Table of Contents
See HBase and MapReduce up in javadocs. Start there. Below is some additional help.
For more information about MapReduce (i.e., the framework in general), see the Hadoop site (TODO: Need good links here -- we used to have some but they rotted against apache hadoop).
Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar to the following:
Exception in thread "main" java.lang.IllegalAccessError: class com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass com.google.protobuf.LiteralByteString at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:792) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100) ...
This is because of an optimization introduced in HBASE-9867 that inadvertently introduced a classloader dependency.
This affects both jobs using the -libjars
option and
"fat jar," those which package their runtime dependencies in a nested
lib
folder.
In order to satisfy the new classloader requirements, hbase-protocol.jar must be included in Hadoop's classpath. This can be resolved system-wide by including a reference to the hbase-protocol.jar in hadoop's lib directory, via a symlink or by copying the jar into the new location.
This can also be achieved on a per-job launch basis by including it
in the HADOOP_CLASSPATH
environment variable at job submission
time. When launching jobs that package their dependencies, all three of the
following job launching commands satisfy this requirement:
$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass $ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass $ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass
For jars that do not package their dependencies, the following command structure is necessary:
$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ...
See also HBASE-10304 for further discussion of this issue.
When TableInputFormat is used to source an HBase table in a MapReduce job, its splitter will make a map task for each region of the table. Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless of how many column families are selected in the Scan.
For those interested in implementing custom splitters, see the method getSplits
in
TableInputFormatBase.
That is where the logic for map-task assignment resides.