Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family. The heirarchy of objects is as follows:
Table
(HBase table)Region
(Regions for the table)Store
(Store per ColumnFamily for each Region for the table)MemStore
(MemStore for each Store for each Region for the table)StoreFile
(StoreFiles for each Store for each Region for the table)Block
(Blocks within a StoreFile within a Store for each Region for the table)
For a description of what HBase files look like when written to HDFS, see Section 13.7.2, “Browsing HDFS for HBase Objects”.
Determining the "right" region size can be tricky, and there are a few factors to consider:
HBase scales by having regions across many servers. Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire cluster will be idle. This really cant be stressed enough, since a common problem is loading 200MB data into HBase then wondering why your awesome 10 node cluster isn't doing anything.
On the other hand, high region count has been known to make things slow. This is getting better with each release of HBase, but it is probably better to have 700 regions than 3000 for the same amount of data.
There is not much memory footprint difference between 1 region and 10 in terms of indexes, etc, held by the RegionServer.
When starting off, it's probably best to stick to the default region-size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up).
See Section 2.5.2.6, “Bigger Regions” for more information on configuration.
This section describes how Regions are assigned to RegionServers.
When HBase starts regions are assigned as follows (short version):
AssignmentManager
upon startup.
AssignmentManager
looks at the existing region assignments in META.
LoadBalancerFactory
is invoked to assign the
region. The DefaultLoadBalancer
will randomly assign the region to a RegionServer.
When a RegionServer fails (short version):
Regions can be periodically moved by the Section 9.5.4.1, “LoadBalancer”.
Over time, Region-RegionServer locality is achieved via HDFS block replication. The HDFS client does the following by default when choosing locations to write replicas:
Thus, HBase eventually achieves locality for a region after a flush or a compaction. In a RegionServer failover situation a RegionServer may be assigned regions with non-local StoreFiles (because none of the replicas are local), however as new data is written in the region, or the table is compacted and StoreFiles are re-written, they will become "local" to the RegionServer.
For more information, see HDFS Design on Replica Placement and also Lars George's blog on HBase and HDFS locality.
Splits run unaided on the RegionServer; i.e. the Master does not participate. The RegionServer splits a region, offlines the split region and then adds the daughter regions to META, opens daughters on the parent's hosting RegionServer and then reports the split to the Master. See Section 2.5.2.7, “Managed Splitting” for how to manually manage splits (and for why you might do this)
The default split policy can be overwritten using a custom RegionSplitPolicy (HBase 0.94+). Typically a custom split policy should extend HBase's default split policy: ConstantSizeRegionSplitPolicy.
The policy can set globally through the HBaseConfiguration used or on a per table basis:
HTableDescriptor myHtd = ...; myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName());
Both Master and Regionserver participate in the event of online region merges. Client sends merge RPC to master, then master moves the regions together to the same regionserver where the more heavily loaded region resided, finally master send merge request to this regionserver and regionserver run the region merges. Similar with process of region splits, region merges run as a local transaction on the regionserver, offlines the regions and then merges two regions on the file system, atomically delete merging regions from META and add merged region to the META, opens merged region on the regionserver and reports the merge to Master at last.
An example of region merges in the hbase shell
$ hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME' hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME', true
It's an asynchronous operation and call returns immediately without waiting merge completed. Passing 'true' as the optional third parameter will force a merge ('force' merges regardless else merge will fail unless passed adjacent regions. 'force' is for expert use only)
A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
The MemStore holds in-memory modifications to the Store. Modifications are KeyValues. When asked to flush, current memstore is moved to snapshot and is cleared. HBase continues to serve edits out of new memstore and backing snapshot until flusher reports in that the flush succeeded. At this point the snapshot is let go.
StoreFiles are where your data lives.
The hfile file format is based on the SSTable file described in the BigTable [2006] paper and on Hadoop's tfile (The unit test suite and the compression harness were taken directly from tfile). Schubert Zhang's blog post on HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs makes for a thorough introduction to HBase's hfile. Matteo Bertozzi has also put up a helpful description, HBase I/O: HFile.
For more information, see the HFile source code. Also see Appendix E, HFile format version 2 for information about the HFile v2 format that was included in 0.92.
To view a textualized version of hfile content, you can do use
the org.apache.hadoop.hbase.io.hfile.HFile
tool. Type the following to see usage:
$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile
For
example, to view the content of the file
hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475
,
type the following:
$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile -v -f hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475
If
you leave off the option -v to see just a summary on the hfile. See
usage for other things to do with the HFile
tool.
For more information of what StoreFiles look like on HDFS with respect to the directory structure, see Section 13.7.2, “Browsing HDFS for HBase Objects”.
StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis.
Compression happens at the block level within StoreFiles. For more information on compression, see Appendix C, Compression In HBase.
For more information on blocks, see the HFileBlock source code.
The KeyValue class is the heart of data storage in HBase. KeyValue wraps a byte array and takes offsets and lengths into passed array at where to start interpreting the content as KeyValue.
The KeyValue format inside a byte array is:
The Key is further decomposed as:
KeyValue instances are not split across blocks. For example, if there is an 8 MB KeyValue, even if the block-size is 64kb this KeyValue will be read in as a coherent block. For more information, see the KeyValue source code.
To emphasize the points above, examine what happens with two Puts for two different columns for the same row:
rowkey=row1, cf:attr1=value1
rowkey=row1, cf:attr2=value2
Even though these are for the same row, a KeyValue is created for each column:
Key portion for Put #1:
------------> 4
-----------------> row1
---> 2
--------> cf
------> attr1
-----------> server time of Put
-------------> Put
Key portion for Put #2:
------------> 4
-----------------> row1
---> 2
--------> cf
------> attr2
-----------> server time of Put
-------------> Put
It is critical to understand that the rowkey, ColumnFamily, and column (aka columnqualifier) are embedded within the KeyValue instance. The longer these identifiers are, the bigger the KeyValue is.
There are two types of compactions: minor and major. Minor compactions will usually pick up a couple of the smaller adjacent StoreFiles and rewrite them as one. Minors do not drop deletes or expired cells, only major compactions do this. Sometimes a minor compaction will pick up all the StoreFiles in the Store and in this case it actually promotes itself to being a major compaction.
After a major compaction runs there will be a single StoreFile per Store, and this will help performance usually. Caution: major compactions rewrite all of the Stores data and on a loaded system, this may not be tenable; major compactions will usually have to be done manually on large systems. See Section 2.5.2.8, “Managed Compactions”.
Compactions will not perform region merges. See Section 15.2.2, “Merge” for more information on region merging.
To understand the core algorithm for StoreFile selection, there is some ASCII-art in the Store source code that will serve as useful reference. It has been copied below:
/* normal skew: * * older ----> newer * _ * | | _ * | | | | _ * --|-|- |-|- |-|---_-------_------- minCompactSize * | | | | | | | | _ | | * | | | | | | | | | | | | * | | | | | | | | | | | | */
Important knobs:
hbase.store.compaction.ratio
Ratio used in compaction
file selection algorithm (default 1.2f). hbase.hstore.compaction.min
(.90 hbase.hstore.compactionThreshold) (files) Minimum number
of StoreFiles per Store to be selected for a compaction to occur (default 2).hbase.hstore.compaction.max
(files) Maximum number of StoreFiles to compact per minor compaction (default 10).hbase.hstore.compaction.min.size
(bytes)
Any StoreFile smaller than this setting with automatically be a candidate for compaction. Defaults to
hbase.hregion.memstore.flush.size
(128 mb). hbase.hstore.compaction.max.size
(.92) (bytes)
Any StoreFile larger than this setting with automatically be excluded from compaction (default Long.MAX_VALUE).
The minor compaction StoreFile selection logic is size based, and selects a file for compaction when the file
<= sum(smaller_files) * hbase.hstore.compaction.ratio
.
This example mirrors an example from the unit test TestCompactSelection
.
hbase.store.compaction.ratio
= 1.0f hbase.hstore.compaction.min
= 3 (files) hbase.hstore.compaction.max
= 5 (files) hbase.hstore.compaction.min.size
= 10 (bytes) hbase.hstore.compaction.max.size
= 1000 (bytes) The following StoreFiles exist: 100, 50, 23, 12, and 12 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 23, 12, and 12.
Why?
This example mirrors an example from the unit test TestCompactSelection
.
hbase.store.compaction.ratio
= 1.0f hbase.hstore.compaction.min
= 3 (files) hbase.hstore.compaction.max
= 5 (files) hbase.hstore.compaction.min.size
= 10 (bytes) hbase.hstore.compaction.max.size
= 1000 (bytes)
The following StoreFiles exist: 100, 25, 12, and 12 bytes apiece (oldest to newest). With the above parameters, no compaction will be started.
Why?
This example mirrors an example from the unit test TestCompactSelection
.
hbase.store.compaction.ratio
= 1.0f hbase.hstore.compaction.min
= 3 (files) hbase.hstore.compaction.max
= 5 (files) hbase.hstore.compaction.min.size
= 10 (bytes) hbase.hstore.compaction.max.size
= 1000 (bytes) The following StoreFiles exist: 7, 6, 5, 4, 3, 2, and 1 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 7, 6, 5, 4, 3.
Why?
hbase.store.compaction.ratio
. A large ratio (e.g., 10) will produce a single giant file. Conversely, a value of .25 will
produce behavior similar to the BigTable compaction algorithm - resulting in 4 StoreFiles.
hbase.hstore.compaction.min.size
. Because
this limit represents the "automatic include" limit for all StoreFiles smaller than this value, this value may need to
be adjusted downwards in write-heavy environments where many 1 or 2 mb StoreFiles are being flushed, because every file
will be targeted for compaction and the resulting files may still be under the min-size and require further compaction, etc.
Stripe compactions is an experimental feature added in HBase 0.98 which aims to improve compactions for large regions or non-uniformly distributed row keys. In order to achieve smaller and/or more granular compactions, the store files within a region are maintained separately for several row-key sub-ranges, or "stripes", of the region. The division is not visible to the higher levels of the system, so externally each region functions as before.
This feature is fully compatible with default compactions - it can be enabled for existing tables, and the table will continue to operate normally if it's disabled later.
You might want to consider using this feature if you have:
According to perf testing performed, in these case the read performance can improve somewhat, and the read and write performance variability due to compactions is greatly reduced. There's overall perf improvement on large, non-uniform row key regions (hash-prefixed timestamp key) over long term. All of these performance gains are best realized when table is already large. In future, the perf improvement might also extend to region splits.
To use stripe compactions for a table or a column family, you should set its hbase.hstore.engine.class
to org.apache.hadoop.hbase.regionserver.StripeStoreEngine
. Due to the nature of compactions, you also need to set the blocking file count to a high number (100 is a good default, which is 10 times the normal default of 10). If changing the existing table, you should do it when it is disabled. Examples:
alter 'orders_table', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'} alter 'orders_table', {NAME => 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'}} create 'orders_table', 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'}
Then, you can configure the other options if needed (see below) and enable the table.
To switch back to default compactions, set hbase.hstore.engine.class
to nil to unset it; or set it explicitly to "org.apache.hadoop.hbase.regionserver.DefaultStoreEngine
" (this also needs to be done on a disabled table).
When you enable a large table after changing the store engine either way, a major compaction will likely be performed on most regions. This is not a problem with new tables.
All of the settings described below are best set on table/cf level (with the table disabled first, for the settings to apply), similar to the above, e.g.
alter 'orders_table', CONFIGURATION => {'key' => 'value', ..., 'key' => 'value'}}
Based on your region sizing, you might want to also change your stripe sizing. By default, your new regions will start with one stripe. When the stripe is too big (16 memstore flushes size), on next compaction it will be split into two stripes. Stripe splitting will continue in a similar manner as the region grows, until the region itself is big enough to split (region split will work the same as with default compactions).
You can improve this pattern for your data. You should generally aim at stripe size of at least 1Gb, and about 8-12 stripes for uniform row keys - so, for example if your regions are 30 Gb, 12x2.5Gb stripes might be a good idea.
The settings are as follows:
Table 9.1.
Setting | Notes |
---|---|
hbase.store.stripe.initialStripeCount
|
Initial stripe count to create. You can use it as follows:
|
hbase.store.stripe.sizeToSplit
| Maximum stripe size before it's split. You can use this in conjunction with the next setting to control target stripe size (sizeToSplit = splitPartsCount * target stripe size), according to the above sizing considerations. |
hbase.store.stripe.splitPartCount
| The number of new stripes to create when splitting one. The default is 2, and is good for most cases. For non-uniform row keys, you might experiment with increasing the number somewhat (3-4), to isolate the arriving updates into narrower slice of the region with just one split instead of several. |
By default, the flush creates several files from one memstore, according to existing stripe boundaries and row keys to flush. This approach minimizes write amplification, but can be undesirable if memstore is small and there are many stripes (the files will be too small).
In such cases, you can set hbase.store.stripe.compaction.flushToL0
to true. This will cause flush to create a single file instead; when at least hbase.store.stripe.compaction.minFilesL0
such files (by default, 4) accumulate, they will be compacted into striped files.
All the settings that apply to normal compactions (file size limits, etc.) apply to stripe compactions. The exception are min and max number of files, which are set to higher values by default because the files in stripes are smaller. To control these for stripe compactions, use hbase.store.stripe.compaction.minFiles
and .maxFiles
.