Reference: HBase Administration

Table Management

It is important to keep in mind that HBase tables can be enabled or disabled depending on the reporting needs of the user base or the usage of the Cluster. It is possible to have tables which are not enabled on the primary cluster but are brought online in Cloned Clusters for testing or ad-hoc analysis as needed. HBase developers may choose to presplit the Table data to reduce the load on the cluster if the Row Key design would data skew in a single or multiple Regions. This decision is heavily dependent on the type of data, the contents of the data and the job profiles.

 

Baseline Regions

Per the HBase documentation the number of HBase Regions can be calculated with the Memstore configuration and Table structure - the formula below assumes that all regions are filled at approximately the same rate.

 

(Region Server Memory * hbase.regionserver.global.memstore) / (Hbase.hregion.memstore.flush.size * Number of Column Families)

 

Region Tuning

HBase works best when Regions are sized between 5 and 20 GB and HBase documentation recommends between 50 and 100 Regions for Tables with one or at most two Column Families and 20 to 200 Regions per Region Server. Generally less Regions results in better performance because Regions may be associated with multiple Servers. HBase Regions will perform a Split when the utilization reaches a preconfigured threshold (hbase.hbregion.max.filesize) and this process runs unaided on the Region Server which means the Master is not involved. The Region Server is responsible for saving the new Region data to HBase, making the data available and informing the Master of the split.

 

HBase Compactions

HBase stores data in Store Files (HFiles) and the more files stored the greater the cost of performing a data read since there may be a need to scan multiple files. HBase will clean up the environment and consolidate files during a process known as Compaction. Major Compactions can have a significant effect on the system response time since there will be a significant amount of network traffic between Region Servers. It is also important to keep in mind that the write throughput can be affected during Major Compactions. As a result system administrators need to have a Major Compaction policy which dictates the frequency and schedule such that the system impact is known ahead of time. Making this decision in a vacuum is not recommended and will most likely result in less than optimal behavior. The best policy will be determined through testing and tuning with an understanding of the types of query profiles the users will be executing against the cluster.

 

Minor

Combine a user designated number of smaller HFiles into one larger HFile

Ensure that reading a single row does not require multiple disk reads

Major

Combine all HFiles into a single large HFile and cleans up data after users submit deletes.

Ensure optimal performance and balances regions across the Region Servers.

Have more questions? Submit a request

Comments

Powered by Zendesk