How To: HiveQL Tuning

HiveQL

While Hive accepts SQL the framework will translate commands into MapReduce or Tez for execution. As a result there are some syntactical considerations when working with HiveQL. Additionally the number of MapReduce jobs generated must be considered as this will be a measure of the resources required.

HiveQL Source Data

HiveQL performs best when the source data is partitioned and in the preferred format for the data volume presented. Enterprises frequently select the ORC format for the source data format since Hive features several optimizations for the ORC format. Ensure that the source data format is in a columnar format preferred by Hive. 

Hive Select Option

HiveQL will perform the aggregations present in the SELECT clause in the Reducer tasks of the MapReduce jobs. This can have an negative impact on performance if the Reducer are already burdened with other tasks such as grouping and filtering. Therefore the aggregations can be performed in the Mapper tasks - to enable this option set the following:

 set hive.map.aggr = true

Hive Join Syntax

HiveQL only supports equivalency based joins therefore if a non equivalency based join is required it will necessary to rewrite the query with a cartesian join (cross join) with the join criteria specified in the WHERE clause. More details on this behavior available in the Hive documentation: 

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

Hive Join Optimization

HiveQL is able to perform joins in memory to improve the runtime of a SQL query - this is only possible when a small table (~25MB or less) is involved in the join. To enable this option set the following:

set hive.auto.convert.join = true

Hive By Clauses

There are four BY clause options available in HiveQL which can all potentially generate different outputs therefore selecting the right BY clause can affect both the output and runtime.

Order By

Guarantee global ordering of data in a single large sorted output file. This is accomplished by using a single reducer in the final MapReduce job.

Sort By

Using Sort By will guarantee overlapping reducer level sorting. While the data in each reducer and subsequent file will be sorted there may be overlapping keys.

Distribute By

Using Distribute By will guarantee non-overlapping reducers. While there are no overlapping keys across the reducers the data is not sorted.

Cluster By

Using Cluster By will guarantee non-overlapping sorted reducers producing an equal number of sorted files - effectively a combination of Sort By and Distribute By.

 

Have more questions? Submit a request

Comments

Powered by Zendesk