How To: Slow Reducer & Data Skew

Symptom

While running a HiveQL job one Reducer task executes slower than other Reducer tasks in the same stage. This is the expected behavior for a SQL Statement using an ORDER BY - there will be a MapReduce Stage with a single Reducer task to accomplish the global ordering of the data. Therefore the below will only help when the Reducer in question is in the same MapReduce job as multiple other Reducers. 

Cause

If the job previously executed without issue and now has sluggish reducers it is possible that there is a data type change in the columns being used for the SQL join. Unfortunately this will be not be obvious through the Qubole environment and it will be necessary to review the data formats internally with administrators and engineers. It is also possible that there is now a data skew which was not previously present in the data. If the job is brand new and has never been executed it is possible that tuning efforts will improve the performance however a data skew must still be considered. 

Action

Data Skew can be solved prior to execution via schema modification if the skew key is already known:

  1. During creation of the schema table in the Hive Metadata declare the skew key using  SKEWED BY (key) ON (key_value)
  2. Configure Hive to leverage a skew join by setting hive.optimize.skewjoin.compiletime = true.

Data Skew can be solved during execution via a threshold resolved during runtime:

  1. Set the threshold - if any key breaks this threshold a skew join will be triggered - hive.skewjoin.key
  2. Configure Hive to consider a skew join during execution by setting hive.optimize.skewjoin = true.
Have more questions? Submit a request

Comments

Powered by Zendesk