How To: Diagnose the Cluster health Check Failed Issue

Description:

Whenever you have enabled the "Automatic Cluster Termination" option, the cluster (all the nodes) are terminated automatically which are running in a bad health (low disk space, low memory etc.).

However, when you have disabled the "Automatic Cluster Termination" option, you receive the Alert Emails notifying that the Health Check has been failed for the cluster.

 

How To:

Here are the steps to diagnose the root cause for the bad health of the cluster:

1) Check for the low disk space for the Master node (after SSH into the cluster master) 
-> df -m

2) Check for the low memory space for the Master node (after SSH into the cluster master)
-> free -m

3) Check for the dfs space for the Master Node (after SSH into the cluster master)
->hadoop dfsadmin -report

4) Check for low disk space, low memory and dfs space for all the slave nodes within that cluster (after SSH to all the cluster slaves).

5) Sometimes your cluster may have only one or two minimum slave nodes. And default hdfs replication factor is 2. So when Qubole cannot put all data in one node due to downscaling, and cannot meet replication factor, cluster might go to "bad health" state.

6) An alternate solution to check these statistics is to use the Gangila Metrics.

7) If everything looks normal, then check if the NameNode (in the master) is in the "Safe Mode" by checking it under the "DFS Status" on the Cluster config page.


If yes, bring the NameNode out of the "Safe Mode".

8) Try to restart the cluster as a possible solution.

Unless health issue is fixed or cluster is restarted, no jobs will run on the cluster.

Have more questions? Submit a request

Comments

Powered by Zendesk