Currently we support Datadog for Hadoop1 and Hadoop2 clusters.
Datadog keys can be entered in the account level or cluster level.
If entered in the account level it will start Datadog on non-hadoop clusters as well. So, till we support all type of clusters, we should recommend customers use cluster level only.
Our Datadog program works as follows:
Whenever the cluster is started from the ui, we create a dashboard and a set of alerts for a cluster. The dashboard is Named "Cluster #id Dashboard". Unfortunately there is no feedback today that Datadog dashboard is created or a link to it, something we will add in the future. To look at the dashboard the customer has to go to Datadog website/ dashboards list and open the dashboard named "Cluster #id Dashboard".
We re-create the dashboard every time a cluster is created, if a user does some change to the dashboard, we advise them to clone the dashboard and perform the changes there.
The alerts are set to send an email to the emails present in the account notification list.
Once the cluster starts, we have a cron job on master polling ganglia every 4 minutes to get metrics to push to ganglia. The list of metrics which are sent to Datadog can be extended by adding new metrics to file /etc/metrics/ganglia_metrics_file.csv
Ganglia needs to be enabled for Datadog to work. Qubole does not enable ganglia from the backend automatically when Datadog is enabled.
The alerts configured by default are:
a) Master disk space
b) HDFS Disk space
c) Master memory usage
d) CPU usage
e) Job tracker liveness
d) Namenode liveness
The metrics format is:
<metric_name>, <master>|"", hour
Master if its a master only metric. Empty if it's an aggregated metric like cpu_report/memory_report
/End hadoop1 specifics
We also want to support services which do not send metrics directly to ganglia.
For this, we have a separate cron job(backed by "/etc/metrics/custom_metrics_file.csv") where users can specify certain commands to be run periodically and the output to be sent to ganglia with a well defined name.
Eg. If one of the lines in the file is:
"active", "echo 1", "int8"
We send a metric to ganglia every 2 minutes with metric name "custom.active", with value "1" (the output of "echo 1") and specify the datatype of the metric to be int.
Valid data types are:
"metric_name", "command_to_be_run", "valid_type"
If a user wants to monitor a custom metric, they have to specify the custom metric in the custom metrics file and also in the datadog metrics file(/etc/metrics/ganglia_metrics_file.csv)
The output of the cron_jobs can be found at /media/ephemeral0/logs/others/push_metrics*.log
Master should have access to datadog to push data.
In order for Datadog to work, master node needs to access to Datadog endpoint via internet. We currently do not support any tunneling for Datadog currently.