How To: Running Mahout

Description:

This article discusses how to run Mahout Jobs in Qubole.

About Mahout: https://mahout.apache.org/

How To:

  • Mahout jars are placed in Qubole S3 bucket "paid-qubole" and the node_bootstrap.sh is used at the time of boot up to install the JARS on the clusters
    • Patch: We have hand patched all Mahout jars and added an extra set of class files at org/apache/hadoop/mapreduce/lib/input/. Reason to patch mahout library was just that we did not want to patch our Hadoop jars, which could be the ideal way to fix this.
  • node_bootstrap.sh has to be copied to the default s3 location of Cluster, so that cluster gets provisioned during the boot time. The bootstrap file node_bootstrap.sh should look like:

{code}
mkdir -p /media/ephemeral0/mahout
cd /media/ephemeral0/mahout
hadoop dfs -get s3://paid-qubole/mahout0.9/mahout-distribution-0.9.tar.gz .
hadoop dfs -get s3://paid-qubole/mahout0.9/data.tar.gz .
tar -xvf mahout-distribution-0.9.tar.gz
tar -xvf data.tar.gz
{code}

  • Run a sample job ( Shell Command) as below replacing the --output with a write accessible bucket:

{code}
/media/ephemeral0/mahout/mahout-distribution-0.9/bin/mahout recommenditembased --input s3://paid-qubole/mahout0.9/sampledata/myrating.csv --output s3://dev.canopydata.com/000Adubey/mahouttest/reco_out --tempDir /tmp/abc6 --usersFile s3://paid-qubole/mahout0.9/sampledata/u.data -s SIMILARITY_COSINE
{code}

 

Have more questions? Submit a request

Comments

Powered by Zendesk