How To: Enable Alluxio on Qubole's hadoop clusters

Description:

Alluxio is an open source solution that provides a distributed memory based storage. It can be configured to work with the Qubole platform. Steps to enable Alluxio on Qubole's hadoop clusters are below. These steps were tested only on Hadoop MR & Hive-MR/Tez workloads:

1. Download the attached node_alluxio_bootstrap.bash file.

2. Modify this script with your AWS credentials and S3 UnderFS URI path.

3. Append the code snippet created in #2 above to the existing node bootstrap script of your cluster.

4. Ensure you are able to access the s3://paid-qubole bucket from you cluster nodes.

5. Restart the cluster.

6. Login to the cluster nodes (master and slaves) and check if the alluxio logs do not show any error. Usual path to check - "/media/ephemeral0/qubole/alluxio/logs". Please note that we have Alluxio daemons/workers running in the cluster and not outside.

7. Make note of the cluster master's public DNS.

8. Try a simple MR job on the shell prompt of the master node to make sure it's all working ok:

hadoop jar /usr/lib/hadoop2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-qds-0.4.0-SNAPSHOT.jar wordcount -libjars /usr/lib/alluxio/core//client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar -Dalluxio.user.file.writetype.default=CACHE_THROUGH s3n://paid-qubole/default-datasets/gutenberg/ alluxio://<CLUSTER_MASTER_PUB_DNS>:19998/out/`

Replace <CLUSTER_MASTER_PUB_DNS> with the value recorded in step #7 above.

9. For all Hive queries (MR/Tez), you will have to add the following lines at the job or account level (via Hive bootstrap):

ADD JAR s3://paid-qubole/jars/alluxio/alluxio-core-client-1.0.1-jar-with-dependencies.jar
SET hive.execution.engine=tez;
SET fs.alluxio.impl=alluxio.hadoop.FileSystem;
SET fs.alluxio-ft.impl=alluxio.hadoop.FaultTolerantFileSystem;
SET fs.AbstractFileSystem.alluxio.impl=alluxio.hadoop.AlluxioFileSystem;
SET alluxio.user.file.writetype.default=CACHE_THROUGH;

10. How to create a table with Alluxio and Hive? An example is below:

CREATE EXTERNAL TABLE `tpcds_orc_3000.store_alluxio`(
`s_store_sk` int,
`s_store_id` string,
`s_rec_start_date` timestamp,
`s_rec_end_date` timestamp,
`s_closed_date_sk` int,
`s_store_name` string,
`s_number_employees` int,
`s_floor_space` int,
`s_hours` string,
`s_manager` string,
`s_market_id` int,
`s_geography_class` string,
`s_market_desc` string,
`s_market_manager` string,
`s_division_id` int,
`s_division_name` string,
`s_company_id` int,
`s_company_name` string,
`s_street_number` string,
`s_street_name` string,
`s_street_type` string,
`s_suite_number` string,
`s_city` string,
`s_county` string,
`s_state` string,
`s_zip` string,
`s_country` string,
`s_gmt_offset` float,
`s_tax_precentage` float)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION 'alluxio://<CLUSTER_MASTER_PUB_DNS>:19998/store_alluxio';

Replace <CLUSTER_MASTER_PUB_DNS> with the value recorded in step #7 above.

Have more questions? Submit a request

Comments

Powered by Zendesk