How To: Running JDBC Storage Handler in a VPC

Description:

The Hive Storage Handler For JDBC by Qubole, which is a fork of HiveJdbcStorageHandler, helps users read from and write to JDBC databases using Hive, and also enabling them to run SQL queries to analyze data that resides in JDBC tables. Optimizations such as FilterPushDown have also been added.

Previously, in order to set up a Hive external table using the JDBC Storage Handler, the Qubole tier needed access to the customer's database. However, it is now possible to configure it such that the access request is sent from the cluster node. This would really make JDBC storage handler more practical for use with Redshift or RDS inside a VPC.

How To:

Override the default map tasks as follows:

  • set mapred.map.tasks=1;

Set "lazy split" in Hive table's TBLPROPERTIES:

  • TBLPROPERTIES ("mapred.jdbc.hive.lazy.split"= "true");

Example:

set mapred.map.tasks=1;
add jar s3://adroitds/jars/RedshiftJDBC41-1.1.6.1006.jar;
add jar s3://paid-qubole/jars/jdbchandler/qubole-hive-JDBC-0.0.6.jar;
DROP TABLE IF EXISTS tmp_test_1;
CREATE EXTERNAL TABLE IF NOT EXISTS tmp_test_1 (
ad_id INT
)
STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler'
TBLPROPERTIES (
"mapred.jdbc.driver.class"="com.amazon.redshift.jdbc41.Driver",
"mapred.jdbc.url"="jdbc:redshift://mytest.us-east-1.redshift.amazonaws.com:5439/test",
"mapred.jdbc.username"="user",
"mapred.jdbc.input.table.name"="redshift_tbl",
"mapred.jdbc.output.table.name" = "redshift_tbl",
"mapred.jdbc.password"="pw",
"mapred.jdbc.hive.lazy.split"= "true"
);
select * from tmp_etutlys_test_contributors_1 limit 10;

 

Have more questions? Submit a request

Comments

Powered by Zendesk