How To: Submitting Spark programs from customer managed remote machine

You want to use spark-submit/spark-shell from a remote
machine managed by you in your account. You want to run spark driver in this node pointing to our spark cluster (which has yarn running).

Note : Although, we dont recommend this approach, this has come up
multiple times in our discussions. So we have decided to support it by
providing steps to get started. Thats about it - we can help if there are
some set-up related issues.

We wont be able to debug/support as the
a) jobs are not in our control (bypassing our API) and
b) we may not have access to the remote node.

Here are the steps roughly

1. Launch Spark cluster in Qubole. Wait till the cluster comes up. Get the
master_dns from control panel.

2. From your remote node, do the following

sudo mkdir /media/ephemeral0/spark
sudo cd /usr/lib
sudo mkdir hadoop2 spark hive hive13 tez
ssh_key = <customer ssh key>
master_dns = <master dns of cluster>
sudo scp -ri ${ssh_key} ${master_dns}:/usr/lib/hadoop2/* /usr/lib/hadoop2

sudo scp -ri ${ssh_key} ${master_dns}:/usr/lib/spark/* /usr/lib/spark/

sudo scp -ri ${ssh_key} ${master_dns}:/usr/lib/hive/* /usr/lib/hive/

sudo scp -ri ${ssh_key} ${master_dns}:/usr/lib/hive13/* /usr/lib/hive13/

sudo scp -ri ${ssh_key} ${master_dns}:/usr/lib/tez/* /usr/lib/tez/

sudo scp -ri ${ssh_key} ${master_dns}:/media/ephemeral0/spark/* /media/ephemeral0/spark/

  • Spark needs RM and NN. These are running in our Spark cluster and the
    remote node needs access to it. Lets open the ports in your SecurityGroup for this. Note that since you copied Spark and Hadoop2 from cluster's master node, RM and NN will be rightly pointing to master DNS. No config change required.
  • Change Spark cluster's Security Group to open all TCP ports for remote node
    (either the remote machine alone or the security group that it belongs to).

3. Now, let us try - /usr/lib/spark/bin/spark-submit --master yarn-client. Job got submitted, can see it in RM UI. But it is in accepted state, doesn't start running and eventually, it fails, the reason being that AM is unable to talk to driver (spark-shell) which is running in remote machine. To solve this, in remote machine's security group, open all ports for the Spark cluster.
Bottomline to note : Remote machine and the master node must have two way
communication open. Only then it will work.

4. Run /usr/lib/spark/bin/spark-shell --master yarn-client or sudo /usr/lib/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client /usr/lib/spark/spark-examples* 1

You may still face some issues because of the mismatch of packages between the remote machine and the cluster and certain issues caused due to autoscaling logs getting copied to master node which will need to be manually deleted. For such issues, please contact Qubole support. We cannot guarantee this set up as the remote machine is not under the Qubole infrastructure and we have limited access to the remote node. 

 

Have more questions? Submit a request

Comments

Powered by Zendesk