Reference: Spark Streaming Best Practices

Description:

This article contains best practices on running Spark Streaming jobs with Qubole.

 

Executors and Receivers:

In YARN, the same executor can be used for both receiving and processing. Each receiver is like a long running task, so each of them occupy a slot/core. If there are free slots/cores in the executor then other tasks can be run on them.

 

Optimal Cluster Size:

The best way is to start with a good cluster size/min executors.

Number of executors should be at least equal to the number of receivers.

Also the number of cores per executor should be set such that the executor has some spare capacity to process apart from just running the receiver.

The number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.

 

Setting spark.streaming.backpressure.enabled to true:

This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process.

 

No of receivers and Need for Autoscaling:

The spark streaming api lets the user set the number of receivers.

Refer http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving so there is no need for autoscaling here.

But the processing is up to the spark engine and there is autoscaling there.

P.S: If you want to receive multiple streams of data in parallel in your streaming application, create multiple input DStreams. This will create multiple receivers which will simultaneously receive multiple data streams. But note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application. Therefore, it is important to remember that a Spark Streaming application needs to be allocated enough cores to process the received data, as well as to run the receiver(s).

 

Running streaming app through analyze:

Pros:

  1. In analyze there are no temporary classnames. The SCALA compiler generates reusable permanent classnames. So saving to checkpoint and restarting from checkpoint does not cause problems.

Cons:

  1. Tapp layer force kills the app after 36 hours

 

Running streaming app through notebook shell interpreter:

Cons

  1. Notebooks generates temp code and those temp class names get saved into checkpoints and then on recovery from checkpoint it fails. The workaround is to wrap all the code in an object.

 

Running streaming app through notebook scala interpreter:

Cons

  1. Notebooks generates temp code and those temp class names get saved into checkpoints and then on recovery from checkpoint it fails. The workaround is to wrap all the code in an object.

 

How to never timeout (ensuring streaming app runs forever):

Refer https://qubole.zendesk.com/hc/en-us/articles/210292643-How-To-Never-Timeout

Set both yarn.resourcemanage.app.timeout.minutes and spark.qubole.idle.timeout to -1.

P.S: However, if you are running this via the QDS Analyze page then the above setting will not override the tapp layer 36 hrs limit.

Have more questions? Submit a request

Comments

Powered by Zendesk