Reference: Why Does Spark Lag?

Data Format & Cache Management

Understanding and properly managing the data types and file formats used in Spark can lead to significant performance improvements and potential cost savings. Java objects are fast to access however these can easily grow between 200% and 500% of original file size. Additionally computation can be impacted when data formats are complicated to serialize or consume large numbers of bytes. Improved performance can be achieved by using arrays of objects, primitive data types, numeric IDs and enumeration objects. When caching RDD data inside of a Spark Cluster it is important to consider both how the data is cached and the resulting cache size. Cached RDD data can quickly increase in size when compared to the size of the source file. The structure of the data and the input file may affect the cache size as will the manner of caching. A deserialized cache is cheaper to store but requires more compute power to access. Therefore this may be an area where experimentation and benchmarking is required to fully understand and appreciate potential performance improvements. Serializing data can provide improved performance since less memory will be required for caching. When data is serialized there will be an increase in time required to access the data since it will need to be deserialized during processing. The resources required to complete garbage collection is proportional to the number of Java objects therefore using primitive objects will reduce the eventual cost of garbage collection. Persisting serialized RDDs in the cache will lead to simpler garbage collection since there is only a one byte array to clean up. Smaller cache sizers will also improve garbage collection since there will be fewer Java objects to clean.


In Memory & On Disk Storage

Spark stores data on nodes in the cluster and ships code to nodes in the cluster therefore the cheapest processing occurs when the data and code are on the same node. If there are not enough resources to accomodate the task then Spark can wait until CPU becomes available on the node. However if the code and the data required are not on the same node and Spark exceeds the locality timeout then it is necessary to move one to the other. Code tends to be cheaper to move across the cluster than data due to sizing considerations. This concept is known as Data Locality and describes the behavior of the task in the executor when processing code and data. The Data Locality for a Task can be identified in the Spark Application UI therefore when experiencing a decrease in performance check the relationship between the code and the necessary data.


Master Node

While Spark supports jobs written in a variety of programming languages the native Spark language Scala requires less overhead to execute than the others. As a result it is important to keep in mind the increased overhead of running code in additional languages like Python. This consideration also applies when executing Notebooks with Pyspark or R interpreters. For example consider code written in Pyspark with a large number of variables - the burden on the Master Node can cause Spark to run out of memory. Notebook Interpreters also run as applications on the Master Node which can cause resource conflicts when there is heavy Notebook usage in the system. The Master Node of a Spark Cluster has a lot of responsibility and as a result it is important to ensure that ample resources are available to the Master.


Spark Applications

The Spark applications triggered by Analyze queries.

Spark Application Driver

The driver which returns data to YARN from Spark queries.

Notebook Interpreters

The individual Spark applications supporting Notebooks.

YARN Application Manager

Responsible for managing the Spark applications status.

YARN Resource Manager

Responsible for managing the vCores and Memory for Spark.


Have more questions? Submit a request


Powered by Zendesk