Reference: Relationship between Partitions / Tasks / Cores

Spark Partitions & Source Data

When data is brought into Spark it is stored in either an RDD or a Data Frame and the source data will be partitioned into blocks that reside in executors on nodes in the cluster. The number of partitions recognized by Spark will be based on the function that is used to bring in the dataset - please refer to the Spark documentation for the partitioning logic.

# of Spark RDD / Data Frame Partitions = Result of Partitioning Logic for Spark Function


Spark Tasks & Data Partitions

When data is processed in Spark the processing is performed by tasks which do the heavy lifting of picking up data and executing the transformations or actions specified in the code. Transformations may be broken up across stages and generate new RDDs or Data Frames with a different numbers of partitions which can affect subsequent stage execution and task need. For the first task this is driven by the number of files in the source:

# of Tasks required for Stage = # of Source Partitions

For the subsequent tasks this is driven by the number of partitions from the prior stages:

# of Tasks required for Stage = # of Spark RDD / Data Frame Partitions


Spark Cores & Processing Tasks

Each task in Spark is assigned either a single or multiple cores to support processing and the number of cores will affect the power of each task. The relationship between tasks and cores is established with spark.task.cpus and the overall core requirement for a stage is estimated as:

# of Cores required for Stage = (# of Tasks required for Stage) * (spark.task.cpus)


Spark Executors & Task Cores

Each Executor in Spark hosts either a single or multiple partitions and as a result either a single or multiple tasks during processing of a stage and the subsequent jobs associated with the code that was executed. The relationship between executors and cores, and as a result the relationship between executors and tasks, is driven with spark.executor.cores and can be established as:

# of Executors required for Stage = (# of Cores required for Stage) / (spark.executor.cores)

Have more questions? Submit a request


Powered by Zendesk