How To: Control Spark RDD Caching

RDD Storage Policy

In Spark the storage level is captured by the StorageLevel object and the following command may be used to return the RDD storage policy:

myRDD.getStorageLevel

Returns a StorageLevel object detailing the RDD storage policy

StorageLevel

Object(useDisk, useMemory, useHead, deserialized, replication)

 

RDD Storage Level

While spark will attempt to store as much information as possible in memory there are situations where data has to be written to disk. The storage decision is made by Spark however developers can influence the behavior by passing StorageLevel parameters when calling the persist() function. Please refer to the official API for more details on rdd.persist(StorageLevel).

MEMORY_ONLY

Deserialized Java objects in the JVM - if the RDD does not fit into memory then some partitions will not be cached and will be recomputed on the fly.

MEMORY_AND_DISK

Deserialized Java objects in the JVM - if the RDD does not fit into memory then some partitions will be written to disk and retrieved when needed.

MEMORY_ONLY_SER

Serialized Java objects in the JVM which is more space efficient however also more CPU intensive - if the RDD does not fit into memory then some partitions will not be cached and will be recomputed on the fly.

MEMORY_AND_DISK_SER

Serialized Java objects in the JVM which is more space efficient however also more CPU intensive - if the RDD does not fit into memory then some partitions will not be cached and will be recomputed on the fly.

DISK_ONLY

Partitions stored only on the disk.

In addition to storage level RDDs also feature replication across nodes and the storage level may indicate the number of replications:

MEMORY_ONLY_2

Same as above - the RDD is replicated across two nodes

MEMORY_AND_DISK_2

Same as above - the RDD is replicated across two nodes

Have more questions? Submit a request

Comments

Powered by Zendesk