Rubix offers several opportunities for improved caching architecture and implementation, as a result Qubole has integrated Rubix into the clusters which are backed by YARN.
Columnar Caching Improvements
Each file is divided into logical blocks with a configurable block size. For example, a 1 GB file in S3 is considered to be made up of 1024 blocks of 1 MB each. Each read request is now done in units of this block-size and the blocks which are read for the first time are stored in a sparse file on local disk. Thus only the parts of the files that are needed are loaded into the cache which improves the cache warm-up time considerably when only a subset of columns are used from the table.
Engine Independent Logic
Rubix caching leverages the locality-based scheduling logic which all the big data engines have. Rubix decides which node will handle a particular split of the file and always returns the same node for that split to the scheduler, thereby allowing schedulers to use locality-based scheduling. Rubix uses consistent hashing to figure out where the block should reside. Consistent hashing allows us to bring down the cost of rebalancing the cache when nodes join or leave the cluster, e.g. during AutoScaling.
Shared Cache Across JVMs
All the data in the blocks reside on the local disk, so the only prerequisite to make the cache shareable amongst different jobs is making the state of the cache available to all jobs. For this, a cache manager is used, which is a lightweight server running on each node of the cluster that keeps track of the blocks of files that have been already brought on to the disk. Each job, while bringing in the blocks of a file to the disk, updates the cache manager with the new state, thereby making the information about the cached blocks available to all other jobs which can now use the cached data rather than going to the source data in cloud storage.
To use Rubix select “Enable Rubix” in the Presto Cluster UI configuration page which automatically sets the Rubix configuration in the Presto cluster to cache data.
To use Rubix ensure that the entire Hadoop job runs on the master with hive.on.master and set the following properties to make MapReduce and Tez jobs use Rubix for caching: