Reference: Files & Compression

File Formats

There are a variety of file formats that are available for use in the cloud and subsequently in Qubole.

Format

Design

CSV (comma separated value)

Text file with commas separating the values. Typically used for large number of columns storing non sentence text or numerical values.

Tab Delimited

Text file with spaces separating the values. Typically used for large number of columns storing variable text or numerical values.   

Pipe Delimited

Text file with pipes separating the values. Typically used for large number of columns storing variable text or numerical values.

Avro

Serialized binary file format designed for more sophisticated data extraction and management, uses JSON to define objects.

Parquet

Columnar file format designed for more sophisticated data extraction and management, preferred sophisticated file format for Spark.

ORC (optimized row column)

Columnar file format designed for more sophisticated data extraction and management, featuring a metadata row embedded in the file, preferred sophisticated file format for Hive and Presto.

JSON

Document file format designed to capture data entries which may have variable structure and schema. Typically requires a serde to parse if using a Hive Schema Table to reference the file. Can be loaded directly into Spark using native Data Frame code.


Compression Considerations

There are a variety of compression policies that are available for the file formats. Compression is useful when attempting to reduce the size of stored data and the cost of transferring data across the network. The use of compression will increase the CPU requirement during the decompression / compression of data before / after processing. Additionally some compression policies do not support splitting of files which can cause resource strain if unaccounted for in the configuration for the tasks that support the job.

Policy

Consideration

Snappy

Designed for high speeds and reasonable compression instead - recommended policy for ORC and Parquet files.

LZ4

Designed for very high speeds and is splittable by design - this is the default compression policy for Spark.

ZLib/Deflate

The files cannot be split into multiple tasks therefore typically used to compress single files

GZip

The files cannot be split into multiple tasks therefore typically used to compress single files - after extraction data volumes grow 6x-10x the size of the file.

Have more questions? Submit a request

Comments

Powered by Zendesk