How To: Write R Data Frame to S3

In general, you can write spark data frame to S3 using spark.write(...) as CSV, parquet, and Avro files. In the same way, you can also write Spark R data frame into S3. Thanks to distributed jobs with sparkr (%spark.r, %r) interpreters. 

 

The process of writing into S3 is simple.

-> Read data as R data frame using SqlContext. You can also read it as Spark R data frame directly. (Note that if it is an R data frame, it has to be converted into Spark R DF before it can be written).

-> Use spark.write functions to write as CSV/ Avro/ Parquet file.

 

- Creating an R data frame from inbuilt libraries.

Screen_Shot_2017-10-21_at_12.12.12_PM.png

- Convert into a Spark data frame.

Screen_Shot_2017-10-21_at_12.14.29_PM.png

- You can convert these data frames into in-memory tables and then, they can be written as actual Hive tables.

Screen_Shot_2017-10-21_at_12.18.33_PM.png

 

- The in-memory table can be queried using spark interpreter(in Scala/ Python). An example code is written in Scala:

Screen_Shot_2017-10-21_at_12.21.33_PM.png

- Finally, the data frame can be written as CSV/ Avro/ Parquet in S3.

Screen_Shot_2017-10-21_at_12.22.46_PM.png

The same data files can be read using Spark session(e.g., spark.read.csv/avro/parquet("s3://<....>") into a dataframe).

 

Have more questions? Submit a request

Comments

Powered by Zendesk