In general, you can write spark data frame to S3 using spark.write(...) as CSV, parquet, and Avro files. In the same way, you can also write Spark R data frame into S3. Thanks to distributed jobs with sparkr (%spark.r, %r) interpreters.
The process of writing into S3 is simple.
-> Read data as R data frame using SqlContext. You can also read it as Spark R data frame directly. (Note that if it is an R data frame, it has to be converted into Spark R DF before it can be written).
-> Use spark.write functions to write as CSV/ Avro/ Parquet file.
- Creating an R data frame from inbuilt libraries.
- Convert into a Spark data frame.
- You can convert these data frames into in-memory tables and then, they can be written as actual Hive tables.
- The in-memory table can be queried using spark interpreter(in Scala/ Python). An example code is written in Scala:
- Finally, the data frame can be written as CSV/ Avro/ Parquet in S3.
The same data files can be read using Spark session(e.g., spark.read.csv/avro/parquet("s3://<....>") into a dataframe).