IBM  Data Science  Experience

Read and Write Data To and From Amazon S3 Buckets in Rstudio

In this article, you will learn how to bring data into Rstudio on DSX from Amazon S3 and write data from Rstudio back into Amazon S3 using 'sparklyr' to work with spark and using 'aws.s3' to work with local r objects.


Using Sparklyr to work with Spark

First thing you want to do is connect to spark service using sparklyr's spark_connect function. You can refer to this Post.
  
#connect to spark
library(sparklyr)  
library(dplyr)  
sc <- spark_connect(config = "Apache Spark-ic")  
  
Get the java Context from spark context to set the S3a credentials needed to connect S3 bucket.
#Get spark context  
ctx <- sparklyr::spark_context(sc)

#Use below to set the java spark context
jsc <- invoke_static(  
  sc,
  "org.apache.spark.api.java.JavaSparkContext",
  "fromSparkContext",
  ctx
)
Now replace below your access key and secret key generated for your AWS account.
#set the s3 configs:  
hconf <- jsc %>% invoke("hadoopConfiguration")  
hconf %>% invoke("set","fs.s3a.access.key", "<put-your-access-key>")  
hconf %>% invoke("set","fs.s3a.secret.key", "<put-your-secret-key>")  
Lets use spark_read_csv to read from Amazon S3 bucket into spark context in Rstudio. First argument is sparkcontext that we are connected to. Second argument is the name of the table that you can refer within spark. Third is path to your s3 bucket. You can additionally specify repartition to number to parallelize reads.
  
#Lets try to read using sparklyr packages
usercsv_tbl <- spark_read_csv(sc,name = "usercsvtlb",path = "s3a://charlesbuckets31/FolderA/users.csv")  
Use src_tbls to see if we read the table in spark. Use head to view the check the dataframe.
src_tbls(sc)  
head(usercsv_tbl,4)  
Likewise you can read parquet as well.
  
usercsv_tbl <- spark_read_parquet(sc,name = "usertbl",path="s3n://charlesbuckets31/FolderB/users.parquet")  
src_tbls(sc)  
You can also write the dataframe back to S3 bucket using spark_write_csv.
#Write back into Amazon S3  
sparklyr::spark_write_csv(usercsv_tbl,path = "s3a://charlesbuckets31/FolderA/usersOutput.csv")  

Using 'aws.s3' to work with local R

First install 'aws.s3' package and load it.
  
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"))  
library("aws.s3")  
'aws.s3' package need AWS ACCESS KEY and AWS SECRET KEY added to environment.Replace below with yours.
Sys.setenv("AWS_ACCESS_KEY_ID" = "<PUT-ACCESS-KEY>","AWS_SECRET_ACCESS_KEY" = "<PUT-SECRET-KEY>")  
Now to read the object into R use 'get_object' and specify your s3 path as shown below.
usercsvobj <-get_object("s3://charlesbuckets31/FolderA/users.csv")  
Since the get_object returns a raw object returns a raw object, you would need to do further processing to convert the raw object to your desire type of object (dataframe) depending on the type of file you are reading.
csvcharobj <- rawToChar(usercsvobj)  
con <- textConnection(csvcharobj)  
data <- read.csv(con)  
close(con)  
data  

Reference to 'aws.s3' package here.

Charles Gomes

Read more posts by this author.

San Francisco, CA

Subscribe to IBM Data Science Experience Blog

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!