In this article, you will learn how to bring data into Rstudio on DSX from Amazon S3 and write data from Rstudio back into Amazon S3 using 'sparklyr' to work with spark and using 'aws.s3' to work with local r objects.
Using Sparklyr to work with Spark
First thing you want to do is connect to spark service using sparklyr's spark_connect function. You can refer to this Post.
#connect to spark library(sparklyr) library(dplyr) sc <- spark_connect(config = "Apache Spark-ic")
Get the java Context from spark context to set the S3a credentials needed to connect S3 bucket.
#Get spark context ctx <- sparklyr::spark_context(sc) #Use below to set the java spark context jsc <- invoke_static( sc, "org.apache.spark.api.java.JavaSparkContext", "fromSparkContext", ctx )
Now replace below your access key and secret key generated for your AWS account.
#set the s3 configs: hconf <- jsc %>% invoke("hadoopConfiguration") hconf %>% invoke("set","fs.s3a.access.key", "<put-your-access-key>") hconf %>% invoke("set","fs.s3a.secret.key", "<put-your-secret-key>")
Lets use spark_read_csv to read from Amazon S3 bucket into spark context in Rstudio. First argument is sparkcontext that we are connected to. Second argument is the name of the table that you can refer within spark. Third is path to your s3 bucket. You can additionally specify repartition to number to parallelize reads.
#Lets try to read using sparklyr packages usercsv_tbl <- spark_read_csv(sc,name = "usercsvtlb",path = "s3a://charlesbuckets31/FolderA/users.csv")
Use src_tbls to see if we read the table in spark. Use head to view the check the dataframe.
Likewise you can read parquet as well.
usercsv_tbl <- spark_read_parquet(sc,name = "usertbl",path="s3n://charlesbuckets31/FolderB/users.parquet") src_tbls(sc)
You can also write the dataframe back to S3 bucket using spark_write_csv.
#Write back into Amazon S3 sparklyr::spark_write_csv(usercsv_tbl,path = "s3a://charlesbuckets31/FolderA/usersOutput.csv")
Using 'aws.s3' to work with local R
First install 'aws.s3' package and load it.
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat")) library("aws.s3")
'aws.s3' package need AWS ACCESS KEY and AWS SECRET KEY added to environment.Replace below with yours.
Sys.setenv("AWS_ACCESS_KEY_ID" = "<PUT-ACCESS-KEY>","AWS_SECRET_ACCESS_KEY" = "<PUT-SECRET-KEY>")
Now to read the object into R use 'get_object' and specify your s3 path as shown below.
Since the get_object returns a raw object returns a raw object, you would need to do further processing to convert the raw object to your desire type of object (dataframe) depending on the type of file you are reading.
csvcharobj <- rawToChar(usercsvobj) con <- textConnection(csvcharobj) data <- read.csv(con) close(con) data
Reference to 'aws.s3' package here.