Calling Spark in R

Spark version: 2.1.0
R version: 3.2.5
R Studio version: 1.0.136

Spark supports R through sparkR package. There are two ways to invoke a sparkR. One is to run sparkR shell command. sparkR will start an spark context and initiate a R command line. There you can enter R commands and spark commands (in a slightly different syntax). The other is to run R or Rstudio, and there you can load the sparkR library. With sparkR library, you will be able to run spark commands.

This short demo will show both ways. And we will have more blogs to do more in-depth analytics using functions from both Spark and R.

Method 1: Use sparkR shell

The executable sparkR is under $SPARK_HOME/bin folder when you install Spark 2.1.0 binaries. Following should also be present in the .bashrc file if you followed prior steps. If not, you can add them now.

  • export SPARK_HOME=/opt/app/spark-2.1.0
  • export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin


Next, run sparkR from terminal window.

Note the different between the sparkR shell and the spark Scala shell (default) we have previously seen. The “>” prompt is the typical R prompt.

In the Spark R shell, you can run R commands. Some Sparks commands have a corresponding R syntax when executed in the R shell. In this example, we load a JSON file and create a Data Frame “people”. In Spark Scala shell, we would do people.count() to get the count. Note here, we use count(people). The same for other Data Frame methods, like collect and printSchema. Also note the R way to assign the output from a function to a variable is “<-“.

> sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
Java ref type org.apache.spark.sql.SparkSession id 1
> people <- read.df("data/spark-examples/people.json", "json") > printSchema(people)
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
> collect(people)
age name
1 NA Michael
2 30 Andy
3 19 Justin
> count(people)
[1] 3

Method 2: Use R Studio

If you followed prior practice to install R Studio. You should be able to run rstudio to launch it. To load sparkR library you will need to setup the R environment properly. Following R code setup the SPARK_HOME environment and load a library at the specific path. This code can be executed in Rstudio.


if (nchar(Sys.getenv("SPARK_HOME")) < 1) { Sys.setenv(SPARK_HOME = "/opt/app/spark-2.1.0") } library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

This command starts a Spark Context session and connect to a local Spark cluster.

sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))

I executed the same set of commands to load a JSON file and explored some commands on Data Frame. No much different from the first method. See the screenshot.

sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
people <- read.df("data/spark-examples/people.json", "json")
head(people)
printSchema(people)
show(people)
collect(people)
take(people, 1)
count(people)