Chapter 4. Spark Examples

1. Spark Pi Program

To test compute-intensive tasks in Spark, the Pi example calculates pi by “throwing darts” at a circle — it generates points in the unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the square. The result approximates pi.

To run Spark Pi:

Log on as a user with HDFS access--for example, your spark user (if you defined one) or hdfs. Navigate to a node with a Spark client and access the spark-client directory:
cd /usr/hdp/current/spark-client
su spark
Run the Spark Pi job in yarn-client mode:
```
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
```
Commonly-used options include:
- --class: The entry point for your application (e.g., org.apache.spark.examples.SparkPi)
- --master: The master URLfor the cluster (e.g., spark://23.195.26.187:7077)
- --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client
- --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
- <application-jar>: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
- <application-arguments>: Arguments passed to the main method of your main class, if any.
The job should complete without errors.
It should produce output similar to the following. Note the value of pi near the end of the output.
```
15/11/10 14:28:35 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:36, took 1.721177 s
Pi is roughly 3.141296
15/11/10 14:28:35 INFO spark.ContextCleaner: Cleaned accumulator 1
```
To view job status in a browser, navigate to the YARN ResourceManager Web UI and view Job History Server information.

2. WordCount Program

WordCount is a simple program that counts how often a word occurs in a text file.

Select an input file for the Spark WordCount example. You can use any text file as input.
Log on as a user with HDFS access--for example, your spark user (if you defined one) or hdfs.
The following example uses log4j.properties as the input file:
cd /usr/hdp/current/spark-client/
su spark
Upload the input file to HDFS:
hadoop fs -copyFromLocal /etc/hadoop/conf/log4j.properties /tmp/data

Run the Spark shell:

./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m

You should see output similar to the following:

bin/spark-shell 

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
15/12/15 16:28:09 INFO SparkContext: Running Spark version 1.5.2
15/12/15 16:28:09 INFO SecurityManager: Changing view acls to: root
...
...
15/12/15 16:28:14 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.

scala>

At the scala> prompt, submit the job: type the following commands, replacing node names, file name and file location with your own values.

val file = sc.textFile("/tmp/data")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("/tmp/wordcount")

To view WordCount output in the scala shell:
```
scala> counts.count()
```
To view the full output from within the scala shell:
```
counts.toArray().foreach(println)
```
To view the output using HDFS:
1. Exit the scala shell.
2. View WordCount job results:
```
hadoop fs -ls /tmp/wordcount
```
  You should see output similar to the following:
```
/tmp/wordcount/_SUCCESS
/tmp/wordcount/part-00000
/tmp/wordcount/part-00001
```
3. Use the HDFS cat command to list WordCount output. For example:
```
hadoop fs -cat /tmp/wordcount/part-00000
```

Legal notices