Chapter 10. Tuning and Troubleshooting Spark

When tuning Spark applications, it is important to understand how Spark works and what types of resources your application requires. For example, machine learning tasks are usually CPU intensive, whereas extract-transform-load (ETL) operations are I/O intensive.

General performance guidelines:

Minimize shuffle operations where possible.
Match join strategy (ShuffledHashJoin vs. BroadcastHashJoin) to the table. This requires manual configuration.
Consider switching from the default serializer to the Kryo serializer to improve performance. This requires manual configuration and class registration.

	Note
	For information about known issues and workarounds related to Spark, see the "Known Issues" section of the HDP Release Notes.

1. Hardware Provisioning

For general information about Spark memory use, including node distribution, local disk, memory, network, and CPU core recommendations, see the Apache Spark Hardware Provisioning document.

2. Checking Job Status

When you run a Spark job, you will see a standard set of console messages.

If a job takes longer than expected or does not complete successfully, check the following resources to understand more about what the job was doing and where time was spent.

Using Ambari: In the Ambari Services tab, select Spark (in the left column). Click on Quick Links and choose the Spark History Server UI. Ambari will display a list of jobs. Click "App ID" for job details. (By default, the Spark History Server is at <host>:18080.)
Using the YARN Web UI at http://<host>:8088/proxy/<job_id>/environment/, view job history and time spent in various stages of the job:
http://<host>:8088/proxy/<app_id>/stages/
From the command line, list running applications (including the application ID):
yarn application –list
From the command line, check the application log:
yarn logs -applicationId <app_id>
The yarn logs command prints the contents of all log files from all containers associated with the specified application. You can also view container log files using the HDFS shell or API. For more information, see "Debugging your Application" in the Apache document Running Spark on YARN.

Use toDebugString() on RDD to see a list of RDD's that will be executed. This is useful for understanding how jobs will be executed.
Use DataFrame#explain() to check the query plan if you are using the DataFrame API.

3. Configuring Spark JVM Memory Allocation

This section describes how to determine memory allocation for a JVM running the Spark executor.

To avoid memory issues, Spark uses 90% of the JVM heap by default. This percentage is controlled by spark.storage.safetyFraction.

Of this 90% of JVM allocation, Spark reserves memory for three purposes:

Storing in-memory shuffle, 20% by default (controlled by spark.shuffle.memoryFraction)
Unroll - used to serialize/deserialize Spark objects to disk when they don’t fit in memory, 20% is default (controlled by spark.storage.unrollFraction)
Storing RDDs: 60% by default (controlled by spark.storage.memoryFraction)

Example

If the JVM heap is 4GB, the total memory available for RDD storage is calculated as:

4GB x 0.9 X 0. 6 = 2.16 GB

Therefore, with the default configuration approximately one half of the Executor JVM heap is used for storing RDDs.

4. Configuring YARN Memory Allocation for Spark

This section describes how to manually configure YARN memory allocation settings based on node hardware specifications.

YARN takes into account all of the available compute resources on each machine in the cluster, and negotiates resource requests from applications running in the cluster. YARN then provides processing capacity to each application by allocating containers. A container is the basic unit of processing capacity in YARN; it is an encapsulation of resource elements such as memory (RAM) and CPU.

In a Hadoop cluster, it is important to balance the usage of RAM, CPU cores, and disks so that processing is not constrained by any one of these cluster resources.

When determining the appropriate YARN memory configurations for SPARK, note the following values on each node:

RAM (Amount of memory)
CORES (Number of CPU cores)

Configuring Spark for yarn-cluster Deployment Mode

In yarn-cluster mode, the Spark driver runs inside an application master process that is managed by YARN on the cluster. The client can stop after initiating the application.

The following command starts a YARN client in yarn-cluster mode. The client will start the default Application Master. SparkPi will run as a child thread of the Application Master. The client will periodically poll the Application Master for status updates, which will be displayed in the console. The client will exist when the application stops running.

./bin/spark-submit --class org.apache.spark.examples.SparkPi \
  --master yarn-cluster \ 
  --num-executors 3 \ 
  --driver-memory 4g \ 
  --executor-memory 2g \ 
  --executor-cores 1 \ 
  lib/spark-examples*.jar 10

Configuring Spark for yarn-client Deployment Mode

In yarn-client mode, the driver runs in the client process. The application master is only used to request resources for YARN.

To launch a Spark application in yarn-client mode, replace yarn-cluster with yarn-client. For example:

./bin/spark-shell --num-executors 32 \
  --executor-memory 24g \
  --master yarn-client

Considerations

When configuring Spark on YARN, consider the following information:

Executor processes will be not released if the job has not finished, even if they are no longer in use. Therefore, please do not overallocate executors above your estimated requirements.
Driver memory does not need to be large if the job does not aggregate much data (as with a collect() action).
There are tradeoffs between num-executors and executor-memory. Large executor memory does not imply better performance, due to JVM garbage collection. Sometimes it is better to configur a larger number of small JVMs than a small number of large JVMs.

Legal notices