Apache Spark Cluster Performance Tuning Tips

If you are running Spark cluster for your organization there are certain configurations that needs to be considered for better performance of your spark jobs. I would like to discuss those configurations in this blog.

When we talk about Spark jobs performance the first thing that comes in mind is YARN. And also the number of executors, cores, executors memory, shuffle partitions and all depends on complexity of the spark job. The size of the data it processes, size of the cluster. The following configurations are important to tune the spark cluster for better performance.

"spark.sql.shuffle.partitions" :- This is the total number of partitions that the data is split into when executing any actions that require a shuffle across the network. The best solutions is

spark.sql.shuffle.partitions = Incoming Data / 250 MB

"num-executors" :- The total number of executots allocated for the job during thr runtime. You may think that increasing the number of executors will increase the speed. But it is not always correct. This may cause the degrade of performance of your spark job. There might be chances that the same cluster will be used by multiple teams. So be cautious to increase the number of executors.

"executor-cores" :- The number of cores that are allocated for spark job per executor during runtime. We can configure it usually 2-6 cores. Again it depends on what kind of operations you are executing in the job. Always starts with small number, test it and increase the cores if you want.

"spark.yarn.executor.memoryOverhead" :- The amount of overhead memory allocated for every executor spinned up at runtime. If you observe more executors are dropping the connection during the execution of the spark job or you see the error like YARN has killed the executor for exceeding the memory limit, then try to increase the overhead executor memory to between 4-8 GB.

"spark.yarn.driver.memoryOverhead":- This is the amount of memory reserved on the driver for managing any overhead costs that are associated with spark job, just same as executor memory overhead. The more complex spark job is and the more executors you need to spin up for spark job, obviously you need to increase the driver memory overhead.

"driver-memory":- This is the amount of memory that is required to run main execution of spark job. Usually driver memory does not need to be large unless you collect the data back to driver or if you do huge calculations outside the spark job. If you see YARN often killing your driver for exceeding memory limits, the first solution is increasing the memory to 6GB. and then continuing to increase by 2GB, if you still continue to get failures.

"executor-memory":- The total amount of memory allocated to each executor at runtime for spark job same as driver memory. Start small increasing by 6GB of allocated memory per executor and increase from there if needed.

Apart from above configurations there are lot of things that you can take care as a developer in spark programs to reduce the size of the data that transferred through throughout the program and the size of the data that will be shuffled in the job. We can consider the following tricks to optimise the spark job for better performance.

Apply Select / Filter in the very early of the program: There is no point if you allow all huge data inside the program when you are not going to use all the data. It is not efficient. When you know what parts of the data is needed for your program initially, then select or filter the data you need. This will remove unnecessary overhead on Spark job.

When writing larger tables out to HDFS, partition by join/aggregation keys. If the incoming data has already been partitioned by whatever columns you will be using for joining dataframes or calculating aggregations, Spark will skip the expensive shuffle operation. The more reads to be performed downstream on the data, the more time you will save in skipped shuffles. In addition, partitioning by columns that will often be used for filtering data will allow Spark to utilise partition pruning to quickly filter out any unnecessary data. The new Delta Lake feature implemented by Databricks eliminates this problem efficiently.

When joining very large datasets with small datasets, use broadcast joins to cut down on unnecessary network shuffling and, in turn, boost the speed of your program. Broadcast joins are a special type of joins that work by taking the smaller dataframe and copying it to every executor. By doing this, Spark no longer needs to shuffle the data in both dataframes across the network, because each executor can simply use its own (full) copy of the smaller dataframe to join with its piece of the larger dataset.

Techtter

Wednesday, October 23, 2019

Apache Spark Cluster Performance Tuning Tips

0 comments:

Post a Comment

Popular Posts

Labels

Facebook

Adress/Street

Phone number

Website