Recently AWS announced support for instances running AMD Epyc processors. While the new instances are 10 percent cheaper, cost and performance are workload dependent. As the AWS announcement notes:
“We recommend that you measure performance and cost on your own workloads when choosing your instance types.”
This raises an obvious question: how do these instance types fare for the big data workloads that our customers run? To answer this question, we compared the performance of AMD and Intel instances using two sets of benchmarks common in the big data space:
- TeraGen and TeraSort for ETL workloads
- TPCDS for Data-Warehousing workloads
The rest of the post goes into the details of the benchmark setup, results therefrom, and conclusions.
All benchmarks were conducted using Apache Hadoop, Hive, and Spark clusters available on Qubole in AWS.
For our comparisons, we selected r5 series and 4xlarge instance types, given these are very popular across our user base. Below, we review some pertinent resource configurations for r5a.4xlarge and the comparable r5.4xlarge instance types.
To minimize variables, we arranged each of the clusters with the following hardware configuration for our analysis:
- 1x Master Nodes: r5.4xlarge/r5a.4xlarge
- 4x Worker Nodes: r5.4xlarge/r5a.4xlarge (static hardware, no autoscaling)
- 1x SSD(gp2) EBS volume attached to each node (these instance types are EBS-only)
All benchmarks have been run on clusters running Apache Hadoop 2.6, as available in the Qubole platform, whichuses Apache Hadoop YARN as the underlying container scheduler.
- No proprietary Qubole features (such as Workload-Aware Autoscaling or Spot Instance node support) were used.
- No changes to default software configurations were applied
In terms of benchmarking methodology, we ran each benchmark thrice and took the best of the three runtimes. The Cost Calculation formula is listed here: Cost Calculations (Appendix).
For this section, the following benchmarks were run:
- TeraGen: Generate 200GB data on S3.
- TeraSort: Sort 200GB data generated above in S3 using MapReduce and write results back to S3.
These benchmarks are I/O intensive and involve reading/writing large amounts of data, which serves as a great example for a typical big data ETL pipeline. TeraSort is also quite CPU intensive.
Effect of EBS Volume Size
First, we ran these with an EBS volume size of 100GB attached to each worker node (for both r5 and r5a instances). The resulting runtimes are noted below:
Clearly something was off with the r5a (AMD) cluster. Since TeraGen is not a CPU-intensive benchmark, the 25 percent performance differential on TeraGen is hard to explain. At this point we noted that r5a instance types have slower EBS throughput than the r5 instance types. To give the instances a fair comparison, we increase EBS volume size for r5a instances (relative to r5) and use this configuration for our subsequent benchmark runs.
For the final results, we ran the benchmarks with 200GB EBS volumes for r5 instances and 330 and 400GB EBS volumes for r5a instances. The 330GB size was based on scaling the EBS volume by a ratio similar to the EBS bandwidth differences between r5 and r5a instance types. o see at what point the r5a instances achieved performance parity, we further increased the EBS sizes to 400GB.
Our summary findings for TeraGen/Sort are as follows:
- TeraGen/Sort benchmarks are sensitive to local disk performance
- r5a instances need to be configured with higher EBS volume sizes (as compared with r5 instances) to achieve performance parity
- When correctly configured, r5a instances can offer equivalent performance at a slightly lower cost (seven to ten percent lower)
We do not entirely understand how increasing EBS volume sizes helps here – except that IOPS/throughput scales linearly in response to EBS volume size. Given that EBS bandwidth is not saturated at these volume sizes – it suggests that IOPS/GB for EBS volumes is lower for r5a instances as compared to r5 instances.
The TPCDS benchmark setup is as follows:
- Scale of Data: 1000GB
- Data format:
- ORC Files stored in S3 for Hive
- PARQUET Files stored in S3 for Spark
- 16 TPCDS queries from different categories.
We used a subset of the queries to speed up the benchmark runs. The selection methodology is listed under TPCDS Query Selection (Appendix). TPCDS benchmark reports are divided into two parts. In Part 1 we benchmark Apache Hive, and in Part 2 we benchmark Apache Spark. Please refer to the Appendix below for the queries we used for this experiment.
Part-1: Apache Hive
In this benchmark, we ran Apache Hive version 2.1 running with Apache Tez — as available on the Qubole platform. All queries are submitted via a HS2 server. We first ran the benchmark with 100GB EBS for both AMD and Intel based instances — and then with higher EBS volume sizes for AMD instances. The results for our experiments are tabulated below:
In this benchmark, we ran Apache Spark 2.3 as available on the Qubole platform. All queries are submitted serially to a single Spark application (provisioned via YARN). The approach here is the same as in the first part. The results for our experiments are tabulated below: