Skip to main content
Uber AI, Data / ML, Engineering

Uber’s Highly Scalable and Distributed Shuffle as a Service

July 7, 2022 / Global
Featured image for Uber’s Highly Scalable and Distributed Shuffle as a Service
Image
Figure 1: Basic Shuffle Operation
Image
Figure 2: Writing Shuffle File
Image
Figure 3: Read Shuffle File
Image
Figure 4: Disk I/O distribution on our fleet
Image
Figure 5: Application failures using external spark shuffle service
Image
Figure 6: Reverse Map Reduce Paradigm
Image
Figure 7: RSS Architecture
Image
Figure 8: Spark interaction with RSS
Image
Figure 9: RSS Server internals
Image
Figure 10: Disk I/O distribution on our fleet after RSS
Image
Figure 11: Failed container with and without RSS
Image
Figure 12: RSS Reliability
Image
Figure 13: Server load: Shuffle data read per second for two representative servers
Mayank Bansal

Mayank Bansal

Mayank Bansal is a staff engineer on Uber's Big Data team.

Bo Yang

Bo Yang

Bo Yang was a Senior Software Engineer II at Uber. Bo worked in the Big Data area for 10+ years in various companies building large-scale systems including a Kafka-based streaming platform and Spark-based batch processing service.

Mayur Bhosale

Mayur Bhosale

Mayur Bhosale is a Senior Software Engineer on Uber’s Batch Data systems. He has been working on making Spark's offering at Uber performant, reliable, and cost-efficient.

Kai Jiang

Kai Jiang

Kai Jiang is a Senior Software Engineer on Uber’s Data Platform team. He has been working on Spark Ecosystem and Big Data file format encryption and efficiency. He is also a contributor to Apache Beam, Parquet, and Spark.

Posted by Mayank Bansal, Bo Yang, Mayur Bhosale, Kai Jiang