java - Shuffle cost in Spark -
i have run various spark jobs these last months , have done performance analysis not in depth (not as have liked). have noticed drop of performance added per core when add more , more cores, expected because spark must have constant cost (or @ least not parallelizable) operations, amdhal's law applies. however, not enough explain i'm seeing. i trying reason in abstract way cost of shuffling data on network depending on number m of machines in cluster. reasoning following : hypothesis the total data size in bytes d there n >= m partitions the data randomly (in terms of partitioning function) equally (same amount d/m of data) distributed before shuffle the partitioning function balanced (=> each partition represents 1/n of data) each machine must end same number of partitions n/m on optimize distribution of data reasoning since data randomly distributed, each machine contains 1/m of each partition before shuffling. since each machine must have n/...