java - Shuffle cost in Spark -

i have run various spark jobs these last months , have done performance analysis not in depth (not as have liked). have noticed drop of performance added per core when add more , more cores, expected because spark must have constant cost (or @ least not parallelizable) operations, amdhal's law applies. however, not enough explain i'm seeing.

i trying reason in abstract way cost of shuffling data on network depending on number m of machines in cluster. reasoning following :

hypothesis

the total data size in bytes d
there n >= m partitions
the data randomly (in terms of partitioning function) equally (same amount d/m of data) distributed before shuffle
the partitioning function balanced (=> each partition represents 1/n of data)
each machine must end same number of partitions n/m on optimize distribution of data

reasoning

since data randomly distributed, each machine contains 1/m of each partition before shuffling. since each machine must have n/m partitions on it, means each machine has locally (on disk or in memory) equivalent n/m^2 partitions can stay on node, namely d/m^2 bytes. contain d/m bytes in total, each machine send d/m - d/m^2 bytes, (m - 1)/m^2 *d.

conclusion

the cost of shuffle function of number of machines , fixed number of machines, linear function of data size :

(d,m) -> (m - 1)/m^2 * d

this function increasing between 1 , 2, , decreases. therefore, cost of shuffle step go down number of cores increases. make sense ? opinions on this. note before edit conclusion opposite so... need reassured or contradicted.

Search This Blog

Premier