java - Shuffle cost in Spark -
i have run various spark jobs these last months , have done performance analysis not in depth (not as have liked). have noticed drop of performance added per core when add more , more cores, expected because spark must have constant cost (or @ least not parallelizable) operations, amdhal's law applies. however, not enough explain i'm seeing.
i trying reason in abstract way cost of shuffling data on network depending on number m of machines in cluster. reasoning following :
hypothesis
- the total data size in bytes
d - there
n >= mpartitions - the data randomly (in terms of partitioning function) equally (same amount
d/mof data) distributed before shuffle - the partitioning function balanced (=> each partition represents
1/nof data) - each machine must end same number of partitions
n/mon optimize distribution of data
reasoning
since data randomly distributed, each machine contains 1/m of each partition before shuffling. since each machine must have n/m partitions on it, means each machine has locally (on disk or in memory) equivalent n/m^2 partitions can stay on node, namely d/m^2 bytes. contain d/m bytes in total, each machine send d/m - d/m^2 bytes, (m - 1)/m^2 *d.
conclusion
the cost of shuffle function of number of machines , fixed number of machines, linear function of data size :
(d,m) -> (m - 1)/m^2 * d
this function increasing between 1 , 2, , decreases. therefore, cost of shuffle step go down number of cores increases. make sense ? opinions on this. note before edit conclusion opposite so... need reassured or contradicted.
Comments
Post a Comment