java - Shuffle cost in Spark -
i have run various spark jobs these last months , have done performance analysis not in depth (not as have liked). have noticed drop of performance added per core when add more , more cores, expected because spark must have constant cost (or @ least not parallelizable) operations, amdhal's law applies. however, not enough explain i'm seeing.
i trying reason in abstract way cost of shuffling data on network depending on number m
of machines in cluster. reasoning following :
hypothesis
- the total data size in bytes
d
- there
n >= m
partitions - the data randomly (in terms of partitioning function) equally (same amount
d/m
of data) distributed before shuffle - the partitioning function balanced (=> each partition represents
1/n
of data) - each machine must end same number of partitions
n/m
on optimize distribution of data
reasoning
since data randomly distributed, each machine contains 1/m
of each partition before shuffling. since each machine must have n/m
partitions on it, means each machine has locally (on disk or in memory) equivalent n/m^2
partitions can stay on node, namely d/m^2
bytes. contain d/m
bytes in total, each machine send d/m - d/m^2
bytes, (m - 1)/m^2 *d
.
conclusion
the cost of shuffle function of number of machines , fixed number of machines, linear function of data size :
(d,m) -> (m - 1)/m^2 * d
this function increasing between 1 , 2, , decreases. therefore, cost of shuffle step go down number of cores increases. make sense ? opinions on this. note before edit conclusion opposite so... need reassured or contradicted.
Comments
Post a Comment