java - Difference between Spark toLocalIterator and iterator methods -
while coding spark programs came across tolocaliterator()
method. earlier using iterator()
method.
if has ever used method please throw lights.
i came across while using foreach
, foreachpartition
methods in spark program.
can pass foreach
method result tolocaliterator
method or vice verse.
tolocaliterator() -> foreachpartition() iterator() -> foreach()
first of all, iterator
method rdd should not called. can read in [javadocs](https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/rdd.html#iterator(org.apache.spark.partition, org.apache.spark.taskcontext)): this should ''not'' called users directly, available implementors of custom subclasses of rdd.
as tolocaliterator
, used collect data rdd scattered around cluster 1 node, 1 program running, , data in same node. similar collect
method, instead of returning list
return iterator
.
foreach
used apply function each of elements of rdd, while foreachpartition
apply function each of partitions. in first approach 1 element @ time (to parallelize more) , in second 1 whole partition (if need perform operation data).
so yes, after applying function rdd using foreach
or foreachpartition
can call tolocaliterator
iterator contents of rdd , process it. however, bear in mind if rdd big, may have memory issues. if want transform rdd again after doing operations need, use sparkcontext
parallelize again.
Comments
Post a Comment