java - Difference between Spark toLocalIterator and iterator methods -


while coding spark programs came across tolocaliterator() method. earlier using iterator() method.

if has ever used method please throw lights.

i came across while using foreach , foreachpartition methods in spark program.

can pass foreach method result tolocaliterator method or vice verse.

tolocaliterator() -> foreachpartition() iterator() -> foreach() 

first of all, iterator method rdd should not called. can read in [javadocs](https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/rdd.html#iterator(org.apache.spark.partition, org.apache.spark.taskcontext)): this should ''not'' called users directly, available implementors of custom subclasses of rdd.

as tolocaliterator, used collect data rdd scattered around cluster 1 node, 1 program running, , data in same node. similar collect method, instead of returning list return iterator.

foreach used apply function each of elements of rdd, while foreachpartition apply function each of partitions. in first approach 1 element @ time (to parallelize more) , in second 1 whole partition (if need perform operation data).

so yes, after applying function rdd using foreach or foreachpartition can call tolocaliterator iterator contents of rdd , process it. however, bear in mind if rdd big, may have memory issues. if want transform rdd again after doing operations need, use sparkcontext parallelize again.


Comments

Popular posts from this blog

java - UnknownEntityTypeException: Unable to locate persister (Hibernate 5.0) -

python - ValueError: empty vocabulary; perhaps the documents only contain stop words -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -