Remove duplicates from a Spark JavaPairDStream / JavaDStream -


i'm building spark streaming application receives data via sockettextstream. problem is, sended data has duplicates. remove them on spark-side (without pre-filtering on sender side). can use javapairrdd's distinct function via dstream's foreach (i can't find way how that)??? need "filtered" java(pair)dstream later actions...

thank you!

the .transform() method can used arbitrary operations on each time slice of rdds. assuming data strings:

somedstream.transform(new function<javardd<string>, javardd<string>>() {         @override         public javardd<string> call(javardd<string> rows) throws exception {             return rows.distinct();         }     }); 

Comments

Popular posts from this blog

json - Zend error Connection -

javascript - Trigger mouseenter when an animated element touches mouse -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -