Remove duplicates from a Spark JavaPairDStream / JavaDStream -


i'm building spark streaming application receives data via sockettextstream. problem is, sended data has duplicates. remove them on spark-side (without pre-filtering on sender side). can use javapairrdd's distinct function via dstream's foreach (i can't find way how that)??? need "filtered" java(pair)dstream later actions...

thank you!

the .transform() method can used arbitrary operations on each time slice of rdds. assuming data strings:

somedstream.transform(new function<javardd<string>, javardd<string>>() {         @override         public javardd<string> call(javardd<string> rows) throws exception {             return rows.distinct();         }     }); 

Comments

Popular posts from this blog

java - UnknownEntityTypeException: Unable to locate persister (Hibernate 5.0) -

python - ValueError: empty vocabulary; perhaps the documents only contain stop words -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -