Remove duplicates from a Spark JavaPairDStream / JavaDStream -
i'm building spark streaming application receives data via sockettextstream. problem is, sended data has duplicates. remove them on spark-side (without pre-filtering on sender side). can use javapairrdd's distinct function via dstream's foreach (i can't find way how that)??? need "filtered" java(pair)dstream later actions...
thank you!
the .transform() method can used arbitrary operations on each time slice of rdds. assuming data strings:
somedstream.transform(new function<javardd<string>, javardd<string>>() { @override public javardd<string> call(javardd<string> rows) throws exception { return rows.distinct(); } });
Comments
Post a Comment