scala - Derive multiple columns from a single column in a Spark DataFrame -
i have df huge parseable metadata single string column in dataframe, lets call dfa, colmna.
i break column, colmna multiple columns thru function, classxyz = func1(colmna). function returns class classxyz, multiple variables, , each of these variables has mapped new column, such colmna1, colmna2 etc.
how such transformation 1 dataframe these additional columns calling func1 once, , not have repeat-it create columns.
its easy solve if call huge function every time add new column, wish avoid.
kindly please advise working or pseudo code.
thanks
sanjay
generally speaking want not directly possible. udf can return single column @ time. there 2 different ways can overcome limitation:
return column of complex type. general solution
structtype
can considerarraytype
ormaptype
well.import org.apache.spark.sql.functions.udf val df = seq( (1l, 3.0, "a"), (2l, -1.0, "b"), (3l, 0.0, "c") ).todf("x", "y", "z") case class foobar(foo: double, bar: double) val foobarudf = udf((x: long, y: double, z: string) => foobar(x * y, z.head.toint * y)) val df1 = df.withcolumn("foobar", foobarudf($"x", $"y", $"z")) df1.show // +---+----+---+------------+ // | x| y| z| foobar| // +---+----+---+------------+ // | 1| 3.0| a| [3.0,291.0]| // | 2|-1.0| b|[-2.0,-98.0]| // | 3| 0.0| c| [0.0,0.0]| // +---+----+---+------------+ df1.printschema // root // |-- x: long (nullable = false) // |-- y: double (nullable = false) // |-- z: string (nullable = true) // |-- foobar: struct (nullable = true) // | |-- foo: double (nullable = false) // | |-- bar: double (nullable = false)
this can flattened later there no need that.
switch rdd, reshape , rebuild df:
import org.apache.spark.sql.types._ import org.apache.spark.sql.row def foobarfunc(x: long, y: double, z: string): seq[any] = seq(x * y, z.head.toint * y) val schema = structtype(df.schema.fields ++ array(structfield("foo", doubletype), structfield("bar", doubletype))) val rows = df.rdd.map(r => row.fromseq( r.toseq ++ foobarfunc(r.getas[long]("x"), r.getas[double]("y"), r.getas[string]("z")))) val df2 = sqlcontext.createdataframe(rows, schema) df2.show // +---+----+---+----+-----+ // | x| y| z| foo| bar| // +---+----+---+----+-----+ // | 1| 3.0| a| 3.0|291.0| // | 2|-1.0| b|-2.0|-98.0| // | 3| 0.0| c| 0.0| 0.0| // +---+----+---+----+-----+
Comments
Post a Comment