scala - Derive multiple columns from a single column in a Spark DataFrame -


i have df huge parseable metadata single string column in dataframe, lets call dfa, colmna.

i break column, colmna multiple columns thru function, classxyz = func1(colmna). function returns class classxyz, multiple variables, , each of these variables has mapped new column, such colmna1, colmna2 etc.

how such transformation 1 dataframe these additional columns calling func1 once, , not have repeat-it create columns.

its easy solve if call huge function every time add new column, wish avoid.

kindly please advise working or pseudo code.

thanks

sanjay

generally speaking want not directly possible. udf can return single column @ time. there 2 different ways can overcome limitation:

  1. return column of complex type. general solution structtype can consider arraytype or maptype well.

    import org.apache.spark.sql.functions.udf  val df = seq(   (1l, 3.0, "a"), (2l, -1.0, "b"), (3l, 0.0, "c") ).todf("x", "y", "z")  case class foobar(foo: double, bar: double)  val foobarudf = udf((x: long, y: double, z: string) =>    foobar(x * y, z.head.toint * y))  val df1 = df.withcolumn("foobar", foobarudf($"x", $"y", $"z")) df1.show // +---+----+---+------------+ // |  x|   y|  z|      foobar| // +---+----+---+------------+ // |  1| 3.0|  a| [3.0,291.0]| // |  2|-1.0|  b|[-2.0,-98.0]| // |  3| 0.0|  c|   [0.0,0.0]| // +---+----+---+------------+  df1.printschema // root //  |-- x: long (nullable = false) //  |-- y: double (nullable = false) //  |-- z: string (nullable = true) //  |-- foobar: struct (nullable = true) //  |    |-- foo: double (nullable = false) //  |    |-- bar: double (nullable = false) 

    this can flattened later there no need that.

  2. switch rdd, reshape , rebuild df:

    import org.apache.spark.sql.types._ import org.apache.spark.sql.row  def foobarfunc(x: long, y: double, z: string): seq[any] =    seq(x * y, z.head.toint * y)  val schema = structtype(df.schema.fields ++   array(structfield("foo", doubletype), structfield("bar", doubletype)))  val rows = df.rdd.map(r => row.fromseq(   r.toseq ++   foobarfunc(r.getas[long]("x"), r.getas[double]("y"), r.getas[string]("z"))))  val df2 = sqlcontext.createdataframe(rows, schema)  df2.show // +---+----+---+----+-----+ // |  x|   y|  z| foo|  bar| // +---+----+---+----+-----+ // |  1| 3.0|  a| 3.0|291.0| // |  2|-1.0|  b|-2.0|-98.0| // |  3| 0.0|  c| 0.0|  0.0| // +---+----+---+----+-----+ 

Comments

Popular posts from this blog

java - UnknownEntityTypeException: Unable to locate persister (Hibernate 5.0) -

python - ValueError: empty vocabulary; perhaps the documents only contain stop words -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -