python - Joining multiple columns in PySpark -

i join 2 dataframes have column names in common.

my dataframes follows:

>>> sample3 dataframe[uid1: string, count1: bigint] >>> sample4 dataframe[uid1: string, count1: bigint]   sample3      uid1  count1 0  john         3 1  paul         4 2  george       5  sample4      uid1  count1 0  john         3 1  paul         4 2  george       5

(i using same dataframe different name on purpose)

i looked @ jira issue 7197 spark , address how perform join (this inconsistent pyspark documentation). however, method propose produces duplicate columns:

>>> cond = (sample3.uid1 == sample4.uid1) & (sample3.count1 == sample4.count1) >>> sample3.join(sample4, cond) dataframe[uid1: string, count1: bigint, uid1: string, count1: bigint]

i result keys not appear twice.

i can 1 column:

>>>sample3.join(sample4, 'uid1') dataframe[uid1: string, count1: bigint, count1: bigint]

however, same syntax not apply method of joining , throws error.

i result:

dataframe[uid1: string, count1: bigint]

i wondering how possible

you can define join cond use list of keys, in case:

sample3.join(sample4, ['uid1','count1'])

Search This Blog

Premier

python - Joining multiple columns in PySpark -

Comments

Post a Comment

Popular posts from this blog

python - ValueError: empty vocabulary; perhaps the documents only contain stop words -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -

java - UnknownEntityTypeException: Unable to locate persister (Hibernate 5.0) -