python - Joining multiple columns in PySpark -
i join 2 dataframes have column names in common.
my dataframes follows:
>>> sample3 dataframe[uid1: string, count1: bigint] >>> sample4 dataframe[uid1: string, count1: bigint] sample3 uid1 count1 0 john 3 1 paul 4 2 george 5 sample4 uid1 count1 0 john 3 1 paul 4 2 george 5
(i using same dataframe different name on purpose)
i looked @ jira issue 7197 spark , address how perform join (this inconsistent pyspark documentation). however, method propose produces duplicate columns:
>>> cond = (sample3.uid1 == sample4.uid1) & (sample3.count1 == sample4.count1) >>> sample3.join(sample4, cond) dataframe[uid1: string, count1: bigint, uid1: string, count1: bigint]
i result keys not appear twice.
i can 1 column:
>>>sample3.join(sample4, 'uid1') dataframe[uid1: string, count1: bigint, count1: bigint]
however, same syntax not apply method of joining , throws error.
i result:
dataframe[uid1: string, count1: bigint]
i wondering how possible
you can define join cond use list of keys, in case:
sample3.join(sample4, ['uid1','count1'])
Comments
Post a Comment