Pandas merge and join not working -
i have problem merging 2 dataframes
i'm processing list of 10 dataframe pairs, created same sql database , csv files.
on pairs merge(df1, df2)
working correctly df1.join(df2)
not. example, thesea subsets 1 of pairs
>>> mod:user studentid 2010453 3891583 2010453 2112086 890910 2112086 2222220 201611 2222220 2346979 7084 2346979 2414996 1817436 2414996 2420317 52821 2420317 2438767 884012 2438767 2451924 20815145 2451924 2515531 2115829 2515531 2536751 494565 2536751 2549050 315295 2549050 2549530 0 2549530 2551532 544968 2551532 2551542 213 2551542 2610206 1257038 2610206 2624429 939670 2624429 2630017 6 2630017 2633815 190564 2633815 2633857 1147211 2633857 2634405 1093092 2634405 2641370 2038012 2641370 2644284 658743 2644284 2649427 220230 2649427 2712372 9468 2712372 2714617 1231577 2714617 2718450 3907345 2718450 2732910 0 2732910 2739711 396876 2739711 8200703 166 8200703 9906492 920875 9906492 oscarl 505 oscarl >>> b assignment:5 studentid 2010453 70 2010453 2112086 82 2112086 2222220 76 2222220 2346979 nan 2346979 2414996 88 2414996 2438767 50 2438767 2451924 100 2451924 2515531 50 2515531 2536751 100 2536751 2538371 94 2538371 2549050 100 2549050 2551532 100 2551532 2610206 50 2610206 2624429 100 2624429 2630017 nan 2630017 2634405 100 2634405 2641370 100 2641370 2644284 100 2644284 2712372 100 2712372 2714617 69 2714617 2718450 100 2718450 2739711 100 2739711 9906492 100 9906492 >>> pd.merge(a, b, left_on="studentid", right_on="studentid", how="inner") mod:user studentid assignment:5 0 3891583 2010453 70 1 890910 2112086 82 2 201611 2222220 76 3 7084 2346979 nan 4 1817436 2414996 88 5 884012 2438767 50 6 20815145 2451924 100 7 2115829 2515531 50 8 494565 2536751 100 9 315295 2549050 100 10 544968 2551532 100 11 1257038 2610206 50 12 939670 2624429 100 13 6 2630017 nan 14 1093092 2634405 100 15 2038012 2641370 100 16 658743 2644284 100 17 9468 2712372 100 18 1231577 2714617 69 19 3907345 2718450 100 20 396876 2739711 100 21 920875 9906492 100 >>> a.join(b, on="studentid", rsuffix="r", how="inner") empty dataframe columns: [mod:user, studentid, assignment:5, studentidr] index: [] >>>
now, make things strange, on other pair of dataframes merge(df1, df2)
not woking df1.join(df2)
working.
>>> mod:user studentid 2115728 1177712 2115728 2341322 142805 2341322 2447383 1642046 2447383 2510156 141 2510156 2512053 570889 2512053 2527456 12262284 2527456 2529917 11826381 2529917 2533588 183665 2533588 2535922 107131 2535922 2535991 542259 2535991 2543095 11614678 2543095 2548984 225 2548984 2549565 2059072 2549565 2632847 25408938 2632847 2634371 129605 2634371 2714666 755975 2714666 8307654 74576 8307654 >>> b assignment:5 studentid 2115728 86.67 2115728 2341322 86.67 2341322 2447383 80.00 2447383 2512053 93.33 2512053 2527456 93.33 2527456 2529917 86.67 2529917 2533588 86.67 2533588 2535922 86.67 2535922 2535991 86.67 2535991 2543095 100.00 2543095 2548984 100.00 2548984 2549565 86.67 2549565 2632847 100.00 2632847 2634371 73.33 2634371 2714666 80.00 2714666 8307654 86.67 8307654 >>> pd.merge(a, b, left_on="studentid", right_on="studentid", how="inner") empty dataframe columns: [mod:user, studentid, assignment:5] index: [] >>> a.join(b, on="studentid", rsuffix="r", how="inner") mod:user studentid assignment:5 studentidr 2115728 1177712 2115728 86.67 2115728 2341322 142805 2341322 86.67 2341322 2447383 1642046 2447383 80.00 2447383 2512053 570889 2512053 93.33 2512053 2527456 12262284 2527456 93.33 2527456 2529917 11826381 2529917 86.67 2529917 2533588 183665 2533588 86.67 2533588 2535922 107131 2535922 86.67 2535922 2535991 542259 2535991 86.67 2535991 2543095 11614678 2543095 100.00 2543095 2548984 225 2548984 100.00 2548984 2549565 2059072 2549565 86.67 2549565 2632847 25408938 2632847 100.00 2632847 2634371 129605 2634371 73.33 2634371 2714666 755975 2714666 80.00 2714666 8307654 74576 8307654 86.67 8307654 >>>
i don't have clue happening , function use
thanks! found issue..it automatic data conversion when reading data frame.
as can see, in first example put string 'oscarl' studentid , resulted in whole columnt being threated string while in second example there no records this, converted int.
i found running suggested solution gave me errror pointed me right direction
pd.concat([a, b], axis=1) traceback (most recent call last): file "<stdin>", line 1, in <module> ... ... exception: ('cannot have duplicate column names split across dtypes', 'occurred @ index assignment:5')
Comments
Post a Comment