import pandas as pd
pd.set_option("display.notebook_repr_html", False)
Introduction
It’s rare that a data analysis involves only a single table of data. In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them.
there are three families of verbs that work with two tables at a time:
Mutating joins, which add new variables to one table from matching rows in another.
Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table.
Set operations, which combine the observations in the data sets as if they were set elements.
Mutating joins
Mutating joins allow you to combine variables from multiple tables. There are five types of mutating join, which differ in their behaviour when a match is not found.
- inner_join
use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
- left_join
use only keys from left frame, similar to a SQL left outer join; preserve key order.
- right_join
use only keys from right frame, similar to a SQL right outer join; preserve key order.
- full_join or outer_join
use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
- cross_join
cartesian product from both frames, preserves the order of the left keys.
left = pd.DataFrame(
{
"key1": ["K0", "K0", "K1", "K2"],
"key2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
right = pd.DataFrame(
{
"key1": ["K0", "K1", "K1", "K2"],
"key2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
pd.merge(left,right,how='inner',on = ["key1","key2"],indicator=True)
key1 key2 A B C D _merge
0 K0 K0 A0 B0 C0 D0 both
1 K1 K0 A2 B2 C1 D1 both
2 K1 K0 A2 B2 C2 D2 both
pd.merge(left,right,how='left',on = ["key1","key2"],indicator=True)
key1 key2 A B C D _merge
0 K0 K0 A0 B0 C0 D0 both
1 K0 K1 A1 B1 NaN NaN left_only
2 K1 K0 A2 B2 C1 D1 both
3 K1 K0 A2 B2 C2 D2 both
4 K2 K1 A3 B3 NaN NaN left_only
pd.merge(left,right,how='right',on = ["key1","key2"],indicator=True)
key1 key2 A B C D _merge
0 K0 K0 A0 B0 C0 D0 both
1 K1 K0 A2 B2 C1 D1 both
2 K1 K0 A2 B2 C2 D2 both
3 K2 K0 NaN NaN C3 D3 right_only
pd.merge(left,right,how='outer',on = ["key1","key2"],indicator=True)
key1 key2 A B C D _merge
0 K0 K0 A0 B0 C0 D0 both
1 K0 K1 A1 B1 NaN NaN left_only
2 K1 K0 A2 B2 C1 D1 both
3 K1 K0 A2 B2 C2 D2 both
4 K2 K1 A3 B3 NaN NaN left_only
5 K2 K0 NaN NaN C3 D3 right_only
df1 = pd.DataFrame(
{
"A": [1,2]
}
)
df2 = pd.DataFrame(
{
"B": ['a', 'b', 'c']
}
)
df1['key'] = 1
df2['key'] = 1
df_cross_join = pd.merge(df1,df2,on='key').drop("key",1)
df_cross_join
A B
0 1 a
1 1 b
2 1 c
3 2 a
4 2 b
5 2 c
Filter joins
Filtering joins keep observations from left-hand dataframe. They match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
- semi_join(x, y) keeps all observations in x that have a match in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
- anti_join(x, y) drops all observations in x that have a match in y, keeping just columns from x.
# semi_join
df = (pd.merge(left,right,how='outer',on = ["key1","key2"],indicator=True)
.query("_merge in ['both']")
)
df
key1 key2 A B C D _merge
0 K0 K0 A0 B0 C0 D0 both
2 K1 K0 A2 B2 C1 D1 both
3 K1 K0 A2 B2 C2 D2 both
# anti_join
df = (pd.merge(left,right,how='outer',on = ["key1","key2"],indicator=True)
.query("_merge == 'left_only'")
)
df
key1 key2 A B C D _merge
1 K0 K1 A1 B1 NaN NaN left_only
4 K2 K1 A3 B3 NaN NaN left_only
Set operations
The final type of two-table verb is set operations. These expect the x and y inputs to have the same variables, and treat the observations like sets:
- intersect(x, y): return only observations in both x and y
- union(x, y): return unique observations in x and y
- setdiff(x, y): return observations in x, but not in y.
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
{
"B": ["B2", "B3", "B6", "B7"],
"D": ["D2", "D3", "D6", "D7"],
"F": ["F2", "F3", "F6", "F7"],
},
index=[2, 3, 6, 7],
)
df_intersec = pd.concat([df1,df2],join='outer',axis=1)
df_intersec
A B C D B D F
0 A0 B0 C0 D0 NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
6 NaN NaN NaN NaN B6 D6 F6
7 NaN NaN NaN NaN B7 D7 F7
df_intersec = pd.concat([df1,df2],join='inner',axis=1)
df_intersec
A B C D B D F
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
pd.concat([df1, df2.reindex(df1.index)], axis=1)
A B C D B D F
0 A0 B0 C0 D0 NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
pd.concat([df1, df2], ignore_index=True, sort=False)
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
4 NaN B2 NaN D2 F2
5 NaN B3 NaN D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7