Two-table Verbs in Pandas

Ray Sun

Apr 12, 2021 5 min read pandas, python

import pandas as pd
pd.set_option("display.notebook_repr_html", False)

Introduction

It’s rare that a data analysis involves only a single table of data. In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them.

there are three families of verbs that work with two tables at a time:

Mutating joins, which add new variables to one table from matching rows in another.
Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table.
Set operations, which combine the observations in the data sets as if they were set elements.

Mutating joins

Mutating joins allow you to combine variables from multiple tables. There are five types of mutating join, which differ in their behaviour when a match is not found.

inner_join

use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

left_join

use only keys from left frame, similar to a SQL left outer join; preserve key order.

right_join

use only keys from right frame, similar to a SQL right outer join; preserve key order.

full_join or outer_join

use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically

cross_join

cartesian product from both frames, preserves the order of the left keys.

left = pd.DataFrame(
       {
           "key1": ["K0", "K0", "K1", "K2"],
           "key2": ["K0", "K1", "K0", "K1"],
           "A": ["A0", "A1", "A2", "A3"],
           "B": ["B0", "B1", "B2", "B3"],
       }
   )
   

right = pd.DataFrame(
       {
           "key1": ["K0", "K1", "K1", "K2"],
           "key2": ["K0", "K0", "K0", "K0"],
           "C": ["C0", "C1", "C2", "C3"],
           "D": ["D0", "D1", "D2", "D3"],
       }
   )

pd.merge(left,right,how='inner',on = ["key1","key2"],indicator=True)

  key1 key2   A   B   C   D _merge
0   K0   K0  A0  B0  C0  D0   both
1   K1   K0  A2  B2  C1  D1   both
2   K1   K0  A2  B2  C2  D2   both

pd.merge(left,right,how='left',on = ["key1","key2"],indicator=True)

  key1 key2   A   B    C    D     _merge
0   K0   K0  A0  B0   C0   D0       both
1   K0   K1  A1  B1  NaN  NaN  left_only
2   K1   K0  A2  B2   C1   D1       both
3   K1   K0  A2  B2   C2   D2       both
4   K2   K1  A3  B3  NaN  NaN  left_only

pd.merge(left,right,how='right',on = ["key1","key2"],indicator=True)

  key1 key2    A    B   C   D      _merge
0   K0   K0   A0   B0  C0  D0        both
1   K1   K0   A2   B2  C1  D1        both
2   K1   K0   A2   B2  C2  D2        both
3   K2   K0  NaN  NaN  C3  D3  right_only

pd.merge(left,right,how='outer',on = ["key1","key2"],indicator=True)

  key1 key2    A    B    C    D      _merge
0   K0   K0   A0   B0   C0   D0        both
1   K0   K1   A1   B1  NaN  NaN   left_only
2   K1   K0   A2   B2   C1   D1        both
3   K1   K0   A2   B2   C2   D2        both
4   K2   K1   A3   B3  NaN  NaN   left_only
5   K2   K0  NaN  NaN   C3   D3  right_only

df1 = pd.DataFrame(
        {
            "A": [1,2]
        }
    )
df2 = pd.DataFrame(
        {
            "B": ['a', 'b', 'c']
        }
    )

df1['key'] = 1
df2['key'] = 1

df_cross_join = pd.merge(df1,df2,on='key').drop("key",1)
df_cross_join

Filter joins

Filtering joins keep observations from left-hand dataframe. They match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

semi_join(x, y) keeps all observations in x that have a match in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
anti_join(x, y) drops all observations in x that have a match in y, keeping just columns from x.

# semi_join
df = (pd.merge(left,right,how='outer',on = ["key1","key2"],indicator=True)
        .query("_merge in ['both']")
     )

df

  key1 key2   A   B   C   D _merge
0   K0   K0  A0  B0  C0  D0   both
2   K1   K0  A2  B2  C1  D1   both
3   K1   K0  A2  B2  C2  D2   both

# anti_join
df = (pd.merge(left,right,how='outer',on = ["key1","key2"],indicator=True)
        .query("_merge == 'left_only'")
     )

df

  key1 key2   A   B    C    D     _merge
1   K0   K1  A1  B1  NaN  NaN  left_only
4   K2   K1  A3  B3  NaN  NaN  left_only

Set operations

The final type of two-table verb is set operations. These expect the x and y inputs to have the same variables, and treat the observations like sets:

intersect(x, y): return only observations in both x and y
union(x, y): return unique observations in x and y
setdiff(x, y): return observations in x, but not in y.

df1 = pd.DataFrame(
        {
            "A": ["A0", "A1", "A2", "A3"],
            "B": ["B0", "B1", "B2", "B3"],
            "C": ["C0", "C1", "C2", "C3"],
            "D": ["D0", "D1", "D2", "D3"],
        },
        index=[0, 1, 2, 3],
    )

df2 = pd.DataFrame(
        {
            "B": ["B2", "B3", "B6", "B7"],
            "D": ["D2", "D3", "D6", "D7"],
            "F": ["F2", "F3", "F6", "F7"],
        },
        index=[2, 3, 6, 7],
    )

df_intersec = pd.concat([df1,df2],join='outer',axis=1)
df_intersec

     A    B    C    D    B    D    F
0   A0   B0   C0   D0  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN
2   A2   B2   C2   D2   B2   D2   F2
3   A3   B3   C3   D3   B3   D3   F3
6  NaN  NaN  NaN  NaN   B6   D6   F6
7  NaN  NaN  NaN  NaN   B7   D7   F7

df_intersec = pd.concat([df1,df2],join='inner',axis=1)
df_intersec

    A   B   C   D   B   D   F
2  A2  B2  C2  D2  B2  D2  F2
3  A3  B3  C3  D3  B3  D3  F3

pd.concat([df1, df2.reindex(df1.index)], axis=1)

    A   B   C   D    B    D    F
0  A0  B0  C0  D0  NaN  NaN  NaN
1  A1  B1  C1  D1  NaN  NaN  NaN
2  A2  B2  C2  D2   B2   D2   F2
3  A3  B3  C3  D3   B3   D3   F3

pd.concat([df1, df2], ignore_index=True, sort=False)

     A   B    C   D    F
0   A0  B0   C0  D0  NaN
1   A1  B1   C1  D1  NaN
2   A2  B2   C2  D2  NaN
3   A3  B3   C3  D3  NaN
4  NaN  B2  NaN  D2   F2
5  NaN  B3  NaN  D3   F3
6  NaN  B6  NaN  D6   F6
7  NaN  B7  NaN  D7   F7

Python SQL

Ray Sun

Data Analytics Professional

My interests include AI/ML and data analytics.