Friday, May 29, 2020

pandas: merge and join

Example of various merge with default indexes and columns between tables.






merge_join
In [3]:
import pandas as pd
import numpy as np
Table0 = pd.read_csv("W0.csv")
Table1 = pd.read_csv("W1.csv")
Table1_color = Table1.style.format(None).highlight_null("orange")
Table2 = pd.read_csv("W2.csv")
Table2_color = Table2.style.format(None).highlight_null("lime")
In [186]:
Table0
Out[186]:
customerID OrderID Product Company_Name
0 1234 9991 laptop Dell
1 2345 9992 desktop HP
2 3456 9993 monitor LG
In [187]:
Table1_color
Out[187]:
customerID OrderID Product Company_Name
0 1234.000000 9991.000000 laptop Dell
1 2345.000000 9992.000000 desktop HP
2 3456.000000 9993.000000 monitor LG
3 nan 8888.000000 trackpad Apple
4 4567.000000 9994.000000 keyboard Anker
5 123.000000 nan nan Apple
6 1234.000000 9990.000000 tablet Apple
In [188]:
Table2_color
Out[188]:
customerID OrderID Product Company_Name
0 1234.000000 9991.000000 laptop Dell
1 2345.000000 9992.000000 desktop HP
2 3456.000000 9993.000000 monitor LG
3 4567.000000 9994.000000 keyboard Anker
4 5678.000000 9995.000000 pen Ballpoint
5 6789.000000 9997.000000 mouse Logitech
6 nan 9998.000000 cup Nobrand
7 8901.000000 9999.000000 nan Logitech
8 8000.000000 nan printer HP
In [161]:
Table0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     3 non-null      int64 
 1    OrderID       3 non-null      int64 
 2    Product       3 non-null      object
 3    Company_Name  3 non-null      object
dtypes: int64(2), object(2)
memory usage: 224.0+ bytes
In [162]:
Table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   customerID     5 non-null      float64
 1    OrderID       5 non-null      float64
 2    Product       5 non-null      object 
 3    Company_Name  6 non-null      object 
dtypes: float64(2), object(2)
memory usage: 320.0+ bytes
In [163]:
Table2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   customerID     8 non-null      float64
 1    OrderID       8 non-null      float64
 2    Product       8 non-null      object 
 3    Company_Name  9 non-null      object 
dtypes: float64(2), object(2)
memory usage: 416.0+ bytes

Note:

Merging 2 tables with different data type can be tricky and buggy. In this situation, we have Table1["customerID"] as int64, Table2["customerID"] as float64. Both of them can be merged but only after I inserted " ' (Nan) into Table2. Otherwise, the data type is an object type. A better procedure will be to just convert the NaN into some bogus numbers such as 0 or 99999999 prior to joining with fillna(0)

When a table contain null (aka NaN), it turns the column from int64 to float64. So, another workaround is to add another NaN row into it.

In [164]:
Table3 = Table2["customerID"].astype(np.int64)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-164-a6051e65159b> in <module>
----> 1 Table3 = Table2["customerID"].astype(np.int64)
      2 

/Library/Python/3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5696         else:
   5697             # else, only a single dtype is given
-> 5698             new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
   5699             return self._constructor(new_data).__finalize__(self)
   5700 

/Library/Python/3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    580 
    581     def astype(self, dtype, copy: bool = False, errors: str = "raise"):
--> 582         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    583 
    584     def convert(self, **kwargs):

/Library/Python/3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, filter, **kwargs)
    440                 applied = b.apply(f, **kwargs)
    441             else:
--> 442                 applied = getattr(b, f)(**kwargs)
    443             result_blocks = _extend_blocks(applied, result_blocks)
    444 

/Library/Python/3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    623             vals1d = values.ravel()
    624             try:
--> 625                 values = astype_nansafe(vals1d, dtype, copy=True)
    626             except (ValueError, TypeError):
    627                 # e.g. astype_nansafe can fail on object-dtype of strings

/Library/Python/3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    866 
    867         if not np.isfinite(arr).all():
--> 868             raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
    869 
    870     elif is_object_dtype(arr):

ValueError: Cannot convert non-finite values (NA or inf) to integer
In [167]:
Table0.merge(Table1)
Out[167]:
customerID OrderID Product Company_Name
0 1234 9991 laptop Dell
1 2345 9992 desktop HP
2 3456 9993 monitor LG
In [168]:
Table1.merge(Table2)
Out[168]:
customerID OrderID Product Company_Name
0 1234.0 9991.0 laptop Dell
1 2345.0 9992.0 desktop HP
2 3456.0 9993.0 monitor LG
3 4567.0 9994.0 keyboard Anker

Note

If not specifying the how, the default is 'inner' join

In [169]:
Table2.merge(Table1, how = 'inner')
Out[169]:
customerID OrderID Product Company_Name
0 1234.0 9991.0 laptop Dell
1 2345.0 9992.0 desktop HP
2 3456.0 9993.0 monitor LG
3 4567.0 9994.0 keyboard Anker
In [170]:
Table1.merge(Table2, how='inner', on = ["customerID"],    suffixes=('_Table1', '_Table2'))
Out[170]:
customerID OrderID_Table1 Product_Table1 Company_Name_Table1 OrderID_Table2 Product_Table2 Company_Name_Table2
0 1234.0 9991.0 laptop Dell 9991.0 laptop Dell
1 2345.0 9992.0 desktop HP 9992.0 desktop HP
2 3456.0 9993.0 monitor LG 9993.0 monitor LG
3 NaN 8888.0 trackpad Apple 9998.0 cup Nobrand
4 4567.0 9994.0 keyboard Anker 9994.0 keyboard Anker
In [171]:
Table1.merge(Table2, how='left', on = ["customerID"])
Out[171]:
customerID OrderID_x Product_x Company_Name_x OrderID_y Product_y Company_Name_y
0 1234.0 9991.0 laptop Dell 9991.0 laptop Dell
1 2345.0 9992.0 desktop HP 9992.0 desktop HP
2 3456.0 9993.0 monitor LG 9993.0 monitor LG
3 NaN 8888.0 trackpad Apple 9998.0 cup Nobrand
4 4567.0 9994.0 keyboard Anker 9994.0 keyboard Anker
5 123.0 NaN NaN Apple NaN NaN NaN
In [172]:
Table1.merge(Table2, on = ["customerID"])
Out[172]:
customerID OrderID_x Product_x Company_Name_x OrderID_y Product_y Company_Name_y
0 1234.0 9991.0 laptop Dell 9991.0 laptop Dell
1 2345.0 9992.0 desktop HP 9992.0 desktop HP
2 3456.0 9993.0 monitor LG 9993.0 monitor LG
3 NaN 8888.0 trackpad Apple 9998.0 cup Nobrand
4 4567.0 9994.0 keyboard Anker 9994.0 keyboard Anker
In [173]:
Table1.merge(Table2, how='right', on = ["customerID"])
Out[173]:
customerID OrderID_x Product_x Company_Name_x OrderID_y Product_y Company_Name_y
0 1234.0 9991.0 laptop Dell 9991.0 laptop Dell
1 2345.0 9992.0 desktop HP 9992.0 desktop HP
2 3456.0 9993.0 monitor LG 9993.0 monitor LG
3 NaN 8888.0 trackpad Apple 9998.0 cup Nobrand
4 4567.0 9994.0 keyboard Anker 9994.0 keyboard Anker
5 5678.0 NaN NaN NaN 9995.0 pen Ballpoint
6 6789.0 NaN NaN NaN 9997.0 mouse Logitech
7 8901.0 NaN NaN NaN 9999.0 NaN Logitech
8 8000.0 NaN NaN NaN NaN printer HP
In [174]:
Table1.merge(Table2, on = ["customerID"])
Out[174]:
customerID OrderID_x Product_x Company_Name_x OrderID_y Product_y Company_Name_y
0 1234.0 9991.0 laptop Dell 9991.0 laptop Dell
1 2345.0 9992.0 desktop HP 9992.0 desktop HP
2 3456.0 9993.0 monitor LG 9993.0 monitor LG
3 NaN 8888.0 trackpad Apple 9998.0 cup Nobrand
4 4567.0 9994.0 keyboard Anker 9994.0 keyboard Anker

merge does not work exactly the same as join

In [175]:
Table1.join(Table2, how='left', on = ["customerID"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-175-b961be48339a> in <module>
----> 1 Table1.join(Table2, how='left', on = ["customerID"])

/Library/Python/3.7/site-packages/pandas/core/frame.py in join(self, other, on, how, lsuffix, rsuffix, sort)
   7207         """
   7208         return self._join_compat(
-> 7209             other, on=on, how=how, lsuffix=lsuffix, rsuffix=rsuffix, sort=sort
   7210         )
   7211 

/Library/Python/3.7/site-packages/pandas/core/frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   7230                 right_index=True,
   7231                 suffixes=(lsuffix, rsuffix),
-> 7232                 sort=sort,
   7233             )
   7234         else:

/Library/Python/3.7/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     86         validate=validate,
     87     )
---> 88     return op.get_result()
     89 
     90 

/Library/Python/3.7/site-packages/pandas/core/reshape/merge.py in get_result(self)
    647 
    648         llabels, rlabels = _items_overlap_with_suffix(
--> 649             ldata.items, lsuf, rdata.items, rsuf
    650         )
    651 

/Library/Python/3.7/site-packages/pandas/core/reshape/merge.py in _items_overlap_with_suffix(left, lsuffix, right, rsuffix)
   2024         raise ValueError(
   2025             "columns overlap but no suffix specified: "
-> 2026             "{rename}".format(rename=to_rename)
   2027         )
   2028 

ValueError: columns overlap but no suffix specified: Index(['customerID', ' OrderID', ' Product', ' Company_Name'], dtype='object')
In [176]:
Table0.merge(Table1).merge(Table2)
Out[176]:
customerID OrderID Product Company_Name
0 1234 9991 laptop Dell
1 2345 9992 desktop HP
2 3456 9993 monitor LG

Multi tables joins

In [177]:
Table0.merge(Table1, on = "customerID").merge(Table2, on = 'customerID')
Out[177]:
customerID OrderID_x Product_x Company_Name_x OrderID_y Product_y Company_Name_y OrderID Product Company_Name
0 1234 9991 laptop Dell 9991.0 laptop Dell 9991.0 laptop Dell
1 2345 9992 desktop HP 9992.0 desktop HP 9992.0 desktop HP
2 3456 9993 monitor LG 9993.0 monitor LG 9993.0 monitor LG

multi-columns join

In [240]:
Table1.merge(Table2, how = 'inner', on = ["customerID", "OrderID"])
Out[240]:
customerID OrderID Product_x Company_Name_x Product_y Company_Name_y
0 1234.0 9991.0 laptop Dell laptop Dell
1 2345.0 9992.0 desktop HP desktop HP
2 3456.0 9993.0 monitor LG monitor LG
3 4567.0 9994.0 keyboard Anker keyboard Anker
In [243]:
Table1
Out[243]:
customerID OrderID Product Company_Name
0 1234.0 9991.0 laptop Dell
1 2345.0 9992.0 desktop HP
2 3456.0 9993.0 monitor LG
3 NaN 8888.0 trackpad Apple
4 4567.0 9994.0 keyboard Anker
5 123.0 NaN NaN Apple
6 1234.0 9990.0 tablet Apple
In [244]:
Table2
Out[244]:
customerID OrderID Product Company_Name
0 1234.0 9991.0 laptop Dell
1 2345.0 9992.0 desktop HP
2 3456.0 9993.0 monitor LG
3 4567.0 9994.0 keyboard Anker
4 5678.0 9995.0 pen Ballpoint
5 6789.0 9997.0 mouse Logitech
6 NaN 9998.0 cup Nobrand
7 8901.0 9999.0 NaN Logitech
8 8000.0 NaN printer HP
In [245]:
Table1.merge(Table2, how = "outer", on = "customerID")
Out[245]:
customerID OrderID_x Product_x Company_Name_x OrderID_y Product_y Company_Name_y
0 1234.0 9991.0 laptop Dell 9991.0 laptop Dell
1 1234.0 9990.0 tablet Apple 9991.0 laptop Dell
2 2345.0 9992.0 desktop HP 9992.0 desktop HP
3 3456.0 9993.0 monitor LG 9993.0 monitor LG
4 NaN 8888.0 trackpad Apple 9998.0 cup Nobrand
5 4567.0 9994.0 keyboard Anker 9994.0 keyboard Anker
6 123.0 NaN NaN Apple NaN NaN NaN
7 5678.0 NaN NaN NaN 9995.0 pen Ballpoint
8 6789.0 NaN NaN NaN 9997.0 mouse Logitech
9 8901.0 NaN NaN NaN 9999.0 NaN Logitech
10 8000.0 NaN NaN NaN NaN printer HP
In [246]:
Table1.dtypes
Out[246]:
customerID      float64
OrderID         float64
Product          object
Company_Name     object
dtype: object
In [247]:
Table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   customerID    6 non-null      float64
 1   OrderID       6 non-null      float64
 2   Product       6 non-null      object 
 3   Company_Name  7 non-null      object 
dtypes: float64(2), object(2)
memory usage: 352.0+ bytes
In [254]:
import pandas as pd
contact = pd.read_csv("contact.csv")
contact
Out[254]:
id phone email
0 1234 3031234567 bk@company.com
1 2345 8018761235 ak@dell.com
2 3456 5053245432 gogo@apple.com

Note : Reason for this error because there is no common column name in both tables to be joined.

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
In [257]:
Table1.merge(contact)
---------------------------------------------------------------------------
MergeError                                Traceback (most recent call last)
<ipython-input-257-9504307fbf66> in <module>
----> 1 Table1.merge(contact)

/Library/Python/3.7/site-packages/pandas/core/frame.py in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
   7295             copy=copy,
   7296             indicator=indicator,
-> 7297             validate=validate,
   7298         )
   7299 

/Library/Python/3.7/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     84         copy=copy,
     85         indicator=indicator,
---> 86         validate=validate,
     87     )
     88     return op.get_result()

/Library/Python/3.7/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    618             warnings.warn(msg, UserWarning)
    619 
--> 620         self._validate_specification()
    621 
    622         # note this function has side effects

/Library/Python/3.7/site-packages/pandas/core/reshape/merge.py in _validate_specification(self)
   1196                             ron=self.right_on,
   1197                             lidx=self.left_index,
-> 1198                             ridx=self.right_index,
   1199                         )
   1200                     )

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
In [262]:
Table1.merge(contact, left_on ="customerID", right_on="id", sort = True).drop("id", axis="columns")
Out[262]:
customerID OrderID Product Company_Name phone email
0 1234.0 9991.0 laptop Dell 3031234567 bk@company.com
1 1234.0 9990.0 tablet Apple 3031234567 bk@company.com
2 2345.0 9992.0 desktop HP 8018761235 ak@dell.com
3 3456.0 9993.0 monitor LG 5053245432 gogo@apple.com

Similar to calling merge under dataframe level. This is a level above and at the pandas level.

In [4]:
pd.merge(Table1, Table2)
Out[4]:
customerID OrderID Product Company_Name
0 1234.0 9991.0 laptop Dell
1 2345.0 9992.0 desktop HP
2 3456.0 9993.0 monitor LG
3 4567.0 9994.0 keyboard Anker
In [ ]:
 

Pandas: SQL Like pandas operations

Pandas's SQL Like operations such as WHERE clause. = != >= str.contains() & | .isin() .isnull() .notnull() ....