Pandas common 3: filtering like WHERE clause in sql¶

Dataframe¶

number,name,species,dob
1.0,joe,human,10-02-1977
2.0,john,human,01-04-1954
3.0,mike,human,04-29-1966
4.0,didi,cat,07-12-2019
5.0,aaron,human,09-08-1990
6.0,boo,dog,02-03-2015
7.0,ziggy,dog,08-09-2010
8.0,balou,gorilla,12-10-2005
9.0,"",""

import pandas as pd
df = pd.read_csv("test5.csv")
df

"==" to produce similar effect as " = ".¶

can be done in .query() as well¶

df["name"] == "joe"

0     True
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
Name: name, dtype: bool

#do a filter first then parse the boolean into the dataframe
df[df["species"] == "dog"]

#assign to a parameter then selectively querying them
animal = df["species"] == "dog"
df[animal]
df [["species","name"]]

"!=" as "!=" operator¶

can be done with .query() as well¶

animal = df["species"] != "dog"
df[animal]

note: for date it works like string df["date"] >= "2020-01-01"¶

greater_date = df["dob"] >= "09-09-2005"
df[greater_date]

greater_number = df["number"] >= 4.0
df[greater_number]

use "str.contains()" as "LIKE" operator¶

note: other noteworthy parameter for str.contains .. case=False and na=False¶

#like any
df["species"].str.contains("or")

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8      NaN
Name: species, dtype: object

#start with d
df["species"].str.startswith("d")

0    False
1    False
2    False
3    False
4    False
5     True
6     True
7    False
8      NaN
Name: species, dtype: object

#end with n
df["species"].str.endswith("n")

0     True
1     True
2     True
3    False
4     True
5    False
6    False
7    False
8      NaN
Name: species, dtype: object

"&" as "AND" operator¶

greater_number = df["number"] >= 4.0
species_type = df["species"] == "dog"
df[greater_number & species_type]

" | " as "OR" operator in SQL¶

greater_number = df["number"] >= 4.0
species_type = df["species"] == "dog"
df[greater_number | species_type]

greater_number = df["number"] == 4.0
species_type = df["species"] == "dog"
#comparing if greater is true and equal to true on specfies type. It only take row that is both row equal to true
df[greater_number == species_type]

greater_number = df["number"] >= 4.0
species_type = df["species"] == "dog"
df[species_type]

df["number"] == 4.0

0    False
1    False
2    False
3     True
4    False
5    False
6    False
7    False
8    False
Name: number, dtype: bool

df["species"] == "dog"

0    False
1    False
2    False
3    False
4    False
5     True
6     True
7    False
8    False
Name: species, dtype: bool

".isin()" as "in" operator¶

isin = df["species"].isin(["dog","gorilla"])
df[isin]

"isnull()" as "IS NULL" operator¶

isnull = df["species"].isnull()
df[isnull]

"notnull" as "IS NOT NULL" operator¶

notnull = df["species"].notnull()
df[notnull]

"duplicated()"" somewhat similar to "select column,count() from table having count() > 1 group by column"¶

#keep = false is to print out every duplicates. Other parameters such as keep='first' or keep='last'
duplicate = df["species"].duplicated(keep = False)
df[duplicate]

"drop_duplicates()" will drop the duplicates and print out what's left¶

dropduplicate = df["species"].drop_duplicates(keep=False)
dropduplicate

3        cat
7    gorilla
8        NaN
Name: species, dtype: object

"between" as "BETWEEN" operator¶

between = df["number"].between(2,5)
df[between]

"loc[a:b]"" as select a range of index like bewtween for variables¶

betweenInd = df.loc['1':'3']
betweenInd

".query()" is simplify form of where clause¶

df.query('species == "human"')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   number   9 non-null      float64
 1   name     8 non-null      object 
 2   species  8 non-null      object 
 3   dob      8 non-null      object 
dtypes: float64(1), object(3)
memory usage: 416.0+ bytes

df["number"].astype(int)
df.describe
df.describe()
df["name"].notnull()
df[df["name"].notnull()]

Data Engineering

Tuesday, June 23, 2020

Pandas: SQL Like pandas operations

Pandas common 3: filtering like WHERE clause in sql¶

Dataframe¶

"==" to produce similar effect as " = ".¶

can be done in .query() as well¶

"!=" as "!=" operator¶

can be done with .query() as well¶

note: for date it works like string df["date"] >= "2020-01-01"¶

use "str.contains()" as "LIKE" operator¶

note: other noteworthy parameter for str.contains .. case=False and na=False¶

"&" as "AND" operator¶

" | " as "OR" operator in SQL¶

".isin()" as "in" operator¶

"isnull()" as "IS NULL" operator¶

"notnull" as "IS NOT NULL" operator¶

"duplicated()"" somewhat similar to "select column,count() from table having count() > 1 group by column"¶

"drop_duplicates()" will drop the duplicates and print out what's left¶

"between" as "BETWEEN" operator¶

"loc[a:b]"" as select a range of index like bewtween for variables¶

".query()" is simplify form of where clause¶

Pandas: SQL Like pandas operations

Report Abuse

	number	name	species	dob
0	1.0	joe	human	10-02-1977
1	2.0	john	human	01-04-1954
2	3.0	mike	human	04-29-1966
3	4.0	didi	cat	07-12-2019
4	5.0	aaron	human	09-08-1990
5	6.0	boo	dog	02-03-2015
6	7.0	ziggy	dog	08-09-2010
7	8.0	balou	gorilla	12-10-2005
8	9.0	NaN	NaN	NaN