Tuesday, May 5, 2020

Pandas: Common 1







####

Panda Common 1

Default and original cvs

In [11]:
import pandas as pd
df = pd.read_csv("test.csv")
df.style
Out[11]:
name email
0 Adam, Bobo adam@it.com
1 John, Doe johnd@it.com
2 Mary, Johnsom maryh@it.com
3 Zikie, Bath zikieb@software.com
4 Beth, Matties bethmat@company.com
5 Dale, Carnegie DaleC@dale.com
6 nan
7 Vlad, Didi vladd@ru.com
8 noname@emailme.com
9 Mike, Typoon miket@runner.com

Using only 1 column . in this case email column

In [18]:
import pandas as pd
df = pd.read_csv("test.csv", usecols = ["email"])
df
Out[18]:
email
0 adam@it.com
1 johnd@it.com
2 maryh@it.com
3 zikieb@software.com
4 bethmat@company.com
5 DaleC@dale.com
6 NaN
7 vladd@ru.com
8 noname@emailme.com
9 miket@runner.com

Selecting specific columns

Note 1: 2 ways to select specific columns df.column and df["column"]. The first alternative would have trouble selecting columns with space in between column name.

Note 2: also watch out for the first 2 outputs, they are Series. The output format look different instead of tabular format. The third output is an list/array hence with the 2 square brackets. One can assign the list of array into a variable.

In [57]:
import pandas as pd

df = pd.read_csv("test.csv")
display(df.name)
display(df["email"])
display(df[["email","name"]])
0        Adam, Bobo
1         John, Doe
2     Mary, Johnson
3       Zikie, Bath
4     Beth, Matties
5    Dale, Carnegie
6               NaN
7        Vlad, Didi
8               NaN
9      Mike, Typoon
Name: name, dtype: object
0            adam@it.com
1           johnd@it.com
2           maryh@it.com
3    zikieb@software.com
4    bethmat@company.com
5         DaleC@dale.com
6                    NaN
7           vladd@ru.com
8     noname@emailme.com
9       miket@runner.com
Name: email, dtype: object
email name
0 adam@it.com Adam, Bobo
1 johnd@it.com John, Doe
2 maryh@it.com Mary, Johnson
3 zikieb@software.com Zikie, Bath
4 bethmat@company.com Beth, Matties
5 DaleC@dale.com Dale, Carnegie
6 NaN NaN
7 vladd@ru.com Vlad, Didi
8 noname@emailme.com NaN
9 miket@runner.com Mike, Typoon

Using custom column as index .. in this case using name

In [23]:
import pandas as pd
df = pd.read_csv("test.csv", index_col="name")
df
Out[23]:
email
name
Adam, Bobo adam@it.com
John, Doe johnd@it.com
Mary, Johnson maryh@it.com
Zikie, Bath zikieb@software.com
Beth, Matties bethmat@company.com
Dale, Carnegie DaleC@dale.com
NaN NaN
Vlad, Didi vladd@ru.com
NaN noname@emailme.com
Mike, Typoon miket@runner.com

Hide away an Index .. key word is hide, this is different than not using the index.

In [45]:
import pandas as pd
df = pd.read_csv("test.csv")
df.style.hide_index()
Out[45]:
name email
Adam, Bobo adam@it.com
John, Doe johnd@it.com
Mary, Johnson maryh@it.com
Zikie, Bath zikieb@software.com
Beth, Matties bethmat@company.com
Dale, Carnegie DaleC@dale.com
nan nan
Vlad, Didi vladd@ru.com
nan noname@emailme.com
Mike, Typoon miket@runner.com

Useful count method. Works like select columnA, count() from table having count() > 1 group by columnA.

In [64]:
import pandas as pd
df = pd.read_csv("test.csv")
df["name"].value_counts()
Out[64]:
Dale, Carnegie    1
Mary, Johnson     1
Vlad, Didi        1
Beth, Matties     1
Mike, Typoon      1
Adam, Bobo        1
Zikie, Bath       1
John, Doe         1
Name: name, dtype: int64

** notice the different between .value_counts() and count(). .value_counts() will be like group count, count is counting how many rows like "select count() from table"

In [100]:
df["name"].count()
Out[100]:
8

*** counting how many values excluding NULL/NaN within the rows. For example, Index 0 has 2 values, Index 6 has none, since it has both columsn of NULLS

In [102]:
df.count(axis="columns")
Out[102]:
0    2
1    2
2    2
3    2
4    2
5    2
6    0
7    2
8    1
9    2
dtype: int64

Dealing with NULL rows. dropna() will drop any rows with NULL

In [84]:
import pandas as pd
df = pd.read_csv("test.csv")
# how = all, means how you want panda to remove. All means, you want entire row removed only if only ALL of them NULL
# the default is how=any. Means, if any of the records is NULL, entire row will be removed
# inplace = True, this will overwirte the new result sets to the dataframe
# keyword all, any and inplace
df.dropna(how="all")  
Out[84]:
name email
0 Adam, Bobo adam@it.com
1 John, Doe johnd@it.com
2 Mary, Johnson maryh@it.com
3 Zikie, Bath zikieb@software.com
4 Beth, Matties bethmat@company.com
5 Dale, Carnegie DaleC@dale.com
7 Vlad, Didi vladd@ru.com
8 NaN noname@emailme.com
9 Mike, Typoon miket@runner.com
In [89]:
#drop the entire rows if any of the row have nulls. dropna() is default and same as dropna(how=any)
#df.dropna()
df.dropna(how="any")
Out[89]:
name email
0 Adam, Bobo adam@it.com
1 John, Doe johnd@it.com
2 Mary, Johnson maryh@it.com
3 Zikie, Bath zikieb@software.com
4 Beth, Matties bethmat@company.com
5 Dale, Carnegie DaleC@dale.com
7 Vlad, Didi vladd@ru.com
9 Mike, Typoon miket@runner.com
In [85]:
#drop entire column as long as there is a null within. Here it dropped both columns
print(df.dropna(axis = 1))
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [83]:
#drop any columns as long as there are nulls within the columms. By default, the dropna drop the rows that have nulls
#this is the same as axis=1
df.dropna(axis = "columns")
Out[83]:
0
1
2
3
4
5
6
7
8
9
In [82]:
#drop a row ONLY if that column have nulls
df.dropna(subset=["name"])
Out[82]:
name email
0 Adam, Bobo adam@it.com
1 John, Doe johnd@it.com
2 Mary, Johnson maryh@it.com
3 Zikie, Bath zikieb@software.com
4 Beth, Matties bethmat@company.com
5 Dale, Carnegie DaleC@dale.com
7 Vlad, Didi vladd@ru.com
9 Mike, Typoon miket@runner.com

fillna() to fill a NULL value into some other value

In [97]:
#filling null cells with something
df.fillna(value="NULLIFIED")
Out[97]:
name email
0 Adam, Bobo adam@it.com
1 John, Doe johnd@it.com
2 Mary, Johnson maryh@it.com
3 Zikie, Bath zikieb@software.com
4 Beth, Matties bethmat@company.com
5 Dale, Carnegie DaleC@dale.com
6 NULLIFIED NULLIFIED
7 Vlad, Didi vladd@ru.com
8 NULLIFIED noname@emailme.com
9 Mike, Typoon miket@runner.com
In [98]:
# pick only a column or multiple columns to fillna. Just list out the column List
df[["name"]].fillna(value="FILL ME")
Out[98]:
name
0 Adam, Bobo
1 John, Doe
2 Mary, Johnson
3 Zikie, Bath
4 Beth, Matties
5 Dale, Carnegie
6 FILL ME
7 Vlad, Didi
8 FILL ME
9 Mike, Typoon

Colorize the null cells

In [22]:
import pandas as pd
df = pd.read_csv("test.csv")
#df.style
df.style.format(None, na_rep="NULL", subset=["name","email"]).highlight_null("orange")
Out[22]:
name email
0 Adam, Bobo adam@it.com
1 John, Doe johnd@it.com
2 Mary, Johnson maryh@it.com
3 Zikie, Bath zikieb@software.com
4 Beth, Matties bethmat@company.com
5 Dale, Carnegie DaleC@dale.com
6 NULL NULL
7 Vlad, Didi vladd@ru.com
8 NULL noname@emailme.com
9 Mike, Typoon miket@runner.com
In [ ]:
 

No comments:

Post a Comment

Pandas: SQL Like pandas operations

Pandas's SQL Like operations such as WHERE clause. = != >= str.contains() & | .isin() .isnull() .notnull() ....