Tuesday, May 5, 2020

Pandas: Common 1





#######


pandas common 2

astype()

info()

nunique()

nunique()

value_sort()

rank()

Convert data type in with astype() method

In [68]:
import pandas as pd
df = pd.read_csv("test5.csv")
df
Out[68]:
number name species
0 1.0 joe human
1 2.0 john human
2 3.0 mike human
3 4.0 didi cat
4 5.0 aaron human
5 6.0 boo dog
6 7.0 ziggy dog
7 8.0 balou gorilla
In [69]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   number   8 non-null      float64
 1   name     8 non-null      object 
 2   species  8 non-null      object 
dtypes: float64(1), object(2)
memory usage: 320.0+ bytes
In [70]:
import pandas as pd
df = pd.read_csv("test5.csv")
df["number"].astype("int")
#no inplace parameter, so assign it back to itself
#df = df["number"].astype("int")
#df.info()
#After converted, it is becoming a Series.
df["number"] = df["number"].astype("int")
df.info()
df[["name","number"]]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   number   8 non-null      int64 
 1   name     8 non-null      object
 2   species  8 non-null      object
dtypes: int64(1), object(2)
memory usage: 320.0+ bytes
Out[70]:
name number
0 joe 1
1 john 2
2 mike 3
3 didi 4
4 aaron 5
5 boo 6
6 ziggy 7
7 balou 8
In [71]:
import pandas as pd
df = pd.read_csv("test5.csv")
df["number"].astype("int")
#no inplace parameter, so assign it back to itself
df = df["number"].astype("int")
#df.info()
#After converted, it is becoming a Series.
#watch out this is reassigning back to entire dataframe
df
Out[71]:
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
Name: number, dtype: int64
In [ ]:

using nunique() method like select count(distinct COLUMN ) in sql

In [72]:
import pandas as pd
df = pd.read_csv("test5.csv")
#df["species"].nunique()
df["species"].nunique()
Out[72]:
4
unique() to list the category ... similar to the following of Categorical at the pd level. at df level uses nunique() (n for number) and unique() methods
In [73]:
import pandas as pd
df = pd.read_csv("test5.csv")
#df["species"].nunique()
df["species"].unique()
Out[73]:
array(['human', 'cat', 'dog', 'gorilla'], dtype=object)

using Categorical to list out the category like select distinct() in sql

In [74]:
import pandas as pd
df = pd.read_csv("test5.csv")
pd.Categorical(df["species"].unique())
Out[74]:
[human, cat, dog, gorilla]
Categories (4, object): [cat, dog, gorilla, human]

usign .value_sort() like ORDER BY in sql

need at least one parameter which is the column. This can be multi-columns sort. It supports inplace.
In [80]:
import pandas as pd
df = pd.read_csv("test5.csv")
df.sort_values(by = "name", ascending ="False")
Out[80]:
number name species
4 5.0 aaron human
7 8.0 balou gorilla
5 6.0 boo dog
3 4.0 didi cat
0 1.0 joe human
1 2.0 john human
2 3.0 mike human
6 7.0 ziggy dog
In [87]:
#multicolumns sort with ascending true or false
df.sort_values(by=["name", "species"], ascending=[False,True])
Out[87]:
number name species
6 7.0 ziggy dog
2 3.0 mike human
1 2.0 john human
0 1.0 joe human
3 4.0 didi cat
5 6.0 boo dog
7 8.0 balou gorilla
4 5.0 aaron human

using .rank() like TopN Analysis in RANK() in SQL

In [89]:
import pandas as pd
df = pd.read_csv('test5.csv')
#rank all columns individually from the dataframe. Interesting that none numerical can be ranked as well
df.rank()
Out[89]:
number name species
0 1.0 5.0 6.5
1 2.0 6.0 6.5
2 3.0 7.0 6.5
3 4.0 4.0 1.0
4 5.0 1.0 6.5
5 6.0 3.0 2.5
6 7.0 8.0 2.5
7 8.0 2.0 4.0
In [92]:
#rank individual column provided by the dataframe
df["number"].rank(ascending=False)
Out[92]:
0    8.0
1    7.0
2    6.0
3    5.0
4    4.0
5    3.0
6    2.0
7    1.0
Name: number, dtype: float64
In [99]:
#this works like TopN Analysis in sql where only 3 top to be queried, inplaced it into original column
df["number"] = df["number"].rank(ascending=True).head(3)
#drop out the ones that not selected from presenting to the output
df.dropna()
Out[99]:
number name species
0 1.0 joe human
1 2.0 john human
2 3.0 mike human
In [ ]:
 

No comments:

Post a Comment

Pandas: SQL Like pandas operations

Pandas's SQL Like operations such as WHERE clause. = != >= str.contains() & | .isin() .isnull() .notnull() ....