Monday, May 20, 2019

DataEng: GCS: Experimenting Panda operations.

Playing with Panda APIs


How the full dataset looks like



import pandas as pd
from StringIO import StringIO
#create dataframe
data = pd.read_csv(StringIO(vpx_vm_data))
print(data)

      vm_id  ds_id                   updated_time
0        27   4098  2018-08-21 22:05:03.437000000
1       183    166  2018-10-08 14:43:10.690000000
2       184    172  2017-10-03 15:59:10.957000000
3       185    162  2017-10-04 01:46:15.417000000
4       187   4924  2017-11-28 21:11:14.127000000
5       188    165  2018-04-23 14:22:01.403000000
6       190    158  2017-10-03 18:00:11.513000000
7       192    173  2018-10-08 21:28:54.837000000
8       193   4100  2018-07-31 21:17:10.343000000
9       194   4925  2017-11-30 17:35:31.977000000
10      196    164  2018-07-08 01:41:37.240000000
11      198    162  2017-10-04 01:46:15.450000000
12      199   4924  2018-04-19 18:15:00.330000000
13      201    169  2018-04-23 14:21:37.287000000
14      203    157  2017-11-07 02:05:28.810000000
15      205    168  2018-02-09 19:31:50.713000000
16      207    158  2017-10-03 15:59:10.363000000
17      208    170  2017-10-03 20:32:52.197000000
18      209    168  2017-10-03 18:00:10.777000000
19      210    169  2017-12-19 14:30:03.007000000
20      211   4100  2018-08-31 16:10:02.320000000
21      212   4098  2018-08-21 22:05:04.547000000
22      213   4065  2018-07-30 21:07:11.120000000
23      214    167  2017-10-03 18:20:18.003000000
24      216   4924  2018-09-25 21:33:04.523000000
25      217    158  2018-04-23 14:20:04.123000000
26      218   5046  2018-04-19 18:14:37.350000000
27      221   4925  2018-08-02 18:37:33.573000000
28      239   6868  2018-05-25 02:55:54.327000000
29      242    158  2018-10-05 21:00:12.307000000
...     ...    ...                            ...
4228   7571   4061  2018-09-24 15:32:39.943000000


counts(*)

import pandas as pd
from StringIO import StringIO
#create dataframe
vm_df = pd.read_csv(StringIO(vpx_vm_data))
date_summary=vm_df.count()
print(date_summary.head())
vm_id           4258
ds_id           4258
updated_time    4258
dtype: int64

That is equivalent to the following. It represents all the columns in the same counts.
select count(*) from vpx_vm;


count(vm_id)

import pandas as pd
from StringIO import StringIO
#create dataframe
vm_df = pd.read_csv(StringIO(vpx_vm_data))
date_summary=vm_df[['vm_id']].count()
print(date_summary.head())
vm_id    4258
dtype: int64

That is equivalent to the following
select count(vm_id) from vpx_vm;


Count with every single column.

import pandas as pd
from StringIO import StringIO
#create dataframe
vm_df = pd.read_csv(StringIO(vpx_vm_data))
date_summary=vm_df.groupby('vm_id').count()
print(date_summary)



That is equivalent to the following.

select vm_id, count(vm_id), count(ds_id), count(updated_time) from EDA.vpx_vm group by vm_id, ds_id, updated_time order by vm_id;


Sorting and Where clause



import pandas as pd
from StringIO import StringIO
#create dataframe
data = pd.read_csv(StringIO(vpx_vm_data))
##print (data)
print data.sort_values("vm_id", inplace=True)
data.where(data["vm_id"]==183, inplace = True)
##data.where(data["vm_id"]==183, inplace = True) --this will spit out everything.
print (data)
None
      vm_id  ds_id                   updated_time
0       NaN    NaN                            NaN
1     183.0  166.0  2018-10-08 14:43:10.690000000
2       NaN    NaN                            NaN
3       NaN    NaN                            NaN
4       NaN    NaN                            NaN
5       NaN    NaN                            NaN
6       NaN    NaN                            NaN
7       NaN    NaN                            NaN
8       NaN    NaN                            NaN
9       NaN    NaN                            NaN
10      NaN    NaN                            NaN
11      NaN    NaN                            NaN
12      NaN    NaN                            NaN
13      NaN    NaN                            NaN
14      NaN    NaN                            NaN

Note 1: data.where(data["vm_id"]==183, inplace = True) is comparison to an integer. data.where(data["vm_id"]=="183", inplace = True) , is comparing to a string and it will fail with error "TypeError: invalid type comparison".

Note 2: inplace = True is used to 'masked' out. Those "NaN" are the result of inplace sets to Tru

That's is similar to the following.

select vm_id, ds_id, updated_time from EDA.vpx_vm where vm_id=183 order by vm_id;

No comments:

Post a Comment

Pandas: SQL Like pandas operations

Pandas's SQL Like operations such as WHERE clause. = != >= str.contains() & | .isin() .isnull() .notnull() ....