How the full dataset looks like
import pandas as pd
from StringIO import StringIO
#create dataframe
data = pd.read_csv(StringIO(vpx_vm_data))
print(data)
vm_id ds_id updated_time
0 27 4098 2018-08-21 22:05:03.437000000
1 183 166 2018-10-08 14:43:10.690000000
2 184 172 2017-10-03 15:59:10.957000000
3 185 162 2017-10-04 01:46:15.417000000
4 187 4924 2017-11-28 21:11:14.127000000
5 188 165 2018-04-23 14:22:01.403000000
6 190 158 2017-10-03 18:00:11.513000000
7 192 173 2018-10-08 21:28:54.837000000
8 193 4100 2018-07-31 21:17:10.343000000
9 194 4925 2017-11-30 17:35:31.977000000
10 196 164 2018-07-08 01:41:37.240000000
11 198 162 2017-10-04 01:46:15.450000000
12 199 4924 2018-04-19 18:15:00.330000000
13 201 169 2018-04-23 14:21:37.287000000
14 203 157 2017-11-07 02:05:28.810000000
15 205 168 2018-02-09 19:31:50.713000000
16 207 158 2017-10-03 15:59:10.363000000
17 208 170 2017-10-03 20:32:52.197000000
18 209 168 2017-10-03 18:00:10.777000000
19 210 169 2017-12-19 14:30:03.007000000
20 211 4100 2018-08-31 16:10:02.320000000
21 212 4098 2018-08-21 22:05:04.547000000
22 213 4065 2018-07-30 21:07:11.120000000
23 214 167 2017-10-03 18:20:18.003000000
24 216 4924 2018-09-25 21:33:04.523000000
25 217 158 2018-04-23 14:20:04.123000000
26 218 5046 2018-04-19 18:14:37.350000000
27 221 4925 2018-08-02 18:37:33.573000000
28 239 6868 2018-05-25 02:55:54.327000000
29 242 158 2018-10-05 21:00:12.307000000
... ... ... ...
4228 7571 4061 2018-09-24 15:32:39.943000000
counts(*)
import pandas as pd
from StringIO import StringIO
#create dataframe
vm_df = pd.read_csv(StringIO(vpx_vm_data))
date_summary=vm_df.count()
print(date_summary.head())
vm_id 4258
ds_id 4258
updated_time 4258
dtype: int64
That is equivalent to the following. It represents all the columns in the same counts.
select count(*) from vpx_vm;
count(vm_id)
import pandas as pd
from StringIO import StringIO
#create dataframe
vm_df = pd.read_csv(StringIO(vpx_vm_data))
date_summary=vm_df[['vm_id']].count()
print(date_summary.head())
vm_id 4258
dtype: int64
That is equivalent to the following
select count(vm_id) from vpx_vm;
Count with every single column.
import pandas as pd
from StringIO import StringIO
#create dataframe
vm_df = pd.read_csv(StringIO(vpx_vm_data))
date_summary=vm_df.groupby('vm_id').count()
print(date_summary)
That is equivalent to the following.
select vm_id, count(vm_id), count(ds_id), count(updated_time) from EDA.vpx_vm group by vm_id, ds_id, updated_time order by vm_id;
Sorting and Where clause
import pandas as pd
from StringIO import StringIO
#create dataframe
data = pd.read_csv(StringIO(vpx_vm_data))
##print (data)
print data.sort_values("vm_id", inplace=True)
data.where(data["vm_id"]==183, inplace = True)
##data.where(data["vm_id"]==183, inplace = True) --this will spit out everything.
print (data)
None
vm_id ds_id updated_time
0 NaN NaN NaN
1 183.0 166.0 2018-10-08 14:43:10.690000000
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
Note 1: data.where(data["vm_id"]==183, inplace = True) is comparison to an integer. data.where(data["vm_id"]=="183", inplace = True) , is comparing to a string and it will fail with error "TypeError: invalid type comparison".
Note 2: inplace = True is used to 'masked' out. Those "NaN" are the result of inplace sets to Tru
That's is similar to the following.
select vm_id, ds_id, updated_time from EDA.vpx_vm where vm_id=183 order by vm_id;
No comments:
Post a Comment