Guide Pandas
Guide Pandas
Pandas is a fast, powerful, flexible and easy to use open source data
analysis and manipulation tool,built on top of the Python
programming language.
Prudhvi Vardhan Notes
Pandas Series
A Pandas Series is like a column in a table. It is a 1-D array holding data of any type.
Importing Pandas
Out[6]: 0 India
1 Pakistan
2 USA
3 Nepal
4 Srilanka
dtype: object
1/33
In [7]: # integers
marks= [13,24,56,78,100]
pd.Series(marks)
Out[7]: 0 13
1 24
2 56
3 78
4 100
dtype: int64
pd.Series(marks,index=subjects)
Out[8]: maths 67
english 57
science 89
hindi 100
dtype: int64
Out[10]: maths 67
english 57
science 89
hindi 100
Name: Jack Marks, dtype: int64
In [11]: marks = {
'maths':67,
'english':57,
'science':89,
'hindi':100
}
marks_series = pd.Series(marks,name="jack Marks")
2/33
In [12]: marks_series
Out[12]: maths 67
english 57
science 89
hindi 100
Name: jack Marks, dtype: int64
Series Attributes
In [13]: # size
marks_series.size
Out[13]: 4
In [14]: # dtype
marks_series.dtype
Out[14]: dtype('int64')
In [15]: # name
marks_series.name
unique is an attribute of a Pandas Series that returns an array of the unique values in the
Series.
In [16]: # is_unique
marks_series.is_unique
Out[16]: True
Out[17]: False
3/33
In [18]: # index
marks_series.index
In [19]: # values
marks_series.values
In [20]: type(marks_series.values)
Out[20]: numpy.ndarray
Pandas.read_csv
In [23]: type(sub)
Out[23]: pandas.core.frame.DataFrame
In [30]: sub.head(5)
0 48
1 57
2 40
3 43
4 44
4/33
In [31]: sub = pd.read_csv("subs.csv",squeeze=True)
In [32]: type(sub)
Out[32]: pandas.core.series.Series
In [33]: sub
Out[33]: 0 48
1 57
2 40
3 43
4 44
...
360 231
361 226
362 155
363 144
364 172
Name: Subscribers gained, Length: 365, dtype: int64
In [57]: kl
Out[57]: match_no
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
5/33
In [38]: movies
Out[38]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Shah Rukh Khan
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
Series Methods
In [40]: # Head
sub.head()
Out[40]: 0 48
1 57
2 40
3 43
4 44
Name: Subscribers gained, dtype: int64
In [41]: # tail
kl.tail()
Out[41]: match_no
211 0
212 20
213 73
214 25
215 7
Name: runs, dtype: int64
Out[43]: movie
Enemmy Sunil Shetty
Name: lead, dtype: object
6/33
value_counts(): Returns a Series containing the counts of unique values in the
Series.
Out[45]: match_no
87 0
211 0
207 0
206 0
91 0
...
164 100
120 100
123 108
126 109
128 113
Name: runs, Length: 215, dtype: int64
Out[50]: 113
7/33
In [55]: # For permanent Changes use Inplace
kl.sort_values(inplace=True)
kl
Out[55]: match_no
87 0
211 0
207 0
206 0
91 0
...
164 100
120 100
123 108
126 109
128 113
Name: runs, Length: 215, dtype: int64
movies.sort_index()
Out[60]: movie
1920 (film) Rajniesh Duggall
1920: London Sharman Joshi
1920: The Evil Returns Vicky Ahuja
1971 (2007 film) Manoj Bajpayee
2 States (2014 film) Arjun Kapoor
...
Zindagi 50-50 Veena Malik
Zindagi Na Milegi Dobara Hrithik Roshan
Zindagi Tere Naam Mithun Chakraborty
Zokkomon Darsheel Safary
Zor Lagaa Ke...Haiya! Meghan Jadhav
Name: lead, Length: 1500, dtype: object
In [61]: movies.sort_index(ascending=False)
Out[61]: movie
Zor Lagaa Ke...Haiya! Meghan Jadhav
Zokkomon Darsheel Safary
Zindagi Tere Naam Mithun Chakraborty
Zindagi Na Milegi Dobara Hrithik Roshan
Zindagi 50-50 Veena Malik
...
2 States (2014 film) Arjun Kapoor
1971 (2007 film) Manoj Bajpayee
1920: The Evil Returns Vicky Ahuja
1920: London Sharman Joshi
1920 (film) Rajniesh Duggall
Name: lead, Length: 1500, dtype: object
8/33
Series Maths Methods
In [62]: # count
kl.count()
Out[62]: 215
Out[66]: 49510
Out[67]: 0
Statical Methods
In [68]: # mean
sub.mean()
Out[68]: 135.64383561643837
In [72]: # median
kl.median()
Out[72]: 24.0
mode(): The mode is the value that appears most frequently in the Series.
9/33
In [74]: # mode
print(movies.mode())
0 Akshay Kumar
dtype: object
Out[71]: 62.67502303725269
Out[75]: 3928.1585127201556
In [76]: # min
sub.min()
Out[76]: 33
In [77]: # max
sub.max()
Out[77]: 396
In [79]: # describe
movies.describe()
10/33
In [80]: kl.describe()
In [81]: sub.describe()
Series Indexing
In [83]: # integer indexing
x = pd.Series([12,13,14,35,46,57,58,79,9])
x[1]
Out[83]: 13
In [86]: movies[0]
In [87]: sub[0]
Out[87]: 48
11/33
In [90]: # slicing
kl[4:10]
Out[90]: match_no
5 1
6 9
7 34
8 0
9 21
10 3
Name: runs, dtype: int64
sub[-5:]
In [96]: movies[-5:]
Out[96]: movie
Hum Tumhare Hain Sanam Shah Rukh Khan
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, dtype: object
In [97]: movies[::2]
Out[97]: movie
Uri: The Surgical Strike Vicky Kaushal
The Accidental Prime Minister (film) Anupam Kher
Evening Shadows Mona Ambegaonkar
Fraud Saiyaan Arshad Warsi
Manikarnika: The Queen of Jhansi Kangana Ranaut
...
Raaz (2002 film) Dino Morea
Waisa Bhi Hota Hai Part II Arshad Warsi
Kaante Amitabh Bachchan
Aankhen (2002 film) Amitabh Bachchan
Company (film) Ajay Devgn
Name: lead, Length: 750, dtype: object
12/33
5/12/23, 6:02 PM
Out[98]: match_no
1 1
8 0
22 38
11 10
2 23
Name: runs, dtype: int64
Out[99]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Shah Rukh Khan
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
Out[101]: maths 67
english 57
science 89
hindi 100
Name: jack Marks, dtype: int64
In [102]: marks_series[1]=88
marks_series
Out[102]: maths 67
english 88
science 89
hindi 100
Name: jack Marks, dtype: int64
13/33
In [103]: # we can add data , if it doesnt exist
marks_series['social']=90
marks_series
Out[103]: maths 67
english 88
science 89
hindi 100
social 90
Name: jack Marks, dtype: int64
Out[111]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Shah Rukh Khan
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
In [115]: movies
Out[115]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Jack
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
14/33
Series with Python Functionalities
In [117]: # len/type/dir/sorted/max/min
print(len(sub))
print(type(sub))
365
<class 'pandas.core.series.Series'>
15/33
In [122]: print(dir(sub))
print(sorted(sub))
16/33
['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_REVERSED', '_AXIS_TO_AXIS_NUMBER',
'_HANDLED_TYPES', ' abs ', ' add ', ' and ', ' annotations ', ' arra
y ', ' array_priority ', ' array_ufunc ', ' array_wrap ', ' bool ',
' class ', ' contains ', ' copy ', ' deepcopy ', ' delattr ', ' de
litem ', ' dict ', ' dir ', ' divmod ', ' doc ', ' eq ', ' finali
ze ', ' float ', ' floordiv ', ' format ', ' ge ', ' getattr ', '_
_getattribute ', ' getitem ', ' getstate ', ' gt ', ' hash ', ' iad
d ', ' iand ', ' ifloordiv ', ' imod ', ' imul ', ' init ', ' ini
t_subclass ', ' int ', ' invert ', ' ior ', ' ipow ', ' isub ', '_
_iter ', ' itruediv ', ' ixor ', ' le ', ' len ', ' long ', ' lt_
_', ' matmul ', ' mod ', ' module ', ' mul ', ' ne ', ' neg ', '_
_new ', ' nonzero ', ' or ', ' pos ', ' pow ', ' radd ', ' rand_
_', ' rdivmod ', ' reduce ', ' reduce_ex ', ' repr ', ' rfloordiv_
_', ' rmatmul ', ' rmod ', ' rmul ', ' ror ', ' round ', ' rpow_
_', ' rsub ', ' rtruediv ', ' rxor ', ' setattr ', ' setitem ', '
setstate ', ' sizeof ', ' str ', ' sub ', ' subclasshook ', ' trued
iv ', ' weakref ', ' xor ', '_accessors', '_accum_func', '_add_numeric_o
perations', '_agg_by_level', '_agg_examples_doc', '_agg_see_also_doc', '_alig
n_frame', '_align_series', '_arith_method', '_as_manager', '_attrs', '_bino
p', '_can_hold_na', '_check_inplace_and_allows_duplicate_labels', '_check_inp
lace_setting', '_check_is_chained_assignment_possible', '_check_label_or_leve
l_ambiguity', '_check_setitem_copy', '_clear_item_cache', '_clip_with_one_bou
nd', '_clip_with_scalar', '_cmp_method', '_consolidate', '_consolidate_inplac
e', '_construct_axes_dict', '_construct_axes_from_arguments', '_construct_res
ult', '_constructor', '_constructor_expanddim', '_convert', '_convert_dtype
s', '_data', '_dir_additions', '_dir_deletions', '_drop_axis', '_drop_labels_
or_levels', '_duplicated', '_find_valid_index', '_flags', '_from_mgr', '_get_
axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_blo
ck_manager_axis', '_get_bool_data', '_get_cacher', '_get_cleaned_column_resol
vers', '_get_index_resolvers', '_get_label_or_level_values', '_get_numeric_da
ta', '_get_value', '_get_values', '_get_values_tuple', '_get_with', '_gotite
m', '_hidden_attrs', '_index', '_indexed_same', '_info_axis', '_info_axis_nam
e', '_info_axis_number', '_init_dict', '_init_mgr', '_inplace_method', '_inte
rnal_names', '_internal_names_set', '_is_cached', '_is_copy', '_is_label_or_l
evel_reference', '_is_label_reference', '_is_level_reference', '_is_mixed_typ
e', '_is_view', '_item_cache', '_ixs', '_logical_func', '_logical_method', '_
map_values', '_maybe_update_cacher', '_memory_usage', '_metadata', '_mgr', '_
min_count_stat_function', '_name', '_needs_reindex_multi', '_protect_consolid
ate', '_reduce', '_reindex_axes', '_reindex_indexer', '_reindex_multi', '_rei
ndex_with_indexers', '_replace_single', '_repr_data_resource_', '_repr_latex
_', '_reset_cache', '_reset_cacher', '_set_as_cached', '_set_axis', '_set_axi
s_name', '_set_axis_nocheck', '_set_is_copy', '_set_labels', '_set_name', '_s
et_value', '_set_values', '_set_with', '_set_with_engine', '_slice', '_stat_a
xis', '_stat_axis_name', '_stat_axis_number', '_stat_function', '_stat_functi
on_ddof', '_take_with_is_copy', '_typ', '_update_inplace', '_validate_dtype',
'_values', '_where', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggreg
ate', 'align', 'all', 'any', 'append', 'apply', 'argmax', 'argmin', 'argsor
t', 'array', 'asfreq', 'asof', 'astype', 'at', 'at_time', 'attrs', 'autocor
r', 'axes', 'backfill', 'between', 'between_time', 'bfill', 'bool', 'clip',
'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'cou
nt', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'di
v', 'divide', 'divmod', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropn
a', 'dtype', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'ewm', 'expandi
ng', 'explode', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_val
id_index', 'flags', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'hasnans', 'hea
d', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'int
erpolate', 'is_monotonic', 'is_monotonic_decreasing', 'is_monotonic_increasin
17/33
g', 'is_unique', 'isin', 'isna', 'isnull', 'item', 'items', 'iteritems', 'key
s', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lt', 'mad',
'map', 'mask', 'max', 'mean', 'median', 'memory_usage', 'min', 'mod', 'mode',
'mul', 'multiply', 'name', 'nbytes', 'ndim', 'ne', 'nlargest', 'notna', 'notn
ull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'plot', 'pop', 'po
w', 'prod', 'product', 'quantile', 'radd', 'rank', 'ravel', 'rdiv', 'rdivmo
d', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 're
peat', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'ro
lling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'searchsorted', 'sem',
'set_axis', 'set_flags', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'so
rt_index', 'sort_values', 'squeeze', 'std', 'sub', 'subtract', 'sum', 'swapax
es', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_ex
cel', 'to_frame', 'to_hdf', 'to_json', 'to_latex', 'to_list', 'to_markdown',
'to_numpy', 'to_period', 'to_pickle', 'to_sql', 'to_string', 'to_timestamp',
'to_xarray', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert',
'tz_localize', 'unique', 'unstack', 'update', 'value_counts', 'values', 'va
r', 'view', 'where', 'xs']
[33, 33, 35, 37, 39, 40, 40, 40, 40, 42, 42, 43, 44, 44, 44, 45, 46, 46, 48,
49, 49, 49, 49, 50, 50, 50, 51, 54, 56, 56, 56, 56, 57, 61, 62, 64, 65, 65, 6
6, 66, 66, 66, 67, 68, 70, 70, 70, 71, 71, 72, 72, 72, 72, 72, 73, 74, 74, 7
5, 76, 76, 76, 76, 77, 77, 78, 78, 78, 79, 79, 80, 80, 80, 81, 81, 82, 82, 8
3, 83, 83, 84, 84, 84, 85, 86, 86, 86, 87, 87, 87, 87, 88, 88, 88, 88, 88, 8
9, 89, 89, 90, 90, 90, 90, 91, 92, 92, 92, 93, 93, 93, 93, 95, 95, 96, 96, 9
6, 96, 97, 97, 98, 98, 99, 99, 100, 100, 100, 101, 101, 101, 102, 102, 103, 1
03, 104, 104, 104, 105, 105, 105, 105, 105, 105, 105, 105, 105, 108, 108, 10
8, 108, 108, 108, 109, 109, 110, 110, 110, 111, 111, 112, 113, 113, 113, 114,
114, 114, 114, 115, 115, 115, 115, 117, 117, 117, 118, 118, 119, 119, 119, 11
9, 120, 122, 123, 123, 123, 123, 123, 124, 125, 126, 127, 128, 128, 129, 130,
131, 131, 132, 132, 134, 134, 134, 135, 135, 136, 136, 136, 137, 138, 138, 13
8, 139, 140, 144, 145, 146, 146, 146, 146, 147, 149, 150, 150, 150, 150, 151,
152, 152, 152, 153, 153, 153, 154, 154, 154, 155, 155, 156, 156, 156, 156, 15
7, 157, 157, 157, 158, 158, 159, 159, 160, 160, 160, 160, 162, 164, 166, 167,
167, 168, 170, 170, 170, 170, 171, 172, 172, 173, 173, 173, 174, 174, 175, 17
5, 176, 176, 177, 178, 179, 179, 180, 180, 180, 182, 183, 183, 183, 184, 184,
184, 185, 185, 185, 185, 186, 186, 186, 188, 189, 190, 190, 192, 192, 192, 19
6, 196, 196, 197, 197, 202, 202, 202, 203, 204, 206, 207, 209, 210, 210, 211,
212, 213, 214, 216, 219, 220, 221, 221, 222, 222, 224, 225, 225, 226, 227, 22
8, 229, 230, 231, 233, 236, 236, 237, 241, 243, 244, 245, 247, 249, 254, 254,
258, 259, 259, 261, 261, 265, 267, 268, 269, 276, 276, 290, 295, 301, 306, 31
2, 396]
In [123]: print(min(sub))
print(max(sub))
33
396
18/33
In [126]: dict(marks_series)
Out[126]: {'maths': 67, 'english': 88, 'science': 89, 'hindi': 100, 'social': 90}
Out[129]: True
Out[133]: True
In [138]: # looping
for i in movies:
print(i)
Vicky Kaushal
Vicky Ahuja
Anupam Kher
Emraan Hashmi
Mona Ambegaonkar
Geetika Vidya Ohlyan
Arshad Warsi
Radhika Apte
Kangana Ranaut
Nawazuddin Siddiqui
Ali Asgar
Ranveer Singh
Prit Kamani
Ajay Devgn
Sushant Singh Rajput
Amitabh Bachchan
Abhimanyu Dasani
Talha Arshad Reshi
Nawazuddin Siddiqui
19/33
In [139]: for i in movies.index:
print(i)
Out[140]: maths 33
english 12
science 11
hindi 0
social 10
Name: jack Marks, dtype: int64
In [141]: 100+marks_series
20/33
In [143]: # Relational operators
kl>=50
Out[143]: match_no
1 False
2 False
3 False
4 False
5 False
...
211 False
212 False
213 True
214 False
215 False
Name: runs, Length: 215, dtype: bool
Out[146]: 50
Out[148]: 9
In [149]: # Count number of day when I had more than 200 subs a day
sub[sub>=200].size
Out[149]: 59
In [160]: num_mov[num_mov>=20].size
Out[160]: 7
21/33
Plotting Graphs on Series
In [162]: sub.plot()
Out[162]: <AxesSubplot:>
In [164]: movies.value_counts().head(20).plot(kind="pie")
Out[164]: <AxesSubplot:ylabel='lead'>
22/33
In [165]: movies.value_counts().head(20).plot(kind="bar")
Out[165]: <AxesSubplot:>
In [166]: # astype
# between
# clip
# drop_duplicates
# isnull
# dropna
# fillna
# isin
# apply
# copy
In [175]: # astype
import sys
sys.getsizeof(kl)
Out[175]: 11752
23/33
In [176]: kl
Out[176]: match_no
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [177]: (kl.astype("int16"))
Out[177]: match_no
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int16
In [178]: sys.getsizeof(kl.astype("int16"))
Out[178]: 10462
24/33
In [181]: # between
kl[kl.between(50,60)]
Out[181]: match_no
15 50
34 58
44 56
57 57
71 51
73 58
80 57
85 56
103 51
122 52
129 54
131 54
137 55
141 58
144 57
182 50
197 51
198 53
209 58
Name: runs, dtype: int64
In [182]: kl[kl.between(50,60)].size
Out[182]: 19
In [183]: # clip
sub.clip(100,200)
Out[183]: 0 100
1 100
2 100
3 100
4 100
...
360 200
361 200
362 155
363 144
364 172
Name: Subscribers gained, Length: 365, dtype: int64
25/33
In [186]: # drop duplicates #### drop_duplicates(): Returns a Series with duplicates rem
dele = pd.Series([1,2,33,3,3,3,1,23,33,22,33,11])
dele
Out[186]: 0 1
1 2
2 33
3 3
4 3
5 3
6 1
7 23
8 33
9 22
10 33
11 11
dtype: int64
In [188]: dele.drop_duplicates()
Out[188]: 0 1
1 2
2 33
3 3
7 23
9 22
11 11
dtype: int64
In [189]: dele.drop_duplicates(keep='last')
Out[189]: 1 2
5 3
6 1
7 23
9 22
10 33
11 11
dtype: int64
26/33
In [190]: movies.drop_duplicates()
Out[190]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Rules: Pyaar Ka Superhit Formula Tanuja
Right Here Right Now (film) Ankit
Talaash: The Hunt Begins... Rakhee Gulzar
The Pink Mirror Edwin Fernandes
Hum Tumhare Hain Sanam Jack
Name: lead, Length: 567, dtype: object
In [191]: dele.duplicated().sum()
Out[191]: 5
In [193]: kl.duplicated().sum()
Out[193]: 137
In [194]: dele.count()
Out[194]: 12
isin(values): Returns a boolean Series indicating whether each element in the Series is
in the provided values
In [198]: # isnull
kl.isnull().sum()
Out[198]: 0
In [199]: dele.isnull().sum()
Out[199]: 0
27/33
In [200]: # dropna
dele.dropna()
Out[200]: 0 1
1 2
2 33
3 3
4 3
5 3
6 1
7 23
8 33
9 22
10 33
11 11
dtype: int64
In [202]: # fillna
dele.fillna(0)
dele.fillna(dele.mean())
Out[202]: 0 1
1 2
2 33
3 3
4 3
5 3
6 1
7 23
8 33
9 22
10 33
11 11
dtype: int64
In [205]: # isin
kl
Out[205]: match_no
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
28/33
In [207]: kl[(kl==49) | (kl==99)]
Out[207]: match_no
82 99
86 49
Name: runs, dtype: int64
In [209]: kl[kl.isin([49,99])]
Out[209]: match_no
82 99
86 49
Name: runs, dtype: int64
In [210]: # apply
movies
Out[210]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Jack
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
Out[212]: movie
Uri: The Surgical Strike [Vicky, Kaushal]
Battalion 609 [Vicky, Ahuja]
The Accidental Prime Minister (film) [Anupam, Kher]
Why Cheat India [Emraan, Hashmi]
Evening Shadows [Mona, Ambegaonkar]
...
Hum Tumhare Hain Sanam [Jack]
Aankhen (2002 film) [Amitabh, Bachchan]
Saathiya (film) [Vivek, Oberoi]
Company (film) [Ajay, Devgn]
Awara Paagal Deewana [Akshay, Kumar]
Name: lead, Length: 1500, dtype: object
29/33
In [213]: movies.apply(lambda x:x.split()[0]) # select first word
Out[213]: movie
Uri: The Surgical Strike Vicky
Battalion 609 Vicky
The Accidental Prime Minister (film) Anupam
Why Cheat India Emraan
Evening Shadows Mona
...
Hum Tumhare Hain Sanam Jack
Aankhen (2002 film) Amitabh
Saathiya (film) Vivek
Company (film) Ajay
Awara Paagal Deewana Akshay
Name: lead, Length: 1500, dtype: object
Out[214]: movie
Uri: The Surgical Strike VICKY
Battalion 609 VICKY
The Accidental Prime Minister (film) ANUPAM
Why Cheat India EMRAAN
Evening Shadows MONA
...
Hum Tumhare Hain Sanam JACK
Aankhen (2002 film) AMITABH
Saathiya (film) VIVEK
Company (film) AJAY
Awara Paagal Deewana AKSHAY
Name: lead, Length: 1500, dtype: object
In [215]: sub
Out[215]: 0 48
1 57
2 40
3 43
4 44
...
360 231
361 226
362 155
363 144
364 172
Name: Subscribers gained, Length: 365, dtype: int64
In [216]: sub.mean()
Out[216]: 135.64383561643837
30/33
In [217]: sub.apply(lambda x:'good day' if x > sub.mean() else 'bad day')
In [229]: # Copy
kl
Out[229]: match_no
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [231]: new[1]=100
In [232]: new
Out[232]: match_no
1 100
2 23
3 13
4 12
5 1
Name: runs, dtype: int64
31/33
In [233]: kl
Out[233]: match_no
1 100
2 23
3 13
4 12
5 1
...
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [241]: new[1]=20
In [242]: new
Out[242]: match_no
1 20
2 23
3 13
4 12
5 1
Name: runs, dtype: int64
In [250]: kl
Out[250]: match_no
1 100
2 23
3 13
4 12
5 1
...
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [ ]:
In [ ]:
32/33
In [1]: import numpy as np
import pandas as pd
Creating DataFrame
pd.DataFrame(student_data,columns=['iq','marks','package'])
Out[2]:
iq marks package
0 100 80 10
1 90 70 7
2 120 100 14
3 80 50 2
student_dict = {
'name':['peter','saint','noeum','parle','samme','dave'],
'iq':[100,90,120,80,13,90],
'marks':[80,70,100,50,11,80],
'package':[10,7,14,2,15,100]
}
students=pd.DataFrame(student_dict)
students
Out[3]:
name iq marks package
0 peter 100 80 10
1 saint 90 70 7
3 parle 80 50 2
4 samme 13 11 15
5 dave 90 80 100
In [4]: students.set_index('name',inplace=True)
students
Out[4]:
iq marks package
name
peter 100 80 10
saint 90 70 7
parle 80 50 2
samme 13 11 15
dave 90 80 100
1/43
In [5]: # Read csv
movies = pd.read_csv("movies.csv")
movies.head()
Out[5]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_
The
The The
Accidental
Accidental Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/The_Accidental_P... 0
Prime Prime
Minister
Minister Minister
(film)
Why Why
Why Cheat
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Why_Cheat_India Cheat 0
India
India India
Out[6]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam W
Narendra
2022- Rajasthan Gujarat Modi Rajasthan
0 1312200 Ahmedabad 2022 Final bat N Gujarat Titans W
05-29 Royals Titans Stadium, Royals
Ahmedabad
Narendra
2022- Royal Rajasthan
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers Modi Rajasthan Rajasthan
field N W
05-27 Bangalore Royals Stadium, Royals Royals
Ahmedabad
Eden
2022- Rajasthan Gujarat Gujarat
3 1312197 Kolkata 05-24 2022 Qualifier 1 Gardens, field N Gujarat Titans W
Royals Titans Titans
In [7]: # shape
movies.shape
In [8]: ipl.shape
2/43
In [9]: # dtype
movies.dtypes
In [10]: ipl.dtypes
Out[10]: ID int64
City object
Date object
Season object
MatchNumber object
Team1 object
Team2 object
Venue object
TossWinner object
TossDecision object
SuperOver object
WinningTeam object
WonBy object
Margin float64
method object
Player_of_Match object
Team1Players object
Team2Players object
Umpire1 object
Umpire2 object
dtype: object
In [11]: # index
movies.index
In [12]: ipl.index
In [13]: # Columns
movies.columns
In [14]: ipl.columns
3/43
In [15]: # Values
students.values
In [16]: ipl.values
Out[17]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonBy Ma
Chennai Brabourne
2022- Punjab Chennai
63 1304057 Mumbai 2022 11 Super Stadium, field N Punjab Kings Runs
04-03 Kings Super Kings
Kings Mumbai
4/43
In [18]: # info
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1629 entries, 0 to 1628
Data columns (total 18 columns):
# Column Non-Null Count Dtype
In [19]: ipl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
In [20]: # describe
movies.describe()
Out[20]:
is_adult year_of_release imdb_rating imdb_votes
5/43
In [21]: ipl.describe()
Out[21]:
ID Margin
In [22]: # isnull
movies.isnull()
Out[22]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_of_release runtime genres imdb_rating imdb_votes story summa
0 False False False False False False False False False False False False False Fal
1 False False True False False False False False False False False False False Fal
2 False False False False False False False False False False False False False Fal
3 False False False False False False False False False False False False False Fal
4 False False True False False False False False False False False False False Fal
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1624 False False False False False False False False False False False False False Fal
1625 False False False False False False False False False False False False False Fal
1626 False False True False False False False False False False False False False Fal
1627 False False False False False False False False False False False False False Fal
1628 False False False False False False False False False False False False False Fal
In [23]: movies.isnull().sum()
Out[23]: title_x 0
imdb_id 0
poster_path 103
wiki_link 0
title_y 0
original_title 0
is_adult 0
year_of_release 0
runtime 0
genres 0
imdb_rating 0
imdb_votes
0
story
20
summary
0
tagline
1072
actors
5
wins_nominations
release_date 922
107
dtype: int64
In [24]: # duplicated
movies.duplicated().sum()
Out[24]: 0
6/43
In [25]: # rename
Students
Out[25]:
iq marks package
name
peter 100 80 10
saint 90 70 7
parle 80 50 2
samme 13 11 15
dave 90 80 100
In [26]: students.rename(columns={'marks':'percent','package':'lpa'},inplace=True)
In [ ]: students.drop(columns='name',inplace=True)
Maths Method
In [28]: # sum -> Axis Argument
students.sum(axis=1)
Out[28]: name
peter 190
saint 167
noeum 234
parle 132
samme 39
dave 270
dtype: int64
In [29]: # mean
students.mean()
Out[29]: iq 82.166667
percent 65.166667
lpa 24.666667
dtype: float64
In [30]: students.min(axis=1)
Out[30]: name
peter 10
saint 7
noeum 14
parle 2
samme 11
dave 80
dtype: int64
In [31]: students.var()
Out[31]: iq 1332.166667
percent 968.166667
lpa 1384.666667
dtype: float64
7/43
Selecting cols from a DataFrame
In [33]: type(movies['title_x'])
Out[33]: pandas.core.series.Series
In [35]: type(movies[['year_of_release','actors','title_x']].head(2))
Out[35]: pandas.core.frame.DataFrame
In [36]: ipl[['City','Team1','Team2' ]]
Out[36]:
City Team1 Team2
In [37]: student_dict = {
'name':['peter','saint','noeum','parle','samme','dave'],
'iq':[100,90,120,80,13,90],
'marks':[80,70,100,50,11,80],
'package':[10,7,14,2,15,100]
}
students=pd.DataFrame(student_dict)
students.set_index('name',inplace=True)
In [38]: students
Out[38]:
iq marks package
name
peter 100 80 10
saint 90 70 7
parle 80 50 2
samme 13 11 15
dave 90 80 100
8/43
Selecting rows from a DataFrame
iloc - searches using index positions
loc - searches using index labels
In [39]: # single_row
movies.iloc[1]
Out[40]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult y
9/43
In [41]: movies.iloc[5:12:2]
Out[41]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_
Out[42]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonBy
Narendra
2022- Rajasthan Gujarat Modi Rajasthan
0 1312200 Ahmedabad 2022 Final bat N Gujarat Titans Wickets
05-29 Royals Titans Stadium, Royals
Ahmedabad
Wankhede
2022- Sunrisers Punjab Sunrisers
4 1304116 Mumbai 2022 70 Stadium, bat N Punjab Kings Wickets
05-22 Hyderabad Kings Hyderabad
Mumbai
Wankhede
2022- Delhi Mumbai Mumbai Mumbai
5 1304115 Mumbai 2022 69 Stadium, field N Wickets
05-21 Capitals Indians Indians Indians
Mumbai
Out[43]:
iq marks package
name
peter 100 80 10
saint 90 70 7
parle 80 50 2
samme 13 11 15
dave 90 80 100
In [44]: students.loc['parle']
Out[44]: iq 80
marks 50
package 2
Name: parle, dtype: int64
10/43
In [45]: students.loc['saint':'samme':2]
Out[45]:
iq marks package
name
saint 90 70 7
parle 80 50 2
Out[46]:
iq marks package
name
saint 90 70 7
dave 90 80 100
In [47]: students.iloc[[0,4,3]]
Out[47]:
iq marks package
name
peter 100 80 10
samme 13 11 15
parle 80 50 2
In [48]: movies.iloc[0:3,0:3]
Out[48]:
title_x imdb_id poster_path
In [49]: movies.loc[0:2,'title_x':'poster_path']
Out[49]:
title_x imdb_id poster_path
Filtering a DataFrame
In [50]: ipl.head(2)
Out[50]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonB
Narendra
2022- Rajasthan Gujarat Modi Rajasthan
0 1312200 Ahmedabad 2022 Final bat N Gujarat Titans Wicke
05-29 Royals Titans Stadium, Royals
Ahmedabad
Narendra
Royal
2022- Rajasthan Modi Rajasthan Rajasthan
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers field N Wicke
05-27 Royals Stadium, Royals Royals
Bangalore
Ahmedabad
11/43
In [51]: # find all the final winners
mask=ipl['MatchNumber'] == 'Final'
new_df= ipl[mask]
new_df[['Season','WinningTeam']]
Out[51]:
Season WinningTeam
Out[52]:
Season WinningTeam
Out[53]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonB
Narendra
2022- Rajasthan Gujarat Modi Rajasthan
0 1312200 Ahmedabad 2022 Final bat N Gujarat Titans Wicke
05-29 Royals Titans Stadium, Royals
Ahmedabad
Narendra
Royal
2022- Rajasthan Modi Rajasthan Rajasthan
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers field N Wicke
05-27 Royals Stadium, Royals Royals
Bangalore
Ahmedabad
12/43
In [54]: ipl[ipl['SuperOver']=='Y'].shape[0]
Out[54]: 14
Out[55]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam W
Maharashtra
Rising
2017- Delhi Cricket Rising Pune Delhi
364 1082599 Pune 2017 9 Pune field N
04-11 Daredevils Association Supergiant Daredevils
Supergiant
Stadium
In [56]:
ipl[(ipl['City'] == 'Kolkata') & (ipl['WinningTeam'] == 'Chennai Super Kings')].shape[0]
Out[56]: 5
Out[57]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonBy Ma
Sheikh
Abu 2020- Delhi Mumbai Delhi Mumbai
168 1216529 2020/21 27 Zayed bat N Wickets
Dhabi 10-11 Capitals Indians Capitals Indians
Stadium
Out[58]: 51.473684210526315
Out[59]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_of
Junglee
24 (2019 tt7463730 https://upload.wikimedia.org/wikipedia/en/e/e2... https://en.wikipedia.org/wiki/Junglee_(2019_film) Junglee Junglee 0
film)
Hey Hey
390 tt4512230 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Hey_Bro Hey Bro 0
Bro Bro
Out[60]: 43
13/43
In [61]: # Action movies with rating higher than 7.5
#mask1=movies['genres'].str.split('|').apply(lambda x:'Action' in x)
mask1=movies['genres'].str.contains('Action')
mask2=movies['imdb_rating']>7.5
movies[mask1 & mask2]
Out[61]:
title_x imdb_id poster_path wiki_link title_y original_tit
In [62]: movies['country']='India'
movies.sample(2)
Out[62]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_o
Wanted
915 (2009 tt1084972 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Wanted_(2009_film) Wanted Wanted 0
film)
Shaadi Shaadi
Mein Mein Shaadi Mein
241 tt7469726 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Shaadi_Mein_Zaro... 0
Zaroor Zaroor Zaroor Aana
Aana Aana
In [63]: movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1629 entries, 0 to 1628
Data columns (total 19 columns):
# Column Non-Null Count Dtype
14/43
In [138]: # From Existing ones
movies['actors'].str.split('|').apply(lambda x:x[0])
In [68]: ipl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
In [69]: ipl['ID']=ipl['ID'].astype('int32')
In [70]: ipl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
15/43
In [72]: ipl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
In [74]: ipl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
Value Counts
16/43
In [143]: # value_counts(series and dataframe)
marks = pd.DataFrame([
[100,80,10],
[90,70,7],
[120,100,14],
[80,70,14],
[80,70,14]
],columns=['iq','marks','package'])
marks
Out[143]:
iq marks package
0 100 80 10
1 90 70 7
2 120 100 14
3 80 70 14
4 80 70 14
In [76]: marks.value_counts()
In [77]: # find which player has won most potm -> in finals and qualifiers
ipl.sample(2)
Out[77]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonBy
Kolkata
2018- Sunrisers Eden Sunrisers Sunrisers
304 1136570 Kolkata 04-14 2018 10 Knight field N Wickets
Hyderabad Gardens Hyderabad Hyderabad
Riders
Himachal
Kings Pradesh
2013- Delhi Delhi Kings XI
561 598028 Dharamsala 2013 67 XI Cricket field N Runs
05-16 Daredevils Daredevils Punjab
Punjab Association
Stadium
17/43
In [78]: ipl[~ipl['MatchNumber'].str.isdigit()]['Player_of_Match'].value_counts() # To reverse the contet use tilt ~
Out[78]: KA Pollard 3
F du Plessis 3
SK Raina 3
A Kumble 2
MK Pandey 2
YK Pathan 2
M Vijay 2
JJ Bumrah 2
AB de Villiers 2
SR Watson
2
HH Pandya 1
Harbhajan Singh
1
A Nehra
1
V Sehwag
1
UT Yadav
1
MS Bisla
BJ Hodge 1
MEK Hussey 1
MS Dhoni 1
CH Gayle 1
MM Patel 1
DE Bollinger 1
AC Gilchrist 1
RG Sharma 1
DA Warner 1
MC Henriques 1
JC Buttler 1
RM Patidar 1
DA Miller 1
VR Iyer 1
SP Narine 1
RD Gaikwad 1
TA Boult 1
MP Stoinis 1
KS Williamson 1
RR Pant
1
SA Yadav 1
Rashid Khan
1
AD Russell
1
KH Pandya
1
KV Sharma
1
NM Coulter-Nile
1
Washington Sundar
1
BCJ Cutting
1
M Ntini
Name: Player_of_Match,1 dtype: int64
1
In [79]: # Toss decision plot
ipl['TossDecision'].value_counts().plot(kind='pie')
Out[79]: <AxesSubplot:ylabel='TossDecision'>
18/43
In [80]: # No.of matches each team has played
(ipl['Team1'].value_counts() + ipl['Team2'].value_counts()).sort_values(ascending=False)
Sort values
In [81]: x = pd.Series([12,14,1,56,89])
x
Out[81]: 0 12
1 14
2 1
3 56
4 89
dtype: int64
In [82]: x.sort_values(ascending=True)
Out[82]: 2 1
0 12
1 14
3 56
4 89
dtype: int64
In [83]: movies.sample(2)
Out[83]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_of_
Hope Hope
Hope Aur
107 Aur tt8324474 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Hope_Aur_Hum Aur 0
Hum
Hum Hum
Tere Tere
Naal Naal Tere Naal
666 Love tt2130242 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Tere_Naal_Love_H... Love Love Ho 0
Ho Ho Gaya
Gaya Gaya
19/43
In [84]: movies.sort_values('title_x', ascending=False)
Out[84]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult y
Zor Lagaa
Zor Lagaa Zor Lagaa
939 Ke...Haiya! tt1479857 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Zor_Lagaa_Ke...H... Ke... 0
Ke... Haiya!
Haiya!
Zindagi
Zindagi Zindagi Tere
670 tt2164702 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Zindagi_Tere_Naam Tere 0
Tere Naam Naam
Naam
1920: The
1920: Evil 1920: Evil
723 Evil tt2222550 https://upload.wikimedia.org/wikipedia/en/e/e7... https://en.wikipedia.org/wiki/1920:_The_Evil_R... 0
Returns Returns
Returns
16
1498 December tt0313844 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/16_December_(film) 16-Dec 16-Dec 0
(film)
20/43
In [85]: students = pd.DataFrame(
{
'name':['nitish','ankit','rupesh',np.nan,'mrityunjay',np.nan,'rishabh',np.nan,'aditya',np.nan],
'college':['bit','iit','vit',np.nan,np.nan,'vlsi','ssit',np.nan,np.nan,'git'],
'branch':['eee','it','cse',np.nan,'me','ce','civ','cse','bio',np.nan],
'cgpa':[6.66,8.25,6.41,np.nan,5.6,9.0,7.4,10,7.4,np.nan],
'package':[4,5,6,np.nan,6,7,8,9,np.nan,np.nan]
}
)
students
Out[85]:
name college branch cgpa package
Out[86]:
name college branch cgpa package
In [87]: students
Out[87]:
name college branch cgpa package
21/43
In [88]: movies.sort_values(['year_of_release','title_x'], ascending=[True,False]).head(2)
Out[88]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult ye
Yeh Yeh
Yeh Zindagi
1625 Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S... Zindagi 0
Ka Safar
Ka Safar Ka Safar
rank(Series)
In [89]: batsman=pd.read_csv("batsman_runs_ipl.csv")
In [90]: batsman.head(2)
Out[90]:
batter batsman_run
2 A Badoni 161
In [91]: batsman['batsman_run'].rank(ascending=False)
Out[91]: 0 166.5
1 226.0
2 535.0
3 329.0
4 402.5
...
600 594.0
601 343.0
602 547.5
603 27.0
604 256.0
Name: batsman_run, Length: 605, dtype: float64
Out[92]:
batter batsman_run batsman_rank
22/43
In [93]: marks = {
'maths':67,
'english':57,
'science':89,
'hindi':100
}
marks_series = pd.Series(marks)
marks_series
Out[93]: maths 67
english 57
science 89
hindi 100
dtype: int64
In [94]: marks_series.sort_index()
Out[94]: english 57
hindi 100
maths 67
science 89
dtype: int64
23/43
In [95]: movies.sort_index(ascending=False)
Out[95]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult
Sabse Sabse
Sabse Bada
1626 Bada tt0069204 NaN https://en.wikipedia.org/wiki/Sabse_Bada_Sukh Bada 0
Sukh
Sukh Sukh
Yeh Yeh
Yeh Zindagi
1625 Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S... Zindagi 0
Ka Safar
Ka Safar Ka Safar
Why Why
Why Cheat
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Why_Cheat_India Cheat 0
India
India India
The
The The
Accidental
Accidental Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/The_Accidental_P... 0
Prime Prime
Minister
Minister Minister
(film)
Battalion Battalion
1 tt9472208 NaN https://en.wikipedia.org/wiki/Battalion_609 Battalion 609 0
609 609
In [140]: # This code drops two columns from the 'batsman' dataframe: 'level_0' and 'index'.
# The 'inplace=True' parameter ensures that the original dataframe is modified instead of creating a new one.
24/43
In [101]: # reset_index(series + dataframe) -> drop parameter
batsman.reset_index(inplace=True)
In [102]: batsman
Out[102]:
batter batsman_run batsman_rank
2 A Chandila 4 535.0
3 A Chopra 53 329.0
4 A Choudhary 25 402.5
Out[103]:
index batter batsman_run
batsman_rank
535.0 2 A Chandila 4
329.0 3 A Chopra 53
402.5 4 A Choudhary 25
In [104]: batsman
Out[104]:
batter batsman_run batsman_rank
2 A Chandila 4 535.0
3 A Chopra 53 329.0
4 A Choudhary 25 402.5
25/43
In [105]: # series to dataframe using reset_index
marks_series.reset_index()
Out[105]:
index 0
1 maths 67
2 english 57
3 science 89
4 hindi 100
In [106]: type(marks_series.reset_index())
Out[106]: pandas.core.frame.DataFrame
In [107]: movies.set_index('title_x',inplace=True)
26/43
In [108]: movies
Out[108]:
imdb_id poster_path wiki_link title_y original_title is_adult year_
title_x
Battalion Battalion
tt9472208 NaN https://en.wikipedia.org/wiki/Battalion_609 Battalion 609 0
609 609
The
The The
Accidental
Accidental Accidental
Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/The_Accidental_P... 0
Prime Prime
Minister
Minister Minister
(film)
Why Why
Why Cheat
Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Why_Cheat_India Cheat 0
India
India India
Yeh Yeh
Yeh Zindagi
Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S... Zindagi 0
Ka Safar
Ka Safar Ka Safar
Sabse Sabse
Sabse Bada
Bada tt0069204 NaN https://en.wikipedia.org/wiki/Sabse_Bada_Sukh Bada 0
Sukh
Sukh Sukh
27/43
In [110]: movies
Out[110]:
imdb link wiki_link title_y original_title is_adult year_
title_x
Battalion Battalion
tt9472208 NaN https://en.wikipedia.org/wiki/Battalion_609 Battalion 609 0
609 609
The
The The
Accidental
Accidental Accidental
Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/The_Accidental_P... 0
Prime Prime
Minister
Minister Minister
(film)
Why Why
Why Cheat
Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Why_Cheat_India Cheat 0
India
India India
Yeh Yeh
Yeh Zindagi
Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S... Zindagi 0
Ka Safar
Ka Safar Ka Safar
Sabse Sabse
Sabse Bada
Bada tt0069204 NaN https://en.wikipedia.org/wiki/Sabse_Bada_Sukh Bada 0
Sukh
Sukh Sukh
28/43
In [111]: # Rename the index
movies.rename(index={'Uri: The Surgical Strike':'uri','Humsafar':'Hum'})
Out[111]:
imdb link wiki_link title_y original_title is_adult year_
title_x
Battalion Battalion
tt9472208 NaN https://en.wikipedia.org/wiki/Battalion_609 Battalion 609 0
609 609
The
The The
Accidental
Accidental Accidental
Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/The_Accidental_P... 0
Prime Prime
Minister
Minister Minister
(film)
Why Why
Why Cheat
Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Why_Cheat_India Cheat 0
India
India India
Yeh Yeh
Yeh Zindagi
Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S... Zindagi 0
Ka Safar
Ka Safar Ka Safar
Sabse Sabse
Sabse Bada
Bada tt0069204 NaN https://en.wikipedia.org/wiki/Sabse_Bada_Sukh Bada 0
Sukh
Sukh Sukh
unique
29/43
In [112]: # unique(series)
temp = pd.Series([1,1,2,2,3,3,4,4,5,5,np.nan,np.nan])
print(temp)
temp.unique()
0 1.0
1 1.0
2 2.0
3 2.0
4 3.0
5 3.0
6 4.0
7 4.0
8 5.0
9 5.0
10 NaN
11 NaN
dtype: float64
In [113]: ipl['Season'].unique()
Out[113]: ['2022', '2021', '2020/21', '2019', '2018', ..., '2012', '2011', '2009/10', '2009', '2007/08']
Length: 15
Categories (15, object): ['2007/08', '2009', '2009/10', '2011', ..., '2019', '2020/21', '2021', '2022']
nunique : returns the number of unique elements in a pandas Series or DataFrame. It doesn't count the missing values (NaNs) by default. If you
want to count the missing values, you can set the argument "dropna" to False.
unique: returns the unique elements in a pandas Series or DataFrame. It counts the missing values (NaNs) by default. If you don't want to count
the missing values, you can use the "dropna" argument and set it to True.
In [114]: len(ipl['Season'].unique())
Out[114]: 15
In [115]: # nunique(series + dataframe) -> does not count nan -> dropna parameter
ipl['Season'].nunique()
Out[115]: 15
isnull(series + dataframe)
In [116]: students
Out[116]:
name college branch cgpa package
30/43
In [117]: students['name'].isnull()
Out[117]: 0 False
1 False
2 False
3 True
4 False
5 True
6 False
7 True
False
8
True
9
Name: name, dtype: bool
Out[118]: 0 True
1 True
2 True
3 False
4 True
5 False
6 True
7 False
True
8
False
9
Name: name, dtype: bool
In [119]: students['name'][students['name'].notnull()]
Out[119]: 0 nitish
1 ankit
2 rupesh
4 mrityunjay
6 rishabh
8 aditya
Name: name, dtype: object
In [120]: # hasnans(series)
students['college'].hasnans
Out[120]: True
In [121]: students.isnull()
Out[121]:
name college branch cgpa package
31/43
In [122]: students.notnull()
Out[122]:
name college branch cgpa package
Out[123]: 0 nitish
1 ankit
2 rupesh
4 mrityunjay
6 rishabh
8 aditya
Name: name, dtype: object
In [124]: students.dropna(how='any')
Out[124]:
name college branch cgpa package
In [125]: students.dropna(how='all')
Out[125]:
name college branch cgpa package
32/43
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
In [126]: students.dropna(subset=['name'])
Out[126]:
name college branch cgpa package
In [127]: students.dropna(subset=['name','college'])
Out[127]:
name college branch cgpa package
In [128]: students
Out[128]:
name college branch cgpa package
In [129]: students['name'].fillna('unknown')
Out[129]: 0 nitish
1 ankit
2 rupesh
3 unknown
4 mrityunjay
5 unknown
6 rishabh
7 unknown
aditya
8
unknown
9
Name: name, dtype: object
33/43
In [130]: students.fillna('0')
Out[130]:
name college branch cgpa package
3 0 0 0 0 0
9 0 git 0 0 0
In [131]:
students['package'].fillna(students['package'].mean())
Out[131]: 0 4.000000
1 5.000000
2 6.000000
3 6.428571
4 6.000000
5 7.000000
6 8.000000
7 9.000000
8 6.428571
9 6.428571
Name: package, dtype: float64
Out[132]: 0 nitish
1 ankit
2 rupesh
3 rupesh
4 mrityunjay
5 mrityunjay
6 rishabh
7 rishabh
aditya
8
aditya
9
Name: name, dtype: object
Out[133]: 0 nitish
1 ankit
2 rupesh
3 mrityunjay
4 mrityunjay
5 rishabh
6 rishabh
7 aditya
aditya
8
9 NaN
Name: name, dtype: object
drop_duplicates
In [135]: marks
34/43
In [144]: marks
Out[144]:
iq marks package
0 100 80 10
1 90 70 7
2 120 100 14
3 80 70 14
4 80 70 14
In [145]: marks.drop_duplicates()
Out[145]:
iq marks package
0 100 80 10
1 90 70 7
2 120 100 14
3 80 70 14
In [146]: marks.drop_duplicates(keep='last')
Out[146]:
iq marks package
0 100 80 10
1 90 70 7
2 120 100 14
4 80 70 14
Out[148]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonBy M
Holkar
2018- Rajasthan Kings XI Kings XI Kings XI
276 1136598 Indore 2018 38 Cricket field N Wickets
05-06 Royals Punjab Punjab Punjab
Stadium
Out[149]:
nue TossWinner TossDecision ... WinningTeam WonBy Margin method Player_of_Match Team1Players Team2Players Umpire1 Umpire2
['MEK
MA ['SR Watson',
Hussey', 'M
ram Rajasthan Chennai 'R Dravid', 'AL
bat ... Wickets 8.0 NaN MEK Hussey Vijay', 'SK SS Hazare RB Tiffin
ium, Royals Super Kings Menaria', 'J
Raina', 'JA
auk Bot...
Mork...
35/43
In [154]: def did_kohli_play(players_list):
return 'V Kohli' in players_list
In [165]: ipl.sample(2)
Out[165]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision ... Margin method Player_of_Matc
Punjab
Cricket
2008- Kings XI Mumbai Mumbai
940 335991 Chandigarh 04-25 2007/08 10 Indians Association field ... 66.0 NaN KC Sangakka
Punjab Indians
Stadium,
Mohali
2 rows × 23 columns
Out[166]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision ... Margin method Player_of_Match
Royal Feroz
2013- Delhi Delhi
571 598054 Delhi 2013 57 Challengers Shah field ... 4.0 NaN JD Unadkat
05-10 Daredevils Daredevils
Bangalore Kotla
Royal Feroz
2012- Delhi Delhi '
638 548372 Delhi 2012 67 Challengers Shah field ... 21.0 NaN CH Gayle
05-17 Daredevils Daredevils
Bangalore Kotla
Royal Feroz
2010- Delhi Delhi
801 419140 Delhi 2009/10 35 Challengers Shah bat ... 37.0 NaN PD Collingwood
04-04 Daredevils Daredevils
Bangalore Kotla
9 rows × 23 columns
36/43
In [168]: ipl[( ipl['City'] == 'Delhi') & (ipl['did_kohli_play'] == True)].drop_duplicates(subset=['City','did_kohli_play'],keep='
Out[168]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision ... Margin method Player_of_Match Tea
['P
Royal Arun
2019- Delhi Delhi D
208 1178421 Delhi 2019 46 Challengers Jaitley bat ... 16.0 NaN S Dhawan
04-28 Capitals Capitals
Bangalore Stadium
1 rows × 23 columns
drop(series + dataframe)
In [169]: # Series
temp = pd.Series([10,2,3,16,45,78,10])
temp
Out[169]: 0 10
1 2
2 3
3 16
4 45
5 78
6 10
dtype: int64
In [170]: temp.drop(index=[0,6])
Out[170]: 1 2
2 3
3 16
4 45
5 78
dtype: int64
In [171]: students
Out[171]:
name college branch cgpa package
37/43
In [172]: students.drop(columns=['branch','cgpa']) # To delete Columns
Out[172]:
name college package
Out[173]:
name college branch cgpa package
In [174]: students.set_index('name')
Out[174]:
college branch cgpa package
name
Out[175]:
college branch cgpa package
name
38/43
apply(series + dataframe)
In [176]: # series
temp = pd.Series([10,20,30,40,50])
temp
Out[176]: 0 10
1 20
2 30
3 40
4 50
dtype: int64
In [178]: temp.apply(sigmoid)
Out[178]: 0 1.000045
1 1.000000
2 1.000000
3 1.000000
4 1.000000
dtype: float64
points
Out[179]:
1st point 2nd point
0 (3, 4) (-3, 4)
1 (-6, 5) (0, 0)
2 (0, 0) (2, 2)
4 (4, 5) (1, 1)
Out[182]: 0 6.000000
1 7.810250
2 2.828427
3 21.931712
4 5.000000
dtype: float64
Out[184]:
1st point 2nd point distance
39/43
In [187]: movies = pd.read_csv("imdb-top-1000.csv")
In [189]: movies.head(1)
Out[189]:
Series_Title Released_Year Runtime Genre IMDB_Rating Director Star1 No_of_Votes Gross Metascore
0 The Shawshank Redemption 1994 142 Drama 9.3 Frank Darabont Tim Robbins 2343110 28341469.0 80.0
In [190]: movies.groupby('Genre')
Out[196]:
Runtime IMDB_Rating No_of_Votes Gross Metascore
Genre
In [197]: generes.mean()
Out[197]:
Runtime IMDB_Rating No_of_Votes Gross Metascore
Genre
40/43
In [198]: generes.min()
Out[198]:
Series_Title Released_Year Runtime IMDB_Rating Director Star1 No_of_Votes Gross Metascore
Genre
Abhishek
Action 300 1924 45 7.6 Aamir Khan 25312 3296.0 33.0
Chaubey
Adventure 2001: A Space Odyssey 1925 88 7.6 Akira Kurosawa Aamir Khan 29999 61001.0 41.0
Animation Akira 1940 71 7.6 Adam Elliot Adrian Molina 25229 128985.0 61.0
Biography 12 Years a Slave 1928 93 7.6 Adam McKay Adrien Brody 27254 21877.0 48.0
Alejandro G.
Comedy (500) Days of Summer 1921 68 7.6 Iñárritu Aamir Khan 26337 1305.0 45.0
Crime 12 Angry Men 1931 80 7.6 Akira Kurosawa Ajay Devgn 27712 6013.0 47.0
Drama 1917 1925 64 7.6 Aamir Khan Abhay Deol 25088 3600.0 28.0
Film-Noir Shadow of a Doubt 1941 100 7.8 Alfred Hitchcock Humphrey Bogart 59556 449191.0 94.0
Alejandro
Horror Alien 1933 71 7.6 Anthony Perkins 27007 89029.0 46.0
Amenábar
Bernard-Pierre
Mystery Dark City 1938 96 7.6 Alex Proyas Donnadieu 33982 1035953.0 52.0
Thriller Wait Until Dark 1967 108 7.8 Terence Young Audrey Hepburn 27733 17550741.0 81.0
Il buono, il brutto, il 1965 132 7.8 Clint Eastwood Clint Eastwood 65659 5321508.0 69.0
Western
cattivo
Out[206]: Genre
Drama 3.540997e+10
Action 3.263226e+10
Comedy 1.566387e+10
Name: Gross, dtype: float64
Out[212]: Genre
Thriller 1.755074e+07
Western 5.822151e+07
Film-Noir 1.259105e+08
Family 4.391106e+08
Fantasy 7.827267e+08
Horror 1.034649e+09
Mystery 1.256417e+09
Biography 8.276358e+09
Crime
8.452632e+09
Adventure
9.496922e+09
Animation
1.463147e+10
Comedy
1.566387e+10
Action
3.263226e+10
Drama
Name: Gross, 3.540997e+10
dtype: float64
Out[214]: Genre
Drama 3.540997e+10
Action 3.263226e+10
Comedy 1.566387e+10
Name: Gross, dtype: float64
Out[217]: Genre
Western 8.35
Name: IMDB_Rating, dtype: float64
41/43
Pandas groupby
groupby in pandas is a function that lets you group data in a DataFrame based on specific
criteria, and then apply aggregate functions to each group. It's a powerful tool for data analysis
that allows you to quickly and easily calculate summary statistics for your data.
The
Frank Tim
0 Shawshank 1994 142 Drama 9.3 2343110
Darabont Robbins
Redemption
Francis
The Marlon
1 1972 175 Crime 9.2 Ford 1620367
Godfather Brando
Coppola
In [3]: movies.groupby('Genre')
1/24
In [4]: generes =movies.groupby('Genre')
Genre
Out[6]: Genre
Thriller 1.755074e+07
Western 5.822151e+07
Film-Noir 1.259105e+08
Family 4.391106e+08
Fantasy 7.827267e+08
Horror 1.034649e+09
Mystery 1.256417e+09
Biography 8.276358e+09
Crime 8.452632e+09
Adventure 9.496922e+09
Animation 1.463147e+10
Comedy 1.566387e+10
Action 3.263226e+10
Drama 3.540997e+10
Name: Gross, dtype: float64
Out[7]: Genre
Western 8.35
Name: IMDB_Rating, dtype: float64
Out[8]: Director
Christopher Nolan 11578345
Name: No_of_Votes, dtype: int64
2/24
In [9]: # find the highest rated movie of each genre
movies.head(1)
Out[10]: Star1
Tom Hanks 12
Robert De Niro 11
Clint Eastwood 10
Al Pacino 10
Leonardo DiCaprio 9
..
Glen Hansard 1
Giuseppe Battiston 1
Giulietta Masina 1
Gerardo Taracena 1
Ömer Faruk Sorak 1
Name: Series_Title, Length: 660, dtype: int64
Out[12]: 14
Out[13]: 14
3/24
In [14]: # find items in each group -> size
movies.groupby('Genre').size() # index based
Out[14]: Genre
Action 172
Adventure 72
Animation 82
Biography 88
Comedy 155
Crime 107
Drama 289
Family 2
Fantasy 2
Film-Noir 3
Horror 11
Mystery 12
Thriller 1
Western 4
dtype: int64
4/24
In [18]: # first()/last() -> nth item
#movies.groupby('Genre').first()
#movies.groupby('Genre').last()
movies.groupby('Genre').nth(6) ---> #gives 7th movie
Genre
Star Wars:
Action Episode V - Irvin Mark
1980 124 8.7 115931
The Empire Kershner Hamill
Strikes Back
Andrew
Animation WALL·E 2008 98 8.4 Ben Burtt 99979
Stanton
David Morgan
Crime Se7en 1995 127 8.6 144509
Fincher Freeman
It's a
Frank James
Drama Wonderful 1946 130 8.6 40580
Life Capra Stewart
Jordan Daniel
Horror Get Out 2017 104 7.7 49285
Peele Kaluuya
Joseph L. Laurence
Mystery Sleuth 1972 138 8.0 4474
Mankiewicz Olivier
5/24
In [19]: # get_group -> vs filtering
movies.groupby('Genre').get_group('Horror')
Jordan Daniel
724 Get Out 2017 104 Horror 7.7 492
Peele Kaluuya
John Donald
844 Halloween 1978 91 Horror 7.7 233
Carpenter Pleasence
The
James Claude
876 Invisible 1933 71 Horror 7.7 30
Man Whale Rains
James Cary
932 Saw 2004 103 Horror 7.6 379
Wan Elwes
In [20]: movies.groupby('Genre').get_group('Fantasy')
Das Cabinet
Robert Werner
321 des Dr. 1920 76 Fantasy 8.1 57428
Wiene Krauss
Caligari
F.W. Max
568 Nosferatu 1922 94 Fantasy 7.9 88794
Murnau Schreck
Das Cabinet
Robert Werner
321 des Dr. 1920 76 Fantasy 8.1 57428
Wiene Krauss
Caligari
F.W. Max
568 Nosferatu 1922 94 Fantasy 7.9 88794
Murnau Schreck
6/24
In [24]: # groups
movies.groupby('Genre').groups ---> #Dictionary= gives index postion
Out[24]: {'Action': [2, 5, 8, 10, 13, 14, 16, 29, 30, 31, 39, 42, 44, 55, 57, 59, 60,
63, 68, 72, 106, 109, 129, 130, 134, 140, 142, 144, 152, 155, 160, 161, 166,
168, 171, 172, 177, 181, 194, 201, 202, 216, 217, 223, 224, 236, 241, 262, 27
5, 294, 308, 320, 325, 326, 331, 337, 339, 340, 343, 345, 348, 351, 353, 356,
357, 362, 368, 369, 375, 376, 390, 410, 431, 436, 473, 477, 479, 482, 488, 49
3, 496, 502, 507, 511, 532, 535, 540, 543, 564, 569, 570, 573, 577, 582, 583,
602, 605, 608, 615, 623, ...], 'Adventure': [21, 47, 93, 110, 114, 116, 118,
137, 178, 179, 191, 193, 209, 226, 231, 247, 267, 273, 281, 300, 301, 304, 30
6, 323, 329, 361, 366, 377, 402, 406, 415, 426, 458, 470, 497, 498, 506, 513,
514, 537, 549, 552, 553, 566, 576, 604, 609, 618, 638, 647, 675, 681, 686, 69
2, 711, 713, 739, 755, 781, 797, 798, 851, 873, 884, 912, 919, 947, 957, 964,
966, 984, 991], 'Animation': [23, 43, 46, 56, 58, 61, 66, 70, 101, 135, 146,
151, 158, 170, 197, 205, 211, 213, 219, 229, 230, 242, 245, 246, 270, 330, 33
2, 358, 367, 378, 386, 389, 394, 395, 399, 401, 405, 409, 469, 499, 510, 516,
518, 522, 578, 586, 592, 595, 596, 599, 633, 640, 643, 651, 665, 672, 694, 72
8, 740, 741, 744, 756, 758, 761, 771, 783, 796, 799, 822, 828, 843, 875, 891,
892, 902, 906, 920, 956, 971, 976, 986, 992], 'Biography': [7, 15, 18, 35, 3
8, 54, 102, 107, 131, 139, 147, 157, 159, 173, 176, 212, 215, 218, 228, 235,
243, 263, 276, 282, 290, 298, 317, 328, 338, 342, 346, 359, 360, 365, 372, 37
3, 385, 411, 416, 418, 424, 429, 484, 525, 536, 542, 545, 575, 579, 587, 600,
606, 614, 622, 632, 635, 644, 649, 650, 657, 671, 673, 684, 729, 748, 753, 75
7, 759, 766, 770, 779, 809, 810, 815, 820, 831, 849, 858, 877, 882, 897, 910,
915, 923, 940, 949, 952, 987], 'Comedy': [19, 26, 51, 52, 64, 78, 83, 95, 96,
112, 117, 120, 127, 128, 132, 153, 169, 183, 192, 204, 207, 208, 214, 221, 23
3, 238, 240, 250, 251, 252, 256, 261, 266, 277, 284, 311, 313, 316, 318, 322,
327, 374, 379, 381, 392, 396, 403, 413, 414, 417, 427, 435, 445, 446, 449, 45
5, 459, 460, 463, 464, 466, 471, 472, 475, 481, 490, 494, 500, 503, 509, 526,
528, 530, 531, 533, 538, 539, 541, 547, 557, 558, 562, 563, 565, 574, 591, 59
3, 594, 598, 613, 626, 630, 660, 662, 667, 679, 680, 683, 687, 701, ...], 'Cr
ime': [1, 3, 4, 6, 22, 25, 27, 28, 33, 37, 41, 71, 77, 79, 86, 87, 103, 108,
111, 113, 123, 125, 133, 136, 162, 163, 164, 165, 180, 186, 187, 189, 198, 22
2, 232, 239, 255, 257, 287, 288, 299, 305, 335, 363, 364, 380, 384, 397, 437,
438, 441, 442, 444, 450, 451, 465, 474, 480, 485, 487, 505, 512, 519, 520, 52
3, 527, 546, 556, 560, 584, 597, 603, 607, 611, 621, 639, 653, 664, 669, 676,
695, 708, 723, 762, 763, 767, 775, 791, 795, 802, 811, 823, 827, 833, 885, 89
5, 921, 922, 926, 938, ...], 'Drama': [0, 9, 11, 17, 20, 24, 32, 34, 36, 40,
45, 50, 53, 62, 65, 67, 73, 74, 76, 80, 82, 84, 85, 88, 89, 90, 91, 92, 94, 9
7, 98, 99, 100, 104, 105, 121, 122, 124, 126, 138, 141, 143, 148, 149, 150, 1
54, 156, 167, 174, 175, 182, 184, 185, 188, 190, 195, 196, 199, 200, 203, 20
6, 210, 225, 227, 234, 237, 244, 248, 249, 253, 254, 258, 259, 260, 264, 265,
268, 269, 272, 274, 278, 279, 280, 283, 285, 286, 289, 291, 292, 293, 295, 29
6, 297, 302, 303, 307, 310, 312, 314, 315, ...], 'Family': [688, 698], 'Fanta
sy': [321, 568], 'Film-Noir': [309, 456, 712], 'Horror': [49, 75, 271, 419, 5
44, 707, 724, 844, 876, 932, 948], 'Mystery': [69, 81, 119, 145, 220, 393, 42
0, 714, 829, 899, 959, 961], 'Thriller': [700], 'Western': [12, 48, 115, 69
1]}
7/24
In [25]: # describe
movies.groupby('Genre').describe()
count mean std min 25% 50% 75% max count mean .
Genre
Action 172.0 129.046512 28.500706 45.0 110.75 127.5 143.25 321.0 172.0 7.949419 .
Adventure 72.0 134.111111 33.317320 88.0 109.00 127.0 149.00 228.0 72.0 7.937500 .
Animation 82.0 99.585366 14.530471 71.0 90.00 99.5 106.75 137.0 82.0 7.930488 .
Biography 88.0 136.022727 25.514466 93.0 120.00 129.0 146.25 209.0 88.0 7.938636 .
Comedy 155.0 112.129032 22.946213 68.0 96.00 106.0 124.50 188.0 155.0 7.901290 .
Crime 107.0 126.392523 27.689231 80.0 106.50 122.0 141.50 229.0 107.0 8.016822 .
Drama 289.0 124.737024 27.740490 64.0 105.00 121.0 137.00 242.0 289.0 7.957439 .
Family 2.0 107.500000 10.606602 100.0 103.75 107.5 111.25 115.0 2.0 7.800000 .
Fantasy 2.0 85.000000 12.727922 76.0 80.50 85.0 89.50 94.0 2.0 8.000000 .
Film-Noir 3.0 104.000000 4.000000 100.0 102.00 104.0 106.00 108.0 3.0 7.966667 .
Horror 11.0 102.090909 13.604812 71.0 98.00 103.0 109.00 122.0 11.0 7.909091 .
Mystery 12.0 119.083333 14.475423 96.0 110.75 117.5 130.25 138.0 12.0 7.975000 .
Thriller 1.0 108.000000 NaN 108.0 108.00 108.0 108.00 108.0 1.0 7.800000 .
Western 4.0 148.250000 17.153717 132.0 134.25 148.0 162.00 165.0 4.0 8.350000 .
14 rows × 40 columns
8/24
In [30]: # sample
movies.groupby('Genre').sample(2,replace=True)
9/24
Out[30]: Series_Title Released_Year Runtime Genre IMDB_Rating Director Star1 No_
171 Die Hard 1988 132 Action 8.2 John Bruce Willis
McTiernan
South Park:
Bigger, Trey
799 1999 81 Animation 7.7 Trey Parker
Longer & Parker
Uncut
The World's
Roger Anthony
632 Fastest 2005 127 Biography 7.8
Donaldson Hopkins
Indian
Alejandro Javier
372 Mar adentro 2014 126 Biography 8.0
Amenábar Bardem
David
660 The Sandlot 1993 101 Comedy 7.8 Mickey Tom Guiry
Evans
Brian De
108 Scarface 1983 170 Crime 8.3 Al Pacino
Palma
Nadine Zain Al
53 Capharnaüm 2018 126 Drama 8.4
Labaki Rafeea
Ryan Michael B.
894 Creed 2015 133 Drama 7.6
Coogler Jordan
E.T. the
Steven Henry
688 Extra- 1982 115 Family 7.8
Spielberg Thomas
Terrestrial
E.T. the
Steven Henry
688 Extra- 1982 115 Family 7.8
Spielberg Thomas
Terrestrial
F.W. Max
568 Nosferatu 1922 94 Fantasy 7.9
Murnau Schreck
Das Cabinet
Robert Werner
321 des Dr. 1920 76 Fantasy 8.1
Wiene Krauss
Caligari
10/24
Series_Title Star1 No_
Alejandro Nicole
948 The Others 2001 101 Horror 7.6
Amenábar Kidman
Alex Rufus
959 Dark City 1998 100 Mystery 7.6
Proyas Sewell
Once Upon
Sergio Henry
48 a Time in the 1968 165 Western 8.5
Leone Fonda
West
In [31]: # nunique
movies.groupby('Genre').nunique() # unique --> unique items , nunique--> gives
Genre
Adventure 72 49 58 10 59 59 72
Animation 82 35 41 11 51 77 82
Biography 88 44 56 13 76 72 88
Family 2 2 2 1 2 2 2
Fantasy 2 2 2 2 2 2 2
Film-Noir 3 3 3 3 3 3 3
Horror 11 11 10 8 10 11 11
Mystery 12 11 10 8 10 11 12
Thriller 1 1 1 1 1 1 1
Western 4 4 4 4 2 2 4
11/24
aggregate method
In [33]:
# passing dict
movies.groupby('Genre').agg(
{
'Runtime':'mean',
'IMDB_Rating':'mean',
'No_of_Votes':'sum',
'Gross':'sum',
'Metascore':'min'
Genre
12/24
In [37]: #Passsing List
movies.groupby('Genre').agg(['min','max','mean','sum','median'])
min max mean sum median min max mean sum median ...
Genre
Action 45 321 129.046512 22196 127.5 7.6 9.0 7.949419 1367.3 7.9 ...
Adventure 88 228 134.111111 9656 127.0 7.6 8.6 7.937500 571.5 7.9 ...
Animation 71 137 99.585366 8166 99.5 7.6 8.6 7.930488 650.3 7.9 ...
Biography 93 209 136.022727 11970 129.0 7.6 8.9 7.938636 698.6 7.9 ...
Comedy 68 188 112.129032 17380 106.0 7.6 8.6 7.901290 1224.7 7.9 ...
Crime 80 229 126.392523 13524 122.0 7.6 9.2 8.016822 857.8 8.0 ...
Drama 64 242 124.737024 36049 121.0 7.6 9.3 7.957439 2299.7 8.0 ...
Family 100 115 107.500000 215 107.5 7.8 7.8 7.800000 15.6 7.8 ... 4
Fantasy 76 94 85.000000 170 85.0 7.9 8.1 8.000000 16.0 8.0 ... 337
Film-Noir 100 108 104.000000 312 104.0 7.8 8.1 7.966667 23.9 8.0 ...
Horror 71 122 102.090909 1123 103.0 7.6 8.5 7.909091 87.0 7.8 ...
Mystery 96 138 119.083333 1429 117.5 7.6 8.4 7.975000 95.7 8.0 ... 1
Thriller 108 108 108.000000 108 108.0 7.8 7.8 7.800000 7.8 7.8 ... 17
Western 132 165 148.250000 593 148.0 7.8 8.8 8.350000 33.4 8.4 ... 5
14 rows × 25 columns
13/24
In [39]: # Adding both the syntax
movies.groupby('Genre').agg(
{
'Runtime':['min','mean'],
'IMDB_Rating':'mean',
'No_of_Votes':['sum','max'],
'Gross':'sum',
'Metascore':'min'
}
)
Genre
14/24
In [40]: # looping on groups
for group , data in movies.groupby('Genre'):
print(data)
15/24
In [41]: # find the highest rated movie of each genre
df = pd.DataFrame(columns=movies.columns)
Christopher Matthew
21 Interstellar 2014 169 Adventure 8.6
Nolan McConaughey
Sen to Chihiro
Hayao Daveigh
23 no 2001 125 Animation 8.6
kamikakushi Miyazaki Chase
Schindler's Steven
7 1993 195 Biography 8.9 Liam Neeson
List Spielberg
Bong Joon
19 Gisaengchung 2019 132 Comedy 8.6 Kang-ho Song
Ho
Roberto Roberto
26 La vita è bella 1997 116 Comedy 8.6
Benigni Benigni
Francis
Marlon
1 The Godfather 1972 175 Crime 9.2 Ford
Brando
Coppola
The
Frank
0 Shawshank 1994 142 Drama 9.3 Tim Robbins
Redemption Darabont
Das Cabinet
321 des Dr. 1920 76 Fantasy 8.1 Robert Werner
Caligari Wiene Krauss
The Third
309 1949 104 Film-Noir 8.1 Carol Reed Orson Welles
Man
Alfred Anthony
49 Psycho 1960 109 Horror 8.5
Hitchcock Perkins
Christopher
69 Memento 2000 113 Mystery 8.4 Guy Pearce
Nolan
Il buono, il
Sergio Clint
12 brutto, il 1966 161 Western 8.8
cattivo Leone Eastwood
16/24
split (apply) combine
Genre
2001: A
7.6 Akira Aamir
Adventure Space 1925 88 Adventure
Kurosawa Khan
Odyssey
Das Cabinet
F.W. Max
Fantasy des Dr. 1920 76 Fantasy 7.9
Murnau Schreck
Caligari
Alejandro Anthony
Horror Alien 1933 71 Horror 7.6
Amenábar Perkins
Bernard-
Alex
Mystery Dark City 1938 96 Mystery 7.6 Pierre
Proyas
Donnadieu
Il buono, il
Clint Clint
Western brutto, il 1965 132 Western 7.8
Eastwood Eastwood
cattivo
17/24
In [43]: # find number of movies starting with A for each group
def foo(group):
print(group) # type = Dataframe
return group
In [44]: movies.groupby('Genre').apply(foo)
In [50]: movies.groupby('Genre').apply(foo)
Out[50]: Genre
Action 10
Adventure 2
Animation 2
Biography 9
Comedy 14
Crime 4
Drama 21
Family 0
Fantasy 0
Film-Noir 0
Horror 1
Mystery 0
Thriller 0
Western 0
dtype: int64
18/24
In [51]: # find ranking of each movie in the group according to IMDB score
def rank_movie(group):
group['genre_rank']=group['IMDB_Rating'].rank(ascending=False)
return group
In [52]: movies.groupby('Genre').apply(rank_movie)
The
Frank Tim
0 Shawshank 1994 142 Drama 9.3 23
Darabont Robbins
Redemption
Francis
The Marlon
1 1972 175 Crime 9.2 Ford 16
Godfather Coppola Brando
The Francis
3 Godfather: 1974 202 Crime 9.0 Ford Al Pacino 11
Part II Coppola
George Elizabeth
996 Giant 1956 201 Drama 7.6
Stevens Taylor
Alfred Tallulah
998 Lifeboat 1944 97 Drama 7.6
Hitchcock Bankhead
The 39 Alfred Robert
999 1935 86 Crime 7.6
Steps Hitchcock Donat
19/24
In [55]: # find normalized IMDB rating group wise
#x normalized = (x – x minimum) / (x maximum – x minimum)
def normal(group):
group['normal_rating'] = (group['IMDB_Rating']- group['IMDB_Rating'].min()
return group
movies.groupby('Genre').apply(normal)
The
Frank Tim
0 Shawshank 1994 142 Drama 9.3 23
Darabont Robbins
Redemption
Francis
The Marlon
1 1972 175 Crime 9.2 Ford 16
Godfather Coppola Brando
The Francis
3 Godfather: 1974 202 Crime 9.0 Ford Al Pacino 11
Part II Coppola
George Elizabeth
996 Giant 1956 201 Drama 7.6
Stevens Taylor
Alfred Tallulah
998 Lifeboat 1944 97 Drama 7.6
Hitchcock Bankhead
The 39 Alfred Robert
999 1935 86 Crime 7.6
Steps Hitchcock Donat
20/24
groupby on multiple cols
21/24
In [77]: # agg on multiple groupby
combo.agg(['min','max','mean'])
Director Star1
Aamir Amole 165 165 165.0 8.4 8.4 8.4 168895 168895 168895.0 122386
Khan Gupte
Aaron Eddie 129 129 129.0 7.8 7.8 7.8 89896 89896 89896.0 85309041
Sorkin Redmayne
Abdellatif Léa 180 180 180.0 7.7 7.7 7.7 138741 138741 138741.0 219967
Kechiche Seydoux
Abhishek Shahid 148 148 148.0 7.8 7.8 7.8 27175 27175 27175.0 21842830
Chaubey Kapoor
Abhishek Amit Sadh 130 130 130.0 7.7 7.7 7.7 32628 32628 32628.0 112252
Kapoor
... ... ... ... ... ... ... ... ... ... ...
Zaza Lembit 87 87 87.0 8.2 8.2 8.2 40382 40382 40382.0 14450
Urushadze Ulfsak
Hrithik 155 155 155.0 8.1 8.1 8.1 67927 67927 67927.0 310848
Zoya Roshan
Akhtar Vijay
154 154 154.0 8.0 8.0 8.0 31886 31886 31886.0 556653
Varma
Çagan Çetin 112 112 112.0 8.3 8.3 8.3 78925 78925 78925.0 46185536
Irmak Tekindor
Ömer Cem
Faruk 127 127 127.0 8.0 8.0 8.0 56960 56960 56960.0 19620607
Yilmaz
Sorak
Excercise
22/24
In [78]: ipl = pd.read_csv("deliveries.csv")
ipl.head(2)
Out[78]: match_id inning batting_team bowling_team over ball batsman non_striker bowler is_s
Sunrisers Royal
1 DA S Dhawan TS
0 1 1 Hyderabad Challengers 1 Warner Mills
Bangalore
Sunrisers Royal
2 DA S Dhawan TS
1 1 1 Hyderabad Challengers 1 Warner Mills
Bangalore
2 rows × 21 columns
0 V Kohli 5434
1 SK Raina 5415
2 RG Sharma 4914
3 DA Warner 4741
4 S Dhawan 4632
5 CH Gayle 4560
6 MS Dhoni 4477
7 RV Uthappa 4446
8 AB de Villiers 4428
9 G Gambhir 4223
six.groupby('batsman')['batsman'].count().sort_values(ascending=False).head(2)
Out[94]: batsman
CH Gayle 327
AB de Villiers 214
Name: batsman, dtype: int64
23/24
In [105]: # find batsman with most number of 4's and 6's in last 5 overs
temp = ipl[ipl['over']>15]
temp = temp[(temp['batsman_runs'] == 4) | (temp['batsman_runs'] == 6)]
temp.groupby('batsman')['batsman'].count().sort_values(ascending=False).head(1
temp.groupby('bowling_team')['batsman_runs'].sum().sort_values(ascending=False
In [118]: # Create a function that can return the highest score of any batsman
temp.groupby('match_id')['batsman_runs'].sum().sort_values(ascending=False).he
Out[118]: 113
Out[123]: 126
In [ ]:
24/24
In [1]: import pandas as pd
import numpy as np
In [3]: courses.head(2)
Out[3]:
course_id course_name price
0 1 python 2499
1 2 sql 3499
In [4]: students.head(2)
Out[4]:
student_id name partner
0 1 Kailash Harjo 23
1 2 Esha Butala 1
In [5]: may.head(2)
Out[5]:
student_id course_id
0 23 1
1 15 5
In [6]: june.head(2)
Out[6]:
student_id course_id
0 3 5
1 16 7
In [7]: matches.head(2)
Out[7]:
id season city date team1 team2 toss_winner toss_decision result dl_applied winner win_by_runs win_by_wickets player_of_ma
Royal Royal
2017- Sunrisers Sunrisers
0 1 2017 Hyderabad 04-05 Hyderabad Challengers Challengers field normal 0 Hyderabad 35 0 Yuvraj Si
Bangalore Bangalore
Rising Rising
2017- Mumbai Rising Pune
1 2 2017 Pune 04-06 Pune field normal 0 Pune 0 7 SPD S
Indians Supergiant
Supergiant Supergiant
Concat
it is a powerful function that allows you to concatenate two or more DataFrames along a particular axis (row-wise or column-wise). You can control how the
data is concatenated by specifying several parameters, such as axis, join, ignore_index, and keys.
1/20
In [8]: regs = pd.concat([may,june],ignore_index=True) # Vertically merged
regs
2/20
Out[8]:
student_id course_id
0 23 1
1 15 5
2 18 6
3 23 4
4 16 9
5 18 1
6 1 1
7 7 8
8 22 3
9 15 1
10 19 4
11 1 6
12 7 10
13 11 7
14 13 3
15 24 4
16 21 1
17 16 5
18 23 3
19 17 7
20 23 6
21 25 1
22 19 2
23 25 10
24 3 3
25 3 5
26 16 7
27 12 10
28 12 1
29 14 9
30 7 7
31 7 2
32 16 3
33 17 10
34 11 8
35 14 6
36 12 5
37 12 7
38 18 8
39 1 10
40 1 9
41 2 5
42 7 6
43 22 5
44 22 6
45 23 9
46 23 5
47 14 4
48 14 1
49 11 10
50 42 9
51 50 8
52 38 1
3/20
In [9]: # Multi_index DataFrame
multi = pd.concat([may,june],keys=['may','june'])
multi
4/20
Out[9]:
student_id course_id
may 0 23 1
1 15 5
2 18 6
3 23 4
4 16 9
5 18 1
6 1 1
7 7 8
8 22 3
9 15 1
10 19 4
11 1 6
12 7 10
13 11 7
14 13 3
15 24 4
16 21 1
17 16 5
18 23 3
19 17 7
20 23 6
21 25 1
22 19 2
23 25 10
24 3 3
june 0 3 5
1 16 7
2 12 10
3 12 1
4 14 9
5 7 7
6 7 2
7 16 3
8 17 10
9 11 8
10 14 6
11 12 5
12 12 7
13 18 8
14 1 10
15 1 9
16 2 5
17 7 6
18 22 5
19 22 6
20 23 9
21 23 5
22 14 4
23 14 1
24 11 10
25 42 9
26 50 8
27 38 1
5/20
In [10]: multi.loc['may']
Out[10]:
student_id course_id
0 23 1
1 15 5
2 18 6
3 23 4
4 16 9
5 18 1
6 1 1
7 7 8
8 22 3
9 15 1
10 19 4
11 1 6
12 7 10
13 11 7
14 13 3
15 24 4
16 21 1
17 16 5
18 23 3
19 17 7
20 23 6
21 25 1
22 19 2
23 25 10
24 3 3
In [11]: multi.loc[('june',0)]
Out[11]: student_id 3
course_id 5
Name: (june, 0), dtype: int64
6/20
In [12]: # Horizontally placed
pd.concat([may,june],axis=1)
Out[12]:
student_id course_id student_id course_id
0 23.0 1.0 3 5
1 15.0 5.0 16 7
2 18.0 6.0 12 10
3 23.0 4.0 12 1
4 16.0 9.0 14 9
5 18.0 1.0 7 7
6 1.0 1.0 7 2
7 7.0 8.0 16 3
8 22.0 3.0 17 10
9 15.0 1.0 11 8
10 19.0 4.0 14 6
11 1.0 6.0 12 5
12 7.0 10.0 12 7
13 11.0 7.0 18 8
14 13.0 3.0 1 10
15 24.0 4.0 1 9
16 21.0 1.0 2 5
17 16.0 5.0 7 6
18 23.0 3.0 22 5
19 17.0 7.0 22 6
20 23.0 6.0 23 9
21 25.0 1.0 23 5
22 19.0 2.0 14 4
23 25.0 10.0 14 1
24 3.0 3.0 11 10
25 NaN NaN 42 9
26 NaN NaN 50 8
27 NaN NaN 38 1
Merge
On Joins
Inner Join
7/20
In each set of data, there should to be a "common" column. Students[student_id] and regs[student_id] are listed here. We combine based on the student_id,
however the inner join only displays the data that is "Common" across the two dataframes.
Out[13]:
student_id name partner course_id
45 23 Chhavi Lachman 18 9
46 23 Chhavi Lachman 18 5
47 24 Radhika Suri 17 4
48 25 Shashank D’Alia 2 1
49 25 Shashank D’Alia 2 10
Left Join
Regardless of whether or not the right side data leaves, it prints all of the left side data. so , we can see left data (Numpy , c++) but we cannot see any right
side data which is student_id here, courses reflect = Left and regs reflect = right
In [14]:
courses.merge(regs,how='left',on='course_id').tail(5)
Out[14]:
course_id course_name price student_id
Right join
In [15]:
temp_df = pd.DataFrame({
'student_id':[26,27,28],
'name':['Nitish','Ankit','Rahul'],
'partner':[28,26,17]
})
students = pd.concat([students,temp_df],ignore_index=True)
In [16]: students.tail()
Out[16]:
student_id name partner
23 24 Radhika Suri 17
24 25 Shashank D’Alia 2
25 26 Nitish 28
26 27 Ankit 26
27 28 Rahul 17
Regs data(50,51,52) in the current case does not contain students data, however even this, data is printed since the join was done right.
why.?
because when using a right join, all right side data is printed regardless of whether the left side data exits or not.
8/20
In [17]: students.merge(regs, how='right',on='student_id').tail(5)
Out[17]:
student_id name partner course_id
50 42 NaN NaN 9
51 50 NaN NaN 8
52 38 NaN NaN 1
Since there is no course_id in the student data in the current case, "Nan" data is displayed.
Why was a left join performed using the student_id? Regardless of whether or not the right side data leaves, it prints all of the left side data.
Out[18]:
student_id name partner course_id
57 26 Nitish 28 NaN
58 27 Ankit 26 NaN
59 28 Rahul 17 NaN
Outer join
Initially the left join data is clearly apparent with (Nitish, Ankit, Rahul) data written,
but the right side data (course id) is blank. like which,
Right join shows Nan even though we don't have any data for (42, 50, 38), but we can see the course's id column because it's a right join.
Finally, we may view both data sets, both common and individual, regardless of whether they have ever been. As seen in the outer join
Out[19]:
student_id name partner course_id
Out[20]: 154247
Out[27]: level_0
june 65072
may 89175
Name: price, dtype: int64
9/20
In [32]: # 3. Print the registration table
# cols -> name -> course -> price
Out[32]:
student_id course_id name partner course_name price
10/20
In [33]: regs.merge(students, on = 'student_id').merge(courses , on='course_id')[['name','course_name','price']]
Out[33]:
name course_name price
11/20
In [38]: # 4. Plot bar chart for revenue/course
regs.merge(courses,on ='course_id').groupby('course_name')['price'].sum()
Out[38]: course_name
data analysis 24995
machine learning 39996
ms sxcel 7995
pandas 4396
plotly 3495
power bi 11394
pyspark 14994
python 22491
sql 6998
tableau 17493
Name: price, dtype: int64
Out[41]: <AxesSubplot:xlabel='course_name'>
intersect1d
Find the intersection of two arrays. Return the sorted, unique values that are in both of the input arrays.
In [47]: students[students['student_id'].isin(common_students_id)]
Out[47]:
student_id name partner
0 1 Kailash Harjo 23
2 3 Parveen Bhalla 3
6 7 Tarun Thaker 9
10 11 David Mukhopadhyay 20
15 16 Elias Dodiya 25
16 17 Yasmin Palan 7
17 18 Fardeen Mahabir 13
21 22 Yash Sethi 21
22 23 Chhavi Lachman 18
numpy.setdiff1d()
function find the set difference of two arrays and return the unique values in arr1 that are not in arr2.
12/20
In [52]: # 6. find course that got no enrollment
# courses['course_id']
# regs['course_id']
Out[52]:
course_id course_name price
10 11 Numpy 699
11 12 C++ 1299
In [53]: # 7. find students who did not enroll into any courses
Out[53]:
student_id name partner
3 4 Marlo Dugal 14
4 5 Kusum Bahri 6
5 6 Lakshmi Contractor 10
7 8 Radheshyam Dey 5
8 9 Nitika Chatterjee 4
9 10 Aayushman Sant 8
19 20 Hanuman Hegde 11
25 26 Nitish 28
26 27 Ankit 26
27 28 Rahul 17
In [55]: students[students['student_id'].isin(student_id_list)].shape[0]
Out[55]: 10
(10/28)*100
Out[56]: 35.714285714285715
Self Join
A self join is a regular join, but the table is joined with itself.
here, left_on = partner from outside students on left , right_on =student_id from iside students on right .
13/20
In [60]: # 8. Print student name -> partner name for all enrolled students
# self join
students.merge(students,how ='inner',left_on = 'partner', right_on= 'student_id')[['name_x','name_y']]
Out[60]:
name_x name_y
26 Nitish Rahul
27 Ankit Nitish
In [81]: # 10. find top 5 students who spent most amount of money on courses
regs.merge(students , on ='student_id').merge(courses, on= 'course_id').groupby(['student_id','name'])['price'].sum().sort_value
14/20
In [82]: # Alternate syntax for merge
# students.merge(regs)
Out[82]:
student_id name partner course_id
0 1 Kailash Harjo 23 1
1 1 Kailash Harjo 23 6
2 1 Kailash Harjo 23 10
3 1 Kailash Harjo 23 9
4 2 Esha Butala 1 5
5 3 Parveen Bhalla 3 3
6 3 Parveen Bhalla 3 5
7 7 Tarun Thaker 9 8
8 7 Tarun Thaker 9 10
9 7 Tarun Thaker 9 7
10 7 Tarun Thaker 9 2
11 7 Tarun Thaker 9 6
12 11 David Mukhopadhyay 20 7
13 11 David Mukhopadhyay 20 8
14 11 David Mukhopadhyay 20 10
15 12 Radha Dutt 19 10
16 12 Radha Dutt 19 1
17 12 Radha Dutt 19 5
18 12 Radha Dutt 19 7
19 13 Munni Varghese 24 3
20 14 Pranab Natarajan 22 9
21 14 Pranab Natarajan 22 6
22 14 Pranab Natarajan 22 4
23 14 Pranab Natarajan 22 1
24 15 Preet Sha 16 5
25 15 Preet Sha 16 1
26 16 Elias Dodiya 25 9
27 16 Elias Dodiya 25 5
28 16 Elias Dodiya 25 7
29 16 Elias Dodiya 25 3
30 17 Yasmin Palan 7 7
31 17 Yasmin Palan 7 10
32 18 Fardeen Mahabir 13 6
33 18 Fardeen Mahabir 13 1
34 18 Fardeen Mahabir 13 8
35 19 Qabeel Raman 12 4
36 19 Qabeel Raman 12 2
37 21 Seema Kota 15 1
38 22 Yash Sethi 21 3
39 22 Yash Sethi 21 5
40 22 Yash Sethi 21 6
41 23 Chhavi Lachman 18 1
42 23 Chhavi Lachman 18 4
43 23 Chhavi Lachman 18 3
44 23 Chhavi Lachman 18 6
45 23 Chhavi Lachman 18 9
46 23 Chhavi Lachman 18 5
47 24 Radhika Suri 17 4
48 25 Shashank D’Alia 2 1
49 25 Shashank D’Alia 2 10
15/20
In [87]: # IPL Problems
matches
Out[87]:
id season city date team1 team2 toss_winner toss_decision result dl_applied winner win_by_runs win_by_wickets player_
Royal Royal
2017- Sunrisers Sunrisers
0 1 2017 Hyderabad 04-05 Hyderabad Challengers Challengers field normal 0 35 0 Yu
Hyderabad
Bangalore Bangalore
Rising Rising
2017- Mumbai Rising Pune
1 2 2017 Pune 04-06 Pune field normal 0 Pune 0 7 S
Indians Supergiant
Supergiant Supergiant
Rising
2017- Kings XI Kings XI Kings XI
3 4 2017 Indore 04-08 Pune field normal 0 0 6 G
Punjab Punjab Punjab
Supergiant
...
2016- Delhi
631 632 2016 Raipur 05-22 Royal Royal Royal
Daredevils
Challengers Challengers field normal 0 Challengers 0 6
Bangalore Bangalore Bangalore
2016- Gujarat
632 633 2016 Bangalore 05-24 Royal Royal Royal
Lions
Challengers Challengers field normal 0 Challengers 0 4 AB
Bangalore Bangalore Bangalore
Kolkata Kolkata
2016- Sunrisers Sunrisers
633 634 2016 Delhi 05-25 Knight Knight field normal 0 22 0 MC
Hyderabad Hyderabad
Riders Riders
Royal
2016- Sunrisers Sunrisers Sunrisers 8 0 B
635 636 2016 Bangalore 05-29 Hyderabad Challengers bat normal 0
Hyderabad Hyderabad
Bangalore
16/20
In [89]: deliveries
Out[89]:
match_id inning batting_team bowling_team over ball batsman non_striker bowler is_super_over ... bye_runs legbye_runs noball_runs penalt
Royal
Sunrisers DA TS
0 1 1 Challengers 1 1 S Dhawan 0 ... 0 0 0
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS
1 1 1 Challengers 1 2 S Dhawan 0 ... 0 0 0
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS
2 1 1 Challengers 1 3 S Dhawan 0 ... 0 0 0
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS
3 1 1 Challengers 1 4 S Dhawan 0 ... 0 0 0
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS
4 1 1 Challengers 1 5 S Dhawan 0 ... 0 0 0
Hyderabad Warner Mills
Bangalore
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Chennai Mumbai RA SL
179073 11415 2 20 2 SR Watson 0 ... 0 0 0
Super Kings Indians Jadeja Malinga
Chennai Mumbai SR SL
179074 11415 2 20 3 RA Jadeja 0 ... 0 0 0
Super Kings Indians Watson Malinga
Chennai Mumbai SR SL
179075 11415 2 20 4 RA Jadeja 0 ... 0 0 0
Super Kings Indians Watson Malinga
Chennai Mumbai SN SL
179076 11415 2 20 5 RA Jadeja 0 ... 0 0 0
Super Kings Indians Thakur Malinga
temp.head(2)
Out[94]:
match_id inning batting_team bowling_team over ball batsman non_striker bowler is_super_over ... result dl_applied winner win_by_runs win
Royal DA TS
Sunrisers Challengers 1 1 Sunrisers
0 1 1 S Dhawan 0 ... normal 0 Hyderabad 35
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS Sunrisers
1 1 1 Challengers 1 2 S Dhawan 0 ... normal 0 Hyderabad 35
Hyderabad Bangalore Warner Mills
2 rows × 39 columns
In [101]: six_df=temp[temp['batsman_runs']==6]
six_df.head(2)
Out[101]:
match_id inning batting_team bowling_team over ball batsman non_striker bowler is_super_over ... result dl_applied winner win_by_runs
Royal DA
Sunrisers Challengers 2 4 A Sunrisers
10 1 1 S Dhawan Choudhary 0 ... normal 0 Hyderabad 35
Hyderabad Warner
Bangalore
Royal
Sunrisers MC Sunrisers
47 1 1 Challengers 8 4 S Dhawan TM Head 0 ... normal 0 Hyderabad 35
Hyderabad Bangalore Henriques
2 rows × 39 columns
17/20
In [105]: #stadium --> sixes
number_six = six_df.groupby('venue')['venue'].count()
number_six.head()
Out[105]: venue
Barabati Stadium 68
Brabourne Stadium 114
Buffalo Park 27
De Beers Diamond Oval 34
Dr DY Patil Sports Academy 173
Name: venue, dtype: int64
In [112]: (number_six/number_matches).sort_values(ascending=False).head()
Out[113]:
match_id inning batting_team bowling_team over ball batsman non_striker bowler is_super_over ... result dl_applied winner win_by_runs
Royal DA TS
Sunrisers Challengers 1 1 Sunrisers
0 1 1 S Dhawan 0 ... normal 0 Hyderabad 35
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers Challengers 1 2 DA TS Sunrisers
1 1 1 S Dhawan 0 ... normal 0 Hyderabad 35
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers Challengers 1 3 DA TS Sunrisers
2 1 1 S Dhawan 0 ... normal 0 Hyderabad 35
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers Challengers 1 4 DA TS Sunrisers
3 1 1 S Dhawan 0 ... normal 0 Hyderabad 35
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS Sunrisers
4 1 1 Challengers 1 5 S Dhawan 0 ... normal 0 Hyderabad 35
Hyderabad Bangalore Warner Mills
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Royal
150455 636 2 Challengers Sunrisers Sachin B Sunrisers
20 2 CJ Jordan 0 ... normal 0 Hyderabad 8
Hyderabad Baby Kumar
Bangalore
Royal
150456 636 2 Challengers Sunrisers Sachin B Sunrisers
20 3 CJ Jordan 0 ... normal 0 Hyderabad 8
Hyderabad Baby Kumar
Bangalore
Royal
150457 636 2 Challengers Sunrisers Iqbal Sachin B Sunrisers
20 4 0 ... normal 0 Hyderabad 8
Hyderabad Abdulla Baby Kumar
Bangalore
Royal
150458 636 2 Challengers Sunrisers Sachin Iqbal B Sunrisers
20 5 0 ... normal 0 Hyderabad 8
Hyderabad Baby Abdulla Kumar
Bangalore
Royal
150459 636 2 Challengers Sunrisers Iqbal Sachin B Sunrisers
20 6 0 ... normal 0 Hyderabad 8
Bangalore Hyderabad Abdulla Baby Kumar
18/20
In [114]: df = pd.merge(deliveries,matches ,how ='inner',left_on='match_id',right_on='id')
df.head(2)
Out[114]:
match_id inning batting_team bowling_team over ball batsman non_striker bowler is_super_over ... result dl_applied winner win_by_runs win
Royal
Sunrisers DA TS Sunrisers
0 1 1 Challengers 1 1 S Dhawan 0 ... normal 0 Hyderabad 35
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS Sunrisers
1 1 1 Challengers 1 2 S Dhawan 0 ... normal 0 Hyderabad 35
Hyderabad Bangalore Warner Mills
2 rows × 39 columns
In [117]: df.groupby(['season','batsman'])['batsman_runs'].sum()
In [120]: df.groupby(['season','batsman'])['batsman_runs'].sum().reset_index().sort_values('batsman_runs',ascending=False)
Out[120]:
season batsman batsman_runs
In [123]: '])['batsman_runs'].sum().reset_index().sort_values('batsman_runs',ascending=False).drop_duplicates(subset='season',keep='first')
Out[123]:
season batsman batsman_runs
19/20
In [124]: oupby(['season','batsman'])['batsman_runs'].sum().reset_index().sort_values('batsman_runs',ascending=False).sort_values('season')
Out[124]:
season batsman batsman_runs
58 2008 L Balaji 0
45 2008 I Sharma 11
67 2008 M Ntini 11
In [ ]:
20/20
What is MultiIndex in Pandas?
In pandas, a multi-index, also known as a hierarchical index, is a way to represent two or more dimensions of data in a single index. This is useful
when you have data that can be grouped or categorized by more than one variable.
But why?
And what exactly is index?
1/25
In [5]: # 2. pd.MultiIndex.from_product()
pd.MultiIndex.from_product([['cse','ece'],[2019,2020,2021,2022]])
Out[7]: 4
In [8]: sample['cse']
Out[8]: 2019 1
2020 2
2021 3
2022 4
dtype: int64
unstack
reshape the given Pandas DataFrame by transposing specified row level to column level
Out[9]:
2019 2020 2021 2022
cse 1 2 3 4
ece 5 6 7 8
stack
reshapes the given DataFrame by converting the column label to a row index.
In [10]: temp.stack()
2/25
In [11]: # multi index dataframes
branch_df1 = pd.DataFrame(
[
[1,2],
[3,4],
[5,6],
[7,8],
[9,10],
[11,12],
[13,14],
[15,16],
],
index = multiindex,
columns = ['avg_package','students']
)
branch_df1
Out[11]:
avg_package students
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
In [12]: branch_df1.loc['cse']
Out[12]:
avg_package students
2019 1 2
2020 3 4
2021 5 6
2022 7 8
In [13]: branch_df1['avg_package']
In [14]: branch_df1['students']
3/25
In [15]: branch_df1.loc['ece']
Out[15]:
avg_package students
2019 9 10
2020 11 12
2021 13 14
2022 15 16
branch_df2
Out[16]:
delhi mumbai
2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
In [17]: branch_df2['delhi']
Out[17]:
avg_package students
2019 1 2
2020 3 4
2021 5 6
2022 7 8
In [18]: branch_df2.loc[2019]
In [19]: branch_df2.iloc[1]
4/25
Multiindex df in terms of both cols and index
branch_df3
#here index= multiindex is a name , already we have stored data of ece and cse in above
Out[20]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [21]: branch_df1
Out[21]:
avg_package students
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
Out[22]:
avg_package students
cse 1 3 5 7 2 4 6 8
ece 9 11 13 15 10 12 14 16
5/25
In [23]: branch_df1.unstack().unstack()
Out[24]:
avg_package students
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
branch_df1.unstack().stack().stack()
6/25
In [26]: # Example : 2
branch_df2
Out[26]:
delhi mumbai
2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
The Unstack()
It is method can be used to move the index to the columns. This means that the index will become the rows, and the rows will become the
columns.
The unstack method can be used to move the index to the columns
In [27]: branch_df2.unstack()
In [28]: branch_df2.stack()
Out[28]:
delhi mumbai
2019 avg_package 1 0
students 2 0
2020 avg_package 3 0
students 4 0
2021 avg_package 5 0
students 6 0
2022 avg_package 7 0
students 8 0
7/25
In [29]: branch_df2.stack().stack()
Out[30]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [31]: branch_df3.stack()
Out[31]:
delhi mumbai
students 2 0
2020 avg_package 3 0
students 4 0
2021 avg_package 5 0
students 6 0
2022 avg_package 7 0
students 8 0
students 10 0
2020 avg_package 11 0
students 12 0
2021 avg_package 13 0
students 14 0
2022 avg_package 15 0
students 16 0
8/25
In [32]: branch_df3.stack().stack()
Out[33]:
delhi mumbai
2019 2020 2021 2022 2019 2020 2021 2022 2019 2020 2021 2022 2019 2020 2021 2022
cse 1 3 5 7 2 4 6 8 0 0 0 0 0 0 0 0
ece 9 11 13 15 10 12 14 16 0 0 0 0 0 0 0 0
9/25
In [34]: branch_df3.unstack().unstack()
Out[35]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
Out[36]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
10/25
In [37]: # shape
branch_df3.shape
Out[37]: (8, 4)
In [38]: # info
branch_df3.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 8 entries, ('cse', 2019) to ('ece', 2022)
Data columns (total 4 columns):
# Column Non-Null Count Dtype
In [40]: branch_df3.isnull()
Out[40]:
delhi mumbai
branch_df3.loc[('cse',2022)]
branch_df3.loc[('cse',2019):('ece',2020):2]
Out[42]:
delhi mumbai
cse 2019 1 2 0 0
2021 5 6 0 0
ece 2019 9 10 0 0
11/25
In [43]: # Using iloc
branch_df3.iloc[0:5:2]
Out[43]:
delhi mumbai
cse 2019 1 2 0 0
2021 5 6 0 0
ece 2019 9 10 0 0
Out[45]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
branch_df3.iloc[:,1:3]
Out[46]:
delhi mumbai
students avg_package
cse 2019 2 0
2020 4 0
2021 6 0
2022 8 0
ece 2019 10 0
2020 12 0
2021 14 0
2022 16 0
Out[47]:
delhi mumbai
students avg_package
cse 2019 2 0
ece 2019 10 0
12/25
In [48]: # sort index
# both -> descending -> diff order
# based on one level
branch_df3
Out[48]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [49]: branch_df3.sort_index(ascending=False)
Out[49]:
delhi mumbai
ece 2022 15 16 0 0
2021 13 14 0 0
2020 11 12 0 0
2019 9 10 0 0
cse 2022 7 8 0 0
2021 5 6 0 0
2020 3 4 0 0
2019 1 2 0 0
Out[50]:
delhi mumbai
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
Out[51]:
cse ece
delhi avg_package 1 3 5 7 9 11 13 15
students 2 4 6 8 10 12 14 16
mumbai avg_package 0 0 0 0 0 0 0 0
students 0 0 0 0 0 0 0 0
13/25
In [52]: # swaplevel
branch_df3
Out[52]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [53]: # On rows
branch_df3.swaplevel()
Out[53]:
delhi mumbai
2019 cse 1 2 0 0
2020 cse 3 4 0 0
2021 cse 5 6 0 0
2022 cse 7 8 0 0
2019 ece 9 10 0 0
2020 ece 11 12 0 0
2021 ece 13 14 0 0
2022 ece 15 16 0 0
In [54]: # on columns
branch_df3.swaplevel(axis=1)
Out[54]:
avg_package students avg_package students
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
14/25
Long(Tall) Vs Wide data
Wide format is where we have a single row for every data point with multiple columns to hold the values of various attributes.
Long format is where, for each data point we have as many rows as the number of attributes and each row contains the value of a particular
attribute for a given data point.
Out[55]:
cse
0 120
In [56]: pd.DataFrame({'cse':[120]}).melt()
Out[56]:
variable value
0 cse 120
Out[57]:
variable value
0 cse 120
1 ece 100
2 mech 50
15/25
In [58]: # we can name the varibale and value
pd.DataFrame({'cse':[120],'ece':[100],'mech':[50]}).melt(var_name='branch',value_name='num_students')
Out[58]:
branch num_students
0 cse 120
1 ece 100
2 mech 50
In [59]: pd.DataFrame(
{
'branch':['cse','ece','mech'],
'2020':[100,150,60],
'2021':[120,130,80],
'2022':[150,140,70]
}
)
Out[59]:
branch 2020 2021 2022
2 mech 60 80 70
In [60]: pd.DataFrame(
{
'branch':['cse','ece','mech'],
'2020':[100,150,60],
'2021':[120,130,80],
'2022':[150,140,70]
}
).melt()
Out[60]:
variable value
1 branch cse
2 branch ece
3 branch mech
3 2020 100
4 2020 150
5 2020 60
6 2021 120
7 2021 130
8 2021 80
9 2022 150
10 2022 140
11 2022 70
16/25
In [61]: # dont include 'branch' to rows
pd.DataFrame(
{
'branch':['cse','ece','mech'],
'2020':[100,150,60],
'2021':[120,130,80],
'2022':[150,140,70]
}
).melt(id_vars=['branch'])
Out[61]:
branch variable value
2 mech 2020 60
5 mech 2021 80
8 mech 2022 70
the melt() method is used to reshape a DataFrame from wide to long format. This means that the columns of the DataFrame are converted into
rows, and the values in the columns are converted into columns.
Out[62]:
branch year students
2 mech 2020 60
5 mech 2021 80
8 mech 2022 70
deaths =pd.read_csv("time_series_covid19_deaths_global.csv")
confirm = pd.read_csv("time_series_covid19_confirmed_global.csv")
In [64]: deaths.head(2)
Out[64]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 12/24/22 12/25/22 12/26/22 12/27/22
17/25
In [65]: deaths.shape
Out[67]: (311253, 6)
In [75]: deaths.head()
Out[75]:
Province/State Country/Region Lat Long date no. of deaths
In [68]: confirm.head(2)
Out[68]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 12/24/22 12/25/22 12/26/22 12/27/22
In [69]: confirm.shape
In [76]: confirm.head()
Out[76]:
Province/State Country/Region Lat Long date no. of confirmed
Out[74]: (311253, 6)
18/25
In [77]: # Now merge both data frames as per desire
confirm.merge(deaths, on =['Province/State','Country/Region','Lat','Long','date'])
Out[77]:
Province/State Country/Region Lat Long date no. of confirmed no. of deaths
311248 NaN West Bank and Gaza 31.952200 35.233200 1/2/23 703228 5708
Out[80]:
Country/Region date no. of confirmed no. of deaths
0 Afghanistan 1/22/20 0 0
1 Albania 1/22/20 0 0
2 Algeria 1/22/20 0 0
3 Andorra 1/22/20 0 0
4 Angola 1/22/20 0 0
the Pivot table takes simple column wise data as input, and groups as the entire Into 2 dimensional table that provides a multi dimensional
summarization of the data.
In [83]: df = sns.load_dataset('tips')
df.head()
Out[83]:
total_bill tip sex smoker day time size
19/25
In [85]: # On gender basis average total bill
df.groupby('sex')['total_bill'].mean()
Out[85]: sex
Male 20.744076
Female 18.056897
Name: total_bill, dtype: float64
Out[88]:
smoker Yes No
sex
Out[89]:
smoker Yes No
sex
Out[90]:
smoker Yes No
sex
Out[91]:
smoker Yes No
sex
Male 60 97
Female 33 54
Out[92]:
smoker Yes No
sex
20/25
In [93]: # All columns together --- gives average
df.pivot_table(index='sex',columns='smoker')
Out[93]:
size tip total_bill
sex
Out[95]:
smoker Yes No
sex
In [96]: df.pivot_table(index='sex',columns='smoker')['size']
Out[96]:
smoker Yes No
sex
Out[98]:
total_bill tip sex smoker day time size
Out[100]:
day Thur Fri Sat Sun
sex smoker
In [102]: df.pivot_table(index=['sex','smoker'],columns=['day','time'])
Out[102]:
tip total_bill
Fri Sat Sun Thur Fri Sat Sun Thur Fri Sat
Dinner Lunch Dinner Dinner Dinner Lunch Dinner Lunch Dinner Dinner Dinner Lunch Dinner Lunch Dinner Dinner
NaN 1.666667 2.4 2.629630 2.600000 3.058000 NaN 1.90 3.246 2.879259 3.521333 19.171000 NaN 11.386667 25.892 21.837778
NaN NaN 2.0 2.656250 2.883721 2.941500 NaN NaN 2.500 3.256563 3.115349 18.486500 NaN NaN 17.475 19.929063
NaN 2.000000 2.0 2.200000 2.500000 2.990000 NaN 2.66 2.700 2.868667 3.500000 19.218571 NaN 13.260000 12.200 20.266667
2.0 3.000000 2.0 2.307692 3.071429 2.437083 3.0 3.00 3.250 2.724615 3.329286 15.899167 18.78 15.980000 22.750 19.003846
21/25
In [103]: df.pivot_table(index=['sex','smoker'],columns=['day','time'],aggfunc={'size':'mean','tip':'max','total_bill':'sum'})
Out[103]:
size tip total_bill
day Thur Fri Sat Sun Thur Fri Sat Sun Thur Fri
time Lunch Dinner Lunch Dinner Dinner Dinner Lunch Dinner Lunch Dinner Dinner Dinner Lunch Dinner Lunch Di
sex smoker
Male Yes 2.300000 NaN 1.666667 2.4 2.629630 2.600000 5.00 NaN 2.20 4.73 10.00 6.5 191.71 0.00 34.16 12
No 2.500000 NaN NaN 2.0 2.656250 2.883721 6.70 NaN NaN 3.50 9.00 6.0 369.73 0.00 0.00 3
Female Yes 2.428571 NaN 2.000000 2.0 2.200000 2.500000 5.00 NaN 3.48 4.30 6.50 4.0 134.53 0.00 39.78 4
No 2.500000 2.0 3.000000 2.0 2.307692 3.071429 5.17 3.0 3.00 3.25 4.67 5.2 381.58 18.78 15.98 2
In [106]: # Margins.
df.pivot_table(index='sex',columns= 'smoker',values ='total_bill',aggfunc='sum',margins=True)
Out[106]:
smoker Yes No All
sex
In [110]: expense.head(2)
Out[110]:
Date Account Category Subcategory Note INR Income/Expense Note.1 Amount Currency Account.1
1 3/2/2022 10:11 CUB - online payment Food NaN Brownie 50.0 Expense NaN 50.0 INR 50.0
2 3/2/2022 10:11 CUB - online payment Other NaN To lended people 300.0 Expense NaN 300.0 INR 300.0
In [115]: # Categories
expense['Category'].value_counts()
22/25
In [117]: expense.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
In [121]: expense.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
expense['Date'].dt.month_name()
Out[123]: 0 March
1 March
2 March
3 March
4 March
...
272 November
273 November
274 November
275 November
276 November
Name: Date, Length: 277, dtype: object
In [125]: expense.head(2)
Out[125]:
Date Account Category Subcategory Note INR Income/Expense Note.1 Amount Currency Account.1 month
23/25
In [126]: # Using pivot table
expense.pivot_table(index ='month', columns='Category', values ='INR', aggfunc='sum')
Out[126]:
Petty Social
Category Allowance Apparel Beauty Education Food Gift Household Other Salary Self- Transportation
cash development Life
month
December 11000.0 2590.0 196.0 NaN 6440.72 NaN 4800.0 1790.0 NaN NaN 400.0 513.72 914.0
February NaN 798.0 NaN NaN 5579.85 NaN 2808.0 20000.0 NaN NaN NaN 1800.00 5078.8
January 1000.0 NaN NaN 1400.0 9112.51 NaN 4580.0 13178.0 NaN 8000.0 NaN 200.00 2850.0
March NaN NaN NaN NaN 195.00 NaN NaN 900.0 NaN NaN NaN NaN 30.0
November 2000.0 NaN NaN NaN 3174.40 115.0 NaN 2000.0 3.0 NaN NaN NaN 331.0
Out[128]:
Petty Social
Category Allowance Apparel Beauty Education Food Gift Household Other Salary Self- Transportation
cash development Life
month
December 11000 2590 196 0 6440.72 0 4800 1790 0 0 400 513.72 914.0
In [131]: # plot
expense.pivot_table(index ='month', columns='Category', values ='INR', aggfunc='sum',fill_value =0).plot()
Out[131]: <AxesSubplot:xlabel='month'>
Out[132]: <AxesSubplot:xlabel='month'>
24/25
In [133]: expense.pivot_table(index ='month', columns='Account', values ='INR', aggfunc='sum',fill_value =0).plot()
Out[133]: <AxesSubplot:xlabel='month'>
In [ ]:
25/25
In [1]: import pandas as pd
import numpy as np
In [2]: a = np.array([1,2,3,4])
a * 4
s = pd.Series(['cat','mat',None,'rat'])
s
Out[4]: 0 cat
1 mat
2 None
3 rat
dtype: object
Out[5]: 0 True
1 False
2 None
3 False
dtype: object
1/11
In [7]: df.head(1)
Out[7]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Emba
Braund,
Mr. A/5
0 1 0 3 male 22.0 1 0 7.25 NaN
Owen 21171
Harris
In [8]: df['Name']
Common Functions
lower/upper/capitalize/title
In [9]: # Upper
df['Name'].str.upper() # converts into Capital Words
2/11
In [10]: # lower
df['Name'].str.lower() # converts into small Words
In [11]: # title
df['Name'].str.title() # converts into starting letter of Word to Capital
Out[12]: 82
Out[14]: 'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Va
llejo)'
strip
3/11
In [16]: ' jack '.strip()
Out[16]: 'jack'
split
In [19]: # split
df['Name'].str.split(',')
Out[21]: 0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
...
886 Montvila
887 Graham
888 Johnston
889 Behr
890 Dooley
Name: Name, Length: 891, dtype: object
4/11
In [22]: df['Name'].str.split(',').str.get(1)
In [25]: df.head(1)
Out[25]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Emba
Braund,
0 1 0 3 Mr. male A/5
22.0 1 0 7.25 NaN
Owen 21171
Harris
In [29]: # it is used to split the Name column of the DataFrame df into two columns
# FirstName and LastName.
Out[29]:
0 1
2 Miss. Laina
5/11
In [30]: df[['title','firstname']]= df['Name'].str.split(',').str.get(1).str.strip().st
In [31]: df.head(1)
Out[31]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Emba
Braund,
Mr. A/5
0 1 0 3 male 22.0 1 0 7.25 NaN
Owen 21171
Harris
replace
C:\Users\user\AppData\Local\Temp/ipykernel_15952/1805277261.py:1: FutureWarni
ng: The default value of regex will change from True to False in a future ver
sion.
df['title'] = df['title'].str.replace('Ms.','Miss.')
C:\Users\user\AppData\Local\Temp/ipykernel_15952/1805277261.py:2: FutureWarni
ng: The default value of regex will change from True to False in a future ver
sion.
df['title'] = df['title'].str.replace('Mlle.','Miss.')
6/11
In [37]: df['title'].value_counts()
Filtering
7/11
In [38]: # startswith/endswith
df[df['firstname'].str.startswith('A')]
Out[38]:
PassengerId Survived Pclass Na me Sex Age SibSp Parch Ticket Fare
Anders son,
13 14 0 3 Mr. Anders male 39.0 1 5 347082 31.2750
Johan
McGo wan,
22 23 1 3 Miss. Anna female 15.0 0 0 330923 8.0292
"An nie"
Holverson,
35 36 0 1 Mr. male 42.0 1 0 113789 52.0000
Alexander
Oskar
Vander
Planke,
38 39 0 3 female 18.0 2 0 345764 18.0000
Miss.
Augusta
Maria
Gustafsson,
876 877 0 3 Mr. Alfred male 20.0 0 0 7534 9.8458
Ossian
95 rows × 15 columns
8/11
In [40]: # endswith
df[df['firstname'].str.endswith('z')]
Out[40]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabi
Cardeza,
Mr. B5
PC
679 680 1 1 Thomas male 36.0 0 1 512.3292 B5
17755
Drake B5
Martinez
Jensen,
Mr.
721 722 0 3 male 17.0 1 0 350048 7.0542 Na
Svend
Lauritz
In [41]: # isdigit/isalpha...
df[df['firstname'].str.isdigit()]
Out[41]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarke
regex
Out[42]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
Cumings,
Mrs. John
1 2 1 1 Bradley female 38.0 1 0 PC 17599 7
(Florence
Briggs Th...
Turpin, Mrs.
William
41 42 0 2 John Robert female 27.0 1 0 11668 2
(Dorothy
Ann ...
Rogers, Mr.
S.C./A.4.
45 46 0 3 William male NaN 0 0
23567
John
Doling, Mrs.
98 99 1 2 John T (Ada female 34.0 0 1 231919 2
Julia Bone)
9/11
In [44]: # find lastnames with start and end char vowel ( aeiou)
df[df['last_name'].str.contains('^[^aeiouAEIOU].+[^aeiouAEIOU]$')]
Out[44]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500
Harris
Cumin gs,
Mrs. John
1 2 1 1 Bra dley
female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...
Heikkinen, STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250
3101282
Laina
M oran,
5 6 0 3 male NaN 0 0 330877 8.4583
Mr. J ames
Mc Carthy,
6 7 0 1 Mr. male 54.0 0 0 17463 51.8625
Tim othy J
... ... ... ... ... ... ... ... ... ... ...
Sutehall,
SOTON/OQ
884 885 0 3 Mr. Henry male 25.0 0 0 7.0500
392076
Jr
Graham,
Miss.
887 888 1 1 female 19.0 0 0 112053 30.0000
Margaret
Edith
Johnston,
Miss.
888 889 0 3 Catherine female NaN 1 2 W./C. 6607 23.4500
Helen
"Carrie"
Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369 30.0000
Howell
Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376 7.7500
Patrick
slicing
10/11
In [45]: df['Name'].str[:4] # first 4 characters
Out[45]: 0 Brau
1 Cumi
2 Heik
3 Futr
4 Alle
...
886 Mont
887 Grah
888 John
889 Behr
890 Dool
Name: Name, Length: 891, dtype: object
In [ ]:
11/11
In [1]: import numpy as np
import pandas as pd
Timestamp Object
Time stamps reference particular moments in time (e.g., Oct 24th, 2022 at 7:00pm)
Vectorized date and time operations are a powerful tool for working with date and time data. They can be used to
quickly and easily perform a wide variety of operations on date and time data.
In [3]: # type
type(pd.Timestamp('2023/05/12'))
Out[3]: pandas._libs.tslibs.timestamps.Timestamp
In [4]: # Variations
pd.Timestamp('2023-05-12')
In [5]: pd.Timestamp('2023,05,12')
In [6]: pd.Timestamp('2023.05.12')
1/14
In [10]: # using Python's datetime object
import datetime as dt
x = pd.Timestamp(dt.datetime(2023,5,12,4,42,56))
x
Out[12]: 2023
In [15]: x.day
Out[15]: 12
In [17]: x.time()
In [18]: x.month
Out[18]: 5
why separate objects to handle data and time when python already has datetime
functionality?
Because of the uniform type in NumPy datetime64 arrays, this type of operation can be accomplished much more
quickly than if we were working directly with Python's datetime objects, especially as arrays get large
Pandas Timestamp object combines the ease-of-use of python datetime with the efficient storage and vectorized
interface of numpy.datetime64
From a group of these Timestamp objects, Pandas can construct a DatetimeIndex that can be used to index data in
a Series or DataFrame
2/14
DatetimeIndex Object
In [23]: pd.DatetimeIndex(['2023/05/12','2023/01/01','2025/01/22'])[0]
In [25]: # type
type(pd.DatetimeIndex(['2023/05/12','2023/01/01','2025/01/22']))
Out[25]: pandas.core.indexes.datetimes.DatetimeIndex
In [28]: dt_index
pd.Series([1,2,3],index=dt_index)
Out[29]: 2023-01-01 1
2022-01-01 2
2021-01-01 3
dtype: int64
date_range function
In [33]: # generate daily dates in a given range
pd.date_range(start='2023/5/12',end='2023/6/12',freq='D')
3/14
In [34]: # Alternate days
pd.date_range(start='2023/5/12',end='2023/6/12',freq='2D')
4/14
In [41]: # For every six hours
pd.date_range(start='2023/5/12',end='2023/6/12',freq='6H')
5/14
In [50]: # Hour (using periods)
pd.date_range(start='2023/5/12',periods =30,freq='H')
to_datetime function
6/14
In [59]: # simple series example
s = pd.Series(['2023/5/12','2022/1/1','2021/2/1'])
pd.to_datetime(s).dt.year # converting string to datetime
Out[59]: 0 2023
1 2022
2 2021
dtype: int64
In [60]: pd.to_datetime(s).dt.day
Out[60]: 0 12
1 1
2 1
dtype: int64
In [61]: pd.to_datetime(s).dt.day_name()
Out[61]: 0 Friday
1 Saturday
2 Monday
dtype: object
In [62]: pd.to_datetime(s).dt.month_name()
Out[62]: 0 May
1 January
2 February
dtype: object
s = pd.Series(['2023/1/1','2022/1/1','2021/130/1'])
pd.to_datetime(s,errors='coerce') #NaT = Not a Time
Out[63]: 0 2023-01-01
1 2022-01-01
2 NaT
dtype: datetime64[ns]
In [64]: pd.to_datetime(s,errors='coerce').dt.year
Out[64]: 0 2023.0
1 2022.0
2 NaN
dtype: float64
In [65]: pd.to_datetime(s,errors='coerce').dt.month_name()
Out[65]: 0 January
1 January
2 NaN
dtype: object
In [66]: df = pd.read_csv("expense_data.csv")
7/14
In [69]: df.head()
Out[69]:
Date Account Category Subcategory Note INR Income/Expense Note.1 Amount Currency Account.1
3/2/2022 CUB -
0 online Food NaN Brownie 50.0 Expense NaN 50.0 INR 50.0
10:11
payment
3/2/2022 CUB - To
1 online Other NaN lended 300.0 Expense NaN 300.0 INR 300.0
10:11
payment people
3/1/2022 CUB -
2 online Food NaN Dinner 78.0 Expense NaN 78.0 INR 78.0
19:50
payment
3/1/2022 CUB -
3 online Transportation NaN Metro 30.0 Expense NaN 30.0 INR 30.0
18:56
payment
3/1/2022 CUB -
4 online Food NaN Snacks 67.0 Expense NaN 67.0 INR 67.0
18:22
payment
In [70]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
8/14
In [73]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
dt accessor
In [75]: df['Date'].dt.year
Out[75]: 0 2022
1 2022
2 2022
3 2022
4 2022
...
272 2021
273 2021
274 2021
275 2021
276 2021
Name: Date, Length: 277, dtype: int64
In [76]: df['Date'].dt.month
Out[76]: 0 3
1 3
2 3
3 3
4 3
..
272 11
273 11
274 11
275 11
276 11
Name: Date, Length: 277, dtype: int64
9/14
In [77]: df['Date'].dt.month_name()
Out[77]: 0 March
1 March
2 March
3 March
4 March
...
272 November
273 November
274 November
275 November
276 November
Name: Date, Length: 277, dtype: object
In [80]: df['Date'].dt.day_name()
Out[80]: 0 Wednesday
1 Wednesday
2 Tuesday
3 Tuesday
4 Tuesday
...
272 Monday
273 Monday
274 Sunday
275 Sunday
276 Sunday
Name: Date, Length: 277, dtype: object
In [86]: df['Date'].dt.is_month_end
Out[86]: 0 False
1 False
2 False
3 False
4 False
...
272 False
273 False
274 False
275 False
276 False
Name: Date, Length: 277, dtype: bool
In [87]: df['Date'].dt.is_year_end
Out[87]: 0 False
1 False
2 False
3 False
4 False
...
272 False
273 False
274 False
275 False
276 False
Name: Date, Length: 277, dtype: bool
10/14
In [90]: df['Date'].dt.is_quarter_end
Out[90]: 0 False
1 False
2 False
3 False
4 False
...
272 False
273 False
274 False
275 False
276 False
Name: Date, Length: 277, dtype: bool
In [91]: df['Date'].dt.is_quarter_start
Out[91]: 0 False
1 False
2 False
3 False
4 False
...
272 False
273 False
274 False
275 False
276 False
Name: Date, Length: 277, dtype: bool
In [95]: plt.plot(df['Date'],df['INR'])
11/14
In [97]: df.head()
Out[97]:
Date Account Category Subcategory Note INR Income/Expense Note.1 Amount Currency Account.1
2022- CUB -
0 03-02 online Food NaN Brownie 50.0 Expense NaN 50.0 INR 50.0
10:11:00 payment
2022- CUB - To
1 03-02 online Other NaN lended 300.0 Expense NaN 300.0 INR 300.0
10:11:00 payment people
2022- CUB -
2 03-01 online Food NaN Dinner 78.0 Expense NaN 78.0 INR 78.0
19:50:00 payment
2022- CUB -
3 03-01 online Transportation NaN Metro 30.0 Expense NaN 30.0 INR 30.0
18:56:00 payment
2022- CUB -
4 03-01 online Food NaN Snacks 67.0 Expense NaN 67.0 INR 67.0
18:22:00 payment
In [99]: df.groupby('day_name')['INR'].mean().plot(kind='bar')
Out[99]: <AxesSubplot:xlabel='day_name'>
12/14
In [101]: df.head()
Out[101]:
Date Account Category Subcategory Note INR Income/Expense Note.1 Amount Currency Account.1
2022- CUB -
0 03-02 online Food NaN Brownie 50.0 Expense NaN 50.0 INR 50.0
10:11:00 payment
2022- CUB - To
1 03-02 online Other NaN lended 300.0 Expense NaN 300.0 INR 300.0
10:11:00 payment people
2022- CUB -
2 03-01 online Food NaN Dinner 78.0 Expense NaN 78.0 INR 78.0
19:50:00 payment
2022- CUB -
3 03-01 online Transportation NaN Metro 30.0 Expense NaN 30.0 INR 30.0
18:56:00 payment
2022- CUB -
4 03-01 online Food NaN Snacks 67.0 Expense NaN 67.0 INR 67.0
18:22:00 payment
Out[102]: <AxesSubplot:ylabel='INR'>
In [109]: # Average
df.groupby('month_name')['INR'].mean().plot(kind ='bar')
Out[109]: <AxesSubplot:xlabel='month_name'>
13/14