Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
116 views17 pages

Pandas Data Handling Guide

The document discusses reading data from various file types into pandas including CSV, text, Excel, and HTML files. It shows examples of using pandas to read data from files located locally and from URLs, and to read specific sheets from an Excel file. It also discusses importing necessary libraries like BeautifulSoup and openpyxl for working with different file types.

Uploaded by

Haourch Amine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views17 pages

Pandas Data Handling Guide

The document discusses reading data from various file types into pandas including CSV, text, Excel, and HTML files. It shows examples of using pandas to read data from files located locally and from URLs, and to read specific sheets from an Excel file. It also discusses importing necessary libraries like BeautifulSoup and openpyxl for working with different file types.

Uploaded by

Haourch Amine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [ ]:  # pip install pandas


# pip install openpyxl ##pour travailler avec des fichiers excel

In [2]:  import pandas as pd

1. Read csv, txt, excel, html (files or links)

1. ----------------- csv

In [8]:  file = 'pok_data.csv'


df = pd.read_csv(file, delimiter=',')

In [9]:  df.head() #les 5 premiers

Out[9]: # Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary

0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False

1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False

2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False

3 3 VenusaurMega Venusaur Grass Poison 80 100 123 122 120 80 1 False

4 4 Charmander Fire NaN 39 52 43 60 50 65 1 False

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 1/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [6]:  df.tail() #les 5 derniers

Out[6]: # Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary

795 719 Diancie Rock Fairy 50 100 150 100 150 50 6 True

796 719 DiancieMega Diancie Rock Fairy 50 160 110 160 110 110 6 True

797 720 HoopaHoopa Confined Psychic Ghost 80 110 60 150 130 70 6 True

798 720 HoopaHoopa Unbound Psychic Dark 80 160 60 170 130 80 6 True

799 721 Volcanion Fire Water 80 110 120 130 90 70 6 True

In [7]:  len(df)

Out[7]: 800

In [ ]:  ## via link

In [9]:  link = "https://gist.githubusercontent.com/rnirmal/e01acfdaf54a6f9b24e91ba4cae63518/raw/6b589a5c5a851711e20c5eb28f


df = pd.read_csv(link)
df.head()

Out[9]: datasetName about link categoryName cloud vintage

0 Microbiome Project American Gut (Microbiome Project) https://github.com/biocore/American-Gut Biology GitHub NaN

1 GloBI Global Biotic Interactions (GloBI) https://github.com/jhpoelen/eol-globi-data/wik... Biology GitHub NaN

2 Global Climate Global Climate Data Since 1929 http://en.tutiempo.net/climate Climate/Weather NaN 1929.0

CommonCraw Computer
3 3.5B Web Pages from CommonCraw 2012 http://www.bigdatanews.com/profiles/blogs/big-... NaN 2012.0
2012 Networks

53.5B Web clicks of 100K users in Indiana Computer


4 Indiana Webclicks http://cnets.indiana.edu/groups/nan/webtraffic... NaN NaN
Univ. Networks

In [ ]:  ​

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 2/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

2. --------------- txt

In [120]:  file = 'pok_data.txt'


df = pd.read_csv(file, delimiter='\t')

In [121]:  df.head()

Out[121]: # Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary

0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False

1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False

2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False

3 3 VenusaurMega Venusaur Grass Poison 80 100 123 122 120 80 1 False

4 4 Charmander Fire NaN 39 52 43 60 50 65 1 False

3. --------------- excel

In [34]:  # for excel files, you must add:


# !pip install openpyxl

In [117]:  file = 'STATISTIQUES EXPLORATION.xlsx'


s = 'STATISTIQUES'
df = pd.read_excel(file, sheet_name=s)

In [1]:  # df.head()

In [2]:  # len(df)

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 3/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

4. ------------------- HTML page

In [3]:  import pandas as pd


from bs4 import BeautifulSoup

fname = "https://en.wikipedia.org/wiki/List_of_American_and_Canadian_cities_by_number_of_major_professional_sports
df_list = pd.read_html(fname)

type(df_list)

Out[3]: list

In [4]:  df_list
3 5 49ers[note 8] Giants Athletics Warriors
4 4 Cowboys[note 10] Rangers Mavericks
5 4 Commanders[note 11] Nationals[note 12] Wizards[note 13]
6 4 Eagles[note 14] Phillies[note 15] 76ers[note 16]
7 4 Dolphins Marlins Heat
8 4 Patriots[note 19] Red Sox[note 20] Celtics
9 4 Vikings[note 21] Twins Timberwolves[note 22]
10 4 Broncos Rockies Nuggets[note 24]
11 4 Cardinals Diamondbacks Suns
12 4 Lions[note 26] Tigers[note 27] Pistons[note 28]
13 3 — [note 29] Blue Jays Raptors[note 30]
14 3 Texans[note 31] Astros Rockets
15 3 Falcons Braves Hawks
16 3 Seahawks Mariners[note 33] [note 34]
17 3 Buccaneers Rays [note 35]
18 3 Steelers Pirates [note 37]
19 3 Browns[note 39] Guardians[note 40] Cavaliers[note 41]
20 2 [note 43] Cardinals[note 44] [note 45]
21 2 Panthers — Hornets[note 47]
22 2 Bengals[note 48] Reds[note 49] [note 50]

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 4/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [5]:  for i, df in enumerate(df_list):


soup = BeautifulSoup(df.to_html(), 'lxml')
for a in soup.find_all('a'):
a.string = a.get('href')
df_list[i] = pd.read_html(str(soup), flavor='bs4')[0]

In [6]:  df_with_links = df_list[1]



df_with_links.head()

Out[6]: Population
Unnamed: Metropolitan Pop.
Country (2022 est.) B4 NFL MLB NBA NHL B6 MLS CFL
0 area rank
[8]

Yankees Rangers
United Giants Knicks Red Bulls New
0 0 New York City 1 19617869 9 Mets[note Islanders 11 —
States Jets[note 1] Nets York City FC
2] Devils[note 3]

Rams Galaxy Los


United Dodgers Lakers
1 1 Los Angeles 2 12872322 8 Chargers[note Kings Ducks 10 Angeles —
States Angels Clippers
4] FC[note 5]

United Cubs Bulls[note


2 2 Chicago 3 9441957 5 Bears[note 6] Blackhawks 6 Fire —
States White Sox 7]

San Francisco United Giants Sharks[note


3 3 6 6518123 5 49ers[note 8] Warriors 6 Earthquakes —
Bay Area States Athletics 9]

Dallas–Fort United Cowboys[note


4 4 4 7943685 4 Rangers Mavericks Stars 5 FC Dallas —
Worth States 10]

2. DataFrame indexing and slicing


In [3]:  import pandas as pd

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 5/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [4]:  file = 'pok_data.txt'


df = pd.read_csv(file, delimiter='\t')
df.head()

Out[4]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary

0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False

1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False

2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False

3 3 VenusaurMega Venusaur Grass Poison 80 100 123 122 120 80 1 False

4 4 Charmander Fire NaN 39 52 43 60 50 65 1 False

In [5]:  df.columns

Out[5]: Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
'Sp. Def', 'Speed', 'Generation', 'Legendary'],
dtype='object')

In [6]:  df.loc[0] #select ligne

Out[6]: # 1
Name Bulbasaur
Type 1 Grass
Type 2 Poison
HP 45
Attack 49
Defense 49
Sp. Atk 65
Sp. Def 65
Speed 45
Generation 1
Legendary False
Name: 0, dtype: object

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 6/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [7]:  df.loc[10:14]

Out[7]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary

10 8 Wartortle Water NaN 59 63 80 65 80 58 1 False

11 9 Blastoise Water NaN 79 83 100 85 105 78 1 False

12 9 BlastoiseMega Blastoise Water NaN 79 103 120 135 115 78 1 False

13 10 Caterpie Bug NaN 45 30 35 20 20 45 1 False

14 11 Metapod Bug NaN 50 20 55 25 25 30 1 False

In [8]:  df.iloc[10:14]

Out[8]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary

10 8 Wartortle Water NaN 59 63 80 65 80 58 1 False

11 9 Blastoise Water NaN 79 83 100 85 105 78 1 False

12 9 BlastoiseMega Blastoise Water NaN 79 103 120 135 115 78 1 False

13 10 Caterpie Bug NaN 45 30 35 20 20 45 1 False

In [72]:  # La différence c'est que:


#loc is : peut etre indexé en utilisant les noms aussi
#iloc is : ne peut etre indexé que via les indexes

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 7/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [73]:  #exemple
df.head()

Out[73]: # Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary

0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False

1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False

2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False

3 3 VenusaurMega Venusaur Grass Poison 80 100 123 122 120 80 1 False

4 4 Charmander Fire NaN 39 52 43 60 50 65 1 False

In [10]:  df.loc[1:3, 'Name':'Type 2'] #select row 1 in column Name

Out[10]: Name Type 1 Type 2

1 Ivysaur Grass Poison

2 Venusaur Grass Poison

3 VenusaurMega Venusaur Grass Poison

In [75]:  # df.iloc[1, 'Name'] #error



df.iloc[1, 1] #doit être

Out[75]: 'Ivysaur'

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 8/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [76]:  df.iloc[:, [1, 2, 5]] #Select les colonnes dans les positions 1, 2 and 5

Out[76]: Name Type 1 Attack

0 Bulbasaur Grass 49

1 Ivysaur Grass 62

2 Venusaur Grass 82

3 VenusaurMega Venusaur Grass 100

4 Charmander Fire 52

... ... ... ...

795 Diancie Rock 100

796 DiancieMega Diancie Rock 160

797 HoopaHoopa Confined Psychic 110

798 HoopaHoopa Unbound Psychic 160

799 Volcanion Fire 110

800 rows × 3 columns

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 9/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [77]:  df[['Name', 'Type 1', 'Attack']] #select les column via leurs noms

Out[77]: Name Type 1 Attack

0 Bulbasaur Grass 49

1 Ivysaur Grass 62

2 Venusaur Grass 82

3 VenusaurMega Venusaur Grass 100

4 Charmander Fire 52

... ... ... ...

795 Diancie Rock 100

796 DiancieMega Diancie Rock 160

797 HoopaHoopa Confined Psychic 110

798 HoopaHoopa Unbound Psychic 160

799 Volcanion Fire 110

800 rows × 3 columns

In [13]:  ## [ligne, colonne]



print(df.iloc[10,2])

Water

In [15]:  len(df)

Out[15]: 800

In [14]:  a = df.loc[ df['Attack'] > 150] #Select des lignes qui vérifient une condution
len(a)

Out[14]: 18

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 10/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [80]:  b = df.drop(columns=['Attack']) #supprimer une colonne


b.head()

Out[80]: # Name Type 1 Type 2 HP Defense Sp. Atk Sp. Def Speed Generation Legendary

0 1 Bulbasaur Grass Poison 45 49 65 65 45 1 False

1 2 Ivysaur Grass Poison 60 63 80 80 60 1 False

2 3 Venusaur Grass Poison 80 83 100 100 80 1 False

3 3 VenusaurMega Venusaur Grass Poison 80 123 122 120 80 1 False

4 4 Charmander Fire NaN 39 43 60 50 65 1 False

In [81]:  # df.head() #la colonne Attack est encore dans df (mais pas dans b)

In [16]:  df.iat[1, 2] #accéder à 'une valeur' via index

Out[16]: 'Grass'

In [17]:  df.at[4, 'HP'] ##accéder à une valeur via nom

Out[17]: 39

In [84]:  df.iloc[4] #extraire une ligne spécifique

Out[84]: # 4
Name Charmander
Type 1 Fire
Type 2 NaN
HP 39
Attack 52
Defense 43
Sp. Atk 60
Sp. Def 50
Speed 65
Generation 1
Legendary False
Name: 4, dtype: object

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 11/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [18]:  df['HP'] #extraire une colonne spécifique

Out[18]: 0 45
1 60
2 80
3 80
4 39
..
795 50
796 50
797 80
798 80
799 80
Name: HP, Length: 800, dtype: int64

Chercher avec des filtres

In [19]:  df.query('HP > 150') #like SQL

Out[19]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary

121 113 Chansey Normal NaN 250 5 5 35 105 50 1 False

155 143 Snorlax Normal NaN 160 110 65 65 110 30 1 False

217 202 Wobbuffet Psychic NaN 190 33 58 33 58 33 2 False

261 242 Blissey Normal NaN 255 10 10 75 135 55 2 False

351 321 Wailord Water NaN 170 90 45 90 45 60 3 False

655 594 Alomomola Water NaN 165 75 80 40 45 65 5 False

In [20]:  df.query('HP > 150 and Attack > 100')

Out[20]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary

155 143 Snorlax Normal NaN 160 110 65 65 110 30 1 False

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 12/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [21]:  df.sample(n=5) ##prendre un échantillion

Out[21]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary

328 303 Mawile Steel Fairy 50 85 85 55 55 50 3 False

552 493 Arceus Normal NaN 120 120 120 120 120 120 4 True

226 210 Granbull Fairy NaN 90 120 75 60 60 45 2 False

773 703 Carbink Rock Fairy 50 50 150 50 150 50 6 False

201 186 Politoed Water NaN 90 75 75 90 100 70 2 False

groupby
Prendre une variable catégorique
Voir la distribution d'autres variables continues selon les différents catégories qui existent

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 13/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [96]:  df

Out[96]: # Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary

0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False

1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False

2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False

3 3 VenusaurMega Venusaur Grass Poison 80 100 123 122 120 80 1 False

4 4 Charmander Fire NaN 39 52 43 60 50 65 1 False

... ... ... ... ... ... ... ... ... ... ... ... ...

795 719 Diancie Rock Fairy 50 100 150 100 150 50 6 True

796 719 DiancieMega Diancie Rock Fairy 50 160 110 160 110 110 6 True

797 720 HoopaHoopa Confined Psychic Ghost 80 110 60 150 130 70 6 True

798 720 HoopaHoopa Unbound Psychic Dark 80 160 60 170 130 80 6 True

799 721 Volcanion Fire Water 80 110 120 130 90 70 6 True

800 rows × 12 columns

In [98]:  dfg1 = df.groupby("Generation")[["Attack", "Defense"]].mean()


print(dfg1)

Attack Defense
Generation
1 76.638554 70.861446
2 72.028302 73.386792
3 81.625000 74.100000
4 82.867769 78.132231
5 82.066667 72.327273
6 75.804878 76.682927

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 14/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [102]:  dfg2 = df.groupby("Generation")[["Speed"]].mean()


print(dfg2)

Speed
Generation
1 72.584337
2 61.811321
3 66.925000
4 71.338843
5 68.078788
6 66.439024

3. A partir des listes et dict, créer vos propres dataframe


In [3]:  import pandas as pd

In [4]:  a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


b = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 15/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [5]:  d = {"nbrs":a, "names":b}



df1 = pd.DataFrame(d)
df1

Out[5]: nbrs names

0 1 a

1 2 b

2 3 c

3 4 d

4 5 e

5 6 f

6 7 g

7 8 h

8 9 i

9 10 j

Sauvegarder comme fichier csv ou excel

In [22]:  # fname = 'mydf.csv'


# df1.to_csv(fname, index=False)

fname = 'mydf.xlsx'
df1.to_excel(fname, index=False)

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 16/17


10/13/23, 12:12 PM Cours Pandas - Jupyter Notebook

In [24]:  ## read again


df1 = pd.read_excel(fname)
df1

Out[24]: nbrs names new

0 1 a 0

1 2 b 10

2 3 c 20

3 4 d 30

4 5 e 40

5 6 f 50

6 7 g 60

7 8 h 70

8 9 i 80

9 10 j 90

10 11 k 100

In [ ]:  ​

localhost:8888/notebooks/séance 3/Cours Pandas .ipynb 17/17

You might also like