01 Pandas Basics

Uploaded by

drzewochujkop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

01 Pandas Basics

Uploaded by

drzewochujkop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Data Engineering – Project 1: Pandas Basics

March 6, 2025

1 Introduction
Welcome! This is the first set of project exercises for the Data Engineering course.
Your job in this course will be to write rather simple programs, according to the instructions
and specification presented in each exercise. Most of them will involve loading data from a file,
performing some transformations or analytics, and writing the results to some other file.
You should submit your work via the provided Git repository in the Gitlab project that was provided
to you. It will then be checked using scripts running on the server (via a CI/CD pipeline), and the
results will be added to your repository as a JSON file. Today, in the first project, there is no limit
of tries, but these will be limited in the subsequent ones. The number of tries will be always stated
in the instructions.
Usually, a file will be provided as sample input for your program, but it is not the exact file your
program will be tested with. Of course, the test input files will adhere to the description in the
exercise, but don’t make any other assumption regarding their structure and contents. The test
file may change in time, but will always comply with the specification.
You will have 2 weeks, starting with the date of your project classes, to submit your work.

2 Instructions
Below, you will find several exercises with precise specifications. Please write one Python program
which performs all the specified tasks in a sequence, producing the desired results.
It is probably a good idea for your program to start its life as a Jupyter Notebook. However, the
desired form is a Python program in a .py file. In JupyterLab, you can use the Save and Export
Notebook As… > Executable Script menu option to produce such a file. However, please verify that
the file runs correctly, as it is easy to omit a possible mistake – remember that Jupyter kernels
maintain their state, and that the cells are not always run sequentially.
After you’re done, please commit your solution to your respository as project01/project01.py.

3 Technical stuff
Your assignment will be checked using a Docker container from a custom-built image.
Currently, the image uses the following software packages:
• Python version: 3.13.2 (main, Feb 25 2025, 05:25:21) [GCC 12.2.0]

1
• NumPy version: 2.2.3
• Pandas version: 2.2.3

4 Exercises
4.1 Exercise 1: Column information
3 points
File proj1_ex01.csv is a properly formed CSV file, with fields separated using commas (,) and
with column headers. Load it into a DataFrame.
Create a file called proj_ex01_fields.json, which contains information all of the columns in the
file you read. The file should contain an array of dictionaries with the following items:
• column name (key: name),
• percentage of missing values (key: missing, values in the range [0.0, 1.0]),
• data type as a string with the following values (key: type:
– int for integer types,
– float for floating-point types,
– other for all other types.
An example JSON file could look like this:
[
{
"name": "id",
"missing": 0.0,
"type": "int"
},
{
"name": "title",
"missing": 0.2,
"type": "other"
},
{
"name": "result",
"missing": 0.73,
"type": "float"
}
]

4.2 Exercise 2: Value statistics

2 points
Compute statistics for all columns in your dataframe.
For numeric columns include:
• the count of non-empty values (count),
• the average (mean),

2
• the standard deviation (std),
• the minimum (min) and maximum (max) values,
• the the 25th, 50th, and 75th percentiles (attribute names: 25%, 50% and 75%, respectively).
For non-numeric columns include:
• the count of non-empty values (count),
• the number of unique values (unique),
• the most common value (top) and its frequency (number of occurences; freq).
Save the result to a JSON file called proj1_ex02_stats.json which contains a dictionary at the
top level; the keys in the dictionary are column names, and the values are dictionaries with keys as
described above, e.g.:
{
"some_number":{
"count":6.0,
"mean":-0.5009940002,
"std":0.8839385203,
"min":-1.5552904133,
"25%":-1.2470386925,
"50%":-0.4162433767,
"75%":0.1799426841,
"max":0.5271122589
},
"some_string":{
"count":7,
"unique":3,
"top":"good",
"freq":3,
}
}
In the inner dictionaries, keys with null values are allowed, e.g. a dictionary for a numeric column
may contain the unique and top attributes.

4.3 Exercise 3: Column names

5 points
Rename (“normalize”) the columns in the dataframe, so that they (sort of) follow the PEP 8
guidelines for variable names.
Apply the following rules:
• keep only characters which belong to the [A-Za-z0-9_ ] class (capital and small letters,
digits, underscore and space),
• convert all letters to lowercase,
• replace all spaces with underscores (_).
Make the changes in your DataFrame persistent.
Save the DataFrame with the new columns to proj1_ex03_columns.csv (don’t include the index).

3
4.4 Exercise 4: Output formats
3 points (1 for each format)
Write the data in the DataFrame to various output formats.
Create an MS Excel file called proj1_ex04_excel.xlsx, which contains the column headers, but
not the index values.
Create a JSON file called proj1_ex04_json.json, which contains an array of rows stored as dic-
tionaries, each with the DataFrame columns as keys (and values as values, obviously), e.g.:
[
{
"one":0.3485539245,
"two":"-0.14509562920877161",
"three":"-0.012336991474672475",
"four":9,
"five":"red",
"six":"good",
"seven":"quarrelsome",
"eight":"2016-05-26 09:33:42"
},
{
"one":-1.4938530178,
"two":"0.12436946488785079",
"three":"1.4611100361038865",
"four":4,
"five":"red",
"six":"bad",
"seven":"doctor",
"eight":"2016-12-03 18:55:52"
}
]
Create a pickle file called proj1_ex04_pickle.pkl with the DataFrame.

4.5 Exercise 5: Selecting rows and columns

4 points
Load the DataFrame pickled in file proj1_ex05.pkl.
Select the following items from the DataFrame:
• the 2nd and 3rd columns (regardless of their names),
• rows whose index values begin with the letter v.
Save the result to a Markdown table stored in file proj1_ex05_table.md. Include the result, but
don’t put anything in cells with missing values (i.e. prevent nan from being printed there).

4
4.6 Exercise 6: Flattening data
3 points
Pandas DataFrames are two-dimensional structures. However, data in JSON files often has a
hierarchical structure, e.g. objects (dictionaries) are nested within objects.
File proj1_ex06.json contains an array with such hierarchical objects (the structure of each array
element is the same).
Based on the data in the file, using pd.json_normalize() create a Pandas DataFrame, which contains
a flattened version of the data. For nested dictionaries, the column names should have the keys
separated using dots (.). E.g., for the following entry:
[
{
"brand": "Audi",
"name": "Q5",
"model": 2023,
"engine": {
"type": "Diesel",
"displacement": "2.0L",
"power": "190 hp",
"environmental": {
"euro": 6,
"filter": "DPF"
}
}
},
]
the resulting columns should be:
• brand,
• model,
• year,
• engine.type,
• engine.displacement,
• engine.power,
• engine.environmental.euro,
• engine.environmental.filter.
Save the resulting DataFrame to pickle file proj1_ex06_pickle.pkl.

PYQ Data Analysis and Visualisation Using Python GE May 2024
No ratings yet
PYQ Data Analysis and Visualisation Using Python GE May 2024
6 pages
Lesson 07 Data Manipulation With Pandas
No ratings yet
Lesson 07 Data Manipulation With Pandas
82 pages
Unit 3
No ratings yet
Unit 3
110 pages
DSA LAB Manual - Good Content
No ratings yet
DSA LAB Manual - Good Content
70 pages
Shubham Info Practical 3251
No ratings yet
Shubham Info Practical 3251
59 pages
Python For Data Analysis Jan 28
No ratings yet
Python For Data Analysis Jan 28
105 pages
02 Python Basics
No ratings yet
02 Python Basics
52 pages
Lab Manual
No ratings yet
Lab Manual
81 pages
Fds Merged
No ratings yet
Fds Merged
102 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Python Lab Programs
No ratings yet
Python Lab Programs
58 pages
Lecture 1 Pyhton Programming DOST 1
No ratings yet
Lecture 1 Pyhton Programming DOST 1
67 pages
N RQgi 8 Eg DUNFS451 K4 X QXA
No ratings yet
N RQgi 8 Eg DUNFS451 K4 X QXA
61 pages
Pythonfile
No ratings yet
Pythonfile
37 pages
Ge - Computer Science Data Analysis
No ratings yet
Ge - Computer Science Data Analysis
16 pages
Final Ip Practical File
No ratings yet
Final Ip Practical File
29 pages
IP Practical Record 2022-23
No ratings yet
IP Practical Record 2022-23
43 pages
UNIT-4 Important Q-A
No ratings yet
UNIT-4 Important Q-A
28 pages
Class XII Informatics Practices Practical List
100% (1)
Class XII Informatics Practices Practical List
10 pages
Python 1
No ratings yet
Python 1
16 pages
Questions Practical File
No ratings yet
Questions Practical File
13 pages
Even Students
No ratings yet
Even Students
36 pages
DAwHPC L03 Data Cleaning Practical
No ratings yet
DAwHPC L03 Data Cleaning Practical
43 pages
Wa0012.
No ratings yet
Wa0012.
30 pages
Ip Practical File GV
No ratings yet
Ip Practical File GV
46 pages
Pandas Practicals - Term-1
100% (1)
Pandas Practicals - Term-1
18 pages
02 Working With Data
No ratings yet
02 Working With Data
3 pages
2023 Data Analysis and Visualization Using Python
100% (2)
2023 Data Analysis and Visualization Using Python
9 pages
Practical
No ratings yet
Practical
29 pages
Journal
No ratings yet
Journal
35 pages
Lab 3 & 4
No ratings yet
Lab 3 & 4
10 pages
Informatics Practices Guide
100% (1)
Informatics Practices Guide
32 pages
DAV Practicle File
No ratings yet
DAV Practicle File
28 pages
3rd EXPERIMENT
No ratings yet
3rd EXPERIMENT
13 pages
Pandas Data Structures Guide
No ratings yet
Pandas Data Structures Guide
72 pages
Pandas Worksheet
No ratings yet
Pandas Worksheet
3 pages
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
No ratings yet
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
16 pages
GE Python Visualization 2023
No ratings yet
GE Python Visualization 2023
16 pages
Python CAT Papers
No ratings yet
Python CAT Papers
6 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
CS3361 Set2
No ratings yet
CS3361 Set2
6 pages
Ip Study
No ratings yet
Ip Study
18 pages
Revise XI Python
No ratings yet
Revise XI Python
21 pages
CLASS XII - IP List of Practicals With Coding 2020
No ratings yet
CLASS XII - IP List of Practicals With Coding 2020
15 pages
Software Development Units (BooksRack - Net)
No ratings yet
Software Development Units (BooksRack - Net)
174 pages
Radio Planner 2.1 Manual
No ratings yet
Radio Planner 2.1 Manual
90 pages
Python Cheat Sheet For Data Scientists by Tomi Mester 2019 PDF
100% (3)
Python Cheat Sheet For Data Scientists by Tomi Mester 2019 PDF
23 pages
EcoSUI EN AN ECOSUIENG E
No ratings yet
EcoSUI EN AN ECOSUIENG E
142 pages
12th Practical
No ratings yet
12th Practical
21 pages
Freshdesk Admin Guide
No ratings yet
Freshdesk Admin Guide
134 pages
Qtranus User Manual
No ratings yet
Qtranus User Manual
10 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Statistics Fundamentals Succinctly
No ratings yet
Statistics Fundamentals Succinctly
104 pages
Informatics Practices Practical List22-2323
No ratings yet
Informatics Practices Practical List22-2323
6 pages
Xii - Ip - Holiday HW
No ratings yet
Xii - Ip - Holiday HW
2 pages
ABAQUS Export Steps
No ratings yet
ABAQUS Export Steps
4 pages
Practical File Questions (2023-24)
No ratings yet
Practical File Questions (2023-24)
3 pages
Exploratory Data Analysis BCG - Ipynb
No ratings yet
Exploratory Data Analysis BCG - Ipynb
273 pages
Cronbach's Alpha Calculator
No ratings yet
Cronbach's Alpha Calculator
5 pages
Python Basics for Beginners
No ratings yet
Python Basics for Beginners
16 pages
Python Reference Card
100% (4)
Python Reference Card
2 pages
Python and CSV Files
No ratings yet
Python and CSV Files
35 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Class XII Pandas & SQL Practical List
100% (1)
Class XII Pandas & SQL Practical List
7 pages
GSTR1 Excel Workbook Template V1.5
100% (1)
GSTR1 Excel Workbook Template V1.5
92 pages
CS3361 Set1
No ratings yet
CS3361 Set1
5 pages
Practical File Question 28.09.2022
No ratings yet
Practical File Question 28.09.2022
15 pages
PF Memeber Creat Payments
No ratings yet
PF Memeber Creat Payments
2 pages
Instagram Crawl Instruction
No ratings yet
Instagram Crawl Instruction
16 pages
SSIS
No ratings yet
SSIS
7 pages
Python ClassXII AI
No ratings yet
Python ClassXII AI
4 pages
Class XII (As Per CBSE Board) : Informatics Practices
No ratings yet
Class XII (As Per CBSE Board) : Informatics Practices
37 pages
Python Series & DataFrames Guide
No ratings yet
Python Series & DataFrames Guide
18 pages
Store Manager - User Manual - v3 Impresora Adhesivos Pequeños Bodega
No ratings yet
Store Manager - User Manual - v3 Impresora Adhesivos Pequeños Bodega
17 pages
Chennai Set2
No ratings yet
Chennai Set2
9 pages
Part E - Format of UDiFF Standard Report - F&O
No ratings yet
Part E - Format of UDiFF Standard Report - F&O
36 pages
Retail Data Analysis in Istanbul - Demo - Guide File
No ratings yet
Retail Data Analysis in Istanbul - Demo - Guide File
25 pages
Training BNest ENG
No ratings yet
Training BNest ENG
119 pages
Particleworks For Ansys
No ratings yet
Particleworks For Ansys
39 pages
TopoDOT FeatureCodeExport Workflow
No ratings yet
TopoDOT FeatureCodeExport Workflow
12 pages
Python Practical File for Students
100% (1)
Python Practical File for Students
40 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Class Ix Question Bank Information Technology Chapter - 4
No ratings yet
Class Ix Question Bank Information Technology Chapter - 4
3 pages
File Handling in Python
No ratings yet
File Handling in Python
3 pages
GE - Computer Scien 4ogygeb
No ratings yet
GE - Computer Scien 4ogygeb
8 pages
PurchaserLists eSWAP
No ratings yet
PurchaserLists eSWAP
2 pages
Importing Data Into R Using RStudio - Watermark
No ratings yet
Importing Data Into R Using RStudio - Watermark
3 pages
XII IP Practical List 2023-24
No ratings yet
XII IP Practical List 2023-24
4 pages

01 Pandas Basics

Uploaded by

01 Pandas Basics

Uploaded by

Data Engineering – Project 1: Pandas Basics

4.2 Exercise 2: Value statistics

4.3 Exercise 3: Column names

4.5 Exercise 5: Selecting rows and columns

You might also like