0% found this document useful (0 votes)

34 views10 pages

Course - Introduction To Data Science (SD211105)

This document is a tutorial for an Introduction to Data Science course, covering essential concepts and techniques in data science using Python. It includes sections on importing data, data visualization with Matplotlib, data manipulation with Pandas, filtering data, looping through data structures, and indexing and slicing DataFrames. By the end of the tutorial, learners will have foundational skills to analyze and visualize data effectively.

Uploaded by

aldisusilo19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views10 pages

Course - Introduction To Data Science (SD211105)

Uploaded by

aldisusilo19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Course: Introduction to Data Science (SD211105)

Session: 2

Topic: Basic Data Science Tutorial

Download Dataset & Try Below Tutorial Using Python:

https://drive.google.com/file/d/1aO1XLlsV3rft4Z3xRZErTjPAhbi4-uye/view?usp=sharing

Welcome to this tutorial designed to introduce you to the world of data science. Whether you are
a complete beginner or have no prior knowledge of the subject, this guide will help you get
started. Throughout the tutorial, you'll learn key concepts, techniques, and tools used in data
science. By the end of each section, you will have gained new abilities to work with and analyze
data.

To begin, here’s a famous quote from Josh Wills that perfectly encapsulates the role of a data
scientist:

"A data scientist is a person who is better at statistics than any software engineer and
better at software engineering than any statistician."

This definition emphasizes the interdisciplinary nature of data science, requiring a balance of
statistical expertise and programming skills.

1. Import and First Look at the Data

In this section, we will walk through the process of importing data and writing functions that help
us get an initial understanding of it. Data comes in various formats, such as CSV, Excel, or
databases, and it's crucial to understand how to load it into your analysis environment.

Steps:

a. Importing Data

To start working with data, we first need to import it into our environment. In Python, we often
use libraries like pandas to handle data efficiently. Let’s import a dataset and take a quick look
at it.

Python code:
# Importing necessary library
import pandas as pd
# Reading a CSV file into a DataFrame
data = pd.read_csv('your_dataset.csv')

# Display the first few rows of the data

print(data.head())

Hint!: your_dataset.csv => 2015.csv

● pandas: A powerful library for data manipulation and analysis.

● pd.read_csv(): Reads a CSV file into a DataFrame, the most common data structure
for handling tabular data in Python.
● data.head(): Displays the first few rows of the dataset for a quick glance at the
contents.

b. Understanding Data Structure

After loading the data, it's important to understand its structure—columns, rows, and data types.
Let’s write a function to summarize this:

Python code:
# Function to get a quick summary of the dataset
def data_summary(df):
print(f"Data Shape: {df.shape}") # Number of rows and columns
print("\nColumn Info:")
print(df.info()) # Data types and non-null counts
print("\nMissing Values:\n", df.isnull().sum()) # Count of
missing values per column

# Call the function

data_summary(data)

● df.shape: Provides the dimensions of the dataset (rows, columns).

● df.info(): Displays information on the column names, data types, and the number of
non-null values.
● df.isnull().sum(): Identifies missing data in each column.

c. Exploring Data Distributions

To explore data distributions, we can use basic statistical measures:

Python code:
# Function to check basic statistics of numeric columns
def check_statistics(df):
print("\nBasic Statistics:\n")
print(df.describe()) # Descriptive statistics for numeric data

# Call the function

check_statistics(data)

● df.describe(): Returns summary statistics such as mean, median, and standard

deviation for numeric columns in the dataset.

By the end of this section, you’ve learned how to:

1. Import data using pandas.

2. Understand the structure and basic information about your data.
3. Check for missing values and basic statistics.

These foundational steps will help you gain insights and prepare the data for deeper analysis in
future sections!

2. Matplotlib: Visualization with Python

Matplotlib is a widely-used library in Python for creating static, animated, and interactive
visualizations. It plays a crucial role in data analysis by enabling you to visualize patterns,
trends, and relationships within data. In this section, we'll introduce two common plot types:
scatter plots and histograms.

a. Scatter Plot

A scatter plot is useful when you want to observe the relationship or correlation between two
numerical variables. For example, if we want to see how one variable affects another, we plot
them as points on a graph.

Python code:
# Importing necessary libraries
import matplotlib.pyplot as plt

# Function to create a scatter plot

def scatter_plot(x, y, xlabel, ylabel, title):
plt.scatter(x, y)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

# Example usage
scatter_plot(data['variable1'], data['variable2'], 'Variable 1',
'Variable 2', 'Scatter Plot of Variable 1 vs Variable 2')

Hint!: data['variable1'] => data['Happiness Score'], data['variable2'] => data['Economy

(GDP per Capita)']

● plt.scatter(): Plots data points as a scatter plot.

● plt.xlabel(), plt.ylabel(), plt.title(): Add labels and title to the chart.

Output:

b. Histogram

A histogram is helpful for understanding the distribution of numerical data, showing how
frequently data points fall within specified ranges.

Python code:
# Function to create a histogram
def histogram(data_column, bins, xlabel, ylabel, title):
plt.hist(data_column, bins=bins, edgecolor='black')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

# Example usage
histogram(data['numeric_variable'], bins=50, xlabel='Value',
ylabel='Frequency', title='Distribution of Numeric Variable')

Hint!: data['numeric_variable'] => data[“Generosity”]

● plt.hist(): Plots the histogram.

● bins: Defines the number of intervals to split the data into.
3. Pandas: Data Manipulation with Python

Pandas is a powerful data manipulation library in Python. It provides easy-to-use structures like
Series and DataFrame to manage and analyze data.

a. Comparison Operators

Comparison operators compare two values and return a boolean result (True or False). They
are useful for filtering data, making decisions, or applying conditions.

Python code:
# Example of comparison operators
result = data['column1'] < 5 # Check if values in column1 are less
than 5
print(result.head())

Hint!: data['column1'] => data[‘Happiness Rank’]

b. Boolean Operators

Boolean operators (and, or, not) evaluate logical expressions and are often used in
conjunction with comparison operators to filter data.

Python code:
# Example of boolean operators
result = (data['column1'] < 5) & (data['column2'] < 10) # Checking
multiple conditions
print(result.head())

Hint!: data['column1'] => data[‘Happiness Rank’], data['column2'] => data[‘Happiness

Score’]

● &: Logical "and"

● |: Logical "or"
● ~: Logical "not"

c. Series

A Series is a one-dimensional data structure in Pandas. It can hold any data type and is
essentially a column in a DataFrame.

Python code:
# Creating a Pandas Series
series_example = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd',
'e'])
print(series_example)

d. DataFrame

A DataFrame is the most commonly used structure in Pandas. It is a two-dimensional table,

similar to an Excel sheet, and can hold multiple types of data.

Python code:
# Creating a DataFrame from a dictionary
df_example = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']
})
print(df_example)

● DataFrames can be created from various inputs, such as dictionaries, lists, or another
DataFrame.

4. Filtering Data

Filtering is a common task when working with data. It allows us to extract specific rows from a
DataFrame based on conditions we set.

a. Filtering with Conditions

You can filter rows by applying conditions to columns, using comparison and boolean operators.

Python code:
# Example: Filtering rows where 'Happiness Score' is greater than 7
filtered_data = data["Happiness Score"] > 7
print(filtered_data)

b. Filtering Multiple Conditions

You can apply multiple conditions using logical operators like & and |.
Python code:
Filtered_1 = data[ (data["Happiness Score"]> 5) & (data["Freedom"]>
0.35) ]
Filtered_1.head()

# logical_and() function
Filtered_2 = data[np.logical_and(data["Health (Life Expectancy)"]>
0.94, data["Happiness Score"]>7 )]
Filtered_2.head()

By the end of this section, you've learned how to:

1. Create scatter plots and histograms using Matplotlib.

2. Manipulate data with Pandas, including comparison operators, boolean operations,
Series, and DataFrames.
3. Filter data based on conditions to extract specific rows for analysis.

These tools and techniques will form the backbone of your data analysis and visualization
journey in Python!

5. Looping through Data Structures: for and while

In Python, loops allow you to iterate over data structures like lists, arrays, and DataFrames. You
can use loops to perform repetitive tasks efficiently. Python provides two types of loops: for
loops and while loops.

a. for Loop

A for loop iterates over items of a sequence (like a list or DataFrame) and performs actions on
each item.

Python code:

# Example: Looping through a list of numbers

numbers = [1, 2, 3, 4, 5]

for num in numbers:

print(num)
You can also loop through a DataFrame using the for loop to access each column or row.

Python code:

# Example: Iterating through DataFrame columns

for column in data.columns:

print(f"Column name: {column}")

b. while Loop

A while loop keeps running as long as a specified condition remains True. It’s useful when
you don’t know the exact number of iterations beforehand.

Python code:

# Example: Using while loop to print numbers less than 5

i = 0

while i < 5:

print(i)

i += 1

Loops can be combined with conditionals and data structures to perform complex operations
efficiently.

6. Indexing and Slicing DataFrames

Indexing and slicing are essential techniques to access specific rows, columns, or subsets of
data in a DataFrame.

a. Indexing

Indexing is used to select rows or columns by their labels or position.

Python code:

# Selecting a column by its name

country_column = data[‘Country’]

print(country_column)

You can also use the iloc and loc methods for more advanced indexing.

● iloc: Selects rows and columns by integer positions (zero-based indexing).

● loc: Selects rows and columns by labels.

Python code:

# Example: Selecting specific rows and columns using iloc

subset_data = data.iloc[0:5, 1:3] # Selects rows 0-4 and columns 1-2

print(subset_data)

# Example: Selecting specific rows and columns using loc

subset_data = data.loc[0:4, [‘Happiness Rank’, ‘Country’]] # Selects

rows 0-4 and ‘Happiness Rank’, ‘Country’ columns

print(subset_data)

b. Slicing

Slicing allows you to extract portions of the DataFrame based on rows or columns.

Python code:

# Example: Slicing first 5 rows of the DataFrame

first_five_rows = data[:5]

print(first_five_rows)
# Example: Slicing specific columns of the DataFrame

Country_and_Region = data[['Country', 'Region']]

print(Country_and_Region)

By using indexing and slicing, you can effectively manage and manipulate different parts of your
DataFrame for analysis. These are powerful tools to extract meaningful subsets of your data
and streamline your workflow.

Summary

By now, you've learned how to:

1. Loop through data structures using for and while loops.

2. Access specific rows, columns, or subsets of data using indexing and slicing in
DataFrames.

These concepts are critical for data manipulation and will help you navigate complex datasets
more easily in Python.

Python Cheat Sheet 2.0
100% (1)
Python Cheat Sheet 2.0
10 pages
Ricoh MP C4504 C5504 C6004 C4504ex C5504ex C6004ex Parts Catalog 66e08bf9c3a41
No ratings yet
Ricoh MP C4504 C5504 C6004 C4504ex C5504ex C6004ex Parts Catalog 66e08bf9c3a41
202 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
Pandas Research
No ratings yet
Pandas Research
14 pages
Module 1.foundations of Data Science
No ratings yet
Module 1.foundations of Data Science
17 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Dav 2 Unit
No ratings yet
Dav 2 Unit
55 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Pandas
No ratings yet
Pandas
25 pages
Python For Data Analysis Jan 28
No ratings yet
Python For Data Analysis Jan 28
105 pages
Pandas
No ratings yet
Pandas
12 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
Beginners Guide To Python For Data Analysis
No ratings yet
Beginners Guide To Python For Data Analysis
2 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
11 pages
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
No ratings yet
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
9 pages
Python & Excel for Data Science
No ratings yet
Python & Excel for Data Science
19 pages
Rest of The Ip Project
No ratings yet
Rest of The Ip Project
26 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
10 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
20 pages
Pandas Guide
No ratings yet
Pandas Guide
50 pages
Python Data Exploration Guide
100% (1)
Python Data Exploration Guide
12 pages
Data Prep & EDA for Python Users
No ratings yet
Data Prep & EDA for Python Users
12 pages
Pandas
No ratings yet
Pandas
50 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
3rd Week Report
No ratings yet
3rd Week Report
7 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Pandas - Digitalocean
No ratings yet
Pandas - Digitalocean
15 pages
02 Python Basics
No ratings yet
02 Python Basics
52 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
DS Final
No ratings yet
DS Final
46 pages
FDS Exp4
No ratings yet
FDS Exp4
5 pages
Report
No ratings yet
Report
18 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Introduction To Pandas - Loading and Exploring Data
No ratings yet
Introduction To Pandas - Loading and Exploring Data
4 pages
Pandas & PyNumS Essentials
No ratings yet
Pandas & PyNumS Essentials
10 pages
EXP1-siddhant Gupta (23 - SE - 148)
No ratings yet
EXP1-siddhant Gupta (23 - SE - 148)
17 pages
Pandas Library: Data Manipulation & Analysis Guide
No ratings yet
Pandas Library: Data Manipulation & Analysis Guide
9 pages
Python Cheat Sheet # 1: (: Essential Syntax)
No ratings yet
Python Cheat Sheet # 1: (: Essential Syntax)
12 pages
Python Unit IV
No ratings yet
Python Unit IV
12 pages
Test 1 Datasheet
No ratings yet
Test 1 Datasheet
3 pages
Data Frame
No ratings yet
Data Frame
95 pages
Python Interviews
No ratings yet
Python Interviews
154 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Notes For Fintech Assesment, Cheatsheet
No ratings yet
Notes For Fintech Assesment, Cheatsheet
19 pages
Python Training For Data Analysis
No ratings yet
Python Training For Data Analysis
40 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Common Python Data Science Interview Questions1
No ratings yet
Common Python Data Science Interview Questions1
5 pages
VIN Model Series/model Designation Order Number License Plate
No ratings yet
VIN Model Series/model Designation Order Number License Plate
3 pages
Infineon TC2xx - AURIX - Documentation PP v01 - 00 EN
No ratings yet
Infineon TC2xx - AURIX - Documentation PP v01 - 00 EN
8 pages
Computer Chapter-2
No ratings yet
Computer Chapter-2
33 pages
File Management in Cloud Storage Platforms
No ratings yet
File Management in Cloud Storage Platforms
8 pages
05 Programmer's Reference, With Instructions On How To Execute The Program
No ratings yet
05 Programmer's Reference, With Instructions On How To Execute The Program
43 pages
Current Midterm Solved Papers: Muhammad Faisal Dar
No ratings yet
Current Midterm Solved Papers: Muhammad Faisal Dar
14 pages
Famos Heat Sealers EN
No ratings yet
Famos Heat Sealers EN
18 pages
(Instructor Version) : Packet Tracer - Configuring Extended Acls - Scenario 3 Desarrollado Por: Oscar Vanegas Landinez
No ratings yet
(Instructor Version) : Packet Tracer - Configuring Extended Acls - Scenario 3 Desarrollado Por: Oscar Vanegas Landinez
6 pages
Digital Voting Machine
No ratings yet
Digital Voting Machine
8 pages
ERLPhase USB Driver Instructions
No ratings yet
ERLPhase USB Driver Instructions
9 pages
Micro GC 3000
No ratings yet
Micro GC 3000
11 pages
Cybersecurity Analytics 1st Edition Rakesh M Verma David J Marchette PDF Download
No ratings yet
Cybersecurity Analytics 1st Edition Rakesh M Verma David J Marchette PDF Download
81 pages
Speech Emotion Analysis System
No ratings yet
Speech Emotion Analysis System
10 pages
Brave MMA Event Expenses 2016
No ratings yet
Brave MMA Event Expenses 2016
18 pages
Year 10 Computer Studies Term 2 Scheme of Work
No ratings yet
Year 10 Computer Studies Term 2 Scheme of Work
4 pages
Intro to Machine Learning Basics
No ratings yet
Intro to Machine Learning Basics
42 pages
Intel Core 2 Duo E7500
No ratings yet
Intel Core 2 Duo E7500
4 pages
Etiquette of Written Word (Unit-4)
No ratings yet
Etiquette of Written Word (Unit-4)
2 pages
PG-I (I-Sem Syllabus)
No ratings yet
PG-I (I-Sem Syllabus)
6 pages
Com1 IpPbx: Innovative Indian IP Switch
100% (1)
Com1 IpPbx: Innovative Indian IP Switch
2 pages
Python Revision Tour I QB
100% (1)
Python Revision Tour I QB
23 pages
Arithmetic and Weighted Mean
No ratings yet
Arithmetic and Weighted Mean
5 pages
Tdt7 Manual
No ratings yet
Tdt7 Manual
2 pages
Muez Ahmed: Education
No ratings yet
Muez Ahmed: Education
1 page
Dcpu BLR PDF
No ratings yet
Dcpu BLR PDF
1 page
Tenderdetail Tenderdetail: Indian Tenders
No ratings yet
Tenderdetail Tenderdetail: Indian Tenders
8 pages
SKIT Hackathon Ppt-Ewaste
No ratings yet
SKIT Hackathon Ppt-Ewaste
8 pages
Attachment (3) - Product Data Sheets3.1 SIEMENS Product Data Sheets6DL11936TC000DF0 - en
No ratings yet
Attachment (3) - Product Data Sheets3.1 SIEMENS Product Data Sheets6DL11936TC000DF0 - en
1 page

Course - Introduction To Data Science (SD211105)

Uploaded by

Course - Introduction To Data Science (SD211105)

Uploaded by

Course: Introduction to Data Science (SD211105)

Topic: Basic Data Science Tutorial

Download Dataset & Try Below Tutorial Using Python:

1. Import and First Look at the Data

# Display the first few rows of the data

Hint!: your_dataset.csv => 2015.csv

● pandas: A powerful library for data manipulation and analysis.

b. Understanding Data Structure

# Call the function

● df.shape: Provides the dimensions of the dataset (rows, columns).

c. Exploring Data Distributions

To explore data distributions, we can use basic statistical measures:

# Call the function

● df.describe(): Returns summary statistics such as mean, median, and standard

By the end of this section, you’ve learned how to:

1. Import data using pandas.

2. Matplotlib: Visualization with Python

# Function to create a scatter plot

Hint!: data['variable1'] => data['Happiness Score'], data['variable2'] => data['Economy

● plt.scatter(): Plots data points as a scatter plot.

Hint!: data['numeric_variable'] => data[“Generosity”]

● plt.hist(): Plots the histogram.

Hint!: data['column1'] => data[‘Happiness Rank’]

Hint!: data['column1'] => data[‘Happiness Rank’], data['column2'] => data[‘Happiness

● &: Logical "and"

A DataFrame is the most commonly used structure in Pandas. It is a two-dimensional table,

a. Filtering with Conditions

b. Filtering Multiple Conditions

By the end of this section, you've learned how to:

1. Create scatter plots and histograms using Matplotlib.

5. Looping through Data Structures: for and while

# Example: Looping through a list of numbers

for num in numbers:

# Example: Iterating through DataFrame columns

for column in data.columns:

print(f"Column name: {column}")

# Example: Using while loop to print numbers less than 5

6. Indexing and Slicing DataFrames

Indexing is used to select rows or columns by their labels or position.

# Selecting a column by its name

● iloc: Selects rows and columns by integer positions (zero-based indexing).

# Example: Selecting specific rows and columns using iloc

subset_data = data.iloc[0:5, 1:3] # Selects rows 0-4 and columns 1-2

# Example: Selecting specific rows and columns using loc

subset_data = data.loc[0:4, [‘Happiness Rank’, ‘Country’]] # Selects

# Example: Slicing first 5 rows of the DataFrame

Country_and_Region = data[['Country', 'Region']]

By now, you've learned how to:

1. Loop through data structures using for and while loops.

You might also like