Course: Introduction to Data Science (SD211105)
Session: 2
Topic: Basic Data Science Tutorial
Download Dataset & Try Below Tutorial Using Python:
https://drive.google.com/file/d/1aO1XLlsV3rft4Z3xRZErTjPAhbi4-uye/view?usp=sharing
Welcome to this tutorial designed to introduce you to the world of data science. Whether you are
a complete beginner or have no prior knowledge of the subject, this guide will help you get
started. Throughout the tutorial, you'll learn key concepts, techniques, and tools used in data
science. By the end of each section, you will have gained new abilities to work with and analyze
data.
To begin, here’s a famous quote from Josh Wills that perfectly encapsulates the role of a data
scientist:
"A data scientist is a person who is better at statistics than any software engineer and
better at software engineering than any statistician."
This definition emphasizes the interdisciplinary nature of data science, requiring a balance of
statistical expertise and programming skills.
1. Import and First Look at the Data
In this section, we will walk through the process of importing data and writing functions that help
us get an initial understanding of it. Data comes in various formats, such as CSV, Excel, or
databases, and it's crucial to understand how to load it into your analysis environment.
Steps:
a. Importing Data
To start working with data, we first need to import it into our environment. In Python, we often
use libraries like pandas to handle data efficiently. Let’s import a dataset and take a quick look
at it.
Python code:
# Importing necessary library
import pandas as pd
# Reading a CSV file into a DataFrame
data = pd.read_csv('your_dataset.csv')
# Display the first few rows of the data
print(data.head())
Hint!: your_dataset.csv => 2015.csv
● pandas: A powerful library for data manipulation and analysis.
● pd.read_csv(): Reads a CSV file into a DataFrame, the most common data structure
for handling tabular data in Python.
● data.head(): Displays the first few rows of the dataset for a quick glance at the
contents.
b. Understanding Data Structure
After loading the data, it's important to understand its structure—columns, rows, and data types.
Let’s write a function to summarize this:
Python code:
# Function to get a quick summary of the dataset
def data_summary(df):
print(f"Data Shape: {df.shape}") # Number of rows and columns
print("\nColumn Info:")
print(df.info()) # Data types and non-null counts
print("\nMissing Values:\n", df.isnull().sum()) # Count of
missing values per column
# Call the function
data_summary(data)
● df.shape: Provides the dimensions of the dataset (rows, columns).
● df.info(): Displays information on the column names, data types, and the number of
non-null values.
● df.isnull().sum(): Identifies missing data in each column.
c. Exploring Data Distributions
To explore data distributions, we can use basic statistical measures:
Python code:
# Function to check basic statistics of numeric columns
def check_statistics(df):
print("\nBasic Statistics:\n")
print(df.describe()) # Descriptive statistics for numeric data
# Call the function
check_statistics(data)
● df.describe(): Returns summary statistics such as mean, median, and standard
deviation for numeric columns in the dataset.
By the end of this section, you’ve learned how to:
1. Import data using pandas.
2. Understand the structure and basic information about your data.
3. Check for missing values and basic statistics.
These foundational steps will help you gain insights and prepare the data for deeper analysis in
future sections!
2. Matplotlib: Visualization with Python
Matplotlib is a widely-used library in Python for creating static, animated, and interactive
visualizations. It plays a crucial role in data analysis by enabling you to visualize patterns,
trends, and relationships within data. In this section, we'll introduce two common plot types:
scatter plots and histograms.
a. Scatter Plot
A scatter plot is useful when you want to observe the relationship or correlation between two
numerical variables. For example, if we want to see how one variable affects another, we plot
them as points on a graph.
Python code:
# Importing necessary libraries
import matplotlib.pyplot as plt
# Function to create a scatter plot
def scatter_plot(x, y, xlabel, ylabel, title):
plt.scatter(x, y)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()
# Example usage
scatter_plot(data['variable1'], data['variable2'], 'Variable 1',
'Variable 2', 'Scatter Plot of Variable 1 vs Variable 2')
Hint!: data['variable1'] => data['Happiness Score'], data['variable2'] => data['Economy
(GDP per Capita)']
● plt.scatter(): Plots data points as a scatter plot.
● plt.xlabel(), plt.ylabel(), plt.title(): Add labels and title to the chart.
Output:
b. Histogram
A histogram is helpful for understanding the distribution of numerical data, showing how
frequently data points fall within specified ranges.
Python code:
# Function to create a histogram
def histogram(data_column, bins, xlabel, ylabel, title):
plt.hist(data_column, bins=bins, edgecolor='black')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()
# Example usage
histogram(data['numeric_variable'], bins=50, xlabel='Value',
ylabel='Frequency', title='Distribution of Numeric Variable')
Hint!: data['numeric_variable'] => data[“Generosity”]
● plt.hist(): Plots the histogram.
● bins: Defines the number of intervals to split the data into.
3. Pandas: Data Manipulation with Python
Pandas is a powerful data manipulation library in Python. It provides easy-to-use structures like
Series and DataFrame to manage and analyze data.
a. Comparison Operators
Comparison operators compare two values and return a boolean result (True or False). They
are useful for filtering data, making decisions, or applying conditions.
Python code:
# Example of comparison operators
result = data['column1'] < 5 # Check if values in column1 are less
than 5
print(result.head())
Hint!: data['column1'] => data[‘Happiness Rank’]
b. Boolean Operators
Boolean operators (and, or, not) evaluate logical expressions and are often used in
conjunction with comparison operators to filter data.
Python code:
# Example of boolean operators
result = (data['column1'] < 5) & (data['column2'] < 10) # Checking
multiple conditions
print(result.head())
Hint!: data['column1'] => data[‘Happiness Rank’], data['column2'] => data[‘Happiness
Score’]
● &: Logical "and"
● |: Logical "or"
● ~: Logical "not"
c. Series
A Series is a one-dimensional data structure in Pandas. It can hold any data type and is
essentially a column in a DataFrame.
Python code:
# Creating a Pandas Series
series_example = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd',
'e'])
print(series_example)
d. DataFrame
A DataFrame is the most commonly used structure in Pandas. It is a two-dimensional table,
similar to an Excel sheet, and can hold multiple types of data.
Python code:
# Creating a DataFrame from a dictionary
df_example = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']
})
print(df_example)
● DataFrames can be created from various inputs, such as dictionaries, lists, or another
DataFrame.
4. Filtering Data
Filtering is a common task when working with data. It allows us to extract specific rows from a
DataFrame based on conditions we set.
a. Filtering with Conditions
You can filter rows by applying conditions to columns, using comparison and boolean operators.
Python code:
# Example: Filtering rows where 'Happiness Score' is greater than 7
filtered_data = data["Happiness Score"] > 7
print(filtered_data)
b. Filtering Multiple Conditions
You can apply multiple conditions using logical operators like & and |.
Python code:
Filtered_1 = data[ (data["Happiness Score"]> 5) & (data["Freedom"]>
0.35) ]
Filtered_1.head()
# logical_and() function
Filtered_2 = data[np.logical_and(data["Health (Life Expectancy)"]>
0.94, data["Happiness Score"]>7 )]
Filtered_2.head()
By the end of this section, you've learned how to:
1. Create scatter plots and histograms using Matplotlib.
2. Manipulate data with Pandas, including comparison operators, boolean operations,
Series, and DataFrames.
3. Filter data based on conditions to extract specific rows for analysis.
These tools and techniques will form the backbone of your data analysis and visualization
journey in Python!
5. Looping through Data Structures: for and while
In Python, loops allow you to iterate over data structures like lists, arrays, and DataFrames. You
can use loops to perform repetitive tasks efficiently. Python provides two types of loops: for
loops and while loops.
a. for Loop
A for loop iterates over items of a sequence (like a list or DataFrame) and performs actions on
each item.
Python code:
# Example: Looping through a list of numbers
numbers = [1, 2, 3, 4, 5]
for num in numbers:
print(num)
You can also loop through a DataFrame using the for loop to access each column or row.
Python code:
# Example: Iterating through DataFrame columns
for column in data.columns:
print(f"Column name: {column}")
b. while Loop
A while loop keeps running as long as a specified condition remains True. It’s useful when
you don’t know the exact number of iterations beforehand.
Python code:
# Example: Using while loop to print numbers less than 5
i = 0
while i < 5:
print(i)
i += 1
Loops can be combined with conditionals and data structures to perform complex operations
efficiently.
6. Indexing and Slicing DataFrames
Indexing and slicing are essential techniques to access specific rows, columns, or subsets of
data in a DataFrame.
a. Indexing
Indexing is used to select rows or columns by their labels or position.
Python code:
# Selecting a column by its name
country_column = data[‘Country’]
print(country_column)
You can also use the iloc and loc methods for more advanced indexing.
● iloc: Selects rows and columns by integer positions (zero-based indexing).
● loc: Selects rows and columns by labels.
Python code:
# Example: Selecting specific rows and columns using iloc
subset_data = data.iloc[0:5, 1:3] # Selects rows 0-4 and columns 1-2
print(subset_data)
# Example: Selecting specific rows and columns using loc
subset_data = data.loc[0:4, [‘Happiness Rank’, ‘Country’]] # Selects
rows 0-4 and ‘Happiness Rank’, ‘Country’ columns
print(subset_data)
b. Slicing
Slicing allows you to extract portions of the DataFrame based on rows or columns.
Python code:
# Example: Slicing first 5 rows of the DataFrame
first_five_rows = data[:5]
print(first_five_rows)
# Example: Slicing specific columns of the DataFrame
Country_and_Region = data[['Country', 'Region']]
print(Country_and_Region)
By using indexing and slicing, you can effectively manage and manipulate different parts of your
DataFrame for analysis. These are powerful tools to extract meaningful subsets of your data
and streamline your workflow.
Summary
By now, you've learned how to:
1. Loop through data structures using for and while loops.
2. Access specific rows, columns, or subsets of data using indexing and slicing in
DataFrames.
These concepts are critical for data manipulation and will help you navigate complex datasets
more easily in Python.