Module1:
Machine Learning:
Definition:
Machine learning is a branch of artificial
intelligence that focuses on enabling systems to
learn from data and improve their performance
on specific tasks without explicit
programming. Essentially, it allows computers
to learn patterns and make predictions or
decisions based on data, rather than relying
solely on pre-programmed instructions.
Here's a more detailed breakdown:
Learning from Data:
Machine learning algorithms analyze vast
amounts of data to identify patterns and
relationships.
No Explicit Programming:
Unlike traditional programming, where specific
instructions are given for every task, machine
learning algorithms learn from data and adapt
their behavior accordingly.
Improved Performance:
As the algorithms process more data, they
refine their models and improve their ability to
perform the task they are designed for, whether
it's classification, prediction, or other tasks.
Subset of AI:
Machine learning is a specific approach within
the broader field of artificial intelligence,
focusing on enabling machines to learn from
data.
Importance of machine learning:
Machine learning (ML) is crucial because
it enables computers to learn from data and
improve their performance on specific tasks
without explicit programming, driving
innovation and efficiency across various
sectors. It's important for data analysis,
automation, personalization, and advancements
in diverse fields.
Key Aspects of Importance:
Data-Driven Insights:
ML algorithms can analyse vast amounts of
data to identify patterns and trends that might
be missed by humans, leading to better
decision-making.
Automation and Efficiency:
ML automates repetitive tasks, freeing up
human resources for more complex work and
boosting productivity.
Personalization:
ML tailor’s user experiences by providing
personalized recommendations and
advertisements based on individual
preferences.
Fraud Detection and Security:
ML helps businesses identify fraudulent
activities and security threats by analyzing
patterns and anomalies in data.
Healthcare Advancements:
ML aids in research, diagnostics, and
personalized treatment in healthcare, speeding
up drug discovery and enabling tailored
therapies.
Improved Accuracy and Efficiency:
ML algorithms can analyse data faster and
more accurately than humans, leading to better
outcomes in various applications.
Cost Reduction and Risk Mitigation:
ML helps reduce costs and mitigate risks by
automating processes, improving efficiency, and
enabling better decision-making.
In essence, machine learning is transforming
how organizations operate, make decisions, and
deliver value, making it a pivotal technology in
the modern world.
Types of machine learning:
Machine learning (ML) can be broadly
categorized into supervised learning,
unsupervised learning, and
reinforcement learning. These categories
are based on how the algorithm learns
from data. Semi-supervised learning is
also a recognized type, bridging the gap
between supervised and unsupervised
methods.
Here's a breakdown of the main types:
Supervised Learning:
This involves training a model on a
labelled dataset, where the desired
output is known for each input.
The model learns to map inputs to
outputs and can then predict outcomes
for new, unseen data.
Examples include predicting house
prices based on features like size and
location (regression), or classifying
emails as spam or not spam
(classification).
Unsupervised Learning:
This type of learning involves training a
model on an unlabelled dataset.
The algorithm identifies patterns,
relationships, and structures within the
data without explicit guidance.
Examples include clustering similar
customers together or reducing the
dimensionality of data.
Semi-supervised Learning:
This approach combines labelled and
unlabelled data for training. It's useful
when obtaining labelled data is
expensive or time-consuming, but
unlabelled data is readily available.
Reinforcement Learning:
In this type, an agent learns to make
decisions by interacting with an
environment and receiving feedback in
the form of rewards or penalties.
The goal is for the agent to learn an
optimal policy that maximizes its
cumulative reward over time.
This is often used in robotics and game
playing.
Applications of machine learning in
various domains:
Machine learning (ML) has
revolutionized numerous industries by
enabling automation, prediction, and
optimization across various domains.
Here's a more detailed look at some key
applications:
1. Healthcare:
Disease Diagnosis:
ML algorithms can analyze medical
images (X-rays, MRIs, etc.) to assist in
early and accurate disease detection,
including cancer identification.
Personalized Treatment:
By analyzing patient data, ML can help
tailor treatment plans and predict patient
outcomes, improving healthcare quality.
Drug Discovery:
ML accelerates the process of identifying
potential drug candidates and optimizing
their properties.
Medical Robotics:
ML is used to develop robots that can
assist in surgeries, rehabilitation, and other
medical procedures.
2. Finance:
Fraud Detection:
ML algorithms can identify fraudulent
transactions in real-time, minimizing
financial losses.
Algorithmic Trading:
ML enables automated trading strategies
based on complex data analysis and market
predictions.
Credit Scoring:
ML helps assess credit risk more
accurately, enabling faster and safer
lending decisions.
Risk Management:
ML models can be used to identify and
mitigate various financial risks.
3. E-commerce and Retail:
Personalized Recommendations:
ML powers recommendation systems that
suggest products based on user preferences
and past behaviour.
Demand Forecasting:
ML helps retailers predict future demand,
optimizing inventory management and
supply chain operations.
Customer Sentiment Analysis:
ML analyses customer feedback to
understand their preferences and improve
shopping experiences.
4. Transportation and Automotive:
Self-Driving Cars:
ML is crucial for enabling autonomous
vehicles to perceive their surroundings,
make decisions, and navigate safely.
Route Optimization:
ML algorithms optimize delivery routes,
minimizing travel time and fuel
consumption.
Predictive Maintenance:
ML can predict potential vehicle
malfunctions, enable timely maintenance
and reduce downtime.
5. Social Media and Entertainment:
Content Recommendations:
Platforms like YouTube and Netflix use
ML to recommend videos and movies that
users might enjoy.
Image and Speech Recognition:
ML powers features like facial recognition,
voice assistants, and automatic content
tagging.
6. Manufacturing:
Quality Control:
ML algorithms can automatically detect
defects in products, improving quality
control processes.
Predictive Maintenance:
ML can predict machine failures, enabling
proactive maintenance and preventing
production downtime.
7. Other Applications:
Spam Filtering:
ML algorithms identify and filter out
unwanted emails, improving inbox
efficiency.
Natural Language Processing (NLP):
ML powers language translation, chatbots,
and other applications that involve
understanding and processing human
language.
Cyber security:
ML is used to detect and prevent cyber
threats, including malware and network
intrusions.
Agriculture:
ML helps optimize crop yields, monitor
soil health, and improve precision farming
techniques.
Education:
ML can personalize learning experiences,
predict student performance, and automate
grading processes.
Bioinformatics:
ML analyzes biological data to advance
research in genetics, drug discovery, and
personalized medicine.
Python for Data Analysis:
Introduction to Python programming:
Python is a high-level, general-purpose
programming language known for its
readability and versatility.
It's widely used in
web development,
data science,
automation,
and artificial intelligence.
Python's clean syntax makes it relatively easy to
learn and use, while its extensive libraries and
frameworks support a wide range of
applications.
Key Features:
High-level:
Python abstracts away many low-level details of
computer operations, allowing developers to
focus on problem-solving rather than getting
bogged down in technical complexities.
Interpreted:
Python code is executed by an interpreter,
which reads and executes the code line by line,
making it easier to debug and test.
Object-oriented:
Python supports object-oriented programming
principles, allowing developers to organize code
into reusable objects and classes.
Dynamic typing:
Python infers the data type of a variable at
runtime, making coding faster and more
flexible.
Readability:
Python's syntax emphasizes clarity and uses
English keywords, making code easier to read
and understand.
Extensive libraries and frameworks:
Python boasts a rich ecosystem of libraries and
frameworks for various tasks, such as web
development (e.g., Django, Flask), data analysis
(e.g., Pandas, NumPy), and machine learning
(e.g., TensorFlow, PyTorch).
Large community support:
Python has a large and active community that
provides extensive documentation, tutorials,
and support for beginners and experienced
users alike.
Python Install
Many PCs and Macs will have python already installed.
To check if you have python installed on a Windows PC, search
in the start bar for Python or run the following on the Command
Line (cmd.exe):
C:\Users\Your Name>python --version
To check if you have python installed on a Linux or Mac, then
on Linux open the command line or on Mac open the Terminal
and type:
python --version
Python is an interpreted programming language, this
means that as a developer you write Python (.py) files in
a text editor and then put those files into the python
interpreter to be executed.
Let's write our first Python file, called hello.py, which can
be done in any text editor:
hello.py:
print("Hello, World!")
Simple as that. Save your file. Open your command line,
navigate to the directory where you saved your file, and
run:
C:\Users\Your Name>python hello.py
The output should be:
Hello, World!
Python Indentation
Indentation refers to the spaces at the beginning of a
code line.
Where in other programming languages the indentation in code is
for readability only, the indentation in Python is very important.
Python uses indentation to indicate a block of code.
Example
if 5 > 2:
print("Five is greater than two!")
Python will give you an error if you skip the indentation:
Creating Variables
Python has no command for declaring a variable.
A variable is created the moment you first assign a value to it.
Example
x=5
y = "John"
print(x)
print(y)
Variables do not need to be declared with any particular type, and can even
change type after they have been set.
Example
x=4 # x is of type int
x = "Sally" # x is now of type str
print(x)
Single or Double Quotes?
String variables can be declared either by using single or double quotes:
Example
x = "John"
# is the same as
x = 'John'
Case-Sensitive
Variable names are case-sensitive.
Example
This will create two variables:
a=4
A = "Sally"
#A will not overwrite a
Input and Output in Python
Taking input in Python:
Python input() function is used to take user input. By default, it returns the
user input in form of a string.
Example:
name = input("Enter your name: ")
print("Hello,", name, "! Welcome!")
Output
Enter your name: GeeksforGeeks
Hello, GeeksforGeeks ! Welcome!
Printing Output using print() in Python
At its core, printing output in Python is straightforward, thanks to the
print() function. This function allows us to display text, variables and
expressions on the console. Let's begin with the basic usage of
the print() function:
In this example, "Hello, World!" is a string literal enclosed within double
quotes. When executed, this statement will output the text to the console.
Example:
print("Hello, World!")
Output
Hello, World!
X=5
Print(“X=”X)
Output:X=5
Creating a Comment
Comments starts with a #, and Python will ignore them:
Example:
#This is a comment
print("Hello, World!")
Example
print("Hello, World!") #This is a comment
Creating a python file
creating a python file on the server, using the .py file
extension, and running it in the Command Line:
C:\Users\Your Name>python myfile.py
Python Conditions and If statements
Python supports the usual logical conditions from mathematics:
Equals: a == b
Not Equals: a != b
Less than: a < b
Less than or equal to: a <= b
Greater than: a > b
Greater than or equal to: a >= b
These conditions can be used in several ways, most commonly in "if
statements" and loops.
An "if statement" is written by using the if keyword.
Example
If statement:
a = 33
b = 200
if b > a:
print("b is greater than a")
output:
b is greater than a
Elif
The elif keyword is Python's way of saying "if the previous conditions were
not true, then try this condition".
Example
a = 33
b = 33
if b > a:
print("b is greater than a")
elif a == b:
print("a and b are equal")
Else
The else keyword catches anything which isn't caught by the preceding
conditions.
Example
a = 200
b = 33
if b > a:
print("b is greater than a")
elif a == b:
print("a and b are equal")
else:
print("a is greater than b")
Sample Programs:
# Program to generate a random number between 0 and 9
# importing the random module
import random
print(random.randint(0,9))
Output
5
# Python program to find the factorial of a number provided by the user.
# To take input from the user
num = int(input("Enter a number: "))
factorial = 1
# check if the number is negative, positive or zero
if num < 0:
print("Sorry, factorial does not exist for negative numbers")
elif num == 0:
print("The factorial of 0 is 1")
else:
for i in range(1,num + 1):
factorial = factorial*i
print("The factorial of",num,"is",factorial)
# Python program to shuffle a deck of card
# importing modules
import itertools, random
# make a deck of cards
deck = list(itertools.product(range(1,14),
['Spade','Heart','Diamond','Club']))
# shuffle the cards
random.shuffle
# draw five cards
print("You got:")
for i in range(5):
print(deck[i][0], "of", deck[i][1])
output:
You got:
3 of Club
7 of Spade
10 of Spade
6 of Diamond
7 of Club
Reverse a Number using a while loop
num = 1234
reversed_num = 0
while num != 0:
digit = num % 10
reversed_num = reversed_num * 10 + digit
num //= 10
print("Reversed Number: " + str(reversed_num))
Python for Data Analysis:
Python libraries for data analysis: NumPy, Pandas,
Matplotlib
Numpy Library:
What is NumPy?
NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier
transform, and matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project
and you can use it freely.
NumPy stands for Numerical Python.
Installation of NumPy
If you have Python and PIP already installed on a system, then installation
of NumPy is very easy.
Install it using this command:
C:\Users\Your Name>pip install numpy
Import NumPy
Once NumPy is installed, import it in your applications by adding
the import keyword:
import numpy
Example:
import numpy
arr = numpy.array([1, 2, 3, 4, 5])
print(arr)
NumPy as np
NumPy is usually imported under the np alias.
alias: In Python alias are an alternate name for referring to the
same thing.
Create an alias with the as keyword while importing:
import numpy as np
Now the NumPy package can be referred to as np instead
of numpy.
Example:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Create a NumPy ndarray Object
NumPy is used to work with arrays. The array object in NumPy is
called ndarray.
We can create a NumPy ndarray object by using
the array() function.
Example:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
type(): This built-in Python function tells us the type of the object
passed to it. Like in above code it shows
that arr is numpy.ndarray type.
[1 2 3 4 5]
<class 'numpy.ndarray'>
NumPy Data Types
Data Types in NumPy
NumPy has some extra data types, and refer to data types with
one character, like i for integers, u for unsigned integers etc.
Below is a list of all data types in NumPy and the characters used
to represent them.
i - integer
b - boolean
u - unsigned integer
f - float
c - complex float
m - timedelta
M - datetime
O - object
S - string
U - unicode string
V - fixed chunk of memory for other type ( void )
Checking the Data Type of an Array
The NumPy array object has a property called dtype that returns
the data type of the array:
Example
Get the data type of an array object:
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr.dtype)
Example
Get the data type of an array containing strings:
import numpy as np
arr = np.array(['apple', 'banana', 'cherry'])
print(arr.dtype)
Slicing arrays
Slicing in python means taking elements from one given index to
another given index.
We pass slice instead of index like this: [start:end].
We can also define the step, like this: [start:end:step].
If we don't pass start its considered 0
If we don't pass end its considered length of array in that
dimension
If we don't pass step its considered 1
Example
Slice elements from index 1 to index 5 from the following array:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])
output:
2,3,4,5
NumPy Array Shape
Shape of an Array
The shape of an array is the number of elements in
each dimension.
Get the Shape of an Array
NumPy arrays have an attribute called shape that
returns a tuple with each index having the number
of corresponding elements.
Example:
Print the shape of a 2-D array:
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)
output:
(2, 4)
Pandas:
The pandas library in Python is primarily used for data
manipulation and analysis, particularly with tabular
data. Its core functionalities revolve around two primary
data structures:
Series: A one-dimensional labeled array capable of
holding any data type.
DataFrame: A two-dimensional labeled data structure
with columns of potentially different types, analogous
to a spreadsheet or SQL table.
Pandas enables a wide range of operations on these data
structures, including:
Data Loading and Exporting: Reading data from
various formats like CSV, Excel, SQL databases, and
writing data to these formats.
Data Cleaning and Preprocessing: Handling missing
values, removing duplicates, and transforming data
types.
Data Exploration and Analysis: Calculating
descriptive statistics, grouping data, performing
aggregations, and filtering.
Data Transformation and Reshaping: Pivoting,
melting, merging, and joining DataFrames.
Time Series Analysis: Functionalities for working
with time-indexed data, including date range
generation and frequency conversion.
In essence, pandas provides a powerful and efficient
toolkit for tasks commonly encountered in data science,
such as data cleaning, exploration, and preparation for
machine learning models.
What is Pandas?
Pandas is a Python library used for working with data
sets.
It has functions for analyzing, cleaning, exploring, and
manipulating data.
Installation of Pandas
If you have Python and PIP already installed on a
system, then installation of Pandas is very easy.
Install it using this command:
C:\Users\Your Name>pip install pandas
If this command fails, then use a python distribution
that already has Pandas installed like, Anaconda,
Spyder etc.
Import Pandas
Once Pandas is installed, import it in your
applications by adding the import keyword:
import pandas
Example:
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
output:
cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
Pandas as pd
Pandas is usually imported under the pd alias.
alias: In Python alias are an alternate name for
referring to the same thing.
Create an alias with the as keyword while
importing:
import pandas as pd
Now the Pandas package can be referred to
as pd instead of pandas.
Example:
import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)
Pandas Series
What is a Series?
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any
type.
Example:
Create a simple Pandas Series from a list:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
output:
0 1
1 7
2 2
dtype: int64
Labels
If nothing else is specified, the values are labeled
with their index number. First value has
index 0,
second value has index 1 etc.
This label can be used to access a specified value.
Example
Return the first value of the Series:
print(myvar[0])
Create Labels
With the index argument, you can name your own
labels.
Example
Create your own labels:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
output:
x 1
y 7
z 2
dtype: int64
Pandas DataFrames
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data
structure, like a 2 dimensional array, or a table with
rows and columns.
Example:
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Result
calories duration
0 420 50
1 380 40
2 390 45
Pandas Read CSV
Read CSV Files
A simple way to store big data sets is to use CSV files
(comma separated files).
CSV files contains plain text and is a well know
format that can be read by everyone including
Pandas.
In our examples we will be using a CSV file called
'data.csv'.
Download data.csv. or Open data.csv
Example:
Load the CSV into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
11 60 100 120 250.7
12 60 106 128 345.3
13 60 104 132 379.3
Tip: use to_string() to print the entire DataFrame.
If you have a large DataFrame with many rows,
Pandas will only return the first 5 rows, and the last
5 rows:
Example:
Print the DataFrame without the to_string() method:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Read JSON
Big data sets are often stored, or extracted as JSON.
JSON is plain text, but has the format of an object,
and is well known in the world of programming,
including Pandas.
In our examples we will be using a JSON file called
'data.json'.
Open data.json.
Example
Load the JSON file into a DataFrame:
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())
output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
11 60 100 120 250.7
Dictionary as JSON
JSON = Python Dictionary
JSON objects have the same format as Python dictionaries.
If your JSON code is not in a file, but in a Python Dictionary, you can
load it into a DataFrame directly:
Example
Load a Python Dictionary into a DataFrame:
import pandas as pd
data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}
df = pd.DataFrame(data)
print(df)
output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
Matplotlib:
Matplotlib is a comprehensive plotting library for the Python
programming language and its numerical mathematics extension
NumPy. It is widely used for creating static, animated, and
interactive visualizations in Python.
Key Features and Uses:
Diverse Plotting Capabilities:
Matplotlib supports a wide range of plot types, including line
plots, scatter plots, bar charts, histograms, pie charts, box plots,
error bars, and 3D plots (using mpl_toolkits.mplot3d).
Customization:
It offers extensive control over plot elements, allowing users to
customize colors, labels, titles, axes, gridlines, legends, and styles
to create publication-quality figures.
Integration:
Matplotlib seamlessly integrates with other popular Python
libraries like NumPy and Pandas, making it a powerful tool for
data analysis and visualization workflows. It is also commonly
used within interactive environments like Jupyter Notebooks.
Object-Oriented API:
Matplotlib provides an object-oriented API for embedding plots
into applications built with general-purpose GUI toolkits such as
Tkinter, wxPython, Qt, or GTK.
Pyplot Module:
The pyplot module within Matplotlib offers a MATLAB-like
interface, simplifying the process of creating plots with a more
procedural approach.
Installation:
Matplotlib can be easily installed using pip, the Python package
installer, by running the following command in a terminal or
command prompt:
Code
pip install matplotlib
Importing:
After installation, Matplotlib can be imported into a Python script
or interactive session, typically aliased as plt for convenience:
Python
import matplotlib.pyplot as plt
Pyplot
Most of the Matplotlib utilities lies under
the pyplot submodule, and are usually imported
under the plt alias:
import matplotlib.pyplot as plt
Example:
Draw a line in a diagram from position (0,0) to
position (6,250):
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
Result:
Matplotlib Plotting
Matplotlib Plotting:
Plotting x and y points
The plot() function is used to draw points (markers) in a diagram.
By default, the plot() function draws a line from point to point.
The function takes parameters for specifying points in the diagram.
Parameter 1 is an array containing the points on the x-axis.
Parameter 2 is an array containing the points on the y-axis.
If we need to plot a line from (1, 3) to (8, 10), we have to pass two
arrays [1, 8] and [3, 10] to the plot function.
Example
Draw a line in a diagram from position (1, 3) to position (8, 10):
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([1, 8])
ypoints = np.array([3, 10])
plt.plot(xpoints, ypoints)
plt.show()
Result:
Plotting Without Line
To plot only the markers, you can use shortcut string
notation parameter 'o', which means 'rings'.
Example
Draw two points in the diagram, one at position (1, 3) and one in
position (8, 10):
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([1, 8])
ypoints = np.array([3, 10])
plt.plot(xpoints, ypoints, 'o')
plt.show()
Result:
Markers
You can use the keyword argument marker to
emphasize each point with a specified marker:
Example:
Mark each point with a circle:
import matplotlib.pyplot as plt
import numpy as np
ypoints = np.array([3, 8, 1, 10])
plt.plot(ypoints, marker = 'o')
plt.show()
Result:
Matplotlib Labels and Title
Create Labels for a Plot
With Pyplot, you can use the xlabel() and ylabel() functions to set
a label for the x- and y-axis.
Example:
Add labels to the x- and y-axis:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.plot(x, y)
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.show()
Result:
Data Preprocessing: Data cleaning and transformation,
Handling missing values and outliers, Feature scaling and
normalization.
1. Data Cleaning and Transformation
This step ensures that raw data is converted into a
consistent format suitable for analysis.
Key tasks:
Removing duplicates: Detect and delete duplicate
rows.
Correcting data types: Ensure each feature has the
appropriate type (e.g., integer, float, string, datetime).
Standardizing formats: Unify inconsistent data
entries (e.g., 'yes', 'Yes', 'Y' → 'Yes').
Encoding categorical variables:
o Label Encoding: Assigns each category a unique
number (used for ordinal data).
o One-Hot Encoding: Creates binary columns for
each category (used for nominal data).
2. Handling Missing Values
Causes:
Data entry errors
Sensor failure
Data corruption
Techniques:
Removal:
o Drop rows or columns with a high percentage of
missing values.
Imputation:
o Mean/Median/Mode Imputation: For numerical
or categorical data.
o Forward/Backward Fill (ffill/bfill): Use nearby
values (common in time-series).
o Model-Based Imputation: Use algorithms (e.g.,
KNN, regression) to predict missing values.
o Indicator Variable: Create an additional column
indicating whether a value was missing.
3. Handling Outliers
Outliers can distort model performance and statistical
summaries.
Detection Methods:
Statistical Methods:
o Z-score (standard deviations from the mean)
o IQR (Interquartile Range) method
Visual Methods:
o Box plots
o Scatter plots
Handling Techniques:
Remove outliers if they are clearly errors or
irrelevant.
Cap or Floor them (e.g., Winsorizing).
Transform data using log, square root, or Box-Cox to
reduce outlier impact.
Use robust models that can handle outliers (e.g., tree-
based algorithms).
4. Feature Scaling and Normalization
Many machine learning models (e.g., KNN, SVM, logistic
regression) are sensitive to feature scales.
Types:
Standardization (Z-score normalization):
o Transforms data to have mean = 0 and standard
deviation = 1
o Formula:
Min-Max Scaling (Normalization):
o Scales values to a [0, 1] range.
o Formula:
Robust Scaling:
o Uses median and IQR, more robust to outliers.
Summary Table:
Task Techniques
Data Remove duplicates, fix data types, encode
Cleaning categories
Missing Drop, mean/median/mode, ffill/bfill, model-
Task Techniques
Values based
Outliers Z-score, IQR, cap/floor, transform
StandardScaler, MinMaxScaler,
Scaling
RobustScaler