Affiliated to Dr. A. P. J.
Abdul Kalam Technical University, Lucknow, Uttar Pradesh
ASSIGNMENT-1
PROGRAM: -MBA (BUSINESS ANALYTICS)
SEMESTER-3
ACADEMIC YEAR: - 2024-2025
SUBJECT: - MACHINE LEARNING USING PYTHON
SUBJECT CODE: - KMBA-352
SUBMITTED BY: - SUBMITTED TO: -
HARSH KUMAR DR. SMITA AGARWAL
ROLL NO.: 2302022 (Professor)
INDEX
S.No. Module Signature
1. Write a Python program to load the iris data from
a given csv file into a data frame and print the
shape of the data, type of the data and first 3 rows.
a) Importing the required libraries
b) Loading the data into the data frame
c) Print the shape of the data
d) Print the datatype of "data"
e) Print the first 3 rows using head()
2. Write a Python program using Scikit-learn to
print the keys, number of rows columns, feature
names, and the description of the Iris data
a) Print Keys of the data
b) Print the data type of each feature
c) Print the column names, total data, and data
types of the iris data set
d) Print the total number of data for each species.
e) Statistical Exploratory Data Analysis
f) Find the unique values of the species
3. Write a Python program to split the iris dataset
into its attributes (X) and labels (y).
a) Importing the required libraries
b) Loading the data into the data frame
c) Drop the columns that are not required
d) Split the Columns
4. Write a Python program to draw a scatterplot,
then add a joint density estimate to describe
individual distributions on the same plot between
Sepal length and Sepal width.
a) Importing the required libraries
b) Loading the data into data frame
c) Scatter Plot
d) Kernel Density Plots in a Joint Plot
e) Joint Plot with KDE
5. Write a Python program using Scikit-learn to split
the iris dataset into 70% train data and 30% test
data. Out of a total of 150 records, the training set
contains 120 records and the test contains 30
records. Print both dataset
a) Importing the required libraries
b) Loading the data into the data frame
c) Drop the columns that are not required
d) Split the columns
e) Split arrays or matrices into random train and test
subsets
6. Implement and demonstrate the any suitable
algorithm for finding the most specific hypothesis
based on a given set of training data samples.
Read the training data from a .CSV file.
a) Importing the required libraries
b) Loading the data into the data frame
c) Dropping the result column
d) Initializing the hypothesis array with the data in
the first row
e) Iterating the row and replacing the value
7. For a given set of training data examples stored in
a .CSV file, implement and demonstrate the
Candidate-Elimination algorithm to output a
description of the set of all hypotheses consistent
with the training examples.
a) Importing the required libraries
b) Loading the data into the data frame
c) Separate Input data(concept) and the
Output(target)
Module 1
Write a Python program to load the iris data from a given csv file into a data
frame and print the shape of the data, type of the data and first 3 rows.
Dataset Source
Dataset: Iris Dataset
Download from:
Github: https://gist.github.com/netj/8836201
Kaggle : https://www.kaggle.com/datasets/arshid/iris-flower-dataset
a) Importing the required libraries
Code:
b) Loading the data into the data frame
Code:
c) Print the shape of the data
Code:
Output:
d) Print the datatype of “data”
Code:
Output:
e) Print the first 3 rows using head()
Code:
Output:
f) Print the last 3 rows using tail()
Code:
Output:
MODULE 2
Write a Python program using Scikit-learn to print the keys, number of rows
columns, feature names, and the description of the Iris data
a) Print keys of the data
Code:
Output:
b) Print the data type of each feature
Code:
Output:
c) Print the column names, total data, data types of the iris data set
Code:
Output:
d) Print the total number of data for each species
Code:
Output:
e) Statistical Exploratory Data Analysis
Code:
Output:
f) Find the unique values of the species
Code:
Output:
MODULE 3
Write a Python program to split the iris dataset into its attributes (X) and
labels (y).
Dataset: Iris Dataset
Download from:
Github: https://gist.github.com/netj/8836201
Kaggle : https://www.kaggle.com/datasets/arshid/iris-flower-dataset
a) Importing the required libraries
Code:
b) Loading the data into the data frame
Code:
Output:
c) Drop the columns that are not required
Code:
Output:
d) Split the Columns
Code:
Output:
Code:
Output:
MODULE 4
Write a Python program to draw a scatterplot, then add a joint density
estimate to describe individual distributions on the same plot between Sepal
length and Sepal width.
Theory: The joint plot is a way of understanding the relationship between two variables and the
distribution of individuals of each variable (Distribution Plots). The joint plot mainly consists of
three separate plots in which, one of it was the middle figure that is used to see the relationship
between x and y. So, this area will give the information about the joint distribution, while the
remaining two areas will provide us with the marginal distribution for the x-axis and y-axis.
Syntax:
1. seaborn.jointplot(x, y, data=None, kind='scatter', stat_func=None, color=None, height=5,
ratio=3, space=0.3, dropna=True, xlim=None, ylim=None, joint_kws=None,
marginal_kws=None, annot_kws=None, **kwargs)
Parameters
• x,y: These are variables which will specify the x-axis and y-axis.
• data: It is an input dataset.
• kind: It is a protocol to draw
• color: It is the parameter used to take a color for the plot elements.
• space: It denotes the space between a joint distribution and marginal distribution.
• xlim, ylim: It represents the limit of the x-axis and y-axis.
A joint plot consisting of 3 separate plots. From those three,
• One of the plots displays the bivariate graph showing how the dependent variable (Y)is
different from the independent variable (X). The bivariate will tell the relationship
between two variables and represent the strength of their relationship.
• The other plot is placed horizontally at the top of the bivariate graph, showing the
distribution of the dependent variable (Y). It is because the univariate will mainly focus on
one variable, describing, summarising and showing any patterns in our data.
• The function called joint plot() in the library called Seaborn will create the scatter plot by
default with two histograms at the top and right margins of the graph.
The parameter 'kind' will be set to 'kde' in the above function so that the joint plot will display a
bivariate density curve on the main plot, and univariate density will curve on the margins.
Plot overlay
plot_joint(): This method is used to customize the joint plot (the central area where two
variables are plotted). By default, a jointplot creates a scatter plot in the joint area, but
.plot_joint() allows you to overlay a different type of plot, such as a KDE plot, histogram, or
regression line.
a) Importing the required libraries
Code:
b) Loading the data into the data frame
Code:
Output:
c) Scatter Plot
Code:
Output:
d) Kernel Density Plots in a Joint Plot
Code:
Output:
e) Joint Plot with KDE
Code:
Output:
MODULE 5
Write a Python program using Scikit-learn to split the iris dataset into 70%
train data and 30% test data. Out of a total of 150 records, the training set
contains 120 records and the test contains 30 records. Print both datasets.
Dataset: Iris Dataset
Download from :
Github: https://gist.github.com/netj/8836201
Kaggle : https://www.kaggle.com/datasets/arshid/iris-flower-dataset
a) Importing the required libraries
Code:
b) Loading the data into the data frame
Code:
Output:
c) Drop the columns that are not required
Code:
Output:
d) Split the columns
Code:
Output:
e) Split arrays or matrices into random train and test subsets
Code:
Output:
MODULE 6
Implement and demonstrate any suitable algorithm for finding the most
specific hypothesis based on a given set of training data samples. Read the
training data from a .CSV file.
Theory: Find S Algorithm¶
Algorithm:
Initialize h to the most specific hypothesis in H
For each positive training instance x For each attribute constraint a, in h
• If the constraint a, is satisfied by x
• Then do nothing
• Else replace a, in h by the next more general constraint that is satisfied by x Output
hypothesis h
a) Importing the required libraries
Code:
b) Loading the data into the data frame
Code:
Output:
c) Dropping the result column
Code:
Output:
d) Initializing the hypothesis array with the data in the first row
Code:
Output:
e) Iterating the row and replacing the value
Code:
Output:
Code:
Output:
Code:
Output:
Code:
Output:
Code:
Output:
MODULE 7
For a given set of training data examples stored in .CSV file, implement and
demonstrate the Candidate-Elimination algorithm to output a description of
the set of all hypotheses consistent with training examples.
Theory: Term used
• Concept learning: Concept learning is basically the learning task of the machine
(Learn by Train data)
• General Hypothesis: Not Specifying features to learn the machine.
• G = {‘?’, ‘?’,’?’,’?’…}: Number of attributes
• Specific Hypothesis: Specifying features to learn machine (Specific feature)
• S= {‘pi’,’pi’,’pi’…}: The number of pi depends on a number of attributes
• Version Space: It is an intermediate of general hypothesis and Specific hypothesis. It
not only just writes one hypothesis but a set of all possible hypotheses based on
training data-set.-set. set.
Algorithm
• Step1: Load Data set
• Step2: Initialize General Hypothesis and Specific Hypothesis.
• Step3: For each training example
• Step4: If example is positive example
- if attribute_value == hypothesis_value: -- Do nothing
- else: -- replace attribute value with '?' (Basically generalizing it)
• Step5: If example is Negative example
Make generalize hypothesis more specific
a) Importing the required libraries
Code:
b) Loading the data into the data frame
Code:
Output:
c) Separate Input data (concept) and the Output(target)
Code:
Output:
d) Candidate Elimination Algorithm
Code:
Output: