Data Science Internship Report
Data Science Internship Report
(Virtual
)
1
1
PROGRAM BOOK FOR
SHORT-TERM INTERNSHIP
(Virtual)
2
SHORT-TERM INTERNSHIP
BACHELOR OF TECHNOLOGY
IN
by
Mr. S. Ashok
Assistant Professor(C)
2021-2025
3
STUDENTS DECLARATION
Mr. S. Ashok,
Assistant Professor(C), Dept. of CSE
JNTU-GV, CEV
(Signature Of
Student)
4
CERTIFICATE
5
June
ACKNOWLEDGEMENT
6
It is our privilege to acknowledge with deep sense of gratitude and devotion for keen
personal interest and invaluable guidance rendered by our internship guide
Mr. S. Ashok, Assistant Professor, Department of Computer Science and
Engineering, JNTU-GV College of Engineering, Vizianagaram.
we express our gratitude to, CEO Pavan Chalamalasetti and to Guide
at Datavalley.Ai whose mentorship during the internship period added immense
value to our learning experience. His guidance and insights played a crucial role in
our professional development.
Our respects and regards to Dr P. Aruna Kumari, HOD, Department of Computer
Science and Technology, JNTU-GV College of Engineering Vizianagaram, for her
invaluable suggestions that helped us in successful completion of the project.
Finally, we also thank all the faculty of Dept. of CSE, JNTU-GV, our friends, and all
our family members who with their valuable suggestions and support, directly or
indirectly helped us in completing this project work.
Modules Covered
1. Python Programming
2. Python Libraries for Data Science
3. SQL for Data Science
4. Mathematics for Data Science
5. Machine Learning
6. Introduction to Deep Learning - Neural Networks
For the project, we applied ensemble learning techniques to predict the sales of products at
Big Mart outlets. The project involved data cleaning, feature engineering, and model building
using algorithms such as Random Forest, Gradient Boosting, and XGBoost. The final model
aimed to improve the accuracy of sales predictions, providing valuable insights for inventory
management and sales strategies.
Overall, this internship experience was beneficial in developing my skills in data science,
including programming, data analysis, and machine learning. It also provided an opportunity
to gain experience working on a real-world project, collaborating with a team to develop a
complex predictive model.
Authorized signatory
Self-Assessment
8
In this Data Science internship, we embarked on a comprehensive learning journey through
various data science modules and culminated our experience with the project titled "Big Mart
Sales Prediction Using Ensemble Learning."
For the project, we applied ensemble learning techniques to predict sales for Big Mart outlets.
We utilized Python programming and various data science libraries to clean, manipulate, and
analyze the data. The project involved feature engineering, model training, and evaluation
using ensemble methods such as Random Forest, Gradient Boosting, and XGBoost.
Throughout this internship, we gained hands-on experience with key data science tools and
techniques, enhancing our skills in data analysis, statistical modeling, and machine learning.
The practical application of theoretical knowledge in a real-world project was immensely
valuable.
We are very satisfied with the work we have done, as it has provided us with extensive
knowledge and practical experience. This internship was highly beneficial, allowing us to
enrich our skills in data science and preparing us for future professional endeavors. We are
confident that the knowledge and skills acquired during this internship will be of great use in
our personal and professional growth.
9
TABLE OF CONTENTS
8 WEEKLY LOG
Data Science is an interdisciplinary field that leverages scientific methods, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. It integrates
various domains including mathematics, statistics, computer science, and domain expertise to
analyze data and make data-driven decisions.
Data Science involves the study of data through statistical and computational techniques to
uncover patterns, make predictions, and gain valuable insights. It encompasses data
cleansing, data preparation, analysis, and visualization, aiming to solve complex problems
and inform business strategies.
11
AI (ARTIFICIAL INTELLIGENCE): AI refers to the ability of machines to
perform tasks that typically require human intelligence, such as understanding natural
language, recognizing patterns in data, and making decisions. It encompasses a
broader scope of technologies and techniques aimed at simulating human intelligence.
DATA SCIENCE: Data Science focuses on extracting insights and knowledge from
data through statistical and computational methods. It involves cleaning, organizing,
analyzing, and visualizing data to uncover patterns and trends, often utilizing AI
techniques such as machine learning and deep learning to build predictive models and
make data-driven decisions.
Data Science is evolving rapidly with advancements in technology and increasing volumes of
data generated daily. Key trends include the rise of deep learning techniques for complex data
analysis, automation of machine learning workflows to accelerate model development and
deployment, and growing concerns around ethical considerations such as bias in AI models
and data privacy regulations.
1. INTRODUCTION TO PYTHON
12
Example:
DOMAIN USAGE
Web Development: Django and Flask are popular frameworks for building web
applications.
Data Science: NumPy, Pandas, Matplotlib facilitate data manipulation, analysis, and
visualization.
AI/ML: TensorFlow, PyTorch, scikit-learn are used for developing AI models and
machine learning algorithms.
Automation and Scripting: Python's simplicity and extensive libraries make it ideal
for automating tasks and writing scripts.
Python's syntax is designed to be clean and easy to learn, using indentation to define code
structure. Variables in Python are dynamically typed, meaning their type is inferred from the
value assigned. This makes Python flexible and reduces the amount of code needed for
simple tasks.
Detailed Explanation:
Python's syntax:
Uses indentation (whitespace) to define code blocks, unlike languages that use curly
braces {}.
Encourages clean and readable code by enforcing consistent indentation practices.
Variables in Python:
Dynamically typed: You don't need to declare the type of a variable explicitly.
Types include integers, floats, strings, lists, tuples, sets, dictionaries, etc.
13
Example:
Control flow statements in Python determine the order in which statements are executed
based on conditions or loops. Python provides several control flow constructs:
Detailed Explanation:
Example:
Output:
14
2. Loops (for and while):
for loop: Iterates over a sequence (e.g., list, tuple) or an iterable object.
while loop: Executes a block of code as long as a condition is true.
Example:
Output:
Example Explanation:
4. FUNCTIONS
Functions in Python are blocks of reusable code that perform a specific task. They help in
organizing code into manageable parts, promoting code reusability and modularity.
15
Detailed Explanation:
Example:
2. Function Call:
Example:
Functions can accept parameters (inputs) that are specified when the function is
called.
Parameters can have default values, making them optional.
16
Example:
Example Explanation:
Function Definition: Functions are defined using def followed by the function
name and parameters in parentheses. The docstring (optional) provides a
description of what the function does.
Function Call: Functions are called by their name followed by parentheses
containing arguments (if any) that are passed to the function.
Parameters and Arguments: Functions can have parameters with default values,
allowing flexibility in function calls. Parameters are variables that hold the
arguments passed to the function.
5. DATA STRUCTURES
Python provides several built-in data structures that allow you to store and organize data
efficiently. These include lists, tuples, sets, and dictionaries.
Detailed Explanation:
1. Lists:
Example:
17
2. Tuples:
Example:
3. Sets:
Example:
4. Dictionaries:
Example:
Example Explanation:
Lists: Used for storing ordered collections of items that can be changed or updated.
Tuples: Similar to lists but immutable, used when data should not change.
Sets: Used for storing unique items where order is not important.
Dictionaries: Used for storing key-value pairs, allowing efficient lookup and
modification based on keys.
Detailed Explanation:
Files are opened using the open() function, which returns a file object.
Use the close() method to close the file once operations are done.
Example:
Use methods like read(), readline(), or readlines() to read content from files.
Handle file paths and exceptions using appropriate error handling.
Example:
3. Writing to Files:
19
Example:
Example Explanation:
Opening and Closing Files: Files are opened using open() and closed using close() to
release resources.
Reading from Files: Methods like read(), readline(), and readlines() allow reading
content from files, handling file operations efficiently.
Writing to Files: Use write() or writelines() to write data into files, managing file
contents as needed.
Errors and exceptions are a natural part of programming. Python provides mechanisms to
handle errors gracefully, preventing abrupt termination of programs.
Detailed Explanation:
1. Types of Errors:
o Syntax Errors: Occur when the code violates the syntax rules of Python.
These are detected during compilation.
o Exceptions: Occur during the execution of a program and can be handled
using exception handling.
2. Exception Handling:
20
Example:
3. Raising Exceptions:
Example:
Example Explanation:
Types of Errors: Syntax errors are caught during compilation, while exceptions
occur during runtime.
Exception Handling: try block attempts to execute code that may raise exceptions,
except block catches specific exceptions, else block executes if no exceptions occur,
and finally block ensures cleanup code runs regardless of exceptions.
Raising Exceptions: Use raise to trigger exceptions programmatically based on
specific conditions.
21
Detailed Explanation:
Class: Blueprint for creating objects. Defines attributes (data) and methods
(functions) that belong to the class.
Object: Instance of a class. Represents a specific entity based on the class blueprint.
Example:
2. Encapsulation:
Bundling of data (attributes) and methods that operate on the data into a single unit
(class).
Access to data is restricted to methods of the class, promoting data security and
integrity.
3. Inheritance:
Ability to create a new class (derived class or subclass) from an existing class (base
class or superclass).
Inherited class (subclass) inherits attributes and methods of the base class and can
override or extend them.
22
Example:
4. Polymorphism:
Example:
Example Explanation:
Classes and Objects: Classes define the structure and behavior of objects, while
objects are instances of classes with specific attributes and methods.
Encapsulation: Keeps the internal state of an object private, controlling access
through methods.
23
Inheritance: Allows a new class to inherit attributes and methods from an existing
class, facilitating code reuse and extension.
Polymorphism: Enables flexibility by using the same interface (method name) for
different data types or classes, allowing for method overriding and overloading.
1. NUMPY
Detailed Explanation:
Arrays in NumPy:
o NumPy's main object is the homogeneous multidimensional array (ndarray),
which is a table of elements (usually numbers), all of the same type, indexed
by a tuple of non-negative integers.
o Arrays are created using np.array() and can be manipulated for various
mathematical operations.
Example:
NumPy Operations:
o NumPy provides a wide range of mathematical functions such as np.sum(),
np.mean(), np.max(), np.min(), etc., which operate element-wise on arrays or
perform aggregations across axes.
24
Example:
Broadcasting:
o Broadcasting is a powerful mechanism that allows NumPy to work with arrays
of different shapes when performing arithmetic operations.
Example:
Example Explanation:
2. PANDAS
Pandas is a powerful library for data manipulation and analysis in Python. It provides data
structures and operations for manipulating numerical tables and time series data.
Detailed Explanation:
o DataFrame: Represents a tabular data structure with labeled axes (rows and
columns). It is similar to a spreadsheet or SQL table.
o Series: Represents a one-dimensional labeled array capable of holding data of
any type (integer, float, string, etc.).
25
Example:
Basic Operations:
o Indexing and Selection: Use loc[] and iloc[] for label-based and integer-based
indexing respectively.
o Filtering: Use boolean indexing to filter rows based on conditions.
o Operations: Apply operations and functions across rows or columns.
Example:
Data Manipulation:
o Adding and Removing Columns: Use assignment (df['New_Column'] = ...)
or drop() method.
o Handling Missing Data: Use dropna() to drop NaN values or fillna() to fill
NaN values with specified values.
Example:
Example Explanation:
DataFrame and Series: Pandas DataFrame is used for tabular data, while Series is
used for one-dimensional labeled data.
26
o Basic Operations: Perform indexing, selection, filtering, and operations on
Pandas objects to manipulate and analyze data.
Detailed Explanation:
1. Matplotlib:
o Basic Plotting: Create line plots, scatter plots, bar plots, histograms, etc.,
using plt.plot(), plt.scatter(), plt.bar(), plt.hist(), etc.
o Customization: Customize plots with labels, titles, legends, colors, markers,
and other aesthetic elements.
o Subplots: Create multiple plots within the same figure using plt.subplots().
Example:
2. Seaborn:
27
Example:
Example Explanation:
Matplotlib: Create various types of plots and customize them using Matplotlib's
extensive API for visualization.
Seaborn: Build complex statistical plots quickly and easily, leveraging Seaborn's
high-level interface and aesthetic improvements.
SQL (Structured Query Language) is a standard language for managing and manipulating
relational databases. It is essential for data scientists to retrieve, manipulate, and analyze data
stored in databases.
Detailed Explanation:
Example:
28
Example:
Example:
Example:
3. Querying Data:
Use SELECT statements with conditions (WHERE), sorting (ORDER BY), grouping
(GROUP BY), and aggregating functions (COUNT, SUM, AVG) to retrieve specific
data subsets.
Example:
SQL joins are used to combine rows from two or more tables based on a related column
between them. There are different types of joins:
INNER JOIN:
o Returns rows when there is a match in both tables based on the join condition.
29
Example:
Example:
Example:
Example:
30
Example Explanation:
INNER JOIN: Returns rows where there is a match in both tables based on the join
condition (customer_id).
LEFT JOIN: Returns all rows from the left table (orders) and the matched rows from
the right table (customers). Returns NULL if there is no match.
RIGHT JOIN: Returns all rows from the right table (customers) and the matched
rows from the left table (orders). Returns NULL if there is no match.
FULL OUTER JOIN: Returns all rows when there is a match in either table (orders
or customers). Returns NULL if there is no match.
1. MATHEMATICAL FOUNDATIONS
Mathematics forms the backbone of data science, providing essential tools and concepts for
understanding and analyzing data.
Detailed Explanation:
1. Linear Algebra:
Example:
2. Calculus:
Example Explanation:
Linear Algebra: Essential for handling large datasets with operations on vectors
and matrices.
Calculus: Provides tools for analyzing and modeling continuous changes and
cumulative effects in data.
Probability and statistics are fundamental in data science for analyzing and interpreting data,
making predictions, and drawing conclusions.
Detailed Explanation:
1. Probability Basics:
Example:
32
2. Descriptive Statistics:
Descriptive statistics are used to summarize and describe the basic features of data. They
provide insights into the central tendency, dispersion, and shape of a dataset.
Detailed Explanation:
o Mean: Also known as average, it is the sum of all values divided by the
number of values.
o Median: The middle value in a sorted, ascending or descending, list of
numbers.
o Mode: The value that appears most frequently in a dataset.
Example:
2. Measures of Dispersion:
Variance: Measures how far each number in the dataset is from the mean.
Standard Deviation: Square root of the variance; it indicates the amount of
variation or dispersion of a set of values.
Range: The difference between the maximum and minimum values in the
dataset.
33
Example:
Example:
Example Explanation:
Measures of Central Tendency: Provide insights into the typical value of the
dataset (mean, median) and the most frequently occurring value (mode).
Measures of Dispersion: Indicate the spread or variability of the dataset
(variance, standard deviation, range).
Skewness and Kurtosis: Describe the shape of the dataset distribution, whether it
is symmetric or skewed, and its tail characteristics.
3. PROBABILITY DISTRIBUTIONS
Probability distributions are mathematical functions that describe the likelihood of different
outcomes in an experiment. They play a crucial role in data science for modeling and
analyzing data.
Detailed Explanation:
1.Normal Distribution:
34
Example:
2. Binomial Distribution:
Example:
3. Poisson Distribution:
35
Example:
Example Explanation:
Machine Learning (ML) is a branch of artificial intelligence (AI) that empowers computers to
learn from data and improve their performance over time without explicit programming. It
focuses on developing algorithms that can analyze and interpret patterns in data to make
predictions or decisions.
Detailed Explanation:
Supervised learning involves training a model on labeled data, where each data point is
paired with a corresponding target variable (label). The goal is to learn a mapping from input
variables (features) to the output variable (target) based on the input-output pairs provided
during training.
Classification
Algorithms:
1. Logistic Regression
37
Definition: Despite its name, logistic regression is a linear model for binary
classification that uses a logistic function to estimate probabilities.
Key Concepts:
o Logistic Function: Sigmoid function that maps input values to probabilities
between 0 and 1.
o Decision Boundary: Threshold that separates the classes based on predicted
probabilities.
2. Decision Trees
Definition: Non-linear model that uses a tree structure to make decisions by splitting
the data into nodes based on feature values.
Key Concepts:
o Nodes and Branches: Represent conditions and possible outcomes in the
decision-making process.
o Entropy and Information Gain: Measures used to determine the best split at
each node.
Example:
38
3. Random Forest
Definition: Ensemble learning method that constructs multiple decision trees during
training and outputs the mode of the classes (classification) or mean prediction
(regression) of the individual trees.
Key Concepts:
o Bagging: Technique that combines multiple models to improve performance
and reduce overfitting.
o Feature Importance: Measures the contribution of each feature to the model's
predictions.
39
Example:
Support Vector Machines (SVM) are robust supervised learning models used for
classification and regression tasks. They excel in scenarios where the data is not linearly
separable by transforming the input space into a higher dimension.
Detailed Explanation:
2. Types of SVM
o C-Support Vector Classification (SVC): SVM for classification tasks,
maximizing the margin between classes.
o Nu-Support Vector Classification (NuSVC): Similar to SVC but allows
control over the number of support vectors and training errors.
40
o Support Vector Regression (SVR): SVM for regression tasks, fitting a
hyperplane within a margin of tolerance.
3. Advantages of SVM
4. Applications of SVM
Hyperplane and Support Vectors: SVMs find the optimal hyperplane that maximizes
the margin between classes, with support vectors influencing its position.
Applications: SVMs are applied in diverse fields for classification tasks requiring
robust performance and flexibility in handling complex data patterns.
41
5. Decision Trees
Decision Trees are versatile supervised learning models used for both classification and
regression tasks. They create a tree-like structure where each internal node represents a
"decision" based on a feature, leading to leaf nodes that represent the predicted outcome.
Detailed Explanation:
Regression Analysis
1. Linear Regression
Detailed Explanation:
Linear Model: Represents the relationship between the input features XXX and the
target variable yyy using a linear equation.
Coefficients: Slope coefficients β\betaβ that represent the impact of each feature on
the target variable.
Intercept: Constant term β0\beta_0β0 that shifts the regression line.
Simple Linear Regression: Predicts a target variable using a single input feature.
Multiple Linear Regression: Predicts a target variable using multiple input features.
1. Linearity: Assumes a linear relationship between predictors and the target variable.
2. Independence of Errors: Residuals (errors) should be independent of each other.
3. Homoscedasticity: Residuals should have constant variance across all levels of
predictors.
43
Example (Simple Linear Regression):
2. Naive Bayes
Naive Bayes is a probabilistic supervised learning algorithm based on Bayes' theorem, with
an assumption of independence between features. It is commonly used for classification tasks
and is known for its simplicity and efficiency, especially with high-dimensional data.
Detailed Explanation:
44
o Gaussian Naive Bayes: Assumes that continuous features follow a Gaussian
distribution.
o Multinomial Naive Bayes: Suitable for discrete features (e.g., word counts in
text classification).
o Bernoulli Naive Bayes: Assumes binary or boolean features (e.g., presence or
absence of a feature).
Efficiency: Fast training and prediction times, especially with large datasets.
Simplicity: Easy to implement and interpret, making it suitable for baseline
classification tasks.
Scalability: Handles high-dimensional data well, such as text classification.
45
3. Support Vector Machines (Svm) For Regression
Support Vector Machines (SVM) are versatile supervised learning models that can be used
for both classification and regression tasks. In regression, SVM aims to find a hyperplane that
best fits the data, while maximizing the margin from the closest points (support vectors).
Detailed Explanation:
2. Mathematical Formulation
o SVM for regression predicts the target variable yyy for an instance X using a
linear function
Example Explanation:
Kernel Trick: SVM uses kernel functions to transform the input space into a
higher-dimensional space where data points can be linearly separated.
Loss Function: SVM minimizes the error between predicted and actual values
while maximizing the margin around the hyperplane.
Random Forest is an ensemble learning method that constructs multiple decision trees during
training and outputs the average prediction of the individual trees for regression tasks.
Detailed Explanation:
47
o Prediction: For regression, Random Forest averages the predictions of all
trees to obtain the final output.
48
Example Explanation:
Gradient Boosting is an ensemble learning technique that combines multiple weak learners
(typically decision trees) sequentially to make predictions for regression tasks.
Detailed Explanation:
49
3. Advantages of Gradient Boosting for Regression
Example Explanation:
Unsupervised learning algorithms are used when we only have input data (X) and no
corresponding output variables. The algorithms learn to find the inherent structure in the data,
such as grouping or clustering similar data points together.
Detailed Explanation:
51
Dimensionality Reduction Techniques
Detailed Explanation:
2. PCA Algorithm
o Step-by-Step Process:
Standardize the data (mean centering and scaling).
Compute the covariance matrix of the standardized data.
Calculate the eigenvectors and eigenvalues of the covariance matrix.
Select the top kkk eigenvectors (principal components) that explain the
most variance.
Project the original data onto the selected principal components to
obtain the reduced-dimensional representation.
52
Example (PCA):
3. Advantages of PCA
4.Applications of PCA
Clustering techniques
53
K-Means Clustering
Detailed Explanation:
2. K-Means Algorithm
o Initialization: Randomly initialize K centroids.
o Assignment: Assign each data point to the nearest centroid based on distance
(typically Euclidean distance).
o Update Centroids: Recalculate the centroids as the mean of all data points
assigned to each centroid.
o Iterate: Repeat the assignment and update steps until convergence (when
centroids no longer change significantly or after a specified number of
iterations).
Simple and Efficient: Easy to implement and computationally efficient for large
datasets.
Scalable: Scales well with the number of data points and clusters.
Interpretability: Provides interpretable results by assigning each data point to a
cluster.
54
4. Applications of K-Means Clustering
Hierarchical Clustering
Hierarchical Clustering is an unsupervised learning algorithm that groups similar objects into
clusters based on their distances or similarities.
Detailed Explanation:
55
3. Advantages of Hierarchical Clustering
Deep Learning is a subset of machine learning that involves neural networks with many
layers (deep architectures) to learn from data. It has revolutionized various fields like
computer vision, natural language processing, and robotics.
Detailed Explanation:
Example Explanation:
Neuron:
A fundamental unit of a neural network that receives inputs, applies weights, and
computes an output using an activation function.
Activation Function:
Layer:
A collection of neurons that process input data. Common layers include input, hidden
(where computations occur), and output (producing the network's predictions).
57
Feedforward Neural Network:
A type of neural network where connections between neurons do not form cycles, and
data flows in one direction from input to output.
Backpropagation:
Loss Function:
Measures the difference between predicted and actual values. It guides the
optimization process during training by quantifying the network's performance.
Gradient Descent:
Batch Size:
Number of training examples used in one iteration of gradient descent. Larger batch
sizes can speed up training but require more memory.
Epoch:
One complete pass through the entire training dataset during the training of a neural
network.
Learning Rate:
Parameter that controls the size of steps taken during gradient descent. It affects how
quickly the model learns and converges to optimal weights.
Overfitting:
Condition where a model learns to memorize the training data rather than generalize
to new, unseen data. Regularization techniques help mitigate overfitting.
Underfitting:
Condition where a model is too simple to capture the underlying patterns in the
training data, resulting in poor performance on both training and test datasets.
Dropout:
58
Regularization technique where randomly selected neurons are ignored during
training to prevent co-adaptation of neurons and improve model generalization.
Deep learning architecture particularly effective for processing grid-like data, such as
images. CNNs use convolutional layers to automatically learn hierarchical features.
Neural networks are computational models inspired by the human brain's structure and
function. They consist of interconnected neurons organized into layers, each performing
specific operations on input data to produce desired outputs. Here's an overview of neural
network architecture and its working:
59
3. Activation Functions:
o Purpose: Applied to the output of each neuron to introduce non-linearity, enabling
neural networks to learn complex patterns.
1. Feedforward Process:
o Input Propagation: Input data is fed into the input layer of the neural network.
o Forward Pass: Data flows through the network layer by layer. Each neuron in a layer
receives inputs from the previous layer, computes a weighted sum, applies an
activation function, and passes the result to the next layer.
o Output Generation: The final layer (output layer) produces predictions or
classifications based on the learned representations from the hidden layers.
2. Training Process:
o Loss Calculation: Compares the network's output with the true labels to compute a
loss (error) value using a loss function (e.g., Mean Squared Error for regression,
Cross-Entropy Loss for classification).
o Backpropagation: Algorithm used to minimize the loss by adjusting weights
backward through the network. It computes gradients of the loss function with respect
to each weight using the chain rule of calculus.
o Gradient Descent: Optimization technique that updates weights in the direction of
the negative gradient to reduce the loss, making the network more accurate over time.
o Epochs and Batch Training: Training involves multiple passes (epochs) through the
entire dataset, with updates applied in batches to improve training efficiency and
generalization.
60
Types Of Neural Networks and Their Importance
1. Feedforward Neural Networks (FNN)
Description: Feedforward Neural Networks are the simplest form of neural networks
where information travels in one direction: from input nodes through hidden layers (if
any) to output nodes.
Importance: They form the foundation of more complex neural networks and are
widely used for tasks like classification and regression.
Applications:
o Classification: Image classification, sentiment analysis.
o Regression: Predicting continuous values like house prices.
Description: CNNs are specialized for processing grid-like data, such as images or
audio spectrograms. They use convolutional layers to automatically learn hierarchical
patterns.
Importance: CNNs have revolutionized computer vision tasks by achieving state-of-
the-art performance in image recognition and analysis.
Applications:
o Image Recognition: Object detection, facial recognition.
o Medical Imaging: Analyzing medical scans for diagnostics.
61
4. Long Short-Term Memory Networks (LSTM)
Description: A type of RNN that mitigates the vanishing gradient problem. LSTMs
have more complex memory units and can learn long-term dependencies.
Importance: LSTMs excel in capturing and remembering patterns in sequential data
over extended time periods.
Applications:
o Speech Recognition: Transcribing spoken language into text.
o Predictive Text: Autocomplete suggestions in messaging apps.
Versatility: Each type of neural network is tailored to different data structures and
tasks, offering versatility in solving complex problems across various domains.
State-of-the-Art Performance: Neural networks have achieved remarkable results in
areas such as image recognition, natural language understanding, and predictive
analytics.
Automation and Efficiency: They automate feature extraction and data
representation learning, reducing the need for manual feature engineering.
62
PROJECT WORK
TITLE: BIGMART SALES PREDICTION USING ENSEMBLE LEARNING
PROJECT OVERVIEW
Data Description: The dataset for this project includes annual sales records for 2013,
encompassing 1559 products across ten different stores located in various cities. The dataset
is rich in attributes, offering valuable insights into customer preferences and product
performance.
Key Objectives
Develop robust predictive models to forecast sales for individual products at specific
store locations.
Identify and analyze key factors influencing sales performance, including product
attributes, store characteristics, and external variables.
Implement and compare various machine learning algorithms to determine the most
effective approach for sales prediction.
Provide actionable insights to optimize inventory management, resource allocation,
and marketing strategies.
Learning Objectives:
1. Data Processing Techniques: Students will learn to extract, process, and clean large
datasets efficiently.
63
2. Exploratory Data Analysis (EDA): Students will conduct EDA to uncover patterns
and insights within the data.
3. Statistical and Categorical Analysis:
o Chi-squared Test
o Cramer’s V Test
o Analysis of Variance (ANOVA)
4. Machine Learning Models:
o Basic Models: Linear Regression
o Advanced Models: Gradient Boosting, Generalized Additive Models (GAMs),
Splines, and Multivariate Adaptive Regression Splines (MARS)
5. Ensemble Techniques:
o Model Stacking
o Model Blending
6. Model Evaluation: Assessing the performance of various models to identify the best
predictive model for sales forecasting.
Methodology
3. Feature Engineering:
Model Development:
Ensemble Techniques:
o Explore model stacking and blending to improve prediction accuracy.
o Model Evaluation and Selection:
64
o Assess model performance using appropriate metrics.
o Select the most effective model or ensemble for deployment.
Expected Outcomes
2. Key Findings:
o Feature Importance: Through various models, features such as item weight, item fat
content, and store location were consistently identified as significant predictors of
sales.
o Customer Preferences: Analysis revealed that products with lower fat content had
higher sales in urban stores, indicating a health-conscious consumer base in these
areas.
o Store Performance: Certain stores consistently outperformed others, suggesting
potential areas for targeted marketing and inventory strategies.
3. Best-Performing Model:
Recommendations:
1. Inventory Management:
o Utilize the insights from the sales forecasts to optimize inventory levels, ensuring
high-demand products are adequately stocked to meet customer needs while reducing
excess inventory for low-demand items.
2. Targeted Marketing:
o Implement targeted marketing strategies based on customer preferences identified in
the analysis. For example, promote low-fat products more aggressively in urban
stores where they are more popular.
5. Employee Training:
o Train store managers and staff on the use of sales forecasts and data-driven decision-
making. Empowering employees with these insights can lead to better in-store
execution and customer service.
Bigmart-Sales-Prediction
66
ACTIVITY LOG FOR FIRST WEEK
67
Introduction to Different Understand the basics
modules of the course – of Machine Learning
24 May 2024 Day 5 Statistics, ML, DL and Deep Learning
WEEKLY REPORT
Objective of the Activity Done: The first week aimed to introduce the students to the
fundamentals of Data Science, covering program structure, key concepts, applications, and an
overview of various modules such as Python, SQL, Data Analytics, Statistics, Machine
Learning, and Deep Learning.
Detailed Report: During the first week, the training sessions provided a comprehensive
introduction to the Data Science internship program. On the first day, students were oriented
on the program flow, schedule, and objectives. They learned about the definition and
significance of Data Science in today's data-driven world.
The following day, students explored various applications and real-world use cases of Data
Science across different industries, helping them understand its practical implications and
benefits. Mid-week, the focus was on basic definitions and differences between key terms
like Data Science, Data Analytics, and Business Intelligence, ensuring a solid foundational
understanding.
Towards the end of the week, students were introduced to the different modules of the course,
including Python, SQL, Data Analytics, Statistics, Machine Learning, and Deep Learning.
These sessions provided an overview of each module's importance and how they contribute to
the broader field of Data Science.
By the end of the week, students had a clear understanding of the training program's
structure, fundamental concepts of Data Science, and the various applications and use cases
across different industries. They were also familiar with the key modules to be studied in the
coming weeks, laying a strong foundation for more advanced learning.
68
69
ACTIVITY LOG FOR SECOND WEEK
Understanding the
Day - 1 applications of Python
27 May 2024 Introduction to Python
70
WEEKLY REPORT
Detailed Report: Throughout the week, students were introduced to Python, starting with its
installation and setup. They learned about variables, data types, operators, and input/output
operations. The sessions covered control structures and looping statements to define data
flow and basic data structures like lists, tuples, dictionaries, and sets for data storage and
access. Functions, methods, and modules were also discussed, emphasizing user-defined and
built-in functions, as well as the importance of modular programming. The week concluded
with lessons on errors and exception handling, teaching students how to manage and handle
different types of exceptions in their code.
Learning Outcomes:
71
ACTIVITY LOG FOR THIRD WEEK
72
WEEKLY REPORT
Objective of the Activity Done: The fourth week aimed to introduce students to Object-
Oriented Programming (OOP) concepts in Python, Python libraries essential for Data Science
(NumPy and Pandas), and foundational SQL concepts. Students learned practical
implementation of OOP principles, numerical operations using NumPy, data manipulation
with Pandas dataframes, and basic SQL commands for database management.
Detailed Report:
73
Learning Outcomes:
Data Analysis on
Ecommerce Data,
11 June 2024 Day - 2 SQL Hands-On – Sample Executing all
Project on Ecommerce commands on
Data Ecommerce Database
74
Probability Measures and Understanding data
Day - 5 Distributions distributions,
14 June 2024 Skewness and Bias
75
WEEKLY REPORT
Objective of the Activity Done: The focus of the third week was to delve into SQL,
advanced SQL queries, and database operations for data analysis. Additionally, the week
covered fundamental mathematics for Data Science, including descriptive statistics,
inferential statistics, hypothesis testing, probability measures, and distributions essential for
data analysis and decision-making.
Detailed Report:
Learning Outcomes:
Acquired proficiency in SQL joins and advanced SQL queries for effective data
retrieval and manipulation.
Applied SQL skills in a practical project scenario involving ecommerce data analysis.
Developed a solid foundation in descriptive statistics and its application in
summarizing data.
Gained expertise in inferential statistics and hypothesis testing to draw conclusions
from data.
Learned about probability measures and distributions, understanding their
characteristics and applications in Data Science.
76
ACTIVITY LOG FOR FIFTH WEEK
77
WEEKLY REPORT
Objective of the Activity Done: The fifth week focused on Machine Learning fundamentals,
covering supervised and unsupervised learning techniques, model evaluation metrics, and
hyperparameter tuning. Students gained a comprehensive understanding of different types of
Machine Learning, algorithms used for both classification and regression, and techniques for
feature importance and dimensionality reduction.
Detailed Report:
Learning Outcomes:
78
ACTIVITY LOG FOR SIXTH WEEK
79
WEEKLY REPORT
Objective of the Activity Done: The sixth week focused on practical aspects of Machine
Learning (ML) and introduction to Deep Learning (DL). Topics included the ML project
lifecycle, data preparation, exploratory data analysis (EDA), model development and
evaluation, ensemble methods (bagging, boosting, stacking), introduction to DL and neural
networks.
Detailed Report:
Learning Outcomes:
80
Student Self Evaluation of the Short-Term Internship
Registration
Student Name: No.
From: To:
Date of Evaluation:
2 Written communication 1 2 3 4 5
3 Proactiveness 1 2 3 4 5
5 Positive Attitude 1 2 3 4 5
6 Self-confidence 1 2 3 4 5
7 Ability to learn 1 2 3 4 5
9 Professionalism 1 2 3 4 5
10 Creativity 1 2 3 4 5
12 Time Management 1 2 3 4 5
81
13 Understanding the Community 1 2 3 4 5
15 OVERALL PERFORMANCE 1 2 3 4 5
Registration
Student Name: No.
From: To:
Date of Evaluation:
2 Written communication 1 2 3 4 5
3 Proactiveness 1 2 3 4 5
5 Positive Attitude 1 2 3 4 5
82
6 Self-confidence 1 2 3 4 5
7 Ability to learn 1 2 3 4 5
9 Professionalism 1 2 3 4 5
10 Creativity 1 2 3 4 5
12 Time Management 1 2 3 4 5
15 OVERALL PERFORMANCE 1 2 3 4 5
83
EVALUATION
Objectives:
• To integrate theory and practice.
• To learn to appreciate work and its function towards the future.
• To develop work habits and attitudes necessary for job success.
• To develop communication, interpersonal and other critical skills in the future job.
• To acquire additional skills required for the world of work.
•
Assessment Model:
• There shall only be internal evaluation.
• The Faculty Guide assigned is in-charge of the learning activities of the students
and for the comprehensive and continuous assessment of the students.
• The assessment is to be conducted for 100 marks.
• The number of credits assigned is 4. Later the marks shall be converted into
grades and grade points to include finally in the SGPA and CGPA.
• The weightings shall be:
84
o Activity Log 25 marks o
Internship Evaluation 50marks o Oral
Presentation 25 marks
• Activity Log is the record of the day-to-day activities. The Activity Log is
assessed on an individual basis, thus allowing for individual members within
groups to be assessed this way. The assessment will take into consideration the
individual student’s involvement in the assigned work.
• While evaluating the student’s Activity Log, the following shall be considered –
a. The individual student’s effort and commitment.
b. The originality and quality of the work produced by the individual student.
c. The student’s integration and co-operation with the work assigned.
d. The completeness of the Activity Log.
• The Internship Evaluation shall include the following components and based on
Weekly Reports and Outcomes Description a. Description of the Work
Environment.
b. Real Time Technical Skills acquired.
c. Managerial Skills acquired.
d. Improvement of Communication Skills.
e. Team Dynamics
f. Technological Developments recorded.
MARKS STATEMENT
(To be used by the Examiners)
85
INTERNAL ASSESSMENT STATEMENT
1. Activity Log 25
2. Internship Evaluation 50
86
3. Oral Presentation 25
87