Data Analysis New1
Data Analysis New1
1
1. Introduction..............................................................................................................................4
3. Data Visualization....................................................................................................................8
4. Machine Learning..................................................................................................................10
6. Tableau.................................................................................................................................. 14
9. Conclusion........................................................................................................................... 23
10. References............................................................................................................................24
2
1. INTRODUCTION
Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the
goal of discovering useful information, informing conclusions, and supporting decision-
making. Data analysis has multiple facets and approaches, encompassing diverse techniques
under a variety of names, and is used in different business, science, and social science
domains. In today's business world, data analysis plays a role in making decisions more
scientific and helping businesses operate more effectively.
Data Analytics
Data analytics is the process of examining raw data to uncover patterns, trends, and insights
for decision-making. This involves several key components such as data preprocessing, data
visualization, statistical modeling, and the use of machine learning techniques.
Key Components of Data Analytics
1. Data Preprocessing: Cleaning and preparing raw data for analysis.
2. Data Visualization: Presenting insights using charts, graphs, and dashboards.
3. Machine Learning: Building predictive models for decision-making.
4. Advanced Tools: Tools like Tableau for visualization and deep learning for complex tasks
such as image recognition.
Importance of Data Analytics
Data analytics helps organizations uncover hidden patterns, market trends, and customer
preferences. By leveraging analytics, businesses can make informed decisions, reduce risks,
and operate more efficiently.
3
OBJECTIVE
The objective of the workshop was to introduce participants to Python for data analysis
and give them a clear roadmap for developing their skills in this field. The session also
covered data visualization techniques using libraries such as Matplotlib and Seaborn,
providing students with practical skills to interpret and represent data effectively.
Additionally, participants were introduced to key concepts such as data cleaning,
exploratory data analysis (EDA), and the job opportunities available in the rapidly growing
field of data science. The workshop focused on equipping students with real-world skills in
data analytics, particularly Python, and provided insights into a structured roadmap for
learning data analysis. The session covered practical applications, allowing students to
work on datasets and understand the tools used in the field. Although the initial agenda
included tools like Tableau and Power BI, the workshop ultimately focused solely on
Python and its data libraries to give participants a deep dive into one platform.
4
IMPORTANCE OF DATA SCIENCE
Data is one of the important features of every organization because it helps business
leaders to make decisions based on facts, statistical numbers and trends. Due to this growing
scope of data, data science came into picture which is a multidisciplinary field. It uses
scientific approaches, procedure, algorithms, and framework to extract the knowledge and
insight from a huge amount of data.
The extracted data can be either structured or unstructured. Data science is a concept
to bring together ideas, data examination, Machine Learning, and their related strategies to
comprehend and dissect genuine phenomena with data. Data science is an extension of
various data analysis fields such as data mining, statistics, predictive analysis and many more.
Data Science is a huge field that uses a lot of methods and concepts which belongs to other
fields like information science, statistics, mathematics, and computer science. Some of the
techniques utilized in Data Science encompasses machine learning, visualization, pattern
recognition, probability model, data engineering, signal processing, etc.
5
TECHNICAL SKILLS
Programming Languages
Python is the primary programming language used in data analytics because of its simplicity,
flexibility, and vast ecosystem of libraries. Analysts rely on Python’s core libraries such as
Pandas and NumPy to manipulate datasets, perform mathematical operations, and handle
structured and unstructured data effectively. Pandas provides robust tools for working with
tabular data, while NumPy supports high-performance numerical computations. Together,
these libraries enable efficient handling of large datasets and form the foundation for data
preprocessing, exploratory analysis, and building advanced models. Python’s readability and
strong community support also make it ideal for both beginners and professionals.
Database Management
Database management skills are essential for data analysts, as raw data is often stored across
relational databases. SQL (Structured Query Language) is the industry-standard tool for
retrieving, updating, and managing data in these systems. Proficiency in SQL allows analysts
to query large datasets efficiently, extract meaningful insights, and join multiple tables to
generate comprehensive reports. According to GeeksforGeeks and other learning platforms,
SQL is considered a core skill for anyone working with data. Strong database management
expertise ensures analysts can access clean, structured data, forming a reliable foundation for
further analysis and decision-making in business environments.
Data Visualization
Data visualization is a critical skill for translating complex datasets into clear, interpretable
insights. Proficiency in Python visualization libraries such as Matplotlib and Seaborn helps
analysts create meaningful charts, graphs, and plots that reveal patterns and trends. For more
interactive and dynamic representations, advanced tools like Plotly or Bokeh can be used,
allowing the creation of dashboards that stakeholders can explore in real time. Effective
visualization not only improves the communication of findings but also supports decision-
making processes. By transforming raw numbers into compelling visuals, data analysts bridge
the gap between technical analysis and business understanding.
6
Statistical Analysis
A solid foundation in statistical analysis is essential for deriving meaningful insights from
data. Analysts need to understand core concepts such as probability, sampling, hypothesis
testing, correlation, and regression analysis to evaluate relationships and test assumptions.
Statistical methods provide the backbone for validating results, ensuring accuracy, and
minimizing errors in conclusions. For instance, regression analysis allows analysts to predict
outcomes and identify factors influencing trends. With these skills, professionals can move
beyond descriptive analysis to inferential and predictive insights. Mastery of statistics ensures
analysts make data-driven decisions that are both reliable and scientifically sound.
Before meaningful insights can be derived, raw data must undergo thorough cleaning and
preprocessing. This skill involves identifying missing values, handling duplicates, correcting
inconsistencies, and transforming data into suitable formats for analysis. Data cleaning
ensures the accuracy, quality, and reliability of the dataset, while preprocessing steps such as
normalization, feature scaling, and encoding categorical variables prepare the data for
advanced modeling. Poor data quality often leads to misleading results, making this skill
crucial for analysts. By mastering data cleaning and preprocessing, professionals ensure that
every subsequent step of analysis is built on a solid and trustworthy foundation.
SOFT SKILLS
Analytical Thinking
7
Developing strong analytical thinking makes data analysts more efficient and precise
problem-solvers.
Problem-Solving
Problem-solving is an essential soft skill for data analysts, as real-world projects often involve
addressing ambiguous or ill-structured challenges. This skill includes identifying the root
causes of problems, evaluating multiple potential solutions, and selecting the most effective
strategy. Analysts often use problem-solving to design workflows, choose appropriate models,
or resolve issues with incomplete or messy datasets. Strong problem-solving requires
creativity, adaptability, and the ability to weigh trade-offs between speed and accuracy. In
practice, this ensures timely and practical solutions that not only address immediate issues but
also add long-term value to the business or research context.
Communication
Clear and concise communication is vital for data analysts, as they frequently present their
findings to both technical and non-technical stakeholders. This skill involves simplifying
complex statistical or technical concepts into actionable insights that decision-makers can
easily understand. Effective communication includes creating well-structured reports, visually
appealing dashboards, and compelling presentations. It also requires active listening to
understand stakeholder requirements and tailoring messages to different audiences. Analysts
who excel in communication can bridge the gap between data and decision-making, ensuring
that their insights lead to practical business outcomes and are appreciated across departments
and teams.
Attention to Detail
Attention to detail is critical in data analysis because even small mistakes can lead to
misleading conclusions or flawed strategies. This skill requires meticulousness in handling
datasets, verifying calculations, and ensuring consistency throughout the analysis process. For
example, overlooking duplicate entries or mislabeling variables can compromise the accuracy
of results. Strong attention to detail ensures the reliability of data cleaning, statistical testing,
and visualization. In business scenarios, this translates into more trustworthy insights and
8
reduced risk of errors in decision-making. Cultivating this skill enables analysts to deliver
precise, high-quality work that withstands scrutiny from stakeholders.
Data preprocessing is a crucial step in any data analysis project. It involves cleaning,
transforming, and organizing raw data into a usable format. Common tasks include
handling missing values, removing duplicates, encoding categorical variables, and
normalizing numerical data. Data manipulation is achieved using tools like Python
(Pandas, NumPy), ensuring datasets are accurate and consistent. This stage improves data
quality and prepares it for visualization and modeling. Effective preprocessing enhances
model performance and helps analysts uncover true patterns in the data, reducing bias and
ensuring that subsequent analysis is both valid and reliable.
Python
Pandas
*Sampling
*Data Cleaning
*Data Transformation
*Data Wrangling
*Data Analysis
Data Structures in Pandas
Functionality of Pandas
The read_csv() function in pandas is commonly used to load data from a CSV (Comma-
Separated Values) file into a DataFrame for analysis. It is one of the most used data import
functions in data analytics.
Syntax:
python
CopyEdit
import pandas as pd
10
df = pd.read_csv('filename.csv')
Key Parameters:
Example:
python
CopyEdit
df = pd.read_csv('amazon_stock.csv')
print(df.head())
This command reads the first few lines of amazon_stock.csv and displays them. Once
loaded, the DataFrame df can be manipulated using various pandas operations such as:
Using read_csv() is the starting point for most data analysis workflows when working with
structured tabular data, especially in business intelligence, finance, or scientific computing.
11
3. Data Visualization
Data visualization is the graphical representation of information and data. Provide an accessible way
to see and understand trends, outliers, and patterns in data. Additionally, it provides an excellent way
for employees or business owners to present data to non-technical audiences without confusion.
Data visualization converts complex datasets into clear, graphical formats that are easier to interpret.
Using tools like Matplotlib, Seaborn, and Plotly in Python, we can generate bar charts, heatmaps,
scatter plots, and histograms. Visualization helps identify trends, outliers, and relationships between
variables. It plays a critical role in communicating findings to stakeholders and supports informed
decision-making. During this project, visualization was used to explore stock data, showing price
trends, volume fluctuations, and volatility patterns in Amazon’s historical stock performance. Clear
visuals enhance comprehension and drive actionable business insights.
METHODOLOGY
The methodology for the "Sales Forecasting" project thus involves a systematic set of procedures that
combine different research and technical methods in developing a solution based on machine learning
in retail sales management. The methodology could then be split into some very key areas, namely:
data collection, model development, design and application, and deployment.
12
Collect and preprocess the data.
1. Data Sources: Collect historical sales data from retail shops. Here sales transactions, details, and
inventory levels over time will be included. Extra data source from the outside environment such as
market trends, seasonal effects, and promotional activities may be included in the model for
increased accuracy
2. Data Pre processing : All data will be cleaned hence by handling missing values, outliers,
andinconsistencies. Feature engineering will be a phase to get the raw data into better features like
converting date to day/month/year, and aggregating sales data for example weekly/monthly total.
2. EDA Exploratory Data Analysis: For EDA, trends, patterns, and correlations in the data will be
identified. Visualization techniques like histograms, scatter plots, and line charts will be used to
assess sales trends through time and indicate which factors most affect sales. 3. Model Development:
1. Machine Learning Techniques: The focus will be on analyzing the time series (for example,
ARIMA, Seasonal Decomposition) and regression techniques such as linear or polynomial regression
for furtherprediction of future sales. 2. Model Evaluation: Performance of the models will be
evaluated on metrics that would essentially incorporate Mean Absolute Error, Root Mean Square
Error, and R-squared values among others. Crossvalidation techniques will be employed to ensure
the model's robustness and generalizability
13
4. Developing the Backend and Integrating 1. API Development: develop a backend system
using either Flask or Django, which will be the interface between the mobile application and the
machine learning model. It should take care of processing the sales prediction requests and any
changes in inventory. 2. Database Management: The sales and inventory will be stored in
Firebase so that the data will be in realtime synchrony. It will provide efficient data retrieval and
storage mechanisms by designing an effective database structure.
PROJECT DESIGN:
● 1.Workflow Diagram:
14
System Architecture:
INTRODUCTION:
“To find out what role certain properties of an item play and how they affect their sales by
understanding Target mart sales.” In order to help Target Mart achieve this goal, a predictive
model can be built to find out for every store, the key factors that can increase their sales and
what changes could be made to the
The aim is to build a predictive model and find out the sales of each product at a particular store.
Using this model, Big Marts like Target will try to understand the properties of products and
stores which play a key role in increasing sales. So the idea is to find out the properties of a
product, and store which impacts the sales of a product.
15
We came up with certain hypothesis in order to solve the problem statement relating to the
various factors of the products and stores. We develop a predictive model using different ML
Algorithms like Linear regression, Polynomial regression, and Ridge regression techniques for
forecasting the sales of a business such as Target Corporation. By using this model, we will try
to understand the properties of products and stores which play a key role in increasing sales.We
came up with certain hypothesis in order to solve the problem statement relating to the various
factors of the products and stores.
• We’ll be performing some basic data exploration and come up with some
inferences about the data.
• In our model we have used Target corporation sales dataset. After preprocessing
and filling missing values, we used ensemble classifier using Decision trees,
Linear regression, Ridge regression, Polynomial regression and XgBoost
regression classifier.
● PREREQUISITES:
The dataset contains about 10 years of daily weather observations from numerous
Australian weather stations.
Many other paradigms are supported via extensions, including design by contract
and logic programming.
Python uses dynamic typing and a combination of reference counting and a cycle-
detecting garbage collector for memory management. It also features dynamic name
resolution (late binding), which binds method and variable names during program
execution.
Python's design offers some support for functional programming in the Lisp tradition. It
has functions; list comprehensions, dictionaries, sets, and generator expressions. The
standard library has two modules (itertools and functools) that implement functional tools
borrowed from Haskell and Standard ML.
16
Regression is a statistical measure that attempts to determine the strength of the
relationship between one dependent variable usually denoted by Y and a series of other
changing variables known as independent variables. Regression model which
contain more than two predictor variables are called Multiple Regression Model.
• MODEL DESIGN:
This shows the architecture Diagram of the proposed model where they focus on the different
algorithm application to the dataset. Here, we are calculating the Accuracy, MAE, MSE,
RMSE and final concluding the best yield algorithm. Here are the following Algorithm are
used.
A. Linear Regression:
•Build a fragmented plot:
1) A linear or non-linear pattern of data and 2) a variance (outliers). Consider a
transformation if the marking isn't linear. If this is the case, outsiders, it can suggest only
eliminating them if there is a non-statistical justification.
• Link the data to the least squares line and confirm the model assumptions using the
residual plot (for the constant standard deviation assumption) and the normal probability plot
(for the normal probability assumption) A transformation might be necessary if the
assumptions made do not appear to be met
• If required, convert the data to the least square using the transformed data, construct a
regression line.
• If a change has been completed, return to the previous process 1. If not, continue to
phase 5.
• When a "good-fit" classic is defined, write the least-square regression line equation.
Consist of normal estimation, estimation, and R-squared errors. • Linear regression formulas
look like this:
Y=o1x1+ o2x2+……… onxn
R-Square: Defines the difference in X (depending variable) explains the total variance in Y
(dependent variable) (independent variable).
B. Polynomial Regression:
• Polynomial Regression is a relapse calculation that modules the relationship here among
dependent(y) and the autonomous variable(x) in light of the fact that as most extreme limit
polynomial. The condition for polynomial relapse is given beneath: y= b0+b1x1+ b2x1 2+
b2x1 3+. bnx1 n
17
• It is regularly alluded to as the exceptional instance of various straight relapse in ML.
Since we apply some polynomial terms to the numerous straight relapse condition to change
it to polynomial relapse adjustment to improve accuracy.
• The informational collection utilized for preparing in polynomial relapse is of a
non-straight nature.
• It uses a linear regression model to fit complex and non-linear functions and
datasets.
C. Ridge Regression:
Ridge regression is a model tuning tool used to evaluate any data that suffers from multi-
collinearity. This method performs the L2 regularization procedure. When multi-collinearity
issues arise, the least squares are unbiased and the variances are high, resulting in the
expected values being far removed from the actual values. The cost function for ridge
regression:
Min (||Y – X(theta)||^2 + λ||theta||^2)
D. XgBoost Regression:
“Extreme Gradient Boosting” is same but much more effective to the gradient boosting
system. It has both a linear model solver and a tree algorithm. Which permits “XgBoost” in
any event multiple times quicker than current slope boosting executions? It underpins
various target capacities, including relapse, order and rating. As "XgBoost" is extremely
high in prescient force however generally delayed with organization, it is appropriate for
some rivalries. It likewise has extra usefulness for cross-approval and finding significant
factors.
● Methodology:
3. Data Analysis - imputing missing values in the data and checking for outliers.
18
4. Feature Engineering - modifying existing variables and creating new ones for analysis.
19
Machine Learning
Machine learning (ML) applies algorithms to train models from data and make predictions
or classifications. In this project, supervised learning techniques like Linear Regression
and Random Forest were used to forecast Amazon's stock prices. The model was trained
using historical data, and metrics like Mean Squared Error (MSE) were applied to assess
accuracy. ML helps businesses automate decision-making, detect anomalies, and
personalize customer experiences. Feature selection, model tuning, and validation were
integral parts of the process. Through this, we gained a deeper understanding of how data-
driven models are developed, tested, and deployed in real-world scenarios.
Machine Learning is a subset of artificial intelligence
It permits the system to learn and improve from experience without being explicitly
programmed automatically
Learning from the labeled data and applying the knowledge to predict the label of the new
data (test data) is known as Supervised Learning
Unsupervised Learning
Unsupervised Learning uses the technique of grouping/clustering the unbled data together
based on its physical characteristics and predict the cluster of new data (test data)
20
Train/Test Split
1. Training set
2. Testing set
Test-Set Conditions:
Deep learning, a subset of machine learning, uses artificial neural networks to model
complex patterns. For image data, convolutional neural networks (CNNs) are widely used
due to their ability to capture spatial hierarchies. In this project, deep learning was
explored for image classification tasks using TensorFlow and Keras. Datasets such as
MNIST or CIFAR-10 were used to train models to identify patterns and recognize visual
elements. Techniques like pooling, dropout, and activation functions were applied to
improve accuracy. Deep learning has revolutionized industries like healthcare and security,
21
making it a powerful tool for handling unstructured data like images.
Convolution Neural Network
Layers in CNN
Max Pooling:Takes the largest element from the rectified feature map
22
Predict Zero Using MNIST Dataset
23
TABLEAU
Tableau is a powerful and fastest-growing data visualization tool used in BI Industry.
*Helping companies to harness the strength of their most valuable assets: data and people
*Allows users to spend more time on data analysis and less on "data wrangling"
Tableau is a powerful data visualization tool that allows users to create interactive and shareable
dashboards. It connects seamlessly with various data sources like Excel, SQL, and cloud platforms.
During the internship, Tableau was used to design dynamic dashboards displaying KPIs and trends in
Amazon stock performance. Features like filters, parameters, calculated fields, and storyboards helped
convey insights effectively. Tableau’s drag-and-drop interface enables rapid development and real-
time analysis, making it ideal for business intelligence tasks. The visual appeal and interactivity of
Tableau dashboards greatly enhance the presentation and understanding of complex datasets for both
technical and non-technical audiences.
*This includes the dataset which almost identical, basic descriptive statistics, but which appear
differently in graphic. Each consists of eleven (x,y) points
Data Granularity
Granularity is the level of depth represented by the data, in a fact or dimension table
Granularity refers to the level of detail or depth captured in a dataset, particularly in data
warehouses and data marts. It defines how much information is stored in each record and how specific
or summarized the data is.
Examples:
Use Case: Used in operational reporting, fraud detection, and personalized marketing.
o Fewer rows.
Examples:
25
o Monthly sales totals by region.
Performance: High granularity can slow down queries due to larger datasets, whereas low
granularity improves performance but reduces detail.
Storage: Detailed data consumes more space.
Flexibility: High-granularity data offers more flexibility in slicing and dicing the data during
analysis.
Accuracy: Aggregated data may obscure anomalies or patterns visible in detailed data.
Best Practices:
Use data aggregation techniques when building dashboards or summary reports for better
performance.
Set
*The Tableau buys a server-named set icon from a multi-dimensional data source
*Action set icon is created by Tableau automatically when set action is performed.
*User filters display in the sets area of the data window with the User Filter Set Icon.
5. Sharing Insights through Dashboards
26
Choose the right chart type that will help convey the information more effectively
The appropriate chart will reveal patterns and trends, through which your audience will understand the
significance of the data set that you visualize
1. Importing Libraries:
The first step in any Data Analysis step is importing necessary libraries.
# Load Data Set: Dataset can be loaded using a method read_csv().
2. Data Preprocessing:
Real-world data is
often messy, incomplete, unstructured, inconsistent, redundant, sprinkled with wacky values. So,
without deploying any Data Preprocessing techniques, it is almost impossible to gain insights from
raw data. Data preprocessing is a process of converting raw data to a suitable format to extract
insights. It is the first and foremost step in the Data Science life cycle. Data Preprocessing makes sure
that data is clean, organize and read-to-feed to the Machine Learning model.
● Except for the Date, Location columns, every column has missing values.
Let’s generate descriptive statistics for the dataset using the function describe() in pandas.
Descriptive Statistics: It is used to summarize and describe the features of data in a meaningful way
to extract insights. It uses two types of statistic to describe or summarize data:
● Measures of tendency
● Measures of spread.
3. Cardinality check for Categorical features:
The accuracy, performance of a classifier not only depends on the model that we use, but also depends
on how we preprocess data, and what kind of data you’re feeding to the classifier to learn.
Many Machine learning algorithms like Linear Regression, Logistic Regression, k-nearest neighbors,
etc. can handle only numerical data, so encoding categorical data to numeric becomes a necessary
step.
But before jumping into encoding, check the cardinality of each categorical feature.
Cardinality: The number of unique values in each categorical feature is known as cardinality.
A feature with a high number of distinct/ unique values is a high cardinality feature. A categorical
feature with hundreds of zip codes is the best example of a high cardinality feature.
This high cardinality feature poses many serious problems like it will increase the number of
27 is not good for the model.
dimensions of data when that feature is encoded. This
There are many ways to handle high cardinality, one would be feature engineering and the other is
simply dropping that feature if it doesn’t add any value to the model.
Machine learning algorithms can’t handle missing values and cause problems. So they need to be
addressed in the first place. There are many techniques to identify and impute missing values.
If a dataset contains missing values and loaded using pandas, then missing values get replaced with
NaN(Not a Number) values. These NaN values can be identified using methods like isna() or isnull()
and they can be imputed using fillna(). This process is known as Missing Data Imputation.
# Handling Missing values in Categorical Features:
big_mart_data.isnull().sum()
Missing values in Numerical Features can be imputed using Mean and Median. Mean is sensitive to
outliers and median is immune to outliers. If you want to impute the missing values with mean values,
then outliers in numerical features need to be addressed properly.
28
5. Outliers detection and treatment:
An Outlier is an observation that lies an abnormal distance from other values in a given sample. They
can be detected using visualization (like box-plots, scatter plots), Z-score, statistical and probabilistic
algorithms, etc.
It’s time to do some analysis on each feature to understand about data and get some insights.
6. Exploratory Data Analysis:
Exploratory Data Analysis(EDA) is a technique used to analyze, visualize, investigate, interpret,
discover and summarize data. It helps Data Scientists to extract trends, patterns, and relationships in
data.
# Outlet_Establishment_Year column:
plt.figure(figsize=(6,6)) sns.countplot(x='Outlet_Establishment_Year', data=big_mart_data) plt.show()
# Item_Fat_Content column:
plt.figure(figsize=(6,6)) sns.countplot(x='Item_Fat_Content', data=big_mart_data) plt.show()
# Item_Type column:
29
plt.figure(figsize=(30,6)) sns.countplot(x='Item_Type', data=big_mart_data) plt.show()
7. Encoding of Categorical Features:
Most Machine Learning Algorithms like Logistic Regression, Support Vector Machines, K
Nearest Neighbours, etc. can’t handle categorical data. Hence, these categorical data need to
converted to numerical data for modeling, which is called Feature Encoding.
There are many feature encoding techniques like One code encoding, label encoding. But in this
particular blog, I will be using replace() function to encode categorical data to numerical data.
30
8. Correlation:
Correlation is a statistic that helps to measure the strength of the relationship between two features. It
is used in bivariate analysis. Correlation can be calculated with method corr() in pandas.
# Splitting data into Independent Features and Dependent Features:
For feature importance and feature scaling, we need to split data into independent and dependent
features.
● X – Independent Features or Input features
● y – Dependent Features or target label.
9. Feature Importance:
● Machine Learning Model performance depends on features that are used to train a model. Feature
importance describes which features are relevant to build a model.
● Feature Importance refers to the techniques that assign a score to input/label features based on how
useful they are at predicting a target variable. Feature importance helps in Feature Selection.
We’ll be using ExtraTreesRegressor class for Feature Importance. This class implements a meta
estimator that fits a number of randomized decision trees on various samples of the dataset and uses
averaging to improve the predictive accuracy and control over-fitting.
31
IMPLEMENTED SCREENSHOTS: 1.
32
33
SAMPLE CODING:
Importing Libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder from
sklearn.model_selection import train_test_split from xgboost
import XGBRegressor
from sklearn import metrics
big_mart_data.head()
# number of data points & number of features
big_mart_data.shape
# getting some information about thye dataset
big_mart_data.info()
#Categorical Features:
#Item_Identifier
#Item_Fat_Content
#Item_Type
#Outlet_Identifier
#Outlet_Size
#Outlet_Location_Type
#Outlet_Type
# checking for missing values
big_mart_data.isnull().sum()
# mean value of "Item_Weight" column big_mart_data['Item_Weight'].mean()
# filling the missing values in "Item_weight column" with "Mean" value
big_mart_data['Item_Weight'].fillna(big_mart_data['Item_Weight'].mean(), inplace=True)
# mode of "Outlet_Size" column
big_mart_data['Outlet_Size'].mode()
# filling the missing values in "Outlet_Size" column with Mode mode_of_Outlet_size =
big_mart_data.pivot_table(values='Outlet_Size', columns='Outlet_Type',
aggfunc=(lambda x: x.mode()[0]))
miss_values = big_mart_data['Outlet_Size'].isnull()
print(miss_values) big_mart_data.loc[miss_values,
'Outlet_Size'] =
big_mart_data.loc[miss_values,'Outlet_Type'].apply(lambda x: mode_of_Outlet_size[x])
big_mart_data.describe()
sns.set()
# Item_Weight distribution plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_Weight']) plt.show()
# Item Visibility distribution plt.figure(figsize=(6,6))
34
sns.distplot(big_mart_data['Item_Visibility']) plt.show()
# Item MRP distribution plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_MRP'])
plt.show()
# Item_Outlet_Sales distribution plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_Outlet_Sales']) plt.show()
# Outlet_Establishment_Year column
plt.figure(figsize=(6,6))
sns.countplot(x='Outlet_Establishment_Year', data=big_mart_data) plt.show()
# Item_Fat_Content column
plt.figure(figsize=(6,6))
sns.countplot(x='Item_Fat_Content', data=big_mart_data) plt.show()
# Outlet_Size column plt.figure(figsize=(6,6))
sns.countplot(x='Outlet_Size', data=big_mart_data)
plt.show()
big_mart_data.head()
big_mart_data['Item_Fat_Content'].value_counts()
big_mart_data.replace({'Item_Fat_Content': {'low fat':'Low Fat','LF':'Low Fat',
'reg':'Regular'}}, inplace=True) big_mart_data['Item_Fat_Content'].value_counts()
Label Encoding
encoder = LabelEncoder() big_mart_data['Item_Identifier'] =
encoder.fit_transform(big_mart_data['Item_Identifier'])
big_mart_data['Item_Fat_Content'] =
encoder.fit_transform(big_mart_data['Item_Fat_Content'])
35
CONCLUSION:
In this work, the effectiveness of various algorithms on the data on revenue and review of,
best performance-algorithm, here propose a software to using regression approach for
predicting the sales centered on sales data from the past the accuracy of linear regression
prediction can be enhanced with this method, polynomial regression, Ridge regression, and
Xgboost regression can be determined. So, we can conclude ridge and Xgboost regression
gives the better prediction than the Linear and polynomial regression approaches. In future,
the forecasting sales and building a sales plan can help to avoid unforeseen cash flow and
manage production, staff and financing needs more effectively. In future work we can also
consider with the ARIMA model which shows the time series graph.
36