Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views36 pages

Data Analysis

Uploaded by

prathiksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views36 pages

Data Analysis

Uploaded by

prathiksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

INTERNSHIP REPORT ON DATA

ANALYSIS

1
1. Introduction..............................................................................................................................4

2. Data Pre-processing and Manipulation....................................................................................6

3. Data Visualization....................................................................................................................8

4. Machine Learning..................................................................................................................10

5. Deep Learning for Images.....................................................................................................12

6. Tableau..................................................................................................................................14

7. Sharing Insights through Dashboards....................................................................................16

8. Stock Market Analysis-Amazon............................................................................................20

9. Conclusion...........................................................................................................................23

10. References............................................................................................................................24

2
1. INTRODUCTION

Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the
goal of discovering useful information, informing conclusions, and supporting decision-
making. Data analysis has multiple facets and approaches, encompassing diverse techniques
under a variety of names, and is used in different business, science, and social science
domains. In today's business world, data analysis plays a role in making decisions more
scientific and helping businesses operate more effectively.
Data Analytics
Data analytics is the process of examining raw data to uncover patterns, trends, and insights
for decision-making. This involves several key components such as data preprocessing, data
visualization, statistical modeling, and the use of machine learning techniques.
Key Components of Data Analytics
1. Data Preprocessing: Cleaning and preparing raw data for analysis.
2. Data Visualization: Presenting insights using charts, graphs, and dashboards.
3. Machine Learning: Building predictive models for decision-making.
4. Advanced Tools: Tools like Tableau for visualization and deep learning for complex tasks
such as image recognition.
Importance of Data Analytics
Data analytics helps organizations uncover hidden patterns, market trends, and customer
preferences. By leveraging analytics, businesses can make informed decisions, reduce risks,
and operate more efficiently.

Practical Application: Target Corporation Sales Prediction

3
OBJECTIVE

The objective of the workshop was to introduce participants to Python for data analysis and
give them a clear roadmap for developing their skills in this field. The session also covered
data visualization techniques using libraries such as Matplotlib and Seaborn, providing
students with practical skills to interpret and represent data effectively. Additionally,
participants were introduced to key concepts such as data cleaning, exploratory data
analysis (EDA), and the job opportunities available in the rapidly growing field of data
science. The workshop focused on equipping students with real-world skills in data
analytics, particularly Python, and provided insights into a structured roadmap for learning
data analysis. The session covered practical applications, allowing students to work on
datasets and understand the tools used in the field. Although the initial agenda included
tools like Tableau and Power BI, the workshop ultimately focused solely on Python and its
data libraries to give participants a deep dive into one platform.

4
IMPORTANCE OF DATA SCIENCE
Data is one of the important features of every organization because it helps business
leaders to make decisions based on facts, statistical numbers and trends. Due to this growing
scope of data, data science came into picture which is a multidisciplinary field. It uses
scientific approaches, procedure, algorithms, and framework to extract the knowledge and
insight from a huge amount of data.
The extracted data can be either structured or unstructured. Data science is a
concept to bring together ideas, data examination, Machine Learning, and their related
strategies to comprehend and dissect genuine phenomena with data. Data science is an
extension of various data analysis fields such as data mining, statistics, predictive analysis
and many more. Data Science is a huge field that uses a lot of methods and concepts which
belongs to other fields like information science, statistics, mathematics, and computer
science. Some of the techniques utilized in Data Science encompasses machine learning,
visualization, pattern recognition, probability model, data engineering, signal processing,
etc.

FUTURE OF DATA SCIENCE


As most of the fields are emerging continuously, the importance of data science is also
increasing rapidly. Data science has influenced various areas. Its effect can be observed in
multiple sectors such as the retail industry, healthcare, and education. In 6 the healthcare
industry, new medicines and techniques are being discovered continuously and there is a
requirement of better care for patients. With the help of data science techniques, the
healthcare sector can find a solution that help to take care the patients. Education is another
field where the benefits of data science can be seen clearly. The latest technologies such as
smartphones and laptops have now become an important part of the education system. With
the help of data science, better opportunities are created for the students which enable them to
enhance their knowledge.

5
TECHNICAL SKILLS
Programming Languages

Python is the primary programming language used in data analytics because of its simplicity,
flexibility, and vast ecosystem of libraries. Analysts rely on Python’s core libraries such as
Pandas and NumPy to manipulate datasets, perform mathematical operations, and handle
structured and unstructured data effectively. Pandas provides robust tools for working with
tabular data, while NumPy supports high-performance numerical computations. Together,
these libraries enable efficient handling of large datasets and form the foundation for data
preprocessing, exploratory analysis, and building advanced models. Python’s readability and
strong community support also make it ideal for both beginners and professionals.

Database Management

Database management skills are essential for data analysts, as raw data is often stored across
relational databases. SQL (Structured Query Language) is the industry-standard tool for
retrieving, updating, and managing data in these systems. Proficiency in SQL allows analysts
to query large datasets efficiently, extract meaningful insights, and join multiple tables to
generate comprehensive reports. According to GeeksforGeeks and other learning platforms,
SQL is considered a core skill for anyone working with data. Strong database management
expertise ensures analysts can access clean, structured data, forming a reliable foundation for
further analysis and decision-making in business environments.

Data Visualization

Data visualization is a critical skill for translating complex datasets into clear, interpretable
insights. Proficiency in Python visualization libraries such as Matplotlib and Seaborn helps
analysts create meaningful charts, graphs, and plots that reveal patterns and trends. For more
interactive and dynamic representations, advanced tools like Plotly or Bokeh can be used,
allowing the creation of dashboards that stakeholders can explore in real time. Effective
visualization not only improves the communication of findings but also supports decision-
making processes. By transforming raw numbers into compelling visuals, data analysts bridge
the gap between technical analysis and business understanding.

6
Statistical Analysis

A solid foundation in statistical analysis is essential for deriving meaningful insights from
data. Analysts need to understand core concepts such as probability, sampling, hypothesis
testing, correlation, and regression analysis to evaluate relationships and test assumptions.
Statistical methods provide the backbone for validating results, ensuring accuracy, and
minimizing errors in conclusions. For instance, regression analysis allows analysts to predict
outcomes and identify factors influencing trends. With these skills, professionals can move
beyond descriptive analysis to inferential and predictive insights. Mastery of statistics ensures
analysts make data-driven decisions that are both reliable and scientifically sound.

Data Cleaning and Preprocessing

Before meaningful insights can be derived, raw data must undergo thorough cleaning and
preprocessing. This skill involves identifying missing values, handling duplicates, correcting
inconsistencies, and transforming data into suitable formats for analysis. Data cleaning
ensures the accuracy, quality, and reliability of the dataset, while preprocessing steps such as
normalization, feature scaling, and encoding categorical variables prepare the data for
advanced modeling. Poor data quality often leads to misleading results, making this skill
crucial for analysts. By mastering data cleaning and preprocessing, professionals ensure that
every subsequent step of analysis is built on a solid and trustworthy foundation.

SOFT SKILLS
Analytical Thinking

Analytical thinking is the foundation of effective data analysis, enabling professionals to


approach complex problems logically and systematically. This skill involves breaking down
larger issues into smaller, manageable components and identifying the relationships between
them. Analysts use this ability to evaluate datasets critically, distinguish between relevant and
irrelevant information, and uncover hidden patterns. Analytical thinking helps in forming
hypotheses, testing them, and interpreting results with clarity. In a business context, this
ensures that conclusions are not only data-driven but also aligned with organizational goals.
Developing strong analytical thinking makes data analysts more efficient and precise
problem-solvers.
7
Problem-Solving

Problem-solving is an essential soft skill for data analysts, as real-world projects often involve
addressing ambiguous or ill-structured challenges. This skill includes identifying the root
causes of problems, evaluating multiple potential solutions, and selecting the most effective
strategy. Analysts often use problem-solving to design workflows, choose appropriate models,
or resolve issues with incomplete or messy datasets. Strong problem-solving requires
creativity, adaptability, and the ability to weigh trade-offs between speed and accuracy. In
practice, this ensures timely and practical solutions that not only address immediate issues but
also add long-term value to the business or research context.

Communication

Clear and concise communication is vital for data analysts, as they frequently present their
findings to both technical and non-technical stakeholders. This skill involves simplifying
complex statistical or technical concepts into actionable insights that decision-makers can
easily understand. Effective communication includes creating well-structured reports, visually
appealing dashboards, and compelling presentations. It also requires active listening to
understand stakeholder requirements and tailoring messages to different audiences. Analysts
who excel in communication can bridge the gap between data and decision-making, ensuring
that their insights lead to practical business outcomes and are appreciated across departments
and teams.

Attention to Detail

Attention to detail is critical in data analysis because even small mistakes can lead to
misleading conclusions or flawed strategies. This skill requires meticulousness in handling
datasets, verifying calculations, and ensuring consistency throughout the analysis process. For
example, overlooking duplicate entries or mislabeling variables can compromise the accuracy
of results. Strong attention to detail ensures the reliability of data cleaning, statistical testing,
and visualization. In business scenarios, this translates into more trustworthy insights and
reduced risk of errors in decision-making. Cultivating this skill enables analysts to deliver
precise, high-quality work that withstands scrutiny from stakeholders.

8
2. DATA PRE-PROCESSING AND MANIPULATION

Data preprocessing is a crucial step in any data analysis project. It involves cleaning,
transforming, and organizing raw data into a usable format. Common tasks include
handling missing values, removing duplicates, encoding categorical variables, and
normalizing numerical data. Data manipulation is achieved using tools like Python
(Pandas, NumPy), ensuring datasets are accurate and consistent. This stage improves data
quality and prepares it for visualization and modeling. Effective preprocessing enhances
model performance and helps analysts uncover true patterns in the data, reducing bias and
ensuring that subsequent analysis is both valid and reliable.

Python

Python is an interpreted, object-oriented, high-level programming language with dynamic


semantics. It is a widely used tool for analyzing data.

Pandas

A panda is an open-source library that provides high-performance data structures to perform


efficient data manipulation and analysis in Python

Example of data manipulation tasks performed using pandas:

*Sampling

*Data Cleaning

*Data Transformation

*Data Wrangling

*Data Analysis
Data Structures in Pandas

Series – 1-D labelled array

DataFrame – 2-D labelled structure of rows and columns

9
Functionality of Pandas

Reading Data from CSV

Data from CSV files can be read using read_csv() as follows:

Item UnitPrice Quantity TotalCost


0 Shirts 100 4 400
import pandas as pd
1 Jeans 300 2 600
pd.read_csv("purchase_data.csv")
2 Shoes 250 1 250

3 Tops 150 3 450

Reading Data from CSV in Pandas

The read_csv() function in pandas is commonly used to load data from a CSV (Comma-
Separated Values) file into a DataFrame for analysis. It is one of the most used data import
functions in data analytics.

Syntax:

python
CopyEdit
import pandas as pd

df = pd.read_csv('filename.csv')

10
Key Parameters:

 filepath_or_buffer: The file path to the CSV file.


 sep: Delimiter to use. Default is ','.
 header: Row number to use as the column names.
 index_col: Column(s) to set as the index.
 usecols: Return a subset of columns.
 dtype: Data type for data or columns.

Example:

python
CopyEdit
df = pd.read_csv('amazon_stock.csv')
print(df.head())

This command reads the first few lines of amazon_stock.csv and displays them. Once
loaded, the DataFrame df can be manipulated using various pandas operations such as:

 Filtering rows (e.g., df[df['Close'] > 1000])


 Handling missing values (df.fillna(0), df.dropna())
 Renaming columns (df.rename(columns={'old': 'new'}))
 Creating new columns (df['Return'] = df['Close'] / df['Open'])

Using read_csv() is the starting point for most data analysis workflows when working with
structured tabular data, especially in business intelligence, finance, or scientific computing.

11
3. Data Visualization
Data visualization is the graphical representation of information and data. Provide an accessible way
to see and understand trends, outliers, and patterns in data. Additionally, it provides an excellent way
for employees or business owners to present data to non-technical audiences without confusion.
Data visualization converts complex datasets into clear, graphical formats that are easier to interpret.
Using tools like Matplotlib, Seaborn, and Plotly in Python, we can generate bar charts, heatmaps,
scatter plots, and histograms. Visualization helps identify trends, outliers, and relationships between
variables. It plays a critical role in communicating findings to stakeholders and supports informed
decision-making. During this project, visualization was used to explore stock data, showing price
trends, volume fluctuations, and volatility patterns in Amazon’s historical stock performance. Clear
visuals enhance comprehension and drive actionable business insights.

METHODOLOGY

The methodology for the "Sales Forecasting" project thus involves a systematic set of procedures that
combine different research and technical methods in developing a solution based on machine learning
in retail sales management. The methodology could then be split into some very key areas, namely:
data collection, model development, design and application, and deployment.

12
Collect and preprocess the data.
1. Data Sources: Collect historical sales data from retail shops. Here sales transactions, details, and
inventory levels over time will be included. Extra data source from the outside environment such as
market trends, seasonal effects, and promotional activities may be included in the model for
increased accuracy
2. Data Pre processing : All data will be cleaned hence by handling missing values, outliers,
andinconsistencies. Feature engineering will be a phase to get the raw data into better features like
converting date to day/month/year, and aggregating sales data for example weekly/monthly total.
2. EDA Exploratory Data Analysis:  For EDA, trends, patterns, and correlations in the data will be
identified. Visualization techniques like histograms, scatter plots, and line charts will be used to
assess sales trends through time and indicate which factors most affect sales. 3. Model Development:
1. Machine Learning Techniques: The focus will be on analyzing the time series (for example,
ARIMA, Seasonal Decomposition) and regression techniques such as linear or polynomial regression
for furtherprediction of future sales. 2. Model Evaluation: Performance of the models will be
evaluated on metrics that would essentially incorporate Mean Absolute Error, Root Mean Square
Error, and R-squared values among others. Crossvalidation techniques will be employed to ensure
the model's robustness and generalizability

13
4. Developing the Backend and Integrating 1. API Development: develop a backend system
using either Flask or Django, which will be the interface between the mobile application and the
machine learning model. It should take care of processing the sales prediction requests and any
changes in inventory. 2. Database Management: The sales and inventory will be stored in
Firebase so that the data will be in realtime synchrony. It will provide efficient data retrieval and
storage mechanisms by designing an effective database structure.

PROJECT DESIGN:

● 1.Workflow Diagram:

14
System Architecture:

PROJECT IMPLEMENTATION, ALGORITHMS AND METHODS USED:

INTRODUCTION:

“To find out what role certain properties of an item play and how they affect their sales by
understanding Target mart sales.” In order to help Target Mart achieve this goal, a predictive
model can be built to find out for every store, the key factors that can increase their sales and
what changes could be made to the

product or store’s characteristics.


Software Requirements:

Windows XP (8,9 or 10). Any Python


Interpreter.
Machine Learning modules installed. Target Mart Dataset.
Working Explanation:

The aim is to build a predictive model and find out the sales of each product at a particular store.
Using this model, Big Marts like Target will try to understand the properties of products and
stores which play a key role in increasing sales. So the idea is to find out the properties of a
product, and store which impacts the sales of a product.

15
We came up with certain hypothesis in order to solve the problem statement relating to the
various factors of the products and stores. We develop a predictive model using different ML
Algorithms like Linear regression, Polynomial regression, and Ridge regression techniques for
forecasting the sales of a business such as Target Corporation. By using this model, we will try
to understand the properties of products and stores which play a key role in increasing sales.We
came up with certain hypothesis in order to solve the problem statement relating to the various
factors of the products and stores.

• We’ll be performing some basic data exploration and come up with some
inferences about the data.
• In our model we have used Target corporation sales dataset. After preprocessing
and filling missing values, we used ensemble classifier using Decision trees,
Linear regression, Ridge regression, Polynomial regression and XgBoost
regression classifier.

● PREREQUISITES:

The dataset contains about 10 years of daily weather observations from numerous
Australian weather stations.

Python is a multi-paradigm programming language. Object-oriented programming and


structured programming are fully supported, and many of its features support
functional programming and aspect-oriented
programming (including by meta programming and meta objects (magic methods)).

Many other paradigms are supported via extensions, including design by contract
and logic programming.
Python uses dynamic typing and a combination of reference counting and a cycle-
detecting garbage collector for memory management. It also features dynamic name
resolution (late binding), which binds method and variable names during program
execution.
Python's design offers some support for functional programming in the Lisp tradition. It
has functions; list comprehensions, dictionaries, sets, and generator expressions. The
standard library has two modules (itertools and functools) that implement functional tools
borrowed from Haskell and Standard ML.

16
Regression is a statistical measure that attempts to determine the strength of the
relationship between one dependent variable usually denoted by Y and a series of other
changing variables known as independent variables. Regression model which

contain more than two predictor variables are called Multiple Regression Model.

• MODEL DESIGN:
This shows the architecture Diagram of the proposed model where they focus on the different
algorithm application to the dataset. Here, we are calculating the Accuracy, MAE, MSE,
RMSE and final concluding the best yield algorithm. Here are the following Algorithm are
used.
A. Linear Regression:
•Build a fragmented plot:
1) A linear or non-linear pattern of data and 2) a variance (outliers). Consider a
transformation if the marking isn't linear. If this is the case, outsiders, it can suggest only
eliminating them if there is a non-statistical justification.

• Link the data to the least squares line and confirm the model assumptions using the
residual plot (for the constant standard deviation assumption) and the normal probability plot
(for the normal probability assumption) A transformation might be necessary if the
assumptions made do not appear to be met
• If required, convert the data to the least square using the transformed data, construct a
regression line.
• If a change has been completed, return to the previous process 1. If not, continue to
phase 5.
• When a "good-fit" classic is defined, write the least-square regression line equation.
Consist of normal estimation, estimation, and R-squared errors. • Linear regression formulas
look like this:
Y=o1x1+ o2x2+……… onxn
R-Square: Defines the difference in X (depending variable) explains the total variance in Y
(dependent variable) (independent variable).
B. Polynomial Regression:
• Polynomial Regression is a relapse calculation that modules the relationship here among
dependent(y) and the autonomous variable(x) in light of the fact that as most extreme limit
polynomial. The condition for polynomial relapse is given beneath: y= b0+b1x1+ b2x1 2+
b2x1 3+. bnx1 n
17
• It is regularly alluded to as the exceptional instance of various straight relapse in ML.
Since we apply some polynomial terms to the numerous straight relapse condition to change
it to polynomial relapse adjustment to improve accuracy.
• The informational collection utilized for preparing in polynomial relapse is of a
non-straight nature.
• It uses a linear regression model to fit complex and non-linear functions and
datasets.

C. Ridge Regression:
Ridge regression is a model tuning tool used to evaluate any data that suffers from multi-
collinearity. This method performs the L2 regularization procedure. When multi-collinearity
issues arise, the least squares are unbiased and the variances are high, resulting in the
expected values being far removed from the actual values. The cost function for ridge
regression:
Min (||Y – X(theta)||^2 + λ||theta||^2)

D. XgBoost Regression:
“Extreme Gradient Boosting” is same but much more effective to the gradient boosting
system. It has both a linear model solver and a tree algorithm. Which permits “XgBoost” in
any event multiple times quicker than current slope boosting executions? It underpins
various target capacities, including relapse, order and rating. As "XgBoost" is extremely
high in prescient force however generally delayed with organization, it is appropriate for
some rivalries. It likewise has extra usefulness for cross-approval and finding significant
factors.

● Methodology:

We will explore the problem in following stages:

1. Hypothesis Generation - understanding the problem better by brainstorming possible


factors that can impact the outcome.

2. Data Preprocessing - looking at categorical and continuous feature summaries and


making inferences about the data.

3. Data Analysis - imputing missing values in the data and checking for outliers.

4. 18
Feature Engineering - modifying existing variables and creating new ones for analysis.
5. Implementing model - making predictive models on the data.

19
Machine Learning
Machine learning (ML) applies algorithms to train models from data and make predictions
or classifications. In this project, supervised learning techniques like Linear Regression
and Random Forest were used to forecast Amazon's stock prices. The model was trained
using historical data, and metrics like Mean Squared Error (MSE) were applied to assess
accuracy. ML helps businesses automate decision-making, detect anomalies, and
personalize customer experiences. Feature selection, model tuning, and validation were
integral parts of the process. Through this, we gained a deeper understanding of how data-
driven models are developed, tested, and deployed in real-world scenarios.
Machine Learning is a subset of artificial intelligence

It permits the system to learn and improve from experience without being explicitly programmed
automatically

Supervised Machine Learning

Learning from the labeled data and applying the knowledge to predict the label of the new data
(test data) is known as Supervised Learning

Unsupervised Learning

Unsupervised Learning uses the technique of grouping/clustering the unbled data together
based on its physical characteristics and predict the cluster of new data (test data)

Data Pre-processing Techniques

 Input Missing Values


 Handling Categorical values
 Scaling the Data
 Normalization
 Feature Selection

20
Train/Test Split

Machine Learning algorithms divide the datasets into two parts:

1. Training set

2. Testing set

Test-Set Conditions:

 Should be able to provide meaningful results statistically


 Should represent the dataset as a whole, that is, with the same
characteristics as a training set

Linear Regression vs. Logistic Regression

Linear Regression Logistic Regression


Predicted values result in continuous Predicted values result in discrete
variables variables
Regression problem can be solved Classification problems can be solved
using this alorithm using this alorithm
Represented as a straight line Represented as S-curve
For example: Analysis on sales data For example: Classification of e-mails
with monthly sales as spam or not

21
5. DEEP LEARNING FOR IMAGES

Deep learning, a subset of machine learning, uses artificial neural networks to model
complex patterns. For image data, convolutional neural networks (CNNs) are widely used
due to their ability to capture spatial hierarchies. In this project, deep learning was
explored for image classification tasks using TensorFlow and Keras. Datasets such as
MNIST or CIFAR-10 were used to train models to identify patterns and recognize visual
elements. Techniques like pooling, dropout, and activation functions were applied to
improve accuracy. Deep learning has revolutionized industries like healthcare and security,
making it a powerful tool for handling unstructured data like images.
Convolution Neural Network

 Layers in CNN

Convolution --> ReLU --> Pooling --> Fully connected

 Reducs the number of parameters by sharing weights(Feature Mapping)


 Can easily work with high pixelated images

Pooling Layer(Spatial Pooling)

Reduce the image size by eliminating the parameters Spatial pooling is


22
also called subsampling or down sampling
 Reducesthe dimensionality based on the filter
 Processed image still returns all the important information
Types of Pooling Layer

Max Pooling:Takes the largest element from the rectified feature map

Average Pooling:It performs down-sampling by averaging the input


Sum Pooling:Sum of all elements in the feature map is called as Sum Pooling

Predict Zero Using MNIST Dataset

 Padding =0 for edge detection


 1=darkest pixels
 -1=lightest pixel
 Stride=1

Image Classification Using CNN


1.Loading the dataset and unzipping 2.Initializing the
parameters
3. Checking for the channel first and rescaling the image

4. Building CNN model

5. Fitting and compiling the model


6.Prediction

23
6. TABLEAU
Tableau is a powerful and fastest-growing data visualization tool used in BI Industry.

*Helping companies to harness the strength of their most valuable assets: data and people

*Empowers users to observe and understand their data

*Using the common ability of people to easily identify visual patterns

*Allows users to spend more time on data analysis and less on "data wrangling"

Tableau is a powerful data visualization tool that allows users to create interactive and shareable
dashboards. It connects seamlessly with various data sources like Excel, SQL, and cloud platforms.
During the internship, Tableau was used to design dynamic dashboards displaying KPIs and trends in
Amazon stock performance. Features like filters, parameters, calculated fields, and storyboards helped
convey insights effectively. Tableau’s drag-and-drop interface enables rapid development and real-
time analysis, making it ideal for business intelligence tasks. The visual appeal and interactivity of
Tableau dashboards greatly enhance the presentation and understanding of complex datasets for both
technical and non-technical audiences.

Data Visualization using Tableau

*Demonstrates the power of Data Visualization

*This includes the dataset which almost identical, basic descriptive statistics, but which appear differently in grap
Each consists of eleven (x,y) points

Data Granularity
24
Granularity is the level of depth represented by the data, in a fact or dimension table

High Granularity:Detailed view of data and transactions

Low Granularity:Zooms out into a summary view of data and transactions

Data Granularity

Granularity refers to the level of detail or depth captured in a dataset, particularly in data warehouses and data mart
defines how much information is stored in each record and how specific or summarized the data is.

1. High Granularity (Fine-Grained Data):

 Definition: Each data record captures detailed information.


 Characteristics:
o More rows (larger datasets).
o Very specific, often transactional-level data.
o Ideal for drill-down analysis and real-time monitoring.
 Examples:
o Each individual sales transaction.
o Daily temperature readings per city.
o A customer’s click-by-click web activity.
 Use Case: Used in operational reporting, fraud detection, and personalized marketing.

2. Low Granularity (Coarse-Grained Data):

 Definition: Each data record summarizes a group of events or data points.


 Characteristics:
o Fewer rows.
o Aggregated information over time, category, or region.
o Useful for performance summaries or executive dashboards.
 Examples:
o Monthly sales totals by region.
o Average daily temperature over a year.
o Yearly revenue per product line.
 Use Case: Used in strategic decision-making, forecasting, and KPI tracking.

25
Why Data Granularity Matters:

 Performance: High granularity can slow down queries due to larger datasets, whereas low granularity impro
performance but reduces detail.
 Storage: Detailed data consumes more space.
 Flexibility: High-granularity data offers more flexibility in slicing and dicing the data during analysis.
 Accuracy: Aggregated data may obscure anomalies or patterns visible in detailed data.

Best Practices:

 Define the required granularity based on business goals.


 Store high-granularity data when long-term analysis or traceability is required.
 Use data aggregation techniques when building dashboards or summary reports for better performance.

Set

*A set is a collection of well-defined objects

*The Tableau user’s sets have the basic icon set

*The Tableau buys a server-named set icon from a multi-dimensional data source

*Action set icon is created by Tableau automatically when set action is performed.

*User filters display in the sets area of the data window with the User Filter Set Icon.
7. Sharing Insights through Dashboards

Using Charts Effectively - Right Chart Type

Choose the right chart type that will help convey the information more effectively

The appropriate chart will reveal patterns and trends, through which your audience will understand the significanc
the data set that you visualize

26
1. Importing Libraries:

The first step in any Data Analysis step is importing necessary libraries.

# Load Data Set: Dataset can be loaded using a method read_csv().

2. Data Preprocessing:

Real-world data is

often messy, incomplete, unstructured, inconsistent, redundant, sprinkled with wacky values. So, without deploying a
Data Preprocessing techniques, it is almost impossible to gain insights from raw data. Data preprocessing is a proc
of converting raw data to a suitable format to extract insights. It is the first and foremost step in the Data Science l
cycle. Data Preprocessing makes sure that data is clean, organize and read-to-feed to the Machine Learning model

● Dataset has two data types: float64, object

● Except for the Date, Location columns, every column has missing values.

Let’s generate descriptive statistics for the dataset using the function describe() in pandas.
Descriptive Statistics: It is used to summarize and describe the features of data in a meaningful way to extract insig
It uses two types of statistic to describe or summarize data:
● Measures of tendency

● Measures of spread.

3. Cardinality check for Categorical features:

27
The accuracy, performance of a classifier not only depends on the model that we use, but also depends on how we
preprocess data, and what kind of data you’re feeding to the classifier to learn.
Many Machine learning algorithms like Linear Regression, Logistic Regression, k-nearest neighbors, etc. can han
only numerical data, so encoding categorical data to numeric becomes a necessary step.
But before jumping into encoding, check the cardinality of each categorical feature.
Cardinality: The number of unique values in each categorical feature is known as cardinality.
A feature with a high number of distinct/ unique values is a high cardinality feature. A categorical feature with
hundreds of zip codes is the best example of a high cardinality feature.
This high cardinality feature poses many serious problems like it will increase the number of dimensions of data
when that feature is encoded. This is not good for the model.
There are many ways to handle high cardinality, one would be feature engineering and the other is simply dropp
that feature if it doesn’t add any value to the model.

4. Handling Missing Values:

Machine learning algorithms can’t handle missing values and cause problems. So they need to be addressed in the
first place. There are many techniques to identify and impute missing values.

If a dataset contains missing values and loaded using pandas, then missing values get replaced with NaN(Not a
Number) values. These NaN values can be identified using methods like isna() or isnull() and they can be impute
using fillna(). This process is known as Missing Data Imputation.

28
# Handling Missing values in Categorical Features:

big_mart_data.isnull().sum()

Missing values in Numerical Features can be imputed using Mean and Median. Mean is sensitive to outliers and
median is immune to outliers. If you want to impute the missing values with mean values, then outliers in numeric
features need to be addressed properly.

5. Outliers detection and treatment:

An Outlier is an observation that lies an abnormal distance from other values in a given sample. They can be detec
using visualization (like box-plots, scatter plots), Z-score, statistical and probabilistic algorithms, etc.

It’s time to do some analysis on each feature to understand about data and get some insights.

29
6. Exploratory Data Analysis:

Exploratory Data Analysis(EDA) is a technique used to analyze, visualize, investigate, interpret, discover and
summarize data. It helps Data Scientists to extract trends, patterns, and relationships in data.

# Outlet_Establishment_Year column:

plt.figure(figsize=(6,6)) sns.countplot(x='Outlet_Establishment_Year', data=big_mart_data) plt.show()

# Item_Fat_Content column:

plt.figure(figsize=(6,6)) sns.countplot(x='Item_Fat_Content', data=big_mart_data) plt.show()

30
# Item_Type column:

plt.figure(figsize=(30,6)) sns.countplot(x='Item_Type', data=big_mart_data) plt.show()

7. Encoding of Categorical Features:

Most Machine Learning Algorithms like Logistic Regression, Support Vector Machines, K Nearest Neighbours, e
can’t handle categorical data. Hence, these categorical data need to converted to numerical data for modeling, whi
is
called Feature Encoding.

There are many feature encoding techniques like One code encoding, label encoding. But in this particular blog, I
will be using replace() function to encode categorical data to numerical data.

8. Correlation:

Correlation is a statistic that helps to measure the strength of the relationship between two features. It is used in
bivariate analysis. Correlation can be calculated with method corr() in pandas.

# Splitting data into Independent Features and Dependent Features:

For feature importance and feature scaling, we need to split data into independent and dependent features.
31
● X – Independent Features or Input features

32
● y – Dependent Features or target label.

9. Feature Importance:

● Machine Learning Model performance depends on features that are used to train a model. Feature importance
describes which features are relevant to build a model.
● Feature Importance refers to the techniques that assign a score to input/label features based on how useful they a
at predicting a target variable. Feature importance helps in Feature Selection.

We’ll be using ExtraTreesRegressor class for Feature Importance. This class implements a meta estimator that f
number of randomized decision trees on various samples of the dataset and uses averaging to improve the predic
accuracy and control over-fitting.

10. Feature Scaling:

Feature Scaling is a technique used to scale, normalize, standardize data in range(0,1). When each column of a datas
has distinct values, then it helps to scale data of each column to a common level. StandardScaler is a class used t
implement feature scaling.

11. Model Building:

In this article, I will be using a Logistic Regression algorithm to build a predictive model to predict whether or not it
will rain tomorrow.

33
CHAPTER IV

IMPLEMENTED SCREENSHOTS: 1.

34
2.

3.

35
CHAPTER V

CONCLUSION:

In this work, the effectiveness of various algorithms on the data on revenue and review of, best
performance-algorithm, here propose a software to using regression approach for predicting the
sales centered on sales data from the past the accuracy of linear regression prediction can be
enhanced with this method, polynomial regression, Ridge regression, and Xgboost regression can
be determined. So, we can conclude ridge and Xgboost regression gives the better prediction than
the Linear and polynomial regression approaches. In future, the forecasting sales and building a
sales plan can help to avoid unforeseen cash flow and manage production, staff and financing
needs more effectively. In future work we can also consider with the ARIMA model which shows
the time series graph.

36

You might also like