0% found this document useful (0 votes)

13 views36 pages

Data Analysis New1

Uploaded by

prathiksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views36 pages

Data Analysis New1

Uploaded by

prathiksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 36

INTERNSHIP REPORT

1
1. Introduction..............................................................................................................................4

2. Data Pre-processing and Manipulation....................................................................................6

3. Data Visualization....................................................................................................................8

4. Machine Learning..................................................................................................................10

5. Deep Learning for Images..................................................................................................... 12

6. Tableau.................................................................................................................................. 14

7. Sharing Insights through Dashboards....................................................................................16

8. Stock Market Analysis-Amazon............................................................................................20

9. Conclusion........................................................................................................................... 23

10. References............................................................................................................................24

2
1. INTRODUCTION

Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the
goal of discovering useful information, informing conclusions, and supporting decision-
making. Data analysis has multiple facets and approaches, encompassing diverse techniques
under a variety of names, and is used in different business, science, and social science
domains. In today's business world, data analysis plays a role in making decisions more
scientific and helping businesses operate more effectively.
Data Analytics
Data analytics is the process of examining raw data to uncover patterns, trends, and insights
for decision-making. This involves several key components such as data preprocessing, data
visualization, statistical modeling, and the use of machine learning techniques.
Key Components of Data Analytics
1. Data Preprocessing: Cleaning and preparing raw data for analysis.
2. Data Visualization: Presenting insights using charts, graphs, and dashboards.
3. Machine Learning: Building predictive models for decision-making.
4. Advanced Tools: Tools like Tableau for visualization and deep learning for complex tasks
such as image recognition.
Importance of Data Analytics
Data analytics helps organizations uncover hidden patterns, market trends, and customer
preferences. By leveraging analytics, businesses can make informed decisions, reduce risks,
and operate more efficiently.

Practical Application: Target Corporation Sales Prediction

3
OBJECTIVE

The objective of the workshop was to introduce participants to Python for data analysis
and give them a clear roadmap for developing their skills in this field. The session also
covered data visualization techniques using libraries such as Matplotlib and Seaborn,
providing students with practical skills to interpret and represent data effectively.
Additionally, participants were introduced to key concepts such as data cleaning,
exploratory data analysis (EDA), and the job opportunities available in the rapidly growing
field of data science. The workshop focused on equipping students with real-world skills in
data analytics, particularly Python, and provided insights into a structured roadmap for
learning data analysis. The session covered practical applications, allowing students to
work on datasets and understand the tools used in the field. Although the initial agenda
included tools like Tableau and Power BI, the workshop ultimately focused solely on
Python and its data libraries to give participants a deep dive into one platform.

4
IMPORTANCE OF DATA SCIENCE
Data is one of the important features of every organization because it helps business
leaders to make decisions based on facts, statistical numbers and trends. Due to this growing
scope of data, data science came into picture which is a multidisciplinary field. It uses
scientific approaches, procedure, algorithms, and framework to extract the knowledge and
insight from a huge amount of data.
The extracted data can be either structured or unstructured. Data science is a concept
to bring together ideas, data examination, Machine Learning, and their related strategies to
comprehend and dissect genuine phenomena with data. Data science is an extension of
various data analysis fields such as data mining, statistics, predictive analysis and many more.
Data Science is a huge field that uses a lot of methods and concepts which belongs to other
fields like information science, statistics, mathematics, and computer science. Some of the
techniques utilized in Data Science encompasses machine learning, visualization, pattern
recognition, probability model, data engineering, signal processing, etc.

FUTURE OF DATA SCIENCE

As most of the fields are emerging continuously, the importance of data science is also
increasing rapidly. Data science has influenced various areas. Its effect can be observed in
multiple sectors such as the retail industry, healthcare, and education. In 6 the healthcare
industry, new medicines and techniques are being discovered continuously and there is a
requirement of better care for patients. With the help of data science techniques, the
healthcare sector can find a solution that help to take care the patients. Education is another
field where the benefits of data science can be seen clearly. The latest technologies such as
smartphones and laptops have now become an important part of the education system. With
the help of data science, better opportunities are created for the students which enable them to
enhance their knowledge.

5
TECHNICAL SKILLS
Programming Languages

Python is the primary programming language used in data analytics because of its simplicity,
flexibility, and vast ecosystem of libraries. Analysts rely on Python’s core libraries such as
Pandas and NumPy to manipulate datasets, perform mathematical operations, and handle
structured and unstructured data effectively. Pandas provides robust tools for working with
tabular data, while NumPy supports high-performance numerical computations. Together,
these libraries enable efficient handling of large datasets and form the foundation for data
preprocessing, exploratory analysis, and building advanced models. Python’s readability and
strong community support also make it ideal for both beginners and professionals.

Database Management

Database management skills are essential for data analysts, as raw data is often stored across
relational databases. SQL (Structured Query Language) is the industry-standard tool for
retrieving, updating, and managing data in these systems. Proficiency in SQL allows analysts
to query large datasets efficiently, extract meaningful insights, and join multiple tables to
generate comprehensive reports. According to GeeksforGeeks and other learning platforms,
SQL is considered a core skill for anyone working with data. Strong database management
expertise ensures analysts can access clean, structured data, forming a reliable foundation for
further analysis and decision-making in business environments.

Data Visualization

Data visualization is a critical skill for translating complex datasets into clear, interpretable
insights. Proficiency in Python visualization libraries such as Matplotlib and Seaborn helps
analysts create meaningful charts, graphs, and plots that reveal patterns and trends. For more
interactive and dynamic representations, advanced tools like Plotly or Bokeh can be used,
allowing the creation of dashboards that stakeholders can explore in real time. Effective
visualization not only improves the communication of findings but also supports decision-
making processes. By transforming raw numbers into compelling visuals, data analysts bridge
the gap between technical analysis and business understanding.

6
Statistical Analysis

A solid foundation in statistical analysis is essential for deriving meaningful insights from
data. Analysts need to understand core concepts such as probability, sampling, hypothesis
testing, correlation, and regression analysis to evaluate relationships and test assumptions.
Statistical methods provide the backbone for validating results, ensuring accuracy, and
minimizing errors in conclusions. For instance, regression analysis allows analysts to predict
outcomes and identify factors influencing trends. With these skills, professionals can move
beyond descriptive analysis to inferential and predictive insights. Mastery of statistics ensures
analysts make data-driven decisions that are both reliable and scientifically sound.

Data Cleaning and Preprocessing

Before meaningful insights can be derived, raw data must undergo thorough cleaning and
preprocessing. This skill involves identifying missing values, handling duplicates, correcting
inconsistencies, and transforming data into suitable formats for analysis. Data cleaning
ensures the accuracy, quality, and reliability of the dataset, while preprocessing steps such as
normalization, feature scaling, and encoding categorical variables prepare the data for
advanced modeling. Poor data quality often leads to misleading results, making this skill
crucial for analysts. By mastering data cleaning and preprocessing, professionals ensure that
every subsequent step of analysis is built on a solid and trustworthy foundation.

SOFT SKILLS
Analytical Thinking

Analytical thinking is the foundation of effective data analysis, enabling professionals to

approach complex problems logically and systematically. This skill involves breaking down
larger issues into smaller, manageable components and identifying the relationships between
them. Analysts use this ability to evaluate datasets critically, distinguish between relevant and
irrelevant information, and uncover hidden patterns. Analytical thinking helps in forming
hypotheses, testing them, and interpreting results with clarity. In a business context, this
ensures that conclusions are not only data-driven but also aligned with organizational goals.

7
Developing strong analytical thinking makes data analysts more efficient and precise
problem-solvers.

Problem-Solving

Problem-solving is an essential soft skill for data analysts, as real-world projects often involve
addressing ambiguous or ill-structured challenges. This skill includes identifying the root
causes of problems, evaluating multiple potential solutions, and selecting the most effective
strategy. Analysts often use problem-solving to design workflows, choose appropriate models,
or resolve issues with incomplete or messy datasets. Strong problem-solving requires
creativity, adaptability, and the ability to weigh trade-offs between speed and accuracy. In
practice, this ensures timely and practical solutions that not only address immediate issues but
also add long-term value to the business or research context.

Communication

Clear and concise communication is vital for data analysts, as they frequently present their
findings to both technical and non-technical stakeholders. This skill involves simplifying
complex statistical or technical concepts into actionable insights that decision-makers can
easily understand. Effective communication includes creating well-structured reports, visually
appealing dashboards, and compelling presentations. It also requires active listening to
understand stakeholder requirements and tailoring messages to different audiences. Analysts
who excel in communication can bridge the gap between data and decision-making, ensuring
that their insights lead to practical business outcomes and are appreciated across departments
and teams.

Attention to Detail

Attention to detail is critical in data analysis because even small mistakes can lead to
misleading conclusions or flawed strategies. This skill requires meticulousness in handling
datasets, verifying calculations, and ensuring consistency throughout the analysis process. For
example, overlooking duplicate entries or mislabeling variables can compromise the accuracy
of results. Strong attention to detail ensures the reliability of data cleaning, statistical testing,
and visualization. In business scenarios, this translates into more trustworthy insights and

8
reduced risk of errors in decision-making. Cultivating this skill enables analysts to deliver
precise, high-quality work that withstands scrutiny from stakeholders.

2. DATA PRE-PROCESSING AND MANIPULATION

Data preprocessing is a crucial step in any data analysis project. It involves cleaning,
transforming, and organizing raw data into a usable format. Common tasks include
handling missing values, removing duplicates, encoding categorical variables, and
normalizing numerical data. Data manipulation is achieved using tools like Python
(Pandas, NumPy), ensuring datasets are accurate and consistent. This stage improves data
quality and prepares it for visualization and modeling. Effective preprocessing enhances
model performance and helps analysts uncover true patterns in the data, reducing bias and
ensuring that subsequent analysis is both valid and reliable.

Python

Python is an interpreted, object-oriented, high-level programming language with dynamic

semantics. It is a widely used tool for analyzing data.

Pandas

A panda is an open-source library that provides high-performance data structures to perform

efficient data manipulation and analysis in Python

Example of data manipulation tasks performed using pandas:

*Sampling

*Data Cleaning

*Data Transformation

*Data Wrangling

*Data Analysis
Data Structures in Pandas

Series – 1-D labelled array

9
DataFrame – 2-D labelled structure of rows and columns

Functionality of Pandas

Reading Data from CSV

Data from CSV files can be read using read_csv() as follows:

Item UnitPrice Quantity TotalCost

0 Shirts 100 4 400
import pandas as pd
1 Jeans 300 2 600
pd.read_csv("purchase_data.csv")
2 Shoes 250 1 250

3 Tops 150 3 450

Reading Data from CSV in Pandas

The read_csv() function in pandas is commonly used to load data from a CSV (Comma-
Separated Values) file into a DataFrame for analysis. It is one of the most used data import
functions in data analytics.

Syntax:

python
CopyEdit
import pandas as pd

10
df = pd.read_csv('filename.csv')

Key Parameters:

 filepath_or_buffer: The file path to the CSV file.

 sep: Delimiter to use. Default is ','.

 header: Row number to use as the column names.

 index_col: Column(s) to set as the index.

 usecols: Return a subset of columns.

 dtype: Data type for data or columns.

Example:

python
CopyEdit
df = pd.read_csv('amazon_stock.csv')
print(df.head())

This command reads the first few lines of amazon_stock.csv and displays them. Once
loaded, the DataFrame df can be manipulated using various pandas operations such as:

 Filtering rows (e.g., df[df['Close'] > 1000])

 Handling missing values (df.fillna(0), df.dropna())

 Renaming columns (df.rename(columns={'old': 'new'}))

 Creating new columns (df['Return'] = df['Close'] / df['Open'])

Using read_csv() is the starting point for most data analysis workflows when working with
structured tabular data, especially in business intelligence, finance, or scientific computing.

11
3. Data Visualization
Data visualization is the graphical representation of information and data. Provide an accessible way
to see and understand trends, outliers, and patterns in data. Additionally, it provides an excellent way
for employees or business owners to present data to non-technical audiences without confusion.
Data visualization converts complex datasets into clear, graphical formats that are easier to interpret.
Using tools like Matplotlib, Seaborn, and Plotly in Python, we can generate bar charts, heatmaps,
scatter plots, and histograms. Visualization helps identify trends, outliers, and relationships between
variables. It plays a critical role in communicating findings to stakeholders and supports informed
decision-making. During this project, visualization was used to explore stock data, showing price
trends, volume fluctuations, and volatility patterns in Amazon’s historical stock performance. Clear
visuals enhance comprehension and drive actionable business insights.

METHODOLOGY

The methodology for the "Sales Forecasting" project thus involves a systematic set of procedures that
combine different research and technical methods in developing a solution based on machine learning
in retail sales management. The methodology could then be split into some very key areas, namely:
data collection, model development, design and application, and deployment.

12
Collect and preprocess the data.
1. Data Sources: Collect historical sales data from retail shops. Here sales transactions, details, and
inventory levels over time will be included. Extra data source from the outside environment such as
market trends, seasonal effects, and promotional activities may be included in the model for
increased accuracy
2. Data Pre processing : All data will be cleaned hence by handling missing values, outliers,
andinconsistencies. Feature engineering will be a phase to get the raw data into better features like
converting date to day/month/year, and aggregating sales data for example weekly/monthly total.
2. EDA Exploratory Data Analysis:  For EDA, trends, patterns, and correlations in the data will be
identified. Visualization techniques like histograms, scatter plots, and line charts will be used to
assess sales trends through time and indicate which factors most affect sales. 3. Model Development:
1. Machine Learning Techniques: The focus will be on analyzing the time series (for example,
ARIMA, Seasonal Decomposition) and regression techniques such as linear or polynomial regression
for furtherprediction of future sales. 2. Model Evaluation: Performance of the models will be
evaluated on metrics that would essentially incorporate Mean Absolute Error, Root Mean Square
Error, and R-squared values among others. Crossvalidation techniques will be employed to ensure
the model's robustness and generalizability

13
4. Developing the Backend and Integrating 1. API Development: develop a backend system
using either Flask or Django, which will be the interface between the mobile application and the
machine learning model. It should take care of processing the sales prediction requests and any
changes in inventory. 2. Database Management: The sales and inventory will be stored in
Firebase so that the data will be in realtime synchrony. It will provide efficient data retrieval and
storage mechanisms by designing an effective database structure.

PROJECT DESIGN:

● 1.Workflow Diagram:

14
System Architecture:

PROJECT IMPLEMENTATION, ALGORITHMS AND METHODS USED:

INTRODUCTION:

“To find out what role certain properties of an item play and how they affect their sales by
understanding Target mart sales.” In order to help Target Mart achieve this goal, a predictive
model can be built to find out for every store, the key factors that can increase their sales and
what changes could be made to the

product or store’s characteristics.

Software Requirements:

Windows XP (8,9 or 10). Any Python

Interpreter.
Machine Learning modules installed. Target Mart Dataset.
Working Explanation:

The aim is to build a predictive model and find out the sales of each product at a particular store.
Using this model, Big Marts like Target will try to understand the properties of products and
stores which play a key role in increasing sales. So the idea is to find out the properties of a
product, and store which impacts the sales of a product.

15
We came up with certain hypothesis in order to solve the problem statement relating to the
various factors of the products and stores. We develop a predictive model using different ML
Algorithms like Linear regression, Polynomial regression, and Ridge regression techniques for
forecasting the sales of a business such as Target Corporation. By using this model, we will try
to understand the properties of products and stores which play a key role in increasing sales.We
came up with certain hypothesis in order to solve the problem statement relating to the various
factors of the products and stores.

• We’ll be performing some basic data exploration and come up with some
inferences about the data.
• In our model we have used Target corporation sales dataset. After preprocessing
and filling missing values, we used ensemble classifier using Decision trees,
Linear regression, Ridge regression, Polynomial regression and XgBoost
regression classifier.

● PREREQUISITES:

The dataset contains about 10 years of daily weather observations from numerous
Australian weather stations.

Python is a multi-paradigm programming language. Object-oriented programming and

structured programming are fully supported, and many of its features support
functional programming and aspect-oriented
programming (including by meta programming and meta objects (magic methods)).

Many other paradigms are supported via extensions, including design by contract
and logic programming.
Python uses dynamic typing and a combination of reference counting and a cycle-
detecting garbage collector for memory management. It also features dynamic name
resolution (late binding), which binds method and variable names during program
execution.
Python's design offers some support for functional programming in the Lisp tradition. It
has functions; list comprehensions, dictionaries, sets, and generator expressions. The
standard library has two modules (itertools and functools) that implement functional tools
borrowed from Haskell and Standard ML.
16
Regression is a statistical measure that attempts to determine the strength of the
relationship between one dependent variable usually denoted by Y and a series of other
changing variables known as independent variables. Regression model which

contain more than two predictor variables are called Multiple Regression Model.

• MODEL DESIGN:
This shows the architecture Diagram of the proposed model where they focus on the different
algorithm application to the dataset. Here, we are calculating the Accuracy, MAE, MSE,
RMSE and final concluding the best yield algorithm. Here are the following Algorithm are
used.
A. Linear Regression:
•Build a fragmented plot:
1) A linear or non-linear pattern of data and 2) a variance (outliers). Consider a
transformation if the marking isn't linear. If this is the case, outsiders, it can suggest only
eliminating them if there is a non-statistical justification.

• Link the data to the least squares line and confirm the model assumptions using the
residual plot (for the constant standard deviation assumption) and the normal probability plot
(for the normal probability assumption) A transformation might be necessary if the
assumptions made do not appear to be met
• If required, convert the data to the least square using the transformed data, construct a
regression line.
• If a change has been completed, return to the previous process 1. If not, continue to
phase 5.
• When a "good-fit" classic is defined, write the least-square regression line equation.
Consist of normal estimation, estimation, and R-squared errors. • Linear regression formulas
look like this:
Y=o1x1+ o2x2+……… onxn
R-Square: Defines the difference in X (depending variable) explains the total variance in Y
(dependent variable) (independent variable).
B. Polynomial Regression:
• Polynomial Regression is a relapse calculation that modules the relationship here among
dependent(y) and the autonomous variable(x) in light of the fact that as most extreme limit
polynomial. The condition for polynomial relapse is given beneath: y= b0+b1x1+ b2x1 2+
b2x1 3+. bnx1 n
17
• It is regularly alluded to as the exceptional instance of various straight relapse in ML.
Since we apply some polynomial terms to the numerous straight relapse condition to change
it to polynomial relapse adjustment to improve accuracy.
• The informational collection utilized for preparing in polynomial relapse is of a
non-straight nature.
• It uses a linear regression model to fit complex and non-linear functions and
datasets.

C. Ridge Regression:
Ridge regression is a model tuning tool used to evaluate any data that suffers from multi-
collinearity. This method performs the L2 regularization procedure. When multi-collinearity
issues arise, the least squares are unbiased and the variances are high, resulting in the
expected values being far removed from the actual values. The cost function for ridge
regression:
Min (||Y – X(theta)||^2 + λ||theta||^2)

D. XgBoost Regression:
“Extreme Gradient Boosting” is same but much more effective to the gradient boosting
system. It has both a linear model solver and a tree algorithm. Which permits “XgBoost” in
any event multiple times quicker than current slope boosting executions? It underpins
various target capacities, including relapse, order and rating. As "XgBoost" is extremely
high in prescient force however generally delayed with organization, it is appropriate for
some rivalries. It likewise has extra usefulness for cross-approval and finding significant
factors.

● Methodology:

We will explore the problem in following stages:

1. Hypothesis Generation - understanding the problem better by brainstorming possible

factors that can impact the outcome.

2. Data Preprocessing - looking at categorical and continuous feature summaries and

making inferences about the data.

3. Data Analysis - imputing missing values in the data and checking for outliers.
18
4. Feature Engineering - modifying existing variables and creating new ones for analysis.

5. Implementing model - making predictive models on the data.

19
Machine Learning
Machine learning (ML) applies algorithms to train models from data and make predictions
or classifications. In this project, supervised learning techniques like Linear Regression
and Random Forest were used to forecast Amazon's stock prices. The model was trained
using historical data, and metrics like Mean Squared Error (MSE) were applied to assess
accuracy. ML helps businesses automate decision-making, detect anomalies, and
personalize customer experiences. Feature selection, model tuning, and validation were
integral parts of the process. Through this, we gained a deeper understanding of how data-
driven models are developed, tested, and deployed in real-world scenarios.
Machine Learning is a subset of artificial intelligence

It permits the system to learn and improve from experience without being explicitly
programmed automatically

Supervised Machine Learning

Learning from the labeled data and applying the knowledge to predict the label of the new
data (test data) is known as Supervised Learning

Unsupervised Learning

Unsupervised Learning uses the technique of grouping/clustering the unbled data together
based on its physical characteristics and predict the cluster of new data (test data)

Data Pre-processing Techniques

 Input Missing Values

 Handling Categorical values
 Scaling the Data
 Normalization
 Feature Selection

20
Train/Test Split

Machine Learning algorithms divide the datasets into two parts:

1. Training set

2. Testing set

Test-Set Conditions:

 Should be able to provide meaningful results statistically

 Should represent the dataset as a whole, that is, with the same
characteristics as a training set

Linear Regression vs. Logistic Regression

Linear Regression Logistic Regression

Predicted values result in continuous Predicted values result in discrete
variables variables
Regression problem can be solved Classification problems can be solved
using this alorithm using this alorithm
Represented as a straight line Represented as S-curve
For example: Analysis on sales data For example: Classification of e-mails
with monthly sales as spam or not

DEEP LEARNING FOR IMAGES

Deep learning, a subset of machine learning, uses artificial neural networks to model
complex patterns. For image data, convolutional neural networks (CNNs) are widely used
due to their ability to capture spatial hierarchies. In this project, deep learning was
explored for image classification tasks using TensorFlow and Keras. Datasets such as
MNIST or CIFAR-10 were used to train models to identify patterns and recognize visual
elements. Techniques like pooling, dropout, and activation functions were applied to
improve accuracy. Deep learning has revolutionized industries like healthcare and security,

21
making it a powerful tool for handling unstructured data like images.
Convolution Neural Network

 Layers in CNN

Convolution --> ReLU --> Pooling --> Fully connected

 Reducs the number of parameters by sharing weights(Feature Mapping)

 Can easily work with high pixelated images

Pooling Layer(Spatial Pooling)

Reduce the image size by eliminating the parameters Spatial pooling

is also called subsampling or down sampling
 Reducesthe dimensionality based on the filter
 Processed image still returns all the important information
Types of Pooling Layer

Max Pooling:Takes the largest element from the rectified feature map

Average Pooling:It performs down-sampling by averaging the input

Sum Pooling:Sum of all elements in the feature map is called as Sum Pooling

22
Predict Zero Using MNIST Dataset

 Padding =0 for edge detection

 1=darkest pixels
 -1=lightest pixel
 Stride=1

Image Classification Using CNN

1.Loading the dataset and unzipping 2.Initializing the
parameters
3. Checking for the channel first and rescaling the image

4. Building CNN model

5. Fitting and compiling the model

6.Prediction

23
TABLEAU
Tableau is a powerful and fastest-growing data visualization tool used in BI Industry.

*Helping companies to harness the strength of their most valuable assets: data and people

*Empowers users to observe and understand their data

*Using the common ability of people to easily identify visual patterns

*Allows users to spend more time on data analysis and less on "data wrangling"

Tableau is a powerful data visualization tool that allows users to create interactive and shareable
dashboards. It connects seamlessly with various data sources like Excel, SQL, and cloud platforms.
During the internship, Tableau was used to design dynamic dashboards displaying KPIs and trends in
Amazon stock performance. Features like filters, parameters, calculated fields, and storyboards helped
convey insights effectively. Tableau’s drag-and-drop interface enables rapid development and real-
time analysis, making it ideal for business intelligence tasks. The visual appeal and interactivity of
Tableau dashboards greatly enhance the presentation and understanding of complex datasets for both
technical and non-technical audiences.

Data Visualization using Tableau

*Demonstrates the power of Data Visualization

*This includes the dataset which almost identical, basic descriptive statistics, but which appear
differently in graphic. Each consists of eleven (x,y) points

Data Granularity

Granularity is the level of depth represented by the data, in a fact or dimension table

High Granularity:Detailed view of data and transactions

24
Low Granularity:Zooms out into a summary view of data and transactions
Data Granularity

Granularity refers to the level of detail or depth captured in a dataset, particularly in data
warehouses and data marts. It defines how much information is stored in each record and how specific
or summarized the data is.

1. High Granularity (Fine-Grained Data):

 Definition: Each data record captures detailed information.

 Characteristics:

o More rows (larger datasets).

o Very specific, often transactional-level data.

o Ideal for drill-down analysis and real-time monitoring.

 Examples:

o Each individual sales transaction.

o Daily temperature readings per city.

o A customer’s click-by-click web activity.

 Use Case: Used in operational reporting, fraud detection, and personalized marketing.

2. Low Granularity (Coarse-Grained Data):

 Definition: Each data record summarizes a group of events or data points.

 Characteristics:

o Fewer rows.

o Aggregated information over time, category, or region.

o Useful for performance summaries or executive dashboards.

 Examples:
25
o Monthly sales totals by region.

o Average daily temperature over a year.

o Yearly revenue per product line.

 Use Case: Used in strategic decision-making, forecasting, and KPI tracking.

Why Data Granularity Matters:

 Performance: High granularity can slow down queries due to larger datasets, whereas low
granularity improves performance but reduces detail.
 Storage: Detailed data consumes more space.

 Flexibility: High-granularity data offers more flexibility in slicing and dicing the data during
analysis.

 Accuracy: Aggregated data may obscure anomalies or patterns visible in detailed data.

Best Practices:

 Define the required granularity based on business goals.

 Store high-granularity data when long-term analysis or traceability is required.

 Use data aggregation techniques when building dashboards or summary reports for better
performance.

Set

*A set is a collection of well-defined objects

*The Tableau user’s sets have the basic icon set

*The Tableau buys a server-named set icon from a multi-dimensional data source

*Action set icon is created by Tableau automatically when set action is performed.

*User filters display in the sets area of the data window with the User Filter Set Icon.
5. Sharing Insights through Dashboards

Using Charts Effectively - Right Chart Type

26
Choose the right chart type that will help convey the information more effectively
The appropriate chart will reveal patterns and trends, through which your audience will understand the
significance of the data set that you visualize
1. Importing Libraries:

The first step in any Data Analysis step is importing necessary libraries.
# Load Data Set: Dataset can be loaded using a method read_csv().

2. Data Preprocessing:

Real-world data is

often messy, incomplete, unstructured, inconsistent, redundant, sprinkled with wacky values. So,
without deploying any Data Preprocessing techniques, it is almost impossible to gain insights from
raw data. Data preprocessing is a process of converting raw data to a suitable format to extract
insights. It is the first and foremost step in the Data Science life cycle. Data Preprocessing makes sure
that data is clean, organize and read-to-feed to the Machine Learning model.

● Dataset has two data types: float64, object

● Except for the Date, Location columns, every column has missing values.

Let’s generate descriptive statistics for the dataset using the function describe() in pandas.
Descriptive Statistics: It is used to summarize and describe the features of data in a meaningful way
to extract insights. It uses two types of statistic to describe or summarize data:
● Measures of tendency
● Measures of spread.
3. Cardinality check for Categorical features:

The accuracy, performance of a classifier not only depends on the model that we use, but also depends
on how we preprocess data, and what kind of data you’re feeding to the classifier to learn.
Many Machine learning algorithms like Linear Regression, Logistic Regression, k-nearest neighbors,
etc. can handle only numerical data, so encoding categorical data to numeric becomes a necessary
step.
But before jumping into encoding, check the cardinality of each categorical feature.
Cardinality: The number of unique values in each categorical feature is known as cardinality.
A feature with a high number of distinct/ unique values is a high cardinality feature. A categorical
feature with hundreds of zip codes is the best example of a high cardinality feature.
This high cardinality feature poses many serious problems like it will increase the number of
27 is not good for the model.
dimensions of data when that feature is encoded. This
There are many ways to handle high cardinality, one would be feature engineering and the other is
simply dropping that feature if it doesn’t add any value to the model.

4. Handling Missing Values:

Machine learning algorithms can’t handle missing values and cause problems. So they need to be
addressed in the first place. There are many techniques to identify and impute missing values.
If a dataset contains missing values and loaded using pandas, then missing values get replaced with
NaN(Not a Number) values. These NaN values can be identified using methods like isna() or isnull()
and they can be imputed using fillna(). This process is known as Missing Data Imputation.
# Handling Missing values in Categorical Features:
big_mart_data.isnull().sum()

Missing values in Numerical Features can be imputed using Mean and Median. Mean is sensitive to
outliers and median is immune to outliers. If you want to impute the missing values with mean values,
then outliers in numerical features need to be addressed properly.

28
5. Outliers detection and treatment:
An Outlier is an observation that lies an abnormal distance from other values in a given sample. They
can be detected using visualization (like box-plots, scatter plots), Z-score, statistical and probabilistic
algorithms, etc.
It’s time to do some analysis on each feature to understand about data and get some insights.
6. Exploratory Data Analysis:
Exploratory Data Analysis(EDA) is a technique used to analyze, visualize, investigate, interpret,
discover and summarize data. It helps Data Scientists to extract trends, patterns, and relationships in
data.
# Outlet_Establishment_Year column:
plt.figure(figsize=(6,6)) sns.countplot(x='Outlet_Establishment_Year', data=big_mart_data) plt.show()

# Item_Fat_Content column:
plt.figure(figsize=(6,6)) sns.countplot(x='Item_Fat_Content', data=big_mart_data) plt.show()

# Item_Type column:
29
plt.figure(figsize=(30,6)) sns.countplot(x='Item_Type', data=big_mart_data) plt.show()
7. Encoding of Categorical Features:

Most Machine Learning Algorithms like Logistic Regression, Support Vector Machines, K
Nearest Neighbours, etc. can’t handle categorical data. Hence, these categorical data need to
converted to numerical data for modeling, which is called Feature Encoding.
There are many feature encoding techniques like One code encoding, label encoding. But in this
particular blog, I will be using replace() function to encode categorical data to numerical data.

30
8. Correlation:
Correlation is a statistic that helps to measure the strength of the relationship between two features. It
is used in bivariate analysis. Correlation can be calculated with method corr() in pandas.
# Splitting data into Independent Features and Dependent Features:
For feature importance and feature scaling, we need to split data into independent and dependent
features.
● X – Independent Features or Input features
● y – Dependent Features or target label.
9. Feature Importance:
● Machine Learning Model performance depends on features that are used to train a model. Feature
importance describes which features are relevant to build a model.
● Feature Importance refers to the techniques that assign a score to input/label features based on how
useful they are at predicting a target variable. Feature importance helps in Feature Selection.

We’ll be using ExtraTreesRegressor class for Feature Importance. This class implements a meta
estimator that fits a number of randomized decision trees on various samples of the dataset and uses
averaging to improve the predictive accuracy and control over-fitting.

10. Feature Scaling:

Feature Scaling is a technique used to scale, normalize, standardize data in range(0,1). When each
column of a dataset has distinct values, then it helps to scale data of each column to a common level.
StandardScaler is a class used to implement feature scaling.

11. Model Building:

In this article, I will be using a Logistic Regression algorithm to build a predictive model to predict
whether or not it will rain tomorrow.

31
IMPLEMENTED SCREENSHOTS: 1.

32
33
SAMPLE CODING:

Importing Libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder from
sklearn.model_selection import train_test_split from xgboost
import XGBRegressor
from sklearn import metrics

%matplotlib inline from

scipy import stats
# loading the data from csv file to Pandas DataFrame big_mart_data =
pd.read_csv('Train.csv')
# first 5 rows of the dataframe

big_mart_data.head()
# number of data points & number of features
big_mart_data.shape
# getting some information about thye dataset
big_mart_data.info()

#Categorical Features:
#Item_Identifier
#Item_Fat_Content
#Item_Type
#Outlet_Identifier
#Outlet_Size
#Outlet_Location_Type
#Outlet_Type
# checking for missing values
big_mart_data.isnull().sum()
# mean value of "Item_Weight" column big_mart_data['Item_Weight'].mean()
# filling the missing values in "Item_weight column" with "Mean" value
big_mart_data['Item_Weight'].fillna(big_mart_data['Item_Weight'].mean(), inplace=True)
# mode of "Outlet_Size" column
big_mart_data['Outlet_Size'].mode()
# filling the missing values in "Outlet_Size" column with Mode mode_of_Outlet_size =
big_mart_data.pivot_table(values='Outlet_Size', columns='Outlet_Type',
aggfunc=(lambda x: x.mode()[0]))
miss_values = big_mart_data['Outlet_Size'].isnull()
print(miss_values) big_mart_data.loc[miss_values,
'Outlet_Size'] =
big_mart_data.loc[miss_values,'Outlet_Type'].apply(lambda x: mode_of_Outlet_size[x])
big_mart_data.describe()
sns.set()
# Item_Weight distribution plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_Weight']) plt.show()
# Item Visibility distribution plt.figure(figsize=(6,6))
34
sns.distplot(big_mart_data['Item_Visibility']) plt.show()
# Item MRP distribution plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_MRP'])
plt.show()
# Item_Outlet_Sales distribution plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_Outlet_Sales']) plt.show()
# Outlet_Establishment_Year column
plt.figure(figsize=(6,6))
sns.countplot(x='Outlet_Establishment_Year', data=big_mart_data) plt.show()
# Item_Fat_Content column
plt.figure(figsize=(6,6))
sns.countplot(x='Item_Fat_Content', data=big_mart_data) plt.show()
# Outlet_Size column plt.figure(figsize=(6,6))
sns.countplot(x='Outlet_Size', data=big_mart_data)
plt.show()
big_mart_data.head()
big_mart_data['Item_Fat_Content'].value_counts()
big_mart_data.replace({'Item_Fat_Content': {'low fat':'Low Fat','LF':'Low Fat',
'reg':'Regular'}}, inplace=True) big_mart_data['Item_Fat_Content'].value_counts()

Label Encoding
encoder = LabelEncoder() big_mart_data['Item_Identifier'] =
encoder.fit_transform(big_mart_data['Item_Identifier'])

big_mart_data['Item_Fat_Content'] =
encoder.fit_transform(big_mart_data['Item_Fat_Content'])

35
CONCLUSION:

In this work, the effectiveness of various algorithms on the data on revenue and review of,
best performance-algorithm, here propose a software to using regression approach for
predicting the sales centered on sales data from the past the accuracy of linear regression
prediction can be enhanced with this method, polynomial regression, Ridge regression, and
Xgboost regression can be determined. So, we can conclude ridge and Xgboost regression
gives the better prediction than the Linear and polynomial regression approaches. In future,
the forecasting sales and building a sales plan can help to avoid unforeseen cash flow and
manage production, staff and financing needs more effectively. In future work we can also
consider with the ARIMA model which shows the time series graph.

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
CNN - Project
No ratings yet
CNN - Project
8 pages
Data Analysis
No ratings yet
Data Analysis
36 pages
Abhijitya Midsem
No ratings yet
Abhijitya Midsem
6 pages
DS Curriculum
No ratings yet
DS Curriculum
4 pages
Data Science Textbook
No ratings yet
Data Science Textbook
7 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
Da Unit-Ii
No ratings yet
Da Unit-Ii
21 pages
Data Science Course in Pitampura
No ratings yet
Data Science Course in Pitampura
19 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
Introduction To Data Science, Evolution of Data Science
No ratings yet
Introduction To Data Science, Evolution of Data Science
11 pages
Data Analytics Syllabus PDF
No ratings yet
Data Analytics Syllabus PDF
5 pages
Emerging Technology Assignment
No ratings yet
Emerging Technology Assignment
5 pages
Advanced Data Analytics and Visualization Course Material
No ratings yet
Advanced Data Analytics and Visualization Course Material
45 pages
Da End Sem
No ratings yet
Da End Sem
5 pages
Impact of Data Science Across Industries
No ratings yet
Impact of Data Science Across Industries
3 pages
Data Science Training Insights
No ratings yet
Data Science Training Insights
32 pages
Introduction to Data Analytics
No ratings yet
Introduction to Data Analytics
30 pages
Research Paper
No ratings yet
Research Paper
14 pages
Report
No ratings yet
Report
26 pages
Computer
No ratings yet
Computer
4 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Exploratory Data Analysis With Python
No ratings yet
Exploratory Data Analysis With Python
24 pages
Sebenta - Financial Modelling - Vasco Tamens
No ratings yet
Sebenta - Financial Modelling - Vasco Tamens
9 pages
Data Analytics
No ratings yet
Data Analytics
42 pages
Pavithra Balaji: Skills
No ratings yet
Pavithra Balaji: Skills
1 page
Certification in Master in Data Science 18.12.24
No ratings yet
Certification in Master in Data Science 18.12.24
3 pages
Ds Final
No ratings yet
Ds Final
3 pages
Unit 4 Data Science Applications
No ratings yet
Unit 4 Data Science Applications
32 pages
Fda 1
No ratings yet
Fda 1
5 pages
Data Science
No ratings yet
Data Science
5 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Data Science MBA
No ratings yet
Data Science MBA
6 pages
Ba Notes Ete
No ratings yet
Ba Notes Ete
16 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Project Work 1
No ratings yet
Project Work 1
12 pages
Data Science Mastery Course in Pitampura
No ratings yet
Data Science Mastery Course in Pitampura
19 pages
FAI Notes - Unit 5
No ratings yet
FAI Notes - Unit 5
12 pages
MUGILLAN Internship Report..2
No ratings yet
MUGILLAN Internship Report..2
18 pages
Data Analysis
No ratings yet
Data Analysis
6 pages
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
No ratings yet
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
7 pages
R Programming Lab Manual
No ratings yet
R Programming Lab Manual
54 pages
Certified Python Data Analyst Professional - Using Python - CPDAP 128Hrs
No ratings yet
Certified Python Data Analyst Professional - Using Python - CPDAP 128Hrs
22 pages
Data Analytics Course Overview
No ratings yet
Data Analytics Course Overview
2 pages
Datascience
No ratings yet
Datascience
12 pages
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
100% (1)
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
41 pages
It Report 2025
No ratings yet
It Report 2025
38 pages
Fods MQP Solutions - 025136
No ratings yet
Fods MQP Solutions - 025136
76 pages
23SC3201 Data Science and Challenges-2
No ratings yet
23SC3201 Data Science and Challenges-2
28 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Data Visualization Using Python
No ratings yet
Data Visualization Using Python
79 pages
Report On Summer Internship
No ratings yet
Report On Summer Internship
30 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
17 pages
AI Enabled Automated Drug Dispenser and Smart ICU Monitoring System For Precision Medication and Continuous Patient Care
No ratings yet
AI Enabled Automated Drug Dispenser and Smart ICU Monitoring System For Precision Medication and Continuous Patient Care
1 page
A Study On Employee Work Life Balance in It
No ratings yet
A Study On Employee Work Life Balance in It
110 pages
AI Enhanced IoT System For Real Time Air Pollution Detection and Forecasting
No ratings yet
AI Enhanced IoT System For Real Time Air Pollution Detection and Forecasting
1 page
Smart Offset Pringing Press
No ratings yet
Smart Offset Pringing Press
7 pages
Hostel
No ratings yet
Hostel
1 page
Advanced Technologies Targeting Isolation and Characterization of Natural Products
No ratings yet
Advanced Technologies Targeting Isolation and Characterization of Natural Products
19 pages
Data Science Interview Questions 1
No ratings yet
Data Science Interview Questions 1
15 pages
Facial Emotion Based Automatic Music Recommender System
No ratings yet
Facial Emotion Based Automatic Music Recommender System
5 pages
4th Yr Project Report Preparation Manual (1) (1)
No ratings yet
4th Yr Project Report Preparation Manual (1) (1)
37 pages
Artificial Intelligence in Healthcare Material
No ratings yet
Artificial Intelligence in Healthcare Material
29 pages
Final
No ratings yet
Final
28 pages
C2017 Information Fusion in Content Based Image Retrieval - A Comprehensive Overview
No ratings yet
C2017 Information Fusion in Content Based Image Retrieval - A Comprehensive Overview
11 pages
Intelligent Parking System: D. Bhanu Priya, V. Raghavendra Rao, Ch. Vasanth Kumar
No ratings yet
Intelligent Parking System: D. Bhanu Priya, V. Raghavendra Rao, Ch. Vasanth Kumar
4 pages
Deep Learning-Based Methods For Brain Tumor Segmentation
No ratings yet
Deep Learning-Based Methods For Brain Tumor Segmentation
23 pages
AI in Astrophysics
No ratings yet
AI in Astrophysics
5 pages
Applying Machine Learning Algorithms For The Classification of Sleep Disorders
No ratings yet
Applying Machine Learning Algorithms For The Classification of Sleep Disorders
12 pages
Question Bank Aml
No ratings yet
Question Bank Aml
2 pages
Generative Image Inpainting With Contextual Attention
No ratings yet
Generative Image Inpainting With Contextual Attention
15 pages
SuperPose: Improved 6D Pose Estimation With Robust Tracking and Mask-Free Initialization
No ratings yet
SuperPose: Improved 6D Pose Estimation With Robust Tracking and Mask-Free Initialization
11 pages
YOLOv8 vs SSD: Pothole Detection Comparison
No ratings yet
YOLOv8 vs SSD: Pothole Detection Comparison
5 pages
A Hierarchical Deep Temporal Model For Group Activity Recognition
No ratings yet
A Hierarchical Deep Temporal Model For Group Activity Recognition
10 pages
Alzheimer 4
No ratings yet
Alzheimer 4
5 pages
Smart Biometric Attendance System
No ratings yet
Smart Biometric Attendance System
17 pages
Malaria Detection Presentation
No ratings yet
Malaria Detection Presentation
8 pages
Pneumonia Detection Using Convolutional Neural Networks
No ratings yet
Pneumonia Detection Using Convolutional Neural Networks
2 pages
Context Encoders: Feature Learning by Inpainting
No ratings yet
Context Encoders: Feature Learning by Inpainting
12 pages
1 s2.0 S095070512201228X Main
No ratings yet
1 s2.0 S095070512201228X Main
13 pages
Deep Learning for Image Captioning
No ratings yet
Deep Learning for Image Captioning
2 pages
Skin Cancer Detection Using CNN: Presented By
No ratings yet
Skin Cancer Detection Using CNN: Presented By
33 pages
Deep Neural Networks in Cloud: Survey
No ratings yet
Deep Neural Networks in Cloud: Survey
24 pages
Envslam: Combining Slam Systems and Neural Networks To Improve The Environment Fusion in Ar Applications
No ratings yet
Envslam: Combining Slam Systems and Neural Networks To Improve The Environment Fusion in Ar Applications
21 pages
Ai ML
No ratings yet
Ai ML
1 page
Personality Prediction System Based On Graphology Using Machine Learning
No ratings yet
Personality Prediction System Based On Graphology Using Machine Learning
34 pages
GAN Review - Models and Medical Image Fusion Applications
No ratings yet
GAN Review - Models and Medical Image Fusion Applications
15 pages