W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed

Data processing involves collecting raw data and converting it into usable information through a series of steps including data collection, preparation, input, processing, interpretation, and storage. It is essential for organizations to develop effective business strategies and enhance competitiveness by utilizing processed data. Key methods include data extraction, transformation, and loading (ETL), along with handling missing data, duplicates, outliers, and encoding categorical variables.

Uploaded by

Niha batool

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views22 pages

W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed

Uploaded by

Niha batool

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Data Processing

Data Processing
• Data processing is collecting raw data and translating it into usable
information.
• The raw data is collected, filtered, sorted, processed, analyzed, stored, and
then presented in a readable format.
• It is usually performed in a step-by-step process by a team of data scientists
and data engineers in an organization.
• Data processing is crucial for organizations to create better business
strategies and increase their competitive edge.
• By converting the data into a readable format like graphs, charts,
and documents, employees throughout the organization can understand
and use the data.
Data Processing Cont.
The processing of data largely depends on the following things, such as:

• The volume of data that needs to be processed.

• The complexity of data processing operations.
• Capacity and inbuilt technology of respective computer systems.
• Technical skills and Time constraints.
Stages of Data Processing
Data Collection

• The collection of raw data is the first step of the data processing cycle.
• The raw data collected has a huge impact on the output produced.
• Raw data can include monetary figures, website cookies, profit/loss
statements of a company, user behavior, etc.
Data Preparation

• Data preparation or data cleaning is the process of sorting and

filtering the raw data to remove unnecessary and inaccurate data.
• Raw data is checked for errors, duplication, miscalculations, or
missing data and transformed into a suitable form for further analysis
and processing.
• This ensures that only the highest quality data is fed into the
processing unit.
Data Input
• The raw data is converted into machine-readable form and fed into
the processing unit. This can be in the form of data entry through a
keyboard, scanner, or any other input source.
Data Processing
• The raw data is subjected to various data processing methods using
machine learning and artificial intelligence algorithms to generate the
desired output.
• This step may vary slightly from process to process depending on the
source of data being processed (data lakes, online databases,
connected devices, etc.) and the intended use of the output.
Data Interpretation or Output
• The data is finally transmitted and displayed to the user in a readable
form like graphs, tables, vector files, audio, video, documents, etc.
This output can be stored and further processed in the next data
processing cycle.
Data Storage
• The last step of the data processing cycle is storage, where data and
metadata are stored for further use. This allows quick access and
retrieval of information whenever needed. Effective proper data
storage is necessary for compliance with GDPR (data protection
legislation).
Why Should We Use Data Processing?
• In the modern era, most of the work relies on data, therefore collecting
large amounts of data for different purposes like academic, scientific
research, institutional use, personal and private use, commercial purposes,
and lots more. The processing of this data collected is essential so that the
data goes through all the above steps and gets sorted, stored, filtered,
presented in the required format, and analyzed.

• The amount of time consumed and the intricacy of processing will depend
on the required results. In situations where large amounts of data are
acquired, the necessity of processing to obtain authentic results with the
help of data processing in data mining and data processing in data research
is inevitable.
Methods of Data Processing
• There are three main data processing methods, such as:
Types of Data Processing
Data Extraction
• Data extraction is the process of collecting or retrieving disparate
types of data from a variety of sources, many of which may be poorly
organized or completely unstructured.
• Data extraction makes it possible to consolidate, process, and refine
data so that it can be stored in a centralized location in order to be
transformed.
Data Extraction and ETL
• To put the importance of data extraction in context, it’s helpful to briefly consider the ETL
process as a whole. In essence, ETL allows companies and organizations to 1) consolidate
data from different sources into a centralized location and 2) assimilate different types of
data into a common format. There are three steps in the ETL process:

• Extraction: Data is taken from one or more sources or systems. The extraction locates
and identifies relevant data, then prepares it for processing or transformation. Extraction
allows many different kinds of data to be combined and ultimately mined for business
intelligence.
• Transformation: Once the data has been successfully extracted, it is ready to be refined.
During the transformation phase, data is sorted, organized, and cleansed. For example,
duplicate entries will be deleted, missing values removed or enriched, and audits will be
performed to produce data that is reliable, consistent, and usable.
• Loading: The transformed, high quality data is then delivered to a single, unified target
location for storage and analysis.
Types of Data Extraction
• Data extraction is a powerful and adaptable process that can help you gather many
types of information relevant to your business. The first step in putting data extraction
to work for you is to identify the kinds of data you’ll need. Types of data that are
commonly extracted include:
• Customer Data: This is the kind of data that helps businesses and organizations
understand their customers and donors. It can include names, phone numbers, email
addresses, unique identifying numbers, purchase histories, social media activity, and
web searches, to name a few.
• Financial Data: These types of metrics include sales numbers, purchasing costs,
operating margins, and even your competitor’s prices. This type of data helps
companies track performance, improve efficiencies, and plan strategically.
• Use, Task, or Process Performance Data: This broad category of data includes
information related to specific tasks or operations. For example, a retail company may
seek information on its shipping logistics, or a hospital may want to monitor post-
surgical outcomes or patient feedback.
Handling Missing Data
• Missing data can be a common problem in datasets, and it is
important to decide how to handle them.
• We can either delete the missing values, replace them with a mean,
median, or mode value, or use imputation techniques to fill in the
missing values.

Id Name Reg_no Semester GPA

1 A Null Null 4
2 B 120342 4B 5
3 C Null 3A Null
Removing duplicates
• Duplicate records can skew the analysis and lead to incorrect insights.
We need to identify and remove any duplicate records in the dataset.

Id Name Reg_no Semester GPA

1 Asad Null Null 4
2 B 120342 4B 5
3 Asad Null Null 4
Handling outliers
• Outliers are extreme values that can significantly affect the analysis.
We need to identify and handle outliers appropriately by either
removing them or transforming them.
Example:
Let's say we are working with a dataset of housing prices in a city. One
of the features in the dataset is the size of the house in square feet.
Upon visualizing the data using a scatter plot, we notice that there is an
extreme outlier - a house that is much larger than the other houses in
the dataset. This outlier could significantly affect our analysis and lead
to incorrect insights.
Encoding categorical variables
Categorical variables cannot be used in their raw form in most machine learning
algorithms, so we need to encode them into numerical values. One-hot encoding
and label encoding are popular techniques for encoding categorical variables.
Example:
Let's say we are working with a dataset of customer information for a bank. One
of the features in the dataset is the customer's job type, which can take on
categorical values such as "manager", "engineer", "teacher", and so on. In order
to use this feature in a machine learning algorithm, we need to encode it as
numerical values. Here are some techniques for encoding categorical variables
"manager" = [1, 0, 0], "engineer" = [0, 1, 0], "teacher" = [0, 0, 1]
Scaling and normalization
Scaling and normalization are important steps to ensure that the
numerical features are on the same scale. This can help to improve the
performance of many machine learning algorithms.

Example :
Let's say we are working with a dataset of student exam scores, where
each student's score is measured on a scale of 0 to 100 for each
subject. One of the features in the dataset is the student's age, which
ranges from 16 to 20 years old. We want to use this feature in a
machine learning algorithm, but since it has a different scale than the
exam scores, we need to scale or normalize it to make it comparable.
Example

Item_id Item_name Price Quantity Expiry

1 A 20 4 20 March
2 B 5 5 20 April
3 C 23 6 1 June
4 D Null 7 2 July
5 A 20 4 20 March
6 B 5 5 20 March
7 E 2000 1 Null

Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
94 pages
Daily Math Review Sheets Grade 5 PDF
100% (2)
Daily Math Review Sheets Grade 5 PDF
77 pages
MCQ Technical Analsis
50% (4)
MCQ Technical Analsis
73 pages
Data Mining Unit-1 Complete
No ratings yet
Data Mining Unit-1 Complete
45 pages
Cs3352 Fods QB
No ratings yet
Cs3352 Fods QB
25 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
DM Unit 1
No ratings yet
DM Unit 1
18 pages
IIT JEE Organic Chemistry Solutions
100% (3)
IIT JEE Organic Chemistry Solutions
15 pages
Methods and Techniques of Data Processing
No ratings yet
Methods and Techniques of Data Processing
22 pages
Unit 1
No ratings yet
Unit 1
9 pages
Mobiltech Presentation
100% (1)
Mobiltech Presentation
27 pages
Fees Structure Assam Down Town University For The Session 2023 2
No ratings yet
Fees Structure Assam Down Town University For The Session 2023 2
2 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
The Importance of Corporate Communications During Financial Crisis
No ratings yet
The Importance of Corporate Communications During Financial Crisis
12 pages
Data Pre Processing
No ratings yet
Data Pre Processing
28 pages
LEVEL 3: Scope and Sequence: Big Question
No ratings yet
LEVEL 3: Scope and Sequence: Big Question
4 pages
Example J.6 Base Plate Bearing On Concrete: Merican Nstitute of Teel Onstruction
100% (1)
Example J.6 Base Plate Bearing On Concrete: Merican Nstitute of Teel Onstruction
4 pages
Data Processing for ICT Students
No ratings yet
Data Processing for ICT Students
16 pages
Lecture Notes
No ratings yet
Lecture Notes
3 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Unit 1
No ratings yet
Unit 1
11 pages
Data Collection
No ratings yet
Data Collection
5 pages
Cours Preprocessing
No ratings yet
Cours Preprocessing
23 pages
Data Mining
No ratings yet
Data Mining
22 pages
Dta Mining
No ratings yet
Dta Mining
15 pages
Developing and Analysis of Power Systems Using Psat Software
100% (1)
Developing and Analysis of Power Systems Using Psat Software
5 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
The Data Science Process
No ratings yet
The Data Science Process
33 pages
Physical Education Revision
No ratings yet
Physical Education Revision
3 pages
Design & Implement Trash Rack Cleaning System
No ratings yet
Design & Implement Trash Rack Cleaning System
23 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Unit I
No ratings yet
Unit I
31 pages
SWP01 CoreRules 7.5.24
No ratings yet
SWP01 CoreRules 7.5.24
41 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Module 7: Financial Fitness: Participant's Handbook
No ratings yet
Module 7: Financial Fitness: Participant's Handbook
24 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Data Processing
No ratings yet
Data Processing
10 pages
Document 1
No ratings yet
Document 1
4 pages
Data Processing
No ratings yet
Data Processing
5 pages
Topic Importance of Data Processing
No ratings yet
Topic Importance of Data Processing
9 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
38 pages
Unit 3
No ratings yet
Unit 3
18 pages
Computer Data Processing
No ratings yet
Computer Data Processing
3 pages
My Mind Reader's
No ratings yet
My Mind Reader's
19 pages
Unit 2
No ratings yet
Unit 2
21 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Data Processing
No ratings yet
Data Processing
35 pages
TTDS Lectures
No ratings yet
TTDS Lectures
13 pages
1 Preoperative
No ratings yet
1 Preoperative
67 pages
Chapter 1+2+GSCM
No ratings yet
Chapter 1+2+GSCM
45 pages
Ba CH-2
No ratings yet
Ba CH-2
6 pages
Unit 2 PPT (BA)
No ratings yet
Unit 2 PPT (BA)
33 pages
Computer Notes 3
No ratings yet
Computer Notes 3
21 pages
Unit 2 Data Preprocessing and Association Rule Mining
No ratings yet
Unit 2 Data Preprocessing and Association Rule Mining
31 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
Unit-1 DM
No ratings yet
Unit-1 DM
16 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Module 3
No ratings yet
Module 3
76 pages
Data Processing and Its Types
No ratings yet
Data Processing and Its Types
11 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
INFINITIVO Inglés
No ratings yet
INFINITIVO Inglés
20 pages
Quarmya Braswell - 2 Explore HierarchyOrganisms StationLab Digital D
No ratings yet
Quarmya Braswell - 2 Explore HierarchyOrganisms StationLab Digital D
25 pages
BECE352E Module 2
No ratings yet
BECE352E Module 2
58 pages
Unit 1-Part3-Compressed
No ratings yet
Unit 1-Part3-Compressed
28 pages
4 VMXQ J9 R Qyj 7 Xo XUaj EB
No ratings yet
4 VMXQ J9 R Qyj 7 Xo XUaj EB
49 pages
DA Unit 2
No ratings yet
DA Unit 2
13 pages
Data Processing in Research
No ratings yet
Data Processing in Research
31 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
Data Processing in Data Mining
No ratings yet
Data Processing in Data Mining
11 pages
Data Mining - Unit - 3
No ratings yet
Data Mining - Unit - 3
62 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Canara - Epassbook - 2024-05-13 09:12:52.002054
No ratings yet
Canara - Epassbook - 2024-05-13 09:12:52.002054
65 pages
IV Cannula
No ratings yet
IV Cannula
17 pages
9 RWS PT 4 Math Nida 202425
No ratings yet
9 RWS PT 4 Math Nida 202425
2 pages
Virtual Palletization Plan FNDE
No ratings yet
Virtual Palletization Plan FNDE
299 pages
Inspection of The Building Signature by Pinnacle.: (Estructure and Electromechanic Equipment Surveying.)
No ratings yet
Inspection of The Building Signature by Pinnacle.: (Estructure and Electromechanic Equipment Surveying.)
12 pages
PNL Account Cashflow Forecast: Missing Values
No ratings yet
PNL Account Cashflow Forecast: Missing Values
5 pages
Agency Sales Call Script
No ratings yet
Agency Sales Call Script
4 pages
Continuous
No ratings yet
Continuous
13 pages
Repair & Rehab of Structures Course
No ratings yet
Repair & Rehab of Structures Course
2 pages
HR Interview Questions
No ratings yet
HR Interview Questions
8 pages

W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed

Uploaded by

W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed

Uploaded by

Data Processing

• The volume of data that needs to be processed.

• Data preparation or data cleaning is the process of sorting and

Id Name Reg_no Semester GPA

Id Name Reg_no Semester GPA

Item_id Item_name Price Quantity Expiry

You might also like