Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views22 pages

W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed

Data processing involves collecting raw data and converting it into usable information through a series of steps including data collection, preparation, input, processing, interpretation, and storage. It is essential for organizations to develop effective business strategies and enhance competitiveness by utilizing processed data. Key methods include data extraction, transformation, and loading (ETL), along with handling missing data, duplicates, outliers, and encoding categorical variables.

Uploaded by

Niha batool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views22 pages

W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed

Data processing involves collecting raw data and converting it into usable information through a series of steps including data collection, preparation, input, processing, interpretation, and storage. It is essential for organizations to develop effective business strategies and enhance competitiveness by utilizing processed data. Key methods include data extraction, transformation, and loading (ETL), along with handling missing data, duplicates, outliers, and encoding categorical variables.

Uploaded by

Niha batool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Processing

Data Processing
• Data processing is collecting raw data and translating it into usable
information.
• The raw data is collected, filtered, sorted, processed, analyzed, stored, and
then presented in a readable format.
• It is usually performed in a step-by-step process by a team of data scientists
and data engineers in an organization.
• Data processing is crucial for organizations to create better business
strategies and increase their competitive edge.
• By converting the data into a readable format like graphs, charts,
and documents, employees throughout the organization can understand
and use the data.
Data Processing Cont.
The processing of data largely depends on the following things, such as:

• The volume of data that needs to be processed.


• The complexity of data processing operations.
• Capacity and inbuilt technology of respective computer systems.
• Technical skills and Time constraints.
Stages of Data Processing
Data Collection

• The collection of raw data is the first step of the data processing cycle.
• The raw data collected has a huge impact on the output produced.
• Raw data can include monetary figures, website cookies, profit/loss
statements of a company, user behavior, etc.
Data Preparation

• Data preparation or data cleaning is the process of sorting and


filtering the raw data to remove unnecessary and inaccurate data.
• Raw data is checked for errors, duplication, miscalculations, or
missing data and transformed into a suitable form for further analysis
and processing.
• This ensures that only the highest quality data is fed into the
processing unit.
Data Input
• The raw data is converted into machine-readable form and fed into
the processing unit. This can be in the form of data entry through a
keyboard, scanner, or any other input source.
Data Processing
• The raw data is subjected to various data processing methods using
machine learning and artificial intelligence algorithms to generate the
desired output.
• This step may vary slightly from process to process depending on the
source of data being processed (data lakes, online databases,
connected devices, etc.) and the intended use of the output.
Data Interpretation or Output
• The data is finally transmitted and displayed to the user in a readable
form like graphs, tables, vector files, audio, video, documents, etc.
This output can be stored and further processed in the next data
processing cycle.
Data Storage
• The last step of the data processing cycle is storage, where data and
metadata are stored for further use. This allows quick access and
retrieval of information whenever needed. Effective proper data
storage is necessary for compliance with GDPR (data protection
legislation).
Why Should We Use Data Processing?
• In the modern era, most of the work relies on data, therefore collecting
large amounts of data for different purposes like academic, scientific
research, institutional use, personal and private use, commercial purposes,
and lots more. The processing of this data collected is essential so that the
data goes through all the above steps and gets sorted, stored, filtered,
presented in the required format, and analyzed.

• The amount of time consumed and the intricacy of processing will depend
on the required results. In situations where large amounts of data are
acquired, the necessity of processing to obtain authentic results with the
help of data processing in data mining and data processing in data research
is inevitable.
Methods of Data Processing
• There are three main data processing methods, such as:
Types of Data Processing
Data Extraction
• Data extraction is the process of collecting or retrieving disparate
types of data from a variety of sources, many of which may be poorly
organized or completely unstructured.
• Data extraction makes it possible to consolidate, process, and refine
data so that it can be stored in a centralized location in order to be
transformed.
Data Extraction and ETL
• To put the importance of data extraction in context, it’s helpful to briefly consider the ETL
process as a whole. In essence, ETL allows companies and organizations to 1) consolidate
data from different sources into a centralized location and 2) assimilate different types of
data into a common format. There are three steps in the ETL process:

• Extraction: Data is taken from one or more sources or systems. The extraction locates
and identifies relevant data, then prepares it for processing or transformation. Extraction
allows many different kinds of data to be combined and ultimately mined for business
intelligence.
• Transformation: Once the data has been successfully extracted, it is ready to be refined.
During the transformation phase, data is sorted, organized, and cleansed. For example,
duplicate entries will be deleted, missing values removed or enriched, and audits will be
performed to produce data that is reliable, consistent, and usable.
• Loading: The transformed, high quality data is then delivered to a single, unified target
location for storage and analysis.
Types of Data Extraction
• Data extraction is a powerful and adaptable process that can help you gather many
types of information relevant to your business. The first step in putting data extraction
to work for you is to identify the kinds of data you’ll need. Types of data that are
commonly extracted include:
• Customer Data: This is the kind of data that helps businesses and organizations
understand their customers and donors. It can include names, phone numbers, email
addresses, unique identifying numbers, purchase histories, social media activity, and
web searches, to name a few.
• Financial Data: These types of metrics include sales numbers, purchasing costs,
operating margins, and even your competitor’s prices. This type of data helps
companies track performance, improve efficiencies, and plan strategically.
• Use, Task, or Process Performance Data: This broad category of data includes
information related to specific tasks or operations. For example, a retail company may
seek information on its shipping logistics, or a hospital may want to monitor post-
surgical outcomes or patient feedback.
Handling Missing Data
• Missing data can be a common problem in datasets, and it is
important to decide how to handle them.
• We can either delete the missing values, replace them with a mean,
median, or mode value, or use imputation techniques to fill in the
missing values.

Id Name Reg_no Semester GPA


1 A Null Null 4
2 B 120342 4B 5
3 C Null 3A Null
Removing duplicates
• Duplicate records can skew the analysis and lead to incorrect insights.
We need to identify and remove any duplicate records in the dataset.

Id Name Reg_no Semester GPA


1 Asad Null Null 4
2 B 120342 4B 5
3 Asad Null Null 4
Handling outliers
• Outliers are extreme values that can significantly affect the analysis.
We need to identify and handle outliers appropriately by either
removing them or transforming them.
Example:
Let's say we are working with a dataset of housing prices in a city. One
of the features in the dataset is the size of the house in square feet.
Upon visualizing the data using a scatter plot, we notice that there is an
extreme outlier - a house that is much larger than the other houses in
the dataset. This outlier could significantly affect our analysis and lead
to incorrect insights.
Encoding categorical variables
Categorical variables cannot be used in their raw form in most machine learning
algorithms, so we need to encode them into numerical values. One-hot encoding
and label encoding are popular techniques for encoding categorical variables.
Example:
Let's say we are working with a dataset of customer information for a bank. One
of the features in the dataset is the customer's job type, which can take on
categorical values such as "manager", "engineer", "teacher", and so on. In order
to use this feature in a machine learning algorithm, we need to encode it as
numerical values. Here are some techniques for encoding categorical variables
"manager" = [1, 0, 0], "engineer" = [0, 1, 0], "teacher" = [0, 0, 1]
Scaling and normalization
Scaling and normalization are important steps to ensure that the
numerical features are on the same scale. This can help to improve the
performance of many machine learning algorithms.

Example :
Let's say we are working with a dataset of student exam scores, where
each student's score is measured on a scale of 0 to 100 for each
subject. One of the features in the dataset is the student's age, which
ranges from 16 to 20 years old. We want to use this feature in a
machine learning algorithm, but since it has a different scale than the
exam scores, we need to scale or normalize it to make it comparable.
Example

Item_id Item_name Price Quantity Expiry


1 A 20 4 20 March
2 B 5 5 20 April
3 C 23 6 1 June
4 D Null 7 2 July
5 A 20 4 20 March
6 B 5 5 20 March
7 E 2000 1 Null

You might also like