0% found this document useful (0 votes)

62 views10 pages

Unit-2 DS

The document discusses data collection and management. It defines data collection as gathering information from various sources to answer research questions and evaluate outcomes. It notes that before analyzing data, researchers must identify the types of data needed and methods for collection, storage, and processing. The document also discusses different types of data (qualitative and quantitative), sources of data collection (surveys, transactions, etc.), and techniques for exploring and cleaning data.

Uploaded by

rajkumarmtech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views10 pages

Unit-2 DS

Uploaded by

rajkumarmtech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

1.

Data Collection and Management

What is Data Collection

The process of gathering and analyzing accurate data from various sources to find answers
to research problems, trends and probabilities, etc., to evaluate possible outcomes is Known
as Data Collection.
Knowledge is power, information is knowledge, and data is information in digitized form, at least
as defined in IT. Hence, data is power. But before you can leverage that data into a successful
strategy for your organization or business, you need to gather it. That’s your first step.
So, to help you get the process started, we shine a spotlight on data collection.

What exactly is it? Believe it or not, it’s more than just doing a Google search! Furthermore,
what are the different types of data collection? And what kinds of data collection tools and data
collection techniques exist?

During data collection, the researchers must identify the data types, the sources of data, and what
methods are being used. We will soon see that there are many different data collection methods.
There is heavy reliance on data collection in research, commercial, and government fields.
Before an analyst begins collecting data, they must answer three questions first:

 What’s the goal or purpose of this research?

 What kinds of data are they planning on gathering?
 What methods and procedures will be used to collect, store, and process the
information?
Additionally, we can break up data into qualitative and quantitative types. Qualitative data covers
descriptions such as color, size, quality, and appearance. Quantitative data, unsurprisingly, deals
with numbers, such as statistics, poll numbers, percentages, etc
Categories of API
Web-based system:

A web API is an interface to either a web server or a web browser. These APIs are used
extensively for the development of web applications. These APIs work at either the server end or
the client end. Companies like Google, Amazon, eBay all provide web-based API.Some popular
examples of web based API are Twitter REST API, Facebook Graph API, Amazon S3 REST API,
etc.

Operating system
There are multiple OS based API that offers the functionality of various OS features that can be
incorporated in creating windows or mac applications.
Some of the examples of OS based API are Cocoa, Carbon, WinAPI, etc.
Database system
Interaction with most of the database is done using the API calls to the database. These APIs are
defined in a manner to pass out the requested data in a predefined format that is understandable by
the requesting client.
This makes the process of interaction with databases generalised and thereby enhancing the
compatibility of applications with the various database. They are very robust and provide a
structured interface to database.
Some popular examples are Drupal 7 Database API, Drupal 8 Database API, Django API.

Hardware System:
These APIs allows access to the various hardware components of a system. They are extremely
crucial for establishing communication to the hardware. Due to which it makes possible for a
range of functions from the collection of sensor data to even display on your screens.
For example, the Google PowerMeter API will allow device manufacturers to build home energy
monitoring devices that work with Google PowerMeter.
Some other examples of Hardware APIs are: QUANT Electronic, WareNetCheckWare,OpenVX
Hardware Acceleration, CubeSensore, etc.
3. Difference between an API and a Library
At this point, I believe you might be scratching your head and confusing APIs with libraries. Let
me simplify it for you, an application programming interface (API) is an interface that defines the
way by which an application program may request service from the libraries.
An API is a set of rules with which the interaction between various entities is defined. We are
specifically talking about interaction between two software.
Even a library also has an API which denotes the area of the library which is actually accessible to
the user from outside.
What Are the Different Sources of Data Collection?
The following are seven primary methods of collecting data in business analytics.
 Surveys
 Transactional Tracking
 Interviews and Focus Groups
 Observation
 Online Tracking
 Forms
 Social Media Monitoring
Data collection breaks down into two methods. As a side note, many terms, such as techniques,
methods, and types, are interchangeable and depending on who uses them. One source may call
data collection techniques “methods,” for instance. But whatever labels we use, the general
concepts and breakdowns apply across the board whether we’re talking about marketing analysis
or a scientific research project.
The two methods are:
 Primary
As the name implies, this is original, first-hand data collected by the data researchers. This process
is the initial information gathering step, performed before anyone carries out any further or related
research. Primary data results are highly accurate provided the researcher collects the information.
However, there’s a downside, as first-hand research is potentially time-consuming and expensive.
 Secondary
Secondary data is second-hand data collected by other parties and already having undergone
statistical analysis. This data is either information that the researcher has tasked other people to
collect or information the researcher has looked up. Simply put, it’s second-hand information.
Although it’s easier and cheaper to obtain than primary information, secondary information raises
concerns regarding accuracy and authenticity. Quantitative data makes up a majority of secondary
data.

Data Exploration:
What is Data Exploration?
Data exploration is the first step of data analysis used to explore and visualize data to uncover
insights from the start or identify areas or patterns to dig into more. Using interactive dashboards
and point-and-click data exploration, users can better understand the bigger picture and get to
insights faster.

Why is Data Exploration Important?

Starting with data exploration helps users to make better decisions on where to dig deeper into the
data and to take a broad understanding of the business when asking more detailed questions later.
With a user-friendly interface, anyone across an organization can familiarize themselves with the
data, discover patterns, and generate thoughtful questions that may spur on deeper, valuable
analysis.
Data exploration and visual analytics tools build understanding, empowering users to explore data
in any visualization. This approach speeds up time to answers and deepens users’ understanding
by covering more ground in less time. Data exploration is important for this reason because it
democratizes access to data and provides governed self-service analytics. Furthermore, businesses
can accelerate data exploration by provisioning and delivering data through visual data marts that
are easy to explore and use.
What are the Main Use Cases for Data Exploration?
Data exploration can help businesses explore large amounts of data quickly to better understand
next steps in terms of further analysis. This gives the business a more manageable starting point
and a way to target areas of interest. In most cases, data exploration involves using data
visualizations to examine the data at a high level. By taking this high-level approach, businesses
can determine which data is most important and which may distort the analysis and therefore
should be removed. Data exploration can also be helpful in decreasing time spent on less valuable
analysis by selecting the right path forward from the start.
Fixing Data:
Here are 8 effective data cleaning techniques:

Remove duplicates
Remove irrelevant data
Standardize capitalization
Convert data type
Clear formatting
Fix errors
Language translation
Handle missing values
Let’s go through these in more detail now.

1. Remove Duplicates
When you collect your data from a range of different places, or scrape your data, it’s likely that
you will have duplicated entries. These duplicates could originate from human error where the
person inputting the data or filling out a form made a mistake.

Duplicates will inevitably skew your data and/or confuse your results. They can also just make the
data hard to read when you want to visualize it, so it’s best to remove them right away.

2. Remove Irrelevant Data

Irrelevant data will slow down and confuse any analysis that you want to do. So, deciphering what
is relevant and what is not is necessary before you begin your data cleaning. For instance, if you
are analyzing the age range of your customers, you don’t need to include their email addresses.

Other elements you’ll need to remove as they add nothing to your data include:

Personal identifiable (PII) data

URLs
HTML tags
Boilerplate text (for ex. in emails)
Tracking codes
Excessive blank space between text
3. Standardize Capitalization
Within your data, you need to make sure that the text is consistent. If you have a mixture of
capitalization, this could lead to different erroneous categories being created.

It could also cause problems when you need to translate before processing as capitalization can
change the meaning. For instance, Bill is a person's name whereas a bill or to bill is something else
entirely.

If, in addition to data cleaning, you are text cleaning in order to process your data with a computer
model, it’s much simpler to put everything in lowercase.

4. Convert Data Types

Numbers are the most common data type that you will need to convert when cleaning your data.
Often numbers are imputed as text, however, in order to be processed, they need to appear as
numerals.

If they are appearing as text, they are classed as a string and your analysis algorithms cannot
perform mathematical equations on them.

The same is true for dates that are stored as text. These should all be changed to numerals. For
example, if you have an entry that reads September 24th 2021, you’ll need to change that to read
09/24/2021.

5. Clear Formatting
Machine learning models can’t process your information if it is heavily formatted. If you are
taking data from a range of sources, it’s likely that there are a number of different document
formats. This can make your data confusing and incorrect.

You should remove any kind of formatting that has been applied to your documents, so you can
start from zero. This is normally not a difficult process, both excel and google sheets, for example,
have a simple standardization function to do this.
6. Fix Errors
It probably goes without saying that you’ll need to carefully remove any errors from your data.
Errors as avoidable as typos could lead to you missing out on key findings from your data. Some
of these can be avoided with something as simple as a quick spell-check.

Spelling mistakes or extra punctuation in data like an email address could mean you miss out on
communicating with your customers. It could also lead to you sending unwanted emails to people
who didn’t sign up for them.

Other errors can include inconsistencies in formatting. For example, if you have a column of US
dollar amounts, you’ll have to convert any other currency type into US dollars so as to preserve a
consistent standard currency. The same is true of any other form of measurement such as grams,
ounces, etc.

7. Language Translation
To have consistent data, you’ll want everything in the same language.

The Natural Language Processing (NLP) models behind software used to analyze data are also
predominantly monolingual, meaning they are not capable of processing multiple languages. So,
you’ll need to translate everything into one language.

8. Handle Missing Values

When it comes to missing values you have two options:
Remove the observations that have this missing value
Input the missing data
What you choose to do will depend on your analysis goals and what you want to do next with your
data.

Removing the missing value completely might remove useful insights from your data. After all,
there was a reason that you wanted to pull this information in the first place.

Therefore it might be better to input the missing data by researching what should go in that field.
If you don’t know what it is, you could replace it with the word missing. If it is numerical you can
place a zero in the missing field.
However, if there are so many missing values that there isn’t enough data to use, then you should
remove the whole section.
The Wrap Up
While it can sometimes be time-consuming to clean your data, it will cost you more than just time
if you skip this step. “Dirty” data can lead to a whole host of issues, so you want it clean before
you begin your analysis.
Data Storage and Management
Storage Management is defined as it refers to the management of the data storage equipment’s that
are used to store the user/computer generated data. Hence it is a tool or set of processes used by an
administrator to keep your data and storage equipment’s safe. Storage management is a process for
users to optimize the use of storage devices and to protect the integrity of data for any media on
which it resides and the category of storage management generally contain the different type of
subcategories covering aspects such as security, virtualization and more, as well as different types
of provisioning or automation, which is generally made up the entire storage management
software market.
Storage management key attributes: Storage management has some key attribute which is
generally used to manage the storage capacity of the system. These are given below:
1. Performance
2. Reliability
3. Recoverability
4. Capacity
Feature of Storage management: There is some feature of storage management which is
provided for storage capacity. These are given below:
 Storage management is a process that is used to optimize the use of storage devices.
 Storage management must be allocated and managed as a resource in order to truly benefit
a corporation.
 Storage management is generally a basic system component of information systems.
 It is used to improve the performance of their data storage resources.
Advantage of storage management: There are some advantage of storage management which
are given below:
 It becomes very simple to manage a storage capacity.
 It generally reduces the time consumption.
 It improves the performance of system.
 In virtualization and automation technologies, it can help an organization improve its
agility.
 Limitations of storage management:

Limited physical storage capacity: Operating systems can only manage the physical storage
space that is available, and as such, there is a limit to how much data can be stored.
Performance degradation with increased storage utilization: As more data is stored, the
system’s performance can decrease due to increased disk access time, fragmentation, and other
factors.
Complexity of storage management: Storage management can be complex, especially as the size
of the storage environment grows.
Cost: Storing large amounts of data can be expensive, and the cost of additional storage capacity
can add up quickly.
Security issues: Storing sensitive data can also present security risks, and the operating system
must have robust security features in place to prevent unauthorized access to this data.
Backup and Recovery: Backup and recovery of data can also be challenging, especially if the
data is stored on multiple systems or devices.

Using Multiple data sources

By using multiple data sources for your model, you can reduce the total volume of data processed
and thereby shorten processing times in Cognos® Transformer. If used in combination with
calculated columns, multiple data sources can minimize or eliminate the need to create database
table joins in an external data access tool. Using multiple data sources also enables measure
allocation.
For example, suppose your product, customer, and order data is stored in a set of tables. If you
were to use this data from a single source, you would need separate tables for Product, Customer,
Customer Site, Order, and Order Detail. This source would contain many duplicate values, and the
joins between the tables would be relatively complex.
Instead, you create three separate sources for Products, Customer/Site, and Order/Order Detail
data. The volume of data contained in each is less than that in the single source, and there are only
simple joins between Customer and Customer Site tables, and Order and Order Detail.
Remember that each source must contain sufficient data to establish context within the dimension
map. You cannot perform database table joins in Cognos Transformer.
Procedure
Using a data access tool such as IBM® Cognos Impromptu or Framework Manager, create each of
the data sources required for your model.
From the Welcome page, click Create a new model to use the New Model wizard to add the largest
structural data source to your model.
Tip: If you are already in Cognos Transformer, click New from the File menu.

From the Edit menu, click Insert Data Source and add the additional structural data sources to the
Data Sources list.
Repeat to add the transactional data sources to the Data Sources list.

Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
94 pages
Cleaning Data in Excel
No ratings yet
Cleaning Data in Excel
19 pages
Data Cleaning and Formatting in Power BI (Slides)
No ratings yet
Data Cleaning and Formatting in Power BI (Slides)
12 pages
Module 2
No ratings yet
Module 2
70 pages
SQL Data Cleaning Techniques Guide
No ratings yet
SQL Data Cleaning Techniques Guide
31 pages
Data Analysis Process Guide
No ratings yet
Data Analysis Process Guide
10 pages
Chapter II Data Collection and Management
No ratings yet
Chapter II Data Collection and Management
19 pages
Module 2 Data Science
No ratings yet
Module 2 Data Science
28 pages
Ics054 Unit 1
No ratings yet
Ics054 Unit 1
14 pages
Data LifeCycle
No ratings yet
Data LifeCycle
12 pages
Data Science Unit 2 Part 1
No ratings yet
Data Science Unit 2 Part 1
10 pages
Data Literacy Class 9
No ratings yet
Data Literacy Class 9
10 pages
Data Engineering Famous Terms 1756202104
No ratings yet
Data Engineering Famous Terms 1756202104
22 pages
Pyramid of Power BI
No ratings yet
Pyramid of Power BI
6 pages
W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed
No ratings yet
W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed
22 pages
Business Analytics Anna University
No ratings yet
Business Analytics Anna University
40 pages
Lesson 5
No ratings yet
Lesson 5
31 pages
1 Da
No ratings yet
1 Da
44 pages
Data Source
No ratings yet
Data Source
7 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
DA Unit 1
No ratings yet
DA Unit 1
33 pages
Data Collection
No ratings yet
Data Collection
5 pages
Methods and Techniques of Data Processing
No ratings yet
Methods and Techniques of Data Processing
22 pages
Unit 2
No ratings yet
Unit 2
22 pages
DA Unit 2
No ratings yet
DA Unit 2
16 pages
Data Cleansing Process For Master Data
No ratings yet
Data Cleansing Process For Master Data
4 pages
Day 13-Managing Digital Data
No ratings yet
Day 13-Managing Digital Data
17 pages
Advanced Data Analytics and Visualization Course Material
No ratings yet
Advanced Data Analytics and Visualization Course Material
45 pages
DataCleansingGuidelines SampleTemplate
100% (1)
DataCleansingGuidelines SampleTemplate
8 pages
AFDM UNIT 2 Notes
No ratings yet
AFDM UNIT 2 Notes
29 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Lecture Notes 6
No ratings yet
Lecture Notes 6
11 pages
Introduction To Data Science Module 2
No ratings yet
Introduction To Data Science Module 2
35 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
DA Unit 1
No ratings yet
DA Unit 1
43 pages
Data Science-Unit-2
No ratings yet
Data Science-Unit-2
33 pages
Data Preparation
No ratings yet
Data Preparation
6 pages
Data Processing
No ratings yet
Data Processing
3 pages
Unit I
No ratings yet
Unit I
31 pages
Data Analytics: UCSC0601
No ratings yet
Data Analytics: UCSC0601
64 pages
Lecture 2
No ratings yet
Lecture 2
14 pages
Dam Unit - V
No ratings yet
Dam Unit - V
19 pages
Ilovepdf Merged Pagenumber
No ratings yet
Ilovepdf Merged Pagenumber
199 pages
Data Warehouse and Data Mining - Unit 1
No ratings yet
Data Warehouse and Data Mining - Unit 1
40 pages
Comprehensive Guide To Business Analytics
No ratings yet
Comprehensive Guide To Business Analytics
10 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
Data Processing
No ratings yet
Data Processing
35 pages
The Data Lifecycle Process
No ratings yet
The Data Lifecycle Process
11 pages
Notes - Business Analytics
No ratings yet
Notes - Business Analytics
138 pages
Data and Its Types
No ratings yet
Data and Its Types
6 pages
Data Quality Services
No ratings yet
Data Quality Services
196 pages
UNIT IV Data Cleaning Techhniques
No ratings yet
UNIT IV Data Cleaning Techhniques
43 pages
Data Engineering Part 1 1735286787
No ratings yet
Data Engineering Part 1 1735286787
22 pages
Encrypted Text Analysis
100% (7)
Encrypted Text Analysis
33 pages
Data For Business Analytics Unit 2
No ratings yet
Data For Business Analytics Unit 2
23 pages
Unit-1 DM
No ratings yet
Unit-1 DM
16 pages
Business Anaytics Lecture Notes1
No ratings yet
Business Anaytics Lecture Notes1
20 pages
Sharing-Creation - Deletion Class Notes
No ratings yet
Sharing-Creation - Deletion Class Notes
2 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Duplicate Check Presentation
No ratings yet
Duplicate Check Presentation
5 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
Important Question of Introduction of Data Science
No ratings yet
Important Question of Introduction of Data Science
10 pages
Data Science & Analytics Overview
No ratings yet
Data Science & Analytics Overview
7 pages
BA Unit 1
No ratings yet
BA Unit 1
38 pages
Data Analysis
No ratings yet
Data Analysis
87 pages
Business Undestanding and Data Collection
No ratings yet
Business Undestanding and Data Collection
27 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
5 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
39 pages
Cours Preprocessing
No ratings yet
Cours Preprocessing
23 pages
Big Data & Web Analytics Insights
No ratings yet
Big Data & Web Analytics Insights
9 pages
Job Seekers: Cleanse Your Resume
100% (1)
Job Seekers: Cleanse Your Resume
6 pages
Fda 1
No ratings yet
Fda 1
5 pages
? Data Cleaning 101
No ratings yet
? Data Cleaning 101
17 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Business Anaytics Unit 1
No ratings yet
Business Anaytics Unit 1
37 pages
Module 2 Clean Data For More Accurate Insights
No ratings yet
Module 2 Clean Data For More Accurate Insights
35 pages
Google Certified Professional Data Engineer
No ratings yet
Google Certified Professional Data Engineer
3 pages
SQL Data Cleaning
No ratings yet
SQL Data Cleaning
17 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Data Cleaning
No ratings yet
Data Cleaning
11 pages
(M3S1) Data Analytics Framework
No ratings yet
(M3S1) Data Analytics Framework
12 pages
Ch06 - Transforming Data (Slides)
No ratings yet
Ch06 - Transforming Data (Slides)
10 pages
Chapter Three
No ratings yet
Chapter Three
13 pages
Data Analytics - Additional Course Information
No ratings yet
Data Analytics - Additional Course Information
11 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Data Mining-L3
No ratings yet
Data Mining-L3
22 pages
Spark Funds Presentation
No ratings yet
Spark Funds Presentation
10 pages
Data Analyst Role at HERD International
No ratings yet
Data Analyst Role at HERD International
4 pages

Unit-2 DS

Uploaded by

Unit-2 DS

Uploaded by

1.

Data Collection and Management

What is Data Collection

 What’s the goal or purpose of this research?

Why is Data Exploration Important?

2. Remove Irrelevant Data

Personal identifiable (PII) data

4. Convert Data Types

8. Handle Missing Values

Using Multiple data sources

You might also like