1.
Data Collection and Management
What is Data Collection
The process of gathering and analyzing accurate data from various sources to find answers
to research problems, trends and probabilities, etc., to evaluate possible outcomes is Known
as Data Collection.
Knowledge is power, information is knowledge, and data is information in digitized form, at least
as defined in IT. Hence, data is power. But before you can leverage that data into a successful
strategy for your organization or business, you need to gather it. That’s your first step.
So, to help you get the process started, we shine a spotlight on data collection.
What exactly is it? Believe it or not, it’s more than just doing a Google search! Furthermore,
what are the different types of data collection? And what kinds of data collection tools and data
collection techniques exist?
During data collection, the researchers must identify the data types, the sources of data, and what
methods are being used. We will soon see that there are many different data collection methods.
There is heavy reliance on data collection in research, commercial, and government fields.
Before an analyst begins collecting data, they must answer three questions first:
What’s the goal or purpose of this research?
What kinds of data are they planning on gathering?
What methods and procedures will be used to collect, store, and process the
information?
Additionally, we can break up data into qualitative and quantitative types. Qualitative data covers
descriptions such as color, size, quality, and appearance. Quantitative data, unsurprisingly, deals
with numbers, such as statistics, poll numbers, percentages, etc
Categories of API
Web-based system:
A web API is an interface to either a web server or a web browser. These APIs are used
extensively for the development of web applications. These APIs work at either the server end or
the client end. Companies like Google, Amazon, eBay all provide web-based API.Some popular
examples of web based API are Twitter REST API, Facebook Graph API, Amazon S3 REST API,
etc.
Operating system
There are multiple OS based API that offers the functionality of various OS features that can be
incorporated in creating windows or mac applications.
Some of the examples of OS based API are Cocoa, Carbon, WinAPI, etc.
Database system
Interaction with most of the database is done using the API calls to the database. These APIs are
defined in a manner to pass out the requested data in a predefined format that is understandable by
the requesting client.
This makes the process of interaction with databases generalised and thereby enhancing the
compatibility of applications with the various database. They are very robust and provide a
structured interface to database.
Some popular examples are Drupal 7 Database API, Drupal 8 Database API, Django API.
Hardware System:
These APIs allows access to the various hardware components of a system. They are extremely
crucial for establishing communication to the hardware. Due to which it makes possible for a
range of functions from the collection of sensor data to even display on your screens.
For example, the Google PowerMeter API will allow device manufacturers to build home energy
monitoring devices that work with Google PowerMeter.
Some other examples of Hardware APIs are: QUANT Electronic, WareNetCheckWare,OpenVX
Hardware Acceleration, CubeSensore, etc.
3. Difference between an API and a Library
At this point, I believe you might be scratching your head and confusing APIs with libraries. Let
me simplify it for you, an application programming interface (API) is an interface that defines the
way by which an application program may request service from the libraries.
An API is a set of rules with which the interaction between various entities is defined. We are
specifically talking about interaction between two software.
Even a library also has an API which denotes the area of the library which is actually accessible to
the user from outside.
What Are the Different Sources of Data Collection?
The following are seven primary methods of collecting data in business analytics.
Surveys
Transactional Tracking
Interviews and Focus Groups
Observation
Online Tracking
Forms
Social Media Monitoring
Data collection breaks down into two methods. As a side note, many terms, such as techniques,
methods, and types, are interchangeable and depending on who uses them. One source may call
data collection techniques “methods,” for instance. But whatever labels we use, the general
concepts and breakdowns apply across the board whether we’re talking about marketing analysis
or a scientific research project.
The two methods are:
Primary
As the name implies, this is original, first-hand data collected by the data researchers. This process
is the initial information gathering step, performed before anyone carries out any further or related
research. Primary data results are highly accurate provided the researcher collects the information.
However, there’s a downside, as first-hand research is potentially time-consuming and expensive.
Secondary
Secondary data is second-hand data collected by other parties and already having undergone
statistical analysis. This data is either information that the researcher has tasked other people to
collect or information the researcher has looked up. Simply put, it’s second-hand information.
Although it’s easier and cheaper to obtain than primary information, secondary information raises
concerns regarding accuracy and authenticity. Quantitative data makes up a majority of secondary
data.
Data Exploration:
What is Data Exploration?
Data exploration is the first step of data analysis used to explore and visualize data to uncover
insights from the start or identify areas or patterns to dig into more. Using interactive dashboards
and point-and-click data exploration, users can better understand the bigger picture and get to
insights faster.
Why is Data Exploration Important?
Starting with data exploration helps users to make better decisions on where to dig deeper into the
data and to take a broad understanding of the business when asking more detailed questions later.
With a user-friendly interface, anyone across an organization can familiarize themselves with the
data, discover patterns, and generate thoughtful questions that may spur on deeper, valuable
analysis.
Data exploration and visual analytics tools build understanding, empowering users to explore data
in any visualization. This approach speeds up time to answers and deepens users’ understanding
by covering more ground in less time. Data exploration is important for this reason because it
democratizes access to data and provides governed self-service analytics. Furthermore, businesses
can accelerate data exploration by provisioning and delivering data through visual data marts that
are easy to explore and use.
What are the Main Use Cases for Data Exploration?
Data exploration can help businesses explore large amounts of data quickly to better understand
next steps in terms of further analysis. This gives the business a more manageable starting point
and a way to target areas of interest. In most cases, data exploration involves using data
visualizations to examine the data at a high level. By taking this high-level approach, businesses
can determine which data is most important and which may distort the analysis and therefore
should be removed. Data exploration can also be helpful in decreasing time spent on less valuable
analysis by selecting the right path forward from the start.
Fixing Data:
Here are 8 effective data cleaning techniques:
Remove duplicates
Remove irrelevant data
Standardize capitalization
Convert data type
Clear formatting
Fix errors
Language translation
Handle missing values
Let’s go through these in more detail now.
1. Remove Duplicates
When you collect your data from a range of different places, or scrape your data, it’s likely that
you will have duplicated entries. These duplicates could originate from human error where the
person inputting the data or filling out a form made a mistake.
Duplicates will inevitably skew your data and/or confuse your results. They can also just make the
data hard to read when you want to visualize it, so it’s best to remove them right away.
2. Remove Irrelevant Data
Irrelevant data will slow down and confuse any analysis that you want to do. So, deciphering what
is relevant and what is not is necessary before you begin your data cleaning. For instance, if you
are analyzing the age range of your customers, you don’t need to include their email addresses.
Other elements you’ll need to remove as they add nothing to your data include:
Personal identifiable (PII) data
URLs
HTML tags
Boilerplate text (for ex. in emails)
Tracking codes
Excessive blank space between text
3. Standardize Capitalization
Within your data, you need to make sure that the text is consistent. If you have a mixture of
capitalization, this could lead to different erroneous categories being created.
It could also cause problems when you need to translate before processing as capitalization can
change the meaning. For instance, Bill is a person's name whereas a bill or to bill is something else
entirely.
If, in addition to data cleaning, you are text cleaning in order to process your data with a computer
model, it’s much simpler to put everything in lowercase.
4. Convert Data Types
Numbers are the most common data type that you will need to convert when cleaning your data.
Often numbers are imputed as text, however, in order to be processed, they need to appear as
numerals.
If they are appearing as text, they are classed as a string and your analysis algorithms cannot
perform mathematical equations on them.
The same is true for dates that are stored as text. These should all be changed to numerals. For
example, if you have an entry that reads September 24th 2021, you’ll need to change that to read
09/24/2021.
5. Clear Formatting
Machine learning models can’t process your information if it is heavily formatted. If you are
taking data from a range of sources, it’s likely that there are a number of different document
formats. This can make your data confusing and incorrect.
You should remove any kind of formatting that has been applied to your documents, so you can
start from zero. This is normally not a difficult process, both excel and google sheets, for example,
have a simple standardization function to do this.
6. Fix Errors
It probably goes without saying that you’ll need to carefully remove any errors from your data.
Errors as avoidable as typos could lead to you missing out on key findings from your data. Some
of these can be avoided with something as simple as a quick spell-check.
Spelling mistakes or extra punctuation in data like an email address could mean you miss out on
communicating with your customers. It could also lead to you sending unwanted emails to people
who didn’t sign up for them.
Other errors can include inconsistencies in formatting. For example, if you have a column of US
dollar amounts, you’ll have to convert any other currency type into US dollars so as to preserve a
consistent standard currency. The same is true of any other form of measurement such as grams,
ounces, etc.
7. Language Translation
To have consistent data, you’ll want everything in the same language.
The Natural Language Processing (NLP) models behind software used to analyze data are also
predominantly monolingual, meaning they are not capable of processing multiple languages. So,
you’ll need to translate everything into one language.
8. Handle Missing Values
When it comes to missing values you have two options:
Remove the observations that have this missing value
Input the missing data
What you choose to do will depend on your analysis goals and what you want to do next with your
data.
Removing the missing value completely might remove useful insights from your data. After all,
there was a reason that you wanted to pull this information in the first place.
Therefore it might be better to input the missing data by researching what should go in that field.
If you don’t know what it is, you could replace it with the word missing. If it is numerical you can
place a zero in the missing field.
However, if there are so many missing values that there isn’t enough data to use, then you should
remove the whole section.
The Wrap Up
While it can sometimes be time-consuming to clean your data, it will cost you more than just time
if you skip this step. “Dirty” data can lead to a whole host of issues, so you want it clean before
you begin your analysis.
Data Storage and Management
Storage Management is defined as it refers to the management of the data storage equipment’s that
are used to store the user/computer generated data. Hence it is a tool or set of processes used by an
administrator to keep your data and storage equipment’s safe. Storage management is a process for
users to optimize the use of storage devices and to protect the integrity of data for any media on
which it resides and the category of storage management generally contain the different type of
subcategories covering aspects such as security, virtualization and more, as well as different types
of provisioning or automation, which is generally made up the entire storage management
software market.
Storage management key attributes: Storage management has some key attribute which is
generally used to manage the storage capacity of the system. These are given below:
1. Performance
2. Reliability
3. Recoverability
4. Capacity
Feature of Storage management: There is some feature of storage management which is
provided for storage capacity. These are given below:
Storage management is a process that is used to optimize the use of storage devices.
Storage management must be allocated and managed as a resource in order to truly benefit
a corporation.
Storage management is generally a basic system component of information systems.
It is used to improve the performance of their data storage resources.
Advantage of storage management: There are some advantage of storage management which
are given below:
It becomes very simple to manage a storage capacity.
It generally reduces the time consumption.
It improves the performance of system.
In virtualization and automation technologies, it can help an organization improve its
agility.
Limitations of storage management:
Limited physical storage capacity: Operating systems can only manage the physical storage
space that is available, and as such, there is a limit to how much data can be stored.
Performance degradation with increased storage utilization: As more data is stored, the
system’s performance can decrease due to increased disk access time, fragmentation, and other
factors.
Complexity of storage management: Storage management can be complex, especially as the size
of the storage environment grows.
Cost: Storing large amounts of data can be expensive, and the cost of additional storage capacity
can add up quickly.
Security issues: Storing sensitive data can also present security risks, and the operating system
must have robust security features in place to prevent unauthorized access to this data.
Backup and Recovery: Backup and recovery of data can also be challenging, especially if the
data is stored on multiple systems or devices.
Using Multiple data sources
By using multiple data sources for your model, you can reduce the total volume of data processed
and thereby shorten processing times in Cognos® Transformer. If used in combination with
calculated columns, multiple data sources can minimize or eliminate the need to create database
table joins in an external data access tool. Using multiple data sources also enables measure
allocation.
For example, suppose your product, customer, and order data is stored in a set of tables. If you
were to use this data from a single source, you would need separate tables for Product, Customer,
Customer Site, Order, and Order Detail. This source would contain many duplicate values, and the
joins between the tables would be relatively complex.
Instead, you create three separate sources for Products, Customer/Site, and Order/Order Detail
data. The volume of data contained in each is less than that in the single source, and there are only
simple joins between Customer and Customer Site tables, and Order and Order Detail.
Remember that each source must contain sufficient data to establish context within the dimension
map. You cannot perform database table joins in Cognos Transformer.
Procedure
Using a data access tool such as IBM® Cognos Impromptu or Framework Manager, create each of
the data sources required for your model.
From the Welcome page, click Create a new model to use the New Model wizard to add the largest
structural data source to your model.
Tip: If you are already in Cognos Transformer, click New from the File menu.
From the Edit menu, click Insert Data Source and add the additional structural data sources to the
Data Sources list.
Repeat to add the transactional data sources to the Data Sources list.