DATA ANALYSIS AND INFORMATION MANAGEMENT
3RD SEMESTER
Submitted by:
NAME: DIPTASREE DEBBARMA
ENROLLMENT NO: 20PMHS005
SEMESTER & YEAR: 3RD SEMESTER 2021
SUBJECT: DATA ANALYSIS & INFORMATION MANAGEMENT
ASSIGNMENT - I (QUESTION & ANSWER)
MANAGEMENT HUMANITIES & SOCIAL SCIENCE
National Institute of Technology, AGARTALA
Jirania PO, Agartala, Barjala, Tripura 799046
1. What is the process of Data Analysis?
Data Analysis is a process of collecting, transforming, cleaning, and modeling data with
the goal of discovering the required information. The results so obtained are
communicated, suggesting conclusions, and supporting decision-making. Data
visualization is at times used to portray the data for the ease of discovering the useful
patterns in the data. The terms Data Modeling and Data Analysis mean the same. Data
Analysis Process consists of the following phases that are iterative in nature –
• Data Requirements Specification
• Data Collection
• Data Processing
• Data Cleaning
• Data Analysis
• Communication
I. Data Requirements Specification
The data required for analysis is based on a question or an experiment. Based on the
requirements of those directing the analysis, the data necessary as inputs to the analysis
is identified (e.g., Population of people). Specific variables regarding a population (e.g.,
Age and Income) may be specified and obtained. Data may be numerical or categorical.
II. Data Collection
Data Collection is the process of gathering information on targeted variables identified
as data requirements. The emphasis is on ensuring accurate and honest collection of
data. Data Collection ensures that data gathered is accurate such that the related
decisions are valid.
Data Collection provides both a baseline to measure and a target to improve. Data is
collected from various sources ranging from organizational databases to the
information in web pages. The data thus obtained, may not be structured and may
contain irrelevant information. Hence, the collected data is required to be subjected to
Data Processing and Data Cleaning.
III. Data Processing:-
The data that is collected must be processed or organized for analysis. This includes
structuring the data as required for the relevant Analysis Tools. For example, the data
might have to be placed into rows and columns in a table within a Spreadsheet or
Statistical Application. A Data Model might have to be created.
IV. Data Cleaning:-
The processed and organized data may be incomplete, contain duplicates, or contain
errors. Data Cleaning is the process of preventing and correcting these errors. There are
several types of Data Cleaning that depend on the type of data. For example, while
cleaning the financial data, certain totals might be compared against reliable published
numbers or defined thresholds. Likewise, quantitative data methods can be used for
outlier detection that would be subsequently excluded in analysis.
V. Data Analysis:-
Data that is processed, organized and cleaned would be ready for the analysis. Various
data analysis techniques are available to understand, interpret, and derive conclusions
based on the requirements. Data Visualization may also be used to examine the data in
graphical format, to obtain additional insight regarding the messages within the data.
Statistical Data Models such as Correlation, Regression Analysis can be used to identify
the relations among the data variables. These models that are descriptive of the data are
helpful in simplifying analysis and communicate results.
The process might require additional Data Cleaning or additional Data Collection, and
hence these activities are iterative in nature.
VI. Communication:-
The results of the data analysis are to be reported in a format as required by the users
to support their decisions and further action. The feedback from the users might result
in additional analysis.
The data analysts can choose data visualization techniques, such as tables and charts,
which help in communicating the message clearly and efficiently to the users. The
analysis tools provide facility to highlight the required information with colour codes
and formatting in tables and charts.
2. What is the difference between Data Mining and Data Analysis?
1. Data Analysis:
Data Analysis involves extraction, cleaning, transformation,
modeling and visualization of data with an objective to extract important and helpful
information which can be additional helpful in deriving conclusions and
make choices. The main purpose of data analysis is to search out some important
information in raw data so the derived knowledge is often used to create vital choices.
2. Data Mining:
Data mining could be called as a subset of Data Analysis. It is the exploration and
analysis of huge knowledge to find important patterns and rules.
Data mining could also be a systematic and successive method of identifying and
discovering hidden patterns and data throughout a big dataset. Moreover, it is used
to build machine learning models that are further used in artificial intelligence.
Based on Data Mining Data Analysis
Definition It is the process of extracting It is the process of
important pattern from large analysing and organizing
datasets. raw data in order to
determine useful
information’s and
decisions
Function It is used in discovering In this all operations are
hidden patterns in raw data involved in examining
sets. data sets to fine
conclusions.
Data set In this data set are generally Dataset can be large,
large and structured. medium or small, Also
structured, semi
structured,
unstructured.
Models Often require mathematical Analytical and business
and statistical models intelligence models
Visualization It generally does not require Surely requires Data
visualization visualization.
Goal Prime goal is to make data It is used to make data
useable. driven decisions.
Required Knowledge It involves the intersection of It requires the
machine learning, statistics knowledge of computer
and databases. science, statistics,
mathematics, subject
knowledge Al/Machine
Learning.
Also known as It is also known as Knowledge Data analysis can be
discovery in databases. divided into descriptive
statistics, exploratory
data analysis,
and confirmatory data
analysis
Output It shows the data tends and The output is verified or
patterns. discarded hypothesis
3. What is data cleansing and what are the best ways to practice data
cleansing?
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly
formatted, duplicate, or incomplete data within a dataset. When combining multiple
data sources, there are many opportunities for data to be duplicated or mislabelled. If
data is incorrect, outcomes and algorithms are unreliable, even though they may look
correct. There is no one absolute way to prescribe the exact steps in the data cleaning
process because the processes will vary from dataset to dataset. But it is crucial to
establish a template for your data cleaning process so you know you are doing it the
right way every time.
Some of the benefits of data cleaning are:
• It accelerates data governance while reducing time and cost of implementation
to maximise ROI.
• Accurately target customers and drive faster customer acquisition.
• Consolidate applications and cost-saving.
• Improves decision-making capabilities as it supports better analytics.
• It saves valuable resources by removing duplicate and inaccurate data from
databases, keeping valuable resources in terms of storage space and processing
time.
• It boosts productivity as it saves time in re-analysing work due to mistakes in
data and saves from making incorrect decisions.
Best Practises for Data Cleaning:-
A. Chalk Out A Plan
Talking about data cleaning, one of the first steps is to carry data profiling which helps
in filtering out data and identifying outlier values or spot problems in data that was
collected. Once the profiling is done, it normalises the field, de-duplicates it, removes
obsolete information and more. While profiling is the first step, what follows next is
asking these questions to carry out best practices.
• What are our goals and expectations
• How will the execution carried out
• What are the benefits in terms of ROI
• Where are the data sets captured from
• How to standardise data
• How to validate the data
• How to test and monitor data quality
• Are the expectation realistic
• What is the cost and more
Having responses to these questions will help in chalking out an overall plan and
strategy to carry out data cleaning.
B. Uniform Data Standards Is The Way
For data cleaning, having a uniformed data standard can bring about better results. It
helps in improving the initial data quality, thereby easing the steps further. It creates
decent quality of data which is easier to clean than data which is low quality. Correction
at the data entry point can be the most crucial steps in ensuring overall data cleaning.
To ensure data standards, many companies believe in creating data entry standards
documents which help in the long run.
C. Validating the Accuracy of Data
The data that is collected and captured should be authentic to avoid errors in programs
and avoid re-runs. Data should be able to meet the required standards, and the source
should be accurate. While it is a crucial step, and can significantly improve the overall
quality of data sets, the process can be complicated and challenging. Especially while
dealing with large datasets. One of the effective ways is to develop a script or validate
small data at a time. It also helps in removing duplicates, identifying obsolete records
and other errors in the dataset.
D. Identifying & Adding the Missing Data
The next step after you have validated the data comes in the step of appending the data
that is missing. Cross-referencing multiple data sources and combining known data into
a final data set that is far more useful and valuable to you will help. This step is essential
in order to provide complete information for business intelligence and analytics. Once
the usability of the dataset is checked, the whole data cleaning process should be
automated to avoid human error, saving significant time and money.
E. Monitoring the System
While setting up automation is crucial, monitoring the whole data cleansing process is
highly essential. It checks the overall health and effectiveness of the system. It also
checks if the data is meeting standards and that the procedures have been followed
correctly. Implementing periodic checks will keep the situation in control.
4. When do you think you should retrain a model? Is it dependent on
the data?
Business data keeps changing on a day-to-day basis, but the format doesn’t change. As
and when a business operation enters a new market, sees a sudden rise of opposition or
sees its own position rising or falling, it is recommended to retrain the model. So, as and
when the business dynamics change, it is recommended to retrain the model with the
changing behaviours of customers.
5. Can you mention a few problems that data analyst usually
encounter while performing the analysis?
• Having a poor formatted data file. For instance, having CSV data with un-escaped
newlines and commas in columns.
• Having inconsistent and incomplete data can be frustrating.
• Common Misspelling and Duplicate entries are a common data quality problem
that most of the data analysts face.
• Having different value representations and misclassified data.
6. Mention the name of the framework developed by Apache for
processing large dataset for an application in a distributed
computing environment?
Hadoop and MapReduce is the programming framework developed by Apache for
processing large data set for an application in a distributed computing environment.
7. How can you highlight cells with negative values in Excel?
• Go to Home → Conditional Formatting → Highlight Cell Rules → Less Than.
• Select the cells in which you want to highlight the negative numbers in red.
• In the Less Than dialog box, specify the value below which the formatting should
be applied.
• Click OK.
8. How can you clear all the formatting without actually
removing the cell contents?
If you click a cell and then press DELETE or BACKSPACE, you clear the cell contents
without removing any cell formats or cell comments. If you clear a cell by (1) … You can
clear all of the formatting from selected cells by first selecting the cells from which you
want to remove all of the formatting.
9. What is a Print Area and how can you set it in Excel?
print area:-
A print area is a range of cells to be included in the final printout. In case you don't want
to print the entire spreadsheet, set a print area that includes only your selection. When you
press Ctrl + P or click the Print button on a sheet that has a defined print area, only that area
will be printed. You can select multiple print areas in a single worksheet, and each area will
print on a separate page. Saving the workbook also saves the print area. If you change your
mind at a later point, you can clear the print area or change it. Defining a print area gives
you more control over what each printed page looks like and, ideally, you should always set
a print area before sending a worksheet to the printer. Without it, you may end up with
messy, hard to read pages where some important rows and columns are cut off, especially if
your worksheet is bigger than the paper you are u sing. To instruct Excel which section of
your data should appear in a printed copy, proceed in one of the following ways.
Fastest way to set print area in Excel:-
The quickest way to set a constant print range is this:
1) Select the part of the worksheet that you want to print.
2) On the Page Layout tab, in the Page Setup group, click Print Area > Set Print Area.
3) A faint gray line will appear denoting the print area.
Tips and notes:
When you save the workbook, the print area is also saved. Whenever you send
the worksheet to the printer, only that area will be printed.
To make sure the defined areas are the ones you really want, press Ctrl + P and
go through each page preview.
To quickly print a certain part of your data without setting a print area, select the
desired range(s), press Ctrl + P and choose Print Selection in the drop-down list
right under Settings.
10.What is the default port for SQL?
The default port 1433 is used when there is only one SQL Server named instance
running on the computer. When multiple SQL Server named instances are running, they
run by default under a dynamic port (49152–65535). In this scenario, an application
will connect to the SQL Server Browser service port (UDP 1434) to get the dynamic port
and then connect to the dynamic port directly.
11. What do you mean by DBMS? What are its different types?
A database is a collection of data or records. Database management systems are
designed to manage databases. A database management system (DBMS) is a software
system that uses a standard method to store and organize data. The data can be added,
updated, deleted, or traversed using various standard algorithms and queries.
Types of Database Management Systems
There are several types of database management systems. Here is a list of seven
common database management systems:
A. Hierarchical databases
B. Network databases
C. Relational databases
D. Object-oriented databases
E. Graph databases
F. ER model databases
G. Document databases
H. NoSQL databases
• Hierarchical Databases
In a hierarchical database management system (hierarchical DBMSs) model, data is stored
in a parent-children relationship node. In a hierarchical database, besides actual data,
records also contain information about their groups of parent/child relationships.
In a hierarchical database model, data is organized into a tree-like structure. The
data is stored in the form of a collection of fields where each field contains only one value.
The records are linked to each other via links into a parent-children relationship. In a
hierarchical database model, each child record has only one parent. A parent can have
multiple children. To retrieve a field’s data, we need to traverse through each tree until the
record is found.
The hierarchical database system structure was developed by IBM in the
early 1960s. While the hierarchical structure is simple, it is inflexible due to the parent-child
one-to-many relationship. Hierarchical databases are widely used to build high-performance
and availability applications usually in the banking and telecommunications industries. The
IBM Information Management System (IMS) and Windows Registry are two popular
examples of hierarchical databases
Advantage
A hierarchical database can be accessed and updated rapidly. As shown in the figure
above, its model structure is like a tree and the relationships between records are
defined in advance. This feature is a double-edged sword.
Disadvantage
This type of database structure is that each child in the tree may have only one parent.
Relationships or linkages between children are not permitted, even if they make sense
from a logical standpoint. Hierarchical databases are like this in their design. Adding a
new field or record requires that the entire database be redefined.
• Network Databases
Network database management systems (Network DBMSs) use a network structure to
create a relationship between entities. Network databases are mainly used on large
digital computers. Network databases are hierarchical databases, but unlike
hierarchical databases where one node can have a single parent only, a network node
can have a relationship with multiple entities. A network database looks more like a
cobweb or interconnected network of records.
In network databases, children are called members and parents are
called occupiers. The difference between each child and member is that it can have
more than one parent.
The approval of the network data model is similar to a hierarchical data model. Data in a
network database is organized in many-to-many relationships. The network database
structure was invented by Charles Bachman. Some of the popular network databases
are the Integrated Data Store (IDS), IDMS (Integrated Database Management System),
Raima Database Manager, Turbo IMAGE, and Univac DMS-1100.
• Relational Databases
In a relational database management system (RDBMS), the relationship between data is
relational and data is stored in tabular form of columns and rows. Each column of a
table represents an attribute and each row in a table represents a record. Each field in a
table represents a data value. Structured Query Language (SQL) is the language used to
query RDBMS, including inserting, updating, deleting, and searching records. Relational
databases work on each table that has a key field that uniquely indicates each row.
These key fields can be used to connect one table of data to another.
Relational databases are the most popular and widely used databases. Some of the
popular DDBMS are Oracle, SQL Server, MySQL, SQLite, and IBM DB2.
The relational database has two major advantages:
• Relational databases can be used with little or no training.
• Database entries can be modified without specifying the entire body.
Properties of Relational Tables
In a relational database, we have to follow the properties given below:
• Values are Atomic
• Each Row is alone.
• Column Values are the same thing.
• Columns are undistinguished.
• Sequence of Rows is Insignificant.
• Each Column has a common name.
• Object-Oriented Model
In this Model, we have to discuss the functionality of object-oriented
Programming. It takes more than the storage of programming language objects. Object
DBMS's increase in the semantics of C++ and Java. It provides full-featured database
programming capabilities while containing native language compatibility. It adds the
database functionality to object programming languages. This approach is analogical of
the application and database development into a constant data model and language
environment. Applications require less code, use more natural data modeling, and code
bases are easier to maintain. Object developers can write complete database
applications with a decent amount of additional effort.
The object-oriented database derivation is the integrity of object-
oriented programming language systems and consistent systems. The power of object-
oriented databases comes from the cyclical treatment of both consistent data, as found
in databases, and transient data, as found in executing programs. Object-oriented
databases use small, recyclable separated from software called objects. The objects
themselves are stored in the object-oriented database.
Each object contains two elements:
A piece of data (e.g., sound, video, text, or graphics).
Instructions, or software programs called methods, for what to do with the data.
Object-oriented database management systems (OODBMs) were created in the early
1980s. Some OODBMs were designed to work with OOP languages such as Delphi, Ruby,
C++, Java, and Python. Some popular OODBMs are TORNADO, Gemstone, Object Store,
GBase, VBase, Intersystem Cache, Versant Object Database, ODABA, ZODB, and Poet.
JADE, and Informix.
Disadvantages of Object-oriented databases
• Object-oriented databases are more expensive to develop.
• Most organizations are unwilling to abandon and convert from those databases.
Benefits of Object-oriented databases
• The benefits of object-oriented databases are compelling.
• The ability to mix and match reusable objects provides the incredible multimedia
capability.
• Graph Databases
Graph Databases are NoSQL databases and use a graph structure for semantic queries.
The data is stored in the form of nodes, edges, and properties. In a graph database, a
Node represents an entity or instance such as a customer, person, or car. A node is
equivalent to a record in a relational database system. An Edge in a graph database
represents a relationship that connects nodes. Properties are additional information
added to the nodes.
The Neo4j, Azure Cosmos DB, SAP HANA, Sparksee, Oracle Spatial and
Graph, OrientDB, ArrangoDB, and MarkLogic are some of the popular graph databases.
Graph database structure is also supported by some RDBMS including Oracle and SQL
Server 2017 and later versions.
• ER Model Databases
An ER model is typically implemented as a database. In a simple relational database
implementation, each row of a table represents one instance of an entity type, and each
field in a table represents an attribute type. In a relational database, a relationship
between entities is implemented by storing the primary key of one entity as a pointer or
"foreign key" in the table of another entity. The entity-relationship model was
developed by Peter Chen in 1976.
• Document Databases
Document databases (Document DB) are also NoSQL databases that store data in the
form of documents. Each document represents the data, its relationship between other
data elements, and attributes of data. Document database store data in a key-value form.
Document DB has become popular recently due to their document
storage and NoSQL properties. NoSQL data storage provides a faster mechanism to
store and search documents.
Popular NoSQL databases are Hadoop/Hbase, Cassandra,
Hypertable, MapR, Hortonworks, Cloudera, Amazon SimpleDB, Apache Flink, IBM
Informix, Elastic, MongoDB, and Azure DocumentDB.
• NoSQL Databases
NoSQL databases are databases that do not use SQL as their primary data access
language. Graph database, network database, object database, and document databases
are common NoSQL databases. This article answers the question, what is a NoSQL
database.
NoSQL database does not have predefined schemas, which makes NoSQL
databases a perfect candidate for rapidly changing development environments. NoSQL
allows developers to make changes on the fly without affecting applications.
NoSQL databases can be categorized into the following five major categories,
Column, Document, Graph, Key-value, and Object databases.
Here is a list of 10 popular NoSQL databases:
• Cosmos DB
• ArrangoDB
• Couchbase Server
• CouchDB
• Amazon DocumentDB
• MongoDB, CouchBase
• Elasticsearch
• Informix
• SAP HANA
• Neo4j
12. What is Normalization? Explain different types of Normalization with
advantages?
Normalization the methodology of arranging a data model to capably store data in an
information base. The completed impact is that tedious data is cleared out, and just data
related to the attribute is taken care of inside the table. Normalization regularly
incorporates isolating an information base into at least two tables and describing
associations between the tables. The objective is to isolate data so that expands,
deletions, and changes of abroad may be made in just one table and thereafter
multiplied through whatever survives from the information base by methods for the
described associations.
There are three standard customary structures, each with extending levels of
Normalisation as follows.
• First Normal Form (1 NF) –
Each field in a table holds particular information. For example, in a specialist
overview, every one table may hold stand apart origination date field.
• Second Normal Form (2 NF) –
Each field in a table that isn’t a determiner of the substance of a substitute field
must itself be a limit of substitute fields in the table.
• Third Normal Form (3 NF) –
No twofold information is permitted. Consequently, for example, if two tables
both oblige an origination date field, the origination date information may be
isolated into a different table, and the two distinct tables may then get to the
origination date information by methods for a list field in the origination date
table. Any change to an origination date would normally be reflecting in all tables
that association with the origination date table.
Advantages of Normalization:
Here we can perceive any reason why Normalization is an alluring possibility in RDBMS
ideas.
• A more modest information base can be kept up as standardization disposes of the
copy information. Generally speaking size of the information base is diminished thus.
• Better execution is guaranteed which can be connected to the above point. As
information bases become lesser in size, the goes through the information turns out
to be quicker and more limited in this way improving reaction time and speed.
• Narrower tables are conceivable as standardized tables will be tweaked and will
have lesser segments which considers more information records per page.
• Fewer files per table guarantees quicker support assignments (file modifies).
• Also understands the choice of joining just the tables that are required.
Disadvantages of Normalization :
• More tables to join as by spreading out information into more tables, the need to
join table’s increments and the undertaking turns out to be drearier. The
information base gets more enthusiastically to acknowledge too.
• Tables will contain codes as opposed to genuine information as the rehashed
information will be put away as lines of codes instead of the genuine information.
Thusly, there is consistently a need to go to the query table.
• Data model turns out to be incredibly hard to question against as the information
model is advanced for applications, not for impromptu questioning. (Impromptu
question is an inquiry that can’t be resolved before the issuance of the question.
It comprises of a SQL that is developed progressively and is typically built by
work area cordial question devices.). Subsequently it is difficult to display the
information base without understanding what the client wants.
• As the typical structure type advances, the exhibition turns out to be increasingly
slow.
• Proper information is needed on the different ordinary structures to execute the
standardization cycle effectively. Reckless use may prompt awful plan loaded up
with significant peculiarities and information irregularity.