BDA Unit5
BDA Unit5
Hadoop Ecosystem :
The following are the components of Hadoop ecosystem:
1. HDFS: Hadoop Distributed File System. It simply stores data files as close to the original form as
possible.
2. HBase: It is Hadoop’s distributed column based database. It supports structured data storage
for large tables.
3. Hive: It is a Hadoop’s data warehouse, enables analysis of large data sets using a language very
similar to SQL. So, one can access data stored in Hadoop cluster by using Hive.
4. Pig: Pig is an easy to understand data flow language. It helps with the analysis of large data sets
which is quite the order with Hadoop without writing codes in MapReduce paradigm
Hadoop PIG
To perform a particular task, programmers need to write a Pig script using the Pig Latin
language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs,
Embedded).
After execution, these scripts will go through
a series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts
into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The
architecture of Apache Pig is shown in figure
below.
Parser
o Initially the Pig Scripts are handled by
the Parser. It checks the syntax of the
script, does type checking, and other
miscellaneous checks.
o The output of the parser will be a DAG
(Directed Acyclic Graph), which
represents the Pig Latin statements
and logical operators.
o In the DAG, the logical operators of the
script are represented as the nodes
and the data flows are represented as
edges.
Optimizer
o The logical plan (DAG) is passed to the
logical optimizer, which carries out the
logical optimizations such as
projection and pushdown.
Compiler
o The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
o Finally, the MapReduce jobs are submitted to Hadoop in a sorted order. Finally,
these MapReduce jobs are executed on Hadoop producing the desired results.
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can
be of any type. A tuple is similar to a row in a table of RDBMS.
o Example: (john, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-
unique) is known as a bag. Each tuple can have any number of fields (flexible
schema).
A bag is represented by ‘{ }’.
It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary
that every tuple contain the same number of fields or that the fields in the same
position (column) have the same type.
o Example: {(Adam, 30), (John, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
o Example: {Adam, 30, {9848022338, [email protected],}}
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is represented
by ‘[ ]’
o Example: [name#john, age#30]
NULL
In Pig ‘s notion of null means the value is unknown. In cases where values are
unreadable or unrecognizable, nulls can show up in the data — for example, if you
were using a wrong data form in the LOAD statement.
LOAD
$LOAD 'info' [USING FUNCTION] [AS SCHEMA];
o LOAD is a relational operator.
o 'info' is a file that is required to load. It contains any type of data.
o USING is a keyword.
o FUNCTION is a load function.
o AS is a keyword.
o SCHEMA is a schema of passing file, enclosed in parentheses.
Example:
File in local file system
$cat data.txt
1010,10,3
2020,20,4
3030,30,5
4040,40,2
Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
Starting pig grunt shell
$pig -x mapreduce or $pig
Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',') AS (d1:int,d2:int,d3:int) ;
Printing loaded data on console
grunt> DUMP A;
(1010,10,3)
(2020,20,4)
(3030,30,5)
(4040,40,2)
STORE
CROSS
DISTINCT
File in local file system
$cat data.txt
1,10,3
2,20,4
1,10,3
2,20,4
Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
Starting pig grunt shell
$pig -x mapreduce or $pig
Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',') AS (d1:int,d2:int,d3:int) ;
To remove duplicate data
grunt>B = DISTINCT A
Printing loaded data on console
grunt> DUMP B;
(1,10,3)
(2,20,4)
FILTER
File in local file system
$cat data.txt
1,10,3
2,20,4
3,10,3
4,20,4
Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
Starting pig grunt shell
$pig -x mapreduce or $pig
Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',') AS (d1:int,d2:int,d3:int) ;
To remove duplicate data
grunt>B = FILTER A BY d2 == 10;
Printing loaded data on console
grunt> DUMP B;
(1,10,3)
(3,10,3)
FOREACH
File in local file system
$cat data.txt
1,2,3,4
5,6,7,8
8,7,6,5
4,3,2,1
Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
Starting pig grunt shell
$pig -x mapreduce or $pig
Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',') AS (d1:int,d2:int,d3:int,
d4:int) ;
To fetch second and fourth columns
grunt>B = FOREACH A GENERATE d2,d4;
Printing loaded data on console
grunt> DUMP B;
(2,4)
(6,8)
(7,5)
(3,1)
GROUP BY
File in local file system
$cat data.txt
John,Ram,3
Clark,John,2
Nike,Ram,5
Imran,John,6
Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
Starting pig grunt shell
$pig -x mapreduce or $pig
Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',')
AS (d1:chararray,d2:chararray,d3:int) ;
To group the data based on d2 column data
grunt>B = GROUP A BY d2;
Printing loaded data on console
grunt> DUMP B;
(Ram, {(John,Ram,3), (Nike,Ram,5)})
(John, {(Clark,John,2), (Imran,John,6)}
LIMIT
ORDER BY
SPLIT
File in local file system
$cat data.txt
1,2
2,4
3,6
4,8
5,7
6,5
7,3
8,1
Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
Starting pig grunt shell
$pig -x mapreduce or $pig
Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',') AS (d1:int,d2:int) ;
To Split the tuples based on field values
grunt> SPLIT A INTO X IF d1<=5, Y IF d1>=6;
Printing loaded data on console
UNION
Apache Hive
What is Hive?
One of the biggest ingredients in the Information Platform built by Jeff’s team at Facebook
was Hive, a framework for data warehousing on top of Hadoop. Hive grew from a need to
manage and learn from the huge volumes of data that Facebook was producing every day
from its burgeoning social network. After trying a few different systems, the team chose
Hadoop for storage and processing, since it was cost-effective and met their scalability needs
Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).
Feature of Hive:
Limitations of Hive
Hive Pig
It follows the Declarative language, SQL-like It follows the Procedural data-flow language
queries called Hql. called PigLatin.
Data Model: Tables, partitions and Buckets Data Model: Atom, Tuple, Bags and Maps
Modes of Execution
o Local Mode
When the Hadoop is built in pseudo distributed mode (Single Node/Local
Machine) and the data is within the node.
o Mapreduce mode
When the Hadoop is built with multiple data nodes and the data is distributed
across the cluster.
Hive Architecture
Hive Client
Hive allows writing applications in various languages, including Java, Python, and
C++. It supports different types of clients such as:-
Hive Services
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can
execute Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI.
It provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure
information of various tables and partitions in the warehouse. It also includes
metadata of column and its type information, the serializers and deserializers
which is used to read and write data and the corresponding HDFS files where
the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request
from different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift,
and JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts
HiveQL statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of
DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine
executes the incoming tasks in the order of their dependencies.
The Metastore
The metastore is the central repository of Hive metadata. The metastore is divided
into two pieces: a service and the backing store for the data. It is configured in three
different ways:
Embedded Metastore
Local Metastore
Remote Metastore
Embedded Metastore
By default, the metastore service runs in the same JVM as the Hive service and
contains an embedded Derby database instance backed by the local disk. This is
called the embedded metastore configuration Using an embedded metastore is a
simple way to get started with Hive; however, only one embedded Derby database
can access the database files on disk at any one time, which means you can have
only one Hive session open at a time that shares the same metastore. Trying to start
a second session gives the error: Failed to start database 'metastore_db'
Local Metastore
The solution to supporting multiple sessions (and therefore multiple users) is to use
a standalone database. This configuration is referred to as a local metastore, since
the metastore service still runs in the same process as the Hive service, but connects
to a database running in a separate process, either on the same machine or on a
remote machine.
Remote Metastore
One or more metastore servers run in separate processes to the Hive service is called
a remote metastore. This brings better manageability and security because the
database tier can be completely firewalled off, and the clients no longer need the
database credentials.
APPLICATIONS ON BIG DATA USING HIVE
a. The literal forms for arrays, maps, and structs are provided as functions. That is, array(), map(), and struct() are built-
in Hive functions.
b. The columns are named col1, col2, col3, etc.
Operators
The usual set of SQL operators is provided by Hive:
Relational Operators
Operator Description
A <> B, A !=B It returns null if A or B is null; true if A is not equal to B, otherwise false.
A<=B It returns null if A or B is null; true if A is less than or equal to B, otherwise false.
Arithmetic Operators
Operators Description
A/B This is used to divide A and B and returns the quotient of the operands.
Hive Tables
In Hive, we can create a table by using the conventions similar to the SQL. It supports
a wide range of flexibility where the data files for tables are stored. It provides two
types of table: -
o Internal table
o External table
Internal Table
The internal tables are also called managed tables as the lifecycle of their data is
controlled by the Hive. By default, these tables are stored in a subdirectory under
the directory defined by hive.metastore.warehouse.dir (i.e. /user/hive/warehouse).
The internal tables are not flexible enough to share with other tools like Pig. If we try
to drop the internal table, Hive deletes both table schema and data.
hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
hive> create table if not exists demo.employee (Id int, Name string , Sa
lary float)
row format delimited
fields terminated by ',' ;
External Table
The external table allows us to create and access a table and a data externally.
The external keyword is used to specify the external table, whereas
the location keyword is used to determine the location of loaded data.
As the table is external, the data is not present in the Hive directory. Therefore, if we
try to drop the table, the metadata of the table will be deleted, but the data still exists.
hive> create external table emplist (Id int, Name string , Salary float)
row format delimited
fields terminated by ','
location '/HiveDirectory';
Once the internal table has been created, the next step is to load the data into it. So,
in Hive, we can easily load data from any file to the database.
o Let's load the data of the file into the database by using the following
command: -
If we want to add more data into the current database, execute the same query
again by just updating the new file name.
o In Hive, if we try to load unmatched data (i.e., one or more column data doesn't
match the data type of specified table columns), it will not throw any
exception. However, it stores the Null value at the position of unmatched
tuple.
Rename a Table
If we want to change the name of an existing table, we can rename that table by using
the following signature: -
hive>ALTER TABLE old_table_name RENAME TO new_table_name;
hive>SHOW TABLES;
Adding column
In Hive, we can add one or more columns in an existing table by using the following
syntax
hive>ALTER TABLE table_name ADD COLUMNS(column_name datatype);
Change Column
In Hive, we can rename a column, change its type and position. Here, we are
changing the name of the column by using the following signature: -
hive>DESCRIBE demo.emp;
hive>ALTER TABLE demp.emp REPLACE COLUMNS( id string, first_name string, age int);
hive>DESCRIBE demo.emp;
The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore.
BIGINT round(num) It returns the BIGINT for the rounded value of DOUBLE num.
BIGINT floor(num) It returns the largest BIGINT that is less than or equal to num.
BIGINT ceil(num), It returns the smallest BIGINT that is greater than or equal to num.
ceiling(DOUBLE num)
DOUBLE exp(num) It returns exponential of num.
DOUBLE ln(num) It returns the natural logarithm of num.
DOUBLE log10(num) It returns the base-10 logarithm of num.
DOUBLE sqrt(num) It returns the square root of num.
DOUBLE abs(num) It returns the absolute value of num.
DOUBLE sin(d) It returns the sin of num, in radians.
DOUBLE asin(d) It returns the arcsin of num, in radians.
DOUBLE cos(d) It returns the cosine of num, in radians.
DOUBLE acos(d) It returns the arccosine of num, in radians.
DOUBLE tan(d) It returns the tangent of num, in radians.
DOUBLE atan(d) It returns the arctangent of num, in radians.
Operations in HiveQL
Arithmetic Operations
hive> select id, name, (salary * 10) /100 from employee; //find 10% salary from each
employee
Relational Operators
Math functions
hive> select Id, Name, sqrt(Salary) from employee_data ;
Aggregate Functions in Hive
In Hive, the aggregate function returns a single value resulting from computation
over many rows. Let''s see some commonly used aggregate functions: -
Return Operator Description
Type
GROUP BY Clause:
The HQL Group By clause is used to group the data from the multiple records
based on one or more column. It is generally used in conjunction with the aggregate
functions (like SUM, COUNT, MIN, MAX and AVG) to perform an aggregation over
each group.
hive> SELECT desig, sum(salary) from emp GROUP BY desig;
Developer 55,000
Manager 1,15,000
Sr. Manager 90,000
HAVING Clause:
The HQL HAVING clause is used with GROUP BY clause. Its purpose is to
apply constraints on the group of data produced by GROUP BY clause. Thus, it
always returns the data where the condition is TRUE.
By using HiveQL ORDER BY and SORT BY clause, we can apply sort on the column.
It returns the result set either in ascending or descending order. Here, we are going
to execute these clauses on the records of the below table:
SORT BY Clause
The HiveQL SORT BY clause is an alternative of ORDER BY clause. It orders the data
within each reducer. Hence, it performs the local ordering, where each reducer's
output is sorted separately. It may also give a partially ordered result.
DISTRIBUTE BY:
Distribute BY clause used on tables present in Hive. Hive uses the columns in
Distribute by to distribute the rows among reducers. All Distribute BY columns will
go to the same reducer.
• SQL Databases
1) Key-Value based DB
2) Document based DB
3) Column based DB
4) Graph based DB
Row Oriented Data Stores vs Column-Oriented Data Stores
A data store is basically a place for storing collections of data, such as a database, a file
system or a directory. In Database system they can be stored in two ways. These are as
follows:
1. Row Oriented Data Stores
2. Column-Oriented Data Stores
Comparisons between Row oriented data stores and Column oriented data stores are as
following:
Data is stored and retrieved one row at a time and In this type of data stores, data are stored and
hence could read unnecessary data if some of the retrieve in columns and hence it can only able
data in a row are required. to read only the relevant data if required.
Records in Row Oriented Data stores are easy to Read and Write operations are slower as
read and write. compared to row-oriented.
Best suited for online transaction system. Best suited for online analytical processing.
These are not efficient in performing operations These are efficient in performing operations
applicable to the entire datasets and hence applicable to the entire dataset and hence
aggregation in row-oriented is an expensive job or enables aggregation over many rows and
operations. columns.
Typical compression mechanisms which provide These type of data stores basically permits
less efficient result than what we achieve from high compression rates due to little distinct or
column-oriented data stores. unique values in columns.
Tables: Data is stored in a table format in HBase. But here tables are in column-oriented
format.
Row Key: Row keys are used to search records which make searches fast.
Column Families: Various columns are combined in a column family. These column
families are stored together which makes the searching process faster because data
belonging to same column family can be accessed together in a single seek.
Column Qualifiers: Each column’s name is known as its column qualifier.
Cell: Data is stored in cells. The data is dumped into cells which are specifically identified
by rowkey and column qualifiers.
Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is
stored with its timestamp. This makes easy to search for a particular version of data.
ARCHITECTURE OF HBASE:
The main components of HBase architecture are
Region
Region Server
HMaster
Zookeeper
Region: A region contains all the rows between the start key and the end key assigned to that
region. HBase tables can be divided into a number of regions in such a way that all the columns
of a column family is stored in one region. Each region contains the rows in a sorted order.
Many regions are assigned to a Region Server, which is responsible for handling, managing,
executing reads and writes operations on that set of regions.
So, concluding in a simpler way:
A table can be divided into a number of regions. A Region is a sorted range of rows
storing data between a start key and an end key.
A Region has a default size of 256MB which can be configured according to the need.
A Group of regions is served to the clients by a Region Server.
A Region Server can serve approximately 1000 regions to the client.
Components of Region Server: This below image shows the components of a Region Server.
Now, I will discuss them separately.
A Region Server maintains various regions running on the top of HDFS. Components of a
Region Server are:
WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file
attached to every Region Server inside the distributed environment. The WAL stores
the new data that hasn’t been persisted or committed to the permanent storage. It is used
in case of failure to recover the data sets.
Block Cache: From the above image, it is clearly visible that Block Cache resides in
the top of Region Server. It stores the frequently read data in the memory. If the data in
BlockCache is least recently used, then that data is removed from BlockCache.
MemStore: It is the write cache. It stores all the incoming data before committing it to
the disk or permanent memory. There is one MemStore for each column family in a
region. As you can see in the image, there are multiple MemStores for a region because
each region contains multiple column families. The data is sorted in lexicographical
order before committing it to the disk.
HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the actual
cells on the disk. MemStore commits the data to HFile when the size of MemStore exceeds.
HMaster
As in the below image, you can see the HMaster handles a collection of Region Server which
resides on DataNode. Let us understand how HMaster does that.
HBase HMaster performs DDL operations (create and delete tables) and
assigns regions to the Region servers as you can see in the above image.
It coordinates and manages the Region Server (similar as NameNode manages
DataNode in HDFS).
It assigns regions to the Region Servers on startup and re-assigns regions to Region
Servers during recovery and load balancing.
It monitors all the Region Server’s instances in the cluster (with the help of Zookeeper)
and performs recovery activities whenever any Region Server is down.
It provides an interface for creating, deleting and updating tables.
HBase has a distributed and huge environment where HMaster alone is not sufficient to manage
everything. So, you would be wondering what helps HMaster to manage this huge
environment? That’s where ZooKeeper comes into the picture. After we understood how
HMaster manages HBase environment, we will understand how Zookeeper helps HMaster in
managing the environment.
HBase combines HFiles to reduce the storage and reduce the number of disk seeks needed for
a read. This process is called compaction. Compaction chooses some HFiles from a region and
combines them. There are two types of compaction as you can see in the above image.
1. Minor Compaction: HBase automatically picks smaller HFiles and recommits them to
bigger HFiles as shown in the above image. This is called Minor Compaction. It
performs merge sort for committing smaller HFiles to bigger HFiles. This helps in
storage space optimization.
2. Major Compaction: As illustrated in the above image, in Major compaction, HBase
merges and recommits the smaller HFiles of a region to a new HFile. In this process,
the same column families are placed together in the new HFile. It drops deleted and
expired cell in this process. It increases read performance.
Fundamentals of ZooKeeper
Zookeeper is a cluster coordinating, cross-platform software service provided by the Apache
Foundation. It is essentially designed for providing service for distributed systems offering
a hierarchical key-value store, which is used to provide a distributed configuration
service, synchronization service, and naming registry for large distributed systems
Architecture of Zookeeper
Apache Zookeeper basically follows the Client-Server Architecture. Participants in the
Zookeeper architecture can be enlisted as follows.
Ensemble
Server
Server Leader
Follower
Client
Ensemble: It is basically the collection of all the Server nodes in the Zookeeper ecosystem.
The Ensemble requires a minimum of three nodes to get itself set up.
Server: It is one among-st the other servers present in the Zookeeper Ensemble whose
objective is to provide all sorts of services to its clients. It sends its alive status to its client in
order to inform its clients about its availability.
Server Leader: Ensemble Leader is elected at the service startup. It has access to recover the
data from any of the failed nodes and performs automatic data recovery for clients.
Follower: A follower is one of the servers in the Ensemble. Its duty is to follow the orders
passed by the Leader.
Client: Clients are the nodes that request service from the server. Similar to servers, the
client also sends signals to servers regarding their availability. In case if the server fails to
respond, then they automatically redirect themselves to the next available server
Every Region Server along with HMaster Server sends continuous heartbeat at regular interval
to Zookeeper and it checks which server is alive and available as mentioned in above image. It
also provides server failure notifications so that, recovery measures can be executed.
VISUALIZATIONS
Data visualization is defined as a graphical representation that contains the information and
the data. By using visual elements like charts, graphs, and maps, data visualization techniques
provide an accessible way to see and understand trends, outliers, and patterns in data.
In modern days we have a lot of data in our hands i.e, in the world of Big Data, data
visualization tools, and technologies are crucial to analyze massive amounts of information and
make data-driven decisions.
It is one of the best univariate plots to know about the distribution of data.
When we want to analyze the impact on the target variable(output) with respect to an
independent variable(input), we use distribution plots a lot.
This plot gives us a combination of both probability density functions(pdf) and
histogram in a single plot.
This plot can be used to obtain more statistical details about the data.
The straight lines at the maximum and minimum are also called whiskers.
Points that lie outside the whiskers will be considered as an outlier.
The box plot also gives us a description of the 25th, 50th,75th quartiles.
With the help of a box plot, we can also determine the Interquartile range(IQR) where
maximum details of the data will be present. Therefore, it can also give us a clear idea
about the outliers in the dataset.
Violin Plot
The violin plots can be considered as a combination of Box plot at the middle and
distribution plots (Kernel Density Estimation) on both sides of the data.
This can give us the description of the distribution of the dataset like whether the
distribution is multimodal, Skewness, etc.
It also gives us useful information like a 95% confidence interval.
This is the plot that you can see in the nook and corners of any sort of analysis between
2 variables.
The line plots are nothing but the values on a series of data points will be connected
with straight lines.
The plot may seem very simple but it has more applications not only in machine
learning but in many other areas.
2. Bar Plot
This is one of the widely used plots, that we would have seen multiple times not just in
data analysis, but we use this plot also wherever there is a trend analysis in many fields.
Though it may seem simple it is powerful in analysing data like sales figures every
week, revenue from a product, Number of visitors to a site on each day of a week,
etc.
3. Scatter Plot
It is one of the most commonly used plots used for visualizing simple data in Machine
learning and Data Science.
This plot describes us as a representation, where each point in the entire dataset is
present with respect to any 2 to 3 features (Columns).
Scatter plots are available in both 2-D as well as in 3-D. The 2-D scatter plot is the
common one, where we will primarily try to find the patterns, clusters, and separability
of the data.
INTERACTION TECHNIQUES
Interactive Visual Analysis (IVA) is a set of techniques for combining the computational
power of computers with the perceptive and cognitive capabilities of humans, in order to extract
knowledge from large and complex datasets.
IVA is a suitable technique for analyzing high-dimensional data that has a large number of
data points, where simple graphing and non-interactive techniques give an insufficient
understanding of the information.
The objective of Interactive Visual Analysis is to discover information in data which is not
readily apparent. The goal is to move from the data itself to the information contained in the
data, ultimately uncovering knowledge which was not apparent from looking at the raw
numbers.
The general framework for an interactive data structure visualization project typically follows
these steps: identify your desired goals, understand the challenges presented by data
constraints, and design a conceptual model in which data can be quickly iterated and reviewed.
Some popular libraries for creating your own interactive data visualizations include: Altair,
Bokeh, Celluloid, Matplotlib, nbinteract, Plotly, Pygal, and Seaborn. Libraries are available for
Python, Jupyter, Javascript, and R interactive data visualizations.
Identify Trends Faster - The majority of human communication is visual as the human
brain processes graphics magnitudes faster than it does text. Direct manipulation of
analyzed data via familiar metaphors and digestible imagery makes it easy to
understand and act on valuable information.
Identify Relationships More Effectively - The ability to narrowly focus on specific
metrics enables users to identify otherwise overlooked cause-and-effect relationships
throughout definable timeframes. This is especially useful in identifying how daily
operations affect an organization’s goals.
Useful Data Storytelling - Humans best understand a data story when its development
over time is presented in a clear, linear fashion. A visual data story in which users can
zoom in and out, highlight relevant information, filter, and change the parameters
promotes better understanding of the data by presenting multiple viewpoints of the data.
Simplify Complex Data - A large data set with a complex data story may present itself
visually as a chaotic, intertwined hairball. Incorporating filtering and zooming controls
can help untangle and make these messes of data more manageable, and can help users
glean better insights.