0% found this document useful (0 votes)

53 views13 pages

Team - 4 Fisac1 Report

1. The document discusses Apache Impala, an open-source SQL query engine for querying large datasets stored in Apache Hadoop clusters in real-time. 2. It provides an overview of what Apache Impala is, its benefits for interactive queries and high performance, applications in business intelligence and machine learning. 3. The document also covers installing Apache Impala, including prerequisites of Java, Hadoop, Hive and steps to download, extract, configure and start the Impala daemons.

Uploaded by

OdysseY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views13 pages

Team - 4 Fisac1 Report

Uploaded by

OdysseY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

---FISAC 1---

GROUP ASSIGNMENT

TOOL-APACHE IMPALA
TEAM 4-

NAME REGISTRATION
NO.
Srideep Sarkar 200911241
Mohamad Farzeen 190911049
Sreeman Kejriwal 200911216
Paavan Akaveeti 200911154
----INTRODUCTION----

 WHAT IS BIG DATA?

Big data refers to extremely large and complex data sets that are too
big to be processed and analyzed by traditional data processing
applications. These data sets are typically characterized by their
volume, variety, and velocity, which means they contain vast
amounts of information from multiple sources that are generated at
high speeds.

The term "big data" also includes the technologies and techniques
used to manage, process, and analyze these data sets. This includes
tools like Hadoop, Spark, and other data analytics software that are
designed to extract insights and knowledge from the vast amount of
data.

The applications of big data are numerous and diverse, ranging from
business intelligence and marketing to scientific research and
healthcare. By harnessing the power of big data, organizations can
make informed decisions, improve their products and services, and
gain a competitive edge in their respective industries.

Examples for Big Data

 New York Stock Exchange

With a daily production of one terabyte of fresh trading data, the New
York Stock Exchange is an example of big data.
 Social Media

According to the estimate, Facebook's databases get more than 500

terabytes of new data each day. This information is primarily
produced by the uploading of images and videos, messaging, leaving
comments, etc.

Types Of Big Data

1. Structured
2. Unstructured
3. Semi-structured

Structured

Structured data refers to any data that can be accessed, processed, and stored in a
fixed format.

Unstructured

Unstructured data is a data whose structure is unknown. Unstructured data is

enormous in quantity and presents several processing obstacles that must be
overcome in order to extract value from it. Unstructured data is frequently found in
heterogeneous data sources that combine simple text files with photos, videos, and
other types of data.

Semi-structured

Both structured and unstructured type of data can be found in semi-structured data.
Semi-structured data can appear to be structured, but it is not specified by a
relational DBMS's concept of a table, for example. An XML file containing data is
an example of semi-structured data.
The Five V’s of Big Data

 Volume – The quantity of data produced.

 Velocity - The rate at which data is generated, gathered, and processed.

 Variety - Structured, semi-structured, and unstructured data

 Value - The ability to transform data into insightful knowledge.

 Veracity - Reliability in terms of quality and accuracy

Challenges of Big Data

Storage
The biggest challenge with the massive volumes of data generated
every day is storage, especially when data is in various formats.
Traditional databases are unable to store unstructured data.

Processing

Big data processing is the act of reading, converting, extracting and

structuring relevant information from raw data. Unified formats of
data input and output continue to be troublesome.

Security

Organisations are very concerned about security.

Information that isn't encrypted is vulnerable to theft or damage from
hackers. Data security experts must therefore strike a compromise bet
ween maintaining stringent security protocols and granting users acce
ss to data.

Finding and Fixing Data Quality Issues

It's likely that many of us are struggling with issues related to bad data
quality, but there are solutions out there.
The four methods for addressing data issues are as follows:
• Accurate data in the initial database
• To fix any data errors, the original data source must be repaired.
• You must identify users using extremely precise procedures.

Hadoop and its ecosystem can be used to process the data and extract
useful information from it to overcome all these challenges. Hadoop,
an open-source architecture for data storage and application execution
on clusters of commodity hardware. It consists of tools like Apache
Spark, Map Reduce, Apache Hive, Apache Impala, Apache Mahout,
Apache Pig, HBase, Tableau, Apache Sqoop, Apache Storm etc.
 THE TOOL WHICH HAS BEEN USED IS – APACHE IMPALA

Apache Impala is an open-source massively parallel processing

SQL query engine that is used to query data stored in Apache
Hadoop clusters in real-time. It is designed to handle large-
scale data sets and allows users to perform complex analytics
on data stored in Hadoop Distributed File System (HDFS) or
Apache HBase.

Impala was developed by Cloudera and is based on Google's

Dremel system, which was designed to query large-scale
datasets in real-time. Impala provides interactive SQL queries,
enabling data analysts and developers to perform ad-hoc
queries and analysis without having to wait for batch
processing to complete.

Impala is optimized for performance and is capable of

processing large amounts of data in a matter of seconds or
minutes. It also supports a wide range of data formats,
including Apache Parquet, Apache Avro, and Apache ORC, and
can be integrated with various Hadoop ecosystem tools like
Apache Hive, Apache Kafka, and Apache Spark.

Overall, Impala provides a powerful tool for querying and

analyzing large datasets in real-time, making it a popular choice
for big data processing and analysis.
 WHY APACHE IMPALA TOOL IS USED?
Apache Impala is used primarily for querying and analyzing large-
scale datasets stored in Apache Hadoop clusters. It provides a high-
performance, low-latency SQL query engine that enables users to
perform interactive ad-hoc queries and analysis on large datasets
without having to wait for batch processing.

Here are some of the key benefits and use cases of using
Apache Impala:

1. Real-time query processing: Impala provides real-time

SQL queries, which makes it an ideal tool for interactive
data exploration and ad-hoc analysis.

2. High-performance SQL engine: Impala is optimized for

performance and is capable of processing large-scale data
sets in seconds or minutes.

3. Integration with Hadoop ecosystem: Impala can be

integrated with various Hadoop ecosystem tools like
Apache Hive, Apache Spark, and Apache Kafka, which
makes it a versatile tool for big data processing.

4. Analytics on diverse data types: Impala supports a wide

range of data formats, including Apache Parquet, Apache
Avro, and Apache ORC, which makes it a suitable tool for
diverse analytics use cases.

5. Business intelligence and reporting: Impala can be used

for business intelligence and reporting applications,
enabling users to generate reports and visualizations from
large-scale data sets in real-time.
Overall, Apache Impala is a powerful tool for big data
processing and analytics that can be used across various
industries, including finance, healthcare, retail, and
telecommunications, among others.
APPLICATION OF APACHE IMPALA

 WHERE APACHE IMPALA CAN BE USED?

Apache Impala can be used for various big data processing and
analytics applications, including:

1. Ad-hoc analytics: Impala is suitable for interactive ad-hoc

queries and analysis, enabling users to quickly explore
and analyze large data sets in real-time.

2. Business intelligence and reporting: Impala can be used

to generate reports and visualizations from large-scale
data sets in real-time, enabling users to make informed
business decisions.

3. ETL (Extract, Transform, Load) process: Impala can be

used to extract data from various sources, transform it
into a suitable format, and load it into Hadoop clusters for
further analysis.

4. Data warehousing: Impala can be used to build a data

warehouse solution on top of Hadoop clusters, enabling
users to store and analyze large data sets in a centralized
location.

5. Machine learning: Impala can be integrated with various

machine learning frameworks like Apache Spark and
TensorFlow to perform advanced analytics and machine
learning on large data sets.

6. IoT (Internet of Things) analytics: Impala can be used to

analyze real-time data generated by IoT devices, enabling
users to gain insights and make decisions based on the
data.

Overall, Apache Impala is a versatile tool for big data processing and
analytics that can be used across various industries and applications,
including finance, healthcare, retail, telecommunications, and more.
INSTALLATION AND PRE-REQUISITES

 Here are the steps for installing Apache Impala:

1. Pre-requisites:
Before installing Apache Impala, you need to ensure that your
system meets the following requirements:
 Java 8 or higher
 Apache Hadoop 2.6.x or higher
 Apache Hive 1.1.x or higher

2. Download Impala:
Download the Impala package from the Apache Impala website
(https://impala.apache.org/).

3. Install Impala:
To install Impala, follow these steps:
 Extract the downloaded Impala package to a directory on your
system.
 Configure the Impala environment by setting the required
environment variables.
 Start the Impala daemons by running the appropriate scripts.

4. Verify the installation:

Once the installation is complete, you can verify it by running some
sample queries on your data using the Impala shell.
Here are some additional considerations when installing Apache
Impala:
 Security: Apache Impala can be configured to work with
Kerberos authentication for added security.
 High availability: Impala can be configured for high availability
by setting up multiple Impala daemons on different nodes.
 Integration with other tools: Impala can be integrated with
other big data tools like Apache Spark, Apache Kafka, and
Apache Nifi for advanced analytics and processing.
 Cluster size: The performance of Impala depends on the size of
your Hadoop cluster, so you may need to scale your cluster
accordingly for optimal performance.

Overall, installing Apache Impala is a straightforward process, and

there are many resources available online to help with configuration
and troubleshooting.

Mohd Zunaid PT CV
No ratings yet
Mohd Zunaid PT CV
1 page
Testing of Hypotheses
No ratings yet
Testing of Hypotheses
11 pages
Shilpa NS Resume-2
No ratings yet
Shilpa NS Resume-2
3 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
20 pages
Chapter 1c - Time Value of Money
No ratings yet
Chapter 1c - Time Value of Money
16 pages
PC Tree Paper
No ratings yet
PC Tree Paper
8 pages
5f6372285a8f6 Vivek Mittal Resume
No ratings yet
5f6372285a8f6 Vivek Mittal Resume
1 page
Glossary of Cloud Native Terms
No ratings yet
Glossary of Cloud Native Terms
44 pages
Getting Started
No ratings yet
Getting Started
1 page
Big Data Analytics in Agriculture
No ratings yet
Big Data Analytics in Agriculture
9 pages
Sem 8
No ratings yet
Sem 8
17 pages
IOT2023
No ratings yet
IOT2023
2 pages
Hadoop
No ratings yet
Hadoop
11 pages
Big Data - Bi - and - Analytics PDF
0% (1)
Big Data - Bi - and - Analytics PDF
30 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
IoT Enabled Fuel Level Monitoring and Automatic Fuel Theft Detection System
No ratings yet
IoT Enabled Fuel Level Monitoring and Automatic Fuel Theft Detection System
8 pages
Hadoop Command Line Interface
No ratings yet
Hadoop Command Line Interface
10 pages
Apriori Mapreduce
No ratings yet
Apriori Mapreduce
6 pages
Unit 5
100% (1)
Unit 5
109 pages
Codetru - Big Data
No ratings yet
Codetru - Big Data
17 pages
CC Lab
No ratings yet
CC Lab
46 pages
Introduction To Big Data Management
No ratings yet
Introduction To Big Data Management
9 pages
21cs1601 Cloud Computing & Big Data Technologies Question Bank Coe
No ratings yet
21cs1601 Cloud Computing & Big Data Technologies Question Bank Coe
5 pages
Syllabus
No ratings yet
Syllabus
11 pages
Unit 3 tt1
No ratings yet
Unit 3 tt1
7 pages
Big Data with Hadoop FDP at IIT Roorkee
No ratings yet
Big Data with Hadoop FDP at IIT Roorkee
2 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
2 pages
Big Data
No ratings yet
Big Data
3 pages
Storage Formats in Hadoop
No ratings yet
Storage Formats in Hadoop
4 pages
Cloudera Distribution of Apache Kafka
No ratings yet
Cloudera Distribution of Apache Kafka
56 pages
Big Data Case Study - Facebook
0% (1)
Big Data Case Study - Facebook
5 pages
Sr Data Engineer Expertise in Big Data & Cloud
No ratings yet
Sr Data Engineer Expertise in Big Data & Cloud
8 pages
Wallmart. 1. Background of The Company
No ratings yet
Wallmart. 1. Background of The Company
9 pages
PIg in BIg Data
No ratings yet
PIg in BIg Data
28 pages
BDA Manual
No ratings yet
BDA Manual
56 pages
PHD Thesis Big Data
100% (3)
PHD Thesis Big Data
7 pages

Team - 4 Fisac1 Report

Uploaded by

Team - 4 Fisac1 Report

Uploaded by

---FISAC 1---

 WHAT IS BIG DATA?

Examples for Big Data

According to the estimate, Facebook's databases get more than 500

Types Of Big Data

Unstructured data is a data whose structure is unknown. Unstructured data is

 Volume – The quantity of data produced.

 Velocity - The rate at which data is generated, gathered, and processed.

 Variety - Structured, semi-structured, and unstructured data

 Value - The ability to transform data into insightful knowledge.

 Veracity - Reliability in terms of quality and accuracy

Challenges of Big Data

Big data processing is the act of reading, converting, extracting and

Organisations are very concerned about security.

Finding and Fixing Data Quality Issues

Apache Impala is an open-source massively parallel processing

Impala was developed by Cloudera and is based on Google's

Impala is optimized for performance and is capable of

Overall, Impala provides a powerful tool for querying and

1. Real-time query processing: Impala provides real-time

2. High-performance SQL engine: Impala is optimized for

3. Integration with Hadoop ecosystem: Impala can be

4. Analytics on diverse data types: Impala supports a wide

5. Business intelligence and reporting: Impala can be used

 WHERE APACHE IMPALA CAN BE USED?

1. Ad-hoc analytics: Impala is suitable for interactive ad-hoc

2. Business intelligence and reporting: Impala can be used

3. ETL (Extract, Transform, Load) process: Impala can be

4. Data warehousing: Impala can be used to build a data

5. Machine learning: Impala can be integrated with various

6. IoT (Internet of Things) analytics: Impala can be used to

 Here are the steps for installing Apache Impala:

4. Verify the installation:

Overall, installing Apache Impala is a straightforward process, and

You might also like