Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
53 views13 pages

Team - 4 Fisac1 Report

1. The document discusses Apache Impala, an open-source SQL query engine for querying large datasets stored in Apache Hadoop clusters in real-time. 2. It provides an overview of what Apache Impala is, its benefits for interactive queries and high performance, applications in business intelligence and machine learning. 3. The document also covers installing Apache Impala, including prerequisites of Java, Hadoop, Hive and steps to download, extract, configure and start the Impala daemons.

Uploaded by

OdysseY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views13 pages

Team - 4 Fisac1 Report

1. The document discusses Apache Impala, an open-source SQL query engine for querying large datasets stored in Apache Hadoop clusters in real-time. 2. It provides an overview of what Apache Impala is, its benefits for interactive queries and high performance, applications in business intelligence and machine learning. 3. The document also covers installing Apache Impala, including prerequisites of Java, Hadoop, Hive and steps to download, extract, configure and start the Impala daemons.

Uploaded by

OdysseY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

---FISAC 1---

GROUP ASSIGNMENT

TOOL-APACHE IMPALA
TEAM 4-

NAME REGISTRATION
NO.
Srideep Sarkar 200911241
Mohamad Farzeen 190911049
Sreeman Kejriwal 200911216
Paavan Akaveeti 200911154
----INTRODUCTION----

 WHAT IS BIG DATA?

Big data refers to extremely large and complex data sets that are too
big to be processed and analyzed by traditional data processing
applications. These data sets are typically characterized by their
volume, variety, and velocity, which means they contain vast
amounts of information from multiple sources that are generated at
high speeds.

The term "big data" also includes the technologies and techniques
used to manage, process, and analyze these data sets. This includes
tools like Hadoop, Spark, and other data analytics software that are
designed to extract insights and knowledge from the vast amount of
data.

The applications of big data are numerous and diverse, ranging from
business intelligence and marketing to scientific research and
healthcare. By harnessing the power of big data, organizations can
make informed decisions, improve their products and services, and
gain a competitive edge in their respective industries.

Examples for Big Data


 New York Stock Exchange

With a daily production of one terabyte of fresh trading data, the New
York Stock Exchange is an example of big data.
 Social Media

According to the estimate, Facebook's databases get more than 500


terabytes of new data each day. This information is primarily
produced by the uploading of images and videos, messaging, leaving
comments, etc.

Types Of Big Data


1. Structured
2. Unstructured
3. Semi-structured

Structured

Structured data refers to any data that can be accessed, processed, and stored in a
fixed format.

Unstructured

Unstructured data is a data whose structure is unknown. Unstructured data is


enormous in quantity and presents several processing obstacles that must be
overcome in order to extract value from it. Unstructured data is frequently found in
heterogeneous data sources that combine simple text files with photos, videos, and
other types of data.

Semi-structured

Both structured and unstructured type of data can be found in semi-structured data.
Semi-structured data can appear to be structured, but it is not specified by a
relational DBMS's concept of a table, for example. An XML file containing data is
an example of semi-structured data.
The Five V’s of Big Data

 Volume – The quantity of data produced.

 Velocity - The rate at which data is generated, gathered, and processed.

 Variety - Structured, semi-structured, and unstructured data

 Value - The ability to transform data into insightful knowledge.

 Veracity - Reliability in terms of quality and accuracy

Challenges of Big Data

Storage
The biggest challenge with the massive volumes of data generated
every day is storage, especially when data is in various formats.
Traditional databases are unable to store unstructured data.

Processing

Big data processing is the act of reading, converting, extracting and


structuring relevant information from raw data. Unified formats of
data input and output continue to be troublesome.

Security

Organisations are very concerned about security.


Information that isn't encrypted is vulnerable to theft or damage from
hackers. Data security experts must therefore strike a compromise bet
ween maintaining stringent security protocols and granting users acce
ss to data.

Finding and Fixing Data Quality Issues

It's likely that many of us are struggling with issues related to bad data
quality, but there are solutions out there.
The four methods for addressing data issues are as follows:
• Accurate data in the initial database
• To fix any data errors, the original data source must be repaired.
• You must identify users using extremely precise procedures.

Hadoop and its ecosystem can be used to process the data and extract
useful information from it to overcome all these challenges. Hadoop,
an open-source architecture for data storage and application execution
on clusters of commodity hardware. It consists of tools like Apache
Spark, Map Reduce, Apache Hive, Apache Impala, Apache Mahout,
Apache Pig, HBase, Tableau, Apache Sqoop, Apache Storm etc.
 THE TOOL WHICH HAS BEEN USED IS – APACHE IMPALA

Apache Impala is an open-source massively parallel processing


SQL query engine that is used to query data stored in Apache
Hadoop clusters in real-time. It is designed to handle large-
scale data sets and allows users to perform complex analytics
on data stored in Hadoop Distributed File System (HDFS) or
Apache HBase.

Impala was developed by Cloudera and is based on Google's


Dremel system, which was designed to query large-scale
datasets in real-time. Impala provides interactive SQL queries,
enabling data analysts and developers to perform ad-hoc
queries and analysis without having to wait for batch
processing to complete.

Impala is optimized for performance and is capable of


processing large amounts of data in a matter of seconds or
minutes. It also supports a wide range of data formats,
including Apache Parquet, Apache Avro, and Apache ORC, and
can be integrated with various Hadoop ecosystem tools like
Apache Hive, Apache Kafka, and Apache Spark.

Overall, Impala provides a powerful tool for querying and


analyzing large datasets in real-time, making it a popular choice
for big data processing and analysis.
 WHY APACHE IMPALA TOOL IS USED?
Apache Impala is used primarily for querying and analyzing large-
scale datasets stored in Apache Hadoop clusters. It provides a high-
performance, low-latency SQL query engine that enables users to
perform interactive ad-hoc queries and analysis on large datasets
without having to wait for batch processing.

Here are some of the key benefits and use cases of using
Apache Impala:

1. Real-time query processing: Impala provides real-time


SQL queries, which makes it an ideal tool for interactive
data exploration and ad-hoc analysis.

2. High-performance SQL engine: Impala is optimized for


performance and is capable of processing large-scale data
sets in seconds or minutes.

3. Integration with Hadoop ecosystem: Impala can be


integrated with various Hadoop ecosystem tools like
Apache Hive, Apache Spark, and Apache Kafka, which
makes it a versatile tool for big data processing.

4. Analytics on diverse data types: Impala supports a wide


range of data formats, including Apache Parquet, Apache
Avro, and Apache ORC, which makes it a suitable tool for
diverse analytics use cases.

5. Business intelligence and reporting: Impala can be used


for business intelligence and reporting applications,
enabling users to generate reports and visualizations from
large-scale data sets in real-time.
Overall, Apache Impala is a powerful tool for big data
processing and analytics that can be used across various
industries, including finance, healthcare, retail, and
telecommunications, among others.
APPLICATION OF APACHE IMPALA

 WHERE APACHE IMPALA CAN BE USED?


Apache Impala can be used for various big data processing and
analytics applications, including:

1. Ad-hoc analytics: Impala is suitable for interactive ad-hoc


queries and analysis, enabling users to quickly explore
and analyze large data sets in real-time.

2. Business intelligence and reporting: Impala can be used


to generate reports and visualizations from large-scale
data sets in real-time, enabling users to make informed
business decisions.

3. ETL (Extract, Transform, Load) process: Impala can be


used to extract data from various sources, transform it
into a suitable format, and load it into Hadoop clusters for
further analysis.

4. Data warehousing: Impala can be used to build a data


warehouse solution on top of Hadoop clusters, enabling
users to store and analyze large data sets in a centralized
location.

5. Machine learning: Impala can be integrated with various


machine learning frameworks like Apache Spark and
TensorFlow to perform advanced analytics and machine
learning on large data sets.

6. IoT (Internet of Things) analytics: Impala can be used to


analyze real-time data generated by IoT devices, enabling
users to gain insights and make decisions based on the
data.

Overall, Apache Impala is a versatile tool for big data processing and
analytics that can be used across various industries and applications,
including finance, healthcare, retail, telecommunications, and more.
INSTALLATION AND PRE-REQUISITES

 Here are the steps for installing Apache Impala:

1. Pre-requisites:
Before installing Apache Impala, you need to ensure that your
system meets the following requirements:
 Java 8 or higher
 Apache Hadoop 2.6.x or higher
 Apache Hive 1.1.x or higher

2. Download Impala:
Download the Impala package from the Apache Impala website
(https://impala.apache.org/).

3. Install Impala:
To install Impala, follow these steps:
 Extract the downloaded Impala package to a directory on your
system.
 Configure the Impala environment by setting the required
environment variables.
 Start the Impala daemons by running the appropriate scripts.

4. Verify the installation:


Once the installation is complete, you can verify it by running some
sample queries on your data using the Impala shell.
Here are some additional considerations when installing Apache
Impala:
 Security: Apache Impala can be configured to work with
Kerberos authentication for added security.
 High availability: Impala can be configured for high availability
by setting up multiple Impala daemons on different nodes.
 Integration with other tools: Impala can be integrated with
other big data tools like Apache Spark, Apache Kafka, and
Apache Nifi for advanced analytics and processing.
 Cluster size: The performance of Impala depends on the size of
your Hadoop cluster, so you may need to scale your cluster
accordingly for optimal performance.

Overall, installing Apache Impala is a straightforward process, and


there are many resources available online to help with configuration
and troubleshooting.

You might also like