Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views30 pages

Big Data Lab Record

The document outlines the syllabus for a Data Science practical course focused on Big Data, specifically the installation and configuration of Hadoop. It details the steps for setting up Hadoop in both pseudo-distributed and fully distributed modes, including Java installation, Hadoop configuration, and running Hadoop services. The practical aims to provide hands-on experience with Hadoop, MapReduce, Pig, Hive, and Apache Spark.

Uploaded by

miniproject818
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views30 pages

Big Data Lab Record

The document outlines the syllabus for a Data Science practical course focused on Big Data, specifically the installation and configuration of Hadoop. It details the steps for setting up Hadoop in both pseudo-distributed and fully distributed modes, including Java installation, Hadoop configuration, and running Hadoop services. The practical aims to provide hands-on experience with Hadoop, MapReduce, Pig, Hive, and Apache Spark.

Uploaded by

miniproject818
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

B.Sc.

III Year VI Semester (CBCS) : Data Science Syllabus


(With Mathematics Combination)
(Examination at the end of Semester - VI)
Practical – 7(A) : Big Data (Lab)
[3 HPW :: 1 Credit :: 50 Marks]
Objectives:
 Installation and understanding of working of HADOOP.
 Understanding of MapReduce program paradigm.
 Writing programs in Python using MapReduce.
 Understanding working of Pig, Hive.
 Understanding of working of Apache Spark Cluster.
1) Setting up and Installing Hadoop in its two operating modes:
a. Pseudo distributed,
b. Fully distributed.
AIM: To install and setup hadoop in pseudo distributed and fully distributed mode.
DESCRIPTION: Pseudo-distributed mode is the mode in which all Hadoop daemons, such as
NameNode, DataNode, ResourceManager, and NodeManager, run on a single machine. In fully-
distributed mode, the Hadoop cluster consists of multiple machines, with each machine running one
or more Hadoop daemons.
SOFTWARE: Hadoop 3.3.5
PROCEDURE:
A. JAVA INSTALLATION
1. Download the Java installer. The Java installation wizard will open, and you will be prompted
to accept the terms and conditions of the Java License Agreement.
2. Choose the installation directory and install the Java Development Kit (JDK), which includes
the JRE as well as tools for developing Java applications.
3. Finally, click the "Install" button to begin the installation of Java on your system. The
installation process may take a few minutes to complete, depending on your system
specifications.
B. JAVA SETUP
1. Go to system settings → edit the system environment variables → Environment Variables →
Under user variables, click on "New" → Give variable name as "JAVA_HOME" and variable
value as the path location to the jdk bin folder → Ok.
2. Under System variables → click on "Path" → New → paste the same path location of the jdk bin
folder → ok.
3. Verify the installation After the installation process is complete, you can verify that Java has
been successfully installed on your system by opening a command prompt typing the following
command: java -version
C. HADOOP INSTALLATION & SETUP:
1. Download the latest stable release of Hadoop from the Apache Hadoop website. Once you've
downloaded the distribution file, extract it to a suitable directory. Extracted Hadoop to
"C:\hadoop-3.3.5".
2. To configure Hadoop, you'll need to edit some of the configuration files in the Hadoop
installation directory.
3. Navigate to the Hadoop installation directory in the "etc\hadoop" directory, open the "hadoop-
env.cmd" file in a text editor.
4. Find the line that sets the value of the "JAVA_HOME" variable and make sure it points to the
location of your JDK installation. Save and close the file.
set JAVA_HOME=C:\Java\jdk-15
5. In the "etc\hadoop" directory, open the "core-site.xml" file in a text editor. Add the following
lines to the file:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
6. Save and close the file. In the "etc\hadoop" directory, open the "mapred-site.xml" file in a text
editor. Add the following lines to the file:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
7. Save and close the file. In the "etc\hadoop" directory, open the "yarn-site.xml" file in a text
editor. Add the following lines to the file:
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
8. Save and close the file. In the "etc\hadoop" directory, open the "hdfs-site.xml" file in a text
editor. Add the following lines to the file:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value> file:///C:/hadoop-3.3.5/data/namenode </value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/C:/hadoop-3.3.5/data/datanode </value>
</property>
</configuration>
9. Save and close the file.
D. RUNNING HADOOP:
1. Format HDFS Before starting the NameNode, you need to format the Hadoop Distributed File
System (HDFS) using the following command in the command prompt (Run as administrator).
This initializes the Hadoop file system: hdfs namenode -format
2. To start Hadoop, you'll need to run the following commands in a Command Prompt window:
3. Change the directory to using the command cd.. until the directory is C:\.
4. Now change the directory using the command: cd hadoop-3.3.5\sbin
5. Type "start-dfs" or "start-dfs.cmd" and press Enter. This starts the Hadoop distributed file
system (DFS).
6. Type "start-yarn" or "start-yarn.cmd" and press Enter. This starts the Hadoop resource
manager and node manager.
7. Type "jps" and press Enter. This displays the process_id of each daemon.

RESULTS: Java and Hadoop has been successfully installed.


OUTPUT:

2023-04-26 20:05:37,047 INFO namenode.NameNode: STARTUP_MSG:


/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = MUNNAWER/192.168.137.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.3.5
STARTUP_MSG: classpath = C:\hadoop-3.3.5\etc\hadoop;C:\hadoop-3.3.5\share\hadoop\common;C:\hadoop-
3.3.5\share\hadoop\common\lib\accessors-smart-2.4.7.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\animal-sniffer-annotations-
1.17.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\asm-5.0.4.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\audience-annotations-
0.5.0.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\avro-1.7.7.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\checker-qual-2.5.2.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\commons-beanutils-1.9.4.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\commons-cli-1.2.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\commons-codec-1.15.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\commons-collections-3.2.2.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\commons-compress-1.21.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\commons-configuration2-2.8.0.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\commons-daemon-1.0.13.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\commons-io-2.8.0.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\commons-lang3-3.12.0.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\commons-logging-1.1.3.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\commons-math3-3.1.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\commons-net-3.9.0.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\commons-text-1.10.0.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\curator-client-4.2.0.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\curator-framework-4.2.0.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\curator-recipes-4.2.0.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\dnsjava-2.1.7.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\failureaccess-1.0.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\gson-2.9.0.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\guava-27.0-jre.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\hadoop-annotations-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\hadoop-auth-3.3.5.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\hadoop-shaded-guava-1.1.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\hadoop-shaded-protobuf_3_7-
1.1.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\httpclient-4.5.13.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\httpcore-
4.4.13.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\j2objc-annotations-1.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jackson-annotations-
2.12.7.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jackson-core-2.12.7.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jackson-core-asl-
1.9.13.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jackson-databind-2.12.7.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jackson-mapper-
asl-1.9.13.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jakarta.activation-api-1.2.1.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\javax.servlet-api-3.1.0.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jaxb-api-2.2.11.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\jaxb-impl-2.2.3-1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jcip-annotations-1.0-1.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\jersey-core-1.19.4.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jersey-json-1.20.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\jersey-server-1.19.4.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jersey-servlet-1.19.4.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\jettison-1.5.3.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jetty-http-9.4.48.v20220622.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\jetty-io-9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jetty-security-
9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jetty-server-9.4.48.v20220622.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\jetty-servlet-9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jetty-util-
9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jetty-util-ajax-9.4.48.v20220622.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\jetty-webapp-9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jetty-xml-
9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jsch-0.1.55.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\json-smart-
2.4.7.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jsp-api-2.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jsr305-3.0.2.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\jsr311-api-1.1.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\jul-to-slf4j-1.7.36.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\kerb-admin-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\kerb-client-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\kerb-common-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\kerb-core-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\kerb-crypto-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\kerb-identity-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\kerb-server-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\kerb-simplekdc-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\kerb-util-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\kerby-asn1-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\kerby-config-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\kerby-pkix-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\kerby-util-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\kerby-xdr-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\metrics-core-3.2.4.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-all-4.1.77.Final.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\netty-buffer-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-codec-4.1.77.Final.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\netty-codec-dns-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-codec-haproxy-
4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-codec-http-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-
codec-http2-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-codec-memcache-4.1.77.Final.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\netty-codec-mqtt-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-codec-redis-
4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-codec-smtp-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-
codec-socks-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-codec-stomp-4.1.77.Final.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\netty-codec-xml-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-common-
4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-handler-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-
handler-proxy-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-resolver-4.1.77.Final.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\netty-resolver-dns-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-resolver-dns-classes-macos-
4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-resolver-dns-native-macos-4.1.77.Final-osx-aarch_64.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\netty-resolver-dns-native-macos-4.1.77.Final-osx-x86_64.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-
transport-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-transport-classes-epoll-4.1.77.Final.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\netty-transport-classes-kqueue-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-transport-
native-epoll-4.1.77.Final-linux-aarch_64.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-transport-native-epoll-4.1.77.Final-linux-
x86_64.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-transport-native-kqueue-4.1.77.Final-osx-aarch_64.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\netty-transport-native-kqueue-4.1.77.Final-osx-x86_64.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-
transport-native-unix-common-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-transport-rxtx-4.1.77.Final.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\netty-transport-sctp-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\netty-transport-udt-
4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\nimbus-jose-jwt-9.8.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\paranamer-
2.3.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\protobuf-java-2.5.0.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\re2j-1.1.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\reload4j-1.2.22.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\slf4j-api-1.7.36.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\slf4j-reload4j-1.7.36.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\snappy-java-1.1.8.2.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\stax2-api-4.2.1.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\token-provider-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\woodstox-core-5.4.0.jar;C:\hadoop-3.3.5\share\hadoop\common\lib\zookeeper-3.5.6.jar;C:\hadoop-
3.3.5\share\hadoop\common\lib\zookeeper-jute-3.5.6.jar;C:\hadoop-3.3.5\share\hadoop\common\hadoop-common-3.3.5-tests.jar;C:\hadoop-
3.3.5\share\hadoop\common\hadoop-common-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\common\hadoop-kms-3.3.5.jar;C:\hadoop-
3.3.5\share\hadoop\common\hadoop-nfs-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\common\hadoop-registry-3.3.5.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\accessors-smart-2.4.7.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\animal-
sniffer-annotations-1.17.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\asm-5.0.4.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\audience-
annotations-0.5.0.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\avro-1.7.7.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\checker-qual-
2.5.2.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-beanutils-1.9.4.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-cli-
1.2.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-codec-1.15.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-collections-
3.2.2.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-compress-1.21.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-configuration2-
2.8.0.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-daemon-1.0.13.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-io-
2.8.0.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-lang3-3.12.0.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-logging-
1.1.3.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-math3-3.1.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-net-
3.9.0.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\commons-text-1.10.0.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\curator-client-
4.2.0.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\curator-framework-4.2.0.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\curator-recipes-
4.2.0.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\dnsjava-2.1.7.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\failureaccess-1.0.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\gson-2.9.0.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\guava-27.0-jre.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\hadoop-annotations-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\hadoop-auth-3.3.5.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\hadoop-shaded-guava-1.1.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\hadoop-shaded-protobuf_3_7-
1.1.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\httpclient-4.5.13.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\httpcore-4.4.13.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\j2objc-annotations-1.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jackson-annotations-2.12.7.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\jackson-core-2.12.7.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jackson-core-asl-1.9.13.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\jackson-databind-2.12.7.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jackson-mapper-asl-1.9.13.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\jakarta.activation-api-1.2.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\javax.servlet-api-3.1.0.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\jaxb-api-2.2.11.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jaxb-impl-2.2.3-1.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\jcip-annotations-1.0-1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jersey-core-1.19.4.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\jersey-json-1.20.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jersey-server-1.19.4.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\jersey-servlet-1.19.4.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jettison-1.5.3.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\jetty-http-9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jetty-io-9.4.48.v20220622.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\jetty-security-9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jetty-server-
9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jetty-servlet-9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jetty-
util-9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jetty-util-ajax-9.4.48.v20220622.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\jetty-webapp-9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jetty-xml-
9.4.48.v20220622.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jsch-0.1.55.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\json-simple-
1.1.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\json-smart-2.4.7.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\jsr305-3.0.2.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\jsr311-api-1.1.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\kerb-admin-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\kerb-client-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\kerb-common-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\kerb-core-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\kerb-crypto-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\kerb-identity-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\kerb-server-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\kerb-simplekdc-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\kerb-util-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\kerby-asn1-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\kerby-config-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\kerby-pkix-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\kerby-util-1.0.1.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\kerby-xdr-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\kotlin-stdlib-1.4.10.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\kotlin-stdlib-common-1.4.10.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\leveldbjni-all-1.8.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-
3.10.6.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-all-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-buffer-
4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-codec-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-codec-dns-
4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-codec-haproxy-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-
codec-http-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-codec-http2-4.1.77.Final.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\netty-codec-memcache-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-codec-mqtt-
4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-codec-redis-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-
codec-smtp-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-codec-socks-4.1.77.Final.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\netty-codec-stomp-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-codec-xml-
4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-common-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-handler-
4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-handler-proxy-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-
resolver-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-resolver-dns-4.1.77.Final.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\netty-resolver-dns-classes-macos-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-resolver-dns-
native-macos-4.1.77.Final-osx-aarch_64.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-resolver-dns-native-macos-4.1.77.Final-osx-
x86_64.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-transport-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-transport-
classes-epoll-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-transport-classes-kqueue-4.1.77.Final.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\netty-transport-native-epoll-4.1.77.Final-linux-aarch_64.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-
transport-native-epoll-4.1.77.Final-linux-x86_64.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-transport-native-kqueue-4.1.77.Final-osx-
aarch_64.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-transport-native-kqueue-4.1.77.Final-osx-x86_64.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\netty-transport-native-unix-common-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-transport-rxtx-
4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-transport-sctp-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\netty-
transport-udt-4.1.77.Final.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\nimbus-jose-jwt-9.8.1.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\okhttp-4.9.3.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\okio-2.8.0.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\paranamer-2.3.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\protobuf-java-2.5.0.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\re2j-1.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\reload4j-1.2.22.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\snappy-java-1.1.8.2.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\stax2-api-4.2.1.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\token-provider-1.0.1.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\woodstox-core-5.4.0.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\lib\zookeeper-3.5.6.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\lib\zookeeper-jute-3.5.6.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\hadoop-hdfs-3.3.5-tests.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\hadoop-hdfs-3.3.5.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\hadoop-hdfs-client-3.3.5-tests.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\hadoop-hdfs-client-3.3.5.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\hadoop-hdfs-httpfs-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\hadoop-hdfs-native-client-3.3.5-tests.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\hadoop-hdfs-native-client-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\hadoop-hdfs-nfs-3.3.5.jar;C:\hadoop-
3.3.5\share\hadoop\hdfs\hadoop-hdfs-rbf-3.3.5-tests.jar;C:\hadoop-3.3.5\share\hadoop\hdfs\hadoop-hdfs-rbf-
3.3.5.jar;\share\hadoop\yarn\*;C:\hadoop-3.3.5\share\hadoop\mapreduce\hadoop-mapreduce-client-app-3.3.5.jar;C:\hadoop-
3.3.5\share\hadoop\mapreduce\hadoop-mapreduce-client-common-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\mapreduce\hadoop-mapreduce-client-core-
3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\mapreduce\hadoop-mapreduce-client-hs-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\mapreduce\hadoop-
mapreduce-client-hs-plugins-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\mapreduce\hadoop-mapreduce-client-jobclient-3.3.5-tests.jar;C:\hadoop-
3.3.5\share\hadoop\mapreduce\hadoop-mapreduce-client-jobclient-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\mapreduce\hadoop-mapreduce-client-
nativetask-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\mapreduce\hadoop-mapreduce-client-shuffle-3.3.5.jar;C:\hadoop-
3.3.5\share\hadoop\mapreduce\hadoop-mapreduce-client-uploader-3.3.5.jar;C:\hadoop-3.3.5\share\hadoop\mapreduce\hadoop-mapreduce-examples-
3.3.5.jar
STARTUP_MSG: build = https://github.com/apache/hadoop.git -r 706d88266abcee09ed78fbaa0ad5f74d818ab0e9; compiled by 'stevel' on 2023-03-
15T15:56Z
STARTUP_MSG: java = 1.8.0_351
************************************************************/
2023-04-26 20:05:37,401 INFO namenode.NameNode: createNameNode [-format]
2023-04-26 20:05:40,851 INFO namenode.NameNode: Formatting using clusterid: CID-465a2753-c480-4815-a94d-560d71043367
2023-04-26 20:05:41,121 INFO namenode.FSEditLog: Edit logging is async:true
2023-04-26 20:05:41,217 INFO namenode.FSNamesystem: KeyProvider: null
2023-04-26 20:05:41,223 INFO namenode.FSNamesystem: fsLock is fair: true
2023-04-26 20:05:41,228 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
2023-04-26 20:05:41,300 INFO namenode.FSNamesystem: fsOwner = DELLi7 (auth:SIMPLE)
2023-04-26 20:05:41,301 INFO namenode.FSNamesystem: supergroup = supergroup
2023-04-26 20:05:41,307 INFO namenode.FSNamesystem: isPermissionEnabled = true
2023-04-26 20:05:41,314 INFO namenode.FSNamesystem: isStoragePolicyEnabled = true
2023-04-26 20:05:41,322 INFO namenode.FSNamesystem: HA Enabled: false
2023-04-26 20:05:41,490 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2023-04-26 20:05:41,926 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit : configured=1000, counted=60, effected=1000
2023-04-26 20:05:41,927 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2023-04-26 20:05:41,941 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2023-04-26 20:05:41,941 INFO blockmanagement.BlockManager: The block deletion will start around 2023 Apr 26 20:05:41
2023-04-26 20:05:41,946 INFO util.GSet: Computing capacity for map BlocksMap
2023-04-26 20:05:41,950 INFO util.GSet: VM type = 64-bit
2023-04-26 20:05:41,968 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
2023-04-26 20:05:41,969 INFO util.GSet: capacity = 2^21 = 2097152 entries
2023-04-26 20:05:41,997 INFO blockmanagement.BlockManager: Storage policy satisfier is disabled
2023-04-26 20:05:41,998 INFO blockmanagement.BlockManager: dfs.block.access.token.enable = false
2023-04-26 20:05:42,025 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.999
2023-04-26 20:05:42,026 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
2023-04-26 20:05:42,028 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
2023-04-26 20:05:42,042 INFO blockmanagement.BlockManager: defaultReplication = 1
2023-04-26 20:05:42,044 INFO blockmanagement.BlockManager: maxReplication = 512
2023-04-26 20:05:42,045 INFO blockmanagement.BlockManager: minReplication = 1
2023-04-26 20:05:42,059 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
2023-04-26 20:05:42,062 INFO blockmanagement.BlockManager: redundancyRecheckInterval = 3000ms
2023-04-26 20:05:42,063 INFO blockmanagement.BlockManager: encryptDataTransfer = false
2023-04-26 20:05:42,074 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
2023-04-26 20:05:42,230 INFO namenode.FSDirectory: GLOBAL serial map: bits=29 maxEntries=536870911
2023-04-26 20:05:42,231 INFO namenode.FSDirectory: USER serial map: bits=24 maxEntries=16777215
2023-04-26 20:05:42,234 INFO namenode.FSDirectory: GROUP serial map: bits=24 maxEntries=16777215
2023-04-26 20:05:42,242 INFO namenode.FSDirectory: XATTR serial map: bits=24 maxEntries=16777215
2023-04-26 20:05:42,296 INFO util.GSet: Computing capacity for map INodeMap
2023-04-26 20:05:42,296 INFO util.GSet: VM type = 64-bit
2023-04-26 20:05:42,300 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
2023-04-26 20:05:42,301 INFO util.GSet: capacity = 2^20 = 1048576 entries
2023-04-26 20:05:42,311 INFO namenode.FSDirectory: ACLs enabled? true
2023-04-26 20:05:42,327 INFO namenode.FSDirectory: POSIX ACL inheritance enabled? true
2023-04-26 20:05:42,334 INFO namenode.FSDirectory: XAttrs enabled? true
2023-04-26 20:05:42,343 INFO namenode.NameNode: Caching file names occurring more than 10 times
2023-04-26 20:05:42,366 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: false, skipCaptureAccessTimeOnlyChange: false,
snapshotDiffAllowSnapRootDescendant: true, maxSnapshotLimit: 65536
2023-04-26 20:05:42,371 INFO snapshot.SnapshotManager: SkipList is disabled
2023-04-26 20:05:42,396 INFO util.GSet: Computing capacity for map cachedBlocks
2023-04-26 20:05:42,396 INFO util.GSet: VM type = 64-bit
2023-04-26 20:05:42,398 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
2023-04-26 20:05:42,409 INFO util.GSet: capacity = 2^18 = 262144 entries
2023-04-26 20:05:42,441 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2023-04-26 20:05:42,443 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2023-04-26 20:05:42,445 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2023-04-26 20:05:42,485 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2023-04-26 20:05:42,486 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000
millis
2023-04-26 20:05:42,498 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2023-04-26 20:05:42,500 INFO util.GSet: VM type = 64-bit
2023-04-26 20:05:42,508 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
2023-04-26 20:05:42,511 INFO util.GSet: capacity = 2^15 = 32768 entries
Re-format filesystem in Storage Directory root= C:\hadoop-3.3.5\data\namenode; location= null ? (Y or N) Y
2023-04-26 20:06:03,936 INFO namenode.FSImage: Allocated new BlockPoolId: BP-344613685-192.168.137.1-1682519763825
2023-04-26 20:06:03,937 INFO common.Storage: Will remove files: [C:\hadoop-
3.3.5\data\namenode\current\edits_inprogress_0000000000000000001, C:\hadoop-3.3.5\data\namenode\current\fsimage_0000000000000000000,
C:\hadoop-3.3.5\data\namenode\current\fsimage_0000000000000000000.md5, C:\hadoop-3.3.5\data\namenode\current\seen_txid, C:\hadoop-
3.3.5\data\namenode\current\VERSION]
2023-04-26 20:06:04,098 INFO common.Storage: Storage directory C:\hadoop-3.3.5\data\namenode has been successfully formatted.
2023-04-26 20:06:04,216 INFO namenode.FSImageFormatProtobuf: Saving image file C:\hadoop-
3.3.5\data\namenode\current\fsimage.ckpt_0000000000000000000 using no compression
2023-04-26 20:06:04,534 INFO namenode.FSImageFormatProtobuf: Image file C:\hadoop-
3.3.5\data\namenode\current\fsimage.ckpt_0000000000000000000 of size 401 bytes saved in 0 seconds .
2023-04-26 20:06:04,650 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2023-04-26 20:06:04,691 INFO namenode.FSNamesystem: Stopping services started for active state
2023-04-26 20:06:04,692 INFO namenode.FSNamesystem: Stopping services started for standby state
2023-04-26 20:06:04,723 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2023-04-26 20:06:04,724 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at MUNNAWER/192.168.137.1
************************************************************/
2) Implementation of the following file management tasks in Hadoop:
a. Adding files and directories
b. Retrieving files
c. Deleting files
AIM: To implement file management tasks in hadoop.
DESCRIPTION: Hadoop is a distributed file system that allows you to store and process large data
sets across a cluster of computers. To implement file management tasks in Hadoop, you can use
Hadoop's command-line interface, HDFS (Hadoop Distributed File System) commands, or Hadoop
APIs.
1. Create a new directory in HDFS using the command:
hdfs dfs -mkdir /path_to_new_directory
2. Copy the file to the HDFS using the command:
hdfs dfs -copyFromLocal /path_to_local_file /path_to_destination
3. Alternatively, you can create a new file in HDFS using the command:
hdfs dfs -touchz /path_to_new_file
4. Upload a file to HDFS, use the command:
hadoop fs -put filename /path_to_destination/
5. Retrieve the file from HDFS using the command:
hdfs dfs -copyToLocal /path_to_hdfs_file /path_to_local_destination
6. Alternatively, you can view the contents of a file in HDFS using the command:
hdfs dfs -cat /path_to_hdfs_file
7. To list the contents of a directory use the following command:
hadoop fs -ls /directoryname
8. To download a file from HDFS, use the following command:
hadoop fs -get /directoryname/filename /local_file_path/
9. Delete the file from HDFS using the command: hdfs dfs -rm /path_to_hdfs_file
10. Alternatively, you can delete a directory and all its contents using the command:
hdfs dfs -rm -r /path_to_hdfs_directory
SOFTWARE: Hadoop 3.3.5
PROCEDURE:
A. ADDING FILES AND DIRECTORIES: Open a terminal and connect to your Hadoop cluster
and use the following commands:
hadoop fs -mkdir /mydirectory
hadoop fs -put myfile.txt /mydirectory/
B. RETRIEVING FILES
hadoop fs -ls /mydirectory
hadoop fs -get /mydirectory/myfile.txt /local/path/
C. DELETING FILES
hadoop fs -rm /mydirectory/myfile.txt
hadoop fs -rm -r /mydirectory
RESULTS: File management tasks in hadoop has been successfully demonstrated.
OUTPUT:

[cloudera@quickstart ~]$ mkdir it106


[cloudera@quickstart ~]$ ls
106 avgsal106 emp.txt new wc106 07 cloudera-manager express-deployment.json it106
[cloudera@quickstart ~]$ cdit106
[cloudera@quickstartit106]$ vi prg1.txt
[cloudera@quickstartit106]$ ls
prg1.txt
[cloudera@quickstart ~]$ hdfs dfs -mkdir /hdfsit106
[cloudera@quickstart ~]$ hdfs dfs -ls /
drwxr-xr-x - clouderasupergroup 0 2021-12-06 06:43 /hdfsit106
drwxr-xr-x - clouderasupergroup 0 2021-11-23 07:01 /it106
[cloudera@quickstart ~]$ hdfs dfs -put it106/prg1.txt /hdfsit106
[cloudera@quickstart ~]$ hdfs dfs -ls /hdfsit106
Found 1 items
-rw-r--r-- 1 clouderasupergroup 12 2021-12-06 06:46
/hdfsit106/prg1.txt
[cloudera@quickstart ~]$ cdit106
[cloudera@quickstartit106]$ vi prg2.txt
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocalit106/prg2.txt /hdfsit106
[cloudera@quickstart ~]$ hdfs dfs -ls /hdfsit106
Found 2 items
-rw-r--r-- 1 clouderasupergroup 12 2021-12-06 06:46
/hdfsit106/prg1.txt
-rw-r--r-- 1 clouderasupergroup 14 2021-12-06 06:48
/hdfsit106/prg2.txt
[cloudera@quickstart ~]$ hdfs dfs -cat /hdfsit106/prg1.txt
hello world
[cloudera@quickstart ~]$ hdfs dfs -get /hdfsit106/prg2.txt ./it106/newprg.txt
[cloudera@quickstart ~]$ cdit106
[cloudera@quickstart it106]$ ls
newprg.txt prg1.txt prg2.txt prg3.txt
[cloudera@quickstart ~]$ hdfs dfs -mkdir /hdfsit18106
[cloudera@quickstart ~]$ hdfs dfs -touchz /hdfsit106/empt.txt
[cloudera@quickstart ~]$ hdfs dfs -ls /hdfsit106
Found 2 items
-rw-r--r-- 1 clouderasupergroup 0 2021-12-06 06:59
/hdfsit106/empt.txt
-rw-r--r-- 1 clouderasupergroup 12 2021-12-06 06:46
/hdfsit106/prg1.txt
[cloudera@quickstart ~]$ hdfs dfs -rm /hdfsit106/empt.txt
Deleted /hdfsit106/empt.txt
[cloudera@quickstart ~]$ hdfs dfs -ls /hdfsit106
Found 1 items
-rw-r--r-- 1 clouderasupergroup 12 2021-12-06 06:46
/hdfsit106/prg1.txt
3) Implementation of Word Count Map Reduce program:
a. Find the number of occurrence of each word appearing in the input file(s).
b. Performing a MapReduce Job for word search count (look for specific keywords in
a file).
AIM: To implement word count map reduce program and to find the number of occurrences of each
word and looking for specific keywords in a file.
DESCRIPTION: The word count MapReduce program is a popular example of how the MapReduce
framework can be used to process large amounts of data efficiently. In this program, the objective is to count
the number of times each word appears in a set of input files. The following is a detailed procedure for
implementing a word count MapReduce program:
1. Prepare the input data: The input data should be in a format that can be read by Hadoop,
which is the underlying framework for MapReduce. The input data can be in a variety of formats,
including text files, CSV files, or HDFS files.
2. Write the Map function: The Map function is responsible for processing the input data and
generating a set of key-value pairs. In the case of the word count program, the Map function
should read each line of the input data, tokenize the words in each line, and emit a key-value
pair for each word. The key should be the word itself, and the value should be the number 1,
indicating that the word has been counted once.
3. Write the Reduce function: The Reduce function is responsible for combining the key-value
pairs generated by the Map function and producing a final set of output key-value pairs. In the
case of the word count program, the Reduce function should receive a set of key-value pairs,
where the key is a word and the value is a list of 1s. The Reduce function should sum up the
values for each key and emit a final key-value pair where the key is the word and the value is
the total count for that word.
4. Configure the MapReduce job: The MapReduce job should be configured with the input data,
the Map function, the Reduce function, and any additional settings or parameters required for
the job. This can be done using a configuration file or command line arguments.
5. Submit the MapReduce job: Once the job has been configured, it can be submitted to the
Hadoop cluster for execution. The MapReduce framework will divide the input data into smaller
chunks and distribute them across the nodes in the cluster. The Map function will be executed
in parallel on each node, generating a set of intermediate key-value pairs. The Reduce function
will then be executed on the results of the Map function, producing a final set of output key-value
pairs.
6. Collect the output: Once the MapReduce job has completed, the output can be collected from
the Hadoop cluster and analyzed. In the case of the word count program, the output will be a set
of key-value pairs, where the key is a word and the value is the number of times that word
appears in the input data.
SOFTWARE: Hadoop 3.3.5
CODE:
# Import required Hadoop libraries
from hadoop.io import LongWritable, Text
from hadoop.mapreduce import Mapper, Reducer, Job

class WordCountMapper(Mapper):
def map(self, key, value, context):
for word in value.split():
context.write(Text(word), LongWritable(1))

class WordCountReducer(Reducer):
def reduce(self, key, values, context):
count = sum([value.get() for value in values])
context.write(key, LongWritable(count))

if __name__ == '__main__':
# Set up job configuration
job = Job()
job.setJarByClass(WordCountMapper.class)
job.setMapperClass(WordCountMapper.class)
job.setReducerClass(WordCountReducer.class)
job.setOutputKeyClass(Text.class)
job.setOutputValueClass(LongWritable.class)

# Set input and output paths


input_path = "input/sample.txt"
output_path = "output/wordcount"
FileInputFormat.addInputPath(job, Path(input_path))
FileOutputFormat.setOutputPath(job, Path(output_path))

# Submit job and wait for completion


job.waitForCompletion(True)
RESULTS: Word count map reduce program has been successfully executed.
INPUT DATASET

Hello world
Hello hadoop
Hadoop is cool
World is round

OUTPUT
Hello 2
World 2
hadoop 1
is 2
cool 1
round 1
4) Map Reduce Program for Stop word elimination:
a. Map Reduce program to eliminate stop words from a large text file.
AIM: To implement map reduce program to eliminate stop words in a file.
DESCRIPTION: Stop words are words that are commonly used in a language and are generally
filtered out in natural language processing (NLP) tasks. Examples of stop words in English include
"the," "and," "a," "an," and "in".
1. Input Data: The first step in creating a MapReduce program is to identify the input data. In
this case, the input data would be a large text file that contains a lot of stop words.
2. Map: The next step is to create the Map function, which will take the input data and break it
down into smaller chunks. The Map function will read each line of the text file and tokenize it
into individual words. For each word, the Map function will check if it is a stop word or not. If
the word is a stop word, the Map function will discard it, and if it's not a stop word, it will emit
a key-value pair with the word as the key and the value as 1.
3. Shuffle and Sort: After the Map function has emitted all the key-value pairs, the next step is
to group the pairs by the key. This is done by the Shuffle and Sort phase of the MapReduce
program. All key-value pairs with the same key are sent to the same reducer.
4. Reduce: The Reduce function will take the key-value pairs from the Shuffle and Sort phase and
eliminate the stop words. The Reduce function will first sum up all the values for each key. If
the sum is greater than 0, it means that the word is not a stop word and should be emitted. If
the sum is 0, it means that the word is a stop word and should be discarded.
5. Output: Finally, the output of the MapReduce program will be a list of all the words in the text
file that are not stop words.
SOFTWARE: Hadoop 3.3.5
CODE:
# List of stop words
stop_words = ["the", "and", "a", "an", "in", "are", "to", "of", "that"]

# Map function to tokenize the input text and eliminate stop words
def map(key, value):
# Tokenize the input text and emit key-value pairs
for word in value.split():
if word.lower() not in stop_words:
yield (word.lower(), 1)

# Reduce function to sum the counts for each word and eliminate stop words
def reduce_func(key, values):
# Sum the counts for each word
count = sum(values)

# If the word is not a stop word, emit the count


if count > 0:
yield (key, count)

RESULTS: Map reduce program to eliminate stop words has been successfully executed.
INPUT DATASET

The quick brown fox jumped over the lazy dog. The dog was not amused. Stop words are words that
are commonly used in a language. We need to eliminate them to improve the accuracy of NLP tasks.

OUTPUT
brown 1
fox 1
jumped 1
over 1
lazy 1
dog 1
not 1
amused. 1
stop 1
words 1
commonly 1
used 1
language. 1
need 1
eliminate 1
them 1
improve 1
accuracy 1
nlp 1
tasks. 1
5) Map Reduce program that mines weather data. Weather sensors collecting data every
hour at many locations across the globe gather large volume of log data, which is a good
candidate for analysis with MapReduce, since it is semi structured and record-oriented.
Data available at: https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all.
a. Find average, max and min temperature for each year in NCDC data set?
b. Filter the readings of a set based on value of the measurement, Output the line of
input files associated with a temperature value greater than 30.0 and store it in a
separate file.
AIM: To implement map reduce program that mines weather data and displays the average,
minimum and maximum temperature for each year.
DESCRIPTION:
1. Data Input: Download the NCDC data set available at https://github.com/tomwhite/hadoop-
book/tree/master/input/ncdc/all and store it in the Hadoop Distributed File System (HDFS).
2. Mapper: Write a mapper function that parses each record of the input file and emits key-value
pairs, where the key is the year and the value is the temperature. The mapper function should
filter out invalid temperature values and emit only those records that have valid temperature
values. Write a separate mapper function that filters the readings of a set based on a value of
the measurement and emits the input line associated with a temperature value greater than
30.0.
3. Combiner: Write a combiner function that computes the local sum, count, minimum, and
maximum of temperature values for each year.
4. Reducer: Write a reducer function that receives the output of the combiner function and
computes the global sum, count, minimum, and maximum of temperature values for each year.
5. Output: Write a function that formats the output of the reducer function to display the average,
max, and min temperature for each year. The output should be stored in a separate file. Write
a function that formats the output of the mapper function that filters the readings of a set based
on a value of the measurement to store the input line associated with a temperature value
greater than 30.0 in a separate file.
6. Run the MapReduce job: Submit the MapReduce job to the Hadoop cluster using the Hadoop
command-line interface. Monitor the progress of the MapReduce job using the Hadoop job
monitoring tool. Once the MapReduce job has completed successfully, retrieve the output files
from the HDFS and store them locally.
SOFTWARE: Hadoop 3.3.5
CODE:
from mrjob.job import MRJob

class MRWeather(MRJob):

def mapper(self, _, line):


fields = line.strip().split(',')
year = fields[0][15:19]
temperature = int(fields[3])
quality = fields[4]
if temperature != 9999 and quality[0] in ['0', '1', '4', '5', '9']:
yield year, (temperature, quality)
def reducer(self, year, values):
temps = []
for value in values:
temps.append(value[0])
if temps:
avg = sum(temps) / len(temps)
maximum = max(temps)
minimum = min(temps)
yield year, (avg, maximum, minimum)
else:
yield year, (None, None, None)
if __name__ == '__main__':
MRWeather.run()
RESULTS: Map reduce program that mines weather data and displays the average, minimum and
maximum temperature for each year has been successfully executed.
INPUT DATASET: The input dataset consists of weather data in the NCDC format, where each line
represents a single reading at a particular location and time. The data is available at:
https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all . The data contains readings from
multiple years across the globe, with each line including the following fields:
Station_id (11 characters) The ID of the weather station
Date (8 characters) The date of the reading in YYYYMMDD format
Observation_type The type of reading, such as TMAX (maximum temperature) or TMIN (minimum
(4 characters) temperature)
Observation_value
The value of the reading in tenths of a degree Celsius
(5 characters)
Observation_ quality The quality of the reading, such as whether it was estimated or based on
(1 character) actual data
Here's an example of a few lines from the input file:

USC00230850,19491231,TMAX,11,,7
USC00230850,19491231,TMIN,-39,,7
USC00230850,19491231,PRCP,0,,7
USC00230850,19500101,TMAX,0,,7
USC00230850,19500101,TMIN,-56,,7
USC00230850,19500101,PRCP,0,,7

OUTPUT: Each line represents one year and its corresponding temperature data, where the first
column is the year, and the second column is a tuple containing the average, maximum and minimum
temperature for that year.

"1949" [-7.728318584070796, 39, -183]


"1950" [-5.24746835443038, 322, -271]
"1951" [-4.674073733903433, 383, -273]
"1952" [-5.938235294117647, 367, -276]
"1953" [-5.4297583081571, 361, -277]
...
6) Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter
your data.
AIM: To install and run pig and perform sorting, grouping, joins, project and filter on the dataset.
DESCRIPTION: Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets. At the present time,
Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs,
for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's
language layer currently consists of a textual language called Pig Latin, which has the following key
properties: Ease of programming (It is trivial to achieve parallel execution of simple data analysis
tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded
as data flow sequences, making them easy to write, understand, and maintain), Optimization
opportunities (The way in which tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than efficiency) and Extensibility
(Users can create their own functions to do special-purpose processing).
1. Sort: The SORT operator sorts the data in ascending or descending order based on the specified
field.
2. Group: The GROUP operator groups the data based on a specified field.
3. Join: The JOIN operator combines two or more relations based on a common field.
4. Project: The PROJECT operator selects specific fields from a relation.
5. Filter: The FILTER operator selects only those records that meet a specified condition.
SOFTWARE: Pig 0.17.0
PROCEDURE:
1. Download the latest stable release of Apache Pig from the official website https://pig.apache.org/.
2. Extract the downloaded file to a directory on your system.
3. Set the PIG_HOME environment variable to point to the extracted directory.
4. Add the PIG_HOME/bin directory to your system's PATH variable.
5. Verify that Pig is installed and running by executing the command "pig -help" in your terminal.
CODE:
//SORTING
data = LOAD 'input.txt' USING PigStorage(',') AS (id:int, name:chararray,
age:int, gender:chararray);
sorted_data = ORDER data BY age ASC;
STORE sorted_data INTO 'output.txt' USING PigStorage(',');

//GROUPING
data = LOAD 'input.txt' USING PigStorage(',') AS (id:int, name:chararray,
age:int, gender:chararray);
grouped_data = GROUP data BY gender;
STORE grouped_data INTO 'output.txt' USING PigStorage(',');

//JOINING
data1 = LOAD 'input1.txt' USING PigStorage(',') AS (id:int, name:chararray,
age:int, gender:chararray);
data2 = LOAD 'input2.txt' USING PigStorage(',') AS (id:int,
address:chararray);
joined_data = JOIN data1 BY id, data2 BY id;
STORE joined_data INTO 'output.txt' USING PigStorage(',');

//PROJECTING
data = LOAD 'input.txt' USING PigStorage(',') AS (id:int, name:chararray,
age:int, gender:chararray);
projected_data = FOREACH data GENERATE name, gender;
STORE projected_data INTO 'output.txt' USING PigStorage(',');
//FILTERING
data = LOAD 'input.txt' AS (id:int, name:chararray, age:int,
gender:chararray);
filtered_data = FILTER data BY age > 30 AND gender == 'F';
STORE filtered_data INTO 'output.txt';

RESULTS: Pig has been installed successfully and Pig Latin scripts to sort, group, join, project and
filter the dataset has been successfully executed.
INPUT DATASET

//input1.txt
1,John,25,M
2,Emily,32,F
3,Michael,18,M
4,Lisa,41,F
//input2.txt
1,123 Main St
2,456 Park Ave
3,789 Elm St
4,1010 Cedar Rd

OUTPUT

//SORT
(3,Michael,18,M)
(1,John,25,M)
(2,Emily,32,F)
(4,Lisa,41,F)

//GROUP
(F,{(2,Emily,32,F),(4,Lisa,41,F)})
(M,{(1,John,25,M),(3,Michael,18,M)})

//JOIN
(1,John,25,M,1,123 Main St)
(2,Emily,32,F,2,456 Park Ave)
(3,Michael,18,M,3,789 Elm St)
(4,Lisa,41,F,4,1010 Cedar Rd)

//PROJECT
(1,John)
(2,Emily)
(3,Michael)
(4,Lisa)

//FILTER
(2,Emily,32,F)
7) Write a Pig Latin script for finding TF-IDF value for book dataset (A corpus of eBooks
available at: Project Gutenberg).
AIM: To write a pig latin script that finds the TF-IDF value for book dataset.
DESCRIPTION:
1. Download and Preprocess the Corpus: First, download the corpus of eBooks available at
Project Gutenberg. You can use the wget command to download the corpus. Once downloaded,
preprocess the corpus by removing any unwanted characters, punctuation, and stop words. You
can use Pig Latin's built-in functions such as TOKENIZE, FILTER, and STOPWORDS to
preprocess the corpus.
2. Generate Term Frequency (TF) Values: Next, generate the Term Frequency (TF) values for
each term in the corpus. You can use the TOKENIZE function to split the text into individual
words, and then use the GROUP BY and COUNT functions to count the number of occurrences
of each word in the corpus.
3. Calculate Document Frequency (DF) Values: After generating the TF values, calculate the
Document Frequency (DF) values for each term in the corpus. You can use the DISTINCT
function to get the number of unique documents in which a term appears, and then use the
GROUP BY function to group the terms by their document frequency.
4. Calculate Inverse Document Frequency (IDF) Values: Next, calculate the Inverse
Document Frequency (IDF) values for each term in the corpus. You can use the LOG function to
calculate the logarithm of the total number of documents divided by the document frequency for
each term.
5. Calculate TF-IDF Values: Finally, calculate the TF-IDF values for each term in the corpus.
You can use the JOIN function to combine the TF and IDF values for each term, and then
multiply them together to get the final TF-IDF value. You can also use the STORE function to
store the results in a file.
SOFTWARE: Pig 0.17.0
CODE:
-- Load the dataset
documents = LOAD '/path/to/corpus' USING TextLoader() AS (doc:chararray);

-- Tokenize the documents


tokenized_docs = FOREACH documents GENERATE FLATTEN(TOKENIZE(doc)) AS
token;

-- Group the tokens by document and count their occurrences


word_counts = FOREACH (GROUP tokenized_docs BY doc) GENERATE group AS doc,
tokenized_docs.token AS tokens, COUNT(tokenized_docs) AS count;

-- Compute the total number of documents


total_docs = COUNT(DISTINCT word_counts.doc);

-- Compute the document frequency for each token


df = FOREACH (GROUP word_counts BY tokens) GENERATE group AS token,
COUNT(word_counts) AS doc_freq;

-- Compute the inverse document frequency for each token


idf = FOREACH df GENERATE token, LOG((double)total_docs/doc_freq) AS idf;

-- Cross join the word_counts and idf relations to compute the TF-IDF
values
tf_idf = JOIN word_counts BY tokens, idf BY token;
tf_idf = FOREACH tf_idf GENERATE word_counts.doc AS doc, word_counts.tokens
AS token, word_counts.count AS count, idf.idf AS idf, word_counts.count *
idf.idf AS tf_idf;

-- Output the results


STORE tf_idf INTO '/path/to/output';
RESULTS: Pig Latin script that finds the TF-IDF value for book dataset has been successfully
executed.
INPUT DATASET

PG10000_counts.txt 13.3 kB
PG10001_counts.txt 13.9 kB
PG10002_counts.txt 48.9 kB
PG10003_counts.txt 59.2 kB
PG10004_counts.txt 64.2 kB
PG10005_counts.txt 77.8 kB
PG10006_counts.txt 27.5 kB
PG10007_counts.txt 40.0 kB
PG10008_counts.txt 74.3 kB
PG10009_counts.txt 75.3 kB
PG1000_counts.txt 121.3 kB
...

OUTPUT
|- part-r-00000
|- ...
|- part-r-0NNNN
(doc:chararray, token:chararray, count:long, idf:double, tf_idf:double)
(book1.txt, "the", 1500, 0.75, 1125.0)
(book1.txt, "of", 1000, 0.6, 600.0)
(book2.txt, "the", 1200, 0.75, 900.0)
(book2.txt, "of", 800, 0.6, 480.0)
...
8) Install and Run Hive then use Hive to create, alter, and drop databases, tables, views,
functions, and indexes.
AIM: To install and run hive to create, alter and drop, databases, tables, views, functions and
indexes.
DESCRIPTION: Apache Hive is a distributed, fault-tolerant data warehouse system that enables
analytics at a massive scale. Hive Metastore (HMS) provides a central repository of metadata that
can easily be analyzed to make informed, data driven decisions, and therefore it is a critical
component of many data lake architectures. Hive is built on top of Apache Hadoop and supports
storage on S3, adls, gs etc though hdfs. Hive allows users to read, write, and manage petabytes of
data using SQL.
SOFTWARE: Hive 4.0.0-alpha-2
PROCEDURE:
1. Download and install Hive: After installing Hadoop, you can download and install Hive. Hive
can be downloaded from the Apache website. Once downloaded, extract the files and move the
extracted folder to a desired location.
2. Configure Hive: Hive needs to be configured to work with Hadoop. This involves setting
environment variables, configuring Hadoop settings, and configuring Hive settings.
a) Set the HADOOP_HOME environment variable to the location where Hadoop is installed.
b) Update the PATH environment variable to include the bin directory of Hadoop and Hive.
c) Configure Hadoop settings by modifying the core-site.xml and hdfs-site.xml files located in
the Hadoop configuration directory.
d) Configure Hive settings by modifying the hive-site.xml file located in the Hive configuration
directory.
3. Start Hive: To start Hive, run the following command in the Hive installation directory:
bin/hive. This will start the Hive shell.
CODE:
//Creating a database
CREATE DATABASE dsbd;

//Changing the current database:


USE dsbd;

//Creating a table:
CREATE TABLE sales (
id INT,
name STRING,
product STRING,
quantity INT,
price FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'sales_data.csv' OVERWRITE INTO TABLE sales;

//Adding columns to a table:


ALTER TABLE sales ADD COLUMNS (
address STRING,
email STRING
);

//Creating a view:
CREATE VIEW sales_view AS
SELECT name, product, price * quantity AS total_price
FROM sales;
//Query data from the view
SELECT * FROM sales_view WHERE total_price > 100;

//Drop the view


DROP VIEW sales_view;

//Create a function to calculate average salary by department


CREATE FUNCTION avg_salary_by_dept(dept STRING)
RETURNS DOUBLE
AS 'SELECT AVG(salary) FROM employees WHERE department = '${dept}''
LANGUAGE SQL
;

//Query data using the function


SELECT department, avg_salary_by_dept(department) AS avg_salary
FROM employees
GROUP BY department;

//Drop the function


DROP FUNCTION avg_salary_by_dept;

//Create an index on the customer_id column


CREATE INDEX transactions_cust_idx
ON TABLE transactions(customer_id);

//Query data using the index


SELECT *
FROM transactions
WHERE customer_id = 102;

//Drop the index


DROP INDEX transactions_cust_idx;

//Dropping a table:
DROP TABLE sales;

//Dropping a database:
DROP DATABASE dsbd;
RESULTS: HQL scripts to create, alter and drop, databases, tables, views, functions and indexes has
been successfully executed.
INPUT DATASET

//id,name,product,quantity,price
1,John,Dress,2,49.99
2,Mary,Shoes,1,79.99
3,Mark,Watch,3,129.99
4,Kate,Pants,2,39.99
5,Lisa,Hat,4,19.99

OUTPUT
hive> CREATE DATABASE dsbd;
OK
Time taken: 1.224 seconds
hive> USE dsbd;
OK
Time taken: 0.017 seconds
hive> CREATE TABLE sales (id INT,name STRING,product STRING,quantity INT,price FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
OK
Time taken: 0.155 seconds
hive> LOAD DATA LOCAL INPATH 'sales_data.csv' OVERWRITE INTO TABLE sales;
OK
Time taken: 0.017 seconds
hive> SELECT * FROM sales;
OK
1 John Dress 2 49.99
2 Mary Shoes 1 79.99
3 Mark Watch 3 129.99
4 Kate Pants 2 39.99
5 Lisa Hat 4 19.99

hive> CREATE VIEW sales_view AS SELECT name, product, price * quantity AS total_price FROM sales;
OK
Time taken: 0.131 seconds
hive> SELECT * FROM sales_view WHERE total_price > 100;
OK
Time taken: 0.119 seconds
Mary Shoes 79.99
Mark Watch 389.97
Time taken: 0.079 seconds
hive> DROP VIEW sales_view;
OK
Time taken: 0.136 seconds

//Creating a function:
CREATE FUNCTION example_function AS
'com.example.hive.udf.ExampleUDF'
USING JAR 'hdfs:///path/to/example_udf.jar';

//Creating an index:
CREATE INDEX example_index
ON TABLE example_table (name)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD
IDXPROPERTIES ('create_time'='2023-05-06');

hive> DROP TABLE sales;


OK
Time taken: 0.085 seconds
hive> DROP DATABASE dsbd;
OK
Time taken: 0.145 seconds
9) Install, Deploy & configure Apache Spark Cluster. Run apache spark applications using
Scala.
AIM: To install, deploy and configure spark cluster.
DESCRIPTION: Apache Spark is a unified analytics engine for large-scale data processing. It
provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports
general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for
SQL and structured data processing, pandas API on Spark for pandas workloads, MLlib for machine
learning, GraphX for graph processing, and Structured Streaming for incremental computation and
stream processing.. It is Simple, Fast, Scalable and Unified. Key features include Batch/streaming
data (Unify the processing of your data in batches and real-time streaming, using your preferred
language: Python, SQL, Scala, Java or R), SQL analytics (Execute fast, distributed ANSI SQL
queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses), Data
science at scale (Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having
to resort to downsampling) and Machine learning (Train machine learning algorithms on a laptop
and use the same code to scale to fault-tolerant clusters of thousands of machines).
SOFTWARE: Spark 3.4.0
PROCEDURE:
1. Installing Apache Spark: The first step in setting up a Spark cluster is to install the Spark
software on each node of the cluster. You can download the latest version of Apache Spark from
the official website https://spark.apache.org/downloads.html. Once you have downloaded the
tarball file, extract it to a directory on each node of the cluster.
2. Setting up a Cluster Manager: There are several cluster managers that can be used to manage
and distribute tasks across the Spark cluster, such as Apache Mesos, Hadoop YARN, and Apache
Spark's built-in standalone manager. For the purpose of this tutorial, we will be using the
standalone manager.
3. Configuring Spark: Next, you need to configure Spark by modifying the spark-defaults.conf
file in the conf directory. You can modify settings such as memory allocation, executor cores, and
driver memory based on your system specifications and workload requirements.
4. Starting Spark Master: To start the Spark master, run the following command on the node
that will serve as the master: sbin/start-master.sh
This will start the Spark master and print the URL for the web UI, which you can access in a
web browser to monitor the status of the cluster.
5. Starting Spark Workers: Next, you need to start the Spark workers on each node of the cluster
by running the following command: sbin/start-worker.sh <master-url> Replace <master-
url> with the URL of the Spark master that you obtained in step 4.
6. Running Spark Applications: To run this program, you can package the code and
dependencies into a JAR file and then submit it to the Spark cluster using the spark-submit
command, like this:
$ spark-submit --class MovieRating /path/to/MovieRating.jar
/path/to/input/file /path/to/output/dir
Replace /path/to/MovieRating.jar with the actual path to the JAR file containing the MovieRating
application code. The /path/to/input/file should be replaced with the actual path to the input text
file, and /path/to/output/dir should be replaced with the actual path to the output directory.

CODE:
import org.apache.spark.{SparkConf, SparkContext}

object MovieRating {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MovieRating")
val sc = new SparkContext(conf)

// Load the input data from a text file


val input = sc.textFile(args(0))

// Parse the input data and create a tuple (Movie ID, Rating)
val ratings = input.map(line => {
val fields = line.split(",")
(fields(1), fields(2).toInt)
})

// Calculate the average rating for each movie


val avgRatings = ratings.aggregateByKey((0.0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
).mapValues(x => x._1 / x._2)

// Save the output to a text file


avgRatings.saveAsTextFile(args(1))

sc.stop()
}
}
RESULTS: Spark has been successfully installed.
INPUT DATASET

//User ID,Movie ID,Rating


1,101,5
1,102,4
1,103,2
2,101,2
2,102,1
2,103,4
3,101,4
3,102,5
3,103,3

OUTPUT
(101,3.6666666666666665)
(102,3.3333333333333335)
(103,3.0)
10) Perform Data analytics using Apache Spark on Amazon food dataset, find all the pairs
of items frequently reviewed together.
AIM: To perform data analytics using spark on amazon food dataset and find all the pair of items
that are frequently reviewed together.
DESCRIPTION: Data analytics using Apache Spark involves the use of Spark's powerful
processing engine to manipulate and analyze large datasets. The framework provides built-in
support for various data sources including HDFS, Apache Cassandra, Apache HBase, Amazon S3,
and many others. Spark's distributed architecture allows for parallel processing of data, making it
ideal for handling large datasets. It supports in-memory computing, which enables fast data
processing, and offers a variety of data processing and analysis libraries, such as Spark SQL, MLlib
for machine learning, GraphX for graph processing, and Streaming for real-time data processing.
Data analytics using Spark involves reading data from different sources, transforming and cleaning
the data, and applying various statistical and machine learning algorithms to generate insights
from the data. Spark also supports the use of SQL queries and other data manipulation techniques
to transform and aggregate data.
SOFTWARE: Spark 3.4.0
CODE:
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.fpm.FPGrowth

object ItemPairs {

def main(args: Array[String]) {

// create Spark session


val spark = SparkSession.builder.appName("ItemPairs").getOrCreate()

// read in Amazon food dataset


val data = spark.read.option("header", "true").option("inferSchema",
"true").csv(args(0))

// select necessary columns


val reviews = data.select("ProductId", "UserId")

// group by UserId and collect list of products reviewed


val userReviews =
reviews.groupBy("UserId").agg(collect_list("ProductId").alias("Products"))

// run FP-growth algorithm to find frequent itemsets


val fpGrowth = new
FPGrowth().setItemsCol("Products").setMinSupport(0.01).setMinConfidence(0.0
1)
val model = fpGrowth.fit(userReviews)

// get frequent itemsets


val freqItemsets = model.freqItemsets

// filter out single items and select pairs only


val pairs = freqItemsets.filter(size($"items") ===
2).select($"items".getItem(0).alias("Item1"),
$"items".getItem(1).alias("Item2"), $"freq")

// order pairs alphabetically to avoid duplicates


val orderedPairs = pairs.withColumn("OrderedItems",
sort_array(array($"Item1", $"Item2"))).drop("Item1",
"Item2").groupBy("OrderedItems").agg(sum($"freq").alias("Freq"))
// sort pairs by frequency in descending order
val sortedPairs = orderedPairs.orderBy($"Freq".desc)

// show results
sortedPairs.show()

// stop Spark session


spark.stop()
}

}
RESULTS: Scala program to analyze the amazon food dataset has been successfully executed.
INPUT DATASET

Data includes:
Reviews from Oct 1999 - Oct 2012
568,454 reviews
256,059 users
74,258 products
260 users with > 50 reviews

OUTPUT
+--------------------+----+
| OrderedItems|Freq|
+--------------------+----+
|[B000G6RYNE, B000...| 15|
|[B000G6RYNE, B000...| 14|
|[B000G6RYNE, B000...| 12|
|[B000G6RYNE, B001...| 11|
|[B000G6RYNE, B000...| 10|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B001...| 9|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B000...| 9|
|[B000G6RYNE, B000...| 9|
...

You might also like