Reg No.
R Aalkeslelcleelols
B.Tech. DEGREE EXAMINATION, JULY 2024
Seventh Semester
1SCSE333J - BIG DATA TOOLS AND TECHNIQUES FOR BLOCKCHAIN
(For the candidates admitted from the academic year2021 - 2022)
Note:
i Part . Ashould be answered in OMIR shect within first 40 minutes and OMR sheet should be handed over to hall
invigilator at the end of 40th minute.
() Part - B& Part -C should be answered in answer booklet.
Time: 3 hours
Max. Marks: 100
PART - A (20 x 1= 20 Marks) Marks BL CO PO
Answer ALL Questions
0
1 What is Big Data?
A)A small amount of structured data B) Large volumes of structured and
unstructured data
C)Only structured data D) Data that fits into traditional
databases
Hadoop was inspired by which of the following papers?
A)Oracle White Paper B) Microsoft Azure Paper
C) Google's MapReduce and Google File D) IBM Watson White Paper
System
1
3. Which Unix tool is commonly used for searching through large text files?
A) cat B) grep
C) echo D) Is
4 Hadoop Streaming allows the use of which programming languages to write
MapReduce jobs?
A) Only Java B) Java and Python
C) Any language that can read from standard D) Java and C++
input and write to standard output
5
Which of the following is a primary feature of HDFS?
A)It is designed for interactive queries. B) It is optimized for reading large files.
C)It does not replicate data. D)It is designed for low-latency access
to small files.
6 Which command is used to list all files in a Hadoop directory?
A)hadoop fs -ls B) hadoop fs -mkdir
C) hadoop fs -put D) hadoop fs -rm
7. In HDFS data flow, what is the primary role ofa DataNode?
B) To store and retrieve blocks
A)To manage the file system namespace
C) To coordinate between clients and data D) To replicate data across nodes
storage
8.
What is serialization in Hadoop?
A)The process of converting a dala structure B) The process of compressing data
into a byte stream
)The process of distributing data across D) The process of querying data '5JF7ISCSEALJ
Iodes
| 3 2
What is the primary purpose of the Shufle phase in MapReduce?
A) Sorting the intermediate key-value pairS B) Merging mapper outputs
C) Sending data from mappers to reducers D)Initializing job parameters
10. In MapReduce, job scheduling is primarily concerned with
A)Allocating map and reduce slots on nodes B) Defining the input-output format for
the job
C) Determining the order of mapper tasks D) Allocating memory or lasks
11 Which of the following is NOT a potential cause of job failures in MapReduce?
A)Task Tracker failure B) Network congestion during Shuffle
C) Incorrect reducer logic D)Input data format errors
| 3 2
12. Which MapReduce format is suiable for handling large datasets where each input file is
processed independently?
A)TextInputFornat B) SequenceFilelnputFormat
C) KeyValueTextlnputFormat D) MultiplelnputsFormat
13. Which execution mode in Apache Pig is typically used for processing large datasets on a
Hadoop cluster?
A)Local mode B) MapReduce mode
C) Tez mode D) Spark mode
1 1 4 2
14. Which of the following statements about Pig Latin is true?
A)lt is a procedural language for defining data B) It supports ACID transactions.
flows.
C)It uses SQL-like syntax for querying data. D)It requires compilation before
execution.
15. Which component of Hive provides metadata management and storage for Hive tables?
A)Hive Shell B) Hive Metastore
C)Hive Server D)Hive CLI
16. Which statement accurately describes Hive tables compared to tables in traditional
databases?
A)Hive tables are stored in memory for fasterB) Hive tables support ACID
access. transactions.
C) Hive tables are schema-less and flexible D) Hive tables cannot be queried using
compared to traditional databases. SQL-like syntax.
17. What is a key characteristic of supervised learning?
A)Requires labeled data for training B) Learns from rewards and
punishments
C) Does not require training data D) Does not use statistical methods
18. Which of the following is NOT a type of machine learming?
A) Supervised Learning B) Unsupervised Learning
C)Reinforcement Learning D)Collaborative Filtering
19 HBase is preferred over traditional RDBMS in scenarios requiring
A) ACIID transactions B) Schema tlexibility and scalability
C)Simple dala storage D) Structured query language (SQL)
support
Page 2 of 4 25JF718CSE333,J
20. In the context of Big Data Analytics, what is a distinguishing feature of Hadoop
compared to traditional data processing systems?
A)Real-time data processing capabilities B) Centralized storage of all data
C)Ability to handle unstructured data D) Distributed processing across clusters
PART - B (5 x 4 = 20 Marks) Marks BL CO PO
Answer ANY FIVE Questions
21 Consider a healthcare organization that needs to manage a variety of data types
including patient records, medical images, and social media feedback on healthcare
services. Examine on how different types of digital data can be managed and utilized
eftectively in such an organization.
4
22 Explain the evolution of Hadoop, highlighting key milestones in its development. How
did Hadoop address the limitations of traditional data processing systems?
23 A company needs to store and process several petabytes of log data generated by web
servers. Examine on how HDFS can handle this requirement and mention its key
features that support large-scale data storage and processing.
24 Defend why to choose Avro for handling schema evolution in Hadoop and how it
facilitates data serialization and deserialization in Hadoop.
4
25 Compare and contrast different input formats available in MapReduce, highlighting their
features and use cases.
26 Explain how to use Apache Pig to analyze a large dataset containing logs from multiple
servers in a distributed computing environment effectively.
4 4
Compare HBase with traditional relational databases (RDBMS) like MySQL in terms of
architecture, use cases, and scalability. Discuss when each database system is more
suitable for Big Data analytics applications.
Marks BL CO PO
PART -C (5 x 12 = 60 Marks)
Answer ALL the Questions
3
28 a. Imagine Dharanee is a data scientist at an e-commerce company that needs to analyze 1:
customer reviews written in various programming languages like Python and Ruby.
Demonstrate how Hadoop Streaming can be used to process these reviews, and describe
a complete workflow from data ingestion to result generation. Include the benefits of
using Hadoop Streaming in this scenario.
(OR)
b. Discuss the characteristics and challenges of Big Data. How does Hadoop address these I2
challenges to provide a robust solution for Big Dala processing? Provide examples to
support your answers.
29 a. Discuss the different commands provided by the Hadoop File System (||DFS) command
line interface (CLI) for file and directory management. Include exanples tor operations
like creating iles, deleting files, and modifying file permissions.
(OR)
b.Compare and contrast the functionalities of Flume and Sqoop for data ingestion into
Hadoop. Provide specific scenarios where each tool would be most appropriate to use.
30 a. Imagine Lakshnan is leading a team tasked with developing a large scale data
processing application lor analysing petabytes of datu collected lrom sensors m a
distributed computing environment Ie have the opion o choose betwecn
implementng the application using tle MupReduce programming nodel or raditional
parallel computing models like MPI (Message Passing ntetace) or IPC (ligh
Performance Computing). Compare the advantages and disadvantages of selecting
MapReduce over traditional parallel computing models. 25JF7I8CSE333J
Page 3 of 4
(OR) and its 12
execution, detailing each phase
4 3 )
a MapReduce job
b. Discuss the anatomy of large-scale data.
significance in processing 12
cxccution modes. How do these modes
2 4 2
Apache Pig and its
31a. Discuss the anatomy of environments?
impact data processing in different
(OR) 12 2 4 2
to migrate its existing data warehouse to a Hadoop-based
b. An organization is planning design and implement Hive tables,
solution using Apache Hive. Explain how it would data retrieval and analysis.
manage metadata, and optimize queries for efficient 12 4 5 5
in data analytics using R.
32 a. Discuss the evolution of machine learning and its applications
Explain the concepts of supervised and unsupervised learning, providing examples of
applications.
cach. Compare their strengths, weaknesses, and real-world
(OR) 4 6 6
b. Imagine a team tasked with implementing a Big Data analytics solution using HBase for 2
a social media platfon. Discuss the architecture design, data modeling approach, and
integration strategies with existing systems. Compare the advantages of using HBase
over traditional RDBMS for handling large-scale social media data. Support the
discussion with relevant examples and considerations.