0% found this document useful (0 votes)

13 views9 pages

Project5 CandidateIdeas

final project is Ai

Uploaded by

betsegaw123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views9 pages

Project5 CandidateIdeas

final project is Ai

Uploaded by

betsegaw123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

CS525:

Large-‐Scale Data Management

Project 5
Candidate Ideas
Project 1: Record-‐Level Indexing
• For each data split, create an inverted index over selected
columns
– Index(es) for each split independently

• At query Jme
– Special input format (IF) will be designed
– IF will accept “trivial” predicates, E.g., column = constant
– IF will decide which inverted index to use
– Reads only the records that match the input predicates and pass it to
the map task

• Fits well in one month

• Most of the work is in the Input Format and eﬃcient storage for the index
Project 1: SpeciﬁcaJons
• IniJal Input: A dataset (say Customers dataset)

• Preprocessing Phase:
– Design a tool (map-‐reduce job) that reads each split (split Si) and creates a corresponding
index (Di)
– The index will be on a specific column of your choice
• TheoreJcally you can create mulJple indexes each one is on a different column
– The index can be an “inverted index”, where each line is a value V, and the list of record
offsets containing V in the corresponding split

• Query Time:
– Given a regular job, assume that all selecJon predicates will be passed to you as parameters
in the form of: column_name = constant
– You design a special input format that should understand the predicates, decide which index
to use (if possible), and opens this index to know which records to read from the split
– If the predicate is on a column not indexed, then the input format will not use the index and
will scan all records normally
– Inside the map funcJon, the normal job will execute (including the predicates again)

• Think about how to search the index fast to be efficient
Project 2: File Tagging
• Add an addiJonal property to files
– Tag or label
– A file can have one or more tags

• At query Jme
– The job deﬁnes a tag, and processes all ﬁles having this tag

• Fits well in one month

• Most of the work is in the Input Format and HDFS (NameNode and File objects)
Project 2: SpecificaJons
• Phase I: Adding Tags
– InvesJgate the HDFS classes and more specifically the File class and its
properJes
– Add new property to each file (Tags), probably as an array of int. So each file
can have many tags
– At the upload Jme, the file should take addiJonal opJonal parameter
indicaJng its tags
– See how the file properJes are stored on disk (so when the NameNode
restarts it can find this info) and do the same for the new property

• Phase 2: Query Time

– Instead of specifying a file to read, you will specify a tag (or more) to read in
the job
– Special input format should find all files having this given tag and start reading
them as inputs to the job
• To convert from tags to files, the HDFS should provide a new API (funcJon) to do this job
Project 3: Special Join in Pig
• Pig allows for hints, e.g., “replicated” to do broadcast join (for small files)

• What if I want to join A and B, and each are already parJJoned on the join
key
– Also a special join (map-‐only) can be used to do this task

• We need to add a new hint to Pig, e.g., “parBBoned”

– Pig now uses special input format to join corresponding parJJons

File A

File B

• Fits well in one month

• Most of the work is in understanding Pig’s compiler (and try to mimic “replicated”
joins)
Project 3: SpeciﬁcaJons
• This one is more challenging than Project 4 (See the next one) for those who want
to learn the internals of Pig

• Step 1:
– Learn how Pig takes a high-‐level language and converts it to a map-‐reduce job(s)
• Focus on simple scenario (E.g., one map-‐reduce job to join two ﬁles)
– Learn how Pig uses hints like “replicated” keyword to change the implementaJon of join

• Step 2:
– Extend Pig by adding a new keyword “parJJoned” to implement another type of special joins
as shown in the previous slide
– Try to focus on things that you will change, i.e., try to mimic to a large extent what Pig is doing
for “replicated” join.

• Step 3:
– Compare your new join algorithm with and without the new keywoard
Project 4: Performance Comparison
(Pig Vs. Java)
• In this project, the internals of Pig will not change

• Find the diﬀerent types of joins supported by Pig

and compare them with “your own”
implementaJon of these joins using Java

• In Java, implement one of the opJmized join

techniques presented in paper “A comparison of join
algorithms for log processing in mapreduce ”

• Fits well in one month

• Most of the work is in wriBng java jobs and comparing the performance
Project 4: SpecificaJons
• Step 1:
– Find out the different types of Joins supported in Pig and the different
scenarios to uJlize each one

• Step 2:
– Implement your own corresponding join jobs using Java

• Step 3:
– Compare the performance between Pig and Java for the diﬀerent Join types

• Step 4:
– Select one opJmizaJon from the paper below to implement in Java
– The paper is: “A comparison of join algorithms for log processing in
mapreduce“
– For example: Instead of reducers caching the records from both rela;ons, with
some op;miza;ons, reducers can cache only the records from the smaller
rela;on
– For the op;miza;on you select, discuss whether it can be done in Pig or not

Free Download Here: Suzuki 6 Piano Accompaniment La Folia PDF
0% (1)
Free Download Here: Suzuki 6 Piano Accompaniment La Folia PDF
2 pages
Guidelines For Oral Presentation
No ratings yet
Guidelines For Oral Presentation
5 pages
HOL Hive PDF
No ratings yet
HOL Hive PDF
23 pages
ChamPock An
No ratings yet
ChamPock An
160 pages
Big Data Hadoop - Course Curriculum - V1
No ratings yet
Big Data Hadoop - Course Curriculum - V1
7 pages
BDA - II Sem - II Mid
100% (1)
BDA - II Sem - II Mid
4 pages
MapReduce Join Algorithms Guide
No ratings yet
MapReduce Join Algorithms Guide
66 pages
44
No ratings yet
44
8 pages
Case Study: Hadoop
No ratings yet
Case Study: Hadoop
46 pages
PFR Final Exam
No ratings yet
PFR Final Exam
2 pages
Pig Interview Questions
No ratings yet
Pig Interview Questions
3 pages
Week 1, 2:: Hadoop &bigdata Lab
No ratings yet
Week 1, 2:: Hadoop &bigdata Lab
3 pages
Ha Do Op World
No ratings yet
Ha Do Op World
24 pages
Thejas Nair Pig Team at Yahoo! Apache Pig PMC Member
No ratings yet
Thejas Nair Pig Team at Yahoo! Apache Pig PMC Member
22 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
No ratings yet
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
74 pages
Differentiation Formulas - Derivative Formulas List
No ratings yet
Differentiation Formulas - Derivative Formulas List
13 pages
Leadership: Definitions and Impact
0% (1)
Leadership: Definitions and Impact
68 pages
Bda Record 18071a0597-1
No ratings yet
Bda Record 18071a0597-1
28 pages
Information Retrieval & Data Mining: Smart PC Explorer
No ratings yet
Information Retrieval & Data Mining: Smart PC Explorer
14 pages
Apache Pig for Data Analysts
No ratings yet
Apache Pig for Data Analysts
58 pages
My Extendible Hashing Report1
No ratings yet
My Extendible Hashing Report1
45 pages
PIG Interview Qusetions
No ratings yet
PIG Interview Qusetions
15 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
System Design Interview Guide
50% (2)
System Design Interview Guide
8 pages
S Pig Hive HBase Zookeeper 07
No ratings yet
S Pig Hive HBase Zookeeper 07
21 pages
Udemy Business 2023 WorkplaceLearningTrends Report
No ratings yet
Udemy Business 2023 WorkplaceLearningTrends Report
34 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Understanding Literature Review in Research
No ratings yet
Understanding Literature Review in Research
9 pages
660 For Upload AY 2021 2022 2
No ratings yet
660 For Upload AY 2021 2022 2
64 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Hadoop Training in Hyderabad
No ratings yet
Hadoop Training in Hyderabad
6 pages
Bharath Vissapragada's Resume: Database & Cloud Computing Expert
No ratings yet
Bharath Vissapragada's Resume: Database & Cloud Computing Expert
5 pages
Apache Pig
No ratings yet
Apache Pig
61 pages
Pega CSA 7.4 Certification Guide
0% (1)
Pega CSA 7.4 Certification Guide
2 pages
2020 World AIDS Day Report Graphs Tables en
No ratings yet
2020 World AIDS Day Report Graphs Tables en
45 pages
Alumni Admissions Essays
No ratings yet
Alumni Admissions Essays
10 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
PG Handbook 2019
No ratings yet
PG Handbook 2019
96 pages
Big Data & Hadoop - Course Curriculum
No ratings yet
Big Data & Hadoop - Course Curriculum
6 pages
BDAA
No ratings yet
BDAA
4 pages
12570
No ratings yet
12570
2 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
Cultural Features in Alberto S. Florentino's Select Play
No ratings yet
Cultural Features in Alberto S. Florentino's Select Play
6 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
Applied Motor Control and Control Module ISU Ilagan - 060606
No ratings yet
Applied Motor Control and Control Module ISU Ilagan - 060606
10 pages
JKD Conversations With John Little
No ratings yet
JKD Conversations With John Little
37 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
BMC 312 Strat MGT
No ratings yet
BMC 312 Strat MGT
16 pages
© Shubham Wadekar: JP Morgan & Chase Data Engineer Interview Guide - Experienced
No ratings yet
© Shubham Wadekar: JP Morgan & Chase Data Engineer Interview Guide - Experienced
9 pages
School Fee Structure 2024
No ratings yet
School Fee Structure 2024
1 page
Exam Results
No ratings yet
Exam Results
2 pages
How To Improve Student English-Speaking Skill
No ratings yet
How To Improve Student English-Speaking Skill
2 pages
Unit-4 PIG
No ratings yet
Unit-4 PIG
9 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Big Data Analytics - Sem 7 CVMU
No ratings yet
Big Data Analytics - Sem 7 CVMU
4 pages
Bda Lab Record
No ratings yet
Bda Lab Record
32 pages
Script For Project Control
No ratings yet
Script For Project Control
8 pages
Lab Manual: Department of Computer Science and Engineering
No ratings yet
Lab Manual: Department of Computer Science and Engineering
10 pages
Bda QB3
No ratings yet
Bda QB3
22 pages
Unit 5-1
No ratings yet
Unit 5-1
8 pages
05a Pig
No ratings yet
05a Pig
52 pages
BDA-Tutotial Questions For Mid-2
No ratings yet
BDA-Tutotial Questions For Mid-2
1 page
Learner Profile Brochure
No ratings yet
Learner Profile Brochure
3 pages
Singapore Maths (P2) Test 1
No ratings yet
Singapore Maths (P2) Test 1
3 pages
Bda Lab
No ratings yet
Bda Lab
36 pages
Termproject
No ratings yet
Termproject
5 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
BDA - Imp Questions - All Units
No ratings yet
BDA - Imp Questions - All Units
2 pages
Unit 5
No ratings yet
Unit 5
23 pages
BDP Assignment 2
No ratings yet
BDP Assignment 2
12 pages
CS702D BigData Labmanual
No ratings yet
CS702D BigData Labmanual
12 pages
Bda Module 5
No ratings yet
Bda Module 5
26 pages
BDA Lab Manual - BAD601-Final One - 7-11
No ratings yet
BDA Lab Manual - BAD601-Final One - 7-11
25 pages
Hostel Subsidy
No ratings yet
Hostel Subsidy
2 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
Complete Bundle Six Steps To Managing Alzheimers Disease and Dementia Guide For Families by Budson MD HQ File
No ratings yet
Complete Bundle Six Steps To Managing Alzheimers Disease and Dementia Guide For Families by Budson MD HQ File
411 pages
Microeconomics 3rd Edition Karlan Unlocked Test Bank
No ratings yet
Microeconomics 3rd Edition Karlan Unlocked Test Bank
306 pages
Combined Interview Questions - SRE
No ratings yet
Combined Interview Questions - SRE
10 pages
UNIT5
No ratings yet
UNIT5
13 pages
NEET Chapter Wise Weightage 2025 With Important Topics
No ratings yet
NEET Chapter Wise Weightage 2025 With Important Topics
13 pages
Accomplishment Report BAC Coordinatorship
No ratings yet
Accomplishment Report BAC Coordinatorship
2 pages
LAB Manual 3170722 Big Data
No ratings yet
LAB Manual 3170722 Big Data
4 pages

Project5 CandidateIdeas

Uploaded by

Project5 CandidateIdeas

Uploaded by

CS525:

Large-­‐Scale Data Management

• Fits well in one month

• Fits well in one month

• Phase 2: Query Time

• We need to add a new hint to Pig, e.g., “parBBoned”

• Fits well in one month

• Find the diﬀerent types of joins supported by Pig

• In Java, implement one of the opJmized join

• Fits well in one month

You might also like

Large-‐Scale Data Management