CS525:
Large-‐Scale
Data
Management
Project
5
Candidate
Ideas
Project
1:
Record-‐Level
Indexing
• For
each
data
split,
create
an
inverted
index
over
selected
columns
– Index(es)
for
each
split
independently
• At
query
Jme
– Special
input
format
(IF)
will
be
designed
– IF
will
accept
“trivial”
predicates,
E.g.,
column
=
constant
– IF
will
decide
which
inverted
index
to
use
– Reads
only
the
records
that
match
the
input
predicates
and
pass
it
to
the
map
task
• Fits
well
in
one
month
• Most
of
the
work
is
in
the
Input
Format
and
efficient
storage
for
the
index
Project
1:
SpecificaJons
• IniJal
Input:
A
dataset
(say
Customers
dataset)
• Preprocessing
Phase:
– Design
a
tool
(map-‐reduce
job)
that
reads
each
split
(split
Si)
and
creates
a
corresponding
index
(Di)
– The
index
will
be
on
a
specific
column
of
your
choice
• TheoreJcally
you
can
create
mulJple
indexes
each
one
is
on
a
different
column
– The
index
can
be
an
“inverted
index”,
where
each
line
is
a
value
V,
and
the
list
of
record
offsets
containing
V
in
the
corresponding
split
• Query
Time:
– Given
a
regular
job,
assume
that
all
selecJon
predicates
will
be
passed
to
you
as
parameters
in
the
form
of:
column_name
=
constant
– You
design
a
special
input
format
that
should
understand
the
predicates,
decide
which
index
to
use
(if
possible),
and
opens
this
index
to
know
which
records
to
read
from
the
split
– If
the
predicate
is
on
a
column
not
indexed,
then
the
input
format
will
not
use
the
index
and
will
scan
all
records
normally
– Inside
the
map
funcJon,
the
normal
job
will
execute
(including
the
predicates
again)
• Think
about
how
to
search
the
index
fast
to
be
efficient
Project
2:
File
Tagging
• Add
an
addiJonal
property
to
files
– Tag
or
label
– A
file
can
have
one
or
more
tags
• At
query
Jme
– The
job
defines
a
tag,
and
processes
all
files
having
this
tag
• Fits
well
in
one
month
• Most
of
the
work
is
in
the
Input
Format
and
HDFS
(NameNode
and
File
objects)
Project
2:
SpecificaJons
• Phase
I:
Adding
Tags
– InvesJgate
the
HDFS
classes
and
more
specifically
the
File
class
and
its
properJes
– Add
new
property
to
each
file
(Tags),
probably
as
an
array
of
int.
So
each
file
can
have
many
tags
–
At
the
upload
Jme,
the
file
should
take
addiJonal
opJonal
parameter
indicaJng
its
tags
– See
how
the
file
properJes
are
stored
on
disk
(so
when
the
NameNode
restarts
it
can
find
this
info)
and
do
the
same
for
the
new
property
• Phase
2:
Query
Time
– Instead
of
specifying
a
file
to
read,
you
will
specify
a
tag
(or
more)
to
read
in
the
job
– Special
input
format
should
find
all
files
having
this
given
tag
and
start
reading
them
as
inputs
to
the
job
• To
convert
from
tags
to
files,
the
HDFS
should
provide
a
new
API
(funcJon)
to
do
this
job
Project
3:
Special
Join
in
Pig
• Pig
allows
for
hints,
e.g.,
“replicated”
to
do
broadcast
join
(for
small
files)
• What
if
I
want
to
join
A
and
B,
and
each
are
already
parJJoned
on
the
join
key
– Also
a
special
join
(map-‐only)
can
be
used
to
do
this
task
• We
need
to
add
a
new
hint
to
Pig,
e.g.,
“parBBoned”
– Pig
now
uses
special
input
format
to
join
corresponding
parJJons
File
A
File
B
• Fits
well
in
one
month
• Most
of
the
work
is
in
understanding
Pig’s
compiler
(and
try
to
mimic
“replicated”
joins)
Project
3:
SpecificaJons
• This
one
is
more
challenging
than
Project
4
(See
the
next
one)
for
those
who
want
to
learn
the
internals
of
Pig
• Step
1:
– Learn
how
Pig
takes
a
high-‐level
language
and
converts
it
to
a
map-‐reduce
job(s)
• Focus
on
simple
scenario
(E.g.,
one
map-‐reduce
job
to
join
two
files)
– Learn
how
Pig
uses
hints
like
“replicated”
keyword
to
change
the
implementaJon
of
join
• Step
2:
– Extend
Pig
by
adding
a
new
keyword
“parJJoned”
to
implement
another
type
of
special
joins
as
shown
in
the
previous
slide
– Try
to
focus
on
things
that
you
will
change,
i.e.,
try
to
mimic
to
a
large
extent
what
Pig
is
doing
for
“replicated”
join.
• Step
3:
– Compare
your
new
join
algorithm
with
and
without
the
new
keywoard
Project
4:
Performance
Comparison
(Pig
Vs.
Java)
• In
this
project,
the
internals
of
Pig
will
not
change
• Find
the
different
types
of
joins
supported
by
Pig
and
compare
them
with
“your
own”
implementaJon
of
these
joins
using
Java
• In
Java,
implement
one
of
the
opJmized
join
techniques
presented
in
paper
“A
comparison
of
join
algorithms
for
log
processing
in
mapreduce
”
• Fits
well
in
one
month
• Most
of
the
work
is
in
wriBng
java
jobs
and
comparing
the
performance
Project
4:
SpecificaJons
• Step
1:
– Find
out
the
different
types
of
Joins
supported
in
Pig
and
the
different
scenarios
to
uJlize
each
one
• Step
2:
– Implement
your
own
corresponding
join
jobs
using
Java
• Step
3:
– Compare
the
performance
between
Pig
and
Java
for
the
different
Join
types
• Step
4:
– Select
one
opJmizaJon
from
the
paper
below
to
implement
in
Java
– The
paper
is:
“A
comparison
of
join
algorithms
for
log
processing
in
mapreduce“
– For
example:
Instead
of
reducers
caching
the
records
from
both
rela;ons,
with
some
op;miza;ons,
reducers
can
cache
only
the
records
from
the
smaller
rela;on
– For
the
op;miza;on
you
select,
discuss
whether
it
can
be
done
in
Pig
or
not