1
u Assignments > Activity #19 - Evaluating Weather Data with Spark
u Grad Projects due Tuesday, 11/15
u Team Project Presentations due Thursday, 11/17
u Final Exam – Thursday, 12/8 from 7-9:30pm in GOL 1550
EXEMPLARY ACCOMPLISHED DEVELOPING BEGINNNIG
CATEGORY LEVEL 4 LEVEL 3 LEVEL 2 LEVEL 1 TOTAL
2
(76-100%) (51-75%) (26-50%) (0-25%)
Proposal clearly states Proposal stated project Proposal stated project but Proposal was unclear and
TEAM PROJECT project intent inlcuding AWS intent, but had some minor was unlcear on was missing key points on
PROPOSAL technologies that will be issues with clarity or goals. technologies or goals and the overall goal and
used. intent. intent. 5
Team checked-in 3 times Team checked-in 2-3 times Team checked-in 2 times Team checked in once (or
TEAM CHECK-INS
and clearly demonstrated and progress might not and progress was not clear not at all) and/or was Final check-in #3 next
their ongoing progress have been clear each time unprepared to explain
expected per check-in progress
week (11/14-11/18)
15
Whitepaper was 5 pages or Whitepaper was minimally 5 Whitepaper was less than 5 Whitepaper lacked any
more in length and clearly pages but had some minor pages and had major issues useful information, was
articulated overall problem, issues around clarifying the around clarifying the overall hastily or poorly written
WHITEPAPER projected cost savings, overall problem, projected problem, projected cost missing many key details Due 12/2
benefits of technologies and cost savings or benefits of savings and the benefits of that were expected.
other valuable information. technologies. technologies.
15
Presentation clearly Presentation clearly Presentation had major Presentation had major
described the problem, described the problem with issues with clarity or demo issues and demo was
FINAL PRESENTATION costs and demo was some minor issues or demo had problems properly either not working or not
Due 11/17
executed with no noted was executed with some running. presented. Presentation 11/17 & 11/22
issues. minor problems. 15
USE OF AWS 5 or more AWS technologies At least 4 AWS technologies At least 3 AWS technologies 2 or less AWS technologies
TECHNOLOGIES were used for the project. were used for the project. were used for the project. were used for the project
10
Project setup was clear and Able to setup project with Able to setup to some Instructions were unclear
straightforward to setup and minimal issues or instructions capacity but had major and/or unable to get
run. Team used an had minor problems around problems or gaps in project fully running.
automated process clarity. Was able to still run instructions to get up and
PROJECT SETUP including Infrastructure as despite these issues. running. Due 12/2
Code tools (AWS SDK/CLI,
CloudFormation, Terraform
or CDK) to for the setup and
teardown process. 15
Based on size of the team, Based on the size of the Based on the size of the Project lack any
the overall complexity and team, the overall complexity team, the overall complexity (too simple), or
effort of the project of the project met the complexity and effort could the effort was far below
exceeded expectations. expected team size. Team have been higher. Team what the entire team was
OVERALL EFFORT, Team clearly met all goals met most of the goals of the met some of the goals of capable of. Team did not
DESIGN & EXECUTION and objectives of the project with some minor the project but deliverables meet the goals of the Due 12/2
project. issues with deliverables. were lacking in some key project or there were
areas. major issues with the
overall design and
execution. 25
TOTAL POINTS 100
7
u Up to this point, you’ve had a crash course on the Hadoop ecosystem
and some key tools that you can use to process Big Data problems
Ç√
8
u Using AWS EMR, you can create Hadoop clusters that process data of
literally any size
u Some of the key benefits to note:
u Ease of use – Launch a fully running cluster in minutes
u Low cost – A 10-node cluster can run for as little as
.15/hour
u Reliable – EMR is tuned for the cloud, requiring less
time tuning and monitoring
u Security – Several security features to keep data safe
u Flexible – Full control over the cluster
u This has led more organizations to leverage the
cloud for Big Data Analytics by creating “Data
Lakes”
Big Data on the Cloud: Data Lakes &
Analytics
SWEN 514/614: Engineering Cloud Software Systems
Department of Software Engineering
Rochester Institute of Technology
Big Data Analytics 10
u Big Data Analytics is the complex process of examining large and
varied data sets, or big data, to uncover information such as hidden
patterns, unknown correlations, market trends and customer
preferences that can help organizations make informed business
decisions
Source: https://datafloq.com/read/big-data-analytics-paving-path-businesses-decision/6110
Big Data Analytics – Why Important? 11
u Cost reduction
u Big data technologies bring significant cost advantages when it comes to
storing large amounts of data – plus they can identify more efficient ways
of doing business
u Faster, better decision making
u With the speed of Hadoop and in-memory
analytics, combined with the ability to analyze
new sources of data, businesses can analyze
and make decisions based on what they’ve
learned
u New products and services
u With the ability to gauge customer needs and
satisfaction through analytics comes the
power to give customers what they want
Source: https://www.sas.com/en_us/insights/analytics/big-data-analytics.html
Data Lake - Defined 12
u A Data Lake allows you to store all your
structured and unstructured data, in
one centralized repository, and at any
scale
u It is usually a single store of all enterprise
data including raw copies of source
system data and transformed data
used for tasks such as reporting,
visualization, advanced analytics and
machine learning
u With a Data Lake, you store your data as-is, without having to first
structure the data, based on potential questions you may have in
the future
Data Lake – How do they work? 13
Source: https://40uu5c99f3a2ja7s7miveqgqu-wpengine.netdna-ssl.com/wp-content/uploads/2017/02/Understanding-data-lakes-EMC.pdf
Data Lake - Challenges 14
u A main challenge with a Data Lake architecture is that raw data is stored with no
oversight of the contents (i.e. it’s a dumping ground of anything and everything)
u For a Data Lake to make data usable, it needs to have clearly defined mechanisms to
cataloging, curation, and securing data
u Without these elements, data cannot be found, or trusted resulting in a “Data Swamp"
VS.
Data Lake vs. Data Warehouse 15
u Data Lakes and Data Warehouses are both widely used for storing big data, but
they are not interchangeable terms
u A Data Warehouse is a repository for structured, filtered data that has already
been processed for a specific purpose
u A Data Lake is a vast pool of raw data, the purpose for which is not yet defined
Source: https://www.talend.com/resources/data-lake-vs-data-warehouse/
Data Lake è Hadoop 16
u Hadoop provided the foundation for Data Lake architectures as it provides a
cheap way for organizations to store all their data on commodity hardware
u The term Data Lake is often associated with Hadoop-oriented storage HDFS
u In such a scenario, an organization's data is first loaded into the Hadoop platform,
and then business analytics and data mining tools are applied to the data where it
resides on Hadoop's cluster nodes of commodity computers
Data Lake Example… 17
u Business Units (Legal, IP&S, F&R, etc.) all had their own content they managed
u The “Knowledge Graph” would connect all this content in ways that was
previously not possible
IP&S Legal Corp
Compute
F&R TRTA Knowledge Graph
Data Lake Example… 18
u PermID.org
u Built on premise as Project BOLD (Big Open Linked Data)
u Technologies
u Spark, MapReduce, HBase, Kafka, Oozie and Cassandra (Knowledge Graph)
Data Lakes in the cloud on AWS 19
u With a Data Lake built on Amazon S3,
you get extreme durability
(99.999999999%)
u There are several native AWS services to
run Big Data analytics, artificial
intelligence, machine learning to gain
insights from your unstructured data sets
u You can automatically scale up storage
and processing capacity, without
lengthy resource procurement cycles
u Can you identify another advantage
with S3?
Data Lakes in S3 = Content Sharing 20
u With the the right credentials, you can see the contents of another company’s
S3 bucket(s) right from your own personal AWS account
u No physical copying required, which is a major shift in mindset from how
companies do things in the on-premise world
u Where we focus our time on isolating data with physical infrastructure, cloud
computing shifts are attention to focus on isolating data using security policies
More on this
later…
Who uses Data Lakes on AWS? 21
u A few examples…
Source: https://tinyurl.com/y4sdpzfe
Data Visualization 22
u Data Visualization is the presentation of data in a pictorial or graphical format
u It enables decision makers to see analytics presented visually, so they can
grasp difficult concepts or identify new patterns
u With interactive visualization, you can take the concept a step further by using
technology to drill down into charts and graphs for more detail, interactively
changing what data you see and how it’s processed
Source: https://www.sas.com/en_us/insights/big-data/data-visualization.html
Data Visualization on AWS 23
u Amazon QuickSight
u As a fully managed service, you can create and publish interactive
dashboards that include ML Insights
u Dashboards can then be accessed from any device, and embedded
into your applications, portals, and websites
u Amazon EMR Notebooks
u A managed environment based on Jupyter Notebooks that allows
data scientists, analysts, and developers to prepare and visualize
data, collaborate with peers, build applications, and perform
interactive analysis using EMR clusters
u They are pre-configured for Spark to interactively run jobs on EMR
clusters in languages such as PySpark, Spark SQL, Spark R, and Scala
u First, we need to understand a little bit about Spark SQL…
Spark SQL 24
u Spark SQL is Spark’s interface for working with structured data
u Those familiar with RDBMS can easily relate to the syntax of Spark SQL
u It was developed to address some of the shortcomings of Hive (e.g.
performance of MapReduce)
Spark SQL - Features 25
u Spark SQL queries are integrated with Spark programs. It allows us to
query structured data inside Spark programs, using SQL or a DataFrame
API which can be used in Java, Scala, Python and R
u Both DataFrames and SQL support a common way to access a variety
of data sources, like Hive, Parquet, JSON, and JDBC
Spark SQL - Features 26
u Spark SQL incorporates a cost-based optimizer, code generation and
columnar storage to make queries agile alongside computing
thousands of nodes using the Spark engine
u The interfaces provided by Spark
SQL provide Spark with more
information about the structure of
both the data and the
computation being performed
u Spark SQL uses this extra
information to perform extra
optimization
u As a result, Spark SQL executes up
to 100x times faster than Hadoop
Source: https://www.edureka.co/blog/spark-sql-tutorial/
Spark SQL - DataFrames 27
u A Spark DataFrame is a distributed collection of data organized into
named columns that provides operations to filter, group, or compute
aggregates, and can be used with Spark SQL
u DataFrames can be constructed from structured data files like CSV,
existing RDDs, tables in Hive, or external databases
RDDs Column 1 Column 2 Column 3
CSV Data Row 1
JSON Data Row 2
Parquet Data
Row 3
Spark SQL (DataFrame) – Simple Example 28
Input – /example/people.json Create a DataFrame based on the content of a JSON file:
Read JSON file into a
{"name":"Michael"} df = spark.read.json("examples/people.json") DataFrame
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
df.show() Displays the content of the DataFrame
# +----+-------+
# | age| name|
# +----+-------+
# |null|Michael|
# | 30| Andy|
# | 19| Justin|
# +----+-------+
df.printSchema() Print the schema in a tree format
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
Spark SQL (DataFrame) – Simple Example 29
df.select("name").show() Select only the "name"
column
# +-------+
# | name|
# +-------+
# |Michael|
# | Andy|
# | Justin|
# +-------+
df.select(df['name'], df['age'] + 1).show() Select everybody, but increment the “age” by 1
# +-------+---------+
# | name|(age + 1)|
# +-------+---------+
# |Michael| null|
# | Andy| 31|
# | Justin| 20|
# +-------+---------+
df.filter(df['age'] > 21).show() Select people older than 21
# +---+----+
# |age|name|
# +---+----+
# | 30|Andy|
# +---+----+ #
Spark SQL (Native SQL) – Simple Example 30
u Running native SQL queries is also possible
sqlDF.createOrReplaceTempView("people") Register the DataFrame as a SQL temporary view
Use the typical SQL Query syntax (SELECT…WHERE,
sqlDF = spark.sql("SELECT * FROM people") etc.)
sqlDF.show()
# +----+-------+
# | age| name|
# +----+-------+
# |null|Michael|
# | 30| Andy|
# | 19| Justin|
# +----+-------+
EMR Notebook – Activity (Due Next Class) 31
u For the upcoming holiday season, you’ve been asked to evaluate Amazon’s
ratings for some of their products
u A "Data Lake" has been created in an S3 bucket and your assignment is to
create some visualizations to better understand the data
u Activity is located in Assignments > Activity #20 - Evaluating Amazon Data with
EMR Notebooks
u Note: You need to do Setup EMR Notebook activity first
u Opportunity to earn another bonus point
Tying it all together on AWS 32
Visualization
Data
Warehouse
Runtime
Queries
Analytics
Data Lake
A Data Lake solution on Azure 33
Data
Warehouse
Source: https://www.microsoft.com/en-us/insidetrack/azure-data-lake-connects-supply-chain-data-for-advanced-analytics
A Data Lake solution on Google Cloud 34
Presto is an open-source distributed SQL query
engine for running interactive analytic queries
against data sources of all sizes ranging from
gigabytes to petabytes
Dataplex is an intelligent data
fabric that unifies data across
data lakes, data warehouses,
and data marts
BigQuery is a fully-managed,
serverless data warehouse that
enables scalable analysis over
petabytes of data.
Source: https://www.hcltech.com/blogs/dataplex-unified-data-fabric-google-cloud-platform
Big Data and Data Science 35
u Data Science and cloud computing
essentially go hand in hand
u A Data Scientist typically analyzes different
types of data that are stored in the Cloud
u Their work typically involves making sense of
messy, unstructured data, from sources such
as smart devices, social media feeds, and
emails that don’t neatly fit into a database
u They analyze, process, and model data then
interpret the results to create actionable plans
for companies and other organizations
u With the increase in Big Data, organizations
are increasingly storing large sets of data
online and there is an increasing need for
Data Scientists
Source: https://www.whizlabs.com/blog/data-science-vs-big-data-vs-data-analytics/
If you want to learn more about Hadoop but 36
not pay for AWS resources…
u You can easily do this on your computer…for free!
u Download VirtualBox (https://www.virtualbox.org/)
u Download Hortonworks Sandbox, which is a fully functioning Hadoop cluster
and ready to go
u All the activities for the Big Data
exercises were originally developed
in this environment before running
on AWS EMR
37
u Going back to your scenarios one last time, your team is tasked to analyze
opportunities for Big Data and Analytics
u Questions to answer:
u What type of analytics can be created and what data would be needed? Is a Data Lake
required? What about a Data Warehouse?
u Based on the data you are working with; what type of Big Data technologies would be
recommended to process the data?
u How often would you process this data?
u What is the value of the analytics you would be providing (e.g. increased sales)
u Submit completed template to Assignments > Activity #19 - Big Data Analytics
Recommendation
u Only 1 submission per team
u The scenarios can be found in Assignment > Activity #7 > Cost Estimating Scenarios