0% found this document useful (0 votes)

38 views33 pages

Big Data Analytics

Uploaded by

shwetasha03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views33 pages

Big Data Analytics

Uploaded by

shwetasha03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

1

u Assignments > Activity #19 - Evaluating Weather Data with Spark

u Grad Projects due Tuesday, 11/15
u Team Project Presentations due Thursday, 11/17

u Final Exam – Thursday, 12/8 from 7-9:30pm in GOL 1550

EXEMPLARY ACCOMPLISHED DEVELOPING BEGINNNIG
CATEGORY LEVEL 4 LEVEL 3 LEVEL 2 LEVEL 1 TOTAL

2
(76-100%) (51-75%) (26-50%) (0-25%)
Proposal clearly states Proposal stated project Proposal stated project but Proposal was unclear and
TEAM PROJECT project intent inlcuding AWS intent, but had some minor was unlcear on was missing key points on
PROPOSAL technologies that will be issues with clarity or goals. technologies or goals and the overall goal and
used. intent. intent. 5
Team checked-in 3 times Team checked-in 2-3 times Team checked-in 2 times Team checked in once (or
TEAM CHECK-INS
and clearly demonstrated and progress might not and progress was not clear not at all) and/or was Final check-in #3 next
their ongoing progress have been clear each time unprepared to explain
expected per check-in progress
week (11/14-11/18)
15
Whitepaper was 5 pages or Whitepaper was minimally 5 Whitepaper was less than 5 Whitepaper lacked any
more in length and clearly pages but had some minor pages and had major issues useful information, was
articulated overall problem, issues around clarifying the around clarifying the overall hastily or poorly written
WHITEPAPER projected cost savings, overall problem, projected problem, projected cost missing many key details Due 12/2
benefits of technologies and cost savings or benefits of savings and the benefits of that were expected.
other valuable information. technologies. technologies.
15
Presentation clearly Presentation clearly Presentation had major Presentation had major
described the problem, described the problem with issues with clarity or demo issues and demo was
FINAL PRESENTATION costs and demo was some minor issues or demo had problems properly either not working or not
Due 11/17
executed with no noted was executed with some running. presented. Presentation 11/17 & 11/22
issues. minor problems. 15
USE OF AWS 5 or more AWS technologies At least 4 AWS technologies At least 3 AWS technologies 2 or less AWS technologies
TECHNOLOGIES were used for the project. were used for the project. were used for the project. were used for the project
10
Project setup was clear and Able to setup project with Able to setup to some Instructions were unclear
straightforward to setup and minimal issues or instructions capacity but had major and/or unable to get
run. Team used an had minor problems around problems or gaps in project fully running.
automated process clarity. Was able to still run instructions to get up and
PROJECT SETUP including Infrastructure as despite these issues. running. Due 12/2
Code tools (AWS SDK/CLI,
CloudFormation, Terraform
or CDK) to for the setup and
teardown process. 15
Based on size of the team, Based on the size of the Based on the size of the Project lack any
the overall complexity and team, the overall complexity team, the overall complexity (too simple), or
effort of the project of the project met the complexity and effort could the effort was far below
exceeded expectations. expected team size. Team have been higher. Team what the entire team was
OVERALL EFFORT, Team clearly met all goals met most of the goals of the met some of the goals of capable of. Team did not
DESIGN & EXECUTION and objectives of the project with some minor the project but deliverables meet the goals of the Due 12/2
project. issues with deliverables. were lacking in some key project or there were
areas. major issues with the
overall design and
execution. 25
TOTAL POINTS 100
7
u Up to this point, you’ve had a crash course on the Hadoop ecosystem
and some key tools that you can use to process Big Data problems

Ç√
8
u Using AWS EMR, you can create Hadoop clusters that process data of
literally any size
u Some of the key benefits to note:
u Ease of use – Launch a fully running cluster in minutes
u Low cost – A 10-node cluster can run for as little as
.15/hour
u Reliable – EMR is tuned for the cloud, requiring less
time tuning and monitoring
u Security – Several security features to keep data safe
u Flexible – Full control over the cluster

u This has led more organizations to leverage the

cloud for Big Data Analytics by creating “Data
Lakes”
Big Data on the Cloud: Data Lakes &
Analytics

SWEN 514/614: Engineering Cloud Software Systems

Department of Software Engineering

Rochester Institute of Technology
Big Data Analytics 10
u Big Data Analytics is the complex process of examining large and
varied data sets, or big data, to uncover information such as hidden
patterns, unknown correlations, market trends and customer
preferences that can help organizations make informed business
decisions

Source: https://datafloq.com/read/big-data-analytics-paving-path-businesses-decision/6110
Big Data Analytics – Why Important? 11
u Cost reduction
u Big data technologies bring significant cost advantages when it comes to
storing large amounts of data – plus they can identify more efficient ways
of doing business
u Faster, better decision making
u With the speed of Hadoop and in-memory
analytics, combined with the ability to analyze
new sources of data, businesses can analyze
and make decisions based on what they’ve
learned
u New products and services
u With the ability to gauge customer needs and
satisfaction through analytics comes the
power to give customers what they want
Source: https://www.sas.com/en_us/insights/analytics/big-data-analytics.html
Data Lake - Defined 12

u A Data Lake allows you to store all your

structured and unstructured data, in
one centralized repository, and at any
scale
u It is usually a single store of all enterprise
data including raw copies of source
system data and transformed data
used for tasks such as reporting,
visualization, advanced analytics and
machine learning
u With a Data Lake, you store your data as-is, without having to first
structure the data, based on potential questions you may have in
the future
Data Lake – How do they work? 13

Source: https://40uu5c99f3a2ja7s7miveqgqu-wpengine.netdna-ssl.com/wp-content/uploads/2017/02/Understanding-data-lakes-EMC.pdf
Data Lake - Challenges 14
u A main challenge with a Data Lake architecture is that raw data is stored with no
oversight of the contents (i.e. it’s a dumping ground of anything and everything)
u For a Data Lake to make data usable, it needs to have clearly defined mechanisms to
cataloging, curation, and securing data
u Without these elements, data cannot be found, or trusted resulting in a “Data Swamp"

VS.
Data Lake vs. Data Warehouse 15
u Data Lakes and Data Warehouses are both widely used for storing big data, but
they are not interchangeable terms
u A Data Warehouse is a repository for structured, filtered data that has already
been processed for a specific purpose
u A Data Lake is a vast pool of raw data, the purpose for which is not yet defined

Source: https://www.talend.com/resources/data-lake-vs-data-warehouse/
Data Lake è Hadoop 16
u Hadoop provided the foundation for Data Lake architectures as it provides a
cheap way for organizations to store all their data on commodity hardware
u The term Data Lake is often associated with Hadoop-oriented storage HDFS
u In such a scenario, an organization's data is first loaded into the Hadoop platform,
and then business analytics and data mining tools are applied to the data where it
resides on Hadoop's cluster nodes of commodity computers
Data Lake Example… 17
u Business Units (Legal, IP&S, F&R, etc.) all had their own content they managed
u The “Knowledge Graph” would connect all this content in ways that was
previously not possible

IP&S Legal Corp

Compute

F&R TRTA Knowledge Graph

Data Lake Example… 18
u PermID.org
u Built on premise as Project BOLD (Big Open Linked Data)
u Technologies
u Spark, MapReduce, HBase, Kafka, Oozie and Cassandra (Knowledge Graph)
Data Lakes in the cloud on AWS 19
u With a Data Lake built on Amazon S3,
you get extreme durability
(99.999999999%)
u There are several native AWS services to
run Big Data analytics, artificial
intelligence, machine learning to gain
insights from your unstructured data sets
u You can automatically scale up storage
and processing capacity, without
lengthy resource procurement cycles
u Can you identify another advantage
with S3?
Data Lakes in S3 = Content Sharing 20
u With the the right credentials, you can see the contents of another company’s
S3 bucket(s) right from your own personal AWS account
u No physical copying required, which is a major shift in mindset from how
companies do things in the on-premise world
u Where we focus our time on isolating data with physical infrastructure, cloud
computing shifts are attention to focus on isolating data using security policies

More on this
later…
Who uses Data Lakes on AWS? 21
u A few examples…

Source: https://tinyurl.com/y4sdpzfe
Data Visualization 22
u Data Visualization is the presentation of data in a pictorial or graphical format
u It enables decision makers to see analytics presented visually, so they can
grasp difficult concepts or identify new patterns
u With interactive visualization, you can take the concept a step further by using
technology to drill down into charts and graphs for more detail, interactively
changing what data you see and how it’s processed

Source: https://www.sas.com/en_us/insights/big-data/data-visualization.html
Data Visualization on AWS 23
u Amazon QuickSight
u As a fully managed service, you can create and publish interactive
dashboards that include ML Insights
u Dashboards can then be accessed from any device, and embedded
into your applications, portals, and websites
u Amazon EMR Notebooks
u A managed environment based on Jupyter Notebooks that allows
data scientists, analysts, and developers to prepare and visualize
data, collaborate with peers, build applications, and perform
interactive analysis using EMR clusters
u They are pre-configured for Spark to interactively run jobs on EMR
clusters in languages such as PySpark, Spark SQL, Spark R, and Scala
u First, we need to understand a little bit about Spark SQL…
Spark SQL 24
u Spark SQL is Spark’s interface for working with structured data
u Those familiar with RDBMS can easily relate to the syntax of Spark SQL
u It was developed to address some of the shortcomings of Hive (e.g.
performance of MapReduce)
Spark SQL - Features 25
u Spark SQL queries are integrated with Spark programs. It allows us to
query structured data inside Spark programs, using SQL or a DataFrame
API which can be used in Java, Scala, Python and R
u Both DataFrames and SQL support a common way to access a variety
of data sources, like Hive, Parquet, JSON, and JDBC
Spark SQL - Features 26
u Spark SQL incorporates a cost-based optimizer, code generation and
columnar storage to make queries agile alongside computing
thousands of nodes using the Spark engine
u The interfaces provided by Spark
SQL provide Spark with more
information about the structure of
both the data and the
computation being performed
u Spark SQL uses this extra
information to perform extra
optimization
u As a result, Spark SQL executes up
to 100x times faster than Hadoop
Source: https://www.edureka.co/blog/spark-sql-tutorial/
Spark SQL - DataFrames 27
u A Spark DataFrame is a distributed collection of data organized into
named columns that provides operations to filter, group, or compute
aggregates, and can be used with Spark SQL
u DataFrames can be constructed from structured data files like CSV,
existing RDDs, tables in Hive, or external databases

RDDs Column 1 Column 2 Column 3

CSV Data Row 1

JSON Data Row 2

Parquet Data
Row 3
Spark SQL (DataFrame) – Simple Example 28
Input – /example/people.json Create a DataFrame based on the content of a JSON file:
Read JSON file into a
{"name":"Michael"} df = spark.read.json("examples/people.json") DataFrame
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
df.show() Displays the content of the DataFrame

# +----+-------+
# | age| name|
# +----+-------+
# |null|Michael|
# | 30| Andy|
# | 19| Justin|
# +----+-------+

df.printSchema() Print the schema in a tree format

df.select(df['name'], df['age'] + 1).show() Select everybody, but increment the “age” by 1

# +-------+---------+
# | name|(age + 1)|
# +-------+---------+
# |Michael| null|
# | Andy| 31|
# | Justin| 20|
# +-------+---------+

df.filter(df['age'] > 21).show() Select people older than 21

# +---+----+
# |age|name|
# +---+----+
# | 30|Andy|
# +---+----+ #
Spark SQL (Native SQL) – Simple Example 30
u Running native SQL queries is also possible

sqlDF.createOrReplaceTempView("people") Register the DataFrame as a SQL temporary view

Use the typical SQL Query syntax (SELECT…WHERE,

sqlDF = spark.sql("SELECT * FROM people") etc.)

sqlDF.show()
# +----+-------+
# | age| name|
# +----+-------+
# |null|Michael|
# | 30| Andy|
# | 19| Justin|
# +----+-------+
EMR Notebook – Activity (Due Next Class) 31
u For the upcoming holiday season, you’ve been asked to evaluate Amazon’s
ratings for some of their products
u A "Data Lake" has been created in an S3 bucket and your assignment is to
create some visualizations to better understand the data

u Activity is located in Assignments > Activity #20 - Evaluating Amazon Data with
EMR Notebooks
u Note: You need to do Setup EMR Notebook activity first
u Opportunity to earn another bonus point
Tying it all together on AWS 32

Visualization

Data
Warehouse
Runtime
Queries

Analytics

Data Lake
A Data Lake solution on Azure 33

Data
Warehouse

Source: https://www.microsoft.com/en-us/insidetrack/azure-data-lake-connects-supply-chain-data-for-advanced-analytics
A Data Lake solution on Google Cloud 34
Presto is an open-source distributed SQL query
engine for running interactive analytic queries
against data sources of all sizes ranging from
gigabytes to petabytes

Dataplex is an intelligent data

fabric that unifies data across
data lakes, data warehouses,
and data marts

BigQuery is a fully-managed,
serverless data warehouse that
enables scalable analysis over
petabytes of data.

Source: https://www.hcltech.com/blogs/dataplex-unified-data-fabric-google-cloud-platform
Big Data and Data Science 35
u Data Science and cloud computing
essentially go hand in hand
u A Data Scientist typically analyzes different
types of data that are stored in the Cloud
u Their work typically involves making sense of
messy, unstructured data, from sources such
as smart devices, social media feeds, and
emails that don’t neatly fit into a database
u They analyze, process, and model data then
interpret the results to create actionable plans
for companies and other organizations
u With the increase in Big Data, organizations
are increasingly storing large sets of data
online and there is an increasing need for
Data Scientists
Source: https://www.whizlabs.com/blog/data-science-vs-big-data-vs-data-analytics/
If you want to learn more about Hadoop but 36
not pay for AWS resources…
u You can easily do this on your computer…for free!
u Download VirtualBox (https://www.virtualbox.org/)
u Download Hortonworks Sandbox, which is a fully functioning Hadoop cluster
and ready to go

u All the activities for the Big Data

exercises were originally developed
in this environment before running
on AWS EMR
37

u Going back to your scenarios one last time, your team is tasked to analyze
opportunities for Big Data and Analytics
u Questions to answer:
u What type of analytics can be created and what data would be needed? Is a Data Lake
required? What about a Data Warehouse?
u Based on the data you are working with; what type of Big Data technologies would be
recommended to process the data?
u How often would you process this data?
u What is the value of the analytics you would be providing (e.g. increased sales)
u Submit completed template to Assignments > Activity #19 - Big Data Analytics
Recommendation
u Only 1 submission per team
u The scenarios can be found in Assignment > Activity #7 > Cost Estimating Scenarios

DP 600 Dumps
No ratings yet
DP 600 Dumps
244 pages
DataAnalytics AWS PDF
No ratings yet
DataAnalytics AWS PDF
133 pages
Data Bricks Certified Associated at A Engineer Exam
No ratings yet
Data Bricks Certified Associated at A Engineer Exam
142 pages
Architecting Data Lakes Zaloni PDF
No ratings yet
Architecting Data Lakes Zaloni PDF
63 pages
Modern Data Architecture Guide
88% (8)
Modern Data Architecture Guide
23 pages
01 - IBM Data Lake Solutions & Technologies - Le Nhan Tam
No ratings yet
01 - IBM Data Lake Solutions & Technologies - Le Nhan Tam
32 pages
Ebook - Operationalizing The Data Lake PDF
100% (3)
Ebook - Operationalizing The Data Lake PDF
173 pages
Build A Modern, Unified Analytics Data Platform With Google Cloud - Whitepaper August 2021
No ratings yet
Build A Modern, Unified Analytics Data Platform With Google Cloud - Whitepaper August 2021
18 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
Data Lake Implementation Improved Processing Time by 4X
No ratings yet
Data Lake Implementation Improved Processing Time by 4X
5 pages
GCP - DataPlex - Building A Data Lakehouse
No ratings yet
GCP - DataPlex - Building A Data Lakehouse
19 pages
BDA Module-2 Notes PDF
100% (1)
BDA Module-2 Notes PDF
14 pages
Transforming Data with Azure Solutions
No ratings yet
Transforming Data with Azure Solutions
35 pages
Summer Internship Report On: Aws Data Engineering (Topic)
No ratings yet
Summer Internship Report On: Aws Data Engineering (Topic)
21 pages
Big Data Essentials for IT Professionals
No ratings yet
Big Data Essentials for IT Professionals
31 pages
Analytics and Processing: Yuanyuan Zhu Email: Yyzhu@whu - Edu.cn
No ratings yet
Analytics and Processing: Yuanyuan Zhu Email: Yyzhu@whu - Edu.cn
47 pages
Fundamentals of Big Data and Business Analytics
No ratings yet
Fundamentals of Big Data and Business Analytics
6 pages
Big Data Spectrum
No ratings yet
Big Data Spectrum
61 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
AWS Data Analytics - Technical - Student
No ratings yet
AWS Data Analytics - Technical - Student
160 pages
Harness Data To Reinvent Your Organization
No ratings yet
Harness Data To Reinvent Your Organization
20 pages
Cloud Analytics for Cargo Firms
No ratings yet
Cloud Analytics for Cargo Firms
45 pages
Hands On Lab Guide For Data Lake PDF
No ratings yet
Hands On Lab Guide For Data Lake PDF
19 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
42 pages
Big Data - Cloud - AI
No ratings yet
Big Data - Cloud - AI
45 pages
Awsdataanalyticsonawstechnicaliltinstructordeck2023 230304021823 0674c2bb
No ratings yet
Awsdataanalyticsonawstechnicaliltinstructordeck2023 230304021823 0674c2bb
146 pages
Deloitte Take Home Challenge - V2
No ratings yet
Deloitte Take Home Challenge - V2
83 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
BDA 02 - Fundamentals
No ratings yet
BDA 02 - Fundamentals
64 pages
DataLake Evaluation
No ratings yet
DataLake Evaluation
8 pages
C3.Ai A New Technology Stack
No ratings yet
C3.Ai A New Technology Stack
22 pages
Untitled Presentation
No ratings yet
Untitled Presentation
12 pages
DataLake Hadoop
100% (1)
DataLake Hadoop
12 pages
Big Data Engineer Resume - Azure Specialist
No ratings yet
Big Data Engineer Resume - Azure Specialist
2 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Internship Report
No ratings yet
Internship Report
24 pages
Data Strategy for Business Insights
100% (1)
Data Strategy for Business Insights
25 pages
DocScanner 20 Oct 2024 2-19 PM
No ratings yet
DocScanner 20 Oct 2024 2-19 PM
16 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Data Engineering
No ratings yet
Data Engineering
22 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Internship
No ratings yet
Internship
24 pages
DoanMinhKhanh W9
No ratings yet
DoanMinhKhanh W9
7 pages
Exam
No ratings yet
Exam
35 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
CCD Chapter 3 Notes
No ratings yet
CCD Chapter 3 Notes
11 pages
Module 1
No ratings yet
Module 1
29 pages
Data Engineer
No ratings yet
Data Engineer
329 pages
D Report
No ratings yet
D Report
19 pages
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
No ratings yet
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
35 pages
Data Roles & Cloud Platforms Guide
No ratings yet
Data Roles & Cloud Platforms Guide
18 pages
Azure Data Engineer Interview Guide
No ratings yet
Azure Data Engineer Interview Guide
158 pages
AWS ML Cheat Sheet Nov 2024
No ratings yet
AWS ML Cheat Sheet Nov 2024
100 pages
Guide To Data Warehousing in The Lakehouse 1731468863
No ratings yet
Guide To Data Warehousing in The Lakehouse 1731468863
55 pages
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
No ratings yet
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
36 pages
20CS11Q3
No ratings yet
20CS11Q3
2 pages
BCE Report
No ratings yet
BCE Report
14 pages
Google Cloud Data Lakes & Warehouses
No ratings yet
Google Cloud Data Lakes & Warehouses
4 pages
Page 2
No ratings yet
Page 2
9 pages
Data Engineering for Professionals
No ratings yet
Data Engineering for Professionals
45 pages
Week4 - Data Formats and Streaming Data Quiz
No ratings yet
Week4 - Data Formats and Streaming Data Quiz
6 pages
Apuntes Big Data Ii
No ratings yet
Apuntes Big Data Ii
11 pages
Google GCP BigLake
No ratings yet
Google GCP BigLake
13 pages
Data Repositories in Data Analytics
No ratings yet
Data Repositories in Data Analytics
8 pages
BDA Lec3
No ratings yet
BDA Lec3
46 pages
This Is What I Will Do To Become A Data Engineer in 2025 - by Syed Kadar Ansari Syed Ahamed - Aug, 2024 - Data Engineer Things
No ratings yet
This Is What I Will Do To Become A Data Engineer in 2025 - by Syed Kadar Ansari Syed Ahamed - Aug, 2024 - Data Engineer Things
22 pages
AWS Certified ML Engineer Associate Slides
No ratings yet
AWS Certified ML Engineer Associate Slides
861 pages
DP 900 Day 4
No ratings yet
DP 900 Day 4
40 pages
Report Mohit
No ratings yet
Report Mohit
17 pages
Azure Book 126
No ratings yet
Azure Book 126
1 page
ICT703 - Big Data - Assessment 1 - Case Study Analysis Report - 1.2
No ratings yet
ICT703 - Big Data - Assessment 1 - Case Study Analysis Report - 1.2
14 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Unit V Big Data
No ratings yet
Unit V Big Data
12 pages
Module 6
No ratings yet
Module 6
16 pages
Hansal Maniar
No ratings yet
Hansal Maniar
7 pages
Vijay - Data Engineer Re
No ratings yet
Vijay - Data Engineer Re
7 pages
Aws Certified Data Engineer Slides
100% (1)
Aws Certified Data Engineer Slides
696 pages
Important DE Interview Questions
No ratings yet
Important DE Interview Questions
5 pages
Document 1
No ratings yet
Document 1
9 pages
Building The Data Lakehouse Bill Inmon Ranjeet Srivastava Mary Levins Download
No ratings yet
Building The Data Lakehouse Bill Inmon Ranjeet Srivastava Mary Levins Download
85 pages
Real Scenarios On Data Term 1722747078
No ratings yet
Real Scenarios On Data Term 1722747078
11 pages
TOP 15 Concepts: AWS Data Engineers
No ratings yet
TOP 15 Concepts: AWS Data Engineers
10 pages
Brief Introduction To Amazon
No ratings yet
Brief Introduction To Amazon
7 pages
Supply Chain Analytics - A Pythonic Approach Amd Guide
No ratings yet
Supply Chain Analytics - A Pythonic Approach Amd Guide
202 pages
218W1A1286
No ratings yet
218W1A1286
25 pages
Aws Summit London Sessions
No ratings yet
Aws Summit London Sessions
21 pages

Big Data Analytics

Uploaded by

Big Data Analytics

Uploaded by

1

u Assignments > Activity #19 - Evaluating Weather Data with Spark

u Final Exam – Thursday, 12/8 from 7-9:30pm in GOL 1550

u This has led more organizations to leverage the

SWEN 514/614: Engineering Cloud Software Systems

Department of Software Engineering

u A Data Lake allows you to store all your

IP&S Legal Corp

F&R TRTA Knowledge Graph

RDDs Column 1 Column 2 Column 3

CSV Data Row 1

JSON Data Row 2

df.printSchema() Print the schema in a tree format

df.select(df['name'], df['age'] + 1).show() Select everybody, but increment the “age” by 1

df.filter(df['age'] > 21).show() Select people older than 21

sqlDF.createOrReplaceTempView("people") Register the DataFrame as a SQL temporary view

Use the typical SQL Query syntax (SELECT…WHERE,

Dataplex is an intelligent data

u All the activities for the Big Data

You might also like