Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views15 pages

PySpark Datafame

A DataFrame in PySpark is a distributed table-like structure optimized for handling big data across multiple machines, similar to an Excel sheet or SQL table. It can be created from various sources, including Python lists, JSON files, and RDDs, and supports SQL queries for structured data handling. Key features include lazy evaluation, immutability, and integration with various data sources.

Uploaded by

Opapa Peter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views15 pages

PySpark Datafame

A DataFrame in PySpark is a distributed table-like structure optimized for handling big data across multiple machines, similar to an Excel sheet or SQL table. It can be created from various sources, including Python lists, JSON files, and RDDs, and supports SQL queries for structured data handling. Key features include lazy evaluation, immutability, and integration with various data sources.

Uploaded by

Opapa Peter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DATAFRAME

Simplified @mrk_talkstech

series
part -13

Save for later


1

What is a DataFrame?
In PySpark, a DataFrame is a
distributed table-like structure with
rows and columns (like an Excel
sheet or SQL table)

@mrk_talkstech
2

Each column has a name and a


data type
It’s optimized to handle big
data across multiple machines

PySpark DataFrame
=
BigData table

@mrk_talkstech
3

Creating DataFrame

The common ways to create


DataFrame
1.From a Python List or Tuple
2.From a JSON(csv file)
3.From RDD(Resilient
Distributed Dataset)

@mrk_talkstech
4

From a Python List or Tuple

@mrk_talkstech
5

From a JSON (csv file)

@mrk_talkstech
6

From RDD Dataset


Method 1: Convert RDD directly to
DataFrame (no schema / column names)

@mrk_talkstech
7

Method 2: Convert RDD to DataFrame with


Column Names

@mrk_talkstech
8

Method 3: Convert RDD to DataFrame with


Explicit Schema

@mrk_talkstech
9

Why DataFrame is
important?
Structured Data Handling → Easy
to work with rows & columns
SQL Support → Run queries
directly like SQL (SELECT,
WHERE, GROUP BY)
Big Data Ready → Handles
terabytes of data across clusters

@mrk_talkstech
10

Optimized Performance → Uses


Spark’s Catalyst Optimizer +
Tungsten Engine
Integration → Works with
different sources (CSV, JSON,
databases, Parquet, etc)
Easier than RDDs → Provides
high-level APIs instead of low-
level transformations

@mrk_talkstech
11

KeyFeatures
Structured (rows+Columns)

Distributed (works on cluster)

Supports SQL

@mrk_talkstech
12

Lazy Evaluation (runs only on


actions)

Immutable (no direct changes)

Optimized (Catalyst+Tungsten)

@mrk_talkstech
Was it helpful?
Like Comment Share Save

@mrk_talkstech
@mrk_talkstech

Hands-on with Pyspark dataframe


methods - don’t miss it!
follow the series!

@mrk_talkstech

You might also like