DataFramesCheatSheet v1.x Rev1

This cheat sheet provides an overview of data wrangling techniques using DataFrames.jl, emphasizing the importance of tidy data where each variable is in its own column and each observation is in its own row. It includes commands for creating, reshaping, sorting, filtering, and summarizing data, as well as handling missing values and combining datasets. The document is inspired by similar resources from RStudio and pandas, with examples based on the Kaggle Titanic dataset.

Uploaded by

samarelsherife202501

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views2 pages

DataFramesCheatSheet v1.x Rev1

Uploaded by

samarelsherife202501

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Tidy Data - the foundation of data wrangling

Data Wrangling
with DataFrames.jl
Cheat Sheet
In a tidy
data set:
& Tidy data makes data analysis easier and more
intuitive. DataFrames.jl can help you tidy up your data.

(for version 1.x) Each variable is saved Each observation is

in its own column. saved in its own row.

Create DataFrame Reshape Data - changing layout Sort Data

sort(df, :age) Mutation: use sort!
DataFrame(x = [1,2,3], y = 4:6, z = 9)
Create data frame with column data Sort by age
from vector, range, or constant.
sort(df, :age, rev = true)
DataFrame([(x=1, y=2), (x=3, y=4)]) Sort by age in reverse order
Create data frame from a vector of stack(df, [:sibsp, :parch]) unstack(df, :variable, :value)
named tuples. Stack columns data as rows Unstack rows into columns sort(df, [:age, order(:sibsp, rev = true)])
with new variable and value columns using variable and value columns Sort by in ascending age and descending sibsp order
DataFrame("x" => [1,2], "y" => [3,4])
Create data frame from pairs of
column name and data. Select Observations (rows) Select Variables (columns) View Metadata
DataFrame(rand(5, 3), [:x, :y, :z]) Function syntax Function syntax names(df) nrow(df)
DataFrame(rand(5, 3), :auto) first(df, 5) or last(df, 5) select(df, :sex) propertynames(df) ncol(df)
Create data frame from a matrix. First 5 rows or last 5 rows select(df, "sex") Column names. Number of
select(df, [:sex, :age]) rows and
DataFrame() unique(df) columns.
Select desired column(s). columnindex(df, "sex")
Create an empty data frame without unique(df, [:pclass, :survived])
Index number of a
any columns. Return data frame with unique rows.
select(df, 2:5) column.
filter(:sex => ==("male"), df) Select columns by index.
DataFrame(x = Int[], y = Float64[]) filter(row -> row.sex == "male", df)
Create an empty data frame with Return rows having sex equals “male”. select(df, r"^s") Handle Missing Data
typed columns. Note: the first syntax performs better. Select columns by regex.
dropmissing(df)
DataFrame(mytable) subset(df, :survived)
select(df, Not(:age)) dropmissing(df, [:age, :sex])
Create data frame from any data subset(df, :sex => x -> x .== "male")
Select all columns except the Return rows without any missing data.
source that supports Tables.jl interface. Return rows for which value is true.
age column.
Note: the “survived” column is Bool type
allowmissing(df)
select(df, Between(:name, :age)) allowmissing(df, :sibsp)
Describe DataFrame Indexing syntax Select all columns between Allow missing data in column(s).
df[6:10, :] name and age columns.
Return rows 6 to 10 disallowmissing(df)
describe(df) Indexing syntax
Summary stats for all columns. disallowmissing(df, :sibsp)
df[df.sex .== "male", :] df[:, [:sex, :age]] Do not allow missing data in column(s).
Return rows having sex equals “male”. Select a copy of columns.
describe(df, :mean, :std)
Specific stats for all columns. df[findfirst(==(30), df.age), :] completecases(df)
Return first row having age equals 30. df[!, [:sex, :age]] completecases(df, [:age, :sex])
Select original column vectors. Return Bool array with true entries
describe(df, extrema => :extrema) df[findall(==(1), df.pclass), :]
Apply custom function to all P.S. Indexing syntax can select observations for rows without any missing data.
Return all rows having pclass equals 1. and variables at the same time!
columns. Mutation: use dropmissing!,
Mutation: use unique!, filter!, or subset! Mutation: use select! allowmissing!, or disallowmissing!

This cheat sheet is inspired by the data wrangling cheat sheets from RStudio and pandas. Examples are based on the Kaggle Titanic data set. Created by Tom Kwong, May 2021. v1.x rev1 Page 1 / 2
Cumulative and Moving Stats Summarize Data Combine Data Sets

Cumulative Stats Aggregating variables innerjoin(df1, df2, on = :id)

select(df, :x => cumsum) combine(df, :survived => sum)

select(df, :x => cumprod) combine(df, :survived => sum => :survived)
Cumulative sum and product of column x. Apply a function to a column; optionally assign colum name.

select(df, :x => v -> accumulate(min, v)) combine(df, :age => (x -> mean(skipmissing(x))))
leftjoin(df1, df2, on = :id)
select(df, :x => v -> accumulate(max, v)) Apply an anonymous function to a column.
Cumulative minimum/maximum of column x.
combine(df, [:parch, :sibsp] .=> maximum)
select(df, :x => v -> cumsum(v) ./ (1:length(v))) Apply a function to multiple columns using broadcasting syntax.
Cumulative mean of column x.
Adding variables with aggregation results
rightjoin(df1, df2, on = :id)
Moving Stats (a.k.a Rolling Stats) transform(df, :fare => mean => :average_fare)
Add a new column that is populated with the aggregated value.
select(df, :x => (v -> runmean(v, n)))
select(df, :x => (v -> runmedian(v, n))) select(df, :name, :fare, :fare => mean => :average_fare)
select(df, :x => (v -> runmin(v, n))) Select any columns and add new ones with the aggregated value.
select(df, :x => (v -> runmax(v, n))) outerjoin(df1, df2, on = :id)
Adding variables by row
Moving mean, medium, minimu, and maximum
for column x with window size n transform(df, [:parch, :sibsp] => ByRow(+) => :relatives)
Add new column by applying a function over existing column(s).
The run* functions (and more) are available from
RollingFunctions.jl package. transform(df, :name => ByRow(x -> split(x, ",")) => [:lname, :fname])
Add new columns by applying a function that returns multiple values.
semijoin(df1, df2, on = :id)
Ranking and Lead/Lag Functions Tips: Use skipmissing function to remove missing values.

select(df, :x => ordinalrank) # 1234

select(df, :x => competerank) # 1224 Group Data Sets
select(df, :x => denserank) # 1223
select(df, :x => tiedrank) # 1 2.5 2.5 4 gdf = groupby(df, :pclass) antijoin(df1, df2, on = :id)
gdf = groupby(df, [:pclass, :sex]) Tips:
The *rank functions come from StatsBase.jl package. Group data frame by one or more columns. You can also use
these functions to
select(df, :x => lead) # shift up keys(gdf) add summarized
select(df, :x => lag) # shift down Get the keys for looking up SubDataFrame’s in the group. data to all rows:
The lead and lag functions come from ShiftedArrays.jl ● select
package. gdf[(1,)] ● select!
Look up a specific group using a tuple of key values. ● transform vcat(df1, df2)
● transform!
Build Data Pipeline combine(gdf, :survived => sum)
Apply a function over a column for every group. Returns a single data frame.
Data frames
@pipe df |> can be
combine(gdf) do sdf combined
filter(:sex => ==("male"), _) |>
DataFrame(survived = sum(sdf.survived)) vertically or
groupby(_, :pclass) |>
end hcat(df1, df2) horizontally.
combine(_, :age => mean)
Apply a function to each SubDataFrame in the group and combine results.
The @pipe macro comes from Pipe.jl package.
Underscores are automatically replaced by return value combine(gdf, AsTable(:) => t -> sum(t.parch .+ t.sibsp))
from the previous operation before the |> operator. Apply a function to each SubDataFrame in the group and combine results.

This cheat sheet is inspired by the data wrangling cheat sheets from RStudio and pandas. Examples are based on the Kaggle Titanic data set. Created by Tom Kwong, May 2021. v1.x rev1 Page 2 / 2

Python Programming Notes
No ratings yet
Python Programming Notes
141 pages
R Programming Cheatsheet
100% (2)
R Programming Cheatsheet
6 pages
Information Practices
No ratings yet
Information Practices
141 pages
Data Wrangling
No ratings yet
Data Wrangling
12 pages
Lab4-Factors & DataFrames
No ratings yet
Lab4-Factors & DataFrames
5 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Rapids Cheatsheet
100% (1)
Rapids Cheatsheet
2 pages
Fonction Dplyr
No ratings yet
Fonction Dplyr
5 pages
6 Working With Data Frames in R
No ratings yet
6 Working With Data Frames in R
8 pages
Data Table
No ratings yet
Data Table
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
R Programming Cheat Sheet: Ata Tructures
No ratings yet
R Programming Cheat Sheet: Ata Tructures
2 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Informatics Practices Class 12 Cbse Notes Data Handling
0% (1)
Informatics Practices Class 12 Cbse Notes Data Handling
17 pages
R Sharing
No ratings yet
R Sharing
16 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Pandas Data Wrangling Cheat Sheet
100% (2)
Pandas Data Wrangling Cheat Sheet
6 pages
Pandas Cheat Sheet
100% (4)
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet CN
No ratings yet
Pandas Cheat Sheet CN
4 pages
Pandas Cheat Sheet
85% (13)
Pandas Cheat Sheet
2 pages
NumPy, SciPy, Pandas, Quandl Cheat Sheet
100% (3)
NumPy, SciPy, Pandas, Quandl Cheat Sheet
4 pages
Enhanced Data
No ratings yet
Enhanced Data
12 pages
Baonhh Pandas
No ratings yet
Baonhh Pandas
41 pages
Datatable
No ratings yet
Datatable
2 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Presentation 1
No ratings yet
Presentation 1
34 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
L3 Notes-1
No ratings yet
L3 Notes-1
8 pages
Data Frame Demo
No ratings yet
Data Frame Demo
73 pages
Chapter 2 - Python Pandas II
No ratings yet
Chapter 2 - Python Pandas II
71 pages
Experiment 5
No ratings yet
Experiment 5
13 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
4 PythonPandas
No ratings yet
4 PythonPandas
8 pages
Module 2.9
No ratings yet
Module 2.9
12 pages
Create A DataFrame
No ratings yet
Create A DataFrame
24 pages
Series and Pandas Methods
No ratings yet
Series and Pandas Methods
5 pages
Study Guide Data Manipulation With R
No ratings yet
Study Guide Data Manipulation With R
4 pages
DAV Practicle File
No ratings yet
DAV Practicle File
28 pages
03 Numpy and Pandas
No ratings yet
03 Numpy and Pandas
68 pages
Pandas Cheat Sheet for Data Science
No ratings yet
Pandas Cheat Sheet for Data Science
5 pages
CCS103 - Module-Student-Copy-Comprog2 (20250119172915)
No ratings yet
CCS103 - Module-Student-Copy-Comprog2 (20250119172915)
287 pages
C Arrays and Strings Guide
No ratings yet
C Arrays and Strings Guide
16 pages
LAB PSAP Problem Statements
No ratings yet
LAB PSAP Problem Statements
21 pages
WinCC Advanced V14 SP1 - Addressing Variables in Global Data Blocks
No ratings yet
WinCC Advanced V14 SP1 - Addressing Variables in Global Data Blocks
3 pages
Assignment 01
No ratings yet
Assignment 01
16 pages
Most Used Problem Solving Patterns
No ratings yet
Most Used Problem Solving Patterns
29 pages
Class X COMP 23 Mock
No ratings yet
Class X COMP 23 Mock
5 pages
June 2022 (v2) QP - Paper 2 CAIE Computer Science GCSE
No ratings yet
June 2022 (v2) QP - Paper 2 CAIE Computer Science GCSE
12 pages
Excel Formulas Amp Functions 2023 2345000000 Compress
No ratings yet
Excel Formulas Amp Functions 2023 2345000000 Compress
266 pages
2019 Winter Model Answer Paper (Msbte Study Resources)
No ratings yet
2019 Winter Model Answer Paper (Msbte Study Resources)
24 pages
B.Tech - R23 - II - Year - AIML - Syllabus - FINAL
No ratings yet
B.Tech - R23 - II - Year - AIML - Syllabus - FINAL
44 pages
Tanvi Merged
No ratings yet
Tanvi Merged
28 pages
C++ Theory Assign
No ratings yet
C++ Theory Assign
4 pages
Streamline Ring ORAM Accesses Through Spatial and Temporal Optimization
No ratings yet
Streamline Ring ORAM Accesses Through Spatial and Temporal Optimization
12 pages
Candidate Performance Analysis 2024
No ratings yet
Candidate Performance Analysis 2024
28 pages
22XX302-LM-Optimizing For Optimistic Scenarios
No ratings yet
22XX302-LM-Optimizing For Optimistic Scenarios
17 pages
100days Sheet Dsa
No ratings yet
100days Sheet Dsa
2 pages
PHP - The Complete Reference - Steven Holzner Curvebreakers
No ratings yet
PHP - The Complete Reference - Steven Holzner Curvebreakers
612 pages
Final LAC LAB MANUAL
No ratings yet
Final LAC LAB MANUAL
32 pages
Web Technology - ISolved Practical Slips
100% (1)
Web Technology - ISolved Practical Slips
30 pages
Programming Paper2 Compilation
No ratings yet
Programming Paper2 Compilation
18 pages
Pic Syllabus
No ratings yet
Pic Syllabus
2 pages
10 - Queues Using Array - 074810
No ratings yet
10 - Queues Using Array - 074810
29 pages
ISR (Interrupt Service Routine)
No ratings yet
ISR (Interrupt Service Routine)
3 pages
Advanced Data Structures Exam Questions - Gracious University
No ratings yet
Advanced Data Structures Exam Questions - Gracious University
5 pages
Chapter 1 - Concept of Data Type
No ratings yet
Chapter 1 - Concept of Data Type
36 pages
Final Binary +linear Search
No ratings yet
Final Binary +linear Search
18 pages
Unit 1 DS
No ratings yet
Unit 1 DS
51 pages
Arrays in C++ - Reader Mode
No ratings yet
Arrays in C++ - Reader Mode
14 pages

DataFramesCheatSheet v1.x Rev1

Uploaded by

DataFramesCheatSheet v1.x Rev1

Uploaded by

Tidy Data - the foundation of data wrangling

(for version 1.x) Each variable is saved Each observation is

Create DataFrame Reshape Data - changing layout Sort Data

Cumulative Stats Aggregating variables innerjoin(df1, df2, on = :id)

select(df, :x => cumsum) combine(df, :survived => sum)

select(df, :x => ordinalrank) # 1234

You might also like