Tidy Data - the foundation of data wrangling
Data Wrangling
with DataFrames.jl
Cheat Sheet
In a tidy
data set:
& Tidy data makes data analysis easier and more
intuitive. DataFrames.jl can help you tidy up your data.
(for version 1.x) Each variable is saved Each observation is
in its own column. saved in its own row.
Create DataFrame Reshape Data - changing layout Sort Data
sort(df, :age) Mutation: use sort!
DataFrame(x = [1,2,3], y = 4:6, z = 9)
Create data frame with column data Sort by age
from vector, range, or constant.
sort(df, :age, rev = true)
DataFrame([(x=1, y=2), (x=3, y=4)]) Sort by age in reverse order
Create data frame from a vector of stack(df, [:sibsp, :parch]) unstack(df, :variable, :value)
named tuples. Stack columns data as rows Unstack rows into columns sort(df, [:age, order(:sibsp, rev = true)])
with new variable and value columns using variable and value columns Sort by in ascending age and descending sibsp order
DataFrame("x" => [1,2], "y" => [3,4])
Create data frame from pairs of
column name and data. Select Observations (rows) Select Variables (columns) View Metadata
DataFrame(rand(5, 3), [:x, :y, :z]) Function syntax Function syntax names(df) nrow(df)
DataFrame(rand(5, 3), :auto) first(df, 5) or last(df, 5) select(df, :sex) propertynames(df) ncol(df)
Create data frame from a matrix. First 5 rows or last 5 rows select(df, "sex") Column names. Number of
select(df, [:sex, :age]) rows and
DataFrame() unique(df) columns.
Select desired column(s). columnindex(df, "sex")
Create an empty data frame without unique(df, [:pclass, :survived])
Index number of a
any columns. Return data frame with unique rows.
select(df, 2:5) column.
filter(:sex => ==("male"), df) Select columns by index.
DataFrame(x = Int[], y = Float64[]) filter(row -> row.sex == "male", df)
Create an empty data frame with Return rows having sex equals “male”. select(df, r"^s") Handle Missing Data
typed columns. Note: the first syntax performs better. Select columns by regex.
dropmissing(df)
DataFrame(mytable) subset(df, :survived)
select(df, Not(:age)) dropmissing(df, [:age, :sex])
Create data frame from any data subset(df, :sex => x -> x .== "male")
Select all columns except the Return rows without any missing data.
source that supports Tables.jl interface. Return rows for which value is true.
age column.
Note: the “survived” column is Bool type
allowmissing(df)
select(df, Between(:name, :age)) allowmissing(df, :sibsp)
Describe DataFrame Indexing syntax Select all columns between Allow missing data in column(s).
df[6:10, :] name and age columns.
Return rows 6 to 10 disallowmissing(df)
describe(df) Indexing syntax
Summary stats for all columns. disallowmissing(df, :sibsp)
df[df.sex .== "male", :] df[:, [:sex, :age]] Do not allow missing data in column(s).
Return rows having sex equals “male”. Select a copy of columns.
describe(df, :mean, :std)
Specific stats for all columns. df[findfirst(==(30), df.age), :] completecases(df)
Return first row having age equals 30. df[!, [:sex, :age]] completecases(df, [:age, :sex])
Select original column vectors. Return Bool array with true entries
describe(df, extrema => :extrema) df[findall(==(1), df.pclass), :]
Apply custom function to all P.S. Indexing syntax can select observations for rows without any missing data.
Return all rows having pclass equals 1. and variables at the same time!
columns. Mutation: use dropmissing!,
Mutation: use unique!, filter!, or subset! Mutation: use select! allowmissing!, or disallowmissing!
This cheat sheet is inspired by the data wrangling cheat sheets from RStudio and pandas. Examples are based on the Kaggle Titanic data set. Created by Tom Kwong, May 2021. v1.x rev1 Page 1 / 2
Cumulative and Moving Stats Summarize Data Combine Data Sets
Cumulative Stats Aggregating variables innerjoin(df1, df2, on = :id)
select(df, :x => cumsum) combine(df, :survived => sum)
select(df, :x => cumprod) combine(df, :survived => sum => :survived)
Cumulative sum and product of column x. Apply a function to a column; optionally assign colum name.
select(df, :x => v -> accumulate(min, v)) combine(df, :age => (x -> mean(skipmissing(x))))
leftjoin(df1, df2, on = :id)
select(df, :x => v -> accumulate(max, v)) Apply an anonymous function to a column.
Cumulative minimum/maximum of column x.
combine(df, [:parch, :sibsp] .=> maximum)
select(df, :x => v -> cumsum(v) ./ (1:length(v))) Apply a function to multiple columns using broadcasting syntax.
Cumulative mean of column x.
Adding variables with aggregation results
rightjoin(df1, df2, on = :id)
Moving Stats (a.k.a Rolling Stats) transform(df, :fare => mean => :average_fare)
Add a new column that is populated with the aggregated value.
select(df, :x => (v -> runmean(v, n)))
select(df, :x => (v -> runmedian(v, n))) select(df, :name, :fare, :fare => mean => :average_fare)
select(df, :x => (v -> runmin(v, n))) Select any columns and add new ones with the aggregated value.
select(df, :x => (v -> runmax(v, n))) outerjoin(df1, df2, on = :id)
Adding variables by row
Moving mean, medium, minimu, and maximum
for column x with window size n transform(df, [:parch, :sibsp] => ByRow(+) => :relatives)
Add new column by applying a function over existing column(s).
The run* functions (and more) are available from
RollingFunctions.jl package. transform(df, :name => ByRow(x -> split(x, ",")) => [:lname, :fname])
Add new columns by applying a function that returns multiple values.
semijoin(df1, df2, on = :id)
Ranking and Lead/Lag Functions Tips: Use skipmissing function to remove missing values.
select(df, :x => ordinalrank) # 1234
select(df, :x => competerank) # 1224 Group Data Sets
select(df, :x => denserank) # 1223
select(df, :x => tiedrank) # 1 2.5 2.5 4 gdf = groupby(df, :pclass) antijoin(df1, df2, on = :id)
gdf = groupby(df, [:pclass, :sex]) Tips:
The *rank functions come from StatsBase.jl package. Group data frame by one or more columns. You can also use
these functions to
select(df, :x => lead) # shift up keys(gdf) add summarized
select(df, :x => lag) # shift down Get the keys for looking up SubDataFrame’s in the group. data to all rows:
The lead and lag functions come from ShiftedArrays.jl ● select
package. gdf[(1,)] ● select!
Look up a specific group using a tuple of key values. ● transform vcat(df1, df2)
● transform!
Build Data Pipeline combine(gdf, :survived => sum)
Apply a function over a column for every group. Returns a single data frame.
Data frames
@pipe df |> can be
combine(gdf) do sdf combined
filter(:sex => ==("male"), _) |>
DataFrame(survived = sum(sdf.survived)) vertically or
groupby(_, :pclass) |>
end hcat(df1, df2) horizontally.
combine(_, :age => mean)
Apply a function to each SubDataFrame in the group and combine results.
The @pipe macro comes from Pipe.jl package.
Underscores are automatically replaced by return value combine(gdf, AsTable(:) => t -> sum(t.parch .+ t.sibsp))
from the previous operation before the |> operator. Apply a function to each SubDataFrame in the group and combine results.
This cheat sheet is inspired by the data wrangling cheat sheets from RStudio and pandas. Examples are based on the Kaggle Titanic data set. Created by Tom Kwong, May 2021. v1.x rev1 Page 2 / 2