-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Some keywords: GROUPING SETS, ROLLUP, CUBE, GROUPING
Some references: postgres, Oracle, SQL Server, groupings combined with arbitrary functions
Grouping sets and friends are useful to pre-calculate various aggregation levels, which is often desired. Api for that feature in data.table is not very friendly, see Aggregating sub totals and grand totals with data.table.
In case of rollup those are aggregations for provided by from top to bottom. See description from postgres man, and example code below.
ROLLUP ( e1, e2, e3, ... )
is equivalent to:
GROUPING SETS (
( e1, e2, e3, ... ),
...
( e1, e2 )
( e1 )
( )
)
I wonder if there could be cheap speed-up of that process? this is potentially heavy computing task. Would be great to have computation of grouping sets feature developed in C, so all the rollup/cube and other features could be built on top of grouping sets more easily in R still utilizing full speed.
Answers to update when closed:
library(plyr)
grp.cols <- c("vs", "am", "gear", "carb", "cyl")
plyr.r = do.call(
rbind.fill,
lapply(1:length(grp.cols), function(x) ddply(mtcars, grp.cols[1:x], summarize, agg=mean(mpg)))
)
library(data.table) # 1.9.7+
dt.r = rollup(as.data.table(mtcars), j = .(agg=mean(mpg)), by=grp.cols)
all.equal(
as.data.table(plyr.r),
dt.r[-.N], # exclude grand total, not present in BrodieG answer
ignore.row.order = TRUE,
ignore.col.order = TRUE
)
#[1] TRUE
# install.packages("data.table", type = "source", repos = "https://Rdatatable.github.io/data.table")- https://stackoverflow.com/questions/9315258/aggregating-sub-totals-and-grand-totals-with-data-table/
library(data.table)
set.seed(1)
DT = data.table(
group=sample(letters[1:2],100,replace=TRUE),
year=sample(2010:2012,100,replace=TRUE),
v=runif(100))
cube(DT, mean(v), by=c("group","year"))
# group year V1
#1: a 2011 0.4176346
#2: b 2010 0.5231845
#3: b 2012 0.4306871
#4: b 2011 0.4997119
#5: a 2012 0.4227796
#6: a 2010 0.2926945
#7: NA 2011 0.4463616
#8: NA 2010 0.4278093
#9: NA 2012 0.4271160
#10: a NA 0.3901875
#11: b NA 0.4835788
#12: NA NA 0.4350153
cube(DT, mean(v), by=c("group","year"), id=TRUE)
# grouping group year V1
#1: 0 a 2011 0.4176346
#2: 0 b 2010 0.5231845
#3: 0 b 2012 0.4306871
#4: 0 b 2011 0.4997119
#5: 0 a 2012 0.4227796
#6: 0 a 2010 0.2926945
#7: 2 NA 2011 0.4463616
#8: 2 NA 2010 0.4278093
#9: 2 NA 2012 0.4271160
#10: 1 a NA 0.3901875
#11: 1 b NA 0.4835788
#12: 3 NA NA 0.4350153
# install.packages("data.table", type = "source", repos = "https://Rdatatable.github.io/data.table")Some other questions can get new answers also:
- https://stackoverflow.com/questions/20918619/nested-table-within-column-sub-group-totals-frequencies-and-percentages-using
- https://stackoverflow.com/questions/10956300/grouping-and-sorting-in-r/10956655#10956655
- https://stackoverflow.com/questions/5982546/r-calculating-column-sums-row-sums-as-an-aggregation-from-a-dataframe
- https://stackoverflow.com/questions/14242409/use-plyr-to-compute-margins
- https://stackoverflow.com/questions/2566766/margin-totals-in-xtabs
- https://stackoverflow.com/questions/12445574/subtotals-in-columns-using-reshape2
- http://stackoverflow.com/questions/36169073/how-to-do-group-by-rollup-in-r-like-sql