datatable.optimize {data.table} | R Documentation |
data.table
internally optimises certain expressions inorder to improve
performance. This section briefly summarises those optimisations.
Note that there's no additional input needed from the user to take advantage of these optimisations. They happen automatically.
Run the code under the example section to get a feel for the performance benefits from these optimisations.
data.table
reads the global option datatable.optimize
to figure
out what level of optimisation is required. The default value Inf
activates all available optimisations.
At optimisation level >= 1
, i.e., getOption("datatable.optimize")
>= 1, these are the optimisations:
The base function order
is internally replaced with
data.table
's fast ordering. That is, dt[order(...)]
gets internally optimised to dt[forder(...)]
.
The expression dt[, lapply(.SD, fun), by=.]
gets optimised
to dt[, list(fun(a), fun(b), ...), by=.]
where a,b, ...
are
columns in .SD
. This improves performance tremendously.
Similarly, the expression dt[, c(.N, lapply(.SD, fun)), by=.]
gets optimised to dt[, list(.N, fun(a), fun(b), ...)]
. .N
is
just for example here.
base::mean
function is internally optimised to use
data.table
's fastmean
function. mean()
from base
is an S3 generic and gets slow with many groups.
At optimisation level >= 2
, i.e., getOption("datatable.optimize")
>= 2, additional optimisations are implemented on top of the optimisations already shown above.
When expressions in j
which contains only these functions
min, max, mean, median, var, sd, prod
, for example,
dt[, list(mean(x), median(x), min(y), max(y)), by=z]
, they are very
effectively optimised using, what we call, GForce. These functions
are replaced with gmean, gmedian, gmin, gmax
instead.
Normally, once the rows belonging to each groups are identified, the values
corresponding to the group is gathered and the j-expression
is
evaluated. This can be improved by computing the result directly without
having to gather the values or evaluating the expression for each group
(which can get costly with large number of groups) by implementing it
specifically for a particular function. As a result, it is extremely fast.
In addition to all the functions above, '.N' is also optimised to use GForce. It when used separately or combined with the functions mentioned above still uses GForce.
Expressions of the form DT[i, j, by]
are also optimised when
i
is a subset operation and j
is any/all of the functions
discussed above.
Auto indexing: data.table
also allows for blazing fast subsets by
creating an index on the first run. Any successive subsets on the same
column then reuses this index to binary search (instead of
vector scan) and is therefore fast.
At the moment, expressions of the form dt[col == val]
and
dt[col %in% val]
are both optimised. We plan to expand this to more
operators and conditions in the future.
Auto indexing can be switched off with the global option
options(datatable.auto.index = FALSE)
. To switch off using existing
indices set global option options(datatable.use.index = FALSE)
.
setNumericRounding
, getNumericRounding
## Not run: # Generate a big data.table with a relatively many columns set.seed(1L) dt = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE)) setDT(dt)[, id := sample(1e5, 5e6, TRUE)] print(object.size(dt), units="Mb") # 400MB, not huge, but will do # 'order' optimisation options(datatable.optimize = 1L) # optimisation 'on' system.time(ans1 <- dt[order(id)]) options(datatable.optimize = 0L) # optimisation 'off' system.time(ans2 <- dt[order(id)]) identical(ans1, ans2) # optimisation of 'lapply(.SD, fun)' options(datatable.optimize = 1L) # optimisation 'on' system.time(ans1 <- dt[, lapply(.SD, min), by=id]) options(datatable.optimize = 0L) # optimisation 'off' system.time(ans2 <- dt[, lapply(.SD, min), by=id]) identical(ans1, ans2) # optimisation of 'mean' options(datatable.optimize = 1L) # optimisation 'on' system.time(ans1 <- dt[, lapply(.SD, mean), by=id]) system.time(ans2 <- dt[, lapply(.SD, base::mean), by=id]) identical(ans1, ans2) # optimisation of 'c(.N, lapply(.SD, ))' options(datatable.optimize = 1L) # optimisation 'on' system.time(ans1 <- dt[, c(.N, lapply(.SD, min)), by=id]) options(datatable.optimize = 0L) # optimisation 'off' system.time(ans2 <- dt[, c(N=.N, lapply(.SD, min)), by=id]) identical(ans1, ans2) # GForce options(datatable.optimize = 2L) # optimisation 'on' system.time(ans1 <- dt[, lapply(.SD, median), by=id]) system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id]) identical(ans1, ans2) # restore optimization options(datatable.optimize = Inf) # auto indexing options(datatable.auto.index = FALSE) system.time(ans1 <- dt[id == 100L]) # vector scan system.time(ans2 <- dt[id == 100L]) # vector scan system.time(dt[id options(datatable.auto.index = TRUE) system.time(ans1 <- dt[id == 100L]) # index + binary search subset system.time(ans2 <- dt[id == 100L]) # only binary search subset system.time(dt[id ## End(Not run)