data.table-package {data.table} | R Documentation |
data.table
inherits from data.frame
. It offers fast and nemory efficient: file reader and writer, aggregations, updates, equi, non-equi, rolling, range and interval joins, in a short and flexible syntax, for faster development.
It is inspired by A[B]
syntax in R where A
is a matrix and B
is a 2-column matrix. Since a data.table
is a data.frame
, it is compatible with R functions and packages that accept only data.frame
s.
Type vignette(package="data.table")
to get started. The Introduction to data.table vignette introduces data.table
's x[i, j, by]
syntax and is a good place to start. If you have read the vignettes and the help page below, please feel free to ask questions on Stack Overflow data.table tag or on datatable-help mailing list. To report a bug please type: bug.report(package = "data.table")
.
Please check the homepage for up to the minute live NEWS.
Tip: one of the quickest ways to learn the features is to type example(data.table)
and study the output at the prompt.
data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFactors=FALSE) ## S3 method for class 'data.table' x[i, j, by, keyby, with = TRUE, nomatch = getOption("datatable.nomatch"), # default: NA_integer_ mult = "all", roll = FALSE, rollends = if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which = FALSE, .SDcols, verbose = getOption("datatable.verbose"), # default: FALSE allow.cartesian = getOption("datatable.allow.cartesian"), # default: FALSE drop = NULL, on = NULL]
... |
Just as |
keep.rownames |
If |
check.names |
Just as |
key |
Character vector of one or more column names which is passed to |
stringsAsFactors |
Logical (default is |
x |
A |
i |
Integer, logical or character vector, single column numeric
If
Using When the binary operator Support for non-equi join was recently implemented, which allows for other binary operators See Keys and fast binary search based subset and Secondary indices and auto indexing. Advanced: When |
j |
When The expression '.()' is a shorthand alias to When Advanced: Advanced: When Advanced: Columns of See Introduction to data.table vignette and examples. |
by |
Column names are seen as if they are variables (as in
Advanced: When Advanced: In the |
keyby |
Same as |
with |
By default When |
nomatch |
Same as |
mult |
When |
roll |
When
Rolling joins apply to the last join column, generally a date but can be any variable. It is particularly fast using a modified binary search. A common idiom is to select a contemporaneous regular time series ( |
rollends |
A logical vector length 2 (a single logical is recycled) indicating whether values falling before the first value or after the last value for a group should be rolled as well.
When |
which |
|
.SDcols |
Specifies the columns of For convenient interactive use, the form |
verbose |
|
allow.cartesian |
|
drop |
Never used by |
on |
Indicate which columns in
See examples as well as Secondary indices and auto indexing. |
data.table
builds on base R functionality to reduce 2 types of time:
programming time (easier to write, read, debug and maintain), and
compute time (fast and memory efficient).
The general form of data.table syntax is:
DT[ i, j, by ] # + extra arguments | | | | | -------> grouped by what? | -------> what to do? ---> on which rows?
The way to read this out loud is: "Take DT
, subset rows by i
, then compute j
grouped by by
. Here are some basic usage examples expanding on this definition. See the vignette (and examples) for working examples.
X[, a] # return col 'a' from X as vector. If not found, search in parent frame. X[, .(a)] # same as above, but return as a data.table. X[, sum(a)] # return sum(a) as a vector (with same scoping rules as above) X[, .(sum(a)), by=c] # get sum(a) grouped by 'c'. X[, sum(a), by=c] # same as above, .() can be ommitted in by on single expression for convenience X[, sum(a), by=c:f] # get sum(a) grouped by all columns in between 'c' and 'f' (both inclusive) X[, sum(a), keyby=b] # get sum(a) grouped by 'b', and sort that result by the grouping column 'b' X[, sum(a), by=b][order(b)] # same order as above, but by chaining compound expressions X[c>1, sum(a), by=c] # get rows where c>1 is TRUE, and on those rows, get sum(a) grouped by 'c' X[Y, .(a, b), on="c"] # get rows where Y$c == X$c, and select columns 'X$a' and 'X$b' for those rows X[Y, .(a, i.a), on="c"] # get rows where Y$c == X$c, and then select 'X$a' and 'Y$a' (=i.a) X[Y, sum(a*i.a), on="c" by=.EACHI] # for *each* 'Y$c', get sum(a*i.a) on matching rows in 'X$c' X[, plot(a, b), by=c] # j accepts any expression, generates plot for each group and returns no data # see ?assign to add/update/delete columns by reference using the same consistent interface
A data.table
is a list
of vectors, just like a data.frame
. However :
it never has or uses rownames. Rownames based indexing can be done by setting a key of one or more columns or done ad-hoc using the on
argument (now preferred).
it has enhanced functionality in [.data.table
for fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.
See the see also
section for the several other methods that are available for operating on data.tables efficiently.
If keep.rownames
or check.names
are supplied they must be written in full because R does not allow partial argument names after '...
'. For example, data.table(DF, keep=TRUE)
will create a
column called "keep"
containing TRUE
and this is correct behaviour; data.table(DF, keep.rownames=TRUE)
was intended.
POSIXlt
is not supported as a column type because it uses 40 bytes to store a single datetime. They are implicitly converted to POSIXct
type with warning. You may also be interested in IDateTime
instead; it has methods to convert to and from POSIXlt
.
https://github.com/Rdatatable/data.table/wiki (data.table
homepage)
http://crantastic.org/packages/data-table (User reviews)
http://en.wikipedia.org/wiki/Binary_search
special-symbols
, data.frame
, [.data.frame
, as.data.table
, setkey
, setorder
, setDT
, setDF
, J
, SJ
, CJ
, merge.data.table
, tables
, test.data.table
, IDateTime
, unique.data.table
, copy
, :=
, alloc.col
, truelength
, rbindlist
, setNumericRounding
, datatable-optimize
, fsetdiff
, funion
, fintersect
, fsetequal
, anyDuplicated
, uniqueN
, rowid
, rleid
, na.omit
, frank
## Not run: example(data.table) # to run these examples at the prompt ## End(Not run) DF = data.frame(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9) DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9) DF DT identical(dim(DT), dim(DF)) # TRUE identical(DF$a, DT$a) # TRUE is.list(DF) # TRUE is.list(DT) # TRUE is.data.frame(DT) # TRUE tables() # basic row subset operations DT[2] # 2nd row DT[3:2] # 3rd and 2nd row DT[order(x)] # no need for order(DT$x) DT[order(x), ] # same as above. The ',' is optional DT[y>2] # all rows where DT$y > 2 DT[y>2 & v>5] # compound logical expressions DT[!2:4] # all rows other than 2:4 DT[-(2:4)] # same # select|compute columns data.table way DT[, v] # v column (as vector) DT[, list(v)] # v column (as data.table) DT[, .(v)] # same as above, .() is a shorthand alias to list() DT[, sum(v)] # sum of column v, returned as vector DT[, .(sum(v))] # same, but return data.table (column autonamed V1) DT[, .(sv=sum(v))] # same, but column named "sv" DT[, .(v, v*2)] # return two column data.table, v and v*2 # subset rows and select|compute data.table way DT[2:3, sum(v)] # sum(v) over rows 2 and 3, return vector DT[2:3, .(sum(v))] # same, but return data.table with column V1 DT[2:3, .(sv=sum(v))] # same, but return data.table with column sv DT[2:5, cat(v, "\n")] # just for j's side effect # select columns the data.frame way DT[, 2, with=FALSE] # 2nd column, returns a data.table always colNum = 2 DT[, colNum, with=FALSE] # same, equivalent to DT[, .SD, .SDcols=colNum] DT[["v"]] # same as DT[, v] but much faster # grouping operations - j and by DT[, sum(v), by=x] # ad hoc by, order of groups preserved in result DT[, sum(v), keyby=x] # same, but order the result on by cols DT[, sum(v), by=x][order(x)] # same but by chaining expressions together # fast ad hoc row subsets (subsets as joins) DT["a", on="x"] # same as x == "a" but uses binary search (fast) DT["a", on=.(x)] # same, for convenience, no need to quote every column DT[.("a"), on="x"] # same DT[x=="a"] # same, single "==" internally optimised to use binary search (fast) DT[x!="b" | y!=3] # not yet optimized, currently vector scan subset DT[.("b", 3), on=c("x", "y")] # join on columns x,y of DT; uses binary search (fast) DT[.("b", 3), on=.(x, y)] # same, but using on=.() DT[.("b", 1:2), on=c("x", "y")] # no match returns NA DT[.("b", 1:2), on=.(x, y), nomatch=0] # no match row is not returned DT[.("b", 1:2), on=c("x", "y"), roll=Inf] # locf, nomatch row gets rolled by previous row DT[.("b", 1:2), on=.(x, y), roll=-Inf] # nocb, nomatch row gets rolled by next row DT["b", sum(v*y), on="x"] # on rows where DT$x=="b", calculate sum(v*y) # all together now DT[x!="a", sum(v), by=x] # get sum(v) by "x" for each i != "a" DT[!"a", sum(v), by=.EACHI, on="x"] # same, but using subsets-as-joins DT[c("b","c"), sum(v), by=.EACHI, on="x"] # same DT[c("b","c"), sum(v), by=.EACHI, on=.(x)] # same, using on=.() # joins as subsets X = data.table(x=c("c","b"), v=8:7, foo=c(4,2)) X DT[X, on="x"] # right join X[DT, on="x"] # left join DT[X, on="x", nomatch=0] # inner join DT[!X, on="x"] # not join DT[X, on=.(y<=foo)] # NEW non-equi join (v1.9.8+) DT[X, on="y<=foo"] # same as above DT[X, on=c("y<=foo")] # same as above DT[X, on=.(y>=foo)] # NEW non-equi join (v1.9.8+) DT[X, on=.(x, y<=foo)] # NEW non-equi join (v1.9.8+) DT[X, .(x,y,x.y,v), on=.(x, y>=foo)] # Select x's join columns as well DT[X, on="x", mult="first"] # first row of each group DT[X, on="x", mult="last"] # last row of each group DT[X, sum(v), by=.EACHI, on="x"] # join and eval j for each row in i DT[X, sum(v)*foo, by=.EACHI, on="x"] # join inherited scope DT[X, sum(v)*i.v, by=.EACHI, on="x"] # 'i,v' refers to X's v column DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # NEW non-equi join with by=.EACHI (v1.9.8+) # setting keys kDT = copy(DT) # (deep) copy DT to kDT to work with it. setkey(kDT,x) # set a 1-column key. No quotes, for convenience. setkeyv(kDT,"x") # same (v in setkeyv stands for vector) v="x" setkeyv(kDT,v) # same # key(kDT)<-"x" # copies whole table, please use set* functions instead haskey(kDT) # TRUE key(kDT) # "x" # fast *keyed* subsets kDT["a"] # subset-as-join on *key* column 'x' kDT["a", on="x"] # same, being explicit using 'on=' (preferred) # all together kDT[!"a", sum(v), by=.EACHI] # get sum(v) for each i != "a" # multi-column key setkey(kDT,x,y) # 2-column key setkeyv(kDT,c("x","y")) # same # fast *keyed* subsets on multi-column key kDT["a"] # join to 1st column of key kDT["a", on="x"] # on= is optional, but is preferred kDT[.("a")] # same, .() is an alias for list() kDT[list("a")] # same kDT[.("a", 3)] # join to 2 columns kDT[.("a", 3:6)] # join 4 rows (2 missing) kDT[.("a", 3:6), nomatch=0] # remove missing kDT[.("a", 3:6), roll=TRUE] # locf rolling join kDT[.("a", 3:6), roll=Inf] # same as above kDT[.("a", 3:6), roll=-Inf] # nocb rolling join kDT[!.("a")] # not join kDT[!"a"] # same # more on special symbols, see also ?"special-symbols" DT[.N] # last row DT[, .N] # total number of rows in DT DT[, .N, by=x] # number of rows in each group DT[, .SD, .SDcols=x:y] # select columns 'x' and 'y' DT[, .SD[1]] # first row of all columns DT[, .SD[1], by=x] # first row of 'y' and 'v' for each group in 'x' DT[, c(.N, lapply(.SD, sum)), by=x] # get rows *and* sum columns 'v' and 'y' by group DT[, .I[1], by=x] # row number in DT corresponding to each group DT[, grp := .GRP, by=x] # add a group counter column X[, DT[.BY, y, on="x"], by=x] # join within each group # add/update/delete by reference (see ?assign) print(DT[, z:=42L]) # add new column by reference print(DT[, z:=NULL]) # remove column by reference print(DT["a", v:=42L, on="x"]) # subassign to existing v column by reference print(DT["b", v2:=84L, on="x"]) # subassign to new column by reference (NA padded) DT[, m:=mean(v), by=x][] # add new column by reference by group # NB: postfix [] is shortcut to print() # advanced usage DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1) DT[, sum(v), by=.(y%%2)] # expressions in by DT[, sum(v), by=.(bool = y%%2)] # same, using a named list to change by column name DT[, .SD[2], by=x] # get 2nd row of each group DT[, tail(.SD,2), by=x] # last 2 rows of each group DT[, lapply(.SD, sum), by=x] # sum of all (other) columns for each group DT[, .SD[which.min(v)], by=x] # nested query by group DT[, list(MySum=sum(v), MyMin=min(v), MyMax=max(v)), by=.(x, y%%2)] # by 2 expressions DT[, .(a = .(a), b = .(b)), by=x] # list columns DT[, .(seq = min(a):max(b)), by=x] # j is not limited to just aggregations DT[, sum(v), by=x][V1<20] # compound query DT[, sum(v), by=x][order(-V1)] # ordering results DT[, c(.N, lapply(.SD,sum)), by=x] # get number of observations and sum per group DT[, {tmp <- mean(y); .(a = a-tmp, b = b-tmp) }, by=x] # anonymous lambdain 'j', j accepts any valid # expression. TO REMEMBER: every element of # the list becomes a column in result. pdf("new.pdf") DT[, plot(a,b), by=x] # can also plot in 'j' dev.off() # using rleid, get max(y) and min of all cols in .SDcols for each consecutive run of 'v' DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b] # Follow r-help posting guide, SUPPORT is here (*not* r-help) : # http://stackoverflow.com/questions/tagged/data.table # or # datatable-help@lists.r-forge.r-project.org ## Not run: vignette("datatable-intro") vignette("datatable-reference-semantics") vignette("datatable-keys-fast-subset") vignette("datatable-secondary-indices-and-auto-indexing") vignette("datatable-reshape") vignette("datatable-faq") test.data.table() # over 5700 low level tests # keep up to date with latest stable version on CRAN update.packages() # get the latest devel (needs Rtools for windows, xcode for mac) install.packages("data.table", repos = "https://Rdatatable.github.io/data.table", type = "source") ## End(Not run)