fwrite {data.table} | R Documentation |
As write.csv
but much faster (e.g. 2 seconds versus 1 minute) and just as flexible. Modern machines almost surely have more than one CPU so fwrite
uses them; on all operating systems including Linux, Mac and Windows.
This is new functionality as of Nov 2016. We may need to refine argument names and defaults.
fwrite(x, file = "", append = FALSE, quote = "auto", sep = ",", sep2 = c("","|",""), eol = if (.Platform$OS.type=="windows") "\r\n" else "\n", na = "", dec = ".", row.names = FALSE, col.names = TRUE, qmethod = c("double","escape"), logicalAsInt = FALSE, dateTimeAs = c("ISO","squash","epoch","write.csv"), buffMB = 8L, nThread = getDTthreads(), showProgress = getOption("datatable.showProgress"), verbose = getOption("datatable.verbose"))
x |
Any |
file |
Output file name. |
append |
If |
quote |
When |
sep |
The separator between columns. Default is |
sep2 |
For columns of type |
eol |
Line separator. Default is |
na |
The string to use for missing values in the data. Default is a blank string |
dec |
The decimal separator, by default |
row.names |
Should row names be written? For compatibility with |
col.names |
Should the column names (header row) be written? If missing, |
qmethod |
A character string specifying how to deal with embedded double quote characters when quoting strings.
|
logicalAsInt |
Should |
dateTimeAs |
How
The first three options are fast due to new specialized C code. The epoch to date-part conversion uses a fast approach by Howard Hinnant (see references) using a day-of-year starting on 1 March. You should not be able to notice any difference in write speed between those three options. The date range supported for |
buffMB |
The buffer size (MB) per thread in the range 1 to 1024, default 8MB. Experiment to see what works best for your data on your hardware. |
nThread |
The number of threads to use. Experiment to see what works best for your data on your hardware. |
showProgress |
Display a progress meter on the console? Ignored when |
verbose |
Be chatty and report timings? |
fwrite
began as a community contribution with pull request #1613 by Otto Seiskari. This gave Matt Dowle the impetus to specialize the numeric formatting and to parallelize: http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/. Final items were tracked in issue #1664 such as automatic quoting, bit64::integer64
support, decimal/scientific formatting exactly matching write.csv
between 2.225074e-308 and 1.797693e+308 to 15 significant figures, row.names
, dates (between 0000-03-01 and 9999-12-31), times and sep2
for list
columns where each cell can itself be a vector.
http://howardhinnant.github.io/date_algorithms.html
https://en.wikipedia.org/wiki/Decimal_mark
setDTthreads
, fread
, write.csv
, write.table
, bit64::integer64
DF = data.frame(A=1:3, B=c("foo","A,Name","baz")) fwrite(DF) write.csv(DF, row.names=FALSE, quote=FALSE) # same fwrite(DF, row.names=TRUE, quote=TRUE) write.csv(DF) # same DF = data.frame(A=c(2.1,-1.234e-307,pi), B=c("foo","A,Name","bar")) fwrite(DF, quote='auto') # Just DF[2,2] is auto quoted write.csv(DF, row.names=FALSE) # same numeric formatting DT = data.table(A=c(2,5.6,-3),B=list(1:3,c("foo","A,Name","bar"),round(pi*1:3,2))) fwrite(DT) fwrite(DT, sep="|", sep2=c("{",",","}")) ## Not run: set.seed(1) DT = as.data.table( lapply(1:10, sample, x=as.numeric(1:5e7), size=5e6)) # 382MB system.time(fwrite(DT, "/dev/shm/tmp1.csv")) # 0.8s system.time(write.csv(DT, "/dev/shm/tmp2.csv", # 60.6s quote=FALSE, row.names=FALSE)) system("diff /dev/shm/tmp1.csv /dev/shm/tmp2.csv") # identical set.seed(1) N = 1e7 DT = data.table( str1=sample(sprintf(" str2=sample(sprintf(" str3=sample(sapply(sample(2:30, 100, TRUE), function(n) paste0(sample(LETTERS, n, TRUE), collapse="")), N, TRUE), str4=sprintf(" num1=sample(round(rnorm(1e6,mean=6.5,sd=15),2), N, replace=TRUE), num2=sample(round(rnorm(1e6,mean=6.5,sd=15),10), N, replace=TRUE), str5=sample(c("Y","N"),N,TRUE), str6=sample(c("M","F"),N,TRUE), int1=sample(ceiling(rexp(1e6)), N, replace=TRUE), int2=sample(N,N,replace=TRUE)-N/2 ) # 774MB system.time(fwrite(DT,"/dev/shm/tmp1.csv")) # 1.1s system.time(write.csv(DT,"/dev/shm/tmp2.csv",row.names=F,quote=F)) # 63.2s system("diff /dev/shm/tmp1.csv /dev/shm/tmp2.csv") # identical unlink("/dev/shm/tmp1.csv") unlink("/dev/shm/tmp2.csv") ## End(Not run)