Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

One option to make complex data.table commands more efficient is to merge the "group by" function with lapply and a column name vector.

For example, let's say we have a data.table called "DT" with columns "A", "B", "C", and "D". We want to calculate the mean of columns "B", "C", and "D" for each unique value of "A".

Instead of doing:

DT[, .(mean_B = mean(B), mean_C = mean(C), mean_D = mean(D)), by = A]

We can create a vector of column names we want to operate on and use lapply inside the "j" argument of data.table to generate the desired columns, like this:

cols <- c("B", "C", "D")
DT[, lapply(.SD, mean), by = A, .SDcols = cols]

This can reduce code duplication (especially if we have more columns to operate on), potentially be faster by only iterating over the column names once, and make it easier to change the columns we operate on in the future.