One option to make complex data.table commands more efficient is to merge the "group by" function with lapply and a column name vector.
For example, let's say we have a data.table called "DT" with columns "A", "B", "C", and "D". We want to calculate the mean of columns "B", "C", and "D" for each unique value of "A".
Instead of doing:
DT[, .(mean_B = mean(B), mean_C = mean(C), mean_D = mean(D)), by = A]
We can create a vector of column names we want to operate on and use lapply inside the "j" argument of data.table to generate the desired columns, like this:
cols <- c("B", "C", "D")
DT[, lapply(.SD, mean), by = A, .SDcols = cols]
This can reduce code duplication (especially if we have more columns to operate on), potentially be faster by only iterating over the column names once, and make it easier to change the columns we operate on in the future.
Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss
Asked: 2023-06-26 15:29:12 +0000
Seen: 8 times
Last updated: Jun 26 '23
How can I include the hours component to a DateTime column using PowerQuery?
Identify commonalities among the strings in a specific column of a DataFrame.
What is the procedure for using Pandas fillna() method with the column's mode?
How can you use linq to choose a specific column from a datatable?