1 | initial version |
One way to create a random sample from a data frame with a greater likelihood of including values within a particular range for a certain variable is to use stratified sampling.
First, create a new variable in the data frame that indicates whether the value of the variable of interest falls within the desired range. For example, if we want to include values of a variable 'x' between 20 and 50, we can create a new variable 'x_range' as follows:
df$x_range <- ifelse(df$x >= 20 & df$x <= 50, "within_range", "outside_range")
Next, use stratified sampling to select a random sample that includes a higher proportion of observations within the desired range. We can use the stratified
function from the splitstackshape
package to do this:
library(splitstackshape) set.seed(123) sample_size <- 100 df_sample <- stratified(df, group = "x_range", size = sample_size, method = "srswor")
In this example, group
specifies the new variable 'x_range' we created, size
specifies the desired sample size, and method = "srswor"
specifies simple random sampling without replacement within each stratum. This will give us a random sample that is more likely to include observations within the desired range for the variable of interest.