performance – Search for parameter combinations corresponding to a value in a large dataframe in R

I would like to ask for performance/scalability feedback on an R function that I recently wrote. I am not a Programmer or Computer Scientist, so I hope I am able to phrase the question clearly:)

Overview of the Question

I have a large 4-column dataframe (the lookup_dataframe; i.e., output of a simulation study). The first column contains an integer value and the other three columns parameter values that correspond to that integer value.

I have a second dataframe (data), that contains a column of input values. For each input value, I want to look up (from the lookup_dataframe) a possible combination of parameters that led to that input value.

Create some sample data

# Load packages
library(furrr)
#> Loading required package: future
library(tidyverse)

# Set amount of cores
plan(multisession, workers = 4)

# Set seed
set.seed(123)

# Create `data`
data <- data.frame(input = raster::sampleInt(n = 800, size = 1000, replace = TRUE)) %>%
  mutate(status = sample(0:1, n(), replace = TRUE), 
         factor = sample(c("A", "B"), n(), replace = TRUE))

# Create `lookup_dataframe`
lookup_dataframe <- data.frame(value = rep(1:800, each = 2500), 
                               param_A = rnorm(n = 2e6), 
                               param_B = rnorm(n = 2e6),
                               param_C = rnorm(n = 2e6))

Write function to look up data

I wrote a function that takes an input value, filters the lookup_dataframe on that value and randomly selects a row. The row then contains the value + the three parameters.

func_input_to_param <- function(input){
  lookup_dataframe %>% 
    filter(input == value) %>% 
    select(-value) %>%
    sample_n(size = 1, replace = TRUE) %>% 
    flatten_dbl()
}

Map function to input data

Finally, I map the function to the input data, stitch both dataframes together, and the job is done! Note, I am using the furrr package with the future_map function to perform the mapping in parallel.

param_dataframe <- future_map(.x = data$input, 
                              .f = func_input_to_param, 
                              .progress = TRUE, .options = furrr_options(seed = TRUE))

param_dataframe <- do.call(rbind, param_dataframe) %>%
  as_tibble(.name_repair = ~ c("param_A", "param_B", "param_C")) %>% 
  cbind(data, .)

head(param_dataframe, 5)
#>   input status factor    param_A    param_B    param_C
#> 1   231      0      A  0.4533840 -0.4965135 -0.3215218
#> 2   631      1      A  1.1109426  0.8285732 -0.6635507
#> 3   328      1      A  0.9866890 -1.5464006  0.9079893
#> 4   707      0      B -1.8231198 -1.2731512  1.1951422
#> 5   753      1      B -0.5865885 -0.6276842  1.3558658

Feedback request

First of all, I really appreciate any feedback you can provide (of course, also not performance-related)! But I wanted to ask specifically for feedback on the code’s performance and scalability. For example, if the input data has a size=1000 the code is fast enough. However, input size=10e6 takes forever to run. How am I able to speed up this code?

It feels like a straightforward problem, so I was wondering if there are smarter ways to approach this problem?

Created in 2021 by the reprex package (v1.0.0)