Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Performance issues with large data.tables #983

@thejokenott

Description

@thejokenott

Description

For some reason, some functions called as a part of the pipeline take an incredible amount of time when compared to calling them directly.

Minimal reproducible example

_targets.R


library(targets)

get_toy_data <- function() {
  library(data.table)
  dt <- data.table(
    id = sample(1:1e8, 3000)
  )

  dt <- dt[
    ,
    .(date1 = seq.Date(as.Date("2017-01-01"), as.Date("2017-01-31"), by = "day")),
    by = id
  ]

  dt <- dt[
    ,
    .(date2 = seq.Date(as.Date(date1) - 10, as.Date(date1), by = "day")),
    by = .(id, date1)
  ]

  dt[
    ,
    avail := sample(c(rep(FALSE, 9), TRUE), size = nrow(dt), replace = TRUE)
  ]

  return(dt[])
}

f <- function(data) {
  start_time <- Sys.time()
  library(data.table)
  data[avail == TRUE, date2_if_available := date2]
  data[
    ,
    date2_first_available := min(date2_if_available, na.rm = TRUE),
    by = .(id, date1)
  ]
  end_time <- Sys.time()
  print(end_time - start_time)
  return(data[])
}


list(
  tar_target(data, get_toy_data()),
  tar_target(fdata, f(data))
)

Now when I run tar_make(fdata), the fdata target takes more than 2 minutes to complete. When I run f(tar_read(data)) instead, it only takes around 2 seconds. There is no issue with the data target. I have tried this on both macOS and Ubuntu, both inside and outside of renv, getting similar results everytime. What might be the issue here?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions