-
Notifications
You must be signed in to change notification settings - Fork 76
Closed
Labels
Description
Prework
- Read and agree to the code of conduct and contributing guidelines.
- If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
- Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- Readable: format your code according to the tidyverse style guide.
Description
Hi @wlandau,
It seems like something about the torch
serialization isn't working right. Or maybe I'm not quite understanding how the serialization/unserialization works. I thought the serialization/unserialization meant you could pass objects in memory from main to worker.
But here is a reprex that I think showcases the issue. The torch
object has to be retrieved by the worker, or else the external pointer is invalid.
Mostly this wouldn't be an issue, but this means torch
objects cannot be loaded on the "main"
process and passed to the HPC via ssh
(and maybe this affects AWS storage/retrieval too, but I don't use it so I can't test it out).
Reproducible example
tar_make_clustermq()
with "main"
retrieval
library(targets)
tar_script({
library(targets)
library(torch)
options(clustermq.scheduler = "multiprocess")
tar_option_set(
packages = c("torch"),
retrieval = "main"
)
tar_pipeline(
tar_target(
tensor,
torch_zeros(10),
format = "torch"
),
tar_target(
test,
as.array(tensor)
)
)
})
tar_make_clustermq()
#> �[34m●�[39m run target tensor
#> �[34m●�[39m run target test
#> �[31mx�[39m error target test
#> Warning in self$crew$finalize() : Unclean shutdown for PIDs: 24760
#> Error : external pointer is not valid .
#> Error: callr subprocess failed: external pointer is not valid .
tar_read(tensor)
#> torch_tensor
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> [ CPUFloatType{10} ]
tar_read(test)
#> Error in gzfile(file, "rb"): invalid 'description' argument
tar_make_clustermq()
with "worker"
retrieval
library(targets)
tar_script({
library(targets)
library(torch)
options(clustermq.scheduler = "multiprocess")
tar_option_set(
packages = c("torch"),
retrieval = "worker"
)
tar_pipeline(
tar_target(
tensor,
torch_zeros(10),
format = "torch"
),
tar_target(
test,
as.array(tensor)
)
)
})
tar_make_clustermq()
#> �[34m●�[39m run target tensor
#> �[34m●�[39m run target test
#> Master: [1.3s 4.6% CPU]; Worker: [avg 74.2% CPU, max 3329675.0 Mb]
tar_read(tensor)
#> torch_tensor
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> 0
#> [ CPUFloatType{10} ]
tar_read(test)
#> [1] 0 0 0 0 0 0 0 0 0 0