Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Torch serialization not working properly; pointer issues. #209

@mattwarkentin

Description

@mattwarkentin

Prework

  • Read and agree to the code of conduct and contributing guidelines.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • Readable: format your code according to the tidyverse style guide.

Description

Hi @wlandau,

It seems like something about the torch serialization isn't working right. Or maybe I'm not quite understanding how the serialization/unserialization works. I thought the serialization/unserialization meant you could pass objects in memory from main to worker.

But here is a reprex that I think showcases the issue. The torch object has to be retrieved by the worker, or else the external pointer is invalid.

Mostly this wouldn't be an issue, but this means torch objects cannot be loaded on the "main" process and passed to the HPC via ssh (and maybe this affects AWS storage/retrieval too, but I don't use it so I can't test it out).

Reproducible example

tar_make_clustermq() with "main" retrieval

library(targets)

tar_script({
  library(targets)
  library(torch)
  
  options(clustermq.scheduler = "multiprocess")
  
  tar_option_set(
    packages = c("torch"), 
    retrieval = "main"
    )
  
  tar_pipeline(
    tar_target(
      tensor, 
      torch_zeros(10), 
      format = "torch"
      ),
    tar_target(
      test, 
      as.array(tensor)
      )
  )
})

tar_make_clustermq()
#> �[34m●�[39m run target tensor
#> �[34m●�[39m run target test
#> �[31mx�[39m error target test
#> Warning in self$crew$finalize() : Unclean shutdown for PIDs: 24760
#> Error : external pointer is not valid .
#> Error: callr subprocess failed: external pointer is not valid .
tar_read(tensor)
#> torch_tensor
#>  0
#>  0
#>  0
#>  0
#>  0
#>  0
#>  0
#>  0
#>  0
#>  0
#> [ CPUFloatType{10} ]
tar_read(test)
#> Error in gzfile(file, "rb"): invalid 'description' argument

tar_make_clustermq() with "worker" retrieval

library(targets)

tar_script({
  library(targets)
  library(torch)
  
  options(clustermq.scheduler = "multiprocess")
  
  tar_option_set(
    packages = c("torch"), 
    retrieval = "worker"
    )
  
  tar_pipeline(
    tar_target(
      tensor, 
      torch_zeros(10), 
      format = "torch"
      ),
    tar_target(
      test, 
      as.array(tensor)
      )
  )
})

tar_make_clustermq()
#> �[34m●�[39m run target tensor
#> �[34m●�[39m run target test
#> Master: [1.3s 4.6% CPU]; Worker: [avg 74.2% CPU, max 3329675.0 Mb]
tar_read(tensor)
#> torch_tensor
#>  0
#>  0
#>  0
#>  0
#>  0
#>  0
#>  0
#>  0
#>  0
#>  0
#> [ CPUFloatType{10} ]
tar_read(test)
#>  [1] 0 0 0 0 0 0 0 0 0 0

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions