Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@swan343
Copy link

@swan343 swan343 commented Jun 26, 2025

Hi!

I noticed here mentions the seed argument, which was present in v0.1.0, is no longer present in the latest version. This PR adds the seed arg back. With a specified non zero seed, I can now get the exact same output after multiple preprocess/normalize runs. Without it the outputs are not equivalent.

@DarnellGranberry
Copy link
Collaborator

Hi @swan343, it looks like that was included in a big update @tbepler made in afa606f, so let's see if that was intentional or not.

@tbepler
Copy link
Owner

tbepler commented Jun 30, 2025

I removed the seed argument because of multiprocessing. If the random number generator is initialized within each process, you will not get the same result when processing a list of micrographs if you change the number of processes being used or run everything in the main thread. Given this potential lack of consistency I felt it best to simply not expose this option.

@swan343
Copy link
Author

swan343 commented Jun 30, 2025

hi @tbepler thanks for the response! From my limited testing I do see the results are repeatable with multiprocessing, if the seed argument is given (np.random.RandomState should always return the same pseudo number generator no matter when and where in theory). heres a sample test code. I do notice for at least some mrcs using gpu will yield different results vs cpu, not sure if this variation also applies to different computers (perhaps that's what you meant?); but using the same computer/device the results are the same.

import hashlib
import glob
import shutil
import os

def get_md5(file_path: str) -> str:
    with open(file_path, 'rb') as f:
        md5 = hashlib.md5()
        while True:
            data = f.read(8192)
            if not data:
                break
            md5.update(data)
        return md5.hexdigest()


result = {}
os.system(' '.join([
  'topaz', 'preprocess', '--device=-1', '--seed=5424','--num-workers=12',
  '--format="png"',
  '-s', '4', '-o=/tmp/lu3435571etv8n.tmp/res', '/tmp/lu3435571etv8n.tmp/test/*.mrc'
]))
for i in sorted(glob.glob('/tmp/lu3435571etv8n.tmp/res/*.png')):
    result[i] = get_md5(i)

for i in range(10):
  print('testing iteration', i)
  shutil.rmtree('/tmp/lu3435571etv8n.tmp/res', ignore_errors=True)
  os.mkdir('/tmp/lu3435571etv8n.tmp/res')
  os.system(' '.join([
    'topaz', 'preprocess', '--device=-1', '--seed=5424',f'--num-workers={i}',
    '--format="png"',
    '-s', '4', '-o=/tmp/lu3435571etv8n.tmp/res', '/tmp/lu3435571etv8n.tmp/test/*.mrc'
]))
  for j in sorted(glob.glob('/tmp/lu3435571etv8n.tmp/res/*.png')):
    if get_md5(j) != result[j]:
      print('md5 doesnt match! @ iter',i, j, get_md5(j), result[j])
      raise Exception('md5 doesnt match!')
  print('iteration', i, 'passed')

@tbepler
Copy link
Owner

tbepler commented Jul 1, 2025

@swan343 my main concern is actually about consistency between runs with different numbers of workers or no workers. With the same number of workers, the results should be the same, because each process's random number generator is independently seeded with the seed. However, if you change the number of workers the results will change, because, e.g. starting 2 workers and running them in parallel from the same state is not the same as running 1 worker through all the images from that same starting state.

This is a pretty minor replicability issue, but would cause inconsistency between runs even when the seed is set which would be unexpected.

@swan343
Copy link
Author

swan343 commented Jul 1, 2025

@tbepler thanks for the reply, by workers are you referring to --num-workers or something else? I can generate the exact same output with --num-workers range from 0 to 9 from the modified script above, cpu or gpu (which overrides it to 1 anyways i believe)

@tbepler
Copy link
Owner

tbepler commented Jul 3, 2025

Ah, I see. You want to reset the random number generator to the same seed for every image. This means every image will start from the same random initialization. Is that what you want?

Do you really need to set the random seed for this function? How much variability are you seeing in the results? I would expect there to be a little bit, but not major qualitative differences. If there are, setting the seed will lock that one solution but I'm not sure it addresses the core problem then, which is that the solution is unstable.

I think this is fine, though. I put a few comments on your PR. We can merge it after those are fixed.

@swan343
Copy link
Author

swan343 commented Jul 3, 2025

yes that's exactly the intent. The variability isn't a lot per se - I do particle masking and can see about 2%-5% of difference depending on the detection technique and validation criteria. With a fixed seed the variation is eliminated completely. Appreciate the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants