-
Notifications
You must be signed in to change notification settings - Fork 15
Restart option added in case of unexpected termination #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…start job. Delete empty vibration cache file in vib folder for vibration restart job
…irectory to a destination subdirectory
add machine keyword in firework task Function copyDataAndSave() is added to copy a file from a origin subdirectory to a destination subdirectory add machine keyword to vibration keywords in setup_adsorbates
Function copyDataAndSave() is added to copy a file from a origin subdirectory to a destination subdirectory add machine keywords to optimization and vibration firework tasks add machine keyword in firework task Function copyDataAndSave() is added to copy a file from a origin subdirectory to a destination subdirectory add machine keyword to vibration keywords in setup_adsorbates add machine keywords to optimization and vibration firework tasks
…t task and delete empty json file in vib folder for vib task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I went over most of it.
| import os | ||
| from fireworks.core.fworker import FWorker | ||
|
|
||
| def createCommand(node, software): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with support for NERSC and general ALCF machines, but expecting users to write a whole new file for each machine is not tenable. If this polaris specific we need to redesign. It looks more like a hack to get this to work.
We can simply establish a better protocol to pass this information. I would recommend we write commands in the format:
"mpiexec --hosts {node} ......... {binary} PREFIX...." then we can instead simply tell users in the documentation if you put {node} or {binary} in the command code Pynta will automatically insert that. Then we don't need this.
| import os | ||
| from fireworks.core.fworker import FWorker | ||
|
|
||
| def createCommand(node, software): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function names should be snake case (also below)
| @@ -0,0 +1,297 @@ | |||
| """ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearly this code was taken from somewhere. How is this code licensed? Can we put it in Pynta? Where is it from and is there a reason this isn't included in fireworks?
|
|
||
|
|
||
| # TODO: why is loglvl a required parameter??? Also nlaunches and sleep_time could have a sensible default?? | ||
| def launch_multiprocess2( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely needs a different name. What distinguishes this from launch_multiprocess?
| from fireworks.core.fworker import FWorker | ||
| import fireworks.fw_config | ||
| import logging | ||
| #restart RHE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these attribution comments. They are confusing, especially here and if anyone wants to know who edited a section of the code they can look at the blame.
| del constraint_dict["type"] | ||
| return constructor(**constraint_dict) | ||
|
|
||
| def copyDataAndSave(origin, destination, file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snake case
|
|
||
| def copyDataAndSave(origin, destination, file): | ||
| ''' | ||
| Function to copy a file from a origin subdirectory to a destination subdirectory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this function needs to exist. What makes it unviable to just use shutil.copy?
|
|
||
| vib_obj_dict = {"software": self.software, "label": adsname, "software_kwargs": software_kwargs, | ||
| "machine": self.machine, "constraints": ["freeze up to "+str(self.nslab)]} | ||
| vib_obj_dict = {"software": self.software, "label": ad, "software_kwargs": software_kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is messed up with the changes in this line, I think you were doing changes in one commit and then undoing them in another.
pynta/main.py
Outdated
| print(f'No directory named "vib" found in {src}.') | ||
| print('No vibration calculations executed: Check optimization runs are finished and optimized geometries are collected.') | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these whitespace changes.
pynta/main.py
Outdated
| xyz = os.path.join(self.path,"Adsorbates",adsname,str(prefix),str(prefix)+".xyz") | ||
| xyzs.append(xyz) | ||
| fwopt = optimize_firework(os.path.join(self.path,"Adsorbates",adsname,str(prefix),str(prefix)+"_init.xyz"), | ||
| self.machine,self.software,"weakopt_"+str(prefix), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this should have probably been squashed with the changes in main...which may help solve the issue with some of the doing-undoing changes.
This PR includes two major updates. Updated restart option will restart optimization or vibration calculations if previous runs are unexpectedly terminated. Also this PR will allow users to run Pynta efficiently on ALCF Polaris machine. Details are described below:
Upon running
restart(),it will retrieve Fireworks workflow information, including the workflow ID number, task ID number, task states, and launch directories where unexpectedly terminated calculations were running. Before rerunning Fireworks for the incomplete runs, all necessary files, such as optimization trajectory files or vib folders, will be copied and sent to the destination directory.If task states are not completed (e.g., fizzled or lost runs), the optimization runs will restart from the last geometry of the optimization trajectory file in the previous launch directory. In the case of a vibration restart, empty vibration JSON files will be deleted from the vib folder before rerunning the vibration.
With Raymundo's efforts, this PR allows Pynta to run on ALCF Polaris with a single queue allocation. Raymundo updated the way Pynta maps tasks on each node for ALCF machines. Each task runs on a different Fireworker, and each Fireworker is associated with a node. This is available for
multilauncher. The optimal approach is to setnum_jobsin Pynta input script to the number of nodes.