Showing posts with label restart. Show all posts
Showing posts with label restart. Show all posts

09 April 2014

570. Briefly: restarting a g09 frequency job with SGE, using same queue

I've had g09 frequency jobs die on me, and in g09 analytical frequency jobs can only be restarted using the .rwf. Because the .rwf files are 160 gb, I don't want to be copying them back and forth between nodes. It's easier then to simply make sure that the restarted job is run on the same node as the original job.

A good resource for SGE related stuff is

Either way, first figure out what node the job ran on. Assuming that the job number was 445:
qacct -j 445|grep hostname
hostname compute-0-6.local

Next figure out the PID, as this is used to name the Gau-[PID].rwf file:
grep PID g03.g03out
Entering Link 1 = /share/apps/gaussian/g09/l1.exe PID= 24286.

You can now craft your restart file, g09_freq.restart -- you'll need to make sure that the paths are appropriate for your system:
%nprocshared=8 %Mem=900000000 %rwf=/scratch/Gau-24286.rwf %Chk=/home/me/jobs/testing/delta_631gplusstar-freq/delta_631gplusstar-freq.chk #P restart
(having empty lines at the end of the file is important) and a qsub file, g09_freq.qsub:
#$ -S /bin/sh #$ -cwd #$ -l h_rt=999:30:00 #$ -l h_vmem=8G #$ -j y #$ -pe orte 8 export GAUSS_SCRDIR=/tmp export GAUSS_EXEDIR=/share/apps/gaussian/g09/bsd:/share/apps/gaussian/g09/local:/share/apps/gaussian/g09/extras:/share/apps/gaussian/g09 /share/apps/gaussian/g09/g09 g09_freq.restart > g09_freq.out
Then submit it to the correct queue by doing
qsub -q all.q@compute-0-6.local g09_freq.qsub

The output goes to g09_freq.log. You know if the restart worked properly if it says
Skip MakeAB in pass 1 during restart.
Resume CPHF with iteration 214.

Note that restarting analytical frequency jobs in g09 can be a hit and miss affair. Jobs that run out of time are easy to restart, and some jobs that die silently have also been restarted successfully. On the other hand, a job that died because my resource allocations ran out couldn't be restarted i.e. restart started the freq job from scratch. The same happened with one a node of mine that has what seems like a dodgy PSU. Finally, I also couldn't restart jobs that died silently due to allocation all the RAM to g09 without leaving any to the OS (or at least that's the current best theory). It may thus be a good idea to back up the rwf file every now and again, in spite of the unwieldy size.

13 June 2012

189. Thoughts on restarting NWChem jobs in ECCE

UPDATE: Because all my nodes are working hard to keep my office warm in the Australian winter, I haven't tested this very extensively, but it seems like freq jobs can be restarted using
           reuse oldhessian.hess
Original post
As often is the case, this is as much a note to myself as a blog post.

Or that's how it started out. I've since spent a bit of time testing different restart options, as I found that some paradoxically were actually seen to lead to longer calculations...

The jobs I've experimented with are very short, so the error margin is probably huge.

What I tried:

A. Task dft geometry:
1. Original job - 215 s
2. Substituting Start with restart in the same directory as A.1 (i.e. loaded db)  - 204 s
3. Same as A.2, but also deleted geometry section - 43 s
4. Same as A.2 and A.3, but loaded movecs explicitly - 43 s
5. Used start, but loaded movecs from job B.3. - 267 s

Comment: not sure whether A.5 is consistently slower than A.1, but I've never seen it go faster. A.3 looks like a good bet when resuming a calculation.

B. Task dft energy:
1. Original job - 41 s
2. Deleted geometry, loaded movecs, db from B.1 - 8.7 s
3. Used start directive, kept geometry, loaded movecs from B.1, no db - 8.7 s

Comment: loading movecs (B.3) seems like a winner

C. Task dft frequency
0. Original job - 952 s
1. Deleted geom, loaded movecs, db from task energy (B.2 above) - 853 s
2. Delete geometry, loaded movecs from opt, put drv.hess, .hess, fd_ddipole in same directory (from A.1 above) - 842 s
3. Same as C2, but deleted basis block as well, and removed everything from the dft block except direct and vectors - 849 s
4. Copied hessian from 0, and put 'reuse' inside the freq block - 0.1 s (!)
5. Copied hessian from A.1 and put 'reuse' inside the freq block - 0.1 s (but data wrong)

Comment: is C0 really slower than the other jobs?
The problem with approach C.5 is that while C.4 gives

65.943 kcal/mol, 69.553 cal/mol-K
C.5 gives
288.058 kcal/mol, 64.706 cal/mol-K

Here's a comment from one of the developers, Bert:

"If you just want to redo an energy calculation followed by and ESP calculation, I would never use restart, but just use start and define the geometry in the geometry block. [cut] The restart is purely to continue the calculation that got interrupted, and the runtime database is probably not in a clean enough state to do something completely different with it. You can use the movecs that have been generated as the starting vectors though. "

A.5 (no effect) and B.3 (speed-up) would be in line with that approach.

With that in hand, time to work on ECCE.

ECCE is a nice tool, but as any point-and-click program it has it's limitations -- it's impossible to predict every single type of usage, and this is particularly true for computational wizardry. To a large extent this is compensated for by the ability to do a 'final edit' using vim before submitting a job -- there is obviously nothing whatsoever that you're prevented from doing at this point, so it offer ultimate flexibility.

There is a major weakness using ECCE though -- using files from old jobs.

In particular this is a weakness when it comes to restarting jobs. In terms of structure, this isn't a problem -- the last structure is provided by ecce via ecce.out.  It would be nice to able to carry over the .movecs files though (I'm still learning, but loading movecs and using fragment guess seems to be the neatest thing). This is high on the wish list.

Anyway, there are two major use cases:

Restarting an interrupted job
Assuming that you resubmit it without changing the name and to the same cluster so that it'll run remotely in the same directory:

which should tell nwchem to look for the .db file and
either edit the scf or dft block and add
vectors input jobname.movecs
Obviously, this isn't example of great insight, but rather a product of reading the manual.

Also, the manual does state (but does not suggest) that leaving the start/restart directive out would cause nwchem to look for evidence of whether it's a restarted job or not. The problem is that ECCE automatically names all files nwch.nw, which would cause nwchem to look for nwch.db and fail.

Launching a new calculation based on an old job
Now, if you are duplicating a job, or if you've since renamed the job, you're in a spot of trouble since ecce doesn't concern itself with the .db and .movecs files. Maybe there's a good reason for this? But if I understand everything correctly, this means that you are loosing a lot of time on scf cycles which you could avoid by loading the .movecs and .db file.

I think that in the case of: same cluster and similar directory structure (i.e. the previous job is also a subdirectory of the same parent directory as the new job) you can put this at the beginning of your job
task shell "cp ../oldjob/*.movecs ."

and either edit the scf or dft block and add
vectors input jobname.movecs

and it actually works.

But I had no luck doing this
task shell "cp ../oldjob/*.db ."
And combining it with restart -- it wants the .db file to be there already if you use restart. This, at least in terms of functionality, agrees well with the comment by Bert above - same directory is ok, but with a different directory only movecs are reasonably easy.

Now, all we need is a tick box to copy the old movecs files between jobs...and the underlying structure. At the moment the movecs files don't get imported, so it would take a bit of editing to get to that point.