Showing posts with label g09d. Show all posts
Showing posts with label g09d. Show all posts

18 January 2016

626. Briefly: Gaussian and cloud computing -- Gaussian G09D with Slurm on aws/ec2

Note: you may want to install awscli and euca2ools. I didn't, so I don't actually know whether they are useful.

My instructions are quite rudimentary since I don't have much time to write these blog posts anymore. Hopefully there's enough information to get you through.


AWS
Either way, sign up for AWS. If you already have an amazon ID I think you can use that. Go to https://aws.amazon.com/

Select Launch an Instance and pick the ubuntu AIM and do Launch and Review. I launched it as a t2.micro instance type, as it is free and it's sufficient for set up but not to run jobs.

Hit launch, and create a new key pair. I called mine myfirstkeypair and saved the pem file in my ~/Downloads folder

In my Downloads folder:
ssh -i "myfirstkeypair.pem" ubuntu@ec2-11-222-33-444.us-west-2.compute.amazonaws.com
I then set a password in the ubuntu AWS image:
sudo passwd ubuntu

I added my id_rsa.pub to ~/.ssh/authorized_keys on the ubuntu AWS image to make logging in via ssh easier -- that way I won't need the pem file.


Set up Gaussian
I then connected with SCP and uploaded my gaussian files -- I went straight for EM64T G09D. It went quite fast at +5 MB/s

scp E6L-103X.tgz ubuntu@ec2-00-111-22-333.us-west-2.compute.amazonaws.com:/home/ubuntu/E6L-103X.tgz

Once that was done, on the ubuntu AWS instance I did:
sudo apt-get install csh 
sudo mkdir /opt/gaussian
cd /opt 
sudo chown ubuntu gaussian -R
cd /opt/gaussian
cp ~/E6L-103X.tgz .
tar xvf E6L-103X.tgz
cd g09
csh bsd/install

echo 'export GAUSS_EXEDIR=/opt/gaussian/g09/bsd:/opt/gaussian/g09/local:/opt/gaussian/g09/extras:/opt/gaussian/g09' >> ~/.bashrc
echo 'export GAUSS_SCRDIR=/home/ubuntu/scratch' >> ~/.bashrc
echo 'export PATH=$PATH:/opt/gaussian/g09' >> ~/.bashrc
source ~/.bashrc 
mkdir ~/scratch ~/jobs

NOTE that you can't run any gaussian jobs under a t2.micro instance. You will have to stop and relaunch as at least a t2.small instance or the jobs will be 'Killed' (that's what is echoed in the terminal when you try to run)
Note that if you terminate an image it will be deleted.

Stop the image and then create a snapshot or an image from it to keep everything you've installed.

Set up Slurm
You'll want a queue manager so that you can launch several jobs in serial. Also, you can set up your script so that it shuts down the image when your job is done to save money.

sudo apt-get update
sudo apt-get install slurm-llnl

ControlMachine=localhost ControlAddr=127.0.0.1 MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=2 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/none FastSchedule=1 SchedulerType=sched/backfill SelectType=select/linear AccountingStorageType=accounting_storage/none ClusterName=rupert JobAcctGatherType=jobacct_gather/none SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdLogFile=/var/log/slurm-llnl/slurmd.log NodeName=localhost NodeAddr=127.0.0.1 PartitionName=All Nodes=localhost
sudo /usr/sbin/create-munge-key
Edit /etc/default/munge:
OPTIONS=--force
Then run
sudo service slurm-llnl restart
sudo service munge restart 
Test using slurm.batch
#!/bin/bash # #SBATCH -p All #SBATCH --job-name=test #SBATCH --output=res.txt # #SBATCH --ntasks=1 #SBATCH --time=10:00 srun hostname srun sleep 60
and submit with
sbatch slurm.batch
 squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2       All     test   ubuntu  R       0:08      1 localhost

Benchmark:
#!/bin/csh #SBATCH -p All #SBATCH --time=9999999 #SBATCH --output=slurm.out #SBATCH --job-name=benchmark setenv GAUSS_SCRDIR /home/ubuntu/scratch setenv GAUSS_EXEDIR /opt/gaussian/g09/bsd:/opt/gaussian/g09/local:/opt/gaussian/g09/extras:/opt/gaussian/g09 /opt/gaussian/g09/g09< benchmark.in > benchmark.out
Using the same opt/freq benchmark as in post 621.

c4.2xlarge 2h 11 min [1h 20 min] 8 vcpu/16 Gb
c4.4xlarge 1h 15 min [     44 min] 16 vcpu/32 Gb
c4.8xlarge      41 min [     25 min] 36 vcpu/60 Gb

It scales surprisingly well, although not perfectly linearly. It's clear that it's cheaper to use a smaller instance, so if time isn't critical or the larger memory isn't needed, c4.8xlarge is not the first choice.

Dropbox:
You might want to use dropbox to transfer files back and forth, especially finished job files (useful if you shut down the machine using a slurm script as shown below)

cd ~ && wget -O - "https://www.dropbox.com/download?plat=lnx.x86_64" | tar xzf -
~/.dropbox-dist/dropboxd
This computer isn't linked to any Dropbox account... Please visit https://www.dropbox.com/cli_link_nonce?nonce=0011223344556677889900aabbccddeef to link this device. This computer isn't linked to any Dropbox account...

Open that link in a browser, then go back to the terminal.
 
wget -O - https://www.dropbox.com/download?dl=packages/dropbox.py > dropbox.py
sudo mv dropbox.py /usr/local/bin
sudo chmod +x d/usr/local/bin/dropbox.py
dropbox.py autostart y

Now, since you don't want to use up space unnecessarily (you're paying for it after all), exclude as many directories as possible. To exclude all existing dropbox dirs, do
 
cd ~/Dropbox
dropbox.py exclude add `ld -d */`
dropbox.py exclude add `ld *.*`
dropbox.py exclude list

Note that it can't handle directories with spaces in the name, so you'll need to polish the list by hand. Next create a directory where you want to run and store your jobs,e .g.
mkdir ~/Dropbox/aws_jobs

When you run a gaussian job, make sure to specify where the .chk files should end up, e.g.
%chk=/home/ubuntu/scratch/benchmark.chk
so that you don't use up space/bandwidth for your chk files (unless of course you want to).

Stop after execution:
Use a batch script along these lines:
#!/bin/csh #SBATCH -p All #SBATCH --time=9999999 #SBATCH --output=slurm.out #SBATCH --job-name=benchmark setenv GAUSS_SCRDIR /home/ubuntu/scratch setenv GAUSS_EXEDIR /opt/gaussian/g09/bsd:/opt/gaussian/g09/local:/opt/gaussian/g09/extras:/opt/gaussian/g09 /opt/gaussian/g09/g09< benchmark.in > benchmark.out rm /home/ubuntu/scratch/*.* sudo shutdown -h now

15 October 2015

624. Gaussian fails with "traps: l502.exe[12449] general protection ip:d75df7 sp:7f7de40dcce0 error:0 in l502.exe[400000+dc8000]" on i7-5820K

Update: 
* The systems is rock solid with nwchem and ADF.  Only G09 crashes
* Gaussian has now released G09E, and the release notes say: "A Sandybridge/Haswell binary distribution is also available". Remains to be found out if this new version solves the issue. I can't check, as I don't have access to that version.

Update 18 Oct 2015:
TL;DR version: G09D (EM64T and AMD64) crash within the first 30 min to 4 hours. An NWChem job has so far run 6 days without crashing and is still going strong.

Original post:
This isn't much of a post yet. I'm mostly posting this so that people searching online will see that they aren't alone.


I just built a new node:
AU$559 Intel BX80648I75820K 6 Core i7-5820K 3.3Ghz 15MB LGA-2011-V3 (No Heatsink)
AU$407 Gigabyte X99-SLI Intel X99 S2011-3 8xDDR4/4xPCI-E/Intel GBLan/ATX Motherboard
AU$50 DeepCool FrostWin v2.0 CPU cooler
AU$155x2 Patriot 16G Kit (8Gx2) DDR4 2133 Desktop RAM
AU$185 Antec HCG-900 High Current Gamer Gaming PSU
AU$39 Gigabyte N210SL-1GI 1GB GT210 PCI-E VGA Card
AU$68 Seagate 3.5" Barracuda 1TB ST1000DM003 SATA3 7200RPM 64MB HDD (carbon)
AU$76 Antec GX500B-W Dominator Window USB3.0 Gaming Case without PSU

I've got an installation of up-to-date Jessie on it, with the following kernel: Debian 3.16.7-ckt11-1+deb8u5 (2015-10-09) x86_64 GNU/Linux.


When running G09D rev. 01 EM64T I keep getting random errors along these lines (these are collected over a couple of days and between restarts):
[100433.566789] traps: l703.exe[11236] general protection ip:df18ca sp:7fc96f595268 error:0 in l703.exe[400000+a46000] [ 2587.899019] traps: l703.exe[3727] general protection ip:9c9757 sp:7fa6fc436ce0 error:0 in l703.exe[400000+a46000] [26439.755347] traps: l502.exe[3235] general protection ip:ab8a55 sp:7ffe29504c10 error:0 in l502.exe[400000+dc8000] [43030.457126] traps: l502.exe[427] general protection ip:11565a7 sp:7f2ec1fff268 error:0 in l502.exe[400000+dc8000] [ 2587.899019] traps: l703.exe[3727] general protection ip:9c9757 sp:7fa6fc436ce0 error:0 in l703.exe[400000+a46000] [37460.207608] traps: l703.exe[14649] general protection ip:a38ae0 sp:7f1a813cf8c0 error:0 in l703.exe[400000+a46000] [ 8865.403861] traps: l502.exe[12449] general protection ip:d75df7 sp:7f7de40dcce0 error:0 in l502.exe[400000+dc8000]


Sometimes the crashes happen after 30 minutes, sometimes after 3 hours. Most happen within four hours. I seem to remember that one ran up to 12 hours, but nothing's gone beyond that. Some short (1h 30 min) calculations have managed to run to completion.

I've checked each RAM stick with memtest+ -- they are fine -- and they are distributed as recommended in the motherboard manual.

The temperature is running below 40 degrees Celsius.

The harddrive is fine according to SMART.

I log everything every two minutes, and so can go back and look at what happened right before the crash, but there's nothing odd.


My current best three hypotheses are:

* There's an issue with G09D EM64T and the new generation of LGA2011-v3 i7 intel cpus specifically

* There's an issue with any version of G09D and the new generation of LGA2011-v3 i7 intel cpus

* There's an issue with my system which is independent of G09D.

To test, I'll be:
* Do runs with G09D rev. 01 AMD64
                         Done -- this also crashed.

* Do runs with NWChem 6.5 (ifort, mkl)
                         Running -- 6 days so far without a crash!

* Update the BIOS (long shot)

* Remove CPU and check for bent pins (long, arduous shot)

I'll be posting updates...

17 August 2015

621. Very briefly: comparison of different hardware for a single G09 calculation

And another update:
* I've set up an Amazon EC2/AWS instance and have done some benchmarking there too: http://verahill.blogspot.com.au/2016/01/626-briefly-gaussian-g09d-with-slurm-on.html 
Exciting stuff.

Another update:
* I turned off hyperthreading on the i7-4930K to see whether it improved performance. The geometry optimisation took 2 min (4%) longer but the overall runtime was 2 min (2%) shorter. This is probably within the reproducibility error for the measurement.
* I've also added results for an i7-5820K without hyperthreading


Update: I spotted a few mistakes
* the L5430 job ran on a dual-socket machine, so I've multiplied the passmark by two and have replotted
* the X3480 job use the EM64T version of gaussian, not AMD64. I don't have a license to test that system using AMD64.

Original post:
There are lots of potential flaws when comparing the performance of a computational package on different hardware. Thus, it can be difficult to find examples online comparing different hardware using computational chemistry packages which makes it challenging to decide on what hardware to budget for.

So here's a simple comparison of a few different types of hardware for a geovib calculation in Gaussian.

All systems have spinning (7200 rpm) disks and use debian jessie (64). The systems haven't been optimised in any way.

All systems used G09D rev 01 AMD 64 unless otherwise indicated. The amount of time the geometry optimisation took is given within [].

Performance:
2h 15 min. [1h 14 min.] Intel i7-4930K/ 32 Gb ram/ 12 threads
2h 23 min. [1h 25 min.] Intel i7-5820K/ 32 Gb ram/ 6 threads (HT turned off, NOTE)
3h 49 min. [2h 12 min.] AMD FX 8350/ 8 Gb ram/ 8 threads
4h 12 min. [2h 19 min.] Intel i5-2400/ 16 Gb ram/ 4 threads
4h 16 min. [2h 24 min.] AMD Phenom II X6 1055T/ 8 Gb ram/ 6 threads
4h 28 min. [2h 16 min.] dual-socket Intel Xeon L5430/ 16 Gb ram/ 8 threads-- rev A.02
4h 43 min. [2h 47 min.] AMD FX 8150/ 32 Gb ram/ 8 threads
9h 26 min. [5h 18 min.] AMD Athlon II X3 445/ 8 Gb ram/ 3 threads

I also tried the EM64T version of G09D rev 01 and got:
1h 43 min. [57 min.] i7-4930K/ 32 Gb ram/ 12 threads
3h 03 min. [1h 44 min.] i5-2400/16 Gb ram/ 4 threads
4h 21 min. [2h 27 min.] Xeon X3480/ 8 Gb ram/ 8 threads -- rev B.01

Just by switching from the AMD64 to the EM64T version we thus cut the calculation down to 75% of the time for the i7.

 I also turned off hyperthreading for the i7-4930K and ran with six threads using the EM64T version:
 1h 41 min. [59 min.] i7-4930K/ 32 Gb ram/ 6 threads
 1h 35 min. [58 min.] i7-5820K/ 32 Gb ram/ 6 threads (NOTE)


Here's a plot of the run times vs Passmark benchmarks (I've multiplied the Xeon L5430 passmark by 2):

Note that the Xeon L5430 and Xeon X340 store the job files on an NFS (scratch files are local) and are not using G09 rev D.01.

It'd be tempting to draw a line through the Athlon II to the I7-4930K, in which case the Phenom II X6 and the I5-2400 perform much, much better than they should based on the CPU Passmark alone.

Either way, these are my observations. No interpretations or opinion attached.

So what's  the benchmarking job that I used? I actually prefer not to reveal it, as it'd eventually point towards my identity (and you're not supposed to publish gaussian benchmarks...)

Suffice to say that it uses:
rpbe/def2-svp and 459 functions/759 primitives (46 atoms) with opt=(verytight) and integral(ultrafine)