Showing posts with label acml. Show all posts
Showing posts with label acml. Show all posts

23 May 2013

430. Strange issue with NWChem, openmpi, SGE and ECCE

This one's a bit odd.

Odd in the sense that

  • the math libs (acml) I'm using should be suitable for the processors that I'm using them for.
  • it only happens when I submit with ECCE + SGE. Calcs on the input files are fine if I launch the by hand



The problem:
I'm having issues launching jobs on two nodes where the nwchem 6.3. binaries were compiled against acml 5.3.1 (gfortran, int64). I'm launching the jobs from ECCE and I've got SGE set up and working since a long time. My two other nodes, one i5-2400 linked against openblas, and one AMD FX 8150 linked against acml 5.3.1 (gfortran, fma4, int64) work absolutely fine.

Both binaries were linked with acml using
export BLASOPT="-L/opt/acml/acml5.3.1/gfortran64_int64/lib -lacml"
export LIBRARY_PATH="$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/acml/acml5.3.1/gfortran64_int64/lib"

The first node is an AMD phenom II X6 1055, while the second one is an ancient, recently-revived AMD Athlon X2 3800+. The acml util cpuid.exe gives
Chip manufacturer: AuthenticAMD AuthenticAMD family 15 extended family 1 model 10 Model Name: AMD Phenom(tm) II X6 1055T Processor Chip supports SSE Chip supports SSE2 Chip supports SSE3 Chip does not support AVX Chip does not support FMA3 Chip does not support FMA4
and
Model Name: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Chip supports SSE Chip supports SSE2 Chip supports SSE3 Chip does not support AVX Chip does not support FMA3 Chip does not support FMA4
respectively. On the AMD Phenom II X6 1055T I kept getting
Scaling coordinates for geometry "geometry" by 1.889725989 (inverse scale = 0.529177249) 0:Illegal Instruction error, status=: 4 (rank:0 hostname:boron pid:12386):ARMCI DASSERT fail. ../../ga-5-2/armci/src/ common/signaltrap.c:SigIllHandler():276 cond:0
. On the Athlon 64 X2 3800+ the job would just exit at
Directory information --------------------- 0 permanent = . 0 scratch = /home/me/scratch
There would be no other errors (in e.g. .po or .o files).

If I launch the job by hand, e.g.
mpirun -n 6 nwchem nwch.nw
it works fine.



The Partial solution
The errors for the AMD Phenom II X6 1055T went away when I instead of acml used openblas:
export BLASOPT="-L/opt/openblas/lib -lopenblas"
export LIBRARY_PATH="$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/openblas/lib"

See e.g. http://verahill.blogspot.com.au/2013/05/424-nwchem-63-on-debian-wheezy.html for general compilation instructions.

The odd thing:
With openblas the AMD Athlon X2 3800+ suddenly gives
Scaling coordinates for geometry "geometry" by 1.889725989 (inverse scale = 0.529177249) 0:Illegal Instruction error, status=: 4 (rank:0 hostname:beryllium pid:9267):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/signaltrap.c:SigIllHandler():276 cond:0

19 May 2013

422. Set up ACML on linux

These are the same instructions as in post 409B. However, I've decided it's better to do the posts the unix/linux way -- have the do one thing, and do that thing well. It makes life easier for me if I can simply refer back to more modular posts.

Anyway, here's how to set up the ACML libs on debian.

ACML
Download both the 'regular' and the int64 gfortran packages from AMD:
http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/acml-downloads-resources/#download

tar xvf acml-5-3-1-gfortran-64bit-int64.tgz
tar xvf acml-5-3-1-gfortran-64bit.tgz
sh install-acml-5-3-1-gfortran-64bit-int64.sh
Where do you want to install ACML? Press return to use the default location (/opt/acml5.3.1), or enter an alternative path. The directory will be created if it does not already exist. > /opt/acml/acml5.3.1
sh install-acml-5-3-1-gfortran-64bit.sh
Where do you want to install ACML? Press return to use the default location (/opt/acml5.3.1), or enter an alternative path. The directory will be created if it does not already exist. > /opt/acml/acml5.3.1
You'll get something like this:
/opt/acml/acml5.3.1
|-- Doc
|-- gfortran64
|-- gfortran64_fma4
|-- gfortran64_fma4_int64
|-- gfortran64_fma4_mp
|-- gfortran64_fma4_mp_int64
|-- gfortran64_int64
|-- gfortran64_mp
|-- gfortran64_mp_int64
`-- util

where
*  fma4 is for cpus with FMA4 support (use util/cpuid to check)
*  int64 is for double-precision float (integer*8) I think
*  mp is for openmp. For MPI do not use the _mp_ libraries!

Pick your library/ies and add them to the LD_LIBRARY_PATH, e.g.:
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/acml/acml5.3.1/gfortran64_int64/lib' >> ~/.bashrc
source ~/.bashrc

04 September 2012

226. ACML libs and nwchem -- what libs to choose to avoid 'Singularity in Pulay matrix' hang.

The problem:
If I compile nwchem against the acml libs (gfortran64_fma4 in acml-5-2-0-gfortran-64bit.tgz) everything appears to go fine, but once I try to run stuff I get


           Memory utilization after 1st SCF pass:
           Heap Space remaining (MW):       12.94            12937848
          Stack Space remaining (MW):       13.11            13107006
   convergence    iter        energy       DeltaE   RMS-Dens  Diis-err    time
 ---------------- ----- ----------------- --------- --------- ---------  ------
 d= 0,ls=0.0,diis     1    -74.9488845804 -7.49D+01  1.85D-02  1.70D-01     0.4
  Singularity in Pulay matrix. Error and Fock matrices removed.


and then the node hangs with 100% CPU.

The (obvious) solution:
To some this will be obvious, but to someone not skilled in the art, like myself, it isn't.
Of course, I could've just RTFM...but what academic does that?
"ACML and MKL can support 64-bit integers if the appropriate library is chosen. For MKL, one can choose the ILP64 Version of Intel® MKL, while for ACML the int64 libraries should be chosen, e.g. in the case of ACML 4.4.0 using a PGI compiler /opt/acml/4.4.0/pgi64_int64/lib/libacml.a"
So, when you go to download your libraries from the AMD website make to download at a minimum the 64 integer file (e.g.acml-5-2-0-gfortran-64bit-int64.tgz).

How I built nwchem:

export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=`pwd`
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES="all"
#export PYTHONVERSION=2.7
export PYTHONHOME=/usr
export BLASOPT="-L/opt/acml/acml5.2.0/gfortran64_fma4_int64/lib -lacml"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/usr/lib/openmpi/lib
export MPI_INCLUDE=/usr/lib/openmpi/include
export LIBRARY_PATH="$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/acml/acml5.2.0/gfortran64_fma4_int64/lib"
export LIBMPI="-lmpi -lopen-rte -lopen-pal -ldl -lmpi_f77 -lpthread"
cd $NWCHEM_TOP/src
make clean
make nwchem_config
make FC=gfortran 2> make.err 1>make.log
export FC=gfortran
cd ../contrib ./getmem.nwchem


Don't forget to add the acml libs to the LD_LIBRARY_PATH in your ~/.bashrc, e.g.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/acml/acml5.2.0/gfortran64_fma4_int64/lib