11 September 2012

231. Compiling john the ripper: single/serial, parallel/OMP and MPI

Update: updated for v1.7.9-jumbo-7 since hccap2john in 1.7.9-jumbo-6 was broken

For no particular reason at all, here's how to compile John the Ripper on Debian Testing (Wheezy). It's very easy, and this post is probably a bit superfluous. The standard version only supports serial and parallel (OMP). See below for MPI.


The regular version: 

mkdir ~/tmp
cd ~/tmp
wget http://www.openwall.com/john/g/john-1.7.9.tar.gz
tar xvf john-1.7.9.tar.gz
cd john-1.7.9/src

If you don't edit the Makefile you build a serial/single-threaded binary.
If you want to build a threaded version for a single node with a multicore processor (OMP) do:
Edit Makefile and uncomment row 19 or 20

 18 # gcc with OpenMP
 19 OMPFLAGS = -fopenmp
 20 OMPFLAGS = -fopenmp -msse2
make clean linux-x86-64
cd ../run

You now have a binary called john in your ../run folder.


The Jumbo version:
If you want to build a distributed version with MPI (can split jobs across several nodes) you need the enhanced, community version:

sudo apt-get install openmpi-bin libopenmpi-dev
cd ~/tmp
wget http://www.openwall.com/john/g/john-1.7.9-jumbo-7.tar.gz
tar xvf john-1.7.9-jumbo-7.tar.gz 
cd john-1.7.9-jumbo-7/src

Edit the Makefile
  20 ## Uncomment the TWO lines below for MPI (can be used together with OMP as well)
  21 ## For experimental MPI_Barrier support, add -DJOHN_MPI_BARRIER too.
  22 ## For experimental MPI_Abort support, add -DJOHN_MPI_ABORT too.
  23 CC = mpicc -DHAVE_MPI
  24 MPIOBJ = john-mpi.o

and do
make clean linux-x86-64-native
cd ../run

I had a look at the passwords on one of our lab boxes -- it immediately discovered that someone had used 'password' as the password...


These test were run on my old AMD II X3 445. Processes which don't speed up with MP are highlighted in red. LM DES is borderline -- it's faster, but doesn't scale well.

Here's the single thread/serial version:
./john --test
Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     2906K c/s real, 2918K c/s virtual
Only one salt:  2796K c/s real, 2807K c/s virtual
Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     95564 c/s real, 95948 c/s virtual
Only one salt:  93593 c/s real, 93781 c/s virtual
Benchmarking: FreeBSD MD5 [32/64 X2]... DONE
Raw:    14094 c/s real, 14122 c/s virtual
Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw:    918 c/s real, 919 c/s virtual
Benchmarking: Kerberos AFS DES [48/64 4K]... DONE
Short:  474316 c/s real, 475267 c/s virtual
Long:   1350K c/s real, 1356K c/s virtual
Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    39843K c/s real, 39923K c/s virtual
Benchmarking: generic crypt(3) [?/64]... DONE
Many salts:     262867 c/s real, 263393 c/s virtual
Only one salt:  260121 c/s real, 260642 c/s virtual
Benchmarking: Tripcode DES [48/64 4K]... DONE
Raw:    369843 c/s real, 370584 c/s virtual
Benchmarking: dummy [N/A]... DONE
Raw:    99512K c/s real, 99712K c/s virtual
Here's the OMP version:
Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     6706K c/s real, 2555K c/s virtual
Only one salt:  5015K c/s real, 2091K c/s virtual
Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     205670 c/s real, 85411 c/s virtual
Only one salt:  238524 c/s real, 86720 c/s virtual
Benchmarking: FreeBSD MD5 [32/64 X2]... DONE
Raw:    38400 c/s real, 13812 c/s virtual
Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw:    2306 c/s real, 845 c/s virtual
Benchmarking: Kerberos AFS DES [48/64 4K]... DONE
Short:  474675 c/s real, 476581 c/s virtual
Long:   1332K c/s real, 1335K c/s virtual
Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    49046K c/s real, 16785K c/s virtual
Benchmarking: generic crypt(3) [?/64]... DONE
Many salts:     721670 c/s real, 246640 c/s virtual
Only one salt:  699168 c/s real, 239605 c/s virtual
Benchmarking: Tripcode DES [48/64 4K]... DONE
Raw:    367444 c/s real, 369657 c/s virtual
Benchmarking: dummy [N/A]... DONE
Raw:    100351K c/s real, 100552K c/s virtual
And here's the MPI version:
mpirun -n 3 ./john --test
(note that this includes a great many more tests than the default version)
Benchmarking: Traditional DES [128/128 BS SSE2-16]... (3xMPI) DONE
Many salts:     8533K c/s real, 8707K c/s virtual
Only one salt:  7705K c/s real, 8110K c/s virtual
Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... (3xMPI) DONE
Many salts:     279808 c/s real, 282634 c/s virtual
Only one salt:  273362 c/s real, 276096 c/s virtual
Benchmarking: FreeBSD MD5 [128/128 SSE2 intrinsics 12x]... (3xMPI) DONE
Raw:    65124 c/s real, 65781 c/s virtual
Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... (3xMPI) DONE
Raw:    2722 c/s real, 2749 c/s virtual
Benchmarking: Kerberos AFS DES [48/64 4K]... (3xMPI) DONE
Short:  1387K c/s real, 1415K c/s virtual
Long:   3880K c/s real, 3959K c/s virtual

Benchmarking: LM DES [128/128 BS SSE2-16]... (3xMPI) DONERaw:    114781K c/s real, 115940K c/s virtual

I don't quite understand the Kerberos results.



Other targets of interest are:

linux-x86-64-avx         Linux, x86-64 with AVX (2011+ Intel CPUs)
linux-x86-64-xop         Linux, x86-64 with AVX and XOP (2011+ AMD CPUs)
linux-x86-64             Linux, x86-64 with SSE2 (most common)
linux-x86-avx            Linux, x86 32-bit with AVX (2011+ Intel CPUs)
linux-x86-xop            Linux, x86 32-bit with AVX and XOP (2011+ AMD CPUs)
linux-x86-sse2           Linux, x86 32-bit with SSE2 (most common, if 32-bit)
linux-x86-mmx            Linux, x86 32-bit with MMX (for old computers)
linux-x86-any            Linux, x86 32-bit (for truly ancient computers)

The FX 8150 does AVX and XOP, while my 1055T doesn't.

The community version has more options:

linux-x86-64-native      Linux, x86-64 'native' (all CPU features you've got)
linux-x86-64-gpu         Linux, x86-64 'native', CUDA and OpenCL (experimental)
linux-x86-64-opencl      Linux, x86-64 'native', OpenCL (experimental)
linux-x86-64-cuda        Linux, x86-64 'native', CUDA (experimental)
linux-x86-64-avx         Linux, x86-64 with AVX (2011+ Intel CPUs)
linux-x86-64-xop         Linux, x86-64 with AVX and XOP (2011+ AMD CPUs)
linux-x86-64[i]          Linux, x86-64 with SSE2 (most common)
linux-x86-64-icc         Linux, x86-64 compiled with icc
linux-x86-64-clang       Linux, x86-64 compiled with clang
linux-x86-gpu            Linux, x86 32-bit with SSE2, CUDA and OpenCL (experimental)
linux-x86-opencl         Linux, x86 32-bit with SSE2 and OpenCL (experimental)
linux-x86-cuda           Linux, x86 32-bit with SSE2 and CUDA (experimental)
linux-x86-sse2[i]        Linux, x86 32-bit with SSE2 (most common, 32-bit)
linux-x86-native         Linux, x86 32-bit, with all CPU features you've got (not necessarily best)
linux-x86-mmx            Linux, x86 32-bit with MMX (for old computers)
linux-x86-any            Linux, x86 32-bit (for truly ancient computers)
linux-x86-clang          Linux, x86 32-bit with SSE2, compiled with clang
linux-alpha              Linux, Alpha
linux-sparc              Linux, SPARC 32-bit
linux-ppc32-altivec      Linux, PowerPC w/AltiVec (best)
linux-ppc32              Linux, PowerPC 32-bit
linux-ppc64              Linux, PowerPC 64-bit
linux-ia64               Linux, IA-64

10 September 2012

230. ROCKS 5.4.3, ATLAS and Gromacs on Xeon X3480

After doing another round of 'benchmarks' (there are so many factors that differ between the systems that it's difficult to tell exactly what I'm measuring) I'm back to looking at my BLAS/LAPACK.


So here's compiling ATLAS on a cluster made up of six dual-socket mobos with 2x quadcore XeonX3480 CPUs and 8 Gb RAM. The cluster is running ROCKS 5.4.3, which is a spin based on Centos 5.6. We then compile GROMACS using ATLAS and compare it with Openblas. Please note that I am not an expert on optimisations (or computers or anything) so comparing Openblas vs ATLAS won't tell you which one is 'better'. They are just numbers based on what someone once observed on a particular system under a particular set of circumstances.

Hurdles: I first had to deal with the lapack + bad symbols + recompile with -fPIC problem (solved by using netlib lapack and building shared libraries), then encountered the 'libgmx.so: undefined reference to _gfortran_' issue (solved by adding -lgfortran to LDFLAGS).


ATLAS
sudo mkdir /share/apps/ATLAS
sudo chown $USER /share/apps/ATLAS
cd ~/tmp
wget http://www.netlib.org/lapack/lapack-3.4.1.tgz
 wget http://downloads.sourceforge.net/project/math-atlas/Developer%20%28unstable%29/3.9.72/atlas3.9.72.tar.bz2
tar xvf atlas3.9.72.tar.bz2
cd ATLAS/
mkdir build
cd build
.././configure --prefix=/share/apps/ATLAS -Fa alg '-fPIC' --with-netlib-lapack-tarfile=$HOME/tmp/lapack-3.4.1.tgz --shared
OS configured as Linux (1)
Assembly configured as GAS_x8664 (2)
Vector ISA Extension configured as  SSE3 (6,448)
Architecture configured as  Corei1 (25)
Clock rate configured as 3059Mhz
make
DONE  STAGE 5-1-0 at 15:23
ATLAS install complete.  Examine
ATLAS/bin/<arch>/INSTALL_LOG/SUMMARY.LOG for details.

ls lib/
libatlas.a  libcblas.a  libf77blas.a  libf77refblas.a  liblapack.a  libptcblas.a  libptf77blas.a  libptlapack.a  libsatlas.so  libtatlas.so  libtstatlas.a  Makefile  Make.inc
make install

In addition to successful copying you'll also get errors along the lines of

cp: cannot stat `/home/me/tmp/ATLAS/build/lib/libsatlas.dylib': No such file or directory
make[1]: [install_lib] Error 1 (ignored)
cp /home/me/tmp/ATLAS/build/lib/libtatlas.dylib /share/apps/ATLAS/lib/.
cp: cannot stat `/home/me/tmp/ATLAS/build/lib/libtatlas.dylib': No such file or directory
make[1]: [install_lib] Error 1 (ignored)
cp /home/me/tmp/ATLAS/build/lib/libsatlas.dll /share/apps/ATLAS/lib/.
cp: cannot stat `/home/me/tmp/ATLAS/build/lib/libsatlas.dll': No such file or directory
make[1]: [install_lib] Error 1 (ignored)
cp /home/me/tmp/ATLAS/build/lib/libtatlas.dll /share/apps/ATLAS/lib/.
cp: cannot stat `/home/me/tmp/ATLAS/build/lib/libtatlas.dll': No such file or directory
make[1]: [install_lib] Error 1 (ignored)
cp /home/me/tmp/ATLAS/build/lib/libsatlas.so /share/apps/ATLAS/lib/.
cp /home/me/tmp/ATLAS/build/lib/libtatlas.so /share/apps/ATLAS/lib/.
because those files don't exist. 


Gromacs

FFTW3 was first build according to this. The only difference is the install targets (--prefix) -- I put things in /share/apps/gromacs/.fftwsingle and /share/apps/gromacs/.fftwdouble. Gromacs was downloaded and extracted as shown in that post, and /share/apps/gromacs was created.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi/lib:/share/apps/ATLAS/lib
#single precision
export LDFLAGS="-L/share/apps/gromacs/.fftwsingle/lib -L/share/apps/ATLAS/lib -latlas -llapack -lf77blas -lcblas -lgfortran"
export CPPFLAGS="-I/share/apps/gromacs/.fftwsingle/include -I/share/apps/ATLAS/include/atlas"
./configure --disable-mpi --enable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_spa --prefix=/share/apps/gromacs
make -j3
make install
#double precision
make distclean
export LDFLAGS="-L/share/apps/gromacs/.fftwdouble/lib -L/share/apps/ATLAS/lib -latlas -llapack -lf77blas -lcblas -lgfortran"
export CPPFLAGS="-I/share/apps/gromacs/.fftwdouble/include -I/share/apps/ATLAS/include/atlas"
./configure --disable-mpi --disable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_dpa --prefix=/share/apps/gromacs
make -j3
make install
#single + mpi
make distclean
export LDFLAGS="-L/share/apps/gromacs/.fftwsingle/lib -L/share/apps/ATLAS/lib -latlas -llapack -lf77blas -lcblas -lgfortran"
export CPPFLAGS="-I/share/apps/gromacs/.fftwsingle/include -I/share/apps/ATLAS/include/atlas""
./configure --enable-mpi --enable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_spampi --prefix=/share/apps/gromacs
make -j3
make install
#double + mpi
make distclean
export LDFLAGS="-L/share/apps/gromacs/.fftwdouble/lib -L/share/apps/ATLAS/lib -latlas -llapack -lf77blas -lcblas -lgfortran"
export CPPFLAGS="-I/share/apps/gromacs/.fftwdouble/include -I/share/apps/ATLAS/include/atlas"
./configure --enable-mpi --disable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_dpampi --prefix=/share/apps/gromacs
make -j3
make install
The -lgfortran is IMPORTANT, or you'll end up with
libgmx.so: 'undefined reference to _gfortran_' type errors.

Performance
I ran a 6x6x6 nm box of water for 5 million steps (10 ns) to get a rough idea of the performance.
Make sure to put
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/share/apps/ATLAS/lib
in your ~/.bashrc, and to include it in your SGE jobs files (if that's what you use).

I allocated 8 Gb RAM and 8 cores for each run.

Double precision:
Openblas: 10.560 ns/day (11.7 GFLOPS, runtime 8182 seconds)
ATLAS  : 10.544 ns/day (11.6 GFLOPS, runtime 8194 seconds)

Single precision:
Openblas: 17.297 ns/dat (19.1 GFLOPS, runtime 4995 seconds)
ATLAS:   17.351 ns/day (19.2 GFLOPS, runtime 4980 seconds)
That's 15 seconds difference on a 1h 20 min run. I'd say they are identical.

07 September 2012

229. Compile ATLAS (+ gromacs, nwchem) on AMD FX 8150 on Debian Testing (Wheezy)

Xianyi's openblas doesn't seem to be ready for AMD FX 8150 yet. I've played with ATLAS in the past, but  for some reason didn't see the same performance with NWChem and ATLAS as I saw with NWChem and Openblas, so I never ended up using it.

I'm also having issues using openblas with CPMD and quantum espresso, and ATLAS is a well-established, respectable project, so it's time to give it another shot. As in most cases in these situations, it's probably a matter of PEBKAC.

Building ATLAS
Anyway. On we go...

mkdir /opt/ATLAS
chown ${USER} /opt/ATLAS
mkdir ~/tmp
cd ~/tmp

wget http://www.netlib.org/lapack/lapack-3.4.1.tgz
 wget http://downloads.sourceforge.net/project/math-atlas/Developer%20%28unstable%29/3.9.72/atlas3.9.72.tar.bz2
tar xvf atlas3.9.72.tar.bz2
cd ATLAS/

Edit ATLAS/Make.top
change the V on line 6 to lowercase i.e. from
- $(ICC) -V 2>&1 >> bin/INSTALL_LOG/ERROR.LOGto
- $(ICC) -v 2>&1 >> bin/INSTALL_LOG/ERROR.LOG
mkdir build/
cd build/


sudo apt-get install cpufreq-utils
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
ondemand
sudo cpufreq-set -g performance

Unfortunately that only takes care of cpu0:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
performance
but

cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
ondemand
So...since we have 8 cores (cpu0-cpu7):

sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor

OK, we're ready to compile:
.././configure --prefix=/opt/ATLAS -Fa alg '-fPIC' --with-netlib-lapack-tarfile=$HOME/tmp/lapack-3.4.1.tgz --shared

Some of the info that's important is:
OS configured as Linux (1)
Assembly configured as GAS_x8664 (2)
Vector ISA Extension configured as  AVXFMA4 (4,496)
Architecture configured as  AMDDOZER (34)
Clock rate configured as 3600Mhz
If that checks out you don't need to manually set your architecture. To get a list over options, do
 make xprint_enums ; ./xprint_enums

If all is well,

make
make install

You should now be done.

Linking Gromacs against ATLAS

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/ATLAS/lib
#single precision
export LDFLAGS="-L/opt/fftw/fftw-3.3.2/single/lib -L/opt/ATLAS/lib -lsatlas -ltatlas -lgfortran"
export CPPFLAGS="-I/opt/fftw/fftw-3.3.2/single/include -I/opt/ATLAS/include"
./configure --disable-mpi --enable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_spatlas --prefix=/opt/gromacs/gromacs-4.5.5
make -j6 2>make.err 1>make.log
make install

#double precision
make distclean
export LDFLAGS="-L/opt/fftw/fftw-3.3.2/double/lib -L/opt/ATLAS/lib -lsatlas -ltatlas -lgfortran" 
export CPPFLAGS="-I/opt/fftw/fftw-3.3.2/double/include -I/opt/ATLAS/include"
./configure --disable-mpi --disable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_dpatlas --prefix=/opt/gromacs/gromacs-4.5.5
make -j6 2>make2.err 1>make2.log
make install

#single + mpi
make distclean
export LDFLAGS="-L/opt/fftw/fftw-3.3.2/single/lib -L/opt/ATLAS/lib -lsatlas -ltatlas -lgfortran"
export CPPFLAGS="-I/opt/fftw/fftw-3.3.2/single/include -I/opt/ATLAS/include"
./configure --enable-mpi --enable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_spmpiatlas --prefix=/opt/gromacs/gromacs-4.5.5
make -j6 2>make3.err 1>make3.log
make install

#double + mpi
make distclean
export LDFLAGS="-L/opt/fftw/fftw-3.3.2/double/lib -L/opt/ATLAS/lib -lsatlas  -ltatlas  -lgfortran" 
export CPPFLAGS="-I/opt/fftw/fftw-3.3.2/double/include -I/opt/ATLAS/include"
./configure --enable-mpi --disable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_dpmpiatlas --prefix=/opt/gromacs/gromacs-4.5.5
make -j6 2>make4.err 1>make4.log
make install

Linking NWChem against ATLAS

export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=`pwd`
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES="all"
export BLASOPT="-L/opt/ATLAS/lib -lsatlas -ltatlas"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/usr/lib/openmpi/lib
export MPI_INCLUDE=/usr/lib/openmpi/include
export LIBRARY_PATH="$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/ATLAS/lib"
export LIBMPI="-lmpi -lopen-rte -lopen-pal -ldl -lmpi_f77 -lpthread"
export LDFLAGS="-I/opt/ATLAS/include"
cd $NWCHEM_TOP/src
make clean
make nwchem_config
make FC=gfortran 2> make.err 1>make.log
export FC=gfortran
cd $NWCHEM_TOP/contrib
./getmem.nwchem