Simple CDO operations extremely slow while running as parallel background processes

I'm running a simple GCM on the cheyenne HPC compute cluster, and have been trying to process the results in parallel as the model runs in blocks of 100 days. On 40 cores, the output is 160 lats by 320 lons, 60 levels, and 200 timesteps (the model runs in 100 day blocks, saves every 12 hours).

For some reason, it appears that even with a sufficient number of cores, the more parallel CDO processes, the slower each process runs. Using other tools like NCO and NCL, more parallel processes were no slower than running a single process. Note that I ensure my code runs on at least as many cores as the number of background processes. Does anyone have any idea how this could have come about?

Here's the script I wrote to test this (results in the comments at the bottom) -- it runs -zonmean on the NetCDF files (the model was run on 40 cores; produced 40 files in parallel), first processing 1 of these files, then 2 files concurrently, then 4, 10, 20, and 40. I ran this for 40 400MB model files, but have provided a smaller 60MB you can test with. May be hard to reproduce if you don't have HPC access.

#!/usr/bin/env bash
# Test the strange bottleneck issue observed with concurrent CDO processes
shopt -s nullglob
dir=~/scratch/test10x4/d0000-d0100 # any will do, just make sure it has contents
cd $dir

# Get file(s)
i=0
np=0
nprocs=(1 2 4 10 20 40)
file=sample.nc # simple sample file
# files=(*_interp.????.nc) # can't use glob for number because tmp files show up
# nf=${#files[@]}
# [ $nf -eq 0 ] && echo "Error: No files found." && exit 1
echo Files: ${files[@]}

# Get zonal mean using increasingly more concurrent processes; since 40 cores
# are available, for most programs, this should be fine
# while [ $np -le $nf ] && [ $i -lt ${#nprocs[@]} ]; do
while [ $i -lt ${#nprocs[@]} ]; do
  # First CDO
  np=${nprocs[i]}
  echo $np concurrent processes
  # Option A) Loop over model files
  # for file in ${files[@]::np}; do
  #   time cdo -s -O -zonmean $file ${file%.nc}_tmp.nc 2>/dev/null &
  # done
  # Option B) Read from single file "sample.nc" 
  for j in $(seq 1 $np); do
    time cdo -s -O -zonmean $file ${file%.nc}_${j}.nc 2>/dev/null &
  done
  wait # will get all the error messages
  echo

  # Then compare to NCO
  echo $np concurrent NCO processes
  # Option A) Loop over model files
  # for file in ${files[@]::np}; do
  #   time cdo -s -O -zonmean $file ${file%.nc}_tmp.nc 2>/dev/null &
  #   time ncwa -a lon $file ${file%.nc}_$tmp.nc &
  # done
  # Option B) Read from single file "sample.nc" 
  for j in $(seq 1 $np); do
    time ncwa -O -a lon $file ${file%.nc}_${j}.nc &
  done
  wait
  echo

  # Next
  let i+=1
done

# Times for 400MB files on *40* cores
# | N   | NCO walltime | CDO walltime
# | --- | ---          | ---                  |
# | 1   | 3s           | 15s                  |
# | 2   | 3s           | 16s                  |
# | 4   | 3s           | 20s                  |
# | 10  | 4s           | 25-35s               |
# | 20  | 4-5s         | 45-55s               |
# | 40  | 6-7s         | 100-200s, most >150s |
#
# Times for 400MB files on *4* cores -- *no change*
# | N   | NCO walltime | CDO walltime |
# | --- | ---          | ---          |
# | 1   | 3s           | 15s          |
# | 2   | 3s           | 17s          |
# | 4   | 3s           | 20s          |

I am using CDO version 1.9.5, compiled with pthread support and MPI support (which as I understand is only for parallel computations for some particular commands?). It was downloaded with *anaconda, but I had the same issues using the version 1.9.4 provided by the supercomputer. See below.

$ cdo -V
Climate Data Operators version 1.9.5 (http://mpimet.mpg.de/cdo)
System: x86_64-pc-linux-gnu
CXX Compiler: g++ -fPIC -DPIC -g -O2 -std=c++11 -fopenmp -fPIC -DPIC  -m64 -fPIC -fopenmp
CXX version : g++ (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
C Compiler: gcc -std=gnu99 -fPIC -DPIC  -m64 -fPIC -fopenmp
C version : gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
F77 Compiler: gfortran -g -O2
F77 version : GNU Fortran (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
Features: 252GB 72threads C++11 Fortran DATA PTHREADS OpenMP3 HDF5 NC4/HDF5/threadsafe OPeNDAP UDUNITS2 PROJ.4 XML2 CURL FFTW3 SSE2
Libraries: HDF5/1.10.3 proj/4.93 xml2/2.9.8 curl/7.62.0
Filetypes: srv ext ieg grb1 grb2 nc1 nc2 nc4 nc4c nc5
     CDI library version : 1.9.5
 CGRIBEX library version : 1.9.1
GRIB_API library version : 2.9.2
  NetCDF library version : 4.6.1 of Oct 20 2018 02:37:36 $
    HDF5 library version : 1.10.3 threadsafe
    EXSE library version : 1.4.0
    FILE library version : 1.8.3

sample_small.nc (61.7 MB) sample_small.nc

Replies (2)

RE: Simple CDO operations extremely slow while running as parallel background processes - Added by Luke Davis over 6 years ago

Please also note that each core has 109GB of memory (I ran with qcmd -l walltime=00:10:00 -l select=1:ncpus=40:mem=109GB -- ./bottleneck_test). So I don't think this should be an insufficient memory issue, given that 40 400MB files amounts to only 16GB.

RE: Simple CDO operations extremely slow while running as parallel background processes - Added by Uwe Schulzweida over 6 years ago

Here is the result from our HPC cluster on a dedicated interactive node:

cdo -V
Climate Data Operators version 1.9.5 (http://mpimet.mpg.de/cdo)
System: x86_64-pc-linux-gnu
CXX Compiler: g++ -fPIC -DPIC -g -O2 -std=c++11 -fopenmp -fPIC -DPIC  -m64 -fPIC -fopenmp 
CXX version : g++ (GCC) 4.8.2
C Compiler: gcc -std=gnu99 -fPIC -DPIC  -m64 -fPIC -fopenmp  
C version : gcc (GCC) 4.8.2
F77 Compiler: /sw/rhel6-x64/pgi/pgi-17.7/linux86-64/17.7/bin/pgf77 -g
F77 version : unknown
Features: 1009GB 72threads C++11 DATA PTHREADS OpenMP3 NC4/HDF5 OPeNDAP SSE2
Libraries:
Filetypes: srv ext ieg grb1 nc1 nc2 nc4 nc4c nc5 
     CDI library version : 1.9.5
 CGRIBEX library version : 1.9.1
  NetCDF library version : 4.4.1.1 of May 24 2017 10:31:30 $
    HDF5 library version : 1.8.10 threadsafe
    EXSE library version : 1.4.0
    FILE library version : 1.8.3

# Times for 500MB files on *36* cores
# | N   | NCO walltime | CDO walltime
# | --- | ---          | ---              |
# | 1   | 7s           | 3s               |
# | 2   | 7s           | 3s               |
# | 4   | 7s           | 3s               |
# | 10  | 8s           | 3s               |
# | 20  | 9s           | 3s               |
# | 40  | 10-14s       | 3s               |

I tried to make the CDO binary very similar to yours but I can't reproduce your result.
I observed the resources with the tool top and found out that the NCO version on our system is using very much memory of >1300Mb while CDO is using only <80Mb. I guess thats the reason why the NCO walltime is increasing with N>=10.
How is the memory usage on your system?

(1-2/2)

Project

General

Profile

CDO