CDO on speed

Added by Ralf Mueller over 3 years ago

hi CDO-ninjas!

This is a little bit off topic, because it's not directly about CDO operators but more on how to speed things up in general.
Despite choosing the right algorithm and its computational efficiency it's very often the IO-part, that make things slow. But how to cope with that strongly depends on

  • how IO-intensive your workflow is
  • what are the most costly parts in terms of computations
  • the total number of input files
  • the layout of your input files:
    • number of variables
    • number of timesteps
    • number of grids
    • gridsize

Here is a list of recommendations, that might help in one or the other occasion. Depending on how you make use of them, they can put heavy (and I mean heavy) load on any machine. Login-nodes or any other resource you share with other users are not the right place to run this in production. If you still do, your soon-to-be-not-so-favourite admin will give you a call. - Yes, I speak from personal experience. you have been warned

  1. Use -P <thread-count> for everything related to horizontal interpolation and EOFs. Choose a number that fits the number of threads possibly running in parallel on your machine. It does not harm, if you choose to many - CDO will take the maximum number then. But do not run such calls in parallel - this will result in a slow-down, because the threads will interfere with each other. It works for remap* and gen* operators in the same way. This is a list of all operators that support OpenMP parallelization.
  2. Use process-based parallelization if possible: In case your workflow has a chunk of CDO calls to be executed on a rather large number of input files, you can parallelize this with a tool like GNU parallel. It read text input from file or stdout and executes each line of it in parallel with a given number (-j <N>, like make) of processes. So the recipe is to put your chunk of calls on a single line (separated with ';'), loop over the input files and pipe this into parallel:
    for file in *.nc; do echo "<long list of CDO commands and all the other tools might need to run on ${file}>"; done | parallel -v -j 12
    This technique can be used for all stuff you can do on the command line, e.g. generating 1000 plots for a movie.
  3. Scripting languages like python or ruby offer very similar functionality as part of their standard libraries: multiprocessing for python, parallel for ruby. The following is an example extracted from the DYAMOND Hackathon at MPI:
    import glob
    from multiprocessing import Pool
    def cdozonmean(infile):
        print('processing '+infile)
        ofile =cdo.zonmean(input=infile)
        return ofile;
    ntasks = 4
    nicam_path = '/work/ka1081/DYAMOND/NICAM-7km/'
    files = sorted([s for s in glob.glob(nicam_path+'*/')])[0:4]
    pool    = Pool(ntasks)
    results = dict()
    for file in files:
        print (file)
        ofile = pool.apply_async(cdozonmean,(file,))
        results[file] = ofile
    # retrieve results, keeping the order of the input files for output files and cat to vfile
    for k,v in results.items():
        results[k] = v.get() = ' '.join([results[x] for x in files]),output = wrk_dir+'')
  4. In order to reduce IO for each CDO call, it might be useful to split input files along different dimensions (variables, timesteps, grids,...) and loop over them with the tools mentioned above. CDO offers a long list of operators for that purpose, please check it with
    cdo -h split
    Always make sure IO is as cheap as possible, but not cheaper.
  5. Write intermediate output to fast IO buffers like /tmp or better /dev/shm. Both directories are usually mapped into RAM, so IO is neglectable. But space is very limited there. Keep track of what you do and (re)move things as soon as possible.