Improving CDO performance

Added by Pauline Millet about 4 years ago

Hello all,

I'm building an application using CDO to compute climate indicators on-the-fly. To avoid having too long waiting time for the user, I'd like to improve as much as possible CDO performance.
I'm using CDO Python bindings, and here is an example of code I run for my computation, that I would like to speed up:

def compute_my_indicator(ifile, lat, lon):
    myxarray = cdo.yearsum(
        input = f"-select,month=6/10 -remapnn,lon={lon}/lat={lat} -mul {ifile} -gec,25  {ifile}", 
        returnXArray = pr
    )
    return myxarray

with mp.Pool(mp.cpu_count()) as pool:
    result = pool.starmap(compute_my_indicator, [[file1.nc, lon, lat], [file2.nc, lon, lat], ..., [fileN.nc, lon, lat]])

So far, I tried to apply the following strategies that I found reading at the docs and forum talks:

- Chain operators as much as possible --> I give a single call to CDO
- Reduce input files size as much as possible --> My input files are limited to my area of interest and the years I focus on
- Have use of multiprocessing --> Implemented as you can see in the example code above. It allowed to reduce time processing from ~45 seconds to ~20 seconds.
- Convert file to netcdf3 format (better than netcdf4). It allowed to reduce time processing from ~20 seconds to ~6 seconds.

Any idea on how I could reduce even more time processing?

Thanks,
Pauline

Replies (16)

RE: Improving CDO performance - Added by Brendan DeTracey about 4 years ago

Ralf will have to answer your python binding question. In the meantime, have you read the following topic: https://code.mpimet.mpg.de/boards/53/topics/6672 ?
The big question is: How large are the files you are reading? Do you think your processing is I/O bound? It is interesting that nc4 to nc3 gave such a large speedup simply due to decompression. (I guess the decompression is single threaded.)

On Linux, you can use GNU Parallel to easily parallelize serial jobs. I sometimes dump a bunch of separate cdo commands to a text file and then:

$ env_parallel --jobs 12 < cdo_parallel_jobs.txt

will process 12 of the jobs at a time until they are all finished. You have to play around with the job number to see what saturates your I/O or your RAM bus. I have found that I/O saturation is usually the issue, at least when working with CMIP6 model output. At that point, you either invest in a very fast SSD (I wish) or find computing resources with a parallel file system (I wish again).

Sometimes a little shell scripting is the answer.

RE: Improving CDO performance - Added by Ralf Mueller about 4 years ago

Since your call makes use of remapping, you can possible speed it up in different ways:

use the OpenMP parallelization with the option '-P <numThreads>'. Keep in mind, that your program already runs in a parallel way. so this can put heavy load on it. Maybe it is beneficial, maybe it isn't.
select the months BEFORE doing the remapping. Otherwise you interpolate fields just to throw them away in the next step of the chain. I refer to '-select,month=6/10' here - Without knowing any details I would write something like
```
"-remapnn,lon={lon}/lat={lat} -mul -select,month=6/10  {ifile} -gec,25 -select,month=6/10  {ifile}"
```
in case your input and output grids are constant, you can pre-compute the interpolation weights to your single-point-grid. In your current implementation this is done for all input files. As an alternative you could compute the weights once with cdo -P 8 gennn .... and just apply these weights in a parallel way with the Pool object (using the remap operator)

But your function uses lon/lat as parameters, so the output indeed varies too much for applying the second idea.

Finally you can try to work in-memory as much as possible instead of use the disk. How much this pays out depends on you disk (HDD or SSD) and the amount of main memory you have. Using 'returnXArray' implies the creation of a temporary file under the hood. this is usually done in /tmp in a fast way. you could move the input files to '/dev/shm' if you are on a Unix-like system. This will speed up the reading, but means

an extra copy operation
you have to remove the files after the processing manually to keep the free space in main memory. Otherwise your system will get unusable. But this can easily be done withing the parallel execution: just rm the file with 'os.remove(...)' at the end of compute_my_indicator()
Brendan is defo right: this only works if IO is a bottleneck here

There is one thing worth knowing when working with Python via Jupyterhub: Tempfiles do not get removed automatically after the processing because the Kernels keep running all the time in this environment. So the /tmp dir can get filled up, which will make the compute node unusable and usually needs a reboot. If you use plain scripts the tempfiles are removed when the program finishes.

I total agree with Brendan: GNU parallel is a great tool to do most of this stuff on a command line. But things like dictionaries/Hashes are a but cumbersome in bash/zsh IMO. Or I never got used to it ...

RE: Improving CDO performance - Added by Brendan DeTracey about 4 years ago

I used bash associative arrays this year for the first time. Bash arrays give ample opportunities to shoot one's self in the foot. And the syntax tastes terrible.

RE: Improving CDO performance - Added by Pauline Millet about 4 years ago

Thanks to you two!

First clarification, I'm indeed running on a unix-like system with a SSD disk. But I'm not using JupyterHub here.
Second clarification, my input files are 1.3G big each. For each call to my function, 6 of them are used as input for the computation. They cover metropolitan France, with a value a day over 66 years. But we aim to cover a wider area, over a longer period of time, sometime in the future

My processing seems neither I/O bounded nor CPU bounded (checked with iostat and htop). I answer to your suggestions below:

1. use the OpenMP parallelization with the option '-P <numThreads>'.
I tried before but it indeed doesn't improve the computation time.
By the way, is there a more efficient way to select a cell based on its geographic coordinates than remapping?

2. select the months BEFORE doing the remapping.
With the suggested command I got the following error "Too few streams specified! Operator -mul needs 2 input streams and 1 output stream!". Thus, I tried to first apply the months selection with a first cdo call, store the output to \tmp, and then use it as input file in my 2nd cdo call, but I didn't notice any improvement.

3. you can pre-compute the interpolation weights to your single-point-grid.
My grid is constant, but do you think it is worth trying with a 141x95 grid?

I need to test further with GNU parallel given your recommendations, I gave it a try a few weeks ago but it may not have been implemented properly.

I feel that the weak point of the processing here is the application of the threshold (-gec) as it involves multiplying the output with the original file to keep the raw values. But I may be wrong!

RE: Improving CDO performance - Added by Ralf Mueller about 4 years ago

Pauline Millet wrote:

Thanks to you two!

First clarification, I'm indeed running on a unix-like system with a SSD disk. But I'm not using JupyterHub here.
Second clarification, my input files are 1.3G big each. For each call to my function, 6 of them are used as input for the computation. They cover metropolitan France, with a value a day over 66 years. But we aim to cover a wider area, over a longer period of time, sometime in the future

My processing seems neither I/O bounded nor CPU bounded (checked with iostat and htop). I answer to your suggestions below:

parallel reading of 6 files with 1.3G size each indeed sounds pretty IO-bound to me, but SSD can help a lot. In case you have enough main memory, please run your processing with these files copied to /dev/shm as input.

1. use the OpenMP parallelization with the option '-P <numThreads>'.
I tried before but it indeed doesn't improve the computation time.
By the way, is there a more efficient way to select a cell based on its geographic coordinates than remapping?

OpenMP-parallelization is powerful when you have a larger target grid. a single point does not benefit from it - your are right. If you just want to select cells, you can use sellonlatbox instead. in case the result has more than one cell, you can append a -fldmean operators or something similar. should be faster then remapping.

2. select the months BEFORE doing the remapping.
With the suggested command I got the following error "Too few streams specified! Operator -mul needs 2 input streams and 1 output stream!". Thus, I tried to first apply the months selection with a first cdo call, store the output to \tmp, and then use it as input file in my 2nd cdo call, but I didn't notice any improvement.

these changes to the chain are executed in memory. For debugging I'd need more input. you can rerun the scripting with cdo.debug = True to inspect the exact calls. This will help finding the wrong call.

3. you can pre-compute the interpolation weights to your single-point-grid.
My grid is constant, but do you think it is worth trying with a 141x95 grid?

due to the fact, that remapping is in general more costly then selection, we can drop my suggestion from before.

I need to test further with GNU parallel given your recommendations, I gave it a try a few weeks ago but it may not have been implemented properly.

I feel that the weak point of the processing here is the application of the threshold (-gec) as it involves multiplying the output with the original file to keep the raw values. But I may be wrong!

That should be fast, but in order to say more I need to know more. Best upload your script(s), one timestep of at least one input file and a reasonable list of locations to process.

cheers
ralf

RE: Improving CDO performance - Added by Brendan DeTracey about 4 years ago

Pauline Millet wrote:

2. select the months BEFORE doing the remapping.
With the suggested command I got the following error "Too few streams specified! Operator -mul needs 2 input streams and 1 output stream!". Thus, I tried to first apply the months selection with a first cdo call, store the output to \tmp, and then use it as input file in my 2nd cdo call, but I didn't notice any improvement.

Show us your code! Is it:

    myxarray = cdo.yearsum(
        input = f"-remapnn,lon={lon}/lat={lat} -mul -select,month=6/10 {ifile} -gec,25 -select,month=6/10 {ifile}", 
        returnXArray = pr
    )

?
- Is it true that you want 0 as a value?
- If you are going to sum to get totals for days over 25, I strongly suggest you do this before remapping. This is the best practice, if possible. Edit: if you are always choosing nearest neighbour interpolation, it should not matter which op you do first.

RE: Improving CDO performance - Added by Pauline Millet about 4 years ago

Hello,

I try to copy files to /dev/shm, it did not reduce computation time.
But applying sellonlatbox better than remapnn had a positive effect. I also retry applying month selection and sellonlatbox before threshold application (-gec / -lec) and it's actually now working good.
I'm now at about 4 seconds for processing

In case you'd like to have a look, here is my full function (it's kind of generic for my project needs):

def myfunction(ifile, bbox, period=None, threshold_min=None, threshold_max=None):
    """ 
    ifile:          path to input file
    bbox:           string containing the geographic boundingbox of the point of
                    interest.
                    Has to be formatted like "lon_min,lon_max,lat_min,lat_max".
    period:         tuple of months number, optional.
    threshold_min:  minimum value to keep for computation, optional.
    threshold_max:  maximun value to keep for computation, optional.

    threshold_min and threshold_max cannot be applied together
    """ 

    (var,) = cdo.showname(input=ifile)

    if period==(1,12):
        period = None

    if period:
        ifile = cdo.sellonlatbox(
            bbox, 
            input = f"-select,month={period[0]}/{period[1]} {ifile}" 
        )
    else:
        ifile = cdo.sellonlatbox(bbox, input = ifile)

    if threshold_min:
        threshold_application = f"-mul {ifile} -gec,{threshold_min} " 
    elif threshold_max:
        threshold_application = f"-mul {ifile} -lec,{threshold_max} " 
    else:
        threshold_application = ''

    myxarray = cdo.yearsum(
            input=f"{threshold_application}{ifile}",
            returnXArray=var,
        )

and I'm calling it with

example_param_dict = {
    'bbox': "1.55,1.56,43.48,43.49", 
    'period': (6,10), 
    'threshold_min': 25
}

files_list = ['sample1.nc', sample2.nc']

param_list = [value for value in example_param_dict.values()]
args_list = [tuple(chain([f], example_param_dict)) for f in files_list]

with mp.Pool(mp.cpu_count()) as pool:
    result = pool.starmap(myfunction, args_list)

You'll find attached sample1.nc and sample2.nc

Thanks for your suggestions,
Pauline

PS:

If you are going to sum to get total degree days over 25C, I strongly suggest you do this before remapping. This is the best practice, if possible.

Ok, I get the idea behind. But I suppose it's not relevant anymore since selection is applied now instead of remapping, right?

Download all files

sample2.nc (826 KB) sample2.nc
sample1.nc (827 KB) sample1.nc

RE: Improving CDO performance - Added by Brendan DeTracey about 4 years ago

Pauline Millet wrote:

Ok, I get the idea behind. But I suppose it's not relevant anymore since selection is applied now instead of remapping, right?

Yes. And thanks for posting your code too!

RE: Improving CDO performance - Added by Ralf Mueller about 4 years ago

hi Pauline!

what is the 'chain()' methods in your code? is it a wrapper around Pool? naw ....

RE: Improving CDO performance - Added by Ralf Mueller about 4 years ago

if all data files are so small, could you upload the others, too? would like to understand the whole processing. or do you only use these two files for your 4sec measurement?

RE: Improving CDO performance - Added by Pauline Millet about 4 years ago

Hi,

no, sorry I forgot to share the packages import...
chain is a module from itertools package (more info: https://www.geeksforgeeks.org/python-itertools-chain/) that helps me having the right shape for specifiying the arguments to provide at 'myfunction' in pool.startmap()

About the files, I provided you only samples as it's easier to share. Original files are 1.3G big each and I use 6 of them as input for my 4sec measurement.

RE: Improving CDO performance - Added by Pauline Millet over 2 years ago

Hello,

I'm re-opening the discussion because, as I said fews messages above, I'm now dealing with more data (larger spatial and temporal cover).
As expected, computation time has increased with those new input data. I was wondering, as 2 years have passed, if you were aware of any new tips to optimize my requests to reduce computation time?

Thanks,
Pauline

PS: If you like, you could have a look at the app developed thanks to CDO and your precious advices: https://canari-agri.fr/
It's in French, but I bet you could guess the mean meaning.
The idea is to present the evolution of climate indicators applied to agricultural issues.
Next version of the app (expected for march ou april 2023) will be available in English, German, and Spanish as well.

RE: Improving CDO performance - Added by Ralf Mueller over 2 years ago

Hi Pauline! happy new year ;-)

In order to get to improve performance the real full input data would be best to share. Maybe you can upload some to free storage like google drive? or do you have an account at DKRZ, CSCS or MPIMET?
Second step is to measure, which calls take the most time, so that improvement can be done at the right place.

cheers
ralf

RE: Improving CDO performance - Added by Pauline Millet over 2 years ago

Happy new year to you too!

Sorry for the late answer...
After few investigation, the spatial selection is the longest part of the request. As you suggested in the discussion "CDO on speed" (https://code.mpimet.mpg.de/boards/53/topics/6672), I split the files spatially to keep correct computation time. It's a good move!

If interested, I keep you updated when the european version of our app is available.

Cheers,
Pauline

RE: Improving CDO performance - Added by Ralf Mueller over 2 years ago

Pauline Millet wrote in RE: Improving CDO performance:

Happy new year to you too!

Sorry for the late answer...
After few investigation, the spatial selection is the longest part of the request. As you suggested in the discussion "CDO on speed" (https://code.mpimet.mpg.de/boards/53/topics/6672), I split the files spatially to keep correct computation time. It's a good move!

At some point it might be useful to analyze the workflow as a whole. Operator ordering can have a big effect, too. The select operator is usually more efficient compared to manually selecting each name, level, time range.

If you need support with the details, let me know.

If interested, I keep you updated when the european version of our app is available.

yes, please!

Cheers,
Pauline

have a nice day, Pauline

RE: Improving CDO performance - Added by Pauline Millet about 2 years ago

Hello,

As promised, here the link to our app to visualize the evolution of climate indicators focused on the agricultural domain: https://canari-europe.com/
The computation of the indicators are made on the fly thanks to CDO and your precious advices :)
The data used are provided by the EUROCORDEX-11 project

Thanks for your help again,
Pauline

PS: If you have any feedback, to not hesitate to share them so that we can improve the app!

(1-16/16)

Project

General

Profile

CDO