Optimal operator order?

Added by Brendan DeTracey over 3 years ago

Hi. Me again...

In the following command line what is the optimal order of cdo commmands?

cdo -O -a -f nc4 -z zip --cmor \
    -intlevel,50 \
    -select,startdate="$seldate_start_arg",enddate="$seldate_end_arg",levidx=18,19 \
    -mergetime "$file_glob" "$file_out"

Would the following be more optimal?

cdo -O -a -f nc4 -z zip --cmor \
    -intlevel,50 \
    -mergetime "$file_glob" "$file_out" 
    -apply,-select,startdate="$seldate_start_arg",enddate="$seldate_end_arg",levidx=18,19 [ "$file_glob" ] \
    "$file_out"

Replies (9)

RE: Optimal operator order? - Added by Ralf Mueller over 3 years ago

hey, Brendan! Happy to hear from you ;-)

IMO it's a good rule of thumb to select the data before doing anything else. So you idea with the second version seems good: use apply to select things before merge.
But I think syntactically it does not work, because the mergetime should get the output of the apply chain an input and the file_out should only occur at the very end of the call.

Can you upload two input files? I would love to test the combination of merge, apply and select

cheers
ralf

RE: Optimal operator order? - Added by Brendan DeTracey over 3 years ago

Hi ralf!
The files are much too big to upload, but here are some download links for your pleasure! Each file is ~8.6GB:
http://esg1.umr-cnrm.fr/thredds/fileServer/CMIP6_CNRM/CMIP/CNRM-CERFACS/CNRM-CM6-1-HR/historical/r1i1p1f2/Omon/thetao/gn/v20191021/thetao_Omon_CNRM-CM6-1-HR_historical_r1i1p1f2_gn_199501-199912.nc
http://esg1.umr-cnrm.fr/thredds/fileServer/CMIP6_CNRM/CMIP/CNRM-CERFACS/CNRM-CM6-1-HR/historical/r1i1p1f2/Omon/thetao/gn/v20191021/thetao_Omon_CNRM-CM6-1-HR_historical_r1i1p1f2_gn_200001-200412.nc
http://esg1.umr-cnrm.fr/thredds/fileServer/CMIP6_CNRM/CMIP/CNRM-CERFACS/CNRM-CM6-1-HR/historical/r1i1p1f2/Omon/thetao/gn/v20191021/thetao_Omon_CNRM-CM6-1-HR_historical_r1i1p1f2_gn_200501-200912.nc
http://esg1.umr-cnrm.fr/thredds/fileServer/CMIP6_CNRM/CMIP/CNRM-CERFACS/CNRM-CM6-1-HR/historical/r1i1p1f2/Omon/thetao/gn/v20191021/thetao_Omon_CNRM-CM6-1-HR_historical_r1i1p1f2_gn_201001-201412.nc

The following link shows the status of all ESGF servers. If the above download links do not work, it may be because esg1.umr-cnrm.fr is temporarily down: https://esgf-node.llnl.gov/status/

RE: Optimal operator order? - Added by Ralf Mueller over 3 years ago

hi again!

thx for the link. I downloaded 2 of the files. they are compressed netcdf4, which is very bad for processing (constant decompression of data/coordinates). so before doing anything else I decompress them.
2nd thing is mergetime: I thing you don't need that, because the shell wildcard together with the filename conventions gives the correct temporal order. extra scanning an re-ordering of timesteps does not seem to be needed IMO.
cat should be faster.

My attempt to decompress a file with CDO takes ages (after 10min I have 5%). Have to check this in more detail first

RE: Optimal operator order? - Added by Ralf Mueller over 3 years ago

With some help of Uwe I can say more: Not only are the data-variables compressed, but also they are saved with the largest possible chunksize (1 single chunk for everything). Similar to the intlevel ticket (#10617) the data is in the worst shape ever to be analyzed with CDO.

My recommendation is: uncompress the data and set a reasonable chunk-size (= horizontal gridsize). this can be done with

nccopy -d 0 -c lev/1 thetao_Omon_CNRM-CM6-1-HR_historical_r1i1p1f2_gn_199501-199912.nc \ 
                     thetao_Omon_CNRM-CM6-1-HR_historical_r1i1p1f2_gn_199501-199912-noZ.nc

The individual file size raises from 8G to 26G, so there is an IO penalty for it, but usually the time spend in decompression is far worse.

Here are some test I did

$ cdo -a -f nc --cmor -intlevel,50 \
    -select,levidx=18,19 \
    -cat 'thetao_Omon_CNRM-CM6-1-HR_historical_r1i1p1f2_gn_*-noZ.nc' tttt.nc
cdo(1) select: Process started
cdo(2) cat: Process started
cdo(2) cat: Processed 13626900000 values from 2 variables over 120 timesteps.
cdo(1) select: Processed 363384000 values from 1 variable over 120 timesteps.
cdo    intlevel: Processed 363384000 values from 1 variable over 120 timesteps [64.95s 375MB].

This should be close to what you initially posted: cmor output, first the concatenation, then select, then intlevel. no compression

$ cdo -a -f nc --cmor -intlevel,50 \
    -select,startdate=1999-06-01,enddate=2010-06-01,levidx=18,19 \
    -cat 'thetao_Omon_CNRM-CM6-1-HR_historical_r1i1p1f2_gn_*-noZ.nc' tttt-sel.nc
cdo(1) select: Process started
cdo(2) cat: Process started
cdo(2) cat: Processed 13626900000 values from 2 variables over 120 timesteps.
cdo(1) select: Processed 36338400 values from 1 variable over 66 timesteps.
cdo    intlevel: Processed 36338400 values from 1 variable over 12 timesteps [47.46s 375MB].

This time I included the date selection to check the reduction in processing time wrt. the reduction in data reduction by select
#

$ cdo -a -f nc --cmor -intlevel,50 \
    -select,startdate=1999-06-01,enddate=2010-06-01,levidx=18,19  \
    'thetao_Omon_CNRM-CM6-1-HR_historical_r1i1p1f2_gn_*-noZ.nc' tttt-sel-noCat.nc
cdo(1) select: Process started
cdo(1) select: Processed 36338400 values from 2 variables over 66 timesteps.
cdo    intlevel: Processed 36338400 values from 1 variable over 12 timesteps [9.34s 369MB].

Finally a version without merge or cat because select already is a collective operation on all inputs. This version seems to be reasonable, because the time coordinate is shared by all input files (a timestep only occurs in a single file).

At the moment the select cannot be used with apply because select accepts an arbitrary number of input files. But maybe the version with select only does the job for you.

One final point about compressed netcdf: IMO the only useful application of this is archiving (as a transparent way of saving space). Any kind of analysis on such data should be done on uncompressed input.

hth
ra;lf

RE: Optimal operator order? - Added by Brendan DeTracey over 3 years ago

Thanks ralf! It looks like the correct answer to my problem is to uncompress these large files that were sub-optimally chunked by their creators. I wonder if I have enough disk space...

RE: Optimal operator order? - Added by Brendan DeTracey about 3 years ago

I am discovering that most of the CMIP6 ocean model datasets are chunked in this way. Blech...

RE: Optimal operator order? - Added by Ralf Mueller about 3 years ago

hi Brendan!

Unfortunately data suitable for archiving is in most cases not suitable for data analysis.

Funny: Blech is a german word, too. But I guess, thats not what you had in mind, right?

RE: Optimal operator order? - Added by Brendan DeTracey about 3 years ago

"Blech" for me comes from comic strips. "Blech" is the noise a comic book character might make when expressing disgust with a situation, perhaps with tongue stuck out like they ate something that tasted terrible.

RE: Optimal operator order? - Added by Ralf Mueller about 3 years ago

I thought so ;-) - in german it means steel sheet or plate

(1-9/9)

Project

General

Profile

CDO