Project

General

Profile

timesel*** vs nco

Added by Antonio Rodriges over 7 years ago

Hello,

I am interested on what optimizations does CDO undertake to perform 9x faster than NCO

cdo timselavg,1460 uwnd.10m.gauss.1979.nc u1979.nc
// shape is 1460 x 94 x 192
takes 0.5 sec

while

ncwa -D 4 -O -x -v time --thr_nbr=1 -a time uwnd.10m.gauss.1979.nc n1979.nc

takes 4.48 sec

Disclaimer: I not the developer of NCO neither CDO, I use both of them for my personal projects

Thanks


Replies (5)

RE: timesel*** vs nco - Added by Antonio Rodriges over 7 years ago

P.S.
I just noticed that CDO produces short values

short uwnd(time,level,lat,lon) ;
uwnd:standard_name = "eastward_wind" ;
uwnd:long_name = "6-Hourly Forecast of U-wind at 10 m" ;
uwnd:units = "m/s" ;
uwnd:grid_type = "gaussian" ;
uwnd:add_offset = 207.65f ;
uwnd:scale_factor = 0.01f ;

while NCO float
float uwnd(level,lat,lon) ;
uwnd:long_name = "6-Hourly Forecast of U-wind at 10 m" ;
uwnd:valid_range = -32765s, -8765s ;
uwnd:unpacked_valid_range = -120.f, 120.f ;
uwnd:actual_range = -38.15f, 46.84f ;
uwnd:units = "m/s" ;

additional question: does CDO takes average on packed (short) values? If yes, what about the precision?

Thanks again

RE: timesel*** vs nco - Added by Uwe Schulzweida over 7 years ago

All CDO operators uses as less memory as possible. And this is the reason why the time mean in CDO is much faster. It seems that NCO stores all timesteps of the array in memory. CDO is reading the array timestep by timestep and accumulates them instantaneously.
Memory requirement:
CDO: 2 x 94 x 192 x 8 = 288768 byte
NCO: 1461 x 94 x 192 x 8 = 210945024 byte

The format and datatype of the CDO output is derived from the input. That means short values on input produces short output values. Use the option "-b F32" to write 32bit floats. Internally all operations are 64bit floats. You can replace "timselavg,1460" by "timavg":

cdo -b F32 timavg uwnd.10m.gauss.1979.nc u1979.nc

Cheers,
Uwe

RE: timesel*** vs nco - Added by Antonio Rodriges over 7 years ago

Thanks!

Really interesting insights!
Performance difference is really impressive!

Does CDO exploits some SSE or OpenMP optimizations in this case (e.g. averaging)?
Also, is it possible to specify a dimension index like in NCO instead of explicitly "time", "lon", etc.?

Thanks

RE: timesel*** vs nco - Added by Uwe Schulzweida over 7 years ago

No, it's not possible to specify a dimension index. That's an other difference to NCO, CDO has very specific operators for one task:

timavg: for time avarage
zonavg: for zonal avarage
fldavg: for field avarage
The loops are very simple, so automatic SIMD vectorization shouldn't be a problem. Here is the loop for the accumulation:
  double *restrict array1;
  const double *restrict array2;

  for ( int i = 0; i < len; i++ ) array1[i] += array2[i];
To gain something out of OpenMP parallelization seems the be very difficult for the time average. I have spent a lot of time to improve the performance of this operation with OpenMP without any success. The main reason is that this task is highly I/O bound and can't be parallelized with serial access to the file.

RE: timesel*** vs nco - Added by Antonio Rodriges over 7 years ago

Thanks for information! Very interesting!

    (1-5/5)