Project

General

Profile

Process NextGems Cycle3 ICON data with CDO

NextGEMS Cycle3 ICON data

The nextGEMS Cycle3 ICON data can be accessed via an intake catalog. The ICON run ngc3028 ran from 2020-01-21 to 2025-07-22 and has the model resolution R2B9. The output data is available at different spatial and temporal resolutions. The output grid is HEALPix and the different resolutions have the zoom level 0 to 10. The data mapped from the ICON R2B9 grid to HEALPix has the zoom level 10. And the hierarchies below are each a quarter smaller. Introduction to the HEALPix grid can be found at: https://easy.gems.dkrz.de/Processing/healpix.
The available temporal resolutions of the output data are: PT30M, PT3H, P1D.

Find the path to the data

The path to the desired archive can be found with the tool find_files or query_yaml.py. Use the option -h for help. The tools are available with the following module:

module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable
The following command gives the path to the daily nextGEMS Cycle3 ICON data on zoom level 7:
query_yaml.py ICON ngc3028  -s time=P1D zoom=7 --cdo
The paths will change, so always use find_files or query_yaml.py!

CDO version

For processing NextGems3 HEALPix Zarr data the latest CDO version 2.2.0 is required. This requires also the latest NetCDF version 4.9.2 and plugins to decompress the Zarr data. On levante such a CDO version is not yet installed by DKRZ. A local CDO version on levante is available with:

module use /home/m/m221078/etc/Modules
module add netcdf-dev
module unload cdo
module add cdo-dev

Data information

With the operator sinfon you get an overview of the data. Here is an example for daily mean data at HEALPix zoom level 7:

PATH_TO_ngc3028_P1D_ZOOM7=$(query_yaml.py ICON ngc3028  -s time=P1D zoom=7 --cdo)
cdo sinfon ${PATH_TO_ngc3028_P1D_ZOOM7}

Select data

In CDO, output data is stored in the format of input data if no other format is selected. For Zarr data it is usually useful to change the data format with the -f option, otherwise the selected data will be stored in Zarr format. The following command selects the variable wind_speed_10m and stores the result to NetCDF4.

PATH_TO_ngc3028_P1D_ZOOM7=$(query_yaml.py ICON ngc3028  -s time=P1D zoom=7 --cdo)
cdo -f nc4 select,name=wind_speed_10m ${PATH_TO_ngc3028_P1D_ZOOM7} wind_speed_10m.nc

Interpolation of HEALPix data

The following operators are available for interpolation of HEALPix data: remapnn, remapdis, remapcon and remapbil. The interpolation with remapcon is a bit inaccurate, because we cannot calculate the edges of the HEALPix grid 100% correctly.
When interpolating HEALPix data, it is often useful to choose data with a similar resolution. Global data on a 1 degree Lon/Lat grid have 64800 grid points. The next best HEALPix resolution for a global 1 degree target grid has zoom level 7.
In the following example the variable ts is interpolated to a global 1 degree lon/lat grid with the nearest neighbor method:

PATH_TO_ngc3028_P1D_ZOOM7=$(query_yaml.py ICON ngc3028  -s time=P1D zoom=7 --cdo)
cdo -f nc4 remapnn,global_1 -select,name=ts ${PATH_TO_ngc3028_P1D_ZOOM7} ts_global_1.nc
cdo    remapnn: Processed 395182080 values from 1 variable over 2010 timesteps [86.69s 2110MB]
The decomposition of the interpolation into 2 parts works as usual:
cdo gennn,global_1 ${PATH_TO_ngc3028_P1D_ZOOM7} healpix7_to_global1_weights.nc
cdo -f nc4 remap,global_1,healpix7_to_global1_weights.nc -select,name=ts ${PATH_TO_ngc3028_P1D_ZOOM7} ts_global_1.nc

Interpolation to HEALPix grid

CDO can generate HEALPix grids. hp<NSIDE>[_<ORDER>] defines the parameter of a global HEALPix grid.
The NSIDE parameter controls the resolution of the pixellization. It is the number of pixels on the side of each of the 12 top-level HEALPix pixels. The total number of grid pixels is 12*NSIDE*NSIDE. NSIDE=1 generates the 12 (H=4, K=3) equal sized top-level HEALPix pixels.
ORDER sets the index ordering convention of the pixels, available are nested (default) or ring ordering.
A shortcut for hp<NSIDE>_nested is hpz<ZOOM>. ZOOM is the zoom level and the relation to NSIDE is ZOOM=log2(NSIDE).

If the geographical coordinates are required in CDO, they are calculated from the HEALPix parameters. For this calculation the astropy-healpix C library is used.

Here is an example to interpolate data from an arbitrary grid to a HEALPix grid with zoom level 7:

cdo remapnn,hpz7 infile outfile

Performance Issues

Reading and decompressing data has a significant impact on the overall performance of CDO. All operators in CDO always read a complete horizontal field. For high-resolution compressed data, this poses at least 2 challenges.
First, decompressing can take significantly longer than processing the data. And if the number of required grid points is significantly smaller than the complete horizontal field, more data will be decompressed than necessary.

The following CDO feature is under construction, experimental, not further documented and only available in the current developer version! So your feedback is very welcome!

query:

The new query: identifier allows to select variables, grid cells and time steps before CDO is working on it.
query: has the following parameters:
name comma separated list of variable names
cell range (1-N) of grid cells (<start>[/to/<end>])
step range (1-N) of time steps (<start>[/to/<end>])
path path to the data

Here is an example to select 2 variables:

cdo  <operator> query:name=var1,var2,path=infile outfile
This does in principle the same as:
cdo  <operator> -selname,var1,var2  infile outfile

Selection of grid cells

A contiguous set of grid cells can be selected with the parameter cell=<start>[/to/<end>] of query:. This can speed up reading and decompressing if there are enough compressed chunks. This is only the case for the high resolution data with zoom=10. Here one field has 12582912 cells spread over 48 compressed chunks.
In the following example the temperature is selected on one grid point with the coordinates of Madrid:

PATH_TO_ngc3028_P1D_ZOOM10=$(query_yaml.py ICON ngc3028  -s time=P1D zoom=10 --cdo)
MADRID=$(cdo gridcellindex,lon=-3.703889,lat=40.4125 ${PATH_TO_ngc3028_P1D_ZOOM10})  # grid cell index for Madrid is 3494582
cdo -f nc copy query:name=ts,cell=${MADRID},path=${PATH_TO_ngc3028_P1D_ZOOM10} ngc3028_P1D_ts_madrid.nc
cdo    copy: Processed 2010 values from 1 variable over 2010 timesteps [15.62s 2066MB]
Without the pre-selection of the grid cell, significantly more resources are needed:
cdo -f nc selgridcell,${MADRID} query:name=ts,path=${PATH_TO_ngc3028_P1D_ZOOM10} ngc3028_P1D_ts_madrid.nc
cdo    selgridcell: Processed 25291653120 values from 1 variable over 2010 timesteps [450.09s 95GB]

Memory leak

This problem was solved in NetCDF 4.9.3!

It looks like reading and writing Zarr data with NetCDF 4.9.0/4.9.2 results in a memory leak. This is especially noticeable with high resolution (zoom level 10) and many time slices. Here is an example:

PATH_TO_ngc3028_P1D_ZOOM10=$(query_yaml.py ICON ngc3028  -s time=P1D zoom=10 --cdo)

levante> cdo -f nc4 monmean -select,name=tas,year=2020  $PATH_TO_ngc3028_P1D_ZOOM10 monmean_2020-2020.nc
cdo    monmean: Processed 4353687552 values from 1 variable over 346 timesteps [326.05s 17GB]

levante> cdo -f nc4 monmean -select,name=tas,year=2020,2021  $PATH_TO_ngc3028_P1D_ZOOM10 monmean_2020-2021.nc
cdo    monmean: Processed 8946450432 values from 1 variable over 711 timesteps [427.44s 34GB]
Here the monthly mean of the variable tas is calculated over one year and two years. When reading 2 years, the required memory doubles. Both CDO commands require a relatively large amount of memory. The same commands on NetCDF4 data would only need 340MB instead of 17GB.