Process NextGems Cycle3 ICON data with CDO¶
- Table of contents
- Process NextGems Cycle3 ICON data with CDO
NextGEMS Cycle3 ICON data¶
The nextGEMS Cycle3 ICON data can be accessed via an intake catalog. The ICON run ngc3028 ran from 2020-01-21 to 2025-07-22 and has the model resolution R2B9. The output data is available at different spatial and temporal resolutions. The output grid is HEALPix and the different resolutions have the zoom level 0 to 10. The data mapped from the ICON R2B9 grid to HEALPix has the zoom level 10. And the hierarchies below are each a quarter smaller. Introduction to the HEALPix grid can be found at: https://easy.gems.dkrz.de/Processing/healpix.
The available temporal resolutions of the output data are: PT30M, PT3H, P1D.
Find the path to the data¶
The path to the desired archive can be found with the tool find_files or query_yaml.py. Use the option -h for help. The tools are available with the following module:
module use /work/k20200/k202134/hsm-tools/outtake/module module load hsm-tools/unstableThe following command gives the path to the daily nextGEMS Cycle3 ICON data on zoom level 7:
query_yaml.py ICON ngc3028 -s time=P1D zoom=7 --cdoThe paths will change, so always use find_files or query_yaml.py!
CDO version¶
For processing NextGems3 HEALPix Zarr data the latest CDO version 2.2.0 is required. This requires also the latest NetCDF version 4.9.2 and plugins to decompress the Zarr data. On levante such a CDO version is not yet installed by DKRZ. A local CDO version on levante is available with:
module use /home/m/m221078/etc/Modules module add netcdf-dev module unload cdo module add cdo-dev
Data information¶
With the operator sinfon you get an overview of the data. Here is an example for daily mean data at HEALPix zoom level 7:
PATH_TO_ngc3028_P1D_ZOOM7=$(query_yaml.py ICON ngc3028 -s time=P1D zoom=7 --cdo) cdo sinfon ${PATH_TO_ngc3028_P1D_ZOOM7}
Select data¶
In CDO, output data is stored in the format of input data if no other format is selected. For Zarr data it is usually useful to change the data format with the -f option, otherwise the selected data will be stored in Zarr format. The following command selects the variable wind_speed_10m and stores the result to NetCDF4.
PATH_TO_ngc3028_P1D_ZOOM7=$(query_yaml.py ICON ngc3028 -s time=P1D zoom=7 --cdo) cdo -f nc4 select,name=wind_speed_10m ${PATH_TO_ngc3028_P1D_ZOOM7} wind_speed_10m.nc
Interpolation of HEALPix data¶
The following operators are available for interpolation of HEALPix data: remapnn, remapdis, remapcon and remapbil. The interpolation with remapcon is a bit inaccurate, because we cannot calculate the edges of the HEALPix grid 100% correctly.
When interpolating HEALPix data, it is often useful to choose data with a similar resolution. Global data on a 1 degree Lon/Lat grid have 64800 grid points. The next best HEALPix resolution for a global 1 degree target grid has zoom level 7.
In the following example the variable ts is interpolated to a global 1 degree lon/lat grid with the nearest neighbor method:
PATH_TO_ngc3028_P1D_ZOOM7=$(query_yaml.py ICON ngc3028 -s time=P1D zoom=7 --cdo) cdo -f nc4 remapnn,global_1 -select,name=ts ${PATH_TO_ngc3028_P1D_ZOOM7} ts_global_1.nc cdo remapnn: Processed 395182080 values from 1 variable over 2010 timesteps [86.69s 2110MB]The decomposition of the interpolation into 2 parts works as usual:
cdo gennn,global_1 ${PATH_TO_ngc3028_P1D_ZOOM7} healpix7_to_global1_weights.nc cdo -f nc4 remap,global_1,healpix7_to_global1_weights.nc -select,name=ts ${PATH_TO_ngc3028_P1D_ZOOM7} ts_global_1.nc
Interpolation to HEALPix grid¶
CDO can generate HEALPix grids. hp<NSIDE>[_<ORDER>] defines the parameter of a global HEALPix grid.
The NSIDE parameter controls the resolution of the pixellization. It is the number of pixels on the side of each of the 12 top-level HEALPix pixels. The total number of grid pixels is 12*NSIDE*NSIDE. NSIDE=1 generates the 12 (H=4, K=3) equal sized top-level HEALPix pixels.
ORDER sets the index ordering convention of the pixels, available are nested (default) or ring ordering.
A shortcut for hp<NSIDE>_nested is hpz<ZOOM>. ZOOM is the zoom level and the relation to NSIDE is ZOOM=log2(NSIDE).
If the geographical coordinates are required in CDO, they are calculated from the HEALPix parameters. For this calculation the astropy-healpix C library is used.
Here is an example to interpolate data from an arbitrary grid to a HEALPix grid with zoom level 7:
cdo remapnn,hpz7 infile outfile
Performance Issues¶
Reading and decompressing data has a significant impact on the overall performance of CDO. All operators in CDO always read a complete horizontal field. For high-resolution compressed data, this poses at least 2 challenges.
First, decompressing can take significantly longer than processing the data. And if the number of required grid points is significantly smaller than the complete horizontal field, more data will be decompressed than necessary.
The following CDO features are under construction, experimental, not further documented and only available in the current developer version! So your feedback is very welcome!
Speed up the decompression¶
The approach here is to decompress a few of the required fields in advance. This makes it necessary to move the selection of the required data from CDO to the I/O layer. For this purpose the new query: identifier was introduced. This allows to select variables, grid cells and time steps before CDO is working on it.query: has the following parameters:
name | comma separated list of variable names |
cell | range (1-N) of grid cells (<start>[/to/<end>]) |
step | range (1-N) of time steps (<start>[/to/<end>]) |
path | path to the data |
Here is an example to select 2 variables:
cdo <operator> query:name=var1,var2,path=infile outfileThis does in principle the same as:
cdo <operator> -selname,var1,var2 infile outfileThe new CDO option --worker can be used to decompress some of the required fields in advance, when this option is combined with query:. The optimal number of workers depends on the number and size of the fields and the processing. Mostly a small number between 4 and 16 is sufficient.
Here again the above interpolation example with 8 workers:
PATH_TO_ngc3028_P1D_ZOOM7=$(query_yaml.py ICON ngc3028 -s time=P1D zoom=7 --cdo) cdo -f nc4 --worker 8 remapnn,global_1 query:name=ts,path=${PATH_TO_ngc3028_P1D_ZOOM7} ts_global_1.nc cdo remapnn: Processed 395182080 values from 1 variable over 2010 timesteps [13.16s 1604MB]Unfortunately the asynchronous decompression (option --worker) only works if the compressed chunks are inside a horizontal field. This is only the case for the 2D fields of the NextGems Cycle3 ICON data.
Selection of grid cells¶
A contiguous set of grid cells can be selected with the parameter cell=<start>[/to/<end>] of query:. This can speed up reading and decompressing if there are enough compressed chunks. This is only the case for the high resolution data with zoom=10. Here one field has 12582912 cells spread over 48 compressed chunks.
In the following example the temperature is selected on one grid point with the coordinates of Madrid:
PATH_TO_ngc3028_P1D_ZOOM10=$(query_yaml.py ICON ngc3028 -s time=P1D zoom=10 --cdo) MADRID=$(cdo gridcellindex,lon=-3.703889,lat=40.4125 ${PATH_TO_ngc3028_P1D_ZOOM10}) # grid cell index for Madrid is 3494582 cdo -f nc --worker 8 copy query:name=ts,cell=${MADRID},path=${PATH_TO_ngc3028_P1D_ZOOM10} ngc3028_P1D_ts_madrid.nc cdo copy: Processed 2010 values from 1 variable over 2010 timesteps [15.62s 2066MB]Without the pre-selection of the grid cell, significantly more resources are needed:
cdo -f nc --worker 8 selgridcell,${MADRID} query:name=ts,path=${PATH_TO_ngc3028_P1D_ZOOM10} ngc3028_P1D_ts_madrid.nc cdo selgridcell: Processed 25291653120 values from 1 variable over 2010 timesteps [450.09s 95GB]
Memory leak¶
It looks like reading and writing Zarr data with NetCDF 4.9.0/4.9.2 results in a memory leak. This is especially noticeable with high resolution (zoom level 10) and many time slices. Here is an example:
PATH_TO_ngc3028_P1D_ZOOM10=$(query_yaml.py ICON ngc3028 -s time=P1D zoom=10 --cdo) levante> cdo -f nc4 monmean -select,name=tas,year=2020 $PATH_TO_ngc3028_P1D_ZOOM10 monmean_2020-2020.nc cdo monmean: Processed 4353687552 values from 1 variable over 346 timesteps [326.05s 17GB] levante> cdo -f nc4 monmean -select,name=tas,year=2020,2021 $PATH_TO_ngc3028_P1D_ZOOM10 monmean_2020-2021.nc cdo monmean: Processed 8946450432 values from 1 variable over 711 timesteps [427.44s 34GB]Here the monthly mean of the variable tas is calculated over one year and two years. When reading 2 years, the required memory doubles. Both CDO commands require a relatively large amount of memory. The same commands on NetCDF4 data would only need 340MB instead of 17GB.