Project

General

Profile

cdo with compressed data

Added by Aaron Spring about 2 years ago

I want to compress the real information of `netcdf` files based on `bitinformation.jl` as showcased in this gist https://gist.github.com/aaronspring/383dbbfe31baa4618c5b0dbef4f6d574, i.e. to `bitround` and `compress` existing `netcdf` MPI-M output and replace previous files with smaller new files, while still being able to work with all tools: `xarray`, `cdo`, ...

However, when I compress the rounded data with the `xarray` standard `zlib`, these files are very slow in `cdo`, whereas they remain still quite fast in `xarray`. With `ncks`-based bitrounding-generated files `cdo` is still quite fast.

What I found in the docs: "File size can be misleading though: NetCDF and GRIB2 have very effective compression algorisms built-in (zip-compressed nc4, aec/szip compressed grb2). The downside is that in both cases decompression is slow. Especially with large horizontal fields the time for decompressing supersedes the saved read-in time compared to uncompressed data. These compressions are essentially made for saving storage space, but not for extensive work with the data." https://code.mpimet.mpg.de/projects/cdo/wiki/Tutorial#Tips-and-tricks-for-high-resolution-data.

Question: Is there a way to compress `netcdf` files in `xarray` or `julia` without harming `cdo` read performance?
(Maybe I can get the `ncks` path working for my needs)

Full gist of the cdo with compressed data: https://gist.github.com/aaronspring/7b9ea18127dca56d5478af5cd0aadc86

What I actually want to achieve: https://gist.github.com/aaronspring/383dbbfe31baa4618c5b0dbef4f6d574

EDIT: used /sw/rhel6-x64/cdo/cdo-1.9.10-magicsxx-gcc64/bin/cdo

EDIT: `ncks -L 9` enables compression and `cdo` is faster with this compression but I dont get the same file size reduction as with `xarray`


Replies (2)

RE: cdo with compressed data - Added by Uwe Schulzweida about 2 years ago

Hi Aaron,

For performant reading of compressed data it is important that the access pattern matches the chunk size.
CDO reads the data in horizontal fields. If the chunks run over the time dimension, a lot of data must be read multiple times and decompressed multiple times.
Your example data has a grid of 256x220 points over 588 time steps. The optimal chunk size here is 256x220x1 or smaller.
e.g. 128x110x1. Your compressed data have a chunk size of 86x74x196. That means in CDO 196 fields must be decompressed for each read field.
I guess that when writing the data in Python, the default chunk size of the netCDF library is used. There is certainly also a possibility to specify this chunk size when writing in Python.

Cheers,
Uwe

RE: cdo with compressed data - Added by Aaron Spring about 2 years ago

Thanks Uwe. I didnt check the chunksize and also didnt expect the chunksize to be changed silently in `xr.to_netcdf()`.

```python
encoding={v:{'zlib':True, 'shuffle':True, 'complevel':9}}
encoding_chunksizes = encoding.copy()
encoding_chunksizes[v]['chunksizes'] = (1,1,220,256)
```
does the trick for `cdo` to be as fast as expected.

    (1-2/2)