Project

General

Profile

NetCDF file inflation?

Added by Bjorn Stevens about 5 years ago

I've noticed very different behavior in the handling of netcdf files, perhaps due to the flavor of netcdf. When proccessing the DYAMOND runs I noticed that NetCDF4 tended to be much more compact which let me to see how much difference a conversion to NetCDF4 made for our BCO data. Here I got the bad surprise that it inflated the file by a factor of ten. Any ideas? See below for details ... note that the '-' causes formatting errors that I was not able to completely resolve

On mistral:

$ ls -al /scratch/m/m219063/BCO/Meteorology__Deebles_Point__2m_10s__201301.nc 
-rw-r--r-- 1 m219063 26008368 Jan 27 18:34 /scratch/m/m219063/BCO/Meteorology__Deebles_Point__2m_10s__201301.nc

$ cdo sinfo /scratch/m/m219063/BCO/Meteorology__Deebles_Point__2m_10s__201301.nc |head -2
Warning (find_time_vars): Found more than one time variable, skipped variable bco_day!
   File format : NetCDF
    -1 : Institut Source   T Steptype Levels Num    Points Num Dtype : Parameter ID
cdo sinfo: Processed 28 variables over 209703 timesteps [1.08s 737MB]

$ cdo -f nc4 copy Meteorology__Deebles_Point__2m_10s__201502.nc dmy.nc
$ ls -l /scratch/m/m219063/BCO/dmy.nc 
-rw-r--r-- 1 m219063 283693564 Jan 27 18:26 /scratch/m/m219063/BCO/dmy.nc

$ cdo sinfo /scratch/m/m219063/BCO/dmy.nc | tail -2
cdo sinfo: Processed 28 variables over 241863 timesteps [8.33s 833MB]

On the last step I was surprised that the file format indicator was no longer given.


Replies (5)

RE: NetCDF file inflation? - Added by Ralf Mueller about 5 years ago

hi Bjorn!

could you make the files readable?

k202125@mlogin108% ls /scratch/m/m219063/BCO/dmy.nc
ls: cannot access /scratch/m/m219063/BCO/dmy.nc: Permission denied
k202125@mlogin108% ls /scratch/m/m219063/BCO/Meteorology__Deebles_Point__2m_10s__201301.nc
ls: cannot access /scratch/m/m219063/BCO/Meteorology__Deebles_Point__2m_10s__201301.nc: Permission denied

RE: NetCDF file inflation? - Added by Bjorn Stevens about 5 years ago

Hi Ralf,

sorry the main directory was only for m people... try now.

Bjorn

RE: NetCDF file inflation? - Added by Uwe Schulzweida about 5 years ago

First I was surprised that the number of timestep changed. But this is because of different input files.
The file format identifier is missing because you are using tail instead of head in the last command.
Your dataset contains 28 Variables with only 1 value per timestep and a lot of timesteps. For CDO this is the worst cast in terms of performance because it reads and writes each single value. Thats because the dataset is processed timeslice wise. But this is normally not a big problem because the total amount of data is relative small.
It seems that this dataset is also not optimal in terms of storage for netCDF4/HDF5. NetCDF4 is using chunks to store the data. But with the CDO data model the chunk size is always 1 in this case.

RE: NetCDF file inflation? - Added by Hauke Schulz about 5 years ago

Could you please elaborate a bit what you do not find "optimal in terms of storage for netCDF4."
Does it lie in the nature of this type of measurements (few variables, many timesteps) or is there actually something we could do about?

However, as I'm working quite a bit with xarray [[http://xarray.pydata.org/en/stable/]], I can report, that it inflates the file only by about 1MB when writing to netCDF4.

import xarray as xr
d=xr.open_dataset('Meteorology__Deebles_Point__2m_10s__201301.nc')
# Write as netCDF4
d.to_netcdf('Meteorology__Deebles_Point__2m_10s__201301.nc4')
-rw-r--r-- 1 25M Jan 28 12:04 /home/mpim/m300408/Meteorology__Deebles_Point__2m_10s__201301.nc
-rw-r--r-- 1 26M Jan 28 12:06 /home/mpim/m300408/Meteorology__Deebles_Point__2m_10s__201301.nc4
-rw-r--r-- 1 235M Jan 29 15:33 /home/mpim/m300408/Meteorology__Deebles_Point__2m_10s__201301_cdo.nc4

Due to this difference xarray converts and can also compress the data much faster. Any idea, why this is the case?

The cdo compression also seems to have some trouble or is the syntax different?

cdo -z zip_4 copy /home/mpim/m300408/Meteorology__Deebles_Point__2m_10s__201301_cdo.nc4 /home/mpim/m300408/Meteorology__Deebles_Point__2m_10s__201301_cdo_c.nc4

-rw-r--r--  1 235M Jan 29 15:50 Meteorology__Deebles_Point__2m_10s__201301_cdo_c.nc4

Cheers,
Hauke

RE: NetCDF file inflation? - Added by Uwe Schulzweida about 5 years ago

This is a CDO specific problem and how CDO handles the data. The focus of the implementation was more on the processing of large data. But we will find a solution for this problem. Unfortunately, the code has to be restructured a bit. And that will take some time.
This also affects the compression in CDO.

Cheers,
Uwe

    (1-5/5)