Project

General

Profile

CDO Usage and Performance with Compressed and Chunked Files

Added by Matt Thompson almost 8 years ago

All,

I'm wondering if you could help me out and make sure I'm using CDO in the correct way. In my use, I've tended to see CDO as a faster tool than or as fast a tool as, say, NCO and the like (cdo diffn vs. h5diff a prime example) with the benefit of some nice usability (commands/operators are nice and self-explanatory). But, I've avoided its use with compressed files as that seems to slow it down (esp. relative to NCO). But today I realized it might actually more to do with chunking or both?

To wit, using some slightly old tools, NCO 4.6.1 and CDO 1.7.2, I was doing some tests on merging some NC4 files. A model I work on has seen some good speed benefit if we write out our diagnostic output in many small files rather than a few large ones (probably due to load balancing, etc.) But, in the end, we want the files to be "large" again so that many, many existing tools and processes don't need to be rewritten. So, being lazy, rather than use netCDF directly, I turn to tools, namely nkcs --append and cdo merge. But, the timings are interesting. First, if our output is not chunked or compressed:

 ncks: 11.25 sec
  CDO:  8.79 sec

Cool, the CDO advantage I'm used to. Now, if the files are compressed by our model (deflate level 1):

 ncks: 66.27 sec
  CDO: 28.94 sec
CDO-z: 73.71 sec

The two different CDO lines there are because if you do a 'cdo merge' on compressed files, the output is written uncompressed so to get the compressed final output, you have to run with '-z zip'. Those numbers are roughly what I've seen.

But, it turns out, what I meant to do was to have our model compress and chunk our files internally and output them, but I forgot. If I use those outputs, things get interesting:

 ncks:  72.29 sec
  CDO: 150.99 sec
CDO-z: 508.39 sec

FYI, the commands used above are:

 ncks: /usr/bin/time -p sh -c "find . -iname 'f516_fp.inst3_3d_asm_Np-___*20150415*nc4' -print0 | xargs -0 -I file ncks -h -A file test_nco.nc4" 
  CDO: /usr/bin/time -p sh -c 'cdo merge f516_fp.inst3_3d_asm_Np-___*20150415*nc4 test_cdo.nc4'
CDO-z: /usr/bin/time -p sh -c 'cdo -z zip merge f516_fp.inst3_3d_asm_Np-___*20150415*nc4 test_cdo.zip.nc4'

One reason CDO appears to be much slower with chunking is that it seems to be re-chunking as well:

$ ncdump -hsc f516_fp.inst3_3d_asm_Np-___O3.20150415_0000z.nc4 | grep -e 'O3:_Storage' -A3
        O3:_Storage = "chunked" ;
        O3:_ChunkSizes = 1, 1, 91, 192 ;
        O3:_DeflateLevel = 1 ;
        O3:_Shuffle = "true" ;
$ ncdump -hsc test_nco.nc4 | grep -e 'O3:_Storage' -A3
        O3:_Storage = "chunked" ;
        O3:_ChunkSizes = 1, 1, 91, 192 ;
        O3:_DeflateLevel = 1 ;
        O3:_Shuffle = "true" ;
$ ncdump -hsc test_cdo.nc4 | grep -e 'O3:_Storage' -A3
        O3:_Storage = "chunked" ;
        O3:_ChunkSizes = 1, 11, 181, 288 ;
        O3:_Endianness = "little" ;
    float OMEGA(time, lev, lat, lon) ;
$ ncdump -hsc test_cdo.zip.nc4 | grep -e 'O3:_Storage' -A3
        O3:_Storage = "chunked" ;
        O3:_ChunkSizes = 1, 11, 181, 288 ;
        O3:_DeflateLevel = 1 ;
        O3:_Shuffle = "true" ;

So, I suppose my first question is: is there a way to have CDO preserve the chunking during a merge on compressed and chunked files? Perhaps our current chunking isn't ideal, but it is the chunk pattern we've used for this resolution, 1152x721.

Second: Is there a better way to call CDO to do these operations? That is, options to pass along with 'merge' to get performance?

Thanks,
Matt


Replies (4)

RE: CDO Usage and Performance with Compressed and Chunked Files - Added by Uwe Schulzweida almost 8 years ago

Hi Matt,

Thanks for this post! This is a CDO bug, the ChunkSizes are not set correctly. I'll try to solve the problem in the next CDO release.
You can set the chunks manually with the CDO option "-k grid" to the size of the grid. This is normally the best chunk size for most processing steps.

/usr/bin/time -p sh -c 'cdo -z zip -k grid merge f516_fp.inst3_3d_asm_Np-___*20150415*nc4 test_cdo.zip.nc4'
Please send us your feedback.

Cheers,
Uwe

RE: CDO Usage and Performance with Compressed and Chunked Files - Added by Matt Thompson almost 8 years ago

Uwe,

Well, it definitely sped things up if nothing else! I decided to try all three options that I now know exist and:

CDO-z,  kauto: 497.05 sec
CDO-z, klines: 105.77 sec
CDO-z,  kgrid:  83.64 sec
$ ncdump -hsc test_cdo.zip.kauto.nc4 | grep -e 'O3:_Storage' -A3
        O3:_Storage = "chunked" ;
        O3:_ChunkSizes = 1, 11, 181, 288 ;
        O3:_DeflateLevel = 1 ;
        O3:_Shuffle = "true" ;

$ ncdump -hsc test_cdo.zip.klines.nc4 | grep -e 'O3:_Storage' -A3
        O3:_Storage = "chunked" ;
        O3:_ChunkSizes = 1, 1, 1, 1152 ;
        O3:_DeflateLevel = 1 ;
        O3:_Shuffle = "true" ;

$ ncdump -hsc test_cdo.zip.kgrid.nc4 | grep -e 'O3:_Storage' -A3
        O3:_Storage = "chunked" ;
        O3:_ChunkSizes = 1, 1, 721, 1152 ;
        O3:_DeflateLevel = 1 ;
        O3:_Shuffle = "true" ;

So I guess all I need is a "passthrough" option that says don't rechunk, just accept and my guess is it should be pretty fast. All you'd really be paying for then is the re-zipping, I assume. (I'm currently trying to puzzle in my head what chunking at the exact grid means. Is it any different than no chunking at all? Hmm.)

RE: CDO Usage and Performance with Compressed and Chunked Files - Added by Uwe Schulzweida almost 8 years ago

Chuncking at the exact grid is similar to no chunking at all if the access pattern is the data on the horizontal grid. CDO has this access pattern but it might be different for other tools.

RE: CDO Usage and Performance with Compressed and Chunked Files - Added by Adriano Fantini about 7 years ago

I have a similar experience with a big, 10GB file (4GB if zipped).

Time-Averaging the file takes 3 minutes for the unzipped file and 35 minutes for the zipped one.
The files contains several variables which might or might not share the same chunking specs.

However, if I only select one variable, the results are more extreme: 30s for the unzipped file, 45s for the zipped one.
The files are on a networked file system but performing the same analysis in a RAM virtual file system does not yield different results.

Any idea on why?

    (1-4/4)