Project

General

Profile

CDOs in python: How to handle multiple files simultaneously?

Added by Semjon Schimanke almost 10 years ago

Hi,

is there a way to handle many files simultaneously with the CDOs in Python?

Let's say, I have model output for 50 years and the data is stored annual, hence in 50 files. Is there a smooth way to open and process them at the same time? I do not want to loop over them.

The Python module netCDF4 offers such an opportunity with MFDataset. However, it seems CDO cannot handle the virtually merged dataset.

A simplified example:
@
from netCDF4 import MFDataset
from cdo import *
cdo=Cdo()

files='/data/tmp/Model_data_year_*.nc'

Multifiles=MFDataset(files)
print Multifiles.file_name

cdo.showname(input=Multifiles.file_name)
@

Thanks for any help/suggestion,
Semjon


Replies (3)

RE: CDOs in python: How to handle multiple files simultaneously? - Added by Ralf Mueller almost 10 years ago

I mostly use the multiprocessing module:

import glob, multiprocessing

def showyear(file):
  return cdo.showyear(input = file)[0].split(' ')

pool = multiprocessing.Pool(14)
results = []

for ifile in glob.glob('/data/tmp/Model_data_year_*.nc'):
  years = pool.apply_async(showyear,[ifile])
  results.append([ifile,years])

pool.close()
pool.join()

print(results)

RE: CDOs in python: How to handle multiple files simultaneously? - Added by Semjon Schimanke almost 10 years ago

Thanks for the suggestion but I am not sure if that helps in my case. As far as I understand does your code process the files in parallel. That is nice but cannot be used if you want information from all files, e.g. a simple mean over time (cdo timmean). Considering the given example I would like to compute a monthly climatology (cdo ymonmean). Can multi processing help in this case?

Cheers,
Semjon

RE: CDOs in python: How to handle multiple files simultaneously? - Added by Ralf Mueller almost 10 years ago

Semjon Schimanke wrote:

Thanks for the suggestion but I am not sure if that helps in my case. As far as I understand does your code process the files in parallel. That is nice but cannot be used if you want information from all files, e.g. a simple mean over time (cdo timmean). Considering the given example I would like to compute a monthly climatology (cdo ymonmean). Can multi processing help in this case?

Sure: For timmean, you could split the files horizontally (see here) and process each area in parallel. It'll depend on the amount of data, if this means speedup or slow down.

For ymonmean, you might split files by mon (splitmon) and compute the timmean for each month in parallel. Unless it's really necessary, I'd avoid joining them an the end, of course. If you grid is huge enough, you could also combine both techniques.

    (1-3/3)