Project

General

Profile

How to handle very large data processing

Added by Guido Cioni over 1 year ago

I'm trying to compute hourly statistics (a climatology) from 30 years of ERA5-Land data.
I have a list of files

era5_t2m_1991.grib, era5_t2m_1992.grib, ... era5_t2m_2020.grib

Each file contains 3 hourly values of 2m Temperature from ERA5-Land and is about 15 GB in size.

Trying to compute hourly average with cdo yhourmean -select,name=T2M 'era5_t2m_*.grib' era5_t2m_1991-2020_mean.grib fills up the RAM (we have about 116 GB on this machine, so quite a lot :D).

I tried to do the same with ERA5 data and had more luck as the RAM usage saturated at "only" 23 GB. With ERA5-Land, probably because of the finer grid, I cannot find a way to process the data.
CPU usage still is quite low because I believe that most of the time is spent on loading data from the disk into memory and with GRIB files we cannot really split the tasks as we have to scan through the entire file.

Regardless, is there any advice on how to process these kind of data? Any way I will be able to compute this climatology? :)

I thought about splitting the tasks by subsetting the dataset by lat/lon pairs but even using this approach I would still need to scan the entire grib file to select only the part that I need, so I guess memory usage would be comparable.


Replies (4)

RE: How to handle very large data processing - Added by Uwe Schulzweida over 1 year ago

Do the calculation for each month, this reduces the memory requirement by a factor of 12:

for month in {1..12}; do
  cdo yhourmean -select,name=T2M,month=$month 'era5_t2m_*.grib' result$month;
done
cdo mergetime result* era5_t2m_1991-2020_mean.grib
rm result*

RE: How to handle very large data processing - Added by Guido Cioni over 1 year ago

really good suggestion, I didn't think that I could split over months.
Obviously this works only when you have linear operators but in case I have to do the standard deviation there's really no other way..right?

RE: How to handle very large data processing - Added by Uwe Schulzweida over 1 year ago

I think this should work the same way with the standard deviation.

RE: How to handle very large data processing - Added by Guido Cioni over 1 year ago

You're right, the result is split in contiguous intervals so it's ok to merge it afterwards.

Thanks again for the suggestion, let's see how long it takes to compute everything: I started a job 2 hours ago and it's still computing the first month of results ("only" 24 GB RAM consumption). Hopefully it's not going to take the entire day.

    (1-4/4)