How to handle very large data processing
Added by Guido Cioni over 2 years ago
I'm trying to compute hourly statistics (a climatology) from 30 years of ERA5-Land data.
I have a list of files
era5_t2m_1991.grib, era5_t2m_1992.grib, ... era5_t2m_2020.grib
Each file contains 3 hourly values of 2m Temperature from ERA5-Land and is about 15 GB in size.
Trying to compute hourly average with cdo yhourmean -select,name=T2M 'era5_t2m_*.grib' era5_t2m_1991-2020_mean.grib
fills up the RAM (we have about 116 GB on this machine, so quite a lot :D).
I tried to do the same with ERA5 data and had more luck as the RAM usage saturated at "only" 23 GB. With ERA5-Land, probably because of the finer grid, I cannot find a way to process the data.
CPU usage still is quite low because I believe that most of the time is spent on loading data from the disk into memory and with GRIB files we cannot really split the tasks as we have to scan through the entire file.
Regardless, is there any advice on how to process these kind of data? Any way I will be able to compute this climatology?
I thought about splitting the tasks by subsetting the dataset by lat/lon pairs but even using this approach I would still need to scan the entire grib file to select only the part that I need, so I guess memory usage would be comparable.
Replies (4)
RE: How to handle very large data processing - Added by Uwe Schulzweida over 2 years ago
Do the calculation for each month, this reduces the memory requirement by a factor of 12:
for month in {1..12}; do cdo yhourmean -select,name=T2M,month=$month 'era5_t2m_*.grib' result$month; done cdo mergetime result* era5_t2m_1991-2020_mean.grib rm result*
RE: How to handle very large data processing - Added by Guido Cioni over 2 years ago
really good suggestion, I didn't think that I could split over months.
Obviously this works only when you have linear operators but in case I have to do the standard deviation there's really no other way..right?
RE: How to handle very large data processing - Added by Uwe Schulzweida over 2 years ago
I think this should work the same way with the standard deviation.
RE: How to handle very large data processing - Added by Guido Cioni over 2 years ago
You're right, the result is split in contiguous intervals so it's ok to merge it afterwards.
Thanks again for the suggestion, let's see how long it takes to compute everything: I started a job 2 hours ago and it's still computing the first month of results ("only" 24 GB RAM consumption). Hopefully it's not going to take the entire day.