Project

General

Profile

Getting rid of duplicate data within a netCDF file

Added by Navajyoth MP over 7 years ago

Hey,

I have a netCDF file, file.nc supposed to be containing monthly data spanning from January, 1960 to December, 2007 (576 months). But the file contains 803 time frames. When I viewed the file using ncview, I could see that there is a duplication of data within the file. Time frames 1 to 576 contain data for the actual period of observation while time frames 577 to 803 contain data for a shorter but duplicate time period.

How do I get rid of the time frames ranging from 577 to 803 using cdo? Also what could be the possible reason for such duplication within the dataset? In other words, is this error purely technical or could there be another reason behind this?


Replies (5)

RE: Getting rid of duplicate data within a netCDF file - Added by Karin Meier-Fleischer over 7 years ago

Hi,

how did you create the data file? Without the data file itself I can only suggest that the creation of the file was wrong.

Bye,
Karin

RE: Getting rid of duplicate data within a netCDF file - Added by Michael Böttinger over 7 years ago

Hi,

in any case, when you are sure that the first 576 time steps are ok,
you can select only those by using
cdo seltimestep,1/576 <file1.nc> <file2.nc>

As for the reason of the duplicated time steps - how did you create that file?
"ncdump -h <file1.nc>" will show you (besides other header information) the file history.

Cheers,
Michael

RE: Getting rid of duplicate data within a netCDF file - Added by Saumya Singh over 2 years ago

@Michael Gertz
Hi, there are duplicate records in the raw file itself. what do I do? I need to delete the duplicate records. Kindly help.

RE: Getting rid of duplicate data within a netCDF file - Added by Karin Meier-Fleischer over 2 years ago

Hi Saumya,

do you mean duplicate times in a netCDF file? What do you mean with 'raw data'?

An example for time duplicates in netCDF file

If you know the time index of the duplicated times, for instance by means of

cdo infon data_with_duplicate_time_records.nc
    -1 :       Date     Time   Level Gridsize    Miss :     Minimum        Mean     Maximum : Parameter name
     1 : 2000-07-16 06:00:00       0    18432       0 :      214.46      278.60      305.98 : temp2         
     2 : 2000-07-16 06:00:00       0    18432       0 :      214.46      278.60      305.98 : temp2         
     3 : 2001-07-16 06:00:00       0    18432       0 :      214.78      278.65      306.47 : temp2         
     4 : 2001-07-16 06:00:00       0    18432       0 :      214.78      278.65      306.47 : temp2         
     5 : 2002-07-16 06:00:00       0    18432       0 :      214.58      278.65      305.64 : temp2         
     6 : 2003-07-16 06:00:00       0    18432       0 :      214.72      278.79      306.28 : temp2         
     7 : 2003-07-16 06:00:00       0    18432       0 :      214.72      278.79      306.28 : temp2         
     8 : 2004-07-16 06:00:00       0    18432       0 :      214.88      278.88      306.48 : temp2         
     9 : 2005-07-16 06:00:00       0    18432       0 :      214.04      278.83      306.52 : temp2         
    10 : 2005-07-16 06:00:00       0    18432       0 :      214.04      278.83      306.52 : temp2

You can use the time indices to delete the duplicates with CDO

cdo -delete,timestep=2,4,7,10 data_with_duplicate_time_records.nc outfile.nc

If there are too much duplicates you can use a short python script that uses xarray and numpy to delete the duplicates

import xarray as xr
import numpy as np

# read the dataset
infile = 'data_with_duplicate_time_records.nc'

ds = xr.open_dataset(infile)
print(ds['time'])

# Numpy provides the function np.unique to create an index list of unique time 
# records of a dataset or array
_, index = np.unique(ds['time'], return_index=True)
print(index)

# create a new dataset which doesn't contain time record duplicates
ds_unique = ds.isel(time=index)
print(ds_unique['time'])

# write the dataset with unique time records to a new netCDF file
ds_unique.to_netcdf('temp2_unique_timesteps.nc')

-Karin

RE: Getting rid of duplicate data within a netCDF file - Added by Saumya Singh over 2 years ago

@karin

Hi,

Thank you very much for the quick response. Yes, I was referring to the duplicate times in the NC file. Raw data meant the file downloaded from the source itself, I have not created it. It turns out that each even timestep is a duplicate value for 4 consecutive years. I will try the script provided by you and hope it works for me. I was wondering if there is a way to just delete the even timestep in cdo.

Thanks a lot.

Have a nice day!

    (1-5/5)