Project

General

Profile

Pack'em ESMs! Tools for (un-)archiving Earth system model data

Updated over 3 years ago by Karl-Hermann Wieners
Information on this page refers to packems-1.0.1 🔒

Tools for (un-)archiving Earth system model data

Overview

Pack'em ESMs! (packems) contains a set of tools for packing, archiving,
listing, and retrieval of at earth system model and other data.

packems takes directories to be packed into tar files, and optionally pushes these directly to the tape archiving system. Features are parallel operation, batch job, and error recovery. It keeps track of the user's archiving operations to re-use this information for later retrieval of data. Currently only HPSS's pftp interface is supported for archiving

listems takes the information recorded by packems and allows to survey and examine the archived data without having to unarchiving or unpack large data files. It features UNIX and Regular Expression style search patterns for file selection and also allows to include user defined storage index information

unpackems uses the listems interface to unarchive and unpack the selected data.** The retrieval process allows for transforming the original directory** structure as needed by the user, eg using a different base directory,** renaming parts of the path name or flattening out the tar file contents** into a single directory


Design and Prototype

Please refer to this presentation


Preparation

Tasks to do once prior to the first usage

Passwordless access to the tape archive (HPSS)

Before using packems for the first time, you must ensure that access to the tape library is available to you without entering a password. The recommended procedure is described at https://www.dkrz.de/up/systems/hpss/pftp-with-kerberos.

Tasks to do prior to each usage

Check Kerberos ticket

Kerberos authentication works with so-called tickets, which currently expire after one week. The expiry date can be checked on mistral with klist. The crucial part of the output looks like this

Valid starting     Expires            Service principal
05/19/20 11:16:29  05/26/20 21:16:29  krbtgt/PHPSS.DKRZ.DE@PHPSS.DKRZ.DE
    renew until 06/30/20 21:16:29

If the Expires date (the format is month/day/year) has not yet expired, everything is fine. If it is about to expire, it can be easily renewed with kinit -AVR. When it expires, a new ticket must be created using kinit -AV, entering the Kerberos password.

modules loaded?

The tools are installed as modules in my home directory:

module use ~m221078/etc/Modules
module add packems

This gives you access to the actual (Python) script packems, and packems_wrapper for use with sbatch.

Setting up packems on your own (if you have the source code)

You need:

  • python 3
  • pykerberos
  • pandas

If you work on mistral, please use the module python3/2020.02-gcc-9.1.0. It contains all needed libraries.


Quick Tour (examples taken from MPI-ESM)

Packems

  1. I recommend to start the job in the experiment directory on work and to specify the directories or files to be archived relative to it. In doing so, the archives will also contain relative path names and can easily be unpacked to other locations later on.
    cd /work/xy1234/m123456/mpiesm/experiments/abc1234
    
  2. -j JOBS specifies how many jobs are started in parallel. Useful in combination with SLURM, e.g. as
    sbatch -A xy1234 -p prepost,compute,compute2 --exclusive --mail-type=FAIL packems_wrapper -j 12 ...
    
  3. By default, the archives are named packems_001.tar, packems_002.tar, and so on. With -o OUTPUT you can specify a different name for them
    ... -o abc1234_outdata_echam ...
    
  4. -d determines the directory in which the .tar files are packed (by default in the current directory). Important if you archive directories in which you do not have write permission
    ... -d packing ...
    
  5. -a activates the archiving - normally only packing is done. -S determines the directory where the files are stored in the HPSS.
    ... -a -S xy1234/mpiesm/experiments/abc1234 ...
    
  6. -p deletes the packed archives automatically after archiving. Otherwise you can manually delete the files in packing afterwards.
    (!) The -P option, which also deletes the original files, prohibits re-starting after abnormal program termination and should not be used at present!
    ... -p ...
    
  7. Finally, define the directory which you would like to archive.
    ... outdata/echam6
    
  8. You can add more than one input directory, but you can set the output only once:
    ... -o abc1234_outdata outdata/echam6 outdata/jsbach outdata/mpiom outdata/hamocc
    
  9. For restart files, the temporal relationship is very important. Therefore, you sort them by timestamp:
    ... -O by_time restart
    

Listems

Input and filtering

  1. -i INDEX_FILE specifiy a INDEX file localted on local file system (in addition to those specified in files attached via -l); please note subsection Scope in which files/expressions are searched/evaluated below
    ... -i /home/user/my_index_files/INDEX_file.txt ...
    
  2. -i INDEX_FILE specifiy a INDEX file localted on HPSS by prepending t: to the path (in addition to those specified in files attached via -l); please note subsection Scope in which files/expressions are searched/evaluated below
    ... -i t:/hpss/arch/bm0146/k204221/INDEX_file.txt ...
    
  3. -i INDEX_FILE_LIST specifiy a list of INDEX files (separated by ;)
    ... -i t:/hpss/arch/bm0146/k204221/INDEX_file_1.txt;t:/hpss/arch/bm0146/k204221/INDEX_file_2.txt ...
    
  4. -i INDEX_FILE_WILDCARD specifiy INDEX files via wildcard; Note: enclose by ' to prevent automatic evaluation by the shell
    ... -i 't:/hpss/arch/bm0146/k204221/INDEX_file_*.txt' ...
    
  5. -i INDEX_FILE_REGULAR_EXPRESSION specifiy INDEX file via regular expression; prepend r:; the order of r: and t: has no effect; Note: enclose by ' to prevent automatic evaluation by the shell
    ... -i 'r:t:/hpss/arch/bm0146/k204221/INDEX_file_[0-9].txt' ...
    ... -i 't:r:/hpss/arch/bm0146/k204221/INDEX_file_[0-9].txt' ...
    
  6. -i INDEX_FILE_MIXED_EXPRESSION specifiy INDEX file(s) via a mixture of wildcard and regular expression; Note: enclose by ' to prevent automatic evaluation by the shell
    ... -i 't:/hpss/arch/bm0146/k204221/other/INDEX_file.txt;t:/hpss/arch/bm0146/k204221/icon/INDEX_file_*.txt;r:t:/hpss/arch/bm0146/k204221/data_c/INDEX_file_[0-9].txt' ...
    
  7. -s LIST_SEPARATOR -i INDEX_FILE_LIST specifiy a list of INDEX files and manually set the list-separator
    ... -s ',' -i t:/hpss/arch/bm0146/k204221/INDEX_file_1.txt,t:/hpss/arch/bm0146/k204221/INDEX_file_2.txt ...
    
  8. -l INDEX_LIST_FILE specifiy a file that contains the path(s) to INDEX file(s) (in addition to those INDEX files specified by -i); same usage of wildcards and regular expressions as -i
    ... -l my_dummy_index_list.txt ...
    
  9. --no-default-index-list do not read the default INDEX file list ~/.packems/INDEX_LIST.txt
    ... --no-default-index-list ...
    
  10. --no-default-index-list is useful in combination of -i when only one specific INDEX file should be read:
    ... --no-default-index-list -i t:/hpss/arch/bm0146/k204221/INDEX_file.txt ...
    
  11. -a TAR_ARCHIVE select one(/more) tar archive(s) exclusively; only tar archives, which are listed in the provided INDEX file(s), are included; no new tar archives can be added; same usage of wildcards and regular expressions as -i
    ... -a search_only_this_archive.tar ...
    
  12. -x EXCLUDE_TAR_ARCHIVE exclude one(/more) tar archive(s) that are listed in the provided INDEX files; no new tar archives can be added; same usage of wildcards and regular expressions as -i
    ... -x contains_bad_data.tar ...
    

Output formatting

  1. --long print detailled/long output (more columns)
    ... --long ...
    
  2. -t OUTPUT_FORMAT choose an output format: txt/text (default), csv, json or html
    ... -t json ...
    
  3. -o OUTPUT_FILE writes default output into a file
    ... -o output.txt ...
    
  4. -t OUTPUT_FORMAT -o OUTPUT_FILE writes output into a file of specific format; file extensions are not automatically recognized
    ... -t json -o output.json ...
    

storing local copies of INDEX files

By default, the INDEX files are copied/retrieved to a local temporary directory and are deleted after they have been read by listems. A prefix t: in the beginning of the INDEX file path indicates that the files should be retrieved from HPSS. Omitting the t: prefix indicates that the files should be copied from the local file system.

  1. -N dry run; stop after a list of INDEX files is created; don't download INDEX files from HPSS; meant to check whether input from -i and -l was properly processed
    ... -N ...
    
  2. -w TMP_DIRECTORY user-provided working directory to retrieve the INDEX files into; the directory will be created if it does not exist; the INDEX files will be kept if purge is not set (-p)
    ... -w store/index/files/here/ ...
    
  3. -p -w TMP_DIRECTORY remove INDEX files from user-provided directory after they were imported into listems
    ... -p -w store/index/files/here/ ...
    

Unpackems

Overview of the Options

Please see section on packems (above) for options -b, -F, -n, -N and -j.

Please see section on listems (above) for options -l, -i, -a, -x, -s, -v and --no-default-index-list.

Information these flags provided here: -p, -d, -D, --flatten, -o, -K, -O, -q, -w

Setting working and destination directories; modifying extraction paths

  1. -d DESTINATION_DIR: destination directory into which the files should be extracted from the tar files
    ... -d /work/bm0146/k204221/model_results ...
    
  2. -D REPLACE_DESTINATION_DIR: replace the first n folders of the archived files by the folder provided by -D; n is the number of folders provided (4 in the example below); see Rules to construct/modifiy the output path of extracted files below for details
    ... -D new/folder/for/data ...
    
  3. -d DESTINATION_DIR -D REPLACE_DESTINATION_DIR: extract files (from tar archives) into /work/bm0146/k204221/model_results/new_folder and drop first directory of each archived file
    ... -d /work/bm0146/k204221/model_results -D new_folder ...
    
  4. -d DESTINATION_DIR -D REPLACE_DESTINATION_DIR: extract files (from tar archives) into /work/bm0146/k204221/model_results/new_folder/subfolder and drop first two directories of each archived file
    ... -d /work/bm0146/k204221/model_results -D new_folder/subfolder ...
    
  5. --flatten flatten/remove directory tree of files stored in tar archives
    ... --flatten ...
    
  6. -w WORK_DIR: specifiy a working directory into which the INDEX and tar files are retrieved.
    ... -w /scratch/k/k204221 ...
    

Keeping and overwriting files

  1. no -K or -O: error by unpackems if file does already exist or if file would be extracted twice (or more) to same location; this is not fail-save, e.g. during parallel extractions
  2. -K: keep existing files during extraction; warn if keeping is expected
    ... -K ...
    
  3. -O: overwrite existing files during extraction; warn if overwriting is expected
    ... -O ...
    
  4. -q: suppress warnings thrown by -K and -O
    ... -q ...
    

Cleanup after extraction

  1. -p: purge INDEX files from working directory provided by -w
    ... -p ...
    

NOTE: tar archives are not automatically removed after retrieval and successful extraction. Please do make clean -f MAKEFILE manually (MAKEFILE is name of the generated Makefile).

Other options/arguments

  1. -f/--force: force to extract all available files
    ... -f ...
    
  2. -o NAME_MAKEFILE: set name of the Makefile to create
    ... -o example_makefile_name ...
    

Detailled Explanations on some Topics

Regular Expressions and bash Wildcards (listems and unpackems)

General Notes

  • enclose the expression with '; e.g. 'data/results/*.nc', 'file_?.nc'
  • regular expressions are indicated by leading r:; e.g. 'r:file_[0-9].nc'
  • wildcards have to match the whole path; e.g. 'file_?.nc' will not match data/results/file_1.nc; but, '*/file_?.nc' will match data/results/file_1.nc
  • for -a, -x and names/expression without preceeding flag: if a regular expression should only match the whole path, we need to add line beginning (^) and line end characters ($); e.g.: '^file_[0-9].nc$' will not match data/results/file_1.nc;

RegEx lookup on local file system:

  • If relative path is given: resolve from current working directory
  • If absolute path is given:
    • we look whether the first three folders exist (not via regex matching but as fixed expression; e.g. /first/second/third); if they don't exist, we stop evaluation of regex
    • reason: users should be prevented to do [a-zA-Z_/]*/my_file.txt causing lot of traffic on the file system / metadata server

RegEx and Wildcard lookup on HPSS:

  • currently deactivated
  • implemented but not activated:
    • we look whether the first three folders exist (not via regex matching but as fixed expression; e.g. /first/second/third); if they don't exist, we stop evaluation of regex
    • reason: users should be prevented to do [a-zA-Z_/]*/my_file.txt or *.txt causing lot of traffic on the file system / metadata server

Scope in which files/expressions are searched/evaluated (listems and unpackems)

  • -l look locally
  • -i:
    • no prefix (or l:): look locally
    • t: prefix: look at HPSS
  • -a and -x search for tar archives in last column of the provided INDEX files (resp.: in the content of the column)
  • arguments without flag (files/expressions attached to the call): files/expressions are looked up in the files listed in the provided INDEX files; the available list has been filtered by -a and -x previously

Available prefixes for file names/expressions (listems and unpackems)

  • r:: evaluated following expression as regular expression
  • t:: expect file to be located on HPSS
  • l:: expect file to be located on local file system (optional; will be ignored; same as omitting t:)

Rules to construct/modifiy the output path of extracted files (unpackems)

files stored in tar archive file:

data/mask.nc
data/forcing/sst.nc
data/forcing/emis.nc
data/output/wind.nc

extraction with -d /work/bm0146/k204221:

/work/bm0146/k204221/data/mask.nc
/work/bm0146/k204221/data/forcing/sst.nc
/work/bm0146/k204221/data/forcing/emis.nc
/work/bm0146/k204221/data/output/wind.nc

extraction with -d ./old_data:

./old_data/data/mask.nc
./old_data/data/forcing/sst.nc
./old_data/data/forcing/emis.nc
./old_data/data/output/wind.nc

extraction with --flatten:

./mask.nc
./sst.nc
./emis.nc
./wind.nc

extraction with -d old_data --flatten:

./old_data/mask.nc
./old_data/sst.nc
./old_data/emis.nc
./old_data/wind.nc

extraction with -D new_dir:

./new_dir/mask.nc
./new_dir/forcing/sst.nc
./new_dir/forcing/emis.nc
./new_dir/output/wind.nc

extraction with -D new_dir/second_dir:

./new_dir/second_dir/mask.nc
./new_dir/second_dir/sst.nc
./new_dir/second_dir/emis.nc
./new_dir/second_dir/wind.nc

extraction with -d old_data -D new_dir:

./old_data/new_dir/mask.nc
./old_data/new_dir/forcing/sst.nc
./old_data/new_dir/forcing/emis.nc
./old_data/new_dir/output/wind.nc

Detailled Examples

Please download this zip archive and extract it in your testing/training directory to run the example commands below (those aimed on the local file system).

Listems

Locally stored INDEX files

# read index files from file lists; local index files
./listems -v -l 'examples/index_file_lists/commented_file_list.txt;examples/index_file_lists/empty_line_file_list.txt;examples/index_file_lists/plain_file_list.txt'

# same as above but do not read default INDEX list file
./listems -v -l 'examples/index_file_lists/commented_file_list.txt;examples/index_file_lists/empty_line_file_list.txt;examples/index_file_lists/plain_file_list.txt' --no-default-index-list

# as two above but other separator for file lists
./listems -v -s ',' -l 'examples/index_file_lists/commented_file_list.txt,examples/index_file_lists/empty_line_file_list.txt,examples/index_file_lists/plain_file_list.txt'

# print as text table
./listems -l 'examples/index_file_lists/commented_file_list.txt;examples/index_file_lists/empty_line_file_list.txt;examples/index_file_lists/plain_file_list.txt' data/ocean_day3d_t_pocp_emep_2012.nc

# print extended output
./listems --long -l 'examples/index_file_lists/commented_file_list.txt;examples/index_file_lists/empty_line_file_list.txt;examples/index_file_lists/plain_file_list.txt' 'data/ocean_day3d_t_pocp_emep_201?.nc'

# more file to look for
./listems -l 'examples/index_file_lists/commented_file_list.txt;examples/index_file_lists/empty_line_file_list.txt;examples/index_file_lists/plain_file_list.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc

# wildcard in -l
./listems -l "examples/index_file_lists/*.txt" data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc

# wildcard in -l and simple file in -x
./listems -l "examples/index_file_lists/*.txt" -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc 

# regex for -l
./listems -v -l "r:examples/index_file_lists/[a-zA-Z-_]*.txt" -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc 

# use several files in -a
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_001.tar;iow_data_004.tar;iow_data4_002.tar;iow_data_006.tar' -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc 

# use wildcard in -a
./listems -l "examples/index_file_lists/*.txt" -a '*4.tar' -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc

# use regex in -a
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_00[0-9].tar' -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc

# use wildcard in -x
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_00[0-9].tar' -x '*6.tar' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc 

# use wildcards in files to search
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_00[0-9].tar;abc.tar' -x '*6.tar' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc *.nc

# `warnow_river...` was not found in the calls before because it is in a subfolder; wie solve it here:
./listems -l "examples/index_file_lists/*.txt" data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*/warnow_river_phoswam_v04_ist.nc'

# use regular expressions in files to search
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_00[0-9].tar;abc.tar' -x '*6.tar' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc r:data/ocean_day3d_u_emep_20[0-9][0-9].nc

# some other call ...
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_001.tar;iow_data_004.tar;iow_data4_002.tar;iow_data_006.tar' -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc

# print as json; with verbose flag
./listems -v -t json -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc *warnow_river_phoswam_v04_ist.nc

# look via regex for files to retrieve and print as json
./listems -t json -l 'examples/index_file_lists/*.txt' 'r:.*warnow_river_phoswam_v0[0-9]_[a-zA-Z0-9]+.nc'

# some verbose output
./listems -v -l 'examples/index_file_lists/*.txt' 'data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc *warnow_river_phoswam_v04_ist.nc'

# write output into file `output_listems.txt`
./listems -o output_listems.txt -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*warnow_river_phoswam_v04_ist.nc'

# write out into html file; if omit `-o` we get it printed to the command line
./listems -t html -o output_listems.html -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*warnow_river_phoswam_v04_ist.nc'

# create download directory
./listems -v -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_*.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc -t json -w tmp

# create download directory and remove/purge
./listems -v -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_*.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc -t json -w tmp -p

INDEX files on HPSS

# call for searching in hpss:
./listems -v -i t:/hpss/arch/bm0146/k204221/iow/INDEX.txt '*day3d_area_t_phoswam_v04_15_1995.nc'

# some extended
./listems -v -i t:/hpss/arch/bm0146/k204221/iow/INDEX.txt '*day3d_area_t_phoswam_v04_15_1995.nc' '*warnow_river*'

Unpackems

Locally stored INDEX files, dry runs for testing

see examples for listems for more ideas

# an example call
./unpackems -N -l examples/index_file_lists/*.txt data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc

# provide a name to the make file
./unpackems -N -l examples/index_file_lists/*.txt data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc -o test_makefile

# select some files to extract
./unpackems -N -l "examples/index_file_lists/*.txt" -a 'iow_data_001.tar;iow_data_004.tar;iow_data4_002.tar;iow_data_006.tar' -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc 

create some problematic files, first, and then call unpackems:

# create some files that should be extracted
mkdir abc/def/ghi/ -p
touch abc/def/ghi/warnow_river_phoswam_v04_ist.nc

# should throw error
./unpackems -N -v -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*/warnow_river_phoswam_v04_ist.nc' -D abc/def/ghi

# should throw warning
./unpackems -N -v -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*/warnow_river_phoswam_v04_ist.nc' -D abc/def/ghi -K

# will overwrite files and be quite with respect to this
./unpackems -N -v -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*/warnow_river_phoswam_v04_ist.nc' -D abc/def/ghi -O -q

Future Plans Archiving/Retrieval mistral <-> HPSS (partly outdated)

First Steps

  • Catalog of the packed data (home directory .packems/...) with original directories
  • Tools for reading the catalog (packls, packinfo)
  • Frontend for unpackems
    • Single files from catalog (unpackems /original/dir/file.ext)
    • Whole directories (unpackems /original/dir)
    • Specify file names with *, ?, [...] (unpackems /original/dir/*_1984*)
    • Specify new destination directory
      unpackems -d /new/directory /original/dir/file.ext -> /new/directory/file.ext
      unpackems -d /new/directory /original//dir/file.ext -> /new/directory/dir/file.ext
      unpackems -s original,new /original/dir/file.ext -> /new/dir/file.ext
      

Features Retrieval und Unpacking Scripts

  • Path to index file / local or in archive
  • Source path in archive (if different)
  • Target path of the archive (=.tar)
  • Target path for the content of the archive (if different)
  • "Pattern" of the desired file(s) (e.g. "echam6hr_19")
  • (if this feature is added for the index file: tags like variable names)
  • Unpack?
  • Delete .tar files after unpacking?
  • Unpack only the desired files (according to the pattern)
  • Flattened unpacking (i.e. removing the folder structure)
  • Remove x components of the folder structure (from var/www/html/ to html/ or similar)
  • Set new access rights / chmod for folders and files
  • Overwrite (archive), overwrite (unpack)

Improved Error Handling

  • Advise users to also use the compression script compresm
  • Provide retrieval script that uses the INDEX file
  • Provide use cases/examples
  • Ensure that INDEX file is not overwritten by several processes simultaneously
    • How to prevent write access to the index file? -> Create lockfile?
  • Ensure that INDEX file is not messed up when processes write to it in parallel
  • Resume transfer to /hpss/arch: possibility to delete incomplete files
  • Resume creation of the tar files: possibility to resume incompletely created files
  • Possibility to specify a "Basedir" (tar -C $BASEDIR / --transform "s,${BASEDIR#/},,g" [...])
  • (possibility to combine all files flattened into one .tar file)
  • Possibility to perform archiving of .tar files only
  • Allow more meaningful file names/prefixes, e.g. by specifying different file names. Prefixes (-o) for different inputs (-i).
  • .netrc: Mitigate possible traps (directory change).
  • Timestamp expected in the file name directly before file extension; possibly generalize it

Future Aspects

  • Possibility to store additional metadata (which variables are in which file etc.)
  • (mtar / htar)
  • Striping
  • Larger block sizes (tar -b <n> -> <n>: multiples of 512 bytes)
  • Request from Florian Ziemen: Caching or similar.
  • (archiving script from Pavan in gitlab: https://gitlab.dkrz.de/hsm-tools/pypftp)