Pack'em ESMs! Tools for (un-)archiving Earth system model data
- Table of contents
- Overview
- Design and Prototype
- Preparation
- Quick Tour (examples taken from MPI-ESM)
- Detailled Explanations on some Topics
- Detailled Examples
- Future Plans Archiving/Retrieval mistral <-> HPSS (partly outdated)
Updated over 4 years ago by Karl-Hermann Wieners
Information on this page refers to packems-1.0.1 🔒
Tools for (un-)archiving Earth system model data
Overview¶
Pack'em ESMs! (packems) contains a set of tools for packing, archiving,
listing, and retrieval of at earth system model and other data.
packems takes directories to be packed into tar files, and optionally pushes these directly to the tape archiving system. Features are parallel operation, batch job, and error recovery. It keeps track of the user's archiving operations to re-use this information for later retrieval of data. Currently only HPSS's pftp interface is supported for archiving
listems takes the information recorded by packems and allows to survey and examine the archived data without having to unarchiving or unpack large data files. It features UNIX and Regular Expression style search patterns for file selection and also allows to include user defined storage index information
unpackems uses the listems interface to unarchive and unpack the selected data.** The retrieval process allows for transforming the original directory** structure as needed by the user, eg using a different base directory,** renaming parts of the path name or flattening out the tar file contents** into a single directory
Design and Prototype¶
Please refer to this presentation
Preparation¶
Tasks to do once prior to the first usage¶
Passwordless access to the tape archive (HPSS)¶
Before using packems
for the first time, you must ensure that access to the tape library is available to you without entering a password. The recommended procedure is described at https://www.dkrz.de/up/systems/hpss/pftp-with-kerberos.
Tasks to do prior to each usage¶
Check Kerberos ticket¶
Kerberos authentication works with so-called tickets, which currently expire after one week. The expiry date can be checked on mistral
with klist
. The crucial part of the output looks like this
Valid starting Expires Service principal 05/19/20 11:16:29 05/26/20 21:16:29 krbtgt/PHPSS.DKRZ.DE@PHPSS.DKRZ.DE renew until 06/30/20 21:16:29
If the Expires date (the format is month/day/year) has not yet expired, everything is fine. If it is about to expire, it can be easily renewed with kinit -AVR
. When it expires, a new ticket must be created using kinit -AV
, entering the Kerberos password.
modules loaded?¶
The tools are installed as modules in my home directory:
module use ~m221078/etc/Modules
module add packems
This gives you access to the actual (Python) script packems
, and packems_wrapper
for use with sbatch
.
Setting up packems on your own (if you have the source code)¶
You need:
python 3
pykerberos
pandas
If you work on mistral, please use the module python3/2020.02-gcc-9.1.0
. It contains all needed libraries.
Quick Tour (examples taken from MPI-ESM)¶
Packems¶
- I recommend to start the job in the experiment directory on work and to specify the directories or files to be archived relative to it. In doing so, the archives will also contain relative path names and can easily be unpacked to other locations later on.
cd /work/xy1234/m123456/mpiesm/experiments/abc1234
-j JOBS
specifies how many jobs are started in parallel. Useful in combination with SLURM, e.g. assbatch -A xy1234 -p prepost,compute,compute2 --exclusive --mail-type=FAIL packems_wrapper -j 12 ...
- By default, the archives are named
packems_001.tar
,packems_002.tar
, and so on. With-o OUTPUT
you can specify a different name for them... -o abc1234_outdata_echam ...
-d
determines the directory in which the.tar
files are packed (by default in the current directory). Important if you archive directories in which you do not have write permission... -d packing ...
-a
activates the archiving - normally only packing is done.-S
determines the directory where the files are stored in the HPSS.... -a -S xy1234/mpiesm/experiments/abc1234 ...
-p
deletes the packed archives automatically after archiving. Otherwise you can manually delete the files inpacking
afterwards.
The-P
option, which also deletes the original files, prohibits re-starting after abnormal program termination and should not be used at present!... -p ...
- Finally, define the directory which you would like to archive.
... outdata/echam6
- You can add more than one input directory, but you can set the output only once:
... -o abc1234_outdata outdata/echam6 outdata/jsbach outdata/mpiom outdata/hamocc
- For restart files, the temporal relationship is very important. Therefore, you sort them by timestamp:
... -O by_time restart
Listems¶
Input and filtering¶
-i INDEX_FILE
specifiy a INDEX file localted on local file system (in addition to those specified in files attached via-l
); please note subsection Scope in which files/expressions are searched/evaluated below... -i /home/user/my_index_files/INDEX_file.txt ...
-i INDEX_FILE
specifiy a INDEX file localted on HPSS by prependingt:
to the path (in addition to those specified in files attached via-l
); please note subsection Scope in which files/expressions are searched/evaluated below... -i t:/hpss/arch/bm0146/k204221/INDEX_file.txt ...
-i INDEX_FILE_LIST
specifiy a list of INDEX files (separated by;
)... -i t:/hpss/arch/bm0146/k204221/INDEX_file_1.txt;t:/hpss/arch/bm0146/k204221/INDEX_file_2.txt ...
-i INDEX_FILE_WILDCARD
specifiy INDEX files via wildcard; Note: enclose by'
to prevent automatic evaluation by the shell... -i 't:/hpss/arch/bm0146/k204221/INDEX_file_*.txt' ...
-i INDEX_FILE_REGULAR_EXPRESSION
specifiy INDEX file via regular expression; prependr:
; the order ofr:
andt:
has no effect; Note: enclose by'
to prevent automatic evaluation by the shell... -i 'r:t:/hpss/arch/bm0146/k204221/INDEX_file_[0-9].txt' ... ... -i 't:r:/hpss/arch/bm0146/k204221/INDEX_file_[0-9].txt' ...
-i INDEX_FILE_MIXED_EXPRESSION
specifiy INDEX file(s) via a mixture of wildcard and regular expression; Note: enclose by'
to prevent automatic evaluation by the shell... -i 't:/hpss/arch/bm0146/k204221/other/INDEX_file.txt;t:/hpss/arch/bm0146/k204221/icon/INDEX_file_*.txt;r:t:/hpss/arch/bm0146/k204221/data_c/INDEX_file_[0-9].txt' ...
-s LIST_SEPARATOR -i INDEX_FILE_LIST
specifiy a list of INDEX files and manually set the list-separator... -s ',' -i t:/hpss/arch/bm0146/k204221/INDEX_file_1.txt,t:/hpss/arch/bm0146/k204221/INDEX_file_2.txt ...
-l INDEX_LIST_FILE
specifiy a file that contains the path(s) to INDEX file(s) (in addition to those INDEX files specified by-i
); same usage of wildcards and regular expressions as-i
... -l my_dummy_index_list.txt ...
--no-default-index-list
do not read the default INDEX file list~/.packems/INDEX_LIST.txt
... --no-default-index-list ...
--no-default-index-list
is useful in combination of-i
when only one specific INDEX file should be read:... --no-default-index-list -i t:/hpss/arch/bm0146/k204221/INDEX_file.txt ...
-a TAR_ARCHIVE
select one(/more) tar archive(s) exclusively; only tar archives, which are listed in the provided INDEX file(s), are included; no new tar archives can be added; same usage of wildcards and regular expressions as-i
... -a search_only_this_archive.tar ...
-x EXCLUDE_TAR_ARCHIVE
exclude one(/more) tar archive(s) that are listed in the provided INDEX files; no new tar archives can be added; same usage of wildcards and regular expressions as-i
... -x contains_bad_data.tar ...
Output formatting¶
--long
print detailled/long output (more columns)... --long ...
-t OUTPUT_FORMAT
choose an output format:txt
/text
(default),csv
,json
orhtml
... -t json ...
-o OUTPUT_FILE
writes default output into a file... -o output.txt ...
-t OUTPUT_FORMAT -o OUTPUT_FILE
writes output into a file of specific format; file extensions are not automatically recognized... -t json -o output.json ...
storing local copies of INDEX files¶
By default, the INDEX files are copied/retrieved to a local temporary directory and are deleted after they have been read by listems. A prefix t:
in the beginning of the INDEX file path indicates that the files should be retrieved from HPSS. Omitting the t:
prefix indicates that the files should be copied from the local file system.
-N
dry run; stop after a list of INDEX files is created; don't download INDEX files from HPSS; meant to check whether input from-i
and-l
was properly processed... -N ...
-w TMP_DIRECTORY
user-provided working directory to retrieve the INDEX files into; the directory will be created if it does not exist; the INDEX files will be kept if purge is not set (-p
)... -w store/index/files/here/ ...
-p -w TMP_DIRECTORY
remove INDEX files from user-provided directory after they were imported into listems... -p -w store/index/files/here/ ...
Unpackems¶
Overview of the Options¶
Please see section on packems
(above) for options -b
, -F
, -n
, -N
and -j
.
Please see section on listems
(above) for options -l
, -i
, -a
, -x
, -s
, -v
and --no-default-index-list
.
Information these flags provided here: -p
, -d
, -D
, --flatten
, -o
, -K
, -O
, -q
, -w
Setting working and destination directories; modifying extraction paths¶
-d DESTINATION_DIR
: destination directory into which the files should be extracted from the tar files... -d /work/bm0146/k204221/model_results ...
-D REPLACE_DESTINATION_DIR
: replace the firstn
folders of the archived files by the folder provided by-D
;n
is the number of folders provided (4
in the example below); see Rules to construct/modifiy the output path of extracted files below for details... -D new/folder/for/data ...
-d DESTINATION_DIR -D REPLACE_DESTINATION_DIR
: extract files (from tar archives) into/work/bm0146/k204221/model_results/new_folder
and drop first directory of each archived file... -d /work/bm0146/k204221/model_results -D new_folder ...
-d DESTINATION_DIR -D REPLACE_DESTINATION_DIR
: extract files (from tar archives) into/work/bm0146/k204221/model_results/new_folder/subfolder
and drop first two directories of each archived file... -d /work/bm0146/k204221/model_results -D new_folder/subfolder ...
--flatten
flatten/remove directory tree of files stored in tar archives... --flatten ...
-w WORK_DIR
: specifiy a working directory into which the INDEX and tar files are retrieved.... -w /scratch/k/k204221 ...
Keeping and overwriting files¶
no -K or -O
: error by unpackems if file does already exist or if file would be extracted twice (or more) to same location; this is not fail-save, e.g. during parallel extractions-K
: keep existing files during extraction; warn if keeping is expected... -K ...
-O
: overwrite existing files during extraction; warn if overwriting is expected... -O ...
-q
: suppress warnings thrown by-K
and-O
... -q ...
Cleanup after extraction¶
-p
: purge INDEX files from working directory provided by-w
... -p ...
NOTE: tar archives are not automatically removed after retrieval and successful extraction. Please do make clean -f MAKEFILE
manually (MAKEFILE
is name of the generated Makefile).
Other options/arguments¶
-f/--force
: force to extract all available files... -f ...
-o NAME_MAKEFILE
: set name of the Makefile to create... -o example_makefile_name ...
Detailled Explanations on some Topics¶
Regular Expressions and bash Wildcards (listems and unpackems)¶
General Notes¶
- enclose the expression with
'
; e.g.'data/results/*.nc'
,'file_?.nc'
- regular expressions are indicated by leading
r:
; e.g.'r:file_[0-9].nc'
- wildcards have to match the whole path; e.g.
'file_?.nc'
will not matchdata/results/file_1.nc
; but,'*/file_?.nc'
will matchdata/results/file_1.nc
- for
-a
,-x
and names/expression without preceeding flag: if a regular expression should only match the whole path, we need to add line beginning (^
) and line end characters ($
); e.g.:'^file_[0-9].nc$'
will not matchdata/results/file_1.nc
;
RegEx lookup on local file system:¶
- If relative path is given: resolve from current working directory
- If absolute path is given:
- we look whether the first three folders exist (not via regex matching but as fixed expression; e.g.
/first/second/third
); if they don't exist, we stop evaluation of regex - reason: users should be prevented to do
[a-zA-Z_/]*/my_file.txt
causing lot of traffic on the file system / metadata server
- we look whether the first three folders exist (not via regex matching but as fixed expression; e.g.
RegEx and Wildcard lookup on HPSS:¶
- currently deactivated
- implemented but not activated:
- we look whether the first three folders exist (not via regex matching but as fixed expression; e.g.
/first/second/third
); if they don't exist, we stop evaluation of regex - reason: users should be prevented to do
[a-zA-Z_/]*/my_file.txt
or*.txt
causing lot of traffic on the file system / metadata server
- we look whether the first three folders exist (not via regex matching but as fixed expression; e.g.
Scope in which files/expressions are searched/evaluated (listems and unpackems)¶
-l
look locally-i
:- no prefix (or
l:
): look locally t:
prefix: look at HPSS
- no prefix (or
-a
and-x
search for tar archives in last column of the provided INDEX files (resp.: in the content of the column)arguments without flag
(files/expressions attached to the call): files/expressions are looked up in the files listed in the provided INDEX files; the available list has been filtered by-a
and-x
previously
Available prefixes for file names/expressions (listems and unpackems)¶
r:
: evaluated following expression as regular expressiont:
: expect file to be located on HPSSl:
: expect file to be located on local file system (optional; will be ignored; same as omittingt:
)
Rules to construct/modifiy the output path of extracted files (unpackems)¶
files stored in tar archive file:
data/mask.nc
data/forcing/sst.nc
data/forcing/emis.nc
data/output/wind.nc
extraction with -d /work/bm0146/k204221
:
/work/bm0146/k204221/data/mask.nc
/work/bm0146/k204221/data/forcing/sst.nc
/work/bm0146/k204221/data/forcing/emis.nc
/work/bm0146/k204221/data/output/wind.nc
extraction with -d ./old_data
:
./old_data/data/mask.nc
./old_data/data/forcing/sst.nc
./old_data/data/forcing/emis.nc
./old_data/data/output/wind.nc
extraction with --flatten
:
./mask.nc
./sst.nc
./emis.nc
./wind.nc
extraction with -d old_data --flatten
:
./old_data/mask.nc
./old_data/sst.nc
./old_data/emis.nc
./old_data/wind.nc
extraction with -D new_dir
:
./new_dir/mask.nc
./new_dir/forcing/sst.nc
./new_dir/forcing/emis.nc
./new_dir/output/wind.nc
extraction with -D new_dir/second_dir
:
./new_dir/second_dir/mask.nc
./new_dir/second_dir/sst.nc
./new_dir/second_dir/emis.nc
./new_dir/second_dir/wind.nc
extraction with -d old_data -D new_dir
:
./old_data/new_dir/mask.nc
./old_data/new_dir/forcing/sst.nc
./old_data/new_dir/forcing/emis.nc
./old_data/new_dir/output/wind.nc
Detailled Examples¶
Please download this zip archive and extract it in your testing/training directory to run the example commands below (those aimed on the local file system).
Listems¶
Locally stored INDEX files¶
# read index files from file lists; local index files
./listems -v -l 'examples/index_file_lists/commented_file_list.txt;examples/index_file_lists/empty_line_file_list.txt;examples/index_file_lists/plain_file_list.txt'
# same as above but do not read default INDEX list file
./listems -v -l 'examples/index_file_lists/commented_file_list.txt;examples/index_file_lists/empty_line_file_list.txt;examples/index_file_lists/plain_file_list.txt' --no-default-index-list
# as two above but other separator for file lists
./listems -v -s ',' -l 'examples/index_file_lists/commented_file_list.txt,examples/index_file_lists/empty_line_file_list.txt,examples/index_file_lists/plain_file_list.txt'
# print as text table
./listems -l 'examples/index_file_lists/commented_file_list.txt;examples/index_file_lists/empty_line_file_list.txt;examples/index_file_lists/plain_file_list.txt' data/ocean_day3d_t_pocp_emep_2012.nc
# print extended output
./listems --long -l 'examples/index_file_lists/commented_file_list.txt;examples/index_file_lists/empty_line_file_list.txt;examples/index_file_lists/plain_file_list.txt' 'data/ocean_day3d_t_pocp_emep_201?.nc'
# more file to look for
./listems -l 'examples/index_file_lists/commented_file_list.txt;examples/index_file_lists/empty_line_file_list.txt;examples/index_file_lists/plain_file_list.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc
# wildcard in -l
./listems -l "examples/index_file_lists/*.txt" data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc
# wildcard in -l and simple file in -x
./listems -l "examples/index_file_lists/*.txt" -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc
# regex for -l
./listems -v -l "r:examples/index_file_lists/[a-zA-Z-_]*.txt" -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc
# use several files in -a
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_001.tar;iow_data_004.tar;iow_data4_002.tar;iow_data_006.tar' -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc
# use wildcard in -a
./listems -l "examples/index_file_lists/*.txt" -a '*4.tar' -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc
# use regex in -a
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_00[0-9].tar' -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc
# use wildcard in -x
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_00[0-9].tar' -x '*6.tar' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc
# use wildcards in files to search
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_00[0-9].tar;abc.tar' -x '*6.tar' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc *.nc
# `warnow_river...` was not found in the calls before because it is in a subfolder; wie solve it here:
./listems -l "examples/index_file_lists/*.txt" data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*/warnow_river_phoswam_v04_ist.nc'
# use regular expressions in files to search
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_00[0-9].tar;abc.tar' -x '*6.tar' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc r:data/ocean_day3d_u_emep_20[0-9][0-9].nc
# some other call ...
./listems -l "examples/index_file_lists/*.txt" -a 'iow_data_001.tar;iow_data_004.tar;iow_data4_002.tar;iow_data_006.tar' -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc
# print as json; with verbose flag
./listems -v -t json -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc *warnow_river_phoswam_v04_ist.nc
# look via regex for files to retrieve and print as json
./listems -t json -l 'examples/index_file_lists/*.txt' 'r:.*warnow_river_phoswam_v0[0-9]_[a-zA-Z0-9]+.nc'
# some verbose output
./listems -v -l 'examples/index_file_lists/*.txt' 'data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc *warnow_river_phoswam_v04_ist.nc'
# write output into file `output_listems.txt`
./listems -o output_listems.txt -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*warnow_river_phoswam_v04_ist.nc'
# write out into html file; if omit `-o` we get it printed to the command line
./listems -t html -o output_listems.html -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*warnow_river_phoswam_v04_ist.nc'
# create download directory
./listems -v -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_*.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc -t json -w tmp
# create download directory and remove/purge
./listems -v -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_*.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc -t json -w tmp -p
INDEX files on HPSS¶
# call for searching in hpss:
./listems -v -i t:/hpss/arch/bm0146/k204221/iow/INDEX.txt '*day3d_area_t_phoswam_v04_15_1995.nc'
# some extended
./listems -v -i t:/hpss/arch/bm0146/k204221/iow/INDEX.txt '*day3d_area_t_phoswam_v04_15_1995.nc' '*warnow_river*'
Unpackems¶
Locally stored INDEX files, dry runs for testing¶
see examples for listems for more ideas
# an example call
./unpackems -N -l examples/index_file_lists/*.txt data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc
# provide a name to the make file
./unpackems -N -l examples/index_file_lists/*.txt data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc -o test_makefile
# select some files to extract
./unpackems -N -l "examples/index_file_lists/*.txt" -a 'iow_data_001.tar;iow_data_004.tar;iow_data4_002.tar;iow_data_006.tar' -x iow_data4_002.tar data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc warnow_river_phoswam_v04_ist.nc
create some problematic files, first, and then call unpackems:
# create some files that should be extracted
mkdir abc/def/ghi/ -p
touch abc/def/ghi/warnow_river_phoswam_v04_ist.nc
# should throw error
./unpackems -N -v -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*/warnow_river_phoswam_v04_ist.nc' -D abc/def/ghi
# should throw warning
./unpackems -N -v -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*/warnow_river_phoswam_v04_ist.nc' -D abc/def/ghi -K
# will overwrite files and be quite with respect to this
./unpackems -N -v -l 'examples/index_file_lists/*.txt' data/ocean_day3d_t_pocp_emep_2012.nc data/ocean_day3d_t_pocp_emep_2004.nc '*/warnow_river_phoswam_v04_ist.nc' -D abc/def/ghi -O -q
Future Plans Archiving/Retrieval mistral <-> HPSS (partly outdated)¶
First Steps¶
- Catalog of the packed data (home directory
.packems/...
) with original directories - Tools for reading the catalog (
packls
,packinfo
) - Frontend for
unpackems
- Single files from catalog (
unpackems /original/dir/file.ext
) - Whole directories (
unpackems /original/dir
) - Specify file names with
*
,?
,[...]
(unpackems /original/dir/*_1984*
) - Specify new destination directory
unpackems -d /new/directory /original/dir/file.ext -> /new/directory/file.ext unpackems -d /new/directory /original//dir/file.ext -> /new/directory/dir/file.ext unpackems -s original,new /original/dir/file.ext -> /new/dir/file.ext
- Single files from catalog (
Features Retrieval und Unpacking Scripts¶
- Path to index file / local or in archive
- Source path in archive (if different)
- Target path of the archive (
=.tar
) - Target path for the content of the archive (if different)
- "Pattern" of the desired file(s) (e.g. "
echam6hr_19
") - (if this feature is added for the index file: tags like variable names)
- Unpack?
- Delete
.tar
files after unpacking? - Unpack only the desired files (according to the pattern)
- Flattened unpacking (i.e. removing the folder structure)
- Remove
x
components of the folder structure (fromvar/www/html/
tohtml/
or similar) - Set new access rights /
chmod
for folders and files - Overwrite (archive), overwrite (unpack)
Improved Error Handling¶
- Advise users to also use the compression script
compresm
- Provide retrieval script that uses the INDEX file
- Provide use cases/examples
- Ensure that INDEX file is not overwritten by several processes simultaneously
- How to prevent write access to the index file? -> Create lockfile?
- Ensure that INDEX file is not messed up when processes write to it in parallel
- Resume transfer to
/hpss/arch
: possibility to delete incomplete files - Resume creation of the tar files: possibility to resume incompletely created files
- Possibility to specify a "
Basedir
" (tar -C $BASEDIR / --transform "s,${BASEDIR#/},,g" [...]
) - (possibility to combine all files flattened into one .tar file)
- Possibility to perform archiving of .tar files only
- Allow more meaningful file names/prefixes, e.g. by specifying different file names. Prefixes (
-o
) for different inputs (-i
). .netrc
: Mitigate possible traps (directory change).- Timestamp expected in the file name directly before file extension; possibly generalize it
Future Aspects¶
- Possibility to store additional metadata (which variables are in which file etc.)
- (
mtar
/htar
) - Striping
- Larger block sizes (
tar -b <n> -> <n>: multiples of 512 bytes
) - Request from Florian Ziemen: Caching or similar.
- (archiving script from Pavan in gitlab: https://gitlab.dkrz.de/hsm-tools/pypftp)