Command-line Usage¶
The command-line tools are the primary way to use fast-carpenter and friends at this point.
All of the FAST commands provide built-in help by providing the --help
option.
fast_curator
¶
The fast-curator package handles the description of the input datasets. These are saved as YAML files, which contain a dictionary that lists the different datasets, the list of files for each dataset, and additional meta-data.
You can build these files semi-automatically by using the fast_curator
command.
This can be called once per dataset and given a wildcarded expression for the input files of this dataset, which it will then expand, build some simple summary meta-data (number of events, number of files, etc) and write this to an output YAML file.
If the output file already exists, and is a valid fast-curator description, the new information will be appended to the existing file.
Input ROOT files can also be stored on xrootd servers, with a file-path specified by the root://
protocol.
You can also provide wild-cards for such files, but make sure to check that you pick all files that you expect; wildcarded searches on xrootd directories can depend on permissions, access rights, storage mirroring and so on.
For an in-depth description of the dataset description files, see the fast-curator pages.
$ fast_curator --help
usage: fast_curator [-h] -d DATASET [-o OUTPUT] [--mc] [--data] [-t TREE_NAME]
[-u USER] [-q QUERY_TYPE] [--no-empty-files]
[--allow-missing-tree]
[--ignore-inaccessible IGNORE_INACCESSIBLE] [-p PREFIX]
[--no-defaults-in-output] [--version] [-m META]
[files [files ...]]
positional arguments:
files
optional arguments:
-h, --help show this help message and exit
-d DATASET, --dataset DATASET
Which dataset to associate these files to
-o OUTPUT, --output OUTPUT
Name of output file list
--mc Specify if this dataset contains simulated data
--data Specify if this dataset contains real data
-t TREE_NAME, --tree-name TREE_NAME
Provide the name of the tree in the input files to
calculate number of events, etc
-u USER, --user USER Add a user function to extend the dataset dictionary,
eg. my_package.my_module.some_function
-q QUERY_TYPE, --query-type QUERY_TYPE
How to interpret file arguments to this command.
Allows the use of experiment-specific file catalogues
or wild-carded file paths. Known query types are:
xrootd, local, cmsdas
--no-empty-files Don't include files that contain no events
--allow-missing-tree Allow files that don't contain the named tree in
--ignore-inaccessible IGNORE_INACCESSIBLE
Don't include files that can't be opened
-p PREFIX, --prefix PREFIX
Provide a common prefix to files, useful for
supporting multiple sites
--no-defaults-in-output
Explicitly list all settings for each dataset in
output file instead of grouping them in default block
--version show program's version number and exit
-m META, --meta META Add other metadata (eg cross-section, run era) for
this dataset. Must take the form of 'key=value'
fast_curator_check
¶
Sometimes it can be useful to check that you’re dataset config files are valid, in particular if you use the import
section (which allows you to include dataset configs from another file).
The fast_curator_check
command can help you by expanding such sections and dumping the result to screen or to an output file.
$ fast_curator_check --help
usage: fast_curator_check [-h] [-o OUTPUT] [-f FIELDS] [-p PREFIX]
files [files ...]
positional arguments:
files
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Name of output file list to expand things to
-f FIELDS, --fields FIELDS
Comma-separated list of fields to dump for each
dataset
-p PREFIX, --prefix PREFIX
Choose one of the file prefixes to use
fast_carpenter
¶
The fast_carpenter
is the star of the show.
It is what actually converts your event-level datasets to the binned summaries.
The built-in help should tell you everything you need to know:
$ fast_carpenter --help
usage: fast_carpenter [-h] [--outdir OUTDIR] [--mode MODE] [--ncores NCORES]
[--nblocks-per-dataset NBLOCKS_PER_DATASET]
[--nblocks-per-sample NBLOCKS_PER_SAMPLE]
[--blocksize BLOCKSIZE] [--quiet] [--profile]
[--execution-cfg EXECUTION_CFG]
[--help-stages [stage-name-regex]]
[--help-stages-full stage] [-v] [--bookkeeping]
[--no-bookkeeping]
dataset_cfg sequence_cfg
Chop up those trees into nice little tables and dataframes
positional arguments:
dataset_cfg Dataset config to run over
sequence_cfg Config for how to process events
optional arguments:
-h, --help show this help message and exit
--outdir OUTDIR Where to save the results
--mode MODE Which mode to run in (multiprocessing, htcondor, sge)
--ncores NCORES Number of cores to run on
--nblocks-per-dataset NBLOCKS_PER_DATASET
Number of blocks per dataset
--nblocks-per-sample NBLOCKS_PER_SAMPLE
Number of blocks per sample
--blocksize BLOCKSIZE
Number of events per block
--quiet Keep progress report quiet
--profile Profile the code
--execution-cfg EXECUTION_CFG, -e EXECUTION_CFG
A configuration file for the execution system. The
exact format and contents of this file will depend on
the value of the `--mode` option.
--help-stages [stage-name-regex]
Print help specific to the available stages
--help-stages-full stage
Print the full help specific to the available stages
-v, --version show program's version number and exit
--bookkeeping Enable creation of book-keeping tarball
--no-bookkeeping Disable creation of book-keeping tarball
In its simplest form therefore, you can just provide a dataset config and a processing config and run:
fast_carpenter datasets.yml processing.yml
Quite often you will want to use some of the acceleration options which allow you to process the jobs more quickly using multiple cpu cores, or by running things on a batch system.
When you do this, the fast_carpenter
command will submit tasks to the batch system and wait for them to finish, so you need to make sure the command is not killed in between e.g. by running fast-carpenter on a remote login node and breaking or losing the connection to that login. Use a tool tmux or screen in such cases.
To use multiple CPUs on the local machine, use --mode multiprocessing
(this is the default, so not generally needed) and specify the number of cores to use, --ncores 4
.
fast_carpenter --ncores 4 datasets.yml processing.yml
Alternatively, if you have access to an htcondor or SGE batch system (i.e. qsub
), then the fast_carpenter
command can submit many tasks to un at the same time using the batch system.
In this case you need to choose an appropriate option for the --mode
option. In addition the options with block
in them can control how many events are processed on each task and for each dataset.
Note
For all modes, the --blocksize
option can be helpful to present fast-carpenter reading too many events into memory in one go.
It’s default value of 100,000 might be too large, in which case reducing it to some other value (e.g. 20,000) can help.
Common symptons of the blocksize being too large are:
- Extremely slow processing, or
- Batch jobs crashing or not being started
fast_plotter
¶
Once you have produced your binned dataframes, the next thing you’ll likely want to do is to make these into figures.
The fast_plotter
command and library can help with this.
Its command-line interface gives a simple way to make plots from the dataframes with reasonable defaults, whereas its internal functions can be useful when needing more specific ways or presenting results.
See also
See the dedicated fast-plotter documentation for more guidance on this package.