The command-line tools are the primary way to use fast-carpenter and friends at this point.
All of the FAST commands provide built-in help by providing the
The fast-curator package handles the description of the input datasets. These are saved as YAML files, which contain a dictionary that lists the different datasets, the list of files for each dataset, and additional meta-data.
You can build these files semi-automatically by using the
This can be called once per dataset and given a wildcarded expression for the input files of this dataset, which it will then expand, build some simple summary meta-data (number of events, number of files, etc) and write this to an output YAML file.
If the output file already exists, and is a valid fast-curator description, the new information will be appended to the existing file.
Input ROOT files can also be stored on xrootd servers, with a file-path specified by the
You can also provide wild-cards for such files, but make sure to check that you pick all files that you expect; wildcarded searches on xrootd directories can depend on permissions, access rights, storage mirroring and so on.
For an in-depth description of the dataset description files, see the fast-curator pages.
$ fast_curator --help usage: fast_curator [-h] -d DATASET [-o OUTPUT] [--mc] [--data] [-t TREE_NAME] [-u USER] [-q QUERY_TYPE] [--no-empty-files] [--allow-missing-tree] [--ignore-inaccessible IGNORE_INACCESSIBLE] [-p PREFIX] [--no-defaults-in-output] [--version] [-m META] [files [files ...]] positional arguments: files optional arguments: -h, --help show this help message and exit -d DATASET, --dataset DATASET Which dataset to associate these files to -o OUTPUT, --output OUTPUT Name of output file list --mc Specify if this dataset contains simulated data --data Specify if this dataset contains real data -t TREE_NAME, --tree-name TREE_NAME Provide the name of the tree in the input files to calculate number of events, etc -u USER, --user USER Add a user function to extend the dataset dictionary, eg. my_package.my_module.some_function -q QUERY_TYPE, --query-type QUERY_TYPE How to interpret file arguments to this command. Allows the use of experiment-specific file catalogues or wild-carded file paths. Known query types are: xrootd, local, cmsdas --no-empty-files Don't include files that contain no events --allow-missing-tree Allow files that don't contain the named tree in --ignore-inaccessible IGNORE_INACCESSIBLE Don't include files that can't be opened -p PREFIX, --prefix PREFIX Provide a common prefix to files, useful for supporting multiple sites --no-defaults-in-output Explicitly list all settings for each dataset in output file instead of grouping them in default block --version show program's version number and exit -m META, --meta META Add other metadata (eg cross-section, run era) for this dataset. Must take the form of 'key=value'
Sometimes it can be useful to check that you’re dataset config files are valid, in particular if you use the
import section (which allows you to include dataset configs from another file).
fast_curator_check command can help you by expanding such sections and dumping the result to screen or to an output file.
$ fast_curator_check --help usage: fast_curator_check [-h] [-o OUTPUT] [-f FIELDS] [-p PREFIX] files [files ...] positional arguments: files optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT Name of output file list to expand things to -f FIELDS, --fields FIELDS Comma-separated list of fields to dump for each dataset -p PREFIX, --prefix PREFIX Choose one of the file prefixes to use
fast_carpenter is the star of the show.
It is what actually converts your event-level datasets to the binned summaries.
The built-in help should tell you everything you need to know:
$ fast_carpenter --help usage: fast_carpenter [-h] [--outdir OUTDIR] [--mode MODE] [--ncores NCORES] [--nblocks-per-dataset NBLOCKS_PER_DATASET] [--nblocks-per-sample NBLOCKS_PER_SAMPLE] [--blocksize BLOCKSIZE] [--quiet] [--profile] [--execution-cfg EXECUTION_CFG] [--help-stages [stage-name-regex]] [--help-stages-full stage] [-v] [--bookkeeping] [--no-bookkeeping] dataset_cfg sequence_cfg Chop up those trees into nice little tables and dataframes positional arguments: dataset_cfg Dataset config to run over sequence_cfg Config for how to process events optional arguments: -h, --help show this help message and exit --outdir OUTDIR Where to save the results --mode MODE Which mode to run in (multiprocessing, htcondor, sge) --ncores NCORES Number of cores to run on --nblocks-per-dataset NBLOCKS_PER_DATASET Number of blocks per dataset --nblocks-per-sample NBLOCKS_PER_SAMPLE Number of blocks per sample --blocksize BLOCKSIZE Number of events per block --quiet Keep progress report quiet --profile Profile the code --execution-cfg EXECUTION_CFG, -e EXECUTION_CFG A configuration file for the execution system. The exact format and contents of this file will depend on the value of the `--mode` option. --help-stages [stage-name-regex] Print help specific to the available stages --help-stages-full stage Print the full help specific to the available stages -v, --version show program's version number and exit --bookkeeping Enable creation of book-keeping tarball --no-bookkeeping Disable creation of book-keeping tarball
In its simplest form therefore, you can just provide a dataset config and a processing config and run:
fast_carpenter datasets.yml processing.yml
Quite often you will want to use some of the acceleration options which allow you to process the jobs more quickly using multiple cpu cores, or by running things on a batch system.
When you do this, the
fast_carpenter command will submit tasks to the batch system and wait for them to finish, so you need to make sure the command is not killed in between e.g. by running fast-carpenter on a remote login node and breaking or losing the connection to that login. Use a tool tmux or screen in such cases.
To use multiple CPUs on the local machine, use
--mode multiprocessing (this is the default, so not generally needed) and specify the number of cores to use,
fast_carpenter --ncores 4 datasets.yml processing.yml
Alternatively, if you have access to an htcondor or SGE batch system (i.e.
qsub), then the
fast_carpenter command can submit many tasks to un at the same time using the batch system.
In this case you need to choose an appropriate option for the
--mode option. In addition the options with
block in them can control how many events are processed on each task and for each dataset.
For all modes, the
--blocksize option can be helpful to present fast-carpenter reading too many events into memory in one go.
It’s default value of 100,000 might be too large, in which case reducing it to some other value (e.g. 20,000) can help.
Common symptons of the blocksize being too large are:
- Extremely slow processing, or
- Batch jobs crashing or not being started
Once you have produced your binned dataframes, the next thing you’ll likely want to do is to make these into figures.
fast_plotter command and library can help with this.
Its command-line interface gives a simple way to make plots from the dataframes with reasonable defaults, whereas its internal functions can be useful when needing more specific ways or presenting results.
See the dedicated fast-plotter documentation for more guidance on this package.