Command-line Usage

The command-line tools are the primary way to use fast-carpenter and friends at this point. All of the FAST commands provide built-in help by providing the --help option.

fast_curator

The fast-curator package handles the description of the input datasets. These are saved as YAML files, which contain a dictionary that lists the different datasets, the list of files for each dataset, and additional meta-data.

You can build these files semi-automatically by using the fast_curator command. This can be called once per dataset and given a wildcarded expression for the input files of this dataset, which it will then expand, build some simple summary meta-data (number of events, number of files, etc) and write this to an output YAML file. If the output file already exists, and is a valid fast-curator description, the new information will be appended to the existing file.

Input ROOT files can also be stored on xrootd servers, with a file-path specified by the root:// protocol. You can also provide wild-cards for such files, but make sure to check that you pick all files that you expect; wildcarded searches on xrootd directories can depend on permissions, access rights, storage mirroring and so on.

For an in-depth description of the dataset description files, see the fast-curator pages.

$ fast_curator --help
usage: fast_curator [-h] -d DATASET [-o OUTPUT] [--mc] [--data] [-t TREE_NAME]
                    [-u USER] [-q QUERY_TYPE] [--no-empty-files]
                    [--allow-missing-tree]
                    [--ignore-inaccessible IGNORE_INACCESSIBLE] [-p PREFIX]
                    [--no-defaults-in-output] [--version] [-m META]
                    [files [files ...]]

positional arguments:
  files

optional arguments:
  -h, --help            show this help message and exit
  -d DATASET, --dataset DATASET
                        Which dataset to associate these files to
  -o OUTPUT, --output OUTPUT
                        Name of output file list
  --mc                  Specify if this dataset contains simulated data
  --data                Specify if this dataset contains real data
  -t TREE_NAME, --tree-name TREE_NAME
                        Provide the name of the tree in the input files to
                        calculate number of events, etc
  -u USER, --user USER  Add a user function to extend the dataset dictionary,
                        eg. my_package.my_module.some_function
  -q QUERY_TYPE, --query-type QUERY_TYPE
                        How to interpret file arguments to this command.
                        Allows the use of experiment-specific file catalogues
                        or wild-carded file paths. Known query types are:
                        xrootd, local, cmsdas
  --no-empty-files      Don't include files that contain no events
  --allow-missing-tree  Allow files that don't contain the named tree in
  --ignore-inaccessible IGNORE_INACCESSIBLE
                        Don't include files that can't be opened
  -p PREFIX, --prefix PREFIX
                        Provide a common prefix to files, useful for
                        supporting multiple sites
  --no-defaults-in-output
                        Explicitly list all settings for each dataset in
                        output file instead of grouping them in default block
  --version             show program's version number and exit
  -m META, --meta META  Add other metadata (eg cross-section, run era) for
                        this dataset. Must take the form of 'key=value'

fast_curator_check

Sometimes it can be useful to check that you’re dataset config files are valid, in particular if you use the import section (which allows you to include dataset configs from another file). The fast_curator_check command can help you by expanding such sections and dumping the result to screen or to an output file.

$ fast_curator_check --help
usage: fast_curator_check [-h] [-o OUTPUT] [-f FIELDS] [-p PREFIX]
                          files [files ...]

positional arguments:
  files

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Name of output file list to expand things to
  -f FIELDS, --fields FIELDS
                        Comma-separated list of fields to dump for each
                        dataset
  -p PREFIX, --prefix PREFIX
                        Choose one of the file prefixes to use

fast_carpenter

The fast_carpenter is the star of the show. It is what actually converts your event-level datasets to the binned summaries.

The built-in help should tell you everything you need to know:

$ fast_carpenter --help
usage: fast_carpenter [-h] [--outdir OUTDIR] [--mode MODE] [--ncores NCORES]
                      [--nblocks-per-dataset NBLOCKS_PER_DATASET]
                      [--nblocks-per-sample NBLOCKS_PER_SAMPLE]
                      [--blocksize BLOCKSIZE] [--quiet] [--profile]
                      [--execution-cfg EXECUTION_CFG]
                      [--help-stages [stage-name-regex]]
                      [--help-stages-full stage] [-v] [--bookkeeping]
                      [--no-bookkeeping]
                      dataset_cfg sequence_cfg

Chop up those trees into nice little tables and dataframes

positional arguments:
  dataset_cfg           Dataset config to run over
  sequence_cfg          Config for how to process events

optional arguments:
  -h, --help            show this help message and exit
  --outdir OUTDIR       Where to save the results
  --mode MODE           Which mode to run in (multiprocessing, htcondor, sge)
  --ncores NCORES       Number of cores to run on
  --nblocks-per-dataset NBLOCKS_PER_DATASET
                        Number of blocks per dataset
  --nblocks-per-sample NBLOCKS_PER_SAMPLE
                        Number of blocks per sample
  --blocksize BLOCKSIZE
                        Number of events per block
  --quiet               Keep progress report quiet
  --profile             Profile the code
  --execution-cfg EXECUTION_CFG, -e EXECUTION_CFG
                        A configuration file for the execution system. The
                        exact format and contents of this file will depend on
                        the value of the `--mode` option.
  --help-stages [stage-name-regex]
                        Print help specific to the available stages
  --help-stages-full stage
                        Print the full help specific to the available stages
  -v, --version         show program's version number and exit
  --bookkeeping         Enable creation of book-keeping tarball
  --no-bookkeeping      Disable creation of book-keeping tarball

In its simplest form therefore, you can just provide a dataset config and a processing config and run:

fast_carpenter datasets.yml processing.yml

Quite often you will want to use some of the acceleration options which allow you to process the jobs more quickly using multiple cpu cores, or by running things on a batch system. When you do this, the fast_carpenter command will submit tasks to the batch system and wait for them to finish, so you need to make sure the command is not killed in between e.g. by running fast-carpenter on a remote login node and breaking or losing the connection to that login. Use a tool tmux or screen in such cases.

To use multiple CPUs on the local machine, use --mode multiprocessing (this is the default, so not generally needed) and specify the number of cores to use, --ncores 4.

fast_carpenter --ncores 4 datasets.yml processing.yml

Alternatively, if you have access to an htcondor or SGE batch system (i.e. qsub), then the fast_carpenter command can submit many tasks to un at the same time using the batch system. In this case you need to choose an appropriate option for the --mode option. In addition the options with block in them can control how many events are processed on each task and for each dataset.

Note

For all modes, the --blocksize option can be helpful to present fast-carpenter reading too many events into memory in one go. It’s default value of 100,000 might be too large, in which case reducing it to some other value (e.g. 20,000) can help. Common symptons of the blocksize being too large are:

  • Extremely slow processing, or
  • Batch jobs crashing or not being started

fast_plotter

Once you have produced your binned dataframes, the next thing you’ll likely want to do is to make these into figures. The fast_plotter command and library can help with this. Its command-line interface gives a simple way to make plots from the dataframes with reasonable defaults, whereas its internal functions can be useful when needing more specific ways or presenting results.

System Message: ERROR/6 (/home/docs/checkouts/readthedocs.org/user_builds/fast-carpenter/checkouts/latest/docs/command_line.rst, line 80)

Command 'fast_plotter --help' failed: [Errno 2] No such file or directory: 'fast_plotter': 'fast_plotter'

See also

See the dedicated fast-plotter documentation for more guidance on this package.