The Processing Config

The processing config file tells fast-carpenter what to do with your data and is written in YAML.

An example config file looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
stages:
    - jet_cleaning: fast_carpenter.Define
    - event_selection: fast_carpenter.CutFlow
    - histogram: fast_carpenter.BinnedDataframe

jet_cleaning:
    variables:
        - BtaggedJets: Jet_bScore > 0.9
        - nBJets: {reduce: count, formula: BtaggedJets}

event_selection:
    selection:
        All:
            - nElectron == 0
            - nJet > 1
            - {reduce: 0, formula: Jet_pt > 100}
            - Any:
                - HT >= 200
                - MHT >= 200

histogram:
    binning:
        - {in: nJet}
        - {in: nBJets}
        - {in: MET, out: met, bins: {edges: [0, 200, 400, 700, 1000]}
    weights: weight_nominal

Other, more complete examples are listed in Example repositories.

Tip

Since this is a YAML file, things like anchors and block syntax are totally valid, which can be helpful to define “aliases” or reuse certain parts of a config. For more guidance on YAML, this is a good overview of the concepts and syntax: https://kapeli.com/cheat_sheets/YAML.docset/Contents/Resources/Documents/index.

Anatomy of the config

The stages section
This is the most important section of the config because it defines what steps to take with the data. It uses a list of single-length dictionaries, whose key is the name for the stage (e.g. histogram) and whose values is the python-importable class that implements it (e.g. fast_carpenter.BinnedDataframes). The following sections discuss what are valid stage classes. Lines 1 to 4 of the config above show an example of this section and others can be found in the linked Example repositories.
Stage configuration sections
Each stage must be given a complete description by adding a top-level section in the yaml file with the same name provided in the stages section. This should then contain a dictionary which will be passed as keyword-arguments to the underlying class’ init method. Lines 22 to 26 of the above example config file show how the stage called histogram is configured. See below for more help on configuring specific stages.
Importing other config files

Sometimes it can be helpful to re-use one config in another, for example, defining a list of common variables and event selections, but then changing the BinnedDataframes that are produced. The processing config supports this by using the reserved word IMPORT as the key for a stage, followed by the path to the config file to import. If the path starts with {this_dir} then the imported file will be located relative to the directory of the importing config file.

For example:

- IMPORT: "{this_dir}/another_processing_config.yml"

See also

The interpretation of the processing config is handled by the fast-flow package so its documentation can also be helpful to understanding the basic anatomy and handling.

Built-in Stages

The list of stages known to fast_carpenter already can be found using the built-in --help-stages option.

$ fast_carpenter --help-stages
fast_carpenter.Define
   config: variables
   purpose: Creates new variables using a string-based expression.

fast_carpenter.SystematicWeights
   config: weights, out_format=weight_{}, extra_variations=[]
   purpose: Combines multiple weights and variations to produce a single event weight

fast_carpenter.CutFlow
   config: selection_file=None, keep_unique_id=False, selection=None, counter=True, weights=None
   purpose: Prevents subsequent stages seeing certain events.

fast_carpenter.SelectPhaseSpace
   config: region_name, **kwargs
   purpose: Creates an event-mask and adds it to the data-space.

fast_carpenter.BinnedDataframe
   config: binning, weights=None, dataset_col=True, pad_missing=False, file_format=None, observed=False, weight_data=False
   purpose: Produces a binned dataframe (a multi-dimensional histogram).

fast_carpenter.BuildAghast
   config: binning, weights=None, dataset_col=True
   purpose: Builds an aghast histogram.

fast_carpenter.EventByEventDataframe
   config: collections, mask=None, flatten=True
   purpose: Write out a pandas dataframe with event-level values

Further guidance on the built-in stages can be found using --help-stages-full and giving the name of the stage. All the built-in stages of fast_carpenter are available directly from the fast_carpenter module, e.g. fast_carpenter.Define.

See also

In-depth discussion of the built-in stages and their configuration can be found on the fast_carpenter module page: fast_carpenter, or directly at:

Todo

Build that list programmatically, so its always up to date and uses the built-in docstrings for a description.

Used-defined Stages

fast-carpenter is still evolving, and so it is natural that many analysis tasks cannot be implemented using the existing stages. In this case, it is possible to implement your own stage and making sure it can be imported by python (e.g. by setting the PYTHONPATH variable to point to the directory containing its code). The class implementing a custom stage should provide the following methods:

__init__(name, out_dir, ...)

This is the method that will receive configuration from the config file, creating the stage itself.

Parameters:
  • name (str) – will contain the name of the stage as given in the config file.
  • out_dir (path) – receives the path to the output directory that should be used if the stage produces output.

Additional arguments can be added, which will be configurable from the processing config file.

event(chunk)

Called once for each chunk of data.

Parameters:chunk

provides access to the dataset configuration (chunk.config) and the current data-space (chunk.tree). Typically one wants an array, or set of arrays representing the data for each event, in which case these can be obtained using:

jet_pt = chunk.tree.array("jet_pt")
jet_pt, jet_eta = chunk.tree.arrays(["jet_pt", "jet_eta", outputtype=tuple)

If your stage produces a new variable, which you want other stages to be able to see, then use the new_variable method:

chunk.tree.new_variable("number_good_jets", number_good_jets)

For more details on working with chunk.tree, see fast_carpenter.masked_tree.MaskedUprootTree.

Returns:True or False for whether to continue processing the chunk through subsequent stages.
Return type:bool

See also

An example of such a user stage can be seen in the cms_public_tutorial demo repository: https://gitlab.cern.ch/fast-hep/public/fast_cms_public_tutorial/blob/master/cms_hep_tutorial/__init__.py

Warning

Make sure that your stage can be imported by python, most likely by setting the PYTHONPATH variable to point to the containing directory. Then to check a stage called AddMyFancyVar and defined in a module called my_custom_module can be imported, make sure no errors are raised by doing:

python -c "import my_custom_module.AddMyFancyVar"

Todo

Describe the collector and merge methods to allow a user stage to save results to disk.