fast_carpenter.summary.binned_dataframe module

Summarize the data by producing binned and possibly weighted counts of the data.

class fast_carpenter.summary.binned_dataframe.BinnedDataframe(name, out_dir, binning, weights=None, dataset_col=True, pad_missing=False, file_format=None)[source]

Bases: object

Produces a binned dataframe (a multi-dimensional histogram).

def __init__(self, name, out_dir, binning, weights=None, dataset_col=False):

Parameters:
  • binning (list[dict]) –

    A list of dictionaries describing the variables to bin on, and how they should be binned. Each of these dictionaries can contain the following:

    Parameter Default Description
    in   The name of the attribute on the event to use.
    out same as in The name of the column to be filled in the output dataframe.
    bins None
    Must be either None or a dictionary. If a dictionary, it must contain one of the follow sets of
    key-value pairs:
    1. nbins, low, high: which are used to produce a list of bin edges equivalent to:
    numpy.linspace(low, high, nbins + 1)
    2. edges: which is treated as the list of bin edges directly.
    If set to None, then the input variable is assumed to already be categorical (ie. binned or discrete)
  • weights (str or list[str], dict[str, str]) – How to weight events in the output table. Must be either a single variable, a list of variables, or a dictionary where the values are variables in the data and keys are the column names that these weights should be called in the output tables.
  • file_format (str or list[str], dict[str, str]) – determines the file format to use to save the binned dataframe to disk. Should be either a) a string with the file format, b) a dict containing the keyword extension to give the file format and then all other keyword-argument pairs are passed on to the corresponding pandas function, or c) a list of values matching a) or b).
  • dataset_col (bool) – adds an extra binning column with the name for each dataset.
  • pad_missing (bool) – If False, any bins that don’t contain data are excluded from the stored dataframe. Leaving this False can save some disk-space and improve processing time, particularly if the bins are only very sparsely filled.
Other Parameters:
 
  • name (str) – The name of this stage (handled automatically by fast-flow)
  • out_dir (str) – Where to put the summary table (handled automatically by fast-flow)
Raises:

BadBinnedDataframeConfig – If there is an issue with the binning description.

collector()[source]
event(chunk)[source]
merge(rhs)[source]
class fast_carpenter.summary.binned_dataframe.Collector(filename, dataset_col, binnings, file_format)[source]

Bases: object

collect(dataset_readers_list, doReturn=True, writeFiles=True)[source]
valid_ext = {'dta': 'stata', 'h5': 'hdf', 'msg': 'msgpack', 'p': 'pickle', 'pkl': 'pickle', 'xlsx': 'excel'}
fast_carpenter.summary.binned_dataframe.combined_dataframes(dataset_readers_list, dataset_col, binnings=None)[source]
fast_carpenter.summary.binned_dataframe.densify_dataframe(in_df, binnings)[source]
fast_carpenter.summary.binned_dataframe.explode(df)[source]

Based on this answer: https://stackoverflow.com/questions/12680754/split-explode-pandas -dataframe-string-entry-to-separate-rows/40449726#40449726