fast_carpenter.summary.binned_dataframe module¶

Summarize the data by producing binned and possibly weighted counts of the data.

class fast_carpenter.summary.binned_dataframe.BinnedDataframe(name, out_dir, binning, weights=None, dataset_col=True, pad_missing=False, file_format=None)[source]¶

Bases: object

Produces a binned dataframe (a multi-dimensional histogram).

def __init__(self, name, out_dir, binning, weights=None, dataset_col=False):

Parameters:

binning (list[dict]) –

A list of dictionaries describing the variables to bin on, and how they should be binned. Each of these dictionaries can contain the following:

Parameter	Default	Description
`in`		The name of the attribute on the event to use.
`out`	same as `in`	The name of the column to be filled in the output dataframe.
`bins`	`None`	Must be either `None` or a dictionary. If a dictionary, it must contain one of the follow sets of key-value pairs: 1. `nbins`, `low`, `high`: which are used to produce a list of bin edges equivalent to: `numpy.linspace(low, high, nbins + 1)` 2. `edges`: which is treated as the list of bin edges directly. If set to `None`, then the input variable is assumed to already be categorical (ie. binned or discrete)

weights (str or list[str], dict[str, str]) – How to weight events in the output table. Must be either a single variable, a list of variables, or a dictionary where the values are variables in the data and keys are the column names that these weights should be called in the output tables.
file_format (str or list[str], dict[str, str]) – determines the file format to use to save the binned dataframe to disk. Should be either a) a string with the file format, b) a dict containing the keyword extension to give the file format and then all other keyword-argument pairs are passed on to the corresponding pandas function, or c) a list of values matching a) or b).
dataset_col (bool) – adds an extra binning column with the name for each dataset.
pad_missing (bool) – If False, any bins that don’t contain data are excluded from the stored dataframe. Leaving this False can save some disk-space and improve processing time, particularly if the bins are only very sparsely filled.

Other Parameters:

name (str) – The name of this stage (handled automatically by fast-flow)
out_dir (str) – Where to put the summary table (handled automatically by fast-flow)

Raises:

BadBinnedDataframeConfig – If there is an issue with the binning description.

collector()[source]¶

event(chunk)[source]¶

merge(rhs)[source]¶

class fast_carpenter.summary.binned_dataframe.Collector(filename, dataset_col, binnings, file_format)[source]¶

Bases: object

collect(dataset_readers_list, doReturn=True, writeFiles=True)[source]¶

valid_ext = {'dta': 'stata', 'h5': 'hdf', 'msg': 'msgpack', 'p': 'pickle', 'pkl': 'pickle', 'xlsx': 'excel'}¶

fast_carpenter.summary.binned_dataframe.combined_dataframes(dataset_readers_list, dataset_col, binnings=None)[source]¶

fast_carpenter.summary.binned_dataframe.densify_dataframe(in_df, binnings)[source]¶

fast_carpenter.summary.binned_dataframe.explode(df)[source]¶: Based on this answer: https://stackoverflow.com/questions/12680754/split-explode-pandas -dataframe-string-entry-to-separate-rows/40449726#40449726