fast_carpenter.summary.binned_dataframe module¶
Summarize the data by producing binned and possibly weighted counts of the data.
-
class
fast_carpenter.summary.binned_dataframe.BinnedDataframe(name, out_dir, binning, weights=None, dataset_col=True, pad_missing=False, file_format=None)[source]¶ Bases:
objectProduces a binned dataframe (a multi-dimensional histogram).
def __init__(self, name, out_dir, binning, weights=None, dataset_col=False):
Parameters: - binning (list[dict]) –
A list of dictionaries describing the variables to bin on, and how they should be binned. Each of these dictionaries can contain the following:
Parameter Default Description inThe name of the attribute on the event to use. outsame as inThe name of the column to be filled in the output dataframe. binsNoneMust be eitherNoneor a dictionary. If a dictionary, it must contain one of the follow sets ofkey-value pairs:1.nbins,low,high: which are used to produce a list of bin edges equivalent to:numpy.linspace(low, high, nbins + 1)2.edges: which is treated as the list of bin edges directly.If set toNone, then the input variable is assumed to already be categorical (ie. binned or discrete) - weights (str or list[str], dict[str, str]) – How to weight events in the output table. Must be either a single variable, a list of variables, or a dictionary where the values are variables in the data and keys are the column names that these weights should be called in the output tables.
- file_format (str or list[str], dict[str, str]) – determines the file format to use to save the binned dataframe to disk. Should be either a) a string with the file format, b) a dict containing the keyword extension to give the file format and then all other keyword-argument pairs are passed on to the corresponding pandas function, or c) a list of values matching a) or b).
- dataset_col (bool) – adds an extra binning column with the name for each dataset.
- pad_missing (bool) – If
False, any bins that don’t contain data are excluded from the stored dataframe. Leaving thisFalsecan save some disk-space and improve processing time, particularly if the bins are only very sparsely filled.
Other Parameters: - name (str) – The name of this stage (handled automatically by fast-flow)
- out_dir (str) – Where to put the summary table (handled automatically by fast-flow)
Raises: BadBinnedDataframeConfig– If there is an issue with the binning description.- binning (list[dict]) –
-
class
fast_carpenter.summary.binned_dataframe.Collector(filename, dataset_col, binnings, file_format)[source]¶ Bases:
object-
valid_ext= {'dta': 'stata', 'h5': 'hdf', 'msg': 'msgpack', 'p': 'pickle', 'pkl': 'pickle', 'xlsx': 'excel'}¶
-
-
fast_carpenter.summary.binned_dataframe.combined_dataframes(dataset_readers_list, dataset_col, binnings=None)[source]¶
-
fast_carpenter.summary.binned_dataframe.explode(df)[source]¶ Based on this answer: https://stackoverflow.com/questions/12680754/split-explode-pandas -dataframe-string-entry-to-separate-rows/40449726#40449726