vak.prep.split.split.unit_dataframe#

vak.prep.split.split.unit_dataframe(dataset_df: DataFrame, dataset_path: str | Path, labelset: set, train_dur: float | None = None, test_dur: float | None = None, val_dur: float | None = None)[source]#

Create datasets splits from a dataframe representing a unit dataset.

Splits dataset into training, test, and (optionally) validation subsets, specified by their duration.

Additionally adds a ‘split’ column to the dataframe, that assigns each row to ‘train’, ‘val’, ‘test’, or ‘None’.

Parameters:

dataset_df (pandas.Dataframe) – A pandas DataFrame representing the samples in a dataset, generated by vak prep.
dataset_path (str) – Path to dataset, a directory generated by running vak prep.
labelset (set, list) – The set of label classes for vocalizations in dataset.
train_dur (float) – Total duration of training set, in seconds. Default is None
test_dur (float) – Total duration of test set, in seconds. Default is None.
val_dur (float) – Total duration of validation set, in seconds. Default is None.

Returns:

dataset_df – A copy of the input dataset with a ‘split’ column added, that assigns each vocalization (row) to a subset, i.e., train, validation, or test. If the vocalization was not added to one of the subsets, its value for ‘split’ will be ‘None’.

Return type:

pandas.Dataframe

Notes

Uses the function vak.dataset.split.train_test_dur_split_inds() to find indices for each subset.