vak.prep.split.algorithms.bruteforce.brute_force#

vak.prep.split.algorithms.bruteforce.brute_force(durs: list[float], labels: list[np.ndarray], labelset: set, train_dur: int | float, val_dur: int | float, test_dur: int | float, max_iter: int = 5000)[source]#

Generate indices that split a dataset into separate training, validation, and test subsets.

Finds indices that split (labels, durations) tuples into training, test, and validation sets of specified durations, with the set of unique labels in each dataset equal to the specified labelset.

The durations of the datasets created using the returned indices will be greater than or equal to the durations specified.

Must specify a positive value for one of {train_dur, test_dur}. The other value can be specified as ‘-1’ which is interpreted as “use the remainder of the dataset for this split, after finding indices for the set with a specified duration”.

Parameters:
  • durs (list) – Of durations of vocalizations.

  • labels (list) – Of labels from vocalizations.

  • labelset (set) – Of labels.

  • train_dur (int, float) – Target duration for training set, in seconds.

  • val_dur (int, float) – Target duration for validation set, in seconds.

  • test_dur (int, float) – Target duration for test set, in seconds.

  • max_iter (int) – Maximum number of iterations to attempt to find indices. Default is 5000.

Returns:

train_inds, val_inds, test_inds – Of int, the indices that will split dataset into training, validation, and test subsets.

Return type:

list

Notes

This is a “brute force” algorithm that just randomly assigns indices to a set, and iterates until it finds some partition where each set has instances of all classes of label. Starts by ensuring that each label is represented in each set and then adds files to reach the required durations.