Source code for vak.prep.split.algorithms.validate
[docs]defvalidate_split_durations(train_dur,val_dur,test_dur,dataset_dur):"""helper function to validate durations specified for splits, so other functions can do the actual splitting. First the functions checks for invalid conditions: + If train_dur, val_dur, and test_dur are all None, a ValueError is raised. + If any of train_dur, val_dur, or test_dur have a negative value that is not -1, an ValueError is raised. -1 is interpreted differently as explained below. + If only val_dur is specified, this raises a ValueError; not clear what durations of training and test set should be. Then, if either train_dur or test_dur are None, they are set to 0. None means user did not specify a value. Finally the function validates that the sum of the specified split durations is not greater than the the total duration of the dataset, `dataset_dur`. If any split is specified as -1, this value is interpreted as "first get the split for the set with a value specified, then use the remainder of the dataset in the split whose duration is set to -1". Functions that do the splitting have to "know" about this meaning of -1, so this validation function does not modify the value. Parameters ---------- train_dur : int, float Target duration for training set split, in seconds. val_dur : int, float Target duration for validation set, in seconds. test_dur : int, float Target duration for test set, in seconds. dataset_dur : int, float Total duration of dataset of vocalizations that will be split. Returns ------- train_dur, val_dur, test_dur : int, float """ifval_durand(train_durisNoneandtest_durisNone):raiseValueError("cannot specify only val_dur, unclear how to split dataset into training and test sets")# make a dict so we can refer to variable by name in loopsplit_durs={"train":train_dur,"val":val_dur,"test":test_dur,}ifall([durisNonefordurinsplit_durs.values()]):raiseValueError("train_dur, val_dur, and test_dur were all None; must specify at least train_dur or test_dur")ifnotall([dur>=0ordur==-1fordurinsplit_durs.values()ifdurisnotNone]):raiseValueError("all durations for split must be real non-negative number or ""set to -1 (meaning 'use the remaining dataset)")ifsum([split_dur==-1forsplit_durinsplit_durs.values()])>1:raiseValueError("cannot specify duration of more than one split as -1, unclear how to calculate durations of splits.")# set any None values for durations to 0; no part of dataset will go to that splitforsplit_nameinsplit_durs.keys():ifsplit_durs[split_name]isNone:split_durs[split_name]=0if-1insplit_durs.values():total_other_splits_dur=sum([durfordurinsplit_durs.values()ifdur!=-1])iftotal_other_splits_dur>dataset_dur:raiseValueError("One dataset split duration was specified as -1, but the total of the other durations specified, "f"{total_other_splits_dur} s, is greater than total duration of Dataset, {dataset_dur}.")else:# if none of the target durations are -1total_splits_dur=sum(split_durs.values())iftotal_splits_dur>dataset_dur:raiseValueError(f"Total of the split durations specified, {total_splits_dur} s, "f"is greater than total duration of dataset, {dataset_dur}.")returnsplit_durs["train"],split_durs["val"],split_durs["test"]