vak.prep.frame_classification.learncurve.make_subsets_from_dataset_df¶
- vak.prep.frame_classification.learncurve.make_subsets_from_dataset_df(dataset_df: DataFrame, input_type: str, train_set_durs: Sequence[float], num_replicates: int, dataset_path: Path, labelmap: dict, background_label: str = 'background') DataFrame [source]¶
Make subsets of the training data split for a learning curve.
Makes subsets given a dataframe representing the entire dataset, with one subset for each combination of (training set duration, replicate number). Each subset is randomly drawn from the total training split.
Uses
vak.prep.split.frame_classification_dataframe()
to make subsets of the training data fromdataset_df
.A new column will be added to the dataframe, ‘subset’, and additional rows for each subset. The dataframe is returned with these subsets added. (The ‘split’ for these rows will still be ‘train’.) Additionally, a separate set of indexing vectors will be made for each subset, using
vak.prep.frame_classification.learncurve.make_index_vectors_for_each_subset()
.032312-vak-frame-classification-dataset-generated-231005_121809 ├── 032312_prep_231005_121809.csv ├── labelmap.json ├── metadata.json ├── prep_231005_121809.log ├── TweetyNet_learncurve_audio_cbin_annot_notmat.toml ├── train ├── gy6or6_baseline_230312_0808.138.cbin.spect.frame_labels.npy ├── gy6or6_baseline_230312_0808.138.cbin.spect.frames.npy ├── gy6or6_baseline_230312_0809.141.cbin.spect.frame_labels.npy ├── gy6or6_baseline_230312_0809.141.cbin.spect.frames.npy ├── gy6or6_baseline_230312_0813.163.cbin.spect.frame_labels.npy ├── gy6or6_baseline_230312_0813.163.cbin.spect.frames.npy ├── gy6or6_baseline_230312_0816.179.cbin.spect.frame_labels.npy ├── gy6or6_baseline_230312_0816.179.cbin.spect.frames.npy ├── gy6or6_baseline_230312_0820.196.cbin.spect.frame_labels.npy ├── gy6or6_baseline_230312_0820.196.cbin.spect.frames.npy ├── inds_in_sample.npy ├── inds_in_sample-train-dur-4.0-replicate-1.npy ├── inds_in_sample-train-dur-4.0-replicate-2.npy ├── inds_in_sample-train-dur-6.0-replicate-1.npy ├── inds_in_sample-train-dur-6.0-replicate-2.npy ├── sample_ids.npy ├── sample_ids-train-dur-4.0-replicate-1.npy ├── sample_ids-train-dur-4.0-replicate-2.npy ├── sample_ids-train-dur-6.0-replicate-1.npy └── sample_ids-train-dur-6.0-replicate-2.npy ... Parameters ---------- dataset_df : pandas.DataFrame Dataframe representing a dataset for frame classification models. It is returned by :func:`vak.prep.frame_classification.get_or_make_source_files`, and has a ``'split'`` column added. train_set_durs : list Durations in seconds of subsets taken from training data to create a learning curve, e.g., `[5., 10., 15., 20.]`. num_replicates : int number of times to replicate training for each training set duration to better estimate metrics for a training set of that size. Each replicate uses a different randomly drawn subset of the training data (but of the same duration). dataset_path : str, pathlib.Path Directory where splits will be saved. input_type : str The type of input to the neural network model. One of {'audio', 'spect'}.
- background_label: str, optional
The string label applied to segments belonging to the background class. Default is
vak.common.constants.DEFAULT_BACKGROUND_LABEL
.- dataset_df_outpandas.DataFrame
A pandas.DataFrame that has the original splits from
dataset_df
, as well as the additional subsets of the training data added, along with additional columns,'subset', 'train_dur', 'replicate_num'
, that are used byvak
. Other functions likevak.learncurve.learncurve()
specify a specific subset of the training data by getting the subset name with the functionvak.common.learncurve.get_train_dur_replicate_split_name()
, and then filteringdataset_df_out
with that name using the ‘subset’ column.