vak.prep.frame_classification.learncurve.make_subsets_from_dataset_df#
- vak.prep.frame_classification.learncurve.make_subsets_from_dataset_df(dataset_df: DataFrame, input_type: str, train_set_durs: Sequence[float], num_replicates: int, dataset_path: Path, labelmap: dict) DataFrame [source]#
Make subsets of the training data split for a learning curve.
Makes subsets given a dataframe representing the entire dataset, with one subset for each combination of (training set duration, replicate number). Each subset is randomly drawn from the total training split.
Uses
vak.prep.split.frame_classification_dataframe()
to make subsets of the training data fromdataset_df
.A new column will be added to the dataframe, βsubsetβ, and additional rows for each subset. The dataframe is returned with these subsets added. (The βsplitβ for these rows will still be βtrainβ.) Additionally, a separate set of indexing vectors will be made for each subset, using
vak.prep.frame_classification.learncurve.make_index_vectors_for_each_subset()
.032312-vak-frame-classification-dataset-generated-231005_121809 βββ 032312_prep_231005_121809.csv βββ labelmap.json βββ metadata.json βββ prep_231005_121809.log βββ TweetyNet_learncurve_audio_cbin_annot_notmat.toml βββ train βββ gy6or6_baseline_230312_0808.138.cbin.spect.frame_labels.npy βββ gy6or6_baseline_230312_0808.138.cbin.spect.frames.npy βββ gy6or6_baseline_230312_0809.141.cbin.spect.frame_labels.npy βββ gy6or6_baseline_230312_0809.141.cbin.spect.frames.npy βββ gy6or6_baseline_230312_0813.163.cbin.spect.frame_labels.npy βββ gy6or6_baseline_230312_0813.163.cbin.spect.frames.npy βββ gy6or6_baseline_230312_0816.179.cbin.spect.frame_labels.npy βββ gy6or6_baseline_230312_0816.179.cbin.spect.frames.npy βββ gy6or6_baseline_230312_0820.196.cbin.spect.frame_labels.npy βββ gy6or6_baseline_230312_0820.196.cbin.spect.frames.npy βββ inds_in_sample.npy βββ inds_in_sample-train-dur-4.0-replicate-1.npy βββ inds_in_sample-train-dur-4.0-replicate-2.npy βββ inds_in_sample-train-dur-6.0-replicate-1.npy βββ inds_in_sample-train-dur-6.0-replicate-2.npy βββ sample_ids.npy βββ sample_ids-train-dur-4.0-replicate-1.npy βββ sample_ids-train-dur-4.0-replicate-2.npy βββ sample_ids-train-dur-6.0-replicate-1.npy βββ sample_ids-train-dur-6.0-replicate-2.npy ... Parameters ---------- dataset_df : pandas.DataFrame Dataframe representing a dataset for frame classification models. It is returned by :func:`vak.prep.frame_classification.get_or_make_source_files`, and has a ``'split'`` column added. train_set_durs : list Durations in seconds of subsets taken from training data to create a learning curve, e.g., `[5., 10., 15., 20.]`. num_replicates : int number of times to replicate training for each training set duration to better estimate metrics for a training set of that size. Each replicate uses a different randomly drawn subset of the training data (but of the same duration). dataset_path : str, pathlib.Path Directory where splits will be saved. input_type : str The type of input to the neural network model. One of {'audio', 'spect'}. Returns ------- dataset_df_out : pandas.DataFrame A pandas.DataFrame that has the original splits from ``dataset_df``, as well as the additional subsets of the training data added, along with additional columns, ``'subset', 'train_dur', 'replicate_num'``, that are used by :mod:`vak`. Other functions like :func:`vak.learncurve.learncurve` specify a specific subset of the training data by getting the subset name with the function :func:`vak.common.learncurve.get_train_dur_replicate_split_name`, and then filtering ``dataset_df_out`` with that name using the 'subset' column.