vak.prep.frame_classification.learncurve.make_subsets_from_dataset_df¶

vak.prep.frame_classification.learncurve.make_subsets_from_dataset_df(dataset_df: DataFrame, input_type: str, train_set_durs: Sequence[float], num_replicates: int, dataset_path: Path, labelmap: dict, background_label: str = 'background') → DataFrame[source]¶

Make subsets of the training data split for a learning curve.

Makes subsets given a dataframe representing the entire dataset, with one subset for each combination of (training set duration, replicate number). Each subset is randomly drawn from the total training split.

Uses vak.prep.split.frame_classification_dataframe() to make subsets of the training data from dataset_df.

A new column will be added to the dataframe, ‘subset’, and additional rows for each subset. The dataframe is returned with these subsets added. (The ‘split’ for these rows will still be ‘train’.) Additionally, a separate set of indexing vectors will be made for each subset, using vak.prep.frame_classification.learncurve.make_index_vectors_for_each_subset().

  032312-vak-frame-classification-dataset-generated-231005_121809
  ├── 032312_prep_231005_121809.csv
  ├── labelmap.json
  ├── metadata.json
  ├── prep_231005_121809.log
  ├── TweetyNet_learncurve_audio_cbin_annot_notmat.toml
  ├── train
      ├── gy6or6_baseline_230312_0808.138.cbin.spect.frame_labels.npy
      ├── gy6or6_baseline_230312_0808.138.cbin.spect.frames.npy
      ├── gy6or6_baseline_230312_0809.141.cbin.spect.frame_labels.npy
      ├── gy6or6_baseline_230312_0809.141.cbin.spect.frames.npy
      ├── gy6or6_baseline_230312_0813.163.cbin.spect.frame_labels.npy
      ├── gy6or6_baseline_230312_0813.163.cbin.spect.frames.npy
      ├── gy6or6_baseline_230312_0816.179.cbin.spect.frame_labels.npy
      ├── gy6or6_baseline_230312_0816.179.cbin.spect.frames.npy
      ├── gy6or6_baseline_230312_0820.196.cbin.spect.frame_labels.npy
      ├── gy6or6_baseline_230312_0820.196.cbin.spect.frames.npy
      ├── inds_in_sample.npy
      ├── inds_in_sample-train-dur-4.0-replicate-1.npy
      ├── inds_in_sample-train-dur-4.0-replicate-2.npy
      ├── inds_in_sample-train-dur-6.0-replicate-1.npy
      ├── inds_in_sample-train-dur-6.0-replicate-2.npy
      ├── sample_ids.npy
      ├── sample_ids-train-dur-4.0-replicate-1.npy
      ├── sample_ids-train-dur-4.0-replicate-2.npy
      ├── sample_ids-train-dur-6.0-replicate-1.npy
      └── sample_ids-train-dur-6.0-replicate-2.npy
  ...

Parameters
----------
dataset_df : pandas.DataFrame
    Dataframe representing a dataset for frame classification models.
    It is returned by
    :func:`vak.prep.frame_classification.get_or_make_source_files`,
    and has a ``'split'`` column added.
train_set_durs : list
    Durations in seconds of subsets taken from training data
    to create a learning curve, e.g., `[5., 10., 15., 20.]`.
num_replicates : int
    number of times to replicate training for each training set duration
    to better estimate metrics for a training set of that size.
    Each replicate uses a different randomly drawn subset of the training
    data (but of the same duration).
dataset_path : str, pathlib.Path
    Directory where splits will be saved.
input_type : str
    The type of input to the neural network model.
    One of {'audio', 'spect'}.

background_label: str, optional

The string label applied to segments belonging to the background class. Default is vak.common.constants.DEFAULT_BACKGROUND_LABEL.

dataset_df_outpandas.DataFrame: A pandas.DataFrame that has the original splits from dataset_df, as well as the additional subsets of the training data added, along with additional columns, 'subset', 'train_dur', 'replicate_num', that are used by vak. Other functions like vak.learncurve.learncurve() specify a specific subset of the training data by getting the subset name with the function vak.common.learncurve.get_train_dur_replicate_split_name(), and then filtering dataset_df_out with that name using the ‘subset’ column.