vak.prep.frame_classification.learncurve.make_subsets_from_dataset_df#

vak.prep.frame_classification.learncurve.make_subsets_from_dataset_df(dataset_df: DataFrame, input_type: str, train_set_durs: Sequence[float], num_replicates: int, dataset_path: Path, labelmap: dict) → DataFrame[source]#

Make subsets of the training data split for a learning curve.

Makes subsets given a dataframe representing the entire dataset, with one subset for each combination of (training set duration, replicate number). Each subset is randomly drawn from the total training split.

Uses vak.prep.split.frame_classification_dataframe() to make subsets of the training data from dataset_df.

A new column will be added to the dataframe, ‘subset’, and additional rows for each subset. The dataframe is returned with these subsets added. (The ‘split’ for these rows will still be ‘train’.) Additionally, a separate set of indexing vectors will be made for each subset, using vak.prep.frame_classification.learncurve.make_index_vectors_for_each_subset().

  032312-vak-frame-classification-dataset-generated-231005_121809
  ├── 032312_prep_231005_121809.csv
  ├── labelmap.json
  ├── metadata.json
  ├── prep_231005_121809.log
  ├── TweetyNet_learncurve_audio_cbin_annot_notmat.toml
  ├── train
      ├── gy6or6_baseline_230312_0808.138.cbin.spect.frame_labels.npy
      ├── gy6or6_baseline_230312_0808.138.cbin.spect.frames.npy
      ├── gy6or6_baseline_230312_0809.141.cbin.spect.frame_labels.npy
      ├── gy6or6_baseline_230312_0809.141.cbin.spect.frames.npy
      ├── gy6or6_baseline_230312_0813.163.cbin.spect.frame_labels.npy
      ├── gy6or6_baseline_230312_0813.163.cbin.spect.frames.npy
      ├── gy6or6_baseline_230312_0816.179.cbin.spect.frame_labels.npy
      ├── gy6or6_baseline_230312_0816.179.cbin.spect.frames.npy
      ├── gy6or6_baseline_230312_0820.196.cbin.spect.frame_labels.npy
      ├── gy6or6_baseline_230312_0820.196.cbin.spect.frames.npy
      ├── inds_in_sample.npy
      ├── inds_in_sample-train-dur-4.0-replicate-1.npy
      ├── inds_in_sample-train-dur-4.0-replicate-2.npy
      ├── inds_in_sample-train-dur-6.0-replicate-1.npy
      ├── inds_in_sample-train-dur-6.0-replicate-2.npy
      ├── sample_ids.npy
      ├── sample_ids-train-dur-4.0-replicate-1.npy
      ├── sample_ids-train-dur-4.0-replicate-2.npy
      ├── sample_ids-train-dur-6.0-replicate-1.npy
      └── sample_ids-train-dur-6.0-replicate-2.npy
  ...

Parameters
----------
dataset_df : pandas.DataFrame
    Dataframe representing a dataset for frame classification models.
    It is returned by
    :func:`vak.prep.frame_classification.get_or_make_source_files`,
    and has a ``'split'`` column added.
train_set_durs : list
    Durations in seconds of subsets taken from training data
    to create a learning curve, e.g., `[5., 10., 15., 20.]`.
num_replicates : int
    number of times to replicate training for each training set duration
    to better estimate metrics for a training set of that size.
    Each replicate uses a different randomly drawn subset of the training
    data (but of the same duration).
dataset_path : str, pathlib.Path
    Directory where splits will be saved.
input_type : str
    The type of input to the neural network model.
    One of {'audio', 'spect'}.

Returns
-------
dataset_df_out : pandas.DataFrame
    A pandas.DataFrame that has the original splits
    from ``dataset_df``, as well as the additional subsets
    of the training data added, along with additional
    columns, ``'subset', 'train_dur', 'replicate_num'``,
    that are used by :mod:`vak`.
    Other functions like :func:`vak.learncurve.learncurve`
    specify a specific subset of the training data
    by getting the subset name with the function
    :func:`vak.common.learncurve.get_train_dur_replicate_split_name`,
    and then filtering ``dataset_df_out`` with that name
    using the 'subset' column.