vak.prep.frame_classification.learncurve.make_subsets_from_dataset_df#

vak.prep.frame_classification.learncurve.make_subsets_from_dataset_df(dataset_df: DataFrame, input_type: str, train_set_durs: Sequence[float], num_replicates: int, dataset_path: Path, labelmap: dict) DataFrame[source]#

Make subsets of the training data split for a learning curve.

Makes subsets given a dataframe representing the entire dataset, with one subset for each combination of (training set duration, replicate number). Each subset is randomly drawn from the total training split.

Uses vak.prep.split.frame_classification_dataframe() to make subsets of the training data from dataset_df.

A new column will be added to the dataframe, β€˜subset’, and additional rows for each subset. The dataframe is returned with these subsets added. (The β€˜split’ for these rows will still be β€˜train’.) Additionally, a separate set of indexing vectors will be made for each subset, using vak.prep.frame_classification.learncurve.make_index_vectors_for_each_subset().

  032312-vak-frame-classification-dataset-generated-231005_121809
  β”œβ”€β”€ 032312_prep_231005_121809.csv
  β”œβ”€β”€ labelmap.json
  β”œβ”€β”€ metadata.json
  β”œβ”€β”€ prep_231005_121809.log
  β”œβ”€β”€ TweetyNet_learncurve_audio_cbin_annot_notmat.toml
  β”œβ”€β”€ train
      β”œβ”€β”€ gy6or6_baseline_230312_0808.138.cbin.spect.frame_labels.npy
      β”œβ”€β”€ gy6or6_baseline_230312_0808.138.cbin.spect.frames.npy
      β”œβ”€β”€ gy6or6_baseline_230312_0809.141.cbin.spect.frame_labels.npy
      β”œβ”€β”€ gy6or6_baseline_230312_0809.141.cbin.spect.frames.npy
      β”œβ”€β”€ gy6or6_baseline_230312_0813.163.cbin.spect.frame_labels.npy
      β”œβ”€β”€ gy6or6_baseline_230312_0813.163.cbin.spect.frames.npy
      β”œβ”€β”€ gy6or6_baseline_230312_0816.179.cbin.spect.frame_labels.npy
      β”œβ”€β”€ gy6or6_baseline_230312_0816.179.cbin.spect.frames.npy
      β”œβ”€β”€ gy6or6_baseline_230312_0820.196.cbin.spect.frame_labels.npy
      β”œβ”€β”€ gy6or6_baseline_230312_0820.196.cbin.spect.frames.npy
      β”œβ”€β”€ inds_in_sample.npy
      β”œβ”€β”€ inds_in_sample-train-dur-4.0-replicate-1.npy
      β”œβ”€β”€ inds_in_sample-train-dur-4.0-replicate-2.npy
      β”œβ”€β”€ inds_in_sample-train-dur-6.0-replicate-1.npy
      β”œβ”€β”€ inds_in_sample-train-dur-6.0-replicate-2.npy
      β”œβ”€β”€ sample_ids.npy
      β”œβ”€β”€ sample_ids-train-dur-4.0-replicate-1.npy
      β”œβ”€β”€ sample_ids-train-dur-4.0-replicate-2.npy
      β”œβ”€β”€ sample_ids-train-dur-6.0-replicate-1.npy
      └── sample_ids-train-dur-6.0-replicate-2.npy
  ...

Parameters
----------
dataset_df : pandas.DataFrame
    Dataframe representing a dataset for frame classification models.
    It is returned by
    :func:`vak.prep.frame_classification.get_or_make_source_files`,
    and has a ``'split'`` column added.
train_set_durs : list
    Durations in seconds of subsets taken from training data
    to create a learning curve, e.g., `[5., 10., 15., 20.]`.
num_replicates : int
    number of times to replicate training for each training set duration
    to better estimate metrics for a training set of that size.
    Each replicate uses a different randomly drawn subset of the training
    data (but of the same duration).
dataset_path : str, pathlib.Path
    Directory where splits will be saved.
input_type : str
    The type of input to the neural network model.
    One of {'audio', 'spect'}.

Returns
-------
dataset_df_out : pandas.DataFrame
    A pandas.DataFrame that has the original splits
    from ``dataset_df``, as well as the additional subsets
    of the training data added, along with additional
    columns, ``'subset', 'train_dur', 'replicate_num'``,
    that are used by :mod:`vak`.
    Other functions like :func:`vak.learncurve.learncurve`
    specify a specific subset of the training data
    by getting the subset name with the function
    :func:`vak.common.learncurve.get_train_dur_replicate_split_name`,
    and then filtering ``dataset_df_out`` with that name
    using the 'subset' column.