vak.prep.frame_classification.learncurve.make_index_vectors_for_each_subset#

vak.prep.frame_classification.learncurve.make_index_vectors_for_each_subset(subsets_df: DataFrame, dataset_path: str | Path, input_type: str) → DataFrame[source]#

Make npy files containing indexing vectors for each subset of the training data used to generate a learning curve with a frame classification dataset.

This function is basically the same as vak.prep.frame_classification.make_splits.make_splits(), except that it only makes the indexing vectors for each subset of the training data. These indexing vectors are needed for each subset to properly grab windows from the npy files during training. There is no need to remake the npy files themselves though.

All the indexing vectors for each split are saved in the “train” directory split inside dataset_path.

The indexing vectors are used by vak.datasets.frame_classification.WindowDataset and vak.datasets.frame_classification.FramesDataset. These vectors make it possible to work with files, to avoid loading the entire dataset into memory, and to avoid working with memory-mapped arrays. The first is the sample_ids vector, that represents the “ID” of any sample \((x, y)\) in the split. We use these IDs to load the array files corresponding to the samples. For a split with \(m\) samples, this will be an array of length \(T\), the total number of frames across all samples, with elements \(i \in (0, 1, ..., m - 1)\) indicating which frames correspond to which sample \(m_i\): \((0, 0, 0, ..., 1, 1, ..., m - 1, m -1)\). The second vector is the inds_in_sample vector. This vector is the same length as sample_ids, but its values represent the indices of frames within each sample \(x_t\). For a data set with \(T\) total frames across all samples, where \(t_i\) indicates the number of frames in each \(x_i\), this vector will look like \((0, 1, ..., t_0, 0, 1, ..., t_1, ... t_m)\).

Parameters:

subset_df (pandas.DataFrame) – A pandas.DataFrame representing the training data subsets. This DataFrame is created by vak.prep.frame_classification.learncurve.make_subsets_from_dataset_df(), and then passed into this function. It is created from a pandas.DataFrame returned by vak.prep.frame_classification.get_or_make_source_files() with a 'split' column added.
dataset_path (pathlib.Path) – Path to directory that represents dataset.
input_type (str) – The type of input to the neural network model. One of {‘audio’, ‘spect’}.

Return type:

None