vak.prep.frame_classification.make_splits.make_splits#

vak.prep.frame_classification.make_splits.make_splits(dataset_df: DataFrame, dataset_path: str | Path, input_type: str, purpose: str, labelmap: dict, audio_format: str | None = None, spect_key: str = 's', timebins_key: str = 't', freqbins_key: str = 'f') → DataFrame[source]#

Make each split of a frame classification dataset.

This function takes a pandas.Dataframe returned by vak.prep.spectrogram_dataset.prep_spectrogram_dataset() or vak.prep.audio_dataset.prep_audio_dataset(), after it has been assigned a ‘split’ column, and then copies, moves, or generates the required files as appropriate for each split.

For each unique ‘split’ in the pandas.Dataframe, a directory is made inside dataset_path. At a high level, all files needed for working with that split will be in that directory E.g., the train directory inside dataset_path would have all the files for every row in dataset_df for which dataset_df['split'] == 'train'.

The inputs to the neural network model are moved or copied into the split directory, or generated if necessary. If the input_type is ‘audio’, then the audio files are copied from their original directory. If the input_type is ‘spect’, and the spectrogram files are already in dataset_path, they are moved into the split directory (under the assumption they were generated by vak.prep.spectrogram_dataset.audio_helper). If they are npz files, but they are not in dataset_path, then they are validated to make sure they have the appropriate keys, and then copied into the split directory. This could be the case if the files were generated by another program. If they are mat files, they will be converted to npz with the default keys for arrays, and then saved in a new npz file in the split directory. This step is required so that all dataset prepared by vak are in a “normalized” or “canonicalized” format.

In addition to copying or moving the audio or spectrogram files that are inputs to the neural network model, other npy files are made for each split and saved in the corresponding directory. This function creates one npy file for each row in dataset_df. It has the extension ‘.frame_labels.npy’, and contains a vector where each element is the target label that the network should predict for the corresponding frame. Taken together, the audio or spectrogram file in each row along with its corresponding frame labels are the data for each sample \((x, y)\) in the dataset, where \(x_t\) supplies the “frames”, and \(y_t\) is the frame labels.

This function also creates two additional npy files for each split. These npy files are “indexing” vectors that are used by vak.datasets.frame_classification.WindowDataset and vak.datasets.frame_classification.FramesDataset. These vectors make it possible to work with files, to avoid loading the entire dataset into memory, and to avoid working with memory-mapped arrays. The first is the sample_ids vector, that represents the “ID” of any sample \((x, y)\) in the split. We use these IDs to load the array files corresponding to the samples. For a split with \(m\) samples, this will be an array of length \(T\), the total number of frames across all samples, with elements \(i \in (0, 1, ..., m - 1)\) indicating which frames correspond to which sample \(m_i\): \((0, 0, 0, ..., 1, 1, ..., m - 1, m -1)\). The second vector is the inds_in_sample vector. This vector is the same length as sample_ids, but its values represent the indices of frames within each sample \(x_t\). For a data set with \(T\) total frames across all samples, where \(t_i\) indicates the number of frames in each \(x_i\), this vector will look like \((0, 1, ..., t_0, 0, 1, ..., t_1, ... t_m)\).

Parameters:

dataset_df (pandas.DataFrame) – A pandas.DataFrame returned by vak.io.dataframe.from_files() with a 'split' column added, as a result of calling vak.io.dataframe.from_files() or because it was added “manually” by calling vak.core.prep.prep_helper.add_split_col() (as is done for ‘predict’ when the entire DataFrame belongs to this “split”).
dataset_path (pathlib.Path) – Path to directory that represents dataset.
input_type (str) – The type of input to the neural network model. One of {‘audio’, ‘spect’}.
purpose (str) – A string indicating what the dataset will be used for. One of {‘train’, ‘eval’, ‘predict’, ‘learncurve’}. Determined by vak.core.prep.prep() using the TOML configuration file.
labelmap (dict) – A dict that maps a set of human-readable string labels to the integer classes predicted by a neural network model. As returned by vak.labels.to_map().
audio_format (str) – A string representing the format of audio files. One of :constant:`vak.common.constants.VALID_AUDIO_FORMATS`.
spect_key (str) – Key for accessing spectrogram in files. Default is ‘s’.
timebins_key (str) – Key for accessing vector of time bins in files. Default is ‘t’.
freqbins_key (str) – key for accessing vector of frequency bins in files. Default is ‘f’.

Returns:

dataset_df_out – The dataset_df with splits sorted by increasing frequency of labels (see dataset_arrays()), and with columns added containing the npy files for each row.

Return type:

pandas.DataFrame