vak.prep.frame_classification.make_splits.make_splits#
- vak.prep.frame_classification.make_splits.make_splits(dataset_df: DataFrame, dataset_path: str | Path, input_type: str, purpose: str, labelmap: dict, audio_format: str | None = None, spect_key: str = 's', timebins_key: str = 't', freqbins_key: str = 'f') DataFrame [source]#
Make each split of a frame classification dataset.
This function takes a
pandas.Dataframe
returned byvak.prep.spectrogram_dataset.prep_spectrogram_dataset()
orvak.prep.audio_dataset.prep_audio_dataset()
, after it has been assigned a ‘split’ column, and then copies, moves, or generates the required files as appropriate for each split.For each unique ‘split’ in the
pandas.Dataframe
, a directory is made insidedataset_path
. At a high level, all files needed for working with that split will be in that directory E.g., thetrain
directory insidedataset_path
would have all the files for every row indataset_df
for whichdataset_df['split'] == 'train'
.The inputs to the neural network model are moved or copied into the split directory, or generated if necessary. If the
input_type
is ‘audio’, then the audio files are copied from their original directory. If theinput_type
is ‘spect’, and the spectrogram files are already indataset_path
, they are moved into the split directory (under the assumption they were generated byvak.prep.spectrogram_dataset.audio_helper
). If they are npz files, but they are not indataset_path
, then they are validated to make sure they have the appropriate keys, and then copied into the split directory. This could be the case if the files were generated by another program. If they are mat files, they will be converted to npz with the default keys for arrays, and then saved in a new npz file in the split directory. This step is required so that all dataset prepared byvak
are in a “normalized” or “canonicalized” format.In addition to copying or moving the audio or spectrogram files that are inputs to the neural network model, other npy files are made for each split and saved in the corresponding directory. This function creates one npy file for each row in
dataset_df
. It has the extension ‘.frame_labels.npy’, and contains a vector where each element is the target label that the network should predict for the corresponding frame. Taken together, the audio or spectrogram file in each row along with its corresponding frame labels are the data for each sample \((x, y)\) in the dataset, where \(x_t\) supplies the “frames”, and \(y_t\) is the frame labels.This function also creates two additional npy files for each split. These npy files are “indexing” vectors that are used by
vak.datasets.frame_classification.WindowDataset
andvak.datasets.frame_classification.FramesDataset
. These vectors make it possible to work with files, to avoid loading the entire dataset into memory, and to avoid working with memory-mapped arrays. The first is thesample_ids
vector, that represents the “ID” of any sample \((x, y)\) in the split. We use these IDs to load the array files corresponding to the samples. For a split with \(m\) samples, this will be an array of length \(T\), the total number of frames across all samples, with elements \(i \in (0, 1, ..., m - 1)\) indicating which frames correspond to which sample \(m_i\): \((0, 0, 0, ..., 1, 1, ..., m - 1, m -1)\). The second vector is theinds_in_sample
vector. This vector is the same length assample_ids
, but its values represent the indices of frames within each sample \(x_t\). For a data set with \(T\) total frames across all samples, where \(t_i\) indicates the number of frames in each \(x_i\), this vector will look like \((0, 1, ..., t_0, 0, 1, ..., t_1, ... t_m)\).- Parameters:
dataset_df (pandas.DataFrame) – A
pandas.DataFrame
returned byvak.io.dataframe.from_files()
with a'split'
column added, as a result of callingvak.io.dataframe.from_files()
or because it was added “manually” by callingvak.core.prep.prep_helper.add_split_col()
(as is done for ‘predict’ when the entireDataFrame
belongs to this “split”).dataset_path (pathlib.Path) – Path to directory that represents dataset.
input_type (str) – The type of input to the neural network model. One of {‘audio’, ‘spect’}.
purpose (str) – A string indicating what the dataset will be used for. One of {‘train’, ‘eval’, ‘predict’, ‘learncurve’}. Determined by
vak.core.prep.prep()
using the TOML configuration file.labelmap (dict) – A
dict
that maps a set of human-readable string labels to the integer classes predicted by a neural network model. As returned byvak.labels.to_map()
.audio_format (str) – A
string
representing the format of audio files. One of :constant:`vak.common.constants.VALID_AUDIO_FORMATS`.spect_key (str) – Key for accessing spectrogram in files. Default is ‘s’.
timebins_key (str) – Key for accessing vector of time bins in files. Default is ‘t’.
freqbins_key (str) – key for accessing vector of frequency bins in files. Default is ‘f’.
- Returns:
dataset_df_out – The
dataset_df
with splits sorted by increasing frequency of labels (seedataset_arrays()
), and with columns added containing the npy files for each row.- Return type: