vak.prep.audio_dataset.prep_audio_dataset#

Gets a set of audio files from a directory, optionally paired with an annotation file or files, and return a Pandas DataFrame that represents the set of files.

Finds all files with audio_format in data_dir, then finds any annotations with annot_format if specified, and additionally filter the audio and annotation files by labelset if specified. Then creates the dataframe with columns specified by vak.prep.audio_dataset.DF_COLUMNS: "audio_path", "annot_path", "annot_format", "samplerate", "sample_dur", and "duration".

Parameters:

data_dir (str, pathlib.Path) – Path to directory containing audio files that should be used in dataset.
audio_format (str) – A string representing the format of audio files. One of :constant:`vak.common.constants.VALID_AUDIO_FORMATS`.
annot_format (str) – Name of annotation format. Added as a column to the DataFrame if specified. Used by other functions that open annotation files via their paths from the DataFrame. Should be a format that the crowsetta library recognizes. Default is None.
annot_file (str) – Path to a single annotation file. Default is None. Used when a single file contains annotations for multiple audio files.
labelset (str, list, set) – Iterable of str or int, set of unique labels for annotations. Default is None. If not None, then files will be skipped where the associated annotation contains labels not found in labelset. labelset is converted to a Python set using vak.common.converters.labelset_to_set(). See docstring of that function for details on how to specify labelset.

Returns:

source_files_df – A set of source files that will be used to prepare a data set for use with neural network models, represented as a pandas.DataFrame. Will contain paths to audio files, possibly paired with annotation files. The columns of the dataframe are specified by vak.prep.audio_dataset.DF_COLUMNS.

Return type:

pandas.Dataframe