vak.prep.parametric_umap.parametric_umap.prep_parametric_umap_dataset#

vak.prep.parametric_umap.parametric_umap.prep_parametric_umap_dataset(data_dir: str | Path, purpose: str, output_dir: str | Path | None = None, audio_format: str | None = None, spect_params: dict | None = None, annot_format: str | None = None, annot_file: str | Path | None = None, labelset: set | None = None, context_s: float = 0.015, train_dur: int | None = None, val_dur: int | None = None, test_dur: int | None = None, train_set_durs: list[float] | None = None, num_replicates: int | None = None, spect_key: str = 's', timebins_key: str = 't')[source]#

Prepare datasets for neural network models that perform a dimensionality reduction task.

For general information on dataset preparation, see the docstring for vak.prep.prep().

Parameters:
  • data_dir (str, Path) – Path to directory with files from which to make dataset.

  • purpose (str) – Purpose of the dataset. One of {β€˜train’, β€˜eval’, β€˜predict’, β€˜learncurve’}. These correspond to commands of the vak command-line interface.

  • output_dir (str) – Path to location where data sets should be saved. Default is None, in which case it defaults to data_dir.

  • audio_format (str) – Format of audio files. One of {β€˜wav’, β€˜cbin’}. Default is None, but either audio_format or spect_format must be specified.

  • spect_params (dict, vak.config.SpectParams) – Parameters for creating spectrograms. Default is None.

  • annot_format (str) – Format of annotations. Any format that can be used with the :module:`crowsetta` library is valid. Default is None.

  • labelset (str, list, set) – Set of unique labels for vocalizations. Strings or integers. Default is None. If not None, then files will be skipped where the associated annotation contains labels not found in labelset. labelset is converted to a Python set using vak.converters.labelset_to_set(). See help for that function for details on how to specify labelset.

  • train_dur (float) – Total duration of training set, in seconds. When creating a learning curve, training subsets of shorter duration will be drawn from this set. Default is None.

  • val_dur (float) – Total duration of validation set, in seconds. Default is None.

  • test_dur (float) – Total duration of test set, in seconds. Default is None.

  • train_set_durs (list) – of int, durations in seconds of subsets taken from training data to create a learning curve, e.g. [5, 10, 15, 20].

  • num_replicates (int) – number of times to replicate training for each training set duration to better estimate metrics for a training set of that size. Each replicate uses a different randomly drawn subset of the training data (but of the same duration).

  • spect_key (str) – key for accessing spectrogram in files. Default is β€˜s’.

  • timebins_key (str) – key for accessing vector of time bins in files. Default is β€˜t’.

Returns:

  • dataset_df (pandas.DataFrame) – That represents a dataset.

  • dataset_path (pathlib.Path) – Path to csv saved from dataset_df.