(howto-user-annot)= # How do I use my own vocal annotation format? To load annotation formats, vak uses another Python package called crowsetta (). It has built-in support for common formats such as the .TextGrid files generated by [Praat](http://www.fon.hum.uva.nl/praat/manual/Intro_7__Annotation.html). Even if your data is not annotated with one of these common formats, you can still use crowsetta to convert your annotations into a format that vak can read. There are two main ways to do this. The first is to convert the annotations to a simple .csv file format, that `crowsetta` calls `'simple-seq'`. You can easily create files in this format with the pandas library, as we show with an example script below. The second approach is to convert your annotations to a more generic format built into crowsetta, called `'generic-seq'`, that is designed to represent a large set of annotations as a single .csv file. In the sections below, we walk through both methods. ```{seealso} For more detail on how vak relates annotation files to the files that they annotate, please see {ref}`which-annotations-go-with-which-annotated` in the how-to on {ref}`howto-prep-annotate`. ``` ## Method 1: converting your annotations to the `'simple-seq'` format The first method is to convert your annotations to a format named `'simple-seq'`. This method will work for a wide array of annotation formats that all can be mapped to a sequence of segments, with each segment having an onset time, offset time, and label. **The one assumption the `'simple-seq'` format makes is that you have one annotation file per file that is annotated, that is, one annotation file per audio file or per array file containing a spectrogram.** This is likely to be the case if you are using apps like Praat or Audacity. An example of such a format is the Audacity [standard label track format](https://manual.audacityteam.org/man/importing_and_exporting_labels.html#Standard_.28default.29_format), exported to .txt files, that you would get if you were to annotate with [region labels](https://manual.audacityteam.org/man/label_tracks.html#type). Below we provide an example of how you would write a very small Python script to convert your annotations to the `'simple-seq'` format using the pandas library. First we explain what your dataset should look like. ### Explanation of when you can use the `'simple-seq'` format Again, this first approach assumes that you have a separate annotation file for each file you have with a vocalization in it, either an audio file or an array file containing a spectrogram. In other words, a directory of your data looks something like this: ```console BB_SGP16-1___20160521_214723.txt BB_SGP16-1___20160521_214723.wav BBY15-4___20150907_211645.txt BBY15-4___20150907_211645.wav ... # more files here DB_1-WWS16-2___20160822_203501.txt DB_1-WWS16-2___20160822_203501.wav ``` Notice that each .wav audio file has a corresponding .txt file with annotations. Each of the .txt files has columns that could be imported into a GUI application, e.g. Audacity. :::{note} Those files are taken from this dataset: You can download them to work through the example yourself. ::: Here's we use the `cat` command in the terminal to dump out the contents of the first .txt file: ```console $ cat BB_SGP16-1___20160521_214723.txt 8.358329 15.019360 Common Pip 194.710924 199.112019 Barbastelle - good ``` We can see there are two rows, each with an onset time, an offset time, and a text label. The evenly-aligned columns tell us that they are separated by tabs (which you can also notice if you open the file in a text editor and move the cursor around). Lastly we see that there is no *header*, that is, no first row with column names, such as "start time", "stop time", and "name". What we want is to convert each .txt file to a comma-separated file (a `.csv`) in the `'simple-seq'` format, with a header that has the specific column names that `crowsetta` recognizes. We can easily create such files with pandas. We will write a script to do so. After running the script, we will have a `.csv` file for each .txt file in our directory, as shown: ```console BB_SGP16-1___20160521_214723.txt BB_SGP16-1___20160521_214723.wav BB_SGP16-1___20160521_214723.wav.csv BBY15-4___20150907_211645.txt BBY15-4___20150907_211645.wav BBY15-4___20150907_211645.wav.csv ... # more files here DB_1-WWS16-2___20160822_203501.txt DB_1-WWS16-2___20160822_203501.wav.csv DB_1-WWS16-2___20160822_203501.wav ``` Notice also how the script names the new annotation files. For each audio file, it creates an annotation file with the same name, including the audio extension, and the annotation extension added after that. For example, the script creates an annotation file named "DB_1-WWS16-2___20160822_203501.wav.csv" for the audio file named "DB_1-WWS16-2___20160822_203501.wav". We could also just name the files by replacing the extension .wav with the extension .csv. One drawback of naming the files by just replacing the extension is that we cannot have any other .csv files with the same name in the directory. This would be true if we want to have an analysis file for each audio file. For example, "DB_1-WWS16-2___20160822_203501.csv". could contain features or measurements we extract from "DB_1-WWS16-2___20160822_203501.wav". ```{admonition} More on naming annotation files As stated above, more detail on how vak relates annotation files to the files that they annotate can be found in the section {ref}`which-annotations-go-with-which-annotated` on the how-to page {ref}`howto-prep-annotate`. The reference section also provides a page on {ref}`file-naming-conventions`. ``` ### Example script for converting .txt files to the `'simple-seq'` format Below is a script that loads the text files using pandas, and then adds the columns names needed before saving a new .csv file with the same values. ```python import pathlib import pandas as pd COLUMNS = ['onset_s', 'offset_s', 'label'] def main(): txt_files = sorted(pathlib.Path('./path/to/data').glob('*.txt')) for txt_file in txt_files: txt_df = pd.read_csv(txt_file, sep='\t', header=None) # sep='\t' because tab-separated txt_df.columns = COLUMNS # in next line, use `txt_file.name` to get the entire file name with audio extension # and then add the .csv extension to it, to follow naming convention csv_name = txt_file.parent / (txt_file.name + '.csv') txt_df.to_csv(csv_name) if __name__ == '__main__': main() ``` ### Using the `'simple-seq'` format with vak Once you have annotations in the `'simple-seq'` format, you will set up the `[PREP]` section of your configuration file like this: ```{code-block} toml [PREP] data_dir = "~/Documents/data/vocal/BFSongRepo-test-csv-format/gy6or6/032212" output_dir = "./data/prep/train" audio_format = "cbin" annot_format = "simple-seq" labelset = "iabcdefghjk" train_dur = 50 val_dur = 15 ``` vak will look for a .csv file in the `'simple-seq'` format for each audio file (or spectrogram file, if you are supplying your own spectrogram files). ## Method 2: converting your annotations to the generic format An alternative to the first method is to use the `'generic-seq'` format. This method may make sense if you do not have a separate annotation file for each audio file, e.g., all your annotations are in a single file saved by an application. There are basically two steps to converting your format to `generic-seq`, described below. :::{note} (Previously the `'generic-seq'` format was called `'csv'`; this name will be removed in the next version of `crowsetta`). ::: ### Step-by-step 1. Write a Python script that loads the onsets, offsets, and labels from your format, and then uses that data to create the `Annotation`s and `Sequence`s that `crowsetta` uses to convert between formats. :::{note} For examples, please see any of the modules for built-in functions in the `crowsetta` library. E.g., the `notmat` module: That module parses annotations from this dataset: ::: 2. Then save your `Annotation`s---converted to the generic `crowsetta` format---in a .csv file, using the `crowsetta.csv` functions. There is a convenience function `crowsetta.csv.annot2csv` that you can use if you have already written a function that returns `crowsetta.Annotation`s. Again, see examples in the built-in format modules. ### Example script for converting .txt files to the `'generic-seq'` format Here is a script that carries out steps one and two. This script can be run on the example {download}`data ` used for training a model in the tutorial {ref}`autoannotate`. ```python import pathlib import numpy as np import scipy.io import crowsetta data_dir = pathlib.Path('~/Documents/data/gy6or6/032312').expanduser() # ``expanduser`` for '~' annot_path = sorted(data_dir.glob('*.not.mat')) # the name of the .csv with our `'generic-seq'` format annotations csv_filename = 'data/annot/gy6or6.032212.annot.csv' # ---- step 1. convert to ``Annotation``s with ``Sequence``s annot = [] for a_notmat in annot_path: notmat_dict = scipy.io.loadmat(a_notmat, squeeze_me=True) # in .not.mat files saved by evsonganaly, # onsets and offsets are in units of ms, have to convert to s onsets_s = notmat_dict['onsets'] / 1000 offsets_s = notmat_dict['offsets'] / 1000 audio_pathname = str(a_notmat).replace('.not.mat', '') notmat_seq = crowsetta.Sequence.from_keyword(labels=np.asarray(list(notmat_dict['labels'])), onsets_s=onsets_s, offsets_s=offsets_s) annot.append( # see `warning` below for explanation of why `annot_path=csv_filename` crowsetta.Annotation(annot_path=csv_filename, audio_path=audio_pathname, seq=notmat_seq) ) # ---- step 2. save as a .csv crowsetta.csv.annot2csv(annot, csv_filename=csv_filename) ``` :::{warning} In the script above, when creating `Annotation`s, notice that we specified the `annot_path` as the path to the .csv file itself, instead of specifying the path to the original `.not.mat` annotation files. You should do the same. E.g., if you are saving your annotations in a .csv file named `bat1_converted.csv`, then the value for every cell in the `annot_path` column of your .csv file should be also be `bat1_converted.csv`. This workaround prevents vak from trying to open the original annotation files as if they were a .csv file, which can cause an error. ::: (howto-user-annot-format-method-2)= ### Using the `'generic-seq'` format with vak If you have written a script that saves all your annotations in a single .csv file as described above, then you need to tell vak to use that file. To do so, you add the `annot_file` option in the `[PREP]` section of your .toml configuration file, as shown below: ```{code-block} toml :emphasize-lines: 6 [PREP] data_dir = "~/Documents/data/vocal/BFSongRepo-test-csv-format/gy6or6/032212" output_dir = "./data/prep/train" audio_format = "cbin" annot_format = "generic-seq" annot_file = "./data/annot/gy6or6.032212.annot.csv" labelset = "iabcdefghjk" train_dur = 50 val_dur = 15 ```