(howto-user-annot)=

# How do I use my own vocal annotation format?

To load annotation formats,
vak uses another Python package called 
crowsetta (<https://crowsetta.readthedocs.io/en/latest/>).
It has built-in support for common formats  
such as the .TextGrid files
generated by [Praat](http://www.fon.hum.uva.nl/praat/manual/Intro_7__Annotation.html).
Even if your data is not annotated with one of these common formats, 
you can still use crowsetta to convert your annotations 
into a format that vak can read.

There are two main ways to do this.
The first is to convert the annotations to a simple .csv file format, 
that `crowsetta` calls `'simple-seq'`.
You can easily create files in this format with the pandas library, 
as we show with an example script below.
The second approach is to convert your annotations 
to a more generic format built into crowsetta, 
called `'generic-seq'`, 
that is designed to represent 
a large set of annotations as a single .csv file.
In the sections below, we walk through both methods.

```{seealso}
For more detail on how vak relates 
annotation files to the files that they 
annotate, please see 
{ref}`which-annotations-go-with-which-annotated` 
in the how-to on 
{ref}`howto-prep-annotate`.
```

## Method 1: converting your annotations to the `'simple-seq'` format

The first method is to convert your annotations to a format named `'simple-seq'`. 
This method will work for a wide array of annotation formats 
that all can be mapped to a sequence of segments, 
with each segment having an onset time, offset time, and label.
**The one assumption the `'simple-seq'` format makes is that you have one annotation file 
per file that is annotated, that is, 
one annotation file per audio file or per array file containing a spectrogram.**
This is likely to be the case if you are using apps like Praat or Audacity.
An example of such a format is the Audacity 
[standard label track format](https://manual.audacityteam.org/man/importing_and_exporting_labels.html#Standard_.28default.29_format), 
exported to .txt files, that you would get if you were to annotate with  
[region labels](https://manual.audacityteam.org/man/label_tracks.html#type).

Below we provide an example of how you would write 
a very small Python script to convert your annotations 
to the `'simple-seq'` format using the pandas library.
First we explain what your dataset should look like.

### Explanation of when you can use the `'simple-seq'` format

Again, this first approach assumes that you have a separate 
annotation file for each file you have with a vocalization in it, 
either an audio file or an array file containing a spectrogram.
In other words, a directory of your data looks something like this:

```console
BB_SGP16-1___20160521_214723.txt
BB_SGP16-1___20160521_214723.wav
BBY15-4___20150907_211645.txt
BBY15-4___20150907_211645.wav
... # more files here
DB_1-WWS16-2___20160822_203501.txt
DB_1-WWS16-2___20160822_203501.wav
```

Notice that each .wav audio file has 
a corresponding .txt file with annotations.
Each of the .txt files has columns 
that could be imported into a GUI application, e.g. Audacity.

:::{note}
Those files are taken from this dataset:  
<https://figshare.com/articles/dataset/Wav_and_label_files_used_in_the_workshop/4714387>  
You can download them to work through the example yourself.
:::

Here's we use the `cat` command in the terminal 
to dump out the contents of the first .txt file:
```console
$ cat BB_SGP16-1___20160521_214723.txt
8.358329	15.019360	Common Pip
194.710924	199.112019	Barbastelle - good
```

We can see there are two rows, 
each with an onset time, an offset time, and a text label.
The evenly-aligned columns tell us that they are separated by tabs 
(which you can also notice if you open the file in a text editor 
and move the cursor around).
Lastly we see that there is no *header*, that is, 
no first row with column names, such as "start time", "stop time", and "name".

What we want is to convert each .txt file to a comma-separated file
(a `.csv`) in the `'simple-seq'` format,
with a header that has the specific column names that `crowsetta` recognizes.
We can easily create such files with pandas. 
We will write a script to do so.
After running the script,  
we will have a `.csv` file for each .txt file in our directory, as shown:

```console
BB_SGP16-1___20160521_214723.txt
BB_SGP16-1___20160521_214723.wav
BB_SGP16-1___20160521_214723.wav.csv
BBY15-4___20150907_211645.txt
BBY15-4___20150907_211645.wav
BBY15-4___20150907_211645.wav.csv
... # more files here
DB_1-WWS16-2___20160822_203501.txt
DB_1-WWS16-2___20160822_203501.wav.csv
DB_1-WWS16-2___20160822_203501.wav
```

Notice also how the script names the new annotation files.
For each audio file, 
it creates an annotation file with the same name, 
including the audio extension,
and the annotation extension added after that. 
For example, the script creates an annotation file 
named "DB_1-WWS16-2___20160822_203501.wav.csv" 
for the audio file named "DB_1-WWS16-2___20160822_203501.wav".
We could also just name the files by replacing 
the extension .wav with the extension .csv. 
One drawback of naming the files by just replacing 
the extension is that we cannot have any other .csv files 
with the same name in the directory. 
This would be true if we want to have an analysis file 
for each audio file. For example, "DB_1-WWS16-2___20160822_203501.csv".
could contain features or measurements 
we extract from "DB_1-WWS16-2___20160822_203501.wav".

```{admonition} More on naming annotation files 
As stated above, 
 more detail on how vak relates 
annotation files to the files that they 
annotate can be found in the section 
{ref}`which-annotations-go-with-which-annotated` 
on the how-to page  
{ref}`howto-prep-annotate`.
The reference section also provides a page 
on {ref}`file-naming-conventions`.
```

### Example script for converting .txt files to the `'simple-seq'` format

Below is a script that loads the text files using pandas, 
and then adds the columns names needed before saving 
a new .csv file with the same values.

```python
import pathlib

import pandas as pd

COLUMNS = ['onset_s', 'offset_s', 'label']


def main():
    txt_files = sorted(pathlib.Path('./path/to/data').glob('*.txt'))
    
    for txt_file in txt_files:
        txt_df = pd.read_csv(txt_file, sep='\t', header=None)  # sep='\t' because tab-separated
        txt_df.columns = COLUMNS
        # in next line, use `txt_file.name` to get the entire file name with audio extension
        # and then add the .csv extension to it, to follow naming convention
        csv_name = txt_file.parent / (txt_file.name + '.csv')
        txt_df.to_csv(csv_name)

if __name__ == '__main__':
    main()
```

### Using the `'simple-seq'` format with vak

Once you have annotations in the `'simple-seq'` format,
you will set up the `[PREP]` section of your configuration 
file like this:
```{code-block} toml
[PREP]
data_dir = "~/Documents/data/vocal/BFSongRepo-test-csv-format/gy6or6/032212"
output_dir = "./data/prep/train"
audio_format = "cbin"
annot_format = "simple-seq"
labelset = "iabcdefghjk"
train_dur = 50
val_dur = 15
```

vak will look for a .csv file in the `'simple-seq'` format for each audio file 
(or spectrogram file, if you are supplying your own spectrogram files).

## Method 2: converting your annotations to the generic format

An alternative to the first method is to use the `'generic-seq'` format. 
This method may make sense if you do not have a separate annotation file 
for each audio file, e.g., all your annotations are in a single file 
saved by an application.
There are basically two steps to converting your format to `generic-seq`,
described below. 

:::{note}
(Previously the `'generic-seq'` format was called `'csv'`; 
this name will be removed in the next version of `crowsetta`).
:::

### Step-by-step

1. Write a Python script that loads the onsets, offsets, and labels
   from your format, and then uses that data to create the `Annotation`s and
   `Sequence`s that `crowsetta` uses to convert between formats.

   :::{note}
   For examples, please see any of the modules for built-in functions
   in the `crowsetta` library.

   E.g., the `notmat` module:
   <https://github.com/vocalpy/crowsetta/blob/main/src/crowsetta/notmat.py>

   That module parses annotations from this dataset:
   <https://figshare.com/articles/dataset/Bengalese_Finch_song_repository/4805749>
   :::

2. Then save your `Annotation`s---converted to the generic
   `crowsetta` format---in a .csv file, using the `crowsetta.csv` functions.
   There is a convenience function `crowsetta.csv.annot2csv` that you can use
   if you have already written a function that returns `crowsetta.Annotation`s.
   Again, see examples in the built-in format modules.

### Example script for converting .txt files to the `'generic-seq'` format

Here is a script that carries out steps one and two.
This script can be run on the example 
{download}`data <https://ndownloader.figshare.com/files/9537229>` 
used for training a model in the tutorial {ref}`autoannotate`.
```python
import pathlib

import numpy as np
import scipy.io

import crowsetta

data_dir = pathlib.Path('~/Documents/data/gy6or6/032312').expanduser()  # ``expanduser`` for '~' 
annot_path = sorted(data_dir.glob('*.not.mat'))

# the name of the .csv with our `'generic-seq'` format annotations
csv_filename = 'data/annot/gy6or6.032212.annot.csv'

# ---- step 1. convert to ``Annotation``s with ``Sequence``s
annot = []
for a_notmat in annot_path:
    notmat_dict = scipy.io.loadmat(a_notmat, squeeze_me=True)
    # in .not.mat files saved by evsonganaly,
    # onsets and offsets are in units of ms, have to convert to s
    onsets_s = notmat_dict['onsets'] / 1000
    offsets_s = notmat_dict['offsets'] / 1000

    audio_pathname = str(a_notmat).replace('.not.mat', '')

    notmat_seq = crowsetta.Sequence.from_keyword(labels=np.asarray(list(notmat_dict['labels'])),
                                                 onsets_s=onsets_s,
                                                 offsets_s=offsets_s)
    annot.append(
       # see `warning` below for explanation of why `annot_path=csv_filename`
        crowsetta.Annotation(annot_path=csv_filename, audio_path=audio_pathname, seq=notmat_seq)
    )

# ---- step 2. save as a .csv
crowsetta.csv.annot2csv(annot, csv_filename=csv_filename)
```

:::{warning}
In the script above, when creating `Annotation`s, 
notice that we specified 
the `annot_path` as the path to the .csv file itself,
instead of specifying the path to the original `.not.mat` annotation files. 
You should do the same.
E.g., if you are saving your annotations in a .csv file
named `bat1_converted.csv`, then the value for every cell in
the `annot_path` column of your .csv file should be
also be `bat1_converted.csv`.
This workaround prevents vak from trying to open the original 
annotation files as if they were a .csv file, 
which can cause an error.
:::

(howto-user-annot-format-method-2)=
### Using the `'generic-seq'` format with vak

If you have written a script that saves all your annotations 
in a single .csv file as described above, 
then you need to tell vak to use that file.
To do so, you add the `annot_file` option in the `[PREP]` section 
of your .toml configuration file, as shown below:

```{code-block} toml
:emphasize-lines: 6
[PREP]
data_dir = "~/Documents/data/vocal/BFSongRepo-test-csv-format/gy6or6/032212"
output_dir = "./data/prep/train"
audio_format = "cbin"
annot_format = "generic-seq"
annot_file = "./data/annot/gy6or6.032212.annot.csv"
labelset = "iabcdefghjk"
train_dur = 50
val_dur = 15
```