How do I use my own vocal annotation format?¶
To load annotation formats,
vak uses another Python package called
crowsetta (https://crowsetta.readthedocs.io/en/latest/).
It has built-in support for common formats
such as the .TextGrid files
generated by Praat.
Even if your data is not annotated with one of these common formats,
you can still use crowsetta to convert your annotations
into a format that vak can read.
There are two main ways to do this.
The first is to convert the annotations to a simple .csv file format,
that crowsetta
calls 'simple-seq'
.
You can easily create files in this format with the pandas library,
as we show with an example script below.
The second approach is to convert your annotations
to a more generic format built into crowsetta,
called 'generic-seq'
,
that is designed to represent
a large set of annotations as a single .csv file.
In the sections below, we walk through both methods.
See also
For more detail on how vak relates annotation files to the files that they annotate, please see How does vak know which annotations go with which annotated files? in the how-to on How do I prepare datasets of annotated vocalizations for use with vak?.
Method 1: converting your annotations to the 'simple-seq'
format¶
The first method is to convert your annotations to a format named 'simple-seq'
.
This method will work for a wide array of annotation formats
that all can be mapped to a sequence of segments,
with each segment having an onset time, offset time, and label.
The one assumption the 'simple-seq'
format makes is that you have one annotation file
per file that is annotated, that is,
one annotation file per audio file or per array file containing a spectrogram.
This is likely to be the case if you are using apps like Praat or Audacity.
An example of such a format is the Audacity
standard label track format,
exported to .txt files, that you would get if you were to annotate with
region labels.
Below we provide an example of how you would write
a very small Python script to convert your annotations
to the 'simple-seq'
format using the pandas library.
First we explain what your dataset should look like.
Explanation of when you can use the 'simple-seq'
format¶
Again, this first approach assumes that you have a separate annotation file for each file you have with a vocalization in it, either an audio file or an array file containing a spectrogram. In other words, a directory of your data looks something like this:
BB_SGP16-1___20160521_214723.txt
BB_SGP16-1___20160521_214723.wav
BBY15-4___20150907_211645.txt
BBY15-4___20150907_211645.wav
... # more files here
DB_1-WWS16-2___20160822_203501.txt
DB_1-WWS16-2___20160822_203501.wav
Notice that each .wav audio file has a corresponding .txt file with annotations. Each of the .txt files has columns that could be imported into a GUI application, e.g. Audacity.
Note
Those files are taken from this dataset:
https://figshare.com/articles/dataset/Wav_and_label_files_used_in_the_workshop/4714387
You can download them to work through the example yourself.
Here’s we use the cat
command in the terminal
to dump out the contents of the first .txt file:
$ cat BB_SGP16-1___20160521_214723.txt
8.358329 15.019360 Common Pip
194.710924 199.112019 Barbastelle - good
We can see there are two rows, each with an onset time, an offset time, and a text label. The evenly-aligned columns tell us that they are separated by tabs (which you can also notice if you open the file in a text editor and move the cursor around). Lastly we see that there is no header, that is, no first row with column names, such as “start time”, “stop time”, and “name”.
What we want is to convert each .txt file to a comma-separated file
(a .csv
) in the 'simple-seq'
format,
with a header that has the specific column names that crowsetta
recognizes.
We can easily create such files with pandas.
We will write a script to do so.
After running the script,
we will have a .csv
file for each .txt file in our directory, as shown:
BB_SGP16-1___20160521_214723.txt
BB_SGP16-1___20160521_214723.wav
BB_SGP16-1___20160521_214723.wav.csv
BBY15-4___20150907_211645.txt
BBY15-4___20150907_211645.wav
BBY15-4___20150907_211645.wav.csv
... # more files here
DB_1-WWS16-2___20160822_203501.txt
DB_1-WWS16-2___20160822_203501.wav.csv
DB_1-WWS16-2___20160822_203501.wav
Notice also how the script names the new annotation files. For each audio file, it creates an annotation file with the same name, including the audio extension, and the annotation extension added after that. For example, the script creates an annotation file named “DB_1-WWS16-2___20160822_203501.wav.csv” for the audio file named “DB_1-WWS16-2___20160822_203501.wav”. We could also just name the files by replacing the extension .wav with the extension .csv. One drawback of naming the files by just replacing the extension is that we cannot have any other .csv files with the same name in the directory. This would be true if we want to have an analysis file for each audio file. For example, “DB_1-WWS16-2___20160822_203501.csv”. could contain features or measurements we extract from “DB_1-WWS16-2___20160822_203501.wav”.
More on naming annotation files
As stated above,
more detail on how vak relates
annotation files to the files that they
annotate can be found in the section
How does vak know which annotations go with which annotated files?
on the how-to page
How do I prepare datasets of annotated vocalizations for use with vak?.
The reference section also provides a page
on File naming conventions.
Example script for converting .txt files to the 'simple-seq'
format¶
Below is a script that loads the text files using pandas, and then adds the columns names needed before saving a new .csv file with the same values.
import pathlib
import pandas as pd
COLUMNS = ['onset_s', 'offset_s', 'label']
def main():
txt_files = sorted(pathlib.Path('./path/to/data').glob('*.txt'))
for txt_file in txt_files:
txt_df = pd.read_csv(txt_file, sep='\t', header=None) # sep='\t' because tab-separated
txt_df.columns = COLUMNS
# in next line, use `txt_file.name` to get the entire file name with audio extension
# and then add the .csv extension to it, to follow naming convention
csv_name = txt_file.parent / (txt_file.name + '.csv')
txt_df.to_csv(csv_name)
if __name__ == '__main__':
main()
Using the 'simple-seq'
format with vak¶
Once you have annotations in the 'simple-seq'
format,
you will set up the [PREP]
section of your configuration
file like this:
[PREP]
data_dir = "~/Documents/data/vocal/BFSongRepo-test-csv-format/gy6or6/032212"
output_dir = "./data/prep/train"
audio_format = "cbin"
annot_format = "simple-seq"
labelset = "iabcdefghjk"
train_dur = 50
val_dur = 15
vak will look for a .csv file in the 'simple-seq'
format for each audio file
(or spectrogram file, if you are supplying your own spectrogram files).
Method 2: converting your annotations to the generic format¶
An alternative to the first method is to use the 'generic-seq'
format.
This method may make sense if you do not have a separate annotation file
for each audio file, e.g., all your annotations are in a single file
saved by an application.
There are basically two steps to converting your format to generic-seq
,
described below.
Note
(Previously the 'generic-seq'
format was called 'csv'
;
this name will be removed in the next version of crowsetta
).
Step-by-step¶
Write a Python script that loads the onsets, offsets, and labels from your format, and then uses that data to create the
Annotation
s andSequence
s thatcrowsetta
uses to convert between formats.Note
For examples, please see any of the modules for built-in functions in the
crowsetta
library.E.g., the
notmat
module: https://github.com/vocalpy/crowsetta/blob/main/src/crowsetta/notmat.pyThat module parses annotations from this dataset: https://figshare.com/articles/dataset/Bengalese_Finch_song_repository/4805749
Then save your
Annotation
s—converted to the genericcrowsetta
format—in a .csv file, using thecrowsetta.csv
functions. There is a convenience functioncrowsetta.csv.annot2csv
that you can use if you have already written a function that returnscrowsetta.Annotation
s. Again, see examples in the built-in format modules.
Example script for converting .txt files to the 'generic-seq'
format¶
Here is a script that carries out steps one and two.
This script can be run on the example
data
used for training a model in the tutorial Automated Annotation.
import pathlib
import numpy as np
import scipy.io
import crowsetta
data_dir = pathlib.Path('~/Documents/data/gy6or6/032312').expanduser() # ``expanduser`` for '~'
annot_path = sorted(data_dir.glob('*.not.mat'))
# the name of the .csv with our `'generic-seq'` format annotations
csv_filename = 'data/annot/gy6or6.032212.annot.csv'
# ---- step 1. convert to ``Annotation``s with ``Sequence``s
annot = []
for a_notmat in annot_path:
notmat_dict = scipy.io.loadmat(a_notmat, squeeze_me=True)
# in .not.mat files saved by evsonganaly,
# onsets and offsets are in units of ms, have to convert to s
onsets_s = notmat_dict['onsets'] / 1000
offsets_s = notmat_dict['offsets'] / 1000
audio_pathname = str(a_notmat).replace('.not.mat', '')
notmat_seq = crowsetta.Sequence.from_keyword(labels=np.asarray(list(notmat_dict['labels'])),
onsets_s=onsets_s,
offsets_s=offsets_s)
annot.append(
# see `warning` below for explanation of why `annot_path=csv_filename`
crowsetta.Annotation(annot_path=csv_filename, audio_path=audio_pathname, seq=notmat_seq)
)
# ---- step 2. save as a .csv
crowsetta.csv.annot2csv(annot, csv_filename=csv_filename)
Warning
In the script above, when creating Annotation
s,
notice that we specified
the annot_path
as the path to the .csv file itself,
instead of specifying the path to the original .not.mat
annotation files.
You should do the same.
E.g., if you are saving your annotations in a .csv file
named bat1_converted.csv
, then the value for every cell in
the annot_path
column of your .csv file should be
also be bat1_converted.csv
.
This workaround prevents vak from trying to open the original
annotation files as if they were a .csv file,
which can cause an error.
Using the 'generic-seq'
format with vak¶
If you have written a script that saves all your annotations
in a single .csv file as described above,
then you need to tell vak to use that file.
To do so, you add the annot_file
option in the [PREP]
section
of your .toml configuration file, as shown below:
[PREP]
data_dir = "~/Documents/data/vocal/BFSongRepo-test-csv-format/gy6or6/032212"
output_dir = "./data/prep/train"
audio_format = "cbin"
annot_format = "generic-seq"
annot_file = "./data/annot/gy6or6.032212.annot.csv"
labelset = "iabcdefghjk"
train_dur = 50
val_dur = 15