Automatically classifying the content of sound files using ML

Following on from yesterday’s extraction of old sound effects, I quickly realised I needed an easier way to search them as they came out of Director as unlabelled, numbered files. I can use QuickLook or a media player to quickly audition them, but how could I easily find the sample that contains the sound of running water or a horse trotting?

I wondered if there was a way of using Machine Learning (ML) to automatically categorise sounds. It seemed like something that should be possible, especially given the recent explosion in “AI” (really: ML) tools. I quickly found Google’s AudioSet, which sounded like the perfect dataset:

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

But the data set is only one half of the solution. You need to use the dataset to create a model and then run that model against your own data to get the required results. Thankfully, I found YAMNet:

YAMNet is a deep net that predicts ~521 audio event classes from the AudioSet-YouTube corpus it was trained on.

I guess YAMNet is tracking behind AudioSet in terms of total categories, but it is good enough for me. Here is a list of all the classes of sounds it can recognise.

Let’s go

I used the script described in this tutorial as a starting point. I’m not a regular python user but using pip to install tensorflow, along with any other missing imports, and after that it …just worked.

Getting your files in order

According to the documentation all sound files need to be at a sample rate of 16000Hz. After getting some calssification results of “Silence”, I realised they also need to be 16-bit resolution. So I ran a quick sox command to create compliant copies of all my sounds. I’ll delete these when I’m done. Notice how I decided to trim sounds to a maximum length of 3 seconds. This helps speed things up and most sounds can still be recognised with such a short starting section.

find . -iname "*.wav" -exec sox {} -c 1 -r 16000 -b 16 {}_16k.wav trim 0 00:03 \;

Optimisation

Running the classifier works at about real-time, a few seconds per sound, but I noticed that it was leaving a lot of my CPU unused. This struck me as a prime candidate for parallelisation, which is pretty easy on the command line. I used the parallel command to scale up the classification to use all 10-cores of the M1 Pro CPU in my 2021 MacBook Pro.

find . -iname "*.wav" -exec parallel python3 classify.py {} ::: {} \+

As I type my computer is making short order of the task, whilst remaining perfectly responsive, if a little warm. Final speed for me is one sound every ~0.85 seconds.

Python Script

Creating my SFX Library

Soundly is a sort of iTunes for sound effects. It’s an app that enables easy, automatic organisation of files, quick searching of metadata, painless playback/auditioning, non-destructive edits, and simple exporting of the final sounds. The free version allows a local library of 10,000 files which is more than enough for my usage. I’m not affiliated with them in any way, but they offer a free version and a 1-month free trial of their paid version.

As you add your local folder of files it allows you to import a (semicolon-separated) .csv file containing additional metadata. It’s here that I point it to the file that was generated by the classifier. The categories are imported as the description of the sound, and are able to be searched. Perfect!

--
Originally published: 2023-08-13
--
Enjoyed this blog post? Please show some support.
--
Comments: @gingerbeardman