Aligning Orchestral Music¶

Below we show how to use this library to align some of the audio used in the paper. For this notebook to work, we assume the user has installed the audio requirements (pip install -r requirements_audio.txt at the root of the repository). we assume that the user has already downloaded the “short” orchestral pieces in the benchmark using the script in experiments/orchestral.py. We do not provide the pieces here for copyright reasons.

Audio example 1: Vivalid’s Spring¶

The first step is to load in the audio

[1]:

import linmdtw
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
import warnings
warnings.filterwarnings("ignore")
import IPython.display as ipd

sr = 44100
x0_0, sr = linmdtw.load_audio("../experiments/OrchestralPieces/Short/0_0.mp3", sr)
x0_1, sr = linmdtw.load_audio("../experiments/OrchestralPieces/Short/0_1.mp3", sr)

Next, we’ll compute the “MFCC mod” features for each audio clip, as described in [1]

[1] Gadermaier, Thassilo, and Gerhard Widmer. “A Study of Annotation and Alignment Accuracy for Performance Comparison in Complex Orchestral Music.” arXiv preprint arXiv:1910.07394 (2019).

[2]:

hop_length = 512
X0_0 = linmdtw.get_mfcc_mod(x0_0, sr, hop_length)
X0_1 = linmdtw.get_mfcc_mod(x0_1, sr, hop_length)

Now, we can extract a warping path between the two audio streams using the main DTW library

[3]:

import time
metadata = {'totalCells':0, 'M':X0_0.shape[0], 'N':X0_1.shape[0],
            'timeStart':time.time(), 'perc':10}
path0 = linmdtw.linmdtw(X0_0, X0_1, do_gpu=True, metadata=metadata)[1]

Parallel Alignment 10.0% Elapsed time: 6.77
Parallel Alignment 20.0% Elapsed time: 13.5
Parallel Alignment 30.0% Elapsed time: 20.3
Parallel Alignment 40.0% Elapsed time: 27
Parallel Alignment 50.0% Elapsed time: 33.9
Parallel Alignment 60.0% Elapsed time: 40.7
Parallel Alignment 70.0% Elapsed time: 48.3
Parallel Alignment 80.0% Elapsed time: 55.8
Parallel Alignment 90.0% Elapsed time: 62.7

Before we apply the computed warping path, let’s compare the first 40 seconds of the two audio clips side by side. We’ll put the first one in the left ear and the second one in the right ear. The one on the left goes faster than the one on the right, but it starts later. Because of this, they are in sync for a brief moment, but the left one then overtakes the right one for the rest of it.

[4]:

xunsync0 = np.zeros((sr*40, 2))
xunsync0[:, 0] = x0_0[0:sr*40]
xunsync0[:, 1] = x0_1[0:sr*40]
linmdtw.save_audio(xunsync0, sr, "unsync0")
ipd.Audio("unsync0.mp3")

ffmpeg version 4.2.4-1ubuntu0.1 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.3.0-10ubuntu2)
  configuration: --prefix=/usr --extra-version=1ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, wav, from 'unsync0.wav':
  Duration: 00:00:40.00, bitrate: 5644 kb/s
    Stream #0:0: Audio: pcm_f64le ([3][0][0][0] / 0x0003), 44100 Hz, stereo, dbl, 5644 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_f64le (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
Output #0, mp3, to 'unsync0.mp3':
  Metadata:
    TSSE            : Lavf58.29.100
    Stream #0:0: Audio: mp3 (libmp3lame), 44100 Hz, stereo, fltp
    Metadata:
      encoder         : Lavc58.54.100 libmp3lame
size=     626kB time=00:00:40.02 bitrate= 128.1kbits/s speed=50.7x
video:0kB audio:626kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.039486%

[4]:

Let’s now apply the computed warping path to see how the alignment went. This library wraps arround the pyrubberband library, which we can use to stretch the audio in x1 to match x2, according to this warping path. The method stretch_audio returns a stereo audio stream with the resulting stretched version of x1 in the left ear and the original version of x2 in the right ear. Let’s save the first 30 seconds of the aligned audio to disk and listen to it

[5]:

xsync0 = linmdtw.stretch_audio(x0_0, x0_1, sr, path0, hop_length)
linmdtw.save_audio(xsync0[0:sr*30, ::], sr, "sync0")
ipd.Audio("sync0.mp3")

Stretching...

ffmpeg version 4.2.4-1ubuntu0.1 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.3.0-10ubuntu2)
  configuration: --prefix=/usr --extra-version=1ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, wav, from 'sync0.wav':
  Duration: 00:00:30.00, bitrate: 5644 kb/s
    Stream #0:0: Audio: pcm_f64le ([3][0][0][0] / 0x0003), 44100 Hz, stereo, dbl, 5644 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_f64le (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
Output #0, mp3, to 'sync0.mp3':
  Metadata:
    TSSE            : Lavf58.29.100
    Stream #0:0: Audio: mp3 (libmp3lame), 44100 Hz, stereo, fltp
    Metadata:
      encoder         : Lavc58.54.100 libmp3lame
size=     470kB time=00:00:30.01 bitrate= 128.2kbits/s speed=49.3x
video:0kB audio:469kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.052637%

[5]:

Audio Example 2: Schubert’s Unfinished Symphony¶

We now show one more example with Schubert’s Unfinished Symphony (short clip index 5 in the paper corpus). We align the entire audio streams (11 minutes, 30 seconds and 12 minutes, 47 seconds, respectively), and we pull out a 45 second clip of the result to listen to.

[6]:

## Step 1: Load in audio
sr = 44100
x5_0, sr = linmdtw.load_audio("../experiments/OrchestralPieces/Short/5_0.mp3", sr)
x5_1, sr = linmdtw.load_audio("../experiments/OrchestralPieces/Short/5_1.mp3", sr)
## Step 2: Compute Features
hop_length = 512
X5_0 = linmdtw.get_mfcc_mod(x5_0, sr, hop_length)
X5_1 = linmdtw.get_mfcc_mod(x5_1, sr, hop_length)

## Step 3: Run DTW in verbose mode
metadata = {'totalCells':0, 'M':X5_0.shape[0], 'N':X5_1.shape[0],
            'timeStart':time.time(), 'perc':10}
path5 = linmdtw.linmdtw(X5_0, X5_1, do_gpu=True, metadata=metadata)[1]

Parallel Alignment 10.0% Elapsed time: 101
Parallel Alignment 20.0% Elapsed time: 201
Parallel Alignment 30.0% Elapsed time: 299
Parallel Alignment 40.0% Elapsed time: 396
Parallel Alignment 50.0% Elapsed time: 493
Parallel Alignment 60.0% Elapsed time: 587
Parallel Alignment 70.0% Elapsed time: 681
Parallel Alignment 80.0% Elapsed time: 775
Parallel Alignment 90.0% Elapsed time: 868

[7]:

## Step 4: Synchronize audio and play the results
xsync5 = linmdtw.stretch_audio(x5_0, x5_1, sr, path5, hop_length)
linmdtw.save_audio(xsync5[sr*45:sr*90, ::], sr, "sync5")
ipd.Audio("sync5.mp3")

Stretching...

ffmpeg version 4.2.4-1ubuntu0.1 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.3.0-10ubuntu2)
  configuration: --prefix=/usr --extra-version=1ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, wav, from 'sync5.wav':
  Duration: 00:00:45.00, bitrate: 5644 kb/s
    Stream #0:0: Audio: pcm_f64le ([3][0][0][0] / 0x0003), 44100 Hz, stereo, dbl, 5644 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_f64le (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
Output #0, mp3, to 'sync5.mp3':
  Metadata:
    TSSE            : Lavf58.29.100
    Stream #0:0: Audio: mp3 (libmp3lame), 44100 Hz, stereo, fltp
    Metadata:
      encoder         : Lavc58.54.100 libmp3lame
size=     704kB time=00:00:45.01 bitrate= 128.1kbits/s speed=46.9x
video:0kB audio:704kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.035112%

[7]:

[ ]: