Automatic Lyrics Transcription from Single Audio Channel Homophonic Music Recordings
Gerardo Roa Dabike

Presentation Outline


  1. Overview of the project
  2. Stage 1
  3. Stage 2
  4. Plan next 18 months

Overview of the project

The Task

To recognise sung speech from an accompaniment singing.

Overview of the project

Aims

To automatically recognise the sung speech from a homophonic single-channel audio recording, by adapting robust systems for typical speech and taking advantages of the musical prosody information of both the background accompaniment and the sung speech.

Overview of the project

Breaking down the task

The project was split into three components or stages.
  1. Unaccompanied singing recognition.
  2. Audio source separation and singer enhancement.
  3. Jointly optimise the singer enhancement with the acoustic model

Overview of the project

Research Questions - Stage 1

  1. Using state-of-the-art ASR systems for spoken speech, is it possible to construct a robust and fair DNN ASR baseline system for an unaccompanied singing scenario?
  2. Given a baseline system for an unaccompanied singing scenario, can the performance of the system be improved by incorporating musically motivated features?

Overview of the project

Research Questions - Stage 2

  1. Given that suitable training databases are small by modern ASR standards. How can a singing enhancement DNN approach be trained to obtain a high-quality singing segment suitable for ASR task?
  2. Can musically motivated features to be used to increase the performance of the DNN singer enhancement model?
  3. Can the singing enhancement model be extended also to recover the background accompaniment?
  4. Given the background accompaniment, are there useful musically motivated features that can be exploited for acoustic modelling?

Overview of the project

Research Questions - Stage 3

  1. Given the source separation and the ASR system, can they be joined in a single system?
  2. Is there an advantage to jointly optimising the separation and ASR stages, compared to training them as separated systems?

Presentation Outline


  1. Overview of the project
  2. Stage 1
  3. Stage 2
  4. Plan next 18 months

Stage 1

Considered a complete stage!

  • Two papers:

    • Roa Dabike, Gerardo and Barker, Jon. "Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System". In Proc. Interspeech. 2019.

    • Roa Dabike, Gerardo and Barker, Jon. "The Use of Voice Source Features for Sung Speech Recognition". To be submitted to Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021.

Presentation Outline


  1. Overview of the project
  2. Stage 1
  3. Stage 2
  4. Plan next 18 months

Stage 2

Plan Proposed in 24-months panel

  • Baseline replication.
  • Audio source separation (ASS) experiments.
  • Musically motivated features for ASS.
  • Beat information for AM.

Stage 2

Dataset

  • Damp Vocal Separation - DAMP-VSEP
    • 41000 30-sec segments - 20800 solo ensemble.
    • 155 countries and 36 languages - 8500 English performances.
    • 6456 singers.
    • 11494 song arrangements.

Stage 2

Dataset

Dataset Utt Size
English train 9243 77 Hrs
Solo train 20660 174 Hrs
Validation 100 0.8 Hrs
Evaluation 100 0.8 Hrs

Stage 2

DAMP-VSEP challenges

Different backgrounds are perceptually identical.

Stage 2

DAMP-VSEP challenges



  • Mixture include non-linear effects.
  • Vocal energy is scaled in mixture.
Mixture :        
Vocal :              
Background :

Stage 2

DAMP-VSEP challenges



  • Vocal is shifted in relation with the mixture.
  • Vocal mixed in the background.
Mixture :        
Vocal :              
Background :

Stage 2

Model Architecture

Convolutional TasNet

Stage 2

Preliminary Results

English train set

Exp Mixture M R SDR (V/B) SI-SDR (V/B) STOI
1 Remix 8 4 6.13/13.70 4.84/12.85 0.5364
2 Remix 10 4 6.24/13.73 5.05/12.83 0.5317
  • Vocal :      
  • Remix :    
  • Exp 1:       
  • Exp 2:       

Stage 2

Preliminary Results

Solo ensemble train set

Exp Mixture M R SDR (V/B) SI-SDR (V/B) STOI
3 Remix 10 4 15.02/14.76 14.14/14.48 0.6808
4 Original 10 4 -8.44/11.71 -23.42/-2.33 0.2750
  • Vocal :      
  • Remix :    
  • Mixture :
  • Exp 3:       
  • Exp 4:       

Stage 2

Preliminary Results

Composite Loss function

Exp Alpha SDR (V/B) SI-SDR (V/B) STOI
5 0.1 2.38/13.62 -15.92/-2.04 0.6530
6 0.5 2.85/11.81 -3.67/-1.62 0.6401

Mixture=Original, M=10, R=4

  • Vocal :      
  • Mixture :
  • Exp 5:       
  • Exp 6:       

Stage 2

ASR evaluation

Sung speech ASS Exp DAMP-VSEP DALI
Reference Clean Vocal 23.83 -
Mixture 63.98 89.81
Enhanced Exp 2 49.90 79.83
Exp 5 68.93 89.35

DALI is a collection of polyphonic songs sourced from YouTube with synchronised audio, lyrics and notes

  • Roa Dabike, Gerardo and Barker, Jon. "The Sheffield University System for the MIREX 2020:Lyrics Transcription Task". Submitted to International Society for Music Information Retrieval (ISMIR). 2020.

Presentation Outline


  1. Overview of the project
  2. Stage 1
  3. Stage 2
  4. Plan next 18 months

Plan next 18 months

Musically Motivated Features

The use instrument embedding features.



  • Train a convolutional autoencoder for instrument classification.
  • Use embedding features to inform the model about the musical accompaniment.

Plan next 18 months

Jointly optimised system


Plan next 18 months

Gantt Chart

Review 24-month panel recommendations

  • Improve explanation of ideas and writing skills.
  • Attention mechanisms.
  • Listening experiments.




Thank you!!