Panel Meeting

Automatic Lyrics Transcription from Single Audio Channel Homophonic Music Recordings

Gerardo Roa Dabike

Presentation Outline

Overview of the project
Stage 1
Stage 2
Plan next 18 months

Overview of the project

The Task

To recognise sung speech from an accompaniment singing.

Overview of the project

Aims

To automatically recognise the sung speech from a homophonic single-channel audio recording, by adapting robust systems for typical speech and taking advantages of the musical prosody information of both the background accompaniment and the sung speech.

Overview of the project

Breaking down the task

The project was split into three components or stages.

Unaccompanied singing recognition.
Audio source separation and singer enhancement.
Jointly optimise the singer enhancement with the acoustic model

Overview of the project

Research Questions - Stage 1

Using state-of-the-art ASR systems for spoken speech, is it possible to construct a robust and fair DNN ASR baseline system for an unaccompanied singing scenario?
Given a baseline system for an unaccompanied singing scenario, can the performance of the system be improved by incorporating musically motivated features?

Overview of the project

Research Questions - Stage 2

Given that suitable training databases are small by modern ASR standards. How can a singing enhancement DNN approach be trained to obtain a high-quality singing segment suitable for ASR task?
Can musically motivated features to be used to increase the performance of the DNN singer enhancement model?
Can the singing enhancement model be extended also to recover the background accompaniment?
Given the background accompaniment, are there useful musically motivated features that can be exploited for acoustic modelling?

Overview of the project

Research Questions - Stage 3

Given the source separation and the ASR system, can they be joined in a single system?
Is there an advantage to jointly optimising the separation and ASR stages, compared to training them as separated systems?

Presentation Outline

Overview of the project
Stage 1
Stage 2
Plan next 18 months

Stage 1

Considered a complete stage!

Two papers:

Roa Dabike, Gerardo and Barker, Jon. "Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System". In Proc. Interspeech. 2019.

Roa Dabike, Gerardo and Barker, Jon. "The Use of Voice Source Features for Sung Speech Recognition". To be submitted to Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021.

Presentation Outline

Overview of the project
Stage 1
Stage 2
Plan next 18 months

Stage 2

Plan Proposed in 24-months panel

Baseline replication.
Audio source separation (ASS) experiments.
Musically motivated features for ASS.
Beat information for AM.

Stage 2

Dataset

Damp Vocal Separation - DAMP-VSEP

41000 30-sec segments - 20800 solo ensemble.
155 countries and 36 languages - 8500 English performances.
6456 singers.
11494 song arrangements.

Stage 2

Dataset

Dataset	Utt	Size
English train	9243	77 Hrs
Solo train	20660	174 Hrs
Validation	100	0.8 Hrs
Evaluation	100	0.8 Hrs

Stage 2

DAMP-VSEP challenges

Different backgrounds are perceptually identical.

Stage 2

DAMP-VSEP challenges

Mixture include non-linear effects.
Vocal energy is scaled in mixture.

Mixture :

Vocal :

Background :

Stage 2

DAMP-VSEP challenges

Vocal is shifted in relation with the mixture.
Vocal mixed in the background.

Mixture :

Vocal :

Background :

Stage 2

Model Architecture

Convolutional TasNet

Stage 2

Preliminary Results

English train set

Exp	Mixture	M	R	SDR (V/B)	SI-SDR (V/B)	STOI
1	Remix	8	4	6.13/13.70	4.84/12.85	0.5364
2	Remix	10	4	6.24/13.73	5.05/12.83	0.5317

Vocal :
Remix :
Exp 1:
Exp 2:

Stage 2

Preliminary Results

Solo ensemble train set

Exp	Mixture	M	R	SDR (V/B)	SI-SDR (V/B)	STOI
3	Remix	10	4	15.02/14.76	14.14/14.48	0.6808
4	Original	10	4	-8.44/11.71	-23.42/-2.33	0.2750

Vocal :
Remix :
Mixture :
Exp 3:
Exp 4:

Stage 2

Preliminary Results

Composite Loss function

Exp	Alpha	SDR (V/B)	SI-SDR (V/B)	STOI
5	0.1	2.38/13.62	-15.92/-2.04	0.6530
6	0.5	2.85/11.81	-3.67/-1.62	0.6401

Mixture=Original, M=10, R=4

Vocal :
Mixture :
Exp 5:
Exp 6:

Stage 2

ASR evaluation

Sung speech	ASS Exp	DAMP-VSEP	DALI
Reference	Clean Vocal	23.83	-
Reference	Mixture	63.98	89.81
Enhanced	Exp 2	49.90	79.83
Enhanced	Exp 5	68.93	89.35

DALI is a collection of polyphonic songs sourced from YouTube with synchronised audio, lyrics and notes

Roa Dabike, Gerardo and Barker, Jon. "The Sheffield University System for the MIREX 2020:Lyrics Transcription Task". Submitted to International Society for Music Information Retrieval (ISMIR). 2020.

Presentation Outline

Overview of the project
Stage 1
Stage 2
Plan next 18 months

Plan next 18 months

Musically Motivated Features

The use instrument embedding features.

Train a convolutional autoencoder for instrument classification.
Use embedding features to inform the model about the musical accompaniment.

Plan next 18 months

Jointly optimised system

Plan next 18 months

Gantt Chart

Review 24-month panel recommendations

Improve explanation of ideas and writing skills.
Attention mechanisms.
Listening experiments.

Presentation Outline

Overview of the project

The Task

Overview of the project

Aims

Overview of the project

Breaking down the task

Overview of the project

Research Questions - Stage 1

Overview of the project

Research Questions - Stage 2

Overview of the project

Research Questions - Stage 3

Presentation Outline

Stage 1

Presentation Outline

Stage 2

Stage 2

Dataset

Stage 2

Dataset

Stage 2

DAMP-VSEP challenges

Stage 2

DAMP-VSEP challenges

Stage 2

DAMP-VSEP challenges

Stage 2

Model Architecture

Stage 2

Preliminary Results

English train set

Stage 2

Preliminary Results

Solo ensemble train set

Stage 2

Preliminary Results

Composite Loss function

Stage 2

ASR evaluation

Presentation Outline

Plan next 18 months

Musically Motivated Features

Plan next 18 months

Jointly optimised system

Plan next 18 months

Gantt Chart

Review 24-month panel recommendations

Thank you!!