Task 1: headphones

Task 1: headphones#

This tutorial walks through the process of running the Cadenza Challenge (CAD1) Task 1 baseline using the shell interface.

1. Cloning the Clarity Repository#

We first need to install the Clarity package.

# Clean directory
from pathlib import Path
import shutil

dirpath = Path('clarity')
if dirpath.exists() and dirpath.is_dir():
    shutil.rmtree(dirpath)

%%capture
!git clone --quiet https://github.com/claritychallenge/clarity.git
%cd clarity
!git checkout v0.3.3
%pip install -e .
%pip install seedir
%cd recipes/cad1/task1/baseline

2. Dataset#

We will be using music audio and listener metadata.

%%capture

import gdown

!gdown 10SfuZR7yVlVO6RwNUc3kPeJHGiwpN3VS
!mv cadenza_data_demo.tar.xz recipes/cad1/task1/baseline
!tar -xvf cadenza_data_demo.tar.xz
!rm cadenza_data_demo.tar.xz

import seedir as sd
sd.seedir('cadenza_data_demo/cad1/task1/', style='lines', depthlimit=3)

task1/
├─metadata/
│ ├─listeners.valid.json
│ └─musdb18.valid.json
└─audio/
  └─musdb18hq/
    └─train/

3. Baseline#

Note

The baseline recipe is divided into 2 stages: Enhnacement and Evaluation

3.1 Enhancement#

The enhancement script contains several key functions

3.1.1 Enhance#

The main function that orchestrates the demixing and remixing.

It takes the config.yaml as input parameter

@hydra.main(config_path="", config_name="config")
def enhance(config: DictConfig) -> None:
    """
    Run the music enhancement.
    The system decomposes the music into vocal, drums, bass, and other stems.
    Then, the NAL-R prescription procedure is applied to each stem.
    Args:
        config (dict): Dictionary of configuration options for enhancing music.

    Returns 8 stems for each song:
        - left channel vocal, drums, bass, and other stems
        - right channel vocal, drums, bass, and other stems
    """

3.1.2 Decompose Signal#

It takes a stereo music signal and uses a pretrained model to decompose the signal into eight stems (VDBO: left and right).

The baseline takes either the demucs or the Open-UnMix model acording to the value of the separator.model parameter in the config.yaml
The function has the left and right audiograms as input parameters. This is to show that one could train a model using this information to produces a personalizes decomposition.

def decompose_signal(
    model: torch.nn.Module,
    model_sample_rate: int,
    signal: np.ndarray,
    signal_sample_rate: int,
    device: torch.device,
    sources_list: list[str],
    left_audiogram: np.ndarray,
    right_audiogram: np.ndarray,
    normalise: bool = True,
) -> dict[str, np.ndarray]:
    """
    Decompose signal into 8 stems.

    The left and right audiograms are ignored by the baseline system as it
    is performing personalised decomposition.
    Instead, it performs a standard music decomposition using the
    HDEMUCS model trained on the MUSDB18 dataset.

    Args:
        model (torch.nn.Module): Torch model.
        model_sample_rate (int): Sample rate of the model.
        signal (np.ndarray): Signal to be decomposed.
        signal_sample_rate (int): Sample frequency.
        device (torch.device): Torch device to use for processing.
        sources_list (list): List of strings used to index dictionary.
        left_audiogram (np.ndarray): Left ear audiogram.
        right_audiogram (np.ndarray): Right ear audiogram.
        normalise (bool): Whether to normalise the signal.

     Returns:
         Dictionary: Indexed by sources with the associated model as values.
    """

3.1.3 Process Stems for Listeners#

This function takes the 8 stems from the demixing and amplifies them to allow for the listener’s hearing loss.

In the baseline it applies NAL-R prescription to each stem
The use of this function may becomes unnecessary in the case of having a separation model that performs a personalized decomposition. E.g., the model applies the NAL-R amplification as part of the decomposition.

def process_stems_for_listener(
    stems: dict,
    enhancer: NALR,
    compressor: Compressor,
    audiogram_left: np.ndarray,
    audiogram_right: np.ndarray,
    cfs: np.ndarray,
    apply_compressor: bool = False,
) -> dict:
    """Process the stems from sources.

    Args:
        stems (dict) : Dictionary of stems
        enhancer (NALR) : NAL-R prescription hearing aid
        compressor (Compressor) : Compressor
        audiogram_left (np.ndarray) : Left channel audiogram
        audiogram_right (np.ndarray) : Right channel audiogram
        cfs (np.ndarray) : Center frequencies
        apply_compressor (bool) : Whether to apply the compressor

    Returns:
        processed_sources (dict) : Dictionary of processed stems
    """

3.1.4 Remix Signal#

This function does the remixing

The baseline justs add all processed signals
This function can be modify to apply any approach you may want to explore.
- Change the level of the stems
- Rebalance left and right channels.
- …

def remix_signal(stems: dict) -> np.ndarray:
    """
    Function to remix signal. It takes the eight stems
    and combines them into a stereo signal.

    Args:
        stems (dict) : Dictionary of stems

    Returns:
        (np.ndarray) : Remixed signal

    """

3.1.5 Utility Functions#

These are a set of helper functions:

separate_sources : Performs the inference of the stems using any separation model.
map_to_dict : Reorganised the separated stems into a dictionary.
get_device : Returns the Torch Device (‘cuda’ or ‘cpu’).

3.2 Evaluation#

The evaluation script assumes that the exp output directory exists. This is the output directory from enhancement.py script.

Caution

The evaluation script Shouldn't be modified or altered in any way.

3.2.1 Run calculate aq#

Main function that orchestrates the evaluation

@hydra.main(config_path="", config_name="config")
def run_calculate_aq(config: DictConfig) -> None:
    """Evaluate the enhanced signals using the HAAQI metric."""

3.2.2 Set Song Seed#

Function to seed the random components of the HAAQI (Hearing Aid Audio Quality) score. This ensures a fair comparison between different systems as the seed depends on the song and not on the order of when that song was processed.

def set_song_seed(song: str) -> None:
    """Set a seed that is unique for the given song"""
    song_encoded = hashlib.md5(song.encode("utf-8")).hexdigest()
    song_md5 = int(song_encoded, 16) % (10**8)
    np.random.seed(song_md5)

3.2.3 Evaluate Song Listener#

Computes the evaluation of a single song-listener pair

def _evaluate_song_listener(
    song: str,
    listener: str,
    config: DictConfig,
    split_dir: str,
    listener_audiograms: dict,
    enhanced_folder: Path,
) -> tuple[float, dict]:
    """Evaluate a single song-listener pair

    Args:
        song (str): The name of the song to evaluate.
        listener (str): The name of the listener to evaluate.
        config (DictConfig): The configuration object.
        split_dir (str): The name of the split directory.
        listener_audiograms (dict): A dictionary of audiograms for each listener.
        enhanced_folder (Path): The path to the folder containing the enhanced signals.

    Returns:
        combined_score (float): The combined score for the result.
        per_instrument_score (dict): A dictionary of scores for each
            instrument channel in the result.

    """

4. Inspecting Existing Configuration#

All of the included shell scripts take configurable variables from the yaml files in the same directory as the shell script.Typically these are named config.yaml, however, other names may be used if more than one shell script is in a directory.

We can inspect the contents of the config file:

path:
  root: ../../cadenza_data_demo/cad1/task1
  metadata_dir: ${path.root}/metadata
  music_dir: ${path.root}/audio/musdb18hq
  music_train_file: ${path.metadata_dir}/musdb18.train.json
  music_valid_file: ${path.metadata_dir}/musdb18.valid.json
  listeners_train_file: ${path.metadata_dir}/listeners.train.json
  listeners_valid_file: ${path.metadata_dir}/listeners.valid.json
  exp_folder: ./exp_${separator.model} # folder to store enhanced signals and final results

sample_rate: 44100       # sample rate of the input mixture
stem_sample_rate: 24000  # sample rate output stems
remix_sample_rate: 32000 # sample rate for output remixed signal

nalr:
  nfir: 220
  fs: ${sample_rate}

apply_compressor: False
compressor:
  threshold: 0.35
  attenuation: 0.1
  attack: 50
  release: 1000
  rms_buffer_size: 0.064

soft_clip: True

separator:
  model: demucs   # demucs or openunmix
  device: ~

evaluate:
  set_random_seed: True
  small_test: False
  batch_size: 1  # Number of batches
  batch: 0       # Batch number to evaluate

# hydra config
hydra:
  run:
    dir: ${path.exp_folder}

The general organisation of the config files is hierarchical, with property labels depending on the script in question. The config file for the enhance and evaluate recipes contains configurable paramaters for both scripts. These include:

Paths for the locations of audio files, metadata and the export location for generated files
Paramaters for the NAL-R fitting (linear amplification that has been used for hearing aids)
Paramaters for the automatic gain control (AGC) compressor used in the baseline enhancer
Parameters for the challenge evaluator
Parameters necessary for Hydra to run

The path.root parameter defaults to the root of the baseline and must be overrided with a dataset root path when the python script is called in the command line.

e.g

user:~$ python mypythonscript.py path.root='/path/to/project'

In this notebook we will use the environment variable $NBOOKROOT which we defined at the start of the tutorial.

Note the lack of slash at the end of the path.root argument string. If you inspect a variable such as path.metadata_dir you will see that this slash is already included in the line.

path:
  root: ./
  metadata_dir: ${path.root}/task1/metadata

The general form for overriding a parameter in the CLI is dot indexed. For the following entry in a config.yaml file:

A:
  B:
    parameter_0: some_value
    parameter_1: some_other_value

The CLI syntax to override those values would be:

User:~$ python myscript.py A.B.parameter_0="new_value" A.B.parameter_1="another_new_value"

5. Output Signals#

8 mono signals corresponding to the left and right VDBO.
- 24000 Hz
- Peak normalised to avoid clipping, but we save the normalisation factor in order to reconstruct the signal for the HAAQI evaluation
- 16-bit PCM
- Compress and saved using lossless FLAC
1 Stereo Signal
- 32000 Hz
- 16-bit PCM
- Compress and saved using lossless FLAC
- Signal clipping may occur.

6. Run Demo#

Typically, all the work is done within python with configurable variables supplied by a yaml file. This is parsed by Hydra inside the python code.

The execution of this code is performed in the CLI and new configuration variable values are supplied as arguments to override defaults.

We are now ready to run the prepared python script. However, the standard configuration is designed to work with the full clarity dataset. We can redirect the script to the correct folders by overriding the appropriate configuration parameters.

!python enhance.py path.root=../cadenza_data_demo/cad1/task1

[2023-05-19 15:56:54,604][torchaudio.utils.download][INFO] - The local file (/home/gerardo/.cache/torch/hub/torchaudio/models/hdemucs_high_musdbhq_only.pt) exists. Skipping the download.

[2023-05-19 15:56:57,704][__main__][INFO] - [001/002] Processing Actions - One Minute Smile for L5076...

[2023-05-19 15:57:07,048][__main__][INFO] - [002/002] Processing Actions - One Minute Smile for L5040...

The enhancement results in eight stems and one remix for all listener-song pairs.

%%capture
!pip install more_itertools

from os import listdir
from os.path import isfile, join
from pathlib import Path
from scipy.io import wavfile
from more_itertools import windowed

from clarity.clarity.utils.flac_encoder import read_flac_signal

import IPython.display as ipd
import pandas as pd

def audio_player_list(signals, rates, width=270, height=40, columns=None, column_align='center'):
    """Generate a list of HTML audio players tags for a given list of audio signals.

    Notebook: B/B_PythonAudio.ipynb

    Args:
        signals (list): List of audio signals
        rates (list): List of sample rates
        width (int): Width of player (either number or list) (Default value = 270)
        height (int): Height of player (either number or list) (Default value = 40)
        columns (list): Column headings (Default value = None)
        column_align (str): Left, center, right (Default value = 'center')
    """
    pd.set_option('display.max_colwidth', None)

    if isinstance(width, int):
        width = [width] * len(signals)
    if isinstance(height, int):
        height = [height] * len(signals)

    audio_list = []
    for cur_x, cur_Fs, cur_width, cur_height in zip(signals, rates, width, height):
        audio_html = ipd.Audio(data=cur_x.T, rate=cur_Fs)._repr_html_()
        audio_html = audio_html.replace('\n', '').strip()
        audio_html = audio_html.replace('<audio ', f'<audio style="width: {cur_width}px; height: {cur_height}px" ')
        audio_list.append([audio_html])

    df = pd.DataFrame(audio_list, index=columns).T
    table_html = df.to_html(escape=False, index=False, header=bool(columns))
    table_html = table_html.replace('<th>', f'<th style="text-align: {column_align}">')
    ipd.display(ipd.HTML(table_html))

def load_and_display(audio_path, number_audios=None, start=20, end=30):
    number_audios = -1 if number_audios is None else number_audios
    audio_files = [f for f in audio_path.glob('**/*') if f.suffix in ['.flac', '.wav']][:number_audios]
    signals = []
    rates = []
    columns = []
    
    # load signals
    for file_to_play in audio_files:
        if file_to_play.suffix == '.flac':
            signal, sample_rate = read_flac_signal(file_to_play)
        else:
            sample_rate, signal = wavfile.read(file_to_play)
            
        signals.append(signal[int(start*sample_rate):int(end*sample_rate)])
        rates.append(sample_rate)
        
        columns.append("/".join(list(file_to_play.parts[-3:-1]) +  [file_to_play.stem]))
    
    # display signals
    for s, r, c in zip(
          windowed(signals, 2, step=2),
          windowed(rates, 2, step=2),
          windowed(columns, 2, step=2)): 
        audio_player_list(s, r, columns=c)

load_and_display(Path("exp_demucs/enhanced_signals"), 8, start=20, end=30)

L5040/Actions - One Minute Smile/L5040_Actions - One Minute Smile_left_bass	L5040/Actions - One Minute Smile/L5040_Actions - One Minute Smile_left_other

L5040/Actions - One Minute Smile/L5040_Actions - One Minute Smile_right_drums	L5040/Actions - One Minute Smile/L5040_Actions - One Minute Smile_left_drums

L5040/Actions - One Minute Smile/L5040_Actions - One Minute Smile_right_other	L5040/Actions - One Minute Smile/L5040_Actions - One Minute Smile_right_bass

L5040/Actions - One Minute Smile/L5040_Actions - One Minute Smile_left_vocals	L5040/Actions - One Minute Smile/L5040_Actions - One Minute Smile_right_vocals

Now that we have enhanced audios we can use the evaluate recipe to generate HAAQI scores for the signals. The evaluation is run in the same manner as the enhancement script.

!python evaluate.py path.root=../cadenza_data_demo/cad1/task1