Searching For Music Mixing Graphs: A Pruning Approach

Abstract

Music mixing is compositional — experts combine multiple audio processors to achieve a cohesive mix from dry source tracks. We propose a method to reverse engineer this process from the input and output audio. First, we create a mixing console that applies all available processors to every chain. Then, after the initial console parameter optimization, we alternate between removing redundant processors and fine-tuning. We achieve this through differentiable implementation of both processors and pruning. Consequently, we find a sparse mixing graph that achieves nearly identical matching quality of the full mixing console. We apply this procedure to dry-mix pairs from various datasets and collect graphs that also can be used to train neural networks for music mixing applications.

1 Introduction

Motivation

From a signal processing perspective, modern music is more than the mere sum of source tracks. Mixing engineers combine and control multiple processors to balance the sources in terms of loudness, frequency content, spatialization, and much more. Many attempts have been made to uncover parts of this intricate process. Some have gathered expert knowledge [1, 2] and built rule-based systems [3, 4]. More recent work has adopted data-driven approaches. Neural networks have been trained to map source tracks directly to a mix [5, 6] or to estimate parameters of a fixed processing chain [7]. Yet, efforts to address the compositional aspects of the music mixing, such as which processors to use for each track, are still limited. One possible remedy is to consider a graph representation whose nodes and edges are processors and connections between them, respectively. In other words, each graph contains the essential information about the mixing process. However, other than the dry source and mixed audio, no public dataset provides such mixing graphs or related metadata [8, 9, 10], which hinders this line of research. This is not surprising; besides the cost of crowdsourcing, it is difficult to standardize the mixing data from multiple engineers with different equipment. One recent work [11] sidestepped this issue by creating synthetic graphs and using them for training. However, this approach is not free from downsides. Neural networks would suffer from poor generalization unless the synthetic data distribution matches the real world. Similar data-related issues arise in different domains, e.g., audio effect chain recognition [12, 13] and synthesizer sound matching [14, 15, 16]. Furthermore, real-world multitrack mixes have a much larger number of source tracks and graph sizes, making synthetic data generation more challenging. Therefore, it is desirable to have a systematic, reliable, and scalable method for collecting graphs. All these contexts lead us to ask: Can we find the mixing graphs solely from audio?

𝒢,𝒫𝒢𝒫\mathcal{G},\mathcal{P}caligraphic_G , caligraphic_PGcsubscript𝐺cG_{\mathrm{c}}italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT𝟎0\mathbf{0}bold_0Gp,𝐏psubscript𝐺psubscript𝐏pG_{\mathrm{p}},\mathbf{P}_{\mathrm{p}}italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT𝐏csubscript𝐏c\mathbf{P}_{\mathrm{c}}bold_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPTy^0subscript^𝑦0\hat{y}_{0}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPTy^csubscript^𝑦c\hat{y}_{\mathrm{c}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPTy^psubscript^𝑦p\hat{y}_{\mathrm{p}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPTy𝑦{y}italic_yτ𝜏\tauitalic_τParameteroptimizationParameteroptimization\!\!\!\mathrm{Parameter\>optimization}roman_Parameter roman_optimizationPruningPruning\!\!\!\mathrm{Pruning}roman_Pruning𝒴𝒴\mathcal{Y}caligraphic_Y ConsoleparametersConsoleparameters\!\!\!\mathrm{Console\>parameters}roman_Console roman_parameters PruningsearchwithparametersPruningsearchwithparameters\!\!\!\mathrm{Pruning\>search\>with\>parameters}roman_Pruning roman_search roman_with roman_parameters AudiosearchspaceAudiosearchspace\!\!\!\mathrm{Audio\>search\>space}roman_Audio roman_search roman_space TolerancethresholdrangeTolerancethresholdrange\!\!\!\mathrm{Tolerance\>threshold\>range}roman_Tolerance roman_threshold roman_range
Figure 1: Music mixing graph search via iterative pruning.

Problem definition

Precisely, for each song (piece) whose dry sources s1,,sKsubscript𝑠1subscript𝑠𝐾s_{1},\cdots,s_{K}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and mix y𝑦yitalic_y are available, we aim to find an audio processing graph G𝐺Gitalic_G and its processor parameters 𝐏𝐏\mathbf{P}bold_P so that processing the dry sources s1,,sKsubscript𝑠1subscript𝑠𝐾s_{1},\cdots,s_{K}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT results in a mix y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG that closely matches the original mix y𝑦yitalic_y. With a loss Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT that measures the match quality on the mixture audio domain 𝒴𝒴\mathcal{Y}caligraphic_Y and regularization Lrsubscript𝐿rL_{\mathrm{r}}italic_L start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT, our objective can be written as follows,

G,𝐏=argminG,𝐏[La(y^,y)+Lr(G,𝐏)].superscript𝐺superscript𝐏subscriptargmin𝐺𝐏subscript𝐿a^𝑦𝑦subscript𝐿r𝐺𝐏G^{*},\mathbf{P}^{*}=\operatorname*{arg\,min}_{G,\mathbf{P}}\big{[}L_{\mathrm{% a}}(\hat{y},y)+L_{\mathrm{r}}(G,\mathbf{P})\big{]}.italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_G , bold_P end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG , italic_y ) + italic_L start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT ( italic_G , bold_P ) ] . (1)

Contributions

Refer to caption
(a) Full mixing console (before pruning)
Refer to caption
(b) Pruned graph
Figure 2: Finding the sparse graph Gpsubscript𝐺pG_{\mathrm{p}}italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT from the differentiable mixing console Gcsubscript𝐺cG_{\mathrm{c}}italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT. Initial letters in the nodes denote their respective types. i: input, o: output, m: mix, e: equalizer, c: compressor, n: noisegate, s: stereo imager, g: gain/panning, r: reverb, d: multitap delay.

One might want to explore the candidate graphs without any restriction. However, this makes the problem ill-posed and underdetermined. The graph’s combinatorial nature makes the search space 𝒢𝒢\mathcal{G}caligraphic_G extremely large. Furthermore, we have to find the processor parameters jointly. As a result, numerous pairs of graphs and parameters can have similar match quality. Therefore, it is desirable to add some restrictions, e.g., preferring structures that are widely used by practitioners. To this end, we resort to the following pruning-based search; see Figure 1 for a visual illustration. Inspired by a recent work [17], we first create a so-called “mixing console” Gcsubscript𝐺cG_{\mathrm{c}}italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT (see Figure 2a for an example). It applies a fixed processing chain to each source. Then, it subgroups the outputs, applies the chain again, and sums the processed subgroups to obtain a final mix y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG. This resembles the traditional hybrid mixing console [18]. Each chain comprises 7777 processors, including an equalizer, compressor, and multitap delay. We implement all of them in a differentiable manner [19, 20, 21]. This allows end-to-end optimization of all parameters 𝐏csubscript𝐏c\mathbf{P}_{\mathrm{c}}bold_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT with an audio-domain loss Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT via gradient descent. After this initial console training, we proceed to the pruning stage. Here, we search for a maximally pruned graph Gpsubscript𝐺pG_{\mathrm{p}}italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT and its parameters 𝐏psubscript𝐏p\mathbf{P}_{\mathrm{p}}bold_P start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT while maintaining the match quality of the mixing console up to a certain tolerance τ𝜏\tauitalic_τ; this is shown as a circle centered at y𝑦yitalic_y in Figure 1. Also, see Figure 2b for an example pruned graph. We use iterative pruning, alternating between the pruning and fine-tuning, i.e., optimization of the remaining parameters [22]. To collect graphs from multiple songs, it is crucial to make the entire search efficient and fast. Pruning, in particular, takes a significant amount of computation time; hence, we investigate efficient and effective methods for pruning. During the pruning, we need to find a subset of nodes that can be removed while not harming the match quality. To achieve this, we view each processor’s “dry/wet” parameter as an approximate importance score and use it to select the candidate nodes. This approach gives 3333 variants of the pruning method with different trade-offs between the computational cost and resulting sparsity. It also draws connections to neural network pruning [23, 24] where the binary pruning operation is relaxed to continuous weights. Note that casting the graph search into pruning is a double-edged sword. The pruning only removes the processors and does not consider all possible signal routings, reducing the search space (from grey to colored regions in Figure 1). Consequently, it does not improve the match quality over the mixing consoles. Nevertheless, the pruned graph follows the real-world practice of selectively applying appropriate processors. In other words, the sparsity is crucial for the graph’s interpretability. Also, it keeps the search cost in a practical range, which might be challenging with other alternatives [25, 26]. Our method serves as a standalone reverse engineering algorithm [17], but it can also be used to collect pseudo-label data to train neural networks for music mixing applications. For example, we may extend existing methods for automatic mixing [3, 4, 5, 6, 7, 27] and mixing style transfer [28] to output the graphs. This allows end users to interpret and control the estimated outputs.

Data

We first report a list of datasets to which we can apply our method. For each song, we need a pair of dry sources s1,,sKsubscript𝑠1subscript𝑠𝐾s_{1},\cdots,s_{K}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and a final mixture y𝑦yitalic_y. Additionally, we use subgrouping information that describes how dry tracks are grouped together. Therefore, we use the MedleyDB dataset [8, 9] as it provides all of them. We also add the MixingSecrets library [10]. Since it only provides the audio, we manually subgrouped each track based on its instrument. Finally, we include our private dataset of Western music mixes from multiple engineers (denoted as Internal). The resulting ensemble comprises 1129112911291129 songs (188188188188, 472472472472, and 579579579579 songs for each respective dataset). The number of dry tracks ranges from 1111 to 133133133133, and the number of subgroups ranges from 1111 to 26262626 (see Figure 6 for the statistics). Except for the final pruned graph collection stage (Section 3.4), we use a random subset for the evaluations (a total of 72727272 songs, 24242424 songs for each dataset). Every signal is stereo and resampled to 30kHz30kHz30$\mathrm{k}\mathrm{H}\mathrm{z}$30 roman_kHz sampling rate.

Supplementary materials

Refer to the following link for audio samples, pruned graphs, and appendices that contain additional details: https://sh-lee97.github.io/grafx-prune.

2 Differentiable Processing on Graphs

An audio processing graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) is assumed to be directed and acyclic (V𝑉Vitalic_V and E𝐸Eitalic_E denote the set of nodes and edges, respectively). Each node viVsubscript𝑣𝑖𝑉v_{i}\in Vitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V is either a processor or an auxiliary module and has its type tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, e.g., e for an equalizer. Each processor takes an audio uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a parameter vector pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as input and outputs a processed signal fi(ui,pi)subscript𝑓𝑖subscript𝑢𝑖subscript𝑝𝑖f_{i}(u_{i},p_{i})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Then, we further mix the input and this processed result with a “dry/wet” weight wi[0,1]subscript𝑤𝑖01w_{i}\in[0,1]italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. Hence, the output yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the processor visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is given as follows,

yi=wifi(ui,pi)+(1wi)ui.subscript𝑦𝑖subscript𝑤𝑖subscript𝑓𝑖subscript𝑢𝑖subscript𝑝𝑖1subscript𝑤𝑖subscript𝑢𝑖y_{i}=w_{i}f_{i}(u_{i},p_{i})+(1-w_{i})u_{i}.italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (2)

We have the following 3333 auxiliary modules:

  • Input — It outputs one of the dry sources sksubscript𝑠𝑘s_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

  • Mix — We ouput the sum of incomming signals.

  • Output — A sum of its inputs is considered as a final output y𝑦yitalic_y.

Each edge eijEsubscript𝑒𝑖𝑗𝐸e_{ij}\in Eitalic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ italic_E represents a “cable” that sends an output signal to another node as input. Throughout the text, we denote an ordered collection from multiple nodes with a boldface letter, e.g., 𝐰𝐰\mathbf{w}bold_w for a weight vector, 𝐒𝐒\mathbf{S}bold_S for a source tensor, and 𝐏𝐏\mathbf{P}bold_P for a dictionary with processor types as keys and their parameter tensors as values. Under this notation, our task is to find G𝐺Gitalic_G, 𝐏𝐏\mathbf{P}bold_P, and 𝐰𝐰\mathbf{w}bold_w from 𝐒𝐒\mathbf{S}bold_S and y𝑦yitalic_y.

2.1 Differentiable Implementation

Considering the music mixing, we use the following 7777 processors.

  • Gain/panning

    We control both loudness and stereo panning of input audio by multiplying a learnable scalar to each channel.

  • Stereo imager

    We change the stereo width of the input by modifying the loudness of the side channel (left minus right).

  • Equalizer

    We use a finite impulse response (FIR) with a length of 2047204720472047 to modify the input’s magnitude response. The FIR is parameterized with its log magnitude (thus 1024102410241024 parameters). We apply inverse FFT of the magnitude with zero phase, obtain a zero-centered FIR, and multiply it with a Hann window. We apply the same FIR to both the left and right channels.

  • Reverb

    We employ 2222 seconds of filtered noise as an impulse response for reverberation. First, we create a 2222-channel uniform noise, where these channels represent the mid and side. We filter the noise by multiplying an element-wise 2222-channel magnitude mask to its short-time Fourier transform (STFT), where the FFT sizes and hop lengths are 384384384384 and 192192192192, respectively. This mask is constructed using the reverberation’s initial and decaying log magnitudes. After the masking, we obtain the mid/side filtered noise via inverse STFT, convert it to stereo, and perform channel-wise convolutions with input to get an output.

  • Compressor

    We use a slight variant of the recently proposed differentiable dynamic range compressor [21]. First, we obtain the input’s smooth energy envelope. The smoothing is typically done with a ballistics filter, but we instead use a one-pole filter for speedup in GPU. Then, we compute the desired gain reduction from the envelope and apply it to the input audio.

  • Noisegate

    Except for the gain computation, its implementation is the same as the compressor.

  • Multitap delay

    For each (left and right) channel, we employ independent 2222 seconds of delay effects with a single delay for every 100ms100ms100$\mathrm{m}\mathrm{s}$100 roman_ms interval. To optimize delay lengths using gradient descent, we employ surrogate complex damped sinusoids [29]. Each sinusoid is converted to a delayed soft impulse via inverse FFT. Its angular frequency represents a continuous relaxation of the discrete delay length. Each delay is filtered with a length-39393939 FIR equalizer to mimic the filtered echo effect [30].

Batched node processing

It is common to compute the graph output signal by processing each node one by one [15, 19]. However, this severely bottlenecks the computation speed for large mixing graphs. Therefore, we instead batch-process multiple nodes in parallel. For the graph in Figure 2b, we can batch-process 1111 equalizer e, 3333 noisegates n, and 5555 gain/pannings g sequentially. Then, we aggregate the intermediate outputs to 2222 subgroup mixes m (also in parallel). This part is identical to graph neural networks’ “message passing,” so we adopt their implementations [31]. We repeat these parallel computations until we reach the output node o. By doing so, we obtain the output faster; in this example, the number of sequential processing is reduced from 15151515 (one-by-one) to 8888 (optimal). We empirically found that up to 5.8×5.8\times5.8 × speedup can be achieved for the pruned graphs with a RTX3090 GPU.

2.2 Mixing Console

We construct a mixing console Gcsubscript𝐺cG_{\mathrm{c}}italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT as follows (see Figure 2a).

  1. (i)

    We add an input node i for each source track.

  2. (ii)

    We connect a serial chain (with a fixed order) of an equalizer e, compressor c, noisegate n, stereo imager s, gain/panning g, multitap delay d, and reverb r for each input.

  3. (iii)

    We subgroup and sum the processed tracks with mix nodes m based on the prepared subgrouping information.

  4. (iv)

    We apply the same chain ecnsgdr to each mix output, then pass it to the output node o (we omit the mix module here).

2.3 Optimization

Before exploring the pruning of each mixing console, as a sanity check, we first evaluate its match performance. To investigate how much each processor type contributes to the match quality, we start with a base graph, a mixing console with no processors that simply sums all the inputs. Then, we add each processor type one by one to the processor chain (see the first column of Table 1). We optimize and evaluate all these preliminary graphs for each song. For each graph, we train its parameters and weights simultaneously with an audio-domain loss given as follows,

La=αlrLlr+αmLm+αsLssubscript𝐿asubscript𝛼lrsubscript𝐿lrsubscript𝛼msubscript𝐿msubscript𝛼ssubscript𝐿sL_{\mathrm{a}}=\alpha_{\mathrm{lr}}L_{\mathrm{lr}}+\alpha_{\mathrm{m}}L_{% \mathrm{m}}+\alpha_{\mathrm{s}}L_{\mathrm{s}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT (3)

where each term Lxsubscript𝐿xL_{\mathrm{x}}italic_L start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT is a variant of multi-resolution STFT loss [32] (x{lr,m,s}xlrms\mathrm{x\in\{lr,m,s\}}roman_x ∈ { roman_lr , roman_m , roman_s }, lrlr\mathrm{lr}roman_lr: left/right, mm\mathrm{m}roman_m: mid, ss\mathrm{s}roman_s: side)

Lx=i=1I[logYx(i)logY^x(i)1N+Yx(i)Y^x(i)FYx(i)F].subscript𝐿xsuperscriptsubscript𝑖1𝐼delimited-[]subscriptnormsubscriptsuperscript𝑌𝑖xsubscriptsuperscript^𝑌𝑖x1𝑁subscriptnormsubscriptsuperscript𝑌𝑖xsubscriptsuperscript^𝑌𝑖x𝐹subscriptnormsubscriptsuperscript𝑌𝑖x𝐹L_{\mathrm{x}}=\sum_{i=1}^{I}\Bigg{[}\frac{\|\log Y^{(i)}_{\mathrm{x}}-\log% \hat{Y}^{(i)}_{\mathrm{x}}\|_{1}}{N}+\frac{\|Y^{(i)}_{\mathrm{x}}-\hat{Y}^{(i)% }_{\mathrm{x}}\|_{F}}{\|Y^{(i)}_{\mathrm{x}}\|_{F}}\Bigg{]}.italic_L start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT [ divide start_ARG ∥ roman_log italic_Y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT - roman_log over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG + divide start_ARG ∥ italic_Y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT - over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_Y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG ] . (4)

Here, Yx(i)superscriptsubscript𝑌x𝑖Y_{\mathrm{x}}^{(i)}italic_Y start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and Y^x(i)superscriptsubscript^𝑌x𝑖\hat{Y}_{\mathrm{x}}^{(i)}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT roman_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT denote the ithsuperscript𝑖thi^{\mathrm{th}}italic_i start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT Mel spectrograms of the target and predicted mixture, respectively. N𝑁Nitalic_N, 1\|\cdot\|_{1}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and F\|\cdot\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denote the number of frames, l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm and Frobenius norm, respectively. We use FFT sizes of 512512512512, 1024102410241024, and 4096409640964096, and hop sizes are 1/4141/41 / 4 of their respective FFT sizes. The number of Mel filterbanks is set to 96969696 for all scales. We apply A-weighting before each STFT [33]. The per-channel loss weights are set to αlr=0.5subscript𝛼lr0.5\alpha_{\mathrm{lr}}=0.5italic_α start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT = 0.5, αm=0.25subscript𝛼m0.25\alpha_{\mathrm{m}}=0.25italic_α start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT = 0.25, and αs=0.25subscript𝛼s0.25\alpha_{\mathrm{s}}=0.25italic_α start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0.25. The implementation is based on auraloss [34]. We further add a regularization that promotes gain-staging, a common practice of audio engineers that keeps the total energy of input and output roughly the same. This is achieved with the following loss:

Lg=viVg|logfi(ui)m2logui,m2|subscript𝐿gsubscriptsubscript𝑣𝑖subscript𝑉gsubscriptnormsubscript𝑓𝑖subscriptsubscript𝑢𝑖m2subscriptnormsubscript𝑢𝑖m2L_{\mathrm{g}}=\sum\nolimits_{v_{i}\in V_{\mathrm{g}}}\left|\log\|f_{i}(u_{i})% _{\mathrm{m}}\|_{2}-\log\|u_{i,\mathrm{m}}\|_{2}\right|italic_L start_POSTSUBSCRIPT roman_g end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V start_POSTSUBSCRIPT roman_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT | roman_log ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - roman_log ∥ italic_u start_POSTSUBSCRIPT italic_i , roman_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | (5)

where ()msubscriptm(\cdot)_{\mathrm{m}}( ⋅ ) start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT and 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote mid channel and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm, respectively. We apply this regularization to a subset of processors VgVsubscript𝑉g𝑉V_{\mathrm{g}}\subset Vitalic_V start_POSTSUBSCRIPT roman_g end_POSTSUBSCRIPT ⊂ italic_V that comprises all equalizers, reverbs, and multitap delays. This allows us to (i) eliminate redundant gains that these linear-time invariant (LTI) processors could create and (ii) restrict the parameters to be in a reasonable range. Therefore, the total loss is given as

L(𝐏,𝐰)=La(𝐏,𝐰)+αgLg(𝐏)𝐿𝐏𝐰subscript𝐿a𝐏𝐰subscript𝛼gsubscript𝐿g𝐏L(\mathbf{P},\mathbf{w})=L_{\mathrm{a}}(\mathbf{P},\mathbf{w})+\alpha_{\mathrm% {g}}L_{\mathrm{g}}(\mathbf{P})italic_L ( bold_P , bold_w ) = italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ( bold_P , bold_w ) + italic_α start_POSTSUBSCRIPT roman_g end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_g end_POSTSUBSCRIPT ( bold_P ) (6)

where the gain-staging weight is set to αg=103subscript𝛼gsuperscript103\alpha_{\mathrm{g}}=10^{-3}italic_α start_POSTSUBSCRIPT roman_g end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. Here, we used a slightly different notation from Equation 1 to emphasize what is optimized. Each console is optimized for 12k12k12$\mathrm{k}$12 roman_k steps using AdamW [35] with a 0.010.010.010.01 learning rate. For each step, we random-sample a 3.8s3.8s3.8$\mathrm{s}$3.8 roman_s region of dry sources 𝐒𝐒\mathbf{S}bold_S (thus the batch size is 1111), compute the mix y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, and compare its last 2.8s2.8s2.8$\mathrm{s}$2.8 roman_s with the corresponding ground-truth y𝑦yitalic_y. Note that the first second is used only for the “warm-up" of the processors with long states such as compressors and reverbs.

Table 1: Matching performances of the mixing consoles using different processor type configurations.
Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT Llrsubscript𝐿lrL_{\mathrm{lr}}italic_L start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT Lmsubscript𝐿mL_{\mathrm{m}}italic_L start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT Lssubscript𝐿sL_{\mathrm{s}}italic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT
Base graph (sum of dry sources) 19.719.719.719.7 1.521.521.521.52 1.461.461.461.46 74.374.374.374.3
+ Gain/panning ecnsgdr .689.689.689.689 .686.686.686.686 .634.634.634.634 .752.752.752.752
+ Stereo imager ecnsgdr .676.676.676.676 .671.671.671.671 .623.623.623.623 .734.734.734.734
+ Equalizer ecnsgdr .557.557.557.557 .549.549.549.549 .493.493.493.493 .637.637.637.637
+ Reverb ecnsgdr .481.481.481.481 .471.471.471.471 .457.457.457.457 .523.523.523.523
+ Compressor ecnsgdr .423.423.423.423 .407.407.407.407 .385.385.385.385 .492.492.492.492
+ Noisegate ecnsgdr .414.414.414.414 .398.398.398.398 .375.375.375.375 .485.485.485.485
+ Multitap delay (full) ecnsgdr .409.409.409.409 .395.395.395.395 .375.375.375.375 .469.469.469.469

2.4 Results

Table 1 reports the evaluation results that are calculated over the entire song. First, the base graph results in an audio loss Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT of 19.719.719.719.7. The side-channel loss Lssubscript𝐿sL_{\mathrm{s}}italic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT is especially large as most source tracks are close to mono while the target mixes have wide stereo images. With the gain/pannings and stereo imagers, we can achieve “rough mixes” with a loss of 0.6760.6760.6760.676. Then, we fill in the missing details with the remaining processor types. Every type improves the match, and the full mixing console reports a loss of 0.4090.4090.4090.409. Also, the top 5555 rows of Figure 7 show mid/side log-magnitude STFTs of the target mixes, matches of the mixing consoles, and their errors. We report the results with 3333, 4444, 6666, and 7777 types where the choice of processors and their order follow Table 1; see the supplementary page for the results on other configurations and additional songs. Again, we can observe that adding each type improves the match from the spectrogram error plots. Furthermore, each song benefits more from different types; for the song RockSteady, the multitap delays improve the match more than the reverbs (Figure 7b), which is different from the average trend. Yet, this is expected since the original mix heavily uses the delay effects. Finally, we note that mixes from MixingSecrets are more challenging to match than the others; it reports a mean audio loss of 0.5450.5450.5450.545, while MedleyDB and Internal report 0.2960.2960.2960.296 and 0.3850.3850.3850.385, respectively.

3 Music Mixing Graph Search

Considering the full mixing console Gcsubscript𝐺cG_{\mathrm{c}}italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT as an upper bound in terms of the matching performance, we want to find a sparser graph with a similar match quality. We achieve this by pruning the console as much as possible while keeping the loss increase up to a tolerance threshold τ𝜏\tauitalic_τ. This objective can be written as

minimize|Vp|s.t.minLa(Gp)minLa(Gc)+τformulae-sequenceminimizesubscript𝑉pstsubscript𝐿asubscript𝐺psubscript𝐿asubscript𝐺c𝜏\mathrm{minimize}\;\;|V_{\mathrm{p}}|\quad\mathrm{s.t.}\;\;\min L_{\mathrm{a}}% (G_{\mathrm{p}})\leq\min L_{\mathrm{a}}(G_{\mathrm{c}})+\tauroman_minimize | italic_V start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT | roman_s . roman_t . roman_min italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT ) ≤ roman_min italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) + italic_τ (7)

where Vpsubscript𝑉pV_{\mathrm{p}}italic_V start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT and |||\cdot|| ⋅ | denote the pruned graph’s node set and its cardinality, respectively. We define the pruning as removal of the nodes VcVpsubscript𝑉csubscript𝑉pV_{\mathrm{c}}\setminus V_{\mathrm{p}}italic_V start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ∖ italic_V start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT and re-routing of their edges, in a way that is equivalent to setting them to “bypass,” i.e., wi=0subscript𝑤𝑖0w_{i}=0italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 for viVcVpsubscript𝑣𝑖subscript𝑉csubscript𝑉pv_{i}\in V_{\mathrm{c}}\setminus V_{\mathrm{p}}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ∖ italic_V start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT. Also, min()\min(\cdot)roman_min ( ⋅ ) signifies that we are (ideally) interested in the optimized audio loss. We only prune the processors, not the auxiliary nodes. Hence, we define a pruning ratio ρ𝜌\rhoitalic_ρ as the number of pruned processors divided by the number of processors in the initial console.

3.1 Iterative Pruning

Finding the optimal (sparsest) solution Vpsuperscriptsubscript𝑉pV_{\mathrm{p}}^{*}italic_V start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is prohibitively expensive. First, due to the interaction between the processors, we need a combinatorial search. As such, we instead assume their independence and prune the processors in a greedy manner. Following the iterative approach [22], we gradually remove processors whenever the tolerance condition is satisfied. Under this setup, we still need to fine-tune intermediate pruned graphs before evaluating the tolerance condition. For reasonable computational complexity, we simply omit this fine-tuning, paying the cost of possibly missing more removable processors. Our method is summarized in Algorithm 1 (in the following parentheses denote line numbers). First, we construct a mixing console Gc=(V,E)subscript𝐺c𝑉𝐸G_{\mathrm{c}}=(V,E)italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT = ( italic_V , italic_E ), optimize its parameters 𝐏𝐏\mathbf{P}bold_P and dry/wet weights 𝐰𝐰\mathbf{w}bold_w, and evaluate the loss (3-5). This validation loss Laminsuperscriptsubscript𝐿aminL_{\mathrm{a}}^{\mathrm{min}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT serves as a pruning threshold with the tolerance τ𝜏\tauitalic_τ. Then, we alternate between pruning and fine-tuning, i.e., further optimization of the remaining parameters and weights (7-20). Each pruning stage consists of multiple trials, which sample subsets of candidates Vcandsubscript𝑉cand{V}_{\mathrm{cand}}italic_V start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT from the set of remaining processors Vpoolsubscript𝑉poolV_{\mathrm{pool}}italic_V start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT (10) and check whether they are removable (12). We keep the pruning if the result satisfies the constraint or cancel it otherwise (12-15). We repeat this process until the terminal condition (9) is satisfied. Implementation-wise, we multiply binary masks, 𝐦𝐦\mathbf{m}bold_m and 𝐦candsubscript𝐦cand{\mathbf{m}}_{\>\!\mathrm{cand}}bold_m start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT, to the weight vector 𝐰𝐰\mathbf{w}bold_w to mimic the pruning during the trials (11). After that, we actually update the graph and remove the pruned processors’ parameters and weights for faster search (18). Sometimes, albeit rare, the pruning can improve the match. In this case, we update the threshold (13).

Algorithm 1 Music mixing graph search with iterative pruning.
1:A mixing console Gcsubscript𝐺cG_{\mathrm{c}}italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT, dry tracks 𝐒𝐒\mathbf{S}bold_S, and mixture y𝑦yitalic_y
2:Pruned graph Gpsubscript𝐺pG_{\mathrm{p}}italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT, parameters 𝐏𝐏\mathbf{P}bold_P, and weights 𝐰𝐰\mathbf{w}bold_w
3:𝐏,𝐰Initialize(Gc)𝐏𝐰Initializesubscript𝐺c\mathbf{P},\mathbf{w}\leftarrow\mathrm{Initialize}\>\!(G_{\mathrm{c}})bold_P , bold_w ← roman_Initialize ( italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT )
4:𝐏,𝐰Train(Gc,𝐏,𝐰,𝐒,y)𝐏𝐰Trainsubscript𝐺c𝐏𝐰𝐒𝑦\mathbf{P},\mathbf{w}\leftarrow\mathrm{Train}\>\!(G_{\mathrm{c}},\mathbf{P},% \mathbf{w},\mathbf{S},y)bold_P , bold_w ← roman_Train ( italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT , bold_P , bold_w , bold_S , italic_y )
5:LaminEvaluate(Gc,𝐏,𝐰,𝐒,y)superscriptsubscript𝐿aminEvaluatesubscript𝐺c𝐏𝐰𝐒𝑦L_{\mathrm{a}}^{\mathrm{min}}\leftarrow\mathrm{Evaluate}\>\!(G_{\mathrm{c}},% \mathbf{P},\mathbf{w},\mathbf{S},y)italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT ← roman_Evaluate ( italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT , bold_P , bold_w , bold_S , italic_y )
6:GpGcsubscript𝐺psubscript𝐺cG_{\mathrm{p}}\leftarrow G_{\mathrm{c}}italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT ← italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT
7:for n𝑛nitalic_n \leftarrow 1111 to Nitersubscript𝑁iterN_{\mathrm{iter}}italic_N start_POSTSUBSCRIPT roman_iter end_POSTSUBSCRIPT do
8:     Vpool,𝐦GetAllProcessors(V),𝟏formulae-sequencesubscript𝑉pool𝐦GetAllProcessors𝑉1V_{\mathrm{pool}},\mathbf{m}\leftarrow\mathrm{GetAllProcessors}\>\!(V),\mathbf% {1}italic_V start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT , bold_m ← roman_GetAllProcessors ( italic_V ) , bold_1
9:     while TryPrune(Vpool,𝐰,𝐦)TryPrunesubscript𝑉pool𝐰𝐦\mathrm{TryPrune}\>\!(V_{\mathrm{pool}},\mathbf{w},\mathbf{m})roman_TryPrune ( italic_V start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT , bold_w , bold_m ) do
10:         Vcand,𝐦candSampleCandidate(Vpool,𝐰)subscript𝑉candsubscript𝐦candSampleCandidatesubscript𝑉pool𝐰V_{\mathrm{cand}},{\mathbf{m}}_{\>\!\mathrm{cand}}\leftarrow\mathrm{% SampleCandidate}\>\!(V_{\mathrm{pool}},\mathbf{w})italic_V start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT , bold_m start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT ← roman_SampleCandidate ( italic_V start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT , bold_w )
11:         LaEvaluate(Gp,𝐏,𝐰𝐦𝐦cand,𝐒,y)subscript𝐿aEvaluatesubscript𝐺p𝐏direct-product𝐰𝐦subscript𝐦cand𝐒𝑦L_{\mathrm{a}}\leftarrow\mathrm{Evaluate}\>\!(G_{\mathrm{p}},\mathbf{P},% \mathbf{w}\odot\mathbf{m}\odot{\mathbf{m}}_{\>\!\mathrm{cand}},\mathbf{S},y)italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ← roman_Evaluate ( italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , bold_P , bold_w ⊙ bold_m ⊙ bold_m start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT , bold_S , italic_y )
12:         if La<Lamin+τsubscript𝐿asuperscriptsubscript𝐿amin𝜏L_{\mathrm{a}}<L_{\mathrm{a}}^{\mathrm{min}}+\tauitalic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT < italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT + italic_τ then
13:              Laminmin(Lamin,La)superscriptsubscript𝐿aminsuperscriptsubscript𝐿aminsubscript𝐿aL_{\mathrm{a}}^{\mathrm{min}}\leftarrow\min(L_{\mathrm{a}}^{\mathrm{min}},L_{% \mathrm{a}})italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT ← roman_min ( italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT , italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT )
14:              𝐦𝐦𝐦cand𝐦direct-product𝐦subscript𝐦cand\mathbf{m}\leftarrow\mathbf{m}\odot{\mathbf{m}}_{\>\!\mathrm{cand}}bold_m ← bold_m ⊙ bold_m start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT
15:         end if
16:         Vpool=UpdatePool(Vpool,Vcand)subscript𝑉poolUpdatePoolsubscript𝑉poolsubscript𝑉candV_{\mathrm{pool}}=\mathrm{UpdatePool}\>\!(V_{\mathrm{pool}},V_{\mathrm{cand}})italic_V start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT = roman_UpdatePool ( italic_V start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT )
17:     end while
18:     Gp,𝐏,𝐰Prune(Gp,𝐏,𝐰,𝐦)subscript𝐺p𝐏𝐰Prunesubscript𝐺p𝐏𝐰𝐦G_{\mathrm{p}},\mathbf{P},\mathbf{w}\leftarrow\mathrm{Prune}\>\!(G_{\mathrm{p}% },\mathbf{P},\mathbf{w},\mathbf{m})italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , bold_P , bold_w ← roman_Prune ( italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , bold_P , bold_w , bold_m )
19:     𝐏,𝐰Train(Gp,𝐏,𝐰,𝐒,y)𝐏𝐰Trainsubscript𝐺p𝐏𝐰𝐒𝑦\mathbf{P},\mathbf{w}\leftarrow\mathrm{Train}\>\!(G_{\mathrm{p}},\mathbf{P},% \mathbf{w},\mathbf{S},y)bold_P , bold_w ← roman_Train ( italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , bold_P , bold_w , bold_S , italic_y )
20:end for
21:return Gp,𝐏,𝐰subscript𝐺p𝐏𝐰G_{\mathrm{p}},\mathbf{P},\mathbf{w}italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , bold_P , bold_w

3.2 Candidate Sampling

The remaining design choices are choosing an appropriate candidate set Vcandsubscript𝑉candV_{\mathrm{cand}}italic_V start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT (10, 16) and deciding when to terminate the trials (9). We explore the following 3333 approaches.

  • Brute-force — We random-sample every processor one by one, i.e., |Vcand|=1subscript𝑉cand1|V_{\mathrm{cand}}|=1| italic_V start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT | = 1. This granularity could achieve high sparsity, but comes with a large computational cost.

  • Dry/wet — For efficient pruning, we need an informed guess of each node’s importance. Intuitively, we can use each dry/wet weight wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as an approximate importance. This observation leads to the following. For each pruning iteration:

    1. (i)

      We create a set of remaining processor types Tpoolsubscript𝑇poolT_{\mathrm{pool}}italic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT. Next, we count the number of processors of each type tTpool𝑡subscript𝑇poolt\in T_{\mathrm{pool}}italic_t ∈ italic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT, denoted as Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

    2. (ii)

      For each trial, we sample a type tTpool𝑡subscript𝑇poolt\in T_{\mathrm{pool}}italic_t ∈ italic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT and choose the smallest-weight processors of that type as candidates. The number of candidates is set to |Vcand|=max(1,rtNt)|V_{\mathrm{cand}}|=\max\;\!(1,\lfloor r_{t}N_{t}\rceil)| italic_V start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT | = roman_max ( 1 , ⌊ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⌉ ) where rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the portion of the chosen processors and is initialized to 0.10.10.10.1 for every pruning iteration.

    3. (iii)

      When the trial fails, we perform one of the followings. If |Vcand|>1subscript𝑉cand1|V_{\mathrm{cand}}|>1| italic_V start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT | > 1, we halve the candidate set, i.e., rtrt/2subscript𝑟𝑡subscript𝑟𝑡2r_{t}\leftarrow r_{t}/2italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / 2. Otherwise, i.e., if |Vcand|=1subscript𝑉cand1|V_{\mathrm{cand}}|=1| italic_V start_POSTSUBSCRIPT roman_cand end_POSTSUBSCRIPT | = 1, we finish the search of this type by removing it from the pool as TpoolTpool{t}subscript𝑇poolsubscript𝑇pool𝑡T_{\mathrm{pool}}\leftarrow T_{\mathrm{pool}}\setminus\{t\}italic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT ← italic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT ∖ { italic_t }.

    4. (iv)

      We iterate above two (ii)-(iii) until Tpool=subscript𝑇poolT_{\mathrm{pool}}=\emptysetitalic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT = ∅.

    This way, we can skip large-weight nodes and evaluate multiple candidates, reducing the total number of trials. Note that if we set rt=0.5subscript𝑟𝑡0.5r_{t}=0.5italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.5, this method is similar to the simple binary search. However, it can lead to over-pruning of specific types sampled early in (ii). Hence, we set rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to a more conservative value 0.10.10.10.1.

  • Hybrid — Solely relying on the weight values could miss some processors that can be pruned but have large weights. We mitigate this by combining the above two, running the brute-force method for every 4thsuperscript4th4^{\mathrm{th}}4 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT iteration.

By default, we use the hybrid method with tolerance τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01.

Refer to caption
Figure 3: Process of iterative pruning (hybrid, τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01). 24242424 songs (8888 songs per dataset) are shown; each color represents an individual song. The upper and lower rows show the pruning ratios and mean dry/wet weights. The yellow-shaded regions show the pruning phase.

3.3 Optimization

We use identical audio loss Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT and gain-staging regularization Lgsubscript𝐿gL_{\mathrm{g}}italic_L start_POSTSUBSCRIPT roman_g end_POSTSUBSCRIPT. To promote sparsity, we add a weight regularization Lpsubscript𝐿pL_{\mathrm{p}}italic_L start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT, a l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of the weight 𝐰𝐰\mathbf{w}bold_w. Hence, the full objective is as follows,

L(𝐏,𝐰)=La(𝐏,𝐰)+αgLg(𝐏)+αpLp(𝐰).𝐿𝐏𝐰subscript𝐿a𝐏𝐰subscript𝛼gsubscript𝐿g𝐏subscript𝛼psubscript𝐿p𝐰L(\mathbf{P},\mathbf{w})=L_{\mathrm{a}}(\mathbf{P},\mathbf{w})+\alpha_{\mathrm% {g}}L_{\mathrm{g}}(\mathbf{P})+\alpha_{\mathrm{p}}L_{\mathrm{p}}(\mathbf{w}).italic_L ( bold_P , bold_w ) = italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ( bold_P , bold_w ) + italic_α start_POSTSUBSCRIPT roman_g end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_g end_POSTSUBSCRIPT ( bold_P ) + italic_α start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT ( bold_w ) . (8)

We first train the console with 6k6k6$\mathrm{k}$6 roman_k steps. Then, we repeat Niter=12subscript𝑁iter12N_{\mathrm{iter}}=12italic_N start_POSTSUBSCRIPT roman_iter end_POSTSUBSCRIPT = 12 rounds of pruning, each with 0.5k0.5k0.5$\mathrm{k}$0.5 roman_k-step fine-tuning. As a result, the total number of optimization steps is the same as the previous console training. During the first 4k4k4$\mathrm{k}$4 roman_k steps of the pruning phase, we linearly increase the sparsity coefficient αpsubscript𝛼p\alpha_{\mathrm{p}}italic_α start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT from 00 to 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. While we halved the full console optimization steps, which could lead to increased loss, it is justified due to the tight resource constraints. With a RTX3090 GPU, each song took about 56m56m56$\mathrm{m}$56 roman_m, 29m29m29$\mathrm{m}$29 roman_m, and 36m36m36$\mathrm{m}$36 roman_m using the brute-force, dry/wet, and hybrid methods, respectively.

3.4 Results

Pruning process

Figure 3 shows how the pruning progresses. Each graph’s sparsity increases gradually while its weights adapt over time. This trend is different for different processor types. The mean objective metrics are reported in Table 2. The default setting reports an average audio loss Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT of 0.4220.4220.4220.422, an 0.0130.0130.0130.013 increase from the full consoles, slightly exceeding the tolerance τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01. This was expected due to the shorter full console training. The average pruning ratio ρ𝜌\rhoitalic_ρ is 0.670.670.670.67 and the equalizer and stereo imager are the most and least remaining types (0.460.460.460.46 and 0.860.860.860.86), respectively. We note that MedleyDB and MixingSecrets report similar pruning ratios of 0.610.610.610.61 and 0.620.620.620.62, respectively. However, the Internal graphs are more sparse; their average pruning ratio is 0.770.770.770.77.

Table 2: Pruning results with various candidate selection methods and tolerance τ𝜏\tauitalic_τ. The subscripts denote per-type pruning ratios.
τ𝜏\tauitalic_τ Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ρ𝜌\rhoitalic_ρ ρgsubscript𝜌g\rho_{\text{\scriptsize$\texttt{g}$}}italic_ρ start_POSTSUBSCRIPT g end_POSTSUBSCRIPT ρssubscript𝜌s\rho_{\text{\scriptsize$\texttt{s}$}}italic_ρ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ρesubscript𝜌e\rho_{\text{\scriptsize$\texttt{e}$}}italic_ρ start_POSTSUBSCRIPT e end_POSTSUBSCRIPT ρrsubscript𝜌r\rho_{\text{\scriptsize$\texttt{r}$}}italic_ρ start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ρcsubscript𝜌c\rho_{\text{\scriptsize$\texttt{c}$}}italic_ρ start_POSTSUBSCRIPT c end_POSTSUBSCRIPT ρnsubscript𝜌n\rho_{\text{\scriptsize$\texttt{n}$}}italic_ρ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ρdsubscript𝜌d\rho_{\text{\scriptsize$\texttt{d}$}}italic_ρ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT
Mix console -- .409.409.409.409 -- -- -- -- -- -- -- --
Brute-force .01.01.01.01 .424.424.424.424 .69.69.69.69 .54.54.54.54 .85.85.85.85 .53.53.53.53 .76.76.76.76 .71.71.71.71 .78.78.78.78 .69.69.69.69
Dry/wet .01.01.01.01 .420.420.420.420 .62.62.62.62 .51.51.51.51 .84.84.84.84 .38.38.38.38 .69.69.69.69 .66.66.66.66 .76.76.76.76 .53.53.53.53
Hybrid .001.001.001.001 .411.411.411.411 .49.49.49.49 .35.35.35.35 .76.76.76.76 .27.27.27.27 .53.53.53.53 .57.57.57.57 .62.62.62.62 .34.34.34.34
.01.01.01.01 .422.422.422.422 .67.67.67.67 .51.51.51.51 .86.86.86.86 .46.46.46.46 .71.71.71.71 .71.71.71.71 .79.79.79.79 .63.63.63.63
.1.1.1.1 .499.499.499.499 .87.87.87.87 .73.73.73.73 .94.94.94.94 .81.81.81.81 .90.90.90.90 .85.85.85.85 .91.91.91.91 .92.92.92.92
Refer to caption
Figure 4: Each node’s weight and loss increase when pruned.
Refer to caption
(a) τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001
ρ=0.45,La=0.525formulae-sequence𝜌0.45subscript𝐿a0.525\rho=0.45,L_{\mathrm{a}}=0.525italic_ρ = 0.45 , italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT = 0.525
Refer to caption
(b) τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01
ρ=0.74,La=0.525formulae-sequence𝜌0.74subscript𝐿a0.525\rho=0.74,L_{\mathrm{a}}=0.525italic_ρ = 0.74 , italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT = 0.525
Refer to caption
(c) τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1
ρ=0.85,La=0.611formulae-sequence𝜌0.85subscript𝐿a0.611\rho=0.85,L_{\mathrm{a}}=0.611italic_ρ = 0.85 , italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT = 0.611
Figure 5: Pruning results (hybrid method) with various tolerances. Song: TablaBreakbeatScience_RockSteady.

Sampling method comparison

Here, we fix the tolerance τ𝜏\tauitalic_τ to 0.010.010.010.01 and compare the candidate sampling approaches; see Table 2. As expected, the brute-force method achieves the highest sparsity, reporting a pruning ratio of 0.690.690.690.69. Its average audio loss is also the highest, 0.4240.4240.4240.424, an 0.0150.0150.0150.015 increase from the mixing console result. The dry/wet method prunes the least with 0.620.620.620.62, 7%percent77\%7 % lower than the brute-force method. However, its audio loss is the lowest, 0.4200.4200.4200.420, as more processors remained. We can investigate the cause of this difference in sparsity by analyzing the relationship between each dry/wet weight wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the loss increase ΔisubscriptΔ𝑖\Delta_{i}roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caused by pruning the processor visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT defined as follows,

Δi=La(G{vi})La(G).subscriptΔ𝑖subscript𝐿a𝐺subscript𝑣𝑖subscript𝐿a𝐺\Delta_{i}=L_{\mathrm{a}}(G\setminus\{v_{i}\})-L_{\mathrm{a}}(G).roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ( italic_G ∖ { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) - italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ( italic_G ) . (9)

Figure 4 shows scatterplots for 2222 random-sampled songs, one for each song. Each point (wi,Δi)subscript𝑤𝑖subscriptΔ𝑖(w_{i},\Delta_{i})( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) corresponds to each processor after the initial console training. To maximize the sparsity using the dry/wet method, a monotonic relationship between the weights wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and loss increases ΔisubscriptΔ𝑖\Delta_{i}roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is desirable, which is unfortunately not the case. Yet, a positive correlation exists, and this becomes more evident when we analyze the relationship for each type separately, justifying the per-type candidate selection. Still, we cannot completely remove the weakness of the dry/wet method, leading us to the hybrid strategy as a compromise. We note that the pruning methods are not only different in sparsity but also in trade-offs between sparsity and match performance. By evaluating the methods with more fine-grained tolerance settings (7777 values from 0.0010.0010.0010.001 to 0.20.20.20.2), we observed that the brute-force method finds graphs with better matches even with the same graph size, closely followed by the hybrid method; refer to the supplementary page for the details.

Choice of tolerance

Finally, we analyze the effect of the value of tolerance τ𝜏\tauitalic_τ. Even with a very low tolerance τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001, we can nearly halve the number of processors, i.e., ρ=0.49𝜌0.49\rho=0.49italic_ρ = 0.49. If we set the value too high, e.g., τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1, the resulting graphs are highly sparse but degrade their matches (La=0.499subscript𝐿a0.499L_{\mathrm{a}}=0.499italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT = 0.499, i.e., 0.0900.0900.0900.090 increase). The default setting of τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01 seems “just right,” balancing the match performance and graph sparsity. We can verify this with the spectrogram errors (bottom 3333 rows for each subplots; see Figure 7 and supplementary page). There is no noticeable degradation from the full consoles to τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001 and 0.010.010.010.01.

Case study

We report the pruning method’s behavior from observations of the individual results.

  • Recall that, for the song RockSteady, there was no clear performance improvement when we added the reverbs (Figure 7b). Hence, we can expect those reverbs to be pruned with a moderate tolerance τ𝜏\tauitalic_τ. Figure 5 shows that this is indeed the case; only 5/145145/145 / 14 reverbs are left when τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001 and 0/140140/140 / 14 for τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01, which is much less than the average statistics (Table 2 and 3). When τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1, processors for the details get removed; only the gain/pannings and equalizers remain. See captions in Figure 5 for the pruning ratios and audio losses of the pruned graphs (the full console has an audio loss of 0.5230.5230.5230.523).

  • The current pruning method fails to detect some redundant processors. In Figure 5b, the bottom 2222 sources are processed with 3333 gain/pannings. Since there is no nonlinear or time-varying processor between those, at least one can be “absorbed” by the others. While this case can be handled with some post-processing, it hints that we might have missed more sparse graphs.

  • Each pruning of the same song yields a slightly different graph. Pruning a mixing console of GirlOnABridge multiple times resulted in graphs with the number of processors from 19191919 to 22222222. This is because our iterative pruning has a stochastic and greedy nature; candidates that were sampled early are more likely to be pruned. Refer to the supplementary page for the pruned graphs.

  • The pruning does not necessarily result in graphs that are close to the maximum loss La(Gc)+τsubscript𝐿asubscript𝐺c𝜏L_{\mathrm{a}}(G_{\mathrm{c}})+\tauitalic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) + italic_τ. For RockSteady, pruning with τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01 resulted in a loss of 0.5250.5250.5250.525, much lower than the threshold. Interestingly, the τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001 case achieved the same loss in spite of a much lower pruning ratio (0.560.560.560.56 versus 0.740.740.740.74).

  • Processors for sources with short spans and low energy tend to get pruned as their contributions to the audio loss are small. Yet, we found that this could sometimes be perceptually noticeable.

Table 3: Pruning results with the default setting on the full dataset.
Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ρ𝜌\rhoitalic_ρ ρgsubscript𝜌g\rho_{\text{\scriptsize$\texttt{g}$}}italic_ρ start_POSTSUBSCRIPT g end_POSTSUBSCRIPT ρssubscript𝜌s\rho_{\text{\scriptsize$\texttt{s}$}}italic_ρ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ρesubscript𝜌e\rho_{\text{\scriptsize$\texttt{e}$}}italic_ρ start_POSTSUBSCRIPT e end_POSTSUBSCRIPT ρrsubscript𝜌r\rho_{\text{\scriptsize$\texttt{r}$}}italic_ρ start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ρcsubscript𝜌c\rho_{\text{\scriptsize$\texttt{c}$}}italic_ρ start_POSTSUBSCRIPT c end_POSTSUBSCRIPT ρnsubscript𝜌n\rho_{\text{\scriptsize$\texttt{n}$}}italic_ρ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ρdsubscript𝜌d\rho_{\text{\scriptsize$\texttt{d}$}}italic_ρ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT
MedleyDB .431.431.431.431 .63.63.63.63 .37.37.37.37 .84.84.84.84 .44.44.44.44 .69.69.69.69 .74.74.74.74 .77.77.77.77 .57.57.57.57
MixingSecrets .625.625.625.625 .64.64.64.64 .50.50.50.50 .87.87.87.87 .33.33.33.33 .64.64.64.64 .63.63.63.63 .80.80.80.80 .69.69.69.69
Internal .434.434.434.434 .75.75.75.75 .70.70.70.70 .87.87.87.87 .55.55.55.55 .73.73.73.73 .85.85.85.85 .86.86.86.86 .72.72.72.72
Total .506.506.506.506 .69.69.69.69 .57.57.57.57 .87.87.87.87 .45.45.45.45 .69.69.69.69 .75.75.75.75 .82.82.82.82 .69.69.69.69
Refer to caption
Figure 6: Statistics of the consoles and pruned graphs (full data). Each dataset’s results are stacked to form the full histograms.

Full results

Finally, we pruned every song in the full dataset ensemble. Table 3 reports the results. The overall trend follows the evaluation subset results but with higher average audio loss (0.5090.5090.5090.509 compared to the previous 0.4220.4220.4220.422). Figure 6 shows statistics of the 3333 datasets, initial mixing console graphs, and their pruned versions. MedleyDB has the smallest number of source tracks, an average of 17.617.617.617.6. The Internal has the largest (28.828.828.828.8), closely followed by the MixingSecrets (27.927.927.927.9). The Internal dataset also has more subgroups, resulting in even larger mixing consoles. This is one potential cause of the higher sparsity of its pruned graphs; more processors were initially used to match the mix, and many of them were redundant. On average, 72.172.172.172.1 processors (108.5108.5108.5108.5 nodes) were remained for each song. Since each full mixing console has an average of 247.6247.6247.6247.6 processors (280.1280.1280.1280.1 nodes), we achieved a pruning ratio of 0.6920.6920.6920.692.

4 Discussion

Summary

We started with a general formulation of retrieving mixing graphs from dry sources and mix. Then, we posed restrictions to cast the search to the pruning of mixing consoles, making it computationally feasible and obtaining more interpretable graphs. Next, with additional assumptions, we derived the iterative method that gradually removes negligible processors in a stochastic and greedy manner. As a result, instead of finding the exact optimal, our method gives (or “samples”) one of the close-to-optimal graphs. With the differentiable processors and relaxation of the pruning with the dry/wet weights, we optimized this objective via gradient descent. We explored 3333 methods to choose pruning candidates, comparing them in terms of their computational cost and graph sparsity. The hybrid method gave a good compromise, so we used it to gather over one thoudsand graph-audio pairs.

Future works

We list possible extensions of our method.

  • The choice of processors and their implementations directly affect the match quality. Our setup, including the equalizer with zero-phase FIR and the reverb based on STFT mask, was motivated by its simplicity and fast computation on a GPU. However, other alternatives exist, e.g., parametric equalizer [20] and artificial reverberation [36], that allow more efficient computation in CPU and have compact parameterizations. Also, the spectrogram errors showed clear temporal patterns (vertical stripes), indicating that the loudness dynamics were not precisely matched. We suspect it is due to the ballistics approximation error, as recently reported [37]. If so, we might need a more sophisticated implementation of the compressor and noisegate. Also, the current method does not support time-varying parameters (or “automation”), which can cause audible errors. For example, we could not match fade-out, i.e., a gradual decrease in track loudness. Finally, we can add other processor types, e.g., saturation/distortion [38] or modulation effects [39].

  • We note several considerations to improve the current pruning method in terms of sparsity, match quality, and interpretability. First, we can modify the mixing console to reflect real-world practices more. For example, we can add send and return loops with additional processor chains. Post-equalizers for compressors and processors with multiple inputs or outputs (e.g., auxiliary sidechain and crossover filter) are also commonly used. Second, to prevent the pruning from harming the perceptual quality, the tolerance condition and the objective function must be appropriately designed. We used a simple multi-resolution STFT loss [32, 34], which has been reported to miss some perceptual features [40, 41]. Hence, we might need an alternative objective as a remedy [42]. Third, as discussed before, using average loss to determine the pruning might be inappropriate. Lastly, to increase the sparsity, more advanced neural network pruning techniques [23, 24] and domain-specific post-processing, e.g., merging LTI processors to a single processor with the combined effect, can be applied.

  • We may relax the prior assumptions and restrictions on graph structures. This will expand our search space and require different search methods other than pruning. For example, allowing arbitrary processor order extends our framework to different architecture search [25, 26]. A completely different approach based on reinforcement learning could also be possible [43]. While all of these are promising, balancing flexibility, match quality, and computation cost will be the main challenge.

Refer to caption
(a) Torres_NewSkin
Refer to caption
(b) TablaBreakbeatScience_RockSteady
Figure 7: Log-magnitude spectrograms of the matched mixes (odd columns) of mixing consoles (4444 center rows) and pruned graphs (3333 bottom rows; in dBdB\mathrm{d}\mathrm{B}roman_dB). The even columns show the match errors.

References

  • [1] P. D. Pestana and J. D. Reiss, “Intelligent audio production strategies informed by best practices,” 2014.
  • [2] F. Everardo, “Towards an automated multitrack mixing tool using answer set programming,” in 14th SMC Conf, 2017.
  • [3] E. Perez-Gonzalez and J. Reiss, “Automatic gain and fader control for live mixing,” in IEEE WASPAA, 2009.
  • [4] B. De Man and J. D. Reiss, “A knowledge-engineered autonomous mixing system,” in AES Convention 135, 2013.
  • [5] M. A. Martinez Ramirez et al., “Automatic music mixing with deep learning and out-of-domain data,” in ISMIR, 2022.
  • [6] D. Koszewski, T. Görne, G. Korvel, and B. Kostek, “Automatic music signal mixing system based on one-dimensional wave-u-net autoencoders,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2023, no. 1, 2023.
  • [7] C. J. Steinmetz, J. Pons, S. Pascual, and J. Serrà, “Automatic multitrack mixing with a differentiable mixing console of neural audio effects,” in IEEE ICASSP, 2021.
  • [8] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “MedleyDB: A multitrack dataset for annotation-intensive mir research.” in ISMIR, vol. 14, 2014.
  • [9] R. M. Bittner, J. Wilkins, H. Yip, and J. P. Bello, “MedleyDB 2.0: New data and a system for sustainable data collection,” ISMIR LBD, 2016.
  • [10] M. Senior, Mixing secrets for the small studio, 2018.
  • [11] S. Lee, J. Park, S. Paik, and K. Lee, “Blind estimation of audio processing graph,” in IEEE ICASSP, 2023.
  • [12] C. Mitcheltree and H. Koike, “SerumRNN: Step by step audio VST effect programming,” in Artificial Intelligence in Music, Sound, Art and Design, 2021.
  • [13] J. Guo and B. McFee, “Automatic recognition of cascaded guitar effects,” in DAFx, 2023.
  • [14] N. Masuda and D. Saito, “Improving semi-supervised differentiable synthesizer sound matching for practical applications,” IEEE/ACM TASLP, vol. 31, 2023.
  • [15] N. Uzrad et al., “DiffMoog: a differentiable modular synthesizer for sound matching,” arXiv:2401.12570, 2024.
  • [16] F. Caspe, A. McPherson, and M. Sandler, “DDX7: Differentiable FM synthesis of musical instrument sounds,” in ISMIR, 2022.
  • [17] J. Colonel, “Music production behaviour modelling,” 2023.
  • [18] “The mixing console — split, inline and hybrids,” https://steemit.com/sound/@jamesub/the-mixing-console-split-inline-and-hybrids, accessed: 2024-02-26.
  • [19] J. Engel, L. H. Hantrakul, C. Gu, and A. Roberts, “DDSP: differentiable digital signal processing,” in ICLR, 2020.
  • [20] S. Nercessian, “Neural parametric equalizer matching using differentiable biquads,” in DAFx, 2020.
  • [21] C. J. Steinmetz, N. J. Bryan, and J. D. Reiss, “Style transfer of audio effects with differentiable signal processing,” JAES, vol. 70, no. 9, 2022.
  • [22] G. Castellano, A. M. Fanelli, and M. Pelillo, “An iterative pruning algorithm for feedforward neural networks,” IEEE transactions on Neural networks, vol. 8, no. 3, 1997.
  • [23] Y. He and L. Xiao, “Structured pruning for deep convolutional neural networks: A survey,” arXiv:2303.00566, 2023.
  • [24] H. Cheng, M. Zhang, and J. Q. Shi, “A survey on deep neural network pruning-taxonomy, comparison, analysis, and recommendations,” arXiv:2308.06767, 2023.
  • [25] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” in ICLR, 2019.
  • [26] Z. Ye, W. Xue, X. Tan, Q. Liu, and Y. Guo, “NAS-FM: Neural architecture search for tunable and interpretable sound synthesis based on frequency modulation,” arXiv:2305.12868, 2023.
  • [27] C. J. Steinmetz, S. S. Vanka, M. A. Martínez-Ramírez, and G. Bromham, Deep Learning for Automatic Mixing.   ISMIR, Dec. 2022.
  • [28] J. Koo et al., “Music mixing style transfer: A contrastive learning approach to disentangle audio effects,” in IEEE ICASSP, 2023.
  • [29] B. Hayes, C. Saitis, and G. Fazekas, “Sinusoidal frequency estimation by gradient descent,” in IEEE ICASSP, 2023.
  • [30] U. Zölzer, Ed., DAFX: Digital Audio Effects, 2nd ed., 2011.
  • [31] M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch geometric,” arXiv:1903.02428, 2019.
  • [32] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in IEEE ICASSP, 2020.
  • [33] A. Wright and V. Välimäki, “Perceptual loss function for neural modeling of audio systems,” in IEEE ICASSP, 2020.
  • [34] C. J. Steinmetz and J. D. Reiss, “auraloss: Audio focused loss functions in PyTorch,” in Digital Music Research Network One-day Workshop (DMRN+15), 2020.
  • [35] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv:1711.05101, 2017.
  • [36] S. Lee, H.-S. Choi, and K. Lee, “Differentiable artificial reverberation,” IEEE/ACM TASLP, vol. 30, 2022.
  • [37] C. J. Steinmetz, T. Walther, and J. D. Reiss, “High-fidelity noise reduction with differentiable signal processing,” in AES Convention 155, 2023.
  • [38] J. T. Colonel, M. Comunità, and J. Reiss, “Reverse engineering memoryless distortion effects with differentiable waveshapers,” in AES Convention 153, 2022.
  • [39] A. Carson, S. King, C. V. Botinhao, and S. Bilbao, “Differentiable grey-box modelling of phaser effects using frame-based spectral processing,” in DAFx, 2023.
  • [40] B. Hayes, J. Shier, G. Fazekas, A. McPherson, and C. Saitis, “A review of differentiable digital signal processing for music & speech synthesis,” Frontiers in Signal Process., 2023.
  • [41] J. Turian and M. Henry, “I’m sorry for your loss: Spectrally-based audio distances are bad at pitch,” in “I Can’t Believe It’s Not Better!” NeurIPS workshop, 2020.
  • [42] C. Vahidi et al., “Mesostructures: Beyond spectrogram loss in differentiable time–frequency analysis,” JAES, 2023.
  • [43] J. You, B. Liu, Z. Ying, V. Pande, and J. Leskovec, “Graph convolutional policy network for goal-directed molecular graph generation,” NeurIPS, 2018.
  • [44] M. A. Martínez-Ramírez, O. Wang, P. Smaragdis, and N. J. Bryan, “Differentiable signal processing with black-box audio effects,” in IEEE ICASSP, 2021.
  • [45] A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” NeurIPS, 2019.
  • [46] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” Journal of Machine Learning Research, vol. 20, no. 55, 2019.
  • [47] D. C. Elton, Z. Boukouvalas, M. D. Fuge, and P. W. Chung, “Deep learning for molecular design—a review of the state of the art,” Molecular Systems Design & Engineering, vol. 4, no. 4, 2019.
  • [48] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.

Appendix A Related Works

A.1 Composition of audio processors

Most audio processors are designed to modify some specific properties of their input signals, e.g., magnitude response, loudness dynamics, and stereo width. As such, combining multiple processors is a common practice to achieve the full effect. Following the main text, we will use the terminology “graph” to represent this composition, although some previous works considered simple structures that allow a more compact form, e.g., a sequence. Now, we outline the previous attempts that tried to estimate the processing graph or its parameters from reference audio. These works differ in task and domain, processors, graph structure, and estimation methods. For example, if the references are dry sources and a wet mixture, this task becomes reverse engineering [11, 12, 13, 17]. In terms of the prediction targets, some fixed the graph and estimated only the parameters [7, 15, 17, 19, 44]. Others tried to predict the graph [13] or both [11, 12, 26, 16]. Table 4 summarizes and highlights such differences.

A.2 Differentiable signal processing

Differentiable processor

Exact implementation or approximation of processors in an automatic differentiation framework, e.g., pytorch [45], enables parameter optimization via gradient descent. Numerous efforts have been focused on converting existing audio processors to their differentiable versions [17, 21, 38, 39, 20, 36]; refer to the recent review [40] and references therein for more details. In many cases, these processors are combined with neural networks, whose computation is done in GPU. Thus, converting the audio processors to be “GPU-friendly” has been an active research topic. For example, for a linear time-invariant (LTI) system with a recurrent structure, we can sample its frequency response to approximate its infinite impulse response (IIR) instead of directly running the recurrence; the former is faster than the latter [20, 36]. However, it is nontrivial to apply a similar trick to nonlinear, recurrent, or time-varying processors. Typically, further simplifications and approximations are employed, e.g., replacing the nonlinear recurrent part with an IIR filter [21] or assuming frame-wise LTI to a linear time-varying system [39]. Sometimes, we can only access input and output signals. In such a case, one can approximate the gradients with finite difference methods [44, 37] or use a pre-trained auxiliary neural network that mimics the processors [7]. In the literature, these are also referred to as “differentiable;” hence, it is rather an umbrella term encompassing all methods that obtain the output signals or gradients within a reasonable amount of time. Nevertheless, our work limits the focus to the implementations in the automatic differentiation framework.

Audio processing graph

Now, consider a composition of multiple differentiable processors; the entire graph remains differentiable due to the chain rule. However, the following practical considerations remain. If we fix the processing graph prior to the optimization and the graph size is relatively small, we can implement the “differentiable graph” following the existing implementations [19, 15]. That is, we compute every processor one by one in a pre-defined topological order. However, we have the following additional requirements. First, the pruning changes the graph during the optimization. Therefore, our implementation must take a graph and its parameters along with the source signals as input arguments for every forward pass. Note that this feature is also necessary when training a neural network that predicts the parameters of any given graph [11]. Second, the size of our graphs is much larger than the ones from previous works [19, 15, 26, 11]. In this case, the one-by-one computation severely bottlenecks the computation speed. Therefore, we derived a flexible and efficient graph computation algorithm (i) that can take different graphs for each forward pass as input and (ii) performs batched processing of multiple processors within a graph, utilizing the parallelism of GPUs. Finally, we note that other than the differentiation with respect to the input signals 𝐒𝐒\mathbf{S}bold_S and parameters 𝐏𝐏\mathbf{P}bold_P, one might be interested in differentiation with respect to the graph structure G𝐺Gitalic_G. The proposed pruning method performs this to a limited extent; deletion of a node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a binary operation that modifies the graph structure. We relaxed this to a continuous dry/wet weight wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and optimized it with the audio loss Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT and regularization Lpsubscript𝐿pL_{\mathrm{p}}italic_L start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT.

A.3 Graph search

Several independent research efforts in various domains exist that search for graphs that satisfy certain requirements. For example, neural architecture search (NAS) aims to find a neural network architecture that achieves improved performance [46]. In this case, the search space consists of graphs, with each node (or edge) representing one of the primitive neural network layers. One particularly relevant work to ours is a differentiable architecture search (DARTS) [25], which relaxes the choice of each layer to a categorical distribution and optimizes it via gradient descent. Theoretically, our method can be naturally extended to this approach; we only need to change our 2222-way choice (prune or not) to (N+1)𝑁1(N+1)( italic_N + 1 )-way (bypass or select one of N𝑁Nitalic_N processor types). DARTS is clearly more flexible and general, allowing an arbitrary order of processors. However, it also greatly increases the computational cost, as we must compute all N𝑁Nitalic_N processors to compute their weight sum for every node. For example, if we want to keep the mixing console structure and allow arbitrary processor choices, the memory complexity becomes O(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) instead of the current O(N)𝑂𝑁O(N)italic_O ( italic_N ). In other words, we must pay additional costs to increase the size of the search space. This cost increase is especially critical to us since we have to find a graph for every song. Another popular related domain is the generation/design of molecules with desired chemical properties [47]. One dominant approach for this task is to use reinforcement learning (RL), which estimates each graph by making a sequence of decisions, e.g., adding nodes and edges [43]. RL is an attractive choice since we can be completely free with prior assumptions on graphs, and we can use arbitrary quality measures that are not differentiable. We also note that RL can be used for NAS [48]. However, applying RL to our task has a risk of obtaining nontrivial mixing graphs that are difficult for practitioners to interpret; we may need a soft regularization penalty that guides the generation process towards familiar structures, e.g., ones like the pruned mixing consoles. Also, it may need much larger computational resources to explore the search space sufficiently.

Appendix B Dry/wet Pruning Algorithm

Algorithm 2 describes the details of the dry/wet method. For a simpler description, we modified the initialization to include per-type node sets and weights as in line 8-12. The termination condition is given in line 13. The trial candidate sampling is implemented in line 14-15. The candidate pool update is expanded to handle the trial successes and failures separately, shown in line 20 and 22-27.

Table 4: A brief summary and comparison of previous works on estimation of compositional audio signal processing.
[15] Task & domain Sound matching [x][𝐏]delimited-[]𝑥delimited-[]𝐏[x]\to[\mathbf{P}][ italic_x ] → [ bold_P ]. The synthesizer parameters 𝐏𝐏\mathbf{P}bold_P were estimated to match the reference (target) audio x𝑥xitalic_x.
Processors Oscillators, envelope generators, and filters that allow parameter modulation as an optional input.
Graph Any pre-defined directed acyclic graph (DAG). For example, a subtractive synthesizer that comprises 2222 oscillators, 1111 amplitude envelope, and 1111 lowpass filter were used in the experiments.
Method Trained a single neural backbone for the reference encoding, followed by multiple prediction heads for the parameters. Optimized with a parameter loss and spectral loss, where the latter is calculated with every intermediate output.
[16] Task & domain Sound matching [x][𝐏]delimited-[]𝑥delimited-[]𝐏[x]\to[\mathbf{P}][ italic_x ] → [ bold_P ]. A frequency-modulation (FM) synthesizer matches recordings of monophonic instruments (violin, flute, and trumpet). Estimates parameters of an operator graph that is empirically searched & selected.
Processors Differentiable sinusoidal oscillators, each used as a carrier or modulator, pre-defined frequencies. An additional FIR reverb is added to the FM graph output for post-processing.
Graph DAGs with at most 6666 operators. Different graphs for different target instruments.
Method Trained a convolutional neural network that estimates envelopes from the target loudness and pitch.
[26] Task & domain Sound matching [x][G,𝐏]delimited-[]𝑥𝐺𝐏[x]\to[G,\mathbf{P}][ italic_x ] → [ italic_G , bold_P ]. Similar setup to the above [16] plus additional estimation of the operator graph G𝐺Gitalic_G.
Processors Identical to [16], except for the frequency ratio that can be searched.
Graph A subgraph of a supergraph, which resembles a multi-layer perceptron (modulator layers followed by a carrier layer).
Method Trained a parameter estimator for the supergraph and found the appropriate subgraph G𝐺Gitalic_G with an evolutionary search.
[12] Task & domain Reverse engineering [s,y][G,𝐏]𝑠𝑦𝐺𝐏[s,y]\to[G,\mathbf{P}][ italic_s , italic_y ] → [ italic_G , bold_P ] of an audio effect chain from a subtractive synthesizer (commercial plugin).
Processors 5555 audio effects: compressor, distortion, equalizer, phaser, and reverb. Non-differentiable implementations.
Graph Chain of audio effects generated with no duplicate types (therefore 32323232 possible combinations) and random order.
Method Trained a next effect predictor and parameter estimator in a supervised (teacher-forcing) manner.
[13] Task & domain Blind estimation [y][G]delimited-[]𝑦delimited-[]𝐺[y]\to[G][ italic_y ] → [ italic_G ] and reverse engineering [s,y][G]𝑠𝑦delimited-[]𝐺[s,y]\to[G][ italic_s , italic_y ] → [ italic_G ] of guitar effect chains.
Processors 13131313 guitar effects, including non-linear processors, modulation effects, ambience effects, and equalizer filters.
Graph A chain of guitar effects. Maximum 5555 processors and a total of 221221221221 possible combinations.
Method Trained a convolutional neural network with synthetic data to predict the correct combination.
[7] Task & domain Automatic mixing [𝐒][𝐏]delimited-[]𝐒delimited-[]𝐏[\mathbf{S}]\to[\mathbf{P}][ bold_S ] → [ bold_P ]. Estimated parameters of fixed processing chains from source tracks (K16)𝐾16(K\leq 16)( italic_K ≤ 16 ).
Processors 7777 differentiable processors, where 4444 (gain, polarity, fader, and panning) were implemented exactly. A combined effect of the remaining 3333 (equalizer, compressor, and reverb) was approximated with a single pre-trained neural network.
Graph Tree structure: applied a fixed chain of the 7777 processors for each track, and then summed the chain outputs altogether.
Method Trained a parameter estimator (convolutional neural network) with a spectrogram loss end-to-end.
[44] Task & domain Reverse engineering of music mastering [s,y][𝐏]𝑠𝑦delimited-[]𝐏[s,y]\to[\mathbf{P}][ italic_s , italic_y ] → [ bold_P ]
Processors A multi-band compressor, graphic equalizer, and limiter. Gradient approximated with a finite difference method.
Graph A serial chain of the processors.
Method Optimized parameters with gradient descent.
[11] Task & domain Blind estimation [y][G,𝐏]delimited-[]𝑦𝐺𝐏[y]\to[G,\mathbf{P}][ italic_y ] → [ italic_G , bold_P ] and reverse engineering [𝐒,y][G,𝐏]𝐒𝑦𝐺𝐏[\mathbf{S},y]\to[G,\mathbf{P}][ bold_S , italic_y ] → [ italic_G , bold_P ]. Estimates both graph and its parameters for singing voice effect (K=1)𝐾1(K=1)( italic_K = 1 ) or drum mixing (K6)𝐾6(K\leq 6)( italic_K ≤ 6 ).
Processors A total of 33333333 processors, including linear filters, nonlinear filters, and control signal generators. Some processors are multiple-input multiple-output (MIMO), e.g., allowing auxiliary modulations. Non-differentiable implementations.
Graph Complex DAG; splits (e.g., multi-band processing) and merges (e.g., sum and modulation). 30303030 processors max.
Method Trained a convolutional neural network-based reference encoder and a transformer variant for graph decoding and parameter estimation. Both were jointly trained via direct supervision of synthetic graphs (e.g., parameter loss).
[17] Task & domain Reverse engineering [𝐒,y][𝐏]𝐒𝑦delimited-[]𝐏[\mathbf{S},y]\to[\mathbf{P}][ bold_S , italic_y ] → [ bold_P ] of music mixing. Estimated parameters of a fixed chain for each track.
Processors 6666 differentiable processors: gain, equalizer, compressor, distortion, panning, and reverb.
Graph A chain of 5555 processors (all above types except the reverb) for each dry track (any other DAG can also be used). The reverb is used for the mixed sum.
Method Parameters were optimized with spectrogram loss end-to-end via gradient descent.
Ours Task & domain Reverse engineering [𝐒,y][G,𝐏]𝐒𝑦𝐺𝐏[\mathbf{S},y]\to[G,\mathbf{P}][ bold_S , italic_y ] → [ italic_G , bold_P ] of music mixing. Estimated a chain of processors and their parameters for each track and submix where K130𝐾130K\leq 130italic_K ≤ 130.
Processors 7777 differentiable processors: gain/panning, stereo imager, equalizer, reverb, compressor, noisegate, and delay.
Graph A tree of processing chains with a subgrouping structure (any other DAG can also be used). Processors can be omitted but should follow the fixed order.
Method Joint estimation of the soft masks (dry/wet weights) and processor parameters. Optimized with the spectrogram loss (and additional regularizations) end-to-end via gradient descent. Accompanied by hard pruning stages.
Table 5: Per-dataset results of the mixing consoles with different processor type configurations.
MedleyDB MixingSecrets Internal
Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT Llrsubscript𝐿lrL_{\mathrm{lr}}italic_L start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT Lmsubscript𝐿mL_{\mathrm{m}}italic_L start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT Lssubscript𝐿sL_{\mathrm{s}}italic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT Llrsubscript𝐿lrL_{\mathrm{lr}}italic_L start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT Lmsubscript𝐿mL_{\mathrm{m}}italic_L start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT Lssubscript𝐿sL_{\mathrm{s}}italic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT Llrsubscript𝐿lrL_{\mathrm{lr}}italic_L start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT Lmsubscript𝐿mL_{\mathrm{m}}italic_L start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT Lssubscript𝐿sL_{\mathrm{s}}italic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT
Base graph ecnigdr 50.750.750.750.7 1.451.451.451.45 1.421.421.421.42 198198198198 7.307.307.307.30 2.162.162.162.16 2.022.022.022.02 22.922.922.922.9 1.121.121.121.12 .951.951.951.951 .940.940.940.940 1.631.631.631.63
+ Gain/panning ecnsgdr .550.550.550.550 .583.583.583.583 .485.485.485.485 .550.550.550.550 .876.876.876.876 .856.856.856.856 .819.819.819.819 .973.973.973.973 .642.642.642.642 .619.619.619.619 .597.597.597.597 .734.734.734.734
+ Stereo imager ecnsgdr .541.541.541.541 .564.564.564.564 .483.483.483.483 .553.553.553.553 .847.847.847.847 .834.834.834.834 .791.791.791.791 .928.928.928.928 .538.538.538.538 .616.616.616.616 .595.595.595.595 .727.727.727.727
+ Equalizer ecnsgdr .450.450.450.450 .453.453.453.453 .390.390.390.390 .504.504.504.504 .700.700.700.700 .698.698.698.698 .622.622.622.622 .780.780.780.780 .522.522.522.522 .497.497.497.497 .467.467.467.467 .626.626.626.626
+ Reverb ecnsgdr .368.368.368.368 .361.361.361.361 .360.360.360.360 .390.390.390.390 .614.614.614.614 .601.601.601.601 .579.579.579.579 .674.674.674.674 .463.463.463.463 .451.451.451.451 .432.432.432.432 .517.517.517.517
+ Compressor ecnsgdr .315.315.315.315 .304.304.304.304 .297.297.297.297 .356.356.356.356 .558.558.558.558 .542.542.542.542 .512.512.512.512 .637.637.637.637 .396.396.396.396 .377.377.377.377 .347.347.347.347 .482.482.482.482
+ Noisegate ecnsgdr .302.302.302.302 .288.288.288.288 .281.281.281.281 .353.353.353.353 .548.548.548.548 .532.532.532.532 .502.502.502.502 .625.625.625.625 .393.393.393.393 .374.374.374.374 .343.343.343.343 .480.480.480.480
+ Multitap delay (full) ecnsgdr .296.296.296.296 .288.288.288.288 .284.284.284.284 .324.324.324.324 .545.545.545.545 .529.529.529.529 .502.502.502.502 .618.618.618.618 .385.385.385.385 .369.369.369.369 .338.338.338.338 .465.465.465.465
Table 6: Per-dataset results of the pruning. MC: mixing console. BF: brute-force, DW: dry/wet, and H: hybrid.
MedleyDB MixingSecrets Internal
τ𝜏\tauitalic_τ Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ρ𝜌\rhoitalic_ρ ρgsubscript𝜌g\rho_{\text{\scriptsize$\texttt{g}$}}italic_ρ start_POSTSUBSCRIPT g end_POSTSUBSCRIPT ρssubscript𝜌s\rho_{\text{\scriptsize$\texttt{s}$}}italic_ρ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ρesubscript𝜌e\rho_{\text{\scriptsize$\texttt{e}$}}italic_ρ start_POSTSUBSCRIPT e end_POSTSUBSCRIPT ρrsubscript𝜌r\rho_{\text{\scriptsize$\texttt{r}$}}italic_ρ start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ρcsubscript𝜌c\rho_{\text{\scriptsize$\texttt{c}$}}italic_ρ start_POSTSUBSCRIPT c end_POSTSUBSCRIPT ρnsubscript𝜌n\rho_{\text{\scriptsize$\texttt{n}$}}italic_ρ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ρdsubscript𝜌d\rho_{\text{\scriptsize$\texttt{d}$}}italic_ρ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ρ𝜌\rhoitalic_ρ ρgsubscript𝜌g\rho_{\text{\scriptsize$\texttt{g}$}}italic_ρ start_POSTSUBSCRIPT g end_POSTSUBSCRIPT ρssubscript𝜌s\rho_{\text{\scriptsize$\texttt{s}$}}italic_ρ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ρesubscript𝜌e\rho_{\text{\scriptsize$\texttt{e}$}}italic_ρ start_POSTSUBSCRIPT e end_POSTSUBSCRIPT ρrsubscript𝜌r\rho_{\text{\scriptsize$\texttt{r}$}}italic_ρ start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ρcsubscript𝜌c\rho_{\text{\scriptsize$\texttt{c}$}}italic_ρ start_POSTSUBSCRIPT c end_POSTSUBSCRIPT ρnsubscript𝜌n\rho_{\text{\scriptsize$\texttt{n}$}}italic_ρ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ρdsubscript𝜌d\rho_{\text{\scriptsize$\texttt{d}$}}italic_ρ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT Lasubscript𝐿aL_{\mathrm{a}}italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ρ𝜌\rhoitalic_ρ ρgsubscript𝜌g\rho_{\text{\scriptsize$\texttt{g}$}}italic_ρ start_POSTSUBSCRIPT g end_POSTSUBSCRIPT ρssubscript𝜌s\rho_{\text{\scriptsize$\texttt{s}$}}italic_ρ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ρesubscript𝜌e\rho_{\text{\scriptsize$\texttt{e}$}}italic_ρ start_POSTSUBSCRIPT e end_POSTSUBSCRIPT ρrsubscript𝜌r\rho_{\text{\scriptsize$\texttt{r}$}}italic_ρ start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ρcsubscript𝜌c\rho_{\text{\scriptsize$\texttt{c}$}}italic_ρ start_POSTSUBSCRIPT c end_POSTSUBSCRIPT ρnsubscript𝜌n\rho_{\text{\scriptsize$\texttt{n}$}}italic_ρ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ρdsubscript𝜌d\rho_{\text{\scriptsize$\texttt{d}$}}italic_ρ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT
MC -- .296.296.296.296 -- -- -- -- -- -- -- -- .545.545.545.545 -- -- -- -- -- -- -- -- .385.385.385.385 -- -- -- -- -- -- -- --
BF .01.01.01.01 .305.305.305.305 .63.63.63.63 .35.35.35.35 .81.81.81.81 .54.54.54.54 .77.77.77.77 .67.67.67.67 .71.71.71.71 .53.53.53.53 .566.566.566.566 .65.65.65.65 .50.50.50.50 .82.82.82.82 .43.43.43.43 .71.71.71.71 .61.61.61.61 .77.77.77.77 .75.75.75.75 .402.402.402.402 .80.80.80.80 .76.76.76.76 .92.92.92.92 .61.61.61.61 .81.81.81.81 .85.85.85.85 .87.87.87.87 .80.80.80.80
DW .01.01.01.01 .302.302.302.302 .56.56.56.56 .32.32.32.32 .79.79.79.79 .42.42.42.42 .74.74.74.74 .57.57.57.57 .56.56.56.56 .43.43.43.43 .561.561.561.561 .57.57.57.57 .47.47.47.47 .82.82.82.82 .22.22.22.22 .62.62.62.62 .54.54.54.54 .74.74.74.74 .55.55.55.55 .397.397.397.397 .74.74.74.74 .74.74.74.74 .82.82.82.82 .49.49.49.49 .70.70.70.70 .87.87.87.87 .86.86.86.86 .61.61.61.61
H .001.001.001.001 .295.295.295.295 .44.44.44.44 .22.22.22.22 .73.73.73.73 .26.26.26.26 .61.61.61.61 .50.50.50.50 .54.54.54.54 .24.24.24.24 .550.550.550.550 .43.43.43.43 .35.35.35.35 .76.76.76.76 .13.13.13.13 .42.42.42.42 .43.43.43.43 .55.55.55.55 .38.38.38.38 .388.388.388.388 .59.59.59.59 .48.48.48.48 .77.77.77.77 .40.40.40.40 .57.57.57.57 .78.78.78.78 .77.77.77.77 .39.39.39.39
.01.01.01.01 .302.302.302.302 .61.61.61.61 .32.32.32.32 .81.81.81.81 .48.48.48.48 .74.74.74.74 .69.69.69.69 .71.71.71.71 .52.52.52.52 .563.563.563.563 .62.62.62.62 .49.49.49.49 .85.85.85.85 .34.34.34.34 .64.64.64.64 .60.60.60.60 .79.79.79.79 .65.65.65.65 .400.400.400.400 .77.77.77.77 .73.73.73.73 .93.93.93.93 .56.56.56.56 .75.75.75.75 .85.85.85.85 .86.86.86.86 .74.74.74.74
.1.1.1.1 .375.375.375.375 .83.83.83.83 .59.59.59.59 .93.93.93.93 .83.83.83.83 .92.92.92.92 .80.80.80.80 .84.84.84.84 .90.90.90.90 .648.648.648.648 .84.84.84.84 .70.70.70.70 .91.91.91.91 .75.75.75.75 .86.86.86.86 .83.83.83.83 .93.93.93.93 .91.91.91.91 .474.474.474.474 .93.93.93.93 .90.90.90.90 .99.99.99.99 .84.84.84.84 .92.92.92.92 .93.93.93.93 .95.95.95.95 .96.96.96.96
Algorithm 2 Music mixing graph search (dry/wet method).
1:A mixing console Gcsubscript𝐺cG_{\mathrm{c}}italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT, dry tracks 𝐒𝐒\mathbf{S}bold_S, and mixture y𝑦yitalic_y
2:Pruned graph Gpsubscript𝐺pG_{\mathrm{p}}italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT, parameters 𝐏𝐏\mathbf{P}bold_P, and weights 𝐰𝐰\mathbf{w}bold_w
3:𝐏,𝐰Initialize(Gc)𝐏𝐰Initializesubscript𝐺c\mathbf{P},\mathbf{w}\leftarrow\mathrm{Initialize}\>\!(G_{\mathrm{c}})bold_P , bold_w ← roman_Initialize ( italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT )
4:𝐏,𝐰Train(Gc,𝐏,𝐰,𝐒,y)𝐏𝐰Trainsubscript𝐺c𝐏𝐰𝐒𝑦\mathbf{P},\mathbf{w}\leftarrow\mathrm{Train}\>\!(G_{\mathrm{c}},\mathbf{P},% \mathbf{w},\mathbf{S},y)bold_P , bold_w ← roman_Train ( italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT , bold_P , bold_w , bold_S , italic_y )
5:LaminEvaluate(Gc,𝐏,𝐰,𝐒,y)superscriptsubscript𝐿aminEvaluatesubscript𝐺c𝐏𝐰𝐒𝑦L_{\mathrm{a}}^{\mathrm{min}}\leftarrow\mathrm{Evaluate}\>\!(G_{\mathrm{c}},% \mathbf{P},\mathbf{w},\mathbf{S},y)italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT ← roman_Evaluate ( italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT , bold_P , bold_w , bold_S , italic_y )
6:GpGcsubscript𝐺psubscript𝐺cG_{\mathrm{p}}\leftarrow G_{\mathrm{c}}italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT ← italic_G start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT
7:for n𝑛nitalic_n \leftarrow 1111 to Nitersubscript𝑁iterN_{\mathrm{iter}}italic_N start_POSTSUBSCRIPT roman_iter end_POSTSUBSCRIPT do
8:     TpoolGetProcessorTypeSet(V)subscript𝑇poolGetProcessorTypeSet𝑉T_{\mathrm{pool}}\leftarrow\mathrm{GetProcessorTypeSet}\>\!(V)italic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT ← roman_GetProcessorTypeSet ( italic_V )
9:     for t𝑡titalic_t in Tpoolsubscript𝑇poolT_{\mathrm{pool}}italic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT do
10:         Vt,𝐰tFilter(V,t),Filter(𝐰,t)formulae-sequencesubscript𝑉𝑡subscript𝐰𝑡Filter𝑉𝑡Filter𝐰𝑡V_{t},\mathbf{w}_{t}\leftarrow\mathrm{Filter}\>\!(V,t),\mathrm{Filter}\>\!(% \mathbf{w},t)italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← roman_Filter ( italic_V , italic_t ) , roman_Filter ( bold_w , italic_t )
11:         Nt,rt,𝐦t|Vt|,0.1,𝟏formulae-sequencesubscript𝑁𝑡subscript𝑟𝑡subscript𝐦𝑡subscript𝑉𝑡0.11N_{t},r_{t},\mathbf{m}_{t}\leftarrow|V_{t}|,0.1,\mathbf{1}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← | italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | , 0.1 , bold_1
12:     end for
13:     while Tpoolsubscript𝑇poolT_{\mathrm{pool}}\neq\emptysetitalic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT ≠ ∅ do
14:         tSampleType(Tpool)𝑡SampleTypesubscript𝑇poolt\leftarrow\mathrm{SampleType}\>\!(T_{\mathrm{pool}})italic_t ← roman_SampleType ( italic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT )
15:         V¯t,𝐦¯GetLeastWeightNodes(Vt,𝐰t,Ntrt)\bar{V}_{t},\bar{\mathbf{m}}\leftarrow\mathrm{GetLeastWeightNodes}\>\!(V_{t},% \mathbf{w}_{t},\lfloor N_{t}r_{t}\rceil)over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG bold_m end_ARG ← roman_GetLeastWeightNodes ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ⌊ italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⌉ )
16:         LaEvaluate(Gp,𝐏,𝐰𝐦𝐦¯,𝐒,y)subscript𝐿aEvaluatesubscript𝐺p𝐏direct-product𝐰𝐦¯𝐦𝐒𝑦L_{\mathrm{a}}\leftarrow\mathrm{Evaluate}\>\!(G_{\mathrm{p}},\mathbf{P},% \mathbf{w}\odot\mathbf{m}\odot\bar{\mathbf{m}},\mathbf{S},y)italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ← roman_Evaluate ( italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , bold_P , bold_w ⊙ bold_m ⊙ over¯ start_ARG bold_m end_ARG , bold_S , italic_y )
17:         if La<Lamin+τsubscript𝐿asuperscriptsubscript𝐿amin𝜏L_{\mathrm{a}}<L_{\mathrm{a}}^{\mathrm{min}}+\tauitalic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT < italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT + italic_τ then
18:              Laminmin(Lamin,La)superscriptsubscript𝐿aminsuperscriptsubscript𝐿aminsubscript𝐿aL_{\mathrm{a}}^{\mathrm{min}}\leftarrow\min(L_{\mathrm{a}}^{\mathrm{min}},L_{% \mathrm{a}})italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT ← roman_min ( italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT , italic_L start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT )
19:              𝐦𝐦𝐦¯𝐦direct-product𝐦¯𝐦\mathbf{m}\leftarrow\mathbf{m}\odot\bar{\mathbf{m}}bold_m ← bold_m ⊙ over¯ start_ARG bold_m end_ARG
20:              VtVtV¯tsubscript𝑉𝑡subscript𝑉𝑡subscript¯𝑉𝑡V_{t}\leftarrow V_{t}\setminus\bar{V}_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
21:         else
22:              if Ntrt>1\lfloor N_{t}r_{t}\rceil>1⌊ italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⌉ > 1 then
23:                  rtrt/2subscript𝑟𝑡subscript𝑟𝑡2r_{t}\leftarrow r_{t}/2italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / 2
24:              else
25:                  TpoolTpool{t}subscript𝑇poolsubscript𝑇pool𝑡T_{\mathrm{pool}}\leftarrow T_{\mathrm{pool}}\setminus\{t\}italic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT ← italic_T start_POSTSUBSCRIPT roman_pool end_POSTSUBSCRIPT ∖ { italic_t }
26:              end if
27:         end if
28:     end while
29:     Gp,𝐏,𝐰Prune(Gp,𝐏,𝐰,𝐦)subscript𝐺p𝐏𝐰Prunesubscript𝐺p𝐏𝐰𝐦G_{\mathrm{p}},\mathbf{P},\mathbf{w}\leftarrow\mathrm{Prune}\>\!(G_{\mathrm{p}% },\mathbf{P},\mathbf{w},\mathbf{m})italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , bold_P , bold_w ← roman_Prune ( italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , bold_P , bold_w , bold_m )
30:     𝐏,𝐰Train(Gp,𝐏,𝐰,𝐒,y)𝐏𝐰Trainsubscript𝐺p𝐏𝐰𝐒𝑦\mathbf{P},\mathbf{w}\leftarrow\mathrm{Train}\>\!(G_{\mathrm{p}},\mathbf{P},% \mathbf{w},\mathbf{S},y)bold_P , bold_w ← roman_Train ( italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , bold_P , bold_w , bold_S , italic_y )
31:end for
32:return Gp,𝐏,𝐰subscript𝐺p𝐏𝐰G_{\mathrm{p}},\mathbf{P},\mathbf{w}italic_G start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT , bold_P , bold_w
Refer to caption
Figure 8: Loss increases from the mixing console and pruning ratios for different pruning methods and tolerances.

Appendix C Supplementary Results

  • Table 5 and 6 report the per-dataset results on the mixing consoles and graph pruning, respectively.

  • Figure 8 compares the pruning methods on 2222 random-sampled songs using 7777 tolerance settings from 0.0010.0010.0010.001 to 0.20.20.20.2.

  • Figure 9 shows multiple graphs obtained by pruning the same console (song) repeatedly.

  • Refer to Figure 10-12 for more pruned graphs obtained with the default setting — hybrid method and τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01.

  • Figure 14-16 show more spectrogram plots.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 9: Each pruning run (default setting) yields a slightly different graph. Song: EthanHein_GirlOnABridge.
Refer to caption
(a) Internal_part1_65536
Refer to caption
(b) DonCamilloChoir_TrudeTheBumblebee
Figure 10: Example pruned graphs (default setting). the number of tracks: K20𝐾20K\leq 20italic_K ≤ 20.
Refer to caption
(a) LittleTybee_TheAlchemist
Refer to caption
(b) RaftMonk_Tiring
Figure 11: Example pruned graphs (default setting). the number of tracks: K>20𝐾20K>20italic_K > 20.
Refer to caption
(a) Internal_part2_67692
Refer to caption
(b) StevenClark_Bounty
Figure 12: Example pruned graphs (default setting). the number of tracks: K>20𝐾20K>20italic_K > 20 (continued).
Refer to caption
(a) Torres_NewSkin
Refer to caption
(b) TablaBreakbeatScience_RockSteady
Figure 13: Matching of target music mixes with mixing consoles and their pruned versions: MedleyDB dataset.
Refer to caption
(a) MusicDelta_SwingJazz
Refer to caption
(b) ChrisJacoby_BoothShotLincoln
Figure 14: Matching of target music mixes with mixing consoles and their pruned versions: MedleyDB dataset.
Refer to caption
(a) HowlProject_IfIWereABell
Refer to caption
(b) IanDearden_TeraniaCreekWalking
Figure 15: Matching of target music mixes with mixing consoles and their pruned versions: MixingSecrets dataset.
Refer to caption
(a) Internal_66680
Refer to caption
(b) Internal_67954
Figure 16: Matching of target music mixes with mixing consoles and their pruned versions: Internal dataset.