Searching For Music Mixing Graphs: A Pruning Approach
Abstract
Music mixing is compositional — experts combine multiple audio processors to achieve a cohesive mix from dry source tracks. We propose a method to reverse engineer this process from the input and output audio. First, we create a mixing console that applies all available processors to every chain. Then, after the initial console parameter optimization, we alternate between removing redundant processors and fine-tuning. We achieve this through differentiable implementation of both processors and pruning. Consequently, we find a sparse mixing graph that achieves nearly identical matching quality of the full mixing console. We apply this procedure to dry-mix pairs from various datasets and collect graphs that also can be used to train neural networks for music mixing applications.
1 Introduction
Motivation
From a signal processing perspective, modern music is more than the mere sum of source tracks. Mixing engineers combine and control multiple processors to balance the sources in terms of loudness, frequency content, spatialization, and much more. Many attempts have been made to uncover parts of this intricate process. Some have gathered expert knowledge [1, 2] and built rule-based systems [3, 4]. More recent work has adopted data-driven approaches. Neural networks have been trained to map source tracks directly to a mix [5, 6] or to estimate parameters of a fixed processing chain [7]. Yet, efforts to address the compositional aspects of the music mixing, such as which processors to use for each track, are still limited. One possible remedy is to consider a graph representation whose nodes and edges are processors and connections between them, respectively. In other words, each graph contains the essential information about the mixing process. However, other than the dry source and mixed audio, no public dataset provides such mixing graphs or related metadata [8, 9, 10], which hinders this line of research. This is not surprising; besides the cost of crowdsourcing, it is difficult to standardize the mixing data from multiple engineers with different equipment. One recent work [11] sidestepped this issue by creating synthetic graphs and using them for training. However, this approach is not free from downsides. Neural networks would suffer from poor generalization unless the synthetic data distribution matches the real world. Similar data-related issues arise in different domains, e.g., audio effect chain recognition [12, 13] and synthesizer sound matching [14, 15, 16]. Furthermore, real-world multitrack mixes have a much larger number of source tracks and graph sizes, making synthetic data generation more challenging. Therefore, it is desirable to have a systematic, reliable, and scalable method for collecting graphs. All these contexts lead us to ask: Can we find the mixing graphs solely from audio?
Problem definition
Precisely, for each song (piece) whose dry sources and mix are available, we aim to find an audio processing graph and its processor parameters so that processing the dry sources results in a mix that closely matches the original mix . With a loss that measures the match quality on the mixture audio domain and regularization , our objective can be written as follows,
(1) |
Contributions
One might want to explore the candidate graphs without any restriction. However, this makes the problem ill-posed and underdetermined. The graph’s combinatorial nature makes the search space extremely large. Furthermore, we have to find the processor parameters jointly. As a result, numerous pairs of graphs and parameters can have similar match quality. Therefore, it is desirable to add some restrictions, e.g., preferring structures that are widely used by practitioners. To this end, we resort to the following pruning-based search; see Figure 1 for a visual illustration. Inspired by a recent work [17], we first create a so-called “mixing console” (see Figure 2a for an example). It applies a fixed processing chain to each source. Then, it subgroups the outputs, applies the chain again, and sums the processed subgroups to obtain a final mix . This resembles the traditional hybrid mixing console [18]. Each chain comprises processors, including an equalizer, compressor, and multitap delay. We implement all of them in a differentiable manner [19, 20, 21]. This allows end-to-end optimization of all parameters with an audio-domain loss via gradient descent. After this initial console training, we proceed to the pruning stage. Here, we search for a maximally pruned graph and its parameters while maintaining the match quality of the mixing console up to a certain tolerance ; this is shown as a circle centered at in Figure 1. Also, see Figure 2b for an example pruned graph. We use iterative pruning, alternating between the pruning and fine-tuning, i.e., optimization of the remaining parameters [22]. To collect graphs from multiple songs, it is crucial to make the entire search efficient and fast. Pruning, in particular, takes a significant amount of computation time; hence, we investigate efficient and effective methods for pruning. During the pruning, we need to find a subset of nodes that can be removed while not harming the match quality. To achieve this, we view each processor’s “dry/wet” parameter as an approximate importance score and use it to select the candidate nodes. This approach gives variants of the pruning method with different trade-offs between the computational cost and resulting sparsity. It also draws connections to neural network pruning [23, 24] where the binary pruning operation is relaxed to continuous weights. Note that casting the graph search into pruning is a double-edged sword. The pruning only removes the processors and does not consider all possible signal routings, reducing the search space (from grey to colored regions in Figure 1). Consequently, it does not improve the match quality over the mixing consoles. Nevertheless, the pruned graph follows the real-world practice of selectively applying appropriate processors. In other words, the sparsity is crucial for the graph’s interpretability. Also, it keeps the search cost in a practical range, which might be challenging with other alternatives [25, 26]. Our method serves as a standalone reverse engineering algorithm [17], but it can also be used to collect pseudo-label data to train neural networks for music mixing applications. For example, we may extend existing methods for automatic mixing [3, 4, 5, 6, 7, 27] and mixing style transfer [28] to output the graphs. This allows end users to interpret and control the estimated outputs.
Data
We first report a list of datasets to which we can apply our method. For each song, we need a pair of dry sources and a final mixture . Additionally, we use subgrouping information that describes how dry tracks are grouped together. Therefore, we use the MedleyDB dataset [8, 9] as it provides all of them. We also add the MixingSecrets library [10]. Since it only provides the audio, we manually subgrouped each track based on its instrument. Finally, we include our private dataset of Western music mixes from multiple engineers (denoted as Internal). The resulting ensemble comprises songs (, , and songs for each respective dataset). The number of dry tracks ranges from to , and the number of subgroups ranges from to (see Figure 6 for the statistics). Except for the final pruned graph collection stage (Section 3.4), we use a random subset for the evaluations (a total of songs, songs for each dataset). Every signal is stereo and resampled to sampling rate.
Supplementary materials
Refer to the following link for audio samples, pruned graphs, and appendices that contain additional details: https://sh-lee97.github.io/grafx-prune.
2 Differentiable Processing on Graphs
An audio processing graph is assumed to be directed and acyclic ( and denote the set of nodes and edges, respectively). Each node is either a processor or an auxiliary module and has its type , e.g., e for an equalizer. Each processor takes an audio and a parameter vector as input and outputs a processed signal . Then, we further mix the input and this processed result with a “dry/wet” weight . Hence, the output of the processor is given as follows,
(2) |
We have the following auxiliary modules:
-
•
Input — It outputs one of the dry sources .
-
•
Mix — We ouput the sum of incomming signals.
-
•
Output — A sum of its inputs is considered as a final output .
Each edge represents a “cable” that sends an output signal to another node as input. Throughout the text, we denote an ordered collection from multiple nodes with a boldface letter, e.g., for a weight vector, for a source tensor, and for a dictionary with processor types as keys and their parameter tensors as values. Under this notation, our task is to find , , and from and .
2.1 Differentiable Implementation
Considering the music mixing, we use the following processors.
-
•
Gain/panning
We control both loudness and stereo panning of input audio by multiplying a learnable scalar to each channel.
-
•
Stereo imager
We change the stereo width of the input by modifying the loudness of the side channel (left minus right).
-
•
Equalizer
We use a finite impulse response (FIR) with a length of to modify the input’s magnitude response. The FIR is parameterized with its log magnitude (thus parameters). We apply inverse FFT of the magnitude with zero phase, obtain a zero-centered FIR, and multiply it with a Hann window. We apply the same FIR to both the left and right channels.
-
•
Reverb
We employ seconds of filtered noise as an impulse response for reverberation. First, we create a -channel uniform noise, where these channels represent the mid and side. We filter the noise by multiplying an element-wise -channel magnitude mask to its short-time Fourier transform (STFT), where the FFT sizes and hop lengths are and , respectively. This mask is constructed using the reverberation’s initial and decaying log magnitudes. After the masking, we obtain the mid/side filtered noise via inverse STFT, convert it to stereo, and perform channel-wise convolutions with input to get an output.
-
•
Compressor
We use a slight variant of the recently proposed differentiable dynamic range compressor [21]. First, we obtain the input’s smooth energy envelope. The smoothing is typically done with a ballistics filter, but we instead use a one-pole filter for speedup in GPU. Then, we compute the desired gain reduction from the envelope and apply it to the input audio.
-
•
Noisegate
Except for the gain computation, its implementation is the same as the compressor.
-
•
Multitap delay
For each (left and right) channel, we employ independent seconds of delay effects with a single delay for every interval. To optimize delay lengths using gradient descent, we employ surrogate complex damped sinusoids [29]. Each sinusoid is converted to a delayed soft impulse via inverse FFT. Its angular frequency represents a continuous relaxation of the discrete delay length. Each delay is filtered with a length- FIR equalizer to mimic the filtered echo effect [30].
Batched node processing
It is common to compute the graph output signal by processing each node one by one [15, 19]. However, this severely bottlenecks the computation speed for large mixing graphs. Therefore, we instead batch-process multiple nodes in parallel. For the graph in Figure 2b, we can batch-process equalizer e, noisegates n, and gain/pannings g sequentially. Then, we aggregate the intermediate outputs to subgroup mixes m (also in parallel). This part is identical to graph neural networks’ “message passing,” so we adopt their implementations [31]. We repeat these parallel computations until we reach the output node o. By doing so, we obtain the output faster; in this example, the number of sequential processing is reduced from (one-by-one) to (optimal). We empirically found that up to speedup can be achieved for the pruned graphs with a RTX3090 GPU.
2.2 Mixing Console
We construct a mixing console as follows (see Figure 2a).
-
(i)
We add an input node i for each source track.
-
(ii)
We connect a serial chain (with a fixed order) of an equalizer e, compressor c, noisegate n, stereo imager s, gain/panning g, multitap delay d, and reverb r for each input.
-
(iii)
We subgroup and sum the processed tracks with mix nodes m based on the prepared subgrouping information.
-
(iv)
We apply the same chain ecnsgdr to each mix output, then pass it to the output node o (we omit the mix module here).
2.3 Optimization
Before exploring the pruning of each mixing console, as a sanity check, we first evaluate its match performance. To investigate how much each processor type contributes to the match quality, we start with a base graph, a mixing console with no processors that simply sums all the inputs. Then, we add each processor type one by one to the processor chain (see the first column of Table 1). We optimize and evaluate all these preliminary graphs for each song. For each graph, we train its parameters and weights simultaneously with an audio-domain loss given as follows,
(3) |
where each term is a variant of multi-resolution STFT loss [32] (, : left/right, : mid, : side)
(4) |
Here, and denote the Mel spectrograms of the target and predicted mixture, respectively. , , and denote the number of frames, norm and Frobenius norm, respectively. We use FFT sizes of , , and , and hop sizes are of their respective FFT sizes. The number of Mel filterbanks is set to for all scales. We apply A-weighting before each STFT [33]. The per-channel loss weights are set to , , and . The implementation is based on auraloss [34]. We further add a regularization that promotes gain-staging, a common practice of audio engineers that keeps the total energy of input and output roughly the same. This is achieved with the following loss:
(5) |
where and denote mid channel and norm, respectively. We apply this regularization to a subset of processors that comprises all equalizers, reverbs, and multitap delays. This allows us to (i) eliminate redundant gains that these linear-time invariant (LTI) processors could create and (ii) restrict the parameters to be in a reasonable range. Therefore, the total loss is given as
(6) |
where the gain-staging weight is set to . Here, we used a slightly different notation from Equation 1 to emphasize what is optimized. Each console is optimized for steps using AdamW [35] with a learning rate. For each step, we random-sample a region of dry sources (thus the batch size is ), compute the mix , and compare its last with the corresponding ground-truth . Note that the first second is used only for the “warm-up" of the processors with long states such as compressors and reverbs.
Base graph (sum of dry sources) | |||||
---|---|---|---|---|---|
+ Gain/panning | ecnsgdr | ||||
+ Stereo imager | ecnsgdr | ||||
+ Equalizer | ecnsgdr | ||||
+ Reverb | ecnsgdr | ||||
+ Compressor | ecnsgdr | ||||
+ Noisegate | ecnsgdr | ||||
+ Multitap delay (full) | ecnsgdr |
2.4 Results
Table 1 reports the evaluation results that are calculated over the entire song. First, the base graph results in an audio loss of . The side-channel loss is especially large as most source tracks are close to mono while the target mixes have wide stereo images. With the gain/pannings and stereo imagers, we can achieve “rough mixes” with a loss of . Then, we fill in the missing details with the remaining processor types. Every type improves the match, and the full mixing console reports a loss of . Also, the top rows of Figure 7 show mid/side log-magnitude STFTs of the target mixes, matches of the mixing consoles, and their errors. We report the results with , , , and types where the choice of processors and their order follow Table 1; see the supplementary page for the results on other configurations and additional songs. Again, we can observe that adding each type improves the match from the spectrogram error plots. Furthermore, each song benefits more from different types; for the song RockSteady, the multitap delays improve the match more than the reverbs (Figure 7b), which is different from the average trend. Yet, this is expected since the original mix heavily uses the delay effects. Finally, we note that mixes from MixingSecrets are more challenging to match than the others; it reports a mean audio loss of , while MedleyDB and Internal report and , respectively.
3 Music Mixing Graph Search
Considering the full mixing console as an upper bound in terms of the matching performance, we want to find a sparser graph with a similar match quality. We achieve this by pruning the console as much as possible while keeping the loss increase up to a tolerance threshold . This objective can be written as
(7) |
where and denote the pruned graph’s node set and its cardinality, respectively. We define the pruning as removal of the nodes and re-routing of their edges, in a way that is equivalent to setting them to “bypass,” i.e., for . Also, signifies that we are (ideally) interested in the optimized audio loss. We only prune the processors, not the auxiliary nodes. Hence, we define a pruning ratio as the number of pruned processors divided by the number of processors in the initial console.
3.1 Iterative Pruning
Finding the optimal (sparsest) solution is prohibitively expensive. First, due to the interaction between the processors, we need a combinatorial search. As such, we instead assume their independence and prune the processors in a greedy manner. Following the iterative approach [22], we gradually remove processors whenever the tolerance condition is satisfied. Under this setup, we still need to fine-tune intermediate pruned graphs before evaluating the tolerance condition. For reasonable computational complexity, we simply omit this fine-tuning, paying the cost of possibly missing more removable processors. Our method is summarized in Algorithm 1 (in the following parentheses denote line numbers). First, we construct a mixing console , optimize its parameters and dry/wet weights , and evaluate the loss (3-5). This validation loss serves as a pruning threshold with the tolerance . Then, we alternate between pruning and fine-tuning, i.e., further optimization of the remaining parameters and weights (7-20). Each pruning stage consists of multiple trials, which sample subsets of candidates from the set of remaining processors (10) and check whether they are removable (12). We keep the pruning if the result satisfies the constraint or cancel it otherwise (12-15). We repeat this process until the terminal condition (9) is satisfied. Implementation-wise, we multiply binary masks, and , to the weight vector to mimic the pruning during the trials (11). After that, we actually update the graph and remove the pruned processors’ parameters and weights for faster search (18). Sometimes, albeit rare, the pruning can improve the match. In this case, we update the threshold (13).
3.2 Candidate Sampling
The remaining design choices are choosing an appropriate candidate set (10, 16) and deciding when to terminate the trials (9). We explore the following approaches.
-
•
Brute-force — We random-sample every processor one by one, i.e., . This granularity could achieve high sparsity, but comes with a large computational cost.
-
•
Dry/wet — For efficient pruning, we need an informed guess of each node’s importance. Intuitively, we can use each dry/wet weight as an approximate importance. This observation leads to the following. For each pruning iteration:
-
(i)
We create a set of remaining processor types . Next, we count the number of processors of each type , denoted as .
-
(ii)
For each trial, we sample a type and choose the smallest-weight processors of that type as candidates. The number of candidates is set to where denotes the portion of the chosen processors and is initialized to for every pruning iteration.
-
(iii)
When the trial fails, we perform one of the followings. If , we halve the candidate set, i.e., . Otherwise, i.e., if , we finish the search of this type by removing it from the pool as .
- (iv)
This way, we can skip large-weight nodes and evaluate multiple candidates, reducing the total number of trials. Note that if we set , this method is similar to the simple binary search. However, it can lead to over-pruning of specific types sampled early in (ii). Hence, we set to a more conservative value .
-
(i)
-
•
Hybrid — Solely relying on the weight values could miss some processors that can be pruned but have large weights. We mitigate this by combining the above two, running the brute-force method for every iteration.
By default, we use the hybrid method with tolerance .
3.3 Optimization
We use identical audio loss and gain-staging regularization . To promote sparsity, we add a weight regularization , a norm of the weight . Hence, the full objective is as follows,
(8) |
We first train the console with steps. Then, we repeat rounds of pruning, each with -step fine-tuning. As a result, the total number of optimization steps is the same as the previous console training. During the first steps of the pruning phase, we linearly increase the sparsity coefficient from to . While we halved the full console optimization steps, which could lead to increased loss, it is justified due to the tight resource constraints. With a RTX3090 GPU, each song took about , , and using the brute-force, dry/wet, and hybrid methods, respectively.
3.4 Results
Pruning process
Figure 3 shows how the pruning progresses. Each graph’s sparsity increases gradually while its weights adapt over time. This trend is different for different processor types. The mean objective metrics are reported in Table 2. The default setting reports an average audio loss of , an increase from the full consoles, slightly exceeding the tolerance . This was expected due to the shorter full console training. The average pruning ratio is and the equalizer and stereo imager are the most and least remaining types ( and ), respectively. We note that MedleyDB and MixingSecrets report similar pruning ratios of and , respectively. However, the Internal graphs are more sparse; their average pruning ratio is .
Mix console | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Brute-force | ||||||||||
Dry/wet | ||||||||||
Hybrid | ||||||||||
Sampling method comparison
Here, we fix the tolerance to and compare the candidate sampling approaches; see Table 2. As expected, the brute-force method achieves the highest sparsity, reporting a pruning ratio of . Its average audio loss is also the highest, , an increase from the mixing console result. The dry/wet method prunes the least with , lower than the brute-force method. However, its audio loss is the lowest, , as more processors remained. We can investigate the cause of this difference in sparsity by analyzing the relationship between each dry/wet weight and the loss increase caused by pruning the processor defined as follows,
(9) |
Figure 4 shows scatterplots for random-sampled songs, one for each song. Each point corresponds to each processor after the initial console training. To maximize the sparsity using the dry/wet method, a monotonic relationship between the weights and loss increases is desirable, which is unfortunately not the case. Yet, a positive correlation exists, and this becomes more evident when we analyze the relationship for each type separately, justifying the per-type candidate selection. Still, we cannot completely remove the weakness of the dry/wet method, leading us to the hybrid strategy as a compromise. We note that the pruning methods are not only different in sparsity but also in trade-offs between sparsity and match performance. By evaluating the methods with more fine-grained tolerance settings ( values from to ), we observed that the brute-force method finds graphs with better matches even with the same graph size, closely followed by the hybrid method; refer to the supplementary page for the details.
Choice of tolerance
Finally, we analyze the effect of the value of tolerance . Even with a very low tolerance , we can nearly halve the number of processors, i.e., . If we set the value too high, e.g., , the resulting graphs are highly sparse but degrade their matches (, i.e., increase). The default setting of seems “just right,” balancing the match performance and graph sparsity. We can verify this with the spectrogram errors (bottom rows for each subplots; see Figure 7 and supplementary page). There is no noticeable degradation from the full consoles to and .
Case study
We report the pruning method’s behavior from observations of the individual results.
-
•
Recall that, for the song RockSteady, there was no clear performance improvement when we added the reverbs (Figure 7b). Hence, we can expect those reverbs to be pruned with a moderate tolerance . Figure 5 shows that this is indeed the case; only reverbs are left when and for , which is much less than the average statistics (Table 2 and 3). When , processors for the details get removed; only the gain/pannings and equalizers remain. See captions in Figure 5 for the pruning ratios and audio losses of the pruned graphs (the full console has an audio loss of ).
-
•
The current pruning method fails to detect some redundant processors. In Figure 5b, the bottom sources are processed with gain/pannings. Since there is no nonlinear or time-varying processor between those, at least one can be “absorbed” by the others. While this case can be handled with some post-processing, it hints that we might have missed more sparse graphs.
-
•
Each pruning of the same song yields a slightly different graph. Pruning a mixing console of GirlOnABridge multiple times resulted in graphs with the number of processors from to . This is because our iterative pruning has a stochastic and greedy nature; candidates that were sampled early are more likely to be pruned. Refer to the supplementary page for the pruned graphs.
-
•
The pruning does not necessarily result in graphs that are close to the maximum loss . For RockSteady, pruning with resulted in a loss of , much lower than the threshold. Interestingly, the case achieved the same loss in spite of a much lower pruning ratio ( versus ).
-
•
Processors for sources with short spans and low energy tend to get pruned as their contributions to the audio loss are small. Yet, we found that this could sometimes be perceptually noticeable.
MedleyDB | |||||||||
---|---|---|---|---|---|---|---|---|---|
MixingSecrets | |||||||||
Internal | |||||||||
Total |
Full results
Finally, we pruned every song in the full dataset ensemble. Table 3 reports the results. The overall trend follows the evaluation subset results but with higher average audio loss ( compared to the previous ). Figure 6 shows statistics of the datasets, initial mixing console graphs, and their pruned versions. MedleyDB has the smallest number of source tracks, an average of . The Internal has the largest (), closely followed by the MixingSecrets (). The Internal dataset also has more subgroups, resulting in even larger mixing consoles. This is one potential cause of the higher sparsity of its pruned graphs; more processors were initially used to match the mix, and many of them were redundant. On average, processors ( nodes) were remained for each song. Since each full mixing console has an average of processors ( nodes), we achieved a pruning ratio of .
4 Discussion
Summary
We started with a general formulation of retrieving mixing graphs from dry sources and mix. Then, we posed restrictions to cast the search to the pruning of mixing consoles, making it computationally feasible and obtaining more interpretable graphs. Next, with additional assumptions, we derived the iterative method that gradually removes negligible processors in a stochastic and greedy manner. As a result, instead of finding the exact optimal, our method gives (or “samples”) one of the close-to-optimal graphs. With the differentiable processors and relaxation of the pruning with the dry/wet weights, we optimized this objective via gradient descent. We explored methods to choose pruning candidates, comparing them in terms of their computational cost and graph sparsity. The hybrid method gave a good compromise, so we used it to gather over one thoudsand graph-audio pairs.
Future works
We list possible extensions of our method.
-
•
The choice of processors and their implementations directly affect the match quality. Our setup, including the equalizer with zero-phase FIR and the reverb based on STFT mask, was motivated by its simplicity and fast computation on a GPU. However, other alternatives exist, e.g., parametric equalizer [20] and artificial reverberation [36], that allow more efficient computation in CPU and have compact parameterizations. Also, the spectrogram errors showed clear temporal patterns (vertical stripes), indicating that the loudness dynamics were not precisely matched. We suspect it is due to the ballistics approximation error, as recently reported [37]. If so, we might need a more sophisticated implementation of the compressor and noisegate. Also, the current method does not support time-varying parameters (or “automation”), which can cause audible errors. For example, we could not match fade-out, i.e., a gradual decrease in track loudness. Finally, we can add other processor types, e.g., saturation/distortion [38] or modulation effects [39].
-
•
We note several considerations to improve the current pruning method in terms of sparsity, match quality, and interpretability. First, we can modify the mixing console to reflect real-world practices more. For example, we can add send and return loops with additional processor chains. Post-equalizers for compressors and processors with multiple inputs or outputs (e.g., auxiliary sidechain and crossover filter) are also commonly used. Second, to prevent the pruning from harming the perceptual quality, the tolerance condition and the objective function must be appropriately designed. We used a simple multi-resolution STFT loss [32, 34], which has been reported to miss some perceptual features [40, 41]. Hence, we might need an alternative objective as a remedy [42]. Third, as discussed before, using average loss to determine the pruning might be inappropriate. Lastly, to increase the sparsity, more advanced neural network pruning techniques [23, 24] and domain-specific post-processing, e.g., merging LTI processors to a single processor with the combined effect, can be applied.
-
•
We may relax the prior assumptions and restrictions on graph structures. This will expand our search space and require different search methods other than pruning. For example, allowing arbitrary processor order extends our framework to different architecture search [25, 26]. A completely different approach based on reinforcement learning could also be possible [43]. While all of these are promising, balancing flexibility, match quality, and computation cost will be the main challenge.
References
- [1] P. D. Pestana and J. D. Reiss, “Intelligent audio production strategies informed by best practices,” 2014.
- [2] F. Everardo, “Towards an automated multitrack mixing tool using answer set programming,” in 14th SMC Conf, 2017.
- [3] E. Perez-Gonzalez and J. Reiss, “Automatic gain and fader control for live mixing,” in IEEE WASPAA, 2009.
- [4] B. De Man and J. D. Reiss, “A knowledge-engineered autonomous mixing system,” in AES Convention 135, 2013.
- [5] M. A. Martinez Ramirez et al., “Automatic music mixing with deep learning and out-of-domain data,” in ISMIR, 2022.
- [6] D. Koszewski, T. Görne, G. Korvel, and B. Kostek, “Automatic music signal mixing system based on one-dimensional wave-u-net autoencoders,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2023, no. 1, 2023.
- [7] C. J. Steinmetz, J. Pons, S. Pascual, and J. Serrà, “Automatic multitrack mixing with a differentiable mixing console of neural audio effects,” in IEEE ICASSP, 2021.
- [8] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “MedleyDB: A multitrack dataset for annotation-intensive mir research.” in ISMIR, vol. 14, 2014.
- [9] R. M. Bittner, J. Wilkins, H. Yip, and J. P. Bello, “MedleyDB 2.0: New data and a system for sustainable data collection,” ISMIR LBD, 2016.
- [10] M. Senior, Mixing secrets for the small studio, 2018.
- [11] S. Lee, J. Park, S. Paik, and K. Lee, “Blind estimation of audio processing graph,” in IEEE ICASSP, 2023.
- [12] C. Mitcheltree and H. Koike, “SerumRNN: Step by step audio VST effect programming,” in Artificial Intelligence in Music, Sound, Art and Design, 2021.
- [13] J. Guo and B. McFee, “Automatic recognition of cascaded guitar effects,” in DAFx, 2023.
- [14] N. Masuda and D. Saito, “Improving semi-supervised differentiable synthesizer sound matching for practical applications,” IEEE/ACM TASLP, vol. 31, 2023.
- [15] N. Uzrad et al., “DiffMoog: a differentiable modular synthesizer for sound matching,” arXiv:2401.12570, 2024.
- [16] F. Caspe, A. McPherson, and M. Sandler, “DDX7: Differentiable FM synthesis of musical instrument sounds,” in ISMIR, 2022.
- [17] J. Colonel, “Music production behaviour modelling,” 2023.
- [18] “The mixing console — split, inline and hybrids,” https://steemit.com/sound/@jamesub/the-mixing-console-split-inline-and-hybrids, accessed: 2024-02-26.
- [19] J. Engel, L. H. Hantrakul, C. Gu, and A. Roberts, “DDSP: differentiable digital signal processing,” in ICLR, 2020.
- [20] S. Nercessian, “Neural parametric equalizer matching using differentiable biquads,” in DAFx, 2020.
- [21] C. J. Steinmetz, N. J. Bryan, and J. D. Reiss, “Style transfer of audio effects with differentiable signal processing,” JAES, vol. 70, no. 9, 2022.
- [22] G. Castellano, A. M. Fanelli, and M. Pelillo, “An iterative pruning algorithm for feedforward neural networks,” IEEE transactions on Neural networks, vol. 8, no. 3, 1997.
- [23] Y. He and L. Xiao, “Structured pruning for deep convolutional neural networks: A survey,” arXiv:2303.00566, 2023.
- [24] H. Cheng, M. Zhang, and J. Q. Shi, “A survey on deep neural network pruning-taxonomy, comparison, analysis, and recommendations,” arXiv:2308.06767, 2023.
- [25] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” in ICLR, 2019.
- [26] Z. Ye, W. Xue, X. Tan, Q. Liu, and Y. Guo, “NAS-FM: Neural architecture search for tunable and interpretable sound synthesis based on frequency modulation,” arXiv:2305.12868, 2023.
- [27] C. J. Steinmetz, S. S. Vanka, M. A. Martínez-Ramírez, and G. Bromham, Deep Learning for Automatic Mixing. ISMIR, Dec. 2022.
- [28] J. Koo et al., “Music mixing style transfer: A contrastive learning approach to disentangle audio effects,” in IEEE ICASSP, 2023.
- [29] B. Hayes, C. Saitis, and G. Fazekas, “Sinusoidal frequency estimation by gradient descent,” in IEEE ICASSP, 2023.
- [30] U. Zölzer, Ed., DAFX: Digital Audio Effects, 2nd ed., 2011.
- [31] M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch geometric,” arXiv:1903.02428, 2019.
- [32] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in IEEE ICASSP, 2020.
- [33] A. Wright and V. Välimäki, “Perceptual loss function for neural modeling of audio systems,” in IEEE ICASSP, 2020.
- [34] C. J. Steinmetz and J. D. Reiss, “auraloss: Audio focused loss functions in PyTorch,” in Digital Music Research Network One-day Workshop (DMRN+15), 2020.
- [35] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv:1711.05101, 2017.
- [36] S. Lee, H.-S. Choi, and K. Lee, “Differentiable artificial reverberation,” IEEE/ACM TASLP, vol. 30, 2022.
- [37] C. J. Steinmetz, T. Walther, and J. D. Reiss, “High-fidelity noise reduction with differentiable signal processing,” in AES Convention 155, 2023.
- [38] J. T. Colonel, M. Comunità, and J. Reiss, “Reverse engineering memoryless distortion effects with differentiable waveshapers,” in AES Convention 153, 2022.
- [39] A. Carson, S. King, C. V. Botinhao, and S. Bilbao, “Differentiable grey-box modelling of phaser effects using frame-based spectral processing,” in DAFx, 2023.
- [40] B. Hayes, J. Shier, G. Fazekas, A. McPherson, and C. Saitis, “A review of differentiable digital signal processing for music & speech synthesis,” Frontiers in Signal Process., 2023.
- [41] J. Turian and M. Henry, “I’m sorry for your loss: Spectrally-based audio distances are bad at pitch,” in “I Can’t Believe It’s Not Better!” NeurIPS workshop, 2020.
- [42] C. Vahidi et al., “Mesostructures: Beyond spectrogram loss in differentiable time–frequency analysis,” JAES, 2023.
- [43] J. You, B. Liu, Z. Ying, V. Pande, and J. Leskovec, “Graph convolutional policy network for goal-directed molecular graph generation,” NeurIPS, 2018.
- [44] M. A. Martínez-Ramírez, O. Wang, P. Smaragdis, and N. J. Bryan, “Differentiable signal processing with black-box audio effects,” in IEEE ICASSP, 2021.
- [45] A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” NeurIPS, 2019.
- [46] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” Journal of Machine Learning Research, vol. 20, no. 55, 2019.
- [47] D. C. Elton, Z. Boukouvalas, M. D. Fuge, and P. W. Chung, “Deep learning for molecular design—a review of the state of the art,” Molecular Systems Design & Engineering, vol. 4, no. 4, 2019.
- [48] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
Appendix A Related Works
A.1 Composition of audio processors
Most audio processors are designed to modify some specific properties of their input signals, e.g., magnitude response, loudness dynamics, and stereo width. As such, combining multiple processors is a common practice to achieve the full effect. Following the main text, we will use the terminology “graph” to represent this composition, although some previous works considered simple structures that allow a more compact form, e.g., a sequence. Now, we outline the previous attempts that tried to estimate the processing graph or its parameters from reference audio. These works differ in task and domain, processors, graph structure, and estimation methods. For example, if the references are dry sources and a wet mixture, this task becomes reverse engineering [11, 12, 13, 17]. In terms of the prediction targets, some fixed the graph and estimated only the parameters [7, 15, 17, 19, 44]. Others tried to predict the graph [13] or both [11, 12, 26, 16]. Table 4 summarizes and highlights such differences.
A.2 Differentiable signal processing
Differentiable processor
Exact implementation or approximation of processors in an automatic differentiation framework, e.g., pytorch [45], enables parameter optimization via gradient descent. Numerous efforts have been focused on converting existing audio processors to their differentiable versions [17, 21, 38, 39, 20, 36]; refer to the recent review [40] and references therein for more details. In many cases, these processors are combined with neural networks, whose computation is done in GPU. Thus, converting the audio processors to be “GPU-friendly” has been an active research topic. For example, for a linear time-invariant (LTI) system with a recurrent structure, we can sample its frequency response to approximate its infinite impulse response (IIR) instead of directly running the recurrence; the former is faster than the latter [20, 36]. However, it is nontrivial to apply a similar trick to nonlinear, recurrent, or time-varying processors. Typically, further simplifications and approximations are employed, e.g., replacing the nonlinear recurrent part with an IIR filter [21] or assuming frame-wise LTI to a linear time-varying system [39]. Sometimes, we can only access input and output signals. In such a case, one can approximate the gradients with finite difference methods [44, 37] or use a pre-trained auxiliary neural network that mimics the processors [7]. In the literature, these are also referred to as “differentiable;” hence, it is rather an umbrella term encompassing all methods that obtain the output signals or gradients within a reasonable amount of time. Nevertheless, our work limits the focus to the implementations in the automatic differentiation framework.
Audio processing graph
Now, consider a composition of multiple differentiable processors; the entire graph remains differentiable due to the chain rule. However, the following practical considerations remain. If we fix the processing graph prior to the optimization and the graph size is relatively small, we can implement the “differentiable graph” following the existing implementations [19, 15]. That is, we compute every processor one by one in a pre-defined topological order. However, we have the following additional requirements. First, the pruning changes the graph during the optimization. Therefore, our implementation must take a graph and its parameters along with the source signals as input arguments for every forward pass. Note that this feature is also necessary when training a neural network that predicts the parameters of any given graph [11]. Second, the size of our graphs is much larger than the ones from previous works [19, 15, 26, 11]. In this case, the one-by-one computation severely bottlenecks the computation speed. Therefore, we derived a flexible and efficient graph computation algorithm (i) that can take different graphs for each forward pass as input and (ii) performs batched processing of multiple processors within a graph, utilizing the parallelism of GPUs. Finally, we note that other than the differentiation with respect to the input signals and parameters , one might be interested in differentiation with respect to the graph structure . The proposed pruning method performs this to a limited extent; deletion of a node is a binary operation that modifies the graph structure. We relaxed this to a continuous dry/wet weight and optimized it with the audio loss and regularization .
A.3 Graph search
Several independent research efforts in various domains exist that search for graphs that satisfy certain requirements. For example, neural architecture search (NAS) aims to find a neural network architecture that achieves improved performance [46]. In this case, the search space consists of graphs, with each node (or edge) representing one of the primitive neural network layers. One particularly relevant work to ours is a differentiable architecture search (DARTS) [25], which relaxes the choice of each layer to a categorical distribution and optimizes it via gradient descent. Theoretically, our method can be naturally extended to this approach; we only need to change our -way choice (prune or not) to -way (bypass or select one of processor types). DARTS is clearly more flexible and general, allowing an arbitrary order of processors. However, it also greatly increases the computational cost, as we must compute all processors to compute their weight sum for every node. For example, if we want to keep the mixing console structure and allow arbitrary processor choices, the memory complexity becomes instead of the current . In other words, we must pay additional costs to increase the size of the search space. This cost increase is especially critical to us since we have to find a graph for every song. Another popular related domain is the generation/design of molecules with desired chemical properties [47]. One dominant approach for this task is to use reinforcement learning (RL), which estimates each graph by making a sequence of decisions, e.g., adding nodes and edges [43]. RL is an attractive choice since we can be completely free with prior assumptions on graphs, and we can use arbitrary quality measures that are not differentiable. We also note that RL can be used for NAS [48]. However, applying RL to our task has a risk of obtaining nontrivial mixing graphs that are difficult for practitioners to interpret; we may need a soft regularization penalty that guides the generation process towards familiar structures, e.g., ones like the pruned mixing consoles. Also, it may need much larger computational resources to explore the search space sufficiently.
Appendix B Dry/wet Pruning Algorithm
Algorithm 2 describes the details of the dry/wet method. For a simpler description, we modified the initialization to include per-type node sets and weights as in line 8-12. The termination condition is given in line 13. The trial candidate sampling is implemented in line 14-15. The candidate pool update is expanded to handle the trial successes and failures separately, shown in line 20 and 22-27.
[15] | Task & domain | Sound matching . The synthesizer parameters were estimated to match the reference (target) audio . |
---|---|---|
Processors | Oscillators, envelope generators, and filters that allow parameter modulation as an optional input. | |
Graph | Any pre-defined directed acyclic graph (DAG). For example, a subtractive synthesizer that comprises oscillators, amplitude envelope, and lowpass filter were used in the experiments. | |
Method | Trained a single neural backbone for the reference encoding, followed by multiple prediction heads for the parameters. Optimized with a parameter loss and spectral loss, where the latter is calculated with every intermediate output. | |
[16] | Task & domain | Sound matching . A frequency-modulation (FM) synthesizer matches recordings of monophonic instruments (violin, flute, and trumpet). Estimates parameters of an operator graph that is empirically searched & selected. |
Processors | Differentiable sinusoidal oscillators, each used as a carrier or modulator, pre-defined frequencies. An additional FIR reverb is added to the FM graph output for post-processing. | |
Graph | DAGs with at most operators. Different graphs for different target instruments. | |
Method | Trained a convolutional neural network that estimates envelopes from the target loudness and pitch. | |
[26] | Task & domain | Sound matching . Similar setup to the above [16] plus additional estimation of the operator graph . |
Processors | Identical to [16], except for the frequency ratio that can be searched. | |
Graph | A subgraph of a supergraph, which resembles a multi-layer perceptron (modulator layers followed by a carrier layer). | |
Method | Trained a parameter estimator for the supergraph and found the appropriate subgraph with an evolutionary search. | |
[12] | Task & domain | Reverse engineering of an audio effect chain from a subtractive synthesizer (commercial plugin). |
Processors | audio effects: compressor, distortion, equalizer, phaser, and reverb. Non-differentiable implementations. | |
Graph | Chain of audio effects generated with no duplicate types (therefore possible combinations) and random order. | |
Method | Trained a next effect predictor and parameter estimator in a supervised (teacher-forcing) manner. | |
[13] | Task & domain | Blind estimation and reverse engineering of guitar effect chains. |
Processors | guitar effects, including non-linear processors, modulation effects, ambience effects, and equalizer filters. | |
Graph | A chain of guitar effects. Maximum processors and a total of possible combinations. | |
Method | Trained a convolutional neural network with synthetic data to predict the correct combination. | |
[7] | Task & domain | Automatic mixing . Estimated parameters of fixed processing chains from source tracks . |
Processors | differentiable processors, where (gain, polarity, fader, and panning) were implemented exactly. A combined effect of the remaining (equalizer, compressor, and reverb) was approximated with a single pre-trained neural network. | |
Graph | Tree structure: applied a fixed chain of the processors for each track, and then summed the chain outputs altogether. | |
Method | Trained a parameter estimator (convolutional neural network) with a spectrogram loss end-to-end. | |
[44] | Task & domain | Reverse engineering of music mastering |
Processors | A multi-band compressor, graphic equalizer, and limiter. Gradient approximated with a finite difference method. | |
Graph | A serial chain of the processors. | |
Method | Optimized parameters with gradient descent. | |
[11] | Task & domain | Blind estimation and reverse engineering . Estimates both graph and its parameters for singing voice effect or drum mixing . |
Processors | A total of processors, including linear filters, nonlinear filters, and control signal generators. Some processors are multiple-input multiple-output (MIMO), e.g., allowing auxiliary modulations. Non-differentiable implementations. | |
Graph | Complex DAG; splits (e.g., multi-band processing) and merges (e.g., sum and modulation). processors max. | |
Method | Trained a convolutional neural network-based reference encoder and a transformer variant for graph decoding and parameter estimation. Both were jointly trained via direct supervision of synthetic graphs (e.g., parameter loss). | |
[17] | Task & domain | Reverse engineering of music mixing. Estimated parameters of a fixed chain for each track. |
Processors | differentiable processors: gain, equalizer, compressor, distortion, panning, and reverb. | |
Graph | A chain of processors (all above types except the reverb) for each dry track (any other DAG can also be used). The reverb is used for the mixed sum. | |
Method | Parameters were optimized with spectrogram loss end-to-end via gradient descent. | |
Ours | Task & domain | Reverse engineering of music mixing. Estimated a chain of processors and their parameters for each track and submix where . |
Processors | differentiable processors: gain/panning, stereo imager, equalizer, reverb, compressor, noisegate, and delay. | |
Graph | A tree of processing chains with a subgrouping structure (any other DAG can also be used). Processors can be omitted but should follow the fixed order. | |
Method | Joint estimation of the soft masks (dry/wet weights) and processor parameters. Optimized with the spectrogram loss (and additional regularizations) end-to-end via gradient descent. Accompanied by hard pruning stages. |
MedleyDB | MixingSecrets | Internal | |||||||||||
Base graph | ecnigdr | ||||||||||||
+ Gain/panning | ecnsgdr | ||||||||||||
+ Stereo imager | ecnsgdr | ||||||||||||
+ Equalizer | ecnsgdr | ||||||||||||
+ Reverb | ecnsgdr | ||||||||||||
+ Compressor | ecnsgdr | ||||||||||||
+ Noisegate | ecnsgdr | ||||||||||||
+ Multitap delay (full) | ecnsgdr |
MedleyDB | MixingSecrets | Internal | ||||||||||||||||||||||||||
MC | ||||||||||||||||||||||||||||
BF | ||||||||||||||||||||||||||||
DW | ||||||||||||||||||||||||||||
H | ||||||||||||||||||||||||||||