Antipodal intelligent music mixing mechanism: theoretical framework based on deep feature deinterleaving
DOI:
https://doi.org/10.71113/JCAC.v1i1.304Keywords:
Transformer, Gans, TFR-Gans, intelligent mixing, deep decoherenceAbstract
This study primarily employs the Generative Adversarial Network (GANs), using the Transformer architecture model (referred to as TFR in this study) as the generator network and the β-VAE architecture as a module. This module is integrated into the encoder and decoder of TFR, placed after the feedforward neural network (FFN). Theoretically, this constructs a TFR-GAN adversarial full-intelligence music mixing architecture model, presenting a new paradigm in intelligent music mixing research. Additionally, through the deep feature de-interleaving mechanism combined with multi-head attention mechanisms, an implementation path for fully intelligent music mixing based on this study's architecture model is proposed, theoretically achieving artificial intelligence mixing from a deep neural network structure.
Downloads
References
[1]Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460. https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html
[2]Chiu, C.-Y., Su, A. W.-Y., & Yang, Y.-H. (2021). Drum-aware ensemble architecture for improved joint musical beat and downbeat tracking. IEEE Signal Processing Letters, 28, 1100–1104. https://doi.org/10.1109/LSP.2021.3084504
[3]Dong, L., Xu, S., & Xu, B. (2018). Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5884–5888. https://doi.org/10.1109/ICASSP.2018.8462506
[4]Dugan, D. (1975). Automatic microphone mixing. Journal of the Audio Engineering Society. https://www.semanticscholar.org/paper/Automatic-Microphone-Mixing-Dugan/225c1d7b8e5cdf7dbdb1485ad017797a1cbbaedf
[5]Eiter, T., Ianni, G., & Krennwallner, T. (2009). Answer set programming: A primer. In S. Tessaris, E. Franconi, T. Eiter, C. Gutierrez, S. Handschuh, M.-C. Rousset, & R. A. Schmidt (Eds.), Reasoning Web. Semantic Technologies for Information Systems: 5th International Summer School 2009, Brixen-bressanone, Italy, August 30—September 4, 2009, Tutorial Lectures (pp. 40–110). Springer. https://doi.org/10.1007/978-3-642-03754-2_2
[6]Feng, J., Erol, M. H., Son Chung, J., & Senocak, A. (2024). From coarse to fine: Efficient training for audio spectrogram transformers. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1416–1420. https://doi.org/10.1109/ICASSP48485.2024.10448376
[7]Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems, 27. https://proceedings.neurips.cc/paper_files/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html
[8]Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., & Lerchner, A. (2017, February 6). beta-VAE: Learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations. https://openreview.net/forum?id=Sy2fzU9gl
[9]Kim, H., & Mnih, A. (2018). Disentangling by factorising. Proceedings of the 35th International Conference on Machine Learning, 2649–2658. https://proceedings.mlr.press/v80/kim18b.html
[10]Man, B. D., & Reiss, J. (2013). A knowledge-engineered autonomous mixing system. Journal of the Audio Engineering Society. https://www.semanticscholar.org/paper/A-Knowledge-Engineered-Autonomous-Mixing-System-Man-Reiss/7d599b5b366ad88ee32ab9dfc8d16c855935fd06
[11]Martínez-Ramírez, M. A., Liao, W.-H., Fabbro, G., Uhlich, S., Nagashima, C., & Mitsufuji, Y. (2022). Automatic music mixing with deep learning and out-of-domain data (No. arXiv:2208.11428). arXiv. https://doi.org/10.48550/arXiv.2208.11428
[12]Narayanaswamy, V., Thiagarajan, J. J., Anirudh, R., & Spanias, A. (2020). Unsupervised audio source separation using generative priors. Interspeech 2020, 2657–2661. https://doi.org/10.21437/Interspeech.2020-3115
[13]Van Houdt, G., Mosquera, C., & Nápoles, G. (2020). A review on the long short-term memory model. Artificial Intelligence Review, 53(8), 5929–5955. https://doi.org/10.1007/s10462-020-09838-1
[14]Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2023). Attention is all you need (2702854). INSPIRE. https://doi.org/10.48550/arXiv.1706.03762
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Mingyang Yong

This work is licensed under a Creative Commons Attribution 4.0 International License.