Antipodal intelligent music mixing mechanism: theoretical framework based on deep feature deinterleaving

Mingyang Yong

doi:10.71113/JCAC.v1i1.304

Authors

Mingyang Yong Sejong University,South korea

DOI:

https://doi.org/10.71113/JCAC.v1i1.304

Keywords:

Transformer, Gans, TFR-Gans, intelligent mixing, deep decoherence

Abstract

This study primarily employs the Generative Adversarial Network (GANs), using the Transformer architecture model (referred to as TFR in this study) as the generator network and the β-VAE architecture as a module. This module is integrated into the encoder and decoder of TFR, placed after the feedforward neural network (FFN). Theoretically, this constructs a TFR-GAN adversarial full-intelligence music mixing architecture model, presenting a new paradigm in intelligent music mixing research. Additionally, through the deep feature de-interleaving mechanism combined with multi-head attention mechanisms, an implementation path for fully intelligent music mixing based on this study's architecture model is proposed, theoretically achieving artificial intelligence mixing from a deep neural network structure.

References

[1]Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460. https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html

[2]Chiu, C.-Y., Su, A. W.-Y., & Yang, Y.-H. (2021). Drum-aware ensemble architecture for improved joint musical beat and downbeat tracking. IEEE Signal Processing Letters, 28, 1100–1104. https://doi.org/10.1109/LSP.2021.3084504

[3]Dong, L., Xu, S., & Xu, B. (2018). Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5884–5888. https://doi.org/10.1109/ICASSP.2018.8462506

[4]Dugan, D. (1975). Automatic microphone mixing. Journal of the Audio Engineering Society. https://www.semanticscholar.org/paper/Automatic-Microphone-Mixing-Dugan/225c1d7b8e5cdf7dbdb1485ad017797a1cbbaedf

[5]Eiter, T., Ianni, G., & Krennwallner, T. (2009). Answer set programming: A primer. In S. Tessaris, E. Franconi, T. Eiter, C. Gutierrez, S. Handschuh, M.-C. Rousset, & R. A. Schmidt (Eds.), Reasoning Web. Semantic Technologies for Information Systems: 5th International Summer School 2009, Brixen-bressanone, Italy, August 30—September 4, 2009, Tutorial Lectures (pp. 40–110). Springer. https://doi.org/10.1007/978-3-642-03754-2_2

[6]Feng, J., Erol, M. H., Son Chung, J., & Senocak, A. (2024). From coarse to fine: Efficient training for audio spectrogram transformers. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1416–1420. https://doi.org/10.1109/ICASSP48485.2024.10448376

[7]Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems, 27. https://proceedings.neurips.cc/paper_files/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html

[8]Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., & Lerchner, A. (2017, February 6). beta-VAE: Learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations. https://openreview.net/forum?id=Sy2fzU9gl

[9]Kim, H., & Mnih, A. (2018). Disentangling by factorising. Proceedings of the 35th International Conference on Machine Learning, 2649–2658. https://proceedings.mlr.press/v80/kim18b.html

[10]Man, B. D., & Reiss, J. (2013). A knowledge-engineered autonomous mixing system. Journal of the Audio Engineering Society. https://www.semanticscholar.org/paper/A-Knowledge-Engineered-Autonomous-Mixing-System-Man-Reiss/7d599b5b366ad88ee32ab9dfc8d16c855935fd06

[11]Martínez-Ramírez, M. A., Liao, W.-H., Fabbro, G., Uhlich, S., Nagashima, C., & Mitsufuji, Y. (2022). Automatic music mixing with deep learning and out-of-domain data (No. arXiv:2208.11428). arXiv. https://doi.org/10.48550/arXiv.2208.11428

[12]Narayanaswamy, V., Thiagarajan, J. J., Anirudh, R., & Spanias, A. (2020). Unsupervised audio source separation using generative priors. Interspeech 2020, 2657–2661. https://doi.org/10.21437/Interspeech.2020-3115

[13]Van Houdt, G., Mosquera, C., & Nápoles, G. (2020). A review on the long short-term memory model. Artificial Intelligence Review, 53(8), 5929–5955. https://doi.org/10.1007/s10462-020-09838-1

[14]Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2023). Attention is all you need (2702854). INSPIRE. https://doi.org/10.48550/arXiv.1706.03762