跳到主要內容

簡易檢索 / 詳目顯示

研究生: 賴浩維
Hao-Wei Lai
論文名稱: GENPIA: A Genre-conditioned Piano Music Generation System
指導教授: 孫敏德
Min-Te Sun
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 55
中文關鍵詞: 音樂生成音訊表示法基於變換器的模型
外文關鍵詞: Music Generation, Audio Representation, Transformer-based Model
相關次數: 點閱:7下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著人們對音樂的需求不斷增長,使得許多研究致力於音樂生成,其中在音樂的多樣性以及個人對音樂的共鳴是生成音樂最大的課題。在這項研究中,我們提出了GENPIA,一種基於音樂類別的鋼琴音樂生成系統。該系統涵蓋了動漫、R&B、爵士和古典音樂等多種音樂類別。為了構建此系統,我們收集和標記了各種音樂類型的音頻數據以實現我們研究的具體目標。在數據預處理過程中,我們應用了帶有擴展音樂類型信息的REMI音頻表示,以呈現具有更好數據結構的音頻數據。我們採用Transformer-XL作為模型,學習關於擴展音頻表示的知識,並生成所需的音頻。我們還利用名為Ailabs.tw 1K7的外部音樂資料集進行預訓練的需求。最終實驗結果可以從一個音樂鈴聽問卷得知,相比之前最先進的研究,GENPIA能夠在不同的音樂類別條件下生成更好的鋼琴作品。


    With the demand for music continuing to grow as people seek variety and personal resonance, many works focus on music generation. In this research, we propose GENPIA, a genre-conditioned piano music generation system. The system encompasses Anime, R$\and$B, Jazz, and Classical music genres. To build our system, we collect and label audio data of various genres for the specific objective of our research. REMI audio representation with genre information extension is applied during data pre-processing to present the audio data with a better data structure. Transformer-XL is implemented as the model to learn knowledge about the extended audio representation and generate the desired output audio. An external dataset, called Ailabs.tw 1K7, is utilized for pre-training purposes. The results obtained from a listening questionnaire show that GENPIA can generate better piano pieces conditioned on different genres compared to the prior state-of-the-art work.

    Contents 1 Introduction 1 2 Related Work 4 2.1 Non-Transformer-based Music Generation . . . . . . . . . . . . . . . . . . 4 2.2 Transformer-based Music Generation . . . . . . . . . . . . . . . . . . . . . 5 2.3 Music Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Preliminary 9 3.1 Transformer-XL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.1 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.2 Transformer Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Differences between Transformer and Transformer-XL . . . . . . . . . . . . 12 3.2.1 Segment-Level Recurrence . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 Attention Using Relative Positional Encoding . . . . . . . . . . . . 14 3.2.3 Stochastic Temperature-Controlled Sampling . . . . . . . . . . . . . 15 3.3 REMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.1 MIDI-like and REMI Audio Representation . . . . . . . . . . . . . 16 3.3.2 Conversion Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 YT-DLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Design 19 4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.3 Model Training and Inference . . . . . . . . . . . . . . . . . . . . . 25 5 Performance 28 5.1 External Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Experimental Environment Configuration . . . . . . . . . . . . . . . . . . . 28 5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.4 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . 30 6 Conclusion 36

    References
    [1] Salamander grand piano. https://freepats.zenvoid.org/Piano/
    acoustic-grand-piano.html.
    [2] Andrea Agostinelli, Timo I Denk, Zal ́an Borsos, Jesse Engel, Mauro Verzetti, Antoine
    Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al.
    Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
    [3] Anna Alajanki, Yi-Hsuan Yang, and Mohammad Soleymani. Benchmarking music
    emotion recognition systems. PloS one, pages 835–838, 2016.
    [4] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.
    arXiv preprint arXiv:1607.06450, 2016.
    [5] Rachel M Bittner, Justin Salamon, Mike Tierney, Matthias Mauch, Chris Cannam,
    and Juan Pablo Bello. Medleydb: A multitrack dataset for annotation-intensive mir
    research. In ISMIR, volume 14, pages 155–160, 2014.
    [6] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won. Mediaeval
    2019: Emotion and theme recognition in music using jamendo. In Larson M, Hicks
    S, Constantin MG, Bischke B, Porter A, Zhao P, Lux M, Cabrera Quiros L, Calandre
    J, Jones G, editors. MediaEval’19, Multimedia Benchmark Workshop; 2019 Oct 27-
    30, Sophia Antipolis, France. Aachen: CEUR; 2019. CEUR Workshop Proceedings,
    2019.
    [7] Yu-Hua Chen, Yu-Hsiang Huang, Wen-Yi Hsiao, and Yi-Hsuan Yang. Automatic
    composition of guitar tabs by transformers and groove modeling. arXiv preprint
    arXiv:2008.01431, 2020.
    [8] Kyunghyun Cho, Bart Van Merri ̈enboer, Dzmitry Bahdanau, and Yoshua Bengio.
    On the properties of neural machine translation: Encoder-decoder approaches. arXiv
    preprint arXiv:1409.1259, 2014.
    [9] Eunjin Choi, Yoon Kyung Chung, Seolhee Lee, JongIk Jeon, Taegyun Kwon, and
    Juhan Nam. Ym2413-mdb: A multi-instrumental fm video game music dataset with
    emotion annotations. ArXiv, abs/2211.07131, 2022.
    [10] Kristy Choi, Curtis Hawthorne, Ian Simon, Monica Dinculescu, and Jesse Engel.
    Encoding musical style with transformer autoencoders. In International Conference
    on Machine Learning, pages 1899–1908. PMLR, 2020.
    [11] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan
    Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length
    context. arXiv preprint arXiv:1901.02860, 2019.
    [12] Micha ̈el Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma:
    A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016.
    [13] Chris Donahue, Huanru Henry Mao, Yiting Ethan Li, Garrison W Cottrell, and Ju-
    lian McAuley. Lakhnes: Improving multi-instrumental music generation with cross-
    domain pre-training. arXiv preprint arXiv:1907.04868, 2019.
    [14] Jeff Ens and Philippe Pasquier. Mmm: Exploring conditional multi-track music
    generation with the transformer. arXiv preprint arXiv:2008.06048, 2020.
    [15] Jianyu Fan, Kivan ̧c Tatar, Miles Thorogood, and Philippe Pasquier. Ranking-based
    emotion recognition for experimental music. In ISMIR, volume 2017, pages 368–375,
    2017.
    [16] Jianyu Fan, Miles Thorogood, and Philippe Pasquier. Emo-soundscapes: A dataset
    for soundscape emotion recognition. In 2017 Seventh international conference on
    affective computing and intelligent interaction (ACII), pages 196–201. IEEE, 2017.
    [17] Jianyu Fan, Yi-Hsuan Yang, Kui Dong, and Philippe Pasquier. A comparative study
    of western and chinese classical music based on soundscape models. In ICASSP
    2020-2020 IEEE International Conference on Acoustics, Speech and Signal Process-
    ing (ICASSP), pages 521–525. IEEE, 2020.
    [18] Lucas N Ferreira and Jim Whitehead. Learning to generate music with sentiment.
    arXiv preprint arXiv:2103.06125, 2021.
    [19] Fraunhofer-Gesellschaft zur F ̈orderung der angewandten Forschung e. V. MP3 audio
    format. https://zh.wikipedia.org/zh-tw/MP3, 1991.
    [20] github. Youtube-dl. https://github.com/ytdl-org/youtube-dl/, 2021.
    [21] github. Yt-dlp. https://github.com/yt-dlp/yt-dlp/, 2023.
    [22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
    Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.
    Communications of the ACM, 63(11):139–144, 2020.
    [23] Alexander Agung Santoso Gunawan, Ananda Phan Iman, and Derwin Suhartono.
    Automatic music generator using recurrent neural network. International Journal of
    Computational Intelligence Systems, 13(1):645–654, 2020.
    [24] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna
    Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling
    factorized piano music modeling and generation with the MAESTRO dataset. In
    International Conference on Learning Representations, 2019.
    [25] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural
    network. arXiv preprint arXiv:1503.02531, 2015.
    [26] Sepp Hochreiter and J ̈urgen Schmidhuber. Long short-term memory. Neural Com-
    putation, 1997.
    [27] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case
    of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
    [28] Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang. Compound word
    transformer: Learning to compose full-song music over dynamic directed hyper-
    graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35,
    pages 178–186, 2021.
    [29] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis
    Hawthorne, AM Dai, MD Hoffman, and D Eck. Music transformer: Generating
    music with long-term structure (2018). arXiv preprint arXiv:1809.04281, 2018.
    [30] Yu-Siang Huang and Yi-Hsuan Yang. Pop music transformer: Beat-based modeling
    and generation of expressive pop piano compositions. In Proceedings of the 28th ACM
    International Conference on Multimedia, pages 1180–1188, 2020.
    [31] Hsiao-Tzu Hung, Joann Ching, Seungheon Doh, Nabin Kim, Juhan Nam, and Yi-
    Hsuan Yang. Emopia: A multi-modal pop piano dataset for emotion recognition and
    emotion-based music generation. arXiv preprint arXiv:2108.01374, 2021.
    [32] Natasha Jaques, Shixiang Gu, Richard E Turner, and Douglas Eck. Generating music
    by fine-tuning recurrent neural networks with reinforcement learning. 2016.
    [33] Junyan Jiang, Gus G Xia, Dave B Carlton, Chris N Anderson, and Ryan H Miyakawa.
    Transformer vae: A hierarchical model for structure-aware and interpretable music
    representation learning. In ICASSP 2020-2020 IEEE International Conference on
    Acoustics, Speech and Signal Processing (ICASSP), pages 516–520. IEEE, 2020.
    [34] Daniel D Johnson. Generating polyphonic music using tied parallel networks. In
    Computational Intelligence in Music, Sound, Art and Design: 6th International Con-
    ference, EvoMUSART 2017, Amsterdam, The Netherlands, April 19–21, 2017, Pro-
    ceedings 6, pages 128–143. Springer, 2017.
    [35] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement
    learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
    [36] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran ̧cois Fleuret. Trans-
    formers are rnns: Fast autoregressive transformers with linear attention. In Interna-
    tional Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
    [37] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
    CoRR, 2015.
    [38] Diederik P Kingma, Max Welling, et al. An introduction to variational autoencoders.
    Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
    [39] Qiuqiang Kong, Bochen Li, Jitong Chen, and Yuxuan Wang. Giantmidi-piano: A
    large-scale midi dataset for classical piano music. Trans. Int. Soc. Music. Inf. Retr.,
    5:87–98, 2020.
    [40] Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, and Yuxuan Wang. High-
    resolution piano transcription with pedals by regressing onset and offset times.
    IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717,
    2021.
    [41] Nikhil Kotecha. Bach2bach: generating music using a deep reinforcement learning
    approach. arXiv preprint arXiv:1812.01060, 2018.
    [42] Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274,
    2017.
    [43] Huanru Henry Mao, Taylor Shin, and Garrison Cottrell. Deepj: Style-specific music
    generation. In 2018 IEEE 12th International Conference on Semantic Computing
    (ICSC), pages 377–382. IEEE, 2018.
    [44] Midjourney.com. Midjourney. https://www.midjourney.com.
    [45] Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, and Karen Simonyan. This
    time with feeling: Learning expressive musical performance. Neural Computing and
    Applications, 32:955–967, 2020.
    [46] OpenAI. Chatgpt. https://openai.com/research/chatgpt, 2021. Accessed: June
    1, 2023.
    [47] OpenAI. OpenAI. https://openai.com, 2021. Accessed: June 1, 2023.
    [48] Renato Eduardo Silva Panda, Ricardo Malheiro, Bruno Rocha, Ant ́onio Pedro
    Oliveira, and Rui Pedro Paiva. Multi-modal music emotion recognition: A new
    dataset, methodology and comparative analysis. In 10th International Symposium
    on Computer Music Multidisciplinary Research (CMMR 2013), pages 570–582, 2013.
    [49] C Payne. Musenet. openai blog, 2019.
    [50] PyTorch. Pytorch. https://github.com/pytorch/pytorch. Software.
    [51] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
    Language models are unsupervised multitask learners. 2019.
    [52] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A
    hierarchical latent vector model for learning long-term structure in music. In Inter-
    national conference on machine learning, pages 4364–4373. PMLR, 2018.
    [53] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning represen-
    tations by back-propagating errors. nature, 323(6088):533–536, 1986.
    [54] James A Russell. A circumplex model of affect. Journal of personality and social
    psychology, 39(6):1161, 1980.
    [55] Flavio Schneider, Zhijing Jin, and Bernhard Sch ̈olkopf. Mo\ˆ usai: Text-to-music
    generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
    [56] Claude E Shannon. A mathematical theory of communication. Bell System Technical
    Journal, 27(3):379–423, 1948.
    [57] Mohammad Soleymani, Micheal N Caro, Erik M Schmidt, Cheng-Ya Sha, and Yi-
    Hsuan Yang. 1000 songs for emotional analysis of music. In Proceedings of the 2nd
    ACM international workshop on Crowdsourcing for multimedia, pages 1–6, 2013.
    [58] Team Audacity. Audacity. https://www.audacityteam.org/, 2000.
    [59] PyTorch Fast Transformer. Pytorch fast transformer. https://github.com/idiap/
    fast-transformers. Software.
    [60] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
    Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
    neural information processing systems, 30, 2017.
    [61] Ziyu Wang, K. Chen, Junyan Jiang, Yiyi Zhang, Maoran Xu, Shuqi Dai, Xianbin
    Gu, and Gus Xia. Pop909: A pop-song dataset for music arrangement generation.
    In International Society for Music Information Retrieval Conference, 2020.
    [62] Jian Wu, Changran Hu, Yulong Wang, Xiaolin Hu, and Jun Zhu. A hierarchical
    recurrent neural network for symbolic melody generation. IEEE transactions on
    cybernetics, 50(6):2749–2757, 2019.
    [63] Shih-Lun Wu and Yi-Hsuan Yang. The jazz transformer on the front line: Explor-
    ing the shortcomings of ai-composed music through quantitative measures. arXiv
    preprint arXiv:2008.01307, 2020.
    [64] Karl L Wuensch. What is a likert scale? and how do you pronounce’likert?’. East
    Carolina University, 4, 2005.
    [65] Le Cun Yan, B Yoshua, and H Geoffrey. Deep learning. nature, 521(7553):436–444,
    2015.
    [66] Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. Midinet: A convolutional gen-
    erative adversarial network for symbolic-domain music generation. arXiv preprint
    arXiv:1703.10847, 2017.

    QR CODE
    :::