以人工智慧方法驅動音樂轉錄與生成｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	沙文森 CHAPUIS Vincent
論文名稱：	以人工智慧方法驅動音樂轉錄與生成 AI Driven Music Transcription and Generation
指導教授：	施國琛 Timothy K. Shih Frederic Lassabe Frederic Lassabe
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2020
畢業學年度：	108
語文別：	英文
論文頁數：	92
中文關鍵詞：	音樂、深度學習、轉錄、生成
外文關鍵詞：	Music, DeepLearning, Transcription, Generation
相關次數：	點閱：25 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

音樂轉錄與生成為一廣泛之研究領域,結集多數對於此領域好奇且充滿希望的研究
學者們,積極的尋求能夠超越現況且能滿足人們好奇心的創新技術,藉此藉由技術突
破,完成過去難以達成之各種任務。為能夠更了解此領域之困難點,本論文針對音樂
轉錄與生成之相關研究做深入之文獻探討,並針對各困難點提出對應之改良演算法,
用實驗來驗證其魯棒姓。首先,本論文採用Waon演算法來轉錄音符,並開發友善操作
之圖像使用者介面。接者,本論文將介紹如何在足夠之資料集中,能夠透過以卷積神
經網路為基底之遞規神經網路作深度學習訓練,來滿足我們期望之結果。除此之外,
本論文也將介紹如何能夠透過如Transflormer之新型模型,將轉錄過之音符作為音樂生
成之可用素材。本論文提出之各項實驗,皆以MiDi之格式作為深度學習之輸入源,搭
配Pytorch之深度學習框架所完成的。最後,本研究針對實驗之成果做深入之討論,探
討此項研究如何能夠更進階之優化以完成一互動產品,提供作曲家更友善之圖形化操
作介面。

Music Transcription and Generation is a wide field that has been looked over by many
with hope and curiosity. Hope to reach and surpass human skills and creativity, curios-
ity to find new ways of accomplishing tasks that were either difficult either impossible
for previously existing technology to succeed. In this thesis, we explore this field and
review the different existing technics used to realize those tasks. We then introduce the
several approaches tested during our research to try new methods or to improve the
current state-of-the-art. We first used an algorithmic approach based on the Waon al-
gorithm to transcript music notes and developed a Graphical User Interface to help for
this task. We then show how deep learning approach like Convolutional Neural net-
work linked with Recurrent Neural Network can give satisfying results in the matter
when adequate dataset is chosen, and how it can also be a great asset for generating
music with cutting edge models like the Transformer. For all those tasks we mainly
used the MiDi file format and Python frameworks like Pytorch to reach our goals. We
finally discuss on how those technics can improve a composer’s life to help him cre-
ate new music and improve his ideas, and how future work on this subject could be
focused on creating an ergonomic user interface for production use.

Abstract i
Acknowledgements v
1 Introduction 1
2 Related Work 4
2.1 Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Music Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
MiDi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
PianoRoll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Digitization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Automatic Music Transcription . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Impact of Spectrogram type on performances . . . . . . . . . . . 9
2.3.2 Onset and frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Imbalanced Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Music Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 MiDiNet, a DCGAN . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 MuseGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.4 Music Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.3
2.2
2.3
2.4

3 Methodology 34
3.1 A first algorithmic approach . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.1 Embedded Application . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Waon, a Wave-to-Notes transcriber . . . . . . . . . . . . . . . . . . 35
3.1.3 SoX, the Swiss Army knife of audio manipulation . . . . . . . . . 35
3.1.4 The experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Deep Learning Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Common Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
First used Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Second used Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
ROC, precision vs recall . . . . . . . . . . . . . . . . . . . . . . . . 43
Third Metric: mAP . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Fourth Metric: Mir_Eval . . . . . . . . . . . . . . . . . . . . . . . . 47
Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Subjective Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Objective Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Deep Learning Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1 Proof of Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
First Proof of Concept . . . . . . . . . . . . . . . . . . . . . . . . . 50
Second Proof of Concept . . . . . . . . . . . . . . . . . . . . . . . . 50
Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 50
First Modification: Short Range Detector . . . . . . . . . . . . . . 51
Second Modification: Triplet-Ranking . . . . . . . . . . . . . . . . 52
Third Unfinished Modification: Transformer . . . . . . . . . . . . 52
Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Different Dataset and Style premise . . . . . . . . . . . . . . . . . 52
3.2
3.2.2
3.3
3.3.2
3.3.3

4 Experiment 54
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1 MAPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.2 MusicNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.3 Maestro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.4 Dataset Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Transcription: Short Range Detector . . . . . . . . . . . . . . . . . 58
Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Ablation-like study . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Unbalanced Dataset Discovery . . . . . . . . . . . . . . . . . . . . 63
Simple Strategy for unbalanced Dataset . . . . . . . . . . . . . . . 66
Advanced Method: Triplet Ranking . . . . . . . . . . . . . . . . . 67
Generation: Different Style Inference . . . . . . . . . . . . . . . . . 68
4.2
4.2.2
5 Discussion and Conclusion 70
5.1 Interpretations and insights . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
References
72
                                

[1] Paula Branco, Luís Torgo, and Rita P. Ribeiro. “A Survey of Predictive Modelling
under Imbalanced Distributions”. In: CoRR abs/1505.01658 (2015). arXiv: 1505.
01658. URL : http://arxiv.org/abs/1505.01658.
[2] Kin Wai Cheuk, Kat Agres, and Dorien Herremans. “The impact of Audio in-
put representations on neural network based music transcription”. In: ArXiv
abs/2001.09989 (2020).
[3] Kyunghyun Cho et al. “On the Properties of Neural Machine Translation: Encoder-
Decoder Approaches”. In: ArXiv abs/1409.1259 (2014).
[4] Hao-Wen Dong et al. “MuseGAN: Symbolic-domain Music Generation and Ac-
companiment with Multi-track Sequential Generative Adversarial Networks”.
In: ArXiv abs/1709.06298 (2017).
[5] Qi Dong, Shaogang Gong, and Xiatian Zhu. “Imbalanced Deep Learning by Mi-
nority Class Incremental Rectification”. In: CoRR abs/1804.10851 (2018). arXiv:
1804.10851. URL : http://arxiv.org/abs/1804.10851.
[6] Jeffrey L. Elman. “Finding Structure in Time”. In: Cognitive Science 14.2 (1990),
pp. 179–211. DOI : 10.1207/s15516709cog1402\_1.
[7] Valentin Emiya. “Transcription automatique de la musique de piano”. In: 2008.
URL :
https://pastel.archives-ouvertes.fr/pastel-00004867/document.
[8] Kunihiko Fukushima. “Neocognitron: A Self-Organizing Neural Network Model
for a Mechanism of Pattern Recognition Unaffected by Shift in Position”. In: Bio-
logical Cybernetics 36 (1980), pp. 193–202.
72REFERENCES
REFERENCES
[9] Ian Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural Infor-
mation Processing Systems 27. Ed. by Z. Ghahramani et al. Curran Associates, Inc.,
2014, pp. 2672–2680.
URL :
http://papers.nips.cc/paper/5423- generative-
adversarial-nets.pdf.
[10] Curtis Hawthorne et al. “Enabling Factorized Piano Music Modeling and Gener-
ation with the MAESTRO Dataset”. In: International Conference on Learning Repre-
sentations. 2019. URL : https://openreview.net/forum?id=r1lYRjC9F7.
[11] Curtis Hawthorne et al. “Onsets and Frames: Dual-Objective Piano Transcrip-
tion”. In: CoRR abs/1710.11153 (2017). arXiv: 1710.11153.
URL :
http://arxiv.
org/abs/1710.11153.
[12] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neu-
ral Computation 9.8 (1997), pp. 1735–1780.
[13] Anna Huang et al. “Visualizing Music Self-Attention”. In: 2018.
URL :
https://
openreview.net/pdf?id=ryfxVNEajm.
[14] Cheng-Zhi Anna Huang et al. “Music Transformer: Generating Music with Long-
Term Structure”. In: arXiv preprint arXiv:1809.04281 (2018).
[15] Muhammad Huzaifah. “Comparison of Time-Frequency Representations for En-
vironmental Sound Classification using Convolutional Neural Networks”. In:
CoRR abs/1706.07156 (2017). arXiv: 1706.07156.
URL :
http://arxiv.org/abs/
1706.07156.
[16] Rainer Kelz et al. “On the Potential of Simple Framewise Approaches to Piano
Transcription”. In: CoRR abs/1612.05153 (2016). arXiv: 1612.05153.
URL :
http:
//arxiv.org/abs/1612.05153.
[17] Colin Raffel et al. “mir_eval: a transparent implementation of common MIR met-
rics”. In: In Proceedings of the 15th International Society for Music Information Re-
trieval Conference, ISMIR. 2014.
[18] F. Rosenblatt. “The Perceptron: A Probabilistic Model for Information Storage
and Organization in The Brain”. In: Psychological Review (1958), pp. 65–386.
73REFERENCES
REFERENCES
[19] John Thickstun, Zaid Harchaoui, and Sham M. Kakade. “Learning Features of
Music from Scratch”. In: International Conference on Learning Representations (ICLR).
2017. URL : https://arxiv.org/pdf/1611.09827.pdf.
[20] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Infor-
mation Processing Systems 30. Ed. by I. Guyon et al. Curran Associates, Inc., 2017,
pp. 5998–6008. URL : http://papers.nips.cc/paper/7181-attention-is-all-
you-need.pdf.
[21] Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. “MidiNet: A Convolutional
Generative Adversarial Network for Symbolic-domain Music Generation using
1D and 2D Conditions”. In: CoRR abs/1703.

簡易檢索 / 詳目顯示

相關論文