| 研究生: |
陳逸星 Yi-Hsin Chen |
|---|---|
| 論文名稱: |
應用生成對抗網路於人體姿態映射與全身風格轉換之演算法 A Generative Adversarial Network-based Framework for Human Pose Mapping and Full Body Style Transformation |
| 指導教授: |
蘇木春
Mu-Chun Su |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 58 |
| 中文關鍵詞: | 風格轉換 、深度學習 、生成對抗網路 、影像處理 、電腦視覺 |
| 外文關鍵詞: | Style Transfer, Deep Learning, Generative Adversarial Network, Image Processing, Computer Vision |
| 相關次數: | 點閱:20 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在過去,對影像中人物的姿勢與動作進行轉換的這項工作,
需要仰賴許多影像特效師花費大量時間進行後製。
傳統的方法像是使用3D環繞攝影機來捕捉人物的動作,
同時建立一個3D的動畫模型去對應人物的各個支點。
隨著科技的演進,人們可以使用像是生成對抗網路(Generative Adversarial Network)
或是其餘深度學習之神經網路來幫助生成這些圖像。
在生成圖像的同時,為了能夠捕捉人物的細節材質等,
這些深度學習的方法常使用人物的骨架、立體的網格、
身體各部位的語意分割或是使用UV座標來幫助捕捉這些細節。
本論文將提出一個基於生成對抗網路的演算法,
能夠重新生成一個人的各項細節至特定的姿態。
本研究的演算法包含(1)使用Pix2pix網路來將圖像從骨架圖片轉至對應UV座標圖片,
(2)將人物的輪廓、UV座標圖片、以及原始圖片當作輸入,使用基於StyleGAN的網路來生成人物的圖像至指定姿態。
而根據本論文的實驗,本研究在使用骨架生成UV圖片的SSIM有0.932,
而在姿態與風格轉換上的SSIM有0.7524,
因此來證明本論文提供之演算法有一定程度之可用性。
In the past, pose re-rendering relied on skilled visual effects artists and time-consuming post-production.
Traditional methods such as building 3D camera arrays to capture a human's pose
and build human keypoints to fit the animation model.
Nowadays people use learning-based tools to generate images such as GAN(Generative Adversarial Network)s or other neural network frameworks.
In order to capture human appearance,
these methods tend to use skeleton, mesh, body part segmentation or dense UV coordinates to capture fine appearance details.
In this paper, we present a framework that could re-render a person from a single source image to a specific pose.
Our framework includes (1) using Pix2pix network to generate UV coordinates image from a keypoint skeleton image.
(2) Take human foreground mask, UV coordinate image and original images as input,
use StyleGAN network to translate a person from source to target image.
According to the results of the experiments,
the results of our skeleton keypoints to the UV coordinate model shows 0.932 on SSIM.
And the results of our pose rerendering model shows 0.7524 on SSIM.
Therefore, our framework has a certain degree of usability.
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative Adversarial Nets,” in
Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C.
Cortes, N. Lawrence, and K. Q. Weinberger, Eds., vol. 27, Curran Associates, Inc., 2014.
[2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks,” in 2017 IEEE International Conference on
Computer Vision (ICCV), Oct. 2017, pp. 2242–2251.
[3] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” in International Conference on Learning Representations, Feb. 2018.
[4] T. Karras, S. Laine, and T. Aila, “A Style-Based Generator Architecture for Generative
Adversarial Networks,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), Jun. 2019, pp. 4396–4405.
[5] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and Improving the Image Quality of StyleGAN,” in 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 8107–8116.
[6] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, “Training Generative Adversarial Networks with Limited Data,” in Thirty-Fourth Conference on Neural
Information Processing Systems, 2020.
[7] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2017, pp. 1125–1134.
[8] B. Albahar, J. Lu, J. Yang, Z. Shu, E. Shechtman, and J.-B. Huang, “Pose with style:
Detail-preserving pose-guided image synthesis with conditional stylegan,” ACM Transactions on Graphics (TOG), vol. 40, no. 6, pp. 1–11, 2021.
[9] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime Multi-person 2D Pose Estimation
Using Part Affinity Fields,” in 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Jul. 2017, pp. 1302–1310.
[10] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: Realtime MultiPerson 2D Pose Estimation Using Part Affinity Fields,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172–186, Jan. 2019.
[11] H. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional Multi-person Pose Estimation,”
2017 IEEE International Conference on Computer Vision (ICCV), 2017.
[12] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 2980–2988.
[13] Wikipedia. “UV mapping.” (Jun. 23, 2022), [Online]. Available: https://en.wikipedia.
org/wiki/UV_mapping (visited on 07/04/2022).
[14] R. A. Güler, N. Neverova, and I. Kokkinos, “DensePose: Dense Human Pose Estimation
in the Wild,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp. 7297–7306.
[15] R. A. Güler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos, “DenseReg:
Fully Convolutional Dense Shape Regression In-the-Wild,” in 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 2614–2623.
[16] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
[17] N.-C. Lee et al., “應用生成對抗網路於嬰兒骨架偵測與早產兒整體動作指標分析,”
M.S. thesis, National Central University, 2020.
[18] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint
arXiv:1411.1784, 2014.
[19] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody dance now,” in Proceedings
of the IEEE/CVF international conference on computer vision, 2019, pp. 5933–5942.
[20] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, “Pose guided person
image generation,” Advances in neural information processing systems, vol. 30, 2017.
[21] G. Yildirim, N. Jetchev, R. Vollgraf, and U. Bergmann, “Generating high-resolution fashion model images wearing custom outfits,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[22] Y. Men, Y. Mao, Y. Jiang, W.-Y. Ma, and Z. Lian, “Controllable person image synthesis
with attribute-decomposed gan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5084–5093.
[23] K. Sarkar, V. Golyanik, L. Liu, and C. Theobalt, “Style and pose control for image synthesis of humans from a single monocular view,” arXiv preprint arXiv:2102.11263, 2021.
[24] E. Lu, F. Cole, T. Dekel, et al., “Layered neural rendering for retiming people in video,”
arXiv preprint arXiv:2009.07833, 2020.43
[25] R. Abdal, Y. Qin, and P. Wonka, “Image2stylegan: How to embed images into the stylegan latent space?” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4432–4441.
[26] A. Tewari, M. Elgharib, G. Bharaj, et al., “Stylerig: Rigging stylegan for 3d control over
portrait images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2020, pp. 6142–6151.
[27] B. Egger, W. A. Smith, A. Tewari, et al., “3d morphable face models—past, present, and
future,” ACM Transactions on Graphics (TOG), vol. 39, no. 5, pp. 1–38, 2020.
[28] D. Castro, S. Hickson, P. Sangkloy, et al., “Let’s dance: Learning from online dance
videos,” arXiv preprint arXiv:1801.07388, 2018.
[29] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New
benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on
computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
[30] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust clothes
recognition and retrieval with rich annotations,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp. 1096–1104.
[31] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and
computer-assisted intervention, Springer, 2015, pp. 234–241.
[32] P. Zablotskaia, A. Siarohin, B. Zhao, and L. Sigal, “Dwnet: Dense warp-based network
for pose-guided human video generation,” arXiv preprint arXiv:1910.09139, 2019.
[33] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz, “Disentangled
person image generation,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2018, pp. 99–108.