跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳逸星
Yi-Hsin Chen
論文名稱: 應用生成對抗網路於人體姿態映射與全身風格轉換之演算法
A Generative Adversarial Network-based Framework for Human Pose Mapping and Full Body Style Transformation
指導教授: 蘇木春
Mu-Chun Su
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 58
中文關鍵詞: 風格轉換深度學習生成對抗網路影像處理電腦視覺
外文關鍵詞: Style Transfer, Deep Learning, Generative Adversarial Network, Image Processing, Computer Vision
相關次數: 點閱:20下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在過去,對影像中人物的姿勢與動作進行轉換的這項工作,
    需要仰賴許多影像特效師花費大量時間進行後製。
    傳統的方法像是使用3D環繞攝影機來捕捉人物的動作,
    同時建立一個3D的動畫模型去對應人物的各個支點。
    隨著科技的演進,人們可以使用像是生成對抗網路(Generative Adversarial Network)
    或是其餘深度學習之神經網路來幫助生成這些圖像。
    在生成圖像的同時,為了能夠捕捉人物的細節材質等,
    這些深度學習的方法常使用人物的骨架、立體的網格、
    身體各部位的語意分割或是使用UV座標來幫助捕捉這些細節。

    本論文將提出一個基於生成對抗網路的演算法,
    能夠重新生成一個人的各項細節至特定的姿態。
    本研究的演算法包含(1)使用Pix2pix網路來將圖像從骨架圖片轉至對應UV座標圖片,
    (2)將人物的輪廓、UV座標圖片、以及原始圖片當作輸入,使用基於StyleGAN的網路來生成人物的圖像至指定姿態。
    而根據本論文的實驗,本研究在使用骨架生成UV圖片的SSIM有0.932,
    而在姿態與風格轉換上的SSIM有0.7524,
    因此來證明本論文提供之演算法有一定程度之可用性。


    In the past, pose re-rendering relied on skilled visual effects artists and time-consuming post-production.
    Traditional methods such as building 3D camera arrays to capture a human's pose
    and build human keypoints to fit the animation model.
    Nowadays people use learning-based tools to generate images such as GAN(Generative Adversarial Network)s or other neural network frameworks.
    In order to capture human appearance,
    these methods tend to use skeleton, mesh, body part segmentation or dense UV coordinates to capture fine appearance details.

    In this paper, we present a framework that could re-render a person from a single source image to a specific pose.
    Our framework includes (1) using Pix2pix network to generate UV coordinates image from a keypoint skeleton image.
    (2) Take human foreground mask, UV coordinate image and original images as input,
    use StyleGAN network to translate a person from source to target image.

    According to the results of the experiments,
    the results of our skeleton keypoints to the UV coordinate model shows 0.932 on SSIM.
    And the results of our pose rerendering model shows 0.7524 on SSIM.
    Therefore, our framework has a certain degree of usability.

    一、 緒論 1 1.1 研究動機 .................................................................. 1 1.2 研究目的 .................................................................. 2 1.3 論文架構 .................................................................. 2 二、 背景知識以及文獻回顧 3 2.1 背景知識 .................................................................. 3 2.1.1 人體骨架偵測 ................................................... 3 2.1.2 UV 座標與 DensePose ......................................... 4 2.1.3 Mask R-CNN..................................................... 6 2.2 生成對抗網路 (Generative Adversarial Network) .................. 8 2.2.1 Conditional GAN ................................................ 8 2.2.2 Pix2pix Conditional GAN...................................... 9 2.2.3 StyleGAN......................................................... 10 2.2.4 PoseWithStyleGAN ............................................. 12 2.3 文獻回顧 .................................................................. 14 2.3.1 人物姿態轉換之相關研究 .................................... 14 2.3.2 StyleGAN 之相關研究......................................... 14 三、 研究方法 17 3.1 演算法流程 ............................................................... 17 3.2 IUV 生成模型 ............................................................ 18 3.2.1 資料前處理 ...................................................... 19 3.2.2 訓練方法與網路架構 .......................................... 21 3.2.3 目標函數 ......................................................... 22 3.3 姿態與風格轉換模型 ................................................... 22 3.3.1 資料前處理 ...................................................... 23 3.3.2 訓練方法與網路架構 .......................................... 23 四、 實驗設計與結果 25 4.1 IUV 生成實驗與評估 ................................................... 25 4.1.1 IUV 生成實驗之資料集描述 ................................. 25 4.1.2 IUV 生成實驗設計 ............................................. 26 4.1.3 不同輸入實驗結果與分析 .................................... 27 4.1.4 不同關節點模型與批次實驗結果與分析 .................. 28 4.1.5 IUV 模型生成結果比較與效能分析 ........................ 28 4.2 姿態與風格轉換生成實驗與評估 .................................... 30 4.2.1 StyleGAN 生成效能探討實驗................................ 30 4.2.2 姿態與風格轉換生成結果實驗與分析 ..................... 31 4.2.3 人物位置移動及轉身實驗與分析 ........................... 33 4.2.4 相關研究之比對和分析 ....................................... 36 4.3 模型相關應用與限制之探討 .......................................... 38 4.3.1 服裝風格轉換 ................................................... 38 4.3.2 模型應用之限制 ................................................ 38 五、 總結 40 5.1 結論 ........................................................................ 40 5.2 未來展望 .................................................................. 40 參考文獻 42

    [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative Adversarial Nets,” in
    Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C.
    Cortes, N. Lawrence, and K. Q. Weinberger, Eds., vol. 27, Curran Associates, Inc., 2014.
    [2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks,” in 2017 IEEE International Conference on
    Computer Vision (ICCV), Oct. 2017, pp. 2242–2251.
    [3] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” in International Conference on Learning Representations, Feb. 2018.
    [4] T. Karras, S. Laine, and T. Aila, “A Style-Based Generator Architecture for Generative
    Adversarial Networks,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern
    Recognition (CVPR), Jun. 2019, pp. 4396–4405.
    [5] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and Improving the Image Quality of StyleGAN,” in 2020 IEEE/CVF Conference on Computer
    Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 8107–8116.
    [6] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, “Training Generative Adversarial Networks with Limited Data,” in Thirty-Fourth Conference on Neural
    Information Processing Systems, 2020.
    [7] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision
    and pattern recognition, 2017, pp. 1125–1134.
    [8] B. Albahar, J. Lu, J. Yang, Z. Shu, E. Shechtman, and J.-B. Huang, “Pose with style:
    Detail-preserving pose-guided image synthesis with conditional stylegan,” ACM Transactions on Graphics (TOG), vol. 40, no. 6, pp. 1–11, 2021.
    [9] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime Multi-person 2D Pose Estimation
    Using Part Affinity Fields,” in 2017 IEEE Conference on Computer Vision and Pattern
    Recognition (CVPR), Jul. 2017, pp. 1302–1310.
    [10] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: Realtime MultiPerson 2D Pose Estimation Using Part Affinity Fields,” IEEE Transactions on Pattern
    Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172–186, Jan. 2019.
    [11] H. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional Multi-person Pose Estimation,”
    2017 IEEE International Conference on Computer Vision (ICCV), 2017.
    [12] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 2980–2988.
    [13] Wikipedia. “UV mapping.” (Jun. 23, 2022), [Online]. Available: https://en.wikipedia.
    org/wiki/UV_mapping (visited on 07/04/2022).
    [14] R. A. Güler, N. Neverova, and I. Kokkinos, “DensePose: Dense Human Pose Estimation
    in the Wild,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp. 7297–7306.
    [15] R. A. Güler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos, “DenseReg:
    Fully Convolutional Dense Shape Regression In-the-Wild,” in 2017 IEEE Conference on
    Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 2614–2623.
    [16] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and
    Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
    [17] N.-C. Lee et al., “應用生成對抗網路於嬰兒骨架偵測與早產兒整體動作指標分析,”
    M.S. thesis, National Central University, 2020.
    [18] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint
    arXiv:1411.1784, 2014.
    [19] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody dance now,” in Proceedings
    of the IEEE/CVF international conference on computer vision, 2019, pp. 5933–5942.
    [20] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, “Pose guided person
    image generation,” Advances in neural information processing systems, vol. 30, 2017.
    [21] G. Yildirim, N. Jetchev, R. Vollgraf, and U. Bergmann, “Generating high-resolution fashion model images wearing custom outfits,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
    [22] Y. Men, Y. Mao, Y. Jiang, W.-Y. Ma, and Z. Lian, “Controllable person image synthesis
    with attribute-decomposed gan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5084–5093.
    [23] K. Sarkar, V. Golyanik, L. Liu, and C. Theobalt, “Style and pose control for image synthesis of humans from a single monocular view,” arXiv preprint arXiv:2102.11263, 2021.
    [24] E. Lu, F. Cole, T. Dekel, et al., “Layered neural rendering for retiming people in video,”
    arXiv preprint arXiv:2009.07833, 2020.43
    [25] R. Abdal, Y. Qin, and P. Wonka, “Image2stylegan: How to embed images into the stylegan latent space?” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4432–4441.
    [26] A. Tewari, M. Elgharib, G. Bharaj, et al., “Stylerig: Rigging stylegan for 3d control over
    portrait images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
    Pattern Recognition, 2020, pp. 6142–6151.
    [27] B. Egger, W. A. Smith, A. Tewari, et al., “3d morphable face models—past, present, and
    future,” ACM Transactions on Graphics (TOG), vol. 39, no. 5, pp. 1–38, 2020.
    [28] D. Castro, S. Hickson, P. Sangkloy, et al., “Let’s dance: Learning from online dance
    videos,” arXiv preprint arXiv:1801.07388, 2018.
    [29] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New
    benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on
    computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
    [30] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust clothes
    recognition and retrieval with rich annotations,” in Proceedings of the IEEE conference
    on computer vision and pattern recognition, 2016, pp. 1096–1104.
    [31] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and
    computer-assisted intervention, Springer, 2015, pp. 234–241.
    [32] P. Zablotskaia, A. Siarohin, B. Zhao, and L. Sigal, “Dwnet: Dense warp-based network
    for pose-guided human video generation,” arXiv preprint arXiv:1910.09139, 2019.
    [33] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz, “Disentangled
    person image generation,” in Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, 2018, pp. 99–108.

    QR CODE
    :::