跳到主要內容

簡易檢索 / 詳目顯示

研究生: 侯貫中
Kuan-Chung Hou
論文名稱: 資料視覺化在社群媒體平台主題偵測與追蹤的應用
指導教授: 林熙禎
She-Jen Lin
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 65
中文關鍵詞: 主題偵測與追蹤資料視覺化中文語言處理FacebookTF-IDFk-medoids
外文關鍵詞: Topic Detection and Tracking, Data visualization, Chinese natural language processing, Facebook, TF-IDF, k-medoids
相關次數: 點閱:12下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著社群媒體的興起,使用者願意在平台上以不同的形式表達立場、評論觀點及分享貼文。社群媒體強調其訊息的即時傳播性,導致串流不斷地產生,使用者如何更快速的從這樣大量的資訊中,瞭解目前熱門的主題、使用者關注的事件等,變成一大挑戰及困難。其中,應用在社群媒體中進行主題偵測與追蹤(Topic Detection and Tracking, TDT)變成一大熱門的研究項目。傳統的TDT研究主要針對結構化高的文章,如新聞文章等,本研究以Facebook作為研究平台,針對公開粉絲專頁的短貼文進行主題偵測與追蹤的研究。

    本研究的研究目的為讓使用者更快速地掌握主題之下的事件,並透過資料視覺化的呈現,來將設計的架構以故事劃分、源頭故事偵測、群集偵測、追蹤及故事鏈結偵測,五個主題偵測及追蹤系統應具備的能力,做新聞實例的探討並解釋其商業用途。本研究主要將系統流程區分為三個階段。資料蒐集與擷取:透過Facebook Graph API抓取公開粉絲專頁的貼文資訊,並以關鍵字比對的方式將貼文映射到特定主題;資料分析:透過Incremental TF-DF來抓取貼文的核心特徵字詞並且避免字詞維度過高的問題,接著,透過k-medoids文件分群技術及自適應決定分群數目的演算法來達到自動分群辨別出事件;資料呈現:透過群集分析以及資料視覺化的技術來針對分析結果做大規模呈現。


    As the rise of social media, people are more willing to declare their position, give comments and share others’ posts on the platform. Social medias emphasize information immediacy, which leads to stream generate constantly. As a result, how users know the hot topics and the events users interest becomes a difficult challenge. In particular,“Topic Detention and Tracking”(TDT) becomes a popular research project applied on social medias. Traditional TDT research mainly focused on high structured articles, e.g., news articles. This research takes Facebook as the research platform and use “Topic Detention and Tracking” to discuss the short-text documents on the public fan page.

    The primary purpose of the research is to allow users to realize events of topics through data visualization using five major themes of detections: story segmentation, first story detection, topic tracking, topic detection, and link detection. The application and capability of these detections and tracking system will then be used for discussion of news and explanation of its commercial purposes. This research divides the system procedure to three stages. The first is data collection and catch, which get the posts information on the public fan pages through the Facebook Graph API and map the posts to certain topic through the keyword mapping. The second stage is data analysis, which get the keywords from the posts by Incremental TF-DF and avoid the problem of excessive term dimension. Then, through the document clustering technology, k-medoids, and the auto-decide clustering numbers algorithm to achieve auto-clustering distinguish events. The third stage is data visualization, which through clustering analysis and data visualization technology to visualize the analysis result in a large scale.

    摘要 i Abstract v 致謝 vi 目錄 vii 圖目錄 ix 表目錄 xi 一、緒論 1 1-1 研究背景 1 1-2 研究動機 2 1-3 研究目的 5 二、相關研究 6 2-1 主題偵測與追蹤 6 2-2 短文件故事的處理 7 2-2-1 文件基底法 7 2-2-2 特徵基底法 8 2-2-3 機率主題模型 9 2-3 OpView社群觀測平台 10 2-3-1 關鍵字風暴圖 11 三、系統架構 12 3-1 系統概念與流程 12 3-2 資料搜集與擷取 13 3-2-1 貼文評分 13 3-2-2 事件處理 14 3-3 資料分析 17 3-3-1 Jieba中文斷詞程式 17 3-3-2 文件特徵萃取 18 3-3-3 字詞的語義相似度 20 3-3-4 文件的相似度 23 3-3-5 k-medoids分群法 25 3-4 資料呈現 27 3-4-1 分群關鍵字標定 27 3-4-2 資料視覺化 28 四 實驗結果與討論 39 4-1 評估方法 39 4-2 資料集 40 4-3 特徵選取字詞門檻數 41 4-4 同義詞過濾參數 41 4-5 主題自動分群參數 42 4-6 實驗1:系統參數配置 42 4-7 實驗2:系統執行效率比較 44 4-8 實驗3:Word2Vec語料庫對系統表現影響 46 五 結論與未來研究方向 48 5-1 結論 48 5-2 研究限制 48 5-3 未來研究方向 49 文獻探討 50 英文文獻 50 中文文獻 53

    英文文獻
    [1] Kemp S., “FUTURE FACTORS”, October 11, 2016, available at http://kepios.com/blog/2016/10/11/future-factors
    [2] Travers J., Milgram S., “An Experimental Study of the Small World Problem”, Sociometry, Vol. 32, No. 4, pp. 425-443, December 1969.
    [3] Bhagat S., Burke M., Diuk C., Fillz I. O., Edunov S., “Three and a half degrees of separation”, February 4 2016, available at https://research.facebook.com/blog/three-and-a-half-degrees-of-separation/
    [4] Kincaid J., “EdgeRank: The secret sauce that makes Facebook’s news feed tick”, Techcrunch, April 22, 2010, available at http://techcrunch.com/2010/04/22/facebook-edgerank.
    [5] Bucher T,, “Want to be on the top? Algorithmic power and the threat of invisibility on Facebook”, New Media & Society, Vol. 14, Issue 7, pp. 1164-1180, April 2012.
    [6] Weber M. S., Monge P., “The flow of digital news in a network of sources, authorities, and hubs”, Journal of Communication, Vol. 61, Vol. 6, Issue 6, pp.1062-1081, December 2011.
    [7] Long M. C., Noor Al-Deen H. S., Hendricks J. A. (Eds), Social Media: Usage and Impact, “Beyond the press release: Social media as a tool for consumer engagement”, Lanham, ML: Lexington Books, pp 145-149, 2012.
    [8] Allan J., Lavrenko V., Malin D., Swan R., 2000, “Detections, bounds, and timelines: UMass and TDT-3”, Proceedings of Topic Detection and Tracking Workshop, pp. 167–174, 2000.
    [9] Shiravi H., Shiravi A., Ghorbani A. A., “A survey of visualization systems for network security”, IEEE Transactions on Visualization and Computer Graphics, Vol. 18, No. 8, pp. 1313-1329, 2012
    [10] Fiscus G., Doddington G. R., Allan J. (Ed), Topic Detection and Tracking, Kluwer Academic Publishers, Norwell, MA, USA, pp. 17–31, February 2002.
    [11] Zheng Y., Meng Z., Xu C., “A Short-Text Oriented Clustering Method for Hot Topics Extraction”, International Journal of Software Engineering and Knowledge Engineering, Vol. 25, Issue 3, pp. 453, April 2015.
    [12] Kaleel S. B., Abhari A., “Cluster-discovery of Twitter messages for event detection and trending,” Journal of Computation Science, Vol. 6, pp. 45-57, January 2015.
    [13] Petkos G., Papadopoulos S., Aiello L., Skeaba R., Kompatsiaris Y., “A soft frequent pattern mining approach for textual topic detection”, Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics, No. 25, June 2014.
    [14] Gaglio S., Re G. L., Morana M., “A framework for real-time Twitter data analysis”, Computer Communications, Vol. 73, Part B, pp. 236-242, January 2016.
    [15] Song M., Kim M. C., Jeong Y. K., “Analyzing the Political Landscape of 2012 Korean Presidential Election in Twitter”, Intelligent System, IEEE, Vol. 29, Issue 2, pp. 18-26, March 2014.
    [16] Cleary I., “Facebook Analytics: The Only Guide You’ll Ever Need”, RAZORSOCIAL, June 9, 2017, available at http://www.razorsocial.com/facebook-analytics-reference-guide/.
    [17] Christopher H., “Brands Favor Social Shares Over Likes”, ADWEEK, April 1, 2013, available at http://www.adweek.com/news/advertising-branding/brands-favor-social-shares-over-likes-148256.
    [18] Fung G. P. C., Yu J. X. Y., Yu P. S., Lu H., “Parameter free bursty events detection in text streams”, Proceeding of the VLDB: 31st Int. Conf. Very Large Data Bases, pp. 181–192, August 2005.
    [19] Yang Y., Pierce T., Carbonell J., “A study of retrospective and on-line event detection”, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, pp. 28–36, August 1998.
    [20] Brants T., Chen F., Farachar A., “A system for new event detection”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in information Retrieval, pp. 330-337, August 2003.
    [21] Cilibrasi R. L., Vitanyi P., “The google similarity distance”, IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No.3, pp. 370-383, March 2007.
    [22] Makrehchi M., Kamel M. S., “Automatic Taxonomy Extraction Using Google and Term Dependency”, IEEE/WIC/ACM International Conference on Web Intelligence, pp. 321-325, 2007.
    [23] Woon W. L., Madnick S., “Asymmetric information distances for automated taxonomy construction”, Knowledge and information systems, Vol. 21, Vol. 1, Issue 1, pp. 91-111, October 2009.
    [24] Mikolov T., Chen K., Corrado G., et al., “Efficient Estimation of Word Representations in Vector Space”, Computer Science, pp. 28-36, Jan 2013.
    [25] Li Y., McLean D., Bandar Z. A., O’Shea J. D., Crockett L., “Sentence Similarity Based on Semantic Nets and Corpus Statistics”, IEEE Transaxtions on Knowledge and Data Engineering, Vol. 18, Issue 8, pp 1138-1150, June 2006.
    [26] Kaufman L., Rousseeuw P. J., “Clustering by means of Medoids.,” pp. 405–416, 1987.
    中文文獻
    [27] 傅珮雯,「Facebook 網站上口碑行為之研究」,國立中山大學,企業管理學系碩士論文,民國100年。
    [28] Fukuball,結巴中文分詞,jieba-0.25,取自 https://github.com/fukuball/jieba-php。
    [29] 唐鳳,萌典,取自 https://www.moedict.tw/about.html。
    [30] 中研院,中文斷詞系統,取自 http:// ckipsvr.iis.sinica.edu.tw/。
    [31] 鄭奕駿,「離線搜尋 Wikipedia 以縮減 NGD 運算時間之研究」,國立中央大學,資訊管理學系碩士論文,民國101年。
    [32] Word2Vec中的數學原理詳解,取自http://blog.csdn.net/itplus/article/details/37969519。
    [33] 郭海蓉、張暉,「增量劇類在動太多文檔摘要中的研究與應用」,中國西南科技大學,西元2012年。
    [34] 林熙禎、侯貫中、張昇暉、趙濬、陳棅、郭台達,「資料視覺化在社群媒體下議題追蹤的應用」,TANET 台灣網際網路研討會,883-888頁,2016。
    [35] Wikipedia資料集,20161120更新,取自https://dumps.wikimedia.org/zhwiki/。

    QR CODE
    :::