| 研究生: |
莊秀敏 Hsiu-Min Chuang |
|---|---|
| 論文名稱: |
從Web擷取興趣點及驗證關係 POI Extraction and Relation Verification from the Web |
| 指導教授: |
張嘉惠
Chia-Hui Chang |
| 口試委員: | |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 英文 |
| 論文頁數: | 94 |
| 中文關鍵詞: | 基於位置的服務 、興趣點爬取 、興趣點關係配對 、地理資訊檢索 、店名辨識 、半監督學習 |
| 外文關鍵詞: | Location-based service, POI crawling, POI relation pairing, geographic information retrieval, store name recognition, Semi-supervised learning |
| 相關次數: | 點閱:9 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著行動設備和智慧手機的普及,我們見證了行動應用服務的快速增長,尤其是在地化服務。根據2014年的行動市場調查,地圖/區域搜尋是智慧手機上最常使用的服務之一。興趣點(POI)如商場、店家、加油站、停車場等都是常見的查詢。已存在的地圖服務如Google地圖或Wikimapia都採用人工建置,不論是特定人員手動建立或群眾外包。然而,手動標記對於POI搜尋服務的成本高且數量有限,由於Web上豐富的資訊量,很多商家的POI資訊可以從Web擷取。另一方面,POI關係可能會隨時間改變,因此確保POI資料的正確是關鍵的。當店家搬遷或歇業,可能造成一對多的地址與店家的配對關係。因此,辨識出過期的POI關係對於改善資料庫品質是重要且具挑戰的。
本文探討兩大問題:(1)POI資料庫的建構與地圖搜尋,(2)POI關係的驗證。在第一個研究中,主要包含了三個工作:POI擷取、POI配對,以及POI搜尋。因此我們提出基於查詢詞的爬蟲策略,尋找可能包含有地址的網頁,以擷取出地址與POI名稱,利用配對模型找出最可能的POI。為了提供有效的POI查詢,我們整合多種搜尋結果來進行排序。在第二個研究中,我們利用網路弱標記資料來訓練驗證模型,偵測資料庫中可能過期的POI配對。我們也分析了不同方法與場景下的效能。目前已建構含有125萬個POI的資料庫,透過Apache Solr的搜尋平台進行POI搜尋服務。實驗結果顯示,我們所提出的POI搜尋效能優於Wikimapia和商業app "What's the Number?",且與Google Maps的效能相近。對於POI配對的效能顯示,我們提出的方法在Google查詢量充足時,可達到91.1%的F1效能。對於驗證過期POI配對的實驗結果顯示,利用半監督學習方法可改善準確率至72.8%。
With the popularity of mobile devices and smartphones, we have witnessed rapid growth in mobile applications and services, especially in location-based services (LBS). According to the mobile marketing survey in 2014, maps/location searches are among the most utilized services on smartphones. Points of interest (POIs), such as stores, gas stations, and parking lots, are common maps/local searches. Existing map services such as Google Maps and Wikimapia are constructed manually either professionally or with crowd-sourcing. However, manual annotation is costly and limited in current POI search services. With the abundance of information on the Web, many POIs can be extracted from the Web. On the other hand, owing to the fact that POI relations are subject to change over time, it is critical to ensure the accuracy of POI data. When some stores close or move, they often result in one-to-many address-to- store-name pairs. Thus, effectively identifying outdated POI relations is important and challenge for improving the quality of databases.
We focus on two problems: (1) POI database construction and search on maps, and (2) POI relation verification. For the first study, it contains three tasks: POI extraction, POI pairing, and POI searches. We adopt the query-based crawler to find address-bearing pages which contain addresses and POI names. Moreover, the pairing model is utilized for coupling. To enable POI searches, we integrate multiple search-results for POI ranking. For the second study, the verification model is used to detect outdated POIs in the database via weakly-labeled Web-data. We also analyze the performance with respect to different classifiers and scenarios. We crawled 1.25 million distinct POIs from the Web and implemented a POI search service via Apache Solr platform. The result demonstrated that our performance outperformed Wikimapia and a commercial app called "What's the Number?" and was close to Google Maps. For POI pairing, the performance can achieve 91.1% F1-measure. In addition, detecting outdated POIs can improve to 72.8% accuracy via tri-training.
Bibliography
[1] Agichtein, E. and Gravano, L. (2000). Snowball: Extracting relations from large plain-text collections. Proceedings of the Fifth ACM International Conference on Digital Libraries. 85-94.
[2] Ahlers, D. and Boll, S. (2007). Location-based Web search. The Geospatial Web. Springer, 55-66.
[3] Ahlers, D. (2013). Business entity retrieval and data provision for Yellow Pages by local search. ECIR. March 24.
[4] Ali, A. L. and Schmid F. (2013). Data quality assurance for volunteered geographic information. GIScience. 126-141.
[5] Ali, A. L., Schmid, F., Rami, A. S. and Kauppinen, T. (2014). Ambiguity and plausibility: managing classification quality in volunteered geographic information. SIGSPATIAL. TX, USA.
[6] Alpaydin, E. (2009). Introduction to machine learning. The MIT Press. Cambridge, Massachusetts. London, England.
[7] Auer, S. and Lehmann, J. and Hellmann, S. (2009). LinkedGeoData: Adding a spatial dimension to the Web of data. ISWC. Heidelberg, 731-746.
[8] Adams, B.,McKenzie, G. and Gahegan, M. (2015). Frankenplace: Interactive thematic mapping for ad hoc exploratory search. WWW. Florence, Italy.
[9] Bach, N. and Badaskar, S. (2007). A review of relation extraction. Language Technologies Institute, Carnegie Mellon University.
[10] Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern information retrieval. ACM Press. Addison Wesley. New York.
[11] Banko, M., Cafarella, M. J., Soderl S., Broadhead, M. and Etzioni, O. (2007). Open information extraction from the Web. IJCAI. India.
[12] Barron, C., Neis, P. and Zipf, A. (2014). A comprehensive framework for intrinsic OpenStreetMap quality analysis. Transactions in GIS. 18(6), 877-895.
[13] Bauer, S., Radlinski, F. and White, R. W. (2016). Where can I buy a boulder? Searching for offine retail locations. WWW. Canada.
[14] Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R. and Hellmann, S . (2009). DBpedia - A crystallization point for the Web of data. Web Semantics: Science, Services and Agents on the World Wide Web. 7(3), 154-165.
[15] Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research. 3(4-5): 993-1022.
[16] Breiman, L. (1996). Bagging predictors. Machine Learning. 24, 123-140.
[17] Brin, S. (1998). Extracting patterns and relations from the world wide Web. WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT.
[18] Buttler, D., Liu, L. and Pu, C. (2001). A fully automated object extraction system for the world wide Web. ICDCS. 361-370.
[19] Chakrabarti, S., Van den Berg M. and Dom, B. (1999). Focused crawling: a new approach to topic specific Web resource discovery. WWW.
[20] Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector ma- chines. ACM Transactions on intelligent systems and Technology. 2(27), 1-27.
[21] Chang, C.-H., Huang, C.-Y. and Su, Y.-S. (2012). On Chinese postal address and associated information extraction. JSAI.
[22] Chang, C.-H. and Li, S.-Y. (2010). MapMarker: Extraction of postal addresses and associated information for general Web pages. WI. 105-111.
[23] Chang, C.-H., Lin, Y.-L., Lin K.-C. and Kayed, M. (2013). Page-level wrapper verification for unsupervised Web data extraction. WISE. 454-467.
[24] Chang, C.-H., Chuang, H.-M., Huang, C.-Y., Su, Y.-S. and Li., S.-Y. (2016). Enhancing POI search on maps via online address extraction and associated information extraction. Applied Intelligence. 44(3). 539-556.
[25] Chang,C.-H., Chen, T.-S., Chen, M.-C. and Ding, J.-L. (2016). Effcient page- level data extraction via schema induction and verification. PAKDD.
[26] Cheng, C.-T. Chuang, H.-M. and Chang, C.-H. (2015). Improving POI search effectiveness by integrating multiple search results. TAAI. Tainan.
[27] Cho, J. and Garcia-Molina, H. (2000). The evolution of theWeb and implications for an incremental crawler. VLDB. 200-209.
[28] Chou, C.-L. and Chang, C.-H. (2014). Named entity extraction via automatic labeling and tri-training, comparison of selection methods. AIRS. 244-255.
[29] Chuang, H.-M., Chang, C.-H. and Kao, T.-Y. (2014). Effective Web crawling for Chinese addresses and associated information. EC-Web, pp.13-25. Germany.
[30] Chuang, H.-M. and Chang, C.-H. (2015). Verification of POI and location pairs via weakly labeled Web data. WWW Workshop on Location Web. pp.743-748. Italy.
[31] Chuang, H.-M., Chang, C.-H., Kao, T.-Y., Cheng, C.-T., Huang, Y.-Y. and Cheong, K.-P. (2016). Enabling maps/location searches on mobile devices: Constructing a POI database via focused crawling and information extraction. International Journal of Geographical Information Science. Published online: 12 January.
[32] Croft, W.B., Metzler, D. and Strohman, T. (2010). Search engines. Information retrieval in practice. Pearson.
[33] Dalvi, N., Olteanu, M., Raghavan, M. and Bohannon, P. (2014). Deduplicating a places database. WWW. Seoul, Korea, April 7{11.
[34] De Albuquerque, J. P., Herfort, B., Brenning, A. and Zipf, A. (2015). A geographic approach for combining social media and authoritative data towards identifying useful information for disaster management. International Journal of Geographical Information Science. 29(4), 667-689.
[35] Etzioni, O., afarella, M., Downey, D., Popescu, A. M., Shaked, T. Soderland, S. Weld, D. S. and Yates, A. (2005). Unsupervised named-entity extraction from the Web: An experimental study. Artificial Intelligence. 165(1). 91-134.
[36] Foley, J., Bendersky, M. and Josifovski, V. (2015). Learning to extract local events from the Web. SIGIR. Chile, August 9-13.
[37] Goodchild, M. F. and Li, L., (2012). Assuring the quality of volunteered geographic information. Spatial Statistics. 110-120.
[38] Haklay, M. (2010). How good is volunteered geographical information? A comparative study of OpenStreetMap and ordnance survey datasets. Environment and Planning B: Planning and Design, 37(4), 682-703.
[39] He, B., Patel, K., Zhang, Z. and Chang K.-C.-C. (2007). Accessing the deep Web: A survey. Communications of the ACM. 50(5), 95-101.
[40] He, Y. , Xin, D., Ganti, V., Rajaraman, S. and Shah, N. (2013). Crawling deep Web entity pages. International Conference on Web Search and Data Mining.
[41] Hoffmann, R., Zhang, C. and Weld, D. S. (2010). Learning 5000 relational ex- tractors. ACL.
[42] Huang, Y.-Y., Chang, C.-H. and Chou, C.-L. (2015). A tool for Web NER model generation using search snippets of known entities. Rocling.
[43] Hung, K.-C., Kalantari, M. and Rajabifard, A. (2016). Methods for assessing the credibility of volunteered geographic information in flood response: A case study in Brisbane, Australia. Applied Geography. 68, 37-47.
[44] Joachimes, T. (2002). Optimizing search engines using click through data. SIGKDD. pp. 133-142.
[45] Jones, C. B. and Purves, R. S. (2008). Geographical information retrieval. Inter- national Journal of Geographical Information Science, 22(3), 219{228, March.
[46] Kao, T.-Y., Chuang, H.-M. and Chang, C.-H. (2015). Point of interest extraction from unstructured Web. Rocling.
[47] Kayed, M. and Chang, C.-H. (2010). FiVaTech: Page-level Web data extraction from template pages. TKDE, 22(2), 249-263.
[48] Kambhatla, Kambhatla, N. (2004). Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. ACL.
[49] Kumar, S and Rowley, H. A. (2007). Classification of weakly-labeled data with partial equivalence relations. International Conference on Computer Vision, pp. 1-8. Brazil.
[50] Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S. and Teixeira, J. S. (2002). A brief survey of Web data extraction tools. SIGMOD Record, 31(2), 84-93.
[51] Lafferty, J., McCallum, A. and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML. 282-289.
[52] Lin, Y.-Y. and Chang, C.-H. (2014). Store name extraction and name-address matching on the Web. ROCLING.
[53] Ling, Y., Yang, J. and He, L. (2012). Chinese organization name recognition based on multiple features. Pacific Asia conference on Intelligence and Security Informatics. 136-144.
[54] Liu, B., Grossman, R. L. and Zhai, Y. (2003). Mining data records in Web pages. SIGKDD. 601-606.
[55] Matuszka, T. and Kiss, A. (2014). Geodint: towards semantic Web-based geographic data integration. Intelligent Information and Database Systems. 8397, pp. 191{200.
[56] McCallum, A. (2003). Effciently inducing features of conditional random fields. UAI, 403-410.
[57] McKenzie, G. and Janowicz, K. (2015). Where is also about time: A location-distortion model to improve reverse geocoding using behavior-driven temporal semantic signatures. Computers, Environment and Urban Systems. 54, 1-13.
[58] McKenzie, G., Janowicz, K., Gao, S. and Gong, L. (2015). How where is when? On the regional variability and resolution of geosocial temporal signatures for points of interest. Computers, Environment and Urban Systems. 54, 336{346.
[59] Najork, M. and Wiener, J. L. (2001). Breadth-first crawling yields high-quality pages. WWW. 114-118.
[60] Nenkova, A. (2012). A survey of text summarization techniques. Mining Text Data. Springer, US. 43-76.
[61] Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf's law. Con- temporary Physics. 46, 323-351.
[62] Nguyen,T.H., Plank, B. and Grishman, R. (2015). Semantic representations for domain adaptation: A case study on the tree kernel-based method for relation extraction. ACL. 635-644.
[63] Noguera, J. M., Barranco, M. J., Segura, R. J. and Martinez, L. (2012). A mobile 3D-GIS hybrid recommender system for tourism. Information Systems. 215, 37-52.
[64] Olston, C. and Najork, M. (2010). Web crawling. Foundations and trends. In- formation retrieval. 4(3), 175-246.
[65] Ourioupina, O. (2002). Extracting geographical knowledge from the Internet. ICDMAM. 108-113.
[66] Popescu, A. and Shabou, A. (2013). Towards precise POI localization with social media. ACM Multimedia Conference. Catalunya, Spain, October 21-25.
[67] Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann Publishers. 16(3), pp. 235-240, September.
[68] Rae, A., Murdock, V., Popescu, A. and Bouchard, H. (2012). Mining the Web for points of interest. SIGIR. Oregon, USA. August 12{16.
[69] Salton, G. and Buckley, C. (1998). Term-weighting approaches in automatic text retrieval. Information Processing and Management. 24(5), 513-523.
[70] Sanderson, M. and Kohler, J. (2004). Analyzing geographic queries. SIGIR Work- shop on Geographic Information Retrieval.
[71] Serdyukov, P., Murdock, V. and Zwol, R. V. (2009). Placing Flickr photos on a map. SIGIR Conference on Research and Development in Information Retrieval.
[72] Shkapenyuk, V. and Suel, T. (2002). Design and Implementation of a High- Performance Distributed Web Crawler. ICDE. San Jose, CA, USA.
[73] Stefano, C. Davide, G. Angelica, L.D., Amdrea, M. and Maurizio, T. (2015). Geo data annotator: a Web framework for collaborative annotation of geographical datasets. WWW. Florence, Italy.
[74] Stirling, G. (2014). Study: 78 percent of local mobile searches result in offine pur- chases. Search Engine Land. Available from: http://searchengineland.com/ study-78-percent-local-mobile-searches-result-offline-purchases-188660
[75] Sutton, C. and McCallum, A. (2011). An introduction to conditional random fields. Foundations and Trends in Machine Learning. 4(4), 267{373.
[76] Wang, D., Hoi, S. C. H., He, Y. and Zhu, J. (2014). Mining weakly labeled Web facial images for search-based face annotation. TKDE. 26(1), 166-179.
[77] Witten, I. H. and Frank, E. (2005). Data Mining { Practical Machine Learning Tools and Techniques. 2nd edn. Elsevier. Amsterdam.
[78] Xia, F., Liu, T., Wang, J., Zhang, W. and Li, H. (2008). Listwise approach to learning to rank - Theory and algorithm. ICML. 1192-1199.
[79] Xu, W., Hoffmann, R., Zhao, L. and Grishman, R. (2013). Filling knowledge base gaps for distant supervision of relation extraction. ACL.
[80] Ying, J.-C., Chen, H.-S., Lin, K.-W. and Lu, Eric H.-C. (2014). Vincent S. Tseng, H.-W. Tsai, K.-H. Cheng and S.-C.Lin, Semantic trajectory-based high utility item recommendation system. Expert Systems with Applications. 41, 4762-4776.
[81] Zhao, S. and Grishman, R. (2005). Extracting relations with integrated information using kernel methods. ACL. pp. 419-426. Sydney, Australia.
[82] Zhang, S. and Wang, X. (2007). Automatic recognition of Chinese organization name based on conditional random fields. Natural Language Processing and Knowledge Engineering. 229-233.
[83] Zhou, Z. H. and Li, M. (2005). Tri-training: Exploiting unlabeled data using three classifiers. TKDE. 17(11), 1529-1541.