PagePilot: 基於多代理架構之多模態自動化網頁助理

簡易檢索 / 詳目顯示

回結果列表

研究生：	葉季儒 Chi-Ju Yeh
論文名稱：	PagePilot: 基於多代理架構之多模態自動化網頁助理 PagePilot: A Multimodal Automated Web Assistant Based on Multi-Agent Architecture
指導教授：	張嘉惠
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2025
畢業學年度：	113
語文別：	中文
論文頁數：	56
中文關鍵詞：	自動化、網頁自動化、大型語言模型Agent 、Multi-Agent
相關次數：	點閱：13 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著大型語言模型(LLM)推理及多模態分析能力的提升，已經可以自動完成許多任務，例如自動操作網頁，現有工具如browser-use和Manus可以依據使用者的要求瀏覽網頁，例如線上購物、搜尋資訊都能處理。但目前的自動化工具對於長網頁、大量文章、操作複雜的任務難以順利處理，容易出現導航問題、視覺對齊問題、幻覺問題，阻礙自動化操作。因此自動化agent仍然需要更多研究，對網頁架構深度優化以解決上述問題。

我們以WebVoyager，一個自動化網頁操作系統為參考，在此基礎上提出了PagePilot系統，將網頁視覺輸入與原始碼資訊整合作為LLM Agent的輸入。PagePilot利用視覺方法進行網頁操作，並輔以從網頁原始碼萃取的關鍵資訊，提升在資訊擷取類任務上的表現。此外，系統引入了動態載入與觀察者agent等優化，前者通過模擬使用者滑鼠滾動來加載更多內容，後者在操作錯誤出現時提供回撤功能。實驗證明這些改進能緩解上述控制問題，提升任務完成率。在WebVoyager與GAIA等資料集上，PagePilot分別達到76% 和 57% 的任務完成率，皆顯著超越WebVoyager (65%, 27%)與GPT-4 (32%, 18%)的baseline，並大幅減少了所需的操作次數。

另外我們構建了來自mind2web的任務資料集，以及中文語系的網頁資料集，即使此類任務較複雜的資料集，也能分別達到52%, 70%的性能。通過人工評估與LLM評估取得近似結果，顯示我們的系統對資訊擷取型的任務有較好的表現。根據消融實驗結果，本架構可以在減少9%動作步驟下，提高30%的任務完成率，為自動化網頁控制提供了新的基準。

總體而言，我們提出了基於Multi Agent架構的網頁自動化控制系統，通過創新性的視覺與原始碼組合，以及針對網頁控制深度優化的架構，大幅提高任務完成率同時減少操作步驟。並且提出了基於中文網頁的評測資料集，驗證自動化控制在中文網站的可行性。我們期望透過這些方法與資源，對於網頁自動化領域有所幫助，並推動相關研究發展。

With the advancement of large language models (LLMs) in reasoning and multimodal analysis capabilities, many tasks can now be automated, such as solving mathematical problems, controlling computers, and automatically operating web pages. However, there are still relatively few automatic agents capable of deeply optimizing web architectures. These agents often struggle to handle complex tasks involving long web pages or large volumes of articles, leading to navigation issues, visual alignment problems, and hallucination, all of which hinder automated operations.

Taking WebVoyager, an automated web operation system, as a reference, we propose the PagePilot system based on this foundation, integrating both web visual input and source code information as the input for the LLM Agent. PagePilot performs web operations using visual methods, supplemented with key information extracted from web source code to enhance its performance on information retrieval tasks. Additionally, the system incorporates optimizations such as dynamic loading and an observer agent: the former simulates user mouse scrolling to load more content, while the latter provides rollback functionality in case of operational errors. Experimental results demonstrate that these enhancements effectively mitigate the aforementioned control issues and improve task completion rates. On the WebVoyager and GAIA datasets, PagePilot achieves task completion rates of 76% and 57%, respectively, significantly surpassing the baselines set by WebVoyager (65%, 27%) and GPT-4 (32%, 18%), while also greatly reducing the number of required actions.

Additionally, we constructed task datasets from mind2web as well as a Chinese-language web dataset. Even on these more complex datasets, our system achieves performances of 52\% and 70%, respectively. Manual evaluations and assessments by LLMs yielded similar results, demonstrating that our system performs well on information retrieval tasks. According to ablation study results, this architecture improves task completion rates by 30% while reducing the number of action steps by 9%, establishing a new benchmark for automated web control.

Overall, we propose a web automation control system based on a Multi-Agent architecture, which significantly improves task completion rates and reduces the number of operational steps through an innovative combination of visual and source code inputs, as well as architecture optimized for deep web control. We have also introduced an evaluation dataset based on Chinese web pages to validate the feasibility of automated control on Chinese-language websites. We hope that these methods and resources will benefit the field of web automation and promote further research and development in this area.

摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
謝誌. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
圖目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
表目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
一、緒論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
 1-1挑戰. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
 1-2方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
 1-3貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
二、相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
 2-1工作流程與網頁自動化. . . . . . . . . . . . . . . . . . . . 4
 2-2強化學習於網頁自動化應用. . . . . . . . . . . . . . . . . 5
 2-3大型語言模型與視覺語言模型應用. . . . . . . . . . . . . 5
 2-4自動化環境與互動方法. . . . . . . . . . . . . . . . . . . . 6
 2-5任務規劃與動作搜尋演算法. . . . . . . . . . . . . . . . . 7
三、方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
 3-1 Multi-Agent架構設計. . . . . . . . . . . . . . . . . . . . 10
 3-2網頁原始碼處理. . . . . . . . . . . . . . . . . . . . . . . . 11
 3-2-1預處理與資料清洗. . . . . . . . . . . . . . . . . . . . . . . 11
 3-2-2 WebAssistant . . . . . . . . . . . . . . . . . . . . . . . . . 11
 3-2-3動態加載. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
 3-3網頁截圖處理. . . . . . . . . . . . . . . . . . . . . . . . . 12
 3-4操作agent . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
 3-5觀察者agent . . . . . . . . . . . . . . . . . . . . . . . . . . 15
四、實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
 4-1資料集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
 4-2評估標準. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
 4-3結果與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
 4-4消融實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
 v
4-5案例分析與挑戰. . . . . . . . . . . . . . . . . . . . . . . . 24
 4-6模型擴展性. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
 4-7與其他相關專案比較. . . . . . . . . . . . . . . . . . . . . 26
五、模型訓練. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
 5-1資料準備. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
 5-2訓練方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
 5-3訓練結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
六、侷限與未來工作. . . . . . . . . . . . . . . . . . . . . . . . 31
七、結論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
八、倫理問題. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
一、WebAssistantPrompt . . . . . . . . . . . . . . . . . . . . 39
二、ActionAgentPrompt . . . . . . . . . . . . . . . . . . . . . 40
三、ObserverAgentPrompt . . . . . . . . . . . . . . . . . . . 42
四、AutoEvaluationPrompt . . . . . . . . . . . . . . . . . . . 43
                                

[1] Izzeddin Gur, Hiroki Furuta, Austin V. Huang, Mustafa Safdari, Yutaka
Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with
planning, long context understanding, and program synthesis. In The
Twelfth International Conference on Learning Representations, ICLR 2024,
Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
[2] Magnus Müller and Gregor Žunič. Browser use: Enable ai to control your
browser, 2024.
[3] Manus AI. Manus ai, 2025.
[4] Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming
Zhang, Zhenzhong Lan, and Dong Yu. WebVoyager: Building an end-to-end
web agent with large multimodal models. In Lun-Wei Ku, Andre Martins,
and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pages
6864–6890, Bangkok, Thailand, August 2024. Association for Computational
Linguistics.
[5] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang,
Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.
Advances in Neural Information Processing Systems, 36, 2024.
[6] Felix Ocker, Daniel Tanneberg, Julian Eggert, and Michael Gienger. Tulip
agent– enabling llm-based agents to solve tasks using large tool libraries,
2024.
[7] Anthropic. Introducing computer use, a new claude 3.5 sonnet, and claude
3.5 haiku, 2024.
[8] Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding,
Wenqi Fan, Xiao yong Wei, Shanru Lin, Hui Liu, Philip S. Yu, and Qing Li.
A survey of webagents: Towards next-generation ai agents for web automa
tion with large foundation models, 2025.
[9] Jorge Ribeiro, Rui Lima, Tiago Eckhardt, and Sara Paiva. Robotic process
automation and artificial intelligence in industry 4.0–a literature review.
Procedia Computer Science, 181:51–58, 2021. CENTERIS 2020- Interna
tional Conference on ENTERprise Information Systems / ProjMAN 2020- International Conference on Project MANagement / HCist 2020- Inter
national Conference on Health and Social Care Information Systems and
Technologies 2020, CENTERIS/ProjMAN/HCist 2020.
[10] 李龍憲and李龍憲. 自動化流程機器人與人工智慧發展之探討.PhDthesis,
李龍憲,2018.
[11] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Jun
hui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A
visual language model for gui agents. In Proceedings of the IEEE/CVF Con
ference on Computer Vision and Pattern Recognition, pages 14281–14290,
2024.
[12] Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze
Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards gener
alist computer agents with self-improvement, 2024.
[13] Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jian
bing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for
advanced visual GUI agents. In Proceedings of the 62nd Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers),
pages 9313–9332, Bangkok, Thailand, August 2024. Association for Compu
tational Linguistics.
[14] Microsoft. Microsoft copilot, 2024.
[15] Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy
Liang. Reinforcement learning on web interfaces using workflow-guided ex
ploration. In International Conference on Learning Representations (ICLR),
2018.
[16] Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and
Thomas Scialom. GAIA: a benchmark for general AI assistants. In The
Twelfth International Conference on Learning Representations, 2024.
[17] Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek
Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon,
and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning
Representations, 2024.
[18] Edward A. Stohr and J. Leon Zhao. Workflow automation: Overview and
research issues. Information Systems Frontiers, 3(3):281–296, Sep 2001.
[19] Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Go
rilla: Large language model connected with massive apis. arXiv preprint
arXiv:2305.15334, 2023.
[20] Michael Lutz, Arth Bohra, Manvel Saroyan, Artem Harutyunyan, and Gio
vanni Campagna. Wilbur: Adaptive in-context learning for robust and ac
curate web agents, 2024.
[21] Yan Zheng, Yi Liu, Xiaofei Xie, Yepang Liu, Lei Ma, Jianye Hao, and Yang
Liu. Automatic web testing using curiosity-driven reinforcement learning.
In 2021 IEEE/ACM 43rd International Conference on Software Engineering
(ICSE), pages 423–435, 2021.
[22] Chien-Hung Liu, Shingchern D. You, and Ying-Chieh Chiu. A reinforce
ment learning approach to guide web crawler to explore web applications
for improving code coverage. Electronics, 13(2), 2024.
[23] Yao Zhang, Zijian Ma, Yunpu Ma, Zhen Han, Yu Wu, and Volker Tresp.
Webpilot: A versatile and autonomous multi-agent system for web task
execution with strategic exploration, 2024.
[24] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision)
is a generalist web agent, if grounded. In Forty-first International Conference
on Machine Learning, 2024.
[25] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng
Gao. Set-of-mark prompting unleashes extraordinary visual grounding in
gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
[26] Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra
Faust, Shixiang Shane Gu, and Izzeddin Gur. Multimodal web navigation
with instruction-finetuned foundation models, 2024.
[27] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R
Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in
language models. In The Eleventh International Conference on Learning
Representations, 2023.
[28] Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov.
Tree search for language model agents, 2024.
[29] Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim,
Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and
Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic
visual web tasks, 2024.
[30] Ziru Chen, Michael White, Ray Mooney, Ali Payani, Yu Su, and Huan
Sun. When is tree search useful for LLM planning? it depends on the dis
criminator. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,
Proceedings of the 62nd Annual Meeting of the Association for Computa
tional Linguistics (Volume 1: Long Papers), pages 13659–13678, Bangkok,
Thailand, August 2024. Association for Computational Linguistics.

簡易檢索 / 詳目顯示

相關論文