| 研究生: |
李德泰 DE-TAI LI |
|---|---|
| 論文名稱: |
Browser Agent 效能瓶頸分析與改進挑戰 Browser Agent Performance Bottleneck Analysis and Improvement Challenges |
| 指導教授: |
張嘉惠
Chia-Hui Chang |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系在職專班 Executive Master of Computer Science & Information Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 37 |
| 中文關鍵詞: | AI代理 、瀏覽器代理 |
| 外文關鍵詞: | WebVoyager, Browser-use, Browser Agent |
| 相關次數: | 點閱:20 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,大型語言模型(Large Language Models, LLMs)在語言理解、
推理與任務執行等方面的能力大幅提升。隨著網頁介面逐漸成為異質資訊系統
的統一入口,基於 LLM 的瀏覽器代理(Browser Agent)已成為建構通用智慧
代理的重要研究方向。
本研究以 2024 年公開的開源專案 WebVoyager 為對象,探討其在真實網
站任務執行上的表現。初步實驗顯示,原始模型在面對內容動態變化大、結構
複雜或以視覺導向為主的網站時,在理解與執行效率上表現不佳。
為提升代理模型的適應能力, 本研究針對 WebVoyager 的核心能力與整體
架構提出改良方案,並與另一套主流開源系統 Browser-use 進行比較分析,涵
蓋感知能力、思考規劃與執行方法等面向。並採用 WebVoyager 資料集做為評
估標準,進行相關任務的實證。
此外,本研究導入檢索增強生成(Retrieval-Augmented Generation, RAG)
機制,透過本研究建構的輕量級知識文件 (lightweight knowledge texts), 使代
理模型在執行任務前能獲取網站結構與功能的基礎知識,進而提升理解能力與
操作準確性。實驗結果顯示,加入 RAG 的 WebVoyager 在任務成功率上提升
了 8.7%,並在多數測試場景中優於 Browser-use。這些結果驗證了外部知識整
合對 LLM 決策品質與瀏覽器代理系統泛化能力的實質助益。
In recent years, Large Language Models (LLMs) have demonstrated significant improvements in language understanding, reasoning, and task execution. As
web interfaces increasingly serve as unified access points to heterogeneous information systems, LLM-based browser agents have emerged as a crucial direction
for building general-purpose intelligent agents.
This study focuses on WebVoyager, an open-source project released in 2024,
investigating its performance in executing tasks on real-world websites. Preliminary experiments reveal that the original model struggles with sites characterized
by dynamic content, complex structures, or visually oriented layouts, resulting
in inefficiencies in comprehension and execution.
To enhance the adaptability of browser agents, this research proposes improvements to both the capabilities and architecture of WebVoyager, and conducts a comparative analysis with another mainstream open-source system, BrowserUse. The comparison covers aspects such as perception, planning, and execution
strategies. The evaluation is based on the WebVoyager benchmark dataset and
includes empirical testing across relevant tasks.
Furthermore, this study integrates a Retrieval-Augmented Generation (RAG)
mechanism. By providing lightweight knowledge texts constructed during the
experiments, the agent can acquire basic knowledge of website structures and
functionalities prior to task execution, thereby improving its comprehension and
operational accuracy. Experimental results show that the RAG-enhanced WebVoyager achieves an 8.7% improvement in task success rate and consistently outperforms Browser-Use across most test scenarios. These findings demonstrate the
practical benefits of external knowledge integration for improving LLM decision
quality and the generalization ability of browser agents.
[1] Browser Use Team. State-of-the-art browser agents: A technical overview.
https://browser-use.com/posts/sota-technical-report, 2024. Accessed: 2025-06-12.
[2] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi
Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for
the web, 2023.
[3] Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming
Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end
web agent with large multimodal models. arXiv preprint arXiv:2401.13919,
2024.
[4] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao
Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui
agents, 2024.
[5] nanobrowser team. Nanobrowser: Aipowered web automation chrome extension. https://github.com/nanobrowser/nanobrowser, 2025. Accessed:
2025-06-12.
[6] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang,
Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu,
Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert
for pretrained language models, 2024.
[7] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter,
Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting
elicits reasoning in large language models, 2023.
[8] Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng,
Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng,
Lars Liden, and Jianfeng Gao. Magma: A foundation model for multimodal
ai agents, 2025.
25
[9] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng
Gao. Set-of-mark prompting unleashes extraordinary visual grounding in
gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
[10] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik
Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in
language models. In International Conference on Learning Representations
(ICLR), 2023.