跳到主要內容

簡易檢索 / 詳目顯示

研究生: 黃懷萱
Huai-Hsuan Huang
論文名稱: DREAM:結合領域知識檢索與多代理推理的結構化論文評估方法
DREAM: Domain-Retrieved Evidence and Multi-Agent Reasoning for Structured Paper Evaluation
指導教授: 張嘉惠
Chia-Hui Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 94
中文關鍵詞: 多代理系統設計RAG同儕評審
外文關鍵詞: Multi Agent System, RAG, Peer Review
相關次數: 點閱:72下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著全球學術產出規模不斷擴張,傳統的同行審查制度正面臨嚴重的人力瓶頸與審查品質不一等挑戰。由於審查作業本質上是一項耗時且高度仰賴專業判斷的工作,學界對於自動化輔助審查系統的需求日益殷切。然而,現有語言模型雖具初步生成能力,但在準確性、可解釋性與偏見控制方面仍與人類專家存在顯著落差,限制其實務應用。

    為回應此困境,本研究設計一套模組化的 AI 輔助論文審查系統,以提升生成評論的準確性、專業性與一致性。整體流程涵蓋三大設計理念:第一,透過記憶模組進行反思學習,記錄過往錯誤模式並轉化為泛化建議,使模型在未來任務中能主動修正偏誤,提升判斷穩定性與可追溯性。第二,於生成階段引入檢索增強生成(Retrieval-Augmented Generation, RAG)技術,針對論文主題動態檢索跨領域知識,作為外部依據補充,以增強評論內容的事實性與深度。第三,透過多代理系統(Multi-Agent System, MAS)將複雜的審查任務分解為多個具體面向,分別由不同功能的代理人處理,包括可讀性、方法嚴謹性與學術貢獻等,最後由統整代理人彙整各項輸出,生成結構化且具決策參考價值的審查意見。該系統不僅提升回饋品質,亦具備模組化與擴展彈性,能靈活對應不同審查需求與應用場景。

    本研究使用 ECNU-SEA/SEA-E 作為基線,在測試集的 F1 效能為 0.700,單獨引入記憶模組後,F1 分數可由 0.700 提升至 0.771,主因在於其有效提高正確拒絕樣本的比例(True Positive, TP),且並未犧牲正確接受的準確性(TN);單純使用多代理架構即可將效能提升至 0.750;進一步結合 QLoRA 微調各代理人後可達 0.786,若再整合記憶模組,則最高可達 0.801。證實記憶模組與代理系統具備加乘效果,在精度與穩定性上展現優勢,也具備良好的擴展潛力與模組解釋性。


    With the rapid growth of global academic output, traditional peer review faces mounting challenges such as reviewer shortages and inconsistent quality. While large language models (LLMs) offer initial support for automating this process, they still fall short of human experts in accuracy, interpretability, and bias control, limiting their practical adoption.

    To overcome these issues, we propose a modular AI-assisted review system designed to enhance the accuracy, consistency, and professionalism of automated reviews. The system integrates three key components: (1) a Memory Module for reflective learning, which records past errors and improves decision stability; (2) Retrieval-Augmented Generation (RAG) to incorporate cross-domain factual knowledge; and (3) a Multi-Agent System (MAS) that delegates subtasks—such as readability, methodology, and contribution—to specialized agents, with a final agent consolidating the outputs.

    Using ECNU-SEA/SEA-E as the base model (F1 = 0.700), adding the Memory Module boosts performance to 0.771 by significantly improving true positive (TP) rates. The MAS alone achieves 0.750, while fine-tuning each agent with QLoRA increases it to 0.786. Combining MAS and Memory further lifts performance to 0.801, demonstrating the system’s synergy, scalability, and robustness.

    目 錄 摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 謝誌 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 圖目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 表目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 一、 緒論 . . . . . . . . . . . . . . . . . . . . . . . . 1 1-1 背景 . . . . . . . . . . . . . . . . . . . . . . . . 1 1-2 目前研究難點 . . . . . . . . . . . . . . . . . . . 2 1-3 方法 . . . . . . . . . . . . . . . . . . . . . . . . 2 1-4 研究貢獻 . . . . . . . . . . . . . . . . . . . . . . 3 二、 文獻參考 . . . . . . . . . . . . . . . . . . . . . . 5 2-1 同行審查(Peer Review) . . . . . . . . . . . . 5 2-2 同行審查資料集 . . . . . . . . . . . . . . . . . . 6 2-3 自動同行審查系統架構 . . . . . . . . . . . . . . 8 2-4 Retrieval-Augmented Generation (RAG) . . . . 9 2-5 Multi Agent System (MAS) . . . . . . . . . . . 9 2-6 評估指標 . . . . . . . . . . . . . . . . . . . . . . 10 三、 研究方法 . . . . . . . . . . . . . . . . . . . . . . 12 3-1 資料準備 . . . . . . . . . . . . . . . . . . . . . . 13 3-1-1 多重評審訓練資料集(Multiple Reviews Dataset) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3-1-2 額外知識(Additional Knowledge) . . . . . . . 15 3-1-3 評論面向的指標 . . . . . . . . . . . . . . . . . . 16 3-2 標準化模組(Standardization Module) . . . . . 17 3-3 檢索模組(Retrieval Module) . . . . . . . . . . 18 3-4 生成評論的不同策略(Review Strategy) . . . . 21 iv 3-4-1 拒絕原因分析與規則設計 . . . . . . . . . . . . . 22 3-4-2 多代理系統架構(Multi-Agent System, MAS) . 24 3-4-3 反思記憶(Reflective Memory) . . . . . . . . . 26 3-4-4 壓縮論文(Paper Compression) . . . . . . . . 27 四、 評估 . . . . . . . . . . . . . . . . . . . . . . . . 28 4-1 環境設定 . . . . . . . . . . . . . . . . . . . . . . 28 4-1-1 模型微調設定 . . . . . . . . . . . . . . . . . . . 29 4-1-2 格式錯誤與資料重建機制 . . . . . . . . . . . . . 29 4-2 決策(Decision)指標評估 . . . . . . . . . . . . 30 4-2-1 基線(反思記憶和領域檢索)的決策評估 . . . . 30 4-2-2 多代理系統的決策分析 . . . . . . . . . . . . . . 32 4-2-3 多代理消融實驗 . . . . . . . . . . . . . . . . . . 34 4-3 數值(Numerical)指標評估 . . . . . . . . . . . 36 4-3-1 基線(反思記憶和領域檢索)的數值分析 . . . . 37 4-3-2 多代理系統的數值分析 . . . . . . . . . . . . . . 37 4-4 字串(Textual)指標評估 . . . . . . . . . . . . 39 4-5 其他方法探索 . . . . . . . . . . . . . . . . . . . 40 4-5-1 GPT 的潛力擴展 . . . . . . . . . . . . . . . . . 40 4-5-2 論文壓縮 . . . . . . . . . . . . . . . . . . . . . . 42 五、 結論 . . . . . . . . . . . . . . . . . . . . . . . . 47 六、 限制與未來工作 . . . . . . . . . . . . . . . . . . 49 七、 系統展示與實作成果 . . . . . . . . . . . . . . . 50 索引 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 附錄一 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 A-1 一般提示詞 . . . . . . . . . . . . . . . . . . . . 57 A-1-1 標準化提示詞 . . . . . . . . . . . . . . . . . . . 58 A-1-2 評論提示詞 . . . . . . . . . . . . . . . . . . . . 59 A-1-3 連動檢索模組提示詞 . . . . . . . . . . . . . . . 60 A-2 多代理人提示詞 . . . . . . . . . . . . . . . . . . 61 A-2-1 連動檢索模組提示詞:可讀性與組織審查 . . . . 62 A-2-2 連動檢索模組提示詞:方法合理性與嚴謹性 . . 63 A-2-3 連動檢索模組提示詞:貢獻與創新性 . . . . . . 64 A-2-4 連動檢索模組提示詞:摘要 . . . . . . . . . . . 65 A-3 記憶模組 . . . . . . . . . . . . . . . . . . . . . . 66 A-3-1 反思模組的提示詞 . . . . . . . . . . . . . . . . 66 附錄二 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 B-1 Baseline 記憶 . . . . . . . . . . . . . . . . . . . 67 B-2 Readability 記憶 . . . . . . . . . . . . . . . . . 68 B-3 Methodology 記憶 . . . . . . . . . . . . . . . . . 69 B-4 Contribution 記憶 . . . . . . . . . . . . . . . . . 69 附錄三 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 C-1 額外知識向量資料擷取 . . . . . . . . . . . . . . 71 C-1-1 ACL 爬蟲 . . . . . . . . . . . . . . . . . . . . . 71 C-1-2 ScienceDirect 爬蟲 . . . . . . . . . . . . . . . . 74 C-2 多評論資料擷取 . . . . . . . . . . . . . . . . . . 78

    [1] Rob Johnson, Anthony Watkinson, and Michael Mabe. The STMReport: An Overview of Scientific and Scholarly Publishing. International Association of Scientific, Technical and Medical Publishers,The Hague, Netherlands, 2018.
    [2] Simon Price and Peter A Flach. Computational support for academic peer review: A perspective from artificial intelligence. Communications of the ACM, 60(3):70–79, 2017.
    [3] Lutz Bornmann and Rüdiger Mutz. Growth rates of modern science:A bibliometric analysis based on the number of publications andcited references. Journal of the Association for Information Scienceand Technology, 66(11):2215–2222, 2015.
    [4] Mark A. Hanson, Pablo Gómez Barreiro, Paolo Crosetto, and DanBrockington. The strain on scientific publishing. Quantitative Science Studies, 5(4):823–843, 2024. ISSN 2641-3337. doi: 10.1162/qss_a_00327. URL http://dx.doi.org/10.1162/qss_a_00327.
    [5] Mohamed Abdalla, Jan Philip Wahle, Terry Ruas, Aurélie Névéol,Fanny Ducel, Saif Mohammad, and Karen Fort. The elephant inthe room: Analyzing the presence of big tech in natural languageprocessing research. In Anna Rogers, Jordan Boyd-Graber, andNaoaki Okazaki, editors, Proceedings of the 61st Annual Meeting ofthe Association for Computational Linguistics (Volume 1: Long Papers), pages 13141–13160, Toronto, Canada, July 2023. Associationfor Computational Linguistics. doi: 10.18653/v1/2023.acl-long.734.URL https://aclanthology.org/2023.acl-long.734/.
    [6] Ivan Stelmakh, Nihar B. Shah, Aarti Singh, and Hal Daumé III.Prior and prejudice: The novice reviewers’ bias against resubmis52sions in conference peer review. Proceedings of the ACM on HumanComputer Interaction, 5:1–17, 2021.
    [7] Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, JaneDwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Houda Bouamor, JuanPino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages7969–7992, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.495. URLhttps://aclanthology.org/2023.emnlp-main.495.
    [8] Ryan Liu and Nihar B Shah. Reviewergpt? an exploratory studyon using large language models for paper reviewing. arXiv preprintarXiv:2306.00622, 2023.
    [9] Tingting Jiang, Zhumo Sun, Shiting Fu, and Yan Lv. Human-aiinteraction research agenda: A user-centered perspective. Data andInformation Management, page 100078, 07 2024. doi: 10.1016/j.dim.2024.100078.
    [10] Yanzhu Guo, Guokan Shang, Virgile Rennard, Michalis Vazirgiannis, and Chloé Clavel. Automatic analysis of substantiation inscientific peer reviews. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10198–10216. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.findings-emnlp.684. URL https://aclanthology.org/2023.findings-emnlp.684.
    [11] Ilia Kuznetsov, Jan Buchmann, Max Eichler, and Iryna Gurevych.Revise and resubmit: An intertextual model of text-based collaboration in peer review. Computational Linguistics, 48(4):949–986,2022. doi: 10.1162/coli_a_00455. URL https://aclanthology.org/2022.cl-4.16.
    [12] Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine vanZuylen, Sebastian Kohlmeier, Eduard Hovy, and Roy Schwartz.A dataset of peer reviews (PeerRead): Collection, insights and53NLP applications. In Marilyn Walker, Heng Ji, and AmandaStent, editors, Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers),pages 1647–1661, New Orleans, Louisiana, June 2018. Associationfor Computational Linguistics. doi: 10.18653/v1/N18-1149. URLhttps://aclanthology.org/N18-1149.
    [13] Nils Dycke, Ilia Kuznetsov, and Iryna Gurevych. Nlpeer: A unifiedresource for the computational study of peer review. In Proceedingsof the 61st Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 5049–5073. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.277. URL https://aclanthology.org/2023.acl-long.277.
    [14] Yanzhu Guo, Guokan Shang, Virgile Rennard, Michalis Vazirgiannis, and Chloé Clavel. Automatic analysis of substantiation in scientific peer reviews. In Houda Bouamor, Juan Pino,and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10198–10216, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.684. URL https://aclanthology.org/2023.findings-emnlp.684.
    [15] Ruiyang Zhou, Lu Chen, and Kai Yu. Is LLM a reliable reviewer?a comprehensive evaluation of LLM on automatic paper reviewing tasks. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste,Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LRECCOLING 2024), pages 9340–9351, Torino, Italia, May 2024. ELRAand ICCL. URL https://aclanthology.org/2024.lrec-main.816.
    [16] Jiangshu Du, Yibo Wang, Wenting Zhao, Zhongfen Deng, ShuaiqiLiu, Renze Lou, Henry Zou, Pranav Narayanan Venkit, Nan Zhang,54Mukund Srinath, Haoran Zhang, Vipul Gupta, Yinghui Li, TaoLi, Fei Wang, Qin Liu, Tianlin Liu, Pengzhi Gao, Congying Xia,Chen Xing, Cheng Jiayang, Zhaowei Wang, Ying Su, Raj Shah,Ruohao Guo, Jing Gu, Haoran Li, Kangda Wei, Zihao Wang,Lu Cheng, Surangika Ranathunga, Meng Fang, Jie Fu, Fei Liu,Ruihong Huang, Eduardo Blanco, Yixin Cao, Rui Zhang, PhilipYu, and Wenpeng Yin. LLMs assist NLP researchers: Critiquepaper (meta-)reviewing. In Yaser Al-Onaizan, Mohit Bansal, andYun-Nung Chen, editors, Proceedings of the 2024 Conference onEmpirical Methods in Natural Language Processing, pages 5081–5099, Miami, Florida, USA, November 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.emnlp-main.292.
    [17] Zhaolin Gao, Kianté Brantley, and Thorsten Joachims. Reviewer2:Optimizing Review Generation Through Prompt Generation. arXive-prints, art. arXiv:2402.10886, February 2024. doi: 10.48550/arXiv.2402.10886.
    [18] Jianxiang Yu, Zichen Ding, Jiaqi Tan, Kangyang Luo, ZhenminWeng, Chenghua Gong, Long Zeng, Renjing Cui, Chengcheng Han,Qiushi Sun, Zhiyong Wu, Yunshi Lan, and Xiang Li. Automatedpeer reviewing in paper sea: Standardization, evaluation, and analysis, 2024. URL https://arxiv.org/abs/2407.12857.
    [19] Michael Galarnyk, Rutwik Routu, Kosha Bheda, Priyanshu Mehta,Agam Shah, and Sudheer Chava. Acl ready: Rag based assistantfor the acl checklist. Available at arXiv 2408.04675, 2024. URLhttps://arxiv.org/abs/2408.04675.
    [20] Yiqiao Jin, Qinlin Zhao, Yiyang Wang, Hao Chen, Kaijie Zhu, Yijia Xiao, and Jindong Wang. Agentreview: Exploring peer reviewdynamics with llm agents. In EMNLP, 2024.
    [21] Weizhe Yuan, Pengfei Liu, and Graham Neubig. Can We Automate Scientific Reviewing? arXiv e-prints, art. arXiv:2102.00176,January 2021. doi: 10.48550/arXiv.2102.00176.55
    [22] Chin-Yew Lin. ROUGE: A package for automatic evaluation ofsummaries. In Text Summarization Branches Out, pages 74–81,Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
    [23] Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents, 2023.
    [24] Anna Glazkova and Dmitry Morozov. Multi-task fine-tuning forgenerating keyphrases in a scientific domain. In 2023 IX International Conference on Information Technology and Nanotechnology(ITNT), pages 1–5, 2023. doi: 10.1109/ITNT57377.2023.10139061.

    QR CODE
    :::