测试时适应

📄 FutureSim: Replaying World Events to Evaluate Adaptive Agents #基准测试 #大语言模型 #自适应代理 #测试时适应 ✅ 7.6/10 | 前25% | #基准测试 | #大语言模型 | #自适应代理 #测试时适应 | arxiv 学术质量 6.1/8 | 影响力 0.8/1 | 可复现性 0.7/1 | 置信度高 👥 作者与机构第一作者：Shashwat Goel (ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems) 通讯作者：未说明作者列表：Shashwat Goel (ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems), Nikhil Chandak (Max Planck Institute for Intelligent Systems, Tübingen AI Center), Arvindh Arun (Institute for AI, University of Stuttgart), Ameya Prabhu (Tübingen AI Center, University of Tübingen), Steffen Staab (Institute for AI, University of Stuttgart, University of Southampton), Moritz Hardt (Max Planck Institute for Intelligent Systems, Tübingen AI Center), Maksym Andriushchenko (ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems), Jonas Geiping (ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center)（注：论文标注前三位作者贡献相等） 💡 毒舌点评亮点：成功构建了一个既“接地”（基于真实新闻）又“可控”（可重放、可消融）的长期自适应评估环境，巧妙地将预测任务转化为衡量AI世界模型演化能力的探针。实验设计（如“直接查询”vs“顺序更新”对比、统一初始预测的适应能力隔离）精准地量化了当前模型的核心短板，并为测试时适应、记忆、搜索等新兴研究方向提供了清晰的实验范式。短板：评估流程的核心环节——自由形式答案的匹配——完全依赖于一个商业化的LLM（DeepSeek V3.2），其匹配的一致性、可靠性及对不同回答格式的偏差未经系统验证，这为整个基准的评分可信度埋下隐患。此外，尽管框架开源，但复现核心结果需要支付高昂的闭源模型API或编程工具订阅费用（GPT 5.5/Codex, Claude Code），并承担模拟运行本身的高成本，实质上限制了无资源团队的复现能力。 ...