ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
📄 ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents #基准测试 #模型评估 #多模态模型 #大语言模型 #动态环境 ✅ 7.0/10 | 前25% | #基准测试 | #模型评估 | #多模态模型 #大语言模型 | arxiv 学术质量 6.5/7 | 选题价值 2.0/2 | 复现加成 0.5 | 置信度 高 👥 作者与机构 第一作者:Fanqing Meng (Evolvent AI, National University of Singapore) - 根据论文附录,其有*号标记为共同贡献者。 通讯作者:Mengkang Hu†, Michael Qizhe Shieh† (Evolvent AI, National University of Singapore) - 根据论文附录,其有†号标记为通讯作者。 作者列表:Fanqing Meng (Evolvent AI, National University of Singapore), Lingxiao Du (National University of Singapore), Zijian Wu (National University of Singapore), Guanzheng Chen (National University of Singapore), Xiangyan Liu (National University of Singapore), Jiaqi Liao (Independent Researcher), Chonghe Jiang (Massachusetts Institute of Technology), Zhenglin Wan (National University of Singapore), Jiawei Gu (University of Washington), Pengfei Zhou (National University of Singapore), Rui Huang (The University of Hong Kong), Ziqi Zhao (The Hong Kong Polytechnic University), Shengyuan Ding (Fudan University), Ailing Yu (Independent Researcher), Bo Peng (Shanghai Jiao Tong University), Bowei Xia (University of Electronic Science and Technology of China), Hao Sun (Peking University), Haotian Liang (University of Science and Technology of China), Ji Xie (Zhejiang University), Jiajun Chen (National University of Singapore), Jiajun Song (Renmin University of China), Liu Yang (The Hong Kong Polytechnic University), Ming Xu (National University of Singapore), Qionglin Qiu (Hunan University), Runhao Fu (Anhui University), Shengfang Zhai (National University of Singapore), Shijian Wang (Southeast University), Tengfei Ma (The Chinese University of Hong Kong), Tianyi Wu (National University of Singapore), Weiyang Jin (The University of Hong Kong), Yan Wang (Tongji University), Yang Dai (National University of Singapore), Yao Lai (The University of Hong Kong), Youwei Shu (National University of Singapore), Yue Liu (National University of Singapore), Yunzhuo Hao (Zhejiang University), Yuwei Niu (Peking University), Jinkai Huang (Evolvent AI, National University of Singapore), Jiayuan Zhuo (Evolvent AI, National University of Singapore), Zhennan Shen (The Hong Kong University of Science and Technology), Linyu Wu (National University of Singapore), Cihang Xie (University of California, Santa Cruz), Yuyin Zhou (University of California, Santa Cruz), Jiaheng Zhang (National University of Singapore), Zeyu Zheng (University of California, Berkeley), Mengkang Hu (Evolvent AI, National University of Singapore), Michael Qizhe Shieh (Evolvent AI, National University of Singapore)。 💡 毒舌点评 亮点:提出了一个设计极其严谨、评估维度(多天、动态环境、全模态)全面且完全杜绝“LLM当裁判”评分模糊性的智能体基准测试,填补了重要空白。短板:作为基准测试,其本身不产出新的模型或算法,对推动模型能力提升的作用是间接的;且100个任务的规模对于构建稳健的排行榜可能稍显不足。 ...