ProgramBench:语言模型能否从头重建程序?
研究提出ProgramBench基准,用于评估语言模型能否仅根据问题描述从头生成完整且可执行的程序。该基准包含2,000个编程问题,覆盖多种难度与类型,要求模型输出可直接运行的代码。测试显示,当前先进模型在此任务上表现仍不理想,准确率较低,突显了语言模型在复杂、无示例编程任务中的局限性。这项工作为衡量模型的实际编程能力提供了新工具。
Computer Science > Software Engineering
Title:ProgramBench: Can Language Models Rebuild Programs From Scratch?
Abstract:Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.03546 [cs.SE] (or arXiv:2605.03546v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.03546 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
Submission history
Access Paper:
View PDF
HTML (experimental)
TeX Source
Current browse context:
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
BibTeX formatted citation