How Expensify’s open-source program is powering OpenAI’s next-gen AI engineering benchmarks

The rapid rise of large language models (LLMs) has opened up exciting new possibilities for AI-driven software development. But how well can these models tackle real-world engineering tasks?
That question led to the creation of SWE-Lancer—a benchmark developed by OpenAI that evaluates LLMs using actual freelance software tasks from Expensify’s open-source repository.
Let’s explore how Expensify contributed to the SWE-Lancer project and what this means for the future of AI in software engineering.
SWE-Lancer: A real-world testbed for AI in software engineering
Most AI coding benchmarks focus on isolated code snippets or theoretical problems. SWE-Lancer is different—it evaluates LLMs on 1,488 real-world software tasks, all sourced from Expensify’s freelance jobs and the open-source repository.
These tasks range from simple bug fixes ($250 payouts) to complex feature implementations ($32,000 payouts), covering:
Frontend & backend development
API integrations
Full-stack debugging
Software architecture decision-making
By leveraging Expensify’s open-source development lifecycle—which encapsulates the intricate challenges of real-world engineering workflows from initial triage to testing and deployment—SWE-Lancer provides the most realistic AI software engineering benchmark to date.
Mapping AI performance to real economic value
One of SWE-Lancer’s most significant innovations is mapping AI coding performance to real-world money. Since each task in the benchmark has an actual payout, researchers can measure AI success not just in terms of accuracy but in dollars earned.
For example, the best-performing LLM in the study, Claude 3.5 Sonnet, earned $403,000 out of a possible $1,000,000.
This unique approach allows researchers to estimate AI’s impact on the freelance job market—an essential step in understanding how automation may shape software engineering careers in the future.
A more rigorous evaluation standard
Existing AI coding benchmarks often rely on unit tests, which can be easily gamed by models that memorize patterns instead of understanding the problem. SWE-Lancer takes a different approach by emphasizing:
End-to-end tests: Simulating real user workflows to ensure AI solutions work in production.
Real-world challenges: Leveraging authentic issues from the Expensify app ensures that models tackle genuine, production-level problems with real business impact.
Full-stack evaluation: AI models must reason across multiple layers of an application, rather than solving isolated coding problems.
By raising the bar for AI evaluation, SWE-Lancer provides deeper insights into where AI can truly contribute to software development.
AI as a software manager: A new frontier
Beyond coding, SWE-Lancer also evaluates whether LLMs can function as software managers. Expensify’s hiring decisions provided the basis for SWE Manager tasks, where models:
Review multiple software proposals
Select the best implementation strategy
Are graded based on how well they match real human engineering manager decisions
Surprisingly, LLMs performed better at management tasks than direct coding, suggesting that AI may first integrate into software teams as an advisor rather than a coder.
The bigger picture: AI and the future of freelance work
Expensify is one of the few companies actively integrating freelance engineering into an open-source development model. Our contributions program provides an early glimpse into how AI could impact the freelance economy:
Will AI replace freelance engineers, or augment them?
Can LLMs bridge skill gaps for underrepresented developers?
Might engineering managers lean on LLMs as advisors?
By open-sourcing its task pipeline, Expensify is helping the AI community explore these critical economic and ethical questions.
A bold experiment with real-world implications
Expensify’s contributions to SWE-Lancer represent a real-world experiment in AI-driven software engineering. With our open-source app and freelance program, Expensify has helped create the most realistic AI coding benchmark to date.
This research is just the beginning. As AI models continue to evolve, Expensify’s open-source development model could serve as a blueprint for the future of AI-assisted engineering teams.
FAQs
-
SWE-Lancer is an AI programming assistant developed by OpenAI that helps software engineers with various coding tasks including code generation, refactoring, and documenting code. It's designed to enhance developer productivity while maintaining code quality.
-
While both tools aim to assist developers, SWE-Lancer has a broader focus beyond just code completion. It includes features for code review, architectural suggestions, and can handle more complex multi-file refactoring tasks. Unlike Copilot which primarily focuses on in-line code suggestions.
-
SWE-Lancer supports major programming languages including Python, JavaScript, TypeScript, Java, C++, and Go. It's particularly strong with Python and JavaScript codebases.
-
SWE Lancer ensures the security and privacy of its users by implementing robust encryption protocols to protect data transmission and storage, adhering to industry-standard privacy regulations, and conducting regular security audits. Additionally, the platform offers users control over their privacy settings and employs multi-factor authentication to safeguard user accounts.
-
Expensify's open-source program provides OpenAI's SWE-Lancer benchmark with 1,488 real-world software engineering tasks, ranging from bug fixes to complex feature implementations. These tasks, which come with actual monetary payouts ($250-$32,000), allow researchers to evaluate AI models' performance in terms of real economic value and practical engineering capabilities.
What makes this contribution particularly valuable is that it offers a comprehensive testing environment with end-to-end tests, production-level problems, and management task evaluation—providing a much more realistic benchmark than traditional coding tests that rely on isolated snippets or theoretical problems.