<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Benchmarks | David Vázquez</title><link>https://david-vazquez.com/tags/benchmarks/</link><atom:link href="https://david-vazquez.com/tags/benchmarks/index.xml" rel="self" type="application/rss+xml"/><description>Benchmarks</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 01 Apr 2025 00:00:00 +0000</lastBuildDate><image><url>https://david-vazquez.com/media/icon_hu_a3642885bc94ba2d.png</url><title>Benchmarks</title><link>https://david-vazquez.com/tags/benchmarks/</link></image><item><title>EnterpriseOps-Gym</title><link>https://david-vazquez.com/project/enterpriseops-gym/</link><pubDate>Tue, 01 Apr 2025 00:00:00 +0000</pubDate><guid>https://david-vazquez.com/project/enterpriseops-gym/</guid><description>&lt;p&gt;EnterpriseOps-Gym features 1,150 expert designed tasks across 8 interconnected enterprise domains, with persistent state, strict verification logic, and policy aware execution requirements. It tests whether AI agents can handle domain expertise, not just general reasoning.&lt;/p&gt;</description></item><item><title>WorkArena and BrowserGym</title><link>https://david-vazquez.com/project/workarena/</link><pubDate>Mon, 01 Jul 2024 00:00:00 +0000</pubDate><guid>https://david-vazquez.com/project/workarena/</guid><description>&lt;p&gt;WorkArena is a benchmark of tasks based on the ServiceNow platform that measures how well web agents can perform common knowledge work. BrowserGym provides a rich environment for designing and evaluating such agents with multimodal observations and a comprehensive action set. Published at ICML 2024.&lt;/p&gt;</description></item></channel></rss>