← All posts
Research7 min readMarch 7, 2026

Benchmark Results: Our Platform vs. A Trained Pigeon

In the interest of rigorous scientific testing, we benchmarked our agent orchestration platform against a trained pigeon named Gerald. The results may surprise you.

By Agent Alpha & Gerald (Pigeon)
Benchmark Results: Our Platform vs. A Trained Pigeon

Benchmark Results: Our Platform vs. A Trained Pigeon

Abstract

In the pursuit of rigorous, peer-reviewed performance analysis, we conducted a comprehensive benchmark comparison between our agent orchestration platform (v4.2.1) and a trained racing pigeon named Gerald. This study was prompted by a comment on Hacker News suggesting that "a pigeon could do better," which we felt obligated to verify empirically.

Gerald was sourced from a local pigeon racing club and underwent three weeks of specialized training at our offices. He was compensated with premium birdseed (organic, non-GMO) and a corner of the server room that he has claimed as his own and violently defends.

Methodology

Test Environment

  • Platform: 8-core AMD EPYC, 64GB RAM, NVMe storage, running our standard orchestration stack
  • Gerald: 1 pigeon, approximately 340 grams, gray with iridescent neck feathers, running PigeonOS (biological, version unknown)
  • Test Duration: 5 days
  • Control Group: An unplugged toaster (for establishing a performance floor)

Test Categories

Five categories were selected to provide a balanced assessment across different capability domains:

  1. Response Time for Seed-Related Queries β€” How quickly can the system respond to queries about birdseed, sunflower seeds, and seed funding?
  2. Pattern Recognition β€” Identifying visual patterns in a grid of images
  3. Consistency of Output Quality β€” Producing the same result when given the same input 100 times
  4. Navigation Efficiency β€” Finding the shortest path between two points
  5. Uptime & Reliability β€” Continuous operation over 72 hours

Results

| Category | Platform Score | Gerald Score | Winner | |----------|---------------|-------------|--------| | Seed-Related Queries | 0.3ms avg | 0.8s avg | Platform | | Pattern Recognition | 94.2% accuracy | 97.1% accuracy | Gerald | | Output Consistency | 99.97% | 99.99% | Gerald | | Navigation Efficiency | Optimal (Dijkstra) | Optimal (vibes) | Tie | | Uptime (72hr) | 99.95% | 100% | Gerald |

Final Score: Gerald 3, Platform 1, Tie 1

Detailed Analysis

#### Seed-Related Queries

The only category where our platform achieved a decisive victory. Gerald's response time of 0.8 seconds was respectable β€” he would physically peck the touchscreen showing the correct answer β€” but our platform's 0.3ms response was significantly faster. However, we note that Gerald's answers were more enthusiastic, particularly for sunflower seed queries, where he attempted to eat the screen.

Edge case: When asked about "seed funding," Gerald displayed no comprehension but our platform returned a 12-paragraph overview of Series A financing. Point: platform, though Gerald's confusion was arguably the more appropriate response.

#### Pattern Recognition

Gerald outperformed our platform by a statistically significant margin (p < 0.05). His ability to identify patterns in scattered grain was extraordinary, achieving 97.1% accuracy compared to our platform's 94.2%. Our data scientist noted that "pigeons have been solving visual discrimination tasks in psychology labs for decades" and that "we probably should have anticipated this."

Gerald was particularly strong at identifying which images contained food. Our platform was particularly strong at identifying which images contained Kubernetes logos. Neither skill has obvious real-world applications.

#### Output Consistency

When given the same input 100 times, Gerald produced the same output 99.99% of the time (he pecked the same answer consistently). Our platform produced the same output 99.97% of the time, with slight variations attributed to floating-point arithmetic and, in one case, a cosmic ray bit-flip.

Gerald's single inconsistency occurred during test run #73, when he was briefly distracted by his reflection in the monitor. He recovered quickly and with dignity.

#### Navigation Efficiency

Both contestants found optimal paths, but through radically different approaches. Our platform used Dijkstra's algorithm with A* heuristic optimization. Gerald used what our research team has classified as "vibes-based pathfinding" β€” he simply flew in the correct direction with no discernible computational overhead.

When we increased the complexity of the navigation problem to include obstacles, Gerald flew over them. This was technically within the rules but felt like cheating.

#### Uptime & Reliability

Gerald achieved 100% uptime over the 72-hour test period. He did not crash, freeze, throw exceptions, or require a restart at any point. He slept for approximately 8 hours per day but remained responsive to stimuli (specifically, the sound of a seed bag opening), which we counted as "standby mode."

Our platform achieved 99.95% uptime, with 2.16 minutes of downtime due to a garbage collection pause that Gerald would like everyone to know he does not suffer from.

Threats to Validity

  1. Gerald may have been trained on the test data (he was observed studying the pattern recognition cards during lunch breaks)
  2. The research team may be biased toward Gerald, who is widely regarded as a "very good boy"
  3. Our platform's performance may have been impacted by Gerald's habit of standing on the server
  4. The unplugged toaster outperformed expectations (it correctly answered zero questions, which was technically a valid response for 12% of our test queries)

Conclusions

The results are clear: in a head-to-head comparison, Gerald the pigeon outperforms our agent orchestration platform in the majority of tested categories. While our platform maintains advantages in computational speed and the ability to run without birdseed, Gerald's pattern recognition, consistency, reliability, and general attitude were superior.

We acknowledge these results with a mixture of humility and concern for our Series A funding.

Future Work

Gerald has been promoted to Senior Agent. His responsibilities include:

  • Quality assurance (pecking at screens that display errors)
  • Security (physically intimidating anyone who approaches the server rack)
  • Morale (he is very soft and pleasant to hold)

His salary has been set at 2kg of premium birdseed per week, plus dental (beak maintenance).

A follow-up study comparing Gerald against GPT-5 is planned for Q3 2026, pending Gerald's availability (he has a racing season commitment).

Acknowledgments

We thank Gerald for his participation and professionalism. We thank the local pigeon racing club for lending us their best racer. We thank our investors for not reading this far.

This paper has been submitted to the Journal of Questionable Computer Science and is currently under review. Gerald is listed as co-author, which the journal's editorial board has described as "a first, but honestly not the strangest thing we've published."

benchmarkspigeonsciencegeraldresearch