FHIR Search Performance Across 4 Servers: A First Look at the Numbers

FHIR search is where most production payer workloads actually live. Patient Access calls, panel queries, analytics extracts, and prior-auth bundle assembly all funnel through search parameters before they reach the underlying resources. A new open-source benchmark from Health Samurai loaded the same Synthea dataset into Aidbox, HAPI FHIR, Medplum, and the Microsoft FHIR Server on identical hardware and measured search throughput across the six FHIR search families. The numbers tell a more interesting story than the headline RPS table alone.

The Search Throughput Numbers

Aggregate search throughput, across the mixed workload of string, date, reference, token, quantity, and composite searches:

  • Aidbox: about 3,404 RPS
  • Medplum: about 1,796 RPS
  • HAPI FHIR: about 1,005 RPS
  • Microsoft FHIR Server: about 261 RPS

That is a 13x spread between the top and bottom of the table, on the same hardware, same data, same k6 load profile. For more on FHIR platform architecture and what that spread means for downstream applications, the per-family breakdown is more useful than the aggregate.

Where Each Server Diverges By Search Family

The aggregate number hides the shape of the workload. The per-family P99 latency numbers reported by the benchmark show that the four servers behave very differently depending on which search family the application leans on.

The Microsoft FHIR Server is the clearest example. Its aggregate search RPS is the lowest, but the bottleneck is concentrated: quantity searches show P99 latency around 1.2 seconds, and composite searches show P99 latency around 1.9 seconds. For an application that rarely uses quantity or composite, Microsoft performs closer to its mid-tier peers. For an application that depends on lab-value range queries (a quantity workload) or multi-parameter joins (composite), Microsoft is the wrong fit.

Medplum surfaces a different constraint. Its aggregate RPS is solid, but Medplum does not support composite search at all in this release. Applications that depend on composite search parameters have to express the query differently or sit out of Medplum entirely until the engine ships support.

HAPI and Aidbox both cover all six families. Their relative ordering within each family varies; the aggregate is a reasonable proxy for both but the application owner should re-check on the specific query mix their workload actually issues.

What the Spread Implies Operationally

A 13x spread in aggregate search throughput, on identical hardware, is large enough to change deployment economics. A FHIR platform that needs to serve a million Patient Access calls per day at peak hours has to translate that into provisioned capacity. The same provisioning sized for the bottom of the table is roughly 13x what it is sized for the top. That difference is real money on cloud infrastructure and real headcount on operational tooling.

The honest framing is that the right comparison is not the aggregate but the family-specific number for the queries the application actually issues. A care-management product that runs Patient + Observation chains depends on reference and token search performance more than on quantity. A claims-analytics product that runs lab-value range queries depends on quantity. A complex measure-compute product depends on composite. The benchmark's per-family numbers are the right input.

What the Benchmark Does Not Tell You

The dataset is 1,000 Synthea-generated patients, around 2 million resources. That fits in memory on the benchmark hardware, which makes the search numbers a clean baseline but not a scale test. The next post in the Health Samurai series is supposed to test at scale, which will surface the search-performance behavior that only shows up when working sets exceed available memory.

The benchmark is also vendor-run, which matters. Health Samurai develops Aidbox, and the report is published by them. The repository is open source and reruns daily, which makes the numbers re-verifiable. Re-verification on a workload that matches the reader's production query mix is the only way to know which family-level number actually applies.

Where Search Performance Fits in the Broader Picture

Search throughput is one dimension of FHIR platform performance. For the PostgreSQL substrate that three of these four servers share, the Top 5 PostgreSQL-based FHIR engines for 2026 covers the storage layer. For the broader FHIR-native platform comparison, the Top 5 FHIR-native platforms for health plans in 2026 covers the platform layer that sits above search.