| Rocket.Chat

If you run secure messaging at scale, you already know the problem: the information was shared, the thread exists, and nobody can find it when it counts.

In high-stakes environments, scattered knowledge is a mission risk.

In Rocket.Chat's latest release, we built Intelligent Search to close that gap. We stress-tested it against 1.2 million real, messy, conversational messages, the kind that actually move through operational environments, and published every result including the uncomfortable ones.

About Intelligent Search

Most search tools make one assumption: that you remember exactly what you are looking for. In practice, nobody does. You remember the topic. You remember it was somewhere in a thread from three weeks ago. You remember it involved the logistics team. You do not remember the exact words, and that is where most search tools leave you stranded.

Intelligent Search works the way people recall information. "Authentication service is down" and "users can't log in" return the same results. You describe what you are looking for in plain language and the system finds it, regardless of how it was originally worded.

Every result comes directly from your organization's own message history. What was said is what you get. In environments where a fabricated answer is worse than no answer at all, that matters more than any accuracy benchmark.

We sat down with Devanshu Sharma, Lead Research Engineer, to get into what the data actually showed and what it took to build something that performs on a bad day, not just a benchmark day.

‍

On the surface it looks like a few lost minutes. But the actual effect is much broader.

The first cost is time, and not just one person's time. It is the entire team grinding to a halt while everyone searches for something that was already shared.

The second cost is duplication. If people cannot find what was documented, they rebuild it from scratch, burning resources and pushing decisions further down the line.

But the third cost is the one that matters most: it erodes the confidence of decision makers.

The problem isn't that organizations fail to generate information. It's that critical intel is distributed across channels, and the people who need the full picture are always working with a partial one.

When someone is making a call without the full context, working off stale information or a partial picture, their confidence in their own decision drops. They second-guess. They delay. Or they proceed without knowing what they don't know. In the environments we are building for, that is not an efficiency problem. That is a mission risk.

It's almost always the dataset.

The demos are built around refined, curated data that a whole team designed to make the system look good.

In the LLM space right now, the company with the best benchmark trends in the news, so everyone optimizes for the benchmark. The assumption becomes: if a model performs well on a standard dataset, it performs well for everyone.

But that's not how it works in real operational environments. The data is noisy. It's fragmented across multiple channels. It changes constantly. It is not a static file sitting on a server. It is a live feed of human communication, and it is messy in ways that no curated benchmark dataset will ever capture.

A demo that does not account for any of that is not showing you what you will actually get when you deploy. The gap between the benchmark and the field is where most AI promises quietly die.

There is no single silver-bullet question. But the most powerful thing a procurement team can ask is: Can I run your system against my own data?

Not a demo. Not a reference customer story. Just: let me test it in my environment, against the kind of messages my people actually send. That is when you find out if the performance is real.

And alongside that, demand transparency. If a vendor claims near-perfect performance with no way for you to verify it, that is a red flag. Either they are hiding something, or they have optimized so hard for the benchmark that there is no headroom left to improve. Either way, be skeptical. We did not just make that argument. We tested it.

Because that's exactly how people search.

When I'm trying to find something I sent or received weeks ago, I don't remember the exact room, the exact channel, or the exact wording. I remember the topic. I remember roughly who it involved. Maybe I remember a timeframe.

That's how operators search too, they're not going to sit there reconstructing the precise keyword. And that's where keyword search completely fails you.

The dataset we used was deliberately messy: short messages, slang, domain-specific terminology, multi-domain conversations, the kind of noise you'd find in any real workspace. It mirrors what's actually happening in the field. We chose it because it reflects reality, not because it would produce favorable numbers.

Because if we optimize the numbers just to show a pretty benchmark, we are setting a trap for ourselves.

When customers deploy in their own environment and test against their own data, and they will, they are going to get the real results. We wanted to find the worst case before anyone else did.

At 380,000 conversational records, the system delivered a Mean Reciprocal Rank of 0.72, meaning the most relevant result was consistently surfacing near the top. At 1.2 million documents, that shifted to 0.56. The system was still retrieving the right content. The challenge at that scale was ranking the best match first, with more content competing for the top result. That is expected behavior, and it gives us a reproducible baseline to improve against.

The dataset is public. The methodology is documented. Anyone can run it themselves. In the research community, reproducibility is highly valued. You publish results and make them verifiable not because someone is trying to make you look bad, but because open reproduction is how the field moves forward.

Retrieval accuracy only matters if the system is available when you need it.

Picture a search system with exceptional accuracy that slows to a crawl under load, or goes dark at the one moment you genuinely need it. That is not a retrieval problem. That is a trust problem. And once you break a user's confidence in the system, you probably do not get it back.

Under sustained load at million-document scale, our API response times stayed in low single-digit milliseconds at the 95th percentile. Error rates remained near baseline. Worker failures were essentially zero. The system absorbed traffic spikes through backpressure and batching without degradation.

For teams running communication infrastructure that supports operational decisions, that stability story matters more than any retrieval benchmark. An AI feature that introduces latency or instability into that stack is not a feature. It is a liability.

Three things. The third is the one that keeps people up at night.

First, irreproducibility. When a user searches for the same thing twice and gets completely different results, trust in the system collapses fast and it does not come back.

Second, opacity about sources. If the system cannot show you where the answer came from, you cannot verify it. In high-stakes environments, an unverifiable answer is not just unhelpful. It is unusable.

But the third is access control. These organizations operate with strict information hierarchies, and the last thing anyone needs is a search system that surfaces sensitive data to someone who should not see it. That is not a retrieval quality problem. That is an institutional catastrophe.

In an early version of the pipeline, we were using a library for embedding inference that had a hidden dependency on endpoints behind Cloudflare. We didn't catch it because we weren't testing in a fully air-gapped environment. Then Cloudflare went down, and suddenly our embedding model was broken.

If that had happened in a real deployment for a defense customer, it would have been a serious incident. We stripped out the entire library and rebuilt it with one that had zero cloud infrastructure dependencies.

It slowed us down. It was not an easy path. But for air-gapped deployments, which is the operating reality for most of the organizations we are building for, there is no other path. The system has to work completely offline, with no external dependencies, under any conditions.

That decision, and others like it, reflect a particular kind of engineering discipline. Not optimizing for the demo. Optimizing for the day when the system has to work and nobody has time to troubleshoot.

The next iteration brings hybrid search, combining semantic and keyword retrieval, alongside LLM-as-judge evaluation to better measure what the numbers are actually missing. The baseline is set. The work continues.