The Autoscaling Dilemma: Architecture, Cost, and Reality in OpenSearch

Written in

by

A couple of years ago, I was assigned to add observability into a legacy system at Oracle. It wasn’t a fresh start or a tech playground: OpenSearch was the only option I was allowed to use. No Elasticsearch, no managed services, no time to negotiate with procurement. It was just me, a legacy system, and the very real need to make logs searchable at scale.

That experience forced me to stop thinking like a developer trying to “make things work” and instead face a more difficult challenge: making things work within constraints. I had to ask: how do you make autoscaling meaningful when your architecture wasn’t really designed for it?


When Scaling Isn’t About Nodes, But About Understanding

On paper, autoscaling sounds easy: add nodes when demand rises, remove them when it falls. But OpenSearch, like many systems built on top of Lucene, hides a lot of complexity under the hood. Yes, it’s technically easy to spin up a new node, but that doesn’t mean the data will be evenly distributed. And if it’s not, you’ve just wasted money and possibly made things worse.

Autoscaling here isn’t just infrastructure, it’s about reshaping how the system stores and indexes data. And the more unstructured that data is, the harder it gets. In fact, OpenSearch doesn’t support autoscaling natively because scaling unstructured data isn’t a predictable process. Not every node will share the same load, not every index will play nice. Hotspots will happen. Bottlenecks will form.


Rollover, Not Resize

Here’s a trap I walked into early: I assumed that adding shards to an index would let me scale writes better. Makes sense, right? More shards, more distribution. But OpenSearch uses a hash-based sharding algorithm, and once the shard count is set, it’s fixed. Add a shard and the math breaks, documents get hashed into the wrong place.

The only way out is to roll over to a new index with more shards. This sounds simple, but it forces you to think ahead: plan your shard count like you’re packing for a trip you don’t fully understand. Rollovers become your version of “scaling”, even though all you’re doing is creating more pieces for OpenSearch to juggle.

This means that autoscaling isn’t reactive. It has to be predictive. You prepare for spikes by over-sharding in advance, or suffer from shard overload later.


The Two Topologies That Saved Me

After several missteps, two architectural patterns stood out. I didn’t invent them, but applying them in the right moment changed everything.

Burst Index

Prepare indexes with higher shard counts ahead of time and point your write alias to them only when needed. You avoid moving data under pressure and gain flexibility without overwhelming the cluster. When the spike is over, you exclude the burst nodes and let the data settle.

This works well when you have a handful of tenants and can predict when load will hit.

Burst Cluster

When you’re dealing with many tenants and unpredictable spikes, a separate cluster makes more sense. You tailor smaller clusters for average load and divert specific tenants to a “burst” cluster during peaks. Reads remain consistent using cross-cluster search, and you avoid penalizing stable users.


Scaling Is Not About Servers

The real problem wasn’t infrastructure. It was expectation. Everyone wants “autoscaling” to behave like magic: when traffic rises, the system reacts, the cost stays fair, and everything keeps working. But with OpenSearch, and any system managing unstructured data, that dream is only real if you understand the mechanics.

Autoscaling here is more like choreography: roll over at the right time, allocate shards properly, avoid hotspots, and know your tenants. If you don’t think ahead, your autoscaling efforts become just another source of instability.

And even when it works, it’s temporary. Every shard you create adds pressure later. Every burst index needs cleanup. Every diverted cluster needs tracking. Observability comes with a cost, and if you’re not careful, the tools you use to observe can become the thing you need to fix next.


Final Thoughts

Sometimes, you don’t choose the architecture. It’s handed to you, restrictions included. That’s when real architectural thinking starts. You stop dreaming in ideal patterns and start solving in trade-offs.

Autoscaling in OpenSearch is possible, but only if you see beyond the infrastructure. It’s not about nodes, it’s about patterns. Not about cost, but timing. Not about scaling up, but scaling right.


This post is dedicated to my cat, Naranja. Your quiet company saw me through more late-night debugging sessions than I can count. I hope there are warm sunbeams wherever you are.

Leave a comment

The Stack Overflow of My Mind

Debugging life, one post at a time