Blockchain Indexing Architectures: Subgraphs vs Data Lakes vs Streaming Pipelines

A breakdown of subgraphs, data lakes, and streaming pipelines, and how each handles blockchain data.

Blockchain Indexing Architectures: Subgraphs vs Data Lakes vs Streaming Pipelines

Blockchain indexing is a complicated topic, as each architecture has its pros and cons.

  • Subgraphs get described as simple.
  • Data lakes are often described as fast.
  • Streaming pipelines are often described as real-time.

In reality, they optimize for very different things: query latency, processing throughput, and data freshness.

The question worth asking is different: how does each system process blockchain data, and what does that cost you? If you're building anything where data has to be correct, fresh, and available under load, that question decides your architecture.

Why indexing exists in the first place

Blockchains aren't built for querying. They're built to write transactions, verify state, and maintain consensus. Everything else is downstream.

Try to build an application directly on RPC, and the limits show up fast. You can't filter across large datasets. Historical queries are slow, and data structures are fragmented across contracts and blocks, which leads to higher compute costs.

Indexing exists to solve one problem: turning blockchain data into something you can query reliably at scale. Every system in this space, regardless of how it's built, is answering that same question.

The three architectures you'll actually encounter

Battletested systems today fall into three categories:

  • stateful indexers (subgraphs),
  • data lake pipelines, and
  • streaming pipelines.

They solve the same problem in different ways, and the differences matter once you're operating at scale.

Stateful indexers (subgraphs)

A subgraph is a deterministic state machine. It processes blockchain data block by block, applies your logic, and builds application-specific state along the way.

The flow looks like this:

chain → node → indexer → mappings → database

Each event triggers a handler. Each handler updates entities. The database reflects the current state of your system.

Three properties define this model.

  1. First, it's deterministic. Replay the chain from block X and you get the same result every time.
  2. Second, it sits close to the source of truth, reading directly from a node or stream rather than an intermediate dataset.
  3. Third, it's stateful by design. The indexer maintains state over time instead of just processing events.

Subgraphs work well when correctness matters more than raw speed, or when you need reliable reorg handling, when you're operating close to the chain tip, and when reproducibility is non-negotiable. That's why it's been the standard for querying DeFi state, balances, and positions, and protocol-level data.

Data lake pipelines

Data lake architectures separate extraction from transformation. Instead of every indexer pulling from the chain on its own, a shared ingestion layer extracts the data once, normalizes it, and stores it in a distributed dataset. Indexers then read from that dataset and shape it for their own use case.

chain → ingestion → data lake → query layer → indexer

Extraction happens once. If a hundred teams need the same data, a subgraph model processes it a hundred times. A data lake centralizes ingestion so data is extracted once and reused across systems. That single change unlocks batch access to large datasets, parallel processing across block ranges, and far less pressure on RPC infrastructure.

Instead of looping through block → decode → repeat, you query a dataset and process it in batches.

This model fits well for analytics platforms, data APIs, and any product that needs to reuse datasets across teams or chains. The tradeoff is complexity.

You now depend on ingestion pipelines, normalization logic, storage consistency, and a query layer.

The trust boundary expands from:

chain → indexer to chain → ingestion → storage → gateway → indexer.

Debugging is exponentially more difficult when data is inaccurate or is missing. You also accept some delay between when data lands on-chain and when it shows up in the dataset.

Streaming pipelines

Streaming pipelines treat blockchain data as a continuous flow rather than something to query after the fact. Instead of pulling, the system pushes data through transformation layers in real time.

chain → stream → process → output

The focus is on low latency, parallel execution, and real-time data flow. Streaming systems often process multiple block ranges in parallel, emit intermediate results, and push data into downstream databases or APIs.

This model fits real-time analytics, event-driven systems, and high-frequency data pipelines.

The trade-off is that streaming systems aren't complete on their own. You still need downstream storage, schema design, and state management if your use case requires it. Streaming is excellent at processing data and weaker at serving it.

The distinction that matters more than the tools

Tool comparisons miss the architectural split that actually drives behavior in production systems: stateful systems versus stateless pipelines.

Subgraphs are stateful systems. They maintain application state, update it over time, and serve it directly.

event → mutate state → store result

Data lakes and streaming systems separate processing from state ownership. They transform and move data, but the responsibility for maintaining application state usually sits downstream, in a database or serving layer.

event → transform → output

This distinction affects how systems scale, how they fail, and how you reason about correctness.

Production systems combine all three

In practice, most systems will use all three at one point.

chain → streaming ingestion → storage → API layer
chain → deterministic indexing → database → application
chain → streaming → data lake → application APIs

As mass adoption happens, the industry will not rely on a single architecture to index and query data.

How to choose

Skip the question of which system is better. Ask these instead:

  • Do you need deterministic state, or flexible data pipelines?
  • Are you optimizing for real-time freshness, or historical scale?
  • Do you want simplicity in debugging, or efficiency at scale?
  • What does correctness mean in your domain, and what's the cost of getting it wrong?

Each architecture trades off latency, throughput, correctness, cost, and operational complexity in different ways. There's no universal answer, only a fit for your use case.

The bigger picture

Blockchain indexing has moved past the question of how to extract data. The harder problem is building systems that stay correct under reorgs, fast under load, and reliable in production. Subgraphs, data lakes, and streaming pipelines each solve part of that problem. Knowing how they differ is what lets you choose the right one, or combine them into something that holds up.

About Ormi

Ormi is the next-generation data layer for Web3, purpose-built for real-time, high-throughput applications like DeFi, gaming, wallets, and on-chain infrastructure. Its hybrid architecture ensures sub-30ms latency and up to 4,000 RPS for live subgraph indexing.

With 99.9% uptime and deployments across ecosystems representing $50B+ in TVL and $100B+ in annual transaction volume, Ormi is trusted to power the most demanding production environments without throttling or delay.