Background

Developers building applications that use LLMs face unique challenges compared to working with conventional software systems. To understand why, it’ll be helpful to look into how LLMs actually work at a high-level and what makes them useful to developers.

LLMs and What They Enable

LLMs are models that have learned, through processing vast amounts of data, to understand and generate human language. More specifically, they are neural networks that generate probabilistic token sequences, a process known as inference, in response to natural language input. These token sequences are what make up the responses you see when interacting with an LLM through an app like Claude or ChatGPT.

Model choice is one of several factors that determine what can be achieved with LLMs. Deciding on the model involves significant trade-offs, as some are better suited to complex reasoning tasks (e.g., Opus), while others are better suited to quick, everyday tasks (e.g., Haiku).

Here are a few examples to demonstrate what was much more complicated, or impossible, to build before the rise of LLMs:

A customer support bot that classifies a complaint based on its meaning rather than keywords and routes the ticket to the appropriate team.
A code review tool that reads a pull request, identifies potentially buggy code, and then suggests fixes based on what it “thinks” the intent of the code is. This is a very different task to just correcting syntax.
A writing assistant that infers the text’s intended audience and then appropriately adjusts the context and tone of the suggestions.

The Downsides of LLM Integration

Benefiting from LLMs is not a one-sided trade though.

To use LLMs in their applications, developers make API calls to endpoints hosted by model providers. As with a standard API call, a POST request is typically sent to an endpoint, and the response contains the generated output. Unlike a typical API call, output, cost, and latency can vary wildly between two calls to the same endpoint due to factors such as model choice, token count (i.e., user prompt and model output tokens), and the difficulty of the requested task.

Non-determinism

Two identical prompts sent to an LLM do not produce the same response. LLM responses are unpredictable, since they are, at their core, next-token predictors. This means there is a probability score attached to the likelihood of what the next token will be as the model is generating its output. In other words, one cannot guarantee that the exact same inputs will lead to the same output as one can in deterministic systems.

Cost

Interacting with a standard API involves usage-based, determinable pricing. That is quite different from token-based LLM costs, which involve numerous factors.

The cost of a single input-output interaction is determined by the input- and output-token costs (i.e., the size of a user prompt and the LLM generation). Additionally, input-token count is not a predictor of output-token size, meaning that two prompts to the same endpoint can have orders-of-magnitude differences in their total cost.

This is hard to predict per request and becomes more complicated with multi-step LLM tasks.

Also, unlike standard API calls, a cost-benefit analysis is opaque before one actually makes the LLM call. In other words, only when the response comes back can one ascertain whether the output (i.e., a predicted benefit) was worth the cost.

Another factor here is that there is no uniform pricing across providers or models. For example, Claude’s cutting-edge model meant for extreme reasoning and automation, Fable 5, costs $10/MITok and $50/MOTok, whereas the older, more lightweight model for light, everyday use, Haiku 4.5, costs one-tenth that at $1/MITok and $5/MOTok.

In other words, prices can vary widely, with the most sophisticated models costing many times more than simpler, everyday models.

as of June 2026

MITok: per million input tokens; MOTok: per million output tokens

Latency

Standard APIs are usually measured in milliseconds. LLM responses, on the other hand, can take several seconds, and their P99 latency, the response time experienced by the slowest 1% of requests, can reach into the tens of seconds.

To soften the blow of long waits, most LLM APIs reduce perceived latency by streaming token chunks as they are generated by the model, rather than buffering the full response before sending it. This requires keeping a persistent connection open between the client & the LLM (usually via chunked HTTP or SSE) for the duration of generation. While this does provide a better user experience, it requires developers to handle complications such as connection management, midstream errors, and stream completion state.

Availability

Non-LLM API providers typically guarantee measured uptime with Service Level Agreements (SLAs). They range from 99.9% all the way up to the gold standard of “Five Nines” (99.999% uptime), which translates to only 5 minutes of downtime a year.

In contrast, LLM API providers have a measured uptime that is drastically lower… At the time of writing, OpenAI’s uptime was 99.98% while Claude’s was 99.2% for the last three months. The latter of which translates to around 70 hours of downtime a year!

Going Beyond a Single LLM

Somewhat paradoxically, one way to address many of these challenges is for developers to integrate not one but multiple models and providers into their applications. Given disparities in cost, capabilities, and reliability both amongst and within LLM providers, this pattern has become increasingly common.

One of the benefits of this is, for example, being able to route a very complex prompt to the latest, most sophisticated model and a simpler prompt to a faster, cheaper model. This saves developers’ token budgets (i.e., money) and users’ time (i.e., latency). A win-win.

An additional benefit is that one can route to a provider based on whether the provider is actually up and available. As we saw earlier, LLM provider uptime is less reliable than traditional cloud services, so having some form of failover routing that uses the next available model or provider when the requested one is down helps increase the end user’s perceived LLM reliability.

The Cost of Multiple Integrations

Supporting more than one LLM, for all its promised benefits, is only half the story.

For one, LLM provider APIs differ in their request and response schema formats. Therefore, developers need to build and maintain a compatibility layer to handle each one. The complexity here grows with each new integration.

Routing logic is required if a developer wants to send a request to the right model. For example, this logic can receive a metadata field on the request body to route to the correct model. Another approach is to involve another LLM in an LLM-as-Judge step to classify the request and forward it to the chosen model.

Another consideration is that each provider imposes its own rate limits on requests and tokens per minute. In practice, this means that one provider’s quota can be depleted while another remains within limits. Without per-provider rate limit tracking, developers can’t route away from a rate-limited provider before requests start failing.

Observability compounds the overall complexity. Providers return token usage, errors, and other metadata in different fields and formats in their responses. This means that having a complete view of an application’s LLM costs, latency, and error rates across different providers is impossible without custom aggregation logic for each additional integration.

Finally, LLM APIs change. Models are deprecated as providers release newer alternatives. Each model arrives with its own pricing and capability tradeoffs that developers need to consider. This results in a more persistent maintenance burden compared to working with more standard APIs.

The Gateway Pattern

LLM Gateways address these challenges of integrating multiple LLMs into applications. Here’s how:

To handle disparate request-response formats across providers, LLM Gateways expose a single API interface and translate requests and responses to and from each provider’s format internally. As far as a gateway user is concerned, they only interact with a single API format.

When providers fail or reach a rate limit, applications that depend on a single LLM also fail. Gateways solve this through configurable routing that allows for failover. This routing logic is also what directs requests to different models based on the type of user prompt.

LLM Gateways can also reduce costs and improve end-user latency through a response cache. When a request similar to a prior one flows through, the gateway returns the cached response and skips the LLM call entirely.

And then finally, there is the LLM-specific issue of non-determinism. There is always the chance that a model will generate inappropriate output in response to a user request. Similarly, users may prompt a model to generate hateful, insulting, violent, or otherwise undesirable content. They may also mistakenly include PII in their prompts, leaking sensitive information to LLM providers. Gateways provide a way to address this through guardrails and configurable moderation policies. In practice, this means scanning user inputs and model outputs to either redact or replace the undesired content.

Given that every interaction passes through the gateway, it is the natural place to capture observability data, normalised across providers. This includes errors, token usage, latency, and guardrail triggers.

LLM Gateways consolidate these solutions into a single layer that significantly simplifies how applications interact with LLMs, controlling for reliability, cost, and flexibility.