Stop Scaling Models. Start Scaling Architecture

By Luke E. Babarinde, MSc, MBA

2026-03-27 · 6 min read

One of the most common mistakes when building a document question-answering system is assuming that every failure is due to the model itself. If the model cannot satisfactorily answer a question, the instinct is often to replace it with a bigger, newer, or longer-context model.

This article explains why designing LLM architecture goes far beyond simply increasing the model’s parameter count or context window.

The Use Case:
The pipeline was designed to:
• Process arbitrary documents
• Classify them
• Chunk them into sections
• Embed them
• Store them
• Retrieve relevant sections
• Answer questions using an LLM

During recent testing of a retrieval pipeline using Qwen3-32B-AWQ running on a single RTX 5090 (32GB VRAM), an interesting issue occurred. The RTX 5090 can theoretically support mid-sized open-weight models comfortably - including GPT-OSS 20B, quantized Gemma 3 27B, Mistral 14B, and Qwen3.5 35B-A3B.

The pipeline was initially built around Qwen3-32B-AWQ and later migrated to Qwen3.5-35B-A3B for improved reasoning and efficiency. Yet despite the hardware being sufficient, the model failed to answer questions about documents that clearly contained the answer.

At first glance, it seemed like a model intelligence problem. It wasn’t.

It was a pipeline design problem.

The Conventional RAG Pattern
Most systems follow this approach:
1. Extract text
2. Chunk text
3. Embed chunks
4. Store in vector DB
5. Retrieve top-k chunks
6. Send to LLM
7. Generate answer

This works in theory. But it is far too simplistic for real-world document reasoning. Illustratively, large chat models do not operate this way internally. Production systems require more structure and discipline.

Improving the Ingestion Pipeline
Instead of increasing model size, we redesigned ingestion. We moved toward a “Two Brains in a Skull” architecture — separating lightweight precision cognition from deeper synthesis.

Enhanced Pipeline:
1. Ingestion
2. Normalize and deduplicate
3. Structure and section documents
4. Chunk with embedding constraints
5. Embed
6. Index
7. Enrich via classification

The key insight is that intelligence at query time depends on discipline at ingestion time. There is also the small matter of latency. Ideally, you want response to be within 5-7 seconds.

The Cognitive Gradient: 3B + 14B
To support this pipeline, we adopted a dual-model strategy - Small Language Model (3B parameters) and a Large Language Model (14B parameters).

Qwen2.5 3B - for basic reasoning. This is perfect for quick fact lookup, extraction for semantic classification, and retrieval of grounded responses that are structured and straight to the point. SLM is token-efficient and also reduces the risk of hallucination.

Qwen3.5 14B AWQ - for deeper reasoning: This is great for multi-document reasoning and broad synthesis to cross-reference context, making it intentionally expansive in production. This is where trade-offs emerge. The 14B model consumes more tokens and requires stricter budgeting.

To improve performance, it is natural to assume that GPU utilization and VRAM will be the main bottlenecks. However, the actual limiting factors when designing production-grade infrastructure are:
• Token budgeting
• Routing discipline
• Chunking strategy
• Thinking mode
• Embedding boundaries

Scaling the Infrastructure
Building a scalable LLM infrastructure begins with understanding the problem that needs to be solved. For example, in a document knowledge system, it is important to know the size limits of the documents and how much inner reasoning the model must perform to process prompts before generating a response. You will need a different tokenss a 5,000-word legal document v budget to proceersus an 800-word guide.

Things to think about when designing the AI infrastructure:
Context Variables
• --max-model-len
Defines the upper bound of cognitive scope, directly trading reasoning breadth for memory pressure and latency.
• KV cache usage
Reflects how much real-time memory the model consumes per token, scaling linearly with context length and concurrency.
• Prefix cache efficiency
Measures how effectively repeated prompt scaffolding is reused, directly reducing latency and GPU load.

Generation Variables
• max_tokens
Caps the model’s expressive depth, determining whether reasoning completes or truncates under pressure.
• thinking / deep_mode
Amplifies internal deliberation, increasing token burn and latency in exchange for higher abstraction capacity.
• finish_reason behavior
Reveals whether generation ended naturally or was forcibly truncated, serving as a diagnostic for token budgeting accuracy.

Throughput Variables
• --max-num-seqs
Controls concurrency, balancing system throughput against deterministic latency and memory stability.
• --max-num-batched-tokens
Defines batching density, trading GPU efficiency for potential volatility in memory allocation.
• Batching behavior
Determines how requests are grouped for execution, directly impacting latency consistency and hardware utilization efficiency.

Precision Variables
• Embedding token cap
Sets the semantic granularity ceiling, forcing chunk boundaries to respect vector model limits.
• Chunk size discipline
Governs retrieval signal clarity, where oversized chunks dilute relevance and undersized chunks fragment meaning.
• Retrieval section selection
Filters contextual noise by constraining which document segments are eligible for synthesis.

Hardware Strategy Variables
• dtype (fp16 vs quantized)
Balances reasoning fidelity against VRAM footprint, trading marginal quality for hardware accessibility.
• kv-cache-dtype
Compresses memory used for attention state, impacting scalability more than reasoning capability.
• gpu-memory-utilization
Sets the ceiling for VRAM allocation, influencing stability, concurrency headroom, and fragmentation risk.

The view of the architecture
LLM - Qwen3.5 14B AWQ

command:
    [
        "Qwen/Qwen3-14B-AWQ",
        "--served-model-name", "qwen3-14b",
        "--quantization", "awq_marlin",
        "--dtype", "float16",
        "--gpu-memory-utilization", "0.60",
        "--max-model-len", "4096",
        "--max-num-seqs", "6",
        "--max-num-batched-tokens", "8192",
        "--kv-cache-dtype", "fp8",
        "--enable-prefix-caching"
    ]

SLM - Qwen 2.5 3B

command:
    [
        "Qwen/Qwen2.5-3B-Instruct",
        "--served-model-name", "qwen3-3b",
        "--dtype", "float16",
        "--gpu-memory-utilization", "0.30",
        "--max-model-len", "4096",
        "--max-num-seqs", "2",
        "--max-num-batched-tokens", "4096",
        "--kv-cache-dtype", "fp8",
        "--enable-prefix-caching"
    ]

For Embeddings - BGE-EMB:

    command:
      [
        "--model", "BAAI/bge-base-en-v1.5",
        "--served-model-name", "bge-emb",
        "--gpu-memory-utilization", "0.05",
        "--max-model-len", "512"
      ]

Here is the simple workflow of how SLM and LLM are used to great effect while maximizing VRAM with the above configuration for the core modalities:

Before classification with SLM

After:

Conclusion
As we move toward an age of efficiency through the adoption of AI for intelligent automation, it is critical that the infrastructure powering that intelligence is itself built on the principles of efficiency.

The future of AI infrastructure will not be defined by the largest model running on the most powerful GPU. It will be defined by systems that understand how cognition flows — when to compress, when to expand, when to reason deeply, and when to remain literal.

True intelligence at scale is not achieved by removing constraints. It is achieved by aligning them.

The systems that endure will not be those that maximize parameters, but those that maximize coherence — where architecture, hardware, retrieval, and reasoning operate as a unified gradient rather than as isolated components. Careful attention must be paid to use cases, with the understanding that one size does not fit all. Scaling to a larger model is rarely the solution to architectural misalignment; thoughtful system design is what enables scalable outcomes.

In the end, efficient infrastructure is not a trade-off for intelligence. It is its foundation.

To receive timely updates, subscribe.