Vanira ResearchWhite Paper · June 2026

Modality Over Scale:
Why Parameters Are Not Intelligence

A formal note for the research community: scaling width and depth is not equivalent to scaling evidence. We argue — and partially prove — that enriching transformer inputs with real modalities and explicit temporal + spatial positional structure raises task accuracy at fixed parameter budgets.

Teja Reddy · Vanira Research 22 min read June 2026

Executive Summary

Chinchilla scaling laws describe how loss falls with parameters when the training distribution is fixed. Live agents operate on a different distribution: synchronized audio, video, documents, and UI state aligned in time and space. Your intuition about positional encoding is directionally correct — standard 1D sequence position is insufficient; you need multi-axis position (time, space, modality) on the input tensor. That is an inductive bias and an information gain, not a parameter gain.

Abstract

We distinguish capacity scaling (more weights) from evidence scaling (richer, structured inputs). For perception-grounded tasks, evidence scaling dominates: a smaller transformer with aligned multimodal tokens and explicit temporal/spatial positional encodings can achieve lower Bayes error than a larger text-only model fed ASR transcripts and image captions.

This paper states the claim formally, connects it to positional encoding theory and the data-processing inequality, and outlines what is proven in the literature vs. what remains an engineering conjecture we test in production.

1. Setup: What the Transformer Actually Sees

Standard self-attention over tokens X = (x₁,…,xₙ):

Scaled dot-product attention (Vaswani et al.)
Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V

Q = XW_Q ,  K = XW_K ,  V = XW_V

Intelligence in this framework is not “in the parameters alone.” It is the map from raw evidence → token matrix X → output. If X is built by collapsing camera pixels into a paragraph, you have already destroyed information before attention runs.

2. Why Parameter Scaling Is Not the Same as Getting Smarter

Empirical scaling laws (Kaplan, Hoffmann/Chinchilla) approximate:

Loss vs. model size N at fixed data D
L(N) ≈ A · N^(-α) + B · D^(-β) + L_∞

Important caveat: this holds when (i) the task is stationary, (ii) the input representation is fixed, and (iii) evaluation matches training (e.g. next-token prediction on text). It does not say that increasing N creates new sensory channels. A text-only model with 10× parameters still cannot attend to pixel (i,j) unless that structure is present in X.

For task label Y (e.g. “approve refund”, “MRZ valid”, “crack detected”):

Information bottleneck when modalities are collapsed to text
Let X = (X_audio, X_video, X_ui, …)  — raw aligned streams
Let T = Textify(X)                     — ASR + caption + OCR string

Data-processing inequality:
  I(Y ; X)  ≥  I(Y ; T)

Strict inequality when Y depends on prosody, micro-texture, layout,
or timing that Textify() discards or corrupts.
More parameters let you fit a better approximation of P(next token | T). They do not recover mutual information already lost in Textify(X).

3. Your Positional-Encoding Intuition — Formalized

Original Transformers encode sequence index only:

1D sinusoidal positional encoding
PE(pos, 2i)   = sin( pos / 10000^(2i/d) )
PE(pos, 2i+1) = cos( pos / 10000^(2i/d) )

h_pos = x_pos + PE(pos)

That tells the model “token A came before token B.” It does not tell the model:

  • that frame f occurred 342 ms after utterance u (temporal alignment),
  • that OCR region r sits at normalized box (0.12, 0.25, 0.30, 0.50) (spatial structure),
  • that stream s is audio vs. vision vs. DOM (modality channel).

Yes — you are making sense. The fix is not “bigger PE table.” It is multi-axis positional structure on each token:

Proposed enriched token representation (core claim)
h_i = e_i                           — content embedding (patch, phoneme, DOM node, …)
    + PE_seq(i)                     — order in fused context window
    + PE_time(t_i, Δt_i, phase_i)   — session clock, turn offset, phase (KYC, inspect, …)
    + PE_space(x_i, y_i, w_i, h_i)  — 2D/3D coords when token is spatial
    + PE_mod(m_i)                   — modality id / encoder family

Attention still standard; only X → H construction changes.

Vision Transformers already use 2D patch position (Dosovitskiy et al.). DETR adds spatial object queries. We extend the same principle to live multimodal agents: one shared clock T across audio frames, UI highlights, and upload events so cross-attention aligns evidence causally (“user said yes after box appeared”).

4. Why This Improves Accuracy Without More Parameters

Learning theory view: parameters control capacity; positional structure controls inductive bias. Without PE_space, the model must learn that patch tokens have 2D neighborhood structure from data alone — a harder hypothesis class (permutation of patches is ambiguous).

Locality bias from spatial PE (informal)
With 2D PE, attention can implement local operators:

  α_ij ∝ exp( q_i · k_j / √d ) · 𝟙[ ||pos(i) - pos(j)|| < ρ ]

→ read MRZ strip, inspect crack neighborhood, match face crop to ID photo
  without attending uniformly over all O(n²) token pairs.

Sample complexity drops when the right bias is baked into H: fewer examples needed to learn the same visual or temporal operator. That manifests as higher accuracy at the same N — which is exactly the opposite of the industry default (“buy more GPUs, train bigger LLM”).

What is proven vs. conjecture

  • Proven / established: data-processing inequality; ViT-style 2D PE; multimodal fusion beats unimodal on vision-language tasks when labels depend on both.
  • Strong engineering evidence: live inspection/KYC/returns accuracy gains when adding aligned vision + time tags at fixed backbone size (Vanira production).
  • Open research: tight lower bounds on how much I(Y;X) is lost per modality collapse; optimal PE_time for streaming agents.

5. Real Modalities: Keep Channels, Don’t Textify Early

Let encoders E_m project each modality into d_model without forcing a single text bottleneck:

Multimodal fusion before shared transformer layers
H_audio  = E_audio(X_audio)  + PE_time ∘ PE_mod(audio)
H_vision = E_vision(X_video)   + PE_time ∘ PE_space ∘ PE_mod(vision)
H_ui     = E_ui(X_dom)         + PE_space ∘ PE_mod(ui)
H_text   = E_text(X_text)      + PE_seq

H = Fuse(H_audio, H_vision, H_ui, H_text)   — concat + align on shared clock T
Y_hat = TransformerBlocks(H)

Early Textify() is equivalent to passing X through a noisy channel with unknown capacity. Real modalities preserve high-frequency evidence (prosody, glare on ID card, hairline crack) that no amount of post-hoc scaling on T recovers.

6. Testable Claim for the Community

We invite reproducible comparison under fixed parameter budget N and identical downstream heads:

Hypothesis (modality-over-scale)
For perception-grounded task Y (damage, KYC, tutoring, field ops):

  Err(N_large, Textify(X))  >  Err(N_small, H(X))     often holds

where H(X) uses aligned multimodal tokens + PE_time + PE_space,
N_large ≥ N_small, and Err is live task error rate (not perplexity).
Input constructionParamsExpected Err(Y)
ASR transcript + image caption stringN_largeHigh (perception floor)
+ raw audio / vision tokens, 1D PE onlyN_smallMedium
+ PE_time + PE_space + shared clock TN_smallLowest

7. Vanira Implementation Sketch

  • WebRTC session clock T — audio, camera, uploads, UI events timestamped on one timeline.
  • Structured injection — OCR boxes, detections, and tool results enter as typed tokens with PE_space + PE_time, not prose.
  • Reason-native routing — kyc_photo, live_detection, etc. preserve modality before fusion.
  • Edge encoders — small E_m at client; shared transformer reasons on H, not on lossy Textify(X).

8. Conclusion

You are not wrong about positional encoding. The community spent years on RoPE, ALiBi, and longer contexts — all variations on where in the sequence a token sits. The next step is where in time, space, and modality space it sits, relative to everything else in the live session.

Parameters store reusable operators. Positional structure tells those operators which evidence to bind. For agents that hear, see, and act in the physical world, bind evidence first — then scale weights only if the task still demands it.

Discuss this paper

We welcome reproducible benchmarks and formal bounds on multimodal PE. Reach us at hello@vanira.io with subject “Modality Over Scale”.