Executive Summary

Chinchilla scaling laws describe how loss falls with parameters when the training distribution is fixed. Live agents operate on a different distribution: synchronized audio, video, documents, and UI state aligned in time and space. Your intuition about positional encoding is directionally correct — standard 1D sequence position is insufficient; you need multi-axis position (time, space, modality) on the input tensor. That is an inductive bias and an information gain, not a parameter gain.

Abstract

We distinguish capacity scaling (more weights) from evidence scaling (richer, structured inputs). For perception-grounded tasks, evidence scaling dominates: a smaller transformer with aligned multimodal tokens and explicit temporal/spatial positional encodings can achieve lower Bayes error than a larger text-only model fed ASR transcripts and image captions.

This paper states the claim formally, connects it to positional encoding theory and the data-processing inequality, and outlines what is proven in the literature vs. what remains an engineering conjecture we test in production.

1. Setup: What the Transformer Actually Sees

Standard self-attention over tokens X = (x₁,…,xₙ):

Scaled dot-product attention (Vaswani et al.)

Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V

Q = XW_Q ,  K = XW_K ,  V = XW_V

Intelligence in this framework is not “in the parameters alone.” It is the map from raw evidence → token matrix X → output. If X is built by collapsing camera pixels into a paragraph, you have already destroyed information before attention runs.

2. Why Parameter Scaling Is Not the Same as Getting Smarter

Empirical scaling laws (Kaplan, Hoffmann/Chinchilla) approximate:

Loss vs. model size N at fixed data D

L(N) ≈ A · N^(-α) + B · D^(-β) + L_∞

Important caveat: this holds when (i) the task is stationary, (ii) the input representation is fixed, and (iii) evaluation matches training (e.g. next-token prediction on text). It does not say that increasing N creates new sensory channels. A text-only model with 10× parameters still cannot attend to pixel (i,j) unless that structure is present in X.

For task label Y (e.g. “approve refund”, “MRZ valid”, “crack detected”):

Information bottleneck when modalities are collapsed to text

Let X = (X_audio, X_video, X_ui, …)  — raw aligned streams
Let T = Textify(X)                     — ASR + caption + OCR string

Data-processing inequality:
  I(Y ; X)  ≥  I(Y ; T)

Strict inequality when Y depends on prosody, micro-texture, layout,
or timing that Textify() discards or corrupts.

More parameters let you fit a better approximation of P(next token | T). They do not recover mutual information already lost in Textify(X).

3. Your Positional-Encoding Intuition — Formalized

Original Transformers encode sequence index only:

1D sinusoidal positional encoding

PE(pos, 2i)   = sin( pos / 10000^(2i/d) )
PE(pos, 2i+1) = cos( pos / 10000^(2i/d) )

h_pos = x_pos + PE(pos)

That tells the model “token A came before token B.” It does not tell the model:

that frame f occurred 342 ms after utterance u (temporal alignment),
that OCR region r sits at normalized box (0.12, 0.25, 0.30, 0.50) (spatial structure),
that stream s is audio vs. vision vs. DOM (modality channel).

Yes — you are making sense. The fix is not “bigger PE table.” It is multi-axis positional structure on each token:

Proposed enriched token representation (core claim)

h_i = e_i                           — content embedding (patch, phoneme, DOM node, …)
    + PE_seq(i)                     — order in fused context window
    + PE_time(t_i, Δt_i, phase_i)   — session clock, turn offset, phase (KYC, inspect, …)
    + PE_space(x_i, y_i, w_i, h_i)  — 2D/3D coords when token is spatial
    + PE_mod(m_i)                   — modality id / encoder family

Attention still standard; only X → H construction changes.

Vision Transformers already use 2D patch position (Dosovitskiy et al.). DETR adds spatial object queries. We extend the same principle to live multimodal agents: one shared clock T across audio frames, UI highlights, and upload events so cross-attention aligns evidence causally (“user said yes after box appeared”).

4. Why This Improves Accuracy Without More Parameters

Learning theory view: parameters control capacity; positional structure controls inductive bias. Without PE_space, the model must learn that patch tokens have 2D neighborhood structure from data alone — a harder hypothesis class (permutation of patches is ambiguous).

Locality bias from spatial PE (informal)

With 2D PE, attention can implement local operators:

  α_ij ∝ exp( q_i · k_j / √d ) · 𝟙[ ||pos(i) - pos(j)|| < ρ ]

→ read MRZ strip, inspect crack neighborhood, match face crop to ID photo
  without attending uniformly over all O(n²) token pairs.

Sample complexity drops when the right bias is baked into H: fewer examples needed to learn the same visual or temporal operator. That manifests as higher accuracy at the same N — which is exactly the opposite of the industry default (“buy more GPUs, train bigger LLM”).

What is proven vs. conjecture

Proven / established: data-processing inequality; ViT-style 2D PE; multimodal fusion beats unimodal on vision-language tasks when labels depend on both.
Strong engineering evidence: live inspection/KYC/returns accuracy gains when adding aligned vision + time tags at fixed backbone size (Vanira production).
Open research: tight lower bounds on how much I(Y;X) is lost per modality collapse; optimal PE_time for streaming agents.

5. Real Modalities: Keep Channels, Don’t Textify Early

Let encoders E_m project each modality into d_model without forcing a single text bottleneck:

Multimodal fusion before shared transformer layers

H_audio  = E_audio(X_audio)  + PE_time ∘ PE_mod(audio)
H_vision = E_vision(X_video)   + PE_time ∘ PE_space ∘ PE_mod(vision)
H_ui     = E_ui(X_dom)         + PE_space ∘ PE_mod(ui)
H_text   = E_text(X_text)      + PE_seq

H = Fuse(H_audio, H_vision, H_ui, H_text)   — concat + align on shared clock T
Y_hat = TransformerBlocks(H)

Early Textify() is equivalent to passing X through a noisy channel with unknown capacity. Real modalities preserve high-frequency evidence (prosody, glare on ID card, hairline crack) that no amount of post-hoc scaling on T recovers.

6. Testable Claim for the Community

We invite reproducible comparison under fixed parameter budget N and identical downstream heads:

Hypothesis (modality-over-scale)

For perception-grounded task Y (damage, KYC, tutoring, field ops):

  Err(N_large, Textify(X))  >  Err(N_small, H(X))     often holds

where H(X) uses aligned multimodal tokens + PE_time + PE_space,
N_large ≥ N_small, and Err is live task error rate (not perplexity).

Input construction	Params	Expected Err(Y)
ASR transcript + image caption string	N_large	High (perception floor)
+ raw audio / vision tokens, 1D PE only	N_small	Medium
+ PE_time + PE_space + shared clock T	N_small	Lowest

7. Vanira Implementation Sketch

WebRTC session clock T — audio, camera, uploads, UI events timestamped on one timeline.
Structured injection — OCR boxes, detections, and tool results enter as typed tokens with PE_space + PE_time, not prose.
Reason-native routing — kyc_photo, live_detection, etc. preserve modality before fusion.
Edge encoders — small E_m at client; shared transformer reasons on H, not on lossy Textify(X).

8. Conclusion

You are not wrong about positional encoding. The community spent years on RoPE, ALiBi, and longer contexts — all variations on where in the sequence a token sits. The next step is where in time, space, and modality space it sits, relative to everything else in the live session.

Parameters store reusable operators. Positional structure tells those operators which evidence to bind. For agents that hear, see, and act in the physical world, bind evidence first — then scale weights only if the task still demands it.

Discuss this paper

We welcome reproducible benchmarks and formal bounds on multimodal PE. Reach us at hello@vanira.io with subject “Modality Over Scale”.

SDK documentation Email research team