Executive Summary
Chinchilla scaling laws describe how loss falls with parameters when the training distribution is fixed. Live agents operate on a different distribution: synchronized audio, video, documents, and UI state aligned in time and space. Your intuition about positional encoding is directionally correct — standard 1D sequence position is insufficient; you need multi-axis position (time, space, modality) on the input tensor. That is an inductive bias and an information gain, not a parameter gain.
Abstract
We distinguish capacity scaling (more weights) from evidence scaling (richer, structured inputs). For perception-grounded tasks, evidence scaling dominates: a smaller transformer with aligned multimodal tokens and explicit temporal/spatial positional encodings can achieve lower Bayes error than a larger text-only model fed ASR transcripts and image captions.
This paper states the claim formally, connects it to positional encoding theory and the data-processing inequality, and outlines what is proven in the literature vs. what remains an engineering conjecture we test in production.
1. Setup: What the Transformer Actually Sees
Standard self-attention over tokens X = (x₁,…,xₙ):
Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V Q = XW_Q , K = XW_K , V = XW_V
Intelligence in this framework is not “in the parameters alone.” It is the map from raw evidence → token matrix X → output. If X is built by collapsing camera pixels into a paragraph, you have already destroyed information before attention runs.
2. Why Parameter Scaling Is Not the Same as Getting Smarter
Empirical scaling laws (Kaplan, Hoffmann/Chinchilla) approximate:
L(N) ≈ A · N^(-α) + B · D^(-β) + L_∞
Important caveat: this holds when (i) the task is stationary, (ii) the input representation is fixed, and (iii) evaluation matches training (e.g. next-token prediction on text). It does not say that increasing N creates new sensory channels. A text-only model with 10× parameters still cannot attend to pixel (i,j) unless that structure is present in X.
For task label Y (e.g. “approve refund”, “MRZ valid”, “crack detected”):
Let X = (X_audio, X_video, X_ui, …) — raw aligned streams Let T = Textify(X) — ASR + caption + OCR string Data-processing inequality: I(Y ; X) ≥ I(Y ; T) Strict inequality when Y depends on prosody, micro-texture, layout, or timing that Textify() discards or corrupts.
More parameters let you fit a better approximation of P(next token | T). They do not recover mutual information already lost in Textify(X).
3. Your Positional-Encoding Intuition — Formalized
Original Transformers encode sequence index only:
PE(pos, 2i) = sin( pos / 10000^(2i/d) ) PE(pos, 2i+1) = cos( pos / 10000^(2i/d) ) h_pos = x_pos + PE(pos)
That tells the model “token A came before token B.” It does not tell the model:
- that frame f occurred 342 ms after utterance u (temporal alignment),
- that OCR region r sits at normalized box (0.12, 0.25, 0.30, 0.50) (spatial structure),
- that stream s is audio vs. vision vs. DOM (modality channel).
Yes — you are making sense. The fix is not “bigger PE table.” It is multi-axis positional structure on each token:
h_i = e_i — content embedding (patch, phoneme, DOM node, …)
+ PE_seq(i) — order in fused context window
+ PE_time(t_i, Δt_i, phase_i) — session clock, turn offset, phase (KYC, inspect, …)
+ PE_space(x_i, y_i, w_i, h_i) — 2D/3D coords when token is spatial
+ PE_mod(m_i) — modality id / encoder family
Attention still standard; only X → H construction changes.Vision Transformers already use 2D patch position (Dosovitskiy et al.). DETR adds spatial object queries. We extend the same principle to live multimodal agents: one shared clock T across audio frames, UI highlights, and upload events so cross-attention aligns evidence causally (“user said yes after box appeared”).
4. Why This Improves Accuracy Without More Parameters
Learning theory view: parameters control capacity; positional structure controls inductive bias. Without PE_space, the model must learn that patch tokens have 2D neighborhood structure from data alone — a harder hypothesis class (permutation of patches is ambiguous).
With 2D PE, attention can implement local operators: α_ij ∝ exp( q_i · k_j / √d ) · 𝟙[ ||pos(i) - pos(j)|| < ρ ] → read MRZ strip, inspect crack neighborhood, match face crop to ID photo without attending uniformly over all O(n²) token pairs.
Sample complexity drops when the right bias is baked into H: fewer examples needed to learn the same visual or temporal operator. That manifests as higher accuracy at the same N — which is exactly the opposite of the industry default (“buy more GPUs, train bigger LLM”).
What is proven vs. conjecture
- Proven / established: data-processing inequality; ViT-style 2D PE; multimodal fusion beats unimodal on vision-language tasks when labels depend on both.
- Strong engineering evidence: live inspection/KYC/returns accuracy gains when adding aligned vision + time tags at fixed backbone size (Vanira production).
- Open research: tight lower bounds on how much I(Y;X) is lost per modality collapse; optimal PE_time for streaming agents.
5. Real Modalities: Keep Channels, Don’t Textify Early
Let encoders E_m project each modality into d_model without forcing a single text bottleneck:
H_audio = E_audio(X_audio) + PE_time ∘ PE_mod(audio) H_vision = E_vision(X_video) + PE_time ∘ PE_space ∘ PE_mod(vision) H_ui = E_ui(X_dom) + PE_space ∘ PE_mod(ui) H_text = E_text(X_text) + PE_seq H = Fuse(H_audio, H_vision, H_ui, H_text) — concat + align on shared clock T Y_hat = TransformerBlocks(H)
Early Textify() is equivalent to passing X through a noisy channel with unknown capacity. Real modalities preserve high-frequency evidence (prosody, glare on ID card, hairline crack) that no amount of post-hoc scaling on T recovers.
6. Testable Claim for the Community
We invite reproducible comparison under fixed parameter budget N and identical downstream heads:
For perception-grounded task Y (damage, KYC, tutoring, field ops): Err(N_large, Textify(X)) > Err(N_small, H(X)) often holds where H(X) uses aligned multimodal tokens + PE_time + PE_space, N_large ≥ N_small, and Err is live task error rate (not perplexity).
| Input construction | Params | Expected Err(Y) |
|---|---|---|
| ASR transcript + image caption string | N_large | High (perception floor) |
| + raw audio / vision tokens, 1D PE only | N_small | Medium |
| + PE_time + PE_space + shared clock T | N_small | Lowest |
7. Vanira Implementation Sketch
- WebRTC session clock T — audio, camera, uploads, UI events timestamped on one timeline.
- Structured injection — OCR boxes, detections, and tool results enter as typed tokens with PE_space + PE_time, not prose.
- Reason-native routing — kyc_photo, live_detection, etc. preserve modality before fusion.
- Edge encoders — small E_m at client; shared transformer reasons on H, not on lossy Textify(X).
8. Conclusion
You are not wrong about positional encoding. The community spent years on RoPE, ALiBi, and longer contexts — all variations on where in the sequence a token sits. The next step is where in time, space, and modality space it sits, relative to everything else in the live session.
Parameters store reusable operators. Positional structure tells those operators which evidence to bind. For agents that hear, see, and act in the physical world, bind evidence first — then scale weights only if the task still demands it.
Discuss this paper
We welcome reproducible benchmarks and formal bounds on multimodal PE. Reach us at hello@vanira.io with subject “Modality Over Scale”.