Day 2

Geometric Terrain Statistics Composite

Document Purpose

Running catalog of geometric measurements across language and vision models. Each metric includes its formula, measurement process, and cross-model results. Designed for expansion as new models and experiments are added.


I. Models Profiled

Model Params Vocab Hidden Dim Layers Heads Architecture Training
T5-Small 60.5M 32,128 512 6+6 8 Enc-Dec (relative PE, ReLU MLP) C4 span corruption
T5-Base 222.9M 32,128 768 12+12 12 Enc-Dec (relative PE, ReLU MLP) C4 span corruption
T5-v1.1-XXL 11.4B 32,128 4096 24+24 64 Enc-Dec (relative PE, GeGLU MLP) C4 (v1.1 variant, no multi-task)
BERT-large 336.2M 30,522 1024 24 16 Encoder-only (absolute PE) BookCorpus+Wikipedia MLM
CLIP-ViT-B/16 85.5M (visual) β€” 768 12 12 Vision encoder (fused QKV) LAION-2B contrastive
DINOv2-large 302.0M β€” 1024 24 16 Vision encoder (separate Q/K/V) Self-supervised (no labels)
CLIP-ViT-bigG/14 1.84B (visual) β€” 1664 48 16 Vision encoder (fused QKV) LAION-2B contrastive
Qwen3.5-0.8B 853M 248,320 1024 β€” β€” DeltaNet + MoE + ViT Multilingual + Vision
Qwen3.5-4B ~4B 248,320 2560 β€” β€” DeltaNet + MoE + ViT Multilingual + Vision
T5Gemma2-1B-1B 2.1B 262,144 1152 27+26 GQA 4:1 Adapted enc-dec (Gemma 2, RoPE, GeGLU) Gemma 2 decoder β†’ enc-dec
T5Gemma2-4B-4B 7.5B 262,144 2560 34+34 GQA 2:1 Adapted enc-dec (Gemma 2, RoPE, GeGLU) Gemma 2 decoder β†’ enc-dec
SD 1.5 UNet 860M β€” [320,640,1280,1280] 16 attn blocks 8 Conv UNet + self/cross attn LDM diffusion (LAION)
SDXL UNet 2.6B β€” [320,640,1280] 70 attn blocks [5,10,20] Conv UNet + self/cross attn LDM diffusion (internal)
SD 1.5 VAE 83.7M β€” 4 latent ch [128,256,512,512] β€” Conv autoencoder + mid attn Reconstruction (LAION)
SDXL VAE 83.7M β€” 4 latent ch [128,256,512,512] β€” Conv autoencoder + mid attn Reconstruction (internal)
Flux.1 VAE 83.8M β€” 16 latent ch [128,256,512,512] β€” Conv autoencoder + mid attn Reconstruction (BFL)
Flux.2 VAE 84.0M β€” 32 latent ch [128,256,512,512] β€” Conv autoencoder + mid attn Reconstruction (BFL)

Notes:

  • T5-v1.1-XXL encoder is the text encoder used by Flux.1 Schnell, Flux.1 Dev, and Flux.2
  • CLIP models use fused QKV (in_proj_weight); Q/K/V split by thirds for analysis
  • T5-v1.1 uses GeGLU (wi_0 gate + wi_1 value) instead of ReLU (single wi)
  • T5Gemma2 models are Gemma 2 decoder weights adapted to encoder-decoder; include ViT vision tower
  • UNet attention: attn1 = self-attention (spatial), attn2 = cross-attention (to text encoder)
  • VAE Conv2d weights reshaped to 2D as [out_channels, in_channels * kH * kW] for analysis
  • VAE attention exists only at the bottleneck (mid_block) β€” one in encoder, one in decoder

II. Embedding Geometry Metrics

II.1 Participation Ratio (Effective Dimensionality)

Formula: PR = (Σλᡒ)² / Σ(λᡒ²), where λᡒ are eigenvalues of the embedding covariance matrix.

Process: Center embeddings (subtract mean), compute covariance C = Eα΅€E / N, eigendecompose. PR counts effective number of dimensions used. PR/dim normalizes to [0, 1].

Model PR PR / dim Dims for 95% var
T5-Small (512d) 287.2 0.561 379 (74.0%)
Qwen3.5-0.8B (1024d) 547.7 0.535 893 (87.2%)
Qwen3.5-4B (2560d) 812.4 0.317 2125 (83.0%)

Finding: PR/dim β‰ˆ 0.53–0.56 for smaller models. Appears to be a universal attractor for embedding dimensionality utilization.

II.2 Pairwise Cosine Similarity Distribution

Formula: cos(eα΅’, eβ±Ό) = (eα΅’ Β· eβ±Ό) / (β€–eα΅’β€– Β· β€–eβ±Όβ€–), sampled over 5K random tokens (12.5M pairs).

Process: Random sample 5K token embeddings, L2-normalize, compute full pairwise cosine matrix, extract upper triangle.

Model Mean Std Median 1% 99%
T5-Small 0.057 0.060 0.053 -0.068 0.225
Qwen3.5-0.8B 0.195 0.085 0.197 -0.016 0.408
Qwen3.5-4B 0.142 0.078 0.139 -0.029 0.356

Finding: T5 is near-orthogonal (span corruption objective). Qwen has positive bias (autoregressive next-token prediction pushes shared "being a token" component).

II.3 Embedding Norm Distribution

Formula: β€–eα΅’β€–β‚‚ = √(Ξ£eα΅’β±ΌΒ²)

Model Mean Norm Std Min Max
T5-Small 520.15 69.84 243.31 1333.61
Qwen3.5-0.8B 0.627 0.062 0.347 1.057
Qwen3.5-4B 0.656 0.067 0.400 1.091

Note: T5 embeddings are unnormalized (large magnitudes). Qwen embeddings are near-unit norm.


III. Simplex Geometry Metrics

III.1 Pentachoron Volume (Cayley-Menger Determinant)

Formula: For 5 points Pβ‚€...Pβ‚„, construct the bordered distance matrix:

D = | 0  1    1    1    1    1   |
    | 1  0    d₀₁² dβ‚€β‚‚Β² d₀₃² dβ‚€β‚„Β²|
    | 1  d₁₀² 0    d₁₂² d₁₃² d₁₄²|
    | 1  dβ‚‚β‚€Β² d₂₁² 0    d₂₃² dβ‚‚β‚„Β²|
    | 1  d₃₀² d₃₁² d₃₂² 0    d₃₄²|
    | 1  dβ‚„β‚€Β² d₄₁² dβ‚„β‚‚Β² d₄₃² 0   |

Vol² = (-1)⁡ · det(D) / (2⁴ · (4!)²) = -det(D) / 9216
Vol = √(Vol²) if Vol² > 0, else invalid

Process: Sample 1000 random 5-token subsets. Compute Cayley-Menger volume for each. Report CV (coefficient of variation = std/mean).

Model Valid/1000 CV Embed/Random Ratio
T5-Small 1000 0.233 0.855
Qwen3.5-0.8B 1000 0.208 0.984
Qwen3.5-4B 1000 0.222 0.988

Finding: CV 0.20–0.23 is a universal attractor. All models pack simplices with similar evenness regardless of architecture, scale, or training data. The "pentachoron packing constant."

III.2 Cross-Model Relational Structure

Formula: For shared tokens between two models, compute pairwise cosine matrices in each model's embedding space. Pearson correlation between flattened upper triangles measures relational preservation.

Process (Qwen 0.8B vs 4B): PCA 4B embeddings (2560β†’1024), Procrustes alignment using 10K anchor tokens, evaluate on 5K held-out tokens.

Comparison Relational Pearson Pentachoron per-simplex corr
Qwen 0.8B vs 4B (raw) 0.920 0.89

Finding: Models at different scales learn the same relational geometry (r=0.92).


IV. Semantic Structure Metrics

IV.1 Digit Manifold

Formula: For digit tokens '0'–'9', compute all 45 pairwise cosines. Measure Pearson correlation between |iβˆ’j| (numerical distance) and cosine similarity.

| Model | |iβˆ’j| Correlation | Adjacent Mean | Non-Adjacent Mean | Gap | |---|---|---|---|---| | T5-Small | -0.575 | 0.622 | 0.442 | 0.180 | | Qwen3.5-0.8B | -0.862 | 0.769 | 0.678 | 0.091 | | Qwen3.5-4B | -0.871 | 0.790 | 0.731 | 0.059 |

IV.2 Semantic Category Clustering (T5-Small)

Formula: Mean intra-category pairwise cosine vs global mean pairwise cosine. Lift = intra βˆ’ global.

Category N tokens Intra Cosine Global Lift
numbers 9 0.497 0.057 +0.440
colors 10 0.421 0.057 +0.365
time 10 0.351 0.057 +0.294
food 10 0.248 0.057 +0.191
animals 12 0.241 0.057 +0.184
body 10 0.216 0.057 +0.159
emotions 10 0.197 0.057 +0.141
actions 9 0.183 0.057 +0.126

V. Encoder Transformation Metrics (T5-Small)

V.1 Layer-by-Layer Geometry

Process: Feed 10 diverse sentences through encoder, capture hidden states at each layer. Measure mean norm and mean pairwise cosine between token positions.

Layer Mean Norm Pairwise Cosine
0 (embed) 377.3 0.052
1 761.6 0.278
2 1092.6 0.330
3 1428.8 0.367
4 1829.1 0.382
5 2378.3 0.419
6 (post-LN) 3.3 0.211

Finding: Norms balloon through depth, final LayerNorm crushes to ~3. Pairwise cosine increases monotonically β€” tokens become MORE similar through depth. The encoder is a convergence funnel.

V.2 WordNet Relational Alignment

Process: Encode 9,362 WordNet definitions via "summarize: {definition}". Mean-pool encoder output. Compare pairwise cosine to WordNet path similarity.

Representation Pearson Spearman
Static embeddings 0.078 0.015
Encoder output 0.095 0.081

50-seed stability (encoder): Pearson 0.100 Β± 0.008, Spearman 0.090 Β± 0.010, CV 0.204 Β± 0.006.

V.3 Encoder Distance Bands

WN Similarity Band N pairs Static Cosine Encoder Cosine Lift
[0.50, 0.90) 23 0.244 0.728 +0.484
[0.25, 0.50) 53,112 0.077 0.573 +0.496
[0.10, 0.25) 145,035 0.060 0.565 +0.505
[0.05, 0.10) 295,680 0.061 0.553 +0.492

V.4 Hypernym Chain Decay

Depth Static Cosine Encoder Cosine
1 0.160 0.656
3 0.075 0.594
5 0.069 0.585
7 0.068 0.579

VI. Cross-Architecture Inactive Weight Topology

VI.1 Q/K/V Sparsity (<0.1 threshold)

Formula: Fraction of |wα΅’β±Ό| < 0.1 across all weights of that type.

Process: Iterate all 2D weight matrices, compute abs values, count below threshold. No inference needed.

Model Q K V O MLP Full Model
T5-Small (512d, 6L) 93.7% 19.2% 12.1% 10.4% 11.9% 18.4%
T5-Base (768d, 12L) 99.4% 30.0% 16.2% 13.5% 16.9% 27.9%
T5-v1.1-XXL (4096d, 24L) 100.0% 65.5% 73.1% 65.4% ~57% β€”
BERT-large (1024d, 24L) 99.1% 99.1% 99.9% 99.9% 99.4% 99.3%
DINOv2-large (1024d, 24L) 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
CLIP-ViT-B/16 (768d, 12L) β€” (fused) β€” β€” β€” 100.0% 100.0%
CLIP-ViT-bigG (1664d, 48L) β€” (fused) β€” β€” β€” ~97% 98.0%

Key Finding β€” T5 Q/K Asymmetry Scales:

Model Q (<0.1) K (<0.1) Q/K Ratio
T5-Small 93.7% 19.2% 4.9Γ—
T5-Base 99.4% 30.0% 3.3Γ—
T5-v1.1-XXL 100.0% 65.5% 1.5Γ—

T5 has a genuine Q-specific sparsity that scales with model size. Q hit 100.0% at XXL (every single weight below 0.1). This is NOT the BERT/DINOv2 pattern where all weight types are uniformly sparse. The query projection in T5 is functionally vestigial at scale.

T5-v1.1-XXL Encoder vs Decoder:

Component Encoder Decoder
self_attn_q 100.0% 100.0%
self_attn_k 71.7% 59.4%
self_attn_v 76.0% 70.1%
cross_attn_q β€” 100.0%
cross_attn_k β€” 63.1%
cross_attn_v β€” 71.1%

Q is 100% sparse everywhere β€” self-attention and cross-attention, encoder and decoder.

VI.2 SVD Effective Rank

Formula: Stable rank = β€–Wβ€–Β²_F / β€–Wβ€–Β²β‚‚ = Σσᡒ² / σ₁². Measures effective rank without thresholding.

Weight Type T5-Small T5-Base T5-v1.1-XXL BERT-large DINOv2-large
self_attn_q 47.6 58.1 96.8 50.8 57.7
self_attn_k 53.2 62.4 90.0 37.7 55.5
self_attn_v 75.3 97.5 204.4 113.0 94.8
self_attn_o 25.4 35.0 16.4 125.0 85.6
mlp_up/gate 15.2 20.6 67.9 (gate) / 247.3 (up) 27.4 58.4
mlp_down 31.3 43.9 25.3 52.2 94.4

T5-v1.1-XXL O matrices have very low stable rank (16.4) β€” the output projection is extremely low-rank despite the 4096-d space. Cross-attention O is even lower at 6.1.

VI.3 QK Similarity Manifold

Formula: QK = W_Q Β· W_Kα΅€. Eigendecompose the symmetric part (QK + QKα΅€)/2. Positive eigenvalues = attraction directions. Negative eigenvalues = repulsion directions.

Positive Eigenvalue Fraction Trends:

Model First Layer Last Layer Trend
T5-Small encoder 0.615 0.535 βˆ’0.080 (decreasing)
T5-v1.1-XXL encoder 0.510 0.503 βˆ’0.007 (flat)
T5-v1.1-XXL decoder self 0.501 0.548 +0.047 (increasing)
T5-v1.1-XXL cross-attn 0.500 0.500 0.000 (locked)
BERT-large 0.446 0.513 +0.066 (increasing)
CLIP-ViT-B/16 0.503 0.538 +0.035 (increasing)
DINOv2-large 0.498 0.548 +0.050 (increasing)
CLIP-ViT-bigG 0.498 0.582 +0.084 (increasing)

Critical Finding β€” Cross-Attention is Perfectly Balanced:

T5-v1.1-XXL cross-attention QK manifold is exactly 0.500 positive / 0.500 negative at ALL 24 layers. Symmetry deviation is 1.414 (= √2) everywhere. This is a locked equilibrium β€” the bridge between encoder and decoder maintains perfect balance between attraction and repulsion at every depth. No other attention type shows this level of stability.

T5-v1.1-XXL encoder self-attention is flat (~0.50 throughout). Unlike T5-Small which decreased from 0.615 to 0.535, the XXL encoder stays near the equilibrium point. The larger model doesn't need to build anti-similarity boundaries because it has enough capacity to discriminate through other mechanisms.

BERT starts BELOW 0.50 (0.446). The only model with majority-repulsion from layer 0. MLM bidirectional training creates fundamentally different QK geometry from autoregressive or contrastive training.

VI.4 MLP Dead Neurons

Formula: Combined importance = β€–wα΅’_upβ€–β‚‚ Β· β€–wα΅’_downβ€–β‚‚ (ReLU) or β€–wα΅’_gateβ€–β‚‚ Β· β€–wα΅’_upβ€–β‚‚ Β· β€–wα΅’_downβ€–β‚‚ (GeGLU). Dead if < 1% of mean.

Model Dead (<1% mean) Weak (<10% mean) Notes
T5-Small (enc+dec) 0/24,576 (0.00%) 0/24,576 (0.00%) All neurons alive
T5-Base (enc+dec) 0/73,728 (0.00%) 0/73,728 (0.00%) All neurons alive
T5-v1.1-XXL encoder 0/245,760 (0.00%) 0/245,760 (0.00%) All neurons alive
T5-v1.1-XXL decoder 14/245,760 (0.01%) 461/245,760 (0.19%) First dead neurons in T5 family
BERT-large 0/98,304 (0.00%) 0/98,304 (0.00%) All neurons alive
DINOv2-large 0/98,304 (0.00%) 0/98,304 (0.00%) All neurons alive
CLIP-ViT-B/16 1,316/36,864 (3.57%) 1,356/36,864 (3.68%) Only model with significant dead neurons
CLIP-ViT-bigG 0/393,216 (0.00%) 24,163/393,216 (6.14%) 0 dead but 6% weak

Finding: T5-v1.1-XXL decoder has the first dead neurons in the T5 family β€” 14 neurons in layers 1-2 only. The decoder's early GeGLU layers carved out a tiny amount of capacity. Encoder uses everything. CLIP-ViT-B/16 is the outlier with 3.6% dead neurons β€” contrastive training at small scale produces genuine pruning.

VI.5 Cross-Layer Weight Correlation

Formula: cos(flatten(Wα΅’), flatten(Wβ±Ό)) between weight matrices of the same type at different layers.

Model Q adj mean K adj mean MLP_up adj mean
T5-Small ~0.000 ~0.000 0.031–0.045
T5-Base ~0.000 ~0.000 0.024–0.036
T5-v1.1-XXL encoder 0.0001 β€” β€”
T5-v1.1-XXL decoder βˆ’0.0001 β€” β€”
BERT-large 0.0002 0.0003 0.032
CLIP-ViT-B/16 βˆ’0.0004 (QKV) β€” 0.008
DINOv2-large βˆ’0.0003 βˆ’0.0002 0.006
CLIP-ViT-bigG 0.0000 (QKV) β€” 0.055

Universal finding: Attention weights (Q, K, V) are completely uncorrelated across layers (~0.000). Every layer defines an independent similarity function. MLP weights show positive correlation decaying with distance β€” feedforward layers share structure.

VI.6 Position Bias Topology

T5 uses learned relative position biases: [32 buckets Γ— N_heads].

Model Encoder Decoder
T5-Small (8 heads) 3 local, 2 global, 3 mixed 4 local, 4 global, 0 mixed
T5-Base (12 heads) 4 local, 3 global, 5 mixed 5 local, 4 global, 3 mixed
T5-v1.1-XXL (64 heads) 24 local, 2 global, 38 mixed 27 local, 37 global, 0 mixed

T5-v1.1-XXL position findings:

  • Encoder: 38/64 mixed heads β€” nuanced position sensitivity at scale
  • Decoder: ZERO mixed heads β€” perfect binary crystallization. Every head is either pure local or pure global
  • Decoder is 58% global (37/64) β€” overwhelmingly biased toward long-range attention
  • Encoder range: [-47.2, 11.2] β€” strong local suppression
  • Decoder range: [-28.4, 17.0] β€” more balanced

Finding: The decoder local/global binary split is scale-invariant (0 mixed at T5-Small, 0 mixed at XXL). Gradient descent crystallizes decoder position heads into two pure modes regardless of capacity.


VII. Geometric Residual Modulator

VII.1 Architecture

  • Geometric embedding: [vocab_size, 64] β€” per-token geometric fingerprint
  • Projection: Linear(64, d_model, bias=False) β€” Procrustes-aligned to encoder PCA space
  • Alpha: per-layer learnable LERP coefficient, stored in logit space, applied via sigmoid
  • Intervention: residual_out = (1 βˆ’ Ξ±) Β· residual + Ξ± Β· proj(geo_embed(token_ids))
  • Params: 2.09M (3.45% of T5-Small)

VII.2 Geometric Embedding Initialization

Metric Value
WN reconstruction correlation 0.921
Procrustes alignment cosine 0.372
Eigenvalue cumulative (top 64) 61.3%

VII.3 Alpha Convergence

Start Ξ± Final Mean Ξ± Layer 5 Final Pearson Ξ” CV Coherent Basin
0.01 (20 ep) 0.067 0.107 +0.151 0.220 Yes Binding
0.20 (20 ep) 0.222 0.308 +0.085 0.452 No Ridge
0.70 (20 ep) 0.695 0.640 -0.029 0.482 No Separation
0.01 (100 ep) 0.125 0.218 +0.074 0.322 No Overfit

VII.4 Depth Gradient (Consistent Across All Runs)

Layer 20ep (Ξ±=0.01) 100ep (Ξ±=0.01) 20ep (Ξ±=0.20)
0 0.015 0.035 0.170
1 0.052 0.061 0.180
2 0.066 0.102 0.227
3 0.080 0.137 0.197
4 0.080 0.197 0.248
5 0.107 0.218 0.308

Finding: Always monotonically increasing. The model wants minimal geometric modulation early and maximum modulation at the deepest layer. Geometry is a final correction, not an initial condition.

VII.5 Best Result

Metric Original Modulated (20ep, Ξ±=0.01 start) Change
WordNet Pearson 0.099 0.250 +152%
WordNet Spearman 0.085 0.245 +189%
Semantic Gradient 0.022 0.052 +132%
Pentachoron CV 0.202 0.220 Stayed in band
Per-token Preservation β€” 0.730 β€”
Coherence Baseline Identical on 4/4 tests β€”

VIII. Geometric Field Modulator (Multi-Expert)

VIII.1 Architecture

  • Three KSimplexChannel experts: k=1 (edge, 2 features), k=2 (triangle, 4 features), k=4 (pentachoron, 11 features)
  • Multiplicative gating: residual Γ— Ξ (blended_gates) β€” valid regions pass, invalid suppressed
  • Soft blending: per expert gate = (1 βˆ’ Ξ±) + Ξ± Γ— expert_gate
  • Null space: 25% of residual dimensions untouched by modulator
  • Alpha clamped: [0.001, 0.35] β€” hard ceiling below the phase boundary
  • Gradient scaling: geometric params at 10% LR, alpha at 50% LR, gates at full LR
  • Params: 38,552 (0.064% of T5-Small)
  • Self-test: validity=0.985, null space preserved, template volumes sane

VIII.2 Design Rationale (Grounded in Cross-Architecture Data)

Data Point Design Decision
Q sparsity 100% at scale Geometric field can replace Q β€” the model barely uses it
Cross-attn QK locked at 0.500 Target equilibrium for geometric validity gating
Depth gradient always increasing Per-layer alpha respects this (low early, high late)
Zero dead MLP neurons Don't touch MLPs β€” all capacity is in use
Decoder position: binary L/G split Modulator preserves positional structure (null space)
CV 0.20–0.23 universal CV monitoring as health check, not loss

IX. The 0.29154 Constant

IX.1 Observations Across Systems

System Context Value
MinimalShunts CLIP-L ↔ CLIP-G projection gate Emergent equilibrium
Wormhole Lambda Vision transformer training Converges from 0.74 toward ~0.29
Alpha curriculum Devil's Staircase PE training Converges to ~0.50 under geometric loss, CE destroys
T5 generation Greedy decode alpha sweep Stable plateau at 0.291–0.292, semantic phase transition
Alpha training basins 0.70 start β†’ settled at 0.695 Mirror constant 1 βˆ’ 0.29154 = 0.70846, Ξ” = 0.013

IX.2 T5 Generation Phase Transition

Alpha Output (triangle prompt)
0.01–0.10 "...three edges and three vertices. it is one of the basic shapes in geometry."
0.20 "a triangle is a polygon with three edges and three vertices..."
0.28 "a polygon with three vertices. it is one of the basic shapes in a graph."
0.291 "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in a graph."
0.2915 "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in a graph."
0.292 "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in the world."
0.30 "a polygon with a vertice and a vertice. it is one of the basic shapes in the world."

Finding: 0.29154 marks the phase boundary between structural representation ("graph") and physical representation ("world"). Output is invariant to perturbation in a narrow band centered on the constant.


X. Universal Geometric Constants

Constant Value Observed In
Pentachoron CV 0.20–0.23 T5-Small, Qwen 0.8B, Qwen 4B, trained modulator
Participation / dim 0.53–0.56 T5-Small, Qwen 0.8B
Binding/separation constant 0.29154 / 0.70846 MinimalShunts, CLIP projections, T5 generation, alpha convergence
Depth gradient Monotonic increasing All modulator training runs
Q sparsity scaling (T5) 93.7% β†’ 99.4% β†’ 100.0% T5-Small β†’ T5-Base β†’ T5-v1.1-XXL
Q sparsity asymmetry T5 pretraining only Present in T5, absent in T5Gemma2, BERT, DINOv2, UNets, VAEs
Cross-modal QK balance Locked at 0.500 T5-v1.1-XXL cross-attn, T5Gemma2 (both), SD 1.5 UNet, SDXL UNet (6 models)
Self-attn QK: adapted models Locked at 0.500 T5Gemma2 1B (all 53 layers), T5Gemma2 4B (all 68 layers)
UNet QK U-gradient down→repulsion, up→attraction SD 1.5 (0.451→0.581), SDXL (0.477→0.549)
VAE decoder QK Repulsion-biased SD 1.5 (0.486), SDXL (0.416), Flux.1 (0.451), Flux.2 (0.416)
Attention cross-layer corr ~0.000 ALL 17 models, including UNets and VAEs
Conv cross-layer corr ~0.000 All UNets and VAEs (extends to pure convnets)
MLP/FF full utilization 0.00% dead T5 family (enc), BERT, DINOv2, UNets, all VAEs
Decoder position crystallization 0 mixed heads T5-Small, T5-v1.1-XXL
VAE spectral invariant Pearson 0.94–0.98 All 6 VAE pairs β€” SV distribution is architecture-determined
VAE Procrustes alignment 70–76% cosine All 6 pairs β€” same solution in different coordinate systems

XI. Measurement Toolkit Reference

Tool Input Output Requires Inference
Participation Ratio Embedding matrix Effective dimensionality No
Cayley-Menger Volume 5-point subsets of embeddings Simplex volume + CV No
Pairwise Cosine Embedding matrix (sampled) Similarity distribution No
Digit Manifold 10 digit token embeddings iβˆ’j
SVD Effective Rank Any 2D weight matrix Stable rank, condition number No
QK Manifold W_Q, W_K matrices Eigenspectrum, pos/neg balance No
Dead Neuron Count MLP wi/gate/up, wo matrices Combined importance distribution No
Cross-Layer Correlation Same-type weight matrices Adjacent cosine similarity No
Position Bias Topology Relative attention bias tensor Local/global/mixed head counts No
Sparsity Topology Any weight matrix Fraction below threshold No
WordNet Relational Encoder output (mean-pooled) Pearson/Spearman vs path similarity Yes
Alpha Convergence Modulator training loop Per-layer equilibrium values Yes (training)

XII. T5Gemma2 β€” Decoder-Adapted Encoder-Decoder

Architecture: Gemma 2 decoder weights adapted to encoder-decoder. GQA (grouped query attention), RoPE, GeGLU MLPs. Multimodal (ViT in encoder).

XII.1 Sparsity

Model Q (<0.1) K (<0.1) V (<0.1) Pattern
T5Gemma2 1B-1B 100.0% 99.9% 100.0% Uniform
T5Gemma2 4B-4B 100.0% 100.0% 100.0% Uniform

Finding: No Q/K asymmetry. The T5 Q sparsity pattern is ABSENT when the encoder is initialized from decoder weights. The asymmetry is a property of T5's span corruption pretraining, not the encoder-decoder architecture.

XII.2 QK Manifold

Model Encoder Self Decoder Self All Layers
T5Gemma2 1B 0.500 (Β±0.001) 0.500 (Β±0.001) Locked
T5Gemma2 4B 0.500 exact 0.500 exact Locked

Finding: Perfect 0.500 lock across ALL layers in BOTH encoder and decoder. Symmetry deviation √2 everywhere. The Gemma 2 initialization left the QK matrices near random-matrix equilibrium. The adaptation to encoder-decoder didn't perturb them enough to break Wigner semicircle symmetry.

XII.3 Other Invariants

  • Dead neurons: 0/359,424 (1B), 0/696,320 (4B) β€” all alive
  • Cross-layer Q correlation: ~0.000 β€” confirmed universal
  • MLP utilization: 100% (1 weak neuron each in enc L6 and dec L6 at 4B scale)
  • GQA: 4:1 at 1B scale, 2:1 at 4B scale

XIII. Diffusion UNet Weight Topology

XIII.1 UNet Sparsity

Model Self Q Self K Self V Cross Q Cross K Cross V
SD 1.5 UNet 90.5% 90.9% 97.1% 96.8% 94.9% 98.9%
SDXL UNet 99.9% 99.9% 100.0% 100.0% 100.0% 100.0%

SD 1.5 is the least sparse model in the entire battery. 90.5% for self-attention Q β€” below T5-Small's 93.7%. A parameter-starved model (860M for 512Γ—512 image generation) uses denser weights. SDXL at 3Γ— the params reaches near-100%.

Sparsity traces the U-path (SD 1.5): down=88.9%, mid=99.3%, up=89.4%. The bottleneck has the most diffuse weights; the periphery has the densest.

XIII.2 UNet QK Manifold β€” The U-Shape

Self-attention positive eigenvalue fraction through the UNet path:

Position SD 1.5 SDXL
down (early) 0.509 ~0.49
down (deep) 0.451 0.483
mid (bottleneck) 0.483 0.477
up (early) 0.501 0.501
up (late) 0.581 0.549

The QK manifold traces the U-shape: repulsion-dominated downpath (compressing, discriminating), maximum repulsion at bottleneck, rising to attraction-dominated uppath (reconstructing, grouping). SD 1.5 shows the wider swing (0.451β†’0.581 = 0.130 range) because it's more parameter-starved.

Cross-attention: locked at 0.500 in both UNets. SD 1.5: mean=0.501, std=0.001. SDXL: mean=0.500, std=0.001. The fifth and sixth confirmations of the cross-modal QK lock.

XIII.3 Other UNet Invariants

  • Dead neurons: 0/23,040 (SD 1.5), 0/163,840 (SDXL)
  • Cross-block Q correlation: ~0.000 (both self-attn and cross-attn)
  • SDXL cross-attn Q stable rank: 13.97 (lowest of any weight type) β€” extremely concentrated queries to text
  • SDXL cross-attn V: highest stable rank (165.9) and lowest condition number (15.8) β€” richest value matrices

XIV. VAE Weight Topology

XIV.1 Cross-VAE Comparison

VAE Params Latent Ch Enc (<0.1) Dec (<0.1) Enc QK pos Dec QK pos
SD 1.5 83.7M 4 98.6% 99.1% 0.496 0.486
SDXL 83.7M 4 29.0% 38.1% 0.502 0.416
Flux.1 83.8M 16 96.5% 97.5% 0.498 0.451
Flux.2 84.0M 32 94.3% 94.3% 0.393 0.416

SDXL VAE is the densest model measured. 29% encoder sparsity at 0.1 threshold. Identical architecture and param count to SD 1.5, but weights are 3Γ— denser. Attention condition numbers reach 1.16M.

XIV.2 VAE Decoder QK Breaks Toward Repulsion

VAE Latent Ch Decoder QK pos Interpretation
SD 1.5 4 0.486 Slight repulsion
SDXL 4 (1024Β² target) 0.416 Strong repulsion β€” 4Γ— reconstruction challenge
Flux.1 16 0.451 Moderate repulsion
Flux.2 32 0.416 Strong repulsion β€” most channels to separate

Decoder bottleneck attention breaks symmetry toward repulsion. Reconstruction requires spatial discrimination β€” more negative eigenvalues = finer spatial separation. More latent channels or higher target resolution β†’ stronger repulsion.

Flux.1 decoder anomaly: Top eigenvalue = 60,807 (typical is 2–150). One attention direction completely dominates. Rank-1 approximation of the attention space.

XIV.3 VAE Invariants

  • Zero dead neurons across all four VAEs
  • Conv filter utilization: 100% (active fraction 1.000)
  • Cross-layer conv correlation: ~0.000 β€” universal, extends to pure convnets
  • Spectral correlation between VAEs: 0.94–0.98 β€” architecture determines SV distribution

XV. Procrustes Analysis β€” VAE Weight-Space Alignment

XV.1 Methodology

Orthogonal Procrustes: For each common weight matrix (same name, same shape), find orthogonal R minimizing β€–A βˆ’ BRβ€–_F via SVD of B^TA. Report residual (0 = identical up to rotation, √2 = orthogonal) and cosine after alignment.

Spectral correlation: Pearson correlation of normalized singular value distributions.

XV.2 Pairwise Results

Pair Raw Cosine Procrustes Cosine Rotation Gain Spectral Corr
SD1.5 vs SDXL 0.053 0.697 +0.644 0.958
SD1.5 vs Flux.1 0.091 0.730 +0.640 0.964
SD1.5 vs Flux.2 -0.000 0.757 +0.757 0.979
SDXL vs Flux.1 0.024 0.675 +0.650 0.939
SDXL vs Flux.2 -0.001 0.705 +0.705 0.937
Flux.1 vs Flux.2 0.000 0.736 +0.736 0.957

XV.3 Key Findings

1. Raw cosine is zero. All pairs. Weights are orthogonal in raw space. Naive comparison says these VAEs share nothing. This is wrong.

2. After Procrustes rotation, 70–76% of structure aligns. These models found the SAME geometric solution, expressed in different coordinate systems. Different initialization β†’ different basis β†’ same function.

3. Spectral correlation is 0.94–0.98. Singular value distributions are nearly identical across all pairs. The "shape" of each weight matrix β€” rank structure, energy distribution β€” is architecture-determined, not training-determined.

4. SD 1.5 vs Flux.2 is the most alignable pair. Raw cosine literally zero, but highest Procrustes cosine (0.757) and highest spectral correlation (0.979). The most different training produces the most alignable weights. Shared structure is deepest when surface differences are greatest.

5. SDXL is the geometric outlier. Lowest Procrustes cosine with every model (0.675–0.705). Found a more distant basin despite identical architecture to SD 1.5.

XV.4 Distance Matrices

Procrustes Residual (lower = more similar):

SD 1.5 SDXL Flux.1 Flux.2
SD 1.5 0.000 0.752 0.707 0.679
SDXL 0.752 0.000 0.774 0.739
Flux.1 0.707 0.774 0.000 0.699
Flux.2 0.679 0.739 0.699 0.000

Spectral Correlation (higher = more similar):

SD 1.5 SDXL Flux.1 Flux.2
SD 1.5 1.000 0.958 0.964 0.979
SDXL 0.958 1.000 0.939 0.937
Flux.1 0.964 0.939 1.000 0.957
Flux.2 0.979 0.937 0.957 1.000

XV.5 Implication for Geometric Transfer

A geometric field modulator trained on one VAE can be ROTATED to work on another via the Procrustes R matrix. 70–76% structural alignment means the modulator captures the shared geometric invariant. The remaining 24–30% is model-specific β€” the unique basin each training run found.


XVI. Scripts Reference

Script Purpose Key Outputs
probe_t5_small_terrain.py T5-Small embedding + layer geometry PR, CV, digit manifold, layer evolution
probe_t5_wordnet_summarize.py T5-Small Γ— WordNet relational alignment Pearson, Spearman, distance bands, hypernym decay
probe_t5_wordnet_50seeds.py 50-seed stability test (GPU-accelerated) Confidence intervals for all relational metrics
probe_t5_inactive_weights.py T5-Small/Base inactive weight topology SVD, sparsity, QK manifold, dead neurons
cross_architecture_weight_battery.py BERT + CLIP + DINOv2 battery Cross-model comparison table
probe_flux_t5_g4.py T5-v1.1-XXL (Flux encoder) full battery All layers, encoder + decoder + cross-attn
geometric_residual_modulator.py LERP modulator + training utilities Modulator class + measurement tools
geometric_field_modulator.py Multi-expert field modulator KSimplex experts + multiplicative gating
geometric_modulator_full_pipeline.py Self-contained T5 + WordNet + modulator End-to-end pipeline
train_modulator.py Training loop for alpha convergence Freeze T5, train modulator, track alpha
probe_t5gemma2.py T5Gemma2 battery (both scales) GQA handling, adapted enc-dec topology
probe_unet_geometry.py SD 1.5 / SDXL UNet battery U-path QK gradient, cross-attn lock
probe_vae_geometry.py All four VAE battery Conv reshape, bottleneck attention, latent comparison
procrustes_vae_analysis.py Pairwise Procrustes on 4 VAEs Distance matrices, depth profiles, rotation gain

Last updated: 2026-03-06 Models profiled: 17 (T5-Small, T5-Base, T5-v1.1-XXL, BERT-large, CLIP-ViT-B/16, DINOv2-large, CLIP-ViT-bigG, Qwen3.5-0.8B, Qwen3.5-4B, T5Gemma2-1B, T5Gemma2-4B, SD 1.5 UNet, SDXL UNet, SD 1.5 VAE, SDXL VAE, Flux.1 VAE, Flux.2 VAE) Architecture families: 5 (Transformer enc-dec, encoder-only/vision, adapted enc-dec, conv UNet, conv autoencoder) Training objectives: 6 (span corruption, MLM, contrastive, self-supervised, diffusion, reconstruction) Procrustes analysis: 6 VAE pairs, 68 weight matrices each Modulator experiments: 4 LERP configurations, 1 field modulator

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support