2026-05-15 · cross-LLM analysis
Published today · xai-org/x-algorithm

X just open-sourced its algorithm.
Here is what four frontier LLMs see in it.

On May 15 2026, xAI published xai-org/x-algorithm — a Rust-first rewrite of X's recommendation system with a Grok-1 transformer at the core. We ran the release past GPT-5, Gemini 2.5 Pro, and Grok 4, then layered our own analysis. Below: what they all agreed on, where each one saw something different, and what it means for anyone selling X engagement data.

Codebase
57% Rust
43% Python · gRPC service boundaries
Model backbone
Grok-1
Transformer ported into recsys ranking
Released model
256-d · 2L
3GB toy via Git LFS · production weights withheld
Commit history
2 commits
Fresh release or aggressive squash
→ TL;DR
What is in the box

Five components, two languages, one production stack.

The repo ships five working modules plus a Rust trait framework. Models, training, and serving infra are not included — and that absence is the tell.

home-mixer
Orchestration layer
gRPC endpoint that blends candidates, injects ads, applies brand-safety tracking.
phoenix
ML retrieval + ranking
Grok-1 transformer adapted for recsys. Multi-action prediction heads. Attention-masked for cacheable scoring.
grox
Content understanding
Spam + PTOS enforcement classifiers. Probably the most policy-relevant module.
thunder
In-network store
Kafka-ingested feature store. Schema references included; producer/consumer impls are not.
candidate-pipeline
Reusable framework
Rust trait-based stages. Composable retrieval → filter → rank → mix.
— what is missing
The actual IP
Full training pipelines, real Phoenix weights, production embedding indices, Kafka serving code, raw engagement data.
Where all three LLMs agreed

Five things GPT-5, Gemini, and Grok 4 all flagged independently.

We sent the same prompt to all three. These five points came back from every one of them, in different words but identical substance. That convergence is the signal.

01
Attention masking is the genuinely novel insight
Candidates cannot attend to each other in-batch. Scores become batch-agnostic, cacheable, and resistant to manipulation via batch composition.
3/3 flagged
02
"Eliminated every hand-engineered feature" is marketing
The Author Diversity Scorer is a post-ranking heuristic. Brand-safety coupling is another. The claim does not survive contact with the repo.
3/3 flagged
03
"End-to-end" is structurally hollow
No training pipelines. No data. No production indices. You cannot reproduce, audit, or fully replicate anything material.
3/3 flagged
04
The multi-action weights are the audit target
The ratio of like vs reply vs click vs block / mute / report decides whether the algorithm rewards outrage or thoughtful content. None published.
3/3 flagged
05
Transparency as competitive pressure
Meta and TikTok now either match or explain why they do not. A reputational gambit aimed squarely at competitors.
3/3 flagged
Where they diverged

Three frontier models. Three distinctive lenses.

Same prompt, same repo, three different angles. Worth noticing what each chose to elevate — and where they all stayed quiet.

G5
OpenAI GPT-5
via api.openai.com
Most thorough on governance
"Brand safety and PTOS classifier thresholds, error costs, and their coupling to ranking — false positives/negatives by topic or dialect. Robustness: feature missingness handling, spam defenses, and places a spammer could exploit candidate masking/caching."
policymanipulation surfaceenterprise lens
G3
Google Gemini 2.5 Pro
via generativelanguage.googleapis.com
Most measured and political
"This is a talent acquisition play and a transparency gambit. It positions X/xAI as a serious engineering organization, using Rust and a unified Grok-based ML stack to attract top talent."
strategic framingdiplomatictalent gambit
G4
xAI Grok 4
via api.x.ai
Most direct — and self-critical
"256-dim / 2-layer model is a toy; the actual system runs on larger un-released weights. Phoenix still inherits Grok-1 inductive biases plus post-ranking diversity and brand-safety logic."
technical bluntnessparent-co self-critiquenoted bias
→ Telling absence

None of the three frontier models touched the political / moderation-policy context — verified-account amplification, content-moderation rollback under Musk, hate-speech policy changes since 2024. All three stayed in pure ML/engineering critique. Either alignment training is making them shy on contested topics, or those features simply are not in this release. Worth knowing which.

Ultrathink — what the LLMs missed

Four deeper reads on what just shipped.

Beyond what the frontier LLMs converged on, four insights that change how to interpret the release.

01

The real story is architectural unification, not transparency.

They took Grok-1's transformer and used it as the recommendation ranker's backbone. The marginal cost of scaling Grok now improves both the LLM and the X feed. This is the same compute-flywheel play TPUs gave Google and PyTorch gave Meta. If it works, "Grok improves → feed improves" becomes a structural advantage.

02

The release is audit-shaped without being auditable.

The 256-dim, 2-layer model is a runnable demo to silence "where's the code." The actual production system runs on weights that are not in the repo. You can inspect the architecture but you cannot validate any behavioral claim against the real model. Transparency theatre, not transparency.

03

The cacheable scoring is also an anti-manipulation defense.

Independent candidate scores mean you cannot game ranking through batch composition — no faking co-occurrence patterns to surface low-signal content. Quietly the most underrated piece of the design. Twitter has historically struggled with this exact attack class.

04

The multi-action heads are an actionable map.

Multi-action prediction with explicit negative weights for block/mute/report tells you exactly what the algorithm optimizes for and against. For anyone selling X engagement data, this rewrites how to score "quality" engagement vs raw volume. Suppression detection becomes possible.

What this unlocks for LunarCrush

X just told us how to score our own data.

Three direct implications for LunarCrush — and any institutional buyer of X social engagement signals.

Quality engagement scoring
Multi-action heads with explicit negative weights for block/mute/report let us refine "engagement quality" scoring beyond raw volume. Sellable to hedge funds: "this creator's reach is being algorithmically supported / suppressed."
Creator algorithmic ceiling
The Author Diversity Scorer attenuates repeat-author exposure per session. High raw engagement no longer implies algorithmic favor. Our creator analytics need to factor in a per-session ceiling.
Public grox taxonomy
X's spam + PTOS classifier architecture is now public reference. We can legitimately mirror their content-quality taxonomy in our X ingestion pipeline — better noise filtering, fewer false-engagement signals reaching customers.