Search: [ml]

The Top 26 Essential Papers (+5 Bonus Resources) for Mastering LLMs and Transformers

This list bridges the Transformer foundations
with the reasoning, MoE, and agentic shift

Recommended Reading Order

Attention Is All You Need (Vaswani et al., 2017)

The original Transformer paper. Covers self-attention,
multi-head attention, and the encoder-decoder structure
(even though most modern LLMs are decoder-only.)
The Illustrated Transformer (Jay Alammar, 2018)

Great intuition builder for understanding
attention and tensor flow before diving into implementations
BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018)

Encoder-side fundamentals, masked language modeling,
and representation learning that still shape modern architectures
Language Models are Few-Shot Learners (GPT-3) (Brown et al., 2020)

Established in-context learning as a real
capability and shifted how prompting is understood
Scaling Laws for Neural Language Models (Kaplan et al., 2020)

First clean empirical scaling framework for parameters, data, and compute
Read alongside Chinchilla to understand why most models were undertrained
Training Compute-Optimal Large Language Models (Chinchilla) (Hoffmann et al., 2022)

Demonstrated that token count matters more than
parameter count for a fixed compute budget
LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023)

The paper that triggered the open-weight era
Introduced architectural defaults like RMSNorm, SwiGLU
and RoPE as standard practice
RoFormer: Rotary Position Embedding (Su et al., 2021)

Positional encoding that became the modern default for long-context LLMs
FlashAttention (Dao et al., 2022)

Memory-efficient attention that enabled long context windows
and high-throughput inference by optimizing GPU memory access.
Retrieval-Augmented Generation (RAG) (Lewis et al., 2020)

Combines parametric models with external knowledge sources
Foundational for grounded and enterprise systems
Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (Ouyang et al., 2022)

The modern post-training and alignment blueprint
that instruction-tuned models follow
Direct Preference Optimization (DPO) (Rafailov et al., 2023)

A simpler and more stable alternative to PPO-based RLHF
Preference alignment via the loss function
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)

Demonstrated that reasoning can be elicited through prompting
alone and laid the groundwork for later reasoning-focused training
ReAct: Reasoning and Acting (Yao et al., 2022 / ICLR 2023)

The foundation of agentic systems
Combines reasoning traces with tool use and environment interaction
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Guo et al., 2025)

The R1 paper. Proved that large-scale reinforcement learning without
supervised data can induce self-verification and structured reasoning behavior
Qwen3 Technical Report (Yang et al., 2025)

A modern architecture lightweight overview
Introduced unified MoE with Thinking Mode and Non-Thinking
Mode to dynamically trade off cost and reasoning depth
Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts (Shazeer et al., 2017)

The modern MoE ignition point
Conditional computation at scale
Switch Transformers (Fedus et al., 2021)

Simplified MoE routing using single-expert activation
Key to stabilizing trillion-parameter training
Mixtral of Experts (Mistral AI, 2024)

Open-weight MoE that proved sparse models can match dense quality
while running at small-model inference cost
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (Komatsuzaki et al., 2022 / ICLR 2023)

Practical technique for converting dense checkpoints into MoE models
Critical for compute reuse and iterative scaling
The Platonic Representation Hypothesis (Huh et al., 2024)

Evidence that scaled models converge toward shared
internal representations across modalities
Textbooks Are All You Need (Gunasekar et al., 2023)

Demonstrated that high-quality synthetic data allows
small models to outperform much larger ones
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)

The biggest leap in mechanistic interpretability
Decomposes neural networks into millions of interpretable features
PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022)

A masterclass in large-scale training
orchestration across thousands of accelerators
GLaM: Generalist Language Model (Du et al., 2022)

Validated MoE scaling economics with massive
total parameters but small active parameter counts
The Smol Training Playbook (Hugging Face, 2025)

Practical end-to-end handbook for efficiently training language models

Bonus Material

T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019)
Toolformer (Schick et al., 2023)
GShard (Lepikhin et al., 2020)
Adaptive Mixtures of Local Experts (Jacobs et al., 1991)
Hierarchical Mixtures of Experts (Jordan and Jacobs, 1994)

If you deeply understand these fundamentals; Transformer core, scaling laws, FlashAttention, instruction tuning, R1-style reasoning, and MoE upcycling, you already understand LLMs better than most

Time to lock-in, good luck ;)

llm · ml · paper

January 30, 2026 at 9:51:19 AM EST * · permalink

·

https://x.com/TheAhmadOsman/status/2016893734986616915

[2505.03335] Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Abstract page for arXiv paper 2505.03335: Absolute Zero: Reinforced Self-play Reasoning with Zero Data

paper · ml · rl

May 10, 2025 at 4:32:08 PM EDT * · permalink

·

https://arxiv.org/abs/2505.03335

The AI Scientist Generates its First Peer-Reviewed Scientific Publication

AI · ml

March 17, 2025 at 11:06:37 PM EDT * · permalink

·

https://sakana.ai/ai-scientist-first-publication/

DeepSeek's open-source week and why it's a big deal | PySpur - AI Agent Builder

Quick Intro to FlashMLA, DeepEP, DeepGEMM, DualPipe, EPPLB, 3FS and Smallpond

ml

March 7, 2025 at 8:43:47 PM EST * · permalink

·

https://www.pyspur.dev/blog/deepseek_open_source_week

S1: The $6 R1 Competitor? - Tim Kellogg

ml

February 6, 2025 at 1:17:27 AM EST * · permalink

·

https://timkellogg.me/blog/2025/02/03/s1

SOTA on swebench-verified: (re)learning the bitter lesson

Searching code is an important part of every developer's workflow. We're trying to make it better.

aide · ml · agent

January 27, 2025 at 1:23:02 AM EST * · permalink

·

https://aide.dev/blog/sota-bitter-lesson

in-context learning

paper · ml

December 30, 2024 at 7:24:07 PM EST * · permalink

·

https://x.com/DimitrisPapail/status/1873503233378820257

Finally, a Replacement for BERT: Introducing ModernBERT

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

ml

December 23, 2024 at 1:41:30 PM EST * · permalink

·

https://huggingface.co/blog/modernbert

[2411.04997] LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

LLM2CLIP: Research showing how to improve CLIP's image-text matching abilities by replacing its text encoder with a frozen LLM (like Llama) and a trainable adapter. The key innovation is fine-tuning the LLM first to make its outputs more discriminative, then using it to help CLIP's vision encoder better understand language. Results show major improvements in matching detailed descriptions to images, handling long text, and even working across languages, while requiring relatively little training time and compute.

ml · paper

November 14, 2024 at 3:28:51 PM EST * · permalink

·

https://arxiv.org/abs/2411.04997

[2410.01792] When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

ml · paper

October 24, 2024 at 1:30:44 PM EDT * · permalink

·

https://arxiv.org/abs/2410.01792

[2410.01201] Were RNNs All We Needed?

ml · paper

October 24, 2024 at 1:30:37 PM EDT * · permalink

·

https://arxiv.org/abs/2410.01201

Machine Learning and the Yield Curve: Tree-Based Macroeconomic Regime Switching by Siyu Bie, Francis X. Diebold, Jingyu He, Junye Li :: SSRN

We explore tree-based macroeconomic regime-switching in the context of the dynamic Nelson-Siegel (DNS) yield-curve model.

finance · ml · paper

September 26, 2024 at 11:32:06 AM EDT * · permalink

·

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4934442

[2203.14465] STaR: Bootstrapping Reasoning With Reasoning

Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering.

ml · llm · paper

September 14, 2024 at 7:20:47 PM EDT * · permalink

·

https://arxiv.org/abs/2203.14465

GitHub - franckalain/financial-machine-learning

Contribute to franckalain/financial-machine-learning development by creating an account on GitHub.

ml · finance

August 15, 2024 at 4:49:17 PM EDT * · permalink

·

https://github.com/franckalain/financial-machine-learning

Uncensor any LLM with abliteration

A Blog post by Maxime Labonne on Hugging Face

ml

July 25, 2024 at 10:41:16 PM EDT * · permalink

·

https://huggingface.co/blog/mlabonne/abliteration

BigCodeBench Leaderboard

ml

June 23, 2024 at 6:52:59 PM EDT * · permalink

·

https://bigcode-bench.github.io/

FX trading signals with regression-based learning | Macrosynergy Research

Jupyter Notebook Regression-based statistical learning helps build trading signals from multiple candidate constituents. The method optimizes models and hyperparameters sequentially and produces point-in-time signals for backtesting and live trading. This post applies regression-based learning to macro trading factors for developed market FX trading, using a novel cross-validation method for expanding panel data. Sequentially optimized models […]

trade · ml · finance

April 6, 2024 at 10:01:23 AM EDT * · permalink

·

https://research.macrosynergy.com/fx-trading-signals-with-regression-based-learning/

[2403.12180] Advanced Statistical Arbitrage with Reinforcement Learning

Statistical arbitrage is a prevalent trading strategy which takes advantage of mean reverse property of spread of paired stocks. Studies on this strategy often rely heavily on model assumption. In...

finance · ml

April 3, 2024 at 1:52:23 PM EDT * · permalink

·

https://arxiv.org/abs/2403.12180

Ushering in the Thermodynamic Future - Litepaper

We are very excited to finally share a bit more about what we are building: a full-stack hardware platform to harness the natural fluctuations of matter as a computational resource for Generative AI.

ai · ml

March 19, 2024 at 6:38:35 PM EDT * · permalink

·

https://www.extropic.ai/future

Optimizing Portfolio Allocation with Hierarchical Risk Parity in Python

Advanced Strategy to Account for Correlations, Risk, and Returns in your Portfolio Leveraging Hierarchical Structures

ml · finance · optim

March 11, 2024 at 1:02:29 PM EDT * · permalink

·

https://medium.com/@crisvelasquez/optimizing-portfolio-allocation-with-hierarchical-risk-parity-in-python-19b1813af618