32 private links
This list bridges the Transformer foundations
with the reasoning, MoE, and agentic shift
Recommended Reading Order
-
Attention Is All You Need (Vaswani et al., 2017)
The original Transformer paper. Covers self-attention,
multi-head attention, and the encoder-decoder structure
(even though most modern LLMs are decoder-only.) -
The Illustrated Transformer (Jay Alammar, 2018)
Great intuition builder for understanding
attention and tensor flow before diving into implementations -
BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018)
Encoder-side fundamentals, masked language modeling,
and representation learning that still shape modern architectures -
Language Models are Few-Shot Learners (GPT-3) (Brown et al., 2020)
Established in-context learning as a real
capability and shifted how prompting is understood -
Scaling Laws for Neural Language Models (Kaplan et al., 2020)
First clean empirical scaling framework for parameters, data, and compute
Read alongside Chinchilla to understand why most models were undertrained -
Training Compute-Optimal Large Language Models (Chinchilla) (Hoffmann et al., 2022)
Demonstrated that token count matters more than
parameter count for a fixed compute budget -
LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023)
The paper that triggered the open-weight era
Introduced architectural defaults like RMSNorm, SwiGLU
and RoPE as standard practice -
RoFormer: Rotary Position Embedding (Su et al., 2021)
Positional encoding that became the modern default for long-context LLMs
-
FlashAttention (Dao et al., 2022)
Memory-efficient attention that enabled long context windows
and high-throughput inference by optimizing GPU memory access. -
Retrieval-Augmented Generation (RAG) (Lewis et al., 2020)
Combines parametric models with external knowledge sources
Foundational for grounded and enterprise systems -
Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (Ouyang et al., 2022)
The modern post-training and alignment blueprint
that instruction-tuned models follow -
Direct Preference Optimization (DPO) (Rafailov et al., 2023)
A simpler and more stable alternative to PPO-based RLHF
Preference alignment via the loss function -
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)
Demonstrated that reasoning can be elicited through prompting
alone and laid the groundwork for later reasoning-focused training -
ReAct: Reasoning and Acting (Yao et al., 2022 / ICLR 2023)
The foundation of agentic systems
Combines reasoning traces with tool use and environment interaction -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Guo et al., 2025)
The R1 paper. Proved that large-scale reinforcement learning without
supervised data can induce self-verification and structured reasoning behavior -
Qwen3 Technical Report (Yang et al., 2025)
A modern architecture lightweight overview
Introduced unified MoE with Thinking Mode and Non-Thinking
Mode to dynamically trade off cost and reasoning depth -
Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts (Shazeer et al., 2017)
The modern MoE ignition point
Conditional computation at scale -
Switch Transformers (Fedus et al., 2021)
Simplified MoE routing using single-expert activation
Key to stabilizing trillion-parameter training -
Mixtral of Experts (Mistral AI, 2024)
Open-weight MoE that proved sparse models can match dense quality
while running at small-model inference cost -
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (Komatsuzaki et al., 2022 / ICLR 2023)
Practical technique for converting dense checkpoints into MoE models
Critical for compute reuse and iterative scaling -
The Platonic Representation Hypothesis (Huh et al., 2024)
Evidence that scaled models converge toward shared
internal representations across modalities -
Textbooks Are All You Need (Gunasekar et al., 2023)
Demonstrated that high-quality synthetic data allows
small models to outperform much larger ones -
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)
The biggest leap in mechanistic interpretability
Decomposes neural networks into millions of interpretable features -
PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022)
A masterclass in large-scale training
orchestration across thousands of accelerators -
GLaM: Generalist Language Model (Du et al., 2022)
Validated MoE scaling economics with massive
total parameters but small active parameter counts -
The Smol Training Playbook (Hugging Face, 2025)
Practical end-to-end handbook for efficiently training language models
Bonus Material
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019)
Toolformer (Schick et al., 2023)
GShard (Lepikhin et al., 2020)
Adaptive Mixtures of Local Experts (Jacobs et al., 1991)
Hierarchical Mixtures of Experts (Jordan and Jacobs, 1994)
If you deeply understand these fundamentals; Transformer core, scaling laws, FlashAttention, instruction tuning, R1-style reasoning, and MoE upcycling, you already understand LLMs better than most
Time to lock-in, good luck ;)
If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations,...
Abstract page for arXiv paper 2505.03335: Absolute Zero: Reinforced Self-play Reasoning with Zero Data
The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as epsilon-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.
This paper revisits and extends the results presented in 2005 by Wilcox and Crittenden in a white paper titled Does Trend Following Work on Stocks? Leveraging a
The design of financial instruments and processes has always been contingent on the infrastructure such products live on. Blockchain technology, as utilized by
LLM2CLIP: Research showing how to improve CLIP's image-text matching abilities by replacing its text encoder with a frozen LLM (like Llama) and a trainable adapter. The key innovation is fine-tuning the LLM first to make its outputs more discriminative, then using it to help CLIP's vision encoder better understand language. Results show major improvements in matching detailed descriptions to images, handling long text, and even working across languages, while requiring relatively little training time and compute.
"Systematically identifying clusters of similar assets is a critical step in statistical arbitrage strategies... Profitability is influenced more by the selection of feature sets and clustering methods than by the choice of signals."
We explore tree-based macroeconomic regime-switching in the context of the dynamic Nelson-Siegel (DNS) yield-curve model.
Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering.
hybrid LSTM models, significantly outperform the traditional GARCH models
Liquid alternative strategies, specifically trend-following and long/short quality stocks, could be viewed as the new bonds.
We introduce a conditional machine learning approach to forecast the stock index return. Our approach is designed to work well for short-horizon forecasts to ad
Momentum trading strategies are thoroughly described in the academic literature and used in many trading strategies by hedge funds, asset managers, and propriet
This paper presents a novel approach to identifying potential bubbles in the US stock market by employing alternative time series methods based on long memory,