Search: [LLM]

The Top 26 Essential Papers (+5 Bonus Resources) for Mastering LLMs and Transformers

This list bridges the Transformer foundations
with the reasoning, MoE, and agentic shift

Recommended Reading Order

Attention Is All You Need (Vaswani et al., 2017)

The original Transformer paper. Covers self-attention,
multi-head attention, and the encoder-decoder structure
(even though most modern LLMs are decoder-only.)
The Illustrated Transformer (Jay Alammar, 2018)

Great intuition builder for understanding
attention and tensor flow before diving into implementations
BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018)

Encoder-side fundamentals, masked language modeling,
and representation learning that still shape modern architectures
Language Models are Few-Shot Learners (GPT-3) (Brown et al., 2020)

Established in-context learning as a real
capability and shifted how prompting is understood
Scaling Laws for Neural Language Models (Kaplan et al., 2020)

First clean empirical scaling framework for parameters, data, and compute
Read alongside Chinchilla to understand why most models were undertrained
Training Compute-Optimal Large Language Models (Chinchilla) (Hoffmann et al., 2022)

Demonstrated that token count matters more than
parameter count for a fixed compute budget
LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023)

The paper that triggered the open-weight era
Introduced architectural defaults like RMSNorm, SwiGLU
and RoPE as standard practice
RoFormer: Rotary Position Embedding (Su et al., 2021)

Positional encoding that became the modern default for long-context LLMs
FlashAttention (Dao et al., 2022)

Memory-efficient attention that enabled long context windows
and high-throughput inference by optimizing GPU memory access.
Retrieval-Augmented Generation (RAG) (Lewis et al., 2020)

Combines parametric models with external knowledge sources
Foundational for grounded and enterprise systems
Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (Ouyang et al., 2022)

The modern post-training and alignment blueprint
that instruction-tuned models follow
Direct Preference Optimization (DPO) (Rafailov et al., 2023)

A simpler and more stable alternative to PPO-based RLHF
Preference alignment via the loss function
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)

Demonstrated that reasoning can be elicited through prompting
alone and laid the groundwork for later reasoning-focused training
ReAct: Reasoning and Acting (Yao et al., 2022 / ICLR 2023)

The foundation of agentic systems
Combines reasoning traces with tool use and environment interaction
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Guo et al., 2025)

The R1 paper. Proved that large-scale reinforcement learning without
supervised data can induce self-verification and structured reasoning behavior
Qwen3 Technical Report (Yang et al., 2025)

A modern architecture lightweight overview
Introduced unified MoE with Thinking Mode and Non-Thinking
Mode to dynamically trade off cost and reasoning depth
Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts (Shazeer et al., 2017)

The modern MoE ignition point
Conditional computation at scale
Switch Transformers (Fedus et al., 2021)

Simplified MoE routing using single-expert activation
Key to stabilizing trillion-parameter training
Mixtral of Experts (Mistral AI, 2024)

Open-weight MoE that proved sparse models can match dense quality
while running at small-model inference cost
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (Komatsuzaki et al., 2022 / ICLR 2023)

Practical technique for converting dense checkpoints into MoE models
Critical for compute reuse and iterative scaling
The Platonic Representation Hypothesis (Huh et al., 2024)

Evidence that scaled models converge toward shared
internal representations across modalities
Textbooks Are All You Need (Gunasekar et al., 2023)

Demonstrated that high-quality synthetic data allows
small models to outperform much larger ones
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)

The biggest leap in mechanistic interpretability
Decomposes neural networks into millions of interpretable features
PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022)

A masterclass in large-scale training
orchestration across thousands of accelerators
GLaM: Generalist Language Model (Du et al., 2022)

Validated MoE scaling economics with massive
total parameters but small active parameter counts
The Smol Training Playbook (Hugging Face, 2025)

Practical end-to-end handbook for efficiently training language models

Bonus Material

T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019)
Toolformer (Schick et al., 2023)
GShard (Lepikhin et al., 2020)
Adaptive Mixtures of Local Experts (Jacobs et al., 1991)
Hierarchical Mixtures of Experts (Jordan and Jacobs, 1994)

If you deeply understand these fundamentals; Transformer core, scaling laws, FlashAttention, instruction tuning, R1-style reasoning, and MoE upcycling, you already understand LLMs better than most

Time to lock-in, good luck ;)

llm · ml · paper

January 30, 2026 at 9:51:19 AM EST * · permalink

·

https://x.com/TheAhmadOsman/status/2016893734986616915

[2505.23836] Large Language Models Often Know When They Are Being Evaluated

If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations,...

paper · llm

June 4, 2025 at 9:32:16 PM EDT * · permalink

·

https://www.arxiv.org/abs/2505.23836

SWE-smith

Creating training data for software engineering agents is difficult. Until now.

Introducing SWE-smith: Generate 100s to 1000s of task instances for any GitHub repository.

We've generated 50k+ task instances for 128 popular GitHub repositories, then trained our own LM for SWE-agent.

The result? SWE-agent-LM-32B achieve 40% pass@1 on SWE-bench Verified.

Now, we've open-sourced everything, and we're excited to see what you build with it!

Check out the tutorial below to generate 100 task instances for any GitHub repository in 10 minutes.

llm

May 16, 2025 at 10:01:37 PM EDT * · permalink

·

https://swesmith.com/

Paper page - LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as epsilon-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.

paper · llm

April 27, 2025 at 4:29:07 PM EDT * · permalink

·

https://huggingface.co/papers/2504.16078

The First LLM – Jonathon Belotti [thundergolfer]

A tracing of the history of GPT-1 and its predecessors.

llm

March 30, 2025 at 3:29:19 PM EDT * · permalink

·

https://thundergolfer.com/blog/the-first-llm

Gitingest

Replace 'hub' with 'ingest' in any GitHub URL for a prompt-friendly text.

github · ai · llm

February 13, 2025 at 8:51:09 PM EST * · permalink

·

https://gitingest.com/

I can now run a GPT-4 class model on my laptop

llm

December 9, 2024 at 2:48:50 PM EST * · permalink

·

https://simonwillison.net/2024/Dec/9/llama-33-70b/

OpenAI's Whisper model is reportedly 'hallucinating' in high-risk situations | Tom's Guide

A new report reveals OpenAI's audio transcription tool, Whisper, has recorded consistent "hallucinations", according to multiple studies.

llm

October 28, 2024 at 4:42:05 PM EDT * · permalink

·

https://www.tomsguide.com/ai/openais-whisper-model-is-reportedly-hallucinating-in-high-risk-situations

Google's rumored Gemini 2.0 launch in December could support LLM stagnation thesis

Google is gearing up to unveil its latest AI language model, Gemini 2.0, in December, according to insider sources from The Verge.

Another indication of the plateau thesis: OpenAI has just confirmed that a new model, internally considered as a potential successor to GPT-4, will not be released this year, despite looming competition from Google Gemini 2.0.

Similarly, Anthropic is rumored to have put a previously announced version 3.5 of its flagship Opus model on hold due to a lack of significant progress, instead focusing on an improved version of Sonnet 3.5 that emphasizes agent-based AI.

google · llm

October 26, 2024 at 2:46:43 PM EDT * · permalink

·

https://the-decoder.com/googles-rumored-gemini-2-0-launch-in-december-could-support-llm-stagnation-thesis/

Introducing the Open FinLLM Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

llm · finance

October 6, 2024 at 1:37:00 PM EDT * · permalink

·

https://huggingface.co/blog/leaderboard-finbench

[2203.14465] STaR: Bootstrapping Reasoning With Reasoning

Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering.

ml · llm · paper

September 14, 2024 at 7:20:47 PM EDT * · permalink

·

https://arxiv.org/abs/2203.14465

THE HYBRID FORECAST OF S&P 500 VOLATILITY ENSEMBLED FROM VIX, GARCH AND LSTM MODELS

hybrid LSTM models, significantly outperform the traditional GARCH models

finance · paper · llm

September 9, 2024 at 10:20:59 AM EDT * · permalink

·

https://www.wne.uw.edu.pl/application/files/4417/1949/0286/WNE_WP449.pdf

[2409.01666] In Defense of RAG in the Era of Long-Context Language Models

llm · ai · paper

September 4, 2024 at 10:27:45 PM EDT * · permalink

·

https://arxiv.org/abs/2409.01666

Anthropic launches Claude Enterprise plan to compete with OpenAI | TechCrunch

Anthropic is launching a new subscription plan for its AI chatbot, Claude, catered toward enterprise customers that want more administrative controls and

llm · ai

September 4, 2024 at 4:57:35 PM EDT * · permalink

·

https://techcrunch.com/2024/09/04/anthropic-launches-claude-enterprise-plan-to-compete-with-openai

Faith and Fate: Transformers as fuzzy pattern matchers – Answer.AI

llm · to_read

August 27, 2024 at 3:02:57 PM EDT * · permalink

·

https://www.answer.ai/posts/2024-07-25-transformers-as-matchers.html

Anthropic's new prompt caching will save developers a fortune | VentureBeat

Anthropic's prompt caching lets users save prompts and call these up for later sessions with additional context for a lower price.

llm

August 15, 2024 at 5:14:33 PM EDT * · permalink

·

https://venturebeat.com/ai/anthropics-new-claude-prompt-caching-will-save-developers-a-fortune/

Unveiling Hermes 3: The First Fine-Tuned Llama 3.1 405B Model is on Lambda’s Cloud

We’re excited to offer the AI/ML community free access to Hermes 3 through Lambda’s new Chat Completions API, fully compatible with the OpenAI API. It provides endpoints for creating completions, chat completions and listing models.

llm

August 15, 2024 at 2:34:37 PM EDT * · permalink

·

https://lambdalabs.com/blog/unveiling-hermes-3-the-first-fine-tuned-llama-3.1-405b-model-is-on-lambdas-cloud

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

llm

July 18, 2024 at 5:34:44 PM EDT * · permalink

·

https://livecodebench.github.io/

Slack Combines ASTs with Large Language Models to Automatically Convert 80% of 15,000 Unit Tests - InfoQ

Slack's engineering team recently published how it used a large language model (LLM) to automatically convert 15,000 unit and integration tests from Enzyme to React Testing Library (RTL). By combining

llm

June 12, 2024 at 10:37:54 AM EDT * · permalink

·

https://www.infoq.com/news/2024/06/slack-automatic-test-conversion/

Google Search results polluted by buggy AI-written code frustrate coders • The Register

Pulumi claims it has culled bad infrastructure-as-code samples

llm

May 1, 2024 at 7:55:17 PM EDT * · permalink

·

https://www.theregister.com/AMP/2024/05/01/pulumi_ai_pollution_of_search/