home/tutorial/10 篇论文

10 篇论文 · 按读的顺序

每篇 1-2 小时。读法：找出"问题—方法—代码对应"三段。

AI infra 这条线，读完下面 10 篇 ≈ 拿到 80% 的工程语言。剩下 20% 是新出的论文，能用这 10 篇当 framework 去快速吸收。顺序经过设计：每一篇都引出下一篇要回答的新问题。

💡 怎么读论文

不要从摘要读到引用。AI infra 论文几乎统一是这个结构：

Intro 最后一段 → 拿走 contribution 列表（看作者觉得自己解了啥）
Motivation / Background 一节 → 拿走"现状什么不好"
Design 主体 → 找第一张系统图，看懂它
Evaluation 第一张图 → 拿走"性能提升幅度"
Related work → 跳过，等你自己做研究时再回看

全程不超过 90 分钟。写 200 字三段笔记：问题 / 方法 / 跟 vLLM 哪行代码对应。

01必读 7 篇 · 按月对齐

① PagedAttention (vLLM) · SOSP 2023 M2

Kwon et al. · "Efficient Memory Management for Large Language Model Serving with PagedAttention"
arxiv.org/abs/2309.06180

问题：朴素 KV cache 内/外碎片 + 无法共享，显存利用率 ~40%。
方法：固定 block + block table，OS 虚拟内存的 GPU 重写。
看完读：vllm/v1/core/kv_cache_manager.py + csrc/attention/paged_attention_v1.cu。
追问：block size 为什么是 16？(论文 §6.4 有讨论)

② Orca · OSDI 2022 M3

Yu et al. · "Orca: A Distributed Serving System for Transformer-Based Generative Models"
论文官方 PDF 在 USENIX。

问题：传统 batching 等齐再跑，长短不齐时浪费严重。
方法：iteration-level scheduling — 每个 token step 重组 batch。这是 continuous batching 的 OG paper。
看完读：vllm/v1/core/sched/scheduler.py 的 schedule()。
追问：Orca 的"selective batching"是什么？vLLM 怎么实现？

③ Sarathi-Serve · OSDI 2024 M3

Agrawal et al. · "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve"
arxiv.org/abs/2403.02310

问题：Continuous batching 里 prefill 跟 decode 混 batch 会拖慢 decode。
方法：chunked prefill — 把大 prefill 切片、跟 decode 拼一起。
看完读：vLLM scheduler 里 chunked_prefill 路径。
追问：chunk size 跟 token budget 是什么关系？

④ SGLang (RadixAttention) · NeurIPS 2024 M2

Zheng et al. · "SGLang: Efficient Execution of Structured Language Model Programs"
arxiv.org/abs/2312.07104

问题：vLLM 的 prefix caching 只能命中"完整 block 前缀"，结构化场景（多分支、tree of thoughts）共享率低。
方法：用 radix tree 做细粒度 KV 共享。
对照：跟 vLLM prefix caching 比，何时更优？(树状结构对话、agent 多分支推理)
追问：vLLM 能借鉴吗？社区有 PR 在做吗？

⑤ FlashAttention (v1) · NeurIPS 2022 M4

Dao et al. · "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"
arxiv.org/abs/2205.14135

问题：朴素 attention 把 N×N matrix 实例化到 HBM，memory-bound 严重。
方法：tile 划分 + online softmax，attention matrix 不落 HBM。
看完读：FlashAttention 仓库 csrc 的 forward kernel。
追问：online softmax 是怎么数学等价的？(看附录 A)
后续：v2 (优化 work partition)、v3 (Hopper 架构特定优化)，选读。

⑥ H2O · NeurIPS 2023 M2/M3 之间

Zhang et al. · "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models"
arxiv.org/abs/2306.14048

问题：长 context 下 KV cache 撑爆显存。
方法：观察到只有少数 "heavy hitter" token 被 attention 频繁用到，其他可以 evict。
OS 对照：working set theory + LRU 的现代版。
追问：vLLM 当前 evict 策略是 H2O 风格吗？(答：不是，vLLM 简单 LRU。H2O 思想可能在未来 PR 出现)

⑦ S-LoRA · MLSys 2024 M5+

Sheng et al. · "S-LoRA: Serving Thousands of Concurrent LoRA Adapters"
arxiv.org/abs/2311.03285

问题：每个 user 一个 LoRA adapter（fine-tune 的小补丁），上千个怎么 serve？
方法：unified memory pool + custom CUDA kernel，让多 adapter 共享 batch。
OS 对照：多租户隔离 + 资源共享。
看完读：vllm/lora/。

02选读 3 篇 · 看兴趣选支

⑧ ZeRO · SC 2020 训练向

Rajbhandari et al. · "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models"

大模型训练的内存分片经典。
OS 思维浓——把 optimizer state、gradient、parameter 分片到多卡。
推理岗面试也可能问，知道核心想法即可。

⑨ Megatron-LM (TP/PP) · 系列论文训练向

Tensor parallelism 把矩阵切片到多卡协同算。
Pipeline parallelism 把模型层切到多卡。
vLLM 的多卡推理也用这两套。知道概念，看 vllm/distributed/。

⑩ torch.compile / Inductor · ASPLOS 2024 编译器向

Ansel et al. · "PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation"

如果你想往编译器方向走，这是入口。
vLLM 在某些路径已经用 torch.compile。
当成 Month 7+ 的桥梁论文读。

03论文笔记模板

每篇读完，在 ~/infra/papers/ 建一个 .md 文件，固定 3 段：

# <论文名> · <会议>

## 问题 (50 字)
现状是什么；现状不好在哪。

## 方法 (100 字)
核心 idea 一句话；最关键的 mechanism (机制) 一段。

## 对账 (50 字)
跟 vLLM 哪行代码 / 哪个模块对得上？(找不到也写"找不到，待查")

## 我学到了 (可选)
反直觉点、巧妙工程、可以借用到 mini-vLLM 的地方。

✓ 论文阅读量

6 个月走完，你应该有 ~15-20 篇论文笔记（7 篇必读 + 你自己 follow 的）。这是 senior infra 工程师的"概念资本"。面试时随手能援引 = 真的懂。

04follow 谁的 Twitter / arxiv

Woosuk Kwon · vLLM 第一作者
Tri Dao · FlashAttention 作者
Lianmin Zheng · SGLang 作者
Hao Zhang (UCSD) · 推理优化系列论文
Amey Agrawal · Sarathi 作者
arXiv cs.DC + cs.LG 每周一刷

M5 · mini-vLLM

OS ↔ vLLM 速查

10 篇论文 · 按读的顺序

01必读 7 篇 · 按月对齐

① PagedAttention (vLLM) · SOSP 2023 M2

② Orca · OSDI 2022 M3

③ Sarathi-Serve · OSDI 2024 M3

④ SGLang (RadixAttention) · NeurIPS 2024 M2

⑤ FlashAttention (v1) · NeurIPS 2022 M4

⑥ H2O · NeurIPS 2023 M2/M3 之间

⑦ S-LoRA · MLSys 2024 M5+

02选读 3 篇 · 看兴趣选支

⑧ ZeRO · SC 2020 训练向

⑨ Megatron-LM (TP/PP) · 系列论文 训练向

⑩ torch.compile / Inductor · ASPLOS 2024 编译器向

03论文笔记模板

04follow 谁的 Twitter / arxiv

⑨ Megatron-LM (TP/PP) · 系列论文训练向