--- id: wiki-2026-0508-flash-attention title: Flash Attention category: 10_Wiki/Topics status: needs_review canonical_id: self aliases: [P-Reinforce-AUTO-FLAT-001] duplicate_of: none source_trust_level: A confidence_score: 1.0 tags: [auto-reinforced, flash-attention, attention-optimization, transformer, gpu-optimization, llm-inference] raw_sources: [] last_reinforced: 2026-05-04 github_commit: pending inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08) tech_stack: language: unspecified framework: unspecified --- # [[Flash Attention|Flash Attention]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ์˜ ํ•ด๋ฐฉ๊ตฐ: ์–ดํ…์…˜์˜ ์ˆ˜ํ•™์  ์›๋ฆฌ๋Š” ์œ ์ง€ํ•˜๋ฉด์„œ, GPU์˜ SRAM๊ณผ HBM ์‚ฌ์ด์˜ ๋ฐ์ดํ„ฐ ์ด๋™์„ ํƒ€์ผ๋ง ๊ธฐ๋ฒ•์œผ๋กœ ์ตœ์ ํ™”ํ•˜์—ฌ 2~4๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ๊ณผ ๊ทน์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ ํ•˜๋“œ์›จ์–ด ์ธ์‹ ์ตœ์ ํ™”์˜ ์ •์ ." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) FlashAttention์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ ํ™•์žฅ ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ๊ณ„์‚ฐ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ํ•˜๋“œ์›จ์–ด ์ธ์‹(Hardware-aware) ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. 1. **ํ•ต์‹ฌ ์ž‘๋™ ์›๋ฆฌ**: * **Tiling (ํƒ€์ผ๋ง)**: ๊ฑฐ๋Œ€ํ•œ ์–ดํ…์…˜ ํ–‰๋ ฌ์„ ์ž‘์€ ๋ธ”๋ก(ํƒ€์ผ) ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„์–ด, ์†๋„๊ฐ€ ๋น ๋ฅธ GPU ์˜จ์นฉ SRAM์—์„œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ๋А๋ฆฐ HBM(๊ณ ๋Œ€์—ญํญ ๋ฉ”๋ชจ๋ฆฌ)์œผ๋กœ์˜ ์ ‘๊ทผ ํšŸ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค. * **Recomputation (์žฌ๊ณ„์‚ฐ)**: ๋ฉ”๋ชจ๋ฆฌ์— ๊ฑฐ๋Œ€ํ•œ ์ค‘๊ฐ„ ํ–‰๋ ฌ์„ ์ €์žฅํ•˜๋Š” ๋Œ€์‹ , ์—ญ์ „ํŒŒ(Backpropagation) ์‹œ ํ•„์š”ํ•œ ๊ฐ’์„ ํ•„์š”ํ•  ๋•Œ๋งˆ๋‹ค ๋‹ค์‹œ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹์„ ํƒํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋ณต์žก๋„๋ฅผ $O(n^2)$์—์„œ $O(n)$์œผ๋กœ ๋‚ฎ์ถฅ๋‹ˆ๋‹ค. 2. **์ฃผ์š” ์„ฑ๊ณผ**: * **์ •ํ™•๋„ ์œ ์ง€**: ๊ทผ๋ณธ์ ์ธ ์—ฐ์‚ฐ ๋ณต์žก๋„($O(n^2d)$)๋Š” ๋™์ผํ•˜๊ฒŒ ์œ ์ง€ํ•˜๋ฉด์„œ๋„, ์‹ค์ œ ์—ฐ์‚ฐ ์†๋„๋ฅผ 2~4๋ฐฐ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. * **์ปจํ…์ŠคํŠธ ํ™•์žฅ**: ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜์—ฌ ๊ธฐ์กด์—๋Š” ๋ถˆ๊ฐ€๋Šฅํ–ˆ๋˜ ์ˆ˜์‹ญ๋งŒ ํ† ํฐ ์ด์ƒ์˜ ๊ธด ๋ฌธ๋งฅ ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. 3. **๋ฒ„์ „ ์ง„ํ™”**: * **FlashAttention-2**: ์—ฐ์‚ฐ ์ˆœ์„œ ์ตœ์ ํ™”์™€ ์ž‘์—… ๋ถ„ํ• (Work Partitioning)์„ ํ†ตํ•ด ๋ณ‘๋ ฌ์„ฑ์„ ๋”์šฑ ๋†’์—ฌ, FP16 ๊ธฐ์ค€ ์ด๋ก ์  ์ตœ๋Œ€ ์„ฑ๋Šฅ์˜ 70% ์ด์ƒ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & Updates) * **์—ฐ์‚ฐ๋Ÿ‰ ์ž์ฒด์˜ ํ•œ๊ณ„**: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ๋ฌธ์ œ๋Š” ํ•ด๊ฒฐํ•˜์ง€๋งŒ, ์‹œํ€€์Šค ๊ธธ์ด์— ๋”ฐ๋ฅธ ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€($O(n^2)$) ์ž์ฒด๋ฅผ ์„ ํ˜•์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฐฑ๋งŒ ํ† ํฐ ์ด์ƒ์˜ ์ดˆ์žฅ๊ธฐ ์‹œํ€€์Šค์—์„œ๋Š” ์—ฌ์ „ํžˆ ์ƒ๋‹นํ•œ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. * **๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ์‹œ์˜ ์ƒ์ถฉ**: Ring Attention๊ณผ ๊ฐ™์€ ์ปจํ…์ŠคํŠธ ๋ณ‘๋ ฌ์„ฑ ๊ธฐ์ˆ ๊ณผ ๊ฒฐํ•ฉํ•  ๋•Œ, ์„ธ๋ถ„ํ™”๋œ FlashAttention ์ฒ˜๋ฆฌ๊ฐ€ ํ†ต์‹  ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ์ธํ•ด ํšจ์œจ์„ฑ ์ €ํ•˜(Efficiency Penalties)๋ฅผ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด USP(Unified Sequence Parallelism)์™€ ๊ฐ™์€ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ๋ฒ•์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **์ƒ์œ„ ๊ฐœ๋…**: [[Attention Mechanisms|Attention Mechanisms]], [[LLM Inference Optimization|LLM Inference Optimization]] * **ํ•˜์œ„/์—ฐ๊ด€ ๊ธฐ์ˆ **: [[KV Cache|KV Cache]], [[Ring Attention|Ring Attention]], [[Sparse Attention|Sparse Attention]], [[PagedAttention|PagedAttention]] * **ํ”„๋กœ์ ํŠธ ์ ์šฉ**: ์ดˆ๋Œ€ํ˜• ์ปจํ…์ŠคํŠธ ์ง€์› RAG ์—”์ง„, ์—์ด์ „ํŠธ ์ž์œจ ๋ถ„์„ ๋ฃจํ”„ --- *Last updated: 2026-05-04* ## ๐Ÿค– LLM ํ™œ์šฉ ํžŒํŠธ (How to Use This Knowledge) **์–ธ์ œ ์ด ์ง€์‹์„ ์“ฐ๋Š”๊ฐ€:** - *(TODO)* **์–ธ์ œ ์“ฐ๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€:** - *(TODO)* ## ๐Ÿงช ๊ฒ€์ฆ ์ƒํƒœ (Validation) - **์ •๋ณด ์ƒํƒœ:** needs_review - **์ถœ์ฒ˜ ์‹ ๋ขฐ๋„:** A - **๊ฒ€ํ†  ์ด์œ :** *(P-Reinforce Phase 1 ์ž๋™ ์ •๊ทœํ™”. ๋ณธ๋ฌธ ๊ฒ€์ฆ ํ•„์š”.)* ## ๐Ÿงฌ ์ค‘๋ณต ๊ฒ€์‚ฌ (Duplicate Check) - **๊ธฐ์กด ์œ ์‚ฌ ๋ฌธ์„œ:** *(TODO: ์ธ๋ฑ์„œ ํด๋Ÿฌ์Šคํ„ฐ ๋ฆฌํฌํŠธ ์ฐธ์กฐ)* - **์ฒ˜๋ฆฌ ๋ฐฉ์‹:** UPDATE (์ž๋™ ์ •๊ทœํ™”) - **์ฒ˜๋ฆฌ ์ด์œ :** Phase 1 ์ •๊ทœํ™” โ€” ์˜› ํ…œํ”Œ๋ฆฟ/๋ˆ„๋ฝ ํ•„๋“œ ๋ณด๊ฐ•. ## ๐Ÿ•“ ๋ณ€๊ฒฝ ์ด๋ ฅ (Changelog) | ๋‚ ์งœ | ๋ณ€๊ฒฝ ๋‚ด์šฉ | ์ฒ˜๋ฆฌ ๋ฐฉ์‹ | ์‹ ๋ขฐ๋„ | |------|-----------|-----------|--------| | 2026-05-08 | P-Reinforce Phase 1 ์ •๊ทœํ™” (frontmatter + ํ—ค๋” ํ‘œ์ค€ํ™”) | UPDATE | A | ## ๐Ÿ’ป ์ฝ”๋“œ ํŒจํ„ด (Code Patterns) **ํŒจํ„ด 1:** *(TODO: ์ด ํ”„๋กœ์ ํŠธ ์ปจ๋ฒค์…˜ ๋ฐ˜์˜ํ•œ ๊ตฌ์กฐ ์Šค์ผˆ๋ ˆํ†ค)* ```text # TODO ``` ## ๐Ÿค” ์˜์‚ฌ๊ฒฐ์ • ๊ธฐ์ค€ (Decision Criteria) **์„ ํƒ A๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **์„ ํƒ B๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **๊ธฐ๋ณธ๊ฐ’:** > *(TODO)* ## โŒ ์•ˆํ‹ฐํŒจํ„ด (Anti-Patterns) - **[์•ˆํ‹ฐํŒจํ„ด]:** *(TODO: ๋ฌด์—‡์„ ํ•˜๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€ + ์ด์œ  + ๋Œ€์‹  ๋ฌด์—‡์„)*