--- id: [[P-Reinforce|P-Reinforce]]-AUTO-FLAT-001 category: Unified confidence_score: 1.00 tags: [auto-reinforced, flash-attention, attention-optimization, transformer, gpu-optimization, llm-inference] last_reinforced: 2026-05-04 --- # [[Flash Attention|Flash Attention]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ์˜ ํ•ด๋ฐฉ๊ตฐ: ์–ดํ…์…˜์˜ ์ˆ˜ํ•™์  ์›๋ฆฌ๋Š” ์œ ์ง€ํ•˜๋ฉด์„œ, GPU์˜ SRAM๊ณผ HBM ์‚ฌ์ด์˜ ๋ฐ์ดํ„ฐ ์ด๋™์„ ํƒ€์ผ๋ง ๊ธฐ๋ฒ•์œผ๋กœ ์ตœ์ ํ™”ํ•˜์—ฌ 2~4๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ๊ณผ ๊ทน์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ ํ•˜๋“œ์›จ์–ด ์ธ์‹ ์ตœ์ ํ™”์˜ ์ •์ ." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) FlashAttention์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ ํ™•์žฅ ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ๊ณ„์‚ฐ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ํ•˜๋“œ์›จ์–ด ์ธ์‹(Hardware-aware) ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. 1. **ํ•ต์‹ฌ ์ž‘๋™ ์›๋ฆฌ**: * **Tiling (ํƒ€์ผ๋ง)**: ๊ฑฐ๋Œ€ํ•œ ์–ดํ…์…˜ ํ–‰๋ ฌ์„ ์ž‘์€ ๋ธ”๋ก(ํƒ€์ผ) ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„์–ด, ์†๋„๊ฐ€ ๋น ๋ฅธ GPU ์˜จ์นฉ SRAM์—์„œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ๋А๋ฆฐ HBM(๊ณ ๋Œ€์—ญํญ ๋ฉ”๋ชจ๋ฆฌ)์œผ๋กœ์˜ ์ ‘๊ทผ ํšŸ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค. * **Recomputation (์žฌ๊ณ„์‚ฐ)**: ๋ฉ”๋ชจ๋ฆฌ์— ๊ฑฐ๋Œ€ํ•œ ์ค‘๊ฐ„ ํ–‰๋ ฌ์„ ์ €์žฅํ•˜๋Š” ๋Œ€์‹ , ์—ญ์ „ํŒŒ(Backpropagation) ์‹œ ํ•„์š”ํ•œ ๊ฐ’์„ ํ•„์š”ํ•  ๋•Œ๋งˆ๋‹ค ๋‹ค์‹œ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹์„ ํƒํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋ณต์žก๋„๋ฅผ $O(n^2)$์—์„œ $O(n)$์œผ๋กœ ๋‚ฎ์ถฅ๋‹ˆ๋‹ค. 2. **์ฃผ์š” ์„ฑ๊ณผ**: * **์ •ํ™•๋„ ์œ ์ง€**: ๊ทผ๋ณธ์ ์ธ ์—ฐ์‚ฐ ๋ณต์žก๋„($O(n^2d)$)๋Š” ๋™์ผํ•˜๊ฒŒ ์œ ์ง€ํ•˜๋ฉด์„œ๋„, ์‹ค์ œ ์—ฐ์‚ฐ ์†๋„๋ฅผ 2~4๋ฐฐ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. * **์ปจํ…์ŠคํŠธ ํ™•์žฅ**: ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜์—ฌ ๊ธฐ์กด์—๋Š” ๋ถˆ๊ฐ€๋Šฅํ–ˆ๋˜ ์ˆ˜์‹ญ๋งŒ ํ† ํฐ ์ด์ƒ์˜ ๊ธด ๋ฌธ๋งฅ ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. 3. **๋ฒ„์ „ ์ง„ํ™”**: * **FlashAttention-2**: ์—ฐ์‚ฐ ์ˆœ์„œ ์ตœ์ ํ™”์™€ ์ž‘์—… ๋ถ„ํ• (Work Partitioning)์„ ํ†ตํ•ด ๋ณ‘๋ ฌ์„ฑ์„ ๋”์šฑ ๋†’์—ฌ, FP16 ๊ธฐ์ค€ ์ด๋ก ์  ์ตœ๋Œ€ ์„ฑ๋Šฅ์˜ 70% ์ด์ƒ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ## โš–๏ธ Trade-offs & Caveats * **์—ฐ์‚ฐ๋Ÿ‰ ์ž์ฒด์˜ ํ•œ๊ณ„**: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ๋ฌธ์ œ๋Š” ํ•ด๊ฒฐํ•˜์ง€๋งŒ, ์‹œํ€€์Šค ๊ธธ์ด์— ๋”ฐ๋ฅธ ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€($O(n^2)$) ์ž์ฒด๋ฅผ ์„ ํ˜•์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฐฑ๋งŒ ํ† ํฐ ์ด์ƒ์˜ ์ดˆ์žฅ๊ธฐ ์‹œํ€€์Šค์—์„œ๋Š” ์—ฌ์ „ํžˆ ์ƒ๋‹นํ•œ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. * **๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ์‹œ์˜ ์ƒ์ถฉ**: Ring Attention๊ณผ ๊ฐ™์€ ์ปจํ…์ŠคํŠธ ๋ณ‘๋ ฌ์„ฑ ๊ธฐ์ˆ ๊ณผ ๊ฒฐํ•ฉํ•  ๋•Œ, ์„ธ๋ถ„ํ™”๋œ FlashAttention ์ฒ˜๋ฆฌ๊ฐ€ ํ†ต์‹  ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ์ธํ•ด ํšจ์œจ์„ฑ ์ €ํ•˜(Efficiency Penalties)๋ฅผ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด USP(Unified Sequence Parallelism)์™€ ๊ฐ™์€ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ๋ฒ•์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **์ƒ์œ„ ๊ฐœ๋…**: [[Attention Mechanisms|Attention Mechanisms]], [[LLM Inference Optimization|LLM Inference Optimization]] * **ํ•˜์œ„/์—ฐ๊ด€ ๊ธฐ์ˆ **: [[KV Cache|KV Cache]], [[Ring Attention|Ring Attention]], [[Sparse Attention|Sparse Attention]], [[PagedAttention|PagedAttention]] * **ํ”„๋กœ์ ํŠธ ์ ์šฉ**: ์ดˆ๋Œ€ํ˜• ์ปจํ…์ŠคํŠธ ์ง€์› RAG ์—”์ง„, ์—์ด์ „ํŠธ ์ž์œจ ๋ถ„์„ ๋ฃจํ”„ --- *Last updated: 2026-05-04*