--- id: PREI-AUTO-FLASH-001 category: Unified confidence_score: 0.98 tags: [auto-reinforced, [[FlashAttention|FlashAttention]], IO-awareness, GPU-optimization, [[LLM|LLM]], long-context] last_reinforced: 2026-05-05 --- # [[FlashAttention|FlashAttention]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์˜ ๋ณ‘๋ชฉ์„ ํ•˜๋“œ์›จ์–ด ์ธ์‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์šฐํšŒํ•˜์—ฌ, ๊ฑฐ๋Œ€ ๋ชจ๋ธ์ด '๊ธด ๊ธฐ์–ต'์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๋น„์•ฝ์ ์ธ ์†๋„๋กœ ์—ฐ์‚ฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“œ๋Š” ํ˜„๋Œ€ [[LLM|LLM]]์˜ ์‚ฐ์†Œ ํ˜ธํก๊ธฐ." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) FlashAttention์€ GPU์˜ ๊ณ ์† ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต์„ ์ง์ ‘ ์ œ์–ดํ•˜์—ฌ ์ž…์ถœ๋ ฅ(IO) ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ๊ทน๋Œ€ํ™”๋กœ ์ค„์ธ ์ฐจ์„ธ๋Œ€ ์–ดํ…์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. 1. **ํ•˜๋“œ์›จ์–ด ์ธ์‹ํ˜•(IO-Aware) ์„ค๊ณ„**: * GPU์˜ **HBM(Main Memory)**๊ณผ **SRAM(Fast Cache)** ๊ฐ„์˜ ๋ฐ์ดํ„ฐ ์ด๋™์ด ์—ฐ์‚ฐ ์†๋„๋ณด๋‹ค ํ›จ์”ฌ ๋А๋ฆฌ๋‹ค๋Š” ์ ์— ์ฐฉ์•ˆ. * ํƒ€์ผ๋ง(Tiling) ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์–ดํ…์…˜ ํ–‰๋ ฌ ์ „์ฒด๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ฆฌ์ง€ ์•Š๊ณ , SRAM ๋‚ด์—์„œ ์—ฐ์‚ฐ์„ ์™„๊ฒฐํ•œ ํ›„ ๊ฒฐ๊ณผ๋งŒ HBM์— ๊ธฐ๋ก. 2. **์—ฐ์‚ฐ ํšจ์œจ ๋ฐ ๋งฅ๋ฝ ํ™•์žฅ**: * **๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ**: ์‹œํ€€์Šค ๊ธธ์ด์— ๋”ฐ๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ๋Ÿ‰์„ ์ œ๊ณฑ($O(N^2)$)์—์„œ ์„ ํ˜•($O(N)$) ์ˆ˜์ค€์œผ๋กœ ์ตœ์ ํ™”ํ•˜์—ฌ OOM(Out-Of-Memory) ๋ฌธ์ œ๋ฅผ ๊ทผ๋ณธ์ ์œผ๋กœ ํ•ด๊ฒฐ. * **์†๋„ ๊ฐœ์„ **: FlashAttention-4 ๊ธฐ์ค€์œผ๋กœ cuDNN ๋Œ€๋น„ ์ตœ๋Œ€ 1.3๋ฐฐ, ํ‘œ์ค€ ์–ดํ…์…˜ ๋Œ€๋น„ ์ˆ˜๋ฐฐ ์ด์ƒ์˜ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑ. 3. **์ƒํƒœ๊ณ„ ํ˜ธํ™˜์„ฑ**: * ์›๋ณธ ์–ดํ…์…˜์˜ ์ˆ˜ํ•™์  ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ตฌํ˜„ ๋ฐฉ์‹๋งŒ ์ตœ์ ํ™”ํ•˜๋ฏ€๋กœ, [[E2LLM|E2LLM]], [[LongLoRA|LongLoRA]] ๋“ฑ ๋‹ค์–‘ํ•œ ๋งฅ๋ฝ ํ™•์žฅ ๊ธฐ์ˆ ๊ณผ ์ฆ‰์‹œ ๊ฒฐํ•ฉ ๊ฐ€๋Šฅ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ์˜ ํ•œ๊ณ„ (RL Update)**: FlashAttention ์ž์ฒด๊ฐ€ ์ด๋ฏธ Peak Memory๋ฅผ ๊ทนํ•œ์œผ๋กœ ๋‚ฎ์ถฐ๋†“์•˜๊ธฐ ๋•Œ๋ฌธ์—, ์—ฌ๊ธฐ์— Sparse Attention(ํฌ์†Œ ์–ดํ…์…˜) ๊ธฐ๋ฒ•์„ ์ถ”๊ฐ€ํ•ด๋„ ์‚ฌ์šฉ์ž๊ฐ€ ์ฒด๊ฐํ•˜๋Š” ์ถ”๊ฐ€์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ด๋“์€ ํฌ์ง€ ์•Š์Œ(์ˆ˜์ต ์ฒด๊ฐ์˜ ๋ฒ•์น™). - **ํ•˜๋“œ์›จ์–ด ์˜์กด์„ฑ ์‹ฌํ™”**: ์ตœ์‹  ๋ชจ๋ธ๋“ค์ด FlashAttention์˜ ์ตœ์ ํ™”์— ๊ทน๋„๋กœ ์˜์กดํ•˜๊ฒŒ ๋˜๋ฉด์„œ, ์ด๋ฅผ ์ง€์›ํ•˜์ง€ ์•Š๋Š” ๊ตฌํ˜• ํ•˜๋“œ์›จ์–ด๋‚˜ ํƒ€ ์•„ํ‚คํ…์ฒ˜์—์„œ๋Š” ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์˜จ์ „ํžˆ ๋ฐœํœ˜ํ•˜๊ธฐ ์–ด๋ ค์šด '๊ธฐ์ˆ ์  ๊ณ ์ฐฉ(Lock-in)' ํ˜„์ƒ์ด ๋ฐœ์ƒํ•จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[GPU-Memory-Hierarchy|GPU-Memory-Hierarchy]], [[E2LLM|E2LLM]], [[Attention-Mechanism|Attention-Mechanism]], [[Mamba|Mamba]] (Hardware-aware parallel scan ๊ณต์œ ) - **Raw Source**: Datacollector_MAC/out_wiki/FlashAttention.md ---