--- id: [[P-Reinforce|P-Reinforce]]-AUTO-GQAM-001 category: Unified confidence_score: 1.00 tags: [auto-reinforced, grouped-query-attention, gqa, transformer, mha, mqa, llm-efficiency] last_reinforced: 2026-05-04 --- # [[Grouped-Query Attention (GQA)|Grouped-Query Attention (GQA)]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "ํšจ์œจ๊ณผ ์„ฑ๋Šฅ์˜ ํ™ฉ๊ธˆ๋น„์œจ: ๋ชจ๋“  ํ—ค๋“œ๊ฐ€ ๊ฐ์ž์˜ Key-Value๋ฅผ ๊ฐ–๋Š” MHA์˜ ๋ฌด๊ฑฐ์šด ๋น„์šฉ๊ณผ, ํ•˜๋‚˜์˜ KV๋งŒ ๊ณต์œ ํ•˜๋Š” MQA์˜ ์„ฑ๋Šฅ ์ €ํ•˜ ์‚ฌ์ด์—์„œ '๊ทธ๋ฃนํ™”๋œ KV ๊ณต์œ '๋ผ๋Š” ์˜๋ฆฌํ•œ ์ ˆ์ถฉ์•ˆ์„ ํ†ตํ•ด ์ถ”๋ก  ์†๋„์™€ ํ’ˆ์งˆ์„ ๋™์‹œ์— ์žก์€ ํ˜„๋Œ€ LLM์˜ ํ‘œ์ค€." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) Grouped-Query Attention(GQA)์€ ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์—์„œ KV ์บ์‹œ(Key-Value Cache)์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์—ฌ ์ถ”๋ก  ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜๋ฉด์„œ๋„, ๋ชจ๋ธ์˜ ํ‘œํ˜„๋ ฅ์„ ๋ณด์กดํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ์–ดํ…์…˜ ๋ณ€ํ˜• ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. 1. **๋“ฑ์žฅ ๋ฐฐ๊ฒฝ**: * **MHA (Multi-Head Attention)**: ๋ชจ๋“  Query ํ—ค๋“œ๊ฐ€ ๊ฐ์ž์˜ Key/Value ํ—ค๋“œ๋ฅผ ๊ฐ€์ง $\rightarrow$ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ, ๊ทธ๋Ÿฌ๋‚˜ KV ์บ์‹œ๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง. * **MQA (Multi-Query Attention)**: ๋ชจ๋“  Query ํ—ค๋“œ๊ฐ€ ๋‹จ ํ•˜๋‚˜์˜ Key/Value ํ—ค๋“œ๋ฅผ ๊ณต์œ  $\rightarrow$ ๋งค์šฐ ๋น ๋ฅด์ง€๋งŒ ์„ฑ๋Šฅ(ํ’ˆ์งˆ) ์ €ํ•˜ ๋ฐœ์ƒ. 2. **ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜**: * **๊ทธ๋ฃนํ™” (Grouping)**: ์—ฌ๋Ÿฌ ๊ฐœ์˜ Query ํ—ค๋“œ๋ฅผ ํ•˜๋‚˜์˜ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ๊ณ , ๊ฐ ๊ทธ๋ฃน๋งˆ๋‹ค ํ•˜๋‚˜์˜ Key/Value ํ—ค๋“œ๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค. * **์ ˆ์ถฉ (Trade-off)**: MHA๋ณด๋‹ค๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ ๊ณ , MQA๋ณด๋‹ค๋Š” ์ •๋ณด ๋ณด์กด ๋Šฅ๋ ฅ์ด ๋›ฐ์–ด๋‚œ '์ค‘๊ฐ„ ์ง€์ '์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. 3. **์˜์˜**: * Llama 2/3, Mistral ๋“ฑ ์ตœ์‹  ์˜คํ”ˆ์†Œ์Šค SOTA ๋ชจ๋ธ๋“ค์ด ์ฑ„ํƒํ•˜๊ณ  ์žˆ๋Š” ํ‘œ์ค€ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. * ํŠนํžˆ ๊ธด ๋ฌธ๋งฅ(Long-context) ์ฒ˜๋ฆฌ ์‹œ KV ์บ์‹œ๊ฐ€ ์ฐจ์ง€ํ•˜๋Š” VRAM ๋น„์ค‘์„ ํš๊ธฐ์ ์œผ๋กœ ๋‚ฎ์ถฐ์ฃผ์–ด, ๋™์ผ ํ•˜๋“œ์›จ์–ด์—์„œ ๋” ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋‚˜ ๋” ๊ธด ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ## โš–๏ธ Trade-offs & Caveats * **์„ฑ๋Šฅ/ํšจ์œจ ๋น„๋ก€**: ๊ทธ๋ฃน ์ˆ˜($G$)๋ฅผ ๋Š˜๋ฆด์ˆ˜๋ก MHA์— ๊ฐ€๊นŒ์›Œ์ง€๋ฉฐ ์„ฑ๋Šฅ์€ ์ข‹์•„์ง€์ง€๋งŒ KV ์บ์‹œ๊ฐ€ ์ปค์ง€๊ณ , ์ค„์ผ์ˆ˜๋ก MQA์— ๊ฐ€๊นŒ์›Œ์ง€๋ฉฐ ํšจ์œจ์€ ์ข‹์•„์ง€์ง€๋งŒ ํ’ˆ์งˆ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. * **๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ๊ณ ์ •**: ํ•™์Šต ์‹œ์— ๊ทธ๋ฃน ๊ตฌ์กฐ๋ฅผ ๊ฒฐ์ •ํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ๊ธฐ์กด MHA ๋ชจ๋ธ์„ ์ถ”๋ก  ์‹œ์—๋งŒ GQA๋กœ ์ „ํ™˜ํ•˜๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฉฐ ์ถ”๊ฐ€์ ์ธ ์—…์‚ฌ์ดํด๋ง(Upcycling) ํ•™์Šต์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **์ƒ์œ„ ๊ฐœ๋…**: [[Attention Mechanisms|Attention Mechanisms]], [[LLM Inference Optimization|LLM Inference Optimization]] * **๋Œ€์กฐ ๊ธฐ์ˆ **: [[Multi-Head Attention (MHA)|Multi-Head Attention (MHA)]], [[Multi-Query Attention (MQA)|Multi-Query Attention (MQA)]] * **์—ฐ๊ด€ ๊ธฐ์ˆ **: [[KV Cache|KV Cache]], [[PagedAttention|PagedAttention]], [[Flash Attention|Flash Attention]] --- *Last updated: 2026-05-04*