--- id: DL-TR-MHA-001 category: "10_Wiki/๐Ÿ’ก Topics/AI" confidence_score: 1.0 tags: [ai, deep-learning, transformer, multi-head-attention, self-attention, nlp] last_reinforced: 2026-04-26 --- # Multi-Head Attention Mechanism (๋ฉ€ํ‹ฐ ํ—ค๋“œ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜) ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์ •๋ณด์˜ ๋ฐ”๋‹ค๋ฅผ ํ•œ ์Œ์˜ ๋ˆˆ์ด ์•„๋‹Œ, ์„œ๋กœ ๋‹ค๋ฅธ ๊ด€์ ์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ˆˆ์œผ๋กœ ๋™์‹œ์— ์ฃผ์‹œํ•˜์—ฌ ์ž…์ฒด์ ์ธ ๋ฌธ๋งฅ์„ ์™„์„ฑํ•˜๋ผ" โ€” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ํ•˜์œ„ ๊ณต๊ฐ„(Subspaces)์œผ๋กœ ํˆฌ์˜ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๊ด€๊ณ„ ์ •๋ณด๋ฅผ ๋ณ‘๋ ฌ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ณ  ํ†ตํ•ฉํ•˜๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์˜ ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜. ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **์ถ”์ถœ๋œ ํŒจํ„ด:** "Parallel Diverse Representation" โ€” ๋‹จ์ผ ์–ดํ…์…˜์ด ๊ฐ€์ง„ ํŽธํ–ฅ์„ฑ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ์—ฌ๋Ÿฌ ๋ญ‰์น˜(Heads)๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ํ—ค๋“œ๊ฐ€ ๋ฌธ๋ฒ•์  ๊ด€๊ณ„, ์˜๋ฏธ์  ๊ด€๊ณ„, ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ๋“ฑ ์„œ๋กœ ๋‹ค๋ฅธ ํŠน์ง•์— ์ง‘์ค‘ํ•˜๊ฒŒ ๋งŒ๋“  ํ›„ ์ด๋ฅผ ๋‹ค์‹œ ํ•ฉ์ณ(Concatenate) ํ’๋ถ€ํ•œ ํ‘œํ˜„๋ ฅ์„ ํ™•๋ณดํ•˜๋Š” ํŒจํ„ด. - **์ž‘๋™ ์›๋ฆฌ:** - **Linear Projection:** ์ž…๋ ฅ ๋ฒกํ„ฐ๋ฅผ $h$๊ฐœ์˜ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜๋กœ ํˆฌ์˜ํ•˜์—ฌ ์ฟผ๋ฆฌ(Q), ํ‚ค(K), ๊ฐ’(V) ์ƒ์„ฑ. - **Scaled Dot-Product Attention:** ๊ฐ ํ—ค๋“œ๋ณ„๋กœ ๋…๋ฆฝ์ ์ธ ์–ดํ…์…˜ ์Šค์ฝ”์–ด ๊ณ„์‚ฐ. - **Concat & Linear:** ๋ชจ๋“  ํ—ค๋“œ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ด์–ด ๋ถ™์ด๊ณ  ์ตœ์ข… ์„ ํ˜• ๋ณ€ํ™˜์„ ํ†ตํ•ด ์ฐจ์› ์œ ์ง€. - **์˜์˜:** ๋ฌธ๋งฅ์˜ ์ค‘์˜์„ฑ์„ ํ•ด์†Œํ•˜๊ณ  ๋ฌธ์žฅ ๋‚ด ๋ณต์žกํ•œ ์ƒํ˜ธ์ž‘์šฉ์„ ํ•œ ๋ฒˆ์— ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์—ฌ, NLP๋ฅผ ๋„˜์–ด ๋น„์ „, ์˜ค๋””์˜ค ๋“ฑ ๋ชจ๋“  AI ๋„๋ฉ”์ธ์˜ ํ‘œ์ค€ ์ถ”๋ก  ๋ฐฉ์‹์œผ๋กœ ์ •์ฐฉ๋จ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ:** ํ—ค๋“œ๊ฐ€ ๋งŽ์„์ˆ˜๋ก ๋ฌด์กฐ๊ฑด ์ข‹๋‹ค๋Š” ํ†ต๋…์—์„œ ๋ฒ—์–ด๋‚˜, ํŠน์ • ํ—ค๋“œ๊ฐ€ ์ค‘๋ณต๋œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ ํ—ค๋“œ๊ฐ€ ์กด์žฌํ•  ์ˆ˜ ์žˆ์Œ์ด ๋ฐํ˜€์ ธ, ์ตœ๊ทผ์—๋Š” ํ—ค๋“œ๋ณ„ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ ˆํ•˜๊ฑฐ๋‚˜ ๊ฐ€์ง€์น˜๊ธฐ(Pruning)ํ•˜๋Š” ๊ธฐ์ˆ ๋„ ์—ฐ๊ตฌ๋จ. - **์ •์ฑ… ๋ณ€ํ™”:** Antigravity ํ”„๋กœ์ ํŠธ์˜ ํ•ต์‹ฌ ์ถ”๋ก  ์—”์ง„์€ ์ง€์‹ ๋ฌธ์„œ์˜ ๊ตฌ์กฐ์  ๊ณ„์ธต๊ณผ ํ…์ŠคํŠธ์˜ ์˜๋ฏธ์  ์—ฐ๊ฒฐ์„ ๋™์‹œ์— ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์ ํ™”๋œ 8๊ฐœ ์ด์ƒ์˜ ๋ฉ€ํ‹ฐ ํ—ค๋“œ ์–ดํ…์…˜ ๋ ˆ์ด์–ด๋ฅผ ์šด์šฉํ•จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - Transformer-Architecture-Foundations, Self-Attention-Foundations, [[GPT-Architecture-Foundations|GPT-Architecture-Foundations]], BERT-Foundations - **Raw Source:** 10_Wiki/Topics/AI/Multi-Head-Attention-Mechanism.md