Files
2nd/10_Wiki/Topics/AI_and_ML/Parameter-Sharing.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

160 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-parameter-sharing
title: Parameter Sharing
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Weight Sharing, Tied Weights]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [parameter-sharing, weight-tying, cnn, rnn, model-compression]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: pytorch
---
# Parameter Sharing
## 매 한 줄
> **"매 same weights, different positions"**. 매 single parameter set 가 multiple computations 에 reuse — translation invariance (CNN), temporal invariance (RNN), parameter efficiency (transformer FFN tied embeddings). 매 modern DL 의 fundamental design pattern.
## 매 핵심
### 매 motivation
- Parameter explosion: 매 fully connected layer on image → billions of params.
- Inductive bias: 매 weight sharing encodes prior (translation/time invariance).
- Generalization: 매 fewer params → better generalization (less overfit).
- Compute: 매 shared weights enable convolution / matmul optimization.
### 매 forms
- **Spatial sharing (CNN)**: 매 same conv kernel slid across image.
- **Temporal sharing (RNN/LSTM/GRU)**: 매 same recurrent weights at every timestep.
- **Cross-layer sharing**: 매 ALBERT, Universal Transformer — 매 same layer params reused L times.
- **Tied embeddings**: 매 input embedding == output projection (LM head).
- **Multi-head**: 매 NOT shared (each head has own W_q, W_k, W_v).
### 매 modern usage
- ALBERT (2019): cross-layer sharing for BERT compression (12× param reduction).
- ViT: spatial sharing via patch embedding.
- Mamba/SSM: temporal sharing via state-space recurrence.
- LoRA: 매 single low-rank delta shared across positions.
### 매 응용
1. CNN image classification (ResNet, ConvNeXt).
2. Sequence modeling (RNN, Transformer position embeddings).
3. Model compression (ALBERT, distillation).
4. Multi-task learning (shared encoder).
## 💻 패턴
### CNN spatial sharing
```python
import torch.nn as nn
# Single 3x3 kernel applied to every spatial position
conv = nn.Conv2d(3, 64, kernel_size=3, padding=1)
# Params: 3*64*3*3 + 64 = 1792 (independent of image size)
```
### Tied input/output embeddings
```python
class LanguageModel(nn.Module):
def __init__(self, vocab_size, dim):
super().__init__()
self.embed = nn.Embedding(vocab_size, dim)
# tie: lm_head.weight = embed.weight
self.lm_head = nn.Linear(dim, vocab_size, bias=False)
self.lm_head.weight = self.embed.weight # share!
def forward(self, x):
h = self.embed(x)
return self.lm_head(h) # no extra params
```
### Cross-layer sharing (ALBERT-style)
```python
class SharedTransformer(nn.Module):
def __init__(self, num_layers, dim):
super().__init__()
self.shared_layer = TransformerBlock(dim) # ONE block
self.num_layers = num_layers
def forward(self, x):
for _ in range(self.num_layers):
x = self.shared_layer(x) # reuse same params
return x
```
### RNN temporal sharing (built-in)
```python
rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=2)
# At every timestep t, same W_ih, W_hh applied
# Params independent of sequence length
```
### Detect shared params
```python
# Count unique parameter tensors
seen = set()
unique = 0
for p in model.parameters():
if id(p) not in seen:
seen.add(id(p))
unique += p.numel()
print(f"Unique params: {unique}")
```
### Multi-task shared encoder
```python
class MultiTaskModel(nn.Module):
def __init__(self):
super().__init__()
self.encoder = ResNet50() # SHARED
self.classifier = nn.Linear(2048, 1000)
self.detector = DetectionHead(2048)
def forward(self, x, task):
features = self.encoder(x)
return self.classifier(features) if task == "cls" else self.detector(features)
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Image input | CNN spatial sharing |
| Sequence input | RNN or Transformer (positional sharing) |
| Memory constrained, many layers | Cross-layer sharing (ALBERT) |
| LM with large vocab | Tied embeddings (saves vocab*dim params) |
| Multi-task related | Shared encoder |
| Tasks unrelated | Don't force sharing — degrades quality |
**기본값**: tied embeddings + CNN spatial / Transformer positional sharing.
## 🔗 Graph
- 부모: [[Inductive-Bias]] · [[LLM_Optimization_and_Deployment_Strategies|Model-Compression]]
- 응용: [[CNN]] · [[데이터 사이언스 및 ML 엔지니어링|RNN]] · [[Transformer]]
## 🤖 LLM 활용
**언제**: 매 designing efficient architecture, debugging param count, applying inductive bias.
**언제 X**: 매 tasks/positions truly independent (forcing sharing hurts quality).
## ❌ 안티패턴
- **Over-sharing**: 매 ALL layers shared → severe quality drop on complex tasks.
- **No tied embeddings on small LM**: 매 vocab=50k, dim=512 → 25M wasted params.
- **Sharing across modalities**: 매 vision encoder ≠ text encoder weights (use CLIP-style separate).
- **Forgetting LayerNorm not shared**: 매 cross-layer share W matrices but keep LN per-layer.
## 🧪 검증 / 중복
- Verified (LeCun 1989 CNN, ALBERT paper, Press & Wolf 2017 tied embeddings).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — sharing forms, modern usage, patterns |