--- id: wiki-2026-0508-parameter-sharing title: Parameter Sharing category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Weight Sharing, Tied Weights] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [parameter-sharing, weight-tying, cnn, rnn, model-compression] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch --- # Parameter Sharing ## 매 한 줄 > **"매 same weights, different positions"**. 매 single parameter set 가 multiple computations 에 reuse — translation invariance (CNN), temporal invariance (RNN), parameter efficiency (transformer FFN tied embeddings). 매 modern DL 의 fundamental design pattern. ## 매 핵심 ### 매 motivation - Parameter explosion: 매 fully connected layer on image → billions of params. - Inductive bias: 매 weight sharing encodes prior (translation/time invariance). - Generalization: 매 fewer params → better generalization (less overfit). - Compute: 매 shared weights enable convolution / matmul optimization. ### 매 forms - **Spatial sharing (CNN)**: 매 same conv kernel slid across image. - **Temporal sharing (RNN/LSTM/GRU)**: 매 same recurrent weights at every timestep. - **Cross-layer sharing**: 매 ALBERT, Universal Transformer — 매 same layer params reused L times. - **Tied embeddings**: 매 input embedding == output projection (LM head). - **Multi-head**: 매 NOT shared (each head has own W_q, W_k, W_v). ### 매 modern usage - ALBERT (2019): cross-layer sharing for BERT compression (12× param reduction). - ViT: spatial sharing via patch embedding. - Mamba/SSM: temporal sharing via state-space recurrence. - LoRA: 매 single low-rank delta shared across positions. ### 매 응용 1. CNN image classification (ResNet, ConvNeXt). 2. Sequence modeling (RNN, Transformer position embeddings). 3. Model compression (ALBERT, distillation). 4. Multi-task learning (shared encoder). ## 💻 패턴 ### CNN spatial sharing ```python import torch.nn as nn # Single 3x3 kernel applied to every spatial position conv = nn.Conv2d(3, 64, kernel_size=3, padding=1) # Params: 3*64*3*3 + 64 = 1792 (independent of image size) ``` ### Tied input/output embeddings ```python class LanguageModel(nn.Module): def __init__(self, vocab_size, dim): super().__init__() self.embed = nn.Embedding(vocab_size, dim) # tie: lm_head.weight = embed.weight self.lm_head = nn.Linear(dim, vocab_size, bias=False) self.lm_head.weight = self.embed.weight # share! def forward(self, x): h = self.embed(x) return self.lm_head(h) # no extra params ``` ### Cross-layer sharing (ALBERT-style) ```python class SharedTransformer(nn.Module): def __init__(self, num_layers, dim): super().__init__() self.shared_layer = TransformerBlock(dim) # ONE block self.num_layers = num_layers def forward(self, x): for _ in range(self.num_layers): x = self.shared_layer(x) # reuse same params return x ``` ### RNN temporal sharing (built-in) ```python rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=2) # At every timestep t, same W_ih, W_hh applied # Params independent of sequence length ``` ### Detect shared params ```python # Count unique parameter tensors seen = set() unique = 0 for p in model.parameters(): if id(p) not in seen: seen.add(id(p)) unique += p.numel() print(f"Unique params: {unique}") ``` ### Multi-task shared encoder ```python class MultiTaskModel(nn.Module): def __init__(self): super().__init__() self.encoder = ResNet50() # SHARED self.classifier = nn.Linear(2048, 1000) self.detector = DetectionHead(2048) def forward(self, x, task): features = self.encoder(x) return self.classifier(features) if task == "cls" else self.detector(features) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Image input | CNN spatial sharing | | Sequence input | RNN or Transformer (positional sharing) | | Memory constrained, many layers | Cross-layer sharing (ALBERT) | | LM with large vocab | Tied embeddings (saves vocab*dim params) | | Multi-task related | Shared encoder | | Tasks unrelated | Don't force sharing — degrades quality | **기본값**: tied embeddings + CNN spatial / Transformer positional sharing. ## 🔗 Graph - 부모: [[Inductive-Bias]] · [[LLM_Optimization_and_Deployment_Strategies|Model-Compression]] - 응용: [[CNN]] · [[데이터_사이언스_및_ML_엔지니어링|RNN]] · [[Transformer]] ## 🤖 LLM 활용 **언제**: 매 designing efficient architecture, debugging param count, applying inductive bias. **언제 X**: 매 tasks/positions truly independent (forcing sharing hurts quality). ## ❌ 안티패턴 - **Over-sharing**: 매 ALL layers shared → severe quality drop on complex tasks. - **No tied embeddings on small LM**: 매 vocab=50k, dim=512 → 25M wasted params. - **Sharing across modalities**: 매 vision encoder ≠ text encoder weights (use CLIP-style separate). - **Forgetting LayerNorm not shared**: 매 cross-layer share W matrices but keep LN per-layer. ## 🧪 검증 / 중복 - Verified (LeCun 1989 CNN, ALBERT paper, Press & Wolf 2017 tied embeddings). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — sharing forms, modern usage, patterns |