Files
2nd/10_Wiki/Topics/AI_and_ML/Parameter-Sharing.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

5.4 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-parameter-sharing Parameter Sharing 10_Wiki/Topics verified self
Weight Sharing
Tied Weights
none A 0.9 applied
parameter-sharing
weight-tying
cnn
rnn
model-compression
2026-05-10 pending
language framework
python pytorch

Parameter Sharing

매 한 줄

"매 same weights, different positions". 매 single parameter set 가 multiple computations 에 reuse — translation invariance (CNN), temporal invariance (RNN), parameter efficiency (transformer FFN tied embeddings). 매 modern DL 의 fundamental design pattern.

매 핵심

매 motivation

  • Parameter explosion: 매 fully connected layer on image → billions of params.
  • Inductive bias: 매 weight sharing encodes prior (translation/time invariance).
  • Generalization: 매 fewer params → better generalization (less overfit).
  • Compute: 매 shared weights enable convolution / matmul optimization.

매 forms

  • Spatial sharing (CNN): 매 same conv kernel slid across image.
  • Temporal sharing (RNN/LSTM/GRU): 매 same recurrent weights at every timestep.
  • Cross-layer sharing: 매 ALBERT, Universal Transformer — 매 same layer params reused L times.
  • Tied embeddings: 매 input embedding == output projection (LM head).
  • Multi-head: 매 NOT shared (each head has own W_q, W_k, W_v).

매 modern usage

  • ALBERT (2019): cross-layer sharing for BERT compression (12× param reduction).
  • ViT: spatial sharing via patch embedding.
  • Mamba/SSM: temporal sharing via state-space recurrence.
  • LoRA: 매 single low-rank delta shared across positions.

매 응용

  1. CNN image classification (ResNet, ConvNeXt).
  2. Sequence modeling (RNN, Transformer position embeddings).
  3. Model compression (ALBERT, distillation).
  4. Multi-task learning (shared encoder).

💻 패턴

CNN spatial sharing

import torch.nn as nn
# Single 3x3 kernel applied to every spatial position
conv = nn.Conv2d(3, 64, kernel_size=3, padding=1)
# Params: 3*64*3*3 + 64 = 1792 (independent of image size)

Tied input/output embeddings

class LanguageModel(nn.Module):
    def __init__(self, vocab_size, dim):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, dim)
        # tie: lm_head.weight = embed.weight
        self.lm_head = nn.Linear(dim, vocab_size, bias=False)
        self.lm_head.weight = self.embed.weight  # share!

    def forward(self, x):
        h = self.embed(x)
        return self.lm_head(h)  # no extra params

Cross-layer sharing (ALBERT-style)

class SharedTransformer(nn.Module):
    def __init__(self, num_layers, dim):
        super().__init__()
        self.shared_layer = TransformerBlock(dim)  # ONE block
        self.num_layers = num_layers

    def forward(self, x):
        for _ in range(self.num_layers):
            x = self.shared_layer(x)  # reuse same params
        return x

RNN temporal sharing (built-in)

rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=2)
# At every timestep t, same W_ih, W_hh applied
# Params independent of sequence length

Detect shared params

# Count unique parameter tensors
seen = set()
unique = 0
for p in model.parameters():
    if id(p) not in seen:
        seen.add(id(p))
        unique += p.numel()
print(f"Unique params: {unique}")

Multi-task shared encoder

class MultiTaskModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = ResNet50()  # SHARED
        self.classifier = nn.Linear(2048, 1000)
        self.detector = DetectionHead(2048)

    def forward(self, x, task):
        features = self.encoder(x)
        return self.classifier(features) if task == "cls" else self.detector(features)

매 결정 기준

상황 Approach
Image input CNN spatial sharing
Sequence input RNN or Transformer (positional sharing)
Memory constrained, many layers Cross-layer sharing (ALBERT)
LM with large vocab Tied embeddings (saves vocab*dim params)
Multi-task related Shared encoder
Tasks unrelated Don't force sharing — degrades quality

기본값: tied embeddings + CNN spatial / Transformer positional sharing.

🔗 Graph

🤖 LLM 활용

언제: 매 designing efficient architecture, debugging param count, applying inductive bias. 언제 X: 매 tasks/positions truly independent (forcing sharing hurts quality).

안티패턴

  • Over-sharing: 매 ALL layers shared → severe quality drop on complex tasks.
  • No tied embeddings on small LM: 매 vocab=50k, dim=512 → 25M wasted params.
  • Sharing across modalities: 매 vision encoder ≠ text encoder weights (use CLIP-style separate).
  • Forgetting LayerNorm not shared: 매 cross-layer share W matrices but keep LN per-layer.

🧪 검증 / 중복

  • Verified (LeCun 1989 CNN, ALBERT paper, Press & Wolf 2017 tied embeddings).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — sharing forms, modern usage, patterns