2nd/10_Wiki/Topics/AI_and_ML/Parameter-Sharing.md at 27b2c25e4dc52db3e6f90eb5f87078c112e2121b

Files

T

koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)

이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 12:24:15 +09:00

5.4 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

매 한 줄

"매 same weights, different positions". 매 single parameter set 가 multiple computations 에 reuse — translation invariance (CNN), temporal invariance (RNN), parameter efficiency (transformer FFN tied embeddings). 매 modern DL 의 fundamental design pattern.

매 핵심

매 motivation

Parameter explosion: 매 fully connected layer on image → billions of params.
Inductive bias: 매 weight sharing encodes prior (translation/time invariance).
Generalization: 매 fewer params → better generalization (less overfit).
Compute: 매 shared weights enable convolution / matmul optimization.

매 forms

Spatial sharing (CNN): 매 same conv kernel slid across image.
Temporal sharing (RNN/LSTM/GRU): 매 same recurrent weights at every timestep.
Cross-layer sharing: 매 ALBERT, Universal Transformer — 매 same layer params reused L times.
Tied embeddings: 매 input embedding == output projection (LM head).
Multi-head: 매 NOT shared (each head has own W_q, W_k, W_v).

매 modern usage

ALBERT (2019): cross-layer sharing for BERT compression (12× param reduction).
ViT: spatial sharing via patch embedding.
Mamba/SSM: temporal sharing via state-space recurrence.
LoRA: 매 single low-rank delta shared across positions.

매 응용

CNN image classification (ResNet, ConvNeXt).
Sequence modeling (RNN, Transformer position embeddings).
Model compression (ALBERT, distillation).
Multi-task learning (shared encoder).

💻 패턴

import torch.nn as nn
# Single 3x3 kernel applied to every spatial position
conv = nn.Conv2d(3, 64, kernel_size=3, padding=1)
# Params: 3*64*3*3 + 64 = 1792 (independent of image size)

Tied input/output embeddings

class LanguageModel(nn.Module):
    def __init__(self, vocab_size, dim):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, dim)
        # tie: lm_head.weight = embed.weight
        self.lm_head = nn.Linear(dim, vocab_size, bias=False)
        self.lm_head.weight = self.embed.weight  # share!

    def forward(self, x):
        h = self.embed(x)
        return self.lm_head(h)  # no extra params

class SharedTransformer(nn.Module):
    def __init__(self, num_layers, dim):
        super().__init__()
        self.shared_layer = TransformerBlock(dim)  # ONE block
        self.num_layers = num_layers

    def forward(self, x):
        for _ in range(self.num_layers):
            x = self.shared_layer(x)  # reuse same params
        return x

rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=2)
# At every timestep t, same W_ih, W_hh applied
# Params independent of sequence length

Detect shared params

# Count unique parameter tensors
seen = set()
unique = 0
for p in model.parameters():
    if id(p) not in seen:
        seen.add(id(p))
        unique += p.numel()
print(f"Unique params: {unique}")

Multi-task shared encoder

class MultiTaskModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = ResNet50()  # SHARED
        self.classifier = nn.Linear(2048, 1000)
        self.detector = DetectionHead(2048)

    def forward(self, x, task):
        features = self.encoder(x)
        return self.classifier(features) if task == "cls" else self.detector(features)

매 결정 기준

상황	Approach
Image input	CNN spatial sharing
Sequence input	RNN or Transformer (positional sharing)
Memory constrained, many layers	Cross-layer sharing (ALBERT)
LM with large vocab	Tied embeddings (saves vocab*dim params)
Multi-task related	Shared encoder
Tasks unrelated	Don't force sharing — degrades quality

기본값: tied embeddings + CNN spatial / Transformer positional sharing.

🔗 Graph

부모: Inductive-Bias · LLM_Optimization_and_Deployment_Strategies
응용: CNN · 데이터 사이언스 및 ML 엔지니어링 · Transformer

🤖 LLM 활용

언제: 매 designing efficient architecture, debugging param count, applying inductive bias. 언제 X: 매 tasks/positions truly independent (forcing sharing hurts quality).

❌ 안티패턴

Over-sharing: 매 ALL layers shared → severe quality drop on complex tasks.
No tied embeddings on small LM: 매 vocab=50k, dim=512 → 25M wasted params.
Sharing across modalities: 매 vision encoder ≠ text encoder weights (use CLIP-style separate).
Forgetting LayerNorm not shared: 매 cross-layer share W matrices but keep LN per-layer.

🧪 검증 / 중복

Verified (LeCun 1989 CNN, ALBERT paper, Press & Wolf 2017 tied embeddings).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — sharing forms, modern usage, patterns

5.4 KiB Raw Blame History Unescape Escape

Parameter Sharing