"매 same weights, different positions". 매 single parameter set 가 multiple computations 에 reuse — translation invariance (CNN), temporal invariance (RNN), parameter efficiency (transformer FFN tied embeddings). 매 modern DL 의 fundamental design pattern.
매 핵심
매 motivation
Parameter explosion: 매 fully connected layer on image → billions of params.
Inductive bias: 매 weight sharing encodes prior (translation/time invariance).
Generalization: 매 fewer params → better generalization (less overfit).
Compute: 매 shared weights enable convolution / matmul optimization.
매 forms
Spatial sharing (CNN): 매 same conv kernel slid across image.
Temporal sharing (RNN/LSTM/GRU): 매 same recurrent weights at every timestep.
Cross-layer sharing: 매 ALBERT, Universal Transformer — 매 same layer params reused L times.
Tied embeddings: 매 input embedding == output projection (LM head).
Multi-head: 매 NOT shared (each head has own W_q, W_k, W_v).
매 modern usage
ALBERT (2019): cross-layer sharing for BERT compression (12× param reduction).
ViT: spatial sharing via patch embedding.
Mamba/SSM: temporal sharing via state-space recurrence.
LoRA: 매 single low-rank delta shared across positions.
매 응용
CNN image classification (ResNet, ConvNeXt).
Sequence modeling (RNN, Transformer position embeddings).
Model compression (ALBERT, distillation).
Multi-task learning (shared encoder).
💻 패턴
CNN spatial sharing
importtorch.nnasnn# Single 3x3 kernel applied to every spatial positionconv=nn.Conv2d(3,64,kernel_size=3,padding=1)# Params: 3*64*3*3 + 64 = 1792 (independent of image size)
Tied input/output embeddings
classLanguageModel(nn.Module):def__init__(self,vocab_size,dim):super().__init__()self.embed=nn.Embedding(vocab_size,dim)# tie: lm_head.weight = embed.weightself.lm_head=nn.Linear(dim,vocab_size,bias=False)self.lm_head.weight=self.embed.weight# share!defforward(self,x):h=self.embed(x)returnself.lm_head(h)# no extra params
Cross-layer sharing (ALBERT-style)
classSharedTransformer(nn.Module):def__init__(self,num_layers,dim):super().__init__()self.shared_layer=TransformerBlock(dim)# ONE blockself.num_layers=num_layersdefforward(self,x):for_inrange(self.num_layers):x=self.shared_layer(x)# reuse same paramsreturnx
RNN temporal sharing (built-in)
rnn=nn.GRU(input_size=128,hidden_size=256,num_layers=2)# At every timestep t, same W_ih, W_hh applied# Params independent of sequence length