"매 learned by data vs set by human". Parameter = model 이 training 중 학습 (weight, bias). Hyperparameter = 매 human 이 사전 설정 (lr, depth, batch size). 2026 frontier: 매 trillion-parameter models (GPT-5, Claude Opus 4.7) — 매 scale 의 dominant axis.
매 핵심
매 parameter vs hyperparameter
Parameter (θ): 매 trainable, gradient descent 의 update target. Examples: W, b in y = Wx + b.
Hyperparameter: 매 fixed before training, 매 architecture/optim choice. Examples: learning rate, batch size, num_layers, dropout p.
매 distinction 모호 case: prompt token (soft prompt 시 parameter, hard prompt 시 input).
매 parameter types
Weights: matrix multiply coefficients (W in Wx + b).
Biases: additive offsets (b).
Embeddings: lookup table (vocab × dim).
LayerNorm γ, β: scale/shift learned per channel.
Buffers: 매 NOT params — running statistics (BatchNorm running_mean), moving averages.
매 modern scale
BERT-base (2018): 110M.
GPT-3 (2020): 175B.
GPT-4 (2023): ~1.7T (rumored MoE).
Llama 3.1 405B (2024): 405B dense.
GPT-5 / Claude Opus 4.7 (2025-2026): trillion-scale, MoE common.
importtorch.nnasnnclassMyLayer(nn.Module):def__init__(self):super().__init__()self.weight=nn.Parameter(torch.randn(10,10))# trainableself.register_buffer("running_mean",torch.zeros(10))# NOT trainable
Freeze parameters (transfer learning)
forpinmodel.encoder.parameters():p.requires_grad=False# frozen# Only classifier head trainsoptimizer=torch.optim.Adam([pforpinmodel.parameters()ifp.requires_grad],lr=1e-4)
Memory estimation
defmodel_memory_gb(model,dtype_bytes=2):# bf16n=sum(p.numel()forpinmodel.parameters())weights=n*dtype_bytesgradients=n*dtype_bytes# if trainingoptimizer=n*8# Adam: 2 states × fp32return(weights+gradients+optimizer)/1e9print(f"Training memory: {model_memory_gb(model):.1f} GB")