f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
159 lines
5.3 KiB
Markdown
159 lines
5.3 KiB
Markdown
---
|
||
id: wiki-2026-0508-parameter
|
||
title: Parameter
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [Model Parameter, Weight, Trainable Parameter]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.95
|
||
verification_status: applied
|
||
tags: [parameter, weight, hyperparameter, ml-fundamentals]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: python
|
||
framework: pytorch
|
||
---
|
||
|
||
# Parameter
|
||
|
||
## 매 한 줄
|
||
> **"매 learned by data vs set by human"**. Parameter = model 이 training 중 학습 (weight, bias). Hyperparameter = 매 human 이 사전 설정 (lr, depth, batch size). 2026 frontier: 매 trillion-parameter models (GPT-5, Claude Opus 4.7) — 매 scale 의 dominant axis.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 parameter vs hyperparameter
|
||
- **Parameter (θ)**: 매 trainable, gradient descent 의 update target. Examples: W, b in `y = Wx + b`.
|
||
- **Hyperparameter**: 매 fixed before training, 매 architecture/optim choice. Examples: learning rate, batch size, num_layers, dropout p.
|
||
- 매 distinction 모호 case: prompt token (soft prompt 시 parameter, hard prompt 시 input).
|
||
|
||
### 매 parameter types
|
||
- **Weights**: matrix multiply coefficients (`W` in `Wx + b`).
|
||
- **Biases**: additive offsets (`b`).
|
||
- **Embeddings**: lookup table (vocab × dim).
|
||
- **LayerNorm γ, β**: scale/shift learned per channel.
|
||
- **Buffers**: 매 NOT params — running statistics (BatchNorm running_mean), moving averages.
|
||
|
||
### 매 modern scale
|
||
- BERT-base (2018): 110M.
|
||
- GPT-3 (2020): 175B.
|
||
- GPT-4 (2023): ~1.7T (rumored MoE).
|
||
- Llama 3.1 405B (2024): 405B dense.
|
||
- GPT-5 / Claude Opus 4.7 (2025-2026): trillion-scale, MoE common.
|
||
- 매 active params (MoE) ≠ total params.
|
||
|
||
### 매 응용
|
||
1. Model size estimation (memory budget).
|
||
2. Compute budget (Chinchilla scaling: tokens ≈ 20× params).
|
||
3. Compression (quantization, pruning operate on params).
|
||
4. Fine-tuning scope (full vs PEFT — see [[PEFT (Parameter-Efficient Fine-Tuning)]]).
|
||
|
||
## 💻 패턴
|
||
|
||
### Count parameters
|
||
```python
|
||
def count_params(model):
|
||
total = sum(p.numel() for p in model.parameters())
|
||
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
|
||
return total, trainable
|
||
|
||
total, trainable = count_params(model)
|
||
print(f"Total: {total/1e9:.2f}B, Trainable: {trainable/1e9:.2f}B")
|
||
```
|
||
|
||
### Parameter vs buffer
|
||
```python
|
||
import torch.nn as nn
|
||
|
||
class MyLayer(nn.Module):
|
||
def __init__(self):
|
||
super().__init__()
|
||
self.weight = nn.Parameter(torch.randn(10, 10)) # trainable
|
||
self.register_buffer("running_mean", torch.zeros(10)) # NOT trainable
|
||
```
|
||
|
||
### Freeze parameters (transfer learning)
|
||
```python
|
||
for p in model.encoder.parameters():
|
||
p.requires_grad = False # frozen
|
||
# Only classifier head trains
|
||
optimizer = torch.optim.Adam(
|
||
[p for p in model.parameters() if p.requires_grad], lr=1e-4
|
||
)
|
||
```
|
||
|
||
### Memory estimation
|
||
```python
|
||
def model_memory_gb(model, dtype_bytes=2): # bf16
|
||
n = sum(p.numel() for p in model.parameters())
|
||
weights = n * dtype_bytes
|
||
gradients = n * dtype_bytes # if training
|
||
optimizer = n * 8 # Adam: 2 states × fp32
|
||
return (weights + gradients + optimizer) / 1e9
|
||
|
||
print(f"Training memory: {model_memory_gb(model):.1f} GB")
|
||
```
|
||
|
||
### Hyperparameter search (Optuna)
|
||
```python
|
||
import optuna
|
||
|
||
def objective(trial):
|
||
lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
|
||
bs = trial.suggest_categorical("batch_size", [32, 64, 128])
|
||
layers = trial.suggest_int("num_layers", 2, 8)
|
||
return train_and_eval(lr=lr, batch_size=bs, num_layers=layers)
|
||
|
||
study = optuna.create_study(direction="maximize")
|
||
study.optimize(objective, n_trials=50)
|
||
```
|
||
|
||
### MoE active params
|
||
```python
|
||
# Mixtral 8x7B: 47B total, ~13B active per token (top-2 routing)
|
||
total = 47e9
|
||
experts = 8
|
||
active_per_token = 2
|
||
shared = 13e9 - (47e9 - 13e9*experts) / experts # rough
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Approach |
|
||
|---|---|
|
||
| Memory budget plan | Total params × dtype × (1 train, 4 with optim) |
|
||
| Inference deployment | Total params × dtype (+ KV cache) |
|
||
| Scaling decision | Chinchilla: tokens ≈ 20 × params |
|
||
| Compute budget | FLOPs ≈ 6 × params × tokens |
|
||
| Fine-tuning | PEFT if params > 1B and 1-GPU |
|
||
|
||
**기본값**: 매 always report total + trainable params separately.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Machine-Learning]]
|
||
- 변형: [[Trainable-Parameter]]
|
||
- 응용: [[LLM_Optimization_and_Deployment_Strategies|Model-Compression]] · [[PEFT (Parameter-Efficient Fine-Tuning)]] · [[LLM_Optimization_and_Deployment_Strategies|Quantization]]
|
||
- Adjacent: [[Scaling-Laws]] · [[MoE]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: 매 model size discussion, memory planning, fine-tuning scope decision.
|
||
**언제 X**: 매 high-level user-facing communication (use "model size" instead).
|
||
|
||
## ❌ 안티패턴
|
||
- **Confusing param ≠ hyperparam**: 매 calling `lr` a parameter.
|
||
- **Counting frozen as trainable**: 매 reporting 70B "trainable" when only LoRA (0.5%) actually trains.
|
||
- **Ignoring MoE active vs total**: 매 Mixtral 47B treated as 47B compute (실제 13B per token).
|
||
- **Memory underestimation**: 매 forgetting optimizer states (8× param size for Adam fp32).
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (PyTorch docs, Kaplan 2020 / Hoffmann 2022 scaling laws).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — parameter vs hyperparameter, modern scale, memory math |
|