Files
2nd/10_Wiki/Topics/AI_and_ML/Parameter.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

159 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-parameter
title: Parameter
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Model Parameter, Weight, Trainable Parameter]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [parameter, weight, hyperparameter, ml-fundamentals]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: pytorch
---
# Parameter
## 매 한 줄
> **"매 learned by data vs set by human"**. Parameter = model 이 training 중 학습 (weight, bias). Hyperparameter = 매 human 이 사전 설정 (lr, depth, batch size). 2026 frontier: 매 trillion-parameter models (GPT-5, Claude Opus 4.7) — 매 scale 의 dominant axis.
## 매 핵심
### 매 parameter vs hyperparameter
- **Parameter (θ)**: 매 trainable, gradient descent 의 update target. Examples: W, b in `y = Wx + b`.
- **Hyperparameter**: 매 fixed before training, 매 architecture/optim choice. Examples: learning rate, batch size, num_layers, dropout p.
- 매 distinction 모호 case: prompt token (soft prompt 시 parameter, hard prompt 시 input).
### 매 parameter types
- **Weights**: matrix multiply coefficients (`W` in `Wx + b`).
- **Biases**: additive offsets (`b`).
- **Embeddings**: lookup table (vocab × dim).
- **LayerNorm γ, β**: scale/shift learned per channel.
- **Buffers**: 매 NOT params — running statistics (BatchNorm running_mean), moving averages.
### 매 modern scale
- BERT-base (2018): 110M.
- GPT-3 (2020): 175B.
- GPT-4 (2023): ~1.7T (rumored MoE).
- Llama 3.1 405B (2024): 405B dense.
- GPT-5 / Claude Opus 4.7 (2025-2026): trillion-scale, MoE common.
- 매 active params (MoE) ≠ total params.
### 매 응용
1. Model size estimation (memory budget).
2. Compute budget (Chinchilla scaling: tokens ≈ 20× params).
3. Compression (quantization, pruning operate on params).
4. Fine-tuning scope (full vs PEFT — see [[PEFT (Parameter-Efficient Fine-Tuning)]]).
## 💻 패턴
### Count parameters
```python
def count_params(model):
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
return total, trainable
total, trainable = count_params(model)
print(f"Total: {total/1e9:.2f}B, Trainable: {trainable/1e9:.2f}B")
```
### Parameter vs buffer
```python
import torch.nn as nn
class MyLayer(nn.Module):
def __init__(self):
super().__init__()
self.weight = nn.Parameter(torch.randn(10, 10)) # trainable
self.register_buffer("running_mean", torch.zeros(10)) # NOT trainable
```
### Freeze parameters (transfer learning)
```python
for p in model.encoder.parameters():
p.requires_grad = False # frozen
# Only classifier head trains
optimizer = torch.optim.Adam(
[p for p in model.parameters() if p.requires_grad], lr=1e-4
)
```
### Memory estimation
```python
def model_memory_gb(model, dtype_bytes=2): # bf16
n = sum(p.numel() for p in model.parameters())
weights = n * dtype_bytes
gradients = n * dtype_bytes # if training
optimizer = n * 8 # Adam: 2 states × fp32
return (weights + gradients + optimizer) / 1e9
print(f"Training memory: {model_memory_gb(model):.1f} GB")
```
### Hyperparameter search (Optuna)
```python
import optuna
def objective(trial):
lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
bs = trial.suggest_categorical("batch_size", [32, 64, 128])
layers = trial.suggest_int("num_layers", 2, 8)
return train_and_eval(lr=lr, batch_size=bs, num_layers=layers)
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)
```
### MoE active params
```python
# Mixtral 8x7B: 47B total, ~13B active per token (top-2 routing)
total = 47e9
experts = 8
active_per_token = 2
shared = 13e9 - (47e9 - 13e9*experts) / experts # rough
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Memory budget plan | Total params × dtype × (1 train, 4 with optim) |
| Inference deployment | Total params × dtype (+ KV cache) |
| Scaling decision | Chinchilla: tokens ≈ 20 × params |
| Compute budget | FLOPs ≈ 6 × params × tokens |
| Fine-tuning | PEFT if params > 1B and 1-GPU |
**기본값**: 매 always report total + trainable params separately.
## 🔗 Graph
- 부모: [[Machine-Learning]]
- 변형: [[Trainable-Parameter]]
- 응용: [[LLM_Optimization_and_Deployment_Strategies|Model-Compression]] · [[PEFT (Parameter-Efficient Fine-Tuning)]] · [[LLM_Optimization_and_Deployment_Strategies|Quantization]]
- Adjacent: [[Scaling-Laws]] · [[MoE]]
## 🤖 LLM 활용
**언제**: 매 model size discussion, memory planning, fine-tuning scope decision.
**언제 X**: 매 high-level user-facing communication (use "model size" instead).
## ❌ 안티패턴
- **Confusing param ≠ hyperparam**: 매 calling `lr` a parameter.
- **Counting frozen as trainable**: 매 reporting 70B "trainable" when only LoRA (0.5%) actually trains.
- **Ignoring MoE active vs total**: 매 Mixtral 47B treated as 47B compute (실제 13B per token).
- **Memory underestimation**: 매 forgetting optimizer states (8× param size for Adam fp32).
## 🧪 검증 / 중복
- Verified (PyTorch docs, Kaplan 2020 / Hoffmann 2022 scaling laws).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — parameter vs hyperparameter, modern scale, memory math |