2nd/10_Wiki/Topics/AI_and_ML/Parameter.md

---
id: wiki-2026-0508-parameter
title: Parameter
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Model Parameter, Weight, Trainable Parameter]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [parameter, weight, hyperparameter, ml-fundamentals]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: python
  framework: pytorch
---

# Parameter

## 매 한 줄
> **"매 learned by data vs set by human"**. Parameter = model 이 training 중 학습 (weight, bias). Hyperparameter = 매 human 이 사전 설정 (lr, depth, batch size). 2026 frontier: 매 trillion-parameter models (GPT-5, Claude Opus 4.7) — 매 scale 의 dominant axis.

## 매 핵심

### 매 parameter vs hyperparameter
- **Parameter (θ)**: 매 trainable, gradient descent 의 update target. Examples: W, b in `y = Wx + b`.
- **Hyperparameter**: 매 fixed before training, 매 architecture/optim choice. Examples: learning rate, batch size, num_layers, dropout p.
- 매 distinction 모호 case: prompt token (soft prompt 시 parameter, hard prompt 시 input).

### 매 parameter types
- **Weights**: matrix multiply coefficients (`W` in `Wx + b`).
- **Biases**: additive offsets (`b`).
- **Embeddings**: lookup table (vocab × dim).
- **LayerNorm γ, β**: scale/shift learned per channel.
- **Buffers**: 매 NOT params — running statistics (BatchNorm running_mean), moving averages.

### 매 modern scale
- BERT-base (2018): 110M.
- GPT-3 (2020): 175B.
- GPT-4 (2023): ~1.7T (rumored MoE).
- Llama 3.1 405B (2024): 405B dense.
- GPT-5 / Claude Opus 4.7 (2025-2026): trillion-scale, MoE common.
- 매 active params (MoE) ≠ total params.

### 매 응용
1. Model size estimation (memory budget).
2. Compute budget (Chinchilla scaling: tokens ≈ 20× params).
3. Compression (quantization, pruning operate on params).
4. Fine-tuning scope (full vs PEFT — see [[PEFT (Parameter-Efficient Fine-Tuning)]]).

## 💻 패턴

### Count parameters
```python
def count_params(model):
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

total, trainable = count_params(model)
print(f"Total: {total/1e9:.2f}B, Trainable: {trainable/1e9:.2f}B")
```

### Parameter vs buffer
```python
import torch.nn as nn

class MyLayer(nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(10, 10))  # trainable
        self.register_buffer("running_mean", torch.zeros(10))  # NOT trainable
```

### Freeze parameters (transfer learning)
```python
for p in model.encoder.parameters():
    p.requires_grad = False  # frozen
# Only classifier head trains
optimizer = torch.optim.Adam(
    [p for p in model.parameters() if p.requires_grad], lr=1e-4
)
```

### Memory estimation
```python
def model_memory_gb(model, dtype_bytes=2):  # bf16
    n = sum(p.numel() for p in model.parameters())
    weights = n * dtype_bytes
    gradients = n * dtype_bytes  # if training
    optimizer = n * 8  # Adam: 2 states × fp32
    return (weights + gradients + optimizer) / 1e9

print(f"Training memory: {model_memory_gb(model):.1f} GB")
```

### Hyperparameter search (Optuna)
```python
import optuna

def objective(trial):
    lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
    bs = trial.suggest_categorical("batch_size", [32, 64, 128])
    layers = trial.suggest_int("num_layers", 2, 8)
    return train_and_eval(lr=lr, batch_size=bs, num_layers=layers)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)
```

### MoE active params
```python
# Mixtral 8x7B: 47B total, ~13B active per token (top-2 routing)
total = 47e9
experts = 8
active_per_token = 2
shared = 13e9 - (47e9 - 13e9*experts) / experts  # rough
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Memory budget plan | Total params × dtype × (1 train, 4 with optim) |
| Inference deployment | Total params × dtype (+ KV cache) |
| Scaling decision | Chinchilla: tokens ≈ 20 × params |
| Compute budget | FLOPs ≈ 6 × params × tokens |
| Fine-tuning | PEFT if params > 1B and 1-GPU |

**기본값**: 매 always report total + trainable params separately.

## 🔗 Graph
- 부모: [[Machine-Learning]]
- 변형: [[Trainable-Parameter]]
- 응용: [[LLM_Optimization_and_Deployment_Strategies|Model-Compression]] · [[PEFT (Parameter-Efficient Fine-Tuning)]] · [[LLM_Optimization_and_Deployment_Strategies|Quantization]]
- Adjacent: [[Scaling-Laws]] · [[MoE]]

## 🤖 LLM 활용
**언제**: 매 model size discussion, memory planning, fine-tuning scope decision.
**언제 X**: 매 high-level user-facing communication (use "model size" instead).

## ❌ 안티패턴
- **Confusing param ≠ hyperparam**: 매 calling `lr` a parameter.
- **Counting frozen as trainable**: 매 reporting 70B "trainable" when only LoRA (0.5%) actually trains.
- **Ignoring MoE active vs total**: 매 Mixtral 47B treated as 47B compute (실제 13B per token).
- **Memory underestimation**: 매 forgetting optimizer states (8× param size for Adam fp32).

## 🧪 검증 / 중복
- Verified (PyTorch docs, Kaplan 2020 / Hoffmann 2022 scaling laws).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — parameter vs hyperparameter, modern scale, memory math |