Files
2nd/10_Wiki/Topics/AI_and_ML/Parameter.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.3 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-parameter Parameter 10_Wiki/Topics verified self
Model Parameter
Weight
Trainable Parameter
none A 0.95 applied
parameter
weight
hyperparameter
ml-fundamentals
2026-05-10 pending
language framework
python pytorch

Parameter

매 한 줄

"매 learned by data vs set by human". Parameter = model 이 training 중 학습 (weight, bias). Hyperparameter = 매 human 이 사전 설정 (lr, depth, batch size). 2026 frontier: 매 trillion-parameter models (GPT-5, Claude Opus 4.7) — 매 scale 의 dominant axis.

매 핵심

매 parameter vs hyperparameter

  • Parameter (θ): 매 trainable, gradient descent 의 update target. Examples: W, b in y = Wx + b.
  • Hyperparameter: 매 fixed before training, 매 architecture/optim choice. Examples: learning rate, batch size, num_layers, dropout p.
  • 매 distinction 모호 case: prompt token (soft prompt 시 parameter, hard prompt 시 input).

매 parameter types

  • Weights: matrix multiply coefficients (W in Wx + b).
  • Biases: additive offsets (b).
  • Embeddings: lookup table (vocab × dim).
  • LayerNorm γ, β: scale/shift learned per channel.
  • Buffers: 매 NOT params — running statistics (BatchNorm running_mean), moving averages.

매 modern scale

  • BERT-base (2018): 110M.
  • GPT-3 (2020): 175B.
  • GPT-4 (2023): ~1.7T (rumored MoE).
  • Llama 3.1 405B (2024): 405B dense.
  • GPT-5 / Claude Opus 4.7 (2025-2026): trillion-scale, MoE common.
  • 매 active params (MoE) ≠ total params.

매 응용

  1. Model size estimation (memory budget).
  2. Compute budget (Chinchilla scaling: tokens ≈ 20× params).
  3. Compression (quantization, pruning operate on params).
  4. Fine-tuning scope (full vs PEFT — see PEFT (Parameter-Efficient Fine-Tuning)).

💻 패턴

Count parameters

def count_params(model):
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

total, trainable = count_params(model)
print(f"Total: {total/1e9:.2f}B, Trainable: {trainable/1e9:.2f}B")

Parameter vs buffer

import torch.nn as nn

class MyLayer(nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(10, 10))  # trainable
        self.register_buffer("running_mean", torch.zeros(10))  # NOT trainable

Freeze parameters (transfer learning)

for p in model.encoder.parameters():
    p.requires_grad = False  # frozen
# Only classifier head trains
optimizer = torch.optim.Adam(
    [p for p in model.parameters() if p.requires_grad], lr=1e-4
)

Memory estimation

def model_memory_gb(model, dtype_bytes=2):  # bf16
    n = sum(p.numel() for p in model.parameters())
    weights = n * dtype_bytes
    gradients = n * dtype_bytes  # if training
    optimizer = n * 8  # Adam: 2 states × fp32
    return (weights + gradients + optimizer) / 1e9

print(f"Training memory: {model_memory_gb(model):.1f} GB")

Hyperparameter search (Optuna)

import optuna

def objective(trial):
    lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
    bs = trial.suggest_categorical("batch_size", [32, 64, 128])
    layers = trial.suggest_int("num_layers", 2, 8)
    return train_and_eval(lr=lr, batch_size=bs, num_layers=layers)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)

MoE active params

# Mixtral 8x7B: 47B total, ~13B active per token (top-2 routing)
total = 47e9
experts = 8
active_per_token = 2
shared = 13e9 - (47e9 - 13e9*experts) / experts  # rough

매 결정 기준

상황 Approach
Memory budget plan Total params × dtype × (1 train, 4 with optim)
Inference deployment Total params × dtype (+ KV cache)
Scaling decision Chinchilla: tokens ≈ 20 × params
Compute budget FLOPs ≈ 6 × params × tokens
Fine-tuning PEFT if params > 1B and 1-GPU

기본값: 매 always report total + trainable params separately.

🔗 Graph

🤖 LLM 활용

언제: 매 model size discussion, memory planning, fine-tuning scope decision. 언제 X: 매 high-level user-facing communication (use "model size" instead).

안티패턴

  • Confusing param ≠ hyperparam: 매 calling lr a parameter.
  • Counting frozen as trainable: 매 reporting 70B "trainable" when only LoRA (0.5%) actually trains.
  • Ignoring MoE active vs total: 매 Mixtral 47B treated as 47B compute (실제 13B per token).
  • Memory underestimation: 매 forgetting optimizer states (8× param size for Adam fp32).

🧪 검증 / 중복

  • Verified (PyTorch docs, Kaplan 2020 / Hoffmann 2022 scaling laws).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — parameter vs hyperparameter, modern scale, memory math