Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

5.3 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Parameter

매 한 줄

"매 learned by data vs set by human". Parameter = model 이 training 중 학습 (weight, bias). Hyperparameter = 매 human 이 사전 설정 (lr, depth, batch size). 2026 frontier: 매 trillion-parameter models (GPT-5, Claude Opus 4.7) — 매 scale 의 dominant axis.

매 핵심

매 parameter vs hyperparameter

Parameter (θ): 매 trainable, gradient descent 의 update target. Examples: W, b in y = Wx + b.
Hyperparameter: 매 fixed before training, 매 architecture/optim choice. Examples: learning rate, batch size, num_layers, dropout p.
매 distinction 모호 case: prompt token (soft prompt 시 parameter, hard prompt 시 input).

매 parameter types

Weights: matrix multiply coefficients (W in Wx + b).
Biases: additive offsets (b).
Embeddings: lookup table (vocab × dim).
LayerNorm γ, β: scale/shift learned per channel.
Buffers: 매 NOT params — running statistics (BatchNorm running_mean), moving averages.

매 modern scale

BERT-base (2018): 110M.
GPT-3 (2020): 175B.
GPT-4 (2023): ~1.7T (rumored MoE).
Llama 3.1 405B (2024): 405B dense.
GPT-5 / Claude Opus 4.7 (2025-2026): trillion-scale, MoE common.
매 active params (MoE) ≠ total params.

매 응용

Model size estimation (memory budget).
Compute budget (Chinchilla scaling: tokens ≈ 20× params).
Compression (quantization, pruning operate on params).
Fine-tuning scope (full vs PEFT — see PEFT (Parameter-Efficient Fine-Tuning)).

💻 패턴

Count parameters

def count_params(model):
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

total, trainable = count_params(model)
print(f"Total: {total/1e9:.2f}B, Trainable: {trainable/1e9:.2f}B")

Parameter vs buffer

import torch.nn as nn

class MyLayer(nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(10, 10))  # trainable
        self.register_buffer("running_mean", torch.zeros(10))  # NOT trainable

Freeze parameters (transfer learning)

for p in model.encoder.parameters():
    p.requires_grad = False  # frozen
# Only classifier head trains
optimizer = torch.optim.Adam(
    [p for p in model.parameters() if p.requires_grad], lr=1e-4
)

Memory estimation

def model_memory_gb(model, dtype_bytes=2):  # bf16
    n = sum(p.numel() for p in model.parameters())
    weights = n * dtype_bytes
    gradients = n * dtype_bytes  # if training
    optimizer = n * 8  # Adam: 2 states × fp32
    return (weights + gradients + optimizer) / 1e9

print(f"Training memory: {model_memory_gb(model):.1f} GB")

Hyperparameter search (Optuna)

import optuna

def objective(trial):
    lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
    bs = trial.suggest_categorical("batch_size", [32, 64, 128])
    layers = trial.suggest_int("num_layers", 2, 8)
    return train_and_eval(lr=lr, batch_size=bs, num_layers=layers)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)

MoE active params

# Mixtral 8x7B: 47B total, ~13B active per token (top-2 routing)
total = 47e9
experts = 8
active_per_token = 2
shared = 13e9 - (47e9 - 13e9*experts) / experts  # rough

매 결정 기준

상황	Approach
Memory budget plan	Total params × dtype × (1 train, 4 with optim)
Inference deployment	Total params × dtype (+ KV cache)
Scaling decision	Chinchilla: tokens ≈ 20 × params
Compute budget	FLOPs ≈ 6 × params × tokens
Fine-tuning	PEFT if params > 1B and 1-GPU

기본값: 매 always report total + trainable params separately.

🔗 Graph

부모: Machine-Learning
변형: Trainable-Parameter
응용: LLM_Optimization_and_Deployment_Strategies · PEFT (Parameter-Efficient Fine-Tuning) · LLM_Optimization_and_Deployment_Strategies
Adjacent: Scaling-Laws · MoE

🤖 LLM 활용

언제: 매 model size discussion, memory planning, fine-tuning scope decision. 언제 X: 매 high-level user-facing communication (use "model size" instead).

❌ 안티패턴

Confusing param ≠ hyperparam: 매 calling lr a parameter.
Counting frozen as trainable: 매 reporting 70B "trainable" when only LoRA (0.5%) actually trains.
Ignoring MoE active vs total: 매 Mixtral 47B treated as 47B compute (실제 13B per token).
Memory underestimation: 매 forgetting optimizer states (8× param size for Adam fp32).

🧪 검증 / 중복

Verified (PyTorch docs, Kaplan 2020 / Hoffmann 2022 scaling laws).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — parameter vs hyperparameter, modern scale, memory math

5.3 KiB Raw Blame History Unescape Escape