--- id: wiki-2026-0508-parameter title: Parameter category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Model Parameter, Weight, Trainable Parameter] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [parameter, weight, hyperparameter, ml-fundamentals] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch --- # Parameter ## 매 한 줄 > **"매 learned by data vs set by human"**. Parameter = model 이 training 중 학습 (weight, bias). Hyperparameter = 매 human 이 사전 설정 (lr, depth, batch size). 2026 frontier: 매 trillion-parameter models (GPT-5, Claude Opus 4.7) — 매 scale 의 dominant axis. ## 매 핵심 ### 매 parameter vs hyperparameter - **Parameter (θ)**: 매 trainable, gradient descent 의 update target. Examples: W, b in `y = Wx + b`. - **Hyperparameter**: 매 fixed before training, 매 architecture/optim choice. Examples: learning rate, batch size, num_layers, dropout p. - 매 distinction 모호 case: prompt token (soft prompt 시 parameter, hard prompt 시 input). ### 매 parameter types - **Weights**: matrix multiply coefficients (`W` in `Wx + b`). - **Biases**: additive offsets (`b`). - **Embeddings**: lookup table (vocab × dim). - **LayerNorm γ, β**: scale/shift learned per channel. - **Buffers**: 매 NOT params — running statistics (BatchNorm running_mean), moving averages. ### 매 modern scale - BERT-base (2018): 110M. - GPT-3 (2020): 175B. - GPT-4 (2023): ~1.7T (rumored MoE). - Llama 3.1 405B (2024): 405B dense. - GPT-5 / Claude Opus 4.7 (2025-2026): trillion-scale, MoE common. - 매 active params (MoE) ≠ total params. ### 매 응용 1. Model size estimation (memory budget). 2. Compute budget (Chinchilla scaling: tokens ≈ 20× params). 3. Compression (quantization, pruning operate on params). 4. Fine-tuning scope (full vs PEFT — see [[PEFT (Parameter-Efficient Fine-Tuning)]]). ## 💻 패턴 ### Count parameters ```python def count_params(model): total = sum(p.numel() for p in model.parameters()) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) return total, trainable total, trainable = count_params(model) print(f"Total: {total/1e9:.2f}B, Trainable: {trainable/1e9:.2f}B") ``` ### Parameter vs buffer ```python import torch.nn as nn class MyLayer(nn.Module): def __init__(self): super().__init__() self.weight = nn.Parameter(torch.randn(10, 10)) # trainable self.register_buffer("running_mean", torch.zeros(10)) # NOT trainable ``` ### Freeze parameters (transfer learning) ```python for p in model.encoder.parameters(): p.requires_grad = False # frozen # Only classifier head trains optimizer = torch.optim.Adam( [p for p in model.parameters() if p.requires_grad], lr=1e-4 ) ``` ### Memory estimation ```python def model_memory_gb(model, dtype_bytes=2): # bf16 n = sum(p.numel() for p in model.parameters()) weights = n * dtype_bytes gradients = n * dtype_bytes # if training optimizer = n * 8 # Adam: 2 states × fp32 return (weights + gradients + optimizer) / 1e9 print(f"Training memory: {model_memory_gb(model):.1f} GB") ``` ### Hyperparameter search (Optuna) ```python import optuna def objective(trial): lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True) bs = trial.suggest_categorical("batch_size", [32, 64, 128]) layers = trial.suggest_int("num_layers", 2, 8) return train_and_eval(lr=lr, batch_size=bs, num_layers=layers) study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=50) ``` ### MoE active params ```python # Mixtral 8x7B: 47B total, ~13B active per token (top-2 routing) total = 47e9 experts = 8 active_per_token = 2 shared = 13e9 - (47e9 - 13e9*experts) / experts # rough ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Memory budget plan | Total params × dtype × (1 train, 4 with optim) | | Inference deployment | Total params × dtype (+ KV cache) | | Scaling decision | Chinchilla: tokens ≈ 20 × params | | Compute budget | FLOPs ≈ 6 × params × tokens | | Fine-tuning | PEFT if params > 1B and 1-GPU | **기본값**: 매 always report total + trainable params separately. ## 🔗 Graph - 부모: [[Machine-Learning]] - 변형: [[Trainable-Parameter]] - 응용: [[LLM_Optimization_and_Deployment_Strategies|Model-Compression]] · [[PEFT (Parameter-Efficient Fine-Tuning)]] · [[LLM_Optimization_and_Deployment_Strategies|Quantization]] - Adjacent: [[Scaling-Laws]] · [[MoE]] ## 🤖 LLM 활용 **언제**: 매 model size discussion, memory planning, fine-tuning scope decision. **언제 X**: 매 high-level user-facing communication (use "model size" instead). ## ❌ 안티패턴 - **Confusing param ≠ hyperparam**: 매 calling `lr` a parameter. - **Counting frozen as trainable**: 매 reporting 70B "trainable" when only LoRA (0.5%) actually trains. - **Ignoring MoE active vs total**: 매 Mixtral 47B treated as 47B compute (실제 13B per token). - **Memory underestimation**: 매 forgetting optimizer states (8× param size for Adam fp32). ## 🧪 검증 / 중복 - Verified (PyTorch docs, Kaplan 2020 / Hoffmann 2022 scaling laws). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — parameter vs hyperparameter, modern scale, memory math |