Files

T

koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)

이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 12:24:15 +09:00

6.5 KiB

Raw Blame History

id, title, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

status

canonical_id

aliases

duplicate_of

source_trust_level

confidence_score

verification_status

Model Parameters

한 줄 정의

학습을 통해 값이 결정되는 모델 내부 변수(가중치 W, 편향 b 등) — 데이터에서 패턴을 인코딩하는 저장 매체. 개수(N)는 모델 capacity·메모리·비용을 결정하고, 2020년대 LLM에서 N과 데이터·연산 사이의 스케일링 법칙(Chinchilla) 이 핵심 설계 도구가 됐다.

핵심

파라미터 vs 하이퍼파라미터

구분	파라미터	하이퍼파라미터
결정 방법	학습으로 추정	사람·search가 지정
예	W, b, embedding	lr, batch size, depth
저장	checkpoint	config

파라미터 개수 추정

Linear (in,out): in*out + out (bias 포함).
Embedding (V,d): V*d.
Multi-head attention 1 layer (d_model=d): 4*d² (Q,K,V,O).
FFN with hidden 4d: 8*d².
Transformer block ≈ 12*d². L 레이어면 ≈12*L*d² + V*d.

스케일링 법칙

Kaplan (2020): loss ∝ N^-α 에 가까움, 데이터·연산도 동시에.
Chinchilla (2022): 주어진 compute 에서 파라미터:토큰 ≈ 1:20 이 최적 (N과 D를 거의 동등하게).
이후 (Llama-3, 2024+): 추론 비용을 고려해 smaller N + much larger D(20× 초과) 트렌드.

Parameter-Efficient Fine-Tuning (PEFT)

LoRA: 가중치 W에 저랭크 ΔW = BA 만 학습 (r=8/16). 보통 0.1–1% 파라미터로 full FT 근접.
QLoRA: 4-bit 양자화 base + LoRA. 24GB GPU에서 65B FT.
Adapters / IA³ / Prefix-Tuning: 변형들.
Soft Prompt: embedding 일부만 학습.

양자화 (저장·추론)

INT8, INT4, NF4, FP8 — VRAM·속도 절감, 품질 손실 작음.

응용

모델 크기 비교, 비용·VRAM 계산, fine-tune 전략 결정, 추론 인프라 sizing.

💻 패턴

파라미터 수 세기 (PyTorch)

def count_params(model):
    total = sum(p.numel() for p in model.parameters())
    train = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, train

t, tr = count_params(model)
print(f"total={t:,}  trainable={tr:,}")

Transformer 추정 공식

def transformer_params(L, d, V, ff_mult=4):
    block = (4 * d * d) + (2 * d * d * ff_mult)  # attn + ffn (대략)
    return L * block + V * d  # + 작은 항(layernorm 등) 무시
print(transformer_params(L=32, d=4096, V=128_000))

VRAM 추정 (학습)

# 대략: (params * (weight + grad + optim_state)) bytes
# fp16 weight 2 + grad 2 + AdamW state(m,v) fp32 8  = 12 bytes/param
def vram_train_gb(n_params): return n_params * 12 / 1e9
print(vram_train_gb(7e9))  # ≈ 84 GB

LoRA 적용 (peft)

from peft import LoraConfig, get_peft_model
cfg = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"],
                 lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(base_model, cfg)
model.print_trainable_parameters()
# trainable params: 4M || all params: 7B || trainable%: 0.06

4-bit 로딩 (bitsandbytes / QLoRA 시작점)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_compute_dtype=torch.bfloat16,
                         bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B",
                                             quantization_config=bnb, device_map="auto")

가중치 freezing (레이어 단위)

for n, p in model.named_parameters():
    p.requires_grad = "classifier" in n  # head만 학습

Chinchilla 토큰 추정

def chinchilla_tokens(n_params, ratio=20):
    return n_params * ratio  # 7B → 140B tokens

결정 기준

상황	선택
작은 GPU(24GB)에서 7–13B FT	QLoRA (NF4 + LoRA r=16)
단일 도메인 적응	LoRA
도메인 + 새 어휘	full FT (작은 LR) 또는 LoRA + embedding 학습
추론만, VRAM 절감	INT8/INT4 양자화
새 모델 사전학습 compute 예산 X	N : tokens ≈ 1 : 20+ (Chinchilla)
추론 비용 우선	smaller N, 더 많은 tokens (overtrain)
빠른 baseline, 데이터 < 1k	LoRA r=8 + few-shot

기본값: LoRA(r=16) + 4-bit base. 사전학습 시 Chinchilla 비율 이상 토큰.

🔗 Graph

부모: Machine-Learning · Deep Learning · Neural-Networks
변형: LoRA · QLoRA
응용: Fine-Tuning · LLM_Optimization_and_Deployment_Strategies · LLM_Optimization_and_Deployment_Strategies · LLM_Optimization_and_Deployment_Strategies
Adjacent: Scaling-Laws · Hyperparameters · Mixture-of-Experts

🤖 LLM 활용

언제: 파라미터 수·VRAM 추정 sanity check, LoRA target_modules 선택, fine-tune config 리뷰, scaling law 토론.

언제 X: 정확한 메모리 예측 — 실측이 필요(activation, optimizer state, sequence length 의존). LLM 추정만 믿고 클러스터 예약 X.

❌ 안티패턴

파라미터 수만으로 성능 비교(데이터·학습 토큰 무시).
LoRA target에 attention 일부만 → 표현력 부족 (q,k,v,o + ffn 까지 권장).
LR 동일하게 사용해 LoRA를 full-FT처럼 학습(LoRA는 보통 1e-4 ~ 3e-4 권장).
Optimizer state(AdamW의 m,v) 메모리 누락 → OOM.
양자화한 모델로 학습 진행하면서 weight를 직접 업데이트 (LoRA 같은 우회 필요).
"더 큰 모델 = 항상 더 좋다" — Chinchilla 데이터 부족 시 underfit.

🧪 검증 / 중복

Verified source: Hoffmann et al. Chinchilla (2022), Hu et al. LoRA (2021), Dettmers et al. QLoRA (2023), PyTorch / Hugging Face peft / bitsandbytes 문서, Llama-3 technical report (2024). 신뢰도 A.

Hyperparameters 와 분리: 파라미터(학습됨) vs 하이퍼파라미터(지정됨).

🕓 Changelog

2026-05-08 Phase 1 — 초기 stub.
2026-05-10 Manual cleanup — FULL 재작성. 추정 공식, scaling laws, PEFT/LoRA/QLoRA, 양자화, VRAM 계산 코드 7개.

6.5 KiB Raw Blame History Unescape Escape