[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -1,63 +1,210 @@
 ---
 id: wiki-2026-0508-just-in-time-data-loading
-title: Just in time Data Loading
+title: Just-in-time Data Loading
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [DATA-JIT-001]
+aliases: [JIT Data Loading, Lazy Loading, Streaming Datasets, On-demand Loading]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [data-engineering, jit-loading, lazy-loading, Optimization, Deep-Learning, performance]
+confidence_score: 0.9
+verification_status: applied
+tags: [data-loading, pytorch, huggingface, mmap, streaming, ml-infra]
 raw_sources: []
-last_reinforced: 2026-04-26
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
+tech_stack:
+  language: python
+  framework: pytorch-huggingface
 ---

-# Just-in-time Data Loading (적시 데이터 로딩)
+# Just-in-time Data Loading

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "메모리의 한계에 굴복하지 말고, 필요한 정보만을 가장 필요한 순간에 흐르듯 공급하라" — 전체 데이터를 메모리에 미리 적재하는 대신, 연산 직전에 필요한 부분만을 디스크나 네트워크로부터 비동기적으로 읽어와 처리하는 효율적인 데이터 공급 전략.
+## 매 한 줄
+> **"매 모든 데이터를 메모리에 올리는 시대는 끝났다 — 필요할 때 필요한 만큼"**. JIT/lazy data loading 은 데이터 전체를 RAM 으로 미리 적재하지 않고 학습/추론 시점에 부분만 fetch 하는 패턴으로, PyTorch DataLoader streaming, HuggingFace datasets streaming, mmap, Arrow IPC 등이 TB-PB 규모 학습의 표준 도구가 되었다.

-## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** "Lazy Fetch and Prefetch" — 실제 사용 시점까지 로딩을 지연(Lazy Loading)시키되, 연산의 병목을 막기 위해 다음 데이터를 미리 예측하여 백그라운드에서 로딩(Prefetching)하는 이중화된 최적화 패턴.
- **주요 기술 및 라이브러리:**
-    - **PyTorch DataLoader:** 멀티 프로세싱을 활용하여 GPU가 학습하는 동안 CPU가 다음 배치를 준비.
-    - **Streaming Datasets:** 테라바이트급 데이터를 다운로드 없이 클라우드에서 실시간으로 스트리밍하며 학습.
-    - **[[memory|memory]] Mapping (mmap):** 파일을 메모리 주소 공간에 매핑하여 필요할 때만 OS가 데이터를 읽어오게 함.
- **의의:** 하드웨어 자원의 한계를 극복하고 대규모 데이터셋(LLM 학습 등)을 안정적으로 처리할 수 있게 함.
+## 매 핵심

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 고가의 대용량 메모리 증설로 해결하던 문제를, 이제는 똑똑한 소프트웨어 스케줄링과 비동기 I/O 설계를 통해 비용 효율적으로 해결하는 방향으로 진화.
- **정책 변화:** Antigravity 프로젝트는 1,174개의 지식 베이스를 전수 조사할 때, 전체를 메모리에 올리지 않고 JIT 로딩 방식을 적용하여 시스템 리소스 점유율을 10% 미만으로 유지함.
+### 매 핵심 기법
+- **mmap**: 파일을 가상 메모리에 매핑 — OS page cache 활용.
+- **Arrow / Parquet**: 컬럼 기반, zero-copy slice.
+- **Streaming dataset**: HF `streaming=True`, IterableDataset.
+- **Sharding + shuffling**: 큰 shuffle buffer 없이 shard-level shuffle.
+- **Async prefetch**: DataLoader workers + pinned memory.
+- **WebDataset**: tar shard streaming.
+- **MosaicML Streaming (StreamingDataset)**: cloud-native, fast resume.
+- **Ray Data**: 분산 lazy.
+- **NVIDIA DALI**: GPU 측 디코드/증강.

-## 🔗 지식 연결 (Graph)
- [[Inference-Optimization|Inference-Optimization]],[[_system|system]]-Design-for-AI-Scale, Deep-Learning-Foundations, Cloud-Computing-Foundations
- **Raw Source:** 10_Wiki/Topics/AI/Just-in-time-Data-Loading.md
+### 매 병목 지점
+- IO bandwidth (네트워크/디스크).
+- Decompression CPU.
+- Decoding (image/audio).
+- Augmentation CPU.
+- Host → Device 전송 (PCIe).

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### 매 응용
+1. ImageNet/LAION TB scale 학습.
+2. LLM pretraining (수 PB tokens).
+3. 동영상/오디오 모델 학습.
+4. Inference batch (대규모 evaluation).

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+## 💻 패턴

-**언제 쓰면 안 되는가:**
- *(TODO)*
+### 1. PyTorch IterableDataset (스트리밍 라인)
+```python
+import torch
+from torch.utils.data import IterableDataset, DataLoader

-## 🧪 검증 상태 (Validation)
+class JsonlStream(IterableDataset):
+    def __init__(self, path): self.path = path
+    def __iter__(self):
+        worker = torch.utils.data.get_worker_info()
+        with open(self.path) as f:
+            for i, line in enumerate(f):
+                if worker is None or i % worker.num_workers == worker.id:
+                    yield json.loads(line)

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+loader = DataLoader(JsonlStream("data.jsonl"),
+                    batch_size=32, num_workers=4, pin_memory=True, prefetch_factor=4)
+```

-## 🧬 중복 검사 (Duplicate Check)
+### 2. HuggingFace `datasets` streaming
+```python
+from datasets import load_dataset
+ds = load_dataset("c4", "en", split="train", streaming=True)
+ds = ds.shuffle(buffer_size=10_000, seed=0).take(1_000_000)
+for ex in ds:
+    yield tokenize(ex["text"])
+# 디스크 다운로드 없음, network 에서 chunk-by-chunk.
+```

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+### 3. mmap 으로 큰 numpy 배열 로딩
+```python
+import numpy as np
+arr = np.memmap("embeddings.f16", dtype=np.float16, mode="r",
+                shape=(50_000_000, 1024))
+# slice 만 OS page-in
+batch = np.array(arr[indices])  # copy out
+```

-## 🕓 변경 이력 (Changelog)
+### 4. Parquet zero-copy iterator (pyarrow)
+```python
+import pyarrow.parquet as pq
+pf = pq.ParquetFile("shard.parquet")
+for batch in pf.iter_batches(batch_size=8192, columns=["text", "label"]):
+    yield batch.to_pydict()
+```

-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+### 5. WebDataset tar shards
+```python
+import webdataset as wds
+url = "pipe:aws s3 cp s3://bucket/shard-{000000..001023}.tar -"
+ds = (wds.WebDataset(url, shardshuffle=True)
+        .shuffle(1000)
+        .decode("pil")
+        .to_tuple("jpg", "cls")
+        .batched(64))
+loader = wds.WebLoader(ds, num_workers=8)
+```
+
+### 6. MosaicML StreamingDataset (resume-safe)
+```python
+from streaming import StreamingDataset
+ds = StreamingDataset(
+    remote="s3://my-bucket/shards", local="/tmp/cache",
+    shuffle=True, batch_size=64, predownload=2_000,
+)
+loader = torch.utils.data.DataLoader(ds, batch_size=64, num_workers=8)
+# epoch 중단/재시작 시 정확히 같은 sample sequence 보장.
+```
+
+### 7. NVIDIA DALI GPU 디코드
+```python
+from nvidia.dali import pipeline_def, fn, types
+@pipeline_def(batch_size=128, num_threads=4, device_id=0)
+def pipe():
+    jpegs, labels = fn.readers.file(file_root="/data")
+    images = fn.decoders.image(jpegs, device="mixed")
+    images = fn.resize(images, size=224)
+    return images, labels
+p = pipe(); p.build()
+```
+
+### 8. Ray Data (분산 lazy)
+```python
+import ray
+ds = (ray.data.read_parquet("s3://bucket/")
+        .map(tokenize)
+        .iter_torch_batches(batch_size=1024, prefetch_batches=4))
+for batch in ds:
+    train_step(batch)
+```
+
+### 9. 동기/비동기 prefetch (CUDA streams)
+```python
+import torch
+class Prefetcher:
+    def __init__(self, loader, device):
+        self.loader, self.device = iter(loader), device
+        self.stream = torch.cuda.Stream()
+        self._next()
+    def _next(self):
+        try: self.batch = next(self.loader)
+        except StopIteration: self.batch = None; return
+        with torch.cuda.stream(self.stream):
+            self.batch = {k: v.to(self.device, non_blocking=True) for k,v in self.batch.items()}
+    def get(self):
+        torch.cuda.current_stream().wait_stream(self.stream)
+        b = self.batch; self._next(); return b
+```
+
+### 10. profile-driven 튜닝 체크리스트
+```python
+# 1) torch.profiler 로 dataloader vs compute 시간 측정
+# 2) num_workers: CPU core 의 1-2x 부터 → GPU util 95%+ 까지 증가
+# 3) prefetch_factor: 2 → 4 → 8
+# 4) pin_memory=True, non_blocking=True
+# 5) 디코드가 CPU bound → DALI/torchcodec 로 GPU 이동
+# 6) 네트워크 bound → shard 크기 64-256MB, 동시 connection 다중화
+```
+
+## 매 결정 기준
+| 상황 | Approach |
+|---|---|
+| 데이터 < RAM | 일반 in-memory Dataset 도 OK |
+| 데이터 > RAM, 단일 노드 | mmap 또는 Parquet iter |
+| TB+ cloud, resume 중요 | MosaicML Streaming 또는 WebDataset |
+| LLM pretraining (PB) | tokenized shards + StreamingDataset + global shuffle |
+| 이미지 / 비디오 디코드 bottleneck | DALI 또는 torchcodec |
+| 분산 + lazy transform | Ray Data |
+
+**기본값**: > 100 GB 데이터는 streaming + sharded shuffle 이 기본. < 그 이하는 mmap + 일반 Dataset.
+
+## 🔗 Graph
+- 부모: [[Data-Loading]] · [[ML-Infrastructure]]
+- 변형: [[Streaming-Dataset]] · [[Mmap]] · [[WebDataset]]
+- 응용: [[LLM-Pretraining]] · [[Large-Scale-Vision]]
+- Adjacent: [[Apache-Arrow]] · [[Parquet]] · [[Ray-Data]] · [[NVIDIA-DALI]]
+
+## 🤖 LLM 활용
+**언제**: DataLoader 코드 변환 (in-memory → streaming), shard schema 설계, prefetch 튜닝 체크리스트 생성.
+**언제 X**: 실제 GPU util 측정 / profiler 트레이스 — 직접 실행해야 truth.
+
+## ❌ 안티패턴
+- **streaming + 작은 shuffle buffer**: in-shard 순서 잔존 → 학습 편향.
+- **num_workers=0 + 큰 데이터**: 메인 스레드 IO block — GPU 0% util.
+- **shard 너무 작음 (< 16MB)**: object-store 호출 폭주.
+- **shard 너무 큼 (> 1GB)**: resume / shuffle 비효율.
+- **pin_memory + CPU dataset**: 의미 없음. GPU 학습일 때만.
+- **augmentation on CPU 만**: GPU starvation — DALI 검토.
+
+## 🧪 검증 / 중복
+- Verified (PyTorch DataLoader docs, HF datasets streaming guide, MosaicML Streaming docs, NVIDIA DALI docs 2026).
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — streaming/mmap/DALI/MosaicML 패턴 |