Files
2nd/10_Wiki/Topics/AI_and_ML/Bottlenecks.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

7.6 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-bottlenecks Bottlenecks (Performance & Process) 10_Wiki/Topics verified self
병목
bottleneck
theory of constraints
TOC
critical path
profiling
none A 0.93 applied
performance
bottleneck
profiling
theory-of-constraints
optimization
scalability
latency
2026-05-10 pending
language framework
any profiling tools

Bottlenecks

📌 한 줄 통찰

"매 system 의 throat". 매 weakest link 의 throughput 의 결정. 매 non-bottleneck 의 improve = 매 시간 낭비. 매 Goldratt's TOC: 매 5 step. 매 modern AI: 매 HBM bandwidth + 매 network 의 bottleneck.

📖 핵심

매 type

  1. Hardware: CPU / GPU / RAM / disk / network.
  2. Software: algorithm / blocking / lock contention.
  3. Process: approval / single point of expertise.
  4. Data: schema / indexing / partitioning.
  5. Cognitive (team): meeting / context-switch.

Theory of Constraints (Goldratt)

  1. Identify the bottleneck.
  2. Exploit it (use 100%).
  3. Subordinate non-bottleneck (don't over-feed).
  4. Elevate it (invest to widen).
  5. Repeat (new bottleneck emerges).
  • 매 90% 의 100× → 매 전체 의 매 10× cap.
  • 매 bottleneck 의 X 의 의미.

매 hardware bottleneck 의 modern (LLM)

  • HBM bandwidth: 매 H100 = 매 3 TB/s. 매 LLM inference 의 dominant.
  • NVLink: 매 GPU-GPU.
  • Network (RDMA, InfiniBand): 매 distributed train.
  • PCIe: 매 GPU-CPU.
  • Storage: 매 NVMe vs spinning.
  • Power / cooling: 매 datacenter limit.

매 software bottleneck

  • CPU-bound: 매 compute heavy.
  • I/O-bound: 매 disk / network wait.
  • Memory-bound: 매 swap / cache miss.
  • Lock contention: 매 mutex.
  • GIL (Python): 매 single-thread CPU.
  • N+1 query: 매 ORM 의 typical.

매 detection

  • Profiler: cProfile, perf, async-profiler.
  • Trace: distributed tracing (Jaeger).
  • Metric: CPU/mem/disk/network util.
  • APM: Datadog, NewRelic.
  • Flame graph.
  • Critical path.

매 process bottleneck

  • 매 approval chain.
  • 매 single expert.
  • 매 environment provisioning.
  • 매 review SLA.
  • 매 meeting cadence.

→ 매 DORA Lead Time 의 component.

매 data bottleneck

  • 매 single hot row.
  • 매 missing index.
  • 매 cross-shard transaction.
  • 매 schema migration block.

매 distributed bottleneck (modern)

  • 매 leader 의 single (Raft, Paxos).
  • 매 cross-region call.
  • 매 sync replication.
  • 매 connection pool limit.

💻 패턴

Profile (Python cProfile)

import cProfile, pstats

def main():
    expensive_call()
    cheap_call()

cProfile.run('main()', 'out.prof')
stats = pstats.Stats('out.prof').sort_stats('cumulative')
stats.print_stats(20)

Linux perf (system-level)

# 매 CPU profile
perf record -F 99 -p $PID -- sleep 10
perf report

# 매 flame graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > flame.svg

Async profiler (JVM)

# 매 sample lock contention
java -jar async-profiler.jar -e lock -d 30 -f lock.html $PID

# 매 wall clock (I/O bound 도)
java -jar async-profiler.jar -e wall -d 30 -f wall.html $PID

N+1 detect (Django)

from django.test.utils import CaptureQueriesContext
from django.db import connection

with CaptureQueriesContext(connection) as ctx:
    posts = Post.objects.all()
    for post in posts:
        print(post.author.name)  # 매 N+1
    
    if len(ctx.captured_queries) > 5:
        log(f'N+1 detected: {len(ctx.captured_queries)} queries')

# 매 fix
posts = Post.objects.select_related('author')  # 매 1 query

GPU bottleneck profile (PyTorch)

import torch.profiler as prof

with prof.profile(
    activities=[prof.ProfilerActivity.CPU, prof.ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
) as p:
    model(input)

print(p.key_averages().table(sort_by='cuda_time_total', row_limit=20))

# 매 HBM bandwidth bottleneck 의 reveal

Lock contention detection

import threading

class LockMonitor:
    def __init__(self, lock):
        self.lock = lock
        self.wait_times = []
    
    def __enter__(self):
        start = time.time()
        self.lock.acquire()
        self.wait_times.append(time.time() - start)
    
    def __exit__(self, *args):
        self.lock.release()
    
    def report(self):
        if not self.wait_times: return
        avg = sum(self.wait_times) / len(self.wait_times)
        if avg > 0.1: log(f'Lock contention: avg wait {avg*1000}ms')

Distributed trace (Jaeger)

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span('handle_request')
def handle(req):
    with tracer.start_as_current_span('db_query') as span:
        span.set_attribute('db.statement', 'SELECT ...')
        result = db.query(...)
    return result

→ 매 시각적 bottleneck identify.

Process bottleneck (workflow analysis)

def analyze_workflow(stage_durations):
    """매 stage 별 의 throughput 의 비교."""
    rates = {stage: 1 / dur for stage, dur in stage_durations.items()}
    bottleneck = min(rates, key=rates.get)
    
    overall_rate = rates[bottleneck]
    waste = sum(r - overall_rate for r in rates.values() if r > overall_rate)
    
    return {
        'bottleneck': bottleneck,
        'overall_rate_per_min': overall_rate * 60,
        'capacity_wasted': waste,
    }

Critical path (DAG)

import networkx as nx

def critical_path(tasks):
    """매 longest path through DAG."""
    G = nx.DiGraph()
    for task in tasks:
        G.add_node(task.id, duration=task.duration)
        for dep in task.deps:
            G.add_edge(dep, task.id)
    
    # 매 longest path
    return nx.dag_longest_path(G, weight='duration')

🤔 결정 기준

증상 Tool
Slow request APM + distributed trace
CPU pegged Flame graph (perf)
GPU underutilized Memory bandwidth (PyTorch profiler)
Slow query EXPLAIN + slow query log
Lock contention async-profiler -e lock
Long lead time Process / DORA analysis
Thundering herd Coordination check

기본값: 매 measure first. 매 hypothesis-based optimize.

🔗 Graph

🤖 LLM 활용

언제: 매 performance optimization. 매 capacity planning. 매 incident root cause. 매 process improvement. 언제 X: 매 hypothesis 없 의 optimize.

안티패턴

  • Optimize without measure: 매 wrong place.
  • Non-bottleneck improve: 매 시간 waste (TOC).
  • 모든 part 의 평등 invest: 매 ROI low.
  • Single profile 의 trust: 매 representative X.
  • Process 의 "사람 의 fault": 매 system issue 가 대부분.
  • Premature optimization: 매 simplicity lose.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — type + TOC + 매 profile / N+1 / GPU / trace code