--- id: wiki-2026-0508-pose-estimation title: Pose Estimation category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Human Pose Estimation, HPE, Keypoint Detection] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [computer-vision, pose-estimation, deep-learning, keypoints] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch, mmpose, mediapipe --- # Pose Estimation ## 매 한 줄 > **"매 image/video에서 인체 keypoints (joints) 위치 detection."**. OpenPose (2017)가 multi-person bottom-up을 popularize, MediaPipe로 mobile real-time, 2024-2025 ViTPose / SAM-style transformer가 SOTA. ## 매 핵심 ### 매 두 가지 paradigm - **Top-down**: detect person bbox → crop → keypoint regression. 매 정확, slow with crowd. - **Bottom-up**: keypoints first → group into persons (PAF / associative embedding). 매 fast at scale. - **Single-stage** (modern): YOLO-Pose, ED-Pose — detection + keypoints joint. ### 매 표현 방식 - **2D keypoints**: (x, y, confidence) — COCO 17 keypoints standard. - **3D pose**: (x, y, z) — single image lift 또는 multi-view. - **SMPL / mesh**: full body parametric model — VIBE, HMR, 4D-Humans. ### 매 응용 1. AR/VR avatar driving (Meta Quest, Apple Vision Pro). 2. Fitness coaching (form correction). 3. Sports analytics (gait, biomechanics). 4. Animation mocap markerless. 5. Surveillance / fall detection. ## 💻 패턴 ### MediaPipe (real-time, on-device) ```python import mediapipe as mp import cv2 mp_pose = mp.solutions.pose pose = mp_pose.Pose(model_complexity=1, min_detection_confidence=0.5) cap = cv2.VideoCapture(0) while cap.isOpened(): ok, frame = cap.read() if not ok: break results = pose.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) if results.pose_landmarks: for lm in results.pose_landmarks.landmark: print(lm.x, lm.y, lm.visibility) ``` ### MMPose (research, ViTPose backbone) ```python from mmpose.apis import MMPoseInferencer inferencer = MMPoseInferencer(pose2d='vitpose-h') result = next(inferencer('image.jpg', show=False)) keypoints = result['predictions'][0][0]['keypoints'] # (17, 2) scores = result['predictions'][0][0]['keypoint_scores'] ``` ### YOLO-Pose (Ultralytics, single-stage) ```python from ultralytics import YOLO model = YOLO('yolo11n-pose.pt') results = model('image.jpg') for r in results: kpts = r.keypoints.xy # (n_persons, 17, 2) conf = r.keypoints.conf ``` ### 3D lift (VideoPose3D-style) ```python import torch # 2D (T, 17, 2) -> 3D (T, 17, 3) via temporal CNN class TemporalLift(torch.nn.Module): def __init__(self, n_kpts=17, ch=1024): super().__init__() self.expand = torch.nn.Conv1d(n_kpts*2, ch, 3, padding=1) self.blocks = torch.nn.Sequential(*[ torch.nn.Sequential( torch.nn.Conv1d(ch, ch, 3, padding=1, dilation=d), torch.nn.BatchNorm1d(ch), torch.nn.ReLU() ) for d in (3, 9, 27) ]) self.head = torch.nn.Conv1d(ch, n_kpts*3, 1) def forward(self, x): # x: (B, T, 17, 2) B, T = x.shape[:2] x = x.reshape(B, T, -1).transpose(1, 2) return self.head(self.blocks(self.expand(x))).transpose(1, 2).reshape(B, T, -1, 3) ``` ### COCO keypoint metric (OKS / mAP) ```python from pycocotools.coco import COCO from pycocotools.cocoeval import COCOeval gt = COCO('person_keypoints_val2017.json') dt = gt.loadRes('predictions.json') e = COCOeval(gt, dt, 'keypoints') e.evaluate(); e.accumulate(); e.summarize() # AP @ OKS=.50:.95 — 표준 metric ``` ### SMPL mesh recovery (4D-Humans / HMR2) ```python from hmr2.models import load_hmr2 model, model_cfg = load_hmr2('logs/checkpoints/epoch=35.ckpt') out = model(image_tensor) verts = out['pred_vertices'] # (B, 6890, 3) betas = out['pred_smpl_params']['betas'] pose = out['pred_smpl_params']['body_pose'] ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Mobile / web real-time | MediaPipe Pose | | Highest accuracy single image | ViTPose-H (MMPose) | | Multi-person crowd | YOLO-Pose / ED-Pose (single-stage) | | 3D from monocular video | 4D-Humans / WHAM | | Animation mocap | SMPL / SMPL-X based | | Edge device < 10ms | MoveNet Lightning, RTMPose-tiny | **기본값**: 2D는 RTMPose, 3D mesh는 4D-Humans. ## 🔗 Graph - 부모: [[Computer_Vision]] · [[Deep_Learning]] - 변형: [[MediaPipe]] - Adjacent: [[Object_Detection]] · [[Keypoint_Detection]] ## 🤖 LLM 활용 **언제**: vision-action pipeline 의 input feature, fitness/AR app, mocap automation. **언제 X**: facial keypoints는 face-specific model (MediaPipe Face Mesh, dlib), hand는 MediaPipe Hands. ## ❌ 안티패턴 - **Top-down without bbox tracking**: 매 frame redetect — temporal jitter 매 심각. ByteTrack 결합. - **2D regression direct (x,y) without heatmap**: 매 lower accuracy. Heatmap supervision 매 표준. - **3D from single 2D pose**: depth ambiguity — temporal context 또는 multi-view 필요. - **Ignoring camera intrinsics for 3D**: 매 metric scale wrong. ## 🧪 검증 / 중복 - Verified (MMPose docs, Ultralytics YOLO11-pose, MediaPipe docs, COCO keypoint benchmark). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — pose estimation paradigms + modern stack (ViTPose, YOLO-Pose, 4D-Humans) |