f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
166 lines
5.3 KiB
Markdown
166 lines
5.3 KiB
Markdown
---
|
|
id: wiki-2026-0508-pose-estimation
|
|
title: Pose Estimation
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Human Pose Estimation, HPE, Keypoint Detection]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [computer-vision, pose-estimation, deep-learning, keypoints]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: python
|
|
framework: pytorch, mmpose, mediapipe
|
|
---
|
|
|
|
# Pose Estimation
|
|
|
|
## 매 한 줄
|
|
> **"매 image/video에서 인체 keypoints (joints) 위치 detection."**. OpenPose (2017)가 multi-person bottom-up을 popularize, MediaPipe로 mobile real-time, 2024-2025 ViTPose / SAM-style transformer가 SOTA.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 두 가지 paradigm
|
|
- **Top-down**: detect person bbox → crop → keypoint regression. 매 정확, slow with crowd.
|
|
- **Bottom-up**: keypoints first → group into persons (PAF / associative embedding). 매 fast at scale.
|
|
- **Single-stage** (modern): YOLO-Pose, ED-Pose — detection + keypoints joint.
|
|
|
|
### 매 표현 방식
|
|
- **2D keypoints**: (x, y, confidence) — COCO 17 keypoints standard.
|
|
- **3D pose**: (x, y, z) — single image lift 또는 multi-view.
|
|
- **SMPL / mesh**: full body parametric model — VIBE, HMR, 4D-Humans.
|
|
|
|
### 매 응용
|
|
1. AR/VR avatar driving (Meta Quest, Apple Vision Pro).
|
|
2. Fitness coaching (form correction).
|
|
3. Sports analytics (gait, biomechanics).
|
|
4. Animation mocap markerless.
|
|
5. Surveillance / fall detection.
|
|
|
|
## 💻 패턴
|
|
|
|
### MediaPipe (real-time, on-device)
|
|
```python
|
|
import mediapipe as mp
|
|
import cv2
|
|
|
|
mp_pose = mp.solutions.pose
|
|
pose = mp_pose.Pose(model_complexity=1, min_detection_confidence=0.5)
|
|
|
|
cap = cv2.VideoCapture(0)
|
|
while cap.isOpened():
|
|
ok, frame = cap.read()
|
|
if not ok: break
|
|
results = pose.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
|
|
if results.pose_landmarks:
|
|
for lm in results.pose_landmarks.landmark:
|
|
print(lm.x, lm.y, lm.visibility)
|
|
```
|
|
|
|
### MMPose (research, ViTPose backbone)
|
|
```python
|
|
from mmpose.apis import MMPoseInferencer
|
|
|
|
inferencer = MMPoseInferencer(pose2d='vitpose-h')
|
|
result = next(inferencer('image.jpg', show=False))
|
|
keypoints = result['predictions'][0][0]['keypoints'] # (17, 2)
|
|
scores = result['predictions'][0][0]['keypoint_scores']
|
|
```
|
|
|
|
### YOLO-Pose (Ultralytics, single-stage)
|
|
```python
|
|
from ultralytics import YOLO
|
|
|
|
model = YOLO('yolo11n-pose.pt')
|
|
results = model('image.jpg')
|
|
for r in results:
|
|
kpts = r.keypoints.xy # (n_persons, 17, 2)
|
|
conf = r.keypoints.conf
|
|
```
|
|
|
|
### 3D lift (VideoPose3D-style)
|
|
```python
|
|
import torch
|
|
# 2D (T, 17, 2) -> 3D (T, 17, 3) via temporal CNN
|
|
class TemporalLift(torch.nn.Module):
|
|
def __init__(self, n_kpts=17, ch=1024):
|
|
super().__init__()
|
|
self.expand = torch.nn.Conv1d(n_kpts*2, ch, 3, padding=1)
|
|
self.blocks = torch.nn.Sequential(*[
|
|
torch.nn.Sequential(
|
|
torch.nn.Conv1d(ch, ch, 3, padding=1, dilation=d),
|
|
torch.nn.BatchNorm1d(ch), torch.nn.ReLU()
|
|
) for d in (3, 9, 27)
|
|
])
|
|
self.head = torch.nn.Conv1d(ch, n_kpts*3, 1)
|
|
|
|
def forward(self, x): # x: (B, T, 17, 2)
|
|
B, T = x.shape[:2]
|
|
x = x.reshape(B, T, -1).transpose(1, 2)
|
|
return self.head(self.blocks(self.expand(x))).transpose(1, 2).reshape(B, T, -1, 3)
|
|
```
|
|
|
|
### COCO keypoint metric (OKS / mAP)
|
|
```python
|
|
from pycocotools.coco import COCO
|
|
from pycocotools.cocoeval import COCOeval
|
|
|
|
gt = COCO('person_keypoints_val2017.json')
|
|
dt = gt.loadRes('predictions.json')
|
|
e = COCOeval(gt, dt, 'keypoints')
|
|
e.evaluate(); e.accumulate(); e.summarize()
|
|
# AP @ OKS=.50:.95 — 표준 metric
|
|
```
|
|
|
|
### SMPL mesh recovery (4D-Humans / HMR2)
|
|
```python
|
|
from hmr2.models import load_hmr2
|
|
model, model_cfg = load_hmr2('logs/checkpoints/epoch=35.ckpt')
|
|
out = model(image_tensor)
|
|
verts = out['pred_vertices'] # (B, 6890, 3)
|
|
betas = out['pred_smpl_params']['betas']
|
|
pose = out['pred_smpl_params']['body_pose']
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Mobile / web real-time | MediaPipe Pose |
|
|
| Highest accuracy single image | ViTPose-H (MMPose) |
|
|
| Multi-person crowd | YOLO-Pose / ED-Pose (single-stage) |
|
|
| 3D from monocular video | 4D-Humans / WHAM |
|
|
| Animation mocap | SMPL / SMPL-X based |
|
|
| Edge device < 10ms | MoveNet Lightning, RTMPose-tiny |
|
|
|
|
**기본값**: 2D는 RTMPose, 3D mesh는 4D-Humans.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Computer_Vision]] · [[Deep_Learning]]
|
|
- 변형: [[MediaPipe]]
|
|
- Adjacent: [[Object_Detection]] · [[Keypoint_Detection]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: vision-action pipeline 의 input feature, fitness/AR app, mocap automation.
|
|
**언제 X**: facial keypoints는 face-specific model (MediaPipe Face Mesh, dlib), hand는 MediaPipe Hands.
|
|
|
|
## ❌ 안티패턴
|
|
- **Top-down without bbox tracking**: 매 frame redetect — temporal jitter 매 심각. ByteTrack 결합.
|
|
- **2D regression direct (x,y) without heatmap**: 매 lower accuracy. Heatmap supervision 매 표준.
|
|
- **3D from single 2D pose**: depth ambiguity — temporal context 또는 multi-view 필요.
|
|
- **Ignoring camera intrinsics for 3D**: 매 metric scale wrong.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (MMPose docs, Ultralytics YOLO11-pose, MediaPipe docs, COCO keypoint benchmark).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — pose estimation paradigms + modern stack (ViTPose, YOLO-Pose, 4D-Humans) |
|