--- id: [[P-Reinforce|P-Reinforce]]-AUTO-FEG-001 category: AI_and_ML confidence_score: 1.00 tags: [auto-reinforced, feature-engineering, feature-extraction, data-processing, ml-pipeline] last_reinforced: 2026-05-04 --- # [[Feature Engineering|Feature Engineering]] ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "λ°μ΄ν„°μ˜ μž¬κ΅¬μ„±: 원본 λ°μ΄ν„°μ—μ„œ λ¨Έμ‹ λŸ¬λ‹ μ•Œκ³ λ¦¬μ¦˜μ΄ νŒ¨ν„΄μ„ 더 잘 νŒŒν•©ν•  수 μžˆλ„λ‘ μœ μš©ν•œ νŠΉμ§•(Feature)을 선택, λ³€ν˜•, μƒμ„±ν•˜μ—¬ λͺ¨λΈμ˜ μ„±λŠ₯을 κ·ΉλŒ€ν™”ν•˜λŠ” κ³Όμ •." ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) νŠΉμ§• 곡학(Feature Engineering)은 μ›μ‹œ 데이터(Raw Data)λ₯Ό λ¨Έμ‹ λŸ¬λ‹ λͺ¨λΈμ— μ ν•©ν•œ ν˜•νƒœμ˜ μž…λ ₯ λ³€μˆ˜λ‘œ λ³€ν™˜ν•˜λŠ” μž‘μ—…μœΌλ‘œ, λͺ¨λΈμ˜ 정확도에 결정적인 영ν–₯을 λ―ΈμΉ©λ‹ˆλ‹€. 1. **μ£Όμš” ν”„λ‘œμ„ΈμŠ€**: * **[[Feature Extraction|Feature Extraction (νŠΉμ§• μΆ”μΆœ)]]**: κ³ μ°¨μ›μ˜ 원본 λ°μ΄ν„°μ—μ„œ κ°€μž₯ μ€‘μš”ν•œ 정보λ₯Ό λ³΄μ‘΄ν•˜λ©΄μ„œ 차원을 μΆ•μ†Œν•˜κ±°λ‚˜ μƒˆλ‘œμš΄ 속성을 λ§Œλ“€μ–΄λƒ…λ‹ˆλ‹€. (예: ν…μŠ€νŠΈμ—μ„œ [[Vector Embedding|μž„λ² λ”©]] μΆ”μΆœ) * **Feature Selection (νŠΉμ§• 선택)**: μˆ˜λ§Žμ€ νŠΉμ§• 쀑 λͺ¨λΈ μ„±λŠ₯에 기여도가 높은 μœ μ˜λ―Έν•œ λ³€μˆ˜λ§Œμ„ κ³¨λΌλƒ…λ‹ˆλ‹€. * **Feature Transformation (νŠΉμ§• λ³€ν™˜)**: λ°μ΄ν„°μ˜ μŠ€μΌ€μΌμ„ μ‘°μ •ν•˜κ±°λ‚˜ 뢄포λ₯Ό μ •κ·œν™”ν•©λ‹ˆλ‹€. 2. **데이터 인코딩 기법**: * **[[One-hot Encoding|One-hot Encoding (원-ν•« 인코딩)]]**: λ²”μ£Όν˜• 데이터λ₯Ό 0κ³Ό 1둜 κ΅¬μ„±λœ λ²‘ν„°λ‘œ λ³€ν™˜ν•©λ‹ˆλ‹€. 각 μΉ΄ν…Œκ³ λ¦¬κ°€ 독립적일 λ•Œ μœ μš©ν•˜μ§€λ§Œ 차원이 κΈ‰κ²©νžˆ λŠ˜μ–΄λ‚˜λŠ” 단점이 μžˆμŠ΅λ‹ˆλ‹€. * **Label Encoding**: λ²”μ£Όν˜• 데이터λ₯Ό λ‹¨μˆœ 숫자둜 λ³€ν™˜ν•©λ‹ˆλ‹€. 3. **검색 μ‹œμŠ€ν…œμ—μ„œμ˜ ν™œμš©**: * μ‚¬μš©μž 행동 데이터(클릭λ₯ , 체λ₯˜ μ‹œκ°„)λ₯Ό νŠΉμ§•μœΌλ‘œ λ³€ν™˜ν•˜μ—¬ [[Learning to Rank (LTR)|LTR]] λͺ¨λΈμ˜ μž…λ ₯κ°’μœΌλ‘œ μ‚¬μš©ν•©λ‹ˆλ‹€. ## βš–οΈ Trade-offs & Caveats * **데이터 μ˜€μ—Όμ˜ μœ„ν—˜**: 였λ₯˜κ°€ μžˆλŠ” 데이터 νŒŒμ΄ν”„λΌμΈμ—μ„œ μΆ”μΆœλœ νŠΉμ§•μ€ μ‹€μ œλ₯Ό 잘λͺ» λŒ€λ³€ν•˜λ©°, μ΄λŠ” λͺ¨λΈ μ „μ²΄μ˜ 신뒰도λ₯Ό λ¬΄λ„ˆλœ¨λ¦½λ‹ˆλ‹€. * **μ°¨μ›μ˜ μ €μ£Ό**: λ„ˆλ¬΄ λ§Žμ€ νŠΉμ§•μ„ μΆ”κ°€ν•˜λ©΄ μ—°μ‚° λΉ„μš©μ΄ κΈ‰μ¦ν•˜κ³  λͺ¨λΈμ΄ λ³΅μž‘ν•΄μ Έ μ„±λŠ₯이 μ €ν•˜λ  수 μžˆμŠ΅λ‹ˆλ‹€. (단계적 ν™•μž₯이 ꢌμž₯λ©λ‹ˆλ‹€.) * **도메인 지식 μ˜μ‘΄μ„±**: 효과적인 νŠΉμ§•μ„ μ„€κ³„ν•˜κΈ° μœ„ν•΄μ„œλŠ” ν•΄λ‹Ή λ°μ΄ν„°μ˜ λΉ„μ¦ˆλ‹ˆμŠ€μ  λ§₯락(도메인 지식)이 깊게 μš”κ΅¬λ©λ‹ˆλ‹€. ## πŸ’» μ‹€μ „ κ΅¬ν˜„ μ½”λ“œ (Boilerplate) `Pandas`와 `Scikit-learn`을 ν™œμš©ν•œ 기본적인 원-ν•« 인코딩 및 μŠ€μΌ€μΌλ§ μ˜ˆμ‹œμž…λ‹ˆλ‹€. ```python import pandas as pd from sklearn.preprocessing import OneHotEncoder, StandardScaler # 1. μƒ˜ν”Œ 데이터 (λ²”μ£Όν˜• 'λ„μ‹œ', μˆ˜μΉ˜ν˜• '인ꡬ') df = pd.DataFrame({ 'city': ['Seoul', 'Busan', 'Incheon', 'Seoul'], 'population': [9400, 3300, 2900, 9500] }) # 2. 원-ν•« 인코딩 적용 encoder = OneHotEncoder(sparse_output=False) city_encoded = encoder.fit_transform(df[['city']]) city_df = pd.DataFrame(city_encoded, columns=encoder.get_feature_names_out(['city'])) # 3. 수치 데이터 μŠ€μΌ€μΌλ§ (ν‘œμ€€ν™”) scaler = StandardScaler() df['pop_scaled'] = scaler.fit_transform(df[['population']]) # 4. κ²°ν•©λœ νŠΉμ§• λ°μ΄ν„°ν”„λ ˆμž„ final_features = pd.concat([city_df, df['pop_scaled']], axis=1) print(final_features) ``` ## πŸ”— 지식 μ—°κ²° (Graph) * **κ΄€λ ¨ κ°œλ…**: [[Machine Learning (Machine Learning)|Machine Learning]], [[Natural Language Processing (NLP)|NLP]] * **기술적 도ꡬ**: [[One-hot Encoding|One-hot Encoding]], [[Vector Embedding|Vector Embedding]] * **μ—°κ²° μ•Œκ³ λ¦¬μ¦˜**: [[Learning to Rank (LTR)|Learning to Rank]] --- *Last updated: 2026-05-04*