Feature Engineering

Feature Engineering - Basic Concepts

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45284887

🔧 Applying domain knowledge (your knowledge of the data – and the model you’re using) to create better features to train your model
- ‼️ ART OF ML!!
  - Most critical part in a good ML implementation
  - Talented/expert ML specialists are good at feature engineering
Curse of dimensionality
- ❗More features ≠ better!
  - Every feature is a new dimension
  - Much of feature engineering is selecting most relevant features → domain knowledge comes into play
- Unsupervised dimensionality reduction techniques can help (PCA, K-Means)
Common problems that Feature Engineering usually addresses:

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285025

🔧 Replace missing values with mean value of column
- 💡 A column represents a single feature
- Median value of column can be more useful if outliers distort the mean
  - 💡 e.g. outlier billionaires distorting the income data of average citizens
👍 Pros
- Fast & easy
- Doesn't affect mean or sample size of overall data set
👎 Cons: pretty terrible
- Not very accurate
- Misses correlations between features (only works on column level)
  - If age & income are correlated, simply imputing the mean will muddy that relation a lot
- Mean/median can only be calculated on numeric features, not on categorical features
  - 💡 Most frequent value in a categorical feature could work though
Example code

Reasonable if (all must apply!):
- Not many rows with missing data
- Dropping those rows doesn't bias data
- Need a fast solution
Almost anything is better though, rarely “best” approach
- e.g. impute similar field (impute “review summary” into “full text”)
- Data is generally valuable, dropping it is generally a bad idea
Example code