Feature Engineering - Basic Concepts
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45284887
- 🔧 Applying domain knowledge (your knowledge of the data – and the model you’re using) to create better features to train your model
- ‼️ ART OF ML!!
- Most critical part in a good ML implementation
- Talented/expert ML specialists are good at feature engineering
- Curse of dimensionality
- ❗More features ≠better!
- Every feature is a new dimension
- Much of feature engineering is selecting most relevant features → domain knowledge comes into play
- Unsupervised dimensionality reduction techniques can help (PCA, K-Means)
- Common problems that Feature Engineering usually addresses:
Missing Data
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285025
- 💡 Impute missing data = fill missing data with something
Impute: Mean Replacement
- 🔧 Replace missing values with mean value of column
- 💡 A column represents a single feature
- Median value of column can be more useful if outliers distort the mean
- 💡 e.g. outlier billionaires distorting the income data of average citizens
- đź‘ŤÂ Pros
- Fast & easy
- Doesn't affect mean or sample size of overall data set
- đź‘ŽÂ Cons: pretty terrible
- Not very accurate
- Misses correlations between features (only works on column level)
- If age & income are correlated, simply imputing the mean will muddy that relation a lot
- Mean/median can only be calculated on numeric features, not on categorical features
- 💡 Most frequent value in a categorical feature could work though
- Example code
Dropping Missing Data
- Reasonable if (all must apply!):
- Not many rows with missing data
- Dropping those rows doesn't bias data
- Need a fast solution
- Almost anything is better though, rarely “best” approach
- e.g. impute similar field (impute “review summary” into “full text”)
- Data is generally valuable, dropping it is generally a bad idea