Feature Engineering - Basic Concepts
- 🔧 Applying domain knowledge (your knowledge of the data – and the model you’re using) to create better features to train your model
- ‼️ ART OF ML!!
- Most critical part in a good ML implementation
- Talented/expert ML specialists are good at feature engineering
- Curse of dimensionality
- ❗More features ≠better!
- Every feature is a new dimension
- Much of feature engineering is selecting most relevant features → domain knowledge comes into play
- Unsupervised dimensionality reduction techniques can help (PCA, K-Means)
Missing Data
Input: Mean Replacement
- 🔧 Replace missing values with mean value of column
- 💡 A column represents a single feature
- Median value of column can be more useful if outliers distort the mean
- đź‘ŤÂ Pros
- Fast & easy
- Doesn't affect mean or sample size of overall data set
- đź‘ŽÂ Cons: pretty terrible
- Not very accurate
- Misses correlations between features (only works on column level)
- If age & income are correlated, simply inputting the mean will muddy that relation a lot
- Mean/median can only be calculated on numeric features, not on categorical features
- 💡 Most frequent value in a categorical feature could work though
- Example code
Dropping Missing Data
- Reasonable if:
- Not many rows with missing data
- Dropping those rows doesn't bias data
- Need a fast solution
- Almost anything is better tho, rarely “best” approach
- e.g. input similar field (input “review summary” into “full text”)
- Example code
Input: using ML
- KNN → Find K “nearest” (most similar) neighbors, i.e. rows, & average values to fill missing data
- Assumes numerical data
- Categorical data can be handled with Hamming distance, but usually DL is better
- Deep Learning → DL model trained on all complete data, then can input missing data on incomplete data
- Works very well for categorical data
- đź‘ŽÂ Complicated