Feature Engineering

Feature Engineering - Basic Concepts

🔧 Applying domain knowledge (your knowledge of the data – and the model you’re using) to create better features to train your model
- ‼️ ART OF ML!!
  - Most critical part in a good ML implementation
  - Talented/expert ML specialists are good at feature engineering
Curse of dimensionality
- ❗More features ≠ better!
  - Every feature is a new dimension
  - Much of feature engineering is selecting most relevant features → domain knowledge comes into play
- Unsupervised dimensionality reduction techniques can help (PCA, K-Means)

🔧 Replace missing values with mean value of column
- 💡 A column represents a single feature
- Median value of column can be more useful if outliers distort the mean
👍 Pros
- Fast & easy
- Doesn't affect mean or sample size of overall data set
👎 Cons: pretty terrible
- Not very accurate
- Misses correlations between features (only works on column level)
  - If age & income are correlated, simply inputting the mean will muddy that relation a lot
- Mean/median can only be calculated on numeric features, not on categorical features
  - 💡 Most frequent value in a categorical feature could work though
Example code

Reasonable if:
- Not many rows with missing data
- Dropping those rows doesn't bias data
- Need a fast solution
Almost anything is better tho, rarely “best” approach
- e.g. input similar field (input “review summary” into “full text”)
Example code

KNN → Find K “nearest” (most similar) neighbors, i.e. rows, & average values to fill missing data
- Assumes numerical data
- Categorical data can be handled with Hamming distance, but usually DL is better
Deep Learning → DL model trained on all complete data, then can input missing data on incomplete data
- Works very well for categorical data
- 👎 Complicated