LAB: TF-IDF on Spark and EMR

TF-IDF Algorithm

TF-IDF = Term Frequency-Inverse Document Frequency
- Measures how relevant a term is in a document
- Used in search algorithms → provide documents that are most relevant for search keys/terms
TF = Term frequency = how often a term/word occurs in a single text/document
DF = Document frequency = how common a term/word is across all documents
- Normally non-domain connectors like “the”, “a”, “and”, etc are frequent across all documents
TF-IDF = specific document frequency / frequency across all documents (divide = multiple by inverse)
- Filters the DF terms from the documents, leaving only the relevant terms/words
Nuances
- Instead of raw value, use log for IDF (since freqs tend to be distributed exponentially)
- TF-IDF assumes document is just a “bag of words”
  - Parsing docs into bag of words can be most of work
  - Words can be represented as Hash (#) for efficiency
  - Capitalizations, synonyms, abbreviations, misspellings, tenses, etc should be taken into account → feature engineering
- Can be used for single words (unigrams), but also n-grams (bigrams, trigrams…)
- Difficult to handle at scale → Apache Spark helps!
Sample TF-IDF matrix with unigrams & bigrams