LAB: TF-IDF on EMR with Spark

TF-IDF Algorithm

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45284921

🔧 TF-IDF = Term Frequency-Inverse Document Frequency
- Measures how relevant a term is in a document
- Old-school technique used in searching algorithms → provide documents that are most relevant for search keys/terms
TF = Term frequency = how often a term/word occurs in a single text/document
DF = Document frequency = how common a term/word is across all documents
- Normally non-domain connectors like “the”, “a”, “and”, etc are frequent across all documents
TF-IDF = specific document frequency / frequency across all documents (divide = multiple by inverse)
- Filters the DF terms from the documents, leaving only the relevant terms/words
Nuances
- Instead of raw value, use log for IDF (since frequencies tend to be distributed exponentially)
- TF-IDF assumes document is just a “bag of words” (BoW)
  - Parsing docs into bag of words can be most of work
  - Words can be represented as Hash (#) for efficiency
  - Capitalizations, synonyms, abbreviations, misspellings, tenses, etc should be taken into account → feature engineering
- Can be used for single words (unigrams), but also n-grams (bigrams, trigrams…)
- Difficult to handle at scale → Apache Spark helps!
Sample TF-IDF matrix with unigrams & bigrams

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285017

💡 We will be using EMR serverless in this demo. If the workflow ever gets stuck, try restarting the instance, it's not always 100% reliable that the serverless cluster works
We will apply TF-IDF to a small subset of Wikipedia
Spin up an EMR Notebook within EMR Studio, select interactive notebook (we will interact with PySpark)
- Select EMR serverless so that you don't need to estimate capacity
Once launched, upload the PySpark code provided in the course and open it in the notebook
- NOTE: EMR clusters may take some time to be spun up and attached, be patient
Select PySpark kernel
Follow the instructions & commands in the notebook in order
- 💡 If it ever gets stuck, respin the EMR serverless cluster
Pay attention to:
- How we clean the data (delete the corrupt article)
- How we perform feature engineering by tokenizing the articles and then transforming the tokens into integers that can be consumed by TF-IDF
- How we apply TF-IDF to find the most relevant Wikipedia articles for the keyword “Gettysburg”