TF-IDF Algorithm
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45284921
- 🔧 TF-IDF = Term Frequency-Inverse Document Frequency
- Measures how relevant a term is in a document
- Old-school technique used in searching algorithms → provide documents that are most relevant for search keys/terms
- TF = Term frequency = how often a term/word occurs in a single text/document
- DF = Document frequency = how common a term/word is across all documents
- Normally non-domain connectors like “the”, “a”, “and”, etc are frequent across all documents
- TF-IDF = specific document frequency / frequency across all documents (divide = multiple by inverse)
- Filters the DF terms from the documents, leaving only the relevant terms/words
- Nuances
- Instead of raw value, use log for IDF (since frequencies tend to be distributed exponentially)
- TF-IDF assumes document is just a “bag of words” (BoW)
- Parsing docs into bag of words can be most of work
- Words can be represented as Hash (#) for efficiency
- Capitalizations, synonyms, abbreviations, misspellings, tenses, etc should be taken into account → feature engineering
- Can be used for single words (unigrams), but also n-grams (bigrams, trigrams…)
- Difficult to handle at scale → Apache Spark helps!
- Sample TF-IDF matrix with unigrams & bigrams
LAB: PySpark notebook on EMR Serverless
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285017
- 💡 We will be using EMR serverless in this demo. If the workflow ever gets stuck, try restarting the instance, it's not always 100% reliable that the serverless cluster works
- We will apply TF-IDF to a small subset of Wikipedia
- Spin up an EMR Notebook within EMR Studio, select interactive notebook (we will interact with PySpark)
- Select EMR serverless so that you don't need to estimate capacity
- Once launched, upload the PySpark code provided in the course and open it in the notebook
- NOTE: EMR clusters may take some time to be spun up and attached, be patient
- Select PySpark kernel
- Follow the instructions & commands in the notebook in order
- 💡 If it ever gets stuck, respin the EMR serverless cluster