TF-IDF Algorithm
- TF-IDF = Term Frequency-Inverse Document Frequency
- Measures how relevant a term is in a document
- Used in search algorithms → provide documents that are most relevant for search keys/terms
- TF = Term frequency = how often a term/word occurs in a single text/document
- DF = Document frequency = how common a term/word is across all documents
- Normally non-domain connectors like “the”, “a”, “and”, etc are frequent across all documents
- TF-IDF = specific document frequency / frequency across all documents (divide = multiple by inverse)
- Filters the DF terms from the documents, leaving only the relevant terms/words
- Nuances
- Instead of raw value, use log for IDF (since freqs tend to be distributed exponentially)
- TF-IDF assumes document is just a “bag of words”
- Parsing docs into bag of words can be most of work
- Words can be represented as Hash (#) for efficiency
- Capitalizations, synonyms, abbreviations, misspellings, tenses, etc should be taken into account → feature engineering
- Can be used for single words (unigrams), but also n-grams (bigrams, trigrams…)
- Difficult to handle at scale → Apache Spark helps!
- Sample TF-IDF matrix with unigrams & bigrams