BlazingText - Word2Vec
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285461
- 🔧 Word embedding = create a vector representation for words
- Vector can represent e.g. sentiment or semantic of a word
- Not an NLP algorithm in itself, but useful in NLP → used in machine translation, sentiment analysis
- ‼️ ONLY embedding of individual words, NOT sentences or documents!!
- Multiple modes
Cbow
(Continuous Bag of Words)
- Order of words does NOT matter (disconnected words inside a “bag”)
Skip-gram
- n-grams → Order of words DOES matter!
Batch skip-gram
→ Like skip-gram, but allows distributed computation over many CPU nodes
- Input data: text file with one sentence per line
- Training instance types:
- For
cbow
and skipgram
- Any single CPU or single GPU instance will work
- single
ml.p3.2xlarge
recommended
- For
batch_skipgram
- Can use single or multiple CPU instances
- Important hyperparameters:
Mode
(batch_skipgram, skipgram, cbow)
- For tuning performance:
Learning_rate
, Window_size
, Vector_dim
, Negative_samples
Object2Vec
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285475
- 🔧 Creates low-dimensional vector embeddings from high-dimensional objects
- Objects with similar vectors are more similar to each other than objects with less similar vectors
- Arbitrary objects → any type of object accepted (input data must have correct format)
- 💡 Object2Vec is like Word2Vec from BlazingText, but generalized for arbitrary objects, not just words
- Use cases: compute nearest neighbors of objects, visualize clusters, genre prediction, recommendations (similar items or users…)
How it works
- Process data into JSONL (JSON Lines) & shuffle it
- Train with two (parallel) input channels, two encoders, and a comparator
- Comparator is followed by a feed-forward neural network, which generates label
- Diagram
- Encoder choices:
- Average-pooled embeddings
- CNNs
- Bidirectional LSTM
- 💡 what works best for your data will vary