Unsupervised Algorithms in SageMaker

BlazingText - Word2Vec

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285461

🔧 Word embedding = create a vector representation for words
- Vector can represent e.g. sentiment or semantic of a word
- Not an NLP algorithm in itself, but useful in NLP → used in machine translation, sentiment analysis
- ‼️ ONLY embedding of individual words, NOT sentences or documents!!
Multiple modes
- Cbow (Continuous Bag of Words)
  - Order of words does NOT matter (disconnected words inside a “bag”)
- Skip-gram
  - n-grams → Order of words DOES matter!
- Batch skip-gram → Like skip-gram, but allows distributed computation over many CPU nodes
Input data: text file with one sentence per line
Training instance types:
- For cbow and skipgram
  - Any single CPU or single GPU instance will work
  - single ml.p3.2xlarge recommended
- For batch_skipgram
  - Can use single or multiple CPU instances
Important hyperparameters:
- Mode (batch_skipgram, skipgram, cbow)
- For tuning performance: Learning_rate, Window_size, Vector_dim, Negative_samples

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285475

🔧 Creates low-dimensional vector embeddings from high-dimensional objects
- Objects with similar vectors are more similar to each other than objects with less similar vectors
- Arbitrary objects → any type of object accepted (input data must have correct format)
  - 💡 Object2Vec is like Word2Vec from BlazingText, but generalized for arbitrary objects, not just words
- Use cases: compute nearest neighbors of objects, visualize clusters, genre prediction, recommendations (similar items or users…)

Process data into JSONL (JSON Lines) & shuffle it
Train with two (parallel) input channels, two encoders, and a comparator
Comparator is followed by a feed-forward neural network, which generates label

Diagram
Encoder choices:
- Average-pooled embeddings
- CNNs
- Bidirectional LSTM
- 💡 what works best for your data will vary