Foreword to this Section
Architecting, building, and improving AI systems is far from trivial. This section includes common problems, challenges, and tradeoffs to watch out for in production-level AI systems. First we expose the challenge, then mention cloud-agnostic techniques to tackle it, and finally the AWS-specific architectures and solutions available to us (but it can all still blend together).
Token Efficiency
Ref: https://www.udemy.com/course/ultimate-aws-certified-generative-ai-developer-professional/learn/lecture/53684395
- 🔧 More tokens used = More processing time & More $$$ spent → if you can do the same or a similar task with less tokens, better!
- Several tools and techniques to measure/improve number of tokens used, token efficiency…
Token Efficiency Techniques
- Count tokens (duh!)
- Super simple
- Very useful to estimate costs prior to inference
- Very useful to optimize prompts to fit within token limits (different models have different context windows)
- Context window optimization/Context pruning (Input)
- Delete older parts of chat history (at some point, the whole chat conversation is too big to continue sending as-is…)
- Chunk big documents and send only relevant chunks (RAG)
- Retrieve less chunks in RAG systems
- Filter RAG chunks by metadata (fast way to toss irrelevant chunks)
- Prompt Compression (Input)
- Summarize older parts of chat history (at some point, the whole chat conversation is too big to continue sending as-is…)
- Summarize large documents (without losing meaning) before sending to FM (normally a small AI model can summarize user prompt before sending to larger model)
- Use KBs and RAG instead of complete documents in the prompt
- Response size controls/Response limiting (Output)
- Establish
maxTokens parameter
- Can also force format/length via JSON output settings
- Specify the desired length of a response in the prompt (”answer can not exceed 100 words”)
- Use few-shot examples to specify desired verbosity
Token Efficiency in AWS
- Bedrock has
CountTokens API (which costs nothing, use!)
- CloudWatch can track lots of stuff
InputTokenCount and OutputTokenCount
- Model/response latency
- TTFT (Time to First Token) in streaming mode
- Throttles, server/client errors on invocation
- …
Cost-Effective Model Selection
Ref: https://www.udemy.com/course/ultimate-aws-certified-generative-ai-developer-professional/learn/lecture/53684401
- 🔧 Model size (small-large) ↔ cost-capability tradeoff