Operational Efficiency and Optimization

Contents:

Foreword to this Section

Architecting, building, and improving AI systems is far from trivial. This section includes common problems, challenges, and tradeoffs to watch out for in production-level AI systems. First we expose the challenge, then mention cloud-agnostic techniques to tackle it, and finally the AWS-specific architectures and solutions available to us (but it can all still blend together).

Token Efficiency

Ref: https://www.udemy.com/course/ultimate-aws-certified-generative-ai-developer-professional/learn/lecture/53684395

🔧 More tokens used = More processing time & More $$$ spent → if you can do the same or a similar task with less tokens, better!
- Several tools and techniques to measure/improve number of tokens used, token efficiency…

Token Efficiency Techniques

Count tokens (duh!)
- Super simple
- Very useful to estimate costs prior to inference
- Very useful to optimize prompts to fit within token limits (different models have different context windows)
Context window optimization/Context pruning (Input)
- Delete older parts of chat history (at some point, the whole chat conversation is too big to continue sending as-is…)
- Chunk big documents and send only relevant chunks (RAG)
- Retrieve less chunks in RAG systems
- Filter RAG chunks by metadata (fast way to toss irrelevant chunks)
Prompt Compression (Input)
- Summarize older parts of chat history (at some point, the whole chat conversation is too big to continue sending as-is…)
- Summarize large documents (without losing meaning) before sending to FM (normally a small AI model can summarize user prompt before sending to larger model)
- Use KBs and RAG instead of complete documents in the prompt
Response size controls/Response limiting (Output)
- Establish maxTokens parameter
- Can also force format/length via JSON output settings
- Specify the desired length of a response in the prompt (”answer can not exceed 100 words”)
- Use few-shot examples to specify desired verbosity

Token Efficiency in AWS

Bedrock has CountTokens API (which costs nothing, use!)
CloudWatch can track lots of stuff
- InputTokenCount and OutputTokenCount
- Model/response latency
  - TTFT (Time to First Token) in streaming mode
- Throttles, server/client errors on invocation
- …

Cost-Effective Model Selection

Ref: https://www.udemy.com/course/ultimate-aws-certified-generative-ai-developer-professional/learn/lecture/53684401

🔧 Model size (small-large) ↔ cost-capability tradeoff