💡 A lot of this section in Kane's course has been covered in AIF and MLA already. Hence, do review e.g. Bedrock Evaluations from the notes on those certs. I will include here details that are not covered in my notes of the previous certs
Bedrock Agent Tracing
Ref: https://www.udemy.com/course/ultimate-aws-certified-generative-ai-developer-professional/learn/lecture/53684739
- When a Bedrock Agent sends a response, there's a registered trace
- Shows reasoning process, what KB(s) it hit, errors…
- Great way to debug agents
- Different trace types: preprocessing, orchestration, postprocessing, routingclassifier…
FM Evaluation Criteria
Ref: https://www.udemy.com/course/ultimate-aws-certified-generative-ai-developer-professional/learn/lecture/53684741
- Can use humans to evaluate FMs/custom models (how helpful is the model, what vibes/feels do you get…)
- Can evaluate against Benchmark Datasets
- Set of sample Q&A (prompt+response) created by SMEs
- Evaluate accuracy, speed/efficiency, scalability, context retrieval…
- Architecture diagram
- Can use another model as a judge (”LLM-as-a-judge”)
- Many types of LLM-as-a-judge architectures
- NB: To use LLM-as-a-judge in Bedrock Evaluations, you must provide a prompt dataset in JSONL format in S3
- Can use hybrid approaches
- 💡 comparing human and benchmark-based approaches can reveal the limitations of both
ROUGE, BLEU, and BERT Scores
Ref: https://www.udemy.com/course/ultimate-aws-certified-generative-ai-developer-professional/learn/lecture/53684745
- Review this section in AWS AIF
- Remember:
- ROUGE checks recall
- BLEU checks precision
- BERTscore → uses embeddings to compare semantic similarity (less sensitive to synonyms)