Governance and QA

Contents:

💡 A lot of this section in Kane's course has been covered in AIF and MLA already. Hence, do review e.g. Bedrock Evaluations from the notes on those certs. I will include here details that are not covered in my notes of the previous certs

Bedrock Agent Tracing

Ref: https://www.udemy.com/course/ultimate-aws-certified-generative-ai-developer-professional/learn/lecture/53684739

When a Bedrock Agent sends a response, there's a registered trace
- Shows reasoning process, what KB(s) it hit, errors…
- Great way to debug agents
Different trace types: preprocessing, orchestration, postprocessing, routingclassifier…

FM Evaluation Criteria

Ref: https://www.udemy.com/course/ultimate-aws-certified-generative-ai-developer-professional/learn/lecture/53684741

Can use humans to evaluate FMs/custom models (how helpful is the model, what vibes/feels do you get…)
Can evaluate against Benchmark Datasets
- Set of sample Q&A (prompt+response) created by SMEs
- Evaluate accuracy, speed/efficiency, scalability, context retrieval…
- Architecture diagram
Can use another model as a judge (”LLM-as-a-judge”)
- Many types of LLM-as-a-judge architectures
- NB: To use LLM-as-a-judge in Bedrock Evaluations, you must provide a prompt dataset in JSONL format in S3
Can use hybrid approaches
- 💡 comparing human and benchmark-based approaches can reveal the limitations of both

ROUGE, BLEU, and BERT Scores

Ref: https://www.udemy.com/course/ultimate-aws-certified-generative-ai-developer-professional/learn/lecture/53684745

Review this section in AWS AIF
Remember:
- ROUGE checks recall
- BLEU checks precision
- BERTscore → uses embeddings to compare semantic similarity (less sensitive to synonyms)