Reinforcement Learning (RL)

Reinforcement Learning - Basic Concepts

Ref: https://www.udemy.com/course/aws-ai-practitioner-certified/learn/lecture/44886629

🔧 A type of ML where an agent learns to make decisions by performing actions in an environment to maximize cumulative rewards
- Many simulations → agent slowly learns from mistakes and successes (reinforced learning through feedback)
- Applications: gaming (chess, go), robotics, finance, healthcare, autonomous vehicles…
Key concepts:
- Agent – the learner or decision-maker
- Environment – the external system the agent interacts with
- Action – the choices made by the agent
- Reward – the feedback from the environment based on the agent’s actions
- State – the current situation of the environment
- Policy – the strategy the agent uses to determine actions based on the state
Learning Process (similar to a state machine)
1. Agent observes current State of the Environment
2. Agent selects an Action based on its Policy
3. Environment transitions to a new State and provides a Reward
4. Agent updates its Policy to improve future decisions
- Goal: Maximize cumulative reward over time
An example in YouTube channel AI Warehouse

Ref: https://www.udemy.com/course/aws-ai-practitioner-certified/learn/lecture/45375323

🔧 Use human feedback in a reward function to help ML models self-learn more efficiently
- Aligns desired output with human goals, wants and needs
Used throughout GenAI apps to significantly increase model performance
- e.g. grading text translations from “technically correct” to “human”

Data collection
- Set of human-generated prompts and responses are created
- “Where is the location of the HR department in Boston?” → “3rd floor”
Supervised fine-tuning of a model
- Fine-tune an existing model with internal knowledge → build SFT model
- Then the model creates responses for the human-generated prompts
- Responses are mathematically compared to human-generated answers
Build a separate reward model
- Humans can indicate which response they prefer from the same prompt (Answer 1 > Answer 2)
- Fine-tune an existing model with the human preferences → Build Reward Model (RM)
- The reward model can now estimate how a human would prefer a prompt response
Optimize the original model with the reward-based model
- Use the reward model as a reward function for RL on the SFT model
- This part can be fully automated

AWS Blog: https://aws.amazon.com/what-is/reinforcement-learning-from-human-feedback/