01.The RL Feedback Loop
Reinforcement Learning focuses on training an agent to interact with an environment to maximize cumulative reward. Unlike supervised learning, we do not feed the agent labeled answers; the agent discovers optimal behavior via sequential exploration.
02.Bellman Optimality & Policy Updates
Using the Bellman equation, we model future reward expectations. In this module, we construct custom state matrices and evaluate policy convergence algorithms in simulated grids.
Python Q-Learning Implementation
import numpy as np
class SimpleQLearner:
def __init__(self, states_count, actions_count, alpha=0.1, gamma=0.9, epsilon=0.1):
self.q_table = np.zeros((states_count, actions_count))
self.alpha = alpha # Learning rate
self.gamma = gamma # Discount factor
self.epsilon = epsilon # Exploration rate
def choose_action(self, state):
if np.random.uniform(0, 1) < self.epsilon:
return np.random.choice(self.q_table.shape[1])
else:
return np.argmax(self.q_table[state, :])
def learn(self, state, action, reward, next_state):
predict = self.q_table[state, action]
target = reward + self.gamma * np.max(self.q_table[next_state, :])
self.q_table[state, action] += self.alpha * (target - predict)