Reinforcement learning is a type of machine learning that is used to make decisions based on a set of features. It is a type of supervised learning model that is used to classify data into different categories.
A Markov Decision Process (MDP) is a mathematical framework used to model decision-making in situations where outcomes are uncertain. It is formulated using a tuple , where:
Rewards are usually written as functions of states and actions, i.e. or simply just states, i.e. .
Our goal is to find actions that maximize the expected sum of discounted rewards.
A policy is any function that maps from states to actions. A value function is defined as the expected return starting from state and following policy .
We can write the value function recursively as the Bellman equation:
Note that is the probability of landing on state from state if we take the action .
The optimal value function is the maximum value function over all policies:
The Bellman equation for the optimal value function is:
The optimal policy is the one that maximizes the value function:
Finding the optimal policy is equivalent to finding the optimal value function. And there are two algorithms that can do this: Value Iteration and Policy Iteration.
# Initialize value function V = {s: 0 for s in states} # Iterate until convergence while not converged: for s in states: V[s] = R[s] + gamma * max([ sum(P[s][a][s_prime] * V[s_prime] for s_prime in states) for a in actions ])
# Initialize random policy pi = {s: random.choice(actions) for s in states} while not converged: # Policy evaluation: solve V = V^π V = solve_linear_system(pi) # V^π # Policy improvement for s in states: pi[s] = argmax([ sum(P[s][a][s_prime] * V[s_prime] for s_prime in states) for a in actions ])
For small MDPs, Policy iteration converges faster. However, for large MDPs, solving for in each step of policy iteration involves solving a large system of linear equations, which is computationally expensive.
Usually, we do not know the state transition probabilities before hand. Instead, they can be estimated by repeatedly running the agent on the MDP under policy and using the following formula:
One way to deal with continuous state spaces in MDPs is to descretize the state space. If we have dimensions and we discretize each dimension with values, then we have states. As we increase , the number of states grows exponentially.
To deal with this, we can use function approximation.
One way to do this is using a model-based approach. We use a simulator to execute trials, each for time steps.
We can then use these to learn a linear model to predict as a function of and :
We can then minimize the squared error over all trials using gradient descent to find and :
However, this is a deterministic model. Most real world systems are stochastic. So we can modify the model to be stochastic by adding a noise term:
Now, if we assume that the our state space is continous but our action space is small and discrete, we can use the Fitted value iteration algorithm.
Since is normally distributed, the term is also normally distributed. We can then write:
Our state transition function can now be written as:
Moreover, our value iteration update rule for the continuous case can be written as:
However, since out states are continuous, we have to approximate the value function as well. We do so by finding a linear or non-linear mapping from states to the value function:
We also can not directly update our from the value iteration update rule. Instead, we have to update our parameters.
We do this by minimizing the squared error:
def V(s, theta): return theta.T @ phi(s) # Sample states and initialize parameters sampled_states = random.sample(state_space, n) theta = np.zeros(feature_dim) while not converged: y = [] for s in sampled_states: q_values = [] for a in actions: # Sample next states from transition model next_states = [sample_from_P(s, a) for _ in range(k)] q_a = R[s] + gamma * mean([V(s_prime, theta) for s_prime in next_states]) q_values.append(q_a) y.append(max(q_values)) # Update parameters via least squares theta = argmin_theta(sum((y[i] - theta.T @ phi(sampled_states[i]))**2 for i in range(n)))
This content is best viewed on a laptop or desktop device.