Reinforcement Learning Concepts and The Q-Learning Algorithm
To understand Q-learning quickly and mathematically
Reinforcement Learning Concepts
Reinforcement learning is a branch of machine learning where the learning of an agent occurs via interacting with an environment. There are several fundamental concepts regarding RL, for instance agent, environment, step, episode, state, action, and reward.
The environment could be as simple as a fixed 5x5 grid map, or a very complex three-dimensional model, the parameters of which are constantly changing. During the agent-environment interaction, each step corresponds to one interaction loop, as depicted in the following figure. To start an episode means restarting the interaction loop from an initial state, and episodes can consist of different amounts of steps based on the algorithm design. In each interaction loop, the environment gives the agent its next state and a reward (or a penalty if the value of the reward is negative) based on the current state of the environment and the action decided by the agent, while the agent takes in the provided information and learns to maximize its cumulative reward normally within one episode through adapting its actions accordingly.
In tutorials and literatures, another concept, observation, is often mentioned. However, the concept of and relation between observation and state can be confusing. Some name observation the factors affecting action, and state is used to describe the fully observed environment [OpenAI tutorial], whereas it is conventional in notation for the action to be conditioned on the state, whether the environment is fully observed or not, and the observation is either left out or deemed the same as the state. This happens because different amounts of information are described without first being distinguished. In the following figure four levels of information are summarized and listed. The figure helps clearing up the confusion. The amount of information included in information level four can range from the smallest circle in the figure to the largest, based on practical requirements for calculation, for debugging, or for building visualization. It is possible for these four information levels to completely overlap, however in most cases they do not. Information levels one to three are used differently in theory, in convention, and in practice. As described in the OpenAI tutorial, theoretically information level three is named state, two is believed to be the same as one, and named observation. Conventionally, information level one is noted as state. In practice, information level two is somewhat irrelevant in the sense that not modeling it would probably not undermine the research, while information level one and three must be modeled.
Mathematically, a RL problem is often formalized as a Markov Decision Process (MDP) used in a discrete-time, controlled and fully observed system. The Markov property of a Markov model assumes that the future (transitions and rewards) only depends on the most recent state and action, not on prior history. Any process can be modeled under Markov property, as long as the state space is detailed enough to capture aspects of a real process that are relevant to predict state transitions and rewards[Learning from Delayed Rewards]. For an agent acting in an environment, the Markov property, i.e., this statistical sufficiency can also be interpreted as the definition of full observation[Robust Reinforcement Learning: A Review of Foundations and Recent Advances], or no hidden information. A MDP consists of five elements:
Based on the above mentioned construct, more RL concepts follow, such as policy, trajectory and return:
Return can be assessed differently. Two commonly used assessments are finite undiscounted return and infinite discounted return. The former is just the sum of rewards obtained in a trajectory within finite steps, while the latter introduces an additional discount factor gamma when adding rewards from infinite steps, because if not, the return may not converge to a finite value.
The goal of an RL agent can therefore be rephrased as to adopt an optimal policy which has maximal expected return over all its possible trajectories. Supposing the infinite discounted return is used, this goal can then be expressed by the following three equations:
The Q-Learning Algorithm
There are many different approaches to solve a RL problem. Based on if agents are fed with predictions of the environment responses from sources other than the actual environment, RL algorithms can be labelled with "model-free" or "model-based", with model meaning the predictions. A famous example of model-based approach is AlphaGo, which learns from Go game manuals and competition records. However in reality, a problem is modeled using RL concepts, usually because the laws within are hard to capture. Therefore, many model-free algorithms have been developed, with two main approaches being policy optimization and Q-learning.
In order to explain how an agent learns using Q-learning algorithm, value functions must first be introduced. Value functions are first used in dynamic programming. Dynamic programming simplifies decision-making by breaking it down into stages, and value functions save evaluations of different stages for later access and computation. To think of it visually, if all end result stages of a decision-making problem can be modeled as vertices with zero out-degree in a directed, acyclic graph, with other vertices being intermediate stages, and each possible solution flowing along arrows starting from vertices with zero in-degree, then value functions are trying to reduce computation through saving evaluations of vertices that have most degrees. MDP matches exactly this kind of formulation, under however different conditions. Value functions in MDP are used to save evaluations of the expected return of each state, which can be understood based on analogy, denoted as Q-learning is an off-policy algorithm. This means, its learning policy is updated by learning from other policies[Learning from Delayed Rewards], normally in every step. The feature of updating a policy during episodes, rather than after them, is called a temporal-difference update, contrary to a Monte-Carlo update. Therefore next, one-step off-policy value functions, also called optimal value functions, evaluating state-action pairs are introduced as which is the known form of the learning update process in Q-learning, i.e., how an agent learns.