Working with trees
Search depth d : Number of moves from root to current (in ply)
Braching factor b : Average number of successor moves
Circles represent the moves of the player running the algorithm (maximizing player), and squares represent the moves of the opponent (minimizing player). The values inside the circles and squares represent the value α of the minimax algorithm. The red arrows represent the chosen move, the numbers on the left represent the tree depth and the blue arrow the chosen move.
Markov Decision Process
First define some basics: We operate on a bunch of States (S). A State reflects a set of observations that is used to describe the current scene. On those States Actions (A) can be executed. These Actions transition the scene from one State to another. To describe how good we like that state utility theory is used to define a numerical that expresses how much the state is preferred. The utility (u) depends on the State and the policy (which action to execute on which state). In addition to the reward (r) for entering the next state, the utility function depends on the future rewards that can be collected as well.
The aim of a MDP is often to find the optimal policy - the best action for any state. To achieve this, we assume the Markov Property holds. The Markov Property then implies, that the optimal policy is only dependent on the current state.
To determine the optimal policy a discount factor for future rewards is used. This discount factor (usually 0 < gamma < 1) expresses how valuable a reward of future states is. The Bellman Equation is used to show that an optimal policy always exists, that it does not rely on the history of States and that multiple optimal policies can exist.
To determine the optimal policy two approaches exist. Value and Policy Iteration. Both are iterative techniques to find the optimal policy.
The Bellman equation can be formulated as a Bellman Update that can be used to reduce the distance to the optimal policy. By applying it iteratively the error will converge to 0. In value iteration the utility of every action possible on every state is calculated and the best overall utilities are used to determine the best policy. This iteration only offers an estimate of the best policy, but can be easier to calculate in very big scenes, as it does not include complex calculations. A requirement is, that all transition and reward functions have to be known and that the number of states is finite.
In Policy Iteration the Bellman Update is used as well, to define a process that improves the policy in every step. Starting with any policy a set of linear equations has to be solved to calculate the best possible utility for this policy. These utilities are then compared to the other utilities generated by choosing the actions the policy does not promote. If any of those actions yield a higher utility than the current policies move that move can be updated. The process is repeated iteratively until the policy does not change anymore. The Policy Iteration calculates the exact solution and is therefor preferable to the Value Iteration. As it utilizes linear equations the computational effort can become too much to handle for big scenes. Then Value Iteration can be used to calculate an estimated optimal policy. The transition and reward functions need to be known and the number of states must be finite.
Every action has a probability of failure. Of not reaching the desired, but a different state. The Markov Property implies, that the transition probability depends only on the current state and the chosen action.
This expands to : every information, that is needed to decide on the failure of an action, is explicitly given in the State. If for example only the position of an object is given in a state, but not the speed, and the speed is important for the action, an additional state would be needed to extract the required speed. The Markov Property would not hold.
But if the Markov Property is satisfied, the speed needs to be included in the observations of every state.