
6.1 Learning through Awards 85
For the value function, an important aspect is how the future should be taken into
account. A number of models have been proposed [432]:
• The finite-horizon model, in which the agent optimizes its expected reward
for the next n
t
steps, i.e.
E
n
t
t=1
r(t)
(6.1)
where r(t) is the reward for time-step t.
• The infinite-horizon discounted model, which takes the entire long-run
reward of the agent into consideration. However, each reward received in future
is geometrically discounted according to a discount factor, γ ∈ [0, 1):
E
∞
t=0
γ
t
r(t)
(6.2)
The discount factor enforces a bound on the infinite sum.
• The average reward model, which prefers actions that optimize the agent’s
long-run average reward:
lim
n
t
→∞
E
1
n
t
n
t
t=0
r(t)
(6.3)
A problem with this model is that it is not possible to distinguish between a
policy that gains a large amount of reward in the initial phases, and a policy
where the largest gain is obtained in the later phases.
In order to find an optimal policy, π
∗
, it is necessary to find an optimal value function.
A candidate optimal value function is [432],
V
∗
(s)=max
a∈A
R(s, a)+γ
s
∈S
T (s, a, s
)V
∗
(s
)
,s∈S (6.4)
where A is the set of all possible actions, S is the set of environmental states, R(s, a)
is the reward function, and T(s, a, s
) is the transition function. Equation (6.4) states
that the value of a state, s, is the expected instantaneous reward, R(s, a), for action
a plus the expected discounted value of the next state, using the best possible action.
From the above, a clear definition of the model in terms of the transition function,
T , and the reward function, R, is required. A number of algorithms have been de-
veloped for such RL problems. The reader is referred to [432, 824] for a summary of
these methods. Of more interest to this chapter are model-free learning methods, as
described in the next section.