mlpy.planners.explorers.discrete.SoftmaxExplorer¶

class mlpy.planners.explorers.discrete.SoftmaxExplorer(tau=None, decay=None)[source]¶

Bases: mlpy.planners.explorers.discrete.DiscreteExplorer

The softmax explorer.

The softmax explorer varies the action probability as a graded function of estimated value. The greedy action is still given the highest selection probability, but all the others are ranked and weighted according to their value estimates.

Parameters:

tau : float, optional

The temperature value. Default is 2.0.

decay : float, optional

The value by which $\tau$ decays. This value should be between 0 and 1. The temperature $\tau$ to decrease over time with a factor of decay. Set this value to 1 if $\tau$ should remain the same throughout the experiment. Default is 1.

Notes

The softmax function implemented uses the Gibbs distribution. It chooses action a on the t-th play with probability:

$\frac{e^{Q_t(a)/\tau}}{\sum_{b=1}^ne^{Q_t(b)/\tau}}$

where $\tau$ is a positive parameter called the temperature. High temperatures cause all actions to be equiprobable. Low temperatures cause a greater difference in the selection probability. For $\tau$ close to zero, the action selection because the same as greedy.

Methods

`activate`()	Turn on exploration mode.
`choose_action`(actions, qvalues)	Choose the next action.
`deactivate`()	Turn off exploration mode.