mlpy.planners.explorers.discrete.SoftmaxExplorer

class mlpy.planners.explorers.discrete.SoftmaxExplorer(tau=None, decay=None)[source]

Bases: mlpy.planners.explorers.discrete.DiscreteExplorer

The softmax explorer.

The softmax explorer varies the action probability as a graded function of estimated value. The greedy action is still given the highest selection probability, but all the others are ranked and weighted according to their value estimates.

Parameters:

tau : float, optional

The temperature value. Default is 2.0.

decay : float, optional

The value by which \tau decays. This value should be between 0 and 1. The temperature \tau to decrease over time with a factor of decay. Set this value to 1 if \tau should remain the same throughout the experiment. Default is 1.

Notes

The softmax function implemented uses the Gibbs distribution. It chooses action a on the t-th play with probability:

\frac{e^{Q_t(a)/\tau}}{\sum_{b=1}^ne^{Q_t(b)/\tau}}

where \tau is a positive parameter called the temperature. High temperatures cause all actions to be equiprobable. Low temperatures cause a greater difference in the selection probability. For \tau close to zero, the action selection because the same as greedy.

Methods

activate() Turn on exploration mode.
choose_action(actions, qvalues) Choose the next action.
deactivate() Turn off exploration mode.