mlpy.learners.offline.irl.ApprenticeshipLearner

class mlpy.learners.offline.irl.ApprenticeshipLearner(obs, planner, method=None, max_iter=None, thresh=None, gamma=None, nsamples=None, max_steps=None, filename=None, **kwargs)[source]

Bases: mlpy.learners.offline.IOfflineLearner

The apprenticeship learner.

The apprenticeship learner is an inverse reinforcement learner, a method introduced by Abbeel and Ng [R2] which strives to imitate the demonstrations given by an expert.

Parameters:

obs : array_like, shape (n, nfeatures, ni)

List of trajectories provided by demonstrator, which the learner is trying to emulate, where n is the number of sequences, ni is the length of the i_th demonstration, and each demonstration has nfeatures features.

planner : IPlanner

The planner to use to determine the best action.

method : {‘projection’, ‘maxmargin’}, optional

The IRL method to employ. Default is projection.

max_iter : int, optional

The maximum number of iteration after which learning will be terminated. It is assumed that a policy close enough to the experts demonstrations was found. Default is inf.

thresh : float, optional

The learning is considered having converged to the demonstrations once the threshold has been reach. Default is eps.

gamma : float, optional

The discount factor. Default is 0.9.

nsamples : int, optional

The number of samples taken during Monte Carlo sampling. Default is 100.

max_steps : int, optional

The maximum number of steps in an iteration (during MonteCarlo sampling). Default is 100.

filename : str, optional

The name of the file to save the learner state to after each iteration. If None is given, the learner state is not saved. Default is None.

Other Parameters:
 

mix_policies : bool

Whether to create a new policy by mixing from policies seen so far or by considering the best valued action. Default is False.

rescale : bool

If set to True, the feature expectations are rescaled to be between 0 and 1. Default is False.

visualize : bool

Visualize each iteration of the IRL step if set to True. Default is False.

Notes

Method maxmargin using a QP solver to solve the following equation:

\begin{aligned}
& \underset{t, w}{\text{maximize}} & & t \\
& \text{subject to} & & w^T \mu_E > w^T \mu^{(j)} + t, j=0, \ldots, i-1 \\
& & & ||w||_2 \le 1.
\end{aligned}

and mixing policies is realized by solving the quadratic problem:

\begin{aligned}
& \text{minimize} & &  ||\mu_E - \mu||_2 \\
& \text{subject to} & & \mu = \sum_i (\lambda_i \mu^{(i)}) \\
& & & \lambda_i \ge 0 \\
& & & \sum_i \lambda_i = 1
\end{aligned}

The QP solver used for the implementation is the IBM ILOG CPLEX Optimizer which requires a separate license. If you are unable to obtain a license, the ‘projection’ method can be used instead.

References

[R2](1, 2) Abbeel, Pieter, and Andrew Y. Ng. “Apprenticeship learning via inverse reinforcement learning.” Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.

Attributes

mid The module’s unique identifier.
type This learner is of type offline.

Methods

choose_action(state) Choose the next action
execute(experience) Execute learning specific updates.
learn() Learn the optimal policy via apprenticeship learning.
load(filename) Load the state of the module from file.
reset(t, **kwargs) Reset reinforcement learner.
save(filename) Save the current state of the module to file.