mlpy.learners.offline.irl.ApprenticeshipLearner¶

class mlpy.learners.offline.irl.ApprenticeshipLearner(obs, planner, method=None, max_iter=None, thresh=None, gamma=None, nsamples=None, max_steps=None, filename=None, **kwargs)[source]¶

Bases: mlpy.learners.offline.IOfflineLearner

The apprenticeship learner.

The apprenticeship learner is an inverse reinforcement learner, a method introduced by Abbeel and Ng [R2] which strives to imitate the demonstrations given by an expert.

Parameters:

obs : array_like, shape (n, nfeatures, ni)

List of trajectories provided by demonstrator, which the learner is trying to emulate, where n is the number of sequences, ni is the length of the i_th demonstration, and each demonstration has nfeatures features.

planner : IPlanner

The planner to use to determine the best action.

method : {‘projection’, ‘maxmargin’}, optional

The IRL method to employ. Default is projection.

max_iter : int, optional

The maximum number of iteration after which learning will be terminated. It is assumed that a policy close enough to the experts demonstrations was found. Default is inf.

thresh : float, optional

The learning is considered having converged to the demonstrations once the threshold has been reach. Default is eps.

gamma : float, optional

The discount factor. Default is 0.9.

nsamples : int, optional

The number of samples taken during Monte Carlo sampling. Default is 100.

max_steps : int, optional

The maximum number of steps in an iteration (during MonteCarlo sampling). Default is 100.

filename : str, optional

The name of the file to save the learner state to after each iteration. If None is given, the learner state is not saved. Default is None.

Other Parameters:

mix_policies : bool

Whether to create a new policy by mixing from policies seen so far or by considering the best valued action. Default is False.

rescale : bool

If set to True, the feature expectations are rescaled to be between 0 and 1. Default is False.

visualize : bool

Visualize each iteration of the IRL step if set to True. Default is False.

See also

IncrApprenticeshipLearner

Notes

Method maxmargin using a QP solver to solve the following equation:

$\begin{aligned} & \underset{t, w}{\text{maximize}} & & t \\ & \text{subject to} & & w^T \mu_E > w^T \mu^{(j)} + t, j=0, \ldots, i-1 \\ & & & ||w||_2 \le 1. \end{aligned}$

and mixing policies is realized by solving the quadratic problem:

$\begin{aligned} & \text{minimize} & & ||\mu_E - \mu||_2 \\ & \text{subject to} & & \mu = \sum_i (\lambda_i \mu^{(i)}) \\ & & & \lambda_i \ge 0 \\ & & & \sum_i \lambda_i = 1 \end{aligned}$

The QP solver used for the implementation is the IBM ILOG CPLEX Optimizer which requires a separate license. If you are unable to obtain a license, the ‘projection’ method can be used instead.

References

[R2]	(1, 2) Abbeel, Pieter, and Andrew Y. Ng. “Apprenticeship learning via inverse reinforcement learning.” Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.

Attributes

`mid`	The module’s unique identifier.
`type`

Methods

`choose_action`(state)	Choose the next action
`end`()	End the episode.
`init`()	Initialize the apprenticeship learner.
`learn`()	Learn the optimal policy via apprenticeship learning.
`load`(filename)	Load the state of the module from file.
`save`(filename)	Save the current state of the module to file.
`start`()	Start an episode.
`step`(experience)	Execute learning specific updates.