mlpy.learners.offline.irl.ApprenticeshipLearner.learn

ApprenticeshipLearner.learn()[source]

Learn the optimal policy via apprenticeship learning.

The apprenticeship learning algorithm for finding a policy \tilde{\pi}, that induces feature expectations \mu(\tilde{\pi}) close to \mu_E is as follows:

  1. Randomly pick some policy \pi^{(0)}, compute (or approximate via Monte Carlo) \mu^{(0)} = \mu(\pi^{(0)}), and set i=1.
  2. Compute t^{(i)} = \underset{w:||w||_2 \le 1}{\text{max}}\underset{j \in {0 \ldots (i-1)}}{\text{min}} w^T(\mu_E = \mu^{(j)}), and let w^{(i)} be the value of w that attains this maximum. This can be achieved by either the max-margin method or by the projection method.
  3. If t^{(i)} \le \epsilon, then terminate.
  4. Using the RL algorithm, compute the optimal policy \pi^{(i)} for the MDP using rewards R = (w^{(i)})^T \phi.
  5. Compute (or estimate) \mu^{(i)} = \mu(\pi^{(i)}).
  6. Set i = i + 1, and go back to step 2.