mlpy.learners.offline.irl.ApprenticeshipLearner.learn¶

ApprenticeshipLearner.learn()[source]¶

Learn the optimal policy via apprenticeship learning.

The apprenticeship learning algorithm for finding a policy $\tilde{\pi}$ , that induces feature expectations $\mu(\tilde{\pi})$ close to $\mu_E$ is as follows:

Randomly pick some policy $\pi^{(0)}$ , compute (or approximate via Monte Carlo) $\mu^{(0)} = \mu(\pi^{(0)})$ , and set $i=1$ .
Compute $t^{(i)} = \underset{w:||w||_2 \le 1}{\text{max}}\underset{j \in {0 \ldots (i-1)}}{\text{min}} w^T(\mu_E = \mu^{(j)})$ , and let $w^{(i)}$ be the value of $w$ that attains this maximum. This can be achieved by either the max-margin method or by the projection method.
If $t^{(i)} \le \epsilon$ , then terminate.
Using the RL algorithm, compute the optimal policy $\pi^{(i)}$ for the MDP using rewards $R = (w^{(i)})^T \phi$ .
Compute (or estimate) $\mu^{(i)} = \mu(\pi^{(i)})$ .
Set $i = i + 1$ , and go back to step 2.