用概率推理解决强化学习- pyro colab代码

2018：Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine UC Berkeley

摘要：强化学习或最优控制的框架提供了智能决策的数学形式化,这是强大的和广泛适用的。虽然强化学习问题的一般形式能够对不确定性进行有效的推理,但在概率模型中,强化学习和推理之间的联系并不明显。然而,当涉及到算法设计时,这种联系具有相当大的价值:原则上,将问题定义为概率推理允许我们使用大量的近似推理工具,以灵活和强大的方式扩展模型, 并对组合性和部分可观察性进行推理。在本文中,我们将讨论强化学习或最优控制问题(有时称为最大熵强化学习)的推广如何等价于确定性动力学情况下的精确概率推理和随机动力学情况下的变分推理。我们将详细介绍该框架的推导过程,概述基于该框架和相关思想提出新的强化学习和控制算法的前期工作,并描述未来研究的前景。

2019：VIREL: A Variational Inference Framework for Reinforcement Learning

Abstract

Applying probabilistic models to reinforcement learning (RL) enables the uses of powerful optimisation tools such as variational inference in RL. However, existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, for example, the lack of mode capturing behaviour in pseudo-likelihood methods, difficulties learning deterministic policies in maximum entropy RL based approaches, and a lack of analysis when function approximators are used. We propose VIREL, a theoretically grounded inference framework for RL that utilises a parametrised action-value function to summarise future dynamics of the underlying MDP, generalising existing approaches. VIREL also benefits from a mode-seeking form of KL divergence, the ability to learn deterministic optimal polices naturally from inference, and the ability to optimise value functions and policies in separate, iterative steps. Applying variational expectationmaximisation to VIREL, we show that the actor-critic algorithm can be reduced to expectation-maximisation, with policy improvement equivalent to an E-step and policy evaluation to an M-step. We derive a family of actor-critic methods from VIREL, including a scheme for adaptive exploration and demonstrate that our algorithms outperform state-of-the-art methods based on soft value functions in several domains.

2021：Empirical Evaluation: Variational Inference Reinforcement Learning in Pyro

Abstract

The experimental results3validated the theoretical connection between MERL and VI. More importantly, The experimental results confirmed that

(1) Pyro is express enough to implement policy-based RL algorithms,

(2) the performance of Pyro version of the algorithm is satisfying, and

(3) modeling and training are better decoupled using Pyro

2. Code available on Github: https://github.com/ljlin/rl.Pyro

3. Available on Colab: https://drive.google.com/drive/folders/1PE5lqtHGvrnrFMDpe3vxbYdCWJAJhlh6

其他参考：

最新Tractability易处理的因果推理

80PPT 概率编程with Fast Exact Symbolic Inference 快速准确符号推理

小数据大任务实现框架开源

再发：迄今为止脑网络结构功能模块元素最全面复杂清晰类芯片多图及分解

强化学习 https 网络安全编程算法

0 人点赞