Pseudo Dyna-Q: A Reinforcement Learning Framework fo Interactive Recommendation
ABSTRACT
Applying reinforcement learning (RL) in recommender systems is attractive but costly due to the constraint of the interaction with
real customers, where performing online policy learning through interacting with real customers usually harms customer experiences.
RL用于交互式推荐很是吸引人,但是在线学习会伤害用户体验(强化学习是在试探中不断变强,刚开始是真的什么都推荐的那种)
A practical alternative is to build a recommender agent offline from logged data, whereas directly using logged data offline leads to the problem of selection bias between logging policy and the recommendation policy.
有一种方法是用离线数据来训练,但会存在偏差的问题。
The existing direct offline learning algorithms,such as Monte Carlo methods and temporal difference methods are either computationally expensive or unstable on convergence.
现存的MC和TD计算昂贵或是极其不稳定。
To address these issues, we propose Pseudo Dyna-Q (PDQ).In PDQ, instead of interacting with real customers, we resort to a customer simulator, referred to as the World Model, which is designed to simulate the environment and handle the selection bias of logged data.
为了解决这个问题,我们来了,我们建造了一个用户模拟器来模拟环境同时用重要性采样的方法解决了数据偏差的问题。
During policy improvement, the World Model is constantly updated and optimized adaptively, according to the current recommendation policy.
在策略提升的时候,model在不断地改变。
This way, the proposed PDQ not only avoids the instability of convergence and high computation cost of existing approaches but also provides unlimited interactions without involving real customers.
这样一来我们就有无穷无尽的interaction,解决收敛问题和高计算问题。
Moreover, a proved upper bound of empirical error of reward function guarantees that the learned offline policy has lower bias and variance.
上限经验损失由很小的偏差和方差。
Extensive experiments demonstrated the advantages of PDQ on two real-world datasets against state-of-the-arts methods.
实验证明了我们很厉害。
我们再来看一下contribuction(日常吹比时间)
We present Pseudo Dyna-Q (PDQ) for interactive recommendation, which provides a general framework that can be instantiated in different neural architectures, and tailored for specific recommendation tasks.
我们提出的框架很有拓展的可能(比较不错)
We conduct a general error analysis for the world model and show the connection of the error and dispersity between recommendation policy and logging policy.
我们通过一系列实验分析了我们的world model(反正不错,要不然是不会是说的)
We implement a simple instantiation of PDQ, and demonstrate its effectiveness on two real-world large scale datasets,showing superior performance over the state-of-the-art methods in interactive recommendation.
实验证明了我们很厉害。(超级厉害)
POLICY LEARNING FOR RECOMMENDE VIA PSEUDO DYNA-Q
The proposed PDQ recommender agent is shown in Figure 1. It consists of two modules:
A world model for generating simulated customers’ feedback, which should be similar to those generated by a real custome
according to the historical logged data.
说白了这个model based家伙就是我们在其他基于强化学习的推荐系统中的模拟器
A recommendation policy which selects the next item to recommend based on the current state. It is learned to maximize
the cumulative reward, such as total clicks in a session.
推荐系统的目标就是最大化目标的总体奖励。
The recommender policy and the world model are co-trained in an iterative way in PDQ. In each iteration, once the current rec-
ommender policy is set, the world model will be updated accordingly to support it. In turn, the new information gained from the
updated world model will further improve the recommendation policy through planning. This way, the recommendation policy is
iteratively improved with an evolving world model.
推荐系统和model之间相互相成,共同进步。
后面的就是world model的训练和agent的训练,我们先来看一哈总体的算法
world model learning
3.1.1 The Error Function. The goal of the world model is to imitate the customer’s feedback and generate the pseudo experiences as real as possible. As the reward function is associated with a customer’s feedback, e.g., a click or a purchase, learning the reward function is equivalent to imitate customers’ feedback. Formally, the world model can be learned effectively by minimizing the error between online and offline rewards:
我们学了reward function用于用户的反馈
policy learning
state tracker(形成state)
具体操作如下
Q-net
world model
总体来说,也就这样一般的方法就是训练一个模型用于模拟环境。然后就是agent和model进行交互得到很多的数据用于训练,最后在生成我们想要的那个模型。很多你可能不懂,但是这都不是很重要,重要的是你要知道它所用到的方法,通俗来说,基于强化学习的推荐系统都逃不过建立模拟器。
好了好了又想学习推荐系统科研的小可爱们,但又不知道该怎样写代码的可以可我的github主页或是由中国人民大学出品的RecBole
https://github.com/xingkongxiaxia/Sequential_Recommendation_System 基于ptyorch的当今主流推荐算法
https://github.com/xingkongxiaxia/tensorflow_recommend_system 我还有基于tensorflow的代码
https://github.com/RUCAIBox/RecBole RecBole(各种类型的,超过60种推荐算法)
欢迎大家点小星星