论文阅读5-----基于强化学习的推荐系统

2021-01-18 14:22:20 浏览数 (1)

Deep Reinforcement Learning for Page-wise Recommendations

ABSTRACT

Recommender systems can mitigate the information overload problem by suggesting users’ personalized items. In real-world recommendations such as e-commerce, a typical interaction between the system and its users is – users are recommended a page of items and provide feedback; and then the system recommends a new page of items. To effectively capture such interaction for recommendations, we need to solve two key problems

(1) how to update recommending strategy according to user’s real-time feedback, and

说白了就是能够根据用户反馈对推荐系统及时做出调整。在论文阅读4中有提到,传统的推荐系统无法做到根据反馈及时调整。

(2) how to generate a page of items with proper display, which pose tremendous challenges to traditional recommender systems. In this paper, we study the problem of page-wise recommendations aiming to address aforementioned two challenges simultaneously. In particular,

就是如何推荐一个页面的物品,而不是一个物品。最好不是那种传统推荐系统取什么top-10之类的(推荐的东西特别的相似)。

(1)we propose a principled approach to jointly generate a set of complementary items and the corresponding strategy to display them in a 2-D page; and

就是他们提出如何一次性推荐很多东西。

(2)propose a novel page-wise recommendation framework based on deep reinforcement learning, DeepPage,which can optimize a page of items with proper display based on real-time feedback from users.

基于RL的推荐系统,可以根据及时反馈及时调整策略。

(3)The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.

实验证明了我们很厉害。

proposed model

输入state得到action(a page-wise recommendation items),得到的action(一推向量,每一个向量代表一个item,从item_embedding中找离它最近的item作为推荐的item,就是途从左向右的红色箭头),这样的话基本就是AC的框架,但是是DDPG训练方式,顺便粘贴一下寻找最近的item的map操作输入state得到action(a page-wise recommendation items),得到的action(一推向量,每一个向量代表一个item,从item_embedding中找离它最近的item作为推荐的item,就是途从左向右的红色箭头),这样的话基本就是AC的框架,但是是DDPG训练方式,顺便粘贴一下寻找最近的item的map操作

1.如何得到state

初始的state,最近N个items作为输入,用GRU的最后输出作为state初始的state,最近N个items作为输入,用GRU的最后输出作为state
推荐时的改变,C代表类别,F代表feedback,具体计算如下推荐时的改变,C代表类别,F代表feedback,具体计算如下

offline training

需要用训练好的AC-off policy作为模拟器产生数据

首先缩小两者之间的距离,要不然是屁的各AC框架,也不能用DDPG训练首先缩小两者之间的距离,要不然是屁的各AC框架,也不能用DDPG训练
说白了还是断开了说白了还是断开了
就是找一下真正从ACTOR网络中输出的action而不是被DECNN后的action就是找一下真正从ACTOR网络中输出的action而不是被DECNN后的action
offline test只用有ground truth的数据进行训练offline test只用有ground truth的数据进行训练

online training

好了好了又想学习推荐系统科研的小可爱们,但又不知道该怎样写代码的可以可我的github主页或是由中国人民大学出品的RecBole

基于ptyorch的当今主流推荐算法

我还有基于tensorflow的代码

RecBole(各种类型的,超过60种推荐算法)

欢迎大家点小星星

0 人点赞