论文阅读7-----基于强化学习的推荐系统

2021-01-18 14:41:30 浏览数 (1)

DRN: A Deep Reinforcement Learning Framework for News Recommendation

ABSTRACT

In this paper, we propose a novel Deep Reinforcement Learning framework for news recommendation.

我们提出来RL方法用于新闻推荐。

Online personalized news recommendation is a highly challenging problem due to the dynamic nature of news features and user preferences. Although some online recommendation models have been proposed to address the dynamic nature of news recommendation, these methods have three major issues.

新闻推荐挑战很大,因为新闻特征和用户偏好动态变化大。现存的推荐系统方法有如下缺点。

First, they only try to model current reward(e.g., Click Through Rate).

1.仅仅尝试当前的奖励,下文引出RL方法,因为RL方法适用于长期的奖励。

Second, very few studies consider to use user feedback other than click / no click labels (e.g., how frequent user returns) to help improve recommendation.

2.没考虑用户反馈,即使考虑了也不过click/no click labels.(反馈不够丰富,下文提出回归时间凑数)

Third, these methods tend to keep recommending similar news to users, which may cause users to get bored.

3.推荐的物品趋于相似,用户非常不爽。(我们有探索机制,厉害的很)

Therefore, to address the aforementioned challenges,

所以我们的解决方法如下

we propose a Deep Q-Learning based recommendation framework, which can model future reward explicitly.

RL的方法,考虑不仅仅只是近期奖励,还有很多未来的奖励。

We furthe consider user return pattern as a supplement to click / no click label in order to capture more user feedback information.

把用户离开APP(亦或是网页)到再次回来的时间间隔也看做反馈,用户粘念性。

In addition,an effective exploration strategy is incorporated to find new attractive news for users.

我们的探索机制很厉。

Extensive experiments are conducted on the offline dataset and online production environment of a commercial

news recommendation application and have shown the superio performance of our methods.

对了,我们还做了实验,的的确确我们超级厉害。

proposed model

具体操作如下具体操作如下
1.输入request2.反馈并保存所有相关数据3.探索和currentQ之间的更新4.大改1.输入request2.反馈并保存所有相关数据3.探索和currentQ之间的更新4.大改

用户的表示,state的形成

各种特征,本模型适合土豪公司,不适合科研各种特征,本模型适合土豪公司,不适合科研
Q网络Q网络

用户的反馈---本文用用户再次返回时间做了一个feedback

反正间隔越小,用户满意度越高反正间隔越小,用户满意度越高

俺们的探索机制

EXPLORE NET的产生由CURRENT NET的一部分加上一点噪音,如下图EXPLORE NET的产生由CURRENT NET的一部分加上一点噪音,如下图
W_探索=三角形W W(W是current net)W_探索=三角形W W(W是current net)

对于W的minor update上文有所提及

0 人点赞