版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/Solo95/article/details/102762027
这篇博文是Model-Free Control的一部分,事实上SARSA和Q-learning with ϵ-greedy Exploration都是不依赖模型的控制的一部分,如果你想要全面的了解它们,建议阅读原文。
SARSA Algorithm
SARSA代表state,action,reward,next state,action taken in next state,算法在每次采样到该五元组时更新,所以得名SARSA。
1: Set1: Set1: Set Initial ϵepsilonϵ-greedy policy π,t=0pi,t=0π,t=0, initial state st=s0s_t=s_0st=s0 2: Take at∼π(st)2: Take a_t sim pi(s_t)2: Take at∼π(st) // Sample action from policy 3: Observe (rt,st 1)3: Observe (r_t, s_{t 1})3: Observe (rt,st 1) 4: loop4: loop4: loop 5: Take5: quad Take5: Take action at 1∼π(st 1)a_{t 1}sim pi(s_{t 1})at 1∼π(st 1) 6: Observe (rt 1,st 2)6: quad Observe (r_{t 1},s_{t 2})6: Observe (rt 1,st 2) 7: Q(st,at)←Q(st,at) α(rt γQ(st 1,at 1)−Q(st,at))7: quad Q(s_t,a_t) leftarrow Q(s_t,a_t) alpha(r_t gamma Q(s_{t 1},a_{t 1})-Q(s_t,a_t))7: Q(st,at)←Q(st,at) α(rt γQ(st 1,at 1)−Q(st,at)) 8: π(st)=argmax Q(st,a)w.prob 1−ϵ,else random8: quad pi(s_t) = mathop{argmax} Q(s_t,a) w.prob 1-epsilon, else random8: π(st)=argmax Q(st,a)w.prob 1−ϵ,else random 9: t=t 19: t=t 19: t=t 1 10:end loop10: end loop10:end loop
Q-learing: Learning the Optimal State-Action Value
我们能在不知道π∗pi^*π∗的情况下估计最佳策略π∗pi^*π∗的价值吗?
可以。使用Q-learning。
核心思想: 维护state-action Q值的估计并且使用它来bootstrap最佳未来动作的的价值。
回顾SARSA Q(st,at)←Q(st,at) α((rt γQ(st 1,at 1))−Q(st,at))Q(s_t,a_t)leftarrow Q(s_t,a_t) alpha((r_t gamma Q(s_{t 1},a_{t 1}))-Q(s_t,a_t))Q(st,at)←Q(st,at) α((rt γQ(st 1,at 1))−Q(st,at))
Q-learning Q(st,at)←Q(st,at) α((rt γmaxa′Q(st 1,a′)−Q(st,at)))Q(s_t,a_t)leftarrow Q(s_t,a_t) alpha((r_t gamma mathop{max}limits_{a'}Q(s_{t 1},a')-Q(s_t,a_t)))Q(st,at)←Q(st,at) α((rt γa′maxQ(st 1,a′)−Q(st,at)))
Off-Policy Control Using Q-learning
- 在上一节中假定了有某个策略πbpi_bπb可以用来执行
- πbpi_bπb决定了实际获得的回报
- 现在在来考虑如何提升行为策略(policy improvement)
- 使行为策略πbpi_bπb是对(w.r.t)当前的最佳Q(s,a)Q(s,a)Q(s,a)估计的- ϵepsilonϵ-greedy策略
Q-learning with ϵepsilonϵ-greedy Exploration
1: Intialize Q(s,a),∀s∈S,a∈A t=0,1: Intialize Q(s,a), forall s in S, a in A t=0,1: Intialize Q(s,a),∀s∈S,a∈A t=0, initial state st=s0s_t=s_0st=s0 2: Set πb2: Set pi_b2: Set πb to be ϵepsilonϵ-greedy w.r.t. Q$ 3: loop3: loop3: loop 4: Take at∼πb(st)4: quad Take a_t simpi_b(s_t)4: Take at∼πb(st) // simple action from policy 5: Observe (rt,st 1)5: quad Observe (r_t, s_{t 1})5: Observe (rt,st 1) 6: Update Q6: quad Update Q6: Update Q given (st,at,rt,st 1)(s_t,a_t,r_t,s_{t 1})(st,at,rt,st 1) 7: Q(sr,ar)←Q(st,rt) α(rt γmaxaQ(st1,a)−Q(st,at))7: quad Q(s_r,a_r) leftarrow Q(s_t,r_t) alpha(r_t gamma mathop{max}limits_{a}Q(s_{t1},a)-Q(s_t,a_t))7: Q(sr,ar)←Q(st,rt) α(rt γamaxQ(st1,a)−Q(st,at)) 8: Perform8: quad Perform8: Perform policy impovement: set πbset pi_bset πb to be ϵepsilonϵ-greedy w.r.t Q 9: t=t 19: quad t=t 19: t=t 1 10:end loop10: end loop10:end loop
如何初始化QQQ重要吗? 无论怎样初始化QQQ(设为0,随机初始化)都会收敛到正确值,但是在实际应用上非常重要,以最优化初始化形式初始化它非常有帮助。会在exploration细讲这一点。