Download CS 540 Introduction to Artificial Intelligence Reinforcement ... and more Study notes Artificial Intelligence in PDF only on Docsity! CS 540 Introduction to Artificial Intelligence Reinforcement Learning II / Summary University of Wisconsin-Madison Fall 2022 Outline • Review of reinforcement learning – MDPs, value functions, Bellman Equation, value iteration • Q-learning – Q function, Q-learning Defining the Optimal Policy For policy p, expected utility over all possible state sequences from 𝑠! produced by following that policy: Called the value function (for p, 𝑠!) 𝑉! 𝑠" = % #$%&$'($# #)*+),'- .+/0 1! 𝑃 sequence 𝑈(sequence) Discounting Rewards One issue: these are infinite series. Convergence? • Solution • Discount factor g between 0 and 1 – Set according to how important present is VS future – Note: has to be less than 1 for convergence Example A 10 B 20 C 20 G 100 Deterministic transition. 𝛾 = 0.8, policy shown in red arrow. Bellman Equation Let’s walk over one step for the value function: Discounted expected future rewards Current state reward Credit L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • Define state utility 𝑉∗ 𝑠 as the expected sum of discounted rewards if the agent executes an optimal policy starting in state s Image source: L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • What is the expected utility of taking action a in state s? 2 !" 𝑃(𝑠′|𝑠, 𝑎)𝑉∗ 𝑠′ Image source: L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • What is the recursive expression for 𝑉∗ 𝑠 in terms of 𝑉∗ 𝑠′ - the utilities of its successors? 𝑉∗ 𝑠 = 𝑟 𝑠 + 𝛾 max$2 !! 𝑃 𝑠" 𝑠, 𝑎 𝑉∗(𝑠") Image source: L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • The same reasoning gives the Bellman equation for a general policy: 𝑉% 𝑠 = 𝑟 𝑠 + 𝛾2 !! 𝑃 𝑠" 𝑠, 𝜋(𝑠) 𝑉%(𝑠") Image source: L. Lazbenik Example A 10 B 20 C 20 G 100 Deterministic transition. 𝛾 = 0.8, policy shown in red arrow. The Q*(s,a) function • Starting from state s, perform (perhaps suboptimal) action a. THEN follow the optimal policy • Equivalent to 𝑄∗ 𝑠, 𝑎 = 𝑟 𝑠 + 𝛾 2 !! 𝑃 𝑠" 𝑠, 𝑎) 𝑉∗(𝑠′) 𝑄∗ 𝑠, 𝑎 = 𝑟 𝑠 + 𝛾 2 !! 𝑃 𝑠" 𝑠, 𝑎)max & 𝑄∗(𝑠", 𝑏) Q-Learning Iteration How do we get Q(s,a)? • Similar iterative procedure Idea: combine old value and new estimate of future value. Note: We are using a policy to take actions; based on the estimated Q! Learning rate Estimate Q*(s,a) from data { 𝑠& , 𝑎& , 𝑟& , 𝑠&'( }: 1. Initialize Q(.,.) arbitrarily (eg all zeros) 1. Except terminal states Q(sterminal,.)=0 2. Iterate over data until Q(.,.) converges: Offline Q-Learning Learning rate 𝑄 𝑠' , 𝑎' ← 1 − 𝛼 𝑄 𝑠' , 𝑎' + 𝛼(𝑟' + 𝛾max& 𝑄(𝑠'(), 𝑏)) Online Q-learning Algorithm Input: step size 𝛼, greedy parameter 𝜖 1. Q(.,.)=0 2. for each episode 3. draw initial state 𝑠~𝜇 4. while (s not terminal) 5. perform 𝑎 = 𝜖-greedy(Q), receive r, s’ 6. 𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼(𝑟 + 𝛾max " 𝑄 𝑠#, 𝑏 ) 7. 𝑠 ← 𝑠# 8. endwhile 9. endfor Note: step 5 can use any other behavior policies Online Q-learning Algorithm • Step 5 can use any other behavior policies to choose action 𝑎, as long as all actions are chosen frequently enough • The cumulative rewards during Q-learning may not be the highest • But after Q-learning converges, can extract an optimal policy: 𝜋∗ 𝑠 ∈ argmaxJ 𝑄(𝑠, 𝑎) 𝑉∗ 𝑠 = max J 𝑄∗(𝑠, 𝑎) Q-Learning: SARSA An alternative update rule: • Just use the next action, no max over actions: • Called state–action–reward–state–action (SARSA) • Can use with epsilon-greedy policy Learning rate Search and RL Review • Search – Uninformed vs Informed – Optimization • Games – Minimax search • Reinforcement Learning – MDPs, value iteration, Q-learning Uninformed vs Informed Search Uninformed search (all of what we saw). Know: • Path cost g(s) from start to node s • Successors. Informed search. Know: • All uninformed search properties, plus • Heuristic h(s) from s to goal (recall game heuristic) start s goal g(s) start s goal g(s) h(s) Fractalsaco Uninformed Search: Iterative Deepening DFS Repeated limited DFS • Search like BFS, fringe like DFS • Properties: – Complete – Optimal (if edge cost 1) – Time O(bd) – Space O(bd) A good option! Hill Climbing Algorithm Pseudocode: What could happen? Local optima! 1. Pick initial state s 2. Pick t in neighbors(s) with the largest f(t) 3. if f(t) ≤ f(s) THEN stop, return s 4. s← t. goto 2. Hill Climbing: Local Optima Note the local optima. How do we handle them? Done? state f state f Where do I go? Simulated Annealing A more sophisticated optimization approach. • Idea: move quickly at first, then slow down • Pseudocode: Pick initial state s For k = 0 through kmax: T ← temperature( (k+1)/kmax ) Pick a random neighbour, t ← neighbor(s) If f(s) ≤ f(t), then s ← t Else, with prob. P(f(s), f (t), T) then s ← t Output: the final state s The interesting bit