Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

CS 540 Introduction to Artificial Intelligence Reinforcement ..., Study notes of Artificial Intelligence

CS 540 Introduction to Artificial Intelligence. Reinforcement Learning II / Summary. University of Wisconsin-Madison. Fall 2022 ...

Typology: Study notes

2022/2023

Uploaded on 05/11/2023

lalitlallit
lalitlallit 🇺🇸

4

(9)

1 document

1 / 40

Toggle sidebar

Related documents


Partial preview of the text

Download CS 540 Introduction to Artificial Intelligence Reinforcement ... and more Study notes Artificial Intelligence in PDF only on Docsity! CS 540 Introduction to Artificial Intelligence Reinforcement Learning II / Summary University of Wisconsin-Madison Fall 2022 Outline • Review of reinforcement learning – MDPs, value functions, Bellman Equation, value iteration • Q-learning – Q function, Q-learning Defining the Optimal Policy For policy p, expected utility over all possible state sequences from 𝑠! produced by following that policy: Called the value function (for p, 𝑠!) 𝑉! 𝑠" = % #$%&$'($# #)*+),'- .+/0 1! 𝑃 sequence 𝑈(sequence) Discounting Rewards One issue: these are infinite series. Convergence? • Solution • Discount factor g between 0 and 1 – Set according to how important present is VS future – Note: has to be less than 1 for convergence Example A 10 B 20 C 20 G 100 Deterministic transition. 𝛾 = 0.8, policy shown in red arrow. Bellman Equation Let’s walk over one step for the value function: Discounted expected future rewards Current state reward Credit L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • Define state utility 𝑉∗ 𝑠 as the expected sum of discounted rewards if the agent executes an optimal policy starting in state s Image source: L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • What is the expected utility of taking action a in state s? 2 !" 𝑃(𝑠′|𝑠, 𝑎)𝑉∗ 𝑠′ Image source: L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • What is the recursive expression for 𝑉∗ 𝑠 in terms of 𝑉∗ 𝑠′ - the utilities of its successors? 𝑉∗ 𝑠 = 𝑟 𝑠 + 𝛾 max$2 !! 𝑃 𝑠" 𝑠, 𝑎 𝑉∗(𝑠") Image source: L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • The same reasoning gives the Bellman equation for a general policy: 𝑉% 𝑠 = 𝑟 𝑠 + 𝛾2 !! 𝑃 𝑠" 𝑠, 𝜋(𝑠) 𝑉%(𝑠") Image source: L. Lazbenik Example A 10 B 20 C 20 G 100 Deterministic transition. 𝛾 = 0.8, policy shown in red arrow. The Q*(s,a) function • Starting from state s, perform (perhaps suboptimal) action a. THEN follow the optimal policy • Equivalent to 𝑄∗ 𝑠, 𝑎 = 𝑟 𝑠 + 𝛾 2 !! 𝑃 𝑠" 𝑠, 𝑎) 𝑉∗(𝑠′) 𝑄∗ 𝑠, 𝑎 = 𝑟 𝑠 + 𝛾 2 !! 𝑃 𝑠" 𝑠, 𝑎)max & 𝑄∗(𝑠", 𝑏) Q-Learning Iteration How do we get Q(s,a)? • Similar iterative procedure Idea: combine old value and new estimate of future value. Note: We are using a policy to take actions; based on the estimated Q! Learning rate Estimate Q*(s,a) from data { 𝑠& , 𝑎& , 𝑟& , 𝑠&'( }: 1. Initialize Q(.,.) arbitrarily (eg all zeros) 1. Except terminal states Q(sterminal,.)=0 2. Iterate over data until Q(.,.) converges: Offline Q-Learning Learning rate 𝑄 𝑠' , 𝑎' ← 1 − 𝛼 𝑄 𝑠' , 𝑎' + 𝛼(𝑟' + 𝛾max& 𝑄(𝑠'(), 𝑏)) Online Q-learning Algorithm Input: step size 𝛼, greedy parameter 𝜖 1. Q(.,.)=0 2. for each episode 3. draw initial state 𝑠~𝜇 4. while (s not terminal) 5. perform 𝑎 = 𝜖-greedy(Q), receive r, s’ 6. 𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼(𝑟 + 𝛾max " 𝑄 𝑠#, 𝑏 ) 7. 𝑠 ← 𝑠# 8. endwhile 9. endfor Note: step 5 can use any other behavior policies Online Q-learning Algorithm • Step 5 can use any other behavior policies to choose action 𝑎, as long as all actions are chosen frequently enough • The cumulative rewards during Q-learning may not be the highest • But after Q-learning converges, can extract an optimal policy: 𝜋∗ 𝑠 ∈ argmaxJ 𝑄(𝑠, 𝑎) 𝑉∗ 𝑠 = max J 𝑄∗(𝑠, 𝑎) Q-Learning: SARSA An alternative update rule: • Just use the next action, no max over actions: • Called state–action–reward–state–action (SARSA) • Can use with epsilon-greedy policy Learning rate Search and RL Review • Search – Uninformed vs Informed – Optimization • Games – Minimax search • Reinforcement Learning – MDPs, value iteration, Q-learning Uninformed vs Informed Search Uninformed search (all of what we saw). Know: • Path cost g(s) from start to node s • Successors. Informed search. Know: • All uninformed search properties, plus • Heuristic h(s) from s to goal (recall game heuristic) start s goal g(s) start s goal g(s) h(s) Fractalsaco Uninformed Search: Iterative Deepening DFS Repeated limited DFS • Search like BFS, fringe like DFS • Properties: – Complete – Optimal (if edge cost 1) – Time O(bd) – Space O(bd) A good option! Hill Climbing Algorithm Pseudocode: What could happen? Local optima! 1. Pick initial state s 2. Pick t in neighbors(s) with the largest f(t) 3. if f(t) ≤ f(s) THEN stop, return s 4. s← t. goto 2. Hill Climbing: Local Optima Note the local optima. How do we handle them? Done? state f state f Where do I go? Simulated Annealing A more sophisticated optimization approach. • Idea: move quickly at first, then slow down • Pseudocode: Pick initial state s For k = 0 through kmax: T ← temperature( (k+1)/kmax ) Pick a random neighbour, t ← neighbor(s) If f(s) ≤ f(t), then s ← t Else, with prob. P(f(s), f (t), T) then s ← t Output: the final state s The interesting bit
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved