Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

CS 540 Introduction to Artificial Intelligence Reinforcement ..., Study notes of Artificial Intelligence

Yale University Artificial Intelligence

CS 540 Introduction to Artificial Intelligence. Reinforcement Learning II / Summary. University of Wisconsin-Madison. Fall 2022 ...

Typology: Study notes

2022/2023

Uploaded on 05/11/2023

lalitlallit 🇺🇸

(9)

1 document

1 / 40

Partial preview of the text

Download CS 540 Introduction to Artificial Intelligence Reinforcement ... and more Study notes Artificial Intelligence in PDF only on Docsity! CS 540 Introduction to Artificial Intelligence Reinforcement Learning II / Summary University of Wisconsin-Madison Fall 2022 Outline • Review of reinforcement learning – MDPs, value functions, Bellman Equation, value iteration • Q-learning – Q function, Q-learning Defining the Optimal Policy For policy p, expected utility over all possible state sequences from 𝑠! produced by following that policy: Called the value function (for p, 𝑠!) 𝑉! 𝑠" = % #$%&$'($# #)*+),'- .+/0 1! 𝑃 sequence 𝑈(sequence) Discounting Rewards One issue: these are infinite series. Convergence? • Solution • Discount factor g between 0 and 1 – Set according to how important present is VS future – Note: has to be less than 1 for convergence Example A 10 B 20 C 20 G 100 Deterministic transition. 𝛾 = 0.8, policy shown in red arrow. Bellman Equation Let’s walk over one step for the value function: Discounted expected future rewards Current state reward Credit L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • Define state utility 𝑉∗ 𝑠 as the expected sum of discounted rewards if the agent executes an optimal policy starting in state s Image source: L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • What is the expected utility of taking action a in state s? 2 !" 𝑃(𝑠′|𝑠, 𝑎)𝑉∗ 𝑠′ Image source: L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • What is the recursive expression for 𝑉∗ 𝑠 in terms of 𝑉∗ 𝑠′ - the utilities of its successors? 𝑉∗ 𝑠 = 𝑟 𝑠 + 𝛾 max$2 !! 𝑃 𝑠" 𝑠, 𝑎 𝑉∗(𝑠") Image source: L. Lazbenik The Bellman equation Agent receives reward 𝑟(𝑠) Agent chooses action 𝑎 Environment returns 𝑠!~𝑃() |𝑠, 𝑎) • The same reasoning gives the Bellman equation for a general policy: 𝑉% 𝑠 = 𝑟 𝑠 + 𝛾2 !! 𝑃 𝑠" 𝑠, 𝜋(𝑠) 𝑉%(𝑠") Image source: L. Lazbenik Example A 10 B 20 C 20 G 100 Deterministic transition. 𝛾 = 0.8, policy shown in red arrow. The Q*(s,a) function • Starting from state s, perform (perhaps suboptimal) action a. THEN follow the optimal policy • Equivalent to 𝑄∗ 𝑠, 𝑎 = 𝑟 𝑠 + 𝛾 2 !! 𝑃 𝑠" 𝑠, 𝑎) 𝑉∗(𝑠′) 𝑄∗ 𝑠, 𝑎 = 𝑟 𝑠 + 𝛾 2 !! 𝑃 𝑠" 𝑠, 𝑎)max & 𝑄∗(𝑠", 𝑏) Q-Learning Iteration How do we get Q(s,a)? • Similar iterative procedure Idea: combine old value and new estimate of future value. Note: We are using a policy to take actions; based on the estimated Q! Learning rate Estimate Q*(s,a) from data { 𝑠& , 𝑎& , 𝑟& , 𝑠&'( }: 1. Initialize Q(.,.) arbitrarily (eg all zeros) 1. Except terminal states Q(sterminal,.)=0 2. Iterate over data until Q(.,.) converges: Offline Q-Learning Learning rate 𝑄 𝑠' , 𝑎' ← 1 − 𝛼 𝑄 𝑠' , 𝑎' + 𝛼(𝑟' + 𝛾max& 𝑄(𝑠'(), 𝑏)) Online Q-learning Algorithm Input: step size 𝛼, greedy parameter 𝜖 1. Q(.,.)=0 2. for each episode 3. draw initial state 𝑠~𝜇 4. while (s not terminal) 5. perform 𝑎 = 𝜖-greedy(Q), receive r, s’ 6. 𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼(𝑟 + 𝛾max " 𝑄 𝑠#, 𝑏 ) 7. 𝑠 ← 𝑠# 8. endwhile 9. endfor Note: step 5 can use any other behavior policies Online Q-learning Algorithm • Step 5 can use any other behavior policies to choose action 𝑎, as long as all actions are chosen frequently enough • The cumulative rewards during Q-learning may not be the highest • But after Q-learning converges, can extract an optimal policy: 𝜋∗ 𝑠 ∈ argmaxJ 𝑄(𝑠, 𝑎) 𝑉∗ 𝑠 = max J 𝑄∗(𝑠, 𝑎) Q-Learning: SARSA An alternative update rule: • Just use the next action, no max over actions: • Called state–action–reward–state–action (SARSA) • Can use with epsilon-greedy policy Learning rate Search and RL Review • Search – Uninformed vs Informed – Optimization • Games – Minimax search • Reinforcement Learning – MDPs, value iteration, Q-learning Uninformed vs Informed Search Uninformed search (all of what we saw). Know: • Path cost g(s) from start to node s • Successors. Informed search. Know: • All uninformed search properties, plus • Heuristic h(s) from s to goal (recall game heuristic) start s goal g(s) start s goal g(s) h(s) Fractalsaco Uninformed Search: Iterative Deepening DFS Repeated limited DFS • Search like BFS, fringe like DFS • Properties: – Complete – Optimal (if edge cost 1) – Time O(bd) – Space O(bd) A good option! Hill Climbing Algorithm Pseudocode: What could happen? Local optima! 1. Pick initial state s 2. Pick t in neighbors(s) with the largest f(t) 3. if f(t) ≤ f(s) THEN stop, return s 4. s← t. goto 2. Hill Climbing: Local Optima Note the local optima. How do we handle them? Done? state f state f Where do I go? Simulated Annealing A more sophisticated optimization approach. • Idea: move quickly at first, then slow down • Pseudocode: Pick initial state s For k = 0 through kmax: T ← temperature( (k+1)/kmax ) Pick a random neighbour, t ← neighbor(s) If f(s) ≤ f(t), then s ← t Else, with prob. P(f(s), f (t), T) then s ← t Output: the final state s The interesting bit

Documents

questions

CS 540 Introduction to Artificial Intelligence Reinforcement ..., Study notes of Artificial Intelligence

Related documents

Partial preview of the text