Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Machine Learning Approaches - Artificial Intelligence - Lecture Slides | CPSC 481, Study notes of Computer Science

ch 5 Material Type: Notes; Professor: Ryu; Class: Artificial Intelligence; Subject: Computer Science; University: California State University - Fullerton; Term: Spring 2014;

Typology: Study notes

2013/2014

Uploaded on 10/14/2014

qq-falcon
qq-falcon 🇺🇸

5

(2)

6 documents

1 / 43

Toggle sidebar

Related documents


Partial preview of the text

Download Machine Learning Approaches - Artificial Intelligence - Lecture Slides | CPSC 481 and more Study notes Computer Science in PDF only on Docsity! 1 Artificial Intelligence CPSC 481 Machine Learning Part A 2 What is Learning?  Learning  Is acquiring new knowledge, skills, values, or understanding  Occurs as part of training or education, personal development, or experience  Learning curve is progress of learning over time (learning performance)  Think about how you learn  Some common learning methods  Memorization, generalization or conceptualization, specialization, pattern recognition, induction/deduction, classification or categorization by patterns or concepts, synthesizing different types of information, reasoning, analogy 5 Symbolic Machine Learning Approaches  Concepts learning and generalization  Supervised learning concepts  Version space learning using candidate elimination algorithm  Decision tree: ID3 and C4.5 Generalization  Definition of generalization  Expression P is more general than Q iff P ⊇ Q  Volleyball, baseball, football generalize to ball or sports  Use of generalization (operators) in logical representation  Replacing constants with variables:  ball(round,red) generalizes to ball(round, X)  Dropping conditions from a conjunctive expression:  shape(X,round) ^ size(X,small) ^ color(X,red) generalizes to shape(X,round)^color(X,red)  Adding a disjunct to an expression:  shape(X,round)^size(X,small)^color(X,red) generalizes to shape(X,round)^size(X,small)^((color(X,red)v color(X,blue))  Replacing a property with its parent in a class hiearchy  color(X,red) generalizes to color(X, primary_color) if primary_color is a superclass of red (following a class hierarchy) 6 Concept Learning through Generalization  Definition of concept and concept space  A concept is a cognitive unit of meaning, an abstract idea or a mental symbol, e.g., defined as a "unit of knowledge“ through generalization  If concept p is more general than concept q, we say that p covers q. Or in predicate calculus, p covers q iff q(x) is a logical consequence of p(x)  Concept space is a set of concepts (candidate concepts) that can be created by learning operators (e.g., generalization, specialization, or others)  From two (positive) ball instances of ball 1. Size(ball1, small) ^ color(ball1,red) ^ shape(ball1,round) 2. Size(ball2,large) ^ color(ball2,blue) ^ shape(ball2,round) generalizes to a Ball concept Size(ball,Y) ^ color(ball,Z) ^ shape(ball,round)  A representation of the concept “ball” can be:  Size(ball,Y) ^ color(ball,Z) ^ shape(ball,round)  Any sentence that unifies with this general definition represents a ball 7 Generalization to Cover all Positive Examples in Winston’s Learning Program 10 Need a generalization hierarchy of these concepts as background knowledge Positive examples Positive and Negative Examples in Learning the Concept of Arch in Blocks World 11 Negative examples (near miss) To exclude all negative examples *For specialization of a description to exclude a near miss, the network c adds constraints to network a so that it can’t match with b. Near miss (negative example) helps the learning algorithm to determine exactly how to specialize the candidate concept. Specialization from a Concept to Exclude Near Miss 12 * It performs a hill climbing search without backtracking on the concept space guided by the training data. So, the performance is highly sensitive to the order and quality of the training examples. A Framework for Symbolic Learning 15 Supervised (inductive) learning process: 1. Training with both positive and negative examples (training data) to build a learning model or patterns 2. Testing for verification of concept learned Version Space Learning  Approach used  Version space is the set of all concept descriptions consistent with the training examples.  Inductive learning by generalization using candidate elimination algorithm (Mitchelle, 1978)  Candidate elimination algorithm  Generalization and specialization operations are used to define a concept space, starting with maximally general and specific concepts.  A concept, c is maximally general if it covers none of the negative training examples, and for any other concept, c’, that covers no negative training example, c ≥ c’. A concept, c, is maximally specific if it covers all positive examples, none of the negative examples, and for any other concept, c’ that covers the positive examples, c ≤ c’. 16 A Version Space (Concept Space) abj(X, ¥, Z) obj(X, Y, ball) obj(X. red, Y) abj(small, X, Y) a obj(X, red, ball) abj(small, X, ball) obj(small, red, X) eo 8 6 obj(small, red, ball) obj(large, red, ball) obj(small, white, ball) 17 Example of Specific to General Search to Learn the Concept of “Ball” 20 Training data Concept learned from the learning process (training) using only S set Size Color Class Small Red Ball Small White Ball Large Blue Ball General to Specific Search for Hypothesis set G Candidate Concept 21 General to Specific Search of the Version Space Learning the Concept “Ball” 22 Concept learned from the learning process (training) using only G set Size Color Class Small Red Brick Large White Ball Large Blue Cube Small Blue Ball Training data eco ecco | , sS55° Converging Boundaries of the G and S Sets| $355 eco eo ys Boundary of G a 2 > - oF a / 9 . f ? NO i ? ot Sf ? i ~ r a O—-———— Potential target concepts ( 9st 4, 4 2 | 4 +; t Rt + f SF ? ? , Boundary of $ _ 9 ’ ‘ . al _ q 9 v 25 A Version Space Learning for Solving Symbolic Integration, LEX 26 Evaluation of Candidate Elimination Algorithm  Pros:  Because the algorithm is incremental (concept building), it doesn’t require all training examples present unlike other learning approaches.  Cons:  Search-based learning must deal with the combinatory problem space, like all search problems.  Algorithm is not noise resistant  Single misclassified training instance can prevent the algorithm from converging on a consistent concept.  Possible solution is to maintain multiple G and S (efficiency problem in this case).  Main contribution by version space and related issues:  Knowledge representation, generalization, and search in inductive learning but also raised general questions (complexity, expressiveness, use of knowledge and data to guide generalization) related to learning  The role of prior knowledge in learning. 27 A Decision Tree for Credit Risk Assessment credit history? unknown bad good debt? collateral? debt? high low hone adequate high low high risk collateral? high risk moderate tisk collateral? low risk none adequate none adequate income? income? low risk low risk $0 to $15k $15to $a5k over £35k $0 to B15k $15 to $aSk over $35k high risk moderate risk low risk high risk moderate risk bow risk 30 A Simplified Decision Tree for Credit Risk Assessment 31 If multiple trees can be created from the same data set, which tree would you use and why? Theory that Supports the Simpler Tree  A pattern needs to be “simpler” than listing out the data set it describes.  Minimal Message Length (MML) theory  Given a data set D  Given a set of hypotheses H1, H2, .., Ht, that describe D  These H can be classifiers, clusters, theories, etc.  MML principle states that:  We should choose Hi, for which the quantity: Mlength(Hi) + Mlength(D|Hi) is minimized, where Mlength(Hi) is minimum number of bits needed to specify Hi, Mlength(D|Hi) is minimum number of bits needed to describe the data given that Hi is true.  Example: Represent a set {2,4,6,…, 100}  H1: List all 49 integers: cost of H1 = cost of listing 49 integers  H2: “All even numbers <= 100 AND >= 2”: cost of H2 = cost of listing 2 integers + cost of “<=“ + cost of “even” concept  According to MML, if cost(solution1) > cost(solution2), then solution2 is worth something since as size of set increases, the cost of solution1 increases much more rapidly. 32 Information Theory  Entropy is a measure of the uncertainty in a random variable. Shannon entropy quantifies the expected value of the information contained in a message usually in units such as bits. For a random variable X with n outcomes {x1, ..., xn}, the Shannon entropy is a measure of uncertainty, H(X), is defined as:  𝑯 𝑿 = − ∑ 𝒑 𝒙𝒊 𝐥𝐥𝐥𝟐 𝒑(𝒙𝒊)𝒏𝒊=𝟏 where p(xi) is the probability mass function of outcome xi  H(X) is the average unpredictability in a random variable ≡ information content (assuming that communication may be represented as a sequence of independent and identically distributed random variables).  The more unpredictable the outcome (increased uncertainty) is, the larger entropy is, resulting in more bits required to contain larger information. The more predictable the outcome (reduced uncertainty or more information on the outcome), the smaller the entropy is, resulting in less bits to contain smaller information).  A fair coin toss has one bit of entropy since there are two possible outcomes that occur with equal probability, e.g., H(coinToss) = –p(h)log2p(h) – p(t)log2p(t) = -1/2log2(1/2) – 1/2log2(1/2) = 1 bit, e.g., learning the actual outcome contains one bit of information.  If we have more information, let’s say” suppose if the coin was rigged to come up heads 75% of the time, then H(coinToss) = –3/4log2(3/4) – 1/4log2(1/4) = 0.811 bits.  What’s the entropy of a coin toss with a coin that has two heads and no tails? 35 Information Theoretic Test Selection  For a data set T, entropy 𝑯 𝑻 = − ∑ 𝒑 𝑪𝒊 𝐥𝐥𝐥𝟐 𝒑(𝑪𝒊)𝒏𝒊=𝟏 is a measure of the amount of uncertainty in the data set T for the occurrence of each class, Ci  H(T) is the average amount of information needed to identify the class, Ci in T.  p(Ci) is the proportion of the number of elements in class Ci to the number of elements in set T.  When H(T) = 0, the set T is perfectly classified (all instances of T are of the same class).  Compute entropy after T is divided into the m possible subsets {v1, v2, …, vm} using an attribute X (or the expected information needed to complete the tree after selection X) is E(X) = ∑(|vi|/|T|)*H(vi)  The amount of information needed to complete the tree is defined as the weighed average of the information in all its subtrees.  |vi| and |T| are the number of instances that belong to the set vi and T, respectively.  NOTE: H(vi) should be calculated based on the class, Ci.  The information gain from the attribute X, gain(X) = H(T) – E(X)  ID3 chooses the attribute that provides the greatest information gain, meaning the smallest E(x) by “simplicity rule”.  The more information/knowledge we have, less bits we need to describe it! 36 Example: Calculating Information Gain  Loan Data  H(LoanData) = -6/14log2(6/14) – 3/14log2(3/14) – 5/14log2(5/14) = 1.531 bits. This is for any tree that covers the examples in the data set.  Partitions by income are: V1={1,4,7,11}, V2={2,3,12,14}, V3={5,6,8,9,10,13}  From the partitions, |v1|/|T| = 4/14, |v2|/|T| = 4/14, |v3|/|T| = 6/14  H(v1) = –Σp(Ci)log2p(Ci)=0.0 since all in “high” class, H(v2) = 1.0, H(v3) = 0.65  The expected information needed to complete the tree for income, E(income) = 4/14*H(V1) + 4/14*H(V2) + 6/14*H(V3) = 4/14*0.0 + 4/14*1.0 + 6/14*0.65 = 0.564bits  Information gain by choosing income, gain(income) = H(LoanData) – E(income) = 1.531 – 0.564 = 0.967 bits.  Similarly we calculate information gain for all other attributes, gain(credit history) = 0.266, gain(debt) = 0.633, gain(collateral) = 0.206.  So we choose the attribute, income for partition that results in subtrees.  For each subtree, repeat the same procedure.  Can we choose the same attribute as the parent node? 37 Decision Tree Induction as Search through a Concept Space  Representation: Tree  Concepts represented in tree  Class concept can be described by attribute and value pairs.  Concept of high risk: (income = $0~$15k) OR (income = $15k~$35k AND credit history = unknown AND debt = high) OR (income = $15k~$35k AND credit history = bad)  High risk rule: IF (income = $0~$15k) OR (income = $15k~$35k AND credit history = unknown AND debt = high) OR (income = $15k~$35k AND credit history = bad), THEN the application is high risk  Concept space (Version space) is all possible decision trees that can be created from training examples  Operations  Moving through this space consists of adding tests to a tree.  ID3 as a search method  ID3 implements a form of greedy search in the space of all possible trees without backtracking. Very efficient! 40 Evaluation of Decision Tree  Advantages  Easy to convert a decision tree into a set of rules,  LHS (decisions leading to the leaf node)  RHS (class outcome)  Common issues  When the data is bad, e.g., inconsistent data, missing data  Continuous attributes: nominal attributes are ideal. For continuous values, break the values into subsets of values (groups of values).  Large data set may produce a large tree.  C4.5 or higher version addressed many of these issues  Bagging produce replicate training sets, called bootstrap samples, by sampling with replacement from the training instances. Boosting maintains a weight for each classifier. The weight is used to reflect the importance. Multiple classifiers produced from bootstrap samples are combined by taking average or voting to form a composite classifier. In bagging, each component classifier has the same vote, while boosting assigns different voting strength to component classifiers on the basis of their accuracy.  This technique can be used for other learning or regression approaches.  To handle large data set, divide the data into subsets, build the decision tree on one subset, and then test its accuracy on other subsets.  Many variations exist, e.g., CART 41 Inductive Bias  Induction depends on prior knowledge and assumptions about the nature of the concepts being learned.  Inductive bias refers to any criteria a learner uses to constrain the concept space or to select concepts within that space.  Examples of inductive biases  Use of domain specific knowledge and assumptions such as heuristics  Syntactic constraints on the representation of learned concepts to limit the size of the space  Bit representation, decision tree, feature vectors, rules, Horn clauses, etc.  Conjunctive forms, limited number of disjunctive forms along with conjunctive forms to increase the expressiveness of representation.  Reason for inductive bias  Learning space is so large, so search-based learning becomes impractical.  Training data are only a subset of all instances in the domain. So since data alone are not enough, it must make additional assumptions about “likely” concepts, may be in the form of heuristics. 42
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved