Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Comparing Data Mining Systems: Commercial Options and Selection Criteria - Prof. Jennifer , Study notes of Data Analysis & Statistical Methods

An overview of data mining systems, discussing the importance of choosing the right system based on various dimensions such as data types, data mining functions and methodologies, coupling with databases, and data visualization. It also lists several example data mining systems and their features.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-bm2
koofers-user-bm2 🇺🇸

10 documents

1 / 21

Toggle sidebar

Related documents


Partial preview of the text

Download Comparing Data Mining Systems: Commercial Options and Selection Criteria - Prof. Jennifer and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity! Data Mining CS57300 / STAT 59800-024 Purdue University April 28, 2009 1 Data mining systems 2 How to choose a data mining system • Commercial data mining systems have little in common • Different data mining functionality or methodology • May even work with completely different kinds of data • Need to consider multiple dimensions in selection • Data types: relational, transactional, sequential, spatial? • Data sources: ASCII text files? multiple relational data sources? support open database connectivity (ODBC) connections? • System issues: running on only one or on several operating systems? a client/server architecture? provide Web-based interfaces and allow XML data as I/O? 3 Choosing a system • Dimensions (cont): • Data mining functions and methodologies • One vs. multiple data mining functions • One vs. variety of methods per function • More functions and methods per function provide the user with greater flexibility and analysis power • Coupling with DB and/or data warehouse systems • Four forms of coupling: no coupling, loose coupling, semitight coupling, and tight coupling • Ideally, a data mining system should be tightly coupled with a database system 4 Top Ten Data Mining Mistakes (source: John Edler, Edler Research) 9 You’ve made a mistake if you... • Lack data • Focus on training • Rely on one technique • Ask the wrong question • Listen (only) to the data • Accept leaks from the future • Discount pesky case • Extrapolate • Answer every inquiry • Sample casually • Believe the best model 10 0: Lack data • Need labeled cases for best gains • Interesting known cases may be exceedingly rare • Should not proceed until enough critical data is gathered to make analysis worthwhile • Example: credit scoring • Company randomly gave credit to thousands of applicants who were risky by conventional scoring method, and monitored them for two years • Then they estimated risk using what was known at the start • This large investment in creating relevant data paid off 11 1: Focus on training • Only out-of-sample results matter • Example: cancer detection • MD Anderson doctors and researchers (1993), using neural networks, surprised to find that longer training (week vs. day) led to only slightly improved training results, and much worse evaluation results. • Sampling (bootstrap, cross-validation, jackknife, leave-one-out...) is an essential tool for evaluation • Note that resampling no longer tests a single model, but a model class, or a modeling process 12 2: Rely on one technique • "To a person with a hammer, all the world's a nail." • For best work, need a whole toolkit. • At very least, compare your method to a conventional one (e.g., naive Bayes, logistic regression) • It’s somewhat unusual for a particular modeling technique to make a big difference, and when it will is hard to predict. • Best approach: use a handful of good tools (Each adds only 5-10% effort) 13 12 © 2004 Elder Research, Inc. Relative Performance Examples: 5 Algorithms on 6 Datasets (with Stephen Lee, U. Idaho, 1997) .00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1.00 Diabetes Gaussian Hypothyroid German Credit Waveform Investment Neural Network Logistic Regression Linear Vector Quantization Projection Pursuit Regression Decision Tree E rr o r R el at iv e to P ee r T ec h n iq u es ( lo w er i s b et te r) 14 5: Accept leaks from the future • Example: • Forecasting interest rate at Chicago Bank • Neural network was 95% accurate, but output was a candidate input • Example 2: • Used moving average of 3 days, but centered on today • Look for variables which work (too) well • Example: Insurance code associated with 25% of purchasers turned out to describe type of cancellation • Need domain knowledge about collection process 19 6: Discount pesky cases • Outliers may be skewing results (e.g. decimal point error on price) or be the whole answer (e.g. Ozone hole), so examine carefully! • The most exciting phrase in research isn't "Aha!" but "That's odd..." • Inconsistencies in the data may be clues to problems with the information flow process • Example: Direct mail • Persistent questioning of oddities found errors in the merge-purge process and was a major contributor to doubling sales per catalog 20 7: Extrapolate • Tend to learn too much from first few experiences • Hard to "erase" findings after an upstream error is discovered • Curse of Dimensionality: low-dimensional intuition is useless in high dimensions • Human and computer strengths are more complementary than alike 21 8: Answer every inquiry • "Don't Know" is a useful model output state • Could estimate the uncertainty for each output (a function of the number and spread of samples near X) • However, few algorithms provide an estimate of uncertainty along with their predictions 22 9: Sample without care • Example: Down sampling • MD Direct Mailing firm had too many non-responders (NR) for model (about 99% of >1M cases) • Took all responders, and every 10th NR to create a more balanced database of 100K cases • Model predicted that everyone in Ketchikan, Wrangell, and Ward Cove Alaska would respond • Why? Sorted data, by zipcode and 100Kth case drawn before reaching bottom of file (i.e., 999**) • Solution: Add random variables to candidate list • Use as "canaries in the mine" to signal trouble 23 9: Sample without care • Example: Up sampling in credit scoring • Paucity of interesting cases led to quintupling them • Cross-validation employed with many techniques and modeling cycles ! results tended to improve with the complexity of the models (instead of the reverse) • Noticed that rare cases were better estimated by complex models but others were worse • Had duplicated cases in each set by upsampling before splitting ! need to split first! • It's hard to beat a stratified sample (a proportional sample from each group of interest) 24 25 © 2004 Elder Research, Inc. Median (and Mean) Error Reduced with each Stage of Combination 55 60 65 70 75 1 2 3 4 5 No. Models in combination M i s s e d 29 How to succeed? • More complex tools and harder problems ! more ways to make mistakes • Don’t expect too much of technology alone! • Success " Learning " Experience " Mistakes • Persistence: Attack repeatedly, from different angles • Collaboration: Domain and statistical experts need to cooperate • Humility: Learning from mistakes requires vulnerability 30 Myths and pitfalls of data mining (source: Tom Khabaza, DMReview) 31 Myth #1 • Data mining is all about algorithms • Data mining is a process consisting of many elements, such as formulating business goals, mapping business goals to data mining goals, acquiring, understanding and preprocessing the data, evaluating and presenting the results of analysis and deploying these results to achieve business benefits • A problem occurs when data miners focus too much on the algorithms and ignore the other 90-95 percent of the data mining process 32 Myth #2 • Data mining is all about predictive accuracy • Predictive models should have some degree of accuracy because this demonstrates that it has truly discovered patterns in the data • However, the usefulness of an algorithm or model is also determined by a number of other properties, one of which is understandability • This is because the data mining process is driven by business expertise -- it relies on the input and involvement of non-technical business professionals in order to be successful 33 Myth #3 • Data mining requires a data warehouse • Data mining can benefit from warehoused data that is well organized, relatively clean and easy to access • But warehoused data may be less useful than the source or operational data -- in the worst case, warehoused data may be completely useless (e.g. if only summary data is stored) • Data mining benefits from a properly designed data warehouse and constructing such a warehouse often benefits from doing some exploratory DM 34 Pitfalls 5. Insufficient Data Knowledge • In order to perform data mining, we must be able to answer questions such as: What do the codes in this field mean, and can there be more than one record per customer in this table and more? In some cases, this information is surprisingly hard to come by 6. Erroneous Assumptions (courtesy of experts) • Business and data experts are crucial resources, but this does not mean that the data miner should unquestioningly accept every statement they make 39 Pitfalls 7. Incompatibility of Data Mining Tools • No toolkit will provide every possible capability, especially when the individual preferences of analysts are taken into account, so the toolkit should interface easily with other available tools and third-party options 8. Locked in the Data Jail House • Some tools require the data to be in a proprietary format that is not compatible with commonly used database systems • This can result in high overhead costs and create difficulty in deployment into an organization's system 40 Announcements • Next class: Semester review • Final report due in class • Please complete online student evaluations 41
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved