Recent Advances in Knowledge Discovery and Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine and Jet Propulsion Laboratory smyth@ics.uci.edu www.ics.uci.edu/~smyth/ acknowledgements: Gregory Piatetsky-Shapiro, Heikki Mannila, Sam Uthurusamy, Usama Fayyad, Evangelos Simoudis, Pedro Domingos, Rick Lathrop Take-home Messages n Massive data sets are revolutionizing data analysis - broad range of practical applications problems - fundamental open research problems n A very rich potential area for AI - many opportunities which have not yet been tapped - a call to action! Outline n Motivation and Context n Emerging Applications n Methods and Algorithms n Research Opportunities for AI Advanced Scout n Background - every NBA game is annotated (each pass, shot, foul, etc.) - potential competitive advantage for coaches - Problem: over a season, this generates alot of data! n Solution (Bhandari et al, IBM, 1997) - ³attribute focusing² finds conditional ranges on attributes where the distributions differ from the norm generates descriptions of interesting patterns e.g., ³Player X made 100% of his shots when when Player Y was in the game: X normally makes only 50% of his shots² n Status - used by 28 of the 29 teams in the NBA - an intelligent assistant AT&T Classification of Telephone Numbers n Background - AT&T has about 100 million customers - It logs 200 million calls per day, 40 attributes each - 250 million unique telephone numbers - Which are business and which are residential? n Solution (Pregibon and Cortes, AT&T,1997) - Proprietary model, using a few attributes, trained on known business customers to adaptively track p(business|data) - Significant systems engineering: data are downloaded nightly, model updated (20 processors, 6Gb RAM, terabyte disk farm) n Status: - running daily at AT&T - HTML interface used by AT&T marketing The Data Revolution n Context - ³.. drowning in data, but starving for knowledge² - Ubiquitous in business, science, medicine, government - Users cannot handle data manually or using traditional methods n Viewpoint: data as a resource - Data themselves are not of direct use - How can we leverage data to make better decisions ? Technology is a Driving Factor n Larger, cheaper memory - Moore¹s law for magnetic disk density ³capacity doubles every 18 months² (Jim Gray) - storage cost per byte falling rapidly n Faster, cheaper processors - can analyze more data - apply more complex models, search techniques Massive Data Sets n Characteristics - very large N (billions) - very large d (thousands or millions) - heterogeneous - dynamic High-dimensional data example n Volume of sphere relative to cube in d dimensions? High-dimensional data example n high-d => most data points will be ³out² at the corners n high-d space is sparse: and non-intuitive Examples of Massive Data Sets n US Treasury Financial Crimes Enforcement Network - 200,000 large-cash transactions reported per week - application: detect money laundering n NASA¹s Earth Observing satellites - June 1998, multispectral spatial images, 80 Gbytes/day - application: classify pixels for atmospheric monitoring n Information Resources Inc. - market scanner data, 14000 stores - 4 million products, 1 terabyte of historical data - application: spot trends in consumer behavior Some Nomenclature n Data Mining - ³extracting structure from large data sets² - currently popular in business circles (high hype factor) - largely derived from Machine Learning - traditionally a disparaging term in statistics n Knowledge Discovery in Databases (KDD) - title of workshop/conference series and edited books - reflects its AI origins n Data mining is one step in the overall iterative process of discovering knowledge from data A Short History of Data Analysis picture showing timeline of how pattern recognition, databases, statistics, and machine learning have "converged" to data mining. Supporting Technologies picture with circles for DB, ML, pattern recognition, visualization, statistics, with Data Mining in the middle intersecting all of them. Characteristics of KDD and Data Mining - ³Complexity-aware² € scalability in time and memory € computationally-intensive search over representations - Exploratory, Local Analysis € patterns versus models - The Human Interface € an interactive process € interpretable results (rules, trees, graphs) - Applications-driven € data used for decisions, actions Outline n Motivation and Context n Emerging Applications - general trends - description of success stories n Methods and Algorithms n Research Opportunities for AI Application Areas in Data Mining n ³Industrial² Applications - Retail goods € e.g., ³market basket analysis² - Telecommunications € e.g., predicting which customers will switch carrier - Banks and Financial services € e.g., finding the most profitable credit-card holders - Insurance and Health services € detecting fraudulent claims - Web Monitoring € exploring Web site usage n Common Themes - dynamic, online: track the customer! - often proprietary APPLICATION: Firstar Bank n Background - large bank and credit-card provider, Milwaukee - uses direct mailings to find new customers - problem: focus mailings on ³high-payoff² customers n Solution (HNC Software, 1997) - analyze characteristics of existing customers (800 variables) - trained a proprietary model to ³score² customers - predict which products to market to which customer n Status - response rate on mailings increased four-fold APPLICATION: Australian Health Insurance Commission n Background - processes 300 million claims annually - 1.3 terabytes: every aspect of a healthcare ³transaction² - problem: controlling prescription costs n Solution (Agrawal et al, VLDB 1996) - Uses IBM¹s commercial data mining package - Generates daily snapshots of trends and patterns - Supports ³drill-down² for detailed analysis n Status - association rules led to savings of 2 to 3 $million per year - system in routine use: € e.g., summary letters mailed to every doctor in Australia Application Areas (continued) n Scientific Applications - Pharmaceutical € drug design and combinatorial chemistry - Computational Biology € finding structure in protein sequences - Astronomy € clustering the Sloan survey to find new galaxy types - Earth Sciences € detecting unusual changes in the atmosphere n Common themes: - scientists are swamped by massive data sets - data mining provides invaluable ³focusing² tools APPLICATION: SKICAT n Background - 2nd Palomar Sky-Survey, NSF-funded, on-going - Digitized images will contain 6 million galaxies, 2 billion stars - Problem: produce catalogs from the images n Solution (Weir, Fayyad, Djorgovski, 1994) - Image analysis calculated 40 attributes per sky-object - Classification tree trained to classify stars v. galaxies - over 90% accurate: previously an unsolved problem - coupled with commercial (Sybase) database system n Status - in routine automated use: has catalogued 250,000 objects - has led to the discovery of 22 quasars - current work focusing on clustering rare objects APPLICATION: Clustering Atmospheric Patterns n Background - ³pressure maps² of Northern Hemisphere, since 1947 - 18 million observations, 32,000 maps. - problem: are there clusters of patterns which recur? - significance: understanding of long-term climate variability n Solution (Smyth et al, KDD-97) - principal components used to project maps to low-d space - probabilistic mixture models used to cluster in low-d space - cross-validation objectively determines the number of clusters n Status - first objective confirmation in atmospheric science that € there exist recurrent patterns in the Northern Hemisphere € clear evidence for 3 clusters Trends n Applications-oriented - very practical focus - business and science users (not experts in data analysis) n Coupling of Algorithms and Architecture - ³systems engineering² for large data sets n Identification of Niche Problem Areas - telephone call record analysis - market basket analysis - insurance/health care fraud detection - market segmentation - astronomical image analysis Outline n Motivation and Context n Emerging Applications n Methods and Algorithms - common tasks in data mining - current techniques and algorithms n Research Opportunities for AI What Users want to do with their Data n Summarize - condensed descriptive patterns € e.g., customers who buy soap often buy toothbrushes also n Understand - generative models: clusters, hierarchies, density estimates € e.g., there are 6 separate customer groupings n Discover - detect novel patterns relative to the norm € e.g., detect fraudulent use of customer¹s credit card n Predict - predictive models: classification, regression € e.g., predict next week¹s sales given this week¹s data Data Mining Methods n Summarization, Local Patterns - association rules - visualization, exploratory analysis n Descriptive Models - clustering, segmentation - graphical models n Change and Deviation Detection - outliers relative to a normative model n Predictive Modeling - classification - regression Data Analysis Algorithms Algorithm = task + model + fit criterion + search n Examples: - C5.0, CART, etc. = Classification + Tree + Error rate + Hill-climbing - Neural network = Regression + Nonlinear + Squared error + Hill-climbing - Association Rules = Description + Rules + Support/Confidence + Branch-and-bound EXAMPLE: Transaction Data and Association Rules n Supermarket example: (Srikant and Agrawal, 1997) - #items = 500,000, #transactions = 1.5 million n Example of an Association Rule If a customer buys beer they will also buy diapers - p(diapers|beer) = ³confidence² - p(beer) = ³support² Association Rule Algorithms n Characteristics: - Closely coupled with database: scale efficiently - produce ³local patterns²: user must still filter, interpret patterns - descriptive rather than inferential - simple, easy to understand: wide use in transaction applications Trends in Data Mining Algorithms n Databases: - highly efficient algorithms coupled closely with database architectures € emphasis on speed of execution (linear time) n Statistics: - scalability to massive, high-dimensional data sets € emphasis on quality of fitted model n Other areas: - machine learning: scaleable, interpretable models - visualization: interactive exploratory data analysis - supercomputing: parallel data analysis Outline n Motivation and Context n Current Status n Emerging Applications n Research Opportunities for AI - specific areas where AI can contribute - agenda for AI research in this context Opportunities n Machine Learning n Knowledge Representation n Anytime Algorithms n Reasoning with Uncertainty n Planning n Heuristic Search n Autonomous Agents n Machine Discovery n Natural Language, Vision An Analogy n AI: Modeling Humans as Agents - must deal with vast streams of perceptual data € massive € dynamic, online € heterogeneous - data are a means for taking action - agent is resource-limited n ....... the human agent problem has alot in common with the data mining problem! Anytime Exploratory Data Analysis n Traditional statistical view of data analysis - resources are unlimited - look for optimal solutions - ......but with massive data sets... € there are resource limits on data exploration n Anytime, flexible, data analysis - fundamental idea: € tradeoff quality of learned model with resources required - e.g., look for simple, ³broad² structure first - report results to user in continuous manner - interactive, adaptive strategies € focus attention on interesting local regions € online learning of which models best describe data Planning and Data Analysis n Massive Data Analysis is a complex, iterative process - preprocessing, focusing, ³cleaning² - subsampling, caching - visualizing, transforming - model-fitting, pattern-finding - evaluating n Very difficult problem! - enormous branching factors - user¹s goals may not be clear - significant degree of uncertainty n Planning and heuristic search can play a useful role - also relevant may be cognitive models of users Knowledge Representation and Learning n Data Mining needs to be in the hands of domain experts - need better representations for the human interface n Current Status - KR in data mining = simple rules - popular because they are interpretable! - even other statistical, ML models are relatively ³flat² - ...... => limits how we interact with the data n Need better languages for interacting with data - increase both expressive power and interpretability - e.g., hierarchies, relations, non-standard data-types - need to be robust: inference is uncertain € couple ideas from logical KR with statistical inference Data Analysis and Domain Knowledge n Where is domain knowledge useful in data mining? - ³data cleaning² € e.g., a male patient who is ³pregnant² - prior knowledge in learning € e.g., causal knowledge in a medical database - interpreting results € e.g., eliminating rules which are already known to the user n Techniques used in practice - Bayesian estimation, causal networks n Opportunities for AI - integration of domain knowledge and data mining € e.g., ontologies for data cleaning and pattern interpretation - interactive techniques for eliciting prior knowledge Real-World Machine Learning n Data Mining applications often characterized by: - scale (massive) - dynamic environments - heterogeneous data types n Traditionally machine learning has focused on: - small-scale - static environments - ³flat-file² data types n .... demand will be for - scaleable algorithms - online, adaptive, active learning - mixtures of flat file, image, text, sequences, HTML, etc. Heuristic Search and Model Exploration n Current data mining algorithms use relatively simple search techniques - e.g., hill-climbing with multiple restarts n Massive data leads to more complex models - => more local maxima in parameter space - e.g., clustering galaxy data: € hundreds of clusters € small (valuable) clusters are extremely hard to find - heuristic search could be very useful, e.g., € initial searches find broad structure € subsequent searches take advantage of earlier information Other Opportunities for AI in Data Mining n Machine Discovery - particularly for science applications - semi-automated, interactive, discovery systems n Autonomous Agents: - using unused CPU cycles to search for unusual patterns - e.g., Web-based data mining n Computer Vision and Pattern Recognition - ³query by content² for image databases - notions of distance, scale for heterogeneous data - interactive systems for harnessing user¹s visual recognition abilities Conclusions n ³Terror-bytes² of data are here to stay! n Traditional techniques are useful, n ....but there are large ³gaps² in our arsenal n Artificial Intelligence can play a key role - many immediate, practical applications - challenging fundamental research questions Resources n http:10/23/97/www.kdnuggets.com n ournal of Data Mining and Knowledge Discovery (1997) n Proceedings - 1st, 2nd, 3rd, Intl KDD Conferences, 95,96,97, AAAI Press - VLDB, SIGMOD workshops and conferences - 29th Interface on Computing Science and Statistics, 1997 n Other reading - Communications of the ACM, November 1996 - Knowledge Discovery in Databases, Piatetsky-Shapiro and Frawley, AAAI Press, 1991. - Advances in Knowledge Discovery and Data Mining, Fayyad et al, AAAI/MIT Press, 1996.