What is Data Mining ?
What is Data Mining ? Data Mining is the computational process of discovering patterns, trends and behaviors, in large data sets using artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Data mining is considered as a synonym for another popularly used term, known as KDD, knowledge discovery in databases. Data mining is an essential step in the process of predictive analytics. The process of mining and extraction of useful information from the existing data is an interdisciplinary work involving mathematicians, statisticians, computer programmers.
Data Mining Definition
The proper use of the term data mining is data discovery. But the term is used commonly for collection, extraction, warehousing, analysis, statistics, artificial intelligence, machine learning, and business intelligence. Statistics provide sufficient tools for data analysis and machine learning deals with the different learning methodologies. Before a data can be mined it needs to be cleaned. This removes the errors and ensures consistency. Data mining methods are generalization, characterization, classification, clustering, association, evolution, pattern matching, data visualization, and meta rule guided mining.
The knowledge discovery in databases is defined in various different themes.
Data Mining Definition- Simplified
(1) pre processing, (2) data mining, and (3) results validation.
Knowledge Discovery in Databases (KDD) process definition:
(2) Pre processing
(4) Data Mining
Cross Industry Standard Process for Data Mining (CRISP-DM) definition:
(1) Business Understanding
(2) Data Understanding
(3) Data Preparation
Data Mining Overall Plan
SAS Institute overall plan for data mining is known as SEMMA. This plan has 5 steps which are as follows: sample, explore, modify, model, and assess.
Step 1: Sample
Extract a portion of a large data set big enough to contain the significant information yet small enough to manipulate quickly.
Step 2: Explore
Search speculatively for unanticipated trends and anomalies so as to gain understanding and ideas.
Step 3: Modify
Create, select, and transform the variables to focus the model construction process.
Step 4: Model
Search automatically for a variable combination that reliably predicts a desired outcome.
Step 5: Assess
Evaluate the usefulness and reliability of findings from the data mining process.
Tasks in Data Mining
The tasks in data mining are either automatic or semi automatic analysis of large volume of data which are extracted to check for previously unknown interesting patterns. These are cluster analysis, anomaly detection on unusual records and dependencies check using the association rule mining. This usually involves using database techniques such as spatial indices. These patterns thus identified provides a summary of the input data, and can used in further analysis or in machine language and predictive analytics. The use of data mining methods for samples of data are known as data dredging, data fishing, and data snooping .Mining techniques are employed in different kinds of databases, including relational, transaction, object-oriented, spatial, and active databases.
Data mining involves six common classes of tasks:
1.Anomaly detection- This is the Outlier or deviation detection, where the identification of unusual data records or data errors that require further investigation are identified.
2.Association rule learning- This is also called dependency modeling which searches for relationships between variables. Also known as market basket analysis.
3.Clustering – Clustering is the task of discovering groups and structures in the data that are in some way or other similar without using known structures in the data.
4.Classification- Classification is the task of generalizing known structure to apply to new data.
5.Regression- Regression is the task with an objective to find a function which models the data with the least error.
6.Summarization- Summarization tasks provides a more compact representation of the data set, including visualization and report generation.
Data Mining Software
Orange, Weka,Rattle GUI, Apache Mahout, SCaViS, RapidMiner, R, ML-Flex, Databionic ESOM Tools, Natural Language Toolkit, SenticNet API , ELKI , UIMA, KNIME, Chemicalize.org , Vowpal Wabbit, GNU Octave, CMSR Data Miner, Mlpy, MALLET, Shogun, Scikit-learn, LIBSVM, LIBLINEAR, Lattice Miner, Dlib, Jubatus, KEEL, Gnome-datamine-tools, Alteryx Project Edition , OpenNN, ADaM, ROSETTA, ADaMSoft, Anaconda, yooreeka, AstroML, streamDM, jHepWork, TraMineR, ARMiner, arules, CLUTO and TANAGRA are some of the top free data mining software in no particular order.
IBM SPSS Modeler, SAS Enterprise Miner, RapidMiner, Angoss Knowledge STUDIO, Microsoft Analysis Services, Oracle Data Mining, FICO Data Management Integration Platform, Think Analytics, Salford Systems, Viscovery, Portrait Software, IBM DB2 Intelligent Miner, STATISTICA Data Miner, QIWare, LIONsolver, KXEN Modeler, Neural Designer, Megaputer’s PolyAnalyst, TIBCO Spotfire Miner, XLMiner- Frontline Systems, GhostMiner, Teradata Warehouse Miner, KNIME, Advanced Miner, Alteryx Designer and Rapid Insight Veera in no particular order.
ELKI, ITALASSI, R, Data Applied, DevInfo, Tanagra, Waffles, Weka, Gephi, OpenRefine, Fusion Tables, DataMelt, Orange, Wrangler Encog, RapidMiner, PAW, SCaVi, ILNumerics.Net, ROOT, Julia, MOA, NumPy, SciPy, KNIME, NetworkX, matplotlib, IPython, SymPy, Scilab, FreeMat, jMatLab, NodeXL Basic, Fluentd, and Tableau Public are some of the free or open source top software for data analysis.
More Information on Predictive Analysis Process
For more information of predictive analytics process, please review the overview of each components in the predictive analytics process: data collection (data mining), data analysis, statistical analysis, predictive modeling and predictive model deployment.