In simplest possible terms, data mining is taking a large amount of information and analysing it in-depth, identifying patterns, and then typically then using this analysis to make predictions based on its content.
It is commonly used for things like machine learning, statistics and database analysis, with utility ranging from everything to data management to learn about the spending habits of customers in order to inform marketing decisions.
With more students taking on the masters of data science online, it’s important that we look to understand the many areas in which data gathering and analysis are used by businesses, and what they can teach us about the digital experience.
An Example of Data Mining in Action
Let’s take an example of something you’re probably all familiar with and may even have wondered about at some point in your life: Kickstarter.
There are over half a million projects that have been launched on Kickstarter, amounting to over 7.31 billion dollars being successfully raised.
However, more than half of all Kickstarters fail – in fact, specific statistics from February 2023 tell us specifically that 359,814 Kickstarters failed.
There are a lot of factors that go into a Kickstarter ranging from the content to the idea, to the existence of either a pre-established community or an online video engagingly explaining the Kickstarter.
Then there’s things like the tiers, and stretch goals – whether they’re announced at the start or released over time to build hype – and a host of other factors.
Kickstarter has been the subject of a variety of studies before; check the bibliography of this Kickstarter to find a few of them, and today there is even freely available software designed to do this specific kind of data-mining in order to generate results based on the most recent Kickstarter data, identifying trends and generating advice on a regular basis rather than relying on year old or even decade old information.
So, what can all this data mining tell us? Well, by setting up a software to dive into the data and identify specific patterns, it’s able to generate very strong correlations between what does and doesn’t directly benefit a Kickstarter.
For example, we ran one of these free data-mining tools through 1163 Kickstarter campaigns and told it to identify based on 15 points of criteria which criteria had the highest correlations with success, and which had the lowest.
The results? Pledge levels, the use of images, a history of previously backed projects and a history of backed projects had the highest association with success. Having a low goal of only $1000 also trended very positively.
On the lower end? Having pledge tiers of $3,000 or more and a Kickstarter duration of 60 days were among the worst scoring outcomes.
What this tells us is that short, quick Kickstarters are key to success and being integrated into the community likely has some tangential bonuses as well.
Now a deeper dive could have found things like word associations – what phrases or tones generated the most appeal – and perhaps a deeper dive into the Kickstarter profiles could have gleaned a greater result – perhaps tapping into specific Kickstarter communities draws a bigger crowd than establishing a community outside of Kickstarter and bringing them over.
For larger corporations, results such as this might be generated not only for Kickstarter but for other crowd-funding websites as well, in order to best determine which platform has the strongest community for their product.
Data Mining vs Data Analysis
However, despite the potential for incredible results, data mining of this sort is only as good as the questions we ask, the sample we use, and the presentation style we choose. Did you notice something in the example above that we didn’t highlight?
Direct comparison between any one data point. The “low goal” of $1000 had a high correlation with success, but we also only looked for that value.
We didn’t ask it to look for the most common goal amounts, nor did we ask it which of the most common values performed the least favourably – we merely picked a number and asked us to tell us how well that number correlated.
A truly deep data-mining project, however, could enter the data, identify what values are the most and least common, and then correlate those values with success – thus generating its own numbers rather than simply taking our number and spitting that back at us.
Taking a sample of data and performing equations on it is one thing, but at its greatest depths data mining can simply look for patterns within the data itself – searching for the strongest correlations – and developing new equations of its own based on that. Essentially, AI learning. At this point, we’ve entered the realm of data analysis.
This difference between data mining and data analysis is simply one of what the program does with the data. If it just takes our question, enters the data and prints the result, then it’s data mining.
However, if it enters the data, runs tests and simulations, and then returns with a much more complex answer than simple ‘yes-no’ style responses, then we’re looking at data analysis.
Of course, there’s a lot more going on here as well. Some terms you may have encountered before are data dredging, data fishing and data snooping, which are related to how some software takes a huge dataset (such as an entire country’s census data) and then generates a random, much smaller sample from within that data set to analyse and generate conclusions from.
This “random” sample may or may not be completely random, and might be repeated multiple times in order to test whether the results of one study are repeatable across a dozen other similarly extracted samples.
The Process of Data Mining
Data Mining is a broad term that can occur in a variety of ways, but most data-mining software follows the same basic principles to acquire data, test data, and then summarise the data for regular human usage. These basic principles are:
- Anomaly detection (outlier/change/deviation detection) – Identifying unusual data sets for further analysis or exclusion from the main results.
- Association rule learning (dependency modelling) – Identifying pre-existing relationships via pattern-seeking techniques.
- Clustering – discovering groups and structures within the data sets that are in some way or another “similar”, without using any known structures in the data.
- Classification – generalising known structures in order to apply them to future incoming data sets.
- Regression – attempts to find an approach to the data that generates the least amount of error.
- Summarisation – transforming all of the analysis and dissections described above into a visual medium that can be easily understood by you or me.
However, it’s worth noting that this process doesn’t always yield valid results. Sometimes the patterns found by analysis would be considered wrong or incorrect by the user – it is perhaps famous at this point that software attempting to learn how to play Tetris were trained to find a way not to lose, and discovered that switching off the game guaranteed a no-lose result.
It may sound absurd but imagine this sort of behaviour when applied to data samples that may have a heavy bias or may consider the most important data points desired by the researchers to be anomalous outliers.
Although we said earlier that data mining is only as good as the questions we ask, the sample we use, and the presentation style we choose, when it comes to complex analysis it is also only as good as what we remember to program in, and what we remember to program out.
