Welche Stolperfallen gibt es in der Analytics? Das schauen wir uns heute mal an.
Pitfall 1: Deep Learning With Shallow Data
The use of deep learning models such as neural nets has grown exponentially with the increase in computing power, and we now have the ability to run very complex algorithms to analyze sets of data.
Applying an advanced deep learning model that is too sophisticated for the available data can easily lead to the classic problem of overfitting. While it may provide a strong result within an estimated sample, it can go haywire when you apply it outside your initial sample for real-world use. Simply put, when you use a methodology that’s too complex for the problem you’re trying to solve, you’ll get the wrong answer.
To prevent overfitting, your model must separate the signal from the noise so that it can disregard the randomness in your original sample and demonstrate that it will not be affected by randomness when used for real-world applications.
Pitfall 2: Using Open-Source Advanced Algorithms Without Fully Understanding Them
The proliferation of open-source neural networks has helped advance the field of data science, giving many more people access to new and highly advanced tools. This becomes a problem when inexperienced data scientists have enough open-source knowledge to use the tools, but not enough knowledge to use them effectively.
Knowing how to call a neural net function using code without knowing how to prepare data and manipulate the inputs for the neural net won’t get you the right answers to the problem you’re trying to solve. While learning how to call functions for a neural net using code is relatively easy, understanding how to best use those functions for data analysis is both an art and a science that comes with experience.
When using these functions, you must properly manipulate the inputs, select the right method to your problem, carefully interpret outcomes by understanding how the methodology interprets the data, and subsequently iterate the training of the neural net in order to fit your data. The art of working with the data and business problem you’re trying to solve optimally mixes with the science of the estimation methodology. This will get you the results you need, rather than relying on a simple call of standardized open-source functions.
Pitfall 3: Not Properly Executing Out-of-Sample Testing
This is another classic pitfall that we see is on the rise in the industry. As most data scientists know, whether you’re using an open-source neural net or any other statistical model, it’s important to test the model on data that the model has never seen before. Many methods set aside a test data set by randomly selecting a portion from your available data. This might be good enough for many traditional statistical methods, but the power of deep learning methods in particular is such that this often results in incorrect outputs.
To avoid this pitfall, run a series of simulations on truly out of sample or holdout data sets, and use different mixes of test and training sets to make sure your model can generalize results properly.
Pitfall 4: Not Understanding Data Before Technical Development
This is quite possibly the biggest pitfall of all. Data preparation work is often considered a boring task compared to running a complex algorithm and studying the output. Many available tools offer different feature engineering options and subsequent algorithms for data analysis and forecasting. With these advanced tools, you can take advantage of machine learning to describe what has happened in the past and what will happen in the future. The temptation is to just plug and play—run standard data feature engineering options, call a neural net to analyze your data, and go. The pitfall here is that you must understand your data before using these available tools. If you do not understand the data, you might choose the wrong tool or the wrong input and wind up with misleading, non-optimal outcomes.