Data Analyst is the best programmer in statistics and the best programmer among statisticians. In this top, we will discuss how a programmer can become better at statistics, avoiding the most common mistakes.

1. Cross-validation and panel analysis

You’ve been taught that cross-validation is all it takes. Sklearn even provides a few handy features for it, so you think you’ve done it. But most cross-validation methods use random sampling, which means that you can get mixed sets with an overestimate of performance.

Solution: Create data that accurately reflects what you can predict in real-world use. Especially with time series and panel data. In this case, you will probably have to generate custom data or do incremental optimization for cross-validation.

Example: There is panel data for two different objects (for example, companies) and they are strongly overlapped with each other. If the split is accidental, then you are making accurate predictions using data that was not actually available at the time of testing. This leads to an overestimation of performance. You think you have avoided cross-validation error, and you see that random forest performs much better than linear regression. But after step-by-step testing to prevent future data from leaking into the test case, the random forest performs worse again! (RMSE from 0.047 to 0.211. Higher than linear regression!)

2. Incomplete understanding of the objective function

Analysts want to create a “better” model. But beauty is in the eye of the beholder. If you do not know what the main task and objective function is, and do not know how the model behaves, then you are unlikely to build the “best” model. In addition, the challenge may be to improve the business metric rather than constructing a mathematical function.

Solution: Most Kaggle winners take a long time to understand the objective function and how the model and data relate to it. If you are optimizing a business metric, match it to the corresponding objective function.

Example: An F-measure is used to evaluate classification models. Once we built a classification model, the success of which depended on the percentage of cases in which it was correct. As it turns out, the F-measure is misleading because it shows that the model was correct about 60% of the time, but in fact only 40%.

3. You don’t have the simplest basic model

Modern ML libraries make things easy. Almost. Just change one line of code and run the model. And another. And one more. Error metrics are decreasing, more customization. Great – they decrease even more … With all the sophistication of the model, you can forget about the silly way of forecasting. Without this primitive test, you don’t have an absolute measure of the quality of your models, and they can be bad in absolute terms.

Solution: What’s the simplest way to predict values? Build a model using the last known value, a (moving) average, or a constant like 0. Compare performance to some monkey’s prediction!

Example: with this set of time series, the first model should be better than the second: root mean square error (hereinafter – RMSE) 0.21 and 0.45. But wait! Taking into account only the last known value, the RMSE falls to 0.003!

4. Inappropriate out-of-sample testing

It can ruin your career! The model looked great in R&D, but it performed badly on real data. This model leads to very poor results, it can cost the company millions. This is the worst mistake of all!

Solution: Make sure you are working with the model in realistic conditions and understand when it will work and when it will not. Or just try advanced analytics consulting😉

Example: Within the sample, the random forest performs much better than the linear regression: RMS 0.048 versus 0.183, but outside the sample, the random forest is much worse: 0.259 versus 0.187. The random forest has been restrained and will fail in real life!

5. What data is available when making a decision?

When you run the model in real life, it gets the data available at that exact moment. They may differ from those intended to be used in training. For example, they were published with a delay, so the other input data changed by the time of launch. So you are making predictions with wrong data or your true y is now false.

Solution: Perform out-of-sample step testing. If the model were tested under real-world conditions (as within penetration testing services), what would the training set look like? What data are available for forecasting? Also, think about this: If you acted on the basis of the forecast, what would be the result at the time of the decision?