Techniques and pitfalls for ML training with small data sets

30. Juni

In Machine learning, the relationship between large and small data sets has long been considered a David vs Goliath battle: even super-duper sophisticated models on small datasets cannot keep up with simple models on large datasets. This seems like a pretty hopeless situation if you don’t have access to huge data stores. However, David shouldn’t give up since solutions do exist! This article provides an overview of several techniques to overcome data size issues.

Admittedly, it’s true: Machine-learning algorithms become more effective as the size of training datasets grows. But although the "the more data the better" approach may be correct in principle, this simplified rule is problematic in some respects:

First, it's not (only) size that matters. It also must be the right data. There’s little point in randomly collecting huge amounts of data and keeping it ready for analysis. What is needed is an intelligent selection of high-quality data, and the quality depends on what you want to do with the data.

Secondly, any process of data collection and analysis has a cost (may it be in terms of money, human effort, time or computing capacities). In a perfect world with unlimited resources, you would always go for big data. In practice, however, you have to find a balance between accuracy and feasibility. Sometimes it makes more sense to work on small high-quality datasets to obtain the right conclusions faster, more reliably and at a lower cost.

Finally, and perhaps most importantly, the “the-more-the better” approach suggests that the best analysis and deepest insights are reserved for companies that can process petabytes of data every day. In fact, most individuals (and also companies) are limited to small datasets and they should’nt leave the field to tech giants who use aggressive information-gathering methods.

Because one thing is certain: Machine Learning can do so much more than just exploring large datasets. It excels with large data sets, but the real challenge lies in processing small data sets where ML can show its full versatility. There are already plenty of well-established techniques for applying ML methods to small datasets which have largely taken a back-seat in the machine learning community. But before offering an overview of the most important ones, we will start by looking at what "small data" means in the first place and what problems may arise when you have too little data.

What is small data and why does the size matter?

Big Data has been one of the hottest buzzwords in technology in the last decade. While big data is characterised by 3 Vs, volume, variety and velocity, small data is the exact opposite. It is ‘small’ enough for human comprehension. It is data that comes in a volume (usually the data even fit on the memory of a standard machine) and format that makes it accessible and actionable for us.

The size of the training data plays a decisive role in avoiding overfitting. An overfit model is one that adjusts too well to the training data. If you have too little data for too many features, the model may see patterns that do not exist and is likely to be biased by outliers. The result is that the model performs poorly with unseen data. So, if your model performs much better on the training data than on the test data, the probability is high that the model is overfitted.

Now, how much data do we need to prevent this problem?

Let’s start with the good news: For many ML applications, you don't need big data at all. However, the question of how much data is enough to build a good model is a difficult one. As always, it highly depends on the method you want to use and the type of problem you want to solve.

With respect to the first, traditional machine learning algorithms require less data than deep learning models. This is because traditional machine learning algorithms have a rather simple structure where parameters and classifiers are selected by humans. By contrast, deep learning models figure out their own parameters and learn from their own errors without structure. This means that not only do they need far more data, but they also have a much longer learning curve where additional data may improve the model.

Regarding the type of the problem, as a general rule, one can say that the more complex the problem is, the more data you need. There are several statistical heuristics that come in handy to calculate a suitable sample size. For instance, the much-debated “rule of 10” states that you need at least 10 times more data points than your model has features (see https://medium.com/@malay.haldar/how-much-training-data-do-you-need-da8ec091e956 for a derivation of this rule using learning curves). For linear models, for instance, the number of parameters equals the number of input features since the model assigns a parameter corresponding to each feature.

Another practical and data-driven way of determining if you have enough data are the learning curves already mentioned. Here, the performance of the model is plotted after each simulation as a function of the size of the training data set. In the resulting graph, one can then see at what point adding more data points no longer brings improvements.

Finally, it is useful to look for published results for ML problems that are similar to yours. Documented sample sizes from previous studies can serve as a good orientation point. Here are some examples:

What can you do if you have less data than needed?

Now that you know how to determine the required amount of data for your task, the question arises as to what to do if this required amount cannot be achieved. Luckily, there are several techniques you can use if you run short of data (given that it’s not possible to simply collect more data – which is the first option to consider) and again, it highly depends on the problems you face. In some cases, the problem is that the data set is simply too small to allow for generalisations. In other cases, the dataset is too small in relation to its dimensionality (Large p, small n). Another common issue that is specific to classification problems is data imbalance.

The following chart shows possible approaches to solving these problems.

Let’s look at these in detail in the sections below.

Imbalance

As just mentioned, one possible implication of data scarceness that arises when dealing with classification problems is imbalance. In this case, the total amount of data may not be so small at all but the classes are not represented equally. One typical example is spam detection, since, in a typical inbox there are usually more non-spam e-mails (thank goodness!) than spam emails. While usually, each dataset is slightly skewed, already a moderate imbalance, where the minority class represents 20 % of the data or less, can cause problems, e.g., that only the majority class is predicted.

Undersampling/Oversampling

If the overall amount of data is large, undersampling can be used to balance the data. Depending on how the data is distributed you can either randomly remove data points from the majority class or first cluster the data (e.g., through K-means clustering) and then remove data points with random sampling.

By contrast, if the overall data size is rather small, you should go for oversampling. Again, one possibility would be to randomly duplicate data points of the minority class. Yet, you should be aware that creating duplicates can slow down the training. Therefore, one popular alternative for this is SMOTE (Synthetic Minority Oversampling Technique). Here, you don’t create exact duplicates of data points. Instead, you create data points that lie between existing minority instances. This is done by the algorithm calculating the distance between two data points in the feature space, multiplying the distance by a random number between 0 and 1, and placing the new data point at this new distance from one of the two data points.

Both over- and undersampling are well-established methods to balance out your dataset. Unfortunately, there are risks too. Oversampling can result in overfitting and may increase the time needed for training. The major drawback of undersampling is the loss of information. You run the risk of deleting potentially useful information that could be important for learning. Therefore, this option is only recommended if you have a deep understanding of the data.

If you want to avoid these risks, you can also choose a different approach. Instead of changing the data, you can adjust the loss function.

Asymmetric Loss function

The basic idea here is to penalize misclassification of the minority class harder than the misclassification of the majority class. This can be done by assigning a higher weight to the loss function terms that belong to the minority class. In addition to retaining all information, another advantage of this approach is that less training time is needed than with under-/oversampling.

Ensemble Learning

Ensemble learning is another technique for dealing with imbalanced data sets. instead of seeking the single best performing model, it combines the predictions from multiple models trained on your dataset. In doing so, one can compensate for individual over-learning. There are various approaches in ensemble learning such as Bagging, Boosting, etc. “Bootstrap aggregation,” aka “Bagging,” randomly generates samples of training datasets with replacement. The learning algorithm is then run on the samples selected and then calculates the mean of all predictions. Boosting is an iterative method that rectifies the weight of an observation depending on the previous classification. This method reduces the bias error and creates strong predictive models.

Lack of Generalisation

A different problem is when the overall amount of data is just too small to allow for generalization.

There are basically two methods to solve this issue: either you increase the size of the data set (“Grow-more approaches) or you make the model learn more from less data (“Know-more approaches”). Let’s start with the first one.

“Grow more”

Intuitively, this solution is very logical: if the data set is too small, you make it bigger. One method to generate new data points is SMOTE, which has already been presented. Other possibilities are GAN’s and Data Augmentation.

GAN’s

Generative Adversarial Networks (GAN’s), is a ML model capable of generating data. It consists of two competing artificial neural networks. One (the generator) has the task of generating real-looking data, the other one ( the discriminator) classifies the data as real or artificial. The competition between the Generator and the Discriminator is that the Generator tries to generate data sets that the Discriminator considers to be real. The discriminator's goal is to recognise the artificially generated data and distinguish it from real data. The generator and discriminator constantly try to "outsmart" each other. Through constant learning and many iterations, the generated data becomes better and better.

Data Augmentation

Data Augmentation is a popular method used especially in image classification. The basic idea is to enhance the dataset by adding slightly modified copies of already existing data points. For images, this can be done by flipping the image horizontally or vertically, Cropping and/or zooming, rotating, or changing the brightness of the image. In doing so, you can increase the diversity of the data without actually having to collect new data.

“Know more”

While “Grow more” approaches focus on creating more data points, know-more approaches aim to get the model to learn from a broader set of data than just training data.

Transfer learning

Let's say you've been playing badminton for years and now you want to start playing tennis. Surely your learning curve will be much steeper than a person who has never held a racket before. As you don't have to learn everything from scratch but can transfer your existing skills to the new game. This is the simple but brilliant idea behind Transfer Learning.

Traditional ML algorithms are trained to work in isolation. They are tailored to solve a specific task on a specific dataset that involves a specific feature space. If anything of these changed you had to rebuild the model from scratch. Transfer learning is intended to overcome the isolation of learning by utilizing and leveraging knowledge acquired from previous tasks to solve related ones. Because the models are already pre-trained, less data is needed to solve the new problems. For instance, you could use transfer learning for image classification, by first training a model on the huge Image-Net dataset, and then retrain it on a dataset that has less data.

N-Shot Learning

A somewhat similar approach is so-called N-Shot learning, which is used for classification problems. You have probably heard of Few-shot, One-shot and Zero-shot learning, which are particular forms of N-shot learning. In this context, a shot is nothing more than a single instance available for training, e.g. in one-shot learning, the model is trained with one single example. How is this possible?

Well, the basic mechanism is based on meta-learning. Meta-learning is different from conventional supervised learning. The latter aims to recognize patterns in the training data and generalize to unseen data. In contrast, in meta-learning, the goal is to learn. In the meta-training phase, the algorithm learns how to recognize similarities and differences between training examples from different classes. In each iteration, it uses a support set, which consists of n labelled examples from k classes and a query set, which contains unseen examples from unknown classes.

During training, the loss function assesses the performance on the query set, based on knowledge gained from its support set and will backpropagate through these errors.

A typical application of N-shot learning (especially one-shot learning) is face recognition. In this situation, the algorithm would have to be trained on many images of the same person to be able to recognize her. Since this task is an impossible one, one-shot learning is used here since it reduces the required amount of training samples to only one example for each class.

Large p, small n

As already mentioned, high dimensionality (i.e., a large number of features) can also lead to problems. Situations, where you have a lot of features in relation to the data, are commonly referred to as "Large p, Small n" problems. (p stands for the number of predictors and n for the number of samples in a dataset). Take, for example a dataset with 50,000 samples. It seems to be anything but small but if it has a considerable number of features, say 15,000, you run the risk of overfitting. In order to avoid this, you would thus need even more data to provide effective coverage of the value range. But you can also go the other way and instead reduce the dimensions. There are various possibilities to do that, I will present two of them.

Principal Component Analysis (PCA)

Without a doubt, the best-known method is Principal component analysis. The idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible. Mathematically, this works via linear combinations.

PCA basically learns a linear transformation that projects the data into another space, where the vectors of the projections are defined by the variance of the data. As a result of the transformation, the first principal component has the largest possible variance; the subsequent principal component is determined so that it is orthogonal (i.e., uncorrelated) to the precedent one and again has the highest possible variance and so on. By selecting the top principal components that explain, say, 80-90 % per cent of the variation, the other components can be discarded as they have no significant benefit to the model.

Autoencoder

An autoencoder is a multilayer neural network that tries to compress the input information and uses this reduced information to reproduce it correctly in the output. It has at least three layers, the input, a hidden layer for encoding, and the output decoding layer. Using backpropagation, the unsupervised algorithm continuously trains itself by setting the target output values to equal the inputs. The key component is the hidden layer which acts as a bottleneck and enables dimensionality reduction.

The biggest advantage of autoencoders over PCA is that no information is lost. Instead, it compresses the full data and retaining all the information. Unlike PCA, autoencoders also do not assume a linear system and are therefore more flexible. However, Autoencoders require more computational resources than PCA.

Conclusion

The goal of this blog post was to give a brief overview of commonly used techniques while training will small data. Importantly, this outline is not exhaustive and is intended only as a first point of reference. Moreover, the presented techniques do not have to be an either/or option. Instead, you can combine different methods to further optimise your model’s performance. See https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0784-9 for a mixture of over-and undersampling techniques or https://github.com/zc8340311/RobustAutoencoder for a combination of an autoencoder and PCA. In any case, it can be stated that small data can be a challenge, but it holds great potential - you just have to have the right techniques to fully explore it.

Monja Burkard