Whenever we want to compare two sets of data, what we essentially have to compare is two distributions. Considering this is one of the most common tasks (one would think this would be a trivial endeavour), however in reality things get complicated very quickly.
In a simple case of comparing two normally distributed samples, most of the time, we only have to compare the averages and standard deviations. But what if the two distributions are not normal? What if they are discrete? What if we want to compare two distributions from different families? Things can get very complicated very quickly, even if we compare the proverbial “apples to apples”.
Fortunately, there are a lot of methods developed. Unfortunately, they all have their limitations, and requirements, which makes choosing a good one, not an easy task.
In this article, we will talk about a rather uncommon problem of comparing two categorical distributions, which is a type of discrete probability distribution that describes the possible results of a random variable that can take certain predefined categorical values. These values generally do not have any underlying order. For example, we can talk about the distribution of probabilities of a customer buying a type of goods in a store, or the distribution of probabilities of activating sensors installed around the house.
So how do we compare two categorical distributions? Since we cannot calculate averages, standard deviations, and many other parameters we have to compare the probabilities of taking a particular value directly. One such metric is Kullback-Leibler divergence:
which measures the relative entropy from Q to P defined on the same probability space 𝜒, or in other words the distance from distribution Q to distribution P in units of information (nats, or bits). It is not a true measure of distance, however, since it is not symmetrical:
Neither does it satisfy the triangle inequality. But, it is still a powerful tool for measuring the difference between two distributions.
Now let’s have look at the examples. Imagine we have a six-sided dice, and we want to find out which distribution models better the outcomes of rolling the dice multiple times. Is it a weighted dice, or a fair one? This would be an example of categorical distribution. Even though the sides of the dice are numbered, they don’t have any particular order. We could replace numbers with pictures for example and the situation wouldn’t change.
To simulate fair dice we are going to draw from a uniform distribution. For the models, we are going to use a beta-binomial distribution to represent a weighted dice, and uniform distribution to represent the fair one. The graph below shows all three distributions: beta-binomial with 𝛼=𝛽=2, uniform distribution, and our experimental data. In this particular realisation of the experiment, it is not immediately obvious from the graphs which model describes the dice better, which is good for this example.
So let’s calculate the Kullback-Leibler divergence for both models. For the beta-binomial distribution, we get 0.032, and for the uniform distribution, we get 0.017. Lower values tell us that the model distribution is better at capturing the nature of our experiment, i.e. the lower amount of information is lost by representing our real dice with this model distribution. So the good news is that we got the result that we were expecting to see – uniform distribution is the correct model in this case.
Following the example above we can find the correct distribution to model our data, or we can check if the process that generates our data changes over time. One such example would be the behaviour of the user in the house. We can look at the distribution of events in the past, and compare it to the current distribution to understand if there was a change in behaviour or habits.