Corona virus: data analysis of a pandemic

Preamble: I am not an expert in epidemiology. Take any predictions in this post with a large pinch of salt. There are many considerations missing from my analysis. If you are looking for more robust forecasts, a collection of predictions has been made available in the coronavirus tech handbook and crowd-sourced predictions can be found on Metaculus. One of the most useful resource for what individuals and governments should be doing at the time of writing is the 80,000 Hours podcast.

Lots of data about the current corona virus outbreak are freely available for anyone to analyse. Here's my attempt to give a snap-shot of what we are facing and what we might expect in the future, purely based on the data we have available.

There are over 200,000 confirmed cases of Covid-19 globally. But these cases are not distributed evenly around the world. They are clustered in the worst-hit countries (click to enlarge):


Due to the skewness of the data, it's more instructive to plot the data on a log scale. Double the height of a bar corresponds to 10 times the number of cases:

Bar log

Here is where the cases are geographically:

There are clearly clusters of outbreaks in China, European countries, Iran, and, to a lesser extent, the US, with much smaller numbers of cases in, for example, Australia and African and South American countries. My worry is what will happen if the illness spreads to countries that are less prepared to deal with hundreds of thousands of people needing medical care.

What is most remarkable about communicable diseases is how illness rates change through time. This video shows how the number of confirmed cases in each country has evolved since 22nd January 2020.

map animation
Whilst China and a few other countries have dramatically reduced the growth-rate of cases, other countries like the UK and the US are yet to experience peak growth-rates. It's harrowing to see substantial numbers of cases in less economically prepared countries in South America and Africa.

Modelling growth-rates

The number of cases when an infectious disease spreading in a uniform population will follow a logistic curve. This is an s-shaped curve that starts low, increases exponentially until it reaches it's maximum growth-rate, before capacity limitations cause the growth-rate to decrease so that the number of infections reaches a steady-state. This is approximately what we see in China's Covid-19 illness rates:

Time series China
We can fit a logistic function to the number of confirmed cases to get a well-fit model:
Time series China logistic

The evolution of global rates of this illness show an interesting growth curve that is different to the standard logistic curve:

Time series global
Instead of one s-shape, there are two. This occurs when there are several weakly coupled populations so that the rapid growth in one sub-population initiates a second growth in the other sub-population. The first sub-population in this case is China, and the second sub-population is some of the rest of the world. Perhaps we will see a third s-shape if Africa and South America experience rapid growth as growth in Europe and the US subsides.

One model we can fit to a growth-rate with two s-shapes is the double logistic function. This is basically a function made up of two temporally separated logistic functions. Fitting the double logistic function to the global Covid-19 cases gives a pretty good fit:

Time series global double logistic

We can then extrapolate the curve to predict the evolution of the illness in the future:

Time series global double logistic prediction
This predicts a maximum of around 800,000 cases by mid April. This seems like a good lower bound on the number of cases. The true number will be significantly higher because this prediction is made on only the confirmed cases. Additionally, this model does not take into account the possibility of a third s-shape, which seems quite likely given the difficulty of stopping the spread in the Africa and South America. It is sobering to think just how high these numbers could go.

Bringing the focus closer to home, the UK is currently in a period of rapid growth in the number of cases:

Time series UK
Compared to the global population, the UK is approximately uniformly populated so a simple logistic function is a suitable model of the growth-rate:
Time series UK logistic
We can then use this model to extrapolate to predict the future evolution of Covid-19 in the UK:
Time series UK prediction
This predicts that we have not yet reached the maximum growth-rate. A maximum of around 10,000 cases are predicted in the UK by mid-April.

Again, the actual number of cases will be substantially higher due to the fact that we are extrapolating from confirmed cases. The true number of cases of this global pandemic will be strongly dependant on the actions of governments and individuals to mitigate the spread.

The data used in this post are only the confirmed cases. Of course, there will be many times more cases out there who have not been confirmed. Some countries are testing their citizens thoroughly, while others are hardly testing at all. This gives a bias in these data towards those countries who are testing more.

Regardless, I hope this post gives you an idea of the fight we are up against. Possibly the toughest fight of our generation.

Feel free to share any of these images and videos, with a link to this article.

The data analysis in this post was conducted using Python. The source code can be found here and the data come from here, collected by the Johns Hopkins University Center for Systems Science and Engineering.


Comments powered by Disqus