By Harsh Shah and Harshal Mahadik

The world population hit the 7.8 billion mark in March 2020, just when COVID-19 was declared a global pandemic, which brought the world to a standstill. Seven countries hold half of the world’s total population, but are these the seven countries that hold half of the world’s total COVID-19 confirmed cases? The answer to this question is yes, in fact only three countries: USA, India, and Brazil account for more than half of the total confirmed COVID-19 cases.

Are we finally safe from the deadly virus? Has Human mankind been able to find a perfect vaccine for COVID-19? We don’t have answers for these questions, but we certainly know that the pandemic isn’t over yet and that the number of COVID-19 positive cases is increasing rapidly day by day (22.4 million as of 20th August 2020), the death toll worldwide has reached 788K (as of 20th August 2020). While a few countries have curtailed the spread of the virus, some countries are still struggling to accommodate the patients in hospitals. China which is the most populated country and the origin of the COVID-19 virus is hardly affected when compared to its humongous population, while in countries like Qatar and Bahrain where the population is below 2.5 million, the cumulative cases per capita ratio is higher than that of other highly populated countries.

We’ve initiated our project having three questions we wanted to answer:

  • What is the accuracy of data shared by some of the leading data sources (better data = better decision making)
  • If there ARE differences in “data truths” between data sources, how significant they are? Can they impede research and decision making?
  • Is there a value in weighing in multiple data sources to achieve better analysis or predictions?

To show an accurate picture of the pandemic we have gathered data and generated data visualizations from two sources: the JHU CSSE and Oxford University. This analysis of the visualization aims to compare the cases per capita ratio (cumulative no. cases/population per country) to population for each country. For the sake of better data visualization and analysis, we have excluded India and China as they account to almost 34% of the world population. We have analyzed COVID-19 cases in India and China later in this post using a line graph. In addition, we have predicted the total number of COVID-19 cases using time series analysis on both datasets.

Insight #1: Datasets aren’t identical, not in scope and not in data

To begin with, the number of countries from both datasets were compared and the results were such that both datasets are incomplete. Several countries present in the Oxford University dataset were missing in the JHU CSSE dataset and vice versa. The Oxford University dataset contains data for a total of 185 countries, whereas the JHU CSSE dataset contains data for a total of 213 countries.

Comparison Logic

For comparison of the datasets, we have divided each scatter plot into four quadrants The logic behind these categories was to highlight the ratio of COVID-19 cases per capita for each country.

  1. Least Affected: Countries with a low population having comparatively fewer cases 
  2. Slightly Affected: Countries with a high population having comparatively fewer cases were grouped 
  3. Moderately Affected: Countries with a high population having comparatively more cases, and 
  4. Most Affected: Countries with a low population having comparatively more cases.




The February data from both datasets shows that Japan has a greater number of cases according to the JHU CSSE dataset than the Oxford University dataset.





The March data from both datasets shows that Chile has a lesser number of cases according to the Oxford University dataset than the JHU CSSE dataset.

For April, May, June, and July there has been a slight variation in the data for Great Britain and France. It is noticeable that for July, French Guiana is present in the most affected quadrant according to the JHU CSSE dataset, while the Oxford University dataset has no data for this country at all.

Insight #2: Nonidentical data yields different analysis results



The above stacked bar graph might look the same but there is a slight difference between the two datasets:

For July 2020, according to the JHU CSSE dataset, 199 countries were categorized as “Least Affected” countries, compared to just 173 countries according to the Oxford University dataset. 

The number of countries “Slightly Affected” remains identical according to both datasets. 

The JHU CSSE dataset cited 4 countries as “Most Affected” while the Oxford University dataset cited just 3 of them. 

Additional discrepancies between the two datasets: 

  • The average cases per capita ratio of “Least Affected” countries according to the JHU CSSE dataset was 0.00188 compared to 0.00190 according to the Oxford University dataset 
  • Similarly, the average cases per capita ratio of “Slightly Affected” countries according to the JHU CSSE dataset was 0.00427, compared to 0.00435 according to the Oxford University dataset 
  • Similarly, “Most Affected” countries:  0.02558 & 0.02451 average cases per capita ratio according to the JHU CSSE and the Oxford University datasets respectively.

More detailed dataset comparison can be found in the graphs of previous months.

India and China cumulative COVID-19 cases comparison for JHU CSSE and Oxford Dataset





According to the line graphs for India and China, there is absolutely no statistical difference between the two datasets and the  datasets are identical.

Insight #3: Additional factors, such as misaligned data “age” impact predictions and outcomes.





To try and predict the number of COVID-19 cases into the future, we have applied time series analysis.

  • According to the JHU CSSE dataset, our model predicted 20.92 million total COVID-19 cases globally by 18th August, whereas the actual cases data for 18th August was 22.23 million
  • According to the Oxford dataset, our model predicted 20.95 million total COVID-19 cases. 
  • The difference in predictions between both datasets is 0.03 million cases. However, when observing the prediction curve of both datasets it can be said there is no visible difference. The 0.03 million difference may be the result of later data updating. While the JHU CSSE dataset was last updated by July 31st, the Oxford dataset was last updated by August 6th, hence, the Oxford dataset model predicted cases was closer to the actual COVID-19 cases than the JHU CSSE model.

Additional Conclusion

  • Both datasets contain very similar COVID-19 data to most of the countries they contain, 
  • Minor differences were found for a few countries like Great Britain, France, Japan, and Chile 
  • The major difference in both datasets is the absence of several countries in each of the datasets.

No single individual or country can do everything, but we all, can do something to fight against this global pandemic. Together, we can save lives, protect resources, and care for each other.

View Code used in this Article