Luxi Wei , Ran Chen , George Tolkachev Article author , Cleo Zisang Yang , Susie Fan

This work won the second place at the Wharton Hackathon during September 21st-27th.
We are so happy for team Golden Kiwi. Congratulations!

Winners of the Wharton Hackathon

Last week, I won the second place in the Wharton Hackathon with Luxi Wei, George Tolkachev, Ran Chen, and Cleo Yang, a rock star team of computer scientists, statisticians, and strategists from the graduate school of the University of Pennsylvania. Sponsored by Wharton and MiPasa.org, it was the first hackathon I joined after graduating from Wharton this May, and the only hackathon I completed 100% virtual. Although we‘ve been taking classes online since the start of the year, it was our first time to immersively “work together when we can’t be together.” We had some hiccups at first but eventually, after five long nights on Zoom, we came up with some super interesting insights. Without further ado, let’s get to it! :​D

Golden Kiwi’s team spirit at Friday 4:55 pm, minutes before the submission deadline

• • •

COVID-19 has changed our lives in one night. Unemployment rose 10%, a level higher than that in the Great Recession. Office buildings closed down. Graduations postponed. Young people fled to their parents’ homes. After months of quarantine, all of a sudden restaurants are opening up, Airbnbs are booked up, bikes and inflatable pools are selling out. The summer is over, we are still on the “COVID rollercoaster.” Let’s catch our breath, look back, and learn from what happened, how residents in different states reacted, and what we can do to move our economy forward in a sustainable way.

It appears that people in different states have had similar experiences at the onset of the pandemic, but their experiences differ drastically as the pandemic unfolds. From January to mid-March, states started reporting their first COVID cases, with some states declaring states of emergency. From mid-March to the end of June, businesses and schools were closed, and travel was restricted. The economy started to look like it’s heading towards a recession. Cases and death numbers started increasing rapidly. After “lock-down”, some states had seemed to “flatten the curve”, and several states started to reopen part of the businesses with reduced capacity. From then to August, with the reopening coming along, some states already see a “second wave” of COVID, even though a lot of the states are still on its “first wave”.

Today, we will zoom in and discover how states had different experiences in their economics, people’s behaviors, and COVID trends, and see how we can use the relationships among these factors to predict the economic impact on these states using machine learning algorithms.

1. The Challenge

The COVID-19 outbreak has revealed a global lack of verifiable, timely, and trusted data. How might we identify and analyze data to help the public, researchers, and policymakers better understand the impact of the pandemic on different aspects of the economy?

Prompt from Wharton Hackathon

The ability to monitor and forecast the economic impact of COVID-19 on society is a critical issue to economic policymaking. To take a stab at this challenge, we create a system for analyzing different states’ economic results based on cluster analysis and use machine learning algorithms to arrange the states in the US according to the level of their performance. We innovatively apply k-means clustering to investigate the economic impact of the pandemic using state-level data, including economic performance, residents’ behavioral characteristics, demographics, and COVID-19 trends. We then dynamically investigate the k-means clustering results in three phases: pre-COVID, lock-down, and re-opening. To further interpret our results, we use a linear model to predict the state-level credit card spending and employment results for Q4 2020, which are two good benchmarks to monitor the progression of the economic performance of the region. Finally, we visualize our results in clusters to inform effective decisions for policymakers.

2. Choosing Datasets

Unsupervised learning algorithms like k-means Clustering can sort almost anything into similar groups, even when the patterns aren’t obvious to humans. How do they do it? It all depends on the dataset. Our models need to be fed a large set of granular data describing each states, so our models can learn the patterns. For example, our model might learn that states with higher income level tend to have more people driving and lower COVID-19 cases during the outbreak, but higher cases as the Pandemic deepens.

The trick is to find datasets large and diverse enough for our model to pick out interesting patterns to characterize each states’ response. Here we decided to use income level, unemployment, educational attainment, mobility, card spending, and COVID-19 case number data sets, so that we cover the three most important aspects: economics, behavioral, and demographics. You can find detailed descriptions in our paper.

3. Training Models

Once we have the datasets cleaned and broken down into pre-COVID, lock-down, and re-opening, it was time to train our models. We then used a series of tools in sklearn to implement k-means and linear predictions and visualize our output using matplotlib.

We choose the k-means algorithm because it provides a simple, efficient, robust, and scalable way to group data according to non-obvious patterns. We then examine the clustering results to investigate the changing dynamics across different states throughout the year in order to surface the strategies leading to more sustainable economic performance during the pandemic.

Finally, we exploit a linear model to predict consumer spending and unemployment based on the fundamental factors described in our clustering analyses.

Model 1: Economic Impact of COVID-19

To cluster the states based on how greatly COVID-19 has impacted them from an economic perspective, we use two features for each time interval: the difference in unemployment rates from the start to the end of the interval, and the rate of card spending throughout the interval.

Model 2: Behavioral Impact of COVID-19

To cluster the states based on the behavioral impact of COVID-19, we use mobility data for each time interval, which showed the percent change of Apple Maps searches via three modes of transportation: walking, driving, and transit.

Model 3: Trends of COVID-19

To cluster the states based on how the number of COVID-19 cases fluctuated over the three-time intervals, we use the case numbers data from the Johns Hopkins University of Medicine.

Model 4: Demographics before COVID-19

To cluster the states based on their pre-COVID-19 conditions, we use two features: the average income for each state in 2018, and the percentage of people in each state who have attained a high school degree or higher.

Model 5: Predictive Analysis of Economics Impact of COVID-19

Our last model aims to predict the economic impact of COVID-19 in September 2020 using data on the 4 above categories from the previous 8 months (January — August 2020). The purpose of this is to investigate whether we can use previous data to predict the future economic impact of COVID-19. To this end, we apply a linear regression model with the independent variables set to the pre-COVID-19 conditions, the behavioral impact of COVID-19, and fluctuations in COVID-19 case numbers, and the dependent variables set to the difference in unemployment rates and card spending.

4. Making Sense of the Output

The most interesting insights come from Model 1, Model 2, and Model 5.

Model 1: Economic Impact of COVID-19

Our model on economic impact analyzes how the unemployment rate and card spendings evolved as COVID-19 hit. To reflect the impact of COVID-19, we look at the economic indicators during three periods: pre-COVID, lock-down, and reopening. We then put states into three groups based on k-means clustering.

We compared our clustering result with the political voting map given the increasing media coverage on how the COVID-19 pandemic has exacerbated the red state — blue states division.

Contrary to popular belief, we don’t see a clear partisan divide of COVID-related economic performance.

For example, we see Montana, a red state, and California, a blue state, belong to the same cluster during all three periods. Interestingly, we also see Texas fall under the same cluster as California and Washington during the pandemic lock-in, coinciding with opinions that recently emerge on Texas turning blue.


Fig. A: Pre-COVID

To better show the clustering results, we visualize the three groups on a US map. The Yellow cluster, or the “Resilient” cluster, underperformed in both unemployment and card spending pre-COVID but outperformed during lock-down and re-opening periods. States that belong to this cluster, such as South Carolina and Tennessee, perform well during the pandemic and emerge to be stronger relative to other states.

The Green cluster is the one that got hit the hardest by COVID-19 from an economic perspective throughout the pandemic. The Blue cluster is in between yellow and green.


Fig. B: Lockdown

After looking at how the economic changes in each cluster on an aggregate basis, we zoom into individual states that fall under different clustering during different periods and the implication of their economic performance. For example, Georgia is in the green cluster pre-COVID but falls into the blue cluster after COVID hits. This demonstrates that Georgia has a stronger job market and spending prior to COVID, but its spending falls into the underperforming clustering after COVID, despite Georgia being the first state to reopen.


Fig. C: Re-opening

Model 2: Behavioral Impact of COVID-19

Here we apply the k-means clustering model on the mobility dataset to evaluate the rate of change of primary means of transportation across states throughout the pandemic.


Fig. A: Pre-COVID

We see that from the onset of the pandemic, most of the states have Walking (Yellow) as their trending mode of transportation, as the spring is coming and the weather warming up. During the outbreak this summer, we see a sharp reduction of states using Public Transit (Blue) and an increase in Driving (Green), as people start moving out of big cities and into more remote areas to avoid crowds.


Fig. B: Lockdown


Fig. C: Re-opening

Finally, as reopening starts in the fall, we start to see more states embracing Walking as their primary form of transportation.

Walking has become even more popular than pre-COVID.

(Skipped Model 3 and 4. To read about them, go to our paper here.)

Model 5: Predictive Analysis of Economics Impact of COVID-19

Our results for predicting the economic impact of COVID-19 for September 2020 indicate an approximate north/south divide between states that are doing well and those doing poorly.

In general, our predictions for Q4 2020 using data from the first three quarters are not too far off from the actual values of unemployment and spending that we already observed in September 2020.

States such as Montana and Minnesota show a low level of unemployment and an upward trend in spending, while those such as Louisiana and Tennessee exhibit an increase in unemployment as well as an increase in spending. Other states, such as Connecticut, show an increase in unemployment and a drop in spending, indicating that they are more conservative during a time of economic downfall.





5. Putting It All Together

Our approach provides a data-driven representation of the impact of COVID-19 on different states during the pandemic. By visualizing the K-Means clustering results on a US map, we have identified states that emerge to be stronger after the pandemic and states that suffer the most. Such clustering and visualization provide insights and groundwork for policymakers to better monitor and address COVID-19 impact based on the condition of each state cluster.

We also provide a good start for predicting and identifying important factors that influence the economic outcome of a state during the pandemic. Although policy provides a crucial role in the evolution of the pandemic, we prove with data that it is by influencing the crucial indexes, such as means of transportation and consumer behavior, to influence the evolution of the pandemic. When more data become available, we plan to systematically carry out the model selection (e.g. LASSO, elastic net, SCAD, MCP) to find the influential factors that make a state more resilient than others.