COVID19 Dataset¶
This dataset is publically avaliable and hosted by google is a very consumable format. More information about all the different statistics can be found here: https://health.google.com/covid-19/open-data
The dataset we will be looking at is about new cases, deaths, tests, and recoveries per day per country. Since the original dataset is very massive, containing a huge amount of data from search trends, to ventilator patients, and many more, we will only be looking at cumulative amounts of cases, deaths, tests, hospitalizations, and vaccinations over time.
Modeling¶
During the course of modeling the cases and deaths, we want to understand what models will work best. Starting from simple LinearRegression and ending up with a more complex model like ElasticNet.
The main goal is to understand if we can model new cases over time and predict the number of new cases in the future given the features.
import warnings
from sklearn.exceptions import ConvergenceWarning
# disable future,user,depreciation warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=ConvergenceWarning)
import os
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
if os.path.exists("data.csv"):
df = pd.read_csv("data.csv")
else:
df = pd.read_csv("https://storage.googleapis.com/covid19-open-data/v3/location/US.csv")
df.to_csv("data.csv")
columns = [
'date',
'new_confirmed',
'new_deceased',
'new_tested',
'new_hospitalized_patients',
'new_intensive_care_patients',
'new_ventilator_patients',
'new_persons_vaccinated',
'new_persons_fully_vaccinated'
]
df = df[columns]
df.date = pd.to_datetime(df.date)
df['breakout'] = df.date - df.date.min()
df['breakout'] = df['breakout'].dt.days
# set nan values with the last valid observation or 0
df = df.fillna(method='ffill').fillna(0)
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 991 entries, 0 to 990 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 991 non-null datetime64[ns] 1 new_confirmed 991 non-null float64 2 new_deceased 991 non-null float64 3 new_tested 991 non-null float64 4 new_hospitalized_patients 991 non-null float64 5 new_intensive_care_patients 991 non-null float64 6 new_ventilator_patients 991 non-null float64 7 new_persons_vaccinated 991 non-null float64 8 new_persons_fully_vaccinated 991 non-null float64 9 breakout 991 non-null int64 dtypes: datetime64[ns](1), float64(8), int64(1) memory usage: 77.5 KB
date | new_confirmed | new_deceased | new_tested | new_hospitalized_patients | new_intensive_care_patients | new_ventilator_patients | new_persons_vaccinated | new_persons_fully_vaccinated | breakout | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-01-01 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
1 | 2020-01-02 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
2 | 2020-01-03 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 |
3 | 2020-01-04 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
4 | 2020-01-05 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4 |
# correlation matrix
plt.figure(figsize=(10, 10))
sns.heatmap(df.corr(), annot=True, cmap="YlGnBu")
<Axes: >
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x7fcedce3ebf0>
A couple of interesting correlations to speculate on. We see that there is a signficant positive correlation between the number of hospitalization and deaths. Also, there is a positive correlation between hostpitalizations and confirmed cases and tests. Of course, this is not a causation, but it is interesting to see that there is a correlation between these features.
Modeling!¶
We will start with the most basic modeling, and then move on to more complex models, throwing in polynomial features along the way if needed. We will be using the following models:
- LinearRegression
- Ridge
- Lasso
- ElasticNet
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFECV
x = df[[col for col in df.columns if col != 'new_confirmed' and col != 'date']]
y = df['new_confirmed']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
r2_scores = {}
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
score = r2_score(y_test, y_pred)
print(f"Model R^2: {score}")
r2_scores['linear'] = score
# plot predictions vs actual
plt.figure(figsize=(20, 5))
plt.scatter(x_test.breakout, y_pred)
plt.scatter(x_test.breakout, y_test)
plt.legend(labels=["Predicted", "Actual"])
Model R^2: 0.6148192681133351
<matplotlib.legend.Legend at 0x7fced1ef8a60>
With simple LinearRegression, we typically find R^2 values between 0.5 and 0.7. This will be a good metric to utilize when comparing future models but the best guess is that this model is underfitting.
model = Pipeline([
('poly', PolynomialFeatures()),
('model', LinearRegression())
])
params = {
'poly__degree': [1, 2, 3, 4, 5],
}
search = GridSearchCV(model, params, cv=5, scoring='r2', n_jobs=-1)
search.fit(x_train, y_train)
best_model = search.best_estimator_
best_params = search.best_params_
best_score = search.best_score_
print(f"Best Score: {best_score}")
print(f"Best Params: {best_params}")
r2_scores['polylinear'] = best_score
y_pred = best_model.predict(x_test)
# plot predictions vs actual
plt.figure(figsize=(20, 5))
plt.scatter(x_test.breakout, y_pred)
plt.scatter(x_test.breakout, y_test)
plt.legend(labels=["Predicted", "Actual"])
Best Score: 0.8604017968416097 Best Params: {'poly__degree': 2}
<matplotlib.legend.Legend at 0x7fcecfc812d0>
Adding polynomial features we don't see much of an improvement. In fact, in most runs, the best parameters actually eliminate polynomial features entirely. Perhaps this is because of the limited subset which we used grid search over. In this case, we see that a polynomial degree of 2 actually performed at ~0.88 R^2 value.
model = Pipeline([
('poly', PolynomialFeatures()),
('scaler', MinMaxScaler()),
('model', Ridge())
])
params = {
'poly__degree': [1, 2, 3, 4, 5],
'model__alpha': [0.1, 0.5, 1, 2, 5, 10, 20, 50, 100]
}
search = GridSearchCV(model, params, cv=5, scoring='r2', n_jobs=-1)
search.fit(x_train, y_train)
best_model = search.best_estimator_
best_params = search.best_params_
best_score = search.best_score_
print(f"Best Score: {best_score}")
print(f"Best Params: {best_params}")
r2_scores['polyridge'] = best_score
y_pred = best_model.predict(x_test)
# plot predictions vs actual
plt.figure(figsize=(20, 5))
plt.scatter(x_test.breakout, y_pred)
plt.scatter(x_test.breakout, y_test)
plt.legend(labels=["Predicted", "Actual"])
plt.show()
Best Score: 0.8878717762645195 Best Params: {'model__alpha': 0.1, 'poly__degree': 3}
With ridge regression, we see a major improvement. Adding regularization certainly seemed to help. In addition, some polynomial features included were better than none. The best estimator had a R^2 value of 0.89.
model = Pipeline([
#('poly', PolynomialFeatures()),
# we will leave out polynomial features for now
# so we can see if any features are not important
('scaler', MinMaxScaler()),
('model', Lasso())
])
params = {
#'poly__degree': [1, 2, 3, 4, 5],
'model__alpha': [0.1, 0.5, 1, 2, 5, 10, 20, 50, 100]
}
search = GridSearchCV(model, params, cv=5, scoring='r2', n_jobs=-1)
search.fit(x_train, y_train)
best_model = search.best_estimator_
best_params = search.best_params_
best_score = search.best_score_
print(f"Best Score: {best_score}")
print(f"Best Params: {best_params}")
print(f"Model Coefficients: {best_model['model'].coef_}")
r2_scores['lasso'] = best_score
y_pred = best_model.predict(x_test)
# plot predictions vs actual
plt.figure(figsize=(20, 5))
plt.scatter(x_test.breakout, y_pred)
plt.scatter(x_test.breakout, y_test)
plt.legend(labels=["Predicted", "Actual"])
plt.show()
Best Score: 0.6684711441421003 Best Params: {'model__alpha': 100} Model Coefficients: [ -0. 69726.29836806 549692.71633038 -235830.42772072 -29252.14225937 -94295.30754802 -24356.24698644 0. ]
Surprisingly, the model's coefficients did not shrink to zero. After a best score of only around ~0.66, we can start to consider that the model is underfitting.
model = Pipeline([
('poly', PolynomialFeatures()),
('scaler', MinMaxScaler()),
('model', Lasso())
])
params = {
'poly__degree': [1, 2, 3, 4, 5],
'model__alpha': [0.1, 0.5, 1, 2, 5, 10, 20, 50, 100]
}
search = GridSearchCV(model, params, cv=5, scoring='r2', n_jobs=-1)
search.fit(x_train, y_train)
best_model = search.best_estimator_
best_params = search.best_params_
best_score = search.best_score_
print(f"Best Score: {best_score}")
print(f"Best Params: {best_params}")
print(f"Model Coefficients: {len(best_model['model'].coef_)}")
print(f"Total Zeroed Coefficients: {len([coef for coef in best_model['model'].coef_ if coef == 0])}")
r2_scores['polylasso'] = best_score
y_pred = best_model.predict(x_test)
# plot predictions vs actual
plt.figure(figsize=(20, 5))
plt.scatter(x_test.breakout, y_pred)
plt.scatter(x_test.breakout, y_test)
plt.legend(labels=["Predicted", "Actual"])
plt.show()
Best Score: 0.8951442817507843 Best Params: {'model__alpha': 50, 'poly__degree': 4} Model Coefficients: 495 Total Zeroed Coefficients: 457
After selecting a polynomial of degree 5, the Lasso regression zeroed out nearly all the features anyways. This is a good sign that some of the dataset features may not be useful or that the model is just too complex. But with regularization we don't have to worry about that. The best score was around ~0.89 which is better than the other models.
model = Pipeline([
('poly', PolynomialFeatures()),
('scaler', MinMaxScaler()),
('model', ElasticNet())
])
params = {
'poly__degree': [1, 2, 3, 4, 5],
'model__alpha': [0.1, 0.5, 1, 2, 5, 10, 20, 50, 100],
'model__l1_ratio': [0.1, 0.5, 1, 2, 5, 10, 20, 50, 100]
}
search = GridSearchCV(model, params, cv=5, scoring='r2', n_jobs=-1)
search.fit(x_train, y_train)
best_model = search.best_estimator_
best_params = search.best_params_
best_score = search.best_score_
print(f"Best Score: {best_score}")
print(f"Best Params: {best_params}")
print(f"Model Coefficients: {len(best_model['model'].coef_)}")
print(f"Total Zeroed Coefficients: {len([coef for coef in best_model['model'].coef_ if coef == 0])}")
r2_scores['polyelastic'] = best_score
y_pred = best_model.predict(x_test)
# plot predictions vs actual
plt.figure(figsize=(20, 5))
plt.scatter(x_test.breakout, y_pred)
plt.scatter(x_test.breakout, y_test)
plt.legend(labels=["Predicted", "Actual"])
plt.show()
Best Score: 0.8951442817507843 Best Params: {'model__alpha': 50, 'model__l1_ratio': 1, 'poly__degree': 4} Model Coefficients: 495 Total Zeroed Coefficients: 457
Finally, when applying ElasticNet, we see that the best score is around ~0.89. With this model, we have almost exactly the same parameters as the Lasso model. This may indicate we are dealing with too complex of a model and could probably get away with a simpler model.
# plot r2 scores
plt.figure(figsize=(7, 3))
plt.bar(r2_scores.keys(), r2_scores.values())
print('best models: ', sorted(r2_scores.items(), key=lambda x: x[1], reverse=True)[:3])
best models: [('polylasso', 0.8951442817507843), ('polyelastic', 0.8951442817507843), ('polyridge', 0.8878717762645195)]
When we compare the best scores of each model, we see that the regularized models performed better than the non-regularized models. In addition, all the top models included some polynomial features. This probably indicates that some of the features are not linearly related to the target variable. But given the fact that many of the features were zeroed out, not all of them are all that useful.
Conclusion¶
With this anaylsis, our top models performed at almost 90% accuracy (0.89 R^2). We choose the top features that, intiutively, make sense that would be correlated. Looking at the pairplot, we see heavy coorelation between the features. But, in the dataset itself there are many more features we could anaylze. Perhaps using search trends or location data could help as the virus did spread from location to location over time.
We may not even be looking at all the data we need to anaylze. COVID19 did come from overseas before it came to the US mainland. Perhaps looking at the data from other countries could help us predict the spread of the virus in the US. Flight data may also even be useful in this case.