1 | initial version |
The optimal procedure for implementing cross-validation with TimeSeriesSplit() on a dataframe in a python end-to-end pipeline can be done as follows:
import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
df = pd.read_csv('data.csv', parse_dates=[0], index_col=0)
X = df.drop('target_variable', axis=1)
y = df['target_variable'].values
pipe = Pipeline([('scaler', StandardScaler()),
('regressor', LinearRegression())])
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(pipe, X, y, cv=tscv, scoring='neg_mean_squared_error')
print('Mean Squared Error: ', -np.mean(scores))
This pipeline will standardize the features, fit the regression model, and evaluate the model using the negative mean squared error metric with a TimeSeriesSplit() cross-validation strategy.