Revision history [back]

The optimal procedure for implementing cross-validation with TimeSeriesSplit() on a dataframe in a python end-to-end pipeline can be done as follows:

Import the necessary libraries:

import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

Load and preprocess the dataframe:

df = pd.read_csv('data.csv', parse_dates=[0], index_col=0)
X = df.drop('target_variable', axis=1)
y = df['target_variable'].values

Define the machine learning pipeline:

pipe = Pipeline([('scaler', StandardScaler()),
                 ('regressor', LinearRegression())])

Define the TimeSeriesSplit() cross-validation strategy:

tscv = TimeSeriesSplit(n_splits=5)

Apply the cross-validation procedure on the pipeline and data:

scores = cross_val_score(pipe, X, y, cv=tscv, scoring='neg_mean_squared_error')

Print the scores mean:

print('Mean Squared Error: ', -np.mean(scores))

This pipeline will standardize the features, fit the regression model, and evaluate the model using the negative mean squared error metric with a TimeSeriesSplit() cross-validation strategy.