Create a pipeline to score different machine learning models with scikit-learn
After the initial data exploration I would like to get a quick gauge on what model would be best for the problem at hand.
A rough estimate helps in narrowing which machine-learning model to use and tune later. It helps to get a sense on how effective perspective algorithms will be.
The goal is to get a big picture overview.
How to Write a Pipeline to Score Different Models
- Prep
I assume that you have a dataset with features (X
) and target labels (y
).
Import the models you want to score.
- Create preprocessing pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import FeatureUnion, Pipeline
# Create features union
# Standardizes feature matrix, uses TSVD, then selects 6 best features
features = [('standardize', StandardScaler()),
('tsvd', TruncatedSVD(n_components=tsvd_components))]
feature_union = FeatureUnion(features)
- Create models pipeline
Now combine the feature_union
pipeline with a scikit-learn.
# Create pipeline
# combines feature union with scikit-learn estimator
# Logistic Regression
estimators_log_r = [('feature_union', feature_union),
('logistic', LogisticRegression(random_state=42))]
model_log_r = Pipeline(estimators_log_r)
# SVC
estimators_svc = [('feature_union', feature_union),
('svc', SVC(probability=True, random_state=42))]
model_svc = Pipeline(estimators_svc)
# Random Forest
estimators_rf = [('feature_union', feature_union),
('rf', RandomForestClassifier(n_jobs=-1, random_state=42))]
model_rf = Pipeline(estimators_rf)
models = {'Logistic_Regression': model_log_r,
'SVC': model_svc,
'Random_Forest_C': model_rf}
- Score models
from sklearn.model_selection import cross_val_score
scores = {name: cross_val_score(model, X, y) for name, model in models.items()}
Now you have a dictionary that contains the validation scores from cross_val_score
for each scikit-learn estimator.