Today I learned how to reduce feature labels in a data set with Principal Component Analysis.
From Python Data Science Handbook:
Principal component analysis is a fast and flexible unsupervised method for dimensionality reduction in data, […]
You can use PCA to learn about the relationship between two values:
In principal component analysis, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset.
Let’s assume we have a pandas DataFrame called diabetes_df
with 10 different columns (features).
We can use scikit-learn’s PCA
estimator to reduce the feature labels from 10 to 2. Then we can try to visualize the data points with matplotlib.
## Reduce dimensionality with PCA
from sklearn.decomposition import PCA
## instantiate model with 2 dimensions
pca = PCA(2)
## project from 10 to 2 dimensions
project_diab = pca.fit_transform(diabetes_df)
## plot
plt.scatter(project_diab[:,0], project_diab[:,1],
c=diabetes.target, edgecolor='none', alpha=0.5,
cmap=plt.get_cmap('Spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
For a visual explanation of Principal Component Analysis, I can recommend this site: Principal Component Analysis Explained Visually.