.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_packages_scikit-learn_auto_examples_plot_eigenfaces.py: The eigenfaces example: chaining PCA and SVMs ============================================= The goal of this example is to show how an unsupervised method and a supervised one can be chained for better prediction. It starts with a didactic but lengthy way of doing things, and finishes with the idiomatic approach to pipelining in scikit-learn. Here we'll take a look at a simple facial recognition example. Ideally, we would use a dataset consisting of a subset of the `Labeled Faces in the Wild `__ data that is available with :func:`sklearn.datasets.fetch_lfw_people`. However, this is a relatively large download (~200MB) so we will do the tutorial on a simpler, less rich dataset. Feel free to explore the LFW dataset. .. code-block:: python from sklearn import datasets faces = datasets.fetch_olivetti_faces() faces.data.shape Let's visualize these faces to see what we're working with .. code-block:: python from matplotlib import pyplot as plt fig = plt.figure(figsize=(8, 6)) # plot several images for i in range(15): ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[]) ax.imshow(faces.images[i], cmap=plt.cm.bone) .. image:: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_001.png :class: sphx-glr-single-img .. tip:: Note is that these faces have already been localized and scaled to a common size. This is an important preprocessing piece for facial recognition, and is a process that can require a large collection of training data. This can be done in scikit-learn, but the challenge is gathering a sufficient amount of training data for the algorithm to work. Fortunately, this piece is common enough that it has been done. One good resource is `OpenCV `__, the *Open Computer Vision Library*. We'll perform a Support Vector classification of the images. We'll do a typical train-test split on the images: .. code-block:: python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(faces.data, faces.target, random_state=0) print(X_train.shape, X_test.shape) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none (300, 4096) (100, 4096) Preprocessing: Principal Component Analysis ------------------------------------------- 1850 dimensions is a lot for SVM. We can use PCA to reduce these 1850 features to a manageable size, while maintaining most of the information in the dataset. .. code-block:: python from sklearn import decomposition pca = decomposition.PCA(n_components=150, whiten=True) pca.fit(X_train) One interesting part of PCA is that it computes the "mean" face, which can be interesting to examine: .. code-block:: python plt.imshow(pca.mean_.reshape(faces.images[0].shape), cmap=plt.cm.bone) .. image:: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_002.png :class: sphx-glr-single-img The principal components measure deviations about this mean along orthogonal axes. .. code-block:: python print(pca.components_.shape) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none (150, 4096) It is also interesting to visualize these principal components: .. code-block:: python fig = plt.figure(figsize=(16, 6)) for i in range(30): ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[]) ax.imshow(pca.components_[i].reshape(faces.images[0].shape), cmap=plt.cm.bone) .. image:: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_003.png :class: sphx-glr-single-img The components ("eigenfaces") are ordered by their importance from top-left to bottom-right. We see that the first few components seem to primarily take care of lighting conditions; the remaining components pull out certain identifying features: the nose, eyes, eyebrows, etc. With this projection computed, we can now project our original training and test data onto the PCA basis: .. code-block:: python X_train_pca = pca.transform(X_train) X_test_pca = pca.transform(X_test) print(X_train_pca.shape) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none (300, 150) .. code-block:: python print(X_test_pca.shape) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none (100, 150) These projected components correspond to factors in a linear combination of component images such that the combination approaches the original face. Doing the Learning: Support Vector Machines ------------------------------------------- Now we'll perform support-vector-machine classification on this reduced dataset: .. code-block:: python from sklearn import svm clf = svm.SVC(C=5., gamma=0.001) clf.fit(X_train_pca, y_train) Finally, we can evaluate how well this classification did. First, we might plot a few of the test-cases with the labels learned from the training set: .. code-block:: python import numpy as np fig = plt.figure(figsize=(8, 6)) for i in range(15): ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[]) ax.imshow(X_test[i].reshape(faces.images[0].shape), cmap=plt.cm.bone) y_pred = clf.predict(X_test_pca[i, np.newaxis])[0] color = ('black' if y_pred == y_test[i] else 'red') ax.set_title(y_pred, fontsize='small', color=color) .. image:: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_004.png :class: sphx-glr-single-img The classifier is correct on an impressive number of images given the simplicity of its learning model! Using a linear classifier on 150 features derived from the pixel-level data, the algorithm correctly identifies a large number of the people in the images. Again, we can quantify this effectiveness using one of several measures from :mod:`sklearn.metrics`. First we can do the classification report, which shows the precision, recall and other measures of the "goodness" of the classification: .. code-block:: python from sklearn import metrics y_pred = clf.predict(X_test_pca) print(metrics.classification_report(y_test, y_pred)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none precision recall f1-score support 0 1.00 0.50 0.67 6 1 1.00 1.00 1.00 4 2 0.50 1.00 0.67 2 3 1.00 1.00 1.00 1 4 0.33 1.00 0.50 1 5 1.00 1.00 1.00 5 6 1.00 1.00 1.00 4 7 1.00 0.67 0.80 3 9 1.00 1.00 1.00 1 10 1.00 1.00 1.00 4 11 1.00 1.00 1.00 1 12 0.67 1.00 0.80 2 13 1.00 1.00 1.00 3 14 1.00 1.00 1.00 5 15 1.00 1.00 1.00 3 17 1.00 1.00 1.00 6 19 1.00 1.00 1.00 4 20 1.00 1.00 1.00 1 21 1.00 1.00 1.00 1 22 1.00 1.00 1.00 2 23 1.00 1.00 1.00 1 24 1.00 1.00 1.00 2 25 1.00 0.50 0.67 2 26 1.00 0.75 0.86 4 27 1.00 1.00 1.00 1 28 0.67 1.00 0.80 2 29 1.00 1.00 1.00 3 30 1.00 1.00 1.00 4 31 1.00 1.00 1.00 3 32 1.00 1.00 1.00 3 33 1.00 1.00 1.00 2 34 1.00 1.00 1.00 3 35 1.00 1.00 1.00 1 36 1.00 1.00 1.00 3 37 1.00 1.00 1.00 3 38 1.00 1.00 1.00 1 39 1.00 1.00 1.00 3 accuracy 0.94 100 macro avg 0.95 0.96 0.94 100 weighted avg 0.97 0.94 0.94 100 Another interesting metric is the *confusion matrix*, which indicates how often any two items are mixed-up. The confusion matrix of a perfect classifier would only have nonzero entries on the diagonal, with zeros on the off-diagonal: .. code-block:: python print(metrics.confusion_matrix(y_test, y_pred)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [[3 0 0 ... 0 0 0] [0 4 0 ... 0 0 0] [0 0 2 ... 0 0 0] ... [0 0 0 ... 3 0 0] [0 0 0 ... 0 1 0] [0 0 0 ... 0 0 3]] Pipelining ---------- Above we used PCA as a pre-processing step before applying our support vector machine classifier. Plugging the output of one estimator directly into the input of a second estimator is a commonly used pattern; for this reason scikit-learn provides a ``Pipeline`` object which automates this process. The above problem can be re-expressed as a pipeline as follows: .. code-block:: python from sklearn.pipeline import Pipeline clf = Pipeline([('pca', decomposition.PCA(n_components=150, whiten=True)), ('svm', svm.LinearSVC(C=1.0))]) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print(metrics.confusion_matrix(y_pred, y_test)) plt.show() .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [[5 0 0 ... 0 0 0] [0 4 0 ... 0 0 0] [0 0 1 ... 0 0 0] ... [0 0 0 ... 3 0 0] [0 0 0 ... 0 1 0] [0 0 0 ... 0 0 3]] A Note on Facial Recognition ---------------------------- Here we have used PCA "eigenfaces" as a pre-processing step for facial recognition. The reason we chose this is because PCA is a broadly-applicable technique, which can be useful for a wide array of data types. Research in the field of facial recognition in particular, however, has shown that other more specific feature extraction methods are can be much more effective. **Total running time of the script:** ( 0 minutes 2.276 seconds) .. _sphx_glr_download_packages_scikit-learn_auto_examples_plot_eigenfaces.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download :download:`Download Python source code: plot_eigenfaces.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: plot_eigenfaces.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_