8 분 소요

[Notice] [ML_12]

Unsupervised learning

  • Dimension reduction: PCA, LDA, SVD

  • Clustering: KMeans Clustering, DBSCAN

  • Clustering evaluation

Dimensionality reduction

from IPython.display import Image
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets
import pandas as pd

iris = datasets.load_iris()
data = iris['data']
data[:5]
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])
df = pd.DataFrame(data, columns = iris['feature_names'])
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
df['target'] = iris['target']
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

PCA

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
data_scaled = StandardScaler().fit_transform(df.loc[:, 'sepal length (cm)' : 'petal width (cm)'])
pca_data = pca.fit_transform(data_scaled)
data_scaled[:5]
array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
       [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
       [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
       [-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
       [-1.02184904,  1.24920112, -1.34022653, -1.3154443 ]])
pca_data[:5]
array([[-2.26470281,  0.4800266 ],
       [-2.08096115, -0.67413356],
       [-2.36422905, -0.34190802],
       [-2.29938422, -0.59739451],
       [-2.38984217,  0.64683538]])

import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns

%matplotlib inline
plt.scatter(pca_data[:,0], pca_data[:, 1], c = df['target'])
<matplotlib.collections.PathCollection at 0x22180d5fa00>

pca = PCA(n_components = 0.99)
pca_data = pca.fit_transform(data_scaled)
pca_data[:5]
array([[-2.26470281,  0.4800266 , -0.12770602],
       [-2.08096115, -0.67413356, -0.23460885],
       [-2.36422905, -0.34190802,  0.04420148],
       [-2.29938422, -0.59739451,  0.09129011],
       [-2.38984217,  0.64683538,  0.0157382 ]])

from mpl_toolkits.mplot3d import Axes3D
import numpy as np 

fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(111, projection='3d') # Axe3D object

sample_size = 50
ax.scatter(pca_data[:, 0], pca_data[:, 1], pca_data[:, 2], alpha=0.6, c=df['target'])
plt.savefig('./tmp.svg')
plt.title("ax.plot")
plt.show()


LDA Dimension Reduction

Linear Discriminant Analysis (LDA): A method of linear discriminant analysis (similar to PCA)

  • LDA reduces dimensionality in a way that maximizes the ratio of inter-class variance to intra-class variance to find the axis that maximizes class separation.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
lda = LinearDiscriminantAnalysis(n_components = 2)
data_scaled = StandardScaler().fit_transform(df.loc[:, 'sepal length (cm)' : 'petal width (cm)'])
lda_data = lda.fit_transform(data_scaled, df['target'])
lda_data[:5]
array([[ 8.06179978,  0.30042062],
       [ 7.12868772, -0.78666043],
       [ 7.48982797, -0.26538449],
       [ 6.81320057, -0.67063107],
       [ 8.13230933,  0.51446253]])

plt.scatter(pca_data[:, 0], pca_data[:, 1], c=df['target'])
<matplotlib.collections.PathCollection at 0x22181418fd0>

plt.scatter(lda_data[:, 0], lda_data[:, 1], c=df['target'])
<matplotlib.collections.PathCollection at 0x221814764c0>


SVD (Singular Value Decomposition)

  • Algorithm used in product recommendation system (recommendation system)

  • Singular value decomposition.

  • It is a dimensionality reduction technique similar to PCA.

  • The scikit-learn package uses truncated SVD (aka LSA).

from sklearn.decomposition import TruncatedSVD
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
data_scaled = StandardScaler().fit_transform(df.loc[:, 'sepal length (cm)': 'petal width (cm)'])
svd = TruncatedSVD(n_components = 2)
svd_data = svd.fit_transform(data_scaled)

plt.scatter(pca_data[:, 0], pca_data[:, 1], c=df['target'])
<matplotlib.collections.PathCollection at 0x221814d1bb0>

plt.scatter(lda_data[:, 0], lda_data[:, 1], c=df['target'])
<matplotlib.collections.PathCollection at 0x22181476940>

plt.scatter(svd_data[:, 0], svd_data[:, 1], c=df['target'])
<matplotlib.collections.PathCollection at 0x22181589a30>


Clustering

Image('https://image.slidesharecdn.com/patternrecognitionbinoy-06-kmeansclustering-160317135729/95/pattern-recognition-binoy-k-means-clustering-13-638.jpg')

</pre>
-----


### K-Means Clustering


This is the most popular algorithm for clustering. It is a clustering technique that selects the closest points based on the centroid.



**For instance**



- Classification of spam texts

- News article classification



```python
from sklearn.cluster import KMeans
```


```python
kmeans = KMeans(n_clusters = 3)
```


```python
cluster_data = kmeans.fit_transform(df.loc[:, 'sepal length (cm)' : 'petal width (cm)'])
```


```python
cluster_data[:5]
```

array([[3.41925061, 0.14135063, 5.0595416 ],
       [3.39857426, 0.44763825, 5.11494335],
       [3.56935666, 0.4171091 , 5.27935534],
       [3.42240962, 0.52533799, 5.15358977],
       [3.46726403, 0.18862662, 5.10433388]])
```python kmeans.labels_ ```
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
       2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])
----- ```python sns.countplot(kmeans.labels_) ```
C:\Users\boyka\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<AxesSubplot:ylabel='count'>
```python sns.countplot(df['target']) ```
C:\Users\boyka\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<AxesSubplot:xlabel='target', ylabel='count'>
```python kmeans ```
KMeans(n_clusters=3)
```python kmeans = KMeans(n_clusters = 3, max_iter = 500) cluster_data = kmeans.fit_transform(df.loc[:, 'sepal length (cm)' : 'petal width (cm)']) sns.countplot(kmeans.labels_) ```
C:\Users\boyka\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<AxesSubplot:ylabel='count'>
----- ## DBSCAN (Density-based spatial clustering of applications with noise) Density-based clustering - A method of clustering high-density parts - If there are more than n points within a radius x from a point, it is recognized as a single cluster. - The number of n_cluster must be specified in KMeans, but not required in DBSCAN - It also finds geometric clustering well ```python Image('https://image.slidesharecdn.com/pydatanyc2015-151119175854-lva1-app6891/95/pydata-nyc-2015-automatically-detecting-outliers-with-datadog-26-638.jpg') ```

</pre>
-----



```python
from sklearn.cluster import DBSCAN
```


```python
dbscan = DBSCAN(eps = 0.6, min_samples = 2)
```


```python
dbscan_data = dbscan.fit_predict(df.loc[:, 'sepal length (cm)' : 'petal width (cm)'])
```


```python
dbscan_data
```

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1,  1,  2,  1,  1,  2,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  2,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  3,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  3,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      dtype=int64)
----- ## Silhouette Score (Clustering Assessment) An indicator that quantitatively evaluates the quality of clustering. - 1: The quality of clustering is good - 0: poor quality of clustering (no meaning of clustering) - negative: misclassified ----- ```python from sklearn.metrics import silhouette_samples, silhouette_score ``` ```python score = silhouette_score(data_scaled, kmeans.labels_) ``` ```python score ```
0.44366157397640527
```python samples = silhouette_samples(data_scaled, kmeans.labels_) ``` ```python samples[:5] ```
array([0.73318987, 0.57783809, 0.68201014, 0.62802187, 0.72693222])
----- ```python def plot_silhouette(X, num_cluesters): for n_clusters in num_cluesters: # Create a subplot with 1 row and 2 columns fig, (ax1, ax2) = plt.subplots(1, 2) fig.set_size_inches(18, 7) # The 1st subplot is the silhouette plot # The silhouette coefficient can range from -1, 1 but in this example all # lie within [-0.1, 1] ax1.set_xlim([-0.1, 1]) # The (n_clusters+1)*10 is for inserting blank space between silhouette # plots of individual clusters, to demarcate them clearly. ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10]) # Initialize the clusterer with n_clusters value and a random generator # seed of 10 for reproducibility. clusterer = KMeans(n_clusters=n_clusters, random_state=10) cluster_labels = clusterer.fit_predict(X) # The silhouette_score gives the average value for all the samples. # This gives a perspective into the density and separation of the formed # clusters silhouette_avg = silhouette_score(X, cluster_labels) print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg) # Compute the silhouette scores for each sample sample_silhouette_values = silhouette_samples(X, cluster_labels) y_lower = 10 for i in range(n_clusters): # Aggregate the silhouette scores for samples belonging to # cluster i, and sort them ith_cluster_silhouette_values = \ sample_silhouette_values[cluster_labels == i] ith_cluster_silhouette_values.sort() size_cluster_i = ith_cluster_silhouette_values.shape[0] y_upper = y_lower + size_cluster_i color = cm.nipy_spectral(float(i) / n_clusters) ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7) # Label the silhouette plots with their cluster numbers at the middle ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i)) # Compute the new y_lower for next plot y_lower = y_upper + 10 # 10 for the 0 samples ax1.set_title("The silhouette plot for the various clusters.") ax1.set_xlabel("The silhouette coefficient values") ax1.set_ylabel("Cluster label") # The vertical line for average silhouette score of all the values ax1.axvline(x=silhouette_avg, color="red", linestyle="--") ax1.set_yticks([]) # Clear the yaxis labels / ticks ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1]) # 2nd Plot showing the actual clusters formed colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters) ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k') # Labeling the clusters centers = clusterer.cluster_centers_ # Draw white circles at cluster centers ax2.scatter(centers[:, 0], centers[:, 1], marker='o', c="white", alpha=1, s=200, edgecolor='k') for i, c in enumerate(centers): ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50, edgecolor='k') ax2.set_title("The visualization of the clustered data.") ax2.set_xlabel("Feature space for the 1st feature") ax2.set_ylabel("Feature space for the 2nd feature") plt.suptitle(("Silhouette analysis for KMeans clustering on sample data " "with n_clusters = %d" % n_clusters), fontsize=14, fontweight='bold') plt.show() ``` ----- ```python plot_silhouette(data_scaled, [2, 3, 4, 5]) ```
For n_clusters = 2 The average silhouette_score is : 0.5817500491982808
For n_clusters = 3 The average silhouette_score is : 0.45994823920518635
For n_clusters = 4 The average silhouette_score is : 0.383850922475103
For n_clusters = 5 The average silhouette_score is : 0.34273996820787694
```python ```

댓글남기기