scikit-learn 分类 KNeighborsClassifier-白红宇

scikit-learn 分类 KNeighborsClassifier

阅读量：4216 次

发布时间：2019-05-26

本文共 7350 字，大约阅读时间需要 24 分钟。

一、参数表：

class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)

二、这是文档中关于参数的解释：

Parameters:	n_neighbors : int, optional (default = 5) Number of neighbors to use by default for queries. weights : str or callable, optional (default = ‘uniform’) weight function used in prediction. Possible values: ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally. ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights. algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional Algorithm used to compute the nearest neighbors: ‘ball_tree’ will use ‘kd_tree’ will use ‘brute’ will use a brute-force search. ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to method. Note: fitting on sparse input will override the setting of this parameter, using brute force. leaf_size : int, optional (default = 30) Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. p : integer, optional (default = 2) Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. metric : string or callable, default ‘minkowski’ the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics. metric_params : dict, optional (default = None) Additional keyword arguments for the metric function. n_jobs : int, optional (default = 1) The number of parallel jobs to run for neighbors search. If `-1`, then the number of jobs is set to the number of CPU cores. Doesn’t affect method.

Parameters:

n_neighbors : int, optional (default = 5)

Number of neighbors to use by default for queries.

weights : str or callable, optional (default = ‘uniform’)

weight function used in prediction. Possible values:

‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.

‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

[callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

Algorithm used to compute the nearest neighbors:

‘ball_tree’ will use

‘kd_tree’ will use

‘brute’ will use a brute-force search.

‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to method.

Note: fitting on sparse input will override the setting of this parameter, using brute force.

leaf_size : int, optional (default = 30)

Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

p : integer, optional (default = 2)

Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric : string or callable, default ‘minkowski’

the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics.

metric_params : dict, optional (default = None)

Additional keyword arguments for the metric function.

n_jobs : int, optional (default = 1)

The number of parallel jobs to run for neighbors search. If -1, then the number of jobs is set to the number of CPU cores. Doesn’t affect method.

重要的参数：

n_neighbors: 近邻数量,默认为5

leaf_size

:最大叶子数，会影响性能与内存

n_jobs : 并行参数

主要方法：

Methods

(X, y)	Fit the model using X as training data and y as target values
([deep])	Get parameters for this estimator.
([X, n_neighbors, return_distance])	Finds the K-neighbors of a point.
([X, n_neighbors, mode])	Computes the (weighted) graph of k-Neighbors for points in X
(X)	Predict the class labels for the provided data
(X)	Return probability estimates for the test data X.
(X, y[, sample_weight])	Returns the mean accuracy on the given test data and labels.
(**params)	Set the parameters of this estimator.

几个重要方法：

predict

(

)

Predict the class labels for the provided data

Parameters:

Parameters:	X : array-like, shape (n_query, n_features), or (n_query, n_indexed) if metric == ‘precomputed’ Test samples.
Returns:	y : array of shape [n_samples] or [n_samples, n_outputs] Class labels for each data sample.

X : array-like, shape (n_query, n_features), or (n_query, n_indexed) if metric == ‘precomputed’

Test samples.

Returns:

y : array of shape [n_samples] or [n_samples, n_outputs]

Class labels for each data sample.

predict_proba

(

)

Return probability estimates for the test data X.

Parameters:

Parameters:	X : array-like, shape (n_query, n_features), or (n_query, n_indexed) if metric == ‘precomputed’ Test samples.
Returns:	p : array of shape = [n_samples, n_classes], or a list of n_outputs of such arrays if n_outputs > 1. The class probabilities of the input samples. Classes are ordered by lexicographic order.

X : array-like, shape (n_query, n_features), or (n_query, n_indexed) if metric == ‘precomputed’

Test samples.

Returns:

p : array of shape = [n_samples, n_classes], or a list of n_outputs

of such arrays if n_outputs > 1. The class probabilities of the input samples. Classes are ordered by lexicographic order.

score

(

sample_weight=None

)

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:

Parameters:	X : array-like, shape = (n_samples, n_features) Test samples. y : array-like, shape = (n_samples) or (n_samples, n_outputs) True labels for X. sample_weight : array-like, shape = [n_samples], optional Sample weights.
Returns:	score : float Mean accuracy of self.predict(X) wrt. y.

X : array-like, shape = (n_samples, n_features)

Test samples.

y : array-like, shape = (n_samples) or (n_samples, n_outputs)

True labels for X.

sample_weight : array-like, shape = [n_samples], optional

Sample weights.

Returns:

score : float

Mean accuracy of self.predict(X) wrt. y.

Example1

from sklearn.neighbors import KNeighborsClassifierX = [[0], [1], [2], [3]]y = [0, 0, 1, 1]neigh = KNeighborsClassifier(n_neighbors=3)neigh.fit(X,y)print(neigh.predict([[1.1]])) #预测出所在类样本标签print(neigh.predict_proba([[0.9]])) #预测'''    [0]    [[0.66666667 0.33333333]]  # 分别对应这个 标签为 0 , 1 的可能性'''

Example2

from sklearn.neighbors import NearestNeighborssamples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]neigh = NearestNeighbors(n_neighbors=1)neigh.fit(samples)print(neigh.kneighbors([[1., 1., 1.]]))'''    resuult：    (array([[0.5]]), array([[2]], dtype=int64))'''#第一个数组代表最近距离,第二个数组代表最近点的索引,当参数return_distance = False时 不返回距离 只返回索引

Example3

from sklearn.neighbors import KNeighborsClassifierX = [[0], [1], [2], [3]]y = [0, 0, 1, 1]neigh = KNeighborsClassifier(n_neighbors=3)neigh.fit(X,y)print('到每个样本的距离及本身索引:')print(neigh.kneighbors())#默认 计算每个样本的 所有近邻 并返回距离print('准确率:',neigh.score([[0.8],[1.5]],[1,0]))'''到每个样本的距离及本身索引:(array([[1., 2., 3.], 表示第2,3,4样本到第一个样本的距离       [1., 1., 2.],       [1., 1., 2.],       [1., 2., 3.]]), array([[1, 2, 3],表示到第一个样本的索引       [0, 2, 3],       [3, 1, 0],       [2, 1, 0]], dtype=int64))准确率: 0.5'''

一个分类的小练习：

from sklearn import datasets, neighbors, linear_modeldigits = datasets.load_digits()X_digits = digits.datay_digits = digits.targetn_samples = len(X_digits)X_train = X_digits[:int(.9 * n_samples)]y_train = y_digits[:int(.9 * n_samples)]X_test = X_digits[int(.9 * n_samples):]y_test = y_digits[int(.9 * n_samples):]knn = neighbors.KNeighborsClassifier()logistic = linear_model.LogisticRegression()print('KNN score: %f' % knn.fit(X_train, y_train).score(X_test, y_test))print('LogisticRegression score: %f'      % logistic.fit(X_train, y_train).score(X_test, y_test))

多个分类算法的对比：

from itertools import productimport numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.svm import SVCfrom sklearn.ensemble import VotingClassifier# Loading some example datairis = datasets.load_iris()X = iris.data[:, [0, 2]]y = iris.target# Training classifiersclf1 = DecisionTreeClassifier(max_depth=4)clf2 = KNeighborsClassifier(n_neighbors=7)clf3 = SVC(kernel='rbf', probability=True)eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2),                                    ('svc', clf3)],                        voting='soft', weights=[2, 1, 2])clf1.fit(X, y)clf2.fit(X, y)clf3.fit(X, y)eclf.fit(X, y)# Plotting decision regionsx_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),                     np.arange(y_min, y_max, 0.1))f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8))for idx, clf, tt in zip(product([0, 1], [0, 1]),                        [clf1, clf2, clf3, eclf],                        ['Decision Tree (depth=4)', 'KNN (k=7)',                         'Kernel SVM', 'Soft Voting']):    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])    print(type(Z))    Z = Z.reshape(xx.shape)    axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4)    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y,                                  s=20, edgecolor='k')    axarr[idx[0], idx[1]].set_title(tt)plt.show()