引言
在机器学习和数据科学领域,模型评估是整个建模过程中至关重要的环节。一个模型在训练数据上表现良好并不意味着它在新数据上也能有同样出色的表现。为了准确评估模型的泛化能力,我们需要使用可靠的验证技术。K折交叉验证(K-Fold Cross-Validation)就是其中最常用、最有效的方法之一。
为什么需要交叉验证
传统训练-测试分割的局限性
在传统的机器学习流程中,我们通常将数据集分为训练集和测试集(比如80%-20%分割)。然而,这种方法存在几个问题:
- 数据利用率低:测试集的数据在训练过程中完全未被使用;
- 评估结果不稳定:不同的数据分割可能导致完全不同的评估结果;
- 小数据集问题:当数据集较小时,测试集可能无法代表整体数据分布。
交叉验证的优势
交叉验证通过系统性地重复使用数据来解决上述问题:
- 更高的数据利用率:每个样本都会被用于训练和验证;
- 更稳定的评估结果:通过多次验证取平均,减少随机性影响;
- 更好的模型选择:为超参数调优提供可靠的性能指标。
K折交叉验证的详细原理
K折交叉验证将原始数据集随机分成K个大小相等(或近似相等)的子集,称为”折”(folds)。然后进行K次训练和验证过程:
- 第1次迭代:使用第1折作为验证集,其余K-1折作为训练集;
- 第2次迭代:使用第2折作为验证集,其余K-1折作为训练集;
- ……
- 第K次迭代:使用第K折作为验证集,其余K-1折作为训练集。
最终,将K次验证的结果(如准确率、均方误差等)进行平均,得到模型的最终性能评估。
如下所示的5折交叉验证:
原始数据集被划分为 5 个数量相等的子集,标记 A~E。最开始模型在划分 B~E 上训练,在划分 A 上测试。在下一次迭代中,模型在划分 A、C、D 和 E 上进行训练,在划分B 上测试。接着划分被转换直到模型已经在所有的划分上进行训练和测试。相比在单一模型划分上进行测试,交叉验证能为模型提供更准确的性能预估。
代码展示
使用的数据集为sklearn中的iris,用knn算法来取K值的大小:
1 2 3 4 5 6 7 8 9 10 11 12 13
| import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.utils import shuffle from collections import Counter iris = datasets.load_iris() x, y = shuffle(iris.data, iris.target, random_state=13) x = x.astype(np.float32) offset = int(x.shape[0] * 0.7) x_trin, y_train =x[:offset], y[:offset] x_test, y_test = x[offset:], y[offset:] y_train = y_train.reshape((-1, 1)) y_test = y_test.reshape((-1, 1))
|
定义欧式距离
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| def compute_distances(X, X_train): ''' 输入: X:测试样本实例矩阵 X_train:训练样本实例矩阵 输出: dists:欧式距离 ''' num_test = X.shape[0] num_train = X_train.shape[0] dists = np.zeros((num_test, num_train)) M = np.dot(X, X_train.T) te = np.square(X).sum(axis=1) tr = np.square(X_train).sum(axis=1) dists = np.sqrt(-2 * M + tr + np.matrix(te).T) return dists
|
定义预测函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| def predict_labels(y_train, dists, k=1): ''' 输入: y_train:训练集标签 dists:测试集与训练集之间的欧氏距离矩阵 k:k值 输出: y_pred:测试集预测结果 ''' num_test = dists.shape[0] y_pred = np.zeros(num_test) for i in range(num_test): closest_y = [] labels = y_train[np.argsort(dists[i, :])].flatten() closest_y = labels[:k] c = Counter(closest_y) y_pred[i] = c.most_common(1)[0][0] return y_pred
|
使用五折找最优的K值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| n_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100] x_train_folds = [] y_train_folds = [] x_train_folds = np.array_split(x_trin, n_folds) y_train_folds = np.array_split(y_train, n_folds) k_to_accuracies = {} for k in k_choices: for fold in range(n_folds): validation_X_test = x_train_folds[fold] validation_y_test = y_train_folds[fold] temp_X_train = x_train_folds[:fold] + x_train_folds[fold+1:] temp_y_train = y_train_folds[:fold] + y_train_folds[fold+1:] temp_X_train = np.vstack(temp_X_train) temp_y_train = np.vstack(temp_y_train) temp_dists = compute_distances(validation_X_test, temp_X_train) temp_y_test_pred = predict_labels(temp_y_train,temp_dists, k=k) temp_y_test_pred = temp_y_test_pred.reshape((-1, 1)) num_correct = np.sum(temp_y_test_pred == validation_y_test) num_test = validation_X_test.shape[0] accuracy = float(num_correct) / num_test k_to_accuracies[k] = k_to_accuracies.get(k,[]) + [accuracy]
for k in sorted(k_to_accuracies): for accuracy in k_to_accuracies[k]: print('k = %d, accuracy = %f' % (k, accuracy))
|
准确率的部分内容如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| k = 1, accuracy = 0.904762 k = 1, accuracy = 1.000000 k = 1, accuracy = 0.952381 k = 1, accuracy = 0.857143 k = 1, accuracy = 0.952381 k = 1, accuracy = 0.904762 k = 1, accuracy = 1.000000 k = 1, accuracy = 0.952381 k = 1, accuracy = 0.857143 k = 1, accuracy = 0.952381 k = 1, accuracy = 0.904762 k = 1, accuracy = 1.000000 k = 1, accuracy = 0.952381 k = 1, accuracy = 0.857143 k = 1, accuracy = 0.952381 k = 3, accuracy = 0.857143 k = 3, accuracy = 1.000000 k = 3, accuracy = 0.952381 k = 3, accuracy = 0.857143 k = 3, accuracy = 0.952381 k = 3, accuracy = 0.857143 k = 3, accuracy = 1.000000
|
接下来打印不同 k值、不同折数下的分类准确率
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| for k in k_choices:
accuracies = k_to_accuracies[k] plt.scatter([k] * len(accuracies), accuracies) accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy') plt.show()
|
分类准备率的内容如下:

可以看到,当 k 取值在 0~20 时,k 近邻分类准确率的波动并不是很大,平均分类准确率在 0.95 左右;当 k 取值在 20~50 时,分类准确率开始下滑;当 k 取值大于50 时,分类准确率则呈现断崖式下跌。所以,一般而言,k 值不宜取得过大,应从一个较小的取值开始,然后用交叉验证方法选取最优值。
代码链接:https://github.com/kg5kb8lbj6/happy_machine_learning.git