Python大数据分析3：决策树与随机森林（会员卡预测）

所谓决策树，它也是一种分类方法，我们可以认为就是根据一些特征条件，来对结果类别做出判断，比如结合客户属性特征，看看是否适合办理信用卡等等。此时的类别就是该不该办理信用卡。

这次我们结合一个foodmart 2000数据集，来看看顾客会员卡的等级预测问题。这是一个跨国食品超市的数据，其中提供了1万多条会员卡顾客数据。其中提供了27个相关特征数据，如姓名、地址、收入、教育情况等，还提供了一个会员卡数据列，其中有金卡、银卡、铜卡和普通卡四种类型。

决策树分类方法和前面的方法很相似，关键是先得到所需的很多特征，特征的选择很关键，甚至有效准确的特征比算法还重要。作为练习，我们先找一些重要的特征吧！标准分类模型都是默认采用数值型特征。

因此我们先选择选择了三个数值型特征，分别是小孩数、汽车数和年收入。

其中年收入需要处理下才能使用，即转换为真正的数值。大家会注意到这是个范围，我们可以先直接提取所有的数字：

这里我们采用了是将每个年收入利用str获取字符串表示，并进一步调用replace函数替换所有非数字字符，这里我们采取了一种被称为正则表达式的方法，0-9表示10个数字，^表示不是，显然这样更为灵活，将其全部替换为空字符，即换成没有：

import pandas as pd

frame = pd.read_csv(‘C:\\temp\\customer.csv’)

print(frame[‘yearly_income’].head(2))

frame[‘yearly_income’] = frame[‘yearly_income’].str.replace(‘[^0-9]’, ”)

print(frame[‘yearly_income’].head(2))

从结果来看，可以通过这个整数来间接表达收入的规模。

当然，也可以采取另外一种方法，比如我们只取目前年收入范围的下限，这个是通过先截取到空格为止的前面的所有字符，然后再对齐进行替换：

frame[‘yearly_income’] = frame[‘yearly_income’].str.split(” “).str[0].str.replace(‘[^0-9]’, ”)

print(frame[‘yearly_income’].head(2))

好了，数据到此整理完毕。

代码其实不复杂，还和以前差不多：

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

frame = pd.read_csv('C:\\temp\\customer.csv')
frame['yearly_income'] = frame['yearly_income'].str.split(" ").str[0].str.replace('[^0-9]', '')
y = frame["member_card"]
clf = DecisionTreeClassifier()
X = frame[["yearly_income", "total_children", "num_cars_owned"]]
scores = cross_val_score(clf, X, y, scoring='accuracy')
print(np.mean(scores))

这里的主要区别在于：第一，将三个数值列为特征数据，而会员卡列作为预测列；第二，使用了DecisionTreeClassifier分类器。

一般而言，如果能够引入更多的相关特征，决策树分类器的效果很更好一些，比如我们认为受教育程度和职业往往也与会员等级关系密切，于是我们继续增加了教育特征：

X = frame[[“yearly_income”, “total_children”, “num_cars_owned”, “education”]]

scores = cross_val_score(clf, X, y, scoring=’accuracy’)

print(np.mean(scores))

但是我们发现出错了。原因很简单，字符类型的列不能直接参与各种分类运算，因此必须将其转换为数值。

转换方法可以利用LabelEncoder来实现，我们先来看看转换的效果：

import pandas as pd
from sklearn.preprocessing import LabelEncoder

frame = pd.read_csv('C:\\temp\\customer.csv')
encoding = LabelEncoder()
encoding.fit(frame["education"])
education_new = encoding.transform(frame["education"])
print(frame["education"].values)
print(education_new)

创建完LabelEncoder编码器后，首先现以当前受教育程度列训练这个编码器，其实就是让它知道这个列有几种不同的字符串，然后再让其转换这个列。

从结果来看，其实就是对于不同的字符串分配不同的数字，相同的字符串使用同一个数字来表示。

然后我们就可以将这个新的教育列加进去：

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

frame = pd.read_csv('C:\\temp\\customer.csv')
frame['yearly_income'] = frame['yearly_income'].str.split(" ").str[0].str.replace('[^0-9]', '')
encoding = LabelEncoder()
encoding.fit(frame["education"])
frame['education_new'] = encoding.transform(frame["education"])
y = frame["member_card"]
clf = DecisionTreeClassifier()
X = frame[["yearly_income", "total_children", "num_cars_owned", "education_new"]]
scores = cross_val_score(clf, X, y, scoring='accuracy')
print(np.mean(scores))

这里首先再DataFrame中建立一个education_new列，并将生成的新教育列赋值过来，既可在后面添加该列直接运算。

这里大家可以注意到了，效果似乎变差了，这是为什么呢？当然特征的选择有可能有问题，但是这里的问题更应该表现在这个数值转换过程。因此LabelEncoder是将字符转为整数，不同的整数虽然可以彼此区分，但是彼此数值的大小却似乎体现一种联系。

比如这里转换的结果我们可以看到，从数值来看，4和3更接近而相似，但是和2更不相似。然而，这其实是不对的，因为4对应的字符串并不和3对应的字符串更相似，似乎和2更相似些。LabelEncoder只是随意分配，并没有考虑字符串本身的相似度，数字只是区分。但是，这种整数却对分类方法产生一定的误导。

因此有效的方法应该使用独热编码。什么是独热编码，我们先看看：

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np

frame = pd.read_csv('C:\\temp\\customer.csv')
encoding = OneHotEncoder()
print(frame["education"].values)
newData = encoding.fit_transform(np.vstack(frame["education"].values)).todense()
print(newData)

我们先看结果，受教育程度列有几种字符串，就会有几个列，每个列对应一种字符串。因此，通过对应的列设置为1可以表示当前行对应的列取这个值。我们明显看到对应关系，有5个列。

这种编码设计的好处在于数值既区分的彼此，也不会产生前面那种数值相似性的问题。

这里略有复杂的地方在于里面需要一种转换，这里简单说明下，

import pandas as pd

import numpy as np

frame = pd.read_csv(‘C:\\temp\\customer.csv’)

print(frame[“education”].values)

print(np.vstack(frame[“education”].values))

这是numpy计算包提供的一种转换功能，可以将一维数值序列变成二维矩阵，里面只有一列，就是刚才这个一维数值序列。

之所以这里需要这种转换，是因为只有有了这个二维矩阵，里面的每一行正好对应原始数据的每一行，后面才能生成含有多个热独编码的新列。

为了能后续进行统一的处理，需要我们首先将顾客DataFrame和这个对应的热度编码合并成一个DataFrame：

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np

frame = pd.read_csv('C:\\temp\\customer.csv')
encoding = OneHotEncoder()
newData = encoding.fit_transform(np.vstack(frame["education"].values)).todense()
frame_new = pd.DataFrame(newData)
frame_full = pd.merge(frame[["yearly_income", "total_children", "num_cars_owned"]], frame_new, left_index=True, right_index=True)
print(frame_full)

这里首先将编码矩阵直接转换为DataFrame，然后和顾客DataFrame中所需的列连接下，这里的连接条件就是按照索引行一一对应连接即可，请大家注意相应的属性设置写法。

最后就可以连成一起了！

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

frame = pd.read_csv('C:\\temp\\customer.csv')
frame['yearly_income'] = frame['yearly_income'].str.split(" ").str[0].str.replace('[^0-9]', '')
y = frame["member_card"]
encoding = OneHotEncoder()
newData = encoding.fit_transform(np.vstack(frame["education"].values)).todense()
frame_new = pd.DataFrame(newData)
frame_full = pd.merge(frame[["yearly_income", "total_children", "num_cars_owned"]], frame_new, left_index=True, right_index=True)
X = frame_full
clf = DecisionTreeClassifier()
scores = cross_val_score(clf, X, y, scoring='accuracy')
print(np.mean(scores))

结果比刚才的LabelEncoder编码改进一些，但是依然低于最初我们没有引入受教育程度的情况，这可能也说明这个列其实意义有限。

那我们可以尝试增加其他列试一试！

newData = encoding.fit_transform(np.vstack(frame[“marital_status”].values)).todense()

我们只需更换这一句里的列，比如是否已婚，就会发现结果提高了，准确度达到77.4%。

有时，甚至需要我们去创建、补充一些新的列，来更好的得到预测模型，很多情况下都需要考虑特定的应用场景特点。大家可以自行多去了解观察。

对于决策树而言，它是利用这些我们选择的特征来进行相应的判断，同时行的选择也会影响效果，因此有时模型的有效性很大程度上取决于我们选择的数据。因此，我们如果根据不同的特征组合和行组合建立不同的多棵决策树，用它们分别进行预测，再根据少数服从多数的原则从多个预测结果中选择最终预测结果，这一定更有把握。

这其实就是随机森林的工作原理。

我们只替换了DecisionTreeClassifier为RandomForestClassifier：

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

frame = pd.read_csv('C:\\temp\\customer.csv')
frame['yearly_income'] = frame['yearly_income'].str.split(" ").str[0].str.replace('[^0-9]', '')
y = frame["member_card"]
encoding = OneHotEncoder()
newData = encoding.fit_transform(np.vstack(frame["marital_status"].values)).todense()
frame_new = pd.DataFrame(newData)
frame_full = pd.merge(frame[["yearly_income", "total_children", "num_cars_owned"]], frame_new, left_index=True,
                      right_index=True)
X = frame_full
clf = RandomForestClassifier()
scores = cross_val_score(clf, X, y, scoring='accuracy')
print(np.mean(scores))

效果略有提升。

随机森林还能调节各种参数，甚至还能允许通过算法自行测试得到最优的参数设置：

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

frame = pd.read_csv('C:\\temp\\customer.csv')
frame['yearly_income'] = frame['yearly_income'].str.split(" ").str[0].str.replace('[^0-9]', '')
y = frame["member_card"]
encoding = OneHotEncoder()
newData = encoding.fit_transform(np.vstack(frame["marital_status"].values)).todense()
frame_new = pd.DataFrame(newData)
frame_full = pd.merge(frame[["yearly_income", "total_children", "num_cars_owned"]], frame_new, left_index=True, right_index=True)
X = frame_full
parameter_space = {
    "max_features": [2, 4, 'auto'],
    "n_estimators": [100, ],
    "criterion": ["gini", "entropy"],
    "min_samples_leaf": [2, 4, 6],
}
clf = RandomForestClassifier()
grid = GridSearchCV(clf, parameter_space)
grid.fit(X, y)
print(grid.best_estimator_)
print(grid.best_score_)

这里我们设置了一些参数及其数值的组合，允许利用GridSearchCV类型去组合搭配，并比较出最好的组合，我们可以通过best_estimator_得到参数设置，通过best_score_得到现在的最优值。

此时即可将best_estimator_得到参数设置回填回去：

clf = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                             criterion='gini', max_depth=None, max_features=2,
                             max_leaf_nodes=None, max_samples=None,
                             min_impurity_decrease=0.0, min_impurity_split=None,
                             min_samples_leaf=4, min_samples_split=2,
                             min_weight_fraction_leaf=0.0, n_estimators=100,
                             n_jobs=None, oob_score=False, random_state=14,
                             verbose=0, warm_start=False, )
scores = cross_val_score(clf, X, y, scoring='accuracy')
print(np.mean(scores))

此时，已经达到目前我们所做的最好准确度。

请注意，这里有一个参数random_state很有趣，事实上，每次运行由于选择的训练集和测试集可能都不一样，因此总是有些小小的差异，通过这个参数设定，可以将选择的随机化固定下来，从而使得每次运行结果是一定的，更便于对比分析。

李树青

Python大数据分析3：决策树与随机森林（会员卡预测）

发表评论取消回复

发表评论 取消回复

发表评论取消回复