Python大数据分析5:回归分析(每日订单预测)

回归分析主要用于预测数值,通常可以根据一些特征属性来通过拟合函数来预测目标特征。可以分为线性回归分析和多项式回归分析等多种形式,由于线性回归通常只能把输入数据拟合成直线,因此在绝大多数应用场景中并不多见,而非线性回归分析通过函数更为复杂,相应构建的模型变得更灵活,有时结果因此而更为准确。

我们这次使用一个每日需求预测订单数据集,具体下载地址仍然是加州大学尔湾分校的数据集,可以直接点击“Data Folder”获取文件。

这个文件不同列采用分号分隔,普通的Excel无法正常区分每一列。

我们自己使用pandas读取:

import pandas as pd

frame = pd.read_csv(‘C:\\temp\\Daily_Demand_Forecasting_Orders.csv’, sep=’;’)

pd.set_option(‘display.max_columns’, None)

print(frame.head(1))

这里使用sep参数指定了分隔符。虽然可以显示,但是部分列名过长,使用并不方便。

因此,我们可以修改部分列名。

import pandas as pd

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)
print(frame.head(1))

其中,有12个特征数据,都是订单相关属性,最后一个Target表示订单总量。我们的任务就是通过前面12个特征较为准确的预测这个订单总量。这是个典型的回归分析任务。

我们先来看看线性回归,而且只考虑一个特征输入,可以称之为一元线性回归。这里选择了Non-urgent order这个特征(非紧急订单)。

import pandas as pd

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)
X = frame['Non-urgent order'].values.reshape(-1, 1)
print(X)

不过,按照回归分析要求,它需要将特征数据做成二维结构,因此我们需要使用reshape来转换下。其中,第二个1表示1列,第一个-1表示行数根据实际情况,因此其实就是一维列表数据竖起来,做成了多行一列数据。

据此就可以预测了:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)
X = frame['Non-urgent order'].values.reshape(-1, 1)
y = frame['Target']
regressor = LinearRegression()
scores = cross_val_score(regressor, X, y, scoring='r2')
print(np.mean(scores))

使用起来很简单,和前面介绍的各种机器学习一样,就是更换了模型,采用了LinearRegression,注意相应的交叉验证指标不能使用准确度,因为对于这种数值型预测,绝对的数值相等预测并不现实,常见的R方指标更为常见,它是指确定性相关系数,用于衡量模型对未知样本预测的效果,最好的得分是1.0。

虽然交叉验证很简单,但是却无法了解模型的很多细节,因此我们再次换成原始写法:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn.metrics as sm

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)
X = frame['Non-urgent order'].values.reshape(-1, 1)
y = frame['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_test_pred = regressor.predict(X_test)
print(round(sm.mean_squared_error(y_test, y_test_pred), 2))
print(round(sm.r2_score(y_test, y_test_pred), 2))
print(regressor.coef_)
print(regressor.intercept_)

这里我们自己拆分了数据集合,首先了模型训练,并进行了预测,最后还可以直接输出所需的指标。其中coef_和intercept_能分别给出此时的拟合直线的截距和系数。

为了更直观的了解,我们不妨绘制下:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)
X = frame['Non-urgent order'].values.reshape(-1, 1)
y = frame['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_test_pred = regressor.predict(X_test)
plt.figure()
plt.scatter(X_test, y_test, color='green')
plt.plot(X_test, y_test_pred, color='black', linewidth=4)
plt.title('Test data')
plt.show()

这里我们使用原始数据作为散点图,因为预测数据采用的是一元线性,所以是直线结构。

从中可以清楚的看到拟合直线的特点,截距和系数完整的定义了直线的位置。

当然,一元回归效果通常不好,毕竟直线拟合太理想化。

因此,对于更多的情况,我们可以考虑使用不同的回归模型。比如如果存在着较大的异常值,这种异常值通常会对回归分析带来很明显的不利影响,我们可以考虑使用岭回归器:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
import sklearn.metrics as sm

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)
X = frame['Non-urgent order'].values.reshape(-1, 1)
y = frame['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
regressor = Ridge()
regressor.fit(X_train, y_train)
y_test_pred = regressor.predict(X_test)
print(round(sm.r2_score(y_test, y_test_pred), 2))

只需直接替换线性回归模型就可以了。

当然,更有效的方法应该是考虑更多的特征,而不是只有一个特征,可以称为多元回归。我们还是使用一般的线性回归:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import sklearn.metrics as sm

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)

y = frame['Target']
X = frame.drop(columns='Target')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_test_pred = regressor.predict(X_test)
print(round(sm.r2_score(y_test, y_test_pred), 2))

这里关键的代码只有中间两行,即将Target作为预测数据列,其他所有的列作为特征列。请注意,由于此时X已经是二维结构,因此就无需再转换。

此时的预测R方指标居然为1,这当然说明完全线性的关系,但是更多的原因在于数据量相对较小,拟合相对较为简单。

我们甚至还可以使用更为复杂的回归模型,比如多项式回归,不像线性回归只能把输入数据拟合成直线,而多项式回归模型通过拟合多项式方程来克服这类问题,从而提高模型的准确性。一般模型的曲率是由多项式的次数决定。随着模型曲率的增加,模型变得更准确。但是,增加曲率的同时也增加了模型的复杂性,因此拟合速度会变慢。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import sklearn.metrics as sm
from sklearn.preprocessing import PolynomialFeatures

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)
y = frame['Target']
X = frame.drop(columns='Target')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

polynomial = PolynomialFeatures(interaction_only=True)
X_train_transformed = polynomial.fit_transform(X_train)
regressor = LinearRegression()
regressor.fit(X_train_transformed, y_train)
X_test_transformed = polynomial.fit_transform(X_test)
y_test_pred = regressor.predict(X_test_transformed)

print(round(sm.r2_score(y_test, y_test_pred), 2))

这里的关键代码在于在进行任何训练预测前,都需要将原始特征数据进行拟合转变,具体是通过PolynomialFeatures的fit_transform方法进行转换,这里有两次,一次是X_train,一次是X_test,然后再将转换后的数据进行训练和预测。

和一般的一元线性回归相比,效果获得了2个百分点的提升。

再如可以使用决策树回归方法:

import pandas as pd
from sklearn.model_selection import train_test_split
import sklearn.metrics as sm
from sklearn.tree import DecisionTreeRegressor

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)
y = frame['Target']
X = frame.drop(columns='Target')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

regressor = DecisionTreeRegressor(max_depth=6)
regressor.fit(X_train, y_train)
y_test_pred = regressor.predict(X_test)
print(round(sm.r2_score(y_test, y_test_pred), 2))

这里的max_depth可以指定决策树的深度。大家可以注意到,不同的方法、不同的参数对于不同的方法往往效果都不一样,这些就构成了大家需要继续学习的基础,我们也该从经验、理论等多方面来了解对于什么样的数据我们应该使用什么样的模型和参数设置。

在决策树回归分析的基础上,我们还可以使用更为强大的AdaBoost回归方法。

import pandas as pd
from sklearn.model_selection import train_test_split
import sklearn.metrics as sm
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)
y = frame['Target']
X = frame.drop(columns='Target')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=6))
regressor.fit(X_train, y_train)
y_test_pred = regressor.predict(X_test)
print(round(sm.r2_score(y_test, y_test_pred), 2))

其中,AdaBoostRegressor需要决策树回归为基础,效果也更好。

AdaBoost回归器还可以返回哪个特征最为相关:

import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)
y = frame['Target']
X = frame.drop(columns='Target')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=6))
regressor.fit(X_train, y_train)
y_test_pred = regressor.predict(X_test)

print(regressor.feature_importances_)

这里12个数值正好对应12个特征,其中最大的0.42对于第三个特征,即Non-urgent order。

最后我们通过一个有趣的绘图结束这个讲解。

这是根据刚才计算的特征重要性绘制的柱状图。数据我们都有了,如何绘制呢?这个需要些技巧。

先看纵轴,需要将最大的数值设定为100,于是:

feature_importances = 100.0 * (regressor.feature_importances_ / max(regressor.feature_importances_))

print(feature_importances)

再看看横轴,这是个排序的结果:

我们调用numpy的sort:

print(np.sort(feature_importances))

不过,它默认是升序。

我们来个降序:

print(np.flipud(np.sort(feature_importances)))

其中的flipud为翻转,正好为降序。

于是我们绘制:

values = np.flipud(np.sort(feature_importances))

plt.figure()

plt.bar(np.arange(12), values)

plt.show()

柱状图的横轴为0到11的12个整数,高度为刚才的特征重要度。

大家可能注意到了,这些方柱好像高低有点变化,这是因为每次计算都会因为随机选择而产生一定的变化。如果想固定下来,可以在AdaBoost回归器中设定固定的随机状态:

regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=6), random_state=7)

此时的图表横轴还是不清楚,究竟是什么数据列对应哪一个方柱?

这里是完整的写法:

import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

frame = pd.read_csv('C:\\temp\\Daily_Demand_Forecasting_Orders.csv', sep=';')
pd.set_option('display.max_columns', None)
frame.rename(columns={'Week of the month (first week, second, third, fourth or fifth week': 'week',
                      'Day of the week (Monday to Friday)': 'day',
                      'Orders from the traffic controller sector': 'sector',
                      'Target (Total orders)': 'Target'}, inplace=True)
y = frame['Target']
X = frame.drop(columns='Target')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=6), random_state=7)
regressor.fit(X_train, y_train)
y_test_pred = regressor.predict(X_test)

feature_importances = 100.0 * (regressor.feature_importances_ / max(regressor.feature_importances_))
values = np.flipud(np.sort(feature_importances))
index_sorted = np.flipud(np.argsort(feature_importances))
plt.figure()
plt.bar(np.arange(12), values)
plt.xticks(np.arange(12), X.columns.values[index_sorted])
plt.show()

其中增加了两行:

第一行中,argsort也是排序,但是它只给出排序好后的列表元素下标。经过翻转以后,我们再拿这个下标去设置横轴的内容,内容就是12个数据列,只不过这是通过下标来挨个获取,而这个下标的次序正好是目前按照重要性排序好的数据列的次序。

最终我们完成了绘制。大家不妨试一试。

发表评论

电子邮件地址不会被公开。 必填项已用*标注