Gradient boosting regression Boston housing data example

Question

I'm trying to understand the gradient boosting regression example using the Boston housing data (http://scikit-learn.org/stable/modules/ensemble.html) hoping to apply it on a different task, but I'm a novice Python user and beginner to ML. I have a general understanding of what the program is doing, but I would like to know what the following codes &/or arguments are doing:

Line 2: Why do we have to shuffle the data & what is random_state=13 doing?
Lines 4-6: The train & tests variables are obvious & the brackets, i.e., subsetting of a list/dictionary, but I don't understand the significance of creating offset integers & the arguments used to create it)

boston = datasets.load_boston() 
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32) 
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]

I see several votes to close as off topic, but I think this is about the algorithm, not the program itself, so I am voting to leave it open. — Peter Flom, May 04 '17 at 12:35

score 3 · Answer 1 · answered Jul 31 '17 at 13:40

This code is rolling its own train/test split, but it really should be using train_test_split from sklearn.

In terms of what it's doing in detail:

The shuffle is to set up for their splitting method, they are just going to take the top 90% of the data as the train set. In order to get a random 90% when this happens, they shuffle.
The random_state=13 sets a seed for the random number generator manually. This ensures that if you run the code twice, you always get the same train / test split.
The offset is the index that cuts the data into a 90% / 10% split. X.shape[0] is the number of rows, so 0.9 * x.shape[0] is the index that splits 90% / 10%. It's a bit wonky, because its probably a decimal, and they are depending on it being downcast to an integer when used as an index.

score 1 · Answer 2 · answered Jul 31 '17 at 14:33

This is not intended as an answer as Matthew Drury has already provided great answers. Just to expand a little bit and provide some alternatives.

The fact that they used offset = int(X.shape[0] * 0.9) isn't too important. It's telling you that they are using roughly 90% for training and 10% for testing. But there are several ways to accomplish this and as others have mentioned sklearn has train_test_split which will create X_train, X_test, y_train, y_test for you:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .85)

Reading the documentation you see that you can specify both a test_size and train_test however by default train_test will be the compliment of test_size.

Another and approach is cross validation. Here's an example in sklearn:

from sklearn.model_selection import KFold
from sklearn.metrics import r2_score
from sklearn import linear_model

# assuming you're performing regression
clf = linear_model.LinearRegression()
cv = KFold(n_splits=5)
results = {'r2':[], 'coef':[]}

for train, test in cv.split(X, y):
    clf.fit(X[train], y[train])
    prediction = clf.predict(X[test])
    r2 = r2_score(y[test], prediction)
    results['r2'].append(r2)
    results['coef'].append(clf.coef_)

The code above can be written in a more compact way but for clarity's sake I wrote it as it. Once all CV folds have been run you can look at the mean r2 score: np.mean(results['r2']). In the example above I saved the estimates of the coeficients at each CV iteration but in a similar fashion you could save whatever you like.

score 0 · Answer 3 · edited Jul 31 '17 at 17:50

0

Line 2: shuffle will randomly shuffle the rows of the dataset to add randomness which is a good practice while building models. random_state=13 ensures that the result of shuffle will remain the same if you want to shuffle the data again with the same code.

To avoid lines 4-6, you can also do the splitting of data using train_test_split() from sklearn.model_selection. For this case: X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=0.2) where test_size implies the proportion of testing set with respect to the complete data size.

edited Jul 31 '17 at 17:50

tuomastik

660
6
17

answered Jul 31 '17 at 14:07

Harshit Mehta

1,133
12
15

This is being automatically flagged as low quality, probably because it is so short. At present it is more of a comment than an answer by our standards. Can you expand on it? You can also turn it into a comment. – gung - Reinstate Monica Jul 31 '17 at 16:08
Thanks @tuomastik for revising the original question and thereby making it more easy to follow. – BaseLearner_J Aug 05 '17 at 15:25

Gradient boosting regression Boston housing data example

3 Answers3