Select your language

The following Python 3 code snippet demonstrates the implementation of a simple Random Forrest machine learning classification model to predict an output value from input values.

In the example, a CSV file is first loaded, which contains various input columns (X) and a column with the value (Y) to be predicted by the model. Then the data is divided into a training and a test set before the model itself is trained. The call model.score checks the quality of the generated model against the test data - the output of the variable model.feature_importances_ returns an array with the importance of each input column.

For the example code the Scikit-learn and the Pandas library must be installed (pip install sklearn, pip install pandas).

from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
import pandas as pd
import numpy as np

# load csv data
input_data_1 = pd.read_csv("prediction-input-1.txt", sep="\t")

# prepare column to be predicted
input_data_1['Transactions'] = np.where(input_data_1['Transactions'] == 0, input_data_1['Transactions'], 1)

Y = np.array(input_data_1['Transactions'])

# select features (= columns used as input for prediction)
features = list(input_data_1.columns)

del features[5]
del features[0]
X = input_data_1[features]

# split data into test and train set
test_size = 0.33

seed = 77
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)

# build and train model
model = RandomForestClassifier(n_estimators=100, n_jobs=2, random_state=0, max_depth=10), Y_train)

# test model quality
result = model.score(X_test, Y_test)


# print vector with feature importances