The following Python 3 code snippet demonstrates the implementation of a simple Random Forrest machine learning classification model to predict an output value from input values.
In the example, a CSV file is first loaded, which contains various input columns (X) and a column with the value (Y) to be predicted by the model. Then the data is divided into a training and a test set before the model itself is trained. The call model.score checks the quality of the generated model against the test data - the output of the variable model.feature_importances_ returns an array with the importance of each input column.
For the example code the Scikit-learn and the Pandas library must be installed (pip install sklearn, pip install pandas).
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
import pandas as pd
import numpy as np
# load csv data
input_data_1 = pd.read_csv("prediction-input-1.txt", sep="\t")
# prepare column to be predicted
input_data_1['Transactions'] = np.where(input_data_1['Transactions'] == 0, input_data_1['Transactions'], 1)
Y = np.array(input_data_1['Transactions'])
# select features (= columns used as input for prediction)
features = list(input_data_1.columns)
del features[5]
del features[0]
X = input_data_1[features]
# split data into test and train set
test_size = 0.33
seed = 77
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
# build and train model
model = RandomForestClassifier(n_estimators=100, n_jobs=2, random_state=0, max_depth=10)
model.fit(X_train, Y_train)
# test model quality
result = model.score(X_test, Y_test)
print(result)
# print vector with feature importances
print(model.feature_importances_)