Title

Random Forest Classifier on Census Dataset

Abstract

This code block trains a random forest classifier on the Census dataset and obtains predictions for both, the validation data and a holdout dataset. It uses Features FSelect, previously selected using L1 regularization with a cost of 0.01 and hot encoding applied at a min frequency of 15. The classifier is evaluated using accuracy score on validation and holdout datasets, and explains 100 predictions.

Data

The Census, CensusHoldout datasets were used and FSelect features were used. tr_xy, tr_x, te_x, tr_y, and te_y variables are created from the data using pandas. The Census dataset contains both numerical and categorical features. The target label of the data is "target".

Method

The random forest classifier algorithm has been used for this analysis. One-hot encoding is applied to the categorical columns with a minimum frequency of 5.

Models & parameters

Random forest classifier pipeline is created with one-hot encoding with a minimum frequency of 5, followed by quantile transformer for the integer features which are used to train the random forest classifier. n_estimators = 5, min_samples_leaf = 10, are the parameters used.

Result

model accuracy on validation set: 0.853, model score on holdout data: 0.848

Observation

This code block trains a random forest classifier on the Census dataset with feature importance calculated using L1 regularization and hot encoding. The classifier is evaluated using accuracy score on validation and holdout datasets, and the feature importance of 100 predictions is explained. The result shows that the model has a score of 0.848 on the holdout data which could be considered fairly good. The dataset could benefit from some preprocessing such as missing value imputation and normalization.