Forest Cover Type Classifier

My work here started with the forest cover type competition on kaggle. The data for this competition was collected from four different wilderness areas in Colorado. There are 12 different data fields (though in the raw data some of those fields are further split apart) that are available for use to predict what type of forest cover exists at the sample site. There are 15,120 records in the training set and 565,892 records in the test set (see more detailed data description here).

Basic histograms of available fields

One of the first things I wanted to do was get a better feel for how each field in the data sets was distributed. While doing so, I also noticed some interesting differences between the training and test sets. Most fields have similar distributions between the two data sets, but some seem significantly different. Note in particular the elevation, wilderness area, and soil type fields.

Cover type vs the fields

I also wanted to discover how each field was distrbuted for the different cover types, so I plotted histograms of the training data (where the cover types were known) of the cover type vs. each other field. I hoped that this would help me get a feel for what fields might be most important and might give me some ideas for how to further interact with the datasets. Looking through these histograms, many of the fields don't have patterns that are particularly helpful for my eyes to help classify cover types, however there are a few that do seem visually useful. Note for instance the elevation, wilderness area and soil type plots.

Looking at what the forest cover types are helps understand some of the visible patterns in the distributions. For instance, cover type 7 is krummholz which is a type that primarily occurs in the subalpine and subarctic. It is not surprsing then that it has a notably higher elevation distribution than the other cover types. Cover type 4 is cottonwood/willow - trees that are often associated with wetlands. While most of the cover types in the training data appear to be fairly close to water, it is perhaps not surprising that the cottonwood/willow type appears to be most closely associated with water.

Other Considerations

One thing I noticed while exploring the data is that the training data is split evenly between each of the 7 cover types (there are 2,160 records of each cover type in the training set.) Looking through my initial results in predicting the test set values suggests a very different distribution of cover types. It appears that the training set was not selected as a random sample, but was curated to have an equal number of each type. This hypothesis seems to be born out when comparing distributions of the wilderness areas and elevations between the training and test data (see the histograms above).

At this point I am not sure what can or should be done to acount for the difference in sampling when building a predictive model. One thing I attempted for this was to create a bootstrapped sample from the training set that matched the wilderness area distribution of the test set - the result from this was notably worse than a more straight forward approach.


Cross validation and competition results for classifiers with a little bit of tuning.

Here condensed refers to condensing the wilderness and soil type fields from separate binary fields to a single field. For example, in the uncondensed datasets there is a field for each of four wilderness areas of which exactly one will be non-zero for any given record. The condensed data set has a single wilderness field with integer value between 1 and 4. In practice this made the data easier to explore and improved computation time somewhat, but mostly had worse results than running the classifiers on the uncompressed fields.

With condensed wilderness and soil type fields

Classifier Type Cross-validation score Kaggle score Computation Time (in seconds)
Random Forest 100 .869 .75632 25.7
Random Forest 200 .864 .75137 37.4
Extra Trees 100 .878 .76794 21.7
Nearest Neighbors (2 neighbors)
Not normalized
.825 .72061 26
Nearest Neighbors (1 neighbors)
normalized
.772

With uncondensed wilderness and soil type fields

Classifier Type Cross-validation score Kaggle score Computation Time (in seconds)
Extra Trees 100 .885 .78604 49.3
Extra Trees 1000 .886 .78807 153.6
Nearest Neighbors (2 neighbors)
Not normalized
.839 .71820 84.2
Nearest Neighbors (1 neighbors)
normalized
.766

Otherwise Tuned Classifiers

Classifier Type Cross-validation score Kaggle score Computation Time (in seconds)
Condensed Extra Trees 100 Max Features 10 .886 .78490 22.5
Uncondensed Extra Trees 1000 Max Features 10 .888 .79071 161.7
Random Forest 500 binary classifiers
into Label Spreading, knn, n=5, alpha=.00001
.822 .74886 279.3

Confusion Matrix

This is the confusion matrix for a cross validation using a random forest classifier. The confusion matrices for nearest neighbors and extra trees looked similar. From the chart it appears that the most common errors in the classifications are mixing cover type 1 and 2 (spruce/fir and lodgepole pine), as well as 3 and 6 (ponderosa pine and douglas fir).

The Kaggle competition is now closed (the public leaderboard - not finalized at date of writing this).

While I attempted a number of different techniques beyond what I documented above, the main score improvement came from doing a little bit of feature engineering: I created a couple of compound fields to use with the random forest type classifiers (I got the idea both for the feature engineering and some of the specific fields I used from users in a thread on the kaggle forum.) Those features contributed to a leaderboard score of .81563 (which before final leader board verification is good enough for 58th out of 1694 entrants.)

Given more time and different equipment I would have liked to try using deep neural networks on this problem. It is my impression that some of the highest results on the leader board came from such an approach. It appears that deep learning libraries benefit greatly from heavily parallel processing (e.g. using the Theano library with a gpu). My computer is solid but not spectacular and unfortunately my graphics card is not of the type that appears to be easily supported with those libraries. It might have been interesting trying to get something running on a cloud computing service, but I decided to put that time into pursuing a side project.

Much of the time I have spent working with the forest cover data ended up being on that side project (which may or may not have contributed to my results on the competition, but which I feel is of interest regardless of the competition). That project is still ongoing and I have started writing it up on my blog data, naturally.