**Kaggle Statoil/C****-CORE Iceberg Classifier Challenge: Ship or Iceberg, Can You Decide From Space?**

**-CORE Iceberg Classifier Challenge: Ship or Iceberg, Can You Decide From Space?**

**Contents:**

I. Goals

II. The Challenge

III. Our Approach

A. Starting with the Basics: Logistic Regression

B. Convolutional Neural Networks (CNN)

C. The Key: Ensemble Modeling

IV. Discussion & Conclusions

V. Appendix

# I. Goals

We had two main goals going into this competition. First, **we wanted to apply our academic understanding of machine learning to real world problems**. Second, **we wanted to explore the opportunities in machine learning and data science offered by Kaggle.com.**

# II. The Challenge

We recently competed in the Statoil Iceberg Classification Challenge that was hosted on Kaggle. In this competition, competitors were given the challenge to classify images as either containing a ship or an iceberg, by predicting the probability that an image contained an iceberg. Competitors were then ranked on the log loss of their predictions on a provided testing set. This log loss ranking was the public leaderboard, which was calculated on 20% of the testing set. At the end of the competition, a private leaderboard, which used the other 80% of the testing set for ranking, was used to determine the winners. In whole, this competition was a classic supervised learning binary image classification challenge and lasted for a total of three months.

All of the images in the competition dataset were taken from a Sentinel-1 satellite 600 kilometers above Earth (shown in the Figure above), regularly used to monitor land and ocean. Each image has two color channels: HH (band 1: transmit/receive horizontally) and HV (band 2: transmit horizontally and receive vertically) and is 75×75 pixels with pixel intensity values in decibels (dB). We were also given an incidence angle feature for each image which corresponded to the incidence angle of which the image was taken. Example satellite images are shown in the Figure below.

# III. Our Approach

### A. Starting with the Basics: Logistic Regression

Because the training set only included 1604 images, we decided that models with larger hypothesis spaces, like neural networks, would be too susceptible to overfitting the data, so we limited our options to simpler models, like support vector machines and logistic regression. Logistic regression seemed to be the most straightforward, as it can produce probabilities of the required form with little to no modifications.

The first feature we devised was the “boxiness” of the objects. This feature was created by taking the area of the objects (iceberg/ship) divided by the tightest box that can fit around them, created by multiplying the maximum length in the horizontal and vertical dimensions. The idea behind this was that ships would have a boxier, more regular shape than icebergs, so the boxiness metric would be closer to 1 for ships and closer to 0 for icebergs. However, when we looked at the distributions of icebergs and ships in the training set against this metric, they were not separated nearly as well as we had expected.

Next, we looked at some less creative features, like the minimum pixel values and the incidence angle. Of these, the best by far was the maximum pixel value, which separated the distributions even better than the “boxiness” metric. Below is a Figure showing the distributions of the maximum pixel values for band 2. The ships tend to have a higher maximum pixel intensity values relative to the icebergs.

We then built a logistic classifier using scikit-learn’s LogisiticRegression() function along with these features. However, when we submitted our predictions to the leaderboard we received a rather disappointing log loss score of about 0.4, which was substantially worse than the best scoring kernel posted on Kaggle (at the time, this was a score of 0.1798). This point was the most difficult part of the competition for us. We did not understand what we could do better or what we were missing, so we spent the next two weeks or so reading research papers about image classification as it pertained to our challenge and staying up to date on the discussions and kernels going on through Kaggle.

The most influential source that we found during this period was a paper on using Convolutional Neural Networks (CNN) for image classification, unveiling the famous AlexNet. A Figure of the AlexNet architecture can be seen below. In this paper, the authors recount their astounding result of achieving far better performance than any prior model on the ImageNet dataset in competition for the Large Scale Visual Recognition Challenge (ILSVRC). They achieved this state-of-the-art by using an 8 layer (5 convolutional layers), 60 million parameter (650,000 neurons) CNN. This paper, in conjunction with the presence of a large number of high performing CNNs in the competition kernels on Kaggle (such as the series of kernels produced by TheGruffalo and its various adaptations), convinced us that we were hasty in our initial bias against using neural networks for this competition and that we should redirect our efforts towards the implementation of a CNN.

### B. Convolutional Neural Networks (CNN)

Our first attempt at a CNN was based off of Gruffalo’s original post. We added and removed a few layers, made some changes to the parameters in the Keras functions, and played with the dropout parameter, but, for the most part, we left the core model unchanged. Our CNN architecture can be viewed in the code listed in the Appendix. It was good enough, so we focused our attention on deploying the techniques discussed in the AlexNet paper.

One of those techniques was data augmentation, which is the practice of synthetically increasing the size of the training set by performing small transformations to the existing training data. In the case of images, some common transformations are flipping them horizontally, adding blur to them and cropping them. An example of this process can be seen in the Figure below. We began by tinkering with the transformations in imgaug. However, only a few of these, horizontal flips, vertical flips and light Gaussian blurring, resulted in an improvement. Though, of course, we did not try every permutation of the 25 available. Our best log loss score with these augmentations was 0.1792. For reference, the log loss score with our base model was 0.21.

Another way of augmenting the data is via pseudo-labeling. This is a semi-supervised learning technique and is popular whenever you have a relatively large amount of unlabeled data. The core idea is that you can use your trained model to label the unlabeled data, aka to pseudo-label it, and then use this pseudo-labeled data as additional data to train your model on. We recognized that our testing set consisted of 8424 unlabeled images, while our training set only consisted of 1604 labeled images. We had a situation where pseudo-labeling should, at least in theory, work. So we decided to pseudo-label the testing set to further augment our data.

Our attempts at pseudo-labeling were disappointing. We tried to implement it on three separate occasions and at all times it produced a worse score. First, we attempted to pseudo-label with the base CNN and with no augmented images, which resulted in a log loss score of 0.3102. Then, we tried to pseudo-label where we only used the confident pseudo-labeled data determined via the images with predicted probabilities greater than 0.95 or less than 0.05. We then added this data to the training data augmented via the imgaug transformations. This produced a score of 0.1802, which was again still lower than the original score of 0.1792. Finally, we tried to use the best scoring kernel to produce the pseudo-labeled data. This produced a rather embarrassing score as it was only marginally better than the CNN without this data and quite a bit worse than the kernel from which we produced the pseudo-labels.

This inspired us to look into what made the best scoring kernel better than our current best model. Said kernel used an ensemble learning technique known as stacking. This then motivated us to do a deep dive into the research surrounding stacking, finding many useful blog posts pertaining to the topic along the way by Anisotropic, MLWave, and Ben Gorman.

For binary image classification, stacking combines the prediction probability distributions of multiple models to harness the strengths of a diverse set of models and to leverage a wide variety of learned image features. For example, a CNN may be predisposed to learning feature A with high confidence but lacks the ability to detect feature B, whereas a support vector machine (SVM) may have the ability to detect feature B with high confidence but totally misses feature A. An ensemble of these two models would be able to detect features A and B with high confidence, thus leading to a better overall model for classifying the images. In practice for Kaggle, this often leads to a collaboration scheme for late-stage team combinations of highly tuned single models among the top competitors.

### C. The Key: Ensemble Modeling

With this newfound understanding of stacking, we realized that we could still make use of our original logistic regression model and possibly a few other simple models. So we stacked our best CNN model with several other CNN models (e.g. DenseNet and those included in the best public stacking kernel) as well as with our original logistic regression model. We found the best results using a “MinMax + BestBase” stacking type. For this, we chose our best CNN model to be our base, i.e. for unconfident iceberg prediction probabilities among all models (from 0.1-0.9 in iceberg classification probability). We defaulted to the prediction probability from the base model. We then imposed a MinMax cutoff such that if a model in the stack was very confident in its prediction of an iceberg (classification probability > 0.9), then we chose that model’s prediction probability assuming that this model picked up on some real signal in the feature space from the images. Contrarily, if a model in the stack was very confident in its prediction of a ship (classification probability < 0.1), then we chose that model’s prediction probability for the same reason as for the iceberg. With this stacking strategy among our top single models, we managed to get our best possible log loss score of 0.1199, putting us in the top 4% of the competition on the public leaderboard. Our code for our best stacking process can be found in the Appendix as well as in our GitHub repository for this challenge. Furthermore, a Figure of our best stacking process can be seen below, visualized through the classification probability distributions of the various models and the MinMax Stack

Despite our successful implementation of ensemble methods in this Kaggle challenge and the countless other usages of ensemble methods in Kaggle challenges at large, there are some notable downsides to using ensemble methods. With top Kaggle solutions often containing 100+ ensemble networks, this brings to the forefront the issue of practicality. More times than not, real world applications of machine learning methods require high throughput and low training/inference time periods. From this standpoint, 100+ model ensembles are surely far from ideal, requiring many aggregate GPU hours and large amounts of additional complexity for small gains in performance. This issue was brought up by several notable Kagglers, such as Ben Gorman and Andres Torrubia.

# IV. Discussion & Conclusions

In the end, we scored in the top 4% on the public leaderboard and the top 30% on the private leaderboard. Unfortunately, we overfit the public data quite a bit. However, we had a submission, one of our early attempts at stacking, which if we would have chosen it as one of our final submissions, would have landed us in the top 6% of the private leaderboard as well. But, we had decided to submit only one stacked result and our best single model because we were unsure about the risks of stacking once the private leaderboard was revealed.

The top scorers, including David, the competition winner, did more exploratory data analysis (EDA) than us and discovered some very interesting things about the incidence angle value. We were confused about the role that incidence angle could play throughout the duration of the competition. So, it would have been natural to have made some plots of the data versus incidence angle. But we didn’t because we rushed through the EDA phase to the modeling phase. Next time we will spend more time there.

We also worked on a week by week basis, each time deciding what we’d do the next week. But in some cases, this caused us to mistake the forest for the trees, getting lost in implementing some detail that did not have enough promise to warrant the time invested. We would have benefited from having monthly goals and deadlines to keep us from getting off track. For instance, we could have set our goal for month one to be to thoroughly do EDA and get our validation procedures in place. This would have made us comfortable doing more EDA and saved us from many headaches later.

Overall, this was a great experience. We started with the basic model of logistic regression, which involved intensive feature engineering. Then, we learned about the power of neural networks. In particular, we learned of the vast potential of convolutional neural networks in an image classification setting. Finally, we came full circle and realized that we could combine our models via ensemble methods to take the best predictions from both.

# V. Appendix

**Github: ****https://github.com/bmowry06/glassDoorAnalysis**

**Keras CNN:**

**MinMax + BestBase Stacking:**