Splitting Data into Training and Test Data Sets

When image matrices and their associated labels are stored in two separate matrices, it is often difficult to split the data into training and test sets randomly. Scikit-learn allows for splitting data into training and test sets relatively easy. For the example below, the data from the Statoil/C-CORE Iceberg Classifier Challenge Kaggle competition was used.

Initial setup:

Splitting Data into Training and Test Data Sets 1.png

Splitting data into a train and test set:

Splitting Data into Training and Test Data Sets 2.png
Splitting Data into Training and Test Data Sets 3.png

Kaggle - Digit Recognizer - CNN 99.4% Accuracy

Imaged below is the basic Convolutional Neural Network (CNN) for the Kaggle - Digit Recognizer competition.

Digit Recognizer - Convolutional Neural Network

Digit Recognizer - Convolutional Neural Network

Below provides an overview ensembling the CNN model. With executing the fit_model function 5 times, a total of a 125 epochs were executed.

Digit Recognizer - CNN-Ensemble.png

Running predictions against the test set for each model and averaging the prediction for each image.

Digit Recognizer - CNN - Predict.png

After exporting and submitting results, the basic CNN achieved 99.4% accuracy.

Digit Recognizer _ Kaggle-Submission.png

Kaggle - Invasive Species Monitoring - Ensemble with ROC of 0.95532

Ensemble (Bucket of models) multiple models to achieve ROC of 0.95532.

2017-09-25 Invasive Species Monitoring _ Kaggle.png

Below is an example of creating 4 models that were trained for 100 epochs each. The last section of code saves the weights for each model in the set. 

Invasive Species - CNN-BN-Ensemble.png

Each model was then used to predict the test image set. The mean prediction for each image was then used for the Kaggle Invasive Species Monitoring competition submission.

Kaggle - Leaf Classification: Directory Structure and Moving Files

Jupyter notebook for setting up the directory structure for Kaggle's Leaf Classification competition has been published. The notebook walks through the process for:

  • Unpacking/Unzipping the competition files
  • Creating directory structure based off the train.csv data set
  • Moving images to appropriate train, valid, and test directories.
    • The train and valid directories contain directories specific to each leaf species
Directory Structure and Moving Files _ Kaggle.png

Installing Kaggle CLI and Competition Data Download

In the computer terminal enter the following commands to install the unofficial Kaggle CLI (Command Line Interface) and download competition files:

  1. pip install kaggle-cli
  2. kg config -u [Kaggle username] -p [Kaggle password]
  3. cd Documents/nbs/
  4. kg download -c [Kaggle competition name]

The purpose of each command:

  1. Installs the Kaggle CLI
  2. Sets the username and password for the Kaggle CLI. This is why the username and password parameters do not need defined in step 4.
  3. Change computer directory to the location where the competition files will be stored
  4. Downloads the competition files. IMPORTANT! before being able to download the competition files, the competition rules will need to be accepted. The accept option is at the end of the rules section of the competition on the Kaggle website

Example of installing Kaggle CLI and downloading Dogs vs. Cats competition files:

  1. pip install kaggle-cli
  2. kg config -u KaggleUser -p P@ssw0rd123
  3. cd Documents/nbs/
  4. kg download -c 'dogs-vs-cats'