## Artificial Neural Network (ANN) 7 - Overfitting & Regularization

Continued from Artificial Neural Network (ANN) 6 - Training via BFGS where we trained our neural network via BFGS

We saw our neural network gave a pretty good predictions of our test score based on how many hours we slept, and how many hours we studied the night before.

In this article, we want to check how well our model reflects the real world data.

We want our model to fit the signal but not the noise so that we should be able to avoid overfitting.

picture source : Python machine learning by Sebastian Raschka

First, we'll work on diagnosing overfitting, and then we'll work on fixing it.

Let's start with an input data for training our neural network:

Here is the plot for our input data, scores vs hours of sleep/study:

To train our model, we need to normalize training data:

Let's start training our network with the normalized data set:

The cost function ($J$) plot vs iterations looks like this:

Now we want to generate more data using "numpy.linspace()":

Contour for the newly generated data looks like this:

From the picture, we can see our model is overfitting, but how do we know for sure?

In general, we want to split our data into 2 portions: **training** and **testing**. We won't touch our testing data while training the model, and only use it to see how we're doing since our testing data is a simulation of the real world.

We may want to modify **Trainer** class a bit to check testing error during training:

Let's train our model with the new data:

We can plot the error on our training and testing sets as we train our model and identify the exact point at which overfitting begins.

As we can see from the picture above, our cost function ($\color{green}{J}$) with real data(test) soars around it=125 while the $\color{blue}{J}$ with training data continues to become smaller and smaller.

Now we know we have overfitting issue, but how do we fix it?

A simple rule of thumb is that we should have at least **10** times as many examples as the degrees for freedom in our model. For us, since we have 9 weights that can change, we would need 90 observations, which we certainly don't have.

One of th popular and effective ways of mitigating the overfitting issue is to use a technique called **regularization**.

One way to implement regularization is to add a term to our cost function that **penalizes** overly complex models.

A simple, but effective way to do this is to add together the square of our weights to our cost function so that models with larger magnitudes of weights, cost more.

We'll need to normalize the other part of our cost function to ensure that our ratio of the two error terms does not change with respect to the number of examples.

We're going to introduce a regularization hyper parameter, $\lambda$, that will allow us to tune the relative cost. So, higher values of lambda will impose bigger penalties for high model complexity.

We need to make changes to **costFunction** and **costFunctionPrime** as well as the **__init__()**:

#New complete class, with changes: class NeuralNetwork(object): def __init__(self, Lambda=0): #Define Hyperparameters self.inputLayerSize = 2 self.outputLayerSize = 1 self.hiddenLayerSize = 3 #Weights (parameters) self.W1 = np.random.randn(self.inputLayerSize,self.hiddenLayerSize) self.W2 = np.random.randn(self.hiddenLayerSize,self.outputLayerSize) #Regularization Parameter: self.Lambda = Lambda def forwardPropagation(self, X): #Propogate inputs though network self.z2 = np.dot(X, self.W1) self.a2 = self.sigmoid(self.z2) self.z3 = np.dot(self.a2, self.W2) yHat = self.sigmoid(self.z3) return yHat def sigmoid(self, z): #Apply sigmoid activation function to scalar, vector, or matrix return 1/(1+np.exp(-z)) def sigmoidPrime(self,z): #Gradient of sigmoid return np.exp(-z)/((1+np.exp(-z))**2) def costFunction(self, X, y): #Compute cost for given X,y, use weights already stored in class. self.yHat = self.forwardPropagation(X) J = 0.5*sum((y-self.yHat)**2)/X.shape[0] + (self.Lambda/2)*(np.sum(self.W1**2)+np.sum(self.W2**2)) return J def costFunctionPrime(self, X, y): #Compute derivative with respect to W and W2 for a given X and y: self.yHat = self.forwardPropagation(X) delta3 = np.multiply(-(y-self.yHat), self.sigmoidPrime(self.z3)) #Add gradient of regularization term: dJdW2 = np.dot(self.a2.T, delta3)/X.shape[0] + self.Lambda*self.W2 delta2 = np.dot(delta3, self.W2.T)*self.sigmoidPrime(self.z2) #Add gradient of regularization term: dJdW1 = np.dot(X.T, delta2)/X.shape[0] + self.Lambda*self.W1 return dJdW1, dJdW2 #Helper functions for interacting with other methods/classes def getParams(self): #Get W1 and W2 Rolled into vector: params = np.concatenate((self.W1.ravel(), self.W2.ravel())) return params def setParams(self, params): #Set W1 and W2 using single parameter vector: W1_start = 0 W1_end = self.hiddenLayerSize*self.inputLayerSize self.W1 = np.reshape(params[W1_start:W1_end], \ (self.inputLayerSize, self.hiddenLayerSize)) W2_end = W1_end + self.hiddenLayerSize*self.outputLayerSize self.W2 = np.reshape(params[W1_end:W2_end], \ (self.hiddenLayerSize, self.outputLayerSize)) def computeGradients(self, X, y): dJdW1, dJdW2 = self.costFunctionPrime(X, y) return np.concatenate((dJdW1.ravel(), dJdW2.ravel()))

Since we made some changes, let's make sure our gradients are correct after making those changes:

Let's train our model again.

Here is the data set we're going to use:

Now our training and testing errors are much closer, which is the indication of the success in reducing the overfit on this dataset.

Let's see our contour plot for test scores against sleep/study hours:

3-D plot:

While we see that the fit is still good, but our model is no longer that interested in the fitting accuracy to our data.

To reduce the overfitting further, we may want to increase the regularization parameter, $\lambda$.

Here is the plot of the 6 weights ($W^{(1)}$) for hidden layer and 3 weights ($W^{(2)}$) for output layer of our neural network:

Note: this picture for weights update was plotted later from a separated run. So, it does not reflect the pictures in the previous section though it shows the general trend how the weight are updated during the iterations.

To get the $W$, we need to modify the lines highlighted in our **Trainer** class as shown below:

Please visit Github Artificial-Neural-Networks-with-Jupyter .

**Next**:

# Machine Learning with scikit-learn

scikit-learn installation

scikit-learn : Features and feature extraction - iris dataset

scikit-learn : Machine Learning Quick Preview

scikit-learn : Data Preprocessing I - Missing / Categorical data

scikit-learn : Data Preprocessing II - Partitioning a dataset / Feature scaling / Feature Selection / Regularization

scikit-learn : Data Preprocessing III - Dimensionality reduction vis Sequential feature selection / Assessing feature importance via random forests

Data Compression via Dimensionality Reduction I - Principal component analysis (PCA)

scikit-learn : Data Compression via Dimensionality Reduction II - Linear Discriminant Analysis (LDA)

scikit-learn : Data Compression via Dimensionality Reduction III - Nonlinear mappings via kernel principal component (KPCA) analysis

scikit-learn : Logistic Regression, Overfitting & regularization

scikit-learn : Supervised Learning & Unsupervised Learning - e.g. Unsupervised PCA dimensionality reduction with iris dataset

scikit-learn : Unsupervised_Learning - KMeans clustering with iris dataset

scikit-learn : Linearly Separable Data - Linear Model & (Gaussian) radial basis function kernel (RBF kernel)

scikit-learn : Decision Tree Learning I - Entropy, Gini, and Information Gain

scikit-learn : Decision Tree Learning II - Constructing the Decision Tree

scikit-learn : Random Decision Forests Classification

scikit-learn : Support Vector Machines (SVM)

scikit-learn : Support Vector Machines (SVM) II

Flask with Embedded Machine Learning I : Serializing with pickle and DB setup

Flask with Embedded Machine Learning II : Basic Flask App

Flask with Embedded Machine Learning III : Embedding Classifier

Flask with Embedded Machine Learning IV : Deploy

Flask with Embedded Machine Learning V : Updating the classifier

scikit-learn : Sample of a spam comment filter using SVM - classifying a good one or a bad one

### Machine learning algorithms and concepts

Batch gradient descent algorithmSingle Layer Neural Network - Perceptron model on the Iris dataset using Heaviside step activation function

Batch gradient descent versus stochastic gradient descent

Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method

Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD)

Logistic Regression

VC (Vapnik-Chervonenkis) Dimension and Shatter

Bias-variance tradeoff

Maximum Likelihood Estimation (MLE)

Neural Networks with backpropagation for XOR using one hidden layer

minHash

tf-idf weight

Natural Language Processing (NLP): Sentiment Analysis I (IMDb & bag-of-words)

Natural Language Processing (NLP): Sentiment Analysis II (tokenization, stemming, and stop words)

Natural Language Processing (NLP): Sentiment Analysis III (training & cross validation)

Natural Language Processing (NLP): Sentiment Analysis IV (out-of-core)

Locality-Sensitive Hashing (LSH) using Cosine Distance (Cosine Similarity)

### Artificial Neural Networks (ANN)

[Note] Sources are available at Github - Jupyter notebook files1. Introduction

2. Forward Propagation

3. Gradient Descent

4. Backpropagation of Errors

5. Checking gradient

6. Training via BFGS

7. Overfitting & Regularization

8. Deep Learning I : Image Recognition (Image uploading)

9. Deep Learning II : Image Recognition (Image classification)

10 - Deep Learning III : Deep Learning III : Theano, TensorFlow, and Keras

Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization