## Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with batch gradient method

In this tutorial, we'll learn another type of single-layer neural network (still this is also a **perceptron**) called **Adaline (Adaptive linear neuron)** rule (also known as the **Widrow-Hoff** rule).

The key difference between the Adaline rule (also known as the Widrow-Hoff rule) and Rosenblatt's perceptron is that the weights are updated based on a linear activation function rather than a unit step function like in the Perceptron model.

Perceptron

Adaptive linear neuron

The difference is that we're going to use the continuous valued output from the linear activation function to compute the model error and update the weights, rather than the binary class labels.

The perceptron algorithm enables the model automatically learn the optimal weight coefficients that are then multiplied with the input features in order to make the decision of whether a neuron fires or not.

In supervised learning and classification, such an algorithm could then be used to predict if a sample belonged to one class or the other.

In binary classifiers perceptron algorithm, we refer to our two classes as either 1 (positive class) or -1 (negative class).

In the context of neural networks, a **perceptron** is an **artificial neuron** using the **Heaviside step function** as the activation function.

The perceptron algorithm is also termed the **single-layer perceptron**, to distinguish it from a **multilayer perceptron**. As a linear classifier, the single-layer perceptron is the simplest **feedforward neural network**.

We can then define an **activation function** $\phi(z)$ that takes a linear combination of certain
input values $x$ and a corresponding **weight vector** $w$ , where $z$ is the so-called **net
input** ($z = w_1x_1 + ... + w_mx_m$).

For our case (binary), we can have:

$$\text if \quad \sum_{i=1}w_ix_i > \theta \quad \text then \quad \phi = 1 $$ $$\text else \quad \sum_{i=1}w_ix_i < \theta \quad \text then \quad \phi = -1 $$If we set $x_0=1$ and $w_0=-\theta$, we can have more compact form:

$$\text if \quad \sum_{i=0}w_ix_i > 0 \quad \text then \quad \phi(z) = 1 $$ $$\text else \quad \sum_{i=0}w_ix_i < 0 \quad \text then \quad \phi(z) = -1 $$ where $$z=\sum_{i=0}w_ix_i$$Or more compact form.

- Heaviside step activation function: $$\phi(z) = \begin{cases}1 & z > 0 \\ -1 & \text{otherwise} \end{cases}$$
- Linear activation function: $$\phi(z) = z$$

**treshold**or

**bias**

Update of each weight $w_j$ in the weight vector $w$ can be written as:

$$w_j : = w_j + \Delta w_j$$The value of $\Delta w_j$ , which is used to update the weight $w_j$, is calculated as the following:

- Heaviside step activation function: $$\Delta w_j=\eta(y^{(i)}-\hat y^{(i)})x_j^{(i)}$$
- Linear activation function: $$\Delta w_j=-\eta \frac{\partial J}{\partial w_j}$$ where $\eta$ is learning rate ($0.0 < \eta <1.0$), $y^{(i)}$ is the true class label of the i-th training sample, and $\hat y^{(i)}$ is the predicted class label.

One of the most critical tasks in supervised machine learning algorithms is to minimize cost function.

In the case
of Adaptive linear neuron, we can define the cost function ** J** to learn the weights as the Sum of
Squared Errors (SSE) between the calculated outcome and the true class label:

$$J(w)=\frac{1}{2}\sum_{i}(y^{(i)}-\phi(z^{(i)}))^2$$

COmpared with the **unit step** function, the advantages of this continuous **linear** activation function are:

- This cost function is
**differentiable**. So, the partial derivative of the SSE cost function with respect to the jth weight is: - Because it is convex, we can use a simple and powerful, optimization algorithm
called
**gradient descent**to find the**weights**that minimize our cost function to classify the samples in the Iris dataset.

$$\frac{\partial J}{\partial w_j}=-\sum_{i}(y^{(i)}-\phi(z^{(i)}))x_{j}^{(i)}$$

where $i$ is the sample # and $j$ is the feature # (dimension of a dataset).

Although the adaptive linear learning rule looks identical to the perceptron rule, the $\phi(z^{(i)})$ with $z^{(i)}=w^Tx^{(i)}$ is a real number and not an integer class label.

Also, the weight update is calculated based on all samples in the training set (instead of
updating the weights incrementally after each sample), which is why this approach
is also referred to as **batch** gradient descent.

We can minimize a cost function by taking a step into the opposite direction of a gradient that is calculated from the **whole training set**, and this is why this approach is also called as **batch gradient descent**.

Since the perceptron rule and **Adaptive Linear Neuron** are very similar, we can take the perceptron
implementation that we defined earlier and change the **fit** method so that the
weights are updated by **minimizing the cost function** via **gradient descent**.

Here is the source code:

import numpy as np class AdaptiveLinearNeuron(object): def __init__(self, rate = 0.01, niter = 10): self.rate = rate self.niter = niter def fit(self, X, y): """Fit training data X : Training vectors, X.shape : [#samples, #features] y : Target values, y.shape : [#samples] """ # weights self.weight = np.zeros(1 + X.shape[1]) # Number of misclassifications self.errors = [] # Cost function self.cost = [] for i in range(self.niter): output = self.net_input(X) errors = y - output self.weight[1:] += self.rate * X.T.dot(errors) self.weight[0] += self.rate * errors.sum() cost = (errors**2).sum() / 2.0 self.cost.append(cost) return self def net_input(self, X): """Calculate net input""" return np.dot(X, self.weight[1:]) + self.weight[0] def activation(self, X): """Compute linear activation""" return self.net_input(X) def predict(self, X): """Return class label after unit step""" return np.where(self.activation(X) >= 0.0, 1, -1) import matplotlib.pyplot as plt import numpy as np import pandas as pd df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None) y = df.iloc[0:100, 4].values y = np.where(y == 'Iris-setosa', -1, 1) X = df.iloc[0:100, [0, 2]].values fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4)) # learning rate = 0.01 aln1 = AdaptiveLinearNeuron(0.01, 10).fit(X,y) ax[0].plot(range(1, len(aln1.cost) + 1), np.log10(aln1.cost), marker='o') ax[0].set_xlabel('Epochs') ax[0].set_ylabel('log(Sum-squared-error)') ax[0].set_title('Adaptive Linear Neuron - Learning rate 0.01') # learning rate = 0.01 aln2 = AdaptiveLinearNeuron(0.0001, 10).fit(X,y) ax[1].plot(range(1, len(aln2.cost) + 1), aln2.cost, marker='o') ax[1].set_xlabel('Epochs') ax[1].set_ylabel('Sum-squared-error') ax[1].set_title('Adaptive Linear Neuron - Learning rate 0.0001') plt.show()

As we can see in the resulting **cost function** plots below, we have two different
types of issues.

The left one shows what could happen if we choose a learning
rate that is too large. Instead of minimizing the cost function, the error becomes
larger in every epoch because we **overshoot** the global minimum.

On the other hand, we can see that the cost decreases for the plot on the right side. That's because we chose the learning rate $\eta=0.0001$ is so small that the algorithm would require a very large number of epochs to converge.

The following figure demonstrates how we change the value of a particular weight parameter to minimize the cost function $J$ (left). The figure on the right illustrates what happens if we choose a learning rate that is too large, we overshoot the global minimum:

Picture from "Python Machine Learning by Sebastian Raschka, 2015"

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known asdata normalizationand is generally performed during the data preprocessing step.

**Gradient descent** is one of the many algorithms that benefit from feature scaling.

**standardization**, which gives our data the property of a

**standard normal distribution**.

In machine learning, we can handle various types of data, e.g. audio signals and pixel values for image data, and this data can include multiple dimensions.

Feature standardizationmakes the values of each feature in the data havezero-mean(when subtracting the mean in the enumerator) andunit-variance.

This method is widely used for normalization in many machine learning algorithms (e.g.,support vector machines,logistic regression, andneural networks).

This is typically done by calculatingstandard scores.

The general method of calculation is to determine the distribution mean and standard deviation for each feature. Next we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation.

- from Feature scaling

So, to standardize the $j$-th feature, we just need to subtract the sample mean $\mu_j$ from every training sample and divide it by its standard deviation $sigma_j$:

$$ x_j^\prime=\frac {x_j-\mu_j}{\sigma_j}$$where $x_j$ is a vector consisting of the $j$-th feature values of all training samples $n$.

We can standardize by using the NumPy methods **mean** and **std**:

X_std = np.copy(X) X_std[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std() X_std[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()

After the standardization, we will train the Linear model again using the not so small learning rate of $\eta = 0.01$:

Here is our new code for the two pictures above:

import matplotlib.pyplot as plt import numpy as np import pandas as pd from matplotlib.colors import ListedColormap class AdaptiveLinearNeuron(object): def __init__(self, rate = 0.01, niter = 10): self.rate = rate self.niter = niter def fit(self, X, y): """Fit training data X : Training vectors, X.shape : [#samples, #features] y : Target values, y.shape : [#samples] """ # weights self.weight = np.zeros(1 + X.shape[1]) # Number of misclassifications self.errors = [] # Cost function self.cost = [] for i in range(self.niter): output = self.net_input(X) errors = y - output self.weight[1:] += self.rate * X.T.dot(errors) self.weight[0] += self.rate * errors.sum() cost = (errors**2).sum() / 2.0 self.cost.append(cost) return self def net_input(self, X): """Calculate net input""" return np.dot(X, self.weight[1:]) + self.weight[0] def activation(self, X): """Compute linear activation""" return self.net_input(X) def predict(self, X): """Return class label after unit step""" return np.where(self.activation(X) >= 0.0, 1, -1) def plot_decision_regions(X, y, classifier, resolution=0.02): # setup marker generator and color map markers = ('s', 'x', 'o', '^', 'v') colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan') cmap = ListedColormap(colors[:len(np.unique(y))]) # plot the decision surface x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1 x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution)) Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T) Z = Z.reshape(xx1.shape) plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap) plt.xlim(xx1.min(), xx1.max()) plt.ylim(xx2.min(), xx2.max()) # plot class samples for idx, cl in enumerate(np.unique(y)): plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], alpha=0.8, c=cmap(idx), marker=markers[idx], label=cl) df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None) y = df.iloc[0:100, 4].values y = np.where(y == 'Iris-setosa', -1, 1) X = df.iloc[0:100, [0, 2]].values # standardize X_std = np.copy(X) X_std[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std() X_std[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std() # learning rate = 0.01 aln = AdaptiveLinearNeuron(0.01, 10) aln.fit(X_std,y) # decision region plot plot_decision_regions(X_std, y, classifier=aln) plt.title('Adaptive Linear Neuron - Gradient Descent') plt.xlabel('sepal length [standardized]') plt.ylabel('petal length [standardized]') plt.legend(loc='upper left') plt.show() plt.plot(range(1, len(aln.cost) + 1), aln.cost, marker='o') plt.xlabel('Epochs') plt.ylabel('Sum-squared-error') plt.show()

# Machine Learning with scikit-learn

scikit-learn installation

scikit-learn : Features and feature extraction - iris dataset

scikit-learn : Machine Learning Quick Preview

scikit-learn : Data Preprocessing I - Missing / Categorical data

scikit-learn : Data Preprocessing II - Partitioning a dataset / Feature scaling / Feature Selection / Regularization

scikit-learn : Data Preprocessing III - Dimensionality reduction vis Sequential feature selection / Assessing feature importance via random forests

Data Compression via Dimensionality Reduction I - Principal component analysis (PCA)

scikit-learn : Data Compression via Dimensionality Reduction II - Linear Discriminant Analysis (LDA)

scikit-learn : Data Compression via Dimensionality Reduction III - Nonlinear mappings via kernel principal component (KPCA) analysis

scikit-learn : Logistic Regression, Overfitting & regularization

scikit-learn : Supervised Learning & Unsupervised Learning - e.g. Unsupervised PCA dimensionality reduction with iris dataset

scikit-learn : Unsupervised_Learning - KMeans clustering with iris dataset

scikit-learn : Linearly Separable Data - Linear Model & (Gaussian) radial basis function kernel (RBF kernel)

scikit-learn : Decision Tree Learning I - Entropy, Gini, and Information Gain

scikit-learn : Decision Tree Learning II - Constructing the Decision Tree

scikit-learn : Random Decision Forests Classification

scikit-learn : Support Vector Machines (SVM)

scikit-learn : Support Vector Machines (SVM) II

Flask with Embedded Machine Learning I : Serializing with pickle and DB setup

Flask with Embedded Machine Learning II : Basic Flask App

Flask with Embedded Machine Learning III : Embedding Classifier

Flask with Embedded Machine Learning IV : Deploy

Flask with Embedded Machine Learning V : Updating the classifier

scikit-learn : Sample of a spam comment filter using SVM - classifying a good one or a bad one

### Machine learning algorithms and concepts

Batch gradient descent algorithmSingle Layer Neural Network - Perceptron model on the Iris dataset using Heaviside step activation function

Batch gradient descent versus stochastic gradient descent

Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method

Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD)

Logistic Regression

VC (Vapnik-Chervonenkis) Dimension and Shatter

Bias-variance tradeoff

Maximum Likelihood Estimation (MLE)

Neural Networks with backpropagation for XOR using one hidden layer

minHash

tf-idf weight

Natural Language Processing (NLP): Sentiment Analysis I (IMDb & bag-of-words)

Natural Language Processing (NLP): Sentiment Analysis II (tokenization, stemming, and stop words)

Natural Language Processing (NLP): Sentiment Analysis III (training & cross validation)

Natural Language Processing (NLP): Sentiment Analysis IV (out-of-core)

Locality-Sensitive Hashing (LSH) using Cosine Distance (Cosine Similarity)

### Artificial Neural Networks (ANN)

[Note] Sources are available at Github - Jupyter notebook files1. Introduction

2. Forward Propagation

3. Gradient Descent

4. Backpropagation of Errors

5. Checking gradient

6. Training via BFGS

7. Overfitting & Regularization

8. Deep Learning I : Image Recognition (Image uploading)

9. Deep Learning II : Image Recognition (Image classification)

10 - Deep Learning III : Deep Learning III : Theano, TensorFlow, and Keras

Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization