Data Mining Algorithms In R/Packages/nnet

This chapter introduces the Feed-Forward Neural Network package for prediction and classification data. An artificial neural network (ANN), usually called "neural network" (NN), is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes information using a connectionist approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. Modern neural networks are non-linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs or to find patterns in data. In this chapter will explore the use of feed-forward neural network through the package NNET^[1]^[2] created by Ripley.

Feed-Forward Neural Network

A feedforward neural network is an artificial neural network where connections between the units do not form a directed cycle. This is different from recurrent neural networks.

The feedforward neural network was the first and arguably simplest type of artificial neural network devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.

ADALINE

ADALINE stands for Adaptive Linear Element. It was developed by Professor Bernard Widrow and his graduate student Ted Hoff at Stanford University in 1960. It is based on the McCulloch-Pitts model and consists of a weight, a bias and a summation function.

Operation: $y_{i}=wx_{i}+b$

Its adaptation is defined through a cost function (error metric) of the residual $e=d_{i}-(b+wx_{i})$ where $d_{i}$ is the desired input. With the MSE error metric $E={\frac {1}{2N}}\sum _{i}^{N}e_{i}^{2}$ the adapted weight and bias become: $b={\frac {\sum _{i}x_{i}^{2}\sum _{i}d_{i}-\sum _{i}x_{i}\sum _{i}x_{i}d_{i}}{N(\sum _{i}(x_{i}-{\bar {x}})^{2})}}$ and $w={\frac {\sum _{i}(x_{i}-{\bar {x}})(d_{i}-{\bar {d}})}{\sum _{i}(x_{i}-{\bar {x}})^{2}}}$

The Adaline has practical applications in the controls area. A single neuron with tap delayed inputs (the number of inputs is bounded by the lowest frequency present and the Nyquist rate) can be used to determine the higher order transfer function of a physical system via the bi-linear z-transform. This is done as the Adaline is, functionally, an adaptive FIR filter. Like the single-layer perceptron, ADALINE has a counterpart in statistical modelling, in this case least squares regression.

There is an extension of the Adaline, called the Multiple Adaline (MADALINE) that consists of two or more adalines serially connected.

NNET Package

The package NNET created by Ripley provides methods for using feed-forward neural networks with a single hidden layer, and for multinomial log-linear models. Specifically, this chapter of the book will be portrayed NNET method. Below is briefly described the method and parameters used.

The implementation of NNET for Feed-Forward Neural Network for R is available on CRAN and already is embbed in Environment R:

nnet

Description

Fit single-hidden-layer neural network, possibly with skip-layer connections.

//Usage
nnet(x, ...)

//S3 method for class 'formula':
nnet(formula, data, weights, ...,
subset, na.action, contrasts = NULL)

//Default S3 method:
nnet(x, y, weights, size, Wts, mask,
linout = FALSE, entropy = FALSE, softmax = FALSE,
censored = FALSE, skip = FALSE, rang = 0.7, decay = 0,
maxit = 100, Hess = FALSE, trace = TRUE, MaxNWts = 1000,
abstol = 1.0e-4, reltol = 1.0e-8, ...)

Arguments

formula A formula of the form class ~ x1 + x2 + ...

x matrix or data frame of x values for examples.

y matrix or data frame of target values for examples.

weights (case) weights for each example – if missing defaults to 1.

size number of units in the hidden layer. Can be zero if there are skip-layer units.

data Data frame from which variables specified in formula are preferentially to be taken.

subset An index vector specifying the cases to be used in the training sample. (NOTE: If given, this argument must be named.)

na.action A function to specify the action to be taken if NAs are found. The default action is for the procedure to fail. An alternative is na.omit, which leads to rejection of cases with missing values on any required variable. (NOTE: If given, this argument must be named.)

contrasts a list of contrasts to be used for some or all of the factors appearing as variables in the model formula.

Wts initial parameter vector. If missing chosen at random.

mask logical vector indicating which parameters should be optimized (default all).

linout switch for linear output units. Default logistic output units.

entropy switch for entropy (= maximum conditional likelihood) fitting. Default by leastsquares.

softmax switch for softmax (log-linear model) and maximum conditional likelihood fitting. linout, entropy, softmax and censored are mutually exclusive.

censored A variant on softmax, in which non-zero targets mean possible classes. Thus for softmax a row of (0, 1, 1) means one example each of classes 2 and 3, but for censored it means one example whose class is only known to be 2 or 3.

skip switch to add skip-layer connections from input to output.

rang Initial random weights on [-rang, rang]. Value about 0.5 unless the inputs are large, in which case it should be chosen so that rang * max(|x|) is about 1.

decay parameter for weight decay. Default 0.

maxit maximum number of iterations. Default 100.

Hess If true, the Hessian of the measure of fit at the best set of weights found is returned as component Hessian.

trace switch for tracing optimization. Default TRUE.

MaxNWts The maximum allowable number of weights. There is no intrinsic limit in the code, but increasing MaxNWts will probably allow fits that are very slow and time-consuming.

abstol Stop if the fit criterion falls below abstol, indicating an essentially perfect fit.

reltol Stop if the optimizer is unable to reduce the fit criterion by a factor of at least 1 - reltol.

... arguments passed to or from other methods.

Details

If the response in formula is a factor, an appropriate classification network is constructed; this has one output and entropy fit if the number of levels is two, and a number of outputs equal to the number of classes and a softmax output stage for more levels. If the response is not a factor, it is passed on unchanged to nnet.default.

Optimization is done via the BFGS method of optim.

Value

object of class nnet or nnet.formula. Mostly internal structure, but has components

wts the best set of weights found.
value value of fitting criterion plus weight decay term.
fitted.values the fitted values for the training data.
residuals the residuals for the training data.
convergence 1 if the maximum number of iterations was reached, otherwise 0.

'''Utilizing Example'''

//use half the iris data
library(“nnet”)

ir <- rbind(iris3[,,1],iris3[,,2],iris3[,,3])
targets <- class.ind( c(rep("s", 50), rep("c", 50), rep("v", 50)) )
samp <- c(sample(1:50,25), sample(51:100,25), sample(101:150,25))

ir1 <- nnet(ir[samp,], targets[samp,], size = 2, rang = 0.1,
decay = 5e-4, maxit = 200)

test.cl <- function(true, pred) {
   true <- max.col(true)
   cres <- max.col(pred)
   table(true, cres)
}

test.cl(targets[-samp,], predict(ir1, ir[-samp,]))

// or
library(“nnet”)

ird <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
species = factor(c(rep("s",50), rep("c", 50), rep("v", 50))))

ir.nn2 <- nnet(species ~ ., data = ird, subset = samp, size = 2, rang = 0.1,
decay = 5e-4, maxit = 200)

table(ird$species[-samp], predict(ir.nn2, ird[-samp,], type = "class"))

Study Case

The case study is designed to illustrate just one among many possible applications of the package NNET.

Scenario

Accurate diagnosis may avoid complications for patients. Desiring to establish whether a patient has breast cancer, the analysis of several factors help determine an accurate diagnosis. Thus, from the collection of data from many patients seek to infer the diagnosis of patients with satisfactory accuracy.

Data Details

The data used in this case study are from a database of the UCI^[3]^[4]^[5]^[6]. The database consists of 10 variables (9 input and 1 output) having 569 instances of records of patients diagnosed.

Input

Clump Thickness
Uniformity of Cell Size
Uniformity of Cell Shape
Marginal Adhesion
Single Epithelial Cell Size
Bare Nuclei
Bland Chromatin
Normal Nucleoli
Mitoses

Output (Class)

Diagnostic: Benign or Malignant

Execution and Results

The implementation using the package is pretty simple. Below is the part of the training and test data.

trainingInput <- read.table("trainingInput.data", sep=", ", header=TRUE)
trainingOutput <- read.table("trainingOutput.data", sep=",", header=TRUE)

library("nnet")
neuralNetworkModel <- nnet(trainingInput, trainingOutput, size = 19, rang = 0.1, decay = 5e-4, maxit = 2000)

neuralNetworkTest <- function(true, pred) {
	true <- max.col(true)
	cres <- max.col(pred)
	table(true, cres)
}

neuralNetworkTest(trainingOutput, predict(neuralNetworkModel, trainingInput))

As a result of the training function neuralNetoworkModel <- nnet (...) has the steps of iteration and approximation of operations in accordance with the parameters set for rang, decay and maxit.

  # weights:  230
  initial  value 146.391298 
  iter  10 value 14.225442
  iter  20 value 0.478782
  iter  30 value 0.149068
  iter  40 value 0.140717
  iter  50 value 0.131745
  iter  60 value 0.124368
  iter  70 value 0.116663
  …
  iter 740 value 0.086414
  iter 750 value 0.086414
  final  value 0.086414 
  converged

After training, there is confusion over the matrix (function neuralNetworkTest) the results obtained. From the matrix it is possible to apply various metrics for information as accuracy, error, among others.

       cres
   true  1    2
     1  180   0
     2   0   120

From this matrix, we find that the test was very successful, being that True Positive = (1,1), False Positive = (1,2), False Negative = (2,1) and True Negative (2,2).

Analysis

From this case study, the neural net model obtained from data sets of patients are sufficient to provide a reliable diagnosis. But for this, the patient data must respect the reality of the model represented by the network, otherwise the ANN present a misdiagnosis. Thus, it is noted that the package NNET is convenient to use, making it accessible to various distinct audiences, which may make use of it without needing a thorough knowledge on the subject.

Reference

^ B. D. Ripley: "Pattern Recognition and Neural Networks", Cambridge, 1996.
^ W. N. Venables and B. D. Ripley: "Modern Applied Statistics with S.", Fourth edition, Springer, 2002.
^ O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
^ William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196.
^ O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition via linear programming: Theory and application to medical diagnosis", in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.
^ K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).

[1]

[2]

[3]

[4]

[5]

[6]