Academia.eduAcademia.edu
Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading  CHAPTER 1  Introduction 1. Computational Ecology Ecology is the scientific study of the relationship between organisms and their environments. This concept was put forward by Haeckel as early as 1866. Through more than one hundred years’ development, ecology has become a major branch of knowledge. This is especially so since the early 1990: ecology has evolved to be one of the centers of modern science. There are many sub-disciplines of ecology. Depending on the organizational levels of organisms, ecology is divided into molecular ecology, physiological ecology, population ecology, community ecology, ecosystems ecology, landscape ecology, etc.; according to the differences in taxa categories of organisms, there are plant ecology, animal ecology, microbial ecology, insect ecology, etc.; based on differences in landscape and habitat categories, there are terrestrial ecology, marine ecology, wetland ecology, or forest ecology, grassland ecology, etc.; if we focus on application categories, they are agro-ecology, urban ecology, pollution ecology, etc., and if we categorize in terms of scientific disciplines, there are mathematical ecology, environmental ecology, chemical ecology, physiological ecology, economic ecology, behavioral ecology, etc. Among the known ecological disciplines, only mathematical ecology is a pure quantitative science. Mathematical ecology stresses the mathematical analysis of ecological issues, mostly by developing analytical models and equations. Due to the complexity, nonlinearity and uncertainty of ecological problems, simple mathematical models or equations are far from enough to address them. As the knowledge of ecology and computational science 1 Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading 2 Computational Ecology advances, intensive computation is playing an increasingly important role in ecological studies. Various theories and methods based on intensive computation, like artificial neural networks, agent-based modeling, systems simulation, numerical approximation, etc., are increasingly used in ecology. As a result, an ecological discipline, computational ecology, is formally proposed here to integrate and synthesize computation-intensive areas in ecology. Research tasks in the discipline of computational ecology are described below: (1) Computational ecology is a science focusing mainly on ecological researches, constructions and applications of theories and methods of computational science including computational mathematics. Intensive computation is one of the major features of computational ecology. Most of the issues in computational ecology start from modeling, followed by intensive computation based on the model (iteration, training, etc.). It aims at the simulation, approximation, prediction, recognition, and classification of ecological issues. With computational ecology as a unified platform, we may not only apply theories and methods of computational science to ecology, but also construct new theories and methods for computational science. It is an interface, membrane, or gate between ecology and computational science. (2) Ecology is the main body of computational ecology. Various sciences are involved in computational ecology, including computational mathematics (such as numerical methods), artificial intelligence (artificial neural networks, machine learning, etc.), computer science (algorithm design, software development, etc.), probability theory, statistics, optimization theory, combinatorics, differential equations, functional analysis, algebraic topology, differential geometry, and others. (3) The research areas of computational ecology involve (but not limited to) the following aspects: (a) Artificial neural networks, knowledge-based systems, machine learning, data exploration, statistical computation (Bayesian computing, randomization, bootstrapping, Monte Carlo techniques, stochastic process, etc.), computation-intensive inferential methods, heuristics, numerical and optimization methods, individualbased modeling and simulation (differential and difference equation modeling and simulation, etc.), prediction, recognition, Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading Introduction 3 classification, agent-based modeling and simulation, network analysis and computation, databases, and other computation-intensive theories and methods. (b) The development, evaluation and validation of software and algorithms for computational ecology. The development and evaluation of apparatus, instruments and machines for ecological and environmental analysis, investigation and monitoring based on the software of computational ecology. 2. Artificial Neural Networks and Ecological Applications 2.1. A brief history of artificial neural network development An artificial neural network is a simulation system of human brain. It can be implemented by both electric elements and computer software. It is a parallel distributed processor with large numbers of connections. Artificial neural networks can achieve knowledge by learning and possess the ability of problem-solving, and the knowledge achieved is stored in connection weights. Researches on modern artificial neural networks began approximately 60 years ago. Development of artificial neural networks has undergone four phases (Fecit, 2003). (1) Birth phase. As early as 1943, McCulloch and Pitts described the neural network with mathematical tools and presented the mathematical model of neurons, i.e., MP model. MP model was finally developed to the theory of limited automata. Their works demonstrated that artificial neural networks can be used to compute any arithmetic and logic functions. Their works were recognized as the origination of artificial neural network researches. In 1949 Hebb speculated that the conditioned response resulted from the characteristics of single neuron. He thus presented a hypothesis on the learning law of neurons. The hypothesis was proved in the following 30 years and widely known as the Hebb learning law. The perceptron network developed by Rosenblatt (1958) was a landmark event which initiated the engineering application of artificial neural networks. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading 4 Computational Ecology Adaptive linear element (Adaline), a variant of perceptron, was subsequently proposed by Widrow and Hoff in 1962 and used in signaling analysis and radar antenna control. Widrow–Hoff learning law is being used in various neural network models. At the late period of this phase, the researches on neural networks entered a recession period because of the limitation of computing ability. (2) Transition phase. The key to promote the development of artificial neural networks is to propose new models and new learning algorithms, while mathematical principles of artificial neural networks are also indispensable. During the 1970s various network models, theories and learning algorithms were further proposed. In this phase Grossberg combined psychology and brain science to form a unified artificial neural theory. Since 1971 the Japanese scientist Amari developed some theories on dynamics and stability of artificial neural networks, in particular the theories based on manifold and probability theory. In 1970 and 1973 Fukushima proposed the theory of neural cognitive network based on his previous researches on artificial system model of human brain. Fukushima’s models include artificial neural cognition and the cognition with optional attention based on neural cognizer. Researches on association memory made a great achievement during this period. Various association memory models were developed by Kohonen (1972), Anderson (1968, 1973, 1977), and other researchers. (3) Peak phase. Since 1980, Feldman and Ballard began their neural network researches and developed various neural network systems and theories covering natural language, logistic reasoning, concept representation, parallel distributed processing, etc. In 1982 and 1984 Hopfield published two papers on a new model and led the neural network researches to climax. Hopfield network is an interconnected and nonlinear dynamic network. Sejnowski started his neural network researches since 1976, and proposed Boltzmann machine according to the methods and concepts of statistical physics, together with Hinton and Ackley in 1984 and 1985. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading Introduction 5 During this period a milestone algorithm for multilayer neural networks, backpropagation (BP) algorithm, was proposed by McClelland and Rumelhart in 1986. (4) Phase of rapid development. The establishment of International Neural Network Society in 1987 marked the beginning of a new era of neural network researches and applications. Since then annual meetings or symposiums on neural networks have been convened around the world. Neural networks have been used in various areas of our society. 2.2. Fundamentals of artificial neural networks 2.2.1. Biological neurons and mechanisms A typical biological neuron is composed of four parts (Bian and Zhang, 2000; Fig. 1) (1) Soma. It is the body of a neuron cell. There are nucleus and cytoplast in the soma. (2) Dendrite.A dendrite is typically less than 1 mm long. It receives signals from other neurons. There are thousands of branched dendrites on the soma. (3) Neurite. It outputs signals to other neurons. Signals are transmitted in the neurite with the rate of dozens of meters per second. A neurite may have several branches connected to different neurons. (4) Synapse. Synapse is a connection between two neurons. A synapse to dendrite is always stimulant which stimulates the next neuron and the synapse to soma is always inhibitive which inhibits the next neuron. A neuron has two different states, i.e., stimulation and inhibition (Bian and Zhang, 2000). A neuron in inhibitive state receives the stimulant signal Figure 1. A biological neuron. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading 6 Computational Ecology from other neurons. Several inputs are algebraically summed. If the sum exceeds a threshold the neuron will be inspired. It will be in a stimulant state, and delivers an output pulse to other neurons. There is a refractory period for a neuron to be inspired. This neuron will not respond to any stimulation from other neurons and the threshold will drop down gradually. Theoretically the biological neuron can only transmit Boolean signals. However, a series of pulses from a neuron when it is inspired may be treated as a frequency-modulated signal and the density of this signal may represent some continuous signal. 2.2.2. Types and mechanisms of artificial neural networks A neural network can be regarded as a digraph with nodes (input nodes and neurons, or input nodes and computation nodes), synaptic connections, and functional connections. As far as connection types are concerned, there are two types of neural networks, i.e., feedforward network and feedback network. Feedforward networks are functional mapping networks and usually used for pattern recognition, function approximation and prediction (Haykin, 1994; Yan and Zhang, 2000; Fecit, 2003). Feedback neural networks are used as association memorizers and optimization tools. In a feedforward network, every neuron receives the inputs from the last layer and yields outputs for the next layer and there is not any feedback. A feedback network can be redrawn as an undigraph in which each connection is bidirectional. In a feedback neural network all nodes are computation nodes, and each node has (n−1) inputs and one output if the total number of nodes is n. There are two phases in the workflow of a neural network: (1) Learning phase. The states of all computation nodes are constant and the connection weights can be adjusted through learning process. (2) Working phase. Connection weights are constant during this phase and the states of computation nodes change to achieve stable states. 2.2.3. Basic architecture of artificial neural networks (1) One-input neuron The architecture of a one-input neuron is indicated in Fig. 2. The mathematical expression of the one-input neuron is y = f(wx + b) Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading Introduction 7 Figure 2. One-input neuron. Figure 3. Multiple-input neuron. There are n inputs for the neuron. where w = the weight of input x; b = bias; y = output; f = transfer function. In this expression, the output of accumulator, p = wx + b, is also called net input of transfer function f . Addition of a bias, b, can increase the adaptability of neurons and neural networks. (2) Multiple-input neuron The architecture of a multiple-input neuron is indicated in Fig. 3. The mathematical expression of the multiple-input neuron is   y=f w1i xi + b , where w1i = the connection weight of source neuron i to target neuron 1, i = 1, 2, . . . , n; b = bias; y = output; f = transfer function. The architecture of the multiple-input neuron (n inputs) can be briefly represented by a simpler illustration, as indicated in Fig. 4. (3) One-layer feedforward neural network A neuron with multiple inputs is not enough to generate a neural network. In a neural network there are generally several neurons operated in parallel (Hagan et al., 1996). A set of neurons operated in parallel form a layer (Fig. 5). The mathematical expression of one-layer feedforward neural networks with s neurons is y = f(wx + b), where x ∈ Rn , y ∈ Rs , b ∈ Rs , and w = (wij )s×n . Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading 8 Computational Ecology Figure 4. The simpler representation of a multiple-input neuron. In this representation x is a n × 1 input vector; w is the 1 × n weight vector; b, p and y are scalar constant and variables. Figure 5. A one-layer feedforward network with s neurons. Each neuron has n inputs. The architecture of one-layer feedforward networks with s neurons is briefly represented by a simpler illustration, as indicated in Fig. 6. The number of neurons in a one-layer feedforward neural network is completely dependent on the number of network outputs. (4) Multilayer feedforward neural network In a multilayer feedforward neural network, each layer has its bias vector, net input vector, weight vector, and output vector. The layer with its output as the network output is output layer and the remaining layers are hidden Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading Introduction 9 Figure 6. The simpler representation of a one-layer feedforward network with s neurons. In this representation x is n × 1 input vector; w is the s × n weight matrix; b, p and y are s × 1 vectors. Figure 7. A two-layer feedforward neural network with s neurons in each layer. Each neuron of the first layer receives n inputs. layers. A multilayer feedforward neural network, such as the network using sigmoid transfer function in the first layer and linear transfer function in the second layer, may arbitrarily approximate most functions (Hagan et al., 1996). A two-layer feedforward neural network, as indicated in Fig. 7, is represented by the following equation: y = g(w2 f(w1 x + b1 ) + b2 ), where x ∈ Rn , y ∈ Rs , b1 ∈ Rs , b2 ∈ Rs , w1 = (wij1 )s×n , and w2 = (wij2 )s×s . Dec. 17, 2009 10 16:31 9in x 6in B-922 b922-ch01 1st Reading Computational Ecology As for the number of layers, two or three layers are enough in most cases. There is not any reasonable algorithm or rule to determine the number of hidden neurons in a multilayer feedforward neural network. (5) Recursive neural network A recursive network is a feedback neural network in which some of the net outputs are redirected to inputs. Recursive networks are more powerful than feedforward networks. A recursive network contains one or more time-delay modules that form the network feedback. The mathematical representation of a time delay element (Fig. 8) is y(t) = x(t − 1), where y(0) is the initial condition. Figure 9 illustrates the architecture of a recursive neural network. Figure 8. Time delay element. Figure 9. A recursive neural network. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading Introduction 11 2.2.4. Learning methods of artificial neural networks There are three ways of network learning: (1) Supervised learning. There is a set of training samples and neural network adjusts its connection weights according to the difference between given outputs and practical outputs; (2) Unsupervised learning. Neural network adjusts connection weights according to statistical information carried by environmental data (Yan and Zhang, 2000), which is a self-organizing process; (3) Reinforcement learning. In this learning system the external environment yields evaluation to network output and neural network adjusts its connection weights through reinforcing those actions encouraged by external environment. It is a learning method between supervised and unsupervised learning. Three kinds of learning algorithms are used in neural networks (Yang and Zhang, 2000) (1) Hebb learning law. The strength of connection between two neurons is expected to increase if the activation of the two neurons is synchronous and decrease if the activation is asynchronous, e.g., to minimize the following weight change wij (t) = ηxi (t)yj (t), where xi (t) and yj (t) are states of two connected neurons at time t. (2) Error correction learning law. The error is ei (t) = yi (t) − zi (t), where yi (t) = desired output of ith neuron at time t; zi (t) = practical output of ith neuron at time t; ei (t) = output error of ith neuron at time t. The objective is to minimize some function of ei (t). An example is the delta rule: wij (t) = ηei (t)xj (t), where xi (t) is the ith input at time t, and wij (t) is the weight change. Dec. 17, 2009 16:31 12 9in x 6in B-922 b922-ch01 1st Reading Computational Ecology (3) Competitive learning law. All output nodes compete with each other and finally only the strongest node is activated. Generally there are prohibitive connections among output nodes. The learning law can be represented by the following: wij (t) = η(xj (t) − wji (t)), if node j wins the competition; wij (t) = 0, if node j fails in the competition. 2.2.5. Adaptation of artificial neural networks A neural network may ideally acquire knowledge by learning from the steady environment. However, if the environment is unsteady (timechanging), the neural network must be provided with the adaptive ability in order to follow the changing environment (Yan and Zhang, 2000). In this case every different input will be treated as a new data set and the neural network is thus viewed as a predictor: x(t) = f(x(t − 1), w(t − 1)), e(t) = z(t) − x(t), where z(t) = observed output at time t; x(t) = predicted output at time t; e(t) = output error at time t. The objective is to let e(t) = 0. 2.3. Applications of artificial neural networks Neural networks are currently applied in many areas (Haykin, 1994; Widrow, 1994; Hoffmann, 1998; Zhang, 2007). Some are listed bellow: (1) Numerical computation. Function approximation, interpolation, optimization, etc. (2) Modeling. Chemical modeling, ecological modeling, dynamic modeling of industrial processes, etc. (3) Data mining and knowledge discovery. Between-variable relationship discovery, classification, etc. (4) Biological and medical applications. Gene discovery, protein prediction, biodiversity analysis, growth simulation, survival analysis, community prediction, etc. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading Introduction 13 (5) Environmental applications. Pollutant prediction, environmental monitoring, habitat discrmination, etc. (6) Visual and audio recognition and processing. Face and signature recognition, radar and sonic image processing (image compression, feature extraction, noise removal, etc.), robot visualization, target identification, audio compression and recognition, recognition of human tissues and cells, etc. (7) Control systems. Robot control, orbit control, traffic dispatch and control, production flow control, weapon manipulation, target tracking, etc. (8) Diagnostic systems. Disease diagnosis, vehicle diagnosis, machine and flow diagnosis, cardiograph classification, etc. (9) Industry and manufacturing. Quality monitoring and analysis, performance analysis, project bidding, product design and analysis, oil exploration, etc. (10) Communication systems. Route selection, aero-navigation, echo canceling, etc. (11) Economic and financial applications. Market analysis, advisory system of stock exchange, real estate assessment, loan consultation, financial analysis, price prediction, cheque recognition and cash detection, etc. 2.4. Ecological applications of artificial neural networks Since the 1970s ecologists have been used to understand the ecosystems by constructing mechanistic models. However, at the increase of complexity of ecosystems studied, more and more black boxes emerged in the model and model complexity increased rapidly. The effectiveness and validity of mechanistic models declined at the increase of model complexity. Those models finally became unsolvable, unstable, and unreliable. Due to the complexity of ecosystems, empirical models regained popularity in recent years (Tan et al., 2006). On the other hand, ecological relationships are highly nonlinear and thus could not be reasonably described by classical models (Schultz and Wieland, 1997; Pastor-Barcenas et al., 2005), including both mechanistic and empirical models. Artificial neural networks have been recognized as the universal function approximators for complex and nonlinear ecological relationships Dec. 17, 2009 14 16:31 9in x 6in B-922 b922-ch01 1st Reading Computational Ecology (Acharya et al., 2006; Nour et al., 2006; Zhang and Barrion, 2006; Zhang, 2007; Zhang et al., 2008). They have the advantages of more automated model synthesis and analytical input–output models (Tan et al., 2006). A large number of studies on ecological applications of artificial neural networks were conducted in the last ten years. Concerning the dynamic modeling of ecological or environmental processes, artificial neural networks were used for modeling short and middle long-term concentration levels (Viotti et al., 2002), subsurface process (Almasri and Kaluarachchi, 2005), sediment transfer (Abrahart and White, 2001), subsurface drain outflow and nitrate–nitrogen concentration in tile effluent and surface ozone (Sharma et al., 2003; Pastor-Barcenas et al., 2005), flow and phosphorus concentration (Nour et al., 2006), dioxide dispersion (Nagendra and Khare, 2006), the growth of Chinese cabbage (Zhang et al., 2007), and food intake dynamics of a holometabolous insect (Zhang et al., 2008). Artificial neural networks are always used to make classification, recognition, and prediction of ecological issues. They were used to explain the observed structure of functional feeding groups of aquatic macroinvertebrates (Jorgensen et al., 2002). Backpropagation (BP) and radial basis function (RBF) neural networks were used to simulate and predict species richness of rice arthropods (Zhang and Barrion, 2006). They were used in the classification and discrimination of vegetation (Marchant and Onyango, 2003; Filippi and Jensen, 2006), habitat zones and functional groups of invertebrates (Zhang, 2007). In addition, artificial neural networks have been used to explain observed changes in species composition and abundance (Jaarsma et al., 2007), to construct transfer functions that implement organism–environment relationships for paleoecological uses (Racca et al., 2007), to classify community assemblages (Zhang, 2007; Tison et al., 2007), and to determine the risk of insect pest invasion (Watts and Worner, 2009). Spatial distribution patterns of invertebrates can be effectively described by artificial neural networks (Cereghino et al., 2001; Zhang et al., 2008). They performed better than partial differential equation and spline function. Artificial neural networks have been compared to various conventional models in terms of modeling performance. They were proved to be more Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch01 1st Reading Introduction 15 effective than differential equations (Zhang et al., 2007; Zhang and Wei, 2008). They were superior to linear models, generalized additive models, and regression trees (Moisen and Frescino, 2002). They outperformed logistic regression, multiple discriminant model and multiple regression in predicting community composition (Olden et al., 2006) and the number of salmonids (McKenna, 2005). They can also provide a feasible alternative to more classical spatial statistical techniques (Pearson et al., 2002). The need for better techniques, tools and practices to analyze ecological systems within an integrated framework has never been so great (Shanmuganathan et al., 2006). Approaches conditioned on data should thus be preferred (Lukacs et al., 2007). Artificial neural networks are universal and adaptive data-driven models. Wider ecological application of them is expected in the future. 2.5. Important books and journals There are a large number of books and journals on artificial neural networks. Some books on theories and applications of artificial neural networks are as follows: (1) Anderson JA. An Introduction to Neural Networks. MIT press, Cambridge, USA, 1995 (2) Smith SM. Neural Networks for Statistical Modeling. van Nostrand Reinhold, New York, USA, 1993 (3) Chester M. Neural Networks: A Tutorial. Prentice-Hall, New Jersy, USA, 1993 (4) Hassoun MH. Fundamentals of Artificial Neural Networks. MIT Press, Cambridge, USA, 1995 (5) Kohonen T. Self-Organizing Maps. Springer-Verlag, Germany, 1995 (6) Haykin S. Neural Networks: A Comprehensive Foundation. Macmillan, New York, USA, 1994 (7) Nigrin A. Neural Networks for Pattern Recognition. MIT Press, Cambridge, USA, 1993 (8) Fecit. Analysis and Design of Neural Networks in MATLAB 6.5. Electronics Industry Press, Beijing, China, 2003 (9) Hagan MT, Demuth HB, Beale MH. Neural Network Design. PWS Publishing Company, Boston, USA, 1996 Dec. 17, 2009 16 16:31 9in x 6in B-922 b922-ch01 1st Reading Computational Ecology (10) Yan PF, Zhang CS. Artificial neural networks and simulated evolution. Tsinghua University Press, Beijing, China, 2000 (11) Zhang WJ. Methodology on Ecology Research. Sun Yat-Sen University Press, Guangzhou, China, 2007 Some journals on theories and applications of artificial neural networks are listed below: (1) Neural Networks http://www.elsevier.com/locate/neunet (2) Neural Computation http://www.mitpressjournals.org/loi/neco (3) Artificial Intelligence http://www.elsevier.com/locate/artint (4) IEEE Transactions on Neural Networks (5) Journal of Artificial Neural Networks (6) Machine Learning http://www.springerlink.com/content/100309/ (7) Network: Computation in Neural Systems http://www.informaworld.com/smpp/title∼db=all∼content= t713663148 (8) International Journal of Neural Systems http://www.worldscinet.com/ijns/ijns.shtml (9) IEEE Transactions on Circuits and Systems (10) Ecological Modeling http://www.elsevier.com/locate/ecolmodel (11) Ecological Complexity http://www.elsevier.com/locate/ecocom Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch02 1st Reading PART I Artificial Neural Networks: Principles, Theories and Algorithms 1 Dec. 17, 2009 16:31 9in x 6in B-922 2 b922-ch02 1st Reading Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch02 1st Reading  CHAPTER 2  Feedforward Neural Networks Feedforward neural networks are one of the two most important types of neural networks. In a feedforward neural network, each neuron gets inputs from the last layer and produces outputs for the next layer (Anderson, 1972; Anderson and Rosenfeld, 1989; Haykin, 1994; Hagan et al., 1996; Fecit, 2003; Marchant and Onyango, 2003). There is not any feedback in a feedforward network and the latter may be illustrated as an unlooped graph. In a sense, a feedforward neural network is a compound function that is repeatedly compounded from nonlinear functions. The perceptron networks, linear networks, BP networks, RBF networks, etc., are feedforward neural networks (Zhang, 2007). The history of feedforward neural networks may be traced back to perceptron (Rosenblatt, 1958; Minsky and Papert, 1969). To learn perceptron is instructive for understaning more complex feedforward neural networks. Perceptron is the first layered neural network with learning feature. An original perceptron contains an input layer (sensory layer, S), an intermediate layer (association layer, A) and an output layer (response layer, R) (Fig. 1). However, the connection weights between the input layer and the intermediate layer are fixed and thus the intermediate layer cannot be regarded as a hidden layer. The connection weights between the intermediate layer and the output layer are adjustable by a learning procedure. 3 Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch02 1st Reading 4 Computational Ecology Figure 1. A three-layer perceptron. 1. Linear Separability and Perceptron 1.1. Linear threshold unit and Boolean function A linear threshold unit is composed of a neuron and a set of adjustable weights, in which the activation function is the threshold function. Defining the threshold θ = −b, where b is the bias, the mathematical expression of the linear threshold unit is y = sgn(wT x − θ), where x = (x1 , x2 , . . . , xn )T , is a n-dimensional input, w = (w1 , w2 , . . . , wn )T , is the weight vector. sgn(z) = 1, if z ≥ 0; sgn(z) = −1, if z < 0. The linear threshold unit can realize such logical functions as AND, OR, NOT, in particular NAND. As a consequence we have the following theorem: Theorem 1. Any Boolean function may be realized by the feedforward network that is composed of linear threshold units, and it can be realized by a three-layer (including input layer) feedforward network (Yan and Zhang, 2000; Fecit, 2003). 1.2. Linear separable function For a given function f , if there exist w and θ, such that f(x) = sgn(wT x − θ), Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch02 1st Reading Feedforward Neural Networks 5 the function f is called linear separable. It is obvious that a two-layer network can only realize the linear separable function. AND, OR, NAND are all separable. As revealed above, a one-layer network, including the one-layer perceptron, could only act as a linear classifier. With the thereshold function as activation function, a multi-layer network may realize Boolean functions. If the activation functions of neurons are continuous (e.g., sigmoid function, sine function, etc.), the neural network will be able to approximate any continuous function. It has been proved that a feedforward neural network with n inputs and m outputs can be treated as a nonlinear mapping from n-dimensional Euclidean space to m-dimensional Euclidean space, and this mapping is able to approximate any continuous function (with or without countable discrete points) (Hornik et al., 1989). This conclusion can be described by the following theorem: Theorem 2 (Kolmogorov Theorem). Suppose that φ(·) is a bounded, nonconstant, and monotonically increasing continuous function, In is the ndimensional unit super cube [0, 1]n , and C(In ) is the set of continuous functions defined on In , then given any function, f ∈ C(In ), and ε > 0, there is an integer m and a group of real constants αi , θi and ωij , where i = 1, 2, . . . , m, and j = 1, 2, . . . , n, such that the network output F(x1 , x2 , . . . , xn ) = m  i=1   n  ωij xj − θi  αi ϕ  j=1 will arbitrarily approximate f(·), i.e., |F(x1 , x2 , . . . , xn ) − f(x1 , x2 , . . . , xn )| < ε, (x1 , x2 , . . . , xn ) ∈ In . The above theorem indicates that a feedforward network with even only one hidden layer may be used as a general approximator. The theorem may be also expanded to pattern classification mapping f :An → {1, 2, . . . , m}, j = 1, 2, . . . , m, where f(x) = j if and only if x ∈ Cj , and An is the compact set in Rn , An = ∪Cj , Ci ∩ Cj is a null set if i = j. Generally the more inflection points there are, the more hidden neurons are needed. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch02 1st Reading 6 Computational Ecology 1.3. Learning law of perceptron The learning law of perceptron is a supervised learning law. Suppose x(k), y(k), ŷ(k), and b(k) are network input, practical output, desired output, and bias at k, respectively, x(k) = (x1 (k), x2 (k), . . . , xn (k))T , y(k) = (y1 (k), y2 (k), . . . , ym (k))T , ŷ(k) = (ŷ1 (k), ŷ2 (k), . . . , ŷm (k))T , b(k) = (b1 (k), b2 (k), . . . , bm (k))T , w(k) = (wij (k)). The learning law of a perceptron is a gradient descent method, expressed as the following wij (k + 1) = wij (k) + (ŷi (k) − yi (k))xj (k), bi (k + 1) = bi (k) + ŷi (k) − yi (k). Through the adjustment of connection weights and bias above, the network output will approximate the desired output. The Matlab functions for the learning law of perceptron are learnp and learnpn (Mathworks, 2002). 1.4. Limitations of perceptron Perceptron neural networks have some limitations: (1) Perceptron can only be used for simple classification issues because its activation function is the threshold function; it can only make classifications on linear separable issues. (2) Perceptron is not able to solve XOR (exclusive OR) issue. (3) Singular samples of inputs will lengthen the learning time. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch02 1st Reading Feedforward Neural Networks 2. 7 Some Analogies of Multilayer Feedforward Networks Multilayer linear neural networks can be related to the following areas (Zhang and Fang, 1982; Chen et al., 1987; Yan and Zhang, 2000) (1) Regression analysis. Regression analysis aims to determine x − y relationship according to observed samples (x, y). A multilayer feedforward network is designed to find a mapping: f :Rn → Rm , such that  min y − f(x)dxdy. The solution of the minimization issue is  f(x) = ypy|x (x, y)dy, where py|x (x, y) is the conditional probability of y if x is given. f(x) is just the regression of y against x. If f is a linear function (linear neural network), we have: ŷ = Ax. The solution A = Ryx R−1 xx will minimize ŷ − Ax2 . (2) Discriminant analysis. It is a supervised pattern classification based on linear transformation. Different categories of data are separated as distant as possible in the new coordinate system. (3) Principal component analysis (PCA). PCA can be used in data reduction in order to identify a small number of factors that explain most of the variance that is observed in a larger number of manifest variables (SPSS, 2006). 3. Functionability of Multilayer Feedforward Networks In the view of probability theory, if the true input–output relationship is p(x, y), the probability density of output will be p(y/x) = p(x, y)/p(x), Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch02 1st Reading 8 Computational Ecology where the probability density of input is  p(x) = p(x, y)dy. A neural network tries to obtain p(x, y) by learning from samples and thus to approximate p(x, y) with ṗ(x, y, w). Functionality of feedforward neural networks can be summarized as the following (Yan and Zhang, 2000): (1) In function approximation, the network output is the regressional function between y and x. (2) In pattern classification, the network output is the posterior probability of corresponding category. (3) The correctness of network output is higer for the unknown samples with larger occurrence probability p(x), and is lower for the unknown samples with smaller occurrence probability. (4) The lower the variance is, the higher the confidence level of network output will be. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch03 1st Reading  CHAPTER 3  Linear Neural Networks Linear neural networks are the simplest networks which are composed of several linear neurons. The prototype of linear neural networks was primarily developed by Widrow and Hoff (1960), and was called Adaptive Linear Element (ADALINE). Different from perceptron, the transfer function of linear neural networks is a linear function. Similar to feedforward neural networks, linear neural networks are used for function approximation, system modeling, adaptive filtering, prediction, control, and pattern classification (Widrow and Hoff, 1960; Widrow and Stearns, 1985; Widrow and Winter, 1988; Haykin, 1994; Anderson and Rosenfeld, 1989; Zhang, 2007). However, they can only be used to make classification on linear separable issues. 1. Linear Neural Networks Before introducing linear neural networks, a similar model, i.e., Generalized Linear Model (GLM), is introduced here. Suppose xi = (x1i , x2i , . . . , xni )T , yi is a scalar variable, i = 1, 2, . . ., p, p is the number of samples, and β = (β1 , β2 , . . . , βn )T is the parameter vector of linear model. A GLM should meet the following rules: (1) There is a strictly increasing and differential function g, such that ηi = g(µi ) = (xi )T β, where µi = E(yi ), g(·) is the link function. 1 Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch03 1st Reading 2 Computational Ecology (2) The probability distribution of yi is the exponential type (i.e., normal distribution, Poisson distribution, binomial distribution, etc.), as described in the following p(yi , η, ) = exp[(ηy − b(η))/ + c(y, )], where η is the natural parameter,  is the parameter for deviation, and the parameters are relevant to the variance of yi . 1.1. Adaline The mathematical expression of ADALINE (Fig. 1) is: y = wx + b, where x = (x1 , x2 , . . . , xn )T , y = (y1 , y2 , . . . , ys )T , b = (b1 , b2 , . . . , bs )T , and w = (wij )s×n . ADALINE may learn from its environment by adjusting connection weights and thresholds according to some learning law like Widrow–Hoff learning law, i.e., LMS (Least Mean Square) rule (Fecit, 2003). 1.2. Multilayer linear neural network Figure 2 shows a multilayer linear neural network. In this network, ẁn×m and ẃs×m are between-layer weight matrices. The rank of overall weight matrix, w = ẃẁ, is less or equal to m if all neurons are linear. Figure 1. Adaline. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch03 1st Reading Linear Neural Networks 3 Figure 2. A multilayer linear neural network. There are m hidden neurons in the network and m ≤ min(s, n). 2. LMS Rule The Least Mean Square (LMS) rule is only efficient to one-layer linear network (Fecit, 2003). Suppose x(k), y(k), ŷ(k), and b(k) are network input, practical output, desired output, and bias at k, respectively, and x(k) = (x1 (k), x2 (k), . . . , xn (k))T , y(k) = (y1 (k), y2 (k), . . . , ys (k))T , ŷ(k) = (ŷ1 (k), ŷ2 (k), . . . , ŷs (k))T , b(k) = (b1 (k), b2 (k), . . . , bs (k))T , w(k) = (wij (k)). LMS rule is approximately a gradient descent method, expressed as the following wij (k + 1) = wij (k) + η(yi (k) − ŷi (k))xj (k), bi (k + 1) = bi (k) + η(yi (k) − ŷi (k)), where η is the learning rate. The training procedure of a linear neural network is as follows: (1) Calculate network output, y = wx + b, and error, e = y − ŷ. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch03 1st Reading 4 Computational Ecology (2) Compare the mean square of output error with the desired. If the error is less than the desired or the maximum epochs is reached, the training procedure terminates; or else continue to train. (3) Calculate weights and thresholds (without confusion, threshold and bias are used as the same meaning in the following context) using LMS rule and return to (1). Linear neural network converges if input vectors are linearly independent and η is approximately determined. LMS rule has been widely used in echo cancellation system of longdistance telephone and other adaptive filter designs (Hagan et al., 1996). LMS rule is also the fundamental of BP algorithm. The Matlab function for LMS is learnwh (Mathworks, 2002). Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch04 1st Reading  CHAPTER 4  Radial Basis Function Neural Networks Radial basis function (RBF) neural networks are always used to conduct functional interpolation (Albus, 1971; Broomhead and Lowe, 1988; Powel, 1990; Zhang and Barrion, 2006). RBF neural networks are multilayer feedforward networks. A RBF network is composed of three layers. The first layer is the input layer. Neurons in the second layer do not have connection weights linked to inputs. The outputs of neurons of the second layer are determined by the distances between inputs and the centers of basis functions. The third layer is usually a linear layer which yields weighted sum of outputs of the second layer (Haykin, 1994; Hagan et al., 1996). All weights from the input layer to the hidden layer are one’s and the weights from the hidden layer to output layer are adjustable (Bian and Zhang, 2000). In a RBF network, neurons respond only to the inputs adjacent to centers of basis functions, i.e., the basis function only responds to a local neighborhood of input space. RBF networks are therefore locally tuned and approximated. Learning is fast in RBF network because only a small number of parameters must be adjusted when learning new data, but more neurons are needed in the situation of high-dimensional input (Moody and Darken, 1989). In a sense probabilistic neural networks, general regression neural networks, wavelet neural networks, and functional link neural networks, etc., are all RBF networks. 1 Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch04 1st Reading 2 Computational Ecology 1. Theory of RBF Neural Network 1.1. Basic theory An RBF neural network is an approximator of an unknown function f(x). In general any function can be expressed as the weighted sum of a set of basis functions (Bian and Zhang, 2000). In RBF network the function f(x) is approximated by a set of basis functions composed of output functions of hidden neurons. The output of basis function from input layer to hidden layer is a nonlinear mapping and the network output is a linear mapping. According to functional analysis theory, suppose H is a Hilbert space with reconstruction kernel K(x, z), and {ϕi } is a orthonormal basis of H, if there is a constant a, such that (Hurt, 1989; Yan and Zhang, 2000) ϕ(x), ϕ(z) = aK(x, z), where x = (x1 , x2 , . . . , xn )T , z = (z1 , z2 , . . . , zn )T , any function f ∈ H can thus be approximated by the linear combination of a set of basis functions (Powel, 1990; Park et al., 1991) f(x) = N  wi ϕi (x, zi ), i=1 where {ϕ(x, zi )|i = 1, 2, . . . , N} is a set of basis functions, and zi = (zi1 , zi2 , . . . , zin )T . Suppose there are N samples, (xi , ŷi ), i = 1, 2, . . . , N, where xi = i (x1 , x2i , . . . , xni )T . The theory above can be represented by the following equation: w = ŷ, where  = (ϕij )N×N , ϕij = ϕ(xi − zj ), ŷ = (ŷ1 , ŷ2 , . . . , ŷN )T , and w = (w1 , w2 , . . . , wN )T . The solution of the above equation is w = −1 ŷ. The matrix  is sometimes singular and the regularization process is thus needed (Yan and Zhang, 2000). Some RBFs needed in the above equation are as follows ϕ(x) = exp(−(x − µi )2 /(2σi2 )) ϕ(x) = (σi2 + x2 )βi . (Gaussian kernel function), Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch04 1st Reading Radial Basis Function Neural Networks 3 1.2. Choice of network architecture In a RBF neural network the basis functions are always not orthonormal and the hidden representation is thus redundant. The number of hidden neurons and the parameters of basis functions should be empirically determined (Yan and Zhang, 2000). Based on functional analysis theory, suppose {ϕi }, i = 1, 2, . . . , is a group of orthonormal functions (basis functions) that are continuous on [0,1], a continuous function f(x) on [0,1], has the unique L2 approximation f(c, x) = n  ci ϕi (x), i=1 where c = (c1 , c2 , . . . , cn functions )T , can be determined by the projection on basis ci =  1 0 f(x)ϕi (x)dx. The mean square error for the approximation with n basis functions will be e2n = ∞  ci2 . i=n+1 The mean square error, e2n , declines as the number of basis functions increases. The basis function ϕi (x) with the largest coefficient ci can be chosen to minimize e2n (Scott and Mulgrew, 1997). By doing as above the architecture of RBF network can be determined. 2. Regularized RBF Neural Network The regularization theory tries to find some function hidden in limited data. The latter is a reverse issue and always ill-posed (Yan and Zhang, 2000). In regularization method a constraint condition is set to guarantee the stability of solution f(x), x = (x1 , x2 , . . . , xn )T . The regularization issue is to find Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch04 1st Reading 4 Computational Ecology f(x) such that N  (ŷi − fi (x))2 /2 + λDf 2 /2, min E(f) = i=1 where N is the numer of samples, ŷi is the desired output, and D is a linear differential operator. The Euler equation of the above issue is (Poggio and Girosi, 1990) D∗ Df(x) = N  (ŷi − fi (x))δ(x − µi )/λ, i=1 where is the adjoint operator of D, and µi = (µi1 , µi2 , . . . , µin )T . The solution of the regularization issue is D∗ f(x) = N  wi G(x, µi ), i=1 where G(x, µi ) is the Green function of operator D∗ D. Suppose that G(x, µi ) = G(x − µi ), G is thus a RBF and the solution is f(x) = N  wi G(x − µi ). i=1 The neural network for this equation is the regularized RBF network with only one hidden layer (Yan and Zhang, 2000; Fig. 1). Figure 1. A regularized RBF neural network. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch04 1st Reading Radial Basis Function Neural Networks 5 The RBF, G(x − µi ), can be further normalized as the function of hidden neurons (Moody and Darken, 1988) i zj (x) = G(x − µ /σj2 )/ m  G(x − µj /σi2 ), i=1 where zj = 1, j = 1, 2, . . . , m. The regularized RBF network is a universal approximator. It can approximate any multivariable functions defined on the compact sets of Rn . It is the best approximator. Moreover, the solution achieved by regularized network is the best, i.e., the approximation performance on sampling points and non-sampling points will not be substantially different (Yan and Zhang, 2000).  3. RBF Neural Network Learning The parameters in RBF neural network learning include µi (RBF center), σi2 (RBF variance), and wi (connection weights), i = 1, 2, . . . , N. The center and variance of each RBF can be subjectively determined. However, the following learning law is usually used (Wettschereck et al., 1992; Yan and Zhang, 2000). Firstly, define objective function as E= N  e2i , i=1 ei = ŷi − fi (x) = ŷi − m  wi G(xi − µj cj ), j=1 where m is the number of hideden neurons, N is the number of samples, and cj2 = 1/(2σj2 ). The learning law is thus wi (k + 1) = wi (k) − η1 ∂E(k)/∂wi (k) = µi (k) − η1 N  j=1 ej (k)G(xj − µi (k)ci ), Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch04 1st Reading 6 Computational Ecology µi (k + 1) = µi (k) − η2 ∂E(k)/∂µi (k) = µi (k) − 2η2 wi (k) × N  ej (k)G′ (xj − µi (k)ci ) i j=1 −1  i (k + 1) = −1  (k)(xj − µi (k)), −1 −1 −1    (k) − η3 wi (k) (k) = (k) + η3 ∂E(k)/∂ i i × N  i ej (k)G′ (xj − µi (k)ci )(xj − µi (k))T , j=1 i = 1, 2, . . . , m,  where i is the covariance matrix, and G′ is the derivative function of G. The Matlab functions for the radial basis function network are newrb and newrbe (Mathworks, 2002). 4. Probabilistic Neural Network 4.1. Architecture of probabilistic neural network Probabilistic neural network (PNN) is a parallel realization of Bayesian classifier. A PNN has two hidden layers. An exponential function is used to replace sigmoid function in the RBF neural network (Sprecht, 1988, 1990). In the probabilistic neural network the number of pattern neurons equals to the number of trained samples and the number of summation neurons equals to the number of categories (Yan and Zhang, 2000; Zhang, 2007; Fig. 2). The input of the pattern neuron in the probabilistic neural network is g(yi ) = exp((yi − 1)/σ 2 ). If σ = c, where c is a constant, the network will be a Bayes classifier; If σ = ∞, it tends to be a linear classifier, and the network tends to be a neighborhood classifier if σ = 0 (Yan and Zhang, 2000). The Matlab function for the probabilistic neural network is newpnn (Mathworks, 2002). Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch04 1st Reading Radial Basis Function Neural Networks 7 Figure 2. A probabilistic neural network. 4.2. Learning of probabilistic neural network The learning process of probabilistic neural network is: (1) Calculate the point product of each pattern neuron: yi = xT wi , where x is the input vector (a sample), wi is the weight vector, and achieve the input of each pattern neuron, g(yi ) = exp((yi − 1)/σ 2 ). (2) Choose adequate weights and activation functions of pattern neurons such that fA (x) and fB (x) represent distribution density functions of classes A and B, respectively. (3) Calculate the weighted output. 5. Generalized Regression Neural Network Generalized regression neural network (GRNN) is a nonparametric regression model developed on the basis of probabilistic neural network (Sprecht, 1991). It is usually used for pattern classification. Matlab function for generalized regression neural network is newgrnn (Mathworks, 2002). 6. Functional Link Neural Network Functional link artificial neural network (FLANN) is developed by introducing the high-order terms of input variable into neural networks Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch04 1st Reading 8 Computational Ecology Figure 3. Functional link artificial neural network. (Pao et al., 1992; Yan and Zhang, 2000; Fig. 3). It has been successfully used in ecological researches (Zhang, 2007; Zhang et al., 2008). Suppose that ϕi (x), i = 1, 2, . . . , p, are a set of basis functions with the following properties (1) ϕi are linearly independent, j (2) supj ( i=1 ϕi 2 )1/2 < ∞. The practical output of jth output node is thus yj = ρ(sj (x)), sj (x) = p  ωji ϕi (x), i=1 j = 1, 2, . . . , m, where x = (x1 , x2 , . . . , xn )T is the input vector, y = (y1 , y2 , . . . , ym )T is the practical output vector, ωj = (ωj1 , ωj2 , . . . , ωjp )T is the weight vector for node j, j = 1, 2, . . . , m, ρ(·) is a nonlinear function, and p is the number of basis functions. Given K training samples {xk , yk }, k = 1, 2, . . . , K, where xk = k )T , and if the kth sample is (x1k , x2k , . . . , xnk )T , yk = (y1k , y2k , . . . , ym added, its value for the inverse function s(·) of nonlinear function ρ(·) should be computed. The network matrix equation is: W T = S, where  = (ϕ(x1 ), ϕ(x2 ), . . . , ϕ(xK ))T is a K ∗ p matrix, S = (s1 , s2 , . . . , sK )T is a K ∗ m matrix, W = (ω1 , ω2 , . . . , ωm )T is a m ∗ p matrix, Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch04 1st Reading Radial Basis Function Neural Networks 9 sk = (s1 (xk ), s2 (xk ), . . . , sm (xk )), and k k k ϕ(xk ) = (ϕ1 (x ), ϕ2 (x ), . . . , ϕp (x )), k = 1, 2, . . . , K. The analytical solution of the matrix equation above is the weight matrix (Zhang, 2007; Zhang et al., 2008) W = ((T )−1 T S)T . RBF neural network and wavelet neural network are special cases of FLANN. The Matlab function for functional link artificial neural network is flann (Zhang et al., 2008). 7. Wavelet Neural Network 7.1. Principles of wavelet neural network Generally training samples are not evenly distributed in input space. To fully mine data information the wavelet neural network (WNN) can be adopted. Multiresolution learning, in which higher resolution is applied for data-dense zone and lower resolution is applied for data-sparse zone and various learning results are finally combined, is an efficient method used in WNN (Liang and Page, 1997). A WNN is generated by using orthonormal wavelets as the basis functions in the FLANN or RBF neural network. WNN has been successfully used in the prediction of time series. There are two types of units in the hidden layer of WNN (Liang and Page, 1997; Yan and Zhang, 2000; Fig. 4): (1) Units ϕLk (x) (k = 1, 2, . . . , nL ) of scaling function ϕ(x). Orthonormal functions ϕLk (x), k = 1, 2, . . . , nL , will construct the approximate to unknown function under different displacements when the coarsest resolution is L, where ϕmk (x) = (2−m )1/2 ϕ(2−m x − k) and ϕ is an orthogonal function, nL is the number of data points under the coarsest resolution, m = 1, 2, . . . , L; k = 1, 2, . . . , 2L−m nL . Dec. 17, 2009 10 16:31 9in x 6in B-922 b922-ch04 1st Reading Computational Ecology (a) (b) (c) Figure 4. Wavelet network under different resolutions. (a) Units ϕLk (k = 1, 2, . . . , s) of scaling function ϕ under resolution L. (b) Units Lk (k = 1, 2, . . . , s) of wavelet function are added. (c) Units L−1k (k = 1, 2, . . . , T) of wavelet function are added. (2) Units mk (x), m = 1, 2, . . . , L; k = 1, 2, . . . , nm , of wavelet function (x) are orthonormal functions of the details of contiunous square integrable function F(x) ∈ L2 (R), where mk (x) = (2−m )1/2 (2−m x − k) and is an orthogonal function, m = 1, 2, . . . , L; k = 1, 2, . . . , 2L−m nL . Orthogonal functions, ϕmk (x) and mk (x), are also mutually orthogonal. The scaling function ϕ(x) and wavelet function (x) may be of the same type, and take one of the following functions: Legendre polynomial, Chebyshov polynomial, Laguerre polynomial, Hermite polynomial, and trigonometric functions (Li et al., 1996; Burden and Faires, 2001). Suppose the function F(x) ∈ L2 (R) is unknown, and there is only a set of discrete samples of F(x), a0k , k = 1, 2, . . . , n0 , which is the approximation to F(x) under the coarsest resolution from experiment. The following Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch04 1st Reading Radial Basis Function Neural Networks 11 recursion formula can be used (Liang and Page, 1997): FL (x) = nL  aLk ϕLk (x), k=1 Fm−1 (x) = Fm (x) + 2L−m nL dmk mk (x), k=1 m = L, L − 1, . . . , 1, where amk = 2L−m nL hk am−1k , k=1 2L−m nL  dmk = gk am−1k , k=1 m = 1, 2, . . . , L, and hk = gk =   +∞ ϕm−1k (x)ϕmk (x)dx, −∞ +∞ ϕm−1k (x) mk (x)dx. −∞ Fm−1 (x) is the approximation to F(x) when resolution is m − 1. The error of the above approximation is L−m L 2 nL  m=1 2 dmk . k=1 7.2. Wavelet neural network learning Learning procedures of WNN are (Yan and Zhang, 2000): (1) Construct a coefficient grid of multiresolution for every dimension of input variable. The interval of grid is equal to the sampling interval or Dec. 17, 2009 12 16:31 9in x 6in B-922 b922-ch04 1st Reading Computational Ecology every dimension in the case of the highest resolution, i.e., m = 0, while there are only two data points for the coarsest resolution (m = L). (2) Train the units of scaling function ϕ. (3) Add appropriate units of wavelet function if error criterion is not met. Wavelet units are added if sample space is in the range of requiring higher precision. (4) Delete the wavelet units with small weights and test network using new data. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch05 1st Reading  CHAPTER 5  BP Neural Network Feedforward networks are used to approximate nonlinear functions. However, their learning is difficult due to the existence of hidden layers. Backpropagation (BP) algorithm, a gradient descent algorithm and an extension of LMS algorithm, makes feedforward network learning easier BP neural network is a one-way propagated multilayer feedforward network (Rumelhart and McClelland, 1986; Fecit, 2003). BP algorithm was initially constructed by Werboss (1974), and later improved by Rumelhart et al. (1986), Parker (1985), and Le Cun (1985). Currently BP is the most widely used neural network (Zhang and Barrion, 2006; Zhang, 2007; Zhang et al., 2008), with applications in pattern recognition (Zhang et al., 2008), function approximation (Zhang and Barrion, 2006; Zhang, 2007), and data compression, etc. 1. BP Algorithm BP neural network is composed of an input layer, an output layer and one or more hidden layers. No between-neuron connections exist within the same layer. The transfer functions for output layer are always linear and (continuously differentiable) sigmoid transfer functions are used for both the input layer and hidden layers (Haykin, 1994; Hagan et al., 1996; Fecit, 2003; Fig. 1). There are two opposite signals between neurons or units in different layers (Yan and Zhang, 2000; Fig. 1): (1) Working signal. This signal is the information flow from input layer to output layer. It is a function of input and weights. 1 Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch05 1st Reading 2 Computational Ecology Figure 1. Architecture of BP neural netwrok. (2) Error signal. The error, i.e., the difference between practical and desired outputs, is propagated layer by layer from output layer to previous layers. The learning procedure consists of two parts, forward propagation and backward propagation (Fecit, 2003). In the forward propagation, every layer’s neurons will only exert influence on the next layer’s neurons. If there is an error between desired and practical outputs, the backward propagation starts to function. In the backward propagation, the error signal flows back along original path and every layer’s weights are modified until the input layer is reached. Forward and backward propagation are repeatedly conducted to minimize error signal. BP algorithm can be summarized as follows (Churing, 1995; Yan and Zhang, 2000; Fig. 2): (1) Perform initialization. Choose a suitable network architecture and set all weights and thresholds to smaller uniformly distributed values. (2) Conduct the following computation for every input sample: (a) Feedforward computation. For the neuron j of layer l, we have ulj (k) = p  i=0 wlji (k)yil−1 (k), Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch05 1st Reading BP Neural Network Figure 2. Flow diagram of BP algorithm. 3 Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch05 1st Reading 4 Computational Ecology where yil−1 (k) is the working signal transferred from the unit i in layer l − 1, p is the number of connections to unit j. If the transfer function of neuron j is a sigmoid function, then yjl−1 (k) = 1/(1 + exp(−ul−1 j (k)), φ′ (uj (k)) = ∂yil (k)/∂uj (k) = yil (k)(1 − yil (k)). If neuron j is in layer 1, i.e., l = 1, set yj0 (k) = xj (k), and if neuron j is in output layer s, i.e., l = s, set yjs (k) = Oj (k), and ej (k) = xj (k) − Oj (k). (b) Backforward computation. For hidden neurons and output units, we have  l+1 δlj (k) = yjl (k)(1 − yjl (k)) δl+1 i (k)wij (k) and δlj (k) = esj (k)Oj (k)(1 − Oj (k)), respectively. (3) Modify weights l l (k + 1) = wji (k) + ηδlj (k)yil−1 (k). wji (4) Set k = k + 1, and input new sample, until the error m N   e2j (k)/(2N) i=1 j=1 is lower than the desired value, where m is the number of units in the output layer, N is the total number of samples, and ej (k) = ŷj (k) − yj (k). 2. BP Theorem BP can be regarded as a nonlinear mapping from input space to output space: F :Rn → Rm , f(X) = Y . Given the input and output sets xi ∈ Rn , i )T , there is yi ∈ Rm , where xi = (x1i , x2i , . . . , xni )T , yi = (y1i , y2i , . . . , ym Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch05 1st Reading BP Neural Network 5 a mapping g and g(xi ) = yi , i = 1, 2, . . . , N. It is desirable to obtain a mapping f ∈ F = {f |f : Rn → Rm }, which is an optimal approximation onto g (Fecit, 2003; Zhang and Barrion, 2006; Zhang et al., 2008). The Kolmogorov theorem described in Chap. 2 did not provide a method to construct a three-layer feedforward network that approximates any continuous function. The following BP theorem will solve the problem. BP Theorem. Given any L2 function f : [0, 1]n → Rm , there is a three-layer BP network, which can approximate f with arbitrary ε square error. Using three-layer BP networks will require a large number of hidden neurons. Multilayer BP network is thus widely used. In the multilayer BP network, the number of hidden neurons and layers are always determined by empirical methods (Zhang and Barrion, 2006; Zhang, 2007; Zhang et al., 2008). The Matlab function for BP neural network is newff (Mathworks, 2002). 3. BP Training The following learning procedure is for two-layer BP networks, using Matlab toolbox (Fecit, 2003): (1) Initilize weights and thresholds (bias) of all layers with small uniformly distributed random values; set desired error threshold ERTHR, maximum epochs, and learning rate LR. (2) Compute every layer’s outputs o, y, and error E: o = tansig(w1 x, b1 ), y = purelin(w2 o, b2 ), e = ŷ − y. (3) Compute error changes δ2 and δ1 of every layer in backpropagation and modified weights of every layer: δ2 = deltalin(y, e), δ1 = deltalin(o, δ2 , w2 ), Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch05 1st Reading 6 Computational Ecology [dw1, db1] = learnbp(x, δ1 , LR), [dw2, db2] = learnbp(o, δ2 , LR), w1 = w1 + dw1, w2 = w2 + dw2, b1 = b1 + db1, b2 = b2 + db2. (4) Compute square sum of error sse = sumsqr(ŷ − purelin(w2 tansig(w1 x, b1 ), b2 )). If sse < ERTHR, or maximum epochs are reached, stop training; or else return to (2). The above training is represented by Matlab function trainbp. 4. Limitations and Improvements of BP Algorithm A major problem of the basic BP algorithm is the slow rate of convergence. Using the basic BP algorithm will sometimes comsume several weeks of computation time. Moreover, there are local minimum points in the goal function of BP algorithm. A BP network of more than three layers has a great occurrence possibility of local minimum issue. The issue of local minimum points can be solved by generic algorithm (Van Rooij et al., 1996), global optimization (Shang and Wah, 1996), homotopy method (Gao and Yang, 1996), etc. The convergence rate and local minimum issue can also be improved by taking the following measures: (1) Add momentum term (Rumelhart et al., 1986; Vogl et al., 1988). In BP algorithm, a larger step η will result in instability and a smaller η will result in a slow convergence rate. As a consequence a momentum term, α(0 < α < 1), can be added in the algorithm wji (k) = αwji (k − 1) + ηδj (k)yi (k). (2) Samples should be provided in a random way (Yan and Zhang, 2000). Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch05 1st Reading BP Neural Network 7 (3) The number of dimension of input variables should be compressed before use in BP training. For a limited number of samples, the compression of variable dimension is important in order to guarantee the generalization performance of trained neural network. (4) Replace gradient descent method with conjugate gradient algorithm or Levenberg–Marquardt algorithm in BP algorithm (Shanno, 1990; Charalambous, 1992; Hagan and Menhaj, 1994). (5) Use small uniformly distributed random values as initial values of weights and thresholds. If all weights are zero’s or of the same value, there will be no differences between hidden neurons and computation will not be able to start. (6) Change learning rate during the training process (Jacobs, 1988; Tollenaere, 1990). (7) Batch processing. Network parameters are modified just after the entire training data set is delivered. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch06 1st Reading  CHAPTER 6  Self-Organizing Neural Networks Neural networks improve their functionality by supervised (training samples) or unsupervised learning from environments. Self-organizing networks are neural networks with unsupervised learning. They are used to find hidden laws and relationships from redundant data and they adjust themselves to adapt to the environments (Fecit, 2003). There are different architectural designs for self-organizing neural networks (Kohonen, 1982, 1988, 1990, 1995). A self-organizing network may be composed of two layers, i.e., input layer and output layer. There are between-layer forward connections and lateral connections within the output layer (Zhang and Li, 2007; Fig. 1). A self-organizing network may be a multilayer feedforward network in which the self-organizing process is performed layer by layer (Yang and Zhang, 2000). In a self-organizing network, the changes of connection weights only relate to the states of neighborhood neurons (Yang and Zhang, 2000). This is called local interaction. Stochastic local interactions may result in a global order. Local interactions in a self-organizing network follow three principles (Von der Malsburg, 1990): (1) Connection force tends to be strengthened by itself (Hebb rule); (2) all neurons compete with each other and the strongest neuron is activated and other neurons are inhibited (winner-take-all); (3) there is coordination among various neurons because a single neuron cannot function by itself. There are several types of self-organizing neural networks, such as self-organizing feature map network, self-organizing competitive learning network, learning vector quantization (LVQ) network, Hamming network, adaptive resonance theory network, etc. 1 Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch06 1st Reading 2 Computational Ecology Figure 1. Lateral connections of neurons in the self-organizing neural network. 1. Self-Organizing Feature Map Neural Network 1.1. Principle and algorithm of self-organizing feature map neural network In a self-organizing feature map (SOM, or SOFM) neural network, the neurons in a layer learn to represent different regions of a sample space and the neighborhood neurons learn to respond to similar samples (Kohonen, 1982, 1995). This network can learn both the topology of sample space and the distribution of samples (Song et al., 2007; Zhang, 2007; Zhang and Li, 2007; Fig. 2). A competitive network is used to classify these samples into natural classes (Mathworks, 2002). SOM can be further used to recognize additional samples. SOM is a mapping from a higher-dimensional space to a 2-dimensional (a 2-dimensional curvature in a n-dimensional space, i.e., principal curvature) or 1-dimensional space (a curve in a n-dimensional space, i.e., principal curve). In a sense SOM is the generalization of PCA and thus has the functionality of feature extraction. Nevertheless, the mapping is not unique, which is related to sample sequences and initial weights (Bian and Zhang, 2000). The principle of SOM is described below (Kohonen, 1982, 1988, 1990, 1995). Suppose the input x = (x1 , x2 , . . . , xn )T , and the weight vector Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch06 1st Reading Self-Organizing Neural Networks 3 Figure 2. Architecture of the self-organizing feature map network. of neuron j is wj = (wj1 , xj2 , . . . , xjn )T , j = 1, 2, . . . , N. First, input a sample x, and find the neuron i (winner neuron) that matches x optimally, i.e., find the maximum wjT x. If all wj are normalized to fixed Euclidean norms, find i(x) = arg min x − wj , j j = 1, 2, . . . , N. Second, determine the neighborhood δi (k) of the winner neuron i, which is changeable as the number of iterations increases. Finally, determine the rule for changing connection weights. The algorithm of SOM can be summarized as follows (Yan and Zhang, 2000): (1) Initialize weights wj (0), j = 1, 2, . . . , N, with small random values. (2) Choose a sample x randomly from the sample set and use x as the input. (3) For step k, find the neuron i that matches x optimally, i.e., i(x) = arg max wjT x, j j = 1, 2, . . . , N, or i(x) = arg min x − wj , j j = 1, 2, . . . , N. (4) Determine the neighborhood δi (k) of the winner neuron i. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch06 1st Reading 4 Computational Ecology (5) Modify weights as the following: wj (k + 1) = wj (k) + η(k)[x(k) − wj (k)], wj (k + 1) = wj (k), j ∈ δi (k), j∈ / δi (k). (6) Stop training if a desirable network is achieved, or else let k = k + 1, then return to (2). Choosing the appropriate neighborhood δi (k) and learning rate η(k) is important for a good trained network (Kohonen, 1988, 1990). In general η(k) should be around 1.0 in the first 1000 iterations and decrease (but not less than 0.1) thereafter. For 2-dimension SOM the neighborhood δi (k) may be taken as a square, or hexagon, etc. At the beginning δi (k) is large enough even to include all neurons and then shrinks step by step, until it includes 1 or 2 adjacent neurons. During the convergence stage δi (k) includes the nearest neuron or even itself only. SOM algorithm is improved in different ways and yields various variants. Amari (1983) developed a theory of self-organizing neural fields and approached the problem of continuous SOM. SOM can also be treated as a nonparametric regression, or constrained topological mapping (Cherkassky and Lari-Najafi, 1991). Data are adaptively divided into different regions and simpler functions are used to conduct approximation in different regions. The Matlab function for SOM neural network is newsom (Mathworks, 2002). 1.2. Performance of self-organizing feature map SOM is performed as follows: (1) Topology is reserved in the mapping. The points with similar features in the input space are adjacent in the mapped space. (2) The area with greater distribution density in the original space corresponds to a larger area in the mapped space. (3) Cluster centers are used to represent the original input functions as a kind of data compression. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch06 1st Reading Self-Organizing Neural Networks 2. 5 Self-Organizing Competitive Learning Neural Network In self-organizing competitive learning neural networks, neurons in a competitive layer learn to represent different regions in a sample space. This network can thus learn the distribution of samples. A competitive network may be used to classify these samples into natural classes (Mathworks, 2002). In this network, the competitive layer is viewed as a classifier and each neuron in this layer corresponds to a different category. Additional samples can be recognized using the trained network (Zhang, 2007; Watts and Worner, 2009). The Matlab function for self-organizing competitive learning neural network is newc (Mathworks, 2002). 3. Hamming Neural Network The Hamming network is a simple competitive neural network that is used to find the class that has the smallest Hamming distance with existing vectors. It is composed of two layers, i.e., feedforward layer and recursive layer. The feedforward layer links the input vector with the prototype vector, and the recursive layer (output layer) uses a competitive algorithm, by lateral inhibition among neurons in this layer, to find out which prototype vector is the nearest to the input vector (Hagan et al., 1996; Fig. 3). Figure 3. Hamming neural network. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch06 1st Reading 6 Computational Ecology The algorithm for Hamming network is described as the following (Yan and Zhang, 2000). Suppose there are m vectors, sj , j = 1, 2, . . . , m, where j j j 1 , w1 , . . . , w1 )T , j = 1, 2, . . . , m. sj = (s1 , s2 , . . . , sn )T , and wj1 = (wj1 jn j2 The Hamming distance between input x, x = (x1 , x2 , . . . , xn )T , and sj is dh (x, sj ) = (n − xT sj )/2. Let w1 = (s1 , s2 , . . . , sm )T /2, the threshold is n/2. The activation function of neurons is f(uj ) = uj /n, j = 1, 2, . . . , m, where uj = n−dh (x, sj ). For weight matrix w2 = (wij2 ), if i = j, wij2 = −ε; if i = j, wij2 = 1. The output is y(k + 1) = (w2 y(k)), where  is a diagonal operator and its element is f(u) = u, if u ≥ 0; f(u) = 0, if u < 0. 4. WTA Neural Network Winner-take-all (WTA) network is a simple competitive neural network (Fig. 4). Figure 4. Architecture of the WTA neural network. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch06 1st Reading Self-Organizing Neural Networks 7 Suppose weight vectors wj = (wj1 , wj2 , . . . , wjn )T , j = 1, 2, . . . , m. The algorithm is: (1) Normalize weights ŵj = wj /wj , j = 1, 2, . . . , m. (2) Choose a wl such that max ŵjT x. j (3) Modify weights ŵl (k + 1) = ŵl (k) + η(k)[x − ŵl (k)], ŵj (k + 1) = ŵj (k), j  = l. (4) Stop training if desirable performance is reached, or else repeat the training process. 5. LVQ Neural Network Learning vector quantization (LVQ) network is a competitive neural network (Kohonen, 1990). It is a mixed network that generates classification through both supervised and unsupervised learning. LVQ network is composed of two layers: competitive layer and linear layer. In the competitive layer several neurons are assigned to the same class and each class is assigned to a neuron in the linear layer. The competitive layer learns classification (subclass) to input vectors and the linear layer transforms the classification generated in the competitive layer to the categories (targeted class) defined by user. The number of neurons in the competitive layer is usually larger than that in the linear layer. The Algorithm of LVQ network is (Yan and Zhang, 2000): (1) Find a neuron i with the largest output in the output set after a vector x is delivered to the network. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch06 1st Reading 8 Computational Ecology (2) Given that x belongs to class p, and i belongs to q in the last learning, wi (k + 1) = wi (k) + η(k)[x(k) − wi (k)], p = q, wi (k + 1) = wi (k) − η(k)[x(k) − wi (k)], p = q, wj (k + 1) = wj (k), j  = i. Both LVQ and SOM are able to conduct clustering, but their performances are largely dependent on initial values. Some new algorithms for LVQ have been developed, e.g., the GLVQ network (Pal et al., 1993). The Matlab function for LVQ neural network is newlvq (Mathworks, 2002). 6. Adaptive Resonance Theory A key issue for competitive networks is how to guarantee the stability of learning process and at the same time demonstrate a higher adaptability. Weight matrix may not converge in many cases. Adaptive resonance theory (ART) is an improvement and extension of competitive learning, which is developed to solve stability/adaptability dilemma (Carpenter and Grossberg, 1987, 1990; Carpenter et al., 1991). ART will realize selfstability and self-organizing recognition to complex environmental patterns. There are three types of ART, i.e., ART1, ART2, and ART3. They are used for the situations of Boolean inputs, continuous inputs, and hierarchical search, respectively. ART may adapt to nonstationary environments and can learn in real time. It has a stable recognition performance on learned objects and will quickly adapt to new objects that have never been learned. There are two layers — input layer and output layer — in ART networks (Hagan et al., 1996; Fig. 5). The input layer compares the input pattern to the desired pattern returned from the output layer. If these patterns cannot match each other, then the output layer will need to be reset. The winner neuron and desired pattern will be cancelled and a new round of competition will be performed in the output layer. The new winner neuron of output layer will return input layer a desired pattern. This process will continue until the input and desired patterns match each other. If they can match then the input layer will generate a new prototype pattern by combining the desired and input patterns. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch06 1st Reading Self-Organizing Neural Networks 9 Figure 5. General architecture of ART neural networks (Hagan et al., 1996). Suppose there are n neurons in the input (Boolean input) layer and m neurons in the output layer, ẇij are the connection weights from the input layer to the output layer and ẅji are the connection weights from the output layer to the input layer, i = 1, 2, . . . , n; j = 1, 2, . . . , m. The algorithm of ART1 is (Carpenter and Grossberg, 1987; Yan and Zhang, 2000): (1) Initialize connection weights. Assign values ẇij = aj , i = 1, 2, . . . , n; j = 1, 2, . . . , m, where am < · · · < a1 < 1/(α + n), 0 < α ≪ 1. ẇji = 1, j = 1, 2, . . . , m; i = 1, 2, . . . , n. (2) Input a Boolean sample x, x = (x1 , x2 , . . . , xn )T . (3) Compute the weighted value for every output neuron yj = n  ẇij xi , j = 1, 2, . . . , m. i=1 (4) Choose the winner neuron K with WTA network K = arg max yj . j Dec. 17, 2009 10 16:31 9in x 6in B-922 b922-ch06 1st Reading Computational Ecology (5) If n  ẅji xi / n  xi < β, i=1 j=1 where β is the warning threshold, then input and the desired patterns do not match each other; otherwise continue the following steps. (6) For winner neuron modify the weights ẇij = ẇij , j = K  ẇij = ẅji / α + ẅji = ẅji , ẅji = ẅji xi , n   ẅji xi , i=1 j = K j = K. (7) Return to step (2) to input a new sample. j=K Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch07 1st Reading  CHAPTER 7  Feedback Neural Networks Most of feedforward neural networks are learning-type networks and do not possess dynamic behaviors. Different from feedforward networks, a feedback neural network will converge to a stable state and associative memory can thus be achieved through transition of states of neurons (Fig. 1). Because of the feedback mechanism, feedback neural networks are nonlinear dynamic systems (Fecit, 2003). Their nonlinear behaviors are much diverse and complex. Nonlinear properties like stability problem, attractors, chaos, unpredictability, randomness, etc., are attractive topics in the researches of feedback networks. In the feedback neural network, all neurons or units are the same. They can connect to one another (Yan and Zhang, 2000; Fig. 1). A feedback neural network is generated in a two-layer feedforward network when the numbers of neurons in the input layer and the output layer are equal, if each output is directly connected to the corresponding input (Bian and Zhang, 2000). The most used feedback neural networks include Elman network and Hopfield networks. 1. Elman Neural Network Elman neural network consists of several layers. The first layer has weights coming from the input and each subsequent layer has a weight coming from the previous layer. All layers except the last one have a recursive weight. The last layer is the network output (Mathworks, 2002; Fig. 2). 1 Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch07 1st Reading 2 Computational Ecology Figure 1. Architecture of feedback neural networks. Figure 2. Architecture of Elman neural network. Elman network is a nonlinear system. It can be used to approximate any function with desired accuracy in a definite time (Zhang, 2007). It may store the information for future use. This model is able to learn not only spatial patterns but also temporal patterns (Hagan et al., 1996; Fecit, 2003). Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch07 1st Reading Feedback Neural Networks 3 The Matlab function for Elman neural network is newelm (Mathworks, 2002). 2. Hopfield Neural Networks Hopfield neural networks possess the general properties of feedback or recursive networks (Hopfield, 1982, 1984; Fecit, 2003). Additionally they meet the following conditions (Bian and Zhang, 2000): (1) Connection weights are symmetrical (wji = wji ) and the weight matrix is thus a symmetrical matrix; (2) no self-feedback of neuron (wii = 0). Because of the symmetrical property, Hopfield neural networks are stable and there are only isolated attractors. Hopfield neural networks are used to store one or more stable target vectors. These vectors may be evoked at a certain time. They are also used for optimization, linear programming, A/D transformation, and pattern recognition (Hopfield and Tank, 1985; Hagan et al., 1996). Hopfield neural networks are designed to store some equilibrium points. Given an initial condition, the network will converge to these points. 2.1. Discrete Hopfield neural network Suppose that a discrete Hopfield neural network (DHNN) is a n-order network (Fig. 3), in which Wn×n is the (symmetrical) weight matrix, T is the n-dimension vector, and Ti is the threshold of neuron i. Neurons may only take the states of 1 or −1. The state equation of discrete Hopfield network is a group of nonlinear difference equations:   n  xi (t + 1) = sgn  wij xj (t) − Ti  , i = 1, 2, . . . , n, j=1 where xj (t) is the state of neuron j at time t, and t is a positive integer. The network’s state is x(t) = (x1 (t), x2 (t), . . . , xn (t))T , xi (t) ∈ {1, −1]. Discrete Hopfield networks work in two ways (Bian and Zhang, 2000): (1) Asynchronous way. Only one neuron changes its state at a time. Other neurons do not change their states. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch07 1st Reading 4 Computational Ecology Figure 3. Architecture of discrete Hopfield neural network. (2) Synchronous way. Some neurons change their states simultaneously at a time. If there is a time t, such that: x(t + t) = x(t). The state equation is stable. A Lyapunov function, i.e., energy function, was defined to analyze the stability of the network: E(t) = n  i=1 Ti xi (t) − n n   wij xi (t)xj (t)/2. i=1 j=1 For asynchronous operation, the change of energy function when xi changes its state is  E(t) = −[xi (t + 1) − xi (t)]   j=i  wij xj (t) − Ti  . Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch07 1st Reading Feedback Neural Networks 5 It is obvious that the energy function will certainly decline when the state of network changes. The energy function will reach a minimum, i.e, the equilibrium state, because the network has only 2n states. Due to the nonlinearity the system has multiple isolated equilibrium states (isolated attractors). For synchronous operation, the attractors are isolated attractors or oscillation cycles with a length of 2. 2.2. Continuous Hopfield neural network The architecture of a continuous Hopfield neural network (CHNN) is analogous to that of a discrete network (Hopfield, 1984). The changes of states of neurons can be represented by a group of nonlinear differential equations: Ci dxi /dt = n  wij yj − xi /Ri + Ii , j=1 yi = g(xi ), i = 1, 2, . . . , n. The energy function is E(t) = − n  i=1 Ii yi (t) − n n   wij yi (t)yj (t)/2 + i=1 j=1 n   i=1 yi 0 g−1 (y)dy/Ri . If is g−1 (·) is a monotonous and continuous function, and Ci > 0, wji = wji , then dE/dt  0. 3. Simulated Annealing The neural networks discussed so far assumed that signals transmit deterministically between neurons. In real systems, however, signal transmission is inpterrupted by chaos and the states of neuron change randomly (Yan and Zhang, 2000), i.e., s = 1 at the probability of P(v), and s = −1 at the probability of 1 − P(v). The probability P(v) is a sigmoid function: P(v) = 1/[1 + exp(−2v/T)], where T is analogous to temperature and there is not any chaos if T = 0. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch07 1st Reading 6 Computational Ecology If the temperature of a heat object drops slowly, the inner part of object will keep its balance and the inner energy of the object will reach a minimum when temperature declines to a given point. This is called annealing (Yan and Zhang, 2000). The optimization algorithm to simulate annealing is the simulated annealing (SA) algorithm, which is summarized as follows: (1) Initialize various parameters randomly, including initial temperature T0 , and determine the annealing rule. (2) Compute x′ = x + x, E = E(x′ ) − E(x), where x = (x1 , x2 , . . . , xn )T , x is an uniformly distributed random perturbation, and E is the energy change of the system. If E < 0, x′ is accepted as the new state; or else x′ is accepted at the probability of P = exp(−E/(kT )), where k is the Boltzmann constant. Changes of temperature T usually follow the rule: T(t) = αT(t − 1), 0.85 ≤ α ≤ 0.98. Repeat this step until equilibrium state is realized. (3) Perform annealing based on rule given in (1), and repeat (2), until T = 0 or the given temperature is reached. SA was improved by researchers to raise the convergence rate (Ingber, 1993; Ingber and Rosen, 1992). 4. Boltzmann Machine Boltzmann machine is an all-connected feedback network (Hinton et al., 1984; Fecit, 2003; Fig. 4). It is analogous to Hopfield network. There are only two states for neurons in a Boltzmann machine, i.e., 1 and −1. Only one neuron changes its state at a time. Hidden neurons are permissible and all neurons are stochastic neurons. They have no self-feedback connections and all of the between-neuron connections are bidirectional and symmetrical. Boltzmann machine can be trained in the supervised way (Yan and Zhang, 2000). Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch07 1st Reading Feedback Neural Networks 7 Figure 4. Architecture of Boltzmann machine. In Fig. 1, visible neurons interact with environment. Visible neurons are fixed at some states and hidden neurons change their states according to inputs received. The energy function is defined as  E=− wji sj si . i j=i The probability of state change of neuron j from sj to −sj is P(sj → −sj ) = 1/[1 + exp(−Ej /T)], where the energy change Ej resulted from state change is  wji si . Ej = −2sj vj = −2sj i The probability of state change of neuron j from −sj to sj is 1 − P(sj → −sj ). The objective of Boltzmann machine learning is to minimize E. Supposing that visible neurons are divided into input neurons and output neurons, the algorithm of Boltzmann machine is: (1) Initialize connection weights wji with uniformly distributed random values in [−a, a], where a can be assigned a value, e.g., 1. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch07 1st Reading 8 Computational Ecology (2) Fix the states of visible neurons according to desired values of data pattern. Relax the network using SA, i.e., change the states of neurons following the rule sj = 1, at the probability of P(vj ), sj = −1at the probability of 1 − P(vj ), where vj =  wji si , i =j P(vj ) = 1/[1 + exp(−2vj /T)]. Perform specified number of iterations for every temperature T , until a given lower temperature is reached. Finally compute  + p+ = s s  = Q+ j i xy sj|xy si|xy , ji x y i, j = 1, 2, . . . , n; i  = j, where x and y are states of the visible neuron and hidden neuron, respectively, sj|xy is the state of neuron j when the state of visible neuron is x and the state of hidden neuron is y, n is the total number − + − of neurons, and Q+ xy = Qy|x Qx . Qx is the probability of the visible neuron at state x when network’s states of visible neurons are fixed. Q− x is the probability of the visible neuron at state x when network’s states of visible neurons are free. State x may be assigned values 1 to 2c , and state y takes values 1 to 2d ; here c and d are the number of visible and hidden neurons, respectively. (3) Fix the states of input neurons only, and repeat (2) to compute  − p− Q− xy sj|xy si|xy , ji = sj si  = x y i, j = 1, 2, . . . , n; i  = j. (4) Modify connection weights by the rule − wji = wji + η(p+ ji − pji ), i, j = 1, 2, . . . , n; i  = j. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch07 1st Reading Feedback Neural Networks 9 (5) Repeat steps (2) to (4) until wji will not change, i.e., the network converges to its stable state. Boltzmann machine will surely converge to the global minimum of energy function, and the network distribution will match the probability distribution of the environment. However it is a time-cost network. A mean field approximation method has been developed to improve its performance (Perterson andAnderson, 1987; Haykin, 1994). In this method the stochastic neuron is replaced with a deterministic neuron. The mean field theory may improve the training efficiency of Boltzmann machine by one to two orders of magnitude. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch08 1st Reading  CHAPTER 8  Design and Customization of Artificial Neural Networks Classical neural networks may approximate any nonlinear function and are widely used. Nevertheless, they do not naturally ensure a satisfactory exploitation of the intrinsic mechanisms of some systems. Special problems usually demand special architectures of neural networks (Zhang and Wei, 2008). Moreover, sometimes a complex system needs to be divided into simpler subsystems and the composite archirecture (modular network) of several neural networks is thus required. For example, a complex input space may be divided into several subspaces and in each space a separate neural network is used. As a result, designing and customizing neural networks based on specific systems or problems are popularly implemented. 1. Mixture of Experts Mixture of experts (ME) is a modular neural network (MNN). MNN uses high-order computational units to perform multiple tasks by dividing a problem into simpler subtasks. The division is conducted by partitioning the input–output response patterns into regions of identical features. A complex problem can be more precisely understood by using MNN than using conventional neural networks (Haykin, 1994;Almasri and Kaluarachchi, 2005). ME is composed of k modules (expert networks) and a control module (gating network) (Jacobs et al., 1991; Fig. 1). The control module assigns different features of the input space to the different modules (Neural Ware, 2000). Each expert network yields an output corresponding to the input 1 Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch08 1st Reading 2 Computational Ecology Figure 1. Architecture of ME (mixture of experts). vector and the network output is the weighted sum of these outputs with the weights equal to the output of the gating network. The output of gating network is considered as the prior probability that the expert network connected to the output of gating network will be used for a given input vector (Almasri and Kaluarachchi, 2005). Suppose that network input is x = (x1 , x2 , . . . , xn )T , network output i )T , is y, desired output is ŷ, the output of module i is yi = (y1i , y2i , . . . , ym and g is the output of output neuron i of gating network (0 ≤ gi ≤ 1, and  i gi = 1, i = 1, 2, . . . , k), then the network output will be y= k  gi yi . i=1 Assume that the outputs of every expert network have the same covariance, the desired output of expert network (module) i is a variable of normal distribution. f(ŷi /x, i) = exp(−ŷi − yi 2 /2)/(2π)1/2 , i = 1, 2, . . . , k. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch08 1st Reading Design and Customization of Artificial Neural Networks 3 The network output is thus represented by the sum of k distributions: i f(ŷ /x) = k  gi f(ŷi /x, i). i=1 The objective of network learning is to maximize the logarithmic likelihood function of f(ŷi /x), i.e., ln f(ŷi /x) (Jacobs et al., 1991). 2. Hierarchical Mixture of Experts Hierarchical mixture of experts (HME) is an expansion of ME (Jordan and Jacobs, 1992; Fig. 2). In HME the input space is divided into some regions and data fitting is separately conducted for each region. Some data points may simultaneously belong to different regions. The boundaries of regions will be automatically adjusted in the learning process (Yan and Zhang, 2000). Suppose the input x ∈ Rn , the output y ∈ Rm . The output of expert network (module) (i, j) is a continuous generalized linear function µij = f(wij , x), Figure 2. Architecture of a two-layer hierarchical mixture of experts. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch08 1st Reading 4 Computational Ecology where wij is the connection weight. The function f is the logit function (f(x) = x/(1 − x)) if HME is used for Boolean classification. The output i of the first layer of gating network is  gi = exp(ξi )/ exp(ξj ). j where ξi = vTi x, vi is the weight vector and gi is the first partition on input space. The output j of gating network i in the second layer is  gj|i = exp(ξij )/ exp(ξik ), k where ξij = vTji x. gj|i is a partition on the region generated from the first partition. The output of gating network in the second layer is  µi = gj|i µij . j Finally the network output will be y=  gi µi . i EM (Dempster et al., 1977; Laird, 1993; Yan and Zhang, 2000) and IRLS (Chen et al., 1999) algorithms have been jointly used in HME computation. The algorithm of HME using EM algorithm is: (1) For a given data (x, y), compute posterior probabilities hi and hj|i . (2) Using IRLS algorithm to modify weight µij of expert network (i, j). (3) Using IRLS algorithm to modify weight vi of the first layer of gating network. (4) Using IRLS algorithm to modify weight vij of every layer of gating network (except the first layer). (5) Start a new round of interation based on the modified weights. 3. Neural Network Controller Some neural networks have been designed to perform the control of dynamic systems (Mathworks, 2002). There are usually two steps in the neural network control, i.e., system identification and control design. The objective of system identification is designing a neural network model for Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch08 1st Reading Design and Customization of Artificial Neural Networks 5 Figure 3. NARMA-L2 neural network. the system to be controlled. The neural network model is then used to train the controller in the stage of control design. The feedback linearization controller, or NARMA-L2 controller, will be described here. This controller was designed to remove nonlinearity from a nonlinear system and transform it into a linear system (Mathworks, 2002; Fecit, 2003). In order to design a NARMA-L2 controller, the first step is to identify the system to be controlled (Fecit, 2003; Fig. 3). A neural network is trained to simulate the dynamics of system. (1) Choose a model architecture. The nonlinear autoregressive moving average model (NARMA) is a standard model of discrete nonlinear systems: y(k + d) = f(y(k), y(k − 1), . . . , y(k − n + 1), u(k), u(k − 1), . . . , u(k − n + 1)), where d ≥ 2, and u(k) is the input of system. Our objective is to train neural network for approximating the nonlinear function F . (2) If system output is demanded to track referential curve: y(k + d) = yτ (k + d), a nonlinear controller in the form of the following equation should be developed: u(k) = g(y(k), y(k − 1), . . . , y(k − n + 1), yτ (k + d), u(k − 1), . . . , u(k − n + 1)) Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch08 1st Reading 6 Computational Ecology To avoid dynamic feedback and slow training, a new model was developed to approximate NARMA (Narendra and Mukhopadhyay, 1994, 1997): y(k + d) = f(y(k), y(k − 1), . . . , y(k − n + 1), u(k − 1), . . . , u(k − n + 1)) + g(y(k), y(k − 1), . . . , y(k − n + 1), u(k − 1), . . . , u(k − n + 1)). The controller is u(k) = [yτ (k + d) − f(y(k), y(k − 1), . . . , y(k − n + 1), u(k − 1), . . . , u(k − n + 1))]/g(y(k), y(k − 1), . . . , y(k − n + 1), u(k − 1), . . . , u(k − n + 1)). 4. Customization of Neural Networks 4.1. Definition of neural network Various new architectures of neural networks are allowed to be created in the Matlab toolkit. A new network can be created by performing the command “network” in the command window (Fig. 4). The neural network should possess the following properties, with initial parameters, structures, or functions, etc. (1) Architecture properties. numInputs (0; the number of inputs); numLayers (0; the number of layers); biasConnect ([]; bias connections); inputConnect ([]; input-layer connections); layerConnect ([]; between-layer connections); outputConnect ([]; layer-output connections); targetConnect ([]; layer-target connections); numOutputs (0; read-only; the number of outputs); numTargets (0; read-only; the number of targets); numInputDelays (0; read-only; the number of input delays); numLayerDelays (0; read-only; the number of layer delays). (2) Subobject structures. inputs (0 × 1 cell of inputs); layers (0 × 1 cell of layers); outputs (1×0 cell containing no outputs); targets (1 × 0 cell containing no targets); biases (0 × 1 cell containing no Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch08 1st Reading Design and Customization of Artificial Neural Networks 7 Figure 4. Creat a new neural network in Matlab environment. biases); inputWeights (0 × 0 cell containing no input weights); layerWeights (0 × 0 cell containing no layer weights). (3) Functions. adaptFcn (none; adapt function); gradientFcn (none; gradient function); initFcn (none; initialization function); performFcn (none; performance function); trainFcn (none; training function). (4) Parameters. adaptParam (none; parameters for adapting network); gradientParam (none; parameters for gradient function); initParam (none; parameters for initialization); performParam (none; parameters for network performance); trainParam (none; parameters for training network). (5) Weight and bias values. IW (0 × 0 cell containing no input weight matrices; input-layer connection weights); LW (0 × 0 cell containing no layer weight matrices; between-layer connection weights); b (0 × 1 cell containing no bias vectors; bias). Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch08 1st Reading 8 Computational Ecology Figure 5. Architecture of the neural network designed. A neural network can be developed, for example, using the following settings (Fecit, 2003; Fig. 5): %Generate an empty neural network net=network; %Architecture properties %Set the number of inputs and network layers net.numInputs=2; %Number of inputs net.numLayers=3; %Number of layers %Set bia connections. Bias connection matrix is a 3 x 1 matrix. We may set layer i has bias %connection, i.e., net.biasConnect(i)=1. Suppose layer 1 and layer 3 have bias connections %in this network net.biasConnect=[1;0;1]; %Set input-layer connections. Input-layer connection matrix is a %3 x 2 matrix. For example, input 1 to layer 1: net.inputConnect(1,1)=1; %input 1 to layer 2: net.inputConnect(2,1)=1; %input 2 to layer 2: net.inputConnect(2,2)=1; net.inputConnect=[1 0;1 1;0 0]; %Set between-layer connections.Layer connection matrix is a 3 x 3 matrix. The syntax: net.layerConnect(i,j) %means a weight connection %of layer j to layer i. Here there are weight connections of layer 1 to %layer 3 and layer 2 to layer 3 net.layerConnect=[0 0 0;0 0 0;1 1 0]; %Set output and target connections. Both output and target matrices are %1 x 3 matrix, which means there is one target and 3 layers are connected to 1 target. net.outputConnect=[0 1 1]; %Connect layers 2 and 3 to output Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch08 1st Reading Design and Customization of Artificial Neural Networks 9 net.targetConnect=[0 0 1]; %Set the layer 3 to target connection. The output of layer 3 will be compared to the target of %layer 3 and yields an error signal which is used in the training. %Subobject structures %Some layer properties may be changed. Here layer 3 has default settings excepting its %initialization function net.inputs{1}.range=[-5 10;-5 10]; %Set range of input 1 net.inputs{2}.range=[-5 5;-5 5;-5 5;-5 5;-5 5]; %Set range of input 2 net.layers{1}.size=8; %There are 5 neurons in layer 1 net.layers{1}.transferFcn=’tansig’; %Set transfer function net.layers{1}.initFcn=’initnw’; %Nguyen-Widrow function is the initialize function net.layers{2}.size=5; %There are 4 neurons in layer 2 net.layers{2}.transferFcn=’logsig’; net.layers{2}.initFcn=’initnw’; net.layers{3}.initFcn=’initnw’; %Here delays of input or layer weights are set. net.inputWeights{2,1}.delays=[0 1]; net.inputWeights{2,2}.delays=1; net.layerWeights{3,3}.delays=1; %Set some functions net.trainFcn=’trainlm’; %Training function: Levenberg-Marquardt net.performFcn=’mse’; %Performance function: mse net.initFcn=initlay’; %Network initialization function: initlay. This means the network will %initialize itself using layer initialization function. The following codes try to initialize, simulate, and train the network, and training goals are defined: net=init(net); x={[2;3] [-3;4]; [2;-1;0;3;1;4] [1;-2;0;4;1;3]}; y={1 -1}; z=sim(net,x); net.trainParam.epochs=1000; net.trainParam.goal=1e-8; net=train(net,x,y); In addition to neural network models, various functions used in neural networks, e.g., transfer functions, topological functions, distance functions, initialization functions, training functions, etc., can also be customized, designed, and loaded by users (Zhang and Li, 2008). Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch09 1st Reading  CHAPTER 9  Learning Theory, Architecture Choice and Interpretability of Neural Networks 1. Learning Theory Artificial neural networks exploit the mechanism or laws hidden in the samples by learning from samples. Learning theory and methods are attractive topics in the neural networks researches and applications. Learning theory focuses on questions such as (Yan and Zhang, 2000): is the learning result able to approximate hidden mechanism as the number of training samples increases (statistical performance of learning, or learnability)? How many samples are required and how long is the computation time for learning the mechanism (complexity of learning)? How about the convergence of learning algorithm? It should be noted that some issues are in nature unlearnable irrespective of architecture, performance, and learning method of neural network used. 1.1. Statistical performance of learning For supervised learning, the objective is to make the network output y approximate the desired output ŷ by finding a specific F(x, w) (adjusting weights) from a given set of functions (a given architecture of neural network), i.e., y = F(x, w), where x ∈ Rn , y ∈ Rm , w is the weight matrix. The question is to analyze whether a given set of samples contain enough information such that the trained neural network has a good generalization 1 Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch09 1st Reading 2 Computational Ecology capacity. The theory on minimization of empirical risk was proposed to address this problem (Vapnik, 1995, 1999). Suppose the empirical risk is  R(w) = (ŷ − F(x, w))2 dP(x, ŷ), where P(x, ŷ) is the joint distribution of x and ŷ. Suppose we is the weight matrix to minimize empirical risk Re (w), and if Re (w) uniformly converges to pratical risk R(w), i.e., p{ sup |R(w) − Re (w)| > ε} → 0, N → ∞, w∈W R(we ) will probabilistically converge to the possible minimum of R(w). The above theory proved the availability of minimization of empirical risk. However, the number of training samples is not limitless and therefore the convergence speed is the focus of the learning (Yan and Zhang, 2000). The convergence speed directly relates to the VC dimension of the problem (VC dimension of a function class F is defined as the cardinality of S, the set of N points in Rn , that can be finely divided by F ). If VC dimension is a limited number, the problem will be learnable and the empirical risk will converge to a minimum. For feedforward neural networks, their VC dimension (capacity of multilayer neural networks) represents their classification capacity. For the feedforward network with only a hidden layer, its VC dimension d will be (Sontag,1998) 2n[h/2] ≤ d ≤ 2|W| log2 (eN), where h is the number of hidden neurons, |W | is the number of adjustable connection weights, N is the number of neurons, [ ] denotes the maximum integer, n is the input dimension, e = 2.7183. Sometimes VC dimension is simply represented by |W|. VC dimension should be reduced as lower as possible. Some measures can be taken to reduce VC dimension: (1) limit the number of hidden neurons; (2) limit the number of connection weights, and (3) try to reduce the dimension of input space. Amari et al. (1997) suggested a simple principle for network learning, i.e., the model is optimally trained when the ratio of the number of training samples to the number of connection weights exceeds 30. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch09 1st Reading Learning Theory, Architecture Choice and Interpretability of Neural Networks 3 The number of hidden neurons (N) is sometimes determined by using the following rule (Nagendra and Khare, 2006), i.e., “rule of thumb”: (1) N = number of input neurons + number of output neurons; (2) The maximum number of neurons in the hidden layer (Nmax ) is two times the number of input layer neurons (Swingler, 1996; Berry and Linoff, 1997); (3) N = number of the training patterns divided by five times of the number of input and output neurons. 1.2. Complexity of learning It is impossible to engineer the network to perfectly learn the unknown. As a consequence, PAC learning is a better choice for neural network learning. PAC learning only demands the approximate learning in the sense of probability at a given learning error (Valiant, 1984; Anthony and Biggs, 1992; Yan and Zhang, 2000). Given a probability distribution p and real values 0 < ε, δ ≤ 1, if there is an algorithm A, by using the samples of C, A is able to output the hypothesis h over the polynomial time of 1/ε, 1/δ, and N (size or scale of the problem to be studied, or the complexity of c ∈ C), such that P((h, c) < ε) ≥ 1 − δ, ∀c ∈ C. Then C is PAC learnable. Suppose the VC dimension of a neural network is d, and d → ∞ if n → ∞ (n is the input dimension). When n → ∞, for any ε > 0 the algorithm with the number of samples less than (d − 1)/(32ε) is not PAC learnable (Venkatech, 1992). For a given neural network with Boolean output, the relationship between sample size (N) and error (ε) should meet the following rule (Yan and Zhang, 2000): N ≥ d/ε, where |W | is the number of adjustable connection weights. 1.3. Dynamic learning Neural network learning is a dynamic process. A learning rule with continuous time can be considered to be a differential equation (Amari, 1990; Yan and Zhang, 2000). Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch09 1st Reading 4 Computational Ecology Suppose a single neuron has the input x ∈ Rn and the weight vector w ∈ Rn , the desired output is ŷ. The input is interrupted by the environment and represented by p(x, ŷ). Amari (1990) developed a unified learning equation as the following: τdw(t)/dt = −w(t) + η(t)r(w(t), x(t), ŷ)x(t), which is equal to w = ηr(w, x, ŷ)x − λw dw/dt = ηr(w, x, ŷ)x − λw discrete equation, continuous equation, where λ, η > 0, r(w, x, ŷ) is the learning signal. If λ = 0 and r(w, x, ŷ) = ŷ − y = ŷ − sgn(wT x) in the above discrete equation, we get the perceptron learning rule; if λ = 0 and r(w, x, ŷ) = ŷ − y = ŷ − wT x, we get Widrow– Hoff learning rule; and if λ=0 and r(w,x,ŷ) = y, Hebb learning rule is achieved. Various new learning rules may be developed based on the unified learning equation, and dynamic properties of learning rules can be analyzed according to the equation. 1.4. Generalization capacity The generalization capacity represents neural network’s ability to respond to unknown samples (but with the same mechanism as hidden in training samples). Using neural networks to interpolate, predict, recognize, etc., demands a high generalization capacity. An over-fitted network will overfit the details (chaos) of training samples and omit the general trends or mechanisms. This network thus has a lower generalization capacity. Some measures for generalization capacity are described in the following. (1) A general rule to describe the relationship between generalization error and training error is (Lee and Tenorio, 1993) GMEE ≤ ε + λ((d/N) ln(N/d))1/2 , where GMEE is the generalization error (prediction error, or generalized minimum empirical error), ε is the training error, d is the VC dimension, N is the number of training samples. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch09 1st Reading Learning Theory, Architecture Choice and Interpretability of Neural Networks 5 (2) An extension of VC dimension, regular dimension, has also been used to describe generalization–training relationship (Gu and Takahashi, 1996): R = ε + σ(dRe /N)1/2 , where R is the generalization (prediction) risk, ε is the training error, dRe is the regular dimension, σ is the variance of loss function R(w), and N is the number of training samples. Training error ε can be expressed as the ratio of misclassified samples versus total samples. (3) A network capacity based measure was developed to represent generalization capacity (Yan and Zhang, 2000): G = log2 C/m, where C is the network capacity, i.e., the number of different mappings or functions to be realized by adjusting connection weights, and m is the output dimension. A small G represents a large generation capacity. 2. Architecture Choice The objective of architecture choice is to maximize the generalization capacity of neural network. Compared to complex networks, a simple network demands a longer training time and has larger training errors. But it is easy to be understood, to extract knowledge and rules, and to be realized by hardware. Moreover, it has a larger generalization capacity. As a result, the general principle is to choose the simplest architecture given the same conditions. 2.1. Regular method An available method for architecture choice is to follow this rule: Criterion of architecture choice = logarithmic likelihood function + λ × architecture complexity, Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch09 1st Reading 6 Computational Ecology where the first term represents the fitting goodness of model output to input samples, the second term is to penalize the model for its complexity, and λ is a factor to adjust the constraint strength. 2.2. Network pruning and construction Network pruning, i.e., removing some unimportant connection weights or neurons from a redundant network, has been suggested (Reed, 1993; Castellano et al., 1997; Setiono, 1997; Yan and Zhang, 2000). For example, in the weight decay method, a penalty function is added in the goal function of BP algorithm, and the weights not inflicted in the learning process will decline exponentially. The iterative pruning algorithm proposed by Castellano et al. (1997) can be used to prune not only hidden neurons but laso connections in feedforward neural networks. In this algorithm, network pruning and modification of weights are conducted simultaneously to keep the same network outputs. The computational procedure of the algorithm is: (1) Assign k=0. (2) Remove the neuron h which has the smallest influence on network N k , following the rule  2 h = arg min whi yh 22 , h∈H i∈Ih where whi is the connection weight from neurons h to i, H is the set of hidden neurons, Ih is the sending domain of neuron h, Ih = {j ∈ V |(i, j) ∈ L}, V is the set of all neurons and L is the set of all connections, and yh is the signal of neuron h to Ih . (3) Solve the following equation to obtain δ using conjugate gradient algorithm: Yk δ = Zk . (4) Generate a new network: N k+1 = {V k+1 , Lk+1 , wk+1 }, where V k+1 = V k − {h}, Lk+1 = Lk − {h} × Ihk ∪ Rkh × {h}, k wk+1 = wji , i∈ / Ih , k + δji , wk+1 = wji i ∈ Ih . Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch09 1st Reading Learning Theory, Architecture Choice and Interpretability of Neural Networks 7 Rh is the receiving domain of neuron h, Rh = {j ∈ V |(j, i) ∈ L}, and δji is the value for weight modification. (5) Assign k = k + 1, repeat steps (2) to (4), until the performance of network N k declines significantly. A neural network can also be constructed step by step from a simple and small architecture. There are different algorithms of network construction, for example, tiling algorithm (Mezard and Nadal, 1989), sequential addition (Marchand et al., 1990), upstart algorithm (Frean, 1990), etc. In general, there is not any universal method or algorithm for the choice of architecture. In addition to the choice of basic models, the number of hidden layers, neurons, etc., should be determined according to the specific system in hand (Zhang, 2007; Zhang et al., 2007; Zhang et al., 2008). 3. Interpretability of Neural Networks The interpretability power of neural networks has long been the major concern for neural network models. For example, screening the relative importance of input variables of a system is a valuable research topic in ecology and environmental science. Developing the methods for interpreting neural networks is the subject of recent researches (Kemp et al., 2007). So far various methods to address neural network’s data interpretability, such as neural interpretation diagram (Özesmi and Özesmi, 1999), sensitivity analysis (Lek et al., 1996; Scardi, 1996; Recknagel et al., 1997; Gevrey et al., 2006), inference rule extraction, randomization test of significance (Scardi and Harding, 1999; Olden, 2000; Olden and Jackson, 2002; Kemp et al., 2007), partial derivatives (Dimopoulos et al., 1999; Reyjol et al., 2001), connection weight methods (Olden et al., 2004; Zhang and Wei, 2008), etc., have been developed for practical uses (Özesmi et al., 2006; Gevrey et al., 2003). 3.1. Sensitivity analysis In sensitivity analysis (Lek et al., 1996), the network output corresponding to each input variable is determined by assigning designed values to the variable at a time while holding the other input variables constant (Lek et al., 1996; Özesmi et al., 2006). The variables that are held constant are Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch09 1st Reading 8 Computational Ecology assigned their minimum, first quartile, median (or mean), third quartile, and maximum values successively. The relative importance of the variable is determined by comparing input–output relationships. Moreover, white noise can be added to each input variable and the output error should be examined (Scardi and Harding, 1999). 3.2. Neural interpretation diagrams Neural interpretation diagrams are used to analyze how the neural network is weighting every input variable and how the input variables interact to yield the network output (Özesmi and Özesmi, 1999; Özesmi et al., 2006). Neural interpretation diagrams are drawn by scaling the thickness of the connection lines of neurons according to the relative values of their weights. Positive and negative signals are represented by different colors. By scaling the thickness of the connection lines from the input neurons, the most important variables can be found. Between-input variable interactions can also be observed. Neural interpretation diagrams are useful for situations with simpler network architecture. 3.3. Randomization test of significance A randomization test was developed to assess the statistical significance of connection weights and input variables (Olden, 2000; Olden and Jackson, 2002). In this method the neural network uses randomized data and all connection weights (input-hidden-output connection weights) are recorded. This procedure is repeated a given number of times, e.g., 1000 or 10000 times, to yield a null distribution for each connection weight. It is then compared to the actual value to calculate the significance level (Özesmi et al., 2006). The connections with little influence on network output can be removed. Through randomization test, the independent variables that significantly contribute to network prediction can be identified. HIPR (Holdback Input Randomization) is a network-independent randomization test method (Kemp et al., 2007). It is a refinement of the method presented by Scardi and Harding (1999). The procedure of HIPR method can be summarized as follows: (1) Optimize the neural network. Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch09 1st Reading Learning Theory, Architecture Choice and Interpretability of Neural Networks 9 (2) Use the test data set to determine the relative importance of input variables: (a) sequentially input each data point in the test data set to the neural network but replace the values of one input variable by uniformly distributed random values in the interval (0.1, 0.9), the range over which the network was originally trained; (b) calculate the MSE (mean squared error) of the neural network when the randomized test set has been presented, and (c) repeate the procedure for each input variable, each time substituting the original values with uniformly distributed random values. 3.4. Partial derivatives The partial derivatives of the network output with respect to input variables can be used to determine the relative importance of input variables (Dimopoulos et al., 1999; Reyjol et al., 2001). By plotting the partial derivatives of the network output with respect to an input variable, the changes of network output with increasing values of the input variable can be determined (Özesmi et al., 2006). 3.5. Connection weight methods Connection weights themselves represent the importance of connections and can thus be used to evaluate the relative importance of input variables. The input variable relevance is substantially a connection-weight method. In this method the relevance of an input variable is the sum square of weights for that input variable divided by the total sum square of weights for all input variables. Variables with higher relevance are more important (Özesmi and Özesmi, 1999). A connection weight method proposed by Olden et al. (2004) is considered to outperform all other previous methods in determining the relative importance of input variables. It sums the product of connection weights from input neurons to hidden neurons with the connection weights from hidden neurons to output neurons for all input variables (Kemp et al., 2007). The larger the sum of connection weights, the greater the importance of the variable connected to this input neuron. The relative importance of a input Dec. 17, 2009 10 16:31 9in x 6in B-922 b922-ch09 1st Reading Computational Ecology variable is determined by Vi = N  wik wko , k=1 where Vi is the relative importance of variable i, N is the total number of hidden neurons, wik the connection weight from input variable i to hidden neuron k, and wko is the connection weight from hidden neuron k to output neuron o. A connection weight method, IDM (importance detection method; Zhang and Wei, 2008), was presented for a specific neural network ANNSSM (artificial neural network for state space modeling). In this algorithm the total importance of input variable i may be calculated by the following formula Vi = = n n   LW{k + n, j}/(1 + exp(−IW{j, i})) k=1 j=1 m  ′ wij /(1 + exp(−wji )), i = 1, 2, . . . , n, j=1 where n is the number of state variables, wij is the between-hidden layer ′ is the input-hidden layer connection weight. The relconnection weight, wji ative importance of input variable i to output variable k may be obtained by Vki = n  LW{k + n, j}/(1 + exp(−IW{j, i})), j=1 k, i = 1, 2, . . . , n. The larger importance value represents the greater importance of the input variable. For ANNSSM, IDM is better than the previous connection weight method (Olden et al., 2004). Many methods that determine the relative importance of input variables implicitly assume that all input variables are unit independent or have the same unit representation, e.g., the number of individuals of different biological taxa (Zhang and Wei, 2008). However, in many cases input variables are different physical quantities, e.g., temperature and humidity. Perhaps temperature is mathematically a more important factor than humidity Dec. 17, 2009 16:31 9in x 6in B-922 b922-ch09 1st Reading Learning Theory, Architecture Choice and Interpretability of Neural Networks 11 according to the sensitivity analysis in neural network modeling, but sometimes humidity is practically more influential than temperature if humidity variation of the environment is much larger than temperature variation. For this reason, in the applications of those methods (e.g., sensitivity analysis), the range of each input variable (or training set) must be deliberately determined based on the practical variation of studied input variable in order to achieve reasonable results. In addition, sequentially removing one input variable from all input variables to check model performance (fitting error, prediction error, etc.) is also an alternative method. March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading  CHAPTER 10  Mathematical Foundations of Artificial Neural Networks Artificial neural networks involve many areas of mathematics, such as probability theory, differential geometry, topology, numerical computation, differential equation theory, etc. Some mathematical principles for the design and analysis of artificial neural networks are discussed in this chapter. 1. Bayesian Methods Learning process and designation of neural networks can be based on Bayesian methods (Yan and Zhang, 2000). The fundamental idea of Bayesian methods is that the probability that a postulate, A, will be true is positively proportional to multiplication of the postulate’s prior probability and the conditional probability of information, I, being observed given H is true. Given a discrete sample space, S, and Ai ∈ S, i = 1, 2, . . ., where ∪Ai = S, and Ai ∩ Aj = φ, i = j. Bayesian rule is represented by: p(Ai /I) = p(I/Ai )p(Ai )/  p(I/Aj )p(Aj ). For a continuous sample space, Bayesian rule is represented by:  p(A/I) = p(I/A)p(A)/ p(I/A)p(A)dA. 1 March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading 2 Computational Ecology 1.1. Models selection Bayesian rule has been used in the selection of models (Mackey, 1992). Suppose that there are several models (e.g., BP, RBF, etc.). Given the data or information, I, the posterior probability of model Ai is thus given by Bayesian rule p(Ai /I) = p(I/Ai )p(Ai )/p(I), where p(Ai ) is the prior probability of model Ai ; p(I/Ai ) is the evidence of model Ai , which can be represented by  p(I/Ai ) = p(I/w, Ai )p(w/Ai )dw, where w = (w1 , w2 , . . . , wn ) is a weight vector. 1.2. Bayesian learning Let p(w) be a distribution of network weights (including various thresholds), w = (w1 , w2 , . . . , wn ), without information or samples, I. Given information or samples, the posterior distribution is  p(w/I) = p(I/w)p(w)/ p(I/w)p(w)dw. Without any prior information, the prior distribution, p(w), and conditional distribution, p(I/w), may be represented by exponential distributions (Yan and Zhang, 2000): p(w) = exp(−cf (w))/ h(c), p(I/w) = exp(−ag(I))/q(a). The posterior distribution of network weights can be achieved by   p(w/I) = p(w/c, a, I)p(c, a/I)dcda. 1.3. Bayesian Ying-Yang system Bayesian Ying-Yang system was proposed to unify various neural network models and learning algorithms (Xu, 1997). Ying-Yang system is the March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 3 generalization of Bayesian estimation of weights and Bayesian selection of models. Perception and association are two of the most important mechanisms in neural network models. They are conducted by supervised or unsupervised learning procedures in the neural networks (Yan and Zhang, 2000), yielding an output (y) from input (x). Perception is targeted to realize pattern recognition, extraction of characteristics, etc. Association is performed by neural networks for classification, function approximation, and control. The relationship between network input, x, and output, y, can be statistically represented by the joint distribution, p(x, y): p(x, y) = p(x)p(y/x), p(x, y) = p(y)p(x/y). Corresponding to the above there are two models, M1 = {My/x , Mx } and M2 = {Mx/y , My }. Mx and My are called Yang model (visible model) and Ying model (invisible model), respectively. The above joint distribution is thus expressed as pM1 (x, y) = pMx (x)pMy/x (y/x), pM2 (x, y) = pMy (y)pMx/y (x/y). A Yang learning machine and a Ying learning machine are used to realize pM1 (x, y) and pM2 (x, y), respectively. The learning model is named as a Ying-Yang system. Learning is performed in the following way (Xu, 1997): (1) (2) (3) (4) Obtain an appropriate representation; Design a basic structure of the model; Determine the scale and size of the model; Learn the parameters in the model. The learning algorithm tries to reach pMx (x)pMy/x (y/x) = pMy (y)pMx/y (x/y). Various neural network models, like RBF, multi-tier feedforward network, etc., and learning algorithms, like PCA, Helmholtz machine, EM learning algorithm, etc., are specific realizations of this theory (Xu, 1997). March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading 4 Computational Ecology 2. Randomization, Bootstrap and Monte Carlo Techniques Randomization, bootstrap and Monte Carlo techniques are widely used in the researches of artificial neural networks. Details on these techniques can be found in Gentle (2002), Manly (1997) and Zhang (2007). 2.1. Random numbers 2.1.1. General random numbers Random numbers are the basis in randomization, bootstrap and Monte Carlo techniques. To produce a series of uniformly distributed random numbers (0 ∼ 1), the Matlab codes are: x=rand; m=rand(5,6); To arrange integer numbers, 1 ∼ n, randomly: n=10; x=randperm(n); The random number from Matlab generator is a pseudo-random number. The generator will reset once Matlab is initiated. Thus, the sequence of random numbers will be the same. The following codes record the present status (R) of generator, and set the status as R: R=rand(‘state’); rand(‘state’,R); Different status of generator can be set at any time: rand(’state’,sum(100*clock)); Theoretically, Matlab generator may yield 21492 different random numbers, which is enough for general research purposes. 2.1.2. Probability distributions and random numbers Random numbers of normal distribution (norm), χ2 distribution (chi2), t distribution (t), F distribution (f), β distribution (beta), uniform March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 5 distribution (unif), and exponential distribution (exp), etc., can be generated in Matlab environment. The commands of probability density, probability distribution, inverse probability distribution, mean and variance, and random number are pdf, cdf, inv, stat, rnd. Various combinations of corresponding commands will yield different probability distributions and random numbers. x=5; %Yield a density value of normal distribution with mean 5 %and standard deviation 30 p=normpdf(x,5,30); %Yield a random number of normal distribution with mean 3 %and standard deviation 25 r=normrnd(3,25); %Yield a value of probability distribution of t-distribution with %degree of freedom 8 s=tcdf(x,8); %Yield a 0.95 percentile of F -distribution with degrees of freedom %3 and 9 g=finv(0.95,3,9); The random numbers of probability distribution can be generated based on uniformly distributed random numbers (Gentle, 2002; Zhang, 2007). The common techniques include: (1) Inverse transformation method. Given the probability distribution F(x) and its inverse function F −1 , generate an uniformly distributed random number on [0,1], X ∼ U (0,1), and take X = F −1 (U). (2) Convolution method. Given that random variable Y = X1 + X2 + · · · + Xn , where Xi , i = 1, 2, . . . , n, are independent and share the same probability distribution, firstly generate Xi , i = 1, 2, . . . , n, and take Y = X1 + X2 + · · · + Xn . (3) Accept–reject method. Suppose the density function to be found is f(x); firstly get a density function g(x) and a constant c, such that f(x) ≤ cg(x); then generate a random number, x, of g(x); take r = cg(x)/f(x); finally, generate a uniformly distributed random number, u; x is the desired random number if ru < 1, or else repeat the above procedure. March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading 6 Computational Ecology The Matlab codes for generating random numbers of various probability distributions are: %Generate a matrix of uniformly distributed random numbers rand(3,4) %Generate a matrix of normally distributed random numbers, %N(3,252 ) random(‘norm’,3,25,5,7); %Generate a matrix of F -distributed random numbers with %degrees of freedom10 and 20 random(‘f’,10,20,6,8) %Generate a matrix of binomially distributed random numbers, %B(5,0.3) random(‘bino’,5,0.3) Latin hypercube is a method to determine inputs (McKay et al., 1979). In Latin hypercube method, m factors (i.e., inputs) are all uniformly distributed random variables, U(0,1). Set the i-th realization of j-th factor as vj = (pj (i) + uj − 1)/n, where pj (·), j = 1, 2, . . . , m, are independent and random selections from 1, 2, . . . , n. pj (i) is the i-th element of j-th permutation, uj is a value sampled from U(0,1), and n is the sample size. vj will completely disperse over factor space. 2.1.3. Markov chain and random numbers Markov chain can be used to generate random numbers. Using different transition matrix will result in different methods for producing random numbers (Gentle, 2002). One of these methods is the Metropolis–Hastings algorithm: given the probability density function, pX , of random variable X, and conditional probability density function, gYt+1|Yt , of deviation of Markov chain, then (1) assign i = 0, and choose xi with probability p; (2) generate y with probability density function gYt+1|Yt (y/xi ); (3) calculate Hastings ratio, r r = pX (y)gYt+1|Yt (xi /y)/[pX (xi )gYt+1|Yt (y/xi )]; (4) assign xi = y, if r ≥ 1; or else generate a uniformly distributed random number, v. Assign xi+1 = y, if v < r, or else xi+1 = xi ; (5) assign i = i + 1, and return to step (2). March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 7 2.1.4. Multivariable random numbers Multivariable random numbers can be generated based on single random variables (Gentle, 2002). Suppose X = (X1 , X2 , . . . , Xn ) is a random vector, where Xi , i = 1, 2, . . . , n, are independent and share the same probability distribution with variance 1 and mean 0. Construct a random vector, Y = AX, with covariance matrix AAT , where |A| = 0. Try to find   A, such that AAT = , where is the desired covariance matrix. 2.2. Randomization-based data partition Data partition includes cross validation, Jackknife, etc. They are used to mine more information from a limited data set (Gentle, 2002). 2.2.1. Cross validation The principle of cross validation is to randomly (or systematically) divide the data set, (xi , yi ), i = 1, 2, . . . , n, into two parts, i.e., training set (T) and validation set (V). The training set is used to determine parameters in a fitted model, y = f(x), and the validation set is used to validate and test the model. Moreover, the training set and validation set can be exchanged to estimate the fitting error in x:   E(R(Y0 , f(x0 )) = R(Yi , f1 (xi )) + R(Yi , f2 (xi ))/n, i∈V i∈T where f1 (x) and f2 (x) are fitted functions using training sets T and V , respectively, R(Y0 , f(x0 ) is the predictive error. A data set can be divided into several intersected subsets (Breiman, 2001), for example, K subsets with similar sizes. A subset is chosen as the validation set and the remaining subsets as training set, in order to achieve the predictive error. The averaged error for K subsets is the predictive error to be estimated. 2.2.2. Jackknife method A data set is systematically divided in the jackknife method in order to obtain the estimates, e.g., variance or mean, from the data set (Gentle, 2002). Suppose the statistic T of a random sample, X1 , X2 , . . . , Xn , is the estimate of population parameter θ. Divide data set into r subsets, each with m elements (m is always 1 or 2). Remove subset i from the data set March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading 8 Computational Ecology and calculate the estimate T−i from the remaining subsets. The estimate of population parameter θ is thus ′ T = r  T−i /r. i=1 Moreover, the jackknife T is J(T) = T ∗ = r∗  Ti∗ /r = rT − (r − 1)T ′ , i=1 where Ti∗ = rT − (r − 1)T−i . If Ti∗ are independent, the variance of T may be estimated with jackknife variance V(J(T)): r  V(J(T)) = (Ti∗ − J(T))2 /(r(r − 1)), i=1 or V(J(T)) = r  (Ti∗ − T)2 /(r(r − 1)). i=1 Suppose m = 1, and the deviation of T may be extended as (Gentle, 2002) D(T) = ∞  ai /ni . i=1 D(T) is unbiased if ai = 0, i = 1, 2, . . .; D(T) has second order precision if a1  = 0. The jackknife estimate of deviation of T is D(J(T)) = E(J(T)) − θ = n ∞  i=1 i ai /n − (n − 1) ∞  ai /(n − 1)i . i=1 2.3. Bootstrap method The principle of boostrap method is treating observed samples as a population and resampling this population. The statistical distribution is inferred according to the conditional distribution of the sample taken from the population. In statistical language (Gentle, 2002), suppose the observed sample is xi , i = 1, 2, . . . , n, and the population parameter is θ, statistc T is used March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 9 to estimate θ. First, the sampling distribution of T should be determined in order to achieve the confidence interval of θ. Statistic T is the functional of empirical cumulative distribution function, Pn ,  T = T(Pn ) = θ(Pn ) = g(x)dPn (x). In bootstrap method, the observed sample xi , i = 1, 2, . . . , n, is resampled to generate a resampling sample, xi∗ , i = 1, 2, . . . , n, and the corresponding statistic is T ∗ . The variance of T is  ∗ 2 V(T) = V(T ∗ ) = (T ∗j − Tbar ) /(m − 1), ∗ is the average of T ∗j , V(T ∗ ) is the where T ∗j is the j-th T ∗ of T , Tbar variance of m samples, all the which from Pn , with size n. The confidence interval of θ can be derived based on the relationship between θ and T , e.g., f(T, θ), and the confidence interval is (Gentle, 2002; Zhang, 2007): P(fα/2 ≤ f(T, θ) ≤ f1−α/2 ) = 1 − α. fα/2 and f1−α/2 may be approximated with the bootstrap method when the probability of the population is not available. fα/2 and f1−α/2 are determined by percentiles of Monte Carlo samples of T ∗ − To , where To is the T value of the observed sample. 2.4. Monte Carlo Method Monte Carlo method is used to test the characteristics of statistical methods, approximate the distribution of statistics with asymptotic approximation, and compute the expectation of function of random variables (Manly, 1987). It is particularly efficient for the hidden model without explicit relationships. Monte Carlo method has been used in function approximation. For example, we need to estimate F(x) with f(x). Given a set of random inputs xi , i = 1, 2, . . . , n, and the corresponding outputs are yi = f(xi ), i = 1, 2, . . . , n. By doing so, the mean and variance of f(x) can be achieved. March 23, 2010 10 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology Monte Carlo method is also used to estimate the variance and test statistical significance, etc. A procedure for statistical test is as follows (Gentle, 2002; Zhang, 2007): generate a random sample from the observed sample, and compute the characteristic of the random sample; repeat this procedure n times (i.e., n randomizations); finally, test null hypothesis, i.e., observed samples are from the same distribution. The statistic, p, is p = r/n, or p = (r + 1)/(n + 1), where r is the number of the characteristic of random sample greater than that of observed sample. Random sampling of data set and subset generation of data set are additional applications of Monte Carlo method. Moreover, missing data can be generated by Monte Carlo method (Gentle, 2002). A data matrix X is composed of the observed data block and the missing data block. We can generate m missing data and analyze the corresponding m complete data blocks. 3. Stochastic Process and Stochastic Differential Equation Stochastic process and stochastic differential equation are fundamental to the design and analysis of distributable artificial neural networks. 3.1. Stationary stochastic process A stationary stochastic process is a stochastic process that meets the condition p(x(ti ) = ci |i = 1, 2, . . . , n) = p(x(ti + τ) = ci |i = 1, 2, . . . , n), ∀ti , ci , τ ∈ R. A stochastic process is reversible if p(x(ti ) = ci |i = 1, 2, . . . , n) = p(x(τ − ti ) = ci |i = 1, 2, . . . , n), ∀ti , ci , τ ∈ R. March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 11 3.2. Markovian process A stochastic process is a Markovian process if p(x(ti ) = ct |x(t1 ) = c1 , . . . , x(ti−1 ) = ct−1 ) = p(x(ti ) = ci |x(ti−1 ) = ct−1 ), ∀ti , ci ∈ R. A Markovian process with both discrete time and discrete states is A Markov chain: p(x(1) = c1 , . . . , x(t) = ct ) = p(x(t) = ct |x(t − 1) = ct−1 )p(x(t − 1) = ct−1 | x(t − 2) = ct−2 ) · · · p(x(2) = c2 |x(1) = c1 ), where p(x(t) = ct |x(t − 1) = ct−1 ), t = 2, 3, . . . , are state transition probabilities. A Markov chain is stationary if state transition probabilities are independent of time. A stationary Markov chain with limited states is a homogeneous Markov chain, if p(x(t) = cj |x(t − a) = ci ) = p(x(t + a) = cj |x(t) = ci ). Suppose {x(t)} is a stochastic process. Define p(u, τ|v, t) = p(x(τ) = u|x(t) = v), p(u, τ) = p(x(τ) = u), ω(u|v, t) = lim [p(u, t + t|v, t) − p(u, t|v, t)]/t, t→0 where ω is the velocity of transition probability, and ω(u|v) = ω(u|v, t) if ω is time independent. For a homogeneous Markov chain of continuous time, suppose 0 ≤ n ≤ N, and define ω− (n) = ω(n − 1|n), ω+ (n) = ω(n + 1|n), ω(m|n) = 0, if m = n ± 1, March 23, 2010 12 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology and x = n/N, x = n/N = 1/N = ε, p(x, t) = Np(n, t). The Fokker–Planck equation is ∂p(x, t)/∂t = −∂[g(x)p(x, t)]/∂x + (2/ε)∂2 [h(x)p(x, t)]/∂x2 , where the drift coefficient g(x), and diffusion coefficient h(x), are defined as g(x) = (ω+ (n) − ω− (n))/N, h(x) = (ω+ (n) + ω− (n))/N. Markov process–based models are not suitable for conducting discriminant decision-making. However, the latter can be achieved by artificial neural networks. Therefore the combination of Markov models and artificial neural networks is advantageous to decision-making. Hidden Markov Model (HMM) may be used to describe the time series (Yan and Zhang, 2000). In a sense, the forward–backward algorithm of HMM is equivalent to that of BP neural network (Hochberg et al., 1991). A HMM is a Markov chain with n states, xi , i = 1, 2, . . . , n. Given the state at t, x(t), and possible outputs, yi , i = 1, 2, . . . , m. A discrete HMM is thus represented by a state transition matrix, P = (pij )n×n , a observation matrix, B = (bij )n×m , and an initial distribution vector L = (li )n . The probabilities of observed output series, yηi , i = 1, 2, . . . , T , are calculated from the forward–backward algorithm (Rabiner, 1989). αi = li , αi (t) = n  αj (t − 1)pji biηT , j βi = δin , βi (t) = n  βj (t + 1)pij bjηt+1 , j where αi (t) is the feedforward probability, and βi (t) is the backward prob ability. The probability of the entire observed output series is αi (t)βi (t). March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 13 3.3. Stochastic differential equation When a continuous dynamic system is perturbed by an external random variable, a chaotic term, E(t), must be incorporated in its differential equation (Langevin equation) dx/dt = f(x, t) + V(t). The chaotic term, E(t), is a discrete function of time. The differential equation has a solution as the following  t x(t) = x(t0 ) + f(x, τ)dτ + v(t) − v(t0 ). t0 It is equivalent to dx(t) = f(x, t)dt + dv(t). The above equation is a stochastic differential equation. A coefficient, w(t), may be included in the chaotic term to generate a new equation dx(t) = f(x, t)dt + w(t)dv(t), and the corresponding solution is  t  t x(t) = x(t0 ) + w(t)dv(t), f(x, τ)dτ + t0 t0 where the latter integral term is defined with Wiener integral. If w(t) is also dependent on x, the following Ito integral and Stratonovich integral should be used: (1) Ito integral:  t  w(x(t))dv(t) = lim w(x(ti−1 ))(v(ti ) − v(ti−1 )). t0 ti→0 i (2) Stratonovich integral:  t  w(x(τi ))(v(ti ) − v(ti−1 )), w(x(t))dv(t) = lim t0 τi→0 i March 23, 2010 14 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology where τi = (ti −ti−1 )/2. The Ito and Stratonovich stochastic differential equations and their solution expressions are represented by  dxi = fi (x(t), t)dt + wij (x(t))dvj (t), j xi (t) = xi (s) +  t fi (x, τ)dτ + s  j t wij (x(τ))dvj (τ), s i = 1, 2, . . . , n. 3.4. Canonical system A canonical system dxi /dt = fi (x1 , x2 , . . . , xn ), i = 1, 2, . . . , n must satisfy the following condition  ∂fi /∂xi = 0. i 4. Interpolation Interpolation means using discrete data to construct a functional relationship or simplifying a given functional relationship to compute intermediate values. The functional relationship achieved by interpolation is called interpolation formula. Through interpolation, we can calculate functional values at certain nodes (Li et al., 2001; Burden and Faires, 2001; Zhang, 2007). Similar to ANNs, interpolation methods are used to generate missing data, to smooth data series, to construct empirical models, to simplify complex models, as well as to predict population dynamics. The interpolation formulae based on algebraic polynomials will be discussed. Firstly, introduce the space of continuous functions, C[a, b]. It is a space composed of all continuous functions on [a, b]. C[a, b] is a normed space, and norm is defined as a ∞-norm f = max |f(x)|. a≤x≤b March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 15 It is a metric space and the metric is defined as d(f, p) = f − p , f(x), p(x) ∈ C[a, b]. C[a, b] is also a Banach space. Suppose there are n + 1 different nodes, i.e., interpolation nodes, of f(x) ∈ C[a, b], on [a, b] a ≤ x0 , x1 , . . . , xn ≤ b, functional values at these nodes are f(xi ), i = 0, 1, . . . , n. If Ln (x) ∈ Hn exists, where Hn is the set of polynomials with degree less than or equal to n, such that Ln (xi ) = f(xi ), i = 0, 1, . . . , n, then Ln (x) is the interpolation polynomial, i.e., interpolation function, of f(x) on [a, b], based on (xi , f(xi )), i = 0, 1, . . . , n. Ln (x) is existential and unique. It is possible to calculate the approximate values of f(x) from Ln (x), if x  = xi , i = 0, 1, . . . , n. It is called interpolation if x ∈ [a, b], or else it is called extrapolation. 4.1. Lagrange interpolation Lagrange interpolation polymial is Ln (x) = n  li (xi )f(xi ), i=0 where li (x) is the basis function and li (x) = [(x − x0 )(x − x1 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn )]/ [(xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn )] i = 0, 1, . . . , n. Lagrange interpolation is a linear interpolation if n = 1; a quadratic interpolation if n = 2; and a cubic interpolation if n = 3. The error of interpolation, i.e., the remainder of Ln (x), Rn (x) = f(x)− Ln (x), is given by the following theorem: suppose that f(x) is n-order March 23, 2010 16 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology continuously differentiable on [a, b], and f (n+1) (x) is existential on (a, b), then the remainder term of interpolation polynomial, Ln (x), is Rn (x) = f (n+1) (ξ)ωn+1 (x)/(n + 1)!, x ∈ [a, b], where ξ = ξ(x) ∈ (a, b), and ωn+1 (x) = (x − x0 )(x − x1 ) · · · (x − xn ). 4.2. Newton interpolation Newton interpolation is equivalent to Lagrange interpolation, but the calculation is simple. First, define difference quotients with different order: • First-order difference quotient: f [xi , xi+1 ] = (f(xi ) − f(xi+1 ))/(xi − xi+1 ), i = 0, 1, . . . , n − 1 • Second-order difference quotient: f [xi , xi+1 , xi+2 ] = (f [xi , xi+1 ] − f [xi+1 , xi+2 ])/(xi − xi+2 ), i = 0, 1, . . . , n − 2 .. . • n-th order difference quotient: f [x0 , x1 , . . . , xn ] = (f [x0 , x1 , . . . , xn−1 ]−f [x1 , x2 , . . . , xn ])/(x0 −xn ) Then the Newton interpolation polynomial is Nn (x) = f(x0 ) + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + · · · f [x0 , x1 , . . ., xn ](x − x0 )(x − x1 ) · · · (x − xn−1 ). 4.3. Hermite interpolation If interpolation conditions are as follows H(xi ) = f(xi ), H ′ (xi ) = f ′ (xi ), i = 0, 1, . . . , n, March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 17 the following Hermite interpolation polynomial can be used: H(x) = n  αi (x)f(xi ) + i=0 delete this star n  βi (x)f ′ (xi ), i=0 where   αi (x) = 1 − 2(x − xi ) n  j=0 j =1 ∗  2 (1/(xi − xj )) ωn+1 (x)/ ′ (xi )]2 }, {(x − xi )2 [ωn+1 2 ′ (x)/{(x − xi )[ωn+1 (xi )]2 }, βi (x) = ωn+1 ′ ωn+1 (xi ) = (xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn ). Suppose that f(x) ∈ C[a, b] has 2n + 2 order derivative on (a, b), then the remainder term of Hermite interpolation polynomial is 2 R(x) = f(x) − H(x) = f (n+2) (ξ)ωn+1 (x)/(2n + 2)!, x ∈ [a, b], where ξ = ξ(x) ∈ (a, b). 4.4. Spline interpolation The above-mentioned interpolation methods become complex and unstable as the order of polynomial increases. For example, the error of interpolation would be considerably large at end nodes of interpolation interval. Spline interpolation method can avoid such issues. Spline interpolation is the most effective interpolation method, in which cubic spline function is the most widely used. Cubic spline interpolation is a piecewise cubic interpolation. Its first- and second-order derivatives are continuous at inner nodes and it is thus relatively smooth. Suppose there are n + 1 different nodes on given interval [a, b] a ≤ x0 < x1 < · · · < xn ≤ b. If the order of polynomial S(x), on each sub-interval [xi , xi+1 ], is less than or equal to m and larger than or equal to 1, and its m − 1 order derivative, S (m−1) (x), is continuous at inner nodes, x1 , . . . , xn−1 , S(x) is called the m-degree spline function. If f(x) ∈ C[a, b], and S(xi ) = f(xi ), March 23, 2010 18 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology i = 0, 1, . . . , n, then S(x) is called the m-degree spline interpolation polynomial of f(x) on [a, b]. A cubic spline interpolation polynomial S(x) = ai x3 + bi x2 + ci x + di , x ∈ [xi , xi+1 ], i = 0, 1, . . . , n − 1. should meet the following conditions: (1) There are in total 2n conditions for interpolation and continuity; (2) The first- and second-order derivatives should be continuous at inner nodes S ′ (xi −0) = S ′ (xi +0), S ′′ (xi −0) = S ′′ (xi +0), i = 1, 2, . . . , n−1. Moreover, the following three boundary conditions are usually demanded to be met (3) S ′ (x0 ) = f ′ (x0 ), S ′ (xn ) = f ′ (xn ), (4) S ′′ (x0 ) = f ′′ (x0 ), S ′′ (xn ) = f ′′ (xn ), (5) S ′ (x0 + 0) = S ′ (xn − 0), S ′′ (x0 + 0) = S ′′ (xn − 0). The cubic spline interpolation polynomial is expressed as S(x) = Mi+1 (x − xi )3 /(6li ) + Mi (xi+1 − x)3 /(6li ) + (f(xi+1 )/ li − Mi+1 li /6)(x − xi ) + (f(xi )/ li − Mi li /6) × (xi+1 − x)x ∈ [xi , xi+1 ], i = 0, 1, . . . , n − 1, where Mi = S ′′ (xi ), i = 0, 1, . . . , n; li = xi+1 − xi , i = 0, 1, . . . , n − 1. Mi , i = 0, 1, . . . , n, are obtained by solving three bending moment equation. Cubic spline function is existential and unique under certain conditions. Moreover, it shows the property of the best approximation. The time-changing survivorship of a moth, Spodoptera litura F., under 20◦ , was interpolated using a linear interpolation (Zhang, 2007), Hermite interpolation, and spline interpolation (Fig. 1). The following are Matlab codes. x=0.5:1:25.5; %Time:0.5,1.5,3.5,...,24.5,25.5 fx=[1 1 1 1 1 1 1 0.96 1 1 1 1 0.98 1 0.98 0.92 0.94 0.85 0.9 0.59 0.38 0.33 0.27 0.17 0.17 0.17]; %Survivorship intp=0.5:0.1:25.5; %Interpolation points:0.5,0.6,0.7,..., 25.4,25.5 subplot(4,1,1);plot(x,fx,’k.’); March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 19 Figure 1. Survivorship interpolation of Spodoptera litura F. ylabel(’Observed’); lx=interp1(x,fx,intp,’linear’); LinearInterp=[intp;lx] subplot(4,1,2);plot(intp,lx,’k.’); ylabel(’Linear Interpolation’); hx=interp1(x,fx,intp,’cubic’); HermiteInterp=[intp;hx] subplot(4,1,3);plot(intp,hx,’k.’); ylabel(’Hermite Interpolation’); sx=interp1(x,fx,intp,’spline’); SplineInterp=[intp;sx] subplot(4,1,4);plot(intp,sx,’k.’); ylabel(’Spline Interpolation’); xlabel(’Time’); %Linear interpolation %Results %Cubic Hermite interpolation %Results %Cubic spline interpolation %Results The interpolation methods may be extended to a higher dimension (Fig. 2). The Matlab codes for two-dimensions are listed (Zhang, 2007): x=1:5; y=1:5; %x: 1,2,3,4,5 %y: 1,2,3,4,5 March 23, 2010 20 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology Figure 2. Two-dimensional interpolation. fx=[1.2 3.5 4.2 2.5 2.0;1.9 5.8 4.1 3.9 3.5;1.5 4.2 6.6 3.2 2.9;1.4 4.1 3.7 2.9 1.6;1.1 5.8 2.6 2.2 1.1]; xi=1:0.1:5; yi=1:0.1:5; subplot(4,1,1);mesh(x,y’,fx); zlabel(’Observed’); lx=interp2(x,y,fx,xi,yi’,’linear’); %Linear interpolation LinearInterp=lx %Results subplot(4,1,2);mesh(xi,yi’,lx); zlabel(’Linear Interpolation’); hx=interp2(x,y,fx,xi,yi’,’cubic’); %Cubic Hermite interpolation HermiteInterp=hx %Results subplot(4,1,3);mesh(xi,yi’,hx); zlabel(’Hermite Interpolation’); sx=interp2(x,y,fx,xi,yi’,’spline’); %Cubic Spline interpolation SplineInterp=sx %Results subplot(4,1,4);mesh(xi,yi’,sx); zlabel(’Spline Interpolation’); March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 5. 21 Function Approximation In ecological studies, we often encounter functions with complex forms and tedious calculation. Function approximation methods can be used to simplify those functions, which will simplify and speed up the calculation without losing accuracy (Shi and Gu, 1999; Li et al., 2001; Burden and Faires, 2001; Zhang and Barrion, 2006; Zhang, 2007). Some ANNs have used function approximation methods as mathematical algorithms (Zhang et al., 2008). Suppose that f(x) is a function on [a, b]. We try to construct a simple function, p(x), to approximate the given function, f(x). It is given as function approximation. Function approximation can also be stated as follows: for a given function f(x) ∈ C[a, b], to find a function p(x) ∈ C[a, b], such that the error between f(x) and p(x) on [a, b] is the smallest. Common error metrics include: (1) uniform approximation: find a function, p(x), so that lim d(f, p) = 0. n→∞ (2) Lq approximation: find a function, p(x), so that  b |f(x) − p(x)|q w(x)dx = 0, lim n→∞ a where q ≥ 1, and w(x) is the weight function on [a, b]. It is the quadratic approximation if q = 2. The weight function, w(x), satisfies the following conditions: (a) w(x) ≥ 0, x ∈ [a, b]; b (b) a |x|n w(x)dx is existential, n = 0, 1, . . .; (c) for the continuous function, g(x) ≥ 0, g(x) ∈ C[a, b], if  b g(x)w(x)dx = 0, a then g(x) ≡ 0 on (a, b). (3) least square estimation: find a function, p(x), so that min = m  i=1 w(xi )(p(xi ) − f(xi ))2 . March 23, 2010 22 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology We discuss these methods using algebraic polynomial p(x) to approximate f(x) in the following text. 5.1. Uniform approximation 5.1.1. Existence of algebraic polynomial of uniform approximation For uniform approximation, Weierstrass theorem can be used: suppose f(x) ∈ C[a, b], then for any ε > 0, there is an algebraic polynomial p(x), so that f(x) − p(x) < ε uniformly holds on [a, b]. Weierstrass theorem shows the existence of algebraic polynomial of uniform approximation. 5.1.2. Chebyshev’s best uniform approximation Suppose 1, x, . . . , xn are a group of linearly independent functions on [a, b], Hn is a set of polynomials with degree less than or equal to n, Hn = span{1, x, . . . , xn }, Hn = C[a, b]. Any pn (x) ∈ Hn can be expressed as pn (x) = a0 + a1 x + · · · + an xn , where ai ∈ R, i = 0, 1, . . . , n. Suppose f(x) ∈ C[a, b], then f(x)−pn (x) is defined as the deviation between f(x) and pn (x) on [a, b], and En = min f(x) − pn (x) , pn (x) ∈ Hn , is the least deviation. Find p∗n (x), so that f(x) − p∗n (x) = En = min f(x) − pn (x) , pn (x) ∈ Hn . This is the Chebyshev’s best uniform approximation; p∗n (x) is the Chebyshev polynomial of best uniform approximation. March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 23 Chebyshev’s alternate node pattern. Suppose p(x) ∈ C[a, b], if there is a node pattern xi , i = 1, 2, . . . , n, a ≤ x1 < · · · < xn ≤ b, such that |p(xi )| = max |p(x)|, i = 1, 2, . . . , n, a≤x≤b and p(xi ) = −p(xi+1 ), i = 1, 2, . . . , n − 1, then it is said to be the Chebyshev’s alternate node pattern of p(x) on [a, b]. The following Chebyshev theorem shows the existence and uniqueness of Chebyshev’s polynomial of best uniform approximation: suppose f(x) ∈ C[a, b], p(x) ∈ Hn , then p(x) is the polynomial of best uniform approximation of f(x), if and only if there is a node pattern of f(x) − p(x) on [a, b], which contains at least n + 2 nodes. By the theorem, we can see that if f(x) ∈ C[a, b], then there is a unique polynomial of best uniform approximation in Hn , and it is a Lagrange interpolation polynomial of f(x). The polynomial of best uniform approximation, p(x), is always hard to be found. Generally, it can be obtained based on the following theorem: suppose f(x) is the n+1 th derivative on [a, b], and f (n+1) (x) does not alter its sign on [a, b], then the end nodes a and b belong to the node pattern of f(x) − p(x), if p(x) ∈ Hn is the polynomial of best uniform approximation of f(x). 5.2. Quadratic approximation If there is a p∗n (x) such that  b a |f(x) − p∗n (x)|2 w(x)dx = inf  b |f(x) − pn (x)|2 w(x)dx, a pn (x) ∈ Hn , then p∗n (x) is the best quadratic approximation of f(x) on [a, b] with respect to the weight function, w(x). March 23, 2010 24 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology 5.2.1. Orthogonal function system Suppose f(x), g(x) ∈ C[a, b], and w(x) is the weight function on [a, b], then the inner product of f(x) and g(x) on [a, b] is  b (f, g) = f(x)g(x)w(x)dx. a The continuous function space C[a, b] with a definition of inner product is an inner product space. f(x) and g(x) are w(x)-weighted orthogonal on [a, b], if (f, g) = 0. If functions ϕ0 (x), ϕ1 (x), . . . , ϕn (x), such that  b (ϕi , ϕj ) = ϕi (x)ϕj (x)w(x)dx = 0, i = j, a (ϕi , ϕj ) = Cij , i = j, {ϕi } is defined as the orthogonal function system on [a, b] with weight function w(x). If Cij ≡ 1, it is the orthonormal function system. If ϕi (x), i = 0, 1, . . . , are algebraic polynomials, then they are said to be orthogonal polynomials. The common orthogonal polynomials are as follows: (1) Legendre polynomial Legendre polynomial is defined on [−1, 1], its weight function w(x) ≡ 1, and orthogonal function system {ϕi } = {1, x, . . . , xn , . . .}, then Legendre polynomial is ϕ0 (x) = 1, ϕ1 (x) = x, ϕn+1 (x) = ((2n + 1)xϕn (x) − nϕn−1 (x))/(n + 1), x ∈ [−1, 1], n = 1, 2, . . . (2) Chebyshev polynomial Chebyshev polynomial is defined on [−1, 1], its weight function w(x) = 1/(1 − x2 )1/2 , and orthogonal function system {ϕi } = {1, x, . . . , xn , . . .}, then Chebyshev polynomial is ϕ0 (x) = 1, ϕ1 (x) = x, ϕn+1 (x) = 2xϕn (x) − ϕn−1 (x), x∈ ∈ [−1, 1], n = 1, 2, . . . March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 25 (3) Laguerre polynomial Laguerre polynomial is defined on [0, ∞), its weight function w(x) = e−x , and orthogonal function system {ϕi } = {1, x, . . . , xn , . . .}, then Laguerre polynomial is ϕ0 (x) = 1, ϕ1 (x) = 1 − x, ϕn+1 (x) = (2n + 1 − x)ϕn (x) − n2 ϕn−1 (x), x ∈ [0, ∞), n = 1, 2, . . . , (4) Hermite polynomial Hermite polynomial is defined on (−∞, ∞), its weight function w(x) = e−x∗x , and orthogonal function system {ϕi } = {1, x, . . . , xn , . . .}, then Hermite polynomial is ϕ0 (x) = 1, ϕ1 (x) = 2x, ϕn+1 (x) = 2xϕn (x) − 2nϕn−1 (x), x ∈ (−∞, ∞), n = 1, 2, . . . (5) Trigonometric function Trigonometric function is defined on [0, 2π], and its weight function w(x) ≡ 1. It has the following form ϕ0 (x) = 1, ϕ1 (x) = cos(x), ϕ2 (x) = sin(x), ϕ3 (x) = cos(2x), ϕ4 (x) = sin(2x), ϕ5 (x) = cos(3x), ϕ6 (x) = sin(3x), . . . , x ∈ [0, 2π]. 5.2.2. Linearly independent function system If ϕk (x) ∈ C[a, b], k = 0, 1, . . . , n − 1, such that a0 ϕ0 (x) + a1 ϕ1 (x) + · · · + an−1 ϕn−1 (x) = 0, if and only if a0 = a1 = · · · = an−1 = 0, we say that ϕk (x), k = 0, 1, . . . , n − 1, are linearly independent. {ϕi } is a linearly independent function system, if a finite number of arbitrary ϕk (x) in function system {ϕi } are linearly independent. Expanded from linearly independent function March 23, 2010 26 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology system, ϕk (x), k = 0, 1, . . . , n − 1, S = span{ϕ0 , ϕ1 , . . . , ϕn−1 } = {a0 ϕ0 (x) + a1 ϕ1 (x) + · · · + an−1 ϕn−1 (x)|a0 , a1 , . . . , an−1 ∈ R} is the subspace of C[a, b]. 5.2.3. Best quadratic approximation The following theorem shows the existence and uniqueness of best quadratic approximation, L2 : suppose f(x) ∈ C[a, b], then f(x) has one and only one best quadratic approximation. Suppose ϕk (x), k = 0, 1, . . . , n, are linearly independent, solve the following equations i = 0, 1, . . . , n. (ϕi , ϕ0 )a0 + (ϕi , ϕ1 )a1 + · · · + (ϕi , ϕn )an = (ϕi , f), The best quadratic approximation of f(x) is a0 ϕ0 (x) + a1 ϕ1 (x) + · · · + an ϕn (x). A simple way is taking ϕi (x) = xi , w(x) ≡ 1. Orthogonal function systems, in particular orthogonal polynomials, are usually used as linearly independent function systems. For this situation, we obtain the solution of the above equations ai = (ϕi , f)/(ϕi , ϕi ), that is ai =  b ϕi (x)f(x)w(x)dx/ a  b (ϕi (x))2 w(x)dx, i = 0, 1, . . ., n, a and ai are said to be the generalized Fourier coefficients of f(x) with respect to the orthogonal function system, {ϕi }. The expansion of f(x) ∞  ai ϕi (x) i=0 is the generalized Fourier series. As a result, any function, f(x) ∈ C[a, b], can be expanded as a generalized Fourier series. March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 27 5.3. Least Square Approximation Find the relationship between independent and dependent variables from experimental data (xi , f(xi )), i = 0, 1, . . . , m, so that the error m  w(xi )(p(xi ) − f(xi ))2 i=0 is minimal. p(x) is the least square approximation. Here, w(x) is the weight function representing the weights of various data nodes. By constructing orthogonal functions, ϕk (x), k = 0, 1, . . . , n, n ≤ m, the least square approximation is n  aj ϕj (x). j=0 The inner product is defined as follows (ϕi , ϕj ) = m  w(xk )ϕi (xk )ϕj (x), k=0 (ϕi , f) = m  w(xk )ϕi (xk )f(xk ), k=0 then ai = (ϕi , f)/(ϕi , ϕi ). As an example, suppose that the population dynamics of an animal is f(t) ∈ C[a, b], the best quadratic approximation of f(t) is found based on trigonometric function. The expansion of f(t) is p(t) = a0 /2 + ∞  (ai cos it + bi sin it), i=1 where ai = (1/π)  bi = (1/π)  2π f(t) cos itdt, i = 0, 1, 2, . . . 0 2π f(t) sin itdt, i = 1, 2, . . . 0 March 23, 2010 28 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology This is just the Fourier series. The Fourier series will uniformly converge to f(t), if f ′ (t) is piecewise continuous on [0,2π]. Thus its partial sum can be taken as the best quadratic approximation of f(t). Periodicity of population dynamics can also be analyzed according to Fourier series. 6. Optimization Methods The error function in a neural network can be minimized through various optimization methods. 6.1. Steepest descent method The steepest descent method is a basic unconstrained optimization technique. Given an unconstrained optimization problem: min f(x), where x = (x1 , x2 , . . . , xn )T , and f(x) is a nonlinear function of x. Calculate t i ∈ R, such that f(xi + t i pi ) = min f(xi + tpi ). t The direction of search, pi , is determined by pi = −∇f(xi ) = −(∂f/∂x1 , ∂f/∂x2 , . . . , ∂f/∂xn )|x = xi . Finally, the next point, xi+1 = xi + t i pi , is achieved. 6.2. Conjugate gradient method Suppose the objective function, f(x), is approximately a quadratic function in the neighborhood of the extreme point, x∗ f(x) ≈ a + bT x + xT Ax/2. Calculate p0 = −b − Ax 0 , gi = b + Ax i , βi−1 = gi 2 / gi−1 2 , pi = −gi + βi−1 pi−1 , March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 29 such that f(xi + t i pi ) = min f(xi + tpi ), t the next point, xi+1 = xi + t i pi , is thus achieved. The iteration terminates if gi ≤ ε. 6.3. Newton method Suppose the objective function, f(x), can be expanded as a quadratic Taylor polynomial near in the neighborhood of the point, xi f(x) ≈ f(xi ) + ∇f(xi )T x + xT Ai x/2, where x = x − xi . Calculate t i ∈ R, such that f(xi + t i pi ) = min f(xi + tpi ). t The direction of search, pi , is determined by: pi = −(Ai )−1 ∇f(xi ). The next point, xi+1 = xi + t i pi , is therefore achieved. 7. Manifold and Differential Geometry Neural field theory was developed based on differential manifold (Amari, 1985,1987,1998; Luo, 2004), which treats the global and macroscopic properties of neural networks. It is dependent on the relationship between probability distribution and neural network. Probability space is a non-Euclidean space and is thus generally studied using manifold theory. Manifold always shows invariant properties under chart transformation and belongs to the category of differential geometry. Manifold is a topological space and thus an important topic in algebraic topology and differential geometry, the latter treats the global properties of space and manifold, in particular the relationship between local and global properties (Chen and Chen, 1980; Meng and Liang, 1999; Wu, 1981). In general, a neural network can be represented as a manifold. It may be mapped to a point on statistical manifold and the parameters of neural network are coordinates on the manifold. A neural network model can be embedded into a flat dual manifold. That is, a family of parameterized probability distributions are embedded into a Riemannian March 23, 2010 30 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology manifold in probability distribution space, and every probability distribution is a point on the manifold and thus represents a neural network. The relationship between manifold and submanifold, and coordination among hierarchical, multi-model neural networks can be studied by neural field theory and manifold theory, details may be found in the studies of Amari (1985,1987,1995,1998) and Luo (2004). 7.1. Differential manifold Manifold is the generalization and extension of Euclidean space, curve, or curvature. The neighborhood of any point on manifold is homeomorphic to an open set in Euclidean space. Local isomorphism between manifold and Euclidean space may be constructed by a homeomorphic mapping (Amari, 1985, 1987; Luo, 2004). Suppose M is a Hausdoff space, and H n is a half closed space of Rn M is an n-dimensional manifold if for every x ∈ M, there is an open neighborhood of x, which is homeomorphic to Rn or H n . For x ∈ M, if there is an open neighborhood of x, which is homeomorphic to H n , x is called a bound point. The set of all bound points of M is the bound of M, ∂M. If ∂M = ϕ, M is a (non-bound) manifold, or else it is a bound manifold. Differential manifold and chart. The set M is a topological (differential) manifold, if M is a topological space, and (1) M is a Hausdoff space; (2) M is a countable base of topological space; (3) for any point p ∈ M, there is a neighborhood U of p, and an open set of homeomorphic mapping fu : U → f(U), where (U, fu ) is a chart of M. Cr differential manifold. given a chart set A = {(U, fu ), (V, fv ), . . . , (X, fx )} on a m-dimensional manifold M, A is defined as a Cr differential structure of M if the following conditions hold: (1) {U, V, . . . , X} is an open covering of M; (2) any two charts in A are Cr compatible; March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 31 (3) for any chart (U, fu ) of M, if it is Cr compatible with every chart in A, it must be in A. Given a Cr differential structure on M, then M is called a Cr differential manifold. Cr differential construction. A chart set on M : ψ = {(U∂ , f∂ )|∂ ∈ A}, satisfies these conditions: (1) M = ∪∂ U∂ ; (2) ψ∂◦ ψγ−1 : ψγ (U∂ ∩ Uγ ) ⊂ Rn → ψ∂ (U∂ ∩ Uγ ) ⊂ Rn , and ψγ◦ ψ∂−1 : ψ∂ (U∂ ∩ Uγ ) ⊂ Rn → ψγ (U∂ ∩ Uγ ) ⊂ Rn , are Cr differential homeomorphic, where (U∂ , ψ∂ ) ∈ φ, (Uγ , ψγ ) ∈ φ, and U∂ ∩ Uγ  = φ; (3) (U, f) ∈ ψ, if (U, f) is Cr compatible with every chart in ψ. ψ is called Cr differential construction on n-dimensional manifold M. Consider a probability distribution, M = {p(x; θ)}, where x ∈ X is a random variable, p(x; θ) > 0 is the density function of x, θ = (θ 1 , θ 2 , . . . , θ n ) ∈ , is analogous to the chart on manifold,  ⊂ Rn is an open set. M will exhibit the differential manifold structure when p(x; θ) is sufficiently smooth in the neighborhood of every point of θ. Several coordinate functions can be used in manifold, for examples, α = f(s), and β = g(s). A one-to-one correspondent coordinate transformation exists between α and β : α = f(g−1 (β)), β = g(f −1 (α)). Suppose X is a compact set composed of states of system, and X is a smooth manifold in Rn , the observation will be y = f(x), x ∈ X, y ∈ R. The following Takens embedding theorem is fundamental to time series modeling: given the observation series, Y(n) = [y(n), y(n − τ), . . . , y(n − (m − 1)τ)], the state x of system at time n, can be reconstructed by mdimensional vector, Y(n), where m ≥ 2d + 1, d is the dimension of phase space of the system. The minimum m to conduct the construction is named as the embedding dimension (Yan and Zhang, 2000). 7.2. Riemannian manifold All tangent vectors of a point p on manifold M, are localized around p and thus form a tangent vector space Tp of p. For the probability distribution March 23, 2010 32 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology manifold, M = {p(x; θ)}, take l(x; θ) = log p(x; θ), then ∂i = ∂l(x; θ)/∂θ i , i = 1, 2, . . . , n, are linearly independent. {∂i } is the base function of Tp , i.e., v= n  vi ∂i , i=1 where v is any tangent vector, v ∈ Tp , and vi is the i-th component of v. Riemannian manifold. If there is a metrics of inner product on tangent vector space of every point of manifold M hij (u) = ∂i , ∂j  = ∂/∂ui , ∂/∂uj , i, j = 1, 2, . . . , n, M is called a Riemannian manifold. Matrix H = (hij (u)) is called the metric tensor, {∂i } is base function of Tp , (U, ui ) is a chart on M. Suppose x, y are tangent vectors, hx is an inner product function and satisfies the following conditions: (1) the mapping (x, y) → hx (x, y) is double linear, and hx (x, y) = hx (y, x), ∀x, y ∈ Tx (M); (2) hx (x, y) ≥ 0, ∀x, y ∈ Tx (M), and x = 0 if hx (x, y) = 0; (3) hx (x, y) is the smooth function on M, if both x and y are smooth vector fields. The length of tangent vector x, is defined as |x|2 = x, x = xi xj hij . If x, y = xi yj hij = 0, tangent vectors x and y are orthogonal. Affine connection. ∇∂j is defined as the internal change of j-th base tangent vector when point θ on manifold changes to θ + dθ, and ∇∂i ∂j , i.e., the covariant derivative, is defined as the internal change of ∂j when θ changes in the direction of ∂i , where ∇∂i ∂j = Ŵki,j (θ)∂k (θ) and Ŵki,j (θ) = ∇∂i ∂j , ∂k . Suppose T(M) is the smooth vector field on Riemannian manifold M, the affine connection on M is a covariant derivative ∇ : T(M)×T(M) → T(M), which satisfies the following condition: for vector fields A, B ∈ T(M), the covariant derivative of B in the direction of A is a vector field C. A curve ρ(t) satisfying ∇ρ ρ = 0 is called geodesic curve. Given a family of probability distributions, S = log P(x; θ), Amari (1985, 1987, 1995, 1998) defined March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 33 Ŵki,j (θ) as Ŵαi,jk (θ) = E[(∂i ∂j l(x; θ) + ∂i l(x, θ)∂j l(x; θ)(1 − α)/2)∂k l(x; θ)]. If α = 1, the curve is an e-geodesic curve and if α = −1, the curve is an m-geodesic curve. 7.3. Submanifold Embeded manifold. For a smooth mapping f : M → N, where M and N are smooth manifolds, such that f is a monomorphism and f : Tx M → Tf(x) N is a monomorphism, ∀x ∈ M, (M, f) is an embedded manifold of N, and f is the embedded mapping. Closed submanifold. (M, f) is a closed submanifold of manifold N, if (1) f(M) is a closed subset of N; (2) ∀y ∈ f(M), there is a chart (U, ui ) of N, such that y ∈ U; f(M) ∩ U is defined by um+1 = um+2 = · · · = un = 0, where m = dim M, n = dim N. If (M, f) is a submanifold of the manifold N, and f : M → f(M) is a homeomorphic mapping when f(M) is a topological subspace of N, then (M, f) is called a regular submanifold of N. 7.4. Dual flat manifold The differential manifold–based neural computation treats the invariant geometrical and topological structures on manifold that are composed of neural networks (Amari, 1985, 1987). A structure of differential geometry, dual flat manifold, has been used in neural networks (Amari, 1995, 1998; Luo, 2004). Riemannian connection. Riemannian connection is an affine connection with invariant Riemannian measure, which is featured by the following components: Ŵ0i,jk (θ) = (∂i gjk + ∂j gik − ∂k gij )/2. Dual flat manifold. The tangent vector space Tθ is composed of base tangent vectors ei = ∂i l(x; θ), i = 1, 2, . . . , n; Riemannian measure on manifold N is represented by gij (θ) = (ei , ej ) = E[∂i l(x; θ)∂j l(x; θ)], March 23, 2010 34 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology i, j = 1, 2, . . . , n. Covariant derivatives ∇ and ∇ ∗ , i.e., dual connections on manifold N, are dual if the equality holds: ST, V  = ∇Sα T, V  + T, ∇S−α V , where S, T , and V are arbitrary vector fields on N. If the Riemannian torsion of N concerning ∇ and ∇ ∗ is 0, N is defined as dual flat manifold (Amari, 1985, 1987, 1995, 1998). If N is a dual flat manifold, there are affine coordinate systems, θ and η, and potential energy functions, φ(θ) and ϕ(η), such that gij (θ) = ∂2 φ(θ)/(∂θi ∂j ), θi = ∂ϕ(η)/∂ηi , gij∗ (η) = ∂2 ϕ(η)/(∂ηi ηj ), ηi = ∂φ(θ)/∂θi . Base vectors, ei and e∗i , are dual and j ei , e∗j  = ∂/∂θi , ∂/∂ηj  = δi A statistical model is a family of probability distributions. A probability space, usually represented by a set of parameters, i.e., parameter space, is a manifold if a topological structure is defined. This is a dual flat manifold (Amari, 1985,1998; Luo, 2004). A parameter set θ = (θ 1 , θ 2 , . . . , θ n ) of probability distribution of a random variable x, generates an n-dimensional manifold N = {p(x; θ)}. Take l(x, θ) = log p(x; θ), then the tangent vector space is Tθα = {T(x)}, where T(x) = {T i ∂i lα (x; θ)}, and lα (x; θ) = l(x; θ), if α = 1, lα (x; θ) = 2p(x; θ)2/(1−α) /(1 − α), if α  = 1. The inner product may be defined for S, T, V ∈ Tθα : S, T  = Eα (Slα Tlα ), ∇Sα T, V  = Eα (STl α Vl α ). 7.5. Manifold of exponential family If a neural network with a fixed topological structure can be recognized by the probability model of exponential family with the parameter set θ = March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 35 (θ1 , θ2 , . . . , θn ), i.e., the structure of a neural network can be expressed as the probability distribution of exponential family of the random variable, x p(x; u) = exp(θi (u)ri (x) + k(x) − ϕ(θ(u))), where ϕ(θ) is the potential energy function; the neural network model is called a neural network of manifold of exponential family. The probability distribution family, S, is called an n-dimensional manifold of distribution of exponential family. S is a dual flat manifold. When r = (r1 , r2 , . . . , rn ) is a random real variable, the probability distribution is p(r; θ) = exp(θi ri + k(r) − ϕ(θ)). By these equalities (Amari, 1985, 1987, 1995, 1998): ∂i l(r; θ) = ri − ∂i ϕ(θ), ∂i ∂j l(r; θ) = ∂i ∂j ϕ(θ), and ∂i ∂j ∂k l(r; θ) = ∂i ∂j ∂k ϕ(θ), the metrics of manifold S can be achieved: Eθ (ri ) = ∂i ϕ(θ), gij (θ) = ∂i ∂j ϕ(θ), Tijk (θ) = ∂i ∂j ∂k l(r; θ), Ŵαi,jk (θ) = (1 − α)Tijk (θ)/2. Neural networks of manifold of exponential family include Boltzmann machine with hidden tiers, etc. Most of neural networks can be described by probability distributions and manifolds of exponential family (Amari, 1985, 1987, 1995, 1998; Luo, 2004). 7.6. Topological structure of manifold According to the topological properties of input–output relationships, we may define a representation of topological structure (Li, 2004; Chen, 1987). Suppose both X and Y are simplicial complex, if there is a continuous mapping H : X × I → Y , such that: X = H(x, 0), Y = H(y, 1), then X is homotopic to Y , and H is a homotopic mapping. Suppose X and Y are homotopic, the mapping f : X → Y , and g : Y → X, such that: f ◦ g = Iy , g◦ f = Ix , then H(x, t) = tf ◦ g + (1 − t)g◦ f , is the homotopic mapping of X to Y . March 23, 2010 36 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology Suppose Cp (K) is an r-dimensional simplicial complex, and ∂p : Cp (K) → Cp−1 (K) is a homotopic transformation, ∂p is called the boundary operator (i.e., boundary resolution). A structure of simplicial complex can be chain-decomposed to achieve the hierarchical structure of chain and tree. For example, an m-dimensional simplicial complex Xm , can be decomposed as the following (Li, 2004) ∂m Tree : Xm −→ ∂m  ∂m−1 Xm−1 −−−→  ∂m−1 ∂m−2 ∂1 → Xm−2 −−−→ · · · − ∂m−2  ···  X0 ; ∂1 → X0 . Chain : Xm −→ Xm−1 −−−→ Xm−2 −−−→ · · · − 8. Functional Analysis 8.1. Functional representation Technically, the functional relationship, s = f(r), where s = y(t), r = x(t), is called an operator. The functional relationship, y(t) = f(t, x(τ)|τ ≤ t), is generally called a functional (Yan and Zhang, 2000). Usually a linear functional can be represented by  f(t, x(τ)|τ ≤ t) = g(t, τ)x(τ)dτ. A common nonlinear functional is the n-th order regular homogeneous functional    f(t, x(τ)|τ ≤ t) = · · · g(t, τ1 , τ2 , . . . , τn ) × x(τ1 )x(τ2 ) · · · x(τn )dτ1 dτ2 · · · dτn . A functional, f(t, x(τ)|τ ≤ t), can be expanded as a Volterra series:   f(t, x(τ)|τ ≤ t) = · · · g(t, τ1 , τ2 , . . ., τn )x(τ1 )x(τ1 ) · · ·  × x(τn )dτ1 dτ2 · · · dτn = g0 (t) + g(t, τ)x(τ)dτ   + g(t, τ1 , τ2 )x(τ1 )x(τ2 )dτ1 dτ2 March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks +··· +  ···  37 g(t, τ1 , τ2 , . . ., τn )x(τ1 )x(τ1 ) · · · × x(τn )dτ1 dτ2 · · · dτn + · · · , which is equivalent to the three-tier feedforward artificial neural network (Yan and Zhang, 2000). 8.2. Functional analysis Some methods of functional analysis may be used in the mathematical analysis of ANNs. Some principles and methods of functional analysis are discussed in the following (Rudin, 1991; Liu, 2000; Men and Feng, 2005; Zhang, 2007). Deflation Principle 1. Suppose (X, d) is a complete metric space, a mapping T : X → X, such that d(Tx, Ty) ≤ θd(x, y), ∀x, y ∈ X, where θ ∈ (0, 1), then there is only one fixed point x′ ∈ X, such that Tx ′ = x′ . Deflation Principle 2. suppose (X, d) is a complete metric space and T : X → X, is a mapping, if there is a natural number n0 , such that d(T n0 x, T n0 y) ≤ θd(x, y), ∀x, y ∈ X, where θ ∈ [0, 1), then there is only one fixed point x′ ∈ X, such that Tx ′ = x′ . According to Deflation Principle 1, take x0 ∈ X, and conduct iterative calculation xn+1 = Tx n , if {xn } is sequential convergent, then the limit of {xn } is the fixed point x′ ; moreover, the error estimation to approximate x′ with xn is d(xn , x′ ) ≤ θ n d(x0 , Tx 0 )/(1 − θ). The nearer x0 approaches Tx 0 , the smaller the error is. As an example, consider a differential equation dx/dt = f(x, t), x|t=t0 = x0 , March 23, 2010 38 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology where f(x, t) is continuous on R2 , and satisfies the condition |f(x1 , t) − f(x2 , t)| ≤ K|x2 − x1 | with respect to x. To show that the above problem has a unique solution in the neighborhood of t0 : Take δ > 0, such that Kδ < 1. Define an operator on C[t0 − δ, t0 + δ]  t f(x(τ), τ)dτ + x0 , Tx(t) = t0 then T is the mapping of R2 to itself, and  t (f(x1 (τ), τ) − f(x2 (τ), τ))dτ d(Tx 1 , Tx 2 ) = max |t−t0|≤δ ≤ max |t−t0|≤δ t0 t  K|x2 (τ) − x1 (τ)|dτ t0 ≤ Kδ max |x2 (τ) − x1 (τ)| |t−t0|≤δ = Kδd(x1 , x2 ). The space C[t0 − δ, t0 + δ] is complete, and 0 ≤ Kδ < 1. The existence and uniqueness of solution is thus shown by the Deflation Principle. Suppose X and Z are normed spaces, S(T) is the linear subspace of X. If the mapping T :S(T) → Z, satisfies the following conditions: T(x + y) = Tx + Ty, T(αx) = αTx, ∀x, y ∈ S(T), α ∈ K, then T is said to be a linear operator from inside Xto inside Z. S(T) is the domain of definition of T , then T is said to be a linear operator on X into Z, if S(T) = X. f is called a linear functional if it is a linear operator on normed space X into a number field K. For the linear operator T on normed space X into normed space Z, ∃M ∈ K, such that Tx ≤ M x , ∀x ∈ X, T is called a bounded linear operator. Then T is continuous if and only if T is bounded. March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 39 Hahn–Banach Theorem on Real Space. Suppose M is the linear subspace of linear real space X, g : X → R, and g(x + y) ≤ g(x) + g(y), g(αx) = αg(x), ∀x, y ∈ X, α ≥ 0, Moreover, f is a linear functional on M and f(x) ≤ g(x), x ∈ M, then there is a linear functional p(x) on X, such that p(x) = f(x), x ∈ M, −g(−x) ≤ p(x) ≤ g(x), x ∈ X. Riesz Theorem 1. Suppose f is the bounded linear functional on C[a, b], then there is a function of bounded variation, v(t), on [a, b], such that  b x(t)dv(t), x ∈ C[a, b], f(x) = a f = V(v), where V(v) is the total variation of v(t) on [a, b]. Moreover, based on any function of bounded variation, v(t) on [a, b], we may define a bounded linear functional on C[a,b] through the above expression. Riesz Theorem 2. Suppose H is a Hilbert space, f is an arbitrary bounded linear functional on H, then there is only one yf ∈ H, such that f(x) = (x, yf ), ∀x ∈ H, f = yf . Hilbert–Schmidt Theorem. Suppose T is a self-conjugate compact operator on Hilbert space H, then there is an orthonormal system {en }, which is composed of eigenvectors corresponding to eigenvalues {λn }, λn = 0, such that  x= αn en + x0 , Tx 0 = 0,  Tx = λn αn en . March 23, 2010 40 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology If {en } is infinite, then lim λn = 0. n→∞ The decomposition of eigenspace is an important topic in signal processing. 9. Algebraic Topology Homotopy method (Lin, 1998) has been used to solve the local minimum problem in BP algorithm. Using homotopy method to find the zero point of nonlinear function f(x), we need to get a related and simpler function g(x). Firstly, the zero point of g(x) is obtained, and gradually via transition the zero point of f(x) is obtained. A homotopy function is constructed as follows: H(t, x) = (1 − t)g(x) + tf(x), where t is the parameter variable. During the training twill gradually change from 0 to 1, and H(0, x) = g(x), if t = 0; zero point is easy to obtain; H(1, x) = f(x), if t = 1; zero point is what we need. The trajectory x0 (t) of zero point of H(t, x) is traced when t changes from 0 to 1, and the solution transits from x0 (0) to x0 (1) — the zero point is thus obtained. 10. Motion Stability Given the motion equation of a system dx/dt = f(x, t), where x(t) = (x1 (t), x2 (t), . . . , xn (t))T , f(x, t) = (f1 (x, t), f2 (x, t), . . . , fn (x, t))T , March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 41 the system can be categorized by the form of f(x, t): if f(x, t) = f(x, t), this is a nonlinear and nonstationary system; if f(x, t) = f(x), a nonlinear stationary system; if f(x, t) = Ax, a linear stationary system; if f(x, t) = A(t)x, a nonstationary linear system. 10.1. Motion stability and discrimination Suppose x ∈  ⊂ Rn , t ∈ I = (t1 , t2 ), the domain of definition of f(x, t) is  × I and f(x, t) is continuous on  × I. If there is a constant, K, such that |f(x, t) − f(y, t)| ≤ K|x − y|, ∀x, y ∈ , t ∈ I, then f(x, t) is said to satisfy Lipschitz condition on  × I. f(x, t) satisfies Lipschitz condition, if ∂fi (x, t)/∂xi , i = 1, 2, . . . , n, are finite. Given a system, dx/dt = f(x, t), if f(x, t) is continuous and satisfies Lipschitz condition on  × I, then there is a constant c > 0, and an unique solution x = x(t) on [t0 −c, t0 +c] for any (x0 , t0 ) ∈ ×I, where x(t) is continuous and x(t0 ) = x0 . Suppose z(t), y(t), and x(t) are the given motion, perturbation, and observed motion, respectively, i.e., z(t) = x(t) − y(t). It is obvious that dy/dt = f(y(t) + z(t), t) − f(z(t), t) = g(y, t). Given the perturbation equation of a given system, dz/dt = f(z, t), dy/dt = g(y, t), g(0, t) = 0. Inside a B-neighborhood of y(t) = 0, i.e., {(y1 (t), y2 (t), . . . , yn (t))|yi (t) < B, i = 1, 2, . . . , n}, given a real value 0 < ε < B, if there is a real value, δ = δ(ε, t0 ), such that the perturbed motion, yi (t), satisfies the following: |yi (t)| < ε, i = 1, 2, . . . , n; ∀t ≥ t0 , when |yi (0)| ≤ δ, i = 1, 2, . . . , n; the given motion, z(t), is said to be stable. If the given motion, z(t), is stable and lim yi (t) = 0, t→∞ i = 1, 2, . . . , n, March 23, 2010 42 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology it is asymptotic stable. If there is a real value ε > 0, such that a value, δ, that satisfies the stability condition, is not existent, the given motion is unstable. If x(t) = xe = (x1e , x2e , . . . , xne )T , or f(x, t) = 0, the system is in an equilibrium state, and xe is the equilibrium state. For any real value ε > 0, if there is a value, δ, such that |xi (t) − xie | < ε, i = 1, 2, . . . , n; ∀t ≥ t0 , when |xi (0) − xie | ≤ δ, i = 1, 2, . . . , n; xe is stable. If xe is stable and lim xi (t) = xie , t→∞ i = 1, 2, . . . , n, the state xe is asymptotic stable. If there is a real value ε > 0, such that a value, δ, that satisfies the stability condition, is not existent, the state xe is unstable (Yan and Zhang, 2000). If the nonlinear function, f(x, t), is considerably smooth, the system can be linearized around the equilibrium state, xe , dx/dt = A(t)x(t), by taking x(t) = xe + x(t), f(x, t) = xe + A(t)x(t), where A(t) = ∂f(x, t)/∂x|x=xe . Suppose A(t) = A, and |A−1 | = 0: if the eigenvalues of A are negative real values, xe is a stable node; if the eigenvalues are conjugate complex numbers with negative real parts, xe is a stable focal point. Given a real function V(x) defined in a neighborhood, , of the origin. V(0) = 0. V is single valued on , and is continuously derivative with respect to xi , i = 1, 2, . . . , n. V(x) is positive definite (negative definite), if for any x ∈ , in exception of x = 0, V(x) > 0 (V(x) < 0); V(x) is half positive definite (half negative definite), if for any x ∈ , V(x) ≥ 0 (V(x) ≤ 0); V(x) is sign changing, if V(x) will be positive, zero, or negative values for different x ∈ . March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 43 Based on the definition of V(x), the Liapunov theorem, which is used to discriminate the stability of nonlinear systems, is as follows (1) the system is stable on the origin, if there is a positive (negative) definite function V(x) in the neighborhood, , of the origin, and the derivative of V(x) is half negative (positive) definite; (2) the system is asymptotic stable on the origin, if there is a positive (negative) definite function V(x) in the neighborhood, , of the origin, and the derivative of V(x) is negative (positive) definite; (3) the system is unstable on the origin, if there is a function V(x) in the neighborhood, , of the origin, and the derivative of V(x) is positive (negative) definite, but V(x) itself is not half negative (positive) definite. 10.2. Stability of feedback networks Feedback neural networks are nonlinear dynamic systems and therefore the stability of a system is an important topic (Yan and Zhang, 2000). The stability of neural networks can be analyzed in two ways. One way is to treat a neural network as a deterministic system and use a group of nonlinear differential equations to describe it; another way is to treat it as a stochastic system and use a group of nonlinear stochastic differential equations to describe it. A feedback neural network may correspond to a model of continuous time–continuous state, discrete time–discrete state, discrete time– continuous state, and continuous time–discrete state. Given a completely connected feedback neural network that contains n neurons, in which any neuron is connected to remaining n − 1 neurons, the weight between neuron i to neuron j is wij , The output of i-th neuron at time t is xi (t) = 1 or −1, i = 1, 2, . . . , n. The behavior of feedback neural network can be considered to be a process of state transition, i.e., a dynamic process (Luo, 2004), which is represented by dx/dt = x(t) + f(wx). The stability of the system can be discriminated by using Liapunov theorem. March 23, 2010 44 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology 11. Entropy of a System The probability model of a dynamic system can be developed based on the principle of maximum entropy and the principle of minimum relative information. 11.1. Principle of maximum entropy If the prior distribution of micro-states is not given, the principle of maximum entropy will be expressed as the following optimum problem max H = − n  pi log pi , i=1 n  pi f(xi ) = f(xi ), i=1 n  pi = 1, i=1 where x(t) = (x1 (t), x2 (t), . . . , xn (t))T ; f(x) = (f(x1 ), f(x2 ), . . . , f(xn ))T ≥ 0, is the state function, which is an analogue of energy function; f(x) is the mean of f(x), and pi is the occurrence probability of micro-state xi . Using Lagrange multiplier method, the maximum entropy is achieved as the following: Hmax = µ + µf(x), where µ ≥ 0, is the Lagrange parameter (1/µ is the analogue of temperature in thermodynamics), and L(µ) = n  exp(−µf(xi )), i=1 µ = log L(µ), pi = exp(−µf(xi ))/L(µ), 11.2. i = 1, 2, . . . , n. Principle of minimum relative information If the prior distribution of micro-states, (p01 , p02 , . . . , p0n ), is given, the principle of minimum relative information will be represented by min I = n  i=1 pi log(pi /p0i ), March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks n  45 pi f(xi ) = f(xi ), i=1 n  pi = 1. i=1 The principle of minimum relative information minimizes the difference between {pi } and {p0i }. This principle is the generalization of principle of maximum entropy. The former is applicable to both continuous and discrete systems. For a continuous system, the principle of minimum relative information is represented by min I =   p(x) log(p(x)/p0 (x))dx, p(x)f(x)dx = f(x),  p(x)dx = 1. The solution to this problem is L(µ) =  p0 (x) exp(−µf(x))dx, p(x) = p0 (x) exp(−µf(x))/L(µ), Imin = −µf(x) − log L(µ). 11.3. Principle of minimum mean energy The principle of minimum mean energy describes the convergence degree of system to its limit state, given a certain degree of disorder. The principle of minimum mean energy is represented by min f(x) = n  i=1 pi f(xi ), March 23, 2010 46 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology − n  pi log pi = E, i=1 n  pi = 1. i=1 The solution is L(µ) = n  exp(−µf(xi )), i=1 pi = exp(−µf(xi ))/L(µ), 11.4. i = 1, 2, . . . , n. Probability distribution with maximum entropy The entropy of exponential distribution p(x) = λ exp(−λx), p(x) = 0, x > 0, λ > 0; x≤0 is H = log λ − 1. Among all of the density functions with the same mean the exponential distribution has the maximum entropy. The entropy of normal distribution is  ∞ H =− p0 (x) log p(x)dx, −∞ where p0 (x) is an arbitrary density function having the same mean and variance with the normal distribution, p(x). Among all of the density functions with the same mean and variance the normal distribution has the maximum entropy. 12. Distance or Similarity Measures Measures of distance and similary are frequently used in the design of neural network models. There are many measures of distance and similary (Zhang and Fang, 1982; Zhang, 2007). Given two n-dimensional vectors, x = (x1 , x2 , . . . , xn ), y = (y1 , y2 , . . . , yn ), the mathematical representations of some measures are described bellow. March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 12.1. 47 Similarity measures (1) Angular cosine: s = xyT /(xx T yyT )1/2 (2) Alliance coefficient: s = (r 2 /(r 2 + n.. ))1/2 where there are pdiscrete values, t1 , t2 , . . . , tp , for x, and q discrete values, r1 , r2 , . . . , rq , for y. nkl is the number of elements of which x has tk and y has rl , k = 1, 2, . . . , p; l = 1, 2, . . . , q, and   q p   n2ij /(ni. n.j ) − 1 r 2 = n..  i=1 j=1 n.. = p  ni.. q  nij p  nij i=1 ni.. = j=1 n.j. = i=1 (3) Linkage coefficient 1: s = (r 2 /(n.. max(p − 1, q − 1)))1/2 (4) Linkage coefficient 2: s = (r 2 /(n.. min(p − 1, q − 1)))1/2 (5) Linkage coefficient 3: s = (r 2 /(n.. ((p − 1)(q − 1))1/2 ))1/2 (6) Point correlation coefficient: s = (ad − bc)/((a + b)(c + d)(a + c)(b + d))1/2 March 23, 2010 48 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology where a is the number of element pairs that both x and y are zero; d is the number of element pairs that both x and y are non-zero; b is the number of element pairs that x is zero and y is non-zero; c is the number of element pairs that x is non-zero and y is zero. (7) Quarter correlation coefficient: s = sin((a + d − (b + c))/(a + b + c + d) ∗ 3.1415926/2) (8) Angular cosine variant 1: s = (a ∗ a/((a + b)(a + c)))1/2 (9) Angular cosine variant 2: s = (a ∗ a ∗ d ∗ d/((a + b)(a + c)(b + d)(c + d)))1/2 12.2. Distance measures (1) Euclidean distance: d = ((x − y)(x − y)T )1/2 /n (2) Manhattan distance: d= n  |xk − yk |/n k=1 (3) Chebyshov distance: d = max |xk − yk | (4) Jaccard coefficient: d = (bx + by )/(cx + cy − a) where bx is the number of element pairs that x is non-zero and y is zero; by is the number of element pairs that y is non-zero and x is zero; cx and cy are the number of non-zero elements of x and y, respectively; a is the number of element pairs that both x and y are non-zero. March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 49 The Matlab codes for measures above are as follows Euclidean distance: function distance=euclideandis(x,y) %x and y: two vectors to be tested. %label1 if (max(size(x))˜=max(size(y))) error(’Array sizes do not match.’); end if ((min(size(x))˜=1) | (min(size(y))˜=1)) error(’Both x and y are vectors’); end %label2 distance=sqrt(sum((x-y).ˆ2))/max(size(x)); Manhattan distance: function distance=manhattandis(x,y) %x and y: two vectors to be tested. %insert here the contents between label1 to lable2 in the codes of %Euclidean distance distance=sum(abs(x-y))/max(size(x)); Chebyshov distance: function distance=chebyshovdis(x,y) %x and y: two vectors to be tested. %insert here the contents between label1 to lable2 in the codes of Euclidean distance distance=max(abs(x-y)); Jaccard coefficient: function distance=jaccarddis(x,y) %x and y: two vectors to be tested. %insert here the contents between label1 to lable2 in the codes of Euclidean distance bb=0; cc=0; dd=0; nn1=0;rr1=0; for kk=1:max(size(x)) if (x(kk)˜=0) nn1=nn1+1; end if (y(kk)˜=0) rr1=rr1+1; end if ((x(kk)==0) & (y(kk)˜=0)) bb=bb+1; end if ((x(kk)˜=0) & (y(kk)==0)) cc=cc+1; end if ((x(kk)˜=0) & (y(kk)˜=0)) dd=dd+1; end end distance=(cc+bb)/(nn1+rr1-dd); Angular cosine: function similarity=angularcosinesim(x,y) %x and y: two vectors to be tested. %insert here the contents between label1 to lable2 in the codes March 23, 2010 50 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology of Euclidean distance aa=sum(x.*y,2); bb=sum(x.ˆ2,2); cc=sum(y.ˆ2,2); similarity=aa./sqrt(bb.*cc); Alliance coefficient function similarity=linkagesim(x,y) %insert here the contents between label1 to lable2 in the codes of Euclidean distance pn=1; qn=1; %label3 pp(1)=y(1); ww(1)=x(1); for kk=1:max(size(x)) jj=0; for ii=1:pn if (y(kk)˜=pp(ii)) jj=jj+1; end; end if (jj==pn) pn=pn+1; pp(pn)=y(kk); end jj=0; for ii=1:qn if (x(kk)˜=ww(ii)) jj=jj+1; end; end if (jj==qn) qn=qn+1; ww(qn)=x(kk); end end for kk=1:pn for jj=1:qn temp(kk,jj)=0; for ii=1: max(size(x)) if ((y(ii)˜=pp(kk))&(x(ii)˜=ww(jj))) temp(kk,jj)=temp(kk,jj)+1; end; end end end summ=0; for kk=1:pn pp(kk)=0; for jj=1:qn pp(kk)=pp(kk)+temp(kk,jj); end summ=summ+pp(kk); end for kk=1:qn ww(kk)=0; for jj=1:pn˜ ww(kk)=ww(kk)+temp(jj,kk); end; end xsquare=0; for kk=1:pn for jj=1:qn xsquare=xsquare+temp(kk,jj)*temp(kk,jj)/(pp(kk)*ww(jj)); end; end xsquare=summ*(xsquare-1); %label4 similarity=sqrt(xsquare/(xsquare+summ)); Linkage coefficient 1: function similarity=colinkage1sim(x,y) %insert here the contents between label1 to lable2 in the codes March 23, 2010 20:1 9in x 6in B-922 b922-ch10 1st Reading Mathematical Foundations of Artificial Neural Networks 51 of Euclidean distance %insert here the contents between label3 to lable4 in the codes of alliance coefficient similarity =sqrt(xsquare/(summ*max(pn-1,qn-1))); Linkage coefficient 2: function similarity=colinkage2sim(x,y) %insert here the contents between label1 to lable2 in the codes of Euclidean distance %insert here the contents between label3 to lable4 in the codes of alliance coefficient similarity=sqrt(xsquare/(summ*min(pn-1,qn-1))); Linkage coefficient 3: function similarity = colinkage3sim(x,y) %insert here the contents between label1 to lable2 in the codes of Euclidean distance %insert here the contents between label3 to lable4 in the codes of alliance coefficient similarity=sqrt(xsquare/(summ*sqrt((pn-1)*(qn-1)))); Point correlation coefficient: function similarity=pointcorresim(x,y) %insert here the contents between label1 to lable2 in the codes of Euclidean distance aa=0; bb=0; cc=0; %label5 dd=0; %label6 for kk=1:max(size(x)) if ((x(kk)==0)&(y(kk)==0)) aa=aa+1; end if ((x(kk)==0)&(y(kk)˜=0)) bb=bb+1; end if ((x(kk)˜=0)&(y(kk)==0)) cc=cc+1; end if ((x(kk)˜=0)&(y(kk)˜=0)) dd=dd+1; end %label7 end %label8 similarity =(aa*dd-bb*cc)/sqrt((aa+bb)*(cc+dd)*(aa+cc)*(bb+dd)); Quarter correlation coefficient: function similarity=quadraticcorresim(x,y) %insert here the contents between label1 to lable2 in the codes of Euclidean distance %insert here the contents between label5 to lable8 in the codes of point correlation coefficient similarity =sin((aa+dd-(bb+cc))/(aa+bb+cc+dd)*3.1415926/2); March 23, 2010 52 20:1 9in x 6in B-922 b922-ch10 1st Reading Computational Ecology Angular cosine variant 1: function similarity = angularcosine1sim(x,y) %insert here the contents between label1 to lable2 in the codes of Euclidean distance %insert here the contents between label5 to lable8 in the codes of point correlation coefficient, %without codes of label7 and label7 similarity = sqrt(aa*aa/((aa+bb)*(aa+cc))); Angular cosine variant 2: function similarity=angularcosine2sim(x,y) %insert here the contents between label1 to lable2 in the codes of Euclidean distance %insert here the contents between label5 to lable8 in the codes of point correlation coefficient similarity = sqrt(aa*aa*dd*dd/((aa+bb)*(aa+cc)*(bb+dd)*(cc+dd))); March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading  CHAPTER 11  Matlab Neural Network Toolkit Matlab is one of the most popular softwares for scientific and engineering computation in the world. A neural network tool box (Neural Network Toolbox) is provided in Matlab (Mathworks, 2002). As Matlab is updated with new versions, more neural network models are supplemented to the neural network tool box. Basic network models in Matlab include perceptron model, linear networks, BP network, RBF network, self-organizing networks, ELMAN network, feedback networks, etc. Matlab provides various learning algorithms. Moreover, it permits the users to independently design their own neural network models. Some Matlab functions of neural networks are described in this chapter. More details can be found in Mathworks (2002) and Fecit (2002). 1. Functions of Perceptron 1.1. Neural network functions (1) newp Newp creates a perceptron used to make simple classification (Mathworks, 2002; Fecit, 2002). Syntax: net=newp; net=newp(mr,s,tf,lf); where mr: r × 2 matrix of minimum and maximum values of r input elements; s: number of neurons; tf: transfer function, i.e., hardlims, 1 March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading 2 Computational Ecology or hardlim (default); lf: learning function, i.e., learnpn, or learnp (default). Default functions used in newp are adaptFcn (adapt function): trains; gradientFcn (gradient function): calcgrad; initFcn (initialization function): initlay; performFcn (performance function): mae; trainFcn (training function): trainc. Examples: net=newp([-1 1; -5 5],2, ‘hardlims’, ‘learnpn’); net=newp([-2 1; -3 9; -1 1],3); (2) init Init assigns initial parameter values for iteration. Syntax: net=init(net0); net=init(net0,var,mea,sp); where net0: original network; net: network with initial parameter values; var: variance of initial parameters (default: 1); mea: mean of initial parameters (default: [], i.e., the parameter values of net0); sp: stability requirement of predictor or system (sp=s (system is stable), p (predictor is stable), or b (both system and predictor are stable); default: p, i.e., predictor is stable). (3) sim Sim simulates a Simulink model with user’s parameter settings in dialog box. Syntax: [tim,sta,out]=sim(net, tspn, opt, uptin); [tim,sta,out1,out2,...,outn]=sim(net, tspn, opt, uptin); where tim: time vector; sta: state (matrix or stuucture format); out: output (matrix or structure format); out1,out2,...,outn: outputs of n root-level outport blocks; net: a network model; tspn: time span (a time, or time interval, or specified time points); opt: optional parameters; optin: optional input. March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading Matlab Neural Network Toolkit 3 Example: p=[1 2 3 4 5 6 7 8 9 10]; net=sim(net,p); (4) train This function is used to train a neural network according to net.trainFcn and net.trainParam. Syntax: [net,tr,out,err,find,flad]=train(net0,in,tar,ind,lad,strv,strt); where net: trained network; net0: initial network; tr: training epoch and performance; out: outputs of network; err: errors of network; find: final conditions of input delay; flad: final conditions of layer delay conditions; in: inputs of network; tar: targets of network (default: zeros); ind: initial conditions of input delay (default: zeros); lad: initial conditions of layer delay (default: zeros); strv: structure of validation vectors (default: []); strt: structure of test vectors (default: []). Example: p=[1 2 03 4 5 6 7 8 9 10]; t=[0.1 0.2 0.4 0.1 0.3 -0.2 0.5 -0.6 0.7 -0.4]; net.trainParam.epochs=1000; net.trainParam.goal=0.001; net=train(net,p,t); 1.2. Initilization functions (1) initlay Initlay initializes each layer i according to initialization function net.layers{i}.initFcn. Syntax: net=initlay(net); %returns a layer-updated network inf=initlay(para); %returns function information Application: net. layers{i}.initFcn=’initlay’; %The weights and biases of layer i are initialized March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading 4 Computational Ecology (2) initwb Layer initialization function initwb initializes a layer’s weights and biases according to its own initialization functions. It returns the initialized network with some layer’s weights and biases updated. Syntax: net=initwb(net,i); where i: layer index. 1.3. Input function The input function netsum combines a layer’s weighted inputs and bias to achieve a layer’s net input. Syntax: inp=netsum({x1,x2,...,xn},fp); %takes x1 −xn and optional function parameters (fp) and returns elementwise sum of x1 , ..., xn deri=netsum(dx,i,x,y,fp); %returns the derivative of y with respect to xi %Default values are used if fp is not supplied inf=netsum(para); %returns function information where xi: s × q matrices; fp: function parameters. Example: x1=[2 1 3; 5 9 6]; x2=[7 4 5; -1 3 4]; su=netsum({x1,x2}); 1.4. Weight function (1) dotprod The weight function dotprod yields dot product weights and applies weights to an input to get weighted inputs. Syntax: dp=dotprod(w,p,fp); %returns the s×q dot product of w and p. There are s layers and q input vectors dim=dotprod(size,s,r,fp); March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading Matlab Neural Network Toolkit 5 %takes the layer dimension s, input dimention r, and function parameters, and returns the %weight size s×r. dp=dotprod(dp,w,p,x,fp); %returns the derivative of x with respect to p dw=dotprod(dw,w,p,x,fp); %returns the derivative of x with respect to w inf=dotprod(para); %returns function information where w: s × r weight matrix; p: r × q weight matrix of q input vectors; fp: row cell array of function parameters (optional); s: dimension of layer; r: dimension of input. Applications: net.inputWeight{i,j}.weightFcn=’dotprod’; %change to dotprod for input weight use net.inputWeight{i,j}.weightFcn=’dotprod’; %change to dotprod for a layer weight use (2) normprod Normprod is the normalized dotprod. Syntax: dp=normprod(w,p,fp); 1.5. Transfer functions (1) hardlim Hard limit transfer function hardlim calculates a layer’s output from the net input. It takes the form: y = hardlim(x) = 0, if x < 0; y = 1, if x ≥ 0. Syntax: out=hardlim(in); %returns 1 if in is non-negative, and 0 if in is %negative inf=hardlim(para); %returns function information where in is a s × q matrix of s-dimensional net input vectors. Application and example: net.layers{i}.transferFcn = ’hardlim’; %Assignthis transfer function to layer i of network a=-3:0.1:8; b=hardlim(a); March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading 6 Computational Ecology (2) hardlims Hardlims calculates a layer’s output from the net input. It takes the form: y = hardlim(x) = −1, if x < 0; y = 1, if x ≥ 0. Syntax: out=hardlims(in); %returns 1 if in is non-negative and -1 if in is %negative inf=hardlims(para); %returns function information where in: s × q matrix of s-dimensional net input vectors. Example: a=-3:0.1:8; b=hardlim(a); 1.6. Learning functions (1) learnp Learnp is a perceptron weight/bias learning function. Syntax: [dw,nls]=learnp(w,in,win,nin,out,ltv,lev,gp,ogp,nd,lp,ls); where dw: s × r weight/bias change matrix; nls: new learning state; w: s × r weight matrix; in: r × q input vectors; win: s × q weighted input vectors; nin: s × q net input vectors; out: s × q output vectors; ltv: s × q layer target vectors; lev: s × q layer error vectors; gp: s × r gradient with respect to performance; ogp: s × q output gradient with respect to performance; nd: s × s neuron distances; lp: learning parameters (default: []); ls: learning state (default: []). Applications: net.inputWeights{i,j}.learnFcn=‘learnp’; net.layerWeights{i,j}.learnFcn=‘learnp’; net.biases{i}.learnFcn=‘learnp’; (2) learnpn It is a normalized perceptron weight/bias learning function, which performs faster learning than learnp when input vectors have widely varying magnitudes (Mathworks, 2002). March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading Matlab Neural Network Toolkit 7 1.7. Performance functions (1) mae Performance function mae is a mean absolute error function. Syntax: per=mae(er,x,pp); %returns mean absolute error of network per=mae(er,net,pp); inf=mae(para); %returns function information where er: matrix or cell array of error vectors; x: vector of all weight and bias values, which can be obtained from a network; pp: performance parameters. Application: net.performFcn=‘mae’; %set performance function of network (2) mse Mse calculates mean squared error of network output. Syntax: per=mse(er,out,webi,fp) deri=mse(‘dy’,er,out,webi,per,fp); %returns derivative of per with respect to out deri=mse(‘dx’,er,out,webi,perf,fp); %returns derivative of per with respect to webi where er: matrix or cell array of error vectors; out: matrix or cell array of output vectors; webi: vector of all weight and bias values; fp: function parameters. 2. Functions of Linear Neural Networks 2.1. Neural network functions (1) newlind Linear neural network function newlind is used to design a linear layer that yields an output given an input with minimum sum square error. March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading 8 Computational Ecology Syntax: net=newlind(in,tcla); net=newlind(in,tcla,ind); where in: r × q matrix with q input vectors; tcla: s × q matrix with q target class vectors; ind: initial input delay states. Example: in={2 3 5 2 1 6}; ini={7 2}; tcla={1 1 2 1 1 2}; net=newlind(in,tcla,ini); (2) newlin Newlin generates a linear layer that is generally used as adaptive filters for signal processing and prediction (Mathworks, 2002). Syntax: net=newlin(in,noel,ind,lr); net=newlin(in,noel,0,inp); %returns a linear layer with the maximum stable %learning rate for input inp where in: r × 2 matrix of minimum and maximum values for r input elements; noel: size of output vector; ind: input delay vector (default: [0]); lr: learning rate (default: 0.01); inp: matrix of input vectors. Default functions used in newlin are adaptFcn: trains; gradientFcn: calcgrad; initFcn: initlay; performFcn: mse; trainFcn: trainb. Example: net=newlin([-2 2],2,[0 1],0.001); in={1 2 -2 0 -1 -1 2 1 0 0}; out=sim(net,in); 2.2. Learning function (1) learnwh Widrow–Hoff function learnwh is a weight/bias learning function, known as the least mean squared (LMS) rule (Mathworks, 2002). Syntax: [dw,nls]=learnp(w,in,win,nin,out,ltv,lev,gp,ogp,nd,lp,ls); March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading Matlab Neural Network Toolkit 9 where dw: s × r weight/bias change matrix; nls: new learning state; w: s × r weight matrix; in: r × q input vectors; win: s × q weighted input vectors; nin: s × q net input vectors; out: s × q output vectors; ltv: s × q layer target vectors; lev: s × q layer error vectors; gp: s × r gradient with respect to performance; ogp: s × q output gradient with respect to performance; nd: s × s neuron distances; lp: learning parameters (default: []); ls: learning state (default: []). Applications: net.inputWeights{i,j}.learnFcn=‘learnwh’; net.layerWeights{i,j}.learnFcn=‘learnwh’; net.biases{i}.learnFcn=‘learnwh’; 2.3. Analysis function The analysis function maxlinlr is used to calculate learning rates of newlin (Mathworks, 2002). Syntax: lr=maxlinlr(inp); %returns the maximum learning rate for a linear %layer without a bias that is to be trained only %on the vectors in inp. lr=maxlinlr(in,‘bias’); Example: inp=[2 5 3 1; -1 4 7 0]; lr=maxlinlr(inp); 3. Functions of BP Neural Network 3.1. Neural network functions (1) newff Newff is used to create a feedforward backpropagation network. Syntax: net=newff(in,[s1 s2...sn],tf1 tf2...tfn,btf,blf,per); %returns n layer feed-forward backpropagation network where in: r × 2 matrix of minimum and maximum values for r input elements; si: size of ith layer; tfi: transfer function for ith layer (tansig, March 22, 2010 10 18:24 9in x 6in B-922 b922-ch11 1st Reading Computational Ecology logsig, purelin. Default: tansig); btf: backpropagation training function (trainlm, traingd, trainbfg, trainrp, etc. Default: trainlm); blf: backpropagation weight/bias learning function (default: learngdm); per: performance function (default: mse). Example: in=[1 2 3 4 5 6 7 8]; y=[5 4 3 2 1 0 -1 -2]; net=newff(minmax(in),[8 1],{‘tansig’ ‘purelin’}); net=train(net,in,y); out=sim(net,in); (2) newcf This function is used to create a cascade-forward backpropagation network. Syntax: net=newcf(in,[s1 s2...sn],{tf1 tf2...tfn},btf,blf,per); %returns n layer cascade-forward backpropagation %network where in: r × 2 matrix of minimum and maximum values of r input elements; si: size of ith layer; tfi: transfer function for ith layer (tansig, logsig, purelin. Default: tansig); btf: backpropagation training function (trainlm, traingd, traingdm, trainrp, etc. Default: trainlm); blf: backpropagation weight/bias learning function (default: learngdm); per: performance function (default: mse). 3.2. Transfer functions (1) purelin The linear function purelin calculate a layer’s output from its input. It takes the form: y = purelin(x) = x. Syntax: a=purelin(in,fp); %returns a s×q matrix deri=purelin(‘dn’,in,x,fp); %returns s×q derivative of x with respect to in where in: s × q matrix of s-dimensional input vectors; fp: function parameters. March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading Matlab Neural Network Toolkit 11 Example and application: in=-1:0.1:1 a=purelin(in); net.layersi.transferFcn=‘purelin’; %assign purelin to layer i of network (2) tansig Tansig (hyperbolic tangent sigmoid function) calculates a layer’s output from its input. It takes the form: y = tansig(x) = (exp(x) − exp(−x))/(exp(x) + exp(−x)). Syntax: a=tansig(in,fp) %returns a s×q matrix deri=tansig(‘dn’,in,x,fp); %returns s×q derivative of x with respect to in where in: s × q matrix of s-dimensional input vectors; fp: function parameters. Example and application: in=-1:0.1:1 a=tansig(in); %Fig. 1 net.layers{i}.transferFcn=‘tansig’; %assign tansig to layer i of network (3) logsig Logsig calculates a layer’s output from its input. It takes the form: y = logsig(x) = 1/(1 + exp(−x)). (4) dtansig This function is the derivative function of tansig. Syntax: deri=dtansig(in,fp); Similarly, there are transfer derivative functions, dpurelin, and dlogsig. March 22, 2010 12 18:24 9in x 6in B-922 b922-ch11 1st Reading Computational Ecology Figure 1. Curve of tansig. 3.3. Training functions (1) trainlm Trainlm is a training algorithm that updates weight and bias values using Levenberg–Marquardt optimization. It can train any network as long as its weight, net input, and transfer functions have derivative functions (Mathworks, 2002). Syntax: [net,inf]=trainlm(net0,din,ltar,idin,size,ts,vv,tv); where net0: original network; net: trained network; inf: training information over each epoch (inf.epoch: epoch number; inf.perf: training performance; inf.vperf: validation performance; inf.tperf: test performance); din: delayed input vectors; ltar: target vectors of layer; idin: intial conditions of input delay; size: batch size; ts: time steps; March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading Matlab Neural Network Toolkit 13 vv: structure of validation vectors, or empty matrix; tv: structure of test vectors, or empty matrix. Application and example: net.trainFcn=‘trainlm’; %assign trainlm to network net.trainParam.epochs=100; %maximum training epochs with default 100 net.trainParam.goal=0.001; %performance goal with default 0 net.trainParam.min_grad=1e-15; %minimum performance gradient with default 1e-10 net.trainParam.show=10; %epochs between displays with default 25 net.trainParam.time=100000; %max time for training (seconds) with default infinity (2) traingd Traingd is a training function, which updates weight and bias values according to gradient descent (Mathworks, 2002). Syntax: [net,inf,lout,lerr]=traingd(net0,din,ltar,idin,size,ts,vv,tv); where lout: collective layer outputs of last epoch; lerr: layer errors of last epoch. The meanings of the remaining variables are the same as trainlm. (3) traingdm Traingdm is a training function, which updates weight and bias values according to gradient descent with momentum (Mathworks, 2002). Syntax: [net,inf,lout,lerr]=traingdm(net0,din,ltar,idin,size,ts,vv,tv); The meanings of all variables are the same as traingdm. 3.4. Learning functions (1) learngdm Learngdm is the gradient descent function with momentum weight/bias learning. March 22, 2010 14 18:24 9in x 6in B-922 b922-ch11 1st Reading Computational Ecology Syntax: [dw,nls]=leargdm(w,in,win,nin,out,ltv,lev,gp,ogp,nd,lp,ls); where dw: s × r weight/bias change matrix; nls: new learning state; w: s × r weight matrix; in: r × q input vectors; win: s × q weighted input vectors; nin: s × q net input vectors; out: s × q output vectors; ltv: s × q layer target vectors; lev: s × q layer error vectors; gp: s × r gradient with respect to performance; ogp: s × q output gradient with respect to performance; nd: s ×s neuron distances; lp: learning parameters (lp.lr: learning rate with default 0.01; lp.mc: momentum constant with default 0.9; default: []); ls: learning state (default: []). Applications: net.inputWeights{i,j}.learnFcn=‘learngdm’; net.layerWeights{i,j}.learnFcn=‘learngdm’; net.biases{i}.learnFcn=‘learngdm’; (2) learngd Learngd is the gradient descent function weight/bias learning function (Mathworks, 2002). Syntax: [dw,nls]=leargd(w,in,win,nin,out,ltv,lev,gp,ogp,nd,lp,ls); The meanings of all variables are the same as learngdm. 4. Functions of Self-Organizing Neural Networks 4.1. Neural network functions (1) newsom Newsom creates a self-organizing map network and resultant competitive layers are used to make classification. Syntax: net=newsom(in,[s1,s2,...sn],tf,df,olr,st,tlr,tnd); %returns a self-organizing map neural network where in: r × 2 matrix of minimum and maximum values of r input elements; si: size of ith layer (defaults: [5 8]); tf: topology function March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading Matlab Neural Network Toolkit 15 (hextop, gridtop, or randtop, etc. Default: hextop); df: distance function (dist, linkdist, or mandist, etc.); olr: ordering phase learning rate (default: 0.9); st: ordering phase steps (default: 1000); tlr: tuning phase learning rate (default: 0.02); tnd: tuning phase neighborhood distance (default: 1). Default functions used in newsom are adaptFcn: trains; gradientFcn: calcgrad; initFcn: initlay; trainFcn: trainr. Example: in=[rand(1,50)/5;rand(1,50)/2]; net=newsom([1 6; 0 5],[5 10]); %distribute samples over 2-dimensional input space net=train(net,in); out=sim(net,in); (2) nnt2som Nnt2som is an update of self-organizing map in older neural network toolbox. Syntax: net=nnt2som(in,[s1,s2,...sn],w,olr,st,tlr,tnd); %returns a self-organizing map neural network where in: r × 2 matrix of minimum and maximum values of r input elements; si: size of ith layer; w: s × r weight matrix; olr: ordering phase learning rate (default: 0.9); st: ordering phase steps (default: 1000); tlr: tuning phase learning rate (default: 0.02); tnd: tuning phase neighborhood distance (default: 1). In nnt2som the topology function and distance function of selforganizing map are gridtop and linkdist respectively. (3) newc Newc creates a competitive layer. Competitive layers are used to make classification (Mathworks, 2002). Syntax: net=newc(in,n,klr,clr); where in: r × 2 matrix of minimum and maximum values of r input elements; n: number of neurons; klr: Kohonen learning rate (default: 0.01); clr: conscience learning rate (default: 0.001). March 22, 2010 16 18:24 9in x 6in B-922 b922-ch11 1st Reading Computational Ecology Example: in=[2 4 3 7 5;5 9 0 1 0;7 1 0 6 4]; net=newc([0 1; 0 1;0 1],8); %a five neuron layer with three input elements net=train(net,in); out=sim(net,in); (4) nnt2c Nnt2c is an update of self-organizing competitive network in older neural network toolbox. Syntax: net=nnt2c(in,w,klr,clr); where in: r × 2 matrix of minimum and maximum values of r input elements; w: s × r weight matrix; klr: Kohonen learning rate (default: 0.01); clr: conscience learning rate (default: 0.001). 4.2. Topology functions Hextop (gridtop) is usded to determine the neuron positions for layers whose neurons are arranged in an n-dimensional hexagonal pattern (grid). (1) hextop Syntax: pos=hextop(d1,d2,...,dN); %returns n×s matrix made of n coordinate vectors %s=d1×d2× ... ×dN where di: layer size of dimension i. (2) gridtop Syntax: pos=gridtop(d1,d2,...,dN); %returns n×s matrix made of n coordinate vectors %s=d1×d2×...×dN where di: layer size of dimension i. March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading Matlab Neural Network Toolkit 17 4.3. Distance functions Dist, mandist, and linkdist are distance weight functions that can assign weights to inputs to yield weighted inputs, and also layer distance functions used to determine between-neuron distances in a layer. (1) dist Dist is a Euclidean distance function. Syntax: d=dist(w,in); %returns s×q matrix of between-vector distances where w: s × r weight matrix; in: r × q matrix of r-dimensional vectors. Applications: net.inputWeight{i,j}.weightFcn=‘dist’; net.layers{i}.distanceFcn=‘dist’; %make layer i use dist in its topology (2) mandist Mandist is a Manhattan distance function. Syntax: d=mandist(w,in); %returns s×q matrix of between-vector distances (3) linkdist Linkdist is a link distance function. Syntax: d=linkdist(w,in); %returns s×q matrix of between-vector distances 4.4. Learning functions (1) learnk Learngk is a Kohonen weight learning function. Syntax: [dw,nls]=learngk(w,in,win,nin,out,ltv,lev,gp,ogp,nd,lp,ls); The meanings of all variables are the same as learngdm. March 22, 2010 18 18:24 9in x 6in B-922 b922-ch11 1st Reading Computational Ecology (2) learnsom Learnsom is a weight learning function for self-organizing map. Syntax: [dw,nls]=leargsom(w,in,win,nin,out,ltv,lev,gp,ogp,nd,lp,ls); where the learning parameters, lp, can be parameterized as default values [lp.order_lr (ordering phase learning rate): 0.9; lp.order_steps (ordering phase steps): 1000; lp.tune_lr (tuning phase learning rate.): 0.02; lp.tune_nd (tuning phase neighborhood distance): 1]. The meanings of all remaining variables are the same as learngdm. 4.5. Transfer function Compet is a competitive transfer function. It takes the form: y = 1 for the neuron with maximum x; y = 0 for the remaining neurons. Syntax: a=compet(in,fp); % returns a s×q matrix deri=compet(‘dn’,in,x,fp); %returns derivative of x with respect to in where in: s × q matrix of input vectors; fp: optional function parameters. Application: net.layers{i}.transferFcn=‘compet’; 4.6. Plot function Plotsom is a function plotting self-organizing map. Syntax: plotsom(pos); %plots positions of neurons and links neurons %within a Euclidean distance of 1 plotsom(w,d,nd); %Fig. 2 where pos: n × s matrix of n-dimensional neural positions; w: s × r weight matrix; d: s × s distance matrix; nd: neighborhood distance (default: 1). March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading Matlab Neural Network Toolkit 19 Figure 2. A spatial representation of plotsom (w,d,nd). Example: pos=hextop(20,15); plotsom(pos); 5. Functions of Radial Basis Neural Networks 5.1. Neural network functions (1) newrb The function newrb creates a radial basis network. Neurons are added to the hidden layer of neural network by newrb until the specified mean squared error goal is met (Mathworks, 2002). March 22, 2010 20 18:24 9in x 6in B-922 b922-ch11 1st Reading Computational Ecology Syntax: [net,tr]=newrb(in,tar,goal,spr,mnn,nn); %returns a radial basis neural network where in: r × q matrix of q input vectors; tar: s × q matrix of q vectors of target class; goal: mse goal (default: 0); spr: spread of radial basis functions (default:1); mnn: maximum number of neurons (default: q); nn: number of neurons to be added between displays (default: 25). In newrb a larger value of spr means a smoother function approximation. Too small value of spr means many neurons will be required to fit a smooth function, and the network may not generalize well (Mathworks, 2002). (2) newrbe The function newrbe creates a radial basis network very quickly. Syntax: net=newrbe(in,tar,,spr); %creats an exact radial basis network 5.2. Transfer function Radbas is a transfer function for radial basis network. Syntax: a=radbas(in,fp); %returns an s×q matrix deri=radbas(‘dn’,in,x,fp); %returns derivative of x with respect to in where in: s × q matrix of input vectors; fp: optional function parameters. 6. Functions of Probabilistic Neural Network 6.1. Neural network function The probabilistic neural network function newpnn is usually used to make classification. net=newpnn(in,tar,spr); %creats a probabilistic neural network network March 22, 2010 18:24 9in x 6in B-922 b922-ch11 1st Reading Matlab Neural Network Toolkit 21 where in: r×q matrix of q input vectors; fp: optional function parameters; tar: s × q matrix of q vectors of target class; spr: spread of radial basis functions (default: 0.1). A very small value of spr will result in a network similar to a nearest neighbor classifier. 6.2. Vector-index convertion functions (1) ind2vec This function converts indices to vectors (Mathworks, 2002). Syntax: vec=ind2vec(in); %converts row vectors of indices to sparse %matrix of vectors Example: in=[2 4 8 3 10]; vec=ind2vec(in); (2) vec2ind Vec2ind converts vectors to indices. 7. Function of Generalized Regression Neural Network Generalized regression network function newgrnn is usually used for function approximation. Syntax: net=newgrnn(in,tar,spr); %returns a generalized regression neural network where in: r × q matrix of q input vectors; tar: s × q matrix of q vectors of target class; spr: spread of radial basis functions (default: 1). 8. Functions of Hopfield Neural Network (1) newhop Newhop creates a Hopfield recurrent neural network used for pattern recall. March 22, 2010 22 18:24 9in x 6in B-922 b922-ch11 1st Reading Computational Ecology Syntax: net=newhop(tar); %returns a Hopfield network with stable points %at the vectors in tar where tar: r × q matrix of q target vectors with 1 or −1 as elements. Example: tar=[1 1 -1 -1; -1 1 -1 1; 1 -1 -1 1]; net=newhop(tar); (2) satlins Satlins is a transfer function that calculates a lyer’s output from the input. It takes the form: y = −1, if x < −1; y = x, if −1 ≤ x ≤ 1; y = 1, if x > 1. Syntax: a=satlins(in,fp); %returns s×q matrix of in’s elements truncated %into the intervals [-1,1] deri=satlins(‘dn’,in,x,fp); %returns s×q derivative of x with respect to in where in: s × q matrix of input vectors; fp: optional function parameters. 9. Function of Elman Neural Network Newelm creates an Elman network. Syntax: net=newelm(in,[s1,s2,...sn],[tf1,tf2,...,tfn],btf,blf,per); %returns an Elman network where in: r × 2 matrix of minimum and maximum values of r input elements; si: size of ith layer; tfi: transfer function of ith layer (default: tansig); btf: backpropagation network training function (traingd, traingdm, traingda, traingdx, etc. Default: traingdx); blf: backpropagation weight/bias learning function (learngd, learngdm, etc. Default: learngdm); per: performance function (default: mse). March 22, 2010 15:11 9in x 6in B-922 b922-ch12 1st Reading PART II Applications of Artificial Neural Networks in Ecology 1 March 22, 2010 15:11 9in x 6in B-922 2 b922-ch12 1st Reading March 22, 2010 15:11 9in x 6in B-922 b922-ch12 1st Reading  CHAPTER 12  Dynamic Modeling of Survival Process Survival process means the survivorship–time curve of a living individual (Zhang, 2007; Zhang and Zhang, 2008). Survival process varies with organisms. For example, an individual holometabolous insect, like ecologically and economically important lepidopterans, coleopterans, hymenopterans, and dipterans, etc., must survive different developmental stages (i.e., egg, the 1st to nth instar larvae, pupa, and adult) to finish the life cycle. There are several static stages in the life cycle. Usually mortality declines markedly during the transition of two developmental stages. Survival process is also affected by environmental conditions, like food resource, temperature, and humidity. Factors influencing survival process are complex and the latter is always a nonlinear process. Due to the lack of theoretical background (Schultz and Wieland, 1997), a mechanistic model, even if it usually involves specific assumptions and limitations, is hard to be developed (Zhang et al., 1997). At present the most often used method to depict survival process is to model mortality distribution with probability density functions (Wagner et al., 1984; Pu, 1990). Artificial neural networks are flexible function approximators to describe nonlinear systems (Kuo et al., 2007; Zhang and Barrion, 2006; Zhang et al., 2007; Zhang et al., 2008). It was reported that for ecosystems with time-changing dynamics, neural networks were considered to be the 3 March 22, 2010 15:11 9in x 6in B-922 b922-ch12 1st Reading 4 Computational Ecology most suitable and robust models (Tan et al., 2006). Theoretically neural networks can be used to model survival process of organisms. In this chapter a BP network is evaluated for its effectiveness and performance in modeling survival process and mortality distribution of a holometabolous insect, Spodoptera litura F. (Lepidoptera: Noctuidae). Empirical models, probability density functions, etc., were used to test their modeling performances and to compare to BP network. Details can be found in the paper by Zhang and Zhang (2008). 1. Model Description 1.1. BP neural network See previous chapters for the principles and algorithms of BP neural network. A complete Matlab algorithm of BP network is developed to model the survivorship and mortality of S. litura, as indicated in the following (Mathworks, 2002; Zhang and Zhang, 2008): %The 1st row is days and the 2nd row is survival rates or mortality. If both time and temperature are %used as input variables, then the 1st and 2nd rows are days and temperatures respectively (i.e., t=data(1:2,:)), %and the 3rd row is survival %rates or mortality (i.e., gt=data(3,:)) t=data(1,:); gt=data(2,:); %Generate a two-layer BP neural network with 5 hidden neurons and 1 output neuron. %If a three-layer BP neural network is to be built, then the following syntax should be revised, for instance, %net=newff(minmax(t),[3,2,1],{’tansig’ ’tansig’ ’purelin’}, ’trainlm’, ’learngd’, ’mse’); net=newff(minmax(t),[5,1],{’tansig’ ’purelin’},’trainlm’,’learngd’, ’mse’); net.trainParam.epochs=1000; %Maximum epochs net.trainParam.goal=0.001; %Training goal, mse=0.001 net=train(net,t,gt); %Set a time series to be predicted. If both time and temperature are used as input variables, then for example, %T=[27 28 29 30 31;20 33 26 29 34]. T=[27 28 29 30 31]; tT=[t T]; ft=sim(net,tT); %Print the simulated output. [tT;ft] figure; March 22, 2010 15:11 9in x 6in B-922 b922-ch12 1st Reading Dynamic Modeling of Survival Process 5 %If there is only one input variable in BP network, then the following codes %may be used to draw a graph plot(tT,ft,’-’); hold on plot(t,gt,’*’); %Print input weights, between-layer weights, and bias net.IW{1,1} %Input weights net.LW{2,1} %Between-layer weights (layer 1 to the output neuron) net.b{1} %Bias of the first layer net.b{2} %Bias of the output neuron 1.2. Empirical models 1.2.1. Models for survival process and mortality distribution The following empirical models are used to model survival process of S. litura. f(t) = k/(a + bect ), (1) f(t) = (at + b)/(ct + d), (2) f(t) = a + b1 t + b2 t 2 + b3 t 3 , (3) where f(t) is the survivorship at time t (days), k, a, b, c, d, b1 , b2 , and b3 are model parameters. Several probability density functions are also used to describe the mortality distribution of S. litura. Normal function: f(t) = 1/((2π)1/2 σ) exp(−(t − µ)2 /(2σ 2 )), (4) Logarithmic Normal function: f(t) = lg e/((2π)1/2 σt) exp(−(lg t − µ)2 /(2σ 2 )), (5) Cauchy function: f(t) = 1/π∗ λ/(λ2 + (t − µ)2 ), (6) Chi-squared function: f(t) = 1/(2n/2 Ŵ(n/2))t n/2−1 exp(−t/2), (7) March 22, 2010 15:11 9in x 6in B-922 b922-ch12 1st Reading 6 Computational Ecology Weibull function: f(t) = m/b∗ (t − µ)m−1 exp(−(t − µ)m /b)), (8) where f(t): the mortality frequency at time t (day); µ and σ in model (4) and (5), and µ and λ in model (6): the parameters relevant to mean and standard deviation of probability distribution; n in model (7): mean of chi-squared distribution; µ, m, and b in model (8): the parameters relevant to mean and standard deviation of Weibull distribution. The above empirical models and density functions are fitted with data using a nonlinear least square method with Gauss–Newton algorithm (He, 2001). The last is a dynamic model based on the multi-stage parametric models (Zhang et al., 1997). |dui (t)/dt/ui (t)| = ai + bi T (9) where T : temperature (◦ C); ui (t): the population size during development stage i (i = 1, egg; i = 2 ∼ 7, 1st ∼ 6th instar larvae) at time t; ai , bi : parameters. Parameters ai and bi for various development stages in model (9) are fitted using regression method (Zhang et al., 1997). The function 1 − |dui (t)/dt/ui (t)| is the dynamic survivorship of stage i. Finally, the time-changing survivorship is computed by the following model:  f(t) = (1 − |dui (τ)/dt/ui (τ)|) (10) τ<t i 1.2.2. Model for time–temperature dependent survivorship and mortality distribution The trend surface model (TSM; He, 2001) is used to simulate time– temperature dependent survivorship and mortality distribution: f(t, T) = a + b1 t + b2 T + b3 tT + b4 t 2 + b5 T 2 (11) where f(t, T) is the survivorship or mortality frequency at temperature T (◦ C) at time t (days); a, b1 , b2 , b3 , b4 , b5 : parameters. The parameters in model (11) were estimated using the Matlab algorithm of TSM model (Mathworks, 2002). March 22, 2010 15:11 9in x 6in B-922 b922-ch12 1st Reading Dynamic Modeling of Survival Process 2. 7 Data Description In total 26, 20, 13, 12, and 16 data points ((ti , f(ti )), i = 1, 2, . . . , 26) for temperatures 20, 24, 32, 36, and 40◦ C respectively, are used in the training and simulation of survivorship and mortality distribution. In the cross validation of BP network, each data point is separately removed from the above data set and the remaining data in the data set are used to train BP network and to predict the removed data using the trained BP network. Comparisons between the predicted and observed are made and Pearson correlation coefficient and statistic significance were calculated to validate the trained BP network. In the prediction of survival process for certain temperature, the observed survivorship for five predicted days (pupa stage) is the survivorship just before larvae pupated, which is used to test generalization performance of models. Data were submitted to BP network for training or testing in their natural sequences, i.e., increasing order of dates (days) and temperatures (◦ C). 3. Results 3.1. Modeling survival process and mortality distribution 3.1.1. Simulation and prediction of survival process A two-layer BP network is built to model the survival process of S. litura. Transfer functions of hidden layer and output layer are hyperbolic tangent sigmoid transfer function (tansig) and linear transfer function (purelin), respectively. Training and learning functions are Levenberg– Marquardt algorithm (trainlm) and gradient descent weight/bias function (learngd). Desired performance function is mean squared error (MSE = 0.01). The network is trained 1000 epochs. An illustration for the trained weights of BP network is shown in Fig. 1. It can be found that with different number of hidden neurons, BP network performs better in the simulation and the five-day prediction (Table 1 and Fig. 2). The mean epoch decreases as the number of hidden neurons increase. MSE is the lowest if five hidden neurons are used. With the constant survivorships, BP network simulates the static stages, egg and pupa satisfactorily. In general the performance of simulation and prediction of March 22, 2010 15:11 9in x 6in B-922 b922-ch12 1st Reading 8 Computational Ecology Figure 1. Connection weights of two-layer BP network with 5 neurons (20◦ C). BP network is the best when five hidden neurons are used, while the performance with eight hidden neurons and two hidden neurons is not satisfactory due to over-learning or deficient learning. The performance of empirical models (1) to (3) and model (10) is not better than the BP network (Figs. 3 and 4). Model (3) (Averaged MSE for simulation and prediction: 0.00549 and 0.0769) shows the best performance and model (2) is the second best (Averaged MSE for simulation and prediction: 0.00899 and 0.0263). Nevertheless the prediction of model (3) is unpractical. As compared to empirical models and probability density functions, model (10) has been directly derived from the survival process of holometabolous insect but its simulation performance is the worst among all models (Averaged MSE: 0.1550). 3.1.2. Simulation of mortality distribution Mortality distributions of S.litura for various temperatures are fitted by the density functions (4) to (8) and the two-layer BP network with five hidden neurons (Fig. 5). The mortality distribution of an insect with a single developmental stage may probably be described by probability density March 22, 2010 15:11 9in x 6in B-922 b922-ch12 1st Reading Dynamic Modeling of Survival Process 9 Table 1. Performance of BP network in the simulation (Sim.) and the five-day prediction (Pred.) of survival process of S. litura, where maximum epochs = 1000. Temperature Performance 2 neurons 3 neurons 5 neurons 20◦ C Epochs Sim. MSE Pred. MSE Epochs Sim. MSE Pred. MSE Epochs Sim. MSE Pred. MSE Epochs Sim. MSE Pred. MSE Epochs Sim. MSE Pred. MSE Mean Epochs Mean MSE Mean MSE 71 0.00099 0 9 0.00067 0.001 1000 0.00221 0.0343 29 0.00049 0 1000 0.00242 0.0008 422 0.00136 0.00722 56 0.00098 0.0001 5 0.00086 0.0002 4 0.00048 0.0318 11 0.00087 0 109 0.00099 0.0017 37 0.00084 0.00676 5 0.00087 0.0006 4 0.00094 0 6 0.00056 0.0155 6 0.00003 0.0003 20 0.00098 0 8 0.00068 0.00328 24◦ C 32◦ C 36◦ C 40◦ C Simulation Prediction 8 neurons 6 0.00095 0.0005 3 0.00095 0.0118 2 0.00098 0.2898 6 0.00057 0.0039 3 0.00090 0 4 0.00087 0.06120 functions because of its homogeneous morphology, physiology, behavior, or at least the slow changes but not sudden jumps in the developmental process. However, the dramatic changes of mortality could occur over distinctive developmental stages of the holometabolous insect. The mortality distribution of the holometabolous insect is thus more complex and cannot be better fitted by probability density functions. BP network is the best model in the simulation of mortality distribution. Simulation performance of Cauchy density function is similar to the BP network, and chi-squared function is the worst. Weibull function is not able to be fitted by the nonlinear least square method. 3.1.3. Cross validation of BP network for time dependent relationship Using the same settings but a different training goal (MSE = 0.001) in the BP network with five hidden neurons, cross validation shows that BP March 22, 2010 10 15:11 9in x 6in B-922 b922-ch12 1st Reading Computational Ecology Figure 2. Performance of BP network in the simulation and prediction of survival process of S. litura. network is robust in the prediction of unknown data, particularly for survival process (Fig. 6). 4. Discussion The above study demonstrates that BP network is able to effectively model survival process and mortality distribution of holometabolous insects. BP network outperforms empirical models, trend surface model, and probability density functions. It yields reasonable and robust prediction in all situations. It has the advantages of simplified and more automated model synthesis and analytical input–output models (Abdel-Aal, 2004). March 22, 2010 15:11 9in x 6in B-922 b922-ch12 1st Reading Dynamic Modeling of Survival Process 11 Figure 3. Performance of empirical models (1) to (3) in the simulation and prediction of survival process of S. litura. To build a satisfied BP network, the following principles are suggested: (1) Set appropriate number of hidden layers and neurons. Using the network with multiple hidden layers helps to reduce neurons used. Excessive number of hidden neurons would result in over-learning and reduce generalization reliability. Overall the number of hidden layers and neurons should increase as the complexity of the ecosystem studied is increased. (2) Set appropriate maximum epochs and training goal. A higher training goal and excessive training epochs would result in an over-learned network. Over-learning can be avoided by using methods such as limitation of the complexity of March 22, 2010 12 15:11 9in x 6in B-922 b922-ch12 1st Reading Computational Ecology Figure 4. Performance of model (10) in the simulation of survival process of S. litura. the model, weight decay, training with noise, etc (Ozesmi et al., 2006). (3) Gather a representative data set. A better data set well representing the sample space is crucial for building a neural network with great predictive power. In addition to the good experimental or sampling design for data acquisition, data quality may also be improved by using various methods to eliminate data redundancy and enhance calculation efficiency (Kilic et al., 2007). A model describes a generation mechanism for data (Gentle, 2002). To build a neural network means training the network with raw data and then March 22, 2010 15:11 9in x 6in B-922 b922-ch12 1st Reading Dynamic Modeling of Survival Process 13 Figure 5. The simulation and prediction of mortality distribution using BP network and probability density functions (4) to (7). Figure 6. Cross validation of BP network in the prediction of survivorship and mortality distribution of S. litura. One data point is removed from the data set and the remaining data in the data set are used to train BP network and to predict the data point removed. March 22, 2010 14 15:11 9in x 6in B-922 b922-ch12 1st Reading Computational Ecology storing the mechanism of data into connection weights of the network. BP network is thus an adaptive and flexible model. Compared to other engineering applications, the studies on processbased neural network modeling are fewer in ecological and environmental areas (Lek and Baran, 1997; Abrahart and White, 2001; Viotti et al., 2002; Sharma et al., 2003; Abdel-Aal, 2004; Almasri and Kaluarachchi, 2005; Pastor-Barcenas et al., 2005; Nour et al., 2006; Nagendra and Khare, 2006). Further studies are desirable in the research of dynamic analysis of ecosystems. March 22, 2010 18:14 9in x 6in B-922 b922-ch13 1st Reading  CHAPTER 13  Simulation of Plant Growth Process Plant growth process is a nonlinear system. Mechanistic models are always recommended for modeling nonlinear ecosystems. However, they involve certain assumptions and limitations and are so highly specialized that they can be manipulated only by experienced researchers who have a deep understanding of the underlying theories (Schultz and Wieland, 1997; PastorBarcenas et al., 2005). Because of the lack of a consistent theoretical background, it is usually hard to develop a specialized mechanistic model for a complex ecosystem like plant growth process (Zhang et al., 2007; Zhang et al., 2008). Artificial neural networks are flexible function approximators and system simulators, which have been used to model nonlinear systems (Acharya et al., 2006; Zhang and Barrion, 2006; Zhang, 2007; Zhang et al., 2007; Zhang and Zhang, 2008). In this chapter the Chinese cabbage growth process is modeled with Elman neural network and linear neural network. Ordinary differential equation is used to compare the performance of neural networks. Sensitivity analysis is performed to detect the robustness of these models. Further details can be found in Zhang et al. (2007). 1. Model Description 1.1. Neural networks 1.1.1. Elman neural network See Chap. 000. 1 March 22, 2010 18:14 9in x 6in B-922 b922-ch13 1st Reading 2 Computational Ecology 1.1.2. Linear neural network See Chap. 000. Elman and linear neural networks are specially developed to simulate x(t+ △ t) from x(t), where x(t) = (x1 (t), x2 (t), . . . , xn (t))T , and n is the dimension of input. Matlab codes of Elman and linear neural networks used are as follows: %Read multivariable time series (dyndata.*) to P. In this file, rows represent variables, and columns represent time series. P=dyndata; variables=size(P,1); %The number of variables times=size(P,2); %The number of time points %Develop a two-layer Elman neural network if it is used. %Transfer function of $i$th layer is tansig (or logsig) and purelin, i=1,2. net=newelm(minmax(P),[30 5],{’tansig’,’purelin’},’traingdm’, ’learngdm’,’mse’); %Develop a linear neural network if it is used. There are 5 output variables. Learning rate is 0.01. Time delay is 0. net=newlin(minmax(P),5,[0],0.01); net.trainParam.epochs=1000; %Train 1000 epochs net.trainParam.goal = 0.000001; %Set the training goal for i=1:times-1; %x(t), x(t+ △ t), t = 1, 2, . . . , times − 1 R=P(:,i); Y=P(:,i+1); %Train network with input x(t) and output x(t+ △ t) in ODEs, dx(t)/ dt = f(x(t), t) net=train(net,R,Y); output(:,i)=sim(net,R); %Compute x(t+ △ t) from x(t) end output %Print input weights and between-layer weights net.IW{1,1} %Input weights net.LW{1,1} %Between-layer weights (layer 1 to layer 1) net.LW{2,1} %Between-layer weights (layer 1 to layer 2) 1.2. Ordinary differential equation Mechanistic models to describe plant growth process are usually expressed as a nonlinear ordinary differential equation (Department of Mathematics of Nanjing University, 1978; Qi et al., 2001): dx(t)/dt = f(x(t), t), (1) where x(t): the vector of state variables, x(t) = (x1 (t), x2 (t), . . . , xn (t))T , t: time. March 22, 2010 18:14 9in x 6in B-922 b922-ch13 1st Reading Simulation of Plant Growth Process 3 The linear forms of the nonlinear ordinary differential equation are mostly used as the following: dx(t)/dt = A(t)x(t), (2) dx(t)/dt = Ax(t), (3) and where A(t) and A are matrices with time-varied elements and time-constant elements respectively. Models (2) and (3) require little information on the mechanism of plant growth process. The difference equations corresponding to the models (1) and (3) are as follows: x(t+ △t) = x(t) + f(x(t), t) △t, (4) x(t+ △t) = x(t) + Ax(t) △t. (5) and 2. Data Source Chinese cabbage was planted in a field. The weight of dry matter of leaves, stem, and root (g/plant), the area of leaves (cm2 /plant), and the water content of soil (g water/g dry soil) were measured every three days. In doing so a multivariable data set was gathered (Qi et al., 2001). 3. Results 3.1. Model comparison Elman neural network, linear neural network and linear ordinary differential equation (3) are used to simulate Chinese cabbage growth process. The trained weights of Elman network are illustrated in Fig. 1. The results indicate that Elman network (training goal = 0.000001) will be in accordance with the system dynamics after training for a very short time (Fig. 2). Linear network (training goal = 0.000001, learning rate = 0.01) fits the dynamics within a certain time, while linear ordinary differential equation yields divergent dynamics from the beginning of the simulation (Fig. 2). Chinese cabbage growth process is a nonlinear system (Qi et al., 2001). The March 22, 2010 18:14 9in x 6in B-922 b922-ch13 1st Reading 4 Computational Ecology Figure 1. Distribution of connection weights of two-layer Elman network (neurons: (30,5)). Upper: 5 × 30; Middle: 30 × 30; Lower: 30 × 5. eigenvalues of system matrix A of linear ordinary differential equation are −1.3418, 0.6823, −0.4948 + 0.4333i, −0.4948 − 0.4333i, and −0.0754, respectively. It is obvious that the system is unstable. This model therefore is not able to simulate the dynamics well. Simulated dynamics by linear network diverges after 21 days, even though linear network is able to simulate the weak nonlinear system. Overall Elman network is the best in simulating the dynamics of the multivariate nonlinear ecosystem. Linear network is only able to simulate the weak nonlinear system, and linear ordinary differential equation March 22, 2010 18:14 9in x 6in B-922 b922-ch13 1st Reading Simulation of Plant Growth Process 5 Figure 2. Dynamic simulation of Chinese cabbage growth process using Elman network, linear network, and the linear ordinary differential equation. will yield unstable and unpractical simulation for multivariate nonlinear systems. 3.2. Sensitivity analysis Transfer functions, training functions, and learning functions are changed to compare the simulation performance of Elman network (Table 1). It is found that all transfer functions, except for “purelin”, of the second layer yield bad simulation performance (Table 1). The changes of transfer function of the first layer, training functions, and learning functions have little influence on simulation performance and the influence seems to be 0.0019 0.001 0.001 0.2668 0.1419 0.0054 0.0022 0.0016 1.052 0.1638 0.0105 0.0021 0.0018 2.1734 0.1392 0.0138 0.0064 0.0016 3.2579 0.2 0.0257 0.0307 0.1239 0.1223 0.23 0.3047 0.5833 1 1.25 2.3 2.6 3.8913 0.0064 0.0034 0.0099 0.0102 0.02 0.0375 0.0417 0.125 0.12 0.15 0.23 0.3375 0.0069 0.0025 0.0065 0.0184 0.03 0.0447 0.06 0.1125 0.194 0.2 0.23 0.2625 9.3416 11.059 29.61 51.682 89.822 121.68 214.98 418.5 379 547.5 776 807 0.0839 0.121 0.1353 0.1297 0.1305 0.1201 0.1817 0.0861 0.1111 0.1429 0.0652 0.1187 −0.0003 0.001 0.0002 0.2644 0.1402 0.0006 traingdx −0.0001 learngd 0.0017 Changes of 0.2652 training & 0.1425 learning functions 0.001 traingd −0.0004 learngd −0.0007 0.2658 0.1412 0.0049 0.002 0.0021 0.967 0.1636 0.0044 0.003 0.0022 1.0486 0.1644 0.009 0.0019 0.0027 0.9969 0.1387 0.0082 0.0019 0.0035 2.1718 0.1402 0.0132 0.0068 0.0017 0.9991 0.2001 0.0136 0.0067 0.0019 3.2605 0.2 0.0261 0.0319 0.1246 0.1225 0.2299 0.305 0.5833 0.9684 0.9938 0.9988 0.9994 0.9997 0.0062 0.0026 0.01 0.0099 0.0199 0.037 0.0417 0.125 0.12 0.15 0.23 0.3375 0.0068 0.0026 0.005 0.0183 0.0299 0.0449 0.06 0.1125 0.194 0.2 0.23 0.2625 0.9998 1 1 1 1 1 1 1 1 1 1 1 0.0842 0.1208 0.1363 0.1298 0.1305 0.1203 0.1817 0.0861 0.1111 0.1429 0.0652 0.1187 0.0275 0.0312 0.1262 0.1225 0.23 0.305 0.5833 1 1.25 2.3 2.6 3.8912 0.0058 0.0018 0.0101 0.01 0.02 0.037 0.0417 0.125 0.12 0.15 0.23 0.3375 0.0063 0.0039 0.005 0.0183 0.03 0.045 0.06 0.1125 0.194 0.2 0.23 0.2625 9.3411 11.061 29.61 51.678 89.818 121.68 214.98 418.5 379 547.5 776 807 0.0827 0.1206 0.135 0.1298 0.1305 0.1204 0.1817 0.0861 0.1111 0.1429 0.0652 0.1187 0.0051 0.0016 0.003 1.0481 0.1635 0.0102 0.0027 0.0038 2.171 0.1387 0.0138 0.0058 0.0024 3.2583 0.2005 0.0269 0.033 0.1249 0.1224 0.23 0.305 0.5833 1 1.25 2.3 2.6 3.8912 0.0064 0.002 0.0099 0.01 0.02 0.037 0.0417 0.125 0.12 0.15 0.23 0.3375 0.0078 0.0038 0.0055 0.0184 0.03 0.045 0.06 0.1125 0.194 0.2 0.23 0.2625 9.3388 11.061 29.608 51.678 89.818 121.68 214.98 418.5 379 547.5 776 807 0.0846 0.1201 0.136 0.1298 0.1305 0.1204 0.1817 0.0861 0.1111 0.1429 0.0652 0.1187 logsig purelin tansig tansig 15 18 21 24 27 30 33 36 39 42 45 48 1st Reading 0.0269 0.0318 0.1265 0.1227 0.23 0.305 0.5833 1 1.25 2.3 2.6 3.8913 0.0066 0.0013 0.0103 0.0101 0.02 0.037 0.0417 0.125 0.12 0.15 0.23 0.3375 0.0062 0.0034 0.0055 0.0185 0.03 0.045 0.06 0.1125 0.194 0.2 0.23 0.2625 9.3379 11.061 29.61 51.682 89.819 121.68 214.98 418.5 379 547.5 776 807 0.0837 0.1203 0.1376 0.1301 0.1305 0.1204 0.1817 0.0861 0.1111 0.1429 0.0652 0.1187 b922-ch13 0.0136 0.007 0.0016 3.2622 0.1999 B-922 12 0.0075 0.0011 0.002 2.1712 0.1391 9in x 6in 9 0.0042 0.0007 0.0024 1.0495 0.1651 Changes of transfer functions 3 18:14 6 −0.0006 tansig 0.0003 purelin 0.0008 0.2662 0.1426 March 22, 2010 t (day) 6 Computational Ecology Table 1. Sensitivity analysis of network functions in Elman neural network. The benchmark settings in the Matlab codes are: net = newelm(minmax(P),[30 5],{’tansig’,’purelin’},’traingdm’,’learngdm’,’mse’). Training epochs=1000; training goal=0.000001. Five output variables in the table are leaf weight, stem weight, root weight, leaf area, and water content of soil. March 22, 2010 18:14 9in x 6in B-922 b922-ch13 1st Reading Simulation of Plant Growth Process 7 somewhat distinctive at the beginning of the simulation. Sensitivity analysis indicates that Elman neural network is considerably robust in the simulation of multivariate dynamic system. By changing the learning rate and training goal of linear network, the results showed that the simulation performance would considerably vary with the changes of learning rate and training goal. In some cases, if the training goal is 0.000001 and the learning rate is 0.001, for instance, the dynamics is not able to be simulated. The correct choice of learning rate and training goal is important for an ideal simulation. 3.3. Model performance using various data sets Similar conclusions are drawn when using various data sets (i.e., different combinations of variables) or functions and parameters to test the performance and robustness of Elman network, linear network, and linear ordinary differential equation. Elman network yields stable and best results. Linear network performs better in some cases, and linear ordinary differential equation produces the worst simulation. 4. Discussison Between-model comparisons indicate that Elman neural network is able to simulate the dynamics of multivariate nonlinear ecosystem. Linear network is able to simulate the simple nonlinear system with a weak nonlinearity. Linear ordinary differential equation is not able to simulate the multivariate nonlinear system. Sensitivity analysis proves that the choices of network functions and training parameters influence the simulation performance of neural networks. The choice of transfer function of the second layer in Elman network will significantly affect the simulation performance, while changes of other functions will not yield the considerable influence. Besides system simulation, Elman network can be used to learn the system structure on both temporal and spatial realms, and the patterns of system dynamics can be detected by using this model. The number of neurons in Elman network can also be adjusted to yield the dynamics at a certain scale. Further details can be found in engineering theories (Hagan et al., 1996; Mathworks, 2002; Fecit, 2003). March 22, 2010 18:14 9in x 6in B-922 b922-ch13 1st Reading 8 Computational Ecology It is argued that several neural networks may be jointly used in the simulation in order to achieve more reasonable results (Sharma et al., 2003; Zhang and Barrion, 2006). The use of single neural network would sometimes not yield a better understanding on the system (Abdel-Aal, 2004; Nour et al., 2006; Nagendra and Khare, 2006; Yu et al., 2006). March 22, 2010 18:4 9in x 6in B-922 b922-ch14 1st Reading  CHAPTER 14  Simulation of Food Intake Dynamics Insect pests harm plants and cause economic loss by feeding on the crops. The dynamics of food intake of insect is a function of time. For the holometabolous insect, an individual needs to survive several developmental stages, e.g., egg, the 1st to nth instar larva, pupae, and adult (Zhang and Zhang, 2008). Food intake largely changes within a developmental stage and particularly between adjacent stages (Zhang et al., 1997). Food intake varies with insect species, food sources, and environmental conditions such as temperature and humidity. It is a system that lacks theoretical background (Schultz and Wieland, 1997) and a nonlinear system (PastorBarcenas, et al., 2005). Therefore it is hard to build mechanistic models for food intake dynamics (Zhang, 2007; Zhang et al., 2008). In this chapter an algorithm for functional link artificial neural network is used to simulate insect’s food intake dynamics. Conventional models are compared to FLANN for their simulation performance. More details can be found in Zhang et al. (2008). 1. Model Description 1.1. Functional link artificial neural network The basic algorithm of functional link artificial neural network (FLANN) can be found in Chap. 000. In FLANN one of the following functions may be chosen as the nonlinear function ρ(·). 1 March 22, 2010 18:4 9in x 6in B-922 b922-ch14 1st Reading 2 Computational Ecology Linear function: ρ(S) = a + b S Negative exponential function: ρ(S) = a e−bS Decreasing function with lower asymptote: ρ(S) = a + b/S Logarithmic linear function: ρ(S) = a + b ln(S) Power function: ρ(S) = a S b Logistic function: ρ(S) = 1/(a + b e−S ) Anti-exponential function: ρ(S) = a eb/S Transcendental tangent function: ρ(S) = tanh(S) = (1−e−2S )/(1+e−2S ) where a and b are constants. Given K training samples {xk , yk }, k = 1, 2, . . . , K, where xk = k k )T , and if the kth sample is added, (x1 , x2k , . . . , xnk )T , yk = (y1k , y2k , . . . , ym its value for the inverse function of nonlinear function ρ(·) should be computed. For example, for ρ(S) = (1 − e−2S )/(1 + e−2S ), we have yk = (1 − e−2Sk )/(1 + e−2Sk ), sk = ln[(1 + yk )/(1 − yk )]/2; for ρ(S) = a + bS, we have yk = a + b sk , sk = (yk − a)/b. The orthogonal functions are basis functions: Legendre function, Chebyshov function, Laguerre function, Hermite function, or trigonometric function, as described in previous chapters. Norm of input vector (Rudin, 1991; Li et al., 2001) was defined as  1/2 xi2 x = x = , where the resultant x is a scalar variable, and x in x is a vector. The norm was normalized in order to coincide with the domain of definition of orthogonal functions. The normalized x in Legendre functions and Chebyshov functions is x = (2x − (max x + min x))/(max x − min x). In Laguerre functions, the normalized x is: x = x − min x, and in trigonometric functions, the normalized x is x = 2πx/(max x − min x) − 2π min x/(max x − min x). The Matlab codes of FLANN are listed bellow (Zhang et al., 2008) %Variables and data used in all of the Matlab functions are defined as the follows %basefunselec: 1:Legendre Function; 2:Chebyshov Function; 3:Laguerre Function; 4:Hermite Function; March 22, 2010 18:4 9in x 6in B-922 b922-ch14 1st Reading Simulation of Food Intake Dynamics 3 %5:Trigonometric Function. %nolinfunsele: 1:Linear Function (ρ(S )= a+ bS); 2: Negative Exponential Function (ρ(S) = a e−bS ); %3:Decreasing Function With Lower Asymptote (ρ(S) = a + b/S); 4:Logarithmic Linear Function %(ρ(S) = a + b ln(S)); 5: Power Function (ρ(S) = a S b ); 6:Logistic Function (ρ(S) = 1/(a + b e−S ));: Anti %Exponential Function (ρ(S) = a eb/S ); 8:Transcendental Tangent Function (ρ(S) = (1 − e−2S )/(1 + e−2S )). %nodimin: the number of dimensions for input vector. nodimout: the number of dimensions for output vector. %notrns: the number of training samples. nobasefun: the number of basis functions; %a, b: the constants in nonlinear function. %traindata: input matrix (x) and output matrix (y) in the file for training samples. The number of training %samples is the number of rows in training sample file; the number of columns in training sample file is %the summation of the number of dimensions of input vector and output vector; In each row of the training %sample file, the first are the values for input vector and the second are the values for output vector. % predidata: input matrix ( xp) for predicted samples, the same format as x. nodimin=2; nodimout=1; notrns=80; nobasefun=50; basefunselec=1; nonlinfunsele=1; %If ρ(S) = tanh(S) is chosen as the nonlinear function, a and b can be set to be %arbitrary values. a=1; b=2; x=traindata(:,1:nodimin); y=traindata(:,(nodimin+1):(nodimin+nodimout)); xp=predidata; net=flann(nodimout,notrns,nobasefun,basefunselec,nonlinfunsele, a,b,x,y) %The Simulated Output Vector for Each Training Sample disp(’Simulated results:’); simu(net,nodimout,notrns,nobasefun,basefunselec,nonlinfunsele, a,b,x); mse=sum(sum((si-y).ˆ2))/(notrns*nodimout) %The predicted output vector for each sample to be predicted disp(’Predicted results:’); simu(net,nodimout,size(xp,1),nobasefun,basefunselec,nonlinfunsele, a,b,xp); March 22, 2010 18:4 9in x 6in B-922 b922-ch14 1st Reading 4 Computational Ecology function net=flann(nodimout,notrns,nobasefun,basefunselec, nonlinfunsele,a,b,x,y) xmm=vectornorm(basefunselec,notrns,x); for k=1:notrns; xx(k,:)=basefun(basefunselec,nobasefun,xmm(k)); for i=1:nodimout; [rr,yy(k,i)]=nonlinfun(nonlinfunsele,a,b,y(k,i)); end end net=(inv(xx’*xx)*xx’*yy)’; function result=simu(net,nodimout,nosam,nobasefun,basefunselec, nonlinfunsele,a,b,mat) xmm=vectornorm(basefunselec,nosam,mat); for k=1:nosam; p=basefun(basefunselec,nobasefun,xmm(k)); for j=1:nodimout; temp=sum(net(j,:).*p); [result(k,j),rt]=nonlinfun(nonlinfunsele,a,b,temp); end end result function p=basefun(basefunselec,nobasefun,x) switch basefunselec case 1 p(1)=x; for i=1:nobasefun-1; if (i==1) p(2)=(3*x*p(1)-1)/2; continue; end p(i+1)=((2*i+1)*x*p(i)-i*p(i-1))/(i+1); end %Legendre case 2 p(1)=x; for i=1:nobasefun-1; if (i==1) p(2)=2*x*p(1)-1; continue; end p(i+1)=2*x*p(i)-p(i-1); end %Chebyshov case 3 p(1)=1-x; for i=1:nobasefun-1; if (i==1) p(2)=(3-x)*p(1)-1; continue; end p(i+1)=(2*i+1-x)*p(i)-iˆ2*p(i-1); end %Laguerre case 4 p(1)=2*x; for i=1:nobasefun-1; if (i==1) p(2)=2*x*p(1)-2; continue; end March 22, 2010 18:4 9in x 6in B-922 b922-ch14 1st Reading Simulation of Food Intake Dynamics end p(i+1)=2*x*p(i)-2*i*p(i-1); %Hermite case 5 for i=1:nobasefun; if (round(i/2)==(i/2)) p(i)=sin(i/2*x); else p(i)=cos((i+1)/2*x); end end %Trigonometric end function [rr,rt]=nonlinfun(nonlinfunsele,a,b,x) %Nonlinear value, nonlinear inverse value switch nonlinfunsele case 1 rr=a+b*x; rt=(x-a)/b; case 2 rr=a*exp(-b*x); rt=(log(a)-log(x))/b; case 3 rr=a+b/x; rt=b/(x-a); case 4 rr=a+b*log(x); rt=exp((x-a)/b); case 5 rr=a*xˆb; rt=(x/a)ˆ(1/b); case 6 rr=1/(a+b*exp(-x)); rt=-log((1-a*x)/(b*x)); case 7 rr=a*exp(b/x); rt=b/(log(x)-log(a)); case 8 rr=(1-exp(-2*x))/(1+exp(-2*x)); rt=0.5*log((1+x)/(1-x)); end function xmm=vectornorm(basefunselec,nosam,mat) for i=1:nosam; xmm(i)=sqrt(sum((mat(i,:)).ˆ2)); end if (basefunselec˜=4) maxx=max(xmm); minn=min(xmm); if (maxx==minn) warning(’Divided by zero!’); exit; end for i=1:nosam; switch basefunselec case {1,2} xmm(i)=(2*xmm(i)-(maxx+minn))/(maxx-minn); case 3 xmm(i)=xmm(i)-minn; 5 March 22, 2010 18:4 9in x 6in B-922 b922-ch14 1st Reading 6 Computational Ecology case 5 xmm(i)=2*3.1415926*xmm(i)/(maxx-minn)-2*3.1415926*minn/ (maxx-minn); end end end 1.2. Conventional models The following empirical models are used to model food intake dynamics: Fractional function: f(t) = (at + b)/(ct + d) Polynomial function: f(t) = a + b1 t + b2 t 2 + b3 t 3 Exponential function: f(t) = aect Multivariate linear regression: f(t, T) = a + b1 t + b2 T Trend surface model: f(t, T) = a + b1 t + b2 T + b3 tT + b4 t 2 + b5 T 2 2. Data Description Six temperatures, i.e., 20, 24, 28, 32, 36, and 40◦ C, were used to measure the food intake dynamics of Spodoptera litura. Feed the larvae with clean leaves of Chinese cabbage after eggs hatched into larvae, and measure the averaged fresh weight (g) of leaves consumed by a larva each day until all larvae become pupae. Fresh weight of leaves consumed per larva was daily accumulated and was used as food intake dynamics. A data set was thus gathered (Zhang et al., 1997). 3. Results 3.1. Modeling food intake dynamics of S. litura In FLANN, 50 Legendre functions are set as basis functions. Nonlinear function is chosen to be ρ(S) = a + bS (linear function), and parameter values are a = 1, b = 2. The results indicate that the averaged MSE (Mean Squared Error) of FLANN, fractional function, polynomial function, and exponential functions are 0.01674, 0.15482, 0.07565, 0.1289, respectively. It is obvious that FLANN has better performance than conventional models in the simulation of food intake dynamics at six temperatures (Fig. 1). Among these March 22, 2010 18:4 9in x 6in B-922 b922-ch14 1st Reading Simulation of Food Intake Dynamics 7 Figure 1. Simulation of food intake dynamics using FLANN and conventional models. conventional models polynomial function has the lowest error and fractional function yields a larger deviation in simulation. Daily food intake of S. litura larva in a stage (instar) will change from zero to a maximum and decline to zero, preparing to develop into the next stage (instar or pupa). As a result, the food intake dynamics of a larva is not a strict increasing function of time. Most conventional models tend to smooth these fluctuations. Theoretically FLANN performs well in describing these details of dynamics. March 22, 2010 18:4 9in x 6in B-922 b922-ch14 1st Reading 8 Computational Ecology 3.2. Modeling temperature–time relationship of food intake In the simulation of temperature–time dependent food intake relationship, both multivariate linear regression (MSE = 1.8575) and trend surface model (MSE = 1.2238) are capable of pursuing the trend of this relationship (Fig. 2). In total 80 data points are used in the FLANN (the same setting as above) simulation of temperature–time dependent food intake relationship. It is found that FLANN has a better fitting goodness than conventional models above (MSE = 0.3461; Fig. 3). 3.3. Sensitivity analysis 3.3.1. Basis functions Different types and numbers of basis functions are used in FLANN to detect the sensitivity of temperature–time dependent food intake Figure 2. Simulation of temperature–time dependent food intake using multivariate linear regression and trend surface model. March 22, 2010 18:4 9in x 6in B-922 b922-ch14 1st Reading Simulation of Food Intake Dynamics 9 Figure 3. FLANN simulation of temperature–time dependent food intake relationship. Table 1. Mean squared error (MSE) of FLANN simulation for various types and numbers of basis functions. In total 10, 20, 30, 40, and 50 basis functions are used respectively (Zhang et al., 2008). Legendre Chebyshov Laguerre Hermite Trigonometric 10 20 30 40 2.5415 2.5366 2.5707 2.4913 2.8268 1.0305 0.94284 50.642 14.396 0.79089 0.43109 0.41592 127.93 1244.2 0.62204 0.3767 0.37695 22.719 32.382 0.564 50 0.34606 0.34605 1.6892 1837.2 0.50087 (Table 1 and Fig. 4). Nonlinear function was ρ(S) = a + bS, and parameter values a = 1, b = 2. The results demonstrate that the simulation performance using Legendre functions, Chebyshov functions, and trigonometric functions is better than that using Laguerre functions and Hermite functions. The fitted error (MSE) of Legendre functions, Chebyshov functions, and trigonometric functions decrease as the number of these basis functions increases. On the contrary, simulation performances of Laguerre functions and Hermite March 22, 2010 10 18:4 9in x 6in B-922 b922-ch14 1st Reading Computational Ecology Figure 4. Output sensitivity of FLANN to different type and parameter values of nonlinear functions in the simulation of temperature–time dependent food intake relationship of S. litura, where a and b are parameters in nonlinear functions. Nonlinear functions 1–8 denote the functions indicated in the mathematical description above. Fifty Legendre functions are used in FLANN. functions are unstable as the number of basis functions changes. Overall Legendre functions and Chebyshov functions are the best basis functions in FLANN modeling (Table 1). 3.3.2. Nonlinear functions Different nonlinear functions and parameter values are used in FLANN simulation (50 Legendre functions) of temperature–time dependent food intake relationship (Fig. 4). March 22, 2010 18:4 9in x 6in B-922 b922-ch14 1st Reading Simulation of Food Intake Dynamics 11 Simulation performance varies with the change of type of nonlinear functions and parameter values in the function. Nonlinear functions 1 (linear function), 2 (negative exponential function) and 5 (power function) are the best functions that yield relatively stable outputs at the change of parameter values (Fig. 4). Logarithmic linear function yields better results in most cases. The remaining nonlinear functions, e.g., transcendental tangent function and logistic function, have bad simulation performance. 4. Discussion As indicated above, FLANN performs better than conventional models in simulating the food intake dynamics of S. litura. The type and number of basis functions, and the type and parameter values of nonlinear functions in FLANN will to a certain extent influence the simulation performance. The weight matrix W in FLANN may also be approximated by using the iterative algorithm (Yan and Zhang, 2000), that is, W(k + 1) = W(k) + ηδ(k)ϕ(xk ), where δ(k) = (δ1 (k), δ2 (k), . . . , δm (k))T , and δj (k) = ρ′ (Sj )ej (k), where ρ′ (S) is the derivative function of ρ(S). For example, for ρ(S) = tanh(S), we have ρ′ (S) = 1 − ρ2 (S), and for ρ(S) = a + bS, we have ρ′ (S) = b. FLANN is a function generator using orthogonal series (Gentle, 2002). It is substantially a generalized form of radial basis function neural network (Yan and Zhang, 2000; Zhang and Qi, 2002; Zhang and Barrion, 2006). In addition to the choices of basis functions and nonlinear functions, a data set with quality is also indispensable for a better simulation performance of FLANN. March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading  CHAPTER 15  Species Richness Estimation and Sampling Data Documentation 1. Estimation of Plant Species Richness on Grassland Biodiversity presents vast numbers of unexploited opportunities for solving environmental problems (Brown, 1991; Cowell, 1992). Plants account for 20% of total number of species globally (Chen and Ma, 2001). On a temperate grassland, plants have the largest biomass (20 000 kg/ha), followed by microorganisms (7000 kg/ha) (Pimental et al., 1992; Chen and Ma, 2001). Plant diversity is the basis of animal diversity (Andow, 1991; Dong et al., 2005; Jia et al., 2006). Researches on plant biodiversity always start with the estimation of species richness. A large number of studies estimating plant species richness have been reported (Dony, 1963; Williams, 1964; Rosenzweig, 1995; Chen and Ma, 2001). Grassland is a natural ecosystem with high productivity. Grasslands account for 16 ∼ 30% of terrestrial land. However, global grasslands have been degrading in recent decades due to overgrazing and reclamation for various uses. To evaluate grassland biodiversity is an important subject in ecological researches. Various methods were used to estimate species richness in previous studies, among which many of them are nonparametric estimators (Chao, 1984; Burnham and Overton, 1978, 1979; Smith and van Belle, 1984; Chao 1 March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading 2 Computational Ecology and Lee, 1992). The performance of richness estimators could be assessed by comparing estimators to the measured species richness at one or a few areas (Miller and Wiegert, 1989) or by simulation (Smith and van Belle, 1984). According to the past studies, the performances of these estimators varied with the habitats, growing systems, and distribution of species (Efron and Gong, 1983; Palmer, 1990, 1991; Bunger and Fitzpatrick, 1993; Colwell and Coddington, 1994; Walther and Morand, 1998; Hellman et al.,1999; Zhang et al., 2004). Besides nonparametric estimators, the parametric models like Arrhenius model, logarithmic normal model, etc., were widely applied in earlier studies (Preston, 1960). These parametric models were usually used to describe various richness–area relationships. Artificial neural networks are recognized as universal function approximators (Acharya et al., 2006; Zhang and Barrion, 2006; Zhang et al., 2007). They have been used to predict invertebrate species richness in rice field (Zhang, 2007). This section aims to estimate plant species richness on grassland using a neural network and compare its performance to some nonparametric estimators and conventional models. 1.1. Model description 1.1.1. Nonparametric estimators Seven nonparametric models (Colwell and Coddington, 1994; Zhang and Schoenly, 1999) are used to estimate plant species richness on the grassland. These estimators have been denoted as Chao 1 and Chao 2 (Chao, 1984), Jackknife 1 and Jackknife 2 (i.e., first-order jackknife and secondorder jackknife, see Burnham and Overton, 1978, 1979), Bootstrap (Smith and van Belle, 1984), Chao 3 and Chao 4 (Chao and Lee 1992). See Colwell and Coddington (1994), Zhang and Schoenly (1999) for mathematical description of these nonparametric estimators. To evaluate performances of these models, two bias indices, i.e., absolute bias (AB) and relative bias (RB), are used in this study (Zhang and Schoenly, 1999) AB =  |Si − S|/sim  RB = (|Si − S|/S)/sim ∗ 100% March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading Species Richness Estimation and Sampling Data Documentation 3 where Si denotes the estimated richness in the ith randomization and S is the observed species richness for all samples in the data set. AB and RB take the average over all randomizations (sim) of the nonparametric estimators. Among these indices, RB was considered as the most revealing index for evaluating estimators (Zhang and Schoenly, 1999). 1.1.2. Artificial neural network A three-layer neural network is developed for modeling the relationship between total species richness and cumulative sample size (Fig. 1). In this network, an input set, xi ∈ R; the corresponding output set, yi ∈ R. xi is the cumulative sample size up to sample i, and yi is the total number of plant species up to sample i, i = 1, 2, . . . , n. Both the first and second layers contained fifteen neurons, and bias is used on each layer. Transfer functions for layers 1–3 are hyperbolic tangent sigmoid transfer function, logistic sigmoid transfer function, and linear transfer function, respectively. Initialization of network, and weights and bias for each layer, is performed by a function that initializes each layer i (i = 1, 2, 3) according to its own initialization function (Hagan et al., 1996; Mathworks, 2002; Fecit, 2003). Network is trained by Levenberg– Marquardt backpropagation algorithm. Performance function is mean Figure 1. Neural network developed in present study. March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading 4 Computational Ecology squared error function (mse). Both the first and second layers receive the same inputs from sample space and yield outputs for the third layer. The third layer learns from the input space. For each layer the net input functions calculated the layer’s net input by combining its weighted inputs and biases. Network performance is evaluated using mse, Pearson correlation coefficient, and significance level for the linear relationship between simulated and observed species richness. The artificial neural network is developed using Matlab (Mathworks, 2002). Matlab codes of the neural network are listed as bellow: clc %Raw data dat(mm,nn) in which mm is number of plant species and %nn is number of samples mm=48; nn=50; ram=100; numn=15; %Number of neurons in layers 1 and 2 respectively obs=zeros(nn-1,ram); yy=zeros(ram,ram-1); y=yy’; for simm=1:ram; da=zeros(nn-1,2); temp=2:nn; da(:,1)=temp’; ra=randperm(nn); for i=2:nn; da(i-1,2)=0; for k=1:mm; u(k)=0; for j=1:i; u(k)=u(k)+dat(k,ra(j)); end; if u(k)˜=0 da(i-1,2)=da(i-1,2)+1;end; end; end; obs(:,simm)=obs(:,simm)+da(:,2); clear net; disp([’Simu=’ num2str(simm)]) n=1; data=da(:,1); net=network; net.numInputs=1; net.numLayers=3; net.inputs{1}.size=n; net.biasConnect=[1;1;1]; net.inputConnect=[1;1;0]; March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading Species Richness Estimation and Sampling Data Documentation 5 net.layerConnect=[0 0 1;0 0 1;1 1 0]; net.outputConnect=[0 0 1]; net.targetConnect=[0 0 1]; mi=min(data);ma=max(data); for i=1:n; tt(i,1)=mi(i);tt(i,2)=ma(i); end; net.inputs{1}.range=tt; net.layers{1}.size=numn; net.layers{2}.size=numn; net.layers{1}.transferFcn=’tansig’; net.layers{2}.transferFcn=’logsig’; net.layers{3}.transferFcn=’purelin’; net.layers{1}.initFcn=’initlay’; net.layers{2}.initFcn=’initlay’; net.layers{3}.initFcn=’initlay’; net.layerWeights{1,3}.delays=1; net.layerWeights{2,3}.delays=1; net.initFcn=’initlay’; net.performFcn=’mse’; net.trainFcn=’trainlm’; net.trainParam.goal=1e-05; net.trainParam.epochs=500; net=train(net,data’,da(:,2)’); pred=51:100; yy(simm,:)=sim(net,[data’ pred]); y=yy’; end; obs y disp(’Mean Standard Dev’) [mean(y,2) std(y,2)] %Print input weights and between-layer weights net.IW{1,1} %Input weights net.IW{2,1} %Input weights net.LW{3,1} %Between-layer weights net.LW{3,2} %Between-layer weights net.LW{1,3} %Between-layer weights net.LW{2,3} %Between-layer weights 1.1.3. Conventional models The polynomial function, a flexible and adaptable model, is used to model the above relationship (He, 2001; Mathworks, 2002). The polynomial function with a lower order could not achieve the satisfied fitting goodness, and the function with a higher order would over-fit the curve. As a result, the two- and three-order polynomial functions, and asymptotic function are March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading 6 Computational Ecology used to model the relationship. y = a2 x2 + a1 x + b, (1) y = a3 x3 + a2 x2 + a1 x + b, (2) y = (ax + b)/(cx + d), (3) where x is the cumulative sample size, and y is the total number of plant species. Confidence intervals are obtained from bootstrap procedure. The last conventional model is lognormal distribution (Krebs, 1989). The methods of Cohen (1959, 1961) are used to fit lognormal distribution to species abundance (cover-degree) data and total species richness for a given cumulative sample size is estimated by means of these methods. 1.1.4. Bootstrap procedure Bootstrap procedure is used to produce yield–effort curve in simulation analysis (Zhang and Schoenly, 1999). The yield–effort curve plots the cumulative number of species, defined as the sum of the number of species in the previous sample(S) and the number of species in the present sample that were not observed in any previous sample. For the first sample, the cumulative number of species is defined to equal its number of species. The order in which samples are added to the total number of samples affects the shape of the curve. Variation in curve shape due to sample order is different from sampling error caused by between-sample heterogeneity (Zhang and Schoenly, 1999). The present study bootstrapped the columns of the sample-by-species matrix. Repeating this process, e.g., 100 or 1000 randomizations, generates a family of curves from which the mean number of species and its standard deviation (or confidence interval) can be calculated for each cumulative sample size in the curve. 1.2. Data source In total 50 samples, each with the area of 1 m × 1 m, were surveyed on the natural grassland in Zhuhai, China. Plant species and their cover-degrees (%) were recorded and measured for each sample. March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading Species Richness Estimation and Sampling Data Documentation 7 1.3. Results In total 48 plant species and 17 families have been found on the grassland. 1.3.1. Nonparametric estimation of species richness 1.3.1.1. Performance evaluation Among seven nonparametric estimators, Bootstrap yields the least absolute bias (AB) and relative bias (RB) in the estimation of plant species richness (Table 1). In addition, Bootstrap is basically insensitive to cumulative sample size as compared to other estimators (Fig. 2). It yields similar species richness estimates under various cumulative sample sizes. Overall Bootstrap is the best nonparametric estimator. Chao 2 is the robust model second-best to Bootstrap. Its estimate tends to be stable after cumulative sample size reaches 10 samples (Fig. 2). Chao 3 and Chao 4 show a similar trend. Estimates of Chao 3, Chao 4, and Jackknife 1 are stable when cumulative sample size is between approximately 20 and 40 samples. Estimates of Jackknife 2 tend to be lower than observed richness if cumulative sample size is larger than 40–50 samples. Chao 1 is the most sensitive to cumulative sample size. Its estimation curve is almost the same as the observed. Chao 1 is in this sense the worst model. Maximum richness of plant species on the grassland, estimated by Chao 1, Chao 2, Jackknife 1, Jackknife 2, Bootstrap, Chao 3, and Chao 4, was 50(±6), 54(±22), 55(±7), 58(±17), 54(±2), 60(±13), and 68(±26) species, respectively. On average, the maximum richness was 70. The cumulative sample sizes to achieve maximum estimates of species richness are 41, 30, 30, 22, 19, 30, and 29 samples for above estimators. 1.3.1.2. Nonparametric estimation of plant species richness Total plant species richness on the grassland is estimated using seven nonparametric models, as illustrated in Table 2. It can be found that the averaged Table 1. Bias of seven nonparametric estimators. AB RB (%) Chao 1 Chao 2 Jackknife 1 Jackknife 2 Bootstrap Chao 3 Chao 4 8.55 17.81 9.06 18.88 8.65 18.02 10.18 21.21 5.16 10.75 11.27 23.49 16.83 35.06 March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading 8 Computational Ecology Figure 2. Performance of nonparametric models for the simulation of species richness vs. sample size relationship. In total 1000 randomizations are used in bootstrap procedure. richness estimate is similar to the estimate of Bootstrap. On average, the total species richness on the grassland, estimated with nonparametric models from all samples, is 48 to 55 species. 1.3.2. Estimation of species richness using other models Using neural network developed above (Fig. 1), polynomial functions [Eqs. (1) and (2)], asymptotic function [Eq. (3)], and lognormal function (Cohen, 1959, 1961; Krebs, 1989) are used to model species richness vs. sample size curve (Figs. 3 and 4). In total 100 randomizations are used in bootstrap procedure. Simulation performance of neural network and lognormal function is the best. Combining simulation and prediction performance, neural March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading Species Richness Estimation and Sampling Data Documentation 9 Table 2. Estimates of plant species richness using seven nonparametric models. Chao 1 Chao 2 Jackknife 1 Jackknife 2 Bootstrap Chao 3 Chao 4 Average Species Richness 95% Lower Limit 95% Upper Limit 49 48.35 50.94 41.59 51.18 51.76 52.19 49.28 47.42 47.77 29.94 — 48.2 49.8 50.12 45.54 50.58 48.92 71.94 — 54.16 53.72 54.26 55.59 Figure 3. Performance of neural network for modeling species richness vs. sample size curve. network is considered to be the best model, and polynomial functions yield the worst predication (Fig. 3). From neural network model, the estimated total species richness on the grassland is about 48 to 60. The asymptotic function [Eq. (3)] is fitted as follows: y = (2.823x + 5.520)/(0.046x + 0.808). March 22, 2010 10 15:45 9in x 6in B-922 b922-ch15 1st Reading Computational Ecology Figure 4. Performance of Arrhenius function, lognormal function, asymptotic function, and polynomial functions for modeling species richness vs. sample size curve. According to this model, the total plant species richness on the grassland is estimated to be 61 species (x → ∞). 1.4. Conclusions Among seven nonparametric estimators tested, Bootstrap yields the least absolute bias and relative bias, and it yields similar estimates under various cumulative sample sizes. Bootstrap is considered to be the best nonparametric estimator. Chao 2 is the second most robust estimator in seven nonparametric models. Chao 1 is most sensitive to cumulative sample size, March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading Species Richness Estimation and Sampling Data Documentation 11 and is considered to be the worst model in estimating species richness. The total species richness on the grassland, estimated with nonparametric models and Bootstrap, is on average 48 to 55 species. Artificial neural network, developed in present study, yields an estimate of 48 to 60 species. Estimate of asymptotic function is 60 species. In general the artificial neural network is reliable in the estimation of species richness. It proves that neural network models are more effective than conventional models including nonparametric estimators used above. 2. Documentation of Sampling Data of Invertebrates Biodiversity and conservation studies in ecology often begin with issues of sampling. Descriptions of sampling data are important topics in biodiversity analysis (Steele et al., 1984; Miller and White, 1986; Miller and Wiegert, 1989; Moreno and Halffter, 2000). Researchers may need to know, for example, how representative and complete their ecological community being sampled is. If taking a few samples in a single field captures the same abundant taxa as taking more samples does, then the future surveys can be done at a lower cost and with a minimal loss of essential ecological information (Zhang and Schoenly, 1999). To measure completeness of sampling, a curve may be drawn (Cohen, 1978; Cohen et al., 1993; Colwell and Coddington, 1994; Zhang and Schoenly, 1999), which plots the number of taxa sampled vs. the sample size, the number of taxa sampled vs. the abundance, or the mean abundance of newly sampled species vs. the sample size. These curves increase ecological understanding of dominance–diversity relationships and spatial distributions of taxa. A large number of studies have dedicated to these problems (Bunge and Fitzpatrick, 1993; Coleman et al., 1982; Colwell and Coddington, 1994; Krebs, 1989; Schoenly et al., 1999, 2003; Shahid et al., 2003; Zhang and Schoenly, 1999). As the indicator of biodiversity conservation and pest management, invertebrate diversity in farmlands has been an attractive topic in recent years (Brown, 1991; Kremaen et al., 1993). A pilot study in agro-biodiversity might constitute a set of samples gathered from one March 22, 2010 12 15:45 9in x 6in B-922 b922-ch15 1st Reading Computational Ecology experimental plot or farmer’s field. Sampling analysis is therefore one of the major topics of invertebrate biodiversity. Artificial neural networks have been widely used in numerical computation, pattern recognition, classification, and system control, etc (Hagan et al., 1996). This section aims to conduct documentation (i.e., sampling information is stored in trained neural networks) on the sampling data using two artificial neural network models, BP and RBF networks, based on the invertebrate data sampled in irrigated rice field. Several ecological functions are also tested to compare their performance with neural networks. More details can be found in Zhang and Barrion (2006). 2.1. Model description 2.1.1. Neural networks 2.1.1.1. BP neural network See chap. 000. 2.1.1.2. RBF neural network See previous chapters. The transfer functions in the hidden layer are Gaussian kernel functions: uj = exp(−(x − cj )2 /(2σj2 )), j = 1, 2, . . . , N, where uj is the output of jth hidden neuron, x is the input, cj is the centralized value of Gaussian function, σj is the standardized constant, and N is the number of neurons in the hidden layer. Matlab codes of RBF network and BP network used in the present study are listed as follows: %Read input x to variable p1 (cumulative sample size) and target y to variable t1 (cumulative %number of species sampled), …. p1=Data1(:,1);p2=Data2(:,1);p3=Data3(:,1);p4=Data4(:,1); t1=Data1(:,2);t2=Data2(:,2);t3=Data3(:,2);t4=Data4(:,2); %Develop a RBF network if it is used eg = 0.01; %Set square sum error sc = 1; %Set expansion constant net1=newrb(p1,t1,eg,sc); net2=newrb(p2,t2,eg,sc); net3=newrb(p3,t3,eg,sc); net4=newrb(p4,t4,eg,sc); %Simulate input-output function with trained RBF network March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading Species Richness Estimation and Sampling Data Documentation 13 Y1=sim(net1,p1);Y2=sim(net2,p2);Y3=sim(net3,p3);Y4=sim(net4,p4); %Develop a BP networks if it is used n=5; %Set 5 neurons in hidden layer net1=newff(minmax(p1’),[n,1],{’tansig’ ’purelin’},’trainlm’); net2=newff(minmax(p2’),[n,1],{’tansig’ ’purelin’},’trainlm’); net3=newff(minmax(p3’),[n,1],{’tansig’ ’purelin’},’trainlm’); net4=newff(minmax(p4’),[n,1],{’tansig’ ’purelin’},’trainlm’); %Training BP network net1.trainParam.epochs =1000; net2.trainParam.epochs = 1000; net3.trainParam.epochs =1000; net4.trainParam.epochs = 1000; net1.trainParam.goal = 0.01; net2.trainParam.goal = 0.01; net3.trainParam.goal = 0.01; net4.trainParam.goal = 0.01; %Set maximum mean square error net1=train(net1,p1’,t1’); net2 =train(net2,p2’,t2’); net3=train(net3,p3’,t3’); net4 =train(net4,p4’,t4’); %Simulate input-output function with trained BP network m=100; Y5=sim(net1,1:m);Y6=sim(net2,1:m);Y7=sim(net3,1:m);Y8=sim(net4,1:m); %Draw figures figure; subplot(4,2,1);plot(1:m,Y5,’-’);hold on;plot(p1’,t1’,’*’); legend(’BP Fitted’,’Data1’); for i=1:m-1; if (Y5(i+1)-Y5(i))<=0.01 plot(0:i,ones(1,i+1)*Y5(i),’b:’); plot(ones(1,i)*i,linspace(20,Y5(i),i),’b:’); disp([i Y5(i)]); break end end if i==m-1 plot(0:m,ones(1,m+1)*Y5(m),’b:’); disp([m Y5(m)]); end subplot(4,2,2);plot(p1,Y1,’-’);hold on;plot(p1,t1,’*’); legend(’RBF Fitted’,’Data1’); subplot(4,2,3);plot(1:m,Y6,’-’);hold on;plot(p2’,t2’,’*’); legend(’BP Fitted’,’Data2’); for i=1:m-1; if (Y6(i+1)-Y6(i))<=0.01 plot(0:i,ones(1,i+1)*Y6(i),’b:’); plot(ones(1,i)*i,linspace(20,Y6(i),i),’b:’); disp([i Y6(i)]); break end end if i==m-1 plot(0:m,ones(1,m+1)*Y6(m),’b:’); disp([m Y6(m)]); end March 22, 2010 14 15:45 9in x 6in B-922 b922-ch15 1st Reading Computational Ecology ylabel(’Cumulated Number of Species Sampled’); subplot(4,2,4);plot(p2,Y2,’-’);hold on;plot(p2,t2,’*’); legend(’RBF Fitted’,’Data2’); subplot(4,2,5);plot(1:m,Y7,’-’);hold on;plot(p3’,t3’,’*’); legend(’BP Fitted’,’Data3’); for i=1:m-1; if (Y7(i+1)-Y7(i))<=0.01 plot(0:i,ones(1,i+1)*Y7(i),’b:’); plot(ones(1,i)*i,linspace(20,Y7(i),i),’b:’); disp([i Y7(i)]); break end end if i==m-1 plot(0:m,ones(1,m+1)*Y7(m),’b:’); disp([m Y7(m)]); end subplot(4,2,6);plot(p3,Y3,’-’);hold on;plot(p3,t3,’*’); legend(’RBF Fitted’,’Data3’); subplot(4,2,7);plot(1:m,Y8,’-’);hold on;plot(p4’,t4’,’*’); xlabel(’Cumulative Sample Size’); legend(’BP Fitted’,’Data4’); for i=1:m-1; if (Y8(i+1)-Y8(i))<=0.01 plot(0:i,ones(1,i+1)*Y8(i),’b:’); plot(ones(1,i)*i,linspace(20,Y8(i),i),’b:’); disp([i Y8(i)]); break end end if i==m-1 plot(0:m,ones(1,m+1)*Y8(m),’b:’); disp([m Y8(m)]); end subplot(4,2,8);plot(p4,Y4,’-’);hold on;plot(p4,t4,’*’); xlabel(’Cumulative Sample Size’); legend(’RBF Fitted’,’Data4’); %Print input weights and between-layer weights net1.IW{1,1} net2.IW{1,1} net3.IW{1,1} net4.IW{1,1} net1.LW{2,1} net2.LW{2,1} net3.LW{2,1} net4.LW{2,1} 2.1.2. Conventional models (1) Arrhenius model is used to fit the species richness vs. the sample size curve (Preston, 1960): N = aS b , where N is the number of the species when sample size is S. The species richness vs. the sample size curves capture the information that describes relationships between sample March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading Species Richness Estimation and Sampling Data Documentation 15 size (i.e., cumulative number of samples) and the number of species sampled (Cohen, 1978; Cohen et al., 1993; Coleman et al., 1982; Colwell and Coddington, 1994). (2) Species richness is positively correlated with abundance. Samples containing large numbers of individuals thus harbor more taxa than smaller samples. The rarefaction curve (Sanders, 1968; Hurlbert, 1971; Simberloff, 1972; Gotelli and Graves, 1996; Schoenly and Zhang, 1999) is used to test the null hypothesis that two or more sampled communities come from the same parent distribution and have the same species richness. 2.1.3. Bootstrap method Bootstrap procedure is used to produce the species richness vs. the sample size curves. The curves plot the cumulative number of species, defined as the sum of the number of species in the previous sample(s) and the number of species in the present sample that were not observed in any previous sample. For the first sample, the cumulative number of species is defined to equal the number of species found in this sample (Zhang and Schoenly, 1999; Zhang, 2007). This study bootstraps the columns of the sample-by-species matrix. By doing so it produces a different sampling pathway through the field. Repeating this process, for instance, 1000 randomizations, will generate a family of curves from which the mean number of species can be calculated for each sample size in the curve. The bootstrap procedure is also used to yield rarefaction curves, from all 60 samples. 2.2. Data description Samples for invertebrates were collected in the irrigated rice field. In total 60 samples were collected for each of four sampling dates. Invertebrates were sorted to stage (immatures, adults) and then identified to the lowest possible taxon. Data from the records were stored as sample-by-functional species (immatures and adults were listed separately, and defined as different functional species in the present study) matrices and their spreadsheet files were represented as Data 1, Data 2, Data 3, and Data 4, respectively (Zhang and Schoenly, 2004). March 22, 2010 16 15:45 9in x 6in B-922 b922-ch15 1st Reading Computational Ecology 2.3. Results In total 99 invertebrate families are found in the four sampling sets, of which 50 families are identified as the abundant families that comprise 99% of the total invertebrate abundance. Invertebrate fauna of these sampling sets are proved to be different from each other (Zhang and Schoenly, 2004). 2.3.1. Species richness vs. sample size curves Both BP network and RBF network perform better in fitting species richness vs. sample size data (Fig. 5). The asymptote of BP approximation is achieved when the absolute error for asymptote is set to 0.01. RBF networks are quickly trained with training goal (SSE = 0.01; SSE: square sum error) but more neurons (60 neurons) are needed (the number of neurons is automatically produced by the RBF network). The total number of functional species of rice invertebrates for sampling sets Data 1, Data 2, Data 3, and Data 4, extrapolated from the function asymptotes of trained BP network, is 140, 149, 147 and 144, respectively (Fig. 5), while the observed functional species richness is 126, 141, 140, and 131 when the sample size is fixed at 60. MSEs (mean squared error) of Arrhenius model are larger than that of BP fitted and RBF fitted functions (Table 3 and Fig. 5. The MSEs of RBF networks are zeros in Table 3). Overall BP network and RBF network are superior to Arrhenius model in function approximation of species richness vs. sample size data. Table 3. Comparisons of fitting goodness to curves of species richness vs. sample size. Performance functions: MSE (for BP and Arrhenius), SSE (for RBF). Data 1 Data 2 Data 3 Data 4 BP RBF Arrhenius 0.022 0.327 0.009 0.011 0 0 0 0 5.2722 25.501 17.4749 15.4121 March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading Species Richness Estimation and Sampling Data Documentation 17 Figure 5. Curves of species richness vs. sample size. Each point is the mean of 1000 randomizations of bootstrap procedure. Asymptotes are indicated based on BP extrapolation. 2.3.2. Rarefaction curves In the rarefaction analysis, BP network and RBF network demonstrate better performance than rarefaction method (Table 4 and Fig. 6). 2.4. Discussion Compared to conventional models above, both BP network and RBF network are better models in the documentation of sampling information. The mathematical function for sampling data is able to be satisfactorily fitted March 22, 2010 18 15:45 9in x 6in B-922 b922-ch15 1st Reading Computational Ecology Table 4. Comparisons of fitting goodness to the rarefaction curves among various methods. Performance functions: MSE (for BP and Rarefaction), SSE (for RBF). Data 1 Data 2 Data 3 Data 4 BP RBF Rarefaction 1.9244 5.3682 3.4713 7.0201 0 0 0 0 9.0187 15.1631 5.1665 8.9528 Figure 6. Curves of rarefaction method. Each point is the mean of 1000 randomizations of bootstrap procedure. March 22, 2010 15:45 9in x 6in B-922 b922-ch15 1st Reading Species Richness Estimation and Sampling Data Documentation 19 using artificial neural network. It is found that the total numbers of functional species of rice invertebrates, extrapolated from trained BP network, are between 140 and 149 in the irrigated rice field. In their regional flora research, Miller and Wiegert (1989) used different models, canonical lognormal, uniform, and random, etc., to fit sampling data. Significant errors were produced when fitting with these models (Fig. 1 in Miller and Wiegert, 1989). These models will yield good fitness only if some simplified assumptions on source data are met (Colwell and Coddington, 1994). Contrary to the explicit mathematical functions, BP network and RBF network can learn from sampling information, as well as allow for the network to mine intrinsic mechanism hidden in sampling data, and no mathematical assumptions are needed to fit the sampling data. As pointed out in previous chapters, the sampling information is represented by the connection weights of neural network. This is the essence of documentation even though connection weights are not explicitly listed. March 22, 2010 16:52 9in x 6in B-922 b922-ch16 1st Reading  CHAPTER 16  Modeling Arthropod Abundance from Plant Composition of Grassland Community A large number of studies have been dedicated to the relationship between arthropod diversity and plant composition. It was reported that weeds influence insect diversity in a crop-weed-insect system (Altieri and Letourmeau, 1984; Altieri, 1994, 1995). Community with more complex plant species composition will contain more diverse insects (Sheng et al., 1997). Some forest studies showed that the relationship between plant community and insect community is significant (Dong et al., 2005; Jia et al., 2006). Furthermore, there is a positive correlation between plant community and predatory and parasitic insect community, and a negative correlation between plant community and defoliator insect community (Dong et al., 2005). Dominant arthropod population on farmland is negatively regulated by vegetational diversity, but a positive regulation would occur in some cases (Andow, 1991). Many facts have revealed that significant but complex relationships exist between arthropods and plant composition. They seem to be nonlinear relationships. Mechanisms to yield these relationships could not be clearly explained. In the research areas of arthropods and plant communities, artificial neural networks are successfully used to make simulation, prediction, recognition, and classification (Moisen and Frescino, 2002; Worner and Gevrey, 2006; Zhang, 2007; Zhang et al., 2007; Zhang and Wei, 2008; Zhang and Zhang, 2008). So far there lacks research on modeling arthropod abundance 1 March 22, 2010 16:52 9in x 6in B-922 b922-ch16 1st Reading 2 Computational Ecology from plant composition on grassland. This chapter aims to find the relationship between arthropod abundance and plant composition on grassland, to develop neural network for modeling this relationship, and to compare simulation performance of neural network to conventional models. 1. Model Description 1.1. Neural network A three-layer neural network is developed for modeling arthropod abundance from plant composition (Fig. 1). In this network, an input set, xi ∈ Rp ; the corresponding output set, yi ∈ R. p is the number of indices for plant composition, xi is the vector of plant composition for sample i, and yi is the arthropod abundance for sample i, i = 1, 2, . . . , n. Thirty neurons are used in both the first and second layers. Bias is used on all of the layers. Transfer functions for layers 1 to 3 are hyperbolic tangent sigmoid transfer function (tansig), logistic sigmoid transfer function (logsig), and linear transfer function (purelin), respectively. Weights and bias for each layer are initiated by Nguyen–Widrow algorithm (Hagan et al., 1996; Mathworks, Figure 1. Architecture of the neural network designed. March 22, 2010 16:52 9in x 6in B-922 b922-ch16 1st Reading Modeling Arthropod Abundance from Plant Composition 3 2002; Fecit, 2003). Network initialization is made with a function that initializes each layer i according to its own initialization function (initlay). Network is trained by Levenberg–Marquardt algorithm (trainlm). Performance function is mean squared error function (mse). Both the first and second layers receive the same inputs from sample space and yield outputs for the third layer. The third layer learns from the input space. For each layer the net input functions (netsum) calculate the layer’s net input by combining its weighted inputs and biases. The neural network is developed by using Matlab (Mathworks, 2002) in this chapter. Matlab codes are listed as follows: clear net; clc n=17; %Dimension of the input numn=30; %Number of neurons in layers 1 and 2 net=network; net.numInputs=1; net.numLayers=3; net.inputs{1}.size=n; net.biasConnect=[1;1;1]; net.inputConnect=[1;1;0]; net.layerConnect=[0 0 1;0 0 1;1 1 1]; net.outputConnect=[0 0 1]; net.targetConnect=[0 0 1]; mi=min(data); ma=max(data); for i=1:n; tt(i,1)=mi(i); tt(i,2)=ma(i); end; net.inputs{1}.range=tt; net.layers{1}.size=numn; net.layers{2}.size=numn; %Transfer functions: logsig,tansig,purelin,radbas,satlins,tribas net.layers{1}.transferFcn=’tansig’; net.layers{2}.transferFcn=’logsig’; net.layers{3}.transferFcn=’purelin’; net.layers{1}.initFcn=’initnw’; net.layers{2}.initFcn=’initnw’; net.layers{3}.initFcn=’initnw’; net.layerWeights{1,3}.delays=1; net.layerWeights{2,3}.delays=1; net.layerWeights{3,3}.delays=1; net.initFcn=’initlay’; net.performFcn=’mse’; net.trainFcn=’trainlm’; net.trainParam.goal=1e-05; March 22, 2010 16:52 9in x 6in B-922 b922-ch16 1st Reading 4 Computational Ecology net.trainParam.epochs=10000; net=train(net,data(:,1:n)’,data(:,n+1)’); y=sim(net,data(:,1:n)’); disp(’Observed Simulated’) [data(:,n+1) y’] mmse=sum((y-data(:,n+1)’).ˆ2)/50 plot(data(:,n+1)’,y,’*’); %Print input weights and between-layer weights net.IW{1,1} %Input weights net.IW{2,1} %Input weights net.LW{3,1} %Between-layer weights net.LW{3,2} %Between-layer weights net.LW{1,3} %Between-layer weights net.LW{2,3} %Between-layer weights net.LW{3,3} %Between-layer weights 1.2. Conventional models 1.2.1. Multivariate model The multivariate regression (He, 2001; Mathworks, 2002) is used for modeling arthropod abundance from plant composition: f(x) = a + bT x, where f(x): arthropod abundance (individuals per sample); a: constant; b = (b1 , b2 , . . . , bp )T : parametric vector; x = (x1 , x2 , . . . , xp )T : the vector of plant composition (p plant taxa, e.g., species, families, etc.). 1.2.2. Response surface model (RSM) The response surface model (He, 2001; Mathworks, 2002) is also used in the present modeling: f(x) = a +bT x +xT cx, where f(x): arthropod abundance (individuals per sample); x = (x1 , x2 , . . . , xp )T : the vector of plant composition; b = (b1 , b2 , . . . , bp )T , c = (c1 , c2 , . . . , cp )T : parametric vectors; a: constant. 1.2.3. Principal components extraction (PCE) PCE is often used in data reduction to identify a small number of factors that explain most of the variance that is observed in a much larger number of manifest variables (SPSS, 2006). In this chapter it is used to reduce the dimensionality of input space and generate independent principal components from a larger number of plant taxa without significant loss of variance information. March 22, 2010 16:52 9in x 6in B-922 b922-ch16 1st Reading Modeling Arthropod Abundance from Plant Composition 2. 5 Data Description Plant composition and arthropod abundance were recorded on the natural grassland in Zhuhai, China. In total 50 samples, each with an equal size of 1 × 1 m, were investigated. Plant species and their cover-degrees were recorded and measured, and individuals of various arthropods were collected and counted for each sample. (1) Plant family data. In the modeling of arthropod abundance from plant family data, in total 17 plant families (17-dimensional input space, R17 ), 50 samples (n = 50) are used to train neural network or to build multivariate regression. The output space is a one-dimensional space (arthropod abundance). (2) PCE based data. PCE procedure from plant family data yielded p principal components. The seventeen-dimensional input space is thus transformed into a p-dimensional input space (Rp ). Fifty samples in the p-dimensional input space are used to train neural network, or to build multivariate regression and response surface model. The output space is a one-dimensional space, i.e., real domain R (arthropod abundance). (3) Cross validation. Each sample is separately removed from the input set of 50 samples, and the remaining samples are used to train model and to predict the removed samples using the trained model. Comparisons between the predicted and observed arthropod abundances are made and Pearson correlation coefficient (r) and statistic significance are calculated to validate models. (4) Samples are submitted to neural network in two ways, i.e., their natural IDs, and randomized sequences of samples. 3. Results 3.1. Complexity of plant–arthropod interactions Using the algorithm for biological interaction network (Zhang, 2007), an interaction network for plant families–arthropod orders is obtained. There are a lot of direct or indirect interactions between plants and arthropods in the network as follows: (Oxalidaceae, Araneida), (Leguminosae, Diptera), (Leguminosae, Araneida), (Gramineae, Hemiptera), (Gramineae, Diptera), (Gramineae, Coleoptera), (Gramineae, Odonata), (Apocynaceae, March 22, 2010 16:52 9in x 6in B-922 b922-ch16 1st Reading 6 Computational Ecology Lepidoptera), (Malvaceae, Diptera), (Compositae, Hemiptera), (Compositae, Diptera), (Compositae, Orthoptera), (Onagraceae, Diptera), (Connaraceae, Araneida), (Cyperaceae, Coleoptera), (Cyperaceae, Isoptera), (Lycopodiaceae, Orthoptera), (Convolvulacea, Diptera), (Commelinaceae, Coleoptera), (Commelinaceae, Araneida). The results for the interaction network indicate that many of the plant–arthropod interactions on grassland are positive interactions, except for the negative interactions (Leguminosae, Araneida), (Gramineae, Hemiptera), (Gramineae, Diptera), and (Compositae, Orthoptera). Theoretically, it will be possible to model arthropod abundance from plant family composition. 3.2. Modeling arthropod abundance Multivariate regression fitted with plant species data reveals that arthropod abundance could not be reasonably described by plant species and their cover-degrees (r = 0.1995, p = 0.05). However, the multivariate regression fitted with plant family data (50 samples, 17 plant families) demonstrated that arthropod abundance is significantly dependent upon plant families and their cover-degrees (r = 0.4182, p < 0.005; Fig. 2). Neural network performs better than multivariate regression in the simulation of arthropod abundance based on the plant family data, as illustrated in Fig. 2. The p (p = 2, 3, 4, 5, 6) principal components are yielded from plant family data using PCE procedure. Fifty samples in the p-dimensional input space are used to train neural network by 10 000 epochs. The results reveal that the simulation based on four principal components has the best goodness. The four principal components better explain about 50% of variation observed in the plant family data. The simulation performance of neural network from plant family data appears to be worse than that from PCE-extracted data (Fig. 2), even not better than the performance based on two principal components (2 PCs, Table 1). Simulation performance with four PCs (regression constant≈0, regression coefficient ≈ 1, p < 0.05, mse = 5.9944) is much better than that with plant family data. From the results above, we may find that a suitable dimensionality of input space is necessary to produce a soundly trained neural network. For the situations with large number of indices for plant composition, the March 22, 2010 16:52 9in x 6in B-922 b922-ch16 1st Reading Modeling Arthropod Abundance from Plant Composition 7 Figure 2. Simulation performance of neural network and multivariate regression. Family data (50 samples, 17 plant families) are used to train models. Samples are submitted to neural network in natural IDs. reduction of indices, for example, by PCE, is suggested in training neural network. A high dimensionality for input space and a few samples in the input set would result in the deficient learning of neural network. With the data of four principal components, multivariate regression and response surface model are developed. Compared with neural network, response surface model does not sufficiently fit arthropod abundance (Fig. 3). Multivariate regression yields the worst performance in all of these models. Different from the above cases, if samples are submitted to neural network in randomized sequences, the neural network will effectively function (regression constant = 2.688, regression coefficient = 0.74, p < 0.05, Simulated = 6.5152 + 0.3674∗ Observed r = 0.6451, p < 0.05 mse = 68.7176 Simulated = 3.3760 + 0.6803∗ Observed r = 0.8967, p < 0.05 mse = 25.075 Simulated = 0.7896 + 0.9240∗ Observed r = 0.9743, p < 0.05 mse = 5.9944 Simulated = 2.3718 + 0.7801∗ Observed r = 0.9271, p < 0.05 mse = 17.2497 Simulated = 6.6830 + 0.3652∗ Observed r = 0.6733, p < 0.05 mse = 65.7586 b922-ch16 6 PCs B-922 5 PCs 9in x 6in 4 PCs 16:52 3 PCs March 22, 2010 2 PCs 8 Computational Ecology Table 1. Comparisons between simulated and observed arthropod abundance. Simulation is conducted based on p (p = 2, 3, 4, 5, 6) principal components (PCs). Samples are submitted to neural network in natural IDs. 1st Reading March 22, 2010 16:52 9in x 6in B-922 b922-ch16 1st Reading Modeling Arthropod Abundance from Plant Composition 9 Figure 3. Simulation performance of neural network, multivariate regression, and response surface model. Four principal components extracted from family data (50 samples, 17 plant families) are used to train models. Samples are submitted to neural network in natural IDs. mse = 15.1725; Fig. 4), but its performance is worse than the neural network trained with the samples submitted in natural IDs. 3.3. Cross validation of models Cross validation is based on four principal components, in which samples are submitted to neural network in natural IDs. In cross validation the neural network demonstrates better robustness in predicting unknown samples (r = 0.2296, p = 0.1; Fig. 5), while multivariate regression (r = 0.0131, p = 0.9143 > 0.05) and response surface model (r = 0.1096, p = 0.4479 > 0.05) do not effectively predict unknown samples. March 22, 2010 10 16:52 9in x 6in B-922 b922-ch16 1st Reading Computational Ecology Figure 4. Neural network simulation. Samples are submitted to neural network in randomized sequences. Left: simulation (mean of ten randomizations of sample sequences); right: arthropod abundance, the observed, and simulated 95% confidence interval from ten randomizations of sample sequences. If samples are submitted to neural network in randomized sequences, the neural network will badly function in the cross validation. About half of the observed data fall beyond the 95% confidence interval of the predicted, as illustrated in Fig. 6. 4. Discussion This study proves that arthropod abundance on grassland is dependent upon plant families and their cover-degrees (plant composition). Neural network model is superior to multivariate regression and response surface model in modeling arthropod abundance from plant composition. March 22, 2010 16:52 9in x 6in B-922 b922-ch16 1st Reading Modeling Arthropod Abundance from Plant Composition 11 Figure 5. Cross validation of multivariate regression, response surface model, and neural network. Four principal components are used to train models. Samples are submitted to neural network in natural IDs. A reasonable dimension of input space is conducive to train a better neural network. Algorithms for dimension reduction, such as PCE, etc., are suggested to be used in the data pre-treatment in neural network modeling. A high dimension of input space and a few samples in the input set would result in the deficient learning of neural network. Randomization procedure is useful to reduce the sequence correlation in sample submission. It is obvious that sequence correlation may be eliminated, but the performance of neural network modeling is lowered. Sequential submission of samples would yield a neural network with undetermined March 22, 2010 12 16:52 9in x 6in B-922 b922-ch16 1st Reading Computational Ecology Figure 6. Cross validation of neural network. Left: Prediction (mean of ten randomizations of sample sequences); Right: Arthropod abundance: the observed, and predicted 95% confidence interval from ten randomizations of sample sequences. information and thus produce unpractical prediction for unknown samples. It is suggested that randomization procedure is used in sample submission for situations with a large number of samples and a lower dimension of input space. March 22, 2010 16:53 9in x 6in B-922 b922-ch17 1st Reading  CHAPTER 17  Pattern Recognition and Classification of Ecosystems and Functional Groups Invertebrate diversity in the farmland has been a natural control force for crop pests, and is usually served as an indicator of agricultural environment health (Brown, 1991; Kremen et al., 1993; Way and Heong, 1994). It has been studied in various aspects by using neural network models. For instance, the BP and radial basis function neural networks were used to make function approximation on sampling data, and possible richness of functional invertebrate species was predicted (Zhang and Barrion, 2006). A stream classification based on characteristic invertebrate species assemblages was also satisfactorily conducted using self-organizing map neural network, and theoretical assemblages were suggested to be used to define representative or reference sites for biological surveillance (Cereghino et al., 2001). Self-organizing map neural network was also used to determine the risk of insect pest invasion (Worner and Gevrey, 2006; Watts and Worner, 2009), and to assess the community (Song et al., 2007). Overall the applications of neural network models in invertebrate diversity still fall behind practical requirements. This chapter aims to present some topological functions for self-organizing map neural network (SOM) and evaluate effectiveness of some neural network models in the recognition and classification of invertebrate habitat zones (ecosystems) and functional groups. Further details can be found in Zhang (2007), and Zhang and Li (2007). 1 March 22, 2010 16:53 9in x 6in B-922 b922-ch17 1st Reading 2 Computational Ecology 1. Model Description 1.1. Neural networks 1.1.1. Probabilistic neural network See Chap. 000. 1.1.2. Generalized regression neural network See Chap. 000. 1.1.3. Linear neural network See Chap. 000. Matlab codes of probabilistic network, general regression network, and linear network used in this study are as follows: %Load sampling data file (SamplingData.*). In this file, columns %represent samples to be recognized and rows represent variables (or %indices, attributes, etc.) (P), and the last row is classes these samples %fall into (lastrow) varis=size(SamplingData,1)-1; samples=size(SamplingData,2); lastrow=SamplingData(varis+1,:); P=SamplingData(1:varis,:); C=ind2vec(lastrow); %Load the file (RecSamples.*) for samples to be recognized. The same %format as sampling data file Q=RecSamples; %Generate a probabilistic neural network (newpnn) and set spread of %radial basis functions as 0.1, if this neural network is to be used net=newpnn(P,C,0.1); %Generate a general regression neural network (newgrnn) and set spread %of radial basis functions as 0.2 (a smaller spread would fit data better %but be less smooth.), if this neural network is to be used net=newgrnn(P,C,0.2); %Generate a linear neural network (newlind) if this neural network is %used net=newlind(P,C); %Make classification on trained samples outputclass=sim(net,P) March 22, 2010 16:53 9in x 6in B-922 b922-ch17 1st Reading Pattern Recognition and Classification of Ecosystems and Functional Groups 3 %Make recognition on the samples with unknown classification recognition=sim(net,Q) 1.1.4. Self-organizing map neural network See Chap. 000. 1.1.5. Self-organizing competitive learning neural network See Chap. 000. Matlab codes of SOM and self-organizing competitive learning neural networks are developed as follows: %Load sampling data file (SamplingData.*). In this file, rows represent samples and colomns represent %variables (or indices, attributes, etc.) da=SamplingData(:,:); P=da(:,:)’; %Load the file (RecSamples.*) for samples to be recognized. The same format with sampling data file newsamples=RecSamples(:,:); Q=newsamples(:,:)’; %Generate a self-organizing map neural network (newsom) and set 5 neurons if it is used net=newsom(minmax(P),[5]); %Generate a self-organizing competitive learning neural network (newc) and set 5 neurons, %Kohonen learning rate=0.01, and Conscience learning %rate=0.001 if it is used net=newc(minmax(P),5,0.01,0.001); %Specify the distance measure as Chebyshov distance if necessary net.layers{1}.distanceFcn = ’chebyshovdist’; net.inputWeights{1,1}.weightFcn = ’chebyshovdist’; %Train the network. 1000 epochs is set net.trainParam.epochs=1000; net=init(net); net=train(net,P); %Obtain samples and updated weights, and make classification on %samples for i=1:size(P,2); a=vec2ind(sim(net,P(:,i))); outputclass(1,i)=i; outputclass(2,i)=a; end outputclass March 22, 2010 16:53 9in x 6in B-922 b922-ch17 1st Reading 4 Computational Ecology %Make recognition on the samples with unknown classification recognition= vec2ind(sim(net,Q)) %Print input weights net.IW{1} 1.1.6. New topological functions for self-organizing map neural network Four topological functions are established based on the template of topological function mytopf in Matlab neural network toolkit, as indicated in the following: cossintopf: cos(sin(cx)) sincostopf: sin(cx)+cos(cx) acossintopf: acos(sin(cx)) expsintopf: e sin(cx) (1) Source codes of topological function cossintopf function pos=cossintopf(varargin) dim=[varargin{:}]; %The dimensions as a row vector size=prod(dim); %Total number of neurons dims=length(dim); %Number of dimensions pos=zeros(dims,size); %The size that POS will need to be set len=1; pos(1,1)=0; for i=1:length(dim) dimi=dim(i); newlen=len*dimi; pos(1:(i-1),1:newlen)=pos(1:(i-1),rem(0:(newlen-1),len)+1); posi=0:(dimi-1); pos(i,1:newlen)=posi(floor((0:(newlen-1))/len)+1); len=newlen; end for i=1:length(dim) pos(i,:)=pos(i,:)*0.7+cos(sin([1:size]*exp(1)/5*i))*0.3; end (2) Source codes of topological function sincostopf function pos=sincostopf(varargin) %The source codes here are the same as cossintopf for i=1:length(dim) pos(i,:)=pos(i,:)*0.6+sin([1:size]*exp(1)/5*i)*0.2 +cos([1:size]*exp(1)/5*i)*0.2; end March 22, 2010 16:53 9in x 6in B-922 b922-ch17 1st Reading Pattern Recognition and Classification of Ecosystems and Functional Groups 5 (3) Source codes of topological function acossintopf function pos=acossintopf(varargin) %The source codes here are the same as cossintopf for i=1:length(dim) pos(i,:)=pos(i,:)*0.7+acos(sin([1:size]*exp(1)/5*i))*0.3; end (4) Source codes of topological function expsintopf function pos=expsintopf(varargin) %The source codes here are the same as cossintopf for i=1:length(dim) pos(i,:)=pos(i,:)*0.7+exp(sin([1:size]*exp(1)/5*i))*0.3; end 1.2. Conventional model Linear discriminant analysis (SPSS, 2006) is conducted to evaluate the effectiveness of neural networks. 2. Data Source Invertebrates were recorded at sixty sites in an irrigated rice field. Invertebrates were sorted to stage (immatures, adults) and then identified to the lowest possible taxon. Invertebrate taxa were lumped into habitat zones (Shoenly and Zhang, 1999). Four sampling sets (Data1, Data2, Data3, and Data4) were achieved, and then combined into a site-by-habitat zone matrix and a sample-by-functional group matrix. If invertebrate habitat zones are classified and recognized, they are treated as samples. In total five habitat classes were defined as follows: (1) plant canopy (terrestrial); (2) neustonic (water surface); (3) planktonic (water column); (4) benthic (bottom dwelling); (5) soil-dweller (dryland). Habitat zones 1–18 represent plant canopy (terrestrial), planktonic (water column), neustonic (water surface), benthic (bottom dwelling), plant canopy (terrestrial), planktonic (water column), neustonic (water surface), benthic (bottom dwelling), plant canopy (terrestrial), planktonic (water column), neustonic (water surface), soil-dweller (dryland), benthic (bottom dwelling), plant canopy (terrestrial), planktonic (water column), neustonic (water surface), benthic (bottom dwelling), soil-dweller (dryland) (Schoenly and Zhang, 1999). March 22, 2010 16:53 9in x 6in B-922 b922-ch17 1st Reading 6 Computational Ecology If invertebrate functional groups are classified and recognized, they are treated as samples to be classified or recognized. 3. Results 3.1. Supervised classification of habitat zones 3.1.1. Recognition of known habitat zones The recognition results of neural network models [i.e., probabilistic network (learning rate = 0.1), generalized regression network (learning rate = 0.2), and linear network] and linear discriminant analysis (prior probabilities are computed from sizes of habitat zones) show that three neural network models yield 100% coincidence with known habitat zones (Table 1). A Table 1. Recognition of invertebrate habitat zones using probabilistic network, generalized regression network, linear network, and linear discriminant analysis. Invertebrate Habitat Zones Practical Classification Probabilistic Network Recognition Gen. Regr. Network Recognition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 3 2 4 1 3 2 4 1 3 2 5 4 1 3 2 4 5 1 3 2 4 1 3 2 4 1 3 2 5 4 1 3 2 4 5 1 3 2 4 1 3 2 4 1 3 2 5 4 1 3 2 4 5 Linear Linear Discri. Anal. Network Recognition Recognition 1 3 2 4 1 3 2 4 1 3 2 5 4 1 3 2 4 5 1 3 2 4 1 3 2 4 1 3 2 5 5 1 3 2 4 5 ∗ Classes 1–5: (1) plant canopy (terrestrial); (2) neustonic (water surface); (3) planktonic (water column); (4) benthic (bottom dwelling); (5) soil-dweller (dryland). March 22, 2010 16:53 9in x 6in B-922 b922-ch17 1st Reading Pattern Recognition and Classification of Ecosystems and Functional Groups 7 habitat zone is incorrectly recognized by linear discriminant analysis. Its recognition coincidence is 94.4%. 3.1.2. Recognition of unknown habitat zone An unknown habitat zone, represented by an invertebrate species ID 10766 from Lepidoptera in early study, can be recognized using the trained neural network. All neural network models and linear discriminant analysis recognize this species as class 5, i.e., soil-dweller (dryland). 3.1.3. Influence of topological functions As illustrated in Fig. 1, different topological functions yield distinctive topological structures of neuron positions. Figure 1. Neuron positions of different topological functions. March 22, 2010 16:53 9in x 6in B-922 b922-ch17 1st Reading 8 Computational Ecology The topological functions in an SOM network are changed to analyze the changes of network outputs. Some of the Matlab codes above may be revised, for example, as the following: net=newsom(minmax(P),[5]); %Set cossintopf as the topological function in the first layer net.layers{1}.topologyFcn=’cossintopf’; Using the SOM with different topological functions as described above, and with other default functions in the SOM of Matlab, a self-organizing unsupervised clustering was conducted on invertebrate orders. The results for the neural networks with four topological functions and default functions are obtained as follows (Zhang and Li, 2007): (1) Data 1 • sincostopf: (Ephemeroptera, Orthoptera, Thysanoptera, Blattodea), the rest of the orders are of the same category; • cossintopf: (Ephemeroptera, Orthoptera, Thysanoptera, Blattodea), the rest of the orders are of the same category; • acossintopf: (Ephemeroptera, Orthoptera, Thysanoptera, Blattodea), the rest of the orders are of the same category; • expsintopf: (Ephemeroptera, Orthoptera, Thysanoptera, Blattodea), the rest of the orders are of the same category; • System default topological function: (Orthoptera, Thysanoptera, Blattodea), the rest of the orders are of the same category. (2) Data 2 • sincostopf: (Lepidoptera, Thysanoptera, undetermined order), the rest of the orders are of the same category; • cossintopf: (Lepidoptera, Thysanoptera, undetermined order), the rest of the orders are of the same category; • acossintopf: (Lepidoptera, Thysanoptera, undetermined order), the rest of the orders are of the same category; • expsintopf: (Lepidoptera, Thysanoptera, undetermined order), the rest of the orders are of the same category; • System default topological function: (Lepidoptera, Thysanoptera, undetermined order), the rest of the orders are of the same category. March 22, 2010 16:53 9in x 6in B-922 b922-ch17 1st Reading Pattern Recognition and Classification of Ecosystems and Functional Groups 9 (3) Data 3 • sincostopf: (Ephemeroptera, Thysanoptera), (Dermaptera, Strepsiptera), the rest of the orders are of the same category; • cossintopf: (Ephemeroptera, Dermaptera, Strepsiptera), the rest of the orders are of the same category; • acossintopf: (Ephemeroptera, Odonata, Dermaptera, Strepsiptera, Thysanoptera, Blattaria), the rest of the orders are of the same category; • expsintopf: (Ephemeroptera, Dermaptera, Strepsiptera, Thysanoptera, Blattaria), the rest of the orders are of the same category; • System default topological function: (Ephemeroptera, Dermaptera, Strepsiptera, Thysanoptera, Blattaria), the rest of the orders are of the same category. (4) Data 4 • sincostopf: (Hymenoptera, Odonata), (Strepsiptera, Neuroptera), the rest of the orders are of the same category; • cossintopf: (Hymenoptera, Odonata), (Strepsiptera, Neuroptera), the rest of the orders are of the same category; • acossintopf: (Lepidoptera, Strepsiptera, Neuroptera, Blattodea), the rest of the orders are of the same category; • expsintopf: (Hymenoptera, undetermined order), (Strepsiptera, Neuroptera, Blattodea), the rest of the orders are of the same category; • System default topological function: (Hymenoptera, Odonata), (Strepsiptera, Neuroptera, Blattodea), the rest of the orders are of the same category. It is found that the general trends from various classifications are similar. Nevertheless, the results between various topological functions are somewhat different. 3.2. Unsupervised classification of functional groups 3.2.1. Classification using different neural networks Overall SOM and self-organizing competitive learning neural networks produce similar classifications under the same distance (similarity) measure March 22, 2010 10 16:53 9in x 6in B-922 b922-ch17 1st Reading Computational Ecology and the number of neurons. Between-model differences can also be found, particularly if more neurons are set in the networks. For example, using Euclidean distance and two neurons, the classification of the functional group terrestrial crawler, walker, jumper or hunter, and the functional group mixed is different between two models, and the remaining functional groups are classified into the same classes. If eight neurons are used, classification yields differences for the following functional groups: external plant feeder, flying adult that is searching, ovipositing, or larvipositing, neustonic (water surface) swimmer (semi-aquatic), and shredder, chewer of coarse particulate matter, and most of the functional groups that belong to the same classes for the two models. 3.2.2. Classification using different distance (similarity) measures Using different distance (similarity) measures will yield different classifications. Generally the distance (similarity) measures with the same category, e.g., Euclidean and Chebyshov distances, tend to produce similar classifications. Pearson correlation shows a different pattern for the classification. Using two neurons, except for functional groups such as collector (filterer, suspension feeder), terrestrial web-builder, herbivore, predator, detritivore, and mixed, the remaining functional groups have the same classifications. Nevertheless, there is a unique classification for using neurons 2 and 8 if Pearson correlation is used in self-organizing competitive learning neural network. 3.2.3. Classification using different number of neurons The number of neurons in the network represents the maximum number of classes in the classification process. Statistically, classes increase at the increase of neurons. The classification using two neurons yields more consistent results. Using Chebyshov distance and 2, 5, and 8 neurons in SOM will produce 2, 4, and 5 classes of functional groups, and for Euclidean distance these numbers of neurons yield 2, 4, and 4 classes, respectively. However, various results are produced with Pearson correlation. March 22, 2010 16:53 9in x 6in B-922 b922-ch17 1st Reading Pattern Recognition and Classification of Ecosystems and Functional Groups 11 3.2.4. General results SOM can learn both the topology of sample space and the distribution of samples, but self-organizing competitive learning neural network will only learn the distribution of samples. SOM is therefore superior to the latter. For example, if Pearson correlation is used in the networks, then the classification for SOM is more informative than the latter. 3.2.5. Classification of functional groups If between-functional group difference of individual numbers is the focus of consideration, i.e., Chebyshov distance is used, then the following classifications should be acceptable: • Two classes: Class I: external plant feeder; flying adult that is searching, ovipositing, or larvipositing; terrestrial crawler, walker, jumper or hunter; neustonic (water surface) swimmer (semi-aquatic); collector (filterer, suspension feeder); terrestrial web-builder; herbivore, predator, and detritivore; mixed; Class II: terrestrial blood sucker; terrestrial flyer; planktonic (water column) swimmer and diver; tourist (nonpredatory species with no known functional role other than as prey in ecosystem); gall former; collector (gather, deposit feeder); predator and parasitoid; shredder, chewer of coarse particulate matter; leaf miner; pollen feeder; idiobiont (acarine ectoparasitoid); leaf roller/webber. The two classes above represent two different invertebrate indicator systems in the rice field. • Four classes: Class I: external plant feeder; terrestrial crawler, walker, jumper or hunter; mixed; Class II: flying adult that is searching, ovipositing, or larvipositing; collector (gather, deposit feeder); collector (filterer, suspension feeder); terrestrial web-builder; herbivore, predator, and detritivore; leaf miner; Class III: planktonic (water column) swimmer and diver; neustonic (water surface) swimmer (semi-aquatic); shredder, chewer of coarse particulate matter; March 22, 2010 12 16:53 9in x 6in B-922 b922-ch17 1st Reading Computational Ecology Class IV: terrestrial blood sucker; terrestrial flyer; tourist (nonpredatory species with no known functional role other than as prey in ecosystem); gall former; predator and parasitoid; pollen feeder; idiobiont (acarine ectoparasitoid); leaf roller/webber. • Five classes: Class I: external plant feeder; Class II: shredder, chewer of coarse particulate matter; Class III: mixed; Class IV: flying adult that is searching, ovipositing, or larvipositing; planktonic (water column) swimmer and diver; collector (gather, deposit feeder); collector (filterer, suspension feeder); herbivore, predator, and detritivore; leaf miner; Class V: terrestrial blood sucker; terrestrial flyer; terrestrial crawler, walker, jumper or hunter; tourist (nonpredatory species with no known functional role other than as prey in ecosystem); neustonic (water surface) swimmer (semi-aquatic); gall former; predator and parasitoid; terrestrial web-builder; pollen feeder; idiobiont (acarine ectoparasitoid); leaf roller/webber. The distribution of connection weights of SOM for five classes is indicated in Fig. 2. Figure 2. Distribution of connection weights of SOM with 5 neurons. March 22, 2010 16:53 9in x 6in B-922 b922-ch17 1st Reading Pattern Recognition and Classification of Ecosystems and Functional Groups 13 There is a major trend of consistency between different classification results that yield different numbers of classes. This is interpretable to some extent. For example, in the classification with two classes, plant pests (external plant feeder), and natural enemies (terrestrial web-builder, flying adult that is searching, ovipositing, or larvipositing, etc.) have the similar magnitude or change in individual numbers. Terrestrial blood sucker, terrestrial flyer, and tourist closely correlate with each other in all of the classifications. 4. Discussion Probabilistic network, generalized regression network, linear network, and linear discriminant analysis are capable of recognizing known and unknown habitat zones. Neural network models are proved to be better than conventional linear discriminant analysis in pattern recognition. Linear neural network may have better recognition ability than linear discriminant analysis even for linear divisible problems. Previous research also showed that neural network models of demersal fish distribution outperformed linear discriminant analysis and attained better recognition and prediction performance for distribution of those species (Maravelias et al., 2003). Generalized regression neural network model outperforms traditional regression-based model like the linear discriminant analysis. This finding supports the past studies. In a research on temporal predictions of functional attributes of the ecosystem at regional scales, the neural network model was much better than the traditional regression model (Paruelo and Tomasel, 1997). Both SOM and self-organizing competitive learning network have been proved to be effective models in the pattern classification and recognition of sampling information. Overall SOM is superior to self-organizing competitive learning neural network. The different settings of distance (similarity) measures and topological functions will to a certain extent affect the network output. Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading  CHAPTER 18  Modeling Spatial Distribution of Arthropods Spatial distribution means the distribution of animal or plant individuals in a space, particularly on the ground. A lot of probability distribution functions and aggregation indices have been developed and used to describe spatial distribution of individuals (Krebs, 1989). In such methods the number of individuals found in a sample is supposed to be a random variable that follows some probability distribution, e.g., binomial distribution, Poisson distribution, negative binomial distribution, etc. Because of lack of spatial variables in the function, it cannot be used to predict the abundance at any given location. On the other hand, due to the lack of theoretical background, it is also hard to construct a mechanistic model that calculates individual distribution from spatial information. Questions on spatial distribution are therefore data-driven. Like most of other ecological problems, the relationship between individual distribution and spatial information is usually a nonlinear one (Gevrey et al., 2006; Moisen and Frescino, 2002; Zhang, 2007; Zhang and Barrion, 2006; Zhang et al., 2008). This chapter aims to present several models and evaluate their effectiveness in modeling spatial distribution of arthropods. A self-designed neural network, BP network, LVQ network, linear network, response surface model, linear discriminant analysis, spline function, and a partial differential equation are developed or used to model field distribution of arthropods. Models are validated and compared for their power in predictability. More details can be found in Zhang et al. (2008). 1 Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading 2 Computational Ecology 1. Model Description 1.1. Neural networks 1.1.1. Self-designed artificial neural network The artificial neural network for modeling spatial distribution of arthropods is a mapping from input space (with the spatial coordinates of quadrat as the element) to output space (with the number of arthropod individuals in the quadrat as the element), U: R2 → R and u(x) = v, where u ∈ U = {u|u : R2 → R}. For an input set, xi ∈ R2 , and the output set, vi ∈ R, there is a mapping f that satisfies f(xi ) = vi , i = 1, 2, . . . , n. A mapping u ∈ U = {u|u : R2 → R}, represented by this network, should approximate f(x) and satisfy the following condition |u(x) − f(x)| < ε, x ∈ R2 , where x = (x, y)T , and ε > 0 is the known threshold for error. A three-layer neural network is developed for modeling spatial distribution of arthropods. Both the first and second layers contain 30 neurons, and bias is used for each layer. Transfer functions for layers 1 to 3 are hyperbolic tangent sigmoid function, logistic sigmoid function, and linear transfer function, respectively. Initialization of network, including adding weights and bias for each layer, is performed by a function that initializes each layer i (i = 1, 2, 3) according to its own initialization function (Fecit, 2003; Hagan et al., 1996). Network is trained using Levenberg–Marquardt algorithm. Performance function is mean squared error (mse) function. The first and second layers receive inputs from input space and produce outputs for the third layer. There is a closed loop for the third layer. For each layer, the net input functions calculate the layer’s net input by combining its weighted inputs and biases. Mathematically, the network output is f(x) ≈ u(x) = 3  k=1 wk ak (·), (1) Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading Modeling Spatial Distribution of Arthropods 3 where a1 (·) = 2/(1 + exp(−2(w11 x + w21 y + b1 ))) − 1, a2 (·) = 1/(1 + exp(−(w12 x + w22 y + b2 ))), a3 (·) = 3  wk3 ak (·) + b3 . k=1 In Eq. (1), x = (x, y)T is the input, u = u(x) is the output; wi , i = 1, 2, 3; wij , i, j = 1, 2; wi3 , i = 1, 2, 3 . . . , and bi , i = 1, 2, 3, are the parameters. The artificial neural network is developed using Matlab (Mathworks, 2002). Modeling performance of the neural network is represented by mse, Pearson correlation coefficient, and significance level for the linear regression between the simulated and observed. The Matlab algorithm for the neural network developed and used in this study are as follows: clc clear net; data=arth; n=30; % Number of neurons in layers 1 and 2 respectively rows=8; cols=8; s=0; for i=1:rows; for j=1:cols; s=s+1; da(s,1)=j; da(s,2)=i; da(s,3)=data(i,j); end; end; datax=da(:,1); datay=da(:,2); net=network; net.numInputs=2; net.numLayers=3; net.inputs {1}.size=1; net.inputs {2}.size=1; net.biasConnect=[1;1;1]; net.inputConnect=[1 1;1 1;0 0]; net.layerConnect=[0 0 0;0 0 0;1 1 1]; net.outputConnect=[0 0 1]; net.targetConnect=[0 0 1]; mi=min(datax); Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading 4 Computational Ecology ma=max(datax); tt(1,1)=mi; tt(1,2)=ma; net.inputs {1}.range=tt; mi=min(datay); ma=max(datay); tt(1,1)=mi; tt(1,2)=ma; net.inputs {2}.range=tt; net.layers {1}.size=n; net.layers {2}.size=n; %Transfer functions: logsig,tansig,purelin,radbas,satlins,tribas net.layers {1}.transferFcn=’tansig’; net.layers {2}.transferFcn=’logsig’; net.layers {3}.transferFcn=’purelin’; net.layers {1}.initFcn=’initlay’; net.layers {2}.initFcn=’initlay’; net.layers {3}.initFcn=’initlay’; net.inputWeights {2,1}.delays=[0 1]; net.inputWeights {1,2}.delays=[0 1]; net.layerWeights {3,3}.delays=1; net.initFcn=’initlay’; net.performFcn=’mse’; net.trainFcn=’trainlm’; net.trainParam.goal=1e-05; net.trainParam.epochs=10000; net=train(net,[da(:,1)’;da(:,2)’],da(:,3)’); y=sim(net,[da(:,1)’;da(:,2)’]); y %Print input weights and between-layer weights net.IW {1,1} %Input weights net.IW {2,1} %Input weights net.LW {3,1} %Between-layer weights net.LW {3,2} %Between-layer weights net.LW {3,3} %Between-layer weights 1.1.2. BP neural network See Chap. 5 Secs. 1–3. 1.1.3. LVQ neural network See Chap. 6 Sec. 5. 1.1.4. Linear neural network See Chap. 3. Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading Modeling Spatial Distribution of Arthropods 5 Matlab algorithms for BP, LVQ, and linear neural networks used in the present study are as follows: %Load sampling data file (SamplingData.*). In this file, columns represent quadrats and rows represent spatial coordinates of input vectors (P), but the last row are classes these quadrats fall into (lastrow) %n is the size of input vector, i.e., the dimension of input space Rn %m is the number of classes, i.e., the size of output vector, or the dimension of output space Rm n=size(SamplingData,1)-1; samples=size(SamplingData,2); lastrow=SamplingData(n+1,:); P=SamplingData(1:n,:); C=ind2vec(lastrow); m=max(lastrow); % Load the file (RecSamples.*) for samples to be recognized. The same format with sampling data file Q=RecSamples; % Generate a BP neural network (newff) with 30 hidden neurons and m output neurons net=newff(minmax(P),[30,m], ’tansig’ ’purelin’, ’trainlm’, ’learngd’,’mse’); %net=newff(minmax(P),[10,10,10, m],’tansig’ ’tansig’ ’tansig’ ’purelin’,’trainlm’, ’learngd’,’mse’); net.trainParam.epochs=1000; net.trainParam.goal=0.001; net=train(net,P,C); %Produce an element vector of typical class percentages of samples that fall into each category if LVQ neural network is used for i=min(lastrow):m; percentages(i)=0; for j=1:samples; if lastrow(j)==i percentages(i)=percentages(i)+1; end end percentages(i)=percentages(i)/samples; end percentages %Generate a LVQ neural network (newlvq) with 200 hidden neurons. Learning rate is 0.01. Learning function is learnlv1. net=newlvq(minmax(P),200,percentages,0.01, ’learnlv1’); %Train the network. 1000 epochs is set. net.trainParam.epochs=1000; net=train(net,P,C); % Generate a linear neural network (newlind) net=newlind(P,C); %Make classification on trained quadrats Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading 6 Computational Ecology out=sim(net,P); maxout=max(out); for i=1:samples; for j=1:m; if (out(j,i)==maxout(i)) outputclass(1,i)=i; outputclass(2,i)=j; break; end end end outputclass %Make recognition on the quadrats with unknown classification recog=sim(net,Q); maxout=max(recog); for i=1:size(Q,2); for j=1:m; if (recog(j,i)==maxout(i)) recognized(1,i)=i; recognized(2,i)=j; break; end end end recognized 1.2. Conventional models 1.2.1. Linear discriminant analysis See Chap. 00 Sec. 000. 1.2.2. Response surface model (RSM) See Chap. 00 Sec. 000. 1.2.3. Spline function The cubic spline function used in the present study is u(x) = Mi+1 (x − xi )3 /(6li ) + Mi (xi+1 − x)3 /(6li ) + (f(xi+1 )/ li − Mi+1 li /6)(x − xi ) + (f(xi )/ li − Mi li /6)(xi+1 − x), x ∈ [xi , xi+1 ], i = 0, 1, . . . , n − 1, (2) where xi = i + 1, Mi = S ′′ (xi ), i = 0, 1, . . . , n; li = xi+1 − xi , i = 0, 1, . . . , n − 1. Mi , i = 0, 1, . . . , n, are obtained from three-bending moment equation (Zhang, 2007b). Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading Modeling Spatial Distribution of Arthropods 7 1.2.4. Partial differential equation The following partial differential equation (PDE) is developed to model spatial distribution of arthropods in a limited region G with a boundary Ŵ. ∂2 u/∂x2 + p(x, y)∂2 u/∂y2 + q(x, y)∂u/∂x + v(x, y)∂u/∂y + w(x, y)u = f(x, y), (x, y) ∈ G, u|Ŵ = (x, y), (3) (x, y) ∈ Ŵ, where u(x, y) is the number of arthropod individuals in (x, y). p(x, y), q(x, y), v(x, y), w(x, y), and f(x, y) are continuous functions on G + Ŵ. Let xi = x0 + ih, yj = y0 + jτ, ui,j = u(xi , yj ), pi,j = p(xi , yj ), qi,j = q(xi , yj ), vi,j = v(xi , yj ), wi,j = w(xi , yj ), fi,j = f(xi , yj ), i = 0, ±1, ±2, . . . ; j = 0, ±1, ±2, . . . ; then all differences are derived as the following: (∂2 u/∂x2 )ij ≈ (ui+1,j − 2ui,j + ui−1,j )/ h2 , (∂2 u/∂y2 )ij ≈ (ui,j+1 − 2ui,j + ui,j−1 )/τ 2 , (∂u/∂x)ij ≈ (ui+1,j − ui−1,j )/(2h), (∂u/∂y)ij ≈ (ui,j+1 − ui,j−1 )/(2τ), and the difference equation for Eq. (3) is expressed as the following: (ui+1,j − 2ui,j + ui−1,j )/ h2 + pij (ui,j+1 − 2ui,j + ui,j−1 )/τ 2 + qij (ui+1,j − ui−1,j )/(2h) + vij (ui,j+1 − ui,j−1 )/(2τ) + wij ui,j = fi,j . (4) Let h = τ = 1. The functions p(x, y), q(x, y), v(x, y), w(x, y), and f(x, y), are expressed as the following linear functions: p(x, y) = ap + bp x + cp y, q(x, y) = aq + bq x + cq y, v(x, y) = av + bv x + cv y, w(x, y) = aw + bw x + cw y, f(x, y) = af + bf x + cf y. (5) Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading 8 Computational Ecology The 15 parameters in Eq. (5) are obtained by fitting spatial distribution data with a group of linear equations based on Eqs. (4) and (5). 2. Data Description Arthropods were collected, identified, and counted for every quadrat (1 × 1 m2 each) of 8 × 8 quadrats on the grassland. Insects were sorted and identified to order level and the other arthropods were identified to class level. For self-designed neural network, partial differential equation, spline function, and response surface model, the following are noted: (1) Training data. In the modeling of spatial distribution, in total 64 quadrats (n = 64) are used to train the neural network and the response surface model. The input space is a two-dimensional space [coordinates of quadrat, e.g., (1,2), (5,7), etc.], and the output space is a onedimensional space (arthropod abundance). (2) Cross validation. One of the cross validation methods is adopted. In this method, each quadrat is separately removed from the input set of 64 quadrats, and the remaining quadrats are used to train model and to predict the removed quadrats using the trained model. As a consequence, the cross validation may be conducted within the data set in the same study. Comparisons between the predicted and observed arthropod abundances are made and Pearson correlation coefficient (r) and statistic significance are calculated to validate models. (3) Quadrats are submitted to neural network in two ways, i.e., fixed sequences and randomized sequences of quadrats. For BP neural network, LVQ neural network, linear neural network, and linear discriminant analysis, a norm of input vector is as follows: z = n i=1 |zi |, where z is the total number of insects in a quadrat, and n is the number of insect orders in the quadrat. Four types of classifications are designed to represent spatial distribution patterns at different ecological scales (Zhang et al., 2008) Five classes: I: z  10; II: 11  z  20; III: 21  z  30; IV: 31  z  40; V: z > 40. Four classes: I: z  20; II: 21  z  40; III: 41  z  60; IV: z > 60. Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading Modeling Spatial Distribution of Arthropods 9 Three classes: I: z  30; II: 31  z  60; III: z > 60. Two classes: I: z  10; II: z > 10. 3. Results 3.1. Modeling population size-based spatial distribution Most arthropods found on the grassland are insects, which belong to the orders Homoptera (523 individuals), Orthoptera (230 individuals), Hymenoptera (110 individuals), Coleoptera (55 individuals), and Diptera (40 individuals), etc. Other arthropods are sparsely distributed on the grassland. 3.1.1. Modeling spatial distribution with neural network and response surface model The artificial neural network developed above [Eq. (1)] is used to simulate spatial distributions of arthropods and the most abundant orders Orthoptera, Hymenoptera, and Homoptera. The neural network is trained for the 10,000 epochs and the training goal (mse) is 0.00001. The results reveal that the neural network has excellent simulation performance. The simulated spatial distribution perfectly coincides with the observed data (intercept ≈ 0, slope ≈ 1, r ≈ 1, p < 0.0001), as illustrated by Fig. 1. Using a deviation function, stmse = mse/u2 , where u is the averaged individuals per quadrat, together with Fig. 1, it is found that the lower abundance will lead neural network to yield better simulation performance [Arthropods (stmse = 6.39 × 10−3 ), Homoptera (stmse = 2.16 × 10−2 ), Orthoptera (stmse = 5.88 × 10−7 ), Hymenoptera (stmse = 1.46 × 10−6 )]. Response surface model fits the spatial distribution of arthropods well but the simulation performance is not as good as that of the neural network (Table 1). 3.1.2. Cross validation of models Cross validation, in which quadrats are submitted to neural network in fixed sequences, demonstrates that neural network performs much better than response surface model in predicting unknown quadrats (Fig. 2). Dec. 17, 2009 10 16:46 9in x 6in B-922 b922-ch18 1st Reading Computational Ecology Figure 1. Neural network simulation of spatial distribution of arthropods. Quadrats are submitted to neural network in fixed sequences. Table 1. Simulation performance of response surface model. Arthropods Orthoptera Hymenoptera Homoptera Observed = 6.2978 + 0.5957 × Simulated r = 0.9496 p < 0.0001 mse = 77.1757 Observed = 1.7310 + 0.5183 × Simulated r=1 p < 0.0001 mse = 7.2065 Observed = 0.7731 + 0.5502 × Simulated r = 0.9647 p < 0.0001 mse = 2.7761 Observed = 3.4850 + 0.5735 × Simulated r = 0.9555 p < 0.0001 mse = 47.0975 Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading Modeling Spatial Distribution of Arthropods 11 Figure 2. Cross validation of neural network and response surface model for the prediction of spatial distribution of arthropods. Quadrats are submitted to neural network in fixed sequences. In most cases, response surface model produces a negative correlation between the predicted and observed abundances. As a result using response surface model to predict spatial distribution of arthropods is not recommended. Neural network has a better generalization performance in the case of larger abundance than lower abundance on the grassland (Fig. 2), which means that compared to simulation, the neural network needs more information to train itself to produce a reasonable prediction. Dec. 17, 2009 12 16:46 9in x 6in B-922 b922-ch18 1st Reading Computational Ecology Figure 3. Cross validation of neural network for the prediction of spatial distribution of arthropods. Quadrats are submitted to neural network in randomized sequences. Five randomizations are conducted. In an additional cross validation of neural network to predict the spatial distribution of arthropods, quadrats are submitted in randomized sequences and five randomizations are used. The results show that neural network performs better (r = 0.5323, p < 0.0001). More than 60% of quadrats are correctly predicted and they fall inside 95% confidence intervals of the predicted data (Fig. 3). The cross validation of spline function [Eq. (2)] reveals that spline function performs worse than both neural network and response surface model (Table 2). Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading Modeling Spatial Distribution of Arthropods 13 Table 2. Cross validation of spline interpolation. Arthropods Orthoptera Observed = 15.6520 − 0.0041 × Simulated r = −0.0100, p > 0.01(0.9341), mse = 1164.4 Observed = 3.8052 − 0.0545 × Simulated r = −0.0975, p > 0.01(0.4436), mse = 37.7524 Hymenoptera Homoptera Observed = 1.7092 + 0.0067 × Simulated r = 0.0100, p > 0.01(0.9289), mse = 15.4195 Observed = 8.3154 − 0.0148 × Simulated r = −0.0316, p > 0.01(0.8072), mse = 445.9147 3.1.3. Describing spatial distribution with PDE A group of linear equations are developed according to Eqs. (4) and (5). The coefficient matrix of the linear equations is derived from distribution data of arthropods. For total arthropods, and the most abundant three orders, Orthoptera, Hymenoptera, and Homoptera, their coefficient matrices (15 × 36) are fully ranked (rank = 15). The parameters obtained are listed in Table 3, from which the partial differential equation will be Table 3. Parameters in partial differential equation [Eqs. (3) to (5)]. ap bp cp aq bq cq av bv cv aw bw cw af bf cf Arthropods Orthoptera Hymenoptera Homoptera 1.2535 −0.0645 −0.1102 −0.7771 0.4671 −0.5469 2.8476 −0.5548 −0.1927 3.8523 −0.3495 0.1113 21.9082 5.2626 −0.551 −0.5867 −0.0301 0.1902 −4.322 0.4873 0.2556 −0.7527 −0.1756 0.3112 −0.8764 0.0595 0.6596 −10.6057 1.3989 2.972 1.2676 −0.0113 −0.2396 −0.8041 0.1044 0.0068 −0.0643 0.0505 −0.1443 3.0785 0.0512 −0.3247 6.0289 0.7485 −1.1149 1.6496 −0.2635 −0.0236 −1.6311 0.216 −0.0558 0.9427 −0.3421 0.0872 6.1561 −0.7088 −0.1529 32.9485 −0.9004 −2.8578 Dec. 17, 2009 14 16:46 9in x 6in B-922 b922-ch18 1st Reading Computational Ecology achieved for a specific taxonomic group. For example, the partial differential equation for spatial distribution of arthropods is the following: ∂2 u/∂x2 + (1.254 − 0.065x − 0.110y)∂2 u/∂y2 + (−0.777 + 0.467x − 0.547y)∂u/∂x + (2.848 − 0.555x − 0.193y)∂u/∂y + (3.852 − 0.349x + 0.111y)u = 21.908 + 5.263x − 0.551y, (x, y) ∈ G, u|Ŵ = (x, y), (x, y) ∈ Ŵ. This equation is used to describe the spatial distribution of arthropods. It fully fitted the spatial distribution. Theoretically this equation can be used to extrapolate the spatial distribution of arthropods. 3.2. Modeling population scaling-based spatial distribution 3.2.1. Modeling spatial distribution with neural networks and linear discriminant model Four fineness levels of spatial distribution of the total insect population per quadrat are fitted using BP, LVQ, linear network, and linear discriminant model, of which BP and LVQ neural networks are designed with the following settings: BP: mse=0.001; training 1000 epochs net=newff(minmax(P),[30,m],{‘tansig’ ‘purelin’},‘trainlm’,‘learngd’,‘mse’); LVQ: training 1000 epochs net=newlvq(minmax(P), 200,percentages,0.01, ‘learnlv1’). It is found that BP network fits the spatial distribution patterns with zero error (Fig. 4 and Table 4). BP network with more than two hidden layers, each with 10 neurons, also fits these patterns with zero error. It suggests that BP network should be the best model to fit spatial distribution patterns of grassland insects. Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading Modeling Spatial Distribution of Arthropods 15 Figure 4. Fitting spatial distribution patterns of grassland insects using neural networks and linear discriminant model. Table 4. Goodness of fit of spatial distribution patterns using variousmodels. BP fitted LVQ fitted No. classes Correctly fitted (%) Total differences Tot. diff./ Tot. obs. Correctly fitted (%) Total differences Tot. diff./ Tot. obs. Five Four Three Two 100 100 100 100 0 0 0 0 1 1 1 1 75 86 91 89 25 10 7 7 0.198 0.118 0.097 0.071 Linear NN fitted Five Four Three Two 67 89 89 80 31 9 8 13 Linear discri. fitted 0.246 0.106 0.111 0.131 64 88 89 80 30 10 8 13 0.238 0.118 0.111 0.131 Total observed = summation of classifications of all quadrats; total differences = summation of absolute differences between observed and fitted classifications. Dec. 17, 2009 16 16:46 9in x 6in B-922 b922-ch18 1st Reading Computational Ecology Simulated patterns of linear network are analogous to linear discriminant model at four ecological scales (Fig. 4 and Table 4). At the finest classification, i.e., five classes, there are three different areas on the grassland. The capability of linear methods to recognize the general trends tends to be weak as the ecological scale increase (from the finer to the coarser, Fig. 4). Performance of LVQ network is between BP network and linear methods. It is concluded that BP network is capable of approximating the details of spatial distribution patterns. On the contrary, the general trend will be readily detected from the fitted or recognized pattern of linear network. 3.2.2. Cross validation of models Each of the 64 quadrats is predicted by the model trained with the remaining 63 quadrats. Combining both training time and recognition performance, BP and LVQ neural networks are created with the following settings: BP: mse==0.001, training 1000 epochs net=newff(minmax(P),[10,10,10,10,10,10,m],{‘tansig’ ‘purelin’},‘trainlm’,‘learngd’, ‘mse’); LVQ: training 500 epochs net=newlvq(minmax(P),50,percentages,0.01, ’learnlv1’). The results reveal that the correctly recognized quadrats increase but the general trend tends to be weak at the expansion of ecological scale (from the finer to the coarser, Fig. 5 and Table 5). BP network tends to yield details but linear methods tend to produce an overall trend. LVQ network is still an intermediate method. Recognition performance is proved to be dependent on not only ecological scale but also classification criteria. 4. Discussion The artificial neural network developed in this study shows excellent performance in the simulation of spatial distribution of arthropods. Both response surface model and spline function are also capable of fitting the Dec. 17, 2009 16:46 9in x 6in B-922 b922-ch18 1st Reading Modeling Spatial Distribution of Arthropods 17 Figure 5. Reconstruction of spatial distribution patterns of grassland insects using neural networks and linear discriminant model. Table 5. Recognition performance of spatial distribution patterns using various models. BP LVQ No. classes Correctly fitted (%) Total differences Tot. diff./ Tot. obs. Correctly fitted (%) Total differences Tot. diff./ Tot. obs. Five Four Three Two 48 78 83 66 54 15 13 22 0.429 0.176 0.181 0.222 55 77 84 70 45 20 12 19 0.357 0.235 0.167 0.192 Linear NN Five Four Three Two 58 84 89 77 39 12 8 15 Linear discri. 0.309 0.141 0.111 0.152 64 83 89 78 34 13 8 14 0.269 0.153 0.111 0.141 Total observed = summation of classifications of all quadrats; total differences = summation of absolute differences between observed and fitted classifications. Dec. 17, 2009 18 16:46 9in x 6in B-922 b922-ch18 1st Reading Computational Ecology spatial distribution of arthropods. However, the simulation performance of response surface model is worse than neural network. Cross validation proves that neural network has much better performance than response surface model and spline function in predicting unknown quadrats. Submitting quadrats in randomized sequences helps to yield the confidence interval of the results in the neural network modeling because a series of stochast outputs are produced. The partial differential equation developed in this study may also be used to extrapolate spatial distribution of arthropods. Among BP, LVQ and linear neural networks, BP neural network is the best algorithm to fit spatial distribution patterns of insects. BP network is able to describe the spatial details of distribution patterns. Linear neural network performs better in detecting the general trends of distribution patterns. Performance of LVQ network is always between that of BP network and linear methods. March 22, 2010 15:57 9in x 6in B-922 b922-ch19 1st Reading  CHAPTER 19  Risk Assessment of Species Invasion and Establishment To assess the establishment risk of an invasive species, the characteristics of the invasive species and the ecosystems susceptible to establishment need to be examined (Worner and Gevery, 2006). Successful establishment of a species depends on both biotic and abiotic factors in the new environment, including climate and environmental conditions of the habitat in the area of invasion. Climate is considered to be the most important factor influencing the establishment of insect pests in new locations (Worner, 1988; Peacock et al., 2006). It is necessary to develop quantitative methods that can detect the climatic factors influencing the establishment for individual species from existing climatic and species distribution data (Anderson, 2005; Dentener et al., 2002; Park et al., 2003; Sutherst and Maywald, 2005; Worner, 1994). Various models and tools have been developed and applied to predict the establishment of invasive species in new regions (Baker et al., 2005; Watts and Worner, 2008, 2009; Worner and Gevery, 2006). These include regression techniques (Eyre et al., 2005; Lehmann et al., 2002), multivariate analysis (Peacock et al., 2006), generalized additive models (Guisan et al., 2002), evolving computation (Soltic et al., 2004) and stochastic simulations (Rossi et al., 1993). Because of the nonlinearity of invasion problems, it is hard for those conventional methods to have good performance in the prediction of species invasion and establishment. For this reason, artificial neural networks are being used in the risk assessment of species invasion and establishment. 1 March 22, 2010 15:57 9in x 6in B-922 b922-ch19 1st Reading 2 Computational Ecology This chapter presents some of the recent applications on neural network risk assessment of species invasion and establishment. 1. Invasion Risk Assessment Based on Species Assemblages Worner and Gevrey (2006) used the SOM network to model insect pest species assemblages for species invasion risk assessment. Their study is described briefly here. 1.1. Data source In this study, the data of 844 phytophagous insect pests for 459 geographical areas worldwide are examined. The presence and the absence of each species in each geographical area constitute a data matrix with the size 844 × 459. 1.2. Model description According to Worner and Gevery, the SOM projection of data onto twodimensional space allows the classification of global geographical areas according to the similarity of their pest species assemblages. The SOM network is composed of two layers: the input layer and the output layer (Kohonen, 1995). The output layer is represented by a map or a rectangular grid with l × m neurons. A batch learning algorithm (Kohonen and Somervuo, 1998), significantly faster and no need to specify learning rate, is used in SOM of this study. The input layer has 844 neurons connected to the 459 geographical areas. In total 459 sample vectors are thus achieved. The number of neurons in output layer is defined by 5(n)1/2 , where n is the number of training samples. There are connection weights between input neurons (844 neurons) and output neurons. Several SOM networks with different sizes are generated and compared to achieve a final network. Combining both network errors and computer limitation, a SOM network with 108 neurons, gridded as 12 rows and 9 columns, is finally obtained. Each neuron of the output layer in the trained SOM has a vector with element values between 0 and 1, which represents a pest species assemblage. March 22, 2010 15:57 9in x 6in B-922 b922-ch19 1st Reading Risk Assessment of Species Invasion and Establishment 3 The value of element denotes the risk index, i.e., a species’ potential to be present or to be associated with the geographical areas within each neuron. All the geographical areas associated with the same neuron have a similar pest species assemblage composition. A species present in one area but not in another in the same neuron is considered to have a high risk of invasion to that geographical area. 1.3. Results The resultant geographical areas on two-dimensional space of SOM are again clustered using cluster analysis. Neighbors on the grid are combined to a cluster. The results indicate that two species, Mediterranean fruit fly Ceratitis capitata (Wiedemann), and the gypsy moth Lymantria dispar L., both absent in New Zealand, are high-risk pests. C. capitata has a higher risk (risk index = 0.73) and L. dispar has a lower risk of invasion (risk index = 0.31). 2. Determination of Abiotic Factors Influencing Species Invasion The MLP network was used by Watts and Worner (2008) to determine the relative importance of abiotic factors that influence the establishment of invasive insect species. In this method, insect pest species are divided into two groups, those species that are recorded as being present in a region, and those that are not. The non-established species are ordered according to the threat posed by the species. 2.1. Data source In this study the climate data for each of 459 geographical areas worldwide, and the presence and the absence of each species of 844 phytophagous insect pests are used. In total 135 climate variables involving temperature, rainfall, soil moisture, heat (degree-days), and their ranges, etc., are used as the input variables of MLP. The data for each variable are linearly normalized to the range of 0 to 1. March 22, 2010 15:57 9in x 6in B-922 b922-ch19 1st Reading 4 Computational Ecology The data were randomly split into two major sets. The first data set (training and test set) contains 80% of the data. The second is the validation set used to perform an independent evaluation of the prediction performance for each target species. 2.2. Model description The standard MLP, with unmodified backpropagation with momentum learning algorithm (Rumelhart et al., 1986), is used in the study of Watts and Worner. The numbers of hidden neurons, training epochs, and learning rate and momentum are determined by examining training and generalization performance. Three hidden neurons are finally built in MLP. The method used in their study is similar to that suggested in Flexer (1996) and Prechelt (1996). The contributions of each input neuron to the output of the network are calculated by the connection weight method of Olden et al. (2004). In this study the performance of MLP for prediction is evaluated by Cohen’s Kappa statistics (Cohen, 1960). In the sensitivity analysis, all input variables but the variable investigated are given their mean values, and the values of the input variable being investigated vary across the range of 0 to 1. 2.3. Results The results reveal that MLP network is able to learn the relationships between climate variables and species presence–absence. It is able to predict the establishment of insect pest species from climate variables. The trainin g performance for the non-established species is in general lower than that for established species. Spring and summer rainfalls, and autumn temperatures are significant positive factors influencing the establishment of those species investigated. March 22, 2010 16:57 9in x 6in B-922 b922-ch20 1st Reading  CHAPTER 20  Prediction of Surface Ozone Ozone (O3 ) is a trace reactive oxidant that absorbs ultraviolet radiation and stabilizes atmospheric temperature (Lars, 2007; Yazdanpanah et al., 2008). Ozone distributes over the stratosphere (20–30 km above the earth’s surface) and the troposphere (0–15 km above the earth’s surface). Ozone in the stratosphere is naturally generated and protects life from injury. Nevertheless, ozone of the troposphere originates in both human’s activities and natural changes, and is harmful to life (Agirre-Basurko et al., 2006; Yazdanpanah et al., 2008). Overall, global ozone concentration has been declining at the mid-latitudes of the Northern Hemisphere during the last two decades (Aires et al., 2002). The thickness of the stratospheric ozone layer has decreased by about 0.5% per year for all latitudes since 1974 as a result of ozone breakdown by chlorine released from emitted chlorofluorocarbons (Rozema et al., 2005; Yazdanpanah et al., 2008). However, ozone concentration always changes with space and time. The prediction of daily total ozone therefore is an important issue (Cardelino et al., 2001; Simpson and Layton, 1983). Surface ozone is largely dependent upon a number of highly nonlinear meteorological factors (Agirre-Basurko et al., 2006; Zanetti, 1990). It has been predicted by various researchers using artificial neural networks (Agirre-Basurko et al., 2006; Ballester et al., 2002; Pastor-Barcenas et al., 2005; Gardner and Dorling, 1998; Hornik et al., 1989; Yazdanpanah et al., 2008). This chapter presents some of the successful case studies. 1 March 22, 2010 16:57 9in x 6in B-922 b922-ch20 1st Reading 2 Computational Ecology 1. BP Prediction of Daily Total Ozone A BP network has been used successfully to predict daily total ozone. Further details can be found in Yazdanpanah et al. (2008). 1.1. Model description In this study, a BP network, the standard backpropagation training algorithm with the extended Delta Bar Delta (DBD) learning rule, is used. There is only a hidden layer in the BP network. The transfer functions of the BP network are sigmoid hyperbolic tangent and sine functions. The sine transfer function takes the trigonometric sine of the input modified by a gain. Extended DBD is an extension of DBD which calculates a momentum term for each connection. In the Delta rule, the error in the output layer is computed as the difference between the desired output and the actual output. The error is transformed by the derivative of the transfer function, and is propagated to prior layers where it is accumulated (Paschalidou et al., 2007). The best BP architecture is determined by the selection procedure for forecasting ozone with meteorological variables (Vahidinasab, 2008; Yazdanpanah, 2002). 1.2. Data description Daily geopotential height, air temperature, dew point temperature, wind direction and speed, and previous day’s total ozone are selected as input variables (6 input variables and 1 output variable, i.e., today’s total ozone) of BP network. Data of these variables were collected from an ozonometric station where radio-sounding measurement was done during 1997–2004. 1.3. Results The results reveal that the BP network with 11 neurons in hidden layer is the best. Training the network (transfer function: sigmoid) 50 000 epochs, the corrected R2 and MSE for simulation are 0.8145 and 0.0380, respectively, and that for prediction are 0.8131 and 0.0399, respectively. This study shows that different learning rules and transfer functions result in different model performance. March 22, 2010 16:57 9in x 6in B-922 b922-ch20 1st Reading Prediction of Surface Ozone 3 In the sensitivity analysis, sequentially one of the input variables is removed and the network with 5 input variables is trained again. Yazdanpanah et al. use a weight index, wi , to evaluate the importance of ith input variable:   Fi = (MSEi )1/2 − (MSEt )1/2 /(MSEt )1/2 ,  wi = Fi / Fj , where MSEi is MSE with variables in exception of variable i, and MSEt is MSE with all input variables. It is demonstrated that the input variables with importance value from large to small are geopotential height, total ozone of the previous day, air temperature, wind speed, wind direction, and dew point temperature. Yazdanpanah et al. suggest that the predictive power of neural network is excellent when the input variables consist of temperatures (dry and dew point) and geopotential heights at standard levels of 100, 50, 30, 20 and 10 hPa with their wind speed and direction, together with previous day’s total ozone. 2. MLP Prediction of Hourly Ozone Levels MLP was used in predicting hourly ozone levels. Further details can be found in Agirre-Basurko et al. (2006). 2.1. Model description Both multilayer perceptron (MLP) and multiple linear regression are used in the study of Agirre-Basurko et al. (2006). The MLP used in their study has an input layer (N neurons), a hidden layer (S neurons) and an output layer (1 neuron). The transfer functions are the hyperbolic tangent (tansig) for the hidden layer and the linear function for the output layer. The scaled conjugated gradient algorithm is used to train MLP network (Moller, 1993). A stopping rule is used to avoid overfitting (Sarle, 1995). With the difference in input variables, MLP is used as MLP1 and MLP2. In total 5 variables are determined to be input variables of MLP1. There are 9 input variables for MLP2. The number of hidden neurons of MLP is calculated using a simple rule of Amari et al. (1997). March 22, 2010 16:57 9in x 6in B-922 b922-ch20 1st Reading 4 Computational Ecology 2.2. Data description The data used in the work of Agirre-Basurko et al. are hourly current (at time t) data and historical (at time t-z, z = 1, . . . , 15) data from the air pollution network and the traffic network of Bilbao during the years 1993 to 1994. Data from 1993 are used to build the models and data from 1994 are used to test the models. Potential input variables include wind speed, wind direction, temperature, relative humidity, atmospheric pressure, solar radiation, thermal gradient, number of vehicles in a unit time (NV), occupation percentage (OP: the fraction of time for which the area of road is occupied by a vehicle), velocity (NV/OP), ozone level, etc. In multiple linear regression, NO2 level is also used as an independent variable. Correction coefficient between the observed and predicted (R), normalized MSE (NMSE), the factor of two (FA2), fractional bias (FB), and fractional variance (FV), are used to represent model performance. 2.3. Results The results indicate that the best forecast has NMSE, FV and FB equal to zero, and R and FA2 are approximately 1. The ozone forecasts O3 (t + i), i = 1, 2, . . . , 8, show better coincidence with the observed ones. In this study the ozone forecasts of MLP networks are proved to be better than multiple linear regression. March 20, 2010 14:48 9in x 6in B-922 b922-ch21 1st Reading  CHAPTER 21  Modeling Dispersion and Distribution of Oxide and Nitrate Pollutants Oxide and nitrate pollutants are important sources that lead to environmental pollution. The levels of oxide and nitrate pollutants vary with space and time. Mechanistic and statistical models have been developed to forecast pollutant levels. Mechanistic models are based on explicit mathematical relationships that describe the processes involved in the formation of pollutants (Zanetti, 1990; Agirre-Basurko et al., 2006). A widely used mechanistic model is UAM (Urban Airshed Model; Scheffe and Morris, 1993). However, mechanistic models are overall more suitable over large areas and require detailed data on the emission and transportation of pollutants and meteorological conditions. Statistical models are empirical models. They are used to establish the input–output relationships without understanding the intrinsic mechanisms of the formation of pollutants (Agirre-Basurko et al., 2006). The examples of statistical models used in pollutant prediction include time series analysis (Simpson and Layton, 1983; Hsu, 1992) and multiple linear regression (Cassmassi, 1998; Cardelino et al., 2001). Oxide and nitrate levels are dependent on human activities and meteorological conditions and their relationships are highly nonlinear (Gardner and Dorling, 1999, 2000). Neural networks are therefore used to model these relationships (Gardner and Dorling, 1998, 1999, 2000; Elkamel et al., 2001). For example, they have been developed to model nitrogen dioxide dispersion (Nagendra and Khare, 2006), to assess nitrate contamination of 1 March 20, 2010 14:48 9in x 6in B-922 b922-ch21 1st Reading 2 Computational Ecology rural private wells (Ray and Klindworth, 2000), and simulate nitrate leaching to ground water (Kaluli et al., 1998). This chapter presents some of the recent case studies. 1. Modeling Nitrogen Dioxide Dispersion Nagendra and Khare (2006) used a feedforward three-layer neural network to model nitrogen dioxide dispersion and confirmed the effectiveness of the neural network in the pollutant prediction. 1.1. Model description In the neural network of Nagendra and Khare, the number of neurons of input layer is the number of input variables, i.e., meteorological and traffic variables. The output layer contains one neuron, i.e., the 24 h average NO2 concentration. There is one hidden layer. The number of hidden neurons is determined by training neural network and comparing the training errors. Transfer functions of the hidden neurons are hyperbolic tangent functions. The input and output neurons use identity function as target values (Gardner and Dorling, 2000). The neural network is trained by a supervised back-propagation learning algorithm (Haykin, 2001). Computed by the gradient descent algorithm, the back-propagation training algorithm yields an “approximation” of the trajectory in weight space (Battiti, 1992). All weights and bias are initialized as the uniformly distributed random values in the range [−2.4/Fi , 2.4/Fi ], where Fi is the total number of inputs. A smaller range will reduce the probability of the saturation of the neurons in the network and thus avoid the occurrence of the error gradients (Wasserman, 1989). The performance of neural network is evaluated by using RMSE (Root Mean Square Error), MBE (mean bias error), and R2 (determinant coefficient), etc. The “d” index, a measure of the degree to which the predictions are error free, is also used in this study (Willmott, 1982):   d = 1 − (pi − oi )2 / (|pi − ō| + |oi − ō|)2 , where pi : predicted value, oi : observed value, i = 1, 2, . . ., n, and ō: average of the observed values. For the neural modeling of NO2 with both meteorological and traffic variables (ANNNO2A ), 11 meteorological (cloud cover, humidity, mixing height, pressure, Pasquill stability, sun shine hour, temperature, visibility, March 20, 2010 14:48 9in x 6in B-922 b922-ch21 1st Reading Modeling Dispersion and Distribution of Oxide and Nitrate Pollutants 3 sine and cosine wind direction, wind speed) and 6 traffic variables [two wheeler, three wheeler, four wheeler (gasoline), four wheeler (diesel), CO and NO2 source strength] are used as the input variables. There are 17 input neurons, 5 hidden neurons, and 1 output neuron. For the neural modeling with only meteorological variables (ANNNO2B ), 10 meteorological variables are used as the input variables, and for the neural modeling with only traffic variables (ANNNO2C ), 5 traffic variables are used as the input variables. 1.2. Data description The data of 24 h average NO2 concentrations, meteorological and traffic variables are the observed values for a period of three years from 1997 to 1999. Two-year data from 1997 to 1998 are used for the model training and one-year data in 1999 are used for model test and evaluation. 1.3. Results The results of this study indicate that ANNNO2A performs better than ANNNO2B . ANNNO2C performs pooly in the evaluation of the model. The model performance is improved as the time-averaging interval increases, which reveals an increasing trend of nonlinearity to linearity. 2. Simulation of Nitrate Distribution in Ground Water Besides conventional neural networks, modular neural networks have been used also in the pollutant prediction. For example, they were used to predict two-dimensional nitrate distribution in ground water by Almasri and Kaluarachchi (2005). 2.1. Model description Modular neural network (MNN) is composed of multiple expert networks (modules) competing to learn different aspects of a problem and a gating network (control module) (Haykin, 1994). The gating network assigns different features of the input space to the different expert networks (Neural March 20, 2010 14:48 9in x 6in B-922 b922-ch21 1st Reading 4 Computational Ecology Ware, 2000). Each expert network yields an output corresponding to the input vector and the output of MNN is the weighted sum of these outputs with the weights equal to the output of the gating network. MNN is trained by backpropagation algorithm (Rumelhart et al., 1986; Maier and Dandy, 1998). In the study of Almasri and Kaluarachchi (2005), There is only one hidden layer in the MNN. There are 13 input variables, 14 hidden neurons, and 1 output variable. MNN is developed using the Neural Works Professional II/Plus (Neural Ware, 2000). In this software MNN has parameters such as learning count (194850), learning rate for the output (0.15) and hidden layer (0.3), momentum (0.4), seed (257), transfer function (tanh), epoch (16) , scaling intervals (Input: [−1, 1]; output: [−0.8, 0.8]), and learning rule (the extended delta-bar-delta), etc. Model performance is evaluated using RMSE, correlation coefficient, etc. 2.2. Data description The study area is divided into 100 × 100 m2 of cells. The distributions of on-ground nitrogen loadings and ground water recharge are estimated based on practical data. The nitrate concentration data are obtained from various agencies. A total of 665 input–output patterns are divided into training and testing data sets. 2.3. Results According toAlmasri and Kaluarachchi, MNN is superior when considering the upgradient contributing areas of nitrate receptors in formulating the input–output response patterns. However, it is to some extent inferior due to the lack of training patterns. It performs poorly for the extreme input– output patterns. It is found that using the total number of input and output neurons as the number of the hidden neurons yields the optimal MNN performance and the training time is also shorter. A comparison between MNN and artificial neural network (ANN; setting of neurons: 13-14-1) indicates that MNN is superior to ANN. March 22, 2010 16:57 9in x 6in B-922 b922-ch22 1st Reading  CHAPTER 22  Modeling Terrestrial Biomasss Terrestrial earth harbors huge amount of biomass and supports numerous organisms. On a temperate grassland, plants have the largest biomass (20 000 kg/ha), followed by arthropods (1 000 kg/ha), microorganisms (7 000 kg/ha), mammals (1.2 kg/ha), birds (0.3 kg/ha), and nematodes (120 kg/ha) (Pimental et al., 1992; Chen and Ma, 2001). The biomass estimation for various terrestrial landscapes is a major challenge (Schino et al., 2003). Some cases of biomass estimation based on neural networks are presented in this chapter. 1. Estimation of Aboveground Grassland Biomass Grasslands and savannas cover nearly 40% of the terrestrial earth (Chapin et al., 2001). Grassland biomass is determined by plants composition, soil condition and topography, and meterological conditions. Three types of methods can be used to estimate grassland biomass (Moreau et al., 2003; Xie et al., 2009): (1) the empirical relationships of spectral vegetation indices, (2) Monteith’s efficiency model, and (3) the canopy process-based models (van der Werf et al., 2007). In recent years the artificial neural networks are suggested to be used in the biomass estimation. This section describes the MLP approach of biomass estimation of grassland. Details can found in Xie et al. (2009). 1.1. Data description In total 568 sample sites were selected and the geographical coordinate of each site was recorded. Five quadrats (1 × 1 m2 ) of each site 1 March 22, 2010 16:57 9in x 6in B-922 b922-ch22 1st Reading 2 Computational Ecology were harvested and the aboveground biomass recorded. Different spectral bands of images, i.e., Band 1, Band 2, Band 3 (spectral reflection of read band), Band 4 (spectral reflection of near infrared band), and NDVI [NDVI = (Band 4 − Band 3)/(Band 4 + band 3)] and aspect were recorded and normalized. 1.2. Model description The MLP model is developed using Statistica 6.0 neural network module. MLP has one input layer, one hidden layer, and one output layer. The output variable of MLP is the aboveground biomass and 7 statistically important input variables are Band 1, Band 3, Band 4, Band 5, Band 7, NDVI, and aspect. The input layer uses a linear function. The hidden layer is set with 5 to 10 neurons respectively and the resultant networks are tested for their performance. Finally 7 hidden neurons are determined based on the balance between model complexity and performance. The transfer functions of the hidden neurons are sigmoid transfer functions. At the beginning, a backpropagation algorithm is used to train the network and a conjugate gradient descent algorithm is then used to converge and optimize MLP. The learning rate is 0.75 and the momentum is 0.45. Model performance is evaluated using RMSE and RMASEŴ (relative RMSE). 1.3. Results According to Xie et al., Band 7 is the most sensitive variable to predict biomass, followed by Band 1, Band 3, Band 5, Band 4, NDVI, and aspect. Overall, both MLP and multiple linear regression have the relative RMSE. However, the simulation and prediction performance of MLP is superior to that of multiple linear regression in terms of RMSE and MLPŴ . The performance of MLP and multiple linear regression is the best for the training set, medium for the entire dataset, and worst for the testing set. 2. Estimation of Trout Biomass In their study, Lek and Baran (1997) analyzed relationships between macrohabitat variables and the biomass of brown trouts, Sulmo trutta L., in Pyrenees mountain streams. March 22, 2010 16:57 9in x 6in B-922 b922-ch22 1st Reading Modeling Terrestrial Biomasss 3 2.1. Data description In total 232 units are sampled. Eight habitat variables (mean Froude number, mean depth, mean bottom velocity, mean surface velocity, surface of shelter, surface of total cover, surface of deep water, and elevation) and two biomass variables (biomass of total trouts and cacheable trouts) are recorded for these units. 2.2. Model description Lek and Baran use the BP network in their study. The BP network contains three layers, i.e., the input layer (8 neurons), the hidden layer (10 neurons), and the output layer (1 neuron). The backpropagation learning rule is used to train the network (Rumelhart et al., 1986; Smith, 1994). The determinant coefficient between the observed and estimated is used to evaluate the network model. 2.3. Results With the determinant coefficients between 0.85 and 0.9, the neural network performs well in the training process. The network model is also superior in the prediction of trout biomass. The determinant coefficients are between 0.74 to 0.88 for the biomass prediction of total and cacheable trouts. March 23, 2010 19:50 9in x 6in B-922  b922-ref 1st Reading  References Abdel-Aal RE. Hourly temperature forecasting using abductive networks. Engineering Applications of Artificial Intelligence, 17: 543–556, 2004. Abrahart RJ, White SM. Modelling sediment transfer in Malawi: Comparing backpropagation neural network solutions against a multiple linear regression benchmark using small data set. Physics and Chemistry of the Earth (B), 26(1): 19–24, 2001. Acharya C, Mohanty S, Sukla LB et al. Prediction of sulphur removal with Acidithiobacillus sp. using artificial neural networks. Ecological Modelling, 190: 223–230, 2006. Agirre-Basurko E, Ibarra-Berastegi G, Madariaga I. Regression and multilayer perceptronbased models to forecast hourly O3 and NO2 levels in the Bilbao area. Environmental Modelling & Software, 21(4): 430–446, 2006. Aires F et al. A regularized neural net approach for retrieval of atmospheric and surface temperature with the IASI instrument. Journal of Applied Meteorology, 41: 144–159, 2002. Albus JS. A theory of cerebellar function. Mathematical Biosciences, 10: 25–61, 1971. Almasri MN, Kaluarachchi JJ. Modular neural networks to predict the nitrate distribution in ground water using the on-ground nitrogen loading and recharge data. Environmental Modelling and Software, 20: 851–871, 2005. Altieri MA. Biodiversity and Pest Management in Agroecosystems. Haworth Press, New York, USA, 1994. Altieri MA. Agroecology: The Science of Sustainable Agriculture. Westview Press, Boulder, Colorado, USA, 1995. Altieri MA, Letourneau DK. Vegetation diversity and insect pest outbreaks. CRC Critical Review in Plant Science, 2: 131–169, 1984. Amari SI. Field theory of self-organizing neural nets. IEEE Transactions on Systems, Man and Cybernetics, 13: 741–748, 1983. Amari SI. Mathematical foundation of Neurocomputing. Proceedings of the IEEE on Neural Networks, 78: 1443–1463, 1990. Amari S. Differential Geometrical Methods in Statistics. Springer Lecture Notes in Statistics Vol. 28. Springer-Verlag, Berlin, Germany, 1985. 1 March 23, 2010 19:50 9in x 6in B-922 b922-ref 1st Reading 2 Computational Ecology Amari S. Differential geometry in statistical inference. Proc. ISI, 46th Session of the ISI, 52(2): 321–338, 1987. Amari S. Differential geometry of a parametric family of invertible linear systemsRimannian metric, dual affine connections and divergence. Mathematical Systems Theory, 20: 53–82, 1987. Amari S. On mathematical methods in the theory of neural networks. Proc. 1st IEEE ICNN, Vol. III, pp. IU 3-Ell 10, 1987. Amari S. Information geometry of EM and EM algorithm for neural networks. Neural Networks, 8(9): 1379–1408. Amari S. The natural gradient learning algorithm for neural networks. Theoretical Aspects of Neural Computation — A Multidisciplinary Perspective, Wong KYM, King I, Yeung DY (Eds.). Hong Kong International Workshop. Springer, 1998. Amari S, Murata N, Muller K, Finke M,Yang HH. Asymptotic statistical theory of overtraining and cross-validation. IEEE Transactions on Neural Networks, 85: 985–996, 1997. Anderson JA. A simple neural network generating an interactive memory. Mathematical Biosciences, 14: 197–220, 1972. Anderson JA, Rosenfeld E. Neurocomputing: Foundations of Research. MIT Press, Cambridge, USA, 1989. Andersen MC. Potential applications of population viability analysis to risk assessment for invasive species. Human and Ecological Risk Assessment, 11: 1083–1095, 2005. Andow DA. Vegetational diversity and arthropod population response. Annual Review of Entomology, 36: 561–586, 1991. Anthony M, Biggs N. Computational Learning Theory. Cambridge University Press, Boston, USA, 1992. Baker R, Cannon R, Bartlett P, Barker I. Novel strategies for assessing and managing the risks posed by invasive alien species to global crop production and biodiversity. Annals of Applied Biology, 146: 177–191, 2005. Ballester EB, Valls GCI, Carrasco-Rodriguez JL et al. Effective 1-day ahead prediction of hourly surface ozone concentrations in eastern Spain using linear models and neural networks. Ecological Modelling, 156: 27–41, 2002. Battiti R. First and second order methods for learning between steepest descent and Newton’s method. Neural Computation, 4: 141–166, 1992. Berry T, Linoff J. Data Mining Techniques. JohnWiley and Sons, New York, USA, 1997. Bian ZQ, Zhang XG. Pattern Recognition (2nd Edition). Tsinghua University Press, Beijing, China, 2000. Bork EW, Hudson RJ, Bailey AW. Upland plant community classification in Elk Island National Park, Alberta, Canada, using disturbance history and physical site factors. Plant Ecology, 130: 171–190, 1997. Bradshaw CJA, Davis LS, Purvis M et al. Using artificial neural networks to model the suitability of coastline for breeding by New Zealand fur seals (Arctocephalus forsteri). Ecological Modelling, 148: 111–131, 2002. March 23, 2010 19:50 9in x 6in B-922 b922-ref 1st Reading References 3 Breiman L. Statistical modeling: The two cultures (with discussion). Statistical Science, 16: 199–231, 2001. Broomhead DS, Lowe D. Multivariable functional interpolation and adaptive networks. Complex Systems, 2: 321–355, 1988. Brown KS Jr. Conservation of neotropical insects: Insects as indicators. The Conservation of Insects and Their Habitats, Collins NM, Thomas JA (Eds.). Academic Press, London, England, 1991. Bunge J, Fitzpatrick M. Estimating the number of species: A review. Journal of American Statistician Association, 88: 364–373, 1993. Burden R, Faires JD. Numerical Analysis (7th Edition). Thomson Learning, Inc., New York, USA, 2001. Burnham KP, Overton WS. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika, 65: 623–633, 1978. Burnham KP, Overton WS. Robust estimation of population when capture probabilities vary among animals. Ecology, 60: 927–936, 1979. Cardelino C, Chang M, St. John J et al. Ozone predictions in Atlanta, Georgia: Analysis of the 1999 ozone season. Journal of the Air and Waste Management Association, 51: 1227–1236, 2001. Carpenter GA, Grossberg S.A massive parallel architecture for a self-organing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37: 54–115, 1987. Carpenter GA, Grossberg S. ART2: Self-organzation of stable category recognition codes for analog input patterns. Applied Optics, 26(23): 4019–4930, 1987. Carpenter GA, Grossberg S. ART3: Hierachical search using chemical transmitters in selforganizing pattern recognition architectures. Neural Networks, 3(23): 129–152, 1990. Carpenter GA, Grossberg S, Reynolds J. ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks, 4(5): 169–181, 1991. Cassmassi JC. Objective ozone forecasting in the South Coast Air Basin: Updating the objective prediction models for the late 1990s and southern California ozone study (SCOS97NARSTO) application. Proceedings of the 12th Conference on Numerical Weather Prediction 54–58, American Meteorology Society, Boston, MA, USA, 1998. Castellano G, Fanelli AM, Pelillo M. An iterative pruning algorithm for feedforward neural networks. IEEE Transactions on Neural Networks, 8(3): 519–531, 1997. Cereghino R, Giraudel JL, Compin A. Spatial analysis of stream invertebrates distribution in the Adour-Garonne drainage basin (France), using Kohonen self organizing maps. Ecological Modelling, 146: 167–180, 2001. Chao A. Non-parametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 11: 265–270, 1984. ChaoA, Lee SM. Estimating the number of classes via sample coverage. Journal of American Statistician Association, 87: 210–217, 1992. March 23, 2010 19:50 9in x 6in B-922 b922-ref 1st Reading 4 Computational Ecology Chapin FS. Effects of plant traits on ecosystem and regional processes: A conceptual framework for predicting the consequences of global change. Annals Botany, 91: 1–9, 2003. Chapin FS, Sala OE, Huber-Sannwald E. Global Biodiversity in a Changing Environment: Scenarios for the 21st Century. Springer-Verlag, New York, USA, 2001. Charalambous C. Conjugate gradient algorithm for efficient training of artificial neural networks. IEEE Proceedings G, 139(3): 301–310, 1992. Chen JX. Course Notes on Algebraic Topology. Higher Education Press, Beijing, China, 1987. Chen LZ, Ma KP. Biodiversity Science: Principles and Practices. Shanghai Science and Technology Press, Shanghai, China, 2001. Chen XS, Chen WH. Course Notes on Differential Geometry. Beijing University Press, Beijing, China, 1980. Chen XR et al. Modern Regression Analysis. Anhui Education Press, Hefei, China, 1987. Chen K, Xu L, Chi H. Improved learning algorithms for mixture of experts in multiclass classification. Neural Networks, 12(9): 1229–1252, 1999. Cherkassky V, Lari-Najafi H. Constrained topological mapping for nonparametric regression analysis. Neural Networks, 4: 27–40, 1991. Churing Y. Backpropagation, Theory, Architecture and Applications. Lawrence Erbaum Publishers, New York, USA, 1995. Cohen ACJ. Simplified estimators for the normal distribution when samples are singly censored sample. Technometrics, 1: 217–237, 1959. Cohen ACJ. Tables for maximum likelihood estimates: Singly truncated and singly censored sample. Technometrics, 3: 535–541, 1961. Cohen JE. Food webs and niche space. Monographs in Population Biology 11, Princeton University Press, Princeton, USA, 1978. Cohen JE et al. Improving food webs. Ecology, 74: 252–258, 1993. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 18(20): 3746, 1960. Coleman BD, Mares MA, Willig MR, HsiehYH. Randomness, area, and species richness. Ecology, 63: 1121–1133, 1982. Cowell RK. Human aspects of biodiversity: An evolutionary perspective. Biological Diversity and Global Change, Solbrig OT, van Emden HM, van Oordt PGWJ (Eds.). International Union of Biological Sciences, Monograph No. 8. IUBS Press, Paris, France, 1992. Colwell RK, Coddington JA. Estimating terrestrial biodiversity through extrapolation. Philosophical Transactions of the Royal Society London B, 345: 101–108, 1994. Dempster AP, Laird NM,Rubin DB. Maximum likelihood from imcomplete data via EM algorithm. Journal of the Royal Statistical Society (Series B), 39: 1–38, 1977. Dentener PR, Whiting DC, Connolly PG. Thrips palmi Karny (Thysanoptera: Thripidae): Could it survive in New Zealand? New Zealand Plant Protection, 55: 18–22, 2002. Department of Mathematics of Nanjing University. Ordinary Differential Equations. Science Press, Beijing, China, 1978. March 23, 2010 19:50 9in x 6in B-922 b922-ref 1st Reading References 5 Dimopoulos I, Chronopoulos J, Chronopoulou-Sereli A, Lek S. Neural network models to study relationships between lead concentration in grasses and permanent urban descriptors in Athens city (Greece). Ecological Modelling, 120: 157–165, 1999. Dong BL, Ji LZ, Wei CY et al. Relationship between plant community and insect community in Korean pine broad-leaved mixed forest of Changbai Mountain. Chinese Journal of Ecology, 24(9): 1013–1016, 2005. Dony JG. The expectation of plant records from prescribed areas. Watsonia, 5: 377–385, 1963. Efron B, Gong G. A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician, 37: 36–48, 1983. Eyre MD, Rushton SP, Luff ML, Telfer MG. Investigating the relationships between the distribution of British ground beetle species (Coleoptera, Carabidae) and temperature, precipitation and altitude. Journal of Biogeography, 32: 973–983, 2005. Fecit. Analysis and Design of Neural Networks in MATLAB 6.5. Electronics Industry Press, Beijing, China, 2003. Filippi AM, Jensen JR. Fuzzy learning vector quantization for hyperspectral coastal vegetation classification. Remote Sensing of Environment, 100: 512–530, 2006. Flexer A. Statistical evaluation of neural network experiments: Minimum requirements and current practice. Cybernetics and Systems ’96 Proceedings of the 13th European Meeting on Cybernetics and Systems Research, Trappl R (Ed.). 1005–1008, Austrian Society for Cybernetic Studies, 1996. Frean M. The upstart algorithm: A method for constructing and training feed forward neural networks. Neural Computation, 2(2): 198–209, 1990. Gao XR, Yang FS. A new method for training MLP neural networks. Chinese Journal of Computers, 19(6): 687–694, 1996. Gardner MW, Dorling SR.Artificial neural networks (the multiplayer perceptron) — a review of applications in the atmospheric sciences. Atmospheric Environment, 32: 2627–2636, 1998. Gardner MW, Dorling SR. Neural network modelling and prediction of hourly NOx and NO2 concentrations in urban air in London. Atmospheric Environment, 33(5): 709–719, 1999. Gardner MW, Dorling SR. Statistical surface ozone models: An improved methodology to account for non-linear behaviour. Atmospheric Environment, 34(1): 21–34, 2000. Gentle JE. Elements of Computational Statistics. Springer Science+Business Media, Inc., Netherlands, 2002. Gevrey M, Dimopoulos I, Lek S. Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological Modelling, 160: 249–264, 2003. Gevrey M, Dimopoulos I, Lek S. Two-way interaction of input variables in the sensitivity analysis of neural network models. Ecological Modelling, 195: 43–50, 2006. Gotelli NJ, Graves GR. Null Models in Ecology. Smithsonian Institution Press, Washington DC, USA, 1996. March 23, 2010 19:50 9in x 6in B-922 b922-ref 1st Reading 6 Computational Ecology Grime JP. Benefits of plant diversity to ecosystems: Immediate, filter and founder effects. Journal of Ecology, 86: 902–910, 1998. Grossberg S. Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. Biological Cybernetics, 23: 121–134, 1976. Grossberg S. How does the brain build a cognitive code? Psychological Review, 88: 375–407, 1980. Gu HZ, Takahashi H. Towards more practical average bounds on supervised learning. IEEE Transactions on Neural Networks, 7(44): 953–968, 1996. Guisan A, Edwards TC, Hastie T. Generalized linear and generalized additive models in studies of species distributions: Setting the scene. Ecological Modelling, 157: 89–100, 2002. Hagan MT, Demuth HB, Beale MH. Neural Network Design. PWS Publishing Company, Boston, USA, 1996. Hagan MT, Menhaj M. Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks, 5(6): 989–993, 1994. Haykin S. Neural Networks: A Comprehensive Foundation. Macmillan College Publishing Company, New York, USA, 1994. He RB. MATLAB 6: Engineering Computation and Applications. Chongqing University Press, Chongqing, 2001. Hebb DO. The Organization of Behavior. Wiley, New York, USA, 1949. Hellmann JJ, Fowler GW. Bias, precision, and accuracy of foue measures of species richness. Ecological Applications, 3: 824–834, 1999. Heltshe JF, Forester NE. Estimating species richness using the jackknife procedure. Biometrics, 39: 1–11, 1983. Herrick JE, Bestelmeyer BT, Archer S, Tugel AJ, Brown JR. An integrated framework for science-based arid land management. Journal of Arid Environments, 65: 319–335, 2006. Hinton GE, Sejnowski TJ, Ackley DH. Boltzmann Machines: Constraint Satisfaction Networks that Learn. Carnegie-Mellon University Computer Science Technical Report: CMU-CS-84-119, Carnegie-Mellon University, Pittsburgh, USA, 1984. Hochberg MM et al. HMM/NN training techniques for connected alphadigit speech recognition. Proc. IEEE ICASSP’91, Toronto, Canada, 109–112, 1991. Hoffmann A. Paradigms of Artificial Intelligence: A methodological and Computational Analysis. Springer, Singapore, 1998. Hopfield JJ. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences USA, 79: 2554–2558, 1982. Hopfield JJ. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences USA, 81: 3088–3092, 1984. Hopfield JJ, Tank DW. Neural computation of decisions in optimization problems. Biological Cybernetics, 52: 141–154, 1985. Hornik KM, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5): 359–366, 1989. March 23, 2010 19:50 9in x 6in B-922 b922-ref 1st Reading References 7 Hsu KJ. Time series analysis of the interdependences among air pollutant. Atmospheric Environment, 26(4): 491–503, 1992. Hurt NE. Phase Retrieval and Zero Crossing. Kluwer Academic Publisher, NewYork, USA, 1989. Hurlbert SH. The concept of species diversity: A critique and alternative parameters. Ecology, 52: 577–585, 1971. Ingber L, Rosen B. Generic algorithms and very fast simulated reannealing: A comparison. Ibid, 16: 87–100. Ingber L. Simulated annealing: Practice versus theory. Mathematical Computer and Modelling, 18: 29–57, 1993. Jacobs RA. Increased rates of convergence through learning rate adaptation. Neural Networks, 1(4): 295–308, 1988. Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE. Adaptive mixtures of local experts. Neural Computation, 3: 79–87, 1991. Jackson RD, Bartolome JW. A state-transition approach to understanding nonequilibrium plant community dynamics in Californian grasslands. Plant Ecology, 162: 49–65, 2002. Jasinski JP, Payette S. The creation of alternative stable states in the sourthern boreal forest, Québec, Canada. Ecological Monographs, 75: 561–583, 2005. Jia CS, Chi DF, Hu YY. Effects of forest plant communities on forest insect communities. Journal of Anhui Agricultural Science, 34(9): 1871–1872, 2006. Jordan MI, Jacobs RA. Hierarchies of adaptive experts. Advances in Neural Information Processing Systems 4. San Mateo, USA, 1992. Jorgensen SE, Verdonschot P, Lek S. Explanation of the observed structure of functional feeding groups of aquatic macro-invertebrates by an ecological model and the maximum exergy principle. Ecological Modelling, 158: 223–231, 2002. Kaluli JW, Madramootoo CA, Djebbar Y. Modeling nitrate leaching using neural networks. Water Science and Technology, 38(7): 127–134, 1998. Kemp SJ, Zaradic P, Hansen F. An approach for determining relative input parameter importance and significance in artificial neural networks. Ecological Modelling, 204: 326–334, 2007. Kilic H, Soyupak S, Tuzun I et al. An automata networks based preprocessing technique for artificial neural network modelling of primary production levels in reservoirs. Ecological Modelling, 201: 359–368, 2007. Kohonen T. Correlation matrix memories. IEEE Transactions on Computers, 21: 353–359, 1972. Kohonen T. Self-organizing formation of topologically correct feature maps. Biological Cybernatics, 43: 59–69, 1982. Kohonen T. Self-Organization and Associative Memory (3rd Edition). Springer-Verlag, New York, USA, 1988. Kohonen T. The self-organizing map. Proceedings of the IEEE, 78(9): 1464–1480, 1990. Kohonen T. Self-Organizing Maps. Springer-Verlag, Heidelberg, Germany, 1995. March 23, 2010 19:50 9in x 6in B-922 b922-ref 1st Reading 8 Computational Ecology Kohonen T, Somervuo P. Self-organizing maps of symbol strings. Neurocomputing, 21: 19–30, 1998. Krebs CJ. Ecological Methodology. HarperCollinsPublishers, Inc., USA, 1989. Kremen C, Colwell RK, Erwin TL, Murphy DD. Invertebrate assemblges: Their use as indicators in conservation planning. Conservation Biology, 7: 796–808, 1993. Kuo JT, Hsieh MH, Lung WS et al. Using artificial neural network for reservoir eutrophication prediction. Ecological Modelling, 200: 171–177, 2007. Laird N. The EM algorithm. Handbook of Statistics Vol 9, Rao CR (Ed.). 509–520, Elsevier Science Publisher, The Netherlands, 1993. Lars OB. Stratospheric ozone, ultraviolet radiation, and cryptogams. Biological Conservation, 135(3): 326–333, 2007. Lavorel S, Garnier E. Predicting the effects of environmental change on plant community composition and ecosystem functioning: Revising the Holy Grail. Functional Ecology, 16: 545–556, 2002. Le Cun Y. Une procedure d’apprentissage pour reseau a seuil assymetrique. Cognitiva, 85: 599–604, 1985. Lee WT, Tenorio MF. On an asymptotically optimal adaptive classifier design criterion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(3): 312–318, 1991. Lehmann A, Overton J, Leathwick JJ. GRASP: Generalized regression analysis and spatial prediction. Ecological Modelling, 157: 189–207, 2002. Lek S, Baran P. Estimations of trout density and biomass: A neural networks approach. Nonlinear Analysis, Theory, Methods & Applications, 30(8): 4985–4990, 1997. Lek S, Belaud A, Baran P, Dimopoulos I, Delacoste M. Role of some environmental variables in trout abundance models using neural networks. Aquatic Living Resources, 9: 23–29, 1996. Li QY, Wang NC, Yi DY. Numerical Analysis (Fourth Edition). Tsinghua University Press, Springer, Beijing, China, 2001. Liang Y, Page EW. Multiresolution learning paradigm and signal prediction. IEEE Transactions on Signal processing, 45: 2858–2864, 1997. Lin JK. Foundations of Topology. Science Press, Beijing, China, 1998. Liu BC. Functional Analysis. Science Press, Beijing, China, 2000. Loot G, Giraudel JL, Lek S. A non-destructive morphometric technique to predict Ligula intestinalis L. plerocercoid load in roach (Rutilus rutilus L.) abdominal cavity. Ecological Modelling, 156: 1–11, 2002. Luo SW. Theoretical Principles of Large-scale Artificial Neural Networks. Tsinghua University Press, Northern Jiaotong University Press, 2004. Mackey DJC. Bayesian interpolation. Neural Computation, 4: 415–447, 1992. Maier HR, Dandy GC. Understanding the behaviour and optimising the performance of back-propagation neural networks: An empirical study. Environmental Modelling & Software, 13(2): 179–191, 1998. Manly BFJ. Randomization, Bootstrap and Monte Carlo Methods in Biology (2nd Edition). Chapman & Hall, London, Britain, 1997. March 23, 2010 19:50 9in x 6in B-922 b922-ref 1st Reading References 9 Maravelias CD, Haralabous J, Papaconstantinou C. Predicting demersal fish species distributions in the Mediterranean Sea using artificial neural networks. Marine Ecology-Progress Series, 255: 249–258, 2003. Marchand M, Golea M, Ruján P. A convergence theorem for sequential learning in twolayer perceptrons. Europhysics Letters, 11(65): 487–492, 1990. Marchant JA, Onyango CM. Comparison of a Bayesian classifier with a multilayer feedforward neural network using the example of plant/weed/soil discrimination. Computers and Electronics in Agriculture, 39: 3–22, 2003. Mathworks. MATLAB 6.5: Neural Network Toolbox. Mathworks, Natick, USA, 2002. McCulloch W, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5: 115–133, 1943. McKenna JE. Application of neural networks to prediction of fish diversity and salmonid production in the Lake Ontario basin. Transactions of The American Fisheries Society, 134(1): 28–43, 2005. Men SP, Feng JH. Applied Functional Analysis. Science Press, Beijing, China, 2005. Meng DJ, Liang K. Differential Geometry. Science Press, Beijing, China, 1999. Mezard M, Nadal JP. Learning in feedforward layered network: The tiling algorithm. Journal of Physics A, 22: 2191–2204, 1989. Miller RJ, Wiegert RG. Documenting completeness, species-area relations, and the speciesabundance distribution of a regional flora. Ecology, 70: 16–22, 1989. Miller RJ, White PS. Condiderations for preserve design based on the distribution of rare plants in Great Smoky Mountains National Park. U.S.A. Journal of Environmental Management, 10: 119–124, 1986. Minsky M, Papert S. Perceptrons. MIT Press, Cambridge, USA, 1969. Moisen GG, Frescino TS. Comparing five modelling techniques for predicting forest characteristics. Ecological Modelling, 157: 209–225, 2002. Moller MF. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6: 525–533, 1993. Moody JE, Darken CJ. Learning with a localized receptive fields. Proceedings of the 1988 Connectionist Model Summer School. Morgan Kaufmann Publishers, San Mateo, USA, 133–143, 1988. Moody JE, Darken CJ. Fast learning in networks of locally-tuned processing units. Neural Computation, 1: 281–294, 1989. Moreau S, Bosseno R, Gu X., Baret F. Assessing the biomass dynamics of Andean bofedal and totora high-protein wetland grasses from NOAA/AVHRR. Remote Sensing of Environment, 85: 516–529, 2003. Moreno CE, Halffter G. Assessing the completeness of bat biodiversity inventories using species accumulation curves. Journal of Applied Ecology, 37: 149–158, 2000. Nagendra SMS, Khare M. Artificial neural network approach for modelling nitrogen dioxide dispersion from vehicular exhaust emissions. Ecological Modelling, 190: 99–115, 2006. Narendra KS, Mukhopadhyay S. Adaptive control of nonlinear multivariable systems using neural networks. IEEE Transactions on Neural Networks, 8(3): 475–485, 1997. March 23, 2010 10 19:50 9in x 6in B-922 b922-ref 1st Reading Computational Ecology Narendra KS, Mukhopadhyay S. Adaptive control of nonlinear multivariable systems using neural networks. Neural Networks, 7(5): 737–752, 1994. Neural Ware. Neural Computing. Neural Ware Inc., Pittsburgh, USA, 2000. Nour MH, Smith DW, El-Din MG et al. The application of artificial neural networks to flow and phosphorus dynamics in small streams on the Boreal Plain, with emphasis on the role of wetlands. Ecological Modelling, 191: 19–32, 2006. Olden JD. An artificial neural network approach for studying phytoplankton succession. Hydrobiology, 436: 131–143, 2000. Olden JD, Jackson DA. Illuminating the “black box”: A randomization approach for understanding variable contributions in artificial neural networks. Ecological Modelling, 154: 135–150, 2002. Olden JD, Joy MK, Death RG. Rediscovering the species in community-wide predictive modeling. Ecological Applications, 16(4): 1449–1460, 2006. Olden JD, Joy MK, Death RG. An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecological Modelling, 178: 389–397, 2004. Özesmi SL, Özesmi U. An artificial neural network approach to spatial habitat modelling with interspecific interaction. Ecological Modelling, 116: 15–31, 1999. Özesmi SL, Tan CO, Özesmi U. Methodological issues in building, training, and testing artificial neural networks in ecological applications. Ecological Modelling, 195: 83–93, 2006. Pal NR, Bezdek JC, Tsao ECK. Generalized clustering networks and Kohonen’s selforganizing scheme. IEEE Transcations on Neural Networks, 4: 549–557, 1993. Palmer MW. The estimation of species richness by extrapolation. Ecology, 71: 1195–1198, 1990. Palmer MW. Estimating species richness: The second-order jackknife reconsidered. Ecology, 72: 1512–1513, 1991. Pao YH et al. Neural-net computing and intelligent control systems. International Journal of Control, 56: 263–289, 1992. Park J et al. Universal approximations using RBF networks. Neural Computation, 3: 246– 257, 1991. Park YS, Chang JB, Lek S, Cao WX, Brosse S. Conservation strategies for endemic fish species threatened by the Three Gorges Dam. Conservation Biology, 17: 1748–1758, 2003. Parker DB. Learning-logic: Casting the cortex of the human brain in silicon. Technical Report TR-47, Center for Computational Research in Economics and Management Science, MIT, Cambridge, USA, 1985. Paruelo JM, Tomasel F. Prediction of functional characteristics of ecosystems: A comparison of artificial neural networks and regression models. Ecological Modelling, 98: 173–186, 1997. Paschalidou AK, Iliadis LS, Kassomenos P, Bezirtzoglou C. Neural modelling of the tropospheric ozone concentrations in an urban site. Proceedings of the 10th International March 23, 2010 19:50 9in x 6in B-922 b922-ref 1st Reading References 11 Conference on Engineering Applications of Neural Networks. Thessaloniki, Hellas, 29–31 Aug, 2007. Pastor-Barcenas O, Soria-Olivas E, Martın-Guerrero JD. Unbiased sensitivity analysis and pruning techniques in neural networks for surface ozone modeling. Ecological Modelling, 182: 149–158, 2005. Pearson RG, Dawson TP, Berry PM et al. SPECIES: A spatial evaluation of climate impact on the envelope of species. Ecological Modelling, 154(3): 289–300, 2002. Peacock L, Worner SP, Sedcole R. Climate variables and their role in site discrimination of invasive insect species distributions. Environmental Entomology, 35(4): 958–963, 2006. Perterson C, Anderson JR. A mean field theory learning algorithm for NN. Complex Systems, 1(5): 995–1019, 1987. Pidgeon IM, Ashby E. Studies in applied ecology I: A statistical analysis of regeneration following protection from grazing. Proceedings of the Linneon Society New South Wales, 65: 123–143, 1940. Pimental D, Stachow U, Takacs DA, Brubaker HW. Conserving biological diversity in agricultural/forestry systems. Bioscience, 42(5): 354–362, 1992. Poggio T, Girosi F. Networks for approximation and learning. Proceedings of IEEE, 78(9): 1481–1496, 1990. Powel MJD. The theory of radial basis function approximation. University of Cambridge Numerical Analysis Reports. University of Cambridge, London, UK, 1990. Prechelt L. A quantitative study of experimental evaluations of neural network learning algorithms: Current research practice. Neural Networks, 9(3): 457–462, 1996. Preston FW. Time and space and the variation of species. Ecology, 41: 785–790, 1960. Pu ZL. Mathematical Models and Applications in the Management of Crop Insect Pests. Guangdong Science and Technology Publishing House, Guangzhou, China, 1990. Quétier F, Thébault A, Lavorel S. Plant traits in a state and transition framework as markers of ecosystem response to land-use change. Ecological Monographs, 77(1): 33–52, 2007. Rabiner LR. A tutorial on HMM and selected applications in speech recognition. Proc. IEEE, 77: 257–286, 1989. Ray C, Klindworth K. Neural networks for agrichemical vulnerability assessment of rural private wells. Journal of Hydrologic Engineering, 5(2): 162–171, 2000. Recknagel F, French M, Harkonen P, Yabunaka K. Artificial neural network approach for modelling and prediction of algal blooms. Ecological Modelling, 96: 11–28, 1997. Reed R. Prunning algorithm-a survey. IEEE Transactions on Neural Networks, 4(5): 740– 747, 1993. Reyjol Y, Lim P, Belaud A, Lek S. Modelling of microhabitat used by fish in natural and regulated flows in the river Garonne (France). Ecological Modelling, 146: 131–142, 2001. Rosenblatt F. The perceptron:A probabilistic model for information storage and organization in the brain. Psychological Review, 65: 388–408, 1958. Rosenzweig ML. Species Diversity in Space and Time. Cambridge University Press, Cambridge, USA, 1995. March 23, 2010 12 19:50 9in x 6in B-922 b922-ref 1st Reading Computational Ecology Rossi RE, Borth PW, Tollefson JJ. Stochastic simulation for characterizing ecological spatial patterns and appraising risk. Ecological Applications, 3(4): 719–735, 1993. Rozema J, Boelen P, Blokker P. Depletion of stratospheric ozone over the Antarctic and Arctic: Responses of plants of polar terrestrial ecosystems to enhanced UV-B. Environmental Pollution, 137(3): 428–442, 2005. Rudin W. Functional Analysis (Second Edition). McGraw-Hill, Columbus, USA, 1991. Rumelhart DE et al. Learning representation by BP errors. Nature, 7: 149–154, 1986. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagation errors. Nature, 323: 533–536, 1986. Rumelhart DE, McClelland JL. Parallel Distributed Processing: Explorations in the Microstructure of Cognition Vol. 1. MIT Press, Cambridge, USA, 1986. Sanders HL. Marine benthic diversity: A comparative study. The American Naturalist, 102: 243–282, 1968. Sarle WS. Stopped training and other remedies for overfitting. Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352–360, 1995. Scardi M. Artificial neural networks as empirical models for estimating phytoplankton production. Marine Ecology Progress Series, 139: 289–299, 1996. Scardi M, Harding Jr LW. Developing an empirical model of phytoplankton primary production: A neural network case study. Ecological. Modelling, 120: 213–223, 1999. Scheffe RD, Morris RE. A review of the development and application of the urban airshed model. Atmospheric Environment, 27b(1): 23–39, 1993. Schino G, Iannetta M, Martini S et al. Satellite estimate of grass biomass in a mountainous range in central Italy. Agroforestry Systems, 59: 157–162, 2003. Schoenly KG, Cohen MB, Barrion AT, Zhang WJ, Gaolach B, Viajante VD. Effects of Bacillus thuringiensis on non-target herbivore and natural enemy assemblages in tropical irrigated rice. Environment and Biosafety Research, 3: 181–206, 2003. Schoenly KG, Zhang WJ. IRRI Biodiversity Software Series. I. LUMP, LINK, AND JOIN: Utility programs for biodiversity research. IRRI Technical Bulletin No. 1. Manila, Philippines, International Rice Research Institute, 1999. Schoenly KG, Zhang WJ. IRRI Biodiversity Software Series. V. RARE, SPPDISS, and SPPANK: programs for detecting between-sample difference in community structure. IRRI Technical Bulletin No. 5. International Rice Research Institute, Manila, Philippines, 1999. Schultz A, Wieland R. The use of neural networks in agroecological modeling. Computers and Electronics in Agriculture, 18: 73–90, 1997. Scott I, Mulgrew B. Nonlinear system identification and prediction using orthonormal functions. IEEE Transactions on Signal Processing, 45: 1842–1853, 1997. Setiono R. A penalty-function approach for pruning feedforward NN. Neural Computation, 9(1): 185–204, 1997. Shahid SA, Schoenly KG, Haskell NH, Hall RD, Zhang WJ. Carcass enrichment does not alter decay rates or arthropod community structure: A test of the arthropod saturation hypothesis at the anthropology research facility in Knoxville, Tennessee. Journal of Medical Entomology, 4: 559–569, 2003. March 23, 2010 19:50 9in x 6in B-922 b922-ref 1st Reading References 13 Shang Y, Wah BW. Global optimization for NN training. IEEE Transactions on Computer, 29(3): 45–54, 1996. Shanno DF. Recent advances in numerical techniques for large-scale optimization. Neural Networks for Control, Miller, Sutton, and Werbos (Eds.). MIT Press, Cambridge, USA, 1990. Sharma V, Negi SC, Rudra RP et al. Neural networks for predicting nitrate-nitrogen in drainage water. Agricultural Water Management, 63: 169–183, 2003. Sheng SG, Gao BJ, Zhang YH, Wang ZW, Wang HX, Wang XJ. Studies on the structure of the insect communities in different plant types. Journal of Agricultural University of Hebei, 20(4): 61–65, 1997. Simberloff D. Properties of the rarefaction diversity measurements. The American Naturalist, 196: 414–418, 1972. Simpson RW, Layton AP. Forecasting peak ozone levels. Atmospheric Environment, 17: 1649–1654, 1983. Smith EP, van Belle G. Non-parametric estimation of species richness. Biometrics, 40: 119–129, 1984. Smith M. Neural Networks for Statistical Modeling. Van Nostrand Reinhold, New York, USA, 1994. Soltic S, Pang S, Peacock L, Worner SP. Evolving computation offers potential for estimation of pest establishment. International Journal of Computers, Systems and Signals, 5(2): 36– 43, 2004. Song MY, Hwang HJ, Kwark IS et al. Self-organizing mapping of benthic macroinvertebrate communities implemented to community assessment and water quality evaluation. Ecological Modelling, 203: 18–25, 2007. Sontag ED. VC dimension of neural networks. Neural Networks and Machine Learning, Bishop CM (Ed.). 69–95, NATO ASI Series, Springer, 1998. Sprecht DF. Probabilistic neural networks for classification, mapping and associative memory. IEEE ICNN. San Diego, USA, 1988. Sprecht DF. Probabilistic neural networks. Neural Networks, 3(1): 109–118, 1990. Sprecht DF. A general regression neural network. IEEE Transcations on Neural Networks, 2(6): 568–576, 1991. SPSS Inc. SPSS 15.0 for windows release 15.0.0.0. SPSS Inc., Chicago, USA, 2006. Steele BB, Bayn RL Jr, ValGrant C. Environmental monitoring using populations of birds and small mammals: Analysis of sampling effort. Biological Conservation, 30: 157–172, 1984. Sutherst RW, Maywald GA. Climate model of the red imported fire ant, Solenopsis invicta Buren (Hymenoptera: Formicidae): Implications for invasion of new regions, particularly Oceania. Environmental Entomology, 34(2): 317–335, 2005. Swingler K. Applying Neural Networks: A Practical Guide. Academic Press, London, UK, 1996. Tan CO, Ozesmi U, Beklioglu M et al. Predictive models in ecology: Comparison of performances and assessment of applicability. Ecological Informatics, 1(2): 195–211, 2006. March 23, 2010 14 19:50 9in x 6in B-922 b922-ref 1st Reading Computational Ecology Tollenaere T. SuperSAB: Fast adaptation back propagation with good scaling properties. Neural Networks, 3(5): 561–573, 1990. Valiant LG. A theory of the learnable. Communications of the ACM, 27(11): 1134–1142, 1984. Vahidinasab V. Day-ahead price forecasting in restructured power systems using artificial neural networks. Electric Power Systems Research, 78(8): 1332–1342, 2008. van der Werf W, Keesman K, Burgesss P et al. Yield-SAFE: A parameter- sparse, processbased dynamic model for predicting resource capture, growth, and production in agroforestry systems. Ecological Engineering, 29(4): 419–433, 2007. Van Rooij AJF, Jain LC, Johnson RP. Neural Netowork Training Using Generic Algorithm. World Scientific, Singapore, 1996. Vapnik V. The Nature of Statistical Learning Theory. Springer-Verlag, New York, USA, 1995. Vapnik V. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10: 988–999, 1999. Venkatech S. Computation and learning in the context of neural network capacity. Neural Networks for Perception Vol 2, Wechsler H (Ed.). 173–207, Academic Press, 1992. Viotti P, Liuti G, Di Genova P. Atmospheric urban pollution: Applications of an artificial neural network (ANN) to the city of Perugia. Ecological Modelling, 148(1): 27–46, 2002. Vogl TP, mangis JK, Zigler AK, Zink WT, Alkon DL. Accelerating the convergence of the backpropagation method. Biological Cybernetics, 59: 256–264, 1988. Von der Malsburg. Network self-organization. An Introduction to Neural and Eletric Network Zornetger SF et al. (eds.). Academic Press, USA, 1990. Wagner TL, Wu H, Sharpe PJH et al. Modeling distribution of insect development time: A literature review and application of the Weibull function. Annuals of Entomological Society of America, 77: 475–487, 1984. Walther BA, Morand S. Comparative performance of species richness estimation methods. Parasitology, 116: 395–405, 1998. Wasserman PD. Neural Computing: Theory and Practice. Van Nostrand Reinhold, New York, USA, 1989. Watts MJ, Worner SP. Using artificial neural networks to determine the relative contribution of abiotic factors influencing the establishment of insect pest species. Ecological Informatics, 3: 64–74, 2008. Watts MJ, Worner SP. Estimating the risk of insect species invasion: Kohonen self-organising maps versus k-means clustering. Ecological Modelling, 220: 821–829, 2009. Way MJ, Heong KL. The role of biodiversity in the dynamics and management of insect pests of tropical irrigated rice-a review. Bulletin of Entomological Research, 84: 567–587, 1994. Werboss PJ. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Science. PhD Thesis, Harvard University, Cambridge, USA, 1974. Wettschereck D et al. Improving the performance of RBF network by learning center locations. Advances in Neural Information Processing Systems. San Mateo, USA, 1992. March 23, 2010 19:50 9in x 6in B-922 b922-ref 1st Reading References 15 Widrow B. Neural networks application in industry, business and science. Communication of the ACM, 37: 93–105, 1994. Widrow B, Hoff ME. Adaptive switching circuits. 1960 IRE WESCON Convention Record, IRE Part 4, New York, USA, 1960. Widrow B, Stearns SD. Adaptive Signal Processing. Prentice-Hall, New Jersy, USA, 1985. Widrow B, Winter R. Neural nets for adaptive filtering and adaptive pattern recognition. IEEE Computer Magazine, 3: 25–39, 1988. Williams CB. Patterns in the Balance of Nature. Academic Press, London, England, 1964. Willmott CJ. Some comments on the evaluation of model performance. Bulletin of the American Meteorological Society, 63: 1309–1313, 1982. Wilson EO. The little things that run the world. Conservation Biology, 1: 344–346, 1987. Worner SP. Predicting the establishment of exotic pests in relation to climate. Quarantine Treatments for Pests of Food Plants, Sharp JL, Hallman GJ (Eds.). Westview Press, Boulder, USA, 1994. Worner SP, Gevrey M. Modelling global insect pest species assemblages to determine risk of invasion. Journal of Applied Ecology, 43(5): 858–867, 2006. Wu DR. Course Notes on Differential Geometry. Higher Education Press, Beijing, China, 1981. Xie YC, Sha ZY, Yu M et al. A comparison of two models with Landsat data for estimating above ground grassland biomass in Inner Mongolia, China. Ecological Modelling, 220: 1810–1818, 2009. Xu L. Bayesian ying-yang system and theory as a unified statistical learning approach (1): For unsupervised and semi-unsupervised learning. Brain-Like Computing and Intelligent Information Systems, Amari S, Kasabov N (Eds.). Springer-Verlag, Germany, 241–274, 1997. Yan PF, Zhang CS. Artificial Neural Networks and Computation of Simulated Evolution. Tsinghua University Press, Beijing, China, 2000. Yazdanpanah H. A Neural network model to predict wheat yield. Proceedings of Map India Conference, New Delhi, India, 2002. Yazdanpanah H, Karimi M, Hejazizadeh Z. Forecasting of daily total atmospheric ozone in Isfahan. Environmental Monitoring and Assessment (DOI 10.1007/s10661-008-0531-z), 2008. Yu R, Leung PS, Bienfang P. Predicting shrimp growth: Artificial neural network versus nonlinear regression models. Aquacultural Engineering, 34: 26–32, 2006. Zanetti P. Air Pollution Modelling: Theories, Computational Methods and Available Software. Computational Mechanics Publications, Southampton, Boston, USA, 1990. Zhang WJ. Computer inference of network of ecological interactions from sampling data. Environmental Monitoring and Assessment, 124: 253–261, 2007a. Zhang WJ. Methodology on Ecology Research. Sun Yat-Sen University Press, Guangzhou, China, 2007b. Zhang WJ. Supervised neural network recognition of habitat zones of rice invertebrates. Stochastic Environmental Research and Risk Assessment, 21: 729–735, 2007c. March 23, 2010 16 19:50 9in x 6in B-922 b922-ref 1st Reading Computational Ecology Zhang WJ. Pattern classification and recognition of invertebrate functional groups using selforganizing neural networks. Environmental Monitoring and Assessment, 130: 415–422, 2007d. Zhang WJ, Bai CJ, Liu GD. Neural network modeling of ecosystems: A case study on cabbage growth system. Ecological Modelling, 201: 317–325, 2007. Zhang WJ, Barrion AT. Function approximation and documentation of sampling data using artificial neural networks. Environmental Monitoring and Assessment, 122: 185–201, 2006. Zhang WJ, Feng YJ, Schoenly KG. Performance of non-parametric richness estimators to hierarchical invertebrate taxa in irrigated rice field. International Rice Research Notes, 29(1): 39–41, 2004. Zhang WJ, Li QH. Development of topological functions in neural networks and their application in SOM learning to biodiversity. The Proceedings of the China Association for Science and Technology, 4(2): 583–586, 2007. Zhang WJ, Liu GH, Dai HQ. Simulation of food intake dynamics of holometabolous insect using functional link artificial neural network. Stochastic Environmental Research and Risk Assessment, 22: 123–133, 2008. Zhang WJ, Pang Y, Qi YH et al. Relationship between temperature and development of Spodoptera litura F. ACTA SCIENTIARUM NATURALIUM UNIVERSITATIS SUN YATSENI, 36(2): 6–9, 1997. Zhang WJ, Qi YH. Functional link artificial neural network and agri-biodiversity analysis. Biodiversity Science, 3: 345–350, 2002. Zhang WJ, Schoenly KG. IRRI Biodiversity Software Series. II. COLLECT1 and COLLECT2: Programs for calculating statistics of collectors’ curves. IRRI Technical Bulletin No. 2. International Rice Research Institute, Manila, Philippines, 1999. Zhang WJ, Schoenly KG. IRRI Biodiversity Software Series. IV. EXTSPP1 and EXTSPP2: Programs for comparing and performance-testing eight extrapolation-based estimators of total species richness. IRRI Technical Bulletin No. 4. International Rice Research Institute, Manila, Philippines, 1999. Zhang WJ, Schoenly KG. Lumping and correlation analyses of invertebrate taxa in tropical irrigated rice field. International Rice Research Notes, 1: 41–43, 2004. Zhang WJ, Wei W. Spatial succession modeling of biological communities: A multi-model approach. Environmental Monitoring and Assessment (DOI 10.1007/ s10661-008-05741), 2008. Zhang WJ, Zhang XY. Neural network modeling of survival dynamics of holometabolous insects: A case study. Ecological Modelling, 211: 433–443, 2008. Zhang WJ, Zhong XQ, Liu GH. Recognizing spatial distribution patterns of grassland insects: Neural network approaches. Stochastic Environmental Research and Risk Assessment, 22(2): 207–216, 2008. Zhang YT, Fang KT. Introduction to Multivariable Statistical Analysis. Science Press, Beijing, China, 1982.