Academia.eduAcademia.edu
Louisiana State University LSU Digital Commons LSU Historical Dissertations and Theses Graduate School 1998 Techniques for Resolving Incomplete Systems in KSystems Analysis. Gary J. Asmus Louisiana State University and Agricultural & Mechanical College Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses Recommended Citation Asmus, Gary J., "Techniques for Resolving Incomplete Systems in K-Systems Analysis." (1998). LSU Historical Dissertations and Theses. 6800. https://digitalcommons.lsu.edu/gradschool_disstheses/6800 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Historical Dissertations and Theses by an authorized administrator of LSU Digital Commons. For more information, please contact [email protected]. INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon th e quality of the copy subm itted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell & Howell Information Company 300 North Zed> Road, Ann Arbor MI 48106-1346 USA 313/761-4700 800/521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. TECHNIQUES FOR RESOLVING INCOMPLETE SYSTEMS IN K-SYSTEMS ANALYSIS A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and the Agricultural and Mechanical College in partial fulfillment o f the requirements for the degree of Doctor o f Philosophy in The Department of Computer Science by Gary J. Asmus B.S., Louisiana State University, 1992 December 1998 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 9922049 UMI Microform 9922049 Copyright 1999, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgments I would like to thank all of the committee members for their advice and support. In particular, I would like to thank Dr. S. Sitharama Iyengar for his guidance of the department and his personal support and guidance o f my work. Also, Dr. Robert Mathews was integral in helping me to look beyond the bounds o f what is known and teaching me how to effectively communicate my thoughts. Most importantly, Dr. J. Bush Jones provided m e with leadership and insight about all aspects o f academic endeavors and investigations. It is not an exaggeration to say that without his unwavering support of m y work, I would not have been able to complete this journey. Finally, I must thank my parents for their patience and encouragement throughout my life. They taught me that any task, no matter how difficult, can be completed if I were to simply begin and persevere. Also, I must acknowledge the help and support of my friend and peer, Chris Branton, who has traveled this academic road with me from its beginning. The single most important person in my life, who has been a constant joy and perfect traveling companion throughout this journey, is my beautiful wife, Tammy. Without her understanding and love, I would not have been able to complete this work. She gives meaning to all that I do. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table Of Contents Acknowledgments................................................................................................................ii List O f Figures..................................................................................................................... v Abstract.............................................................................................................................. vii Chapter 1. Overview o f K-systems Analysis......................................................................1 1.1 Introduction to K-Systems Analysis............................................................................ 5 1.1.1 General Definitions................................................................................................ 5 1.1.3 Unbiased Reconstructions.................................................................................... 10 1.2 Reconstructabiiity Analysis Algorithms.....................................................................12 1.3 Reconstruction o f General Functions........................................................................ 13 1.4 Reconstructions with Arbitrary Data......................................................................... 17 1.4.2 State Contradictions.............................................................................................19 1.4.3 Data Scattering..................................................................................................... 19 1.4.4 Missing D ata........................................................................................................ 20 1.5 The Importance O f Reconstructions With Arbitrary Data.......................................23 1.5.1 Missing Data, Known States, and the Entropy Fill........................................... 23 1.5.2 Data Scattering and Clustering........................................................................... 25 1.6 Existing Missing Data and Clustering Algorithms..................................................... 27 1.6.1 Missing Data Algorithms.....................................................................................27 1.6.2 Clustering Algorithms.......................................................................................... 30 Chapter 2. Incomplete Systems: Missing D ata............................................................... 33 2.1 Systems, Structure and Missing D ata........................................................................35 2.2 Distance Between States............................................................................................ 38 2.2 Closest States............................................................................................................. 41 2.3 Use of the Closest States........................................................................................... 42 2.4 Closest States Algorithm........................................................................................... 48 2.4.1 Space and Time Complexity o f the Closest States Algorithm.......................... 52 2.4.2 Closest State Algorithm Examples......................................................................53 2.5 Alternative Closest State Algorithm.......................................................................... 57 2.5.1 Iterated Closest States......................................................................................... 57 2.5.2 Mean Deviation from the Mean o f the Closest States.....................................60 2.5.3 Test for Selecting an Imputation Algorithm...................................................... 64 Chapter 3. Incomplete Systems: Data Scattering and Clustering...................................66 3.1 Using Clustered Data in K-systems Analysis............................................................68 3.2 Entropy Similarity........................................................................................................72 3.2.1 Entropy Similarity in the taxmap Algorithm...................................................... 88 3.2.2 Entropy Similarity taxmap Examples................................................................. 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4. Summary and Conclusions...........................................................................104 4.1 A Methodology for Resolving Incomplete Systems.............................................. 104 4.2 Conclusions and Final Remarks............................................................................... 108 Bibliography.................................................................................................................... I l l Vita.................................................................................................................................. 114 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List O f Figures Figure 1: A probabilistic system........................................................................................8 Figure 2: A g-system........................................................................................................15 Figure 3: A K-system.......................................................................................................17 Figure 4: Data Scattering Example..................................................................................20 Figure 5: Resolution of data scattering........................................................................... 21 Figure 6: Data for two-dimensional clustering............................................................... 25 Figure 7: Two dimensional clusters............................................................................... 26 Figure 8: Example o f an incomplete system.................................................................. 34 Figure 9: Example o f a structure system........................................................................ 36 Figure 10: Example of a relabeled structure system....................................................... 37 Figure 11: Example System.............................................................................................42 Figure 12: Original System - No Missing Data.............................................................. 54 Figure 13: Comparison of entropy fill and closest states.............................................. 55 Figure 14: Comparison of entropy fill and closest states.............................................. 57 Figure 15: States 000, 111, and 222............................................................................... 59 Figure 16: Added States 022,101, and 011.................................................................... 59 Figure 17: Added States 122,002,220, and 020............................................................60 Figure 18: Added States 121,100, and 202................................................................... 60 Figure 19: Closest States to the Members of the Closest State S ets.............................62 Figure 20: Deviations from the Mean............................................................................. 63 Figure 21: Comparison of entropy fill and closest states.............................................. 64 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 22: Data Scattering Example.................................................................................68 Figure 23: Joint Probability Distribution......................................................................... 74 Figure 24: Maximum Entropy Distribution..................................................................... 75 Figure 25: Entropy Similarity Joint Distribution........................................................... 77 Figure 26: Plot o f Example Points....................................................................................80 Figure 27: Entropy Distribution.......................................................................................81 Figure 28: Triangle Inequality Counterexample............................................................ 84 Figure 29: Plot o f Counterexample..................................................................................85 Figure 30: Relationship of Euclidean Distance to Entropy Dissimilarity.................... 88 Figure 31: T = 0.990, Hc = 0.96157...............................................................................96 Figure 32: T = 0.993, Hc = 0.96197...............................................................................96 Figure 33: T = 0.994, Hc = 0.96226...............................................................................97 Figure 34: T = 0.995, Hc = 0.96171...............................................................................97 Figure 35: T = 0.90 to 1.0, incremented by 0.01............................................................ 98 Figure 36: Non-spherical Clusters, T = 0.9999...............................................................99 Figure 37: Clusters with Different Shapes, T = 0.99.................................................... 100 Figure 38: Sparse Linearly Non-Separable Clusters, T = 0.984...................................101 Figure 39: Dense Linearly Non-Separable Cluster, T = 0.998.....................................101 Figure 40: Clusters with a Bridge, T = 0.997................................................................ 102 Figure 41: Four Clusters Sharing Coordinates, T = 0.991........................................... 103 Figure 42: Four Clusters, T = 0.999...............................................................................103 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract K-systems analysis is a generalization of reconstructability analysis (RA), where any general, complete multivariate system (g-system) can be transformed into an isomorphic, dimensionless system (a K-system) that has sufficient properties to be analyzed using probabilistic RA algorithms. In particular, a g-system consists o f a set o f states formed from a complete combination of the variables assigned specific values from a finite set o f possible values and an associated system function value. The gsystem must be complete in that all possible states must have an associated system function value. K-systems analysis has been applied to a variety of systems, but many real-world systems consist o f data that is incomplete. Impediments in real-world systems have been previously identified as state contradictions, data scattering and missing data [JONE 85d]. The problem o f state contradictions has been adequately addressed, but while techniques for the resolution o f data scattering and missing data have been proposed, additional issues remain. The author has condensed the understanding of data scattering and missing data into the single problem o f an incomplete system. Within this context, techniques for resolving incomplete systems and, thereby, inducing a complete system have been developed. If a g-system is incomplete, it may be viewed solely from the perspective o f missing data. A new algorithm has been developed based on the state distance and uses this distance to determine unbiased estimates o f the values for the system function. The state distance is a generalized Hamming distance and is shown to satisfy vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the properties o f a metric on the state space and to be superior to current methods for imputing system function values. An incomplete system may be viewed from the perspective o f data scattering. In general, scattered data may be resolved through clustering and, previously, this clustering has been done in one dimension. A method is developed that allows the meaningful use o f two dimensions in the clustering. Further, a new pairwise similarity measure is developed based on the maxim um entropy principle and mathematics that form the foundation of K-systems analysis. Use o f this similarity measure is demonstrated within the context of an existing clustering algorithm. viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1. Overview o f K-systems Analysis Reconstructability analysis is the study o f the relationship o f subsystems to systems, the relationship o f parts to wholes, the analysis o f a system based solely on the information contained within it. The development o f reconstructability analysis began in the 1960’s with the work of Ross Ashby [HIGA 83]. Further developments occurred throughout the 1970’s, but reconstructability analysis did not become fully developed until 1981 when a whole issue o f the International Journal of General Systems was devoted to the evaluation of the reconstruction hypothesis. In particular, the work of Cavallo and Klir provided a detailed overview and definition o f reconstructability analysis [CAVA 81a-b]. During the mid 1980’s, Bush Jones published a series o f papers that provided efficient algorithms which addressed the practical needs for performing reconstructability analysis. The topics covered in these papers include the determination o f reconstruction families [JONE 82], the determination of unbiased reconstructions [JONE 85a], and a greedy algorithm for the generalization o f the reconstruction problem [JONE 85b]. One paper of particular interest here concerned the reconstruction o f general functions with arbitrary data [JONE 85c]. Previously, reconstructability analysis applied only to probabilistic or possibilistic systems. Jones was able to develop a method whereby a general system (g-system) could be transformed into an isomorphic Klir system (K-system) that could be analyzed using Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the probabilistic reconstructability analysis algorithms. The results o f the analysis could then be mapped directly back to the original g-system. Following this work Jones published a paper on reconstructability considerations with arbitrary data, in which he proposed methods whereby impediments in real world data could be overcome so that reconstructability analysis could be applied [JONES 85d]. The main focus o f the research reported here is on these impediments and their resolution. In particular, the focus will be on the problems referred to as data scattering and missing data. First, we provide a cohesive and condensed understanding o f the problems o f data scattering and missing data. In effect, these two problems are actually a single problem that may be viewed from two perspectives. The one issue that ties these perspectives together is that both are interpretations o f an incomplete system. Both interpretations are valid in particular contexts and discriminating between the two is often difficult. In general, the view o f missing data is most general in that it does not assume that the source o f the data has any particular attributes; it only assumes the existing data is all of the information that is available. Data scattering assumes that the observation or control of the variable values was imprecise and can be refined based on the existing system information. Additionally, the view o f data scattering may be useful if the amount of missing data is large; clustering to resolve scattered data may reduce the amount of missing data in the system. A new algorithm for imputing missing data is developed based on the distance between states. The state distance that is developed for the Cartesian state space is Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. shown to be a true distance because it satisfies the symmetry axiom, the identity axiom and the triangle inequality. The state distance is used to identify the set o f closest states and this set is used to impute an unbiased estimation o f the missing state. It is shown that the closest state set is superior to the existing technique by proving that it shares, on average, more information with the missing state than does the set of all states. The algorithm is defined and analyzed and shown to have a time complexity of 0{n-?), where n is the number o f states. If the incompleteness o f the system is determined to be due to data scattering, the data will require clustering. Currently, clustering for K-systems analysis is done in one dimension for each variable that is considered to be scattered. The author describes the shortcomings o f using clusters that are inherently one dimensional and develops a general methodology for using higher dimensional clusterings. In particular, focus is on the use o f two dimensional clusters so that the system may still be submitted for K-systems analysis. While data that is clustered may be easily presented for K-systems analysis, use o f the resulting analysis for prediction requires additional considerations. The author presents a method whereby previously unobserved variable values may be projected to the clusters used for the analysis and, subsequently, be used to predict the system response. In addition to predicting the system response, the methodology also enables the numerical qualification o f the results independent o f the algorithm that was used to produce the clustering. Next, a technique that is similar to the one used to perform the K-system transformation will be applied in the context o f a similarity measure based on the Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. information entropy. This entropy similarity measure is specifically defined for the pair wise comparison o f ^-dimensional points, but is also generalized to the overall similarity o f an arbitrary number o f n-dimensional points. Using only a K-systems type o f transformation and the definition o f information entropy, this similarity measure is analyzed in the context o f its corresponding dissimilarity measure. It is shown that this dissimilarity measure does not meet all the properties o f a metric as did the state distance, specifically, it does not satisfy the triangle inequality. The behavior o f the entropy similarity is further explored to demonstrate why it does not satisfy this property. Finally, the use o f entropy similarity is demonstrated using a density based clustering algorithm known as the taxmap algorithm [CARM 69]. This algorithm is intended to simulate the way in which a human observer would detect clusters in two and three dimensions. It makes use o f a similarity matrix that contains pair wise measures o f similarity for all data points. It proceeds by finding the two most similar points to initiate a cluster and continues to add the next most similar point to that cluster until a measure o f discontinuity is exceeded. It then repeats this process until all the points have been assigned to a cluster. The author uses the entropy similarity and derives a corresponding measure o f discontinuity for use in the taxmap algorithm. Examples o f the results o f this algorithm are presented for a variety o f data sets. The remainder o f this work will be organized as follows. First, the work of Cavallo and Klir [CAVA 81a-b] and Jones [JONES 85a-d] will be reviewed to form a foundation from which to work. Then the impediments in real world data will be Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. described along with the existing techniques that are applied for their resolution. Limitations of these techniques will be described along with the motivation for finding new methods. Next, follows a discussion o f what exactly is an incomplete system and why we wish to work mainly with complete systems. An algorithm for filling in (or imputing) system function values for missing states will be derived, defined and the computational complexity will be analyzed. Finally, a new similarity measure will be derived, defined and used in a clustering algorithm that may also be used to resolve incomplete systems. Finally, the missing state algorithm and the clustering algorithm will be discussed in the context o f their use for resolving incomplete systems. 1.1 Introduction to K-Systems Analysis 1.1.1 General Definitions We begin with the work o f Cavallo and Klir which established the basic terminology and concepts that are known as reconstructability analysis. When doing any type of systems modeling, it can be very useful to represent the overall system as a set o f coupled subsystems. As Cavallo and Klir point out, a complex system which is represented by a set o f coupled subsystem has a number o f advantages. “It is usually easier to understand a large system when it can be decomposed into smaller systems. It is also easier to further develop such systems, as each subsystem may be investigated independently o f other subsystems. Once sufficiently developed, it is usually easier to utilize it for various purposes. Moreover, the decomposed system may be easier to document or simulate on computer.” [CAVA 81a] Difficulties arise when trying to decompose a system into subsystems; it is possible that the overall system cannot be reconstructed from the set o f subsystems. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This lead's to two distinct yet related problems for systems modeling. The first is the reconstructability problem which has been defined as the problem o f determining which subsystems o f an overall system are adequate for describing the overall system within some desired level o f approximation. The identification problem can be viewed as the complement to the reconstructability problem. Given that the overall system is unknown, the problem is one o f identifying properties o f the overall system based solely on appropriate properties o f the known subsystems [CAVA 81a]. The process of solving these two problems is known as reconstructability analysis (RA). Specifically, the term reconstructability analysis will be used when referring to systems that are inherently probabilistic and complete such that they can be directly submitted for analysis using the reconstructability algorithms. Arbitrary multivariate systems may require prior analysis and transformation into a form that is suitable for analysis using the probabilistic RA algorithms. Reconstructability analysis concerns itself with systems that are composed of states. A system may be viewed as a set o f multivariate data that consists o f a set of variables and a system function which is defined on these variables. Thus, a system consists o f a tuple of the form < vt, V2, . . . , vn, / > where vt, v2, . . . , vn are variables and / is a function defined over these variables. The function may be a probabilistic behavior function, a possibilistic behavior function, a fuzzy set membership function, or any arbitrary (and possibly non-linear) function over the set o f variables. A state is a complete combination o f values from a state set assigned to each variable and the function/is associated with each state. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Formally, a system may be defined as the following six-tuple: B = (V ,W ,s ,A ,Q ,f) (1.1) where • V- | i e 1 , 2 , . . . , n} is the set o f variables • W= \j e { 1 , 2 , . . . , m}, m < n} is family o f state sets • s: V —►W is an onto mapping that assigns each variable in V to one o f the state sets • A = styy) x s ty j x . . . x s ty j is the set of all states • Q is the set o f real numbers • A -» Q is a system function that represents the meaning o f the information regarding the total set of states in the system. Typically, the function f and the system B referred to probabilistic or possibilistic systems as in [CAVA 81], but the use o f these symbols here is more general in that they may represent any arbitrary system that can be represented a set o f states and a corresponding system function. This is due to the fact that the early RA methodology was expanded to include functions from arbitrary general systems [JONE 85a-e] [TRIV 93]. The algorithms that were developed in [CAVA 81] and [JONE 85a] are targeted for probabilistic functions for use on completely specified systems. An example o f a complete, probabilistic system is shown in figure 1. hi this example, V= [v„ v2, v3}; W= {{0, 1}, {0, 1, 2}}; s: V —> W is the onto mapping that assigns {0,1} to v, and v2, and (0, 1, 2} to v3; A = [000, 001, 002, 010, Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 011, 012,100,101,102,110,111,112}; Q = [0,1] is a set of real numbers and f .A - + Q is a probabilistic function. Note that current conception of a system V1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 I 0 0 0 1 1 1 VJ 0 1 2 0 1 2 0 1 2 0 1 2 Aa ) 0.1 0.1 0.0 0.2 0.1 0.2 0.1 0.0 0.0 0.1 0.0 0.1 Figure 1: A probabilistic system for reconstructability analysis is a probabilistic system, but the RA algorithms may be applied to arbitrary systems through application o f a transformation of a general system into a Klir-system (K-system). This transformation and its properties will be defined and demonstrated below. We define some further notation so that we may understand the notion of states and substates. Using notation similar to [CAVA 81b]: For each state a = (a ,.| i e N H) e A , where Nn is the total number of variables o f a system defined by (1.1) and for each state p =(piiys jr,xc^„) associated with variables in set Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9 Z= {,,\j< = X ,X c:N .}^V ' let p be called a substate o f a (or a be called a superstate of P) if and only if P y = CLj, for all j e X . Let p -<a (and a >- p ) denote that P is a substate o f a . [CAVA 81b] Let (/ I Z] denote the projection o ff which disregards all variables in V except those in set Z c: V. Then, \ f i Z] is a mapping from a set of states (substates o f states in A) to Q : \[ ^ Z ] x z s (y,)^> Q such that [fiZI(P)= g(W a)|a> P}> , where function g is determined by the nature of function f . For instance, i / i m ) = Z /(“) aH$ w h en/is a probabilistic function [CAVA 81b]. One final definition is needed before moving onto the notion o f an unbiased reconstruction. Any system may also be viewed as a subsystem o f an overall system. Given a system, B, as defined above, a collection o f q subsystems is defined as, S = ^ B } = {( k V. k V,ks, kA , kQ ,kf)\k<= {1,2,.. .,q}} Elements of the set S are referred to as subsystems o f B if and only if, for each k, the elements satisfy the following conditions: . kV cV , Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • k l L l/such that ks is onto; • ks : k V-> k i/such that ks(vj) = s(vj) for each v/e k V; . ‘- 4 = ? /* (v ,) • kQ = & • kf = kv\ [CAVA 81b]. We can also use the concise definition o f a subsystem using the terminology o f Jones. Given an overall system B with a behavior function f and a subsystem ^B, the behavior function t y must satisfy the following condition: ‘/ ( P ) = ! / < ? > P-<a where *P e A, (3 -<a (p is a substate o f a) [JONE 82]. 1.1.3 Unbiased Reconstructions Based on the definitions provided so far, we can now review the idea o f unbiased reconstructions. Given an overall system, there is a family o f reconstructions that are compatible with this system. Given a particular reconstruction, there also exists a family o f overall systems that are compatible with it [CAVA 81a]. It is desirable to select one member o f the family o f reconstructions, say f s, to represent the overall system, f and this selection requires some assumption which will justify this choice. When the overall system, f is representing a probability distribution function, the principle o f maximum entropy can be invoked to select that function which is maximally non-committal to all matters except the requirements that Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 11 [/, i k V^=kf = [ / i * r ] V* e [CAVA 81b]. The use o f the principle o f maximum entropy for selecting the function from the family o f possible reconstructions has been justified by the following arguments : • The maximum entropy probability distribution is the only unbiased distribution, that is, the only distribution which takes into account all available information about the system, but no additional information [CAVA 81b]. • The maximum entropy probability distribution is the most likely distribution. Given a reconstruction hypothesis, each element o f the reconstruction family of that hypothesis could have been generated by any number o f actual data sets. The largest number of possible data sets which are mutually comparable and compatible with the given reconstruction hypothesis are those which are also compatible with the maximum entropy overall probability distribution [CAVA 81b]. • Maximizing any function but entropy leads to inconsistencies unless that function has the same maxima as entropy [CAVA 81b]. • Every real world system can be represented by the maxim um entropy reconstruction because joining the subsystems that represent the real world system always results in the maximum entropy distribution [CAVA 81b]. In addition to these arguments, Pittarelli has evaluated the use o f the maximum entropy principle in detail and finds that while there are qualifications for the arguments provided above, these arguments along with others provide compelling evidence that justifies the use of maximum entropy distributions [PITT 89]. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.2 Reconstructability Analysis Algorithms Cavallo and Klir provided several algorithms for computing the unbiased reconstruction, but it was not until the mid 1980’s that Bush Jones invented a more efficient algorithm that implementation became a reality. Jones also created a greedy algorithm for a generalization o f the reconstruction problem. The first Jones algorithm is for determination o f unbiased reconstructions. It makes use of only independent states for determining the reconstruction. Also, the algorithm is general in that it can be employed on arbitrary collections of states and substates. This makes it especially useful for the greedy algorithm for the reconstructions problem. This algorithm was proven to converge and provides the maximum entropy solution to the reconstruction hypothesis[JONE 85a]. Jones also provided a greedy algorithm for the generalization o f the reconstruction problem. The generalization o f the problem can be stated as follows: “Given an overall system B with known probabilistic behavior function / and hence known behavior functions t y for the set of substates {P}, determine a subset o f {P} of given size or whose unbiased reconstruction is within acceptable tolerance o f/ to represent the system.” [JONE 85b] The algorithm will work on an independent set o f substates or on the complete set. While the independent set o f substates is guaranteed to avoid using redundant information, the complete set may provide faster or more compact reconstructions. The algorithm works on the set E, which is the set o f all substates that are being using for the reconstruction (either independent states or all states). Let D represent the set o f substates currently used for the reconstruction during execution o f Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the algorithm and let U(D)-» f i} represent the computation o f the unbiased reconstruction for the substates o f D. Finally, let y(.) represent the function which is used to select the next substate to be included in the reconstruction; the greater the value o f y(P), the more desirable that state is for inclusion into the reconstruction. The algorithm starts by initializing f (j to a flat distribution and letting D be initially empty. It selects one P to add to D such that y(p) is a maximum and computes the unbiased reconstruction U(D) for the new D. If either a size limit o f D is exceeded or if the approximation is sufficiently close to the true f j j , then the algorithm stops. Otherwise it continues to add substates to the set D until one o f the two criteria are satisfied. Using these two algorithms to form the reconstruction that approximates the overall distribution is what is known as reconstructability analysis. It provides the list of substates, in order of influence, that have the most effect on system behavior. 1.3 Reconstruction of General Functions With the two preceding algorithms in hand, reconstructability analysis was fully defined and could be effectively applied to any probabilistic behavior function. The results of such an analysis would be correct for the information given and would introduce no extraneous information to the analysis. In order to utilize reconstructability analysis for non-behavior systems, Jones invented a transformation which can be applied to practically any multivariate function on discrete variables. Any general system (or g-system) can be transformed to an isomorphic Klir system (or K-system) which can then be analyzed using reconstructability analysis. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 14 We begin by defining the g-system and will make use o f the definitions o f states and substates from the previous sections. First, associated with a g-system is a general behavior function/fa). If A is the set o f all possible states o f a systems and R+ is the set of positive real numbers, then f: A -> R + is the function that represents the information that is associated with the system states. Note that, without loss of generality, we restrict the function values to the positive real numbers, since these values may be easily scaled to R+. Two additional definitions are necessary before we can define a g-system. There is a set o f functions defined for each subsystem: m/( P ) = ^ / ( a ) »where m uniquely identifies a substate a>f$ And a parameter: T=Z/(a) a eA Now we can define a g-system as a six-tuple: (r, {v,}, {a}, p /r .;} ) where as defined above (1) t is a parameter; (2) {v/} is a set o f variables; (3) {a} is a set o f states; (4) {J3} is a set o f substates; (5)f(.) is a function on {a}; (6) {mf ( ) } are functions on {(3). [JONE 85c] Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 15 Figure 2 is an example o f a g-system. Note that it is a discrete valued multivariate system. V, 0 0 0 0 0 0 1 1 1 1 1 1 v2 0 0 0 1 1 1 0 0 0 1 1 1 v3 0 1 2 0 1 2 0 1 2 0 1 2 m 3.70 6.10 9.50 3.70 7.10 14.50 3.70 6.40 10.20 3.70 8.40 16.00 V/ V2/C11) =A110) +AI 11) +/(112) t= 93.0 Figure 2: A g-system Note that this particular g-system is completely defined in that all possible states in the system also have an associated system function value. The definition o f a g-system does not require this property, but it will be shown that it is highly desirable. With this definition o f a general system, we can now define a K-system. The first part o f the transformation to a K-system makes use o f the parameter x that we defined above. A normalization is performed that removes the units from the system and we define the function: *(a ) = Z f e ) iVot T This converts the function f(a ) to a dimensionless system, k(a), and makes the following properties true for the K-system: Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 16 0 < * ( a ) < 1, V a and £ * ( a ) = 1. [JONE 85c] a It is important to note that these properties are the same properties that a probabilistic system has, but that this transformation does not make this into a probabilistic system. It creates a system with sufficient properties that the probabilistic reconstructability analysis algorithms may be applied. Finally, another set o f functions is defined: "*(P)S £ * ( “ )• a>fJ Now we define a K-system: (x, {v/}, {a}, (P},k(.), {” « .)}) where as defined above (1) t is a transformation factor; (2) {v/} is a set o f variables; (3) {a} is a set o f states; (4) {P} is a set o f substates; (5) k(.) is a function on {a}; (6) {mk(.)} are functions on {P}. [JONE 85c] Figure 3 is an example of a K-system. It is the g-system from figure 2 shown above after the transformation has been performed. The g-system and the K-system are isomorphic in the sense the they both contain the same system information. The system function was scaled so that it has the Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. V/ 0 0 0 0 0 0 1 1 1 1 1 1 V2 0 0 0 1 1 1 0 0 0 1 1 1 v3 0 1 2 0 1 2 0 1 2 0 1 2 17 m m 3.70 6.10 9.50 3.70 7.10 14.50 3.70 6.40 10.20 3.70 8.40 16.00 0.04 0.07 0.10 0.04 0.08 0.16 0.04 0.07 0.11 0.04 0.09 0.17 v/ v2*(l 1) = *(110) + *(111) + *(112) t=93.0 Figure 3: A K-system properties stated above. The relationships between the states and substates remain the same, so the results o f reconstructability analysis performed on the ksystem clearly map directly back to the g-system. The information is never modified, nor is any information removed or added to the system. Again, it should be noted that this K-system is also completely defined as was noted above for the corresponding gsystem. 1.4 Reconstructions with Arbitrary Data With the advent o f the K-system transformation, reconstructability analysis could now be performed with general functions. While this provided the ability to analyze a greater number o f function types, there still existed possible impediments in real world data that would prevent the use of reconstructability analysis. The general functions defined in the previous section still required that the variable values be discrete; continuously valued variables were not allowed. While it is trivially true that a finite data set has a finite set o f discrete values for each variable, reconstructability Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. analy sis also requires that the variable values be repeated as shown in figures 2 and 3. If the values are not discrete, we have an impediment referred to as data scattering [JONE 85e]. The reconstructability analysis algorithms also require that the data is complete; that is, there must be a system function value for every possible combination o f variable values (states). I f the system under consideration has possible states that do not have system function values, this is referred to as the problem of missing data. Finally, in real world data there may be redundantly defined states in the sense that there are repeated states in the table and these redundant states may have different function values; this is the problem referred to as state contradictions. Overcoming these impediments in real world data is at the core o f the research reported here. Jones provided basic techniques for overcoming these impediments in [JONE 85d], but while the methods allow reconstructability analysis to proceed, there are significant questions about their use and effectiveness. The focus o f this section will be on the problems o f data scattering and missing data. The problem of state contradictions has been adequately and reasonably addressed [JONE 85d]. An example will be provided and the method for resolving the contradiction will be described. In general, these three problems should be addressed in the following order. First, data scattering should be resolved, then state contradictions, and finally the problem o f missing data [JONE 85d]. Our focus here is on data scattering and missing data, so we will deal briefly with the state contradiction problem first and then move on to the core of the research. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.4.2 State Contradictions State contradictions occur when there are multiple system function values for the same state. For example, suppose we have a system that has three variables, each of which can take the values o f 0 or 1. Further, suppose that for the state where v,=0, v2 =1, and v3 = 1 (or more concisely, state 001), we have two different values for the system function. We may take an average of the two values and use that single value to represent the state system function value. This does not add any information to the system, it condenses redundant information for a single state. Rather than being an impediment to reconstructability analysis, it can add new dimensions. It is also possible to use other common statistical techniques besides the mean, such as the mode, median, maximum or minimum[JONE 85e]. The analyst is free to choose whatever method that is appropriate for the data being analyzed. It is important to note that while some information is possibly, in some sense, “lost” when handling state contradictions, there is no information added to the system. 1.4.3 Data Scattering Data scattering refers to the lack o f repeated distinct values for each variable in a system that is submitted for K-systems analysis. This problem can best be illustrated by an example from [JONE 85d] reproduced here as figure 4. Inspection of the table provides a solution to data scattering for this particular set of data. One can see that vj is taking the approximate values o f {3, 7}, V2 takes the values {9,3} and yj is taking the values o f {9,6, 7}. Relabeling the states by Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 20 V/ 12 6.9 6.7 7.1 6.9 7.5 2.8 3.1 3.2 2.7 2.7 2.7 V2 3.1 2.9 2.7 9.2 8.9 8.6 3.3 2.9 3.0 8.7 8.9 9.2 v3 9.4 6.2 7.1 9.2 5.9 6.9 9.3 5.7 7.0 9.2 6.3 7.1 ft) 3.7 6.1 9.5 3.7 7.1 14.5 3.7 6.4 10.2 3.7 8.4 16.0 Figure 4: Data Scattering Example using these values as keys and replacing the actual values with 0, 1, or 2, as appropriate gives us the table shown in figure 5 which is in the exact form required for reconstructability analysis. There are many one dimensional clustering techniques which can be applied that will resolve the problem o f data scattering in this example. When the data is not quite so clearly grouped, the technique applied here may not give as consistent and understandable results as those shown here. 1.4.4 Missing Data The final problem that must be resolved with real world systems is missing data. Up to this point we have only considered systems which are complete; that is, there is a system function value for every possible state or combination o f variable values. Previously, Jones has defined two methods for conducting K-systems analysis when there is missing data [JONE 85d, JONE 89]. One is to modify the algorithms to Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 21 ensure that only known states are considered in the analysis and the other is to use an “entropy fill” to provide values for the missing states. V/ 0 0 0 0 0 0 1 1 1 1 1 1 v2 0 0 0 1 1 1 0 0 0 1 1 1 VJ 0 1 2 0 1 2 0 1 2 0 1 2 f(.) 3.7 6.1 9.5 3.7 7.1 14.5 3.7 6.4 10.2 3.7 8.4 16.0 Figure 5: Resolution o f data scattering First, we will review the method of using only known states. This technique is based on the idea that reconstructability analysis can only be as good as the data which is submitted for analysis. Given the general algorithm for the reconstructability problem, we need not use the complete set of states; we only use those states that exist and provide the most information for determining system behavior. The only changes required to the algorithms are in the computation of the transformation factor, x, and the calculation o f the K-system function k,. First, x is calculated for only those states which are known, T= Z /(° 0 ajmown The function values for k are then calculated as, Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 22 k (a )= ^~a ^,a known, T and the corresponding K-system has the property that, ^ k ( a ) = 1. [JONE 85d] a jb to w n An important point is that the values o f k(a) for missing states are treated as being unknown, not simply being set to zero. The formation o f equivalence classes may then proceed once the states are relabeled in order to ensure that we use the most information possible given the data. Relabeling is done for each variable, where the variable value that occurs most frequently in the data is relabeled as zero and the rest of the values may then be relabeled arbitrarily. Once this has been done, the equivalence classes can be formed as in [JONE 85d] and this was shown to minimize the number o f missing equivalence classes. The greedy algorithm may then be executed on this data considering only the existing classes and the analysis will account for only the data which exists. Since the equivalence classes only used known states and the K-system function was constructed using only these states, no information is introduced into the system and the analysis is correct for the given data. The second method for addressing missing state data is known as the entropy fill. This is a matter o f using the overall mean o f the known states as the system function value for those states which are unknown. The reason that we wish to fill in these missing states is two fold. First, we may wish to predict the behavior o f the system state which is missing, The previous technique ignores the missing data and can provide no prediction about the system behavior for the missing state. Secondly, Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. we may want to determine the interaction effect of the variables for a missing state. That is, we would like to know what portion of the effect o f a state is due to the interaction o f the variables and what portion is due to the variables acting alone [JONE 89]. 1.5 The Importance O f Reconstructions With Arbitrary Data The ability to apply K-systems analysis to arbitrary data greatly expands the set o f systems that can be analyzed using only the maximum entropy mathematics embedded within reconstructability analysis. Previously, only systems which were explicitly probabilistic could be addressed. With the advent o f the K-system transformation, a greater number of systems could be analyzed, but without the resolution o f the impediments in real world data the systems which could be analyzed were still limited to only discretely valued systems that were completely specified. Overcoming these impediments allows the formalities o f reconstructability analysis to apply to practically any multivariate system. This enables the correct model o f the system to be induced without making any o f the assumptions o f classical statistical analysis [JONE 86]. While the methods described above enable reconstructability analysis for a variety of systems, there are limitations that can be overcome so that the results of the analysis are more meaningful. 1.5.1 Missing Data, Known States, and the Entropy Fill A reconstruction using only known states is correct for the data as given, but there are aspects o f the system that are missed. In particular, predictions o f the effects o f missing states and the interaction o f variables are not possible when using only Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. known states [GOUW 96]. Also, there is information that exists in the system substates that may be used for determining the values o f the missing states. By using the entropy fill technique, these limitations are overcome insofar as predictions may be possible, but the assumption that all missing states assume the overall system mean is, both, too restrictive and too broad. The entropy fill technique will reduce the amount o f variability that the system displays. Obviously, due to the nature o f the variance calculation, the variance of a system may be greatly effected by using the same mean value for every missing state. While the effect will not be very significant if only one state is missing, if there are a large number o f missing states, the system variability and, by extension, the entropy o f the total system will be greatly reduced. In effect the system is “flattened” by assuming the overall mean. The reconstruction will then be weighted toward the mean, and the dynamics o f the system will be subdued. The entropy fill technique is used in order to minimize the amount of information that is added to the system. We propose that the amount o f information added to the system when replacing missing states is the same regardless o f the system function value used. That is, if we add one state to the system, we have added a set quantity o f information. The system function value associated with the state added provides the meaning o f that state, but does not effect the quantity o f information added. Each state that is added to the system increases the amount o f information available about the system exactly the same amount Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.5.2 Data Scattering and Clustering Real world data about a system will, more often than not, have scattered variable values. Even in the case of a designed experiment, where a finite discrete set of variable values is selected, the ability to control the values that the variables take is limited. For example, one may wish to use preset temperatures o f 25 and 50 to control a chemical process, but given the nature o f temperature controls it is possible that temperatures of {24.8, 25.2, ...} and {51.1, 50.2, ...} will actually be recorded. The clustering technique outlined above will adequately handle this specific case, but situations may arise where the solution to the problem is not so clear cut. An example o f a more difficult situation is the following. Suppose we have a variable, y/, that takes the following values: {0.9, 1.0, 1.1, 2.8, 2.9, 3.0, 3.1}. Upon inspection it appears that this variable is taking two different values, approximately {1.0, 3.0}. Any good clustering algorithm would give the same results for this one dimensional system. Let us also suppose that there are system function values associated with each variable value as shown in figure 6. If we only cluster the data based on the variable values, we miss the obvious fact that by using the system function as a second dimension we have three clusters as shown by figure 7. VI 0.9 1.0 1.1 2.8 2.9 3.0 3.1 ft) 5.0 4.8 5.2 6.0 12.0 6.1 11.9 Figure 6: Data for two-dimensional clustering Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 14.0 12.0 10.0 8.0 - - - - - A-) 6 0 ■ 4.0 -2.0 - - 0.0 0.0 0.5 1.0 2.0 1.5 2.5 3.0 3.5 vi Figure 7: Two dimensional clusters The reason that there are three clusters is not clear by looking at this data, but clearly there seem to be three. There could be another variable that interacts with this variable or there could be a missing variable. What can be done is for the data to be clustered into three clusters, label each cluster as a variable value, and proceed with the rest of the pre-processing and K-systems analysis. The results of the analysis will tell us which clusters are having the greatest effect in determining system behavior. The first aspect o f this research related to data scattering is to develop a meaningful understanding of how to incorporate two dimensional clusters in reconstructability analysis. While we can directly use two dimensional clusterings in reconstructability analysis, it is not clear how to interpret the results. The second aspect of this part o f the research is to develop a clustering algorithm which is congruent with the overall spirit of reconstructability analysis; that is, the idea that we use only information contained in the data, the formalities o f information theory and Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the principle o f maximum entropy. Currently, most clustering algorithms assume a model and then fit the data to that model. We provide a method where the variable values are clustered using only the principle of maximum entropy and the formalities o f information theory. 1.6 Existing M issing Data and Clustering Algorithms There are many existing algorithms for imputing missing data and even more algorithms for clustering data. This section will provide a brief review o f existing techniques and place the new methods derived here into specific contexts. The goal is not to review every existing technique, but to identify possible existing alternatives to those currently used in K-systems analysis and to establish a context for the new techniques reported here. Attention will be directed to limitations and assumptions o f the algorithms reported in the existing literature that helped to motivate the development o f new methods. 1.6.1 Missing Data Algorithms Statistics have a long history of analysis when there is missing data. Additionally, there is a large number of research reported where missing data is handled in an ad hoc fashion. Most o f the literature that specifically addresses missing data is fairly recent, mostly dating from 1970 onward. There are a few review papers that present a comprehensive overview of methods applied for statistical analysis when data is missing [HART 71], [ORCH 72], [LITT 87]. Procedures for dealing with missing data can be grouped into the few nonmutually exclusive categories [LITT 87]. First, there is the simple method o f only Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. using units of data that are completely specified. Any incomplete or missing data is discarded and analysis proceeds using only those data units which are complete. Second, there are imputation based procedures in which any missing data is filled in based on the existing data. Common methods for imputing missing data include hot deck imputation where values are randomly selected from existing data to fill in the missing data, mean imputation where the overall mean is used for all missing data units (this is the entropy fill), and regression imputation where missing values are imputed based on a regression o f the existing data. Third, there are weighting procedures that are used for non-response in sample surveys. This is related to missing responses for some portions o f the data unit where the design weights assigned to possible choices prior to a survey are re-weighted afterwards to adjust the results for non-response. The final class o f procedures falls into the category o f model-based missing data procedures. This is perhaps the most actively investigated type o f missing data procedure. This class o f procedures is based on a defining model for the partially missing data and the basing inferences on likelihood under that model. The focus of model based procedures is on estimating parameters such as the mean and the variance related to the total set o f data. Missing data values are not imputed in this method, but the resulting analysis accounts for the missing data in the estimation o f the overall parameters. K-systems analysis requires some form o f imputation since a complete set of states is required for the full power o f K-system analysis to be used. K-system analysis Reproduced with permission o f the copyright owner. Further reproduction prohibited without permission. can proceed by not using the missing data as in the first class o f missing data procedures; only known states are used and the rest are treated as missing. This provides an incomplete analysis o f the system and, in particular, the interactions of the variables in states and substates will not be found. As we noted above, the entropy fill technique is a specific instantiation o f the an unconditional mean imputation procedure from the second class o f missing data procedures. Mean imputation suffers from a number o f problems, the most serious o f which is the distortion o f the empirical distribution due to the under prediction of the variance. The final two classes of procedures, weighting and model based, are specifically the type of procedures that we wish to avoid. First, each assumes that the data being analyzed fits some specific model and then predictions o f the effects of the missing data are made based on this model. K-systems analysis assumes no model; only the information contained within the data is used to induce the correct model. Secondly, these procedures do not provide values that may be used for K-systems analysis. Both procedures account for missing data so that summary parameters about the system may still be calculated and the results will be consistent with the model that is assumed to underlie the data being analyzed. There is one existing imputation procedure that is similar to the technique that is proposed here. It is referred to as a nearest neighbor hot deck imputation [SAND 83]. It consists of defining a metric to measure the units based on the values of the covariates and then selects a value used in the set o f closest units for the m issing value. This is similar to the method proposed here, except that the distance is defined Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. on the states and a search is not conducted for the closest states; the closest states are defined by the structure o f the system. Also, no single closest state is used to impute the missing values, instead, an unbiased estimator o f all closest states is used for the imputed values. 1.6.2 Clustering Algorithms There is extensive literature on clustering and there are many different methods and techniques that are available. Clustering is often considered referred to as a form of unsupervised learning or self-organization [BEZD 92]. Much o f the current work related to clustering is in the field o f fuzzy clustering. A clustering algorithm classically produces a hard clustering; each data unit is assigned to a single cluster. In fuzzy clustering, the general idea is to assign each data unit some level of membership in each cluster in accordance with the principles of fuzzy logic [ZADE 65]. The focus of the research here is to generate hard clusters so that the results o f the clustering can be directly applied in K-systems analysis. Even if a fuzzy clustering procedure is applied to a data set, virtually all fuzzy partitions can be hardened by applying some method of determining the maximum cluster membership for each point. Clustering algorithms may be divided into a number o f groups based on the manner in which the clusters are formed. There are joining algorithms where each data unit is initially considered to be in a separate cluster and then clusters are iteratively joined based on a measure o f similarity. Alternatively, there are splitting algorithms where all o f the data units are initially assigned to a single cluster and then this initial cluster is then divided into smaller clusters. There are also switching style algorithms Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where initial clusters are formed and then data units are switched between clusters in order to optimize some criteria. While this classification o f clustering algorithms is by no means complete, it demonstrates the wide variety of clustering techniques that have been applied. There is extensive literature about various clustering algorithms and their application to specific fields. Some general references for clustering include [ANDE 73], [HART 75], [EVER 93], and [BEZD 92] There are too many different clustering algorithms available to even make a passing attempt at providing an overview o f all of them, but two recent algorithms that are relevant to the results reported here will be briefly reviewed. Both algorithms are based on the principle o f maximum entropy and the similarity measure and algorithm proposed here are also based on this principle. One is referred to as a least bias clustering algorithm in that it selects cluster centroids such that there is no initial bias towards any o f the points [BENI 94]. It then proceeds to make each data point iterate towards one o f the cluster centroids. The number of clusters is determined by varying a resolution parameter between zero and one and then counting the number o f clusters that results over all values. The number of clusters that results most often is considered the most probable number of cluster. One possible problem with this algorithm is that it assumes that the entropy can be modeled by a particular type o f distribution and it uses this assumption to generate the results. Also, as is typical o f clustering algorithms, the results are reported for two relatively simple cases and are determined to be perceptually correct Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The other algorithm uses a simulated annealing approach for determining the clusters and implements the principle o f maximum entropy based on the same distribution used in the preceding algorithm [HOFM 97]. This algorithm is based on the use of pair wise measures o f similarity between all data points. This is similar to the method proposed here in that it also is based on the principle of maxim um entropy and includes a pair wise measure of similarity. The annealing algorithm is able to use a similarity measure that is not a true metric and then applies an annealing algorithm to maximize the entropy as approximated by a distribution. There is no requirement placed on the type o f similarity measure that may be used. Both of the preceding algorithms are based on the principle o f maximum entropy, but approximate the maximum entropy distribution by assuming a particular type o f distribution, the Gibbs distribution. The work reported here does not use an approximation to the entropy, but instead calculate the entropy directly based solely on the data as presented to the algorithm. Both algorithms use the principle, but neither uses the underlying mathematics o f information theory to determine the answer. There are many other clustering algorithms based on many other measures and assumptions. The two briefly outlined here are closest to the approach presented here and the work in [HOFM 97] may be used within the context o f the entropy similarity measure that will be described in chapter 3. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2. Incomplete Systems: Missing Data The impediments o f missing data and data scattering are actually the same problem viewed from two different perspectives. Missing data is the belief that all the variables in the system are non-continuously valued variables. Even if the values that a variable takes are listed as (1.2, 1.3,1.4,2.2, 2.3,2.4} these values are already discrete and do not require any type o f clustering or clumping together o f values. If these variable values and the other variables and their values are all viewed as deriving from discrete variable systems, then there may be some combinations of variable values (states) that are missing. Alternatively, if these values are believed to have been derived from a system that has variables that are continuously valued, the values are viewed as being scattered. If this belief is true, then it is perfectly valid to cluster these variable values into groups o f points centered around 1.3 and 2.3. These values may then be used in place o f the original values and analysis, of any type, may proceed. It is useful to consider the two problems o f data scattering and missing states as a single impediment to the use of K-systems analysis. This single problem may be readily viewed as having an incomplete system. Suppose that the data shown in Figure 8 is submitted for analysis. The two views described above are possible based on inspection o f this data. Either the values for variable v, are scattered or there are a significant number o f states 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 34 V/ 1.3 1.2 1.4 1.4 1.2 1.3 2.4 2.2 2.4 2.3 2.3 2.2 v2 0 0 0 1 1 1 0 0 0 I 1 1 vj 0 1 2 0 1 2 0 1 2 0 1 2 f(*) 3.7 6.1 9.5 3.7 7.1 14.5 3.7 6.4 10.2 3.7 8.4 16.0 Figure 8: Example of an incomplete system involving the values of variable v, missing from this system. Without some prior knowledge about the source of this data, it is possible to make very poor decisions about how to resolve this incomplete system. The question that must be asked is how is a complete system can be induced from the existing system that is consistent with the known data, yet only adds minimal assumptions about the source from which the data is derived. While a completely general answer that will apply to all possible situations is unlikely, it is possible to develop algorithms that use only known information about the system and add minimal information based solely on the existing system information. One part of the research presented here is to apply the notion o f the distance between states in order to find a better value to use for the missing state. Instead of using the overall system mean, we would use a local mean based on those states which are close to the missing state. Since we are adding the same amount o f information (the same number of states) to the system as the entropy fill, we do not add more Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. information to the system. By trying to use only local data to determine a value for the missing state, we can calculate “better” values; values that more accurately capture the dynamics o f the system. This will enable better predictions o f the effects o f missing states on the reconstructed system and more accurate calculations o f interactions. The second part o f the research, reported in chapter 3, is related to using the overall entropy o f the data and the calculated clusters for determining the appropriate clustering. Starting with the basic definition o f information entropy, a similarity measure will be developed for use in a density based clustering algorithm. 2.1 Systems, Structure and Missing Data First, the problem o f incomplete systems will be addressed from the point of view of missing data. This section will develop a method whereby only the existing information will be used along with the existing structure provided by the information contained in the system. The structure of a system will be defined based solely on the information present in the data and this structure will then be used to fill in or impute the missing system function values. A general system can be defined as consisting o f a structure system that consists o f the set o f variables, V, and the set o f values that each variable may be assigned, Vj; this also defines the set of states, A. So the structure o f the system is completely defined by the states that exist in a system. Associated with each state is a system function value which may be interpreted as giving the state some meaning. That is, for an arbitrary structure system there may be many different system functions. These system functions describe the meaning o f the structure and without Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the system function the structure system has no meaning. We place no constraints on the structure system other than requiring each state be complete in itself; we do not explicitly consider substates. This means that if there are three variables in the structure system, each existing combination of variables consists o f a value for each of the three variables. Each variable may take values from any arbitrary, finite set whether it is the set o f real numbers, integer, natural numbers or some arbitrary set of values (e.g., (red, green, blue}). Also, the variables are not required to be homogenous, each may take different types of values. Figure 9 is an example o f an arbitrary structure system and the associated system function values. V, red blue red green green red blue V2 0.01 0.12 0.12 0.12 0.12 0.01 0.01 v3 2 15 2 15 2 15 2 ffa) 3.7 6.1 9.5 3.7 7.1 14.5 16.0 Figure 9: Example of a structure system This system may be viewed as the subset o f the complete structure system where the values o f the system function are known. From the known structure system, we wish to impute values for the complete structure system from the parts o f the structure that are currently known. The first step for inducing a complete system is to identify the overall structure system based solely on the information given. The variable values for each variable will be relabeled so that all types o f variables may be handled in the same manner. Each variable value is relabeled by counting the number o f distinct values starting from zero and then relabeling each Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. value with the index, assig n in g the same index to values which are the same. The example in figure is thereby transformed as shown in Figure 10. Note that this transformation is an isomorphism in that one systems maps directly to the other and neither adds information to nor removes information from the original system. v; 0 l 0 2 2 0 1 v2 0 1 1 1 1 0 0 v3 0 1 0 1 0 1 0 f(a) 3.7 6.1 9.5 3.7 7.1 14.5 16.0 Figure 10: Example of a relabeled structure system This relabeling results in a structure system that is similar in form to the original behavior systems defined in the previous chapter. This transformation also allows the computations o f the total number o f state in the structure system, namely the product o f the number o f values that each variable takes, or equivalently, the product o f the cardinality o f each set of variable values. For the example currently under discussion, there are a total o f (2) (2) (3) = 12 possible states. Given the seven existing states in the example we are, therefore, missing five states. Previously, the entropy fill was used to find values for the missing states by calculating the average o f all known states and using this value for all of the missing states. While this technique provided a method for imputing the missing function values, it created a system that did not capture the full dynamics o f the original system in that by using the value o f the mean repeatedly, it reduced the overall variability of the system. This was due, at least in part, to the assumption that all missing states were Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. similar to the system mean regardless o f the structure o f the system. This is the same as assuming that all states that are in the system are equally similar to every other state; there is no way to distinguish one particular state as being more similar to another state than it was to any other state in the system. The algorithm that will be described here will use the existing structure and associated information o f the given system to impute the values for missing states. 2.2 Distance Between States In order to further illuminate the structure o f a system, a distance function will be defined on the set o f states. This distance function will be applicable to any set of states and will be shown to be a true distance function or metric because it satisfies the identity and symmetry axioms and the triangle inequality. By defining a distance function for states, it will be possible to use the resultant structure o f the system to impute values for missing states. Without loss of generality, states will be represented by fixed length strings, where the length of the string is the number of variables and each position in the string may take one of the values from each variable’s set o f values. The ordering o f variables in the string is fixed yet arbitrary; changing the order of the variables does not effect the results presented here, so long as the ordering o f the variables is constant throughout the application o f the algorithm. Using this representation, the set of known states from the previous example is {000, 111, 010, 211, 210, 001, 100}. The distance between two states is defined in a fashion that is similar to the Hamming distance, that is, the distance between two states is the number o f positions in the state Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. strings where they differ. The only difference from the standard Hamming distance is that the values that are available for each position are not restricted to the set {0, 1}. Let the distance between two states, a / and a 2, be denoted as d (a j, a 2). For example, if we have the states a [ = 211, a 2 = 210 and 0 .3 = 101, the we have d (a 1 ,0 .2 ) = I, d(pLi,a.s) = 2 and d{cL2 ,o-3 ) =3. This definition o f distance is a true metric in the sense that it satisfies the following typical properties: 1. d(a 1 , cc2) = 0 if and only if ay = a 2 (the identity axiom) 2. d{o/, cx2) = d(a 2 , c lj ) (the symmetry axiom) 3. d(a 1 , a 3) < d(a /,a 2 ) + d(a.2 , a j ) (the triangle inequality) The identity and symmetry axioms are obviously true based on the definition o f state distance. The triangle inequality can be assumed to be true since the state distance defined here is similar to the Hamming distance, but the states are not restricted to variables that only take the values of 0 or 1. Therefore, we provide the following proof that the state distance satisfies the triangle inequality. Proof. Suppose that we have three states, a /, a 2, and a j , such that a / * <X2 * a j and that d ( a /,a j) = k. Then the string representation o f the states for a / and a j have differences in k positions. Assume that there exists an a 2 such that d(aj,a.2 ) + d(a.2 ,CL3 ) < k. Suppose that a 2 is picked so that the d (a /,a 2 ) = 1, that is a j differs from a 2 in exactly one position. Then </(a2,a j) < A - 1 for the strict inequality to be true. We can transform a 2 into a / by changing the one position where they differ. By the definition o f the state distance, d ( a j,a j) would be at best k-1, but d(a 1 ,0 .3 ) = k Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and changing one position in a , can only leave the distance the same (if we change a position where they already differ to a value where they still differ) or, at best, decrease the distance by one (if we make the one differing position the same). Therefore, d(a 2 ,o-3 ) ^ k -\. For any a 1 can be transformed to 0 .2 0 .2 that we pick we have d(a 1 ,0 .2 ) = q and every by changing q positions and the distance from each 0 .2 must be at least (k - q). Therefore, assuming that d(a 1 ,0 .2 ) + d{o 2 ,o-3 ) > k, implies that q + (k-q )> k k> k which is false. Therefore, d{o. 1 ,0 .3 ) < d(al,a.2 ) + d(a2,a j) and it is proven that the triangle inequality holds and the state distance is a true metric. □ Now we have the structure o f the system defined as the states and a distance metric on the Cartesian space o f the states. This allows a type of similarity calculation to be performed so that instead o f using the all o f the existing states to determine the value of a missing state, only the most similar or closest states can be used. Applying this idea to the set of missing states leads directly to an algorithm that uses the closest states to determine appropriate values for those states. The state distance is a true metric on the Cartesian state space. This identifies additional structure in the state space along with the structure that may already be discovered though the application o f K-systems analysis to a complete system. We can now use the state distance to impute the missing values from an incomplete system. It is useful to note that the state distance is a direct generalization o f the Hamming Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. distance defined on binary strings. Instead of binary strings, states can be viewed as a generalized string where the maximum length of a string is the number of variables in the system and each variable may be assigned values from a different set o f symbols, not just a symbol from the binary set {0,1}. 2.2 Closest States Given the state distance it is now possible to rank all o f the states in a system relative to a single particular state. In particular, we can find a set o f states which are closest to any state that is missing a system function value and use this set to impute a value. Since the state distance may take values from the set o f natural numbers, the set o f closest states consists of all states that are a distance of one from the state that is missing a value. A',* = £t,. | dfaa,)=1,Va,- e A}. The following example illustrates the idea o f the set o f closest states. Suppose we are given a system that has three variables, v„ v2, and v3 that take values from the sets {0,1}, {0,1}, {0,1,2}, respectively. Therefore, we can construct the set o f states shown in table 11. The table also shows the known values o f the system function/facj. Let us consider two states as examples, 010 and 102. First, we will construct the set of closest states for the state 010. We can apply the definition o f state distance and find all states that are a distance o f one from this state. This can be readily done by alternating each variable value, in turn, to another value from the set o f possible values Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 42 V2 0 0 0 1 I 1 0 0 0 1 1 1 VI 0 0 0 0 0 0 1 1 1 1 1 1 V3 0 1 2 0 1 2 0 1 2 0 1 2 m 3.7 6.1 9.5 3.7 7.1 14.5 3.7 6.4 10.2 3.7 8.4 16.0 Figure 11: Example System for that variable and this produces the set A ° 10 = {110, 000, O il, 012}. Each o f these states differs in one variable position from the missing state 010. We can perform the same construction on the state 102 and we generate the set A™ = {002, 112, 100, 101}. In general, it is also possible to construct sets o f states that are any distance from a given state up to the maximum distance, which is the number o f variables in a system. 2.3 Use of the Closest States Given some arbitrary set of states and the associated system function values, we can now define an algorithm for imputing system function values for all missing states. The idea that underlies this algorithm is the following: use only the states that are most similar (or closest) to impute a value for a missing state; this is a form of maximum likelihood estimation. The distance metric defined in the previous section can be used to determine which states from the set o f all states are closest to the Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. missing state. Intuitively, this is a matter of using states where the values of all variables are the same, except for one variable. The set of closest states may be readily formed based on the state distance, but it is not immediately clear how this set o f states should be used to impute a value for the missing state. We observe that the structure o f the system based on the state distance is similar to the structure of an error correcting code. This similarity can be used to give an indication on how the set o f closest states can be used. Error correcting codes are constructed so that the Hamming distance between code words is maximized. This allows errors to be detected by recognizing that the current code word is not in the set o f valid code words and allows error correction by selection of the valid code word that is closest to the code word containing an error. This is a type of maximum likelihood decoding in that it selects the valid code word based on the closest code word and the probability that there are a specific number o f errors. In the case o f a system consisting of variables and the values that each can take, the states correspond to the code words in a communication system and code words with errors correspond to missing states. Since the system was not explicitly constructed to maximize the distance between valid code words, all o f the states exist and there is no single state that is closest to the missing state. Instead there is a set o f states that are closest to the missing state and all the closest states are equally probable corrections to the missing state. The most likely value o f the missing state is the expected value of the set o f closest states, namely, the mean. The mean o f the set is an unbiased estimator o f the missing state. This is in keeping with the maximum entropy principle Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that is the basis of the reconstructability algorithms. Using the mean o f the closest state set is the analog o f the maximum likelihood solution to the error correcting codes. Alternatively, other estimators for the missing value may be used in a similar fashion to the resolution of state contradictions. The analyst may wish to use the m axim um , the m inim um , the median or the mode as the estimator for the missing state. Also, other methods may be applied using the distance function and additional sets of states that are further away from the missing state. It may be possible to use the relationship of the members o f the closest state set to their own corresponding closest state sets to determine the values that should be used for the closest states. In particular, it is possible to use the deviation from the mean o f the closest state set members from the mean of their corresponding closest state sets. In effect, this incorporates information from states that are one step further away from the missing state and weights this effect by adjusting the mean value of the closest states by the average deviation of the actual value from the closest state’s closest states. The mean of the closest state set is used here for a number o f reasons. First, the mean is an unbiased estimator o f an unknown distribution. An unbiased estimator is in keeping with the principle o f maximum entropy in that it makes the fewest assumptions regarding the underlying structure. Also, if the original variable values are not from a set of variable values that ordered, it may not be possible to assign a magnitude to these values and thereby use the variable values to perform some other type of estimation. Using the mean o f the missing states places the least constraint on Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the type of systems that may be considered for the closest state algorithm and, by extension, submitted forK-systems analysis. In addition, to the previous arguments the set of closest states may be characterized by the amount o f information that it shares with the missing states. Using the state distance metric, it is possible to define the amount o f information that states share based on their distance. In addition, the average amount of information that a state shares with a particular set of states may also be determined. The state distance is defined on the number o f variable values that are the same between a pair of states and each variable may be considered a single information entity. We can use the notion of similarity and distance to show why using the closest states is superior to the entropy fill. We begin by defining the amount o f information shared by two states as the proportion of the number of positions in the state string that are the same. This is the following function o f the state distance: /f r „ a ,)= (2.1) where Nv is the number of variables and d is the state distance as defined above. For a given system and a particular state, a , all members of the closest state set have a distance of one from the state. This implies that the information shared between a particular state and each member o f the closest state set is equal to Nv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (2 .2) 46 as defined by equation 2.1. We claim that the closest state set shares more information with the missing state than the set o f all states used in the entropy fill and, in general, shares the most information possible with the missing state. Proof: As defined by 2.1, the average amount o f information shared between a particular state and its closest state set is I(A 0 a). The entropy fill technique uses the set o f all existing states. The average amount o f information shared by this set with the missing state can be defined as the average o f each of the following sets formed for each possible state distance from a : An = { a f e An,d (a ,a i) = n,n = {1,2,...,JVv}} (2.3) So we can calculate the average amount o f information shared by each member o f the above sets as: (2.4) \A ,\< \A ,\,2 < n < N ,. This implies that the average amount of information that the total set shares with the missing state is: , where A' = A - a N v - 1 N v —2 -— + — ----------Nv Nv < (2.5) — Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (2 -6) Note that the term for those states that are the maximum distance from the state o f interest are included and that this term is zero, but it is included as member for the calculation of the average shared information. This yields the Nv in the denominator instead of Nv - 1 if the term had not been included. This expression simplifies to: n ,-i (2.7) 2 2 x 2v J_ 1_ 2 2N y (2.8) (2.9) We know that as (2.10) (2 .11) Therefore, the set o f closest states shares more information, on average, with a missing state than does the set o f all states. It is obvious from the preceding that the set o f closest states is also the largest set o f states that shares the maximum amount o f information on average with any particular state. A subset of the closest state set will have the same average shared information, but will be a smaller set. Any larger set will include states which must share less information with the initial state and, therefore, will have a lower average shared information. This implies that the closest state set consists o f the most similar Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. states that share the most information possible with the initial state. The closest states can then be used to impute values for the system function and the only assumption about the function on the states is that each variable that is used in the state is also used in the function. 2.4 Closest States Algorithm Now that we have the set o f closest states that was created based on the state distance function, this enables the determination of the mean for this set to be used as the unbiased estimator for the value of the missing state. Based on these definitions and principles we can begin to define an algorithm that can be used to impute values for the missing states. The following notations will be used for the algorithms that follow. The number o f variables will be denoted as Nv, the number of values that each variable, v/, takes will be denoted as nVI- and the total number o f states is denoted as Nn. The following is the basic algorithm for calculating the system function values for all missing states. First, we define a data structure that will be used to store and track the states and associated systems function. It should be noted that the data structure that is used here is purposely very primitive. Various programming languages may provide capabilities, such as multidimensional arrays, that may simplify the implementation of the algorithm. The following data structure consists o f an array of the variable values and the system function value. (1) data structure State Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (2) Variable(Nv) : Integer (3) SystemFunc: Double (4) end data structure The following routine is the high level portion of the algorithm that will maintain two copies o f the structure system. Since the algorithm imputes values for every missing state based on the existing system, this allows the original system values to be used and then filled into the other copy without inadvertently filling in values that are considered missing that will be subsequently used for imputing other missing states. This routine assumes that the State structure has already been populated by all existing states and that the number o f variables and the number of values that each variable is allowed to take is already known. Also, it assumes that the missing states have been assigned a special value referred to as the MISSING_FLAG. This allows the algorithm to account for any missing states that have a closest state set that is missing all o f the system function values. It will iterate through the complete set of states, filling in missing states where possible and then check to see if all the states are imputed. If not, it will perform the closest state loop again until each state has been imputed. There is an assignment that is made to copy all o f the states in one copy to the states in another copy. The implementation o f this routine is trivial and at the most straightforward iterates through every state assigning the values from one state to the other. Again, specific languages may have capabilities that make this routine even more elementary, allowing direct assignment o f one array to the other. (1) Routine ImputeMissingStates Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 50 (2) Currentstates(0 to Nn -1): State, NextStates(0 to Nn - 1): State (3) Complete: boolean (4) Complete = false (5) while -i Complete (6) Complete = true (7) NextStatesO = Currents tatesO (8) for i = 1 To Nn (9) if (NextStates(i).SystemFunc = M ISSINGFLAG) Then (10) NextStates(i).SystemFunc = ClosestStateMean ( CurrStates(i), CurrentStates(),Nv, n^O) (11) if ( NextStates(i).SystemFunc = MISSING_FLAG) Then (12) Complete = false (13) fi (14) (15) fi Next i (16) Wend The following routine iterates through each state, checks to see whether each state is missing and, if it is, calls that routine that calculates the mean o f the corresponding closest state set. (1) Function ClosestStateMean (CurrentState:State, theStatesO: State, Ny, n^O ) (2) i, j: integer, AStaterState, stateSum: real, divisorr real, adder real (3) stateSum = 0 Reproduced with permission o f the copyright owner. Further reproduction prohibited without permission. 51 (4) divisor = 0 (5) For i= 1 To Nv (6) For j = 1 To n ^ i) (7) if —i(j = currentState.values(i)) then (8) Astate = currentState (9) AState. values(i) = j - 1 (10) adder = theStatesCGetlndexCNy, n^O, AState)).SystemFunc (11) if (adder = MISSING FLAG) then (12) stateSum = stateSum + adder (13) divisor = divisor + 1 (14) fi (15) (16) fi nextj (17)next i (18)if -i(divisor = 0) Then (19) ClosestStateMean = stateSum / divisor (20)Else (21) ClosestStateMean = MISSING_FLAG (22)fi The following routine is responsible for returning the index of a particular state based on the variable values of that it is assigned. Again, note that the variable values are assumed to have been relabeled so that each variable, vy, takes values from the set Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 52 {0, 1, . . . «vj}. This routine will then return the index into the state array that corresponds to the particular state and, thereby, allows access to the corresponding system function value. (1) routine Getlndex (Nv, n^O, AStaterState) (2) i, multiplier, r v : integer (3) rv = 0 (4) multiplier = 1 (5) For i = Nv -1 DownTo 0 (6) rv = rv + AState. Variable(i) * multiplier (7) multiplier = multiplier * n ^ i) (8) Next i (9) return rv -1 (10)end routine 2.4.1 Space and Time Complexity of the Closest States Algorithm Given these routines, we can determine the space and time complexity o f the overall algorithm for calculating imputed values for missing states. First, the ImputeMissingStates routine iterates through each o f the states at least once in the loop starting at statement 8. This implies that there will be a total o f Nn iterations through this loop. In addition, there is a while loop starting at statement 5 that ensures that every missing state receives an imputed value. This loop is necessary because the closest state set of any specific missing state may be empty; that is, none o f the closest states has a system function value. The maximum number of times that this loop may Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. be executed is readily seen to be the maximum distance between any two states in the system. The maximum distance between any two states is the number o f variables in the system, Nv. This implies that the complexity o f this particular routine is 0(NvNn) Within this loop is a call to the ClosestStateMean routine which has two loops in it. One loop, beginning at statement 5, iterates though all o f the variables and the other loop iterates through all o f the values for each variable. This means that there are iterations in the inner loop and Nv iterations in the outer loop for a total o f (n^Ny). This expression is equal to the total number o f states, so the complexity o f this routine is 0 (Nn). Finally, the routine that returns the index o f an arbitrary state in the system based on the assignment of values to the variables has a complexity o f 0 (NV), the number of variables in the system. The overall complexity o f the algorithm is then a product of all of these complexities, namely, 0(NyNn Nn Nv). This expression simplifies to 0(NV2 Nn^). In general, the number o f states will be far greater that the number of variables in the system, Nn » Nv. This is particularly true if there are many different values that each variable can take, so we may characterize the complexity of the overall algorithm solely in terms o f the number of states, 0{Nn2). Finally, we observe that there are two copies o f the states maintained throughout the algorithm, so we know the space complexity is 2Nn or 0(Nn). 2.4.2 Closest State Algorithm Examples To demonstrate the effectiveness o f the closest state algorithm, the following examples demonstrate the results that can be expected. In particular, the results are Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. compared to the entropy fill technique. Consider the following system which is based on the following function using the values indicated in the table in figure 12: / ( a ) = v,sin f V|^ —| + (v 2 + 2 )”1 + v i j 2 + vl + 2.7 V(v3 +i); V, 0 0 0 0 0 0 1 1 1 1 1 1 v2 0 0 0 1 1 1 0 0 0 1 1 1 v3 0 1 2 0 1 2 0 1 2 0 1 2 m 3.70 6.1 9.5 3.7 7.1 14.5 3.7 6.4 10.2 3.7 8.4 16.0 Figure 12: Original System - No Missing Data For this example, assume that the following states are missing from the system: 001, 012, 101, and 111. The first step to the algorithm is to form the set of closest states for each missing state, so we have the following sets: A™ = {101, 011, 000, 002} A °12 ={112,002,010,011} A™ = {001, 111, 100, 102} A'eu = {011,101, 110,112} For each set we calculate the mean o f the closest state set counting only those closest states that are not missing themselves. So, we have the following results. Note Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 55 that we use the symbol <|> to indicate a missing function value which will effect the number o f values used for calculating the mean of the closest state set. X001) = 0X101) + / 0 1 1) + /0 0 0 ) + /0 0 2 )) / = (<j> + 7.1+ 3.7+ 9.5 )/3 = 6.8 = 9.1 10.2) / 2 = 7.0 16.0)/3 = 8.9 /(012) = (/[l 12) + /0 0 2 ) + /[ 010) + / 0 1 1)) / = ((j> + 9.5+ 3.7+ 7.1 )/3 ycioi) = (/(001) +y( 111) + / 100) + /1 0 2 )) / = (<j> + $ + 3.7+ yXiii) = (/C0 i i ) + A 101) +y(i io) = (7.1 + ((> + 3.7+ n n n +j[112)) / /i The overall system mean of the existing states is 7.2. The following table provides a direct comparison with the known values, ffa ) , the values calculated using the entropy fill, f e(a), and the values calculated using the closest states algorithm V, 0 0 1 1 v2 0 1 0 1 v3 1 2 1 1 /e/a) 7.2 7.2 7.2 7.2 fc(a) 6.8 9.1 7.0 8.9 m 6.1 14.5 6.4 8.4 Figure 13: Comparison o f entropy fill and closest states The following example demonstrates the behavior of the algorithm when one of the missing states has an empty closest state set The system will be the same as the system from the precious example, but will have the following missing states: 001, 000, 011, 101 and 002. The mean of the remaining states is 8.6. We proceed with the Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 56 algorithm as before by first forming the sets of closest states for each missing state and then calculating the mean of the closest state set. /[000) = (/(100) +7(010) + /0 0 1 ) +7(002)) / n = (3.7 + 3.7 + <J>+ <|>)/2 =3.7 7(o o i)= c/(ioi) +7(011) +7(000) +7(002)) / n = <{) + <j>+ <j) + <j) / 0 = undefined 7(011) = (/(l 11) +7(001) +7(010) +7(012)) / * = (8.4 + <j>+ 3.7 + 14.5)) / 3 =8.87 7(101) = (7(001) +7(111) +7(100) +7(102)) / n = (<{>+ 8.4 + 3.7+ 10.2)/ 3 =7.43 7(002) = 0X102) +7(012) + f(000) +7(001)) / n = (10.2 + 1 4 . 5 + <j) + (j) ) / 2 =12.35 Note that the state 001 had no states in its closest state set that had an associated function value. This is easily remedied in the algorithm by checking to see if all o f the states have values and, if not, another pass is made through the closest states algorithm Only those values which are still undefined are considered missing and the algorithm will proceed as before except that the imputed values for the missing states will then be used to attempt to fill in the remaining states. So, the example proceeds by using the imputed values for the states in the closest state set o f 001. See figure 14 for summary.. 7(001) = (/(101) +7(011) +7(000) +7(002)) / n = 3.7 + 8.87 + 7.43 + 12.35 / 4 = 8.09 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 57 2.5 Alternative Closest State Algorithm The state distance and the general idea of using the closest states can be extended to a number o f different realizations o f a closest states algorithm. In this V, 0 0 0 0 I v2 0 0 0 1 0 v3 0 1 2 1 1 fcfo.) 8.6 8.6 8.6 8.6 8.6 3.7 8.1 12.4 8.9 7.4 m 3.7 6.1 9.5 7.1 6.4 Figure 14: Comparison o f entropy fill and closest states section, two alternatives will be briefly considered along with some examples using them. Using the mean o f the closest states is convenient and makes the fewest assumptions regarding the structure o f the systems and the breadth o f effect of the variables, but adding additional information from more distance states may frequently be justified and may also provide superior results. The two alternatives that we briefly consider here are an iterated version o f the preceding closest states algorithm and a method o f using the closest states to the members of the initial closest state set. 2.5.1 Iterated Closest States A direct extension of the closest state algorithm is an iterated version. Imputing values for the missing states based on the set o f closest states provides reasonable and quick estimates o f the missing values. It is also possible to take the idea o f state distance and closest states and apply it iteratively until the imputed values converge to some fixed point. The algorithm would proceed as before, but instead o f just checking to see whether all o f the states have values and then terminating, it begins to iteratively Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 58 calculate the means for all o f the states that were originally missing. It proceeds in this manner until all o f the states have reached a stable value. Examples have been executed using this algorithm and the imputed values that it provides seem to be reasonable. Compared to the non-iterated version, the values tend to be closer to the entropy fill values. This is appeals to intuition in that each iteration propagates values from states which are further away from the missing state and with enough missing states, the imputed values would be expected to be quite close to the overall system mean. The following series o f plots will compare the imputed and original values for all of the states as more known states are added to the system. The equation that is the basis o f the system being compared is the following: and all variables take values from the set {0, 1, 2}. The plots begin with only three known states (000, 111, 222) and proceeds to add known states to the system which reduces the number of missing states. The plots make it obvious that as the amount of missing data in the system is reduced, both the original and the iterated version yield increasingly better results. It can be seen from the preceding figures that the iterated version and the original version of the closest states algorithm provide reasonable results for the missing states. In general, it can be seen that the non-iterated version seems to over­ predict the amount of variance in the system and the iterated version seems to under predict. Taken together, they seem to provide bounds on the variability o f the missing Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 59 1.000 0.800 -- ■ *) 0.600 - _ -x —Iterated .. 0.200 o — «N a - .. Closest State o < N (N Figure 15: States 000, 111, a n d 222 1.000 A 0.800 -- X) 0.600 .. - . x- - • Iterated j 0.400 _ 0.200 o — © o ( N © o O — o — — o o o . Closest i State | o C N O — CN — t S C N t N Figure 16: Added States 022,101, and 011 states. One final note about the iterated version is that while it seems reasonable that the iterated version will converge, this has not been proven. The imputed values of the missing states are obviously bounded by the maximum and minimum values for the known states and it seems likely that the iterated version should converge for all states to values near the mean. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 60 1.000 0.800 .. 0.600 -- . . x- - •Iterated 0.400 .. _ 0.200 0.000 o o o o CN O o O o CN o o CM o CN . Closest State o CN CN Figure 17: Added States 122,002, 220, and 020 1.000 0.800 -0.600 - - x- - • Iteratedj 0.400 _ 0.200 0.000 o o o O o o o CN o o CN . Closest I State O CN Figure 18: Added States 121, 100, and 202 2.5.2 Mean Deviation from the M ean of the Closest States The preceding examples for both the iterated and non-iterated versions of the closest states algorithm indicate that quite often the mean o f the closest states is offset from the known value. In practice, it is impossible to know what the value of the missing state is, but this observation suggests another alternative form o f the closest states algorithm. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The members o f the set of closest states also have a set of closest states. For the members of the closest state set that have a known value, the closest state sets for these members can be determined and the mean of these sets can be calculated. This calculated mean can then be compared to the known value and the deviation of the prediction from the known can be directly calculated. The deviations o f all the members in the set o f closest states can be calculated and the mean deviation of the closest state set can be determined. This mean deviation for the closest state set can then be used to adjust the calculated mean for the missing state. The following results are calculated for a preceding example system that is shown in figure 12 and starts with the following missing states: 001, 012, 101, and 111. Starting the same as previously, each o f the closest state sets is formed for the missing states. A™ = { 101, 011, 000, 002} A°12 ={112,002,010,011} A lc0' ={001, 111, 100,102} A 1" ={011,101, 110,112} The difference in this algorithm begins with finding all o f the closest state sets for all o f the states in each of the sets listed above. There is some redundancy in the members o f the closest state sets, so the table in figure 19 shows each unique state set only once. Note that each closest state set has four members, since there are two distinct values for vj and V2 and three values for vj. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 62 States 000 001 002 010 011 100 101 • 102 110 111 112 Closest States 010 011 012 000 001 110 111 112 100 101 102 100 101 102 110 111 000 001 002 010 011 012 001 000 000 on 010 101 100 100 111 110 110 002 002 001 012 012 102 102 101 112 112 111 Figure 19: Closest States to the Members o f the Closest State Sets The mean for each o f these closest state sets and the deviation o f this mean from the known value can be calculated. These deviations for each o f the states in the closest state set can then be averaged and this value can be added to the calculated mean of the closest state set. This technique may capture the remote behavior o f the system relative to each o f the states and then weight that behavior when imputing values for the missing states. The following table displays the results o f the calculations o f the means and the deviations. These results can then be used to adjust the imputed values from the standard closest state algorithms based on the average deviation for each missing state. y[001) = closest state mean + the average deviation = 6.8 + 0.461 = 7.261 y[012) = 9.1 +0.767 = 9.867 /(101) = 7.0 + (-0.909) = 6.091 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 63 y [ l l l ) = 8.9 + 2.783 Missing State 001 101 O il 000 002 Missing State 012 . 112 002 010 011 Missing State 101 001 111 100 102 Missing State 111 011 101 110 112 = 11.683 Average Known Mean Deviation Deviation 6.95 3.4 7.1 3.7 -2.05 5.75 3.7 0.461 0.033 9.467 9.5 Known Mean Deviation 9.5 3.7 7.1 9.467 4.833 3.7 0.033 -1.133 3.4 0.767 Known Mean Deviation 3.7 10.2 5.867 9.85 -2.167 0.35 -0.909 Known Mean Deviation 3.4 7.1 3.7 3.7 16 7.8 6.95 -4.1 9.05 2.783 Figure 20: Deviations from the Mean The following table summarises the results and compares them to the standard closest state results. For this example, the results are comparable and possibly better from using the mean deviation as an adjustment to the closest state algorithm. It has not been determined, nor is it clear whether this version o f a closest states algorithm is superior to the previous versions. Additional research is required to make any determination, but the next section provides a method for selecting an imputation method. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 64 Vi 0 0 1 1 v2 0 v3 fdevfa) f d a) 0 1 1 1 1 1 2 7.3 9.9 6.1 11.7 6.8 9.1 7.0 8.9 m 6.1 14.5 6.4 8.4 Figure 21: Comparison o f entropy fill and closest states 2.5.3 Test for Selecting an Im putation Algorithm While the closest states algorithm is the most general in that it makes the fewest assumptions and shares the most information with the missing states, it is possible that for specific instances an alternative algorithm may provide superior results. This section provides a test to determine which o f the algorithms may generate the best results for a specific set o f data. The test proceeds by imputing values for states in the system that are already known, calculating the sum o f squares error based on the known value. This procedure may be repeated for each of the algorithms defined previously, a variant of the closest state algorithm, or any other procedure for imputing data that the analyst may choose. The algorithm that has the minimum error will be selected to impute values for the missing states. While this test is not foolproof, it may provide some guidance for deciding what technique to use for imputing the data. The test may be stated as follows: For each imputation algorithm Algj and a given g-system as previously defined, determine the set o f existing states Ae. For each state, cq, in A e, impute a value, d , , using algorithm Algj. Calculate the error for each state as: * , = « f - d f, Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (2.12) 65 and the total sum of squares error for the algorithm as l g y ) = E e<! - a,«4. (2-13) The minimum value o f E(Algj) indicates that algorithm j had the minimum error for predicting function values for existing states. This may indicate that algorithm j is the optimal algorithm to use to impute values for the missing states. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3. Incomplete Systems: Data Scattering and Clustering The closest states algorithm allows imputation o f values for missing states regardless of the number of missing states. Even if the extreme case o f having a single state is considered, the closest states algorithms will provide values for the missing states. In this case, the value is the same value as that of the single existing state. The solution that the algorithm provides in this case is a reasonable answer, but there are other special cases and reasons why applying the closest state algorithm may not be the best approach. As the discussion in chapter 1 indicated, it is easy to conceive of the problem o f scattered data as being the same as missing data. The discussion also included the possibility that the missing states are not missing, but that the missing data is due to some form of scattering. Previously, the problem of an incomplete system was first viewed as a data scattering problem and then, once this problem was resolved, as a problem o f missing data [JONE 85e]. As was discussed above, these are not necessarily two separate problems, but two alternate interpretations o f the same problem. It is perhaps more general to initially view an incomplete system as first missing data and then, if necessary, regard the data as scattered if there is too much data missing. In chapter 4, two criteria for deciding if there is too much missing data will be presented. Assuming that the given variable values are scattered led to the solution o f applying one dimensional clustering techniques to each set of variable values that were determined to be scattered. One limitation o f this approach is that the clustering is 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. performed on the variable values alone, in effect, assuming that the variable values and the system function values are independent. Given the closest state algorithm which is capable o f taking any incomplete system and imputing a complete system, there are still reasons for using some clustering technique on the variable values. First, the true reason that the system is incomplete may indeed be that the variable values are scattered. Frequently, in the execution o f a designed experiment, the control of the variable values may not be fine enough so that every trial uses exactly the same variable setting. Obviously, any analysis o f the results o f the designed experiment would want to take into account this fact. Second, clustering the variable values may help in reducing the number of missing states in the data set. Suppose that the system under consideration is the example that was previously considered and is reproduced here as figure 22. If the system is treated as having missing states, there are a very large number of possible states missing. In the example, there are nine unique values for v j and v2, and ten unique values for vj. This yields a total o f 810 possible states, so with twelve states defined in the table, there are 798 missing states. Obviously, even with a good algorithm for imputing missing function values for all o f the missing states, it is not reasonable to think that this small percentage of existing states will provide enough information to generate good values for the missing states. In addition, it seems obvious from an inspection o f the table that the values may very well be reasonably clustered; in this case visual inspection provides a good solution as was described previously. The need for clustering for K-systems Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 68 analysis is obvious, but the question remains as to how the clustering should be done and that is the question that we propose to answer here. V/ 7.2 6.9 6.7 7.1 6.9 7.5 2.8 3.1 3.2 2.7 2.7 2.7 v2 3.1 2.9 2.7 9.2 8.9 8.6 3.3 2.9 3.0 8.7 8.9 9.2 V3 9.4 6.2 7.1 9.2 5.9 6.9 9.3 5.7 7.0 9.2 6.3 7.1 /(■) 3.7 6.1 9.5 3.7 7.1 14.5 3.7 6.4 10.2 3.7 8.4 16.0 Figure 22: Data Scattering Example First, we propose a method for performing clustering o f the variable values that takes in account the system function value. The clustering method discussed here will be focused on the needs o f K-systems analysis, specifically, the need to reduce the amount o f data scattering and the number of variable values so that K-systems analysis may proceed. A general method will be described that will be allow K-systems analysis to proceed utilizing any type o f clustering algorithm. 3.1 Using Clustered Data in K-systems Analysis K-systems analysis may be performed and the results are valid for the given data system independent o f the underlying system. In the same way that the K-system function must be induced from a general system function, the variables and the values that they take must be in the correct form for the analysis to proceed. It is desirable to Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. have each variable assigned a state set o f the foim o f the natural numbers, V{= { 0 ,1 ,.. . , n). This can be easily accomplished given a finite set of data by assigning a unique integer value to each distinguishably unique variable value. Current methodology for using clustered data in K-systems analysis requires that the clustering be done in one dimension. Each set o f variable values is clustered independently o f all other variables and the system function. This leads to a problem in the use and interpretation o f the cluster values that are used in place of the original variable values. This problem is not one of relabeling the values since the relabeling results in an isomorphism. The problem is related to the meaning o f the clusters that are submitted for analysis. Generating clusters independently of the system function assumes that the system function is independent o f the variable values; the variable values do not effect nor control the system function. The reason for doing any type of data analysis is to find what the effect that the variable values have in determining the behavior o f the system. This is especially true in K-systems analysis because not only are the variable effects being sought, but even finer effects may be discovered through the states and substates. We propose that the clustering be done across two dimensions, a variable and the systems function, in what is effectively a preprocessing step to K-systems analysis. The preceding discussion made it obvious that clustering based solely on the variable values is insufficient and the results from the analysis may be ambiguous. If we are to propose doing clustering using two dimensions, it may seem a natural extension that the clustering proceed across all o f the variables and the system function. The true Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. purpose o f doing the clustering for K-systems analysis is to induce a system that has properties sufficiently close to a probabilistic system that the probabilistic reconstructability analysis algorithms can be applied. In effect, the K-systems analysis is a form of clustering or unsupervised learning in that it finds the important subsystems that effect system behavior and is able to group variables and their values in different combinations that make the controlling subsystems obvious. Limiting the clustering to the two dimensions of each variable and the associated system function values allows the results to be submitted to the finer analysis o f K-systems analysis. As in the closest states algorithm, the system function provides the meaning o f the variable clusters. Clustering generally proceeds across all dimensions of the data and groups like data vectors together. K-systems analysis also proceeds across all dimensions in the data, but it is able to produce more refined groupings of the data; it finds categories o f effects based on all possible combinations o f values and distributes their effects though maximum entropy mathematics. Clustering for each variable in combinations with the system function allows the high level behavior of the system to be found. This can then be submitted for K-systems analysis so that the finer effects of the variable clusters and combinations o f variables clusters on the overall system can be found. One final aspect o f clustering for K-systems analysis is the use o f these two dimensional clusters and, in particular, using the results of the analysis to make predictions of the effect of previously unknown variable values. Given a reconstruction of a system, the analyst may wish to predict the system function output Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. based on some combination o f specific variable values. The variable values must be mapped into one o f the existing clusters and this must be done in the absence of a system function value. If the existing clusters are not linearly separable perpendicular to the axis o f the variable, the variable value being considered may not map into a single unique cluster. This means that the variable value can not be unambiguously assigned to a single cluster and there is not a single prediction about the effect that the variable values will have on the system. Instead o f being a drawback, this can add new insight in the analysis of the system. The variable value can be determined to have some probability o f inclusion into multiple clusters based on whether it falls in the range of the cluster and on the distance of that value from the centroids o f the clusters. While this does not provide a single solution to the effect that the variable values will have, it will provide a set o f possible solutions along with associated probabilities based on the possible inclusion o f the variable values in the existing clusters. This extends the K-systems analysis from a method o f providing deterministic answers for every possible state, into a stochastic type o f modeling system that will provide a number of solutions along with associated qualifications based on the possibility that specific values fall into specific clusters. For each two dimensional cluster generated by any clustering algorithm, the following information is generated by, in effect, projecting the two dimensional cluster to the variable axis. So, each cluster will have maximum and minimum variable values, v ^ , v ^ and a centroid or mean value, vc o f the cluster can also be calculated. Given a specific variable value, v, it must first be determined whether this Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 72 value falls into the range o f any of the existing clusters. I f it falls into one or more, then the probability that this values falls into one of these specific clusters can be calculated as: . |v - v c'| — z r2Ve* - * * <3 1 > If the value does not fall inside the range for any o f the clusters, it may be assigned to a single closest cluster or the probability may be calculated for all clusters. Once these probabilities have been calculated they may be used to qualify the predictions that the K-systems analysis would provide for specific cluster assignments. Specifically, these probabilities may be used in one of two ways. First, the analyst may simply select a single cluster where Pr(v e c/) is a maximum. Alternatively, all possible clusters and combinations of clusters may be used to generate predictions and the product o f the probabilities for each cluster can be used as qualifiers. In effect, the analysis is stochastic and provides multiple possible solutions that are qualified by the product probabilities. 3.2 Entropy Similarity Clearly some kind o f clustering may be necessary for K-systems analysis to be effective on the widest possible range o f systems. It is also quite clear that clustering independently for each variable may lead to results which are ambiguous. The previous section described in some detail the problems that may be encountered and outlined a general methodology whereby they may be overcome. There remains a Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. question about the details of how best to determine the clusters that should be submitted for K-systems analysis. Any common clustering algorithm may be applied to the problem of determining the clusters. While all o f these algorithms are useful is certain contexts, none were developed specifically for application in the field o f reconstructability analysis. Here we present an algorithm that is directly based on information entropy and uses the same techniques for inducing a K-system to induce a system that may be analyzed using information entropy. The algorithm that will be developed here is based on the taxmap algorithm that was developed by Carmichael et al. [CARM 68] and Carmichael and Sneath [CARM 69]. This algorithm attempts to imitate the procedure that is used by a human who is manually detecting clusters through observation o f two and three dimensions. This algorithm tries to detect relative distances between pairs o f points and searches for continuous and relatively dense regions of space that are surrounded by mostly empty space. This method is based on the use of a similarity matrix that contains the relative similarity o f all pairs o f points. In general, the matrix values are based on some general method o f similarity, but the specific similarity function or relation is not explicitly defined. This section will develop a measure of similarity that is based on the formalities o f information theory and the entropy mathematics used for Ksystems analysis. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We begin by defining the pairwise similarity measure using the information entropy. First, consider the equation for the joint entropy o f two independent variables [SHAN 48]: H{x,y) = -£ /> (x ,.y ) 10g 2 p { x ,y ) (3.2) V* ,y where p(x,y) is the joint probability distribution. The maximum value o f the entropy will be when the distribution, p(x,y), is uniform. This is the basic equation that will be used to calculate the similarity o f the each pair o f points. Note that the similarity measure will be defined for two dimensional space, but may be extended for multiple dimensions. Given two points the question is then how the joint probability distribution can be calculated based solely on these two points. A joint distribution can be calculated given the marginals o f a system and this distribution will be the maximum entropy distribution given that the two marginal distributions are independent. If the two marginal distributions are not independent, the resulting distribution calculated as below is then one solution in a family o f distributions that satisfy the constraints o f the joint distribution. For example, suppose that we have the two discrete marginal distributionsp(x) = {0.25,0.30, 0.45} andp(y) = {0.30, 0.40,0.30}. The problem may then be set up in the tabular format o f a joint probability distribution as shown below. X\Y 1 2 3 1 p ( x j,y j) p (x ? ,y j) p ( x i,y j) 0.30 2 3 0.40 0.30 p(x],y?) p (x j,y i) p(x%y>) p(x?.vi) P(X1,V7) p(x7,y?) 0.25 0.30 0.45 Figure 23: Joint Probability Distribution Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Given this distribution, the joint distribution can be directly calculated as P(x i ,y j) = )p (y j )Vz, j . This results in the distribution shown in figure 24: X\Y 1 2 3 1 0.075 0.090 0.135 0.30 2 0.100 0.120 0.180 0.40 3 0.075 0.090 0.135 0.30 0.25 0.30 0.45 Figure 24: Maximum Entropy Distribution Based on this distribution, the entropy o f the system can be directly calculated from equation 3.2 and in this particular case, H(x, y) — 3.11. While this allows the calculation o f the joint distribution and the overall joint entropy o f this distribution, it still does not lead us directly to the similarity measure. It shows that given the marginals o f a system, it is possible to calculate the joint maximum entropy distribution given the marginal distributions. Next, we show how to induce the properties o f marginal distributions directly from a pair o f data points. The technique that is used is similar to the transformations that results in a K-system [JONE 85c]. Again, we note that the transformation used here is for two dimensions, but that this transformation can be readily extended to multiple dimensions. Suppose that we are given two data points, (xj, y j ) and (x2 , y2 h such that x/ , y / e R+. Without loss of generality, we restrict these points to the positive real numbers because any points outside this range can be easily mapped into the range of positive real numbers. We define the following transformations factors: Tx = x , + X 2 +y2- Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (3.3) (3.4) 76 These factors can then be used to transform the points as follows: (3.5) (3.6) (3.7) (3.8) Clearly, the resulting points will have the following properties o f marginal probability distributions: ^ x s = 1.0 , 0 < Xg <1 and X > ,= 1 .0 , OS*, SI. Note that information has neither been added nor removed by this scaling of the variable values. The results o f the transformations are not probability distributions, but has created a system that has sufficient properties that they can be used in the same maimer as marginal distributions. The entropy o f two points given the transformation and the use o f the points as marginals can then be calculated based on the joint distribution derived from the marginals as follows. Given two arbitrary points in two dimensional space , (xj, y j ) and (x2 , y2)> can first be transformed as in equations 3.5 through 3.8 into the points, (.x[,y[) and (x'2 , y 2’ ). The joint maximum entropy distributions can then be formed as Reproduced with permission o f the copyright owner. Further reproduction prohibited without permission. if the these transformed points were marginals. The tabular form o f the problem is shown below: X\Y 1 2 1 2 pk,y?) y'i y'z x'l Figure 25: Entropy Similarity Joint Distribution Note that p(xj, yj) is equal to the product ( x [ y ' ).The entropy of this distribution can then be calculated as defined by the equation for information entropy, equation 3.2. Note that the maximum value for the two dimensional joint entropy is when the distribution is uniform. This implies that the scaled values must all be equal and, since they must sum to one, all values of p ( x ',y 'j) must be equal to 0.25. This, in turn, implies that the maximum value that the entropy can take is equal to 2.0. This value will be used to normalize the entropies calculated to the interval [0,1]. The normalized entropies will then be able to serve as a measure of similarity between two points and will conveniently fall within the [0,1] interval. In general, this normalization factor is dependent on the number of dimensions and the number o f points that are being assessed in the joint entropy. We will derive an expression that can be used to determine this normalization factor for ndimensional points. As was stated previously, the joint entropy is at a maximum when the distribution is uniform. Let nd stand for the number of dimensions (or variables) in the data. Since the similarity involves two points, this means that the distribution for each marginal can be expressed as Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where rip is the number o f points to be compared and in the pair wise comparison, rip = two. Therefore, each portion o f the joint distribution would be expressed as p(vt,v2,...,v ^ ) = 1 Knp) . (3.10) The rip in the denominator is due to the fact that this comparison is applicable for np number o f the points and the power o f n j is due to the number o f marginals that make up the joint distribution, so we have two raised to the number o f dimensions as the number o f partitions into which the joint distribution is divided. Note that if this equation is used for a general comparison o f 2 data points, rip = 2. The distribution is then used in the calculation of the joint entropy so that, /f(v ,,v2,...,v ^ ) = - 5 ] p ( v I,v2,...,v„rf)logp(vI,v2,...,v ),rf). (3.11) The summation occurs over all combinations o f the ntf dimensions for each variable which yields the upper limit for the summation o f n '* . Substituting equation 3.10 into equation 3.11 yields, H "d tf(v „ v 2,...,v ) = - £ — -— log— -— , Vl a* £ ( * ,) " ' (/i,)"' where there are (jip (3.12) summations so we have, f 1 log----1 1 ^n * (» ,) * (» ,)* J Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (3.13) 79 Which finally yields the maximum value o f the entropy for distributions that have nd dimensions, (3.14) For the pairwise comparison of data points, rip = 2 and the normalization factor simplifies to: (3.15) So we find that the pairwise normalization factor, tin is equal to nd, the number of dimensions. Given the equations set forth above, we can explicitly derive equations for the two dimensional entropy similarity measure as follows. We begin with two points (xj, y j ) and (x2 , y 2 )» which are then scaled to the points ( x ', y ') and ( x 2 , y 2) as in equations 2.2 through 2.7. The joint maximum entropy distribution can then be calculated resulting in the values for py . The value for two dimensional entropy can then be explicitly calculated as: H (x y )= - f P(*1’y '^ °g2 p (* 1’y 1) + P^ x’y2)log2 p (*x,y^ T + p(x i >yi )iog2 p(x i »y i )+ p (x 2 »y 2 )log 2 p (x 2 . y 2 1 We define the two dimensional entropy similarity, Sj, , to be: (3.17) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 80 where p / and p 2 are two points in 2 dimensional space the 2 in the denominator is to scale the results to the [0,1] interval as described above. In general, the entropy can be defined as in equation 3.11 and we can now provide a corresponding general equation for the entropy similarity for any dimension, /y, as follows: (3.18) nH where n/7 is defined as in equation 3.14. We may now use this as a measure of similarity, form a similarity matrix and apply a taxmap style clustering algorithm. First, we will present some examples to help characterize the behavior o f this function. The following plot will show the points for the examples that follow. Similarity Example Points ► P\->Pi 40 60 80 100 X Figure 26: Plot o f Example Points Suppose the we are given the two points p j = (0.8, 2.7) and p 2 — (1-1, 3.0). Upon inspection of figure 26, it is obvious that these two points are very similar. Transforming these points as specified in equations 3.3 through 3.8 we calculate: xx = 1.9, t y = 5.7 and Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Using these values as marginals yields the following distribution, X\Y 1 2 1 0.199 0.274 0.474 2 0.222 0.305 0.526 0.421 0.579 Figure 27: Entropy Distribution Calculating the entropy as in equation 3.16 yields H(x,y) = 1.9799 and normalizing from equation 3.17 yields S l ( p l, p 2)= 0.990. This agrees with our intuition that the two points are very similar to each other. Using the previous example as a starting point, suppose that we are given the two points, p 2 = (1.1, 3.0) and p 3 = (80.8, 2.7). Inspection yields the observation that these points are not very similar, in general, but they are very similar along one dimension, x, = 2.7 and x2 = 3.0. Calculating the similarity as above yields, Sjf(p 2 ,p 2)= 0.550. This value captures the fact that while the points are not very similar in terms o f the distance from each other, they are very similar along on of their dimensions. Suppose that the points are p 2 = (1.1, 3.0) and P4 = (80.8, 20.7). The observation this time is that the points are not very similar at all. Calculating the entropy similarity yields, S 2H{p2 , p A)= 0.325. This result is fairly intuitive because Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 82 one might expect that this value should be even lower than the previous two points, where S„(p 2 , p 3)= 0.550. Suppose that we wish to use the Euclidean distance to calculate the similarity between the respective pairs o f points where the distance is defined as: (3.19) Then the distances are d g ( p \,p 2 ) ~ 0.424, d £ (p 2 > P 3 ) = 79.70, are d p ( p 2 ,p 4 ) = 81.64 for each example. Using the inverse of the distance as a measure of similarity, defined as (3.20) yields values o f s g ( p j , p 2 ) = 2.36, s £ ( p i,p 2 ) = 0.0125, and s £ ( p i.p 2 ) ~ 0.0122. This Euclidean measure does not greatly discriminate between the last two pairs o f points, while the maximum entropy measure captures the similarity o f the two points in the second example based on their similarity along one o f the dimensions. Suppose that we also consider the city block distance that is recommended for use in the original taxmap algorithm [CARM 69]. The city block distance was used because the wanted points to have the same distance if the points were, say, two units apart on each variable or if they were one unit apart on one variable and three on the other. (3.22) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 83 For the preceding points, the city block distances are d c b ( p i,p 2 ) - 0.3, d c b (p2 ,P 3) = 80, and d cb(p2>P4) = 98. Again, using the inverse as the measure o f similarity (3.23) yields, s c b ( P l.P 2 ) = 3-33, s c b (P 2> P 3) = 0.015, and s c b (P 2 -P 4 ) = 0.0102. While the city block distance does a better job o f recognizing the similarity along a dimension, it still does not greatly discriminate the similarity between the pairs o f points p /, p 2 and p \ , P3Suppose that we consider the final combination o f pairs o f points, p s = (80.8, 2.7) and p 4 = (80.8, 20.7). The entropy similarity between these two points is Sh(P i’P*)= 0.758, the Euclidean distance is d£(p 2 , P 4) = 18.0, the Euclidean similarity is s £ ( p 2 . P 4) = 0.0556, the city block distance is d c b(P2> city block similarity is s c b(P2> P4) P4) = 18.0, and the = 0.055. Again, the entropy similarity captures the fact that the points are very similar, in fact exactly the same, along one dimension. Also, it captures the same information as the Euclidean similarity, namely, that the distance between these two points is less than the preceding pairs and it yields a correspondingly greater value for similarity. The entropy similarity satisfies some o f the typical properties of similarity relations, but does not satisfy any common type o f transitivity or alternatively the triangle inequality. This is not unusual since the relation is not a fuzzy membership function which is typically the subject of similarity relations [ZADE 71]. Further, Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 84 measures o f similarity that produce results in the interval [0, 1] have a corresponding measure o f dissimilarity defined as [EVER 93]: (3-24) d ( p i , P j ) = l - s ( P i ’Pj)> which, in this case, is symmetric and non-negative. Some dissimilarity measures also satisfy the triangle inequality, in which case these measures qualify as distance measures as was the case with the state distance in the previous section. The entropy similarity has a corresponding dissimilarity measure defined exactly as in equation 3.24. This similarity is obviously symmetric so that Sh(ph Pj )= $h ip j >P i) and is non-negative, so the dissimilarity as defined in equation 3.24 has these properties as well. We denote the entropy dissimilarity as dfU pj, P i) and will show that it does not satisfy the triangle inequality. We prove this by a counter-example and then explain this behavior. Suppose that we assume the triangle inequality is true for the entropy dissimilarity which means that d H(Pi’P j ) ^ d H(Pi >Pk)+d H(pj >Pt ) • (3-25) Suppose that we have three data points as shown in the following table and plot. Pi 1 2 3 X 7 6 8 y 0.5 10 2 Figure 28: Triangle Inequality Counterexample For the triangle inequality to be true, we must have du(Pl*Pl)—d H CPl»P i) + d H (p i» Pi ) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (3*26) 85 Calculation of the similarities yields Sh(Pi>Pi)= 0.659, S * ( p „ p ,) = 0.931, and S 2H(p 2 , p 3)= 0.780. 1 1 9 ♦ _ 2 75 3 1 1 ♦ 3 ♦ I5 6 7 8 9 1 Figure 29: Plot o f Counterexample The corresponding dissimilarities are d w0 ,l,p 2)=O.341,</w0 7,,/73)= 0.069, and d H(p 2 , p 3)= 0.220. Substituting these values into 2.20 yields 0.341 <0.069+ 0.220 and 0.341 <0.289, which is a contradiction so we know that the entropy dissimilarity does not satisfy the triangle inequality. Obviously, the Euclidean distance satisfies the triangle inequality and we would therefore expect to get a different ordering for these points if we compare the Euclidean and the maximum entropy dissimilarities. Calculating the Euclidean distance between these points yields the following: Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. d E (P b P 2 ) = 9.55. d £ (p b P3) = 1.80, and d £ (p 2 > P3) = 8.25. Based on these distances we can observe that p / and p 3 are closest, p 2 and p j are next closest, and p j and p 2 are furthest apart. The previous results for the entropy similarity provide the following ordering: p i and p j are closest, p2 and p j are next closest, and p i and p 2 are furthest apart; this is the same as the Euclidean ordering. We find that while the triangle inequality is not satisfied by this measure for these points, that the entropy dissimilarity still yields the same relative ordering for these pairs o f points. Note that, in general, the same relative ordering o f pairs o f points does not hold between the Euclidean distance and the entropy dissimilarity. The reason for this behavior is due the normalization o f the points in the first step of the calculation o f the entropy similarity. The similarity between the points is determined solely on their relative relation to each other and does not account for other points that may be included in the set of data. The entropy similarity and dissimilarity asymptotically approach zero and one, respectively, as the Euclidean distance increases. For this reason, once the points o f interest begin to get far apart, the triangle inequality becomes false. Even though the points in the previous example maintained the same relative relationship in terms of pairwise distances, the entropy dissimilarity will fail to satisfy this ordering when compared to the Euclidean orderings. An interesting property o f the entropy similarity is that it captures the similarity of points relative to their relationship along each dimension. This means that if, say, the x-coordinate is exactly equal (xj = x 2 ) between two 2-D points, that the similarity between the two points will be, at least, 0.5, regardless o f the values o f the Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 87 ^-coordinate. As the value o f y 2 increases from y i to positive infinity, the similarity starts at 1.0 and asymptotically approaches 0.5. The plot in figure 30 shows the relationship between the Euclidean distance and the entropy dissimilarity. The plot shows how the dissimilarity changes as the distance remains constant. The data underlying the plot is for different series o f points that are exactly the same Euclidean distance from a single point. The difference between each series is their relationship to one the axes. The series labeled as along axis, is a series of points where one coordinate is the same as the source point and the other coordinate varies directly in relation to the distance. The series labeled as 0.01 degrees, has the coordinates from the first series projected so that they are oriented 0.01 degrees off the axis. The rest o f the series are represented in the same way; each point is projected to new coordinates at a particular angle from the axis by the following equations: * « w = * o + rf£ c o s ( 0 ) , a n d (3.27) ynew= y 0 + d E sin(0 ) , (3.28) where (xq, yg) is the source point from which the distance is measured and (xnew, ynew) is the new point that is a distance of dg from the original point. This type o f relationship and bounding is found in higher dimensional entropy similarity calculations as well. For example, when nrf = 3, we find two limits for similarity at 0.667 and 0.333, when two and one coordinates, respectively, are exactly the same. The type o f relationship to the Euclidean distance as shown in figure 30 will Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 88 also be m aintained, but will be more complicated as it will depend upon the angle offset from two axes. 100000 10000 uI) Along Axis 1000 -A_ 1 Degree Angle _x u 3 tu i m - 0.01 Degree Angle 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 i ! _x_ 30 Degree Angle j 45 Degree Angle j 0 0 10 Degree Angle 1 Entropy Dissimilarity Figure 30: Relationship of Euclidean Distance to Entropy Dissimilarity 3.2.1 Entropy Similarity in the taxmap Algorithm The taxmap algorithm was developed with the intention o f mimicking the way that a human would visually detect clusters in two and three dimensions. An individual would compare relative distances between points and search for continuous and relatively dense regions o f space that are surrounded by continuous relatively empty spaces [CARM 69]. The algorithm proceeds by first identifying the two most Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. similar points in the data and creating an initial cluster consisting o f these points. Then it searches for the point that is the most similar to the points already in the cluster and considers it for admission to the existing cluster. There are some criteria for judging whether this new point should be included in the cluster. If it meets the criteria, then it is added to the cluster and if it does not meet the criteria, the current cluster is considered complete and a new cluster is begun. The new cluster is started in the same way as the first; the two closest points that are not already in a cluster initiate the new cluster. While this algorithm may be simply stated, it requires some measure of similarity or distance and a corresponding criterion for determining whether a new point is added to an existing cluster. The original statement o f the algorithm suggested the use of the city block distance function for populating a similarity matrix. The criterion for determining inclusion o f a new point was somewhat arbitrary. This criterion was based on a measure o f discontinuity that was derived from the change in the average similarity if the new point was added. Specifically, the drop in similarity was defined as the average similarity before the point was added minus the average similarity after the point was added. The measure of discontinuity was defined to be the average similarity after the point was added minus the drop in similarity. If this value was considered low, the point would not be added and a new cluster would be started. The entropy similarity measure should also have a corresponding drop in similarity and a discontinuity measure for use in the taxmap algorithm. Given the Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 90 nature o f this measure, the preceding definitions would not be appropriate due to the logarithm in the calculations; the similarity asymptotically approaches one. Based on the equations in the previous sections, the drop in similarity will be defined specifically for use with the entropy similarity as follows. First, the entropy o f the current cluster will be calculated by using the same technique and equations as was defined for the similarity comparison of two data points. The equations need only be modified to the extent that the 2 should be replaced by rip, the number o f points in the comparison or, in this case, the number of points in the current cluster. This enables the calculation o f the overall cohesiveness o f the current cluster though the use o f the equations for entropy. The value for entropy can then be normalized based on the number of points based on equation this equation which is equation 3.14: / \ (3.29) The normalized entropy o f the current cluster can be calculated, the next candidate point can be added to the cluster and the normalized entropy for the proposed new cluster can also be calculated. The difference between these two entropies can then be used as the drop in similarity due to the addition of the new cluster point. Let As be the drop in similarity, then we have, (3.28) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 91 The measure o f discontinuity may then be calculated similarly to the original taxmap algorithm by defining it as the drop in similarity minus the new similarity. Let M d denote the measure o f discontinuity: (3.29) The measure of discontinuity is then compared to some threshold value, T, that determines whether the point is added to the current cluster (Md > T) or a new cluster is begun (Md < T). The value that is used for the threshold will depend on the data that is being clustered. One interpretation o f the meaning o f the threshold value is that is signifies the density or cohesiveness that the analyst desires in the data. The threshold may take values in the [0, 1] interval and values that are close to one indicate that a high level of cohesiveness is required in the clusters. If a value if one is used, then the number of clusters will be the same as the number of unique points. If a value of zero is used, then there will only be a single cluster that contains all o f the points. In general, using values which are close to one will be most effective. This is due to the manner in which the similarity rapidly approaches one as the distance between points decreases. The taxmap algorithm is based on the idea of looking for areas that have a high density of points and forming a cluster. This leads to the behavior that the first cluster that is formed tends to be the most dense cluster in the data and the following clusters get gradually less dense. This is not true in all cases, but it indicates that the threshold value that is used for determining when to start a new cluster may need to be adjusted to a lower value as each cluster is finalized and a new cluster is started. We add an Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. additional parameter, p, to the algorithm that allows the threshold value, T, to be reduced after each cluster is formed as follows: Tnew = Told —p. While this technique will not be optimal in all cases, it was effectively used in most of the examples presented below. One final issue related to clustering algorithms, in general, will be discussed before we present examples of the results that can be expected from the algorithm. This issue relates to determining the correct number o f clusters in the data. This is often referred to a validity measure for the clustering that has been generated [XDE 91]. Many clustering algorithms such as the k-means algorithm, require the analyst to select a number o f clusters that are desired and then some particular criterion or criteria are optimized to yield that number of clusters [BEZD 80]. Determining the appropriate number o f clusters is done either through the analyst inspection or through the use o f some validity measure. This validity measure is generally specific to the type of algorithm that is being used to perform the clustering. Usually, it consists of a calculation that tries to assess the compactness of the clusters generated and the separation between these clusters. In the original version o f the taxmap algorithm, as well as the new version proposed here, the number of clusters is determined by the threshold value. Depending on the initial value and whether it is adjusted over time, different clusterings may be produced. A method that can be applied here which is similar to a method that is used for another clustering algorithm that is based on the principle o f maximum entropy, the Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. least bias fuzzy clustering algorithm[BENI 94]. This algorithm is very different in that it is an optimization technique, but is also has a threshold or resolution parameter. The method starts by assum ing that their is initially no particular bias as to the value that the threshold should take. It proceeds by testing a finite quantized set o f values from the [0,1] interval based on a quantization parameter, 1IQ. Where Q is the number o f values for the threshold that are tested in the algorithm. All values o f the threshold that yield the same number o f clusters, y, are counted as p(y). Their fraction P(y) of the total Q, p(y)/Q, is regarded as the probability that the solution to the clustering problem will yield y clusters. Therefore, the value o f y that has the maximum value o f P(y) is the most likely number o f clusters. This same technique can be applied here with the note that the threshold values, T, may either be held constant or varied for each cluster created as was described above. In practice, it has been found that the correct number of clusters can be found without varying T, but that the quality of the clusters is not as good. So, the most probable number o f clusters can be found using a constant T, but the best clusters are often found by allowing T to decrease during the creation of clusters. One final note about this technique is that when applied here, one cluster will tend to be the most frequently occurring number. This is because of the non-linearity inherent in the similarity measure. In [BENI 94], the resolution parameter is allowed to vary between 0 and 1, but for use here, it is generally started at 0.9 and uses smaller increment. Generally, the analyst will be responsible for determining whether there is a single cluster and the algorithm will determine the Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. optimal number o f clusters starting from evaluating the measure for two or more clusters. The preceding discussion indicated that the correct number o f clusters may be found, but that the specific clustering may not be optimal. This leads to one final problem o f determining the best clustering once the correct number o f clusters is known. This is often done through the visual inspection o f the resulting clusters and the subjective judgment o f a human being. Ideally, we would have a measure of the quality o f the clusters independent o f the need for human judgment. This can be done in a manner consistent with the previous measure applied in the clustering algorithm. Since the clustering that is done here is for determining cluster values for each variable and the system function, the centroids of one of the coordinates (the variable) of the two dimensional clusters that were generated for the correct number o f clusters can be substituted for the original values and the overall joint entropy for each clustering, Hc, can be calculated as specified previously for each clustering that yields the most probable number of clusters. The optimal clustering will be the one that has the maximum joint entropy, / f “ value o f when the substituted values are used. Note that the is itself an indicator o f the number o f clusters in that it tends to reach a maximum for the optimal number o f cluster. The only drawback to using is as an indicator by itself is that its true maximum is at the point where the number of clusters is equal to the number o f points; the In conjunction with the method for finding the correct number o f clusters, selecting a clustering that has the maximum entropy will Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. lead to the best clustering using the most probable number o f clusters. The following section contains examples that will illustrate the use o f T and H ™ . 3.2.2 Entropy Similarity taxmap Examples This section will present a number of examples of the application of the taxmap type of algorithm using the entropy similarity measures presented in the previous section. There are many types examples identified in the literature that are difficult for existing algorithms to correctly cluster. The examples presented here will all be in two dimensions and will start with simple clusterings that will be used to illustrate the effects o f using different values of T and the resulting values o f . Following these examples will be the more difficult examples and the results that the algorithm specified here generated. First, we begin with a simple example that is readily clustered by most common clustering algorithms and can be easily verified through visual inspection. The following four figures display the results of the algorithm with different starting values of T and show the resulting value o f Hc. Note that the value o f Hc increases as the value of T increases and that the clusterings get visibly better. Also note that in figure 34, the value o f Hc decreased and that this resulted in three clusters. The plot in figure 35 shows the results of varying the starting threshold, T, and the resultant number o f clusters based on that starting value. These results indicate that the most probable number of clusters is two. Again, it should be noted that the threshold value will tend to generate many clusterings that have only a single cluster Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 30 -- ♦♦ Cluster 1 . Cluster 2 10 - - 0 20 10 30 Figure 31: T = 0.990, Hc = 0.96157 35 30 25 ♦♦ 20 C lu ste r 1 . C lu ste r 2 15 10 5 0 0 10 20 30 Figure 32: T = 0.993, Hc = 0.96197 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 97 35 3 0 42 5 - |- 20 ♦ ♦ ♦ ♦♦ ♦ .. » Cluster 1 . Cluster 2 15 - - 10 - - v - 5 0 10 20 30 Figure 33: T = 0.994, Hc = 0.96226 35 30 25 ♦♦ 20 + C lu s te r 1 15 _ C lu s te r 2 ! A C lu s te r 3 I 10 5 0 0 10 20 30 Figure 34: T = 0.995, Hc = 0.96171 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 98 and that the determination that there is more than a single cluster must be made manually. The next example is one o f the more difficult problems for clustering algorithms to solve. In consists of data that does not readily form circular clusters, which is a problem for many clustering techniques [RUSP 69]. 60 50 -40 -_ 30 -o3CJ ° 20 -- 10 .. 1 2 3 4 5 Number O f Clusters Figure 35: T = 0.90 to 1.0, incremented by 0.01 Note that the threshold value required to generate these clusters was very high. This is because the points are very similar along one dimension, in fact, the x coordinates are all exactly the same in each cluster. The only difference between these two clusters is along the y axis. Therefore, the similarity between points in different Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. clusters will be very high and pairs o f points between clusters will be guaranteed to have a similarity o f at least 0.5 140 .. 130 -- 55 r A Cluster 1! I ! m Cluster 2 j 55.5 56 56.5 57 57.5 Figure 36: Non-spherical Clusters, T = 0.9999 The next example consists of two different shaped clusters. One is a very narrow and linear type of cluster and the other is a rather diffuse roughly spherical cluster. Cluster which are linearly non-separable are also known to be problematic for clustering algorithms. The following two examples are both linearly non-separable and help to demonstrate why the taxmap method is known as a density based clustering algorithm. The first example in figure 38 is a rather sparse crescent shaped cluster that has a circular cluster contained within the arms o f the crescent. The second example contains exactly the same points as the previous example, but includes additional Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 100 points in both clusters; this example has clusters that are denser than the one before. The algorithm is unable to determine good clusters for the first crescent, but is able to generate a better answer for the denser version of the same type o f data. Another problem type for clustering algorithms is when the clusters have a bridge that connects the two clusters. This makes it difficult to determine where one cluster starts and another cluster begins. The following figure shows that the 5 4.5 4 3.5 3 |! _ Cluster 1 t1 i ■ Cluster 2! 2.5 2 1.5 1 0.5 0 0 2 4 6 8 10 12 Figure 37: Clusters with Different Shapes, T = 0.99 entropy similarity measure based algorithm also has some problem with this type. While it gathers points based on the next closest point, it tends to add more points than it should; it includes points in one cluster that can be easily seen to be a part of another cluster. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 101 o°o o Cluster 1 j x Cluster 2 j 0 Figure 38: Sparse Linearly Non-Separable Clusters, T = 0.984 4.5 4 3.5 3 Cluster 1 Cluster 2 2.5 2 1.5 1 0.5 0 0 1 2 3 4 5 6 Figure 39: Dense Linearly Non-Separable Cluster, T = 0.998 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 102 The next two examples expose a limitation o f using the entropy similarity measure in the taxmap algorithm. This limitation is related to the behavior of the similarity measure for points that share exactly the same coordinate. Previously, it was shown that when the two clusters include points that share a coordinate that the algorithm required a very high value o f T to distinguish the clusters. As shown in figure 33, there are four clusters and each o f the four clusters shares coordinates with the other clusters. The algorithm is unable to correctly distinguish the clusters, but in figure 34, the clusters have been rotated so that they no longer share coordinates. The density of each cluster and the relationship o f the points within the cluster remain the same. 1.6 1.5 1.4 1.3 M. k 1.2 □ □ cb i a A a □ □ §□ □ a. 1.1 □ □ □□ 1 D 0.9 □ k 4 k a A Cluster 1 | D Cluster 2 ! a k 4 k a a a a a a a a O A 0.8 0.7 0.6 . , , . . 0 0.5 1 1.5 2 . 2.5 3 Figure 40: Clusters with a Bridge, T = 0.997 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 103 * Cluster 1 □ Cluster 2 A Cluster 3 x Cluster 4 10 15 20 Figure 41: Four Clusters Sharing Coordinates, T = 0.991 * Cluster 1 j a Cluster 2 | I □ Cluster 3 w iI x Cluster 4 i Figure 42: Four Clusters, T = 0.999 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4. Summary and Conclusions Reconstructability analysis is a potent tool for the analysis o f multivariate systems that has been applied in a wide variety o f fields (for example, see [KLIR 86], [KUMA 89], [TRTV 93]). These applications were limited to small systems that were, for the most part, completely defined. In cases where the knowledge o f the system was incomplete, various techniques were used to induce the complete system. These techniques include the existing techniques of the entropy fill and the one dimensional clustering, as well as, ad hoc techniques used that were based on expert knowledge of the system being analyzed. The author has presented a more complete understanding o f the problems o f incomplete systems and provided new techniques that yield answers that are superior to existing techniques and are consistent with the underlying theory and algorithms of K-systems analysis. The rest o f this chapter will present the preceding results as a comprehensive whole. First, the author will outline a methodology for using these new techniques to achieve results which are comprehensible and consistent with the underlying theory of K-systems analysis. Finally, future areas of research will be identified within the field o f K-systems analysis and possible applications o f these new techniques will be briefly explored outside the field. 4.1 A Methodology for Resolving Incomplete Systems The understanding o f incomplete systems is one o f determining whether there is simply missing data or whether the variable values are scattered either through 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 105 faulty control o f the variables or imprecise observation. The best resolution o f an incomplete system is likely to come from a source o f information that is external to the data submitted for analysis. If the analyst knows that the variable values are from a discrete set o f values and that slight variations are due to imperfect control or observation, then the incompleteness of the system is known to be solely due to missing data; some o f the possible states have not been observed. In effect, the clustering is done manually by the analyst with knowledge external to the data set before the data are submitted for analysis. The system can then be submitted for analysis by the missing data algorithms and the results can be directly submitted for Ksystems analysis. Alternatively, if the analyst has no prior knowledge of the system structure, determining whether the incompleteness o f the data is due to missing states or scattered variable values is more problematic. Ideally, there is sufficient data so that the missing states can be addressed using the closest state algorithms, since there is no loss of information that is associated with applying a clustering algorithm. By imputing values for the missing states, all the existing information about the system is used and only minimal assumptions are made about the missing states. If clustering is performed, information, in some sense, is potentially lost by clumping together known information into a single information entity. The general question about incomplete systems is whether the data should be considered missing and whether the data should be considered scattered. I f there is no other knowledge about the source of the system data, there is uncertainty as to what Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 106 approach should be used to induce a complete system. It seems that a perfect answer for all situations is unattainable. If the system is incomplete, there must be some reason that this is so and it seems likely that this answer will be found external to the data that has been submitted. The most general answer when no external information is known is to assume that there is information about this system that is missing. Then applying the missing data algorithms will provide answers that make the fewest assumptions about the nature of the missing data. Unfortunately, it is possible that the amount o f information that is missing from the system is so great that it is unreasonable to assume that any algorithm, no matter how good, could impute the missing values that truly capture the behavior of the system. In this case, attempting to group or cluster the existing data may be the best approach, especially since it may reduce the amount of missing data. Since a completely general answer to this problem seems highly unlikely, a number o f possible approaches are suggested. First, one reasonable approach is to determine whether any o f the closest state sets o f a missing state are empty. If so, the closest state algorithms will still provide an answer, but the answer will actually consist o f estimates based on states that are not in the closest state set. We have shown that as the distance between states becomes greater, the amount o f shared information becomes far less and we would expect that the estimates calculated would be proportionally more biased. For systems that consist o f a large number o f variables and states, the previous approach may be too restrictive due to the large amount o f data that is required about Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 107 the system. An alternative approach may be that the missing data may be imputed so long as the m axim um gap between existing states is less than h a lf o f the maximum distance between possible states defined for the system. This allows closest state sets to be empty, but requires that the values used to impute the missing values share at least half o f their information with the missing states. If neither of the two previous criteria are met, it is clear that some other method for inducing a complete system must be tried. If the data are experimental and the system is relatively small, it may be possible to gather more information about the system with the express purpose o f filling in some o f the missing data. If the data is observational and not experimental or if the system is large and gathering more experimental data is costly or difficult, it may be impossible to gather more information about the system. For these types of cases and others, we can assume that the only information about the system is that which already exists and additional information to reduce the amount o f missing data is not available; the analysis must proceed with only the existing data. Clustering the data in some fashion may reduce the amount o f states that are missing in the system. If the amount o f missing data can be eliminated by clustering or reduced so that the system meets one o f the previous criteria for imputing missing data, a meaningful analysis may be performed. An essential feature o f the clustering is that the clusters themselves must be meaningful. Creating clusters based on the variable values themselves along with the system function assures that, at least, the clusters are relevant to the context o f the system, that is the system function. These Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 108 two dimensional clusters map directly back to the original system and are easily understood to relate to the original system. The clustered data may then be re-analyzed to determine if there is still missing data and to what extent The closest states algorithms may then be used to impute any remaining missing values and the results of the analysis may be readily mapped back to the original system. In addition, predictions o f the effects o f previously unknown variable values may be determined as was described previously. 4.2 Conclusions and Final Remarks The author has proposed the use o f the closest states algorithm and the entropy similarity measure as the methods for performing the imputation and clustering of the data; applied together, these techniques can be used to induce a complete system. In general, these methods use the same principles and mathematics that are already embodied within the existing K-system algorithms and techniques. They are based on well known principles that enable a consistent approach to K-system analysis. While use of other techniques yield systems that have properties sufficient for the use of the RA algorithms, these new techniques provide solutions that are either superior to previous techniques or more meaningful and more easily understood when applied within the context o f K-systems analysis. Additionally, a general methodology for inducing a complete system has been introduced. This includes criteria for determining when to address an incomplete system as solely missing data and when to address it as including scattered data as well. After this determination has been made, new algorithms for their resolution have Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 109 been developed. The new algorithms are based on the same principles and mathematics as the existing techniques that allow the probabilistic reconstructability analysis algorithms to be applied to g-systems. The basic methodology for performing K-systems analysis is as follows: 1. Determine if the system is complete by assessing whether all possible states have an associated system function value. 2. If the system is complete, apply the K-systems algorithms. If not, determine the extent o f the missing data as whether or not the system has either a) an empty closest state sets or b) existing states separated by more than half the maximum distance between states. 3. If neither o f the two criteria are met, the closest states algorithm may be immediately applied and the results submitted for K-systems analysis. No further processing is required to assess or use the results o f this analysis. 4. If one of these criteria is met, perform two dimensional clustering (using each variable and the system function as the two dimensions) using the entropy similarity taxmap method. Determine whether the resulting system is complete. If it is complete, apply the K-system algorithms. If not, use the closest states algorithm and then complete the analysis. 5. Results generated using the two dimensional clustering require some additional computations if predictions o f previously unknown variable values are desired. This is done by projecting the two dimensional clusters into the variable axis and calculating the probability that a particular variable value falls into one or more o f Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 110 the clusters. Predictions are made for all non-zero probabilities and these predictions include the probability products of the possible clusters. The author has provided a comprehensive methodology for resolving incomplete systems so that they may be submitted for K-systems analysis. An algorithm for imputing missing values has been presented that is superior to existing techniques. A unifying methodology for distinguishing between missing data and data scattering has been presented. A new similarity measure based on the mathematics of information theory has been discovered and its use illustrated within an existing clustering algorithm. Note that more research is needed related to both missing data and clustering as they apply to K-systems analysis. In particular, a comprehensive comparison o f variants o f the closest state algorithm may provide further insight into the use applicability of the general algorithm and the variants that are possible. Also, the entropy similarity and corresponding dissimilarity have been characterized in terms o f the properties they possess and the relationship to the Euclidean distance. The use o f the entropy similarity has been demonstrated within an existing algorithm, but additional applications using other algorithms or the development of an algorithm specific to the measure may yield additional benefits. One algorithm in particular that is based on the principle o f maximum entropy and a pairwise similarity matrix seems especially promising for application o f the entropy similarity measure [HOFM 97]. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Bibliography [ANDE 73] Anderburg, Michael R., (1973). Cluster analysis for applications. Acedemic Press, New York, New York. [BEZD 80] Bezdek, James C. (1980). A convergence theorem for the fuzzy ISODATA clustering algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 1,1-8. [BEZD 92] Bezdek, James C. and Pal, Sankar K. (1992) Fuzzy Models for Pattern Recognitions, The Institute o f Electrical and Electronic Engineers, New York, New York. [BENI 94] Beni, Gerardo and L u,X iaom in(1994). A least bias fuzzy clustering method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16,9, 954-960. [CARM 68] Carmichael, J. W., George, J.A. and Julius, R.S. (1968). Finding Natural Clusters. Syst. Zool., 17,144-150. [CARM 69] Carmichael, J.W. and Sneath, P.H.A. (1969). Taxometric maps. Syst. Z ool, 18,402-415. [CAVA 81a] Cavallo, Roger E. and Klir, George J. (1981). Reconstructability Analysis: Overview and Bibliography. International Journal o f General Systems, 7 , 1-6. [CAVA 81b] Cavallo, Roger E. and Klir, George J. (1981). Reconstructability Analysis: Evaluation o f Reconstruction Hypothesis. International Journal o f General Systems, 7, 7-32. [CAVA 82] Cavallo, Roger E. and Klir, George J. (1982). Decision Making in Reconstructability Analysis. International Journal o f General Systems, 8,243-255. [COVE 91] Cover, Thomas M. and Thomas, Joy A. (1991). Elements of Information Theory. John Wiley and Sons, Inc. New York, New York. [EVER 93] Everitt, Brian S. (1993). Cluster Analysis. John Wiley and Sons, Inc. New York, New Yoric. Ill Reproduced with permission o f the copyright owner. Further reproduction prohibited without permission. 112 [GOUW96] Gouw, Deky and Jones, Bush (1993). The Interaction o f K-Svstems Theory. International Journal o f General Systems, 24, 163-169. [GUIA 85] Guiasu, Silviu, and Shenitzer, Abe (1985). The Principal o f Maximum Entropy. The Mathematical Intelligencer, 7,42-48. [HART 71] Hartley, H. O. and Hocking, R.R. (1971). The analysis o f incomplete data, Biometrics, 27, 783-808. [HART 75] Hartigan, John A. (1975). Clustering Algorithms. John Wiley and Sons, Inc. New York, New Yoric. [HOFM 97] Hofmann, Thomas and Buhmann, Joachim M. (1997). Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis ana Machine Intelligence, 19, 1, 1-14. [JONE 82] Jones, Bush (1982). Determination o f Reconstruction Families. International Journal o f General Systems, 8,225-228. [JONE 85a] Jones, Bush 11985). Determination o f Unbiased Reconstructions. International Journal o f General Systems, 10, 169-176. [JONE 85b] Jones, Bush (1985). A Greedy Algorithm for a Generalization o f the Reconstruction Problem, International Journal o f General Systems, 11, 63-68. [JONE 85c] Jones, Bush f1985). Reconstructability Analysis for General Functions. International Journal o f General Systems, 11,133-142. [JONE 85d] Jones, Bush (1985). Reconstructability Considerations with Arbitrary Data. International Journal o f General Systems, 11,143-151. [JONE 85e] Jones, Bush (1985). The Cognitive Content of System Substates. IEEE Worksnop on Languages for Automation. [JONE 86] Jones, Bush (1986). K-systems Versus Classical Multivariate Systems. International Journal o f General Systems, 12,1-6. [JONE 89] Jones, Bush (1989). A Program for Reconstructability Analysis. International Journal o f General Systems, 15,199-205. [KLIR 86] Klir, George J. (1986). The Role o f Reconstructability Analysis in Social Science Research. Mathematical Social Sciences, 12,205-225. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 113 [KUMA89] Kumar, Vinod, Kumar, Uma, and Hoshino, Kyuoji (1989). An Application o f the Entropy Maximization Approach in Shopping Area Planning. International Journal o f GeneralSystems, 16,25-42. [LHT87] Litde, R.JA. (1982). Models for non-reponse in sample surveys, Journal o f the American Statistical Association, 77,137-250. [ORCH 72] Orchard, T. and Woodbury, M.A. (1972). A missing information principle: Theory and applications. Proceedings o f the 6th Berkeley Symposium on Math, Statistics and Probability, 1,697-715. [PITT 89] Pittarelli, Michael (1989). Uncertainty and Estimation in Reconstructability Analysis. International Journal o f General Systems, 15,1-58. [RUSP 69] Ruspini, Enrique H. (1969). A new approacch to clustering. Inform. Control, 15,1,22-32. [SAND 83] Sande, I.G. (1983). Hot deck imputation procedures, in Incomplete Data in Sample Surveys, Vol. Ill: Symposium on incomplete aata. Proceedings, New York: Acedemic Press. [SHAN 48] Shannon, Claude E. (1948). A mathematical theory o f communication. Bell Systems Technical Journal, 27,379-423,623-656. [TRIV93] Trivedi, Sudhir K. Reconstructability Theory for General Systems and it's Application to Automated Rule Learning. Ph.D. dissertation, Louisiana State University, Baton Rouge, LA, 1993. [XIE91] Xie, Xuanli L. and Beni, Gerardo (1991). A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 8, 841-847. [ZADE 71] Zadeh, Lofti A. (1971). Similarity relations and fuzzy orderings. Information Science, 3, 177-200. [ZADE 65] Zadeh, Lofti A. (1965). Fuzzy Sets. Inform. Control, 8,338-353. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Vita Gary J. Asmus was bom in Hampton, Virginia, and grew up in Wisconsin and then Missouri where he graduated from Washington High School in Washington, Missouri. After beginning his college career in Missouri, he completed his undergraduate studies at Louisiana State University and Agricultural and Mechanical College in December 1992. He returned to Louisiana State University in 1994 as a Board of Regents Fellow and completed his doctoral work in 1998. His interests center on the role of analogy in human and artificial intelligence. He pursues this interest through the study of numerical and K-systems analysis, clustering or unsupervised learning, and genetic and evolutionary algorithms. In particular, he is interested in application and development of simple techniques that can be applied to model complex systems of all types, from meteorology to the human mind. 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DOCTORAL EXAMINATION AND DISSERTATION REPORT Ca n d i d a t e ; M a jo r F i e l d : T itle Gary J. Asmus Computer Science o f D is s e rta tio n : Techniques for Resolving Incomplete Systems in K-systems Analysis A p p ro v e d : fe e M ai L u a te S c h o o l EX AM ININ G COMMITTEE: fo b A r ik th r ..________ D a te o f R x a m in a tio n : October 15, 1998 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. IMAGE EVALUATION TEST TARGET (Q A -3 ) y <fr >xj g r 150mm IIVU4GE. I n c 1653 E ast Main S treet Rochester, NY 14609 USA Phone: 716/462-0300 Fax: 716/288-5989 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.