Optimal Classification/Printable version

From Wikibooks, open books for an open world
Jump to navigation Jump to search


Optimal Classification

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/Optimal_Classification

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.

Rypka Method

Rypka's Method[edit | edit source]

Rypka's[1] method[2] utilizes the theoretical and empirical separatory equations shown below to perform the task of optimal classification. The method finds the optimal order of the fewest attributes, which in combination define a bounded class of elements.

Application of the method begins with construction of an attribute-valued system in truth table[3] or spreadsheet form with elements listed in the left most column beginning in the second row. Characteristics[4] are listed in the first row beginning in the second column with the title of the attributes in the upper left most cell. Normally the file name of the data is given the title of the element class. The values which connect each characteristic with each element are placed in the intersecting cells. Selecting characteristics which all elements share may be the most difficult part of creating a database which can utilizing this method.

The elements are first sorted in descending order according to their truth table value, which is calculated from the existing sequence and value of characteristics for each element. Duplicate truth table values or multisets for the entire bounded class reveal either the need to eliminate duplicate elements or the need to include additional characteristics.

An empirical separatory value is calculated for each characteristic in the set and the characteristic with the greatest empirical separatory value is exchanged with the characteristic which occupies the most significant attribute position.

Next the second most significant characteristic is found by calculating an empirical separatory value for each remaining characteristic in combination with the first characteristic. The characteristic which produces the greatest separatory value is then exchanged with the characteristic which occupies the second most significant attribute position.

Next the third most significant characteristic is found by calculating an empirical separatory value for each remaining characteristic in combination with the first and second characteristics. The characteristic which produces the greatest empirical separatory value is then exchanged with the characteristic which occupies the third most significant attribute position. This procedure may continue until all characteristics have been processed or until one hundred percent separation of the elements has been achieved.

A larger radix will allow faster identification by excluding a greater percentage of elements per characteristic. A binary radix for instance excludes only fifty percent of the elements per characteristic whereas a five-valued radix excludes eighty percent of the elements per characteristic.[5] What follows is an elucidation of the matrix and separatory equations.[6]

Equations[edit | edit source]

  1. Truth Table Size-Related Equations
  2. Separatory Equations
    1. Element-Related equations
    2. Characteristic-related equations
      1. Theoretical separation
      2. Empirical separation
        1. Target Set Truth Table Values
        2. Separation Stages

Examples[edit | edit source]

  1. Computational Example
  2. Application Example

Notes and References[edit | edit source]

  1. :Eugene Weston Rypka passed away on April 27, 2006. Gene was born on May 6, 1925 in Owatonna, MN to Charles Frederick and Ethel Marie Rypka. He served in World War II as a paramedic in Iwo Jima and received several medals and commendations. In 1958, Gene received a Ph.D. in Medical Microbiology from Stanford University. He had a long and distinguished career, including work with Russian scientists at Lovelace Medical Center and the University of New Mexico. Bicycle racing was a lifetime love and occupation, and in later years, he also studied martial arts.
  2. Primary reference: Biological Identification with Computers edited by R.J. Pankhurst, British museum (natural history) London, England proceedings of a meeting held at Kings College, Cambridge 27 and 28 September 1973 of the Systematics Association Special Volume Number 7 and published by the Academic Press 1975 noting the work of Eugene W. Rypka, Dept. of Microbiology, Lovelace Center for Health Sciences, Albuquerque, New Mexico, "Pattern Recognition and Microbial Identification." ISBN 0125448503
  3. Characteristics and attributes are used interchangeably.
  4. See Table II. page 158 of the primary reference
  5. The primary reference should be consulted for a more detailed and in depth explanation of the method.


Rypka Method/Equations/Truth Table Size

Truth Table Size-Related Equations[edit | edit source]

These equations derive the truth table[1] size from the highest value of logic (V) and the highest number of characteristics (C} in the target set and in the group of the bounded class[2] of elements which denote the range for the truth table values of the set and the group.[3]

Group size[edit | edit source]

, where:
  • Q is the group size[4]
  • V is the highest value of logic in the group,
  • C is the highest number of characteristics in the group.

Target Set size[edit | edit source]

, where:
  • R is the target set size[5]
  • V is the highest value of logic in the group,
  • K is the highest number of characteristics in the target set.

Notes[edit | edit source]

  1. See page 158, Table II of the primary reference
  2. See page 172 of the primary reference.
  3. group - Columns C1 to C5 (0 to C, C=5)
  4. Target set - Columns C1 to C3 (0 to K, K=3)


Rypka Method/Equations/Separatory/Elements

Element Related Equations[edit | edit source]

Order of elements[edit | edit source]

The elements are arranged in descending order according to their truth table value, i.e., the value calculated as the sum of each characteristic's logic state value times the highest value of logic raised to the power of the order of the characteristic.[1] The element truth table value allows elements to be sorted and identified as unique or equivalent and the bounded class identified as a set or multiset.

, where:
  • ei is the element truth table value in the group,
  • V is the highest value of logic in the group,
  • v is the value of logic of each characteristic in the group,
  • j is the jth characteristic index, where:
j = 0..C and where:
  • C is the number of characteristics in the group,
  • i is the ith element index, where:
i = 0..G and where:
  • G is the number of elements in the bounded class.

Maximum number of pairs of elements to separate[edit | edit source]

Maximum number of pairs of elements to separate refers to triangularization of the matrix to permit comparison of each element with every other element to determine the number of pairs that are separable or disjoint. Pairs are separable or disjoint whenever the logic values of the elements that make up a pair are different. In theory, therefore the maximum possible number of pairs that can be separated is determined by the following equation:[2]

, where:[3]
  • pmax is the maximum number of pairs to separate, and
  • G is the number of elements in the bounded class.

Notes[edit | edit source]

  1. See Number Systems - essentially using the characteristic values to compute a network or memory address, followed by sorting.
  2. See page 176 Table XI of the primary reference.
  3. Triangular Number


Rypka Method/Equations/Separatory/Characteristic/Theoretical

Theoretical Separation[edit | edit source]

The general identification equation[edit | edit source]

, where:[1]
  • Sj is the theoretical separatory value per jth characteristic,
  • C is the number of characteristics in the group,
  • V is the highest value of logic in the group and
  • j is the jth characteristic index in the target set, where:
j = 0..K and where:
  • K is the number of characteristics in the target set.

Minimal number of characteristics to result in theoretical separation[edit | edit source]

, where:[2]
  • tmin is the minimal number of characteristics to result in theoretical separation,
  • G is the number of elements in the bounded class and
  • V is the highest value of logic in the group.

Notes[edit | edit source]

  1. See page 153, page 167, Fig. 2. & page 175 of the primary reference
  2. See page 157, Primary Schemes footnote of the primary reference


Rypka Method/Equations/Separatory/Characteristic/Empirical/Target Set

Target Set Truth Table Value[edit | edit source]

, where:[1]

  • vi,j is the element's attribute value,
  • i is the ith element's index value, where,
i = 0...G' where G is the number of elements in the bounded class, and,
  • j is the j'th characteristic's index value, where,
j = 0...K and where,
  • K is the number of characteristics in the target set,
  • V is highest value of logic in the group,
  • V(K-j) is the positional value of the jth characteristic.

, where,

nti contains the multiset count for each truth table value[2].

Notes[edit | edit source]

  1. As the characteristic with the greatest separatory value is moved to the next most significant position, K is incremented to expand the target set size from two characteristics to the number of characteristics in the group or the number of characteristics which result in 100% separation. For the initial target set with one characteristic the separatory value is computed individually for each characteristic in the group to find the initial characteristic with the highest separatory value.
  2. coefficient of association, , see page 172 of the primary reference


Rypka Method/Equations/Separatory/Characteristic/Empirical/Separation

Separation Stages[edit | edit source]

Initial separation[edit | edit source]

, where:[1]

  • Sj is the initial empirical separatory value for each characteristic, where,
j = 0...C and is the index of the jth characteristic in the group and C is the number of characteristics in the group, and,
l = 0...R and is the truth table value of the jth characteristic, where R is the truth table size, where:
R = V0, and,
  • V is the highest value of logic in the group and,
  • 0 is the target set exponent for a single characteristic, and,
  • G is the number of elements in the bounded class.

Subsequent separation[edit | edit source]

, where:

  • Sj is the initial empirical separatory value for each characteristic, where,
l = 0...R and is the target set truth table index value, where R is the target set truth table size value, where:
R = VK, and,
  • V is the highest value of logic in the group and,
  • K is the number of characteristics in the target set, and,
  • G is the number of elements in the bounded class.

Notes[edit | edit source]

  1. Must be applied initially to each characteristic. (The equations have been implemented in Mathcad and using the Visual Basic programming language. See Application_Example


Rypka Method/Examples/Computational

Bounded Class Data[edit | edit source]

Bounded Class Dimensions[edit | edit source]

G = 28 - 28 elements - i = 0...G-1[1]

C = 10 - 10 characteristics or attributes - j = 0...C-1

V = 5 - 5 valued logic - l = 0...V-1

Order of Elements[edit | edit source]

eft

Count multisets[edit | edit source]

Squared multiset Counts[edit | edit source]

Squared Multiset Counts Summation[edit | edit source]

Separatory Values[edit | edit source]


max(T) = 309 = S8 = highest initial separatory value

Notes[edit | edit source]

  1. Mathcad's ORIGIN function applies to all arrays such that if more than one array is being used and one array requires a zero origin then the other arrays must use a zero origin with all variables being adapted as well.


Rypka Method/Examples/Application

Flag Recognition[edit | edit source]

Although the following example lacks an intuitive sense of optimization it serves as a good example of how optimization can reduce the number of queries. While the method of optimal classification is highly beneficial for reducing the number of queries required for manual identification, automated identification may be better served by use of a neural network.

Flag overlay grid[edit | edit source]

Designated areas (characteristics) for sampling background colors (states) of all flags (elements).

The overlay is used to determine the color of each area for each flag and the color is recorded in the table as the logical state of the area. The table data is then submitted to the optimization program and processed until an optimal empirical separatory value is obtained.
Data set[edit | edit source]
FLAGS/LOC,A,B,C,D,E,F,G,H,I
BELGIUM,BLACK,YELLOW,ORANGE,BLACK,YELLOW,ORANGE,BLACK,YELLOW,ORANGE
FRANCE,BLUE,WHITE,RED,BLUE,WHITE,RED,BLUE,WHITE,RED
GERMANY,BLACK,BLACK,BLACK,RED,RED,RED,YELLOW,YELLOW,YELLOW
IRELAND,GREEN,WHITE,ORANGE,GREEN,WHITE,ORANGE,GREEN,WHITE,ORANGE
ITALY,GREEN,WHITE,RED,GREEN,WHITE,RED,GREEN,WHITE,RED
JAPAN,WHITE,WHITE,WHITE,WHITE,RED,WHITE,WHITE,WHITE,WHITE
LUXEMBOURG,RED,RED,RED,WHITE,WHITE,WHITE,BABY,BABY,BABY
NETHERLANDS,RED,RED,RED,WHITE,WHITE,WHITE,BLUE,BLUE,BLUE
SPAIN,RED,RED,RED,YELLOW,YELLOW,YELLOW,RED,RED,RED
HELIGOLAND,GREEN,GREEN,GREEN,RED,RED,RED,WHITE,WHITE,WHITE
Original order[edit | edit source]

Systematic query[edit | edit source]

Starting with area "A" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of the Netherlands. The answer to the first query in regard to area "A" is RED which would remove 2/3 of the flags from further consideration. The next query for the color in area "B" would be RED which would serve to eliminate none of the remaining flags. In fact, since the colors in columns "D", "E" and "F" are the same for each remaining flag, we would not be able to eliminate any remaining flags until column "G" where the color BLUE would provide a unique answer to the final necessary query. Here all remaining flags except the flag of the Netherlands would be eliminated from further consideration. It would therefore take a minimum of seven queries using the systematic query method to establish the identity of the flag in our possession as belonging to the Netherlands.

Optimized order[edit | edit source]

Minimized query[edit | edit source]

The results of optimization are shown above and include a listing of the theoretical and empirical percentages. The original characteristic sequence is indexed in the bottom row. Starting with area "G" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of Ireland. The answer to this first query would be GREEN. The next query is for the color in area "F" to which we would answer ORANGE. Since no other flags have this combination of GREEN and ORANGE in these areas our query can end here. The method has minimized the number of necessary queries required to identify the flag by optimizing the order of characteristics. (Please note that there may be more than one minimal solution.)

Computer program GUI[edit | edit source]

Identification by Polychotomous Key[edit | edit source]