Talk:Optimal Classification
From Wikibooks, the open-content textbooks collection
| This page was previously nominated for deletion, but was kept. Please see the discussion in the Wikibooks:Requests for deletion archives for justifications and discussion. Old discussions should be taken into account before nominating again for deletion. |
Contents |
[edit] Data submittal
You may submit a CSV formatted unoptimized data set here and I will provide the optimization below it. Topic name goes in the first column of the first row followed by the names or labels of each characteristic, followed by the name of the first element in the first column of the second row followed by the states that link each characteristic to the element and so forth. Please limit the number of elements to 1,000, the number of characteristics to 50 and the number of states to 20. Typative (talk) 11:54, 2 August 2008 (UTC)
[edit] Example
FLAGS/LOC,A,B,C,D,E,F,G,H,I BELGIUM,BLACK,YELLOW,ORANGE,BLACK,YELLOW,ORANGE,BLACK,YELLOW,ORANGE FRANCE,BLUE,WHITE,RED,BLUE,WHITE,RED,BLUE,WHITE,RED GERMANY,BLACK,BLACK,BLACK,RED,RED,RED,YELLOW,YELLOW,YELLOW IRELAND,GREEN,WHITE,ORANGE,GREEN,WHITE,ORANGE,GREEN,WHITE,ORANGE ITALY,GREEN,WHITE,RED,GREEN,WHITE,RED,GREEN,WHITE,RED JAPAN,WHITE,WHITE,WHITE,WHITE,RED,WHITE,WHITE,WHITE,WHITE LUXEMBOURG,RED,RED,RED,WHITE,WHITE,WHITE,BABY,BABY,BABY NETHERLANDS,RED,RED,RED,WHITE,WHITE,WHITE,BLUE,BLUE,BLUE SPAIN,RED,RED,RED,YELLOW,YELLOW,YELLOW,RED,RED,RED
[edit] comments and questions
This book is a work in progress and is undergoing revisions, expansion, etc. Please feel free to add your comments and questions regarding anything about this topic, including the method, formatting, ext. to help editors improve this book should you choose not to make any edits yourself. Thanks. Typative (talk) 17:32, 1 August 2008 (UTC)
[edit] Splitting into smaller sub-pages
Why should this be split into smaller sub-pages? I don't find the length confusing in the least. On the contrary, keeping everything here allows for a quick overview. --Swift (talk) 23:10, 31 July 2008 (UTC)
- I felt the same way since for one thing the application example has a tendency to immediately capture one's attention. One might otherwise keep on "thumbing". Never hurts to give something a shot but let me know if you do not like it and I'll be happy to restore the single page. Typative (talk) 11:20, 1 August 2008 (UTC)
- I also don't see a need to split it into sub-pages at this time (the book is to recent and still very small in size). The split request tag was added by Mike.lifeguard but with not specific comment or post to indicate the reason. I'll leave a post in his talk page. --Panic (talk) 18:28, 3 August 2008 (UTC)
- It is customary (though you are by no means bound by custom) to have the main page of a book as a cover or TOC, and have content on subpages. — Mike.lifeguard | talk 18:48, 3 August 2008 (UTC)
-
-
- One of the problem I find when contributing to a book even if its size is an impediment to a monolithic approach, is that if it is to segmented, there is no way to see all the structure, it becomes very hard to give the work any flow or sense of directionality and to reduce duplication of content. --Panic (talk) 00:43, 4 August 2008 (UTC)
-
[edit] Expand this
The theory of optimal classification is a large area of research.This book, however, only covers a single algorithm, like an article would. This book really needs to be expanded to include more information, like a book should. There are a lot of important topics like Bayes classification algorithm that aren't covered here. To help expand this book quickly, you can request imports of related articles from Wikipedia, and turn those into book pages. --Whiteknight (Page) (Talk) 23:53, 31 July 2008 (UTC)
- What I had in mind was limiting it to algorithms and processes that perform the same function. Is that what you have in mind as well? Typative (talk) 11:23, 1 August 2008 (UTC)
- Just tested method of hierarchal expansion similar to computer folder tree (Microsoft Windows) hierarchy with:
- "Optimal Classification/Application Example/Flag Recognition".
- The tree structure can easily accommodate expansion or contraction.
- For instance, if one assumes Optimal Classification to be at "book" level then a "chapter" level for each method of Optimal Classification can be inserted into the hierarchy as follows:
- Optimal Classification/Chapter 1/Application Example/Flag Recognition.
- Revisions of the tree will require that links be updated manually or perhaps by bot.
- Typative (talk) 14:47, 1 August 2008 (UTC)
- This is the benefit of the flat database structure of the wiki. Don't introduce a hierarchy except where absolutely needed. So, to indicate which pages are part of a book, you place them under <Bookname>/<pagename>. Where there is clear use in mutual back-links (such as is the case with Japanese/Vocabulary: there are links to this page just under the title on the sub-pages, e.g. Japanese/Vocabulary/Animals) pages can be place in deeper sub-pages. Otherwise, the structure of the book should be created simply by linking.
- That allows for a more flexible system of interlinking and breaks us out of the confines of linear books. It also allows for reorganising of content such as moving a page between chapters simply by changing the link, rather than having to move the page as well. A page can even appear in the learning path more than once. --Swift (talk) 00:46, 4 August 2008 (UTC)
- Further testing indicates that a lot of work is required to manually fix the links when the hierarchy structure is changed. Although a tree structure is logically superior it may not be feasible without a bot to follow and make the changes. Typative (talk) 15:23, 1 August 2008 (UTC)
-
-
- No tags, it's only based on the location of the pages take a look into the C++ Programming notice that if you select TOC1 on the top of the window it shows "< C++ Programming", the order is given by the location of the page on the book namespace. Move a page to another subpage location and you reorganize the tree but that implies that the top page the root be the TOC. --Panic (talk) 17:23, 4 August 2008 (UTC)
- Is there a tutorial for using this method? Typative (talk) 10:15, 7 August 2008 (UTC)
- It's not a method but a feature provided in the software. See the corresponding links at Optimal Classification/Application Example/Flag Recognition (the ones that link to Optimal Classification and Application Example. --Swift (talk) 11:28, 7 August 2008 (UTC)
- Is there a tutorial for using this method? Typative (talk) 10:15, 7 August 2008 (UTC)
- No tags, it's only based on the location of the pages take a look into the C++ Programming notice that if you select TOC1 on the top of the window it shows "< C++ Programming", the order is given by the location of the page on the book namespace. Move a page to another subpage location and you reorganize the tree but that implies that the top page the root be the TOC. --Panic (talk) 17:23, 4 August 2008 (UTC)
-
[edit] Additional methods being sought to fill additional chapters...
Starting with this query at the Wikipedia mathematics reference desk.
Note: Due to the limitation of set size shown in the primary reference and the notation of a sub-scheme, the method of permutation was used in the evaluation program. It is assumed the need for a set size limit and a sub scheme was due to the limitation of the computational facility a the time (1971), namely a Burroughs 5700 time sharing terminal. Current PC technology is sufficient to accommodate a much larger set size before a sub-scheme for remaining characteristics that fall outside the limit of the set must be used. If a decision tree trimming method which can accomplish the same function without requiring a limit on set size or the need for a sub-scheme then it should be included here. Typative (talk) 13:42, 12 August 2008 (UTC)
[edit] more general book
I think a discussion of classification algorithms, such as this algorithm, should stay at Wikibooks. However, rather than put each algorithm in its own book, I think it would be better to collect several algorithms per book. So I suggest moving this "Optimal Classification" "book", to make it part of a more general book.
Which book is the appropriate place for this algorithm?
- Algorithm implementation?
- Artificial Intelligence?
- Advanced Data Structures and Algorithms?
- Systems Theory/Decision Structure?
- Start a new book that covers only classification and clustering algorithms?
- Some other book I've overlooked?
Which book do you feel is most appropriate for discussing this algorithm? --DavidCary (talk) 15:56, 13 August 2008 (UTC)
- I agree with you, my preference goes to a work that "covers only classification and clustering algorithms", but then why not let this evolve into a more complete book or just wait for someone else to create book on the same lines and propose a merge ? (for what I've understood the actual author only intended to cover this one algorithm, that gives us a usable book on a given subject, not the beginning of a great project, with luck and more contributions maybe it will get there).
I take you aren't committing to help extend it, just advancing the proposal ? --Panic (talk) 17:34, 13 August 2008 (UTC)
-
Although in need of improvement and expansion (still) I liked this "book" in the form of an article much better than in the form of a book. In the form of a book I agree that it makes a very thin, although very potent, book which has already been expanded about as much a possible without making it more complicated than it needs to be by covering every minute detail as in the primary reference. My goal after all in creating it online anywhere in the first place was to show its, universal application and practical simplicity in combination with its solid mathematical base.
I feel therefore that perhaps a new book project which "...covers only classification and clustering algorithms" so as to show and compare the advantages, disadvantages and applications of each method (in which case Dr. Rypka's method might become only a section within a chapter covering optimal classification) with additional sections to cover other methods of optimal classification is acceptable but may also be highly beneficial and supportive of the understanding of Dr. Rypka's method as well. Although I find one swimsuit contestant attractive, I find it much easier to judge one by looking at the others as well! ;-} Typative (talk) 13:27, 14 August 2008 (UTC)
[edit] clustering
[edit] notes
↑ Biological Identification with Computers edited by R.J. Pankhurst, British museum (natural history) London, England proceedings of a meeting held at Kings College, Cambridge 27 and 28 September 1973 of the Systematics Association Special Volume Number 7 and published by the Academic Press 1975 noting the work of Eugene W. Rypka, Dept. of Microbiology, Lovelace Center for Health Sciences, Albuquerque, New Mexico, "Pattern Recognition and Microbial Identification." ISBN 0125448503
I will have to look at the other "clustering" algorithms before I can comment further. Over the years I have studied many algorithms to learn about them and many of those algorithms may have changed and I have not looked at the changes.
[edit] Wikipedia Cluster analysis - "Elbow criterion" - obvious misinterpretation/original research
In the context we are discussing a set or multiset of values or states is called an attribute when it is used in combination with other attributes to define a bounded class of elements that have the same attributes in common. It appears that the point of confusion here is that you (and others) are using the term "cluster" to refer to a "subset of a set" or to a group of attributes which define a bounded class rather than using the term "cluster" to refer to the multiset count for each value or state of an attribute. In this context the set or multiset of values of an attribute will always have a number of clusters equal to the count of its values or states.
The number of attributes selected to derive target set size of the subset is usually not fixed but initially set to one and thereafter incremented progressively until 100% separation is achieved or before the target set size exceeds computer capacity or the time allocated for classification is exceeded. The minimum number of attributes can be determined mathematically as follows:
, where:[1]
-
-
tmin is the minimal number of characteristics to result in theoretical separation, G is the number of elements in the bounded class and V is the highest value of logic in the group.
-
Typative (talk) 11:27, 10 September 2008 (UTC)
[edit] edit break
Look at it again. Recall the statement under the Flag overlay grid? The areas represent attributes, the colors represent attribute values and the flags represent elements. Typative (talk) 18:50, 10 September 2008 (UTC)
To sort you need first to separate. To separate you first need a criteria upon which to separate. The criteria in regard to rocks might be weight or size or color or shape. What criteria are you going to use to separate the rocks? Typative (talk) 18:50, 10 September 2008 (UTC)
Correct but irrelevant. The rock itself can not be split between two buckets without using a mallet or sledge hammer. Typative (talk) 18:50, 10 September 2008 (UTC)
What makes these rocks a cluster? The weight or size or color or shape? Being in the same bucket without a criteria does not make them a cluster, like a "cluster" of stars in the heavens with no other criteria except proximity, i.e., you can't rationally say they are a cluster because they are close together in the bucket. Typative (talk) 18:50, 10 September 2008 (UTC)
No. It is improper to use the term cluster in this way. The word set is a generalized term but the word cluster is not. Think about the flag above. A flag is made up of different colors in different locations but a flag is not properly defined as a cluster of colors. Typative (talk) 18:50, 10 September 2008 (UTC)
Okay, I believe I see the problem. You are deriving the meaning of the word cluster from its colloquial uses. Another example would be an oak leaf cluster. Also, you are using "forest", "plains", and "houses/roads" as if they were attributes when in fact they are elements in your bounded class. (See attribute-valued system.) Typative (talk) 18:50, 10 September 2008 (UTC)
You are describing "trees" here as if they are elements but using them above as if they are attributes. Typative (talk) 18:50, 10 September 2008 (UTC)
[edit] edit break
Dear fellow Wikibookian,
I see that what I've been doing is actually vector quantization. So we shouldn't be surprised that what I am doing is technically not exactly the same thing as "classification".
I've been asked To separate you first need a criteria upon which to separate. ... What criteria are you going to use to separate the rocks? What makes these rocks a cluster? The weight or size or color or shape?
I don't know ahead of time which criteria I will use.
I have a bunch of rocks. (I actually work with other things, but I don't want to bring in a bunch of irrelevant details, so let's pretend I'm working with rocks). For each one, I measure its density and hardness. I suppose I *could* decide, before looking at any rocks (top-down), exactly which range of density and hardness to assign to each bucket. However, many times I do something more like spread them across the ground in a density-vs-hardness graph. In an ideal world, there would be a few discrete points in the density-vs-hardness graph ("sandstone", "granite", "salt"), and every rock would fall exactly on top of one of those points, making a tower. And in that ideal world, it would be easy to discover exactly how many kinds of rocks I have, *after* I have already divided up all the rocks into a few discrete towers (bottom-up), by counting how many towers I see. Alas, because of my own measurement error, and also because of variations in the rocks, very rarely do 2 rocks fall at exactly the same point in this graph. Still, I can usually visually pick out one group of closely spaced rocks, call them all "Type 1" and put them into bucket #1, and pick out another group of closely spaced rocks (far away form the first group), call them "Type 2", put them into bucket #2, etc.
Now it may turn out that *all* of tomorrow's rocks have the same hardness, and my criteria ends up being density alone. Or it may turn out that *all* of tomorrow's rocks have the same density, and my criteria ends up being hardness alone. Or it may turn out that *all* of tomorrow's rocks fall along a diagonal line, and *either* density alone *or* hardness alone is sufficient to separate the piles I discover. Or it may turn out that they end up in one tight pile, and I may decide to use color or shape or some other measurement to divide them. But I won't know until I try it.
I don't know ahead of time which criteria I will use. Tomorrow, *after* I divide up all the rocks, *then* I could give you some criteria -- a range of hardness and density (or something else) that describes the rocks in each bucket.
Let me say again -- what I'm actually doing in this example is vector quantization. The reason I'm bringing up vector quantization is that I think that the "Optimal Classification" algorithm *should* go into some book at Wikibooks. However, I think a book that discussed that algorithm *and* other closely-related algorithms would be *better* than a book about a single algorithm. And "vector quantization" is the most closely related algorithm that I am familiar with. I suspect other algorithms have been developed that are even more closely related -- the book should talk about those as well. If we discover that there are 100 other algorithms that are even more closely related to "Optimal Classification" than "vector quantization", then I would agree that "vector quantization" doesn't belong in this book -- we would put the most-closely-related algorithms in this book, and put the other algorithms (and vector quantization) in some other book or books.
I agree that "vector quantization" and "optimal classification" are "two different things". But how can a book discuss several algorithms unless those algorithms are different?
If X is the first step in the process of doing Y, then I think a Wikibook about Y should *either* also cover X, *or* have a link to some other prerequisite book. (For example, Microprocessor Design points out that Digital Circuits is a prerequisite). So would you prefer me to keep talking about clustering and classification and "vector quantization" in this book? Or are you going to point out a more appropriate prerequisite book?
[edit] edit break
Yes. I agree that it would be good to open this book up to all kinds of classification algorithms and related techniques. --DavidCary (talk) 23:11, 3 October 2008 (UTC)