Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

5390

James F. Peters Andrzej Skowron Henryk Rybi´nski (Eds.)

Transactions on Rough Sets IX

13

Volume Editors James F. Peters University of Manitoba Department of Electrical and Computer Engineering Winnipeg, Manitoba, R3T 5V6, Canada E-mail: [email protected] Andrzej Skowron Warsaw University Institute of Mathematics Banacha 2, 02-097, Warsaw, Poland E-mail: [email protected] Henryk Rybi´nski Warsaw University of Technology Institute of Computer Science Nowowiejska 15/19, 00-665 Warsaw, Poland E-mail: [email protected]

Library of Congress Control Number: 2008942076 CR Subject Classification (1998): F.4.3, I.5, H.2.8, I.2, G.2 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISSN ISBN-10 ISBN-13

0302-9743 (Lecture Notes in Computer Science) 1861-2059 (Transaction on Rough Sets) 3-540-89875-1 Springer Berlin Heidelberg New York 978-3-540-89875-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12586622 06/3180 543210

Preface

Volume IX of the Transactions on Rough Sets (TRS) provides evidence of the continuing growth of a number of research streams that were either directly or indirectly begun by the seminal work on rough sets by Zdzislaw Pawlak (19262006)1. One of these research streams inspired by Prof. Pawlak is rough set-based intelligent systems, a topic that was an important part of his early 1970s work on knowledge description systems prior to his discovery of rough sets during the early 1980s. Evidence of intelligent systems as a recurring motif over the past two decades can be found in the rough-set literature that now includes over 4,000 publications by more than 1,600 authors in the rough set database2 . This volume of the TRS includes articles that are extensions of papers included in the ﬁrst conference on Rough Sets and Intelligent Systems Paradigms3. In addition to research on intelligent systems, this volume also presents papers that reﬂect the profound inﬂuence of a number of other research initiatives by Zdzislaw Pawlak. In particular, this volume introduces a number of new advances in the foundations and applications of artiﬁcial intelligence, engineering, image processing, logic, mathematics, medicine, music, and science. These advances have significant implications in a number of research areas such as attribute reduction, approximation schemes, category-based inductive reasoning, classiﬁers, classifying mappings, context algebras, data mining, decision attributes, decision rules, decision support, diagnostic feature analysis, EEG classiﬁcation, feature analysis, granular computing, hierarchical classiﬁers, indiscernibility relations, information granulation, information systems, musical rhythm retrieval, probabilistic dependencies, reducts, rough-fuzzy C-means, rough inclusion functions, roughness, singing voice recognition, and vagueness. A total of 47 researchers are represented in this volume. This volume has been made possible thanks to the laudable eﬀorts of a great many generous persons and organizations. The editors and authors of this volume also extend an expression of gratitude to Alfred Hofmann, Ursula Barth, Christine G¨ unther, and the LNCS staﬀ at Springer for their support in making this volume of the TRS possible. In addition, the editors of this volume extend their thanks to Marcin Szczuka for his consummate skill and care in the compilation of this volume.

1

2 3

See, e.g., Pawlak, Z., Skowron, A.: Rudiments of rough sets, Information Sciences 177 (2007) 3-27; Pawlak, Z., Skowron, A.: rough sets: Some extensions, Information Sciences 177 (2007) 28-40; Pawlak, Z., Skowron, A.: Rough sets and Boolean reasoning, Information Sciences 177 (2007) 41-73. http://rsds.wsiz.rzeszow.pl/rsds.php Int. Conf. on Rough Sets and Emerging Intelligent Systems Paradigms, Lecture Notes in Artiﬁcial Intelligence 4585. Springer, Berlin, 2007.

VI

Preface

The editors of this volume were supported by the by the Ministry of Science and Higher Education of the Republic of Poland, research grants No. NN516 368334 and 3T11C 002 29, by the by Ministry of Regional Development of the Republic of Poland, grant “Decision Support - New Generation Systems” of Innovative Economy Operational Programme 2007-2013 (Priority Axis 1. Research and development of new technologies), and the Natural Sciences and Engineering Research Council of Canada (NSERC) research grant 185986.

October 2008

Henryk Rybi´ nski James F. Peters Andrzej Skowron

LNCS Transactions on Rough Sets

This journal subline has as its principal aim the fostering of professional exchanges between scientists and practitioners who are interested in the foundations and applications of rough sets. Topics include foundations and applications of rough sets as well as foundations and applications of hybrid methods combining rough sets with other approaches important for the development of intelligent systems. The journal includes high-quality research articles accepted for publication on the basis of thorough peer reviews. Dissertations and monographs up to 250 pages that include new research results can also be considered as regular papers. Extended and revised versions of selected papers from conferences can also be included in regular or special issues of the journal. Editors-in-Chief:

James F. Peters, Andrzej Skowron

Editorial Board M. Beynon G. Cattaneo M.K. Chakraborty A. Czy˙zewski J.S. Deogun D. Dubois I. D¨ untsch S. Greco J.W. Grzymala-Busse M. Inuiguchi J. J¨ arvinen D. Kim J. Komorowski C.J. Liau T.Y. Lin E. Menasalvas M. Moshkov T. Murai

M. do C. Nicoletti H.S. Nguyen S.K. Pal L. Polkowski H. Prade S. Ramanna R. Slowi´ nski J. Stefanowski J. Stepaniuk Z. Suraj ´ R. Swiniarski M. Szczuka S. Tsumoto G. Wang Y. Yao N. Zhong W. Ziarko

Table of Contents

Vagueness and Roughness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zbigniew Bonikowski and Urszula Wybraniec-Skardowska Modiﬁed Indiscernibility Relation in the Theory of Rough Sets with Real-Valued Attributes: Application to Recognition of Fraunhofer Diﬀraction Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof A. Cyran

1

14

On Certain Rough Inclusion Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anna Gomoli´ nska

35

Automatic Rhythm Retrieval from Musical Files . . . . . . . . . . . . . . . . . . . . . Bo˙zena Kostek, Jaroslaw W´ ojcik, and Piotr Szczuko

56

FUN: Fast Discovery of Minimal Sets of Attributes Functionally Determining a Decision Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marzena Kryszkiewicz and Piotr Lasek Information Granulation: A Medical Case Study . . . . . . . . . . . . . . . . . . . . . Urszula Ku˙zelewska and Jaroslaw Stepaniuk Maximum Class Separability for Rough-Fuzzy C-Means Based Brain MR Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pradipta Maji and Sankar K. Pal

76

96

114

Approximation Schemes in Logic and Artiﬁcial Intelligence . . . . . . . . . . . . Victor W. Marek and Miroslaw Truszczy´ nski

135

Decision Rule Based Data Models Using NetTRS System Overview . . . . Marcin Michalak and Marek Sikora

145

A Rough Set Based Approach for ECG Classiﬁcation . . . . . . . . . . . . . . . . . Sucharita Mitra, M. Mitra, and B.B. Chaudhuri

157

Universal Problem of Attribute Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikhail Ju. Moshkov, Marcin Piliszczuk, and Beata Zielosko

187

Extracting Relevant Information about Reduct Sets from Data Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikhail Ju. Moshkov, Andrzej Skowron, and Zbigniew Suraj Context Algebras, Context Frames, and Their Discrete Duality . . . . . . . . Ewa Orlowska and Ingrid Rewitzky

200

212

X

Table of Contents

A Study in Granular Computing: On Classiﬁers Induced from Granular Reﬂections of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lech Polkowski and Piotr Artiemjew On Classifying Mappings Induced by Granular Structures . . . . . . . . . . . . . Lech Polkowski and Piotr Artiemjew

230 264

The Neurophysiological Bases of Cognitive Computation Using Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej W. Przybyszewski

287

Diagnostic Feature Analysis of a Dobutamine Stress Echocardiography Dataset Using Rough Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kenneth Revett

318

Rules and Apriori Algorithm in Non-deterministic Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroshi Sakai, Ryuji Ishibashi, Kazuhiro Koba, and Michinori Nakata

328

On Extension of Dependency and Consistency Degrees of Two Knowledges Represented by Covering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Samanta and Mihir K. Chakraborty

351

A New Approach to Distributed Algorithms for Reduct Calculation . . . . Tomasz Str¸akowski and Henryk Rybi´ nski

365

From Information System to Decision Support System . . . . . . . . . . . . . . . . Alicja Wakulicz-Deja and Agnieszka Nowak

379

Debellor: A Data Mining Platform with Stream Architecture . . . . . . . . . . Marcin Wojnarski

405

Category-Based Inductive Reasoning: Rough Set Theoretic Approach . . . Marcin Wolski

428

Probabilistic Dependencies in Linear Hierarchies of Decision Tables . . . . Wojciech Ziarko

444

Automatic Singing Voice Recognition Employing Neural Networks and Rough Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˙ Pawel Zwan, Piotr Szczuko, Bo˙zena Kostek, and Andrzej Czy˙zewski

455

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts . . . . . . . . . Jan G. Bazan

474

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

751

Vagueness and Roughness Zbigniew Bonikowski1 and Urszula Wybraniec-Skardowska2 1

Institute of Mathematics and Informatics University of Opole, Opole, Poland [email protected] 2 Autonomous Section of Applied Logic, Pozna´ n School of Banking, Faculty in Chorz´ ow, Poland [email protected]

Abstract. The paper proposes a new formal approach to vagueness and vague sets taking inspirations from Pawlak’s rough set theory. Following a brief introduction to the problem of vagueness, an approach to conceptualization and representation of vague knowledge is presented from a number of diﬀerent perspectives: those of logic, set theory, algebra, and computer science. The central notion of the vague set, in relation to the rough set, is deﬁned as a family of sets approximated by the so called lower and upper limits. The family is simultaneously considered as a family of all denotations of sharp terms representing a suitable vague term, from the agent’s point of view. Some algebraic operations on vague sets and their properties are deﬁned. Some important conditions concerning the membership relation for vague sets, in connection to Blizard’s multisets and Zadeh’s fuzzy sets, are established as well. A classical outlook on a logic of vague sentences (vague logic) based on vague sets is also discussed. Keywords: vagueness, roughness, vague sets, rough sets, knowledge, vague knowledge, membership relation, vague logic.

1

Introduction

Logicians and philosophers have been interested in the problem area of vague knowledge for a long time, looking for some logical foundations of a theory of vague notions (terms) constituting such knowledge. Recently vagueness and, more generally - imperfection, has become the subject of investigations of computer scientists interested in the problems of AI, in particular, in the problems of reasoning on the basis of imperfect information and in the application of computers to support and represent such reasoning in the computer memory (see, e.g., Parsons [15]). Imperfection is considered in a general information-based framework, where objects are described by an agent in terms of attributes and their values. Bonissone and Tong [5] indicated three types of imperfections relating to information: incompleteness, uncertainty and imprecision. Incompleteness arises from the absence of a value of an attribute for some objects. Uncertainty arises from a lack J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 1–13, 2008. c Springer-Verlag Berlin Heidelberg 2008

2

Z. Bonikowski and U. Wybraniec-Skardowska

of information; as a result, an object’s attribute may have a ﬁnite set of values rather than a single value. Imprecision occurs when an attribute’s value cannot be measured with adequate precision. There are also other classiﬁcations of imperfect information (see, e.g., Slowi´ nski, Stefanowski [26]). Marcus [12] thought of imprecision more generally. He distinguished, e.g., such types of imprecision as vagueness, fuzziness and roughness. Both fuzziness and roughness are mathematical models of vagueness. Fuzziness is closely related to Zadeh’s fuzzy sets [28]. In fuzzy set theory, vagueness is described by means of a speciﬁc membership relation. Fuzziness is often identiﬁed with vagueness, however, Zadeh [29] noted that vagueness comprises fuzziness. Roughness is connected with Pawlak’s rough sets [19]. Classical, set-theoretical sets (orthodox sets) are not suﬃcient to deal with vagueness. Non-orthodox sets - rough sets and fuzzy sets - are used in two different approaches to vagueness (Pawlak [22]): while Zadeh’s fuzzy set theory represents a quantitative approach, Pawlak’s rough set theory represents a qualitative approach to vagueness. Signiﬁcant results obtained by computer scientists in the area of imprecision and vagueness, such as Zadeh’s fuzzy set theory [28], Shafer’s theory of evidence [24] and Pawlak’s rough set theory [19,21], greatly contributed to advancing and intensifying of research into vagueness. This paper is an extended version of a previous article by the same authors [4]. It proposes a new approach to vagueness taking into account the main ideas of roughness. Roughness considered as a mathematical model of vagueness is here replaced by an approach to vagueness in which vague sets, deﬁned in this paper, play the role of rough sets. Vague sets are connected with vague knowledge and, at the same time, are understood as denotations of vague notions. The paper also attempts to lay logical foundations to the theory of vague notions (terms) and thus bring an essential contribution to research in this area. The structure of the paper is as follows. In Sect. 2, we introduce the notion of unit information (unit knowledge) and vague information (vague knowledge). The central notion of the vague set, inspired by Pawlak’s notion of a rough set, is deﬁned in Sect. 3. Section 4 is devoted to the problem of multiplicity of an object’s membership to a vague set. In Sect. 5 some operations on vague sets and their algebraic properties are given. A view on the logic of vague concepts (terms) is discussed in Sect. 6. The paper ends with Sect. 7 which delivers some ﬁnal remarks.

2

Unit Knowledge and Vague Knowledge

In the process of cognition of a deﬁnite fragment of reality, the cognitive agent (a man, an expert, a group of men or experts, a robot) attempts to discover information contained in it or, more adequately, about its objects. Each fragment of reality recognized by the agent can be interpreted as the following relational structure: (1) = U, R1 , R2 , . . . , Rn ,

Vagueness and Roughness

3

where U, the universe of objects of reality , is a non-empty set, and Ri , for i = 1, 2, . . . , n, is the set of i-ary relations on U. One-ary relations are regarded as subsets of U and understood as properties of objects of U, and multi-argument relations as relationships among its objects. Formally, every k-ary relation of Rk is a subset of U k . We assume that reality is objective with respect to cognition. Objective knowledge about it consists of pieces of unit information (knowledge) about objects of U with respect to all particular relations of Rk (k = 1, 2, . . . , n). We introduce the notion of knowledge and vague knowledge in accordance with some conceptions of the second co-author of this paper ([27]). Definition 1. Unit information (knowledge). Unit information (knowledge) about the object o ∈ U with respect to the relation → − R ∈ Rk (k = 1, 2, . . . , n) is the image R (o) of the object o with respect to the relation R1 . Discovering unit knowledge about objects of reality is realized through asking questions which include certain aspects, called attributes, of the objects of the universe U. Then, we usually choose a ﬁnite set U ⊆ U as the universe and we put it forward as a generalized attribute-value system Σ, also called an information system (cf. Codd [6]; Pawlak [16], [18], [19]; Marek and Pawlak [13]). Its deﬁnition is as follows: Definition 2. Information system. Σ is an information system iff it is an ordered system Σ = U, A1 , A2 , . . . , An ,

(2)

where U ⊆ U, card(U ) < ω and Ak (k = 1, 2, . . . , n) is the set of k-ary attributes understood as k-ary functions, i.e. ∀a∈Ak a : U k → Va ,

(3)

where Va is the set of all values of the attribute a. Example 1. Let us consider the following information system: S = S, A1 , A2 , where S = {p1 , p2 , . . . , p5 } is a set of 5 papers and A1 = {IMPACT FACTOR (IF ), QUOTATIONS (Q)}, A2 = {TOPIC CONNECTION (T C)}. The attribute IF is a function which assigns to every paper p ∈ S an impact factor of the journal in which p was published. We assume that VIF = [0, 100]. The value of the attribute Q for any paper p ∈ S is the number of quotations of p. We assume that VQ = {0, 1, 2, . . . , 2000}. We also assume that T C assigns to every pair of papers a quotient of the number of common references by the number of all references, and that VT C = [0, 1]. 1

j − → R, if o ∈ R, R (o) = for R ∈ R1 . ∅, otherwise. → − R (o) = {x1 , . . . , xi−1 , xi+1 , . . . , xk : x1 , . . . , xi−1 , o, xi+1 , . . . , xk ∈ R} for R ∈ Rk (k = 2, . . . , n).

4

Z. Bonikowski and U. Wybraniec-Skardowska

The information system S can be clearly presented in the following tables: p1 p2 p3 p4 p5

IF Q T C p1 p2 p3 p4 p5 0.203 125 p1 1 3/10 0 6/7 0 0.745 245 p2 3/10 1 0 0 4/17 0.498 200 p3 0 0 1 0 1/12 0.105 150 p4 6/7 0 0 1 0 1.203 245 p5 0 4/17 1/12 0 1

Every attribute of the information system Σ and every value of this attribute explicitly indicates a relation belonging to the so-called relational system determined by Σ. The unit information (knowledge) about an object o ∈ U should be considered with respect to relations of the system. Definition 3. System determined by the information system. (Σ) is a system determined by the information system Σ (see (2)) iff (Σ) = U, {Ra,W : a ∈ A1 , ∅ = W ⊆ Va }, . . . , {Ra,W : a ∈ An , ∅ = W ⊆ Va }, where Ra,W = {(o1 , o2 , . . . , ok ) ∈ U k : a((o1 , o2 , . . . , ok )) ∈ W } for any k ∈ {1, 2, . . . , n}, a ∈ Ak . Let us see that {Ra,{v} : a ∈ A1 , v ∈ Va } = U , i.e. the family {Ra,{v} : a ∈ A1 , v ∈ Va } is a covering of U . It is easy to see that Fact 1. The system Σ uniquely determines the system (Σ). Example 2. Let S be the above given information system. Then the system determined by the system S is (S) = U, RA1 , RA2 , where RA1 = {RIF,S }∅=S⊆VIF ∪ {RQ,S }∅=S⊆VQ and RA2 = {RT C,S }∅=S⊆VT C . For any attribute a of the system S and any i, j ∈ R we adopt the following notation: Sij = {v ∈ Va : i ≤ v ≤ j}, S j = {v ∈ Va : v ≤ j}, Si = {v ∈ Va : i ≤ v}. 0.5 = {p1 , p3 , p4 }, RIF,S Then, in particular, we can easily state that: RIF,S0.1 0.7 = 150 = RQ,{150} = {p4 }, RQ,S {p2 , p5 }, RIF,S 0.3 = {p1 , p4 }, RQ,S150 200 = {p2 , p3 , p5 } and RT C,{1/12} = {(p3 , p5 ), (p5 , p3 )}, RT C,{1} = {(pi , pi )}i=1,...,5 . The notion of knowledge about the attributes of the system Σ depends on the cognitive agent discovering the fragment of reality Σ. According to Skowron’s understanding of the notion of knowledge determined by any unary attribute (cf. Pawlak [17], Skowron et al. [25], Demri, Orlowska [8] pp.16–17), we can adopt the following generalized deﬁnition of the notion of knowledge Ka about any k-ary attribute a: Definition 4. Knowledge Ka about the attribute a. Let Σ be the information system satisfying (2) and a ∈ Ak (k = 1, 2, . . . , n). Then

Vagueness and Roughness

5

(a) Ka = {((o1 , o2 , . . . , ok ), Va,u ) : u = (o1 , o2 , . . . , ok ) ∈ U k }, where Va,u ⊆ P (Va ), Va,u is the family of all sets of possible values of the attribute a for the object u from the viewpoint of the agent and P (Va ) is the family of all subsets of Va . (b) The knowledge Ka of the agent about the attribute a and its value for the object u is (0) empty if card( W ∈Va,u W ) = 0, (1) definite if card( W ∈Va,u W ) = 1, (> 1) imprecise, in particular vague, if card( W ∈Va,u W ) > 1. Let us observe that vague knowledge about some attribute of the information system Σ is connected with the assignation of a vague value to the object u. Example 3. Let us consider again the information system S = S, A1 , A2 . The agent’s knowledge KIF , KQ , KT C about the attributes of the information system S can be characterized by means of the following tables: p1 p2 p3 p4 p5 VT C,(p,p ) p1 p2 p3 p4 p5

VIF,p VQ,p {S0.2 , S0.3 , S0.25 } {S100 , S150 , S90 , S80 } {S0.5 , S0.7 , S0.8 } {S180 , S200 , S250 , S240 } {S0.5 , S0.6 , S0.4 } {S170 , S230 , S180 , S150 } {S0.1 , S0.2 , S0.15 } {S100 , S90 , S10 , S140 } {S0.7 , S1.5 , S1.0 } {S270 , S150 , S240 , S200 }

p1 p2 p3 p4 p5 {S11 } {S 0.3 , S 0.5 } {S 0.1 , S 0.2 } {S0.5 , S0.8 } {S 0.1 , S 0.2 } {S 0.3 , S 0.5 } {S11 } {S 0.1 , S 0.2 } {S 0.1 , S 0.2 } {S 0.3 , S 0.4 } {S 0.1 , S 0.2 } {S 0.1 , S 0.2 } {S11 } {S 0.1 , S 0.2 } {S 0.3 , S 0.1 } 0.1 0.2 0.1 0.2 {S0.5 , S0.8 } {S , S } {S , S } {S11 } {S 0.1 , S 0.2 } 0.1 0.2 0.3 0.4 0.3 0.1 0.1 0.2 {S , S } {S , S } {S , S } {S , S } {S11 }

From Deﬁnitions 1 and 3 we arrive at: Fact 2. Unit information (knowledge) about the object o ∈ U with respect to a → − relation R of the system (Σ) is the image R (o) of the object o with respect to the relation R. → − Contrary to the objective unit knowledge R (o) about the object o of U in the reality with regard to its relation R, the subjective unit knowledge (the unit knowledge of an agent) about the object o of U in the reality (Σ) depends on an attribute of Σ determining the relation R and its possible values from the viewpoint of the knowledge of the agent discovering (Σ). The subjective unit −−→ knowledge Rag (o) depends on the agent’s ability to solve the following equation: −−→ Rag (o) = x, where x is an unknown quantity.

(e)

6

Z. Bonikowski and U. Wybraniec-Skardowska

Solutions of (e) for a k-ary relation R should be images of the object o with respect to k-ary relations Ra,W from (Σ), where ∅ = W ⊆ Va . Let us note that for each unary relation R solutions of (e) are unary relations Ra,W , where ∅ = W ∈ Va,o . A solution of the equation (e) can be correct – then the agent’s knowledge about the object o is exact. If the knowledge is inexact, then at least one solution of (e) is not an image of the object o with respect to the relation R. Definition 5. Empty, definite and imprecise unit knowledge. Unit knowledge of the agent about the object o ∈ U in (Σ) with respect to its relation R is (0) empty iff the equation (e) does not have a solution for the agent (the → − agent knows nothing about the value of the function R for the object o), (1) definite iff the equation (e) has exactly one solution for the agent (either the agent’s knowledge is exact – the agent knows the value of the function → − R for the object o – or he accepts only one, but not necessarily accurate, value of the function), (> 1) imprecise iff the equation (e) has at least two solutions for the agent (the → − agent allows at least two possible values of the function R for the object o). From Deﬁnitions 4 and 5 we arrive at: Fact 3. Unit knowledge of the agent about the object o ∈ U in (Σ) with respect to its relation R is (0) empty if the agent’s knowledge Ka about the attribute a and its value for the object o is empty, (1) definite if the agent’s knowledge Ka about the attribute a and its value for the object o is definite, (> 1) imprecise if the agent’s knowledge Ka about the attribute a and its value for the object o is imprecise. When the unit knowledge of the agent about the object o is imprecise, then most often we replace the unknown quantity x in (e) with a vague value. Example 4. Consider the relation R = RQ,S200 within the previous system (S), i.e. the set of all papers of S that have been quoted in at least 200 other papers. The unit knowledge about the paper p5 with respect to R can be the following −−→ vague information: (e1 ) Rag (p5 ) = VALUABLE , where VALUABLE is an unknown, indeﬁnite, vague quantity. Then the agent refers to the paper p5 non-uniquely, assigning to him different images of the paper p5 with respect to the relations that are possible from his point of view. Then the equation (e1 ) usually has, for him, at least two solutions. From Example 3, it follows that each of these relations: RQ,S270 , RQ,S150 , RQ,S240 , RQ,S200 can be a solution to (e1 ). Let us observe that RQ,S270 = ∅, RQ,S150 = {p2 , p3 , p4 , p5 }, RQ,S240 = {p2 , p5 }, RQ,S200 = {p2 , p3 , p5 }.

Vagueness and Roughness

3

7

Vague Sets and Rough Sets

Let (Σ) be the system determined by the information system Σ. In order to simplify our considerations in the subsequent sections of the paper, we will limit ourselves to the unary relation R (property) – a subset of U of the system (Σ). Definition 6. Inexact unit knowledge of the agent. Unit knowledge of the agent about the object o in (Σ) with respect to R is inexact iff the equation (e) has for him at least one solution and at least one of → − the solutions is not an image R (o). The equation (e) has then the form: −−→ (ine) Rag (o) = X, where X is an unknown quantity from the viewpoint of the agent, and (ine) has for him at least one solution and at least one of the solutions is not an image → − R (o). The equation (ine) can be called the equation of inexact knowledge of the agent. All solutions of (ine) are unary relations in the system (Σ). Definition 7. Vague unit knowledge of the agent. Unit knowledge of the agent about the object o in (Σ) with respect to R is vague iff the equation (e) has at least two diﬀerent solutions for the agent. The equation (e) has then the form: −−→ (ve) Rag (o) = VAGUE , where VAGUE is an unknown quantity, and (ve) has at least two diﬀerent solutions for the agent. The equation (ve) can be called the equation of vague knowledge of the agent. Fact 4. Vague unit knowledge is a particular case of inexact unit knowledge. Definition 8. Vague (proper vague) set. The family of all solutions (sets) of (ine), respectively (ve), is called the vague set for the object o determined by R, respectively the proper vague set for the object o determined by R. Example 5. The family of all solutions of (e1 ) from Example 4 is a vague set Vp5 for the paper p5 determined by RQ,S200 and Vp5 = {RQ,S270 , RQ,S150 , RQ,S240 , RQ,S200 }. Vague sets, thus also proper vague sets, determined by a set R are here some generalizations of sets approximated by representations (see Bonikowski [3]). They are non-empty families of unary relations from (Σ) (such that at least one of them includes R) and sub-families of the family P (U ) of all subsets of the set U , determined by the set R. They have the greatest lower bound (the lower limit ) and the least upper bound (the upper limit) in P (U ) with respect to inclusion. We will denote the greatest lower bound of any family X by X. The least upper bound of X will be denoted by X. So, we can note

8

Z. Bonikowski and U. Wybraniec-Skardowska

Fact 5. For each vague set V determined by the set (property) R V ⊆ {Y ∈ P (U ) : V ⊆ Y ⊆ V}.

(4)

The idea of vague sets was conceived upon Pawlak’s idea of rough sets [19], who deﬁned them by means of the operations of lower approximation: ∗ and upper approximation: ∗ , deﬁned on subsets of U . The lower approximation of a set is deﬁned as a union of indiscernibility classes of a given relation in U 2 which are included in this set, whereas the upper approximation of a set is deﬁned as a union of the indiscernibility classes of the relation which have a non-empty intersection with this set. Definition 9. Rough set. A rough set determined by a set R ⊆ U is a family P of all sets satisfying the condition (5): (5) P = {Y ∈ P (U ) : Y∗ = R∗ ∧ Y ∗ = R∗ }.2 Let us observe that because R ⊆ R ∈ P, the family P is a non-empty family of sets such that at least one of them includes R (cf. Deﬁnition 8). By analogy to Fact 5, we have Fact 6. For each rough set P determined by the set (property) R P ⊆ {Y ∈ P (U ) : R∗ ⊆ Y ⊆ R∗ }.

(6)

It is obvious that Fact 7. If V is a vague set and X∗ = V and X ∗ = V for any X ∈ V, then V is a subset of a rough set determined by any set of V. For every rough set P determined by R we have: P = R∗ and P = R∗ . We can therefore consider the following generalization of the notion of the rough set: Definition 10. Generalized rough set. A non-empty family G of subsets of U is called a generalized rough set determined by a set R iff it satisﬁes the condition (7): G = R∗ and G = R∗ .

(7)

It is easily seen that Fact 8. Every rough set determined by a set R is a generalized rough set determined by R. Fact 9. If V is a vague set and there exists a set X ⊆ U such that X∗ = V and X ∗ = V, then V is a generalized rough set determined by the set X. 2

Some authors deﬁne a rough set as a pair of sets (lower approximation, upper approximation)(cf., e.g., Iwi´ nski [10], Pagliani [14]).

Vagueness and Roughness

4

9

Multiplicity of Membership to a Vague Set

For every object o ∈ U and every vague set Vo , we can count the multiplicity of membership of o to this set. Definition 11. Multiplicity of membership. The number i is the multiplicity of membership of the object o to the vague set Vo iff o belongs to i sets of Vo (i ∈ N). The notion of multiplicity of an object’s membership to a vague set is closely related to the so-called degree of an object’s membership to the set. Definition 12. Degree of an object’s membership. Let Vo be a vague set for the object o and card(Vo ) = n. The function μ is called a degree of membership of o to Vo iff ⎧ ⎨ 0, if the multiplicity of membership of o to Vo equals 0, μ(o) = nk , if the multiplicity of membership of o to Vo equals k (0 < k < n), ⎩ 1, if the multiplicity of membership of o to Vo equals n. Example 6. The degree of the membership of the paper p5 to the vague set Vp5 (see Example 5 ) is equal to 3/4. It is clear that Fact 10. 1. Any vague set is a multiset in Blizard’s sense [1]. 2. Any vague set is a fuzzy set in Zadeh’s sense [28] with μ as its membership function.

5

Operations on Vague Sets

Let us denote by V the family of all vague sets determined by relations in the system (Σ). In the family V we can deﬁne a unary operation of the negation ¬ on vague sets, a union operation ⊕ and an intersection operation on any two vague sets. Definition 13. Operations on vague sets. Let V1 = {Ri }i∈I and V2 = {Si }i∈I be vague sets determined by the sets R ⊆ U and S ⊆ U , respectively. Then (a) V1 ⊕ V2 = {Ri }i∈I ⊕ {Si }i∈I = {Ri ∪ Si }i∈I , (b) V1 V2 = {Ri }i∈I {Si }i∈I = {Ri ∩ Si }i∈I , (c) ¬V1 = ¬{Ri }i∈I = {U \ Ri }i∈I . The family V1 ⊕ V2 is called the union of the vague sets V1 and V2 determined by the relations R and S. The family V1 V2 is called the intersection of the vague sets V1 and V2 determined by the relations R and S. The family ¬V1 is called the negation of the vague set V1 determined by the relation R.

10

Z. Bonikowski and U. Wybraniec-Skardowska

Theorem 1. Let V1 = {Ri }i∈I and V2 = {Si }i∈I be vague sets determined by the sets R and S, respectively. Then (a) V1 ⊕ V2 = V1 ∪ V2 and V1 ⊕ V2 = V1 ∪ V2 , (b) V1 V2 = V1 ∩ V2 and V1 V2 = V1 ∩ V2 , (c) ¬V1 = U \ V1 and ¬V1 = U \ V1 . Theorem 2. The structure B = (V, ⊕, , ¬, 0, 1) is a Boolean algebra, where 0 = {∅} and 1 = {U }. We can easily observe that the above-deﬁned operations on vague sets differ from Zadeh’s operations on fuzzy sets, from standard operations in any ﬁeld of sets and, in particular, from the operations on rough sets deﬁned by Pomykala & Pomykala [23] and Bonikowski [2]. The family of all rough sets with operations deﬁned in the latter two works is a Stone algebra.

6

On Logic of Vague Terms

How to solve the problem of logic of vague terms, logic of vague sentences (vague logic) based on the vague sets characterized in the previous sections? Answering this question requires a brief description of the problem of language representation of unit knowledge. On the basis of our examples, let us consider two pieces of unit information about the paper p5 , with respect to the set R of all papers that have been referenced in at least 200 other papers: ﬁrst, exact unit knowledge −−→ Rag (p5 ) = {p2 , p3 , p5 }, next, vague unit knowledge: −−→ Rag (p5 ) = VALUABLE .

(ee) (e1 )

Let p5 be the designator of the proper name a, R – the denotation (extension) of the name-predicate P (‘a paper that has been quoted in at least 200 other papers’), and the vague name-predicate V (‘a paper which is valuable’) – a language representation of the vague quantity VALUABLE. Then a representation of the ﬁrst equation (ee) is the logical atomic sentence a is P (re) and a representation of the second equation (e1 ) is the vague sentence a is V. (re1 ) In a similar way, we can represent, respectively, (ee) and (e1 ) by means of a logical atomic sentence: aP or P (a), (re ) where P is the predicate (‘has been quoted in at least 200 other papers’ ), and by means of a vague sentence aV or V (a), (re1 ) where V is the vague predicate (‘is valuable’ ).

Vagueness and Roughness

11

The sentence (re1 ) (res. the sentence (re1 )) is not a logical sentence, but it can be treated as a sentential form, which represents all logical sentences, in particular the sentence (re) (respectively sentence (re )) that arises by replacing the vague name-predicate (res. vague predicate) V by allowable sharp namepredicates (res. sharp predicates), whose denotations (extensions) constitute the vague set Vp5 being the denotation of V and, at the same time, the set of solutions to the equation (e1 ) from the agent’s point of view. By analogy, we can consider every atomic vague sentence in the form V (a), where a is an individual term and V – its vague predicate, as a sentential form with V as a vague variable running over all denotations of sharp predicates that can be substituted for V in order to get precise, true or false, logical sentences from the form V (a). Then, the scope of the variable V is the vague set Vo determined by the designator o of the term a. All the above remarks lead to a ‘conservative’, classical approach in searching for a logic of vague terms, or vague sentences, here referred to as vague logic (cf. Fine [9], Cresswell [7]). It is easy to see that all counterparts of laws of classical logic are laws of vague logic because, to name just one reason, vague sentences have an interpretation in Boolean algebra B of vague sets (see Theorem 2). We can distinguish two directions in seeking such a logic: 1a) all counterparts of tautologies of classical sentential calculus that are obtained by replacing sentence variables with atomic expressions of this logic (in the form V(x)), representing vague atomic sentences (sentential functions in the form V (a)), are tautologies of vague logic, 1b) all counterparts of tautologies of classical predicate calculus that can be obtained by replacing predicate variables with vague predicate variables, representing vague predicates, are tautologies of vague logic; 2) vague logic should be a ﬁnite-valued logic, in which a value of any vague sentence V (a) represented by its vague atomic expression (in the form V(x)) is the multiplicity of membership of the designator o of a to the vague set Vo being the denotation of V , and the multiplicities of membership of the designators of the subjects of any composed vague sentence, represented by its composed vague formula, to the denotation (a vague set) corresponding to this sentence are functions of the multiplicities of membership of every designator of the subject of its atomic component to the denotation of its vague predicate. It should be noticed that sentential connectives for vague logic should not satisfy standard conditions (see Malinowski [11]). For example, an alternative of two vague sentences V (a) and V (b) can be a ‘true’ vague sentence (sentential form) despite the fact that its arguments V (a) and V (b) are neither ‘true’ or ‘false’ sentential forms, i.e. in certain cases they represent true sentences, and in some other cases they represent false sentences. It is not contrary to the statement that all vague sentential forms which we obtain by a suitable substitution of sentential variables (resp. predicate variables) by vague sentences (resp. vague predicates) in laws of classical logic always represent true sentences. Thus they are laws of vague logic.

12

7

Z. Bonikowski and U. Wybraniec-Skardowska

Final Remarks

1. The concept of vagueness was deﬁned in the paper as an indeﬁnite, vague quantity or property corresponding to the knowledge of an agent discovering a fragment of reality, and delivered in the form of the equation of inexact knowledge of the agent. A vague set was deﬁned as a set (family) of all possible solutions (sets) of this equation and although our considerations were limited to the case of unary relations, they can easily be generalized to encompass any k-ary relations. 2. The idea of vague sets was derived from the idea of rough sets originating in the work of Zdzislaw Pawlak, whose theory of rough sets takes a nonnumerical, qualitative approach to the issue of vagueness, as opposed to the quantitative interpretation of vagueness provided by Lotﬁ Zadeh. 3. Vague sets, like rough sets, are based on the idea of a set approximation by two sets called the lower and the upper limits of this set. These two kinds of sets are families of sets approximated by suitable limits. 4. Pawlak’s approach and the approach discussed in this paper both make a reference to the concept of a cognitive agent’s knowledge about the objects of the reality being investigated (see Pawlak [20]). This knowledge is determined by the system of concepts that is determined by a system of their extensions (denotations). When the concept is vague, its denotation, in Pawlak’s sense, is a rough set, while in the authors’ sense – a vague set which, under some conditions, is a subset of the rough set. 5. In language representation, the equation of inexact, vague knowledge of the agent can be expressed by means of vague sentences containing a vague predicate. Its denotation (extension) is a family of all scopes of sharp predicates which, from the agent’s viewpoint, can be substituted for the vague predicate. The denotation is, at the same time, the vague set of all solutions to the equation of the agent’s vague knowledge. 6. Because vague sentences can be treated as sentential forms whose variables are vague predicates, all counterparts of tautologies of classical logic are laws of vague logic (logic of vague sentences). 7. Vague logic is based on classical logic but it is many-valued logic, because its sentential connectives are not extensional.

References 1. Blizard, W.D.: Multiset Theory. Notre Dame J. Formal Logic 30(1), 36–66 (1989) 2. Bonikowski, Z.: A Certain Conception of the Calculus of Rough Sets. Notre Dame J. Formal Logic 33, 412–421 (1992) 3. Bonikowski, Z.: Sets Approximated by Representations (in Polish, the doctoral dissertation prepared under the supervision of Prof. U.Wybraniec-Skardowska), Warszawa (1996) 4. Bonikowski, Z., Wybraniec-Skardowska, U.: Rough Sets and Vague Sets. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS, vol. 4585, pp. 122–132. Springer, Heidelberg (2007)

Vagueness and Roughness

13

5. Bonissone, P., Tong, R.: Editorial: reasoning with uncertainty in expert systems. Int. J. Man–Machine Studies 22, 241–250 (1985) 6. Codd, E.F.: A Relational Model of Data for Large Shared Data Banks. Comm. ACM 13, 377–387 (1970) 7. Cresswell, M.J.: Logics and Languages. Methuen, London (1973) 8. Demri, S., Orlowska, E.: Incomplete Information: Structure, Inference, Complexity. Springer, Heidelberg (2002) 9. Fine, K.: Vagueness, Truth and Logic. Synthese 30, 265–300 (1975) 10. Iwi´ nski, T.: Algebraic Approach to Rough Sets. Bull. Pol. Acad. Sci. Math. 35, 673–683 (1987) 11. Malinowski, G.: Many-Valued Logics. Oxford University Press, Oxford (1993) 12. Marcus, S.: A Typology of Imprecision. In: Brainstorming Workshop on Uncertainty in Membrane Computing Proceedings, Palma de Mallorca, pp. 169–191 (2004) 13. Marek, W., Pawlak, Z.: Rough Sets and Information Systems, ICS PAS Report 441 (1981) 14. Pagliani, P.: Rough Set Theory and Logic-Algebraic Structures. In: Orlowska, E. (ed.) Incomplete Information: Rough Set Analysis, pp. 109–190. Physica Verlag, Heidelberg (1998) 15. Parsons, S.: Current approaches to handling imperfect information in data and knowledge bases. IEEE Trans. Knowl. Data Eng. 8(3), 353–372 (1996) 16. Pawlak, Z.: Information Systems, ICS PAS Report 338 (1979) 17. Pawlak, Z.: Information Systems – Theoretical Foundations (in Polish). PWN – Polish Scientiﬁc Publishers, Warsaw (1981) 18. Pawlak, Z.: Information Systems – Theoretical Foundations. Information Systems 6, 205–218 (1981) 19. Pawlak, Z.: Rough Sets. Intern. J. Comp. Inform. Sci. 11, 341–356 (1982) 20. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991) 21. Pawlak, Z.: Vagueness and uncertainty: A rough set perspective. Computat. Intelligence 11(2), 227–232 (1995) 22. Pawlak, Z.: Orthodox and Non-orthodox Sets - some Philosophical Remarks. Found. Comput. Decision Sci. 30(2), 133–140 (2005) 23. Pomykala, J., Pomykala, J.A.: The Stone Algebra of Rough Sets. Bull. Pol. Acad. Sci. Math. 36, 495–508 (1988) 24. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, Princeton (1976) 25. Skowron, A., Komorowski, J., Pawlak, Z., Polkowski, L.: Rough Sets Perspective ˙ on Data and Knowledge. In: Kl¨ osgen, W., Zytkow, J.M. (eds.) Handbook of Data Mining and Knowlewdge Discovery, pp. 134–149. Oxford University Press, Oxford (2002) 26. Slowi´ nski, R., Stefanowski, J.: Rough-Set Reasoning about Uncertain Data. Fund. Inform. 23(2–3), 229–244 (1996) 27. Wybraniec-Skardowska, U.: Knowledge, Vagueness and Logic. Int. J. Appl. Math. Comput. Sci. 11, 719–737 (2001) 28. Zadeh, L.A.: Fuzzy Sets. Information and Control 8, 338–353 (1965) 29. Zadeh, L.A.: PRUF: A meaning representation language for natural languages. Int. J. Man–Machine Studies 10, 395–460 (1978)

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets with Real-Valued Attributes: Application to Recognition of Fraunhofer Diﬀraction Patterns Krzysztof A. Cyran Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland [email protected]

Abstract. The goal of the paper is to present the modiﬁcation of classical indiscernibility relation, dedicated for rough set theory in a realvalued attributes space. Contrary to some other known generalizations, indiscernibility relation modiﬁed here, remains an equivalence relation and it is obtained by introducing a structure into collection of attributes. It deﬁnes real-valued subspaces, used in a multidimensional cluster analysis, partitioning the universe in a more natural way, as compared to onedimensional discretization, iterated in classical model. Since the classical model is a special, extreme case of our modiﬁcation, the modiﬁed version can be considered as more general. But more importantly, it allows for natural processing of real-valued attributes in a rough-set theory, broadening the scope of applications of classical, as well as variable precision rough set model, since the latter can utilize the proposed modiﬁcation, equally well. In a case study, we show a real application of modiﬁed relation, a hybrid, opto-electronic recognizer of Fraunhofer diﬀraction patterns. Modiﬁed rough sets are used in an evolutionary optimization of the optical feature extractor implemented as a holographic ring-wedge detector. The classiﬁcation is performed by a probabilistic neural network, whose error, assessed in an unbiased way is compared to earlier works. Keywords: rough sets, indiscernibility relation, holographic ring-wedge detector, evolutionary optimization, probabilistic neural network, hybrid pattern recognition.

1

Introduction

In classical theory of rough sets, originated by Pawlak [32], the indiscernibility relation is generated by the information describing objects belonging to some ﬁnite set called universe. If this information is of discrete nature, than the classical form of this relation is natural and elegant notion. For many applications processing discrete attributes describing objects of the universe, such deﬁnition of indiscernibility relation is adequate, what implies that area of successful use J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 14–34, 2008. c Springer-Verlag Berlin Heidelberg 2008

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

15

of classical rough set methodology covers problems having natural discrete representation, consistent with granular nature of knowledge in this theory [32]. Such classical rough set model is particularly useful in automatic machine learning, knowledge acquisition and decision rules generation, applied to problems with discrete data not having enough size for application of statistical methods, demanding reliable estimation of distributions characterizing the underlying process [29,30]. If however, the problem is deﬁned in a continuous domain, the classical indiscernibility relation almost surely builds one-element abstract classes, and therefore is not suitable for any generalization. To overcome this disadvantage, diﬀerent approaches are proposed. The simplest is the discretization, but if this processes is iterated separately for single attributes, it induces artiﬁcial and highly nonlinear transformation of attribute space. Other approaches concentrate on generalization of the notion of indiscernibility relation into tolerance relation [25,36] or similarity relation [15,37,38]. The comparative study focused upon even more general approaches, assuming indiscernibility relation to be any binary reﬂexive relation, is given by Gomolinska [20]. Another interesting generalization of indiscernibility relation into characteristic relation, applicable for attributes with missing values (lost values or don’t care conditions) is proposed by Grzymala-Busse [21,22]. In the paper we propose methodology, based on introduction of structure into a collection of conditional attributes, and treating certain groups deﬁning this structure as multidimensional subspaces in a forthcoming cluster analysis. In this way we do not have to resign from equivalence relation, and at the same time, we obtain abstract classes uniting similar objects, belonging to the same clusters, in a continuous multidimensional space, as required by majority of classiﬁcation problems. Since the area of author’s interests is focused in hybrid opto-electronic pattern recognition systems, the practical illustration of proposed modiﬁcation concerns such system. However, with some exceptions, indicated at the end of section 2, the modiﬁcation can ﬁnd many more applications, especially, that it can be equally well adopted in a generalized variable precision rough set model, introduced by Ziarko [40], to meet requirements of analysis of huge data sets. Automatic recognition of images constitutes an important area in the pattern recognition problems. Mait et al. [28], in a review article, state that “an examination of recent trends in imaging reveals a movement towards systems that balance processing between optics and electronics”. Such systems are designed to perform heavy computations in optical mode, practically contributing no time delays, while post- processing is made in computers, often with the use of artiﬁcial intelligence (AI) methods. The foundations of one of such systems have been proposed by Casasent and Song [4], presenting the design of holographic ring wedge detectors (HRWD), and by George and Wang, who combined commercially available ring wedge-detector (RWD) and neural network (NN) in a one complete image recognition system [19]. Despite the completeness of the solution their system was of little practical importance, since commercially available

16

K.A. Cyran

RWD was very expensive and moreover, could not be adapted to a particular problem. Casasent’s HRWD, originally named by him as a computer generated hologram (CGH) had a lot of advantages over commercial RWD, most important being: much lower cost and adaptability. According to optical characteristics the HRWD belongs to a wider class of grating based diﬀractive optical variable devices (DOVDs) [11], which could be relatively easy obtained from computer generated masks, and are used for sampling the Fraunhofer diﬀraction pattern. The pioneering works proposing the method of optimization of HRWD masks to a given application have been published by Cyran and Mrozek [10] and by Jaroszewicz et al. [23]. Mentioned method was successfully applied to a multi layer perceptron (MLP) based system, in a recognition of the type of subsurface stress in materials with embedded optical ﬁber [9,12,14]. Examples of application of the RWD-based feature extraction together with MLP-based classiﬁcation module include systems designed by Podeszwa et al. [34] devoted for the monitoring of the engine condition, and by Jaroszewicz et al. [24] dedicated for airplane engines. Some other notable examples of applications of ring-wedge detectors and neural network systems, include works of Ganotra et al. [17] and Berfanger and George [3], concerning ﬁngerprint recognition, face recognition [18], or image quality assessment [3]. The ring-wedge detector has been also used, as a light scatter detector, in a classiﬁcation of airbone particles performed by Kaye et al. [26] and accurate characterization of particles or defects, present on or under the surface, useful in fabrication of integrated circuits, as presented by Nebeker and Hirleman [31]. The purely optical version of HRWD-MLP recognition system was considered by Cyran and Jaroszewicz [7], however, such system is limited by the development of optical neural networks. Simpliﬁed, to rings only, version of the device is reported by Fares et al. [16] to be applied in a rotation invariant recognition of letters. With all these applications, no wonder that Mait et al. [28] concluded: ”few attempts have been made to design detectors with much consideration for the optics. A notable exception is ring-wedge detector designed for use in the Fourier plane of a coherent optical processor.” Obviously, MLP (or more generally any type of NN) is not the only classiﬁer which could be applied for classiﬁcation of patterns occurring in a feature space generated by HRWD. Moreover, the ﬁrst version of optimization procedure favored the rough set based classiﬁers, due to identical (and therefore fully compatible) discrete nature of knowledge representation in the theory of rough sets applied both to HRWD optimization and to subsequent rough set based classiﬁcation. The application of general ideas of obtaining such rough classiﬁer was presented by Cyran and Jaroszewicz [8] and fast rough classiﬁer implemented as PAL 26V12 element was considered and designed by Cyran [6]. Despite of inherent compatibility between optimization procedure and the classiﬁer, the system remained sub optimal, because features extracted from HRWD generate continuous space, subject to unnatural discretization required by both: rough set based optimization and classiﬁer. Mentioned problems led to the idea, that in order to obtain the enhanced optimization method, the discretization required by classical indiscernibility relation

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

17

in rough set theory, should be eliminated in such a way, which does not require the resignation from equivalence relation in a favor of some weaker form (like tolerance relation, for example). We achieved it by such modiﬁcation of the indiscernibility relation which allows natural processing of real valued attributes. The paper presents this problem in the section 2. After focusing on indiscernibility relation related problems in section 2, section 3 starts with optical foundations of the recognition system considered, and it is followed by experimental results obtained from application of enhanced optimization methodology. The discussion and conclusions are included in section 4. Remarkably, the experimental application of modiﬁed indiscernibility relation into the system considered, improved the results of evolutionary optimization of holographic RWD and equivalently, enhanced the optimization of the HRWD generated feature space, dedicated for real-valued classiﬁers. It also gave theoretical basis for the latest design of two-way, neural network - rough set based classiﬁcation system [5].

2

Modiﬁcation of Indiscernibility Relation

Let us start with a brief analysis of the classical theory of rough sets, and the generalization of it, named the theory of rough sets with variable precision, in a context of data representation requirements. Next, the modiﬁcation of indiscernibility relation is given. With modiﬁed indiscernibility relation, majority of notions deﬁned in rough set theory (both in classical and generalized, variable precision form) can be naturally applied to the attributes having real-valued domain. 2.1

Analysis of Theory of Rough Sets with Discrete Attributes

The notion of a rough set has been deﬁned for a representation, processing and understanding of imperfect knowledge. Such knowledge must be often suﬃcient in controlling, machine learning or pattern recognition. The rough approach is based on an assumption that each object is associated with some information, describing it, not necessarily, in an accurate and certain way. Objects described by the same information are not discernible. The indiscernibility relation, introduced here in an informal way, expresses the fact that theory of rough sets does not deal with individual objects, but with classes of objects which are indiscernible. Therefore the knowledge represented by classical rough sets is granular [32]. The simple consequence is that objects with natural real-valued representation, hardly match that scheme, and some preprocessing has to be performed, before such objects can be considered in a rough-set based frame. This preprocessing has the goal in making ”indiscernible” objects which are close enough (but certainly discernible) in real-valued space. In majority of applications of rough set theory, this is obtained by subsequent discretization of all real-valued attributes. This, highly nonlinear process, is not natural and disadvantageous in many applications (such as an application presented in section 3). Before we present an alternative way of addressing the problem (in subsection 2.2), a formal deﬁnition of classical indiscernibility relation is given.

18

K.A. Cyran

Let S =< U, Q, v, f > be the information system composed of universe U , set of attributes Q, information function f , and a mapping v. This latter mapping associates each attribute q ∈ Q with its domain Vq . The information function f : U ×Q → V is deﬁned in such a way, that f (x, q) reads as the value of attribute q for the element x ∈ U , and V denotes a domain of all attributes q ∈ Q and is deﬁned as a union of all domains of single attributes, i.e. V = q∈Q Vq . Then each nonempty set of attributes C ⊆ Q deﬁnes the indiscernibility relation I0 (C) ⊆ U × U for x, y ∈ U as xI0 (C)y ⇔ ∀q ∈ C, f (x, q) = f (y, q).

(1)

Such deﬁnition, although theoretically applicable, both for discrete and continues domains V , is practically valuable only for discrete domains. For continuous domains such relation is too strong, because in practice all elements would have been discernible. Consequently, all abstract classes generated by I, would have been composed of exactly one element, what would have made the application of rough set theory notions possible, but senseless. The problem is that in the theory of rough sets, with each information system, we can associate some knowledge KQ generated by the indiscernibility relation I0 (Q); for continuous attributes the corresponding knowledge would have been too speciﬁc, to allow for any generalizations, required for classiﬁcation of similar objects into common categories. 2.2

Indiscernibility Relation in Rough Sets with Real Valued Attributes

The consequence of the discussion ending the previous section is the need of discretization. If a problem is originally deﬁned for real valued attributes, then before application of rough set theory, some clustering and discretization of continuous values of attributes should be performed. Let this process be denoted as a transformation described by a vector function Λ : card(C)→{1, 2, . . . , ξ}card(C), where ξ is called the discretization factor. The discretization factor simply denotes the number of clusters covering the domain of each individual attribute q ∈ C. Theoretically, this factor could be diﬀerent for diﬀerent attributes, but without the loss of generality, we assume its constancy over the set of attributes. Then, the discretization of any individual attribute q ∈ C, can be denoted as a transformation deﬁned by a scalar function Λ : → {1, 2, . . . , ξ}. In this case, we obtain the classical form of indiscernibility relation, deﬁned as xI0 (Λ[C]) y ⇔ ∀q ∈ C, f (x, Λ[q]) = f (y, Λ[q]) .

(2)

Below, we will summarize, that majority (however, not all) of notions deﬁned in a theory of rough sets de facto do not demand the strong version of indiscernibility relation I0 deﬁned by equation (1) (or by (2), if the discretization is required). From a formal point of view, what is really important, is the fact, that we assume the indiscernibility relation to be a relation of equivalence, i.e. it must be reﬂexive, symmetric and transitive. From practical point of view, objects indiscernible in a sense of rough set theory, should be such objects, which

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

19

are close in a real-valued space. Any relation, having these properties, we denote by I, without any subscript, reserving subscripts for denoting particular forms of I. The exact form of I, deﬁned as I0 in (1) or (2), is not required, except for some notions, which we discuss later, for processing of the rough information. One can easily verify (by confrontation of the general form of indiscernibility relation I with presented below notions) that the following constructs form a logically consistent system, no matter what the speciﬁc form of the indiscernibility relation is. In particular it is true for such forms of this relation, which vary from classical form, both for discrete (1) and continuous (2) types of attributes, as presented below. C -elementary sets. Set Z is C-elementary, when all elements x ∈ Z are Cindiscernible, i.e. they belong to the same class [x]I(C) of relation I(C). If C = Q then Z is elementary set in S. C-elementary set is therefore the atomic unit of knowledge about universe U with respect to C. Since C-elementary sets are deﬁned by abstract classes of relation I, it follows that any equivalence relation can be used as I. C -deﬁnable sets. If a set X is a union of C-elementary sets then X is Cdeﬁnable, i.e. it is deﬁnable with respect to knowledge KC . A complement, a product, or an union of C-deﬁnable sets is also C-deﬁnable. Therefore the indiscernibility relation I(C), by generating knowledge KC , deﬁnes all what can be accurately expressed with the use of set of attributes C. Two information systems S and S are equivalent if they have the same elementary sets. Then the knowledge KQ is the same as knowledge KQ . Knowledge KQ is more general than knowledge KQ iﬀ I(Q ) ⊆ I(Q), i.e. when each abstract class of the relation I(Q ) is included in some abstract class of I(Q). C-deﬁnable sets, as unions of C-elementary sets are also deﬁned for any equivalence relation I. C -rough set X. Any set being the union of C-elementary sets is a C-crisp set, any other collection of objects in universe U is called a C-rough set. A rough set contains a border, composed of elements such, that based on the knowledge generated by indiscernibility relation I, it is impossible to distinguish whether or not the element belongs to the set. Each rough set can be deﬁned by two crisp sets, called lower and upper approximation of the rough set. Since C-crisp sets are unions of C-elementary sets, and C-rough set is deﬁned by two C-crisp sets, therefore the notion of C-rough set is deﬁned for any equivalence relation I, not necessarily for I0 . C -lower approximation of rough set X ⊆U. The lower approximation of a rough set X is composed of those elements of universe, which belong for sure to X, based on indiscernibility relation I. Formally, C-lower approximation of a set X ⊆ U , denoted as CX, is deﬁned in the information system S, as CX = {x ∈ U : [x]I(C) ⊆ X} and since it is a C-crisp set, it can be deﬁned for arbitrary relation I. C -upper approximation of rough set X ⊆U. The upper approximation of a rough set X is composed of those elements of universe, which perhaps belong

20

K.A. Cyran

to X, based on indiscernibility relation I. Formally, C-upper approximation of a set X ⊆ U , denoted as CX is deﬁned in the information system S, as CX = {x ∈ U : [x]I(C) ∩ X = ∅} and since it is a C-crisp set, it can be deﬁned for arbitrary relation I. C -border of rough set X ⊆U. The border of a rough set is the diﬀerence between its upper and lower approximation. Formally, C-border of a set X, denoted as BnC (X) is deﬁned as BnC (X) = CX − CX, and as a diﬀerence of two C-crisp sets, its deﬁnition is based on arbitrary equivalence relation I. Other notions, which are based on a notion of upper and/or lower approximation of a set X ⊆ U with respect to a set of attributes C, include: C-positive region of the set X ⊆ U , C-negative region of the set X ⊆ U , sets roughly C-deﬁnable, sets internally C-undeﬁnable, sets externally C-undeﬁnable, sets totally C-undeﬁnable, roughness of a set, C-accuracy of approximation of a set: αC (X), C-quality of approximation of a set: γC (X). An interesting comparison of this latter coeﬃcient and Dempster-Shafer theory of evidence is given by Skowron and Grzymala-Busse [35]. Rough membership function of the element x:μC X (x). The coeﬃcient describing the level of uncertainty, whether the element x ∈ U belongs to a set X ⊆ U when indiscernible relation I(C) generates the knowledge KC in inC formation system S, is a function denoted by μC X (x) and deﬁned as μX (x) = card{X ∩ [x]I(C) }/card{[x]I(C) }. This coeﬃcient is also referred to as a rough membership function of an element x, due to similarities with membership function known from theory of fuzzy sets. This function gave base for the generalization of rough set theory called rough set model with variable precision [40]. This model assumes that lower and upper approximations are dependent on additional coeﬃcient β, such that 0 ≤ β ≤ 0.5, and are deﬁned as C β X = {x ∈ U : C μC X (x) ≥ 1 − β} and C β X = {x ∈ U : μX (x) > β} respectively. The boundary in β this model is deﬁned as BnC (X) = {x ∈ U : β < μC X (x) < 1 − β}. It is easy to observe that the classical rough set theory is the special case of variable precision model with β = 0. Since ∀X ⊆ U , CX ⊆ C β X ⊆ C β X ⊆ CX, variable precision model is a weaker form of theory as compared to classical model, and therefore it is often preferable in analysis of large information systems with some amount of contradicting data. The membership function of an element x can be also deﬁned (x) = card{( for a family of sets X as μC X Xn ∈X Xn ) ∩ [x]I(C) }/card{[x]I(C) }. If all subsets Xn of the family X are mutually disjoint, then ∀x ∈ U , μC X (x) = (x). Since the deﬁnition of the rough membership function of the ΣXn ∈X μC Xn C element μX (x) assumes only the existence of classes of equivalence of the relation I, and the variable precision model formally diﬀers from classical model only in the deﬁnition of lower and upper approximation with the use of this coeﬃcient, therefore all presented above notions are deﬁned for arbitrary I also in this generalized model. Notions of a rough set theory, applicable for a separate set X, are generally applicable also for families of sets X = {X1 , X2 , . . . , XN }, where Xn ⊆ U , and n = 1, . . . , N . The lower approximation of a family of sets is a family

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

21

of lower approximations of sets belonging to family considered. Formally, CX = {CX1 , CX2 , . . . , CXN }. As a family of C-crisp sets, the deﬁnition of C-lower approximation of family of sets is based on arbitrary relation of equivalence I. Similarly, C-upper approximation of family of sets is a family of upper approximations of sets belonging to family considered. Formally, CX = {CX1 , CX2 , . . . , CXN }. This notion is valid for any relation of equivalence I, for reasons identical to those, presented for C-lower approximation of family of sets. Other notions, which are based on a notion of upper and/or lower approximation of a family of sets X, with respect to a set of attributes C, include: C-border of family of sets, C-negative region of the family of sets, C-negative region of the family of sets, C-accuracy of approximation of a family of sets, C-quality of approximation of a family of sets. This latter coeﬃcient is especially interesting for the application presented in the subsequent section, since it is used as an objective function in a procedure of optimization of the feature extractor. For this purpose, the considered family of sets is a family of abstract classes generated by the decision attribute d being the class of the image to be recognized (see section 3). Here, we deﬁne this coeﬃcient for any family of sets X, as γC (X) = card[P osC (X)]/card(U ). Conclusion. The analysis of above notions indicates, that they do not require any particular form of the indiscernibility relation (like for example the classical form referred to as I0 ). They are deﬁned for any form of the indiscernibility relation (satisfying reﬂexity, symmetry and transitiveness), denoted by I and are strict analogs of classical notions deﬁned with the assumption of original form of indiscernibility relation I0 deﬁned in (1) and (2). Therefore, the exact form of the indiscernibility relation, as proposed by classical theory of rough sets, as well as by its generalization named variable precision model, is not actually required for presented notions to create a coherent logical system. Some papers, referred in introduction, go further in this generalizing tendency, resigning from the requirement of equivalence relation; working with such generalizations, however, is often not natural in problems, such as classiﬁcation, when notion of abstract classes, inherently involved in equivalence relation, is of great importance. Therefore, we propose such modiﬁcation of indiscernibility relation, which is particularly useful in pattern recognition problems, dealing with a space of continuous attributes and deﬁned in terms of equivalence relation. To introduce formally the modiﬁcation, let us change the notation of indiscernibility relation as being now dependent on a family of sets of attributes, instead of being dependent simply on a set of attributes. By the family of sets of attributes, we understand a subset of a power set, based on the set of attributes, such, that all elements of this subset (these elements are subsets of the set of attributes) are mutually disjoint, and their union is equal to the considered set of attributes. This allows us to introduce some structure into, originally unstructured, set of attributes, which the relation depends on [13]. Let C = {C1 , C2 , . . . , CN } be introduced above family of disjoint sets of attributes Cn ⊆ Q such that unstructured set ofattributes C ⊆ Q is equal to the union of members of the family C, i.e. C = Cn ∈C Cn . Then, let the indiscernibility

22

K.A. Cyran

relation be dependent on C instead of being dependent on C. Observe that both C and C contain the same collection of single attributes, however C includes additional structure as compared to C. If this structure is irrelevant for the problem considered, it can be simply ignored and we can obtain, as a special case, the classical version of indiscernibility relation I0 . However we can also obtain other versions of this modiﬁed relation for which the introduced structure is meaningful. Let relation I1 (C) ⊆ U × U be such form of a relation I which is diﬀerent from I0 xI1 (C)y ⇔ ∀Cn ∈ C, Clus(x, Cn ) = Clus(y, Cn ).

(3)

where x, y ∈ U , and Clus(x, Cn ) denotes the number of the cluster, that the element x belongs to. The cluster analysis is therefore required to be performed in a continuous vector spaces deﬁned by sets of real valued conditional attributes Cn ∈ C. There are two extreme cases of this relation, obtained when family C is composed of exactly one set of conditional attributes C, and when family C is composed of card(C) sets, each containing exactly one conditional attribute q ∈ C. The classical form I0 of the indiscernibility relation is obtained as the latter extreme special case of modiﬁed version I1 , because then clustering and discretization is performed separately for each continuous attribute. Formally, it can be written as ⎧ ⎫ ⎨ ⎬ I0 (Λ[C]) ≡ I1 (C) ⇔ C = {qn } : C = {qn } ∧ Clus (x, {qn }) = f (x, Λ[qn ]) . ⎩ ⎭ qn ∈C

(4) In other words, the classical form I0 of the indiscernibility relation can be obtained as a special case of modiﬁed version I1 if we assume that family C is composed of such subsets Cn , that each contains just one attribute, and the discretization of each continuous attribute is based on separate cluster analysis as required by a scalar function Λ applied to each of attributes qn . Here we discuss some of the notions of rough set theory that cannot be used in a common sense with the modiﬁed indiscernibility relation. We start with so called basic sets which are abstract classes of relation I({q}) deﬁned for singe attribute q. These are simply sets composed of elements indiscernible with respect to single attribute q. Obviously, this notion loses its meaning when I1 is used instead of I0 , because abstract classes generated by I0 ({q}) are always unions of some abstract classes generated by I0 (C), however abstract classes generated by I1 ({q}) not necessarily are unions of abstract classes generated by I1 (C). Therefore the conclusion that knowledge K{q} generated by I0 ({q}) is always more general than knowledge KC generated by I0 (C), no longer holds when I1 is used instead of I0 . Similarly, notions of reducts, relative reducts, cores and relative cores no longer are applicable in their classical sense, since their definitions are strongly associated with single attributes. Joining these attributes into members of family C, destroys individual treatment of attributes, required for these notions to have their well known meaning. However, as long as rough

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

23

set theory usage in continuous attribute space, does not exceed the collection of notions described ahead of the deﬁnition (3), the modiﬁed I1 version should be considered more advantageous, as compared to the classical form I0 . In particular, this is true in processing of knowledge obtained from the holographic ring wedge detector, when the quality of approximation of family of sets plays the major role. We present this application as an illustrative example.

3

Application to Fraunhofer Pattern Recognizer

Presented below system belongs to a class of fast hybrid opto-electronic pattern recognizers. The feature extraction subsystem is processing the information optically. Let us start a description of such feature extractor by giving a physical basis, required to understand the properties of feature vectors generated by this subsystem, followed by the description of enhanced method of HRWD optimization and experimental results of the usage of this method. This illustrative section is completed with the description of probabilistic neural network (PNN) based classiﬁer and experimental results of the application of it into Fraunhofer pattern recognition. 3.1

Optical Foundations

In homogeneous and isotropic medium which is free of charge (ρ = 0) and currents (j = 0) Maxwell equations result in a wave equation Δ2 G − μ

∂2G = 0. ∂t2

(5)

where G denotes electric (E) or magnetic (H) ﬁeld, and a product μ is the reciprocal of squared velocity of a wave in a medium. Application of this equation to a space with obstacles like apertures or diaphragms should result in equations describing the diﬀraction of the light at these obstacles. However the solution is very complicated for special cases and impossible for the general case. Therefore the simpliﬁcation should be used which assumes a scalar ﬁeld u instead of vector ﬁeld G. In such a case the information about the light polarization is lost and it holds that 1 ∂2u = 0. (6) ∇2 u − 2 ν ∂t2 Simpliﬁed in this way theory, called the scalar Kirchhoﬀ’s theory, describes the diﬀraction of the light at various obstacles. According to this theory, scalar complex amplitude u0 (P ) of a light oscillation, caused by the diﬀraction, is given in a point of observation P by the Kirchhoﬀ’s integral [33]

ikr e du0 d eikr 1 − u0 u0 (P ) = dΣ. (7) 4π r dn dn r Σ

where Σ denotes closed surface with point P and without the light source, n is an external normal to the surface Σ, k = 2π/λ is a propagation constant,

24

K.A. Cyran

u0 denotes scalar amplitude on a surface σ, and r is the distance between any point covered inside surface Σ to the observation point P . Formula (7) states that amplitude u0 in point P does not depend on the state of oscillations in the whole area surrounding this point (what would result from Huygens theory) but, depends only on state of oscillations on a surface Σ. All other oscillations inside this surface are canceling each other. Application of Kirchhoﬀ’s theorem to a diﬀraction on a ﬂat diaphragm with aperture of any shape and size gives the integral stretched only on a surface ΣA covering the aperture. Such integral can be transformed to [33] ik u0 (P ) = − 4π

u0 (1 + cos θ) ΣA

eikr dΣA . r

(8)

where θ denotes an angle between radius r from any point of aperture to point of observation, and the internal normal of the aperture. Since any transparent image is, in fact, a collection of diaphragms and apertures of various shapes and sizes, therefore such image, when illuminated by coherent light, generates the diﬀraction pattern, described in scalar approximation by the Kirchhoﬀ’s integral (7). Let coordinates of any point A, in an image plane, are denoted by (x, y), and let an amplitude of light oscillation in this point, be μ(x, y). Furthermore, let coordinates (ξ, η) of an observation point P be chosen as 2π 2π sin θ, η = sin ϕ. (9) ξ= λ λ where: λ denotes the length of the light wave, whereas θ and ϕ are angles between the radius from the point of observation P to point A, and planes (x, z) and (y, z), respectively. These planes are two planes of such coordinate system (x, y, z), whose axes x and y are in the image plane, and axis z is perpendicular to the image plane (it is called optical axis). Let coordinate system (x , y ) be the system with the beginning at point P and such that its plane (x , y ) is parallel to the plane of the coordinate system (x, y). It is worth to notice, that coordinates of one particular point in the observation system (ξ, η) correspond to coordinates of all points P of the system (x , y ), such that the angles between axis z and a line connecting these points with some points A of the plane (x, y), are θ and ϕ, respectively. In other words, all radii AP , connecting points A of the plane (x, y) and points P of the plane (x , y ), which are parallel to each other, are represented in a system (ξ, η) by one point. Such transformation of the coordinate systems is physically obtained in the back focal plane of the lens, placed perpendicularly to the optical axis z. In this case, all parallel radii represent parallel light beams, diﬀracted on the image (see Fig. 1) and focused in the same point in a focal plane. Moreover, the integral (7), when expressed in a coordinate system (ξ, η), can be transformed to [33] 1 u0 (ξ, η) = 2π

∞ ∞ −∞ −∞

ν(x, y)e−i(ξx+ηy) dxdy.

(10)

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

25

P

α

rf

R

f

l

l’ Fig. 1. The operation of the spherical lens

Geometrical relationships (Fig. 1) reveal that rf = R

l − f . l

(11)

On the other hand the operation of the lens is given by 1 1 1 = + . f l l

(12)

Putting this equation to (11), after elementary algebra, we obtain rf R = . l f

(13)

Since angles θ and ϕ (corresponding to angle α in Fig. 1, in a plane (x, z) and (y, z), respectively) are small, therefore equations (9), having in mind (13), can be rewritten as 2π yf 2π xf ,η = (14) ξ= λ f λ f where xf and yf denote Cartesian coordinates in a focal plane of the lens. Equation (10) expressed in these coordinates can be written as 1 u0 (xf , yf ) = 2π

∞ ∞

xf

yf

ν(x, y)e−i2π( λf x+ λf y) dxdy.

(15)

−∞ −∞

Setting new coordinates (u, v) as u=

xf yf ,v = λf λf

(16)

we have ﬁnally the equation 1 u0 (u, v) = 2π

∞ ∞ −∞ −∞

ν(x, y)e−i2π(ux+vy) dxdy.

(17)

26

K.A. Cyran

which is (up to the constant factor k) a Fourier integral. This is essentially the Fraunhofer approximation of Kirchhoﬀ’s integral, and is also referred to as a Fraunhofer diﬀraction pattern [27]. The complex amplitude of the Fraunhofer diﬀraction pattern obtained in a back focal plane of the lens is therefore a Fourier transform of the complex amplitude from the image plane u0 (u, v) = kF {ν(x, y)}.

(18)

This fact is very often used in a design of hybrid systems for recognition of images in a spatial frequency domain. One prominent example is the system with a feature extractor built as a HRWD placed in a back focal plane of the lens. The HRWD itself consists of two parts: a part composed of rings Ri and a part containing wedges Wj . Each of elements Ri or Wj is covered with a grating of particular spatial frequency and orientation, so that the light, passing through the given region, is diﬀracted and focused by some other lens, at certain cell of array of photodetectors. The photodetector, in turn, integrates the intensity of the light and generates one feature used in classiﬁcation. 3.2

Enhanced Optimization Method

The system considered above can be used for the recognition of images invariant with respect to translation, rotation and size, based on the properties of Fourier transform and the way of sampling the Fraunhofer diﬀraction pattern by the HRWD. Standard HRWD based feature extractor can be optimized to obtain even better recognition properties of the system. To perform any optimization one needs the objective function and the method of search in a space of solutions. These two problems are discussed wider below. Let ordered 5-tuple T =< U, C, {d}, v, f > be the decision table obtained from the information system S =< U, Q, v, f > by a decomposition of the set of attributes Q into two mutually disjoint sets: the set of conditional attributes C and the set {d} composed of one decision attribute d. Let each conditional attribute c ∈ C be one feature obtained from HRWD, and let decision attribute d be the number of the class to be recognized. Obviously the domain of any of such conditional attributes is and the domain of decision attribute d is a subset of ﬁrst natural numbers, with cardinality equal to the number of recognized classes. Furthermore, let D = {[xn ]I0 ({d}) : xn ∈ U } be the family of such sets of images where each set contains all images belonging to the same class. Observe that the classical form of the indiscernibility relation I0 is used in this deﬁnition, due to discrete nature of the domain of decision attribute d. Based on the results of discussion given by Cyran and Mrozek [10], we argue that the rough set based coeﬃcient, called quality of approximation of family D by conditional attributes belonging to C, and denoted by γC (D), is a good objective function in the optimization of feature extractor in problems with multimodal distribution of classes in a feature space. This is so, because this coeﬃcient indicates the level of determinism of the decision table, what in turn, is relevant for the classiﬁcation. On the other hand, based on the conclusion given

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

27

in 2.2, in the case of real valued attributes C, the preferred form of indiscernibility relation, being so crucial for rough set theory in general (and therefore for the computation of γC (D) objective in particular), is the form deﬁned by (3). Therefore the optimization with the objective function γC (D) computed with respect to classical form of indiscernibility relation for real valued attributes C given in (2) produces sub-optimal solutions. This drawback can be eliminated if modiﬁed version proposed in (3) is used instead of classical form deﬁned in (2). However the generalized form (3) requires the deﬁnition of some structure in a set of conditional attributes. This is task dependent, and in our case the architecture of the feature extractor having diﬀerent properties of wedges and rings, deﬁnes natural structure, as a family C = {CR , CW }, composed of two sets: a set of attributes corresponding to rings CR , and a set of attributes corresponding to wedges CW . With this structure introduced into set of conditional attributes, the coeﬃcient γC (D) computed with respect to modiﬁed indiscernibility relation (3), is en enhanced objective function for optimization of the HRWD. Since deﬁned above enhanced objective function is not diﬀerentiable, gradientbased search method should be excluded. However the HRWD can be optimized in a framework of evolutionary algorithm. The maximum value of ﬁtness 97%, having the meaning of γC (D∗) = 0.97, was obtained in 976 generation for population composed of 50 individuals (Fig. 2). The computer generated mask of optimal HRWD, named xopt is designed for a system with a coherent light wave length λ=635nm, emitted by laser diode and for a lens L with a focal length fL =1m. In order to keep the resolution capability of the system, the diameter of the HRWD in a Fourier plane should be equal to the diameter of the Airy disc given by: sHRW D = 4 × 1.22 × λ × fL1 /smin = 2.07mm, if the assumed minimum size of recognizable objects is given by smin = 1.5mm. Assuming also the rectangular array of photodetectors of the size s=5mm, forming four rows (i=1,. . . ,4) and four columns (j = 1, . . . , 4), and setting the distance in vertical direction from the optical axis to the upper edge of the array H = 50mm we obtain values of angles θij presented in a Table 1. Similar results for the distances dij are in Table 2. Since the software for generating HRWD masks has been designed in a such way, that distances dij are given in units equal to a one-tenth of a percent of the

100

100

90

90

80

80

70 0

a)

200

400

600

800

70

1000

1

10

100

1000

b)

Fig. 2. Process of evolutionary optimization of HRWD. The courses present the ﬁtness of xopt expressed in percents: a) linear scale, b) logarithmic horizontal scale.

28

K.A. Cyran

Table 1. The values of angles θij (expressed in degrees) deﬁning the HRWD gratings 4 20.22 22.38 25.02 28.30

3 14.74 16.39 18.43 21.04

2 8.97 10.01 11.31 12.99

1 ← j, i ↓ 3.01 1 2 3.37 3 3.81 4 4.40

Table 2. Distances dij between striae [μm] 4 12.54 13.82 15.34 17.20

3 12.93 14.33 16.06 18.24

2 13.20 14.71 16.60 19.04

1 ← j, i ↓ 13.35 1 2 14.92 3 16.90 4 19.48

Table 3. Distances dij between striae, in units used by software generating HRWD masks 4 12.14 13.38 14.86 16.65

3 12.52 13.88 15.55 17.65

2 12.78 14.24 16.08 18.43

1 ← j, i ↓ 12.92 1 2 14.44 3 16.36 4 18.86

radius of HRWD, therefore for RHRW D = sHRW D /2 = 1.035mm, we give in a Table 3 the proper values, expressed in these units. 3.3

PNN Based Classiﬁcation

In our design the input layer of the probabilistic neural network (PNN) used as a classiﬁer is composed of N elements to process N -dimensional feature vectors generated by HRWD (N = NR + NW ). The pattern layer consists of M pools of pattern neurons, associated with M classes of intermodal interference to be recognized. We used in that layer RBF neurons with Gaussian transfer function, being the kernel function. Then, the width of the kernel function is simply a standard deviation σ of the Gaussian bell. Each neuron of pattern layer is connected with every neuron of input layer and the weight vectors of pattern layer are equal to feature vectors present in a training set. Contrary to the pattern layer, the summation layer consisting of M neurons, is organized in a such way, that only one output neuron is connected with neurons from any summation layer pool. When using such networks as classiﬁers, formally, there is a need to multiply the output values by prior probabilities Pj . However in our cases, all priors are equal and therefore, results can be obtained directly on outputs of the network.

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

29

We veriﬁed the recognition abilities by a classiﬁcation of speckle structure images, obtained from the output of the optical ﬁber. The experiments were conducted for a set of 128 images of speckle patterns generated by intermodal interference occurring in optical ﬁber and belonging to eight classes taken in 16 sessions Sl (l = 1, . . . , 16). The Fraunhofer diﬀraction patterns of input images were obtained by calculating the intensity patterns from discrete Fourier transform equivalent to (17). The training set consisted of 120 images, taken out in 15 sessions, and the testing set contained 8 images, belonging to diﬀerent classes, representing one session Sl . The process of training and testing was performed 16 times, according to delete-8 jackknife method, i.e., for each iteration, another session composed of 8 images was used for the testing set, and all but one sessions were used for the training set. That gave the basis for reliable cross-validation with still reasonable number of images used for training, and the reasonable computational time. This time was eight times shorter, as compared to classical leave-one-out method, which, for all discussions in a paper is equivalent to delete1 jackknife method, since the only diﬀerence, the resubstitution error of a prediction model, is not addressed in a paper. Jackknife method was used for cross validation of PNN results, because of unbiased estimation of true error in probabilistic classiﬁcation (contrary to underestimated error - however having smaller variance obtained by Bootstrap method) [1,39]. Therefore, choice of delete-8 jackknife method, was a sort of tradeoﬀ between accuracy (standard deviation of estimated normalized decision error was 0.012), unbiased estimate of the error, and computational eﬀort. The results of such testing of the PNN applied to classiﬁcation of images in a feature space obtained from a standard, optimized, and optimized with modiﬁed indiscernibility relation HRWDs, are presented in Table 4. More detailed results of all jackknife tests are presented in Table 5, Fig. 3 and Fig. 4. The normalized decision errors, ranging from 1.5 to 2 percent, indicate good overall recognition abilities of the system. The 20% reduction of this error is obtained by optimization of HRWD with classical indiscernibility relation. Further 6% error reduction, is caused solely by a modiﬁcation of indiscernibility relation, according to (3). Table 4. Results of testing the classiﬁcation abilities of the system. The classiﬁer is a PNN having Gaussian radial function with standard deviation σ = 0.125. In the last column the improvement is computed with respect to Standard HRWD (ﬁrst value) and with respect to HRWD optimized with standard indiscernibility relation (value in a parentheses).

Standard HRWD HRWD optimized with standard indiscernibility relation HRWD optimized with modiﬁed indiscernibility relation

Correct Normalized Improvement decisions [%] decision error [%] [%] 84.4 1.95 0.0 (-25.0) 87.5

1.56

20.0 (0.0)

88.3

1.46

25.1 (6.4)

30

K.A. Cyran

cu mu lativ e n u mb e r o f b ad d e cisio n s s ta nd a rd H R WD

20 15

o p tim ize d H R WD 10 5 H R WD o ptim ized w ith m o d ifie d in d is cern ib ility re la tio n

0 1

6

11

16

Fig. 3. Results of testing the HRWD-PNN system. The horizontal axis represents the number of the test, the vertical axis is a cumulative number of bad decisions. Starting from test 9 to the end, the cumulative number of bad decisions is better for optimization of HRWD with modiﬁed indiscernibility relation, as compared to optimization with classical version of this relation.

n o rmalize d d e cisio n e rro rs [%] s ta n d a rd H R WD

2 ,8 2 ,6 2 ,4 2 ,2 2 1 ,8 1 ,6 1 ,4 1 ,2

o p tim ize d H R WD

1

6

11

16

o p tim ize d w ith m o d ifie d in d is e rn ib ility re la tio n

Fig. 4. Results of testing the HRWD-PNN system. The horizontal axis represents the number of the test, while the vertical axis is a normalized decision error averaged over tests from the ﬁrst to given, represented by the value of horizontal axis. Observe, that for averaging over more than 8 tests, the results for recognition with HRWD optimized with modiﬁed indiscernibility relation are outperforming both: results for HRWD optimized with classical version of indiscernibility relation and results for standard HRWD.

In order to understand the scale of this improvement, not looking too impressive at ﬁrst glance, one should refer to a Fig. 2 and take into consideration, that this additional 6% error reduction is obtained over an already optimized

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

31

Table 5. Results of PNN testing for tests number 1 to 16. Bold font is used for results diﬀering between optimization with standard and modiﬁed version of indiscernibility relation. Bold underlined results indicate improvement when modiﬁed relation is used instead of classical. Bold results without underlining indicate the opposite. NUMBER OF TEST SESSION: 1 2 3 4 5 6 7 8 9 10 11 NUMBER OF BAD Standard HRWD 122120101 1 4 optimized with standard indiscernibility relation 113010202 1 1 optimized with modiﬁed indiscernibility relation 113011101 1 1

12 13 14 15 16 DECISIONS 0 0 1 0 4 0 0 1 1 2 0 0 1 1 2

solution. The level of diﬃculty can be grasped observing that, on average, the increase of the objective function is well mimicked by a straight line, if a generation number axis is drawn in a log scale. This means, that the growth of objective is, on average, well approximated by a logarithmic function of the generation number. It experimentally reﬂects a well known fact, stating that, the better current solution is, the harder is to optimize it further (harder, means: it requires more generations in evolutionary process).

4

Discussion and Conclusions

The paper presents a modiﬁcation of the indiscernibility relation, used in the theory of rough sets. This theory has been successfully applied to many machine learning and artiﬁcial intelligence oriented problems. However, it is well known limitation of this theory, that it processes continuous attributes in an unnatural way. To support more natural processing, the modiﬁcation of indiscernibility relation has been proposed (3), such that the indiscernibility relation remains the equivalence relation, but the processing of continuous attributes becomes more natural. This modiﬁcation introduces the information about structure into unstructured in classical version collection of attributes that the relation is dependent on. It has been shown that the classical relation is the special case of the modiﬁed version, therefore proposed modiﬁcation can be recognized as being more general (yet, not as general, as indiscernibility relations, which are no longer equivalence relations). Remarkably, proposed generalization is equivalently valid for classical theory of rough sets, as well as for the variable precision model, predominantly used in machine learning applied to huge data sets. Proposed in a paper modiﬁcation of indiscernibility relation, introduces the ﬂexibility in deﬁnition of particular special case, which is most natural to given application. In the case of real-valued attributes, our modiﬁcation allows for performing multidimensional cluster analysis, contrary to multiple one-dimensional analyses, required by the classical form. In majority of cases, the cluster analysis should be performed in a space, generated by all attributes. This corresponds to a family C composed of one set (card(C) = 1), containing all conditional

32

K.A. Cyran

attributes, and it is the opposite case, as compared to the classical relation, assuming that family C is composed of one-element disjoint sets, and therefore, satisfying equation card(C) = card(C). However, other less extreme cases are allowed as well and, in an experimental study, we use a family C = {CR , CW }, composed of two sets containing 8 elements, each. Such structure seems to be natural for application having two-way architecture, like HRWD based feature extractor. Presented modiﬁcation has been applied in optimization procedure of the hybrid opto-electronic pattern recognition system composed of HRWD and PNN. It allowed to improve the recognition abilities by reducing the normalized decision error by 6.5%, if a system, optimized with classical indiscernibility relation, is treated as the reference. One should notice, that this improvement is achieved with respect to a reference, being already optimized solution, which makes any further improvement diﬃcult. Obtained results experimentally conﬁrm our claims concerning sub optimality of earlier solutions. Presented experiment is an illustration of application of proposed methodology into hybrid pattern recognizer. However, we think, that presented modiﬁcation of indiscernibility relation will ﬁnd many more applications in rough set based machine learning, since it gives natural way of processing real valued attributes, within a rough set based formalism. Certainly there are also limitations. Because some known in rough set theory notions loose their meaning, when modiﬁed relation is to be applied, therefore, if for any reason, they are supposed to play relevant role in a problem, the proposed modiﬁcation can be hardly applied in any other than classical special case form. One prominent example concerns so called basic sets in a universe U , deﬁned by the indiscernibility relation, computed with respect to single attributes, as opposed to modiﬁed relation predominantly designed to deal with sets of attributes deﬁning a vector space, used for common cluster analysis. This modiﬁcation is especially useful in the case of information systems with real valued conditional attributes representing vector space N , such as systems of non syntactic pattern recognition. The experimental example belongs to this class of problems and illustrates the potential of modiﬁed indiscernibility relation for processing real-valued data in a rough set based theory.

References 1. Azuaje, F.: Genomic data sampling and its eﬀect on classiﬁcation performance assessment. BMC Bioinformatics 4(1), 5–16 (2003) 2. Berfanger, D.M., George, N.: All-digital ring-wedge detector applied to ﬁngerprint recognition. App. Opt. 38(2), 357–369 (1999) 3. Berfanger, D.M., George, N.: All-digital ring wedge detector applied to image quality assessment. App. Opt. 39(23), 4080–4097 (2000) 4. Casasent, D., Song, J.: A computer generated hologram for diﬀraction-pattern sampling. Proc. SPIE 523, 227–236 (1985)

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

33

5. Cyran, K.A.: Integration of classiﬁers working in discrete and real valued feature space applied in two-way opto-electronic image recognition system. In: Proc. of IASTED Conference: Visualization, Imaging & Image Processing, Benidorn, Spain (accepted, 2005) 6. Cyran, K.A.: PLD-based rough classiﬁer of Fraunhofer diﬀraction pattern. In: Proc. Int. Conf. Comp. Comm. Contr. Tech., Orlando, pp. 163–168 (2003) 7. Cyran, K.A., Jaroszewicz, L.R.: Concurrent signal processing in optimized hybrid CGH-ANN system. Opt. Appl. 31, 681–689 (2001) 8. Cyran, K.A., Jaroszewicz, L.R.: Rough set based classiﬁction of interferometric images. In: Jacquot, P., Fournier, J.M. (eds.) Interferometry in Speckle Light. Theory and Applictions, pp. 413–420. Springer, Heidelberg (2000) 9. Cyran, K.A., Jaroszewicz, L.R., Niedziela, T.: Neural network based automatic diﬀraction pattern recognition. Opto-elect. Rev. 9, 301–307 (2001) 10. Cyran, K.A., Mrozek, A.: Rough sets in hybrid methods for pattern recognition. Int. J. Intell. Sys. 16, 149–168 (2001) 11. Cyran, K.A., Niedziela, T., Jaroszewicz, L.R.: Grating-based DOVDs in high-speed semantic pattern recognition. Holography 12(2), 10–12 (2001) 12. Cyran, K.A., Niedziela, T., Jaroszewicz, J.R., Podeszwa, T.: Neural classiﬁers in diﬀraction image processing. In: Proc. Int. Conf. Comp. Vision Graph., Zakopane, Poland, pp. 223–228 (2002) 13. Cyran, K.A., Stanczyk, U.: Indiscernibility relation for continuous attributes: Application in image recognition. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS, vol. 4585, pp. 726–735. Springer, Heidelberg (2007) 14. Cyran, K.A., Stanczyk, U., Jaroszewicz, L.R.: Subsurface stress monitoring system based on holographic ring-wedge detector and neural network. In: McNulty, G.J. (ed.) Quality, Reliability and Maintenance, pp. 65–68. Professional Engineering Publishing, Bury St Edmunds, London (2002) 15. Doherty, P., Szalas, A.: On the correspondence between approximations and similarity. In: Tsumoto, S., Slowi´ nski, R., Komorowski, J., Grzymala-Busse, J.W. (eds.) RSCTC 2004. LNCS, vol. 3066, pp. 143–152. Springer, Heidelberg (2004) 16. Fares, A., Bouzid, A., Hamdi, M.: Rotation invariance using diﬀraction pattern sampling in optical pattern recognition. J. of Microwaves and Optoelect. 2(2), 33– 39 (2000) 17. Ganotra, D., Joseph, J., Singh, K.: Modiﬁed geometry of ring-wedge detector for sampling Fourier transform of ﬁngerprints for classiﬁcation using neural networks. Proc. SPIE 4829, 407–408 (2003) 18. Ganotra, D., Joseph, J., Singh, K.: Neural network based face recognition by using diﬀraction pattern sampling with a digital ring-wedge detector. Opt. Comm. 202, 61–68 (2002) 19. George, N., Wang, S.: Neural networks applied to diﬀraction-pattern sampling. Appl. Opt. 33, 3127–3134 (1994) 20. Gomolinska, A.: A comparative study of some generalized rough approximations. Fundamenta Informaticae 51(1), 103–119 (2002) 21. Grzymala-Busse, J.W.: Data with missing attribute values: Generalization of indiscernibility relation and rule induction. In: Peters, J.F., Skowron, A., Grzymala´ Busse, J.W., Kostek, B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 78–95. Springer, Heidelberg (2004)

34

K.A. Cyran

22. Grzymala-Busse, J.W.: Rough set strategies to data with missing attribute values. In: Proceedings of the Workshop on Foundations and New Directions in Data Mining, associated with the third IEEE International Conference on Data Mining, Melbourne, FL, USA, November 19– 22, 2003, pp. 56–63 (2003) 23. Jaroszewicz, L.R., Cyran, K.A., Podeszwa, T.: Optimized CGH-based pattern recognizer. Opt. Appl. 30, 317–333 (2000) 24. Jaroszewicz, L.R., Merta, I., Podeszwa, T., Cyran, K.A.: Airplane engine condition monitoring system based on artiﬁcial neural network. In: McNulty, G.J. (ed.) Quality, Reliability and Maintenance, pp. 179–182. Professional Engineering Publishing, Bury St Edmunds, London (2002) 25. Jarvinen, J.: Approximations and roughs sets based on tolerances. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 182–189. Springer, Heidelberg (2001) 26. Kaye, P.H., Barton, J.E., Hirst, E., Clark, J.M.: Simultaneous light scattering and intrinsic ﬂuorescence measurement for the classiﬁcation of airbone particles. App. Opt. 39(21), 3738–3745 (2000) 27. Kreis, T.: Holographic interferometry: Principles and methods. Akademie Verlag Series in Optical Metrology, vol. 1. Akademie-Verlag (1996) 28. Mait, J.N., Athale, R., van der Gracht, J.: Evolutionary paths in imaging and recent trends. Optics Express 11(18), 2093–2101 (2003) 29. Mrozek, A.: A new method for discovering rules from examples in expert systems. Man-Machine Studies 36, 127–143 (1992) 30. Mrozek, A.: Rough sets in computer implementation of rule-based control of industrial processes. In: Slowinski, R. (ed.) Intelligent decision support. Handbook of applications and advances of the rough sets, pp. 19–31. Kluwer Academic Publishers, Dordrecht (1992) 31. Nebeker, B.M., Hirleman, E.D.: Light scattering by particles and defects on surfaces: semiconductor wafer inspector. Lecture Notes in Physics, vol. 534, pp. 237– 257 (2000) 32. Pawlak, Z.: Rough sets: theoretical aspects of reasoning about data. Kluwer Academic, Dordrecht (1991) 33. Piekara, A.H.: New aspects of optics – introduction to quantum electronics and in particular to nonlinear optics and optics of coherent light [in Polish]. PWN, Warsaw (1976) 34. Podeszwa, T., Jaroszewicz, L.R., Cyran, K.A.: Fiberscope based engine condition monitoring system. Proc. SPIE 5124, 299–303 (2003) 35. Skowron, A., Grzymala-Busse, J.W.: From rough set theory to evidence theory. In: Yager, R.R., Ferdizzi, M., Kacprzyk, J. (eds.) Advances in Dempster Shafer theory of evidence, pp. 193–236. Wiley & Sons, NY (1994) 36. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996) 37. Slowinski, R., Vanderpooten, D.: A generalized deﬁnition of rough approximations based on similarity. IEEE Transaction on Data and Knowledge Engineering 12(2), 331–336 (2000) 38. Slowinski, R., Vanderpooten, D.: Similarity relation as a basis for rough approximations. In: Wang, P.P. (ed.) Advances in machine intelligence and soft computing, pp. 17–33. Bookwrights, Raleigh (1997) 39. Twomey, J.M., Smith, A.E.: Bias and variance of validation methods for function approximation neural networks under conditions of sparse data. IEEE Trans. Sys., Man, and Cyber. 28(3), 417–430 (1998) 40. Ziarko, W.: Variable precision rough set model. J. Comp. Sys. Sci. 40, 39–59 (1993)

On Certain Rough Inclusion Functions Anna Gomoli´ nska Bialystok University, Department of Mathematics, Akademicka 2, 15267 Bialystok, Poland [email protected]

Abstract. In this article we further explore the idea which led to the standard rough inclusion function. As a result, two more rough inclusion functions (RIFs in short) are obtained, diﬀerent from the standard one and from each other. With every RIF we associate a mapping which is in some sense complementary to it. Next, these complementary mappings (co-RIFs) are used to deﬁne certain metrics. As it turns out, one of these distance functions is an instance of the Marczewski–Steinhaus metric. While the distance functions may directly be used to measure the degree of dissimilarity of sets of objects, their complementary mappings – also discussed here – are useful in measuring of the degree of mutual similarity of sets. Keywords: rough inclusion function, rough mereology, distance and similarity between sets.

1

Introduction

Broadly speaking, rough inclusion functions (RIFs) are mappings with which one can measure the degree of inclusion of a set in a set. The formal notion of rough inclusion was worked out within rough mereology, a theory proposed by Polkowski and Skowron [2,3,4]. Rough mereology extends Le´sniewski’s mereology [5, 6], a formal theory of being-part to the case of being-part-to-degree. The standard RIF is certainly the most famous RIF. Its deﬁnition, based on the frequency count, is closely related to the deﬁnition of conditional probability. The idea underlying the standard RIF was explored by L ukasiewicz in

This article is an extended version of the paper presented at the International Conference on Rough Sets and Emerging Intelligent Systems Paradigms In Memoriam Zdzislaw Pawlak (RSEISP’2007). In comparison to [1], we study properties of the mappings, complementary to the title rough inclusion functions, more intensively. In addition, certain distance functions and their complementary mappings are introduced and investigated. Many thanks to the anonymous referees whose comments helped improve the paper. The research was partly supported by the grant N N516 368334 from Ministry of Science and Higher Education of the Republic of Poland and by an Innovative Economy Operational Programme 2007–2013 (Priority Axis 1. Research and development of new technologies) grant managed by Ministry of Regional Development of the Republic of Poland.

J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 35–55, 2008. c Springer-Verlag Berlin Heidelberg 2008

36

A. Gomoli´ nska

his research on probability of the truth of logical expressions (in particular, implicative formulas) about one hundred years ago [7, 8]. Apart from the standard RIF, there are only several functions of such sort described in the literature (see, e.g., [4, 9, 10]). Although the notion of RIF is dispensable when approximating sets of objects in line with the classical Pawlak approach [11, 12, 13, 14, 15], it is of particular importance for more general rough-set models. Namely, the concept of RIF is a basic component in Skowron and Stepaniuk’s approximation spaces [16, 17, 18, 19, 20, 21, 22] where lower and upper rough approximations of sets of objects are deﬁned by means of RIFs. In the variable-precision rough-set model with extensions [23, 24, 25, 26] and the decision-theoretic rough set model and its extensions [27, 28, 29], the standard RIF is taken as an estimator of certain conditional probabilities which, in turn, are used to deﬁne variable-precision positive and negative regions of sets of objects. Moreover, starting with a RIF, one can derive a family of rough membership functions which was already observed by Pawlak and Skowron in [30]. Also various functions measuring the degree of similarity between sets can be deﬁned by means of RIFs (see, e.g., [4, 10, 31, 32] for the rough-set approach). Last but not the least, a method of knowledge reduction is proposed in [33] which is based, among other things, on the degree of rough inclusion. In this paper we explore further the idea which led to the standard RIF. The aim is to discover other RIFs which have a similar origin as the standard one. Our investigations are motivated, among other things, by the fact that in spite of well-groundedness, usefulness, and popularity of the standard RIF, some of its properties may seem to be too strong (e.g., Proposition 2a,b). In addition, it would be good to have alternative RIFs at our disposal. As a result, we have obtained two RIFs more. One of them is new, at least up to the author’s knowledge, whereas the remaining one was mentioned in [9]. We investigate properties of the three RIFs with emphasis on the mutual relationships among them. As regards the standard RIF, some of its properties have already been known, but yet a few of them are new. Unlike the standard RIF, the new RIFs do not, at ﬁrst glance, seem to be very useful to estimate the conditional probability. It turns out however that they are diﬀerent from, yet deﬁnable in terms of the standard RIF. On the other hand, the latter RIF can be derived from the former two. In the sequel, we introduce mappings complementary to our RIFs, called coRIFs, and we present their properties. The co-RIFs give rise to certain distance functions which turn out to be metrics on the power set of the universe of objects. The distance functions may directly be used to measure the degree of dissimilarity between sets. It is interesting that one of these metrics is an instance of the Marczewski–Steinhaus metric [34]. Finally, we arrive at mappings complementary to the distance functions. They may, in turn, serve as indices of similarity between sets. It is worthy to note that these similarity indices are known from the literature [35, 36, 37, 38]. The rest of the paper is organized as follows. Section 2 is fully devoted to the standard RIF. In Sect. 3 we recall axioms of rough mereology, a formal theory of

On Certain Rough Inclusion Functions

37

being-part-to-degree introduced by Polkowski and Skowron in [2], which provides us with fundamentals of a formal notion of rough inclusion. We also explain what we actually mean by a RIF. In the same section we argue that RIFs indeed realize the formal concept of rough inclusion proposed by Polkowski and Skowron. Some authors [39,40] (see also [28]) claim rough inclusion measures to fulﬁl conditions somewhat diﬀerent from ours. Let us emphasize that our standpoint is that of rough mereology, and its axioms just provide us with a list of postulates to be satisﬁed by functions measuring the degree of inclusion. In Sect. 4, two alternatives of the standard RIF are derived. In Sect. 5 we consider the co-RIFs corresponding to our three RIFs and we investigate their properties. Certain distance functions and their complementary mappings, induced by the co-RIFs, are discussed in Sect. 6. The last section summarizes the results.

2

The Standard Rough Inclusion Function

The idea underlying the notion of the standard rough inclusion function was explored by Jan L ukasiewicz, a famous Polish logician who, among other things, conducted research on probability of the truth of propositional formulas [7, 8]. The standard RIF is the most popular among functions measuring the degree of inclusion of a set in a set. Let us recall that both the decision-theoretic rough set model [27, 29] and the variable-precision rough set model [23, 24] make use of the standard RIF. It is also commonly used to estimate the confidence (or accuracy) of decision rules and association rules [10, 41, 42, 43]. Last but not the least, the standard RIF is counted as a function with which one can measure similarity between clusterings [44, 45]. Consider a structure M with a non-empty universe U and a propositional language L interpretable over M . For any formula α and u ∈ U , u |= α reads as ‘α is satisﬁed by u’ or ‘u satisﬁes α’. The extension of α is deﬁned as the set ||α|| = {u ∈ U | u |= α}. α will be satisfiable in M if its extension is non-empty, and unsatisfiable otherwise. Morever, α is called true in M , |= α, if ||α|| = U . Finally, α entails a formula β, written α |= β, if and only if every object satisfying α satisﬁes β as well, i.e., ||α|| ⊆ ||β||. In classical logic, an implicative formula α → β is true in M if and only if α entails β. Clearly, many interesting formulas are not true in this sense. Since implicative formulas with unsatisﬁable predecessors are true, we limit our considerations to satisﬁable α. Then, one can assess the degree of truth of α → β by calculating the probability that an object satisfying α satisﬁes β as well. Where U is ﬁnite, this probability may be estimated by the fraction of objects of ||α|| which also satisfy β. That is, the degree of truth of α → β may be deﬁned as #(||α|| ∩ ||β||)/#||α|| where #||α|| means the cardinality of ||α||. By a straithforward generalization, we arrive at the well-known notion of the standard RIF, commonly used in the rough set theory. It owes its popularity to the clarity of the underlying idea and to the easiness of computation by means of this notion. Since conditional probability may be estimated by the standard RIF, the latter has also been used successfully in the decision-theoretic rough set

38

A. Gomoli´ nska

model [27, 29] (see also [28]) and the variable-precision rough set model and its extensions [23, 24, 25]. Given a non-empty ﬁnite set of objects U and its power set ℘U , the standard RIF upon U is a mapping κ£ : ℘U × ℘U → [0, 1] such that for any X, Y ⊆ U , #(X∩Y ) if X = ∅, def £ #X κ (X, Y ) = (1) 1 otherwise. To assess the degree of inclusion of a set of objects X in a set of objects Y by means of κ£ , one needs to measure the relative overlap of X with Y . The larger the overlap of two sets, the higher is the degree of inclusion, viz., for any X, Y, Z ⊆ U , #(X ∩ Y ) ≤ #(X ∩ Z) ⇒ κ£ (X, Y ) ≤ κ£ (X, Z). The success of the standard RIF also lies in its mathematical properties. Where X is a family of sets, we write Pair(X ) to say that elements of X are pairwise disjoint, i.e., ∀X, Y ∈ X .(X = Y ⇒ X ∩ Y = ∅). It is assumed that conjunction and disjunction will take the precedence to implication and double implication. Proposition 1. For any sets X, Y, Z ⊆ U and any families of sets ∅ = X , Y ⊆ ℘U , it holds: (a) κ£ (X, Y ) = 1 ⇔ X ⊆ Y, (b) Y ⊆ Z ⇒ κ£ (X, Y ) ≤ κ£ (X, Z), (c) Z ⊆ Y ⊆ X ⇒ κ£ (X, Z) ≤ κ£ (Y, Z), (d) κ£ (X, Y) ≤ κ£ (X, Y ), Y ∈Y

(e) X = ∅ & Pair(Y) ⇒ κ£ (X,

Y) =

κ£ (X, Y ),

Y ∈Y

(f ) κ ( X , Y ) ≤ κ£ (X, Y ) · κ£ ( X , X), £

X∈X

(g) Pair(X ) ⇒ κ ( X , Y ) = κ£ (X, Y ) · κ£ ( X , X). £

X∈X

Proof. We prove (f) only.Consider any Y ⊆ U and any non-empty family X ⊆ ℘U . Firstsuppose that X = holds ∅, i.e., X = {∅}. The property obviously (∅, Y ) = 1 ·1 = 1. Now let X be nonsince κ£ ( X , Y ) = 1 and κ£ ( X , ∅)·κ£ empty.In sucha case, κ£ ( X , Y ) = #( X ∩ Y)/# X = # {X ∩ Y | X ∈ X }/# X ≤ {#(X ∩ Y ) | X ∈ X }/# X = {#(X ∩ Y )/# X |X ∈ X }. Observe that if some element X of X is empty, then #(X ∩ Y )/# X = 0. 0 = 0 On the other hand, κ£ (X, Y ) · κ£ ( X , X) = 1 · (#X/# X ) = 1 · as well. For every non-empty element X of X , we have #(X ∩ Y )/# X = £ £ (X, Y ) · κ ( X , X) as required. Summing (#(X ∩ Y )/#X) · (#X/# X ) = κ up, κ£ ( X , Y ) ≤ X∈X κ£ (X, Y ) · κ£ ( X , X).

On Certain Rough Inclusion Functions

39

Some comments can be useful here. (a) says that the standard RIF yields 1 if and only if the ﬁrst argument is included in the second one. Property (b) expresses monotonicity of κ£ in the second variable, whereas (c) states some weak form of co-monotonicity of the standard RIF in the ﬁrst variable. It follows from (d) that for any covering of a set of objects, say Z, the sum of the degrees of inclusion of a set X in the sets constituting the covering is at least as high as the degree of inclusion of X in Z. The non-strict inequality in (d) may be strenghtened to = for non-empty X and coverings consisting of pairwise disjoint sets as stated by (e). Due to (f), for any covering of a set of objects, say Z, the degree of inclusion of Z in a set Y is not higher than a weighted sum of the degrees of inclusion of sets constituting the covering in Y where the weights are the degrees of inclusion of Z in the members of the covering of Z. In virtue of (g), the inequality may be strenghtened to = if elements of the covering are pairwise disjoint. Let us observe that (g) is in some sense a counterpart of the total probability theorem. The following conclusions can be drawn from the facts above. Proposition 2. For any X, Y, Z, W ⊆ U (X = ∅) and a family Y of pairwise disjoint sets of objects such that Y = U , we have: κ£ (X, Y ) = 1, (a) Y ∈Y

(b) κ£ (X, Y ) = 0 ⇔ X ∩ Y = ∅, (c) κ£ (X, ∅) = 0, (d) X ∩ Y = ∅ ⇒ κ£ (X, Z − Y ) = κ£ (X, Z ∪ Y ) = κ£ (X, Z), (e) Z ∩ W = ∅ ⇒ κ£ (Y ∪ Z, W ) ≤ κ£ (Y, W ) ≤ κ£ (Y − Z, W ), (f ) Z ⊆ W ⇒ κ£ (Y − Z, W ) ≤ κ£ (Y, W ) ≤ κ£ (Y ∪ Z, W ). Proof. We show (d) only. To this end, consider any sets of objects X, Y where X = ∅ and X ∩ Y = ∅. Immediately (d1) κ£ (X, Y ) = 0 by (b). Hence, for any Z ⊆ U , κ£ (X, Z) = κ£ (X, (Z ∩ Y ) ∪ (Z − Y )) = κ£ (X, Z ∩ Y ) + κ£ (X, Z − Y ) ≤ κ£ (X, Y ) + κ£ (X, Z − Y ) = κ£ (X, Z − Y ) in virtue of Proposition 1b,e. In the sequel, κ£ (X, Z ∪ Y ) ≤ κ£ (X, Z) + κ£ (X, Y ) = κ£ (X, Z) due to (d1) and Proposition 1d. The remaining inequalities are consequences of Proposition 1b.

Let us note a few remarks. (a) states that the degrees of inclusion of a nonempty set of objects X in pairwise disjoint sets will sum up to 1 when these sets, taken together, cover the universe. In virtue of (b), the degree of inclusion of a non-empty set in an arbitrary set of objects equals to 0 just in the case the both sets are disjoint. (b) obviously implies (c). The latter property says that the degree of inclusion of a non-empty set in ∅ is equal to 0. Thanks to (d), removing (resp., adding) objects, not being members of a non-empty set X, from (to) a set Z does not inﬂuence the degree of inclusion of X in Z. As follows from (e), adding (resp., removing) objects, not belonging to a set W , to (from) a set Y does not increase (decrease) the degree of inclusion of Y in W . Finally, removing (resp., adding) members of a set of objects W from (to) a set Y does not increase (decrease) the degree of inclusion of Y in W due to (f).

40

A. Gomoli´ nska

Example 1. Given U = {0, . . . , 9}, X = {0, . . . , 3}, Y = {0, . . . , 3, 8}, and Z = {2, . . . , 6}. Note that X ∩ Z = Y ∩ Z = {2, 3}. Thus, κ£ (X, Z) = 1/2 and κ£ (Z, X) = 2/5 which means that the standard RIF is not symmetric. Moreover, κ£ (Y, Z) = 2/5 < 1/2. Thus, X ⊆ Y may not imply κ£ (X, Z) ≤ κ£ (Y, Z), i.e., κ£ is not monotone in the ﬁrst variable.

3

Rough Mereology: A Formal Framework for Rough Inclusion

The notion of the standard RIF was generalized and formalized by Polkowski and Skowron within rough mereology, a theory of the notion of being-part-todegree [2, 3, 4]. The starting point is a pair of formal theories introduced by Le´sniewski [5, 6], viz., mereology and ontology where the former theory extends the latter one. Mereology is a theory of the notion of being-part, whereas ontology is a theory of names and plays the role of set theory. Le´sniewski’s mereology is also known as a theory of collective sets as opposite to ontology being a theory of distributive sets. In this section we only recall a very small part of rough mereology, pivotal for the notion of rough inclusion. We somewhat change the original notation (e.g., ‘el’ to ‘ing’, ‘μt ’ to ‘ingt ’), yet trying to keep with the underlying ideas. In ontology, built upon the classical predicate logic with identity, two basic semantical categories are distinguished: the category of non-empty names1 and the category of propositions. We use x, y, z, with subscripts if needed, as name variables and we denote the set of all such variables by Var. The only primitive notion of ontology is the copula ‘is’, denoted by ε and characterized by the axiom (L0) xεy ↔ (∃z.zεx ∧ ∀z, z .(zεx ∧ z εx → zεz ) ∧ ∀z.(zεx → zεy))

(2)

where ‘xεy’ is read as ‘x is y’. The ﬁrst two conjuncts on the right-hand side say that x ranges over non-empty, individual names only. The third conjunct says that each of x’s is y as well. In particular, the intended meaning of ‘xεx’ is simply that x ranges over individual names. Mereology is built upon ontology and introduces a name-forming functor pt where ‘xεpt(y)’ reads as ‘x is a part of y’. The functor pt is described by the following axioms: (L1) xεpt(y) → xεx ∧ yεy, (L2) xεpt(y) ∧ yεpt(z) → xεpt(z), (L3) ¬(xεpt(x)). (L1) stipulates that both x and y range over individual names. According to (L2) and (L3), being-part is transitive and irreﬂexive, respectively. The reﬂexive counterpart of pt is the notion of being-ingredient, ing, given by def

xεing(y) ↔ xεpt(y) ∨ x = y. 1

Empty names are denied by Le´sniewski on philosophical grounds.

(3)

On Certain Rough Inclusion Functions

41

One can see that (L1 ) xεing(y) → xεx ∧ yεy, (L2 ) xεing(y) ∧ yεing(z) → xεing(z), (L3 ) xεing(x), (L4 ) xεing(y) ∧ yεing(x) → x = y. Axioms (L1’), (L2’) are counterparts of (L1), (L2), respectively. (L3’), (L4’) postulate reﬂexivity and antisymmetry of ing, respectively. It is worth noting that one can start with ing characterized by (L1’)–(L4’) and deﬁne pt by def

xεpt(y) ↔ xεing(y) ∧ x = y.

(4)

Polkowski and Skowron’s rough mereology extends Le´sniewski’s mereology by a family of name-forming functors ingt . These functors, constituting a formal counterpart of the notion of being-ingredient-to-degree, are described by the following axioms, for any name variables x, y, z and s, t ∈ [0, 1]: (P S1) ∃t.xεingt (y) → xεx ∧ yεy, (P S2) xεing1 (y) ↔ xεing(y), (P S3) xεing1 (y) → ∀z.(zεingt (x) → zεingt (y)), (P S4) x = y ∧ xεingt (z) → yεingt (z), (P S5) xεingt (y) ∧ s ≤ t → xεings (y). The expression ‘xεingt (y)’ reads as ‘x is an ingredient of y to degree t’. The axiom (PS1) claims x, y to range over individual names. According to (PS2), being an ingredient to degree 1 is equivalent with being an ingredient. (PS3) states a weak form of transitivity of the graded ingredienthood. (PS4) says that ‘=’ is congruencial with respect to being-ingredient-to-degree. As postulated by (PS5), ingt is, in fact, a formalization of the notion of being an ingredient to degree at least t. Furthermore, being-part-to-degree may be deﬁned as a special case of the graded ingredienthood, viz., def

xεptt (y) ↔ xεingt (y) ∧ x = y.

(5)

The axioms (PS1)–(PS5) are minimal conditions to be fulﬁlled by the formal concept of graded ingredienthood2 . According to the standard interpretation, being an ingredient (part) is understood as being included (included in the proper sense). In the same vein, the graded ingredienthood may be interpreted as a graded inclusion, called rough inclusion in line with Polkowski and Skowron. Now we describe a model for the part of rough mereology presented above, simplifying the picture as much as possible. Consider a non-empty set of objects U and a structure M = (℘U, ⊆, κ) where the set of all subsets of U , ℘U , serves 2

For instance, nothing has been said about the property of being external yet. For this and other concepts of rough mereology see, e.g., [4].

42

A. Gomoli´ nska

as the universe of M , ⊆ is the usual inclusion relation on ℘U , and κ is a mapping κ : ℘U × ℘U → [0, 1] satisfying the conditions rif 1 , rif 2 below: def

rif 1 (κ) ⇔ ∀X, Y ⊆ U.(κ(X, Y ) = 1 ⇔ X ⊆ Y ), def

rif 2 (κ) ⇔ ∀X, Y, Z ⊆ U.(Y ⊆ Z ⇒ κ(X, Y ) ≤ κ(X, Z)). According to rif 1 , κ is a generalization of ⊆. Moreover, κ achieves the greatest value (equal to 1) only for such pairs of sets that the second element of a pair contains the ﬁrst element. The condition rif 2 postulates κ to be monotone in the second variable. We call any mapping κ as above a rough inclusion function (RIF) over U . For simplicity, the reference to U will be dropped if no confusion results. Observe that having assumed rif 1 , the second condition is equivalent to rif ∗2 given by rif ∗2 (κ) ⇔ ∀X, Y, Z ⊆ U.(κ(Y, Z) = 1 ⇒ κ(X, Y ) ≤ κ(X, Z)). def

Subsets of U are viewed as concepts, and RIFs are intended as functions measuring the degrees of inclusion of concepts in concepts. It is worth noting that any RIF over U is a fuzzy set on ℘U × ℘U or, in other words, a fuzzy binary relation on ℘U (see [46] and more recent, ample literature on fuzzy set theory). Clearly, RIFs may satisfy various additional postulates as well. Examples of such postulates are: def

rif 3 (κ) ⇔ ∀∅ = X ⊆ U.κ(X, ∅) = 0, def

rif 4 (κ) ⇔ ∀X, Y ⊆ U.(κ(X, Y ) = 0 ⇒ X ∩ Y = ∅), rif −1 4 (κ) ⇔ ∀∅ = X ⊆ U.∀Y ⊆ U.(X ∩ Y = ∅ ⇒ κ(X, Y ) = 0), def def

rif 5 (κ) ⇔ ∀∅ = X ⊆ U.∀Y ⊆ U.(κ(X, Y ) = 0 ⇔ X ∩ Y = ∅), def

rif 6 (κ) ⇔ ∀∅ = X ⊆ U.∀Y ⊆ U.κ(X, Y ) + κ(X, U − Y ) = 1, def

rif 7 (κ) ⇔ ∀X, Y, Z ⊆ U.(Z ⊆ Y ⊆ X ⇒ κ(X, Z) ≤ κ(Y, Z)). As follows from Propositions 1 and 2, the standard RIF satisﬁes all the conditions above. Moreover, for any RIF κ, rif 1 (κ) and rif 6 (κ) imply rif 5 (κ); rif 5 (κ) −1 is equivalent to the conjunction of rif 4 (κ) and rif −1 4 (κ); and rif 4 (κ) implies rif 3 (κ). It is worth mentioning that some authors stipulate functions measuring the degree of inclusion to satisfy rif 2 , rif 7 , and the ‘if’ part of rif 1 [39, 40]. Names and name-forming functors are interpreted in M by means of a mapping I as follows. Every name is interpreted as a non-empty set of concepts, i.e., subsets of U , and individual names are interpreted as singletons. For any singleton Y = {X} where X ⊆ U , let def

e(Y ) = X.

(6)

The identity symbol is interpreted as the identity relation on ℘U (the same symbol ‘=’ is used in both cases for simplicity). The copula ε is interpreted as a binary relation εI ⊆ ℘(℘U ) × ℘(℘U ) such that for any X, Y ⊆ ℘U ,

On Certain Rough Inclusion Functions def

XεI Y ⇔ #X = 1 & X ⊆ Y.

43

(7)

Observe that X ⊆ Y above may equivalently be written as e(X) ∈ Y . In the sequel, the name-forming functors ing, pt, ingt , and ptt (t ∈ [0, 1]) are interpreted as mappings ingI , ptI , ingt,I , ptt,I : ℘U → ℘(℘U ) such that for any X ⊆ U , def

ingI (X) = ℘X, def

ptI (X) = ℘X − {X}, def

ingt,I (X) = {Y ⊆ U | κ(Y, X) ≥ t}, def

ptt,I (X) = {Y ⊆ U | κ(Y, X) ≥ t & Y = X},

(8)

thus, e.g., ing is interpreted as the power-set operator. The pair MI = (M, I) is an interpretation of the language of the part of rough mereology considered here. In the next step, we assign non-empty sets of concepts to name variables. Given an interpretation MI , any such variable assignment v : Var → ℘(℘U ) may be extended to a term assignment vI as follows. For any x ∈ Var, t ∈ [0, 1], and f ∈ {ing, pt, ingt , ptt }, def

vI (x) = v(x), fI (e(v(x))) if #v(x) = 1, def vI (f (x)) = undeﬁned otherwise.

(9)

Finally, we can deﬁne satisﬁability of formulas by variable assignments in MI . For any formula α and any variable assignment v, ‘MI , v |= α’ reads as ‘α is satisﬁed by v in MI ’. Along the standard lines, α will be true in MI , MI |= α, if α is satisﬁed by every variable assignment in MI . The relation of satisﬁability of formulas is deﬁned as follows, for any formulas α, β, any name variables x, y, any degree variable t, and f ∈ {ing, pt, ingt , ptt }: def

MI , v |= x = y ⇔ vI (x) = vI (y), def

MI , v |= xεy ⇔ vI (x)εI vI (y), def

MI , v |= xεf (y) ⇔ vI (x)εI vI (f (y)), def

MI , v |= α ∧ β ⇔ MI , v |= α & MI , v |= β, def

MI , v |= ¬α ⇔ MI , v |= α, def

MI , v |= ∀x.α ⇔ MI , w |= α for any w diﬀerent from v at most for x, def

MI , v |= ∀t.α ⇔ for every t ∈ [0, 1], MI , v |= α.

(10)

The remaining cases can easily be obtained from those above. Let us observe that the ﬁrst three conditions may be simpliﬁed to the following ones: MI , v |= x = y ⇔ v(x) = v(y),

44

A. Gomoli´ nska

MI , v |= xεy ⇔ (#v(x) = 1 & v(x) ⊆ v(y)) ⇔ ∃X ⊆ U.(v(x) = {X} & X ∈ v(y)), MI , v |= xεing(y) ⇔ (#v(x) = #v(y) = 1 & e(v(x)) ⊆ e(v(y))) ⇔ ∃X, Y ⊆ U.(v(x) = {X} & v(y) = {Y } & X ⊆ Y ), MI , v |= xεpt(y) ⇔ (#v(x) = #v(y) = 1 & e(v(x)) ⊂ e(v(y))) ⇔ ∃X, Y ⊆ U.(v(x) = {X} & v(y) = {Y } & X ⊂ Y ), MI , v |= xεingt (y) ⇔ (#v(x) = #v(y) = 1 & κ(e(v(x)), e(v(y))) ≥ t) ⇔ ∃X, Y ⊆ U.(v(x) = {X} & v(y) = {Y } & κ(X, Y ) ≥ t), MI , v |= xεptt (y) ⇔ (#v(x) = #v(y) = 1 & v(x) = v(y) & κ(e(v(x)), e(v(y))) ≥ t) ⇔ ∃X, Y ⊆ U.(v(x) = {X} & v(y) = {Y } & X = Y & κ(X, Y ) ≥ t).

(11)

By a straightforward inspection one can check that MI is a model of the considered part of rough mereology, i.e., all axioms are true in MI . By way of example, we only show that (PS3) is true in MI , i.e., for any name variables x, y, any t ∈ [0, 1], and any variable assignment v, MI , v |= xεing1 (y) → ∀z.(zεingt (x) → zεingt (y)).

(12)

To this end, assume MI , v |= xεing1 (y) ﬁrst. Hence, (a) #v(x) = #v(y) = 1 and κ(e(v(x)), e(v(y))) ≥ 1 by (11). The latter is equivalent with (b) e(v(x)) ⊆ e(v(y)) due to rif 1 (κ). Next consider any variable assignment w, diﬀerent from v at most for z. As a consequence, (c) w(x) = v(x) and w(y) = v(y). In the sequel assume MI , w |= zεingt (x). Hence, (d) #w(z) = 1 and (e) κ(e(w(z)), e(w(x))) ≥ t by (11). It holds that (f) κ(e(w(z)), e(w(x))) ≤ κ(e(w(z)), e(w(y))) by (b), (c), and rif 2 (κ). From the latter and (e) we obtain (g) κ(e(w(z)), e(w(y))) ≥ t. Hence, MI , w |= zεingt (y) in virtue of (a), (c), (d), and (11).

4

In Search of New RIFs

According to rough mereology, rough inclusion is a generalization of the settheoretical inclusion of sets. While keeping with this idea, we try to obtain RIFs diﬀerent from the standard one. Let U be a non-empty ﬁnite set of objects. Observe that for any X, Y ⊆ U , the following formulas are equivalent: (i) X ⊆ Y, (ii) X ∩ Y = X, (iii) X ∪ Y = Y, (iv) (U − X) ∪ Y = U, (v) X − Y = ∅.

(13)

The equivalence of the ﬁrst two statements gave rise to the standard RIF. Now we explore (i) ⇔ (iii) and (i) ⇔ (iv). In the case of (iii), ‘⊇’ always holds true.

On Certain Rough Inclusion Functions

45

Conversely, ‘⊆’ always takes place in (iv). The remaining inclusions may or may not hold, so we may introduce degrees of inclusion. Thus, let us deﬁne mappings κ1 , κ2 : ℘U × ℘U → [0, 1] such that for any X, Y ⊆ U , #Y if X ∪ Y = ∅, def κ1 (X, Y ) = #(X∪Y ) 1 otherwise, def

κ2 (X, Y ) =

#((U − X) ∪ Y ) . #U

(14)

It is worth noting that κ2 was mentioned in [9]. Now we show that both κ1 , κ2 are RIFs diﬀerent from the standard one and from each other. Proposition 3. Each of κi (i = 1, 2) is a RIF upon U , i.e., rif 1 (κi ) and rif 2 (κi ) hold. Proof. We only prove the property for i = 1. Let X, Y, Z be any sets of objects. To show rif 1 (κ1 ), we only examine the non-trivial case where X, Y = ∅. Then, κ1 (X, Y ) = 1 if and only if #Y = #(X ∪ Y ) if and only if Y = X ∪ Y if and only if X ⊆ Y . In the case of rif 2 assume that (a1) Y ⊆ Z. First suppose that X = ∅. If Z is empty as well, then Y = ∅. In result, κ1 (X, Y ) = 1 ≤ 1 = κ1 (X, Z). Conversely, if Z is non-empty, then κ1 (X, Z) = #Z/#Z = 1 ≥ κ1 (X, Y ). Now assume that X = ∅. Then X ∪ Y, X ∪ Z = ∅. Moreover, Z = Y ∪ (Z − Y ) and Y ∩ (Z − Y ) = ∅ by (a1). As a consequence, (a2) #Z = #Y + #(Z − Y ). Additionally (a3) #(X ∪ Z) ≤ #(X ∪ Y ) + #(Z − Y ) and (a4) #Y ≤ #(X ∪ Y ). Hence, κ1 (X, Y ) = #Y /#(X ∪Y ) ≤ (#Y +#(Z −Y ))/(#(X ∪Y )+#(Z −Y )) ≤ (#Y + #(Z − Y ))/#(X ∪ Y ∪ (Z − Y )) = #Z/#(X ∪ Z) = κ1 (X, Z) by (a2)– (a4).

Example 2. Consider U = {0, . . . , 9} and its subsets X = {0, . . . , 4}, Y = {2, . . . , 6}. Notice that X ∩ Y = {2, 3, 4}, X ∪ Y = {0, . . . , 6}, and (U − X) ∪ Y = {2, . . . , 9}. Hence, κ£ (X, Y ) = 3/5, κ1 (X, Y ) = 5/7, and κ2 (X, Y ) = 4/5, i.e., κ£ , κ1 , and κ2 are diﬀerent RIFs. Proposition 4. For any X, Y ⊆ U , we have: (a) X = ∅ ⇒ (κ1 (X, Y ) = 0 ⇔ Y = ∅), (b) κ2 (X, Y ) = 0 ⇔ X = U & Y = ∅, (c) rif 4 (κ1 ) & rif 4 (κ2 ), (d) κ£ (X, Y ) ≤ κ1 (X, Y ) ≤ κ2 (X, Y ), (e) κ1 (X, Y ) = κ£ (X ∪ Y, Y ), (f ) κ2 (X, Y ) = κ£ (U, (U − X) ∪ Y ) = κ£ (U, U − X) + κ£ (U, X ∩ Y ), (g) κ£ (X, Y ) = κ£ (X, X ∩ Y ) = κ1 (X, X ∩ Y ) = κ1 (X − Y, X ∩ Y ), (h) X ∪ Y = U ⇒ κ1 (X, Y ) = κ2 (X, Y ). Proof. By way of illustration we show (d) and (h). To this end, consider any sets of objects X, Y . In case (d), if X is empty, then (U − X) ∪ Y = U . Hence

46

A. Gomoli´ nska

by the deﬁnitions, κ£ (X, Y ) = κ1 (X, Y ) = κ2 (X, Y ) = 1. Now suppose that X = ∅. Obviously (d1) #(X ∩ Y ) ≤ #X and (d2) #Y ≤ #(X ∪ Y ). Since X ∪ Y = X ∪ (Y − X) and X ∩ (Y − X) = ∅, (d3) #(X ∪ Y ) = #X + #(Y − X). Similarly, it follows from Y = (X ∩ Y ) ∪ (Y − X) and (X ∩ Y ) ∩ (Y − X) = ∅ that (d4) #Y = #(X ∩ Y ) + #(Y − X). Observe also that (U − X) ∪ Y = ((U − X) − Y ) ∪ Y = (U − (X ∪ Y )) ∪ Y and (U − (X ∪ Y )) ∩ Y = ∅. Hence, (d5) #((U − X) ∪ Y ) = #(U − (X ∪ Y )) + #Y . In the sequel, κ£ (X, Y ) = #(X ∩ Y )/#X ≤ (#(X ∩ Y ) + #(Y − X))/(#X + #(Y − X)) = #Y /#(X ∪ Y ) = κ1 (X, Y ) ≤ (#(U − (X ∪ Y )) + #Y )/(#(U − (X ∪ Y )) + #(X ∪ Y )) = #((U − X) ∪ Y )/#U = κ2 (X, Y ) by (d1)–(d5) and the deﬁnitions of the RIFs. For (h) assume that X ∪ Y = U . Then Y − X = U − X, and κ1 (X, Y ) = #Y /#U = #((Y − X) ∪ Y )/#U = #((U − X) ∪ Y )/#U = κ2 (X, Y ) as required.

Let us brieﬂy comment upon the properties. According to (a), if X is nonempty, then the emptiness of Y will be both suﬃcient3 and necessary to have κ1 (X, Y ) = 0. Property (b) states that κ2 yields 0 solely for (U, ∅). Due to (c), κi (X, Y ) = 0 (i = 1, 2) implies the emptiness of the overlap of X, Y . Property (d) says that the degrees of inclusion yielded by κ2 are at least as high as those given by κ1 , and the degrees of inclusion provided by κ1 are not lower than those estimated by means of the standard RIF. (e) and (f) provide us with characterizations of κ1 and κ2 in terms of κ£ , respectively. On the other hand, the standard RIF may be deﬁned by means of κ1 in virtue of (g). Finally, (h) states that κ1 , κ2 are equal on the set of all pairs (X, Y ) such that X, Y cover the universe.

5

Mappings Complementary to RIFs

Now we deﬁne mappings which are in some sense complementary to the RIFs considered. We also investigate properties of these functions and give one more characterization of the standard RIF. Namely, with every mapping f : ℘U × ℘U → [0, 1] one can associate a complementary mapping f¯ : ℘U × ℘U → [0, 1] deﬁned by def (15) f¯(X, Y ) = 1 − f (X, Y ) ¯ for any sets X, Y ⊆ U . Clearly, f is complementary to f . In particular, we obtain

κ ¯ £ (X, Y ) = κ ¯ 1 (X, Y ) = κ ¯ 2 (X, Y ) = 3

#(X−Y ) #X

0 #(X−Y ) #(X∪Y )

0

if X = ∅, otherwise, if X ∪ Y = ∅, otherwise,

#(X − Y ) . #U

Compare the optional postulate rif 3 (κ).

(16)

On Certain Rough Inclusion Functions

47

For the sake of simplicity, κ ¯ where κ is a RIF will be referred to as a co-RIF. Observe that each of the co-RIFs measures the diﬀerence between its ﬁrst and second arguments, i.e., the equivalence (i) ⇔ (v) (cf. (13)) is explored here. It is worthy to note that for any X, Y ⊆ U , κ ¯£ (X, Y ) = κ£ (X, U − Y ).

(17)

However, the same is not true of κ ¯ i for i = 1, 2. Indeed, κ1 (X, U − Y ) = #(U − Y )/#(X ∪(U −Y )) if X ∪(U −Y ) = ∅, and κ2 (X, U −Y ) = #(U −(X ∩Y ))/#U , so the counterparts of (17) do not hold in general. Example 3. Let U and X, Y be as in Example 2, i.e., U = {0, . . . , 9}, X = {0, . . . , 4}, and Y = {2, . . . , 6}. It is easy to see that κ1 (X, U − Y ) = 5/8 and ¯1 (X, Y ) = 2/7 and κ ¯2 (X, Y ) = 1/5. κ2 (X, U − Y ) = 7/10, whereas κ We can characterize the standard RIF in terms of κi (i = 1, 2) and their co-RIFs as follows: Proposition 5. For any sets of objects X, Y where X = ∅, κ£ (X, Y ) =

κ ¯ 2 (X, U − Y ) κ ¯ 1 (X, U − Y ) = . κ1 (U − Y, X) κ2 (U, X)

Proof. Consider any set of objects Y and any non-empty set of objects X. Hence, ¯ 1 (X, U − X ∪ (U − Y ) = ∅ as well. Moreover, κ1 (U − Y, X), κ2 (U, X) > 0. Then κ Y ) = #(X − (U − Y ))/#(X ∪ (U − Y )) = #(X ∩ Y )/#(X ∪ (U − Y )) = (#(X ∩ Y )/#X) · (#X/#(X ∪ (U − Y ))) = κ£ (X, Y ) · κ1 (U − Y, X) by the deﬁnitions of κ£ , κ1 , and κ ¯ 1 . Hence, κ£ (X, Y ) = κ ¯ 1 (X, U − Y )/κ1 (U − Y, X) as required. Similarly, κ ¯ 2 (X, U − Y ) = #(X − (U − Y ))/#U = #(X ∩ Y )/#U = (#(X ∩ Y )/#X) · (#X/#U ) = κ£ (X, Y ) · κ2 (U, X) by the deﬁnitions of κ£ , κ2 , and κ ¯2 . Immediately κ£ (X, Y ) = κ ¯ 2 (X, U − Y )/κ2 (U, X) which ends the proof.

Henceforth the symmetric diﬀerence of sets X, Y will be denoted by X ÷ Y . We can prove the following properties of co-RIFs: Proposition 6. For any X, Y, Z ⊆ U , an arbitrary RIF κ, and i = 1, 2, (a) κ ¯ (X, Y ) = 0 ⇔ X ⊆ Y, (b) Y ⊆ Z ⇒ κ ¯ (X, Z) ⊆ κ ¯ (X, Y ), (c) κ ¯ 2 (X, Y ) ≤ κ ¯ 1 (X, Y ) ≤ κ ¯ £ (X, Y ), (d) κ ¯ i (X, Y ) + κ ¯ i (Y, Z) ≥ κ ¯ i (X, Z), ¯ i (Y, X) ≤ 1, (e) 0 ≤ κ ¯ i (X, Y ) + κ (f ) (X = ∅ & Y = ∅) or (X = ∅ & Y = ∅) ⇒ κ ¯ £ (X, Y ) + κ ¯£ (Y, X) = κ ¯ 1 (X, Y ) + κ ¯ 1 (Y, X) = 1. Proof. We only prove (d) for i = 1, and (e). To this end, consider any sets of objects X, Y, Z. In case (d), if X = ∅, then κ ¯ 1 (X, Z) = 0 in virtue of (a). Hence,

48

A. Gomoli´ nska

(d) obviously holds. Now suppose that X = ∅. If Y = ∅, then κ ¯ 1 (X, Y ) = 1. On the other hand, if Y = ∅ and Z = ∅, then κ ¯1 (Y, Z) = 1. In both cases κ ¯ 1 (X, Y ) + κ ¯ 1 (Y, Z) ≥ 1 ≥ κ ¯ 1 (X, Z). Finally, assume that X, Y, Z = ∅. Let m = #(X∪Y ∪Z), m0 = #(X−(Y ∪Z)), m1 = #(Y − (X ∪ Z)), m2 = #((X ∩ Y ) − Z), m3 = #((X ∩ Z) − Y ), and m4 = #(Z − (X ∪ Y )). Observe that #(X − Y ) = m0 + m3 , #(X − Z) = m0 + m2 , #(Y − Z) = m1 + m2 , #(X ∪ Y ) = m − m4 , #(X ∪ Z) = m − m1 , and #(Y ∪ Z) = m − m0 . Hence, κ ¯ 1 (X, Y ) = #(X − Y )/#(X ∪ Y ) = (m0 + ¯ 1 (Y, Z) = (m1 + m2 )/(m − m0 ) and m3 )/(m − m4 ). On the same grounds, κ κ ¯ 1 (X, Z) = (m0 + m2 )/(m − m1 ). It is easy to see that m0 + m3 m1 + m2 m0 + m1 + m2 m0 + m2 m1 + m2 m0 + m3 + ≥ ≥ + ≥ m − m4 m − m0 m m m m − m1 which ends the proof of (d). The ﬁrst inequality of (e) is obvious, so we only show the second one. For i = 1 assume that X ∪ Y = ∅ since the case X = Y = ∅ is trivial. Thus, ¯ 1 (Y, X) = (#(X − Y )/#(X ∪ Y )) + (#(Y − X)/#(X ∪ Y )) = κ ¯ 1 (X, Y ) + κ #(X ÷ Y )/#(X ∪ Y ) ≤ 1 because X ÷ Y ⊆ X ∪ Y . The property just proved implies the second inequality for i = 2 due to (c).

According to (a), every co-RIF will yield 0 exactly in the case the ﬁrst argument is included in the second one. As a consequence, (*) κ ¯ (X, X) = 0 for every set of objects X. That is, κ ¯ may serve as a (non-symmetric) distance function. (b) states that co-RIFs are co-monotone in the second variable. (c) provides us with a comparison of our three co-RIFs. Properties (e), (f) will prove their usefulness in the next section. (d) expresses the triangle inequality condition for κ ¯ i (i = 1, 2). Let us note that the triangle inequality does not hold for κ ¯ £ in general. Example 4. Consider sets of objects X, Y, Z such that X − Z, Z − X = ∅ and Y = X ∪ Z. We show that ¯ £ (Y, Z) < κ ¯£ (X, Z). κ ¯ £ (X, Y ) + κ By the assumptions each of X, Y, Z is non-empty and Y − Z = X − Z. Next, #X < #Y since X ⊂ Y . Moreover, κ ¯ £ (X, Y ) = 0 in virtue of (a). As a consequence, κ ¯ £ (X, Y ) + κ ¯ £ (Y, Z) =

#(Y − Z) #(X − Z) #(Y − Z) < = =κ ¯ £ (X, Z) #Y #X #X

as expected. ¯ £ (Y, X) > 1. Indeed, if X, Y = ∅ and Additionally, it can be that κ ¯ £ (X, Y ) + κ £ £ X ∩Y = ∅, then Σ = κ ¯ (X, Y )+¯ κ (Y, X) = (#X/#X)+(#Y /#Y ) = 1+1 = 2. Nevertheless, 2 is the greatest value taken by Σ.

On Certain Rough Inclusion Functions

6

49

RIFs and Their Complementary Mappings vs. Similarity and Distance between Sets

In this section we use the three co-RIFs to deﬁne certain normalized distance functions with which one can measure (dis)similarity between sets. Namely, let δ £ , δi : ℘U × ℘U → [0, 1] (i = 1, 2) be mappings such that for any X, Y ⊆ U , δ £ (X, Y ) =

def

1 £ κ ¯ (X, Y ) + κ ¯£ (Y, X) , 2

def

¯ i (X, Y ) + κ ¯i (Y, X). δi (X, Y ) = κ

(18)

It is easy to see that ⎧ #(X−Y ) #(Y −X) ⎪ if X, Y = ∅, + ⎨ 12 #X #Y δ £ (X, Y ) = 0 if X, Y = ∅, ⎪ ⎩1 in the remaining cases, 2 #(X÷Y ) if X ∪ Y = ∅, δ1 (X, Y ) = #(X∪Y ) 0 otherwise, δ2 (X, Y ) =

#(X ÷ Y ) . #U

(19)

It is worth mentioning that δ1 is an instance of the Marczewski–Steinhaus metric [34]. As we shall see, the remaining two functions are metrics on ℘U as well. Namely, we can prove the following: Proposition 7. For any sets X, Y, Z ⊆ U and δ ∈ {δ £ , δ1 , δ2 }, (a) δ(X, Y ) = 0 ⇔ X = Y, (b) δ(X, Y ) = δ(Y, X), (c) δ(X, Y ) + δ(Y, Z) ≥ δ(X, Z), (d) max{δ £ (X, Y ), δ2 (X, Y )} ≤ δ1 (X, Y ) ≤ 2δ £ (X, Y ). Proof. Property (a) is an easy consequence of Proposition 6a. (b) directly follows from the deﬁnitions of δ £ , δ1 , and δ2 . Property (c) for δ1 , δ2 can easily be obtained from Proposition 6d. Now we show (c) for δ £ . To this end, consider any sets of objects X, Y, Z. If both X, Z are empty, then δ £ (X, Z) = 0 in virtue of (a), and (c) follows immediately. Next, if X, Y = ∅ and Z = ∅, or X, Y = ∅ and Z = ∅, then δ £ (X, Z) = δ £ (Y, Z) = 1/2 by (19). In consequence, (c) is fulﬁlled regardless of the value δ £ (X, Y ). In the same vein, if X = ∅ and Y, Z = ∅, or X = ∅ and Y, Z = ∅, then δ £ (X, Y ) = δ £ (X, Z) = 1/2. Here is (c) satisﬁed regardless of δ £ (Y, Z). In the sequel, if X, Z = ∅ and Y = ∅, then δ £ (X, Y ) = δ £ (Y, Z) = 1/2. Hence, δ £ (X, Y ) + δ £ (Y, Z) = 1 ≥ δ £ (X, Z) for any value of δ £ at (X, Z).

50

A. Gomoli´ nska

Finally we prove (c) for X, Y, Z = ∅. Let m and mi (i = 0, . . . , 4) be as earlier. Additionally, let m5 = #((Y ∩ Z) − X). Notice that 2δ £ (X, Y ) = = 2δ £ (Y, Z) = = 2δ £ (X, Z) = =

#(X − Y ) #(Y − X) + #X #Y m0 + m3 m1 + m5 + , m − (m1 + m4 + m5 ) m − (m0 + m3 + m4 ) #(Y − Z) #(Z − Y ) + #Y #Z m1 + m2 m3 + m4 + , m − (m0 + m3 + m4 ) m − (m0 + m1 + m2 ) #(X − Z) #(Z − X) + #X #Z m0 + m2 m4 + m5 + . m − (m1 + m4 + m5 ) m − (m0 + m1 + m2 )

Hence we obtain 2(δ £ (X, Y ) + δ £ (Y, Z) − δ £ (X, Z)) = (m0 + m3 ) − (m0 + m2 ) (m1 + m5 ) + (m1 + m2 ) (m3 + m4 ) − (m4 + m5 ) + + m − (m1 + m4 + m5 ) m − (m0 + m3 + m4 ) m − (m0 + m1 + m2 ) m3 − m2 2m1 + m2 + m5 m3 − m5 2(m1 + m3 ) ≥ + + = ≥ 0. m m m m In result, δ £ (X, Y ) + δ £ (Y, Z) ≥ δ £ (X, Z) as needed. As regards (d), we only prove that (*) δ £ (X, Y ) ≤ δ1 (X, Y ) for any X, Y ⊆ U . The rest easily follows from Proposition 6c. Consider any sets of objects X, Y . If at least one of X, Y is empty, (*) will directly hold by the deﬁnitions of δ £ , δ1 . For the remaining case observe that #(X ∩ Y ) #(X ∩ Y ) #(X ∩ Y ) ≤ min , #(X ∪ Y ) #X #Y since max{#X, #Y } ≤ #(X ∪ Y ). Hence, we obtain in the sequel: #(X ∩ Y ) #(X ∩ Y ) #(X ∩ Y ) ,1 − , max 1 − ≤1− #X #Y #(X ∪ Y ) #(X − Y ) #(Y − X) #(X ÷ Y ) max , , ≤ #X #Y #(X ∪ Y ) #(X − Y ) #(Y − X) #(X ÷ Y ) + ≤2 , #X #Y #(X ∪ Y ) 1 #(X − Y ) #(Y − X) #(X ÷ Y ) + . ≤ 2 #X #Y #(X ∪ Y ) From the latter we derive (*) by the deﬁnitions of δ £ , δ1 .

On Certain Rough Inclusion Functions

51

Summing up, δ £ and δi (i = 1, 2) are metrics on ℘U due to (a)–(c), and they may be used to measure the distance between sets. According to (d), the double distance between sets X, Y , estimated by means of δ £ , will be not smaller than the distance between X, Y yielded by δ1 . In turn, the distance measured by the latter metric will be greater than or equal to the distance given by each of δ £ , δ2 . In view of the fact that κ ¯ i , underlying δi , satisfy the triangle inequality (see Proposition 6d), it is not very surprizing that δi are metrics. The really unexpected result is that δ £ fulﬁls the triangle inequality as well. The distance between two sets may be interpreted as the degree of their dissimilarity. Thus, δ £ and δi may serve as measures (indices) of dissimilarity of sets. On the other hand, mappings which are complementary in the sense of (15) to δ £ and δi , δ¯£ and δ¯i (i = 1, 2), respectively, may be used as similarity measures (see, e.g., [44] for a discussion of various indices used to measure the degree of similarity between clusterings). Let us note that for any X, Y ⊆ U , the following dependencies hold: 1 £ κ (X, Y ) + κ£ (Y, X) , δ¯£ (X, Y ) = 2 δ¯i (X, Y ) = κi (X, Y ) + κi (Y, X) − 1.

(20)

More precisely, ⎧ #(X∩Y ) 1 1 ⎪ if X, Y = ∅, ⎨ 2 #X + #Y δ¯£ (X, Y ) = 1 if X, Y = ∅, ⎪ ⎩1 in the remaining cases, 2 #(X∩Y ) if X ∪ Y = ∅, δ¯1 (X, Y ) = #(X∪Y ) 1 otherwise, #((U − (X ∪ Y )) ∪ (X ∩ Y )) . δ¯2 (X, Y ) = #U

(21)

Thus, starting with the standard RIF and two other RIFs of a similar origin, we have ﬁnally arrived at similarity measures known from the literature nski to [35, 36, 37, 38]. More precisely, δ¯£ is the function proposed by Kulczy´ estimate biotopical similarity [36]. The similarity index δ¯1 , complementary to the Marczewski–Steinhaus metric δ1 , is attributed to Jaccard [35]. The function δ¯2 was introduced (at least) twice, viz., by Sokal and Michener [38] and by Rand [37]. Let us note the following observations: Proposition 8. For any sets of objects X, Y and δ ∈ {δ £ , δ1 , δ2 }, we have that: ¯ (a) δ(X, Y ) = 1 ⇔ X = Y, ¯ ¯ X), (b) δ(X, Y ) = δ(Y, £ (c) δ¯ (X, Y ) = 0 ⇔ X ∩ Y = ∅ & X, Y = ∅, (d) δ¯1 (X, Y ) = 0 ⇔ X ∩ Y = ∅ & X ∪ Y = ∅,

52

A. Gomoli´ nska

(e) δ¯2 (X, Y ) = 0 ⇔ X ∩ Y = ∅ & X ∪ Y = U, (f ) 2δ¯£ (X, Y ) − 1 ≤ δ¯1 (X, Y ) ≤ min{δ¯£ (X, Y ), δ¯2 (X, Y )}. The proof is easy and, hence, omitted. However, some remarks may be useful. (a) states that every set is similar to itself to the highest degree 1. According to (b), similarity is assumed to be symmetric here. Properties (c)–(e) describe conditions characterizing the lowest degree of similarity between sets. A comparison of the three similarity indices is provided by (f). An example, illustrating a possible application of the Marczewski–Steinhaus metric to estimate diﬀerences between biotopes, can be found in [34]. In that example, two real forests from Lower Silesia (Poland) are considered. We slightly modify the example and we extend it to the other distance measures investigated. Example 5. As the universe we take a collection of tree species U = {a, b, h, l, o, p, r, s} where a stands for ‘alder’, b – ‘birch’, h – ‘hazel’, l – ‘larch’, o – ‘oak’, p – ‘pine’, r – ‘rowan’, and s – ‘spruce’. Consider two forests represented by the collections A, B of the tree species which occur in those forests where A = {a, b, h, p, r} and B = {b, o, p, s}. First we compute the degrees of inclusion of A in B, and vice-versa. Next we measure the biotopical differences between A and B using δ £ and δi for i = 1, 2. Finally we estimate the degrees of biotopical similarity of the forests investigated. It is easy to see that κ£ (A, B) = 2/5, κ£ (B, A) = 1/2, κ1 (A, B) = 4/7, κ1 (B, A) = 5/7, κ2 (A, B) = 5/8, and κ2 (B, A) = 3/4. Hence, 1 3 1 11 δ £ (A, B) = + = , 2 5 2 20 δ1 (A, B) = 5/7, and δ2 (A, B) = 5/8. As expected, the distance functions δ £ , δ1 , δ2 (and so the corresponding similarity measures δ¯£ , δ¯1 , δ¯2 ) may give us diﬀerent values when measuring the distance (resp., similarity) between A and B. Due to Proposition 7d, this distance is the greatest (equal to 5/7) when measured by δ1 . Conversely, δ¯1 yields the least degree of similarity, equal to 2/7. Therefore, these measures seem to be particularly attractive to cautious reasoners. For those who accept a higher risk, both δ £ , δ2 (and similarly, δ¯£ , δ¯2 ) are reasonable alternatives too. Accidentially, δ £ gives the least distance, equal to 11/20, and its complementary mapping δ¯£ yields the greatest degree of similarity, equal to 9/20. In this particular case, values provided by δ2 and δ¯2 , 5/8 and 3/8, respectively, are in between. Clearly, the choice of the most appropriate distance function (or similarity measure) may also depend on factors other than the level of risk.

7

Summary

In this article, an attempt was made to discover RIFs diﬀerent from the standard one, yet having a similar origin. First we overviewed the notion of the standard RIF, κ£ . In the next step, a general framework for discussion of RIFs and their properties was recalled. As a result, a minimal set of postulates specifying a RIF

On Certain Rough Inclusion Functions

53

was derived. Also several optional conditions were proposed. Then we deﬁned two RIFs, κ1 and κ2 , which turned out to be diﬀerent from the standard one. The latter RIF was mentioned in [9], yet the former one seems to be new. We examined properties of these RIFs with a special stress laid on the relationship to the standard RIF. In the sequel, we introduced functions complementary to RIFs (co-RIFs) which resulted in a new characterization of the standard RIF in terms of the remaining two RIFs and their complementary mappings. We examined proper¯ 1 , and κ ¯ 2 . We easily found out that they ties of each of the three co-RIFs: κ ¯£, κ might serve as distance functions. However, only the latter two functions proved to satisfy the triangle inequality. In the next step, the co-RIFs were used to deﬁne certain distance functions, δ £ , δ1 , and δ2 , which turned out to be metrics on the power set of the set of all objects considered. δ1 has already been known in the literature [34]. From the distance functions mentioned above we ﬁnally derived their complementary mappings, δ¯£ , δ¯1 , and δ¯2 , serving as similarity measures. As turned out, they were discovered many years ago [35,36,37,38]. In this way, starting with an idea which led to the standard RIF and going through intermediate stages (co-RIFs and certain metrics based on them), we ﬁnally arrived at similarity indices known in machine learning, relational learning, and statistical learning, to name a few areas of application.

References 1. Gomoli´ nska, A.: On three closely related rough inclusion functions. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 142–151. Springer, Heidelberg (2007) 2. Polkowski, L., Skowron, A.: Rough mereology. In: Ra´s, Z.W., Zemankova, M. (eds.) ISMIS 1994. LNCS (LNAI), vol. 869, pp. 85–94. Springer, Heidelberg (1994) 3. Polkowski, L., Skowron, A.: Rough mereology: A new paradigm for approximate reasoning. Int. J. Approximated Reasoning 15, 333–365 (1996) 4. Polkowski, L., Skowron, A.: Rough mereological calculi of granules: A rough set approach to computation. Computational Intelligence 17, 472–492 (2001) 5. Le´sniewski, S.: Foundations of the General Set Theory 1 (in Polish). Works of the Polish Scientiﬁc Circle, Moscow, vol. 2 (1916); Also In: [6], pp. 128–173 6. Surma, S.J., Srzednicki, J.T., Barnett, J.D. (eds.): Stanislaw Le´sniewski Collected Works. Kluwer/Polish Scientiﬁc Publ., Dordrecht/Warsaw (1992) 7. Borkowski, L. (ed.): Jan L ukasiewicz – Selected Works. North Holland/Polish Scientiﬁc Publ., Amsterdam/Warsaw (1970) 8. L ukasiewicz, J.: Die logischen Grundlagen der Wahrscheinlichkeitsrechnung, Cracow (1913); In: [7], pp. 16-63 (English translation) 9. Drwal, G., Mr´ ozek, A.: System RClass – software implementation of a rough classiﬁer. In: Klopotek, M.A., Michalewicz, M., Ra´s, Z.W. (eds.) Proc. 7th Int. Symp. Intelligent Information Systems (IIS 1998), Malbork, Poland, June 1998, pp. 392– 395 (1998)

54

A. Gomoli´ nska

10. Stepaniuk, J.: Knowledge discovery by application of rough set models. In: Polkowski, L., Tsumoto, S., Lin, T.Y. (eds.) Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems, pp. 137–233. Physica, Heidelberg (2001) 11. Pawlak, Z.: Rough sets. Int. J. Computer and Information Sciences 11, 341–356 (1982) 12. Pawlak, Z.: Information Systems. Theoretical Foundations (in Polish). Wydawnictwo Naukowo-Techniczne, Warsaw (1983) 13. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning About Data. Kluwer, Dordrecht (1991) 14. Pawlak, Z.: Rough set elements. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery, vol. 1, pp. 10–30. Physica, Heidelberg (1998) 15. Pawlak, Z.: A treatise on rough sets. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets IV. LNCS, vol. 3700, pp. 1–17. Springer, Heidelberg (2005) 16. Bazan, J.G., Skowron, A., Swiniarski, R.: Rough sets and vague concept approximation: From sample approximation to adaptive learning. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets V. LNCS, vol. 4100, pp. 39–63. Springer, Heidelberg (2006) 17. Peters, J.F.: Approximation spaces for hierarchical intelligent behavioral system models. In: Dunin-K¸eplicz, B., Jankowski, A., Skowron, A., Szczuka, M. (eds.) Monitoring, Security, and Rescue Techniques in Multiagent Systems, pp. 13–30. Springer, Heidelberg (2005) 18. Peters, J.F., Skowron, A., Stepaniuk, J.: Nearness of objects: Extension of approximation space model. Fundamenta Informaticae 79, 497–512 (2007) 19. Skowron, A., Stepaniuk, J.: Generalized approximation spaces. In: Lin, T.Y., Wildberger, A.M. (eds.) Soft Computing. Simulation Councils, pp. 18–21. Simulation Councils, San Diego (1995) 20. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996) 21. Skowron, A., Stepaniuk, J., Peters, J.F., Swiniarski, R.: Calculi of approximation spaces. Fundamenta Informaticae 72, 363–378 (2006) 22. Skowron, A., Swiniarski, R., Synak, P.: Approximation spaces and information granulation. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 175–189. Springer, Heidelberg (2005) 23. Ziarko, W.: Variable precision rough set model. J. Computer and System Sciences 46, 39–59 (1993) 24. Ziarko, W.: Probabilistic decision tables in the variable precision rough set model. Computational Intelligence 17, 593–603 (2001) ´ ezak, D., Wang, G., Szczuka, M.S., 25. Ziarko, W.: Probabilistic rough sets. In: Sl D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS, vol. 3641, pp. 283–293. Springer, Heidelberg (2005) 26. Ziarko, W.: Stochastic approach to rough set theory. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Slowi´ nski, R. (eds.) RSCTC 2006. LNCS, vol. 4259, pp. 38–48. Springer, Heidelberg (2006) 27. Yao, Y.Y.: Decision-theoretic rough set models. In: Yao, J., Lingras, P., Wu, W.-Z., ´ ezak, D. (eds.) RSKT 2007. LNCS, vol. 4481, pp. Szczuka, M.S., Cercone, N.J., Sl¸ 1–12. Springer, Heidelberg (2007) 28. Yao, Y.Y.: Probabilistic rough set approximations. Int. J. of Approximate Reasoning (in press, 2007), doi:10.1016/j.ijar.2007.05.019 29. Yao, Y.Y., Wong, S.K.M.: A decision theoretic framework for approximating concepts. Int. J. of Man–Machine Studies 37, 793–809 (1992)

On Certain Rough Inclusion Functions

55

30. Pawlak, Z., Skowron, A.: Rough membership functions. In: Fedrizzi, M., Kacprzyk, J., Yager, R.R. (eds.) Advances in the Dempster–Shafer Theory of Evidence, pp. 251–271. John Wiley & Sons, Chichester (1994) 31. Gomoli´ nska, A.: Possible rough ingredients of concepts in approximation spaces. Fundamenta Informaticae 72, 139–154 (2006) 32. Nguyen, H.S., Skowron, A., Stepaniuk, J.: Granular computing: A rough set approach. Computational Intelligence 17, 514–544 (2001) 33. Zhang, M., Xu, L.D., Zhang, W.X., Li, H.Z.: A rough set approach to knowledge reduction based on inclusion degree and evidence reasoning theory. Expert Systems 20, 298–304 (2003) 34. Marczewski, E., Steinhaus, H.: On a certain distance of sets and the corresponding distance of functions. Colloquium Mathematicum 6, 319–327 (1958) 35. Jaccard, P.: Nouvelles recherches sur la distribution ﬂorale. Bull. de la Soci´et´e Vaudoise des Sciences Naturelles 44, 223–270 (1908) 36. Kulczy´ nski, S.: Die Pﬂanzenassociationen der Pieninen. Bull. Internat. Acad. Polon. Sci. Lett., Sci. Math. et Naturelles, serie B, suppl. II 2, 57–203 (1927) 37. Rand, W.: Objective criteria for the evaluation of clustering methods. J. of the American Statistical Association 66, 846–850 (1971) 38. Sokal, R.R., Michener, C.D.: A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 38, 1409–1438 (1958) 39. Xu, Z.B., Liang, J.Y., Dang, C.Y., Chin, K.S.: Inclusion degree: A perspective on measures for rough set data analysis. Information Sciences 141, 227–236 (2002) 40. Zhang, W.X., Leung, Y.: Theory of including degrees and its applications to uncertainty inference. In: Proc. of 1996 Asian Fuzzy System Symposium, pp. 496–501 (1996) 41. An, A., Cercone, N.: Rule quality measures for rule induction systems: Description and evaluation. Computational Intelligence 17, 409–424 (2001) 42. Kryszkiewicz, M.: Fast discovery of representative association rules. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, pp. 214–221. Springer, Heidelberg (1998) 43. Tsumoto, S.: Modelling medical diagnostic rules based on rough sets. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, pp. 475–482. Springer, Heidelberg (1998) 44. Albatineh, A.N., Niewiadomska-Bugaj, M., Mihalko, D.: On similarity indices and correction for chance agreement. J. of Classiﬁcation 23, 301–313 (2006) 45. Wallace, D.L.: A method for comparing two hierarchical clusterings: Comment. J. of the American Statistical Association 78, 569–576 (1983) 46. Zadeh, L.A.: Fuzzy sets. Information and Control 8, 338–353 (1965)

Automatic Rhythm Retrieval from Musical Files Bo˙zena Kostek, Jaroslaw W´ ojcik, and Piotr Szczuko Gda´ nsk University of Technology, Multimedia Systems Department Narutowicza 11/12, 80-952 Gda´ nsk, Poland {bozenka,szczuko}@sound.eti.pg.gda.pl, [email protected]

Abstract. This paper presents a comparison of the eﬀectiveness of two computational intelligence approaches applied to the task of retrieving rhythmic structure from musical ﬁles. The method proposed by the authors of this paper generates rhythmic levels ﬁrst, and then uses these levels to compose rhythmic hypotheses. Three phases: creating periods, creating simpliﬁed hypotheses and creating full hypotheses are examined within this study. All experiments are conducted on a database of national anthems. Decision systems such as Artiﬁcial Neural Networks and Rough Sets are employed to search the metric structure of musical ﬁles. This was based on examining physical attributes of sound that are important in determining the placement of a particular sound in the accented location of a musical piece. The results of the experiments show that both decision systems award note duration as the most signiﬁcant parameter in automatic searching for metric structure of rhythm from musical ﬁles. Also, a brief description of the application realizing automatic rhythm accompaniment is presented. Keywords: Rhythm Retrieval, Metric Rhythm, Music Information Retrieval, Artiﬁcial Neural Networks, Rough Sets.

1

Introduction

The aim of this article is to present a comparative study of the eﬀectiveness of two computational intelligence approaches applied to the task of retrieving rhythmic structure from musical ﬁles. Existing metric rhythm research usually focuses on retrieving low rhythmic levels – they go down to the level of a measure. Typically those methods are suﬃcient to emulate human perception of a local rhythm. According to McAuley & Semple [14] trained musicians perceive more levels, though. High-level perception is required from drum players, thus computational approach needs to retrieve a so-called hypermetric structure of a piece. If it reaches high rhythmic levels such as phrases, sentences and periods, then automatic drum accompaniment applications can be developed. Rhythm retrieval research is a broad ﬁeld and, among other issues, involves the quantization process of the beginnings and lengths of notes, the extraction of rhythm events from audio recordings, and the search for meter of compositions. J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 56–75, 2008. c Springer-Verlag Berlin Heidelberg 2008

Automatic Rhythm Retrieval from Musical Files

57

Rhythm is an element of a piece determining musical style, which may be valuable in retrieval. The rhythmic structure together with patterns retrieved carry information about the genre of a piece. Content-based methods of music retrieval are nowadays developed by researchers from the multimedia retrieval and computational intelligence domain. The most common classes of rhythm retrieval models are: rule-based, multiple-agents, multiple-oscillators and probabilistic. The rhythm retrieval methods can be classiﬁed within the context of what type of actions they take, i.e. whether they quantize musical data, or ﬁnd the tempo of a piece (e.g. van Belle [2]), time signatures, positions of barlines, a metric structure or an entire hypermetric hierarchy. Rhythm ﬁnding systems very often rank the hypotheses of rhythm, basing on the sound salience function. Since scientists diﬀer in opinions on the aspect of salience, the Authors carried out special experiments to solve the salience problem. A number of research studies are based on the theory published by Lerdahl & Jackendoﬀ [13], who claim that such physical attributes of sounds as pitch (frequency), duration and velocity (amplitude) inﬂuence the rhythmical salience of sounds. Another approach, proposed by Rosenthal [19], ranks higher the hypotheses in which long sounds are placed in accented positions. In Dixon’s [4] multiple-agent approach, two salience functions are proposed, combining duration, pitch and velocity. The ﬁrst, is a linear combination of physical attributes, Dixon calls it an additive function. The other one is a multiplicative function. Dahl [3] notices that drummers play accented strokes with higher amplitude than unaccented ones. Parncutt, in his book [15], claims that lower sounds fall on the beat. In the review of Parncutt’s book, Huron [5] notices that the high salience of low sounds is “neither an experimentally determined fact nor an established principle in musical practice”. A duration-based hypothesis predominated in rhythm-related works, however this approach seemed to be based on intuition only. The experimental conﬁrmation of this thesis – based on the Data Mining (DM) association rules and Artiﬁcial Neural Networks (ANNs) – can be found in former works by the Authors of this paper [6], [7], [8] and also in the doctoral thesis of Wojcik [27]. The experiments employing rough sets, which are a subject of this paper, were performed in order to conﬁrm results obtained from the DM and ANN approaches. Another reason was to verify if all three computational intelligence models applied to the salience problem, return similar ﬁndings, which may prove the correctness of these approaches. This article is an extended version of a paper which is included in Proceedings of Rough Sets and Intelligent Systems Paradigms [12]. The remainder of the paper is organized as follows: in Section 2 a short review of computational intelligence methods that are used in research related to emulation of human perception is presented. Then, Section 3 shows some issues describing hypermetric rhythm retrieval, which direct towards the experiments on rhythm retrieval. A brief description of the application realizing automatic rhythm accompaniment is shown in Section 4 along with an approach to the computational complexity of the algorithm creating hypermetric rhythmic hypotheses (Section 5). Finally, Section 6 puts forward summary of results as well as some concluding remarks.

58

2

B. Kostek, J. W´ ojcik, and P. Szczuko

Emulation of Human Perception by Computational Intelligence Techniques

The domain of computational intelligence grows into independent and very attractive research area in a few last years, with many applications dedicated to data mining in musical domain [8], [9], [23], [24]. Computational Intelligence (CI) is a branch of Artificial Intelligence, which deals with the AI soft facets, i.e. programs behaving intelligently. The CI is understood in a number of ways, e.g. as a study of the design of intelligent agents or as a subbranch of AI, which aims “to use learning, adaptive, or evolutionary computation to create programs that are, in some sense, intelligent” [25]. Researchers are trying to classify the branches of CI to designate the ways in which CI methods help humans to discover how their perception works. However, this is a multi-facet task with numerous overlapping deﬁnitions, thus the map of this discipline is ambiguous. The domain of CI groups several approaches, the most common are: the Artiﬁcial Neural Networks (ANNs), Fuzzy Systems, Evolutionary Computation, Machine Learning including Data Mining, Soft Computing, Rough Sets, Bayesian Networks, Expert Systems and Intelligent Agents [18]. Currently, in the age of CI people are trying to build machines emulating human behaviors, and one of such applications concerns rhythm perception. This paper presents an example of how to design and build an algorithm which is able to emulate human perception of rhythm. Two CI approaches, namely the ANNs and Rough Sets (RS), are used in the experiments aiming at the estimation of musical salience. The ﬁrst of them, the ANN model, concerns processes, which are not entirely known, e.g. human perception of rhythm. The latter is the RS approach, introduced by Pawlak [16] and used by many researches in data discovery and intelligent management [17], [18]. Since the applicability of ANNs in recognition was experimentally conﬁrmed in a number of areas, neural networks are also used to estimate rhythmic salience of sounds. There exists a vast literature on ANNs, and for this reason only a brief introduction to this area is presented in this paper. A structure of an ANN usually employs the McCulloch-Pitts model, involving the modiﬁcation of the neuron activation function, which is usually sigmoidal. All neurons are interconnected. Within the context of the neural network topology, ANNs can be classiﬁed as feedforward or recurrent networks, which are also called feedback networks. In the case of recurrent ANNs the connections between units form cycles, while in feedforward ANNs the information moves in only one direction, i.e. forward. The elements of a vector of object features constitute the values, which are fed to the input of an ANN. The type of data accepted at the input and/or returned at the output of an ANN is also a diﬀerentiating factor. The quantitative variable values are continuous by nature, and the categorical variables belong to a ﬁnite set (small, medium, big, large). The ANNs with continuous values at input are able to determine the degree of the membership to a certain class. The output of networks based on categorical variables may be Boolean, in which case the network decides whether an object belongs to a class or not. In the case of the salience problem the number of categorical output variables equals to two, and it is determined whether the sound is accented or not.

Automatic Rhythm Retrieval from Musical Files

59

In the experiments the Authors examined whether a supervised categorical network such as Learning Vector Quantization (LVQ) is suﬃcient to resolve the salience problem. The classiﬁcation task of the network was to recognize the sound as accented or not. LVQs are self-organizing networks with the ability to learn and detect the regularities and correlations at their input, and then to adapt their responses to that input. An LVQ network is trained in a supervised manner, it consists of the competitive and a linear layers. The ﬁrst one classiﬁes the input vectors into subclasses, and the latter transforms input vectors into target classes. On the other hand, the aim of the RS-based experiments was two-fold. First, it was to compare the results with the ones coming from the ANN. In addition, two schemes of data discretization were applied. In the case of k-means discretization accuracies of predictions are delivered.

3 3.1

Experiments Database

Presented experiments were conducted on MIDI ﬁles of eighty national anthems, retrieved from the Internet. Storing information about meter in the ﬁles is necessary to indicate accented sounds in a musical piece. This information, however, is optional in MIDI ﬁles, thus information whether the sound is accented or not is not always available. In addition, in a number of musical ﬁles retrieved from the Internet, the assigned meter is incorrect or there is no information about meter at all. This is why the correctness of meter was checked by inserting an additional simple drum track into the melody. The hits of the snare drum were inserted in the locations of the piece calculated with Formula (1), where T is a period computed with the autocorrelation function, and i indicates subsequent hits of a snare drum. i · T, i = 0, 1, 2, . . .

(1)

The Authors listened to the musical ﬁles with snare drum hits inserted, and rejected all the ﬁles in which accented locations were indicated incorrectly. Also some anthems with changes in time signature could not be included in the training and testing sets, because this metric rhythm retrieval method deals with hypotheses based on rhythmic levels of a constant period. Usually the change in time signature results in changes in the period of a rhythmic level corresponding to the meter, and an example of such change might be from 3/4 into 4/4. Conversely, an example of a change in time signature which does not inﬂuence the correct indication of accented sounds could be from 2/4 into 4/4. Salience experiments presented in this paper are conducted on polyphonic MIDI tracks containing melodies, overlapping sounds coming from the tracks other than melodic ones, were not included in the experimental sets. For the purpose of the experiments the values of physical sounds’ attributes were normalized and discretized with equal subrange technique. Minimum and maximum values within the domain of each attribute are found. The whole range

60

B. Kostek, J. W´ ojcik, and P. Szczuko

is then divided into msubranges with thresholds between the subranges, placed in the locations counted with aid of the Formula (2). M inV alue + (M axV alue − M inV alue) · j/m for j = 0, 1, 2, . . . m 3.2

(2)

ANN-Based Experiment

For the training phase, accented locations in each melody were found with methods described in Section 3.1. One of the tested networks had three separate inputs – one for each physical attribute of a sound (duration, frequency and amplitude - DPV ). Three remaining networks had one input each. Each input took a diﬀerent physical attribute of a given sound, namely D – duration, P – pitch (frequency) or V – velocity (amplitude). All attributes were from the range of 0 to 127. The network output was binary: 1 if the sound was accented, or 0 if it was not. Musical data were provided to the networks to train them to recognize accented sounds on the basis of physical attributes. In this study LVQ network recognized a sound as ‘accented’ or ‘not accented’. Since physical attributes are not the only features determining whether a sound is accented, some network answers may be incorrect. The network accuracy NA was formulated as the ratio of the number of accented sounds, which were correctly detected by the network, to the total number of accented sounds in a melody, as stated in Formula (3). NA = number of accented sounds correctly detected by the network / number of all accented sounds (3) Hazard accuracy HA is the ratio of the number of accents given by the network to the number of all sounds in a set, as stated in Formula (4). HA = number of accented sounds detected by the network / number of all sounds (4) The melodies of anthems were used to create 10 training/testing sets. Each set included 8 entire pieces. Each sound with an index divisible by 3 was assumed to be a training sound. The remaining sounds were treated as testing sounds. As a consequence, the testing set was twice as large as the training set. Accuracies in the datasets were averaged for each network separately. Evaluating a separate accuracy for each ANN allowed for comparing their preciseness. Standard deviations were also calculated. Fractions equal to standard deviations were divided by average values. Such fractions help compare the stability of results. The lower the value of the fraction is, the more stable the results are. All results are shown on the right side of Table 1. A single accuracy value was assigned to each ANN. Standard deviations were also calculated and the resultant stability fraction equal to standard deviations divided by average values was presented. The accuracy of ﬁnding accented sounds estimated for four networks can be seen in Fig. 1, the plots are drawn on the basis of the data from Table 1. There

Automatic Rhythm Retrieval from Musical Files

61

Table 1. Parameters of training and testing data and performance of ANNs Set No. 1 2 3 4 5 6 7 8 9 10 Avg. StdDev StdDev/Avg

Number of sounds All Accented Not accented 937 387 550 1173 386 787 1054 385 669 937 315 622 801 293 508 603 245 358 781 332 449 880 344 536 867 335 532 1767 509 1258 980 353 626 317 71 251

Acc./all [%]

NA/HA D P

V

DPV

41 33 37 34 37 41 43 39 39 29 37 4

1.90 2.28 2.14 2.25 1.98 1.67 1.93 2.06 1.91 2.14 2.03 0.19 0.09

0.95 1.23 0.11 0.79 1.04 0.93 1.16 1.13 0.83 1.62 0.98 0.39 0.40

1.96 2.19 2.13 2.49 1.95 1.24 1.89 2.14 1.73 2.66 2.03 0.39 0.19

1.01 0.89 0.96 1.13 1.02 1.02 0.98 0.97 0.87 0.72 0.96 0.11 0.12

Fig. 1. Accuracy of four networks for melodies of anthems

are three plots presenting the results of networks fed with one attribute only, and one plot for the network presented with all three physical attributes at its single input (line DPV ). The consequent pairs of training and testing sets are on the horizontal axis, the fraction NA/HA, signifying how many times an approach is more accurate than a blind choice, is on the vertical axis. 3.3

Rough Set-Based Experiments

The aim of this experiment was to obtain the results analogical to the ones coming from the ANN and to confront them with each other. In particular, it was expected to conﬁrm whether physical attributes inﬂuence a tendency of sounds

62

B. Kostek, J. W´ ojcik, and P. Szczuko

to be located in accented positions. Further, it was to answer how complex is the way the rhythmic salience of sound depends on its physical attributes, and to observe the stability of the accuracies obtained in the RS-based experiments. In the rough set-based experiments, the dataset named RSESdata1 was split into training and testing sets in the 3:1 ratio. Then the rules were generated, utilizing a genetic algorithm available in the Rough Set Exploration System [1], [22]. For dataset RSESdata1, 7859 rules were obtained resulting in the classiﬁcation accuracy of 0.75 with the coverage equal to 1. It should be remembered that accuracy is a measure of classiﬁcation success, which is deﬁned as a ratio of the number of properly classiﬁed new cases (objects) to the total number of new cases. Rules with support less than 10 were then removed. The set of rules was thus reduced to 427 and the accuracy dropped to 0.736 with the coverage still remaining 1. Then the next attempt to further decrease the number of rules was made, and rules with support less than 30 were excluded. In this case, 156 rules were still valid but the accuracy dropped signiﬁcantly, i.e. to 0.707, and at the same time the coverage fall to 0.99. It was decided that for a practical implementation of a rough set-based classiﬁer, a set of 427 rules is suitable. Reducts used in rule generation are presented in Table 2. The same approach was used for dataset RSESdata2, and resulted in 11121 rules with the accuracy of 0.742 and the coverage of 1. After removing rules with support less than 10, only 384 rules remained, and the accuracy dropped to 0.735. Again, such number of rules is practically applicable. Reducts used in rule generation are presented in Table 3. The approach taken to LVQ network was also implemented for rough sets. Ten diﬀerent training\test sets were acquired by randomly splitting data into Table 2. Reduct for RSESdata1 dataset Reducts Positive Region Stability Coeﬃcient { duration, pitch } 0.460 1 { duration, velocity } 0.565 1 { pitch, velocity } 0.369 1 { duration } 0.039 1 { pitch } 0.002 1 { velocity } 0.001 1

Table 3. Reduct for RSESdata2 dataset Reducts Positive Region Stability Coeﬃcient { duration, velocity } 0.6956 1 { duration, pitch } 0.6671 1 { pitch, velocity } 0.4758 1 { duration } 0.0878 1 { pitch } 0.0034 1 { velocity } 0.0028 1

Automatic Rhythm Retrieval from Musical Files

63

Table 4. Parameters of training and testing data and performance of RSES (RSA is a Rough Set factor, analogical to NA in ANNs) Set No. 1 2 3 4 5 6 7 8 9 10 Avg. StdDev StdDev/Avg

Number of sounds All testing Accented sounds 1679 610 1679 608 1679 594 1679 638 1679 632 1679 605 1679 573 1679 618 1679 603 1679 627 1679 610 0 19.2

Not accented 1069 1071 1085 1041 1047 1074 1106 1061 1076 1052 1068 19.2

Acc/all 36.33 36.21 35.37 37.99 37.64 36.03 34.12 36.80 35.91 37.34 36.37 1.14

RSA/HA D P

V

DPV

1.81 1.90 1.84 1.68 1.67 1.87 1.77 1.90 1.77 1.77 1.80 0.08 0.04

1.21 1.09 1.19 1.12 1.12 1.13 1.18 1.17 1.11 1.15 1.15 0.039 0.033

1.75 1.74 1.74 1.62 1.64 1.88 1.68 1.73 1.70 1.66 1.72 0.07 0.04

1.06 1.08 1.12 1.08 1.07 1.16 1.09 1.06 1.08 1.08 1.09 0.02 0.02

ﬁve pairs, and than each set in a pair was further divided into two sets – a training and a testing one – with the 2:1 ratio. Therefore testing sets contained 1679 objects each. The experiments, however, were based on RSESdata1 set because of its higher generalization ability (see Table 4). It should be remembered that reduct is a set of attributes that discerns objects with diﬀerent decisions. Positive region shows what part of indiscernibility classes for a reduct is inside the rough set. The larger boundary regions are, the more rules are nondeterministic, and the smaller positive region is. Stability coeﬃcient reveals if the reduct appears also for subsets of original dataset, which are calculated during the reduct search. For reduct {duration} positive region is very small, but during classiﬁcation a voting method is used to infer correct outcome from many nondeterministic rules, and, ﬁnally, high accuracy is obtained. Adding another dimension, e.g. {duration, velocity}, results in higher number of deterministic rules, larger positive region, but it does not guarantee the accuracy increase ( Table 4). Rules were generated utilizing diﬀerent reduct sets (compare with Table 1): D - {duration} only; P - {pitch} only; V - {velocity} only; DPV - all 6 reducts {duration, velocity}, {duration, pitch}, {pitch, velocity}, {duration}, {pitch}, {velocity} have been employed. k-NN Discretization. The data were analyzed also employing k-NN method, which is implemented as a part of the RSES system [22]. The experiment was carried out diﬀerently in comparison to previously performed experiments using ANN (LVQ) and RS. The reason for this was to observe accuracy of classiﬁcation while various number of k values has been set. It may easily be observed that lower number of clusters implies better accuracy of the predictions and a smaller

64

B. Kostek, J. W´ ojcik, and P. Szczuko Table 5. Cut points in the case of k=3 Duration 45.33 133.88

Pitch 44.175 78.285

Velocity 44.909 75.756

Table 6. Classiﬁcation results for k=3 1 0 No. of obj. Accuracy Coverage 1 665 206 899 0.763 0.969 0 375 1,190 1,570 0.76 0.997 True positive rate 0.64 0.85

Table 7. Cut points in the case of k=4 Duration 38.577 98.989 198.56

Pitch 25.622 51.85 79.988

Velocity 41.18 65.139 89.67

Table 8. Classiﬁcation results for k=4 1 0 No. of obj. Accuracy Coverage 1 640 220 899 0.744 0.957 0 353 1,202 1,570 0.773 0.99 True positive rate 0.64 0.85

number of rules generated. In the following experiments full attributes vectors [“duration”, “pitch”, ”velocity”] are used as reducts. The k-means discretization is performed, where k values are set manually k={4, 5,10,15,20}. For a given k exactly k clusters are calculated, represented by their center points. The cut point is set as a middle point between two neighbor cluster centers. Cut points are used for attribute discretization and then rough set rules are generated. The training set comprises 7407 objects and the testing one 2469 objects (3:1 ratio). Experiment I – k-means discretization (k=3) of each attribute (“duration”, “pitch” ,”velocity”), 872 rules. Cut points are shown in Table 5 and classiﬁcation results in Table 6 (Total accuracy: 0.761, total coverage: 0.987). Experiment II – k-means discretization (k=4) of each attribute (“duration”, “pitch” ,”velocity”), 1282 rules. Cut points are shown in Table 7 and classiﬁcation results in Table 8 (Total accuracy: 0.763; total coverage: 0.978). Experiment III – k-means discretization (k=5) of each attribute (“duration”, “pitch” ,”velocity”), 1690 rules. Cut points are shown in Table 9 and classiﬁcation results in Table 10 (Total accuracy: 0.766: total coverage: 0.967).

Automatic Rhythm Retrieval from Musical Files

65

Table 9. Cut points in the case of k=5 Duration 31.733 72.814

133.91 259.15

Pitch 24.224 47.536

68.629 94.84

Velocity 27.826 48.708

66.759 89.853

Table 10. Classiﬁcation results for k=5 1 0 No. of obj. Accuracy Coverage 1 619 232 899 0.727 0.947 0 326 1,211 1,570 0.788 0.979 True positive rate 0.66 0.84

Table 11. Cut points in the case of k=10 Duration 11.11 27.319 44.375 62.962 86.621

121.62 174.66 264.32 642.94

Pitch 18.089 35.259 47.046 56.279 64.667

73.161 82.648 95.963 119.15

Velocity 14.558 27.307 36.02 42.846 49.768

57.81 67.94 81.992 102.45

Table 12. Classiﬁcation results for k=10 1 0 No. of obj. Accuracy Coverage 1 533 227 899 0.701 0.845 0 253 1,162 1,570 0.821 0.901 True positive rate 0.68 0.84

Table 13. Cut points in the case of k=15 Duration 3.2372 8.8071 13.854 20.693 29.284 37.857 46.806

58.942 76.842 101.33 132.65 183.95 283.39 656.29

Pitch 9.4427 23.023 33.148 41.285 48.225 53.264 57.875

64.17 70.125 74.465 79.687 86.909 97.988 119.79

Velocity 15.139 28.128 37.217 44.091 49.015 52.011 53.913

56.751 60.378 63.963 68.747 75.963 86.914 104.36

Experiment IV – k-means discretization (k=10) of each attribute (“duration”, “pitch” ,”velocity”), 2987 rules. Cut points are shown in Table 11 and classiﬁcation results in Table 12 (Total accuracy: 0.779: total coverage: 0.881).

66

B. Kostek, J. W´ ojcik, and P. Szczuko Table 14. Classiﬁcation results for k=15 1 0 No. of obj. Accuracy Coverage 1 492 217 899 0.694 0.789 0 229 1,121 1,570 0.83 0.86 True positive rate 0.68 0.84 Table 15. Cut points in the case of k=20

Duration 2.920 38.505 7.388 45.68 11.601 53.632 16.635 63.169 21.486 74.84 26.464 88.631 31.96

107.31 132.67 173.43 235.67 326.34 672.55

Pitch 7.584 18.309 25.782 31.629 35.553 38.184 40.174

41.5 42.674 44.977 49.35 54.714 60.045

65.348 70.631 76.774 84.981 97.402 119.79

Velocity 9.603 22.588 33.807 43.158 50.163 55.096 58.528

61.725 65.283 68.596 71.363 73.564 75.624

78.691 83.434 89.438 96.862 106.57 121.49

Table 16. Classiﬁcation results for k=20 1 0 No. of obj. Accuracy Coverage 1 476 223 899 0.681 0.778 0 233 1,122 1,570 0.828 0.863 True positive rate 0.67 0.83

Experiment V – k-means discretization (k=15) of each attribute (“duration”, “pitch” ,”velocity”), 3834 rules. Cut points are shown in Table 13 and classiﬁcation results in Table 14 (Total accuracy: 0.783: total coverage: 0.834). Experiment VI – k-means discretization (k=20) of each attribute (“duration”, “pitch” ,”velocity”), 4122 rules. Cut points are shown in Table 15 and classiﬁcation results in Table 16 (Total accuracy: 0.778: total coverage: 0.832). Retrieving rhythmical patterns together with their hierarchical structure of rhythm acquired with machine learning is a step towards an application capable of creating automatic drum accompaniment to a given melody. Such a computer system is to be presented in Section 4.

4

Automatic Drum Accompaniment Application

The hypermetric rhythm retrieval approach proposed in this article is illustrated with a practical application of a system automatically generating a drum accompaniment to a given melody. A stream of sounds in MIDI format is introduced at the system input, on the basis of a musical content the method retrieves a hypermetric structure of rhythm of a musical piece consisting rhythmic motives, phrases, and sentences. A method does not use any information about rhythm

Automatic Rhythm Retrieval from Musical Files

67

Fig. 2. The tree of periods

(time signature), which is present often in MIDI ﬁles. Neither rhythmic tracks nor harmonic information are used to support the method. The only information analyzed is a melody, which might be monophonic as well as polyphonic. Two elements are combined, namely recurrence of melodic and rhythmic patterns and the rhythmic salience of sounds to create a machine able to ﬁnd the metric structure of rhythm to a given melody. The method proposed by the authors of this paper generates rhythmic levels ﬁrst, and then uses these levels to compose rhythmic hypotheses. The lowest rhythmic level has a phase of the ﬁrst sound from the piece and its period is atomic. The following levels have periods of values achieved by recursive multiplication of periods that have already been calculated (starting from the atomic value) by the most common prime numbers in Western music, i.e. 2 and 3. The process of period generation may be illustrated as a process of a tree structure formation (Figure 2) with a root representing the atomic period equal to 1. Each node is represented by a number which is the node ancestor number multiplied by either 2 or 3. The tree holds some duplicates. The node holding a duplicated value would generate a sub-tree whose all nodes would also be duplicates of already existing values. Thus duplicate subtrees are eliminated and we obtain a graphical interpretation in the form of the period triangle (see Figure 3), where the top row refers to a quarter-note, and consecutively to a half-note, whole note (motive), phrase, sentence and period. When the phase of periods creation is completed, each period must have all its phases (starting from phase 0) generated. The last phase of a given rhythmic level has the value equal to the size of the period decreased by one atomic period. In order to achieve hypotheses from the generated rhythmic levels, it is necessary to ﬁnd all families of related rhythmic levels. A level may belong to many levels. The generated hypotheses are instantly ranked to extract the one which designates the appropriate rhythm of the piece. The hypotheses that cover notes of signiﬁcant rhythmic weights are ranked higher. The weights are calculated based on the knowledge gathered by learning systems that know how to asses the importance of physical characteristics of sounds that comprise the

68

B. Kostek, J. W´ ojcik, and P. Szczuko

Fig. 3. Triangle of periods Table 17. Drum instruments added at a particular rhythmic level Rhythmic level Name of the instrument 1 Closed hi-hat 2 Bass drum 3 Snare drum 4 Open triangle 5 Splash cymbal 6 Long whistle 7 Chinese cymbal

piece. The system proposed by the authors employs rules obtained in the process of data mining [11], [26], as well as from the operation of neural networks [6], and through employing rough sets [12]. Taking a set of representative musical objects as grounds, these systems learn how to asses the inﬂuence of a sound relative frequency, amplitude and length on its rhythmic weight. The second group of methods used to rank hypotheses is based on one of the elementary rules known in music composition, i.e. recurrence of melodic and rhythmic patterns – the group is described in the authors’ works [10], [27]. The application realizing an automatic accompaniment, called DrumAdd, accepts a MIDI ﬁle at its input. The accompaniment is added to the melody by inserting a drum channel, whose number is 10 in the MIDI ﬁle. Hi-hat hits are inserted in the locations of rhythmic events associated with the ﬁrst rhythmic

Automatic Rhythm Retrieval from Musical Files

69

Fig. 4. User interface of an automatic drum accompaniment application

level. The consecutive drum instruments associated with higher rhythmic levels are: bass drum, snare drum, open triangle, splash cymbal, long whistle and a Chinese cymbal, as it is shown in Table 17. The DrumAdd system was developed in Java. The main window of the system can be seen in Figure 4 – the user interface shown is entitled ‘Hypotheses’. Default settings of quantization are as follows: - onsets of sounds are shifted to time grid of one-eight note, - durations of sounds are natural multiplies of one-eight note, - notes shorter than one-sixteenth note are deleted. A user may easily change quantization settings. A hypothesis ranking method can be chosen in a drop-down list (‘Salience - Duration’ in the case presented). A user may listen to the accompaniment made on the basis of the hypothesis (link ‘Listen’), change the drum sounds associated to the consecutive rhythmic levels (link ‘Next. . . ’) or acknowledge the given hypothesis as correct (link ‘Correct’ is assumed to be correct). A user also receives an access to report and ranking of hypotheses, which presents a table with accuracies corresponding to hypotheses ranking methods. The drum Accompaniment is generated automatically to the sample melodies contained in the system. As a result some sample pieces contain a drum track created strictly with the approach presented earlier. In the second group of examples, the accompaniment is created on the basis of a metric structure retrieved automatically.

70

5

B. Kostek, J. W´ ojcik, and P. Szczuko

Algorithm Complexity

This section addresses the problem of computational algorithm complexity. Three phases of the algorithm engineered by the authors, namely creating periods, simpliﬁed hypotheses and full hypotheses are examined. The analyses of computational complexity of the method proposed assume that the engineered method is expected to rank rhythmic hypotheses formed of three rhythmic levels above meter. This proved to be suﬃcient for providing automatic drum accompaniment for a given melody without delay. The method creates all possible rhythmic structures. However, their number is limited and depends on the following factors [28]: – The level designated as the lowest among all the created hypotheses (this deﬁnes the parameter of sound length quantization). The authors observed that the quantization with the resolution of a quarter-note is suﬃcient. – The intricacy of the hypotheses, i.e. how many levels they contain. The method was examined for at most three rhythmic levels above meter, similarly as in the research conducted by Rosenthal [20], and Temperley and Sleator [21]. Taking the above assumptions into consideration, i.e. the quantization parameter being a quarter-note and the analysis of a hypothesis concerning three levels above meter, we obtain the number of periods from the ﬁrst 6 layers of the triangle shown in Figure 3. The atomic period is a quarter-note (layer 1), the layer containing periods 4, 6, 9 is the level of meter, and the sixth layer holding the values of 32, 48, 72 . . . is the last examined rhythmic level. Calculating periods. The number of periods is n·(n+1)/2, where n is the number of layers, so the algorithm is polynomial, the function of the computational complexity is of class O(n2 ). The basic operation that calculates periods is multiplication. The number of periods calculated for 6 layers is 21, and these are the elements of a period list. Creating hypotheses. Hypotheses (with periods only) are lists of related rhythmic levels that include pairs of values. If we take only periods into consideration, the number of hypotheses is the number of paths starting from the highest rhythmic level (layer 6) and ending in the level of atomic period (layer 1). For assumed parameters, this gives 32 hypotheses if only periods deﬁned. The number is a result of the following computations: – from period 32 there is one path (32, 16, 8, 4, 2, 1,), – from period 48 there are 5 paths, – from period 72 there are 10 paths, For the left half of the triangle we may specify 16 paths. The computations for the right half, i.e. the paths including periods 108, 162, and 243, are analogous. This gives 32 paths altogether in a 6 layer triangle. The function of computational complexity is of class O(n2 ), where n is the number of layers. Thus, the complexity is exponential which with n limited to

Automatic Rhythm Retrieval from Musical Files

71

Table 18. Rhythmic hypotheses (without phases) for a 6-layer triangle of periods Layer 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 9 9 9 9 9 9 9 9

4 8 8 8 8 12 12 12 12 12 12 12 12 18 18 18 18 12 12 12 12 18 18 18 18 18 18 18 18 27 27 27 27

5 16 16 24 24 24 24 36 36 24 24 36 36 36 36 54 54 24 24 36 36 36 36 54 54 36 36 54 54 54 54 81 81

6 32 48 48 72 48 72 72 108 48 72 72 108 72 108 108 162 48 72 72 108 72 108 108 162 72 108 108 162 108 162 162 243

6 layers conﬁnes the number of hypotheses to 32. The rows of Table 6 show subsequent simpliﬁed hypotheses, i.e. the ones that contain only periods (phases are ignored) for the example from Figures 2 and 3. The algorithm that creates hypotheses with periods only ranks rhythmic hypotheses based on the recurrence of melorhythmic patterns (16 methods proposed in the thesis of Wojcik [27]). The basic operation of patterns recurrence evaluation is in this case addition. The only hypotheses ranking method examined by the authors that requires the phases to be deﬁned is the method based on rhythmic weights.

72

B. Kostek, J. W´ ojcik, and P. Szczuko

Creating hypotheses with phases. Each hypotheses may have as many versions, with regard to phases, as its longest period is, e.g. the ﬁrst hypothesis from Table 17 (the ﬁrst row: 1, 2, 4, 8, 16, 32) will have 32 diﬀerent phases. On condition that n = 6, the number of all hypotheses for the discussed example will amount to 3125, which is the sum of all periods from layer 6. Thus, the number of all hypotheses is the sum of the values from the last column of Table 18. The algorithm that forms hypotheses with phases is used in a method ranking rhythmic hypotheses based on rhythmic weight. The elementary operation of this method is addition. To analyze a piece of music with regard to its motives, phases, phrases and periods when its atomic period is deﬁned as a quarter-note, the number of 6 layers (n=6) is suﬃcient. Despite the exponential complexity of the method, the number of elementary operations is not more than 104 on a 1.6 GHz computer. The total time of all operations for a single piece of music is imperceptible for a system user, which was proved by the experimental system, engineered by the authors. This means that the method provides high quality automatic drum accompaniment without a delay.

6

Concluding Remarks

Employing computational approach is helpful in retrieving the time signature and the locations of barlines from a piece on the basis of its content only. Rhythmic salience approach worked out and described in this paper may also be valuable in ranking rhythmic hypotheses and music transcription. A system, creating drum accompaniment to a given melody, automatically, on the basis of highly ranked rhythmic hypothesis is a useful practical application of rhythmic salience method. A prototype of such a system, using salience approach was developed on the basis of ﬁndings of authors of this paper, and it works without delay, even though its computational complexity is quite considerable. On the basis of the results (see Tables 1, 4) obtained for both: RS and ANN experiments, it may be observed that the average accuracy of all approaches taking duration D into account – solely or in the combination of all three attributes DPV – is about twice as good as hazard accuracy (values of 1.72 for Rough Set DPV, 1.80 for Rough Set D, and a value of 2.03 both for Network D and for Network DPV were achieved). The performance of approaches considering pitch P and velocity V separately are very close to random accuracy, the values are equal to 1.09 and 1.15 for Rough Sets. For the ANN, the values are 0.96 and 0.98, respectively. Thus, it can be concluded that the location of a sound depends only on its duration. The algorithms with the combination of DPV attributes performed as well as the one based only on duration, however this is especially valid for ANNs, rough sets did a little bit worse. Additional attributes do not increase the performance of the ANN approach. It can be thus concluded that the rhythmic salience depends on physical attributes in a simple way, namely it depends on a single physical attribute – duration.

Automatic Rhythm Retrieval from Musical Files

73

Network D is the ANN that returns the most stable results. The value of fraction in the third row of Table 1 is low for this network and it is equal to 0.09. Network DPV, which takes all attributes into account, is much less reliable because the stability fraction is about twice worse than the stability of Network D and it is equal to 0.19. The stability of Network P , considering the pitch, is quite high (it equals 0.12), but its performance is close to the random choice. For learning and testing data used in this experiment, velocity appeared to be the most data-sensitive attribute (see results of Network V ). Additionally, this network appeared to be unable to ﬁnd accented sounds. In the case of Rough Sets, the duration-based approaches D and DPV returned less stable results than P and V approaches. Values of 0.045, 0.043, 0.026, 033 were obtained for D, DPV, P, and V respectively. The ANN salience-based experiments described in the earlier work by the Authors [7], were conducted on a database of musical ﬁles containing various musical genres. It consisted of monophonic (non-polyphonic), and the polyphonic ﬁles. Also, a veriﬁcation of the association rules model of the Data Mining domain for musical salience estimation was presented in that paper. The conclusions derived from the experiments conducted on national anthems for the purpose of this paper, are consistent with the ones described in the work by Kostek et al. [7]. Thus, the ANNs can be used in systems of musical rhythm retrieval in a wide range of genres and regardless of the fact whether the music is monophonic of polyphonic. The average relative accuracy for duration-based approaches where Rough Sets are used is lower than this obtained by LVQ ANNs. However, the same tendency is noticeable – utilization of the duration parameter leads to successful classiﬁcation. The P (pitch) and V (velocity) parameters appeared not to be important in making decision about rhythmical structure of a melody. Finally, using diﬀerent discretization schemes instead of the equal subrange technique does not change the accuracy of rough sets-based rhythm classiﬁcation, signiﬁcantly.

Acknowledgments The research was partially supported by the Polish Ministry of Science and Education within the project No. PBZ-MNiSzW-02/II/2007.

References 1. Bazan, J.G., Szczuka, M.S.: The Rough Set Exploration System. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005) 2. van Belle, W.: BPM Measurement of Digital Audio by means of Beat Graphs & Ray Shooting. Department Computer Science, University Tromsø (Retrieved, 2004), http://bio6.itek.norut.no/werner/Papers/bpm04/ 3. Dahl, S.: On the beat - Human movement and timing in the production and perception of music. Ph.D. Thesis, KTH Royal Institute of Technology, Stockholm, Sweden (2005)

74

B. Kostek, J. W´ ojcik, and P. Szczuko

4. Dixon, S.: Automatic Extraction of Tempo and Beat from Expressive Performances. J. of New Music Research 30(1), Swets & Zeitlinger, 39–58 (2001) 5. Huron, D.: Review of Harmony: A psychoacoustical Approach (Parncutt, 1989); Psychology of Music 19(2), 219–222 (1991) 6. Kostek, B., W´ ojcik, J.: Machine Learning System for Estimation Rhythmic Salience of Sounds. Int. J. of Knowledge-Based and Intelligent Engineering Systems 9, 1–10 (2005) 7. Kostek, B., W´ ojcik, J., Holonowicz, P.: Estimation the Rhythmic Salience of Sound with Association Rules and Neural Networks. In: Proc. of the Intern. IIS: IIPWM 2005, Intel. Information Proc. and Web Mining, Advances in Soft Computing, pp. 531–540. Springer, Sobieszewo (2005) 8. Kostek, B.: Perception-Based Data Processing in Acoustics. In: Applications to Music Information Retrieval and Psychophysiology of Hearing. Series on Cognitive Technologies. Springer, Heidelberg (2005) 9. Kostek, B.: Applying computational intelligence to musical acoustics. Archives of Acoustics 32(3), 617–629 (2007) 10. Kostek, B., W´ ojcik, J.: Automatic Retrieval of Musical Rhythmic Patterns, vol. 119. Audio Engineering Soc. Convention, New York (2005) 11. Kostek, B., W´ ojcik, J.: Automatic Salience-Based Hypermetric Rhythm Retrieval. In: International Workshop on Interactive Multimedia and Intelligent Services in Mobile and Ubiquitous Computing, Seoul, Korea. IEEE CS, Los Alamitos (2007) 12. Kostek, B., W´ ojcik, J., Szczuko, P.: Searching for Metric Structure of Musical Files. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 774–783. Springer, Heidelberg (2007) 13. Lerdahl, F., Jackendoﬀ, R.: A Generative Theory of Tonal Music. MIT Press, Cambridge (1983) 14. McAuley, J.D., Semple, P.: The eﬀect of tempo and musical experience on perceived beat. Australian Journal of Psychology 51(3), 176–187 (1999) 15. Parncutt, R.: Harmony: A Psychoacoustical Approach. Springer, Berlin (1989) 16. Pawlak, Z.: Rough Sets. Internat. J. Computer and Information Sciences 11, 341– 356 (1982) 17. Pawlak, Z., Skowron, A.: Rudiments of rough sets. Information Sciences 177, 3–27 (2007) 18. Peters, J.F., Skowron, A. (eds.): Transactions on Rough Sets V. LNCS, vol. 4100. Springer, Heidelberg (2004-2008) 19. Rosenthal, D.F.: Emulation of human rhythm perception. Comp. Music J. 16(1), 64–76 (Spring, 1992) 20. Rosenthal, D.F.: Machine Rhythm: Computer Emulation of Human Rhythm Perception, Ph.D. thesis. MIT Media Lab, Cambridge, Mass. (1992) 21. Temperley, D., Sleator, D.: Modeling meter and harmony: A preference-rule approach. Comp. Music J. 15(1), 10–27 (1999) 22. RSES Homepage, http://logic.mimuw.edu.pl/∼ rses 23. Wieczorkowska, A., Czyzewski, A.: Rough Set Based Automatic Classiﬁcation of Musical Instrument Sounds. Electr. Notes Theor. Comput. Sci. 82(4) (2003) 24. Wieczorkowska, A., Ra´s, Z.W.: Editorial: Music Information Retrieval. J. Intell. Inf. Syst. 21(1), 5–8 (2003) 25. Wikipedia homepage 26. W´ ojcik, J., Kostek, B.: Intelligent Methods for Musical Rhythm Finding Systems. In: Nguyen, N.T. (ed.) Intelligent Technologies for Inconsistent Processing. International Series on Advanced Intelligence, vol. 10, pp. 187–202 (2004)

Automatic Rhythm Retrieval from Musical Files

75

27. W´ ojcik, J.: Methods of Forming and Ranking Rhythmic Hypotheses in Musical Pieces, Ph.D. Thesis, Electronics, Telecommunications and Informatics Faculty, Gdansk Univ. of Technology, Gdansk (2007) 28. W´ ojcik, J., Kostek, B.: Computational Complexity of the Algorithm Creating Hypermetric Rhythmic Hypotheses. Archives of Acoustics 33(1), 57–63 (2008)

FUN: Fast Discovery of Minimal Sets of Attributes Functionally Determining a Decision Attribute Marzena Kryszkiewicz and Piotr Lasek Institute of Computer Science, Warsaw University of Technology Nowowiejska 15/19, 00-665 Warsaw, Poland {mkr,p.lasek}@ii.pw.edu.pl Abstract. In this paper, we present our Fun algorithm for discovering minimal sets of conditional attributes functionally determining a given dependent attribute. In particular, the algorithm is capable of discovering Rough Sets certain, generalized decision, and membership distribution reducts. Fun can operate either on partitions of objects or alternatively on stripped partitions, which do not store singleton groups. It is capable of using functional dependencies occurring among conditional attributes for pruning candidate dependencies. In this paper, we oﬀer further reduction of stripped partitions, which allows correct determination of minimal functional dependencies provided optional candidate pruning is not carried out. In the paper we consider six variants of F un, including two new variants using reduced stripped partitions. We have carried out a number of experiments on benchmark data sets to test the eﬃciency of all variants of Fun. We have also tested the eﬃciency of the F un’s variants against the Rosetta and RSES toolkits’ algorithms computing all reducts and against Tane, which is one of the most eﬃcient algorithms computing all minimal functional dependencies. The experiments prove that Fun is up to 3 orders of magnitude faster than the the Rosetta and RSES toolkits’ algorithms and faster than Tane up to 30 times. Keywords: Rough Sets, information system, decision table, reduct, functional dependency.

1

Introduction

The determination of minimal functional dependencies is a standard task in the area of relational databases. Tane [6] or Dep-Miner [14] are example eﬃcient algorithms for discovering minimal functional dependencies from relational databases. A variant of the task, which consists in discovering minimal sets of conditional attributes that functionally or approximately determine a given decision attribute, is one of the topics of Artiﬁcial Intelligence and Data Mining. Such sets of conditional attributes can be used, for instance, for building classiﬁers. In the terms of Rough Sets, such minimal conditional attributes are called reducts [18]. One can distinguish a number of types of reducts. Generalized decision reducts (or equivalently, possible/approximate reducts [9]), membership distribution reducts (or equivalently, membership reducts [9]), and J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 76–95, 2008. c Springer-Verlag Berlin Heidelberg 2008

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

77

certain decision reducts belong to most popular Rough Sets reducts. In general, these types of reducts do not determine the decision attribute functionally. However, it was shown in [10] that these types of reducts are minimal sets of conditional attributes functionally determining appropriate modiﬁcations of the decision attribute. Thus, the task of searching such reducts is equivalent to looking for minimal sets of attributes functionally determining a given attribute. In this paper, we focus on ﬁnding all such minimal sets of attributes. To this end, one might consider applying either methods for discovering Rough Sets reducts, or discovering all minimal functional dependencies and then selecting such that determine a requested attribute. A number of methods for discovering diﬀerent types of reducts have already been proposed in the literature. e.g. [3-5],[7-8],[11-12],[15-29]. The most popular methods are based on discernibility matrices [21]. Unfortunately, the existing methods for discovering all reducts are not scalable. The recently oﬀered algorithms for ﬁnding all minimal functional dependencies are deﬁnitely faster. In this paper, we focus on direct discovery of all minimal functional dependencies with a given dependent attribute, and expect this process to be faster than the discovery of all minimal functional dependencies. First, we present eﬃcient Fun algorithm, we oﬀered recently [12]. F un discovers minimal functional dependencies with a given dependent attribute, and, in particular, is capable of discovering three above mentioned types of reducts. Fun can operate either on partitions of objects or alternatively on stripped object partitions, which do not store singleton groups. It is capable of using functional dependencies occurring among conditional attributes, which are found as a side eﬀect, for pruning candidate dependencies. In this paper, we extend our proposal from [12]. We oﬀer further full and partial reduction of stripped partitions, which allows correct determination of minimal functional dependencies provided optional candidate pruning is not carried out. Then, we compare the eﬃciency of two new variants of F un and four other variants of this algorithm, we proposed in [12]. We also test the eﬃciency of the F un’s variants against the Rosetta and RSES toolkits’ algorithms computing all reducts and against Tane, which is one of the most eﬃcient algorithms computing all minimal functional dependencies. The layout of the paper is as follows: Basic notions of information systems, functional dependencies, decision tables and reducts are recalled in Section 2. In Section 3, we present the Fun algorithm. Entirely new contribution is presented in subsection 3.5, where we describe how to reduce stripped partitions and provide two new variants of the F un algorithm. The experimental evaluation of 6 variants of F un, as well as the Rosetta and RSES toolkits’ algorithms and Tane, are reported in Section 4. We conclude our results in Section 5.

2 2.1

Basic Notions Information Systems

An information system is a pair S = (O, AT ), where O is a non-empty ﬁnite set of objects and AT is a non-empty ﬁnite set of attributes of these objects. In the

78

M. Kryszkiewicz and P. Lasek

sequel, a(x), a ∈ AT and x ∈ O, denotes the value of attribute a for object x, and Va denotes the domain of a. Each subset of attributes A ⊆ AT determines a binary A-indiscernibility relation IN D(A) consisting of pairs of objects indiscernible wrt. attributes A; that is, IN D(A) = {(x, y) ∈ O×O|∀a∈A a(x) = a(y)}. IN D(A) is an equivalence relation and determines a partition of O, which is denoted by πA . The set of objects indiscernible with an object x with respect to A in S is denoted by IA (x) and is called A-indiscernibility class; that is, IA (x) = {y ∈ O|(x, y) ∈ IN D(A)}. Clearly, πA = {IA (x)|x ∈ O}. Table 1. Sample information system S = (O, AT ), where AT = {a, b, c, e, f } oid 1 2 3 4 5 6 7 8 9 10

a 1 1 0 0 0 1 1 1 1 1

b 0 1 1 1 1 1 1 1 1 0

c 0 1 1 1 1 0 0 0 0 0

e 1 1 0 0 2 2 2 2 3 3

f 1 2 3 3 2 2 2 2 2 2

Example 2.1.1. Table 1 presents a sample information system S = (O, AT ), where O is the set of ten objects and AT = {a, b, c, e, f } is the set of attributes of these objects. 2.2

Functional Dependencies

Functional dependencies are of high importance in designing relational databases. We recall this notion after [2]. Let S = (O, AT ) and A, B ⊆ AT . A → B is deﬁned a functional dependency (or A is deﬁned to determine B functionally), if ∀x∈O IA (x) ⊆ IB (x). A functional dependency A → B is called minimal, if ∀C⊂A C → B is not functional. Example 2.2.1. Let us consider the information system in Table 1. {ce} → {a} is a functional dependency, nevertheless, {c} → {a}, {e} → {a}, and ∅ → {a} are not. Hence, {ce} → {a} is a minimal functional dependency. Property 2.2.1. Let A, B, C ⊆ AT . a) If A → B is a functional dependency, then ∀C⊃A C → B is functional. b) If A → B is not functional, then ∀C⊂A C → B is not functional. c) If A → B is a functional dependency, then ∀C⊃A C → B is a non-minimal functional dependency. d) If A → B and B → C are functional dependencies, then A → C is a nonminimal functional dependency. e) If A ⊂ B and B ∩C = ∅, then A → B is a functional dependency and B → C is not a minimal functional dependency.

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

79

Functional dependencies can be calculated by means of partitions [6] as follows: Property 2.2.2. Let A, B ⊆ AT . A → B is a functional dependency iﬀ πA = πAB iﬀ |πA | = |πAB |. Example 2.2.2. Let us consider the information system in Table 2. We observe that π{ce} = π{cea} = {{1}, {2}, {3, 4}, {5}, {6, 7, 8}, {9, 10}}. The equality of π{ce} and π{cea} (or their cardinalities) is suﬃcient to conclude that {ce} → {a} is a functional dependency. The next property recalls a method of calculating a partition with respect to an attribute set C by intersecting partitions with respect to subsets of C. Let A, B ⊆ AT . The product of partitions πA and πB , denoted by πA ∩ πB , is deﬁned as πA ∩ πB = {Y ∩ Z|Y ∈ πA and Z ∈ πB }. Property 2.2.3. Let A, B, C ⊆ AT and C = A ∪ B. Then, πC = πA ∩ πB . 2.3

Decision Tables, Reducts and Functional Dependencies

A decision table is an information system DT = (O, AT ∪ {d}), where d ∈ / AT is a distinguished attribute called the decision, and the elements of AT are called conditions. A decision class is deﬁned as the set of all objects with the same decision value. By Xdi we will denote the decision class consisting of objects the decision value of which equals di , where di ∈ Vd . Clearly, for any object x in O, Id (x) is a decision class. It is often of interest to ﬁnd minimal subsets of AT (or strict reducts) that functionally determine d. It may happen, nevertheless, that such minimal sets of conditional attributes do not exist. Table 2. Sample decision table DT = (O, AT ∪ {d}), where AT = {a, b, c, e, f } and d AT is the decision attribute, extended with derived attributes dN AT , ∂AT , μd oid 1 2 3 4 5 6 7 8 9 10

a 1 1 0 0 0 1 1 1 1 1

b 0 1 1 1 1 1 1 1 1 0

c 0 1 1 1 1 0 0 0 0 0

e 1 1 0 0 2 2 2 2 3 3

f 1 2 3 3 2 2 2 2 2 2

d dN AT 1 1 1 1 1 N 2 N 2 2 2 N 3 N 3 N 3 3 3 3

∂AT μAT d {1} {1} {1, 2} {1, 2} {2} {2, 3} {2, 3} {2, 3} {3} {3}

AT AT :< μAT > 1 , μ2 , μ3 < 1, 0, 0 > < 1, 0, 0 > < 1/2, 1/2, 0 > < 1/2, 1/2, 0 > < 0, 1, 0 > < 0, 1/3, 2/3 > < 0, 1/3, 2/3 > < 0, 1/3, 2/3 > < 0, 0, 1 > < 0, 0, 1 >

Example 2.3.1. Table 2 describes a sample decision table DT = (O, AT ∪ {d}), where AT = {a, b, c, e, f }. Partition πAT = {{1}, {2}, {3, 4}, {5}, {6, 7, 8}, {9}, {10}} contains all AT -indiscernibility classes, whereas π{d} = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9, 10}} contains all decision classes. There is no functional dependency between AT and d, since there is no decision class in π{d} containing AT -indiscernibility class {3, 4} (or {6, 7, 8}). As AT → d is not functional, then C → d, where C ⊆ AT , is not functional either.

80

M. Kryszkiewicz and P. Lasek

Rough Sets theory deals with the problem of non-existence of strict reducts by means of other types of reducts, which always exist, irrespectively if AT → d is a functional dependency, or not. We will now recall such three types of reducts, namely certain decision reducts, generalized decision reducts, and membership distribution reducts. Certain decision reducts. Certain decision reducts are deﬁned based on the notion of a positive region of DT , thus we start with introducing this notion. A positive region of DT , denoted as POS, is the set-theoretical union of all AT indiscernibility classes, each of which is contained in a decision class of DT ; that is, P OS = {X ∈ πAT |X ⊆ Y, Y ∈ πd } = {x ∈ O|IAT (x) ⊆ Id (x)}. A set of attributes A ⊆ AT is called a certain decision reduct of DT , if A is a minimal set, such that ∀x∈P OS IA (x) ⊆ Id (x) [18]. Now, we will introduce a derivable decision attribute for an object x ∈ O as a modiﬁcation of the decision attribute d, which N we will denote by dN AT (x) and deﬁne as follows: dAT (x) = d(x) if x ∈ P OS, and N dAT (x) = N, otherwise (see Table 2 for illustration). Clearly, all objects with values of dN AT that are diﬀerent from N belong to P OS. Property 2.3.1 [10]. Let A ⊆ AT . A is a certain decision reduct iﬀ A → {dN AT } is a minimal functional dependency. Generalized decision reducts. Generalized decision reducts are deﬁned based on a generalized decision. Let us thus start with introducing this notion. An A-generalized decision for object x in DT (denoted by ∂A (x)), A ⊆ AT , is deﬁned as the set of all decision values of all objects indiscernible with x wrt. A; i.e., ∂A (x) = {d(y)|y ∈ IA (x)} [21]. For A = AT , an A-generalized decision is also called a generalized decision (see Table 2 for illustration). A ⊆ AT is deﬁned a generalized decision reduct of DT , if A is a minimal set such that ∀x∈O ∂A (x) = ∂AT (x). Property 2.3.2 [10]. Let A ⊆ AT . Attribute set A is a generalized decision reduct iﬀ A → {∂AT } is a minimal functional dependency. μ-Decision Reducts. The generalized decision informs on decision classes to which an object may belong, but does not inform on the degree of the membership to these classes, which could be also of interest. A membership distribution n function) μA d : O → [0, 1] , A ⊆ AT, n = |Vd |, is deﬁned as follows [9],[23-24]: A A μA d (x) = (μd1 (x), . . . , μdn (x)), where |IA (x)∩Xdi | {d1 , . . . , dn } = Vd and μA . di (x) = |IA (x)|

Please, see Table 2 for illustration of μAT d . A ⊆ AT is a called a μ-decision reduct (or membership distribution reduct) of DT , if A is a minimal set such AT that ∀x∈O μA d (x) = μd (x). Property 2.3.3 [10]. Let A ⊆ AT . A is a μ-decision reduct iﬀ A → {μAT d } is a minimal functional dependency.

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

3

81

Computing Minimal Sets of Attributes Functionally Determining Given Dependent Attribute with Fun

In this section, we present the Fun algorithm for computing all minimal subsets of conditional attributes AT that functionally determine a given dependent attribute ∂. First, we recall the variants of F un that apply partitions of objects or, so called, stripped partitions of objects [12]. Then, in Section 3.5, we introduce an idea of reduced stripped partitions and oﬀer two new variants of Fun based on them. The Fun algorithm can be used for calculating Rough Sets reducts provided the dependent attribute is determined properly, namely Fun will return certain decision reducts for ∂ = ∂AT , generalized decision for ∂ = dN AT , and μ-decision reducts for ∂ = μAT . For brevity, a minimal subset of AT that functionally d determines a given dependent attribute ∂ will be called a ∂-reduct. 3.1

Main Algorithm

The Fun algorithm takes two arguments: a set of conditional attributes AT and a functionally dependent attribute ∂. As a result, it returns all ∂-reducts. Fun starts with creating singleton candidates C1 for ∂-reducts from each attribute in AT . Then, the partitions (π) and their cardinalities (groupNo) wrt. ∂ and all attributes in C1 are determined. Notation for F un • Ck candidate k attribute sets (potential ∂-reducts); k attribute ∂-reducts; • Rk • C.π the representation of the partition πC of the candidate attribute set C; it is stored as the list of groups of objects identifiers (oids); • C.groupN o the number of groups in the partion of the candidate attribute set C; that is, |πC |; • ∂.T an array representation of π∂ ; Algorithm. F un(attribute set AT , dependent attribute ∂); C1 = {{a}|a ∈ AT }; // create singleton candidates from conditional attributes in AT forall C in C1 ∪ {∂} do begin C.π = πC ; C.groupN o = |πC | endfor; /* calculate an array representation of π∂ for later multiple use in the Holds function */ ∂.T = P artitionArrayRepresentation(∂); // Main loop for (k = 1; Ck = ∅; k + +) do begin Rk = {}; forall candidates C ∈ Ck do begin if Holds(C → {∂}) then // Is C → {∂} a functional dependency? // store C as a k attribute ∂-reduct remove C from Ck to Rk ; endif endfor; /* create (k + 1) attribute candidates for ∂-reducts from k attribute non-∂-reducts */ Ck+1 = F unGen(Ck ); endfor; S return k Rk ;

Next, the PartitionArrayRepresentation function (see Section 3.3) is called to create an array representation of π∂ . This representation shall be used multiple times in the Holds function, called later in the algorithm, for eﬃcient checking

82

M. Kryszkiewicz and P. Lasek

whether candidate attribute sets determine ∂ functionally. Now, the main loop starts. In each k-th iteration, the following is performed: – The Holds function (see Section 3.3) is called to check if k attribute candidates Ck determine ∂ functionally. The candidates that do are removed from the set of k attribute candidates to the set of ∂-reducts Rk . – The FunGen function (see Section 3.2) is called to create (k + 1) attribute candidates Ck+1 from the k attribute candidates that remained in Ck . The algorithm stops when the set of candidates becomes empty. 3.2

Generating Candidates for ∂-reducts

The FunGen function creates (k + 1) attribute candidates Ck+1 by merging k attribute candidates Ck , which are not ∂-reducts. The algorithm adopts the manner of creating and pruning of candidates introduced in [1] (here: candidate sets of attributes instead of candidates for frequent itemsets). There are merged only those pairs of k attribute candidates Ck that diﬀer merely on their last attributes (see [1] for justiﬁcation that this method is lossless and non-redundant). For each new candidate C, πC is calculated as the product of the partitions wrt. the merged k attribute sets (see Section 3.3 for the Product function). The cardinality (groupNo) of πC is also calculated. Now, it is checked for each new (k + 1) attribute candidate C, if there exists its k attribute subset A not present in Ck . If so, it means that either A or its subset was found earlier as a ∂-reduct. This implies that the candidate C is a proper superset of a ∂-reduct, thus it is not a ∂-reduct, and hence C is deleted from the set Ck+1 . Optionally, for each tested k attribute subset A that is present in Ck , it is checked, if |πA | equals |πC |. If so, then A → C holds (by Property 2.2.2). Hence, A → {∂} is not a minimal functional dependency (by Property 2.2.1e), and thus C is deleted from Ck+1 . function F unGen(Ck ); /* Merging */ forall A, B ∈ Ck do if A[1] = B[1] ∧ . . . ∧ A[k − 1] = A[k − 1] ∧ A[k] < B[k] then begin C = A[1] · A[2] · . . . · A[k] · B[k]; /* compute partition C.π as a product of A.π and B.π, and the number of groups in C.π */ C.groupN o = P roduct(A.π, B.π, C.π); add C to Ck+1 endif ; endfor; /* Pruning */ forall C ∈ Ck+1 do forall k attribute set A, such that A ⊂ C do if A ∈ / Ck then /* A ⊂ C and ∃B ⊆ A such that B → {∂} holds, so C → ∂ holds, but is not minimal */ begin delete C from Ck+1 ; break end elseif A.groupN o = C.groupN o then // optional candidate pruning step /*A ⊂ C and A → C holds, so C → {∂} is not a minimal functional dependency */ begin delete C from Ck+1 ; break end endif endfor endfor; return Ck+1 ;

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

3.3

83

Using Partitions in Fun

Computing Array Representation of Partition. The PartitionArrayRepresentation function returns an array T of the length equal to the number of objects O in DT . For a given attribute C, each element j in T is assigned the index of the group in C.π to which object with oid = j belongs. As a result, j-th element of T informs to which group in C.π j-th object in DT belongs, j = 1.. |O|. function P artitionArrayRepresentation(attribute set C); /* assert: T is an array[1 . . . |O|] */ i = 1; for i-th group G in partition C.π do begin for each oid G do T [oid] = i endfor; i=i+1 endfor return T ;

Verifying Candidate Dependency. The Holds function checks, if there is a functional dependency between the set of attributes C and an attribute ∂. It is checked for successive groups G in C.π, if there is an oid in G that belongs to a group in ∂.π diﬀerent from the group in ∂.π to which the ﬁrst oid in G belongs (for the purpose of eﬃciency, the pre-calculated ∂.T representation of the partition for ∂ is applied instead of ∂.π). If so, this means that G is not contained in one group of ∂.π and thus C → {∂} is not a functional dependency. In such a case, the function stops returning false as a result. Otherwise, if no such group G is found, the function returns true, which means that C → {∂} is a functional dependency. function Holds(C → {∂}); /* assert: ∂.T is an array representation of ∂.π */ for each group G in partition C.π do begin oid = first element in group G; ∂-f irstGroup = ∂.T [oid]; // the identifier of the group in ∂.π to which oid belongs for each next element oid ∈ G do begin ∂-nextGroup = ∂.T [oid]; if ∂-f irstGroup = ∂-nextGroup then /* there are oids in G that identify objects indiscernible wrt. C, but discernible wrt. ∂ */ return false // hence, C → {∂} does not hold endif endfor; endfor; return true; // C → {∂} holds

Computing Product of Partitions. The Product function computes the partition wrt. the attribute set C and its cardinality from the partitions wrt. the attribute sets A and B. The function examines successive groups wrt. the partition for B. The objects in a given group G in B.π are split into maximal subgroups in such a way that the objects in each resultant subgroup are contained in a same group in A.π. The obtained set of subgroups equals {G ∩ Y |Y ∈ A.π}. Product C.π is calculated as the set of all subgroups obtained from all groups in B.π; i.e., C.π = G∈B.π {G ∩ Y |Y ∈ A.π} = {G ∩ Y |Y ∈ A.π and G ∈ B.π} = B.π ∩ A.π. In order to calculate the product of the partitions eﬃciently (with time complexity linear wrt. the number of objects in DT ), we follow the idea presented in [6], and use two static arrays T and S: T is used to store an array representation of

84

M. Kryszkiewicz and P. Lasek

the partition wrt. A; S is used to store subgroups obtained from a given group G in B.π. function P roduct(A.π, B.π; var C.π); /* assert: T [1..|O|] is a static array */ /* assert: S[1..|O|] is a static array with all elements initially equal to ∅ */ C.π = {}; groupN o = 0; /* calculate an array representation of A.π for later multiple use in the P roduct function */ T = P artitionArrayRepresentation(A); i = 1; for i-th group G in partition B.π do begin A-GroupIds = ∅; for each element oid ∈ G do begin j = T [oid]; // the identifier of the group in A.π to which oid belongs insert oid into S[j]; insert j into A-GroupIds endfor; for each j ∈ A-GroupIds do begin insert S[j] into C.π; groupN o = groupN o + 1; S[j] = ∅ endfor; i=i+1 endfor; return groupN o;

3.4

Using Stripped Partitions in Fun

The representation of partitions that requires storing objects identiﬁers (oids) of all objects in DT may be too memory consuming. In order to alleviate this problem, it was proposed in [6] to store oids only for objects belonging to nonsingleton groups in a partition representation. Such a representation of a partition is called a stripped one and will be denoted by π s . Clearly, the stripped representation is lossless. Example 3.4.1. In Table 2, the partition wrt. {ce}: π{ce} = {{1}, {2}, {3, 4}, {5}, s {6, 7, 8}, {9, 10}}, whereas the stripped partition wrt. {ce}: π{ce} = {{3, 4}, {6, 7, 8}, {9, 10}}. function StrippedHolds(C → {∂}); i = 1; for i-th group G in partition C.π do begin oid = first element in group G; ∂-f irstGroup = ∂.T [oid]; // the identifier of the group in ∂.π to which oid belongs if ∂-f irstGroup = null then return false endif ; /* ∂.T [oid] = null indicates that oid constitutes a singleton group in the partition for ∂. */ /* Hence, no next object in G belongs to this group in ∂.π , so C → {∂} does not hold.

*/

for each next element oid ∈ G do begin ∂-nextGroup = ∂.T [oid]; if ∂-f irstGroup = ∂-nextGroup then /* there are oids in G that identify objects indiscernible wrt. C, but discernible wrt. ∂ */ return false // hence, C → {∂} does not hold endif endfor; i=i+1 endfor; return true; // C → {∂} holds

When applying stripped partitions in our Fun algorithm instead of usual partitions, one should call the StrippedHolds function instead of Holds, and the StrippedProduct function instead of Product. The modiﬁed parts of the functions have been shadowed in the code below. We note, however, that the groupNo ﬁeld

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

85

still stores the number of groups in an unstripped partition (singleton groups are not stored in a stripped partition, but are counted!). function StrippedP roduct(A.π, B.π; var C.π); C.π = {}; groupN o = B.groupN o; T = P artitionArrayRepresentation(A); i = 1; for i-th group G in partition B.π do begin A − GroupIds = ∅; for each element oid ∈ G do begin j = T [oid]; // the identifier of the group in A.π to which oid belongs if j = null then groupN o = groupN o + 1;

// respect singleton subgroups

else begin insert oid into S[j]; insert j into A-GroupIds endif endfor; for each j ∈ A − GroupIds do begin if |S[j]| > 1 then insert S[j] into C.π

// store only non-singleton groups

endif ; groupN o = groupN o + 1; S[j] = ∅

// but count all groups, including singleton ones

endfor; groupN o = groupN o − 1; i=i+1 endfor; /* Clearing of array T for later use */ for i-th group G in partition A.π do for each element oid ∈ G do T [oid] = null endfor endfor; return groupN o;

3.5

Using Reduced Stripped Partitions in Fun

In this section, we oﬀer further reduction of stripped partitions wrt. conditional attributes. Our proposal is based on the following observations: Let C be a conditional attribute set and d be the decision attribute. Let G be any group in the stripped partition wrt. C that is contained in a group belonging to the stripped partition wrt. d. a) Group G operates in favour of functional dependency between C and d. b) Any subgroup G ⊆ G that occurs in the stripped partition wrt. a superset C ⊇ C also operates in favour of functional dependency between C and d. Thus, the veriﬁcation of the containment of G in a group of the stripped partition wrt. d is dispensable in testing the existence of a functional dependency between C and d. rs ) We deﬁne a reduced stripped partition wrt. attribute set A (and denote by πA as the set of those groups in the stripped partition wrt. A that are not contained rs = {G ∈ in any group in the stripped partition wrt. decision d; that is, πA s s πA |¬∃D∈π{d} G ⊆ D}.

Example 3.5.1. In Table 2, the stripped partition wrt. conditional attribute e: s π{e} = {{1, 2}, {3, 4}, {5, 6, 7, 8}, {9, 10}}, whereas the stripped partition wrt. s decision attribute d: π{d} = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9, 10}}. We note that group

86

M. Kryszkiewicz and P. Lasek

s s {1, 2} ∈ π{e} and its subsets are contained in group {1, 2, 3} ∈ π{d} . Similarly, s s group {9, 10} ∈ π{e} and its subsets are contained in group {7, 8, 9, 10} ∈ π{d} . s There is no group in π{d} containing {3, 4} or {5, 6, 7, 8}. Thus, the groups {1, 2} s and {9, 10} in π{e} , unlike the remaining two groups: {3, 4} and {5, 6, 7, 8} in s π{e} , operate in favour of functional dependency between {e} and {d}. Hence, the s reduced stripped partition π{e} = {{3, 4}, {5, 6, 7, 8}} and the reduced stripped partitions wrt. supersets of {e} will contain neither {1, 2}, nor {9, 10}, nor their subsets.

It is easy to observe that the reduced stripped partition wrt. attribute set C can be calculated based on the product of reduced stripped partitions wrt. subsets of C as shown in Proposition 3.5.1. Proposition 3.5.1. Let A, B, C ⊆ AT and C = A∪B. Then, the reduced stripped partition wrt. C equals the set of the groups in the product of the reduced stripped partitions wrt. A and B that are not contained in any group of the rs rs rs s stripped partition wrt. decision d; that is, πC = {G ∈ πA ∩πB |¬∃D∈π{d} G ⊆ D}. function ReducedStrippedHolds(C → {∂}); i = 1; holds = true; for i-th group G in partition C.π do begin oid = first element in group G; ∂-f irstGroup = ∂.T [oid]; // the identifier of the group in ∂.π to which oid belongs if ∂-f irstGroup = null then holds = false; /* ∂.T [oid] = null indicates that oid constitutes a singleton group in the partition for ∂. */ /* Hence, no next object in G belongs to this group in ∂.π , so C → {∂} does not hold. */ else begin for each next element oid ∈ G do begin ∂-nextGroup = ∂.T [oid]; if ∂-f irstGroup = ∂-nextGroup then /* there are oids in G that identify objects indiscernible wrt. C, but discernible wrt. ∂ */ holds = false; break; endif endfor; if ∂-f irstGroup = ∂-nextGroup then

// hence, C → {∂} does not hold

delete c from C.π; endif ; i = i + 1; endif ; endfor; return holds ; rs rs In our proposal, the product πA ∩ πB of the reduced stripped partitions wrt. A and B is calculated by means of the StrippedProduct function. The reduced rs rs rs is determined from a product πA ∩ πB by new Reducedstripped partition πC StrippedHolds function. The function is a modiﬁcation of StrippedProduct. The ReducedStrippedHolds function like StrippedProduct veriﬁes, if there is a functional dependency between C and d. In addition, ReducedStrippedHolds removes rs rs s ∩ πB that are contained in π{d} . The modiﬁed parts those groups in product πA of the code in ReducedStrippedHolds have been shadowed.

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

87

s s s Please, note that the StrippedHolds function reads groups of πC = πA ∩ πB s until the ﬁrst group that is not contained in a group of π{d} is found. To the rs rs contrary, ReducedStrippedHolds reads all groups of the product πA ∩ πB . This means that the execution of ReducedStrippedHolds may last longer than the rs rs s s ∩ πB and πA ∩ πB are of similar length. execution of StrippedHolds, when πA On the other hand, the execution of ReducedStrippedHolds may last shorter rs rs s s ∩ πB is shorter than πA ∩ πB . than the execution of StrippedHolds, when πA As an alternative to both solutions of shortening partitions, we propose the PartReducedStrippedHolds function, which deletes the groups from the product s s ∩ πB until the ﬁrst group in this product that is not contained in a group πA s of π{d} is found. The result of PartReducedStrippedHolds is a group set being a s s rs rs ∩ πB and superset of πA ∩ πB . subset of πA function P artReducedStrippedHolds(C → {∂}); i = 1; for i-th group G in partition C.π do begin oid = first element in group G; ∂-f irstGroup = ∂.T [oid]; // the identifier of the group in ∂.π to which oid belongs if ∂-f irstGroup = null then return false endif ; /* ∂.T [oid] = null indicates that oid constitutes a singleton group in the partition for ∂. */ /* Hence, no next object in G belongs to this group in ∂.π , so C → {∂} does not hold. */ for each next element oid ∈ G do begin ∂-nextGroup = ∂.T [oid]; if ∂-f irstGroup = ∂-nextGroup then /* there are oids in G that identify objects indiscernible wrt. C, but discernible wrt. ∂ */ return false // hence, C → {∂} does not hold endif endfor; delete c from C.π; i = i + 1; endfor; return true; // C → {∂} holds

We note that it is impossible to determine the number of groups in the product πA ∩ πB as a side-eﬀect of calculating the product of the reduced stripped rs rs partitions πA ∩ πB . The same observation holds when the product is calculated from the partially reduced stripped partitions. Lack of this knowledge disallows using the optional pruning step in the FunGen algorithm. The usefulness of using fully or partially reduced stripped partitions will be examined experimentally in Section 4.

4

Experimental Results

We have performed a number of experiments on a few data sets available in UCI Repository (http://www.ics.uci.edu/∼mlearn/MLRepository.html) and Table 3. Six variants of the Fun algorithm wrt. Holds algorithm, partitions’ type and candidate pruning option Fun’s variant Holds method

H H/P SH SH/P PRSH RSH Holds Holds Stripped Stripped Part Reduced Reduced Holds Holds Stripped Holds Stripped Holds No No Yes Yes Yes Yes Stripped partitions Yes No Yes No No Optional candidate pruning No

88

M. Kryszkiewicz and P. Lasek Table 4. Reference external tools Label Tane RSES RSGR RRER Tool Tane RSES Rosetta Rosetta Algorithm Tane Exhaustive SAV Genetic Reducer RSES Exhaustive Reducer Limitation to 500 records Comments

Table 5. Execution times in seconds for the letter-recognition data set. ∗ - results are not available, a data set was to large to be analyzed. ∗∗ - ∗∗ - RSES was written in Java, which could cause an additional overhead, ∗∗∗ - the times provided by Rosetta have 1 second granularity. 1 2 3 4 5 6 7 8 9

|O| H H/P SH SH/P 100 0.32 0.52 0.36 0.23 200 2.54 2.72 0.87 0.25 500 7.92 7.24 1.63 0.79 1000 26.70 19.41 3.72 2.03 2000 38.29 27.60 7.97 4.28 5000 126.19 76.48 28.52 19.54 10000 1687.15 976.04 51.51 52.97 15000 N/A∗ N/A∗ 154.79 137.21 20000 N/A∗ N/A∗ 444.70 421.89

PRSH 0.30 0.97 1.71 3.28 7.99 28.66 59.97 144.86 440.31

RSH RSES∗∗ 0.24 9.50 0.97 13.50 1.57 9.00 2.94 14.00 6.74 20.00 28.91 130.00 131.79 960.00 368.67 1860.00 727.69 3060.00

Tane RSGR∗∗∗ RRER∗∗∗ 0.55 1), whereas complex cells are nonlinear (F 1/F 0 < 1). The classical V1 RF properties can be found using small ﬂashing light spots, moving white or dark bars or gratings. We will give an example of the decision rules for the RF mapped with the moving white and dark bars [5]. A moving white bar gives the following decision rule: DR V1 1: xpi ∧ yp0 ∧ xsk ∧ ys1 ∧ s2 → r1

(4)

The decision rule for a moving dark bar is given as: DR V1 2: xpj ∧ yp0 ∧ xsl ∧ ys1 ∧ s2 → r1

(5)

where xpi is the x-position of the incremental subﬁeld, where xpj is the x-position of the decremental subﬁeld, yp0 is the y-position of the both subﬁelds, xsk , xsl , ys1 are horizontal and vertical sizes of the RF subﬁelds, and s2 is a vertical bar which means that this cell is tuned to the vertical orientation. We have skipped other stimulus attributes like movement velocity, direction, amplitude, etc. For simplicity we assume that the cell is not direction sensitive, it gives the same responses to both direction of bar movement and to the dark and light bars and that cell responses are symmetric around the x middle position (xp). An overlap index [10] is deﬁned as: OI =

0.5(xsk + xsl ) − |xpi − xpj | 0.5(xsk + xsl ) + |xpi − xpj |

OI compares sizes of increment (xsk ) and decrement (xsl ) subﬁelds to their separation (|xpi − xpj |). After [11], if OI ≤ 0.3 (“non-overlapping” subﬁelds) it is the simple cell with dominating ﬁrst harmonic response (linear) and r1 is the amplitude of the ﬁrst harmonic. If OI ≥ 0.5 (overlapping subﬁelds), it is the complex cell with dominating F 0 response (nonlinear) and r1 are changes in the mean cell activity. Hubel and Wiesel [9] have proposed that the complex

The Neurophysiological Bases of Cognitive Computation

295

cell RF is created by convergence of several simple cells in a similar way like V1 RF properties are related to RF of LGN cells (Fig. 1). However, there is recent experimental evidence that the nonlinearity of the complex cell RF may be related to the feedback or horizontal connections [12]. Decision Rules for area V4. The properties of the RFs in area V4 are more complex than that in area V1 or in the LGN and in most cases they are nonlinear. It is not clear what exactly optimal stimuli for cells in V4 are, but a popular hypothesis states that the V4 cells code the simple, robust shapes. Below there is an example from [13] of the decision rules for a narrow (0.4 deg) and long (4 deg) horizontal or vertical bars placed in diﬀerent positions of area V4 RF: DR V4 1: o0 ∧ yprm ∧ (yp−2.2 ∨ yp0.15 ) ∧ xs4 ∧ ys0.4 → r2

(6)

o90 ∧ xprm ∧ (xp−0.6 ∨ xp1.3 ) ∧ xs0.4 ∧ ys4 → r1

(7)

DR V4 2:

The ﬁrst rule relates area V4 cell responses to a moving horizontal bar (o0 ) and the stimulus in the second rule is a moving vertical bar (o90 ), yprm , xprm have meaning of the tolerance for the y or x bar positions (more details in the Result section). The horizontal bar placed narrowly in two diﬀerent y-positions (yp−2.2 , yp0.15 ) gives strong cell responses (DR V4 1), and the vertical bar placed with wide range in two diﬀerent x-positions (xp−0.6 , xp1.3 ) gives medium cell responses. Decision Rules for feedforward connections from LGN → V1. Thalamic axons target speciﬁc cells in layers 4 and 6 of the primary visual cortex (V1). Generally we assume that there is a linear summation of LGN cells (approximately 10 − 100 of them [14]) to one V1 cell. It was proposed [9] that the LGN cells determine the orientation of the V1 cell in the following way: LGN cells which have a direct synaptic connection to V1 neurons have their receptive ﬁelds arranged along a straight line on the retina (Fig. 1). In this Hubel and Wiesel [9] classical model the major assumption is that activity of all (four in Fig. 1) LGN cells is necessary for a V1 cell to be sensitive to the speciﬁc stimulus (oriented light bar). This principle determines syntax of the LGN to V1 decision rule, by using logical and meaning that if one LGN cell does not respond then there is no V1 cell response. After Sherman and Guillery [15] we will call such inputs drivers. Alonso et al. [14] showed that there is a high speciﬁcity between RF properties of the LGN cells which have monosynaptic connections to a V1 simple cell. This precision goes beyond simple retinotopy and includes such RF properties as RF sign, timing, subregion strength and sign [14]. The decision rule for the feedforward LGN to V1 connections are following: DR LGN V1 1: r1LGN (xi , yi ) ∧ r1LGN (xi+1 , yi ) ∧ . . . ∧ r1LGN (xi+n , yi ) → r1V 1

(8)

296

A.W. Przybyszewski

Fig. 1. On the left: modiﬁed schematic of the model proposed by [9]. Four LGN cells with circular receptive ﬁelds arranged along a straight line on the retina have direct synaptic connection to V1 neuron. This V1 neuron is orientation sensitive as marked by the thick, interrupted lines. On the right: receptive ﬁelds of two types of LGN cells, and two types of area V1 cells.

DR LGN V1 2: r1LGN (xi , yi ) ∧ r1LGN (xi+1 , yi+1 ) ∧ . . . ∧ r1LGN (xi+n , yi+1 ) → r1V 1

(9)

where the ﬁrst rule determines response of cells in V1 with optimal horizontal orientation, and the second rule says that the optimal orientation is 45 degrees; (xi , yi ) is the localization of the RF in x-y Euclidian coordinates of the visual ﬁeld. Notice that these rules assume that V1 RF is completely determined by the FF pathway from the LGN. Decision Rules for feedback connections from V1→LGN. There are several papers showing the existence of the feedback connections from V1 to LGN [16-20]. In [20], authors have quantitatively compared the visuotopic extent of geniculate feedforward aﬀerents to V1 with the size of the RF center and surround of neurons in V1 input layers and the visuotopic extent of V1 feedback connections to the LGN with the RF size of cells in V1. Area V1 feedback connections restrict their inﬂuence to LGN regions visuotopically coextensive with the size of the classical RF of V1 layer 6 cells and commensurate with the LGN region from which they receive feedforward connections. In agreement with [15] we will denote feedback inputs modulators with following decision rules:

The Neurophysiological Bases of Cognitive Computation

297

DR V1 LGN 1: (r1V 1 ∨ r1LGN (xi , yi )), (r1V 1 ∨ r1LGN (xi , yi+1 ), (r1V 1 ∨ r1LGN (xi+1 , yi+1 )), . . . . . . , r1LGN (xi+2n , yi+2n )) → r2LGN

(10)

This rule says that when the activity of a particular V1 cell is in agreement with activity in some LGN cells their responses increase from r1 to r2 , and r1LGN (xi , yi ) means r1 response of LGN cell with coordination (xi , yi ) in the visual ﬁeld, and r2LGN means r2 response of all LGN cells in the decision rules which activity was coincidental with the feedback excitation, it is a pattern of LGN cells activity. Decision Rules for feedforward connections V1 → V4. There are relatively small direct connections from V1 to V4 bypassing area V2 [20], but we also must take into account V1 to V2 [21] and V2 to V4 connections, which are highly organized but variable, especially in V4 [22] feedforward connections. We simplify that V2 has similar properties to V1 but have a larger size of the RF. We assume that, like from the retina to LGN and from LGN to V1 direct or indirect connections from V1 to V4 provide driver input and fulﬁll the following decision rules: DR V1 V4 1: r1V 1 (xi , yi ) ∧ r1V 1 (xi+1 , yi ) ∧ . . . ∧ r1V 1 (xi+n , yi ) → r1V 4

(11)

DR V1 V4 2: r1V 1 (xi , yi ) ∧ r1V 1 (xi+1 , yi+j ) ∧ . . . ∧ r1V 1 (xi+n , yi+m ) → r1V 4

(12)

We assume that, the RF in area V4 sums up driver inputs from regions in the areas V1and V2 of cells with highly speciﬁc RF properties, not only retinotopically correlated. Decision Rules for feedback connections from V4→V1. Anterograde anatomical tracing [23] has shown axons backprojecting from area V4 directly to area V1 or sometimes with branches in area V2. Axons of V4 cells span in area V1 in large territories with most terminations in layer 1, which can be either distinct clusters or in linear arrays. These speciﬁc for each axon branches determine decision rules that will have similar syntax (see below) but anatomical structure of the particular axon may introduce diﬀerent semantics. Their anatomical structures maybe related to the speciﬁc receptive ﬁeld properties of diﬀerent V4 cells. Distinct clusters may have terminals on V1 cells near pinwheel centers (cells with diﬀerent orientations arranged radially), whereas a linear array of terminals may be connected to V1 neurons with similar orientation preference. In consequence, some parts of the V4 RF would have preference for certain orientations and others may have preference for the certain locations but be more ﬂexible to diﬀerent orientations. This hypothesis is supported by recent intracellular recordings from neurons located near pinwheels centers which,

298

A.W. Przybyszewski

in contrast to other narrowly tuned neurons, showed subthreshold responses to all orientations [24]. However, neurons which have ﬁxed orientation can change other properties of their receptive ﬁeld like for example spatial frequency, therefore the feedback from area V4 can tune them to expected spatial details in the RF (M. Sur, Brenda Milner Symposium, 22 Sept. 2008, MNI McGill University, Montreal). The V4 input modulates V1 cell in the following way: DR V4 V1 1: (r1V 4 ∨ r1V 1 (xi , yi )), (r1V 4 ∨ r1V 1 (xi , yi+1 ), (r1V 4 ∨ r1V 1 (xi+1 , yi+1 )), . . . . . . , (r1V 4 ∨ r1V 1 (xi+n , yi+m )) → r2V 1

(13)

Meaning of r1V 1 (xi , yi ) and r2V 1 are same as explained above for the V1 to LGN decision rule. Decision Rules for feedback connections V4→LGN. Anterograde tracing from area V4 showed axons projecting to diﬀerent layers of LGN and some of them also to the pulvinar [25] These axons have widespread terminal ﬁelds with branches non-uniformly spread about several millimeters (Fig. 2). Like descending axons in V1, axons from area V4 have within their LGN terminations, distinct clusters or linear branches (Fig. 2). These clusters and branches are characteristic for diﬀerent axons and as it was mentioned above their diﬀerences may be related to diﬀerent semantics in the decision rule below: DR V4 LGN 1: (r1V 4 ∨ r1LGN (xi , yi )), (r1V 4 ∨ r1LGN (xi , yi+1 ), (r1V 4 ∨ r1LGN (xi+1 , yi+1 )), . . . . . . , (r1V 4 ∨ r1LGN (xi+n , yi+m )) → r2LGN

(14)

Meaning of r1LGN (xi , yi ) and r2LGN are same as explained above for the V1 to LGN decision rule. Notice that interaction between FF and FB pathways extends a classical view that the brain as computer uses two-valued logic. This eﬀect in psychophysics can be paraphrased as: “I see it but it does not ﬁt my predictions”. In neurophysiology, we assume that a substructure could be optimally tuned to the stimulus but its activity does not ﬁt to the FB predictions. Such interaction can be interpreted as the third logical value. If there is no stimulus, the response in the local structure should have a logical value 0, if stimulus is optimal for the local structure, it should have logical value 12 , and if it also is tuned to expectations of higher areas (positive feedback) then response should have logical value 1. Generally it becomes more complicated if we consider many interacting areas, but in this work we use only three-valued logic.

The Neurophysiological Bases of Cognitive Computation

299

Fig. 2. Boutons of the descending axon from area V4 with terminals in diﬀerent parvocellular layers of LGN: layer 6 in black, layer 5 in red, layer 4 in yellow. Total number of boutons for this and other axons was between 1150 and 2075. We estimated that it means that each descending V4 axon connects to approximately 500 to over 1000 LGN (mostly parvocellular) cells [25]. Thick lines outline LGN; thin lines shows layers 5 and 6, dotted line azimuth, and dashed lines show elevation of the visual ﬁeld covered by the descending axon. This axon arborization extension has approximately V4 RF size.

3

Results

We have used our model as a basis for an analysis of the experimental data from the neurons recorded in the monkey’s area V4 [2]. In [2], it was shown that the RF in V4 can be divided into several subﬁeld that, stimulated separately, can give us the ﬁrst approximation of the concept of the shape to which the cell is tuned [13]. We have also shown that subﬁelds are tuned to stimuli with similar orientation [2]. In Fig. 3, we demonstrate that the receptive ﬁeld subﬁelds have not only similar preferred orientations but also spatial frequencies [2]. We have divided cell responses into three categories (see Methods) by horizontal lines in plots A-D of Fig. 3. We have draw a line near spike frequency 17 spikes/s, which separates responses of category r1 (above) from r0 (below the threshold line). Horizontal lines plotted near spike frequency 34 spikes/s separate responses of category r2 (above) from r1 (below). The stimulus attributes related to these three response categories were extracted in the decision table (Table 1). We summarize results of our analysis in Figs. 3H and G from Table 1. Fig. 3H presents a schematic of a possible stimulus that would give medium cell responses (r1 ). One can imagine

300

A.W. Przybyszewski

Fig. 3. Modiﬁed plots from [2]. Curves represent responses of V4 neurons to their RF subﬁelds grating stimulations with diﬀerent spatial frequencies (SF). (A-D) SF selectivity curves across RF with positions indicated in insets. The centers of tested subﬁelds were 2 deg apart. (E-H) Schematic representation summarizing orientation and SF selectivity of subﬁelds presented in A-D and in [2]. These ﬁgures are based on the decision table 1, for stimuli in E, F cell responses were r1 , for stimuli in G, H cell responses were r2 , (F) and (G) represent a possible stimulus conﬁguration from schematics (E) and (F).

several classes of possible stimuli assuming that subﬁeld responses will sum up linearly (for example see Fig. 3F). Fig. 3G shows a schematic of a possible stimulus set-up, which would give r2 response that as we have assumed, is related not only to the local but also the global visual cortex tuning. One can notice that in the last case only subﬁelds in the vertical row give strong independent responses (Fig. 3H). We assign the narrow (obn ), medium (obm ), and wide (obw ) orientation bandwidth as follows: obn if (ob : 0 < ob < 50deg), medium obm if (ob : 50deg < ob < 100deg), wide obw if (ob : ob > 100deg). We assign the narrow (sf bn ), medium (sf bm ), and wide (sf bw ) spatial frequency bandwidth: sf bn if (sf b : 0 < sf b < 2c/deg), medium sf bm if (sf b : 2c/deg < sf b < 2.5c/deg), wide sf bw if (sf b : sf b > 2.5c/deg). For simplicity in the following decision rules, we assume that the subﬁelds are not direction sensitive; therefore responses to stimulus orientation 0 and 180 deg should be same.

The Neurophysiological Bases of Cognitive Computation

301

Table 1. Decision table for one cell responses to subﬁelds stimulation Fig. 3C-F and Fig.5 in [2]. Attributes xpr, ypr, sf = 2c/deg, s are constant and they are not presented in the table. Cells 3* are from Fig. 3 in [2] and cells 5* are from Fig. 5 in [2] processed in Fig. 3. cell 3c 3c1 3c2 3d 3d1 3d2 3e 3f 3f1 3f2 5a 5a1 5b 5b1 5c 5c1 5d

o 172 10 180 172 5 180 180 170 10 333 180 180 180 180 180 180 180

ob 105 140 20 105 100 50 0 100 140 16 0 0 0 0 0 0 0

sf b 0 0 0 0 0 0 0 0 0 0 3 0.9 3.2 1 3 1.9 0.8

xp 0 0 0 0 0 0 -2 0 0 0 0 0 0 0 0 0 0

yp 0 0 0 -2 -2 -2 0 2 2 2 -2 -2 2 2 0 0 0

r 1 1 2 1 1 2 0 1 1 2 1 2 1 2 1 2 1

Our results from the separate subﬁelds stimulation study can be presented as the following decision rules: DR V4 3: o180 ∧ sf2 ∧ ((obw ∧ sf bw ∧ xp0 ∧ (yp−2 ∨ yp0 ∨ yp2 )))∨ ∨ (obn ∧ sf bn ∧ yp0 ∧ (xp−2 ∨ xp2 )) → r1

(15)

o180 ∧ sf2 ∧ obn ∧ sf bn ∧ xp0 ∧ (yp−2 ∨ yp0 ∨ yp2 ) → r2

(16)

DR V4 4:

These decision rules can be interpreted as follows: disc shaped grating stimuli with wide bandwidths of orientations or spatial frequencies when placed along vertical axis of the receptive ﬁeld evoke medium cell responses. However, similar discs when placed horizontally to the left or to the right from the middle of the RF, must have narrow orientation and spatial frequency to evoke medium cell responses. Only a narrowly tuned disc in spatial frequency and orientation placed vertically from the middle of the receptive ﬁeld can evoke strong cell responses. Notice that Figs 3F and 3H show possible conﬁgurations of the optimal stimulus. This approach is similar to the assumption that an image of the object is initially represented in terms of the activation of a spatially arrayed set of multiscale, multioriented detectors like arrangements of simple cells in V1 (metric

302

A.W. Przybyszewski

templates in subordinate-level object classiﬁcation of Lades et al. [26]). However, this approach does not take into account interactions between several stimuli, when more than one subﬁeld is stimulated, and we will show below there is a strong nonlinear interaction between subﬁelds. We analyzed experiments where the RF is stimulated at ﬁrst with a single small vertical bar and later with two bars changing their horizontal positions. One example of V4 cell responses to thin (0.25 deg) vertical bars in diﬀerent horizontal positions is shown in the upper left part of Fig. 4 (Fig. 4E). Cell response has maximum amplitude for the middle (XP os = 0) bar position along the x − axis. Cell responses are not symmetrical around 0. In Fig. 2F, the same cell (cell 61 in table 2) is tested with two bars. The ﬁrst bar stays at the 0 position, while the second bar changes its position along x − axis. Cell responses show several maxima dividing the receptive ﬁeld into four areas. However, this is not always the case as responses to two bars in another cell (cell 62 in table 2) show only two minima (Fig. 2G). Horizontal lines in plots of both ﬁgures divide cell responses into the three categories r0 , r1 , r2 , which are related to the mean response frequency (see Methods). Stimuli attributes and cell responses classiﬁed into categories are shown in table 2 for cells in Fig. 4 and in table 3 for cells in Fig. 5. We assign the narrow (xprn ), medium (xprm ), and wide (xprw ) x position ranges as follows: xprn if (xpr : 0 < xpr ≤ 0.6), medium xprm if (xpr : 0.6 < xpr ≤ 1.2), wide xprw if (xpr : xpr > 1.2). We assign the narrow (yprn ), medium (yprm ), and wide (yprw ) y position range: yprn if (ypr : 0 < ypr ≤ 1.2), medium yprm if (ypr : 1.2 < xpr ≤ 1.6), wide yprw if (ypr : ypr > 1.6). On the basis of Fig. 3 and the decision table 2 (also compare with [18]) the one-bar study can be presented as the following decision rules: DR V4 5: o90 ∧ xprn ∧ xp0.1 ∧ xs0.25 ∧ ys0.4 → r2

(17)

o90 ∧ xprw ∧ xp−0.2 ∧ xs0.25 ∧ ys0.4 → r1

(18)

DR V4 6: We interpret these rules that r1 response in eq. (18) does not eﬀectively involve the feedback to the lower areas: V1 and LGN. The descending V4 axons have excitatory synapses not only on relay cells in LGN and pyramidal cells in V1, but also on inhibitory interneurons in LGN and inhibitory double banquet cells in layer 2/3 of V1. As an eﬀect of the feedback, only a narrow range of area V4 RF responded with a high r2 activity to a single bar stimulus, whereas in the outside area excitatory and inhibitory feedback inﬂuences compensated each other. On the basis of Fig. 4 the decision table, the two-bar horizontal interaction study can be presented as the following Two-bar Decision Rules (DRT): DRT V4 1: o90 ∧xprn ∧((xp−1.9 ∨xp0.1 ∨xp1.5 )∧xs0.25 ∧ys0.4 )1 ∧(o90 ∧xp0 ∧xs0.25 ∧ys0.4 )0 → r2 (19)

The Neurophysiological Bases of Cognitive Computation

303

DRT V4 2: o90 ∧ xprm ∧ ((xp−1.8 ∨ xp−0.4 ∨ xp0.4 ∨ xp1.2 ) ∧ xs0.25 ∧ ys0.4 )1 ∧ ∧ (o90 ∧ xp0 ∧ xs0.25 ∧ ys0.4 )0 → r1

(20)

One-bar decision rules can be interpreted as follows: the narrow vertical bar evokes a strong response in the central positions, and medium responses in a larger area near the central position. Two-bar decision rules claim that: the cell responses to two bars are strong if one bar is in the middle of the RF (bar with index 0 in decision rules) and the second narrow bar (bar with index 1 in decision rules) is in the certain, speciﬁc positions in the RF eq. (19). But when the second bar is in less precise positions, cell responses became weaker eq. (20). Responses of other cells are sensitive to other bar positions (Fig. 4G). These diﬀerences could be correlated with anatomical variability of the descending

Fig. 4. Modiﬁed plots from [2]. Curves represent responses of several cells from area V4 to small single (E) and double (F, G) vertical bars. Bars change their position along x-axis (Xpos). Responses are measured in spikes/sec. Mean cell responses ± SE are marked in E, F, and G. Cell responses are divided into three ranges by thin horizontal lines. Below each plot are schematics showing bar positions giving r1 (gray) and r2 (black) responses; below (E) for a single bar, below (F and G) for double bars (one bar was always in position 0). (H) This schematic extends responses for horizontally placed bars (E) to the whole RF: white color shows excitatory, black color inhibitory interactions between bars. Bars’ interactions are asymmetric in the RF.

304

A.W. Przybyszewski

Table 2. Decision table for cells shown in Fig. 4. Attributes o, ob, sf, sfb were constant and are not presented in the table. cell 61e 61f1 61f2 61f3 61f4 61f5 61f6 61f7 62g1 62g2 62g3 62g4 62g5 63h1 63h2 63h3

xp -0.7 -1.9 0.1 1.5 -1.8 -0.4 0.4 1.2 -1.5 -0.15 -1.5 -0.25 1 -0.5 1 0.2

xpr 1.4 0.2 0.2 0.1 0.6 0.8 0.8 0.8 0.1 0.5 0.6 1.3 0.6 0 1 0.1

xs 0.25 0.25 0.25 0.25 0.25 0.25 0.2 5 0.25 0.25 0.25 0.25 0.25 0.25 0.5 1 0.25

ys 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 4

s 2 22 22 22 12 22 22 22 22 22 22 22 22 44 44 22

r 1 2 2 2 1 1 1 1 2 2 1 1 1 2 1 2

Table 3. Decision table for one cell shown in Fig. 5. Attributes yp, ypr are constant and are not presented in the table. We introduce another parameter of the stimulus, diﬀerence in the direction of drifting grating of two patches: ddg = 0 when drifting are in the same directions, and ddg = 1 if drifting in two patches are in opposite directions. cell 64c 64c1 64c2 64d 64d1

xp -4.5 -1.75 -0.5 -6 -3.5

xpr 3 1.5 1 0 4.8

xs ys 1 1 1 1 1 1 1 8 1 8

ddg 1 1 1 0 0

r 2 1 2 2 1

axons connections. As mentioned above, V4 axons in V1 have distinct clusters or linear branches. Descending pathways are modulators, which means that they follow the logical “or” rule. This rule states that cells in area V1 become more active as a result of the feedback only if their patterns “ﬁt” to the area V4 cell “expectation”. The decision table (Table 3) based on Fig. 5 describes cell responses to two patches placed in diﬀerent positions along x-axis of the receptive ﬁeld (RF). Figure 5 shows that adding the second patch reduced single patch cell responses. We have assumed that cell response to a single patch placed in the middle of the RF is r2 . The second patch suppresses cell responses to a greater extent when it is more similar to the ﬁrst patch (Fig. 5D).

The Neurophysiological Bases of Cognitive Computation

305

Fig. 5. Modiﬁed plots from [2]. Curves represent V4 cell responses to two patches with gratings moving in opposite direction - patch 1 deg diameter (C) and in the same (D) directions for patch 1 deg wide and 8 deg long. One patch is always at x-axis position 0 and the second patch changes its position as it is marked in XPos coordinates. The horizontal lines represent 95% conﬁdence intervals for the response to a single patch in position 0. Below C and D, schematics show the positions of the patches and their inﬂuences on cell responses. Arrows are showing the direction of moving gratings. The lower part of the ﬁgure shows two schematics of the excitatory (white) and inhibitory (black) interactions between patches in the RF. Patches with gratings moving in the same directions (right schematic) show larger inhibitory areas (more dark color) than patches moving in opposite directions (left schematic).

Two-patch horizontal interaction decision rules are as follows: DRT V4 3: ddg1 ∧ (o0 ∧ xpr3 ∧ xp4.5 ∧ xs1 ∧ ys1 )1 ∧ (o0 ∧ xp0 ∧ xs1 ∧ ys1 )0 → r2

(21)

DRT V4 4: ddg1 ∧ (o0 ∧ xpr1 ∧ xp0.5 ∧ xs1 ∧ ys1 )1 ∧ (o0 ∧ xp0 ∧ xs1 ∧ ys1 )0 → r2

(22)

DRT V4 5: ddg0 ∧ (o0 ∧ xpr4.8 ∧ xp3.5 ∧ xs1 ∧ ys8 )1 ∧ (o0 ∧ xp0 ∧ xs1 ∧ ys1)0 → r1

(23)

306

A.W. Przybyszewski

Table 4. Decision table for cells in Fig. 6. Attributes yp, ypr, xs = ys = 0.5deg, s = 33 (two discs) are constant and are not presented in the table. We introduce another parameter of the stimulus, diﬀerence in polarities of two patches: dp = 0 if polarities are same, and dp = 1 if polarities are opposite. cell 81a 81a1 81a2 81a3 81a4 81a5 81a6 81b 81b1 81b2

xp -0.1 -1.75 -1.2 1.25 -1.3 -1.3 1.5 -1.4 0.9 0.9

xpr 0.5 0.3 1 1.5 0.3 0.3 0.4 0.6 0.8 0.2

dp 0 0 1 1 1 1 1 1 1 1

r 1 1 1 1 2 2 2 1 1 2

These decision rules can be interpreted as follows: patches with drifting in opposite directions gratings give strong responses when positioned very near (overlapping) or 150% of their width apart one from the other eqs. (21, 22). Interaction of patches with a similar grating evoked small responses in large extend of the RF eq. (23). Generally, interactions between similar stimuli evoke stronger and more extended inhibition than between diﬀerent stimuli. These and other examples can be generalized to other classes of objects. Two-spot horizontal interaction decision rules are as follows: DRT V4 6: dp0 ∧ s33 ∧ (((xp−0.1 ∧ xpr0.5 ) ∨ (xp−1.75 ∧ xpr0.3 )) ∧ xs0.5 )1 ∧ (xp0 ∧ xs0.5 )0 → r1 (24) DRT V4 7: dp1 ∧s33 ∧(((xp−1.2 ∧xpr1 )∨(xp1.25 ∧xpr1.5 ))∧xs0.5 )1 ∧(xp0 ∧xs0.5 )0 → r1 (25) DRT V4 8: dp1 ∧s33 ∧(((xp−1.3 ∧xpr0.2 )∨(xp1.5 ∧xpr0.4 ))∧xs0.5 )1 ∧(xp0 ∧xs0.5 )0 → r2 (26) DRT V4 9: dp1 ∧s33 ∧(((xp−1.4 ∧xpr0.6 )∨(xp0.9 ∧xpr0.8 ))∧xs0.5 )1 ∧(xp0 ∧xs0.5 )0 → r1 (27) DRT V4 10: dp1 ∧ s33 ∧ ((xp0.9 ∧ xpr0.2 ) ∧ xs0.5 )1 ∧ (xp0 ∧ xs0.5 )0 → r2

(28)

where dp is the diﬀerence in light polarities between two light spots (s33 ), and subscript 1 is related to spot changing its x-axis position, whereas subscript 0 is related to the spot in 0 position on x-axis.

The Neurophysiological Bases of Cognitive Computation

307

Fig. 6. Modiﬁed plots from [2]. Curves represent V4 cell responses to pair of 0.5 deg diameter bright and dark discs tested along width axis. Continuous lines mark the curves for responses to diﬀerent polarity stimuli, and the same polarity stimuli are marked by dashed line. Schematics for cell responses showed in (A) are in (C-F) and (I, J). Schematics for cell responses in (B) are in (G) and (H). Interactions between same polarity light spots (C) are diﬀerent than interactions between diﬀerent polarities patches (D-H). Small responses (class 1) are in (C), (D), (G), and larger responses (class 2) are in (E), (F), (H). (E) shows that there is no r2 responses in same polarity two spots interactions. (I) shows small excitatory (gray) in a short range and strong inhibitory (black) interactions between same polarity spots and (J) shows short range inhibitory (dark) and longer range excitatory interactions between diﬀerent polarities spots.

We propose the following classes of the object’s Parts Interaction Rules: PIR1: Facilitation when stimulus consists of multiple similar thin bars with small distances (about 0.5 deg) between them, and suppression when the distance between bars is larger than 0.5 deg. Suppression/facilitation is very often a nonlinear function of the distance. In our experiments (Fig. 3), cell responses to two bars were periodic along the receptive ﬁeld with dominating periods of about 30, 50, or 70% of the RF width. These nonlinear interactions were also observed along vertical and diagonals of the RF and often show strong asymmetries in relationship to the RF middle. PIR2: Strong inhibition when stimulus consists of multiple similar patches ﬁlled with gratings with the distance between patch edges ranging from 0 deg (touching) to 2 deg, weak inhibition when distance is between 3 to 5 deg through the RF width.

308

A.W. Przybyszewski

PIR3: If bars or patches have diﬀerent attributes like polarity or drifting directions, their suppression is smaller and localized facilitation at the small distance between stimuli is present. As in bar interaction, suppression/facilitations between patches or bright/dark discs can be periodic along diﬀerent RF axis and often asymmetric in the RF. We have tested the above rules in nine cells from area V4 by using discs or annuli ﬁlled stimuli with optimally oriented and variable in spatial frequencies drifting gratings (Pollen et al. [2] Figs. 9, 10). Our assumptions were that if it is a strong inhibitory mechanism as described in the rule PRI2 then responses to annulus with at least 2 deg inner diameters will be stronger than responses to the disc. In addition by changing spatial frequencies of gratings inside the annulus, we have expected eventually to ﬁnd other periodicities along the RF width as described by PIR3. In summary, we wanted to ﬁnd out what relations there are between stimulus properties and area V4 cell responses or whether B-elementary granules have equivalence classes of the relation IN D{r} or V4-elementary granules, or whether [u]B ⇒ [u]B4 . It was evident from the beginning that because diﬀerent area V4 cells have diﬀerent properties, their responses to the same stimuli will be diﬀerent, therefore we wanted to know if the rough set theory will help us in our data modeling. We assign the spatial frequency: low (sfl ), medium (sfm ), and high (sfh ) as follows: sfl if (sf : 0 < sf ≤ 1c/deg), medium sfm if (sf : 1c/deg < sf ≤ 4c/deg), high sfh if (sf : sf > 4c/deg). On the basis of this deﬁnition we calculate for each row in Table 5 the spatial frequency range by taking into account the spatial frequency bandwidth (sfb ). Therefore 107a is divided to 107al and 107am, 108a to 108al and 108am, and 108b to 108bl, 108bm, and 108bh. Stimuli used in these experiments can be placed in the following ten categories: Y0 = |sfl xo7 xi0 s4 | = {101, 105} Y1 = |sfl xo7 xi2 s5 | = {101a, 105a} Y2 = |sfl xo8 xi0 s4 | = {102, 104} Y3 = |sfl xo8 xi3 s5 | = {102a, 104a} Y4 = |sfl xo6 xi0 s4 | = {103, 106, 107, 108, 20a, 20b} Y5 = |sfl xo6 xi2 s5 | = {103a, 106a, 107al, 108bl} Y6 = |sfl xo4 xi0 s4 | = {108al} Y7 = |sfm xo6 xi2 s5 | = {107am, 108bm} Y8 = |sfm xo4 xi0 s4 | = {107a, 108am} Y9 = |sfh xo6 xi2 s5 | = {108bh}

The Neurophysiological Bases of Cognitive Computation

309

Table 5. Decision table for eight cells comparing the center-surround interaction. All stimuli were concentric, and therefore attributes were not xs, ys, but xo outer diameter, xi inner diameter. All stimuli were localized around the middle of the receptive ﬁeld so that xp = yp = xpr = ypr = 0 were constant and we did not put them in the table. The optimal orientation were normalized, denoted as 1, and removed from the table. cell 101 101a 102 102a 103 103a 104 104a 105 105a 106 106a 107 107a 107b 108 108a 108b 20a 20b

sf 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 2.1 2 0.5 2 5 0.5 0.5

sf b 0 0 0 0 0 0 0 0 0 0 0 0 0.25 3.8 0 0 0 9 0 0

xo 7 7 8 8 6 6 8 8 7 7 6 6 6 6 4 6 4 6 6 6

xi 0 2 0 3 0 2 0 3 0 2 0 3 0 2 0 0 0 2 0 0

s 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 4 4 5 4 4

r 0 1 0 0 0 1 0 2 0 1 1 2 2 2 1 1 2 2 1 2

These are equivalence classes for stimulus attributes, which means that in each class they are indiscernible IN D(B). We have normalized orientation bandwidth to 0 in {20a, 20b} and spatial frequency bandwidth to 0, in cases {107, 107a, 108a, 108b}, and put values covered by the bandwidth to the spatial frequency parameters. There are three ranges of responses denoted as ro , r1 , r2 . Therefore on the basis of the neurological data there are the following three categories of cell responses: |ro | = {101, 102, 102a, 103, 104, 105} |r1 | = {101a, 103a, 105a, 107b, 108, 20a} |r2 | = {104a, 106a, 107, 107al, 107am, 108al, 108am, 108bl, 108bm, 108bh, 20b} which are denoted as Xo , X1 , X2 . We will calculate the lower and upper approximation [1] of the brains basic concepts in term of stimulus basic categories: B X0 = Y0 ∪ Y2 = {101, 105, 102, 104} ¯¯ 0=Y0 ∪Y2 ∪Y3 ∪Y4={101, 105, 102, 104, 102a, 104a, 103, 106, 107, 108, 20a, 20b} BX B X1 = Y1 = {101a, 105a} ¯¯ 1 = Y1 ∪ Y5 ∪ Y6 ∪ Y4 = BX {101a, 105a, 103a, 107al, 108b, 106a, 20b, 107b, 108a, 103, 107, 106, 108, 20a}

310

A.W. Przybyszewski

B X2 = Y7 ∪ Y9 = {107am, 108bm, 108bh} ¯¯ 2 = Y7 ∪ Y9 ∪ Y8 ∪ Y6 ∪ Y3 ∪ Y4 ∪ Y5 = {107am, 108bm, 108bh, 107b, 108am, BX 102a, 104a, 103a, 107a, 108bl, 106a, 20b, 103, 107, 106, 108, 20a, 108al} Concept 0 and concept 1 are roughly B −def ined, which means that only with some approximation we have found that the stimuli do not evoke a response, or evoke weak or strong response in the area V4 cells. Certainly a stimulus such as Y0 or Y2 does not evoke a response in all our examples, in cells 101, 105, 102, 104. Also stimulus Y1 evokes a weak response in all our examples: 101a, 105a. We are interested in stimuli that evoke strong responses because they are speciﬁc for area V4 cells. We ﬁnd two such stimuli, Y7 and Y9 . In the meantime other stimuli such as Y3 , Y4 evoke no response, weak or strong responses in our data. We can ﬁnd the quality [1] of our experiments by comparing properly classiﬁed stimuli P OSB(r) = {101, 101a, 105, 105a, 102, 104, 107am, 108bm, 108bh} to all stimuli and to all responses: γ{r} = card{101,101a,105,105a,102,104,107am,108bm,108bh} card{101,101a,,20a,20b} = 0.38. We can also ask what percentage of cells we fully classiﬁed. We obtain consistent responses from 2 of 9 cells, which means that γ = 0.22. This is related to the fact that for some cells we have tested more than two stimuli. What is also important from an electrophysiological point of view is there are negative cases. There are many negative instances for the concept 0, which means that in many cases this brain area responds to our stimuli; however it seems that our concepts are still only roughly deﬁned. We have following decision rules: DR V4 7: sfl ∧ xo7 ∧ xi2 ∧ s5 → r1

(29)

sfl ∧ xo7 ∧ xi0 ∧ s4 → r0

(30)

sfl ∧ xo8 ∧ xi0 ∧ s4 → r0

(31)

(sfm ∨ sfh ) ∧ xo6 ∧ xi2 ∧ s5 → r2

(32)

DR V4 8: DR V4 9: DR V4 10: These can be interpreted as the statement that a large annulus (s5 ) evokes a weak response, but a large disc (s4 ) evokes no response when there is modulation with low spatial frequency gratings. However, somewhat smaller annulus containing medium or high spatial frequency objects evokes strong responses. It is unexpected that certain stimuli evoke inconsistent responses in diﬀerent cells (Table 5): 103: sfl ∧ xo6 ∧ xi0 ∧ s4 → r0 106: sfl ∧ xo6 ∧ xi0 ∧ s4 → r1 107: sfl ∧ xo6 ∧ xi0 ∧ s4 → r2 A disc with not very large dimension containing a low spatial frequency grating can evoke no response (103), a small response (106), or a strong response (107).

The Neurophysiological Bases of Cognitive Computation

4

311

Discussion

Physical properties of objects are diﬀerent from their psychological representation. Grdenfors [27] proposed to describe the principle of human perceptual system as grouping objects by similarities in the conceptual space. Human perceptual systems group together similar objects with unsharp boundaries [27], which means that objects are related to their parts by rough inclusion or that diﬀerent parts belong to objects with some approximation (degree) [28]. We suggest that similarity relations between objects and their parts are related to the hierarchical relationships between diﬀerent visual areas. These similarities may be related to synchronizations of multi-resolution, parallel computations and are diﬃcult to simulate using a digital computer [29]. Treisman [30] proposed that our brains extract features related to diﬀerent objects using two diﬀerent procedures: parallel and serial processing. The “basic features” were identiﬁed in psychophysical experiments as elementary features that can be extracted in parallel. Evidence of parallel features extraction comes from experiments showing that the extraction time becomes independent of the number of objects. Other features need serial searches, so that the extraction time is proportional to the number of objects. High-level serial processing is associated with integration and consolidation of items combined with conscious awareness. Other low-level parallel processes are rapid, global, related to higheﬃciency categorization of items and largely unconscious [30]. Treisman [30] showed that instances of a disjunctive set of at least four basic features could be detected through parallel processing. Other researchers have provided evidence for parallel detection of more complex features, such as shape from shading [31] or experience-based learning of features of intermediate complexity [32]. Thorpe et al. [33] in recent experiments, however, found that human and nonhuman primates can rapidly and accurately categorize brieﬂy ﬂashed natural images. Human and monkey observers are very good at deciding whether or not a novel image contains an animal even when more than one image is presented simultaneously [34]. The underlying visual processing reﬂecting the decision that a target was present is under 150ms [33]. These ﬁndings are in contradiction to the classical view that only simple, “basic features”, likely related to early visual areas like V1 and V2, are processed in parallel [30] Certainly, natural scenes contain more complex stimuli than “simple” geometric shapes. It seems that the conventional, two-stage perception-processing model needs correction, because to the “basic features” we must add a set of unknown intermediate features. We propose that at least some intermediate features are related to receptive ﬁeld properties in area V4. Area V4 has been associated with shape processing because its neurons respond to shapes [35] and because lesions in this area disrupt shape discrimination, complex-grouping discriminations [36], multiple viewpoint shape discriminations [37], and rotated shape discriminations [38]. Area V4 responses are also driven by curvature or circularity, which was recently observed by mean of the human fMRI [39]. By applying rough sets to V4 neuron responses, we have diﬀerentiated between bottom-up information (hypothesis testing) related to the sensory input,

312

A.W. Przybyszewski

and predictions, some of which can be learned but are generally related to positive feedback from higher areas. If a prediction is in agreement with a hypothesis, object classiﬁcation will change from category 1 to category 2. Our research suggests that such decisions can be made very eﬀectively during pre-attentive, parallel processing in multiple visual areas. In addition, we found that the decision rules of diﬀerent neurons can be inconsistent. One should take into account that modeling complex phenomena demands the use of local models (captured by local agents), if one would like to use the multiagent terminology [6]) that should be fused afterwards. This process involves negotiations between agents [6] to resolve contradictions and conﬂicts in local modeling. One of the possible approaches in developing methods for complex concept approximations can be based on the layered learning [41]. Inducing concept approximation should be developed hierarchically starting from concepts that can be directly approximated using sensor measurements toward complex target concepts related to perception. This general idea can be realized using additional domain knowledge represented in natural language. We have proposed decision rules for diﬀerent visual areas and for FF and FB connections between them. However in processing our V4 experimental data, we also have found inconsistent decision rules. These inconsistencies could help process diﬀerent aspects of the properties of complex objects. The principle is similar to that observed in the orientation tuning cells of the primary visual cortex. Neurons in V1 with overlapping receptive ﬁelds show diﬀerent preferred orientations. It is assumed that this overlap helps extract local orientations in diﬀerent parts of an object. However, it is still not clear which cell will dominate if several cells with overlapping receptive ﬁelds are tuned to diﬀerent attributes of a stimulus. Most models assume a “winner takes all” strategy; meaning that using a convergence (synaptic weighted averaging) mechanism, the most dominant cells will take control over other cells, and less represented features will be lost. This approach is equivalent to the two-valued logic implementation. Our ﬁnding from area V4 seems to support a diﬀerent strategy than the “winner takes all” approach. It seems that diﬀerent features are processed in parallel and then compared with the initial hypothesis in higher visual areas. We think that descending pathways play a major role in this veriﬁcation process. At ﬁrst, the activity of a single cell is compared with the feedback modulator by logical conjunction to avoid hallucinations. Next, the global, logical disjunction (“modulators”) operation allows the brain to choose a preferred pattern from the activities of diﬀerent cells. This process of choosing the right pattern may have strong anatomical basis because individual axons have variable and complex terminal shapes, facilitating some regions and features against other so called salient features (for example Fig. 2). Learning can probably modify the synaptic weights of the feedback boutons, ﬁne-tuning the modulatory eﬀects of feedback. Neurons in area V4 integrate an object’s attributes from the properties of its parts in two ways: (1) within the area via horizontal or intra-laminar local excitatory-inhibitory interactions, (2) between areas via feedback connections tuned to lower visual areas. Our research put more emphasis on feedback

The Neurophysiological Bases of Cognitive Computation

313

connections because they are probably faster than horizontal interactions [42]. Diﬀerent neurons have diﬀerent Part Interactions Rules (PIR as described in the Results section) and perceive objects by way of multiple “unsharp windows” (Figs. 4, 6). If an object’s attributes ﬁt the unsharp window, a neuron sends positive feedback [3] to lower areas, which as described above, use “modulator logical rules” to sharpen the attribute-extracting window and therefore change the neurons response from class 1 to class 2 (Fig. 4 J and K; Fig. 6 C to D, E to F, and G to H ). The above analysis of our experimental data leads us to suggest that the central nervous system chieﬂy uses at least two diﬀerent logical rules: “driver logical rule” and “modulator logical rule.” The ﬁrst, “driver logical rule,” processes data using a large number of possible algorithms (over-representation). The second, “modulator logical rule,” supervises decisions and chooses the right algorithm. Below we will look at possible cognitive interpretations of our model using the shape categorization task as an example. The classiﬁcation of diﬀerent objects by their diﬀerent attributes has been regarded as a single process termed “subordinate classiﬁcation” [40]. Relevant perceptual information is related to subordinate-level shape classiﬁcation by distinctive information of the object like its size, surface, curvature of contours, etc. There are two theoretical approaches regarding shape representation: metric templates and invariant parts models. As mentioned above, both theories assume that an image of the object is represented in terms of cell activation in areas like V1: a spatially arrayed set of multi-scale, multi-oriented detectors (“Gabor jets”). Metric templates [26] map object values directly onto units in an object layer, or onto hidden units, which can be trained to diﬀerentially activate or inhibit object units in the next layer [41]. Metric templates preserve the metrics of the input without the extraction of edges, viewpoint invariant properties, parts or the relations among parts. This model discriminates shape similarities and human psychophysical similarities of complex shapes or faces [25]. Matching a new image against those in the database is done by allowing the Gabor jets to independently change their own best ﬁt (change their position). The similarities of two objects will be the sum of the correlations in corresponding jets. When this methods is used, changes in object or face position or changes in facial expressions can achieve 95% accuracy between several hundreds faces [43]. The main problems with the Lades model [26] described above are that it does not distinguish among the largest eﬀects in object recognition it is insensitive to contour variations, which are very important psychophysically speaking, and it is insensitive to salient features (non-accidental properties [NAP]) [3]. The model we propose here suggests that these features are probably related to eﬀects of feedback pathways, which may strengthen diﬀerences, signal salient features and also assemble other features, making it possible to extract contours. A geon structural description (GSD) is a two-dimensional representation of an arrangement of parts, each speciﬁed in terms of its non-accidental characterization and the relations amongst these parts [38]. Across objects, the parts (geons) can diﬀer in their NAP. NAP are properties that do not change with

314

A.W. Przybyszewski

Fig. 7. Comparison of diﬀerences in nonaccidental properties between a brick and a cylinder using geon [3] and our model. The geon shows attributes from psychological space like curves, parallels or vertices, which may be diﬀerent in diﬀerent subjects. The neurological model compares properties of both objects on the basis of a single cell recordings from the visual system. Both objects can stimulate similar receptive ﬁelds in area V4. These receptive ﬁelds are sensitive in annuli - they extract orientation change in diﬀerent parts of the RF [2]. Area V1 RFs are sensitive to edge orientations, whereas LGN RFs extract spots related to corners. All these diﬀerent attributes are put together by FF and FB pathways.

small depth rotations of an object. The presence or absence of the NAP of some geons or the diﬀerent relations between them may be the basis for subordinate level discrimination [38]. The advantage of the GSD is that the representation of objects in terms of their parts and the relations between them is accessible to cognition and fundamental for viewpoint invariant perception. Our neurological model introduces interactions between RF parts as in the geon model; however, our parts are deﬁned diﬀerently than the somewhat subjective parts of the GSD model. Fig. 7 shows diﬀerences in a simple objects understanding between geon and our neurological approach. The top part of this ﬁgure shows diﬀerences in nonaccidental properties between a brick and a cylinder [3]. We propose hierarchical deﬁnition of parts based on neurophysiological recordings from the visual system. Both objects may be classiﬁed in V4 by the receptive ﬁeld discriminating

The Neurophysiological Bases of Cognitive Computation

315

between diﬀerent stimulus orientations in its central and peripheral parts as it is schematically presented in Fig. 7 [2]. Another, diﬀerent classiﬁcation is performed by area V1, where oriented edges are extracted from both objects (Fig. 7). However, even more precise classiﬁcation is performed in LGN where objects are seen as sets of small circular shapes similar to receptive ﬁelds in the retina (bottom part of Fig. 7). In our model, interactions between parts and NAPs are associated with the role of area V4 in visual discrimination, as described in the above lesion experiments [34-36]. However, feedback from area V4 to the LGN and area V1 could be responsible for the possible mechanism associated with the properties of the GSD model. The diﬀerent interactions between parts may be related to the complexity and the individual shapes of diﬀerent axons descending from V4. Their separated cluster terminals may be responsible for invariance related to small rotations (NAP). These are the anatomical bases of the GSD model, although we hypothesize that the electrophysiological properties of the descending pathways (FB), deﬁned above as the modulator, are even more important. The modulating role of the FB is related to the anatomical properties of the descending pathways’ logic. Through this logic, multiple patterns of the coincidental activity between the LGN or V1 and FB can be extracted. One may imagine that these diﬀerently extracted patterns of activity correlate with the multiple viewpoints or shape rotations deﬁned as NAP in the GSD model. In summary, by applying rough set theory to model neurophysiological data we have shown a new approach for objects categorization in psychophysical space. Two diﬀerent logical rules are applied to indiscernibility classes of LGN, V1, and V4 receptive ﬁelds: “driver logical rules” put many possible objects’ properties together and “modulator logical rules” choose these attributes which are in agreement with our previous experiences. Acknowledgement. Thanks to Carmelo Milo for his technical help, as well to Farah Averill and Dana Hayward for their help in editing the manuscript.

References 1. Pawlak, Z.: Rough Sets - Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991) 2. Pollen, D.A., Przybyszewski, A.W., Rubin, M.A., Foote, W.: Spatial receptive ﬁeld organization of macaque V4 neurons. Cereb Cortex 12, 601–616 (2002) 3. Biederman, I.: Recognition-by-components: a theory of human image understanding. Psychol. Rev. 94(2), 115–147 (1987) 4. Przybyszewski, A.W., Gaska, J.P., Foote, W., Pollen, D.A.: Striate cortex increases contrast gain of macaque LGN neurons. Vis. Neurosci. 17, 485–494 (2000) 5. Przybyszewski, A.W., Kagan, I., Snodderly, M.: Eye position inﬂuences contrast responses in V1 of alert monkey [Abstract]. Journal of Vision 3(9), 698, 698a (2003), http://journalofvision.org/3/9/698/ 6. Russell, S., Norvig, P.: Artiﬁcial Intelligence: A Modern Approach, 2nd edn. Prentice Hall Series in Artiﬁcial Intelligence (2003)

316

A.W. Przybyszewski

7. Przybyszewski, A.W., Kon, M.A.: Synchronization-based model of the visual system supports recognition. Program No. 718.11. 2003 Abstract Viewer/Itinerary Planner. Society for Neuroscience, Washington, DC (2003) 8. Kuﬄer, S.W.: Neurons in the retina; organization, inhibition and excitation problems. Cold Spring Harb. Symp. Quant. Biol. 17, 281–292 (1952) 9. Hubel, D.H., Wiesel, T.N.: Receptive ﬁelds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962) 10. Schiller, P.H., Finlay, B.L., Volman, S.F.: Quantitative studies of single-cell properties in monkey striate cortex. I. Spatiotemporal organization of receptive ﬁelds. J. Neurophysiol. 39, 1288–1319 (1976) 11. Kagan, I., Gur, M., Snodderly, D.M.: Spatial organization of receptive ﬁelds of V1 neurons of alert monkeys: comparison with responses to gratings. J. Neurophysiol. 88, 2557–2574 (2002) 12. Bardy, C., Huang, J.Y., Wang, C., FitzGibbon, T., Dreher, B.: ‘Simpliﬁcation’ of responses of complex cells in cat striate cortex: suppressive surrounds and ‘feedback’ inactivation. J. Physiol. 574, 731–750 (2006) 13. Przybyszewski, A.W.: Checking Brain Expertise Using Rough Set Theory. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 746–755. Springer, Heidelberg (2007) 14. Alonso, J.M., Usrey, W.M., Reid, R.C.: Rules of connectivity between geniculate cells and simple cells in cat primary visual cortex. J. Neurosci. 21(11), 4002–4015 (2001) 15. Sherman, S.M., Guillery, R.W.: The role of the thalamus in the ﬂow of information to the cortex. Philos. Trans. R Soc. Lond. B Biol. Sci. 357(1428), 1695–1708 (2002) 16. Lund, J.S., Lund, R.D., Hendrickson, A.E., Bunt, A.H., Fuchs, A.F.: The origin of eﬀerent pathways from the primary visual cortex, area 17, of the macaque monkey as shown by retrograde transport of horseradish peroxidase. J. Comp. Neurol. 164, 287–303 (1975) 17. Fitzpatrick, D., Usrey, W.M., Schoﬁeld, B.R., Einstein, G.: The sublaminar organization of corticogeniculate neurons in layer 6 of macaque striate cortex. Vis. Neurosci. 11, 307–315 (1994) 18. Ichida, J.M., Casagrande, V.A.: Organization of the feedback pathway from striate cortex (V1) to the lateral geniculate nucleus (LGN) in the owl monkey (Aotus trivirgatus). J. Comp. Neurol. 454, 272–283 (2002) 19. Angelucci, A., Sainsbury, K.: Contribution of feedforward thalamic aﬀerents and corticogeniculate feedback to the spatial summation area of macaque V1 and LGN. J. Comp. Neurol. 498, 330–351 (2006) 20. Nakamura, H., Gattass, R., Desimone, R., Ungerleider, L.G.: The modular organization of projections from areas V1 and V2 to areas V4 and TEO in macaques. J. Neurosci. 13, 3681–3691 (1993) 21. Rockland, K.S., Virga, A.: Organization of individual cortical axons projecting from area V1 (area 17) to V2 (area 18) in the macaque monkey. Vis. Neurosci. 4, 11–28 (1990) 22. Rockland, K.S.: Conﬁguration, in serial reconstruction, of individual axons projecting from area V2 to V4 in the macaque monkey. Cereb Cortex 2, 353–374 (1992) 23. Rockland, K.S., Saleem, K.S., Tanaka, K.: Divergent feedback connections from areas V4 and TEO in the macaque. Vis. Neurosci. 11, 579–600 (1994) 24. Schummers, J., Mario, J., Sur, M.: Synaptic integration by V1 neurons depends on location within the orientation map. Neuron 36, 969–978 (2002)

The Neurophysiological Bases of Cognitive Computation

317

25. Przybyszewski, A.W., Potapov, D.O., Rockland, K.S.: Feedback connections from area V4 to LGN. In: Ann. Meet. Society for Neuroscience, San Diego, USA (2001), http://sfn.scholarone.com/itin2001/prog#620.9 26. Lades, M., Vortbrueggen, J.C., Buhmann, J., Lange, J., von der Malsburg, C., Wuertz, R.P., Konen, W.: Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers 42, 300–311 (1993) 27. Grdenfors, P.: Conceptual Spaces. MIT Press, Cambridge (2000) 28. Polkowski, L., Skowron, A.: Rough Mereological Calculi of Granules: A Rough Set Approach to Computation. Computational Intelligence 17, 472–492 (2001) 29. Przybyszewski, A.W., Linsay, P.S., Gaudiano, P., Wilson, C.: Basic Diﬀerence Between Brain and Computer: Integration of Asynchronous Processes Implemented as Hardware Model of the Retina. IEEE Trans. Neural Networks 18, 70–85 (2007) 30. Treisman, A.: Features and objects: the fourteenth Bartlett memorial lecture. Q J. Exp. Psychol. A 40, 201–237 (1988) 31. Ramachandran, V.S.: Perception of shape from shading. Nature 331, 163–166 (1988) 32. Ullman, S., Vidal-Naquet, M., Sali, E.: Visual features of intermediate complexity and their use in classiﬁcation. Nature Neuroscience 5, 682–687 (2002) 33. Thorpe, S., Faze, D., Merlot, C.: Speed of processing in the human visual system. Nature 381, 520–522 (1996) 34. Rousselet, G.A., Fabre-Thorpe, M., Thorpe, S.J.: Parallel processing in high-level categorization of natural images. Nat. Neurosci. 5, 629–630 (2002) 35. David, S.V., Hayden, B.Y., Gallant, J.L.: Spectral receptive ﬁeld properties explain shape selectivity in area V4. J. Neurophysiol. 96, 3492–3505 (2006) 36. Merigan, W.H.: Cortical area V4 is critical for certain texture discriminations, but this eﬀect is not dependent on attention. Vis. Neurosci. 17(6), 949–958 (2000) 37. Merigan, W.H., Pham, H.A.: V4 lesions in macaques aﬀect both single- and multiple-viewpoint shape discriminations. Vis. Neurosci. 15(2), 359–367 (1998) 38. Girard, P., Lomber, S.G., Bullier, J.: Shape discrimination deﬁcits during reversible deactivation of area V4 in the macaque monkey. Cereb Cortex 12(11), 1146–1156 (2002) 39. Dumoulin, S.O., Hess, R.F.: Cortical specialization for concentric shape processing Vision Research, vol. 47, pp. 1608–1613 (2007) 40. Biederman, I., Subramaniam, S., Bar, M., Kalocsai, P., Fiser, J.: Subordinate-level object classiﬁcation reexamined. Psychol. Res. 62, 131–153 (1999) 41. Poggio, T., Edelman, S.: A network that learns to recognize three-dimensional objects. Nature 343, 263–266 (1990) 42. Girard, P., Hup, J.M., Bullier, J.: Feedforward and feedback connections between areas V1 and V2 of the monkey have similar rapid conduction velocities. J. Neurophysiol. 85(3), 1328–1331 (2001) 43. Wiscott, L., Fellous, J.-M., Krueger, N., von der Malsburg, C.: Face recognition by elastic graph matching. IEEE Pattern Recognition and Machine Intelligence 19, 775–779 (1997)

Diagnostic Feature Analysis of a Dobutamine Stress Echocardiography Dataset Using Rough Sets Kenneth Revett University of Westminster, Harrow School of Computer Science London, England, HA1 3TP

Abstract. Stress echocardiography is an important functional diagnostic and prognostic tool that is now routinely applied to evaluate the risk of cardiovascular artery disease (CAD). In patients who are unable to safely undergo a stress based test, dobutamine is administered which provides a similar eﬀect to stress on the cardiovascular system. In this work, a complete dataset containing data on 558 subjects undergoing a prospective longitudinal study is employed to investigate what diagnostic features correlate with the ﬁnal outcome. The dataset was examined using rough sets, which produced a series of decision rules that predicts which features inﬂuence the outcomes measured clinically and recorded in the dataset. The results indicate that the ECG attribute was the most informative diagnostic feature. In addition, prehistory information has a signiﬁcant impact on the classiﬁcation accuracy. Keywords: dobutamine, ECG, LTF-C, Reducts, rough sets, Stress echocardiography.

1

Introduction

Heart disease remains the number one cause of mortality in the western world. Coronary arterial disease (CAD) is a primary cause of morbidity and mortality in patients with heart disease. The early detection of CAD was in part made possible in the late 1970’s by the introduction of echocardiography - a technique for measuring the physical properties of the heart using a variety of imaging techniques such as ultrasound, and doppler ﬂow measurements [1], [2], [3]. The purpose of these imaging studies is to identify structural malformations such as aneurysms and valvular deformities. Although useful, structural information may not provide the full clinical picture in the way that functional imaging techniques such as stress echocardiography (SE) may. This imaging technique is a versatile tool that allows clinicians to diagnose patients with CAD eﬃciently and accurately. In addition, it provides information concerning the prognosis of the patient - which can be used to provide on-going clinical support to help reduce morbidity. The underlying basis for SE is the induction of cardiovascular stress, which generates cardiac ischemia, resulting in cardiac wall motion (a distension J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 318–327, 2008. c Springer-Verlag Berlin Heidelberg 2008

Feature Analysis of an Echocardiography Dataset Using Rough Sets

319

type of motion). The motion should reﬂect the ability of the vasculature to adapt to stressful situations such as enhanced physical activity. The extent to which the vessels (and the heart itself expands) under strenuous activity reﬂects the viability of the vasculature system. In coronary artery disease, the ability of the vasculature to adapt is limited as a result of episodes of ischemia - reduction in local blood supply - which causes tissue damage. Normally, the walls of the heart (in particular the left ventrical) change (move) in a typical fashion in response to stress (i.e. heavy exercise). A quantitative measure called the wall motion score is computed and its magnitude is directly related to the extent of the WMA score. The WMA provides a quantitative measure of how the heart responds to stress. Stress echocardiography (SE) was originally induced under conditions of strenuous exercise such as bike and treadmills. In many cases though, patients are not able to exercise to the level required and pharmacological agents such as dobutamine or dipyridamole have been used to induce approximately the same level of stress on the heart as physical exercise. Dobutamine in particular emulates physical exercise eﬀects on the cardiovascular system by increasing the heart rate and blood pressure and impacts cardiac contractility - which drives cardiac oxygen demand [4]. A number of reports have indicated that though there are subtle diﬀerences between exercise and pharmacologically induced stress, they essentially provide the same stimulus to the heart and can therefore, in general, be used interchangeably [5],[6]. The focus of this paper was to investigate the eﬀectiveness of dobutamine stress echocardiography (DSE) by analysing the results of a large study of 558 patients undergoing DSE. The purpose is to determine which attributes collected in this study correlate most closely with the decision outcome. After a careful investigation of this dataset, a set of rules is presented that relates conditional features (attributes) to decision outcomes. This rule set is generated through the application of rough sets, a data mining technique developed by the late Professor Pawlak [7]. The antecedents of the rule set contains information about which features are involved in the decision outcome. In addition, values of the relevant features provides quantitative information regarding the values that are relevant for each feature for the respective decision class. This provides very useful information regarding the features that are directly relevant in predicting the outcome: in this case whether the principle outcome is whether SE provides prognostic value in lieu of other relevant and routinely collected medical information with respect to the likelihood of cardiac events. In the next section, a literature review of previous work involving the clinical application of stress echocardiography is presented. 1.1

Previous Work

In 1998, Chuah and colleagues published a report on the investigation of a followup study of 860 patients who underwent dobutamine stress echocardiography over a 2-year period [8]. The prinicpal features examined in this study were wall motion abnormalities (WMA), cardiovascular risk factors, and clinical status (collected at the time the dobutamine stress test was administered). Any prior myocardial infarctions were determined by patient history or the presence of

320

K. Revett

signiﬁcant Q waves. The patient group (consisting of 479 men and 381 women, mean age 70 +/- 10, was monitored for a period of 52 months subsequent to the SE test. The follow up resutls indicates that 86 patients had cardiac events, including 36 myocardial infarctions and cardiac death in 50 patients. Those patients with events tended to have a lower rest ejection fraction and more extensive WMAs at rest and with stress. The authors also examined how outcomes (as measured by the likelihood of an event) correlated with respect to the SE results. Of the patients with normal SE results, 4% (12 of 302) had an event. Patients with new or worsening WMAs (321 patients), 44 (14%) had subsequent cardiac events during the follow up period. Lastly, those patients (237) with ﬁxed WMAs (during rest and at stress), 30 (13%) had cardiac events during the follow-up period. The authors then examined the relationship between the feature space and the likelihood of a follow-up event (identifying univariate predictors of cardiac events). The independent predictors were: a history of congestive heart failure, percentage of abnormal segments at peak-stress (measured via SE), and an abnormal left ventricular end-systolic volume response to stress. In the study by Krivokapich and colleagues [3], the pronostic value of dobutamine SE was directly assessed with respect to predicting cardiac events in patients with or suspected of having coronary arterial disease. The study was a retrospective examination of 1,183 patients that underwent DSE (dobutamine stress echocardiography). The patients were monitored for 12 months after a DSE examination in order to determine whether the results of the DSE were predictive (or at least correlated with)of subsequent cardiac events. The authors examined several features using bivariate logistic regression and forward and backward stepwise multiple logistic regression. The independent variables examined were: history of hypertension, diabetes mellitus, myocardial infarction, coronary artery bypass grafting surgery, age, gender, peak dose of dobutamine, rest and peak dobutamine heart rate, blood pressure, rate pressure product, presence of chest pain, abnormal electrocardiogram (ECG), WMA abnormality, and a positive SE. The results from this study indicate that a postive SE and an abnormal ECG were most indicative of a subsequent cardiac event (deﬁned as a myocardial infarction, death or CABG).Patients that had a positive SE and an abnormal ECG had a 42% cardiac incidence rate, versus a 7% cardiac incidence rate for negative SE and ECG. A positive Se alone yielded a 34% cardiac incidence rate during the 12 month follow up period. These results indicate the predictive power of a positive SE in terms of predicting cardiac events within a relatively short time window. In a study by Marwick and colleagues [6] sought to determine whether dobutamine echocardiography could be used as an independent predictor of cardiac mortaility in a group of 3,156 patients (1,801 men and 1,355 women, mean age 63 +/- 12 years) in a nine-year longitudinal follow-up study (1988-1994). At the time of the SE examination, several clinical variables and patient history were recorded for subsequent uni and multi-variate analysis of predictors of cardiac death. During the follow-up period, 259 (8%) deaths attributed to cardiac failure occurred. The authors analysed the patient data with respect to clinical features in order to examine their predictive capacity

Feature Analysis of an Echocardiography Dataset Using Rough Sets

321

generally - and to determine if SE was correlated in anyway with the outcome. Age, gender, heart failure therapy were predictive of cardiac failure during the follow-up period. the addition of resting left ventricular fucntion, and SE testing data further improved the predictive capacity of a sequential model (KaplanMeier survival curves and Cox proportional hazards models). In those patients with a negative dobutamine echocardiogram (1,581 pateints), the average rate of cardiac mortality was 1% per year, compared to 8% in those patients with SE abnormalities. The ﬁnal result from this study indicates that the inclusion of SE, in addition to standard clinical data increaes signﬁciantly the predictive outcome of cardiac events. Though not an exhaustive list of published examinations of the predictive capacity of dobutamine echocardiography, the cases presented here are indicative of the approach used to examine whether this technique provides positve predictive information that can assist clinicians in patient care (see [8], [9] for additional studies]). The approach is typically a longitudinal study, utilising a substantial patient cohort. As most subjects are in clinical care for suspected heart disease, there is a substantial amount of clinical information that is acquired as part of the routine care these patients. Typically, clinical data provides a predictive capacity on the order of 60%. The deployment of stress echocardiography enhances the predictive capacity over typical clinical data - even that acquired within the context of the disease based on previous medical exposure. Univariate and multivariate models provide quantitative information with respect to which variables appear to be correlated with the decision outcome. The reality for busy clinicians is that they may not be prepared to perform the complex analyses required to extract useful information from their data. This study attempts to provide a rational basis for the examination of the feature space of a typical SE dataset. The goal is to determine if the features are indeed correlated with the decision outcomes - and if so - what subset of features are relevant and what range of values are expected for predictive features. The next section presents a description of the dataset and some of the pre-processing stages employed for subsequent data analysis. 1.2

The Dataset

The data employed in this study was obtained from a prospective dobutamine stress echocardiography (DSE) study at the UCLA Adult Cardiac Imaging and Hemodynamics Laboratory held between 1991 and 1996. The patients were monitored during a ﬁve year period and then observed for a further twelve months to determine if the DSE results could predict patient outcome. The outcomes were categorised into the following cardiac events: cardiac death, myocardial infarction (MI), and revascularisation by percutaneous transluminal coronary angioplasty (PTCA) or coronary artery bypass graft surgery (CABG) [5]. After normal exclusionary processes, the patient cohort consisted of 558 subjects (220 women and 338 men) with a median age of 67 (range 26-93). Dobutamine was administered intraveneously using a standard delivery system yielding a maximum dose of 40 g/kg/min. There were a total of 30 attributes collected in this study which are listed in Table 1.

322

K. Revett

Table 1. The decision table attributes and their data types (continuous, ordinal, or discrete) employed in this study (see for details). Note the range of correlation coeﬃcients was -0.013 to 0.2476 (speciﬁc data not shown). Attribute name Attribute type bhr basal heart rate Integer basebp basal blood pressure Integer basedp basal double product (= bhr x basebp) Integer pkhr peak heart rate Integer sbp systolic blood pressure Integer dp double product (= pkhr x sbp) Integer dose dose of dobutamine given Integer maxhr maximum heart rate Integer mphr(b) % of maximum predicted heart rate Integer mbp maximum blood pressure Integer dpmaxdo double product on maximum dobutamine dose Integer dobdose dobutamine dose at which maximum double product Integer age Integer gender (male = 0) Level (2) baseef baseline cardiac ejection fraction Integer dobef ejection fraction on dobutamine Integer chestpain (0 experienced chest pain) Integer posecg signs of heart attack on ecg (0 = yes) Integer equivecg ecg is equivocal (0 = yes) Integer restwma wall motion anamoly on echocardiogram (0 = yes) Integer posse stress echocardiogram was positive (0 = yes) Integer newMI new myocardial infarction, or heart attack (0 = yes) Integer newPTCA recent angioplasty (0 = yes) Level (2) newCABG recent bypass surgery (0 = yes) Level (2) death died (0 = yes) Level (2) hxofht history of hypertension (0 = yes) Level (2) hxofptca history of angioplasty (0 = yes) Level (2) hxofcabg history of bypass surgery (0 = yes) Level (2) hxofdm history of diabetes (0 = yes) Level (2) hxofMI history of heart attack (0 = yes) Level (2)

The attributes were a mixture of categorical and continuous values. The decision class used to evaluate this dataset was the outcomes as listed as listed above and in Table 1. As a preliminary evaluation of the dataset, the data was evaluated with respect to each of the four possible measured outcomes included in the decision table individually, excluding each of the other three possible outcomes. This process was repeated for each of the outcomes in the decision table. Next, the eﬀect of the echocardiogram (ECG) was investigated. Reports indicate that this is a very informative attribute with respect to predicting the clinical outcome of a patient [3]. To evaluate the eﬀect of ECG on the outcomes, the base case investigation (all four possible outcomes) was investigated with (base case) and without the ECG attribute. Lastly, the information content of any

Feature Analysis of an Echocardiography Dataset Using Rough Sets

323

prehistory information was investigated to examined if there was a correlation between the DSE and the outcome. There were a total of six diﬀerent history attributes (see Table 1) that were tested to determine if each in isolation had a positive correlation with the outcomes. In the next section, we describe the experiments that were performed using rough sets (RSES 2.2.1).

2

Results

In the ﬁrst experiment, each outcome was used as the sole decision attribute. The four outcomes were: new Myocardial Infarction (MI) (28 cases), death (24 cases), newPTCA (27 cases), and newCABG (33 cases). All continuous attributes were discretised using the MDL algorithm within RSES [9], [10]. Note there were no missing values in the dataset. A 10-fold cross validation was performed - using decision rules and dynamic reducts. Without any ﬁltering of the reducts or rules, Table 2 presents randomly selected confusion matrices that were generated for each of the decision outcomes for the base case. The number of rules was quite large - and initially no ﬁltering was performed to reduce either the number of reducts nor the number of rules. The number of reducts for panels ’A’ - ’D’ in Table 2 were: 104, 159, 245, and 122 respectively. On average, the length of the reducts ranged from 5-9, out of a total of 27 attributes (minus the 3 other outcome decision classes). The number of rules (all of which were deterministic) was quite large, with a range of 23,356-45,330 for the cases listed in table 2. Filtering was performed on both reducts (based on support) and rule coverage in order to reduce the cardinality of the decision rules. The resulting decision rule set were reduced to a range of 314-1,197. The corresponding accuracy was reduced by approximately 4% (range 3- 6%). Filtering can be performed on a variety of conditions, such as LHS support, coverage, RHS support. For a discussion of rule ﬁltering, please consult [10], [11] for an excellent discussion of this topic. The number of rules was quite large - and initially no ﬁltering was performed to reduce either the number of reducts nor the number of rules. The number of Table 2. Confusion matrices for the ’base’ cases of the four diﬀerent outcomes. The label ’A’ corresponds to death, ’B’ to MI, ’C’ to new PTCA, and ’D’ to newCABG. Note the overall accuracy is placed at the lower right hand corner of each subtable (italicized). A 0 0 204 1 2 0.95 C 0 0 207 1 6 0.97

1 7 10 0.22 1 9 1 0.10

B 0.97 0 205 0.80 1 0 0.92 0.94 D 0.96 0 191 0.14 1 7 0.93 0.96

0 4 14 0 0 25 0 0.0

1 0.98 1.0 0.92 1 0.88 0 0.86

324

K. Revett

Table 3. The classiﬁcation accuracy obtained from the classiﬁcation using the exact same protocol for the table reported in Table 2 (note the ECG attribute was included in the decision table). The results are the average over the four diﬀerent outcomes. A 0 0 206 1 3 0.95 C 0 0 209 1 1 0.97

1 5 9 0.22 1 7 6 0.10

B 0.98 0 205 0.75 1 0 0.92 0.94 D 0.98 0 191 0.86 1 0 0.93 0.96

0 4 14 0 0 25 7 0.0

1 0.98 1.0 0.96 1 0.88 1.00 0.94

Table 4. The classiﬁcation accuracy obtained from the classiﬁcation using the same protocol for the data reported in table 2 (note the ECG attribute was included in the decision table). The results are the average over the four diﬀerent outcomes. Attribute name History of hypertension History of diabetes History of smoking History of angioplasty History of coronary artery bypass surgery

Classiﬁcation accuracy 91.1% 85.3% 86.3% 90.3% 82.7%

reducts for panels ’A’ - ’D’ in Table 2 were: 104, 159, 245, and 122 respectively. On average, the length of the reducts ranged from 5-9, out of a total of 27 attributes (minus the 3 other outcome decision classes). The number of rules (all of which were deterministic) was quite large, with a range of 23,356-45,330 for the cases listed in table 2. Filtering was performed on both reducts (based on support) and rule coverage in order to reduce the cardinality of the decision rules. The resulting decision rule set were reduced to a range of 314-1,197. The corresponding accuracy was reduced by approximately 4% (range 3-6%). In the next experiment, the correlation between the outcome and the ECG result was examined. It has been reported that the ECG, which is a standard cardiological test to measure functional activity of the heart, should be correlated with the outcome [2]. We therefore repeated the experiment in Table 2, with the ECG attribute excluded (masked) from the decision table. The results are reported in Table 3. Lastly, we examined the eﬀect of historical information that was collected and incorporated into the dataset (see Table 1). These historical attributes include: history of hypertension, diabetes, smoking, myocardial infarction, angioplasty, and coronary artery bypass surgery. We repeated the base set of experiments (including ECG) and withheld each of the historical attributes one at a time and report the results as a set of classiﬁcation accuracies, listed in Table 4. In addition to classiﬁcation accuracy, rough sets provide a collection of decision rules in conjunctive normal form. These rules contain the attributes and

Feature Analysis of an Echocardiography Dataset Using Rough Sets

325

Table 5. Sample set of rules from the base case (+ECG) with death as the decision outcome. The right hand column indicates the support (LHS) for the corresponding rule. Note that these rules were selected randomly from the full set. Rule Support dp([20716, *]) AND dobdoes(40) AND hxofDM(0) 19 AND anyevent(0) ⇒ death(0) dp([*,13105]) AND dobdoes(40) AND hxofDM(0) 18 AND anyevent(0)⇒ death(0) basebp([*,159]) AND sbp([115,161]) AND hxofDM(0) 24 AND anyevent(0) ⇒ death(0) dp([*,131105) AND dobdose(35) AND dobEF([53,61]) 10 AND hxofDM(1)⇒ death(10) dp([20633,20716]) AND dobdoes(4) AND 1 baseEF([56,76]) AND hxofDM(0) AND anyevent(1) ⇒ death (1) dp([*,13]) AND dobdoes(30) AND hxofCABG(0) 12 AND anyevent(1) AND ecg([*,2]) ⇒ death(1)

their values that are antecedents in a rule base. Therefore, the decision rules provide a codiﬁcation of the knowledge contained within the decision table. Examples of the resulting rule set for the base case, using MI as the decision attribute is presented in table 5.

3

Conclusion

This dataset contained a complete set of attributes (30) that was a mixture of continuous and categorical data. The data was obtained from a prospective study of cardiovascular health obtained by professional medical personal (cardiographers). The attributes were obtained from patients undergoing stress echocardiography, a routine medical technique employed to diagnose cardiovascular artery disease. From the initial classiﬁcation results, the speciﬁcity of the classiﬁcation using rough sets was quite high (90+%) - consistent with some literature reports [2],[6]. As can be seen in Table 2, the sensitivity of the test was reasonably high, and consistent with several literature reports. The eﬀect of ECG, the attribute most correlated with the clinical outcome of CAD, was measured by masking this attribute. The results indicate that this attribute did not have a signiﬁcant impact on the overall classiﬁcation accuracy, but did manage to increase the sensitivity was reduced slightly when it was excluded in the decision table. This result requires further examination to quantify the role of an abnormal ECG - and the interaction/information content of an abnormal ECG and other medical indicators.The eﬀect of patient history was examined, and the results (see Table 4) indicate that in general, relevant medical history did have a positive impact on the classiﬁcation accuracy. This result was quantiﬁed by examining the classiﬁcation accuracy when these 5 history factors were removed from the decision table (one at a time). The eﬀect of their combination

326

K. Revett

was not examined in this paper, which is left for future work. The data clearly indicate that a positive SE result was highly correlated with a subsequent cardiac event. This result when demonstrated by examining the rule set, looking at the occurrences of this attribute in the consequent. Lastly, the rule set that was produced yielded a consistently reduced set of attributes - ranging from 4-9 attributes, greatly reducing the size of the dataset. As displayed in Table 5 - and generally across the rule set, the dp and dobdose attributes appear consistently (has a large support) within all decision outcomes (data not displayed). This type of analysis is a major product of the rough sets approach to data analysis - extraction of knowledge from data. This is a preliminary study that will be pursued in conjunction with a qualiﬁed cardiologist. The results generated so far are interesting - and certainly consistent and in many cases superior to other studies [1],[3]. To this author’s knowledge, this is the ﬁrst report which examined the dobutamine SE literature using rough sets. Komorowski & Ohn have examined a similar dataset - but the imaging technique and attributes selected were diﬀerent from those used in the study investigated in this work [12]. In a preliminary examination of this dataset, Revett [13] published similar results to this study. A principal addition in this study is conﬁrmation of the 2007 study through the application of a novel neural network (LTF-C) to corroborate the reduced attribute set extracted from the rough sets examination. The application of LTF-C did indeed conﬁrm that the classiﬁcation accuracy was maximal with the selected set of attributes, compared to an exhaustive investigation of the other attributes with respect to training speed and classiﬁcation accuracy. The results from this study indicate that a rough sets approach to rule extraction from this dataset provided evidence that corroborate much of the results reported in the literature. The basis for applying rough sets is that it provides evidence with regards to the features and their values that are predictive with respect to the decision class. Further analysis of this dataset is possible, and this analysis would beneﬁt from a close collaboration between medical experts and data mining engineers. Acknowledgements. The author would like to acknowledge the source of the dataset used in this study: Alan Garﬁnkle, UCLA (at the time of submission): http://www.stat.ucla.edu:16080/projects/datasets/cardiacexplanation.html

References 1. Tsutsui, J.M., Elhendy, A., Anderson, J.A., Xie, F., McGrain, A.C., Porter, T.R.: Prognostic value of dobutamine stress myocardial contrast perfuson echocardiography. Circulation 112, 1444–1450 (2005) 2. Armstrong, W.F., Zoghbi: Stress Echocardiography: Current Methodology and Clinical Applications. J. Am. Coll. Cardiology 45, 1739–1747 (2005) 3. Krivokapich, J., Child, J.S., Walter, D.O., Garﬁnkel, A.: Prognostic value of dobutamine stress echocardiography in predicting cardiac events in patients with known or suspected coronary artery disease. J. Am. Coll. Cardiology 33, 708–716 (1999)

Feature Analysis of an Echocardiography Dataset Using Rough Sets

327

4. Bergeron, S., Hillis, G., Haugen, E., Oh, J., Bailey, K., Pellikka, P.: Prognostic value of dobutamine stress echocardiography in patients with chronic kidney disease. American Heart Journal 153(3), 385–391 (1982); Pawlak, Z.: Rough Sets. International Journal of Computer and Information Sciences 11, 341–356 (1982) 5. Marwick, T.H., Case, C., Poldermans, D., Boersma, E., Bax, J., Sawada, S., Thomas, J.D.: A clinical and echocardiographic score for assigning risk of major events after dobutamine echocardiography. Journal of the American College of cardiology 43(11), 2102–2107 (2004) 6. Marwick, T.H., Case, C., Sawada, S., Timmerman, C., Brenneman, P., Kovacs, R., Short, L., Lauer, M.: Prediction of mortaility using dobutamine echocardiography. Journal of the American College of Cardiology 37(3), 754–760 (2001) 7. Pawlak, Z.: Rough sets - Theoretical aspects of reasoning about data. Kluwer, Dordrecht (1991) 8. Chuah, S.-C., Pellikka, P.A., Roger, V.L., McCully, R.B., Seward, J.B.: Role of dobutamine stress echocardiography in rpedicting utcome of 860 patients with known or suspsected coronary artery disease. In: Circulation 1997, pp. 1474–1480 (1998) 9. Senior, R.: Stress echocardiography - current status. Business Brieﬁng: European Cardiology, 26–29 (2005) 10. Bazan, J., Szczuka, M.: The Rough Set Exploration System. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005), http://logic.mimuw.edu.pl/∼ rses 11. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough sets: A tutorial. In: Pal, S.K., Skowron, A. (eds.) Rough Fuzzy Hybridization - A New Trend in Decision Making, pp. 3–98. Springer, Heidelberg (1999) 12. Komorowski, J., Øhrn, A.: Modelling prognostic power of cardiac tests using rough sets. Artiﬁcial Intelligence in Medicine 15, 167–191 (1999) 13. Revett, K.: Analysis of a dobutamine stress echocardiography dataset using rough sets. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 756–762. Springer, Heidelberg (2007)

Rules and Apriori Algorithm in Non-deterministic Information Systems Hiroshi Sakai1 , Ryuji Ishibashi1 , Kazuhiro Koba1 , and Michinori Nakata2 1

Mathematical Sciences Section, Department of Basic Sciences, Faculty of Engineering, Kyushu Institute of Technology, Tobata, Kitakyushu 804, Japan [email protected] 2 Faculty of Management and Information Science, Josai International University, Gumyo, Togane, Chiba 283, Japan [email protected]

Abstract. This paper presents a framework of rule generation in N ondeterministic Inf ormation Systems (N ISs), which follows rough sets based rule generation in Deterministic Inf ormation Systems (DISs). Our previous work about N ISs coped with certain rules, minimal certain rules and possible rules. These rules are characterized by the concept of consistency. This paper relates possible rules to rules by the criteria support and accuracy in N ISs. On the basis of the information incompleteness in N ISs, it is possible to deﬁne new criteria, i.e., minimum support, maximum support, minimum accuracy and maximum accuracy. Then, two strategies of rule generation are proposed based on these criteria. The ﬁrst strategy is Lower Approximation strategy, which deﬁnes rule generation under the worst condition. The second strategy is U pper Approximation strategy, which deﬁnes rule generation under the best condition. To implement these strategies, we extend Apriori algorithm in DISs to Apriori algorithm in N ISs. A prototype system is implemented, and this system is applied to some data sets with incomplete information. Keywords: Rough sets, Non-deterministic information, Incomplete information, Rule generation, Lower and upper approximations, Apriori algorithm.

1

Introduction

Rough set theory has been used as a mathematical tool of soft computing for approximate two decades. This theory usually handles tables with deterministic information. Many applications of this theory, such as rule generation, machine learning and knowledge discovery, have been presented [5, 9, 15, 21, 22, 23, 24, 25, 36, 38]. We follow rule generation in Deterministic Inf ormation Systems (DISs) [21, 22, 23, 24, 33], and we describe rule generation in N on-deterministic Inf ormation J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 328–350, 2008. c Springer-Verlag Berlin Heidelberg 2008

Rules and Apriori Algorithm in Non-deterministic Information Systems

329

Systems (N ISs). N ISs were proposed by Pawlak [21], Orlowska [19, 20] and Lipski [13, 14] to handle information incompleteness in DISs, like null values, unknown values, missing values. Since the emergence of incomplete information research, N ISs have been playing an important role. Therefore, rule generation in N ISs will also be an important framework for rule generation from incomplete information. The following shows some important researches on rule generation from incomplete information. In [13, 14], Lipski showed a question-answering system besides an axiomatization of logic, and Orlowska established rough set analysis for non-deterministic information [3, 19, 20]. Grzymala-Busse developed a system named LERS which depends upon LEM 1 and LEM 2 algorithms [5, 6, 7], and recently proposed four interpretations of missing attribute values [8]. Stefanowski and Tsoukias deﬁned non symmetric similarity relations and valued tolerance relations for analyzing incomplete information [34, 35]. Kryszkiewicz proposed a framework of rules in incomplete information systems [10, 11, 12]. According to authors’ knowledge, these are the most important researches on incomplete information. We have also discussed several issues related to nondeterministic information and incomplete information [16, 17, 18], and proposed a framework named Rough N on-deterministic Inf ormation Analysis (RN IA) [26, 27, 28, 29, 30, 31, 32]. In this paper, we brieﬂy review RN IA including certain and possible rules, then develop rule generation by the criteria support and accuracy in N ISs. In this rule generation, we extend Apriori algorithm in DISs to a new algorithm in N ISs. The computational complexity of this new algorithm is almost the same as Apriori algorithm. Finally, we investigate a prototype system, and apply it to some data sets with incomplete information.

2

Basic Definitions and Background of the Research

This section summarizes basic deﬁnitions, and reviews the background of this research in [28, 31, 32]. 2.1

Basic Deﬁnitions

A Deterministic Information System (DIS) is a quadruplet (OB, AT, {V ALA | A ∈ AT }, f ), where OB is a ﬁnite set whose elements are called objects, AT is a ﬁnite set whose elements are called attributes, V ALA is a ﬁnite set whose elements are called attribute values and f is such a mapping that f : OB×AT → ∪A∈AT V ALA which is called a classif ication f unction. If f (x, A)=f (y, A) for every A ∈ AT R ⊂ AT , we see there is a relation between x and y for AT R. This relation is an equivalence relation over OB, and this is called an indiscernibility relation. We usually deﬁne two sets CON ⊆ AT which we call condition attributes and DEC ⊆ AT which we call decision attributes. An object x ∈ OB is consistent (with any distinct object y ∈ OB), if f (x, A)=f (y, A) for every A ∈ CON implies f (x, A)=f (y, A) for every A ∈ DEC.

330

H. Sakai et al.

A N on-deterministic Inf ormation System (N IS) is also a quadruplet (OB, AT, {V ALA |A ∈ AT }, g), where g : OB × AT → P (∪A∈AT V ALA ) (a power set of ∪A∈AT V ALA ). Every set g(x, A) is interpreted as that there is an actual value in this set, but this value is not known. For a N IS=(OB, AT, {V ALA | A ∈ AT }, g) and a set AT R ⊆ AT , we name a DIS=(OB, AT R, {V ALA |A ∈ AT R}, h) satisfying h(x, A) ∈ g(x, A) a derived DIS (for AT R) from N IS. For a set AT R={A1, · · · , An } ⊆ AT and any x ∈ OB, let P T (x, AT R) denote the Cartesian product g(x, A1 ) × · · · × g(x, An ). We name every element a possible tuple (f or AT R) of x. Fora possible tuple ζ=(ζ1 , · · ·, ζn ) ∈ P T (x, AT R), let [AT R, ζ] denote a formula 1≤i≤n [Ai , ζi ]. Every [Ai , ζi ] is called a descriptor. Let P I(x, CON, DEC) (x ∈ OB) denote a set {[CON, ζ] ⇒ [DEC, η]|ζ ∈ P T (x, CON ), η ∈ P T (x, DEC)}. We name an element of P I(x, CON, DEC) a possible implication (f rom CON to DEC) of x. In the following, τ denotes a possible implication, and τ x denotes a possible implication obtained from an object x. Now, we deﬁne six classes of possible implications, certain rules and possible rules. For any τ x ∈ P I(x, CON, DEC), let DD(τ x , x, CON, DEC) denote a set {ϕ| ϕ is such a derived DIS for CON ∪ DEC that an implication from x in ϕ is equal to τ x }. If P I(x, CON, DEC) is a singleton set {τ x }, we say τ x is def inite. Otherwise we say τ x is indef inite. If a set {ϕ ∈ DD(τ x , x, CON, DEC)| x is consistent in ϕ} is equal to DD(τ x , x, CON, DEC), we say τ x is globally consistent (GC). If this set is equal to {}, we say τ x is globally inconsistent (GI). Otherwise, we say τ x is marginal (M A). By combining two cases, i.e., ‘D(ef inite) or I(ndef inite)’ and ‘GC, M A or GI’, we deﬁne six classes, DGC, DM A, DGI, IGC, IM A, IGI in Table 1. A possible implication τ x belonging to DGC class is consistent in all derived DISs, and this τ x is not inﬂuenced by the information incompleteness, therefore we name τ x a certain rule or more correctly a candidate of a certain rule. A possible implication τ x belonging to either DGC, IGC, DM A or IM A class is consistent in some ϕ ∈ DD(τ x , x, CON, DEC). Therefore, we name τ x a possible rule or more correctly a candidate of a possible rule. Table 1. Six classes of possible implications in NISs GC M A GI Def inite DGC DM A DGI Indef inite IGC IM A IGI

Now, we give necessary and suﬃcient conditions for characterizing GC, M A and GI classes. For any ζ ∈ P T (x, AT R), we deﬁne two sets inf (x, AT R, ζ)={y ∈ OB|P T (y, AT R)={ζ}} ∪ {x}, sup(x, AT R, ζ)={y ∈ OB|ζ ∈ P T (y, AT R)}. Intuitively, inf (x, AT R, ζ) implies a set of objects whose tuples are ζ and definite. If a tuple ζ ∈ P T (x, AT R) is not deﬁnite, this object x does not satisfy

Rules and Apriori Algorithm in Non-deterministic Information Systems

331

P T (x, AT R)={ζ}. Therefore, we added a set {x} in the deﬁnition of inf . A set sup(x, AT R, ζ) implies a set of objects whose tuples may be ζ. Even though x does not appear in the right hand side of sup, we employ the sup(x, AT R, ζ) notation due to the inf (x, AT R, ζ) notation. Generally, {x} ⊆ inf (x, AT R, ζ)= sup(x, AT R, ζ) holds in DISs, and {x} ⊆ inf (x, AT R, ζ) ⊆ sup(x, AT R, ζ) holds in N ISs. Theorem 1 [28, 29]. For a N IS, let us consider a possible implication τ x :[CON , ζ] ⇒ [DEC, η] ∈ P I(x, CON, DEC). Then, the following holds. (1) τ x belongs to GC class, if and only if sup(x, CON, ζ) ⊆ inf (x, DEC, η). (2) τ x belongs to M A class, if and only if inf (x, CON, ζ) ⊆ sup(x, DEC, η). (3) τ x belongs to GI class, if and only if inf (x, CON, ζ) ⊆ sup(x, DEC, η). Proposition 2 [28, 29]. For any N IS, let AT R ⊆ AT be {A1 , · · · , An }, and let a possible tuple ζ ∈ P T (x, AT R) be (ζ1 , · · · , ζn ). Then, the following holds. (1) inf (x, AT R, ζ)=∩i inf (x, {Ai }, (ζi )). (2) sup(x, AT R, ζ)=∩i sup(x, {Ai }, (ζi )). 2.2

An Illustrative Example

Let us consider N IS1 in Table 2. There are four derived DISs in Table 3. Table 2. A table of N IS1 OB Color Size 1 {red, green} {small} 2 {red, blue} {big} 3 {blue} {big} Table 3. Four derived DISs from N IS1 . Tables are ϕ1 , ϕ2 , ϕ3 , ϕ4 to the right. OB Color Size OB Color 1 red small 1 red 2 red big 2 blue 3 blue big 3 blue

Size OB Color Size OB Color Size small 1 green small 1 green small big 2 red big 2 blue big big 3 blue big 3 blue big

Let us focus on a possible implication τ13 : [Color, blue] ⇒ [Size, big] ∈ P I(3, {Color}, {Size}). This τ13 means the ﬁrst implication from object 3, and τ13 appears in four derived DISs. Since the following holds, {2, 3} = sup(3, {Color}, (blue)) ⊆ inf (3, {Size}, (big)) = {2, 3}, τ13 belongs to DGC class according to Theorem 1. Namely, τ13 is consistent in each derived DIS. As for the second possible implication,

332

H. Sakai et al.

τ21 : [Color, red] ⇒ [Size, small] ∈ P I(1, {Color}, {Size}), the following holds: {1, 2} = sup(1, {Color}, (red)) ⊆ inf (1, {Size}, (small)) = {1}, {1} = inf (1, {Color}, (red)) ⊆ sup(1, {Size}, (small)) = {1}. According to Theorem 1, τ21 belongs to IM A class, namely τ21 appears in ϕ1 and ϕ2 , and τ21 is consistent just in ϕ2 . 2.3

Certain Rule Generation in Non-deterministic Information Systems

This subsection brieﬂy reviews the previous research on certain rule generation in N ISs [28, 29]. We have named possible implications in DGC class certain rules. For certain rule generation, we dealt with the following problem. Problem 1 [29]. For a N IS, let DEC be decision attributes and let η be a tuple of decision attributes values for DEC. Then, ﬁnd minimal certain rules in the form of [CON, ζ] ⇒ [DEC, η]. According to Theorem 1, Problem 1 is reduced to ﬁnd some minimal sets of descriptors [CON, ζ] satisfying sup(x, CON, ζ) ⊆ inf (x, DEC, η). For solving this problem, we employed a discernibility function in DISs [33]. We adjusted the discernibility function to N ISs, and implemented utility programs [29]. Example 1. Let us focus on a possible implication τ13 : [Color, blue] ⇒ [Size, big] in Table 2, again. Since inf (3, {Size}, (big))={2, 3}, it is necessary to discriminate object 1 ∈ {2, 3} from object 3. The descriptor [Color, blue] discriminates object 1 from object 3, because sup(3, {Color}, (blue))={2, 3} and 1 ∈ sup(3, {Color}, (blue)) hold. In this way, the discernibility function DF (3) becomes [Color, blue], and we obtain minimal certain rule τ13 . The following is a real execution. % ./plc ?-consult(dgc rule.pl). yes ?-trans. File Name for Read Open:’data.pl’. Decision Definition File:’attrib.pl’. File Name for Write Open:’data.rs’. EXEC TIME=0.01796603203(sec) yes ?-minimal. /* [1,blue](=[Color,blue]),[2,big](=[Size,big]) */ > Descriptor [1,blue] is a core for object 1 [1,blue]=>[2,big] [4/4(=4/4,1/1),Definite,GC: Only Core Descriptors] EXEC TIME=0.01397013664(sec) yes

Rules and Apriori Algorithm in Non-deterministic Information Systems

333

This program is implemented in prolog [28, 29, 30]. Each attribute is identiﬁed with its ordinal number, namely Color and Size are identiﬁed with 1 and 2, respectively. The underlined parts are speciﬁed by a user. 2.4

Non-deterministic Information and Incomplete Information

This subsection clariﬁes the semantic diﬀerence of non-deterministic information and incomplete information. Table 4. A table of DIS with incomplete information OB Color Size 1 ∗ small 2 ∗ big 3 blue big

Let us consider Table 4. The symbol ”∗” is often employed for indicating incomplete information. Table 4 is generated by replacing non-deterministic information in Table 2 with ∗. There are some interpretations of this ∗ symbol [4, 7, 8, 10, 17, 34]. In the most simple interpretation of incomplete information, the symbol ∗ may be each attribute value. Namely, ∗ may be either red, blue or green, and there are 9 (=3×3) possible tables in Table 4. In such a possible table, the implication from object 1 may be [Color, blue] ⇒ [Size, small], and this contradicts τ13 : [Color, blue] ⇒ [Size, big]. On the other hand in Table 2, the function is g(1, {Color})={red, green} {red, blue, green}, and we dealt with four derived DISs. In Table 2, we did not handle [Color, blue] ⇒ [Size, small] from object 1. Like this, τ13 is globally consistent in Table 2, but τ13 is inconsistent in Table 4. The function g(x, A) and a set sup(x, AT R, ζ) are employed for handling information incompleteness, and cause the semantic diﬀerence of non-deterministic information and incomplete information. In RN IA, the interpretation of the information incompleteness comes from the meaning of the function g(x, A). There is no other assumption on this interpretation. 2.5

A Problem of Possible Rule Generation in Non-deterministic Information Systems

We have deﬁned possible rules by possible implications which belong to either DGC, DM A, IGC or IM A classes. In this case, there may be a large number of possible implications satisfying condition (2) in Theorem 1. For example in Table 2, there are four possible implications including τ13 and τ21 , and every possible implication is consistent in at least a derived DIS. Thus, every possible implication is a possible rule. This implies the deﬁnition of possible rules may be too weak. Therefore, we need to employ other criteria for deﬁning rules except certain rules.

334

H. Sakai et al.

In the subsequent sections, we follow the framework of rule generation [1, 2, 22, 36, 38], and employ criteria, support and accuracy for deﬁning rules including possible rules.

3

New Criteria: Minimum Support, Minimum Accuracy, Maximum Support and Maximum Accuracy

This section proposes new criteria in N ISs, and investigates the calculation of criteria. These new criteria depend upon each element in DD(τ x , x, CON, DEC), but the complexity of the calculation does not depend upon the number of elements in DD(τ x , x, CON, DEC). 3.1

Deﬁnition of New Criteria

In a DIS, criteria support and accuracy are usually applied to deﬁning rules [1, 2, 36]. In a N IS, we deﬁne the following four criteria, i.e., minimum support: minsupp(τ x ), maximum support: maxsupp(τ x ), minimum accuracy: minacc(τ x ) and maximum accuracy: maxacc(τ x ) in the following: (1) (2) (3) (4)

minsupp(τ x ) = M inimumϕ∈DD(τ x,x,CON,DEC){support(τ x ) in ϕ}, maxsupp(τ x ) = M aximumϕ∈DD(τ x,x,CON,DEC){support(τ x ) in ϕ}, minacc(τ x ) = M inimumϕ∈DD(τ x,x,CON,DEC){accuracy(τ x ) in ϕ}, maxacc(τ x ) = M aximumϕ∈DD(τ x,x,CON,DEC){accuracy(τ x ) in ϕ}.

If τ x is deﬁnite, DD(τ x , x, CON, DEC) is equal to all derived DISs. If τ x is indeﬁnite, DD(τ x , x, CON, DEC) is a subset of all derived DISs. If we employ all derived DISs instead of DD(τ x , x, CON, DEC) in the above deﬁnition, minsupp(τ x ) and minacc(τ x ) are 0, respectively. Because, there exist some derived DISs where τ x does not appear. This property for each indeﬁnite τ x is trivial, so we deﬁne minsupp(τ x ) and minacc(τ x ) over DD(τ x , x, CON, DEC). Example 2. In Table 2, let us focus on a possible implication τ13 : [Color, blue] ⇒ [Size, big] ∈ P I(3, {Color}, {Size}). In DD(τ13 , 3, {Color}, {Size})={ϕ1, ϕ2 , ϕ3 , ϕ4 }, the following holds: 1/3 = minsupp(τ13 ) ≤ maxsupp(τ13 ) = 2/3, 1 = minacc(τ13 ) ≤ maxacc(τ13 ) = 1. As for the second possible implication, τ21 : [Color, red] ⇒ [Size, small] ∈ P I(1, {Color}, {Size}), in DD(τ21 , 1, {Color}, {Size})={ϕ1, ϕ2 }, the following holds: 1/3 = minsupp(τ21 ) ≤ maxsupp(τ21 ) = 1/3, 1/2= minacc(τ21 ) ≤ maxacc(τ21 ) = 1.

Rules and Apriori Algorithm in Non-deterministic Information Systems

3.2

335

A Simple Method for Calculating Criteria

In order to obtain minsupp(τ x ), minacc(τ x ), maxsupp(τ x ) and maxacc(τ x ), the most simple method is to examine each support(τ x ) and accuracy(τ x ) in every ϕ ∈ DD(τ x , x, CON, DEC). This method is simple, however the number of elements in DD(τ x , x, CON, DEC) is A∈CON,B∈DEC,x=y |g(y, A)||g(y, B)|, and the number of elements increases in exponential order. Therefore, this simple method will not be applicable to N ISs with a large number of derived DISs. 3.3

Eﬀective Calculation of Minimum Support and Minimum Accuracy

Let us consider how to calculate minsupp(τ x ) and minacc(τ x ) for τ x : [CON, ζ] ⇒ [DEC, η] from object x. Each object y with descriptors [CON, ζ] or [DEC, η] inﬂuences minsupp(τ x ) and minacc(τ x ). Table 5 shows all possible implications with descriptors [CON, ζ] or [DEC, η]. For example in CASE 1, we can obtain just an implication. However in CASE 2, we can obtain either (C2.1) or (C2.2). Every possible implication depends upon the selection of a value in g(y, DEC). This selection of attribute values speciﬁes some derived DISs from a N IS. Table 5. Seven cases of possible implications (related to [CON, ζ] ⇒ [DEC, η] from object x, η = η , ζ = ζ ) in N ISs Condition : CON Decision : DEC P ossible Implications CASE1 g(y, CON ) = {ζ} g(y, DEC) = {η} [CON, ζ] ⇒ [DEC, η](C1.1) CASE2 g(y, CON ) = {ζ} η ∈ g(y, DEC) [CON, ζ] ⇒ [DEC, η](C2.1) [CON, ζ] ⇒ [DEC, η ](C2.2) CASE3 g(y, CON ) = {ζ} η ∈ g(y, DEC) [CON, ζ] ⇒ [DEC, η ](C3.1) CASE4 ζ ∈ g(y, CON ) g(y, DEC) = {η} [CON, ζ] ⇒ [DEC, η](C4.1) [CON, ζ ] ⇒ [DEC, η](C4.2) CASE5 ζ ∈ g(y, CON ) η ∈ g(y, DEC) [CON, ζ] ⇒ [DEC, η](C5.1) [CON, ζ] ⇒ [DEC, η ](C5.2) [CON, ζ ] ⇒ [DEC, η](C5.3) [CON, ζ ] ⇒ [DEC, η ](C5.4) CASE6 ζ ∈ g(y, CON ) η ∈ g(y, DEC) [CON, ζ] ⇒ [DEC, η ](C6.1) [CON, ζ ] ⇒ [DEC, η ](C6.2) CASE7 ζ ∈ g(y, CON ) Any [CON, ζ ] ⇒ Decision(C7.1)

Now, we revise the deﬁnition of inf and sup information in the previous section. We handled both inf and sup information for every object x. However, in the subsequent sections it is enough to handle minimum and maximum sets of an equivalence class deﬁned by a descriptor [AT R, val]. This revision is very simple, and this revision reduces the manipulation of each calculation. Deﬁnition 1. For each descriptor [AT R, val](= [{A1, · · · , Ak }, (ζ1 , · · · , ζk )], (k ≥ 1) ) in a N IS, Descinf and Descsup are deﬁned as follows:

336

(1) (2) (3) (4)

H. Sakai et al.

Descinf ([Ai , ζi ])={x ∈ OB|P T (x, {Ai })={ζi }}={x ∈ OB|g(x, {Ai })={ζi }}. Descinf ([AT R, val])=Descinf (∧i[Ai , ζi ])=∩i Descinf ([Ai , ζi ]). Descsup([Ai , ζi ])={x ∈ OB|ζi ∈ P T (x, {Ai })}={x ∈ OB|ζi ∈ g(x, {Ai })}. Descsup([AT R, val])=Descsup(∧i[Ai , ζi ])=∩i Descsup([Ai , ζi ]).

The deﬁnition of Descinf requires that every element in this set is deﬁnite. Even though the deﬁnition of Descsup is the same as sup, we employ the Descsup([AT R, ζ]) notation due to the Descinf ([AT R, ζ]) notation. Clearly, Descinf ([CON, ζ]) is a set of objects belonging to either CASE 1, 2 or 3 in Table 5, and Descsup([CON , ζ]) is a set of objects belonging to either CASE 1 to CASE 6. Descsup([CON, ζ]) − Descinf ([CON, ζ]) is a set of objects belonging to either CASE 4, 5 or 6. Proposition 3. Let |X| denote the cardinality of a set X. In Table 6, the support value of τ x : [CON, ζ] ⇒ [DEC, η] from x is minimum. If τ x is deﬁnite, namely τ x belongs to CASE 1, minsupp(τ x )=|Descinf ([CON, ζ]) ∩ Descinf ([DEC, η])|/|OB|. If τ x is indeﬁnite, namely τ x does not belong to CASE 1, minsupp(τ x )=(|Descinf ([CON, ζ]) ∩ Descinf ([DEC, η])| + 1)/|OB|. Proof. This selection of attribute values in a N IS excludes every [CON, ζ] ⇒ [DEC, η] from object y = x. In reality, we remove (C2.1), (C4.1) and (C5.1) from Table 5. Therefore, the support value of τ x is minimum in a derived DIS with such selections of attribute values. If τ x is deﬁnite, object x is in a set Descinf ([CON, ζ])∩Descinf ([DEC, η]). Otherwise, τ x belongs to either (C2.1), (C4.1) or (C5.1). Thus, it is necessary to add 1 to the numerator. Proposition 4. Table 7 is a part of Table 5. In Table 7, the accuracy value of τ x : [CON, ζ] ⇒ [DEC, η] from x is minimum. Let OU T ACC denote [Descsup([CON, ζ]) − Descinf ([CON, ζ])] − Descinf ([DEC, η]). Table 6. Selections from Table 5. These selections make the support value of [CON, ζ] ⇒ [DEC, η] minimum. CASE1 CASE2 CASE3 CASE4 CASE5

Condition : CON g(y, CON ) = {ζ} g(y, CON ) = {ζ} g(y, CON ) = {ζ} ζ ∈ g(y, CON ) ζ ∈ g(y, CON )

Decision : DEC g(y, DEC) = {η} η ∈ g(y, DEC) η ∈ g(y, DEC) g(y, DEC) = {η} η ∈ g(y, DEC)

CASE6 ζ ∈ g(y, CON )

η ∈ g(y, DEC)

CASE7 ζ ∈ g(y, CON )

Any

Selection [CON, ζ] ⇒ [DEC, η](C1.1) [CON, ζ] ⇒ [DEC, η ](C2.2) [CON, ζ] ⇒ [DEC, η ](C3.1) [CON, ζ ] ⇒ [DEC, η](C4.2) [CON, ζ] ⇒ [DEC, η ](C5.2) [CON, ζ ] ⇒ [DEC, η](C5.3) [CON, ζ ] ⇒ [DEC, η ](C5.4) [CON, ζ] ⇒ [DEC, η ](C6.1) [CON, ζ ] ⇒ [DEC, η ](C6.2) [CON, ζ ] ⇒ Decision(C7.1)

Rules and Apriori Algorithm in Non-deterministic Information Systems

337

Table 7. Selections from Table 5. These selections make the accuracy value of [CON, ζ] ⇒ [DEC, η] minimum. CASE1 CASE2 CASE3 CASE4 CASE5 CASE6 CASE7

Condition : CON g(y, CON ) = {ζ} g(y, CON ) = {ζ} g(y, CON ) = {ζ} ζ ∈ g(y, CON ) ζ ∈ g(y, CON ) ζ ∈ g(y, CON ) ζ ∈ g(y, CON )

Decision : DEC g(y, DEC) = {η} η ∈ g(y, DEC) η ∈ g(y, DEC) g(y, DEC) = {η} η ∈ g(y, DEC) η ∈ g(y, DEC) Any

Selection [CON, ζ] ⇒ [DEC, η](C1.1) [CON, ζ] ⇒ [DEC, η ](C2.2) [CON, ζ] ⇒ [DEC, η ](C3.1) [CON, ζ ] ⇒ [DEC, η](C4.2) [CON, ζ] ⇒ [DEC, η ](C5.2) [CON, ζ] ⇒ [DEC, η ](C6.1) [CON, ζ ] ⇒ Decision(C7.1)

If τ x is deﬁnite, ([CON,ζ])∩Descinf ([DEC,η])| minacc(τ x )= |Descinf . |Descinf ([CON,ζ])|+|OUT ACC|

If τ x is indeﬁnite, |Descinf ([CON,ζ])∩Descinf ([DEC,η])|+1 minacc(τ x )= |Descinf ([CON,ζ])∪{x}|+|OUT ACC−{x}| .

Proof. Since m/n ≤ (m + k)/(n + k) (0 ≤ m ≤ n, n = 0, k > 0) holds, we excludes every [CON, ζ] ⇒ [DEC, η] from object y = x. We select possible implications [CON, ζ] ⇒ [DEC, η ], which increase the denominator. The accuracy value of τ x is minimum in a derived DIS with such selection of attribute values. The set OU T ACC deﬁnes objects in either CASE 5 or CASE 6. As for CASE 4 and CASE 7, the condition part is not [CON, ζ]. Therefore, we can omit such implications for calculating minacc(τ x ). If τ x is deﬁnite, the numerator is |Descinf ([CON, ζ]) ∩ Descinf ([DEC, η])| and the denominator is |Descinf ([CON, ζ])|+|OU T ACC|. If τ x is indeﬁnite, τ x belongs to either (C2.1), (C4.1) or (C5.1). The denominator is |Descinf ([CON, ζ]) ∪ {x}| + |OU T ACC − {x}| in every case, and the numerator is |Descinf ([CON, ζ]) ∩ Descinf ([DEC, η])|+1. Theorem 5. For a N IS, let us consider a possible implication τ x :[CON, ζ] ⇒ [DEC, η] ∈ P I(x, CON, DEC). Let SU P Pmin ={ϕ|ϕ is a derived DIS from N IS, and support(τ x ) is minimum in ϕ}. Then, accuracy(τ x ) is minimum in some ϕ ∈ SU P Pmin . Proof. Table 7 is a special case of Table 6. Namely, in CASE 5 of Table 6, either (C5.2), (C5.3) or (C5.4) may hold. In CASE 6 of Table 6, either (C6.1) or (C6.2) may hold. In every selection, the minimum support value is the same. In Table 7, (C5.2) in CASE 5 and (C6.1) in CASE 6 are selected. Theorem 5 assures that there exists a derived DIS, where both support(τ x ) and accuracy(τ x ) are minimum. DISworst denotes such a derived DIS, and we name

338

H. Sakai et al.

Table 8. Selections from Table 5. These selections make the support and accuracy values of [CON, ζ] ⇒ [DEC, η] maximum. CASE1 CASE2 CASE3 CASE4 CASE5 CASE6 CASE7

Condition(CON ) g(y, CON ) = {ζ} g(y, CON ) = {ζ} g(y, CON ) = {ζ} ζ ∈ g(y, CON ) ζ ∈ g(y, CON ) ζ ∈ g(y, CON ) ζ ∈ g(y, CON )

Decision(DEC) g(y, DEC) = {η} η ∈ g(y, DEC) η ∈ g(y, DEC) g(y, DEC) = {η} η ∈ g(y, DEC) η ∈ g(y, DEC) Any

Selection [CON, ζ] ⇒ [DEC, η](C1.1) [CON, ζ] ⇒ [DEC, η](C2.1) [CON, ζ] ⇒ [DEC, η ](C3.1) [CON, ζ] ⇒ [DEC, η](C4.1) [CON, ζ] ⇒ [DEC, η](C5.1) [CON, ζ ] ⇒ [DEC, η ](C6.2) [CON, ζ ] ⇒ Decision(C7.1)

DISworst a derived DIS with the worst condition for τ x . This is an important property for Problem 3 in the subsequent section. 3.4

Eﬀective Calculation of Maximum Support and Maximum Accuracy

In this subsection, we show an eﬀective method to calculate maxsupp(τ x ) and maxacc(τ x ) based on Descinf and Descsup. The following can be proved according the same manner as Proposition 3, 4 and Theorem 5. A derived DIS deﬁned in Table 8 makes both support and accuracy maximum. Proposition 6. For τ x : [CON, ζ] ⇒ [DEC, η] from x, the following holds. maxsupp(τ x )=|Descsup([CON, ζ]) ∩ Descsup([DEC, η])|/|OB|. Proposition 7. For τ x : [CON, ζ] ⇒ [DEC, η] from x, let IN ACC denote [Descsup([CON, ζ]) − Descinf ([CON, ζ])] ∩ Descsup ([DEC, η]). If τ x is deﬁnite, ACC| maxacc(τ x )= |Descinf ([CON,ζ])∩Descsup([DEC,η])|+|IN . |Descinf ([CON,ζ])|+|IN ACC|

If τ x is indeﬁnite, ACC−{x}|+1 maxacc(τ x )= |Descinf ([CON,ζ])∩Descsup([DEC,η])−{x}|+|IN . |Descinf ([CON,ζ])∪{x}|+|IN ACC−{x}|

Theorem 8. For a N IS, let us consider a possible implication τ x :[CON, ζ] ⇒ [DEC, η] ∈ P I(x, CON, DEC). Let SU P Pmax ={ϕ|ϕ is a derived DIS from N IS, and support(τ x ) is maximum in ϕ}. Then, accuracy(τ x ) is maximum in some ϕ ∈ SU P Pmax . Theorem 8 assures that there exists a derived DIS, where both support(τ x ) and accuracy(τ x ) are maximum. DISbest denotes such a derived DIS, and we name DISbest a derived DIS with the best condition for τ x . This is also an important property for Problem 4 in the subsequent section.

Rules and Apriori Algorithm in Non-deterministic Information Systems

4

339

Rule Generation by New Criteria in Non-deterministic Information Systems

This section applies Proposition 3, 4, 6, 7 and Theorem 5, 8 to rule generation in N ISs. 4.1

Rules by the Criteria in Deterministic Information Systems

In DISs, rule generation by the criteria is often deﬁned as the following. Problem 2. In a table or a DIS, ﬁnd every implication τ that support(τ ) ≥ α and accuracy(τ ) ≥ β for given α and β (0 < α, β ≤ 1). For solving this problem, Apriori algorithm was proposed by Agrawal [1, 2]. In this framework, association rules in transaction data are obtained. The application of the large item set is the key point in Apriori algorithm. This Problem 2 has also been considered in [22, 36, 38]. 4.2

Rules by New Criteria and Two Strategies in Non-deterministic Information Systems

Now, we extend Problem 2 to Problem 3 and Problem 4 in the following. Problem 3 (Rule Generation by Lower Approximation Strategy). For a N IS, let CON ⊆ AT and DEC ⊆ AT be condition attributes and the decision attribute, respectively. Find every possible implication τ x : [CON, ζ] ⇒ [DEC, η] satisfying minsupp(τ x ) ≥ α and minacc(τ x ) ≥ β for given α and β (0 < α, β ≤ 1). Problem 4 (Rule Generation by Upper Approximation Strategy). For a N IS, let CON ⊆ AT and DEC ⊆ AT be condition attributes and the decision attribute, respectively. Find every possible implication τ x : [CON, ζ] ⇒ [DEC, η] satisfying maxsupp(τ x ) ≥ α and maxacc(τ x ) ≥ β for given α and β (0 < α, β ≤ 1). It is necessary to remark that both minsupp(τ x ) and minacc(τ x ) are deﬁned over DD(τ x , x, CON, DEC). For deﬁnite τ x , DD(τ x , x, CON, DEC) is equal to all derived DISs. However for indeﬁnite τ x , DD(τ x , x, CON, DEC) is not equal to all derived DISs, and minsupp(τ x )=0 and minacc(τ x )=0 may hold. This may be an important issue in lower approximation strategy. However in this paper, we employ a set DD(τ x , x, CON, DEC) instead of all derived DISs. As for upper approximation strategy, maxsupp(τ x ) and maxacc(τ x ) over DD(τ x , x, CON , DEC) are the same as maxsupp(τ x ) and maxacc(τ x ) over all derived DISs. We employed terms M in-M ax and M ax-M ax strategies in [31, 32]. According to rough sets based concept, we rename these terms lower approximation strategy and upper approximation strategy, respectively. Next Proposition 9 clariﬁes the relation between certain rules, possible rules and rules by new criteria.

340

H. Sakai et al.

Proposition 9. For a possible implication τ x , the following holds. (1) τ x is a certain rule in Section 2.1, if and only if τ x is deﬁnite and minacc(τ x )=1. (2) τ x is a possible rule in Section 2.1, if and only if maxacc(τ x )=1. The concept of consistency deﬁnes certain and possible rules, therefore there is no deﬁnition about support. In certain rule generation, we often have a possible implication whose minacc(τ x )=1 and minsupp(τ x ) is quite small. Proposition 10, 11 and 12 clarify the properties of rule generation. Proposition 10. For a given α and β (0 < α, β ≤ 1), let Rule(α, β, LA) denote a set of rules deﬁned by lower approximation strategy with α and β, and let Rule(α, β, U A) denote a set of rules deﬁned by upper approximation strategy with α and β. Then, Rule(α, β, LA) ⊆ Rule(α, β, U A) holds. Proposition 11. The following, which are related to a possible implication τ x : [CON, ζ] ⇒ [DEC, η], are equivalent. (1) τ x is obtained according to lower approximation strategy, namely minsupp(τ x ) ≥ α and minacc(τ x ) ≥ β. (2) support(τ x ) ≥ α and accuracy(τ x ) ≥ β in each ϕ ∈ DD(τ x , x, CON, DEC). (3) In a derived DISworst deﬁned in Table 7, support(τ x ) ≥ α and accuracy(τ x ) ≥ β hold. Proof. For each ϕ ∈ DD(τ x , x, CON, DEC), support(τ x ) ≥ minsupp(τ x ) and accuracy(τ x ) ≥ minacc(τ x ) hold, therefore (1) and (2) are equivalent. According to Theorem 5, a derived DISworst (depending upon τ x ) deﬁned in Table 7 assigns minimum values to both support(τ x ) and accuracy(τ x ). Thus, (1) and (3) are equivalent. Proposition 12. The following, which are related to a possible implication τ x : [CON, ζ] ⇒ [DEC, η], are equivalent. (1) τ x is obtained according to upper approximation strategy, namely maxsupp(τ x ) ≥ α and maxacc(τ x ) ≥ β. (2) support(τ x ) ≥ α and accuracy(τ x ) ≥ β in a ϕ ∈ DD(τ x , x, CON, DEC). (3) In a derived DISbest deﬁned in Table 8, support(τ x ) ≥ α and accuracy(τ x ) ≥ β hold. Proof: For each ϕ ∈ DD(τ x , x, CON, DEC), support(τ x ) ≤ maxsupp(τ x ) and accuracy(τ x ) ≤ maxacc(τ x ) hold. According to Theorem 8, a derived DISbest (depending upon τ x ) deﬁned in Table 8 assigns maximum values to both support(τ x ) and accuracy(τ x ). In this DISbest , maxsupp(τ x )=support(τ x ) and maxacc(τ x )=accuracy(τ x ) hold. Thus, (1), (2) and (3) are equivalent. Due to Proposition 10, 11 and 12, Rule(α, β, LA) deﬁnes a set of possible implications in a DISworst , and Rule(α, β, U A) deﬁnes a set of possible implications in a DISbest . This implies that we do not have to examine each derived DIS in

Rules and Apriori Algorithm in Non-deterministic Information Systems

341

DD(τ x , x, CON, DEC), but we have only to examine a DISworst for the lower approximation strategy and a DISbest for the upper approximation strategy. 4.3

Extended Apriori Algorithms for Two Strategies and A Simulation

This subsection proposes two extended Apriori algorithms in Algorithm 1 and 2. In DISs, Descinf ([A, ζ])=Descsup([A, ζ]) holds, however Descinf ([A, ζ]) ⊆ Descsup([A, ζ]) holds in N ISs. Apriori algorithm handles transaction data, and employs the sequential search for obtaining large item sets [1, 2]. In DISs, we employ the manipulation of Descinf and Descsup instead of the sequential search. According to this manipulation, we obtain the minimum set and maximum set of an equivalence class. Then, we calculate minsupp(τ x ) and minacc(τ x ) by using Descinf and Descsup. The rest is almost the same as Apriori algorithm. Now, we show an example, which simulates Algorithm 1. Algorithm 1. Extended Apriori Algorithm for Lower Approximation Strategy Input : A N IS, a decision attribute DEC, threshold value α and β. Output: Every rule deﬁned by lower approximation strategy. for (every A ∈ AT ) do Generate Descinf ([A, ζ]) and Descsup([A, ζ]); end For the condition minsupp(τ x )=|SET |/|OB| ≥ α, obtain the number N U M of elements in SET ; Generate a set CAN DIDAT E(1), which consists of descriptors [A, ζA ] satisfying either (CASE A) or (CASE B) in the following; (CASE A) |Descinf ([A, ζA ])| ≥ N U M , (CASE B) |Descinf ([A, ζA ])|=(N U M − 1) and (Descsup([A, ζA ]) − Descinf ([A, ζA ])) = {}. Generate a set CAN DIDAT E(2) according to the following procedures; (Proc 2-1) For every [A, ζA ] and [DEC, ζDEC ] (A = DEC) in CAN DIDAT E(1), generate a new descriptor [{A, DEC}, (ζA , ζDEC )]; (Proc 2-2) Examine condition (CASE A) and (CASE B) for each [{A, DEC}, (ζA , ζDEC )]; If either (CASE A) or (CASE B) holds and minacc(τ ) ≥ β display τ : [A, ζA ] ⇒ [DEC, ζDEC ] as a rule; If either (CASE A) or (CASE B) holds and minacc(τ ) < β, add this descriptor to CAN DIDAT E(2); Assign 2 to n; while CAN DIDAT E(n) = {} do Generate CAN DIDAT E(n + 1) according to the following procedures; (Proc 3-1) For DESC1 and DESC2 ([DEC, ζDEC ] ∈ DESC1 ∩ DESC2 ) in CAN DIDAT E(n), generate a new descriptor by using a conjunction of DESC1 ∧ DESC2 ; (Proc 3-2) Examine the same procedure as (Proc 2-2). Assign n + 1 to n; end

342

H. Sakai et al.

Algorithm 2. Extended Apriori Algorithm for Upper Approximation Strategy Input : A N IS, a decision attribute DEC, threshold value α and β. Output: Every rule deﬁned by upper approximation strategy. Algorithm 2 is proposed as Algorithm 1 with the following two revisions : 1. (CASE A) and (CASE B) in Algorithm 1 are replaced with (CASE C). (CASE C) |Descsup([A, ζA ])| ≥ N U M . 2. minacc(τ ) in Algorithm 1 is replaced with maxacc(τ ).

Example 3. Let us consider Descinf and Descsup, which are obtained from N IS2 in Table 9, and let us consider Problem 3. We set α=0.3, β=0.8, condition attribute CON ⊆ {P, Q, R, S} and decision attribute DEC={T }. Since |OB|=5 and minsupp(τ )=|SET |/5 ≥ 0.3, |SET | ≥ 2 must hold. According to Table 10, we generate Table 11 satisfying either (CASE A) or (CASE B) in the following: (CASE A) |Descinf ([A, ζA ] ∧ [T, η])| ≥ 2 (A ∈ {P, Q, R, S}). (CASE B) |Descinf ([A, ζA ] ∧ [T, η])|=1 and Descsup([A, ζA ] ∧ [T, η])− Descinf ([A, ζA ] ∧ [T, η]) = {} (A ∈ {P, Q, R, S}). Table 9. A Table of N IS2 OB 1 2 3 4 5

P {3} {2} {1, 2} {1} {3}

Q {1, 3} {2, 3} {2} {3} {1}

R {3} {1, 3} {1, 2} {3} {1, 2}

S T {2} {3} {1, 3} {2} {3} {1} {2, 3} {1, 2, 3} {3} {3}

Table 10. Descinf and Descsup information in Table 9 [P, 1] [P, 2] [P, 3] [Q, 1] [Q, 2] [Q, 3] [R, 1] [R, 2] [R, 3] Descinf {4} {2} {1, 5} {5} {3} {4} {} {} {1, 4} Descsup {3, 4} {2, 3} {1, 5} {1, 5} {2, 3} {1, 2, 4} {2, 3, 5} {3, 5} {1, 2, 4} [S, 1] [S, 2] [S, 3] [T, 1] [T, 2] [T, 3] Descinf {} {1} {3, 5} {3} {2} {1, 5} Descsup {2} {1, 4} {2, 3, 4, 5} {3, 4} {2, 4} {1, 4, 5}

Table 11. Conjunctions of descriptors satisfying either (CASE A) or (CASE B) in Table 10 Descinf Descsup

[P, 3] ∧ [T, 3] [Q, 1] ∧ [T, 3] [R, 3] ∧ [T, 3] [S, 2] ∧ [T, 3] [S, 3] ∧ [T, 1] [S, 3] ∧ [T, 3] {1, 5} {5} {1} {1} {3} {5} {1, 5} {1, 5} {1, 4} {1, 4} {3, 4} {4, 5}

Rules and Apriori Algorithm in Non-deterministic Information Systems

343

The conjunction [P, 3] ∧ [T, 3] in Table 11 means an implication τ31 , τ35 : [P, 3] ⇒ [T, 3]. Because Descsup([P, 3] ∧ [T, 3])={1, 5} holds, τ31 and τ35 come from object 1 and 5, respectively. Since 1, 5 ∈ Descinf ([P, 3] ∧ [T, 3]) holds, minsupp(τ31 )= minsupp(τ35 )=|{1, 5}|/5=0.4 holds. Then, the conjunction [Q, 1] ∧ [T, 3] in Table 11 means an implication τ41 , τ45 : [Q, 1] ⇒ [T, 3]. Since 5 ∈ Descinf ([Q, 1]∧[T, 3]) holds, minsupp(τ45 )=|{5}|/5=0.2 holds. On the other hand, 1 ∈ Descsup([Q, 1]∧ [T, 3]) − Descinf ([Q, 1] ∧ [T, 3]) holds, so minsupp(τ41 )=(|{5}| + 1)/5=0.4 holds in object 1. According to this consideration, we obtain the candidates of rules, which satisfy minsupp(τ x ) ≥ 0.3, as follows: τ31 , τ35 : [P, 3] ⇒ [T, 3], τ41 : [Q, 1] ⇒ [T, 3], τ54 : [R, 3] ⇒ [T, 3], τ64 : [S, 2] ⇒ [T, 3], τ74 : [S, 3] ⇒ [T, 1], τ84 : [S, 3] ⇒ [T, 3]. For these candidates, we examine each minacc(τ x ) according to Proposition 4. For τ31 and τ35 , Descsup([P, 3])={1, 5}, Descinf ([P, 3])={1, 5}, Descinf ([P, 3] ∧ [T, 3])={1, 5} and OU T ACC=[{1, 5}−{1, 5}]−{1, 5}={}. Since 1, 5 ∈ Descinf ( [P, 3] ∧ [T, 3]) holds, minacc(τ31 )= minacc(τ35 )=|{1, 5}|/(|{1, 5}| + |{}|)=1 is derived. For τ74 : [S, 3] ⇒ [T, 1], Descsup([S, 3])={2, 3, 4, 5}, Descinf ([S, 3])={3, 5}, Descinf ([S, 3]∧[T, 1])={3}, Descsup([S, 3]∧[T, 1])= {3, 4} and OU T ACC=[{2, 3, 4, 5} − {3, 5}] − {3}={2, 4} holds, so minacc(τ74 )=(|{3}| + 1)/(|{3, 5} ∪ {4}| + |{2, 4} − {4}|)=0.5 is derived. In this way, we obtain three rules satisfying minsupp(τ x ) ≥ 0.3 and minacc(τ x ) ≥ 0.8 in the following: τ31 , τ35 : [P, 3] ⇒ [T, 3] (minsupp=0.4, minacc=1), τ41 : [Q, 1] ⇒ [T, 3] (minsupp=0.4, minacc=1), τ64 : [S, 2] ⇒ [T, 3] (minsupp=0.4, minacc=1). Any possible implication including [R, 3] ∧ [T, 3] does not satisfy minsupp(τ x ) ≥ 0.3. As for [S, 3] ∧ [T, 1] and [S, 3] ∧ [T, 3], the same results hold. The following shows a real execution on Example 3. % ./nis apriori version 1.2.8 File Name:’nis2.dat’ ======================================== Lower Approximation Strategy ======================================== CAN(1)=[P,1],[P,2],[P,3],[Q,1],[Q,2],[Q,3],[R,3],[S,2],[S,3],[T,1], [T,2],[T,3](12) CAN(2)=[S,3][T,1](0.250,0.500),[P,3][T,3](1.000, 1.000),[Q,1][T,3](1.000,1.000),[R,3][T,3](0.333, 0.667),[S,2][T,3](0.500,1.000),[S,3][T,3](0.250, 0.500)(6) ========== OBTAINED RULE ========== [P,3]=>[T,3](minsupp=0.400,minsupp=0.400,minacc=1.000, minacc=1.000) (from 1,5) (from ) [Q,1]=>[T,3](minsupp=0.200,minsupp=0.400,minacc=1.000, minacc=1.000) (from ) (from 1)

344

H. Sakai et al.

[S,2]=>[T,3](minsupp=0.200,minsupp=0.400,minacc=0.500, minacc=1.000) (from ) (from 4) EXEC TIME=0.0000000000(sec) ======================================== Upper Approximation Strategy ======================================== CAN(1)=[P,1],[P,2],[P,3],[Q,1],[Q,2],[Q,3],[R,3],[S,2],[S,3],[T,1], [T,2],[T,3](12) CAN(2)=[S,3][T,1](0.667,0.667),[P,3][T,3](1.000, 1.000),[Q,1][T,3](1.000,1.000),[R,3][T,3](1.000, 1.000),[S,2][T,3](1.000,1.000),[S,3][T,3](0.667, 0.667)(6) ========== OBTAINED RULE ========== [P,3]=>[T,3](maxsupp=0.400,maxsupp=0.400,maxacc=1.000, maxacc=1.000) (from 1,5) (from ) [Q,1]=>[T,3](maxsupp=0.400,maxsupp=0.400,maxacc=1.000, maxacc=1.000) (from 5) (from 1) [R,3]=>[T,3](maxsupp=0.400,maxsupp=0.400,maxacc=1.000, maxacc=1.000) (from 1) (from 4) [S,2]=>[T,3](maxsupp=0.400,maxsupp=0.400,maxacc=1.000, maxacc=1.000) (from 1) (from 4) EXEC TIME=0.0000000000(sec)

According to this execution, we know Rule(0.3, 0.8, LA)={[P, 3] ⇒ [T, 3], [Q, 1] ⇒ [T, 3], [S, 2] ⇒ [T, 3]}, Rule(0.3, 0.8, U A)={[P, 3] ⇒ [T, 3], [Q, 1] ⇒ [T, 3], [S, 2] ⇒ [T, 3], [R, 3] ⇒ [T, 3]}.

The possible implication [R, 3] ⇒ [T, 3] ∈ Rule(0.3, 0.8, U A)− Rule(0.3, 0.8, LA) depends upon the information incompleteness. This can not be obtained by the lower approximation strategy, but this can be obtained by the upper approximation strategy. 4.4

Main Program for Lower Approximation Strategy

A program nis apriori is implemented on a Windows PC with Pentium 4 (3.40 GHz), and it consists of about 1700 lines in C. This nis apriori mainly consists of two parts, i.e., a part for lower approximation strategy and a part for upper approximation strategy. As for lower approximation strategy, a function GenRuleByLA() (Generate Rules By LA strategy) is coded. GenRuleByLA(table.obj,table.att,table.kosuval,table.con num, table.dec num,table.con,table.dec,thresh,minacc thresh);

In GenRuleByLA(), a function GenCandByLA() is called, and generates a candidate CAN DIDAT E(n).

Rules and Apriori Algorithm in Non-deterministic Information Systems

345

GenCandByLA(desc,cand,conj num max,ob,at,desc num,c num, d num,co,de,thr,minacc thr);

At the same time, minsupp(τ ) and minacc(τ ) are calculated according to Proposition 3 and 4. As for upper approximation strategy, the similar functions are implemented.

5

Computational Issues in Algorithm 1

This section focuses on the computational complexity of Algorithm 1. As for Algorithm 2, the result is almost the same as Algorithm 1. 5.1

A Simple Method for Lower Approximation Strategy

Generally, a possible implication τ x depends upon the number of derived DISs, i.e., x∈OB,A∈AT |g(x, A)|, and condition attributes CON (CON ⊆ 2AT −DEC ). x x Furthermore, minsupp(τ x ) and minacc(τ ) depend on DD(τ , x, CON, DEC), whose number of elements is A∈CON,B∈DEC,x=y |g(y, A)||g(y, B)|. Therefore, it will be impossible to employ a simple method that we sequentially pick up every possible implication τ x and sequentially examine minsupp(τ x ) and minacc(τ x ). 5.2

Complexity on Extended Apriori Algorithm for Lower Approximation Strategy

In order to solve this computational issue, we focus on descriptors [A, ζ] (A ∈ AT , ζ ∈ V ALA ). The number of all descriptors is usually very small. Furthermore, Proposition 3 and 4 show us the methods to calculate minsupp(τ x ) and minacc(τ x ). These methods do not depend upon the number of element in DD(τ x , x, CON, DEC). Now, we analyze each step in Algorithm 1. (STEP 1) (Generation of Descinf , Descsup and CAN DIDAT E(1) ) We ﬁrst prepare two arrays DescinfA,val [] and DescsupA,val [] for each val ∈ V ALA (A ∈ AT ). For each object x ∈ OB, we apply (1) and (2) in the following: (1) If g(x, A)={val}, add x to DescinfA,val [] and DescsupA,val []. (2) If g(x, A) = {val} and val ∈ g(x, A), add x to DescsupA,val []. Then, all descriptors satisfying either (CASE A) or (CASE B) in Algorithm 1 are added to CAN DIDAT E(1). For each A ∈ AT , this procedure is applied, and the complexity depends upon |OB| × |AT |. (STEP 2) (Generation of CAN DIDAT E(2) ) For each [A, valA ], [DEC, valDEC ] ∈ CAN DIDAT E(1), we produce [A, valA ] ∧ [DEC, valDEC ], and generate Descinf ([A, valA ] ∧ [DEC, valDEC ]) =Descinf ([A, valA]) ∩ Descinf ([DEC, valDEC ]),

346

H. Sakai et al.

Descsup([A, valA ] ∧ [DEC, valDEC ]) =Descsup([A, valA]) ∩ Descsup([DEC, valDEC ]). If [A, valA ] ∧ [DEC, valDEC ] satisﬁes either (CASE A) or (CASE B) in Algorithm 1, this descriptor is added to CAN DIDAT E(2). Furthermore, we examine minacc([A, valA ] ∧ [DEC, valDEC ]) in (Proc 2-2) according to Proposition 4. The complexity of (STEP 2) depends upon the number of combined descriptors [A, valA ] ∧ [DEC, valDEC ]. (STEP 3) (Repetition of STEP 2 on CAN DIDAT E(n) ) For each DESC1 and DESC2 in CAN DIDAT E(n), we generate a conjunction DESC1 ∧ DESC2 . For such conjunctions, we apply the same procedure as (STEP 2). In the execution, two sets Descinf ([CON, ζ]) and Descsup([CON, ζ]) are stored in arrays, and we can obtain Descinf ([CON, ζ] ∧ [DEC, η]) by using the intersection operation Descinf ([CON, ζ]) ∩ Descinf ([DEC, η]). The same property holds for Descsup([CON, ζ] ∧ [DEC, η]). Therefore, it is easy to obtain CAN DIDAT E(n + 1) from CAN DIDAT E(n). This is a merit of employing equivalence classes, and this is the characteristics of rough set theory. In Apriori algorithm, such Descinf and Descsup([CON, ζ]) are not employed, and the total search of a database is executed for generating every combination of descriptors. It will be necessary to consider the merit and demerit of handling two sets Descinf ([CON, ζ]) and Descsup([CON, ζ]) in the next research. Apriori algorithm employs an equivalence class for each descriptors, and handles only deterministic information. On the other hand, Algorithm 1 employs the minimum and the maximum sets of an equivalence class, i.e., Descinf and Descsup, and handles non-deterministic information as well as deterministic information. In Algorithm 1, it takes twice steps of Apriori algorithm for manipulating equivalence classes. The rest is almost the same as Apriori algorithm, therefore the complexity of Algorithm 1 will be almost the same as Apriori algorithm.

6

Concluding Remarks and Future Work

We proposed rule generation based on lower approximation strategy and upper approximation strategy in N ISs. We employed Descinf , Descsup and the concept of large item set in Apriori algorithm, and proposed two extended Apriori algorithms in N ISs. These extended algorithms do not depend upon the number of derived DISs, and the complexity of these extended algorithms is almost the same as Apriori algorithm. We implemented the extended algorithms, and applied them to some data sets. According to these utility programs, we can explicitly handle not only deterministic information but also non-deterministic information. Now, we brieﬂy show the application to Hepatitis data in UCI Machine Learning Repository [37]. In reality, we applied our programs to Hepatitis data. This data consists of 155 objects, 20 attributes. There are 167 missing values, which

Rules and Apriori Algorithm in Non-deterministic Information Systems

347

are about 5.4% of total data. The number of objects without missing values is 80, namely the number is about the half of total data. In usual analyzing tools, it may be diﬃcult to handle total 155 objects. We employ a list for expressing non-deterministic information, for example, [red,green], [red,blue] for {red, green} and {red, blue} in Table 2. This syntax is so simple that we can easily generate data of N ISs by using Excel. As for Hepatitis data, we loaded this data into Excel, and replaced each missing value (? symbol) with a list of all possible attribute values. For some numerical values, the discretized attribute values are also given in the data set. For example, in the 15th attribute BILIRUBIN, attribute values are discretized to the six attribute values, i.e., 0.39, 0.80, 1.20, 2.00, 3.00, 4.00. We employed these discretized values in some attributes. The following is a part of the real revised Hepatitis data in Excel. There are 78732 (=2 × 6 × 94) derived DISs for these six objects. Probably, it seems hard to handle all derived DISs for total 155 objects sequentially. 155 20 2 30 2 50 2 70 2 30 2 30

2 1 1 1 1

2 30 1

//Number of objects //Number of Attributes 1 2 2 2 2 1 2 2 2 2 2 0.8 80 13 3.8 [10,20,30,40,50,60,70,80,90] 1 1 2 1 2 2 1 2 2 2 2 2 0.8 120 13 3.8 [10,20,30,40,50,60,70,80,90] 1 2 2 1 2 2 2 2 2 2 2 2 0.8 80 13 3.8 [10,20,30,40,50,60,70,80,90] 1 [1,2] 1 2 2 2 2 2 2 2 2 2 0.8 33 13 3.8 80 1 2 2 2 2 2 2 2 2 2 2 2 0.8 [33,80,120,160,200,250] 200 3.8 [10,20,30,40,50,60,70,80,90] 1 2 2 2 2 2 2 2 2 2 2 2 0.8 80 13 3.8 70 1 : : :

The decision attribute is the ﬁrst attribute CLASS (1:die, 2:live), and we ﬁxed α=0.25 and β=0.85. Let us show the results of two cases. (CASE 1) Obtained Rules from 80 Objects without Missing Values It is possible to apply our programs to the standard DISs. For 80 objects, it took 0.015(sec), and 14 rules including the following are generated. [AGE,30]=>[Class,live] (support=0.287,accuracy=0.958), [ASCITES,yes]=>[CLASS,live] (support=0.775,accuracy=0.912), [ALBUMIN,4.5]=>[CLASS,live] (support=0.287,accuracy=0.958).

(CASE 2) Obtained Rules from 155 Objects with 167 Missing Values Due to two strategies, 22 rules and 25 rules are generated, respectively. It took 0.064(sec). Let us show every rule, which is obtained by upper approximation strategy but is not obtained by lower approximation strategy. Namely, every rule is in boundary set Rule(0.25, 0.85, U A) − Rule(0.25, 0.85, LA). There are three such rules. [Alk PHOSPHATE,80]=>[CLASS,live] (minsupp=0.25,minacc=0.841,maxsupp=0.348,maxacc=0.857) [ANOREXIA,yes]&[SGOT,13]=>[CLASS,live] (minsupp=0.25,minacc=0.829,maxsupp=0.381,maxacc=0.855)

348

H. Sakai et al.

[SPLEEN PALPABLE,yes]&[SGOT,13]=>[CLASS,live] (minsupp=0.25,minacc=0.848,maxsupp=0.368,maxacc=0.877)

In the 17th attribute SGOT, there are four missing values. The above two rules with descriptor [SGOT,13] depend upon these four missing values. These rules show us the diﬀerence between lower approximation strategy and upper approximation strategy. We are also focusing on the diﬀerence between rule generation in DISs and N ISs. Let us suppose a N IS. We remove every object with non-deterministic information from the N IS, and we obtain a DIS. We are interested in rules, which are not obtained from the DIS but obtained from the N IS. According to some experiments including Hepatitis data and Mammographic data in UCI repository, we veriﬁed our utility programs work well, even if there are huge number of derived DISs. However, we have not analyzed the meaning of the obtained rules. Because, the main issue of this paper is to establish the framework and to implement algorithms. From now on, we will apply our utility programs to real data with missing values, and we want to obtain meaningful rules from N ISs. Our research is not toward rule generation from data with a large number of objects, but it is toward rule generation from incomplete data with a large number of derived DISs. This paper is a revised and extended version of papers [31, 32]. Acknowledgment. The authors would be grateful to anonymous referees for their useful comments. This work is partly supported by the Grant-in-Aid for Scientiﬁc Research (C) (No.16500176, No.18500214), Japan Society for the Promotion of Science.

References 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proceedings of the 20th Very Large Data Base, pp. 487–499 (1994) 2. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.: Fast Discovery of Association Rules. In: Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI/MIT Press (1996) 3. Demri, S., Orlowska, E.: Incomplete Information: Structure, Inference, Complexity. Monographs in Theoretical Computer Science. Springer, Heidelberg (2002) 4. Grzymala-Busse, J.: On the Unknown Attribute Values in Learning from Examples. In: Ra´s, Z.W., Zemankova, M. (eds.) ISMIS 1991. LNCS (LNAI), vol. 542, pp. 368– 377. Springer, Heidelberg (1991) 5. Grzymala-Busse, J.: A New Version of the Rule Induction System LERS. Fundamenta Informaticae 31, 27–39 (1997) 6. Grzymala-Busse, J., Werbrouck, P.: On the Best Search Method in the LEM1 and LEM2 Algorithms. Incomplete Information: Rough Set Analysis 13, 75–91 (1998) 7. Grzymala-Busse, J.: Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction. Transactions on Rough Sets 1, 78–95 (2004)

Rules and Apriori Algorithm in Non-deterministic Information Systems

349

8. Grzymala-Busse, J.: Incomplete data and generalization of indiscernibility relation, ´ deﬁnability, and approximations. In: Sezak, D., Wang, G., Szczuka, M.S., D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3641, pp. 244–253. Springer, Heidelberg (2005) 9. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough Sets: a tutorial. In: Pal, S., Skowron, A. (eds.) Rough Fuzzy Hybridization, pp. 3–98. Springer, Heidelberg (1999) 10. Kryszkiewicz, M.: Rules in Incomplete Information Systems. Information Sciences 113, 271–292 (1999) 11. Kryszkiewicz, M., Rybinski, H.: Computation of Reducts of Composed Information Systems. Fundamenta Informaticae 27, 183–195 (1996) 12. Kryszkiewicz, M.: Maintenance of Reducts in the Variable Precision Rough Sets Model. ICS Research Report 31/94, Warsaw University of Technology (1994) 13. Lipski, W.: On Semantic Issues Connected with Incomplete Information Data Base. ACM Trans. DBS 4, 269–296 (1979) 14. Lipski, W.: On Databases with Incomplete Information. Journal of the ACM 28, 41–70 (1981) 15. Nakamura, A., Tsumoto, S., Tanaka, H., Kobayashi, S.: Rough Set Theory and Its Applications. Journal of Japanese Society for AI 11, 209–215 (1996) 16. Nakamura, A.: A Rough Logic based on Incomplete Information and Its Application. International Journal of Approximate Reasoning 15, 367–378 (1996) 17. Nakata, M., Sakai, H.: Rough-set-based Approaches to Data Containing Incomplete Information: Possibility-based Cases. In: Nakamatsu, K., Abe, J. (eds.) Advances in Logic Based Intelligent Systems. Frontiers in Artiﬁcial Intelligence and Applications, vol. 132, pp. 234–241. IOS Press, Amsterdam (2005) 18. Nakata, M., Sakai, H.: Lower and Upper Approximations in Data Tables Containing Possibilistic Information. Transactions on Rough Sets 7, 170–189 (2007) 19. Orlowska, E.: What You Always Wanted to Know about Rough Sets. In: Incomplete Information: Rough Set Analysis, vol. 13, pp. 1–20. Physica-Verlag (1998) 20. Orlowska, E., Pawlak, Z.: Representation of Nondeterministic Information. Theoretical Computer Science 29, 27–39 (1984) 21. Pawlak, Z.: Rough Sets. Kluwer Academic Publisher, Dordrecht (1991) 22. Pawlak, Z.: Some Issues on Rough Sets. Transactions on Rough Sets 1, 1–58 (2004) 23. Polkowski, L., Skowron, A. (eds.): Rough Sets in Knowledge Discovery 1. Studies in Fuzziness and Soft Computing, vol. 18. Physica-Verlag (1998) 24. Polkowski, L., Skowron, A. (eds.): Rough Sets in Knowledge Discovery 2. Studies in Fuzziness and Soft Computing, vol. 19. Physica-Verlag (1998) 25. Rough Set Software. Bulletin of Int’l. Rough Set Society 2, 15–46 (1998) 26. Sakai, H.: Eﬀective Procedures for Handling Possible Equivalence Relations in Nondeterministic Information Systems. Fundamenta Informaticae 48, 343–362 (2001) 27. Sakai, H.: Eﬀective Procedures for Data Dependencies in Information Systems. In: Rough Set Theory and Granular Computing. Studies in Fuzziness and Soft Computing, vol. 125, pp. 167–176. Springer, Heidelberg (2003) 28. Sakai, H., Okuma, A.: Basic Algorithms and Tools for Rough Non-deterministic Information Analysis. In: Peters, J.F., Skowron, A., Grzymala-Busse, J.W., Kostek, ´ B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 209–231. Springer, Heidelberg (2004) 29. Sakai, H., Nakata, M.: An Application of Discernibility Functions to Generating Minimal Rules in Non-deterministic Information Systems. Journal of Advanced Computational Intelligence and Intelligent Informatics 10, 695–702 (2006)

350

H. Sakai et al.

30. Sakai, H.: On a Rough Sets Based Data Mining Tool in Prolog: An Overview. In: Umeda, M., Wolf, A., Bartenstein, O., Geske, U., Seipel, D., Takata, O. (eds.) INAP 2005. LNCS (LNAI), vol. 4369, pp. 48–65. Springer, Heidelberg (2006) 31. Sakai, H., Nakata, M.: On Possible Rules and Apriori Algorithm in Nondeterministic Information Systems. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Slowi´ nski, R. (eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 264–273. Springer, Heidelberg (2006) 32. Sakai, H., Ishibashi, R., Koba, K., Nakata, M.: On Possible Rules and Apriori Algorithm in Non-deterministic Information Systems 2. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds.) RSFDGrC 2007. LNCS (LNAI), vol. 4482, pp. 280–288. Springer, Heidelberg (2007) 33. Skowron, A., Rauszer, C.: The Discernibility Matrices and Functions in Information Systems. In: Intelligent Decision Support - Handbook of Advances and Applications of the Rough Set Theory, pp. 331–362. Kluwer Academic Publishers, Dordrecht (1992) 34. Stefanowski, J., Tsoukias, A.: On the Extension of Rough Sets under Incomplete Information. In: Zhong, N., Skowron, A., Ohsuga, S. (eds.) RSFDGrC 1999. LNCS (LNAI), vol. 1711, pp. 73–81. Springer, Heidelberg (1999) 35. Stefanowski, J., Tsoukias, A.: Incomplete Information Tables and Rough Classiﬁcation. Computational Intelligence 7, 212–219 (2001) 36. Tsumoto, S.: Knowledge Discovery in Clinical Databases and Evaluation of Discovered Knowledge in Outpatient Clinic. Information Sciences 124, 125–137 (2000) 37. UCI Machine Learning Repository, http://mlearn.ics.uci.edu/MLRepository.html 38. Ziarko, W.: Variable Precision Rough Set Model. Journal of Computer and System Sciences 46, 39–59 (1993)

On Extension of Dependency and Consistency Degrees of Two Knowledges Represented by Covering P. Samanta1 and Mihir K. Chakraborty2 1

2

Department of Mathematics, Katwa College Katwa, Burdwan, West Bengal, India pulak [email protected] Department of Pure Mathematics, University of Calcutta 35, Ballygunge Circular Road, Kolkata-700019, India [email protected]

Abstract. Knowledge of an agent depends on the granulation procedure adopted by the agent. The knowledge granules may form a partition of the universe or a covering. In this paper dependency degrees of two knowledges have been considered in both the cases. A measure of consistency and inconsistency of knowledges are also discussed. This paper is a continuation of our earlier work [3]. Keywords: Rough sets, elementary category(partition, covering of knowledge), dependency degree, consistency degree.

1

Introduction

Novotn´ y and Pawlak deﬁned a dependency degree between two knowledges given by two partitions on a set [6,7,8,9]. Knowledge is given by indiscernibility relations on the universe and indiscernibility relation is taken to be an equivalence relation. But in many situations the indiscernibility relation fails to be transitive. Hence the clusters or granules of knowledge overlap. This observation gives rise to the study of Rough Set Theory based on coverings instead of partitions [2,10,11,13,14,15,16]. In [3] the present authors introduced the notions of consistency degree and inconsistency degree of two knowledges given by partitions of the universe using the dependency degree deﬁned by Novotn´ y and Pawlak. In this paper some more investigations in that direction have been carried out but the main emphasis is laid on deﬁning the dependency degree of two knowledges when they are given by coverings in general, not by partitions only. Now, in the covering based approximation systems lower and upper approximations of a set are deﬁned in at least ﬁve diﬀerent ways [10]. All of these approximations reduce to the standard Pawlakian approximations when the underlying indiscernibility relation turns out to be equivalence. We have in this paper used four of them of which one is the classical one. As a consequence, four diﬀerent dependency degrees arise. J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 351–364, 2008. c Springer-Verlag Berlin Heidelberg 2008

352

P. Samanta and M.K. Chakraborty

It is interestingly observed that the properties of partial dependency that were developed in [3,6,9] hold good in the general case of covering based approximation system. The main results on covering are placed in section 3. Depending upon this generalized notion of dependency, consistency degree and inconsistency degree between two such knowledges have been deﬁned.

2

Dependency of Knowledge Based on Partition

We would accept the basic philosophy that a knowledge of an agent about an universe is her ability to categorize objects inhabiting it through information received from various sources or perception in the form of attribute-value data. For this section we start with the indiscernibility relation caused by the attributevalue system. So, knowledge is deﬁned as follows. Definition 1. Knowledge : A knowledge is a pair, < U, P > where U is a nonempty ﬁnite set and P is an equivalence relation on U . P will also denote the partition generated by the equivalence relation. Definition 2. Finer and Coarser Knowledge : A knowledge P is said to be ﬁner than the knowledge Q if every block of the partition P is included in some block of the partition Q. In such a case Q is said to coarser than P . We shall write it as P Q. We recall a few notions due to Pawlak (and others) e.g P -positive region of Q and based upon it dependency-degree of knowledges. Definition 3. Let P and Q be two equivalence relations over U . The P -positive PX , where region of Q, denoted by P osP (Q) is deﬁned by P osP (Q) = X∈U/Q ¯ PX = { Y ∈ U/P : Y ⊆ X} called P -lower approximation of X. ¯ Definition 4. Dependency degree : Knowledge Q depends in a degree k (0 ≤ osP (Q) k ≤ 1) on knowledge P , written as P ⇒k Q, iﬀ k = CardP where card CardU denotes cardinality of the set. If k = 1 , we say that Q totally depends on P and we write P ⇒ Q; and if k = 0 we say that Q is totally independent of P . Viewing from the angle of multi-valuedness one can say that the sentence ‘The knowledge Q depends on the knowledge P ’ instead of being only ‘true’(1) or ‘false’(0) may receive other intermediate truth-values, the value k being determined as above. This approach justiﬁes the term ‘partial dependency’ as well. In propositions 1,2 and 3, we enlist some elementary, often trivial, properties of dependency degree some of them being newly exercised but most of which are present in [6,9]. Some of these properties e.g. proposition 3(v) will constitute the basis of deﬁnitions and results of the next section.

Consistency of Knowledge

353

Proposition 1 (i) [x]P1 ∩P2 = [x]P1 ∩ [x]P2 , (ii) If P ⇒ Q and R P then R ⇒ Q, (iii)If P ⇒ Q and Q R then P ⇒ R, (iv)If P ⇒ Q and Q ⇒ R then P ⇒ R, (v)If P ⇒ R and Q ⇒ R then P ∩ Q ⇒ R, (vi) If P ⇒ R ∩ Q then P ⇒ R and P ⇒ Q, (vii) If P ⇒ Q and Q ∩ R ⇒ T then P ∩ R ⇒ T , (viii) If P ⇒ Q and R ⇒ T then P ∩ R ⇒ Q ∩ T . Proposition 2 (i) If P P then P X ⊇ P X, (ii) If P ⇒a Q and P P then P ⇒b Q where b ≥ a, (iii) If P ⇒a Q and P P then P ⇒b Q where b ≤ a, (iv) If P ⇒a Q and Q Q then P ⇒b Q where b ≤ a, (v) If P ⇒a Q and Q Q then P ⇒b Q where a ≤ b. Proposition 3 (i) If R ⇒a P and Q ⇒b P then R ∩ Q ⇒c P for some c ≥ M ax(a, b), (ii) If R ∩ P ⇒a Q then R ⇒b Q and P ⇒c Q for some b, c ≤ a, (iii) If R ⇒a Q and R ⇒b P then R ⇒c Q ∩ P for some c ≤ M in(a, b), (iv) If R ⇒a Q ∩ P then R ⇒b Q and R ⇒c P for some b, c ≥ a, (v) If R ⇒a P and P ⇒b Q then R ⇒c Q for some c ≥ a + b − 1.

3

Dependency of Knowledge Based on Covering

A covering C of a set U is a collection of subsets {Ci } of U such that ∪Ci = U . It is often important to deﬁne a knowledge in terms of covering and not by partition which is a special case of covering. Given a covering C one can deﬁne a binary relation RC on U which is a tolerance relation (reﬂexive, symmetric) by xRC y holds iﬀ x, y ∈ Ci for some i, where the set {Ci } constitute the covering. Definition 5. A tolerance space is a structure S = < U, R >, where U is a nonempty set of objects and R is a reﬂexive and symmetric binary relation deﬁned on U . A tolerance class of a tolerance space < U, R > is a maximal subset of U such that any two elements of it are mutually related. In the context of knowledge when the indiscernibility relation R is only reﬂexive and symmetric (and not necessarily transitive) the approximation system < U, R > is a tolerance space. In such a case the granules of the Knowledge may be formed in many diﬀerent ways. Since the granules are not necessarily disjoint it is worthwhile to talk about granulation around an object x ∈ U . Now the most natural granule at x is the set {y : xRy}. This set is generally denoted by Rx . But any element Ci of the covering C can also be taken as a granule around x where x ∈ Ci . There may be others. So, depending upon various ways

354

P. Samanta and M.K. Chakraborty

of perceiving a granule, various deﬁnitions of lower approximations (and hence the upper approximations as their duals) of a set may be given. We shall consider them below. Now any covering gives rise to a unique partition. By P we denote the partition corresponding to the covering C. Definition 6. [1,2] A covering is said to be genuine covering if Ci ⊆ Cj implies Ci = Cj . For any genuine covering C it is immediate that the elements of C are all tolerance classes of the relation RC . Definition 7. Let two ﬁnite coverings C1 and C2 be given by C1 = {C1 , C2 , ...Cn } }. Then C1 ∩ C2 is the collection {Ci ∩ Cj where i = and C2 = {C1 , C2 , ...Cm 1, 2, ...n; j = 1, 2, ...m}. Example 1. Let C1 = {{1, 2, 3}, {2, 3, 4}, {5, 6, 7}, {6, 7, 8}} and C2 = {{1, 2, 3, 4}, {3, 4, 5, 6}, {5, 6, 7, 8}}. Then C1 ∩ C2 = {{1, 2, 3}, {3}, {2, 3, 4}, {3, 4}, {5, 6}, {5, 6, 7}, {6}, {6, 7, 8}}. Definition 8. We shall say that a covering C1 is ﬁner than a covering C2 written as C1 C2 iﬀ ∀Cj ∈ C2 ∃ Cj1 , Cj2 , ..., Cjn such that Cj = Cj1 ∪ Cj2 ∪ ... ∪ Cjn where, Cj1 , Cj2 , ..., Cjn ∈ C1 i.e. every element of C2 may be expressed as the union of some elements of C1 . Let R be a tolerance relation in U . Then the family C(R) of all tolerances classes of R is a covering of U . The pair (U, C) will be called generalized approximation space, where U is a set and C is a covering of U . We shall however assume U to be ﬁnite in the sequel. Let (U, C) be a generalized approximation space and C = {C1 , C2 , ...C n }. The indiscernibility neighborhood of an element x ∈ U is the set NxC = {Ci : x ∈ Ci }. In fact NxC is the same as RxC . For any x ∈ U the set PxC = {y ∈ U : ∀Ci (x ∈ Ci ⇔ y ∈ Ci )} will be called kernel of x. Let P be the family of all kernels (U, C) i.e. P = {PxC : x ∈ U }. Clearly P is a partition of U . Definition 9. [10] Let X be a subset of U . Then the lower and upper approximations are deﬁned as follows : C 1 (X) = {x : NxC ⊆ X} 1 C (X) = {Ci : Ci ∩ X = φ} C 2 (X) =

C {Nx : NxC ⊆ X}

C 3 (X) =

{Ci , Ci ⊆ X f or some Ci ∈ C1 }

2

C (X) = {z : ∀y(z ∈ NxC ⇒ NxC ∩ X = φ)} 3

C (X) = {y : ∀Ci (y ∈ Ci ⇒ Ci ∩ X = φ)}

Consistency of Knowledge

355

C {Px : PxC ⊆ X} 4 C (X) = {{PxC : {PxC ∩ X = φ} C 4 (X) =

Proposition 4. If C1 C2 then P1 P2 where P1 , P2 are the partitions corresponding to C1 , and C2 respectively. i

Proposition 5. If C1 C2 then for any X ⊆ U , C1 i (X) ⊇ C2 i (X) and C1 (X) ⊆ i

C2 (X) for i = 1, 2, 3, 4 . Example 2. Let U = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and C = {{1, 2}, {1, 2, 3}, {4, 6}, {6, 7, 9}, {8, 9}, {5, 10}}. Let A = {1, 2, 4, 6, 9, 10}. Then C 1 (A) = {4}, C 2 (A) = {4, 6}, C 3 (A) = {1, 2, 4, 6}, C 4 (A) = {1, 2, 4, 6, 9}. 1

2

Let B = {3, 9, 10}. Then C (B) = {1, 2, 3, 5, 6, 7, 8, 9, 10}, C (B) = {1, 2, 3, 5, 3 4 7, 8, 9, 10}, C (B) = {3, 5, 7, 8, 9, 10}, C (B) = {3, 9} Proposition 6. Propositions 1, 2, 3 except 3(v) of section 2 also hold in this generalized case. C1 (X). Definition 10. We deﬁne C1 -Positive region of C2 as P osC1 C2 = X∈C2

Definition 11. Dependency degree with respect to covering : C1 depends in a |P os

C2 |

C1 where |X| degree k (0 ≤ k ≤ 1) on C2 , written as C1 ⇒k C2 , iﬀ k = |U| denotes cardinality of the set X. We shall also write k = Dep(C1 , C2 ). If k = 1 , C1 is said to be totally dependent on C2 and we write C1 ⇒ C2 ; and if k = 0 we say that C2 is totally independent of C1 .

Since we have four kinds of lower approximations, we have, four diﬀerent C1 Positive region of C2 viz. P osiC1 C2 with respect to C i (X) for i = 1, 2, 3, 4 and also i four diﬀerent viz. DepC1 (C1 , C2 ) for Ci 1= 1, 2, 3, 4. kinds of Dependencies C1 {x : Nx ⊆ X} ⊆ {Nx : x ∈ C1 , Nx ⊆ X} ⊆ {Ci , Clearly, X∈C2 X∈C2 X∈C2 C C Ci ⊆ X f or some Ci ∈ C1 } ⊆ {Px : Px ⊆ X}. X∈C2

This implies, P os1C1 C2 ⊆ P os2C1 C2 ⊆ P os3C1 C2 ⊆ P os4C1 C2 . So, we have,

|P os1 C2 |

|P os2 C2 |

|P os3 C2 |

|P os4 C2 |

C1 C1 C1 ≤ ≤ ≤ . |U| |U| |U| |U| So, the following proposition is obtained. C1

Proposition 7. Dep1 (C1 , C2 ) ≤ Dep2 (C1 , C2 ) ≤ Dep3 (C1 , C2 ) ≤ Dep4 (C1 , C2 ). Example 3. Consider C1 = {{1, 2, 3}, {2, 3, 4}, {5, 6, 7}, {6, 7, 8}} and C2 = {{1, 2, 3, 4}, {3, 4, 5, 6}, {5, 6, 7, 8}}. Then Dep1 (C1 , C2 ) = 1, Dep2 (C1 , C2 ) = 1, Dep3 (C1 , C2 ) = 1, Dep4 (C1 , C2 ) = 1. Also, Dep1 (C2 , C1 ) = 0, Dep2 (C2 , C1 ) = 0, Dep3 (C2 , C1 ) = 0, Dep4 (C2 , C1 ) = 1. Example 4. Let us Consider C1 = {{1, 2, 3}, {3, 4, 8}, {6, 7, 8}, {8, 9}} and C2 = {{1, 2, 3, 4}, {5, 8}, {6, 7}, {8, 9}}. Then Dep1 (C1 , C2 ) = 13 , Dep2 (C1 , C2 ) = 13 , Dep3 (C1 , C2 ) = 59 , Dep4 (C1 , C2 ) = 1. Also, Dep1 (C2 , C1 ) = 13 , Dep2 (C2 , C1 ) = 13 , Dep3 (C2 , C1 ) = 49 , Dep4 (C2 , C1 ) = 59 .

356

P. Samanta and M.K. Chakraborty

Observation (i) C1 ⇒ C2 iﬀ P osC1 C2 = U C1 (X) = U iﬀ X∈C2 1 C iﬀ {x : Nx ⊆ X} = U X∈C2

iﬀ ∀x ∈ U, NxC1 ⊆ X for some X ∈ C2 . Also C1 ⇒0 C2 iﬀ P osC1 C2 = φ iﬀ C1 (X) = φ X∈C2 1 C1 {x : Nx ⊆ X} = φ iﬀ X∈C2

iﬀ ∀x ∈ U , there does not exists any X ∈ C2 such that NxC1 ⊆ X . (ii) C1 ⇒ C2 iﬀ P osC1 C2 = U iﬀ C1 (X) = U X∈C2 2 C1 iﬀ {Nx : NxC1 ⊆ X} = U x X∈C2

iﬀ ∀x ∈ U, NxC1 ⊆ X for some X ∈ C2 .

Also C1 ⇒0 C2 iﬀ P osC1 C2 = φ iﬀ C1 (X) = φ X∈C2 2 C1 iﬀ {Nx : NxC1 ⊆ X} = φ x X∈C2

iﬀ ∀x ∈ U , there does not exists any X ∈ C2 such that NxC1 ⊆ X .

(iii) C1 ⇒ C2 iﬀ P osC1 C2 = U C1 (X) = U iﬀ X∈C2 3 iﬀ X∈C {Ci ∈ C1 : Ci ⊆ X} = U 2 i iﬀ each Ci (∈ C1 ) ⊆ X, for some X ∈ C2 . Also C1 ⇒0 C2 iﬀ P osC1 C2 = φ C1 (X) = φ iﬀ X∈C2 3 iﬀ X∈C {Ci ∈ C1 : Ci ⊆ X} = φ 2 i iﬀ for any Ci ∈ C1 there does not exists any X ∈ C2 such that Ci ⊆ X. (iv) C1 ⇒ C2 iﬀ P osC1 C2 = U iﬀ C1 4 (X) = U X∈C2

Consistency of Knowledge

357

iﬀ X∈C {PxC1 : PxC1 ⊆ X} = U x 2 iﬀ for all x, PxC1 ⊆ X for some X ∈ C2 . Also C1 ⇒0 C2 iﬀ P osC1 C2 = φ iﬀ C1 (X) = φ X∈C2 4C1 iﬀ X∈C {Px : PxC1 ⊆ X} = φ 2 iﬀ for all x, there does not exists any X ∈ C2 such that PxC1 ⊆ X. The sets C1 and C2 may be considered as two groups of classifying properties of the objects of the universe U . Properties belonging to any group may have overlapping extensions. Now if C1 ⇒ C2 holds i.e. the dependency degree of C2 on C1 is 1 then the following is its characterization in the ﬁrst two cases (i) and (ii): given any element x of the universe, the set of all objects satisfying at least one of the properties of x is included in the extension of at least one of the classifying properties belonging to the second group. If, on the other hand C1 ⇒0 C2 holds, it follows that, ∀x ∈ U, NxC1 is not a subset of X for any X ∈ C2 ; that means for any element x there is at least one element y which shares at least one of the classiﬁcatory properties of the ﬁrst group and does not have any of the classiﬁcatory properties belonging to the second group. In the third case (iii) C1 ⇒ C2 iﬀ ∀Ci ∈ C1 , ∃Cj ∈ C2 such that Ci ⊆ Cj and C1 ⇒0 C2 iﬀ ∀Ci ∈ C1 there does not exist any Cj ∈ C2 such that Ci ⊆ Cj . The ﬁrst condition means that the extension of any of the classiﬁcatory properties of the ﬁrst group is a subset of the extension of at least one of the classiﬁcatory properties of the second. On the other hand the second one means : no classiﬁcatory property belonging to the ﬁrst group implies any one of the classiﬁcatory property of the second group. In the fourth case (iv) if x and y are equivalent with respect to the classiﬁcatory properties in the group C1 then x and y will share at least one of the classiﬁcatory properties with respect to C2 and vice versa.

4

Consistency of Knowledge Based on Partition

Two knowledges P and Q on U where P and Q are partitions may be considered as fully consistent if and only if U/P = U/Q, that is P ,Q generate exactly the same granules. This is equivalent to P ⇒ Q and Q ⇒ P . So, a natural measure of consistency degree of P and Q might be the truth-value of the non-classical sentence “Q depends on P ∧ P depends on Q” computed by a suitable conjunction operator applied on the truth-values of the two component sentences Thus a binary predicate Cons may be created such that Cons(P, Q) will stand for the above conjunctive sentence. A triangular norm (or t-norm) used in fuzzyliterature and many-valued logic scenario is a potential candidate for computing ∧. A t-norm is a mapping t : [0, 1] → [0, 1] satisfying (i) t(a, 1) = a, (ii) b ≤ d

358

P. Samanta and M.K. Chakraborty

implies t(a, b) ≤ t(a, d), (iii) t(a, b) = t(b, a), (iv) t(a, t(b, d)) = t(t(a, b), d). It follows that t(a, 0) = 0. Typical examples of t-norm are : min(a, b) (G¨ odel), max(0, a + b − 1) (Lukasicwicz), a × b (Godo,Hajek). These are conjunction operators used extensively and are in some sense the basic t-norms [4]. With 1 − x as negation operator the De-Morgan dual of t-norms called s-norms are obtained as s(a, b) = 1 − t(1 − a, 1 − b). Values of disjunctive sentences are computed by s-norms. There is however a diﬃculty in using a t-norm in the present context. We would like to have the following assumptions to hold. Assumption 1. Knowledges P ,Q shall be fully consistent iﬀ they generate the same partition. Assumption 2. Knowledges P ,Q shall be fully inconsistent iﬀ no granule generated by one is contained in any granule generated by the other. The translation of the above demands in mathematical terms is that the conjunction operator should fulﬁll the conditions: (a, b) = 1 iﬀ a = 1, b = 1 and (a, b) = 0 iﬀ a = 0, b = 0. No t-norm satisﬁes the second. So we deﬁne consistency degree as follows: Definition 12. Let P and Q be two knowledges such that P ⇒a Q and Q ⇒b P . The consistency degree between the two knowledges denoted by Cons(P, Q) is given by Cons(P, Q) = a+b+nab n+2 , where n is a non negative integer. Definition 13. Two knowledges P and Q are said to be fully consistent if Cons(P, Q) = 1. Two knowledge P and Q are said to be fully inconsistent if Cons(P, Q) = 0. Example 5. (i) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and the partitions be taken as P = {{1, 3, 5}, {2, 4, 6}, {7, 8}} and Q = {{1, 2, 7}, {3, 4, 8}, {5, 6}}. Then P ⇒0 Q and Q ⇒0 P . So, Cons(P, Q) = 0. (ii) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and partitions P = {{1, 3, 5}, {2, 4, 6}, {7, 8}} and Q = {{1, 3, 5}, {2, 4, 6}, {7, 8}}. Then P ⇒1 Q and Q ⇒1 P . So, Cons(P, Q) = 1. (iii) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and partitions P = {{1, 4, 5}, {2, 8}, {6, 7}, {3}} and Q = {{1, 3, 5}, {2, 4, 7, 8}, {6}}. Then P ⇒ 83 Q and Q ⇒ 18 P . So, Cons(P, Q) =

3 1 3 1 8 + 8 +n 8 8

n+2

, where n is a non-negative integer.

Although any choice of n satisﬁes the initial requirements, some special values for it may be of special signiﬁcance e.g n = 0, n = Card(U ) and n as deﬁned in proposition 5. We shall make discussions on two of such values latter. ‘n’ shall

Consistency of Knowledge

359

be referred to as the ‘consistency constant’ or simply ‘constant’ in the sequel. The constant is a kind of constraint on consistency measure as shown in the next proposition. Proposition 8. For two knowledges P and Q if n1 ≤ n2 then Cons1 (P, Q) ≥ Cons2 (P, Q) where Consi (P, Q) is the consistency degree when ni is the constant taken. Proof. Let P ⇒a Q and Q ⇒b P . Since n1 ≤ n2 , so, n2 − n1 ≥ 0. So a+b+n1 ab 1 ab 2 ab 2 ab and Cons2 (P, Q) = a+b+n - a+b+n Cons1 (P, Q) = a+b+n n1 +2 n2 +2 . Now, n1 +2 n2 +2 (n2 −n1 )(a+b−2ab) (n1 +2)(n2 +2)

≥ 0 iﬀ (n2 − n1 )(a + b − 2ab) ≥ 0 iﬀ (a + b − 2ab) ≥ 0 iﬀ √ a + b ≥ 2ab. Now, a+b ≥ ab ≥ ab. So a + b ≥ 2ab holds. This shows that 2 Cons1 (P, Q) ≥ Cons2 (P, Q).

=

Proposition 9. If n = the number of elements a ∈ U such that [a]P ⊆ [a]Q and [a]Q ⊆ [a]P , then n = CardU - [Card PX + Card QX X∈U/Q ¯ X∈U/P ¯ Card( PX QX)]. X∈U/Q ¯ X∈U/P ¯ Proof. Here the number of elements a ∈ U such that [a]P ⊆ [a]Q = Card X∈U/Q PX ...(i). Now the number of elements a∈ U such that [a]Q ⊆ [a]P =Card X∈U/P ¯ QX ...(ii). So the number of elements common to (i) and (ii) = Card( PX X∈U/Q ¯ ¯ QX)] ...(iii) . From (i), (ii) and (iii) the proposition follows. X∈U/P ¯ One can observe that the deﬁnition of a consistent object in [5,7] may be generalized relative to any pair (P, Q) of partitions of the Universe, not only restricted to the partitions caused due to the pair (CON, DEC) where CON is the set of condition attributes and DEC is the decision attributes. With this extension of the notion, n is the count of all those objects a such that a is not consistent relative to both the pairs (P, Q) and (Q, P ). In the following examples n is taken to be this number. Example 6. (i) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and partitions P = {{1, 3, 5}, {2, 4, 6}, {7, 8}} and Q = {{1, 2, 7}, {3, 4, 8}, {5, 6}}. Then P ⇒0 Q and Q ⇒0 P . Here n = 8. So, Cons(P, Q) = 0+0+8.0.0 = 0. 8+2 (ii) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and partitions P = {{1, 3, 5}, {2, 4, 6}, {7, 8}} and Q = {{1, 3, 5}, {2, 4, 6}, {7, 8}}. Then P ⇒1 Q and Q ⇒1 P . Here n = 0. = 1. So, Cons(P, Q) = 1+1+0.1.1 0+2 (iii) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and partitions P = {{1, 4, 5}, {2, 8}, {6, 7}, {3}} and Q = {{1, 3, 5}, {2, 4, 7, 8}, {6}}. Then P ⇒ 38 Q and Q ⇒ 18 P . Here n = 4. So, Cons(P, Q) =

3 1 3 1 8 + 8 +4. 8 . 8

4+2

=

11 96 .

If the t-norm is taken to be max(0, a + b − 1), then the corresponding s-norm is min(1, a + b). For the t-norm min(a, b), the s-norm is max(a, b). There is an order relation in the t-norms/ s-norms, viz. any t-norm ≤ min ≤ max ≤ any s-norm.

360

P. Samanta and M.K. Chakraborty

In particular max(0, a + b − 1) ≤ min(a, b) ≤ max(a, b) ≤ min(1, a + b). Where does the Cons function situate itself in this chain - might be an interesting and useful query. The following proposition answers this question. Proposition 10. max(0, a + b − 1) ≤ Cons(P, Q) ≤ max(a, b) if P ⇒a Q and Q ⇒b P . To compare Cons(P, Q) and min(a, b), we have, Proposition 11. Let P and Q be two knowledges and P ⇒a Q and Q ⇒b P. Then (i) a = b = 1 iﬀ min(a, b) = Cons(P, Q) = 1, (ii) If either a = 1 or b = 1 then min(a, b) ≤ Cons(P, Q), a−b , a = 0, b = 1, (iii) min(a, b) = a ≤ Cons(P, Q) iﬀ n ≤ a(b−1) a−b (iv) min(a, b) = a ≥ Cons(P, Q) iﬀ n ≥ a(b−1) , a = 0, b = 1, (v) max(0, a + b − 1) ≤ Cons(P, Q) ≤ max(a, b) ≤ s(a, b) = min(1, a + b).

The Cons function seems to be quite similar to a t-norm but not the same. So a closer look into the function is worthwhile. We deﬁne a function : [0, 1] × [0, 1] → [0, 1] as follows (a, b) = a+b+nab n+2 where n is a non-negative integer. Proposition 12. (i) 0 ≤ (a, b) ≤ 1, (ii) If a ≤ b then (a, b) ≤ (a, c), (iii) (a, b) = (b, a), (iv) (a, (b, c)) = ((a, b), c) iﬀ a = c ; (a, (b, c)) ≤ ((a, b), c) iﬀ a ≤ c; (a, (b, c)) ≥ ((a, b), c) iﬀ a ≥ c, (v) (a, 1) ≥ a, equality occurring iﬀ a = 1, (vi) (a, 0) ≤ a, equality occurring iﬀ a = 0, (vii) (a, b) = 1 iﬀ a = b = 1 and (a, b) = 0 iﬀ a = b = 0, (viii) (a, a) = a iﬀ either a = 0 or a = 1, The consistency function Cons gives a measure of similarity between two knowledges. It would be natural to deﬁne a measure of inconsistency or dissimilarity now. In [6] a notion of distance is available. Definition 14. If P ⇒a Q and Q ⇒b P then the distance function is denoted by ρ(P, Q) and deﬁned as ρ(P, Q) = 2−(a+b) . 2 Proposition 13. The distance function ρ satisﬁes the conditions : (i) o ≤ ρ(P, Q) ≤ 1 (ii) ρ(P, P ) = 0 (iii) ρ(P, Q) = ρ(Q, P ) (iv) ρ(P, R) ≤ ρ(P, Q) + ρ(Q, R). For proof the reader is referred to [6].

Consistency of Knowledge

361

Definition 15. We now deﬁne a measure of inconsistency by: InCons(P, Q) = 1 - Cons(P, Q) Proposition 14. (i) o ≤ InCons(P, Q) ≤ 1, (ii) InCons(P, P ) = 0, (iii) InCons(P, Q) = InCons(Q, P ), (iv) InCons(P, R) ≤ InCons(P, Q) + InCons(Q, R) for a ﬁxed constant n. Proof. of (iv) : Let P ⇒x R, R ⇒y P , P ⇒a Q, Q ⇒b P , Q ⇒l R, R ⇒m Q ...(i). Now InCons(P, R) = n+2−x−y−nxy ≤ InCons(P, Q) + InCons(Q, R) = n+2 n+2−a−b−nab n+2

+ n+2−l−m−nlm = 2(n+2)−n(ab+lm)−(a+b+l+m) n+2 n+2 iﬀ n + 2 − x − y − nxy ≤ 2(n + 2) − n(ab + lm) − (a + b + l + m) iﬀ n(ab + lm − xy − 1) ≤ 2 + x + y − (a + b + l + m)...(ii). From (i) by Proposition 3(v) we have x ≥ (a + m − 1) and y ≥ (b + l − 1). Hence (ab + lm − xy − 1) ≤ (ab + lm − (a + m − 1)(b + l − 1) − 1) = (a(1 − l) + b(1 − m) + (m − 1) + (l − 1)) ≤ (1 − l + 1 − m + m − 1 + l − 1) (because + 2−l−m − 2−x−y ) 0 ≤ a, b ≤ 1) = 0. ...(iii) Now, 2+x+y−(a+b+l+m) = 2( 2−a−b 2 2 2 = 2(ρ(P, Q) + ρ(Q, R) − ρ(P, R)) ≥ 0. ...(iv)[by Proposition 13(iv)]. Thus the left hand side of inequality (ii) is negative and the right hand side of (ii) is positive. So (iv) i.e triangle inequality is established. Proposition 11 shows that for any ﬁxed n the inconsistency measure of knowledge is a metric. It is also a generalization of the distance function ρ in [6]; InCons reduces to ρ when n = 0. n is again a kind of constraint on the inconsistency measure - as n increases, the inconsistency increases too. 4.1

Consistency Degree w.r.t Covering

Definition 16. We deﬁne consistency degree in the same way : Consi (C1 , C2 ) = a+b+nab where Depi (C1 , C2 ) = a i.e., C1 ⇒a C2 and Depi (C2 , C1 ) = b i.e., n+2 C2 ⇒b C1 where i = 1, 2, 3, 4. Example 7. Let C1 = {{1, 2, 3}, {3, 4, 8}, {6, 7, 8}, {8, 9}} and C2 = {{1, 2, 3, 4}, {5, 8}, {6, 7}, {8, 9}}. Then Dep1 (C1 , C2 ) = 13 , Dep2 (C1 , C2 ) = 13 , Dep3 (C1 , C2 ) = 59 , Dep4 (C1 , C2 ) = 1. Also, Dep1 (C2 , C1 ) = 13 , Dep2 (C2 , C1 ) = 13 , Dep3 (C2 , C1 ) = 49 , Dep4 (C2 , C1 ) = 59 . So, Consi (C1 , C2 ) for i = 1, 2, 3, 4 are as follows : Cons1 (C1 , C2 ) = Cons2 (C1 , C2 ) = Cons1 (C1 , C2 ) = Cons1 (C1 , C2 ) =

1 + 13 +n. 13 . 13 3

n+2 1 1 1 1 3 + 3 +n. 3 . 3 n+2 5 4 5 4 9 + 9 +n. 9 . 9 n+2 1+ 59 +n.1. 59 n+2

n+6 9(n+2) , n+6 = 9(n+2) , 20n+81 = 81(n+2) , 5n+14 = 9(n+2) .

=

Observation (a) Consi (C1 , C2 ) = 1 iﬀ Depi (C1 , C2 ) = 1 and Depi (C2 , C1 ) = 1.

362

P. Samanta and M.K. Chakraborty

Its interpretations for i = 1, 2, 3, 4 are given by: Cons1 (C1 , C2 ) = 1 iﬀ ∀x ∈ U, NxC1 ⊆ X for some X ∈ C2 and ∀x ∈ U, NxC2 ⊆ X for some X ∈ C1 . Cons2 (C1 , C2 ) = 1 iﬀ ∀x ∈ U, NxC1 ⊆ X for some X ∈ C2 and ∀x ∈ U, NxC2 ⊆ X for some X ∈ C1 . Cons3 (C1 , C2 ) = 1 iﬀ each Ci (∈ C1 ) ⊆ X, for some X ∈ C2 and each Ci (∈ C2 ) ⊆ X, for some X ∈ C1 Cons4 (C1 , C2 ) = 1 iﬀ for all x, PxC1 ⊆ X for some X ∈ C2 and for all x, ⊆ X for some X ∈ C1

PxC2

(b) Consi (C1 , C2 ) = 0 iﬀ Depi (C1 , C2 ) = 0 and Depi (C2 , C1 ) = 0. So, the interpretations are: Cons1 (C1 , C2 ) = 0 iﬀ ∀x ∈ U , there does not exists any X ∈ C2 such that NxC1 ⊆ X and ∀x ∈ U , there does not exists any X ∈ C1 such that NxC2 ⊆ X . Cons2 (C1 , C2 ) = 0 iﬀ ∀x ∈ U , there does not exists any X ∈ C2 such that NxC1 ⊆ X and ∀x ∈ U , there does not exists any X ∈ C1 such that NxC2 ⊆ X . Cons3 (C1 , C2 ) = 0 iﬀ for any Ci ∈ C1 there does not exists any X ∈ C2 such that Ci ⊆ X and for any Ci ∈ C2 there does not exists any X ∈ C1 such that Ci ⊆ X Cons4 (C1 , C2 ) = 0 iﬀ for all x, there does not exists any X ∈ C2 such that ⊆ X and for all x, there does not exists any X ∈ C1 such that PxC2 ⊆ X

PxC1

Definition 17. A measure of inconsistency for the case of covering in the same way is deﬁned as follows : InCons(P, Q) = 1 - Cons(P, Q).

5

Towards a Logic of Consistency of Knowledge

We are now at the threshold of a logic of consistency (of knowledge). Along with the usual propositional connectives the language shall contain two binary predicates, ‘Cons’ and ‘Dep’ for consistency and dependency respectively. At least the following features of this logic are present. (i) 0 ≤ Cons(P, Q) ≤ 1, (ii) Cons(P, P ) = 1, (iii) Cons(P, Q) = Cons(Q, P ), (iv) Cons(P, Q) = 0 iﬀ Dep(P, Q) = 0 and Dep(Q, P ) = 0

Consistency of Knowledge

363

and Cons(P, Q) = 1 iﬀ Dep(P, Q) = 1 and Dep(Q, P ) = 1 In case P ,Q,R partitions we also get (v) Cons(P, Q) and Cons(Q, R) implies Cons(P, R). (i) shows that the logic is many-valued; (ii) and (iii) are natural expectations; (iv) conforms to assumptions 1 and 2 (section2); (v) shows transitivity the predicate Cons in the special case of partitions. That the transitivity holds is shown below. We want to show that Cons(P, Q) and Cons(Q, R) implies Cons(P, R) i.e, Cons(P, Q) and Cons(Q, R)≤ Cons(P, R). We use Lukasiewicz t-norm to compute ‘and’. Let n be the ﬁxed constant. So,what is needed is M ax(0, Cons(P, Q) + Cons(Q, R) − 1) ≤ Cons(P, R). Clearly, Cons(P, R) ≥ 0 ...(i). We shall now show Cons(P, R) ≥ Cons(P, Q) + Cons(Q, R) − 1. Let P ⇒x R, R ⇒y P , P ⇒a Q, Q ⇒b P , Q ⇒l R, R ⇒m Q So x ≥ (a + m − 1) and y ≥ (b + l − 1) [cf. Proposition 3(v)]...(ii). + l+m+nlm −1 So, Cons(P, Q) + Cons(Q, R) − 1 = a+b+nab n+2 n+2 ≤ x+y+n(ab+lm−1) [using (ii)]...(iii). n+2 Here, xy ≥ (a+ l − 1)(b + m− 1) = ab + lm+ (m− 1)(a− 1)+ (b − 1)(l − 1)−1 ≥ ab+lm−1. [as, m−1 ≤ 0 , a−1 ≤ 0 , so (m−1)(a−1) ≥ 0 , and b−1 ≤ 0 , l−1 ≤ 0 , (b − 1)(l − 1) ≥ 0 ] ...(iv) . So (iii) and (iv) imply Cons(P, Q)+ Cons(Q, R)− 1 = Cons(P, R) ... (v). ≤ x+y+nxy n+2 (i)-(v) pave the way of formulating axioms of a possible logic of knowledge. =

6

(a+l−1)+(b+m−1)+n(ab+lm−1) n+2

Concluding Remarks

This paper is only the beginning of a research on a many valued logic of dependency and consistency of knowledges where knowledge is in the context of incomplete information understood basically as proposed by Pawlak. Various ways of deﬁning lower and upper approximations indicate that the modalities are also diﬀerent and hence corresponding logics would also be diﬀerent. We foresee interesting logics being developed and signiﬁcant applications of the concepts Dep, Cons and the the operator .

Acknowledgement The ﬁrst author acknowledges the ﬁnancial support from the University Grants Commission, Government of India.

References 1. Bianucci, D., Cattaneo, G., Ciucci, D.: Entropies and co-entropies of coverings with application to incomplete information systems. Fundamenta Informaticae 75, 77–105 (2007) 2. Cattanio, G., Cucci, D.: Lattice Properties of Preclusive and Classical Rough Sets. Personal Collection

364

P. Samanta and M.K. Chakraborty

3. Chakraborty, M.K., Samanta, P.: Consistency-Degree Between Knowledges. In: Kryszkiewicz, M., et al. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 133– 141. Springer, Heidelberg (2007) 4. Klir, G.J., Yuan, B.: Fuzzy Sets And Fuzzy Logic: Theory and Applications. Prentice-Hall of India, Englewood Cliﬀs (1997) 5. Nguyen, N.T., Malowiecki, M.: Consistency Measures for Conﬂict Proﬁles. In: Pe´ ters, J.F., Skowron, A., Grzymala-Busse, J.W., Kostek, B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 169–186. Springer, Heidelberg (2004) 6. Novotn´ y, M., Pawlak, Z.: Partial Dependency of Attributes. Bull. Polish Acad. of Sci., Math. 36, 453–458 (1988) 7. Pawlak, Z.: Rough Sets. Internal Journal of Information and Computer Science 11, 341–356 (1982) 8. Pawlak, Z.: On Rough Dependency of Attributes in Information System. Bull. Polish Acad. of Sci., Math. 33, 551–559 (1985) 9. Pawlak, Z.: ROUGH SETS - Theoritical Aspects of Reasoning About Data. Kluwer Academic Publishers, Dordrecht (1991) 10. Pomykala, J.A.: Approximation, Similarity and Rough Constructions. ILLC Prepublication Series for Computation and Complexity Theory CT-93-07. University of Amsterdam 11. Qin, K., Gao, Y., Pei, Z.: On Covering Rough Sets. In: Yao, J.T., et al. (eds.) RSKT 2007. LNCS, vol. 4481, pp. 34–41. Springer, Heidelberg (2007) 12. Sakai, H., Okuma, A.: Basic Algorithm and Tools for Rough Non-deterministic Information Analysis. In: Peters, J.F., Skowron, A., Grzymala-Busse, J.W., Kostek, ´ B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 209–231. Springer, Heidelberg (2004) 13. Skowran, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996) 14. Slezak, D., Wasilewski, P.: Granular Sets - Foundations and Case Study of Tolerance Spaces. In: An, A., et al. (eds.) RSFDGrC 2007. LNCS, vol. 4482, pp. 435–442. Springer, Heidelberg (2007) 15. Yao, Y.: Semantics of Fuzzy Sets in Rough Set Theory. In: Peters, J.F., et al. (eds.) Transactions on Rough Sets II. LNCS, vol. 3135, pp. 297–318. Springer, Heidelberg (2004) 16. Zakowski, W.: Approximation the space (u, π). Demonstratio Mathematica 16, 761–769 (1983)

A New Approach to Distributed Algorithms for Reduct Calculation Tomasz Str¸akowski and Henryk Rybiński Warsaw Uniwersity of Technology, Poland [email protected], [email protected]

Abstract. Calculating reducts is a very important process. Unfortunately, the process of computing all reducts in NP-hard. There are a lot of heuristic solutions for computing reducts, but they do not guarantee achieving complete set of reducts. We propose here three versions of an exact algorithm, designed for parallel processing. We present here how to decompose the problem of calculating reducts, so that parallel calculations are eﬃcient. Keywords: Rough set theory, reducts calculations, distributed computing.

1

Introduction

Nowadays, the ability of collecting data is much higher than the ability of processing them. Rough Set Theory (RST) provides means for discovering knowledge from data. One of the main concepts in RST is the notion of reduct, which can be seen as a minimal set of conditional attributes preserving the required classiﬁcation features [1]. In other words, having a reduct of a decision table we are able to classify objects (i.e. take decisions) with the same quality as with all attributes. However, the main restriction in practical use of RST is that computing all reducts is NP-hard. It is therefore of high importance to ﬁnd out eﬃcient algorithms that compute reducts eﬃciently. There are many ideas how to speedup computing of reducts [2], [3], [4], [5]. Many of the presented algorithms are based on some heuristics. The disadvantage of the heuristic solution is that it does not necessary give us a complete set of reducts, in addition, some results can be over-reducts. Another way to speed up the calculation processes, not yet explored suﬃciently, could be distributing the computations over a set of processors, and perform the calculations in parallel.

The research has been partially supported by grant No 3 T11C 002 29, received from Polish Ministry of Education and Science, and partially supported by grant received from rector of Warsaw University of Technology No 503/G/1032/4200/000.

J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 365–378, 2008. c Springer-Verlag Berlin Heidelberg 2008

366

T. Str¸akowski and H. Rybiński

In this paper we analyze how to speed up the calculations of the complete sets of reducts by distributing the processing over a number of available processors. A parallel version of a genetic algorithm for computing reducts has been presented in [3]. The main disadvantage of this approach is that the algorithm does not necessary ﬁnd all the reducts. In this paper we present various types of the problem decomposition for calculating reducts. We present here three versions of distributing the processing, each of them generating all the reducts of a given information system. We will also discuss the conditions for decomposing the problem, and present criteria that enable one to ﬁnd out the best decomposition. The paper is composed as follows. In Section 2 we recall basic notions related to the rough set theory, and present the analogies between ﬁnding reducts in RST and the transformations of logical clauses. We also present a naïve algorithm for ﬁnding a complete set of reducts and discuss the complexity of the algorithm. In Section 3 we present 3 various ways of decomposing the process of reduct calculations. Section 4 is devoted to experimental results, performed with all three proposed approaches. We conclude the paper with a discussion about the eﬀectiveness of the approaches and their areas of applications.

2

Computing Reducts and Logic Operations

Let us start with recalling basic notions of the rough set theory. In practical terms, knowledge is coded in an information system (IS). IS is a pair (U,A) where U is ﬁnite set of elements, and A is a ﬁnite set of attributes which describe each element. For every a ∈ A there is a function U → Va , assigning a value v ∈ Va of the attribute a to the objects u ∈ U , where Va is domain of a. The indiscernibility relation is deﬁned as follows: IN D(A) = {(u, v) : u, v ∈ U, a(u) = a(v), a ∈ A} Informally speaking, two objects u and v are indiscernible for the attribute a if they have the same value of that attribute. Theindiscernibility relation could be deﬁned for the set of attributes IN D(B) = a∈B IN D(a), B ⊆ A. One of the most important ideas in RST is the notion of reduct. Reduct is a minimal set of attributes B, B ⊆ A, for which the indiscernibility relation in U is exactly the same, as for the set A, i.e. IND(B) =IND(A). Superreduct is a super set of a reduct. Given a set of attributes B, B ⊆ A, we deﬁne a B-related reduct as a set C of attributes, B ∩ C = ∅ , which preserves the partition of IND(B) over U. Given u ∈ U , we deﬁne local reduct as a minimal set of attributes capable of distinguishing this particular object from the other objects, as well, as the total set of attributes. Let us introduce a discernibility function (denoted by disc(B, u)) as a set of all object v discernible with u for the set of attributes B: disc(B, u) = {v ∈ U |∀a ∈ B(a(u) = a(v))}

A New Approach to Distributed Algorithms for Reduct Calculation

367

Table 1. Decision Table a 1 1 2 2 3

u1 u2 u3 u4 u5

b 2 2 2 2 5

c 3 1 3 3 1

d 1 2 2 2 3

Table 2. Indiscernibility matrix u1 u1 u2 u3 u4 u5

c a a abc

u2 c ac ac ab

u3 a ac

u4 a ac

abc

abc

u5 abc ab abc abc

Table 3. Interpretation of the indiscernibility matrix

u1 u2 u3 u4 u5

Discernibility CNF form DNF Formula Local reducts Function (after reduction) (Prime Implicants) c ∧ a ∧ (a ∨ b ∨ c) c∧a a∨a {a,c} c ∧ (a ∨ c) ∧ (a ∨ b) c ∧ (a ∨ b) (a ∧ c) ∨ (b ∧ c) {a,c};{b,c} a ∧ (a ∨ c) ∧ (a ∨ b) a a {a} a ∧ (a ∨ c) ∧ (a ∨ b) a a {a} (a ∧ b ∧ c) ∨ (a ∧ b) (a ∧ b) a∧b {a };{b }

Local reduct for the element u ∈ U is a minimal set of attributes B, B ⊆ A, such that disc(B,u) = disc(A,u). Now, let us show some similarities between reducts and some logic operation. The relationships between reducts and logical expressions were ﬁrst presented in [6]. Let us consider a decision table, as in Table 1. We have here ﬁve elements (u1 − u5 ), three conditional attributes, namely a, b, c, and one decision attribute d. The indiscernibility matrix for this table is shown in Table 2: The interpretation of the above indiscernibility matrix can be presented in the form of Table 3: The i-th row shows here the following: in column 1 there is a rule saying which attributes have to be used to discern the i-th object (ui ) with any other objects of IS from Table 1 (discernibility). The second column shows the same rule in the form of CN F (after reduction), the 3rd one presents the rule in disjunctive normal form (DNF), whereas the last column provides the local reducts for ui .

368

T. Str¸akowski and H. Rybiński

Algorithm 2.1. Reduct Set Computation(DT ) Compute Indiscernibility M atrix M (A) = (Cij ) T ransf orm M to one dimensional table T Reduce T using absorption laws comment: from CNF to prime implicant Sort T comment: Sorting is our modiﬁcation, d - number of elements in T build the f amilies of R1 , R2 , , Rn in the f ollowing way : ⎧ R =∅ ⎪ ⎪ 0 ⎪ ⎪ for i ← 0 to d ⎪ ⎪ ⎧ ⎪ ⎪ if Stop condition is true comment: It is our modiﬁcation ⎨ ⎪ ⎪ ⎪ ⎪ ⎨ then Break algorithmRd = Ri ⎪ ⎪ do ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ else Ri = Si ∪ Ki where Si = {r ∈ Ri−1 : r ∩ Ti = ∅} Ki = (r ∪ {a})∀a ∈ Ti , r ∈ Ri−1 : r ∩ Ti = ∅ Remove redundant elements f rom Rd Remove Super − Reducts RED(a) = Rd return (RED(A)) Now let us recall a naïve method for calculating reducts. It is a slight modiﬁcation of the algorithm presented in [2], and is given abowe in the form of a pseudo code. This code diﬀers from the original one in two places. Fist, we sort the clauses in discernibility function by length (the shortest clauses are ﬁrst). Then, we change here the stop condition: Ti is the i-th clause of prime implicant. Ri is the set of candidate reducts. In the i-th step we check r ∈ Ri , and if r ∩ T i = ∅, then Ri+1 := Ri+1 ∪ r, otherwise the clause is split onto separate attributes, and each attribute is added to r, making a new reduct candidate to be included to Ri+1 . As the clauses Ti are sorted, we can stop the algorithm when k + li > |A|, where k is the length of the shortest candidate reduct, li is the length of Ti . Let us reconsider the time and space complexities of the naïve algorithm. There are four sequential parts in the algorithm: (1) generating the indiscernibility matrix (IND matrix); (2) converting the matrix to discernibility function (using absorption laws); and (3) converting to the DNF form (prime implicants), i.e. reducts. The IND matrix is square and symmetric (with null values on the diagonal). The size of the matrix is |U | × |U |, |U | denotes the number of the elements in DT. So, the time and space complexities are: O(

|U |2 − |U | ) 2

(1)

The complexity of the process of converting from IND to the discernibility function formulae is linear, so can be ignored. The complexity of the conversion from

A New Approach to Distributed Algorithms for Reduct Calculation

369

discernibility function to CNF is O(n 2 ) (in the worst case), where n is the number of clauses in the discernibility function. No additional data structures are needed, so the space complexity can be ignored. The hardest to estimate is the complexity of converting from CNF to DNF. The space complexity is estimated as: |A| O |A| (2) 2

It is the maximal number of the candidate reducts in the conversion process. The proof on the maximal numbers of reducts was presented in [4]. More complicated is to estimate the time complexity. Given n as the number of clauses in the discernibility function, we can estimate it as: |A| O( |A| × n) (3) 2

During the conversion process from discernibility function to prime implicants in one step we compare every candidate reduct with the i-th clause of discernibility function. The number of steps is equal to the number of clauses. The maximum 2 number of clauses in the discernibility function is: n = |U | 2−|U | (in the worst case, where the absorption laws cannot be used). Hence, the time complexity is: |A| |U |2 − |U | O( |A| × ) (4) 2 2 Let us summarize now our considerations: 1. The maximal space requirement depends only on the number of the attributes in DT. 2. The time of computing IND depends polynomialy on the number of objects in DT. 3. The time of computing all the reducts depends exponentially on the number of attributes (for a constant number of objects). The exponential explosion of the complexity appears in the last part of the algorithm, during the conversion from CNF to prime implicants. The best practice is to decompose the more complex part, though it is not the only possible place. Sometimes computing IND is more time consuming than evaluating the conversions. We will discuss the options in the next section.

3

Decomposition of the Problem

There are several ways of decomposing the algorithm. One possibility is to split DT, compute reducts for each part independently and merge the results. Another idea is to compute IND matrix sequentially, convert it to discernibility function and CNF, and then split discernibility function into several parts to be calculated

370

T. Str¸akowski and H. Rybiński

separately, so that the conversions to DNF are made in the separate nodes of the algorithm, and then the ﬁnal result is obtained by merging the partial results. The two proposals above consist in a horizontal decomposition, in this sense that we split the table DT into some sub-tables, and then use the partial results to compute the ﬁnal reduct. Certainly, the partial results are not necessarily reducts. They can be though related reducts, and additional (post)-processing is needed to calculate reducts. Both proposals will be described in more detail in this Section, in 3.1 and 3.2 respectively. In the paper we propose yet another solution, based on a vertical decomposition. In particular, during the process of converting CNF to DNF we split the set of candidate reducts among a number of processors, which then serve in parallel for processing the consecutive clauses. We call this decomposition vertical because it splits the set of candidate reducts (subsets of the attributes) into separate subsets, instead of splitting the set of objects. For each subset of the candidate reducts, the conversion is completed in a separate node of algorithm (processor). Let us note that every candidate reduct passes comparisons with every clause. This will give us a guarantee that the partial results in each node are reducts or super reducts. Having computed the partial results, in the last phase of the algorithm we join them to the ﬁnal reducts set. Let us also note that there is a diﬀerence between using partial results obtained from horizontal and vertical decompositions. In the ﬁrst case we have to merge partial reducts, which is a complex, and time consuming process, whereas in the second case we have to join the partial results, and remove duplicates, and super reducts. This process is fairly simple. The third proposal is presented in p. 3.4. Below we describe the three proposals in more detail. 3.1

Splitting Decision Table

Let us present the process of decomposing DT. We split DT into two separate, randomly selected subsets: X1 and X2 , and for each of them we compute the reducts. If now we would like to "merge" the results, the ﬁnal result does not take into account indiscernibilities between objects from X1 and X2 . It is therefore necessary to compute another part of the IN D matrix to calculate the discernibility for the pairs (xi xj ), xi ∈ X1 , and xj ∈ X2 . In Fig. 1 it is shown how the decomposition of DT inﬂuences splitting the IN D matrix (denoted by M ). M (Xk ), k = 1, 2, are the parts related to discernibility of the object both from Xk . M (X1 ∪ X2 ) is a part of M with information about discernibility between xi , xj , such that xi ∈ X1 , and xj ∈ X2 . In this sense the decomposition of DT is not disjoint. However, in the sense of splitting M into disjoint parts, the decomposition is disjoint. We can thus conclude that for splitting DT into two sets we need three processing nodes. Similarly, if we split DT into three sets, we need six processing nodes. In general, if we split DT into n subsets we need n2 +n processing nodes. 2

A New Approach to Distributed Algorithms for Reduct Calculation

371

Indiscernibility Matrix Set X1

Set X2

Set X1

Set X2

M(X1,X2) not used M(X1) used on half M(X2) used on half M(X2,X1) used on all Fig. 1. Spliting DT

3.2

Spliting Discernibility Function

Another idea for decomposing the problem of computing all reducts is to split discernibility function into separate sections, and then to treat each section as a separate discernibility function. The conversion to DNF is made for every discernibility function, and then the partial results are merged as a multiplication of clauses. Let us illustrate it by the following example. Example 1. Provided after applying the absorption laws we receive the discernibility function as below: (a ∨ b) ∧ (a ∨ c) ∧ (b ∨ d) ∧ (d ∨ e)(∗) we can now convert it to the DN F form in the following sequential steps: 1. (a ∨ ac ∨ ab ∨ bc) ∧ (b ∨ d) ∧ (d ∨ e) = (a ∨ bc) ∧ (b ∨ d) ∧ (d ∨ e) 2. (ab ∨ ad ∨ bcd ∨ bc) ∧ (d ∨ e) = (ab ∨ ad ∨ bc) ∧ (d ∨ e) 3. (abd ∨ abe ∨ ad ∨ ade ∨ bcd ∨ bce) = (ad ∨ abe ∨ bcd ∨ bce) Instead of processing (*) sequentially let us split it into 2 parts: 1. (a ∨ b) ∧ (a ∨ c) 2. (b ∨ d) ∧ (d ∨ e) The tasks (1) and (2) can be continued in 2 separate processing nodes, which leads to the forms: 1. (a ∨ ac ∨ ab ∨ bc) = (a ∨ bc) 2. (bd ∨ be ∨ d ∨ de) = (be ∨ d)

372

T. Str¸akowski and H. Rybiński

computing IND matrix

computing reducts

merging reducts

time

Fig. 2. The parallel processing with 3 nodes

Having the partial results from the nodes (1) and (2) we merge them: (a∨bc)∧ (be ∨ d) So we receive the ﬁnal result: (abe ∧ ad ∧ bce ∧ bcd) On Fig. 2 we present a general idea of processing algorithm in the parallel way, as sketched above. As one can see, in this approach we can split the calculations among as many nodes as many pairs of clauses we have in the discernibility function (obviously we can split the task to a smaller number of nodes, as well). There is though a ﬁnal part of the algorithm, which is devoted to merging the partial results coming from the nodes. This process is performed sequentially and its eﬃciency depends on the number of processing nodes. Obviously, we should avoid the cases when the cost of merging is higher than the savings from parallel processing. We discuss the issue in the next paragraph. Merging of partial results The process of merging the partial results is time consuming. It is equivalent to the process of ﬁnding Cartesian product of n sets, so the time requirement for this process depends on the number of the partial results, i.e. O(Π|mi|)i = 1, 2, 3, ..n, where |mi| is the number of elements in the ith partial result. There is though a way to perform also this process in a parallel way. Let us consider the case we have two partial p1 and p2 . We split p1 into few separate results to merge - subsets, so p1 = i p1i . Thus p1 ∧ p2 = i p1i ∧ p2 , and each component p1i ∧ p2 can be processed in a separate processing node. The process of summing the partial conjunction results consists in removing duplicates and super reducts from the ﬁnal result set. The more components of p1i we have in p1 , the more processors we can use. Optimal use of the processors On Fig. 3 we present an example of using 5 processors for computing reducts by splitting prime implicant. We distinguish here four phases. The ﬁrst one is for computing the IN D matrix and prime implicant (marked by very light grey), then the conversion from prime implicant to DNF starts (light grey) on ﬁve nodes.

A New Approach to Distributed Algorithms for Reduct Calculation

373

node 1 node 2 node 3 node 4 node 5 central node

computing IND Matrix

inactive time

computing reducts merging reducts removing duplicates and superreducts

Fig. 3. Sample usage of processors for 5 nodes

node 1 node 2 node 3 node 4 node 5

central node

computing IND Matrix

inactive time

computing reducts merging reducts removing duplicates and superreducts

Fig. 4. Merging by bundles

When we have 2 conversions completed, the merging can start on the free nodes (dark grey). When any partial reduct results are provided, the ﬁnal process of removing duplicates is performed sequentially (black). This solution is not optimal for the use of processors. There are a lot of periods where some nodes of the algorithm have to wait, even if some nodes have the same speed. The problem gets worse if some nodes diﬀer in speed. To solve this problem we propose in every merging of partial results P1 and P2 to split P1 into more parts then we have free available processors. Thus, we decompose merging into many independent bundles. Each bundle can be processed asynchronously. Each processor processes as many bundles as it can.

374

T. Str¸akowski and H. Rybiński

In this case, the maximal time of waiting in every partial merging is the time of processing one bundle in the slowest node. Let us consider this proposal in more detail Fig. 4. In this case the node N3 does not have to wait for N2 , but it helps nodes N4 and N5 by merging bundles from P4 and P5 . This task can be ﬁnished faster than in the previous example. After computing DNF from ∧(P2 , N2 ) takes P2 and P3 from the queue, and starts computing set ∧(P2 , P3 ). After computing ∧(P4 , P5 ), the nodes N3 , N4 , N5 join to N2 . Having ﬁnished P1 , the node N1 takes the next task from the queue (∧(P1 , P4 , P5 )). Having ﬁnished processing ∧(P2 , P3 ) the remaining free nodes join to the computations ∧(P1 , P4 , P5 ). The last task is to compute ∧(P1 , P2 , P3 , P4 , P5 ) by all the nodes. 3.3

Splitting Set of Candidate Reducts - Vertical Decomposition

Now we present the third way of decomposing calculations of reducts, which is the vertical one. The main idea is that during the conversion of CNF to DNF we split the formula into 2 parts across a (disjunctive) component. The idea of this decomposition was originally presented in [7]. Here we make a slight modiﬁcation of this method. Let us go back again to the conversion process from CNF to DNF. Sequentially, the process can be performed as below: 1. 2. 3. 4.

(a ∨ b) ∧ (a ∨ c) ∧ (b ∨ d) ∧ (d ∨ e) (a ∨ ac ∨ ab ∨ bc) ∧ (b ∨ d) ∧ (d ∨ e) = (a ∨ bc) ∧ (b ∨ d) ∧ (d ∨ e) (ab ∨ ad ∨ bcd ∨ bc) ∧ (d ∨ e) = (ab ∨ ad ∨ bc) ∧ (d ∨ e) (abd ∨ abe ∨ ad ∨ ade ∨ bcd ∨ bce) = (ad ∨ abe ∨ bcd ∨ bce)

The bold clauses a and bc relate to "candidate reducts". Let us make the decomposition after the second step1 , and perform the process in two nodes: Table 4. Decomposition of computation after second step Node 1 Node 2 (a) ∧ (b ∨ d) ∧ (d ∨ e) (bc) ∧ (b ∨ d) ∧ (d ∨ e) (ab ∨ ad) ∧ (d ∨ e) (bc ∨ bcd) ∧ (d ∨ e) = (bc) ∧ (d ∨ e) (abd ∨ abe ∨ ad ∨ ade) = (ad ∨ abe) (bcd ∨ bce) (ad ∨ abe ∨ bcd ∨ bce)

The advantage of this decomposition is easiness of joining partial results one should only add sets of reducts and remove super-reducts. This method reduces time of processing and space needed for storing candidate reducts. If we have one processor without enough memory for the candidate reducts we can decompose the process into two parts. The ﬁrst part can be continued, whereas the second one can wait frozen, and restart after having ﬁnished the ﬁrst one. This is more eﬀective than using virtual memory, because the algorithm can 1

It could have been done also after the 1st step, as well as after the 3rd step.

A New Approach to Distributed Algorithms for Reduct Calculation

375

decide what should be frozen, and what is executed. The disadvantage is that decomposition is done in the late phase of algorithm. It causes that the time saved by the decomposition can be inessential. Another disadvantage is that the algorithm depends on too many parameters. In particular, one has to choose a right moment to split the formula. In our experiment we have used the following rules: 1. do not split before doing 10% of the conversion steps; 2. the last split must be done before 60% of the conversion; 3. make splitting if the number of candidates is greater then u (u is a parameter). The main diﬀerence between our proposal and the one presented in [7] is in spliting candidate of sets. In [7] it is proposed to split the set of candidates into n procesors after having the number of "candidates reducts" higher than branching factor [7]. The disadvantage of this approach is that we do not know the number of candidate reducts before completing computations, so it is hard to estimate the optimal value of the branching factor.

4

Experiments and Results

There are a some measures in the literature for the distributed algorithms. In our experiments we used two indicators: 1. Speedup 2. Eﬃciency Following [8] we deﬁne speedup as Sp = TT 1p , and eﬃciency as Ep = Sp p , where T1 is the time of execution of the algorithm on one processor, Tp is the time needed by p processors, p is the number of processors. We have tested all the presented algorithms. For the experiments we used three base data sets: (a) 4000 records, and 23 condition attributes; (b) 5000 records, and 20 condition attributes; and (c) 20000 records, and 19 condition attributes. The sets (a) and (b) were randomly generated. The set (c) is based on the set "Letter recognition" from [9]. To the original set we have added three additional columns, each being a combinations of selected columns from the original set (so that more reducts should appear in the results). For each of the databases we have prepared a number of sets of data - 5 sets for (a), 6 sets for (b) and 11 sets for (c). Every set of data was prepared by a random selection of objects from the base sets. For each series of data sets we have performed one experiment for the sequential algorithm, and additionally, 3 experiments - one for each way of decomposition. Below we present the results of the experiment. Tables 5-7 contain the execution times for the sequential version of the algorithm for each of the 3 testing data respectively. In these tables the column 2 shows the total execution time, the columns 3 shows the execution time of computing IN D matrix and reduced discernibility function. It is not possible to split times for processing IN D matrix and discernibility function without the loss of eﬃciency.

376

T. Str¸akowski and H. Rybiński

Let us note that computing of the IN D matrix and discernibility function for the ﬁrst case (Table 5) takes less than 1% of the total processing time. In the 2nd (Table 6) case the computing of IN D is about 50% of the total time of processing. The number of clauses in prime implicant is smaller for this data set. In the 3rd case (table 7), the computing of IN D takes more than 99% of the total computing time. Let us note that only for this case the decomposition of DT can be justiﬁed. Now we present Tables 8-10. In each table the results of 3 distributed algorithms are presented for each data set respectively. From Table 8 we can see that for the datasets where discernibility function is long and we expect many results, it is better to use vertical decomposition. The vertical decomposition has two advantages: (a) we decompose the phase that Table 5. Time of computing for sequential method, data set 1 Size (records) Total time (ms) IND matrix time (ms) IND matrix size Reducts number 2000 2483422 29344 513 5131 2500 2144390 41766 475 4445 3000 2587125 60766 555 5142 3500 3137750 80532 532 4810 4000 191390 100266 116 1083

Table 6. Time of computing for sequential method, data set 2 Size (records) Total time (ms) IND matrix time (ms) IND matrix size Reducts number 2500 70735 31457 77 202 3000 68140 46016 41 107 3500 79234 61078 33 72 4000 99500 77407 37 109 4500 127015 100235 42 127 5000 151235 120094 46 131

Table 7. Time of computing for sequential method, data set 3 Size (records) Total time (ms) IND matrix time (ms) IND matrix size Reducts number 11000 668063 650641 16 5 12000 798906 780391 16 5 13000 936375 916641 16 5 14000 1086360 1065375 16 5 15000 1245188 1223016 16 5 16000 1413032 1389782 16 5 17000 1597250 1572843 16 5 18000 1787578 1762015 16 5 19000 1993640 1966718 16 5 20000 2266016 2238078 16 5

A New Approach to Distributed Algorithms for Reduct Calculation

377

Table 8. Parallel methods for date set 1 Size (records) 2000 2500 3000 3500 4000

S3 0,41 0,75 0,53 0,70 0,46

DT E3 0,14 0,25 0,18 0,24 0,15

A kind of decomposition DISC FUNCTION CANDIDATE REDUCTS S2 E2 S3 E3 S2 E2 S3 E3 3.01 1.5 2.53 0.84 4.33 2.16 7.66 2.55 3.03 1.51 2.46 0.82 3.91 1.95 6.64 2.21 2.11 1.05 2.11 0.70 3.34 1.67 6.08 2.02 3.6 1.8 2.8 0.93 3.55 1.77 6.90 2.30 1.57 0.78 1.57 0.52 0.64 0.32 1.0 0.33

Table 9. Parallel methods for date set 2 Size (records) 2500 3000 3500 4000 4500 5000

S3 0,72 0,69 0,54 0,73 1,02 0,99

DT E3 0,24 0,23 0,18 0,24 0,34 0,33

A kind of decomposition DISC FUNCTION CANDIDATE REDUCTS S2 E2 S3 E3 S2 E2 S3 E3 1.86 0.93 1.91 0.64 0.79 0.39 0.88 0.29 1.36 0.68 1.37 0.46 0.84 0.42 0.84 0.28 1.18 0.59 1.22 0.40 0.93 0.47 0.71 0.23 1.19 0.60 1.16 0.39 0.96 0.48 0.97 0.32 1.20 0.60 1.08 0.36 0.84 0.42 0.90 0.30 1.18 0.59 1.10 0.37 0.98 0.49 1.06 0.35

Table 10. Parallel methods for date set 3 Size (records) 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000

S3 1.57 1.55 1.58 1.59 1.59 1.60 1.61 1.63 1.64 1.73

DT E3 0.52 0.52 0.53 0.53 0.53 0.53 0.54 0.54 0.54 0.58

A kind of decomposition DISC FUNCTION CANDIDATE REDUCTS S2 E2 S3 E3 S2 E2 S3 E3 0.99 0.49 1.00 0.33 0.99 0.49 1.00 0.33 1.00 0.50 1.00 0.33 1.00 0.50 1.00 0.33 1.00 0.50 1.00 0.33 1.00 0.5 1.00 0.33 1.00 0.50 0.99 0.33 1.00 0.5 0.99 0.33 0.99 0.49 0.99 0.33 0.99 0.49 0.99 0.33 1.00 0.50 1.00 0.33 1.00 0.5 1.00 0.33 1.00 0.50 1.00 0.33 1.00 0.5 1.00 0.33 1.00 0.50 0.99 0.33 1.00 0.5 0.99 0.33 1.00 0.50 1.00 0.33 1.00 0.5 1.00 0.33 1.00 0.50 0.50 0.33 1.00 0.5 1.00 0.33

takes majority of the time; and (b) joining partial results is less time consuming than merging. For the methods with horizontal decomposition the time of computing depends on the time of merging partial results. By adding another processor not necessarily we get better results - although the conversion to DNF is faster, the merging of three sets is more complicated. In the second case (Table 9) only the method with discernibility function decomposition gives good results. Splitting candidate reducts was not eﬀective, because conversion from CNF to DN F takes less than 50% of the total processing

378

T. Str¸akowski and H. Rybiński

time, so the decomposition was made too late. Also splitting DT was not effective, as this method may cause a redundancy in partial results. The best method here is splitting discernibility function. It also may cause redundancy in the partial results, but much less then the DT decomposition. In Table 10 we have an unusual case, because of big number of objects and small number of attributes. The processing of IND takes more than 99% of total time, so we can expect that only the decomposition of DT can give us satisfactory results.

5

Conclusions and Future Work

We have investigated possibilities of decomposing the process of computing the reducts. Three points where the decomposition is feasible have been identiﬁed. Based on this, three algorithms of parallel computing of the reducts have been presented and tested. The performed experiments have shown that each of the algorithms has its own speciﬁc kind of data sets, for which it is the best. It is therefore an important task to identify at the beginning of the computations which way of paralleling the reduct computations is the most appropriate. We also expect that for some kind of data combining the three methods can also bring positive results. Special heuristics have to be prepared in order to decide (perhaps dynamically, during the computations) on when and how split the computations. This is the subject of our future research.

References 1. Pawlak, Z.: Rough sets: Theoretical aspects of reasoning about data. Kluwer, Dordrecht (1991) 2. Bazan, J., Nguyen, H.S., Nguyen, S.H., Synak, P., Wróblewski, J.: Rough set algorithms in classiﬁcation problem. In: Polkowski, L., Tsumoto, S., Lin, T. (eds.) Rough Set Methods and Applications, pp. 49–88. Springer, Physica-Verlag, Heidelberg (2000) 3. Wróblewski, J.: A parallel algorithm for knowledge discovery system. In: PARELEC 1998, pp. 228–230. The Press Syndicate of the Technical University of Bialystok (1998) 4. Wróblewski, J.: Adaptacyjne Metody Klasyﬁkacji Obiektów. Ph.D thesis, Uniwersytet Warszawski, Wydziaş Matematyki, Informatyki i Mechaniki (2001) 5. Bakar, A.A., Sulaiman, M., Othman, M., Selamat, M.: Finding minimal reduct with binary integer programming in datamining. In: Proc. of the IEEE TENCON 2000, vol. 3, pp. 141–146 (2000) 6. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. In: Slowinski, R. (ed.) Decision Support: Handbook of Applications and Advances of Rough Sets Theory, pp. 331–362. Kluwer, Dordrecht (1992) 7. Susmaga, R.: Parallel computation of reducts. In: Polkowski, L., Skowron, A. (eds.) Rough Sets and Current Trends in Computing, pp. 450–457. Springer, Heidelberg (1998) 8. Karbowski, A., Niewiadomska-Szymkiewicz, E. (eds.): Obliczenia równolegşe i rozproszone. Oﬁcyna Wydawnicza Politechniki Warszawskiej (in Polish) (2001) 9. Asuncion, A., Newman, D.: UCI machine learning repository (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html

From Information System to Decision Support System Alicja Wakulicz-Deja and Agnieszka Nowak Institute of Computer Science, University of Silesia B¸edzi´ nska 39, 41–200 Sosnowiec, Poland {wakulicz,nowak}@us.edu.pl

Abstract. In the paper we present the deﬁnition of Pawlak’s model of an information system. The model covers information systems with history, systems with the decomposition of objects or attributes and dynamical information systems. Information systems are closely related to rough set theory and decision support systems. The aim of the paper is to characterize the stimulated by Professor Pawlak research of the group in Silesian University in information retrieval based on diﬀerent information systems and in decision support based on rough sets, and to outline the current research projects of this group on modern decision systems. Keywords: information system, decision support system, rough set theory, clustering methods.

1

Introduction

Information systems and decision support systems are strongly related. The paper shows that we can treat a decision system as an information system of some objects, for which we have the information about their classiﬁcation. Recently, not so many attention is paid for a classiﬁcation of information systems in the literature. We deal with a problem of classiﬁcation based on changes of information systems in the time, what leads in natural way to a concept of dynamic systems. Data analysis in a given information system is possible thanks to deﬁning: the decomposition of system (done on the set of attributes or objects), dependent and independent attributes in data (to remove the attributes that are dependent), whether the attributes or even objects are equivalent, comparison of the objects, attributes and even the whole systems. The paper also presents that the model of information system created by Professor Pawlak is very useful for retrieving information. One of the diﬀerent methods of retrieving information, so called atomic components method, was proposed by Professor Pawlak, and it is presented in the paper with all basic assumptions. The relation between information systems and rough set theory with decision support systems, where researches are concerned with the classiﬁcatory analysis of imprecise, uncertain or incomplete information or knowledge expressed in terms of data acquired from experience, is also presented in the paper. It also consider the methods of reduction the set of attributes and rule induction method’s that have been J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 379–404, 2008. c Springer-Verlag Berlin Heidelberg 2008

380

A. Wakulicz-Deja and A. Nowak

applied to knowledge discovery in databases, whose empirical results obtained show that they are very powerful and that some important knowledge has been extracted from databases. Because of that, in the paper, the results of the stages of diﬀerent researches that were done (i.e. diagnosis support system used in child neurology and it is a notable example of a complex multistage diagnosis process) and all of the researches that are planed to do at the Silesian University, are presented. It is supposed to explain Professor Pawlak’s invaluable contribution to the domain of information and decision support systems. The notion of an information system formulated by Professor Pawlak and developed with his co-workers, is now a well developed branch of data analysis formalisms. It is strongly related to (but diﬀerent from) the relational database theory on the one hand and to fuzzy set theory on the other. In this paper we consider the connection between the theory of information and information retrieval systems with rough set theory and decision support systems. It is obvious that model of a system created by Professor Pawlak makes data description and analysis simple and very reliable.

2

Information System

An information system consists of a set of objects and attributes deﬁned on this set. In information systems with a ﬁnite number of attributes, there are classes created by these attributes (for each class, the values of the attributes are constant on elements from the class). Any collection of data, speciﬁed as a structure: S = X, A, V, q such that X is a non-empty set of objects, A isa non-empty set of attributes, V is a non-empty set of attributes’ values: V = a∈A Va and q is an information function of X ×A → V , is referred to as an information system. The set {q(x, a) : a ∈ A} is called information about the object x or, in short, a record of x or a row determined by x. Each attribute a is viewed as a mapping a : X → Va which assigns a value a(x) ∈ Va to every object x. A pair (a, v), where a ∈ A, and v ∈ Va , is called a descriptor. In information systems, the descriptor language is a formal language commonly used to express and describe properties of objects and concepts. More formally, an information system is a pair A = (U, A) where U is a nonempty ﬁnite set of objects called the universe and A is a non-empty ﬁnite set of attributes such that a : U → Va for every a ∈ A. The set Va is called the value set of a. Now we will discuss which sets of objects can be expressed (deﬁned) by formulas constructed by using attributes and their values. The simplest formulas d, called descriptors, have the form (a, v) where a ∈ A and v ∈ Va . In each information system S the information language LS = AL, G is deﬁned, where AL is the alphabet and G is the grammar part of that language.

From Information System to Decision Support System

381

AL is simply a set of all symbols which can be used to describe the information in such a system, e.g.: 1. 2. 3. 4. 5.

{0, 1} (constant symbols), A - the set of all attributes, V - a set of all the values of the attributes, symbols of logical operations like ˜, + and ∗, and naturally brackets, which are required to represent more complex information.

G - the grammar part of the language LS deﬁnes syntax with TS as the set of all possible forms of terms (a term is a unit of information in S) and its meaning (semantics). A simple descriptor (a, v) ∈ TS (a ∈ A where v ∈ Va ). If we denote such a descriptor (a, v) as the term t, then following term formations will be also possible: ¬t, t + t , t ∗ t , where t, t ∈ TS . The meaning is deﬁned as a function σ which maps the set of terms in a system S in a set of objects X, σ : TS → P (x), where P (x) is the set of the subsets of X. The value of σ for a given descriptor (a, v) is deﬁned as following [49]: 1. 2. 3. 4. 2.1

σ(a, v) = {x ∈ X, qx (a) = v}, σ(¬t) = X \ σ(t), σ(t + t ) = σ(t) ∪ σ(t ) and σ(t ∗ t ) = σ(t) ∩ σ(t ). Information Table

Information systems are often represented in a form of tables with the ﬁrst column containing objects and the remaining columns, separated by vertical lines, containing values of attributes. Such tables are called information tables (an example is presented in Table 1). The deﬁnition of this system is as follows: S = X, A, V, q, where X = {x1 , . . . , x8 }, A = {a, b, c}, V = Va ∪ Vb ∪ Vc , Va = {a1 , a2 }, Vb = {b1 , b2 }, Vc = {c1 , c2 , c3 , c4 } and q : X × A → V . For instance, q(x1 , a) = a1 and q(x3 , b) = b1 . Table 1. An information system - an information table student a x1 a1 x2 a1 x3 a2 x4 a2 x5 a1 x6 a1 x7 a2 x8 a2

b b1 b1 b1 b1 b2 b2 b2 b2

c c1 c2 c3 c4 c1 c2 c3 c4

382

A. Wakulicz-Deja and A. Nowak

Before we start considering the properties of an information system, it is necessary to explain what the information in such a system means. The information in the system S is a function ρ with the arguments on the attributes set A and its values, which belong to the set V (ρ(a) ∈ Va ). As long as the sets of the objects, attributes and their values are ﬁnite, we know exactly how many (different) pieces of information in a given system S comprises, and the number is equal to a∈A card(Va ). The information ρ assigns a set of the objects Xρ that Xρ = {x ∈ X : qx = ρ}. We call them indiscernible, because they have the same description. If we assume that B ⊆ A then each subset B of A determines a binary relation IN DA (B), called an indiscernibility relation. By the indiscernibility relation determined by B, denoted by IN DA (B), we understand the equivalence relation:

IN DA (B) = {x, x ∈ X × X : ∀a∈B [a(x) = a(x )]}. For a given information system it is possible to deﬁne the comparison of the objects, attributes and even the whole systems. We can ﬁnd some dependent and independent attributes in data, we can check whether the attributes or even objects are equivalent. An important issue in data analysis is to discover dependencies between attributes. Intuitively, a set of attributes D depends totally on a set of attributes C if the values of attributes from C uniquely determine the values of the attributes from D. If D depends totally on C then IN DA (C) ⊆ IN DA (D). This means that the partition generated by C is ﬁner than the partition generated by D. Assume that a and b are attributes from the set A in a system S. We say that b depends on a (a → b), if the indiscernibility relation on a contains in the indiscernibility relation on b: a ⊆ b. If a = b then the attributes are equivalent. The attributes are dependent if any of the conditions: a ⊆ b or b ⊆ a is satisﬁed. Two objects x, y ∈ X are indiscernible in a system S relatively to the attribute a ∈ A (xea y) if and only if qx (a) = qy (a). In the presented example, the objects x1 and x2 are indiscernible relatively to the attributes a and b. The objects x, y ∈ X are indiscernible in a system S relatively to all of the attributes a ∈ A (xSey) if and only if qx = qy . In the example there are no indiscernible objects in the system S. Each information system determines unequivocally a partition of the set of objects, which is some kind of classiﬁcation. Finding the dependence between attributes let us to reduce the amount of the information which is crucial in systems with a huge numbers of attributes. Deﬁning a system as a set of objects, attributes and their values is necessary to deﬁne the algorithm for searching the system and updating the data consisted in it. Moreover, all information retrieval systems are also required to be implemented in this way. The ability to discern between perceived objects is also important for constructing various entities not only to form reducts, but also decision rules and decision algorithms. 2.2

An Application in Information Retrieval Area

The information retrieval issue is the main area of the employment of information systems. An information retrieval system, in which the objects are described by

From Information System to Decision Support System

383

their features (properties), we can deﬁne as follows: Let us have a set of objects X and a set of attributes A. These objects can be books, magazines, people, etc. The attributes are used to deﬁne the properties of the objects. For the system of books, the attributes can be author, year, number of sheets. An information system which is used for information retrieval should allow to ﬁnd the answer for a query. There are diﬀerent methods of retrieving information. Professor Pawlak proposed the atomic components method [2,49]. Its mathematical foundation was deﬁned in [5] and [6]. This method bases on the assumption that each question can be presented in the normal form, which is the sum of the products with one descriptor of each attribute only. To make the system capable of retrieving information it is required to create the information language (query language). This language should permit describing objects and forming user’s queries. Naturally enough, such a language has to be universal for both the natural and system language. Owing to this, all steps are done on the language level rather than on the database level. The advantages of information languages are not limited to the aforementioned features. There are a lot of systems that need to divide the information, which is called the decomposition of the system. It allows improving the time eﬃciency and make the updating process easy, but also enables the organization of the information in the systems. Information systems allow collecting data in a long term. It means that some information changes in time, and because of that, the system has a special property, which is called the dynamics of the system. Matching unstructured, natural-language queries and documents is diﬃcult because both queries and documents (objects) must be represented in a suitable way. Most often, it is a set of terms, where aterm is a unit of a semantic expression, e.g. a word or a phrase. Before a retrieval process can start, sentences are preprocessed with stemming and removing too frequent words (stopwords). The computational complexity when we move from simpler systems to more compound increases. For example, for atomic component retrieval method, the problem of rapidly growing number of atomic component elements is very important. Assuming that A is a set of attributes, and Va is a set of values of attribute a, where a ∈ A, in a given system we achieve a a∈A Va objects to remember. For example, if we have a 10 attributes in a given system S, and each of such attributes has 10 values, we have to remember 1010 of elements. 2.3

System with Decomposition

When the system consists of huge set of data it is very diﬃcult in given time to analyse those data. Instead of that, it is better to analyze the smaller pieces (subsets) of data, and at the end of the analysing, connect them to one major system. There are two main method of decomposition: with attributes or objects. A lot of systems are implemented with such type of decomposition. System with object’s decomposition. If it is possible to decompose the system S = X, A, V, q in a way that we gain subsystems with smaller number of objects, it means that:

384

A. Wakulicz-Deja and A. Nowak

S=

n

Si ,

i=1

where Si = Xi , A, V, qi , Xi ⊆ X and

i

Xi = X, qi : Xi × A → V , qi = q|Xi ×A .

System with attributes’s decomposition. When in system S there are often the same types of queries, about the same group of attributes, it means that such system should be divided to subsystems Si in a way that: S=

Si ,

i

where Si = X, Ai , Vi , qi , Ai ⊆ A and i Ai = A, Vi ⊆ V , qi : X × Ai → Vi , qi = q|X×Ai . Decomposition lets for optimization of the retrieval information process in the system S. The choice between those two kind of decomposition depends only on the type and main goal of such system. 2.4

Dynamic Information System and System with the History

In the literature information systems are classiﬁed according to their purposes: documentational, medical or management information systems. We propose different classiﬁcation: those with respect to dynamics of systems. Such a classiﬁcation gives possibility to: 1. 2. 3. 4.

Perform a joint analysis of systems belonging to the same class, Distinguish basic mechanisms occuring in each class of systems, Unify design techniques for all systems of a given class, Simplify the teaching of system operation and system design principles.

Analysing the performance of information systems, it is easy to see that the data stored in those systems are subject to changes. Those changes occur in deﬁnite moments of time. For example: in a system which contains personal data: age, address, education, the values of these attributes may be changed. Thus time is a parameter determining the state of the system, although it does not appear in the system in an explicit way. There are systems in which data do not change in time, at least during a given period of time. But there are also systems in which changes occur permanently in a determined or quite accidental way. In order to describe the classiﬁcation, which we are going to propose, we introduce the notion of a dynamic information system, being an extension of the notion of an information system presented by Professor Pawlak. Definition 1. A dynamic information system is a family of ordered quadruples: S = {Xt , At , Vt , qt }t∈T where: – T - is the discrete set of time moments, denoted by numbers 0, 1, . . . , N , – Xt - is the set of objects at the moment t ∈ T ,

(1)

From Information System to Decision Support System

– – – –

385

At - is the set of attributes at the moment t ∈ T , the set of values of the attribute a ∈ At , Vt (a) - is S Vt := a∈At Vt (a) - is the set of attribute values at the moment t ∈ T , qt - is a function which assigns to each pair x, a, x ∈ Xt , a ∈ At , an element of the set Vt , i.e. qt : Xt × At → Vt .

An ordered pair a, v, a ∈ At , v ∈ Vt (a) is denoted as a descriptor of the attribute a. We will denote by qt,x a map defined as follows: qx,t : At → Vt ,

(2)

a ∈ At x ∈ Xt t ∈ T

qt,x (a) := qt (a, x)

(3)

Let Inf (S) = {VtAt }t∈T be a set of all functions from At to Vt for all t ∈ T . Functions belonging to Inf (S) will be called informations at instant t, similarly, the functions qt,x will be called the information about object x at instant t in information system S. Therefore, an information about an object x at instant t is nothing else, but a description of object x, in instant t, obtained by means of descriptors. We will examine closer the changes, which particular elements (X,A,V ,q) of a dynamic system may undergo in certain time moments (see also [46,47]). Systems, whose all parameters do not depend on time are discussed in [7]. Here we deal with the dynamic systems in which the descriptions of objects depend essentialy on time. It is useful to observe at the begining that any dynamic system belongs to one of two classes of systems: time-invariant and time-varying system. Definition 2. Time-invariant system is the dynamic system such that: T

1. Zt := V

2.

Dqt = ∅ and

t∈T V

qt (x, a) = qt (x, a) where Dqt - domain of function qt . t,t ∈T

(x,a)∈ZT

Definition 3. Time-varying system is the dynamic system such that: T

1. ZT :=

t∈T

Dqt = ∅ or

2. ZT = ∅ and 2.5

W

W

t,t ∈T (x,a)∈ZT

qt (x, a) = qt (x, a).

Time-Invariant Systems

Let XT , AT be sets of objects and attributes of the dynamic system deﬁned as follows: T

– XT := – AT :=

t∈T T t∈T

Xt , At .

386

A. Wakulicz-Deja and A. Nowak

It is evident by the deﬁnition of the time-invariant system that a dynamic system is time-invariant if and only if (4) q := qt |ZT does not depend on t and ZT = XT × AT . It means that any time-invariant system S = {< Xt , At , Vt , qt >}t∈T , has a subsystem S in Pawlak ’s notion which is time independant

S =< XT , AT , q(ZT ), q > . Let us consider a system of library information in which books are objects, the set of attributes is given by author’s name, title, publisher’s name, year of issue, subject, etc. and attributes values are given in natural language [48,49]. Let us consider time evolution of this system on the example given by its subsystem connected with four books: – – – –

b1 b2 b3 b4

= = = =

C.J.Date, An Introduction to database systems. G.T.Lancaster, Programming in COBOL. Ch.T.Meadow, The analysis of Information Systems. G.Salton, The SMART retrieval system.

and four attributes: publisher, year of issue, number of pages, subject. The history of our library in years 1980, 1981, 1982 described by our subsystem depends on two events. Begining from 1981 out library information was enritched with the information about subject of book and the book b4 was bought, and in 1982 the book b3 was lost. This situation is given by dynamic system S = {< Xt , At , Vt , qt >}t=1980,1981,1982 , described in the tables 2, 3, and 4. Table 5 presents a time-invariant subsystem S = {< Xt , At , Vt , q >} . It is easy to see that in the dynamic system described above XT = {b1 , b2 }, AT ={P ublisher, Y ear, P ages}, and VT is given below what propes that q|XT ×AT is time independent i.e. the system described in the example is time-invariant. Table 2. S = {< Xt , At , Vt , qt >}t=1980 X1980 \A1980 Publisher b1 Addison-Wesley Publish. Comp.Inc.,USA b2 Pergamon Press, Oxford, New York b3 John Wiley & Sons Inc., New York

Year 1977 1972 1967

Pages 493 180 339

Table 3. S = {< Xt , At , Vt , qt >}t=1981 X1981 \A1981 Publisher b1 Addison-Wesley Publish. Comp.Inc.,USA b2 Pergamon Press, Oxford, New York b3 John Wiley & Sons Inc., New York b4 Prentice-Hall Inc., Englewood-Cliﬀs, USA

Year 1977 1972 1967 1971

Pages Subject 493 Databases 180 Programming 339 Information Sys. 585 Retrieval Sys.

From Information System to Decision Support System

387

Table 4. S = {< Xt , At , Vt , qt >}t=1982 X1982 \A1982 Publisher b1 Addison-Wesley Publish. Comp.Inc.,USA b2 Pergamon Press, Oxford, New York b4 Prentice-Hall Inc., Englewood-Cliﬀs, USA

Year 1977 1972 1971

Pages Subject 493 Databases 180 Programming 585 Retrieval Systems

Table 5. Time-invariant subsystem S = {< Xt , At , Vt , q >} XT \AT Publisher Year Pages b1 Addison-Wesley Publish. Comp.Inc.,USA 1977 493 b2 Pergamon Press, Oxford, New York 1972 180

2.6

Time-Varying Systems T

T

If t∈T Xt = ∅ or t∈T At = ∅ i.e. ZT = ∅ then the system is obviously time dependent on T since there does not exist an element x beloniging to all Xt or a belonging to all At . If ZT = ∅ then the dynamic system S = {< Xt , At , Vt , qt >}t∈T has a subsystem

S = {< XT , AT , qt (ZT ), qt |ZT >}, t ∈ T and we can observe that this system is not time-invariant, since by the deﬁnition of the time-varying system there exist t, t ∈ T and (x, a) ∈ ZT that qt (x, a) = qt (x, a). A system which contains information about students [27] is good example of a system with time-varying information. The set of objects is the set of all students of a ﬁxed University [Faculty,Course]. As a set of attributes we may choose, for example: STUDY-YEAR, GROUP, MARK-OF-MATH, MARK-OFPHYSICS,AV-MARK and so on. Descriptors are as before, pairs of the form , where the sets of attribute values are as follows: ||ST U DY − Y EAR|| = {I, II, III, . . .}, ||GROU P || = {1, 2, 3, . . .}, ||M ARK − OF − M AT H|| = {2, 3, 4, 5}, ||M ARK − OF − P HY SICS|| = {2, 3, 4, 5}, ||AV ERAGE M ARK|| = {2, 2.1, 2.2, . . . , 5}. Let us assume that student changes the study year if his average mark lays between 3 and 5. If not the student ramains on the same year of studies. If there is not a change in the study year the student can change the students group. Let us consider the history of three students s1 , s2 , s3 begining with the ﬁrst year of their studies during the following three years. The situation in the system

388

A. Wakulicz-Deja and A. Nowak Table 6. First year of observation X1 \A1 Year Group Av.mark s1 I 1 − s2 I 1 − s3 I 2 − Table 7. Second year of observation X2 \A2 Year Group Av.mark s1 I 3 3.1 s2 II 1 4.1 s3 II 2 3.3 Table 8. Third year of observation X3 \A3 s1 s2 s3

Year Group Av.mark II 3 3.1 III 1 4.8 II 1 3.7

is described in tables 6, 7, and 8. One can observe that XT = {s1 , s2 , s3 }, AT = {ST U DY − Y EAR, GROU P, AV − M ARK}, and ⎧ ⎨ I t = 1 year of observation qt (s1 , ST U DY Y EAR) = I t = 2 year of observation (5) ⎩ II t = 3 year of observation what means that the system is the time-varying system. 2.7

Variability of Information in Dynamic Systems

In time-varying systems we can observe various types of information changes. T T If the set ZT = { t∈T XT } ∩ { t∈T AT } = ∅ then the important features of the character of changes of the information in time is described by the dynamic subsystem S . S = {< XT , AT , qt (ZT ), qt |ZT >}t∈T .

In the subclass of dynamic systems, represented by system S , the state of the system depends on time t by the family {qt }t∈T only. Due to a way of realization of this subclass of systems in the practise it is sensible to consider such a realization of systems, which allows to determine values of the function: f (x, a, qt−1 (x, a), . . . , qt−i (x, a)) for all x ∈ XT , a ∈ AT and t ∈ T .

From Information System to Decision Support System

389

By f we denote any function which is feasible in the considered realization, and by i we denote so called depth of information and can assume values 0, 1, . . . , I. When i = 0 the function f depends on x and a only. One can observe that such realizations of systems are not giving possibility of determining values of a function which explicitly depends on t. This is one of features which distinguish dynamic information systems from data processing systems. From the point of view of the realizations described above, any dynamic system belongs to the one of the subsequent classes: 1. Systems with determined variability SDV . A dynamic system belongs to SDV if and only if: V – (x,a)∈ZT there exist initial values q−1 (x, a), . . . , q−i (x, a) ∈ such that: V V q (x, a) = f (x, a, qt−1 (x, a), . . . , qt−i (x, a)) t∈T (x,a)∈ZT t for properly choosen (feasible) function f .

S t∈T

Vt

2. Systems with predictable variability SP V . A dynamic system belongs to SP V if and only if: – it does not belong to SDV , M – there exist T1 ,. . . , TM ⊂ T ( j=1 Tj = T , Tj ∩ Tk = ∅, for j = k, j, k = 1, . . . , M , card Tj > 1 for j = 1, . . . , M ), and functions f1 , . . . , fM such that: V V feasible q (x, a) = fj (x, a, qt−1 (x, a), . . . , qt−j (x, a)) t∈Tj (x,a)∈ZT t for properly choosen initial values q−1 (x, a), . . . , q−ij (x, a). 3. Systems with unpredictable variability SU V . A dynamic system belongs to SU V if and only if: – it does not belong to SDV or SP V . It is worthy to underline that to SU V can belong systems whose structure is formally simple. For example, the system whose information function is determined as follows:

f1 (x, a, qt−1 (x, a), . . . , qt−i1 (x, a)) or qt (x, a) = (6) f2 (x, a, qt−1 (x, a), . . . , qt−i2 (x, a)) belongs to SU V as long as there is not determined for which t: f1 and for which f2 is applied. 2.8

Examples of Time-Varying Systems

Examples of systems belonging to SDV , SP V and SU V classes are given here. An example of a system with determined variability SDV can be a system of patient supervision (medical information). The objects of this system are

390

A. Wakulicz-Deja and A. Nowak Table 9. The prescribitions of medicaments and tests for patients X\A Test blood X-ray lungs Peniciline injections Vitamins p1 1 0.2 C p2 1 p3 + 0.03 B1 p4 1 -

Table 10. The physician prescription for a given patient A\t 0 1 2 3 4 5 6 7 8 9 10 P 1111110000 0 T 0010010010 0

patients. The attributes are, for example, test of blood morphology, lungs X-ray, prescribed penicillin, prescribed doses of vitamins (in mg), etc. The following table (Table 9) presents prescribitions of medicaments and tests for patients p1 , p2 , p3 , p4 at the begining of the considered system performance (t = 0). Let us describe the system performance on the example of the patient p2 who after small surgery get an bacterial infection. Physician prescription is as follows: Penicillin injections P for six forthcoming days, blood morphology test T every thrid day. This prescription gives the following table (Table 10) of function qt (p2 , P ) and qt (p2 , T ). One can observe that using the Boolean algebra notion these functions can be written in the following form

qt (p2 , P ) = qt−1 (p2 , P )[qt−2 (p2 , P ) + qt−7 (p2 , P )] ∗ (7) qt (p2 , T ) = qt−1 (p2 , T ).qt−2 (p2 , T ) if only initial values are given as follows ⎧ ⎨ q−1 (p2 , P ) = 1 ∗ ∗ q−j (p2 , P ) = 0 for j = 2, 3, . . . , 7 information depth = 7, ⎩ q−k (p2 , T ) = 1 for k = 1, 2 information depth = 2.

(8)

The formulas / ∗ /, / ∗ ∗/ convince us that described system (at least reduced to object p2 and attributes P and T ) is SDV . Other systems of the class SDV can be found in [29,28,30]. As a example of SP V we use the system with students, and we assume that T1 , T2 , T3 are time intervals determined as follows: T1 : from Oct.1st 1980 to Sept.30th 1981, T2 : from Oct.1st 1981 to Sept.30th 1982, T3 : from Oct.1st 1982 to Sept.30th 1983. It is easy to see that the function qt (si , Y ), qt (si , G), qt (si , A.m), i = 1, 2, 3. are constants in each time interval T1 , T2 , T3 . Therefore on each interval T1 , T2 , T3 this function are realizable (information depth = 0) and the system belongs to

From Information System to Decision Support System

391

Table 11. The example of the system belonging to the SU V X\A Storage Prod.division (I) Prod.division(II) M1 200 50 30 M2 100 10 20 M3 0 5 4

SP V . Finally, let us consider a system which describes materials managment in a factory. Objects in this system are diﬀerent types od materials. The attributes can be here production divisions and /or workstands and the main storage. The attribute values are given in units (of weight, measure, etc.) which are natural for the described object. Let us consider the system of this kind reduced to three objects a storage and two production divisions as attributes. Let the attribut values be given in the T able 11. It is obvious that a state of resources of the objects Mi on the production division K (K = I,II) depends not only of information function qt−1 (Mi , K) but also on information function deﬁned on other attributes i.e. depends on qt−1 (M1 , II), qt−1 (M1 , St.), qt−1 (M1 , I) therefore it is not a function which can be used as information function due to the deﬁnition of dynamic system. Moreover the values of functions qt (Mi , St.) are not determined a priori and generally we can not determine the moments in which these values will be changed. This system of course does not belong to SDV or SP V . Therefore it belongs to SU V . Examples of systems belonging to the SU V class can be found in [27,31] also. 2.9

Influence of Foundations of a System on Its Classification

Analysing foundations of a real system we can determine to which of described above classes the system belongs. Thus e.g., if we assume, that objects of the system are documents with static or rarely changing descrptions, then this system will belong to the class of invariant systems. The characteristics of most library systems imply directly their belonging to the class of time-invariant systems. In the same way, the assumption about variability in documents descriptions will suggest, that a system containing such documents belongs to the class of systems with time-varying information. Of course, if we are able to determine moments in which descrptions changes will occur, then it will be the system with predictable variability (SP V ). If we are not able to determine these moments - we will obtain a system with unpredictable variability (SU V ). Some systems are a priori classiﬁed as systems with determined variability (SDV ), because the knowledge of “histories” of objects is one of the requirements, as in medical systems for example. So, foundations of the realized information system decide a priori about its classiﬁcation - which, in consequence, suggests a priori certain performance mechanisms of this system. Many of existing systems are actually packages of systems belonging to diﬀerent classes (e.g. medical system may consist with a module of registration which is an time-invariant system and with a module of patients supervison which belongs to the class of time-varying systems). In this case every of modules is designed as a system of an appropriate

392

A. Wakulicz-Deja and A. Nowak

class. The classiﬁcation resulted from the analysis of performance of information systems, can be a convenient tool for designing purposes. When somebody starts system designing, he has a good knowledge of systems foundation and parameters but generally he can not predict proper mechanisms of system performance. In this situation, as was stated above, he can determine a class to which the system belongs. This allows him to choose adequate mechanisms of system performance. 2.10

Performance Mechanisms in Dynamic Systems

In a realization of information systems we should made decisions about a structure of database and the way of its updating, on the retrieval method and a retrieval language we are going to use and on the mode of operation which will be used in the system. In the forthcoming we give some remarks how these decision depend on the fact that the considered system belongs to one of the determined classes i.e. class of Invariant Systems, class of Systems with Determined Variability (SDV ), class of Systems with Predictable Variability (SP V ), class of Systems with Unpredictable Variability (SU V ). Database and its updating At ﬁrst let us consider invariant systems. The database of the invariant system is static throughout the period of performance. A reorganization of the database, if desired, is realized after the period of performance and consists in creating a new database. In systems with time-varying information the database changes during the action of the system. In systems with determined variability (SDV ) we have to store information about an object in the past because this information is necessary for determining the actual information about this object. Thus the “history”, with prescribed depth of information about objects, should be stored in the database. In systems with predictable variability (SP V ) actualization and reorganization of the database ought to be executed at certain moments, in which changes are predicted. These are mainly changes in descriptions of objects. The database reorganization (actualization) does not necessarily involve changes in programs operating on the database. In systems with unpredictable variability (SU V ) any execution of the retrieval process ought to be preceded by the actualization of descriptions of objects. In all systems with time-varying information we can have at the same period an actualization of the set od objects, set of attributes and set of descriptors, as in invariant systems. Retrieval method and information retrieval language Because of the specyﬁc character of the database and actualization process, one prefers the exhaustice search as a retrieval method for invariant systems.In such a case an extension of database does not results in the retireval method. At most, in order to speed up the system performance, one may apply the methods of inverted ﬁles or linked lists. These methods are more useful for some systems with predictable variability (information depth =0). There, when the system action is stoped, the database can be actualized along with inverted ﬁles or linked lists updating. In these systems there is no need for developing special information

From Information System to Decision Support System

393

retrieval languagees, because languages based on thesauruses, indexing or decimal classiﬁcation seem to be suﬃciently eﬃcient. However in the invariant systems and systems with predictable variability one can prefer a speciﬁc method of retrieval. For a realization of the systems with time-varying information a grouping of informations and random access to descriptions of these groups or an individual description of object is essential. Mathematical methods of retrieval seem to be the most convenient in this case (for example: Lum’s methods [23] or the atomic component method with decomposition of system). These retireval algorithms allow us to ﬁnd quickly a particular information, they also simplify the updating process. In the case of systems with determined variability (SDV ), this problem looks a bit diﬀerent, because a new information is constantly created and has to be stored. In this case the method of linked lists seem to be as good as mathematical methods (e.g. the method of atomic components). In the method of linked lists an actual information about the object is obtained by consideration of a chain of a determined length given by the depth of information. In the systems with time-varying information a language based on descriptors is the most convenient, for information retrieval, since it allows us easy to write/read informations described by means of codes which are equivalents of descriptors. Moreover in this case the descriptions of objects are determined by the values of attributes. Informations in time-varynig systems are always described by means of codes, therefore all output informations are translated onto the natural language. Consequently, from the user’s point of view, there is no diﬀerence if the system uses the descriptor or another language. In some cases, when this translation can be omitted (e.g. in medical systems, which are used by medical servise) the descriptors ought to be introduced in accordance with codes accepted by a user. Here we ought to mention interactive languages, which seem to be necessary for most systems with time-varying information (the necessity of a dialogue with the system) but they will be discussed latter on, along with the operation mode of dynamic systems. Operation mode Let us consider now the continuous operation mode and the batch operation mode in an information system. The continuous operation mode consists in current (i.e. ∀t∈T ) information feeding, therefore we hace current data base updating. This operation mode will occur in systems with unpredictable variability, an actualization processes will be executed in turns with retrieval processes. In most cases, however, information systems work in batch operation mode, which means that in certain moments actualization and reorganization processes take place. This operation mode can be used in invariant and time-varying systems with predictable variability (SP V ). A case with the interactive operation mode is a bit diﬀerent, since a user is able to communicate with a system. If this mode is used only for rerieval purposes (to ﬁnd more complete or relevant information), then it can be applied to a system of arbitrary class. But if goal of this dialogue is to create a new data base structure (internal changes), then interactive systems are limited down to the class of systems with unpredictable variability (SU V ). At the end let us mention that due to the structure of dynamic model discussed

394

A. Wakulicz-Deja and A. Nowak

here (deﬁnition of the dynamic information system) performance mechanisms are applied to any pair (x, a), x ∈ XT , a ∈ AT separately. This all reorganizations of the model which are based on concurrent processing and multi-access give high eﬃciency of the information system in the practise. Conclusion In this paper a possibility of introducing dynamics in Pawlak ’s model of systems is presented. In the most situations of practise this model is more convenient then the classical (relational) model. It is due to the fact that in Pawlak’s model, information about an object are given by functions, while in the classical model informations are determined by relations. This simpliﬁes a description of systems and their analysis, which is important not only for system designing but also for teaching of system operation. Authors think, that the only way of teaching how to use the system and how to design it, goes by understanding of system operation mechanisms. For the model presented here the proposed classiﬁcation allows to fullﬁl this goal easier. The model of information system created by Pawlak is very useful to built and analysis in diﬀerent types of retrieval information systems. The document information systems are very speciﬁc type of information systems and Pawlak ’s model is very good to deﬁne the informations in it.

3

Decision Support Systems

When data mining ﬁrst appeared, several disciplines related to data analysis, like statistics or artiﬁcial intelligence were combined towards a new topic: extracting signiﬁcant patterns from data. The original data sources were small datasets and, therefore, traditional machine learning techniques were the most common tools for this tasks. As the volume of data grows these traditional methods were reviewed and extended with the knowledge from experts working on the ﬁeld of data management and databases. Because of that, information systems with some data-mining methods start to be the decision support systems. Decision support system is a kind of information system, which classiﬁes each object to some class denoted by one of the attributes, called decision attribute. While the information system is simply a pair of the form U and A, the decision support system is also a pair S = (U, C ∪ {d}) with distinguished attribute d. In case of decision table the attributes belonging to C are called conditional attributes or simply conditions while d is called decision. We will further assume that the set of decision values is ﬁnite. The i-th decision class is a set of objects: Ci = {x ∈ U : d(x) = di }, where di is the i-th decision value taken from decision value set Vd={d1 , . . . , d|Vd | }. Let us consider the decision table presented as Table 12. In presented system (with informations about students): C = {a, b, c}, D = {d}.

From Information System to Decision Support System

395

Table 12. Decision table student a x1 a1 x2 a1 x3 a2 x4 a2 x5 a1 x6 a1 x7 a2 x8 a2

b b1 b1 b1 b1 b2 b2 b2 b2

c c1 c2 c3 c4 c1 c2 c3 c4

d T T T N N T T N

Having indiscernibility relation we may deﬁne the notion of reduct. In case of decision tables decision reduct is a set B ⊂ C of attributes, which cannot be further reduced and IN D(B) ⊆ IN D(d). Decision rule is a formula of the form: (ai1 = v1 ) ∧ . . . ∧ (aik = vk ) ⇒ (d = vd ), where 1 ≤ i1 < . . . < ik ≤ m, vj ∈ Vai j . We can simply interpret such formula similar to natural language with if and then elements. In given decision table the decision rule for object x1 is given as: if (a = a1 ) and (b = b1 ) and (c = c1 ) then (d = T ), the same as (a = a1 ) ∧ (b = b1 ) ∧ (c = c1 ) → (d = T ). Atomic subformulas (ai1 = v1 ) are called conditions, premises. We say that rule r is applicable to object, or alternatively, the object matches rule, if its attribute values satisfy the premise of the rule. Each object x in a decision table determines a decision rule: ∀a∈C (a = a(x)) ⇒ (d = d(x))), where C is set of conditional attributes and d is decision attribute. Decision rules corresponding to some objects can have the same condition parts but diﬀerent decision parts. We use decision rules to classify given information. When the information is uncertain or just incomplete there is need to use some additional techniques for information systems. Numerous methods based on the rough set approach combined with Boolean reasoning techniques have been developed for decision rule generation.

4

Rough Sets

Rough Set theory has been applied in such ﬁelds as machine learning, data mining, etc., successfully since Professor Pawlak developed it in 1982. Reduction

396

A. Wakulicz-Deja and A. Nowak

of decision table is one of the key problem of rough set theory. The methodology is concerned with the classiﬁcatory analysis of imprecise, uncertain or incomplete information or knowledge expressed in terms of data acquired from experience. The primary notions of the theory of rough sets are the approximation space and lower and upper approximations of a set. The approximation space is a classiﬁcation of the domain of interest into disjoint categories. The membership status with respect to an arbitrary subset of the domain may not always be clearly deﬁnable. This fact leads to the deﬁnition of a set in terms of lower and upper approximations [9,10,11]. 4.1

The Basic Notions

One of the basic fundaments of rough set theory is the indiscernibility relation which is generated using information about particular objects of interest. Information about objects is represented in the form of a set of attributes and their associated values for each object. The indiscernibility relation is intended to express the fact that, due to lack of knowledge, we are unable to discern some objects from others simply by employing the available information about thos objects. Any set of all indiscernible (similar) objects is called an elementary set, and forms a basic granule (atom) of knowledge about the universe. Any union of some elementary sets in a universe is referred to as a crisp set. Otherwise the set is referred to as being a rough set. Then, two separate unions of elementary sets can be used to approximate the imprecise set. Vague or imprecise concepts in contrast to precise concepts, cannot be characterized solely in terms of information about their elements since elements are not always discernable from each other. There is an assumption that any vague or imprecise concept is replaced by a pair of precise concepts called the lower and the upper approximation of the vague or imprecise concept. 4.2

Lower/Upper Approximation

The lower approximation is a description of the domain objects which are known with certainty to belong to the subset of interest, whereas the upper approximation is a description of the objects which possibly belong to the subset. Any subset deﬁned through its lower and upper approximations is called a rough set. It must be emphasized that the concept of rough set should not be confused with the idea of fuzzy set as they are fundamentally diﬀerent, although in some sense complementary, notions. Rough set approach allows to precisely deﬁne the notion of concept approximation. It is based on the indiscernibility relation between objects deﬁning a partition of the universe U of objects. The indiscernibility of objects follows from the fact that they are perceived by means of values of available attributes. Hence some objects having the same (or similar) values of attributes are indiscernible. Let S = (U, C ∪ D) be an information system, then with any B ⊆ C there is associated an equivalence relation IN DS (B), called the B-indiscernibility relation, its classes are denoted by [x]B .

From Information System to Decision Support System

397

For B ⊆ C and X ⊆ U , we can approximate X using only the information contained in B by constructing the B-lower (BX) and B-upper approximations of X (BX), where: BX = {x : [x]B ⊆ X} and BX = {x : [x]B ∩ X = ∅}. The B-lower approximation of X is the set of all objects which can be certainly classiﬁed to X using attributes from B. The diﬀerence between the upper and the lower approximation constitutes the boundary region of a vague or imprecise concept. Upper and lower approximations are two of the basic operations in rough set theory. 4.3

Reduct and core of Attributes

In the rough set area there is also a very important problem with ﬁnding (select) relevant features (attributes), which source is denoted as so called core of the information system S. Reduct is a minimal set of attributes B ⊆ C such that IN DS (B) = IN DS (C), which means that it is a minimal set of attributes from C that preserves the original classiﬁcation deﬁned by the set C of attributes. The intersection of all reducts is the so-called core. In the example both the core and the reduct consist of attributes b and c (CORE(C) = {b, c}, RED(C) = {b, c}). 4.4

Rule Induction

Rough set based rule induction methods have been applied to knowledge discovery in databases, whose empirical results obtained show that they are very powerful and that some important knowledge has been extracted from databases. For rule induction, lower/upper approximations and reducts play important roles and the approximations can be extended to variable precision model, using accuracy and coverage for rule induction have never been discussed. We can use the indiscernibility function fS , that form a minimal decision rule for given decision table [1]. For an information system S = (U, C ∪ {d}) with n objects, the discernibility matrix of S is a symmetric n × n matrix with entries cij deﬁned as: cij = {a ∈ C|a(xi ) = a(xj )} for i, j = 1, 2, . . . , n where d(xi ) = d(xj )). Each entry consists of the set of attributes upon which objects xi and xj diﬀer. A discernibility function fS for an information system S is a boolean function of m boolean variables a∗1 , . . . , a∗m (corresponding to the attributes a1 , . . . , am ) deﬁned by: c∗ij |1 ≤ j ≤ i ≤ n, cij = ∅ (9) fS = where c∗ij = {a∗ |a ∈ cij }.

398

A. Wakulicz-Deja and A. Nowak

For given decision table we formed following set of rules: – – – – – – – 4.5

rule rule rule rule rule rule rule

nr nr nr nr nr nr nr

1: 2: 3: 4: 5: 6: 7:

if if if if if if if

a = a1 and b = b1 then d = T b = b1 and c = c1 then d = T b = b1 and c = c2 then d = T c = c3 then d = T c = c4 then d = N b = b2 and c = c1 then d = N c = c2 then d = T .

Rough Set Theory and Decision Systems in Practise

The main speciﬁc problems addressed by the theory of rough sets are not only representation of uncertain or imprecise knowledge, or knowledge acquisition from experience, but also the analysis of conﬂicts, the identiﬁcation and evaluation of data dependencies and the reduction of the amount of information. A number of practical applications employing this approach have been developed in recent years in areas such as medicine, drug research, process control and other. The recent publication of a monograph on the theory and a handbook on applications facilitate the development of new applications. One of the primary applications of rough sets in artiﬁcial intelligence is knowledge analysis and data mining [12,13,16,17]. From two expert systems implemented at the Silesian University, MEM is the one with the decision table in the form of the knowledge base. It is a diagnosis support system used in child neurology and it is a notable example of a complex multistage diagnosis process. It permits the reduction of attributes, which allows improving the rules acquired by the system. MEM was developed on the basis of real data provided by the Second Clinic of the Department of Paediatrics of the Silesian Academy of Medicine. The system is employed there to support the classiﬁcation of children having mitochondrial encephalopathies and considerably reduces the number of children directed for further invasive testing in the consecutive stages of the diagnosis process [18,19]. The work contains an example of applying the rough sets theory to application of support decision making. The created system limits maximally the indications for invasive diagnostic methods that ﬁnally decide about diagnosis. System has arisen using induction (machine learning from examples) - one of the methods artiﬁcial intelligence. Three stages classiﬁcation has been created. The most important problem was to create an appropriate choice of attributes for the classiﬁcation process and the generation of a set of rules, a base to make decisions in new cases. Rough set theory provides the appropriate methods which form to solve this problem. A detailed analysis of the medical problem results in creating a three -staged diagnostic process, which allows to classify children into suﬀering from mitochondrial encephalomyopathy and ones suﬀering from other diseases. Data on which the decisions were based, like any real data, contained errors. Incomplete information was one of them. It resulted from the fact that some observation or examinations were not possible to be made for all patients. Inconsistency of information was another

From Information System to Decision Support System

399

problem. Inconsistency occured because there were patients who were diﬀerently diagnosed at the same values of the parameters analyzed. Additionally developing a supporting decision system in diagnosing was connected with reducing of knowledge, generating decision rules and with a suitable classiﬁcation of new information. The ﬁrst stages of research on decision support systems concentrated on: methods to represent the knowledge in a given system and the methods of the veriﬁcation and validation of a knowledge base [14]. Recent works, however, deal with the following problems: a huge number of rules in a knowledge base with numerous premises in each rule, a large set of attributes, many of which are dependent, complex inference processes and the problem of the proper interpretation of the decision rules by users. Fortunately, the cluster analysis brings very useful techniques for the smart organisation of the rules, one of which is a hierarchical structure. It is based on the assumption that rules that are similar can be placed in one group. Consequently, in each inference process we can ﬁnd the most similar group and obtain the forward chaining procedure on this, signiﬁcantly smaller, group only. The method reduces the time consumption of all processes and explores only the new facts that are actually necessary rather then all facts that can be retrieved from a given knowledge base. In our opinion, clustering rules for inference processes in decision support systems could prove useful to improve the eﬃciency of those systems [3,4]. The very important issue for knowledge base modularization is the concept proposed in [26], where the decision units conception was presented. Both methods: cluster analysis and decision unit are subject of our recent researches. We propose such methods to represent knowledge in composited (large, complex) knowledge bases. Using modular representation we can limit the number of rules to process during the inference. Thanks to properties of the cluster and the decision units we can perform diﬀerent large knowledge bases are an important problem in decision systems. It is well known that the main problem of forward chaining is that it ﬁres a lot of rules, that are unnecessary to ﬁre, because they aren’t the inference goal. A lot of ﬁred rules forming a lot of new facts that are diﬃcult to interpret them properly. That is why the optimization of the inference processes in rule based systems is very important in artiﬁcial intelligence area. Fortunately there are some methods to solve such problem. For example, we may reorganize the knowledge base from list of not related rules, to groups of similar rules (thanks to cluster analysis method) or decision units. Thanks to this it is possible to make the inference process (even for really large and composited knowledge bases) very eﬃcient. Simplifying, when we clustering rules, then in inference processes we search only small subset of rules (cluster), that the most similar to given facts or hipothesis [25]. In case of using decision units concept, thanks to constructed such units, in backward chaining technique we make inference process only on proper decision unit (that with the given conclusion attribute). That is why we propose to change the structure of knowledge base to cluster or decision unit structure inference algorithm optimizations, depending on user requirements. On this stage of our work we can only present the general conception of modular

400

A. Wakulicz-Deja and A. Nowak

rule base organization. We can’t formally proof that our conception really will cause growth of eﬃciency. But in our opinion hierarchical organization of rule knowledge base allow us to decrease the number of rules necessary to process during inference, thus we hope that global inference eﬃciency will grow. On this stage of our reaserch, decision units (with Petri nets extensions) and rules clusters are parallel tools for rule base decomposition rather than a one coherent approach. Therefore we have two methods of rule base decomposition — into the rules clusters if we want to perform forward chaning inference and into the decision units, if we want to do backward chanining inference. The main goal of our future work is to create coherent conception of modularization of large rule bases. This conception shall join two main subgoals: optimalization of forward and backward chaining inference process and practical approach for rule base modelling and veriﬁcation. In our opinion, two methods of rule base decomposition described in this work, allow as to obtain our goals. It is very important, that exists software tools, dedicated for rules clustering and decision units approach. Practical tests allow us to say, that we need specialized software tools when we work with large, composited rule bases. We expect, that our mixed approach is base for creating such software tools. Rough sets theory enables solving the problem of a huge number of attributes and dependent attributes removal. The accuracy of classiﬁcation can be increased by selecting subsets of strong attributes, which is performed by using several classiﬁcation learners. The processed data are classiﬁed by diverse learning schemes and the generation of rules is supervised by domain experts. The implementation of this method in automated decision support software can improve the accuracy and reduce the time consumption as compared to full syntax analysis [20,21,22]. Pawlak’s theory is also widely used by Zielosko and Piliszczuk to build clasiﬁers based on partial reducts and partial decision rules [43,44]. Recently, partial reducts and partial decision rules were studied intesively by Moshkov and also Zielosko and Piliszczuk. Partial reducts and partial decision rules depend on the noise in less degree than exact reducts and rules [42]. Moreover, it is possible to construct more compact classiﬁers based on partial reducts and rules. The experiments with classiﬁers presented in [45] show that accuracy of classiﬁers based on such reducts and rules is often better than the accuracy based on extact reducts and rules. The very important facts are that in a 1976 Dempster and Shafer have created a mathematical theory of evidence called Dempster-Shafer theory, which is based on belief functions and plausible reasoning [32]. It lets to combine separate pieces of information (evidence) to calculate the probability of an event. Pawlak’s rough set theory as an innovative mathematical tool created in 1982 let us to describe the knowledge, including also the uncertain and inexact knowledge [8]. Finally, In 1994 the basic functions of the evidence theory have been deﬁned, based on the notion from the rough set theory [33]. All the dependences between these theories has allowed further research on their practical usage. There are some papers that tried to show the relationships between the rough set theory and the evidence theory which could be used to ﬁnd the minimal templates

From Information System to Decision Support System

401

for a given decision table were also published [34,35]. Extracting the templates from data is a problem that consists in the ﬁnding some set of attributes with a minimal number of attributes, that warrants, among others, the suﬃciently small diﬀerence between the belief function and the plausibility function. This small diﬀerence between these functions allows reducing the number of the attributes (together with the decrease in the vales of the attributes) and made the templates. Moreover MTP (minimal templates problem) gives the recipe witch decision value may be grouped. At the end we get decision rules with the suitable large support. Of course, in recent years it is possible to witness a rapid growth of interest of application of rought set theory in many other domains such as, for instance, vibration analysis, conﬂict resolution, intelligent agents, pattern recognition, control theory, signal analysis,process industry, marketing, etc. Swiniarski in [36] presented an application of rough sets method to feature selection and reduction as a front end of neural network based texture images recognition. The role of the rough sets is to show its ability to select reduced set of pattern’s features. In other paper, presented by Nguyen we can observe the multi-agent system based on rough set theory [37]. The task of creating eﬀective methods of web search result connected with the clustering method, based on rough sets was presented in [41] by Nguyen. Pawlak’s theory was also used to perform new methodology for data mining in distributed and multiagent systems [38]. Recently, rough set based methods have been proposed for data mining in very large relational data bases [39,40]. 4.6

Conclusions

Classiﬁcation is an important problem in the ﬁeld of Data Mining. Data acquisition and warehousing capabilities of computer systems are suﬃcient for wide application of computer aided Knowledge Discovery. Inductive learning is employed in various domains such as medical data analysis or customer activity monitoring. Due to various factors that data suﬀer from impreciseness and incompleteness. There are many classiﬁcation approaches like “nearest neighbours”, “naive Bayes”, “decision tree”, “decision rule set”, “neural networks” and many others. Unfortunately, there are opinions that rough set based methods can be used for small data set only. The main approach is related to their lack of scalability (more precisely: there is a lack of proof showing that they can be scalable). The biggest troubles stick in the rule induction step. As we know, the potential number of all rules is exponential. All heuristics for rule induction algorithms have at least O(n2 ) time complexity, where n is the number of objects in the data set and require multiple data scanning. Rough Sets Theory has been applied to build classiﬁers by exploring symbolic relations in data. Indiscernibility relations combined with the cloncept notion, and the application of set operations, lead to knowledge discovery in an elegant and intuitive way. Knowledge discovered from data talbes is often presented in terms of “if...then...” decision rules. With each rule a conﬁdence measure is associated. Rough sets provide symbolic representation of data and the representation of knowledge in

402

A. Wakulicz-Deja and A. Nowak

terms of aatributes, information tables, semantic decision rules, rough measures of inclusion and closeness of information granules, and so on. Rough set methods make possible to reduce the size of a dataset by removing some of the attributes while preserwing the partitioning of the universe of an information system into equivalence classes.

5

Summary

Information systems and decision support systems are strongly related. The paper shows that we can treat a decision system as an information system of some objects, for which we have the information about their classiﬁcation. When the information is not complete, or the system has some uncertain data - we can use rough set theory to separate the uncertain part from that, what we are sure about. By deﬁning the reduct for a decision table, we can optimize the system and then, using the methods for minimal rules generation, we can easily classify new objects. We see, therefore, that Prof. Pawlak’s contribution to the domain of information and decision support systems is invaluable [24].

References 1. Bazan, J.: Metody wnioskowa´ n aproksymacyjnych dla syntezy algorytm´ ow decyzyjnych, praca doktorska, Wydzia l Informatyki, Matematyki i Mechaniki, Uniwersytet Warszawski, Warszawa (1998) 2. Grzelak, K., Kocha´ nska, J.: System wyszukiwania informacji metod¸a skladowych atomowych MSAWYSZ, ICS PAS Reports No 511, Warsaw (1983) 3. Nowak, A., Wakulicz-Deja, A.: Eﬀectiveness comparison of classiﬁcation rules based on k-means Clustering and Salton’s Method. In: Advances in Soft Computing, pp. 333–338. Springer, Heidelberg (2004) 4. Nowak, A., Wakulicz-Deja, A.: The concept of the hierarchical clustering algorithms for rules based systems. In: Advances in Soft Computing, pp. 565–570. Springer, Heidelberg (2005) 5. Pawlak, Z.: Mathematical foundation of information retrieval. CC PAS Reports No 101, Warsaw (1973) 6. Pawlak, Z., Marek, W.: Information Storage and retrieval system-mathematical foundations. CC PAS Reports No. 149, Warsaw (1974) 7. Pawlak, Z.: Information systems theoretical foundation. Information Systems 6(3) (1981) 8. Pawlak, Z.: Rough Sets: Theoretical aspects of reasoning about data. Kluwer Academic Publishers, Boston (1991) 9. Pawlak, Z., Skowron, A.: Rudiments of rough sets. Information Sciences 177, 3–27 (2007) 10. Pawlak, Z., Skowron, A.: Rough sets: some extensions. Information Sciences 177, 28–40 (2007) 11. Pawlak, Z., Skowron, A.: Rough sets and Boolean reasoning. Information Sciences 177, 41–73 (2007)

From Information System to Decision Support System

403

12. Roddick, J.F., Hornsby, K., Spiliopoulou, M.: YABTSSTDMR - Yet Another Bibliography of Temporal, Spatial and Spatio-Temporal Data Mining Research. In: Unnikrishnan, K.P., Uthurusamy, R. (eds.) Proc. SIGKDD Temporal Data Mining Workshop, San Francisco, CA, pp. 167–175. ACM, New York (2001) 13. Roddick, J.F., Egenhofer, M.J., Hoel, E., Papadias, D., Salzberg, B.: Spatial, Temporal and Spatio-Temporal Databases - Hot Issues and Directions for Ph.D Research. SIGMOD Record 33(2), 126–131 (2004) 14. Simi´ nski, R., Wakulicz-Deja, A.: Circularity in Rule Knowledge Bases - Detection using Decision Unit Approach. In: Advances in Soft Computing, pp. 273–280. Springer, Heidelberg (2004) 15. Skowron, A.: From the Rough Set Theory to the Evidence Theory. In: Yager, R.R., Fedrizzi, M., Kacprzyk, J. (eds.) Advances in the Dempster-Shafer Theory of Evidence, pp. 193–236. Wiley, New York (1994) 16. Skowron, A., Bazan, J., Stepaniuk, J.: Modelling Complex Patterns by Information Systems. Fundamenta Informaticae 67(1-3), 203–217 (2005) 17. Bazan, J., Peters, J., Skowron, A., Synak, P.: Spatio-temporal approximate reasoning over complex objects. Fundamenta Informaticae 67, 249–269 (2005) 18. Wakulicz-Deja, A.: Podstawy system´ ow ekspertowych. Zagadnienia implementacji. Studia Informatica 26(3(64)) (2005) 19. Wakulicz-Deja, A., Paszek, P.: Optimalization on Decision Problems on Medical Knowledge Bases. In: 5th European Congress on Intelligent Techniques and Soft Computing, Aachen, Germany (1997) 20. Wakulicz-Deja, A., Ilczuk, G.: Attribute Selection and Rule Generation Techniques ´ ezak, D., Yao, J., Peters, J.F., Ziarko, W.P., for Medical Diagnosis Systems. In: Sl¸ Hu, X. (eds.) RSFDGrC 2005. LNCS, vol. 3642, pp. 352–361. Springer, Heidelberg (2005) 21. Wakulicz-Deja, A., Ilczuk, G., Kargul, W., Mynarski, R., Drzewiecka, A., Pilat, E.: Artiﬁcial intelligence in echocardiography - from data to conclusions. Eur. J. Echocardiography Supplement 7(supl.1) (2006) 22. Wakulicz-Deja, A., Paszek, P.: Applying rough set theory to multi stage medical diagnosing. Fundamenta Informaticae XX, 1–22 (2003) 23. Lum, V.Y.: Multi-Attribute Retrieval with Combined Indexes. Communications of the ACM 13(11) (1970) 24. Wakulicz-Deja, A., Nowak, A.: From an information system to a decision support system. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 454–464. Springer, Heidelberg (2007) 25. Nowak, A., Wakulicz-Deja, A.: The inference processes on clustered rules. In: Advances in Soft Computing, vol. 5, pp. 403–411. Springer, Heidelberg (2006) 26. Nowak, A., Simi´ nski, R., Wakulicz-Deja, A.: Towards modular representation of knowledge base. In: Advances in Soft Computing, vol. 5, pp. 421–428. Springer, Heidelberg (2006) 27. Eﬀelsberg, W., Harder, T., Reuter, A.: An experiment in learning DBTG data-base administration. Information Systems 5, 137–147 (1980) 28. Michalski, R.S., Chilausky, R.L.: Knowledge acquisition by encoding expert rules versus computer indution from examples: a case study involving soybean pathology. International Journal Man. Machine Studies 12, 63–87 (1977) 29. Slamecka, V., Comp, H.N., Bodre, A.: MARIS - A knowledge system for internal medicine. Information Processing and Management 5, 273–276 (1977) 30. Masui, S., Shioya, M., Salaniski, T., Tayama, Y., Iungawa, T.: Fujite: Evaluation of a diﬀusion model applicable to environmental assessment for air polution abatement, System Development Lab. Hitachi Ltd., Tokyo, Japan (1980)

404

A. Wakulicz-Deja and A. Nowak

31. Cash, J., Whinston, A.: Security for GPLAN system. Information Systems 2(2) (1976) 32. Shafer, G.: A mathematical theory of evidence. Princeton University Press, Princeton (1976) 33. Skowron, A., Grzymala-Busse, J.: From the Rough Set Theory to the Evidence Theory. In: Yager, R.R., Fedrizzi, M., Kacprzyk, J. (eds.) Advances in the DempsterShafer Theory of Evidence, pp. 193–236. Wiley, New York (1994) 34. Marszal-Paszek, B., Paszek, P.: Minimal Templates Problem, Intelligent Information Processing and Web Mining. In: Advances in Soft Computing, vol. 35, pp. 397–402. Springer, Heidelberg (2006) 35. Marszal-Paszek, B., Paszek, P.: Extracting Minimal Templates in a Decision Table, Monitoring, Security, and Rescue Techniques in Multiagent Systems. In: Advances in Soft Computing, vol. 2005, pp. 339–344. Springer, Heidelberg (2005) 36. Swiniarski, R., Hargis, L.: Rough Sets as a Front End of Neural Networks Texture Classiﬁers. Neurocomputing 36(1-4), 85–102 (2001) 37. Nguyen, H.S., Nguyen, S.H., Skowron, A.: Decomposition of Task Speciﬁcation. In: Ra´s, Z.W., Skowron, A. (eds.) ISMIS 1999. LNCS, vol. 1609, p. 310. Springer, Heidelberg (1999) 38. Polkowski, L., Skowron, A. (eds.): Rough Sets in Knowledge Discovery 2: Applications, Case Studies and Software Systems. Physica Verlag, Heidelberg (1998) 39. Stepaniuk, J.: Relational Data and Rough Sets. Fundamenta Informaticae 79(3-4), 525–539 (2007) 40. Stepaniuk, J.: Approximation Spaces in Multi Relational Knowledge Discovery. Rough Sets 6, 351–365 (2007) 41. Ngo, C.L., Nguyen, H.S.: A Method of Web Search Result Clustering Based on Rough Sets, Web Intelligence, pp. 673–679 (2005) 42. Moshkov, M., Piliszczuk, M., Zielosko, B.: On construction of partial reducts and irreducible partial decision rules. Fundamenta Informaticae 75(1-4), 357–374 (2007) 43. Piliszczuk, M.: On greedy algorithm for partial reduct construction. In: Proceedings of concurency, Speciﬁcation and Programming Workshop, Ruciane Nida, Poland, pp. 400–411 (2005) 44. Zielosko, B.: On partial decision rules. In: Proceedings of concurency, Speciﬁcation and Programming Workshop, Ruciane Nida, Poland, pp. 598–609 (2005) 45. Zielosko, B., Kocjan, A., Piliszczuk, M.: Classiﬁers Based on Partial Reducts and Partial Decision Rules. In: Intelligent Information Systems XVI. Proceedings of the International IIS 2008 Conference held in Zakopane. Challenging Problems of Science. Computer Science, pp. 431–438. Academic Publishing House EXIT, Warsaw (2008) 46. Orlowska E.: Dynamic information systems, IPI PAN papers Nr. 434, Warsaw, Poland (1981) 47. Sarnadas, A.: Temporal aspects of logical procedure deﬁnition. Information Systems (3) (1980) 48. Wakulicz-Deja, A.: Classiﬁcation of time - varying information systems. Information Systems (3), Warsaw, Poland (1984) 49. Wakulicz-Deja, A.: Podstawy system´ ow wyszukiwania informacji. Analiza Metod, Problemy Wsp´ olczesnej Nauki. Teoria i Zastosowania. Informatyka, Akademicka Oﬁcyna Wydawnicza PLJ, Warsaw, Poland (1995)

Debellor: A Data Mining Platform with Stream Architecture Marcin Wojnarski Warsaw University, Faculty of Mathematics, Informatics and Mechanics ul. Banacha 2, 02-097 Warszawa, Poland [email protected]

Abstract. This paper introduces Debellor (www.debellor.org) – an open source extensible data mining platform with stream-based architecture, where all data transfers between elementary algorithms take the form of a stream of samples. Data streaming enables implementation of scalable algorithms, which can eﬃciently process large volumes of data, exceeding available memory. This is very important for data mining research and applications, since the most challenging data mining tasks involve voluminous data, either produced by a data source or generated at some intermediate stage of a complex data processing network. Advantages of data streaming are illustrated by experiments with clustering time series. The experimental results show that even for moderatesize data sets streaming is indispensable for successful execution of algorithms, otherwise the algorithms run hundreds times slower or just crash due to memory shortage. Stream architecture is particularly useful in such application domains as time series analysis, image recognition or mining data streams. It is also the only eﬃcient architecture for implementation of online algorithms. The algorithms currently available on Debellor platform include all classiﬁers from Rseslib and Weka libraries and all ﬁlters from Weka. Keywords: Pipeline, Online Algorithms, Software Environment, Library.

1

Introduction

In the ﬁelds of data mining and machine learning, there is frequently a need to process large volumes of data, too big to ﬁt in memory. This is particularly the case in some application domains, like computer vision or mining data streams [1,2], where input data are usually voluminous. But even in other domains, where input data are small, they can abruptly expand at an intermediate stage of processing, e.g. due to extraction of windows from a time series or an image [3,4]. Most of ordinary algorithms are not suitable for such tasks, because they try to keep all data in memory. Instead, special algorithms are necessary, which make eﬃcient use of memory. Such algorithms will be called scalable. J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 405–427, 2008. c Springer-Verlag Berlin Heidelberg 2008

406

M. Wojnarski

Another feature of data mining algorithms – besides scalability – which is very desired nowadays is interoperability, i.e. a capability of the algorithm to be easily connected with other algorithms. This property is more and more important, as basically all newly created data mining systems – whether experimental or enduser solutions – incorporate much more than just one algorithm. It would be very worthful if algorithms were both scalable and interoperable. Unfortunately, combining these two features is very diﬃcult. Interoperability requires that every algorithm is implemented as a separate module, with clearly deﬁned input and output. Obviously, data mining algorithm must take data as its input, so the data must be fully materialized – generated and stored in a data structure – just to invoke the algorithm, no matter what it actually does. And materialization automatically precludes scalability of the algorithm. In order to provide scalability and interoperability at the same time, algorithms must be implemented in a special software architecture, which do not enforce data materialization. Debellor1 – the data mining platform introduced in this paper – deﬁnes such an architecture, based on the concept of data streaming. In Debellor, data are passed between interconnected algorithms sample-bysample, as a stream of samples, so they can be processed on the ﬂy, without full materialization. The idea of data streaming is inspired by architectures of database management systems, which enable fast query execution on very large data tables. It should be noted that Debellor is not a library, like e.g., Rseslib2 [5,6,7] or Weka3 [8], but a data mining platform. Although its distribution contains implementations of a number of algorithms, the primary goal of Debellor is to provide not algorithms themselves, but a common architecture, in which various types of data processing algorithms may be implemented and combined, even if they are created by independent researchers. Debellor can handle a wide range of algorithm types: classiﬁers, clusterers, data ﬁlters, generators etc. Moreover, extendability of data types is provided, so it will be possible to process not only ordinary feature vectors, but also images, text, DNA microarray data etc. It is worth mentioning that Debellor’s modular and stream-oriented architecture will enable easy parallelization of composite data mining algorithms. This aspect will be investigated elsewhere. Debellor is written in Java and distributed under GNU General Public License. Its current version, Debellor 0.5, is available at www.debellor.org. The algorithms currently available include all classiﬁers from Rseslib and Weka libraries, all ﬁlters from Weka and a reader of ARFF ﬁles. There are also several algorithms implemented by Debellor itself, like Train&Test evaluation procedure. The algorithms from Rseslib and Weka, except the ARFF reader, are not scalable – this is enforced by architectures of both libraries.

1 2 3

The name originates from Latin debello (to conquer) and debellator (conqueror). http://rsproject.mimuw.edu.pl/ http://www.cs.waikato.ac.nz/ml/weka/

Debellor: A Data Mining Platform with Stream Architecture

2

407

Related Work

There is large amount of software that can be used to facilitate implementation of new data mining algorithms. A common choice is to use an environment for numerical calculations: R4 [9], Matlab5 , Octave6 [10,11] or Scilab7 and implement the algorithm in a scripting language deﬁned by the environment. Many data mining and machine learning algorithms are available for each of these environments, usually in a form of external packages, so the environments can be seen as common platforms for diﬀerent data mining algorithms. However, they do not deﬁne common architecture for algorithms, so they do not automatically provide interoperability. Moreover, the scripting languages of these environments have low eﬃciency, no static typing and only weak support for object-oriented programming, so they are suitable for fast prototyping and running small experiments, but not for implementation of scalable and interoperable algorithms. Another possible choice is to take a data mining library written in a generalpurpose programming language (usually Java) – examples of such libraries are Weka8 [8], Rseslib9 [5,6,7], RapidMiner10 [12] – and try to ﬁt the new algorithm into the architecture of the library. However, these libraries preclude scalability of algorithms, because the whole training data must be materialized in memory before they can be passed to an algorithm. The concept of data streaming, called also pipelining, has been used in database management systems [13,14,15,16] for eﬃcient query execution. The elementary units capable of processing streams are called iterators in [13,14]. The issue of scalability is related to the concept of online algorithms. In machine learning literature [17,18], the term online has been used to denote training algorithms which perform updates of the underlying decision model after every single presentation of a sample. The algorithms which update the model only when the whole training set has been presented are called batch. Usually online algorithms can be more memory-eﬃcient than their batch counterparts, because they do not have to store samples for later use. They are also more ﬂexible, e.g., they can be used in incremental learning or allow for the training process to be stopped anytime during scan of the data. This is why extensive research has been done to devise online variants of existing batch algorithms [19,20,21,22,23]. Certainly, online algorithms are the best candidates for implementation in stream architecture. Note, however, that many batch algorithms also do not have to keep all samples in memory and thus can beneﬁt from data streaming. In many cases it is enough to keep only some statistics calculated during scan of the data set, used afterwards to make the ﬁnal update of the model. For example, standard k-means [17,24,25] algorithm performs batch 4 5 6 7 8 9 10

http://www.r-project.org http://www.mathworks.com http://www.octave.org http://www.scilab.org http://www.cs.waikato.ac.nz/ml/weka http://rsproject.mimuw.edu.pl http://rapid-i.com

408

M. Wojnarski

updates of the model, but despite this it can be scalable if implemented in stream architecture, as will be shown in Sect. 5.8.

3 3.1

Motivation Scalability

Scalable algorithms are indispensable in most of data mining tasks – every time when data become larger than available memory. Even if initially memory seems capacious enough to hold the data, it may appear during experiments that data are larger and memory smaller than expected. There are many reasons for this: 1. Not the whole physical memory is available to the data mining algorithm at a given time. Some part is used by operating system and other applications. 2. The experiment may incorporate many algorithms run in parallel. In such case, available memory must be partitioned between all of them. In the future, parallelization will become more and more common due to parallelization of hardware architectures, e.g., expressed by increasing number of cores in processors. 3. In a complex experiment, composed of many elementary algorithms, every intermediate algorithm will generate another set of data. Total amount of data will be much larger than the amount of source data alone. 4. For architectural reasons data must be stored in memory in some general data structures, which take more memory than would be necessary in a given experiment. For example, data may be composed of binary attributes and each value could be stored on a single bit, but in fact each value takes 8 bytes or more, because every attribute – whether it is numeric or binary – is stored in the same way. Internal data representation used by a given platform is always a compromise between generality and eﬃcient memory usage. 5. Data generated at intermediate processing stages may be many times larger than source data. For example: – Input data may require decompression, e.g. JPEG images must be converted to raw bitmaps to undergo processing. This may increase data size even by a factor of 100. – In image recognition, a single input image may be used to generate thousands of subwindows that would undergo further processing [4,26]. An input image of 1MB size may easily generate windows of 1GB size or more. Similar situation occurs in speech recognition or time series analysis, where the sliding-window technique is used. – Synthetic attributes may be generated, e.g. by taking all multiplications of pairs of original attributes, which leads to quadratic increase in the number of attributes. – Synthetic samples may be generated, in order to increase the size of training set and improve learning of a decision system. For example, this method is used in [27], which studies the problem of Optical Character Recognition. Training images of hand-written characters are randomly

Debellor: A Data Mining Platform with Stream Architecture

409

distorted by planar aﬃne transformations and added to the training set. Every image undergoes 9 random distortions, which leads to 10-fold increase in the training set size (from 60 to 600 thousand images). 6. In some applications, like mining data streams [1], input data are potentially inﬁnite, so scalability obviously becomes an issue. 7. Even if the volume of data is small at the stage of experiments, it may become much bigger when the algorithm is deployed in a ﬁnal product and must process real-world instead of experimental data. The above arguments show clearly that memory is indeed a critical issue for data mining algorithms. Every moderately complex experiment will show one or more of the characteristics listed above. This is why we need scalable algorithms and – for this purpose – an architecture that will enable algorithms to process data on the ﬂy, without full materialization of a data set. 3.2

Interoperability

Nowadays, it is impossible to solve a data mining task or conduct an experiment using only one algorithm. For example, even if you want to experiment with a single algorithm, like a new classiﬁcation method, you at least have to access data on disk, so you need an algorithm that reads a given ﬁle format (e.g. ARFF11 ). Also, you would like to evaluate your classiﬁer, so you need an algorithm which implements an evaluation scheme, like cross-validation or bootstrap. And in most cases you will also need several algorithms for data preprocessing like normalization, feature selection, imputation of missing values etc. – note that preprocessing is an essential step in knowledge discovery [28,29] and usually several diﬀerent preprocessing methods must be applied before data can be passed to a decision system. To build a data mining system, there must be a way to connect all these diﬀerent algorithms together. Thus, they must possess the property of interoperability. Without this property, even the most eﬃcient algorithm is practically useless. Further on, the graph of data ﬂow between elementary algorithms in a data mining system will be called a Data Processing Network (DPN). In general, we will assume that DPN is a directed acyclic graph, so there are no loops of data ﬂow. Moreover, in the current version of Debellor, DPN can only have a form of a single chain, without branches. An example of DPN is shown in Figure 1.

Fig. 1. Example of a Data Processing Network (DPN), composed of ﬁve elementary algorithms (boxes). Arrows depict data ﬂow between the algorithms. 11

http://www.cs.waikato.ac.nz/ml/weka/arﬀ.html

410

4

M. Wojnarski

Data Streaming

To provide interoperability, data mining algorithms must be implemented in a common software architecture, which speciﬁes: – a method for connecting algorithms, – a model of data transfer, – common data representation. Architectures of existing data mining systems utilize the batch model of data transfer. In this model, algorithms must take the whole data set as an argument for execution. To run composite experiment, represented by a DPN with a number of algorithms, an additional supervisor module is needed, responsible for invoking consecutive algorithms and passing data sets between them. Figure 3 presents a UML sequence diagram [30] with an example of batch processing in a DPN composed of three algorithms. DPN itself is presented in Fig. 2. Batch data transfer enforces data materialization, which precludes scalability of algorithms and DPN as a whole. For example, in Weka, every classiﬁer must be implemented as a subclass of Classifier class (in weka.classifiers package). Its training algorithm must be implemented in the method: buildClassifier(Instances) :

void

The argument of type Instances is an array of training samples. This argument must be created before calling buildClassifier, so the data must be fully materialized in memory just to invoke training algorithm, no matter what the algorithm actually does. Similar situation takes place for clustering methods, which must inherit from weka.clusterers.Clusterer class and overload the method: buildClusterer(Instances) :

void

Rseslib and RapidMiner also enforce data materialization before a training algorithm can be invoked. In Rseslib, classiﬁers must be trained in the class constructor, which takes an argument of type DoubleDataTable. In RapidMiner, training of any decision system takes place in the method apply(IOContainer) of the class com.rapidminer.operator.Operator. Both Rseslib’s DoubleDataTable and RapidMiner’s IOContainer represent materialized input data. If a large data set must be materialized, execution of the experiment is practically impossible. If data ﬁt in virtual memory [31], but exceed available physical memory, operating system temporarily swaps [31] part of the data (stores it in

Fig. 2. DPN used as an example for analysis of data transfer models

Debellor: A Data Mining Platform with Stream Architecture

411

Fig. 3. UML diagram of batch data transfer in a DPN composed of three algorithms: LoadData, Preprocess and TrainClassiﬁer, controlled by the Supervisor module. Supervisor invokes the algorithms (methods run) and pass data between them. All samples of a given data set are generated and transferred together, so available memory must be large enough to hold all data. Vertical lines denote life of modules, with time passing down the lines. Horizontal lines represent messages (method calls and/or data transfers) between the modules. Vertical boxes depict execution of the module’s code.

the swap ﬁle on disk), which makes the execution tens or hundreds times slower, as access to disk is orders of magnitude slower than to memory. If the data set is so large that it even exceeds available virtual memory, execution of the experiment is terminated with an out-of-memory error. This problem could be avoided if the class that represents a data set (e.g., Instances in Weka) implemented internally the buﬀering of data on disk. Then, however, the same performance degradation would occur as in the case of system swapping, because swapping and buﬀering on disk are actually the same things, only implemented at diﬀerent levels: of operating system or data mining environment. The only way to avoid severe performance degradation when processing large data is to generate data iteratively, sample-by-sample, and instantly process created samples, as presented in Fig. 4. In this way, data may be generated and consumed on the ﬂy, without materialization of the whole set. This model of data transfer will be called iterative.

412

M. Wojnarski

Fig. 4. UML diagram of iterative data transfer. The supervisor invokes the algorithms separately for each sample of the data set (sample x y denotes sample no. x generated by algorithm no. y). In this way, memory requirements are very low (memory complexity is constant), but supervisor’s control over data ﬂow becomes very diﬃcult.

Iterative data transfer solves the problem of high memory consumption, because memory requirements imposed by the architecture are constant – only a ﬁxed number of samples must be kept in memory in a given moment, no matter how large the full data set is. However, another problem arises: the supervisor becomes responsible for controlling the ﬂow of samples and the order of execution of algorithms. This control may be very complex, because each elementary algorithm may have diﬀerent input-output characteristics. The number of possible variants is practically inﬁnite, for example: 1. Preprocessing algorithm may ﬁlter out some samples, in which case more than one input sample may be needed to produce one output sample. 2. Preprocessing algorithm may produce a number of output samples from a single input sample, e.g. when extracting windows from an image or time series. 3. Training algorithm of a decision system usually have to scan data many times, not only once.

Debellor: A Data Mining Platform with Stream Architecture

413

4. Generation of output samples may be delayed relatively to the ﬂow of input samples, e.g. an algorithm may require that 10 input samples are given before it starts producing output samples. 5. Input data to an algorithm may be inﬁnite, e.g. when they are generated synthetically. In such case, the control mechanism must stop data generation in appropriate moment. 6. Some algorithms may have more than one input or output, e.g. an algorithm for merging data from several diﬀerent sources (many inputs) or an algorithm for splitting data into training and test parts (many outputs). In such case, the control of data ﬂow through all the inputs and outputs becomes even more complex, because there are additional dependencies between many inputs/outputs of the same algorithm. Note that the diagram in Fig. 4 depicts a simpliﬁed case when DPN is a single chain of three algorithms, without branches; preprocessing generates exactly one output sample for every input sample; and training algorithm scans data only once.

Fig. 5. UML diagram of control and data ﬂow in the stream model of data transfer. The supervisor invokes only method build() of the last component (TrainClassiﬁer). This triggers a cascade of messages (calls to methods next()) and transfers of samples, as needed to fulﬁll the initial build() request.

414

M. Wojnarski

The way how data ﬂow should be controlled depends on what algorithms are used in a given DPN. For this reason, the algorithms themselves – not the supervisor – should be responsible for controlling data ﬂow. To this end, each algorithm must be implemented as a component which can communicate with other components without external control of the supervisor. Supervisor’s responsibility must be limited only to linking components together (building DPN) and invoking the last algorithm in DPN, which is the ﬁnal receiver of all samples. Communication should take the form of a stream of samples: (i) sample is the unit of data transfer; (ii) samples are transferred sequentially, in a ﬁxed order decided by the sender. This model of data transfer will be called a stream model. An example of control and data ﬂow in this model is presented in Fig. 5. Component architecture and data streaming are the features of Debellor which enable scalability of algorithms implemented on this platform.

5 5.1

Debellor Data Mining Platform Data Streams

Debellor’s components are called cells. Every cell is a Java class inheriting from the base class Cell (package org.debellor.core). Cells may implement all kinds of data processing algorithms, for example: 1. Decision algorithms: classiﬁcation, regression, clustering, density estimation etc. 2. Transformations of samples and attributes. 3. Removal or insertion of samples and attributes. 4. Loading data from ﬁle, database etc. 5. Generation of synthetic data. 6. Buﬀering and reordering of samples. 7. Evaluation schemes: train&test, cross-validation, leave-one-out etc. 8. Collecting statistics. 9. Data visualization. Cells may be connected into DPN by calling the setSource(Cell) method on the receiving cell, for example: Cell cell1 = ..., cell2 = ..., cell3 = ...; cell2.setSource(cell1); cell3.setSource(cell2); The ﬁrst cell will usually represent a ﬁle reader or a generator of synthetic data. Intermediate cells may apply diﬀerent kinds of data transformations, while the last cell will usually implement a decision system or an evaluation procedure. DPN can be used to process data by calling methods open(), next() and close() on the last cell of DPN, for example:

Debellor: A Data Mining Platform with Stream Architecture

415

cell3.open(); sample1 = cell3.next(); sample2 = cell3.next(); sample3 = cell3.next(); ... cell3.close(); The above calls open communication session with cell3, retrieve some number of processed samples and close the session. In order to realize each request, cell3 may communicate with its source cell, cell2, by invoking the same methods (open, next, close) on cell2. And cell2 may in turn communicate with cell1. In this way it is possible to generate output samples on the ﬂy. The stream of samples may ﬂow through consecutive cells of DPN without buﬀering, so input data may have unlimited volume. Note that the user of DPN does not have to control sample ﬂow by himself. To obtain the next sample of processed data it is enough to call cell3.next(), which will invoke – if needed – a cascade of calls to preceding cells. Moreover, diﬀerent cells may control the ﬂow of samples diﬀerently. For example, cells that implement classiﬁcation algorithms will take one input sample in order to generate one output sample. Filtering cells will take a couple of input samples in order to generate one output sample that matches the ﬁltering rule. The image subwindow generator will produce many output samples out of a single input sample. We can see that the cell’s interface is very ﬂexible. It enables implementation of various types of algorithms in the same framework and allows to easily combine the algorithms into a complex DPN. 5.2

Buildable Cells

Some cells may be buildable, in which case their content must be built before the cell can be used. Building procedure is invoked by calling method build() :

void

on the cell object. This method is declared in the base class Cell. Building a cell may mean diﬀerent things for diﬀerent types of cells. For example: – training a decision system of some kind (classiﬁer, clusterer, . . . ), – running an evaluation scheme (train&test, cross-validation, . . . ), – reading all data from input stream and buﬀering in memory. Note that all these diﬀerent types of algorithms are encapsulated under the same interface (method build()). This increases simplicity and modularity of the platform. Usually, the cell reads input data during building, so it must be properly connected to a source cell before build() is invoked. Afterwards, the cell may be reconnected and used to process another stream of data. Some buildable cells may also implement erase() method, which clears the content of the cell. After erasure, the cell may be built once again.

416

5.3

M. Wojnarski

State of the Cell

Every cell object has a state variable attached, which indicates what cell operations are allowed in a given moment. There are three possible states: EMPTY, CLOSED and OPEN. Transitions between them are presented in Fig. 6. Each transition is invoked by call to an appropriate method: build(), erase(), open() or close().

Fig. 6. Diagram of cell states and allowed transitions

Only a part of cell methods may be called in a given state. For example, next() can be called only in OPEN state, while setSource() is allowed only in EMPTY or CLOSED state. It is guaranteed by the base class implementation that disallowed calls immediately end with an exception thrown. Thanks to this automatic state control, connecting diﬀerent cells together and building composite algorithms becomes easier and safer, because many possible mistakes or bugs related to inter-cell communication are detected early. Otherwise, they could exist unnoticed, generating incorrect results during data processing. Moreover, it is easier to implement new cells, because the authors do not have to check correctness of method calls by themselves. 5.4

Parametrization

Most of cells require a number of parameters to be set before the cell can start working. Certainly, every type of a cell requires diﬀerent parameters, but for the sake of interoperability and simplicity of usage, there should be a common interface for passing parameters, no matter what number and types of parameters are expected by a given cell. Debellor deﬁnes such an interface. Parameters for a given cell are stored in an object of class Parameters (package org.debellor.core), which keeps a dictionary of parameter names and associated String values (in the future we plan to extend permitted value types, note however that all simple types can be easily converted to String). Thanks to the use of a dictionary, the names do not have to be hard-coded as ﬁelds of cell objects, hence parameters can be added dynamically, according to requirements of a given cell. The object of class Parameters can be passed to the cell by calling Cell’s method: setParameters(Parameters) :

void

Debellor: A Data Mining Platform with Stream Architecture

417

It is also possible (and usually more convenient) to pass single parameter values directly to the cell, without an intermediate Parameters object, by calling: set(String name, String value) :

void

This method call delegates to analogous method of Cell’s internal Parameters object. 5.5

Data Representation

The basic unit of data transfer between cells is sample. Samples are represented by objects of class Sample. Every sample contains two ﬁelds, data and label, which hold input data and associated decision label, respectively. Any of the ﬁelds can be null, if corresponding information is missing or simply not necessary at the given point of data processing. Cells are free to use whichever part of input data they want. For example, build() method of a classiﬁer (i.e. training algorithm) would use both data and label, interpreting label as a target classiﬁcation of data, given by a supervisor. During operation phase, the classiﬁer would ignore input label, if present. Instead, it would classify data and assign the generated label to the label ﬁeld of the output sample. Data and labels are represented in an abstract way. Both data and label ﬁelds reference objects of type Data (package org.debellor.core). Data is a base class for classes that represent data items, like single features or vectors of features. When the cell wants to use information stored in data or label, it must downcast the object to a speciﬁc subclass, as expected by the cell. Thanks to this abstract method of data representation, new data types can be added easily, by creating a new subclass of Data. Authors of new cells are not limited to a single data type, hard-coded into the platform, as for example in Weka. Data objects may be nested. For example, objects of class DataVector (in org.debellor.core.data) hold arrays of other data objects, like simple features (classes NumericFeature and SymbolicFeature) or other DataVectors. 5.6

Immutability of Data

A very important concept related to data representation is immutability. Objects which store data – instances of Sample class or Data subclasses – are immutable, i.e. they cannot be modiﬁed after creation. Thanks to this property, data objects can be safely shared by cells, without risk of accidental modiﬁcation in one cell that would aﬀect operations of another cell. Immutability of data objects yields many beneﬁts: 1. Safety – cells written by diﬀerent people may work together in a complex DPN without interference. 2. Simplicity – the author of a new cell does not have to care about correctness of access to data objects.

418

M. Wojnarski

3. Eﬃciency – data objects do not have to be copied when transferred to another cell. Without immutability, copying would be necessary to provide a basic level of safety. Also, a number of samples may keep references to the same data object. 4. Parallelization – if DPN is executed concurrently, no synchronization is needed when accessing shared data objects. This simpliﬁes parallelization and makes it more eﬃcient. 5.7

Metadata

Many cells have to know some basic characteristics (“type”) of input samples before processing of the data starts. For example, training algorithm of a neural network has to know the number of input features, to be able to allocate arrays of weights of appropriate size. To provide such information, method open() returns an object of class MetaSample (static inner class of Sample), which describes common properties of all samples generated by the stream being open. Similarly to Sample, MetaSample has separate ﬁelds describing input data and labels, both of type MetaData (static inner class of Data). Metadata have analogous structure and properties as the data being described. The hierarchy of metadata classes, rooted at MetaData, mirrors the hierarchy of data classes, rooted at Data. The nesting of MetaData and Data objects is also similar, e.g. if the stream generates DataVectors of 10 SymbolicFeatures, corresponding MetaData object will be an instance of MetaDataVector, containing an array of 10 MetaSymbolicFeatures describing every feature. Similarly to Data, MetaData objects are immutable, so they can be safely shared by cells. 5.8

Example

To illustrate the usage of Debellor, we will show how to implement standard k-means algorithm in stream architecture and how to employ it to data processing in a several-cell DPN. K-means [17,24,25] is a popular clustering algorithm. Given n input samples – numeric vectors of ﬁxed length, x1 , x2 , . . . , xn – it tries to ﬁnd cluster centers c1 , . . . , ck which minimize the sum of squared distances of samples to their closest center: n min xi − cj 2 . (1) E(c1 , . . . , ck ) = i=1

j=1,...,k

This is done through iterative process with two steps repeated alternately in a loop: (i) assignment of each sample to the nearest cluster and (ii) repositioning of each center to the centroid of all samples in a given cluster. The algorithm is presented in Fig. 7. As we can see, the common implementation of k-means as a function is non-scalable, because it employs batch model of data transfer: training data are passed as an array of samples, so they must be generated and accumulated in memory before the function is called.

Debellor: A Data Mining Platform with Stream Architecture

419

function kmeans(data) returns an array of centers Initialize array centers repeat Set sum[1], . . . , sum[k], count[1], . . . , count[k] to zero for i = 1..n do /* assign samples to clusters */ x = data[i] j = clusterOf(x) sum[j] = sum[j] + x count[j] = count[j] + 1 end for j = 1..k do /* reposition centers */ centers[j] = sum[j]/count[j] end until no center has been changed return centers Fig. 7. Pseudocode illustrating k-means clustering algorithm implemented as a regular stand-alone function. The function takes an array of n samples (data) as argument and returns k cluster centers. Both samples and centers are real-valued vectors. The function clusterOf(x) returns index of the center that is closest to x.

class KMeans extends Cell method build() Initialize array centers repeat Set sum[1], . . . , sum[k], count[1], . . . , count[k] to zero (*) source.open() for i = 1..n do (*) x = source.next() j = clusterOf(x) sum[j] = sum[j] + x count[j] = count[j] + 1 end (*) source.close() for j = 1..k do centers[j] = sum[j]/count[j] end until no center has been changed Fig. 8. Pseudocode illustrating implementation of k-means as Debellor’s cell. Since k-means is a training algorithm (generates a decision model), it must be implemented in method build() of a Cell’s subclass. Input data are provided by the source cell, the reference source being a ﬁeld of Cell. The generated model is stored in the ﬁeld centers of class KMeans, method build() does not return anything. The lines of code inserted or modiﬁed relatively to the standard implementation are marked with asterisk (*).

420

M. Wojnarski

class KMeans extends Cell method next() x = source.next() if x == null then return null return x.setLabel(clusterOf(x)) Fig. 9. Pseudocode illustrating implementation of method next() of KMeans cell. This method employs the clustering model generated by build() and stored inside the KMeans object to label new samples with identiﬁers of their clusters.

/* 3 cells are created and linked into DPN */ Cell arff = new ArffReader(); arff.set("filename", "iris.arff");

/* parameter ﬁlename is set */

Cell remove = new WekaFilter("attribute.Remove"); remove.set("attributeIndices", "last"); remove.setSource(arff); /* cells arﬀ and remove are linked */ Cell kmeans = new KMeans(); kmeans.set("numClusters", "10"); kmeans.setSource(remove); /* k-means algorithm is executed */ kmeans.build(); /* the clusterer is used to label 3 training samples with cluster identiﬁers */ kmeans.open(); Sample s1 = kmeans.next(), s2 = kmeans.next(), s3 = kmeans.next(); kmeans.close(); /* labelled samples are printed on screen */ System.out.println(s1 + "\n" + s2 + "\n" + s3);

Fig. 10. Java code showing sample usage of Debellor cells: reading data from an ARFF ﬁle, removal of an attribute, training and application of a k-means clusterer

Stream implementation of k-means – as Debellor’s cell – is presented in Fig. 8. In contrast to the standard implementation, training data are not passed explicitly, as an array of samples. Instead, the algorithm retrieves samples one-by-one from the source cell, so it can process arbitrarily large data sets. In addition,

Debellor: A Data Mining Platform with Stream Architecture

421

Fig. 9 shows how to implement method next(), responsible for application of the generated clustering model to new samples. Note that despite the algorithm presented in Fig. 8 employs stream method of data transfer, it employs batch method of updating the decision model (the updates are performed after all samples have been scanned). These two things – the method of data transfer and the way how model is updated – are separate and independent issues. It is possible for batch (in terms of model update) algorithms to utilize and beneﬁt from stream architecture. Listing in Fig. 10 shows how to run a simple experiment: train a k-means clusterer and apply it to several training samples, to label them with identiﬁers of their clusters. Data are read from an ARFF ﬁle and simple preprocessing – removal of the last attribute – is applied to all samples. Note that loading data from ﬁle and preprocessing is executed only when the next input sample is requested by the kmeans cell – in methods build() and next().

6

Experimental Evaluation

6.1

Setup

In existing data mining systems, when data to be processed are too large to ﬁt in memory, they must be put in virtual memory. During execution of the algorithm, parts of data are being swapped to disk by operating system, to make space for other parts, currently requested. In this way, portions of data are constantly moving between memory and disk, generating huge overhead on execution time of the algorithm. In the presented experiments we wanted to estimate this overhead and the performance gain that can be obtained through the use of Debellor’s data streaming instead of swapping. For this purpose, we trained k-means [17,24,25] clustering algorithm on time windows extracted from the time series that was used in EUNITE12 2003 data mining competition. We compared execution times of two variants of the experiment: 1. batch, with time windows created in advance and buﬀered in memory, 2. stream, with time windows generated on the ﬂy. Data Processing Networks of both variants are presented in Fig. 11 and 12. In both variants, we employed our stream implementation of k-means, sketched in Sect. 5.8 (KMeans cell in Fig. 11 and 12). In the ﬁrst variant, we inserted a buﬀer into DPN just before the KMeans cell – in this way we eﬀectively obtained a batch algorithm. In the second variant, the buﬀer was placed earlier in the chain of algorithms, before window extraction. We could have dropped buﬀering at all, but then the data would be loaded from disk again in every training cycle, which was not necessary, as the source data were small enough to ﬁt in memory. 12

EUropean Network on Intelligent TEchnologies for Smart Adaptive Systems, http://www.eunite.org

422

M. Wojnarski

Fig. 11. DPN of the ﬁrst (batch) variant of experiment

Fig. 12. DPN of the second (stream) variant of experiment

Source data were composed of a series of real-valued measurements from glass production process, recorded in 9408 diﬀerent time points separated by 15-minute intervals. There were two kinds of measurements: 29 “input” and 5 “output” values. In the experiment we used only “input” values, “output” ones were ﬁltered out by Weka ﬁlter for attribute removal (WekaFilter cell). After loading from disk and dropping unnecessary attributes, the data occupied 5.7MB of memory. They were subsequently passed to TimeWindows cell, which generated time windows of length W , on every possible oﬀset from the beginning of the input time series. Each window was created as a concatenation of W consecutive samples of the series. Therefore, for input series of length T , composed of A attributes, the resulting stream contained T − W + 1 samples, each composed of W ∗ A attributes. In this way, relatively small source data (5.7MB) generated large volume of data at further stages of DPN, e.g. 259MB for W = 50. In the experiments, we compared training times of both variants of k-means. Since the time eﬀectiveness of swapping and memory management depends highly on the hardware setup, the experiments were repeated in two diﬀerent hardware environments: (A) a laptop PC with Intel Mobile Celeron 1.7 GHz CPU, 256MB RAM; (B) a desktop PC with AMD Athlon XP 2100+ (1.74 GHz), 1GB RAM. Both systems run under Microsoft Windows XP. Sun’s Java Virtual Machine (JVM) 1.6.0 03 was used. The number of clusters for k-means was set to 5. 6.2

Results

Results of experiments are presented in Table 1 and 2. They are also depicted graphically in Fig. 13 and 14. Diﬀerent lengths of time windows were checked, for every length the size of generated training data was diﬀerent (given in the second column of the tables). In each trial, training time of k-means was measured. These times are reported in normalized form, i.e. the total training time in seconds is divided by the number of training cycles and data size in MB. Normalized times can be directly

Debellor: A Data Mining Platform with Stream Architecture

423

Table 1. Normalized training times of k-means for batch and stream variant of experiment and diﬀerent lengths of time windows. Corresponding sizes of training data are given in the second column. Hardware environment A. Window length 10 20 30 40 50 60 70 80 90

Normalized Data size Normalized execution time execution time [MB] (batch variant) (stream variant) 53 3.1 5.6 104 3.2 5.3 156 3.1 5.0 208 5.1 4.9 259 244.4 5.0 311 326.9 8.3 362 370.6 10.7 413 386.0 10.9 464 475.3 11.1

Table 2. Normalized training times of k-means for batch and stream variant of experiment and diﬀerent lengths of time windows. Corresponding sizes of training data are given in the second column. Hardware environment B. Window length 50 100 120 150 170 180 190 200 210 220 230 240 250 260

Normalized Data size Normalized execution time execution time [MB] (batch variant) (stream variant) 259 4.0 5.3 515 4.0 5.4 617 4.0 6.5 769 5.3 8.7 869 6.3 8.8 919 23.8 8.8 969 36.4 8.8 1019 50.7 8.8 1069 71.3 8.8 1119 85.1 8.8 1168 100.4 9.1 1218 111.1 9.1 1267 140.2 9.4 1317 crash 9.3

compared across diﬀerent trials. Every table and ﬁgure presents results of both variants of the algorithm. Time complexity of a single training cycle of k-means is linear in the data size, so normalized execution times should be similar across diﬀerent values of window length. However, for the batch variant, the times are constant only for small sizes of data. At the point when data size gets close to the amount of physical memory installed on the system, execution time suddenly jumps to a very high value, many times larger than for smaller data sizes. It may even

424

M. Wojnarski

500 batch variant stream variant

Normalized training time

400

300

200

100

0

0

100

200 300 Size of training data [MB]

400

500

Fig. 13. Normalized training times of k-means for batch and stream variant of experiment and diﬀerent lengths of time windows. Hardware environment A.

150 batch variant stream variant

Normalized training time

125

100

75

50

25

0 200

400

600

800 Size of training data [MB]

1000

1200

1400

Fig. 14. Normalized training times of k-means for batch and stream variant of experiment and diﬀerent lengths of time windows. Hardware environment B. Note that the measurement which caused the batch variant to crash (last row in Table 2) is not presented here.

Debellor: A Data Mining Platform with Stream Architecture

425

happen that from some point the execution crashes due to memory shortage (see Tab. 2), despite JVM heap size being set to the highest possible value (1300 MB on a 32-bit system). This is because swapping must be activated to handle this large volume of data. And because access to disk is orders of magnitude slower than to memory, algorithm execution becomes also very slow. This dramatic slowdown is not present in the case of the stream algorithm, which requires always the same amount of memory, at the level of 6MB. For small data sizes this algorithm runs a bit slower, because training data must be generated in each training cycle from the beginning. But for large data sizes it can be 40 times better, or even more (the curves in Figures 13 and 14 rise very quickly, so we may suspect that for larger data sizes the disparity between both variants is even bigger). The batch variant is actually not usable. What is also important, every stream implementation of a data mining algorithm can be used in batch manner by simply preceding it with a buﬀer in DPN. Thus, the user can choose the faster variant, depending on the data size. On the other hand, batch implementation cannot be used in stream-based manner, rather the algorithm must be redesigned and implemented again.

7

Conclusions

In this paper we introduced Debellor – a data mining platform with stream architecture. We presented the concept of data streaming and proved through experimental evaluation that it enables much more eﬃcient processing of large data than the currently used method of batch data transfer. Stream architecture is also more general. Every stream-based implementation can be used in batch manner. Opposite is not true. Thanks to data streaming, algorithms implemented on Debellor platform can be scalable and interoperable at the same time. We also analysed the signiﬁcance of scalability issue for the design of composite data mining systems and showed that even when source data are relatively small, lack of memory may still pose a problem, since large volumes of data may be generated at intermediate stages of data processing network. Stream architecture has also weaknesses. Because of sequential access to data, implementation of algorithms may be conceptually more diﬃcult. Batch data transfer is more intuitive for the programmer. Moreover, some algorithms may inherently require random access to data. Although they can be implemented in stream architecture, they have to buﬀer all data internally, so they will not beneﬁt from streaming. However, these algorithms can still beneﬁt from interoperability provided by Debellor – they can be connected with other algorithms to form a complex data mining system. Development of Debellor will be continued. We plan to extend the architecture to handle multi-input and multi-output cells as well as nesting of cells (e.g., to implement meta-learning algorithms). We also want to implement parallel execution of DPN and serialization of cells (i.e., saving to a ﬁle). Acknowledgement. The research has been partially supported by the grant N N516 368334 from Ministry of Science and Higher Education of the Republic

426

M. Wojnarski

of Poland and by the grant “Decision support – new generation systems” of Innovative Economy Operational Programme 2008-2012 (Priority Axis 1. Research and development of new technologies) managed by Ministry of Regional Development of the Republic of Poland.

References 1. Aggarwal, C.C. (ed.): Data Streams: Models and Algorithms. Springer, Heidelberg (2007) 2. Gama, J., Gaber, M.M. (eds.): Learning from Data Streams: Processing Techniques in Sensor Networks. Springer, Heidelberg (2007) 3. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice Hall, Englewood Cliﬀs (2002) 4. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. IEEE Computer Vision and Pattern Recognition 1, 511–518 (2001) 5. Bazan, J.G., Szczuka, M.: RSES and RSESlib – A collection of tools for rough set computations. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS, vol. 2005, pp. 106–113. Springer, Heidelberg (2001) 6. Bazan, J.G., Szczuka, M.S., Wojna, A., Wojnarski, M.: On the evolution of rough set exploration system. In: Tsumoto, S., Slowi´ nski, R., Komorowski, J., GrzymalaBusse, J.W. (eds.) RSCTC 2004. LNCS, vol. 3066, pp. 592–601. Springer, Heidelberg (2004) 7. Wojna, A., Kowalski, L.: Rseslib: Programmer’s Guide (2008), http://rsproject.mimuw.edu.pl 8. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 9. R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2005) 10. Eaton, J.W.: Octave: Past, present, and future. In: International Workshop on Distributed Statistical Computing (2001) 11. Eaton, J.W., Rawlings, J.B.: Ten years of Octave – recent developments and plans for the future. In: International Workshop on Distributed Statistical Computing (2003) 12. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: Yale: Rapid prototyping for complex data mining tasks. In: KDD 2006: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (2006) 13. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database System Implementation. Prentice Hall, Englewood Cliﬀs (1999) 14. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book. Prentice Hall, Englewood Cliﬀs (2001) 15. Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J.: STREAM: The stanford data stream management system (2004), http://dbpubs.stanford.edu:8090/pub/2004-20 16. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: ACM (ed.) Symposium on Principles of Database Systems, pp. 1–16. ACM Press, New York (2002) 17. Ripley, B.D.: Pattern recognition and neural networks. Cambridge University Press, Cambridge (1996)

Debellor: A Data Mining Platform with Stream Architecture

427

18. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) 19. Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: Knowledge Discovery and Data Mining (1998) 20. Balakrishnan, S., Madigan, D.: Algorithms for sparse linear classiﬁers in the massive data setting. Journal of Machine Learning Research 9, 313–337 (2008) 21. Amit, Y., Shalev-Shwartz, S., Singer, Y.: Online learning of complex prediction problems using simultaneous projections. Journal of Machine Learning Research 9, 1399–1435 (2008) 22. Furaoa, S., Hasegawa, O.: An incremental network for on-line unsupervised classiﬁcation and topology learning. Neural Networks 19, 90–106 (2006) 23. Kivinen, J., Smola, A.J., Williamson, R.C.: Online learning with kernels. IEEE Transactions On Signal Processing 52, 2165–2176 (2004) 24. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31, 264–323 (1999) 25. Russell, S., Norvig, P.: Artiﬁcial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliﬀs (1995) 26. Wojnarski, M.: Absolute contrasts in face detection with adaBoost cascade. In: ´ ezak, D. (eds.) Yao, J., Lingras, P., Wu, W.-Z., Szczuka, M.S., Cercone, N.J., Sl¸ RSKT 2007. LNCS, vol. 4481, pp. 174–180. Springer, Heidelberg (2007) 27. LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998) 28. Kriegel, H.P., Borgwardt, K.M., Kr¨ oger, P., Pryakhin, A., Schubert, M., Zimek, A.: Future trends in data mining. Data Mining and Knowledge Discovery 15, 87–97 (2007) 29. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann Publishers, San Francisco (1999) 30. Booch, G., Rumbaugh, J., Jacobson, I.: Uniﬁed Modeling Language User Guide. Addison-Wesley, Reading (2005) 31. Silberschatz, A., Galvin, P., Gagne, G.: Operating System Concepts, 7th edn. Wiley, Chichester (2004)

Category-Based Inductive Reasoning: Rough Set Theoretic Approach Marcin Wolski Department of Logic and Methodology of Science, Maria Curie-Skłodowska University, Poland [email protected]

Abstract. The present paper is concerned with rough set theory (RST) and a particular approach to human-like induction, namely similarity coverage model (SCM). It redefines basic concepts of RST – such like e.g. a decision rule, accuracy and coverage of decision rules – in the light of SCM and explains how RST may be viewed as a similarity-based model of human-like inductive reasoning. Furthermore, following the knowledge-based theory of induction, we enrich RST by the concept of an ontology and, in consequence, we present an RST-driven conceptualisation of SCM. The paper also discusses a topological representation of information systems in terms of non-Archimedean structures. It allows us to present an ontology-driven interpretation of finite non-Archimedean nearness spaces and, to some extent, to complete recent papers about RST and the topological concepts of nearness.

1 Introduction Category-based induction is an approach to human-like inductive reasoning in which both conceptual knowledge and similarity of objects play the key role. So far this type of reasoning has been a subject of study mainly in ethnobiology, or better still, in cognitive science. In this paper we shall apply the main ideas underlying category-based induction to computer science, especially to rough set theory (RST) [10,12]. It will allow us to introduce some new interesting interpretations of basic concepts and structures from RST and topology. There are, in general, two basic theories explaining the mechanism of (human-like) induction: the knowledge-based theory and the similarity-based theory. According to the former one, induction is driven by a prior categorisation of given objects, often called conceptual knowledge or ontology. On this view, people firstly identify some category of which a given object is an element and then generalise properties to the members of this category and vice versa. For example, knowing that bluejays require vitamin K for their liver to function, one can generalise that all birds require this vitamin too. On the other hand, the similarity-based theory argues that induction is based on the overall similarity of compared objects rather than on the conceptual knowledge. For example, students from Michigan are reported to conclude – on the basis that skunks have some biological property – that it is more likely that rather opossums have this property than bears. Skunks, however, are taxonomically closer to bears than to opossums [2]. Summing up, according to the knowledge-based approach the generalisation from one J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 428–443, 2008. c Springer-Verlag Berlin Heidelberg 2008

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

429

object to another is supported by categories (agents ignore appearances of objects and rely on the category membership), whereas according to the similarity-based approach such generalisations are based on perceptual similarity (agents ignore their knowledge about the category membership and rely on appearances) [7]. Inductive reasoning which takes into account both conceptual knowledge and similarity of objects is generally called category-based induction. There are a number of formal models of such reasoning, e.g. [2,6,9,15]. In this paper we shall study similarity coverage model (SCM) [2,9], mainly for its simplicity and strong influence on the development of other models. According to SCM, the strength of inductive argument increases with (a) the degree to which the premies categories are similar to the conclusion category, and (b) the degree to which premise categories cover the lowest level knowledge-category (e.g. from a taxonomy) that includes both the premies and conclusion categories. Thus, the step (a) represents the similarity-based approach, whereas the step (b) represents the knowledge-based approach. The main aim of this paper is to give an account of SCM within the conceptual framework of RST. First, we re-interpret some notions from RST – such as a decision rule, accuracy and coverage of decision rules, rough inclusion functions – according to the standpoint of SCM. On this view, RST may be regarded as a similarity-based approach to induction. Then we enrich RST with a proper ontology and show how the knowledgebased approach can correct the assessment of decision rules. In consequence, the paper proposes an RST-driven model of category-based induction. Furthermore, we discuss topological aspects of RST and the category-based theory of induction. More specifically, we examine a topological counterpart of an information system. Usually, information systems have been represented as approximation spaces or approximation topological spaces. In consequence, only the indiscernability (or similarity) relation induced by the set of all attributes has been considered. In contrast to this approach, we consider all indiscernability relations, or better still, all partitions, induced by an information system. Mathematically, these partitions induce a non-Archimedean structure which, in turn, gives rise to a topological nearness space. Recently a number of attempts have been made to connect RST with nearness type structures, e.g. [11,19]. To some extent we complete the previous results and show some intuitive reasons to consider such structures. Specifically, every ontology induced over a non-Archimedean structure is taxonomic. The paper is organised as follows. Section 2 contains a brief and informal introduction to SCM. Section 3 describes basic concepts from RST which are relevant to inductive reasoning. Section 4 discusses concepts introduced in Section 3 against the background of inductive reasoning and SCM. Finally, Section 5 presents topological aspects of RST and the category-based approach to induction.

2 Similarity Coverage Model In this section we informally introduce category-based induction which has been of a special importance mainly for ethnobiology. There are different accounts of such induction – in the paper we focus on the very influential similarity coverage model (SCM) introduced by Osherson et al. [9].

430

M. Wolski

Ethnobiology or folk biology is a branch of cognitive science which studies the ways in which people categorise the local fauna and flora and project their knowledge about a certain category to other ones [2,6,9,15]. For example, given that bobcats secrete uric acid crystals and cows secrete uric acid crystals, subjects, on the basis that all mammals may have this property, infer that foxes secrete uric acid crystals. According to SCM, the subject performing an induction task firstly calculates the similarity of the premise categories (i.e. bobcats, cows) to the conclusion category (i.e. foxes). Then the subject calculates the average similarity (coverage) of the premise categories to the superordinate category including both the premise and conclusion categories (i.e. mammals). Let us consider the following example: Horses have an ileal vein, Donkeys have an ileal vein. Gophers have an ileal vein. This argument is weaker than: Horses have an ileal vein, Gophers have an ileal vein. Cows have an ileal vein. Of course, the similarity of horses to cows is much higher than the similarity of horses or donkeys to gophers. Thus the strength of inductive inference depends on the maximal similarity of the conclusion category to some of the premise categories. Now let us shed some light on the coverage principle: Horses have an ileal vein, Cows have an ileal vein. All mammals have an ileal vein. According to SCM this argument is weaker than the following one: Horses have an ileal vein, Gophers have an ileal vein. All mammals have an ileal vein. The reason is that the average similarity of horses to other mammals is almost the same as that of cows. In other words, the set H of all animals considered to be similar to horses is almost equal to the set C of all animals similar to cows. Thus the second premise does not bring us nothing in terms of coverage. By contrast, gophers are similar to other mammals than horses and thus this premise makes the coverage higher. That is, the set H ∪ G, where G is the set of all animals similar to gophers, has more elements than the set H ∪ C. Thus, the following inductive inference Horses have an ileal vein, All mammals have an ileal vein. is stronger, than Bats have an ileal vein, All mammals have an ileal vein.

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

431

The range of mammals similar to cows is much wider than the range of mammals similar to bats. One can say that cows are more typical examples of mammals than bats or gophers. Now, let us summarise the above examples in a more formal way. Firstly, there is given a set of categories C we reason about. This set is provided with a binary “kind of” relation K, which is acyclic and thus irreflexive and asymmetric. We call K taxonomic if and only if it is transitive and no item is of two distinct kinds. Definition 1. A transitive relation K is taxonomic over C iff for any a, b, c ∈ C such that aKb and aKc, it holds that b = c or bKc or cKb. For example, collie is a kind of dog and dog is a kind of mammal. Items x ∈ C, such that there is no t satisfying tKx constitute basic categories. An example of non-taxonomic relation is as follows: wheelchair is a kind of furniture and a kind of vehicle. Now, neither furniture = vehicle nor furnitureKvehicle nor vehicleKfurniture. Subjects reasoning about C are additionally provided with a default notion of similarity R defined on basic categories CBasic , i.e. minimal elements of C with respect to K. People usually assume that R is at least reflexive and symmetric. Very often R is represented as an equivalence relation, that is, R is additionally transitive. Given that c1 ∈ CBasic has a property p, a subject may infer that c2 ∈ CBasic also satisfies p, if there exists c3 ∈ C such that c1 Kc3 , c2 Kc3 and {c ∈ CBasic : c1 Rc} is a “substantial” subset of {c ∈ CBasic : cKc3 }. Informally speaking, one can transfer knowledge from a category c1 to c2 if the set of all elements considered to be similar to c1 is a substantial subset of the set of all Cbasic -instantiations of the minimal taxonomic category c whose examples are c1 and c2 . Summing up, one can say that (C, K) represents gathered information, while R is an inductive “engine” making inferences about unknown features of objects belonging to C.

3 Rough Set Theory In this section we briefly recall basic notions from RST which are relevant to inductive reasoning. We start with introducing the concept of an information system, then we discuss decision rules and different measures of their strength. We conclude by recalling some notions from the rough–mereological approach. Definition 2. An information system is a quadruple U, A, V, f where: – – – –

U is a non–empty finite set of objects; A is anon–empty finite set of attributes; V = a∈A Va where Va is the value–domain of the attribute a; f : U × A → V is an information function, such that for all a ∈ A and u ∈ U , f (u, a) ∈ Va .

It is often useful to view an information system U, A, V, f as a decision table, assuming that A = C ∪ D and C ∩ D = ∅ where C is a set of conditional attributes and D is a set of decision attributes. For example, Figure 1 presents a decision table where: – U = {Beaver, Squirrel, M ouse, M uskrat, Otter, Skunk}, – C = {Environment, Diet, T ail, Size},

432

M. Wolski

Animals Environment Diet Beaver semi-aquatic herbivorous Squirrel terrestial omnivorous Mouse terrestial omnivorous Muskrat semi-aquatic omnivorous Otter semi-aquatic carnivorous Skunk terrestial omnivorous

Tail

Size

Poland

flattened round round round round round

medium small very small medium medium medium

yes yes yes yes yes no

Fig. 1. An example of a dataset

– D = {P oland}, – e.g. VDiet = {herbivorous, carnivorous, omnivorous} for Diet ∈ C. Each subset of attributes S ⊆ A determines an equivalence relation IN D(S) ⊆ U × U defined as follows: IN D(S) = {(u, v) : (∀a ∈ S) f (u, a) = f (v, a)}. As usual, IN D(S) is called an indiscernability relation induced by S, the partition induced by the relation IN D(S) is denoted by U/IN D(S), and [u]S denotes the equivalence class of IN D(S) defined by u ∈ U . For instance, if S = {Environment, Diet}, then IN D(S) = {{Beaver}, {Squirrel, M ouse, Skunk}, {M uskrat}, {Otter}}. Obviously, U/IN D(A) refines every other partition U/IN D(S) where S ⊆ A. Furthemore [u]S . [u]A = S⊆A

Intuitively, any subset X ⊆ U which can be defined by a formula of some knowledge representation language L is a concept in L. For example, one can use the following simple descriptor language, say LDesc , based on a given information system U, A, V, f : f ml ::= [a = val] | ¬f ml | f ml ∧ f ml | f ml ∨ f ml where a ∈ A and val ∈ Va . We say that α ∈ LDesc is a formula over C if all attributes a in α belong to C. For any formula α ∈ LDesc , |α| denotes the meaning of α in U , i.e. the concept in LDesc which is defined as follows: – If α is of the form [a = val], then |α| = {u ∈ U : f (u, a) = val}; – |¬α| = U \ |α|, |α ∧ β| = |α| ∩ |β|, |α ∨ β| = |α| ∪ |β|. For example, α = [P oland = no] and |α| = {Skunk}. Let α be a formula of LDesc over C and β a formula over D. Then the expression α ⇒ β is called a decision rule if |α|A ∩ |β|A = ∅. Definition 3. Let α ⇒ β be a decision rule and Card(B) denote the cardinality of the set B. Then, the accuracy Accα (β) and the coverage Covα (β) for α ⇒ β are defined as follows:

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

Accα (β) =

Card(|α| ∩ |β|) , Card(α)

Covα (β) =

Card(|α| ∩ |β|) . Card(β)

and

433

Example 1. Let us assume that |α| = [Environment = semi − aquatic] and |β| = [P oland = yes]. Then α ⇒ β is a decision rule over the information system depicted by Fig. 1, Accα (β) = 3/3 = 1, and Covα (β) = 3/5. It is worth emphasising that if Accα (β) = 1, then it holds that u ∈ |β| provided that u ∈ |α|. On the other hand, if Covα (β) = 1 then we have u ∈ |α| provided that u ∈ |β|. Thus, Accα (β) measures the sufficiency of α ⇒ β, whereas Covα (β) measures the necessity of α ⇒ β; for details see e.g. [16]. Hereafter several attempts were made to introduce other measures of how good a given decision rule is. However, the meaning of these measures remains fixed. For a given decision rule α ⇒ β they answer the following question: Given that u ∈ |α|, what is the chance that u ∈ |β|?

(1)

In the evolution of RST it is the rough–mereological approach of a special importance [13]. This approach is based on the inclusion function, called rough inclusion (RIF), which generalises fuzzy set and rough set approaches. Generally speaking, RIFs measure the degree of inclusion of a set of objects X in a set of objects Y . In this paper we follow the definition of RIF proposed in [5]: Definition 4. A RIF upon U is any function κ : 2U × 2U → [0, 1] such that: – (∀X, Y )(κ(X, Y ) = 1 ⇔ X ⊆ Y ), – (∀X, Y, Z)(Y ⊆ Z ⇒ κ(X, Y ) ≤ κ(X, Z)). The most famous RIF is the so-called standard RIF, denoted by κ£ , which is based on J. Łukasiewicz’s ideas concerning the probability of truth of propositional formulas: Card(X∩Y ) if X = ∅ £ Card(X) κ (X, Y ) = 1 otherwise Another RIF κ1 , which is really interesting in the context of induction, was proposed by A. Gomoli´nska in [5]: Card(Y ) if X ∪ Y = ∅ κ1 (X, Y ) = Card(X∪Y ) 1 otherwise As one can easy observe, for any X, Y ⊆ U and any decision rule α ⇒ β, κ1 (X, Y ) = κ£ (X ∪ Y, Y ), Accα (β) = κ£ (|α|, |β|),

434

M. Wolski

and

Covα (β) = κ£ (|β|, |α|).

Summing up, many ideas from RST are based upon the notion of RIF. In what follows, we shall be interested in question how RIFs can be used to assess the strength of decision rules and their generalisations. It is worth noting that we view these rules from the perspective of inductive reasoning and, in consequence, we change their standard interpretations.

4 Inductive Reasoning: RST Approach Now, let us examine the above ideas from RST against the background of SCM. As said earlier, each formula α ∈ LDesc represents a concept in LDesc . It is easy to observe that |[a = val]| ∈ U/IN D({a}), and |α| =

A, for some A ⊆ U/IN D(A).

Thus, elements of U/IN D(A) can be regarded as atomic concepts and any other concept in LDesc can be built by means of atomic concepts and ∪. Furthermore, any formula α will be regarded as a concept name, or better still, a category. Generally speaking, SCM tries to answer the question how safe is to transfer knowledge about a value val of some attribute a from one category α to another category β. In other words, given that |α| ⊆ |[a = val]|, what is the chance that |β| ⊆ |[a = val]|?

(2)

Observe that this question, in contrast to (1), makes sense even for a rule α ⇒ β such that |α| ∩ |β| = ∅. We shall call such a rule an inductive rule. Furthermore, examples from Section 2 require multi-premises inductive rules represented by expressions α, β, γ ⇒ δ rather than simple rules of the form α ⇒ δ. Let us recall that in multiconclusion Gentzen’s sequent calculus α, β, γ ⇒ δ means that δ follows from α ∧ β ∧ γ. However, in SCM-like inductive reasoning we have α, β, γ ⇒ δ means that δ follows from α ∨ β ∨ γ. Indeed, for example the following decision rule [Size = verysmall], [Size = small] ⇒ [P oland = yes] based on the dataset from Fig. 1 where |[Size = verysmall]| = {M ouse}, |[Size = small]| = {Squirrel}, |[P oland = yes]| = {Beaver, Squirrel, M ouse, M uskrat, Otter}

(3)

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

435

might represent the following inductive inference: Mice have an ileal vein, Squirrels have an ileal vein, All animals living in Poland have an ileal vein. As one can easy observe |[Size = verysmall] ∧ [Size = small]| = ∅ and the conjunctive interpretation of premises leads us to wrong conclusions. Thus, given the rule α, β, γ ⇒ δ we shall regard the premises as the category α ∨ β ∨ γ representing the concept |α ∨ β ∨ γ| in LDesc . Now, we answer the question (2). According to SCM we should (a) compute the similarity of the premise category to the conclusion category, and (b) compute the degree to which the premise category cover the lowest level knowledge–category that includes both the premies and conclusion categories. Intuitively, the identity is the highest level of similarity. It is easy to observe that |α| = |β| iff Accα (β) = 1 and Covα (β) = 1. Thus, the measures Accα (β) and Covα (β) taken together tell us to which extent the categories |α| and |β| are similar. In this paper we use the following measure, denoted by Sim, which was firstly introduced by S. Kulczy´nski in the context of clustering methods [8]; see also [1]: Sim(α, β) =

Accα (β) + Covα (β) . 2

It is easy to compute that for α ⇒ β from Example 1 Sim(α, β) = 4/5. Now, let us consider the step (b) of SCM. It can be formalised as follows: Cov(α, β) =

Card(|α|) Card(C)

where α ⇒ β is an inductive rule, and C represents the smallest category from the underlying ontology O containing both the premise and conclusion categories, i.e. |α|∪ |β| ⊆ |C|. Since RST assumes that any formula α ∈ LDesc representing a concept |α| in LDesc is a category, the smallest category containing both α and β is α ∨ β. In other words, RST assumes the richest possible ontology representing all concepts definable in LDesc . Thus, we have CovRST (α, β) =

Card(|α|) . Card(|α| ∪ |β|)

Observe that for α ⇒ β from Example 1, CovRST (α, β) = κ1 (β, α). Thus, assessing the strength of a rule α ⇒ β consists in computing values of a couple of RIFs. In our example, for the decision rule α ⇒ β defined as above, we have CovRST (α, β) = 3/5

436

M. Wolski

Some comments may be useful here. The standard RIF κ£ is the most popular RIF among rough-set community. The main reason seems to be the clarity of the interpretation of κ£ . On the other hand, such RIFs like κ1 or κ2 lack an obvious interpretation. Our present inquiry into RST against the background of SCM provide us with the intuitive meaning at least for κ1 : it computes the coverage of the premise category to the smallest category containing both the premise and the conclusion categories (with respect to the ontology representing all concepts in LDesc ). It is quite likely that other models of inductive reasoning may bring us some new interpretations of other RIFs as well. Let us now return to issues concerning the underlying ontology O. First, RST assumes that the ontology O represents all concepts in LDesc which, in turn, belong to some U/IN D(S) for S ⊆ A. That is, atomic concepts are indiscernability classes and all other concepts are built up from them. Second, O is in fact the Boolean algebra U, ∪, ∩, \ generated by U/IN D(A). Thus, what actually contributes to induction is the indiscernability relation IN D(A): when you know IN D(A), you also know the corresponding RST ontology O. On this view, RST may be regarded as a kind of the similarity–based approach to inductive reasoning. Only similarity (indiscernability) classes affect induction and, in consequence, only (a) step (i.e. the similarity–based step) of SCM is performed. Since (b) step of SCM assumes additional conceptual knowledge, applying it to RST ontology, which actually brings us nothing more than IN D(A), may lead to wrong results. Example 2. Consider the information system given by Fig. 1. Let α = [Environment = terrestial] ∧ [Diet = omnivorous] ∧ [T ail = round] and β = [P oland = no], ¬β = [P oland = yes] Then α ⇒ β is a decision rule and Accα (β) = 1/3, Covα (β) = 1/1 = 1, Sim(α, β) = 2/3, and CovRST (α, β) = 3/3 = 1. Also α ⇒ ¬β is a decision rule, for which we have Accα (¬β) = 2/3, Covα (γ) = 2/5, Sim(α, γ) = 8/15, and CovRST (α, β) = 3/6 Observe, that according to the above measures, the arguments represented by α ⇒ β is stronger than α ⇒ ¬β (50/30 and 31/30 respectively). However, our intuition suggests us the opposite ranking, e.g.: Skunks have a property P Mice have a property P Squirrels have a property P All animals not living in Poland have a property P.

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

437

Skunks have a property P Mice have a property P Squirrels have a property P All animals in Poland have a property P. Given that mice and squirrels live in Poland, but skunks do not, it is obvious that the second argument should be recognised as stronger than the first one. In order to correct our result we have to consider a proper ontology O, which brings us new information about U . Let us look at the scientific ontology given by Fig. 2, which is built for the dataset Fig. 1 – in fact, it is a fragment of the well-known biological taxonomy. In this case, the smallest concept containing both |α| and |β| is the set of all objects from the dataset. Thus, Cov(α, β) = 1/2 and the overall result for α ⇒ β is 7/6. The same computation for α ⇒ ¬β brings us Cov(α, ¬β) = 1/2 and the overall result is 31/30. This time the strength of both arguments is quite similar: the difference is equal 4/30 in favour of the first argument. Thus, this time we have obtained a better result. Our example shows also that all categories used in induction should have proper extensions in the given dataset. For instance, the categories not living in Poland and skunk represent the same concept {Skunk}, what actually makes the first argument stronger than the second one even when applying scientific ontology. Observe also that in the case of the scientific ontology beavers and squirrels belong to the same family, yet they differ on all conditional attributes. Thus, this ontology really brings us the new knowledge about the dataset. However, sometimes it is better to have an ontology which reflects our knowledge which is encoded by means of attributes from A. For example, the taxonomy Fig. 3 represents the way people could categorise the animals from Fig. 1. Which ontology is more useful depends on features we want to reason about. For instance, in enthnobiology it is widely agreed that scientific ontology is better to reason about hidden properties of animals, whereas the common sense ontology is better to reason

ORDER

Beaver, Squirrel, Mouse, Muskrat, Otter, Skunk

SUBORDER

Beaver, Squirrel, Mouse, Muskrat

FAMILY

Beaver, Squirrel

Mouse, Muskrat

Otter, Skunk

Fig. 2. The scientific taxonomy for the dataset

Folk ORDER

Beaver, Squirrel, Mouse, Muskrat, Otter, Skunk

Folk SUBORDER

Beaver, Otter, Squirrel, Mouse, Muskrat

Folk FAMILY

Beaver, Muskrat, Otter

Mouse, Squirrel

Fig. 3. A common-sense taxonomy for the dataset

Skunk

438

M. Wolski

about their behaviour. Thus, the ontology must be carefully chosen with respect to the goal properties. As said above, the common-sense ontology is mainly based on attributes of objects. On the basis of this fact, one can regard concept lattices from formal concept analysis (FCA) [17,18] as such ontologies. Let us recall that any binary relation R ⊆ U × V induces two operators: R+ (A) = {b ∈ V : (∀a ∈ A)a, b ∈ R} R+ (B) = {a ∈ U : (∀b ∈ B)a, b ∈ R} Definition 5. A concept induced by R ⊆ U × V is a pair (A, B), where A ⊆ U and B ⊆ V such that A = R+ (B) and B = R+ (A). A set A is called an extent concept if A = R+ R+ (A). Similarly if B ⊆ M is such that B = R+ R+ (B) then B is called an intent concept. The set of all concepts of any information system is a complete lattice [17,18]. Since the lattice induced by our dataset from Fig. 1 is quite complicated, we present here only a list of concepts (see Fig. 4) instead of the Hasse diagram. As one can see, it is quite large ontology when compared with the common sense ontology. As a consequence, for a small dataset as this in the paper the results are very similar to these obtained for RST ontology. However, for large datasets the results may substantially differ. But checking how FCA-ontologies are useful for inductive reasoning is a task for future works.

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

< {Beaver}, {semiaquatic, herbivorous, f lattened, medium, yes} > < {Squirrel}, {terrestial, omnivorous, round, small, yes} > < {M ouse}, {terrestial, omnivorous, round, very small, yes} > < {M uskrat}, {semiaquatic, omnivorous, round, medium, yes} > < {Otter}, {semiaquatic, carnivorous, round, medium, yes} > < {Skunk}, {terrestial, omnivorous, round, medium, no} > < {Beaver, Squirrel, M ouse, M uskrat, Otter}, {yes} > < {Beaver, M uskrat, Otter}, {semiaquatic, medium, yes} > < {Beaver, M uskrat, Otter, Skunk}, {medium} > < {Squirrel, M ouse}, terrestial, omnivorous, round, yes > < {Squirrel, M ouse, M uskrat}, {omnivorous, round, yes} > < {Squirrel, M ouse, M uskrat, Otter}, {round, yes} > < {Squirrel, M ouse, Skunk}, {terrestial, omnivorous, round} > < {M uskrat, Otter}, {semiaquatic, round, medium, yes} > < {M uskrat, Skunk}, {omnivorous, round, medium} > < {Squirrel, M ouse, M uskrat, Skunk}, {omnivorous, round} > < {M uskrat, Otter, Skunk}, {round, medium} > < {Squirrel, M ouse, M uskrat, Otter, Skunk}, {round} > < {Beaver, Squirrel, M ouse, M uskrat, Otter, Skunk}, {} > < {}, {semiaquatic, terrestial, herbivorous, omnivorous, carnivorous, f lattened, round, verysmall, small, medium, yes, no} > Fig. 4. Concepts induced by the dataset from Fig. 1

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

439

Summing up this section let us say a few words about inductive rules. As said above, under the interpretation expressed by Equation (2), apart from decision rules also inductive rules make sense. Let α ∈ LDesc be a description of beavers, i.e. |α| = {Beaver}, β ∈ LDesc be a description of squirrels, |β| = {Squirrel}, and γ be a description of skunks, |γ| = {Skunk}. Then, |α| ∩ |β| = ∅, |α| ∩ |γ| = ∅, and both α ⇒ β and α ⇒ γ are inductive rules. In consequence, Sim(α, β) = Sim(α, γ) = 0. Observe also that CovRST cannot distinguish these rules either, for we have CovRST (α, β) = CovRST (α, γ) = 1/2. However, under the scientific ontology (Fig. 4.) Cov(α, β) = 1/2, whereas Cov(α, γ) = 1/6.

5 Induction over Nearness Spaces In this section we consider a topological counterpart of RST enriched by the concept of ontology. Recently a number of attempts have been made to connect RST with nearness type structures, e.g. [11,19]. These structures (such like nearness spaces or merotopic spaces) are actually quite abstract and this section aims to provide the reader with their intuitive interpretation. We start with some ideas concerning RST and inductive reasoning, and then we develop them into a nearness space. An information system U, A, V, f is often represented as an approximation space (U, IN D(A)), that is, a non-empty set U equipped with an equivalence relation. This representation allows one to connect RST with relational structures which underlie many branches of mathematics, e.g. topology, logic, or universal algebra. Here we would like to change this approach and consider IN D(S) for all S ⊆ A. Definition 6. Let A, B ⊆ 2X ; then a refinement relation is defined by: def

A B ⇔ (∀A ∈ A) (∃B ∈ B)A ⊆ B. Obviously, for any information system U, A, V, f , U/IN D(A) refines every other partition U/IN D(S), for all S ⊆ A. A simple mathematical structure which generalises this observation is called a non-Archimedean structure. Definition 7. A non-Archimedean structure μ on a non-empty set U is a set of partitions of U satisfying: A B & A ∈ μ ⇒ B ∈ μ, and the couple (U, μ) is called a non-Archimedean space. Let IN D S = {U/IN D(S) : S ⊆ A}. Observe that (U, IN D S ) may fail to be a non-Archimedean space. Take as an example the dataset from Fig. 1 and consider the partition P = {{Beaver}, {Squirrel, M ouse, M uskrat, Otter, Skunk}}. Then U/IN D(A) P , yet there is no S ⊆ A such that U/IN D(S) = P . Furthermore, any concept α ∈ LDesc induces a partition Pα = {|α|, |¬α|} of U and U/IN D(A) Pα . For example, when α = [Diet = herbivorous], then Pα = P . Thus, what we actually need is a non-Archimedean structure IN D A induced by U/IN D(A): IN D A = {P : P is a partition of U & U/IN D(A) P }.

440

M. Wolski

Proposition 1. Let U, A, V, f be an information system, and C = {|α| : α ∈ LDesc } be a set of all non-empty concepts in LDesc . Then C= IN D A . Proof. We have to prove that C ⊆ IN D A and IN D A ⊆ C. First, for every nonempty |α| in LDesc it holds that U/IN D(A) P and thus C ⊆ IN D A . Second, α C ∈ P for some partition P ∈ IN D A . Since assume that C ∈ IN D A . It means that U/IN D(A) P , it follows that C = A for some A ⊆ U/IN D(A). Every element Ci of A is a concept in LDesc for some αi ∈ LDesc and thus C is a concept in LDesc for α1 ∨ α2 ∨ . . . ∨ αi . Hence IN D A ⊆ C. In other words, all non-empty concepts in LDesc belong to some partition of the nonArchimedean structure IN D A and every element of such partition is a concept in LDesc . Since an ontology is a subset of the family of all concepts in LDesc , the space (U, IN D A ) sets the stage for conceptual knowledge about U . Definition 8. Let U, A, V, f be an information system. Then by an ontology OA over (U, IN D A ) we mean an ordered set of partitions (P, ) such that P ⊆ IN D A and for all Pi , Pj ∈ P it holds that Pi = Pj for i = j, We say that C is a concept from an ontology OA = (P, ) if C ∈ P. In other words, C is a concept from OA if there is a partition P ∈ P such that C ∈ P . The set of all concepts from OA will be denoted by COA . Example 3. The scientific taxonomy from Fig. 2 can be represented as follows: OA = {P1 , P2 , P3 , P4 } where P1 = {{Beaver}, {Squirrel}, {M ouse}, {M uskrat}, {Otter}, {Skunk}}, P2 = {{Beaver, Squirrel}, {M ouse, M uskrat}, {Otter, Skunk}}, P3 = {{Beaver, Squirrel, M ouse, M uskrat}, {Otter, Skunk}}, P4 = {{Beaver, Squirrel, M ouse, M uskrat, Otter, Skunk}}. Definition 9. We say that C1 K C2 iff C1 ⊆ C2 and there exists Pi , Pj ∈ OA such that C1 ∈ Pi , C2 ∈ Pj and Pi Pj , for all C1 , C2 ∈ COA . Proposition 2. The relation K is taxonomic over COA for every ontology OA induced the non-Archimedean space (U, IN D A ). Proof. It follows from the definition of K and the definition of ontology. For an information system U, A, V, f the associated taxonomy over OA will be denoted by (COA , K). In order to generalise this description for non-taxonomic ontologies it suffices to define the ontology over the family of covers.

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

441

Definition 10. stack μ = {B ⊆ 2X : (∃A ∈ μA) B}. First, stack IN D A is a family of covers of U . Second, for an information system U, A, V, f , as a generalised ontology we take an ordered set of covers (P, ). Definition 11. Let U, A, V, f be an information system. Then by an generalised ontology GOA over (U, IN DA ) we mean an ordered set of partitions (P, ) such that P ⊆ stack IN D A and for all Pi , Pj ∈ P it holds that Pi = Pj for i = j, Since stack IN D A provides the most general stage for inductive reasoning, we examine it in some detail now. First, observe that for an information system U, A, V, f , it holds that: stack IN D A = stack {IN D(A)}. Thus, {IN D(A)} suffices to generate the whole stack IN D A . Furthermore, the stack operation allows us to connect IN D A with nearness type structures. Definition 12. Let X be a set and ν be a non-empty set of coverings of X such that: A B & A ∈ ν ⇒ B ∈ ν. Then (X, ν) is called a pre-nearness space. When stack Eν = ν, for Eν = {P ∈ ν : P is a partition of X}, then (X, ν) is called a non-Archimedean pre-nearness space and Eν is its base. Thus, the non-Archimedean structure IN D A on U is a base of the non-Archimedean prenearness space (X, stack IN D A ). Definition 13. Let (X, ν) be a pre-nearness space such that: A ∈ ν & B ∈ ν ⇒ {A ∩ B : A ∈ A and B ∈ B} ∈ ν. Then (X, ν) is called a merotopic space. Definition 14. A merotopic space (X, ν) which satisfies: A ∈ ν ⇒ {Intν (A) : A ∈ A} ∈ ν, where Intν (A) = {x ∈ X : {A, X \ {x}} ∈ ν}, is called a nearness space. Proposition 3. Let U, A, V, f be an information system. Then (U, stack IN D A ) is a non-Archimedean nearness space. Proof. In order not to overload the paper with definitions, we shall give just a hint how to prove this theorem. First, as is well-known, every partition star-refines itself. Therefore (U, stack IN D A ) is a uniform pre-nearness space. Second, every uniform pre-nearness space satisfies Definition 14, see, e.g. [4] for the proof. Finally, since U/IN D(A) = Estack IN DA , the uniform pre-nearness space (U, stack IN D A ) is closed under intersections as required by Definition 13. Thus, (U, stack IN D A ) is a non-Archimedean nearness space.

442

M. Wolski

Please, observe that the very simple description of SCM in terms of subsets of U have led us to (U, stack IN D A ) as a proper stage for human-like inductive reasoning. Surprisingly, this stage is nothing more than a representation of a basic concept of RST. Let us recall that any information system U, A, V, f may be regarded as a finite approximation space (U, IN D/A), and in many cases this representation is more handy, e.g. in algebraic investigations into RST. Actually, the same remark may be applied to (U, stack IN D A ). Proposition 4. Let U be a non-empty finite set. Then there is one-to-one correspondence between finite approximation spaces (U, E) and non-Archimedean nearness spaces (U, ν) over U . Proof. For the same reason as above, we also give a sketch of the proof. Every finite non-Archimedean nearness space (U, ν) is induced by a partition P. Since P is the minimal open basis for the topology induced by Intν , it follows that (U, ν) is a topological nearness space. On the other hand, every finite topological space (U, ν) has the minimal open basis P for its topology Intν . Since is Intν symmetric, P is a partition and thus (U, ν) is a non-Archimedean nearness space. Finally, there is one-to-one correspondence between finite topological nearness spaces and finite approximation spaces. See also [19]. Thus, non-Archimedean nearness spaces over finite sets may be considered as another special representation of information systems. Approximation spaces are useful when one consider, e.g. relational structures and modal logics, whereas nearness spaces are suitable for ontologies and inductive reasoning.

6 Final Remarks The article presents an account of preliminary results concerning Rough Set Theory (RST) and Similarity Coverage Model of category-based induction (SCM). In the first part of this paper we have shown how decision rules may be regarded as induction tasks and how rough inclusion functions may be used to compute the strength of inductive reasoning. In the second part we have presented a model of SCM based on non-Archimedean structures and non-Archimedean nearness spaces. Recently a number of attempts have been made to connect RST with nearness type structures, e.g. [11,19]. Thus, the paper has presented some intuitive reasons to consider these abstract topological spaces. The model based on a non-Archimedean space has a nice property that every ontology over it is taxonomic. Acknowledgement. The research has been supported by the grant N N516 368334 from Ministry of Science and Higher Education of the Republic of Poland and by the grant Innovative Economy Operational Programme 2007-2013 (Priority Axis 1. Research and development of new technologies) managed by Ministry of Regional Development of the Republic of Poland.

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

443

References 1. Albatineh, A., Niewiadomska-Bugaj, M., Mihalko, D.: On Similarity Indices and Correction for Chance Agreement. Journal of Classification 23, 301–313 (2006) 2. Atran, S.: Classifying Nature Across Cultures. In: Osherson, D., Smith, E. (eds.) An Invitation to Cognitive Science. Thinking, pp. 131–174. MIT Press, Cambridge (1995) 3. Deses, D., Lowen-Colebunders, E.: On Completeness in a Non-Archimedean Setting via Firm Reflections. Bulletin of the Belgian Mathematical Society, Special volume: p-adic Numbers in Number Theory, Analytic Geometry and Functional Analysis, 49–61 (2002) 4. Deses, D.: Completeness and Zero-dimensionality Arising from the Duality Between Closures and Lattices, Ph.D. Thesis, Free University of Brussels (2003), http://homepages.vub.ac.be/∼diddesen/phdthesis.pdf 5. Gomoli´nska, A.: On Three Closely Related Rough Inclusion Functions. In: Kryszkiewicz, M., Peters, J., Rybi´nski, H., Skowron, A. (eds.) RSEISP 2007. LNCS, vol. 4585, pp. 142– 151. Springer, Heidelberg (2007) 6. Heit, E.: Properties of Inductive Reasoning. Psychonomic Bulletin & Review 7, 569–592 (2000) 7. Kloos, H., Sloutsky, V., Fisher, A.: Dissociation Between Categorization and Induction Early in Development: Evidence for Similarity-Based Induction. In: Proceedings of the XXVII Annual Conference of the Cognitive Science Society (2005) 8. Kulczy´nski, S.: Die Pflanzenassociationen der Pieninen. Bulletin International de L’Academie Polonaise des Sciences et des letters, classe des sciences mathematiques et naturelles, Serie B, Supplement II 2, 57–203 (1927) 9. Osherson, D.N., Smith, E.E., Wilkie, O., Lopez, A., Shafir, E.: Category-Based Induction. Psychological Review 97(2), 185–200 (1990) 10. Pawlak, Z.: Rough Sets. International Journal of Computer and Information Sciences 11, 341–356 (1982) 11. Peters, J., Skowron, A., Stepaniuk, J.: Nearness of Objects: Extension of Approximation Space Model. Fundamenta Informaticae 79, 497–512 (2007) 12. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishers, Boston (1991) 13. Polkowski, L., Skowron, A.: Rough Mereology: A New Paradigm for Approximate Reasoning. Int. J. Approx. Reasoning 15(4), 333–365 (1996) 14. Skowron, A., Stepaniuk, J.: Tolerance Approximation Spaces. Fundamenta Informaticae 27, 245–253 (1996) 15. Sloman, S.A.: Feature-Based Induction. Cognitive Psychology 25, 231–280 (1993) 16. Tsumoto, S.: Extraction of Experts’ Decision Rules from Clinical Databases Using Rough Set Model. Intelligent Data Analysis 2(1-4), 215–227 (1998) 17. Wille, R.: Restructuring Lattice Theory: an Approach Based on Hierarchies of Concepts. In: Rival, I. (ed.) Ordered Sets, pp. 445–470. Reidel, Dordrecht-Boston (1982) 18. Wille, R.: Concept lattices and Conceptual Knowledge Systems. Computers & Mathematics with Applications 23, 493–515 (1992) 19. Wolski, M.: Approximation Spaces and Nearness Type Structures. Fundamenta Informaticae 79, 567–577 (2007)

Probabilistic Dependencies in Linear Hierarchies of Decision Tables Wojciech Ziarko Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2

Abstract. The article is a study of probabilistic dependencies between attribute-deﬁned partitions of a universe in hierarchies of probabilistic decision tables. The dependencies are expressed through two measures: the probabilistic generalization of the Pawlak’s measure of the dependency between attributes and the expected certainty gain measure introduced by the author. The expected certainty gain measure reﬂects the subtle grades of probabilistic dependence of events. Both dependency measures are developed and it is shown how they can be extended from ﬂat decision tables to dependencies existing in hierarchical structures of decision tables.

1

Introduction

The notion of decision table has been around for long time and was widely used in circuit design, software engineering, business, and other application areas. In the original formulation, decision tables are static due to the lack of the ability to automatically learn and adapt their structures based on new information. Decision tables representing data-acquired classiﬁcation knowledge have been introduced by Pawlak [1]. In Pawlak’s approach, the decision tables are dynamic structures derived from data, with the ability to adjust with new information. This fundamental diﬀerence makes it possible for novel uses of decision tables in applications related to reasoning from data, such as data mining, machine learning or complex pattern recognition. The decision tables are typically used for making predictions about the value of the target decision attribute, such as medical diagnosis, based on combinations of values of condition attributes, for example symptoms and test results, as measured on new, previously unseen objects (for example, patients). However, the decision tables often suﬀer from the following problems related to the fact that they are typically computed based on a subset, a sample of the universe of all possible objects. Firstly, the decision table may have excessive decision boundary, often due to poor quality of the descriptive condition attributes, which may be weakly correlated with the decision attribute. The excessive decision boundary leads to the excessive number of incorrect predictions. Secondly, the decision table may be highly incomplete, i.e. excessively many new measurement vectors of condition attributes of new objects are not matched J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 444–454, 2008. c Springer-Verlag Berlin Heidelberg 2008

Probabilistic Dependencies in Linear Hierarchies of Decision Tables

445

by any combination of condition attribute values present in the decision table. Such a highly incomplete decision table leads to an excessive number of new, unrepresented observations, for which the prediction of the decision attribute value is not possible. With condition attributes weakly correlated with the decision attribute, increasing their number does not rectify the ﬁrst problem. Attempting to increase the number of condition attributes, or the number of possible values of the attributes, results in the exponential explosion of the complexity of decision table learning and is leading to the rapid increase of its degree of incompleteness [8]. In general, the decision boundary reduction problem is conﬂicting with the decision table incompleteness minimization problem. To deal with these fundamental diﬃculties, an approach involving building hierarchies of decision tables was proposed [6]. The approach is focused on learning hierarchical structures of decision tables rather than learning individual tables, subject to learning complexity constraints. In this approach, a linear hierarchy of decision tables is formed, in which the parent layer decision boundary deﬁnes a universe of discourse for the child layer table. The decision tables on each layer are size-limited by reducing the number of condition attributes and their values, thus bounding their learning complexity [8]. Each layer contributes a degree of decision boundary reduction, while providing a shrinking decision boundary to the next layer. In this way, even in the presence of relatively weak condition attributes, a signiﬁcant total boundary reduction can be achieved, while preserving the learning complexity constraints on each level. Similar to single layer decision table, the hierarchy of decision tables needs to be evaluated from the point of view of its quality as a potential classiﬁer of new observations. The primary evaluative measure for decision tables, as introduced by Pawlak, is the measure of partial functional dependency between attributes [1] and its probabilistic extension [7]. Another measure is the recently introduced expected gain measure which captures more subtle probabilistic associations between attributes [7]. In this paper, these measures are reviewed and generalized to the hierarchical structures of decision tables. A simple recursive method of their computation is also discussed. The measures, referred to as γ and λ measures respectively, provide a tool for assessment of decision table-based classiﬁers derived from data. The basics of the rough set theory and the techniques for analysis of decision tables are presented in this article in the probabilistic context, with the underlying assumption that the universe of discourse U is potentially inﬁnite and is known only partially through a ﬁnite collection of observation vectors (the sample data). This assumption is consistent with great majority of applications in the areas of statistical analysis, data mining and machine learning.

2

Attribute-Based Probabilistic Approximation Spaces

In this section, we brieﬂy review the essential assumptions, deﬁnitions and notations of the rough set theory in the context of probability theory.

446

2.1

W. Ziarko

Attributes and Classifications

We assume that observations about objects are expressed through values of attributes, which are assumed to be functions a : U → Va , where Va is a ﬁnite set of values called the domain. The attributes represent some properties of the objects in U . It should be however mentioned here that, in practice, the attributes may not be functions but general relations due to inﬂuence of measurement random noise. The presence of noise may cause the appearance of multiple attribute values associated with an object. Traditionally, the attributes are divided into two disjoint categories: condition attributes denoted as C, and decision attributes D = {d}. In many rough setoriented applications, attributes are ﬁnite-valued functions obtained by discretizing values of real-valued variables representing measurements taken on objects e ∈ U. As individual attributes, any non-empty subset of attributes B ⊆ C ∪ D deﬁnes a mapping from the set of objects U into the set of vectors of values of attributes in B. This leads to the idea of the equivalence relation on U , called indiscernibility relation IN DB = {(e1 , e2 ) ∈ U : B(e1 ) = B(e2 )}. According to this relation, objects having identical values of attributes in B are equivalent, that is, indistinguishable in terms of values of attributes in B . The collection of classes of identical objects will be denoted as U/B and the pair (U, U/B) will be called an approximation space. The object sets G ∈ U/C ∪ D, will be referred to as atoms. The sets E ∈ U/C will be referred to as elementary sets. The sets X ∈ U/D will be called decision categories. Each elementary set E ∈ U/C and each decision category X ∈ U/D is a union of some atoms. That is, E = ∪{G ∈ U/C ∪ D : G ⊆ E} and X = ∪{G ∈ U/C ∪ D : G ⊆ F }. 2.2

Probabilities

We assume that all subsets X ⊆ U under consideration are measurable by a probability measure function P , normally estimated from collected data in a standard way, with 0 < P (X) < 1, which means that they are likely to occur but their occurrence is not certain. In particular, each atom G ∈ U/C ∪ D is assigned a joint probability P (G). From our initial assumption and from the basic properties of the probability measure P , follows that for all atoms G ∈ U/C ∪ D, we have 0 < P (G) < 1 and G∈U/C∪D P (G) = 1. Based on the joint probabilities of atoms, probabilities of elementary sets E and of a decision category X can be calculated by P (E) = G⊆E P (G). The probability P (X) of the decision category X in the universe U is the prior probability of the category X. It represents the degree of conﬁdence in the occurrence of the decision category X, in the absence of any information expressed by attribute values. The conditional probability of a decision category X, P (X|E) = P (X∩E) P (E) , conditioned on the occurrence of the elementary set E, represents the degree

Probabilistic Dependencies in Linear Hierarchies of Decision Tables

447

of conﬁdence in the occurrence of the decision category X, given information indicating that E occurred. The conditional probability can be expressed in terms P G⊆X∩E P (G) of joint probabilities of atoms by P (X|E) = P . This property allows G⊆E P (G) for simple computation of the conditional probabilities of decision categories. 2.3

Variable Precision Rough Sets

The theory of rough set underlies the methods for derivation, optimization and analysis of decision tables acquired from data. In this part, we review the basic deﬁnitions and assumptions of the variable precision rough set model (VPRSM) [5][7]. The VPRSM is a direct generalization of Pawlak rough sets [1]. One of the main objectives of rough set theory is the formation and analysis of approximate deﬁnitions of otherwise undeﬁnable sets [1]. The approximate deﬁnitions, in the form of lower approximation and boundary area of a set, allow for determination of an object’s membership in a set with varying degrees of certainty. The lower approximation permits for uncertainty-free membership determination, whereas the boundary deﬁnes an area of objects which are not certain, but possible, members of the set [1]. The VPRSM extends upon these ideas by parametrically deﬁning the positive region as an area where the certainty degree of an object’s membership in a set is relatively high, the negative region as an area where the certainty degree of an object’s membership in a set is relatively low, and by deﬁning the boundary as an area where the certainty of an object’s membership in a set is deemed neither high nor low. The deﬁning criteria in the VPRSM are expressed in terms of conditional probabilities and of the prior probability P (X) of the set X in the universe U . The prior probability P (X) is used as reference value here as it represents the likelihood of X occurrence in the extreme case characterized by the absence of any attribute-based information. In the context the attribute-value representation of sets of the universe U , as described in the previous section, we will assume that the sets of interest are decision categories X ∈ U/D. Two precision control parameters are used: the lower limit l, 0 ≤ l < P (X) < 1, representing the highest acceptable degree of the conditional probability P (X|E) to include the elementary set E in the negative region of the set X; and the upper limit u, 0 < P (X) < u ≤ 1, reﬂecting the least acceptable degree of the conditional probability P (X|E) to include the elementary set E in the positive region, or u-lower approximation of the set X. The l-negative region of the set X, denoted as N EGl (X) is deﬁned by: N EGl (X) = ∪{E : P (X|E) ≤ l}

(1)

The l-negative region of the set Xis a collection of objects for which the probability of membership in the set X is significantly lower than the prior probability P (X). The u-positive region of the set X, P OSu (X) is deﬁned as P OSu (X) = ∪{E : P (X|E) ≥ u}.

(2)

The u-positive region of the set X is a collection of objects for which the probability of membership in the set X is significantly higher than the prior probability

448

W. Ziarko

P (X). The objects which are not classiﬁed as being in the u-positive region nor in the l-negative region belong to the (l, u)-boundary region of the decision category X, denoted as BN Rl,u (X) = ∪{E : l < P (X|E) < u}.

(3)

The boundary is a speciﬁcation of objects about which it is known that their associated probability of belonging, or not belonging to the decision category X, is not much diﬀerent from the prior probability of the decision category P (X). The VPRSM reduces to standard rough sets when l = 0 and u = 1.

3

Structures of Decision Tables Acquired from Data

To describe functional or partial functional connections between attributes of objects of the universe U , Pawlak introduced the idea of decision table acquired from data [1]. The probabilistic decision tables and their hierarchies extend this idea into probabilistic domain by forming representations of probabilistic relations between attributes. 3.1

Probabilistic Decision Tables

For the given decision category X ∈ U/D and the set values of the VPRSM lower and upper limit parameters l and u, we deﬁne the probabilistic decision C,D as a mapping C(U ) → {P OS, N EG, BN D} derived from the table DTl,u classiﬁcation table as follows: The mapping is assigning each tuple of values of condition attribute values t ∈ C(U ) to its unique designation of one of VPRSM approximation regions P OSu (X), N EGl (X) or BN Dl,u (X), the corresponding elementary set Et is included in, along with associated elementary set probabilities P (Et ) and conditional probabilities P (X|Et ): ⎧ ⎨ (P (Et ), P (X|Et ), P OS) ⇔ Et ⊆ P OSu (X) C,D (4) DTl,u (t) = (P (Et ), P (X|Et ), N EG) ⇔ Et ⊆ N EGl (X) ⎩ (P (Et ), P (X|Et ), BN D) ⇔ Et ⊆ BN Dl,u (X) The probabilistic decision table is an approximate representation of the probabilistic relation between condition and decision attributes via a collection of uniform size probabilistic rules corresponding to rows of the table. An example probabilistic decision table is shown in Table 1. In this table, the condition attributes are a, b, c, attribute-value combinations correspond to elementary sets E and Region is a designation of one of the approximation regions the corresponding elementary sets belong to: positive (POS), negative (NEG) or boundary (BND). The probabilistic decision tables are most useful for decision making or prediction when the relation between condition and decision attributes is largely non-deterministic. However, they suﬀer from the inherent contradiction between

Probabilistic Dependencies in Linear Hierarchies of Decision Tables

449

Table 1. An example of probabilistic decision table a 1 1 2 2 0

b c 1 2 0 1 2 1 0 2 2 1

P (E) P (X|E) 0.23 1.00 0.33 0.61 0.11 0.27 0.01 1.00 0.32 0.06

Region POS BND BND POS NEG

the accuracy and completeness. In the presence of boundary region, higher accuracy, i.e. reduction of boundary region, can be achieved either by adding new condition attributes or by increasing the precision of existing ones (for instance, by making the discretization procedure ﬁner). Both solutions lead to the exponential growth in the maximum number of attribute-value combinations to be stored in the decision table [8]. In practice, it results in such negative eﬀects as excessive size of the decision table, likely high degree of table incompleteness (in the sense of missing many feasible attribute-value combinations), weak data support for elementary sets represented in the table and, consequently, unreliable estimates of probabilities. The use of hierarchies of decision tables rather than individual tables in the process of classiﬁer learning from data provides a partial solution to these problems [6]. 3.2

Probabilistic Decision Table Hierarchies

Since the VPRSM boundary region BN Dl,u (X) is a deﬁnable subset of the universe U , it allows to structure the decision tables into hierarchies by treating the boundary region BN Dl,u (X) as sub-universe of U , denoted as U = BN Dl,u (X). The ”child” sub-universe U so deﬁned can be made completely independent from its ”parent” universe U , by having its own collection of condition attributes C to form a ”child” approximation sub-space (U, U/C ). As on the parent level, in the approximation space (U, U/C ), the decision table for the subset X ⊆ X of the target decision category X, X = X ∩ BN Dl,u (X) can be derived by adapting the formula (4). By repeating this step recursively, a linear hierarchy of probabilistic decision tables can be grown until either boundary area disappears in one of the child tables, or no attributes can be identiﬁed to produce non-boundary decision table at the ﬁnal level. Other termination conditions are possible, but this issue is out of scope in this article. The nesting of approximation spaces obtained as a result of recursive computation of decision tables, as described above, creates a new approximation space on U . The resulting hierarchical approximation space (U, R) cannot be expressed by the indiscernibility relation, as deﬁned in Section 2, in terms of the attributes used to form the local sub-spaces on individual levels of the hierarchy. This leads to the basic question: how to measure the degree of the mostly probabilistic dependency between the hierarchical partition R of U and the partition (X, ¬X) corresponding to the decision category X ⊆ U . Some probabilistic inter-partition dependency measures are explored in the next section.

450

4

W. Ziarko

Dependencies in Decision Table Hierarchies

The dependencies between partitions are fundamental to rough set-based nonprobabilistic and probabilistic reasoning and prediction. They allow to predict the occurrence of a class of one partition based on the information that a class of another partition occurred. There are several ways dependencies between partitions can be deﬁned in decision tables. In Pawlak’s early works functional and partial functional dependencies were explored [1]. The probabilistic generalization of the dependencies was also deﬁned and investigated in the framework of the variable precision rough set model. All these dependencies represent the relative size of the positive and negative regions of the target set X. They reﬂect the quality of approximation of the target category in terms of the elementary sets of the approximation space. Following the original Pawlak’s terminology, we will refer to these dependencies as γ-dependencies. Other kind of dependencies, based on the notion of the certainty gain measure, reﬂect the average degree of improvement of the certainty of occurrence of the decision category X, or ¬X, relative to its prior probability P (X) [7] (see also [2] and [4]). We will refer to these dependencies as λ-dependencies. Both, the γ-dependencies and λ-dependencies can be extended to hierarchies of probabilistic decision tables, as described below. Because there is no single collection of attributes deﬁning the partition of U , the dependencies of interest in this case are dependencies between the hierarchical partition R generated by the decision table hierarchy, forming the approximation space (U, R), and the partition (X, ¬X), deﬁned by the target set. 4.1

Γ -Dependencies for Decision Tables

The partial functional dependency among attributes, referred to as γ-dependency γ(D|C) measure, was introduced by Pawlak [1]. It can be expressed in terms of the probability of positive region of the partition U/D deﬁning decision categories: (5) γ(D|C) = P (P OS C,D (U )) where P OS C,D (U ) is a positive region of the partition U/D in the approximation space induced by the partition U/C. In the binary case of two decision categories, X and ¬X, the γ(D|C)-dependency can be extended to the VPRSM by deﬁning it as the combined probability of the u-positive and l -negative regions: γl,u (X|C) = P (P OSu (X) ∪ N EGl (X)).

(6)

The γ-dependency measure reﬂects the proportion of objects in U , which can be classiﬁed with suﬃciently high certainty as being members, or non-members of the set X. 4.2

Computation of Γ -Dependencies in Decision Table Hierarchies

In the case of the approximation space obtained by forming it via hierarchical classiﬁcation process, the γ-dependency between the hierarchical partition R and

Probabilistic Dependencies in Linear Hierarchies of Decision Tables

451

the partition (X, ¬X) can be computed directly by analyzing all classes of the hierarchical partition. However, an easier to implement recursive computation is also possible. This is done by recursively applying, starting from the leaf table of the hierarchy and going up to the root table, the following formula (7) U (X|R) in the hierarchical for computing the dependency of the parent table γl,u U approximation space (U, R), if the dependency of a child level table γl,u (X|R ) in the sub-approximation space (U , R ) is given:

U U U γl,u (X|R) = γl,u (X|C) + P (U )γl,u (X|R ),

(7)

where C is a collection of attributes inducing the approximation space U and U = BN Dl,u (X). As in the ﬂat table case, this dependency measure represents the fraction of objects that can be classiﬁed with acceptable certainty into decision categories X or ¬X by applying the decision tables in the hierarchy. The dependency of the whole structure of decision tables, that is the last dependency computed by the recursive application of formula (7), will be called a global γdependency. Alternatively, the global γ-dependency can be computed straight from from the deﬁnition (5). This computation requires checking all elementary sets of the hierarchical partition for the inclusion in P OSu (X)∪N EGl (X), which seems to be less elegant and more time consuming that the recursive method. 4.3

Certainty Gain Functions

Based on the probabilistic information contained in data, as given by the joint probabilities of atoms, it is also possible to evaluate the degree of probabilistic dependency between any elementary set and a decision category. The dependency measure is called absolute certainty gain [7] (gabs). It represents the degree of inﬂuence the occurrence of an elementary set E has on the likelihood of occurrence of the decision category X. The occurrence of E can increase, decrease, or have no eﬀect on the probability of occurrence of X. The probability of occurrence of X, in the absence of any other information, is given by its prior probability P (X). The degree of variation of the probability of X, due to occurrence of E, is reﬂected by the absolute certainty gain function: gabs(X|E) = |P (X|E) − P (X)|,

(8)

where | ∗ | denotes absolute value function. The values of the absolute gain function fall in the range 0 ≤ gabs(X|E) ≤ max(P (¬X), P (X)) < 1. In addition, if sets X and E are independent in the probabilistic sense, that is, if P (X ∩ E) = P (X)P (E), then gabs(X|E) = 0. The deﬁnition of the absolute certainty gain provides a basis for the deﬁnition of a new probabilistic dependency measure between attributes. This dependency can be expressed as the average degree of change of occurrence certainty of the decision category X, or of its complement ¬X, due to occurrence of any elementary set [7], as deﬁned by the expected certainty gain function: P (E)gabs(X|E), (9) egabs(X|C) = E∈U/C

452

W. Ziarko

where X ∈ U/D. The expected certainty gain is a more subtle inter-partition dependency than γ-dependency since it takes into account the probabilistic distribution information in the boundary region of X. The egabs(X|C) measure can be computed directly from joint probabilities of atoms. It can be proven [7] that the expected gain function falls in the range 0 ≤ egabs(X|C) ≤ 2P (X)(1 − P (X)), where X ∈ U/D. 4.4

Attribute Λ-Dependencies in Decision Tables

The strongest dependency between attributes of a decision table occurs when the decision category X is deﬁnable, i.e. when the dependency is functional. Consequently, the dependency in this deterministic case can be used as a reference value to normalize the certainty gain function. The following normalized expected gain function λ(X|C) measures the expected degree of the probabilistic dependency between elementary sets and the decision categories belonging to U/D [7]: egabs(X|C) , (10) λ(X|C) = 2P (X)(1 − P (X)) where X ∈ U/D. The λ-dependency quantiﬁes in relative terms the average degree of deviation of elementary sets from statistical independence with the decision class X ∈ U/D. The dependency function reaches its maximum λ(X|C) = 1 only if the dependency is deterministic (functional) and is at minimum when all events represented by elementary sets E ∈ U/C are unrelated to the occurrence of the decision class X ∈ U/D. In the latter case, the conditional distribution of the decision class P (X|E) equals to its prior distribution P (X). The value of the λ(X|C) dependency function can be easily computed from the joint probabilities of atoms. As opposed to the generalized γ(X|C) dependency, the λ(X|C) dependency has the monotonicity property [3], that is, λ(X|C) ≤ λ(X|C ∪ {a}), where a is an extra condition attribute outside the set C. This monotonicity property allows for dependency-preserving reduction of attributes and is leading to the notion of probabilistic λ-reduct of attributes, as deﬁned in [3]. 4.5

Computation of Λ-Dependencies in Decision Table Hierarchies

The λ-dependencies can be computed directly based on any known partitioning of the universe U . In cases when the approximation space is formed through hierarchical classiﬁcation, the λ-dependency between the partition R so created and the target category X can be computed via a recursive formula derived below. Let egabsl,u (X|C) = P (E)gabs(X|E) (11) E∈P OSu ∪N EGl

denote the conditional expected gain function, i.e. restricted to the union of positive and negative regions of the target set X in the approximations space generated by attributes C. The maximum value of egabsl,u (X|C), achievable

Probabilistic Dependencies in Linear Hierarchies of Decision Tables

453

in deterministic case, is 2P (X)(1 − P (X)). Thus, the normalized conditional λ-dependency function, can be deﬁned as: λl,u (X|C) =

egabsl,u (X|C) . 2P (X)(1 − P (X))

(12)

As γ-dependencies, λ-dependencies between the target partition (X, ¬X) and the hierarchical partition R can be computed recursively. The following formula (13) describes the relationship between λ-dependency computed in the approximation space (U, R), versus the dependency computed over the approximation sub-space (U, R ), where R and R are hierarchical partitions of universes U and U = BN Dl,u (X), respectively. Let λl,u (X|R) and λl,u (X|R ) denote λdependency measures in the approximation spaces (U, R) and (U , R ), respectively. The λ-dependencies in those approximation spaces are related by the following: λl,u (X|R) = λl,u (X|C) + P (BN Dl,u (X))λl,u (X|R ).

(13)

The proof of the above formula follows directly from the Bayes’s equation. In practical terms, the formula (13) provides a method for eﬃcient computation of conditional λ-dependency in a hierarchical arrangement of probabilistic decision tables. According to this method, to compute conditional λ-dependency for each level of the hierarchy, it suﬃces to compute the conditional λ-dependency and to know ”child” BN Dl,u (X)-level conditional λ-dependency. That is, the conditional λ-dependency should be computed ﬁrst for the bottom level table using formula (12), and then it would be computed for each subsequent level in the bottom-up fashion by successively applying (13). In similar way, the ”unconditional” λ-dependency λ(X|R) can be computed over all elementary sets of the hierarchical approximation space. This is made possible by the following variant of the formula (13): λ(X|R) = λl,u (X|C) + P (BN Dl,u (X))λ(X|R ).

(14)

The recursive process based on the formula (14) is essentially the same as in the case (13), with except that the bottom-up procedure starts with computation of the ”unconditional” λ-dependency by formula (10) for the the bottom-level table.

5

Concluding Remarks

Learning and evaluation of hierarchical structures of probabilistic decision tables is the main focus of this article. The earlier introduced measures of gamma and lambda dependencies between attributes [7] for decision tables acquired from data are not directly applicable to approximation spaces corresponding to hierarchical structures of decision tables. The main contribution of this work is the extension of the measures to the decision table hierarchies case and the derivation of recursive formulas for their easy computation. The gamma dependency

454

W. Ziarko

measure allows for the assessment of the prospective ability of the classiﬁer based on the hierarchy of decision tables to predict the values of decision attribute on required level of certainty. The lambda dependency measure captures the relative degree of probabilistic correlation between classes of the partitions corresponding to condition and decision attributes, respectively. The degree of the correlation in this case is a representation of the average improvement of the ability to predict the occurrence of the target set X, or its complement ¬X. Jointly, both measures enable the user to evaluate the progress of learning with the addition of new training data and to assess the quality of the empirical classiﬁer. Three experimental applications of the presented approach are currently under development. The ﬁrst one is concerned with face recognition using photos to develop the classiﬁer in the form of a hierarchies of decision tables, the second one is aiming at adaptive learning of spam recognition among e-mails, and the third one is focused on stock price movement prediction using historical data. Acknowledgment. This paper is an extended version of the article included in the Proceedings of the International Conference on Rough Sets and Emerging Intelligent Systems Paradigms, devoted to the memory of Professor Zdzislaw Pawlak, held in Warsaw, Poland in 2007. The support of the Natural Sciences and Engineering Research Council of Canada in funding the research presented in this article is gratefully acknowledged.

References 1. Pawlak, Z.: Rough sets - Theoretical Aspects of Reasoning About Data. Kluwer, Dordrecht (1991) 2. Greco, S., Matarazzo, B., Slowinski, R.: Rough membership and Bayesian conﬁrma´ ezak, D., Wang, G., Szczuka, M.S., tion measures for parametrized rough sets. In: Sl D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS, vol. 3641, pp. 314–324. Springer, Heidelberg (2005) 3. Slezak, D., Ziarko, W.: The Investigation of the Bayesian rough set model. International Journal of Approximate Reasoning 40, 81–91 (2005) 4. Yao, Y.: Probabilistic approaches to rough sets. Expert Systems 20(5), 287–291 (2003) 5. Ziarko, W.: Variable precision rough sets model. Journal of Computer and Systems Sciences 46(1), 39–59 (1993) 6. Ziarko, W.: Acquisition of hierarchy-structured probabilistic decision tables and rules from data. In: Proc. of IEEE Intl. Conf. on Fuzzy Systems, Honolulu, pp. 779–784 (2002) ´ ezak, D., Wang, G., Szczuka, M.S., 7. Ziarko, W.: Probabilistic rough sets. In: Sl D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS, vol. 3641, pp. 283–293. Springer, Heidelberg (2005) 8. Ziarko, W.: On learnability of decision tables. In: Tsumoto, S., Slowi´ nski, R., Komorowski, J., Grzymala-Busse, J.W. (eds.) RSCTC 2004. LNCS, vol. 3066, pp. 394– 401. Springer, Heidelberg (2004)

Automatic Singing Voice Recognition Employing Neural Networks and Rough Sets ˙ Pawel Zwan, Piotr Szczuko, Bo˙zena Kostek, and Andrzej Czy˙zewski Gda´ nsk University of Technology, Multimedia Systems Department Narutowicza 11/12, 80-952 Gda´ nsk, Poland {zwan,szczuko,bozenka,ac}@sound.eti.pg.gda.pl

Abstract. The aim of the research study presented in this paper is the automatic recognition of a singing voice. For this purpose, a database containing sample recordings of trained and untrained singers was constructed. Based on these recordings, certain voice parameters were extracted. Two recognition categories were deﬁned – one reﬂecting the skills of a singer (quality), and the other reﬂecting the type of the singing voice (type). The paper also presents the parameters designed especially for the analysis of a singing voice and gives their physical interpretation. Decision systems based on artiﬁcial neutral networks and rough sets are used for automatic voice quality/ type classiﬁcation. Results obtained from both decision systems are then compared and conclusions are derived. Keywords: Singing Voice, Feature extraction, Automatic Classiﬁcation, Artiﬁcial Neural Networks, Rough Sets, Music Information Retrieval.

1

Introduction

The area of automatic content indexing and classiﬁcation is related to the Music Information Retrieval (MIR) domain, which is now growing very rapidly and induces many discussions on automatic speech recognition and the development of appropriate systems. The speech is not the only outcome of the human voice organ. Singing is another one, and is considered a musical instrument by musicologists. However, its artistic and musical aspects are the reason why singing must be analyzed by specially designed additional parameters. These parameters obviously should be based on speech parameters, but additionally they must focus on the articulation and the timbre. A parametric description is necessary in many applications of automatic sound recognition. A very complicated biomechanics of the singing voice [10], [27] and a diﬀerent character of the intonation and the timbre of the voice require numerous features to describe its operation. Such a parametric representation needs intelligent decision systems in the classiﬁcation process. In the presented study, artiﬁcial neural network (ANN) and rough set-based (RS) decision systems were employed for the purpose of the singing voice quality/type recognition. The systems were trained with sound samples, of which a large part (1700 samples) was J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 455–473, 2008. c Springer-Verlag Berlin Heidelberg 2008

456

˙ P. Zwan et al.

recorded in the studio and 1200 samples were extracted from professional CD recordings. For every sound sample, a feature vector (FV) containing 331 parameters was formed. The parameters were divided into two groups: the so-called dedicated ones (designed allowing for a singing voice speciﬁcs) and more general ones known from the literature on MIR and speech recognition. The decision system ability to automatically classify a singing voice is discussed in the context of comparing the eﬃciency of ANN and RS systems in two recognition categories: ‘voice type’ (classes: bas, baritone, tenor, alto, mezzo-soprano, soprano) and ‘voice quality’ (classes: amateur, semi-professional, professional). Additionally, the parameters were judged using statistical and rough set methods. For different methods of reducing the feature vector redundancy, new classiﬁers were trained. The results were compared by analyzing the accuracy of the trained recognition systems. This article is an extended version of the paper presented at the RSEISP’07 conference held in Warsaw [34]. The paper is organized as follows. In Section 2 the organization of the database of samples of singing is described. The automatic process of classiﬁcation requires an eﬃcient block of feature extraction, thus Section 3 presents parameters that were used in experiments and discusses them in the context of their relationship with voice production mechanisms. The analysis shown in Section 4 concentrates around the redundancy elimination in the feature vector. For this purpose three methods, i.e. Fisher and Sebestyen statistics, and the rough set-based method are employed. The main core of experiments is presented in Section 5, and ﬁnally Section 6 summarizes the results obtained in this study.

2

The Database of Singing Voices

The prepared singing voice database contains over 2900 sound samples. Some 1700 of them were recorded from 42 singers in a studio. The vocalists consisted of three groups: amateurs (Gda´ nsk University of Technology Choir vocalists), semi-professionals (Gda´ nsk Academy of Music, Vocal Faculty students), and professionals (qualiﬁed vocalists, graduated from the Vocal Faculty of the Gda´ nsk Academy of Music). Each of them recorded 5 vowels: ‘a’, ‘e’, ‘i’, ‘o’, ‘u’ at several sound pitches belonging to a natural voice scale. These recordings formed the ﬁrst singing category – singing quality. The singing voice type category was formed by assigning the voices to one of the following classes: bas, baritone, tenor, alto, mezzo-soprano and soprano. The second group of samples was prepared on the basis of CD audio recordings of famous singers. The database of professionals needed to be extended due to the fact that voice type recognition is possible only among professional voices. Amateur voices do not show many diﬀerences within groups of male and female voices as it has already been researched in literature [2], [27].

3

Parametrization of the Singing Voice

In order to parameterize the singing voice properly, the understanding of the voice production mechanism is required. The biomechanism of the singing voice

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

457

creation is rather complicated, but in the domain of its spectral energy (while not taking phase changes into account) it can be simpliﬁed by assuming the FIR model of the singing voice production. As in each classical FIR model a vocal tract is a ﬁlter which changes the spectrum of a source (the glottal) by a sum of resonances with given frequencies, amplitudes and qualities (the vocal tract). Therefore the singing is produced by the vibration of human vocal cords and resonances in the throat and head cavities. The resonances produce formants in the spectrum of sounds. Formants are not only related to articulation, and enable to produce diﬀerent vowels, but also characterize timbre and voice type qualities. For example the formant of the middle frequency band (3.5 kHz) is described in literature as “singer’s formant”, and its relation to voice quality is proved [2], [20], [27]. This concept is well recognized in a reach literature related to singing. However, the interaction between two factors, namely glottal source and resonance characteristics, shapes the timbre and power of an outgoing vocal sound, and both factors are equally important. Thus, singing voice parameters can be divided into two groups related to those two factors. Since this is a classical FIR model in order to deconvolve from the output signal the source from the ﬁlter inverse ﬁltration methods are required. In literature, some inverse ﬁltration methods for deriving glottis parameters are presented, however they are proved to be ineﬃcient due to phase problems [10]. In this aspect only parameters of vocal tract formants can be calculated directly from the inverse ﬁltering analysis since they are deﬁned in frequency. The assumption of linearity is the reason why time parameters of the source signal must be extracted by other methods which will be presented later on. Vocal tract parameters are in a speech analysis most often derived from the LPC method, but an adequate separation of frequency resonances demands high resolution for lower frequencies, where the resonances are located. Moreover, the methods of analysis with a spectrum resolution controlled by a function of sound pitch are required. The warped LPC method [8], [18] (further called the WLPC analysis) fulﬁlls those conditions and enables to analyze frequencies and levels of formants with a controlled higher low frequency resolution (below 5 kHz). It is based on nonlinear sampling of the unit circle in a z transform, thus the resolution in lower frequencies is better comparing to a classical LPC analysis with the same length of the analyzed frame. The phase response frequency is transformed non-linearly to a warped frequency ω W according to Equation (1). λ · sin ω ωW = ω + 2 · arctan (1) 1 − λ · cos ω where λ is a parameter, which determines non-linearity of the transformation and low frequency resolution of the WLPC analysis. The larger λ is, the more densely are the lower frequencies sampled. Mathematical aspects of this transformation are presented in detail in some literature sources [8], [9], [18] and in the previous works of the authors of this paper [34]. Since the analysis is applied to small signal frames it can be performed for several parts of the analyzed sounds. Therefore, any parameter F (which can be

458

˙ P. Zwan et al.

for example the level of one of the formants) forms a vector which describes its values in consecutive frames. In order to focus on a whole signal, and not only on a single frame, the median value of this vector is represented by a so-called static parameter Fmed . In this case, a median value is better then a medium value, because it is more resistant to short non typical values of a parameter, which do not drastically change the median value. On the other hand, in order to investigate the stability, the variances of the vector values (denoted as Fvar ) are also taken into account. Some of the singing voice parameters must be calculated for a whole sound rather than for single frames. Those parameters are deﬁned on the basis of the fundamental frequency contour analysis, and they are related to vibrato and intonation. Vibrato is described as the modulation of the fundamental frequency of sounds performed by singers in order to change timbre of sounds, while intonation is their ability to produce sounds perceived as stable and precise in tune. The parameters based on the singing voice analysis (’dedicated’ parameters) form an important group, but they should be supplemented with general descriptors normally used for the classiﬁcation of sound instruments. This group of parameters was investigated in detail in the domain of automatic sound recognition at Multimedia Systems Department at Gda´ nsk University of Technology. The usefulness of those parameters in automatic musical sound recognition was proved, and implied their application to the ﬁeld of the singing voice recognition. In this study, 331 parameters were derived from the analyses, of which 124 are deﬁned by the authors and are so-called ’dedicated parameters’ especially designed to address signing voice speciﬁcs. 3.1

Estimation of the Vocal Tract Parameters

As already described, the estimation of formants requires methods of analysis with good frequency resolution which are dependent on pitch of sounds. If the resolution is not properly set, single harmonics can be erroneously recognized as formants. For those purposes the WLPC analysis seems to be the most appropriate because the λ parameter is the function of the pitch of analyzed sounds [9], and thus can be changed in this analysis. The function λ=f(f ) is presented in Eq. (2). The problem of how to determine the appropriate λ is presented in detail in the work of one of the authors [32], [34]. λ = 10−6 · f [Hz]2 − 0.0022 · f [Hz] + 0.9713

(2)

However, parameters related to the ‘singing formant’ can also be extracted on the basis of the FFT power spectrum parametrization. Correlation between the WLPC and FFT parameters is not a problematic issue. Various methods, among them statistical analysis and rough set method, enable to reduce redundancy in feature vectors (FVs) and to compare the signiﬁcance of the features. The WLPC and FFT analyses results are presented in Fig. 1. Maxima and minima of the WLPC curves are determined automatically by an algorithm elaborated by one of the authors [32]. WLPC smoothes the power spectrum with a good resolution for frequencies below 5kHz.

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

459

Fig. 1. WLPC analysis shown along with the FFT power spectrum analysis of sound

Extracted WLPC maxima are related to one of the three formants: articulation (frequencies 1-2.5kHz), singer’s (singing) (frequencies 3-4 kHz), and high singing formants (frequencies over 5kHz). Since in literature a formal prescription of how to deﬁne mathematically these formants does not exist, three deﬁnitions for each of them can be proposed basing on three WLPC minima. Fnm = W LP Cmxn − W LP Cmnm

(3)

where WLPCmx n is the value of the nth WLPC maximum and WLPCmn m is a value of the mth WLPC minimum. Since the WLPC analysis is applied to short signal frames, it can be performed for several fragments of the analyzed sounds. Therefore, any formant parameter Fnm forms a vector which describes its values in consecutive frames. Median values of this vector represent a so-called static parameter Fnmmed , while the values of variances are a dynamic representation and are denoted as Fnmvar . The maximum and minimum values from expression (5) are also calculated respectively in all consecutive frames. They are denoted as Fnmmax and Fnmmin respectively. A singing formant is presented by many authors as signiﬁcant for the estimation of singing quality. Parameters related to the “singer’s formant” were extracted on the basis of the linear combination of parameters Fnm and additionally by using the FFT power spectrum parametrization. The combinations of the parameters deﬁned basing on the WLPC analysis are presented in Eq. (4) and (5). Those equations show direct relationship between formants. The parameter related to the formant energy deﬁned on the basis of the FFT power spectrum is presented in (6). F2 = F21 − F11 F1

(4)

F2 = F21 − F31 F3

(5)

SF L =

ESF Etotal

(6)

460

˙ P. Zwan et al.

where ratios F2 /F1 and F2 /F3 represent a diﬀerence in formant levels F11 , F21 and F31 expressed in [dB], SF L denotes the singer’s formant energy, ESF is the power spectrum energy for the band (2.5kHz-4kHz) in which a ‘singing formant’ is present, and Etotal is the total energy of the analyzed signal. 3.2

Estimation of the Glottal Source Parameters

Interaction between the vocal tract ﬁlter and the glottal shape along with phase problems are obstacles for an accurate automatic glottal source shape extraction [10], [12], [27], [32]. Glottal source parameters, which are deﬁned in the time domain, are not easy to compute from the inverse ﬁltration. However, within the context of the singing voice quality their stability rather that their objective values seems to be important. The analysis must be done for single periods of sound, and the sonogram analysis with small analyzing frames and big overlap should be employed. For each of the frequency bands, the sonogram consists of a set of n sequences Sn (k), where n is the number of a frequency band and k is the number of a sample. Since the aim of the parametrization is to describe the stability of energy changes in sub-bands, the autocorrelation in time is a function of sequences Sn (k). The more frequent and stable energy changes in a sub-band were, the higher were the values of the autocorrelation function maximum (for index not equal to 0). The analysis was performed for 16 and 32 sample frames. In the ﬁrst case the energy band of 0-10 kHz was related to the ﬁrst four indexes n and the maximum of the autocorrelation function of sub-band n is denoted as KX n (7), in the second case n=1...8 and the resulting parameter is deﬁned as LX n (8). Two diﬀerent analyzing frames were used for comparison purposes only. The redundancy in the feature vector (FV) was further eliminated by statistical methods. (7) KXn = max(Corr (Sn16 (k))), n = 1...4 k

LXn = max(Corr (Sn32 (k))), n = 1...8 k

(8)

where Corr (.) is the autocorrelation function in time domain, k – sample numk

ber, n - number of the frequency sub-band, Sn16 – sonogram samples sequence for the analyzed frame of 16 samples and frequency sub-band n, and Sn32 denotes a sonogram sample sequence for the analyzed frame of 32 samples and the frequency sub-band n. Conversely, the minimum of the correlation Corr (Sn (k)) function is connected with the symmetry or anti-symmetry of energy changes in sub-bands, which relates to the open quotient of glottis source [32]. Therefore in each of the analyzed sub-bands the KY n and LY n parameters are deﬁned as (9) and (10), respectively: (9) KYn = min(Corr (Sn16 (k))), n = 1...4 k

LYn = min(Corr (Sn32 (k))), n = 1...8 k

where Corr (.), k, n, Sn16 , Sn32 are deﬁned as in formulas (7) and (8). k

(10)

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

461

Another parameter deﬁned for each analyzed sub-band is a threshold parameter KP n deﬁned as the number of samples exceeding the average energy level of the sub-band n divided by the total number of samples in the sub-band. For the frame of 32 samples a similar parameter is deﬁned and denoted as LP n . Parameters KP n and LP n are also related to the open quotient of the glottal signal [32], [33]. 3.3

Estimation of Intonation and Vibrato Parameters

Proper vibrato and intonation play a very important role in the perception of voice quality [4], [5], [7], [25], [31]. It is clear that a person who does not hold the pitch steadily and does not have a consistent vibrato cannot be judged as a good singer. Intonation and vibrato of the singing is deﬁned in the frequency domain, thus a pitch contour needs to be extracted. There are several methods of automatic sound pitch extraction, of which autocorrelation method seems to be the most appropriate [6], [14]. Autocorrelation pitch extraction method is based on the determination of the maximum of an autocorrelation function deﬁned for the overlapped segments of the audio signal. Since this method is well presented in a reach literature [6], [23] on this subject, it will not be recalled here. The fundamental frequency (f0 ) within each analyzed frame was determined, and at the same time the improvement of the frequency resolution of the analysis was achieved by interpolating three samples around the maximum of the autocorrelation function. The length of the frame was set to 512 samples. The value was determined experimentally in order to give a satisfactory time resolution. It is presented in detail in other papers of the authors [6], [32]. The interpolation improves the frequency resolution signiﬁcantly. The pitch of the analyzed sounds is not always stable in time, especially when sounds of untrained singers are concerned. In order to accurately parametrize vibrato and intonation of the analyzed sound, an equivalent pitch contour of the sound but without vibrato should be determined. The result of such analysis is a so-called ‘base contour’ which is calculated by smoothing the pitch contour (using the moving average method) with the frame length equal to the reciprocal of the half vibrato frequency. When bc(n) are samples of the base contour (deﬁned in frequency) and v(n) are samples of the vibrato contour, the vibrato modiﬁed contour is calculated as vm (n) = v(n)-bc(n) and it is used for the vibrato parametrization. On the other hand, bc(n) are used for the intonation parametrization to deﬁne how quickly the singer is able to obtain a given pitch of the sound and how stable its frequency is. The parametrization of vibrato depth and frequency (fV IB ) may be not suﬃcient in the category of singing quality. Since the stability of vibrato reﬂects the quality of sound parameters in time [5], [27], additional three vibrato parameters were deﬁned [5], [34]: – “perdiodicity” of vibrato VIB P (Eq. 11) pitch contour, deﬁned as the maximum value of the autocorrelation of the pitch contour function (for index not equal to 0);

462

˙ P. Zwan et al.

– “harmonicity” of vibrato VIB H (Eq. 12) obtained by calculating Spectrum Flatness Measure for the spectrum of the pitch contour; – “sinusoidality” of vibrato VIB S (Eq. 13) deﬁned as the similarity of the parameterized pitch contour to the sine waveform. V IBP = max(Corr (f0 (n)) n

V IBH =

N

(11)

N1 F0 (n)

n=1 N 1 F0 N n=1

(12) (n)

max (F0 (n)) V IBS =

n N

n=1

(13) F0 (n)

Bad singers often try to use vibration of the sounds not for artistic purposes but to hide “false” intonation of a sound. In addition, false sounds are obviously directly showing lack of proﬁciency in vocal skills. Since intonation seems important in a voice quality determination, the base contour must be parametrized. In order to calculate intonation parameters, two methods were proposed. The ﬁrst method calculates the medium value of a diﬀerential sequence of a base contour (IR). The second method does not analyze all base contour samples but the ﬁrst and the last one, and returns the IT parameter. Parameters IR and IT are also deﬁned for the ﬁrst and last N /2 samples of pitch contour separately (N is the number of samples of the pitch contour) and are denoted as IR att , IT att , IR rel , IT rel , whereatt means the attack and rel the release of the sound. 3.4

Other Parameters

Another way of determining singing voice parameters is to use a more general signal description such as descriptors of audio content contained in the MPEG-7 standard. Although those parameters are not related to the singing voice biomechanics, they may be useful in the recognition process. The MPEG-7 parameters [11], [19] will not be presented in detail here, since they were reviewed in previous works by the authors [15], [17], [16]. The MPEG-7 audio parameters can be divided into the following groups: – ASE (Audio Spectrum Envelope) describes the short-term power spectrum of the waveform. The mean values and variances of each coeﬃcient over time are denoted as ASE 1 . . . ASE 34 and ASE 1var . . . ASE 34var respectively. – ASC (Audio Spectrum Centroid) describes the center of gravity of the logfrequency power spectrum. The mean value and the variance are denoted as ASC and ASC var respectively. – ASS (Audio Spectrum Spread). The mean value and the variance over time are denoted as ASS and ASS var respectively.

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

463

– SFM (Spectral Flatness Measure) calculated for each frequency band. The mean values and the variances are denoted as SFM 1 . . . SFM 24 and SFM 1var . . . SFM 24var . – Parameters related to discrete harmonic values: HSD (Harmonic Spectral Deviation), HSS (Harmonic Spectral Spread), HSV (Harmonic Spectral Variation). The level of the ﬁrst harmonic changes for diﬀerent voice type qualities [27], thus in automatic voice recognition some parameters related to the behavior of harmonics can be useful. Parameters employed in the analysis were deﬁned for harmonic decomposition of sounds. They are the mean value of diﬀerences between the amplitudes of a harmonic in adjacent time frames (sn , where n is the number of a harmonic), the mean value of the amplitudes Ah of a harmonic over time (mn , where n is the number of a harmonic), the standard deviation of amplitudes Ah of a harmonic over time (md n , where n is the number of a harmonic). Other parameters used in the experiments were: brightness (br ) (center of spectrum gravity) [13], [14] and mel-cepstrum coeﬃcients mcc n [3], where n is the number of a coeﬃcient.

4

Analysis of Parameters

All 2900 sound samples from the database described in Section 3 were described by the presented parameters. Since the total number of the parameters is big (331), they all will not all be listed here. We can, however, divide them into the following groups: – – – – 4.1

parameters of formants – 46 parameters, parameters of the glottal - 59 parameters, parameters of the pitch contour (intonation and vibrato) – 18 parameters, other parameters (general) – 208 parameters. Statistical Analysis

Some chosen pairs of the parameters can be represented graphically in a 2D space. In Fig. 2, an example of a distribution of two parameters for sound samples of professional and amateur singers is presented. It can be noticed, that for a majority of these samples sounds are separated by using only two features. A large number of the features in the FV and a large number of voice samples are the reason to use statistical methods for the feature evaluation. Therefore every feature can be analyzed and a feature vector can be reduced to the parameters with the biggest values of statistics. Another way is to use the rough sets. Three methods of data reduction are to be described, namely Fisher statistic (F ) and Sebestyen statistics (S), and rough sets in the following sections of this paper. Fisher statistic has the ability to test the separation between the pairs of classes being recognized, while Sebestyen criterion tests the database globally for

464

˙ P. Zwan et al.

Fig. 2. An example of a 2D space of the values of selected parameters

Table 1. Sebestyen criterion for 20 parameters in the categories of voice quality (a), and voice type (b) a. parameter F1 /F2 VIB H ASE 16 F31min ASE 23 ASE 24 F22med

Svalue 1.282 1.047 0.844 0.672 0.654 0.637 0.556

parameter SFLmin LAT SFLmed ASE 21 ASE 22 SFLmin ASE 14

Svalue 0.545 0.545 0.529 0.519 0.489 0.468 0.407

parameter ASE 15 F2 /F3 br F31med F22min F22max

Svalue 0.406 0.297 0.281 0.278 0.252 0.248

parameter ASE 10 ASE 9 LP 5 F22med ASE 23 ASE 16 MCC 6

Svalue 1.006 0.680 0.518 0.509 0.501 0.419 0.384

parameter SFM 17 ASE 25 KP 4 mfcc 9var ASE 12 LX 1 ASE 13

Svalue 0.37 0.36 0.358 0.358 0.355 0.351 0.320

parameter MCC 10 MCC 10var F22min ASE 19 MCC 8 LP 6

Svalue 0.307 0.307 0.290 0.258 0.258 0.241

b.

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

465

all pairs of classes in one calculation. Those statistical methods are presented in [13], [15], [26]. Sebestyen criterion has an advantage while compared to F statistic. Its global character enables to sort the parameters from the most to the less appropriate for all pairs of classes. In Table 1 the results of S criterion for 20 best parameters are presented in two categories of classiﬁcation. Fisher statistic allows for comparing parameters only in selected pairs of classes. Therefore the parameters cannot be globally compared. Below, are presented the most interesting conclusions coming from the statistical analysis of singing voice parameters that fall within the categories of the singing voice quality and type. The detailed studies of the parameter redundancy using the Fisher’s criterion are presented in P. Zwan’s PhD thesis [33]. In the category of voice quality, “dedicated” parameters obtained higher Fisher values than “general” parameters, while “general” descriptors were more appropriate for class separation in the voice type category. In the category of voice quality, the best BF results were obtained by glottal source parameters: LX 2 , LX 3 , SFL, VD, F 22max , F1 /F2 , F2 /F3 . Among “general” descriptors the best BF results were obtained by some chosen MPEG7 parameters: ASE and HSV (of various indexes) and the parameters describing value changes of harmonics in neighboring frames. For the pair of “amateur” – “professional” classes the best parameters (with regard to Fisher statistics) were the parameters related to the energy of the singer’s formant: SFL, F22med , F1 /F3 , F2 /F3 , F22min . It is evident that the energy of the band of the singer’s formant is crucial for distinguishing professional singers from amateurs. For the pair of: “semiprofessional” – “professional” classes the parameters related to the singer’s formant energy do not have such a great signiﬁcance. In this case, glottal source time parameters are essential. They relate to the invariability and periodicity of energy changes in singing to the level of single signal periods in voice samples. High values of the Fisher statistics were obtained by parameters related to vibrato: VD, VIB P , VIB H , VIB S . Such a good result for vibrato parameters is very valuable, because these descriptors are not correlated with the parameters of the singer’s formant (they describe diﬀerent elements of singing technique). In the category of voice type, the highest F values have threshold parameters KP 2 , LP 4 , LP 5 , LP 8 , parameters LX 1 , KX 1 , the SFLmax parameter related to the singer’s formant level and the parameters related to a higher formant FW, namely: F1 /F3 . Among the parameters deﬁned on the basis of the WLPC analysis, the highest F values were obtained by the parameters: F22med and F22max , what indicates the signiﬁcance of deﬁning the singer’s formant values in relation to the second minima of the WLPC function. The results of Sebestyen criterion and Fisher statistic cannot be compared directly, but generally the results are consistent. The majority of parameters with high S value also obtained high F value for a certain pair of classes, and similarly, the majority of parameters with a big Fisher criterion value had a high position in the list of parameters sorted by the S value. The consistence of the results proves the statistical methods to be good tools for a comparison of parameters in the context of their usability in the automatic recognition of singing voices.

466

4.2

˙ P. Zwan et al.

Rough Set-Based Analysis

Rough sets introduced by Pawlak [21] are often employed in the analysis of data which aims to discover signiﬁcant data and eliminate redundant ones. A reach literature on rough sets covers many applications [22], it is also used in music information retrieval [15], [28], [29], [29]. Within the context of this paper, the rough set method was used for the analysis of descriptors deﬁned for the purpose of this study. In experiments, the rough set decision system RSES was employed [24]. Since this system is widely used by many researches, the details concerning its algorithmic implementation and performance will not be provided here. FVs were divided into training and testing sets. Parameters were quantized according to the RSES system principles. The local and global discretization were used to obtain reducts calculated from genetic and exhaustive algorithms [1]. Since two discretization methods and two algorithms for reduct calculation were used two sets of reduced parameters and four sets of reducts containing the selected parameters were extracted. In the category of voice quality the vector of parameters was reduced to the parameters listed below: a) the global discretization: FV 1 = [F11 , F2 /F1 , KX 2 , KY 7,

fV IB , VIB p , ASE 21 , ASC, ASC v , SFM 10 , s2 ] (15)

b) the local discretization: FV 2 = [F11 , F21 , F31 , F33 , F12var , F13min , F13var , KX 1 , KX 2 , KP 1 , LP 3 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 , s2 ] (16) The sets of parameters selected by global and local discretization methods diﬀer. The global discretization is a simpler method thus the number of parameters is lower. The global discretization tries to separate group of classes globally by selecting the lesser number of discretization cut points. However, the parameters chosen by the rough set method for both discretization methods in general match the parameters chosen by statistical methods of data reduction. Among the reduced set of parameters, descriptors related to the WLPC analysis of formants can be found, and thus can be qualiﬁed as signiﬁcant for the classiﬁcation purposes. They are related to all three formants, which proves that in the category of voice quality all formants are required to be parameterized and the extracted descriptors should be contained in the FV. It is interesting that among those parameters F31 and F33 which are related to ‘high formant’ (middle frequency higher than 5kHz) appeared. The signiﬁcance of this formant is not described in literature concerning automatic singing voice parametrization. Among glottal source parameters descriptors such as: KX 1 , KX 2 , KP 1 , LP 3 were selected. On the other hand, frequency (fV IB ) and periodicity (VIB p ) related to vibrato modulation found their place among other important descriptors. From the remaining parameters, a few MPEG-7 parameters namely LAT, TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 were qualiﬁed. In addition, one parameter related to the

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

467

analysis of spectrum which is represented by s2 related to the variation of the second harmonic was chosen. In order to deﬁne the reducts, two algorithms were used: genetic and exhaustive algorithms. In the case of global discretization, those two algorithms calculated one and the same reduct containing all the parameters of (15). For both algorithms, all the parameters had equal inﬂuence on the decision. In the case of local discretization, reducts obtained by the two algorithms diﬀered signiﬁcantly. The resulting reducts are presented in (17) and (18). For selection of the number of the ’best’ reducts the stability coeﬃcient values was taken into account. – reducts for the genetic algorithm, limited to a few best reducts: {F11 , F31 , F12var , KX 2 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 } {F11 , F31 , F12var , F13min ,KX 2 , fV IB ,VIB p , LA,ASE 6 ,ASE 7 ,ASE 8 ,ASE 21 } {F11 , F31 , F13var , KX 2 , fV IB , VIB p , TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 } {F11 , F31 , KX 2 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 , s2 } {F11 , F31 , KX 2 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 } {F11 , F13min , KX 2 , KP 1 , fV IB , VIB p ,TC, ASE 6 , ASE 7 , ASE 8 , ASE 18 , s2 } {F31 , F12var , KX 2 , KP 1 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 } {F31 , F13var , KX 2 , fV IB , VIB p , TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 , s2 } {F13var , KX 2 , KP 1 , fV IB , VIB p , TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 , s2 } {F12var , KX 2 , KP 1 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 } {F13min , KX 2 , KP 1 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 , s2 } (17) – reducts for the exhaustive algorithm, limited to best 6 reducts: {KX 2 , VIB p , LAT, ASE 6 , ASE 8 } {VIB p , LAT, TC, ASE 6 , ASE 8 } {F11 , VIB p , TC, ASE 6 , ASE 8 } {F11 , fV IB , VIB p , ASE 6 , ASE 8 } {KX 2 , LAT, TC, ASE 6 , ASE 8 } {KX 2 , fV IB , VIB p, ASE 6 , ASE 8 } (18) In the category of voice type over 200 from the total number of 331 parameters remained independently of the discretization methods or the type of algorithm used for the calculation. It was not possible to reduce parameter representation as much as in the case of the voice quality category. Within this context, automatic voice type recognition seems to be more complex. One of the reasons can be the diversity of registers among diﬀerent voice types and individual voice qualities which change for singers for the same voice type. Additionally, some singers’ voices were not easy to qualify to a voice type category, e.g. low registers of soprano voices were similar in timbre to mezzo-soprano and even alto voices.

468

5

˙ P. Zwan et al.

Automatic Singing Voice Recognition

The next step of the experiment was to train an automatic recognition system based on both reduced and full feature vectors. Since three reduction methods were performed for each of the categories several decision systems (DS) were trained for the purposes of their comparison: – DS 1 – for the full vector (331 parameters), – DS 2 – for the vector with 100 parameters with biggest S values, – DS 3 – for the vector with 100 parameters with biggest F values (all pairs of the classes were concerned), – DS 4 – for the vector with 50 parameters with the biggest S values, – DS 5 – for the vector witch 50 parameters with the biggest F values, – DS 6 – for the vector with 20 parameters with the biggest S values, – DS 7 – for the vector witch 20 parameters with the biggest F values, – DS 8 – for the vector reduced by rough-sets with global discretization method – DS 9 – for the vector reduced by rough-sets with local discretization method Since Artiﬁcial Neural Networks are widely used in automatic sound recognition [13], [14], [15], [32], [35], the ANN classiﬁer was used. The ANN was a simple feed-forward, three layer network with 100 neurons in the hidden layer and 3 or 6 neurons in the output layer respectively (dependent on the number of classes being recognized). Since there were 331 parameters in the FV, the input layer consisted of 331 neurons. Sounds from the database were divided into three groups. First part of samples (70%) was used for training, second part (10%) for validation and the third part (20%) for testing. Samples in training, validation and testing sets consisted of sounds of diﬀerent vowels and pitches. The network was trained smoothly with the validation error increasing after approx. 3000 cycles. To train the network optimally, the minimum of the global validation error function must have been found. If the validation error was increasing for 50 successive cycles, the last validation error function minimum was assumed to be global, and the learning was halted. In Figure 3, the automatic recognition results are presented for nine decision systems DS1 – DS9 and two recognition categories. V331 is the vector of all 331 parameters, Sn are the vectors reduced by the Sebestyen criterion to n parameters, Fm are the vectors reduced by the Fisher statistic to m parameters, RS L is the vector of parameters selected by the rough set local discretization method and RS G is the vector of parameters selected by the rough set global discretization method. The results from Table 2 show that whatever the data reduction method is, artiﬁcial neural networks generate similar results of automatic recognition. The eﬃciency for a full vector of 331 parameters is 93.4% for the voice quality category and 90% for the voice type category and decreases to approx. 75% when the size of the vector is reduced to 20 parameters. A better recognition accuracy for the voice type category when the rough set data reduction method is used comes from the fact that for this category the vector was not signiﬁcantly reduced. In the case of the voice quality the results of automatic recognition for F20 , S20 and RS L can directly be compared because in those two cases the

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

469

Table 2. Results of automatic recognition accuracy [%] for various FV size reduction methods Category V331 S100 F100 S50 F50 S20 F20 RS L RS G Quality 93.4 90.5 92.1 85.3 87.2 73.2 75.5 75.2 62 Type 90.0 82.3 81.5 79.0 76.0 58.3 60.1 83.1 82.5

Table 3. The comparison of the accuracy of RS and ANN based classiﬁers for various RS-based data reduction methods Data reduction method global discretization, both algorithms local discretization, genetic algorithm local discretization, exhaustive algorithm

RS [%] 96.8 97.6 89

ANN [%] 62 72.5 72.5

vectors have the same number of parameters. The recognition accuracy is very similar for all three methods. Rough set-based algorithms can serve not only for data reduction purposes but also as classiﬁers. A comparison between RS and ANN classiﬁers acting on vectors reduced by rough set-based methods seems very interesting. Since the reducts were extracted only for the vocal quality category, the experiment was carried on for that category and the results are presented in Table 3. The automatic recognition results are much better for an RS classiﬁer used. The RS method is specialized for the classiﬁcation task of a reduced set of parameters. A discretization algorithm used in RS selects the most important cut points in terms of the discernibility between classes. Thus, rules generated from the parameters by RS are strictly dedicated for the analyzed case. Following the RS methodology, the proper rules are easy to obtain. For ANNs, since they are trained and tested using single objects, the generalization is harder to obtain and every single training object can have an inﬂuence on the results. In the case of a smaller number of parameters it has a particular meaning which can be clearly observed in Table 3. Contrarily, when the number of parameters (the number of reducts) is bigger, the ANN decision system starts to perform better than RS. This may be observed in the results of the automatic recognition in the voice type category (parts of Tables 4 and 5). In order to make a detailed comparison, between the best trained ANN recognition system and the best trained RSES system, the detailed recognition results for both recognition categories are presented in Tables 4 and 5. Rows in these tables describe recognized quality classes, and columns correspond to the classiﬁcation. In the case of the quality category, the automatic recognition results are better comparing to the ANN. The rough set system achieved very good results with a reduced FV of 20 parameters in the classiﬁcation of the voice quality category. In the category of voice type, the results are worse. Moreover, in the case of the voice type category erroneous classiﬁcation is not always related to neighboring

470

˙ P. Zwan et al.

Table 4. ANN singing voice recognition results for (a) Voice Quality (VQ) and (b) Voice Type (VT) categories a. VQ recognition [%] amateur semi-professional professional

amateur 96.3 4.5 3.5

semi-professional 2.8 94.3 7

professional 0.9 1.1 89.5

b. VT category recognition [%] bass baritone bass 90.6 6.3 baritone 3.3 90 tenor 0 3.6 alto 0 0 mezzo 0 0 soprano 0 0

tenor 3.1 6.7 89.3 4 0 2.9

alto 0 0 7.1 80 0 0

mezzo 0 0 0 12 93.8 2.9

soprano 0 0 0 4 6.3 94.1

Table 5. RSES-based singing voice classiﬁcation results for (a) Voice Quality (VQ) and (b) Voice Type (VT) categories a. VQ recognition [%] amateur semi-professional professional

amateur 94.7 1.3 0

semi-professional 4.2 95.4 1.6

professional 1.1 3.3 96.7

b. VT category recognition [%] bass baritone bass 84.0 10.0 baritone 13.0 64.8 tenor 6.0 18.0 alto 0 4.7 mezzo 3.8 0 soprano 2.9 2.9

tenor 4.0 13.0 54.0 16.3 2.6 2.9

alto 2.0 0 10.0 51.2 1.3 1.4

mezzo 0 1.9 6.0 16.3 73.1 11.4

soprano 0 7.3 6.0 11.6 19.2 78.6

classes. Thus, the RSES system was not able to perform the classiﬁcation as well as ANN while trained and tested on vectors of more than 200 parameters in the category of voice type where further vector size reduction was not possible (total accuracy obtained equals 0.664). It is interesting to notice that types of voices being ‘the extreme’ of the voice type category were recognized with better eﬃciency than those contained between other classes. Also, there is not much diﬀerence whether this concerns male or female voice.

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

6

471

Conclusions

By comparing automatic recognition results of neural networks and rough set systems, several conclusions may be reached. The recognition performed by the rough set system was better for the quality category and worse for the voice type category in comparison to the ANN. In the case of the voice quality category, it was possible for the RS system to reduce a large number of parameters to 20 descriptors and the extraction of rules went very smoothly. Descriptors of the level of formants, stability of glottal parameters along with those related to vibrato, and MPEG-7 descriptors in addition, enabled to derive linear IF-THEN rules. It proves that automatic recognition of the quality category is possible for a signiﬁcantly reduced number of descriptors. In the case of voice quality it was not possible to achieve very good recognition results for the RS classiﬁer as the extraction of a small number of rules was not possible. Neural networks enabled to classify particular types of singing voices eﬀectively while the rough-set system achieved worse eﬃciency. The diversity of voice registers and individual timbre characteristics of singers are the reason that non-linear classiﬁcation systems such as ANNs should perhaps be used for automatic recognition in the category of voice type. Another reason for lower recognition results may be that the database of singing voices was represented by too few diﬀerent singers. Moreover, it has been proven that all the presented data reduction algorithms enabled a signiﬁcant decrease in the feature vector size. The results obtained by the trained ANN for vectors of the same length but produced by diﬀerent data reduction methods gave very similar recognition results. The parameters selected by those algorithms as the most appropriate for automatic singing voice recognition were very similar. Acknowledgments. The research was partially supported by the Polish Ministry of Science and Education within the project No. PBZ-MNiSzW-02/II/2007.

References 1. Bazan, J.G., Szczuka, M.S.: The Rough Set Exploration System. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005) 2. Bloothoof, G.: The sound level of the singers formant in professional singing. J. Acoust. Soc. Am. 79(6), 2028–2032 (1986) 3. Childers, D.G., Skinner, D.P., Kemerait, R.C.: The Cepstrum: A Guide to Processing. Proc. IEEE 65, 1428–1443 (1977) 4. Dejonckere, P.H., Olek, M.P.: Exactness of intervals in singing voice: A comparison between singing students and professional singers. In: Proc. 17th International Congress on Acoustics, Rome, VIII, pp. 120–121 (2001) 5. Diaz, J.A., Rothman, H.B.: Acoustic parameters for determining the diﬀerences between good and poor vibrato in singing. In: Proc. 17th International Congress on Acoustics, Rome, VIII, pp. 110–116 (2001)

472

˙ P. Zwan et al.

6. Dziubinski, ˙ M., Kostek, B.: Octave Error Immune and Instantaneous Pitch Detection Algorithm. J. of New Music Research 34, 273–292 (2005) 7. Fry, D.B.: Basis for the acoustical study of singing. J. Acoust. Soc. Am. 28, 789–798 (1957) 8. Harma, A.: Evaluation of a warped linear predictive coding scheme. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 897–900 (2000) 9. Harma, A.: A comparison of warped and conventional linear predictive coding. IEEE Transactions on Speech and Audio Processing 5, 579–588 (2001) 10. Herzel, H., Titze, I., Steinecke, I.: Nonlinear dynamics of the voice: signal analysis and biomechanical modeling. CHAOS 5, 30–34 (1995) 11. Herrera, P., Serra, X., Peeters, G.: A proposal for the description of audio in the context of MPEG-7. In: Proc. CBMI European Workshop on Content-Based Multimedia Indexing, Toulouse, France (1999) 12. Joliveau, E., Smith, J., Wolfe, J.: Vocal tract resonances in singing: the soprano voice. J. Acoust. Soc. America 116, 2434–2439 (2004) 13. Kostek, B.: Soft Computing in Acoustics, Applications of Neural Networks, Fuzzy Logic and Rough Sets to Music Acoustics, Studies in Fuzziness and Soft Computing. Physica Verlag, Heidelberg (1999) 14. Kostek, B., Czy˙zewski, A.: Representing Musical Instrument Sounds for Their Automatic Classiﬁcation. J. Audio Eng. Soc. 49, 768–785 (2001) 15. Kostek, B.: Perception-Based Data Processing in Acoustics. In: Applications to Music Information Retrieval and Psychophysiology of Hearing. Series on Cognitive Technologies. Springer, Heidelberg (2005) ˙ 16. Kostek, B., Szczuko, P., Zwan, P., Dalka, P.: Processing of Musical Data Employing Rough Sets and Artiﬁcial Neural Networks. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 112–133. Springer, Heidelberg (2005) 17. Kostek, B.: Applying computational intelligence to musical acoustics. Archives of Acoustics 32(3), 617–629 (2007) 18. Kruger, E., Strube, H.W.: Linear prediction on a warped frequency scale. IEEE Trans. on Acoustics, Speech, and Signal Processing 36(9), 1529–1531 (1988) 19. Lindsay, A., Herre, J.: MPEG-7 and MPEG-7 Audio - An Overview. J. Audio Eng. Society 49(7/8), 589–594 (2001) 20. Mendes, A.: Acoustic eﬀect of vocal training. In: Proc. 17th International Congress on Acoustics, Rome, VIII, pp. 106–107 (2001) 21. Pawlak, Z.: Rough Sets. International J. Computer and Information Sciences 11, 341–356 (1982) 22. Peters, J.F., Skowron, A. (eds.): Transactions on Rough Sets V. LNCS, vol. 4100. Springer, Heidelberg (2006) 23. Rabiner, L.: On the use of autocorrelation analysis for pitch detection. IEEE Trans., ASSP 25, 24–33 (1977) 24. Rough-set Exploration System, logic.mimuw.edu.pl/∼ rses/RSES doc eng.pdf 25. Schutte, H.K., Miller, D.G.: Acoustic Details of Vibrato Cycle in Tenor High Notes. J. of Voice 5, 217–231 (1990) 26. Sebestyen, G.S.: Decision-making processes in pattern recognition. Macmillan Publishing Co., Indianapolis (1965) 27. Sundberg, J.: The science of the singing voice. Northern Illinois University Press (1987) 28. Wieczorkowska, A., Czy˙zewski, A.: Rough Set Based Automatic Classiﬁcation of Musical Instrument Sounds. Electr. Notes Theor. Comput. Sci. 82(4) (2003)

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

473

29. Wieczorkowska, A., Ra˙s, Z.W.: Editorial: Music Information Retrieval. J. Intell. Inf. Syst. 21(1), 5–8 (2003) 30. Wieczorkowska, A., Ras, Z.W., Zhang, X., Lewis, R.A.: Multi-way Hierarchic Classiﬁcation of Musical Instrument Sounds, pp. 897–902. MUE, IEEE (2007) 31. Wolf, S.K.: Quantitative studies on the singing voice. J. Acoust. Soc. Am. 6, 255– 266 (1935) ˙ 32. Zwan, P.: Expert System for Automatic Classiﬁcation and Quality Assessment of Singing Voices. 121 Audio Eng. Soc. Convention, San Francisco, USA (2006) ˙ 33. Zwan, P.: Expert system for objectivization of judgments of singing voices (in Polish), Ph.D. Thesis (supervisor: Kostek B.), Gdansk Univ. of Technology, Electronics, Telecommunications and Informatics Faculty, Multimedia Systems Department, Gdansk, Poland (2007) ˙ 34. Zwan, P., Kostek, B., Szczuko, P., Czy˙zewski, A.: Automatic Singing Voice Recognition Employing Neural Networks and Rough Sets. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 793–802. Springer, Heidelberg (2007) ˙ 35. Zwan, P.: Automatic singing quality recognition employing artiﬁcial neural networks. Archives of Acoustics 33(1), 65–71 (2008)

Hierarchical Classifiers for Complex Spatio-temporal Concepts Jan G. Bazan Chair of Computer Science, University of Rzesz´ ow Rejtana 16A, 35-310 Rzesz´ ow, Poland [email protected]

Abstract. The aim of the paper is to present rough set methods of constructing hierarchical classiﬁers for approximation of complex concepts. Classiﬁers are constructed on the basis of experimental data sets and domain knowledge that are mainly represented by concept ontology. Information systems, decision tables and decision rules are basic tools for modeling and constructing such classiﬁers. The general methodology presented here is applied to approximate spatial complex concepts and spatio-temporal complex concepts deﬁned for (un)structured complex objects, to identify the behavioral patterns of complex objects, and to the automated behavior planning for such objects when the states of objects are represented by spatio-temporal concepts requiring approximation. We describe the results of computer experiments performed on real-life data sets from a vehicular traﬃc simulator and on medical data concerning the infant respiratory failure. Keywords: rough set, concept approximation, complex dynamical system, ontology of concepts, behavioral pattern identiﬁcation, automated planning.

1

Introduction

Reasoning based on concepts constitutes one of the main elements of a thinking process because it is closely related to the skill of categorization and classiﬁcation of objects. The term concept means mental picture of a group of objects (see [1]). While the term conceptualize is commonly understood to mean form a concept or idea about something (see [1]). In the context of this work, there is interest in classifying conceptualized sets of objects. Concepts themselves provide a means of describing (forming a mental picture of) sets of objects (for a similar understanding the term concept, see, e.g., [2, 3, 4]). Deﬁnability of concepts is a term well-known in classical logic (see, e.g., [5]). Yet in numerous applications, the concepts of interest may only be deﬁned approximately on the basis of available, incomplete information about them (represented, e.g., by positive and negative examples) and selected primary concepts and methods for creating new concepts out of them. It brings about the necessity to work out approximate reasoning methods based on inductive reasoning (see, e.g., [6, 7, 8, 9, 10, 11, 12, 13]). J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 474–750, 2008. c Springer-Verlag Berlin Heidelberg 2008

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

475

In machine learning, this issue is known under the term learning concepts by examples (see, e.g., [10]). The main problem of learning concepts by examples is that the description of a concept under examination needs to be created on the basis of known examples of that concept. By creating a concept description we understand detection of such properties of exemplary objects belonging to this concept that enable further examination of examples in terms of their membership in the concept under examination. A natural way to solve the problem of learning concepts by examples is inductive reasoning which means that while obtaining further examples of objects belonging to the concept (the so-called positive examples) and examples of objects not belonging to the concept (the so-called negative examples), an attempt is made to ﬁnd such a description that correctly matches all or almost all examples of the concept under examination. Moreover, instead of speaking of learning concepts by examples, one may consider a more general learning of the so-called classiﬁcations which are partitions of all examples into a family of concepts (called decision classes) creating a partition of the object universe. A description of such a classiﬁcation makes it possible to recognize the decision that should be made about examples unknown so far; that is, it gives us the answer as to what decision should be made that also includes examples not occurring in the process of classiﬁcation learning. Classiﬁers also known in literature as decision algorithms, classifying algorithms or learning algorithms may be treated as constructive, approximate descriptions of concepts (decision classes). These algorithms constitute the kernel of decision systems that are widely applied in solving many problems occurring in such domains as pattern recognition, machine learning, expert systems, data mining and knowledge discovery (see, e.g., [6, 8, 9, 10, 11, 12, 13]). In literature there can be found descriptions of numerous approaches to constructing classiﬁers, which are based on such paradigms of machine learning theory as classical and modern statistical methods (see, e.g., [11, 13]), neural networks (see, e.g., [11, 13]), decision trees (see, e.g., [11]), decision rules (see, e.g., [10, 11]), and inductive logic programming (see, e.g., [11]). Many of the approaches mentioned above resulted in decision systems intended for computer support of decision making (see, e.g., [11]). An example of such a system is RSES (Rough Set Exploration System [14, 15]) which has been developed for over ten years and utilizes rough set theory, originated by Professor Zdzislaw Pawlak (see [16, 17, 18]), in combination with Boolean reasoning (see [19, 20, 21]). With the development of modern civilization, not only the scale of the data gathered but also the complexity of concepts and phenomena which they concern are increasing rapidly. This crucial data change has brought new challenges to work out new data mining methods. Particularly, data more and more often concerns complex processes which do not give in to classical modeling methods. Of such a form may be medical and ﬁnancial data, data coming from vehicles monitoring, or data about the users gathered on the Internet. Exploration methods of such data are in the center of attention in many powerful research centers in the world, and at the same time detection of models of complex processes and their properties (patterns) from data is becoming more and more attractive for applications

476

J.G. Bazan

(see, e.g., [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]). Making a progress in this ﬁeld is extremely crucial, among other things, for the development of intelligent systems which support decision making on the basis of results of analysis of the available data sets. Therefore, working out methods of detection of process models and their properties from data and proving their effectiveness in diﬀerent applications are of particular importance for the further development of decision supporting systems in many domains such as medicine, ﬁnance, industry, transport, telecommunication, and others. However, in the last few years essential limitations have been discovered concerning the existing data mining methods for very large data sets regarding complex concepts, phenomena, or processes (see, e.g., [41, 42, 43, 44, 45, 46]). A crucial limitation of the existing methods is, among other things, the fact that they do not support an eﬀective approximation of complex concepts, that is, concepts whose approximation requires discovery of extremely complex patterns. Intuitively, such concepts are too far in the semantical sense from the available concepts, e.g., sensory ones. As a consequence, the size of spaces which should be searched in order to ﬁnd patterns crucial for approximation are so large that an eﬀective search of these spaces very often becomes unfeasible using the existing methods and technology. Thus, as it turned out, the ambition to approximate complex concepts with high quality from available concepts (most often deﬁned by sensor data) in a fully automatic way, realized by the existing systems and by most systems under construction, is a serious obstacle since the classiﬁers obtained are often of unsatisfactory quality. Recently, it has been noticed in the literature (see, e.g., [42, 47, 48, 49, 50, 51, 52]) that one of the challenges for data mining is discovery of methods linking detection of patterns and concepts with domain knowledge. The latter term denotes knowledge about concepts occurring in a given domain and various relations among them. This knowledge greatly exceeds the knowledge gathered in data sets; it is often represented in a natural language and usually acquired during a dialogue with an expert in a given domain. One of the ways to represent domain knowledge is to record it in the form of the so-called concept ontology where ontology is usually understood as a ﬁnite hierarchy of concepts and relations among them, linking concepts from diﬀerent levels (see, e.g., [53, 54]). In the paper, we discuss methods for approximation of complex concepts in real-life projects. The reported research is closely related to such areas as machine learning and data mining (feature selection and extraction [55, 56, 57], classiﬁer construction [9, 10, 11, 12], analytical learning and explanation based learning [12, 58, 59, 60, 61]), temporal and spatio-temporal reasoning [62, 63, 64], hierarchical learning and modeling [42, 52, 65, 66, 67, 68], adaptive control [67, 69], automated planning (hierarchical planning, reconstruction of plans, adaptive learning plans) [70, 71, 72, 73, 74, 75, 76], rough sets and fuzzy sets (approximation of complex vague concepts) [77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87], granular computing (searching for compound patterns) [88, 89, 90, 91], complex adaptive systems [92, 93, 94, 95, 96, 97], autonomous multiagent systems [98, 99, 100, 101]), swarm systems [102, 103, 104], ontologies development [53, 54, 105, 106, 107].

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

477

It is also worthwhile mentioning that the reported research is also closely related to the domain of clinical decision-support for medical diagnosis and therapy (see, e.g., [108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121]). Many reported results in this domain can be characterized as methods for solving speciﬁc problems such as temporal abstraction problem [117, 120, 121] or medical planning problem [108, 111, 112, 119]). Many methods and algorithms proposed in this paper can be also used for solving such problems. The main aim of the paper is to present the developed methods for approximation of complex vague concepts involved in speciﬁcation of real-life problems and approximate reasoning used in solving these problems. However, methods presented in the paper are assuming that additional domain knowledge in the form of the concept ontology is given. Concepts from ontology are often vague and expressed in natural language. Approximation of ontology is used to create hints in searching for approximation of complex concepts from sensory (low level) data. The need of use of a domain knowledge expressed in the form of a concept ontology can be noticed in intensively developing domains connected with analysis and data processing as in the case of reinforcement learning (see, e.g., [12, 122, 123, 124]). In the latter ﬁeld, methods of learning new strategies with reinforcement take into account concept ontologies obtained from an expert, with the help of which it is possible to construct an approximation of a function estimating the quality of actions performed. Similarly, in a Service Oriented Architecture (SOA) [47, 49], the distribution of varied Web Services can be performed with the use of a domain knowledge, expressed using a concept ontology. There also appeared propositions (see, e.g., [42, 51, 52]) that use domain knowledge to search for the approximation of complex concepts in a hierarchical way which would lead to hierarchical classiﬁers able to approximate complex concepts with the high quality, e.g., by analogy to biological systems [42]. This idea can be also related to learning of complex (e.g., nonlinear) functions for fusion of information from diﬀerent sources [125]. Therefore, currently, the problem of construction of such hierarchical classiﬁers is fundamental for complex concepts approximation and its solution will be crucial for construction of many methods of intelligent data analysis. These are, for example, – methods of classiﬁcation of objects into complex spatial concepts which are semantically distant from sensor data, e.g., these are concepts as safe vehicle driving on a highway, hazardous arrangement of two cooperating robots which puts them both at risk of being damaged, – methods of classiﬁcation of object to complex spatio-temporal concepts semantically distant from sensor data which require observation of single objects or many objects over a certain period of time (e.g., acceleration of a vehicle on the road, gradual decrease of a patient’s body temperature, robot’s backward movement while turning right), – methods of behavioral pattern or high risk pattern identiﬁcation where these types of patterns may be treated as complex concepts representing dynamic properties of objects; such concepts are expressed in a natural language on a

478

J.G. Bazan

high level of abstraction and describing speciﬁc behaviors of a single object (or many complex objects) over a certain period of time (e.g., overtaking one vehicle by another, a traﬃc jam, chasing one vehicle after another, behavior of a patient under a high life threat, ineﬀective cooperation of a robot team) – methods of automatic learning of plans of complex object behavior, where a plan may be treated as a complex value of the decision which needs to be made for complex objects such as vehicles, robots, groups of vehicles, teams of robots, or patients undergoing treatment. In the paper, we propose to link automatic methods of complex concept learning, and models of detection of processes and their properties with domain knowledge obtained in a dialogue with an expert. Interaction with a domain expert facilitates guiding the process of discovery of patterns and models of processes and makes the process computationally feasible. Thus presentation of new approximation methods of complex concepts based on experimental data and domain knowledge, represented using ontology concepts, is the main aim of this paper. In our opinion, the presented methods are useful for solving typical problems appearing when modeling complex dynamical systems. 1.1

Complex Dynamical Systems

When modeling complex real-world phenomena and processes mentioned above and solving problems under conditions that require an access to various distributed data and knowledge sources, the so-called complex dynamical systems (CDS) are often applied (see, e.g., [92, 93, 94, 95, 96, 97]), or putting it in other way autonomous multiagent systems (see, e.g., [98, 99, 100, 101]) or swarm systems (see, e.g., [104]). These are collections of complex interacting objects characterized by constant change of parameters of their components over time, numerous relationships between the objects, the possibility of cooperation/competition among the objects and the ability of objects to perform more or less compound actions. Examples of such systems are traﬃc, a patient observed during treatment, a team of robots during performing some task, etc. It is also worthwhile mentioning that the description of a CDS dynamics is often not possible with purely analytical methods as it includes many complex vague concepts (see, e.g., [126, 127, 128]). Such concepts concern properties of chosen fragments of the CDS and may be treated as more or less complex objects occurring in the CDS. Hence, are needed appropriate methods of extracting such fragments that are suﬃcient to conclude about the global state of the CDS in the context of the analyzed types of changes and behaviors. In this approach, the CDS state is described by providing information about the membership of the complex objects isolated from the CDS in the complex concepts already established, describing properties of complex objects and relations among these objects. Apart from that, the description of the CDS dynamics requires following changes of the CDS state in time which leads to the so-called trajectory (history), that is, sequences of the CDS states observed over a certain period of time. Therefore, there are also needed methods for following changes of the selected

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

479

fragments of the CDS and changes of relations between the extracted fragments. In this paper, we use complex spatio-temporal concepts concerning properties, describing the dynamics of complex objects occurring in CDSs, to represent and monitor such changes. They are expressed in natural language on a much higher level of abstraction than so-called sensor data, so far mostly applied to the approximation of concepts. Examples of such concepts are safe car driving, safe overtaking, patient’s behavior when faced with a life threat, ineﬀective behavior of robot team. However, the identiﬁcation of complex spatio-temporal concepts and using them to monitor a CDS requires approximation of these concepts. In this paper, we propose to approximate complex spatio-temporal concepts by hierarchical classiﬁers mentioned above and based on data sets and domain knowledge. 1.2

Problems in Modeling Complex Dynamical Systems

In modeling complex dynamical systems there appear many problems related to approximation of complex concepts used to describe the dynamics of the systems. One of these problems is obviously the problem of the gap between complex concepts and sensor data mentioned above. Apart from that, a series of other problems may be formulated whose solution is very important for complex concepts approximation and for complex dynamical systems monitoring. Below, we present a list of such problems including particularly those whose solution is the aim of this paper. 1. Problem of the gap between complex concepts and sensor data preventing an eﬀective direct usage of sensor data to induce approximation of complex concepts by fully automatic methods. 2. Problem of complex concept stratiﬁcation in classiﬁer construction. 3. Problem of identiﬁcation of behavioral patterns of complex objects in complex dynamical systems monitoring. 4. Problem of context of complex object parts while complex dynamical systems monitoring. 5. Problem of time speed-up in identiﬁcation of behavioral patterns. 6. Problem of automated planning of complex object behavior when the object states are represented by complex concepts requiring approximation. 7. Problem of solving conﬂicts between actions in automated planning of complex object behavior. 8. Problem of synchronization of plans constructed for parts of a structured complex object. 9. Problem of plan adaptation. 10. Problem of similarity relation approximation between complex objects, complex object states, and complex object behavioral plans using data sets and domain knowledge. In further subsections, a brief overview of the problems mentioned above is presented.

480

J.G. Bazan

Problem of the Gap between Complex Concepts and Sensor Data. As we mentioned before, in spatio-temporal complex concepts approximation using sensor data, there occur major diﬃculties resulting from the fact that between spatio-temporal complex concepts and sensor data, there exists a gap which prevents an eﬀective direct usage of sensor data for approximation of complex concepts. Therefore, in the paper we propose to ﬁll the gap using domain knowledge represented mainly by a concept ontology and data sets chosen appropriately for this ontology (see Section 1.3). Problem of Complex Concept Stratiﬁcation. When we create classiﬁers for concepts on the basis of uncertain and imprecise data and knowledge semantically distant from the concepts under approximation, it is frequently not possible to construct a classiﬁer which decisively classiﬁes objects, unknown during classiﬁer learning, to the concept or its complement. There appears a need to construct such classiﬁers that, instead of stating clearly about the object under testing whether it belongs to the concept or not, allow us to obtain only a certain type of membership degree of the object under testing to the concept. In other words, we would like to determine, with regards to the object under testing, how certain the fact that this object belongs to the concept is. Let us notice that this type of mechanism stratiﬁes concepts under approximation, that is, divides objects under testing into layers labeled with individual values of membership degree to the concept. Such a mechanism can be obtained using diﬀerent kinds of probability distributions (see [6, 43]). However, in this paper, instead of learning of a probability distribution we learn layers of concepts relevant for construction of classiﬁers. We call such classiﬁers as stratifying classiﬁers and we present two methods of a stratifying classiﬁer construction (see Section 1.3). Our approach is inspired by papers about linguistic variables written by Professor Lotﬁ Zadeh (see [129, 130, 131]). Problem of Identifying Behavioral Patterns. The study of collective behavior in complex dynamical systems is now one of the more challenging research problems (see, e.g., [93, 99, 100, 102, 104, 132, 133, 134]), especially if one considers the introduction of some form of learning by cooperating agents (see, e.g., [103, 122, 123, 124, 135, 136, 137]). For example, an eﬃcient complex dynamical systems monitoring very often requires the identiﬁcation of the so-called behavioral patterns or a speciﬁc type of such patterns called high-risk patterns or emergent patterns (see, e.g., [93, 99, 100, 132, 138, 139, 140, 141, 142, 143, 144]). They are complex concepts concerning dynamic properties of complex objects expressed in a natural language on a high level of abstraction and describing speciﬁc behaviors of these objects. Examples of behavioral patterns may be: overtaking one vehicle by another vehicle, driving a group of vehicles in a traﬃc jam, behavior of a patient under a high life threat, etc. These types of concepts are diﬃcult to identify automatically because they require watching complex object behavior over longer period of time and this watching usually is based on the identiﬁcation of a sequence of less complex spatio-temporal concepts. Moreover, a crucial role

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

481

for identiﬁcation of a given behavioral pattern is played by the sequence of less complex concepts which identify it. For example, in order to identify the behavioral pattern of overtaking one vehicle by another, it should ﬁrst be determined whether the overtaking vehicle approaches the overtaken vehicle; next, whether the overtaking vehicle changes lanes appropriately and overtakes the vehicle; and ﬁnally, to determine that the overtaking vehicle returns to the previous lane driving in front of the overtaken vehicle. The methodology of a dynamical system modeling proposed in the paper enables approximation of behavioral patterns on the basis of data sets and domain knowledge expressed using a concept ontology (see Section 1.3). Problem of Context for Complex Object Parts. In this paper, any complex dynamical system (CDS) is represented using descriptions of its global states or trajectories (histories), that is, sequences of CDS states observed over a certain period of time (see, e.g., [145, 146, 147, 148, 149, 150, 151, 152] and Section 1.1). Properties of such states or trajectories are often dependent on speciﬁc parts of these states or trajectories. This requires to consider the relevant structure of states or trajectories making it possible to extract parts and the relevant context of parts. Moreover, each structured object occurring in a complex dynamical system is understood as a set of parts extracted from states or trajectories of a given complex dynamical system. Such parts are often related by relations representing links or interactions between parts. That is why both learning of the behavioral patterns concerning structured objects and the identiﬁcation of such patterns, in relation to speciﬁc structured objects, requires the isolation of structured objects as sets of potential parts of such objects, that is, object sets of lesser complexity. The elementary approach to isolate structured objects consisting in examination of all possible subsets (of an established size) of the set of potential parts of structured objects cannot be applied because of potentially high number of such subsets. For example, during an observation of a highway from a helicopter (see, e.g., [89, 153]), in order to identify a group of vehicles which are involved in the maneuver of dangerous overtaking, it would be necessary to follow (in the real time) the behavior of all possible groups of vehicles of an established size (e.g., six vehicles, see Appendix A) that may be involved in this maneuver, which already with a relatively small number of visible vehicles becomes computationally too diﬃcult. Another possibility is the application of methods which use the context in which the objects being parts of structured objects occur. This type of methods isolate structured objects not by a direct indication of the set of parts of the searched structured object but by establishing one part of the searched structured object and attaching to it other parts, being in the same context as the established part. Unfortunately, also here, the elementary approach to determination of the context of the part of the structured object, consisting in examination of all possible subsets (of an established size) of the set of potential structured objects to which the established part of the structured object belongs, cannot be applied because of a large number of such subsets. For example, in order to identify a group of vehicles which are involved in a dangerous maneuver

482

J.G. Bazan

and to which the vehicle under observation belongs, it would be necessary to follow (in the real time) the behavior of the possible groups of vehicles of an established size (e.g., six vehicles, see Appendix A) to which the vehicle considered belongs, which is, with a relatively small number of visible vehicles, still computationally too diﬃcult. Therefore, there are needed special methods of determining the context of the established part of the structured object based on a domain knowledge which enable to limit the number of analyzed sets of parts of structured objects. In the paper, we propose the so-called sweeping method which enables fast determination of the context of the established object treated as one of the parts of the structured object (see Section 1.3). Problem of Time Speed-Up in Identiﬁcation of Behavioral Patterns. Identiﬁcation of a behavioral pattern in relation to a speciﬁc complex object may be performed by observing the behavior of these objects over a certain period of time. Attempts to shorten this time are usually inadvisable, because they may cause false identiﬁcation of behavioral pattern in relation to some complex objects. However, in many applications there exists a need for a fast decision making (often in the real time) about whether or not a given object matches the established behavioral pattern. It is extremely crucial in terms of computational complexity because it enables a rapid elimination of these complex objects which certainly do not match the pattern. Therefore, in the paper, there is presented a method of elimination of complex objects in identiﬁcation of a behavioral pattern, which is based on the rules of fast elimination of behavioral patterns which are determined on the basis of data sets and domain knowledge (see Section 1.3). Problem of Automated Planning. In monitoring the behavior of complex dynamical systems (e.g., by means of behavioral patterns identiﬁcation) there may appear a need to apply methods of automated planning of complex object behavior. For example, if during observation of a complex dynamical system, a behavioral pattern describing inconvenient or unsafe behavior of a complex object (i.e., a part of system state or trajectory) is identiﬁed, then the system control module may try, using appropriate actions, to change the behavior of this object in such a way as to lead the object out of the inconvenient or unsafe situation. However, this type of short-term interventions may not be suﬃcient to lead the object out of the undesired situation permanently. Therefore, a possibility of automated planning is often considered which means construction of sequences of actions alternately with states (of plans) to be performed by the complex object or on the complex object in order to bring it to a speciﬁc state. In literature, there may be found descriptions of many automated planning methods (see, e.g., [70, 71, 72, 73, 74, 75, 76]). However, applying the latter approaches, it has to be assumed that the current complex object state is known which results from a simple analysis of current values of available parameters of this object. Meanwhile, in complex dynamical systems, a complex object state is often described in a natural language using vague spatio-temporal conditions whose satisﬁability cannot be tested on the basis of a simple analysis of available information about the object. For example, when planning the treatment of an infant suﬀering from

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

483

the respiratory failure, the infant’s condition may be described by the following condition: – Patient with RDS type IV, persistent PDA and sepsis with mild internal organs involvement (see Appendix B for mor medical details). Stating the fact that a given patient is in the above condition requires an analysis of examination results of this patient registered over a certain period of time with a large support of a domain knowledge provided by experts (medical doctors). This type of conditions may be represented using complex spatio-temporal concepts. Identiﬁcation of these conditions requires, however, an approximation of the concepts representing them with the help of classiﬁers. Therefore, in the paper, we describe automated planning methods of behavior of complex objects whose states are described using complex concepts requiring approximation (see Section 1.3). Problem of Solving Conﬂicts between Actions. In automated planning methods, during a plan construction there usually appears a problem of nondeterministic choice of one action possible to apply in a given state. Therefore, usually there may be many solutions to a given planning problem consisting in bringing a complex object from the initial state to the ﬁnal one using diﬀerent plans. Meanwhile, in practical applications there often appears a situation that the automatically generated plan must be compatible with the plan proposed by the expert (e.g., the treatment plan should be compatible with the plan proposed by human experts from a medical clinic). Hence, we inevitably need tools which may be used during a plan generation to solve the conﬂicts appearing between actions which may be performed at a given planning state. It also concerns making the decision about what state results from the action performed. That is why, in the paper, we propose a method which indicates the action to be performed in a given state or shows the state which is the result of the indicated action. This method uses a special classiﬁer constructed on the basis of data sets and domain knowledge (see Section 1.3). Problem of Synchronizing Plans. In planning the behavior of structurally complex objects consisting of parts being objects of lesser complexity, it is often not possible to plan eﬀectively the behavior of a whole such object. That is why, in such cases the behavior of all parts is usually planned separately. However, such an approach to behavior planning for a complex object requires plan synchronization constructed for individual parts in such a way as not to make these plans contradicting one to another but be complement in order to plan the best behavior for the whole complex object. For example, treatment of a certain illness A, which is the result of illnesses B and C requires such a treatment planning of illnesses B and C so as not to make their treatments contradictory, but to make them to support and to complement one another during treatment of illness A. In the paper, a planning synchronization method for parts of a complex object is presented. It uses two classiﬁers constructed on the basis of data sets and domain knowledge (see Section 1.3). If we treat plans constructed for parts

484

J.G. Bazan

of a structured object as processes of some kind, then the method of synchronizing those plans is a method of synchronization of processes corresponding to the parts of a structured object. It should be emphasized, however, that the significant novelty of the method of synchronization of processes presented herein in relation to the ones known from literature (see, e.g., [154, 155, 156, 157, 158, 159]) is the fact that the synchronization is carried out by using classiﬁers determined on the basis of data sets and domain knowledge. Plan Adaptation Problem. After constructing a plan for a complex object, the execution of this plan may take place. However, the execution of the whole plan is not always possible in practice. It may happen that, during the plan execution such a state of complex object occurred that is not compatible with the state predicted by the plan. Then, the question arises whether the plan should still be executed or whether it should be reconstructed (updated). If the current complex object state diﬀers slightly from the state expected by the plan, then the execution of the current plan may perhaps be continued. If, however, the current state diﬀers signiﬁcantly from the state from the plan, then the current plan has to be reconstructed. It would seem that the easiest way to reconstruct the plan is construction of a new plan which commences at the current state of the complex object and ends at the ﬁnal state of the old plan (a total reconstruction of the plan). However, in practical applications, a total reconstruction can be too costly in terms of computation or resources. Therefore, we need other methods which can eﬀectively reconstruct the original plan in such a way as to realize it at least partially. Hence, in the paper, we propose a method of plan reconstruction called a partial reconstruction. It consists of constructing a short so-called repair plan which quickly brings the complex object to the so-called return state from the current plan. Next, on the basis of the repair plan, a reconstruction of the current plan is performed by replacing its fragment beginning with the current state and ending with the return state of the repair plan (see Section 1.3). It is worth noticing that this issue is related to the domain of artiﬁcial intelligence called the reasoning about changes (see, e.g., [160, 161]). Research works in this domain very often concern construction of a method of concluding about changes in satisﬁability of concepts on a higher level of a certain concept hierarchy as a basis for discovery of plans aimed at restoration of the satisﬁability of the desired concepts on a lower level of this hierarchy. Problem of Similarity Relation Approximation. In building classiﬁers approximating complex spatio-temporal concepts, there may appear a need to estimate the similarity or the diﬀerence of two elements of a similar type such as complex objects, complex object states or plans generated for complex objects. This is an example of a classical case of the problem of deﬁning similarity relation (or perhaps deﬁning dissimilarity relation complementary to it) which is still one of the greatest challenges of data mining and knowledge discovery. The existing methods of deﬁning similarity relations are based on building similarity functions on the basis of simple strategies of fusion of local similarities of compared elements. Optimization of the similarity formula established is performed

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

485

by tuning both parameters of local similarities and their linking parameters (see, e.g., [162, 163, 164, 165, 166, 167, 168, 169, 170, 171]). Frequently, however, experts from a given domain are not able to provide such a formula that would not raise their doubts and they limit themselves to the presentation of a set of examples of similarity function values, that is, a set of pairs of the compared elements labeled with degrees representing similarity function value. In this case, deﬁning the similarity function requires its approximation with the help of a classiﬁer, and at the same time such properties of compared elements should be deﬁned that enable to approximate the similarity function. The main diﬃculty of the similarity function approximation is an appropriate choice of these properties. Meanwhile, according to the domain knowledge there are usually many various aspects of similarity between compared elements. For example, when comparing medical plans constructed for treatment of infants with a respiratory failure (see Appendix B), similarity of antibiotic therapies, similarity of applied mechanical ventilation methods, similarity of PDA closing and others should be taken into account. Each of these aspects should be considered in a speciﬁc way and presentation of formulas describing them can be extremely diﬃcult for an expert. Frequently, an expert may only give examples of pairs of comparable elements together with their similarity in each of these aspects. Moreover, a fusion of diﬀerent similarity aspects into a global similarity should also be performed in a way resulting from the domain knowledge. This way may be expressed, for example, using a concept ontology. In the paper, we propose a method of similarity relation approximation based on the usage of data sets and domain knowledge expressed, among other things, on the basis of a concept ontology (see Section 1.3). 1.3

Overview of the Results Achieved

As we mentioned before, the aim of this paper is to present a set of approximation methods of complex spatio-temporal concepts and approximate reasoning concerning these concepts, assuming that the information about concepts is given mainly in the form of a concept ontology. The results described in the paper may be divided into the following groups: 1. methods for construction of classiﬁers stratifying a given concept, 2. general methodology of concept approximation with the usage of data sets and domain knowledge represented mainly in the form of a concept ontology, 3. methods for approximation of spatial concepts from an ontology, 4. methods for approximation of spatio-temporal concepts from an ontology deﬁned for unstructured objects, 5. methods for approximation spatio-temporal concepts from an ontology deﬁned for structured objects, 6. methods for behavioral pattern identiﬁcation of complex objects in states of complex dynamical systems,

486

J.G. Bazan

7. methods for automated planning of behavior of complex objects when the object states are represented by vague complex concepts requiring approximation, 8. implementation of all more crucial methods described in the paper as the RSES system extension. In further subsections we brieﬂy characterize the above groups of results. At this point we present the publications on which the main results of our research have been partially based. The initial version of method for approximation of spatial concepts from an ontology was described in [172]. Methods for approximation of spatio-temporal concepts and methods for behavioral pattern identiﬁcation were presented in [88, 173, 174, 175, 176, 177, 178]. Papers [173, 176, 177, 178] concern behaviors related to recognition of vehicle behavioral patterns or a group of vehicles on the road. The traﬃc simulator used to generate data for the needs of computer experiments was described in [179]. The paper [174] concerns medical applications related to recognition of high death risk pattern of infants suﬀering from respiratory failure, whereas papers [88, 175] concern both applications which were mentioned above. Finally, methods for automated planning of behavior of complex objects were described in [88, 180, 181]. Methods for Construction of Classiﬁers Stratifying Concepts. In practice, construction of classiﬁers often takes place on the basis of data sets containing uncertain and imprecise information (knowledge). That is why it is not often possible to construct a classiﬁer which decisively classiﬁes objects to the concept or its complement. This phenomenon occurs particularly when there is a need to classify objects not occurring in a learning set of objects, that is, those which are not used to construct the classiﬁer. One possible approach is to search for classiﬁers approximating probability distribution (see, e.g., [6, 43]). However, in application, one may often require a less exact method based on classifying objects to diﬀerent linguistic layers of the concept. This idea is inspired by papers of Professor Lotﬁ Zadeh (see, e.g., [129, 130, 131]). In our approach, the discovered concept layers are used as patterns in searching for approximation of a more compound concept. In the paper, we present methods for construction of classiﬁers which, instead of stating clearly whether a tested object belongs to the concept or not, enable to obtain some membership degree of the tested object to the concept. In the paper, we deﬁne the concept of a stratifying classiﬁer as a classifying algorithm stratifying concepts, that is, classifying objects to diﬀerent concept layers (see Section 3). We propose two approaches to construction of these classiﬁers. One of them is the expert approach which is based on the deﬁning, by an expert, an additional attribute in data which describes membership of the object to individual concept layers. Next, a classiﬁer diﬀerentiating layers as decision classes is constructed. The second approach called the automated approach is based on the designing algorithms being the classiﬁer extensions which enable to classify objects to concept layers on the basis of certain premises and experimental observations. In the paper, a new method of this type is proposed which is based on shortening of decision rules relatively to various coeﬃcients of consistency.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

487

General Methodology of Concept Approximation from Ontology. One of the main results presented in this paper is a methodology of approximating concepts from ontology. Generally, in order to approximate concepts a classical in machine learning [10] method of concept approximation is applied on the basis of positive and negative examples. It is based on the construction of a data table for each concept, known in rough set theory as a decision table (a special information system with a distinguished attribute called decision [16]) with rows (called objects) corresponding to positive and negative examples of the concept approximated and columns describing properties (features, attributes) of examples expressed by formulas in a considered language. The last column, called the decision column, is treated as a description of membership of individual examples to the concept approximated. For a table constructed in such a way, classiﬁers approximating a concept are built. In such an approach, the main problem is the choice of examples of a given concept and properties of these examples. The speciﬁcity of methodology of concept approximation proposed here in comparison with other methods (see, e.g., [11, 52, 182]) is the usage of a domain knowledge expressed in the form of a concept ontology together with the rough set methods. For concepts from the lowest level of an ontology hierarchy (the sensor level), not depending on the remaining concepts, we assume that so-called sensor attributes are also available which on the basis of given positive and negative examples, enable approximating these concepts by using classical methods of classiﬁer construction. However, the concept approximation methods, applied on a higher level of ontology consist in approximation of concepts using concepts from the lower ontology level. In this way, there are created hierarchical classiﬁers which use domain knowledge recorded in the form of ontology levels. In other words, patterns discovered for approximation of concepts on a given hierarchy level are used in construction of more compound patterns relevant for approximation of concepts on the next hierarchy level. To approximate concepts from the higher ontology level, sensor attributes cannot be applied directly because the “semantical distance” of the higher level concepts from sensor attributes is too long and they are deﬁned on diﬀerent abstraction levels, i.e., searching for relevant features to approximate such concepts directly from sensory features becomes unfeasible (see the ﬁrst problem from Section 1.2). For example, it is hardly believable that given only sensor attributes describing simple parameters of driving a vehicle (e.g., location, speed, acceleration), one can approximate such a complex concept as safe driving a vehicle. Therefore, we propose a method, by means of which concepts from the higher ontology level exclusively be approximated by concepts from one level below. The proposed approach to concept approximation of a higher level is based on the assumption that the concept from the higher ontology level is semantically not too far from concepts lying on the lower level in the ontology. “Not too far” means that it may be expected that it is possible to approximate a concept

488

J.G. Bazan

from the higher ontology level with the help of lower ontology level concepts and patterns used for or derived from their construction, for which classiﬁers have already been built. If we assume that approximation of concepts on the higher ontology level takes place using lower level concepts, then according to an established concept approximation methodology, positive and negative examples of the concept approximated are needed as well as their properties which serve the purpose of approximation. However, because of the semantical diﬀerences between concepts on diﬀerent ontology levels, mentioned above, examples of lower ontology level concepts cannot be directly used to approximate a higher ontology level concept. For example, if the concept of a higher level concerns a group of vehicles (e.g., driving in a traﬃc jam, chase of one vehicle after another, overtaking), whereas the lower level concepts concern single vehicles (e.g., accelerating, decelerating, changing lanes), then the properties of a single vehicle (deﬁned in order to approximate lower ontology level concepts) are usually insuﬃcient to describe the properties of the whole group of vehicles. Diﬃculties with concept approximation on the higher ontology level using examples of the lower ontology level also appear when on the higher ontology level there are concepts concerning a time period diﬀerent than that one related to the concepts on the lower ontology level. For example, a higher level concept may concern a time window, that is, a certain period of time (e.g., vehicle acceleration, vehicle deceleration), whereas the lower level concepts may concern a certain instant, that is, a time point (e.g., a small vehicle speed, location of vehicle in the right lane). Hence, we present a method for construction of positive and negative examples of a concept of a higher ontology level consisting, in a general case, in arrangement (putting together) sets of examples of concepts of the lower ontology level. At the same time we deﬁne and represent such sets using patterns expressed in languages describing properties of examples of concepts of lower level in the ontology. These sets (represented by patterns) are arranged according to the socalled constraints resulting from the domain knowledge and determining which sets (patterns) may be arranged and which cannot be arranged for the construction of examples of higher level concepts. Thus, object structures on higher hierarchical levels come into being through linking (with the consideration of certain constraints) of objects from lower levels (and more precisely sets of these objects described by patterns). Such an approach enables a gradual modeling properties of more and more complex objects. Starting with elementary objects, objects being their sets or sequences of such objects, sets of sequences, etc. are gradually modeled. Diﬀerent languages expressing properties of, e.g., elementary objects, object sequences, or sets of sequences correspond to diﬀerent model levels. A crucial innovation feature of methods presented here is the fact that to deﬁne patterns describing examples of a lower ontology level, classiﬁers constructed for these concepts are used. The example construction process for higher ontology level concepts on the basis of lower level concepts proceeds in the following way. Objects which are positive and negative examples of lower ontology level concepts are elements of a

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

489

certain relational structure domain. Relations occurring in such a structure express relations between these objects and may be used to extract sets of objects of the lower ontology level. Each extracted set of objects is also a domain of a certain relational structure, in which relations are deﬁned using information from a lower level. The process of extraction of relational structures is performed in order to approximate a higher ontology level concept with the help of lower ontology level concepts. Hence, to extract relational structures we necessarily need the information about membership of lower level objects to the concepts from this level. Such information may be available for any tested object based on the application of previously created classiﬁers for the lower ontology level concepts. Let us note that classiﬁers stratifying concepts are of a special importance here. The language in which we deﬁne formulas (patterns) to extract new relational structures using relational structures and lower ontology level concepts, is called the language for extracting relational structures (ERS-language). For relational structures extracted in such a way, properties (attributes) may be deﬁned which lead to an information system whose objects are extracted relational structures and the attributes are the properties of these structures (RS-information system). Relational structure properties may be deﬁned using patterns which are formulas in a language specially constructed for this purpose, i.e., in a language for deﬁnnig features of relational structures (F RS-language). For example, some of the languages used to deﬁne the properties of extracted relational structures, presented in this paper, use elements of temporal logics with linear time, e.g., Linear Temporal Logic (see, e.g., [183, 184, 185]). Objects of RS-information system are often inappropriate to make their properties relevant for the approximation of the higher ontology level concepts. It is due to the fact that there are too many such objects and their descriptions are too detailed. Hence, when applied to the higher ontology level concept approximation, the extension of the created classiﬁer would be too low, that is, the classiﬁer would classify too small number of tested objects. Apart from that, the problem of computational complexity would appear which means that because of a large number of objects in such information systems, the number of objects in a linking table, constructed in order to approximate concepts determined in a set of objects of a complex structure, would be too large to construct a classiﬁer eﬀectively (see below). That is why a grouping (clustering) of such objects is applied which leads to obtaining more general objects, i.e., clusters of relational structures. This grouping may take place using a language chosen by an expert and called the language for extracting clusters of relational structures (ECRS-language). Within this language, a family of patterns may be selected to extract relevant clusters of relational structures from the initial information system. For the clusters of relational structures obtained, an information system may be constructed whose objects are clusters deﬁned by patterns from this family, and the attributes are the properties of these clusters. The properties of these clusters may be deﬁned by patterns which are formulas of a language specially constructed for this purpose, i.e., a language for deﬁning features of clusters

490

J.G. Bazan

of relational structures (F CRS-language). For example, some of the languages assigned to deﬁne the properties of relational structure clusters presented in this paper use elements of temporal logics with branching time, e.g., Branching Temporal Logic (see, e.g., [183, 184, 185]). The information system with objects which are clusters of relational structures (CRS-information system) may already be used to approximate the concept of the higher ontology level. In order to do this, a new attribute is added to the system by the expert informs about membership of individual clusters to the concept approximated, and owing to that we obtain an approximation table of a higher ontology concept. The method of construction of the approximation table of a higher ontology level concept may be generalized for concepts determined on a set of structured objects, that is, ones consisting of a set of parts (e.g., a group of vehicles on the road, a group of interacting illnesses, a robot team performing a task together). This generalization means that CRS-information systems constructed for individual parts may be linked in order to obtain an approximation table of a higher ontology level concept determined for structured objects. Objects of this table are obtained through an arrangement (linking) of all possible objects of linked information systems. From the mathematical point of view this assumption is a Cartesian product of sets of objects of linked information systems. However, in terms of domain knowledge not all object links belonging to such a Cartesian product are possible (see [78, 84, 186, 187]). For example, if we approximate the concept of safe overtaking, it makes sense to arrange objects concerning only such vehicle pairs which are in the process of the overtaking maneuver. For the reason mentioned above, that is, elimination of unrealistic complexes of objects, the so-called constraints are deﬁned that are formulas built on the basis of arranged object features. The constraints determine which objects may be arranged in order to obtain an example of an object from a higher level and which may not. Additionally, we assume that to each arrangement allowed by the constraints, the expert adds a decision value informing whether a given arrangement belongs ore does not belong to the approximated concept of a higher level. The table constructed in such a way serves the purpose of the approximation of a concept describing structured objects. However, in order to approximate a concept concerning structured objects, it is often necessary to construct not only all parts of the structured object but also features describing relations between parts. For example, driving one vehicle after another, apart from features describing the behavior of those two vehicles separately, features describing the location of these vehicles in relation to one another as well ought to be constructed. That is why in construction of a table of concept approximation for structured objects, there is constructed an additional CRS-information system whose attributes entirely describe the whole structured object in terms of relations between the parts of this object. In approximation of the object concerning structured objects, this system is arranged together with other CRS-information systems constructed for individual parts of the structured objects.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

The spatial concept of the higher ontology level (defined for complex objects)

The spatio-temporal concept of the higher ontology level (defined for complex objects)

C Spatial concepts of the lower ontology level (defined for the same type of complex objects)

491

C1

...

C

Cl

Spatial concepts of the lower ontology level (defined for the same type of complex objects)

Case 1

C1

...

Cl

Case 2 The spatio-temporal concept of the higher ontology level (defined for structured complex objects)

Spatio-temporal concepts of the lower ontology level (defined for parts of structured complex objects)

C

C1

...

Cl

Case 3 Fig. 1. Three cases of complex concepts approximation in ontology

A fundamental problem in construction of an approximation table of a higher ontology level concept is, therefore, the choice of four appropriate languages used during its construction. The ﬁrst language serves the purpose of deﬁning patterns in a set of examples of a concept of lower ontology level which enable the relational structure extraction. The second one enables to deﬁne the properties of these structures. The third one makes possible to deﬁne relational structure clusters and, ﬁnally, the fourth one, the properties of these clusters. All these languages must be deﬁned in such a way as to make the properties of the created relational structure clusters useful on a higher ontology level for approximation of the concept occurring there. Moreover, when the approximated concept concerns structured objects, each of the parts of this type of objects may require another four the languages similar to those already mentioned above. Deﬁnitions of the above four languages depends on the semantical diﬀerence between concepts from both ontology levels. In the paper, the above methodology is applied in the three following cases in which the above four languages are deﬁned in a completely diﬀerent way: 1. The concept of the higher ontology level is a spatial concept (it does not require observing changes of objects over time) and it is deﬁned on the set of the same objects (examples) as concepts of the lower ontology level, and at the same time the lower ontology level concepts are also spatial concepts (see Case 1 from Fig. 1).

492

J.G. Bazan

2. The concept of the higher ontology level is a spatio-temporal concept (it requires observing object changes over time) and it is deﬁned on a set of the same objects (examples) as the lower ontology level concepts. Moreover, the lower ontology level concepts are spatial concepts exclusively (see Case 2 from Fig. 1). 3. The concept of the higher ontology level is a spatio-temporal concept deﬁned on a set of objects which are structured objects in relation to objects (examples) of the lower ontology level concepts, that is, the lower ontology level objects are parts of objects from the higher ontology level. Additionally, and at the same time the lower ontology level concepts are also spatio-temporal concepts (see Case 3 from Fig. 1). Methods described in the next three subsections concern the above three cases. These methods also found application in construction of methods of behavioral pattern identiﬁcation and in automated planning. Methods of Approximation of Spatial Concepts. In the paper, the method of approximating concepts from ontology is proposed when a higher ontology level concept is a spatial concept (not requiring an observation of changes over time) and it is deﬁned on a set of the same objects (examples) as the lower ontology level concepts; at the same time, the lower level concepts are also spatial concepts. An exemplary situation of this type is an approximation of the concept of Safe overtaking (concerning single vehicles on the road) using concepts such as Safe distance from the opposite vehicle during overtaking, Possibility of going back to the right lane and Possibility of safe stopping before the crossroads. The concept approximation method described in this subsection is an example of the general methodology of approximating concepts from ontology described previously. That is why its speciﬁcity is the domain knowledge usage expressed in the form of a concept ontology and application of rough set methods, mainly in terms of application of classiﬁer construction methods. The basic terms used in the presented method is pattern and production rule. Patterns are descriptions of examples of concepts from an ontology and they are constructed by classiﬁers stratifying these concepts. A production rule is a decision rule which is constructed on two adjacent levels of ontology. In the predecessor of this rule there are patterns for the concepts from the lower level of the ontology whereas in the successor, there is a pattern for one concept from the higher level of the ontology (connected with concepts from the rule predecessor) where both patterns from the predecessor and the successor of the rule are chosen from patterns constructed earlier for concepts from both adjacent levels of the ontology. A rule constructed in such a way may serve as a simple classiﬁer or an argument “for”/“against” the given concept, enabling classiﬁcation of objects which match the patterns from the rule predecessor with the pattern from the rule successor. In the paper, there is proposed an algorithmic method of induction of production rules, consisting in an appropriate search for data tables with attributes describing the membership of training objects to particular layers of concepts (see Section 5.4). These tables are constructed using the so-called constraints between concepts thanks to which the information put in the tables

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

493

only concerns those objects/examples which might be found there according to the production rule under construction. Although a single production rule may be used as a classiﬁer for the concept appearing in a rule successor, it is not a complete classiﬁer yet, i.e., classifying all objects belonging to an approximated concept and not only those matching patterns of a rule predecessor. Therefore, in practice, production rules are grouped into the so-called productions (see Section 5.3), i.e., production rule collections, in a way that each production contains rules having patterns for the same concepts in a predecessor and the successor, but responding to their diﬀerent layers. Such production is able to classify much more objects than a single production rule where these objects are classiﬁed into diﬀerent layers of the concept occurring in a rule successor. Both productions and production rules themselves are only constructed for the two adjacent levels of ontology. Therefore, in order to use the whole ontology fully, there are constructed the so-called AR-schemes, i.e., approximate reasoning schemes (see, e.g., [77, 89, 172, 188, 189, 190, 191, 192, 193, 194]) which are hierarchical compositions of production rules (see Section 5.7). The synthesis of an AR-scheme is carried out in a way that to a particular production from a lower hierarchical level of the AR-scheme under construction another production rule on a higher level may be attached, but only that one where one of the concepts for which the pattern occurring in the predecessor was constructed is the concept connected with the rule successor from the previous level. Additionally, it is required that the pattern occurring in a rule predecessor from the higher level is a subset of the pattern occurring in a rule successor from the lower level (in the sense of inclusion of object sets matching both patterns). To the two combined production rules other production rules can be attached (from above, from below or from the side) and in this way a multilevel structure is made which is a composition of many production rules. The AR-scheme constructed in such a way can be used as a hierarchical classiﬁer whose entrance are predecessors of production rules from the lowest part of the AR-scheme hierarchy and the exit is the successor of a rule from the highest part of the AR-scheme hierarchy. That way, each AR-scheme is a classiﬁer for a concept occurring in the rule successor from the highest part in the hierarchy of the scheme and, to be precise, for a concept for which a pattern occurring in the rule successor from the highest part in the hierarchy of the AR-scheme is determined. However, similarly to the case of a single production rule, an AR-scheme is not a full classiﬁer yet. That is why, in practice, for a particular concept there are many AR-schemes constructed which approximate diﬀerent layers or concept regions. In this paper, there are proposed two approaches for constructing AR-schemes (see Section 5.7). The ﬁrst approach is based on memory with AR-schemes and consists in building many AR-schemes after determining production, which later on are stored and used for the classiﬁcation of tested objects. The second approach is based on a dynamic construction of AR-schemes. It is realized in a way that during classiﬁcation of a given tested object, an

494

J.G. Bazan

appropriate AR-schemes for classifying this particular object is built on the basis of a given collection of productions (“lazy” classiﬁcation). In order to test the quality and eﬀectiveness of classiﬁer construction methods based on AR-schemes, experiments on data generated from the traﬃc simulator were performed (see Section 5.8). The experiments showed that classiﬁcation quality obtained through classiﬁers based on AR-schemes is higher than classiﬁcation quality obtained through traditional classiﬁers based on decision rules. Apart from that, the time spent on classiﬁer construction based on AR-schemes is shorter than when constructing classical rule classiﬁers, their structure is less complicated than that of classical rule classiﬁers (a considerably smaller average number of decision rules), and their performance is much more stable because of the diﬀerences in data in samples supplied for learning (e.g., to change the simulation scenario). Methods of Approximation of Spatio-temporal Concepts. We also propose a method of approximating concepts from ontology when a higher ontology level concept is a spatio-temporal concept (it requires observing changes of complex objects over time) deﬁned on a set of the same objects as the lower ontology level concepts; at the same time, the lower ontology level concepts are spatial concepts only. This case concerns a situation when during an observation of a single object in order to capture its behavior described by a higher ontology level concept, we have to observe it longer than it requires to capture behaviors described by lower ontology level concepts. For example, lower ontology level concepts may concern simple vehicle behaviors such as small increase in speed, small decrease in speed or small move towards the left lane. However, the higher ontology level concept may be a more complex concept as, e.g., acceleration in the right lane. Let us notice that determining whether a vehicle accelerates in the right lane requires its observation for some time called a time window. On the other hand, determining whether a vehicle speed increases in the right lane requires only a registration of the speed of a vehicle in two neighboring instants (time points) only. That is why spatio-temporal concepts are more diﬃcult to approximate than spatial concepts whose approximation does not require observing changes of objects over time. Similarly to spatial concept approximation (see above), the method of concept approximation described in this subsection is an example of the general methodology of approximating concepts from ontology described earlier. Its speciﬁcity is, therefore, the domain knowledge usage expressed in the form of a concept ontology and rough set method application, mainly in terms of application of classiﬁer construction methods. However, in this case more complex ontologies are used, and they contain both spatial and spatio-temporal concepts. The starting point for the method proposed is a remark that spatio-temporal concept identiﬁcation requires an observation of a complex object over a longer period of time called a time window (see Section 6.4). To describe complex object changes in the time window, the so-called temporal patterns (see Section 6.6) are used, which are deﬁned as functions determined on a given time window. These patterns, being in fact formulas from a certain language, also characterize

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

495

certain spatial properties of the complex object examined, observed in a given time window. They are constructed using lower ontology level concepts and that is why identiﬁcation whether the object belongs to these patterns requires the application of classiﬁers constructed for concepts of the lower ontology level. On a slightly higher abstraction level, the spatio-temporal concepts (also called temporal concepts) are directly used to describe complex object behaviors (see Section 6.5). Those concepts are deﬁned by an expert in a natural language and they are usually formulated using questions about the current status of spatio-temporal objects, e.g., Does the vehicle examined accelerate in the right lane?, Does the vehicle maintain a constant speed during lane changing? The method proposed here is based on approximating temporal concepts using temporal patterns with the help of classiﬁers. In order to do this a special decision table is constructed called a temporal concept table (see Section 6.9). The rows of this table represent the parameter vectors of lower level ontology concepts observed in a time window (and, more precisely, clusters of such parameter vectors). Columns of this table (apart from the last one) are determined using temporal patterns. However, the last column represents membership of an object, described by parameters (features, attributes) from a given row, to the approximated temporal concept. Temporal concepts may be treated as nodes of a certain directed graph which is called a behavioral graph. Links (directed edges) in this graph are the temporal relations between temporal concepts meaning a temporal sequence of satisfying two temporal concepts one after another. These graphs are of a great signiﬁcance in complex objects approximation for structured objects (see below). Methods of Approximation of Spatio-temporal Concepts for Structured Objects. The method of spatio-temporal concept approximation presented in the previous subsection is extended to the case when higher ontology level concepts are deﬁned on a set of objects which are structured objects in relation to objects (examples) of the lower ontology level concepts, that is, the lower ontology level objects are parts of objects from the higher ontology level. Moreover, lower ontology level concepts are also spatio-temporal concepts. This case concerns a situation when during a structured object observation, which serves the purpose of capturing its behavior described by a higher ontology level concept, we must observe this object longer than it is required to capture the behavior of a single part of the structured object described by lower ontology level concepts. For example, lower ontology level concepts may concern complex behaviors of a single vehicle such as acceleration in the right lane, acceleration and changing lanes from right to left, decelerating in the left lane. However, a higher ontology level concept may be an even more complex concept describing behavior of a structured object consisting of two vehicles (the overtaking and the overtaken one) over a certain period of time, for example, the overtaking vehicle changes lanes from right to left, whereas the overtaken vehicle drives in the right lane. Let us notice that the behavior described by this concept is a crucial fragment of the overtaking maneuver and determining whether the observed group of two vehicles behaved exactly that way, requires observing a sequence of

496

J.G. Bazan

behaviors of vehicles taking part in this maneuver for a certain period of time. They may be: acceleration in the right lane, acceleration and changing lanes from right to left, maintaining a stable speed in the right lane. Analogously to the case of spatial and spatio-temporal concept approximation for unstructured objects, the method of concept approximation described in this subsection is an example of the general methodology of approximating concepts from ontology described previously. Hence, its speciﬁcity is also the domain knowledge usage expressed in the form of a concept ontology and rough set methods. However, in this case, ontologies may be extremely complex, containing concepts concerning unstructured objects, concepts concerning structured objects as well as concepts concerning relations between parts of structured objects. The starting point for the proposed method is the remark that spatio-temporal concept identiﬁcation concerning structured objects requires observing changes of these objects over a longer period of time (the so-called longer time windows) than in the case of complex objects which are parts of structured objects. Moreover, spatio-temporal concept identiﬁcation concerning structured objects requires not only an observation of changes of all constituent parts of a given structured object individually, but also an observation of relations between these constituent parts and changes concerning these relations. Therefore, in order to identify spatio-temporal concepts concerning structured objects in behavioral graphs, we may observe paths of their constituent objects corresponding to constituent part behaviors in a given period. Apart from that paths in behavioral graphs describing relation changes between parts of structured objects should be observed. The properties of these paths may be deﬁned using functions which we call temporal patterns for temporal paths (see Section 6.17). These patterns, being in fact formulas from a certain language, characterize spatio-temporal properties of the examined structured object in terms of its parts and constraints between these parts. On a slightly higher abstraction level, to describe behaviors of structured objects, the so-called temporal concepts for structured objects (see Section 6.20) are used, which are deﬁned by an expert in a natural language and formulated usually with the help of questions about the current status of structured objects, e.g., Does one of the two observed vehicles approach the other driving behind it in the right lane?, Does one of the two observed vehicles change lanes from the right to the left one driving behind the second vehicle? The method of temporal concept approximation concerning structured objects, proposed here, is based on approximation of temporal concepts using temporal patterns for paths in behavioral graphs of parts of structured objects with the usage of temporal patterns for paths in behavioral graphs reﬂecting relation changes between the constituent parts. In order to do this a special decision table is constructed called a temporal concept table of structured objects (see Section 6.20). The rows of this table are obtained by arranging feature (attribute) value vectors of paths from behavioral graphs corresponding to parts of the structured objects observed in the data set (and, more precisely, value vectors of cluster features of such paths) and value vectors of path features from the behavioral graph

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

497

reﬂecting relation changes between parts of the structured object (and, more precisely, value vectors of cluster features of such paths). From the mathematical point of view such an arrangement is a Cartesian product of linked feature vectors. However, in terms of domain knowledge not all links belonging to such a Cartesian product are possible and making sense (see [78, 84, 186, 187]). According to the general methodology presented above, to eliminate such arrangements of feature vectors that are unreal or do not make sense, we deﬁne the so-called constraints which are formulas obtained on the basis of values occurring in the vectors arranged. The constraints determine which vectors may be arranged in order to obtain an example of a concept from a higher level and which may not. Additionally, we assume that to each feature vector arrangement, acceptable by constraints, the expert adds the decision value informing about the fact whether a given arrangement belongs to the approximated concept from the higher level. Methods of Behavioral Pattern Identiﬁcation. Similarly to the case of spatio-temporal concepts for unstructured complex objects, the spatio-temporal concepts deﬁned for structured objects may also be treated as nodes of a certain directed graph which is called a behavioral graph for a structured object (see Section 6.22). These graphs may be used to represent and identify the so-called behavioral patterns which are complex concepts concerning dynamic properties of complex structured objects expressed in a natural language depending on time and space. Examples of behavioral patterns may be: overtaking on the road, driving in a traﬃc jam, behavior of a patient connected with a high life threat. These types of concepts are much more diﬃcult to approximate even than many temporal concepts. In the paper, a new method of behavioral pattern identiﬁcation is presented which is based on interpreting the behavioral graph of a structured object as a complex classiﬁer enabling identiﬁcation of a behavioral pattern described by this graph. This is possible based on the observation of the structured object behavior for a longer time and checking whether the behavior matches the chosen behavioral graph path. If this is so, then it is determined if the behavior matches the behavioral pattern represented by this graph, which enables a detection of speciﬁc behaviors of structured objects (see Section 6.23). The eﬀective application of the above behavioral pattern identiﬁcation method encounters, however, two problems in practice. The ﬁrst of them concerns extracting relevant context for the parts of structured objects (see the fourth problem from Section 1.2). To solve this problem a sweeping method, enabling a rapid structured object extraction, is proposed in this paper. This method works on the basis of simple heuristics called sweeping algorithms around complex objects which are constructed with the use of a domain knowledge supported by data sets (see Section 6.13). The second problem appearing with behavioral pattern identiﬁcation is the problem of fast elimination of such objects that certainly do not match a given behavioral pattern (see the ﬁfth problem from Section 1.2). As one of the methods of solving this problem, we proposed the so-called method of fast

498

J.G. Bazan

elimination of speciﬁc behavioral patterns in relation to the analyzed structured objects. This method is based on the so-called rules of fast elimination of behavioral patterns which are determined from the data and on the basis of a domain knowledge (see Section 6.24). It leads to a great acceleration of behavioral pattern identiﬁcation because such structured objects, whose behavior certainly does not match a given behavioral pattern, may be very quickly eliminated. For these objects it is not necessary to apply the method based on behavioral graphs which greatly accelerates the global perception. In order to test the quality and eﬀectiveness of classiﬁer construction methods based on behavioral patterns, there have been performed experiments on data generated from the road simulator and medical data connected to detection of higher-death risk in infants suﬀering from the respiratory failure (see Section 6.25 and Section 6.26). The experiments showed that the algorithmic methods presented in this paper provide very good results in detecting behavioral patterns and may be useful with complex dynamical systems monitoring. Methods of Automated Planning. Automated planning methods for unstructured complex objects were also worked out. These methods work on the basis of data sets and a domain knowledge represented by a concept ontology. A crucial novelty in the method proposed here, in comparison with the already existing ones, is the fact that performing actions according to plan depends on satisfying complex vague spatio-temporal conditions expressed in a natural language, which leads to the necessity of approximation of these conditions as complex concepts. Moreover, these conditions describe complex concept changes which should be reﬂected in the concept ontology. Behavior of unstructured complex objects is modeled using the so-called planning rules being formulas of the type: the state before performing an action → action → state 1 after performing an action | ... | state k after performing an action, which are deﬁned on the basis of data sets and a domain knowledge (see Section 7.4). Each rule includes the description of the complex object state before applying the rule (that is, before performing an action), expressed in a language of features proposed by an expert, the name of the action (one of the actions speciﬁed by the expert which may be performed at a particular state), and the description of sequences of states which a complex object may turn into after applying the action mentioned above. It means that the application of such a rule gives indeterministic eﬀects, i.e., after performing the same action the system may turn into diﬀerent states. All planning rules may be represented in a form of the so-called planning graphs whose nodes are state descriptions (occurring in predecessors and successors of planning rules) and action names occurring in planning rules (see Section 7.4). In the graphical interpretation, solving the problem of automated planning is based on ﬁnding a path in the planning graph from the initial state to an expected ﬁnal state. It is worth noticing that the conditions for performing an action (object states) are described by vague spatio-temporal complex concepts which are expressed in the natural language and require an approximation. For speciﬁc applications connected with the situation when it is expected that the proposed plan of a complex object behavior is to be strictly compatible with

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

499

the determined experts’ instructions (e.g., the way of treatment in a specialist clinic is to be compatible with the treatment schemes used there), there has also been proposed an additional mechanism enabling to resolve the nondeterminism occurring in the application of planning rules. This mechanism is an additional classiﬁer based on data sets and domain knowledge. Such classiﬁers suggest the action to be performed in a given state and show the state which is the result of the indicated action (see Section 7.7). The automated planning method for unstructured objects has been generalized in the paper also in the case of planning of the behavior of structured objects (consisting of parts connected with one another by dependencies). The generalization is based on the fact that on the level of a structured object there is an additional planning graph deﬁned where there are double-type nodes and directed edges between the nodes (see Section 7.11). The nodes of the ﬁrst type describe vague features of states (meta-states) of the whole structured object, whereas the nodes of the second type concern complex actions (meta-actions) performed by the whole structured object (all its constituent parts) over a longer period of time (a time window). The edges between the nodes represent temporal dependencies between meta-states and meta-actions as well as meta-actions and meta-states. Similarly to the previous case of unstructured objects, planning of a structured object behavior is based on ﬁnding a path in a planning graph from the initial meta-state to the expected ﬁnal meta-state; and, at the same time, each meta-action occurring in such a path must be planned separately on the level of each constituent part of the structured object. In other words, it should be planned what actions each part of a structured object must perform in order for the whole structured object to be able to perform the meta-action which has been planned. During the planning of a meta-action a synchronization mechanism (determining compatibility) of plans proposed for the part of a structured object is used, which works on the basis of a family of classiﬁers determined on the basis of data sets with a great support of domain knowledge. Apart from that, an additional classiﬁer is applied (also based on a data set and the domain knowledge) which enables to determine whether the juxtaposition and execution of plans determined for the constituent parts, in fact, lead to the execution of the meta-action planned on the level of the whole structured object (see Section 7.13). During the attempt to execute the plan constructed there often appears a need to reconstruct the plan which means that during the plan execution there may appear such a state of a complex object that is not compatible with the state suggested by the plan. A total reconstruction of the plan (building the whole plan from the beginning) may computationally be too costly. Therefore, we propose another plan reconstruction method called a partial reconstruction. It is based on constructing a short so-called repair plan, which rapidly brings the complex object to the so-called return state which appears in the current plan. Next, on the basis of the repair plan, a current plan reconstruction is performed through replacing its fragment beginning with the current state and ending with the return plan with the repair plan (see Section 7.9 and Section 7.17).

500

J.G. Bazan

In construction and application of classiﬁers approximating complex spatiotemporal concepts, there may appear a need to construct, with a great support of the domain knowledge, a similarity relation of two elements of similar type, such as complex objects, complex object states, or plans generated for complex objects. Hence, in this paper we propose a new method of similarity relation approximation based on the use of data sets and a domain knowledge expressed mainly in the form of a concept ontology. We apply this method, among other things, to verify automated planning methods, that is, to compare the plan generated automatically with the plan suggested by experts from a given domain (see Section 7.18, Section 7.19 and Section 7.20). In order to check the eﬀectiveness of the automated planning methods proposed here, there were performed experiments concerning planning of treatment of infants suﬀering from the respiratory failure (see Section 7.21). Experimental results showed that the proposed method gives good results, also in the opinion of medical experts (compatible enough with the plans suggested by the experts), and may be applied in medical practice as a supporting tool for planning of the treatment of infants suﬀering from the respiratory failure. Implementation and Data Sets. The result of the works conducted is also a programming system supporting the approximation of spatio-temporal complex concepts in the given concept ontology in the dialogue with the user. The system also includes an implementation of the algorithmic methods presented in this paper and is available on the web side of RSES system (see [15]). Sections 5, 6 and 7, apart from the method description, contain the results of computing experiments conducted on real-life data sets, supported by domain knowledge. It is worth mentioning that the requirements regarding data sets which can be used for computing experiments with modeling spatio-temporal phenomena are much greater than the requirements of the data which are used for testing process of classical classiﬁers. Not only have the data to be representative of the decision making problem under consideration but also they have to be related to the domain knowledge available (usually cooperation with experts in a particular domain is essential). It is important that such data should fully and appropriately reﬂect complex spatio-temporal phenomena connected to the environment of the data collected. The author of the paper acquired such data sets from two sources. The ﬁrst source of data is the traﬃc simulator made by the author (see Appendix A). The simulator is a computing tool for generating data sets connected to the traﬃc on the street and at crossroads. During simulation each vehicle appearing on the simulation board behaves as an independently acting agent. On the basis of observation of the surroundings (other vehicles, its own location, weather conditions, etc.) this agent makes an independent decision what maneuvers it should make to achieve its aim which is to go safely across the simulation board and to leave the board using the outbound way given in advance. At any given moment of the simulation, all crucial vehicle parameters may be recorded, and thanks to this data sets for experiments can be obtained.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

501

The second collection of data sets used in computer experiments was provided by Neonatal Intensive Care Unit, First Department of Pediatrics, PolishAmerican Institute of Pediatrics, Collegium Medicum, Jagiellonian University, Krakow, Poland. This data constitutes a detailed description of treatment of 300 infants, i.e., treatment results, diagnosis, operations, medication (see Section 6.26 and Appendix B). 1.4

Organization of the Paper

This paper is organized as follows. In Section 2 we brieﬂy describe selected classical methods of classiﬁer construction and concept approximation which are used in next subsections of the paper. These methods are based on rough set theory achievements and were described in the author’s previous papers (see, e.g., [14, 195, 196, 197, 198, 199, 200, 201, 202, 203]). In Section 3 we describe methods of construction of a concept stratifying classiﬁer. The general methodology of approximating concepts with the use of data sets and a domain knowledge represented mainly in the form of a concept ontology is described in Section 4. Methods of approximating spatial concepts from ontology are described in Section 5, whereas methods of approximating spatio-temporal concepts from ontology and methods of behavioral patterns identiﬁcation are described in Section 6. Methods of automated planning of complex object behavior when object states are represented with the help of complex objects requiring an approximation with the use of data sets and a domain knowledge are presented in Section 7. Finally, in Section 8 we summarize the results and give directions for the future research. The paper also contains two appendixes. The ﬁrst appendix contains the description of the traﬃc simulator used to generate experimental data (see Appendix A). The second one describes medical issues connected with the infant respiratory failure (see Appendix B) concerning one of the data sets used for experiments.

2

Classical Classifiers

In general, the term classify means arrange objects in a group or class based on shared characteristics (see [1]). In this work, the term classiﬁcation has a special meaning, i.e., classiﬁcation connotes any context in which some decision or forecast about object grouping is made on the basis of currently available knowledge or information (see, e.g., [11, 204]). A classiﬁcation algorithm (classiﬁer) is an algorithm which enables us to make a forecast repeatedly on the basis of accumulated knowledge in new situations (see, e.g., [11]). Here we consider the classiﬁcation provided by a classifying

502

J.G. Bazan

algorithm which is applied to a number of cases to classify objects unseen previously. Each new object is assigned to a class belonging to a predeﬁned set of classes on the basis of observed values of suitably chosen attributes (features). Many approaches have been proposed to construct classiﬁcation algorithms. Among them we would like to mention classical and modern statistical techniques (see, e.g., [11, 13]), neural networks (see, e.g., [11, 13, 205]), decision trees (see, e.g., [11, 206, 207, 208, 209, 210, 211, 212]), decision rules (see, e.g., [10, 11, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223]) and inductive logic programming (see, e.g., [11, 224]). In this section, we consider methods implemented in our system RSES (Rough Set Exploration System) (see [14, 225, 226, 227, 228, 229, 230, 231]). RSES is a computer software system developed for the purpose of data analysis (the data is assumed to be in the form of an information system or a decision table, see Section 2.1). In construction of classiﬁers, which is the main step in the process od data analysis with RSES, elements of rough set theory are used. In this paper, we call these algorithms the standard RSES methods of classiﬁer construction. The majority of the standard RSES methods of classiﬁer construction have been applied in more advanced methods of classiﬁer construction, which will be presented in Sections 3, 5, 6, and 7. Therefore, in this section we only give a brief overview of that methods of classiﬁer construction. These methods are based on rough set theory (see [16, 17, 232]). In the Section 2.1 we start with introduction of basic rough set terminology and notation, necessary for the rest of this paper (see Section 2.1). The analysis of data in the RSES system proceeds according to the scheme presented in Fig. 2. First, the data for analysis has to be loaded/imported into the system. Next, in order to have a better chance for constructing (learning) a proper classiﬁer, it is frequently advisable to transform the initial data set. Such transformation, usually referred to as preprocessing, may consist of several steps. RSES supports preprocessing methods which make it possible to manage

Knowledge reduction

Load/Import data table

Data preprocessing

Classifier construction

Classifier evaluation

Classification of new cases

Fig. 2. The RSES data analysis process

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

503

missing parts in data, discretize numeric attributes, and create new attributes (see [14] and Section 2.2 for more details). When the data is preprocessed, we can be interested in learning about its internal structure. By using classical rough set concepts such as reducts (see Section 2.1), dynamic reducts (see [14, 195, 196, 198, 201, 202, 203]), and positive region (see Section 2.1) one can discover dependencies that occur in our data set. Knowledge of reducts can lead to reduction of data by removing some of the redundant attributes. Next, the classiﬁer construction may be started. In the RSES system, these classiﬁers may be constructed using various methods (see [14] and sections 2.3, 2.4, 2.5, 2.6, 2.7 for more details). A classiﬁer is constructed on the basis of a training set consisting of labeled examples (objects with decisions). Such a classiﬁer may further be used for evaluation on a test set or applied to new, unseen and unlabeled cases in order to determine the value of decision (classiﬁcation) for them (see Section 2.9). If the quality of the constructed classiﬁer is insuﬃcient, one may return to data preprocessing and/or knowledge reduction; another method of classiﬁer construction may be applied as well. 2.1

Rough Set Basic Notions

In order to provide a clear description further in the paper and to avoid any misunderstandings, we bring here some essential deﬁnitions from rough set theory. We will frequently refer to the notions introduced in this section. Quite a comprehensive description of notions and concepts related to the classical rough set theory may be found in [189]. An information system (see [16, 17]) is a pair A = (U, A) where U is a nonempty, ﬁnite set called the universe of A and A is a non-empty, ﬁnite set of attributes, i.e., mappings a : U → Va , where Va is called the value set of a ∈ A. Elements of U are called objects and interpreted as, e.g., cases, states, processes, patients, observations. Attributes are interpreted as features, variables, characteristic conditions. We also consider a special case of information systems called decision tables. A decision table is an information system of the form A = (U, A, d) where d ∈ A is a distinguished attribute called the decision. The elements of A are called condition attributes or conditions. One can interpret the decision attribute as a kind of partition of the universe of objects given by an expert, a decision-maker, an operator, a physician, etc. In machine learning decision tables are called training sets of examples (see [10]). The cardinality of the image d(U ) = {k : d(s) = k for some s ∈ U } is called the rank of d and is denoted by r(d). r(d) We assume that the set Vd of values of the decision d is equal to {vd1 , ..., vd }. Let us observe that the decision d determines a partition CLASSA (d) = r(d) 1 k {XA , . . . , XA } of the universe U where XA = {x ∈ U : d(x) = vdk } for 1 ≤ k ≤ r(d). CLASSA (d) is called the classiﬁcation of objects of A determined

504

J.G. Bazan

i by the decision d. The set XA is called the i-th decision class of A. By XA (u) we denote the decision class {x ∈ U : d(x) = d(u)}, for any u ∈ U . Let A = (U, A) be an information system. For every set of attributes B ⊆ A, an equivalence relation, denoted by IN DA (B) and called the B-indiscernibility relation, is deﬁned by

IN DA (B) = {(u, u ) ∈ U × U : ∀a∈B a(u) = a(u )}.

(1)

Objects u, u being in the relation IN DA (B) are indiscernible by attributes from B. By [u]IN DA (B) we denote the equivalence class of the relation IN DA (B), such that u belongs to this class. An attribute a ∈ B ⊆ A is dispensable in B if IN DA (B) = IN DA (B \ {a}), otherwise a is indispensable in B. A set B ⊆ A is independent in A if every attribute from B is indispensable in B, otherwise the set B is dependent in A. A set B ⊆ A is called a reduct in A if B is independent in A and IN DA (B) = IN DA (A). The set of all reducts in A is denoted by REDA (A). This is the classical notion of a reduct and it is sometimes referred to as global reduct. Let A = (U, A) be an information system with n objects. By M (A) (see [21]) we denote an n × n matrix (cij ), called the discernibility matrix of A, such that cij = {a ∈ A : a(xi )=a(xj )} for i, j = 1, . . . , n .

(2)

A discernibility function f A for an information system A is a Boolean function of m Boolean variables a1 , . . . , am corresponding to the attributes a1 , . . . , am , respectively, and deﬁned by fA (a1 , . . . , am ) = { cij : 1 ≤ j < i ≤ n ∧ cij =∅}, (3) where cij = {a : a ∈ cij }. It can be shown (see [21]) that the set of all prime implicants of fA determines the set of all reducts of A. We present an exemplary deterministic algorithms for computation of the whole reduct set REDA (A) (see, e.g., [199]). This algorithm computes the discernibility matrix of A (see Algorithm 2.1). The time cost of the reduct set computation using the algorithm presented above can be too high in the case the decision table consists of too many objects, attributes, or diﬀerent values of attributes. The reason is that, in general, the size of the reduct set can be exponential with respect to the size of the decision table and the problem of the minimal reduct computation is NP-hard (see [21]). Therefore, we are often forced to apply approximation algorithms to obtain some knowledge about the reduct set. One way is to use approximation algorithms that need not give optimal solutions but require a short computing time. Among these algorithms are the following ones: Johnson’s algorithm, covering algorithms, algorithms based on simulated annealing and Boltzmann machines, algorithms using neural networks and algorithms based on genetic algorithms (see, e.g., [196, 198, 199] for more details).

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

505

Algorithm 2.1. Reduct set computation Input: Information system A = (U, A) Output: Set REDA (A) of all reducts of A 1 begin 2 Compute indiscernibility matrix M (A) 3 Reduce M (A) using absorbtion laws // Let C1 , ..., Cd are non-empty fields of reduced M (A) Build a familie of sets R0 , R1 , ..., Rd in the following way: 4 5 begin 6 R0 = ∅ 7 for i = 1 to d do 8 Ri = Si ∪ Ti where Si = {R ∈ Ri−1 : R ∩ Ci = ∅} and Ti = (R ∪ {a})a∈Ci ,R∈Ri−1 :R∩Ci =∅ 9 10 end 11 end 12 Remove dispensable attributes from each element of family Rd 13 Remove redundant elements from Rd 14 REDA (A) = Rd 15 end If A = (U, A) is an information system, B ⊆ A is a set of attributes and X ⊆ U is a set of objects (usually called a concept), then the sets {u ∈ U : [u]IN DA (B) ⊆ X} and {u ∈ U : [u]IN DA (B) ∩ X =∅} are called the B-lower and the B-upper approximations of X in A, and they are denoted by BX and BX, respectively. The set BNB (X) = BX − BX is called the B-boundary of X (boundary region, for short). When B = A, we also write BNA (X) instead of BNA (X). Sets which are unions of some classes of the indiscernibility relation IN DA (B) are called deﬁnable by B (or B-deﬁnable in short). A set X is, thus, B-deﬁnable iﬀ BX = BX. Some subsets (categories) of objects in an information system cannot be exactly expressed in terms of the available attributes but they can be deﬁned roughly. The set BX is the set of all elements of U which can be classiﬁed with certainty as elements of X, given a knowledge about these elements in the form of values of attributes from B; the set BNB (X) is the set of elements of U which one can classify neither to X nor to −X having a knowledge about objects represented by B. If the boundary region of X ⊆ U is the empty set, i.e., BNB (X) = ∅, then the set X is called crisp (exact) with respect to B; in the opposite case, i.e., if BNB (X) = ∅, the set X is referred to as rough (inexact) with respect to B (see, e.g., [17]). If X1 , . . . , Xr(d) are decision classes of A, then the set BX1 ∪ · · · ∪ BXr(d) is called the B-positive region of A and denoted by P OSB (d).

506

J.G. Bazan

If A = (U, A, d) is a decision table and B ⊆ A, then we deﬁne a function ∂B : U → P(Vd ), called the B-generalized decision of A, by ∂B (x) = {v ∈ Vd : ∃x ∈ U (x IN DA (B)x and d(x) = v)} .

(4)

The A-generalized decision ∂A of A is called the generalized decision of A. A decision table A is called consistent (deterministic) if card(∂A (x)) = 1 for any x ∈ U , otherwise A is inconsistent (non-deterministic). Non-deterministic information systems were introduced by Witold Lipski (see [233]), while deterministic information systems independently by Zdzislaw Pawlak [234] (see, also, [235, 236]). It is easy to see that a decision table A is consistent iﬀ P OSA (d) = U . Moreover, if ∂B = ∂B , then P OSB (d) = P OSB (d) for any pair of nonempty sets B, B ⊆ A. A subset B of the set A of attributes of a decision table A = (U, A, d) is a relative reduct of A iﬀ B is a minimal set with respect to the following property: ∂B = ∂A . The set of all relative reducts of A is denoted by RED(A, d). Let A = (U, A, d) be a consistent decision table and let M (A) = (cij ) be its discernibility matrix. We construct a new matrix M (A) = (cij ) assuming cij = ∅ if d(xi ) = d(xj ), and cij = cij − {d} otherwise. The matrix M (A) is called the relative discernibility matrix of A. Now, one can construct the relative discernibility function fM (A) of M (A) in the same way as the discernibility function. It can be shown (see [21]) that the set of all prime implicants of fM (A) determines the set of all relative reducts of A. Another important type of reducts are local reducts. A local reduct r(xi ) ⊆ A (or a reduct relative to decision and object xi ∈ U where xi is called a base object ) is a subset of A such that: 1. ∀xj ∈U d(xi ) = d(xj ) =⇒ ∃ak ∈r(xi ) ak (xi ) = ak (xj ), 2. r(xi ) is minimal with respect to inclusion. If A = (U, A, d) is a decision table, then any system B = (U , A, d) such that U ⊆ U is called a subtable of A. ai ∈ A and vi ∈ Vai . A A template of A is a formula (ai = vi ) where generalized template is a formula of the form (ai ∈ Ti ) where Ti ⊂ Vai . An object satisﬁes (matches) a template if for every attribute ai occurring in the template, the value of this attribute at a considered object is equal to vi (belongs to Ti in the case of the generalized template). The template splits the original information system in the two distinct subtables containing objects that satisfy and do not satisfy the template, respectively. It is worth mentioning that the notion of a template can be treated as a particular case of a more general notion, viz., that of a pattern (see Section 4.9).

2.2

Discretization

Suppose we have a decision table A = (U, A, d) where card(Va ) is high for some a ∈ A. Then, there is a very low chance that a new object is recognized by rules

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

507

generated directly from this table because the attribute value vector of a new object will not match any of these rules. Therefore, for decision tables with real (numerical) value attributes, some discretization strategies are built in order to obtain a higher quality of classiﬁcation. This problem was intensively studied (see, e.g., [199, 237, 238] for more details). The process of discretization is usually realized in two following steps (see, e.g., [14, 199, 237, 238]). First, the algorithm generates a set of cuts. By a cut for an attribute ai ∈ A such that Vai is an ordered set we denote a value c ∈ Vai . The cuts can be then used to transform the decision table. As a result we obtain a decision table with the same set of attributes but the attributes have diﬀerent values. Instead of a(x) = v for an attribute a ∈ A and an object x ∈ U , we rather get a(x) ∈ [c1 , c2 ] where c1 and c2 are cuts generated for attribute a by a discretization algorithm. The cuts are generated in a way that the resulting intervals contain possibly most uniform sets of objects w.r.t decision. The discretization method available in RSES has two versions (see, e.g., [14, 199, 238]) that are usually called global and local. Both methods belong to a bottom-up approaches which add cuts for a given attribute one-by-one in subsequent iterations of algorithm. The diﬀerence between these two methods lies in the way in which the candidate for a new cut is evaluated. In the global method, we evaluate all objects in the data table at every step. In the local method, we only consider a part of objects that are related to the candidate cut, i.e., which have the value of the attribute considered currently in the same range as the cut candidate. Naturally, the second (local) method is faster as less objects have to be examined at every step. In general, the local method produces more cuts. The local method is also capable of dealing with nominal (symbolic) attributes. Grouping (quantization) of a nominal attribute domain with use of the local method always results in two subsets of attribute values (see, e.g., [14, 199, 238] for more details). 2.3

Decision Rules

Let A = (U, A, d) be a decision table and let V = {Va : a ∈ A} ∪ Vd . Atomic formulas over B ⊆ A ∪ {d} and V are expressions of the form a = v, called descriptors over B and V , where a ∈ B and v ∈ Va . The set F(B, V ) of formulas over B and V is the least set containing all atomic formulas over B, V and closed with respect to the classical propositional connectives ∨ (disjunction), ∧ (conjunction), and ¬ (negation). Let ϕ ∈ F(B , V ). Then, by |ϕ|A we denote the meaning of ϕ in the decision table A, i.e., the set of all objects of U with the property ϕ, deﬁned inductively by 1. 2. 3. 4.

if ϕ is of the form a = v, then |ϕ|A = {x ∈ U : a(x) = v}, |ϕ ∧ ϕ |A = |ϕ|A ∩ |ϕ |A , |ϕ ∨ ϕ |A = |ϕ|A ∪ |ϕ |A , |¬ϕ|A = U − |ϕ}A .

508

J.G. Bazan

The set F(A, V ) is called the set of conditional formulas of A and is denoted by C(A, V ). Any formula of the form (a1 = v1 ) ∧ ... ∧ (al = vl ) where vi ∈ Vai (for i = 1, ..., l) and P = {a1 , ..., al } ⊆ A is called a P-basic formula of A. If ϕ is a P-basic formula of A and Q ⊆ P , then by ϕ/Q we mean the Q-basic formula obtained from the formula ϕ by removing from ϕ all its elementary subformulas (a = va ) such that a ∈ P \ Q. A decision rule for A is any expression of the form ϕ ⇒ d = v where ϕ ∈ C(A, V ), v ∈ V d , and |ϕ|A = ∅. Formulas ϕ and d = v are referred to as the predecessor (premise of the rule) and the successor of the decision rule ϕ ⇒ d = v respectively. If r is a decision rule in A, then by P red(r) we denote the predecessor of r and by Succ(r) we denote the successor of r . An object u ∈ U is matched by a decision rule ϕ ⇒ d = vdk (where 1 ≤ k ≤ r(d)) iﬀ u ∈ |ϕ|A . If u is matched by ϕ ⇒ d = vdk , then we say that the rule is classifying u to the decision class Xk . The number of objects matched by a decision rule ϕ ⇒ d = v, denoted by M atchA (ϕ ⇒ d = v), is equal to card(|ϕ|A ). The number SuppA (ϕ ⇒ d = v) = card(|ϕ|A ∩ |d = v|A ) is called the number of objects supporting the decision rule ϕ ⇒ d = v. A decision rule ϕ ⇒ d = v for A is true in A, symbolically ϕ ⇒A d = v, iﬀ |ϕ|A ⊆ |d = v|A . If the decision rule ϕ ⇒ d = v is true in A, we say that the decision rule is consistent in A, otherwise ϕ ⇒ d = v is inconsistent or approximate in A. SuppA (r) is called the If r is a decision rule in A, then the number μA (r) = Match A (r) coeﬃcient of consistency of the rule r. The coeﬃcient μA (r) may be understood as the degree of consistency of the decision rule r. It is easy to see that a decision rule r for A is consistent iﬀ μA (r) = 1. The coeﬃcient of consistency of r can be also treated as the degree of inclusion of |P red(r)|A in |Succ(r)|A (see, e.g., [239]). If ϕ ⇒ d = v is a decision rule for A and ϕ is P-basic formula of A (where P ⊆ A), then the decision rule ϕ ⇒ d = v is called a P-basic decision rule for A, or a basic decision rule in short. Let ϕ ⇒ d = v be a P-basic decision rule of A (where P ⊆ A) and let a ∈ P . We will say that the attribute a is dispensable in the rule ϕ ⇒ d = v iﬀ |ϕ ⇒ d = v|A = U implies |ϕ/(P \ {a}) ⇒ d = v|A = U , otherwise attribute a is indispensable in the rule ϕ ⇒ d = v. If all attributes a ∈ P are indispensable in the rule ϕ ⇒ d = v, then ϕ ⇒ d = v will be called independent in A. The subset of attributes R ⊆ P will be called a reduct of P-basic decision rule ϕ ⇒ d = v, if ϕ/R ⇒ d = v is independent in A and |ϕ ⇒ d = v|A = U implies |ϕ/R ⇒ d = v|A = U . If R is a reduct of the P-basic decision rule ϕ ⇒ d = v, then ϕ/R ⇒ d = v is said to be reduced. If R is a reduct of the A-basic decision rule ϕ ⇒ d = v, then ϕ/R ⇒ d = v is said to be an optimal basic decision rule of A (a basic decision rule with minimal number of descriptors). The set of all optimal basic decision rules of A is denoted by RU L(A).

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

2.4

509

Two Methods for Decision Rule Synthesis

Classiﬁers based on a set of decision rules are the most elaborated methods in RSES. Several methods for calculation of the decision rule sets are implemented. Also, various methods for transforming and utilizing rule sets are available. However, in our computer experiments we usually use two methods for decision rules synthesis. We would like to mention those methods here. The ﬁrst method returns all basic decision rules with minimal number of descriptors (see, e.g., [196, 198, 199, 240]). Therefore, this method is often called an exhaustive method. From the practical point of view, the method consists in applying an algorithm computing all reducts (see Algorithm 2.1) for each object individually, which results in obtaining decision rules with a minimal number of descriptors in relation to individual objects (see, e.g., [196, 198, 199]). The second method for basic decision rule synthesis, is the covering algorithm called LEM2 (see, e.g., [216, 222, 223]). In LEM2, a separate-and-conquer technique is paired with rough set notions such as upper and lower approximations. This method tends to produce less rules than algorithms based on the exhaustive local reduct calculation (as in the previous method) and seems to be faster. On the downside, the LEM2 method sometimes returns too few valuable and meaningful rules (see also Section 2.10). 2.5

Operations on Rule Sets

In general, the methods used by RSES to generate rules may produce quite a bunch of them. Naturally, some of the rules may be marginal, erroneous or redundant. In order to provide a better control over the rule-based classiﬁers some simple techniques for transforming rule sets should be used. The simplest way to alter a set of decision rules is by ﬁltering them. It is possible to eliminate from the rule set these rules that have insuﬃcient support on training sample, or those that point at a decision class other than the desired one. More advanced operations on rule sets are shortening and generalization. Rule shortening is a method that attempts to eliminate descriptors from the premise of the rule. The resulting rule is shorter, more general (applicable to more training objects) but it may lose some of its precision. The shortened rule may be less precise, i.e., it may give wrong answers (decisions) for some of the matching training objects. We present an exemplary method of approximate rules computation (see, e.g., [196, 198, 199]) that we use in our experiments. We begin with an algorithm for synthesis of optimal decision rules from a given decision table (see Section 2.4). Next, we compute approximate rules from the optimal decision rules already calculated. Our method is based on the notion of consistency of a decision rule (see Section 2.1). The original optimal rule is reduced to an approximate rule with the coeﬃcient of consistency exceeding a ﬁxed threshold. Let A = (U, A, d) be a decision table and r0 ∈ RU L(A). The approximate rule (based on rule r0 ) is computed using the Algorithm 2.2.

510

J.G. Bazan

Algorithm 2.2. Approximate rule synthesis (by descriptor dropping) Input: 1. decision table A = (U, A, d) 2. decision rule r0 ∈ RU L(A) 3. threshold of consistency μ0 (e.g., μ0 = 0.9) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Output: the approximate rule rapp (based on rule r0 ) begin Calculate the coeﬃcient of consistency μA (r0 ) if μA (r0 ) < μ0 then STOP // In this case no approximate rule end μmax = μA (r0 ) and rapp = r0 while μmax > μ0 do μmax = 0 for i = 1 to the number of descriptors from P red(rapp ) do r = rapp Remove i-th descriptor from P red(r) Calculate the coeﬃcient of consistency μA (r) and μ = μA (r) if μ > μmax then μmax = μ and imax = i end end if μmax > μ0 then Remove imax -th conditional descriptor from rapp end end return rapp end

It is easy to see that the time and space complexity of Algorithm 2.2 are of order O(l2 · m · n) and O(C), respectively (where l is the number of conditional descriptors in the original optimal decision rule r0 and C is a constant). The approximate rules, generated by the above method, can help to extract interesting laws from the decision table. By applying approximate rules instead of optimal rules one can slightly decrease the quality of classiﬁcation of objects from the training set but we expect, in return, to receive more general rules with a higher quality of classiﬁcation of new objects (see [196]). On the other hand, generalization of rules is a process which consists in replacement of the descriptors having a single attribute value in rule predecessors with more general descriptors. In the RSES system there is an algorithm available which instead of simple descriptors of type a(x) = v, where a ∈ A, v ∈ Va and x ∈ U tries to use the so-called generalized descriptors of the form a(x) ∈ V where V ⊂ Va (see, e.g., [14]). In addition, such a replacement is performed

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

511

only when the coeﬃcient of consistency of the new rule is not smaller than the established threshold. Let us notice that such an operation is crucial in terms of enlargement of the extension of decision rules for the generalized decision rules are able to classify a greater number of tested objects. It is worth mentioning that the application of the method of generalizing rules described above only makes sense for tables with attributes having a small number values. Such attributes are usually attributes with symbolic values. On the other hand a usage of this method for tables with numerical attributes requires a previous discretization of values of these attributes. 2.6

Negotiations Among Rules

Suppose we have a set of decision rules. When we attempt to classify an object from test sample with use of a rule set generated, it may happen that various rules suggest diﬀerent decision values. In such conﬂict situations, we need a strategy to resolve controversy and reach a ﬁnal result (decision). This problem was intensively studied (see, e.g., [198, 199]). In its current version, RSES provides a conﬂict resolution strategy based on voting among rules. In this method, each rule that matches the object under consideration casts a vote in favor of the decision value it points at. Votes are summed up and the decision is chosen that has got majority of votes. This simple method may be extended by assigning weights to rules. Each rule, then votes with its weight and the decision that has the highest total of weighted votes is the ﬁnal one. In RSES, this method is known as a standard voting and is based on a basic strength (weight) of decision rules (see Section 2.8). Of course, there are many other methods that can be used to resolve conﬂicts between decision rules (see, e.g., [196, 198, 199, 216, 217, 241]). 2.7

Decomposition Trees

In the case of the decision tables larger, the computation of decision rules for these tables can be extremely diﬃcult or even impossible. This problem arises from a relatively high computational complexity of rule computing algorithms. Unfortunately, it frequently concerns covering algorithms such as, e.g., LEM2 as well (see Section 2.4). One of the solutions to this problem is the so-called decomposition. Decomposition consists in partitioning the entrance data table into parts (subtables) in such a way as to be able to calculate decision rules for these parts using standard methods. Naturally, a method is also necessary which would aggregate the obtained rule sets in order to build a general classiﬁer. In this paper, we present a decomposition method based on a decomposition tree (see [165, 226, 242]) which may be constructed according to Algorithm 2.3. This algorithm creates the decomposition tree in steps where each step leads to construction of the next level of the tree. At a given step of the algorithm execution, a binary partition of the decision table takes place using the best template (see Section 2.1) found for the table being partitioned. In this way, with each tree node (leaf), there is connected a template partitioning the subtable in this node into objects matching and not matching the template. This

512

J.G. Bazan

Algorithm 2.3. Decomposition tree synthesis Input: decision table A = (U, A, d) Output: the decomposition tree for the decision table A 1 begin 2 Find the best template T in A (see Section 2.1) 3 Divide A in two subtables: A1 containing all objects satisfying T and A2 = A − A1 4 if obtained subtables are of acceptable size in the sense of rough set methods then 5 STOP // The decomposition is finished 6 end 7 repeat lines 2-7 for all “too large” subtables 8 end

template and its contradiction are transferred as templates describing subtables to the next step of decomposition. Decomposition ﬁnishes when the subtables obtained are so small that the decision rules can be calculated for them using standard methods. After determining the decomposition tree, decision rule sets are calculated for all the leaves of this tree and, more precisely, for the subtables occurring in single leaves. The tree and the rules calculated for training sample can be used in classiﬁcation of unseen cases. Suppose we have a binary decomposition tree. Let u be a new object, A(T) be a subtable containing all objects matching a template T, and A(¬T ) be a subtable containing all objects not matching a template T. We classify object u starting from the root of the tree using Algorithm 2.4. This algorithm works in such a way that such a leaf of a decomposition tree is sought ﬁrst that the tested object matches the template describing the objects of

Algorithm 2.4. Classiﬁcation by decomposition tree 1 2 3 4 5 6 7 8 9 10 11 12 13

begin if u matches template T found for A then go to subtree related to A(T ) else go to subtree related to A(¬T ) end if u is at the leaf of the tree then go to line 12 else repeat lines 2-11 substituting A(T ) (or A(¬T )) for A end Classify u using decision rules for subtable attached to the leaf end

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

513

that leaf. Next, the object is classiﬁed with the help of decision rules calculated for the leaf that was found. The type of the decomposition method depends on the method of determining the best template. For instance, if decomposition is needed only because it is impossible to compute rules for a given decision table, then the best template for this table is the template which divides a given table into two equal parts. If, however, we are concerned with the table partition that is most compatible with the partition introduced by decision classes, then the measure of the template quality may be, for example, the number of pairs of objects from diﬀerent decision classes, diﬀerentiated with the help of the partition introduced by a given template. Surely, the best template in this case is a template with the largest number of diﬀerentiated pairs. The patterns determined may have diﬀerent forms (see, e.g., [165] for more details). In the simplest case, for a symbolic attribute, the best template might be of the forms a(x) = v or a(x) = v where a ∈ A, v ∈ Va , and x ∈ U , whereas for a numerical attribute, the templates might be a(x) > v, a(x) < v, a(x) ≤ v, or a(x) ≥ v where a ∈ A, v ∈ Va , and x ∈ U . The classiﬁer presented in this section uses a binary decision tree, however, it should not be mistaken for C4.5 or ID3 (see, e.g., [210, 243]) because, as we said before, rough set methods have been used in leaves of the decomposition tree in construction of the classifying algorithm. 2.8

Concept Approximation and Classiﬁers

Deﬁnability of concepts is a term well-known in classical logic (see, e.g., [5, 244, 245]). In this classical approach a deﬁnable concept (set) is a relation on the domain of a given structure whose elements are precisely those elements satisfying some formula in the structure. Semantics of such formula enables to determine precisely for a given element (object) whether it belongs to the concept or not. However, the issue of deﬁnability of concepts is somewhat complicated by the pervasive presence of vagueness and ambiguity in natural language (see [126, 127, 244]). Therefore, in numerous applications, the concepts of interest may only be deﬁned approximately on the basis of available, incomplete, imprecise or noisy information about them, represented, e.g., by positive and negative examples (see [6, 7, 8, 9, 10, 11, 12, 13]). Such concepts are often called vague (imprecise) concepts. We say that a concept is vague when there may be cases (elements, objects) in which there is no clear fact of the matter whether the concept applies or not. Hence, the classical approach to concept deﬁnability known from classical logic cannot be applied for vague concepts. At the same time an approximation of a vague concept consists in construction of an algorithm (called a classiﬁer) for this concept, which may be treated as a constructive, approximate description of the concept. This description enables to classify testing objects, that is, to determine for a given object whether it belongs to the concept approximated or not to which degree. There is a long debate in philosophy on vague concepts (see, e.g., [126, 127, 128]) and recently computer scientists (see, e.g., [79, 82, 83, 246, 247, 248, 249]) as well

514

J.G. Bazan

as other researchers have become interested in vague concepts. Since the classical approach to concept deﬁnability known from classical logic cannot be applied for vague concepts new methods of deﬁnability have been proposed. Professor Lotﬁ Zadeh (see [250]) introduced a very successful approach to deﬁnability of vague concepts. In this approach, sets are deﬁned by partial membership in contrast to crisp membership used in the classical deﬁnition of a set. Rough set theory proposed a method of concept deﬁnability by employing the lower and upper approximation, and the boundary region of this concept (see Section 2.1). If the boundary region of a set is empty it means that a particular set is crisp, otherwise the set is rough (inexact). The non-empty boundary region of the set means that our knowledge about the set is not suﬃcient to deﬁne the set precisely. Using the lower and upper approximation, and the boundary region of a given concept a classiﬁer can be constructed. Assume there is given a decision table A = (U, A, d), whose binary decision attribute with values 1 and 0 partitions the set of objects in two disjoint ones: C and C . The set C contains objects with the decision attribute value equal to 1, and the set C contains objects with the decision attribute value equal to 0. The sets C and C may also be interpreted in such a way that the set C is a certain concept to be approximated and the set C is the complement of this concept (C = U \ C). If we deﬁne for concept C and its complement C , their A-lower approximations AC and AC , the Aupper approximation AC, and the A-boundary BNA (C) (BNA (C) = AC \ AC), we obtain a simple classiﬁer which operates in such a way that a given testing object u is classiﬁed to concept C if it belongs to the lower approximation AC. Otherwise, if object u belongs to the lower approximation AC , it is classiﬁed to the complement of concept C. However, if the object belongs neither to AC nor AC , but it belongs to BNA (C), then the classiﬁer cannot make an unambiguous decision about membership of the object, and it has to respond that the object under testing simultaneously belongs to the concept C and its complement C , which means it is a border object. In this case the membership degree of a tested object u ∈ U to concept C ⊆ U is expressed numerically with the help of a rough membership function (see, e.g., [16, 17]). The rough membership function μC quantiﬁes the degree of relative overlap between the concept C and the equivalence class to which u belongs. It is deﬁned as follows: μC (u) : U → [0, 1] and μC (u) =

card([u]IN DA (A) ∩ C) . card([u]IN DA (A) )

As we can see, in order to work the classiﬁer described above, it is necessary for the tested object to belong to one of the equivalence classes of relation IN DA (A). However, there is one more instance remaining when the tested object does not belong to any equivalence class of relation IN DA (A). In such case, the classiﬁer under consideration cannot make any decision about membership of the tested object and has to say: “I do not know”. Unfortunately, the case when the tested object does not belong to any equivalence class of relation IN DA (A) frequently occurs in practical applications. It is due to the fact that if the objects under testing do not belong to the decision

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

515

table that was known at the beginning, but to its extension, the chances are small that in a given decision table, there exists an object (called a training object) whose conditional attribute values are identical to those in the testing object. However, it follows from the deﬁnitions of the relation IN DA (A) that the testing object for which there is no training object cannot be classiﬁed by the classiﬁer described above. In such a case, one can say that the extension of this classiﬁer is very small. For the above reason, the classic approach to classifying objects in the rough set theory (described above) requires generalization. It is worth noticing that in machine learning and pattern recognition (see, e.g., [6, 8, 9, 10, 11, 12, 13]), this issue is known under the term learning concepts by examples (see, e.g., [10]). The main problem of learning concepts by examples is that the description of a concept under examination needs to be created on the basis of known examples of that concept. By creating a concept description we understand detecting such properties of exemplary objects belonging to this concept that enable further examination of examples in terms of their membership in the concept under examination. A natural way to solve the problem of learning concepts by examples is inductive reasoning (see, e.g., [251, 252]). In inductive reasoning we assume as true the sentence stating a general regularity, at the same time we do that on the basis of acknowledging sentences stating individual instances of this regularity (see, e.g., [251, 252]). This is the reasoning according to which decisions in the real world are often made relying on incomplete or even ﬂawed information. This takes place in the cases of answers to questions connected with forecasting, checking hypotheses or making decisions. In the case of the problem of learning concepts by examples, the usage of inductive reasoning means that while obtaining further examples of objects belonging to the concept (the so-called positive examples) and examples of objects not belonging to the concept (the so-called negative examples), an attempt is made to ﬁnd such description that correctly matches all or almost all examples of the concept under examination. From the theoretical point of view, in the rough set theory the classic approach to concept approximation was generalized by Professor Skowron and Professor Stepaniuk (see [253]). This approach is consistent with the philosophical view (see, e.g., [126, 127]) and the logical view (see, e.g., [128]). The main element of this generalization is an approximation space. The approximation space (see, e.g., [246, 253, 254, 255]) is a tuple AS = (U, I, ν), where – U is a non-empty set of objects, – I : U → P (U ) is an uncertainty function and P (U ) denotes the powerset of U , – ν : P (U ) × P (U ) → [0, 1] is a rough inclusion function. The uncertainty function I deﬁnes for every object u ∈ U a set of objects indistinguishable with u or similar to u. The set I(u) is called the neighborhood of u. If U is a set of objects of a certain decision table A = (U, A, d), then in the simplest case the set I(u) may be the equivalence class [u]IN DA (A) . However, in a general case the set I(u) is usually deﬁned with the help of a special language such as GDL or N L (see Section 4.7).

516

J.G. Bazan

The rough inclusion function ν deﬁnes the degree of inclusion of X in Y , where X, Y ⊆ U . In the simplest case, rough inclusion can be deﬁned by: card(X∩Y ) if X = ∅ card(X) ν(X, Y ) = 1 if X = ∅. This measure is widely used by the data mining and rough set communities (see, e.g., [16, 17, 246, 253]). However, rough inclusion can have a much more general form than inclusion of sets to a degree (see [192, 247, 249]). It is worth noticing that in literature (see, e.g., [247]) a parameterized approximation space is considered instead of the approximation space. Any parameterized approximation space consists of a family of approximation spaces creating the search space for data models. Any approximation space in this family is distinguished by some parameters. Searching strategies for optimal (sub-optimal) parameters are basic rough set tools in searching for data models and knowledge. There are two main types of parameters. The ﬁrst ones are used to deﬁne object sets (neighborhoods), the second are measuring the inclusion or closeness of neighborhoods. For an approximation space AS = (U, I, ν) and any subset X ⊆ U the lower and the upper approximations are deﬁned by: – LOW (AS, X) = {u ∈ U : ν (I (u) , X) = 1} , – U P P (AS, X) = {u ∈ U : ν (I (u) , X) > 0}, respectively. The lower approximation of a set X with respect to the approximation space AS is the set of all objects which can be classiﬁed with certainty as object of X with respect to AS. The upper approximation of a set X with respect to the approximation space AS is the set of all objects which can be possibly classiﬁed as objects of X with respect to AS. Several known approaches to concept approximations can be covered using the approximation spaces discussed here, e.g., the approach given in [16, 17], approximations based on the variable precision rough set model (see, e.g., [256]) or tolerance (similarity) rough set approximations (see, e.g., [253]). Similarly to the classic approach, the lower and upper approximation in the approximation space AS for a given concept C may be used to classify objects to this concept. In order to do this one may examine the membership of the tested objects to LOW (AS, C), LOW (AS, C ) and U P P (AS, C) \ LOW (AS, C). However, in machine learning and pattern recognition (see, e.g., [6, 8, 9, 10, 11, 12, 13]), we often search for approximation of a concept C ⊆ U ∗ in an approximation space AS ∗ = (U ∗ , I ∗ , ν ∗ ) having only a partial information about AS ∗ and C, i.e., information restricted to a sample U ⊆ U ∗ . Let us denote the restriction of AS ∗ to U by AS = (U, I, ν), i.e., I(x) = I ∗ (x) ∩ U , ν(X, Y ) = ν ∗ (X, Y ) for x ∈ U , and X, Y ⊆ U (see Fig. 3). To decide if a given object u ∈ U ∗ belongs to the lower approximation or to the upper approximation of C ⊆ U ∗ , it is necessary to know the value ν ∗ (I ∗ (u), C). However, in the case there is only partial information about the approximation space AS ∗ available, one must make an estimation of such a value ν ∗ (I ∗ (u), C)

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

517

*

U - the set of objects from the approximation space

AS* = (U*, I*,ν*)

Tested object u∈U

*

I(u)

*

I (u)

U - the set of objects from the approximation space

AS = (U, I, ν)

Fig. 3. An approximation space AS and its extension AS ∗

rather than its exact value. In machine learning, pattern recognition or data mining, diﬀerent heuristics are used for estimation of the values of ν ∗ . Using diﬀerent heuristic strategies, values of another function ν are computed and they are used for estimation of values of ν ∗ . Then, the function ν is used for deciding if objects belong to C or not. Hence, we deﬁne an approximation of C in the approximation space AS = (U ∗ , I ∗ , ν ) rather than in AS ∗ = (U ∗ , I ∗ , ν ∗ ). Usually, it is required that the approximations of C ∩ U in AS and AS are close (or the same). The approach presented above (see, e.g., [83, 246, 248, 249]) became an inspiration for ﬁnding out of a number of methods which would enable to enlarge the extension of constructed classiﬁers, that is, to make the classiﬁers under construction to be able to classify any objects, and not only those belonging to a given decision table. Some other issues concerning the rough set approach to vague concept approximation are discussed, e.g., in [83, 128, 248, 249]. Among these issues are the higher order vagueness (i.e., nondeﬁnability of boundary regions), adaptive learning of concept approximation, concept drift, and sorites paradoxes. One of the basic ways of increasing the extension of classiﬁers is to approximate the concepts not with the help of the equivalence class of relation IN D (see above) but with the help of the patterns of the established language which diﬀerent objects may match, both from the training table and its extension. A given object matches the pattern if it is compatible with the description of this pattern. Usually, the pattern is constructed in such a way that all or almost

518

J.G. Bazan

all its matching objects belong to the concept under study (the decision class). Moreover, it is required that the objects from many equivalence classes of relation IN D could match the patterns. Thus, the extension of classiﬁers based on patterns is dramatically greater than the extension of classiﬁers working on the basis of equivalence classes of relation IN D. These types of patterns are often called decision rules (see Section 2.3). In literature one may encounter many methods of computing decision rules from data and methods enabling preprocessing the data in order to construct eﬀective classiﬁers. Into this type of methods one may include, for example, discretization of attribute values (see Section 2.2), methods computing decision rules (see Section 2.3), shortening and generalization of decision rules (see Section 2.5). The determined decision rules may be applied to classiﬁers construction. For instance, let us examine the situation, when a classiﬁer is created on the basis of decision rules from the set RU L(A) computed for a given decision table A = (U, A, d), and at the same time decision attribute d describes the membership to a certain concept C and its complement C . 1 The set of rules RU L(A) is the sum of two subsets RU L(A, C) and RU L(A, C ), where RU L(A, C) is the set of rules classifying objects to C and RU L(A, C ) is a set of rules classifying objects to C . For any tested object u, by M Rul(A, C, u) ⊆ RU L(A, C) and M Rul(A, C , u) ⊆ RU L(A, C ) we denote sets of such rules whose predecessors match object u and classify objects to C and C , respectively. Let AS = (U, I, ν) be an approximation space, where: SuppA (r) ∪ SuppA (r) 1. ∀u ∈ U : I(u) = r∈MRul(A,C,u) r∈MRul(A,C ,u) card(X∩Y ) if X = ∅ card(X) 2. ∀X, Y ⊆ U : ν(X, Y ) = 1 if X = ∅. The above approximation space AS may be extended in a natural way to approximation space AS = (U ∗ , I ∗ , ν ), where: ∗ 1. I ∗ : U ∗ −→ P (U ∗ ) such that ∀u ∈ U : I (u) = I(u), card(X∩Y ) if X = ∅ card(X) 2. ∀X, Y ⊆ U ∗ : ν (X, Y ) = 1 if X = ∅.

Let us notice that such a simple generalization of functions I to I ∗ and ν to ν is possible because function I may determine the neighborhood for a given object belonging to U ∗ . It results from the fact that decision rules from set RU L(A) may recognize objects not only from set U but also from set U ∗ \ U . Approximation space AS may now also be used to construct a classiﬁer which classiﬁes objects from set U ∗ to concept C or its complement C . In creating such a classiﬁer the key problem is to resolve the conﬂict between the rules

1

For simplicity of reasoning we consider only binary classiﬁers, i.e. classiﬁers with two decision classes. One can easily extend the approach to the case of classiﬁers with more decision classes.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

519

classifying the tested object to the concept or to its complement. Let us notice that this conﬂict occurs because in practice we do not know function ν ∗ but only its approximation ν . That is why, there may exist such a tested object ut that the values ν ({ut }, C) and ν ({ut }, C ) are high (that is close to 1), while values ν ∗ ({ut }, C) and ν ∗ ({ut }, C ) are very diﬀerent (e.g., ν ∗ ({ut }, C) is close to 1 and ν ∗ ({ut }, C ) is close to 0). Below, we present the deﬁnition of such a classiﬁer in the form of a function that returns the value Y ES when the tested object belongs to C or the value N O when the tested object belongs to C : Y ES if ν ({u}, C) > 0.5 ∀u ∈ U : Classif ier(u) = (5) N O otherwise. Obviously, other rough inclusion functions may be deﬁned (see, e.g., [192, 247, 249]). Thus, we obtain diﬀerent classiﬁers. Unfortunately, a classiﬁer deﬁned with the help of Equation (5) is impractical because the function ν used in it does not introduce additional parameters which enable to recognize of objects to the concept and its complement whereas in practical applications in constructing classiﬁers based on decision rules, functions are applied which give the strength (weight) of the classiﬁcation of a given tested object to concept C or its complement C (see, e.g., [196, 199, 216, 217, 241]). Below, we present a few instances of such weights (see [199]). 1. A simple strength of decision rule set is deﬁned by SimpleStrength(C, ut) =

card(M Rul(A, C, ut )) . card(RU L(A, C))

2. A maximal strength of decision rule set is deﬁned by M aximalStrength(C, ut ) = maxr∈MRul(A,C,ut )

SuppA (r) card(C)

.

3. A basic strength or a standard strength of decision rule set is deﬁned by SuppA (r) BasicStrength(C, ut ) =

r∈MRul(A,C,ut )

r∈RUL(A,C)

SuppA (r)

4. A global strength of decision rule set is deﬁned by

card GlobalStrength(C, ut ) =

r∈MRul(A,C,ut )

card(C)

.

SuppA (r) .

Using each of the above rules weight, a rough inclusion function corresponding to it may be deﬁned. Let us mark any established weight of rule sets as S. For weight S we deﬁne an exemplary rough inclusion function νS in the following way:

520

J.G. Bazan

∀X, Y ⊆ U : νS (X, Y ) =

⎧ 0 if Y = ∅ ∧ X = ∅ ⎪ ⎪ ⎪ ⎪ 1 if X=∅ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ S(Y,u) ⎪ ⎪ ⎪ S(Y,u)+S(U\Y,u) if X = {u} and ⎪ ⎪ ⎨ S(Y, u) + S(U \ Y, u) = 0 ⎪ ⎪ 1 ⎪ ⎪ 2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ P ⎪ νS ({u},Y ) ⎪ ⎪ ⎩ u∈X card(X)

,

if X = {u} and S(Y, u) + S(U \ Y, u) = 0 if card(X) > 1

where for an established set Y and object u the weights S(Y, u) and S(U \ Y, u) are computed using the decision rule set generated for table A = (U, A, dY ), where attribute dY describes the membership of objects from U to the set Y . The rough inclusion function deﬁned above may be used to construct the classiﬁer as it is done in Equation (5). Such a classiﬁer executes a simple negotiation method between the rules classifying the tested object to the concept and rules classifying the tested object to the complement of the concept (see Section 2.6). It simply is based on classifying tested object u to concept C only when with the established weight of rule sets S the value νS ({u}, C) is bigger than νS ({u}, C ). Otherwise, object u is classiﬁed to the complement of concept C. In this paper, the weight BasicStrength is used in experiments related to construction of classiﬁers based on decision rules to resolve conﬂicts between rule sets. 2.9

Evaluation of Classiﬁers

In order to evaluate the classiﬁer quality in relation to the data analyzed, a given decision table is partitioned into the two tables in a general case (see, e.g., [11, 257, 258]): 1. the training table containing objects on the basis of which the algorithm learns to classify objects to decision classes, 2. the test table, by means of which the classiﬁer learned on the training table may be evaluated when classifying all objects belonging to this table. The numerical measure of the classiﬁer evaluation is often the number of mistakes made by the classiﬁer during classiﬁcation of objects from the test table in comparison to all objects under classiﬁcation (the error rate, see, e.g., [11, 196, 198]). However, the method of the numerical classiﬁer evaluation, used most often, is the method based on a confusion matrix. The confusion matrix (see, e.g., [15, 257, 259]) contains information about actual and predicted classiﬁcations done by a classiﬁer. Performance of such systems is commonly evaluated using the data in the matrix. The Table 1 shows the confusion matrix for a two class classiﬁer, i.e., for a classiﬁer constructed for a concept.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

521

Table 1. The confusion matrix

Predicted Negative Positive Actual Negative TN FP Positive FN TP

The entries in the confusion matrix have the following meaning in the context of our study (see, e.g., [260]): – T N (True Negatives) is the number of correct predictions that an object is a negative example of a concept of the test table, – F P (False Positives) is the number of incorrect predictions that an object is a positive example of a concept of the test table, – F N (False Negatives) is the number of incorrect predictions that an object is a negative example of a concept of the test table, – T P (True Positives) is the number of correct predictions that an object is a positive example of a concept of the test table. Several standard terms (parameters) have been deﬁned for the two class confusion matrix: – the accuracy (ACC) deﬁned for a given classiﬁer by the following equality: ACC =

TN + TP , TN + FN + FP + TP

– the accuracy for positive examples or the sensitivity (see, e.g., [260]) or the true positive rate (T P R) (see, e.g., [257]) deﬁned for a given classiﬁer by the following equality: TP , TPR = TP + FN – the accuracy for negative examples or the speciﬁcity (see, e.g., [260]) or the true negative rate (T N R) (see, e.g., [257]) deﬁned for a given classiﬁer by the following equality: TN T NR = . TN + FP An essential parameter is also the number of classiﬁed objects from the test table in comparison to the number of all objects from this table since classiﬁers may not always be able to classify the objects. This parameter, called the coverage (see, e.g., [11, 15]), may be treated as an extension measure of the classiﬁer. Thus, in order to evaluate classiﬁers, also the following numerical parameters are applied in this paper: 1. the coverage (COV ) deﬁned for a given classiﬁer by the following equality: COV =

TN + FP + FN + TP , the number of all objects of the test table

522

J.G. Bazan

2. the coverage for positive examples (P COV ) deﬁned for a given classiﬁer by the following equality: P COV =

FN + TP , the number of all positive examples of a concept of the table

3. the coverage for negative examples (N COV ) deﬁned for a given classiﬁer by the following equality: N COV =

TN + FP , the number of all negative examples of a concept of the table

4. the real accuracy deﬁned for a given classiﬁer by: ACC · COV , 5. the real accuracy for positive examples or the real true positive rate deﬁned for a given classiﬁer by: T P R · P COV , 6. the real accuracy for negative examples or the real true negative rate deﬁned for a given classiﬁer by: T N R · N COV . Besides that, in order to evaluate classiﬁers still diﬀerent parameters are applied. These are, for instance, time of construction of a classiﬁer on the basis of a training table or the complexity degree of the classiﬁer under construction (e.g., the number of generated decision rules). In summary, in this paper the main parameters applied to the evaluation of classiﬁers are: the accuracy, the coverage, the real accuracy, the accuracy for positive examples, the coverage for positive examples, the real accuracy for positive examples, the accuracy for negative examples, the coverage for negative examples and the real accuracy for negative examples. They are used in experiments with AR schemes (see Section 5.8) and experiments related to detecting behavioral patterns (see Section 6.25 and Section 6.26). However, in experiments with automated planning another method of classiﬁer quality evaluation was applied (see Section 7.21). It results from the fact that this case is about automated generating the value of complex decision that is a plan which is a sequence of actions alternated with states. Hence, to compare this type of complex decision values the above mentioned parameters may not be used. Therefore, to compare the plans generated automatically with the plans available in the data set we use a special classiﬁer based on concept ontology which shows the similarity between any pair of plans (see Section 7.18). It is worth noticing that in literature there may be found another, frequently applied method of measuring the quality of created classiﬁers. It is a method based on ROC curve (Receiver Operating Characteristic curve) (see, e.g., [260, 261, 262]). This method is available, for instance, in system ROSETTA (see, e.g., [259, 263, 264]). It is also worthwhile mentioning that the author of this paper participated in construction of programming library RSES-lib, creating the computational kernel of system ROSETTA (see [230, 259] for more details). In order not to make the value of the determined parameter of the classiﬁer evaluation depending on a speciﬁc partitioning the whole decision table into a training and test parts, a number of methods are applied which perform tests to determine which parameter values of the classiﬁer evaluation are creditable.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

523

The methods of this type applied most often are train-and-test and crossvalidation (see, e.g., [11, 258, 265]). The train-and-test method is usually applied to decision tables having at least 1000 objects (see, e.g., [11]). It consists in a random isolation of two test subtables from the whole data available, treating one of them as a training subtable and the other as a test subtable. The training and test subtables are usually separated (although not always) and altogether make the available decision table. It is crucial, however, that at least some part of the objects from the test subtable does not occur in the training subtable. The proportion between the number of objects in the test and training subtables depends on a given experiment but it is usually such that the number of objects in the test part constitutes from 20 to 50 percent of the number of objects in the whole data available (see, e.g., [11]). The cross-validation method is applied to evaluate a classiﬁer when the number of objects in the decision table is less than 1000 (see, e.g., [11]). This method consists in partitioning data in a random way into m equal parts and, then performing m experiments with them. In each of these experiments, a local coeﬃcient of the classiﬁer evaluation is calculated for a situation when one of the parts into which the data was divided is a set of tested objects, and the remaining m − 1 parts (temporarily combined) are treated as a set of training objects. Finally, a coeﬃcient of the classiﬁer evaluation as an average arithmetical coeﬃcient of all experiments is calculated. The number m is determined depending on the speciﬁc data and should be selected in such a way that the test parts not to have too few objects. In practice, m is an integer ranging from 5 to 15 (see, e.g., [11]). All decision tables used in experiments have more than 1000 objects, in this paper. That is why in order to determine the parameter of the classiﬁer quality the train-and-test method is always applied. Moreover, each experiment is repeated 10 times for ten random partitions into two separate tables (training and test). Hence, the result of each experiment is the arithmetical mean obtained from the results of its repetitions. Additionally, the standard deviation of the received result is given. 2.10

Problem of Low Coverage

If a given tested object matches the predecessor of a certain basic decision rule (that is the values of condition attributes of this object are the same as the values of the descriptors from the rule predecessor corresponding to them), then this rule may be used to classify this object, that is, the object is classiﬁed to the decision class occurring in the rule successor. In this case we also say that a given tested object is recognized by a certain decision rule. However, if a given tested object is recognized by diﬀerent decision rules which classify it to more than one decision classes, then negotiation methods between rules are applied (see Section 2.6 and Section 2.8). In practice, it may happen that a given tested object does not match the predecessor of any of the available decision rules. We say that this object is not recognized by a given classiﬁer based on decision rules and what follows it cannot be classiﬁed by this classiﬁer. It is an unfavorable situation, for we often expect

524

J.G. Bazan

from the classiﬁers to classify all or almost all tested objects. If there are many of the unclassiﬁed objects, then we say that a given classiﬁer has too low an extension. It is expressed numerically by a low value of the coverage parameter (see Section 2.9). A number of approaches which enable to avoid a low coverage of classiﬁers based on decision rules were described in literature. They are for example: 1. The application of classiﬁers based on the set of all rules with a minimum number of descriptors (see Section 2.4) which usually have a high extension (see, e.g., [196, 198]). 2. The application of rule classiﬁers constructed on the basis of covering algorithms and partial matching mechanism of the objects to the rules (see, e.g., [10, 213, 214, 216, 217, 222, 223, 266]). 3. The application of classiﬁers based on decision rules which underwent the process of generalization of rules owing to which the classiﬁer extension usually increases (see Section 2.5). 4. The application of classiﬁers based on a lazy learning which does not require preliminary computation of decision rules, for decision rules needed for object classiﬁcation are discovered directly in a given decision table during the classiﬁcation of the tested object (see, e.g., [197, 198, 267]). All the methods mentioned above have their advantages and disadvantages. The ﬁrst method has an exponential time complexity which results from the complexity of the algorithm computing all reducts (see Section 2.4). The second method is very quick, for it is based on rules computed with the help of the covering method. However, in this method there are often applied approximation rules to classify objects (determined as a result of a partial matching objects to the rules). Therefore, the quality of classiﬁcation on the basis of such rules may be unsatisfactory. The third method uses the operation of rule generalization. Owing to this operation the extension of the obtained rules increases. However, it does not lead to such a high extension as in the case of the ﬁrst, second and fourth method. Apart from that the operation of rule generalization is quite time consuming. Whereas, the fourth method, although does not require preliminary computation of decision rules, its pessimistic computational time complexity of each tested object classiﬁcation is of order O(n2 · m), where n is the number of objects in the training table and m is the number of condition attributes. Hence, for bigger decision tables this method cannot be applied eﬀectively. There is one more possibility remaining to build classiﬁers on the basis of rules computed with the covering method without using partial matching tested objects to the rules. Obviously, classiﬁers based on such rules may have a low coverage. However, they usually have a high quality of the classiﬁcation. It is extremely crucial in many applications (for example in medical and ﬁnancial ones) where it is required that the decisions generated by classiﬁers be always or almost always correct. In such applications it is sometimes better for the classiﬁer to say I do not know rather than make a wrong decision. That is why in this paper we use classiﬁers based on rules computed with covering method (without partial matching objects to the rules) agreeing on a low coverage of such

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

525

classiﬁers in cases when classiﬁers based on the set of all rules with minimum number of descriptors cannot be applied (too large analyzed decision tables).

3

Methods of Constructing Stratifying Classifiers

The algorithm of concept approximation, presented in Subsection 2.8, consists in classifying the tested objects to the lower approximation of this concept, the lower approximation of complement of this concept or its border. Many methods enabling increase of the extension of classiﬁers under construction, in rough set theory are proposed (see Section 2.8). Discretization of attribute values (see Section 2.2), methods of calculating and modifying decision rules (see Sections 2.3, 2.4, 2.5), and partial matching method (see Section 2.10) are examples of such methods. As a result of applying these methods, there are constructed classiﬁers able to classify almost every tested object to the concept or its complement. At ﬁrst glance this state of aﬀairs should dispose optimistically for approximation methods can be expanded for tested objects from beyond a given decision table, which is necessary in inductive learning (see Section 2.8). Unfortunately, such a process of generalizing concept approximation encounters diﬃculties in classifying new (unknown during the classiﬁer learning) tested objects. Namely, after expanding the set of objects U of a given information system with new objects, equivalence classes of these objects are often disjoint with U . This means that if such objects match the description of a given concept C constructed on the basis of set U , this match is often incidental. Indeed due to the unfamiliarity the process generalization of decision rules may go too far (e.g., decision rules are too short) because of absence of these new objects when the concept description was created. It may happen that the properties (attributes) used to describe a concept are chosen in wrong way. So, if a certain tested object from outside the decision table is classiﬁed, then it may turn out that, in the light of the knowledge gathered in a given decision table, this object should be classiﬁed neither to the concept nor to its complement but to the concept border, which expresses our uncertainty about the classiﬁcation of this object. Meanwhile, most of the classiﬁers currently constructed classify the object to the concept or its complement. A need of use the knowledge from a given table arises in order to determine the coeﬃcient of certainty that the object under testing belongs to the approximated concept. In other words, we would like to determine, with reference to the tested object, how certain the fact is that this object belongs to the concept. And at the same time it would be the best to express if the certainty coeﬃcient by a number, e.g., from [0, 1]. In literature such a numerical coeﬃcient is expressed using diﬀerent kinds of rough membership functions (see Section 2.8). If a method of determining such a coeﬃcient is given, it may be assumed that the coeﬃcient values are discretized which leads to a sequence of concept layers arranged linearly. The ﬁrst layer in this sequence represents objects which, without any doubt do not belong to the concept (the lower approximation of the concept complement). The next layers in the sequence represent objects belonging to the

526

J.G. Bazan

lower approximation of C

layers of the boundary region of C lower approximation of U-C

Fig. 4. Layers of a given concept C

concept more and more certainly (border layers of the concept). The last layer in this sequence represents objects certainly belonging to the concept, that is, the ones belonging to the lower concept approximation (see Fig. 4). Let us add that this type of concept layers may be deﬁned both on the basis of the knowledge gathered in data tables and using additional domain knowledge provided by experts. 3.1

Stratifying Classiﬁer

In order to examine the membership of tested objects to individual concept layers, such classiﬁers are needed that can approximate all layers of a given concept at the same time. Such classiﬁers are called in this paper stratifying classiﬁers. Deﬁnition 1 (A stratifying classiﬁer ). Let A = (U, A, d) be a decision table whose objects are positive and negative examples of a concept C (described by a binary attribute d). 1. A partition of the set U is a family {U1 , ..., Uk } of non-empty subsets of the set U (where k > 1) such that the following two conditions are satisﬁed: (a) U = U1 ∪ ... ∪ Uk , (b) ∀i=j Ui ∩ Uj = ∅. 2. A partition of the set U into a family UC1 , ..., UCk we call the partition of U into layers in relation to concept C when the following three conditions are satisﬁed: (a) set UC1 includes objects which, according to an expert, certainly do not belong to concept C (so they belong to a lower approximation of its complement), (b) for every two sets UCi , UCj (where i < j), set UCi includes objects which, according to an expert, belong to concept C with a degree of certainty lower than the degree of certainly of membership of objects of UCj in U , (c) set UCk includes objects which, according to an expert, certainly belong to concept C, viz., to its lower approximation. 3. Each algorithm which assigns (classiﬁes) tested objects into one of the layers belonging to a partition of the set U in relation to the concept C, is called a stratifying classiﬁer of the concept C.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

527

4. In practice, instead of using layer markings UC1 , ..., UCk , elements of the set E = {e1 , ..., ek } are used to label layers, whereas the stratifying classiﬁer constructed for the concept C which classiﬁes each tested object into one of the layers labeled with labels from the set E, is denoted by μE C. 5. If the stratifying classiﬁer μE C classiﬁes a tested object u into the layer labeled by e ∈ E, then this fact is denoted by the equality μE C (u) = e. An expert may divide the set of objects U into layers in two following ways: 1. by an assignment of weight labels to all training objects arbitrary (see Section 3.2), 2. by providing heuristics which may be applied in construction of a stratifying classiﬁer (see Section 3.3). Stratifying classiﬁers can be very useful when we need to estimate realistically what the certainty of membership of a tested object to a concept is, without determining whether the object belongs to the concept or not. Apart from that, stratifying classiﬁers may be used to construct the so-called production rules (see Section 5.3). In the paper, two general ways of construction of stratifying classiﬁers are presented. The ﬁrst one is the expert approach consisting in deﬁning by an expert an additional attribute in data which describes the membership of objects to particular layers of the concept. Next, a classiﬁer diﬀerentiating layers as decision classes is built (see Section 3.2). The second approach is called the automatic approach and is based on designing algorithms which are extensions of classiﬁers enabling the classiﬁcation of objects into layers of a concept on the basis of certain premises and experimental observations (see Section 3.3). 3.2

Stratifying Classiﬁers Based on the Expert Approach

In construction of stratifying classiﬁers using expert knowledge, it is assumed that for all training objects not only the binary classiﬁcation of training objects to a concept or outside the concept is known but we also know the assignment of all training objects into the speciﬁc concept layers. Using this approach an additional knowledge needs to be gained from a domain knowledge. Owing to that a classical classiﬁer may be built (e.g., the one based on a set of rules with a minimal number of descriptors) which directly classiﬁes the objects to diﬀerent concept layers. This classiﬁer is built on the basis of a decision attribute which has as many values as many concept layers are there, and each of these values is a label of one of the layers. 3.3

Stratifying Classiﬁers Based on the Automatic Approach

In construction of stratifying classiﬁers using the automatic approach, the assignment of all training objects to speciﬁc concept layers is unknown but we only know the binary classiﬁcation of training objects to a concept or its complement. However, the performance of a stratifying classiﬁer is, in this case,

528

J.G. Bazan

connected with a certain heuristics which supports discernibility of objects belonging to a lesser or greater degree to the concept, that is, objects belonging to diﬀerent layers of this concept. Such a heuristic determines the way an object is classiﬁed to diﬀerent layers and, thus, it is called a stratifying heuristic. Many diﬀerent types of heuristics stratifying concepts may be proposed. These may be, e.g., heuristics based on the diﬀerence of weights of decision rules classifying tested objects to concept and its complement or heuristics using a k-NN algorithm of k nearest neighbors (compare with [78, 200, 268]). In this paper, however, we are concerned with a new type of stratifying heuristics using the operation of decision rule shortening (see Section 2.5). The starting point of the presented heuristics is the following observation. Let us assume that for a certain consistent decision table A whose decision is a binary attribute with values 1 (objects belonging to the concept C) and 2 (objects belonging to the complement of concept C which is denoted by C ), a set of decision rules, RU L(A) was calculated. The set RU L(A) is the sum of two separate subsets of rules RU L1 (A) (classifying objects to C) and RU L2 (A) (classifying objects to C ). Now, let us shorten the decision rules from RU L1 (A) to obtain the coeﬃcient of consistency equal to 0.9 by placing the shortened decision rules in the set RU L1 (A, 0.9). Next, let RU L (A) = RU L1 (A, 0.9) ∪ RU L2 (A). In this way, we have increased the extension of the input decision set of rules RU L(A) in relation to the concept C, viz., as a result of shortening of the rules, the chance is increased that a given tested object is recognized by the rules classifying to the concept C. In other words, the classiﬁer based on the set of rules RU L (A) classiﬁes objects to the concept C more often. Now, if a certain tested object u, not belonging to table A, is classiﬁed to C by the classiﬁer based on the rule set RU L (A), then the chance that object u actually belongs to C is much bigger than in the case of using the set of rules RU L(A). The reason for this assumption is the fact that it is harder for a classiﬁer based on the set of rules RU L (A) to classify objects to C for the rules classifying objects to C are shortened in it and owing to that they recognize the objects more often. If, however, an object u is classiﬁed to C , then some of its crucial properties identiﬁed by the rules classifying it to C must determine this decision. If shortening of the decision rules is greater (to the lower consistency coeﬃcient), then the change in the rule set extension will be even bigger. Summing up the above discussion, we conclude that rule shortening makes it possible to change the extensions of decision rule sets in relation to chosen concepts (decision classes), and owing to that one can obtain a certain type of approximation based on the certainty degree, concerning the membership of tested objects to the concept under consideration where diﬀerent layers of the concept are modeled by applying diﬀerent coeﬃcients of rule shortening. In construction of algorithms producing stratifying classiﬁers based on shortening of decision rules, there occurs a problem with the selection of accuracy coeﬃcient threshold to which decision rules are shortened. In other words, what we mean here is the range and the step with which the accuracy coeﬃcient threshold must be selected in order to obtain sets of rules enabling an eﬀective

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

529

description of the actual layers of the concept approximated. On the basis of previous experimental experience (see, e.g., [196, 198]), in this paper, we establish that the shortening thresholds of decision rule consistency coeﬃcient are selected from the range 0.5 to 1.0. The lower threshold limit (that is, 0.5) results from the experimental observation that if we shorten rules classifying objects to a certain concept C below the limit 0.5 (without simultaneous shortening of rules classifying objects to C ), then although their extension increases dramatically (they classify objects to the concept C very often), their certainty falls to an absolutely unsatisfactory level. However, the upper limit of the threshold (that is 1.0) simply means leaving only accurate rules in the set of rules, and rejecting other approximation rules which could have occurred for a given decision table. If it comes, however, to the change step of the chosen threshold of the rule coeﬃcient of consistency, we set it at 0.1. This change step is dictated by the fact that it enables a general search of thresholds from 0.5 to 1.0 and, at the same time, the number of rule shortening operations is not too high which is essential for keeping the time needed to conduct computer experiments within acceptable bounds. Now we present an algorithm of a stratifying classiﬁer construction based on rule shortening (see Algorithm 3.1). Let us notice that after the above algorithm completes its performance on the list L, there are eleven decision rule sets. The ﬁrst classiﬁer on this list contains the Algorithm 3.1. Stratifying classiﬁer construction Input: decision table A = (U, A, d) and concept C ⊆ U Output: classiﬁer list L representing a stratifying classiﬁer 1 begin 2 Calculate decision rules for table A, denoted by RU L(A) = RU L1 (A) ∪ RU L2 (A) Create empty classiﬁer list L 3 4 for a := 0.5 to 0.9 with step 0.1 do 5 Shorten rules RU L1 (A) to the consistency coeﬃcient a and place the shortened decision rules in RU L1 (A, a) RU L := RU L1 (A, a) ∪ RU L2 (A) 6 7 Add RU L to the end of the list L 8 end 9 Add RU L(A) to the end of the list L 10 for a := 0.9 to 0.5 with step 0.1 do 11 Shorten rules RU L2 (A) to the consistency coeﬃcient a and place the shortened decision rules the RU L2 (A, a) 12 RU L := RU L1 (A) ∪ RU L2 (A, a) 13 Add RU L to the end of the list L 14 end 15 return L 16 end

530

J.G. Bazan

Algorithm 3.2. Classiﬁcation using the stratifying classiﬁer Input: 1. classiﬁer list L representing a stratifying classiﬁer, 2. set of labels of layers E = {e1 , ..., esize(L)+1 }, 3. tested object u 1 2 3 4 5 6 7 8 9

Output: The label of the layer to which the object u is classiﬁed begin for i := size(L) down to 1 do Classify u using the classiﬁer L[i] if u is classiﬁed by L[i] to the concept C then return ei+1 end end return e1 end

most shortened rules classifying to C. That is why, if it classiﬁes an object to C , the degree of certainty is the highest that this object belongs to concept C , whereas the last classiﬁer on the list L, contains the most shortened rules classifying to C . That is why the classiﬁcation of an object to the concept C using this classiﬁer gives us the highest degree of certainty of that the object really belongs to C. The time complexity of Algorithm 3.1 depends on the time complexity of the chosen algorithm of decision rules computing and on the algorithm of approximate rules synthesis (see Section 2.5). On the basis of the classiﬁer constructed according to Algorithm 3.1, the tested object is classiﬁed to a speciﬁc layer with the help of successive classiﬁers starting from the last to the ﬁrst one, and if the object is classiﬁed by the i-th classiﬁer to C, then we learn about membership of the object under testing to the (i + 1)-th layer of C. However, if the object is not classiﬁed to C by any classiﬁer, we learn about membership of the tested object to the ﬁrst layer (layer number 1), that is, to the complement of concept C. We present a detailed algorithm for classiﬁcation of the object using the stratifying classiﬁer (see Algorithm 3.2). Let us notice that if the size of the list L is equal to 11 (generated by Algorithm 3.1), the above classiﬁer classiﬁes objects to 12 concept layers where the number 12 layer is the layer of objects with the highest degree of certainty of membership to the concept and the layer number 1 is the layer with the lowest degree of certainty of membership to this concept.

4

General Methodology of Complex Concept Approximation

Many real-life problems may be modeled with the help of the so-called complex dynamical systems (see, e.g., [92, 93, 94, 95, 96, 97]) or, putting it in an other

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

531

way, autonomous multiagent systems (see, e.g., [98, 101]) or swarm systems (see, e.g., [104]). These are sets consisting of complex objects which are characterized by the constant change of parameters of their components over time, numerous relationships among the objects, the possibility of cooperation/competition among the objects and the ability of objects to perform more or less complicated actions. Examples of systems of these kind are: traﬃc, a patient observed during treatment, a team of robots during performing some task. The description of the dynamics of such a system is often impossible using purely classical analytical methods, and the description itself contains many vague concepts. For instance, in order to monitor complex dynamical systems eﬀectively, complex spatio-temporal concepts are used very often, concerning dynamic properties of complex objects occurring in these systems. These concepts are expressed in natural language on a much higher level of abstraction than the so-called sensor data, which have mostly been applied to approximation of concepts so far. Examples of such concepts are safe car driving, safe overtaking, patient’s behavior when faced with a life threat, ineﬀective behavior of robot team. Much attention has been devoted to spatio-temporal exploration methods in literature (see, e.g., [63, 64]). The current experience indicates more and more that approximation of such concepts requires a support of knowledge of the domain to which the approximated terms are applied, i.e., the domain knowledge. It usually means the knowledge about concepts occurring in a given domain and various relations among these concepts. This knowledge exceeds signiﬁcantly the knowledge gathered in data sets; it is often represented in a natural language, and it is usually obtained in a dialogue with an expert from a given domain (see, e.g., [41, 42, 43, 44, 45, 46, 52, 269]). One of the methods of representing this knowledge is recording it in the form of the so-called concept ontology. The concept ontology is usually understood as a ﬁnite set of concepts creating a hierarchy and relationships among these concepts which connect concepts from diﬀerent hierarchical levels (see next section). In this subsection, we present a general methodology of approximating complex spatio-temporal concepts on the basis of experimental data and a domain knowledge represented mainly by a concept ontology. 4.1

Ontology as a Representation of Domain Knowledge

The word ontology was originally used by philosophers to describe a branch of metaphysics concerned with the nature and relations of being (see, e.g., [270]). However, the deﬁnition of ontology itself has been a matter of dispute for a long time, and controversies concern mainly the thematic scope to be embraced by this branch. Discussions on the subject of ontology deﬁnition appear in the works of Gottfried Leibniz, Immanuel Kant, Bernard Bolzano, Franz Brentano, or Kazimierz Twardowski (see, e.g., [271]). Most of them treat ontology as a ﬁeld of science concerning types and structures of objects, properties, events, processes, relations, and reality domains (see, e.g., [106]). Therefore, ontology is neither a science concerning functioning of the world nor the ways a human being perceives it. It poses questions: How do we classify everything?, What

532

J.G. Bazan

classes of beings are inevitable for describing and concluding on the subject of ongoing processes?, What classes of being enable to conclude about the truth?, What classes of being enable to conclude about the future? (see, e.g., [106, 270]). Ontology in Informatics. The term ontology appeared in the information technology context at the end of the sixties of the last century as a speciﬁc way of knowledge formalization, mainly in the context of database development and artiﬁcial intelligence (see, e.g., [53, 272]). The growth in popularity of database systems caused avalanche increase of their capacity. The data size, multitude of tools used both for storing and introducing, or transferring data caused that databases became diﬃcult in managing and communication with the outside world. Database schemes are determined to high extent not only by the requirements on an application or database theory but also by cultural conditions, knowledge, and the vocabulary used by designers. As the result, the same class of objects may possess diﬀerent sets of attributes in various schemes termed differently. These attribute sets are identical terms but often describe completely diﬀerent things. A solution to this problem are supposed to be ontologies which can be treated as tools for establishing standards of database scheme creation. The second pillar of ontology development is artiﬁcial intelligence (AI), mainly because of the view according to which making conclusions requires knowledge resources concerning the outside world, and ontology is a way of formalizing and representing such knowledge (see, e.g., [7, 273]). It is worth noticing that, in the recent years, one of the main applications of ontologies has been their use for an intelligent search of information on the Internet (see, e.g., [53] and [54] for more details). Deﬁnition of Ontology. Philosophically as well as in information technology, there is a lack of agreement if it comes to the deﬁnition of ontology. Let us now consider three deﬁnitions of ontology, well-known from literature. Guarino states (see [53]) that in the most prevalent use of this term, an ontology refers to an engineering artifact, constituted by a speciﬁc vocabulary used to describe a certain reality (or some part of reality), plus a set of explicit assumptions regarding the intended meaning of vocabulary words. In this approach, an ontology describes a hierarchy of concepts related by relationships, whereas in more sophisticated cases, suitable axioms are added to express other relationships among concepts and to constrain the interpretation of those concepts. Another well-known deﬁnition of ontology has been proposed by Gruber (see [105]). He deﬁnes an ontology as an explicit speciﬁcation of a conceptualization. He explains that for AI systems, what exists is that which can be represented. When the knowledge of a domain is represented in a declarative formalism, the set of objects that can be represented is called the universe of discourse. This set of objects and the describable relationships among them are reﬂected in the representational vocabulary with which a knowledge-based program represents knowledge. Thus, according to Gruber, in the the context of AI, we can describe the ontology of a knowledge-based program by deﬁning a set of representational terms. In such an ontology, deﬁnitions associate the names of

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

533

entities in the universe of discourse (e.g., classes, relations, functions, or other objects) with human-readable text describing what the names mean, and formal axioms that constrain the interpretation and the well-formed use of these terms. Finally, we present a view of ontology recommended by the World Wide Web Consortium (W3C) (see [107]). W3C explains that an ontology deﬁnes the terms used to describe and represent an area of knowledge. Ontologies are used by people, databases, and applications that need to share domain information (a domain is just a speciﬁc subject area or area of knowledge such as medicine, tool manufacturing, real estate, automobile repair, ﬁnancial management, etc.). Ontologies include computer-usable deﬁnitions of basic concepts in the domain and the relationships among them. They encode knowledge in a domain and also knowledge that spans domains. In this way, they make that knowledge reusable. Structure of Ontology. Concept ontologies share many structured similarities, regardless of the language in which they are expressed. However, most ontologies describe individuals (objects, instances), concepts (classes), attributes (properties), and relations (see, e.g., [53, 54, 105, 107]). Individuals (objects, instances) are the basic, “ground level” components of an ontology. They may include concrete objects such as people, animals, tables, automobiles, and planets, as well as abstract individuals such as numbers and words. Concepts (classes) are abstract groups, sets, or collections of objects. They may contain individuals or other concepts. Some examples of concepts are vehicle (the class of all vehicles), patient (the class of all patients), inﬂuenza (the class of all patients suﬀering from inﬂuenza), player (the class of players), team (the class of all players from some team). Objects belonging to concepts in an ontology can be described by assigning attributes to them. Each attribute has at least a name and a value, and is used to store information that is speciﬁc to the object the attribute is attached to. For example, an object from the concept participant (see ontology from Fig. 52 ) has attributes such as ﬁrst name, last name, address, aﬃliation. If you did not deﬁne attributes for concepts, you would have either a taxonomy (if concept relationships are described) or a controlled vocabulary. These are useful, but are not considered true otologies. There are three following types of relations between concepts from ontology: a subsumption relation (written as is-a relation), a meronymy relation (written as part-of relation), and a domain-speciﬁc relation. The ﬁrst type of relations is the subsumption relation (written as is-a). If a class A subsumes a class B, then any member of the class A is-a member of the class B. For example, the class author subsumes the class participant. It means that anything that is a member of the class author is a member of the class Participant (see ontology from Fig. 5). Where A subsumes B, A is called the superclass, whereas B is the subclass. The subsumption relation is very similar to the notion of inheritance, well-known from the object-oriented programming 2

This example has been inspired by Jarrar (see [54]).

534

J.G. Bazan

Organizing committee

Organizer

Person

Author

Participant

Paper

Reviewer

Program committee

Fig. 5. The graph of a simple ontology

(see, e.g., [274, 275]). Such relation can be used to create a hierarchy of concepts, typically with a maximally general concept like person at the top, and more speciﬁc concepts like author or reviewer at the bottom. The hierarchy of concepts is usually visualized by a graph of ontology (see Fig. 5) where any subsumption relation is represented by a thin solid line with an arrow in the direction from the superclass to the subclass. Another common type of relations is the meronymy relation (written as partof) that represents how objects combine together to form composite objects. For example, in the ontology from Fig. 5, we would say that any reviewer is-part-of the program committee. Any meronymy relation is represented graphically by a broken line with an arrow in the direction from the part to the composite object (see Fig. 5). From the technical point of view this type of relation between ontology terms is represented with the help of object attributes belonging to concepts. It is done in such a way that the value of an attribute of an object u, which is to be a part of some object u belonging to diﬀerent concept, informs about u . Apart from the standard is-a and part-of relations, ontologies often include additional types of relations that further reﬁne the semantics modeled by the ontologies. These relations are often domain-speciﬁc and are used to answer particular types of questions. For example, in the domain of conferences, we might deﬁne a written-by relation between concepts paper and author which tells us who is the author of a paper. In the domain of conferences, we deﬁne also a writes relation between concepts author and paper which tells us which paper has been written by each author. Any domain-speciﬁc relation is represented by

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

535

a thick solid line with an arrow. From the technical point of view this type of relations between ontology concepts is also represented with the help of object attributes belonging to the concepts. In this paper, we use many ontologies, constructed on the basis of domain knowledge concerning the analyzed data sets, to approximate complex concepts. In these ontologies there occur all types of relations mentioned above. However, the relations of individual types do not occur in these ontologies simultaneously but in each of them there occurs only one type of relation. The reason for this is the fact that individual relation types serve us to approximate diﬀerent types of complex concepts. For example, relations of the type is-a occur in ontology from Fig. 6 which is an example of an ontology used to approximate spatial concepts (see Section 5). Ontologies showing dependencies between temporal concepts for structured objects and temporal concepts for constituent parts of these objects (used to approximate temporal concepts for structured objects) are examples of ontologies in which there occur relations of type part-of (see Section 6). On the other hand, domain-speciﬁc relations occur in numerous examples of behavior graphs presented in Section 6 and are used to approximate behavioral patterns. The planning graphs presented in Section 7 are also examples of ontologies in which there occur domain-speciﬁc relations. Incidentally, planning graphs are, in a way, ontologies even more complex than the mentioned above, because two types of concepts occur in them simultaneously. Namely, there occur concepts representing states of complex objects and concepts representing actions performed on complex objects. Obviously, there are many ways of linking the ontologies mentioned above provided they concern the same domain. For example, an ontology describing the behavior graph of a group of vehicles may be linked with ontologies describing dependencies of temporal concepts for such groups of vehicles and temporal concepts describing behavior of individual vehicles or changes of relationships among these vehicles. Then, in such an ontology, there would occur relations of two types simultaneously, that is, domain-speciﬁc and part-of relations. Although these types of linking diﬀerent ontologies are not essential for complex concept approximation methods presented in this paper, they cause a signiﬁcant increase of complexity of the ontologies examined. General Recommendations Concerning Building of an Ontology. Currently there are many papers which describe various designer groups’ experience obtained in the process of ontology construction (see, e.g., [276]). Although they do not constitute formal frames enabling to create an integral methodology yet, general recommendations how to create an ontology may be formed on their basis. Each project connected with an ontology creation has the following phases: – Motivation for creating an ontology. – Deﬁnition of the ontology range. – Ontology building. • Building of a lexicon. • Identiﬁcation of concepts.

536

J.G. Bazan

• Building of the concept structure. • Modeling relations in ontology. – The evaluation of the ontology obtained. – Ontology implementation. Motivation for creating an ontology is an initial process resulting from arising inside a certain organization, a need to change the existing ontology or to create a new one. Extremely crucial for the whole further process is, at this stage, clarity of the aim for which the ontology is built. It is the moment when potential sources of knowledge needed for the ontology construction should be deﬁned. They are usually sources which may be divided into two groups those requiring human engagement (e.g., interviews, discussions) and those in which a human does not appear as a knowledge source (e.g., documents, dictionaries and publications from the modeled domain, intranet and Internet, and other ontologies). By the ontology range we understand this part of the real world which should be included into the model under creation in the form of concepts and relations among them. One of the easier, and at the same time very eﬀective, ways to determine the ontology range accurately is using the so-called “competency questions” (see, e.g., [277]). The starting point for this method is deﬁning a list of questions to which the database built on the basis of the ontology under construction should give an answer. Having the range deﬁned, the process of ontology building should be started. The ﬁrst step in ontology building is deﬁning a list of expressions, phrases, and terms crucial for a given domain and a speciﬁc context of application. A lexicon should be composed that is a dictionary containing terms used by the ontology as well as their deﬁnitions, from the list. The lexicon is a starting point for the most diﬃcult stage in the ontology building, that is, construction of concepts (classes) of the ontology and relations among these concepts. It should be remembered that it is not possible to perform these two activities one after the other. They have to be performed in parallel. We should bear in mind that each relation is also a concept. Thus, ﬁnding the answer to the question What should constitute a concept and what should constitute a relation? is not easy and depends on the target application and, often, the designer’s experience. If it comes to building hierarchical classes, three approaches to building such a hierarchy are given in the paper [278]: 1. Top-down. We start with a concept superior to all concepts included in the knowledge base and we come to the next levels of inferior concepts by applying atomization. 2. Bottom-up. We start with the most inferior concept contained in the knowledge base and we come to the concepts on higher levels of hierarchy by applying generalization. 3. Middle-out. We start with concepts which are the most crucial in terms of the project and we perform atomization or generalization when needed. In order to evaluate the obtained ontology it should be checked if the ontology possesses the following qualities ([277]):

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

537

– Consistency. The ontology is consistently integral, that is, contradictory conclusions cannot be drawn from it. – Completeness. The ontology is complete if all expected elements are included in the model (concepts, relations, etc.). – Conciseness. All information gathered in the ontology is concise and accurate. – The possibility of answering the “competency questions” posed previously. Summing up, an ontology building is a laborious process requiring a huge amount of knowledge concerning the modeling process itself, the tools used, and the domain being modeled. Ontology Applications. Practical ontology applications relate to the so-called general ontologies which have a rather general character and may be applied in a knowledge base building from diﬀerent domains and domain ontologies meaning ontologies describing knowledge about a speciﬁc domain or a speciﬁc fragment of the real world. Many such ontologies have been worked out and they are often available on the Internet. They are, e.g., Dublin Core (see [279]), GFO (General Formal Ontology [280]), OpenCyc/ResearchCyc (see [281]), SUMO (Suggested Upper Merged Ontology [282]), WordNet (see [283]), DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering [284]) and others. Generally, ontologies are applied when the semantics of the data gathered is crucial. It turns out that such a situation takes place quite often, particularly when intelligent methods of data analysis are supposed to act eﬀectively. That is why ontologies more and more are useful in information technology projects. Some examples of applications of ontologies are e-commerce, bioinformatics, geographical, information systems, regulatory and legal information systems, digital libraries, e-learning, agent technology, database design and integration, software engineering natural language processing, information access and retrieval, the Semantic Web, Web services, medicine (see, e.g., [53] and [54] for more details). Computer Systems for Creating and Using Ontologies. There is a series of formal languages to represent ontologies. These are such languages as Web Ontology Language (OWL [107]), Resource Description Framework (RDF [285]), Ontology Inference Layer (OIL [286]), DARPA Agent Markup Language (DAML [287]), CycL (see [288]), etc. However, the most dynamically developed one is OWL which came to the existence as an improvement of the DAML, OIL and RDF languages. There are also many computer systems for creating and using ontologies. They are, e.g., Cyc (see [288]), OpenCyc (see [289]), Protege (see [290]), OntoStudio (previously OntoEdit [291]), Ontolingua (see [292]), Chimaer (see [293]), OilEd (see [294]), and others. Within these systems, the ontology is usually created using convenient graphical tools which make it possible to enter all the elements of ontology as well as their further edition and visualization. Ontological systems very often possess mechanisms of concluding on the basis of ontology constructed. These mechanisms work in such a way that after creating an ontology the system may be asked quite complex questions. They concern

538

J.G. Bazan

the existence of an instance of a concept which satisﬁes certain logical conditions, deﬁned using concepts, attributes, and relations occurring in the ontology. For instance, in the ontology in Fig. 5, we could pose the following questions: – Who is the author of a given paper? – Which papers have been reviewed by a given reviewer? – Which persons belong to the programming committee? From the technical point of view, information searching based on ontology is performed with the help of questions formed in a formal language used to represent ontology or its special extension. For instance, the language RDQL (RDF Data Query Language [295]) is a question language similar to the language SQL extending the RDF language. Usually, the ontological systems also enable to form questions using graphical interface (see, e.g, [290, 291]). 4.2

Motivations for Approximation of Concepts and Relations from Ontology

In current systems operating on the basis of ontology it is assumed that we possess complete information about concepts, that is, for each concept all objects belonging to this concept are known by us. This assumption causes that, in order to examine the membership of an object to the concept, it is enough to check if this object occurs as an instance of this concept or not. Meantime, in practical applications we often possess only incomplete information about concepts, that is, for each concept, certain sets of objects constituting examples and counterexamples, respectively are given. It causes the necessity of approximating concepts with the help of classiﬁers. For instance, using the ontology in Fig. 6 which concerns safe vehicle driving on a road, it cannot be assumed that all concept instances of this ontology are available. For example, for the concept safe driving, it cannot be assumed that the information about all possible cars driving safely is available. That is why for such a concept, a classiﬁer is constructed which is expected to be able to classify examples of vehicles into those belonging and those which do not belong to the concept. Apart from that, the relations between concepts, deﬁned in current systems based on ontology, are usually precise (exact, crisp). For example, for the relation is-a in ontology from Fig. 5, if the relation between concepts author and participant is to be precise (exact, crisp), then each author of a paper at a conference is a participant of this conference. In practice, however, it does not always have to be that way. It is possible that some authors of papers are not conference participants, particularly in the case of articles having many coauthors. So, a relation between concepts can be imprecise (inexact, vague). Besides, on the grounds of classical systems based on ontology, when we possess complete information about concepts, the problem of vagueness of the above relation may be solved by adding to the ontology an additional concept representing these authors who are not conference participants and binding this new concept with the concept person by the is-a relation. However, in practical applications, when the available information about concepts is not complete, we are even not able to

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

Safe distance from the opposite vehicle during overtaking

Possibility of safe stopping before the crossroad

Safe overtaking

Possibility of going back to the right lane

539

Safe driving

Forcing the right of way

Safe distance from the front vehicle

Fig. 6. An ontology for safe driving

check whether the relations under consideration are precise (exact, crisp). That is why relations among concepts also require approximation. In approximation of concepts occurring in ontology, there often appears the following problem. In practical applications, usually is the so-called sensor data available only (that is, data obtained by measurement using sensors, thus obtained on a low level of abstraction). For example, by observing a situation on a road, i.e., such data as speed, acceleration, location, the current driving lane, may be obtained. Meanwhile, some concepts occurring in ontology are so complex that they are separated by a considerable semantical distance from the sensor data, i.e., they are deﬁned and interpreted on very diﬀerent levels of abstraction. Hence, approximation of such concepts using sensor data does not lead to classiﬁers of satisfying quality (see, e.g., [42, 44, 45, 46, 48]). For instance, in ontology from Fig. 6, such a complex concept is without a doubt the concept safe driving because it is not possible to directly determine whether a given vehicle goes safely on the basis of simple sensor data only. If, however, apart from complex concepts there are simple concepts in ontology, that is, those which may be approximated using sensor data, and they are directly or indirectly linked by relations to complex concepts, then appears a need to use the knowledge about the concepts and relations among them to approximate complex concepts more eﬀectively. For example, in order to determine if a given vehicle drives safely, other concepts from ontology from Fig. 6, linked by relations to the concept safe driving, may be used. For example, one of such concepts is the possibility of safe stopping before the crossroad. The aim of this paper is to present set of methods for approximating complex spatio-temporal concepts and relations among them assuming that the

540

J.G. Bazan

Safe driving Forcing the right of way Safe overtaking Safe distance from the front vehicle

Safe distance from the opposite vehicle during overtaking Possibility of going back to the right lane

Possibility of safe stopping before the crossroad

S E N S O R DATA

Fig. 7. The ontology for safe driving revisited

information about concepts and relations is given in the form of ontology. To meet these needs, by ontology we understand a ﬁnite set of concepts creating a hierarchy and relations among these concepts which link concepts from different levels of the hierarchy. At the same time, on top of this hierarchy there is always the most complex concept whose approximation we are interested in aiming at practical applications. For example, ontology from Fig. 6 may be presented hierarchically as in Fig. 7. At the same time, we assume that the ontology speciﬁcation contains incomplete information about concepts and relations occurring in ontology, particularly for each concept, sets of objects constituting examples and counterexamples for these concepts are given. Additionally, for concepts from the lowest hierarchical level (sensor level) it is assumed that there are also sensor attributes available which enable to approximate these concepts on the basis of the examples and counterexamples given. This fact is marked in Fig. 7 by block arrows. 4.3

Unstructured, Structured, and Complex Objects

Every concept mentioned in this paper is understood as a subset of a certain set called the universe. Elements of the universe are called objects and they are interpreted as states, incidents, vehicles, processes, patients, illnesses and sets or sequences of entities mentioned previously. If such objects come from the real-life world, then their perception takes place by detecting their structure. Discovery

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

541

of relevant object structure for particular tasks is a complex problem strongly related to perception, that is usually understood as the process of acquiring, interpreting, selecting, and organizing sensory information (see, e.g., [45, 86, 145, 146, 147, 148, 149, 150, 151, 152, 296, 297, 298, 299]). Many interdisciplinary research has been conducted in this scope in the overlapping areas of such ﬁelds as cognitive science, psychology and neuroscience, pattern recognition (see, e.g., [26, 27, 35, 300, 301, 302, 303]). Structure of objects is used to deﬁne compound patterns over objects with the simple or structured structures. The construction of such compound patterns may be hierarchical. We search for patterns relevant for approximation of some complex concepts. Notice, that together with the granularity of patterns one should consider the computational complexity of satisﬁability testing for such patterns. The structure of the perceived objects may be more or less complex, because the objects may diﬀer in complexity. It means both the degree of spatial as well as the spatio-temporal complexity. When speaking about spatial complexity we mean not only the fact that the objects diﬀer in the features such as location, size, shape, color, weight, but also that objects may consist of parts related with each other in terms of dependencies (e.g., one may examine objects which are object groups in the traﬃc). However, the spatio-temporal complexity results from the fact that the perception of objects may be extended over time (e.g., one may examine objects which are single vehicles observed at a single time point and objects which are also single vehicles, but they are observed over a certain period of time). Both of these aspects of object complexity may cumulate which additionally increases the diversity of appearing objects (e.g., objects which are vehicle groups observed over a certain period of time are more complex than both the objects which are vehicle groups observed at a single time point and the objects which are single vehicles observed over a certain period of time). However, in practice the perception of objects always takes place on an established level of perceiving detail. This means that depending on the needs, during perceiving objects only such details concerning their structure are taken into account that are necessary to conduct eﬀective reasoning about the objects being perceived. For example, if we want to identify vehicles driven dangerously on the road, then we are not interested in the internal construction of each vehicle but rather the behavior of each vehicle as a certain whole. Hence, in the paper, we examine objects of two types. The ﬁrst type of objects are unstructured objects, meaning those which may be treated as indivisible wholes. We deal with this type of objects when we analyze patients, bank clients or vehicles, using their parameters observed at the single time point. The second type of objects which occurs in practical applications are structured objects which cannot be treated as indivisible wholes and are often registered during some period. Examples of this type of objects may be a group of vehicles driving on a highway, a set of illnesses occurring in a patient, a robot team performing a task.

542

J.G. Bazan

In terms of spatiality, structured objects often consist of disjunctive parts which are objects of uniform structure connected with dependencies. However, generally, the construction of structured objects is hierarchical, that is, their parts may also be structured objects. Additionally, a great spatial complexity of structured objects causes that conducting eﬀective reasoning about these objects usually requires their observation over a certain period of time. Thus, the hierarchy of such objects’ structure may concern not only their spatial, but also spatio-temporal structure. For example, to observe simple behaviors of a single vehicle (e.g., speed increase, a slight turn towards the left lane) it is suﬃcient to observe the vehicle over a short period of time, whereas to recognize more complex behaviors of a single vehicle (e.g., acceleration, changing lanes from right to the left one), the vehicle should be observed for a longer period of time, at the same time a repeated observation of the above mentioned simple behaviors may be extremely helpful here (e.g., if over a certain period the vehicle increased speed repeatedly, it means that this vehicle probably accelerates). Finally, behavior observation of a vehicle group requires its observation for an even longer period of time. It happens that way because the behavior of a vehicle group is usually the aggregation or consequence of vehicle behaviors which belong to the group (e.g., observation of an overtaking maneuver of one vehicle by another requires following speciﬁc behaviors of both the overtaking and overtaken vehicle for a certain period of time). Obviously, each of structured objects usually may be treated as an unstructured object. If we treat any object as an unstructured object at a given moment, it means that its internal structure does not interest us from the point of view of the decision problems considered. On the other hand, it is extremely diﬃcult to ﬁnd real-life unstructured objects, that is, objects without parts. In the real-life world, almost every object has some kind of internal structure and consists of certain spatial, temporal or spatio-temporal parts. Particularly, objects which are examples and counterexamples of complex concepts (both spatial and spatio-temporal), being more or less semantically distant from sensor data, have a complex structure. Therefore, one can say that they are complex objects. That is why the division of complex objects into unstructured and structured ones is of a symbolic character only and depends on the interpretation of these objects. If we are interested in their internal structure, then we treat them as structured objects; otherwise we treat them as unstructured ones. 4.4

Representation of Complex Object Collections

If complex objects are gathered into a collection, then in order to represent the available information about these objects, one may use information systems. Below, we present an example of such an information system whose objects are vehicles and attributes describe the parameters of the vehicle recorded at a given time point. Example 1. Let us consider an information system A = (U, A) such that A = {x, y, l, v, t, id}. Each object of this system represents a condition of a considered

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

543

vehicle at one time moment. The attributes x and y provide the current location of a vehicle, the l and v attributes provide us with current traﬃc lane on which the vehicle is and the current vehicle speed respectively. The attribute t represents time in a number of seconds which has passed since the ﬁrst observation of the vehicle (Vt is a subset of the set of positive integer numbers). The attribute id provides identiﬁers of vehicles. The second, extremely crucial, example of the information system used in this paper is an information system whose objects represent patient conditions at diﬀerent time points. Example 2. Let us consider an information system A = (U, A) such that U = {u1 , ..., un } and A = {a1 , ..., am , at , aid }. Each object of this system represents medical parameters of a certain patient during one day of his/her hospitalization. Attributes a1 , ..., am describe medical parameters of the patient (examination results, diagnoses, treatments, medications, etc.), whereas the attribute at represents time in a number of days which has passed since the ﬁrst observation of the patient (Vat is a subset of the set of positive integer numbers). Finally, the attribute aid provides identiﬁers of patients. Similarly to the two examples above, the attributes of complex objects may be based on sensor data. However, in a general case the properties of complex objects may be deﬁned in languages which are deﬁned speciﬁcally for a given purpose (see Section 4.7). 4.5

Relational Structures

As we have written before, structured objects consist of parts which are structured objects of lesser complexity (hierarchical structure) or unstructured objects connected by dependencies. Additionally, a great spatial complexity of structured objects causes that conducting eﬀective conclusions about these objects usually requires their observation for a certain period of time. Hence, there is a need to follow the spatio-temporal dependencies between parts of complex objects. Therefore, the eﬀective description of the structure of objects requires not only providing spatial properties of individual parts of these objects, but also describing the spatio-temporal relations between the parts of these objects. Therefore, in order to describe the structure of complex objects and relations between complex objects in this paper we will use relational structures (see, e.g., [5, 89]). In order to deﬁne the relational structure using language and semantics of ﬁrst-order logic we assume that a set of relation symbols REL = {Ri : i ∈ I} and function symbols F U N = {fj : j ∈ J} are given, where I, J are some ﬁnite sets (see, e.g., [89]). For any functional or relational symbol there is assigned a natural number called the arity of the symbol. Functional symbols and relations of arity 0 are called constants. The set of constants is denoted by CONST. Symbols of arity 1 are called unary and of arity 2 are called binary. In the case of binary relational or functional symbols we usually use traditional inﬁx notation. For instance we write x ≤ y rather than ≤ (x, y). The set of functional

544

J.G. Bazan

and relational symbols together with their arities is called the signature. The interpretation of a functional symbol fi (a relational symbol Ri ) over the set A is a function (a relation) deﬁned over the set A and denoted by fiA (RiA ). The number of arguments of a function fiA (a relation RiA ) is equal the arity of fi (Ri ). Now, we can deﬁne the relational structure of a given signature (see, e.g., [5, 89]). Deﬁnition 2 (A relational structure of a given signature). Let Σ = REL ∪ F U N be a signature, where REL = {Ri : i ∈ I} is a set of relation symbols and F U N = {fj : j ∈ J} is a set of function symbols, where I, J are some ﬁnite sets. 1. A relational structure of signature Σ is a triple (D, R, F) where – D is a non-empty ﬁnite set called the domain of the relational structure, – R = {R1D , ..., RkD } is a ﬁnite (possibly empty) family of relations deﬁned over D such that RiD corresponds to symbol Ri ∈ REL and RiD ⊆ Dni where 0 < ni ≤ card(D) and ni is the arity of Ri , for i = 1, ..., k, – F = {f1D , ..., flD } is ﬁnite (possibly empty) family of functions such that fjD corresponds to symbol fj ∈ F U N and fjD : Dmj −→ D where 0 ≤ mj ≤ card(D) and mj is the arity of fj , for j = 1, ..., l. 2. If for any f ∈ F, f : D0 −→ D, then we call such a function constant and we identify it with one element of the set D, corresponding to f . 3. If (D, R, F) is a relational structure and F is empty, then such relational structure is called pure relational structure and is denoted by (D, R). A classical example of a relational structure is a set of real numbers with operations of addition and multiplications and ordering relation. A typical example of a pure relational structure is a directed graph whose domain is set of graph nodes and the family of relations consists of one relation described by a set of graph edges. The example below illustrates how relational structures may be used to describe the spatial structure of a complex object. Example 3. Let us examine the complex object which is perceived as an image in Fig. 8. In this image one may notice a group of six cars: A, B, C, D, E, F . In order to deﬁne the spatial structure of this car group, the most crucial thing is deﬁning the location of cars towards each other and deﬁning the diversity of the driving directions of individual cars. That is why the spatial structure of such a group may be described with the help of relational structure (S, R), where: – S = {A, B, C, D, E, F }, – R = {R1 , R2 , R3 , R4 }, where: • ∀(X, Y ) ∈ S × S : (X, Y ) ∈ R1 iﬀ X is driving directly before Y , • ∀(X, Y ) ∈ S × S : (X, Y ) ∈ R2 iﬀ X is driving directly behind Y , • ∀(X, Y ) ∈ S × S : (X, Y ) ∈ R3 iﬀ X is coming from the opposite direction in comparison with Y , • ∀(X, Y ) ∈ S ×S : (X, Y ) ∈ R4 iﬀ X is driving in the same direction as Y .

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

545

D

E

C

B F

A

Fig. 8. An example of spatial complex object

D L

H G

C

K

J

I E F B A

F1

F2

F3

Fig. 9. An example of spatio-temporal complex object

For instance, it is easy to see that (B, A), (C, B), (D, C), (F, E) ∈ R1 , (A, B), (B, C), (C, D), (E, F ) ∈ R2 , (E, C), (E, B), (F, A) ∈ R3 and (A, C), (B, D), (E, F ) ∈ R4 . Complex objects may also have spatio-temporal structure. The example below shows this type of a structured object. Example 4. Let us examine the complex object which is represented with the help of three images F1 , F2 and F3 recorded at three consecutive time points (see Fig. 9). In image F1 one may notice cars A, B, C and D, whereas in image

546

J.G. Bazan

F2 we see cars E, F , G and H. Finally, in image F3 we see cars I, J, K and L (see Fig. 9). It is easy to notice that pictures F1 , F2 and F3 may be treated as three frames chosen from a certain ﬁlm made e.g., from an unmanned helicopter conducting a road observation, and at the same time each consecutive frame is distant in time from the previous one by about one second. Therefore, in all these pictures we see the same four cars, at the same time the ﬁrst car is perceived as car A, E or J, the second car is perceived as car B, F or I, the third car is perceived as car C, G or L and the fourth car is perceived as car D, H or K. The spatial structure of complex object ST = {A, B, C, D, E, F, G, H, I, J} may be described with the help of relational structure similar to the one in Example 3. However, object ST has spatio-temporal structure which should be reﬂected in relational structure describing complex object ST . That is why, to the relation family R from Example 3 we add relation Rt determined in the following way: ∀(X, Y ) ∈ ST × ST : (X, Y ) ∈ Rt iﬀ X represents the same vehicle as Y and X was recorded earlier than Y . For instance, it is easy to see that (A, E), (H, K) ∈ Rt , but (G, C), (I, F ) ∈ Rt and (C, H), (F, K) ∈ Rt . Moreover, we modify the deﬁnition of the remaining relations from family R: – ∀(X, Y ) ∈ ST × ST : (X, Y ) ∈ R1 iﬀ X, Y were noticed in and X is going directly before Y , – ∀(X, Y ) ∈ ST × ST : (X, Y ) ∈ R2 iﬀ X, Y were noticed in and X is driving directly behind Y , – ∀(X, Y ) ∈ ST × ST : (X, Y ) ∈ R3 iﬀ X, Y were noticed in and X is coming from the opposite direction in comparison – ∀(X, Y ) ∈ ST × ST : (X, Y ) ∈ R4 iﬀ X, Y were noticed in and X is driving in the same direction as Y .

the same frame the same frame the same frame with Y , the same frame

If some set of complex objects is perceived as an unstructured object (its parts are not distinguished) and these objects belong to the object set of a certain information system, then a structure of such set of complex objects is described by relational structure, that we call a trivial relational structure. Deﬁnition 3. Let A = (U, A) be an information system. For any set of objects U ⊆ U we deﬁne a relational structure (Dom, R, F) such that Dom = {U }, R and F are empty families. Such relational structure is called a trivial relational structure. The above trivial relational structures are used to approximate spatial concepts (see Section 5). In each collection of complex objects there may occur relations between objects belonging to this collection. That is why each collection of complex objects may be treated as a complex object whose parts are objects belonging to the collection. Hence, the structure of complex object collection may be described using relational structure, where the domain elements of this structure are objects which belong to this collection (see Section 4.7).

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

4.6

547

Languages and Property Systems

In the paper, we use many special languages to deﬁne features of complex objects. Any language L is understood as a set of formulas over a given ﬁnite alphabet and is constructed in the following way. 1. First, we deﬁne an alphabet of L, some atomic formulas and their semantics by means of some satisﬁability relation |=L . The satisﬁability relation is a binary relation in X × L, where X denotes a universe of elements (objects). We will write x |=L α to denote the fact that |=L holds for the pair (x, α) consisting of the object x and the formula α. 2. Next, we extend, in the standard way, the satisﬁability relation |=L on Boolean combination of atomic formulas, i.e., on the least set of formulas including atomic formulas and closed with respect to the classical propositional connectives: disjunction (∨), conjunction (∧), negation (¬) using the following rules: (a) x |=L (α ∨ β) iﬀ x |=L α or x |=L β, (b) x |=L (α ∧ β) iﬀ x |=L α and x |=L β, (c) x |=L ¬(α) iﬀ non(x |=L α), where α, β are formulas, x is an object, and the symbol |=L denotes the satisﬁability relation of the deﬁned language. 3. Finally, for any formula α ∈ L, the set |α|L = {x ∈ X : x |=L α} can be constructed that is called the meaning (semantics) of α in L. Hence, in the sequel, in specifying languages and their semantics we will only deﬁne atomic formulas and their semantics assuming that the extension on Boolean combination is the standard one. Moreover, in deﬁnitions of alphabets over which languages are constructed we often omit listing parentheses assuming that the relevant parentheses are always included in alphabets. Besides, in modeling complex objects we often use structures called property systems. Deﬁnition 4 (A property system). A property system is any triple P = (X, L, |= ), where X is a set of objects; L is a language over a given ﬁnite alphabet; and |=⊆ X × L is a satisﬁability relation. We also use the following notation: 1. We write, if necessary, XP , LP , |=P , instead of X, L, and |=, respectively. 2. |α|P = {x ∈ X : x |=P α} is the meaning (semantics) of α in P. 3. By aα for α ∈ LP we denote a function (attribute) from XP into {0, 1} deﬁned by aα (x) = 1 iﬀ x |=P α for x ∈ XP . 4. Any property system P with a ﬁnite set of objects and a ﬁnite set of formulas deﬁnes an information system AP = (XP , A), where A = {aα }α∈L . It is worthwhile mentioning that the deﬁnition of any information system A = (U, A) constructed in hierarchical modeling should start from deﬁnition of the universe of objects of such an information system. For this purpose, we select

548

J.G. Bazan

a language in which a set U ∗ of complex objects is deﬁned, where U ⊆ U ∗ . For specifying the universe of objects of A, we construct some property system Q over the universe U ∗ of already constructed objects. The language LQ consists of formulas which are used for specifying properties of the already constructed objects from U ∗ . To deﬁne the universe of objects of A, we select a formula α from LQ . Such a formula is called type of the constructed information system A. Now, we assume that the object x belongs to the universe of A iﬀ x satisﬁes (in Q) the formula α, i.e., x |=Q α, where x ∈ U ∗ . Observe, that the universe of objects of A can be an extension of the set U because U is usually only a sample of possible objects of A. Notice that the type α selected for a constructed information system deﬁnes a binary attribute aα for this system. Certainly, this attribute can be used to deﬁne the universe of the information system A (see Section 4.7 for more details). Notice also that the property system Q is constructed using property systems and information systems used in modeling the lower level of concept hierarchy. 4.7

Basic Languages of Deﬁning Features of Complex Objects

As we have written before, the perception of each complex object coming from the real-life world takes place by detecting its structure (see Section 4.3), whereas the features of a given complex object may be determined only by establishing the features of this structure. The structures of complex objects which are the result of perception of complex objects may be modeled with the help of relational structures (see Section 4.5). Therefore, by the features of complex objects represented with relational structures we will understand the features of these structures. Each collection of complex objects K may be represented using an information system A = (U, A), where the object set U is equal to collection K and the attributes from set A describe the properties of complex objects from collection K and more precisely, the properties of relational structures representing individual objects from this collection. In the simplest case, the attributes from set A may be sensor attributes, that is, they represent the readings of sensors recorded for objects from set U (see Example 1 and Example 2 from Section 4.4). However, in the case of structured objects whose properties usually cannot be described with the help of sensor attributes, the attributes from set A may be deﬁned with the use of the properties of these objects’ parts, the relations between the parts and information about the hierarchy of parts expressed e.g., with the help of concept ontology (see Section 4.10). In practice, apart from the properties of complex objects described above and represented using the attributes from set A, other properties of complex objects are also possible which describe the properties of these objects on a slightly higher level of abstraction than the attributes from set A. These properties are usually deﬁned by experts on the basis of domain knowledge and are often represented with the help of concepts, that is, attributes which have only two values. For the table in Example 1, e.g., “safe driving”could be such a concept.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

549

By adding such an attribute to the information system, which is usually called decision attribute or decision and marking it with d, we obtain decision table (U, A, d). However, eﬀective approximation of a decision attribute d using attributes from set A usually requires deﬁning new attributes which are often binary attributes representing concepts. Such concepts may be deﬁned in an established language on the basis of attributes available in set A. In this paper, such language is called a language for deﬁning features of complex objects. In the simplest case such a language may be the language of mathematical formulas in which formulas enabling calculating the speciﬁc properties of a complex object are formed. For example, if the complex object is a certain subset of a set of rational numbers with simple addition and multiplication and the order relation, then the attributes of such a complex object may be: the minimal value, the maximum value or the arithmetic average over this set. However, in many cases in order to deﬁne attributes of complex objects special languages should be deﬁned. In this paper, to deﬁne a speciﬁc language deﬁning complex object properties Tarski’s approach is used which requires the language’s alphabet, set of language formulas and language formula semantics (see, e.g., [304] and Section 4.6). For example, in order to deﬁne concepts describing new properties of objects from a given information system a well known language called generalized descriptor language may be used (see, e.g., [16, 165]). Deﬁnition 5 (A generalized descriptor language). Let A = (U, A) be an information system. A generalized descriptor language of information system A (denoted by GDL(A) or GDL-language, when A is ﬁxed) is deﬁned in the following way: Va ∪ {¬, ∨, ∧} is an alphabet of the language • the set ALGDL (A) = A ∪ a∈A

GDL(A), • expressions of the form (a ∈ V ), where a ∈ A and V ⊆ Va are atomic formulas of the language GDL(A). Now, we determine the semantics of the language GDL(A). The language GDL(A) formulas may be treated as the descriptions of object occurring in system A. Deﬁnition 6. Let A = (U, A) be an information system. The satisﬁability of an atomic formula φ = (a ∈ V ) ∈ GDL(A) by an object u ∈ U from table A (denoted by u |=GDL(A) φ) is deﬁned in the following way: u |=GDL(A) (a ∈ V ) iﬀ a(u) ∈ V. We still need to answer the question of deﬁning the atomic formulas (expressions of the form a ∈ V ) belonging to the set of formulas of the above language. In the case of symbolic attributes, in practical applications the formulas of the form a ∈ V are usually deﬁned using relations: “=” or “=” (e.g., a = va or a = va for some symbolic attribute a such that va ∈ Va ). However, if the attribute a is

550

J.G. Bazan

a numeric one, then the correct atomic formulas may be a < va , a ≤ va , a > va or a ≥ va . Atomic formulas may be also deﬁned using intervals, for example: a ∈ [v1 , v2 ], a ∈ (v1 , v2 ], a ∈ [v1 , v2 ) or a ∈ (v1 , v2 ), where v1 , v2 ∈ Va . We present a few examples of formulas of the language GDL(A), where A = (U, A), A = {a1 , a2 , a3 } and v1 ∈ Va1 , v2 ∈ Va2 and v3 , v4 ∈ Va3 . – – – –

(a1 = v1 ) ∧ (a2 = v2 ) ∧ (a3 ∈ [v3 , v4 )), (a1 = v1 ) ∨ (a2 = v2 ), ((a1 = v1 ) ∨ (a2 = v2 )) ∧ (a3 > v3 ), ¬((a1 = v1 ) ∧ (a3 ≤ v3 )) ∨ ((a2 = v2 ) ∧ (a3 ∈ (v3 , v4 ])).

Another example of a language deﬁning complex object properties may be a neighborhood language. In order to deﬁne the neighborhood language a dissimilarity function of pairs of objects of the information system is needed. Deﬁnition 7. Let A = (U, A) be an information system. 1. We call a function DISMA : U × U −→ [0, 1] the dissimilarity function of pairs of objects in the information system A, if the following conditions are satisﬁed: (a) for any pair (u1 , u2 ) ∈ U × U : DISMA (u1 , u2 ) = 0 ⇔ ∀ a ∈ A : a(u1 ) = a(u2 ), (b) for any pair (u1 , u2 ) ∈ U × U : DISMA (u1 , u2 ) = DISMA (u2 , u1 ), (c) for any u1 , u2 , u3 ∈ U : DISMA (u1 , u3 ) ≤ DISMA (u1 , u2 ) + DISMA (u2 , u3 ). 2. For any u1 , u2 , u3 , u4 ∈ U , if DISMA (u1 , u2 ) < DISMA (u3 , u4 ) then we say that objects from the pair (u3 , u4 ) are more diﬀerent than objects from the pair (u1 , u2 ), relatively to DISMA . 3. If any u1 , u2 ∈ U satisﬁes DISMA (u1 , u2 ) = 0 then we say that objects from the pair (u1 , u2 ) are not diﬀerent, relatively to DISMA , i.e., they are indiscernible, relatively to DISMA . 4. If any u1 , u2 ∈ U satisﬁes DISMA (u1 , u2 ) = 1 then we say that objects from the pair (u1 , u2 ) are completely diﬀerent, relatively to DISMA . Let us notice that the above dissimilarity function is not a metric (distance) but a pseudometric. The reason is that the ﬁrst metric condition is not satisﬁed which in the case of the DISMA function would state that the distance between the pair of objects is equal to 0 if and only if they are the same objects. This condition is not satisﬁed because of the possibility of existence of non-one-element abstraction classes of the relation IN DA (A), that is, because of the possibility of repetition of objects in the set U . We present an example of dissimilarity function of pairs of objects of information system.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

551

Example 5. Let A = (U, A) be an information system A = (U, A), where A = {a1 , ..., am } is the set of binary attributes. We deﬁne the dissimilarity function of pairs of objects in the following way: ∀(u1 , u2 ) ∈ U × U : DISMA (u1 , u2 ) =

card({a ∈ A : a(u1 ) = a(u2 )}) . card(A)

Let us notice, that the dissimilarity function deﬁned above is based on a widely known and introduced by Hamming measurement of dissimilarity of two sequences of the same length expressing number of places (positions) on which these two sequences diﬀer. Now, we can deﬁne the neighborhood language. Deﬁnition 8 (A neighborhood language). Let A = (U, A) be an information system. A neighborhood language for the information system A (denoted by N L(A) or N L-language, when A is ﬁxed) is deﬁned in the following way: • the set ALN L (A) = U ∪ (0, 1] ∪ {¬, ∨, ∧} is an alphabet of the language N L(A), • expressions of the form (u, ε), where u ∈ U and ε ∈ (0, 1] called as neighborhoods of objects, are atomic formulas of the language N L(A). Now, we determine the semantics of language N L(A). The language N L(A) formulas may be treated as the descriptions of object occurring in system A. Deﬁnition 9. Let A = (U, A) be an information system and DISMA be a dissimilarity function of pairs of objects from the system A. The satisﬁability of an atomic formula φ = (u0 , ε) ∈ N L(A) by an object u ∈ U from table A relative to dissimilarity function DISMA (denoted by u |=N L(A) φ), is deﬁned in the following way: u |=N L(A) (u0 , ε) ⇔ DISMA (u0 , u) ≤ ε. Each of formula of languages GDL or N L describes a certain set of objects which satisfy this formula (see Fig. 10). According to Deﬁnitions 5 and 8 a set of such objects is included in a set of objects U . However, it is worth noticing that these formulas may be satisﬁed by objects from outside the set U , that is, belonging to an extension of the set U (if we assume that attribute values on such objects can be received) (see Fig. 10). An explanation is needed if it comes to the issue of deﬁning pairs of objects in an information system with a dissimilarity function. For information systems many such functions may be deﬁned applying various approaches. A review of such approaches may be found in, e.g., [162, 163, 164, 165, 166, 167, 168, 169, 170, 171]). However, the approaches known from literature usually do not take into account the full speciﬁcation of a speciﬁc information system. That is why in a general case the dissimilarity function of a pair of objects should be deﬁned by experts individually for each information system on the basis of domain knowledge. Such a deﬁnition may be given in the form of an arithmetical expression (see Example 5). Very often, however, experts in a given domain are not able to present such an expression

552

J.G. Bazan

U* - an extension of the set U

The meaning of the formula φ

U - the set of objects from the system A Fig. 10. The illustration of the meaning of a given formula

and content themselves with presenting a set of value examples of this function, that is, a set of pairs of objects labeled with a dissimilarity function value which exists between these objects. In this last case, deﬁning dissimilarity function requires approximation with the help of classiﬁers. The classiﬁer approximating the dissimilarity function are called dissimilarity classiﬁer of pairs of objects for an information system. Deﬁnition 10. Let A = (U, A) be an information system A = (U, A) (where A = {a1 , ..., am }) and DISMA is a given dissimilarity function of pairs of objects from the system A. 1. A dissimilarity function table for the system A relatively to the dissimilarity function DISMA is a decision table AD = (UD , AD , d), where: – UD ⊆ U × U , – AD = {b1 , ..., bm , bm+1 , ...., b2m }, where attributes from AD are deﬁned in the following way: ai (u1 ) i≤m ∀u = (u1 , u2 ) ∈ UD ∀bi ∈ AD : bi (u) = ai−m (u2 ) otherwise. – ∀u = (u1 , u2 ) ∈ UD : d(u) = DISMA (u1 , u2 ). 2. If AD = (UD , AD , d) is the dissimilarity function table for the system A then any classiﬁer for the table AD is called a dissimilarity classiﬁer for the system A. Such classiﬁer is denoted by μDISMA . Let us notice that in the dissimilarity table of the information system A there do not exist all the possible pairs of objects of system A, but only a certain chosen

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

553

subset of the set of these pairs. This limitation is necessary, for the number of pairs of U × U product may be so large that the expert is not able to give all the values of decision attribute d for them. That is why in the dissimilarity table there are usually only found pairs chosen by the expert which represent typical cases of dissimilarity function determining which may be generalized with the help of a classiﬁer. The dissimilarity classiﬁer may serve determining the value of dissimilarity function for the pair of objects from the information system. According to Deﬁnition 10 such pairs come from set U × U , that is, they are pairs of objects from a given information system A. However, it should be stressed that the dissimilarity classiﬁer may determine the values of the dissimilarity function for the pairs of objects which do not belong to system A, that is, those which belong to extension of A. Hence, dissimilarity classiﬁers may be treated as a way to deﬁne concepts (new two-argument relations). The described approach to the measure of dissimilarity is applied in this paper to the measure of dissimilarity between objects in information systems (see Section 6.7 and Section 6.19), between states in planning graphs (see Section 7.9) and plans (see Section 7.20). 4.8

Types of Complex Objects

In a given complex dynamical system there may occur many diﬀerent complex objects. The collection of all such objects may be represented with the help of information system, where the set of this system’s objects correspond with the objects of this collection and the attributes of this system describe the properties of complex objects from the collection and more precisely the properties of relational structures representing individual objects of this collection. Such a system for a given complex dynamical system we call in this paper a total information system (T IS) for a given complex dynamical system. Attributes of the system T IS may be sensor attributes or they are deﬁned in an established language which helps to express the properties of complex objects (see Section 4.7). To the attribute set of the system T IS one may add the binary decision attribute representing the concept describing an additional property of complex objects. The decision attribute may be further approximated with the help of attributes available from the system T IS (see Section 4.7). However, in practice the concepts which are examined are deﬁned only in the set of complex objects of a certain type occurring in a given complex dynamical system. In the example concerning the traﬃc (see Example 1) such a concept may concern only cars (e.g., safe overtaking of one car by another), whereas in the example concerning patient treatment (see Example 2), the examined concepts may concern the treatment of infants only, not other people like children, adults or the elderly whose treatment diﬀers from the treatment of infants. Therefore, we need a mechanism which enable to appropriate selection of complex objects, and more precisely relational structures which they represent and in which we are interested at the moment. In other words, we need a method which enable to select objects of a certain type from the system T IS.

554

J.G. Bazan

In the paper, we propose a method of adding a binary attribute to T IS to deﬁne the types of complex objects, and more precisely the types representing the objects of relational structures. The value Y ES of such an attribute in a given row, means that the given row represents the complex object that is of the examined type, whereas value N O means that the row represents a complex object which is not of the examined type. The attributes deﬁning types may be deﬁned with the help of attributes from the system T IS in the language GDL or any other language in which the attributes form the system T IS were deﬁned. The example below shows how the attributes deﬁning the types of complex objects may be deﬁned. Example 6. Let us assume that in a certain hospital in the children’s ward there was applied information system A = (U, A) to represent the information about patients’ treatment, such that U = {u1 , ..., un } and A = {a1 , ..., am , aage , at , aid }. Each object of this system represents medical parameters of a certain child in one day of his/her hospitalization. Attributes a1 , ..., am describe medical parameters of the patient (examination results, diagnoses, treatments, medications, etc.), while the attribute aage represents the age of patient (a number of days of life), the at attribute represents the value of a time unit (a number of days) which has elapsed since the ﬁrst observation of the patient and the attribute aid provides identiﬁers of patients. If system A is treated as the total information system for a complex dynamical system understood as a set of all patients, then the “infant” type of patient (it is a child not older than 28 days) labeled with Tinf may be deﬁned with the help of formula (aage ≤ 28). A slightly more diﬃcult situation appears in the case of the information system from Example 1, when we want to deﬁne the passenger car type of object. A written description of the formula deﬁning such a type may be as follows: the object is perceived as a rectangle whose length is two to ﬁve times bigger than its width, and the movement of the object takes place in the direction parallel to the longer side of the rectangle. It is easy to see that in order to deﬁne such a formula the information system from Example 1 would have to be complemented with sensor attributes determining the coordinates of the characteristic points of the object for determining its sizes, shape and movement direction. If we deﬁne an additional attribute by determining the type of object in the system T IS, then we can select information subsystem in which all objects will have the same value of this attribute. Using a subsystem selected in such a way one may analyze concepts concerning the established type of objects. Obviously, during the approximation of these concepts the attribute determining the type according to which an object selection was previously performed is useless, because its value is the same for all selected objects. Therefore, the attributes deﬁning the type of object are not used to approximate concepts, but only to an initial selection of objects for the need of concept approximation. In a given complex dynamical system there may be observed very diﬀerent complex objects. The diversity of objects may express itself both through the degree of spatial complexity and by the spatio-temporal complexity (see Section 4.3). Therefore, in a general case it should be assumed that in order to

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

555

describe the properties of all complex objects occurring in a given dynamical system, many languages must be used. For instance, to describe the properties of a single vehicle at a single time point, the information obtained directly from the sensors are usually used (e.g., speed, location), to describe the properties of a vehicle observed for a certain period of time (time window), a language may be used which enables to deﬁne the so-called temporal patterns observed in time windows (see Section 6.6), whereas in order to describe the properties of groups of vehicles a language may be used which enable to deﬁne temporal patterns observed in the sequences of time windows (see Section 6.17). Moreover, it usually happens that not each of these languages is appropriate to express the properties of all complex objects occurring in a given complex dynamical system. For example, if we want to apply the language of temporal patterns to determine the properties of a vehicle at a single time point, then it is not feasible because this language requires information about the vehicle collected in the whole time window not at a single time point. Therefore, the approach to recognizing types of complex objects described above must be complemented. Namely, the attributes deﬁning types of complex objects, apart from the values Y ES and N O mentioned before, may also have the UNKNOWN value. This value means that for a given complex object it is not possible to compute correctly the value of an attribute. Summarizing, if we examine complex objects from a certain complex dynamical system and claim that a given complex object u is a complex object of type T , then it means that in the total information system constructed for this system there exists such attribute aT that it takes the value Y ES for object u. One may also say that a given complex object u is not a complex object of type T which means that attribute aT corresponding with type T takes the value N O for object u. The value of attribute aT for object u may also take the value UNKNOWN which in practice also means that object u is not of type T . A given complex object may be an object of many types, because there may exist many attributes identifying types in T IS which take the value Y ES for this object. For example, in the information system from Example 6 the type of object Tr may be deﬁned which can be described in words as the patient recently admitted to hospital (that is admitted not earlier than three days ago) with the help of formula (at ≤ 3). Then, the infant admitted to hospital for treatment two days ago is a patient of both type Tinf and Tr . Finally, let us notice that the above approach to determining types of objects may be applied not only to complex objects which were observed at the moment of deﬁning the formula determining the type, but also to those complex objects which appeared later, that is, belong to the extension of the system T IS. It results from the properties of formulas of the language GDL which deﬁne the types of objects in the discussed approach. 4.9

Patterns

If an attribute of a complex object collection is a binary attribute (it describes a certain concept), then the formula enables to determine its values is usually called

556

J.G. Bazan

a pattern for the concept. Below, we present a pattern deﬁnition assuming that there is given a language L deﬁning features of complex objects of a determined type, deﬁned using Tarski’s approach (see, e.g., [304]). Deﬁnition 11 (A pattern). Let S be a collection of complex objects of a ﬁxed type T . We assume C ⊆ S is a concept and L is language of formulas deﬁning (under a given interpretation of L deﬁned by a satisﬁability relation) features of complex objects from the collection S (i.e., subsets of S deﬁned by formulas under the given interpretation). 1. A formula α ∈ L is called a pattern for concept C explained in the language L if exists s ∈ S such that s ∈ C and s |=L α (s satisﬁes α in the language L). 2. If s |=L α then we say that s matches pattern α or s supports pattern α. Otherwise s does not match pattern or does not support pattern α. 3. A pattern α ∈ L is called exact relative to the concept C when for any s ∈ S, if s |=L α then s ∈ C. Otherwise, a pattern α is called inexact. 4. The number: support(α) = card(|α|L ) is called the support of the pattern α. 5. The conﬁdence of the pattern α relatively to the concept C we denote as conf idenceC (α) and deﬁne in the following way: conf idenceC (α) =

card({s ∈ C : s |=L α}) . support(α)

Thus patterns are simple but convenient way of deﬁning complex object properties and they may be applied to information system construction representing complex object collections. Despite the fact that according to Deﬁnition 11, patterns are supposed to describe complex object properties belonging to a given complex object collection S, they may also describe complex object properties from outside of the S collection. However, they always have to be complex objects of the same type as the objects gathered in collection S. Patterns may be deﬁned by experts on the basis of domain knowledge. In such a case the expert must deﬁne a needed formula in a chosen language which enables to test objects on their membership to the pattern. In a general case, patterns may be also approximated with the help of classiﬁers. In this case, it is required from the expert to give only examples of objects belonging to the pattern and counterexamples of objects not belonging to the pattern. Then, however, attributes which may be used to approximate the pattern are needed. Sometimes in an information system representing a complex object collection one of the attributes is distinguished. For example, it may represent a concept distinguished by the expert which requires approximation using the rest of the attributes. Then such an information system is called a decision table (see Section 2.1).

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

557

The decision table constructed for a complex object collection may be useful in classiﬁer construction which ensures the approximation of the distinguished decision attribute. The approximation may be performed with the help of classical classiﬁers (see Section 2) or stratifying classiﬁers (see Section 3). As we wrote before, language formulas serving to deﬁne complex object properties may be satisﬁed by complex objects from outside of a given collection of complex objects. Thus, for any complex object being of the same type as complex objects from a given collection, it may be classiﬁed using the above mentioned classiﬁer. 4.10

Approximation of Concepts from Ontology

The method of using ontology for the approximation of concepts presented in this section consists of approximating concepts from the higher level of ontology using concepts from the lower levels. For the concepts from the lowest hierarchical level of ontology (sensor level), which are not dependent on the rest of the concepts, it is assumed that there are also available the so called sensor attributes which enable to approximate these concepts on the basis of applied positive and negative examples of objects. Below, we present an example of concept approximation using sensor attributes in a certain ontology. Example 7. Let us consider the ontology from Fig. 11. Each vehicle satisfying the established condition expressed in a natural language belongs to some concepts of this ontology. For example, to the concept of Safe overtaking belong vehicles which overtake safely, while to the concept of Possibility of safe stopping before the crossroads belong vehicles whose speed is so small that they may safely stop before the crossroads. Concepts of the lowest ontology level that is Safe distance from the opposite vehicle during overtaking, Possibility of going back to the right lane, Possibility of safe stopping before the crossroads, Safe distance from the front vehicle, Forcing the right of way and Safe distance from the front vehicle are sensor concepts, that is, they may be approximated directly using sensor data. For instance, the concept of Possibility of safe stopping before the crossroads may be approximated using such sensor attributes as vehicle speed, vehicle acceleration, distance to the crossroads, visibility and road humidity. On the higher levels of ontology, however, sensor attributes may not be used directly to approximate concepts because the semantical distance of approximated concepts from sensor attributes is too large and they are deﬁned on diﬀerent levels of abstraction. For example, if we wish to approximate the concept of safe driving on the higher level and on the sensor level we have at our disposal only attributes giving simple parameters of vehicle driving (that is, location, speed, acceleration, etc.), then it is hard to expect that these parameters allow to make the approximation of such a complex concept as safe driving possible. That is why in this paper we propose a method of approximating the concept from the higher level of ontology only with the help of concepts from the ontology level that is lower by one level, which are closer to the concept under approximation

558

J.G. Bazan

Safe driving Forcing the right of way Safe overtaking Safe distance from the front vehicle

Safe distance from the opposite vehicle during overtaking Possibility of going back to the right lane

Possibility of safe stopping before the crossroad

S E N S O R DATA

Fig. 11. An ontology as a hierarchy of concepts for approximation

than the sensor data. The proposed approach to the approximation of concepts of the lower level is based on an assumption that a concept from the higher ontology level is “not too far” semantically from concepts lying on the lower level of ontology. “Not too far” means that it can be expected that it is possible to approximate a concept from the higher level of ontology using concepts from the lower level for which classiﬁers have already been built. The proposed method of approximating concepts of the higher ontology level is based on constructing a decision table for a concept on the higher ontology level whose objects represent positive and negative examples of the concept approximated on this level; and at the same time a stratifying classiﬁer is constructed for this table. In this paper, such a table is called a concept approximation table of the higher ontology level concept. One of the main problems related to construction of the concept approximation table of the higher ontology level concept is providing positive and negative examples of the approximated concept on the basis of data sets. It would seem that objects which are the positive and negative examples of the lower ontology levels concepts may be used at once (without any changes) for concept approximation on the higher ontology level. If it could be possible to perform, any ontology concepts could be approximated using positive and negative examples available from the data sets. However, in a general case, because of semantical diﬀerences between concepts and examples on diﬀerent levels of ontology, objects of the lower level cannot be directly used to approximate concepts of the

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

559

higher ontology level. For example, if on a higher level of a concept hierarchy, we have a concept concerning a group of vehicles, and on a lower one concepts concerning single vehicles, then usually the properties of single vehicles (deﬁned in order to approximate concepts of lower levels of ontology) are not suﬃcient to describe properties of a whole group of vehicles. Diﬃculties with approximation of concepts on the higher ontology level with the help of object properties from the lower ontology level also appear when on the higher ontology level there are concepts concerning another (e.g., longer) period of time than concepts on the lower ontology level. For example, on the higher level we examine a concept concerning a time window (a certain time period), yet on the lower level they are concepts concerning a certain instant, i.e., a time point (see Section 6). That is why in this paper we propose a method for constructing objects of an approximation table of the concept from the higher ontology level (that is, positive and negative examples of this concept) by arranging sets of objects which are positive and negative examples of the lower ontology level concepts. These sets must be constructed in such a way, that the properties of these sets considered together with relationships between their elements could be used for the approximation of the higher ontology level concept. However, it should be stressed here that the complex objects mentioned above (being positive and negative examples of concepts from the higher and lower ontology levels) are representation of real-life objects only. In other words, we assume that the relational structures are expressing the result of perception of real-life objects (see Section 4.5 and Fig. 12). Therefore, by the features of complex objects represented with

Real-life complex objects

Relational structures (representations of real-life complex objects)

Fig. 12. Real-life complex objects and representations of their structures

560

J.G. Bazan

Concept approximation table for the concept C from the higher ontology level

L13

Domain knowledge about the concept C from the higher ontology level

Adding the decision attribute

L12 CRS-information system

L11 Domain knowledge about features of clusters

Selection of features of clusters and selection of clusters acceptable by constraints

FCRS-language + constraint relation

L10

New set of objects represented by clusters of relational structures

L9 Domain knowledge about extraction of clusters

Definition of a new set of objects represented by clusters of relational structures

ECRS-language

L8

RS-information system

L7 Domain knowledge about features of relational structures

Selection of features of relational structures and selection of relational structures acceptable by constraints

FRS-language + constraint relation

L6 New set of objects represented by relational structures

L5

Domain knowledge about extraction of relational structures

Definition of a new set of objects represented by relational structures

ERS-language

L4 Global relational structure S = (U, R)

L3

Domain knowledge about relational structures for attributes from the set A

L2

Definition of family of relations over U

Relational structures for atributes from the set A

Information system A=(U, A) (positive and negative examples of concepts from the lower ontology level)

L1

Fig. 13. The general scheme for construction of the concept approximation table

relational structures we understand the features of these structures. Such features are deﬁned using attributes from information systems from the higher and lower ontology levels.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

561

In Fig. 13, we illustrate the general scheme for construction of the concept approximation table for a given concept C depending in some ontology on concepts from the lower level (relatively to the concept C). In the further part of the subsection, this scheme will be explained in detail. As we have written before, in this paper we assume that for the concepts of the lower ontology level a collection of objects which are positive and negative examples of these concepts is available. Let us also assume that they are objects of a certain information system A = (U, A), where attributes from set A represent all available properties of these objects (see label L1 from the Fig. 13). It should be stressed here that the information about the membership degree of objects from set U to the concepts from the lower ontology level may serve deﬁning new attributes which are appended to the set A. However, providing such information for a randomly chosen object (also for an object which will appear in the future) requires previous approximation of concepts of the lower level with the help of classical or stratifying classiﬁers. At this point, we assume that for the concepts of the lower ontology level such classiﬁers were already constructed, while our aim is to approximate the concept of the higher ontology level. Incidentally, in the simplest case, the concepts of the lower ontology level may be approximated with the help of sensor attributes (see Example 7). Apart from attributes deﬁned on the basis of the membership of objects to the concepts or to the layers of the concepts, there may be other attributes in set A. For example, it may be an attribute identifying the recording time of values of the remaining attributes from set A for a given object from set U or an attribute unambiguously identifying individual objects or groups of objects from set U . Objects being positive and negative examples of the lower ontology level concepts can be very often used to deﬁne new objects represented by relational structures by using available information about these objects. Relations deﬁned in such structures may be also used to ﬁlter (extract) sets of objects or, in a more general case, sets of relational structures or their clusters as new objects for a higher level concept. Relations among objects may be deﬁned on the basis of attributes from the information system A, with the use of relational structures deﬁned on the value sets of attributes from set A (see label L2 from the Fig. 13). For example, the value set of attribute Vat from Example 2 is a subset of the set of integer numbers. Therefore, it is a domain of a relational structure (Vat , {Rat }), where relation Rat is deﬁned in the following way: ∀(t1 , t2 ) ∈ Vat × Vat : t1 Rat t2 ⇔ t1 ≤ t2 . Relation Rat may be in a natural way, generalized to the relation Rt ⊆ U × U in the following way: ∀(u1 , u2 ) ∈ U × U : u1 Rt u2 ⇔ at (u1 ) Rat at (u2 ). Let us notice that relation Rt orders in time the objects of the information system from Example 2. Moreover, it is also worthwhile mentioning that for any

562

J.G. Bazan

pair of objects (u1 , u2 ) ∈ U × U (where U ⊆ U ) the relation Rt is also deﬁned (if we assume that attribute values on such objects can be received) (see Fig. 10). Analogously, a relation ordering objects in time on the basis of attribute t from the information system from Example 1 may be obtained. Obviously, relations deﬁned on the basis of the attributes of information system A are not always related to the ordering objects in time. The example below illustrates how structural relations may be deﬁned on the basis of the distance between objects. Example 8. Let us consider an information system A = (U, A), whose object set U = {u1 , ..., un } is a ﬁnite set of vehicles going from a town T1 to a town T2 , whereas two attributes d and v belong to the attribute set A. The attribute d represents the distance of a given vehicle from the town T2 while attribute v represents the speed of a given vehicle. Value sets of these attributes are subsets of the set of real numbers. Besides, the set Vd is a domain of relational structure (Vd , {Rdε }), where the relation Rdε is deﬁned in the following way: ∀(v1 , v2 ) ∈ Vd × Vd : v1 Rdε v2 ⇔ |v1 − v2 | ≤ ε, where ε is a ﬁxed real number greater than 0. Relation Rdε may be in a natural way, generalized to the relation Rε ⊆ U × U in the following way: ∀(u1 , u2 ) ∈ U × U : u1 Rε u2 ⇔ d(u1 ) Rdε d(u2 ). As we see, a pair of vehicles belongs to relation Rε when objects are distant from each other by no more than ε. Therefore, relation Rε we call the nearness relation of vehicles and parameter ε is called the nearness parameter of vehicles. Relation Rε may be deﬁned for diﬀerent values ε. That is why in a general case the number of nearness relations is inﬁnite. However, if it is assumed that parameter ε takes the values from a ﬁnite set (e.g., ε = 1, 2, ..., 100), then the number of nearness relations is ﬁnite. If Rε is a nearness relation deﬁned in the set U × U (where ε > 0), then set of vehicles U is a domain of the pure relational structure S = (U, {Rε }). The exemplary concepts characterizing the properties of individual vehicles may be high (average, low) speed of the vehicle or high (average, low) distance from the town T2 . These concepts are deﬁned by an expert and may be approximated on the basis of sensor attributes d and v. However, more complex concepts may be deﬁned which cannot be approximated with the help of these attributes. The example of such a concept is vehicle driving in a traﬃc jam. The traﬃc jam is deﬁned by a number of vehicles blocking one another until they can scarcely move (see, e.g., [305]). It is easy to notice that on the basis of observation of the vehicle’s membership to the above mentioned sensor concepts (concerning a single vehicle) and even observation of the value of sensor attributes for a given vehicle, it is not possible to recognize whether the vehicle is driving in a traﬃc jam or not. It is necessary to examine the neighborhood of a given vehicle and more precisely to check whether there are other vehicles right after and before the examined one. Therefore, to approximate the concept vehicle driving

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

563

in traﬃc jam we need a certain type of vehicle grouping which may be performed with the help of the above mentioned relation Rε (see Example 9). Let us add that in recognition of the vehicle’s membership to the concept vehicle driving in a traﬃc jam, it is also important that the speed of the examined vehicle and the speed of the vehicles in its neighborhood are available. However, to simplify the examples, in this subsection we assume that in recognition of the vehicle’s membership to the concept vehicle driving in a traﬃc jam it is suﬃcient to check the appearance of other vehicles in the neighborhood of a given vehicle and considering the speed of these vehicles is not necessary. Thus, for a given information system A = (U, A) representing positive and negative examples of the lower ontology levels concepts there may be deﬁned a pure relational structure S = (U, R) (see label L3 from the Fig. 13). Next, using the relations from family R a special language may be deﬁned in which patterns are deﬁned which describe sets of objects (new concepts) for the needs of approximation of the higher ontology level concepts (see label L4 from the Fig. 13). The extracted sets of objects of a lower level are also usually nontrivial relational structures, for the relations determined on the whole set of objects of the lower ontology level in a natural way are deﬁned on the extracted sets. Time windows (see Section 6.4) or sequences of time windows (see Section 6.15) may be such kind of relational structures. In modeling, we use pure relational structures (without functions) over set of objects extracted from the initial relational structures whose domains are sets of objects of lower ontology level. The reason is that these structures are deﬁned by extension of relations structures deﬁned on information about objects of lower ontology level and even if in the latter structures are deﬁned functions then after the extension we obtain relations over objects rather than functions. Example 9. Let us consider an information system A = (U, A) from Example 8. Let Rε be the nearness relation deﬁned in the set U × U for the ﬁxed ε > 0. Then, the vehicle set U is the domain of relational structure S = (U, {Rε }) and the relation Rε may be used to extract relational structures from the structure S. In order to do this we deﬁne the family of subsets F (S) of the set U in the following way: F (S) = {Nε (u1 ), ..., Nε (un )}, where: Nε (ui ) = {u ∈ U : ui Rε u}, for i = 1, ..., n. Let us notice that each set from family F (S) is connected with one of the vehicles from set U . Therefore, each of the sets from family F (S) should be interpreted as a set of vehicles which are distant from the established vehicle u no more than by the established nearness parameter ε. In other words each such set is a vehicle set which are in the neighborhood of a given vehicle, with the established radius of the neighborhood area. For instance, if ε = 20 meters then vehicles u3 , u4 , u5 , u6 , and u7 belong to the neighborhood of vehicle u5 (see Fig. 14). Finally, let us notice that each set N ∈ F (S) is a domain of relational structure (N, {Rε }). Thus, we obtain the family of relational structures extracted from structure S.

564

J.G. Bazan

u1

u2

u3

u4

u5

20m

u6

u7

u8

20m

N(u5) Fig. 14. A vehicle and its neighborhood

The language in which, using the relational structures, we deﬁne formulas for expressing extracted relational structures, is called a language for extracting relational structures (ERS-language). The formulas of ERS-language determine type of relational structures, i.e., relational structures which can appear in the constructed information system. These new relational structures represent structure of more compound objects composed out of less compound ones. We call them extracted relational structures (see label L5 from the Fig. 13). In this paper, we use the three following ERS-languages: 1. the language assigned to extract trivial relational structures such as presented in Deﬁnition 3 and this method of relational structure extraction is used in the case of construction of the concept approximation table using stratifying classiﬁers (see Section 5.2), 2. the ET W -language assigned to extract relational structures which are time windows (see Section 6.4), 3. the EST W -language assigned to extract relational structures which are sequences of time windows (see Section 6.15). However, the above mentioned process of extracting relational structures is carried out in order to approximate the concept of the higher ontology level with the help of lower ontology level concepts. Therefore, to extract relational structures it is necessary to use information about membership of objects of the lower level to the concepts from this level. Such information may be available for any tested object thanks to the application of previously created classiﬁers for the lower ontology level concepts (see Section 6.4 and Section 6.15). For relational structures extracted using ERS-language features (properties, attributes) may be deﬁned using a specially constructed language, that we call a language for deﬁnnig features of relational structures (see label L6 from the Fig. 13). The F RS-language leads to an information system whose objects are extracted relational structures and the attributes are the features of these structures. Such system will be called an information system of extracted relational structures (RS-information system) (see label L7 from the Fig. 13). However, from the point of view of domain knowledge, not all objects (relational structures) extracted using ERS-language are appropriate to approximation of a given concept of the higher level of ontology. For instance, if we approximate the

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

565

concept of safe overtaking, it is reasonable to use objects representing vehicles examples that are in the process of overtaking maneuver, for using objects representing vehicles which are not in the process of an overtaking maneuver, nothing help to recognize the pairs of vehicles which take part in a safe overtaking with the pairs of vehicles which overtake unsafely. For the above reason, that is, to eliminate objects which are unreal or are unreasonable, there are deﬁned the so-called constraints which are formulas deﬁned on the basis of object features used to create attributes from the RS-system. The constraints determine which objects may be used in order to obtain a concept example from the higher level and which cannot be used (see label L6 from the Fig. 13). In this paper constraints are represented by a constraint relation and are deﬁned as a formula of the language GDL (see Deﬁnition 5) on the basis of attributes appearing in the system RS-system. The example below illustrates how RS-information systems may be deﬁned. Example 10. Let us consider an information system A = (U, A), a relational structure S = (U, {Rε }) and a family F (S) extracted from relational structure S (see Example 9). We construct an information system F = (F (S), A) such that A = {af , ab }, where for any u = Nε (u) ∈ F (S) a value af (u) is the number of vehicles in the neighborhood Nε (u) going in the right lane before vehicle u and ab (u) is the number of vehicles in the neighborhood Nε (u) going in the right lane behind vehicle u. Let us notice that attributes of set A were chosen in such a way that the objects from information system F are relevant to approximate the concept vehicle driving in a traﬃc jam. For example, if ε = 20 meters and for the object u ∈ F (S) values af (u) = 2 and ab (u) = 2, then vehicle u is driving in a traﬃc jam (see vehicle u4 from Fig. 15). Whereas, if af (u) = 0 and ab (u) = 0, then vehicle u is not driving in a traﬃc jam (see vehicle u7 from Fig. 15). For the system F we deﬁne the following formula: φ = ((af > 0) ∨ (ab > 0)) ∈ GDL(F). It is easy to notice that formula φ is not satisﬁed only by neighborhoods related to vehicles which deﬁnitely not driving in a traﬃc jam. Therefore, in terms of neighborhood classiﬁcation to the concept driving in a traﬃc jam these neighborhoods may be called trivial ones. Hence, formula φ may be treated as a constraint formula which is used to eliminate the above mentioned trivial neighborhoods from F. After such reduction we obtain an RS-information system A = (U , A), where U = {u ∈ F (S) : u |=GDL(F) φ}.

Let us notice that the deﬁnition of attributes of extracted relational structures leads to granulation of relational structures. For example, we obtain granules of relational structures deﬁned by the indiscernibility relation deﬁned by new attributes. A question arises, how to construct languages deﬁning features of relational structures, particularly when it comes to approximation of spatio-temporal concepts, that is, those whose recognition requires following the changes of complex objects over time. One of more developed languages of this type is a temporal

566

J.G. Bazan

u1

u2 10m

u5

u4

u3 10m

10m

10m

u7

u6 10m

10m

10m

N(u4)

10m

10m

10m

N(u7)

Fig. 15. Two vehicle neighborhoods

logic language. In literature there are many systems of temporal logics deﬁned which oﬀer many useful mechanisms (see, e.g., [183, 184, 185]). Therefore, in this paper, we use temporal logics to deﬁne our own languages describing features of relational structures. Especially interesting for us are the elements appearing in deﬁnitions of temporal logics of linear time (e.g., Linear Temporal Logic) and branching time logic (e.g., Branching Temporal Logic). Temporal logic of linear time assumes that time has a linear nature, that is, one without branches. In other words, it describes only one world in which each two events are sequentially ordered. In linear time logics there are the following four temporal operators introduced: , ♦, and U. Generally speaking, these operators enable us to determine the satisﬁability of temporal formulas in a certain time period. Operator (often also marked as G) determines the satisﬁability of a formula at all instants (states) of the time period under observation. Operator ♦ (often marked as F) determines the satisﬁability of a formula at least at one instant (state) of the time period under observation. Operator (often marked as X) determines the satisﬁability of a formula at an instant (state) right after the instant of reference. Finally, operator U (often marked as U) determines the satisﬁability of a formula until another formula is satisﬁed. Therefore, linear time temporal logics may be used to express object properties which aggregate behavior of complex objects observed over a certain period of linear time, e.g., features of time windows or features of temporal paths in behavioral graphs (see Section 6.6 and Section 6.17). Temporal logic of branching time, however, assumes that time has a branching nature, that is, at a given instant it may branch itself into parallel worlds representing possible various future states. In branching time logics there are two additional path operators A and E introduced. They enable us to determine the satisﬁability of temporal formulas for various variants of the future. The ﬁrst operator means that the temporal formula, before which the operator occurs, is satisﬁed for all variants of the future. The second, however, means the formula is satisﬁed for a certain future. Path operators combined with the three G, F and X temporal logics operators give six possible combinations: AG, AF, AX, EG, EF and EX. These combinations give opportunities to describe multi-variant, extended over time behaviors. Therefore, temporal logics of branching time may be used to express such complex object properties that aggregate multi-variant behaviors of objects changing over time (e.g., features of clusters of time windows

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

567

or features of clusters of temporal paths in behavioral graphs) (see Section 6.8 and Section 6.19). We assume, that in extracted relational structures the time ﬂow has a linear character. Therefore, languages using elements of temporal logics with linear time are applied to deﬁne their features. In this paper, we use the three following languages deﬁning features of extracted relational structure: 1. the language assigned to deﬁne features of trivial relational structure such as in Deﬁnition 3 - this method of deﬁning features of relational structures is applied together with extraction of trivial relational structure (see Deﬁnition 3) and is based on the usage of features of objects taken from information system as features of relational structures after extraction (objects in a given information system and elements of domains of extracted from this system relational structures are the same) (see Section 5.2), 2. the language F T W using elements of temporal logic language and is assigned to deﬁne relational structure properties, which are time windows (see Section 6.6), 3. the language F T P also using elements of temporal logic language, assigned to deﬁne relational structure properties, which are paths in behavioral graphs (see Section 6.17). However, objects of RS-information systems are often not suitable to use their properties for approximating concepts of the higher ontology level. It happens this way because the number of these objects is too large and their descriptions are too detailed. Hence, if they are applied to approximate the concept from the higher ontology level, the coverage of the constructed classiﬁer would be too little, that is, the classiﬁer could classify too small number of tested objects. Apart from that, there would appear a problem of computational complexity which means that due to the large number of objects of such information system, the number of objects in the concept approximation table for the structured objects (see further part of this subsection) would be too large in order to construct a classiﬁer eﬀectively. That is why, a clustering such objects is applied leading to obtaining a family of object clusters (see label L8 from the Fig. 13). The example below illustrates in a very simple way how it is possible to deﬁne clusters of relational structures. Example 11. Let A = (U , A) be an RS-information system from Example 10. We are going to deﬁne clusters of the vehicles’ neighborhoods. For this purpose we propose a relation Rσ ⊆ U × U, that is deﬁned in the following way: ∀(u1 ,u2 )∈U×U u1 Rσ u2 ⇔ |af (u1 ) − af (u2 )| ≤ σ ∧ |ab (u1 ) − ab (u2 )| ≤ σ, where σ is a ﬁxed integer number greater than 0. As we see, to relation Rσ belong such pairs of vehicle neighborhoods which diﬀer only slightly (no more than by σ) in terms of attribute values af and ab . Therefore, relation Rσ is called the nearness relation of vehicle neighborhoods and parameter σ is called the nearness

568

J.G. Bazan

parameter of vehicle neighborhoods. The relation Rσ may be deﬁned for diﬀerent values σ. That is why in a general case the number of such nearness relations is inﬁnite. However, if it is assumed that parameter σ takes the values from a ﬁnite set (e.g., σ = 1, 2, ..., 10), then the number of nearness relations is ﬁnite. Let Rσ be nearness relation of neighborhoods determined for the established σ > 0. Then the set of neighborhood of vehicles U is the domain of a pure relational structure S = (U , {Rσ }). The relational structure S is the starting point to extract clusters of vehicle neighborhoods. In order to do this we deﬁne the family of subsets F (S) of the set U in the following way: F (S) = {Nσ (u1 ), ..., Nσ (un )}, where: Nσ (ui ) = {u ∈ U : ui Rσ u}, for i = 1, ..., n. Let us notice that each of the set from family F (S) is connected with one vehicle neighborhood from the set U . For any u ∈ U the set Nσ (u) will be also denoted by u, for short. Moreover, these sets are interpreted as neighborhood clusters which are distant from the central neighborhood in the cluster no more than the established nearness parameter. In other words, each such family is a vehicles’ neighborhood cluster which are close to a given neighborhood, with their established nearness parameter. For instance, if ε = 20 meters and σ = 1, then neighborhoods Nε (u3 ), Nε (u5 ) and obviously neighborhood Nε (u4 ) belong to the neighborhood cluster Nσ (u4 ) (see Fig. 16), whereas the neighborhood Nε (u7 ) does not belong to this neighborhood cluster. Finally, let us notice that each set X ∈ F (S) is a domain of relational structure (X, {Rσ }). Hence, we obtain the family of relational structures extracted from structure S. Grouping of objects in system RS-system may be performed using chosen by an expert language of extraction of clusters of relational structures, which in this case is called a language for extracting clusters of relational structures (ECRS-language). The formulas of ECRS-language express families of clusters of relational structures from the input RS-information systems (see label L9 from the Fig. 13). Such formulas can be treated as a type of clusters of relational structures which will create objects in a new information system. In ECRS-language we may deﬁne a family of patterns corresponding to a family of expected clusters. In this paper, the two following ECRS-languages are used:

u1

u2 10m

u5

u4

u3 10m

10m

10m

u8

u7

u6 10m

10m

10m

N(u4)

10m

10m

N(u7)

N(u3) N(u5)

Fig. 16. Four vehicle neighborhoods

10m

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

569

1. the language ECT W assigned to deﬁne relational structure clusters which are time window families (see Section 6.8), 2. the language ECT P assigned to deﬁne relational structure clusters which are path families in complex object behavioral graphs (see Section 6.19). For clusters of relational structures extracted in such a way features may be deﬁned using a specially constructed language, that we call a language for deﬁning features of clusters of relational structures (F CRS-language) (see label L10 from the Fig. 13). A formula from this language is satisﬁed (or unsatisﬁed) on a given clusters of relational structures if and only if it is satisﬁed for all relational structures from this clusters. The F CRS-language leads to an information system whose objects are extracted clusters of relational structures and the attributes are the features of these clusters (see label L11 from the Fig. 13). Such information system we call an information system of clusters of relational structures (CRS-information system). Similarly to the case of the relational structures extracted using ERSlanguage, not all objects (relational structures) extracted using ECRS-language are appropriate to approximation of a given concept of the higher level of ontology. Therefore in this case we also deﬁne constraints which are formulas deﬁned on the basis of object features used to create attributes from the CRSinformation system. Such constraints determine which objects may be used in order to obtain a concept example from the higher level and which cannot be used. The example below illustrates how CRS-information systems may be deﬁned. Example 12. Let F (S) be the family extracted from relational structure S (see Example 11). One can construct an information system F = (F (S), A), where A = {af , ab } and for any u ∈ F (S) values of attributes af and ab are computed as the arithmetical average of values of attributes af and ab for neighborhoods belonging to the cluster represented by u. The attributes of set A were chosen in such a way that the objects from set U are appropriate for approximation of the concept vehicle driving in a traﬃc jam. For example, if ε = 20 meters, σ = 1 and values af (u) and ab (u) are close to 2 then the neighborhoods from cluster represented by object u contain vehicles which deﬁnitely drive in a traﬃc jam. Whereas, if af (u) and ab (u) are close to 0 then the neighborhoods from cluster represented by object u contain vehicles which deﬁnitely do not drive in a traﬃc jam. For the system F we deﬁne the following formula: Φ = ((af > 0.5) ∨ (ab > 0.5)) ∈ GDL(F). It is easy to notice that formula Φ is not satisﬁed only by such clusters to which belong vehicle neighborhoods deﬁnitely not driving in a traﬃc jam. Therefore, in terms of cluster classiﬁcation to the concept driving in a traﬃc jam these clusters may be called trivial ones. Hence, formula Φ may be treated as a constraint formula which is used to eliminate the above mentioned trivial clusters from F.

570

J.G. Bazan

After such reduction we obtain an CRS-information system A = (U , A), where U = {u ∈ F (S) : u |=GDL(F) Φ }.

Unlike the single relational structures in relational structure clusters the time ﬂow has a branching character because in various elements of a given cluster we observe various variants of dynamically changing reality. Therefore, to deﬁne relational structure cluster properties we use elements of temporal logics of branching time language. In this paper, we use the two following languages deﬁning cluster properties: 1. the language F CT W using elements of temporal logics language and assigned to deﬁne cluster features which are families of time windows (see Section 6.8), 2. the language F CT P also using elements of temporal logics language assigned to deﬁne cluster families which are families of temporal paths in behavioral graphs, that is, sub-graphs of behavioral graphs (see Section 6.19). Finally, we assume that to each object, acceptable by constraints, an expert adds a decision value determining whether a given object belongs to a higher level approximated concept or not (see label L12 from the Fig. 13). After adding the decision attribute we obtain the concept approximation table for a concept from the higher ontology level (see label L13 from the Fig. 13). The notion of concept approximation table concerning a concept from the higher ontology level for an unstructured complex object may be generalized in the case of concept approximation for structured objects (that is, consisting of parts). Let us assume that the concept is deﬁned for structured objects of type T which consist of parts being complex objects of types T1 ,...,Tk . In Fig. 17 we illustrate the general scheme for construction of the concept approximation table for such structured objects. We see that in order to construct a table for approximating a concept deﬁned for structured objects of type T , CRS-systems are constructed for all types of structured object parts, that is, types T1 ,...,Tk (see labels L3−1,..., L3−k from the Fig. 17). Next, these systems are joined in order to obtain a table of approximating concept of the higher ontology level determined for structured objects. Objects of this table are obtained by arranging (linking) all possible objects of linked information systems (see label L4 from the Fig. 17). From the mathematical point of view such an arrangement is a Cartesian product of sets of objects of linked information systems. However, from the point of view of domain knowledge not all objects links belonging to such a Cartesian product are possible and reasonable (see [78, 84, 186, 187]). For instance, if we approximate the concept of overtaking, it is reasonable to arrange objects of such pairs of vehicles that drive close to each other. For the above reason, there are deﬁned constraints which are formulas deﬁned on the basis of properties of arranged objects. The constraints determine which objects may be arranged in order to obtain a concept example from the higher level and which

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

Concept approximation table for the concept from the higher ontology level (for structured objects of the type T)

Domain knowledge about the concept from the higher ontology level (for structured objects of the type T)

L5

Linking of objects (clusters) from all CRS-information systems, selection of clusters arrangements acceptable by constraints and adding the decision attribute

Constraint relation

CRS-information system of clusters of relational structures extracted from relational structure Sk

CRS-information system of clusters of relational structures extracted from relational structure Sk+1

L4

CRS-information system of clusters of relational structures extracted from relational structure S1

. . .

571

...

L3-1

Relational structure S1 (domain is a set of parts of the type T1)

...

L2-1

. . .

. . .

L3-k

L3-c

Relational structure Sk+1 (domain is a Cartesian product of parts of types T1, ..., Tk)

Relational structure Sk (domain is a set of parts of the type Tk)

L2-k

L2-c Domain knowledge about relational structures for attributes from the set A

L2

Defining family of relation in U

Relational structures for atributes from the set A

Information system A=(U, A) (positive and negative examples of concepts from the lower ontology level)

L1

Fig. 17. The general scheme for construction of the concept approximation table for structured objects

cannot be arranged. Additionally, we assume that to each object arrangement, acceptable by constraints, an expert adds a decision value determining whether a given arrangement belongs to a higher level approximated concept or not (see label L4 from the Fig. 17). A table constructed in such a way is to serve a concept approximation determined on a set of structured objects (see label L5 from the Fig. 17). However, it frequently happens that in order to describe a structured object, apart from describing all parts of this object, a relation between the parts of this object should be described. Therefore, in constructing a table of concept approximation for a structured object, there is constructed an additional CRS-information system whose attributes entirely describe the whole structured object in terms of relations between the parts of this object (see label L3−c from the Fig. 17). In approximation of the object concerning structured objects, this system is

572

J.G. Bazan

arranged together with other CRS-information systems constructed for individual parts of the structured objects (see label L4 from the Fig. 17). Similarly to the case of the concept approximation table for unstructured objects, the constraint relation is usually deﬁned as a formula in the language GDL (see Deﬁnition 5) on the basis of attributes appearing in the obtained table. However, constraint relation may also be approximated using classiﬁers. In such a case providing examples of objects belonging and not belonging to constraint relation is required (see, e.g., [78]). The construction of a speciﬁc approximation table of a higher ontology level concept requires deﬁning all elements appearing in Figs. 13 and 17. A fundamental problem connected with construction of an approximation table of the higher ontology level concept is, therefore, the choice of four appropriate languages used during its construction. The ﬁrst language serves the purpose of deﬁning patterns in a set of lower ontology level concept examples which enable the relational structure extraction. The second one enables deﬁning the features of these structures. The third one enables to deﬁne relational structure clusters and ﬁnally the fourth one the properties of these clusters. All these languages must be deﬁned in such a way as to make the properties of created relational structure clusters useful on a higher ontology level for approximation of the concept occurring there. Moreover, in the case when the approximated concept concerns structured objects each of the parts of this type of objects may require another four of the languages mentioned above. The spatial concept of the higher ontology level (defined for complex objects)

The spatio-temporal concept of the higher ontology level (defined for complex objects)

C Spatial concepts of the lower ontology level (defined for the same type of complex objects)

C1

...

C

Cl

Spatial concepts of the lower ontology level (defined for the same type of complex objects)

Case 1

C1

...

Case 2 The spatio-temporal concept of the higher ontology level (defined for structured complex objects)

Spatio-temporal concepts of the lower ontology level (defined for parts of structured complex objects)

C

C1

...

Cl

Case 3 Fig. 18. Three cases of complex concepts approximation in ontology

Cl

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

573

However, the deﬁnition of these languages depends on semantical diﬀerence between concepts from both ontology levels. In this paper, we examine the following three situations in which the above four languages are deﬁned in a completely diﬀerent way (see Fig. 18). 1. The approximated concept C of the higher ontology level is a spatial concept (it does not require observing changes of objects over time) and it is deﬁned on a set of the same objects as lower ontology level concepts (see Case 1 from Fig. 18). On the lower level we have a concept family: {C1 , ..., Cl }, that are also spatial concept. Apart from that the concepts {C1 , ..., Cl } are deﬁned for unstructured objects without following their changes over time. That is why these concepts are deﬁned on the basis of an object state observation at a single time point or time period established identically for all concepts. For example, the concept C and the concepts C1 ,...,Cl may concern the situation of the same vehicle while concept C may be the concept of Safe overtaking. On the other hand, to the family of concepts C1 ,...,Cl may belong such concepts as: Safe distance from the opposite vehicle during overtaking, Possibility of going back to the right lane and Possibility of safe stopping before the crossroads. The methods of approximation of the concept C for this case are described in Section 5. 2. The concept C under approximation is a spatio-temporal one (it requires observing object changes over time) and it is deﬁned on the set of the same objects as the lower ontology level concepts (see Case 2 from Fig. 18). On the lower level we have a concept family: {C1 , ..., Cl }, that are spatial concept. The concept C concerns object property deﬁned in a longer time period than the concepts from the family {C1 , ..., Cl }. This case concerns a situation when following an unstructured object in order to capture its behavior described by the concept C, we have to observe it longer than it is required to capture behaviors described by concepts from the family {C1 , ..., Cl }. For example, concepts C1 ,...,Cl may concern simple behaviors of a vehicle such as acceleration, deceleration, moving towards the left lane, while the concept C may be a more complex concept: accelerating in the right lane. Let us notice that determining whether a vehicle accelerates in the right lane requires its observation for some time which is called a time window. Howe

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

5390

James F. Peters Andrzej Skowron Henryk Rybi´nski (Eds.)

Transactions on Rough Sets IX

13

Volume Editors James F. Peters University of Manitoba Department of Electrical and Computer Engineering Winnipeg, Manitoba, R3T 5V6, Canada E-mail: [email protected] Andrzej Skowron Warsaw University Institute of Mathematics Banacha 2, 02-097, Warsaw, Poland E-mail: [email protected] Henryk Rybi´nski Warsaw University of Technology Institute of Computer Science Nowowiejska 15/19, 00-665 Warsaw, Poland E-mail: [email protected]

Library of Congress Control Number: 2008942076 CR Subject Classification (1998): F.4.3, I.5, H.2.8, I.2, G.2 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISSN ISBN-10 ISBN-13

0302-9743 (Lecture Notes in Computer Science) 1861-2059 (Transaction on Rough Sets) 3-540-89875-1 Springer Berlin Heidelberg New York 978-3-540-89875-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12586622 06/3180 543210

Preface

Volume IX of the Transactions on Rough Sets (TRS) provides evidence of the continuing growth of a number of research streams that were either directly or indirectly begun by the seminal work on rough sets by Zdzislaw Pawlak (19262006)1. One of these research streams inspired by Prof. Pawlak is rough set-based intelligent systems, a topic that was an important part of his early 1970s work on knowledge description systems prior to his discovery of rough sets during the early 1980s. Evidence of intelligent systems as a recurring motif over the past two decades can be found in the rough-set literature that now includes over 4,000 publications by more than 1,600 authors in the rough set database2 . This volume of the TRS includes articles that are extensions of papers included in the ﬁrst conference on Rough Sets and Intelligent Systems Paradigms3. In addition to research on intelligent systems, this volume also presents papers that reﬂect the profound inﬂuence of a number of other research initiatives by Zdzislaw Pawlak. In particular, this volume introduces a number of new advances in the foundations and applications of artiﬁcial intelligence, engineering, image processing, logic, mathematics, medicine, music, and science. These advances have significant implications in a number of research areas such as attribute reduction, approximation schemes, category-based inductive reasoning, classiﬁers, classifying mappings, context algebras, data mining, decision attributes, decision rules, decision support, diagnostic feature analysis, EEG classiﬁcation, feature analysis, granular computing, hierarchical classiﬁers, indiscernibility relations, information granulation, information systems, musical rhythm retrieval, probabilistic dependencies, reducts, rough-fuzzy C-means, rough inclusion functions, roughness, singing voice recognition, and vagueness. A total of 47 researchers are represented in this volume. This volume has been made possible thanks to the laudable eﬀorts of a great many generous persons and organizations. The editors and authors of this volume also extend an expression of gratitude to Alfred Hofmann, Ursula Barth, Christine G¨ unther, and the LNCS staﬀ at Springer for their support in making this volume of the TRS possible. In addition, the editors of this volume extend their thanks to Marcin Szczuka for his consummate skill and care in the compilation of this volume.

1

2 3

See, e.g., Pawlak, Z., Skowron, A.: Rudiments of rough sets, Information Sciences 177 (2007) 3-27; Pawlak, Z., Skowron, A.: rough sets: Some extensions, Information Sciences 177 (2007) 28-40; Pawlak, Z., Skowron, A.: Rough sets and Boolean reasoning, Information Sciences 177 (2007) 41-73. http://rsds.wsiz.rzeszow.pl/rsds.php Int. Conf. on Rough Sets and Emerging Intelligent Systems Paradigms, Lecture Notes in Artiﬁcial Intelligence 4585. Springer, Berlin, 2007.

VI

Preface

The editors of this volume were supported by the by the Ministry of Science and Higher Education of the Republic of Poland, research grants No. NN516 368334 and 3T11C 002 29, by the by Ministry of Regional Development of the Republic of Poland, grant “Decision Support - New Generation Systems” of Innovative Economy Operational Programme 2007-2013 (Priority Axis 1. Research and development of new technologies), and the Natural Sciences and Engineering Research Council of Canada (NSERC) research grant 185986.

October 2008

Henryk Rybi´ nski James F. Peters Andrzej Skowron

LNCS Transactions on Rough Sets

This journal subline has as its principal aim the fostering of professional exchanges between scientists and practitioners who are interested in the foundations and applications of rough sets. Topics include foundations and applications of rough sets as well as foundations and applications of hybrid methods combining rough sets with other approaches important for the development of intelligent systems. The journal includes high-quality research articles accepted for publication on the basis of thorough peer reviews. Dissertations and monographs up to 250 pages that include new research results can also be considered as regular papers. Extended and revised versions of selected papers from conferences can also be included in regular or special issues of the journal. Editors-in-Chief:

James F. Peters, Andrzej Skowron

Editorial Board M. Beynon G. Cattaneo M.K. Chakraborty A. Czy˙zewski J.S. Deogun D. Dubois I. D¨ untsch S. Greco J.W. Grzymala-Busse M. Inuiguchi J. J¨ arvinen D. Kim J. Komorowski C.J. Liau T.Y. Lin E. Menasalvas M. Moshkov T. Murai

M. do C. Nicoletti H.S. Nguyen S.K. Pal L. Polkowski H. Prade S. Ramanna R. Slowi´ nski J. Stefanowski J. Stepaniuk Z. Suraj ´ R. Swiniarski M. Szczuka S. Tsumoto G. Wang Y. Yao N. Zhong W. Ziarko

Table of Contents

Vagueness and Roughness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zbigniew Bonikowski and Urszula Wybraniec-Skardowska Modiﬁed Indiscernibility Relation in the Theory of Rough Sets with Real-Valued Attributes: Application to Recognition of Fraunhofer Diﬀraction Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof A. Cyran

1

14

On Certain Rough Inclusion Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anna Gomoli´ nska

35

Automatic Rhythm Retrieval from Musical Files . . . . . . . . . . . . . . . . . . . . . Bo˙zena Kostek, Jaroslaw W´ ojcik, and Piotr Szczuko

56

FUN: Fast Discovery of Minimal Sets of Attributes Functionally Determining a Decision Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marzena Kryszkiewicz and Piotr Lasek Information Granulation: A Medical Case Study . . . . . . . . . . . . . . . . . . . . . Urszula Ku˙zelewska and Jaroslaw Stepaniuk Maximum Class Separability for Rough-Fuzzy C-Means Based Brain MR Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pradipta Maji and Sankar K. Pal

76

96

114

Approximation Schemes in Logic and Artiﬁcial Intelligence . . . . . . . . . . . . Victor W. Marek and Miroslaw Truszczy´ nski

135

Decision Rule Based Data Models Using NetTRS System Overview . . . . Marcin Michalak and Marek Sikora

145

A Rough Set Based Approach for ECG Classiﬁcation . . . . . . . . . . . . . . . . . Sucharita Mitra, M. Mitra, and B.B. Chaudhuri

157

Universal Problem of Attribute Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikhail Ju. Moshkov, Marcin Piliszczuk, and Beata Zielosko

187

Extracting Relevant Information about Reduct Sets from Data Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikhail Ju. Moshkov, Andrzej Skowron, and Zbigniew Suraj Context Algebras, Context Frames, and Their Discrete Duality . . . . . . . . Ewa Orlowska and Ingrid Rewitzky

200

212

X

Table of Contents

A Study in Granular Computing: On Classiﬁers Induced from Granular Reﬂections of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lech Polkowski and Piotr Artiemjew On Classifying Mappings Induced by Granular Structures . . . . . . . . . . . . . Lech Polkowski and Piotr Artiemjew

230 264

The Neurophysiological Bases of Cognitive Computation Using Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej W. Przybyszewski

287

Diagnostic Feature Analysis of a Dobutamine Stress Echocardiography Dataset Using Rough Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kenneth Revett

318

Rules and Apriori Algorithm in Non-deterministic Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroshi Sakai, Ryuji Ishibashi, Kazuhiro Koba, and Michinori Nakata

328

On Extension of Dependency and Consistency Degrees of Two Knowledges Represented by Covering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Samanta and Mihir K. Chakraborty

351

A New Approach to Distributed Algorithms for Reduct Calculation . . . . Tomasz Str¸akowski and Henryk Rybi´ nski

365

From Information System to Decision Support System . . . . . . . . . . . . . . . . Alicja Wakulicz-Deja and Agnieszka Nowak

379

Debellor: A Data Mining Platform with Stream Architecture . . . . . . . . . . Marcin Wojnarski

405

Category-Based Inductive Reasoning: Rough Set Theoretic Approach . . . Marcin Wolski

428

Probabilistic Dependencies in Linear Hierarchies of Decision Tables . . . . Wojciech Ziarko

444

Automatic Singing Voice Recognition Employing Neural Networks and Rough Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˙ Pawel Zwan, Piotr Szczuko, Bo˙zena Kostek, and Andrzej Czy˙zewski

455

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts . . . . . . . . . Jan G. Bazan

474

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

751

Vagueness and Roughness Zbigniew Bonikowski1 and Urszula Wybraniec-Skardowska2 1

Institute of Mathematics and Informatics University of Opole, Opole, Poland [email protected] 2 Autonomous Section of Applied Logic, Pozna´ n School of Banking, Faculty in Chorz´ ow, Poland [email protected]

Abstract. The paper proposes a new formal approach to vagueness and vague sets taking inspirations from Pawlak’s rough set theory. Following a brief introduction to the problem of vagueness, an approach to conceptualization and representation of vague knowledge is presented from a number of diﬀerent perspectives: those of logic, set theory, algebra, and computer science. The central notion of the vague set, in relation to the rough set, is deﬁned as a family of sets approximated by the so called lower and upper limits. The family is simultaneously considered as a family of all denotations of sharp terms representing a suitable vague term, from the agent’s point of view. Some algebraic operations on vague sets and their properties are deﬁned. Some important conditions concerning the membership relation for vague sets, in connection to Blizard’s multisets and Zadeh’s fuzzy sets, are established as well. A classical outlook on a logic of vague sentences (vague logic) based on vague sets is also discussed. Keywords: vagueness, roughness, vague sets, rough sets, knowledge, vague knowledge, membership relation, vague logic.

1

Introduction

Logicians and philosophers have been interested in the problem area of vague knowledge for a long time, looking for some logical foundations of a theory of vague notions (terms) constituting such knowledge. Recently vagueness and, more generally - imperfection, has become the subject of investigations of computer scientists interested in the problems of AI, in particular, in the problems of reasoning on the basis of imperfect information and in the application of computers to support and represent such reasoning in the computer memory (see, e.g., Parsons [15]). Imperfection is considered in a general information-based framework, where objects are described by an agent in terms of attributes and their values. Bonissone and Tong [5] indicated three types of imperfections relating to information: incompleteness, uncertainty and imprecision. Incompleteness arises from the absence of a value of an attribute for some objects. Uncertainty arises from a lack J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 1–13, 2008. c Springer-Verlag Berlin Heidelberg 2008

2

Z. Bonikowski and U. Wybraniec-Skardowska

of information; as a result, an object’s attribute may have a ﬁnite set of values rather than a single value. Imprecision occurs when an attribute’s value cannot be measured with adequate precision. There are also other classiﬁcations of imperfect information (see, e.g., Slowi´ nski, Stefanowski [26]). Marcus [12] thought of imprecision more generally. He distinguished, e.g., such types of imprecision as vagueness, fuzziness and roughness. Both fuzziness and roughness are mathematical models of vagueness. Fuzziness is closely related to Zadeh’s fuzzy sets [28]. In fuzzy set theory, vagueness is described by means of a speciﬁc membership relation. Fuzziness is often identiﬁed with vagueness, however, Zadeh [29] noted that vagueness comprises fuzziness. Roughness is connected with Pawlak’s rough sets [19]. Classical, set-theoretical sets (orthodox sets) are not suﬃcient to deal with vagueness. Non-orthodox sets - rough sets and fuzzy sets - are used in two different approaches to vagueness (Pawlak [22]): while Zadeh’s fuzzy set theory represents a quantitative approach, Pawlak’s rough set theory represents a qualitative approach to vagueness. Signiﬁcant results obtained by computer scientists in the area of imprecision and vagueness, such as Zadeh’s fuzzy set theory [28], Shafer’s theory of evidence [24] and Pawlak’s rough set theory [19,21], greatly contributed to advancing and intensifying of research into vagueness. This paper is an extended version of a previous article by the same authors [4]. It proposes a new approach to vagueness taking into account the main ideas of roughness. Roughness considered as a mathematical model of vagueness is here replaced by an approach to vagueness in which vague sets, deﬁned in this paper, play the role of rough sets. Vague sets are connected with vague knowledge and, at the same time, are understood as denotations of vague notions. The paper also attempts to lay logical foundations to the theory of vague notions (terms) and thus bring an essential contribution to research in this area. The structure of the paper is as follows. In Sect. 2, we introduce the notion of unit information (unit knowledge) and vague information (vague knowledge). The central notion of the vague set, inspired by Pawlak’s notion of a rough set, is deﬁned in Sect. 3. Section 4 is devoted to the problem of multiplicity of an object’s membership to a vague set. In Sect. 5 some operations on vague sets and their algebraic properties are given. A view on the logic of vague concepts (terms) is discussed in Sect. 6. The paper ends with Sect. 7 which delivers some ﬁnal remarks.

2

Unit Knowledge and Vague Knowledge

In the process of cognition of a deﬁnite fragment of reality, the cognitive agent (a man, an expert, a group of men or experts, a robot) attempts to discover information contained in it or, more adequately, about its objects. Each fragment of reality recognized by the agent can be interpreted as the following relational structure: (1) = U, R1 , R2 , . . . , Rn ,

Vagueness and Roughness

3

where U, the universe of objects of reality , is a non-empty set, and Ri , for i = 1, 2, . . . , n, is the set of i-ary relations on U. One-ary relations are regarded as subsets of U and understood as properties of objects of U, and multi-argument relations as relationships among its objects. Formally, every k-ary relation of Rk is a subset of U k . We assume that reality is objective with respect to cognition. Objective knowledge about it consists of pieces of unit information (knowledge) about objects of U with respect to all particular relations of Rk (k = 1, 2, . . . , n). We introduce the notion of knowledge and vague knowledge in accordance with some conceptions of the second co-author of this paper ([27]). Definition 1. Unit information (knowledge). Unit information (knowledge) about the object o ∈ U with respect to the relation → − R ∈ Rk (k = 1, 2, . . . , n) is the image R (o) of the object o with respect to the relation R1 . Discovering unit knowledge about objects of reality is realized through asking questions which include certain aspects, called attributes, of the objects of the universe U. Then, we usually choose a ﬁnite set U ⊆ U as the universe and we put it forward as a generalized attribute-value system Σ, also called an information system (cf. Codd [6]; Pawlak [16], [18], [19]; Marek and Pawlak [13]). Its deﬁnition is as follows: Definition 2. Information system. Σ is an information system iff it is an ordered system Σ = U, A1 , A2 , . . . , An ,

(2)

where U ⊆ U, card(U ) < ω and Ak (k = 1, 2, . . . , n) is the set of k-ary attributes understood as k-ary functions, i.e. ∀a∈Ak a : U k → Va ,

(3)

where Va is the set of all values of the attribute a. Example 1. Let us consider the following information system: S = S, A1 , A2 , where S = {p1 , p2 , . . . , p5 } is a set of 5 papers and A1 = {IMPACT FACTOR (IF ), QUOTATIONS (Q)}, A2 = {TOPIC CONNECTION (T C)}. The attribute IF is a function which assigns to every paper p ∈ S an impact factor of the journal in which p was published. We assume that VIF = [0, 100]. The value of the attribute Q for any paper p ∈ S is the number of quotations of p. We assume that VQ = {0, 1, 2, . . . , 2000}. We also assume that T C assigns to every pair of papers a quotient of the number of common references by the number of all references, and that VT C = [0, 1]. 1

j − → R, if o ∈ R, R (o) = for R ∈ R1 . ∅, otherwise. → − R (o) = {x1 , . . . , xi−1 , xi+1 , . . . , xk : x1 , . . . , xi−1 , o, xi+1 , . . . , xk ∈ R} for R ∈ Rk (k = 2, . . . , n).

4

Z. Bonikowski and U. Wybraniec-Skardowska

The information system S can be clearly presented in the following tables: p1 p2 p3 p4 p5

IF Q T C p1 p2 p3 p4 p5 0.203 125 p1 1 3/10 0 6/7 0 0.745 245 p2 3/10 1 0 0 4/17 0.498 200 p3 0 0 1 0 1/12 0.105 150 p4 6/7 0 0 1 0 1.203 245 p5 0 4/17 1/12 0 1

Every attribute of the information system Σ and every value of this attribute explicitly indicates a relation belonging to the so-called relational system determined by Σ. The unit information (knowledge) about an object o ∈ U should be considered with respect to relations of the system. Definition 3. System determined by the information system. (Σ) is a system determined by the information system Σ (see (2)) iff (Σ) = U, {Ra,W : a ∈ A1 , ∅ = W ⊆ Va }, . . . , {Ra,W : a ∈ An , ∅ = W ⊆ Va }, where Ra,W = {(o1 , o2 , . . . , ok ) ∈ U k : a((o1 , o2 , . . . , ok )) ∈ W } for any k ∈ {1, 2, . . . , n}, a ∈ Ak . Let us see that {Ra,{v} : a ∈ A1 , v ∈ Va } = U , i.e. the family {Ra,{v} : a ∈ A1 , v ∈ Va } is a covering of U . It is easy to see that Fact 1. The system Σ uniquely determines the system (Σ). Example 2. Let S be the above given information system. Then the system determined by the system S is (S) = U, RA1 , RA2 , where RA1 = {RIF,S }∅=S⊆VIF ∪ {RQ,S }∅=S⊆VQ and RA2 = {RT C,S }∅=S⊆VT C . For any attribute a of the system S and any i, j ∈ R we adopt the following notation: Sij = {v ∈ Va : i ≤ v ≤ j}, S j = {v ∈ Va : v ≤ j}, Si = {v ∈ Va : i ≤ v}. 0.5 = {p1 , p3 , p4 }, RIF,S Then, in particular, we can easily state that: RIF,S0.1 0.7 = 150 = RQ,{150} = {p4 }, RQ,S {p2 , p5 }, RIF,S 0.3 = {p1 , p4 }, RQ,S150 200 = {p2 , p3 , p5 } and RT C,{1/12} = {(p3 , p5 ), (p5 , p3 )}, RT C,{1} = {(pi , pi )}i=1,...,5 . The notion of knowledge about the attributes of the system Σ depends on the cognitive agent discovering the fragment of reality Σ. According to Skowron’s understanding of the notion of knowledge determined by any unary attribute (cf. Pawlak [17], Skowron et al. [25], Demri, Orlowska [8] pp.16–17), we can adopt the following generalized deﬁnition of the notion of knowledge Ka about any k-ary attribute a: Definition 4. Knowledge Ka about the attribute a. Let Σ be the information system satisfying (2) and a ∈ Ak (k = 1, 2, . . . , n). Then

Vagueness and Roughness

5

(a) Ka = {((o1 , o2 , . . . , ok ), Va,u ) : u = (o1 , o2 , . . . , ok ) ∈ U k }, where Va,u ⊆ P (Va ), Va,u is the family of all sets of possible values of the attribute a for the object u from the viewpoint of the agent and P (Va ) is the family of all subsets of Va . (b) The knowledge Ka of the agent about the attribute a and its value for the object u is (0) empty if card( W ∈Va,u W ) = 0, (1) definite if card( W ∈Va,u W ) = 1, (> 1) imprecise, in particular vague, if card( W ∈Va,u W ) > 1. Let us observe that vague knowledge about some attribute of the information system Σ is connected with the assignation of a vague value to the object u. Example 3. Let us consider again the information system S = S, A1 , A2 . The agent’s knowledge KIF , KQ , KT C about the attributes of the information system S can be characterized by means of the following tables: p1 p2 p3 p4 p5 VT C,(p,p ) p1 p2 p3 p4 p5

VIF,p VQ,p {S0.2 , S0.3 , S0.25 } {S100 , S150 , S90 , S80 } {S0.5 , S0.7 , S0.8 } {S180 , S200 , S250 , S240 } {S0.5 , S0.6 , S0.4 } {S170 , S230 , S180 , S150 } {S0.1 , S0.2 , S0.15 } {S100 , S90 , S10 , S140 } {S0.7 , S1.5 , S1.0 } {S270 , S150 , S240 , S200 }

p1 p2 p3 p4 p5 {S11 } {S 0.3 , S 0.5 } {S 0.1 , S 0.2 } {S0.5 , S0.8 } {S 0.1 , S 0.2 } {S 0.3 , S 0.5 } {S11 } {S 0.1 , S 0.2 } {S 0.1 , S 0.2 } {S 0.3 , S 0.4 } {S 0.1 , S 0.2 } {S 0.1 , S 0.2 } {S11 } {S 0.1 , S 0.2 } {S 0.3 , S 0.1 } 0.1 0.2 0.1 0.2 {S0.5 , S0.8 } {S , S } {S , S } {S11 } {S 0.1 , S 0.2 } 0.1 0.2 0.3 0.4 0.3 0.1 0.1 0.2 {S , S } {S , S } {S , S } {S , S } {S11 }

From Deﬁnitions 1 and 3 we arrive at: Fact 2. Unit information (knowledge) about the object o ∈ U with respect to a → − relation R of the system (Σ) is the image R (o) of the object o with respect to the relation R. → − Contrary to the objective unit knowledge R (o) about the object o of U in the reality with regard to its relation R, the subjective unit knowledge (the unit knowledge of an agent) about the object o of U in the reality (Σ) depends on an attribute of Σ determining the relation R and its possible values from the viewpoint of the knowledge of the agent discovering (Σ). The subjective unit −−→ knowledge Rag (o) depends on the agent’s ability to solve the following equation: −−→ Rag (o) = x, where x is an unknown quantity.

(e)

6

Z. Bonikowski and U. Wybraniec-Skardowska

Solutions of (e) for a k-ary relation R should be images of the object o with respect to k-ary relations Ra,W from (Σ), where ∅ = W ⊆ Va . Let us note that for each unary relation R solutions of (e) are unary relations Ra,W , where ∅ = W ∈ Va,o . A solution of the equation (e) can be correct – then the agent’s knowledge about the object o is exact. If the knowledge is inexact, then at least one solution of (e) is not an image of the object o with respect to the relation R. Definition 5. Empty, definite and imprecise unit knowledge. Unit knowledge of the agent about the object o ∈ U in (Σ) with respect to its relation R is (0) empty iff the equation (e) does not have a solution for the agent (the → − agent knows nothing about the value of the function R for the object o), (1) definite iff the equation (e) has exactly one solution for the agent (either the agent’s knowledge is exact – the agent knows the value of the function → − R for the object o – or he accepts only one, but not necessarily accurate, value of the function), (> 1) imprecise iff the equation (e) has at least two solutions for the agent (the → − agent allows at least two possible values of the function R for the object o). From Deﬁnitions 4 and 5 we arrive at: Fact 3. Unit knowledge of the agent about the object o ∈ U in (Σ) with respect to its relation R is (0) empty if the agent’s knowledge Ka about the attribute a and its value for the object o is empty, (1) definite if the agent’s knowledge Ka about the attribute a and its value for the object o is definite, (> 1) imprecise if the agent’s knowledge Ka about the attribute a and its value for the object o is imprecise. When the unit knowledge of the agent about the object o is imprecise, then most often we replace the unknown quantity x in (e) with a vague value. Example 4. Consider the relation R = RQ,S200 within the previous system (S), i.e. the set of all papers of S that have been quoted in at least 200 other papers. The unit knowledge about the paper p5 with respect to R can be the following −−→ vague information: (e1 ) Rag (p5 ) = VALUABLE , where VALUABLE is an unknown, indeﬁnite, vague quantity. Then the agent refers to the paper p5 non-uniquely, assigning to him different images of the paper p5 with respect to the relations that are possible from his point of view. Then the equation (e1 ) usually has, for him, at least two solutions. From Example 3, it follows that each of these relations: RQ,S270 , RQ,S150 , RQ,S240 , RQ,S200 can be a solution to (e1 ). Let us observe that RQ,S270 = ∅, RQ,S150 = {p2 , p3 , p4 , p5 }, RQ,S240 = {p2 , p5 }, RQ,S200 = {p2 , p3 , p5 }.

Vagueness and Roughness

3

7

Vague Sets and Rough Sets

Let (Σ) be the system determined by the information system Σ. In order to simplify our considerations in the subsequent sections of the paper, we will limit ourselves to the unary relation R (property) – a subset of U of the system (Σ). Definition 6. Inexact unit knowledge of the agent. Unit knowledge of the agent about the object o in (Σ) with respect to R is inexact iff the equation (e) has for him at least one solution and at least one of → − the solutions is not an image R (o). The equation (e) has then the form: −−→ (ine) Rag (o) = X, where X is an unknown quantity from the viewpoint of the agent, and (ine) has for him at least one solution and at least one of the solutions is not an image → − R (o). The equation (ine) can be called the equation of inexact knowledge of the agent. All solutions of (ine) are unary relations in the system (Σ). Definition 7. Vague unit knowledge of the agent. Unit knowledge of the agent about the object o in (Σ) with respect to R is vague iff the equation (e) has at least two diﬀerent solutions for the agent. The equation (e) has then the form: −−→ (ve) Rag (o) = VAGUE , where VAGUE is an unknown quantity, and (ve) has at least two diﬀerent solutions for the agent. The equation (ve) can be called the equation of vague knowledge of the agent. Fact 4. Vague unit knowledge is a particular case of inexact unit knowledge. Definition 8. Vague (proper vague) set. The family of all solutions (sets) of (ine), respectively (ve), is called the vague set for the object o determined by R, respectively the proper vague set for the object o determined by R. Example 5. The family of all solutions of (e1 ) from Example 4 is a vague set Vp5 for the paper p5 determined by RQ,S200 and Vp5 = {RQ,S270 , RQ,S150 , RQ,S240 , RQ,S200 }. Vague sets, thus also proper vague sets, determined by a set R are here some generalizations of sets approximated by representations (see Bonikowski [3]). They are non-empty families of unary relations from (Σ) (such that at least one of them includes R) and sub-families of the family P (U ) of all subsets of the set U , determined by the set R. They have the greatest lower bound (the lower limit ) and the least upper bound (the upper limit) in P (U ) with respect to inclusion. We will denote the greatest lower bound of any family X by X. The least upper bound of X will be denoted by X. So, we can note

8

Z. Bonikowski and U. Wybraniec-Skardowska

Fact 5. For each vague set V determined by the set (property) R V ⊆ {Y ∈ P (U ) : V ⊆ Y ⊆ V}.

(4)

The idea of vague sets was conceived upon Pawlak’s idea of rough sets [19], who deﬁned them by means of the operations of lower approximation: ∗ and upper approximation: ∗ , deﬁned on subsets of U . The lower approximation of a set is deﬁned as a union of indiscernibility classes of a given relation in U 2 which are included in this set, whereas the upper approximation of a set is deﬁned as a union of the indiscernibility classes of the relation which have a non-empty intersection with this set. Definition 9. Rough set. A rough set determined by a set R ⊆ U is a family P of all sets satisfying the condition (5): (5) P = {Y ∈ P (U ) : Y∗ = R∗ ∧ Y ∗ = R∗ }.2 Let us observe that because R ⊆ R ∈ P, the family P is a non-empty family of sets such that at least one of them includes R (cf. Deﬁnition 8). By analogy to Fact 5, we have Fact 6. For each rough set P determined by the set (property) R P ⊆ {Y ∈ P (U ) : R∗ ⊆ Y ⊆ R∗ }.

(6)

It is obvious that Fact 7. If V is a vague set and X∗ = V and X ∗ = V for any X ∈ V, then V is a subset of a rough set determined by any set of V. For every rough set P determined by R we have: P = R∗ and P = R∗ . We can therefore consider the following generalization of the notion of the rough set: Definition 10. Generalized rough set. A non-empty family G of subsets of U is called a generalized rough set determined by a set R iff it satisﬁes the condition (7): G = R∗ and G = R∗ .

(7)

It is easily seen that Fact 8. Every rough set determined by a set R is a generalized rough set determined by R. Fact 9. If V is a vague set and there exists a set X ⊆ U such that X∗ = V and X ∗ = V, then V is a generalized rough set determined by the set X. 2

Some authors deﬁne a rough set as a pair of sets (lower approximation, upper approximation)(cf., e.g., Iwi´ nski [10], Pagliani [14]).

Vagueness and Roughness

4

9

Multiplicity of Membership to a Vague Set

For every object o ∈ U and every vague set Vo , we can count the multiplicity of membership of o to this set. Definition 11. Multiplicity of membership. The number i is the multiplicity of membership of the object o to the vague set Vo iff o belongs to i sets of Vo (i ∈ N). The notion of multiplicity of an object’s membership to a vague set is closely related to the so-called degree of an object’s membership to the set. Definition 12. Degree of an object’s membership. Let Vo be a vague set for the object o and card(Vo ) = n. The function μ is called a degree of membership of o to Vo iff ⎧ ⎨ 0, if the multiplicity of membership of o to Vo equals 0, μ(o) = nk , if the multiplicity of membership of o to Vo equals k (0 < k < n), ⎩ 1, if the multiplicity of membership of o to Vo equals n. Example 6. The degree of the membership of the paper p5 to the vague set Vp5 (see Example 5 ) is equal to 3/4. It is clear that Fact 10. 1. Any vague set is a multiset in Blizard’s sense [1]. 2. Any vague set is a fuzzy set in Zadeh’s sense [28] with μ as its membership function.

5

Operations on Vague Sets

Let us denote by V the family of all vague sets determined by relations in the system (Σ). In the family V we can deﬁne a unary operation of the negation ¬ on vague sets, a union operation ⊕ and an intersection operation on any two vague sets. Definition 13. Operations on vague sets. Let V1 = {Ri }i∈I and V2 = {Si }i∈I be vague sets determined by the sets R ⊆ U and S ⊆ U , respectively. Then (a) V1 ⊕ V2 = {Ri }i∈I ⊕ {Si }i∈I = {Ri ∪ Si }i∈I , (b) V1 V2 = {Ri }i∈I {Si }i∈I = {Ri ∩ Si }i∈I , (c) ¬V1 = ¬{Ri }i∈I = {U \ Ri }i∈I . The family V1 ⊕ V2 is called the union of the vague sets V1 and V2 determined by the relations R and S. The family V1 V2 is called the intersection of the vague sets V1 and V2 determined by the relations R and S. The family ¬V1 is called the negation of the vague set V1 determined by the relation R.

10

Z. Bonikowski and U. Wybraniec-Skardowska

Theorem 1. Let V1 = {Ri }i∈I and V2 = {Si }i∈I be vague sets determined by the sets R and S, respectively. Then (a) V1 ⊕ V2 = V1 ∪ V2 and V1 ⊕ V2 = V1 ∪ V2 , (b) V1 V2 = V1 ∩ V2 and V1 V2 = V1 ∩ V2 , (c) ¬V1 = U \ V1 and ¬V1 = U \ V1 . Theorem 2. The structure B = (V, ⊕, , ¬, 0, 1) is a Boolean algebra, where 0 = {∅} and 1 = {U }. We can easily observe that the above-deﬁned operations on vague sets differ from Zadeh’s operations on fuzzy sets, from standard operations in any ﬁeld of sets and, in particular, from the operations on rough sets deﬁned by Pomykala & Pomykala [23] and Bonikowski [2]. The family of all rough sets with operations deﬁned in the latter two works is a Stone algebra.

6

On Logic of Vague Terms

How to solve the problem of logic of vague terms, logic of vague sentences (vague logic) based on the vague sets characterized in the previous sections? Answering this question requires a brief description of the problem of language representation of unit knowledge. On the basis of our examples, let us consider two pieces of unit information about the paper p5 , with respect to the set R of all papers that have been referenced in at least 200 other papers: ﬁrst, exact unit knowledge −−→ Rag (p5 ) = {p2 , p3 , p5 }, next, vague unit knowledge: −−→ Rag (p5 ) = VALUABLE .

(ee) (e1 )

Let p5 be the designator of the proper name a, R – the denotation (extension) of the name-predicate P (‘a paper that has been quoted in at least 200 other papers’), and the vague name-predicate V (‘a paper which is valuable’) – a language representation of the vague quantity VALUABLE. Then a representation of the ﬁrst equation (ee) is the logical atomic sentence a is P (re) and a representation of the second equation (e1 ) is the vague sentence a is V. (re1 ) In a similar way, we can represent, respectively, (ee) and (e1 ) by means of a logical atomic sentence: aP or P (a), (re ) where P is the predicate (‘has been quoted in at least 200 other papers’ ), and by means of a vague sentence aV or V (a), (re1 ) where V is the vague predicate (‘is valuable’ ).

Vagueness and Roughness

11

The sentence (re1 ) (res. the sentence (re1 )) is not a logical sentence, but it can be treated as a sentential form, which represents all logical sentences, in particular the sentence (re) (respectively sentence (re )) that arises by replacing the vague name-predicate (res. vague predicate) V by allowable sharp namepredicates (res. sharp predicates), whose denotations (extensions) constitute the vague set Vp5 being the denotation of V and, at the same time, the set of solutions to the equation (e1 ) from the agent’s point of view. By analogy, we can consider every atomic vague sentence in the form V (a), where a is an individual term and V – its vague predicate, as a sentential form with V as a vague variable running over all denotations of sharp predicates that can be substituted for V in order to get precise, true or false, logical sentences from the form V (a). Then, the scope of the variable V is the vague set Vo determined by the designator o of the term a. All the above remarks lead to a ‘conservative’, classical approach in searching for a logic of vague terms, or vague sentences, here referred to as vague logic (cf. Fine [9], Cresswell [7]). It is easy to see that all counterparts of laws of classical logic are laws of vague logic because, to name just one reason, vague sentences have an interpretation in Boolean algebra B of vague sets (see Theorem 2). We can distinguish two directions in seeking such a logic: 1a) all counterparts of tautologies of classical sentential calculus that are obtained by replacing sentence variables with atomic expressions of this logic (in the form V(x)), representing vague atomic sentences (sentential functions in the form V (a)), are tautologies of vague logic, 1b) all counterparts of tautologies of classical predicate calculus that can be obtained by replacing predicate variables with vague predicate variables, representing vague predicates, are tautologies of vague logic; 2) vague logic should be a ﬁnite-valued logic, in which a value of any vague sentence V (a) represented by its vague atomic expression (in the form V(x)) is the multiplicity of membership of the designator o of a to the vague set Vo being the denotation of V , and the multiplicities of membership of the designators of the subjects of any composed vague sentence, represented by its composed vague formula, to the denotation (a vague set) corresponding to this sentence are functions of the multiplicities of membership of every designator of the subject of its atomic component to the denotation of its vague predicate. It should be noticed that sentential connectives for vague logic should not satisfy standard conditions (see Malinowski [11]). For example, an alternative of two vague sentences V (a) and V (b) can be a ‘true’ vague sentence (sentential form) despite the fact that its arguments V (a) and V (b) are neither ‘true’ or ‘false’ sentential forms, i.e. in certain cases they represent true sentences, and in some other cases they represent false sentences. It is not contrary to the statement that all vague sentential forms which we obtain by a suitable substitution of sentential variables (resp. predicate variables) by vague sentences (resp. vague predicates) in laws of classical logic always represent true sentences. Thus they are laws of vague logic.

12

7

Z. Bonikowski and U. Wybraniec-Skardowska

Final Remarks

1. The concept of vagueness was deﬁned in the paper as an indeﬁnite, vague quantity or property corresponding to the knowledge of an agent discovering a fragment of reality, and delivered in the form of the equation of inexact knowledge of the agent. A vague set was deﬁned as a set (family) of all possible solutions (sets) of this equation and although our considerations were limited to the case of unary relations, they can easily be generalized to encompass any k-ary relations. 2. The idea of vague sets was derived from the idea of rough sets originating in the work of Zdzislaw Pawlak, whose theory of rough sets takes a nonnumerical, qualitative approach to the issue of vagueness, as opposed to the quantitative interpretation of vagueness provided by Lotﬁ Zadeh. 3. Vague sets, like rough sets, are based on the idea of a set approximation by two sets called the lower and the upper limits of this set. These two kinds of sets are families of sets approximated by suitable limits. 4. Pawlak’s approach and the approach discussed in this paper both make a reference to the concept of a cognitive agent’s knowledge about the objects of the reality being investigated (see Pawlak [20]). This knowledge is determined by the system of concepts that is determined by a system of their extensions (denotations). When the concept is vague, its denotation, in Pawlak’s sense, is a rough set, while in the authors’ sense – a vague set which, under some conditions, is a subset of the rough set. 5. In language representation, the equation of inexact, vague knowledge of the agent can be expressed by means of vague sentences containing a vague predicate. Its denotation (extension) is a family of all scopes of sharp predicates which, from the agent’s viewpoint, can be substituted for the vague predicate. The denotation is, at the same time, the vague set of all solutions to the equation of the agent’s vague knowledge. 6. Because vague sentences can be treated as sentential forms whose variables are vague predicates, all counterparts of tautologies of classical logic are laws of vague logic (logic of vague sentences). 7. Vague logic is based on classical logic but it is many-valued logic, because its sentential connectives are not extensional.

References 1. Blizard, W.D.: Multiset Theory. Notre Dame J. Formal Logic 30(1), 36–66 (1989) 2. Bonikowski, Z.: A Certain Conception of the Calculus of Rough Sets. Notre Dame J. Formal Logic 33, 412–421 (1992) 3. Bonikowski, Z.: Sets Approximated by Representations (in Polish, the doctoral dissertation prepared under the supervision of Prof. U.Wybraniec-Skardowska), Warszawa (1996) 4. Bonikowski, Z., Wybraniec-Skardowska, U.: Rough Sets and Vague Sets. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS, vol. 4585, pp. 122–132. Springer, Heidelberg (2007)

Vagueness and Roughness

13

5. Bonissone, P., Tong, R.: Editorial: reasoning with uncertainty in expert systems. Int. J. Man–Machine Studies 22, 241–250 (1985) 6. Codd, E.F.: A Relational Model of Data for Large Shared Data Banks. Comm. ACM 13, 377–387 (1970) 7. Cresswell, M.J.: Logics and Languages. Methuen, London (1973) 8. Demri, S., Orlowska, E.: Incomplete Information: Structure, Inference, Complexity. Springer, Heidelberg (2002) 9. Fine, K.: Vagueness, Truth and Logic. Synthese 30, 265–300 (1975) 10. Iwi´ nski, T.: Algebraic Approach to Rough Sets. Bull. Pol. Acad. Sci. Math. 35, 673–683 (1987) 11. Malinowski, G.: Many-Valued Logics. Oxford University Press, Oxford (1993) 12. Marcus, S.: A Typology of Imprecision. In: Brainstorming Workshop on Uncertainty in Membrane Computing Proceedings, Palma de Mallorca, pp. 169–191 (2004) 13. Marek, W., Pawlak, Z.: Rough Sets and Information Systems, ICS PAS Report 441 (1981) 14. Pagliani, P.: Rough Set Theory and Logic-Algebraic Structures. In: Orlowska, E. (ed.) Incomplete Information: Rough Set Analysis, pp. 109–190. Physica Verlag, Heidelberg (1998) 15. Parsons, S.: Current approaches to handling imperfect information in data and knowledge bases. IEEE Trans. Knowl. Data Eng. 8(3), 353–372 (1996) 16. Pawlak, Z.: Information Systems, ICS PAS Report 338 (1979) 17. Pawlak, Z.: Information Systems – Theoretical Foundations (in Polish). PWN – Polish Scientiﬁc Publishers, Warsaw (1981) 18. Pawlak, Z.: Information Systems – Theoretical Foundations. Information Systems 6, 205–218 (1981) 19. Pawlak, Z.: Rough Sets. Intern. J. Comp. Inform. Sci. 11, 341–356 (1982) 20. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991) 21. Pawlak, Z.: Vagueness and uncertainty: A rough set perspective. Computat. Intelligence 11(2), 227–232 (1995) 22. Pawlak, Z.: Orthodox and Non-orthodox Sets - some Philosophical Remarks. Found. Comput. Decision Sci. 30(2), 133–140 (2005) 23. Pomykala, J., Pomykala, J.A.: The Stone Algebra of Rough Sets. Bull. Pol. Acad. Sci. Math. 36, 495–508 (1988) 24. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, Princeton (1976) 25. Skowron, A., Komorowski, J., Pawlak, Z., Polkowski, L.: Rough Sets Perspective ˙ on Data and Knowledge. In: Kl¨ osgen, W., Zytkow, J.M. (eds.) Handbook of Data Mining and Knowlewdge Discovery, pp. 134–149. Oxford University Press, Oxford (2002) 26. Slowi´ nski, R., Stefanowski, J.: Rough-Set Reasoning about Uncertain Data. Fund. Inform. 23(2–3), 229–244 (1996) 27. Wybraniec-Skardowska, U.: Knowledge, Vagueness and Logic. Int. J. Appl. Math. Comput. Sci. 11, 719–737 (2001) 28. Zadeh, L.A.: Fuzzy Sets. Information and Control 8, 338–353 (1965) 29. Zadeh, L.A.: PRUF: A meaning representation language for natural languages. Int. J. Man–Machine Studies 10, 395–460 (1978)

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets with Real-Valued Attributes: Application to Recognition of Fraunhofer Diﬀraction Patterns Krzysztof A. Cyran Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland [email protected]

Abstract. The goal of the paper is to present the modiﬁcation of classical indiscernibility relation, dedicated for rough set theory in a realvalued attributes space. Contrary to some other known generalizations, indiscernibility relation modiﬁed here, remains an equivalence relation and it is obtained by introducing a structure into collection of attributes. It deﬁnes real-valued subspaces, used in a multidimensional cluster analysis, partitioning the universe in a more natural way, as compared to onedimensional discretization, iterated in classical model. Since the classical model is a special, extreme case of our modiﬁcation, the modiﬁed version can be considered as more general. But more importantly, it allows for natural processing of real-valued attributes in a rough-set theory, broadening the scope of applications of classical, as well as variable precision rough set model, since the latter can utilize the proposed modiﬁcation, equally well. In a case study, we show a real application of modiﬁed relation, a hybrid, opto-electronic recognizer of Fraunhofer diﬀraction patterns. Modiﬁed rough sets are used in an evolutionary optimization of the optical feature extractor implemented as a holographic ring-wedge detector. The classiﬁcation is performed by a probabilistic neural network, whose error, assessed in an unbiased way is compared to earlier works. Keywords: rough sets, indiscernibility relation, holographic ring-wedge detector, evolutionary optimization, probabilistic neural network, hybrid pattern recognition.

1

Introduction

In classical theory of rough sets, originated by Pawlak [32], the indiscernibility relation is generated by the information describing objects belonging to some ﬁnite set called universe. If this information is of discrete nature, than the classical form of this relation is natural and elegant notion. For many applications processing discrete attributes describing objects of the universe, such deﬁnition of indiscernibility relation is adequate, what implies that area of successful use J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 14–34, 2008. c Springer-Verlag Berlin Heidelberg 2008

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

15

of classical rough set methodology covers problems having natural discrete representation, consistent with granular nature of knowledge in this theory [32]. Such classical rough set model is particularly useful in automatic machine learning, knowledge acquisition and decision rules generation, applied to problems with discrete data not having enough size for application of statistical methods, demanding reliable estimation of distributions characterizing the underlying process [29,30]. If however, the problem is deﬁned in a continuous domain, the classical indiscernibility relation almost surely builds one-element abstract classes, and therefore is not suitable for any generalization. To overcome this disadvantage, diﬀerent approaches are proposed. The simplest is the discretization, but if this processes is iterated separately for single attributes, it induces artiﬁcial and highly nonlinear transformation of attribute space. Other approaches concentrate on generalization of the notion of indiscernibility relation into tolerance relation [25,36] or similarity relation [15,37,38]. The comparative study focused upon even more general approaches, assuming indiscernibility relation to be any binary reﬂexive relation, is given by Gomolinska [20]. Another interesting generalization of indiscernibility relation into characteristic relation, applicable for attributes with missing values (lost values or don’t care conditions) is proposed by Grzymala-Busse [21,22]. In the paper we propose methodology, based on introduction of structure into a collection of conditional attributes, and treating certain groups deﬁning this structure as multidimensional subspaces in a forthcoming cluster analysis. In this way we do not have to resign from equivalence relation, and at the same time, we obtain abstract classes uniting similar objects, belonging to the same clusters, in a continuous multidimensional space, as required by majority of classiﬁcation problems. Since the area of author’s interests is focused in hybrid opto-electronic pattern recognition systems, the practical illustration of proposed modiﬁcation concerns such system. However, with some exceptions, indicated at the end of section 2, the modiﬁcation can ﬁnd many more applications, especially, that it can be equally well adopted in a generalized variable precision rough set model, introduced by Ziarko [40], to meet requirements of analysis of huge data sets. Automatic recognition of images constitutes an important area in the pattern recognition problems. Mait et al. [28], in a review article, state that “an examination of recent trends in imaging reveals a movement towards systems that balance processing between optics and electronics”. Such systems are designed to perform heavy computations in optical mode, practically contributing no time delays, while post- processing is made in computers, often with the use of artiﬁcial intelligence (AI) methods. The foundations of one of such systems have been proposed by Casasent and Song [4], presenting the design of holographic ring wedge detectors (HRWD), and by George and Wang, who combined commercially available ring wedge-detector (RWD) and neural network (NN) in a one complete image recognition system [19]. Despite the completeness of the solution their system was of little practical importance, since commercially available

16

K.A. Cyran

RWD was very expensive and moreover, could not be adapted to a particular problem. Casasent’s HRWD, originally named by him as a computer generated hologram (CGH) had a lot of advantages over commercial RWD, most important being: much lower cost and adaptability. According to optical characteristics the HRWD belongs to a wider class of grating based diﬀractive optical variable devices (DOVDs) [11], which could be relatively easy obtained from computer generated masks, and are used for sampling the Fraunhofer diﬀraction pattern. The pioneering works proposing the method of optimization of HRWD masks to a given application have been published by Cyran and Mrozek [10] and by Jaroszewicz et al. [23]. Mentioned method was successfully applied to a multi layer perceptron (MLP) based system, in a recognition of the type of subsurface stress in materials with embedded optical ﬁber [9,12,14]. Examples of application of the RWD-based feature extraction together with MLP-based classiﬁcation module include systems designed by Podeszwa et al. [34] devoted for the monitoring of the engine condition, and by Jaroszewicz et al. [24] dedicated for airplane engines. Some other notable examples of applications of ring-wedge detectors and neural network systems, include works of Ganotra et al. [17] and Berfanger and George [3], concerning ﬁngerprint recognition, face recognition [18], or image quality assessment [3]. The ring-wedge detector has been also used, as a light scatter detector, in a classiﬁcation of airbone particles performed by Kaye et al. [26] and accurate characterization of particles or defects, present on or under the surface, useful in fabrication of integrated circuits, as presented by Nebeker and Hirleman [31]. The purely optical version of HRWD-MLP recognition system was considered by Cyran and Jaroszewicz [7], however, such system is limited by the development of optical neural networks. Simpliﬁed, to rings only, version of the device is reported by Fares et al. [16] to be applied in a rotation invariant recognition of letters. With all these applications, no wonder that Mait et al. [28] concluded: ”few attempts have been made to design detectors with much consideration for the optics. A notable exception is ring-wedge detector designed for use in the Fourier plane of a coherent optical processor.” Obviously, MLP (or more generally any type of NN) is not the only classiﬁer which could be applied for classiﬁcation of patterns occurring in a feature space generated by HRWD. Moreover, the ﬁrst version of optimization procedure favored the rough set based classiﬁers, due to identical (and therefore fully compatible) discrete nature of knowledge representation in the theory of rough sets applied both to HRWD optimization and to subsequent rough set based classiﬁcation. The application of general ideas of obtaining such rough classiﬁer was presented by Cyran and Jaroszewicz [8] and fast rough classiﬁer implemented as PAL 26V12 element was considered and designed by Cyran [6]. Despite of inherent compatibility between optimization procedure and the classiﬁer, the system remained sub optimal, because features extracted from HRWD generate continuous space, subject to unnatural discretization required by both: rough set based optimization and classiﬁer. Mentioned problems led to the idea, that in order to obtain the enhanced optimization method, the discretization required by classical indiscernibility relation

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

17

in rough set theory, should be eliminated in such a way, which does not require the resignation from equivalence relation in a favor of some weaker form (like tolerance relation, for example). We achieved it by such modiﬁcation of the indiscernibility relation which allows natural processing of real valued attributes. The paper presents this problem in the section 2. After focusing on indiscernibility relation related problems in section 2, section 3 starts with optical foundations of the recognition system considered, and it is followed by experimental results obtained from application of enhanced optimization methodology. The discussion and conclusions are included in section 4. Remarkably, the experimental application of modiﬁed indiscernibility relation into the system considered, improved the results of evolutionary optimization of holographic RWD and equivalently, enhanced the optimization of the HRWD generated feature space, dedicated for real-valued classiﬁers. It also gave theoretical basis for the latest design of two-way, neural network - rough set based classiﬁcation system [5].

2

Modiﬁcation of Indiscernibility Relation

Let us start with a brief analysis of the classical theory of rough sets, and the generalization of it, named the theory of rough sets with variable precision, in a context of data representation requirements. Next, the modiﬁcation of indiscernibility relation is given. With modiﬁed indiscernibility relation, majority of notions deﬁned in rough set theory (both in classical and generalized, variable precision form) can be naturally applied to the attributes having real-valued domain. 2.1

Analysis of Theory of Rough Sets with Discrete Attributes

The notion of a rough set has been deﬁned for a representation, processing and understanding of imperfect knowledge. Such knowledge must be often suﬃcient in controlling, machine learning or pattern recognition. The rough approach is based on an assumption that each object is associated with some information, describing it, not necessarily, in an accurate and certain way. Objects described by the same information are not discernible. The indiscernibility relation, introduced here in an informal way, expresses the fact that theory of rough sets does not deal with individual objects, but with classes of objects which are indiscernible. Therefore the knowledge represented by classical rough sets is granular [32]. The simple consequence is that objects with natural real-valued representation, hardly match that scheme, and some preprocessing has to be performed, before such objects can be considered in a rough-set based frame. This preprocessing has the goal in making ”indiscernible” objects which are close enough (but certainly discernible) in real-valued space. In majority of applications of rough set theory, this is obtained by subsequent discretization of all real-valued attributes. This, highly nonlinear process, is not natural and disadvantageous in many applications (such as an application presented in section 3). Before we present an alternative way of addressing the problem (in subsection 2.2), a formal deﬁnition of classical indiscernibility relation is given.

18

K.A. Cyran

Let S =< U, Q, v, f > be the information system composed of universe U , set of attributes Q, information function f , and a mapping v. This latter mapping associates each attribute q ∈ Q with its domain Vq . The information function f : U ×Q → V is deﬁned in such a way, that f (x, q) reads as the value of attribute q for the element x ∈ U , and V denotes a domain of all attributes q ∈ Q and is deﬁned as a union of all domains of single attributes, i.e. V = q∈Q Vq . Then each nonempty set of attributes C ⊆ Q deﬁnes the indiscernibility relation I0 (C) ⊆ U × U for x, y ∈ U as xI0 (C)y ⇔ ∀q ∈ C, f (x, q) = f (y, q).

(1)

Such deﬁnition, although theoretically applicable, both for discrete and continues domains V , is practically valuable only for discrete domains. For continuous domains such relation is too strong, because in practice all elements would have been discernible. Consequently, all abstract classes generated by I, would have been composed of exactly one element, what would have made the application of rough set theory notions possible, but senseless. The problem is that in the theory of rough sets, with each information system, we can associate some knowledge KQ generated by the indiscernibility relation I0 (Q); for continuous attributes the corresponding knowledge would have been too speciﬁc, to allow for any generalizations, required for classiﬁcation of similar objects into common categories. 2.2

Indiscernibility Relation in Rough Sets with Real Valued Attributes

The consequence of the discussion ending the previous section is the need of discretization. If a problem is originally deﬁned for real valued attributes, then before application of rough set theory, some clustering and discretization of continuous values of attributes should be performed. Let this process be denoted as a transformation described by a vector function Λ : card(C)→{1, 2, . . . , ξ}card(C), where ξ is called the discretization factor. The discretization factor simply denotes the number of clusters covering the domain of each individual attribute q ∈ C. Theoretically, this factor could be diﬀerent for diﬀerent attributes, but without the loss of generality, we assume its constancy over the set of attributes. Then, the discretization of any individual attribute q ∈ C, can be denoted as a transformation deﬁned by a scalar function Λ : → {1, 2, . . . , ξ}. In this case, we obtain the classical form of indiscernibility relation, deﬁned as xI0 (Λ[C]) y ⇔ ∀q ∈ C, f (x, Λ[q]) = f (y, Λ[q]) .

(2)

Below, we will summarize, that majority (however, not all) of notions deﬁned in a theory of rough sets de facto do not demand the strong version of indiscernibility relation I0 deﬁned by equation (1) (or by (2), if the discretization is required). From a formal point of view, what is really important, is the fact, that we assume the indiscernibility relation to be a relation of equivalence, i.e. it must be reﬂexive, symmetric and transitive. From practical point of view, objects indiscernible in a sense of rough set theory, should be such objects, which

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

19

are close in a real-valued space. Any relation, having these properties, we denote by I, without any subscript, reserving subscripts for denoting particular forms of I. The exact form of I, deﬁned as I0 in (1) or (2), is not required, except for some notions, which we discuss later, for processing of the rough information. One can easily verify (by confrontation of the general form of indiscernibility relation I with presented below notions) that the following constructs form a logically consistent system, no matter what the speciﬁc form of the indiscernibility relation is. In particular it is true for such forms of this relation, which vary from classical form, both for discrete (1) and continuous (2) types of attributes, as presented below. C -elementary sets. Set Z is C-elementary, when all elements x ∈ Z are Cindiscernible, i.e. they belong to the same class [x]I(C) of relation I(C). If C = Q then Z is elementary set in S. C-elementary set is therefore the atomic unit of knowledge about universe U with respect to C. Since C-elementary sets are deﬁned by abstract classes of relation I, it follows that any equivalence relation can be used as I. C -deﬁnable sets. If a set X is a union of C-elementary sets then X is Cdeﬁnable, i.e. it is deﬁnable with respect to knowledge KC . A complement, a product, or an union of C-deﬁnable sets is also C-deﬁnable. Therefore the indiscernibility relation I(C), by generating knowledge KC , deﬁnes all what can be accurately expressed with the use of set of attributes C. Two information systems S and S are equivalent if they have the same elementary sets. Then the knowledge KQ is the same as knowledge KQ . Knowledge KQ is more general than knowledge KQ iﬀ I(Q ) ⊆ I(Q), i.e. when each abstract class of the relation I(Q ) is included in some abstract class of I(Q). C-deﬁnable sets, as unions of C-elementary sets are also deﬁned for any equivalence relation I. C -rough set X. Any set being the union of C-elementary sets is a C-crisp set, any other collection of objects in universe U is called a C-rough set. A rough set contains a border, composed of elements such, that based on the knowledge generated by indiscernibility relation I, it is impossible to distinguish whether or not the element belongs to the set. Each rough set can be deﬁned by two crisp sets, called lower and upper approximation of the rough set. Since C-crisp sets are unions of C-elementary sets, and C-rough set is deﬁned by two C-crisp sets, therefore the notion of C-rough set is deﬁned for any equivalence relation I, not necessarily for I0 . C -lower approximation of rough set X ⊆U. The lower approximation of a rough set X is composed of those elements of universe, which belong for sure to X, based on indiscernibility relation I. Formally, C-lower approximation of a set X ⊆ U , denoted as CX, is deﬁned in the information system S, as CX = {x ∈ U : [x]I(C) ⊆ X} and since it is a C-crisp set, it can be deﬁned for arbitrary relation I. C -upper approximation of rough set X ⊆U. The upper approximation of a rough set X is composed of those elements of universe, which perhaps belong

20

K.A. Cyran

to X, based on indiscernibility relation I. Formally, C-upper approximation of a set X ⊆ U , denoted as CX is deﬁned in the information system S, as CX = {x ∈ U : [x]I(C) ∩ X = ∅} and since it is a C-crisp set, it can be deﬁned for arbitrary relation I. C -border of rough set X ⊆U. The border of a rough set is the diﬀerence between its upper and lower approximation. Formally, C-border of a set X, denoted as BnC (X) is deﬁned as BnC (X) = CX − CX, and as a diﬀerence of two C-crisp sets, its deﬁnition is based on arbitrary equivalence relation I. Other notions, which are based on a notion of upper and/or lower approximation of a set X ⊆ U with respect to a set of attributes C, include: C-positive region of the set X ⊆ U , C-negative region of the set X ⊆ U , sets roughly C-deﬁnable, sets internally C-undeﬁnable, sets externally C-undeﬁnable, sets totally C-undeﬁnable, roughness of a set, C-accuracy of approximation of a set: αC (X), C-quality of approximation of a set: γC (X). An interesting comparison of this latter coeﬃcient and Dempster-Shafer theory of evidence is given by Skowron and Grzymala-Busse [35]. Rough membership function of the element x:μC X (x). The coeﬃcient describing the level of uncertainty, whether the element x ∈ U belongs to a set X ⊆ U when indiscernible relation I(C) generates the knowledge KC in inC formation system S, is a function denoted by μC X (x) and deﬁned as μX (x) = card{X ∩ [x]I(C) }/card{[x]I(C) }. This coeﬃcient is also referred to as a rough membership function of an element x, due to similarities with membership function known from theory of fuzzy sets. This function gave base for the generalization of rough set theory called rough set model with variable precision [40]. This model assumes that lower and upper approximations are dependent on additional coeﬃcient β, such that 0 ≤ β ≤ 0.5, and are deﬁned as C β X = {x ∈ U : C μC X (x) ≥ 1 − β} and C β X = {x ∈ U : μX (x) > β} respectively. The boundary in β this model is deﬁned as BnC (X) = {x ∈ U : β < μC X (x) < 1 − β}. It is easy to observe that the classical rough set theory is the special case of variable precision model with β = 0. Since ∀X ⊆ U , CX ⊆ C β X ⊆ C β X ⊆ CX, variable precision model is a weaker form of theory as compared to classical model, and therefore it is often preferable in analysis of large information systems with some amount of contradicting data. The membership function of an element x can be also deﬁned (x) = card{( for a family of sets X as μC X Xn ∈X Xn ) ∩ [x]I(C) }/card{[x]I(C) }. If all subsets Xn of the family X are mutually disjoint, then ∀x ∈ U , μC X (x) = (x). Since the deﬁnition of the rough membership function of the ΣXn ∈X μC Xn C element μX (x) assumes only the existence of classes of equivalence of the relation I, and the variable precision model formally diﬀers from classical model only in the deﬁnition of lower and upper approximation with the use of this coeﬃcient, therefore all presented above notions are deﬁned for arbitrary I also in this generalized model. Notions of a rough set theory, applicable for a separate set X, are generally applicable also for families of sets X = {X1 , X2 , . . . , XN }, where Xn ⊆ U , and n = 1, . . . , N . The lower approximation of a family of sets is a family

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

21

of lower approximations of sets belonging to family considered. Formally, CX = {CX1 , CX2 , . . . , CXN }. As a family of C-crisp sets, the deﬁnition of C-lower approximation of family of sets is based on arbitrary relation of equivalence I. Similarly, C-upper approximation of family of sets is a family of upper approximations of sets belonging to family considered. Formally, CX = {CX1 , CX2 , . . . , CXN }. This notion is valid for any relation of equivalence I, for reasons identical to those, presented for C-lower approximation of family of sets. Other notions, which are based on a notion of upper and/or lower approximation of a family of sets X, with respect to a set of attributes C, include: C-border of family of sets, C-negative region of the family of sets, C-negative region of the family of sets, C-accuracy of approximation of a family of sets, C-quality of approximation of a family of sets. This latter coeﬃcient is especially interesting for the application presented in the subsequent section, since it is used as an objective function in a procedure of optimization of the feature extractor. For this purpose, the considered family of sets is a family of abstract classes generated by the decision attribute d being the class of the image to be recognized (see section 3). Here, we deﬁne this coeﬃcient for any family of sets X, as γC (X) = card[P osC (X)]/card(U ). Conclusion. The analysis of above notions indicates, that they do not require any particular form of the indiscernibility relation (like for example the classical form referred to as I0 ). They are deﬁned for any form of the indiscernibility relation (satisfying reﬂexity, symmetry and transitiveness), denoted by I and are strict analogs of classical notions deﬁned with the assumption of original form of indiscernibility relation I0 deﬁned in (1) and (2). Therefore, the exact form of the indiscernibility relation, as proposed by classical theory of rough sets, as well as by its generalization named variable precision model, is not actually required for presented notions to create a coherent logical system. Some papers, referred in introduction, go further in this generalizing tendency, resigning from the requirement of equivalence relation; working with such generalizations, however, is often not natural in problems, such as classiﬁcation, when notion of abstract classes, inherently involved in equivalence relation, is of great importance. Therefore, we propose such modiﬁcation of indiscernibility relation, which is particularly useful in pattern recognition problems, dealing with a space of continuous attributes and deﬁned in terms of equivalence relation. To introduce formally the modiﬁcation, let us change the notation of indiscernibility relation as being now dependent on a family of sets of attributes, instead of being dependent simply on a set of attributes. By the family of sets of attributes, we understand a subset of a power set, based on the set of attributes, such, that all elements of this subset (these elements are subsets of the set of attributes) are mutually disjoint, and their union is equal to the considered set of attributes. This allows us to introduce some structure into, originally unstructured, set of attributes, which the relation depends on [13]. Let C = {C1 , C2 , . . . , CN } be introduced above family of disjoint sets of attributes Cn ⊆ Q such that unstructured set ofattributes C ⊆ Q is equal to the union of members of the family C, i.e. C = Cn ∈C Cn . Then, let the indiscernibility

22

K.A. Cyran

relation be dependent on C instead of being dependent on C. Observe that both C and C contain the same collection of single attributes, however C includes additional structure as compared to C. If this structure is irrelevant for the problem considered, it can be simply ignored and we can obtain, as a special case, the classical version of indiscernibility relation I0 . However we can also obtain other versions of this modiﬁed relation for which the introduced structure is meaningful. Let relation I1 (C) ⊆ U × U be such form of a relation I which is diﬀerent from I0 xI1 (C)y ⇔ ∀Cn ∈ C, Clus(x, Cn ) = Clus(y, Cn ).

(3)

where x, y ∈ U , and Clus(x, Cn ) denotes the number of the cluster, that the element x belongs to. The cluster analysis is therefore required to be performed in a continuous vector spaces deﬁned by sets of real valued conditional attributes Cn ∈ C. There are two extreme cases of this relation, obtained when family C is composed of exactly one set of conditional attributes C, and when family C is composed of card(C) sets, each containing exactly one conditional attribute q ∈ C. The classical form I0 of the indiscernibility relation is obtained as the latter extreme special case of modiﬁed version I1 , because then clustering and discretization is performed separately for each continuous attribute. Formally, it can be written as ⎧ ⎫ ⎨ ⎬ I0 (Λ[C]) ≡ I1 (C) ⇔ C = {qn } : C = {qn } ∧ Clus (x, {qn }) = f (x, Λ[qn ]) . ⎩ ⎭ qn ∈C

(4) In other words, the classical form I0 of the indiscernibility relation can be obtained as a special case of modiﬁed version I1 if we assume that family C is composed of such subsets Cn , that each contains just one attribute, and the discretization of each continuous attribute is based on separate cluster analysis as required by a scalar function Λ applied to each of attributes qn . Here we discuss some of the notions of rough set theory that cannot be used in a common sense with the modiﬁed indiscernibility relation. We start with so called basic sets which are abstract classes of relation I({q}) deﬁned for singe attribute q. These are simply sets composed of elements indiscernible with respect to single attribute q. Obviously, this notion loses its meaning when I1 is used instead of I0 , because abstract classes generated by I0 ({q}) are always unions of some abstract classes generated by I0 (C), however abstract classes generated by I1 ({q}) not necessarily are unions of abstract classes generated by I1 (C). Therefore the conclusion that knowledge K{q} generated by I0 ({q}) is always more general than knowledge KC generated by I0 (C), no longer holds when I1 is used instead of I0 . Similarly, notions of reducts, relative reducts, cores and relative cores no longer are applicable in their classical sense, since their definitions are strongly associated with single attributes. Joining these attributes into members of family C, destroys individual treatment of attributes, required for these notions to have their well known meaning. However, as long as rough

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

23

set theory usage in continuous attribute space, does not exceed the collection of notions described ahead of the deﬁnition (3), the modiﬁed I1 version should be considered more advantageous, as compared to the classical form I0 . In particular, this is true in processing of knowledge obtained from the holographic ring wedge detector, when the quality of approximation of family of sets plays the major role. We present this application as an illustrative example.

3

Application to Fraunhofer Pattern Recognizer

Presented below system belongs to a class of fast hybrid opto-electronic pattern recognizers. The feature extraction subsystem is processing the information optically. Let us start a description of such feature extractor by giving a physical basis, required to understand the properties of feature vectors generated by this subsystem, followed by the description of enhanced method of HRWD optimization and experimental results of the usage of this method. This illustrative section is completed with the description of probabilistic neural network (PNN) based classiﬁer and experimental results of the application of it into Fraunhofer pattern recognition. 3.1

Optical Foundations

In homogeneous and isotropic medium which is free of charge (ρ = 0) and currents (j = 0) Maxwell equations result in a wave equation Δ2 G − μ

∂2G = 0. ∂t2

(5)

where G denotes electric (E) or magnetic (H) ﬁeld, and a product μ is the reciprocal of squared velocity of a wave in a medium. Application of this equation to a space with obstacles like apertures or diaphragms should result in equations describing the diﬀraction of the light at these obstacles. However the solution is very complicated for special cases and impossible for the general case. Therefore the simpliﬁcation should be used which assumes a scalar ﬁeld u instead of vector ﬁeld G. In such a case the information about the light polarization is lost and it holds that 1 ∂2u = 0. (6) ∇2 u − 2 ν ∂t2 Simpliﬁed in this way theory, called the scalar Kirchhoﬀ’s theory, describes the diﬀraction of the light at various obstacles. According to this theory, scalar complex amplitude u0 (P ) of a light oscillation, caused by the diﬀraction, is given in a point of observation P by the Kirchhoﬀ’s integral [33]

ikr e du0 d eikr 1 − u0 u0 (P ) = dΣ. (7) 4π r dn dn r Σ

where Σ denotes closed surface with point P and without the light source, n is an external normal to the surface Σ, k = 2π/λ is a propagation constant,

24

K.A. Cyran

u0 denotes scalar amplitude on a surface σ, and r is the distance between any point covered inside surface Σ to the observation point P . Formula (7) states that amplitude u0 in point P does not depend on the state of oscillations in the whole area surrounding this point (what would result from Huygens theory) but, depends only on state of oscillations on a surface Σ. All other oscillations inside this surface are canceling each other. Application of Kirchhoﬀ’s theorem to a diﬀraction on a ﬂat diaphragm with aperture of any shape and size gives the integral stretched only on a surface ΣA covering the aperture. Such integral can be transformed to [33] ik u0 (P ) = − 4π

u0 (1 + cos θ) ΣA

eikr dΣA . r

(8)

where θ denotes an angle between radius r from any point of aperture to point of observation, and the internal normal of the aperture. Since any transparent image is, in fact, a collection of diaphragms and apertures of various shapes and sizes, therefore such image, when illuminated by coherent light, generates the diﬀraction pattern, described in scalar approximation by the Kirchhoﬀ’s integral (7). Let coordinates of any point A, in an image plane, are denoted by (x, y), and let an amplitude of light oscillation in this point, be μ(x, y). Furthermore, let coordinates (ξ, η) of an observation point P be chosen as 2π 2π sin θ, η = sin ϕ. (9) ξ= λ λ where: λ denotes the length of the light wave, whereas θ and ϕ are angles between the radius from the point of observation P to point A, and planes (x, z) and (y, z), respectively. These planes are two planes of such coordinate system (x, y, z), whose axes x and y are in the image plane, and axis z is perpendicular to the image plane (it is called optical axis). Let coordinate system (x , y ) be the system with the beginning at point P and such that its plane (x , y ) is parallel to the plane of the coordinate system (x, y). It is worth to notice, that coordinates of one particular point in the observation system (ξ, η) correspond to coordinates of all points P of the system (x , y ), such that the angles between axis z and a line connecting these points with some points A of the plane (x, y), are θ and ϕ, respectively. In other words, all radii AP , connecting points A of the plane (x, y) and points P of the plane (x , y ), which are parallel to each other, are represented in a system (ξ, η) by one point. Such transformation of the coordinate systems is physically obtained in the back focal plane of the lens, placed perpendicularly to the optical axis z. In this case, all parallel radii represent parallel light beams, diﬀracted on the image (see Fig. 1) and focused in the same point in a focal plane. Moreover, the integral (7), when expressed in a coordinate system (ξ, η), can be transformed to [33] 1 u0 (ξ, η) = 2π

∞ ∞ −∞ −∞

ν(x, y)e−i(ξx+ηy) dxdy.

(10)

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

25

P

α

rf

R

f

l

l’ Fig. 1. The operation of the spherical lens

Geometrical relationships (Fig. 1) reveal that rf = R

l − f . l

(11)

On the other hand the operation of the lens is given by 1 1 1 = + . f l l

(12)

Putting this equation to (11), after elementary algebra, we obtain rf R = . l f

(13)

Since angles θ and ϕ (corresponding to angle α in Fig. 1, in a plane (x, z) and (y, z), respectively) are small, therefore equations (9), having in mind (13), can be rewritten as 2π yf 2π xf ,η = (14) ξ= λ f λ f where xf and yf denote Cartesian coordinates in a focal plane of the lens. Equation (10) expressed in these coordinates can be written as 1 u0 (xf , yf ) = 2π

∞ ∞

xf

yf

ν(x, y)e−i2π( λf x+ λf y) dxdy.

(15)

−∞ −∞

Setting new coordinates (u, v) as u=

xf yf ,v = λf λf

(16)

we have ﬁnally the equation 1 u0 (u, v) = 2π

∞ ∞ −∞ −∞

ν(x, y)e−i2π(ux+vy) dxdy.

(17)

26

K.A. Cyran

which is (up to the constant factor k) a Fourier integral. This is essentially the Fraunhofer approximation of Kirchhoﬀ’s integral, and is also referred to as a Fraunhofer diﬀraction pattern [27]. The complex amplitude of the Fraunhofer diﬀraction pattern obtained in a back focal plane of the lens is therefore a Fourier transform of the complex amplitude from the image plane u0 (u, v) = kF {ν(x, y)}.

(18)

This fact is very often used in a design of hybrid systems for recognition of images in a spatial frequency domain. One prominent example is the system with a feature extractor built as a HRWD placed in a back focal plane of the lens. The HRWD itself consists of two parts: a part composed of rings Ri and a part containing wedges Wj . Each of elements Ri or Wj is covered with a grating of particular spatial frequency and orientation, so that the light, passing through the given region, is diﬀracted and focused by some other lens, at certain cell of array of photodetectors. The photodetector, in turn, integrates the intensity of the light and generates one feature used in classiﬁcation. 3.2

Enhanced Optimization Method

The system considered above can be used for the recognition of images invariant with respect to translation, rotation and size, based on the properties of Fourier transform and the way of sampling the Fraunhofer diﬀraction pattern by the HRWD. Standard HRWD based feature extractor can be optimized to obtain even better recognition properties of the system. To perform any optimization one needs the objective function and the method of search in a space of solutions. These two problems are discussed wider below. Let ordered 5-tuple T =< U, C, {d}, v, f > be the decision table obtained from the information system S =< U, Q, v, f > by a decomposition of the set of attributes Q into two mutually disjoint sets: the set of conditional attributes C and the set {d} composed of one decision attribute d. Let each conditional attribute c ∈ C be one feature obtained from HRWD, and let decision attribute d be the number of the class to be recognized. Obviously the domain of any of such conditional attributes is and the domain of decision attribute d is a subset of ﬁrst natural numbers, with cardinality equal to the number of recognized classes. Furthermore, let D = {[xn ]I0 ({d}) : xn ∈ U } be the family of such sets of images where each set contains all images belonging to the same class. Observe that the classical form of the indiscernibility relation I0 is used in this deﬁnition, due to discrete nature of the domain of decision attribute d. Based on the results of discussion given by Cyran and Mrozek [10], we argue that the rough set based coeﬃcient, called quality of approximation of family D by conditional attributes belonging to C, and denoted by γC (D), is a good objective function in the optimization of feature extractor in problems with multimodal distribution of classes in a feature space. This is so, because this coeﬃcient indicates the level of determinism of the decision table, what in turn, is relevant for the classiﬁcation. On the other hand, based on the conclusion given

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

27

in 2.2, in the case of real valued attributes C, the preferred form of indiscernibility relation, being so crucial for rough set theory in general (and therefore for the computation of γC (D) objective in particular), is the form deﬁned by (3). Therefore the optimization with the objective function γC (D) computed with respect to classical form of indiscernibility relation for real valued attributes C given in (2) produces sub-optimal solutions. This drawback can be eliminated if modiﬁed version proposed in (3) is used instead of classical form deﬁned in (2). However the generalized form (3) requires the deﬁnition of some structure in a set of conditional attributes. This is task dependent, and in our case the architecture of the feature extractor having diﬀerent properties of wedges and rings, deﬁnes natural structure, as a family C = {CR , CW }, composed of two sets: a set of attributes corresponding to rings CR , and a set of attributes corresponding to wedges CW . With this structure introduced into set of conditional attributes, the coeﬃcient γC (D) computed with respect to modiﬁed indiscernibility relation (3), is en enhanced objective function for optimization of the HRWD. Since deﬁned above enhanced objective function is not diﬀerentiable, gradientbased search method should be excluded. However the HRWD can be optimized in a framework of evolutionary algorithm. The maximum value of ﬁtness 97%, having the meaning of γC (D∗) = 0.97, was obtained in 976 generation for population composed of 50 individuals (Fig. 2). The computer generated mask of optimal HRWD, named xopt is designed for a system with a coherent light wave length λ=635nm, emitted by laser diode and for a lens L with a focal length fL =1m. In order to keep the resolution capability of the system, the diameter of the HRWD in a Fourier plane should be equal to the diameter of the Airy disc given by: sHRW D = 4 × 1.22 × λ × fL1 /smin = 2.07mm, if the assumed minimum size of recognizable objects is given by smin = 1.5mm. Assuming also the rectangular array of photodetectors of the size s=5mm, forming four rows (i=1,. . . ,4) and four columns (j = 1, . . . , 4), and setting the distance in vertical direction from the optical axis to the upper edge of the array H = 50mm we obtain values of angles θij presented in a Table 1. Similar results for the distances dij are in Table 2. Since the software for generating HRWD masks has been designed in a such way, that distances dij are given in units equal to a one-tenth of a percent of the

100

100

90

90

80

80

70 0

a)

200

400

600

800

70

1000

1

10

100

1000

b)

Fig. 2. Process of evolutionary optimization of HRWD. The courses present the ﬁtness of xopt expressed in percents: a) linear scale, b) logarithmic horizontal scale.

28

K.A. Cyran

Table 1. The values of angles θij (expressed in degrees) deﬁning the HRWD gratings 4 20.22 22.38 25.02 28.30

3 14.74 16.39 18.43 21.04

2 8.97 10.01 11.31 12.99

1 ← j, i ↓ 3.01 1 2 3.37 3 3.81 4 4.40

Table 2. Distances dij between striae [μm] 4 12.54 13.82 15.34 17.20

3 12.93 14.33 16.06 18.24

2 13.20 14.71 16.60 19.04

1 ← j, i ↓ 13.35 1 2 14.92 3 16.90 4 19.48

Table 3. Distances dij between striae, in units used by software generating HRWD masks 4 12.14 13.38 14.86 16.65

3 12.52 13.88 15.55 17.65

2 12.78 14.24 16.08 18.43

1 ← j, i ↓ 12.92 1 2 14.44 3 16.36 4 18.86

radius of HRWD, therefore for RHRW D = sHRW D /2 = 1.035mm, we give in a Table 3 the proper values, expressed in these units. 3.3

PNN Based Classiﬁcation

In our design the input layer of the probabilistic neural network (PNN) used as a classiﬁer is composed of N elements to process N -dimensional feature vectors generated by HRWD (N = NR + NW ). The pattern layer consists of M pools of pattern neurons, associated with M classes of intermodal interference to be recognized. We used in that layer RBF neurons with Gaussian transfer function, being the kernel function. Then, the width of the kernel function is simply a standard deviation σ of the Gaussian bell. Each neuron of pattern layer is connected with every neuron of input layer and the weight vectors of pattern layer are equal to feature vectors present in a training set. Contrary to the pattern layer, the summation layer consisting of M neurons, is organized in a such way, that only one output neuron is connected with neurons from any summation layer pool. When using such networks as classiﬁers, formally, there is a need to multiply the output values by prior probabilities Pj . However in our cases, all priors are equal and therefore, results can be obtained directly on outputs of the network.

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

29

We veriﬁed the recognition abilities by a classiﬁcation of speckle structure images, obtained from the output of the optical ﬁber. The experiments were conducted for a set of 128 images of speckle patterns generated by intermodal interference occurring in optical ﬁber and belonging to eight classes taken in 16 sessions Sl (l = 1, . . . , 16). The Fraunhofer diﬀraction patterns of input images were obtained by calculating the intensity patterns from discrete Fourier transform equivalent to (17). The training set consisted of 120 images, taken out in 15 sessions, and the testing set contained 8 images, belonging to diﬀerent classes, representing one session Sl . The process of training and testing was performed 16 times, according to delete-8 jackknife method, i.e., for each iteration, another session composed of 8 images was used for the testing set, and all but one sessions were used for the training set. That gave the basis for reliable cross-validation with still reasonable number of images used for training, and the reasonable computational time. This time was eight times shorter, as compared to classical leave-one-out method, which, for all discussions in a paper is equivalent to delete1 jackknife method, since the only diﬀerence, the resubstitution error of a prediction model, is not addressed in a paper. Jackknife method was used for cross validation of PNN results, because of unbiased estimation of true error in probabilistic classiﬁcation (contrary to underestimated error - however having smaller variance obtained by Bootstrap method) [1,39]. Therefore, choice of delete-8 jackknife method, was a sort of tradeoﬀ between accuracy (standard deviation of estimated normalized decision error was 0.012), unbiased estimate of the error, and computational eﬀort. The results of such testing of the PNN applied to classiﬁcation of images in a feature space obtained from a standard, optimized, and optimized with modiﬁed indiscernibility relation HRWDs, are presented in Table 4. More detailed results of all jackknife tests are presented in Table 5, Fig. 3 and Fig. 4. The normalized decision errors, ranging from 1.5 to 2 percent, indicate good overall recognition abilities of the system. The 20% reduction of this error is obtained by optimization of HRWD with classical indiscernibility relation. Further 6% error reduction, is caused solely by a modiﬁcation of indiscernibility relation, according to (3). Table 4. Results of testing the classiﬁcation abilities of the system. The classiﬁer is a PNN having Gaussian radial function with standard deviation σ = 0.125. In the last column the improvement is computed with respect to Standard HRWD (ﬁrst value) and with respect to HRWD optimized with standard indiscernibility relation (value in a parentheses).

Standard HRWD HRWD optimized with standard indiscernibility relation HRWD optimized with modiﬁed indiscernibility relation

Correct Normalized Improvement decisions [%] decision error [%] [%] 84.4 1.95 0.0 (-25.0) 87.5

1.56

20.0 (0.0)

88.3

1.46

25.1 (6.4)

30

K.A. Cyran

cu mu lativ e n u mb e r o f b ad d e cisio n s s ta nd a rd H R WD

20 15

o p tim ize d H R WD 10 5 H R WD o ptim ized w ith m o d ifie d in d is cern ib ility re la tio n

0 1

6

11

16

Fig. 3. Results of testing the HRWD-PNN system. The horizontal axis represents the number of the test, the vertical axis is a cumulative number of bad decisions. Starting from test 9 to the end, the cumulative number of bad decisions is better for optimization of HRWD with modiﬁed indiscernibility relation, as compared to optimization with classical version of this relation.

n o rmalize d d e cisio n e rro rs [%] s ta n d a rd H R WD

2 ,8 2 ,6 2 ,4 2 ,2 2 1 ,8 1 ,6 1 ,4 1 ,2

o p tim ize d H R WD

1

6

11

16

o p tim ize d w ith m o d ifie d in d is e rn ib ility re la tio n

Fig. 4. Results of testing the HRWD-PNN system. The horizontal axis represents the number of the test, while the vertical axis is a normalized decision error averaged over tests from the ﬁrst to given, represented by the value of horizontal axis. Observe, that for averaging over more than 8 tests, the results for recognition with HRWD optimized with modiﬁed indiscernibility relation are outperforming both: results for HRWD optimized with classical version of indiscernibility relation and results for standard HRWD.

In order to understand the scale of this improvement, not looking too impressive at ﬁrst glance, one should refer to a Fig. 2 and take into consideration, that this additional 6% error reduction is obtained over an already optimized

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

31

Table 5. Results of PNN testing for tests number 1 to 16. Bold font is used for results diﬀering between optimization with standard and modiﬁed version of indiscernibility relation. Bold underlined results indicate improvement when modiﬁed relation is used instead of classical. Bold results without underlining indicate the opposite. NUMBER OF TEST SESSION: 1 2 3 4 5 6 7 8 9 10 11 NUMBER OF BAD Standard HRWD 122120101 1 4 optimized with standard indiscernibility relation 113010202 1 1 optimized with modiﬁed indiscernibility relation 113011101 1 1

12 13 14 15 16 DECISIONS 0 0 1 0 4 0 0 1 1 2 0 0 1 1 2

solution. The level of diﬃculty can be grasped observing that, on average, the increase of the objective function is well mimicked by a straight line, if a generation number axis is drawn in a log scale. This means, that the growth of objective is, on average, well approximated by a logarithmic function of the generation number. It experimentally reﬂects a well known fact, stating that, the better current solution is, the harder is to optimize it further (harder, means: it requires more generations in evolutionary process).

4

Discussion and Conclusions

The paper presents a modiﬁcation of the indiscernibility relation, used in the theory of rough sets. This theory has been successfully applied to many machine learning and artiﬁcial intelligence oriented problems. However, it is well known limitation of this theory, that it processes continuous attributes in an unnatural way. To support more natural processing, the modiﬁcation of indiscernibility relation has been proposed (3), such that the indiscernibility relation remains the equivalence relation, but the processing of continuous attributes becomes more natural. This modiﬁcation introduces the information about structure into unstructured in classical version collection of attributes that the relation is dependent on. It has been shown that the classical relation is the special case of the modiﬁed version, therefore proposed modiﬁcation can be recognized as being more general (yet, not as general, as indiscernibility relations, which are no longer equivalence relations). Remarkably, proposed generalization is equivalently valid for classical theory of rough sets, as well as for the variable precision model, predominantly used in machine learning applied to huge data sets. Proposed in a paper modiﬁcation of indiscernibility relation, introduces the ﬂexibility in deﬁnition of particular special case, which is most natural to given application. In the case of real-valued attributes, our modiﬁcation allows for performing multidimensional cluster analysis, contrary to multiple one-dimensional analyses, required by the classical form. In majority of cases, the cluster analysis should be performed in a space, generated by all attributes. This corresponds to a family C composed of one set (card(C) = 1), containing all conditional

32

K.A. Cyran

attributes, and it is the opposite case, as compared to the classical relation, assuming that family C is composed of one-element disjoint sets, and therefore, satisfying equation card(C) = card(C). However, other less extreme cases are allowed as well and, in an experimental study, we use a family C = {CR , CW }, composed of two sets containing 8 elements, each. Such structure seems to be natural for application having two-way architecture, like HRWD based feature extractor. Presented modiﬁcation has been applied in optimization procedure of the hybrid opto-electronic pattern recognition system composed of HRWD and PNN. It allowed to improve the recognition abilities by reducing the normalized decision error by 6.5%, if a system, optimized with classical indiscernibility relation, is treated as the reference. One should notice, that this improvement is achieved with respect to a reference, being already optimized solution, which makes any further improvement diﬃcult. Obtained results experimentally conﬁrm our claims concerning sub optimality of earlier solutions. Presented experiment is an illustration of application of proposed methodology into hybrid pattern recognizer. However, we think, that presented modiﬁcation of indiscernibility relation will ﬁnd many more applications in rough set based machine learning, since it gives natural way of processing real valued attributes, within a rough set based formalism. Certainly there are also limitations. Because some known in rough set theory notions loose their meaning, when modiﬁed relation is to be applied, therefore, if for any reason, they are supposed to play relevant role in a problem, the proposed modiﬁcation can be hardly applied in any other than classical special case form. One prominent example concerns so called basic sets in a universe U , deﬁned by the indiscernibility relation, computed with respect to single attributes, as opposed to modiﬁed relation predominantly designed to deal with sets of attributes deﬁning a vector space, used for common cluster analysis. This modiﬁcation is especially useful in the case of information systems with real valued conditional attributes representing vector space N , such as systems of non syntactic pattern recognition. The experimental example belongs to this class of problems and illustrates the potential of modiﬁed indiscernibility relation for processing real-valued data in a rough set based theory.

References 1. Azuaje, F.: Genomic data sampling and its eﬀect on classiﬁcation performance assessment. BMC Bioinformatics 4(1), 5–16 (2003) 2. Berfanger, D.M., George, N.: All-digital ring-wedge detector applied to ﬁngerprint recognition. App. Opt. 38(2), 357–369 (1999) 3. Berfanger, D.M., George, N.: All-digital ring wedge detector applied to image quality assessment. App. Opt. 39(23), 4080–4097 (2000) 4. Casasent, D., Song, J.: A computer generated hologram for diﬀraction-pattern sampling. Proc. SPIE 523, 227–236 (1985)

Modiﬁed Indiscernibility Relation in the Theory of Rough Sets

33

5. Cyran, K.A.: Integration of classiﬁers working in discrete and real valued feature space applied in two-way opto-electronic image recognition system. In: Proc. of IASTED Conference: Visualization, Imaging & Image Processing, Benidorn, Spain (accepted, 2005) 6. Cyran, K.A.: PLD-based rough classiﬁer of Fraunhofer diﬀraction pattern. In: Proc. Int. Conf. Comp. Comm. Contr. Tech., Orlando, pp. 163–168 (2003) 7. Cyran, K.A., Jaroszewicz, L.R.: Concurrent signal processing in optimized hybrid CGH-ANN system. Opt. Appl. 31, 681–689 (2001) 8. Cyran, K.A., Jaroszewicz, L.R.: Rough set based classiﬁction of interferometric images. In: Jacquot, P., Fournier, J.M. (eds.) Interferometry in Speckle Light. Theory and Applictions, pp. 413–420. Springer, Heidelberg (2000) 9. Cyran, K.A., Jaroszewicz, L.R., Niedziela, T.: Neural network based automatic diﬀraction pattern recognition. Opto-elect. Rev. 9, 301–307 (2001) 10. Cyran, K.A., Mrozek, A.: Rough sets in hybrid methods for pattern recognition. Int. J. Intell. Sys. 16, 149–168 (2001) 11. Cyran, K.A., Niedziela, T., Jaroszewicz, L.R.: Grating-based DOVDs in high-speed semantic pattern recognition. Holography 12(2), 10–12 (2001) 12. Cyran, K.A., Niedziela, T., Jaroszewicz, J.R., Podeszwa, T.: Neural classiﬁers in diﬀraction image processing. In: Proc. Int. Conf. Comp. Vision Graph., Zakopane, Poland, pp. 223–228 (2002) 13. Cyran, K.A., Stanczyk, U.: Indiscernibility relation for continuous attributes: Application in image recognition. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS, vol. 4585, pp. 726–735. Springer, Heidelberg (2007) 14. Cyran, K.A., Stanczyk, U., Jaroszewicz, L.R.: Subsurface stress monitoring system based on holographic ring-wedge detector and neural network. In: McNulty, G.J. (ed.) Quality, Reliability and Maintenance, pp. 65–68. Professional Engineering Publishing, Bury St Edmunds, London (2002) 15. Doherty, P., Szalas, A.: On the correspondence between approximations and similarity. In: Tsumoto, S., Slowi´ nski, R., Komorowski, J., Grzymala-Busse, J.W. (eds.) RSCTC 2004. LNCS, vol. 3066, pp. 143–152. Springer, Heidelberg (2004) 16. Fares, A., Bouzid, A., Hamdi, M.: Rotation invariance using diﬀraction pattern sampling in optical pattern recognition. J. of Microwaves and Optoelect. 2(2), 33– 39 (2000) 17. Ganotra, D., Joseph, J., Singh, K.: Modiﬁed geometry of ring-wedge detector for sampling Fourier transform of ﬁngerprints for classiﬁcation using neural networks. Proc. SPIE 4829, 407–408 (2003) 18. Ganotra, D., Joseph, J., Singh, K.: Neural network based face recognition by using diﬀraction pattern sampling with a digital ring-wedge detector. Opt. Comm. 202, 61–68 (2002) 19. George, N., Wang, S.: Neural networks applied to diﬀraction-pattern sampling. Appl. Opt. 33, 3127–3134 (1994) 20. Gomolinska, A.: A comparative study of some generalized rough approximations. Fundamenta Informaticae 51(1), 103–119 (2002) 21. Grzymala-Busse, J.W.: Data with missing attribute values: Generalization of indiscernibility relation and rule induction. In: Peters, J.F., Skowron, A., Grzymala´ Busse, J.W., Kostek, B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 78–95. Springer, Heidelberg (2004)

34

K.A. Cyran

22. Grzymala-Busse, J.W.: Rough set strategies to data with missing attribute values. In: Proceedings of the Workshop on Foundations and New Directions in Data Mining, associated with the third IEEE International Conference on Data Mining, Melbourne, FL, USA, November 19– 22, 2003, pp. 56–63 (2003) 23. Jaroszewicz, L.R., Cyran, K.A., Podeszwa, T.: Optimized CGH-based pattern recognizer. Opt. Appl. 30, 317–333 (2000) 24. Jaroszewicz, L.R., Merta, I., Podeszwa, T., Cyran, K.A.: Airplane engine condition monitoring system based on artiﬁcial neural network. In: McNulty, G.J. (ed.) Quality, Reliability and Maintenance, pp. 179–182. Professional Engineering Publishing, Bury St Edmunds, London (2002) 25. Jarvinen, J.: Approximations and roughs sets based on tolerances. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 182–189. Springer, Heidelberg (2001) 26. Kaye, P.H., Barton, J.E., Hirst, E., Clark, J.M.: Simultaneous light scattering and intrinsic ﬂuorescence measurement for the classiﬁcation of airbone particles. App. Opt. 39(21), 3738–3745 (2000) 27. Kreis, T.: Holographic interferometry: Principles and methods. Akademie Verlag Series in Optical Metrology, vol. 1. Akademie-Verlag (1996) 28. Mait, J.N., Athale, R., van der Gracht, J.: Evolutionary paths in imaging and recent trends. Optics Express 11(18), 2093–2101 (2003) 29. Mrozek, A.: A new method for discovering rules from examples in expert systems. Man-Machine Studies 36, 127–143 (1992) 30. Mrozek, A.: Rough sets in computer implementation of rule-based control of industrial processes. In: Slowinski, R. (ed.) Intelligent decision support. Handbook of applications and advances of the rough sets, pp. 19–31. Kluwer Academic Publishers, Dordrecht (1992) 31. Nebeker, B.M., Hirleman, E.D.: Light scattering by particles and defects on surfaces: semiconductor wafer inspector. Lecture Notes in Physics, vol. 534, pp. 237– 257 (2000) 32. Pawlak, Z.: Rough sets: theoretical aspects of reasoning about data. Kluwer Academic, Dordrecht (1991) 33. Piekara, A.H.: New aspects of optics – introduction to quantum electronics and in particular to nonlinear optics and optics of coherent light [in Polish]. PWN, Warsaw (1976) 34. Podeszwa, T., Jaroszewicz, L.R., Cyran, K.A.: Fiberscope based engine condition monitoring system. Proc. SPIE 5124, 299–303 (2003) 35. Skowron, A., Grzymala-Busse, J.W.: From rough set theory to evidence theory. In: Yager, R.R., Ferdizzi, M., Kacprzyk, J. (eds.) Advances in Dempster Shafer theory of evidence, pp. 193–236. Wiley & Sons, NY (1994) 36. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996) 37. Slowinski, R., Vanderpooten, D.: A generalized deﬁnition of rough approximations based on similarity. IEEE Transaction on Data and Knowledge Engineering 12(2), 331–336 (2000) 38. Slowinski, R., Vanderpooten, D.: Similarity relation as a basis for rough approximations. In: Wang, P.P. (ed.) Advances in machine intelligence and soft computing, pp. 17–33. Bookwrights, Raleigh (1997) 39. Twomey, J.M., Smith, A.E.: Bias and variance of validation methods for function approximation neural networks under conditions of sparse data. IEEE Trans. Sys., Man, and Cyber. 28(3), 417–430 (1998) 40. Ziarko, W.: Variable precision rough set model. J. Comp. Sys. Sci. 40, 39–59 (1993)

On Certain Rough Inclusion Functions Anna Gomoli´ nska Bialystok University, Department of Mathematics, Akademicka 2, 15267 Bialystok, Poland [email protected]

Abstract. In this article we further explore the idea which led to the standard rough inclusion function. As a result, two more rough inclusion functions (RIFs in short) are obtained, diﬀerent from the standard one and from each other. With every RIF we associate a mapping which is in some sense complementary to it. Next, these complementary mappings (co-RIFs) are used to deﬁne certain metrics. As it turns out, one of these distance functions is an instance of the Marczewski–Steinhaus metric. While the distance functions may directly be used to measure the degree of dissimilarity of sets of objects, their complementary mappings – also discussed here – are useful in measuring of the degree of mutual similarity of sets. Keywords: rough inclusion function, rough mereology, distance and similarity between sets.

1

Introduction

Broadly speaking, rough inclusion functions (RIFs) are mappings with which one can measure the degree of inclusion of a set in a set. The formal notion of rough inclusion was worked out within rough mereology, a theory proposed by Polkowski and Skowron [2,3,4]. Rough mereology extends Le´sniewski’s mereology [5, 6], a formal theory of being-part to the case of being-part-to-degree. The standard RIF is certainly the most famous RIF. Its deﬁnition, based on the frequency count, is closely related to the deﬁnition of conditional probability. The idea underlying the standard RIF was explored by L ukasiewicz in

This article is an extended version of the paper presented at the International Conference on Rough Sets and Emerging Intelligent Systems Paradigms In Memoriam Zdzislaw Pawlak (RSEISP’2007). In comparison to [1], we study properties of the mappings, complementary to the title rough inclusion functions, more intensively. In addition, certain distance functions and their complementary mappings are introduced and investigated. Many thanks to the anonymous referees whose comments helped improve the paper. The research was partly supported by the grant N N516 368334 from Ministry of Science and Higher Education of the Republic of Poland and by an Innovative Economy Operational Programme 2007–2013 (Priority Axis 1. Research and development of new technologies) grant managed by Ministry of Regional Development of the Republic of Poland.

J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 35–55, 2008. c Springer-Verlag Berlin Heidelberg 2008

36

A. Gomoli´ nska

his research on probability of the truth of logical expressions (in particular, implicative formulas) about one hundred years ago [7, 8]. Apart from the standard RIF, there are only several functions of such sort described in the literature (see, e.g., [4, 9, 10]). Although the notion of RIF is dispensable when approximating sets of objects in line with the classical Pawlak approach [11, 12, 13, 14, 15], it is of particular importance for more general rough-set models. Namely, the concept of RIF is a basic component in Skowron and Stepaniuk’s approximation spaces [16, 17, 18, 19, 20, 21, 22] where lower and upper rough approximations of sets of objects are deﬁned by means of RIFs. In the variable-precision rough-set model with extensions [23, 24, 25, 26] and the decision-theoretic rough set model and its extensions [27, 28, 29], the standard RIF is taken as an estimator of certain conditional probabilities which, in turn, are used to deﬁne variable-precision positive and negative regions of sets of objects. Moreover, starting with a RIF, one can derive a family of rough membership functions which was already observed by Pawlak and Skowron in [30]. Also various functions measuring the degree of similarity between sets can be deﬁned by means of RIFs (see, e.g., [4, 10, 31, 32] for the rough-set approach). Last but not the least, a method of knowledge reduction is proposed in [33] which is based, among other things, on the degree of rough inclusion. In this paper we explore further the idea which led to the standard RIF. The aim is to discover other RIFs which have a similar origin as the standard one. Our investigations are motivated, among other things, by the fact that in spite of well-groundedness, usefulness, and popularity of the standard RIF, some of its properties may seem to be too strong (e.g., Proposition 2a,b). In addition, it would be good to have alternative RIFs at our disposal. As a result, we have obtained two RIFs more. One of them is new, at least up to the author’s knowledge, whereas the remaining one was mentioned in [9]. We investigate properties of the three RIFs with emphasis on the mutual relationships among them. As regards the standard RIF, some of its properties have already been known, but yet a few of them are new. Unlike the standard RIF, the new RIFs do not, at ﬁrst glance, seem to be very useful to estimate the conditional probability. It turns out however that they are diﬀerent from, yet deﬁnable in terms of the standard RIF. On the other hand, the latter RIF can be derived from the former two. In the sequel, we introduce mappings complementary to our RIFs, called coRIFs, and we present their properties. The co-RIFs give rise to certain distance functions which turn out to be metrics on the power set of the universe of objects. The distance functions may directly be used to measure the degree of dissimilarity between sets. It is interesting that one of these metrics is an instance of the Marczewski–Steinhaus metric [34]. Finally, we arrive at mappings complementary to the distance functions. They may, in turn, serve as indices of similarity between sets. It is worthy to note that these similarity indices are known from the literature [35, 36, 37, 38]. The rest of the paper is organized as follows. Section 2 is fully devoted to the standard RIF. In Sect. 3 we recall axioms of rough mereology, a formal theory of

On Certain Rough Inclusion Functions

37

being-part-to-degree introduced by Polkowski and Skowron in [2], which provides us with fundamentals of a formal notion of rough inclusion. We also explain what we actually mean by a RIF. In the same section we argue that RIFs indeed realize the formal concept of rough inclusion proposed by Polkowski and Skowron. Some authors [39,40] (see also [28]) claim rough inclusion measures to fulﬁl conditions somewhat diﬀerent from ours. Let us emphasize that our standpoint is that of rough mereology, and its axioms just provide us with a list of postulates to be satisﬁed by functions measuring the degree of inclusion. In Sect. 4, two alternatives of the standard RIF are derived. In Sect. 5 we consider the co-RIFs corresponding to our three RIFs and we investigate their properties. Certain distance functions and their complementary mappings, induced by the co-RIFs, are discussed in Sect. 6. The last section summarizes the results.

2

The Standard Rough Inclusion Function

The idea underlying the notion of the standard rough inclusion function was explored by Jan L ukasiewicz, a famous Polish logician who, among other things, conducted research on probability of the truth of propositional formulas [7, 8]. The standard RIF is the most popular among functions measuring the degree of inclusion of a set in a set. Let us recall that both the decision-theoretic rough set model [27, 29] and the variable-precision rough set model [23, 24] make use of the standard RIF. It is also commonly used to estimate the confidence (or accuracy) of decision rules and association rules [10, 41, 42, 43]. Last but not the least, the standard RIF is counted as a function with which one can measure similarity between clusterings [44, 45]. Consider a structure M with a non-empty universe U and a propositional language L interpretable over M . For any formula α and u ∈ U , u |= α reads as ‘α is satisﬁed by u’ or ‘u satisﬁes α’. The extension of α is deﬁned as the set ||α|| = {u ∈ U | u |= α}. α will be satisfiable in M if its extension is non-empty, and unsatisfiable otherwise. Morever, α is called true in M , |= α, if ||α|| = U . Finally, α entails a formula β, written α |= β, if and only if every object satisfying α satisﬁes β as well, i.e., ||α|| ⊆ ||β||. In classical logic, an implicative formula α → β is true in M if and only if α entails β. Clearly, many interesting formulas are not true in this sense. Since implicative formulas with unsatisﬁable predecessors are true, we limit our considerations to satisﬁable α. Then, one can assess the degree of truth of α → β by calculating the probability that an object satisfying α satisﬁes β as well. Where U is ﬁnite, this probability may be estimated by the fraction of objects of ||α|| which also satisfy β. That is, the degree of truth of α → β may be deﬁned as #(||α|| ∩ ||β||)/#||α|| where #||α|| means the cardinality of ||α||. By a straithforward generalization, we arrive at the well-known notion of the standard RIF, commonly used in the rough set theory. It owes its popularity to the clarity of the underlying idea and to the easiness of computation by means of this notion. Since conditional probability may be estimated by the standard RIF, the latter has also been used successfully in the decision-theoretic rough set

38

A. Gomoli´ nska

model [27, 29] (see also [28]) and the variable-precision rough set model and its extensions [23, 24, 25]. Given a non-empty ﬁnite set of objects U and its power set ℘U , the standard RIF upon U is a mapping κ£ : ℘U × ℘U → [0, 1] such that for any X, Y ⊆ U , #(X∩Y ) if X = ∅, def £ #X κ (X, Y ) = (1) 1 otherwise. To assess the degree of inclusion of a set of objects X in a set of objects Y by means of κ£ , one needs to measure the relative overlap of X with Y . The larger the overlap of two sets, the higher is the degree of inclusion, viz., for any X, Y, Z ⊆ U , #(X ∩ Y ) ≤ #(X ∩ Z) ⇒ κ£ (X, Y ) ≤ κ£ (X, Z). The success of the standard RIF also lies in its mathematical properties. Where X is a family of sets, we write Pair(X ) to say that elements of X are pairwise disjoint, i.e., ∀X, Y ∈ X .(X = Y ⇒ X ∩ Y = ∅). It is assumed that conjunction and disjunction will take the precedence to implication and double implication. Proposition 1. For any sets X, Y, Z ⊆ U and any families of sets ∅ = X , Y ⊆ ℘U , it holds: (a) κ£ (X, Y ) = 1 ⇔ X ⊆ Y, (b) Y ⊆ Z ⇒ κ£ (X, Y ) ≤ κ£ (X, Z), (c) Z ⊆ Y ⊆ X ⇒ κ£ (X, Z) ≤ κ£ (Y, Z), (d) κ£ (X, Y) ≤ κ£ (X, Y ), Y ∈Y

(e) X = ∅ & Pair(Y) ⇒ κ£ (X,

Y) =

κ£ (X, Y ),

Y ∈Y

(f ) κ ( X , Y ) ≤ κ£ (X, Y ) · κ£ ( X , X), £

X∈X

(g) Pair(X ) ⇒ κ ( X , Y ) = κ£ (X, Y ) · κ£ ( X , X). £

X∈X

Proof. We prove (f) only.Consider any Y ⊆ U and any non-empty family X ⊆ ℘U . Firstsuppose that X = holds ∅, i.e., X = {∅}. The property obviously (∅, Y ) = 1 ·1 = 1. Now let X be nonsince κ£ ( X , Y ) = 1 and κ£ ( X , ∅)·κ£ empty.In sucha case, κ£ ( X , Y ) = #( X ∩ Y)/# X = # {X ∩ Y | X ∈ X }/# X ≤ {#(X ∩ Y ) | X ∈ X }/# X = {#(X ∩ Y )/# X |X ∈ X }. Observe that if some element X of X is empty, then #(X ∩ Y )/# X = 0. 0 = 0 On the other hand, κ£ (X, Y ) · κ£ ( X , X) = 1 · (#X/# X ) = 1 · as well. For every non-empty element X of X , we have #(X ∩ Y )/# X = £ £ (X, Y ) · κ ( X , X) as required. Summing (#(X ∩ Y )/#X) · (#X/# X ) = κ up, κ£ ( X , Y ) ≤ X∈X κ£ (X, Y ) · κ£ ( X , X).

On Certain Rough Inclusion Functions

39

Some comments can be useful here. (a) says that the standard RIF yields 1 if and only if the ﬁrst argument is included in the second one. Property (b) expresses monotonicity of κ£ in the second variable, whereas (c) states some weak form of co-monotonicity of the standard RIF in the ﬁrst variable. It follows from (d) that for any covering of a set of objects, say Z, the sum of the degrees of inclusion of a set X in the sets constituting the covering is at least as high as the degree of inclusion of X in Z. The non-strict inequality in (d) may be strenghtened to = for non-empty X and coverings consisting of pairwise disjoint sets as stated by (e). Due to (f), for any covering of a set of objects, say Z, the degree of inclusion of Z in a set Y is not higher than a weighted sum of the degrees of inclusion of sets constituting the covering in Y where the weights are the degrees of inclusion of Z in the members of the covering of Z. In virtue of (g), the inequality may be strenghtened to = if elements of the covering are pairwise disjoint. Let us observe that (g) is in some sense a counterpart of the total probability theorem. The following conclusions can be drawn from the facts above. Proposition 2. For any X, Y, Z, W ⊆ U (X = ∅) and a family Y of pairwise disjoint sets of objects such that Y = U , we have: κ£ (X, Y ) = 1, (a) Y ∈Y

(b) κ£ (X, Y ) = 0 ⇔ X ∩ Y = ∅, (c) κ£ (X, ∅) = 0, (d) X ∩ Y = ∅ ⇒ κ£ (X, Z − Y ) = κ£ (X, Z ∪ Y ) = κ£ (X, Z), (e) Z ∩ W = ∅ ⇒ κ£ (Y ∪ Z, W ) ≤ κ£ (Y, W ) ≤ κ£ (Y − Z, W ), (f ) Z ⊆ W ⇒ κ£ (Y − Z, W ) ≤ κ£ (Y, W ) ≤ κ£ (Y ∪ Z, W ). Proof. We show (d) only. To this end, consider any sets of objects X, Y where X = ∅ and X ∩ Y = ∅. Immediately (d1) κ£ (X, Y ) = 0 by (b). Hence, for any Z ⊆ U , κ£ (X, Z) = κ£ (X, (Z ∩ Y ) ∪ (Z − Y )) = κ£ (X, Z ∩ Y ) + κ£ (X, Z − Y ) ≤ κ£ (X, Y ) + κ£ (X, Z − Y ) = κ£ (X, Z − Y ) in virtue of Proposition 1b,e. In the sequel, κ£ (X, Z ∪ Y ) ≤ κ£ (X, Z) + κ£ (X, Y ) = κ£ (X, Z) due to (d1) and Proposition 1d. The remaining inequalities are consequences of Proposition 1b.

Let us note a few remarks. (a) states that the degrees of inclusion of a nonempty set of objects X in pairwise disjoint sets will sum up to 1 when these sets, taken together, cover the universe. In virtue of (b), the degree of inclusion of a non-empty set in an arbitrary set of objects equals to 0 just in the case the both sets are disjoint. (b) obviously implies (c). The latter property says that the degree of inclusion of a non-empty set in ∅ is equal to 0. Thanks to (d), removing (resp., adding) objects, not being members of a non-empty set X, from (to) a set Z does not inﬂuence the degree of inclusion of X in Z. As follows from (e), adding (resp., removing) objects, not belonging to a set W , to (from) a set Y does not increase (decrease) the degree of inclusion of Y in W . Finally, removing (resp., adding) members of a set of objects W from (to) a set Y does not increase (decrease) the degree of inclusion of Y in W due to (f).

40

A. Gomoli´ nska

Example 1. Given U = {0, . . . , 9}, X = {0, . . . , 3}, Y = {0, . . . , 3, 8}, and Z = {2, . . . , 6}. Note that X ∩ Z = Y ∩ Z = {2, 3}. Thus, κ£ (X, Z) = 1/2 and κ£ (Z, X) = 2/5 which means that the standard RIF is not symmetric. Moreover, κ£ (Y, Z) = 2/5 < 1/2. Thus, X ⊆ Y may not imply κ£ (X, Z) ≤ κ£ (Y, Z), i.e., κ£ is not monotone in the ﬁrst variable.

3

Rough Mereology: A Formal Framework for Rough Inclusion

The notion of the standard RIF was generalized and formalized by Polkowski and Skowron within rough mereology, a theory of the notion of being-part-todegree [2, 3, 4]. The starting point is a pair of formal theories introduced by Le´sniewski [5, 6], viz., mereology and ontology where the former theory extends the latter one. Mereology is a theory of the notion of being-part, whereas ontology is a theory of names and plays the role of set theory. Le´sniewski’s mereology is also known as a theory of collective sets as opposite to ontology being a theory of distributive sets. In this section we only recall a very small part of rough mereology, pivotal for the notion of rough inclusion. We somewhat change the original notation (e.g., ‘el’ to ‘ing’, ‘μt ’ to ‘ingt ’), yet trying to keep with the underlying ideas. In ontology, built upon the classical predicate logic with identity, two basic semantical categories are distinguished: the category of non-empty names1 and the category of propositions. We use x, y, z, with subscripts if needed, as name variables and we denote the set of all such variables by Var. The only primitive notion of ontology is the copula ‘is’, denoted by ε and characterized by the axiom (L0) xεy ↔ (∃z.zεx ∧ ∀z, z .(zεx ∧ z εx → zεz ) ∧ ∀z.(zεx → zεy))

(2)

where ‘xεy’ is read as ‘x is y’. The ﬁrst two conjuncts on the right-hand side say that x ranges over non-empty, individual names only. The third conjunct says that each of x’s is y as well. In particular, the intended meaning of ‘xεx’ is simply that x ranges over individual names. Mereology is built upon ontology and introduces a name-forming functor pt where ‘xεpt(y)’ reads as ‘x is a part of y’. The functor pt is described by the following axioms: (L1) xεpt(y) → xεx ∧ yεy, (L2) xεpt(y) ∧ yεpt(z) → xεpt(z), (L3) ¬(xεpt(x)). (L1) stipulates that both x and y range over individual names. According to (L2) and (L3), being-part is transitive and irreﬂexive, respectively. The reﬂexive counterpart of pt is the notion of being-ingredient, ing, given by def

xεing(y) ↔ xεpt(y) ∨ x = y. 1

Empty names are denied by Le´sniewski on philosophical grounds.

(3)

On Certain Rough Inclusion Functions

41

One can see that (L1 ) xεing(y) → xεx ∧ yεy, (L2 ) xεing(y) ∧ yεing(z) → xεing(z), (L3 ) xεing(x), (L4 ) xεing(y) ∧ yεing(x) → x = y. Axioms (L1’), (L2’) are counterparts of (L1), (L2), respectively. (L3’), (L4’) postulate reﬂexivity and antisymmetry of ing, respectively. It is worth noting that one can start with ing characterized by (L1’)–(L4’) and deﬁne pt by def

xεpt(y) ↔ xεing(y) ∧ x = y.

(4)

Polkowski and Skowron’s rough mereology extends Le´sniewski’s mereology by a family of name-forming functors ingt . These functors, constituting a formal counterpart of the notion of being-ingredient-to-degree, are described by the following axioms, for any name variables x, y, z and s, t ∈ [0, 1]: (P S1) ∃t.xεingt (y) → xεx ∧ yεy, (P S2) xεing1 (y) ↔ xεing(y), (P S3) xεing1 (y) → ∀z.(zεingt (x) → zεingt (y)), (P S4) x = y ∧ xεingt (z) → yεingt (z), (P S5) xεingt (y) ∧ s ≤ t → xεings (y). The expression ‘xεingt (y)’ reads as ‘x is an ingredient of y to degree t’. The axiom (PS1) claims x, y to range over individual names. According to (PS2), being an ingredient to degree 1 is equivalent with being an ingredient. (PS3) states a weak form of transitivity of the graded ingredienthood. (PS4) says that ‘=’ is congruencial with respect to being-ingredient-to-degree. As postulated by (PS5), ingt is, in fact, a formalization of the notion of being an ingredient to degree at least t. Furthermore, being-part-to-degree may be deﬁned as a special case of the graded ingredienthood, viz., def

xεptt (y) ↔ xεingt (y) ∧ x = y.

(5)

The axioms (PS1)–(PS5) are minimal conditions to be fulﬁlled by the formal concept of graded ingredienthood2 . According to the standard interpretation, being an ingredient (part) is understood as being included (included in the proper sense). In the same vein, the graded ingredienthood may be interpreted as a graded inclusion, called rough inclusion in line with Polkowski and Skowron. Now we describe a model for the part of rough mereology presented above, simplifying the picture as much as possible. Consider a non-empty set of objects U and a structure M = (℘U, ⊆, κ) where the set of all subsets of U , ℘U , serves 2

For instance, nothing has been said about the property of being external yet. For this and other concepts of rough mereology see, e.g., [4].

42

A. Gomoli´ nska

as the universe of M , ⊆ is the usual inclusion relation on ℘U , and κ is a mapping κ : ℘U × ℘U → [0, 1] satisfying the conditions rif 1 , rif 2 below: def

rif 1 (κ) ⇔ ∀X, Y ⊆ U.(κ(X, Y ) = 1 ⇔ X ⊆ Y ), def

rif 2 (κ) ⇔ ∀X, Y, Z ⊆ U.(Y ⊆ Z ⇒ κ(X, Y ) ≤ κ(X, Z)). According to rif 1 , κ is a generalization of ⊆. Moreover, κ achieves the greatest value (equal to 1) only for such pairs of sets that the second element of a pair contains the ﬁrst element. The condition rif 2 postulates κ to be monotone in the second variable. We call any mapping κ as above a rough inclusion function (RIF) over U . For simplicity, the reference to U will be dropped if no confusion results. Observe that having assumed rif 1 , the second condition is equivalent to rif ∗2 given by rif ∗2 (κ) ⇔ ∀X, Y, Z ⊆ U.(κ(Y, Z) = 1 ⇒ κ(X, Y ) ≤ κ(X, Z)). def

Subsets of U are viewed as concepts, and RIFs are intended as functions measuring the degrees of inclusion of concepts in concepts. It is worth noting that any RIF over U is a fuzzy set on ℘U × ℘U or, in other words, a fuzzy binary relation on ℘U (see [46] and more recent, ample literature on fuzzy set theory). Clearly, RIFs may satisfy various additional postulates as well. Examples of such postulates are: def

rif 3 (κ) ⇔ ∀∅ = X ⊆ U.κ(X, ∅) = 0, def

rif 4 (κ) ⇔ ∀X, Y ⊆ U.(κ(X, Y ) = 0 ⇒ X ∩ Y = ∅), rif −1 4 (κ) ⇔ ∀∅ = X ⊆ U.∀Y ⊆ U.(X ∩ Y = ∅ ⇒ κ(X, Y ) = 0), def def

rif 5 (κ) ⇔ ∀∅ = X ⊆ U.∀Y ⊆ U.(κ(X, Y ) = 0 ⇔ X ∩ Y = ∅), def

rif 6 (κ) ⇔ ∀∅ = X ⊆ U.∀Y ⊆ U.κ(X, Y ) + κ(X, U − Y ) = 1, def

rif 7 (κ) ⇔ ∀X, Y, Z ⊆ U.(Z ⊆ Y ⊆ X ⇒ κ(X, Z) ≤ κ(Y, Z)). As follows from Propositions 1 and 2, the standard RIF satisﬁes all the conditions above. Moreover, for any RIF κ, rif 1 (κ) and rif 6 (κ) imply rif 5 (κ); rif 5 (κ) −1 is equivalent to the conjunction of rif 4 (κ) and rif −1 4 (κ); and rif 4 (κ) implies rif 3 (κ). It is worth mentioning that some authors stipulate functions measuring the degree of inclusion to satisfy rif 2 , rif 7 , and the ‘if’ part of rif 1 [39, 40]. Names and name-forming functors are interpreted in M by means of a mapping I as follows. Every name is interpreted as a non-empty set of concepts, i.e., subsets of U , and individual names are interpreted as singletons. For any singleton Y = {X} where X ⊆ U , let def

e(Y ) = X.

(6)

The identity symbol is interpreted as the identity relation on ℘U (the same symbol ‘=’ is used in both cases for simplicity). The copula ε is interpreted as a binary relation εI ⊆ ℘(℘U ) × ℘(℘U ) such that for any X, Y ⊆ ℘U ,

On Certain Rough Inclusion Functions def

XεI Y ⇔ #X = 1 & X ⊆ Y.

43

(7)

Observe that X ⊆ Y above may equivalently be written as e(X) ∈ Y . In the sequel, the name-forming functors ing, pt, ingt , and ptt (t ∈ [0, 1]) are interpreted as mappings ingI , ptI , ingt,I , ptt,I : ℘U → ℘(℘U ) such that for any X ⊆ U , def

ingI (X) = ℘X, def

ptI (X) = ℘X − {X}, def

ingt,I (X) = {Y ⊆ U | κ(Y, X) ≥ t}, def

ptt,I (X) = {Y ⊆ U | κ(Y, X) ≥ t & Y = X},

(8)

thus, e.g., ing is interpreted as the power-set operator. The pair MI = (M, I) is an interpretation of the language of the part of rough mereology considered here. In the next step, we assign non-empty sets of concepts to name variables. Given an interpretation MI , any such variable assignment v : Var → ℘(℘U ) may be extended to a term assignment vI as follows. For any x ∈ Var, t ∈ [0, 1], and f ∈ {ing, pt, ingt , ptt }, def

vI (x) = v(x), fI (e(v(x))) if #v(x) = 1, def vI (f (x)) = undeﬁned otherwise.

(9)

Finally, we can deﬁne satisﬁability of formulas by variable assignments in MI . For any formula α and any variable assignment v, ‘MI , v |= α’ reads as ‘α is satisﬁed by v in MI ’. Along the standard lines, α will be true in MI , MI |= α, if α is satisﬁed by every variable assignment in MI . The relation of satisﬁability of formulas is deﬁned as follows, for any formulas α, β, any name variables x, y, any degree variable t, and f ∈ {ing, pt, ingt , ptt }: def

MI , v |= x = y ⇔ vI (x) = vI (y), def

MI , v |= xεy ⇔ vI (x)εI vI (y), def

MI , v |= xεf (y) ⇔ vI (x)εI vI (f (y)), def

MI , v |= α ∧ β ⇔ MI , v |= α & MI , v |= β, def

MI , v |= ¬α ⇔ MI , v |= α, def

MI , v |= ∀x.α ⇔ MI , w |= α for any w diﬀerent from v at most for x, def

MI , v |= ∀t.α ⇔ for every t ∈ [0, 1], MI , v |= α.

(10)

The remaining cases can easily be obtained from those above. Let us observe that the ﬁrst three conditions may be simpliﬁed to the following ones: MI , v |= x = y ⇔ v(x) = v(y),

44

A. Gomoli´ nska

MI , v |= xεy ⇔ (#v(x) = 1 & v(x) ⊆ v(y)) ⇔ ∃X ⊆ U.(v(x) = {X} & X ∈ v(y)), MI , v |= xεing(y) ⇔ (#v(x) = #v(y) = 1 & e(v(x)) ⊆ e(v(y))) ⇔ ∃X, Y ⊆ U.(v(x) = {X} & v(y) = {Y } & X ⊆ Y ), MI , v |= xεpt(y) ⇔ (#v(x) = #v(y) = 1 & e(v(x)) ⊂ e(v(y))) ⇔ ∃X, Y ⊆ U.(v(x) = {X} & v(y) = {Y } & X ⊂ Y ), MI , v |= xεingt (y) ⇔ (#v(x) = #v(y) = 1 & κ(e(v(x)), e(v(y))) ≥ t) ⇔ ∃X, Y ⊆ U.(v(x) = {X} & v(y) = {Y } & κ(X, Y ) ≥ t), MI , v |= xεptt (y) ⇔ (#v(x) = #v(y) = 1 & v(x) = v(y) & κ(e(v(x)), e(v(y))) ≥ t) ⇔ ∃X, Y ⊆ U.(v(x) = {X} & v(y) = {Y } & X = Y & κ(X, Y ) ≥ t).

(11)

By a straightforward inspection one can check that MI is a model of the considered part of rough mereology, i.e., all axioms are true in MI . By way of example, we only show that (PS3) is true in MI , i.e., for any name variables x, y, any t ∈ [0, 1], and any variable assignment v, MI , v |= xεing1 (y) → ∀z.(zεingt (x) → zεingt (y)).

(12)

To this end, assume MI , v |= xεing1 (y) ﬁrst. Hence, (a) #v(x) = #v(y) = 1 and κ(e(v(x)), e(v(y))) ≥ 1 by (11). The latter is equivalent with (b) e(v(x)) ⊆ e(v(y)) due to rif 1 (κ). Next consider any variable assignment w, diﬀerent from v at most for z. As a consequence, (c) w(x) = v(x) and w(y) = v(y). In the sequel assume MI , w |= zεingt (x). Hence, (d) #w(z) = 1 and (e) κ(e(w(z)), e(w(x))) ≥ t by (11). It holds that (f) κ(e(w(z)), e(w(x))) ≤ κ(e(w(z)), e(w(y))) by (b), (c), and rif 2 (κ). From the latter and (e) we obtain (g) κ(e(w(z)), e(w(y))) ≥ t. Hence, MI , w |= zεingt (y) in virtue of (a), (c), (d), and (11).

4

In Search of New RIFs

According to rough mereology, rough inclusion is a generalization of the settheoretical inclusion of sets. While keeping with this idea, we try to obtain RIFs diﬀerent from the standard one. Let U be a non-empty ﬁnite set of objects. Observe that for any X, Y ⊆ U , the following formulas are equivalent: (i) X ⊆ Y, (ii) X ∩ Y = X, (iii) X ∪ Y = Y, (iv) (U − X) ∪ Y = U, (v) X − Y = ∅.

(13)

The equivalence of the ﬁrst two statements gave rise to the standard RIF. Now we explore (i) ⇔ (iii) and (i) ⇔ (iv). In the case of (iii), ‘⊇’ always holds true.

On Certain Rough Inclusion Functions

45

Conversely, ‘⊆’ always takes place in (iv). The remaining inclusions may or may not hold, so we may introduce degrees of inclusion. Thus, let us deﬁne mappings κ1 , κ2 : ℘U × ℘U → [0, 1] such that for any X, Y ⊆ U , #Y if X ∪ Y = ∅, def κ1 (X, Y ) = #(X∪Y ) 1 otherwise, def

κ2 (X, Y ) =

#((U − X) ∪ Y ) . #U

(14)

It is worth noting that κ2 was mentioned in [9]. Now we show that both κ1 , κ2 are RIFs diﬀerent from the standard one and from each other. Proposition 3. Each of κi (i = 1, 2) is a RIF upon U , i.e., rif 1 (κi ) and rif 2 (κi ) hold. Proof. We only prove the property for i = 1. Let X, Y, Z be any sets of objects. To show rif 1 (κ1 ), we only examine the non-trivial case where X, Y = ∅. Then, κ1 (X, Y ) = 1 if and only if #Y = #(X ∪ Y ) if and only if Y = X ∪ Y if and only if X ⊆ Y . In the case of rif 2 assume that (a1) Y ⊆ Z. First suppose that X = ∅. If Z is empty as well, then Y = ∅. In result, κ1 (X, Y ) = 1 ≤ 1 = κ1 (X, Z). Conversely, if Z is non-empty, then κ1 (X, Z) = #Z/#Z = 1 ≥ κ1 (X, Y ). Now assume that X = ∅. Then X ∪ Y, X ∪ Z = ∅. Moreover, Z = Y ∪ (Z − Y ) and Y ∩ (Z − Y ) = ∅ by (a1). As a consequence, (a2) #Z = #Y + #(Z − Y ). Additionally (a3) #(X ∪ Z) ≤ #(X ∪ Y ) + #(Z − Y ) and (a4) #Y ≤ #(X ∪ Y ). Hence, κ1 (X, Y ) = #Y /#(X ∪Y ) ≤ (#Y +#(Z −Y ))/(#(X ∪Y )+#(Z −Y )) ≤ (#Y + #(Z − Y ))/#(X ∪ Y ∪ (Z − Y )) = #Z/#(X ∪ Z) = κ1 (X, Z) by (a2)– (a4).

Example 2. Consider U = {0, . . . , 9} and its subsets X = {0, . . . , 4}, Y = {2, . . . , 6}. Notice that X ∩ Y = {2, 3, 4}, X ∪ Y = {0, . . . , 6}, and (U − X) ∪ Y = {2, . . . , 9}. Hence, κ£ (X, Y ) = 3/5, κ1 (X, Y ) = 5/7, and κ2 (X, Y ) = 4/5, i.e., κ£ , κ1 , and κ2 are diﬀerent RIFs. Proposition 4. For any X, Y ⊆ U , we have: (a) X = ∅ ⇒ (κ1 (X, Y ) = 0 ⇔ Y = ∅), (b) κ2 (X, Y ) = 0 ⇔ X = U & Y = ∅, (c) rif 4 (κ1 ) & rif 4 (κ2 ), (d) κ£ (X, Y ) ≤ κ1 (X, Y ) ≤ κ2 (X, Y ), (e) κ1 (X, Y ) = κ£ (X ∪ Y, Y ), (f ) κ2 (X, Y ) = κ£ (U, (U − X) ∪ Y ) = κ£ (U, U − X) + κ£ (U, X ∩ Y ), (g) κ£ (X, Y ) = κ£ (X, X ∩ Y ) = κ1 (X, X ∩ Y ) = κ1 (X − Y, X ∩ Y ), (h) X ∪ Y = U ⇒ κ1 (X, Y ) = κ2 (X, Y ). Proof. By way of illustration we show (d) and (h). To this end, consider any sets of objects X, Y . In case (d), if X is empty, then (U − X) ∪ Y = U . Hence

46

A. Gomoli´ nska

by the deﬁnitions, κ£ (X, Y ) = κ1 (X, Y ) = κ2 (X, Y ) = 1. Now suppose that X = ∅. Obviously (d1) #(X ∩ Y ) ≤ #X and (d2) #Y ≤ #(X ∪ Y ). Since X ∪ Y = X ∪ (Y − X) and X ∩ (Y − X) = ∅, (d3) #(X ∪ Y ) = #X + #(Y − X). Similarly, it follows from Y = (X ∩ Y ) ∪ (Y − X) and (X ∩ Y ) ∩ (Y − X) = ∅ that (d4) #Y = #(X ∩ Y ) + #(Y − X). Observe also that (U − X) ∪ Y = ((U − X) − Y ) ∪ Y = (U − (X ∪ Y )) ∪ Y and (U − (X ∪ Y )) ∩ Y = ∅. Hence, (d5) #((U − X) ∪ Y ) = #(U − (X ∪ Y )) + #Y . In the sequel, κ£ (X, Y ) = #(X ∩ Y )/#X ≤ (#(X ∩ Y ) + #(Y − X))/(#X + #(Y − X)) = #Y /#(X ∪ Y ) = κ1 (X, Y ) ≤ (#(U − (X ∪ Y )) + #Y )/(#(U − (X ∪ Y )) + #(X ∪ Y )) = #((U − X) ∪ Y )/#U = κ2 (X, Y ) by (d1)–(d5) and the deﬁnitions of the RIFs. For (h) assume that X ∪ Y = U . Then Y − X = U − X, and κ1 (X, Y ) = #Y /#U = #((Y − X) ∪ Y )/#U = #((U − X) ∪ Y )/#U = κ2 (X, Y ) as required.

Let us brieﬂy comment upon the properties. According to (a), if X is nonempty, then the emptiness of Y will be both suﬃcient3 and necessary to have κ1 (X, Y ) = 0. Property (b) states that κ2 yields 0 solely for (U, ∅). Due to (c), κi (X, Y ) = 0 (i = 1, 2) implies the emptiness of the overlap of X, Y . Property (d) says that the degrees of inclusion yielded by κ2 are at least as high as those given by κ1 , and the degrees of inclusion provided by κ1 are not lower than those estimated by means of the standard RIF. (e) and (f) provide us with characterizations of κ1 and κ2 in terms of κ£ , respectively. On the other hand, the standard RIF may be deﬁned by means of κ1 in virtue of (g). Finally, (h) states that κ1 , κ2 are equal on the set of all pairs (X, Y ) such that X, Y cover the universe.

5

Mappings Complementary to RIFs

Now we deﬁne mappings which are in some sense complementary to the RIFs considered. We also investigate properties of these functions and give one more characterization of the standard RIF. Namely, with every mapping f : ℘U × ℘U → [0, 1] one can associate a complementary mapping f¯ : ℘U × ℘U → [0, 1] deﬁned by def (15) f¯(X, Y ) = 1 − f (X, Y ) ¯ for any sets X, Y ⊆ U . Clearly, f is complementary to f . In particular, we obtain

κ ¯ £ (X, Y ) = κ ¯ 1 (X, Y ) = κ ¯ 2 (X, Y ) = 3

#(X−Y ) #X

0 #(X−Y ) #(X∪Y )

0

if X = ∅, otherwise, if X ∪ Y = ∅, otherwise,

#(X − Y ) . #U

Compare the optional postulate rif 3 (κ).

(16)

On Certain Rough Inclusion Functions

47

For the sake of simplicity, κ ¯ where κ is a RIF will be referred to as a co-RIF. Observe that each of the co-RIFs measures the diﬀerence between its ﬁrst and second arguments, i.e., the equivalence (i) ⇔ (v) (cf. (13)) is explored here. It is worthy to note that for any X, Y ⊆ U , κ ¯£ (X, Y ) = κ£ (X, U − Y ).

(17)

However, the same is not true of κ ¯ i for i = 1, 2. Indeed, κ1 (X, U − Y ) = #(U − Y )/#(X ∪(U −Y )) if X ∪(U −Y ) = ∅, and κ2 (X, U −Y ) = #(U −(X ∩Y ))/#U , so the counterparts of (17) do not hold in general. Example 3. Let U and X, Y be as in Example 2, i.e., U = {0, . . . , 9}, X = {0, . . . , 4}, and Y = {2, . . . , 6}. It is easy to see that κ1 (X, U − Y ) = 5/8 and ¯1 (X, Y ) = 2/7 and κ ¯2 (X, Y ) = 1/5. κ2 (X, U − Y ) = 7/10, whereas κ We can characterize the standard RIF in terms of κi (i = 1, 2) and their co-RIFs as follows: Proposition 5. For any sets of objects X, Y where X = ∅, κ£ (X, Y ) =

κ ¯ 2 (X, U − Y ) κ ¯ 1 (X, U − Y ) = . κ1 (U − Y, X) κ2 (U, X)

Proof. Consider any set of objects Y and any non-empty set of objects X. Hence, ¯ 1 (X, U − X ∪ (U − Y ) = ∅ as well. Moreover, κ1 (U − Y, X), κ2 (U, X) > 0. Then κ Y ) = #(X − (U − Y ))/#(X ∪ (U − Y )) = #(X ∩ Y )/#(X ∪ (U − Y )) = (#(X ∩ Y )/#X) · (#X/#(X ∪ (U − Y ))) = κ£ (X, Y ) · κ1 (U − Y, X) by the deﬁnitions of κ£ , κ1 , and κ ¯ 1 . Hence, κ£ (X, Y ) = κ ¯ 1 (X, U − Y )/κ1 (U − Y, X) as required. Similarly, κ ¯ 2 (X, U − Y ) = #(X − (U − Y ))/#U = #(X ∩ Y )/#U = (#(X ∩ Y )/#X) · (#X/#U ) = κ£ (X, Y ) · κ2 (U, X) by the deﬁnitions of κ£ , κ2 , and κ ¯2 . Immediately κ£ (X, Y ) = κ ¯ 2 (X, U − Y )/κ2 (U, X) which ends the proof.

Henceforth the symmetric diﬀerence of sets X, Y will be denoted by X ÷ Y . We can prove the following properties of co-RIFs: Proposition 6. For any X, Y, Z ⊆ U , an arbitrary RIF κ, and i = 1, 2, (a) κ ¯ (X, Y ) = 0 ⇔ X ⊆ Y, (b) Y ⊆ Z ⇒ κ ¯ (X, Z) ⊆ κ ¯ (X, Y ), (c) κ ¯ 2 (X, Y ) ≤ κ ¯ 1 (X, Y ) ≤ κ ¯ £ (X, Y ), (d) κ ¯ i (X, Y ) + κ ¯ i (Y, Z) ≥ κ ¯ i (X, Z), ¯ i (Y, X) ≤ 1, (e) 0 ≤ κ ¯ i (X, Y ) + κ (f ) (X = ∅ & Y = ∅) or (X = ∅ & Y = ∅) ⇒ κ ¯ £ (X, Y ) + κ ¯£ (Y, X) = κ ¯ 1 (X, Y ) + κ ¯ 1 (Y, X) = 1. Proof. We only prove (d) for i = 1, and (e). To this end, consider any sets of objects X, Y, Z. In case (d), if X = ∅, then κ ¯ 1 (X, Z) = 0 in virtue of (a). Hence,

48

A. Gomoli´ nska

(d) obviously holds. Now suppose that X = ∅. If Y = ∅, then κ ¯ 1 (X, Y ) = 1. On the other hand, if Y = ∅ and Z = ∅, then κ ¯1 (Y, Z) = 1. In both cases κ ¯ 1 (X, Y ) + κ ¯ 1 (Y, Z) ≥ 1 ≥ κ ¯ 1 (X, Z). Finally, assume that X, Y, Z = ∅. Let m = #(X∪Y ∪Z), m0 = #(X−(Y ∪Z)), m1 = #(Y − (X ∪ Z)), m2 = #((X ∩ Y ) − Z), m3 = #((X ∩ Z) − Y ), and m4 = #(Z − (X ∪ Y )). Observe that #(X − Y ) = m0 + m3 , #(X − Z) = m0 + m2 , #(Y − Z) = m1 + m2 , #(X ∪ Y ) = m − m4 , #(X ∪ Z) = m − m1 , and #(Y ∪ Z) = m − m0 . Hence, κ ¯ 1 (X, Y ) = #(X − Y )/#(X ∪ Y ) = (m0 + ¯ 1 (Y, Z) = (m1 + m2 )/(m − m0 ) and m3 )/(m − m4 ). On the same grounds, κ κ ¯ 1 (X, Z) = (m0 + m2 )/(m − m1 ). It is easy to see that m0 + m3 m1 + m2 m0 + m1 + m2 m0 + m2 m1 + m2 m0 + m3 + ≥ ≥ + ≥ m − m4 m − m0 m m m m − m1 which ends the proof of (d). The ﬁrst inequality of (e) is obvious, so we only show the second one. For i = 1 assume that X ∪ Y = ∅ since the case X = Y = ∅ is trivial. Thus, ¯ 1 (Y, X) = (#(X − Y )/#(X ∪ Y )) + (#(Y − X)/#(X ∪ Y )) = κ ¯ 1 (X, Y ) + κ #(X ÷ Y )/#(X ∪ Y ) ≤ 1 because X ÷ Y ⊆ X ∪ Y . The property just proved implies the second inequality for i = 2 due to (c).

According to (a), every co-RIF will yield 0 exactly in the case the ﬁrst argument is included in the second one. As a consequence, (*) κ ¯ (X, X) = 0 for every set of objects X. That is, κ ¯ may serve as a (non-symmetric) distance function. (b) states that co-RIFs are co-monotone in the second variable. (c) provides us with a comparison of our three co-RIFs. Properties (e), (f) will prove their usefulness in the next section. (d) expresses the triangle inequality condition for κ ¯ i (i = 1, 2). Let us note that the triangle inequality does not hold for κ ¯ £ in general. Example 4. Consider sets of objects X, Y, Z such that X − Z, Z − X = ∅ and Y = X ∪ Z. We show that ¯ £ (Y, Z) < κ ¯£ (X, Z). κ ¯ £ (X, Y ) + κ By the assumptions each of X, Y, Z is non-empty and Y − Z = X − Z. Next, #X < #Y since X ⊂ Y . Moreover, κ ¯ £ (X, Y ) = 0 in virtue of (a). As a consequence, κ ¯ £ (X, Y ) + κ ¯ £ (Y, Z) =

#(Y − Z) #(X − Z) #(Y − Z) < = =κ ¯ £ (X, Z) #Y #X #X

as expected. ¯ £ (Y, X) > 1. Indeed, if X, Y = ∅ and Additionally, it can be that κ ¯ £ (X, Y ) + κ £ £ X ∩Y = ∅, then Σ = κ ¯ (X, Y )+¯ κ (Y, X) = (#X/#X)+(#Y /#Y ) = 1+1 = 2. Nevertheless, 2 is the greatest value taken by Σ.

On Certain Rough Inclusion Functions

6

49

RIFs and Their Complementary Mappings vs. Similarity and Distance between Sets

In this section we use the three co-RIFs to deﬁne certain normalized distance functions with which one can measure (dis)similarity between sets. Namely, let δ £ , δi : ℘U × ℘U → [0, 1] (i = 1, 2) be mappings such that for any X, Y ⊆ U , δ £ (X, Y ) =

def

1 £ κ ¯ (X, Y ) + κ ¯£ (Y, X) , 2

def

¯ i (X, Y ) + κ ¯i (Y, X). δi (X, Y ) = κ

(18)

It is easy to see that ⎧ #(X−Y ) #(Y −X) ⎪ if X, Y = ∅, + ⎨ 12 #X #Y δ £ (X, Y ) = 0 if X, Y = ∅, ⎪ ⎩1 in the remaining cases, 2 #(X÷Y ) if X ∪ Y = ∅, δ1 (X, Y ) = #(X∪Y ) 0 otherwise, δ2 (X, Y ) =

#(X ÷ Y ) . #U

(19)

It is worth mentioning that δ1 is an instance of the Marczewski–Steinhaus metric [34]. As we shall see, the remaining two functions are metrics on ℘U as well. Namely, we can prove the following: Proposition 7. For any sets X, Y, Z ⊆ U and δ ∈ {δ £ , δ1 , δ2 }, (a) δ(X, Y ) = 0 ⇔ X = Y, (b) δ(X, Y ) = δ(Y, X), (c) δ(X, Y ) + δ(Y, Z) ≥ δ(X, Z), (d) max{δ £ (X, Y ), δ2 (X, Y )} ≤ δ1 (X, Y ) ≤ 2δ £ (X, Y ). Proof. Property (a) is an easy consequence of Proposition 6a. (b) directly follows from the deﬁnitions of δ £ , δ1 , and δ2 . Property (c) for δ1 , δ2 can easily be obtained from Proposition 6d. Now we show (c) for δ £ . To this end, consider any sets of objects X, Y, Z. If both X, Z are empty, then δ £ (X, Z) = 0 in virtue of (a), and (c) follows immediately. Next, if X, Y = ∅ and Z = ∅, or X, Y = ∅ and Z = ∅, then δ £ (X, Z) = δ £ (Y, Z) = 1/2 by (19). In consequence, (c) is fulﬁlled regardless of the value δ £ (X, Y ). In the same vein, if X = ∅ and Y, Z = ∅, or X = ∅ and Y, Z = ∅, then δ £ (X, Y ) = δ £ (X, Z) = 1/2. Here is (c) satisﬁed regardless of δ £ (Y, Z). In the sequel, if X, Z = ∅ and Y = ∅, then δ £ (X, Y ) = δ £ (Y, Z) = 1/2. Hence, δ £ (X, Y ) + δ £ (Y, Z) = 1 ≥ δ £ (X, Z) for any value of δ £ at (X, Z).

50

A. Gomoli´ nska

Finally we prove (c) for X, Y, Z = ∅. Let m and mi (i = 0, . . . , 4) be as earlier. Additionally, let m5 = #((Y ∩ Z) − X). Notice that 2δ £ (X, Y ) = = 2δ £ (Y, Z) = = 2δ £ (X, Z) = =

#(X − Y ) #(Y − X) + #X #Y m0 + m3 m1 + m5 + , m − (m1 + m4 + m5 ) m − (m0 + m3 + m4 ) #(Y − Z) #(Z − Y ) + #Y #Z m1 + m2 m3 + m4 + , m − (m0 + m3 + m4 ) m − (m0 + m1 + m2 ) #(X − Z) #(Z − X) + #X #Z m0 + m2 m4 + m5 + . m − (m1 + m4 + m5 ) m − (m0 + m1 + m2 )

Hence we obtain 2(δ £ (X, Y ) + δ £ (Y, Z) − δ £ (X, Z)) = (m0 + m3 ) − (m0 + m2 ) (m1 + m5 ) + (m1 + m2 ) (m3 + m4 ) − (m4 + m5 ) + + m − (m1 + m4 + m5 ) m − (m0 + m3 + m4 ) m − (m0 + m1 + m2 ) m3 − m2 2m1 + m2 + m5 m3 − m5 2(m1 + m3 ) ≥ + + = ≥ 0. m m m m In result, δ £ (X, Y ) + δ £ (Y, Z) ≥ δ £ (X, Z) as needed. As regards (d), we only prove that (*) δ £ (X, Y ) ≤ δ1 (X, Y ) for any X, Y ⊆ U . The rest easily follows from Proposition 6c. Consider any sets of objects X, Y . If at least one of X, Y is empty, (*) will directly hold by the deﬁnitions of δ £ , δ1 . For the remaining case observe that #(X ∩ Y ) #(X ∩ Y ) #(X ∩ Y ) ≤ min , #(X ∪ Y ) #X #Y since max{#X, #Y } ≤ #(X ∪ Y ). Hence, we obtain in the sequel: #(X ∩ Y ) #(X ∩ Y ) #(X ∩ Y ) ,1 − , max 1 − ≤1− #X #Y #(X ∪ Y ) #(X − Y ) #(Y − X) #(X ÷ Y ) max , , ≤ #X #Y #(X ∪ Y ) #(X − Y ) #(Y − X) #(X ÷ Y ) + ≤2 , #X #Y #(X ∪ Y ) 1 #(X − Y ) #(Y − X) #(X ÷ Y ) + . ≤ 2 #X #Y #(X ∪ Y ) From the latter we derive (*) by the deﬁnitions of δ £ , δ1 .

On Certain Rough Inclusion Functions

51

Summing up, δ £ and δi (i = 1, 2) are metrics on ℘U due to (a)–(c), and they may be used to measure the distance between sets. According to (d), the double distance between sets X, Y , estimated by means of δ £ , will be not smaller than the distance between X, Y yielded by δ1 . In turn, the distance measured by the latter metric will be greater than or equal to the distance given by each of δ £ , δ2 . In view of the fact that κ ¯ i , underlying δi , satisfy the triangle inequality (see Proposition 6d), it is not very surprizing that δi are metrics. The really unexpected result is that δ £ fulﬁls the triangle inequality as well. The distance between two sets may be interpreted as the degree of their dissimilarity. Thus, δ £ and δi may serve as measures (indices) of dissimilarity of sets. On the other hand, mappings which are complementary in the sense of (15) to δ £ and δi , δ¯£ and δ¯i (i = 1, 2), respectively, may be used as similarity measures (see, e.g., [44] for a discussion of various indices used to measure the degree of similarity between clusterings). Let us note that for any X, Y ⊆ U , the following dependencies hold: 1 £ κ (X, Y ) + κ£ (Y, X) , δ¯£ (X, Y ) = 2 δ¯i (X, Y ) = κi (X, Y ) + κi (Y, X) − 1.

(20)

More precisely, ⎧ #(X∩Y ) 1 1 ⎪ if X, Y = ∅, ⎨ 2 #X + #Y δ¯£ (X, Y ) = 1 if X, Y = ∅, ⎪ ⎩1 in the remaining cases, 2 #(X∩Y ) if X ∪ Y = ∅, δ¯1 (X, Y ) = #(X∪Y ) 1 otherwise, #((U − (X ∪ Y )) ∪ (X ∩ Y )) . δ¯2 (X, Y ) = #U

(21)

Thus, starting with the standard RIF and two other RIFs of a similar origin, we have ﬁnally arrived at similarity measures known from the literature nski to [35, 36, 37, 38]. More precisely, δ¯£ is the function proposed by Kulczy´ estimate biotopical similarity [36]. The similarity index δ¯1 , complementary to the Marczewski–Steinhaus metric δ1 , is attributed to Jaccard [35]. The function δ¯2 was introduced (at least) twice, viz., by Sokal and Michener [38] and by Rand [37]. Let us note the following observations: Proposition 8. For any sets of objects X, Y and δ ∈ {δ £ , δ1 , δ2 }, we have that: ¯ (a) δ(X, Y ) = 1 ⇔ X = Y, ¯ ¯ X), (b) δ(X, Y ) = δ(Y, £ (c) δ¯ (X, Y ) = 0 ⇔ X ∩ Y = ∅ & X, Y = ∅, (d) δ¯1 (X, Y ) = 0 ⇔ X ∩ Y = ∅ & X ∪ Y = ∅,

52

A. Gomoli´ nska

(e) δ¯2 (X, Y ) = 0 ⇔ X ∩ Y = ∅ & X ∪ Y = U, (f ) 2δ¯£ (X, Y ) − 1 ≤ δ¯1 (X, Y ) ≤ min{δ¯£ (X, Y ), δ¯2 (X, Y )}. The proof is easy and, hence, omitted. However, some remarks may be useful. (a) states that every set is similar to itself to the highest degree 1. According to (b), similarity is assumed to be symmetric here. Properties (c)–(e) describe conditions characterizing the lowest degree of similarity between sets. A comparison of the three similarity indices is provided by (f). An example, illustrating a possible application of the Marczewski–Steinhaus metric to estimate diﬀerences between biotopes, can be found in [34]. In that example, two real forests from Lower Silesia (Poland) are considered. We slightly modify the example and we extend it to the other distance measures investigated. Example 5. As the universe we take a collection of tree species U = {a, b, h, l, o, p, r, s} where a stands for ‘alder’, b – ‘birch’, h – ‘hazel’, l – ‘larch’, o – ‘oak’, p – ‘pine’, r – ‘rowan’, and s – ‘spruce’. Consider two forests represented by the collections A, B of the tree species which occur in those forests where A = {a, b, h, p, r} and B = {b, o, p, s}. First we compute the degrees of inclusion of A in B, and vice-versa. Next we measure the biotopical differences between A and B using δ £ and δi for i = 1, 2. Finally we estimate the degrees of biotopical similarity of the forests investigated. It is easy to see that κ£ (A, B) = 2/5, κ£ (B, A) = 1/2, κ1 (A, B) = 4/7, κ1 (B, A) = 5/7, κ2 (A, B) = 5/8, and κ2 (B, A) = 3/4. Hence, 1 3 1 11 δ £ (A, B) = + = , 2 5 2 20 δ1 (A, B) = 5/7, and δ2 (A, B) = 5/8. As expected, the distance functions δ £ , δ1 , δ2 (and so the corresponding similarity measures δ¯£ , δ¯1 , δ¯2 ) may give us diﬀerent values when measuring the distance (resp., similarity) between A and B. Due to Proposition 7d, this distance is the greatest (equal to 5/7) when measured by δ1 . Conversely, δ¯1 yields the least degree of similarity, equal to 2/7. Therefore, these measures seem to be particularly attractive to cautious reasoners. For those who accept a higher risk, both δ £ , δ2 (and similarly, δ¯£ , δ¯2 ) are reasonable alternatives too. Accidentially, δ £ gives the least distance, equal to 11/20, and its complementary mapping δ¯£ yields the greatest degree of similarity, equal to 9/20. In this particular case, values provided by δ2 and δ¯2 , 5/8 and 3/8, respectively, are in between. Clearly, the choice of the most appropriate distance function (or similarity measure) may also depend on factors other than the level of risk.

7

Summary

In this article, an attempt was made to discover RIFs diﬀerent from the standard one, yet having a similar origin. First we overviewed the notion of the standard RIF, κ£ . In the next step, a general framework for discussion of RIFs and their properties was recalled. As a result, a minimal set of postulates specifying a RIF

On Certain Rough Inclusion Functions

53

was derived. Also several optional conditions were proposed. Then we deﬁned two RIFs, κ1 and κ2 , which turned out to be diﬀerent from the standard one. The latter RIF was mentioned in [9], yet the former one seems to be new. We examined properties of these RIFs with a special stress laid on the relationship to the standard RIF. In the sequel, we introduced functions complementary to RIFs (co-RIFs) which resulted in a new characterization of the standard RIF in terms of the remaining two RIFs and their complementary mappings. We examined proper¯ 1 , and κ ¯ 2 . We easily found out that they ties of each of the three co-RIFs: κ ¯£, κ might serve as distance functions. However, only the latter two functions proved to satisfy the triangle inequality. In the next step, the co-RIFs were used to deﬁne certain distance functions, δ £ , δ1 , and δ2 , which turned out to be metrics on the power set of the set of all objects considered. δ1 has already been known in the literature [34]. From the distance functions mentioned above we ﬁnally derived their complementary mappings, δ¯£ , δ¯1 , and δ¯2 , serving as similarity measures. As turned out, they were discovered many years ago [35,36,37,38]. In this way, starting with an idea which led to the standard RIF and going through intermediate stages (co-RIFs and certain metrics based on them), we ﬁnally arrived at similarity indices known in machine learning, relational learning, and statistical learning, to name a few areas of application.

References 1. Gomoli´ nska, A.: On three closely related rough inclusion functions. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 142–151. Springer, Heidelberg (2007) 2. Polkowski, L., Skowron, A.: Rough mereology. In: Ra´s, Z.W., Zemankova, M. (eds.) ISMIS 1994. LNCS (LNAI), vol. 869, pp. 85–94. Springer, Heidelberg (1994) 3. Polkowski, L., Skowron, A.: Rough mereology: A new paradigm for approximate reasoning. Int. J. Approximated Reasoning 15, 333–365 (1996) 4. Polkowski, L., Skowron, A.: Rough mereological calculi of granules: A rough set approach to computation. Computational Intelligence 17, 472–492 (2001) 5. Le´sniewski, S.: Foundations of the General Set Theory 1 (in Polish). Works of the Polish Scientiﬁc Circle, Moscow, vol. 2 (1916); Also In: [6], pp. 128–173 6. Surma, S.J., Srzednicki, J.T., Barnett, J.D. (eds.): Stanislaw Le´sniewski Collected Works. Kluwer/Polish Scientiﬁc Publ., Dordrecht/Warsaw (1992) 7. Borkowski, L. (ed.): Jan L ukasiewicz – Selected Works. North Holland/Polish Scientiﬁc Publ., Amsterdam/Warsaw (1970) 8. L ukasiewicz, J.: Die logischen Grundlagen der Wahrscheinlichkeitsrechnung, Cracow (1913); In: [7], pp. 16-63 (English translation) 9. Drwal, G., Mr´ ozek, A.: System RClass – software implementation of a rough classiﬁer. In: Klopotek, M.A., Michalewicz, M., Ra´s, Z.W. (eds.) Proc. 7th Int. Symp. Intelligent Information Systems (IIS 1998), Malbork, Poland, June 1998, pp. 392– 395 (1998)

54

A. Gomoli´ nska

10. Stepaniuk, J.: Knowledge discovery by application of rough set models. In: Polkowski, L., Tsumoto, S., Lin, T.Y. (eds.) Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems, pp. 137–233. Physica, Heidelberg (2001) 11. Pawlak, Z.: Rough sets. Int. J. Computer and Information Sciences 11, 341–356 (1982) 12. Pawlak, Z.: Information Systems. Theoretical Foundations (in Polish). Wydawnictwo Naukowo-Techniczne, Warsaw (1983) 13. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning About Data. Kluwer, Dordrecht (1991) 14. Pawlak, Z.: Rough set elements. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery, vol. 1, pp. 10–30. Physica, Heidelberg (1998) 15. Pawlak, Z.: A treatise on rough sets. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets IV. LNCS, vol. 3700, pp. 1–17. Springer, Heidelberg (2005) 16. Bazan, J.G., Skowron, A., Swiniarski, R.: Rough sets and vague concept approximation: From sample approximation to adaptive learning. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets V. LNCS, vol. 4100, pp. 39–63. Springer, Heidelberg (2006) 17. Peters, J.F.: Approximation spaces for hierarchical intelligent behavioral system models. In: Dunin-K¸eplicz, B., Jankowski, A., Skowron, A., Szczuka, M. (eds.) Monitoring, Security, and Rescue Techniques in Multiagent Systems, pp. 13–30. Springer, Heidelberg (2005) 18. Peters, J.F., Skowron, A., Stepaniuk, J.: Nearness of objects: Extension of approximation space model. Fundamenta Informaticae 79, 497–512 (2007) 19. Skowron, A., Stepaniuk, J.: Generalized approximation spaces. In: Lin, T.Y., Wildberger, A.M. (eds.) Soft Computing. Simulation Councils, pp. 18–21. Simulation Councils, San Diego (1995) 20. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996) 21. Skowron, A., Stepaniuk, J., Peters, J.F., Swiniarski, R.: Calculi of approximation spaces. Fundamenta Informaticae 72, 363–378 (2006) 22. Skowron, A., Swiniarski, R., Synak, P.: Approximation spaces and information granulation. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 175–189. Springer, Heidelberg (2005) 23. Ziarko, W.: Variable precision rough set model. J. Computer and System Sciences 46, 39–59 (1993) 24. Ziarko, W.: Probabilistic decision tables in the variable precision rough set model. Computational Intelligence 17, 593–603 (2001) ´ ezak, D., Wang, G., Szczuka, M.S., 25. Ziarko, W.: Probabilistic rough sets. In: Sl D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS, vol. 3641, pp. 283–293. Springer, Heidelberg (2005) 26. Ziarko, W.: Stochastic approach to rough set theory. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Slowi´ nski, R. (eds.) RSCTC 2006. LNCS, vol. 4259, pp. 38–48. Springer, Heidelberg (2006) 27. Yao, Y.Y.: Decision-theoretic rough set models. In: Yao, J., Lingras, P., Wu, W.-Z., ´ ezak, D. (eds.) RSKT 2007. LNCS, vol. 4481, pp. Szczuka, M.S., Cercone, N.J., Sl¸ 1–12. Springer, Heidelberg (2007) 28. Yao, Y.Y.: Probabilistic rough set approximations. Int. J. of Approximate Reasoning (in press, 2007), doi:10.1016/j.ijar.2007.05.019 29. Yao, Y.Y., Wong, S.K.M.: A decision theoretic framework for approximating concepts. Int. J. of Man–Machine Studies 37, 793–809 (1992)

On Certain Rough Inclusion Functions

55

30. Pawlak, Z., Skowron, A.: Rough membership functions. In: Fedrizzi, M., Kacprzyk, J., Yager, R.R. (eds.) Advances in the Dempster–Shafer Theory of Evidence, pp. 251–271. John Wiley & Sons, Chichester (1994) 31. Gomoli´ nska, A.: Possible rough ingredients of concepts in approximation spaces. Fundamenta Informaticae 72, 139–154 (2006) 32. Nguyen, H.S., Skowron, A., Stepaniuk, J.: Granular computing: A rough set approach. Computational Intelligence 17, 514–544 (2001) 33. Zhang, M., Xu, L.D., Zhang, W.X., Li, H.Z.: A rough set approach to knowledge reduction based on inclusion degree and evidence reasoning theory. Expert Systems 20, 298–304 (2003) 34. Marczewski, E., Steinhaus, H.: On a certain distance of sets and the corresponding distance of functions. Colloquium Mathematicum 6, 319–327 (1958) 35. Jaccard, P.: Nouvelles recherches sur la distribution ﬂorale. Bull. de la Soci´et´e Vaudoise des Sciences Naturelles 44, 223–270 (1908) 36. Kulczy´ nski, S.: Die Pﬂanzenassociationen der Pieninen. Bull. Internat. Acad. Polon. Sci. Lett., Sci. Math. et Naturelles, serie B, suppl. II 2, 57–203 (1927) 37. Rand, W.: Objective criteria for the evaluation of clustering methods. J. of the American Statistical Association 66, 846–850 (1971) 38. Sokal, R.R., Michener, C.D.: A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 38, 1409–1438 (1958) 39. Xu, Z.B., Liang, J.Y., Dang, C.Y., Chin, K.S.: Inclusion degree: A perspective on measures for rough set data analysis. Information Sciences 141, 227–236 (2002) 40. Zhang, W.X., Leung, Y.: Theory of including degrees and its applications to uncertainty inference. In: Proc. of 1996 Asian Fuzzy System Symposium, pp. 496–501 (1996) 41. An, A., Cercone, N.: Rule quality measures for rule induction systems: Description and evaluation. Computational Intelligence 17, 409–424 (2001) 42. Kryszkiewicz, M.: Fast discovery of representative association rules. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, pp. 214–221. Springer, Heidelberg (1998) 43. Tsumoto, S.: Modelling medical diagnostic rules based on rough sets. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, pp. 475–482. Springer, Heidelberg (1998) 44. Albatineh, A.N., Niewiadomska-Bugaj, M., Mihalko, D.: On similarity indices and correction for chance agreement. J. of Classiﬁcation 23, 301–313 (2006) 45. Wallace, D.L.: A method for comparing two hierarchical clusterings: Comment. J. of the American Statistical Association 78, 569–576 (1983) 46. Zadeh, L.A.: Fuzzy sets. Information and Control 8, 338–353 (1965)

Automatic Rhythm Retrieval from Musical Files Bo˙zena Kostek, Jaroslaw W´ ojcik, and Piotr Szczuko Gda´ nsk University of Technology, Multimedia Systems Department Narutowicza 11/12, 80-952 Gda´ nsk, Poland {bozenka,szczuko}@sound.eti.pg.gda.pl, [email protected]

Abstract. This paper presents a comparison of the eﬀectiveness of two computational intelligence approaches applied to the task of retrieving rhythmic structure from musical ﬁles. The method proposed by the authors of this paper generates rhythmic levels ﬁrst, and then uses these levels to compose rhythmic hypotheses. Three phases: creating periods, creating simpliﬁed hypotheses and creating full hypotheses are examined within this study. All experiments are conducted on a database of national anthems. Decision systems such as Artiﬁcial Neural Networks and Rough Sets are employed to search the metric structure of musical ﬁles. This was based on examining physical attributes of sound that are important in determining the placement of a particular sound in the accented location of a musical piece. The results of the experiments show that both decision systems award note duration as the most signiﬁcant parameter in automatic searching for metric structure of rhythm from musical ﬁles. Also, a brief description of the application realizing automatic rhythm accompaniment is presented. Keywords: Rhythm Retrieval, Metric Rhythm, Music Information Retrieval, Artiﬁcial Neural Networks, Rough Sets.

1

Introduction

The aim of this article is to present a comparative study of the eﬀectiveness of two computational intelligence approaches applied to the task of retrieving rhythmic structure from musical ﬁles. Existing metric rhythm research usually focuses on retrieving low rhythmic levels – they go down to the level of a measure. Typically those methods are suﬃcient to emulate human perception of a local rhythm. According to McAuley & Semple [14] trained musicians perceive more levels, though. High-level perception is required from drum players, thus computational approach needs to retrieve a so-called hypermetric structure of a piece. If it reaches high rhythmic levels such as phrases, sentences and periods, then automatic drum accompaniment applications can be developed. Rhythm retrieval research is a broad ﬁeld and, among other issues, involves the quantization process of the beginnings and lengths of notes, the extraction of rhythm events from audio recordings, and the search for meter of compositions. J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 56–75, 2008. c Springer-Verlag Berlin Heidelberg 2008

Automatic Rhythm Retrieval from Musical Files

57

Rhythm is an element of a piece determining musical style, which may be valuable in retrieval. The rhythmic structure together with patterns retrieved carry information about the genre of a piece. Content-based methods of music retrieval are nowadays developed by researchers from the multimedia retrieval and computational intelligence domain. The most common classes of rhythm retrieval models are: rule-based, multiple-agents, multiple-oscillators and probabilistic. The rhythm retrieval methods can be classiﬁed within the context of what type of actions they take, i.e. whether they quantize musical data, or ﬁnd the tempo of a piece (e.g. van Belle [2]), time signatures, positions of barlines, a metric structure or an entire hypermetric hierarchy. Rhythm ﬁnding systems very often rank the hypotheses of rhythm, basing on the sound salience function. Since scientists diﬀer in opinions on the aspect of salience, the Authors carried out special experiments to solve the salience problem. A number of research studies are based on the theory published by Lerdahl & Jackendoﬀ [13], who claim that such physical attributes of sounds as pitch (frequency), duration and velocity (amplitude) inﬂuence the rhythmical salience of sounds. Another approach, proposed by Rosenthal [19], ranks higher the hypotheses in which long sounds are placed in accented positions. In Dixon’s [4] multiple-agent approach, two salience functions are proposed, combining duration, pitch and velocity. The ﬁrst, is a linear combination of physical attributes, Dixon calls it an additive function. The other one is a multiplicative function. Dahl [3] notices that drummers play accented strokes with higher amplitude than unaccented ones. Parncutt, in his book [15], claims that lower sounds fall on the beat. In the review of Parncutt’s book, Huron [5] notices that the high salience of low sounds is “neither an experimentally determined fact nor an established principle in musical practice”. A duration-based hypothesis predominated in rhythm-related works, however this approach seemed to be based on intuition only. The experimental conﬁrmation of this thesis – based on the Data Mining (DM) association rules and Artiﬁcial Neural Networks (ANNs) – can be found in former works by the Authors of this paper [6], [7], [8] and also in the doctoral thesis of Wojcik [27]. The experiments employing rough sets, which are a subject of this paper, were performed in order to conﬁrm results obtained from the DM and ANN approaches. Another reason was to verify if all three computational intelligence models applied to the salience problem, return similar ﬁndings, which may prove the correctness of these approaches. This article is an extended version of a paper which is included in Proceedings of Rough Sets and Intelligent Systems Paradigms [12]. The remainder of the paper is organized as follows: in Section 2 a short review of computational intelligence methods that are used in research related to emulation of human perception is presented. Then, Section 3 shows some issues describing hypermetric rhythm retrieval, which direct towards the experiments on rhythm retrieval. A brief description of the application realizing automatic rhythm accompaniment is shown in Section 4 along with an approach to the computational complexity of the algorithm creating hypermetric rhythmic hypotheses (Section 5). Finally, Section 6 puts forward summary of results as well as some concluding remarks.

58

2

B. Kostek, J. W´ ojcik, and P. Szczuko

Emulation of Human Perception by Computational Intelligence Techniques

The domain of computational intelligence grows into independent and very attractive research area in a few last years, with many applications dedicated to data mining in musical domain [8], [9], [23], [24]. Computational Intelligence (CI) is a branch of Artificial Intelligence, which deals with the AI soft facets, i.e. programs behaving intelligently. The CI is understood in a number of ways, e.g. as a study of the design of intelligent agents or as a subbranch of AI, which aims “to use learning, adaptive, or evolutionary computation to create programs that are, in some sense, intelligent” [25]. Researchers are trying to classify the branches of CI to designate the ways in which CI methods help humans to discover how their perception works. However, this is a multi-facet task with numerous overlapping deﬁnitions, thus the map of this discipline is ambiguous. The domain of CI groups several approaches, the most common are: the Artiﬁcial Neural Networks (ANNs), Fuzzy Systems, Evolutionary Computation, Machine Learning including Data Mining, Soft Computing, Rough Sets, Bayesian Networks, Expert Systems and Intelligent Agents [18]. Currently, in the age of CI people are trying to build machines emulating human behaviors, and one of such applications concerns rhythm perception. This paper presents an example of how to design and build an algorithm which is able to emulate human perception of rhythm. Two CI approaches, namely the ANNs and Rough Sets (RS), are used in the experiments aiming at the estimation of musical salience. The ﬁrst of them, the ANN model, concerns processes, which are not entirely known, e.g. human perception of rhythm. The latter is the RS approach, introduced by Pawlak [16] and used by many researches in data discovery and intelligent management [17], [18]. Since the applicability of ANNs in recognition was experimentally conﬁrmed in a number of areas, neural networks are also used to estimate rhythmic salience of sounds. There exists a vast literature on ANNs, and for this reason only a brief introduction to this area is presented in this paper. A structure of an ANN usually employs the McCulloch-Pitts model, involving the modiﬁcation of the neuron activation function, which is usually sigmoidal. All neurons are interconnected. Within the context of the neural network topology, ANNs can be classiﬁed as feedforward or recurrent networks, which are also called feedback networks. In the case of recurrent ANNs the connections between units form cycles, while in feedforward ANNs the information moves in only one direction, i.e. forward. The elements of a vector of object features constitute the values, which are fed to the input of an ANN. The type of data accepted at the input and/or returned at the output of an ANN is also a diﬀerentiating factor. The quantitative variable values are continuous by nature, and the categorical variables belong to a ﬁnite set (small, medium, big, large). The ANNs with continuous values at input are able to determine the degree of the membership to a certain class. The output of networks based on categorical variables may be Boolean, in which case the network decides whether an object belongs to a class or not. In the case of the salience problem the number of categorical output variables equals to two, and it is determined whether the sound is accented or not.

Automatic Rhythm Retrieval from Musical Files

59

In the experiments the Authors examined whether a supervised categorical network such as Learning Vector Quantization (LVQ) is suﬃcient to resolve the salience problem. The classiﬁcation task of the network was to recognize the sound as accented or not. LVQs are self-organizing networks with the ability to learn and detect the regularities and correlations at their input, and then to adapt their responses to that input. An LVQ network is trained in a supervised manner, it consists of the competitive and a linear layers. The ﬁrst one classiﬁes the input vectors into subclasses, and the latter transforms input vectors into target classes. On the other hand, the aim of the RS-based experiments was two-fold. First, it was to compare the results with the ones coming from the ANN. In addition, two schemes of data discretization were applied. In the case of k-means discretization accuracies of predictions are delivered.

3 3.1

Experiments Database

Presented experiments were conducted on MIDI ﬁles of eighty national anthems, retrieved from the Internet. Storing information about meter in the ﬁles is necessary to indicate accented sounds in a musical piece. This information, however, is optional in MIDI ﬁles, thus information whether the sound is accented or not is not always available. In addition, in a number of musical ﬁles retrieved from the Internet, the assigned meter is incorrect or there is no information about meter at all. This is why the correctness of meter was checked by inserting an additional simple drum track into the melody. The hits of the snare drum were inserted in the locations of the piece calculated with Formula (1), where T is a period computed with the autocorrelation function, and i indicates subsequent hits of a snare drum. i · T, i = 0, 1, 2, . . .

(1)

The Authors listened to the musical ﬁles with snare drum hits inserted, and rejected all the ﬁles in which accented locations were indicated incorrectly. Also some anthems with changes in time signature could not be included in the training and testing sets, because this metric rhythm retrieval method deals with hypotheses based on rhythmic levels of a constant period. Usually the change in time signature results in changes in the period of a rhythmic level corresponding to the meter, and an example of such change might be from 3/4 into 4/4. Conversely, an example of a change in time signature which does not inﬂuence the correct indication of accented sounds could be from 2/4 into 4/4. Salience experiments presented in this paper are conducted on polyphonic MIDI tracks containing melodies, overlapping sounds coming from the tracks other than melodic ones, were not included in the experimental sets. For the purpose of the experiments the values of physical sounds’ attributes were normalized and discretized with equal subrange technique. Minimum and maximum values within the domain of each attribute are found. The whole range

60

B. Kostek, J. W´ ojcik, and P. Szczuko

is then divided into msubranges with thresholds between the subranges, placed in the locations counted with aid of the Formula (2). M inV alue + (M axV alue − M inV alue) · j/m for j = 0, 1, 2, . . . m 3.2

(2)

ANN-Based Experiment

For the training phase, accented locations in each melody were found with methods described in Section 3.1. One of the tested networks had three separate inputs – one for each physical attribute of a sound (duration, frequency and amplitude - DPV ). Three remaining networks had one input each. Each input took a diﬀerent physical attribute of a given sound, namely D – duration, P – pitch (frequency) or V – velocity (amplitude). All attributes were from the range of 0 to 127. The network output was binary: 1 if the sound was accented, or 0 if it was not. Musical data were provided to the networks to train them to recognize accented sounds on the basis of physical attributes. In this study LVQ network recognized a sound as ‘accented’ or ‘not accented’. Since physical attributes are not the only features determining whether a sound is accented, some network answers may be incorrect. The network accuracy NA was formulated as the ratio of the number of accented sounds, which were correctly detected by the network, to the total number of accented sounds in a melody, as stated in Formula (3). NA = number of accented sounds correctly detected by the network / number of all accented sounds (3) Hazard accuracy HA is the ratio of the number of accents given by the network to the number of all sounds in a set, as stated in Formula (4). HA = number of accented sounds detected by the network / number of all sounds (4) The melodies of anthems were used to create 10 training/testing sets. Each set included 8 entire pieces. Each sound with an index divisible by 3 was assumed to be a training sound. The remaining sounds were treated as testing sounds. As a consequence, the testing set was twice as large as the training set. Accuracies in the datasets were averaged for each network separately. Evaluating a separate accuracy for each ANN allowed for comparing their preciseness. Standard deviations were also calculated. Fractions equal to standard deviations were divided by average values. Such fractions help compare the stability of results. The lower the value of the fraction is, the more stable the results are. All results are shown on the right side of Table 1. A single accuracy value was assigned to each ANN. Standard deviations were also calculated and the resultant stability fraction equal to standard deviations divided by average values was presented. The accuracy of ﬁnding accented sounds estimated for four networks can be seen in Fig. 1, the plots are drawn on the basis of the data from Table 1. There

Automatic Rhythm Retrieval from Musical Files

61

Table 1. Parameters of training and testing data and performance of ANNs Set No. 1 2 3 4 5 6 7 8 9 10 Avg. StdDev StdDev/Avg

Number of sounds All Accented Not accented 937 387 550 1173 386 787 1054 385 669 937 315 622 801 293 508 603 245 358 781 332 449 880 344 536 867 335 532 1767 509 1258 980 353 626 317 71 251

Acc./all [%]

NA/HA D P

V

DPV

41 33 37 34 37 41 43 39 39 29 37 4

1.90 2.28 2.14 2.25 1.98 1.67 1.93 2.06 1.91 2.14 2.03 0.19 0.09

0.95 1.23 0.11 0.79 1.04 0.93 1.16 1.13 0.83 1.62 0.98 0.39 0.40

1.96 2.19 2.13 2.49 1.95 1.24 1.89 2.14 1.73 2.66 2.03 0.39 0.19

1.01 0.89 0.96 1.13 1.02 1.02 0.98 0.97 0.87 0.72 0.96 0.11 0.12

Fig. 1. Accuracy of four networks for melodies of anthems

are three plots presenting the results of networks fed with one attribute only, and one plot for the network presented with all three physical attributes at its single input (line DPV ). The consequent pairs of training and testing sets are on the horizontal axis, the fraction NA/HA, signifying how many times an approach is more accurate than a blind choice, is on the vertical axis. 3.3

Rough Set-Based Experiments

The aim of this experiment was to obtain the results analogical to the ones coming from the ANN and to confront them with each other. In particular, it was expected to conﬁrm whether physical attributes inﬂuence a tendency of sounds

62

B. Kostek, J. W´ ojcik, and P. Szczuko

to be located in accented positions. Further, it was to answer how complex is the way the rhythmic salience of sound depends on its physical attributes, and to observe the stability of the accuracies obtained in the RS-based experiments. In the rough set-based experiments, the dataset named RSESdata1 was split into training and testing sets in the 3:1 ratio. Then the rules were generated, utilizing a genetic algorithm available in the Rough Set Exploration System [1], [22]. For dataset RSESdata1, 7859 rules were obtained resulting in the classiﬁcation accuracy of 0.75 with the coverage equal to 1. It should be remembered that accuracy is a measure of classiﬁcation success, which is deﬁned as a ratio of the number of properly classiﬁed new cases (objects) to the total number of new cases. Rules with support less than 10 were then removed. The set of rules was thus reduced to 427 and the accuracy dropped to 0.736 with the coverage still remaining 1. Then the next attempt to further decrease the number of rules was made, and rules with support less than 30 were excluded. In this case, 156 rules were still valid but the accuracy dropped signiﬁcantly, i.e. to 0.707, and at the same time the coverage fall to 0.99. It was decided that for a practical implementation of a rough set-based classiﬁer, a set of 427 rules is suitable. Reducts used in rule generation are presented in Table 2. The same approach was used for dataset RSESdata2, and resulted in 11121 rules with the accuracy of 0.742 and the coverage of 1. After removing rules with support less than 10, only 384 rules remained, and the accuracy dropped to 0.735. Again, such number of rules is practically applicable. Reducts used in rule generation are presented in Table 3. The approach taken to LVQ network was also implemented for rough sets. Ten diﬀerent training\test sets were acquired by randomly splitting data into Table 2. Reduct for RSESdata1 dataset Reducts Positive Region Stability Coeﬃcient { duration, pitch } 0.460 1 { duration, velocity } 0.565 1 { pitch, velocity } 0.369 1 { duration } 0.039 1 { pitch } 0.002 1 { velocity } 0.001 1

Table 3. Reduct for RSESdata2 dataset Reducts Positive Region Stability Coeﬃcient { duration, velocity } 0.6956 1 { duration, pitch } 0.6671 1 { pitch, velocity } 0.4758 1 { duration } 0.0878 1 { pitch } 0.0034 1 { velocity } 0.0028 1

Automatic Rhythm Retrieval from Musical Files

63

Table 4. Parameters of training and testing data and performance of RSES (RSA is a Rough Set factor, analogical to NA in ANNs) Set No. 1 2 3 4 5 6 7 8 9 10 Avg. StdDev StdDev/Avg

Number of sounds All testing Accented sounds 1679 610 1679 608 1679 594 1679 638 1679 632 1679 605 1679 573 1679 618 1679 603 1679 627 1679 610 0 19.2

Not accented 1069 1071 1085 1041 1047 1074 1106 1061 1076 1052 1068 19.2

Acc/all 36.33 36.21 35.37 37.99 37.64 36.03 34.12 36.80 35.91 37.34 36.37 1.14

RSA/HA D P

V

DPV

1.81 1.90 1.84 1.68 1.67 1.87 1.77 1.90 1.77 1.77 1.80 0.08 0.04

1.21 1.09 1.19 1.12 1.12 1.13 1.18 1.17 1.11 1.15 1.15 0.039 0.033

1.75 1.74 1.74 1.62 1.64 1.88 1.68 1.73 1.70 1.66 1.72 0.07 0.04

1.06 1.08 1.12 1.08 1.07 1.16 1.09 1.06 1.08 1.08 1.09 0.02 0.02

ﬁve pairs, and than each set in a pair was further divided into two sets – a training and a testing one – with the 2:1 ratio. Therefore testing sets contained 1679 objects each. The experiments, however, were based on RSESdata1 set because of its higher generalization ability (see Table 4). It should be remembered that reduct is a set of attributes that discerns objects with diﬀerent decisions. Positive region shows what part of indiscernibility classes for a reduct is inside the rough set. The larger boundary regions are, the more rules are nondeterministic, and the smaller positive region is. Stability coeﬃcient reveals if the reduct appears also for subsets of original dataset, which are calculated during the reduct search. For reduct {duration} positive region is very small, but during classiﬁcation a voting method is used to infer correct outcome from many nondeterministic rules, and, ﬁnally, high accuracy is obtained. Adding another dimension, e.g. {duration, velocity}, results in higher number of deterministic rules, larger positive region, but it does not guarantee the accuracy increase ( Table 4). Rules were generated utilizing diﬀerent reduct sets (compare with Table 1): D - {duration} only; P - {pitch} only; V - {velocity} only; DPV - all 6 reducts {duration, velocity}, {duration, pitch}, {pitch, velocity}, {duration}, {pitch}, {velocity} have been employed. k-NN Discretization. The data were analyzed also employing k-NN method, which is implemented as a part of the RSES system [22]. The experiment was carried out diﬀerently in comparison to previously performed experiments using ANN (LVQ) and RS. The reason for this was to observe accuracy of classiﬁcation while various number of k values has been set. It may easily be observed that lower number of clusters implies better accuracy of the predictions and a smaller

64

B. Kostek, J. W´ ojcik, and P. Szczuko Table 5. Cut points in the case of k=3 Duration 45.33 133.88

Pitch 44.175 78.285

Velocity 44.909 75.756

Table 6. Classiﬁcation results for k=3 1 0 No. of obj. Accuracy Coverage 1 665 206 899 0.763 0.969 0 375 1,190 1,570 0.76 0.997 True positive rate 0.64 0.85

Table 7. Cut points in the case of k=4 Duration 38.577 98.989 198.56

Pitch 25.622 51.85 79.988

Velocity 41.18 65.139 89.67

Table 8. Classiﬁcation results for k=4 1 0 No. of obj. Accuracy Coverage 1 640 220 899 0.744 0.957 0 353 1,202 1,570 0.773 0.99 True positive rate 0.64 0.85

number of rules generated. In the following experiments full attributes vectors [“duration”, “pitch”, ”velocity”] are used as reducts. The k-means discretization is performed, where k values are set manually k={4, 5,10,15,20}. For a given k exactly k clusters are calculated, represented by their center points. The cut point is set as a middle point between two neighbor cluster centers. Cut points are used for attribute discretization and then rough set rules are generated. The training set comprises 7407 objects and the testing one 2469 objects (3:1 ratio). Experiment I – k-means discretization (k=3) of each attribute (“duration”, “pitch” ,”velocity”), 872 rules. Cut points are shown in Table 5 and classiﬁcation results in Table 6 (Total accuracy: 0.761, total coverage: 0.987). Experiment II – k-means discretization (k=4) of each attribute (“duration”, “pitch” ,”velocity”), 1282 rules. Cut points are shown in Table 7 and classiﬁcation results in Table 8 (Total accuracy: 0.763; total coverage: 0.978). Experiment III – k-means discretization (k=5) of each attribute (“duration”, “pitch” ,”velocity”), 1690 rules. Cut points are shown in Table 9 and classiﬁcation results in Table 10 (Total accuracy: 0.766: total coverage: 0.967).

Automatic Rhythm Retrieval from Musical Files

65

Table 9. Cut points in the case of k=5 Duration 31.733 72.814

133.91 259.15

Pitch 24.224 47.536

68.629 94.84

Velocity 27.826 48.708

66.759 89.853

Table 10. Classiﬁcation results for k=5 1 0 No. of obj. Accuracy Coverage 1 619 232 899 0.727 0.947 0 326 1,211 1,570 0.788 0.979 True positive rate 0.66 0.84

Table 11. Cut points in the case of k=10 Duration 11.11 27.319 44.375 62.962 86.621

121.62 174.66 264.32 642.94

Pitch 18.089 35.259 47.046 56.279 64.667

73.161 82.648 95.963 119.15

Velocity 14.558 27.307 36.02 42.846 49.768

57.81 67.94 81.992 102.45

Table 12. Classiﬁcation results for k=10 1 0 No. of obj. Accuracy Coverage 1 533 227 899 0.701 0.845 0 253 1,162 1,570 0.821 0.901 True positive rate 0.68 0.84

Table 13. Cut points in the case of k=15 Duration 3.2372 8.8071 13.854 20.693 29.284 37.857 46.806

58.942 76.842 101.33 132.65 183.95 283.39 656.29

Pitch 9.4427 23.023 33.148 41.285 48.225 53.264 57.875

64.17 70.125 74.465 79.687 86.909 97.988 119.79

Velocity 15.139 28.128 37.217 44.091 49.015 52.011 53.913

56.751 60.378 63.963 68.747 75.963 86.914 104.36

Experiment IV – k-means discretization (k=10) of each attribute (“duration”, “pitch” ,”velocity”), 2987 rules. Cut points are shown in Table 11 and classiﬁcation results in Table 12 (Total accuracy: 0.779: total coverage: 0.881).

66

B. Kostek, J. W´ ojcik, and P. Szczuko Table 14. Classiﬁcation results for k=15 1 0 No. of obj. Accuracy Coverage 1 492 217 899 0.694 0.789 0 229 1,121 1,570 0.83 0.86 True positive rate 0.68 0.84 Table 15. Cut points in the case of k=20

Duration 2.920 38.505 7.388 45.68 11.601 53.632 16.635 63.169 21.486 74.84 26.464 88.631 31.96

107.31 132.67 173.43 235.67 326.34 672.55

Pitch 7.584 18.309 25.782 31.629 35.553 38.184 40.174

41.5 42.674 44.977 49.35 54.714 60.045

65.348 70.631 76.774 84.981 97.402 119.79

Velocity 9.603 22.588 33.807 43.158 50.163 55.096 58.528

61.725 65.283 68.596 71.363 73.564 75.624

78.691 83.434 89.438 96.862 106.57 121.49

Table 16. Classiﬁcation results for k=20 1 0 No. of obj. Accuracy Coverage 1 476 223 899 0.681 0.778 0 233 1,122 1,570 0.828 0.863 True positive rate 0.67 0.83

Experiment V – k-means discretization (k=15) of each attribute (“duration”, “pitch” ,”velocity”), 3834 rules. Cut points are shown in Table 13 and classiﬁcation results in Table 14 (Total accuracy: 0.783: total coverage: 0.834). Experiment VI – k-means discretization (k=20) of each attribute (“duration”, “pitch” ,”velocity”), 4122 rules. Cut points are shown in Table 15 and classiﬁcation results in Table 16 (Total accuracy: 0.778: total coverage: 0.832). Retrieving rhythmical patterns together with their hierarchical structure of rhythm acquired with machine learning is a step towards an application capable of creating automatic drum accompaniment to a given melody. Such a computer system is to be presented in Section 4.

4

Automatic Drum Accompaniment Application

The hypermetric rhythm retrieval approach proposed in this article is illustrated with a practical application of a system automatically generating a drum accompaniment to a given melody. A stream of sounds in MIDI format is introduced at the system input, on the basis of a musical content the method retrieves a hypermetric structure of rhythm of a musical piece consisting rhythmic motives, phrases, and sentences. A method does not use any information about rhythm

Automatic Rhythm Retrieval from Musical Files

67

Fig. 2. The tree of periods

(time signature), which is present often in MIDI ﬁles. Neither rhythmic tracks nor harmonic information are used to support the method. The only information analyzed is a melody, which might be monophonic as well as polyphonic. Two elements are combined, namely recurrence of melodic and rhythmic patterns and the rhythmic salience of sounds to create a machine able to ﬁnd the metric structure of rhythm to a given melody. The method proposed by the authors of this paper generates rhythmic levels ﬁrst, and then uses these levels to compose rhythmic hypotheses. The lowest rhythmic level has a phase of the ﬁrst sound from the piece and its period is atomic. The following levels have periods of values achieved by recursive multiplication of periods that have already been calculated (starting from the atomic value) by the most common prime numbers in Western music, i.e. 2 and 3. The process of period generation may be illustrated as a process of a tree structure formation (Figure 2) with a root representing the atomic period equal to 1. Each node is represented by a number which is the node ancestor number multiplied by either 2 or 3. The tree holds some duplicates. The node holding a duplicated value would generate a sub-tree whose all nodes would also be duplicates of already existing values. Thus duplicate subtrees are eliminated and we obtain a graphical interpretation in the form of the period triangle (see Figure 3), where the top row refers to a quarter-note, and consecutively to a half-note, whole note (motive), phrase, sentence and period. When the phase of periods creation is completed, each period must have all its phases (starting from phase 0) generated. The last phase of a given rhythmic level has the value equal to the size of the period decreased by one atomic period. In order to achieve hypotheses from the generated rhythmic levels, it is necessary to ﬁnd all families of related rhythmic levels. A level may belong to many levels. The generated hypotheses are instantly ranked to extract the one which designates the appropriate rhythm of the piece. The hypotheses that cover notes of signiﬁcant rhythmic weights are ranked higher. The weights are calculated based on the knowledge gathered by learning systems that know how to asses the importance of physical characteristics of sounds that comprise the

68

B. Kostek, J. W´ ojcik, and P. Szczuko

Fig. 3. Triangle of periods Table 17. Drum instruments added at a particular rhythmic level Rhythmic level Name of the instrument 1 Closed hi-hat 2 Bass drum 3 Snare drum 4 Open triangle 5 Splash cymbal 6 Long whistle 7 Chinese cymbal

piece. The system proposed by the authors employs rules obtained in the process of data mining [11], [26], as well as from the operation of neural networks [6], and through employing rough sets [12]. Taking a set of representative musical objects as grounds, these systems learn how to asses the inﬂuence of a sound relative frequency, amplitude and length on its rhythmic weight. The second group of methods used to rank hypotheses is based on one of the elementary rules known in music composition, i.e. recurrence of melodic and rhythmic patterns – the group is described in the authors’ works [10], [27]. The application realizing an automatic accompaniment, called DrumAdd, accepts a MIDI ﬁle at its input. The accompaniment is added to the melody by inserting a drum channel, whose number is 10 in the MIDI ﬁle. Hi-hat hits are inserted in the locations of rhythmic events associated with the ﬁrst rhythmic

Automatic Rhythm Retrieval from Musical Files

69

Fig. 4. User interface of an automatic drum accompaniment application

level. The consecutive drum instruments associated with higher rhythmic levels are: bass drum, snare drum, open triangle, splash cymbal, long whistle and a Chinese cymbal, as it is shown in Table 17. The DrumAdd system was developed in Java. The main window of the system can be seen in Figure 4 – the user interface shown is entitled ‘Hypotheses’. Default settings of quantization are as follows: - onsets of sounds are shifted to time grid of one-eight note, - durations of sounds are natural multiplies of one-eight note, - notes shorter than one-sixteenth note are deleted. A user may easily change quantization settings. A hypothesis ranking method can be chosen in a drop-down list (‘Salience - Duration’ in the case presented). A user may listen to the accompaniment made on the basis of the hypothesis (link ‘Listen’), change the drum sounds associated to the consecutive rhythmic levels (link ‘Next. . . ’) or acknowledge the given hypothesis as correct (link ‘Correct’ is assumed to be correct). A user also receives an access to report and ranking of hypotheses, which presents a table with accuracies corresponding to hypotheses ranking methods. The drum Accompaniment is generated automatically to the sample melodies contained in the system. As a result some sample pieces contain a drum track created strictly with the approach presented earlier. In the second group of examples, the accompaniment is created on the basis of a metric structure retrieved automatically.

70

5

B. Kostek, J. W´ ojcik, and P. Szczuko

Algorithm Complexity

This section addresses the problem of computational algorithm complexity. Three phases of the algorithm engineered by the authors, namely creating periods, simpliﬁed hypotheses and full hypotheses are examined. The analyses of computational complexity of the method proposed assume that the engineered method is expected to rank rhythmic hypotheses formed of three rhythmic levels above meter. This proved to be suﬃcient for providing automatic drum accompaniment for a given melody without delay. The method creates all possible rhythmic structures. However, their number is limited and depends on the following factors [28]: – The level designated as the lowest among all the created hypotheses (this deﬁnes the parameter of sound length quantization). The authors observed that the quantization with the resolution of a quarter-note is suﬃcient. – The intricacy of the hypotheses, i.e. how many levels they contain. The method was examined for at most three rhythmic levels above meter, similarly as in the research conducted by Rosenthal [20], and Temperley and Sleator [21]. Taking the above assumptions into consideration, i.e. the quantization parameter being a quarter-note and the analysis of a hypothesis concerning three levels above meter, we obtain the number of periods from the ﬁrst 6 layers of the triangle shown in Figure 3. The atomic period is a quarter-note (layer 1), the layer containing periods 4, 6, 9 is the level of meter, and the sixth layer holding the values of 32, 48, 72 . . . is the last examined rhythmic level. Calculating periods. The number of periods is n·(n+1)/2, where n is the number of layers, so the algorithm is polynomial, the function of the computational complexity is of class O(n2 ). The basic operation that calculates periods is multiplication. The number of periods calculated for 6 layers is 21, and these are the elements of a period list. Creating hypotheses. Hypotheses (with periods only) are lists of related rhythmic levels that include pairs of values. If we take only periods into consideration, the number of hypotheses is the number of paths starting from the highest rhythmic level (layer 6) and ending in the level of atomic period (layer 1). For assumed parameters, this gives 32 hypotheses if only periods deﬁned. The number is a result of the following computations: – from period 32 there is one path (32, 16, 8, 4, 2, 1,), – from period 48 there are 5 paths, – from period 72 there are 10 paths, For the left half of the triangle we may specify 16 paths. The computations for the right half, i.e. the paths including periods 108, 162, and 243, are analogous. This gives 32 paths altogether in a 6 layer triangle. The function of computational complexity is of class O(n2 ), where n is the number of layers. Thus, the complexity is exponential which with n limited to

Automatic Rhythm Retrieval from Musical Files

71

Table 18. Rhythmic hypotheses (without phases) for a 6-layer triangle of periods Layer 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 9 9 9 9 9 9 9 9

4 8 8 8 8 12 12 12 12 12 12 12 12 18 18 18 18 12 12 12 12 18 18 18 18 18 18 18 18 27 27 27 27

5 16 16 24 24 24 24 36 36 24 24 36 36 36 36 54 54 24 24 36 36 36 36 54 54 36 36 54 54 54 54 81 81

6 32 48 48 72 48 72 72 108 48 72 72 108 72 108 108 162 48 72 72 108 72 108 108 162 72 108 108 162 108 162 162 243

6 layers conﬁnes the number of hypotheses to 32. The rows of Table 6 show subsequent simpliﬁed hypotheses, i.e. the ones that contain only periods (phases are ignored) for the example from Figures 2 and 3. The algorithm that creates hypotheses with periods only ranks rhythmic hypotheses based on the recurrence of melorhythmic patterns (16 methods proposed in the thesis of Wojcik [27]). The basic operation of patterns recurrence evaluation is in this case addition. The only hypotheses ranking method examined by the authors that requires the phases to be deﬁned is the method based on rhythmic weights.

72

B. Kostek, J. W´ ojcik, and P. Szczuko

Creating hypotheses with phases. Each hypotheses may have as many versions, with regard to phases, as its longest period is, e.g. the ﬁrst hypothesis from Table 17 (the ﬁrst row: 1, 2, 4, 8, 16, 32) will have 32 diﬀerent phases. On condition that n = 6, the number of all hypotheses for the discussed example will amount to 3125, which is the sum of all periods from layer 6. Thus, the number of all hypotheses is the sum of the values from the last column of Table 18. The algorithm that forms hypotheses with phases is used in a method ranking rhythmic hypotheses based on rhythmic weight. The elementary operation of this method is addition. To analyze a piece of music with regard to its motives, phases, phrases and periods when its atomic period is deﬁned as a quarter-note, the number of 6 layers (n=6) is suﬃcient. Despite the exponential complexity of the method, the number of elementary operations is not more than 104 on a 1.6 GHz computer. The total time of all operations for a single piece of music is imperceptible for a system user, which was proved by the experimental system, engineered by the authors. This means that the method provides high quality automatic drum accompaniment without a delay.

6

Concluding Remarks

Employing computational approach is helpful in retrieving the time signature and the locations of barlines from a piece on the basis of its content only. Rhythmic salience approach worked out and described in this paper may also be valuable in ranking rhythmic hypotheses and music transcription. A system, creating drum accompaniment to a given melody, automatically, on the basis of highly ranked rhythmic hypothesis is a useful practical application of rhythmic salience method. A prototype of such a system, using salience approach was developed on the basis of ﬁndings of authors of this paper, and it works without delay, even though its computational complexity is quite considerable. On the basis of the results (see Tables 1, 4) obtained for both: RS and ANN experiments, it may be observed that the average accuracy of all approaches taking duration D into account – solely or in the combination of all three attributes DPV – is about twice as good as hazard accuracy (values of 1.72 for Rough Set DPV, 1.80 for Rough Set D, and a value of 2.03 both for Network D and for Network DPV were achieved). The performance of approaches considering pitch P and velocity V separately are very close to random accuracy, the values are equal to 1.09 and 1.15 for Rough Sets. For the ANN, the values are 0.96 and 0.98, respectively. Thus, it can be concluded that the location of a sound depends only on its duration. The algorithms with the combination of DPV attributes performed as well as the one based only on duration, however this is especially valid for ANNs, rough sets did a little bit worse. Additional attributes do not increase the performance of the ANN approach. It can be thus concluded that the rhythmic salience depends on physical attributes in a simple way, namely it depends on a single physical attribute – duration.

Automatic Rhythm Retrieval from Musical Files

73

Network D is the ANN that returns the most stable results. The value of fraction in the third row of Table 1 is low for this network and it is equal to 0.09. Network DPV, which takes all attributes into account, is much less reliable because the stability fraction is about twice worse than the stability of Network D and it is equal to 0.19. The stability of Network P , considering the pitch, is quite high (it equals 0.12), but its performance is close to the random choice. For learning and testing data used in this experiment, velocity appeared to be the most data-sensitive attribute (see results of Network V ). Additionally, this network appeared to be unable to ﬁnd accented sounds. In the case of Rough Sets, the duration-based approaches D and DPV returned less stable results than P and V approaches. Values of 0.045, 0.043, 0.026, 033 were obtained for D, DPV, P, and V respectively. The ANN salience-based experiments described in the earlier work by the Authors [7], were conducted on a database of musical ﬁles containing various musical genres. It consisted of monophonic (non-polyphonic), and the polyphonic ﬁles. Also, a veriﬁcation of the association rules model of the Data Mining domain for musical salience estimation was presented in that paper. The conclusions derived from the experiments conducted on national anthems for the purpose of this paper, are consistent with the ones described in the work by Kostek et al. [7]. Thus, the ANNs can be used in systems of musical rhythm retrieval in a wide range of genres and regardless of the fact whether the music is monophonic of polyphonic. The average relative accuracy for duration-based approaches where Rough Sets are used is lower than this obtained by LVQ ANNs. However, the same tendency is noticeable – utilization of the duration parameter leads to successful classiﬁcation. The P (pitch) and V (velocity) parameters appeared not to be important in making decision about rhythmical structure of a melody. Finally, using diﬀerent discretization schemes instead of the equal subrange technique does not change the accuracy of rough sets-based rhythm classiﬁcation, signiﬁcantly.

Acknowledgments The research was partially supported by the Polish Ministry of Science and Education within the project No. PBZ-MNiSzW-02/II/2007.

References 1. Bazan, J.G., Szczuka, M.S.: The Rough Set Exploration System. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005) 2. van Belle, W.: BPM Measurement of Digital Audio by means of Beat Graphs & Ray Shooting. Department Computer Science, University Tromsø (Retrieved, 2004), http://bio6.itek.norut.no/werner/Papers/bpm04/ 3. Dahl, S.: On the beat - Human movement and timing in the production and perception of music. Ph.D. Thesis, KTH Royal Institute of Technology, Stockholm, Sweden (2005)

74

B. Kostek, J. W´ ojcik, and P. Szczuko

4. Dixon, S.: Automatic Extraction of Tempo and Beat from Expressive Performances. J. of New Music Research 30(1), Swets & Zeitlinger, 39–58 (2001) 5. Huron, D.: Review of Harmony: A psychoacoustical Approach (Parncutt, 1989); Psychology of Music 19(2), 219–222 (1991) 6. Kostek, B., W´ ojcik, J.: Machine Learning System for Estimation Rhythmic Salience of Sounds. Int. J. of Knowledge-Based and Intelligent Engineering Systems 9, 1–10 (2005) 7. Kostek, B., W´ ojcik, J., Holonowicz, P.: Estimation the Rhythmic Salience of Sound with Association Rules and Neural Networks. In: Proc. of the Intern. IIS: IIPWM 2005, Intel. Information Proc. and Web Mining, Advances in Soft Computing, pp. 531–540. Springer, Sobieszewo (2005) 8. Kostek, B.: Perception-Based Data Processing in Acoustics. In: Applications to Music Information Retrieval and Psychophysiology of Hearing. Series on Cognitive Technologies. Springer, Heidelberg (2005) 9. Kostek, B.: Applying computational intelligence to musical acoustics. Archives of Acoustics 32(3), 617–629 (2007) 10. Kostek, B., W´ ojcik, J.: Automatic Retrieval of Musical Rhythmic Patterns, vol. 119. Audio Engineering Soc. Convention, New York (2005) 11. Kostek, B., W´ ojcik, J.: Automatic Salience-Based Hypermetric Rhythm Retrieval. In: International Workshop on Interactive Multimedia and Intelligent Services in Mobile and Ubiquitous Computing, Seoul, Korea. IEEE CS, Los Alamitos (2007) 12. Kostek, B., W´ ojcik, J., Szczuko, P.: Searching for Metric Structure of Musical Files. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 774–783. Springer, Heidelberg (2007) 13. Lerdahl, F., Jackendoﬀ, R.: A Generative Theory of Tonal Music. MIT Press, Cambridge (1983) 14. McAuley, J.D., Semple, P.: The eﬀect of tempo and musical experience on perceived beat. Australian Journal of Psychology 51(3), 176–187 (1999) 15. Parncutt, R.: Harmony: A Psychoacoustical Approach. Springer, Berlin (1989) 16. Pawlak, Z.: Rough Sets. Internat. J. Computer and Information Sciences 11, 341– 356 (1982) 17. Pawlak, Z., Skowron, A.: Rudiments of rough sets. Information Sciences 177, 3–27 (2007) 18. Peters, J.F., Skowron, A. (eds.): Transactions on Rough Sets V. LNCS, vol. 4100. Springer, Heidelberg (2004-2008) 19. Rosenthal, D.F.: Emulation of human rhythm perception. Comp. Music J. 16(1), 64–76 (Spring, 1992) 20. Rosenthal, D.F.: Machine Rhythm: Computer Emulation of Human Rhythm Perception, Ph.D. thesis. MIT Media Lab, Cambridge, Mass. (1992) 21. Temperley, D., Sleator, D.: Modeling meter and harmony: A preference-rule approach. Comp. Music J. 15(1), 10–27 (1999) 22. RSES Homepage, http://logic.mimuw.edu.pl/∼ rses 23. Wieczorkowska, A., Czyzewski, A.: Rough Set Based Automatic Classiﬁcation of Musical Instrument Sounds. Electr. Notes Theor. Comput. Sci. 82(4) (2003) 24. Wieczorkowska, A., Ra´s, Z.W.: Editorial: Music Information Retrieval. J. Intell. Inf. Syst. 21(1), 5–8 (2003) 25. Wikipedia homepage 26. W´ ojcik, J., Kostek, B.: Intelligent Methods for Musical Rhythm Finding Systems. In: Nguyen, N.T. (ed.) Intelligent Technologies for Inconsistent Processing. International Series on Advanced Intelligence, vol. 10, pp. 187–202 (2004)

Automatic Rhythm Retrieval from Musical Files

75

27. W´ ojcik, J.: Methods of Forming and Ranking Rhythmic Hypotheses in Musical Pieces, Ph.D. Thesis, Electronics, Telecommunications and Informatics Faculty, Gdansk Univ. of Technology, Gdansk (2007) 28. W´ ojcik, J., Kostek, B.: Computational Complexity of the Algorithm Creating Hypermetric Rhythmic Hypotheses. Archives of Acoustics 33(1), 57–63 (2008)

FUN: Fast Discovery of Minimal Sets of Attributes Functionally Determining a Decision Attribute Marzena Kryszkiewicz and Piotr Lasek Institute of Computer Science, Warsaw University of Technology Nowowiejska 15/19, 00-665 Warsaw, Poland {mkr,p.lasek}@ii.pw.edu.pl Abstract. In this paper, we present our Fun algorithm for discovering minimal sets of conditional attributes functionally determining a given dependent attribute. In particular, the algorithm is capable of discovering Rough Sets certain, generalized decision, and membership distribution reducts. Fun can operate either on partitions of objects or alternatively on stripped partitions, which do not store singleton groups. It is capable of using functional dependencies occurring among conditional attributes for pruning candidate dependencies. In this paper, we oﬀer further reduction of stripped partitions, which allows correct determination of minimal functional dependencies provided optional candidate pruning is not carried out. In the paper we consider six variants of F un, including two new variants using reduced stripped partitions. We have carried out a number of experiments on benchmark data sets to test the eﬃciency of all variants of Fun. We have also tested the eﬃciency of the F un’s variants against the Rosetta and RSES toolkits’ algorithms computing all reducts and against Tane, which is one of the most eﬃcient algorithms computing all minimal functional dependencies. The experiments prove that Fun is up to 3 orders of magnitude faster than the the Rosetta and RSES toolkits’ algorithms and faster than Tane up to 30 times. Keywords: Rough Sets, information system, decision table, reduct, functional dependency.

1

Introduction

The determination of minimal functional dependencies is a standard task in the area of relational databases. Tane [6] or Dep-Miner [14] are example eﬃcient algorithms for discovering minimal functional dependencies from relational databases. A variant of the task, which consists in discovering minimal sets of conditional attributes that functionally or approximately determine a given decision attribute, is one of the topics of Artiﬁcial Intelligence and Data Mining. Such sets of conditional attributes can be used, for instance, for building classiﬁers. In the terms of Rough Sets, such minimal conditional attributes are called reducts [18]. One can distinguish a number of types of reducts. Generalized decision reducts (or equivalently, possible/approximate reducts [9]), membership distribution reducts (or equivalently, membership reducts [9]), and J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 76–95, 2008. c Springer-Verlag Berlin Heidelberg 2008

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

77

certain decision reducts belong to most popular Rough Sets reducts. In general, these types of reducts do not determine the decision attribute functionally. However, it was shown in [10] that these types of reducts are minimal sets of conditional attributes functionally determining appropriate modiﬁcations of the decision attribute. Thus, the task of searching such reducts is equivalent to looking for minimal sets of attributes functionally determining a given attribute. In this paper, we focus on ﬁnding all such minimal sets of attributes. To this end, one might consider applying either methods for discovering Rough Sets reducts, or discovering all minimal functional dependencies and then selecting such that determine a requested attribute. A number of methods for discovering diﬀerent types of reducts have already been proposed in the literature. e.g. [3-5],[7-8],[11-12],[15-29]. The most popular methods are based on discernibility matrices [21]. Unfortunately, the existing methods for discovering all reducts are not scalable. The recently oﬀered algorithms for ﬁnding all minimal functional dependencies are deﬁnitely faster. In this paper, we focus on direct discovery of all minimal functional dependencies with a given dependent attribute, and expect this process to be faster than the discovery of all minimal functional dependencies. First, we present eﬃcient Fun algorithm, we oﬀered recently [12]. F un discovers minimal functional dependencies with a given dependent attribute, and, in particular, is capable of discovering three above mentioned types of reducts. Fun can operate either on partitions of objects or alternatively on stripped object partitions, which do not store singleton groups. It is capable of using functional dependencies occurring among conditional attributes, which are found as a side eﬀect, for pruning candidate dependencies. In this paper, we extend our proposal from [12]. We oﬀer further full and partial reduction of stripped partitions, which allows correct determination of minimal functional dependencies provided optional candidate pruning is not carried out. Then, we compare the eﬃciency of two new variants of F un and four other variants of this algorithm, we proposed in [12]. We also test the eﬃciency of the F un’s variants against the Rosetta and RSES toolkits’ algorithms computing all reducts and against Tane, which is one of the most eﬃcient algorithms computing all minimal functional dependencies. The layout of the paper is as follows: Basic notions of information systems, functional dependencies, decision tables and reducts are recalled in Section 2. In Section 3, we present the Fun algorithm. Entirely new contribution is presented in subsection 3.5, where we describe how to reduce stripped partitions and provide two new variants of the F un algorithm. The experimental evaluation of 6 variants of F un, as well as the Rosetta and RSES toolkits’ algorithms and Tane, are reported in Section 4. We conclude our results in Section 5.

2 2.1

Basic Notions Information Systems

An information system is a pair S = (O, AT ), where O is a non-empty ﬁnite set of objects and AT is a non-empty ﬁnite set of attributes of these objects. In the

78

M. Kryszkiewicz and P. Lasek

sequel, a(x), a ∈ AT and x ∈ O, denotes the value of attribute a for object x, and Va denotes the domain of a. Each subset of attributes A ⊆ AT determines a binary A-indiscernibility relation IN D(A) consisting of pairs of objects indiscernible wrt. attributes A; that is, IN D(A) = {(x, y) ∈ O×O|∀a∈A a(x) = a(y)}. IN D(A) is an equivalence relation and determines a partition of O, which is denoted by πA . The set of objects indiscernible with an object x with respect to A in S is denoted by IA (x) and is called A-indiscernibility class; that is, IA (x) = {y ∈ O|(x, y) ∈ IN D(A)}. Clearly, πA = {IA (x)|x ∈ O}. Table 1. Sample information system S = (O, AT ), where AT = {a, b, c, e, f } oid 1 2 3 4 5 6 7 8 9 10

a 1 1 0 0 0 1 1 1 1 1

b 0 1 1 1 1 1 1 1 1 0

c 0 1 1 1 1 0 0 0 0 0

e 1 1 0 0 2 2 2 2 3 3

f 1 2 3 3 2 2 2 2 2 2

Example 2.1.1. Table 1 presents a sample information system S = (O, AT ), where O is the set of ten objects and AT = {a, b, c, e, f } is the set of attributes of these objects. 2.2

Functional Dependencies

Functional dependencies are of high importance in designing relational databases. We recall this notion after [2]. Let S = (O, AT ) and A, B ⊆ AT . A → B is deﬁned a functional dependency (or A is deﬁned to determine B functionally), if ∀x∈O IA (x) ⊆ IB (x). A functional dependency A → B is called minimal, if ∀C⊂A C → B is not functional. Example 2.2.1. Let us consider the information system in Table 1. {ce} → {a} is a functional dependency, nevertheless, {c} → {a}, {e} → {a}, and ∅ → {a} are not. Hence, {ce} → {a} is a minimal functional dependency. Property 2.2.1. Let A, B, C ⊆ AT . a) If A → B is a functional dependency, then ∀C⊃A C → B is functional. b) If A → B is not functional, then ∀C⊂A C → B is not functional. c) If A → B is a functional dependency, then ∀C⊃A C → B is a non-minimal functional dependency. d) If A → B and B → C are functional dependencies, then A → C is a nonminimal functional dependency. e) If A ⊂ B and B ∩C = ∅, then A → B is a functional dependency and B → C is not a minimal functional dependency.

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

79

Functional dependencies can be calculated by means of partitions [6] as follows: Property 2.2.2. Let A, B ⊆ AT . A → B is a functional dependency iﬀ πA = πAB iﬀ |πA | = |πAB |. Example 2.2.2. Let us consider the information system in Table 2. We observe that π{ce} = π{cea} = {{1}, {2}, {3, 4}, {5}, {6, 7, 8}, {9, 10}}. The equality of π{ce} and π{cea} (or their cardinalities) is suﬃcient to conclude that {ce} → {a} is a functional dependency. The next property recalls a method of calculating a partition with respect to an attribute set C by intersecting partitions with respect to subsets of C. Let A, B ⊆ AT . The product of partitions πA and πB , denoted by πA ∩ πB , is deﬁned as πA ∩ πB = {Y ∩ Z|Y ∈ πA and Z ∈ πB }. Property 2.2.3. Let A, B, C ⊆ AT and C = A ∪ B. Then, πC = πA ∩ πB . 2.3

Decision Tables, Reducts and Functional Dependencies

A decision table is an information system DT = (O, AT ∪ {d}), where d ∈ / AT is a distinguished attribute called the decision, and the elements of AT are called conditions. A decision class is deﬁned as the set of all objects with the same decision value. By Xdi we will denote the decision class consisting of objects the decision value of which equals di , where di ∈ Vd . Clearly, for any object x in O, Id (x) is a decision class. It is often of interest to ﬁnd minimal subsets of AT (or strict reducts) that functionally determine d. It may happen, nevertheless, that such minimal sets of conditional attributes do not exist. Table 2. Sample decision table DT = (O, AT ∪ {d}), where AT = {a, b, c, e, f } and d AT is the decision attribute, extended with derived attributes dN AT , ∂AT , μd oid 1 2 3 4 5 6 7 8 9 10

a 1 1 0 0 0 1 1 1 1 1

b 0 1 1 1 1 1 1 1 1 0

c 0 1 1 1 1 0 0 0 0 0

e 1 1 0 0 2 2 2 2 3 3

f 1 2 3 3 2 2 2 2 2 2

d dN AT 1 1 1 1 1 N 2 N 2 2 2 N 3 N 3 N 3 3 3 3

∂AT μAT d {1} {1} {1, 2} {1, 2} {2} {2, 3} {2, 3} {2, 3} {3} {3}

AT AT :< μAT > 1 , μ2 , μ3 < 1, 0, 0 > < 1, 0, 0 > < 1/2, 1/2, 0 > < 1/2, 1/2, 0 > < 0, 1, 0 > < 0, 1/3, 2/3 > < 0, 1/3, 2/3 > < 0, 1/3, 2/3 > < 0, 0, 1 > < 0, 0, 1 >

Example 2.3.1. Table 2 describes a sample decision table DT = (O, AT ∪ {d}), where AT = {a, b, c, e, f }. Partition πAT = {{1}, {2}, {3, 4}, {5}, {6, 7, 8}, {9}, {10}} contains all AT -indiscernibility classes, whereas π{d} = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9, 10}} contains all decision classes. There is no functional dependency between AT and d, since there is no decision class in π{d} containing AT -indiscernibility class {3, 4} (or {6, 7, 8}). As AT → d is not functional, then C → d, where C ⊆ AT , is not functional either.

80

M. Kryszkiewicz and P. Lasek

Rough Sets theory deals with the problem of non-existence of strict reducts by means of other types of reducts, which always exist, irrespectively if AT → d is a functional dependency, or not. We will now recall such three types of reducts, namely certain decision reducts, generalized decision reducts, and membership distribution reducts. Certain decision reducts. Certain decision reducts are deﬁned based on the notion of a positive region of DT , thus we start with introducing this notion. A positive region of DT , denoted as POS, is the set-theoretical union of all AT indiscernibility classes, each of which is contained in a decision class of DT ; that is, P OS = {X ∈ πAT |X ⊆ Y, Y ∈ πd } = {x ∈ O|IAT (x) ⊆ Id (x)}. A set of attributes A ⊆ AT is called a certain decision reduct of DT , if A is a minimal set, such that ∀x∈P OS IA (x) ⊆ Id (x) [18]. Now, we will introduce a derivable decision attribute for an object x ∈ O as a modiﬁcation of the decision attribute d, which N we will denote by dN AT (x) and deﬁne as follows: dAT (x) = d(x) if x ∈ P OS, and N dAT (x) = N, otherwise (see Table 2 for illustration). Clearly, all objects with values of dN AT that are diﬀerent from N belong to P OS. Property 2.3.1 [10]. Let A ⊆ AT . A is a certain decision reduct iﬀ A → {dN AT } is a minimal functional dependency. Generalized decision reducts. Generalized decision reducts are deﬁned based on a generalized decision. Let us thus start with introducing this notion. An A-generalized decision for object x in DT (denoted by ∂A (x)), A ⊆ AT , is deﬁned as the set of all decision values of all objects indiscernible with x wrt. A; i.e., ∂A (x) = {d(y)|y ∈ IA (x)} [21]. For A = AT , an A-generalized decision is also called a generalized decision (see Table 2 for illustration). A ⊆ AT is deﬁned a generalized decision reduct of DT , if A is a minimal set such that ∀x∈O ∂A (x) = ∂AT (x). Property 2.3.2 [10]. Let A ⊆ AT . Attribute set A is a generalized decision reduct iﬀ A → {∂AT } is a minimal functional dependency. μ-Decision Reducts. The generalized decision informs on decision classes to which an object may belong, but does not inform on the degree of the membership to these classes, which could be also of interest. A membership distribution n function) μA d : O → [0, 1] , A ⊆ AT, n = |Vd |, is deﬁned as follows [9],[23-24]: A A μA d (x) = (μd1 (x), . . . , μdn (x)), where |IA (x)∩Xdi | {d1 , . . . , dn } = Vd and μA . di (x) = |IA (x)|

Please, see Table 2 for illustration of μAT d . A ⊆ AT is a called a μ-decision reduct (or membership distribution reduct) of DT , if A is a minimal set such AT that ∀x∈O μA d (x) = μd (x). Property 2.3.3 [10]. Let A ⊆ AT . A is a μ-decision reduct iﬀ A → {μAT d } is a minimal functional dependency.

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

3

81

Computing Minimal Sets of Attributes Functionally Determining Given Dependent Attribute with Fun

In this section, we present the Fun algorithm for computing all minimal subsets of conditional attributes AT that functionally determine a given dependent attribute ∂. First, we recall the variants of F un that apply partitions of objects or, so called, stripped partitions of objects [12]. Then, in Section 3.5, we introduce an idea of reduced stripped partitions and oﬀer two new variants of Fun based on them. The Fun algorithm can be used for calculating Rough Sets reducts provided the dependent attribute is determined properly, namely Fun will return certain decision reducts for ∂ = ∂AT , generalized decision for ∂ = dN AT , and μ-decision reducts for ∂ = μAT . For brevity, a minimal subset of AT that functionally d determines a given dependent attribute ∂ will be called a ∂-reduct. 3.1

Main Algorithm

The Fun algorithm takes two arguments: a set of conditional attributes AT and a functionally dependent attribute ∂. As a result, it returns all ∂-reducts. Fun starts with creating singleton candidates C1 for ∂-reducts from each attribute in AT . Then, the partitions (π) and their cardinalities (groupNo) wrt. ∂ and all attributes in C1 are determined. Notation for F un • Ck candidate k attribute sets (potential ∂-reducts); k attribute ∂-reducts; • Rk • C.π the representation of the partition πC of the candidate attribute set C; it is stored as the list of groups of objects identifiers (oids); • C.groupN o the number of groups in the partion of the candidate attribute set C; that is, |πC |; • ∂.T an array representation of π∂ ; Algorithm. F un(attribute set AT , dependent attribute ∂); C1 = {{a}|a ∈ AT }; // create singleton candidates from conditional attributes in AT forall C in C1 ∪ {∂} do begin C.π = πC ; C.groupN o = |πC | endfor; /* calculate an array representation of π∂ for later multiple use in the Holds function */ ∂.T = P artitionArrayRepresentation(∂); // Main loop for (k = 1; Ck = ∅; k + +) do begin Rk = {}; forall candidates C ∈ Ck do begin if Holds(C → {∂}) then // Is C → {∂} a functional dependency? // store C as a k attribute ∂-reduct remove C from Ck to Rk ; endif endfor; /* create (k + 1) attribute candidates for ∂-reducts from k attribute non-∂-reducts */ Ck+1 = F unGen(Ck ); endfor; S return k Rk ;

Next, the PartitionArrayRepresentation function (see Section 3.3) is called to create an array representation of π∂ . This representation shall be used multiple times in the Holds function, called later in the algorithm, for eﬃcient checking

82

M. Kryszkiewicz and P. Lasek

whether candidate attribute sets determine ∂ functionally. Now, the main loop starts. In each k-th iteration, the following is performed: – The Holds function (see Section 3.3) is called to check if k attribute candidates Ck determine ∂ functionally. The candidates that do are removed from the set of k attribute candidates to the set of ∂-reducts Rk . – The FunGen function (see Section 3.2) is called to create (k + 1) attribute candidates Ck+1 from the k attribute candidates that remained in Ck . The algorithm stops when the set of candidates becomes empty. 3.2

Generating Candidates for ∂-reducts

The FunGen function creates (k + 1) attribute candidates Ck+1 by merging k attribute candidates Ck , which are not ∂-reducts. The algorithm adopts the manner of creating and pruning of candidates introduced in [1] (here: candidate sets of attributes instead of candidates for frequent itemsets). There are merged only those pairs of k attribute candidates Ck that diﬀer merely on their last attributes (see [1] for justiﬁcation that this method is lossless and non-redundant). For each new candidate C, πC is calculated as the product of the partitions wrt. the merged k attribute sets (see Section 3.3 for the Product function). The cardinality (groupNo) of πC is also calculated. Now, it is checked for each new (k + 1) attribute candidate C, if there exists its k attribute subset A not present in Ck . If so, it means that either A or its subset was found earlier as a ∂-reduct. This implies that the candidate C is a proper superset of a ∂-reduct, thus it is not a ∂-reduct, and hence C is deleted from the set Ck+1 . Optionally, for each tested k attribute subset A that is present in Ck , it is checked, if |πA | equals |πC |. If so, then A → C holds (by Property 2.2.2). Hence, A → {∂} is not a minimal functional dependency (by Property 2.2.1e), and thus C is deleted from Ck+1 . function F unGen(Ck ); /* Merging */ forall A, B ∈ Ck do if A[1] = B[1] ∧ . . . ∧ A[k − 1] = A[k − 1] ∧ A[k] < B[k] then begin C = A[1] · A[2] · . . . · A[k] · B[k]; /* compute partition C.π as a product of A.π and B.π, and the number of groups in C.π */ C.groupN o = P roduct(A.π, B.π, C.π); add C to Ck+1 endif ; endfor; /* Pruning */ forall C ∈ Ck+1 do forall k attribute set A, such that A ⊂ C do if A ∈ / Ck then /* A ⊂ C and ∃B ⊆ A such that B → {∂} holds, so C → ∂ holds, but is not minimal */ begin delete C from Ck+1 ; break end elseif A.groupN o = C.groupN o then // optional candidate pruning step /*A ⊂ C and A → C holds, so C → {∂} is not a minimal functional dependency */ begin delete C from Ck+1 ; break end endif endfor endfor; return Ck+1 ;

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

3.3

83

Using Partitions in Fun

Computing Array Representation of Partition. The PartitionArrayRepresentation function returns an array T of the length equal to the number of objects O in DT . For a given attribute C, each element j in T is assigned the index of the group in C.π to which object with oid = j belongs. As a result, j-th element of T informs to which group in C.π j-th object in DT belongs, j = 1.. |O|. function P artitionArrayRepresentation(attribute set C); /* assert: T is an array[1 . . . |O|] */ i = 1; for i-th group G in partition C.π do begin for each oid G do T [oid] = i endfor; i=i+1 endfor return T ;

Verifying Candidate Dependency. The Holds function checks, if there is a functional dependency between the set of attributes C and an attribute ∂. It is checked for successive groups G in C.π, if there is an oid in G that belongs to a group in ∂.π diﬀerent from the group in ∂.π to which the ﬁrst oid in G belongs (for the purpose of eﬃciency, the pre-calculated ∂.T representation of the partition for ∂ is applied instead of ∂.π). If so, this means that G is not contained in one group of ∂.π and thus C → {∂} is not a functional dependency. In such a case, the function stops returning false as a result. Otherwise, if no such group G is found, the function returns true, which means that C → {∂} is a functional dependency. function Holds(C → {∂}); /* assert: ∂.T is an array representation of ∂.π */ for each group G in partition C.π do begin oid = first element in group G; ∂-f irstGroup = ∂.T [oid]; // the identifier of the group in ∂.π to which oid belongs for each next element oid ∈ G do begin ∂-nextGroup = ∂.T [oid]; if ∂-f irstGroup = ∂-nextGroup then /* there are oids in G that identify objects indiscernible wrt. C, but discernible wrt. ∂ */ return false // hence, C → {∂} does not hold endif endfor; endfor; return true; // C → {∂} holds

Computing Product of Partitions. The Product function computes the partition wrt. the attribute set C and its cardinality from the partitions wrt. the attribute sets A and B. The function examines successive groups wrt. the partition for B. The objects in a given group G in B.π are split into maximal subgroups in such a way that the objects in each resultant subgroup are contained in a same group in A.π. The obtained set of subgroups equals {G ∩ Y |Y ∈ A.π}. Product C.π is calculated as the set of all subgroups obtained from all groups in B.π; i.e., C.π = G∈B.π {G ∩ Y |Y ∈ A.π} = {G ∩ Y |Y ∈ A.π and G ∈ B.π} = B.π ∩ A.π. In order to calculate the product of the partitions eﬃciently (with time complexity linear wrt. the number of objects in DT ), we follow the idea presented in [6], and use two static arrays T and S: T is used to store an array representation of

84

M. Kryszkiewicz and P. Lasek

the partition wrt. A; S is used to store subgroups obtained from a given group G in B.π. function P roduct(A.π, B.π; var C.π); /* assert: T [1..|O|] is a static array */ /* assert: S[1..|O|] is a static array with all elements initially equal to ∅ */ C.π = {}; groupN o = 0; /* calculate an array representation of A.π for later multiple use in the P roduct function */ T = P artitionArrayRepresentation(A); i = 1; for i-th group G in partition B.π do begin A-GroupIds = ∅; for each element oid ∈ G do begin j = T [oid]; // the identifier of the group in A.π to which oid belongs insert oid into S[j]; insert j into A-GroupIds endfor; for each j ∈ A-GroupIds do begin insert S[j] into C.π; groupN o = groupN o + 1; S[j] = ∅ endfor; i=i+1 endfor; return groupN o;

3.4

Using Stripped Partitions in Fun

The representation of partitions that requires storing objects identiﬁers (oids) of all objects in DT may be too memory consuming. In order to alleviate this problem, it was proposed in [6] to store oids only for objects belonging to nonsingleton groups in a partition representation. Such a representation of a partition is called a stripped one and will be denoted by π s . Clearly, the stripped representation is lossless. Example 3.4.1. In Table 2, the partition wrt. {ce}: π{ce} = {{1}, {2}, {3, 4}, {5}, s {6, 7, 8}, {9, 10}}, whereas the stripped partition wrt. {ce}: π{ce} = {{3, 4}, {6, 7, 8}, {9, 10}}. function StrippedHolds(C → {∂}); i = 1; for i-th group G in partition C.π do begin oid = first element in group G; ∂-f irstGroup = ∂.T [oid]; // the identifier of the group in ∂.π to which oid belongs if ∂-f irstGroup = null then return false endif ; /* ∂.T [oid] = null indicates that oid constitutes a singleton group in the partition for ∂. */ /* Hence, no next object in G belongs to this group in ∂.π , so C → {∂} does not hold.

*/

for each next element oid ∈ G do begin ∂-nextGroup = ∂.T [oid]; if ∂-f irstGroup = ∂-nextGroup then /* there are oids in G that identify objects indiscernible wrt. C, but discernible wrt. ∂ */ return false // hence, C → {∂} does not hold endif endfor; i=i+1 endfor; return true; // C → {∂} holds

When applying stripped partitions in our Fun algorithm instead of usual partitions, one should call the StrippedHolds function instead of Holds, and the StrippedProduct function instead of Product. The modiﬁed parts of the functions have been shadowed in the code below. We note, however, that the groupNo ﬁeld

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

85

still stores the number of groups in an unstripped partition (singleton groups are not stored in a stripped partition, but are counted!). function StrippedP roduct(A.π, B.π; var C.π); C.π = {}; groupN o = B.groupN o; T = P artitionArrayRepresentation(A); i = 1; for i-th group G in partition B.π do begin A − GroupIds = ∅; for each element oid ∈ G do begin j = T [oid]; // the identifier of the group in A.π to which oid belongs if j = null then groupN o = groupN o + 1;

// respect singleton subgroups

else begin insert oid into S[j]; insert j into A-GroupIds endif endfor; for each j ∈ A − GroupIds do begin if |S[j]| > 1 then insert S[j] into C.π

// store only non-singleton groups

endif ; groupN o = groupN o + 1; S[j] = ∅

// but count all groups, including singleton ones

endfor; groupN o = groupN o − 1; i=i+1 endfor; /* Clearing of array T for later use */ for i-th group G in partition A.π do for each element oid ∈ G do T [oid] = null endfor endfor; return groupN o;

3.5

Using Reduced Stripped Partitions in Fun

In this section, we oﬀer further reduction of stripped partitions wrt. conditional attributes. Our proposal is based on the following observations: Let C be a conditional attribute set and d be the decision attribute. Let G be any group in the stripped partition wrt. C that is contained in a group belonging to the stripped partition wrt. d. a) Group G operates in favour of functional dependency between C and d. b) Any subgroup G ⊆ G that occurs in the stripped partition wrt. a superset C ⊇ C also operates in favour of functional dependency between C and d. Thus, the veriﬁcation of the containment of G in a group of the stripped partition wrt. d is dispensable in testing the existence of a functional dependency between C and d. rs ) We deﬁne a reduced stripped partition wrt. attribute set A (and denote by πA as the set of those groups in the stripped partition wrt. A that are not contained rs = {G ∈ in any group in the stripped partition wrt. decision d; that is, πA s s πA |¬∃D∈π{d} G ⊆ D}.

Example 3.5.1. In Table 2, the stripped partition wrt. conditional attribute e: s π{e} = {{1, 2}, {3, 4}, {5, 6, 7, 8}, {9, 10}}, whereas the stripped partition wrt. s decision attribute d: π{d} = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9, 10}}. We note that group

86

M. Kryszkiewicz and P. Lasek

s s {1, 2} ∈ π{e} and its subsets are contained in group {1, 2, 3} ∈ π{d} . Similarly, s s group {9, 10} ∈ π{e} and its subsets are contained in group {7, 8, 9, 10} ∈ π{d} . s There is no group in π{d} containing {3, 4} or {5, 6, 7, 8}. Thus, the groups {1, 2} s and {9, 10} in π{e} , unlike the remaining two groups: {3, 4} and {5, 6, 7, 8} in s π{e} , operate in favour of functional dependency between {e} and {d}. Hence, the s reduced stripped partition π{e} = {{3, 4}, {5, 6, 7, 8}} and the reduced stripped partitions wrt. supersets of {e} will contain neither {1, 2}, nor {9, 10}, nor their subsets.

It is easy to observe that the reduced stripped partition wrt. attribute set C can be calculated based on the product of reduced stripped partitions wrt. subsets of C as shown in Proposition 3.5.1. Proposition 3.5.1. Let A, B, C ⊆ AT and C = A∪B. Then, the reduced stripped partition wrt. C equals the set of the groups in the product of the reduced stripped partitions wrt. A and B that are not contained in any group of the rs rs rs s stripped partition wrt. decision d; that is, πC = {G ∈ πA ∩πB |¬∃D∈π{d} G ⊆ D}. function ReducedStrippedHolds(C → {∂}); i = 1; holds = true; for i-th group G in partition C.π do begin oid = first element in group G; ∂-f irstGroup = ∂.T [oid]; // the identifier of the group in ∂.π to which oid belongs if ∂-f irstGroup = null then holds = false; /* ∂.T [oid] = null indicates that oid constitutes a singleton group in the partition for ∂. */ /* Hence, no next object in G belongs to this group in ∂.π , so C → {∂} does not hold. */ else begin for each next element oid ∈ G do begin ∂-nextGroup = ∂.T [oid]; if ∂-f irstGroup = ∂-nextGroup then /* there are oids in G that identify objects indiscernible wrt. C, but discernible wrt. ∂ */ holds = false; break; endif endfor; if ∂-f irstGroup = ∂-nextGroup then

// hence, C → {∂} does not hold

delete c from C.π; endif ; i = i + 1; endif ; endfor; return holds ; rs rs In our proposal, the product πA ∩ πB of the reduced stripped partitions wrt. A and B is calculated by means of the StrippedProduct function. The reduced rs rs rs is determined from a product πA ∩ πB by new Reducedstripped partition πC StrippedHolds function. The function is a modiﬁcation of StrippedProduct. The ReducedStrippedHolds function like StrippedProduct veriﬁes, if there is a functional dependency between C and d. In addition, ReducedStrippedHolds removes rs rs s ∩ πB that are contained in π{d} . The modiﬁed parts those groups in product πA of the code in ReducedStrippedHolds have been shadowed.

FUN: Fast Discovery of Minimal Sets of Attributes Determining Decision

87

s s s Please, note that the StrippedHolds function reads groups of πC = πA ∩ πB s until the ﬁrst group that is not contained in a group of π{d} is found. To the rs rs contrary, ReducedStrippedHolds reads all groups of the product πA ∩ πB . This means that the execution of ReducedStrippedHolds may last longer than the rs rs s s ∩ πB and πA ∩ πB are of similar length. execution of StrippedHolds, when πA On the other hand, the execution of ReducedStrippedHolds may last shorter rs rs s s ∩ πB is shorter than πA ∩ πB . than the execution of StrippedHolds, when πA As an alternative to both solutions of shortening partitions, we propose the PartReducedStrippedHolds function, which deletes the groups from the product s s ∩ πB until the ﬁrst group in this product that is not contained in a group πA s of π{d} is found. The result of PartReducedStrippedHolds is a group set being a s s rs rs ∩ πB and superset of πA ∩ πB . subset of πA function P artReducedStrippedHolds(C → {∂}); i = 1; for i-th group G in partition C.π do begin oid = first element in group G; ∂-f irstGroup = ∂.T [oid]; // the identifier of the group in ∂.π to which oid belongs if ∂-f irstGroup = null then return false endif ; /* ∂.T [oid] = null indicates that oid constitutes a singleton group in the partition for ∂. */ /* Hence, no next object in G belongs to this group in ∂.π , so C → {∂} does not hold. */ for each next element oid ∈ G do begin ∂-nextGroup = ∂.T [oid]; if ∂-f irstGroup = ∂-nextGroup then /* there are oids in G that identify objects indiscernible wrt. C, but discernible wrt. ∂ */ return false // hence, C → {∂} does not hold endif endfor; delete c from C.π; i = i + 1; endfor; return true; // C → {∂} holds

We note that it is impossible to determine the number of groups in the product πA ∩ πB as a side-eﬀect of calculating the product of the reduced stripped rs rs partitions πA ∩ πB . The same observation holds when the product is calculated from the partially reduced stripped partitions. Lack of this knowledge disallows using the optional pruning step in the FunGen algorithm. The usefulness of using fully or partially reduced stripped partitions will be examined experimentally in Section 4.

4

Experimental Results

We have performed a number of experiments on a few data sets available in UCI Repository (http://www.ics.uci.edu/∼mlearn/MLRepository.html) and Table 3. Six variants of the Fun algorithm wrt. Holds algorithm, partitions’ type and candidate pruning option Fun’s variant Holds method

H H/P SH SH/P PRSH RSH Holds Holds Stripped Stripped Part Reduced Reduced Holds Holds Stripped Holds Stripped Holds No No Yes Yes Yes Yes Stripped partitions Yes No Yes No No Optional candidate pruning No

88

M. Kryszkiewicz and P. Lasek Table 4. Reference external tools Label Tane RSES RSGR RRER Tool Tane RSES Rosetta Rosetta Algorithm Tane Exhaustive SAV Genetic Reducer RSES Exhaustive Reducer Limitation to 500 records Comments

Table 5. Execution times in seconds for the letter-recognition data set. ∗ - results are not available, a data set was to large to be analyzed. ∗∗ - ∗∗ - RSES was written in Java, which could cause an additional overhead, ∗∗∗ - the times provided by Rosetta have 1 second granularity. 1 2 3 4 5 6 7 8 9

|O| H H/P SH SH/P 100 0.32 0.52 0.36 0.23 200 2.54 2.72 0.87 0.25 500 7.92 7.24 1.63 0.79 1000 26.70 19.41 3.72 2.03 2000 38.29 27.60 7.97 4.28 5000 126.19 76.48 28.52 19.54 10000 1687.15 976.04 51.51 52.97 15000 N/A∗ N/A∗ 154.79 137.21 20000 N/A∗ N/A∗ 444.70 421.89

PRSH 0.30 0.97 1.71 3.28 7.99 28.66 59.97 144.86 440.31

RSH RSES∗∗ 0.24 9.50 0.97 13.50 1.57 9.00 2.94 14.00 6.74 20.00 28.91 130.00 131.79 960.00 368.67 1860.00 727.69 3060.00

Tane RSGR∗∗∗ RRER∗∗∗ 0.55 1), whereas complex cells are nonlinear (F 1/F 0 < 1). The classical V1 RF properties can be found using small ﬂashing light spots, moving white or dark bars or gratings. We will give an example of the decision rules for the RF mapped with the moving white and dark bars [5]. A moving white bar gives the following decision rule: DR V1 1: xpi ∧ yp0 ∧ xsk ∧ ys1 ∧ s2 → r1

(4)

The decision rule for a moving dark bar is given as: DR V1 2: xpj ∧ yp0 ∧ xsl ∧ ys1 ∧ s2 → r1

(5)

where xpi is the x-position of the incremental subﬁeld, where xpj is the x-position of the decremental subﬁeld, yp0 is the y-position of the both subﬁelds, xsk , xsl , ys1 are horizontal and vertical sizes of the RF subﬁelds, and s2 is a vertical bar which means that this cell is tuned to the vertical orientation. We have skipped other stimulus attributes like movement velocity, direction, amplitude, etc. For simplicity we assume that the cell is not direction sensitive, it gives the same responses to both direction of bar movement and to the dark and light bars and that cell responses are symmetric around the x middle position (xp). An overlap index [10] is deﬁned as: OI =

0.5(xsk + xsl ) − |xpi − xpj | 0.5(xsk + xsl ) + |xpi − xpj |

OI compares sizes of increment (xsk ) and decrement (xsl ) subﬁelds to their separation (|xpi − xpj |). After [11], if OI ≤ 0.3 (“non-overlapping” subﬁelds) it is the simple cell with dominating ﬁrst harmonic response (linear) and r1 is the amplitude of the ﬁrst harmonic. If OI ≥ 0.5 (overlapping subﬁelds), it is the complex cell with dominating F 0 response (nonlinear) and r1 are changes in the mean cell activity. Hubel and Wiesel [9] have proposed that the complex

The Neurophysiological Bases of Cognitive Computation

295

cell RF is created by convergence of several simple cells in a similar way like V1 RF properties are related to RF of LGN cells (Fig. 1). However, there is recent experimental evidence that the nonlinearity of the complex cell RF may be related to the feedback or horizontal connections [12]. Decision Rules for area V4. The properties of the RFs in area V4 are more complex than that in area V1 or in the LGN and in most cases they are nonlinear. It is not clear what exactly optimal stimuli for cells in V4 are, but a popular hypothesis states that the V4 cells code the simple, robust shapes. Below there is an example from [13] of the decision rules for a narrow (0.4 deg) and long (4 deg) horizontal or vertical bars placed in diﬀerent positions of area V4 RF: DR V4 1: o0 ∧ yprm ∧ (yp−2.2 ∨ yp0.15 ) ∧ xs4 ∧ ys0.4 → r2

(6)

o90 ∧ xprm ∧ (xp−0.6 ∨ xp1.3 ) ∧ xs0.4 ∧ ys4 → r1

(7)

DR V4 2:

The ﬁrst rule relates area V4 cell responses to a moving horizontal bar (o0 ) and the stimulus in the second rule is a moving vertical bar (o90 ), yprm , xprm have meaning of the tolerance for the y or x bar positions (more details in the Result section). The horizontal bar placed narrowly in two diﬀerent y-positions (yp−2.2 , yp0.15 ) gives strong cell responses (DR V4 1), and the vertical bar placed with wide range in two diﬀerent x-positions (xp−0.6 , xp1.3 ) gives medium cell responses. Decision Rules for feedforward connections from LGN → V1. Thalamic axons target speciﬁc cells in layers 4 and 6 of the primary visual cortex (V1). Generally we assume that there is a linear summation of LGN cells (approximately 10 − 100 of them [14]) to one V1 cell. It was proposed [9] that the LGN cells determine the orientation of the V1 cell in the following way: LGN cells which have a direct synaptic connection to V1 neurons have their receptive ﬁelds arranged along a straight line on the retina (Fig. 1). In this Hubel and Wiesel [9] classical model the major assumption is that activity of all (four in Fig. 1) LGN cells is necessary for a V1 cell to be sensitive to the speciﬁc stimulus (oriented light bar). This principle determines syntax of the LGN to V1 decision rule, by using logical and meaning that if one LGN cell does not respond then there is no V1 cell response. After Sherman and Guillery [15] we will call such inputs drivers. Alonso et al. [14] showed that there is a high speciﬁcity between RF properties of the LGN cells which have monosynaptic connections to a V1 simple cell. This precision goes beyond simple retinotopy and includes such RF properties as RF sign, timing, subregion strength and sign [14]. The decision rule for the feedforward LGN to V1 connections are following: DR LGN V1 1: r1LGN (xi , yi ) ∧ r1LGN (xi+1 , yi ) ∧ . . . ∧ r1LGN (xi+n , yi ) → r1V 1

(8)

296

A.W. Przybyszewski

Fig. 1. On the left: modiﬁed schematic of the model proposed by [9]. Four LGN cells with circular receptive ﬁelds arranged along a straight line on the retina have direct synaptic connection to V1 neuron. This V1 neuron is orientation sensitive as marked by the thick, interrupted lines. On the right: receptive ﬁelds of two types of LGN cells, and two types of area V1 cells.

DR LGN V1 2: r1LGN (xi , yi ) ∧ r1LGN (xi+1 , yi+1 ) ∧ . . . ∧ r1LGN (xi+n , yi+1 ) → r1V 1

(9)

where the ﬁrst rule determines response of cells in V1 with optimal horizontal orientation, and the second rule says that the optimal orientation is 45 degrees; (xi , yi ) is the localization of the RF in x-y Euclidian coordinates of the visual ﬁeld. Notice that these rules assume that V1 RF is completely determined by the FF pathway from the LGN. Decision Rules for feedback connections from V1→LGN. There are several papers showing the existence of the feedback connections from V1 to LGN [16-20]. In [20], authors have quantitatively compared the visuotopic extent of geniculate feedforward aﬀerents to V1 with the size of the RF center and surround of neurons in V1 input layers and the visuotopic extent of V1 feedback connections to the LGN with the RF size of cells in V1. Area V1 feedback connections restrict their inﬂuence to LGN regions visuotopically coextensive with the size of the classical RF of V1 layer 6 cells and commensurate with the LGN region from which they receive feedforward connections. In agreement with [15] we will denote feedback inputs modulators with following decision rules:

The Neurophysiological Bases of Cognitive Computation

297

DR V1 LGN 1: (r1V 1 ∨ r1LGN (xi , yi )), (r1V 1 ∨ r1LGN (xi , yi+1 ), (r1V 1 ∨ r1LGN (xi+1 , yi+1 )), . . . . . . , r1LGN (xi+2n , yi+2n )) → r2LGN

(10)

This rule says that when the activity of a particular V1 cell is in agreement with activity in some LGN cells their responses increase from r1 to r2 , and r1LGN (xi , yi ) means r1 response of LGN cell with coordination (xi , yi ) in the visual ﬁeld, and r2LGN means r2 response of all LGN cells in the decision rules which activity was coincidental with the feedback excitation, it is a pattern of LGN cells activity. Decision Rules for feedforward connections V1 → V4. There are relatively small direct connections from V1 to V4 bypassing area V2 [20], but we also must take into account V1 to V2 [21] and V2 to V4 connections, which are highly organized but variable, especially in V4 [22] feedforward connections. We simplify that V2 has similar properties to V1 but have a larger size of the RF. We assume that, like from the retina to LGN and from LGN to V1 direct or indirect connections from V1 to V4 provide driver input and fulﬁll the following decision rules: DR V1 V4 1: r1V 1 (xi , yi ) ∧ r1V 1 (xi+1 , yi ) ∧ . . . ∧ r1V 1 (xi+n , yi ) → r1V 4

(11)

DR V1 V4 2: r1V 1 (xi , yi ) ∧ r1V 1 (xi+1 , yi+j ) ∧ . . . ∧ r1V 1 (xi+n , yi+m ) → r1V 4

(12)

We assume that, the RF in area V4 sums up driver inputs from regions in the areas V1and V2 of cells with highly speciﬁc RF properties, not only retinotopically correlated. Decision Rules for feedback connections from V4→V1. Anterograde anatomical tracing [23] has shown axons backprojecting from area V4 directly to area V1 or sometimes with branches in area V2. Axons of V4 cells span in area V1 in large territories with most terminations in layer 1, which can be either distinct clusters or in linear arrays. These speciﬁc for each axon branches determine decision rules that will have similar syntax (see below) but anatomical structure of the particular axon may introduce diﬀerent semantics. Their anatomical structures maybe related to the speciﬁc receptive ﬁeld properties of diﬀerent V4 cells. Distinct clusters may have terminals on V1 cells near pinwheel centers (cells with diﬀerent orientations arranged radially), whereas a linear array of terminals may be connected to V1 neurons with similar orientation preference. In consequence, some parts of the V4 RF would have preference for certain orientations and others may have preference for the certain locations but be more ﬂexible to diﬀerent orientations. This hypothesis is supported by recent intracellular recordings from neurons located near pinwheels centers which,

298

A.W. Przybyszewski

in contrast to other narrowly tuned neurons, showed subthreshold responses to all orientations [24]. However, neurons which have ﬁxed orientation can change other properties of their receptive ﬁeld like for example spatial frequency, therefore the feedback from area V4 can tune them to expected spatial details in the RF (M. Sur, Brenda Milner Symposium, 22 Sept. 2008, MNI McGill University, Montreal). The V4 input modulates V1 cell in the following way: DR V4 V1 1: (r1V 4 ∨ r1V 1 (xi , yi )), (r1V 4 ∨ r1V 1 (xi , yi+1 ), (r1V 4 ∨ r1V 1 (xi+1 , yi+1 )), . . . . . . , (r1V 4 ∨ r1V 1 (xi+n , yi+m )) → r2V 1

(13)

Meaning of r1V 1 (xi , yi ) and r2V 1 are same as explained above for the V1 to LGN decision rule. Decision Rules for feedback connections V4→LGN. Anterograde tracing from area V4 showed axons projecting to diﬀerent layers of LGN and some of them also to the pulvinar [25] These axons have widespread terminal ﬁelds with branches non-uniformly spread about several millimeters (Fig. 2). Like descending axons in V1, axons from area V4 have within their LGN terminations, distinct clusters or linear branches (Fig. 2). These clusters and branches are characteristic for diﬀerent axons and as it was mentioned above their diﬀerences may be related to diﬀerent semantics in the decision rule below: DR V4 LGN 1: (r1V 4 ∨ r1LGN (xi , yi )), (r1V 4 ∨ r1LGN (xi , yi+1 ), (r1V 4 ∨ r1LGN (xi+1 , yi+1 )), . . . . . . , (r1V 4 ∨ r1LGN (xi+n , yi+m )) → r2LGN

(14)

Meaning of r1LGN (xi , yi ) and r2LGN are same as explained above for the V1 to LGN decision rule. Notice that interaction between FF and FB pathways extends a classical view that the brain as computer uses two-valued logic. This eﬀect in psychophysics can be paraphrased as: “I see it but it does not ﬁt my predictions”. In neurophysiology, we assume that a substructure could be optimally tuned to the stimulus but its activity does not ﬁt to the FB predictions. Such interaction can be interpreted as the third logical value. If there is no stimulus, the response in the local structure should have a logical value 0, if stimulus is optimal for the local structure, it should have logical value 12 , and if it also is tuned to expectations of higher areas (positive feedback) then response should have logical value 1. Generally it becomes more complicated if we consider many interacting areas, but in this work we use only three-valued logic.

The Neurophysiological Bases of Cognitive Computation

299

Fig. 2. Boutons of the descending axon from area V4 with terminals in diﬀerent parvocellular layers of LGN: layer 6 in black, layer 5 in red, layer 4 in yellow. Total number of boutons for this and other axons was between 1150 and 2075. We estimated that it means that each descending V4 axon connects to approximately 500 to over 1000 LGN (mostly parvocellular) cells [25]. Thick lines outline LGN; thin lines shows layers 5 and 6, dotted line azimuth, and dashed lines show elevation of the visual ﬁeld covered by the descending axon. This axon arborization extension has approximately V4 RF size.

3

Results

We have used our model as a basis for an analysis of the experimental data from the neurons recorded in the monkey’s area V4 [2]. In [2], it was shown that the RF in V4 can be divided into several subﬁeld that, stimulated separately, can give us the ﬁrst approximation of the concept of the shape to which the cell is tuned [13]. We have also shown that subﬁelds are tuned to stimuli with similar orientation [2]. In Fig. 3, we demonstrate that the receptive ﬁeld subﬁelds have not only similar preferred orientations but also spatial frequencies [2]. We have divided cell responses into three categories (see Methods) by horizontal lines in plots A-D of Fig. 3. We have draw a line near spike frequency 17 spikes/s, which separates responses of category r1 (above) from r0 (below the threshold line). Horizontal lines plotted near spike frequency 34 spikes/s separate responses of category r2 (above) from r1 (below). The stimulus attributes related to these three response categories were extracted in the decision table (Table 1). We summarize results of our analysis in Figs. 3H and G from Table 1. Fig. 3H presents a schematic of a possible stimulus that would give medium cell responses (r1 ). One can imagine

300

A.W. Przybyszewski

Fig. 3. Modiﬁed plots from [2]. Curves represent responses of V4 neurons to their RF subﬁelds grating stimulations with diﬀerent spatial frequencies (SF). (A-D) SF selectivity curves across RF with positions indicated in insets. The centers of tested subﬁelds were 2 deg apart. (E-H) Schematic representation summarizing orientation and SF selectivity of subﬁelds presented in A-D and in [2]. These ﬁgures are based on the decision table 1, for stimuli in E, F cell responses were r1 , for stimuli in G, H cell responses were r2 , (F) and (G) represent a possible stimulus conﬁguration from schematics (E) and (F).

several classes of possible stimuli assuming that subﬁeld responses will sum up linearly (for example see Fig. 3F). Fig. 3G shows a schematic of a possible stimulus set-up, which would give r2 response that as we have assumed, is related not only to the local but also the global visual cortex tuning. One can notice that in the last case only subﬁelds in the vertical row give strong independent responses (Fig. 3H). We assign the narrow (obn ), medium (obm ), and wide (obw ) orientation bandwidth as follows: obn if (ob : 0 < ob < 50deg), medium obm if (ob : 50deg < ob < 100deg), wide obw if (ob : ob > 100deg). We assign the narrow (sf bn ), medium (sf bm ), and wide (sf bw ) spatial frequency bandwidth: sf bn if (sf b : 0 < sf b < 2c/deg), medium sf bm if (sf b : 2c/deg < sf b < 2.5c/deg), wide sf bw if (sf b : sf b > 2.5c/deg). For simplicity in the following decision rules, we assume that the subﬁelds are not direction sensitive; therefore responses to stimulus orientation 0 and 180 deg should be same.

The Neurophysiological Bases of Cognitive Computation

301

Table 1. Decision table for one cell responses to subﬁelds stimulation Fig. 3C-F and Fig.5 in [2]. Attributes xpr, ypr, sf = 2c/deg, s are constant and they are not presented in the table. Cells 3* are from Fig. 3 in [2] and cells 5* are from Fig. 5 in [2] processed in Fig. 3. cell 3c 3c1 3c2 3d 3d1 3d2 3e 3f 3f1 3f2 5a 5a1 5b 5b1 5c 5c1 5d

o 172 10 180 172 5 180 180 170 10 333 180 180 180 180 180 180 180

ob 105 140 20 105 100 50 0 100 140 16 0 0 0 0 0 0 0

sf b 0 0 0 0 0 0 0 0 0 0 3 0.9 3.2 1 3 1.9 0.8

xp 0 0 0 0 0 0 -2 0 0 0 0 0 0 0 0 0 0

yp 0 0 0 -2 -2 -2 0 2 2 2 -2 -2 2 2 0 0 0

r 1 1 2 1 1 2 0 1 1 2 1 2 1 2 1 2 1

Our results from the separate subﬁelds stimulation study can be presented as the following decision rules: DR V4 3: o180 ∧ sf2 ∧ ((obw ∧ sf bw ∧ xp0 ∧ (yp−2 ∨ yp0 ∨ yp2 )))∨ ∨ (obn ∧ sf bn ∧ yp0 ∧ (xp−2 ∨ xp2 )) → r1

(15)

o180 ∧ sf2 ∧ obn ∧ sf bn ∧ xp0 ∧ (yp−2 ∨ yp0 ∨ yp2 ) → r2

(16)

DR V4 4:

These decision rules can be interpreted as follows: disc shaped grating stimuli with wide bandwidths of orientations or spatial frequencies when placed along vertical axis of the receptive ﬁeld evoke medium cell responses. However, similar discs when placed horizontally to the left or to the right from the middle of the RF, must have narrow orientation and spatial frequency to evoke medium cell responses. Only a narrowly tuned disc in spatial frequency and orientation placed vertically from the middle of the receptive ﬁeld can evoke strong cell responses. Notice that Figs 3F and 3H show possible conﬁgurations of the optimal stimulus. This approach is similar to the assumption that an image of the object is initially represented in terms of the activation of a spatially arrayed set of multiscale, multioriented detectors like arrangements of simple cells in V1 (metric

302

A.W. Przybyszewski

templates in subordinate-level object classiﬁcation of Lades et al. [26]). However, this approach does not take into account interactions between several stimuli, when more than one subﬁeld is stimulated, and we will show below there is a strong nonlinear interaction between subﬁelds. We analyzed experiments where the RF is stimulated at ﬁrst with a single small vertical bar and later with two bars changing their horizontal positions. One example of V4 cell responses to thin (0.25 deg) vertical bars in diﬀerent horizontal positions is shown in the upper left part of Fig. 4 (Fig. 4E). Cell response has maximum amplitude for the middle (XP os = 0) bar position along the x − axis. Cell responses are not symmetrical around 0. In Fig. 2F, the same cell (cell 61 in table 2) is tested with two bars. The ﬁrst bar stays at the 0 position, while the second bar changes its position along x − axis. Cell responses show several maxima dividing the receptive ﬁeld into four areas. However, this is not always the case as responses to two bars in another cell (cell 62 in table 2) show only two minima (Fig. 2G). Horizontal lines in plots of both ﬁgures divide cell responses into the three categories r0 , r1 , r2 , which are related to the mean response frequency (see Methods). Stimuli attributes and cell responses classiﬁed into categories are shown in table 2 for cells in Fig. 4 and in table 3 for cells in Fig. 5. We assign the narrow (xprn ), medium (xprm ), and wide (xprw ) x position ranges as follows: xprn if (xpr : 0 < xpr ≤ 0.6), medium xprm if (xpr : 0.6 < xpr ≤ 1.2), wide xprw if (xpr : xpr > 1.2). We assign the narrow (yprn ), medium (yprm ), and wide (yprw ) y position range: yprn if (ypr : 0 < ypr ≤ 1.2), medium yprm if (ypr : 1.2 < xpr ≤ 1.6), wide yprw if (ypr : ypr > 1.6). On the basis of Fig. 3 and the decision table 2 (also compare with [18]) the one-bar study can be presented as the following decision rules: DR V4 5: o90 ∧ xprn ∧ xp0.1 ∧ xs0.25 ∧ ys0.4 → r2

(17)

o90 ∧ xprw ∧ xp−0.2 ∧ xs0.25 ∧ ys0.4 → r1

(18)

DR V4 6: We interpret these rules that r1 response in eq. (18) does not eﬀectively involve the feedback to the lower areas: V1 and LGN. The descending V4 axons have excitatory synapses not only on relay cells in LGN and pyramidal cells in V1, but also on inhibitory interneurons in LGN and inhibitory double banquet cells in layer 2/3 of V1. As an eﬀect of the feedback, only a narrow range of area V4 RF responded with a high r2 activity to a single bar stimulus, whereas in the outside area excitatory and inhibitory feedback inﬂuences compensated each other. On the basis of Fig. 4 the decision table, the two-bar horizontal interaction study can be presented as the following Two-bar Decision Rules (DRT): DRT V4 1: o90 ∧xprn ∧((xp−1.9 ∨xp0.1 ∨xp1.5 )∧xs0.25 ∧ys0.4 )1 ∧(o90 ∧xp0 ∧xs0.25 ∧ys0.4 )0 → r2 (19)

The Neurophysiological Bases of Cognitive Computation

303

DRT V4 2: o90 ∧ xprm ∧ ((xp−1.8 ∨ xp−0.4 ∨ xp0.4 ∨ xp1.2 ) ∧ xs0.25 ∧ ys0.4 )1 ∧ ∧ (o90 ∧ xp0 ∧ xs0.25 ∧ ys0.4 )0 → r1

(20)

One-bar decision rules can be interpreted as follows: the narrow vertical bar evokes a strong response in the central positions, and medium responses in a larger area near the central position. Two-bar decision rules claim that: the cell responses to two bars are strong if one bar is in the middle of the RF (bar with index 0 in decision rules) and the second narrow bar (bar with index 1 in decision rules) is in the certain, speciﬁc positions in the RF eq. (19). But when the second bar is in less precise positions, cell responses became weaker eq. (20). Responses of other cells are sensitive to other bar positions (Fig. 4G). These diﬀerences could be correlated with anatomical variability of the descending

Fig. 4. Modiﬁed plots from [2]. Curves represent responses of several cells from area V4 to small single (E) and double (F, G) vertical bars. Bars change their position along x-axis (Xpos). Responses are measured in spikes/sec. Mean cell responses ± SE are marked in E, F, and G. Cell responses are divided into three ranges by thin horizontal lines. Below each plot are schematics showing bar positions giving r1 (gray) and r2 (black) responses; below (E) for a single bar, below (F and G) for double bars (one bar was always in position 0). (H) This schematic extends responses for horizontally placed bars (E) to the whole RF: white color shows excitatory, black color inhibitory interactions between bars. Bars’ interactions are asymmetric in the RF.

304

A.W. Przybyszewski

Table 2. Decision table for cells shown in Fig. 4. Attributes o, ob, sf, sfb were constant and are not presented in the table. cell 61e 61f1 61f2 61f3 61f4 61f5 61f6 61f7 62g1 62g2 62g3 62g4 62g5 63h1 63h2 63h3

xp -0.7 -1.9 0.1 1.5 -1.8 -0.4 0.4 1.2 -1.5 -0.15 -1.5 -0.25 1 -0.5 1 0.2

xpr 1.4 0.2 0.2 0.1 0.6 0.8 0.8 0.8 0.1 0.5 0.6 1.3 0.6 0 1 0.1

xs 0.25 0.25 0.25 0.25 0.25 0.25 0.2 5 0.25 0.25 0.25 0.25 0.25 0.25 0.5 1 0.25

ys 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 4

s 2 22 22 22 12 22 22 22 22 22 22 22 22 44 44 22

r 1 2 2 2 1 1 1 1 2 2 1 1 1 2 1 2

Table 3. Decision table for one cell shown in Fig. 5. Attributes yp, ypr are constant and are not presented in the table. We introduce another parameter of the stimulus, diﬀerence in the direction of drifting grating of two patches: ddg = 0 when drifting are in the same directions, and ddg = 1 if drifting in two patches are in opposite directions. cell 64c 64c1 64c2 64d 64d1

xp -4.5 -1.75 -0.5 -6 -3.5

xpr 3 1.5 1 0 4.8

xs ys 1 1 1 1 1 1 1 8 1 8

ddg 1 1 1 0 0

r 2 1 2 2 1

axons connections. As mentioned above, V4 axons in V1 have distinct clusters or linear branches. Descending pathways are modulators, which means that they follow the logical “or” rule. This rule states that cells in area V1 become more active as a result of the feedback only if their patterns “ﬁt” to the area V4 cell “expectation”. The decision table (Table 3) based on Fig. 5 describes cell responses to two patches placed in diﬀerent positions along x-axis of the receptive ﬁeld (RF). Figure 5 shows that adding the second patch reduced single patch cell responses. We have assumed that cell response to a single patch placed in the middle of the RF is r2 . The second patch suppresses cell responses to a greater extent when it is more similar to the ﬁrst patch (Fig. 5D).

The Neurophysiological Bases of Cognitive Computation

305

Fig. 5. Modiﬁed plots from [2]. Curves represent V4 cell responses to two patches with gratings moving in opposite direction - patch 1 deg diameter (C) and in the same (D) directions for patch 1 deg wide and 8 deg long. One patch is always at x-axis position 0 and the second patch changes its position as it is marked in XPos coordinates. The horizontal lines represent 95% conﬁdence intervals for the response to a single patch in position 0. Below C and D, schematics show the positions of the patches and their inﬂuences on cell responses. Arrows are showing the direction of moving gratings. The lower part of the ﬁgure shows two schematics of the excitatory (white) and inhibitory (black) interactions between patches in the RF. Patches with gratings moving in the same directions (right schematic) show larger inhibitory areas (more dark color) than patches moving in opposite directions (left schematic).

Two-patch horizontal interaction decision rules are as follows: DRT V4 3: ddg1 ∧ (o0 ∧ xpr3 ∧ xp4.5 ∧ xs1 ∧ ys1 )1 ∧ (o0 ∧ xp0 ∧ xs1 ∧ ys1 )0 → r2

(21)

DRT V4 4: ddg1 ∧ (o0 ∧ xpr1 ∧ xp0.5 ∧ xs1 ∧ ys1 )1 ∧ (o0 ∧ xp0 ∧ xs1 ∧ ys1 )0 → r2

(22)

DRT V4 5: ddg0 ∧ (o0 ∧ xpr4.8 ∧ xp3.5 ∧ xs1 ∧ ys8 )1 ∧ (o0 ∧ xp0 ∧ xs1 ∧ ys1)0 → r1

(23)

306

A.W. Przybyszewski

Table 4. Decision table for cells in Fig. 6. Attributes yp, ypr, xs = ys = 0.5deg, s = 33 (two discs) are constant and are not presented in the table. We introduce another parameter of the stimulus, diﬀerence in polarities of two patches: dp = 0 if polarities are same, and dp = 1 if polarities are opposite. cell 81a 81a1 81a2 81a3 81a4 81a5 81a6 81b 81b1 81b2

xp -0.1 -1.75 -1.2 1.25 -1.3 -1.3 1.5 -1.4 0.9 0.9

xpr 0.5 0.3 1 1.5 0.3 0.3 0.4 0.6 0.8 0.2

dp 0 0 1 1 1 1 1 1 1 1

r 1 1 1 1 2 2 2 1 1 2

These decision rules can be interpreted as follows: patches with drifting in opposite directions gratings give strong responses when positioned very near (overlapping) or 150% of their width apart one from the other eqs. (21, 22). Interaction of patches with a similar grating evoked small responses in large extend of the RF eq. (23). Generally, interactions between similar stimuli evoke stronger and more extended inhibition than between diﬀerent stimuli. These and other examples can be generalized to other classes of objects. Two-spot horizontal interaction decision rules are as follows: DRT V4 6: dp0 ∧ s33 ∧ (((xp−0.1 ∧ xpr0.5 ) ∨ (xp−1.75 ∧ xpr0.3 )) ∧ xs0.5 )1 ∧ (xp0 ∧ xs0.5 )0 → r1 (24) DRT V4 7: dp1 ∧s33 ∧(((xp−1.2 ∧xpr1 )∨(xp1.25 ∧xpr1.5 ))∧xs0.5 )1 ∧(xp0 ∧xs0.5 )0 → r1 (25) DRT V4 8: dp1 ∧s33 ∧(((xp−1.3 ∧xpr0.2 )∨(xp1.5 ∧xpr0.4 ))∧xs0.5 )1 ∧(xp0 ∧xs0.5 )0 → r2 (26) DRT V4 9: dp1 ∧s33 ∧(((xp−1.4 ∧xpr0.6 )∨(xp0.9 ∧xpr0.8 ))∧xs0.5 )1 ∧(xp0 ∧xs0.5 )0 → r1 (27) DRT V4 10: dp1 ∧ s33 ∧ ((xp0.9 ∧ xpr0.2 ) ∧ xs0.5 )1 ∧ (xp0 ∧ xs0.5 )0 → r2

(28)

where dp is the diﬀerence in light polarities between two light spots (s33 ), and subscript 1 is related to spot changing its x-axis position, whereas subscript 0 is related to the spot in 0 position on x-axis.

The Neurophysiological Bases of Cognitive Computation

307

Fig. 6. Modiﬁed plots from [2]. Curves represent V4 cell responses to pair of 0.5 deg diameter bright and dark discs tested along width axis. Continuous lines mark the curves for responses to diﬀerent polarity stimuli, and the same polarity stimuli are marked by dashed line. Schematics for cell responses showed in (A) are in (C-F) and (I, J). Schematics for cell responses in (B) are in (G) and (H). Interactions between same polarity light spots (C) are diﬀerent than interactions between diﬀerent polarities patches (D-H). Small responses (class 1) are in (C), (D), (G), and larger responses (class 2) are in (E), (F), (H). (E) shows that there is no r2 responses in same polarity two spots interactions. (I) shows small excitatory (gray) in a short range and strong inhibitory (black) interactions between same polarity spots and (J) shows short range inhibitory (dark) and longer range excitatory interactions between diﬀerent polarities spots.

We propose the following classes of the object’s Parts Interaction Rules: PIR1: Facilitation when stimulus consists of multiple similar thin bars with small distances (about 0.5 deg) between them, and suppression when the distance between bars is larger than 0.5 deg. Suppression/facilitation is very often a nonlinear function of the distance. In our experiments (Fig. 3), cell responses to two bars were periodic along the receptive ﬁeld with dominating periods of about 30, 50, or 70% of the RF width. These nonlinear interactions were also observed along vertical and diagonals of the RF and often show strong asymmetries in relationship to the RF middle. PIR2: Strong inhibition when stimulus consists of multiple similar patches ﬁlled with gratings with the distance between patch edges ranging from 0 deg (touching) to 2 deg, weak inhibition when distance is between 3 to 5 deg through the RF width.

308

A.W. Przybyszewski

PIR3: If bars or patches have diﬀerent attributes like polarity or drifting directions, their suppression is smaller and localized facilitation at the small distance between stimuli is present. As in bar interaction, suppression/facilitations between patches or bright/dark discs can be periodic along diﬀerent RF axis and often asymmetric in the RF. We have tested the above rules in nine cells from area V4 by using discs or annuli ﬁlled stimuli with optimally oriented and variable in spatial frequencies drifting gratings (Pollen et al. [2] Figs. 9, 10). Our assumptions were that if it is a strong inhibitory mechanism as described in the rule PRI2 then responses to annulus with at least 2 deg inner diameters will be stronger than responses to the disc. In addition by changing spatial frequencies of gratings inside the annulus, we have expected eventually to ﬁnd other periodicities along the RF width as described by PIR3. In summary, we wanted to ﬁnd out what relations there are between stimulus properties and area V4 cell responses or whether B-elementary granules have equivalence classes of the relation IN D{r} or V4-elementary granules, or whether [u]B ⇒ [u]B4 . It was evident from the beginning that because diﬀerent area V4 cells have diﬀerent properties, their responses to the same stimuli will be diﬀerent, therefore we wanted to know if the rough set theory will help us in our data modeling. We assign the spatial frequency: low (sfl ), medium (sfm ), and high (sfh ) as follows: sfl if (sf : 0 < sf ≤ 1c/deg), medium sfm if (sf : 1c/deg < sf ≤ 4c/deg), high sfh if (sf : sf > 4c/deg). On the basis of this deﬁnition we calculate for each row in Table 5 the spatial frequency range by taking into account the spatial frequency bandwidth (sfb ). Therefore 107a is divided to 107al and 107am, 108a to 108al and 108am, and 108b to 108bl, 108bm, and 108bh. Stimuli used in these experiments can be placed in the following ten categories: Y0 = |sfl xo7 xi0 s4 | = {101, 105} Y1 = |sfl xo7 xi2 s5 | = {101a, 105a} Y2 = |sfl xo8 xi0 s4 | = {102, 104} Y3 = |sfl xo8 xi3 s5 | = {102a, 104a} Y4 = |sfl xo6 xi0 s4 | = {103, 106, 107, 108, 20a, 20b} Y5 = |sfl xo6 xi2 s5 | = {103a, 106a, 107al, 108bl} Y6 = |sfl xo4 xi0 s4 | = {108al} Y7 = |sfm xo6 xi2 s5 | = {107am, 108bm} Y8 = |sfm xo4 xi0 s4 | = {107a, 108am} Y9 = |sfh xo6 xi2 s5 | = {108bh}

The Neurophysiological Bases of Cognitive Computation

309

Table 5. Decision table for eight cells comparing the center-surround interaction. All stimuli were concentric, and therefore attributes were not xs, ys, but xo outer diameter, xi inner diameter. All stimuli were localized around the middle of the receptive ﬁeld so that xp = yp = xpr = ypr = 0 were constant and we did not put them in the table. The optimal orientation were normalized, denoted as 1, and removed from the table. cell 101 101a 102 102a 103 103a 104 104a 105 105a 106 106a 107 107a 107b 108 108a 108b 20a 20b

sf 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 2.1 2 0.5 2 5 0.5 0.5

sf b 0 0 0 0 0 0 0 0 0 0 0 0 0.25 3.8 0 0 0 9 0 0

xo 7 7 8 8 6 6 8 8 7 7 6 6 6 6 4 6 4 6 6 6

xi 0 2 0 3 0 2 0 3 0 2 0 3 0 2 0 0 0 2 0 0

s 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 4 4 5 4 4

r 0 1 0 0 0 1 0 2 0 1 1 2 2 2 1 1 2 2 1 2

These are equivalence classes for stimulus attributes, which means that in each class they are indiscernible IN D(B). We have normalized orientation bandwidth to 0 in {20a, 20b} and spatial frequency bandwidth to 0, in cases {107, 107a, 108a, 108b}, and put values covered by the bandwidth to the spatial frequency parameters. There are three ranges of responses denoted as ro , r1 , r2 . Therefore on the basis of the neurological data there are the following three categories of cell responses: |ro | = {101, 102, 102a, 103, 104, 105} |r1 | = {101a, 103a, 105a, 107b, 108, 20a} |r2 | = {104a, 106a, 107, 107al, 107am, 108al, 108am, 108bl, 108bm, 108bh, 20b} which are denoted as Xo , X1 , X2 . We will calculate the lower and upper approximation [1] of the brains basic concepts in term of stimulus basic categories: B X0 = Y0 ∪ Y2 = {101, 105, 102, 104} ¯¯ 0=Y0 ∪Y2 ∪Y3 ∪Y4={101, 105, 102, 104, 102a, 104a, 103, 106, 107, 108, 20a, 20b} BX B X1 = Y1 = {101a, 105a} ¯¯ 1 = Y1 ∪ Y5 ∪ Y6 ∪ Y4 = BX {101a, 105a, 103a, 107al, 108b, 106a, 20b, 107b, 108a, 103, 107, 106, 108, 20a}

310

A.W. Przybyszewski

B X2 = Y7 ∪ Y9 = {107am, 108bm, 108bh} ¯¯ 2 = Y7 ∪ Y9 ∪ Y8 ∪ Y6 ∪ Y3 ∪ Y4 ∪ Y5 = {107am, 108bm, 108bh, 107b, 108am, BX 102a, 104a, 103a, 107a, 108bl, 106a, 20b, 103, 107, 106, 108, 20a, 108al} Concept 0 and concept 1 are roughly B −def ined, which means that only with some approximation we have found that the stimuli do not evoke a response, or evoke weak or strong response in the area V4 cells. Certainly a stimulus such as Y0 or Y2 does not evoke a response in all our examples, in cells 101, 105, 102, 104. Also stimulus Y1 evokes a weak response in all our examples: 101a, 105a. We are interested in stimuli that evoke strong responses because they are speciﬁc for area V4 cells. We ﬁnd two such stimuli, Y7 and Y9 . In the meantime other stimuli such as Y3 , Y4 evoke no response, weak or strong responses in our data. We can ﬁnd the quality [1] of our experiments by comparing properly classiﬁed stimuli P OSB(r) = {101, 101a, 105, 105a, 102, 104, 107am, 108bm, 108bh} to all stimuli and to all responses: γ{r} = card{101,101a,105,105a,102,104,107am,108bm,108bh} card{101,101a,,20a,20b} = 0.38. We can also ask what percentage of cells we fully classiﬁed. We obtain consistent responses from 2 of 9 cells, which means that γ = 0.22. This is related to the fact that for some cells we have tested more than two stimuli. What is also important from an electrophysiological point of view is there are negative cases. There are many negative instances for the concept 0, which means that in many cases this brain area responds to our stimuli; however it seems that our concepts are still only roughly deﬁned. We have following decision rules: DR V4 7: sfl ∧ xo7 ∧ xi2 ∧ s5 → r1

(29)

sfl ∧ xo7 ∧ xi0 ∧ s4 → r0

(30)

sfl ∧ xo8 ∧ xi0 ∧ s4 → r0

(31)

(sfm ∨ sfh ) ∧ xo6 ∧ xi2 ∧ s5 → r2

(32)

DR V4 8: DR V4 9: DR V4 10: These can be interpreted as the statement that a large annulus (s5 ) evokes a weak response, but a large disc (s4 ) evokes no response when there is modulation with low spatial frequency gratings. However, somewhat smaller annulus containing medium or high spatial frequency objects evokes strong responses. It is unexpected that certain stimuli evoke inconsistent responses in diﬀerent cells (Table 5): 103: sfl ∧ xo6 ∧ xi0 ∧ s4 → r0 106: sfl ∧ xo6 ∧ xi0 ∧ s4 → r1 107: sfl ∧ xo6 ∧ xi0 ∧ s4 → r2 A disc with not very large dimension containing a low spatial frequency grating can evoke no response (103), a small response (106), or a strong response (107).

The Neurophysiological Bases of Cognitive Computation

4

311

Discussion

Physical properties of objects are diﬀerent from their psychological representation. Grdenfors [27] proposed to describe the principle of human perceptual system as grouping objects by similarities in the conceptual space. Human perceptual systems group together similar objects with unsharp boundaries [27], which means that objects are related to their parts by rough inclusion or that diﬀerent parts belong to objects with some approximation (degree) [28]. We suggest that similarity relations between objects and their parts are related to the hierarchical relationships between diﬀerent visual areas. These similarities may be related to synchronizations of multi-resolution, parallel computations and are diﬃcult to simulate using a digital computer [29]. Treisman [30] proposed that our brains extract features related to diﬀerent objects using two diﬀerent procedures: parallel and serial processing. The “basic features” were identiﬁed in psychophysical experiments as elementary features that can be extracted in parallel. Evidence of parallel features extraction comes from experiments showing that the extraction time becomes independent of the number of objects. Other features need serial searches, so that the extraction time is proportional to the number of objects. High-level serial processing is associated with integration and consolidation of items combined with conscious awareness. Other low-level parallel processes are rapid, global, related to higheﬃciency categorization of items and largely unconscious [30]. Treisman [30] showed that instances of a disjunctive set of at least four basic features could be detected through parallel processing. Other researchers have provided evidence for parallel detection of more complex features, such as shape from shading [31] or experience-based learning of features of intermediate complexity [32]. Thorpe et al. [33] in recent experiments, however, found that human and nonhuman primates can rapidly and accurately categorize brieﬂy ﬂashed natural images. Human and monkey observers are very good at deciding whether or not a novel image contains an animal even when more than one image is presented simultaneously [34]. The underlying visual processing reﬂecting the decision that a target was present is under 150ms [33]. These ﬁndings are in contradiction to the classical view that only simple, “basic features”, likely related to early visual areas like V1 and V2, are processed in parallel [30] Certainly, natural scenes contain more complex stimuli than “simple” geometric shapes. It seems that the conventional, two-stage perception-processing model needs correction, because to the “basic features” we must add a set of unknown intermediate features. We propose that at least some intermediate features are related to receptive ﬁeld properties in area V4. Area V4 has been associated with shape processing because its neurons respond to shapes [35] and because lesions in this area disrupt shape discrimination, complex-grouping discriminations [36], multiple viewpoint shape discriminations [37], and rotated shape discriminations [38]. Area V4 responses are also driven by curvature or circularity, which was recently observed by mean of the human fMRI [39]. By applying rough sets to V4 neuron responses, we have diﬀerentiated between bottom-up information (hypothesis testing) related to the sensory input,

312

A.W. Przybyszewski

and predictions, some of which can be learned but are generally related to positive feedback from higher areas. If a prediction is in agreement with a hypothesis, object classiﬁcation will change from category 1 to category 2. Our research suggests that such decisions can be made very eﬀectively during pre-attentive, parallel processing in multiple visual areas. In addition, we found that the decision rules of diﬀerent neurons can be inconsistent. One should take into account that modeling complex phenomena demands the use of local models (captured by local agents), if one would like to use the multiagent terminology [6]) that should be fused afterwards. This process involves negotiations between agents [6] to resolve contradictions and conﬂicts in local modeling. One of the possible approaches in developing methods for complex concept approximations can be based on the layered learning [41]. Inducing concept approximation should be developed hierarchically starting from concepts that can be directly approximated using sensor measurements toward complex target concepts related to perception. This general idea can be realized using additional domain knowledge represented in natural language. We have proposed decision rules for diﬀerent visual areas and for FF and FB connections between them. However in processing our V4 experimental data, we also have found inconsistent decision rules. These inconsistencies could help process diﬀerent aspects of the properties of complex objects. The principle is similar to that observed in the orientation tuning cells of the primary visual cortex. Neurons in V1 with overlapping receptive ﬁelds show diﬀerent preferred orientations. It is assumed that this overlap helps extract local orientations in diﬀerent parts of an object. However, it is still not clear which cell will dominate if several cells with overlapping receptive ﬁelds are tuned to diﬀerent attributes of a stimulus. Most models assume a “winner takes all” strategy; meaning that using a convergence (synaptic weighted averaging) mechanism, the most dominant cells will take control over other cells, and less represented features will be lost. This approach is equivalent to the two-valued logic implementation. Our ﬁnding from area V4 seems to support a diﬀerent strategy than the “winner takes all” approach. It seems that diﬀerent features are processed in parallel and then compared with the initial hypothesis in higher visual areas. We think that descending pathways play a major role in this veriﬁcation process. At ﬁrst, the activity of a single cell is compared with the feedback modulator by logical conjunction to avoid hallucinations. Next, the global, logical disjunction (“modulators”) operation allows the brain to choose a preferred pattern from the activities of diﬀerent cells. This process of choosing the right pattern may have strong anatomical basis because individual axons have variable and complex terminal shapes, facilitating some regions and features against other so called salient features (for example Fig. 2). Learning can probably modify the synaptic weights of the feedback boutons, ﬁne-tuning the modulatory eﬀects of feedback. Neurons in area V4 integrate an object’s attributes from the properties of its parts in two ways: (1) within the area via horizontal or intra-laminar local excitatory-inhibitory interactions, (2) between areas via feedback connections tuned to lower visual areas. Our research put more emphasis on feedback

The Neurophysiological Bases of Cognitive Computation

313

connections because they are probably faster than horizontal interactions [42]. Diﬀerent neurons have diﬀerent Part Interactions Rules (PIR as described in the Results section) and perceive objects by way of multiple “unsharp windows” (Figs. 4, 6). If an object’s attributes ﬁt the unsharp window, a neuron sends positive feedback [3] to lower areas, which as described above, use “modulator logical rules” to sharpen the attribute-extracting window and therefore change the neurons response from class 1 to class 2 (Fig. 4 J and K; Fig. 6 C to D, E to F, and G to H ). The above analysis of our experimental data leads us to suggest that the central nervous system chieﬂy uses at least two diﬀerent logical rules: “driver logical rule” and “modulator logical rule.” The ﬁrst, “driver logical rule,” processes data using a large number of possible algorithms (over-representation). The second, “modulator logical rule,” supervises decisions and chooses the right algorithm. Below we will look at possible cognitive interpretations of our model using the shape categorization task as an example. The classiﬁcation of diﬀerent objects by their diﬀerent attributes has been regarded as a single process termed “subordinate classiﬁcation” [40]. Relevant perceptual information is related to subordinate-level shape classiﬁcation by distinctive information of the object like its size, surface, curvature of contours, etc. There are two theoretical approaches regarding shape representation: metric templates and invariant parts models. As mentioned above, both theories assume that an image of the object is represented in terms of cell activation in areas like V1: a spatially arrayed set of multi-scale, multi-oriented detectors (“Gabor jets”). Metric templates [26] map object values directly onto units in an object layer, or onto hidden units, which can be trained to diﬀerentially activate or inhibit object units in the next layer [41]. Metric templates preserve the metrics of the input without the extraction of edges, viewpoint invariant properties, parts or the relations among parts. This model discriminates shape similarities and human psychophysical similarities of complex shapes or faces [25]. Matching a new image against those in the database is done by allowing the Gabor jets to independently change their own best ﬁt (change their position). The similarities of two objects will be the sum of the correlations in corresponding jets. When this methods is used, changes in object or face position or changes in facial expressions can achieve 95% accuracy between several hundreds faces [43]. The main problems with the Lades model [26] described above are that it does not distinguish among the largest eﬀects in object recognition it is insensitive to contour variations, which are very important psychophysically speaking, and it is insensitive to salient features (non-accidental properties [NAP]) [3]. The model we propose here suggests that these features are probably related to eﬀects of feedback pathways, which may strengthen diﬀerences, signal salient features and also assemble other features, making it possible to extract contours. A geon structural description (GSD) is a two-dimensional representation of an arrangement of parts, each speciﬁed in terms of its non-accidental characterization and the relations amongst these parts [38]. Across objects, the parts (geons) can diﬀer in their NAP. NAP are properties that do not change with

314

A.W. Przybyszewski

Fig. 7. Comparison of diﬀerences in nonaccidental properties between a brick and a cylinder using geon [3] and our model. The geon shows attributes from psychological space like curves, parallels or vertices, which may be diﬀerent in diﬀerent subjects. The neurological model compares properties of both objects on the basis of a single cell recordings from the visual system. Both objects can stimulate similar receptive ﬁelds in area V4. These receptive ﬁelds are sensitive in annuli - they extract orientation change in diﬀerent parts of the RF [2]. Area V1 RFs are sensitive to edge orientations, whereas LGN RFs extract spots related to corners. All these diﬀerent attributes are put together by FF and FB pathways.

small depth rotations of an object. The presence or absence of the NAP of some geons or the diﬀerent relations between them may be the basis for subordinate level discrimination [38]. The advantage of the GSD is that the representation of objects in terms of their parts and the relations between them is accessible to cognition and fundamental for viewpoint invariant perception. Our neurological model introduces interactions between RF parts as in the geon model; however, our parts are deﬁned diﬀerently than the somewhat subjective parts of the GSD model. Fig. 7 shows diﬀerences in a simple objects understanding between geon and our neurological approach. The top part of this ﬁgure shows diﬀerences in nonaccidental properties between a brick and a cylinder [3]. We propose hierarchical deﬁnition of parts based on neurophysiological recordings from the visual system. Both objects may be classiﬁed in V4 by the receptive ﬁeld discriminating

The Neurophysiological Bases of Cognitive Computation

315

between diﬀerent stimulus orientations in its central and peripheral parts as it is schematically presented in Fig. 7 [2]. Another, diﬀerent classiﬁcation is performed by area V1, where oriented edges are extracted from both objects (Fig. 7). However, even more precise classiﬁcation is performed in LGN where objects are seen as sets of small circular shapes similar to receptive ﬁelds in the retina (bottom part of Fig. 7). In our model, interactions between parts and NAPs are associated with the role of area V4 in visual discrimination, as described in the above lesion experiments [34-36]. However, feedback from area V4 to the LGN and area V1 could be responsible for the possible mechanism associated with the properties of the GSD model. The diﬀerent interactions between parts may be related to the complexity and the individual shapes of diﬀerent axons descending from V4. Their separated cluster terminals may be responsible for invariance related to small rotations (NAP). These are the anatomical bases of the GSD model, although we hypothesize that the electrophysiological properties of the descending pathways (FB), deﬁned above as the modulator, are even more important. The modulating role of the FB is related to the anatomical properties of the descending pathways’ logic. Through this logic, multiple patterns of the coincidental activity between the LGN or V1 and FB can be extracted. One may imagine that these diﬀerently extracted patterns of activity correlate with the multiple viewpoints or shape rotations deﬁned as NAP in the GSD model. In summary, by applying rough set theory to model neurophysiological data we have shown a new approach for objects categorization in psychophysical space. Two diﬀerent logical rules are applied to indiscernibility classes of LGN, V1, and V4 receptive ﬁelds: “driver logical rules” put many possible objects’ properties together and “modulator logical rules” choose these attributes which are in agreement with our previous experiences. Acknowledgement. Thanks to Carmelo Milo for his technical help, as well to Farah Averill and Dana Hayward for their help in editing the manuscript.

References 1. Pawlak, Z.: Rough Sets - Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991) 2. Pollen, D.A., Przybyszewski, A.W., Rubin, M.A., Foote, W.: Spatial receptive ﬁeld organization of macaque V4 neurons. Cereb Cortex 12, 601–616 (2002) 3. Biederman, I.: Recognition-by-components: a theory of human image understanding. Psychol. Rev. 94(2), 115–147 (1987) 4. Przybyszewski, A.W., Gaska, J.P., Foote, W., Pollen, D.A.: Striate cortex increases contrast gain of macaque LGN neurons. Vis. Neurosci. 17, 485–494 (2000) 5. Przybyszewski, A.W., Kagan, I., Snodderly, M.: Eye position inﬂuences contrast responses in V1 of alert monkey [Abstract]. Journal of Vision 3(9), 698, 698a (2003), http://journalofvision.org/3/9/698/ 6. Russell, S., Norvig, P.: Artiﬁcial Intelligence: A Modern Approach, 2nd edn. Prentice Hall Series in Artiﬁcial Intelligence (2003)

316

A.W. Przybyszewski

7. Przybyszewski, A.W., Kon, M.A.: Synchronization-based model of the visual system supports recognition. Program No. 718.11. 2003 Abstract Viewer/Itinerary Planner. Society for Neuroscience, Washington, DC (2003) 8. Kuﬄer, S.W.: Neurons in the retina; organization, inhibition and excitation problems. Cold Spring Harb. Symp. Quant. Biol. 17, 281–292 (1952) 9. Hubel, D.H., Wiesel, T.N.: Receptive ﬁelds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962) 10. Schiller, P.H., Finlay, B.L., Volman, S.F.: Quantitative studies of single-cell properties in monkey striate cortex. I. Spatiotemporal organization of receptive ﬁelds. J. Neurophysiol. 39, 1288–1319 (1976) 11. Kagan, I., Gur, M., Snodderly, D.M.: Spatial organization of receptive ﬁelds of V1 neurons of alert monkeys: comparison with responses to gratings. J. Neurophysiol. 88, 2557–2574 (2002) 12. Bardy, C., Huang, J.Y., Wang, C., FitzGibbon, T., Dreher, B.: ‘Simpliﬁcation’ of responses of complex cells in cat striate cortex: suppressive surrounds and ‘feedback’ inactivation. J. Physiol. 574, 731–750 (2006) 13. Przybyszewski, A.W.: Checking Brain Expertise Using Rough Set Theory. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 746–755. Springer, Heidelberg (2007) 14. Alonso, J.M., Usrey, W.M., Reid, R.C.: Rules of connectivity between geniculate cells and simple cells in cat primary visual cortex. J. Neurosci. 21(11), 4002–4015 (2001) 15. Sherman, S.M., Guillery, R.W.: The role of the thalamus in the ﬂow of information to the cortex. Philos. Trans. R Soc. Lond. B Biol. Sci. 357(1428), 1695–1708 (2002) 16. Lund, J.S., Lund, R.D., Hendrickson, A.E., Bunt, A.H., Fuchs, A.F.: The origin of eﬀerent pathways from the primary visual cortex, area 17, of the macaque monkey as shown by retrograde transport of horseradish peroxidase. J. Comp. Neurol. 164, 287–303 (1975) 17. Fitzpatrick, D., Usrey, W.M., Schoﬁeld, B.R., Einstein, G.: The sublaminar organization of corticogeniculate neurons in layer 6 of macaque striate cortex. Vis. Neurosci. 11, 307–315 (1994) 18. Ichida, J.M., Casagrande, V.A.: Organization of the feedback pathway from striate cortex (V1) to the lateral geniculate nucleus (LGN) in the owl monkey (Aotus trivirgatus). J. Comp. Neurol. 454, 272–283 (2002) 19. Angelucci, A., Sainsbury, K.: Contribution of feedforward thalamic aﬀerents and corticogeniculate feedback to the spatial summation area of macaque V1 and LGN. J. Comp. Neurol. 498, 330–351 (2006) 20. Nakamura, H., Gattass, R., Desimone, R., Ungerleider, L.G.: The modular organization of projections from areas V1 and V2 to areas V4 and TEO in macaques. J. Neurosci. 13, 3681–3691 (1993) 21. Rockland, K.S., Virga, A.: Organization of individual cortical axons projecting from area V1 (area 17) to V2 (area 18) in the macaque monkey. Vis. Neurosci. 4, 11–28 (1990) 22. Rockland, K.S.: Conﬁguration, in serial reconstruction, of individual axons projecting from area V2 to V4 in the macaque monkey. Cereb Cortex 2, 353–374 (1992) 23. Rockland, K.S., Saleem, K.S., Tanaka, K.: Divergent feedback connections from areas V4 and TEO in the macaque. Vis. Neurosci. 11, 579–600 (1994) 24. Schummers, J., Mario, J., Sur, M.: Synaptic integration by V1 neurons depends on location within the orientation map. Neuron 36, 969–978 (2002)

The Neurophysiological Bases of Cognitive Computation

317

25. Przybyszewski, A.W., Potapov, D.O., Rockland, K.S.: Feedback connections from area V4 to LGN. In: Ann. Meet. Society for Neuroscience, San Diego, USA (2001), http://sfn.scholarone.com/itin2001/prog#620.9 26. Lades, M., Vortbrueggen, J.C., Buhmann, J., Lange, J., von der Malsburg, C., Wuertz, R.P., Konen, W.: Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers 42, 300–311 (1993) 27. Grdenfors, P.: Conceptual Spaces. MIT Press, Cambridge (2000) 28. Polkowski, L., Skowron, A.: Rough Mereological Calculi of Granules: A Rough Set Approach to Computation. Computational Intelligence 17, 472–492 (2001) 29. Przybyszewski, A.W., Linsay, P.S., Gaudiano, P., Wilson, C.: Basic Diﬀerence Between Brain and Computer: Integration of Asynchronous Processes Implemented as Hardware Model of the Retina. IEEE Trans. Neural Networks 18, 70–85 (2007) 30. Treisman, A.: Features and objects: the fourteenth Bartlett memorial lecture. Q J. Exp. Psychol. A 40, 201–237 (1988) 31. Ramachandran, V.S.: Perception of shape from shading. Nature 331, 163–166 (1988) 32. Ullman, S., Vidal-Naquet, M., Sali, E.: Visual features of intermediate complexity and their use in classiﬁcation. Nature Neuroscience 5, 682–687 (2002) 33. Thorpe, S., Faze, D., Merlot, C.: Speed of processing in the human visual system. Nature 381, 520–522 (1996) 34. Rousselet, G.A., Fabre-Thorpe, M., Thorpe, S.J.: Parallel processing in high-level categorization of natural images. Nat. Neurosci. 5, 629–630 (2002) 35. David, S.V., Hayden, B.Y., Gallant, J.L.: Spectral receptive ﬁeld properties explain shape selectivity in area V4. J. Neurophysiol. 96, 3492–3505 (2006) 36. Merigan, W.H.: Cortical area V4 is critical for certain texture discriminations, but this eﬀect is not dependent on attention. Vis. Neurosci. 17(6), 949–958 (2000) 37. Merigan, W.H., Pham, H.A.: V4 lesions in macaques aﬀect both single- and multiple-viewpoint shape discriminations. Vis. Neurosci. 15(2), 359–367 (1998) 38. Girard, P., Lomber, S.G., Bullier, J.: Shape discrimination deﬁcits during reversible deactivation of area V4 in the macaque monkey. Cereb Cortex 12(11), 1146–1156 (2002) 39. Dumoulin, S.O., Hess, R.F.: Cortical specialization for concentric shape processing Vision Research, vol. 47, pp. 1608–1613 (2007) 40. Biederman, I., Subramaniam, S., Bar, M., Kalocsai, P., Fiser, J.: Subordinate-level object classiﬁcation reexamined. Psychol. Res. 62, 131–153 (1999) 41. Poggio, T., Edelman, S.: A network that learns to recognize three-dimensional objects. Nature 343, 263–266 (1990) 42. Girard, P., Hup, J.M., Bullier, J.: Feedforward and feedback connections between areas V1 and V2 of the monkey have similar rapid conduction velocities. J. Neurophysiol. 85(3), 1328–1331 (2001) 43. Wiscott, L., Fellous, J.-M., Krueger, N., von der Malsburg, C.: Face recognition by elastic graph matching. IEEE Pattern Recognition and Machine Intelligence 19, 775–779 (1997)

Diagnostic Feature Analysis of a Dobutamine Stress Echocardiography Dataset Using Rough Sets Kenneth Revett University of Westminster, Harrow School of Computer Science London, England, HA1 3TP

Abstract. Stress echocardiography is an important functional diagnostic and prognostic tool that is now routinely applied to evaluate the risk of cardiovascular artery disease (CAD). In patients who are unable to safely undergo a stress based test, dobutamine is administered which provides a similar eﬀect to stress on the cardiovascular system. In this work, a complete dataset containing data on 558 subjects undergoing a prospective longitudinal study is employed to investigate what diagnostic features correlate with the ﬁnal outcome. The dataset was examined using rough sets, which produced a series of decision rules that predicts which features inﬂuence the outcomes measured clinically and recorded in the dataset. The results indicate that the ECG attribute was the most informative diagnostic feature. In addition, prehistory information has a signiﬁcant impact on the classiﬁcation accuracy. Keywords: dobutamine, ECG, LTF-C, Reducts, rough sets, Stress echocardiography.

1

Introduction

Heart disease remains the number one cause of mortality in the western world. Coronary arterial disease (CAD) is a primary cause of morbidity and mortality in patients with heart disease. The early detection of CAD was in part made possible in the late 1970’s by the introduction of echocardiography - a technique for measuring the physical properties of the heart using a variety of imaging techniques such as ultrasound, and doppler ﬂow measurements [1], [2], [3]. The purpose of these imaging studies is to identify structural malformations such as aneurysms and valvular deformities. Although useful, structural information may not provide the full clinical picture in the way that functional imaging techniques such as stress echocardiography (SE) may. This imaging technique is a versatile tool that allows clinicians to diagnose patients with CAD eﬃciently and accurately. In addition, it provides information concerning the prognosis of the patient - which can be used to provide on-going clinical support to help reduce morbidity. The underlying basis for SE is the induction of cardiovascular stress, which generates cardiac ischemia, resulting in cardiac wall motion (a distension J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 318–327, 2008. c Springer-Verlag Berlin Heidelberg 2008

Feature Analysis of an Echocardiography Dataset Using Rough Sets

319

type of motion). The motion should reﬂect the ability of the vasculature to adapt to stressful situations such as enhanced physical activity. The extent to which the vessels (and the heart itself expands) under strenuous activity reﬂects the viability of the vasculature system. In coronary artery disease, the ability of the vasculature to adapt is limited as a result of episodes of ischemia - reduction in local blood supply - which causes tissue damage. Normally, the walls of the heart (in particular the left ventrical) change (move) in a typical fashion in response to stress (i.e. heavy exercise). A quantitative measure called the wall motion score is computed and its magnitude is directly related to the extent of the WMA score. The WMA provides a quantitative measure of how the heart responds to stress. Stress echocardiography (SE) was originally induced under conditions of strenuous exercise such as bike and treadmills. In many cases though, patients are not able to exercise to the level required and pharmacological agents such as dobutamine or dipyridamole have been used to induce approximately the same level of stress on the heart as physical exercise. Dobutamine in particular emulates physical exercise eﬀects on the cardiovascular system by increasing the heart rate and blood pressure and impacts cardiac contractility - which drives cardiac oxygen demand [4]. A number of reports have indicated that though there are subtle diﬀerences between exercise and pharmacologically induced stress, they essentially provide the same stimulus to the heart and can therefore, in general, be used interchangeably [5],[6]. The focus of this paper was to investigate the eﬀectiveness of dobutamine stress echocardiography (DSE) by analysing the results of a large study of 558 patients undergoing DSE. The purpose is to determine which attributes collected in this study correlate most closely with the decision outcome. After a careful investigation of this dataset, a set of rules is presented that relates conditional features (attributes) to decision outcomes. This rule set is generated through the application of rough sets, a data mining technique developed by the late Professor Pawlak [7]. The antecedents of the rule set contains information about which features are involved in the decision outcome. In addition, values of the relevant features provides quantitative information regarding the values that are relevant for each feature for the respective decision class. This provides very useful information regarding the features that are directly relevant in predicting the outcome: in this case whether the principle outcome is whether SE provides prognostic value in lieu of other relevant and routinely collected medical information with respect to the likelihood of cardiac events. In the next section, a literature review of previous work involving the clinical application of stress echocardiography is presented. 1.1

Previous Work

In 1998, Chuah and colleagues published a report on the investigation of a followup study of 860 patients who underwent dobutamine stress echocardiography over a 2-year period [8]. The prinicpal features examined in this study were wall motion abnormalities (WMA), cardiovascular risk factors, and clinical status (collected at the time the dobutamine stress test was administered). Any prior myocardial infarctions were determined by patient history or the presence of

320

K. Revett

signiﬁcant Q waves. The patient group (consisting of 479 men and 381 women, mean age 70 +/- 10, was monitored for a period of 52 months subsequent to the SE test. The follow up resutls indicates that 86 patients had cardiac events, including 36 myocardial infarctions and cardiac death in 50 patients. Those patients with events tended to have a lower rest ejection fraction and more extensive WMAs at rest and with stress. The authors also examined how outcomes (as measured by the likelihood of an event) correlated with respect to the SE results. Of the patients with normal SE results, 4% (12 of 302) had an event. Patients with new or worsening WMAs (321 patients), 44 (14%) had subsequent cardiac events during the follow up period. Lastly, those patients (237) with ﬁxed WMAs (during rest and at stress), 30 (13%) had cardiac events during the follow-up period. The authors then examined the relationship between the feature space and the likelihood of a follow-up event (identifying univariate predictors of cardiac events). The independent predictors were: a history of congestive heart failure, percentage of abnormal segments at peak-stress (measured via SE), and an abnormal left ventricular end-systolic volume response to stress. In the study by Krivokapich and colleagues [3], the pronostic value of dobutamine SE was directly assessed with respect to predicting cardiac events in patients with or suspected of having coronary arterial disease. The study was a retrospective examination of 1,183 patients that underwent DSE (dobutamine stress echocardiography). The patients were monitored for 12 months after a DSE examination in order to determine whether the results of the DSE were predictive (or at least correlated with)of subsequent cardiac events. The authors examined several features using bivariate logistic regression and forward and backward stepwise multiple logistic regression. The independent variables examined were: history of hypertension, diabetes mellitus, myocardial infarction, coronary artery bypass grafting surgery, age, gender, peak dose of dobutamine, rest and peak dobutamine heart rate, blood pressure, rate pressure product, presence of chest pain, abnormal electrocardiogram (ECG), WMA abnormality, and a positive SE. The results from this study indicate that a postive SE and an abnormal ECG were most indicative of a subsequent cardiac event (deﬁned as a myocardial infarction, death or CABG).Patients that had a positive SE and an abnormal ECG had a 42% cardiac incidence rate, versus a 7% cardiac incidence rate for negative SE and ECG. A positive Se alone yielded a 34% cardiac incidence rate during the 12 month follow up period. These results indicate the predictive power of a positive SE in terms of predicting cardiac events within a relatively short time window. In a study by Marwick and colleagues [6] sought to determine whether dobutamine echocardiography could be used as an independent predictor of cardiac mortaility in a group of 3,156 patients (1,801 men and 1,355 women, mean age 63 +/- 12 years) in a nine-year longitudinal follow-up study (1988-1994). At the time of the SE examination, several clinical variables and patient history were recorded for subsequent uni and multi-variate analysis of predictors of cardiac death. During the follow-up period, 259 (8%) deaths attributed to cardiac failure occurred. The authors analysed the patient data with respect to clinical features in order to examine their predictive capacity

Feature Analysis of an Echocardiography Dataset Using Rough Sets

321

generally - and to determine if SE was correlated in anyway with the outcome. Age, gender, heart failure therapy were predictive of cardiac failure during the follow-up period. the addition of resting left ventricular fucntion, and SE testing data further improved the predictive capacity of a sequential model (KaplanMeier survival curves and Cox proportional hazards models). In those patients with a negative dobutamine echocardiogram (1,581 pateints), the average rate of cardiac mortality was 1% per year, compared to 8% in those patients with SE abnormalities. The ﬁnal result from this study indicates that the inclusion of SE, in addition to standard clinical data increaes signﬁciantly the predictive outcome of cardiac events. Though not an exhaustive list of published examinations of the predictive capacity of dobutamine echocardiography, the cases presented here are indicative of the approach used to examine whether this technique provides positve predictive information that can assist clinicians in patient care (see [8], [9] for additional studies]). The approach is typically a longitudinal study, utilising a substantial patient cohort. As most subjects are in clinical care for suspected heart disease, there is a substantial amount of clinical information that is acquired as part of the routine care these patients. Typically, clinical data provides a predictive capacity on the order of 60%. The deployment of stress echocardiography enhances the predictive capacity over typical clinical data - even that acquired within the context of the disease based on previous medical exposure. Univariate and multivariate models provide quantitative information with respect to which variables appear to be correlated with the decision outcome. The reality for busy clinicians is that they may not be prepared to perform the complex analyses required to extract useful information from their data. This study attempts to provide a rational basis for the examination of the feature space of a typical SE dataset. The goal is to determine if the features are indeed correlated with the decision outcomes - and if so - what subset of features are relevant and what range of values are expected for predictive features. The next section presents a description of the dataset and some of the pre-processing stages employed for subsequent data analysis. 1.2

The Dataset

The data employed in this study was obtained from a prospective dobutamine stress echocardiography (DSE) study at the UCLA Adult Cardiac Imaging and Hemodynamics Laboratory held between 1991 and 1996. The patients were monitored during a ﬁve year period and then observed for a further twelve months to determine if the DSE results could predict patient outcome. The outcomes were categorised into the following cardiac events: cardiac death, myocardial infarction (MI), and revascularisation by percutaneous transluminal coronary angioplasty (PTCA) or coronary artery bypass graft surgery (CABG) [5]. After normal exclusionary processes, the patient cohort consisted of 558 subjects (220 women and 338 men) with a median age of 67 (range 26-93). Dobutamine was administered intraveneously using a standard delivery system yielding a maximum dose of 40 g/kg/min. There were a total of 30 attributes collected in this study which are listed in Table 1.

322

K. Revett

Table 1. The decision table attributes and their data types (continuous, ordinal, or discrete) employed in this study (see for details). Note the range of correlation coeﬃcients was -0.013 to 0.2476 (speciﬁc data not shown). Attribute name Attribute type bhr basal heart rate Integer basebp basal blood pressure Integer basedp basal double product (= bhr x basebp) Integer pkhr peak heart rate Integer sbp systolic blood pressure Integer dp double product (= pkhr x sbp) Integer dose dose of dobutamine given Integer maxhr maximum heart rate Integer mphr(b) % of maximum predicted heart rate Integer mbp maximum blood pressure Integer dpmaxdo double product on maximum dobutamine dose Integer dobdose dobutamine dose at which maximum double product Integer age Integer gender (male = 0) Level (2) baseef baseline cardiac ejection fraction Integer dobef ejection fraction on dobutamine Integer chestpain (0 experienced chest pain) Integer posecg signs of heart attack on ecg (0 = yes) Integer equivecg ecg is equivocal (0 = yes) Integer restwma wall motion anamoly on echocardiogram (0 = yes) Integer posse stress echocardiogram was positive (0 = yes) Integer newMI new myocardial infarction, or heart attack (0 = yes) Integer newPTCA recent angioplasty (0 = yes) Level (2) newCABG recent bypass surgery (0 = yes) Level (2) death died (0 = yes) Level (2) hxofht history of hypertension (0 = yes) Level (2) hxofptca history of angioplasty (0 = yes) Level (2) hxofcabg history of bypass surgery (0 = yes) Level (2) hxofdm history of diabetes (0 = yes) Level (2) hxofMI history of heart attack (0 = yes) Level (2)

The attributes were a mixture of categorical and continuous values. The decision class used to evaluate this dataset was the outcomes as listed as listed above and in Table 1. As a preliminary evaluation of the dataset, the data was evaluated with respect to each of the four possible measured outcomes included in the decision table individually, excluding each of the other three possible outcomes. This process was repeated for each of the outcomes in the decision table. Next, the eﬀect of the echocardiogram (ECG) was investigated. Reports indicate that this is a very informative attribute with respect to predicting the clinical outcome of a patient [3]. To evaluate the eﬀect of ECG on the outcomes, the base case investigation (all four possible outcomes) was investigated with (base case) and without the ECG attribute. Lastly, the information content of any

Feature Analysis of an Echocardiography Dataset Using Rough Sets

323

prehistory information was investigated to examined if there was a correlation between the DSE and the outcome. There were a total of six diﬀerent history attributes (see Table 1) that were tested to determine if each in isolation had a positive correlation with the outcomes. In the next section, we describe the experiments that were performed using rough sets (RSES 2.2.1).

2

Results

In the ﬁrst experiment, each outcome was used as the sole decision attribute. The four outcomes were: new Myocardial Infarction (MI) (28 cases), death (24 cases), newPTCA (27 cases), and newCABG (33 cases). All continuous attributes were discretised using the MDL algorithm within RSES [9], [10]. Note there were no missing values in the dataset. A 10-fold cross validation was performed - using decision rules and dynamic reducts. Without any ﬁltering of the reducts or rules, Table 2 presents randomly selected confusion matrices that were generated for each of the decision outcomes for the base case. The number of rules was quite large - and initially no ﬁltering was performed to reduce either the number of reducts nor the number of rules. The number of reducts for panels ’A’ - ’D’ in Table 2 were: 104, 159, 245, and 122 respectively. On average, the length of the reducts ranged from 5-9, out of a total of 27 attributes (minus the 3 other outcome decision classes). The number of rules (all of which were deterministic) was quite large, with a range of 23,356-45,330 for the cases listed in table 2. Filtering was performed on both reducts (based on support) and rule coverage in order to reduce the cardinality of the decision rules. The resulting decision rule set were reduced to a range of 314-1,197. The corresponding accuracy was reduced by approximately 4% (range 3- 6%). Filtering can be performed on a variety of conditions, such as LHS support, coverage, RHS support. For a discussion of rule ﬁltering, please consult [10], [11] for an excellent discussion of this topic. The number of rules was quite large - and initially no ﬁltering was performed to reduce either the number of reducts nor the number of rules. The number of Table 2. Confusion matrices for the ’base’ cases of the four diﬀerent outcomes. The label ’A’ corresponds to death, ’B’ to MI, ’C’ to new PTCA, and ’D’ to newCABG. Note the overall accuracy is placed at the lower right hand corner of each subtable (italicized). A 0 0 204 1 2 0.95 C 0 0 207 1 6 0.97

1 7 10 0.22 1 9 1 0.10

B 0.97 0 205 0.80 1 0 0.92 0.94 D 0.96 0 191 0.14 1 7 0.93 0.96

0 4 14 0 0 25 0 0.0

1 0.98 1.0 0.92 1 0.88 0 0.86

324

K. Revett

Table 3. The classiﬁcation accuracy obtained from the classiﬁcation using the exact same protocol for the table reported in Table 2 (note the ECG attribute was included in the decision table). The results are the average over the four diﬀerent outcomes. A 0 0 206 1 3 0.95 C 0 0 209 1 1 0.97

1 5 9 0.22 1 7 6 0.10

B 0.98 0 205 0.75 1 0 0.92 0.94 D 0.98 0 191 0.86 1 0 0.93 0.96

0 4 14 0 0 25 7 0.0

1 0.98 1.0 0.96 1 0.88 1.00 0.94

Table 4. The classiﬁcation accuracy obtained from the classiﬁcation using the same protocol for the data reported in table 2 (note the ECG attribute was included in the decision table). The results are the average over the four diﬀerent outcomes. Attribute name History of hypertension History of diabetes History of smoking History of angioplasty History of coronary artery bypass surgery

Classiﬁcation accuracy 91.1% 85.3% 86.3% 90.3% 82.7%

reducts for panels ’A’ - ’D’ in Table 2 were: 104, 159, 245, and 122 respectively. On average, the length of the reducts ranged from 5-9, out of a total of 27 attributes (minus the 3 other outcome decision classes). The number of rules (all of which were deterministic) was quite large, with a range of 23,356-45,330 for the cases listed in table 2. Filtering was performed on both reducts (based on support) and rule coverage in order to reduce the cardinality of the decision rules. The resulting decision rule set were reduced to a range of 314-1,197. The corresponding accuracy was reduced by approximately 4% (range 3-6%). In the next experiment, the correlation between the outcome and the ECG result was examined. It has been reported that the ECG, which is a standard cardiological test to measure functional activity of the heart, should be correlated with the outcome [2]. We therefore repeated the experiment in Table 2, with the ECG attribute excluded (masked) from the decision table. The results are reported in Table 3. Lastly, we examined the eﬀect of historical information that was collected and incorporated into the dataset (see Table 1). These historical attributes include: history of hypertension, diabetes, smoking, myocardial infarction, angioplasty, and coronary artery bypass surgery. We repeated the base set of experiments (including ECG) and withheld each of the historical attributes one at a time and report the results as a set of classiﬁcation accuracies, listed in Table 4. In addition to classiﬁcation accuracy, rough sets provide a collection of decision rules in conjunctive normal form. These rules contain the attributes and

Feature Analysis of an Echocardiography Dataset Using Rough Sets

325

Table 5. Sample set of rules from the base case (+ECG) with death as the decision outcome. The right hand column indicates the support (LHS) for the corresponding rule. Note that these rules were selected randomly from the full set. Rule Support dp([20716, *]) AND dobdoes(40) AND hxofDM(0) 19 AND anyevent(0) ⇒ death(0) dp([*,13105]) AND dobdoes(40) AND hxofDM(0) 18 AND anyevent(0)⇒ death(0) basebp([*,159]) AND sbp([115,161]) AND hxofDM(0) 24 AND anyevent(0) ⇒ death(0) dp([*,131105) AND dobdose(35) AND dobEF([53,61]) 10 AND hxofDM(1)⇒ death(10) dp([20633,20716]) AND dobdoes(4) AND 1 baseEF([56,76]) AND hxofDM(0) AND anyevent(1) ⇒ death (1) dp([*,13]) AND dobdoes(30) AND hxofCABG(0) 12 AND anyevent(1) AND ecg([*,2]) ⇒ death(1)

their values that are antecedents in a rule base. Therefore, the decision rules provide a codiﬁcation of the knowledge contained within the decision table. Examples of the resulting rule set for the base case, using MI as the decision attribute is presented in table 5.

3

Conclusion

This dataset contained a complete set of attributes (30) that was a mixture of continuous and categorical data. The data was obtained from a prospective study of cardiovascular health obtained by professional medical personal (cardiographers). The attributes were obtained from patients undergoing stress echocardiography, a routine medical technique employed to diagnose cardiovascular artery disease. From the initial classiﬁcation results, the speciﬁcity of the classiﬁcation using rough sets was quite high (90+%) - consistent with some literature reports [2],[6]. As can be seen in Table 2, the sensitivity of the test was reasonably high, and consistent with several literature reports. The eﬀect of ECG, the attribute most correlated with the clinical outcome of CAD, was measured by masking this attribute. The results indicate that this attribute did not have a signiﬁcant impact on the overall classiﬁcation accuracy, but did manage to increase the sensitivity was reduced slightly when it was excluded in the decision table. This result requires further examination to quantify the role of an abnormal ECG - and the interaction/information content of an abnormal ECG and other medical indicators.The eﬀect of patient history was examined, and the results (see Table 4) indicate that in general, relevant medical history did have a positive impact on the classiﬁcation accuracy. This result was quantiﬁed by examining the classiﬁcation accuracy when these 5 history factors were removed from the decision table (one at a time). The eﬀect of their combination

326

K. Revett

was not examined in this paper, which is left for future work. The data clearly indicate that a positive SE result was highly correlated with a subsequent cardiac event. This result when demonstrated by examining the rule set, looking at the occurrences of this attribute in the consequent. Lastly, the rule set that was produced yielded a consistently reduced set of attributes - ranging from 4-9 attributes, greatly reducing the size of the dataset. As displayed in Table 5 - and generally across the rule set, the dp and dobdose attributes appear consistently (has a large support) within all decision outcomes (data not displayed). This type of analysis is a major product of the rough sets approach to data analysis - extraction of knowledge from data. This is a preliminary study that will be pursued in conjunction with a qualiﬁed cardiologist. The results generated so far are interesting - and certainly consistent and in many cases superior to other studies [1],[3]. To this author’s knowledge, this is the ﬁrst report which examined the dobutamine SE literature using rough sets. Komorowski & Ohn have examined a similar dataset - but the imaging technique and attributes selected were diﬀerent from those used in the study investigated in this work [12]. In a preliminary examination of this dataset, Revett [13] published similar results to this study. A principal addition in this study is conﬁrmation of the 2007 study through the application of a novel neural network (LTF-C) to corroborate the reduced attribute set extracted from the rough sets examination. The application of LTF-C did indeed conﬁrm that the classiﬁcation accuracy was maximal with the selected set of attributes, compared to an exhaustive investigation of the other attributes with respect to training speed and classiﬁcation accuracy. The results from this study indicate that a rough sets approach to rule extraction from this dataset provided evidence that corroborate much of the results reported in the literature. The basis for applying rough sets is that it provides evidence with regards to the features and their values that are predictive with respect to the decision class. Further analysis of this dataset is possible, and this analysis would beneﬁt from a close collaboration between medical experts and data mining engineers. Acknowledgements. The author would like to acknowledge the source of the dataset used in this study: Alan Garﬁnkle, UCLA (at the time of submission): http://www.stat.ucla.edu:16080/projects/datasets/cardiacexplanation.html

References 1. Tsutsui, J.M., Elhendy, A., Anderson, J.A., Xie, F., McGrain, A.C., Porter, T.R.: Prognostic value of dobutamine stress myocardial contrast perfuson echocardiography. Circulation 112, 1444–1450 (2005) 2. Armstrong, W.F., Zoghbi: Stress Echocardiography: Current Methodology and Clinical Applications. J. Am. Coll. Cardiology 45, 1739–1747 (2005) 3. Krivokapich, J., Child, J.S., Walter, D.O., Garﬁnkel, A.: Prognostic value of dobutamine stress echocardiography in predicting cardiac events in patients with known or suspected coronary artery disease. J. Am. Coll. Cardiology 33, 708–716 (1999)

Feature Analysis of an Echocardiography Dataset Using Rough Sets

327

4. Bergeron, S., Hillis, G., Haugen, E., Oh, J., Bailey, K., Pellikka, P.: Prognostic value of dobutamine stress echocardiography in patients with chronic kidney disease. American Heart Journal 153(3), 385–391 (1982); Pawlak, Z.: Rough Sets. International Journal of Computer and Information Sciences 11, 341–356 (1982) 5. Marwick, T.H., Case, C., Poldermans, D., Boersma, E., Bax, J., Sawada, S., Thomas, J.D.: A clinical and echocardiographic score for assigning risk of major events after dobutamine echocardiography. Journal of the American College of cardiology 43(11), 2102–2107 (2004) 6. Marwick, T.H., Case, C., Sawada, S., Timmerman, C., Brenneman, P., Kovacs, R., Short, L., Lauer, M.: Prediction of mortaility using dobutamine echocardiography. Journal of the American College of Cardiology 37(3), 754–760 (2001) 7. Pawlak, Z.: Rough sets - Theoretical aspects of reasoning about data. Kluwer, Dordrecht (1991) 8. Chuah, S.-C., Pellikka, P.A., Roger, V.L., McCully, R.B., Seward, J.B.: Role of dobutamine stress echocardiography in rpedicting utcome of 860 patients with known or suspsected coronary artery disease. In: Circulation 1997, pp. 1474–1480 (1998) 9. Senior, R.: Stress echocardiography - current status. Business Brieﬁng: European Cardiology, 26–29 (2005) 10. Bazan, J., Szczuka, M.: The Rough Set Exploration System. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005), http://logic.mimuw.edu.pl/∼ rses 11. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough sets: A tutorial. In: Pal, S.K., Skowron, A. (eds.) Rough Fuzzy Hybridization - A New Trend in Decision Making, pp. 3–98. Springer, Heidelberg (1999) 12. Komorowski, J., Øhrn, A.: Modelling prognostic power of cardiac tests using rough sets. Artiﬁcial Intelligence in Medicine 15, 167–191 (1999) 13. Revett, K.: Analysis of a dobutamine stress echocardiography dataset using rough sets. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 756–762. Springer, Heidelberg (2007)

Rules and Apriori Algorithm in Non-deterministic Information Systems Hiroshi Sakai1 , Ryuji Ishibashi1 , Kazuhiro Koba1 , and Michinori Nakata2 1

Mathematical Sciences Section, Department of Basic Sciences, Faculty of Engineering, Kyushu Institute of Technology, Tobata, Kitakyushu 804, Japan [email protected] 2 Faculty of Management and Information Science, Josai International University, Gumyo, Togane, Chiba 283, Japan [email protected]

Abstract. This paper presents a framework of rule generation in N ondeterministic Inf ormation Systems (N ISs), which follows rough sets based rule generation in Deterministic Inf ormation Systems (DISs). Our previous work about N ISs coped with certain rules, minimal certain rules and possible rules. These rules are characterized by the concept of consistency. This paper relates possible rules to rules by the criteria support and accuracy in N ISs. On the basis of the information incompleteness in N ISs, it is possible to deﬁne new criteria, i.e., minimum support, maximum support, minimum accuracy and maximum accuracy. Then, two strategies of rule generation are proposed based on these criteria. The ﬁrst strategy is Lower Approximation strategy, which deﬁnes rule generation under the worst condition. The second strategy is U pper Approximation strategy, which deﬁnes rule generation under the best condition. To implement these strategies, we extend Apriori algorithm in DISs to Apriori algorithm in N ISs. A prototype system is implemented, and this system is applied to some data sets with incomplete information. Keywords: Rough sets, Non-deterministic information, Incomplete information, Rule generation, Lower and upper approximations, Apriori algorithm.

1

Introduction

Rough set theory has been used as a mathematical tool of soft computing for approximate two decades. This theory usually handles tables with deterministic information. Many applications of this theory, such as rule generation, machine learning and knowledge discovery, have been presented [5, 9, 15, 21, 22, 23, 24, 25, 36, 38]. We follow rule generation in Deterministic Inf ormation Systems (DISs) [21, 22, 23, 24, 33], and we describe rule generation in N on-deterministic Inf ormation J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 328–350, 2008. c Springer-Verlag Berlin Heidelberg 2008

Rules and Apriori Algorithm in Non-deterministic Information Systems

329

Systems (N ISs). N ISs were proposed by Pawlak [21], Orlowska [19, 20] and Lipski [13, 14] to handle information incompleteness in DISs, like null values, unknown values, missing values. Since the emergence of incomplete information research, N ISs have been playing an important role. Therefore, rule generation in N ISs will also be an important framework for rule generation from incomplete information. The following shows some important researches on rule generation from incomplete information. In [13, 14], Lipski showed a question-answering system besides an axiomatization of logic, and Orlowska established rough set analysis for non-deterministic information [3, 19, 20]. Grzymala-Busse developed a system named LERS which depends upon LEM 1 and LEM 2 algorithms [5, 6, 7], and recently proposed four interpretations of missing attribute values [8]. Stefanowski and Tsoukias deﬁned non symmetric similarity relations and valued tolerance relations for analyzing incomplete information [34, 35]. Kryszkiewicz proposed a framework of rules in incomplete information systems [10, 11, 12]. According to authors’ knowledge, these are the most important researches on incomplete information. We have also discussed several issues related to nondeterministic information and incomplete information [16, 17, 18], and proposed a framework named Rough N on-deterministic Inf ormation Analysis (RN IA) [26, 27, 28, 29, 30, 31, 32]. In this paper, we brieﬂy review RN IA including certain and possible rules, then develop rule generation by the criteria support and accuracy in N ISs. In this rule generation, we extend Apriori algorithm in DISs to a new algorithm in N ISs. The computational complexity of this new algorithm is almost the same as Apriori algorithm. Finally, we investigate a prototype system, and apply it to some data sets with incomplete information.

2

Basic Definitions and Background of the Research

This section summarizes basic deﬁnitions, and reviews the background of this research in [28, 31, 32]. 2.1

Basic Deﬁnitions

A Deterministic Information System (DIS) is a quadruplet (OB, AT, {V ALA | A ∈ AT }, f ), where OB is a ﬁnite set whose elements are called objects, AT is a ﬁnite set whose elements are called attributes, V ALA is a ﬁnite set whose elements are called attribute values and f is such a mapping that f : OB×AT → ∪A∈AT V ALA which is called a classif ication f unction. If f (x, A)=f (y, A) for every A ∈ AT R ⊂ AT , we see there is a relation between x and y for AT R. This relation is an equivalence relation over OB, and this is called an indiscernibility relation. We usually deﬁne two sets CON ⊆ AT which we call condition attributes and DEC ⊆ AT which we call decision attributes. An object x ∈ OB is consistent (with any distinct object y ∈ OB), if f (x, A)=f (y, A) for every A ∈ CON implies f (x, A)=f (y, A) for every A ∈ DEC.

330

H. Sakai et al.

A N on-deterministic Inf ormation System (N IS) is also a quadruplet (OB, AT, {V ALA |A ∈ AT }, g), where g : OB × AT → P (∪A∈AT V ALA ) (a power set of ∪A∈AT V ALA ). Every set g(x, A) is interpreted as that there is an actual value in this set, but this value is not known. For a N IS=(OB, AT, {V ALA | A ∈ AT }, g) and a set AT R ⊆ AT , we name a DIS=(OB, AT R, {V ALA |A ∈ AT R}, h) satisfying h(x, A) ∈ g(x, A) a derived DIS (for AT R) from N IS. For a set AT R={A1, · · · , An } ⊆ AT and any x ∈ OB, let P T (x, AT R) denote the Cartesian product g(x, A1 ) × · · · × g(x, An ). We name every element a possible tuple (f or AT R) of x. Fora possible tuple ζ=(ζ1 , · · ·, ζn ) ∈ P T (x, AT R), let [AT R, ζ] denote a formula 1≤i≤n [Ai , ζi ]. Every [Ai , ζi ] is called a descriptor. Let P I(x, CON, DEC) (x ∈ OB) denote a set {[CON, ζ] ⇒ [DEC, η]|ζ ∈ P T (x, CON ), η ∈ P T (x, DEC)}. We name an element of P I(x, CON, DEC) a possible implication (f rom CON to DEC) of x. In the following, τ denotes a possible implication, and τ x denotes a possible implication obtained from an object x. Now, we deﬁne six classes of possible implications, certain rules and possible rules. For any τ x ∈ P I(x, CON, DEC), let DD(τ x , x, CON, DEC) denote a set {ϕ| ϕ is such a derived DIS for CON ∪ DEC that an implication from x in ϕ is equal to τ x }. If P I(x, CON, DEC) is a singleton set {τ x }, we say τ x is def inite. Otherwise we say τ x is indef inite. If a set {ϕ ∈ DD(τ x , x, CON, DEC)| x is consistent in ϕ} is equal to DD(τ x , x, CON, DEC), we say τ x is globally consistent (GC). If this set is equal to {}, we say τ x is globally inconsistent (GI). Otherwise, we say τ x is marginal (M A). By combining two cases, i.e., ‘D(ef inite) or I(ndef inite)’ and ‘GC, M A or GI’, we deﬁne six classes, DGC, DM A, DGI, IGC, IM A, IGI in Table 1. A possible implication τ x belonging to DGC class is consistent in all derived DISs, and this τ x is not inﬂuenced by the information incompleteness, therefore we name τ x a certain rule or more correctly a candidate of a certain rule. A possible implication τ x belonging to either DGC, IGC, DM A or IM A class is consistent in some ϕ ∈ DD(τ x , x, CON, DEC). Therefore, we name τ x a possible rule or more correctly a candidate of a possible rule. Table 1. Six classes of possible implications in NISs GC M A GI Def inite DGC DM A DGI Indef inite IGC IM A IGI

Now, we give necessary and suﬃcient conditions for characterizing GC, M A and GI classes. For any ζ ∈ P T (x, AT R), we deﬁne two sets inf (x, AT R, ζ)={y ∈ OB|P T (y, AT R)={ζ}} ∪ {x}, sup(x, AT R, ζ)={y ∈ OB|ζ ∈ P T (y, AT R)}. Intuitively, inf (x, AT R, ζ) implies a set of objects whose tuples are ζ and definite. If a tuple ζ ∈ P T (x, AT R) is not deﬁnite, this object x does not satisfy

Rules and Apriori Algorithm in Non-deterministic Information Systems

331

P T (x, AT R)={ζ}. Therefore, we added a set {x} in the deﬁnition of inf . A set sup(x, AT R, ζ) implies a set of objects whose tuples may be ζ. Even though x does not appear in the right hand side of sup, we employ the sup(x, AT R, ζ) notation due to the inf (x, AT R, ζ) notation. Generally, {x} ⊆ inf (x, AT R, ζ)= sup(x, AT R, ζ) holds in DISs, and {x} ⊆ inf (x, AT R, ζ) ⊆ sup(x, AT R, ζ) holds in N ISs. Theorem 1 [28, 29]. For a N IS, let us consider a possible implication τ x :[CON , ζ] ⇒ [DEC, η] ∈ P I(x, CON, DEC). Then, the following holds. (1) τ x belongs to GC class, if and only if sup(x, CON, ζ) ⊆ inf (x, DEC, η). (2) τ x belongs to M A class, if and only if inf (x, CON, ζ) ⊆ sup(x, DEC, η). (3) τ x belongs to GI class, if and only if inf (x, CON, ζ) ⊆ sup(x, DEC, η). Proposition 2 [28, 29]. For any N IS, let AT R ⊆ AT be {A1 , · · · , An }, and let a possible tuple ζ ∈ P T (x, AT R) be (ζ1 , · · · , ζn ). Then, the following holds. (1) inf (x, AT R, ζ)=∩i inf (x, {Ai }, (ζi )). (2) sup(x, AT R, ζ)=∩i sup(x, {Ai }, (ζi )). 2.2

An Illustrative Example

Let us consider N IS1 in Table 2. There are four derived DISs in Table 3. Table 2. A table of N IS1 OB Color Size 1 {red, green} {small} 2 {red, blue} {big} 3 {blue} {big} Table 3. Four derived DISs from N IS1 . Tables are ϕ1 , ϕ2 , ϕ3 , ϕ4 to the right. OB Color Size OB Color 1 red small 1 red 2 red big 2 blue 3 blue big 3 blue

Size OB Color Size OB Color Size small 1 green small 1 green small big 2 red big 2 blue big big 3 blue big 3 blue big

Let us focus on a possible implication τ13 : [Color, blue] ⇒ [Size, big] ∈ P I(3, {Color}, {Size}). This τ13 means the ﬁrst implication from object 3, and τ13 appears in four derived DISs. Since the following holds, {2, 3} = sup(3, {Color}, (blue)) ⊆ inf (3, {Size}, (big)) = {2, 3}, τ13 belongs to DGC class according to Theorem 1. Namely, τ13 is consistent in each derived DIS. As for the second possible implication,

332

H. Sakai et al.

τ21 : [Color, red] ⇒ [Size, small] ∈ P I(1, {Color}, {Size}), the following holds: {1, 2} = sup(1, {Color}, (red)) ⊆ inf (1, {Size}, (small)) = {1}, {1} = inf (1, {Color}, (red)) ⊆ sup(1, {Size}, (small)) = {1}. According to Theorem 1, τ21 belongs to IM A class, namely τ21 appears in ϕ1 and ϕ2 , and τ21 is consistent just in ϕ2 . 2.3

Certain Rule Generation in Non-deterministic Information Systems

This subsection brieﬂy reviews the previous research on certain rule generation in N ISs [28, 29]. We have named possible implications in DGC class certain rules. For certain rule generation, we dealt with the following problem. Problem 1 [29]. For a N IS, let DEC be decision attributes and let η be a tuple of decision attributes values for DEC. Then, ﬁnd minimal certain rules in the form of [CON, ζ] ⇒ [DEC, η]. According to Theorem 1, Problem 1 is reduced to ﬁnd some minimal sets of descriptors [CON, ζ] satisfying sup(x, CON, ζ) ⊆ inf (x, DEC, η). For solving this problem, we employed a discernibility function in DISs [33]. We adjusted the discernibility function to N ISs, and implemented utility programs [29]. Example 1. Let us focus on a possible implication τ13 : [Color, blue] ⇒ [Size, big] in Table 2, again. Since inf (3, {Size}, (big))={2, 3}, it is necessary to discriminate object 1 ∈ {2, 3} from object 3. The descriptor [Color, blue] discriminates object 1 from object 3, because sup(3, {Color}, (blue))={2, 3} and 1 ∈ sup(3, {Color}, (blue)) hold. In this way, the discernibility function DF (3) becomes [Color, blue], and we obtain minimal certain rule τ13 . The following is a real execution. % ./plc ?-consult(dgc rule.pl). yes ?-trans. File Name for Read Open:’data.pl’. Decision Definition File:’attrib.pl’. File Name for Write Open:’data.rs’. EXEC TIME=0.01796603203(sec) yes ?-minimal. /* [1,blue](=[Color,blue]),[2,big](=[Size,big]) */ > Descriptor [1,blue] is a core for object 1 [1,blue]=>[2,big] [4/4(=4/4,1/1),Definite,GC: Only Core Descriptors] EXEC TIME=0.01397013664(sec) yes

Rules and Apriori Algorithm in Non-deterministic Information Systems

333

This program is implemented in prolog [28, 29, 30]. Each attribute is identiﬁed with its ordinal number, namely Color and Size are identiﬁed with 1 and 2, respectively. The underlined parts are speciﬁed by a user. 2.4

Non-deterministic Information and Incomplete Information

This subsection clariﬁes the semantic diﬀerence of non-deterministic information and incomplete information. Table 4. A table of DIS with incomplete information OB Color Size 1 ∗ small 2 ∗ big 3 blue big

Let us consider Table 4. The symbol ”∗” is often employed for indicating incomplete information. Table 4 is generated by replacing non-deterministic information in Table 2 with ∗. There are some interpretations of this ∗ symbol [4, 7, 8, 10, 17, 34]. In the most simple interpretation of incomplete information, the symbol ∗ may be each attribute value. Namely, ∗ may be either red, blue or green, and there are 9 (=3×3) possible tables in Table 4. In such a possible table, the implication from object 1 may be [Color, blue] ⇒ [Size, small], and this contradicts τ13 : [Color, blue] ⇒ [Size, big]. On the other hand in Table 2, the function is g(1, {Color})={red, green} {red, blue, green}, and we dealt with four derived DISs. In Table 2, we did not handle [Color, blue] ⇒ [Size, small] from object 1. Like this, τ13 is globally consistent in Table 2, but τ13 is inconsistent in Table 4. The function g(x, A) and a set sup(x, AT R, ζ) are employed for handling information incompleteness, and cause the semantic diﬀerence of non-deterministic information and incomplete information. In RN IA, the interpretation of the information incompleteness comes from the meaning of the function g(x, A). There is no other assumption on this interpretation. 2.5

A Problem of Possible Rule Generation in Non-deterministic Information Systems

We have deﬁned possible rules by possible implications which belong to either DGC, DM A, IGC or IM A classes. In this case, there may be a large number of possible implications satisfying condition (2) in Theorem 1. For example in Table 2, there are four possible implications including τ13 and τ21 , and every possible implication is consistent in at least a derived DIS. Thus, every possible implication is a possible rule. This implies the deﬁnition of possible rules may be too weak. Therefore, we need to employ other criteria for deﬁning rules except certain rules.

334

H. Sakai et al.

In the subsequent sections, we follow the framework of rule generation [1, 2, 22, 36, 38], and employ criteria, support and accuracy for deﬁning rules including possible rules.

3

New Criteria: Minimum Support, Minimum Accuracy, Maximum Support and Maximum Accuracy

This section proposes new criteria in N ISs, and investigates the calculation of criteria. These new criteria depend upon each element in DD(τ x , x, CON, DEC), but the complexity of the calculation does not depend upon the number of elements in DD(τ x , x, CON, DEC). 3.1

Deﬁnition of New Criteria

In a DIS, criteria support and accuracy are usually applied to deﬁning rules [1, 2, 36]. In a N IS, we deﬁne the following four criteria, i.e., minimum support: minsupp(τ x ), maximum support: maxsupp(τ x ), minimum accuracy: minacc(τ x ) and maximum accuracy: maxacc(τ x ) in the following: (1) (2) (3) (4)

minsupp(τ x ) = M inimumϕ∈DD(τ x,x,CON,DEC){support(τ x ) in ϕ}, maxsupp(τ x ) = M aximumϕ∈DD(τ x,x,CON,DEC){support(τ x ) in ϕ}, minacc(τ x ) = M inimumϕ∈DD(τ x,x,CON,DEC){accuracy(τ x ) in ϕ}, maxacc(τ x ) = M aximumϕ∈DD(τ x,x,CON,DEC){accuracy(τ x ) in ϕ}.

If τ x is deﬁnite, DD(τ x , x, CON, DEC) is equal to all derived DISs. If τ x is indeﬁnite, DD(τ x , x, CON, DEC) is a subset of all derived DISs. If we employ all derived DISs instead of DD(τ x , x, CON, DEC) in the above deﬁnition, minsupp(τ x ) and minacc(τ x ) are 0, respectively. Because, there exist some derived DISs where τ x does not appear. This property for each indeﬁnite τ x is trivial, so we deﬁne minsupp(τ x ) and minacc(τ x ) over DD(τ x , x, CON, DEC). Example 2. In Table 2, let us focus on a possible implication τ13 : [Color, blue] ⇒ [Size, big] ∈ P I(3, {Color}, {Size}). In DD(τ13 , 3, {Color}, {Size})={ϕ1, ϕ2 , ϕ3 , ϕ4 }, the following holds: 1/3 = minsupp(τ13 ) ≤ maxsupp(τ13 ) = 2/3, 1 = minacc(τ13 ) ≤ maxacc(τ13 ) = 1. As for the second possible implication, τ21 : [Color, red] ⇒ [Size, small] ∈ P I(1, {Color}, {Size}), in DD(τ21 , 1, {Color}, {Size})={ϕ1, ϕ2 }, the following holds: 1/3 = minsupp(τ21 ) ≤ maxsupp(τ21 ) = 1/3, 1/2= minacc(τ21 ) ≤ maxacc(τ21 ) = 1.

Rules and Apriori Algorithm in Non-deterministic Information Systems

3.2

335

A Simple Method for Calculating Criteria

In order to obtain minsupp(τ x ), minacc(τ x ), maxsupp(τ x ) and maxacc(τ x ), the most simple method is to examine each support(τ x ) and accuracy(τ x ) in every ϕ ∈ DD(τ x , x, CON, DEC). This method is simple, however the number of elements in DD(τ x , x, CON, DEC) is A∈CON,B∈DEC,x=y |g(y, A)||g(y, B)|, and the number of elements increases in exponential order. Therefore, this simple method will not be applicable to N ISs with a large number of derived DISs. 3.3

Eﬀective Calculation of Minimum Support and Minimum Accuracy

Let us consider how to calculate minsupp(τ x ) and minacc(τ x ) for τ x : [CON, ζ] ⇒ [DEC, η] from object x. Each object y with descriptors [CON, ζ] or [DEC, η] inﬂuences minsupp(τ x ) and minacc(τ x ). Table 5 shows all possible implications with descriptors [CON, ζ] or [DEC, η]. For example in CASE 1, we can obtain just an implication. However in CASE 2, we can obtain either (C2.1) or (C2.2). Every possible implication depends upon the selection of a value in g(y, DEC). This selection of attribute values speciﬁes some derived DISs from a N IS. Table 5. Seven cases of possible implications (related to [CON, ζ] ⇒ [DEC, η] from object x, η = η , ζ = ζ ) in N ISs Condition : CON Decision : DEC P ossible Implications CASE1 g(y, CON ) = {ζ} g(y, DEC) = {η} [CON, ζ] ⇒ [DEC, η](C1.1) CASE2 g(y, CON ) = {ζ} η ∈ g(y, DEC) [CON, ζ] ⇒ [DEC, η](C2.1) [CON, ζ] ⇒ [DEC, η ](C2.2) CASE3 g(y, CON ) = {ζ} η ∈ g(y, DEC) [CON, ζ] ⇒ [DEC, η ](C3.1) CASE4 ζ ∈ g(y, CON ) g(y, DEC) = {η} [CON, ζ] ⇒ [DEC, η](C4.1) [CON, ζ ] ⇒ [DEC, η](C4.2) CASE5 ζ ∈ g(y, CON ) η ∈ g(y, DEC) [CON, ζ] ⇒ [DEC, η](C5.1) [CON, ζ] ⇒ [DEC, η ](C5.2) [CON, ζ ] ⇒ [DEC, η](C5.3) [CON, ζ ] ⇒ [DEC, η ](C5.4) CASE6 ζ ∈ g(y, CON ) η ∈ g(y, DEC) [CON, ζ] ⇒ [DEC, η ](C6.1) [CON, ζ ] ⇒ [DEC, η ](C6.2) CASE7 ζ ∈ g(y, CON ) Any [CON, ζ ] ⇒ Decision(C7.1)

Now, we revise the deﬁnition of inf and sup information in the previous section. We handled both inf and sup information for every object x. However, in the subsequent sections it is enough to handle minimum and maximum sets of an equivalence class deﬁned by a descriptor [AT R, val]. This revision is very simple, and this revision reduces the manipulation of each calculation. Deﬁnition 1. For each descriptor [AT R, val](= [{A1, · · · , Ak }, (ζ1 , · · · , ζk )], (k ≥ 1) ) in a N IS, Descinf and Descsup are deﬁned as follows:

336

(1) (2) (3) (4)

H. Sakai et al.

Descinf ([Ai , ζi ])={x ∈ OB|P T (x, {Ai })={ζi }}={x ∈ OB|g(x, {Ai })={ζi }}. Descinf ([AT R, val])=Descinf (∧i[Ai , ζi ])=∩i Descinf ([Ai , ζi ]). Descsup([Ai , ζi ])={x ∈ OB|ζi ∈ P T (x, {Ai })}={x ∈ OB|ζi ∈ g(x, {Ai })}. Descsup([AT R, val])=Descsup(∧i[Ai , ζi ])=∩i Descsup([Ai , ζi ]).

The deﬁnition of Descinf requires that every element in this set is deﬁnite. Even though the deﬁnition of Descsup is the same as sup, we employ the Descsup([AT R, ζ]) notation due to the Descinf ([AT R, ζ]) notation. Clearly, Descinf ([CON, ζ]) is a set of objects belonging to either CASE 1, 2 or 3 in Table 5, and Descsup([CON , ζ]) is a set of objects belonging to either CASE 1 to CASE 6. Descsup([CON, ζ]) − Descinf ([CON, ζ]) is a set of objects belonging to either CASE 4, 5 or 6. Proposition 3. Let |X| denote the cardinality of a set X. In Table 6, the support value of τ x : [CON, ζ] ⇒ [DEC, η] from x is minimum. If τ x is deﬁnite, namely τ x belongs to CASE 1, minsupp(τ x )=|Descinf ([CON, ζ]) ∩ Descinf ([DEC, η])|/|OB|. If τ x is indeﬁnite, namely τ x does not belong to CASE 1, minsupp(τ x )=(|Descinf ([CON, ζ]) ∩ Descinf ([DEC, η])| + 1)/|OB|. Proof. This selection of attribute values in a N IS excludes every [CON, ζ] ⇒ [DEC, η] from object y = x. In reality, we remove (C2.1), (C4.1) and (C5.1) from Table 5. Therefore, the support value of τ x is minimum in a derived DIS with such selections of attribute values. If τ x is deﬁnite, object x is in a set Descinf ([CON, ζ])∩Descinf ([DEC, η]). Otherwise, τ x belongs to either (C2.1), (C4.1) or (C5.1). Thus, it is necessary to add 1 to the numerator. Proposition 4. Table 7 is a part of Table 5. In Table 7, the accuracy value of τ x : [CON, ζ] ⇒ [DEC, η] from x is minimum. Let OU T ACC denote [Descsup([CON, ζ]) − Descinf ([CON, ζ])] − Descinf ([DEC, η]). Table 6. Selections from Table 5. These selections make the support value of [CON, ζ] ⇒ [DEC, η] minimum. CASE1 CASE2 CASE3 CASE4 CASE5

Condition : CON g(y, CON ) = {ζ} g(y, CON ) = {ζ} g(y, CON ) = {ζ} ζ ∈ g(y, CON ) ζ ∈ g(y, CON )

Decision : DEC g(y, DEC) = {η} η ∈ g(y, DEC) η ∈ g(y, DEC) g(y, DEC) = {η} η ∈ g(y, DEC)

CASE6 ζ ∈ g(y, CON )

η ∈ g(y, DEC)

CASE7 ζ ∈ g(y, CON )

Any

Selection [CON, ζ] ⇒ [DEC, η](C1.1) [CON, ζ] ⇒ [DEC, η ](C2.2) [CON, ζ] ⇒ [DEC, η ](C3.1) [CON, ζ ] ⇒ [DEC, η](C4.2) [CON, ζ] ⇒ [DEC, η ](C5.2) [CON, ζ ] ⇒ [DEC, η](C5.3) [CON, ζ ] ⇒ [DEC, η ](C5.4) [CON, ζ] ⇒ [DEC, η ](C6.1) [CON, ζ ] ⇒ [DEC, η ](C6.2) [CON, ζ ] ⇒ Decision(C7.1)

Rules and Apriori Algorithm in Non-deterministic Information Systems

337

Table 7. Selections from Table 5. These selections make the accuracy value of [CON, ζ] ⇒ [DEC, η] minimum. CASE1 CASE2 CASE3 CASE4 CASE5 CASE6 CASE7

Condition : CON g(y, CON ) = {ζ} g(y, CON ) = {ζ} g(y, CON ) = {ζ} ζ ∈ g(y, CON ) ζ ∈ g(y, CON ) ζ ∈ g(y, CON ) ζ ∈ g(y, CON )

Decision : DEC g(y, DEC) = {η} η ∈ g(y, DEC) η ∈ g(y, DEC) g(y, DEC) = {η} η ∈ g(y, DEC) η ∈ g(y, DEC) Any

Selection [CON, ζ] ⇒ [DEC, η](C1.1) [CON, ζ] ⇒ [DEC, η ](C2.2) [CON, ζ] ⇒ [DEC, η ](C3.1) [CON, ζ ] ⇒ [DEC, η](C4.2) [CON, ζ] ⇒ [DEC, η ](C5.2) [CON, ζ] ⇒ [DEC, η ](C6.1) [CON, ζ ] ⇒ Decision(C7.1)

If τ x is deﬁnite, ([CON,ζ])∩Descinf ([DEC,η])| minacc(τ x )= |Descinf . |Descinf ([CON,ζ])|+|OUT ACC|

If τ x is indeﬁnite, |Descinf ([CON,ζ])∩Descinf ([DEC,η])|+1 minacc(τ x )= |Descinf ([CON,ζ])∪{x}|+|OUT ACC−{x}| .

Proof. Since m/n ≤ (m + k)/(n + k) (0 ≤ m ≤ n, n = 0, k > 0) holds, we excludes every [CON, ζ] ⇒ [DEC, η] from object y = x. We select possible implications [CON, ζ] ⇒ [DEC, η ], which increase the denominator. The accuracy value of τ x is minimum in a derived DIS with such selection of attribute values. The set OU T ACC deﬁnes objects in either CASE 5 or CASE 6. As for CASE 4 and CASE 7, the condition part is not [CON, ζ]. Therefore, we can omit such implications for calculating minacc(τ x ). If τ x is deﬁnite, the numerator is |Descinf ([CON, ζ]) ∩ Descinf ([DEC, η])| and the denominator is |Descinf ([CON, ζ])|+|OU T ACC|. If τ x is indeﬁnite, τ x belongs to either (C2.1), (C4.1) or (C5.1). The denominator is |Descinf ([CON, ζ]) ∪ {x}| + |OU T ACC − {x}| in every case, and the numerator is |Descinf ([CON, ζ]) ∩ Descinf ([DEC, η])|+1. Theorem 5. For a N IS, let us consider a possible implication τ x :[CON, ζ] ⇒ [DEC, η] ∈ P I(x, CON, DEC). Let SU P Pmin ={ϕ|ϕ is a derived DIS from N IS, and support(τ x ) is minimum in ϕ}. Then, accuracy(τ x ) is minimum in some ϕ ∈ SU P Pmin . Proof. Table 7 is a special case of Table 6. Namely, in CASE 5 of Table 6, either (C5.2), (C5.3) or (C5.4) may hold. In CASE 6 of Table 6, either (C6.1) or (C6.2) may hold. In every selection, the minimum support value is the same. In Table 7, (C5.2) in CASE 5 and (C6.1) in CASE 6 are selected. Theorem 5 assures that there exists a derived DIS, where both support(τ x ) and accuracy(τ x ) are minimum. DISworst denotes such a derived DIS, and we name

338

H. Sakai et al.

Table 8. Selections from Table 5. These selections make the support and accuracy values of [CON, ζ] ⇒ [DEC, η] maximum. CASE1 CASE2 CASE3 CASE4 CASE5 CASE6 CASE7

Condition(CON ) g(y, CON ) = {ζ} g(y, CON ) = {ζ} g(y, CON ) = {ζ} ζ ∈ g(y, CON ) ζ ∈ g(y, CON ) ζ ∈ g(y, CON ) ζ ∈ g(y, CON )

Decision(DEC) g(y, DEC) = {η} η ∈ g(y, DEC) η ∈ g(y, DEC) g(y, DEC) = {η} η ∈ g(y, DEC) η ∈ g(y, DEC) Any

Selection [CON, ζ] ⇒ [DEC, η](C1.1) [CON, ζ] ⇒ [DEC, η](C2.1) [CON, ζ] ⇒ [DEC, η ](C3.1) [CON, ζ] ⇒ [DEC, η](C4.1) [CON, ζ] ⇒ [DEC, η](C5.1) [CON, ζ ] ⇒ [DEC, η ](C6.2) [CON, ζ ] ⇒ Decision(C7.1)

DISworst a derived DIS with the worst condition for τ x . This is an important property for Problem 3 in the subsequent section. 3.4

Eﬀective Calculation of Maximum Support and Maximum Accuracy

In this subsection, we show an eﬀective method to calculate maxsupp(τ x ) and maxacc(τ x ) based on Descinf and Descsup. The following can be proved according the same manner as Proposition 3, 4 and Theorem 5. A derived DIS deﬁned in Table 8 makes both support and accuracy maximum. Proposition 6. For τ x : [CON, ζ] ⇒ [DEC, η] from x, the following holds. maxsupp(τ x )=|Descsup([CON, ζ]) ∩ Descsup([DEC, η])|/|OB|. Proposition 7. For τ x : [CON, ζ] ⇒ [DEC, η] from x, let IN ACC denote [Descsup([CON, ζ]) − Descinf ([CON, ζ])] ∩ Descsup ([DEC, η]). If τ x is deﬁnite, ACC| maxacc(τ x )= |Descinf ([CON,ζ])∩Descsup([DEC,η])|+|IN . |Descinf ([CON,ζ])|+|IN ACC|

If τ x is indeﬁnite, ACC−{x}|+1 maxacc(τ x )= |Descinf ([CON,ζ])∩Descsup([DEC,η])−{x}|+|IN . |Descinf ([CON,ζ])∪{x}|+|IN ACC−{x}|

Theorem 8. For a N IS, let us consider a possible implication τ x :[CON, ζ] ⇒ [DEC, η] ∈ P I(x, CON, DEC). Let SU P Pmax ={ϕ|ϕ is a derived DIS from N IS, and support(τ x ) is maximum in ϕ}. Then, accuracy(τ x ) is maximum in some ϕ ∈ SU P Pmax . Theorem 8 assures that there exists a derived DIS, where both support(τ x ) and accuracy(τ x ) are maximum. DISbest denotes such a derived DIS, and we name DISbest a derived DIS with the best condition for τ x . This is also an important property for Problem 4 in the subsequent section.

Rules and Apriori Algorithm in Non-deterministic Information Systems

4

339

Rule Generation by New Criteria in Non-deterministic Information Systems

This section applies Proposition 3, 4, 6, 7 and Theorem 5, 8 to rule generation in N ISs. 4.1

Rules by the Criteria in Deterministic Information Systems

In DISs, rule generation by the criteria is often deﬁned as the following. Problem 2. In a table or a DIS, ﬁnd every implication τ that support(τ ) ≥ α and accuracy(τ ) ≥ β for given α and β (0 < α, β ≤ 1). For solving this problem, Apriori algorithm was proposed by Agrawal [1, 2]. In this framework, association rules in transaction data are obtained. The application of the large item set is the key point in Apriori algorithm. This Problem 2 has also been considered in [22, 36, 38]. 4.2

Rules by New Criteria and Two Strategies in Non-deterministic Information Systems

Now, we extend Problem 2 to Problem 3 and Problem 4 in the following. Problem 3 (Rule Generation by Lower Approximation Strategy). For a N IS, let CON ⊆ AT and DEC ⊆ AT be condition attributes and the decision attribute, respectively. Find every possible implication τ x : [CON, ζ] ⇒ [DEC, η] satisfying minsupp(τ x ) ≥ α and minacc(τ x ) ≥ β for given α and β (0 < α, β ≤ 1). Problem 4 (Rule Generation by Upper Approximation Strategy). For a N IS, let CON ⊆ AT and DEC ⊆ AT be condition attributes and the decision attribute, respectively. Find every possible implication τ x : [CON, ζ] ⇒ [DEC, η] satisfying maxsupp(τ x ) ≥ α and maxacc(τ x ) ≥ β for given α and β (0 < α, β ≤ 1). It is necessary to remark that both minsupp(τ x ) and minacc(τ x ) are deﬁned over DD(τ x , x, CON, DEC). For deﬁnite τ x , DD(τ x , x, CON, DEC) is equal to all derived DISs. However for indeﬁnite τ x , DD(τ x , x, CON, DEC) is not equal to all derived DISs, and minsupp(τ x )=0 and minacc(τ x )=0 may hold. This may be an important issue in lower approximation strategy. However in this paper, we employ a set DD(τ x , x, CON, DEC) instead of all derived DISs. As for upper approximation strategy, maxsupp(τ x ) and maxacc(τ x ) over DD(τ x , x, CON , DEC) are the same as maxsupp(τ x ) and maxacc(τ x ) over all derived DISs. We employed terms M in-M ax and M ax-M ax strategies in [31, 32]. According to rough sets based concept, we rename these terms lower approximation strategy and upper approximation strategy, respectively. Next Proposition 9 clariﬁes the relation between certain rules, possible rules and rules by new criteria.

340

H. Sakai et al.

Proposition 9. For a possible implication τ x , the following holds. (1) τ x is a certain rule in Section 2.1, if and only if τ x is deﬁnite and minacc(τ x )=1. (2) τ x is a possible rule in Section 2.1, if and only if maxacc(τ x )=1. The concept of consistency deﬁnes certain and possible rules, therefore there is no deﬁnition about support. In certain rule generation, we often have a possible implication whose minacc(τ x )=1 and minsupp(τ x ) is quite small. Proposition 10, 11 and 12 clarify the properties of rule generation. Proposition 10. For a given α and β (0 < α, β ≤ 1), let Rule(α, β, LA) denote a set of rules deﬁned by lower approximation strategy with α and β, and let Rule(α, β, U A) denote a set of rules deﬁned by upper approximation strategy with α and β. Then, Rule(α, β, LA) ⊆ Rule(α, β, U A) holds. Proposition 11. The following, which are related to a possible implication τ x : [CON, ζ] ⇒ [DEC, η], are equivalent. (1) τ x is obtained according to lower approximation strategy, namely minsupp(τ x ) ≥ α and minacc(τ x ) ≥ β. (2) support(τ x ) ≥ α and accuracy(τ x ) ≥ β in each ϕ ∈ DD(τ x , x, CON, DEC). (3) In a derived DISworst deﬁned in Table 7, support(τ x ) ≥ α and accuracy(τ x ) ≥ β hold. Proof. For each ϕ ∈ DD(τ x , x, CON, DEC), support(τ x ) ≥ minsupp(τ x ) and accuracy(τ x ) ≥ minacc(τ x ) hold, therefore (1) and (2) are equivalent. According to Theorem 5, a derived DISworst (depending upon τ x ) deﬁned in Table 7 assigns minimum values to both support(τ x ) and accuracy(τ x ). Thus, (1) and (3) are equivalent. Proposition 12. The following, which are related to a possible implication τ x : [CON, ζ] ⇒ [DEC, η], are equivalent. (1) τ x is obtained according to upper approximation strategy, namely maxsupp(τ x ) ≥ α and maxacc(τ x ) ≥ β. (2) support(τ x ) ≥ α and accuracy(τ x ) ≥ β in a ϕ ∈ DD(τ x , x, CON, DEC). (3) In a derived DISbest deﬁned in Table 8, support(τ x ) ≥ α and accuracy(τ x ) ≥ β hold. Proof: For each ϕ ∈ DD(τ x , x, CON, DEC), support(τ x ) ≤ maxsupp(τ x ) and accuracy(τ x ) ≤ maxacc(τ x ) hold. According to Theorem 8, a derived DISbest (depending upon τ x ) deﬁned in Table 8 assigns maximum values to both support(τ x ) and accuracy(τ x ). In this DISbest , maxsupp(τ x )=support(τ x ) and maxacc(τ x )=accuracy(τ x ) hold. Thus, (1), (2) and (3) are equivalent. Due to Proposition 10, 11 and 12, Rule(α, β, LA) deﬁnes a set of possible implications in a DISworst , and Rule(α, β, U A) deﬁnes a set of possible implications in a DISbest . This implies that we do not have to examine each derived DIS in

Rules and Apriori Algorithm in Non-deterministic Information Systems

341

DD(τ x , x, CON, DEC), but we have only to examine a DISworst for the lower approximation strategy and a DISbest for the upper approximation strategy. 4.3

Extended Apriori Algorithms for Two Strategies and A Simulation

This subsection proposes two extended Apriori algorithms in Algorithm 1 and 2. In DISs, Descinf ([A, ζ])=Descsup([A, ζ]) holds, however Descinf ([A, ζ]) ⊆ Descsup([A, ζ]) holds in N ISs. Apriori algorithm handles transaction data, and employs the sequential search for obtaining large item sets [1, 2]. In DISs, we employ the manipulation of Descinf and Descsup instead of the sequential search. According to this manipulation, we obtain the minimum set and maximum set of an equivalence class. Then, we calculate minsupp(τ x ) and minacc(τ x ) by using Descinf and Descsup. The rest is almost the same as Apriori algorithm. Now, we show an example, which simulates Algorithm 1. Algorithm 1. Extended Apriori Algorithm for Lower Approximation Strategy Input : A N IS, a decision attribute DEC, threshold value α and β. Output: Every rule deﬁned by lower approximation strategy. for (every A ∈ AT ) do Generate Descinf ([A, ζ]) and Descsup([A, ζ]); end For the condition minsupp(τ x )=|SET |/|OB| ≥ α, obtain the number N U M of elements in SET ; Generate a set CAN DIDAT E(1), which consists of descriptors [A, ζA ] satisfying either (CASE A) or (CASE B) in the following; (CASE A) |Descinf ([A, ζA ])| ≥ N U M , (CASE B) |Descinf ([A, ζA ])|=(N U M − 1) and (Descsup([A, ζA ]) − Descinf ([A, ζA ])) = {}. Generate a set CAN DIDAT E(2) according to the following procedures; (Proc 2-1) For every [A, ζA ] and [DEC, ζDEC ] (A = DEC) in CAN DIDAT E(1), generate a new descriptor [{A, DEC}, (ζA , ζDEC )]; (Proc 2-2) Examine condition (CASE A) and (CASE B) for each [{A, DEC}, (ζA , ζDEC )]; If either (CASE A) or (CASE B) holds and minacc(τ ) ≥ β display τ : [A, ζA ] ⇒ [DEC, ζDEC ] as a rule; If either (CASE A) or (CASE B) holds and minacc(τ ) < β, add this descriptor to CAN DIDAT E(2); Assign 2 to n; while CAN DIDAT E(n) = {} do Generate CAN DIDAT E(n + 1) according to the following procedures; (Proc 3-1) For DESC1 and DESC2 ([DEC, ζDEC ] ∈ DESC1 ∩ DESC2 ) in CAN DIDAT E(n), generate a new descriptor by using a conjunction of DESC1 ∧ DESC2 ; (Proc 3-2) Examine the same procedure as (Proc 2-2). Assign n + 1 to n; end

342

H. Sakai et al.

Algorithm 2. Extended Apriori Algorithm for Upper Approximation Strategy Input : A N IS, a decision attribute DEC, threshold value α and β. Output: Every rule deﬁned by upper approximation strategy. Algorithm 2 is proposed as Algorithm 1 with the following two revisions : 1. (CASE A) and (CASE B) in Algorithm 1 are replaced with (CASE C). (CASE C) |Descsup([A, ζA ])| ≥ N U M . 2. minacc(τ ) in Algorithm 1 is replaced with maxacc(τ ).

Example 3. Let us consider Descinf and Descsup, which are obtained from N IS2 in Table 9, and let us consider Problem 3. We set α=0.3, β=0.8, condition attribute CON ⊆ {P, Q, R, S} and decision attribute DEC={T }. Since |OB|=5 and minsupp(τ )=|SET |/5 ≥ 0.3, |SET | ≥ 2 must hold. According to Table 10, we generate Table 11 satisfying either (CASE A) or (CASE B) in the following: (CASE A) |Descinf ([A, ζA ] ∧ [T, η])| ≥ 2 (A ∈ {P, Q, R, S}). (CASE B) |Descinf ([A, ζA ] ∧ [T, η])|=1 and Descsup([A, ζA ] ∧ [T, η])− Descinf ([A, ζA ] ∧ [T, η]) = {} (A ∈ {P, Q, R, S}). Table 9. A Table of N IS2 OB 1 2 3 4 5

P {3} {2} {1, 2} {1} {3}

Q {1, 3} {2, 3} {2} {3} {1}

R {3} {1, 3} {1, 2} {3} {1, 2}

S T {2} {3} {1, 3} {2} {3} {1} {2, 3} {1, 2, 3} {3} {3}

Table 10. Descinf and Descsup information in Table 9 [P, 1] [P, 2] [P, 3] [Q, 1] [Q, 2] [Q, 3] [R, 1] [R, 2] [R, 3] Descinf {4} {2} {1, 5} {5} {3} {4} {} {} {1, 4} Descsup {3, 4} {2, 3} {1, 5} {1, 5} {2, 3} {1, 2, 4} {2, 3, 5} {3, 5} {1, 2, 4} [S, 1] [S, 2] [S, 3] [T, 1] [T, 2] [T, 3] Descinf {} {1} {3, 5} {3} {2} {1, 5} Descsup {2} {1, 4} {2, 3, 4, 5} {3, 4} {2, 4} {1, 4, 5}

Table 11. Conjunctions of descriptors satisfying either (CASE A) or (CASE B) in Table 10 Descinf Descsup

[P, 3] ∧ [T, 3] [Q, 1] ∧ [T, 3] [R, 3] ∧ [T, 3] [S, 2] ∧ [T, 3] [S, 3] ∧ [T, 1] [S, 3] ∧ [T, 3] {1, 5} {5} {1} {1} {3} {5} {1, 5} {1, 5} {1, 4} {1, 4} {3, 4} {4, 5}

Rules and Apriori Algorithm in Non-deterministic Information Systems

343

The conjunction [P, 3] ∧ [T, 3] in Table 11 means an implication τ31 , τ35 : [P, 3] ⇒ [T, 3]. Because Descsup([P, 3] ∧ [T, 3])={1, 5} holds, τ31 and τ35 come from object 1 and 5, respectively. Since 1, 5 ∈ Descinf ([P, 3] ∧ [T, 3]) holds, minsupp(τ31 )= minsupp(τ35 )=|{1, 5}|/5=0.4 holds. Then, the conjunction [Q, 1] ∧ [T, 3] in Table 11 means an implication τ41 , τ45 : [Q, 1] ⇒ [T, 3]. Since 5 ∈ Descinf ([Q, 1]∧[T, 3]) holds, minsupp(τ45 )=|{5}|/5=0.2 holds. On the other hand, 1 ∈ Descsup([Q, 1]∧ [T, 3]) − Descinf ([Q, 1] ∧ [T, 3]) holds, so minsupp(τ41 )=(|{5}| + 1)/5=0.4 holds in object 1. According to this consideration, we obtain the candidates of rules, which satisfy minsupp(τ x ) ≥ 0.3, as follows: τ31 , τ35 : [P, 3] ⇒ [T, 3], τ41 : [Q, 1] ⇒ [T, 3], τ54 : [R, 3] ⇒ [T, 3], τ64 : [S, 2] ⇒ [T, 3], τ74 : [S, 3] ⇒ [T, 1], τ84 : [S, 3] ⇒ [T, 3]. For these candidates, we examine each minacc(τ x ) according to Proposition 4. For τ31 and τ35 , Descsup([P, 3])={1, 5}, Descinf ([P, 3])={1, 5}, Descinf ([P, 3] ∧ [T, 3])={1, 5} and OU T ACC=[{1, 5}−{1, 5}]−{1, 5}={}. Since 1, 5 ∈ Descinf ( [P, 3] ∧ [T, 3]) holds, minacc(τ31 )= minacc(τ35 )=|{1, 5}|/(|{1, 5}| + |{}|)=1 is derived. For τ74 : [S, 3] ⇒ [T, 1], Descsup([S, 3])={2, 3, 4, 5}, Descinf ([S, 3])={3, 5}, Descinf ([S, 3]∧[T, 1])={3}, Descsup([S, 3]∧[T, 1])= {3, 4} and OU T ACC=[{2, 3, 4, 5} − {3, 5}] − {3}={2, 4} holds, so minacc(τ74 )=(|{3}| + 1)/(|{3, 5} ∪ {4}| + |{2, 4} − {4}|)=0.5 is derived. In this way, we obtain three rules satisfying minsupp(τ x ) ≥ 0.3 and minacc(τ x ) ≥ 0.8 in the following: τ31 , τ35 : [P, 3] ⇒ [T, 3] (minsupp=0.4, minacc=1), τ41 : [Q, 1] ⇒ [T, 3] (minsupp=0.4, minacc=1), τ64 : [S, 2] ⇒ [T, 3] (minsupp=0.4, minacc=1). Any possible implication including [R, 3] ∧ [T, 3] does not satisfy minsupp(τ x ) ≥ 0.3. As for [S, 3] ∧ [T, 1] and [S, 3] ∧ [T, 3], the same results hold. The following shows a real execution on Example 3. % ./nis apriori version 1.2.8 File Name:’nis2.dat’ ======================================== Lower Approximation Strategy ======================================== CAN(1)=[P,1],[P,2],[P,3],[Q,1],[Q,2],[Q,3],[R,3],[S,2],[S,3],[T,1], [T,2],[T,3](12) CAN(2)=[S,3][T,1](0.250,0.500),[P,3][T,3](1.000, 1.000),[Q,1][T,3](1.000,1.000),[R,3][T,3](0.333, 0.667),[S,2][T,3](0.500,1.000),[S,3][T,3](0.250, 0.500)(6) ========== OBTAINED RULE ========== [P,3]=>[T,3](minsupp=0.400,minsupp=0.400,minacc=1.000, minacc=1.000) (from 1,5) (from ) [Q,1]=>[T,3](minsupp=0.200,minsupp=0.400,minacc=1.000, minacc=1.000) (from ) (from 1)

344

H. Sakai et al.

[S,2]=>[T,3](minsupp=0.200,minsupp=0.400,minacc=0.500, minacc=1.000) (from ) (from 4) EXEC TIME=0.0000000000(sec) ======================================== Upper Approximation Strategy ======================================== CAN(1)=[P,1],[P,2],[P,3],[Q,1],[Q,2],[Q,3],[R,3],[S,2],[S,3],[T,1], [T,2],[T,3](12) CAN(2)=[S,3][T,1](0.667,0.667),[P,3][T,3](1.000, 1.000),[Q,1][T,3](1.000,1.000),[R,3][T,3](1.000, 1.000),[S,2][T,3](1.000,1.000),[S,3][T,3](0.667, 0.667)(6) ========== OBTAINED RULE ========== [P,3]=>[T,3](maxsupp=0.400,maxsupp=0.400,maxacc=1.000, maxacc=1.000) (from 1,5) (from ) [Q,1]=>[T,3](maxsupp=0.400,maxsupp=0.400,maxacc=1.000, maxacc=1.000) (from 5) (from 1) [R,3]=>[T,3](maxsupp=0.400,maxsupp=0.400,maxacc=1.000, maxacc=1.000) (from 1) (from 4) [S,2]=>[T,3](maxsupp=0.400,maxsupp=0.400,maxacc=1.000, maxacc=1.000) (from 1) (from 4) EXEC TIME=0.0000000000(sec)

According to this execution, we know Rule(0.3, 0.8, LA)={[P, 3] ⇒ [T, 3], [Q, 1] ⇒ [T, 3], [S, 2] ⇒ [T, 3]}, Rule(0.3, 0.8, U A)={[P, 3] ⇒ [T, 3], [Q, 1] ⇒ [T, 3], [S, 2] ⇒ [T, 3], [R, 3] ⇒ [T, 3]}.

The possible implication [R, 3] ⇒ [T, 3] ∈ Rule(0.3, 0.8, U A)− Rule(0.3, 0.8, LA) depends upon the information incompleteness. This can not be obtained by the lower approximation strategy, but this can be obtained by the upper approximation strategy. 4.4

Main Program for Lower Approximation Strategy

A program nis apriori is implemented on a Windows PC with Pentium 4 (3.40 GHz), and it consists of about 1700 lines in C. This nis apriori mainly consists of two parts, i.e., a part for lower approximation strategy and a part for upper approximation strategy. As for lower approximation strategy, a function GenRuleByLA() (Generate Rules By LA strategy) is coded. GenRuleByLA(table.obj,table.att,table.kosuval,table.con num, table.dec num,table.con,table.dec,thresh,minacc thresh);

In GenRuleByLA(), a function GenCandByLA() is called, and generates a candidate CAN DIDAT E(n).

Rules and Apriori Algorithm in Non-deterministic Information Systems

345

GenCandByLA(desc,cand,conj num max,ob,at,desc num,c num, d num,co,de,thr,minacc thr);

At the same time, minsupp(τ ) and minacc(τ ) are calculated according to Proposition 3 and 4. As for upper approximation strategy, the similar functions are implemented.

5

Computational Issues in Algorithm 1

This section focuses on the computational complexity of Algorithm 1. As for Algorithm 2, the result is almost the same as Algorithm 1. 5.1

A Simple Method for Lower Approximation Strategy

Generally, a possible implication τ x depends upon the number of derived DISs, i.e., x∈OB,A∈AT |g(x, A)|, and condition attributes CON (CON ⊆ 2AT −DEC ). x x Furthermore, minsupp(τ x ) and minacc(τ ) depend on DD(τ , x, CON, DEC), whose number of elements is A∈CON,B∈DEC,x=y |g(y, A)||g(y, B)|. Therefore, it will be impossible to employ a simple method that we sequentially pick up every possible implication τ x and sequentially examine minsupp(τ x ) and minacc(τ x ). 5.2

Complexity on Extended Apriori Algorithm for Lower Approximation Strategy

In order to solve this computational issue, we focus on descriptors [A, ζ] (A ∈ AT , ζ ∈ V ALA ). The number of all descriptors is usually very small. Furthermore, Proposition 3 and 4 show us the methods to calculate minsupp(τ x ) and minacc(τ x ). These methods do not depend upon the number of element in DD(τ x , x, CON, DEC). Now, we analyze each step in Algorithm 1. (STEP 1) (Generation of Descinf , Descsup and CAN DIDAT E(1) ) We ﬁrst prepare two arrays DescinfA,val [] and DescsupA,val [] for each val ∈ V ALA (A ∈ AT ). For each object x ∈ OB, we apply (1) and (2) in the following: (1) If g(x, A)={val}, add x to DescinfA,val [] and DescsupA,val []. (2) If g(x, A) = {val} and val ∈ g(x, A), add x to DescsupA,val []. Then, all descriptors satisfying either (CASE A) or (CASE B) in Algorithm 1 are added to CAN DIDAT E(1). For each A ∈ AT , this procedure is applied, and the complexity depends upon |OB| × |AT |. (STEP 2) (Generation of CAN DIDAT E(2) ) For each [A, valA ], [DEC, valDEC ] ∈ CAN DIDAT E(1), we produce [A, valA ] ∧ [DEC, valDEC ], and generate Descinf ([A, valA ] ∧ [DEC, valDEC ]) =Descinf ([A, valA]) ∩ Descinf ([DEC, valDEC ]),

346

H. Sakai et al.

Descsup([A, valA ] ∧ [DEC, valDEC ]) =Descsup([A, valA]) ∩ Descsup([DEC, valDEC ]). If [A, valA ] ∧ [DEC, valDEC ] satisﬁes either (CASE A) or (CASE B) in Algorithm 1, this descriptor is added to CAN DIDAT E(2). Furthermore, we examine minacc([A, valA ] ∧ [DEC, valDEC ]) in (Proc 2-2) according to Proposition 4. The complexity of (STEP 2) depends upon the number of combined descriptors [A, valA ] ∧ [DEC, valDEC ]. (STEP 3) (Repetition of STEP 2 on CAN DIDAT E(n) ) For each DESC1 and DESC2 in CAN DIDAT E(n), we generate a conjunction DESC1 ∧ DESC2 . For such conjunctions, we apply the same procedure as (STEP 2). In the execution, two sets Descinf ([CON, ζ]) and Descsup([CON, ζ]) are stored in arrays, and we can obtain Descinf ([CON, ζ] ∧ [DEC, η]) by using the intersection operation Descinf ([CON, ζ]) ∩ Descinf ([DEC, η]). The same property holds for Descsup([CON, ζ] ∧ [DEC, η]). Therefore, it is easy to obtain CAN DIDAT E(n + 1) from CAN DIDAT E(n). This is a merit of employing equivalence classes, and this is the characteristics of rough set theory. In Apriori algorithm, such Descinf and Descsup([CON, ζ]) are not employed, and the total search of a database is executed for generating every combination of descriptors. It will be necessary to consider the merit and demerit of handling two sets Descinf ([CON, ζ]) and Descsup([CON, ζ]) in the next research. Apriori algorithm employs an equivalence class for each descriptors, and handles only deterministic information. On the other hand, Algorithm 1 employs the minimum and the maximum sets of an equivalence class, i.e., Descinf and Descsup, and handles non-deterministic information as well as deterministic information. In Algorithm 1, it takes twice steps of Apriori algorithm for manipulating equivalence classes. The rest is almost the same as Apriori algorithm, therefore the complexity of Algorithm 1 will be almost the same as Apriori algorithm.

6

Concluding Remarks and Future Work

We proposed rule generation based on lower approximation strategy and upper approximation strategy in N ISs. We employed Descinf , Descsup and the concept of large item set in Apriori algorithm, and proposed two extended Apriori algorithms in N ISs. These extended algorithms do not depend upon the number of derived DISs, and the complexity of these extended algorithms is almost the same as Apriori algorithm. We implemented the extended algorithms, and applied them to some data sets. According to these utility programs, we can explicitly handle not only deterministic information but also non-deterministic information. Now, we brieﬂy show the application to Hepatitis data in UCI Machine Learning Repository [37]. In reality, we applied our programs to Hepatitis data. This data consists of 155 objects, 20 attributes. There are 167 missing values, which

Rules and Apriori Algorithm in Non-deterministic Information Systems

347

are about 5.4% of total data. The number of objects without missing values is 80, namely the number is about the half of total data. In usual analyzing tools, it may be diﬃcult to handle total 155 objects. We employ a list for expressing non-deterministic information, for example, [red,green], [red,blue] for {red, green} and {red, blue} in Table 2. This syntax is so simple that we can easily generate data of N ISs by using Excel. As for Hepatitis data, we loaded this data into Excel, and replaced each missing value (? symbol) with a list of all possible attribute values. For some numerical values, the discretized attribute values are also given in the data set. For example, in the 15th attribute BILIRUBIN, attribute values are discretized to the six attribute values, i.e., 0.39, 0.80, 1.20, 2.00, 3.00, 4.00. We employed these discretized values in some attributes. The following is a part of the real revised Hepatitis data in Excel. There are 78732 (=2 × 6 × 94) derived DISs for these six objects. Probably, it seems hard to handle all derived DISs for total 155 objects sequentially. 155 20 2 30 2 50 2 70 2 30 2 30

2 1 1 1 1

2 30 1

//Number of objects //Number of Attributes 1 2 2 2 2 1 2 2 2 2 2 0.8 80 13 3.8 [10,20,30,40,50,60,70,80,90] 1 1 2 1 2 2 1 2 2 2 2 2 0.8 120 13 3.8 [10,20,30,40,50,60,70,80,90] 1 2 2 1 2 2 2 2 2 2 2 2 0.8 80 13 3.8 [10,20,30,40,50,60,70,80,90] 1 [1,2] 1 2 2 2 2 2 2 2 2 2 0.8 33 13 3.8 80 1 2 2 2 2 2 2 2 2 2 2 2 0.8 [33,80,120,160,200,250] 200 3.8 [10,20,30,40,50,60,70,80,90] 1 2 2 2 2 2 2 2 2 2 2 2 0.8 80 13 3.8 70 1 : : :

The decision attribute is the ﬁrst attribute CLASS (1:die, 2:live), and we ﬁxed α=0.25 and β=0.85. Let us show the results of two cases. (CASE 1) Obtained Rules from 80 Objects without Missing Values It is possible to apply our programs to the standard DISs. For 80 objects, it took 0.015(sec), and 14 rules including the following are generated. [AGE,30]=>[Class,live] (support=0.287,accuracy=0.958), [ASCITES,yes]=>[CLASS,live] (support=0.775,accuracy=0.912), [ALBUMIN,4.5]=>[CLASS,live] (support=0.287,accuracy=0.958).

(CASE 2) Obtained Rules from 155 Objects with 167 Missing Values Due to two strategies, 22 rules and 25 rules are generated, respectively. It took 0.064(sec). Let us show every rule, which is obtained by upper approximation strategy but is not obtained by lower approximation strategy. Namely, every rule is in boundary set Rule(0.25, 0.85, U A) − Rule(0.25, 0.85, LA). There are three such rules. [Alk PHOSPHATE,80]=>[CLASS,live] (minsupp=0.25,minacc=0.841,maxsupp=0.348,maxacc=0.857) [ANOREXIA,yes]&[SGOT,13]=>[CLASS,live] (minsupp=0.25,minacc=0.829,maxsupp=0.381,maxacc=0.855)

348

H. Sakai et al.

[SPLEEN PALPABLE,yes]&[SGOT,13]=>[CLASS,live] (minsupp=0.25,minacc=0.848,maxsupp=0.368,maxacc=0.877)

In the 17th attribute SGOT, there are four missing values. The above two rules with descriptor [SGOT,13] depend upon these four missing values. These rules show us the diﬀerence between lower approximation strategy and upper approximation strategy. We are also focusing on the diﬀerence between rule generation in DISs and N ISs. Let us suppose a N IS. We remove every object with non-deterministic information from the N IS, and we obtain a DIS. We are interested in rules, which are not obtained from the DIS but obtained from the N IS. According to some experiments including Hepatitis data and Mammographic data in UCI repository, we veriﬁed our utility programs work well, even if there are huge number of derived DISs. However, we have not analyzed the meaning of the obtained rules. Because, the main issue of this paper is to establish the framework and to implement algorithms. From now on, we will apply our utility programs to real data with missing values, and we want to obtain meaningful rules from N ISs. Our research is not toward rule generation from data with a large number of objects, but it is toward rule generation from incomplete data with a large number of derived DISs. This paper is a revised and extended version of papers [31, 32]. Acknowledgment. The authors would be grateful to anonymous referees for their useful comments. This work is partly supported by the Grant-in-Aid for Scientiﬁc Research (C) (No.16500176, No.18500214), Japan Society for the Promotion of Science.

References 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proceedings of the 20th Very Large Data Base, pp. 487–499 (1994) 2. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.: Fast Discovery of Association Rules. In: Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI/MIT Press (1996) 3. Demri, S., Orlowska, E.: Incomplete Information: Structure, Inference, Complexity. Monographs in Theoretical Computer Science. Springer, Heidelberg (2002) 4. Grzymala-Busse, J.: On the Unknown Attribute Values in Learning from Examples. In: Ra´s, Z.W., Zemankova, M. (eds.) ISMIS 1991. LNCS (LNAI), vol. 542, pp. 368– 377. Springer, Heidelberg (1991) 5. Grzymala-Busse, J.: A New Version of the Rule Induction System LERS. Fundamenta Informaticae 31, 27–39 (1997) 6. Grzymala-Busse, J., Werbrouck, P.: On the Best Search Method in the LEM1 and LEM2 Algorithms. Incomplete Information: Rough Set Analysis 13, 75–91 (1998) 7. Grzymala-Busse, J.: Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction. Transactions on Rough Sets 1, 78–95 (2004)

Rules and Apriori Algorithm in Non-deterministic Information Systems

349

8. Grzymala-Busse, J.: Incomplete data and generalization of indiscernibility relation, ´ deﬁnability, and approximations. In: Sezak, D., Wang, G., Szczuka, M.S., D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3641, pp. 244–253. Springer, Heidelberg (2005) 9. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough Sets: a tutorial. In: Pal, S., Skowron, A. (eds.) Rough Fuzzy Hybridization, pp. 3–98. Springer, Heidelberg (1999) 10. Kryszkiewicz, M.: Rules in Incomplete Information Systems. Information Sciences 113, 271–292 (1999) 11. Kryszkiewicz, M., Rybinski, H.: Computation of Reducts of Composed Information Systems. Fundamenta Informaticae 27, 183–195 (1996) 12. Kryszkiewicz, M.: Maintenance of Reducts in the Variable Precision Rough Sets Model. ICS Research Report 31/94, Warsaw University of Technology (1994) 13. Lipski, W.: On Semantic Issues Connected with Incomplete Information Data Base. ACM Trans. DBS 4, 269–296 (1979) 14. Lipski, W.: On Databases with Incomplete Information. Journal of the ACM 28, 41–70 (1981) 15. Nakamura, A., Tsumoto, S., Tanaka, H., Kobayashi, S.: Rough Set Theory and Its Applications. Journal of Japanese Society for AI 11, 209–215 (1996) 16. Nakamura, A.: A Rough Logic based on Incomplete Information and Its Application. International Journal of Approximate Reasoning 15, 367–378 (1996) 17. Nakata, M., Sakai, H.: Rough-set-based Approaches to Data Containing Incomplete Information: Possibility-based Cases. In: Nakamatsu, K., Abe, J. (eds.) Advances in Logic Based Intelligent Systems. Frontiers in Artiﬁcial Intelligence and Applications, vol. 132, pp. 234–241. IOS Press, Amsterdam (2005) 18. Nakata, M., Sakai, H.: Lower and Upper Approximations in Data Tables Containing Possibilistic Information. Transactions on Rough Sets 7, 170–189 (2007) 19. Orlowska, E.: What You Always Wanted to Know about Rough Sets. In: Incomplete Information: Rough Set Analysis, vol. 13, pp. 1–20. Physica-Verlag (1998) 20. Orlowska, E., Pawlak, Z.: Representation of Nondeterministic Information. Theoretical Computer Science 29, 27–39 (1984) 21. Pawlak, Z.: Rough Sets. Kluwer Academic Publisher, Dordrecht (1991) 22. Pawlak, Z.: Some Issues on Rough Sets. Transactions on Rough Sets 1, 1–58 (2004) 23. Polkowski, L., Skowron, A. (eds.): Rough Sets in Knowledge Discovery 1. Studies in Fuzziness and Soft Computing, vol. 18. Physica-Verlag (1998) 24. Polkowski, L., Skowron, A. (eds.): Rough Sets in Knowledge Discovery 2. Studies in Fuzziness and Soft Computing, vol. 19. Physica-Verlag (1998) 25. Rough Set Software. Bulletin of Int’l. Rough Set Society 2, 15–46 (1998) 26. Sakai, H.: Eﬀective Procedures for Handling Possible Equivalence Relations in Nondeterministic Information Systems. Fundamenta Informaticae 48, 343–362 (2001) 27. Sakai, H.: Eﬀective Procedures for Data Dependencies in Information Systems. In: Rough Set Theory and Granular Computing. Studies in Fuzziness and Soft Computing, vol. 125, pp. 167–176. Springer, Heidelberg (2003) 28. Sakai, H., Okuma, A.: Basic Algorithms and Tools for Rough Non-deterministic Information Analysis. In: Peters, J.F., Skowron, A., Grzymala-Busse, J.W., Kostek, ´ B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 209–231. Springer, Heidelberg (2004) 29. Sakai, H., Nakata, M.: An Application of Discernibility Functions to Generating Minimal Rules in Non-deterministic Information Systems. Journal of Advanced Computational Intelligence and Intelligent Informatics 10, 695–702 (2006)

350

H. Sakai et al.

30. Sakai, H.: On a Rough Sets Based Data Mining Tool in Prolog: An Overview. In: Umeda, M., Wolf, A., Bartenstein, O., Geske, U., Seipel, D., Takata, O. (eds.) INAP 2005. LNCS (LNAI), vol. 4369, pp. 48–65. Springer, Heidelberg (2006) 31. Sakai, H., Nakata, M.: On Possible Rules and Apriori Algorithm in Nondeterministic Information Systems. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Slowi´ nski, R. (eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 264–273. Springer, Heidelberg (2006) 32. Sakai, H., Ishibashi, R., Koba, K., Nakata, M.: On Possible Rules and Apriori Algorithm in Non-deterministic Information Systems 2. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds.) RSFDGrC 2007. LNCS (LNAI), vol. 4482, pp. 280–288. Springer, Heidelberg (2007) 33. Skowron, A., Rauszer, C.: The Discernibility Matrices and Functions in Information Systems. In: Intelligent Decision Support - Handbook of Advances and Applications of the Rough Set Theory, pp. 331–362. Kluwer Academic Publishers, Dordrecht (1992) 34. Stefanowski, J., Tsoukias, A.: On the Extension of Rough Sets under Incomplete Information. In: Zhong, N., Skowron, A., Ohsuga, S. (eds.) RSFDGrC 1999. LNCS (LNAI), vol. 1711, pp. 73–81. Springer, Heidelberg (1999) 35. Stefanowski, J., Tsoukias, A.: Incomplete Information Tables and Rough Classiﬁcation. Computational Intelligence 7, 212–219 (2001) 36. Tsumoto, S.: Knowledge Discovery in Clinical Databases and Evaluation of Discovered Knowledge in Outpatient Clinic. Information Sciences 124, 125–137 (2000) 37. UCI Machine Learning Repository, http://mlearn.ics.uci.edu/MLRepository.html 38. Ziarko, W.: Variable Precision Rough Set Model. Journal of Computer and System Sciences 46, 39–59 (1993)

On Extension of Dependency and Consistency Degrees of Two Knowledges Represented by Covering P. Samanta1 and Mihir K. Chakraborty2 1

2

Department of Mathematics, Katwa College Katwa, Burdwan, West Bengal, India pulak [email protected] Department of Pure Mathematics, University of Calcutta 35, Ballygunge Circular Road, Kolkata-700019, India [email protected]

Abstract. Knowledge of an agent depends on the granulation procedure adopted by the agent. The knowledge granules may form a partition of the universe or a covering. In this paper dependency degrees of two knowledges have been considered in both the cases. A measure of consistency and inconsistency of knowledges are also discussed. This paper is a continuation of our earlier work [3]. Keywords: Rough sets, elementary category(partition, covering of knowledge), dependency degree, consistency degree.

1

Introduction

Novotn´ y and Pawlak deﬁned a dependency degree between two knowledges given by two partitions on a set [6,7,8,9]. Knowledge is given by indiscernibility relations on the universe and indiscernibility relation is taken to be an equivalence relation. But in many situations the indiscernibility relation fails to be transitive. Hence the clusters or granules of knowledge overlap. This observation gives rise to the study of Rough Set Theory based on coverings instead of partitions [2,10,11,13,14,15,16]. In [3] the present authors introduced the notions of consistency degree and inconsistency degree of two knowledges given by partitions of the universe using the dependency degree deﬁned by Novotn´ y and Pawlak. In this paper some more investigations in that direction have been carried out but the main emphasis is laid on deﬁning the dependency degree of two knowledges when they are given by coverings in general, not by partitions only. Now, in the covering based approximation systems lower and upper approximations of a set are deﬁned in at least ﬁve diﬀerent ways [10]. All of these approximations reduce to the standard Pawlakian approximations when the underlying indiscernibility relation turns out to be equivalence. We have in this paper used four of them of which one is the classical one. As a consequence, four diﬀerent dependency degrees arise. J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 351–364, 2008. c Springer-Verlag Berlin Heidelberg 2008

352

P. Samanta and M.K. Chakraborty

It is interestingly observed that the properties of partial dependency that were developed in [3,6,9] hold good in the general case of covering based approximation system. The main results on covering are placed in section 3. Depending upon this generalized notion of dependency, consistency degree and inconsistency degree between two such knowledges have been deﬁned.

2

Dependency of Knowledge Based on Partition

We would accept the basic philosophy that a knowledge of an agent about an universe is her ability to categorize objects inhabiting it through information received from various sources or perception in the form of attribute-value data. For this section we start with the indiscernibility relation caused by the attributevalue system. So, knowledge is deﬁned as follows. Definition 1. Knowledge : A knowledge is a pair, < U, P > where U is a nonempty ﬁnite set and P is an equivalence relation on U . P will also denote the partition generated by the equivalence relation. Definition 2. Finer and Coarser Knowledge : A knowledge P is said to be ﬁner than the knowledge Q if every block of the partition P is included in some block of the partition Q. In such a case Q is said to coarser than P . We shall write it as P Q. We recall a few notions due to Pawlak (and others) e.g P -positive region of Q and based upon it dependency-degree of knowledges. Definition 3. Let P and Q be two equivalence relations over U . The P -positive PX , where region of Q, denoted by P osP (Q) is deﬁned by P osP (Q) = X∈U/Q ¯ PX = { Y ∈ U/P : Y ⊆ X} called P -lower approximation of X. ¯ Definition 4. Dependency degree : Knowledge Q depends in a degree k (0 ≤ osP (Q) k ≤ 1) on knowledge P , written as P ⇒k Q, iﬀ k = CardP where card CardU denotes cardinality of the set. If k = 1 , we say that Q totally depends on P and we write P ⇒ Q; and if k = 0 we say that Q is totally independent of P . Viewing from the angle of multi-valuedness one can say that the sentence ‘The knowledge Q depends on the knowledge P ’ instead of being only ‘true’(1) or ‘false’(0) may receive other intermediate truth-values, the value k being determined as above. This approach justiﬁes the term ‘partial dependency’ as well. In propositions 1,2 and 3, we enlist some elementary, often trivial, properties of dependency degree some of them being newly exercised but most of which are present in [6,9]. Some of these properties e.g. proposition 3(v) will constitute the basis of deﬁnitions and results of the next section.

Consistency of Knowledge

353

Proposition 1 (i) [x]P1 ∩P2 = [x]P1 ∩ [x]P2 , (ii) If P ⇒ Q and R P then R ⇒ Q, (iii)If P ⇒ Q and Q R then P ⇒ R, (iv)If P ⇒ Q and Q ⇒ R then P ⇒ R, (v)If P ⇒ R and Q ⇒ R then P ∩ Q ⇒ R, (vi) If P ⇒ R ∩ Q then P ⇒ R and P ⇒ Q, (vii) If P ⇒ Q and Q ∩ R ⇒ T then P ∩ R ⇒ T , (viii) If P ⇒ Q and R ⇒ T then P ∩ R ⇒ Q ∩ T . Proposition 2 (i) If P P then P X ⊇ P X, (ii) If P ⇒a Q and P P then P ⇒b Q where b ≥ a, (iii) If P ⇒a Q and P P then P ⇒b Q where b ≤ a, (iv) If P ⇒a Q and Q Q then P ⇒b Q where b ≤ a, (v) If P ⇒a Q and Q Q then P ⇒b Q where a ≤ b. Proposition 3 (i) If R ⇒a P and Q ⇒b P then R ∩ Q ⇒c P for some c ≥ M ax(a, b), (ii) If R ∩ P ⇒a Q then R ⇒b Q and P ⇒c Q for some b, c ≤ a, (iii) If R ⇒a Q and R ⇒b P then R ⇒c Q ∩ P for some c ≤ M in(a, b), (iv) If R ⇒a Q ∩ P then R ⇒b Q and R ⇒c P for some b, c ≥ a, (v) If R ⇒a P and P ⇒b Q then R ⇒c Q for some c ≥ a + b − 1.

3

Dependency of Knowledge Based on Covering

A covering C of a set U is a collection of subsets {Ci } of U such that ∪Ci = U . It is often important to deﬁne a knowledge in terms of covering and not by partition which is a special case of covering. Given a covering C one can deﬁne a binary relation RC on U which is a tolerance relation (reﬂexive, symmetric) by xRC y holds iﬀ x, y ∈ Ci for some i, where the set {Ci } constitute the covering. Definition 5. A tolerance space is a structure S = < U, R >, where U is a nonempty set of objects and R is a reﬂexive and symmetric binary relation deﬁned on U . A tolerance class of a tolerance space < U, R > is a maximal subset of U such that any two elements of it are mutually related. In the context of knowledge when the indiscernibility relation R is only reﬂexive and symmetric (and not necessarily transitive) the approximation system < U, R > is a tolerance space. In such a case the granules of the Knowledge may be formed in many diﬀerent ways. Since the granules are not necessarily disjoint it is worthwhile to talk about granulation around an object x ∈ U . Now the most natural granule at x is the set {y : xRy}. This set is generally denoted by Rx . But any element Ci of the covering C can also be taken as a granule around x where x ∈ Ci . There may be others. So, depending upon various ways

354

P. Samanta and M.K. Chakraborty

of perceiving a granule, various deﬁnitions of lower approximations (and hence the upper approximations as their duals) of a set may be given. We shall consider them below. Now any covering gives rise to a unique partition. By P we denote the partition corresponding to the covering C. Definition 6. [1,2] A covering is said to be genuine covering if Ci ⊆ Cj implies Ci = Cj . For any genuine covering C it is immediate that the elements of C are all tolerance classes of the relation RC . Definition 7. Let two ﬁnite coverings C1 and C2 be given by C1 = {C1 , C2 , ...Cn } }. Then C1 ∩ C2 is the collection {Ci ∩ Cj where i = and C2 = {C1 , C2 , ...Cm 1, 2, ...n; j = 1, 2, ...m}. Example 1. Let C1 = {{1, 2, 3}, {2, 3, 4}, {5, 6, 7}, {6, 7, 8}} and C2 = {{1, 2, 3, 4}, {3, 4, 5, 6}, {5, 6, 7, 8}}. Then C1 ∩ C2 = {{1, 2, 3}, {3}, {2, 3, 4}, {3, 4}, {5, 6}, {5, 6, 7}, {6}, {6, 7, 8}}. Definition 8. We shall say that a covering C1 is ﬁner than a covering C2 written as C1 C2 iﬀ ∀Cj ∈ C2 ∃ Cj1 , Cj2 , ..., Cjn such that Cj = Cj1 ∪ Cj2 ∪ ... ∪ Cjn where, Cj1 , Cj2 , ..., Cjn ∈ C1 i.e. every element of C2 may be expressed as the union of some elements of C1 . Let R be a tolerance relation in U . Then the family C(R) of all tolerances classes of R is a covering of U . The pair (U, C) will be called generalized approximation space, where U is a set and C is a covering of U . We shall however assume U to be ﬁnite in the sequel. Let (U, C) be a generalized approximation space and C = {C1 , C2 , ...C n }. The indiscernibility neighborhood of an element x ∈ U is the set NxC = {Ci : x ∈ Ci }. In fact NxC is the same as RxC . For any x ∈ U the set PxC = {y ∈ U : ∀Ci (x ∈ Ci ⇔ y ∈ Ci )} will be called kernel of x. Let P be the family of all kernels (U, C) i.e. P = {PxC : x ∈ U }. Clearly P is a partition of U . Definition 9. [10] Let X be a subset of U . Then the lower and upper approximations are deﬁned as follows : C 1 (X) = {x : NxC ⊆ X} 1 C (X) = {Ci : Ci ∩ X = φ} C 2 (X) =

C {Nx : NxC ⊆ X}

C 3 (X) =

{Ci , Ci ⊆ X f or some Ci ∈ C1 }

2

C (X) = {z : ∀y(z ∈ NxC ⇒ NxC ∩ X = φ)} 3

C (X) = {y : ∀Ci (y ∈ Ci ⇒ Ci ∩ X = φ)}

Consistency of Knowledge

355

C {Px : PxC ⊆ X} 4 C (X) = {{PxC : {PxC ∩ X = φ} C 4 (X) =

Proposition 4. If C1 C2 then P1 P2 where P1 , P2 are the partitions corresponding to C1 , and C2 respectively. i

Proposition 5. If C1 C2 then for any X ⊆ U , C1 i (X) ⊇ C2 i (X) and C1 (X) ⊆ i

C2 (X) for i = 1, 2, 3, 4 . Example 2. Let U = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and C = {{1, 2}, {1, 2, 3}, {4, 6}, {6, 7, 9}, {8, 9}, {5, 10}}. Let A = {1, 2, 4, 6, 9, 10}. Then C 1 (A) = {4}, C 2 (A) = {4, 6}, C 3 (A) = {1, 2, 4, 6}, C 4 (A) = {1, 2, 4, 6, 9}. 1

2

Let B = {3, 9, 10}. Then C (B) = {1, 2, 3, 5, 6, 7, 8, 9, 10}, C (B) = {1, 2, 3, 5, 3 4 7, 8, 9, 10}, C (B) = {3, 5, 7, 8, 9, 10}, C (B) = {3, 9} Proposition 6. Propositions 1, 2, 3 except 3(v) of section 2 also hold in this generalized case. C1 (X). Definition 10. We deﬁne C1 -Positive region of C2 as P osC1 C2 = X∈C2

Definition 11. Dependency degree with respect to covering : C1 depends in a |P os

C2 |

C1 where |X| degree k (0 ≤ k ≤ 1) on C2 , written as C1 ⇒k C2 , iﬀ k = |U| denotes cardinality of the set X. We shall also write k = Dep(C1 , C2 ). If k = 1 , C1 is said to be totally dependent on C2 and we write C1 ⇒ C2 ; and if k = 0 we say that C2 is totally independent of C1 .

Since we have four kinds of lower approximations, we have, four diﬀerent C1 Positive region of C2 viz. P osiC1 C2 with respect to C i (X) for i = 1, 2, 3, 4 and also i four diﬀerent viz. DepC1 (C1 , C2 ) for Ci 1= 1, 2, 3, 4. kinds of Dependencies C1 {x : Nx ⊆ X} ⊆ {Nx : x ∈ C1 , Nx ⊆ X} ⊆ {Ci , Clearly, X∈C2 X∈C2 X∈C2 C C Ci ⊆ X f or some Ci ∈ C1 } ⊆ {Px : Px ⊆ X}. X∈C2

This implies, P os1C1 C2 ⊆ P os2C1 C2 ⊆ P os3C1 C2 ⊆ P os4C1 C2 . So, we have,

|P os1 C2 |

|P os2 C2 |

|P os3 C2 |

|P os4 C2 |

C1 C1 C1 ≤ ≤ ≤ . |U| |U| |U| |U| So, the following proposition is obtained. C1

Proposition 7. Dep1 (C1 , C2 ) ≤ Dep2 (C1 , C2 ) ≤ Dep3 (C1 , C2 ) ≤ Dep4 (C1 , C2 ). Example 3. Consider C1 = {{1, 2, 3}, {2, 3, 4}, {5, 6, 7}, {6, 7, 8}} and C2 = {{1, 2, 3, 4}, {3, 4, 5, 6}, {5, 6, 7, 8}}. Then Dep1 (C1 , C2 ) = 1, Dep2 (C1 , C2 ) = 1, Dep3 (C1 , C2 ) = 1, Dep4 (C1 , C2 ) = 1. Also, Dep1 (C2 , C1 ) = 0, Dep2 (C2 , C1 ) = 0, Dep3 (C2 , C1 ) = 0, Dep4 (C2 , C1 ) = 1. Example 4. Let us Consider C1 = {{1, 2, 3}, {3, 4, 8}, {6, 7, 8}, {8, 9}} and C2 = {{1, 2, 3, 4}, {5, 8}, {6, 7}, {8, 9}}. Then Dep1 (C1 , C2 ) = 13 , Dep2 (C1 , C2 ) = 13 , Dep3 (C1 , C2 ) = 59 , Dep4 (C1 , C2 ) = 1. Also, Dep1 (C2 , C1 ) = 13 , Dep2 (C2 , C1 ) = 13 , Dep3 (C2 , C1 ) = 49 , Dep4 (C2 , C1 ) = 59 .

356

P. Samanta and M.K. Chakraborty

Observation (i) C1 ⇒ C2 iﬀ P osC1 C2 = U C1 (X) = U iﬀ X∈C2 1 C iﬀ {x : Nx ⊆ X} = U X∈C2

iﬀ ∀x ∈ U, NxC1 ⊆ X for some X ∈ C2 . Also C1 ⇒0 C2 iﬀ P osC1 C2 = φ iﬀ C1 (X) = φ X∈C2 1 C1 {x : Nx ⊆ X} = φ iﬀ X∈C2

iﬀ ∀x ∈ U , there does not exists any X ∈ C2 such that NxC1 ⊆ X . (ii) C1 ⇒ C2 iﬀ P osC1 C2 = U iﬀ C1 (X) = U X∈C2 2 C1 iﬀ {Nx : NxC1 ⊆ X} = U x X∈C2

iﬀ ∀x ∈ U, NxC1 ⊆ X for some X ∈ C2 .

Also C1 ⇒0 C2 iﬀ P osC1 C2 = φ iﬀ C1 (X) = φ X∈C2 2 C1 iﬀ {Nx : NxC1 ⊆ X} = φ x X∈C2

iﬀ ∀x ∈ U , there does not exists any X ∈ C2 such that NxC1 ⊆ X .

(iii) C1 ⇒ C2 iﬀ P osC1 C2 = U C1 (X) = U iﬀ X∈C2 3 iﬀ X∈C {Ci ∈ C1 : Ci ⊆ X} = U 2 i iﬀ each Ci (∈ C1 ) ⊆ X, for some X ∈ C2 . Also C1 ⇒0 C2 iﬀ P osC1 C2 = φ C1 (X) = φ iﬀ X∈C2 3 iﬀ X∈C {Ci ∈ C1 : Ci ⊆ X} = φ 2 i iﬀ for any Ci ∈ C1 there does not exists any X ∈ C2 such that Ci ⊆ X. (iv) C1 ⇒ C2 iﬀ P osC1 C2 = U iﬀ C1 4 (X) = U X∈C2

Consistency of Knowledge

357

iﬀ X∈C {PxC1 : PxC1 ⊆ X} = U x 2 iﬀ for all x, PxC1 ⊆ X for some X ∈ C2 . Also C1 ⇒0 C2 iﬀ P osC1 C2 = φ iﬀ C1 (X) = φ X∈C2 4C1 iﬀ X∈C {Px : PxC1 ⊆ X} = φ 2 iﬀ for all x, there does not exists any X ∈ C2 such that PxC1 ⊆ X. The sets C1 and C2 may be considered as two groups of classifying properties of the objects of the universe U . Properties belonging to any group may have overlapping extensions. Now if C1 ⇒ C2 holds i.e. the dependency degree of C2 on C1 is 1 then the following is its characterization in the ﬁrst two cases (i) and (ii): given any element x of the universe, the set of all objects satisfying at least one of the properties of x is included in the extension of at least one of the classifying properties belonging to the second group. If, on the other hand C1 ⇒0 C2 holds, it follows that, ∀x ∈ U, NxC1 is not a subset of X for any X ∈ C2 ; that means for any element x there is at least one element y which shares at least one of the classiﬁcatory properties of the ﬁrst group and does not have any of the classiﬁcatory properties belonging to the second group. In the third case (iii) C1 ⇒ C2 iﬀ ∀Ci ∈ C1 , ∃Cj ∈ C2 such that Ci ⊆ Cj and C1 ⇒0 C2 iﬀ ∀Ci ∈ C1 there does not exist any Cj ∈ C2 such that Ci ⊆ Cj . The ﬁrst condition means that the extension of any of the classiﬁcatory properties of the ﬁrst group is a subset of the extension of at least one of the classiﬁcatory properties of the second. On the other hand the second one means : no classiﬁcatory property belonging to the ﬁrst group implies any one of the classiﬁcatory property of the second group. In the fourth case (iv) if x and y are equivalent with respect to the classiﬁcatory properties in the group C1 then x and y will share at least one of the classiﬁcatory properties with respect to C2 and vice versa.

4

Consistency of Knowledge Based on Partition

Two knowledges P and Q on U where P and Q are partitions may be considered as fully consistent if and only if U/P = U/Q, that is P ,Q generate exactly the same granules. This is equivalent to P ⇒ Q and Q ⇒ P . So, a natural measure of consistency degree of P and Q might be the truth-value of the non-classical sentence “Q depends on P ∧ P depends on Q” computed by a suitable conjunction operator applied on the truth-values of the two component sentences Thus a binary predicate Cons may be created such that Cons(P, Q) will stand for the above conjunctive sentence. A triangular norm (or t-norm) used in fuzzyliterature and many-valued logic scenario is a potential candidate for computing ∧. A t-norm is a mapping t : [0, 1] → [0, 1] satisfying (i) t(a, 1) = a, (ii) b ≤ d

358

P. Samanta and M.K. Chakraborty

implies t(a, b) ≤ t(a, d), (iii) t(a, b) = t(b, a), (iv) t(a, t(b, d)) = t(t(a, b), d). It follows that t(a, 0) = 0. Typical examples of t-norm are : min(a, b) (G¨ odel), max(0, a + b − 1) (Lukasicwicz), a × b (Godo,Hajek). These are conjunction operators used extensively and are in some sense the basic t-norms [4]. With 1 − x as negation operator the De-Morgan dual of t-norms called s-norms are obtained as s(a, b) = 1 − t(1 − a, 1 − b). Values of disjunctive sentences are computed by s-norms. There is however a diﬃculty in using a t-norm in the present context. We would like to have the following assumptions to hold. Assumption 1. Knowledges P ,Q shall be fully consistent iﬀ they generate the same partition. Assumption 2. Knowledges P ,Q shall be fully inconsistent iﬀ no granule generated by one is contained in any granule generated by the other. The translation of the above demands in mathematical terms is that the conjunction operator should fulﬁll the conditions: (a, b) = 1 iﬀ a = 1, b = 1 and (a, b) = 0 iﬀ a = 0, b = 0. No t-norm satisﬁes the second. So we deﬁne consistency degree as follows: Definition 12. Let P and Q be two knowledges such that P ⇒a Q and Q ⇒b P . The consistency degree between the two knowledges denoted by Cons(P, Q) is given by Cons(P, Q) = a+b+nab n+2 , where n is a non negative integer. Definition 13. Two knowledges P and Q are said to be fully consistent if Cons(P, Q) = 1. Two knowledge P and Q are said to be fully inconsistent if Cons(P, Q) = 0. Example 5. (i) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and the partitions be taken as P = {{1, 3, 5}, {2, 4, 6}, {7, 8}} and Q = {{1, 2, 7}, {3, 4, 8}, {5, 6}}. Then P ⇒0 Q and Q ⇒0 P . So, Cons(P, Q) = 0. (ii) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and partitions P = {{1, 3, 5}, {2, 4, 6}, {7, 8}} and Q = {{1, 3, 5}, {2, 4, 6}, {7, 8}}. Then P ⇒1 Q and Q ⇒1 P . So, Cons(P, Q) = 1. (iii) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and partitions P = {{1, 4, 5}, {2, 8}, {6, 7}, {3}} and Q = {{1, 3, 5}, {2, 4, 7, 8}, {6}}. Then P ⇒ 83 Q and Q ⇒ 18 P . So, Cons(P, Q) =

3 1 3 1 8 + 8 +n 8 8

n+2

, where n is a non-negative integer.

Although any choice of n satisﬁes the initial requirements, some special values for it may be of special signiﬁcance e.g n = 0, n = Card(U ) and n as deﬁned in proposition 5. We shall make discussions on two of such values latter. ‘n’ shall

Consistency of Knowledge

359

be referred to as the ‘consistency constant’ or simply ‘constant’ in the sequel. The constant is a kind of constraint on consistency measure as shown in the next proposition. Proposition 8. For two knowledges P and Q if n1 ≤ n2 then Cons1 (P, Q) ≥ Cons2 (P, Q) where Consi (P, Q) is the consistency degree when ni is the constant taken. Proof. Let P ⇒a Q and Q ⇒b P . Since n1 ≤ n2 , so, n2 − n1 ≥ 0. So a+b+n1 ab 1 ab 2 ab 2 ab and Cons2 (P, Q) = a+b+n - a+b+n Cons1 (P, Q) = a+b+n n1 +2 n2 +2 . Now, n1 +2 n2 +2 (n2 −n1 )(a+b−2ab) (n1 +2)(n2 +2)

≥ 0 iﬀ (n2 − n1 )(a + b − 2ab) ≥ 0 iﬀ (a + b − 2ab) ≥ 0 iﬀ √ a + b ≥ 2ab. Now, a+b ≥ ab ≥ ab. So a + b ≥ 2ab holds. This shows that 2 Cons1 (P, Q) ≥ Cons2 (P, Q).

=

Proposition 9. If n = the number of elements a ∈ U such that [a]P ⊆ [a]Q and [a]Q ⊆ [a]P , then n = CardU - [Card PX + Card QX X∈U/Q ¯ X∈U/P ¯ Card( PX QX)]. X∈U/Q ¯ X∈U/P ¯ Proof. Here the number of elements a ∈ U such that [a]P ⊆ [a]Q = Card X∈U/Q PX ...(i). Now the number of elements a∈ U such that [a]Q ⊆ [a]P =Card X∈U/P ¯ QX ...(ii). So the number of elements common to (i) and (ii) = Card( PX X∈U/Q ¯ ¯ QX)] ...(iii) . From (i), (ii) and (iii) the proposition follows. X∈U/P ¯ One can observe that the deﬁnition of a consistent object in [5,7] may be generalized relative to any pair (P, Q) of partitions of the Universe, not only restricted to the partitions caused due to the pair (CON, DEC) where CON is the set of condition attributes and DEC is the decision attributes. With this extension of the notion, n is the count of all those objects a such that a is not consistent relative to both the pairs (P, Q) and (Q, P ). In the following examples n is taken to be this number. Example 6. (i) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and partitions P = {{1, 3, 5}, {2, 4, 6}, {7, 8}} and Q = {{1, 2, 7}, {3, 4, 8}, {5, 6}}. Then P ⇒0 Q and Q ⇒0 P . Here n = 8. So, Cons(P, Q) = 0+0+8.0.0 = 0. 8+2 (ii) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and partitions P = {{1, 3, 5}, {2, 4, 6}, {7, 8}} and Q = {{1, 3, 5}, {2, 4, 6}, {7, 8}}. Then P ⇒1 Q and Q ⇒1 P . Here n = 0. = 1. So, Cons(P, Q) = 1+1+0.1.1 0+2 (iii) Let U = {1, 2, 3, 4, 5, 6, 7, 8} and partitions P = {{1, 4, 5}, {2, 8}, {6, 7}, {3}} and Q = {{1, 3, 5}, {2, 4, 7, 8}, {6}}. Then P ⇒ 38 Q and Q ⇒ 18 P . Here n = 4. So, Cons(P, Q) =

3 1 3 1 8 + 8 +4. 8 . 8

4+2

=

11 96 .

If the t-norm is taken to be max(0, a + b − 1), then the corresponding s-norm is min(1, a + b). For the t-norm min(a, b), the s-norm is max(a, b). There is an order relation in the t-norms/ s-norms, viz. any t-norm ≤ min ≤ max ≤ any s-norm.

360

P. Samanta and M.K. Chakraborty

In particular max(0, a + b − 1) ≤ min(a, b) ≤ max(a, b) ≤ min(1, a + b). Where does the Cons function situate itself in this chain - might be an interesting and useful query. The following proposition answers this question. Proposition 10. max(0, a + b − 1) ≤ Cons(P, Q) ≤ max(a, b) if P ⇒a Q and Q ⇒b P . To compare Cons(P, Q) and min(a, b), we have, Proposition 11. Let P and Q be two knowledges and P ⇒a Q and Q ⇒b P. Then (i) a = b = 1 iﬀ min(a, b) = Cons(P, Q) = 1, (ii) If either a = 1 or b = 1 then min(a, b) ≤ Cons(P, Q), a−b , a = 0, b = 1, (iii) min(a, b) = a ≤ Cons(P, Q) iﬀ n ≤ a(b−1) a−b (iv) min(a, b) = a ≥ Cons(P, Q) iﬀ n ≥ a(b−1) , a = 0, b = 1, (v) max(0, a + b − 1) ≤ Cons(P, Q) ≤ max(a, b) ≤ s(a, b) = min(1, a + b).

The Cons function seems to be quite similar to a t-norm but not the same. So a closer look into the function is worthwhile. We deﬁne a function : [0, 1] × [0, 1] → [0, 1] as follows (a, b) = a+b+nab n+2 where n is a non-negative integer. Proposition 12. (i) 0 ≤ (a, b) ≤ 1, (ii) If a ≤ b then (a, b) ≤ (a, c), (iii) (a, b) = (b, a), (iv) (a, (b, c)) = ((a, b), c) iﬀ a = c ; (a, (b, c)) ≤ ((a, b), c) iﬀ a ≤ c; (a, (b, c)) ≥ ((a, b), c) iﬀ a ≥ c, (v) (a, 1) ≥ a, equality occurring iﬀ a = 1, (vi) (a, 0) ≤ a, equality occurring iﬀ a = 0, (vii) (a, b) = 1 iﬀ a = b = 1 and (a, b) = 0 iﬀ a = b = 0, (viii) (a, a) = a iﬀ either a = 0 or a = 1, The consistency function Cons gives a measure of similarity between two knowledges. It would be natural to deﬁne a measure of inconsistency or dissimilarity now. In [6] a notion of distance is available. Definition 14. If P ⇒a Q and Q ⇒b P then the distance function is denoted by ρ(P, Q) and deﬁned as ρ(P, Q) = 2−(a+b) . 2 Proposition 13. The distance function ρ satisﬁes the conditions : (i) o ≤ ρ(P, Q) ≤ 1 (ii) ρ(P, P ) = 0 (iii) ρ(P, Q) = ρ(Q, P ) (iv) ρ(P, R) ≤ ρ(P, Q) + ρ(Q, R). For proof the reader is referred to [6].

Consistency of Knowledge

361

Definition 15. We now deﬁne a measure of inconsistency by: InCons(P, Q) = 1 - Cons(P, Q) Proposition 14. (i) o ≤ InCons(P, Q) ≤ 1, (ii) InCons(P, P ) = 0, (iii) InCons(P, Q) = InCons(Q, P ), (iv) InCons(P, R) ≤ InCons(P, Q) + InCons(Q, R) for a ﬁxed constant n. Proof. of (iv) : Let P ⇒x R, R ⇒y P , P ⇒a Q, Q ⇒b P , Q ⇒l R, R ⇒m Q ...(i). Now InCons(P, R) = n+2−x−y−nxy ≤ InCons(P, Q) + InCons(Q, R) = n+2 n+2−a−b−nab n+2

+ n+2−l−m−nlm = 2(n+2)−n(ab+lm)−(a+b+l+m) n+2 n+2 iﬀ n + 2 − x − y − nxy ≤ 2(n + 2) − n(ab + lm) − (a + b + l + m) iﬀ n(ab + lm − xy − 1) ≤ 2 + x + y − (a + b + l + m)...(ii). From (i) by Proposition 3(v) we have x ≥ (a + m − 1) and y ≥ (b + l − 1). Hence (ab + lm − xy − 1) ≤ (ab + lm − (a + m − 1)(b + l − 1) − 1) = (a(1 − l) + b(1 − m) + (m − 1) + (l − 1)) ≤ (1 − l + 1 − m + m − 1 + l − 1) (because + 2−l−m − 2−x−y ) 0 ≤ a, b ≤ 1) = 0. ...(iii) Now, 2+x+y−(a+b+l+m) = 2( 2−a−b 2 2 2 = 2(ρ(P, Q) + ρ(Q, R) − ρ(P, R)) ≥ 0. ...(iv)[by Proposition 13(iv)]. Thus the left hand side of inequality (ii) is negative and the right hand side of (ii) is positive. So (iv) i.e triangle inequality is established. Proposition 11 shows that for any ﬁxed n the inconsistency measure of knowledge is a metric. It is also a generalization of the distance function ρ in [6]; InCons reduces to ρ when n = 0. n is again a kind of constraint on the inconsistency measure - as n increases, the inconsistency increases too. 4.1

Consistency Degree w.r.t Covering

Definition 16. We deﬁne consistency degree in the same way : Consi (C1 , C2 ) = a+b+nab where Depi (C1 , C2 ) = a i.e., C1 ⇒a C2 and Depi (C2 , C1 ) = b i.e., n+2 C2 ⇒b C1 where i = 1, 2, 3, 4. Example 7. Let C1 = {{1, 2, 3}, {3, 4, 8}, {6, 7, 8}, {8, 9}} and C2 = {{1, 2, 3, 4}, {5, 8}, {6, 7}, {8, 9}}. Then Dep1 (C1 , C2 ) = 13 , Dep2 (C1 , C2 ) = 13 , Dep3 (C1 , C2 ) = 59 , Dep4 (C1 , C2 ) = 1. Also, Dep1 (C2 , C1 ) = 13 , Dep2 (C2 , C1 ) = 13 , Dep3 (C2 , C1 ) = 49 , Dep4 (C2 , C1 ) = 59 . So, Consi (C1 , C2 ) for i = 1, 2, 3, 4 are as follows : Cons1 (C1 , C2 ) = Cons2 (C1 , C2 ) = Cons1 (C1 , C2 ) = Cons1 (C1 , C2 ) =

1 + 13 +n. 13 . 13 3

n+2 1 1 1 1 3 + 3 +n. 3 . 3 n+2 5 4 5 4 9 + 9 +n. 9 . 9 n+2 1+ 59 +n.1. 59 n+2

n+6 9(n+2) , n+6 = 9(n+2) , 20n+81 = 81(n+2) , 5n+14 = 9(n+2) .

=

Observation (a) Consi (C1 , C2 ) = 1 iﬀ Depi (C1 , C2 ) = 1 and Depi (C2 , C1 ) = 1.

362

P. Samanta and M.K. Chakraborty

Its interpretations for i = 1, 2, 3, 4 are given by: Cons1 (C1 , C2 ) = 1 iﬀ ∀x ∈ U, NxC1 ⊆ X for some X ∈ C2 and ∀x ∈ U, NxC2 ⊆ X for some X ∈ C1 . Cons2 (C1 , C2 ) = 1 iﬀ ∀x ∈ U, NxC1 ⊆ X for some X ∈ C2 and ∀x ∈ U, NxC2 ⊆ X for some X ∈ C1 . Cons3 (C1 , C2 ) = 1 iﬀ each Ci (∈ C1 ) ⊆ X, for some X ∈ C2 and each Ci (∈ C2 ) ⊆ X, for some X ∈ C1 Cons4 (C1 , C2 ) = 1 iﬀ for all x, PxC1 ⊆ X for some X ∈ C2 and for all x, ⊆ X for some X ∈ C1

PxC2

(b) Consi (C1 , C2 ) = 0 iﬀ Depi (C1 , C2 ) = 0 and Depi (C2 , C1 ) = 0. So, the interpretations are: Cons1 (C1 , C2 ) = 0 iﬀ ∀x ∈ U , there does not exists any X ∈ C2 such that NxC1 ⊆ X and ∀x ∈ U , there does not exists any X ∈ C1 such that NxC2 ⊆ X . Cons2 (C1 , C2 ) = 0 iﬀ ∀x ∈ U , there does not exists any X ∈ C2 such that NxC1 ⊆ X and ∀x ∈ U , there does not exists any X ∈ C1 such that NxC2 ⊆ X . Cons3 (C1 , C2 ) = 0 iﬀ for any Ci ∈ C1 there does not exists any X ∈ C2 such that Ci ⊆ X and for any Ci ∈ C2 there does not exists any X ∈ C1 such that Ci ⊆ X Cons4 (C1 , C2 ) = 0 iﬀ for all x, there does not exists any X ∈ C2 such that ⊆ X and for all x, there does not exists any X ∈ C1 such that PxC2 ⊆ X

PxC1

Definition 17. A measure of inconsistency for the case of covering in the same way is deﬁned as follows : InCons(P, Q) = 1 - Cons(P, Q).

5

Towards a Logic of Consistency of Knowledge

We are now at the threshold of a logic of consistency (of knowledge). Along with the usual propositional connectives the language shall contain two binary predicates, ‘Cons’ and ‘Dep’ for consistency and dependency respectively. At least the following features of this logic are present. (i) 0 ≤ Cons(P, Q) ≤ 1, (ii) Cons(P, P ) = 1, (iii) Cons(P, Q) = Cons(Q, P ), (iv) Cons(P, Q) = 0 iﬀ Dep(P, Q) = 0 and Dep(Q, P ) = 0

Consistency of Knowledge

363

and Cons(P, Q) = 1 iﬀ Dep(P, Q) = 1 and Dep(Q, P ) = 1 In case P ,Q,R partitions we also get (v) Cons(P, Q) and Cons(Q, R) implies Cons(P, R). (i) shows that the logic is many-valued; (ii) and (iii) are natural expectations; (iv) conforms to assumptions 1 and 2 (section2); (v) shows transitivity the predicate Cons in the special case of partitions. That the transitivity holds is shown below. We want to show that Cons(P, Q) and Cons(Q, R) implies Cons(P, R) i.e, Cons(P, Q) and Cons(Q, R)≤ Cons(P, R). We use Lukasiewicz t-norm to compute ‘and’. Let n be the ﬁxed constant. So,what is needed is M ax(0, Cons(P, Q) + Cons(Q, R) − 1) ≤ Cons(P, R). Clearly, Cons(P, R) ≥ 0 ...(i). We shall now show Cons(P, R) ≥ Cons(P, Q) + Cons(Q, R) − 1. Let P ⇒x R, R ⇒y P , P ⇒a Q, Q ⇒b P , Q ⇒l R, R ⇒m Q So x ≥ (a + m − 1) and y ≥ (b + l − 1) [cf. Proposition 3(v)]...(ii). + l+m+nlm −1 So, Cons(P, Q) + Cons(Q, R) − 1 = a+b+nab n+2 n+2 ≤ x+y+n(ab+lm−1) [using (ii)]...(iii). n+2 Here, xy ≥ (a+ l − 1)(b + m− 1) = ab + lm+ (m− 1)(a− 1)+ (b − 1)(l − 1)−1 ≥ ab+lm−1. [as, m−1 ≤ 0 , a−1 ≤ 0 , so (m−1)(a−1) ≥ 0 , and b−1 ≤ 0 , l−1 ≤ 0 , (b − 1)(l − 1) ≥ 0 ] ...(iv) . So (iii) and (iv) imply Cons(P, Q)+ Cons(Q, R)− 1 = Cons(P, R) ... (v). ≤ x+y+nxy n+2 (i)-(v) pave the way of formulating axioms of a possible logic of knowledge. =

6

(a+l−1)+(b+m−1)+n(ab+lm−1) n+2

Concluding Remarks

This paper is only the beginning of a research on a many valued logic of dependency and consistency of knowledges where knowledge is in the context of incomplete information understood basically as proposed by Pawlak. Various ways of deﬁning lower and upper approximations indicate that the modalities are also diﬀerent and hence corresponding logics would also be diﬀerent. We foresee interesting logics being developed and signiﬁcant applications of the concepts Dep, Cons and the the operator .

Acknowledgement The ﬁrst author acknowledges the ﬁnancial support from the University Grants Commission, Government of India.

References 1. Bianucci, D., Cattaneo, G., Ciucci, D.: Entropies and co-entropies of coverings with application to incomplete information systems. Fundamenta Informaticae 75, 77–105 (2007) 2. Cattanio, G., Cucci, D.: Lattice Properties of Preclusive and Classical Rough Sets. Personal Collection

364

P. Samanta and M.K. Chakraborty

3. Chakraborty, M.K., Samanta, P.: Consistency-Degree Between Knowledges. In: Kryszkiewicz, M., et al. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 133– 141. Springer, Heidelberg (2007) 4. Klir, G.J., Yuan, B.: Fuzzy Sets And Fuzzy Logic: Theory and Applications. Prentice-Hall of India, Englewood Cliﬀs (1997) 5. Nguyen, N.T., Malowiecki, M.: Consistency Measures for Conﬂict Proﬁles. In: Pe´ ters, J.F., Skowron, A., Grzymala-Busse, J.W., Kostek, B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 169–186. Springer, Heidelberg (2004) 6. Novotn´ y, M., Pawlak, Z.: Partial Dependency of Attributes. Bull. Polish Acad. of Sci., Math. 36, 453–458 (1988) 7. Pawlak, Z.: Rough Sets. Internal Journal of Information and Computer Science 11, 341–356 (1982) 8. Pawlak, Z.: On Rough Dependency of Attributes in Information System. Bull. Polish Acad. of Sci., Math. 33, 551–559 (1985) 9. Pawlak, Z.: ROUGH SETS - Theoritical Aspects of Reasoning About Data. Kluwer Academic Publishers, Dordrecht (1991) 10. Pomykala, J.A.: Approximation, Similarity and Rough Constructions. ILLC Prepublication Series for Computation and Complexity Theory CT-93-07. University of Amsterdam 11. Qin, K., Gao, Y., Pei, Z.: On Covering Rough Sets. In: Yao, J.T., et al. (eds.) RSKT 2007. LNCS, vol. 4481, pp. 34–41. Springer, Heidelberg (2007) 12. Sakai, H., Okuma, A.: Basic Algorithm and Tools for Rough Non-deterministic Information Analysis. In: Peters, J.F., Skowron, A., Grzymala-Busse, J.W., Kostek, ´ B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 209–231. Springer, Heidelberg (2004) 13. Skowran, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996) 14. Slezak, D., Wasilewski, P.: Granular Sets - Foundations and Case Study of Tolerance Spaces. In: An, A., et al. (eds.) RSFDGrC 2007. LNCS, vol. 4482, pp. 435–442. Springer, Heidelberg (2007) 15. Yao, Y.: Semantics of Fuzzy Sets in Rough Set Theory. In: Peters, J.F., et al. (eds.) Transactions on Rough Sets II. LNCS, vol. 3135, pp. 297–318. Springer, Heidelberg (2004) 16. Zakowski, W.: Approximation the space (u, π). Demonstratio Mathematica 16, 761–769 (1983)

A New Approach to Distributed Algorithms for Reduct Calculation Tomasz Str¸akowski and Henryk Rybiński Warsaw Uniwersity of Technology, Poland [email protected], [email protected]

Abstract. Calculating reducts is a very important process. Unfortunately, the process of computing all reducts in NP-hard. There are a lot of heuristic solutions for computing reducts, but they do not guarantee achieving complete set of reducts. We propose here three versions of an exact algorithm, designed for parallel processing. We present here how to decompose the problem of calculating reducts, so that parallel calculations are eﬃcient. Keywords: Rough set theory, reducts calculations, distributed computing.

1

Introduction

Nowadays, the ability of collecting data is much higher than the ability of processing them. Rough Set Theory (RST) provides means for discovering knowledge from data. One of the main concepts in RST is the notion of reduct, which can be seen as a minimal set of conditional attributes preserving the required classiﬁcation features [1]. In other words, having a reduct of a decision table we are able to classify objects (i.e. take decisions) with the same quality as with all attributes. However, the main restriction in practical use of RST is that computing all reducts is NP-hard. It is therefore of high importance to ﬁnd out eﬃcient algorithms that compute reducts eﬃciently. There are many ideas how to speedup computing of reducts [2], [3], [4], [5]. Many of the presented algorithms are based on some heuristics. The disadvantage of the heuristic solution is that it does not necessary give us a complete set of reducts, in addition, some results can be over-reducts. Another way to speed up the calculation processes, not yet explored suﬃciently, could be distributing the computations over a set of processors, and perform the calculations in parallel.

The research has been partially supported by grant No 3 T11C 002 29, received from Polish Ministry of Education and Science, and partially supported by grant received from rector of Warsaw University of Technology No 503/G/1032/4200/000.

J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 365–378, 2008. c Springer-Verlag Berlin Heidelberg 2008

366

T. Str¸akowski and H. Rybiński

In this paper we analyze how to speed up the calculations of the complete sets of reducts by distributing the processing over a number of available processors. A parallel version of a genetic algorithm for computing reducts has been presented in [3]. The main disadvantage of this approach is that the algorithm does not necessary ﬁnd all the reducts. In this paper we present various types of the problem decomposition for calculating reducts. We present here three versions of distributing the processing, each of them generating all the reducts of a given information system. We will also discuss the conditions for decomposing the problem, and present criteria that enable one to ﬁnd out the best decomposition. The paper is composed as follows. In Section 2 we recall basic notions related to the rough set theory, and present the analogies between ﬁnding reducts in RST and the transformations of logical clauses. We also present a naïve algorithm for ﬁnding a complete set of reducts and discuss the complexity of the algorithm. In Section 3 we present 3 various ways of decomposing the process of reduct calculations. Section 4 is devoted to experimental results, performed with all three proposed approaches. We conclude the paper with a discussion about the eﬀectiveness of the approaches and their areas of applications.

2

Computing Reducts and Logic Operations

Let us start with recalling basic notions of the rough set theory. In practical terms, knowledge is coded in an information system (IS). IS is a pair (U,A) where U is ﬁnite set of elements, and A is a ﬁnite set of attributes which describe each element. For every a ∈ A there is a function U → Va , assigning a value v ∈ Va of the attribute a to the objects u ∈ U , where Va is domain of a. The indiscernibility relation is deﬁned as follows: IN D(A) = {(u, v) : u, v ∈ U, a(u) = a(v), a ∈ A} Informally speaking, two objects u and v are indiscernible for the attribute a if they have the same value of that attribute. Theindiscernibility relation could be deﬁned for the set of attributes IN D(B) = a∈B IN D(a), B ⊆ A. One of the most important ideas in RST is the notion of reduct. Reduct is a minimal set of attributes B, B ⊆ A, for which the indiscernibility relation in U is exactly the same, as for the set A, i.e. IND(B) =IND(A). Superreduct is a super set of a reduct. Given a set of attributes B, B ⊆ A, we deﬁne a B-related reduct as a set C of attributes, B ∩ C = ∅ , which preserves the partition of IND(B) over U. Given u ∈ U , we deﬁne local reduct as a minimal set of attributes capable of distinguishing this particular object from the other objects, as well, as the total set of attributes. Let us introduce a discernibility function (denoted by disc(B, u)) as a set of all object v discernible with u for the set of attributes B: disc(B, u) = {v ∈ U |∀a ∈ B(a(u) = a(v))}

A New Approach to Distributed Algorithms for Reduct Calculation

367

Table 1. Decision Table a 1 1 2 2 3

u1 u2 u3 u4 u5

b 2 2 2 2 5

c 3 1 3 3 1

d 1 2 2 2 3

Table 2. Indiscernibility matrix u1 u1 u2 u3 u4 u5

c a a abc

u2 c ac ac ab

u3 a ac

u4 a ac

abc

abc

u5 abc ab abc abc

Table 3. Interpretation of the indiscernibility matrix

u1 u2 u3 u4 u5

Discernibility CNF form DNF Formula Local reducts Function (after reduction) (Prime Implicants) c ∧ a ∧ (a ∨ b ∨ c) c∧a a∨a {a,c} c ∧ (a ∨ c) ∧ (a ∨ b) c ∧ (a ∨ b) (a ∧ c) ∨ (b ∧ c) {a,c};{b,c} a ∧ (a ∨ c) ∧ (a ∨ b) a a {a} a ∧ (a ∨ c) ∧ (a ∨ b) a a {a} (a ∧ b ∧ c) ∨ (a ∧ b) (a ∧ b) a∧b {a };{b }

Local reduct for the element u ∈ U is a minimal set of attributes B, B ⊆ A, such that disc(B,u) = disc(A,u). Now, let us show some similarities between reducts and some logic operation. The relationships between reducts and logical expressions were ﬁrst presented in [6]. Let us consider a decision table, as in Table 1. We have here ﬁve elements (u1 − u5 ), three conditional attributes, namely a, b, c, and one decision attribute d. The indiscernibility matrix for this table is shown in Table 2: The interpretation of the above indiscernibility matrix can be presented in the form of Table 3: The i-th row shows here the following: in column 1 there is a rule saying which attributes have to be used to discern the i-th object (ui ) with any other objects of IS from Table 1 (discernibility). The second column shows the same rule in the form of CN F (after reduction), the 3rd one presents the rule in disjunctive normal form (DNF), whereas the last column provides the local reducts for ui .

368

T. Str¸akowski and H. Rybiński

Algorithm 2.1. Reduct Set Computation(DT ) Compute Indiscernibility M atrix M (A) = (Cij ) T ransf orm M to one dimensional table T Reduce T using absorption laws comment: from CNF to prime implicant Sort T comment: Sorting is our modiﬁcation, d - number of elements in T build the f amilies of R1 , R2 , , Rn in the f ollowing way : ⎧ R =∅ ⎪ ⎪ 0 ⎪ ⎪ for i ← 0 to d ⎪ ⎪ ⎧ ⎪ ⎪ if Stop condition is true comment: It is our modiﬁcation ⎨ ⎪ ⎪ ⎪ ⎪ ⎨ then Break algorithmRd = Ri ⎪ ⎪ do ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ else Ri = Si ∪ Ki where Si = {r ∈ Ri−1 : r ∩ Ti = ∅} Ki = (r ∪ {a})∀a ∈ Ti , r ∈ Ri−1 : r ∩ Ti = ∅ Remove redundant elements f rom Rd Remove Super − Reducts RED(a) = Rd return (RED(A)) Now let us recall a naïve method for calculating reducts. It is a slight modiﬁcation of the algorithm presented in [2], and is given abowe in the form of a pseudo code. This code diﬀers from the original one in two places. Fist, we sort the clauses in discernibility function by length (the shortest clauses are ﬁrst). Then, we change here the stop condition: Ti is the i-th clause of prime implicant. Ri is the set of candidate reducts. In the i-th step we check r ∈ Ri , and if r ∩ T i = ∅, then Ri+1 := Ri+1 ∪ r, otherwise the clause is split onto separate attributes, and each attribute is added to r, making a new reduct candidate to be included to Ri+1 . As the clauses Ti are sorted, we can stop the algorithm when k + li > |A|, where k is the length of the shortest candidate reduct, li is the length of Ti . Let us reconsider the time and space complexities of the naïve algorithm. There are four sequential parts in the algorithm: (1) generating the indiscernibility matrix (IND matrix); (2) converting the matrix to discernibility function (using absorption laws); and (3) converting to the DNF form (prime implicants), i.e. reducts. The IND matrix is square and symmetric (with null values on the diagonal). The size of the matrix is |U | × |U |, |U | denotes the number of the elements in DT. So, the time and space complexities are: O(

|U |2 − |U | ) 2

(1)

The complexity of the process of converting from IND to the discernibility function formulae is linear, so can be ignored. The complexity of the conversion from

A New Approach to Distributed Algorithms for Reduct Calculation

369

discernibility function to CNF is O(n 2 ) (in the worst case), where n is the number of clauses in the discernibility function. No additional data structures are needed, so the space complexity can be ignored. The hardest to estimate is the complexity of converting from CNF to DNF. The space complexity is estimated as: |A| O |A| (2) 2

It is the maximal number of the candidate reducts in the conversion process. The proof on the maximal numbers of reducts was presented in [4]. More complicated is to estimate the time complexity. Given n as the number of clauses in the discernibility function, we can estimate it as: |A| O( |A| × n) (3) 2

During the conversion process from discernibility function to prime implicants in one step we compare every candidate reduct with the i-th clause of discernibility function. The number of steps is equal to the number of clauses. The maximum 2 number of clauses in the discernibility function is: n = |U | 2−|U | (in the worst case, where the absorption laws cannot be used). Hence, the time complexity is: |A| |U |2 − |U | O( |A| × ) (4) 2 2 Let us summarize now our considerations: 1. The maximal space requirement depends only on the number of the attributes in DT. 2. The time of computing IND depends polynomialy on the number of objects in DT. 3. The time of computing all the reducts depends exponentially on the number of attributes (for a constant number of objects). The exponential explosion of the complexity appears in the last part of the algorithm, during the conversion from CNF to prime implicants. The best practice is to decompose the more complex part, though it is not the only possible place. Sometimes computing IND is more time consuming than evaluating the conversions. We will discuss the options in the next section.

3

Decomposition of the Problem

There are several ways of decomposing the algorithm. One possibility is to split DT, compute reducts for each part independently and merge the results. Another idea is to compute IND matrix sequentially, convert it to discernibility function and CNF, and then split discernibility function into several parts to be calculated

370

T. Str¸akowski and H. Rybiński

separately, so that the conversions to DNF are made in the separate nodes of the algorithm, and then the ﬁnal result is obtained by merging the partial results. The two proposals above consist in a horizontal decomposition, in this sense that we split the table DT into some sub-tables, and then use the partial results to compute the ﬁnal reduct. Certainly, the partial results are not necessarily reducts. They can be though related reducts, and additional (post)-processing is needed to calculate reducts. Both proposals will be described in more detail in this Section, in 3.1 and 3.2 respectively. In the paper we propose yet another solution, based on a vertical decomposition. In particular, during the process of converting CNF to DNF we split the set of candidate reducts among a number of processors, which then serve in parallel for processing the consecutive clauses. We call this decomposition vertical because it splits the set of candidate reducts (subsets of the attributes) into separate subsets, instead of splitting the set of objects. For each subset of the candidate reducts, the conversion is completed in a separate node of algorithm (processor). Let us note that every candidate reduct passes comparisons with every clause. This will give us a guarantee that the partial results in each node are reducts or super reducts. Having computed the partial results, in the last phase of the algorithm we join them to the ﬁnal reducts set. Let us also note that there is a diﬀerence between using partial results obtained from horizontal and vertical decompositions. In the ﬁrst case we have to merge partial reducts, which is a complex, and time consuming process, whereas in the second case we have to join the partial results, and remove duplicates, and super reducts. This process is fairly simple. The third proposal is presented in p. 3.4. Below we describe the three proposals in more detail. 3.1

Splitting Decision Table

Let us present the process of decomposing DT. We split DT into two separate, randomly selected subsets: X1 and X2 , and for each of them we compute the reducts. If now we would like to "merge" the results, the ﬁnal result does not take into account indiscernibilities between objects from X1 and X2 . It is therefore necessary to compute another part of the IN D matrix to calculate the discernibility for the pairs (xi xj ), xi ∈ X1 , and xj ∈ X2 . In Fig. 1 it is shown how the decomposition of DT inﬂuences splitting the IN D matrix (denoted by M ). M (Xk ), k = 1, 2, are the parts related to discernibility of the object both from Xk . M (X1 ∪ X2 ) is a part of M with information about discernibility between xi , xj , such that xi ∈ X1 , and xj ∈ X2 . In this sense the decomposition of DT is not disjoint. However, in the sense of splitting M into disjoint parts, the decomposition is disjoint. We can thus conclude that for splitting DT into two sets we need three processing nodes. Similarly, if we split DT into three sets, we need six processing nodes. In general, if we split DT into n subsets we need n2 +n processing nodes. 2

A New Approach to Distributed Algorithms for Reduct Calculation

371

Indiscernibility Matrix Set X1

Set X2

Set X1

Set X2

M(X1,X2) not used M(X1) used on half M(X2) used on half M(X2,X1) used on all Fig. 1. Spliting DT

3.2

Spliting Discernibility Function

Another idea for decomposing the problem of computing all reducts is to split discernibility function into separate sections, and then to treat each section as a separate discernibility function. The conversion to DNF is made for every discernibility function, and then the partial results are merged as a multiplication of clauses. Let us illustrate it by the following example. Example 1. Provided after applying the absorption laws we receive the discernibility function as below: (a ∨ b) ∧ (a ∨ c) ∧ (b ∨ d) ∧ (d ∨ e)(∗) we can now convert it to the DN F form in the following sequential steps: 1. (a ∨ ac ∨ ab ∨ bc) ∧ (b ∨ d) ∧ (d ∨ e) = (a ∨ bc) ∧ (b ∨ d) ∧ (d ∨ e) 2. (ab ∨ ad ∨ bcd ∨ bc) ∧ (d ∨ e) = (ab ∨ ad ∨ bc) ∧ (d ∨ e) 3. (abd ∨ abe ∨ ad ∨ ade ∨ bcd ∨ bce) = (ad ∨ abe ∨ bcd ∨ bce) Instead of processing (*) sequentially let us split it into 2 parts: 1. (a ∨ b) ∧ (a ∨ c) 2. (b ∨ d) ∧ (d ∨ e) The tasks (1) and (2) can be continued in 2 separate processing nodes, which leads to the forms: 1. (a ∨ ac ∨ ab ∨ bc) = (a ∨ bc) 2. (bd ∨ be ∨ d ∨ de) = (be ∨ d)

372

T. Str¸akowski and H. Rybiński

computing IND matrix

computing reducts

merging reducts

time

Fig. 2. The parallel processing with 3 nodes

Having the partial results from the nodes (1) and (2) we merge them: (a∨bc)∧ (be ∨ d) So we receive the ﬁnal result: (abe ∧ ad ∧ bce ∧ bcd) On Fig. 2 we present a general idea of processing algorithm in the parallel way, as sketched above. As one can see, in this approach we can split the calculations among as many nodes as many pairs of clauses we have in the discernibility function (obviously we can split the task to a smaller number of nodes, as well). There is though a ﬁnal part of the algorithm, which is devoted to merging the partial results coming from the nodes. This process is performed sequentially and its eﬃciency depends on the number of processing nodes. Obviously, we should avoid the cases when the cost of merging is higher than the savings from parallel processing. We discuss the issue in the next paragraph. Merging of partial results The process of merging the partial results is time consuming. It is equivalent to the process of ﬁnding Cartesian product of n sets, so the time requirement for this process depends on the number of the partial results, i.e. O(Π|mi|)i = 1, 2, 3, ..n, where |mi| is the number of elements in the ith partial result. There is though a way to perform also this process in a parallel way. Let us consider the case we have two partial p1 and p2 . We split p1 into few separate results to merge - subsets, so p1 = i p1i . Thus p1 ∧ p2 = i p1i ∧ p2 , and each component p1i ∧ p2 can be processed in a separate processing node. The process of summing the partial conjunction results consists in removing duplicates and super reducts from the ﬁnal result set. The more components of p1i we have in p1 , the more processors we can use. Optimal use of the processors On Fig. 3 we present an example of using 5 processors for computing reducts by splitting prime implicant. We distinguish here four phases. The ﬁrst one is for computing the IN D matrix and prime implicant (marked by very light grey), then the conversion from prime implicant to DNF starts (light grey) on ﬁve nodes.

A New Approach to Distributed Algorithms for Reduct Calculation

373

node 1 node 2 node 3 node 4 node 5 central node

computing IND Matrix

inactive time

computing reducts merging reducts removing duplicates and superreducts

Fig. 3. Sample usage of processors for 5 nodes

node 1 node 2 node 3 node 4 node 5

central node

computing IND Matrix

inactive time

computing reducts merging reducts removing duplicates and superreducts

Fig. 4. Merging by bundles

When we have 2 conversions completed, the merging can start on the free nodes (dark grey). When any partial reduct results are provided, the ﬁnal process of removing duplicates is performed sequentially (black). This solution is not optimal for the use of processors. There are a lot of periods where some nodes of the algorithm have to wait, even if some nodes have the same speed. The problem gets worse if some nodes diﬀer in speed. To solve this problem we propose in every merging of partial results P1 and P2 to split P1 into more parts then we have free available processors. Thus, we decompose merging into many independent bundles. Each bundle can be processed asynchronously. Each processor processes as many bundles as it can.

374

T. Str¸akowski and H. Rybiński

In this case, the maximal time of waiting in every partial merging is the time of processing one bundle in the slowest node. Let us consider this proposal in more detail Fig. 4. In this case the node N3 does not have to wait for N2 , but it helps nodes N4 and N5 by merging bundles from P4 and P5 . This task can be ﬁnished faster than in the previous example. After computing DNF from ∧(P2 , N2 ) takes P2 and P3 from the queue, and starts computing set ∧(P2 , P3 ). After computing ∧(P4 , P5 ), the nodes N3 , N4 , N5 join to N2 . Having ﬁnished P1 , the node N1 takes the next task from the queue (∧(P1 , P4 , P5 )). Having ﬁnished processing ∧(P2 , P3 ) the remaining free nodes join to the computations ∧(P1 , P4 , P5 ). The last task is to compute ∧(P1 , P2 , P3 , P4 , P5 ) by all the nodes. 3.3

Splitting Set of Candidate Reducts - Vertical Decomposition

Now we present the third way of decomposing calculations of reducts, which is the vertical one. The main idea is that during the conversion of CNF to DNF we split the formula into 2 parts across a (disjunctive) component. The idea of this decomposition was originally presented in [7]. Here we make a slight modiﬁcation of this method. Let us go back again to the conversion process from CNF to DNF. Sequentially, the process can be performed as below: 1. 2. 3. 4.

(a ∨ b) ∧ (a ∨ c) ∧ (b ∨ d) ∧ (d ∨ e) (a ∨ ac ∨ ab ∨ bc) ∧ (b ∨ d) ∧ (d ∨ e) = (a ∨ bc) ∧ (b ∨ d) ∧ (d ∨ e) (ab ∨ ad ∨ bcd ∨ bc) ∧ (d ∨ e) = (ab ∨ ad ∨ bc) ∧ (d ∨ e) (abd ∨ abe ∨ ad ∨ ade ∨ bcd ∨ bce) = (ad ∨ abe ∨ bcd ∨ bce)

The bold clauses a and bc relate to "candidate reducts". Let us make the decomposition after the second step1 , and perform the process in two nodes: Table 4. Decomposition of computation after second step Node 1 Node 2 (a) ∧ (b ∨ d) ∧ (d ∨ e) (bc) ∧ (b ∨ d) ∧ (d ∨ e) (ab ∨ ad) ∧ (d ∨ e) (bc ∨ bcd) ∧ (d ∨ e) = (bc) ∧ (d ∨ e) (abd ∨ abe ∨ ad ∨ ade) = (ad ∨ abe) (bcd ∨ bce) (ad ∨ abe ∨ bcd ∨ bce)

The advantage of this decomposition is easiness of joining partial results one should only add sets of reducts and remove super-reducts. This method reduces time of processing and space needed for storing candidate reducts. If we have one processor without enough memory for the candidate reducts we can decompose the process into two parts. The ﬁrst part can be continued, whereas the second one can wait frozen, and restart after having ﬁnished the ﬁrst one. This is more eﬀective than using virtual memory, because the algorithm can 1

It could have been done also after the 1st step, as well as after the 3rd step.

A New Approach to Distributed Algorithms for Reduct Calculation

375

decide what should be frozen, and what is executed. The disadvantage is that decomposition is done in the late phase of algorithm. It causes that the time saved by the decomposition can be inessential. Another disadvantage is that the algorithm depends on too many parameters. In particular, one has to choose a right moment to split the formula. In our experiment we have used the following rules: 1. do not split before doing 10% of the conversion steps; 2. the last split must be done before 60% of the conversion; 3. make splitting if the number of candidates is greater then u (u is a parameter). The main diﬀerence between our proposal and the one presented in [7] is in spliting candidate of sets. In [7] it is proposed to split the set of candidates into n procesors after having the number of "candidates reducts" higher than branching factor [7]. The disadvantage of this approach is that we do not know the number of candidate reducts before completing computations, so it is hard to estimate the optimal value of the branching factor.

4

Experiments and Results

There are a some measures in the literature for the distributed algorithms. In our experiments we used two indicators: 1. Speedup 2. Eﬃciency Following [8] we deﬁne speedup as Sp = TT 1p , and eﬃciency as Ep = Sp p , where T1 is the time of execution of the algorithm on one processor, Tp is the time needed by p processors, p is the number of processors. We have tested all the presented algorithms. For the experiments we used three base data sets: (a) 4000 records, and 23 condition attributes; (b) 5000 records, and 20 condition attributes; and (c) 20000 records, and 19 condition attributes. The sets (a) and (b) were randomly generated. The set (c) is based on the set "Letter recognition" from [9]. To the original set we have added three additional columns, each being a combinations of selected columns from the original set (so that more reducts should appear in the results). For each of the databases we have prepared a number of sets of data - 5 sets for (a), 6 sets for (b) and 11 sets for (c). Every set of data was prepared by a random selection of objects from the base sets. For each series of data sets we have performed one experiment for the sequential algorithm, and additionally, 3 experiments - one for each way of decomposition. Below we present the results of the experiment. Tables 5-7 contain the execution times for the sequential version of the algorithm for each of the 3 testing data respectively. In these tables the column 2 shows the total execution time, the columns 3 shows the execution time of computing IN D matrix and reduced discernibility function. It is not possible to split times for processing IN D matrix and discernibility function without the loss of eﬃciency.

376

T. Str¸akowski and H. Rybiński

Let us note that computing of the IN D matrix and discernibility function for the ﬁrst case (Table 5) takes less than 1% of the total processing time. In the 2nd (Table 6) case the computing of IN D is about 50% of the total time of processing. The number of clauses in prime implicant is smaller for this data set. In the 3rd case (table 7), the computing of IN D takes more than 99% of the total computing time. Let us note that only for this case the decomposition of DT can be justiﬁed. Now we present Tables 8-10. In each table the results of 3 distributed algorithms are presented for each data set respectively. From Table 8 we can see that for the datasets where discernibility function is long and we expect many results, it is better to use vertical decomposition. The vertical decomposition has two advantages: (a) we decompose the phase that Table 5. Time of computing for sequential method, data set 1 Size (records) Total time (ms) IND matrix time (ms) IND matrix size Reducts number 2000 2483422 29344 513 5131 2500 2144390 41766 475 4445 3000 2587125 60766 555 5142 3500 3137750 80532 532 4810 4000 191390 100266 116 1083

Table 6. Time of computing for sequential method, data set 2 Size (records) Total time (ms) IND matrix time (ms) IND matrix size Reducts number 2500 70735 31457 77 202 3000 68140 46016 41 107 3500 79234 61078 33 72 4000 99500 77407 37 109 4500 127015 100235 42 127 5000 151235 120094 46 131

Table 7. Time of computing for sequential method, data set 3 Size (records) Total time (ms) IND matrix time (ms) IND matrix size Reducts number 11000 668063 650641 16 5 12000 798906 780391 16 5 13000 936375 916641 16 5 14000 1086360 1065375 16 5 15000 1245188 1223016 16 5 16000 1413032 1389782 16 5 17000 1597250 1572843 16 5 18000 1787578 1762015 16 5 19000 1993640 1966718 16 5 20000 2266016 2238078 16 5

A New Approach to Distributed Algorithms for Reduct Calculation

377

Table 8. Parallel methods for date set 1 Size (records) 2000 2500 3000 3500 4000

S3 0,41 0,75 0,53 0,70 0,46

DT E3 0,14 0,25 0,18 0,24 0,15

A kind of decomposition DISC FUNCTION CANDIDATE REDUCTS S2 E2 S3 E3 S2 E2 S3 E3 3.01 1.5 2.53 0.84 4.33 2.16 7.66 2.55 3.03 1.51 2.46 0.82 3.91 1.95 6.64 2.21 2.11 1.05 2.11 0.70 3.34 1.67 6.08 2.02 3.6 1.8 2.8 0.93 3.55 1.77 6.90 2.30 1.57 0.78 1.57 0.52 0.64 0.32 1.0 0.33

Table 9. Parallel methods for date set 2 Size (records) 2500 3000 3500 4000 4500 5000

S3 0,72 0,69 0,54 0,73 1,02 0,99

DT E3 0,24 0,23 0,18 0,24 0,34 0,33

A kind of decomposition DISC FUNCTION CANDIDATE REDUCTS S2 E2 S3 E3 S2 E2 S3 E3 1.86 0.93 1.91 0.64 0.79 0.39 0.88 0.29 1.36 0.68 1.37 0.46 0.84 0.42 0.84 0.28 1.18 0.59 1.22 0.40 0.93 0.47 0.71 0.23 1.19 0.60 1.16 0.39 0.96 0.48 0.97 0.32 1.20 0.60 1.08 0.36 0.84 0.42 0.90 0.30 1.18 0.59 1.10 0.37 0.98 0.49 1.06 0.35

Table 10. Parallel methods for date set 3 Size (records) 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000

S3 1.57 1.55 1.58 1.59 1.59 1.60 1.61 1.63 1.64 1.73

DT E3 0.52 0.52 0.53 0.53 0.53 0.53 0.54 0.54 0.54 0.58

A kind of decomposition DISC FUNCTION CANDIDATE REDUCTS S2 E2 S3 E3 S2 E2 S3 E3 0.99 0.49 1.00 0.33 0.99 0.49 1.00 0.33 1.00 0.50 1.00 0.33 1.00 0.50 1.00 0.33 1.00 0.50 1.00 0.33 1.00 0.5 1.00 0.33 1.00 0.50 0.99 0.33 1.00 0.5 0.99 0.33 0.99 0.49 0.99 0.33 0.99 0.49 0.99 0.33 1.00 0.50 1.00 0.33 1.00 0.5 1.00 0.33 1.00 0.50 1.00 0.33 1.00 0.5 1.00 0.33 1.00 0.50 0.99 0.33 1.00 0.5 0.99 0.33 1.00 0.50 1.00 0.33 1.00 0.5 1.00 0.33 1.00 0.50 0.50 0.33 1.00 0.5 1.00 0.33

takes majority of the time; and (b) joining partial results is less time consuming than merging. For the methods with horizontal decomposition the time of computing depends on the time of merging partial results. By adding another processor not necessarily we get better results - although the conversion to DNF is faster, the merging of three sets is more complicated. In the second case (Table 9) only the method with discernibility function decomposition gives good results. Splitting candidate reducts was not eﬀective, because conversion from CNF to DN F takes less than 50% of the total processing

378

T. Str¸akowski and H. Rybiński

time, so the decomposition was made too late. Also splitting DT was not effective, as this method may cause a redundancy in partial results. The best method here is splitting discernibility function. It also may cause redundancy in the partial results, but much less then the DT decomposition. In Table 10 we have an unusual case, because of big number of objects and small number of attributes. The processing of IND takes more than 99% of total time, so we can expect that only the decomposition of DT can give us satisfactory results.

5

Conclusions and Future Work

We have investigated possibilities of decomposing the process of computing the reducts. Three points where the decomposition is feasible have been identiﬁed. Based on this, three algorithms of parallel computing of the reducts have been presented and tested. The performed experiments have shown that each of the algorithms has its own speciﬁc kind of data sets, for which it is the best. It is therefore an important task to identify at the beginning of the computations which way of paralleling the reduct computations is the most appropriate. We also expect that for some kind of data combining the three methods can also bring positive results. Special heuristics have to be prepared in order to decide (perhaps dynamically, during the computations) on when and how split the computations. This is the subject of our future research.

References 1. Pawlak, Z.: Rough sets: Theoretical aspects of reasoning about data. Kluwer, Dordrecht (1991) 2. Bazan, J., Nguyen, H.S., Nguyen, S.H., Synak, P., Wróblewski, J.: Rough set algorithms in classiﬁcation problem. In: Polkowski, L., Tsumoto, S., Lin, T. (eds.) Rough Set Methods and Applications, pp. 49–88. Springer, Physica-Verlag, Heidelberg (2000) 3. Wróblewski, J.: A parallel algorithm for knowledge discovery system. In: PARELEC 1998, pp. 228–230. The Press Syndicate of the Technical University of Bialystok (1998) 4. Wróblewski, J.: Adaptacyjne Metody Klasyﬁkacji Obiektów. Ph.D thesis, Uniwersytet Warszawski, Wydziaş Matematyki, Informatyki i Mechaniki (2001) 5. Bakar, A.A., Sulaiman, M., Othman, M., Selamat, M.: Finding minimal reduct with binary integer programming in datamining. In: Proc. of the IEEE TENCON 2000, vol. 3, pp. 141–146 (2000) 6. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. In: Slowinski, R. (ed.) Decision Support: Handbook of Applications and Advances of Rough Sets Theory, pp. 331–362. Kluwer, Dordrecht (1992) 7. Susmaga, R.: Parallel computation of reducts. In: Polkowski, L., Skowron, A. (eds.) Rough Sets and Current Trends in Computing, pp. 450–457. Springer, Heidelberg (1998) 8. Karbowski, A., Niewiadomska-Szymkiewicz, E. (eds.): Obliczenia równolegşe i rozproszone. Oﬁcyna Wydawnicza Politechniki Warszawskiej (in Polish) (2001) 9. Asuncion, A., Newman, D.: UCI machine learning repository (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html

From Information System to Decision Support System Alicja Wakulicz-Deja and Agnieszka Nowak Institute of Computer Science, University of Silesia B¸edzi´ nska 39, 41–200 Sosnowiec, Poland {wakulicz,nowak}@us.edu.pl

Abstract. In the paper we present the deﬁnition of Pawlak’s model of an information system. The model covers information systems with history, systems with the decomposition of objects or attributes and dynamical information systems. Information systems are closely related to rough set theory and decision support systems. The aim of the paper is to characterize the stimulated by Professor Pawlak research of the group in Silesian University in information retrieval based on diﬀerent information systems and in decision support based on rough sets, and to outline the current research projects of this group on modern decision systems. Keywords: information system, decision support system, rough set theory, clustering methods.

1

Introduction

Information systems and decision support systems are strongly related. The paper shows that we can treat a decision system as an information system of some objects, for which we have the information about their classiﬁcation. Recently, not so many attention is paid for a classiﬁcation of information systems in the literature. We deal with a problem of classiﬁcation based on changes of information systems in the time, what leads in natural way to a concept of dynamic systems. Data analysis in a given information system is possible thanks to deﬁning: the decomposition of system (done on the set of attributes or objects), dependent and independent attributes in data (to remove the attributes that are dependent), whether the attributes or even objects are equivalent, comparison of the objects, attributes and even the whole systems. The paper also presents that the model of information system created by Professor Pawlak is very useful for retrieving information. One of the diﬀerent methods of retrieving information, so called atomic components method, was proposed by Professor Pawlak, and it is presented in the paper with all basic assumptions. The relation between information systems and rough set theory with decision support systems, where researches are concerned with the classiﬁcatory analysis of imprecise, uncertain or incomplete information or knowledge expressed in terms of data acquired from experience, is also presented in the paper. It also consider the methods of reduction the set of attributes and rule induction method’s that have been J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 379–404, 2008. c Springer-Verlag Berlin Heidelberg 2008

380

A. Wakulicz-Deja and A. Nowak

applied to knowledge discovery in databases, whose empirical results obtained show that they are very powerful and that some important knowledge has been extracted from databases. Because of that, in the paper, the results of the stages of diﬀerent researches that were done (i.e. diagnosis support system used in child neurology and it is a notable example of a complex multistage diagnosis process) and all of the researches that are planed to do at the Silesian University, are presented. It is supposed to explain Professor Pawlak’s invaluable contribution to the domain of information and decision support systems. The notion of an information system formulated by Professor Pawlak and developed with his co-workers, is now a well developed branch of data analysis formalisms. It is strongly related to (but diﬀerent from) the relational database theory on the one hand and to fuzzy set theory on the other. In this paper we consider the connection between the theory of information and information retrieval systems with rough set theory and decision support systems. It is obvious that model of a system created by Professor Pawlak makes data description and analysis simple and very reliable.

2

Information System

An information system consists of a set of objects and attributes deﬁned on this set. In information systems with a ﬁnite number of attributes, there are classes created by these attributes (for each class, the values of the attributes are constant on elements from the class). Any collection of data, speciﬁed as a structure: S = X, A, V, q such that X is a non-empty set of objects, A isa non-empty set of attributes, V is a non-empty set of attributes’ values: V = a∈A Va and q is an information function of X ×A → V , is referred to as an information system. The set {q(x, a) : a ∈ A} is called information about the object x or, in short, a record of x or a row determined by x. Each attribute a is viewed as a mapping a : X → Va which assigns a value a(x) ∈ Va to every object x. A pair (a, v), where a ∈ A, and v ∈ Va , is called a descriptor. In information systems, the descriptor language is a formal language commonly used to express and describe properties of objects and concepts. More formally, an information system is a pair A = (U, A) where U is a nonempty ﬁnite set of objects called the universe and A is a non-empty ﬁnite set of attributes such that a : U → Va for every a ∈ A. The set Va is called the value set of a. Now we will discuss which sets of objects can be expressed (deﬁned) by formulas constructed by using attributes and their values. The simplest formulas d, called descriptors, have the form (a, v) where a ∈ A and v ∈ Va . In each information system S the information language LS = AL, G is deﬁned, where AL is the alphabet and G is the grammar part of that language.

From Information System to Decision Support System

381

AL is simply a set of all symbols which can be used to describe the information in such a system, e.g.: 1. 2. 3. 4. 5.

{0, 1} (constant symbols), A - the set of all attributes, V - a set of all the values of the attributes, symbols of logical operations like ˜, + and ∗, and naturally brackets, which are required to represent more complex information.

G - the grammar part of the language LS deﬁnes syntax with TS as the set of all possible forms of terms (a term is a unit of information in S) and its meaning (semantics). A simple descriptor (a, v) ∈ TS (a ∈ A where v ∈ Va ). If we denote such a descriptor (a, v) as the term t, then following term formations will be also possible: ¬t, t + t , t ∗ t , where t, t ∈ TS . The meaning is deﬁned as a function σ which maps the set of terms in a system S in a set of objects X, σ : TS → P (x), where P (x) is the set of the subsets of X. The value of σ for a given descriptor (a, v) is deﬁned as following [49]: 1. 2. 3. 4. 2.1

σ(a, v) = {x ∈ X, qx (a) = v}, σ(¬t) = X \ σ(t), σ(t + t ) = σ(t) ∪ σ(t ) and σ(t ∗ t ) = σ(t) ∩ σ(t ). Information Table

Information systems are often represented in a form of tables with the ﬁrst column containing objects and the remaining columns, separated by vertical lines, containing values of attributes. Such tables are called information tables (an example is presented in Table 1). The deﬁnition of this system is as follows: S = X, A, V, q, where X = {x1 , . . . , x8 }, A = {a, b, c}, V = Va ∪ Vb ∪ Vc , Va = {a1 , a2 }, Vb = {b1 , b2 }, Vc = {c1 , c2 , c3 , c4 } and q : X × A → V . For instance, q(x1 , a) = a1 and q(x3 , b) = b1 . Table 1. An information system - an information table student a x1 a1 x2 a1 x3 a2 x4 a2 x5 a1 x6 a1 x7 a2 x8 a2

b b1 b1 b1 b1 b2 b2 b2 b2

c c1 c2 c3 c4 c1 c2 c3 c4

382

A. Wakulicz-Deja and A. Nowak

Before we start considering the properties of an information system, it is necessary to explain what the information in such a system means. The information in the system S is a function ρ with the arguments on the attributes set A and its values, which belong to the set V (ρ(a) ∈ Va ). As long as the sets of the objects, attributes and their values are ﬁnite, we know exactly how many (different) pieces of information in a given system S comprises, and the number is equal to a∈A card(Va ). The information ρ assigns a set of the objects Xρ that Xρ = {x ∈ X : qx = ρ}. We call them indiscernible, because they have the same description. If we assume that B ⊆ A then each subset B of A determines a binary relation IN DA (B), called an indiscernibility relation. By the indiscernibility relation determined by B, denoted by IN DA (B), we understand the equivalence relation:

IN DA (B) = {x, x ∈ X × X : ∀a∈B [a(x) = a(x )]}. For a given information system it is possible to deﬁne the comparison of the objects, attributes and even the whole systems. We can ﬁnd some dependent and independent attributes in data, we can check whether the attributes or even objects are equivalent. An important issue in data analysis is to discover dependencies between attributes. Intuitively, a set of attributes D depends totally on a set of attributes C if the values of attributes from C uniquely determine the values of the attributes from D. If D depends totally on C then IN DA (C) ⊆ IN DA (D). This means that the partition generated by C is ﬁner than the partition generated by D. Assume that a and b are attributes from the set A in a system S. We say that b depends on a (a → b), if the indiscernibility relation on a contains in the indiscernibility relation on b: a ⊆ b. If a = b then the attributes are equivalent. The attributes are dependent if any of the conditions: a ⊆ b or b ⊆ a is satisﬁed. Two objects x, y ∈ X are indiscernible in a system S relatively to the attribute a ∈ A (xea y) if and only if qx (a) = qy (a). In the presented example, the objects x1 and x2 are indiscernible relatively to the attributes a and b. The objects x, y ∈ X are indiscernible in a system S relatively to all of the attributes a ∈ A (xSey) if and only if qx = qy . In the example there are no indiscernible objects in the system S. Each information system determines unequivocally a partition of the set of objects, which is some kind of classiﬁcation. Finding the dependence between attributes let us to reduce the amount of the information which is crucial in systems with a huge numbers of attributes. Deﬁning a system as a set of objects, attributes and their values is necessary to deﬁne the algorithm for searching the system and updating the data consisted in it. Moreover, all information retrieval systems are also required to be implemented in this way. The ability to discern between perceived objects is also important for constructing various entities not only to form reducts, but also decision rules and decision algorithms. 2.2

An Application in Information Retrieval Area

The information retrieval issue is the main area of the employment of information systems. An information retrieval system, in which the objects are described by

From Information System to Decision Support System

383

their features (properties), we can deﬁne as follows: Let us have a set of objects X and a set of attributes A. These objects can be books, magazines, people, etc. The attributes are used to deﬁne the properties of the objects. For the system of books, the attributes can be author, year, number of sheets. An information system which is used for information retrieval should allow to ﬁnd the answer for a query. There are diﬀerent methods of retrieving information. Professor Pawlak proposed the atomic components method [2,49]. Its mathematical foundation was deﬁned in [5] and [6]. This method bases on the assumption that each question can be presented in the normal form, which is the sum of the products with one descriptor of each attribute only. To make the system capable of retrieving information it is required to create the information language (query language). This language should permit describing objects and forming user’s queries. Naturally enough, such a language has to be universal for both the natural and system language. Owing to this, all steps are done on the language level rather than on the database level. The advantages of information languages are not limited to the aforementioned features. There are a lot of systems that need to divide the information, which is called the decomposition of the system. It allows improving the time eﬃciency and make the updating process easy, but also enables the organization of the information in the systems. Information systems allow collecting data in a long term. It means that some information changes in time, and because of that, the system has a special property, which is called the dynamics of the system. Matching unstructured, natural-language queries and documents is diﬃcult because both queries and documents (objects) must be represented in a suitable way. Most often, it is a set of terms, where aterm is a unit of a semantic expression, e.g. a word or a phrase. Before a retrieval process can start, sentences are preprocessed with stemming and removing too frequent words (stopwords). The computational complexity when we move from simpler systems to more compound increases. For example, for atomic component retrieval method, the problem of rapidly growing number of atomic component elements is very important. Assuming that A is a set of attributes, and Va is a set of values of attribute a, where a ∈ A, in a given system we achieve a a∈A Va objects to remember. For example, if we have a 10 attributes in a given system S, and each of such attributes has 10 values, we have to remember 1010 of elements. 2.3

System with Decomposition

When the system consists of huge set of data it is very diﬃcult in given time to analyse those data. Instead of that, it is better to analyze the smaller pieces (subsets) of data, and at the end of the analysing, connect them to one major system. There are two main method of decomposition: with attributes or objects. A lot of systems are implemented with such type of decomposition. System with object’s decomposition. If it is possible to decompose the system S = X, A, V, q in a way that we gain subsystems with smaller number of objects, it means that:

384

A. Wakulicz-Deja and A. Nowak

S=

n

Si ,

i=1

where Si = Xi , A, V, qi , Xi ⊆ X and

i

Xi = X, qi : Xi × A → V , qi = q|Xi ×A .

System with attributes’s decomposition. When in system S there are often the same types of queries, about the same group of attributes, it means that such system should be divided to subsystems Si in a way that: S=

Si ,

i

where Si = X, Ai , Vi , qi , Ai ⊆ A and i Ai = A, Vi ⊆ V , qi : X × Ai → Vi , qi = q|X×Ai . Decomposition lets for optimization of the retrieval information process in the system S. The choice between those two kind of decomposition depends only on the type and main goal of such system. 2.4

Dynamic Information System and System with the History

In the literature information systems are classiﬁed according to their purposes: documentational, medical or management information systems. We propose different classiﬁcation: those with respect to dynamics of systems. Such a classiﬁcation gives possibility to: 1. 2. 3. 4.

Perform a joint analysis of systems belonging to the same class, Distinguish basic mechanisms occuring in each class of systems, Unify design techniques for all systems of a given class, Simplify the teaching of system operation and system design principles.

Analysing the performance of information systems, it is easy to see that the data stored in those systems are subject to changes. Those changes occur in deﬁnite moments of time. For example: in a system which contains personal data: age, address, education, the values of these attributes may be changed. Thus time is a parameter determining the state of the system, although it does not appear in the system in an explicit way. There are systems in which data do not change in time, at least during a given period of time. But there are also systems in which changes occur permanently in a determined or quite accidental way. In order to describe the classiﬁcation, which we are going to propose, we introduce the notion of a dynamic information system, being an extension of the notion of an information system presented by Professor Pawlak. Definition 1. A dynamic information system is a family of ordered quadruples: S = {Xt , At , Vt , qt }t∈T where: – T - is the discrete set of time moments, denoted by numbers 0, 1, . . . , N , – Xt - is the set of objects at the moment t ∈ T ,

(1)

From Information System to Decision Support System

– – – –

385

At - is the set of attributes at the moment t ∈ T , the set of values of the attribute a ∈ At , Vt (a) - is S Vt := a∈At Vt (a) - is the set of attribute values at the moment t ∈ T , qt - is a function which assigns to each pair x, a, x ∈ Xt , a ∈ At , an element of the set Vt , i.e. qt : Xt × At → Vt .

An ordered pair a, v, a ∈ At , v ∈ Vt (a) is denoted as a descriptor of the attribute a. We will denote by qt,x a map defined as follows: qx,t : At → Vt ,

(2)

a ∈ At x ∈ Xt t ∈ T

qt,x (a) := qt (a, x)

(3)

Let Inf (S) = {VtAt }t∈T be a set of all functions from At to Vt for all t ∈ T . Functions belonging to Inf (S) will be called informations at instant t, similarly, the functions qt,x will be called the information about object x at instant t in information system S. Therefore, an information about an object x at instant t is nothing else, but a description of object x, in instant t, obtained by means of descriptors. We will examine closer the changes, which particular elements (X,A,V ,q) of a dynamic system may undergo in certain time moments (see also [46,47]). Systems, whose all parameters do not depend on time are discussed in [7]. Here we deal with the dynamic systems in which the descriptions of objects depend essentialy on time. It is useful to observe at the begining that any dynamic system belongs to one of two classes of systems: time-invariant and time-varying system. Definition 2. Time-invariant system is the dynamic system such that: T

1. Zt := V

2.

Dqt = ∅ and

t∈T V

qt (x, a) = qt (x, a) where Dqt - domain of function qt . t,t ∈T

(x,a)∈ZT

Definition 3. Time-varying system is the dynamic system such that: T

1. ZT :=

t∈T

Dqt = ∅ or

2. ZT = ∅ and 2.5

W

W

t,t ∈T (x,a)∈ZT

qt (x, a) = qt (x, a).

Time-Invariant Systems

Let XT , AT be sets of objects and attributes of the dynamic system deﬁned as follows: T

– XT := – AT :=

t∈T T t∈T

Xt , At .

386

A. Wakulicz-Deja and A. Nowak

It is evident by the deﬁnition of the time-invariant system that a dynamic system is time-invariant if and only if (4) q := qt |ZT does not depend on t and ZT = XT × AT . It means that any time-invariant system S = {< Xt , At , Vt , qt >}t∈T , has a subsystem S in Pawlak ’s notion which is time independant

S =< XT , AT , q(ZT ), q > . Let us consider a system of library information in which books are objects, the set of attributes is given by author’s name, title, publisher’s name, year of issue, subject, etc. and attributes values are given in natural language [48,49]. Let us consider time evolution of this system on the example given by its subsystem connected with four books: – – – –

b1 b2 b3 b4

= = = =

C.J.Date, An Introduction to database systems. G.T.Lancaster, Programming in COBOL. Ch.T.Meadow, The analysis of Information Systems. G.Salton, The SMART retrieval system.

and four attributes: publisher, year of issue, number of pages, subject. The history of our library in years 1980, 1981, 1982 described by our subsystem depends on two events. Begining from 1981 out library information was enritched with the information about subject of book and the book b4 was bought, and in 1982 the book b3 was lost. This situation is given by dynamic system S = {< Xt , At , Vt , qt >}t=1980,1981,1982 , described in the tables 2, 3, and 4. Table 5 presents a time-invariant subsystem S = {< Xt , At , Vt , q >} . It is easy to see that in the dynamic system described above XT = {b1 , b2 }, AT ={P ublisher, Y ear, P ages}, and VT is given below what propes that q|XT ×AT is time independent i.e. the system described in the example is time-invariant. Table 2. S = {< Xt , At , Vt , qt >}t=1980 X1980 \A1980 Publisher b1 Addison-Wesley Publish. Comp.Inc.,USA b2 Pergamon Press, Oxford, New York b3 John Wiley & Sons Inc., New York

Year 1977 1972 1967

Pages 493 180 339

Table 3. S = {< Xt , At , Vt , qt >}t=1981 X1981 \A1981 Publisher b1 Addison-Wesley Publish. Comp.Inc.,USA b2 Pergamon Press, Oxford, New York b3 John Wiley & Sons Inc., New York b4 Prentice-Hall Inc., Englewood-Cliﬀs, USA

Year 1977 1972 1967 1971

Pages Subject 493 Databases 180 Programming 339 Information Sys. 585 Retrieval Sys.

From Information System to Decision Support System

387

Table 4. S = {< Xt , At , Vt , qt >}t=1982 X1982 \A1982 Publisher b1 Addison-Wesley Publish. Comp.Inc.,USA b2 Pergamon Press, Oxford, New York b4 Prentice-Hall Inc., Englewood-Cliﬀs, USA

Year 1977 1972 1971

Pages Subject 493 Databases 180 Programming 585 Retrieval Systems

Table 5. Time-invariant subsystem S = {< Xt , At , Vt , q >} XT \AT Publisher Year Pages b1 Addison-Wesley Publish. Comp.Inc.,USA 1977 493 b2 Pergamon Press, Oxford, New York 1972 180

2.6

Time-Varying Systems T

T

If t∈T Xt = ∅ or t∈T At = ∅ i.e. ZT = ∅ then the system is obviously time dependent on T since there does not exist an element x beloniging to all Xt or a belonging to all At . If ZT = ∅ then the dynamic system S = {< Xt , At , Vt , qt >}t∈T has a subsystem

S = {< XT , AT , qt (ZT ), qt |ZT >}, t ∈ T and we can observe that this system is not time-invariant, since by the deﬁnition of the time-varying system there exist t, t ∈ T and (x, a) ∈ ZT that qt (x, a) = qt (x, a). A system which contains information about students [27] is good example of a system with time-varying information. The set of objects is the set of all students of a ﬁxed University [Faculty,Course]. As a set of attributes we may choose, for example: STUDY-YEAR, GROUP, MARK-OF-MATH, MARK-OFPHYSICS,AV-MARK and so on. Descriptors are as before, pairs of the form , where the sets of attribute values are as follows: ||ST U DY − Y EAR|| = {I, II, III, . . .}, ||GROU P || = {1, 2, 3, . . .}, ||M ARK − OF − M AT H|| = {2, 3, 4, 5}, ||M ARK − OF − P HY SICS|| = {2, 3, 4, 5}, ||AV ERAGE M ARK|| = {2, 2.1, 2.2, . . . , 5}. Let us assume that student changes the study year if his average mark lays between 3 and 5. If not the student ramains on the same year of studies. If there is not a change in the study year the student can change the students group. Let us consider the history of three students s1 , s2 , s3 begining with the ﬁrst year of their studies during the following three years. The situation in the system

388

A. Wakulicz-Deja and A. Nowak Table 6. First year of observation X1 \A1 Year Group Av.mark s1 I 1 − s2 I 1 − s3 I 2 − Table 7. Second year of observation X2 \A2 Year Group Av.mark s1 I 3 3.1 s2 II 1 4.1 s3 II 2 3.3 Table 8. Third year of observation X3 \A3 s1 s2 s3

Year Group Av.mark II 3 3.1 III 1 4.8 II 1 3.7

is described in tables 6, 7, and 8. One can observe that XT = {s1 , s2 , s3 }, AT = {ST U DY − Y EAR, GROU P, AV − M ARK}, and ⎧ ⎨ I t = 1 year of observation qt (s1 , ST U DY Y EAR) = I t = 2 year of observation (5) ⎩ II t = 3 year of observation what means that the system is the time-varying system. 2.7

Variability of Information in Dynamic Systems

In time-varying systems we can observe various types of information changes. T T If the set ZT = { t∈T XT } ∩ { t∈T AT } = ∅ then the important features of the character of changes of the information in time is described by the dynamic subsystem S . S = {< XT , AT , qt (ZT ), qt |ZT >}t∈T .

In the subclass of dynamic systems, represented by system S , the state of the system depends on time t by the family {qt }t∈T only. Due to a way of realization of this subclass of systems in the practise it is sensible to consider such a realization of systems, which allows to determine values of the function: f (x, a, qt−1 (x, a), . . . , qt−i (x, a)) for all x ∈ XT , a ∈ AT and t ∈ T .

From Information System to Decision Support System

389

By f we denote any function which is feasible in the considered realization, and by i we denote so called depth of information and can assume values 0, 1, . . . , I. When i = 0 the function f depends on x and a only. One can observe that such realizations of systems are not giving possibility of determining values of a function which explicitly depends on t. This is one of features which distinguish dynamic information systems from data processing systems. From the point of view of the realizations described above, any dynamic system belongs to the one of the subsequent classes: 1. Systems with determined variability SDV . A dynamic system belongs to SDV if and only if: V – (x,a)∈ZT there exist initial values q−1 (x, a), . . . , q−i (x, a) ∈ such that: V V q (x, a) = f (x, a, qt−1 (x, a), . . . , qt−i (x, a)) t∈T (x,a)∈ZT t for properly choosen (feasible) function f .

S t∈T

Vt

2. Systems with predictable variability SP V . A dynamic system belongs to SP V if and only if: – it does not belong to SDV , M – there exist T1 ,. . . , TM ⊂ T ( j=1 Tj = T , Tj ∩ Tk = ∅, for j = k, j, k = 1, . . . , M , card Tj > 1 for j = 1, . . . , M ), and functions f1 , . . . , fM such that: V V feasible q (x, a) = fj (x, a, qt−1 (x, a), . . . , qt−j (x, a)) t∈Tj (x,a)∈ZT t for properly choosen initial values q−1 (x, a), . . . , q−ij (x, a). 3. Systems with unpredictable variability SU V . A dynamic system belongs to SU V if and only if: – it does not belong to SDV or SP V . It is worthy to underline that to SU V can belong systems whose structure is formally simple. For example, the system whose information function is determined as follows:

f1 (x, a, qt−1 (x, a), . . . , qt−i1 (x, a)) or qt (x, a) = (6) f2 (x, a, qt−1 (x, a), . . . , qt−i2 (x, a)) belongs to SU V as long as there is not determined for which t: f1 and for which f2 is applied. 2.8

Examples of Time-Varying Systems

Examples of systems belonging to SDV , SP V and SU V classes are given here. An example of a system with determined variability SDV can be a system of patient supervision (medical information). The objects of this system are

390

A. Wakulicz-Deja and A. Nowak Table 9. The prescribitions of medicaments and tests for patients X\A Test blood X-ray lungs Peniciline injections Vitamins p1 1 0.2 C p2 1 p3 + 0.03 B1 p4 1 -

Table 10. The physician prescription for a given patient A\t 0 1 2 3 4 5 6 7 8 9 10 P 1111110000 0 T 0010010010 0

patients. The attributes are, for example, test of blood morphology, lungs X-ray, prescribed penicillin, prescribed doses of vitamins (in mg), etc. The following table (Table 9) presents prescribitions of medicaments and tests for patients p1 , p2 , p3 , p4 at the begining of the considered system performance (t = 0). Let us describe the system performance on the example of the patient p2 who after small surgery get an bacterial infection. Physician prescription is as follows: Penicillin injections P for six forthcoming days, blood morphology test T every thrid day. This prescription gives the following table (Table 10) of function qt (p2 , P ) and qt (p2 , T ). One can observe that using the Boolean algebra notion these functions can be written in the following form

qt (p2 , P ) = qt−1 (p2 , P )[qt−2 (p2 , P ) + qt−7 (p2 , P )] ∗ (7) qt (p2 , T ) = qt−1 (p2 , T ).qt−2 (p2 , T ) if only initial values are given as follows ⎧ ⎨ q−1 (p2 , P ) = 1 ∗ ∗ q−j (p2 , P ) = 0 for j = 2, 3, . . . , 7 information depth = 7, ⎩ q−k (p2 , T ) = 1 for k = 1, 2 information depth = 2.

(8)

The formulas / ∗ /, / ∗ ∗/ convince us that described system (at least reduced to object p2 and attributes P and T ) is SDV . Other systems of the class SDV can be found in [29,28,30]. As a example of SP V we use the system with students, and we assume that T1 , T2 , T3 are time intervals determined as follows: T1 : from Oct.1st 1980 to Sept.30th 1981, T2 : from Oct.1st 1981 to Sept.30th 1982, T3 : from Oct.1st 1982 to Sept.30th 1983. It is easy to see that the function qt (si , Y ), qt (si , G), qt (si , A.m), i = 1, 2, 3. are constants in each time interval T1 , T2 , T3 . Therefore on each interval T1 , T2 , T3 this function are realizable (information depth = 0) and the system belongs to

From Information System to Decision Support System

391

Table 11. The example of the system belonging to the SU V X\A Storage Prod.division (I) Prod.division(II) M1 200 50 30 M2 100 10 20 M3 0 5 4

SP V . Finally, let us consider a system which describes materials managment in a factory. Objects in this system are diﬀerent types od materials. The attributes can be here production divisions and /or workstands and the main storage. The attribute values are given in units (of weight, measure, etc.) which are natural for the described object. Let us consider the system of this kind reduced to three objects a storage and two production divisions as attributes. Let the attribut values be given in the T able 11. It is obvious that a state of resources of the objects Mi on the production division K (K = I,II) depends not only of information function qt−1 (Mi , K) but also on information function deﬁned on other attributes i.e. depends on qt−1 (M1 , II), qt−1 (M1 , St.), qt−1 (M1 , I) therefore it is not a function which can be used as information function due to the deﬁnition of dynamic system. Moreover the values of functions qt (Mi , St.) are not determined a priori and generally we can not determine the moments in which these values will be changed. This system of course does not belong to SDV or SP V . Therefore it belongs to SU V . Examples of systems belonging to the SU V class can be found in [27,31] also. 2.9

Influence of Foundations of a System on Its Classification

Analysing foundations of a real system we can determine to which of described above classes the system belongs. Thus e.g., if we assume, that objects of the system are documents with static or rarely changing descrptions, then this system will belong to the class of invariant systems. The characteristics of most library systems imply directly their belonging to the class of time-invariant systems. In the same way, the assumption about variability in documents descriptions will suggest, that a system containing such documents belongs to the class of systems with time-varying information. Of course, if we are able to determine moments in which descrptions changes will occur, then it will be the system with predictable variability (SP V ). If we are not able to determine these moments - we will obtain a system with unpredictable variability (SU V ). Some systems are a priori classiﬁed as systems with determined variability (SDV ), because the knowledge of “histories” of objects is one of the requirements, as in medical systems for example. So, foundations of the realized information system decide a priori about its classiﬁcation - which, in consequence, suggests a priori certain performance mechanisms of this system. Many of existing systems are actually packages of systems belonging to diﬀerent classes (e.g. medical system may consist with a module of registration which is an time-invariant system and with a module of patients supervison which belongs to the class of time-varying systems). In this case every of modules is designed as a system of an appropriate

392

A. Wakulicz-Deja and A. Nowak

class. The classiﬁcation resulted from the analysis of performance of information systems, can be a convenient tool for designing purposes. When somebody starts system designing, he has a good knowledge of systems foundation and parameters but generally he can not predict proper mechanisms of system performance. In this situation, as was stated above, he can determine a class to which the system belongs. This allows him to choose adequate mechanisms of system performance. 2.10

Performance Mechanisms in Dynamic Systems

In a realization of information systems we should made decisions about a structure of database and the way of its updating, on the retrieval method and a retrieval language we are going to use and on the mode of operation which will be used in the system. In the forthcoming we give some remarks how these decision depend on the fact that the considered system belongs to one of the determined classes i.e. class of Invariant Systems, class of Systems with Determined Variability (SDV ), class of Systems with Predictable Variability (SP V ), class of Systems with Unpredictable Variability (SU V ). Database and its updating At ﬁrst let us consider invariant systems. The database of the invariant system is static throughout the period of performance. A reorganization of the database, if desired, is realized after the period of performance and consists in creating a new database. In systems with time-varying information the database changes during the action of the system. In systems with determined variability (SDV ) we have to store information about an object in the past because this information is necessary for determining the actual information about this object. Thus the “history”, with prescribed depth of information about objects, should be stored in the database. In systems with predictable variability (SP V ) actualization and reorganization of the database ought to be executed at certain moments, in which changes are predicted. These are mainly changes in descriptions of objects. The database reorganization (actualization) does not necessarily involve changes in programs operating on the database. In systems with unpredictable variability (SU V ) any execution of the retrieval process ought to be preceded by the actualization of descriptions of objects. In all systems with time-varying information we can have at the same period an actualization of the set od objects, set of attributes and set of descriptors, as in invariant systems. Retrieval method and information retrieval language Because of the specyﬁc character of the database and actualization process, one prefers the exhaustice search as a retrieval method for invariant systems.In such a case an extension of database does not results in the retireval method. At most, in order to speed up the system performance, one may apply the methods of inverted ﬁles or linked lists. These methods are more useful for some systems with predictable variability (information depth =0). There, when the system action is stoped, the database can be actualized along with inverted ﬁles or linked lists updating. In these systems there is no need for developing special information

From Information System to Decision Support System

393

retrieval languagees, because languages based on thesauruses, indexing or decimal classiﬁcation seem to be suﬃciently eﬃcient. However in the invariant systems and systems with predictable variability one can prefer a speciﬁc method of retrieval. For a realization of the systems with time-varying information a grouping of informations and random access to descriptions of these groups or an individual description of object is essential. Mathematical methods of retrieval seem to be the most convenient in this case (for example: Lum’s methods [23] or the atomic component method with decomposition of system). These retireval algorithms allow us to ﬁnd quickly a particular information, they also simplify the updating process. In the case of systems with determined variability (SDV ), this problem looks a bit diﬀerent, because a new information is constantly created and has to be stored. In this case the method of linked lists seem to be as good as mathematical methods (e.g. the method of atomic components). In the method of linked lists an actual information about the object is obtained by consideration of a chain of a determined length given by the depth of information. In the systems with time-varying information a language based on descriptors is the most convenient, for information retrieval, since it allows us easy to write/read informations described by means of codes which are equivalents of descriptors. Moreover in this case the descriptions of objects are determined by the values of attributes. Informations in time-varynig systems are always described by means of codes, therefore all output informations are translated onto the natural language. Consequently, from the user’s point of view, there is no diﬀerence if the system uses the descriptor or another language. In some cases, when this translation can be omitted (e.g. in medical systems, which are used by medical servise) the descriptors ought to be introduced in accordance with codes accepted by a user. Here we ought to mention interactive languages, which seem to be necessary for most systems with time-varying information (the necessity of a dialogue with the system) but they will be discussed latter on, along with the operation mode of dynamic systems. Operation mode Let us consider now the continuous operation mode and the batch operation mode in an information system. The continuous operation mode consists in current (i.e. ∀t∈T ) information feeding, therefore we hace current data base updating. This operation mode will occur in systems with unpredictable variability, an actualization processes will be executed in turns with retrieval processes. In most cases, however, information systems work in batch operation mode, which means that in certain moments actualization and reorganization processes take place. This operation mode can be used in invariant and time-varying systems with predictable variability (SP V ). A case with the interactive operation mode is a bit diﬀerent, since a user is able to communicate with a system. If this mode is used only for rerieval purposes (to ﬁnd more complete or relevant information), then it can be applied to a system of arbitrary class. But if goal of this dialogue is to create a new data base structure (internal changes), then interactive systems are limited down to the class of systems with unpredictable variability (SU V ). At the end let us mention that due to the structure of dynamic model discussed

394

A. Wakulicz-Deja and A. Nowak

here (deﬁnition of the dynamic information system) performance mechanisms are applied to any pair (x, a), x ∈ XT , a ∈ AT separately. This all reorganizations of the model which are based on concurrent processing and multi-access give high eﬃciency of the information system in the practise. Conclusion In this paper a possibility of introducing dynamics in Pawlak ’s model of systems is presented. In the most situations of practise this model is more convenient then the classical (relational) model. It is due to the fact that in Pawlak’s model, information about an object are given by functions, while in the classical model informations are determined by relations. This simpliﬁes a description of systems and their analysis, which is important not only for system designing but also for teaching of system operation. Authors think, that the only way of teaching how to use the system and how to design it, goes by understanding of system operation mechanisms. For the model presented here the proposed classiﬁcation allows to fullﬁl this goal easier. The model of information system created by Pawlak is very useful to built and analysis in diﬀerent types of retrieval information systems. The document information systems are very speciﬁc type of information systems and Pawlak ’s model is very good to deﬁne the informations in it.

3

Decision Support Systems

When data mining ﬁrst appeared, several disciplines related to data analysis, like statistics or artiﬁcial intelligence were combined towards a new topic: extracting signiﬁcant patterns from data. The original data sources were small datasets and, therefore, traditional machine learning techniques were the most common tools for this tasks. As the volume of data grows these traditional methods were reviewed and extended with the knowledge from experts working on the ﬁeld of data management and databases. Because of that, information systems with some data-mining methods start to be the decision support systems. Decision support system is a kind of information system, which classiﬁes each object to some class denoted by one of the attributes, called decision attribute. While the information system is simply a pair of the form U and A, the decision support system is also a pair S = (U, C ∪ {d}) with distinguished attribute d. In case of decision table the attributes belonging to C are called conditional attributes or simply conditions while d is called decision. We will further assume that the set of decision values is ﬁnite. The i-th decision class is a set of objects: Ci = {x ∈ U : d(x) = di }, where di is the i-th decision value taken from decision value set Vd={d1 , . . . , d|Vd | }. Let us consider the decision table presented as Table 12. In presented system (with informations about students): C = {a, b, c}, D = {d}.

From Information System to Decision Support System

395

Table 12. Decision table student a x1 a1 x2 a1 x3 a2 x4 a2 x5 a1 x6 a1 x7 a2 x8 a2

b b1 b1 b1 b1 b2 b2 b2 b2

c c1 c2 c3 c4 c1 c2 c3 c4

d T T T N N T T N

Having indiscernibility relation we may deﬁne the notion of reduct. In case of decision tables decision reduct is a set B ⊂ C of attributes, which cannot be further reduced and IN D(B) ⊆ IN D(d). Decision rule is a formula of the form: (ai1 = v1 ) ∧ . . . ∧ (aik = vk ) ⇒ (d = vd ), where 1 ≤ i1 < . . . < ik ≤ m, vj ∈ Vai j . We can simply interpret such formula similar to natural language with if and then elements. In given decision table the decision rule for object x1 is given as: if (a = a1 ) and (b = b1 ) and (c = c1 ) then (d = T ), the same as (a = a1 ) ∧ (b = b1 ) ∧ (c = c1 ) → (d = T ). Atomic subformulas (ai1 = v1 ) are called conditions, premises. We say that rule r is applicable to object, or alternatively, the object matches rule, if its attribute values satisfy the premise of the rule. Each object x in a decision table determines a decision rule: ∀a∈C (a = a(x)) ⇒ (d = d(x))), where C is set of conditional attributes and d is decision attribute. Decision rules corresponding to some objects can have the same condition parts but diﬀerent decision parts. We use decision rules to classify given information. When the information is uncertain or just incomplete there is need to use some additional techniques for information systems. Numerous methods based on the rough set approach combined with Boolean reasoning techniques have been developed for decision rule generation.

4

Rough Sets

Rough Set theory has been applied in such ﬁelds as machine learning, data mining, etc., successfully since Professor Pawlak developed it in 1982. Reduction

396

A. Wakulicz-Deja and A. Nowak

of decision table is one of the key problem of rough set theory. The methodology is concerned with the classiﬁcatory analysis of imprecise, uncertain or incomplete information or knowledge expressed in terms of data acquired from experience. The primary notions of the theory of rough sets are the approximation space and lower and upper approximations of a set. The approximation space is a classiﬁcation of the domain of interest into disjoint categories. The membership status with respect to an arbitrary subset of the domain may not always be clearly deﬁnable. This fact leads to the deﬁnition of a set in terms of lower and upper approximations [9,10,11]. 4.1

The Basic Notions

One of the basic fundaments of rough set theory is the indiscernibility relation which is generated using information about particular objects of interest. Information about objects is represented in the form of a set of attributes and their associated values for each object. The indiscernibility relation is intended to express the fact that, due to lack of knowledge, we are unable to discern some objects from others simply by employing the available information about thos objects. Any set of all indiscernible (similar) objects is called an elementary set, and forms a basic granule (atom) of knowledge about the universe. Any union of some elementary sets in a universe is referred to as a crisp set. Otherwise the set is referred to as being a rough set. Then, two separate unions of elementary sets can be used to approximate the imprecise set. Vague or imprecise concepts in contrast to precise concepts, cannot be characterized solely in terms of information about their elements since elements are not always discernable from each other. There is an assumption that any vague or imprecise concept is replaced by a pair of precise concepts called the lower and the upper approximation of the vague or imprecise concept. 4.2

Lower/Upper Approximation

The lower approximation is a description of the domain objects which are known with certainty to belong to the subset of interest, whereas the upper approximation is a description of the objects which possibly belong to the subset. Any subset deﬁned through its lower and upper approximations is called a rough set. It must be emphasized that the concept of rough set should not be confused with the idea of fuzzy set as they are fundamentally diﬀerent, although in some sense complementary, notions. Rough set approach allows to precisely deﬁne the notion of concept approximation. It is based on the indiscernibility relation between objects deﬁning a partition of the universe U of objects. The indiscernibility of objects follows from the fact that they are perceived by means of values of available attributes. Hence some objects having the same (or similar) values of attributes are indiscernible. Let S = (U, C ∪ D) be an information system, then with any B ⊆ C there is associated an equivalence relation IN DS (B), called the B-indiscernibility relation, its classes are denoted by [x]B .

From Information System to Decision Support System

397

For B ⊆ C and X ⊆ U , we can approximate X using only the information contained in B by constructing the B-lower (BX) and B-upper approximations of X (BX), where: BX = {x : [x]B ⊆ X} and BX = {x : [x]B ∩ X = ∅}. The B-lower approximation of X is the set of all objects which can be certainly classiﬁed to X using attributes from B. The diﬀerence between the upper and the lower approximation constitutes the boundary region of a vague or imprecise concept. Upper and lower approximations are two of the basic operations in rough set theory. 4.3

Reduct and core of Attributes

In the rough set area there is also a very important problem with ﬁnding (select) relevant features (attributes), which source is denoted as so called core of the information system S. Reduct is a minimal set of attributes B ⊆ C such that IN DS (B) = IN DS (C), which means that it is a minimal set of attributes from C that preserves the original classiﬁcation deﬁned by the set C of attributes. The intersection of all reducts is the so-called core. In the example both the core and the reduct consist of attributes b and c (CORE(C) = {b, c}, RED(C) = {b, c}). 4.4

Rule Induction

Rough set based rule induction methods have been applied to knowledge discovery in databases, whose empirical results obtained show that they are very powerful and that some important knowledge has been extracted from databases. For rule induction, lower/upper approximations and reducts play important roles and the approximations can be extended to variable precision model, using accuracy and coverage for rule induction have never been discussed. We can use the indiscernibility function fS , that form a minimal decision rule for given decision table [1]. For an information system S = (U, C ∪ {d}) with n objects, the discernibility matrix of S is a symmetric n × n matrix with entries cij deﬁned as: cij = {a ∈ C|a(xi ) = a(xj )} for i, j = 1, 2, . . . , n where d(xi ) = d(xj )). Each entry consists of the set of attributes upon which objects xi and xj diﬀer. A discernibility function fS for an information system S is a boolean function of m boolean variables a∗1 , . . . , a∗m (corresponding to the attributes a1 , . . . , am ) deﬁned by: c∗ij |1 ≤ j ≤ i ≤ n, cij = ∅ (9) fS = where c∗ij = {a∗ |a ∈ cij }.

398

A. Wakulicz-Deja and A. Nowak

For given decision table we formed following set of rules: – – – – – – – 4.5

rule rule rule rule rule rule rule

nr nr nr nr nr nr nr

1: 2: 3: 4: 5: 6: 7:

if if if if if if if

a = a1 and b = b1 then d = T b = b1 and c = c1 then d = T b = b1 and c = c2 then d = T c = c3 then d = T c = c4 then d = N b = b2 and c = c1 then d = N c = c2 then d = T .

Rough Set Theory and Decision Systems in Practise

The main speciﬁc problems addressed by the theory of rough sets are not only representation of uncertain or imprecise knowledge, or knowledge acquisition from experience, but also the analysis of conﬂicts, the identiﬁcation and evaluation of data dependencies and the reduction of the amount of information. A number of practical applications employing this approach have been developed in recent years in areas such as medicine, drug research, process control and other. The recent publication of a monograph on the theory and a handbook on applications facilitate the development of new applications. One of the primary applications of rough sets in artiﬁcial intelligence is knowledge analysis and data mining [12,13,16,17]. From two expert systems implemented at the Silesian University, MEM is the one with the decision table in the form of the knowledge base. It is a diagnosis support system used in child neurology and it is a notable example of a complex multistage diagnosis process. It permits the reduction of attributes, which allows improving the rules acquired by the system. MEM was developed on the basis of real data provided by the Second Clinic of the Department of Paediatrics of the Silesian Academy of Medicine. The system is employed there to support the classiﬁcation of children having mitochondrial encephalopathies and considerably reduces the number of children directed for further invasive testing in the consecutive stages of the diagnosis process [18,19]. The work contains an example of applying the rough sets theory to application of support decision making. The created system limits maximally the indications for invasive diagnostic methods that ﬁnally decide about diagnosis. System has arisen using induction (machine learning from examples) - one of the methods artiﬁcial intelligence. Three stages classiﬁcation has been created. The most important problem was to create an appropriate choice of attributes for the classiﬁcation process and the generation of a set of rules, a base to make decisions in new cases. Rough set theory provides the appropriate methods which form to solve this problem. A detailed analysis of the medical problem results in creating a three -staged diagnostic process, which allows to classify children into suﬀering from mitochondrial encephalomyopathy and ones suﬀering from other diseases. Data on which the decisions were based, like any real data, contained errors. Incomplete information was one of them. It resulted from the fact that some observation or examinations were not possible to be made for all patients. Inconsistency of information was another

From Information System to Decision Support System

399

problem. Inconsistency occured because there were patients who were diﬀerently diagnosed at the same values of the parameters analyzed. Additionally developing a supporting decision system in diagnosing was connected with reducing of knowledge, generating decision rules and with a suitable classiﬁcation of new information. The ﬁrst stages of research on decision support systems concentrated on: methods to represent the knowledge in a given system and the methods of the veriﬁcation and validation of a knowledge base [14]. Recent works, however, deal with the following problems: a huge number of rules in a knowledge base with numerous premises in each rule, a large set of attributes, many of which are dependent, complex inference processes and the problem of the proper interpretation of the decision rules by users. Fortunately, the cluster analysis brings very useful techniques for the smart organisation of the rules, one of which is a hierarchical structure. It is based on the assumption that rules that are similar can be placed in one group. Consequently, in each inference process we can ﬁnd the most similar group and obtain the forward chaining procedure on this, signiﬁcantly smaller, group only. The method reduces the time consumption of all processes and explores only the new facts that are actually necessary rather then all facts that can be retrieved from a given knowledge base. In our opinion, clustering rules for inference processes in decision support systems could prove useful to improve the eﬃciency of those systems [3,4]. The very important issue for knowledge base modularization is the concept proposed in [26], where the decision units conception was presented. Both methods: cluster analysis and decision unit are subject of our recent researches. We propose such methods to represent knowledge in composited (large, complex) knowledge bases. Using modular representation we can limit the number of rules to process during the inference. Thanks to properties of the cluster and the decision units we can perform diﬀerent large knowledge bases are an important problem in decision systems. It is well known that the main problem of forward chaining is that it ﬁres a lot of rules, that are unnecessary to ﬁre, because they aren’t the inference goal. A lot of ﬁred rules forming a lot of new facts that are diﬃcult to interpret them properly. That is why the optimization of the inference processes in rule based systems is very important in artiﬁcial intelligence area. Fortunately there are some methods to solve such problem. For example, we may reorganize the knowledge base from list of not related rules, to groups of similar rules (thanks to cluster analysis method) or decision units. Thanks to this it is possible to make the inference process (even for really large and composited knowledge bases) very eﬃcient. Simplifying, when we clustering rules, then in inference processes we search only small subset of rules (cluster), that the most similar to given facts or hipothesis [25]. In case of using decision units concept, thanks to constructed such units, in backward chaining technique we make inference process only on proper decision unit (that with the given conclusion attribute). That is why we propose to change the structure of knowledge base to cluster or decision unit structure inference algorithm optimizations, depending on user requirements. On this stage of our work we can only present the general conception of modular

400

A. Wakulicz-Deja and A. Nowak

rule base organization. We can’t formally proof that our conception really will cause growth of eﬃciency. But in our opinion hierarchical organization of rule knowledge base allow us to decrease the number of rules necessary to process during inference, thus we hope that global inference eﬃciency will grow. On this stage of our reaserch, decision units (with Petri nets extensions) and rules clusters are parallel tools for rule base decomposition rather than a one coherent approach. Therefore we have two methods of rule base decomposition — into the rules clusters if we want to perform forward chaning inference and into the decision units, if we want to do backward chanining inference. The main goal of our future work is to create coherent conception of modularization of large rule bases. This conception shall join two main subgoals: optimalization of forward and backward chaining inference process and practical approach for rule base modelling and veriﬁcation. In our opinion, two methods of rule base decomposition described in this work, allow as to obtain our goals. It is very important, that exists software tools, dedicated for rules clustering and decision units approach. Practical tests allow us to say, that we need specialized software tools when we work with large, composited rule bases. We expect, that our mixed approach is base for creating such software tools. Rough sets theory enables solving the problem of a huge number of attributes and dependent attributes removal. The accuracy of classiﬁcation can be increased by selecting subsets of strong attributes, which is performed by using several classiﬁcation learners. The processed data are classiﬁed by diverse learning schemes and the generation of rules is supervised by domain experts. The implementation of this method in automated decision support software can improve the accuracy and reduce the time consumption as compared to full syntax analysis [20,21,22]. Pawlak’s theory is also widely used by Zielosko and Piliszczuk to build clasiﬁers based on partial reducts and partial decision rules [43,44]. Recently, partial reducts and partial decision rules were studied intesively by Moshkov and also Zielosko and Piliszczuk. Partial reducts and partial decision rules depend on the noise in less degree than exact reducts and rules [42]. Moreover, it is possible to construct more compact classiﬁers based on partial reducts and rules. The experiments with classiﬁers presented in [45] show that accuracy of classiﬁers based on such reducts and rules is often better than the accuracy based on extact reducts and rules. The very important facts are that in a 1976 Dempster and Shafer have created a mathematical theory of evidence called Dempster-Shafer theory, which is based on belief functions and plausible reasoning [32]. It lets to combine separate pieces of information (evidence) to calculate the probability of an event. Pawlak’s rough set theory as an innovative mathematical tool created in 1982 let us to describe the knowledge, including also the uncertain and inexact knowledge [8]. Finally, In 1994 the basic functions of the evidence theory have been deﬁned, based on the notion from the rough set theory [33]. All the dependences between these theories has allowed further research on their practical usage. There are some papers that tried to show the relationships between the rough set theory and the evidence theory which could be used to ﬁnd the minimal templates

From Information System to Decision Support System

401

for a given decision table were also published [34,35]. Extracting the templates from data is a problem that consists in the ﬁnding some set of attributes with a minimal number of attributes, that warrants, among others, the suﬃciently small diﬀerence between the belief function and the plausibility function. This small diﬀerence between these functions allows reducing the number of the attributes (together with the decrease in the vales of the attributes) and made the templates. Moreover MTP (minimal templates problem) gives the recipe witch decision value may be grouped. At the end we get decision rules with the suitable large support. Of course, in recent years it is possible to witness a rapid growth of interest of application of rought set theory in many other domains such as, for instance, vibration analysis, conﬂict resolution, intelligent agents, pattern recognition, control theory, signal analysis,process industry, marketing, etc. Swiniarski in [36] presented an application of rough sets method to feature selection and reduction as a front end of neural network based texture images recognition. The role of the rough sets is to show its ability to select reduced set of pattern’s features. In other paper, presented by Nguyen we can observe the multi-agent system based on rough set theory [37]. The task of creating eﬀective methods of web search result connected with the clustering method, based on rough sets was presented in [41] by Nguyen. Pawlak’s theory was also used to perform new methodology for data mining in distributed and multiagent systems [38]. Recently, rough set based methods have been proposed for data mining in very large relational data bases [39,40]. 4.6

Conclusions

Classiﬁcation is an important problem in the ﬁeld of Data Mining. Data acquisition and warehousing capabilities of computer systems are suﬃcient for wide application of computer aided Knowledge Discovery. Inductive learning is employed in various domains such as medical data analysis or customer activity monitoring. Due to various factors that data suﬀer from impreciseness and incompleteness. There are many classiﬁcation approaches like “nearest neighbours”, “naive Bayes”, “decision tree”, “decision rule set”, “neural networks” and many others. Unfortunately, there are opinions that rough set based methods can be used for small data set only. The main approach is related to their lack of scalability (more precisely: there is a lack of proof showing that they can be scalable). The biggest troubles stick in the rule induction step. As we know, the potential number of all rules is exponential. All heuristics for rule induction algorithms have at least O(n2 ) time complexity, where n is the number of objects in the data set and require multiple data scanning. Rough Sets Theory has been applied to build classiﬁers by exploring symbolic relations in data. Indiscernibility relations combined with the cloncept notion, and the application of set operations, lead to knowledge discovery in an elegant and intuitive way. Knowledge discovered from data talbes is often presented in terms of “if...then...” decision rules. With each rule a conﬁdence measure is associated. Rough sets provide symbolic representation of data and the representation of knowledge in

402

A. Wakulicz-Deja and A. Nowak

terms of aatributes, information tables, semantic decision rules, rough measures of inclusion and closeness of information granules, and so on. Rough set methods make possible to reduce the size of a dataset by removing some of the attributes while preserwing the partitioning of the universe of an information system into equivalence classes.

5

Summary

Information systems and decision support systems are strongly related. The paper shows that we can treat a decision system as an information system of some objects, for which we have the information about their classiﬁcation. When the information is not complete, or the system has some uncertain data - we can use rough set theory to separate the uncertain part from that, what we are sure about. By deﬁning the reduct for a decision table, we can optimize the system and then, using the methods for minimal rules generation, we can easily classify new objects. We see, therefore, that Prof. Pawlak’s contribution to the domain of information and decision support systems is invaluable [24].

References 1. Bazan, J.: Metody wnioskowa´ n aproksymacyjnych dla syntezy algorytm´ ow decyzyjnych, praca doktorska, Wydzia l Informatyki, Matematyki i Mechaniki, Uniwersytet Warszawski, Warszawa (1998) 2. Grzelak, K., Kocha´ nska, J.: System wyszukiwania informacji metod¸a skladowych atomowych MSAWYSZ, ICS PAS Reports No 511, Warsaw (1983) 3. Nowak, A., Wakulicz-Deja, A.: Eﬀectiveness comparison of classiﬁcation rules based on k-means Clustering and Salton’s Method. In: Advances in Soft Computing, pp. 333–338. Springer, Heidelberg (2004) 4. Nowak, A., Wakulicz-Deja, A.: The concept of the hierarchical clustering algorithms for rules based systems. In: Advances in Soft Computing, pp. 565–570. Springer, Heidelberg (2005) 5. Pawlak, Z.: Mathematical foundation of information retrieval. CC PAS Reports No 101, Warsaw (1973) 6. Pawlak, Z., Marek, W.: Information Storage and retrieval system-mathematical foundations. CC PAS Reports No. 149, Warsaw (1974) 7. Pawlak, Z.: Information systems theoretical foundation. Information Systems 6(3) (1981) 8. Pawlak, Z.: Rough Sets: Theoretical aspects of reasoning about data. Kluwer Academic Publishers, Boston (1991) 9. Pawlak, Z., Skowron, A.: Rudiments of rough sets. Information Sciences 177, 3–27 (2007) 10. Pawlak, Z., Skowron, A.: Rough sets: some extensions. Information Sciences 177, 28–40 (2007) 11. Pawlak, Z., Skowron, A.: Rough sets and Boolean reasoning. Information Sciences 177, 41–73 (2007)

From Information System to Decision Support System

403

12. Roddick, J.F., Hornsby, K., Spiliopoulou, M.: YABTSSTDMR - Yet Another Bibliography of Temporal, Spatial and Spatio-Temporal Data Mining Research. In: Unnikrishnan, K.P., Uthurusamy, R. (eds.) Proc. SIGKDD Temporal Data Mining Workshop, San Francisco, CA, pp. 167–175. ACM, New York (2001) 13. Roddick, J.F., Egenhofer, M.J., Hoel, E., Papadias, D., Salzberg, B.: Spatial, Temporal and Spatio-Temporal Databases - Hot Issues and Directions for Ph.D Research. SIGMOD Record 33(2), 126–131 (2004) 14. Simi´ nski, R., Wakulicz-Deja, A.: Circularity in Rule Knowledge Bases - Detection using Decision Unit Approach. In: Advances in Soft Computing, pp. 273–280. Springer, Heidelberg (2004) 15. Skowron, A.: From the Rough Set Theory to the Evidence Theory. In: Yager, R.R., Fedrizzi, M., Kacprzyk, J. (eds.) Advances in the Dempster-Shafer Theory of Evidence, pp. 193–236. Wiley, New York (1994) 16. Skowron, A., Bazan, J., Stepaniuk, J.: Modelling Complex Patterns by Information Systems. Fundamenta Informaticae 67(1-3), 203–217 (2005) 17. Bazan, J., Peters, J., Skowron, A., Synak, P.: Spatio-temporal approximate reasoning over complex objects. Fundamenta Informaticae 67, 249–269 (2005) 18. Wakulicz-Deja, A.: Podstawy system´ ow ekspertowych. Zagadnienia implementacji. Studia Informatica 26(3(64)) (2005) 19. Wakulicz-Deja, A., Paszek, P.: Optimalization on Decision Problems on Medical Knowledge Bases. In: 5th European Congress on Intelligent Techniques and Soft Computing, Aachen, Germany (1997) 20. Wakulicz-Deja, A., Ilczuk, G.: Attribute Selection and Rule Generation Techniques ´ ezak, D., Yao, J., Peters, J.F., Ziarko, W.P., for Medical Diagnosis Systems. In: Sl¸ Hu, X. (eds.) RSFDGrC 2005. LNCS, vol. 3642, pp. 352–361. Springer, Heidelberg (2005) 21. Wakulicz-Deja, A., Ilczuk, G., Kargul, W., Mynarski, R., Drzewiecka, A., Pilat, E.: Artiﬁcial intelligence in echocardiography - from data to conclusions. Eur. J. Echocardiography Supplement 7(supl.1) (2006) 22. Wakulicz-Deja, A., Paszek, P.: Applying rough set theory to multi stage medical diagnosing. Fundamenta Informaticae XX, 1–22 (2003) 23. Lum, V.Y.: Multi-Attribute Retrieval with Combined Indexes. Communications of the ACM 13(11) (1970) 24. Wakulicz-Deja, A., Nowak, A.: From an information system to a decision support system. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 454–464. Springer, Heidelberg (2007) 25. Nowak, A., Wakulicz-Deja, A.: The inference processes on clustered rules. In: Advances in Soft Computing, vol. 5, pp. 403–411. Springer, Heidelberg (2006) 26. Nowak, A., Simi´ nski, R., Wakulicz-Deja, A.: Towards modular representation of knowledge base. In: Advances in Soft Computing, vol. 5, pp. 421–428. Springer, Heidelberg (2006) 27. Eﬀelsberg, W., Harder, T., Reuter, A.: An experiment in learning DBTG data-base administration. Information Systems 5, 137–147 (1980) 28. Michalski, R.S., Chilausky, R.L.: Knowledge acquisition by encoding expert rules versus computer indution from examples: a case study involving soybean pathology. International Journal Man. Machine Studies 12, 63–87 (1977) 29. Slamecka, V., Comp, H.N., Bodre, A.: MARIS - A knowledge system for internal medicine. Information Processing and Management 5, 273–276 (1977) 30. Masui, S., Shioya, M., Salaniski, T., Tayama, Y., Iungawa, T.: Fujite: Evaluation of a diﬀusion model applicable to environmental assessment for air polution abatement, System Development Lab. Hitachi Ltd., Tokyo, Japan (1980)

404

A. Wakulicz-Deja and A. Nowak

31. Cash, J., Whinston, A.: Security for GPLAN system. Information Systems 2(2) (1976) 32. Shafer, G.: A mathematical theory of evidence. Princeton University Press, Princeton (1976) 33. Skowron, A., Grzymala-Busse, J.: From the Rough Set Theory to the Evidence Theory. In: Yager, R.R., Fedrizzi, M., Kacprzyk, J. (eds.) Advances in the DempsterShafer Theory of Evidence, pp. 193–236. Wiley, New York (1994) 34. Marszal-Paszek, B., Paszek, P.: Minimal Templates Problem, Intelligent Information Processing and Web Mining. In: Advances in Soft Computing, vol. 35, pp. 397–402. Springer, Heidelberg (2006) 35. Marszal-Paszek, B., Paszek, P.: Extracting Minimal Templates in a Decision Table, Monitoring, Security, and Rescue Techniques in Multiagent Systems. In: Advances in Soft Computing, vol. 2005, pp. 339–344. Springer, Heidelberg (2005) 36. Swiniarski, R., Hargis, L.: Rough Sets as a Front End of Neural Networks Texture Classiﬁers. Neurocomputing 36(1-4), 85–102 (2001) 37. Nguyen, H.S., Nguyen, S.H., Skowron, A.: Decomposition of Task Speciﬁcation. In: Ra´s, Z.W., Skowron, A. (eds.) ISMIS 1999. LNCS, vol. 1609, p. 310. Springer, Heidelberg (1999) 38. Polkowski, L., Skowron, A. (eds.): Rough Sets in Knowledge Discovery 2: Applications, Case Studies and Software Systems. Physica Verlag, Heidelberg (1998) 39. Stepaniuk, J.: Relational Data and Rough Sets. Fundamenta Informaticae 79(3-4), 525–539 (2007) 40. Stepaniuk, J.: Approximation Spaces in Multi Relational Knowledge Discovery. Rough Sets 6, 351–365 (2007) 41. Ngo, C.L., Nguyen, H.S.: A Method of Web Search Result Clustering Based on Rough Sets, Web Intelligence, pp. 673–679 (2005) 42. Moshkov, M., Piliszczuk, M., Zielosko, B.: On construction of partial reducts and irreducible partial decision rules. Fundamenta Informaticae 75(1-4), 357–374 (2007) 43. Piliszczuk, M.: On greedy algorithm for partial reduct construction. In: Proceedings of concurency, Speciﬁcation and Programming Workshop, Ruciane Nida, Poland, pp. 400–411 (2005) 44. Zielosko, B.: On partial decision rules. In: Proceedings of concurency, Speciﬁcation and Programming Workshop, Ruciane Nida, Poland, pp. 598–609 (2005) 45. Zielosko, B., Kocjan, A., Piliszczuk, M.: Classiﬁers Based on Partial Reducts and Partial Decision Rules. In: Intelligent Information Systems XVI. Proceedings of the International IIS 2008 Conference held in Zakopane. Challenging Problems of Science. Computer Science, pp. 431–438. Academic Publishing House EXIT, Warsaw (2008) 46. Orlowska E.: Dynamic information systems, IPI PAN papers Nr. 434, Warsaw, Poland (1981) 47. Sarnadas, A.: Temporal aspects of logical procedure deﬁnition. Information Systems (3) (1980) 48. Wakulicz-Deja, A.: Classiﬁcation of time - varying information systems. Information Systems (3), Warsaw, Poland (1984) 49. Wakulicz-Deja, A.: Podstawy system´ ow wyszukiwania informacji. Analiza Metod, Problemy Wsp´ olczesnej Nauki. Teoria i Zastosowania. Informatyka, Akademicka Oﬁcyna Wydawnicza PLJ, Warsaw, Poland (1995)

Debellor: A Data Mining Platform with Stream Architecture Marcin Wojnarski Warsaw University, Faculty of Mathematics, Informatics and Mechanics ul. Banacha 2, 02-097 Warszawa, Poland [email protected]

Abstract. This paper introduces Debellor (www.debellor.org) – an open source extensible data mining platform with stream-based architecture, where all data transfers between elementary algorithms take the form of a stream of samples. Data streaming enables implementation of scalable algorithms, which can eﬃciently process large volumes of data, exceeding available memory. This is very important for data mining research and applications, since the most challenging data mining tasks involve voluminous data, either produced by a data source or generated at some intermediate stage of a complex data processing network. Advantages of data streaming are illustrated by experiments with clustering time series. The experimental results show that even for moderatesize data sets streaming is indispensable for successful execution of algorithms, otherwise the algorithms run hundreds times slower or just crash due to memory shortage. Stream architecture is particularly useful in such application domains as time series analysis, image recognition or mining data streams. It is also the only eﬃcient architecture for implementation of online algorithms. The algorithms currently available on Debellor platform include all classiﬁers from Rseslib and Weka libraries and all ﬁlters from Weka. Keywords: Pipeline, Online Algorithms, Software Environment, Library.

1

Introduction

In the ﬁelds of data mining and machine learning, there is frequently a need to process large volumes of data, too big to ﬁt in memory. This is particularly the case in some application domains, like computer vision or mining data streams [1,2], where input data are usually voluminous. But even in other domains, where input data are small, they can abruptly expand at an intermediate stage of processing, e.g. due to extraction of windows from a time series or an image [3,4]. Most of ordinary algorithms are not suitable for such tasks, because they try to keep all data in memory. Instead, special algorithms are necessary, which make eﬃcient use of memory. Such algorithms will be called scalable. J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 405–427, 2008. c Springer-Verlag Berlin Heidelberg 2008

406

M. Wojnarski

Another feature of data mining algorithms – besides scalability – which is very desired nowadays is interoperability, i.e. a capability of the algorithm to be easily connected with other algorithms. This property is more and more important, as basically all newly created data mining systems – whether experimental or enduser solutions – incorporate much more than just one algorithm. It would be very worthful if algorithms were both scalable and interoperable. Unfortunately, combining these two features is very diﬃcult. Interoperability requires that every algorithm is implemented as a separate module, with clearly deﬁned input and output. Obviously, data mining algorithm must take data as its input, so the data must be fully materialized – generated and stored in a data structure – just to invoke the algorithm, no matter what it actually does. And materialization automatically precludes scalability of the algorithm. In order to provide scalability and interoperability at the same time, algorithms must be implemented in a special software architecture, which do not enforce data materialization. Debellor1 – the data mining platform introduced in this paper – deﬁnes such an architecture, based on the concept of data streaming. In Debellor, data are passed between interconnected algorithms sample-bysample, as a stream of samples, so they can be processed on the ﬂy, without full materialization. The idea of data streaming is inspired by architectures of database management systems, which enable fast query execution on very large data tables. It should be noted that Debellor is not a library, like e.g., Rseslib2 [5,6,7] or Weka3 [8], but a data mining platform. Although its distribution contains implementations of a number of algorithms, the primary goal of Debellor is to provide not algorithms themselves, but a common architecture, in which various types of data processing algorithms may be implemented and combined, even if they are created by independent researchers. Debellor can handle a wide range of algorithm types: classiﬁers, clusterers, data ﬁlters, generators etc. Moreover, extendability of data types is provided, so it will be possible to process not only ordinary feature vectors, but also images, text, DNA microarray data etc. It is worth mentioning that Debellor’s modular and stream-oriented architecture will enable easy parallelization of composite data mining algorithms. This aspect will be investigated elsewhere. Debellor is written in Java and distributed under GNU General Public License. Its current version, Debellor 0.5, is available at www.debellor.org. The algorithms currently available include all classiﬁers from Rseslib and Weka libraries, all ﬁlters from Weka and a reader of ARFF ﬁles. There are also several algorithms implemented by Debellor itself, like Train&Test evaluation procedure. The algorithms from Rseslib and Weka, except the ARFF reader, are not scalable – this is enforced by architectures of both libraries.

1 2 3

The name originates from Latin debello (to conquer) and debellator (conqueror). http://rsproject.mimuw.edu.pl/ http://www.cs.waikato.ac.nz/ml/weka/

Debellor: A Data Mining Platform with Stream Architecture

2

407

Related Work

There is large amount of software that can be used to facilitate implementation of new data mining algorithms. A common choice is to use an environment for numerical calculations: R4 [9], Matlab5 , Octave6 [10,11] or Scilab7 and implement the algorithm in a scripting language deﬁned by the environment. Many data mining and machine learning algorithms are available for each of these environments, usually in a form of external packages, so the environments can be seen as common platforms for diﬀerent data mining algorithms. However, they do not deﬁne common architecture for algorithms, so they do not automatically provide interoperability. Moreover, the scripting languages of these environments have low eﬃciency, no static typing and only weak support for object-oriented programming, so they are suitable for fast prototyping and running small experiments, but not for implementation of scalable and interoperable algorithms. Another possible choice is to take a data mining library written in a generalpurpose programming language (usually Java) – examples of such libraries are Weka8 [8], Rseslib9 [5,6,7], RapidMiner10 [12] – and try to ﬁt the new algorithm into the architecture of the library. However, these libraries preclude scalability of algorithms, because the whole training data must be materialized in memory before they can be passed to an algorithm. The concept of data streaming, called also pipelining, has been used in database management systems [13,14,15,16] for eﬃcient query execution. The elementary units capable of processing streams are called iterators in [13,14]. The issue of scalability is related to the concept of online algorithms. In machine learning literature [17,18], the term online has been used to denote training algorithms which perform updates of the underlying decision model after every single presentation of a sample. The algorithms which update the model only when the whole training set has been presented are called batch. Usually online algorithms can be more memory-eﬃcient than their batch counterparts, because they do not have to store samples for later use. They are also more ﬂexible, e.g., they can be used in incremental learning or allow for the training process to be stopped anytime during scan of the data. This is why extensive research has been done to devise online variants of existing batch algorithms [19,20,21,22,23]. Certainly, online algorithms are the best candidates for implementation in stream architecture. Note, however, that many batch algorithms also do not have to keep all samples in memory and thus can beneﬁt from data streaming. In many cases it is enough to keep only some statistics calculated during scan of the data set, used afterwards to make the ﬁnal update of the model. For example, standard k-means [17,24,25] algorithm performs batch 4 5 6 7 8 9 10

http://www.r-project.org http://www.mathworks.com http://www.octave.org http://www.scilab.org http://www.cs.waikato.ac.nz/ml/weka http://rsproject.mimuw.edu.pl http://rapid-i.com

408

M. Wojnarski

updates of the model, but despite this it can be scalable if implemented in stream architecture, as will be shown in Sect. 5.8.

3 3.1

Motivation Scalability

Scalable algorithms are indispensable in most of data mining tasks – every time when data become larger than available memory. Even if initially memory seems capacious enough to hold the data, it may appear during experiments that data are larger and memory smaller than expected. There are many reasons for this: 1. Not the whole physical memory is available to the data mining algorithm at a given time. Some part is used by operating system and other applications. 2. The experiment may incorporate many algorithms run in parallel. In such case, available memory must be partitioned between all of them. In the future, parallelization will become more and more common due to parallelization of hardware architectures, e.g., expressed by increasing number of cores in processors. 3. In a complex experiment, composed of many elementary algorithms, every intermediate algorithm will generate another set of data. Total amount of data will be much larger than the amount of source data alone. 4. For architectural reasons data must be stored in memory in some general data structures, which take more memory than would be necessary in a given experiment. For example, data may be composed of binary attributes and each value could be stored on a single bit, but in fact each value takes 8 bytes or more, because every attribute – whether it is numeric or binary – is stored in the same way. Internal data representation used by a given platform is always a compromise between generality and eﬃcient memory usage. 5. Data generated at intermediate processing stages may be many times larger than source data. For example: – Input data may require decompression, e.g. JPEG images must be converted to raw bitmaps to undergo processing. This may increase data size even by a factor of 100. – In image recognition, a single input image may be used to generate thousands of subwindows that would undergo further processing [4,26]. An input image of 1MB size may easily generate windows of 1GB size or more. Similar situation occurs in speech recognition or time series analysis, where the sliding-window technique is used. – Synthetic attributes may be generated, e.g. by taking all multiplications of pairs of original attributes, which leads to quadratic increase in the number of attributes. – Synthetic samples may be generated, in order to increase the size of training set and improve learning of a decision system. For example, this method is used in [27], which studies the problem of Optical Character Recognition. Training images of hand-written characters are randomly

Debellor: A Data Mining Platform with Stream Architecture

409

distorted by planar aﬃne transformations and added to the training set. Every image undergoes 9 random distortions, which leads to 10-fold increase in the training set size (from 60 to 600 thousand images). 6. In some applications, like mining data streams [1], input data are potentially inﬁnite, so scalability obviously becomes an issue. 7. Even if the volume of data is small at the stage of experiments, it may become much bigger when the algorithm is deployed in a ﬁnal product and must process real-world instead of experimental data. The above arguments show clearly that memory is indeed a critical issue for data mining algorithms. Every moderately complex experiment will show one or more of the characteristics listed above. This is why we need scalable algorithms and – for this purpose – an architecture that will enable algorithms to process data on the ﬂy, without full materialization of a data set. 3.2

Interoperability

Nowadays, it is impossible to solve a data mining task or conduct an experiment using only one algorithm. For example, even if you want to experiment with a single algorithm, like a new classiﬁcation method, you at least have to access data on disk, so you need an algorithm that reads a given ﬁle format (e.g. ARFF11 ). Also, you would like to evaluate your classiﬁer, so you need an algorithm which implements an evaluation scheme, like cross-validation or bootstrap. And in most cases you will also need several algorithms for data preprocessing like normalization, feature selection, imputation of missing values etc. – note that preprocessing is an essential step in knowledge discovery [28,29] and usually several diﬀerent preprocessing methods must be applied before data can be passed to a decision system. To build a data mining system, there must be a way to connect all these diﬀerent algorithms together. Thus, they must possess the property of interoperability. Without this property, even the most eﬃcient algorithm is practically useless. Further on, the graph of data ﬂow between elementary algorithms in a data mining system will be called a Data Processing Network (DPN). In general, we will assume that DPN is a directed acyclic graph, so there are no loops of data ﬂow. Moreover, in the current version of Debellor, DPN can only have a form of a single chain, without branches. An example of DPN is shown in Figure 1.

Fig. 1. Example of a Data Processing Network (DPN), composed of ﬁve elementary algorithms (boxes). Arrows depict data ﬂow between the algorithms. 11

http://www.cs.waikato.ac.nz/ml/weka/arﬀ.html

410

4

M. Wojnarski

Data Streaming

To provide interoperability, data mining algorithms must be implemented in a common software architecture, which speciﬁes: – a method for connecting algorithms, – a model of data transfer, – common data representation. Architectures of existing data mining systems utilize the batch model of data transfer. In this model, algorithms must take the whole data set as an argument for execution. To run composite experiment, represented by a DPN with a number of algorithms, an additional supervisor module is needed, responsible for invoking consecutive algorithms and passing data sets between them. Figure 3 presents a UML sequence diagram [30] with an example of batch processing in a DPN composed of three algorithms. DPN itself is presented in Fig. 2. Batch data transfer enforces data materialization, which precludes scalability of algorithms and DPN as a whole. For example, in Weka, every classiﬁer must be implemented as a subclass of Classifier class (in weka.classifiers package). Its training algorithm must be implemented in the method: buildClassifier(Instances) :

void

The argument of type Instances is an array of training samples. This argument must be created before calling buildClassifier, so the data must be fully materialized in memory just to invoke training algorithm, no matter what the algorithm actually does. Similar situation takes place for clustering methods, which must inherit from weka.clusterers.Clusterer class and overload the method: buildClusterer(Instances) :

void

Rseslib and RapidMiner also enforce data materialization before a training algorithm can be invoked. In Rseslib, classiﬁers must be trained in the class constructor, which takes an argument of type DoubleDataTable. In RapidMiner, training of any decision system takes place in the method apply(IOContainer) of the class com.rapidminer.operator.Operator. Both Rseslib’s DoubleDataTable and RapidMiner’s IOContainer represent materialized input data. If a large data set must be materialized, execution of the experiment is practically impossible. If data ﬁt in virtual memory [31], but exceed available physical memory, operating system temporarily swaps [31] part of the data (stores it in

Fig. 2. DPN used as an example for analysis of data transfer models

Debellor: A Data Mining Platform with Stream Architecture

411

Fig. 3. UML diagram of batch data transfer in a DPN composed of three algorithms: LoadData, Preprocess and TrainClassiﬁer, controlled by the Supervisor module. Supervisor invokes the algorithms (methods run) and pass data between them. All samples of a given data set are generated and transferred together, so available memory must be large enough to hold all data. Vertical lines denote life of modules, with time passing down the lines. Horizontal lines represent messages (method calls and/or data transfers) between the modules. Vertical boxes depict execution of the module’s code.

the swap ﬁle on disk), which makes the execution tens or hundreds times slower, as access to disk is orders of magnitude slower than to memory. If the data set is so large that it even exceeds available virtual memory, execution of the experiment is terminated with an out-of-memory error. This problem could be avoided if the class that represents a data set (e.g., Instances in Weka) implemented internally the buﬀering of data on disk. Then, however, the same performance degradation would occur as in the case of system swapping, because swapping and buﬀering on disk are actually the same things, only implemented at diﬀerent levels: of operating system or data mining environment. The only way to avoid severe performance degradation when processing large data is to generate data iteratively, sample-by-sample, and instantly process created samples, as presented in Fig. 4. In this way, data may be generated and consumed on the ﬂy, without materialization of the whole set. This model of data transfer will be called iterative.

412

M. Wojnarski

Fig. 4. UML diagram of iterative data transfer. The supervisor invokes the algorithms separately for each sample of the data set (sample x y denotes sample no. x generated by algorithm no. y). In this way, memory requirements are very low (memory complexity is constant), but supervisor’s control over data ﬂow becomes very diﬃcult.

Iterative data transfer solves the problem of high memory consumption, because memory requirements imposed by the architecture are constant – only a ﬁxed number of samples must be kept in memory in a given moment, no matter how large the full data set is. However, another problem arises: the supervisor becomes responsible for controlling the ﬂow of samples and the order of execution of algorithms. This control may be very complex, because each elementary algorithm may have diﬀerent input-output characteristics. The number of possible variants is practically inﬁnite, for example: 1. Preprocessing algorithm may ﬁlter out some samples, in which case more than one input sample may be needed to produce one output sample. 2. Preprocessing algorithm may produce a number of output samples from a single input sample, e.g. when extracting windows from an image or time series. 3. Training algorithm of a decision system usually have to scan data many times, not only once.

Debellor: A Data Mining Platform with Stream Architecture

413

4. Generation of output samples may be delayed relatively to the ﬂow of input samples, e.g. an algorithm may require that 10 input samples are given before it starts producing output samples. 5. Input data to an algorithm may be inﬁnite, e.g. when they are generated synthetically. In such case, the control mechanism must stop data generation in appropriate moment. 6. Some algorithms may have more than one input or output, e.g. an algorithm for merging data from several diﬀerent sources (many inputs) or an algorithm for splitting data into training and test parts (many outputs). In such case, the control of data ﬂow through all the inputs and outputs becomes even more complex, because there are additional dependencies between many inputs/outputs of the same algorithm. Note that the diagram in Fig. 4 depicts a simpliﬁed case when DPN is a single chain of three algorithms, without branches; preprocessing generates exactly one output sample for every input sample; and training algorithm scans data only once.

Fig. 5. UML diagram of control and data ﬂow in the stream model of data transfer. The supervisor invokes only method build() of the last component (TrainClassiﬁer). This triggers a cascade of messages (calls to methods next()) and transfers of samples, as needed to fulﬁll the initial build() request.

414

M. Wojnarski

The way how data ﬂow should be controlled depends on what algorithms are used in a given DPN. For this reason, the algorithms themselves – not the supervisor – should be responsible for controlling data ﬂow. To this end, each algorithm must be implemented as a component which can communicate with other components without external control of the supervisor. Supervisor’s responsibility must be limited only to linking components together (building DPN) and invoking the last algorithm in DPN, which is the ﬁnal receiver of all samples. Communication should take the form of a stream of samples: (i) sample is the unit of data transfer; (ii) samples are transferred sequentially, in a ﬁxed order decided by the sender. This model of data transfer will be called a stream model. An example of control and data ﬂow in this model is presented in Fig. 5. Component architecture and data streaming are the features of Debellor which enable scalability of algorithms implemented on this platform.

5 5.1

Debellor Data Mining Platform Data Streams

Debellor’s components are called cells. Every cell is a Java class inheriting from the base class Cell (package org.debellor.core). Cells may implement all kinds of data processing algorithms, for example: 1. Decision algorithms: classiﬁcation, regression, clustering, density estimation etc. 2. Transformations of samples and attributes. 3. Removal or insertion of samples and attributes. 4. Loading data from ﬁle, database etc. 5. Generation of synthetic data. 6. Buﬀering and reordering of samples. 7. Evaluation schemes: train&test, cross-validation, leave-one-out etc. 8. Collecting statistics. 9. Data visualization. Cells may be connected into DPN by calling the setSource(Cell) method on the receiving cell, for example: Cell cell1 = ..., cell2 = ..., cell3 = ...; cell2.setSource(cell1); cell3.setSource(cell2); The ﬁrst cell will usually represent a ﬁle reader or a generator of synthetic data. Intermediate cells may apply diﬀerent kinds of data transformations, while the last cell will usually implement a decision system or an evaluation procedure. DPN can be used to process data by calling methods open(), next() and close() on the last cell of DPN, for example:

Debellor: A Data Mining Platform with Stream Architecture

415

cell3.open(); sample1 = cell3.next(); sample2 = cell3.next(); sample3 = cell3.next(); ... cell3.close(); The above calls open communication session with cell3, retrieve some number of processed samples and close the session. In order to realize each request, cell3 may communicate with its source cell, cell2, by invoking the same methods (open, next, close) on cell2. And cell2 may in turn communicate with cell1. In this way it is possible to generate output samples on the ﬂy. The stream of samples may ﬂow through consecutive cells of DPN without buﬀering, so input data may have unlimited volume. Note that the user of DPN does not have to control sample ﬂow by himself. To obtain the next sample of processed data it is enough to call cell3.next(), which will invoke – if needed – a cascade of calls to preceding cells. Moreover, diﬀerent cells may control the ﬂow of samples diﬀerently. For example, cells that implement classiﬁcation algorithms will take one input sample in order to generate one output sample. Filtering cells will take a couple of input samples in order to generate one output sample that matches the ﬁltering rule. The image subwindow generator will produce many output samples out of a single input sample. We can see that the cell’s interface is very ﬂexible. It enables implementation of various types of algorithms in the same framework and allows to easily combine the algorithms into a complex DPN. 5.2

Buildable Cells

Some cells may be buildable, in which case their content must be built before the cell can be used. Building procedure is invoked by calling method build() :

void

on the cell object. This method is declared in the base class Cell. Building a cell may mean diﬀerent things for diﬀerent types of cells. For example: – training a decision system of some kind (classiﬁer, clusterer, . . . ), – running an evaluation scheme (train&test, cross-validation, . . . ), – reading all data from input stream and buﬀering in memory. Note that all these diﬀerent types of algorithms are encapsulated under the same interface (method build()). This increases simplicity and modularity of the platform. Usually, the cell reads input data during building, so it must be properly connected to a source cell before build() is invoked. Afterwards, the cell may be reconnected and used to process another stream of data. Some buildable cells may also implement erase() method, which clears the content of the cell. After erasure, the cell may be built once again.

416

5.3

M. Wojnarski

State of the Cell

Every cell object has a state variable attached, which indicates what cell operations are allowed in a given moment. There are three possible states: EMPTY, CLOSED and OPEN. Transitions between them are presented in Fig. 6. Each transition is invoked by call to an appropriate method: build(), erase(), open() or close().

Fig. 6. Diagram of cell states and allowed transitions

Only a part of cell methods may be called in a given state. For example, next() can be called only in OPEN state, while setSource() is allowed only in EMPTY or CLOSED state. It is guaranteed by the base class implementation that disallowed calls immediately end with an exception thrown. Thanks to this automatic state control, connecting diﬀerent cells together and building composite algorithms becomes easier and safer, because many possible mistakes or bugs related to inter-cell communication are detected early. Otherwise, they could exist unnoticed, generating incorrect results during data processing. Moreover, it is easier to implement new cells, because the authors do not have to check correctness of method calls by themselves. 5.4

Parametrization

Most of cells require a number of parameters to be set before the cell can start working. Certainly, every type of a cell requires diﬀerent parameters, but for the sake of interoperability and simplicity of usage, there should be a common interface for passing parameters, no matter what number and types of parameters are expected by a given cell. Debellor deﬁnes such an interface. Parameters for a given cell are stored in an object of class Parameters (package org.debellor.core), which keeps a dictionary of parameter names and associated String values (in the future we plan to extend permitted value types, note however that all simple types can be easily converted to String). Thanks to the use of a dictionary, the names do not have to be hard-coded as ﬁelds of cell objects, hence parameters can be added dynamically, according to requirements of a given cell. The object of class Parameters can be passed to the cell by calling Cell’s method: setParameters(Parameters) :

void

Debellor: A Data Mining Platform with Stream Architecture

417

It is also possible (and usually more convenient) to pass single parameter values directly to the cell, without an intermediate Parameters object, by calling: set(String name, String value) :

void

This method call delegates to analogous method of Cell’s internal Parameters object. 5.5

Data Representation

The basic unit of data transfer between cells is sample. Samples are represented by objects of class Sample. Every sample contains two ﬁelds, data and label, which hold input data and associated decision label, respectively. Any of the ﬁelds can be null, if corresponding information is missing or simply not necessary at the given point of data processing. Cells are free to use whichever part of input data they want. For example, build() method of a classiﬁer (i.e. training algorithm) would use both data and label, interpreting label as a target classiﬁcation of data, given by a supervisor. During operation phase, the classiﬁer would ignore input label, if present. Instead, it would classify data and assign the generated label to the label ﬁeld of the output sample. Data and labels are represented in an abstract way. Both data and label ﬁelds reference objects of type Data (package org.debellor.core). Data is a base class for classes that represent data items, like single features or vectors of features. When the cell wants to use information stored in data or label, it must downcast the object to a speciﬁc subclass, as expected by the cell. Thanks to this abstract method of data representation, new data types can be added easily, by creating a new subclass of Data. Authors of new cells are not limited to a single data type, hard-coded into the platform, as for example in Weka. Data objects may be nested. For example, objects of class DataVector (in org.debellor.core.data) hold arrays of other data objects, like simple features (classes NumericFeature and SymbolicFeature) or other DataVectors. 5.6

Immutability of Data

A very important concept related to data representation is immutability. Objects which store data – instances of Sample class or Data subclasses – are immutable, i.e. they cannot be modiﬁed after creation. Thanks to this property, data objects can be safely shared by cells, without risk of accidental modiﬁcation in one cell that would aﬀect operations of another cell. Immutability of data objects yields many beneﬁts: 1. Safety – cells written by diﬀerent people may work together in a complex DPN without interference. 2. Simplicity – the author of a new cell does not have to care about correctness of access to data objects.

418

M. Wojnarski

3. Eﬃciency – data objects do not have to be copied when transferred to another cell. Without immutability, copying would be necessary to provide a basic level of safety. Also, a number of samples may keep references to the same data object. 4. Parallelization – if DPN is executed concurrently, no synchronization is needed when accessing shared data objects. This simpliﬁes parallelization and makes it more eﬃcient. 5.7

Metadata

Many cells have to know some basic characteristics (“type”) of input samples before processing of the data starts. For example, training algorithm of a neural network has to know the number of input features, to be able to allocate arrays of weights of appropriate size. To provide such information, method open() returns an object of class MetaSample (static inner class of Sample), which describes common properties of all samples generated by the stream being open. Similarly to Sample, MetaSample has separate ﬁelds describing input data and labels, both of type MetaData (static inner class of Data). Metadata have analogous structure and properties as the data being described. The hierarchy of metadata classes, rooted at MetaData, mirrors the hierarchy of data classes, rooted at Data. The nesting of MetaData and Data objects is also similar, e.g. if the stream generates DataVectors of 10 SymbolicFeatures, corresponding MetaData object will be an instance of MetaDataVector, containing an array of 10 MetaSymbolicFeatures describing every feature. Similarly to Data, MetaData objects are immutable, so they can be safely shared by cells. 5.8

Example

To illustrate the usage of Debellor, we will show how to implement standard k-means algorithm in stream architecture and how to employ it to data processing in a several-cell DPN. K-means [17,24,25] is a popular clustering algorithm. Given n input samples – numeric vectors of ﬁxed length, x1 , x2 , . . . , xn – it tries to ﬁnd cluster centers c1 , . . . , ck which minimize the sum of squared distances of samples to their closest center: n min xi − cj 2 . (1) E(c1 , . . . , ck ) = i=1

j=1,...,k

This is done through iterative process with two steps repeated alternately in a loop: (i) assignment of each sample to the nearest cluster and (ii) repositioning of each center to the centroid of all samples in a given cluster. The algorithm is presented in Fig. 7. As we can see, the common implementation of k-means as a function is non-scalable, because it employs batch model of data transfer: training data are passed as an array of samples, so they must be generated and accumulated in memory before the function is called.

Debellor: A Data Mining Platform with Stream Architecture

419

function kmeans(data) returns an array of centers Initialize array centers repeat Set sum[1], . . . , sum[k], count[1], . . . , count[k] to zero for i = 1..n do /* assign samples to clusters */ x = data[i] j = clusterOf(x) sum[j] = sum[j] + x count[j] = count[j] + 1 end for j = 1..k do /* reposition centers */ centers[j] = sum[j]/count[j] end until no center has been changed return centers Fig. 7. Pseudocode illustrating k-means clustering algorithm implemented as a regular stand-alone function. The function takes an array of n samples (data) as argument and returns k cluster centers. Both samples and centers are real-valued vectors. The function clusterOf(x) returns index of the center that is closest to x.

class KMeans extends Cell method build() Initialize array centers repeat Set sum[1], . . . , sum[k], count[1], . . . , count[k] to zero (*) source.open() for i = 1..n do (*) x = source.next() j = clusterOf(x) sum[j] = sum[j] + x count[j] = count[j] + 1 end (*) source.close() for j = 1..k do centers[j] = sum[j]/count[j] end until no center has been changed Fig. 8. Pseudocode illustrating implementation of k-means as Debellor’s cell. Since k-means is a training algorithm (generates a decision model), it must be implemented in method build() of a Cell’s subclass. Input data are provided by the source cell, the reference source being a ﬁeld of Cell. The generated model is stored in the ﬁeld centers of class KMeans, method build() does not return anything. The lines of code inserted or modiﬁed relatively to the standard implementation are marked with asterisk (*).

420

M. Wojnarski

class KMeans extends Cell method next() x = source.next() if x == null then return null return x.setLabel(clusterOf(x)) Fig. 9. Pseudocode illustrating implementation of method next() of KMeans cell. This method employs the clustering model generated by build() and stored inside the KMeans object to label new samples with identiﬁers of their clusters.

/* 3 cells are created and linked into DPN */ Cell arff = new ArffReader(); arff.set("filename", "iris.arff");

/* parameter ﬁlename is set */

Cell remove = new WekaFilter("attribute.Remove"); remove.set("attributeIndices", "last"); remove.setSource(arff); /* cells arﬀ and remove are linked */ Cell kmeans = new KMeans(); kmeans.set("numClusters", "10"); kmeans.setSource(remove); /* k-means algorithm is executed */ kmeans.build(); /* the clusterer is used to label 3 training samples with cluster identiﬁers */ kmeans.open(); Sample s1 = kmeans.next(), s2 = kmeans.next(), s3 = kmeans.next(); kmeans.close(); /* labelled samples are printed on screen */ System.out.println(s1 + "\n" + s2 + "\n" + s3);

Fig. 10. Java code showing sample usage of Debellor cells: reading data from an ARFF ﬁle, removal of an attribute, training and application of a k-means clusterer

Stream implementation of k-means – as Debellor’s cell – is presented in Fig. 8. In contrast to the standard implementation, training data are not passed explicitly, as an array of samples. Instead, the algorithm retrieves samples one-by-one from the source cell, so it can process arbitrarily large data sets. In addition,

Debellor: A Data Mining Platform with Stream Architecture

421

Fig. 9 shows how to implement method next(), responsible for application of the generated clustering model to new samples. Note that despite the algorithm presented in Fig. 8 employs stream method of data transfer, it employs batch method of updating the decision model (the updates are performed after all samples have been scanned). These two things – the method of data transfer and the way how model is updated – are separate and independent issues. It is possible for batch (in terms of model update) algorithms to utilize and beneﬁt from stream architecture. Listing in Fig. 10 shows how to run a simple experiment: train a k-means clusterer and apply it to several training samples, to label them with identiﬁers of their clusters. Data are read from an ARFF ﬁle and simple preprocessing – removal of the last attribute – is applied to all samples. Note that loading data from ﬁle and preprocessing is executed only when the next input sample is requested by the kmeans cell – in methods build() and next().

6

Experimental Evaluation

6.1

Setup

In existing data mining systems, when data to be processed are too large to ﬁt in memory, they must be put in virtual memory. During execution of the algorithm, parts of data are being swapped to disk by operating system, to make space for other parts, currently requested. In this way, portions of data are constantly moving between memory and disk, generating huge overhead on execution time of the algorithm. In the presented experiments we wanted to estimate this overhead and the performance gain that can be obtained through the use of Debellor’s data streaming instead of swapping. For this purpose, we trained k-means [17,24,25] clustering algorithm on time windows extracted from the time series that was used in EUNITE12 2003 data mining competition. We compared execution times of two variants of the experiment: 1. batch, with time windows created in advance and buﬀered in memory, 2. stream, with time windows generated on the ﬂy. Data Processing Networks of both variants are presented in Fig. 11 and 12. In both variants, we employed our stream implementation of k-means, sketched in Sect. 5.8 (KMeans cell in Fig. 11 and 12). In the ﬁrst variant, we inserted a buﬀer into DPN just before the KMeans cell – in this way we eﬀectively obtained a batch algorithm. In the second variant, the buﬀer was placed earlier in the chain of algorithms, before window extraction. We could have dropped buﬀering at all, but then the data would be loaded from disk again in every training cycle, which was not necessary, as the source data were small enough to ﬁt in memory. 12

EUropean Network on Intelligent TEchnologies for Smart Adaptive Systems, http://www.eunite.org

422

M. Wojnarski

Fig. 11. DPN of the ﬁrst (batch) variant of experiment

Fig. 12. DPN of the second (stream) variant of experiment

Source data were composed of a series of real-valued measurements from glass production process, recorded in 9408 diﬀerent time points separated by 15-minute intervals. There were two kinds of measurements: 29 “input” and 5 “output” values. In the experiment we used only “input” values, “output” ones were ﬁltered out by Weka ﬁlter for attribute removal (WekaFilter cell). After loading from disk and dropping unnecessary attributes, the data occupied 5.7MB of memory. They were subsequently passed to TimeWindows cell, which generated time windows of length W , on every possible oﬀset from the beginning of the input time series. Each window was created as a concatenation of W consecutive samples of the series. Therefore, for input series of length T , composed of A attributes, the resulting stream contained T − W + 1 samples, each composed of W ∗ A attributes. In this way, relatively small source data (5.7MB) generated large volume of data at further stages of DPN, e.g. 259MB for W = 50. In the experiments, we compared training times of both variants of k-means. Since the time eﬀectiveness of swapping and memory management depends highly on the hardware setup, the experiments were repeated in two diﬀerent hardware environments: (A) a laptop PC with Intel Mobile Celeron 1.7 GHz CPU, 256MB RAM; (B) a desktop PC with AMD Athlon XP 2100+ (1.74 GHz), 1GB RAM. Both systems run under Microsoft Windows XP. Sun’s Java Virtual Machine (JVM) 1.6.0 03 was used. The number of clusters for k-means was set to 5. 6.2

Results

Results of experiments are presented in Table 1 and 2. They are also depicted graphically in Fig. 13 and 14. Diﬀerent lengths of time windows were checked, for every length the size of generated training data was diﬀerent (given in the second column of the tables). In each trial, training time of k-means was measured. These times are reported in normalized form, i.e. the total training time in seconds is divided by the number of training cycles and data size in MB. Normalized times can be directly

Debellor: A Data Mining Platform with Stream Architecture

423

Table 1. Normalized training times of k-means for batch and stream variant of experiment and diﬀerent lengths of time windows. Corresponding sizes of training data are given in the second column. Hardware environment A. Window length 10 20 30 40 50 60 70 80 90

Normalized Data size Normalized execution time execution time [MB] (batch variant) (stream variant) 53 3.1 5.6 104 3.2 5.3 156 3.1 5.0 208 5.1 4.9 259 244.4 5.0 311 326.9 8.3 362 370.6 10.7 413 386.0 10.9 464 475.3 11.1

Table 2. Normalized training times of k-means for batch and stream variant of experiment and diﬀerent lengths of time windows. Corresponding sizes of training data are given in the second column. Hardware environment B. Window length 50 100 120 150 170 180 190 200 210 220 230 240 250 260

Normalized Data size Normalized execution time execution time [MB] (batch variant) (stream variant) 259 4.0 5.3 515 4.0 5.4 617 4.0 6.5 769 5.3 8.7 869 6.3 8.8 919 23.8 8.8 969 36.4 8.8 1019 50.7 8.8 1069 71.3 8.8 1119 85.1 8.8 1168 100.4 9.1 1218 111.1 9.1 1267 140.2 9.4 1317 crash 9.3

compared across diﬀerent trials. Every table and ﬁgure presents results of both variants of the algorithm. Time complexity of a single training cycle of k-means is linear in the data size, so normalized execution times should be similar across diﬀerent values of window length. However, for the batch variant, the times are constant only for small sizes of data. At the point when data size gets close to the amount of physical memory installed on the system, execution time suddenly jumps to a very high value, many times larger than for smaller data sizes. It may even

424

M. Wojnarski

500 batch variant stream variant

Normalized training time

400

300

200

100

0

0

100

200 300 Size of training data [MB]

400

500

Fig. 13. Normalized training times of k-means for batch and stream variant of experiment and diﬀerent lengths of time windows. Hardware environment A.

150 batch variant stream variant

Normalized training time

125

100

75

50

25

0 200

400

600

800 Size of training data [MB]

1000

1200

1400

Fig. 14. Normalized training times of k-means for batch and stream variant of experiment and diﬀerent lengths of time windows. Hardware environment B. Note that the measurement which caused the batch variant to crash (last row in Table 2) is not presented here.

Debellor: A Data Mining Platform with Stream Architecture

425

happen that from some point the execution crashes due to memory shortage (see Tab. 2), despite JVM heap size being set to the highest possible value (1300 MB on a 32-bit system). This is because swapping must be activated to handle this large volume of data. And because access to disk is orders of magnitude slower than to memory, algorithm execution becomes also very slow. This dramatic slowdown is not present in the case of the stream algorithm, which requires always the same amount of memory, at the level of 6MB. For small data sizes this algorithm runs a bit slower, because training data must be generated in each training cycle from the beginning. But for large data sizes it can be 40 times better, or even more (the curves in Figures 13 and 14 rise very quickly, so we may suspect that for larger data sizes the disparity between both variants is even bigger). The batch variant is actually not usable. What is also important, every stream implementation of a data mining algorithm can be used in batch manner by simply preceding it with a buﬀer in DPN. Thus, the user can choose the faster variant, depending on the data size. On the other hand, batch implementation cannot be used in stream-based manner, rather the algorithm must be redesigned and implemented again.

7

Conclusions

In this paper we introduced Debellor – a data mining platform with stream architecture. We presented the concept of data streaming and proved through experimental evaluation that it enables much more eﬃcient processing of large data than the currently used method of batch data transfer. Stream architecture is also more general. Every stream-based implementation can be used in batch manner. Opposite is not true. Thanks to data streaming, algorithms implemented on Debellor platform can be scalable and interoperable at the same time. We also analysed the signiﬁcance of scalability issue for the design of composite data mining systems and showed that even when source data are relatively small, lack of memory may still pose a problem, since large volumes of data may be generated at intermediate stages of data processing network. Stream architecture has also weaknesses. Because of sequential access to data, implementation of algorithms may be conceptually more diﬃcult. Batch data transfer is more intuitive for the programmer. Moreover, some algorithms may inherently require random access to data. Although they can be implemented in stream architecture, they have to buﬀer all data internally, so they will not beneﬁt from streaming. However, these algorithms can still beneﬁt from interoperability provided by Debellor – they can be connected with other algorithms to form a complex data mining system. Development of Debellor will be continued. We plan to extend the architecture to handle multi-input and multi-output cells as well as nesting of cells (e.g., to implement meta-learning algorithms). We also want to implement parallel execution of DPN and serialization of cells (i.e., saving to a ﬁle). Acknowledgement. The research has been partially supported by the grant N N516 368334 from Ministry of Science and Higher Education of the Republic

426

M. Wojnarski

of Poland and by the grant “Decision support – new generation systems” of Innovative Economy Operational Programme 2008-2012 (Priority Axis 1. Research and development of new technologies) managed by Ministry of Regional Development of the Republic of Poland.

References 1. Aggarwal, C.C. (ed.): Data Streams: Models and Algorithms. Springer, Heidelberg (2007) 2. Gama, J., Gaber, M.M. (eds.): Learning from Data Streams: Processing Techniques in Sensor Networks. Springer, Heidelberg (2007) 3. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice Hall, Englewood Cliﬀs (2002) 4. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. IEEE Computer Vision and Pattern Recognition 1, 511–518 (2001) 5. Bazan, J.G., Szczuka, M.: RSES and RSESlib – A collection of tools for rough set computations. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS, vol. 2005, pp. 106–113. Springer, Heidelberg (2001) 6. Bazan, J.G., Szczuka, M.S., Wojna, A., Wojnarski, M.: On the evolution of rough set exploration system. In: Tsumoto, S., Slowi´ nski, R., Komorowski, J., GrzymalaBusse, J.W. (eds.) RSCTC 2004. LNCS, vol. 3066, pp. 592–601. Springer, Heidelberg (2004) 7. Wojna, A., Kowalski, L.: Rseslib: Programmer’s Guide (2008), http://rsproject.mimuw.edu.pl 8. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 9. R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2005) 10. Eaton, J.W.: Octave: Past, present, and future. In: International Workshop on Distributed Statistical Computing (2001) 11. Eaton, J.W., Rawlings, J.B.: Ten years of Octave – recent developments and plans for the future. In: International Workshop on Distributed Statistical Computing (2003) 12. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: Yale: Rapid prototyping for complex data mining tasks. In: KDD 2006: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (2006) 13. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database System Implementation. Prentice Hall, Englewood Cliﬀs (1999) 14. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book. Prentice Hall, Englewood Cliﬀs (2001) 15. Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J.: STREAM: The stanford data stream management system (2004), http://dbpubs.stanford.edu:8090/pub/2004-20 16. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: ACM (ed.) Symposium on Principles of Database Systems, pp. 1–16. ACM Press, New York (2002) 17. Ripley, B.D.: Pattern recognition and neural networks. Cambridge University Press, Cambridge (1996)

Debellor: A Data Mining Platform with Stream Architecture

427

18. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) 19. Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: Knowledge Discovery and Data Mining (1998) 20. Balakrishnan, S., Madigan, D.: Algorithms for sparse linear classiﬁers in the massive data setting. Journal of Machine Learning Research 9, 313–337 (2008) 21. Amit, Y., Shalev-Shwartz, S., Singer, Y.: Online learning of complex prediction problems using simultaneous projections. Journal of Machine Learning Research 9, 1399–1435 (2008) 22. Furaoa, S., Hasegawa, O.: An incremental network for on-line unsupervised classiﬁcation and topology learning. Neural Networks 19, 90–106 (2006) 23. Kivinen, J., Smola, A.J., Williamson, R.C.: Online learning with kernels. IEEE Transactions On Signal Processing 52, 2165–2176 (2004) 24. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31, 264–323 (1999) 25. Russell, S., Norvig, P.: Artiﬁcial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliﬀs (1995) 26. Wojnarski, M.: Absolute contrasts in face detection with adaBoost cascade. In: ´ ezak, D. (eds.) Yao, J., Lingras, P., Wu, W.-Z., Szczuka, M.S., Cercone, N.J., Sl¸ RSKT 2007. LNCS, vol. 4481, pp. 174–180. Springer, Heidelberg (2007) 27. LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998) 28. Kriegel, H.P., Borgwardt, K.M., Kr¨ oger, P., Pryakhin, A., Schubert, M., Zimek, A.: Future trends in data mining. Data Mining and Knowledge Discovery 15, 87–97 (2007) 29. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann Publishers, San Francisco (1999) 30. Booch, G., Rumbaugh, J., Jacobson, I.: Uniﬁed Modeling Language User Guide. Addison-Wesley, Reading (2005) 31. Silberschatz, A., Galvin, P., Gagne, G.: Operating System Concepts, 7th edn. Wiley, Chichester (2004)

Category-Based Inductive Reasoning: Rough Set Theoretic Approach Marcin Wolski Department of Logic and Methodology of Science, Maria Curie-Skłodowska University, Poland [email protected]

Abstract. The present paper is concerned with rough set theory (RST) and a particular approach to human-like induction, namely similarity coverage model (SCM). It redefines basic concepts of RST – such like e.g. a decision rule, accuracy and coverage of decision rules – in the light of SCM and explains how RST may be viewed as a similarity-based model of human-like inductive reasoning. Furthermore, following the knowledge-based theory of induction, we enrich RST by the concept of an ontology and, in consequence, we present an RST-driven conceptualisation of SCM. The paper also discusses a topological representation of information systems in terms of non-Archimedean structures. It allows us to present an ontology-driven interpretation of finite non-Archimedean nearness spaces and, to some extent, to complete recent papers about RST and the topological concepts of nearness.

1 Introduction Category-based induction is an approach to human-like inductive reasoning in which both conceptual knowledge and similarity of objects play the key role. So far this type of reasoning has been a subject of study mainly in ethnobiology, or better still, in cognitive science. In this paper we shall apply the main ideas underlying category-based induction to computer science, especially to rough set theory (RST) [10,12]. It will allow us to introduce some new interesting interpretations of basic concepts and structures from RST and topology. There are, in general, two basic theories explaining the mechanism of (human-like) induction: the knowledge-based theory and the similarity-based theory. According to the former one, induction is driven by a prior categorisation of given objects, often called conceptual knowledge or ontology. On this view, people firstly identify some category of which a given object is an element and then generalise properties to the members of this category and vice versa. For example, knowing that bluejays require vitamin K for their liver to function, one can generalise that all birds require this vitamin too. On the other hand, the similarity-based theory argues that induction is based on the overall similarity of compared objects rather than on the conceptual knowledge. For example, students from Michigan are reported to conclude – on the basis that skunks have some biological property – that it is more likely that rather opossums have this property than bears. Skunks, however, are taxonomically closer to bears than to opossums [2]. Summing up, according to the knowledge-based approach the generalisation from one J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 428–443, 2008. c Springer-Verlag Berlin Heidelberg 2008

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

429

object to another is supported by categories (agents ignore appearances of objects and rely on the category membership), whereas according to the similarity-based approach such generalisations are based on perceptual similarity (agents ignore their knowledge about the category membership and rely on appearances) [7]. Inductive reasoning which takes into account both conceptual knowledge and similarity of objects is generally called category-based induction. There are a number of formal models of such reasoning, e.g. [2,6,9,15]. In this paper we shall study similarity coverage model (SCM) [2,9], mainly for its simplicity and strong influence on the development of other models. According to SCM, the strength of inductive argument increases with (a) the degree to which the premies categories are similar to the conclusion category, and (b) the degree to which premise categories cover the lowest level knowledge-category (e.g. from a taxonomy) that includes both the premies and conclusion categories. Thus, the step (a) represents the similarity-based approach, whereas the step (b) represents the knowledge-based approach. The main aim of this paper is to give an account of SCM within the conceptual framework of RST. First, we re-interpret some notions from RST – such as a decision rule, accuracy and coverage of decision rules, rough inclusion functions – according to the standpoint of SCM. On this view, RST may be regarded as a similarity-based approach to induction. Then we enrich RST with a proper ontology and show how the knowledgebased approach can correct the assessment of decision rules. In consequence, the paper proposes an RST-driven model of category-based induction. Furthermore, we discuss topological aspects of RST and the category-based theory of induction. More specifically, we examine a topological counterpart of an information system. Usually, information systems have been represented as approximation spaces or approximation topological spaces. In consequence, only the indiscernability (or similarity) relation induced by the set of all attributes has been considered. In contrast to this approach, we consider all indiscernability relations, or better still, all partitions, induced by an information system. Mathematically, these partitions induce a non-Archimedean structure which, in turn, gives rise to a topological nearness space. Recently a number of attempts have been made to connect RST with nearness type structures, e.g. [11,19]. To some extent we complete the previous results and show some intuitive reasons to consider such structures. Specifically, every ontology induced over a non-Archimedean structure is taxonomic. The paper is organised as follows. Section 2 contains a brief and informal introduction to SCM. Section 3 describes basic concepts from RST which are relevant to inductive reasoning. Section 4 discusses concepts introduced in Section 3 against the background of inductive reasoning and SCM. Finally, Section 5 presents topological aspects of RST and the category-based approach to induction.

2 Similarity Coverage Model In this section we informally introduce category-based induction which has been of a special importance mainly for ethnobiology. There are different accounts of such induction – in the paper we focus on the very influential similarity coverage model (SCM) introduced by Osherson et al. [9].

430

M. Wolski

Ethnobiology or folk biology is a branch of cognitive science which studies the ways in which people categorise the local fauna and flora and project their knowledge about a certain category to other ones [2,6,9,15]. For example, given that bobcats secrete uric acid crystals and cows secrete uric acid crystals, subjects, on the basis that all mammals may have this property, infer that foxes secrete uric acid crystals. According to SCM, the subject performing an induction task firstly calculates the similarity of the premise categories (i.e. bobcats, cows) to the conclusion category (i.e. foxes). Then the subject calculates the average similarity (coverage) of the premise categories to the superordinate category including both the premise and conclusion categories (i.e. mammals). Let us consider the following example: Horses have an ileal vein, Donkeys have an ileal vein. Gophers have an ileal vein. This argument is weaker than: Horses have an ileal vein, Gophers have an ileal vein. Cows have an ileal vein. Of course, the similarity of horses to cows is much higher than the similarity of horses or donkeys to gophers. Thus the strength of inductive inference depends on the maximal similarity of the conclusion category to some of the premise categories. Now let us shed some light on the coverage principle: Horses have an ileal vein, Cows have an ileal vein. All mammals have an ileal vein. According to SCM this argument is weaker than the following one: Horses have an ileal vein, Gophers have an ileal vein. All mammals have an ileal vein. The reason is that the average similarity of horses to other mammals is almost the same as that of cows. In other words, the set H of all animals considered to be similar to horses is almost equal to the set C of all animals similar to cows. Thus the second premise does not bring us nothing in terms of coverage. By contrast, gophers are similar to other mammals than horses and thus this premise makes the coverage higher. That is, the set H ∪ G, where G is the set of all animals similar to gophers, has more elements than the set H ∪ C. Thus, the following inductive inference Horses have an ileal vein, All mammals have an ileal vein. is stronger, than Bats have an ileal vein, All mammals have an ileal vein.

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

431

The range of mammals similar to cows is much wider than the range of mammals similar to bats. One can say that cows are more typical examples of mammals than bats or gophers. Now, let us summarise the above examples in a more formal way. Firstly, there is given a set of categories C we reason about. This set is provided with a binary “kind of” relation K, which is acyclic and thus irreflexive and asymmetric. We call K taxonomic if and only if it is transitive and no item is of two distinct kinds. Definition 1. A transitive relation K is taxonomic over C iff for any a, b, c ∈ C such that aKb and aKc, it holds that b = c or bKc or cKb. For example, collie is a kind of dog and dog is a kind of mammal. Items x ∈ C, such that there is no t satisfying tKx constitute basic categories. An example of non-taxonomic relation is as follows: wheelchair is a kind of furniture and a kind of vehicle. Now, neither furniture = vehicle nor furnitureKvehicle nor vehicleKfurniture. Subjects reasoning about C are additionally provided with a default notion of similarity R defined on basic categories CBasic , i.e. minimal elements of C with respect to K. People usually assume that R is at least reflexive and symmetric. Very often R is represented as an equivalence relation, that is, R is additionally transitive. Given that c1 ∈ CBasic has a property p, a subject may infer that c2 ∈ CBasic also satisfies p, if there exists c3 ∈ C such that c1 Kc3 , c2 Kc3 and {c ∈ CBasic : c1 Rc} is a “substantial” subset of {c ∈ CBasic : cKc3 }. Informally speaking, one can transfer knowledge from a category c1 to c2 if the set of all elements considered to be similar to c1 is a substantial subset of the set of all Cbasic -instantiations of the minimal taxonomic category c whose examples are c1 and c2 . Summing up, one can say that (C, K) represents gathered information, while R is an inductive “engine” making inferences about unknown features of objects belonging to C.

3 Rough Set Theory In this section we briefly recall basic notions from RST which are relevant to inductive reasoning. We start with introducing the concept of an information system, then we discuss decision rules and different measures of their strength. We conclude by recalling some notions from the rough–mereological approach. Definition 2. An information system is a quadruple U, A, V, f where: – – – –

U is a non–empty finite set of objects; A is anon–empty finite set of attributes; V = a∈A Va where Va is the value–domain of the attribute a; f : U × A → V is an information function, such that for all a ∈ A and u ∈ U , f (u, a) ∈ Va .

It is often useful to view an information system U, A, V, f as a decision table, assuming that A = C ∪ D and C ∩ D = ∅ where C is a set of conditional attributes and D is a set of decision attributes. For example, Figure 1 presents a decision table where: – U = {Beaver, Squirrel, M ouse, M uskrat, Otter, Skunk}, – C = {Environment, Diet, T ail, Size},

432

M. Wolski

Animals Environment Diet Beaver semi-aquatic herbivorous Squirrel terrestial omnivorous Mouse terrestial omnivorous Muskrat semi-aquatic omnivorous Otter semi-aquatic carnivorous Skunk terrestial omnivorous

Tail

Size

Poland

flattened round round round round round

medium small very small medium medium medium

yes yes yes yes yes no

Fig. 1. An example of a dataset

– D = {P oland}, – e.g. VDiet = {herbivorous, carnivorous, omnivorous} for Diet ∈ C. Each subset of attributes S ⊆ A determines an equivalence relation IN D(S) ⊆ U × U defined as follows: IN D(S) = {(u, v) : (∀a ∈ S) f (u, a) = f (v, a)}. As usual, IN D(S) is called an indiscernability relation induced by S, the partition induced by the relation IN D(S) is denoted by U/IN D(S), and [u]S denotes the equivalence class of IN D(S) defined by u ∈ U . For instance, if S = {Environment, Diet}, then IN D(S) = {{Beaver}, {Squirrel, M ouse, Skunk}, {M uskrat}, {Otter}}. Obviously, U/IN D(A) refines every other partition U/IN D(S) where S ⊆ A. Furthemore [u]S . [u]A = S⊆A

Intuitively, any subset X ⊆ U which can be defined by a formula of some knowledge representation language L is a concept in L. For example, one can use the following simple descriptor language, say LDesc , based on a given information system U, A, V, f : f ml ::= [a = val] | ¬f ml | f ml ∧ f ml | f ml ∨ f ml where a ∈ A and val ∈ Va . We say that α ∈ LDesc is a formula over C if all attributes a in α belong to C. For any formula α ∈ LDesc , |α| denotes the meaning of α in U , i.e. the concept in LDesc which is defined as follows: – If α is of the form [a = val], then |α| = {u ∈ U : f (u, a) = val}; – |¬α| = U \ |α|, |α ∧ β| = |α| ∩ |β|, |α ∨ β| = |α| ∪ |β|. For example, α = [P oland = no] and |α| = {Skunk}. Let α be a formula of LDesc over C and β a formula over D. Then the expression α ⇒ β is called a decision rule if |α|A ∩ |β|A = ∅. Definition 3. Let α ⇒ β be a decision rule and Card(B) denote the cardinality of the set B. Then, the accuracy Accα (β) and the coverage Covα (β) for α ⇒ β are defined as follows:

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

Accα (β) =

Card(|α| ∩ |β|) , Card(α)

Covα (β) =

Card(|α| ∩ |β|) . Card(β)

and

433

Example 1. Let us assume that |α| = [Environment = semi − aquatic] and |β| = [P oland = yes]. Then α ⇒ β is a decision rule over the information system depicted by Fig. 1, Accα (β) = 3/3 = 1, and Covα (β) = 3/5. It is worth emphasising that if Accα (β) = 1, then it holds that u ∈ |β| provided that u ∈ |α|. On the other hand, if Covα (β) = 1 then we have u ∈ |α| provided that u ∈ |β|. Thus, Accα (β) measures the sufficiency of α ⇒ β, whereas Covα (β) measures the necessity of α ⇒ β; for details see e.g. [16]. Hereafter several attempts were made to introduce other measures of how good a given decision rule is. However, the meaning of these measures remains fixed. For a given decision rule α ⇒ β they answer the following question: Given that u ∈ |α|, what is the chance that u ∈ |β|?

(1)

In the evolution of RST it is the rough–mereological approach of a special importance [13]. This approach is based on the inclusion function, called rough inclusion (RIF), which generalises fuzzy set and rough set approaches. Generally speaking, RIFs measure the degree of inclusion of a set of objects X in a set of objects Y . In this paper we follow the definition of RIF proposed in [5]: Definition 4. A RIF upon U is any function κ : 2U × 2U → [0, 1] such that: – (∀X, Y )(κ(X, Y ) = 1 ⇔ X ⊆ Y ), – (∀X, Y, Z)(Y ⊆ Z ⇒ κ(X, Y ) ≤ κ(X, Z)). The most famous RIF is the so-called standard RIF, denoted by κ£ , which is based on J. Łukasiewicz’s ideas concerning the probability of truth of propositional formulas: Card(X∩Y ) if X = ∅ £ Card(X) κ (X, Y ) = 1 otherwise Another RIF κ1 , which is really interesting in the context of induction, was proposed by A. Gomoli´nska in [5]: Card(Y ) if X ∪ Y = ∅ κ1 (X, Y ) = Card(X∪Y ) 1 otherwise As one can easy observe, for any X, Y ⊆ U and any decision rule α ⇒ β, κ1 (X, Y ) = κ£ (X ∪ Y, Y ), Accα (β) = κ£ (|α|, |β|),

434

M. Wolski

and

Covα (β) = κ£ (|β|, |α|).

Summing up, many ideas from RST are based upon the notion of RIF. In what follows, we shall be interested in question how RIFs can be used to assess the strength of decision rules and their generalisations. It is worth noting that we view these rules from the perspective of inductive reasoning and, in consequence, we change their standard interpretations.

4 Inductive Reasoning: RST Approach Now, let us examine the above ideas from RST against the background of SCM. As said earlier, each formula α ∈ LDesc represents a concept in LDesc . It is easy to observe that |[a = val]| ∈ U/IN D({a}), and |α| =

A, for some A ⊆ U/IN D(A).

Thus, elements of U/IN D(A) can be regarded as atomic concepts and any other concept in LDesc can be built by means of atomic concepts and ∪. Furthermore, any formula α will be regarded as a concept name, or better still, a category. Generally speaking, SCM tries to answer the question how safe is to transfer knowledge about a value val of some attribute a from one category α to another category β. In other words, given that |α| ⊆ |[a = val]|, what is the chance that |β| ⊆ |[a = val]|?

(2)

Observe that this question, in contrast to (1), makes sense even for a rule α ⇒ β such that |α| ∩ |β| = ∅. We shall call such a rule an inductive rule. Furthermore, examples from Section 2 require multi-premises inductive rules represented by expressions α, β, γ ⇒ δ rather than simple rules of the form α ⇒ δ. Let us recall that in multiconclusion Gentzen’s sequent calculus α, β, γ ⇒ δ means that δ follows from α ∧ β ∧ γ. However, in SCM-like inductive reasoning we have α, β, γ ⇒ δ means that δ follows from α ∨ β ∨ γ. Indeed, for example the following decision rule [Size = verysmall], [Size = small] ⇒ [P oland = yes] based on the dataset from Fig. 1 where |[Size = verysmall]| = {M ouse}, |[Size = small]| = {Squirrel}, |[P oland = yes]| = {Beaver, Squirrel, M ouse, M uskrat, Otter}

(3)

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

435

might represent the following inductive inference: Mice have an ileal vein, Squirrels have an ileal vein, All animals living in Poland have an ileal vein. As one can easy observe |[Size = verysmall] ∧ [Size = small]| = ∅ and the conjunctive interpretation of premises leads us to wrong conclusions. Thus, given the rule α, β, γ ⇒ δ we shall regard the premises as the category α ∨ β ∨ γ representing the concept |α ∨ β ∨ γ| in LDesc . Now, we answer the question (2). According to SCM we should (a) compute the similarity of the premise category to the conclusion category, and (b) compute the degree to which the premise category cover the lowest level knowledge–category that includes both the premies and conclusion categories. Intuitively, the identity is the highest level of similarity. It is easy to observe that |α| = |β| iff Accα (β) = 1 and Covα (β) = 1. Thus, the measures Accα (β) and Covα (β) taken together tell us to which extent the categories |α| and |β| are similar. In this paper we use the following measure, denoted by Sim, which was firstly introduced by S. Kulczy´nski in the context of clustering methods [8]; see also [1]: Sim(α, β) =

Accα (β) + Covα (β) . 2

It is easy to compute that for α ⇒ β from Example 1 Sim(α, β) = 4/5. Now, let us consider the step (b) of SCM. It can be formalised as follows: Cov(α, β) =

Card(|α|) Card(C)

where α ⇒ β is an inductive rule, and C represents the smallest category from the underlying ontology O containing both the premise and conclusion categories, i.e. |α|∪ |β| ⊆ |C|. Since RST assumes that any formula α ∈ LDesc representing a concept |α| in LDesc is a category, the smallest category containing both α and β is α ∨ β. In other words, RST assumes the richest possible ontology representing all concepts definable in LDesc . Thus, we have CovRST (α, β) =

Card(|α|) . Card(|α| ∪ |β|)

Observe that for α ⇒ β from Example 1, CovRST (α, β) = κ1 (β, α). Thus, assessing the strength of a rule α ⇒ β consists in computing values of a couple of RIFs. In our example, for the decision rule α ⇒ β defined as above, we have CovRST (α, β) = 3/5

436

M. Wolski

Some comments may be useful here. The standard RIF κ£ is the most popular RIF among rough-set community. The main reason seems to be the clarity of the interpretation of κ£ . On the other hand, such RIFs like κ1 or κ2 lack an obvious interpretation. Our present inquiry into RST against the background of SCM provide us with the intuitive meaning at least for κ1 : it computes the coverage of the premise category to the smallest category containing both the premise and the conclusion categories (with respect to the ontology representing all concepts in LDesc ). It is quite likely that other models of inductive reasoning may bring us some new interpretations of other RIFs as well. Let us now return to issues concerning the underlying ontology O. First, RST assumes that the ontology O represents all concepts in LDesc which, in turn, belong to some U/IN D(S) for S ⊆ A. That is, atomic concepts are indiscernability classes and all other concepts are built up from them. Second, O is in fact the Boolean algebra U, ∪, ∩, \ generated by U/IN D(A). Thus, what actually contributes to induction is the indiscernability relation IN D(A): when you know IN D(A), you also know the corresponding RST ontology O. On this view, RST may be regarded as a kind of the similarity–based approach to inductive reasoning. Only similarity (indiscernability) classes affect induction and, in consequence, only (a) step (i.e. the similarity–based step) of SCM is performed. Since (b) step of SCM assumes additional conceptual knowledge, applying it to RST ontology, which actually brings us nothing more than IN D(A), may lead to wrong results. Example 2. Consider the information system given by Fig. 1. Let α = [Environment = terrestial] ∧ [Diet = omnivorous] ∧ [T ail = round] and β = [P oland = no], ¬β = [P oland = yes] Then α ⇒ β is a decision rule and Accα (β) = 1/3, Covα (β) = 1/1 = 1, Sim(α, β) = 2/3, and CovRST (α, β) = 3/3 = 1. Also α ⇒ ¬β is a decision rule, for which we have Accα (¬β) = 2/3, Covα (γ) = 2/5, Sim(α, γ) = 8/15, and CovRST (α, β) = 3/6 Observe, that according to the above measures, the arguments represented by α ⇒ β is stronger than α ⇒ ¬β (50/30 and 31/30 respectively). However, our intuition suggests us the opposite ranking, e.g.: Skunks have a property P Mice have a property P Squirrels have a property P All animals not living in Poland have a property P.

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

437

Skunks have a property P Mice have a property P Squirrels have a property P All animals in Poland have a property P. Given that mice and squirrels live in Poland, but skunks do not, it is obvious that the second argument should be recognised as stronger than the first one. In order to correct our result we have to consider a proper ontology O, which brings us new information about U . Let us look at the scientific ontology given by Fig. 2, which is built for the dataset Fig. 1 – in fact, it is a fragment of the well-known biological taxonomy. In this case, the smallest concept containing both |α| and |β| is the set of all objects from the dataset. Thus, Cov(α, β) = 1/2 and the overall result for α ⇒ β is 7/6. The same computation for α ⇒ ¬β brings us Cov(α, ¬β) = 1/2 and the overall result is 31/30. This time the strength of both arguments is quite similar: the difference is equal 4/30 in favour of the first argument. Thus, this time we have obtained a better result. Our example shows also that all categories used in induction should have proper extensions in the given dataset. For instance, the categories not living in Poland and skunk represent the same concept {Skunk}, what actually makes the first argument stronger than the second one even when applying scientific ontology. Observe also that in the case of the scientific ontology beavers and squirrels belong to the same family, yet they differ on all conditional attributes. Thus, this ontology really brings us the new knowledge about the dataset. However, sometimes it is better to have an ontology which reflects our knowledge which is encoded by means of attributes from A. For example, the taxonomy Fig. 3 represents the way people could categorise the animals from Fig. 1. Which ontology is more useful depends on features we want to reason about. For instance, in enthnobiology it is widely agreed that scientific ontology is better to reason about hidden properties of animals, whereas the common sense ontology is better to reason

ORDER

Beaver, Squirrel, Mouse, Muskrat, Otter, Skunk

SUBORDER

Beaver, Squirrel, Mouse, Muskrat

FAMILY

Beaver, Squirrel

Mouse, Muskrat

Otter, Skunk

Fig. 2. The scientific taxonomy for the dataset

Folk ORDER

Beaver, Squirrel, Mouse, Muskrat, Otter, Skunk

Folk SUBORDER

Beaver, Otter, Squirrel, Mouse, Muskrat

Folk FAMILY

Beaver, Muskrat, Otter

Mouse, Squirrel

Fig. 3. A common-sense taxonomy for the dataset

Skunk

438

M. Wolski

about their behaviour. Thus, the ontology must be carefully chosen with respect to the goal properties. As said above, the common-sense ontology is mainly based on attributes of objects. On the basis of this fact, one can regard concept lattices from formal concept analysis (FCA) [17,18] as such ontologies. Let us recall that any binary relation R ⊆ U × V induces two operators: R+ (A) = {b ∈ V : (∀a ∈ A)a, b ∈ R} R+ (B) = {a ∈ U : (∀b ∈ B)a, b ∈ R} Definition 5. A concept induced by R ⊆ U × V is a pair (A, B), where A ⊆ U and B ⊆ V such that A = R+ (B) and B = R+ (A). A set A is called an extent concept if A = R+ R+ (A). Similarly if B ⊆ M is such that B = R+ R+ (B) then B is called an intent concept. The set of all concepts of any information system is a complete lattice [17,18]. Since the lattice induced by our dataset from Fig. 1 is quite complicated, we present here only a list of concepts (see Fig. 4) instead of the Hasse diagram. As one can see, it is quite large ontology when compared with the common sense ontology. As a consequence, for a small dataset as this in the paper the results are very similar to these obtained for RST ontology. However, for large datasets the results may substantially differ. But checking how FCA-ontologies are useful for inductive reasoning is a task for future works.

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

< {Beaver}, {semiaquatic, herbivorous, f lattened, medium, yes} > < {Squirrel}, {terrestial, omnivorous, round, small, yes} > < {M ouse}, {terrestial, omnivorous, round, very small, yes} > < {M uskrat}, {semiaquatic, omnivorous, round, medium, yes} > < {Otter}, {semiaquatic, carnivorous, round, medium, yes} > < {Skunk}, {terrestial, omnivorous, round, medium, no} > < {Beaver, Squirrel, M ouse, M uskrat, Otter}, {yes} > < {Beaver, M uskrat, Otter}, {semiaquatic, medium, yes} > < {Beaver, M uskrat, Otter, Skunk}, {medium} > < {Squirrel, M ouse}, terrestial, omnivorous, round, yes > < {Squirrel, M ouse, M uskrat}, {omnivorous, round, yes} > < {Squirrel, M ouse, M uskrat, Otter}, {round, yes} > < {Squirrel, M ouse, Skunk}, {terrestial, omnivorous, round} > < {M uskrat, Otter}, {semiaquatic, round, medium, yes} > < {M uskrat, Skunk}, {omnivorous, round, medium} > < {Squirrel, M ouse, M uskrat, Skunk}, {omnivorous, round} > < {M uskrat, Otter, Skunk}, {round, medium} > < {Squirrel, M ouse, M uskrat, Otter, Skunk}, {round} > < {Beaver, Squirrel, M ouse, M uskrat, Otter, Skunk}, {} > < {}, {semiaquatic, terrestial, herbivorous, omnivorous, carnivorous, f lattened, round, verysmall, small, medium, yes, no} > Fig. 4. Concepts induced by the dataset from Fig. 1

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

439

Summing up this section let us say a few words about inductive rules. As said above, under the interpretation expressed by Equation (2), apart from decision rules also inductive rules make sense. Let α ∈ LDesc be a description of beavers, i.e. |α| = {Beaver}, β ∈ LDesc be a description of squirrels, |β| = {Squirrel}, and γ be a description of skunks, |γ| = {Skunk}. Then, |α| ∩ |β| = ∅, |α| ∩ |γ| = ∅, and both α ⇒ β and α ⇒ γ are inductive rules. In consequence, Sim(α, β) = Sim(α, γ) = 0. Observe also that CovRST cannot distinguish these rules either, for we have CovRST (α, β) = CovRST (α, γ) = 1/2. However, under the scientific ontology (Fig. 4.) Cov(α, β) = 1/2, whereas Cov(α, γ) = 1/6.

5 Induction over Nearness Spaces In this section we consider a topological counterpart of RST enriched by the concept of ontology. Recently a number of attempts have been made to connect RST with nearness type structures, e.g. [11,19]. These structures (such like nearness spaces or merotopic spaces) are actually quite abstract and this section aims to provide the reader with their intuitive interpretation. We start with some ideas concerning RST and inductive reasoning, and then we develop them into a nearness space. An information system U, A, V, f is often represented as an approximation space (U, IN D(A)), that is, a non-empty set U equipped with an equivalence relation. This representation allows one to connect RST with relational structures which underlie many branches of mathematics, e.g. topology, logic, or universal algebra. Here we would like to change this approach and consider IN D(S) for all S ⊆ A. Definition 6. Let A, B ⊆ 2X ; then a refinement relation is defined by: def

A B ⇔ (∀A ∈ A) (∃B ∈ B)A ⊆ B. Obviously, for any information system U, A, V, f , U/IN D(A) refines every other partition U/IN D(S), for all S ⊆ A. A simple mathematical structure which generalises this observation is called a non-Archimedean structure. Definition 7. A non-Archimedean structure μ on a non-empty set U is a set of partitions of U satisfying: A B & A ∈ μ ⇒ B ∈ μ, and the couple (U, μ) is called a non-Archimedean space. Let IN D S = {U/IN D(S) : S ⊆ A}. Observe that (U, IN D S ) may fail to be a non-Archimedean space. Take as an example the dataset from Fig. 1 and consider the partition P = {{Beaver}, {Squirrel, M ouse, M uskrat, Otter, Skunk}}. Then U/IN D(A) P , yet there is no S ⊆ A such that U/IN D(S) = P . Furthermore, any concept α ∈ LDesc induces a partition Pα = {|α|, |¬α|} of U and U/IN D(A) Pα . For example, when α = [Diet = herbivorous], then Pα = P . Thus, what we actually need is a non-Archimedean structure IN D A induced by U/IN D(A): IN D A = {P : P is a partition of U & U/IN D(A) P }.

440

M. Wolski

Proposition 1. Let U, A, V, f be an information system, and C = {|α| : α ∈ LDesc } be a set of all non-empty concepts in LDesc . Then C= IN D A . Proof. We have to prove that C ⊆ IN D A and IN D A ⊆ C. First, for every nonempty |α| in LDesc it holds that U/IN D(A) P and thus C ⊆ IN D A . Second, α C ∈ P for some partition P ∈ IN D A . Since assume that C ∈ IN D A . It means that U/IN D(A) P , it follows that C = A for some A ⊆ U/IN D(A). Every element Ci of A is a concept in LDesc for some αi ∈ LDesc and thus C is a concept in LDesc for α1 ∨ α2 ∨ . . . ∨ αi . Hence IN D A ⊆ C. In other words, all non-empty concepts in LDesc belong to some partition of the nonArchimedean structure IN D A and every element of such partition is a concept in LDesc . Since an ontology is a subset of the family of all concepts in LDesc , the space (U, IN D A ) sets the stage for conceptual knowledge about U . Definition 8. Let U, A, V, f be an information system. Then by an ontology OA over (U, IN D A ) we mean an ordered set of partitions (P, ) such that P ⊆ IN D A and for all Pi , Pj ∈ P it holds that Pi = Pj for i = j, We say that C is a concept from an ontology OA = (P, ) if C ∈ P. In other words, C is a concept from OA if there is a partition P ∈ P such that C ∈ P . The set of all concepts from OA will be denoted by COA . Example 3. The scientific taxonomy from Fig. 2 can be represented as follows: OA = {P1 , P2 , P3 , P4 } where P1 = {{Beaver}, {Squirrel}, {M ouse}, {M uskrat}, {Otter}, {Skunk}}, P2 = {{Beaver, Squirrel}, {M ouse, M uskrat}, {Otter, Skunk}}, P3 = {{Beaver, Squirrel, M ouse, M uskrat}, {Otter, Skunk}}, P4 = {{Beaver, Squirrel, M ouse, M uskrat, Otter, Skunk}}. Definition 9. We say that C1 K C2 iff C1 ⊆ C2 and there exists Pi , Pj ∈ OA such that C1 ∈ Pi , C2 ∈ Pj and Pi Pj , for all C1 , C2 ∈ COA . Proposition 2. The relation K is taxonomic over COA for every ontology OA induced the non-Archimedean space (U, IN D A ). Proof. It follows from the definition of K and the definition of ontology. For an information system U, A, V, f the associated taxonomy over OA will be denoted by (COA , K). In order to generalise this description for non-taxonomic ontologies it suffices to define the ontology over the family of covers.

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

441

Definition 10. stack μ = {B ⊆ 2X : (∃A ∈ μA) B}. First, stack IN D A is a family of covers of U . Second, for an information system U, A, V, f , as a generalised ontology we take an ordered set of covers (P, ). Definition 11. Let U, A, V, f be an information system. Then by an generalised ontology GOA over (U, IN DA ) we mean an ordered set of partitions (P, ) such that P ⊆ stack IN D A and for all Pi , Pj ∈ P it holds that Pi = Pj for i = j, Since stack IN D A provides the most general stage for inductive reasoning, we examine it in some detail now. First, observe that for an information system U, A, V, f , it holds that: stack IN D A = stack {IN D(A)}. Thus, {IN D(A)} suffices to generate the whole stack IN D A . Furthermore, the stack operation allows us to connect IN D A with nearness type structures. Definition 12. Let X be a set and ν be a non-empty set of coverings of X such that: A B & A ∈ ν ⇒ B ∈ ν. Then (X, ν) is called a pre-nearness space. When stack Eν = ν, for Eν = {P ∈ ν : P is a partition of X}, then (X, ν) is called a non-Archimedean pre-nearness space and Eν is its base. Thus, the non-Archimedean structure IN D A on U is a base of the non-Archimedean prenearness space (X, stack IN D A ). Definition 13. Let (X, ν) be a pre-nearness space such that: A ∈ ν & B ∈ ν ⇒ {A ∩ B : A ∈ A and B ∈ B} ∈ ν. Then (X, ν) is called a merotopic space. Definition 14. A merotopic space (X, ν) which satisfies: A ∈ ν ⇒ {Intν (A) : A ∈ A} ∈ ν, where Intν (A) = {x ∈ X : {A, X \ {x}} ∈ ν}, is called a nearness space. Proposition 3. Let U, A, V, f be an information system. Then (U, stack IN D A ) is a non-Archimedean nearness space. Proof. In order not to overload the paper with definitions, we shall give just a hint how to prove this theorem. First, as is well-known, every partition star-refines itself. Therefore (U, stack IN D A ) is a uniform pre-nearness space. Second, every uniform pre-nearness space satisfies Definition 14, see, e.g. [4] for the proof. Finally, since U/IN D(A) = Estack IN DA , the uniform pre-nearness space (U, stack IN D A ) is closed under intersections as required by Definition 13. Thus, (U, stack IN D A ) is a non-Archimedean nearness space.

442

M. Wolski

Please, observe that the very simple description of SCM in terms of subsets of U have led us to (U, stack IN D A ) as a proper stage for human-like inductive reasoning. Surprisingly, this stage is nothing more than a representation of a basic concept of RST. Let us recall that any information system U, A, V, f may be regarded as a finite approximation space (U, IN D/A), and in many cases this representation is more handy, e.g. in algebraic investigations into RST. Actually, the same remark may be applied to (U, stack IN D A ). Proposition 4. Let U be a non-empty finite set. Then there is one-to-one correspondence between finite approximation spaces (U, E) and non-Archimedean nearness spaces (U, ν) over U . Proof. For the same reason as above, we also give a sketch of the proof. Every finite non-Archimedean nearness space (U, ν) is induced by a partition P. Since P is the minimal open basis for the topology induced by Intν , it follows that (U, ν) is a topological nearness space. On the other hand, every finite topological space (U, ν) has the minimal open basis P for its topology Intν . Since is Intν symmetric, P is a partition and thus (U, ν) is a non-Archimedean nearness space. Finally, there is one-to-one correspondence between finite topological nearness spaces and finite approximation spaces. See also [19]. Thus, non-Archimedean nearness spaces over finite sets may be considered as another special representation of information systems. Approximation spaces are useful when one consider, e.g. relational structures and modal logics, whereas nearness spaces are suitable for ontologies and inductive reasoning.

6 Final Remarks The article presents an account of preliminary results concerning Rough Set Theory (RST) and Similarity Coverage Model of category-based induction (SCM). In the first part of this paper we have shown how decision rules may be regarded as induction tasks and how rough inclusion functions may be used to compute the strength of inductive reasoning. In the second part we have presented a model of SCM based on non-Archimedean structures and non-Archimedean nearness spaces. Recently a number of attempts have been made to connect RST with nearness type structures, e.g. [11,19]. Thus, the paper has presented some intuitive reasons to consider these abstract topological spaces. The model based on a non-Archimedean space has a nice property that every ontology over it is taxonomic. Acknowledgement. The research has been supported by the grant N N516 368334 from Ministry of Science and Higher Education of the Republic of Poland and by the grant Innovative Economy Operational Programme 2007-2013 (Priority Axis 1. Research and development of new technologies) managed by Ministry of Regional Development of the Republic of Poland.

Category-Based Inductive Reasoning: Rough Set Theoretic Approach

443

References 1. Albatineh, A., Niewiadomska-Bugaj, M., Mihalko, D.: On Similarity Indices and Correction for Chance Agreement. Journal of Classification 23, 301–313 (2006) 2. Atran, S.: Classifying Nature Across Cultures. In: Osherson, D., Smith, E. (eds.) An Invitation to Cognitive Science. Thinking, pp. 131–174. MIT Press, Cambridge (1995) 3. Deses, D., Lowen-Colebunders, E.: On Completeness in a Non-Archimedean Setting via Firm Reflections. Bulletin of the Belgian Mathematical Society, Special volume: p-adic Numbers in Number Theory, Analytic Geometry and Functional Analysis, 49–61 (2002) 4. Deses, D.: Completeness and Zero-dimensionality Arising from the Duality Between Closures and Lattices, Ph.D. Thesis, Free University of Brussels (2003), http://homepages.vub.ac.be/∼diddesen/phdthesis.pdf 5. Gomoli´nska, A.: On Three Closely Related Rough Inclusion Functions. In: Kryszkiewicz, M., Peters, J., Rybi´nski, H., Skowron, A. (eds.) RSEISP 2007. LNCS, vol. 4585, pp. 142– 151. Springer, Heidelberg (2007) 6. Heit, E.: Properties of Inductive Reasoning. Psychonomic Bulletin & Review 7, 569–592 (2000) 7. Kloos, H., Sloutsky, V., Fisher, A.: Dissociation Between Categorization and Induction Early in Development: Evidence for Similarity-Based Induction. In: Proceedings of the XXVII Annual Conference of the Cognitive Science Society (2005) 8. Kulczy´nski, S.: Die Pflanzenassociationen der Pieninen. Bulletin International de L’Academie Polonaise des Sciences et des letters, classe des sciences mathematiques et naturelles, Serie B, Supplement II 2, 57–203 (1927) 9. Osherson, D.N., Smith, E.E., Wilkie, O., Lopez, A., Shafir, E.: Category-Based Induction. Psychological Review 97(2), 185–200 (1990) 10. Pawlak, Z.: Rough Sets. International Journal of Computer and Information Sciences 11, 341–356 (1982) 11. Peters, J., Skowron, A., Stepaniuk, J.: Nearness of Objects: Extension of Approximation Space Model. Fundamenta Informaticae 79, 497–512 (2007) 12. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishers, Boston (1991) 13. Polkowski, L., Skowron, A.: Rough Mereology: A New Paradigm for Approximate Reasoning. Int. J. Approx. Reasoning 15(4), 333–365 (1996) 14. Skowron, A., Stepaniuk, J.: Tolerance Approximation Spaces. Fundamenta Informaticae 27, 245–253 (1996) 15. Sloman, S.A.: Feature-Based Induction. Cognitive Psychology 25, 231–280 (1993) 16. Tsumoto, S.: Extraction of Experts’ Decision Rules from Clinical Databases Using Rough Set Model. Intelligent Data Analysis 2(1-4), 215–227 (1998) 17. Wille, R.: Restructuring Lattice Theory: an Approach Based on Hierarchies of Concepts. In: Rival, I. (ed.) Ordered Sets, pp. 445–470. Reidel, Dordrecht-Boston (1982) 18. Wille, R.: Concept lattices and Conceptual Knowledge Systems. Computers & Mathematics with Applications 23, 493–515 (1992) 19. Wolski, M.: Approximation Spaces and Nearness Type Structures. Fundamenta Informaticae 79, 567–577 (2007)

Probabilistic Dependencies in Linear Hierarchies of Decision Tables Wojciech Ziarko Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2

Abstract. The article is a study of probabilistic dependencies between attribute-deﬁned partitions of a universe in hierarchies of probabilistic decision tables. The dependencies are expressed through two measures: the probabilistic generalization of the Pawlak’s measure of the dependency between attributes and the expected certainty gain measure introduced by the author. The expected certainty gain measure reﬂects the subtle grades of probabilistic dependence of events. Both dependency measures are developed and it is shown how they can be extended from ﬂat decision tables to dependencies existing in hierarchical structures of decision tables.

1

Introduction

The notion of decision table has been around for long time and was widely used in circuit design, software engineering, business, and other application areas. In the original formulation, decision tables are static due to the lack of the ability to automatically learn and adapt their structures based on new information. Decision tables representing data-acquired classiﬁcation knowledge have been introduced by Pawlak [1]. In Pawlak’s approach, the decision tables are dynamic structures derived from data, with the ability to adjust with new information. This fundamental diﬀerence makes it possible for novel uses of decision tables in applications related to reasoning from data, such as data mining, machine learning or complex pattern recognition. The decision tables are typically used for making predictions about the value of the target decision attribute, such as medical diagnosis, based on combinations of values of condition attributes, for example symptoms and test results, as measured on new, previously unseen objects (for example, patients). However, the decision tables often suﬀer from the following problems related to the fact that they are typically computed based on a subset, a sample of the universe of all possible objects. Firstly, the decision table may have excessive decision boundary, often due to poor quality of the descriptive condition attributes, which may be weakly correlated with the decision attribute. The excessive decision boundary leads to the excessive number of incorrect predictions. Secondly, the decision table may be highly incomplete, i.e. excessively many new measurement vectors of condition attributes of new objects are not matched J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 444–454, 2008. c Springer-Verlag Berlin Heidelberg 2008

Probabilistic Dependencies in Linear Hierarchies of Decision Tables

445

by any combination of condition attribute values present in the decision table. Such a highly incomplete decision table leads to an excessive number of new, unrepresented observations, for which the prediction of the decision attribute value is not possible. With condition attributes weakly correlated with the decision attribute, increasing their number does not rectify the ﬁrst problem. Attempting to increase the number of condition attributes, or the number of possible values of the attributes, results in the exponential explosion of the complexity of decision table learning and is leading to the rapid increase of its degree of incompleteness [8]. In general, the decision boundary reduction problem is conﬂicting with the decision table incompleteness minimization problem. To deal with these fundamental diﬃculties, an approach involving building hierarchies of decision tables was proposed [6]. The approach is focused on learning hierarchical structures of decision tables rather than learning individual tables, subject to learning complexity constraints. In this approach, a linear hierarchy of decision tables is formed, in which the parent layer decision boundary deﬁnes a universe of discourse for the child layer table. The decision tables on each layer are size-limited by reducing the number of condition attributes and their values, thus bounding their learning complexity [8]. Each layer contributes a degree of decision boundary reduction, while providing a shrinking decision boundary to the next layer. In this way, even in the presence of relatively weak condition attributes, a signiﬁcant total boundary reduction can be achieved, while preserving the learning complexity constraints on each level. Similar to single layer decision table, the hierarchy of decision tables needs to be evaluated from the point of view of its quality as a potential classiﬁer of new observations. The primary evaluative measure for decision tables, as introduced by Pawlak, is the measure of partial functional dependency between attributes [1] and its probabilistic extension [7]. Another measure is the recently introduced expected gain measure which captures more subtle probabilistic associations between attributes [7]. In this paper, these measures are reviewed and generalized to the hierarchical structures of decision tables. A simple recursive method of their computation is also discussed. The measures, referred to as γ and λ measures respectively, provide a tool for assessment of decision table-based classiﬁers derived from data. The basics of the rough set theory and the techniques for analysis of decision tables are presented in this article in the probabilistic context, with the underlying assumption that the universe of discourse U is potentially inﬁnite and is known only partially through a ﬁnite collection of observation vectors (the sample data). This assumption is consistent with great majority of applications in the areas of statistical analysis, data mining and machine learning.

2

Attribute-Based Probabilistic Approximation Spaces

In this section, we brieﬂy review the essential assumptions, deﬁnitions and notations of the rough set theory in the context of probability theory.

446

2.1

W. Ziarko

Attributes and Classifications

We assume that observations about objects are expressed through values of attributes, which are assumed to be functions a : U → Va , where Va is a ﬁnite set of values called the domain. The attributes represent some properties of the objects in U . It should be however mentioned here that, in practice, the attributes may not be functions but general relations due to inﬂuence of measurement random noise. The presence of noise may cause the appearance of multiple attribute values associated with an object. Traditionally, the attributes are divided into two disjoint categories: condition attributes denoted as C, and decision attributes D = {d}. In many rough setoriented applications, attributes are ﬁnite-valued functions obtained by discretizing values of real-valued variables representing measurements taken on objects e ∈ U. As individual attributes, any non-empty subset of attributes B ⊆ C ∪ D deﬁnes a mapping from the set of objects U into the set of vectors of values of attributes in B. This leads to the idea of the equivalence relation on U , called indiscernibility relation IN DB = {(e1 , e2 ) ∈ U : B(e1 ) = B(e2 )}. According to this relation, objects having identical values of attributes in B are equivalent, that is, indistinguishable in terms of values of attributes in B . The collection of classes of identical objects will be denoted as U/B and the pair (U, U/B) will be called an approximation space. The object sets G ∈ U/C ∪ D, will be referred to as atoms. The sets E ∈ U/C will be referred to as elementary sets. The sets X ∈ U/D will be called decision categories. Each elementary set E ∈ U/C and each decision category X ∈ U/D is a union of some atoms. That is, E = ∪{G ∈ U/C ∪ D : G ⊆ E} and X = ∪{G ∈ U/C ∪ D : G ⊆ F }. 2.2

Probabilities

We assume that all subsets X ⊆ U under consideration are measurable by a probability measure function P , normally estimated from collected data in a standard way, with 0 < P (X) < 1, which means that they are likely to occur but their occurrence is not certain. In particular, each atom G ∈ U/C ∪ D is assigned a joint probability P (G). From our initial assumption and from the basic properties of the probability measure P , follows that for all atoms G ∈ U/C ∪ D, we have 0 < P (G) < 1 and G∈U/C∪D P (G) = 1. Based on the joint probabilities of atoms, probabilities of elementary sets E and of a decision category X can be calculated by P (E) = G⊆E P (G). The probability P (X) of the decision category X in the universe U is the prior probability of the category X. It represents the degree of conﬁdence in the occurrence of the decision category X, in the absence of any information expressed by attribute values. The conditional probability of a decision category X, P (X|E) = P (X∩E) P (E) , conditioned on the occurrence of the elementary set E, represents the degree

Probabilistic Dependencies in Linear Hierarchies of Decision Tables

447

of conﬁdence in the occurrence of the decision category X, given information indicating that E occurred. The conditional probability can be expressed in terms P G⊆X∩E P (G) of joint probabilities of atoms by P (X|E) = P . This property allows G⊆E P (G) for simple computation of the conditional probabilities of decision categories. 2.3

Variable Precision Rough Sets

The theory of rough set underlies the methods for derivation, optimization and analysis of decision tables acquired from data. In this part, we review the basic deﬁnitions and assumptions of the variable precision rough set model (VPRSM) [5][7]. The VPRSM is a direct generalization of Pawlak rough sets [1]. One of the main objectives of rough set theory is the formation and analysis of approximate deﬁnitions of otherwise undeﬁnable sets [1]. The approximate deﬁnitions, in the form of lower approximation and boundary area of a set, allow for determination of an object’s membership in a set with varying degrees of certainty. The lower approximation permits for uncertainty-free membership determination, whereas the boundary deﬁnes an area of objects which are not certain, but possible, members of the set [1]. The VPRSM extends upon these ideas by parametrically deﬁning the positive region as an area where the certainty degree of an object’s membership in a set is relatively high, the negative region as an area where the certainty degree of an object’s membership in a set is relatively low, and by deﬁning the boundary as an area where the certainty of an object’s membership in a set is deemed neither high nor low. The deﬁning criteria in the VPRSM are expressed in terms of conditional probabilities and of the prior probability P (X) of the set X in the universe U . The prior probability P (X) is used as reference value here as it represents the likelihood of X occurrence in the extreme case characterized by the absence of any attribute-based information. In the context the attribute-value representation of sets of the universe U , as described in the previous section, we will assume that the sets of interest are decision categories X ∈ U/D. Two precision control parameters are used: the lower limit l, 0 ≤ l < P (X) < 1, representing the highest acceptable degree of the conditional probability P (X|E) to include the elementary set E in the negative region of the set X; and the upper limit u, 0 < P (X) < u ≤ 1, reﬂecting the least acceptable degree of the conditional probability P (X|E) to include the elementary set E in the positive region, or u-lower approximation of the set X. The l-negative region of the set X, denoted as N EGl (X) is deﬁned by: N EGl (X) = ∪{E : P (X|E) ≤ l}

(1)

The l-negative region of the set Xis a collection of objects for which the probability of membership in the set X is significantly lower than the prior probability P (X). The u-positive region of the set X, P OSu (X) is deﬁned as P OSu (X) = ∪{E : P (X|E) ≥ u}.

(2)

The u-positive region of the set X is a collection of objects for which the probability of membership in the set X is significantly higher than the prior probability

448

W. Ziarko

P (X). The objects which are not classiﬁed as being in the u-positive region nor in the l-negative region belong to the (l, u)-boundary region of the decision category X, denoted as BN Rl,u (X) = ∪{E : l < P (X|E) < u}.

(3)

The boundary is a speciﬁcation of objects about which it is known that their associated probability of belonging, or not belonging to the decision category X, is not much diﬀerent from the prior probability of the decision category P (X). The VPRSM reduces to standard rough sets when l = 0 and u = 1.

3

Structures of Decision Tables Acquired from Data

To describe functional or partial functional connections between attributes of objects of the universe U , Pawlak introduced the idea of decision table acquired from data [1]. The probabilistic decision tables and their hierarchies extend this idea into probabilistic domain by forming representations of probabilistic relations between attributes. 3.1

Probabilistic Decision Tables

For the given decision category X ∈ U/D and the set values of the VPRSM lower and upper limit parameters l and u, we deﬁne the probabilistic decision C,D as a mapping C(U ) → {P OS, N EG, BN D} derived from the table DTl,u classiﬁcation table as follows: The mapping is assigning each tuple of values of condition attribute values t ∈ C(U ) to its unique designation of one of VPRSM approximation regions P OSu (X), N EGl (X) or BN Dl,u (X), the corresponding elementary set Et is included in, along with associated elementary set probabilities P (Et ) and conditional probabilities P (X|Et ): ⎧ ⎨ (P (Et ), P (X|Et ), P OS) ⇔ Et ⊆ P OSu (X) C,D (4) DTl,u (t) = (P (Et ), P (X|Et ), N EG) ⇔ Et ⊆ N EGl (X) ⎩ (P (Et ), P (X|Et ), BN D) ⇔ Et ⊆ BN Dl,u (X) The probabilistic decision table is an approximate representation of the probabilistic relation between condition and decision attributes via a collection of uniform size probabilistic rules corresponding to rows of the table. An example probabilistic decision table is shown in Table 1. In this table, the condition attributes are a, b, c, attribute-value combinations correspond to elementary sets E and Region is a designation of one of the approximation regions the corresponding elementary sets belong to: positive (POS), negative (NEG) or boundary (BND). The probabilistic decision tables are most useful for decision making or prediction when the relation between condition and decision attributes is largely non-deterministic. However, they suﬀer from the inherent contradiction between

Probabilistic Dependencies in Linear Hierarchies of Decision Tables

449

Table 1. An example of probabilistic decision table a 1 1 2 2 0

b c 1 2 0 1 2 1 0 2 2 1

P (E) P (X|E) 0.23 1.00 0.33 0.61 0.11 0.27 0.01 1.00 0.32 0.06

Region POS BND BND POS NEG

the accuracy and completeness. In the presence of boundary region, higher accuracy, i.e. reduction of boundary region, can be achieved either by adding new condition attributes or by increasing the precision of existing ones (for instance, by making the discretization procedure ﬁner). Both solutions lead to the exponential growth in the maximum number of attribute-value combinations to be stored in the decision table [8]. In practice, it results in such negative eﬀects as excessive size of the decision table, likely high degree of table incompleteness (in the sense of missing many feasible attribute-value combinations), weak data support for elementary sets represented in the table and, consequently, unreliable estimates of probabilities. The use of hierarchies of decision tables rather than individual tables in the process of classiﬁer learning from data provides a partial solution to these problems [6]. 3.2

Probabilistic Decision Table Hierarchies

Since the VPRSM boundary region BN Dl,u (X) is a deﬁnable subset of the universe U , it allows to structure the decision tables into hierarchies by treating the boundary region BN Dl,u (X) as sub-universe of U , denoted as U = BN Dl,u (X). The ”child” sub-universe U so deﬁned can be made completely independent from its ”parent” universe U , by having its own collection of condition attributes C to form a ”child” approximation sub-space (U, U/C ). As on the parent level, in the approximation space (U, U/C ), the decision table for the subset X ⊆ X of the target decision category X, X = X ∩ BN Dl,u (X) can be derived by adapting the formula (4). By repeating this step recursively, a linear hierarchy of probabilistic decision tables can be grown until either boundary area disappears in one of the child tables, or no attributes can be identiﬁed to produce non-boundary decision table at the ﬁnal level. Other termination conditions are possible, but this issue is out of scope in this article. The nesting of approximation spaces obtained as a result of recursive computation of decision tables, as described above, creates a new approximation space on U . The resulting hierarchical approximation space (U, R) cannot be expressed by the indiscernibility relation, as deﬁned in Section 2, in terms of the attributes used to form the local sub-spaces on individual levels of the hierarchy. This leads to the basic question: how to measure the degree of the mostly probabilistic dependency between the hierarchical partition R of U and the partition (X, ¬X) corresponding to the decision category X ⊆ U . Some probabilistic inter-partition dependency measures are explored in the next section.

450

4

W. Ziarko

Dependencies in Decision Table Hierarchies

The dependencies between partitions are fundamental to rough set-based nonprobabilistic and probabilistic reasoning and prediction. They allow to predict the occurrence of a class of one partition based on the information that a class of another partition occurred. There are several ways dependencies between partitions can be deﬁned in decision tables. In Pawlak’s early works functional and partial functional dependencies were explored [1]. The probabilistic generalization of the dependencies was also deﬁned and investigated in the framework of the variable precision rough set model. All these dependencies represent the relative size of the positive and negative regions of the target set X. They reﬂect the quality of approximation of the target category in terms of the elementary sets of the approximation space. Following the original Pawlak’s terminology, we will refer to these dependencies as γ-dependencies. Other kind of dependencies, based on the notion of the certainty gain measure, reﬂect the average degree of improvement of the certainty of occurrence of the decision category X, or ¬X, relative to its prior probability P (X) [7] (see also [2] and [4]). We will refer to these dependencies as λ-dependencies. Both, the γ-dependencies and λ-dependencies can be extended to hierarchies of probabilistic decision tables, as described below. Because there is no single collection of attributes deﬁning the partition of U , the dependencies of interest in this case are dependencies between the hierarchical partition R generated by the decision table hierarchy, forming the approximation space (U, R), and the partition (X, ¬X), deﬁned by the target set. 4.1

Γ -Dependencies for Decision Tables

The partial functional dependency among attributes, referred to as γ-dependency γ(D|C) measure, was introduced by Pawlak [1]. It can be expressed in terms of the probability of positive region of the partition U/D deﬁning decision categories: (5) γ(D|C) = P (P OS C,D (U )) where P OS C,D (U ) is a positive region of the partition U/D in the approximation space induced by the partition U/C. In the binary case of two decision categories, X and ¬X, the γ(D|C)-dependency can be extended to the VPRSM by deﬁning it as the combined probability of the u-positive and l -negative regions: γl,u (X|C) = P (P OSu (X) ∪ N EGl (X)).

(6)

The γ-dependency measure reﬂects the proportion of objects in U , which can be classiﬁed with suﬃciently high certainty as being members, or non-members of the set X. 4.2

Computation of Γ -Dependencies in Decision Table Hierarchies

In the case of the approximation space obtained by forming it via hierarchical classiﬁcation process, the γ-dependency between the hierarchical partition R and

Probabilistic Dependencies in Linear Hierarchies of Decision Tables

451

the partition (X, ¬X) can be computed directly by analyzing all classes of the hierarchical partition. However, an easier to implement recursive computation is also possible. This is done by recursively applying, starting from the leaf table of the hierarchy and going up to the root table, the following formula (7) U (X|R) in the hierarchical for computing the dependency of the parent table γl,u U approximation space (U, R), if the dependency of a child level table γl,u (X|R ) in the sub-approximation space (U , R ) is given:

U U U γl,u (X|R) = γl,u (X|C) + P (U )γl,u (X|R ),

(7)

where C is a collection of attributes inducing the approximation space U and U = BN Dl,u (X). As in the ﬂat table case, this dependency measure represents the fraction of objects that can be classiﬁed with acceptable certainty into decision categories X or ¬X by applying the decision tables in the hierarchy. The dependency of the whole structure of decision tables, that is the last dependency computed by the recursive application of formula (7), will be called a global γdependency. Alternatively, the global γ-dependency can be computed straight from from the deﬁnition (5). This computation requires checking all elementary sets of the hierarchical partition for the inclusion in P OSu (X)∪N EGl (X), which seems to be less elegant and more time consuming that the recursive method. 4.3

Certainty Gain Functions

Based on the probabilistic information contained in data, as given by the joint probabilities of atoms, it is also possible to evaluate the degree of probabilistic dependency between any elementary set and a decision category. The dependency measure is called absolute certainty gain [7] (gabs). It represents the degree of inﬂuence the occurrence of an elementary set E has on the likelihood of occurrence of the decision category X. The occurrence of E can increase, decrease, or have no eﬀect on the probability of occurrence of X. The probability of occurrence of X, in the absence of any other information, is given by its prior probability P (X). The degree of variation of the probability of X, due to occurrence of E, is reﬂected by the absolute certainty gain function: gabs(X|E) = |P (X|E) − P (X)|,

(8)

where | ∗ | denotes absolute value function. The values of the absolute gain function fall in the range 0 ≤ gabs(X|E) ≤ max(P (¬X), P (X)) < 1. In addition, if sets X and E are independent in the probabilistic sense, that is, if P (X ∩ E) = P (X)P (E), then gabs(X|E) = 0. The deﬁnition of the absolute certainty gain provides a basis for the deﬁnition of a new probabilistic dependency measure between attributes. This dependency can be expressed as the average degree of change of occurrence certainty of the decision category X, or of its complement ¬X, due to occurrence of any elementary set [7], as deﬁned by the expected certainty gain function: P (E)gabs(X|E), (9) egabs(X|C) = E∈U/C

452

W. Ziarko

where X ∈ U/D. The expected certainty gain is a more subtle inter-partition dependency than γ-dependency since it takes into account the probabilistic distribution information in the boundary region of X. The egabs(X|C) measure can be computed directly from joint probabilities of atoms. It can be proven [7] that the expected gain function falls in the range 0 ≤ egabs(X|C) ≤ 2P (X)(1 − P (X)), where X ∈ U/D. 4.4

Attribute Λ-Dependencies in Decision Tables

The strongest dependency between attributes of a decision table occurs when the decision category X is deﬁnable, i.e. when the dependency is functional. Consequently, the dependency in this deterministic case can be used as a reference value to normalize the certainty gain function. The following normalized expected gain function λ(X|C) measures the expected degree of the probabilistic dependency between elementary sets and the decision categories belonging to U/D [7]: egabs(X|C) , (10) λ(X|C) = 2P (X)(1 − P (X)) where X ∈ U/D. The λ-dependency quantiﬁes in relative terms the average degree of deviation of elementary sets from statistical independence with the decision class X ∈ U/D. The dependency function reaches its maximum λ(X|C) = 1 only if the dependency is deterministic (functional) and is at minimum when all events represented by elementary sets E ∈ U/C are unrelated to the occurrence of the decision class X ∈ U/D. In the latter case, the conditional distribution of the decision class P (X|E) equals to its prior distribution P (X). The value of the λ(X|C) dependency function can be easily computed from the joint probabilities of atoms. As opposed to the generalized γ(X|C) dependency, the λ(X|C) dependency has the monotonicity property [3], that is, λ(X|C) ≤ λ(X|C ∪ {a}), where a is an extra condition attribute outside the set C. This monotonicity property allows for dependency-preserving reduction of attributes and is leading to the notion of probabilistic λ-reduct of attributes, as deﬁned in [3]. 4.5

Computation of Λ-Dependencies in Decision Table Hierarchies

The λ-dependencies can be computed directly based on any known partitioning of the universe U . In cases when the approximation space is formed through hierarchical classiﬁcation, the λ-dependency between the partition R so created and the target category X can be computed via a recursive formula derived below. Let egabsl,u (X|C) = P (E)gabs(X|E) (11) E∈P OSu ∪N EGl

denote the conditional expected gain function, i.e. restricted to the union of positive and negative regions of the target set X in the approximations space generated by attributes C. The maximum value of egabsl,u (X|C), achievable

Probabilistic Dependencies in Linear Hierarchies of Decision Tables

453

in deterministic case, is 2P (X)(1 − P (X)). Thus, the normalized conditional λ-dependency function, can be deﬁned as: λl,u (X|C) =

egabsl,u (X|C) . 2P (X)(1 − P (X))

(12)

As γ-dependencies, λ-dependencies between the target partition (X, ¬X) and the hierarchical partition R can be computed recursively. The following formula (13) describes the relationship between λ-dependency computed in the approximation space (U, R), versus the dependency computed over the approximation sub-space (U, R ), where R and R are hierarchical partitions of universes U and U = BN Dl,u (X), respectively. Let λl,u (X|R) and λl,u (X|R ) denote λdependency measures in the approximation spaces (U, R) and (U , R ), respectively. The λ-dependencies in those approximation spaces are related by the following: λl,u (X|R) = λl,u (X|C) + P (BN Dl,u (X))λl,u (X|R ).

(13)

The proof of the above formula follows directly from the Bayes’s equation. In practical terms, the formula (13) provides a method for eﬃcient computation of conditional λ-dependency in a hierarchical arrangement of probabilistic decision tables. According to this method, to compute conditional λ-dependency for each level of the hierarchy, it suﬃces to compute the conditional λ-dependency and to know ”child” BN Dl,u (X)-level conditional λ-dependency. That is, the conditional λ-dependency should be computed ﬁrst for the bottom level table using formula (12), and then it would be computed for each subsequent level in the bottom-up fashion by successively applying (13). In similar way, the ”unconditional” λ-dependency λ(X|R) can be computed over all elementary sets of the hierarchical approximation space. This is made possible by the following variant of the formula (13): λ(X|R) = λl,u (X|C) + P (BN Dl,u (X))λ(X|R ).

(14)

The recursive process based on the formula (14) is essentially the same as in the case (13), with except that the bottom-up procedure starts with computation of the ”unconditional” λ-dependency by formula (10) for the the bottom-level table.

5

Concluding Remarks

Learning and evaluation of hierarchical structures of probabilistic decision tables is the main focus of this article. The earlier introduced measures of gamma and lambda dependencies between attributes [7] for decision tables acquired from data are not directly applicable to approximation spaces corresponding to hierarchical structures of decision tables. The main contribution of this work is the extension of the measures to the decision table hierarchies case and the derivation of recursive formulas for their easy computation. The gamma dependency

454

W. Ziarko

measure allows for the assessment of the prospective ability of the classiﬁer based on the hierarchy of decision tables to predict the values of decision attribute on required level of certainty. The lambda dependency measure captures the relative degree of probabilistic correlation between classes of the partitions corresponding to condition and decision attributes, respectively. The degree of the correlation in this case is a representation of the average improvement of the ability to predict the occurrence of the target set X, or its complement ¬X. Jointly, both measures enable the user to evaluate the progress of learning with the addition of new training data and to assess the quality of the empirical classiﬁer. Three experimental applications of the presented approach are currently under development. The ﬁrst one is concerned with face recognition using photos to develop the classiﬁer in the form of a hierarchies of decision tables, the second one is aiming at adaptive learning of spam recognition among e-mails, and the third one is focused on stock price movement prediction using historical data. Acknowledgment. This paper is an extended version of the article included in the Proceedings of the International Conference on Rough Sets and Emerging Intelligent Systems Paradigms, devoted to the memory of Professor Zdzislaw Pawlak, held in Warsaw, Poland in 2007. The support of the Natural Sciences and Engineering Research Council of Canada in funding the research presented in this article is gratefully acknowledged.

References 1. Pawlak, Z.: Rough sets - Theoretical Aspects of Reasoning About Data. Kluwer, Dordrecht (1991) 2. Greco, S., Matarazzo, B., Slowinski, R.: Rough membership and Bayesian conﬁrma´ ezak, D., Wang, G., Szczuka, M.S., tion measures for parametrized rough sets. In: Sl D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS, vol. 3641, pp. 314–324. Springer, Heidelberg (2005) 3. Slezak, D., Ziarko, W.: The Investigation of the Bayesian rough set model. International Journal of Approximate Reasoning 40, 81–91 (2005) 4. Yao, Y.: Probabilistic approaches to rough sets. Expert Systems 20(5), 287–291 (2003) 5. Ziarko, W.: Variable precision rough sets model. Journal of Computer and Systems Sciences 46(1), 39–59 (1993) 6. Ziarko, W.: Acquisition of hierarchy-structured probabilistic decision tables and rules from data. In: Proc. of IEEE Intl. Conf. on Fuzzy Systems, Honolulu, pp. 779–784 (2002) ´ ezak, D., Wang, G., Szczuka, M.S., 7. Ziarko, W.: Probabilistic rough sets. In: Sl D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS, vol. 3641, pp. 283–293. Springer, Heidelberg (2005) 8. Ziarko, W.: On learnability of decision tables. In: Tsumoto, S., Slowi´ nski, R., Komorowski, J., Grzymala-Busse, J.W. (eds.) RSCTC 2004. LNCS, vol. 3066, pp. 394– 401. Springer, Heidelberg (2004)

Automatic Singing Voice Recognition Employing Neural Networks and Rough Sets ˙ Pawel Zwan, Piotr Szczuko, Bo˙zena Kostek, and Andrzej Czy˙zewski Gda´ nsk University of Technology, Multimedia Systems Department Narutowicza 11/12, 80-952 Gda´ nsk, Poland {zwan,szczuko,bozenka,ac}@sound.eti.pg.gda.pl

Abstract. The aim of the research study presented in this paper is the automatic recognition of a singing voice. For this purpose, a database containing sample recordings of trained and untrained singers was constructed. Based on these recordings, certain voice parameters were extracted. Two recognition categories were deﬁned – one reﬂecting the skills of a singer (quality), and the other reﬂecting the type of the singing voice (type). The paper also presents the parameters designed especially for the analysis of a singing voice and gives their physical interpretation. Decision systems based on artiﬁcial neutral networks and rough sets are used for automatic voice quality/ type classiﬁcation. Results obtained from both decision systems are then compared and conclusions are derived. Keywords: Singing Voice, Feature extraction, Automatic Classiﬁcation, Artiﬁcial Neural Networks, Rough Sets, Music Information Retrieval.

1

Introduction

The area of automatic content indexing and classiﬁcation is related to the Music Information Retrieval (MIR) domain, which is now growing very rapidly and induces many discussions on automatic speech recognition and the development of appropriate systems. The speech is not the only outcome of the human voice organ. Singing is another one, and is considered a musical instrument by musicologists. However, its artistic and musical aspects are the reason why singing must be analyzed by specially designed additional parameters. These parameters obviously should be based on speech parameters, but additionally they must focus on the articulation and the timbre. A parametric description is necessary in many applications of automatic sound recognition. A very complicated biomechanics of the singing voice [10], [27] and a diﬀerent character of the intonation and the timbre of the voice require numerous features to describe its operation. Such a parametric representation needs intelligent decision systems in the classiﬁcation process. In the presented study, artiﬁcial neural network (ANN) and rough set-based (RS) decision systems were employed for the purpose of the singing voice quality/type recognition. The systems were trained with sound samples, of which a large part (1700 samples) was J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 455–473, 2008. c Springer-Verlag Berlin Heidelberg 2008

456

˙ P. Zwan et al.

recorded in the studio and 1200 samples were extracted from professional CD recordings. For every sound sample, a feature vector (FV) containing 331 parameters was formed. The parameters were divided into two groups: the so-called dedicated ones (designed allowing for a singing voice speciﬁcs) and more general ones known from the literature on MIR and speech recognition. The decision system ability to automatically classify a singing voice is discussed in the context of comparing the eﬃciency of ANN and RS systems in two recognition categories: ‘voice type’ (classes: bas, baritone, tenor, alto, mezzo-soprano, soprano) and ‘voice quality’ (classes: amateur, semi-professional, professional). Additionally, the parameters were judged using statistical and rough set methods. For different methods of reducing the feature vector redundancy, new classiﬁers were trained. The results were compared by analyzing the accuracy of the trained recognition systems. This article is an extended version of the paper presented at the RSEISP’07 conference held in Warsaw [34]. The paper is organized as follows. In Section 2 the organization of the database of samples of singing is described. The automatic process of classiﬁcation requires an eﬃcient block of feature extraction, thus Section 3 presents parameters that were used in experiments and discusses them in the context of their relationship with voice production mechanisms. The analysis shown in Section 4 concentrates around the redundancy elimination in the feature vector. For this purpose three methods, i.e. Fisher and Sebestyen statistics, and the rough set-based method are employed. The main core of experiments is presented in Section 5, and ﬁnally Section 6 summarizes the results obtained in this study.

2

The Database of Singing Voices

The prepared singing voice database contains over 2900 sound samples. Some 1700 of them were recorded from 42 singers in a studio. The vocalists consisted of three groups: amateurs (Gda´ nsk University of Technology Choir vocalists), semi-professionals (Gda´ nsk Academy of Music, Vocal Faculty students), and professionals (qualiﬁed vocalists, graduated from the Vocal Faculty of the Gda´ nsk Academy of Music). Each of them recorded 5 vowels: ‘a’, ‘e’, ‘i’, ‘o’, ‘u’ at several sound pitches belonging to a natural voice scale. These recordings formed the ﬁrst singing category – singing quality. The singing voice type category was formed by assigning the voices to one of the following classes: bas, baritone, tenor, alto, mezzo-soprano and soprano. The second group of samples was prepared on the basis of CD audio recordings of famous singers. The database of professionals needed to be extended due to the fact that voice type recognition is possible only among professional voices. Amateur voices do not show many diﬀerences within groups of male and female voices as it has already been researched in literature [2], [27].

3

Parametrization of the Singing Voice

In order to parameterize the singing voice properly, the understanding of the voice production mechanism is required. The biomechanism of the singing voice

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

457

creation is rather complicated, but in the domain of its spectral energy (while not taking phase changes into account) it can be simpliﬁed by assuming the FIR model of the singing voice production. As in each classical FIR model a vocal tract is a ﬁlter which changes the spectrum of a source (the glottal) by a sum of resonances with given frequencies, amplitudes and qualities (the vocal tract). Therefore the singing is produced by the vibration of human vocal cords and resonances in the throat and head cavities. The resonances produce formants in the spectrum of sounds. Formants are not only related to articulation, and enable to produce diﬀerent vowels, but also characterize timbre and voice type qualities. For example the formant of the middle frequency band (3.5 kHz) is described in literature as “singer’s formant”, and its relation to voice quality is proved [2], [20], [27]. This concept is well recognized in a reach literature related to singing. However, the interaction between two factors, namely glottal source and resonance characteristics, shapes the timbre and power of an outgoing vocal sound, and both factors are equally important. Thus, singing voice parameters can be divided into two groups related to those two factors. Since this is a classical FIR model in order to deconvolve from the output signal the source from the ﬁlter inverse ﬁltration methods are required. In literature, some inverse ﬁltration methods for deriving glottis parameters are presented, however they are proved to be ineﬃcient due to phase problems [10]. In this aspect only parameters of vocal tract formants can be calculated directly from the inverse ﬁltering analysis since they are deﬁned in frequency. The assumption of linearity is the reason why time parameters of the source signal must be extracted by other methods which will be presented later on. Vocal tract parameters are in a speech analysis most often derived from the LPC method, but an adequate separation of frequency resonances demands high resolution for lower frequencies, where the resonances are located. Moreover, the methods of analysis with a spectrum resolution controlled by a function of sound pitch are required. The warped LPC method [8], [18] (further called the WLPC analysis) fulﬁlls those conditions and enables to analyze frequencies and levels of formants with a controlled higher low frequency resolution (below 5 kHz). It is based on nonlinear sampling of the unit circle in a z transform, thus the resolution in lower frequencies is better comparing to a classical LPC analysis with the same length of the analyzed frame. The phase response frequency is transformed non-linearly to a warped frequency ω W according to Equation (1). λ · sin ω ωW = ω + 2 · arctan (1) 1 − λ · cos ω where λ is a parameter, which determines non-linearity of the transformation and low frequency resolution of the WLPC analysis. The larger λ is, the more densely are the lower frequencies sampled. Mathematical aspects of this transformation are presented in detail in some literature sources [8], [9], [18] and in the previous works of the authors of this paper [34]. Since the analysis is applied to small signal frames it can be performed for several parts of the analyzed sounds. Therefore, any parameter F (which can be

458

˙ P. Zwan et al.

for example the level of one of the formants) forms a vector which describes its values in consecutive frames. In order to focus on a whole signal, and not only on a single frame, the median value of this vector is represented by a so-called static parameter Fmed . In this case, a median value is better then a medium value, because it is more resistant to short non typical values of a parameter, which do not drastically change the median value. On the other hand, in order to investigate the stability, the variances of the vector values (denoted as Fvar ) are also taken into account. Some of the singing voice parameters must be calculated for a whole sound rather than for single frames. Those parameters are deﬁned on the basis of the fundamental frequency contour analysis, and they are related to vibrato and intonation. Vibrato is described as the modulation of the fundamental frequency of sounds performed by singers in order to change timbre of sounds, while intonation is their ability to produce sounds perceived as stable and precise in tune. The parameters based on the singing voice analysis (’dedicated’ parameters) form an important group, but they should be supplemented with general descriptors normally used for the classiﬁcation of sound instruments. This group of parameters was investigated in detail in the domain of automatic sound recognition at Multimedia Systems Department at Gda´ nsk University of Technology. The usefulness of those parameters in automatic musical sound recognition was proved, and implied their application to the ﬁeld of the singing voice recognition. In this study, 331 parameters were derived from the analyses, of which 124 are deﬁned by the authors and are so-called ’dedicated parameters’ especially designed to address signing voice speciﬁcs. 3.1

Estimation of the Vocal Tract Parameters

As already described, the estimation of formants requires methods of analysis with good frequency resolution which are dependent on pitch of sounds. If the resolution is not properly set, single harmonics can be erroneously recognized as formants. For those purposes the WLPC analysis seems to be the most appropriate because the λ parameter is the function of the pitch of analyzed sounds [9], and thus can be changed in this analysis. The function λ=f(f ) is presented in Eq. (2). The problem of how to determine the appropriate λ is presented in detail in the work of one of the authors [32], [34]. λ = 10−6 · f [Hz]2 − 0.0022 · f [Hz] + 0.9713

(2)

However, parameters related to the ‘singing formant’ can also be extracted on the basis of the FFT power spectrum parametrization. Correlation between the WLPC and FFT parameters is not a problematic issue. Various methods, among them statistical analysis and rough set method, enable to reduce redundancy in feature vectors (FVs) and to compare the signiﬁcance of the features. The WLPC and FFT analyses results are presented in Fig. 1. Maxima and minima of the WLPC curves are determined automatically by an algorithm elaborated by one of the authors [32]. WLPC smoothes the power spectrum with a good resolution for frequencies below 5kHz.

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

459

Fig. 1. WLPC analysis shown along with the FFT power spectrum analysis of sound

Extracted WLPC maxima are related to one of the three formants: articulation (frequencies 1-2.5kHz), singer’s (singing) (frequencies 3-4 kHz), and high singing formants (frequencies over 5kHz). Since in literature a formal prescription of how to deﬁne mathematically these formants does not exist, three deﬁnitions for each of them can be proposed basing on three WLPC minima. Fnm = W LP Cmxn − W LP Cmnm

(3)

where WLPCmx n is the value of the nth WLPC maximum and WLPCmn m is a value of the mth WLPC minimum. Since the WLPC analysis is applied to short signal frames, it can be performed for several fragments of the analyzed sounds. Therefore, any formant parameter Fnm forms a vector which describes its values in consecutive frames. Median values of this vector represent a so-called static parameter Fnmmed , while the values of variances are a dynamic representation and are denoted as Fnmvar . The maximum and minimum values from expression (5) are also calculated respectively in all consecutive frames. They are denoted as Fnmmax and Fnmmin respectively. A singing formant is presented by many authors as signiﬁcant for the estimation of singing quality. Parameters related to the “singer’s formant” were extracted on the basis of the linear combination of parameters Fnm and additionally by using the FFT power spectrum parametrization. The combinations of the parameters deﬁned basing on the WLPC analysis are presented in Eq. (4) and (5). Those equations show direct relationship between formants. The parameter related to the formant energy deﬁned on the basis of the FFT power spectrum is presented in (6). F2 = F21 − F11 F1

(4)

F2 = F21 − F31 F3

(5)

SF L =

ESF Etotal

(6)

460

˙ P. Zwan et al.

where ratios F2 /F1 and F2 /F3 represent a diﬀerence in formant levels F11 , F21 and F31 expressed in [dB], SF L denotes the singer’s formant energy, ESF is the power spectrum energy for the band (2.5kHz-4kHz) in which a ‘singing formant’ is present, and Etotal is the total energy of the analyzed signal. 3.2

Estimation of the Glottal Source Parameters

Interaction between the vocal tract ﬁlter and the glottal shape along with phase problems are obstacles for an accurate automatic glottal source shape extraction [10], [12], [27], [32]. Glottal source parameters, which are deﬁned in the time domain, are not easy to compute from the inverse ﬁltration. However, within the context of the singing voice quality their stability rather that their objective values seems to be important. The analysis must be done for single periods of sound, and the sonogram analysis with small analyzing frames and big overlap should be employed. For each of the frequency bands, the sonogram consists of a set of n sequences Sn (k), where n is the number of a frequency band and k is the number of a sample. Since the aim of the parametrization is to describe the stability of energy changes in sub-bands, the autocorrelation in time is a function of sequences Sn (k). The more frequent and stable energy changes in a sub-band were, the higher were the values of the autocorrelation function maximum (for index not equal to 0). The analysis was performed for 16 and 32 sample frames. In the ﬁrst case the energy band of 0-10 kHz was related to the ﬁrst four indexes n and the maximum of the autocorrelation function of sub-band n is denoted as KX n (7), in the second case n=1...8 and the resulting parameter is deﬁned as LX n (8). Two diﬀerent analyzing frames were used for comparison purposes only. The redundancy in the feature vector (FV) was further eliminated by statistical methods. (7) KXn = max(Corr (Sn16 (k))), n = 1...4 k

LXn = max(Corr (Sn32 (k))), n = 1...8 k

(8)

where Corr (.) is the autocorrelation function in time domain, k – sample numk

ber, n - number of the frequency sub-band, Sn16 – sonogram samples sequence for the analyzed frame of 16 samples and frequency sub-band n, and Sn32 denotes a sonogram sample sequence for the analyzed frame of 32 samples and the frequency sub-band n. Conversely, the minimum of the correlation Corr (Sn (k)) function is connected with the symmetry or anti-symmetry of energy changes in sub-bands, which relates to the open quotient of glottis source [32]. Therefore in each of the analyzed sub-bands the KY n and LY n parameters are deﬁned as (9) and (10), respectively: (9) KYn = min(Corr (Sn16 (k))), n = 1...4 k

LYn = min(Corr (Sn32 (k))), n = 1...8 k

where Corr (.), k, n, Sn16 , Sn32 are deﬁned as in formulas (7) and (8). k

(10)

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

461

Another parameter deﬁned for each analyzed sub-band is a threshold parameter KP n deﬁned as the number of samples exceeding the average energy level of the sub-band n divided by the total number of samples in the sub-band. For the frame of 32 samples a similar parameter is deﬁned and denoted as LP n . Parameters KP n and LP n are also related to the open quotient of the glottal signal [32], [33]. 3.3

Estimation of Intonation and Vibrato Parameters

Proper vibrato and intonation play a very important role in the perception of voice quality [4], [5], [7], [25], [31]. It is clear that a person who does not hold the pitch steadily and does not have a consistent vibrato cannot be judged as a good singer. Intonation and vibrato of the singing is deﬁned in the frequency domain, thus a pitch contour needs to be extracted. There are several methods of automatic sound pitch extraction, of which autocorrelation method seems to be the most appropriate [6], [14]. Autocorrelation pitch extraction method is based on the determination of the maximum of an autocorrelation function deﬁned for the overlapped segments of the audio signal. Since this method is well presented in a reach literature [6], [23] on this subject, it will not be recalled here. The fundamental frequency (f0 ) within each analyzed frame was determined, and at the same time the improvement of the frequency resolution of the analysis was achieved by interpolating three samples around the maximum of the autocorrelation function. The length of the frame was set to 512 samples. The value was determined experimentally in order to give a satisfactory time resolution. It is presented in detail in other papers of the authors [6], [32]. The interpolation improves the frequency resolution signiﬁcantly. The pitch of the analyzed sounds is not always stable in time, especially when sounds of untrained singers are concerned. In order to accurately parametrize vibrato and intonation of the analyzed sound, an equivalent pitch contour of the sound but without vibrato should be determined. The result of such analysis is a so-called ‘base contour’ which is calculated by smoothing the pitch contour (using the moving average method) with the frame length equal to the reciprocal of the half vibrato frequency. When bc(n) are samples of the base contour (deﬁned in frequency) and v(n) are samples of the vibrato contour, the vibrato modiﬁed contour is calculated as vm (n) = v(n)-bc(n) and it is used for the vibrato parametrization. On the other hand, bc(n) are used for the intonation parametrization to deﬁne how quickly the singer is able to obtain a given pitch of the sound and how stable its frequency is. The parametrization of vibrato depth and frequency (fV IB ) may be not suﬃcient in the category of singing quality. Since the stability of vibrato reﬂects the quality of sound parameters in time [5], [27], additional three vibrato parameters were deﬁned [5], [34]: – “perdiodicity” of vibrato VIB P (Eq. 11) pitch contour, deﬁned as the maximum value of the autocorrelation of the pitch contour function (for index not equal to 0);

462

˙ P. Zwan et al.

– “harmonicity” of vibrato VIB H (Eq. 12) obtained by calculating Spectrum Flatness Measure for the spectrum of the pitch contour; – “sinusoidality” of vibrato VIB S (Eq. 13) deﬁned as the similarity of the parameterized pitch contour to the sine waveform. V IBP = max(Corr (f0 (n)) n

V IBH =

N

(11)

N1 F0 (n)

n=1 N 1 F0 N n=1

(12) (n)

max (F0 (n)) V IBS =

n N

n=1

(13) F0 (n)

Bad singers often try to use vibration of the sounds not for artistic purposes but to hide “false” intonation of a sound. In addition, false sounds are obviously directly showing lack of proﬁciency in vocal skills. Since intonation seems important in a voice quality determination, the base contour must be parametrized. In order to calculate intonation parameters, two methods were proposed. The ﬁrst method calculates the medium value of a diﬀerential sequence of a base contour (IR). The second method does not analyze all base contour samples but the ﬁrst and the last one, and returns the IT parameter. Parameters IR and IT are also deﬁned for the ﬁrst and last N /2 samples of pitch contour separately (N is the number of samples of the pitch contour) and are denoted as IR att , IT att , IR rel , IT rel , whereatt means the attack and rel the release of the sound. 3.4

Other Parameters

Another way of determining singing voice parameters is to use a more general signal description such as descriptors of audio content contained in the MPEG-7 standard. Although those parameters are not related to the singing voice biomechanics, they may be useful in the recognition process. The MPEG-7 parameters [11], [19] will not be presented in detail here, since they were reviewed in previous works by the authors [15], [17], [16]. The MPEG-7 audio parameters can be divided into the following groups: – ASE (Audio Spectrum Envelope) describes the short-term power spectrum of the waveform. The mean values and variances of each coeﬃcient over time are denoted as ASE 1 . . . ASE 34 and ASE 1var . . . ASE 34var respectively. – ASC (Audio Spectrum Centroid) describes the center of gravity of the logfrequency power spectrum. The mean value and the variance are denoted as ASC and ASC var respectively. – ASS (Audio Spectrum Spread). The mean value and the variance over time are denoted as ASS and ASS var respectively.

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

463

– SFM (Spectral Flatness Measure) calculated for each frequency band. The mean values and the variances are denoted as SFM 1 . . . SFM 24 and SFM 1var . . . SFM 24var . – Parameters related to discrete harmonic values: HSD (Harmonic Spectral Deviation), HSS (Harmonic Spectral Spread), HSV (Harmonic Spectral Variation). The level of the ﬁrst harmonic changes for diﬀerent voice type qualities [27], thus in automatic voice recognition some parameters related to the behavior of harmonics can be useful. Parameters employed in the analysis were deﬁned for harmonic decomposition of sounds. They are the mean value of diﬀerences between the amplitudes of a harmonic in adjacent time frames (sn , where n is the number of a harmonic), the mean value of the amplitudes Ah of a harmonic over time (mn , where n is the number of a harmonic), the standard deviation of amplitudes Ah of a harmonic over time (md n , where n is the number of a harmonic). Other parameters used in the experiments were: brightness (br ) (center of spectrum gravity) [13], [14] and mel-cepstrum coeﬃcients mcc n [3], where n is the number of a coeﬃcient.

4

Analysis of Parameters

All 2900 sound samples from the database described in Section 3 were described by the presented parameters. Since the total number of the parameters is big (331), they all will not all be listed here. We can, however, divide them into the following groups: – – – – 4.1

parameters of formants – 46 parameters, parameters of the glottal - 59 parameters, parameters of the pitch contour (intonation and vibrato) – 18 parameters, other parameters (general) – 208 parameters. Statistical Analysis

Some chosen pairs of the parameters can be represented graphically in a 2D space. In Fig. 2, an example of a distribution of two parameters for sound samples of professional and amateur singers is presented. It can be noticed, that for a majority of these samples sounds are separated by using only two features. A large number of the features in the FV and a large number of voice samples are the reason to use statistical methods for the feature evaluation. Therefore every feature can be analyzed and a feature vector can be reduced to the parameters with the biggest values of statistics. Another way is to use the rough sets. Three methods of data reduction are to be described, namely Fisher statistic (F ) and Sebestyen statistics (S), and rough sets in the following sections of this paper. Fisher statistic has the ability to test the separation between the pairs of classes being recognized, while Sebestyen criterion tests the database globally for

464

˙ P. Zwan et al.

Fig. 2. An example of a 2D space of the values of selected parameters

Table 1. Sebestyen criterion for 20 parameters in the categories of voice quality (a), and voice type (b) a. parameter F1 /F2 VIB H ASE 16 F31min ASE 23 ASE 24 F22med

Svalue 1.282 1.047 0.844 0.672 0.654 0.637 0.556

parameter SFLmin LAT SFLmed ASE 21 ASE 22 SFLmin ASE 14

Svalue 0.545 0.545 0.529 0.519 0.489 0.468 0.407

parameter ASE 15 F2 /F3 br F31med F22min F22max

Svalue 0.406 0.297 0.281 0.278 0.252 0.248

parameter ASE 10 ASE 9 LP 5 F22med ASE 23 ASE 16 MCC 6

Svalue 1.006 0.680 0.518 0.509 0.501 0.419 0.384

parameter SFM 17 ASE 25 KP 4 mfcc 9var ASE 12 LX 1 ASE 13

Svalue 0.37 0.36 0.358 0.358 0.355 0.351 0.320

parameter MCC 10 MCC 10var F22min ASE 19 MCC 8 LP 6

Svalue 0.307 0.307 0.290 0.258 0.258 0.241

b.

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

465

all pairs of classes in one calculation. Those statistical methods are presented in [13], [15], [26]. Sebestyen criterion has an advantage while compared to F statistic. Its global character enables to sort the parameters from the most to the less appropriate for all pairs of classes. In Table 1 the results of S criterion for 20 best parameters are presented in two categories of classiﬁcation. Fisher statistic allows for comparing parameters only in selected pairs of classes. Therefore the parameters cannot be globally compared. Below, are presented the most interesting conclusions coming from the statistical analysis of singing voice parameters that fall within the categories of the singing voice quality and type. The detailed studies of the parameter redundancy using the Fisher’s criterion are presented in P. Zwan’s PhD thesis [33]. In the category of voice quality, “dedicated” parameters obtained higher Fisher values than “general” parameters, while “general” descriptors were more appropriate for class separation in the voice type category. In the category of voice quality, the best BF results were obtained by glottal source parameters: LX 2 , LX 3 , SFL, VD, F 22max , F1 /F2 , F2 /F3 . Among “general” descriptors the best BF results were obtained by some chosen MPEG7 parameters: ASE and HSV (of various indexes) and the parameters describing value changes of harmonics in neighboring frames. For the pair of “amateur” – “professional” classes the best parameters (with regard to Fisher statistics) were the parameters related to the energy of the singer’s formant: SFL, F22med , F1 /F3 , F2 /F3 , F22min . It is evident that the energy of the band of the singer’s formant is crucial for distinguishing professional singers from amateurs. For the pair of: “semiprofessional” – “professional” classes the parameters related to the singer’s formant energy do not have such a great signiﬁcance. In this case, glottal source time parameters are essential. They relate to the invariability and periodicity of energy changes in singing to the level of single signal periods in voice samples. High values of the Fisher statistics were obtained by parameters related to vibrato: VD, VIB P , VIB H , VIB S . Such a good result for vibrato parameters is very valuable, because these descriptors are not correlated with the parameters of the singer’s formant (they describe diﬀerent elements of singing technique). In the category of voice type, the highest F values have threshold parameters KP 2 , LP 4 , LP 5 , LP 8 , parameters LX 1 , KX 1 , the SFLmax parameter related to the singer’s formant level and the parameters related to a higher formant FW, namely: F1 /F3 . Among the parameters deﬁned on the basis of the WLPC analysis, the highest F values were obtained by the parameters: F22med and F22max , what indicates the signiﬁcance of deﬁning the singer’s formant values in relation to the second minima of the WLPC function. The results of Sebestyen criterion and Fisher statistic cannot be compared directly, but generally the results are consistent. The majority of parameters with high S value also obtained high F value for a certain pair of classes, and similarly, the majority of parameters with a big Fisher criterion value had a high position in the list of parameters sorted by the S value. The consistence of the results proves the statistical methods to be good tools for a comparison of parameters in the context of their usability in the automatic recognition of singing voices.

466

4.2

˙ P. Zwan et al.

Rough Set-Based Analysis

Rough sets introduced by Pawlak [21] are often employed in the analysis of data which aims to discover signiﬁcant data and eliminate redundant ones. A reach literature on rough sets covers many applications [22], it is also used in music information retrieval [15], [28], [29], [29]. Within the context of this paper, the rough set method was used for the analysis of descriptors deﬁned for the purpose of this study. In experiments, the rough set decision system RSES was employed [24]. Since this system is widely used by many researches, the details concerning its algorithmic implementation and performance will not be provided here. FVs were divided into training and testing sets. Parameters were quantized according to the RSES system principles. The local and global discretization were used to obtain reducts calculated from genetic and exhaustive algorithms [1]. Since two discretization methods and two algorithms for reduct calculation were used two sets of reduced parameters and four sets of reducts containing the selected parameters were extracted. In the category of voice quality the vector of parameters was reduced to the parameters listed below: a) the global discretization: FV 1 = [F11 , F2 /F1 , KX 2 , KY 7,

fV IB , VIB p , ASE 21 , ASC, ASC v , SFM 10 , s2 ] (15)

b) the local discretization: FV 2 = [F11 , F21 , F31 , F33 , F12var , F13min , F13var , KX 1 , KX 2 , KP 1 , LP 3 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 , s2 ] (16) The sets of parameters selected by global and local discretization methods diﬀer. The global discretization is a simpler method thus the number of parameters is lower. The global discretization tries to separate group of classes globally by selecting the lesser number of discretization cut points. However, the parameters chosen by the rough set method for both discretization methods in general match the parameters chosen by statistical methods of data reduction. Among the reduced set of parameters, descriptors related to the WLPC analysis of formants can be found, and thus can be qualiﬁed as signiﬁcant for the classiﬁcation purposes. They are related to all three formants, which proves that in the category of voice quality all formants are required to be parameterized and the extracted descriptors should be contained in the FV. It is interesting that among those parameters F31 and F33 which are related to ‘high formant’ (middle frequency higher than 5kHz) appeared. The signiﬁcance of this formant is not described in literature concerning automatic singing voice parametrization. Among glottal source parameters descriptors such as: KX 1 , KX 2 , KP 1 , LP 3 were selected. On the other hand, frequency (fV IB ) and periodicity (VIB p ) related to vibrato modulation found their place among other important descriptors. From the remaining parameters, a few MPEG-7 parameters namely LAT, TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 were qualiﬁed. In addition, one parameter related to the

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

467

analysis of spectrum which is represented by s2 related to the variation of the second harmonic was chosen. In order to deﬁne the reducts, two algorithms were used: genetic and exhaustive algorithms. In the case of global discretization, those two algorithms calculated one and the same reduct containing all the parameters of (15). For both algorithms, all the parameters had equal inﬂuence on the decision. In the case of local discretization, reducts obtained by the two algorithms diﬀered signiﬁcantly. The resulting reducts are presented in (17) and (18). For selection of the number of the ’best’ reducts the stability coeﬃcient values was taken into account. – reducts for the genetic algorithm, limited to a few best reducts: {F11 , F31 , F12var , KX 2 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 } {F11 , F31 , F12var , F13min ,KX 2 , fV IB ,VIB p , LA,ASE 6 ,ASE 7 ,ASE 8 ,ASE 21 } {F11 , F31 , F13var , KX 2 , fV IB , VIB p , TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 } {F11 , F31 , KX 2 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 , s2 } {F11 , F31 , KX 2 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 } {F11 , F13min , KX 2 , KP 1 , fV IB , VIB p ,TC, ASE 6 , ASE 7 , ASE 8 , ASE 18 , s2 } {F31 , F12var , KX 2 , KP 1 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 } {F31 , F13var , KX 2 , fV IB , VIB p , TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 , s2 } {F13var , KX 2 , KP 1 , fV IB , VIB p , TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 , s2 } {F12var , KX 2 , KP 1 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 , ASE 21 } {F13min , KX 2 , KP 1 , fV IB , VIB p , LAT, TC, ASE 6 , ASE 7 , ASE 8 , s2 } (17) – reducts for the exhaustive algorithm, limited to best 6 reducts: {KX 2 , VIB p , LAT, ASE 6 , ASE 8 } {VIB p , LAT, TC, ASE 6 , ASE 8 } {F11 , VIB p , TC, ASE 6 , ASE 8 } {F11 , fV IB , VIB p , ASE 6 , ASE 8 } {KX 2 , LAT, TC, ASE 6 , ASE 8 } {KX 2 , fV IB , VIB p, ASE 6 , ASE 8 } (18) In the category of voice type over 200 from the total number of 331 parameters remained independently of the discretization methods or the type of algorithm used for the calculation. It was not possible to reduce parameter representation as much as in the case of the voice quality category. Within this context, automatic voice type recognition seems to be more complex. One of the reasons can be the diversity of registers among diﬀerent voice types and individual voice qualities which change for singers for the same voice type. Additionally, some singers’ voices were not easy to qualify to a voice type category, e.g. low registers of soprano voices were similar in timbre to mezzo-soprano and even alto voices.

468

5

˙ P. Zwan et al.

Automatic Singing Voice Recognition

The next step of the experiment was to train an automatic recognition system based on both reduced and full feature vectors. Since three reduction methods were performed for each of the categories several decision systems (DS) were trained for the purposes of their comparison: – DS 1 – for the full vector (331 parameters), – DS 2 – for the vector with 100 parameters with biggest S values, – DS 3 – for the vector with 100 parameters with biggest F values (all pairs of the classes were concerned), – DS 4 – for the vector with 50 parameters with the biggest S values, – DS 5 – for the vector witch 50 parameters with the biggest F values, – DS 6 – for the vector with 20 parameters with the biggest S values, – DS 7 – for the vector witch 20 parameters with the biggest F values, – DS 8 – for the vector reduced by rough-sets with global discretization method – DS 9 – for the vector reduced by rough-sets with local discretization method Since Artiﬁcial Neural Networks are widely used in automatic sound recognition [13], [14], [15], [32], [35], the ANN classiﬁer was used. The ANN was a simple feed-forward, three layer network with 100 neurons in the hidden layer and 3 or 6 neurons in the output layer respectively (dependent on the number of classes being recognized). Since there were 331 parameters in the FV, the input layer consisted of 331 neurons. Sounds from the database were divided into three groups. First part of samples (70%) was used for training, second part (10%) for validation and the third part (20%) for testing. Samples in training, validation and testing sets consisted of sounds of diﬀerent vowels and pitches. The network was trained smoothly with the validation error increasing after approx. 3000 cycles. To train the network optimally, the minimum of the global validation error function must have been found. If the validation error was increasing for 50 successive cycles, the last validation error function minimum was assumed to be global, and the learning was halted. In Figure 3, the automatic recognition results are presented for nine decision systems DS1 – DS9 and two recognition categories. V331 is the vector of all 331 parameters, Sn are the vectors reduced by the Sebestyen criterion to n parameters, Fm are the vectors reduced by the Fisher statistic to m parameters, RS L is the vector of parameters selected by the rough set local discretization method and RS G is the vector of parameters selected by the rough set global discretization method. The results from Table 2 show that whatever the data reduction method is, artiﬁcial neural networks generate similar results of automatic recognition. The eﬃciency for a full vector of 331 parameters is 93.4% for the voice quality category and 90% for the voice type category and decreases to approx. 75% when the size of the vector is reduced to 20 parameters. A better recognition accuracy for the voice type category when the rough set data reduction method is used comes from the fact that for this category the vector was not signiﬁcantly reduced. In the case of the voice quality the results of automatic recognition for F20 , S20 and RS L can directly be compared because in those two cases the

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

469

Table 2. Results of automatic recognition accuracy [%] for various FV size reduction methods Category V331 S100 F100 S50 F50 S20 F20 RS L RS G Quality 93.4 90.5 92.1 85.3 87.2 73.2 75.5 75.2 62 Type 90.0 82.3 81.5 79.0 76.0 58.3 60.1 83.1 82.5

Table 3. The comparison of the accuracy of RS and ANN based classiﬁers for various RS-based data reduction methods Data reduction method global discretization, both algorithms local discretization, genetic algorithm local discretization, exhaustive algorithm

RS [%] 96.8 97.6 89

ANN [%] 62 72.5 72.5

vectors have the same number of parameters. The recognition accuracy is very similar for all three methods. Rough set-based algorithms can serve not only for data reduction purposes but also as classiﬁers. A comparison between RS and ANN classiﬁers acting on vectors reduced by rough set-based methods seems very interesting. Since the reducts were extracted only for the vocal quality category, the experiment was carried on for that category and the results are presented in Table 3. The automatic recognition results are much better for an RS classiﬁer used. The RS method is specialized for the classiﬁcation task of a reduced set of parameters. A discretization algorithm used in RS selects the most important cut points in terms of the discernibility between classes. Thus, rules generated from the parameters by RS are strictly dedicated for the analyzed case. Following the RS methodology, the proper rules are easy to obtain. For ANNs, since they are trained and tested using single objects, the generalization is harder to obtain and every single training object can have an inﬂuence on the results. In the case of a smaller number of parameters it has a particular meaning which can be clearly observed in Table 3. Contrarily, when the number of parameters (the number of reducts) is bigger, the ANN decision system starts to perform better than RS. This may be observed in the results of the automatic recognition in the voice type category (parts of Tables 4 and 5). In order to make a detailed comparison, between the best trained ANN recognition system and the best trained RSES system, the detailed recognition results for both recognition categories are presented in Tables 4 and 5. Rows in these tables describe recognized quality classes, and columns correspond to the classiﬁcation. In the case of the quality category, the automatic recognition results are better comparing to the ANN. The rough set system achieved very good results with a reduced FV of 20 parameters in the classiﬁcation of the voice quality category. In the category of voice type, the results are worse. Moreover, in the case of the voice type category erroneous classiﬁcation is not always related to neighboring

470

˙ P. Zwan et al.

Table 4. ANN singing voice recognition results for (a) Voice Quality (VQ) and (b) Voice Type (VT) categories a. VQ recognition [%] amateur semi-professional professional

amateur 96.3 4.5 3.5

semi-professional 2.8 94.3 7

professional 0.9 1.1 89.5

b. VT category recognition [%] bass baritone bass 90.6 6.3 baritone 3.3 90 tenor 0 3.6 alto 0 0 mezzo 0 0 soprano 0 0

tenor 3.1 6.7 89.3 4 0 2.9

alto 0 0 7.1 80 0 0

mezzo 0 0 0 12 93.8 2.9

soprano 0 0 0 4 6.3 94.1

Table 5. RSES-based singing voice classiﬁcation results for (a) Voice Quality (VQ) and (b) Voice Type (VT) categories a. VQ recognition [%] amateur semi-professional professional

amateur 94.7 1.3 0

semi-professional 4.2 95.4 1.6

professional 1.1 3.3 96.7

b. VT category recognition [%] bass baritone bass 84.0 10.0 baritone 13.0 64.8 tenor 6.0 18.0 alto 0 4.7 mezzo 3.8 0 soprano 2.9 2.9

tenor 4.0 13.0 54.0 16.3 2.6 2.9

alto 2.0 0 10.0 51.2 1.3 1.4

mezzo 0 1.9 6.0 16.3 73.1 11.4

soprano 0 7.3 6.0 11.6 19.2 78.6

classes. Thus, the RSES system was not able to perform the classiﬁcation as well as ANN while trained and tested on vectors of more than 200 parameters in the category of voice type where further vector size reduction was not possible (total accuracy obtained equals 0.664). It is interesting to notice that types of voices being ‘the extreme’ of the voice type category were recognized with better eﬃciency than those contained between other classes. Also, there is not much diﬀerence whether this concerns male or female voice.

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

6

471

Conclusions

By comparing automatic recognition results of neural networks and rough set systems, several conclusions may be reached. The recognition performed by the rough set system was better for the quality category and worse for the voice type category in comparison to the ANN. In the case of the voice quality category, it was possible for the RS system to reduce a large number of parameters to 20 descriptors and the extraction of rules went very smoothly. Descriptors of the level of formants, stability of glottal parameters along with those related to vibrato, and MPEG-7 descriptors in addition, enabled to derive linear IF-THEN rules. It proves that automatic recognition of the quality category is possible for a signiﬁcantly reduced number of descriptors. In the case of voice quality it was not possible to achieve very good recognition results for the RS classiﬁer as the extraction of a small number of rules was not possible. Neural networks enabled to classify particular types of singing voices eﬀectively while the rough-set system achieved worse eﬃciency. The diversity of voice registers and individual timbre characteristics of singers are the reason that non-linear classiﬁcation systems such as ANNs should perhaps be used for automatic recognition in the category of voice type. Another reason for lower recognition results may be that the database of singing voices was represented by too few diﬀerent singers. Moreover, it has been proven that all the presented data reduction algorithms enabled a signiﬁcant decrease in the feature vector size. The results obtained by the trained ANN for vectors of the same length but produced by diﬀerent data reduction methods gave very similar recognition results. The parameters selected by those algorithms as the most appropriate for automatic singing voice recognition were very similar. Acknowledgments. The research was partially supported by the Polish Ministry of Science and Education within the project No. PBZ-MNiSzW-02/II/2007.

References 1. Bazan, J.G., Szczuka, M.S.: The Rough Set Exploration System. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005) 2. Bloothoof, G.: The sound level of the singers formant in professional singing. J. Acoust. Soc. Am. 79(6), 2028–2032 (1986) 3. Childers, D.G., Skinner, D.P., Kemerait, R.C.: The Cepstrum: A Guide to Processing. Proc. IEEE 65, 1428–1443 (1977) 4. Dejonckere, P.H., Olek, M.P.: Exactness of intervals in singing voice: A comparison between singing students and professional singers. In: Proc. 17th International Congress on Acoustics, Rome, VIII, pp. 120–121 (2001) 5. Diaz, J.A., Rothman, H.B.: Acoustic parameters for determining the diﬀerences between good and poor vibrato in singing. In: Proc. 17th International Congress on Acoustics, Rome, VIII, pp. 110–116 (2001)

472

˙ P. Zwan et al.

6. Dziubinski, ˙ M., Kostek, B.: Octave Error Immune and Instantaneous Pitch Detection Algorithm. J. of New Music Research 34, 273–292 (2005) 7. Fry, D.B.: Basis for the acoustical study of singing. J. Acoust. Soc. Am. 28, 789–798 (1957) 8. Harma, A.: Evaluation of a warped linear predictive coding scheme. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 897–900 (2000) 9. Harma, A.: A comparison of warped and conventional linear predictive coding. IEEE Transactions on Speech and Audio Processing 5, 579–588 (2001) 10. Herzel, H., Titze, I., Steinecke, I.: Nonlinear dynamics of the voice: signal analysis and biomechanical modeling. CHAOS 5, 30–34 (1995) 11. Herrera, P., Serra, X., Peeters, G.: A proposal for the description of audio in the context of MPEG-7. In: Proc. CBMI European Workshop on Content-Based Multimedia Indexing, Toulouse, France (1999) 12. Joliveau, E., Smith, J., Wolfe, J.: Vocal tract resonances in singing: the soprano voice. J. Acoust. Soc. America 116, 2434–2439 (2004) 13. Kostek, B.: Soft Computing in Acoustics, Applications of Neural Networks, Fuzzy Logic and Rough Sets to Music Acoustics, Studies in Fuzziness and Soft Computing. Physica Verlag, Heidelberg (1999) 14. Kostek, B., Czy˙zewski, A.: Representing Musical Instrument Sounds for Their Automatic Classiﬁcation. J. Audio Eng. Soc. 49, 768–785 (2001) 15. Kostek, B.: Perception-Based Data Processing in Acoustics. In: Applications to Music Information Retrieval and Psychophysiology of Hearing. Series on Cognitive Technologies. Springer, Heidelberg (2005) ˙ 16. Kostek, B., Szczuko, P., Zwan, P., Dalka, P.: Processing of Musical Data Employing Rough Sets and Artiﬁcial Neural Networks. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 112–133. Springer, Heidelberg (2005) 17. Kostek, B.: Applying computational intelligence to musical acoustics. Archives of Acoustics 32(3), 617–629 (2007) 18. Kruger, E., Strube, H.W.: Linear prediction on a warped frequency scale. IEEE Trans. on Acoustics, Speech, and Signal Processing 36(9), 1529–1531 (1988) 19. Lindsay, A., Herre, J.: MPEG-7 and MPEG-7 Audio - An Overview. J. Audio Eng. Society 49(7/8), 589–594 (2001) 20. Mendes, A.: Acoustic eﬀect of vocal training. In: Proc. 17th International Congress on Acoustics, Rome, VIII, pp. 106–107 (2001) 21. Pawlak, Z.: Rough Sets. International J. Computer and Information Sciences 11, 341–356 (1982) 22. Peters, J.F., Skowron, A. (eds.): Transactions on Rough Sets V. LNCS, vol. 4100. Springer, Heidelberg (2006) 23. Rabiner, L.: On the use of autocorrelation analysis for pitch detection. IEEE Trans., ASSP 25, 24–33 (1977) 24. Rough-set Exploration System, logic.mimuw.edu.pl/∼ rses/RSES doc eng.pdf 25. Schutte, H.K., Miller, D.G.: Acoustic Details of Vibrato Cycle in Tenor High Notes. J. of Voice 5, 217–231 (1990) 26. Sebestyen, G.S.: Decision-making processes in pattern recognition. Macmillan Publishing Co., Indianapolis (1965) 27. Sundberg, J.: The science of the singing voice. Northern Illinois University Press (1987) 28. Wieczorkowska, A., Czy˙zewski, A.: Rough Set Based Automatic Classiﬁcation of Musical Instrument Sounds. Electr. Notes Theor. Comput. Sci. 82(4) (2003)

Automatic Singing Voice Recognition with Neural Nets and Rough Sets

473

29. Wieczorkowska, A., Ra˙s, Z.W.: Editorial: Music Information Retrieval. J. Intell. Inf. Syst. 21(1), 5–8 (2003) 30. Wieczorkowska, A., Ras, Z.W., Zhang, X., Lewis, R.A.: Multi-way Hierarchic Classiﬁcation of Musical Instrument Sounds, pp. 897–902. MUE, IEEE (2007) 31. Wolf, S.K.: Quantitative studies on the singing voice. J. Acoust. Soc. Am. 6, 255– 266 (1935) ˙ 32. Zwan, P.: Expert System for Automatic Classiﬁcation and Quality Assessment of Singing Voices. 121 Audio Eng. Soc. Convention, San Francisco, USA (2006) ˙ 33. Zwan, P.: Expert system for objectivization of judgments of singing voices (in Polish), Ph.D. Thesis (supervisor: Kostek B.), Gdansk Univ. of Technology, Electronics, Telecommunications and Informatics Faculty, Multimedia Systems Department, Gdansk, Poland (2007) ˙ 34. Zwan, P., Kostek, B., Szczuko, P., Czy˙zewski, A.: Automatic Singing Voice Recognition Employing Neural Networks and Rough Sets. In: Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 793–802. Springer, Heidelberg (2007) ˙ 35. Zwan, P.: Automatic singing quality recognition employing artiﬁcial neural networks. Archives of Acoustics 33(1), 65–71 (2008)

Hierarchical Classifiers for Complex Spatio-temporal Concepts Jan G. Bazan Chair of Computer Science, University of Rzesz´ ow Rejtana 16A, 35-310 Rzesz´ ow, Poland [email protected]

Abstract. The aim of the paper is to present rough set methods of constructing hierarchical classiﬁers for approximation of complex concepts. Classiﬁers are constructed on the basis of experimental data sets and domain knowledge that are mainly represented by concept ontology. Information systems, decision tables and decision rules are basic tools for modeling and constructing such classiﬁers. The general methodology presented here is applied to approximate spatial complex concepts and spatio-temporal complex concepts deﬁned for (un)structured complex objects, to identify the behavioral patterns of complex objects, and to the automated behavior planning for such objects when the states of objects are represented by spatio-temporal concepts requiring approximation. We describe the results of computer experiments performed on real-life data sets from a vehicular traﬃc simulator and on medical data concerning the infant respiratory failure. Keywords: rough set, concept approximation, complex dynamical system, ontology of concepts, behavioral pattern identiﬁcation, automated planning.

1

Introduction

Reasoning based on concepts constitutes one of the main elements of a thinking process because it is closely related to the skill of categorization and classiﬁcation of objects. The term concept means mental picture of a group of objects (see [1]). While the term conceptualize is commonly understood to mean form a concept or idea about something (see [1]). In the context of this work, there is interest in classifying conceptualized sets of objects. Concepts themselves provide a means of describing (forming a mental picture of) sets of objects (for a similar understanding the term concept, see, e.g., [2, 3, 4]). Deﬁnability of concepts is a term well-known in classical logic (see, e.g., [5]). Yet in numerous applications, the concepts of interest may only be deﬁned approximately on the basis of available, incomplete information about them (represented, e.g., by positive and negative examples) and selected primary concepts and methods for creating new concepts out of them. It brings about the necessity to work out approximate reasoning methods based on inductive reasoning (see, e.g., [6, 7, 8, 9, 10, 11, 12, 13]). J.F. Peters et al. (Eds.): Transactions on Rough Sets IX, LNCS 5390, pp. 474–750, 2008. c Springer-Verlag Berlin Heidelberg 2008

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

475

In machine learning, this issue is known under the term learning concepts by examples (see, e.g., [10]). The main problem of learning concepts by examples is that the description of a concept under examination needs to be created on the basis of known examples of that concept. By creating a concept description we understand detection of such properties of exemplary objects belonging to this concept that enable further examination of examples in terms of their membership in the concept under examination. A natural way to solve the problem of learning concepts by examples is inductive reasoning which means that while obtaining further examples of objects belonging to the concept (the so-called positive examples) and examples of objects not belonging to the concept (the so-called negative examples), an attempt is made to ﬁnd such a description that correctly matches all or almost all examples of the concept under examination. Moreover, instead of speaking of learning concepts by examples, one may consider a more general learning of the so-called classiﬁcations which are partitions of all examples into a family of concepts (called decision classes) creating a partition of the object universe. A description of such a classiﬁcation makes it possible to recognize the decision that should be made about examples unknown so far; that is, it gives us the answer as to what decision should be made that also includes examples not occurring in the process of classiﬁcation learning. Classiﬁers also known in literature as decision algorithms, classifying algorithms or learning algorithms may be treated as constructive, approximate descriptions of concepts (decision classes). These algorithms constitute the kernel of decision systems that are widely applied in solving many problems occurring in such domains as pattern recognition, machine learning, expert systems, data mining and knowledge discovery (see, e.g., [6, 8, 9, 10, 11, 12, 13]). In literature there can be found descriptions of numerous approaches to constructing classiﬁers, which are based on such paradigms of machine learning theory as classical and modern statistical methods (see, e.g., [11, 13]), neural networks (see, e.g., [11, 13]), decision trees (see, e.g., [11]), decision rules (see, e.g., [10, 11]), and inductive logic programming (see, e.g., [11]). Many of the approaches mentioned above resulted in decision systems intended for computer support of decision making (see, e.g., [11]). An example of such a system is RSES (Rough Set Exploration System [14, 15]) which has been developed for over ten years and utilizes rough set theory, originated by Professor Zdzislaw Pawlak (see [16, 17, 18]), in combination with Boolean reasoning (see [19, 20, 21]). With the development of modern civilization, not only the scale of the data gathered but also the complexity of concepts and phenomena which they concern are increasing rapidly. This crucial data change has brought new challenges to work out new data mining methods. Particularly, data more and more often concerns complex processes which do not give in to classical modeling methods. Of such a form may be medical and ﬁnancial data, data coming from vehicles monitoring, or data about the users gathered on the Internet. Exploration methods of such data are in the center of attention in many powerful research centers in the world, and at the same time detection of models of complex processes and their properties (patterns) from data is becoming more and more attractive for applications

476

J.G. Bazan

(see, e.g., [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]). Making a progress in this ﬁeld is extremely crucial, among other things, for the development of intelligent systems which support decision making on the basis of results of analysis of the available data sets. Therefore, working out methods of detection of process models and their properties from data and proving their effectiveness in diﬀerent applications are of particular importance for the further development of decision supporting systems in many domains such as medicine, ﬁnance, industry, transport, telecommunication, and others. However, in the last few years essential limitations have been discovered concerning the existing data mining methods for very large data sets regarding complex concepts, phenomena, or processes (see, e.g., [41, 42, 43, 44, 45, 46]). A crucial limitation of the existing methods is, among other things, the fact that they do not support an eﬀective approximation of complex concepts, that is, concepts whose approximation requires discovery of extremely complex patterns. Intuitively, such concepts are too far in the semantical sense from the available concepts, e.g., sensory ones. As a consequence, the size of spaces which should be searched in order to ﬁnd patterns crucial for approximation are so large that an eﬀective search of these spaces very often becomes unfeasible using the existing methods and technology. Thus, as it turned out, the ambition to approximate complex concepts with high quality from available concepts (most often deﬁned by sensor data) in a fully automatic way, realized by the existing systems and by most systems under construction, is a serious obstacle since the classiﬁers obtained are often of unsatisfactory quality. Recently, it has been noticed in the literature (see, e.g., [42, 47, 48, 49, 50, 51, 52]) that one of the challenges for data mining is discovery of methods linking detection of patterns and concepts with domain knowledge. The latter term denotes knowledge about concepts occurring in a given domain and various relations among them. This knowledge greatly exceeds the knowledge gathered in data sets; it is often represented in a natural language and usually acquired during a dialogue with an expert in a given domain. One of the ways to represent domain knowledge is to record it in the form of the so-called concept ontology where ontology is usually understood as a ﬁnite hierarchy of concepts and relations among them, linking concepts from diﬀerent levels (see, e.g., [53, 54]). In the paper, we discuss methods for approximation of complex concepts in real-life projects. The reported research is closely related to such areas as machine learning and data mining (feature selection and extraction [55, 56, 57], classiﬁer construction [9, 10, 11, 12], analytical learning and explanation based learning [12, 58, 59, 60, 61]), temporal and spatio-temporal reasoning [62, 63, 64], hierarchical learning and modeling [42, 52, 65, 66, 67, 68], adaptive control [67, 69], automated planning (hierarchical planning, reconstruction of plans, adaptive learning plans) [70, 71, 72, 73, 74, 75, 76], rough sets and fuzzy sets (approximation of complex vague concepts) [77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87], granular computing (searching for compound patterns) [88, 89, 90, 91], complex adaptive systems [92, 93, 94, 95, 96, 97], autonomous multiagent systems [98, 99, 100, 101]), swarm systems [102, 103, 104], ontologies development [53, 54, 105, 106, 107].

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

477

It is also worthwhile mentioning that the reported research is also closely related to the domain of clinical decision-support for medical diagnosis and therapy (see, e.g., [108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121]). Many reported results in this domain can be characterized as methods for solving speciﬁc problems such as temporal abstraction problem [117, 120, 121] or medical planning problem [108, 111, 112, 119]). Many methods and algorithms proposed in this paper can be also used for solving such problems. The main aim of the paper is to present the developed methods for approximation of complex vague concepts involved in speciﬁcation of real-life problems and approximate reasoning used in solving these problems. However, methods presented in the paper are assuming that additional domain knowledge in the form of the concept ontology is given. Concepts from ontology are often vague and expressed in natural language. Approximation of ontology is used to create hints in searching for approximation of complex concepts from sensory (low level) data. The need of use of a domain knowledge expressed in the form of a concept ontology can be noticed in intensively developing domains connected with analysis and data processing as in the case of reinforcement learning (see, e.g., [12, 122, 123, 124]). In the latter ﬁeld, methods of learning new strategies with reinforcement take into account concept ontologies obtained from an expert, with the help of which it is possible to construct an approximation of a function estimating the quality of actions performed. Similarly, in a Service Oriented Architecture (SOA) [47, 49], the distribution of varied Web Services can be performed with the use of a domain knowledge, expressed using a concept ontology. There also appeared propositions (see, e.g., [42, 51, 52]) that use domain knowledge to search for the approximation of complex concepts in a hierarchical way which would lead to hierarchical classiﬁers able to approximate complex concepts with the high quality, e.g., by analogy to biological systems [42]. This idea can be also related to learning of complex (e.g., nonlinear) functions for fusion of information from diﬀerent sources [125]. Therefore, currently, the problem of construction of such hierarchical classiﬁers is fundamental for complex concepts approximation and its solution will be crucial for construction of many methods of intelligent data analysis. These are, for example, – methods of classiﬁcation of objects into complex spatial concepts which are semantically distant from sensor data, e.g., these are concepts as safe vehicle driving on a highway, hazardous arrangement of two cooperating robots which puts them both at risk of being damaged, – methods of classiﬁcation of object to complex spatio-temporal concepts semantically distant from sensor data which require observation of single objects or many objects over a certain period of time (e.g., acceleration of a vehicle on the road, gradual decrease of a patient’s body temperature, robot’s backward movement while turning right), – methods of behavioral pattern or high risk pattern identiﬁcation where these types of patterns may be treated as complex concepts representing dynamic properties of objects; such concepts are expressed in a natural language on a

478

J.G. Bazan

high level of abstraction and describing speciﬁc behaviors of a single object (or many complex objects) over a certain period of time (e.g., overtaking one vehicle by another, a traﬃc jam, chasing one vehicle after another, behavior of a patient under a high life threat, ineﬀective cooperation of a robot team) – methods of automatic learning of plans of complex object behavior, where a plan may be treated as a complex value of the decision which needs to be made for complex objects such as vehicles, robots, groups of vehicles, teams of robots, or patients undergoing treatment. In the paper, we propose to link automatic methods of complex concept learning, and models of detection of processes and their properties with domain knowledge obtained in a dialogue with an expert. Interaction with a domain expert facilitates guiding the process of discovery of patterns and models of processes and makes the process computationally feasible. Thus presentation of new approximation methods of complex concepts based on experimental data and domain knowledge, represented using ontology concepts, is the main aim of this paper. In our opinion, the presented methods are useful for solving typical problems appearing when modeling complex dynamical systems. 1.1

Complex Dynamical Systems

When modeling complex real-world phenomena and processes mentioned above and solving problems under conditions that require an access to various distributed data and knowledge sources, the so-called complex dynamical systems (CDS) are often applied (see, e.g., [92, 93, 94, 95, 96, 97]), or putting it in other way autonomous multiagent systems (see, e.g., [98, 99, 100, 101]) or swarm systems (see, e.g., [104]). These are collections of complex interacting objects characterized by constant change of parameters of their components over time, numerous relationships between the objects, the possibility of cooperation/competition among the objects and the ability of objects to perform more or less compound actions. Examples of such systems are traﬃc, a patient observed during treatment, a team of robots during performing some task, etc. It is also worthwhile mentioning that the description of a CDS dynamics is often not possible with purely analytical methods as it includes many complex vague concepts (see, e.g., [126, 127, 128]). Such concepts concern properties of chosen fragments of the CDS and may be treated as more or less complex objects occurring in the CDS. Hence, are needed appropriate methods of extracting such fragments that are suﬃcient to conclude about the global state of the CDS in the context of the analyzed types of changes and behaviors. In this approach, the CDS state is described by providing information about the membership of the complex objects isolated from the CDS in the complex concepts already established, describing properties of complex objects and relations among these objects. Apart from that, the description of the CDS dynamics requires following changes of the CDS state in time which leads to the so-called trajectory (history), that is, sequences of the CDS states observed over a certain period of time. Therefore, there are also needed methods for following changes of the selected

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

479

fragments of the CDS and changes of relations between the extracted fragments. In this paper, we use complex spatio-temporal concepts concerning properties, describing the dynamics of complex objects occurring in CDSs, to represent and monitor such changes. They are expressed in natural language on a much higher level of abstraction than so-called sensor data, so far mostly applied to the approximation of concepts. Examples of such concepts are safe car driving, safe overtaking, patient’s behavior when faced with a life threat, ineﬀective behavior of robot team. However, the identiﬁcation of complex spatio-temporal concepts and using them to monitor a CDS requires approximation of these concepts. In this paper, we propose to approximate complex spatio-temporal concepts by hierarchical classiﬁers mentioned above and based on data sets and domain knowledge. 1.2

Problems in Modeling Complex Dynamical Systems

In modeling complex dynamical systems there appear many problems related to approximation of complex concepts used to describe the dynamics of the systems. One of these problems is obviously the problem of the gap between complex concepts and sensor data mentioned above. Apart from that, a series of other problems may be formulated whose solution is very important for complex concepts approximation and for complex dynamical systems monitoring. Below, we present a list of such problems including particularly those whose solution is the aim of this paper. 1. Problem of the gap between complex concepts and sensor data preventing an eﬀective direct usage of sensor data to induce approximation of complex concepts by fully automatic methods. 2. Problem of complex concept stratiﬁcation in classiﬁer construction. 3. Problem of identiﬁcation of behavioral patterns of complex objects in complex dynamical systems monitoring. 4. Problem of context of complex object parts while complex dynamical systems monitoring. 5. Problem of time speed-up in identiﬁcation of behavioral patterns. 6. Problem of automated planning of complex object behavior when the object states are represented by complex concepts requiring approximation. 7. Problem of solving conﬂicts between actions in automated planning of complex object behavior. 8. Problem of synchronization of plans constructed for parts of a structured complex object. 9. Problem of plan adaptation. 10. Problem of similarity relation approximation between complex objects, complex object states, and complex object behavioral plans using data sets and domain knowledge. In further subsections, a brief overview of the problems mentioned above is presented.

480

J.G. Bazan

Problem of the Gap between Complex Concepts and Sensor Data. As we mentioned before, in spatio-temporal complex concepts approximation using sensor data, there occur major diﬃculties resulting from the fact that between spatio-temporal complex concepts and sensor data, there exists a gap which prevents an eﬀective direct usage of sensor data for approximation of complex concepts. Therefore, in the paper we propose to ﬁll the gap using domain knowledge represented mainly by a concept ontology and data sets chosen appropriately for this ontology (see Section 1.3). Problem of Complex Concept Stratiﬁcation. When we create classiﬁers for concepts on the basis of uncertain and imprecise data and knowledge semantically distant from the concepts under approximation, it is frequently not possible to construct a classiﬁer which decisively classiﬁes objects, unknown during classiﬁer learning, to the concept or its complement. There appears a need to construct such classiﬁers that, instead of stating clearly about the object under testing whether it belongs to the concept or not, allow us to obtain only a certain type of membership degree of the object under testing to the concept. In other words, we would like to determine, with regards to the object under testing, how certain the fact that this object belongs to the concept is. Let us notice that this type of mechanism stratiﬁes concepts under approximation, that is, divides objects under testing into layers labeled with individual values of membership degree to the concept. Such a mechanism can be obtained using diﬀerent kinds of probability distributions (see [6, 43]). However, in this paper, instead of learning of a probability distribution we learn layers of concepts relevant for construction of classiﬁers. We call such classiﬁers as stratifying classiﬁers and we present two methods of a stratifying classiﬁer construction (see Section 1.3). Our approach is inspired by papers about linguistic variables written by Professor Lotﬁ Zadeh (see [129, 130, 131]). Problem of Identifying Behavioral Patterns. The study of collective behavior in complex dynamical systems is now one of the more challenging research problems (see, e.g., [93, 99, 100, 102, 104, 132, 133, 134]), especially if one considers the introduction of some form of learning by cooperating agents (see, e.g., [103, 122, 123, 124, 135, 136, 137]). For example, an eﬃcient complex dynamical systems monitoring very often requires the identiﬁcation of the so-called behavioral patterns or a speciﬁc type of such patterns called high-risk patterns or emergent patterns (see, e.g., [93, 99, 100, 132, 138, 139, 140, 141, 142, 143, 144]). They are complex concepts concerning dynamic properties of complex objects expressed in a natural language on a high level of abstraction and describing speciﬁc behaviors of these objects. Examples of behavioral patterns may be: overtaking one vehicle by another vehicle, driving a group of vehicles in a traﬃc jam, behavior of a patient under a high life threat, etc. These types of concepts are diﬃcult to identify automatically because they require watching complex object behavior over longer period of time and this watching usually is based on the identiﬁcation of a sequence of less complex spatio-temporal concepts. Moreover, a crucial role

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

481

for identiﬁcation of a given behavioral pattern is played by the sequence of less complex concepts which identify it. For example, in order to identify the behavioral pattern of overtaking one vehicle by another, it should ﬁrst be determined whether the overtaking vehicle approaches the overtaken vehicle; next, whether the overtaking vehicle changes lanes appropriately and overtakes the vehicle; and ﬁnally, to determine that the overtaking vehicle returns to the previous lane driving in front of the overtaken vehicle. The methodology of a dynamical system modeling proposed in the paper enables approximation of behavioral patterns on the basis of data sets and domain knowledge expressed using a concept ontology (see Section 1.3). Problem of Context for Complex Object Parts. In this paper, any complex dynamical system (CDS) is represented using descriptions of its global states or trajectories (histories), that is, sequences of CDS states observed over a certain period of time (see, e.g., [145, 146, 147, 148, 149, 150, 151, 152] and Section 1.1). Properties of such states or trajectories are often dependent on speciﬁc parts of these states or trajectories. This requires to consider the relevant structure of states or trajectories making it possible to extract parts and the relevant context of parts. Moreover, each structured object occurring in a complex dynamical system is understood as a set of parts extracted from states or trajectories of a given complex dynamical system. Such parts are often related by relations representing links or interactions between parts. That is why both learning of the behavioral patterns concerning structured objects and the identiﬁcation of such patterns, in relation to speciﬁc structured objects, requires the isolation of structured objects as sets of potential parts of such objects, that is, object sets of lesser complexity. The elementary approach to isolate structured objects consisting in examination of all possible subsets (of an established size) of the set of potential parts of structured objects cannot be applied because of potentially high number of such subsets. For example, during an observation of a highway from a helicopter (see, e.g., [89, 153]), in order to identify a group of vehicles which are involved in the maneuver of dangerous overtaking, it would be necessary to follow (in the real time) the behavior of all possible groups of vehicles of an established size (e.g., six vehicles, see Appendix A) that may be involved in this maneuver, which already with a relatively small number of visible vehicles becomes computationally too diﬃcult. Another possibility is the application of methods which use the context in which the objects being parts of structured objects occur. This type of methods isolate structured objects not by a direct indication of the set of parts of the searched structured object but by establishing one part of the searched structured object and attaching to it other parts, being in the same context as the established part. Unfortunately, also here, the elementary approach to determination of the context of the part of the structured object, consisting in examination of all possible subsets (of an established size) of the set of potential structured objects to which the established part of the structured object belongs, cannot be applied because of a large number of such subsets. For example, in order to identify a group of vehicles which are involved in a dangerous maneuver

482

J.G. Bazan

and to which the vehicle under observation belongs, it would be necessary to follow (in the real time) the behavior of the possible groups of vehicles of an established size (e.g., six vehicles, see Appendix A) to which the vehicle considered belongs, which is, with a relatively small number of visible vehicles, still computationally too diﬃcult. Therefore, there are needed special methods of determining the context of the established part of the structured object based on a domain knowledge which enable to limit the number of analyzed sets of parts of structured objects. In the paper, we propose the so-called sweeping method which enables fast determination of the context of the established object treated as one of the parts of the structured object (see Section 1.3). Problem of Time Speed-Up in Identiﬁcation of Behavioral Patterns. Identiﬁcation of a behavioral pattern in relation to a speciﬁc complex object may be performed by observing the behavior of these objects over a certain period of time. Attempts to shorten this time are usually inadvisable, because they may cause false identiﬁcation of behavioral pattern in relation to some complex objects. However, in many applications there exists a need for a fast decision making (often in the real time) about whether or not a given object matches the established behavioral pattern. It is extremely crucial in terms of computational complexity because it enables a rapid elimination of these complex objects which certainly do not match the pattern. Therefore, in the paper, there is presented a method of elimination of complex objects in identiﬁcation of a behavioral pattern, which is based on the rules of fast elimination of behavioral patterns which are determined on the basis of data sets and domain knowledge (see Section 1.3). Problem of Automated Planning. In monitoring the behavior of complex dynamical systems (e.g., by means of behavioral patterns identiﬁcation) there may appear a need to apply methods of automated planning of complex object behavior. For example, if during observation of a complex dynamical system, a behavioral pattern describing inconvenient or unsafe behavior of a complex object (i.e., a part of system state or trajectory) is identiﬁed, then the system control module may try, using appropriate actions, to change the behavior of this object in such a way as to lead the object out of the inconvenient or unsafe situation. However, this type of short-term interventions may not be suﬃcient to lead the object out of the undesired situation permanently. Therefore, a possibility of automated planning is often considered which means construction of sequences of actions alternately with states (of plans) to be performed by the complex object or on the complex object in order to bring it to a speciﬁc state. In literature, there may be found descriptions of many automated planning methods (see, e.g., [70, 71, 72, 73, 74, 75, 76]). However, applying the latter approaches, it has to be assumed that the current complex object state is known which results from a simple analysis of current values of available parameters of this object. Meanwhile, in complex dynamical systems, a complex object state is often described in a natural language using vague spatio-temporal conditions whose satisﬁability cannot be tested on the basis of a simple analysis of available information about the object. For example, when planning the treatment of an infant suﬀering from

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

483

the respiratory failure, the infant’s condition may be described by the following condition: – Patient with RDS type IV, persistent PDA and sepsis with mild internal organs involvement (see Appendix B for mor medical details). Stating the fact that a given patient is in the above condition requires an analysis of examination results of this patient registered over a certain period of time with a large support of a domain knowledge provided by experts (medical doctors). This type of conditions may be represented using complex spatio-temporal concepts. Identiﬁcation of these conditions requires, however, an approximation of the concepts representing them with the help of classiﬁers. Therefore, in the paper, we describe automated planning methods of behavior of complex objects whose states are described using complex concepts requiring approximation (see Section 1.3). Problem of Solving Conﬂicts between Actions. In automated planning methods, during a plan construction there usually appears a problem of nondeterministic choice of one action possible to apply in a given state. Therefore, usually there may be many solutions to a given planning problem consisting in bringing a complex object from the initial state to the ﬁnal one using diﬀerent plans. Meanwhile, in practical applications there often appears a situation that the automatically generated plan must be compatible with the plan proposed by the expert (e.g., the treatment plan should be compatible with the plan proposed by human experts from a medical clinic). Hence, we inevitably need tools which may be used during a plan generation to solve the conﬂicts appearing between actions which may be performed at a given planning state. It also concerns making the decision about what state results from the action performed. That is why, in the paper, we propose a method which indicates the action to be performed in a given state or shows the state which is the result of the indicated action. This method uses a special classiﬁer constructed on the basis of data sets and domain knowledge (see Section 1.3). Problem of Synchronizing Plans. In planning the behavior of structurally complex objects consisting of parts being objects of lesser complexity, it is often not possible to plan eﬀectively the behavior of a whole such object. That is why, in such cases the behavior of all parts is usually planned separately. However, such an approach to behavior planning for a complex object requires plan synchronization constructed for individual parts in such a way as not to make these plans contradicting one to another but be complement in order to plan the best behavior for the whole complex object. For example, treatment of a certain illness A, which is the result of illnesses B and C requires such a treatment planning of illnesses B and C so as not to make their treatments contradictory, but to make them to support and to complement one another during treatment of illness A. In the paper, a planning synchronization method for parts of a complex object is presented. It uses two classiﬁers constructed on the basis of data sets and domain knowledge (see Section 1.3). If we treat plans constructed for parts

484

J.G. Bazan

of a structured object as processes of some kind, then the method of synchronizing those plans is a method of synchronization of processes corresponding to the parts of a structured object. It should be emphasized, however, that the significant novelty of the method of synchronization of processes presented herein in relation to the ones known from literature (see, e.g., [154, 155, 156, 157, 158, 159]) is the fact that the synchronization is carried out by using classiﬁers determined on the basis of data sets and domain knowledge. Plan Adaptation Problem. After constructing a plan for a complex object, the execution of this plan may take place. However, the execution of the whole plan is not always possible in practice. It may happen that, during the plan execution such a state of complex object occurred that is not compatible with the state predicted by the plan. Then, the question arises whether the plan should still be executed or whether it should be reconstructed (updated). If the current complex object state diﬀers slightly from the state expected by the plan, then the execution of the current plan may perhaps be continued. If, however, the current state diﬀers signiﬁcantly from the state from the plan, then the current plan has to be reconstructed. It would seem that the easiest way to reconstruct the plan is construction of a new plan which commences at the current state of the complex object and ends at the ﬁnal state of the old plan (a total reconstruction of the plan). However, in practical applications, a total reconstruction can be too costly in terms of computation or resources. Therefore, we need other methods which can eﬀectively reconstruct the original plan in such a way as to realize it at least partially. Hence, in the paper, we propose a method of plan reconstruction called a partial reconstruction. It consists of constructing a short so-called repair plan which quickly brings the complex object to the so-called return state from the current plan. Next, on the basis of the repair plan, a reconstruction of the current plan is performed by replacing its fragment beginning with the current state and ending with the return state of the repair plan (see Section 1.3). It is worth noticing that this issue is related to the domain of artiﬁcial intelligence called the reasoning about changes (see, e.g., [160, 161]). Research works in this domain very often concern construction of a method of concluding about changes in satisﬁability of concepts on a higher level of a certain concept hierarchy as a basis for discovery of plans aimed at restoration of the satisﬁability of the desired concepts on a lower level of this hierarchy. Problem of Similarity Relation Approximation. In building classiﬁers approximating complex spatio-temporal concepts, there may appear a need to estimate the similarity or the diﬀerence of two elements of a similar type such as complex objects, complex object states or plans generated for complex objects. This is an example of a classical case of the problem of deﬁning similarity relation (or perhaps deﬁning dissimilarity relation complementary to it) which is still one of the greatest challenges of data mining and knowledge discovery. The existing methods of deﬁning similarity relations are based on building similarity functions on the basis of simple strategies of fusion of local similarities of compared elements. Optimization of the similarity formula established is performed

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

485

by tuning both parameters of local similarities and their linking parameters (see, e.g., [162, 163, 164, 165, 166, 167, 168, 169, 170, 171]). Frequently, however, experts from a given domain are not able to provide such a formula that would not raise their doubts and they limit themselves to the presentation of a set of examples of similarity function values, that is, a set of pairs of the compared elements labeled with degrees representing similarity function value. In this case, deﬁning the similarity function requires its approximation with the help of a classiﬁer, and at the same time such properties of compared elements should be deﬁned that enable to approximate the similarity function. The main diﬃculty of the similarity function approximation is an appropriate choice of these properties. Meanwhile, according to the domain knowledge there are usually many various aspects of similarity between compared elements. For example, when comparing medical plans constructed for treatment of infants with a respiratory failure (see Appendix B), similarity of antibiotic therapies, similarity of applied mechanical ventilation methods, similarity of PDA closing and others should be taken into account. Each of these aspects should be considered in a speciﬁc way and presentation of formulas describing them can be extremely diﬃcult for an expert. Frequently, an expert may only give examples of pairs of comparable elements together with their similarity in each of these aspects. Moreover, a fusion of diﬀerent similarity aspects into a global similarity should also be performed in a way resulting from the domain knowledge. This way may be expressed, for example, using a concept ontology. In the paper, we propose a method of similarity relation approximation based on the usage of data sets and domain knowledge expressed, among other things, on the basis of a concept ontology (see Section 1.3). 1.3

Overview of the Results Achieved

As we mentioned before, the aim of this paper is to present a set of approximation methods of complex spatio-temporal concepts and approximate reasoning concerning these concepts, assuming that the information about concepts is given mainly in the form of a concept ontology. The results described in the paper may be divided into the following groups: 1. methods for construction of classiﬁers stratifying a given concept, 2. general methodology of concept approximation with the usage of data sets and domain knowledge represented mainly in the form of a concept ontology, 3. methods for approximation of spatial concepts from an ontology, 4. methods for approximation of spatio-temporal concepts from an ontology deﬁned for unstructured objects, 5. methods for approximation spatio-temporal concepts from an ontology deﬁned for structured objects, 6. methods for behavioral pattern identiﬁcation of complex objects in states of complex dynamical systems,

486

J.G. Bazan

7. methods for automated planning of behavior of complex objects when the object states are represented by vague complex concepts requiring approximation, 8. implementation of all more crucial methods described in the paper as the RSES system extension. In further subsections we brieﬂy characterize the above groups of results. At this point we present the publications on which the main results of our research have been partially based. The initial version of method for approximation of spatial concepts from an ontology was described in [172]. Methods for approximation of spatio-temporal concepts and methods for behavioral pattern identiﬁcation were presented in [88, 173, 174, 175, 176, 177, 178]. Papers [173, 176, 177, 178] concern behaviors related to recognition of vehicle behavioral patterns or a group of vehicles on the road. The traﬃc simulator used to generate data for the needs of computer experiments was described in [179]. The paper [174] concerns medical applications related to recognition of high death risk pattern of infants suﬀering from respiratory failure, whereas papers [88, 175] concern both applications which were mentioned above. Finally, methods for automated planning of behavior of complex objects were described in [88, 180, 181]. Methods for Construction of Classiﬁers Stratifying Concepts. In practice, construction of classiﬁers often takes place on the basis of data sets containing uncertain and imprecise information (knowledge). That is why it is not often possible to construct a classiﬁer which decisively classiﬁes objects to the concept or its complement. This phenomenon occurs particularly when there is a need to classify objects not occurring in a learning set of objects, that is, those which are not used to construct the classiﬁer. One possible approach is to search for classiﬁers approximating probability distribution (see, e.g., [6, 43]). However, in application, one may often require a less exact method based on classifying objects to diﬀerent linguistic layers of the concept. This idea is inspired by papers of Professor Lotﬁ Zadeh (see, e.g., [129, 130, 131]). In our approach, the discovered concept layers are used as patterns in searching for approximation of a more compound concept. In the paper, we present methods for construction of classiﬁers which, instead of stating clearly whether a tested object belongs to the concept or not, enable to obtain some membership degree of the tested object to the concept. In the paper, we deﬁne the concept of a stratifying classiﬁer as a classifying algorithm stratifying concepts, that is, classifying objects to diﬀerent concept layers (see Section 3). We propose two approaches to construction of these classiﬁers. One of them is the expert approach which is based on the deﬁning, by an expert, an additional attribute in data which describes membership of the object to individual concept layers. Next, a classiﬁer diﬀerentiating layers as decision classes is constructed. The second approach called the automated approach is based on the designing algorithms being the classiﬁer extensions which enable to classify objects to concept layers on the basis of certain premises and experimental observations. In the paper, a new method of this type is proposed which is based on shortening of decision rules relatively to various coeﬃcients of consistency.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

487

General Methodology of Concept Approximation from Ontology. One of the main results presented in this paper is a methodology of approximating concepts from ontology. Generally, in order to approximate concepts a classical in machine learning [10] method of concept approximation is applied on the basis of positive and negative examples. It is based on the construction of a data table for each concept, known in rough set theory as a decision table (a special information system with a distinguished attribute called decision [16]) with rows (called objects) corresponding to positive and negative examples of the concept approximated and columns describing properties (features, attributes) of examples expressed by formulas in a considered language. The last column, called the decision column, is treated as a description of membership of individual examples to the concept approximated. For a table constructed in such a way, classiﬁers approximating a concept are built. In such an approach, the main problem is the choice of examples of a given concept and properties of these examples. The speciﬁcity of methodology of concept approximation proposed here in comparison with other methods (see, e.g., [11, 52, 182]) is the usage of a domain knowledge expressed in the form of a concept ontology together with the rough set methods. For concepts from the lowest level of an ontology hierarchy (the sensor level), not depending on the remaining concepts, we assume that so-called sensor attributes are also available which on the basis of given positive and negative examples, enable approximating these concepts by using classical methods of classiﬁer construction. However, the concept approximation methods, applied on a higher level of ontology consist in approximation of concepts using concepts from the lower ontology level. In this way, there are created hierarchical classiﬁers which use domain knowledge recorded in the form of ontology levels. In other words, patterns discovered for approximation of concepts on a given hierarchy level are used in construction of more compound patterns relevant for approximation of concepts on the next hierarchy level. To approximate concepts from the higher ontology level, sensor attributes cannot be applied directly because the “semantical distance” of the higher level concepts from sensor attributes is too long and they are deﬁned on diﬀerent abstraction levels, i.e., searching for relevant features to approximate such concepts directly from sensory features becomes unfeasible (see the ﬁrst problem from Section 1.2). For example, it is hardly believable that given only sensor attributes describing simple parameters of driving a vehicle (e.g., location, speed, acceleration), one can approximate such a complex concept as safe driving a vehicle. Therefore, we propose a method, by means of which concepts from the higher ontology level exclusively be approximated by concepts from one level below. The proposed approach to concept approximation of a higher level is based on the assumption that the concept from the higher ontology level is semantically not too far from concepts lying on the lower level in the ontology. “Not too far” means that it may be expected that it is possible to approximate a concept

488

J.G. Bazan

from the higher ontology level with the help of lower ontology level concepts and patterns used for or derived from their construction, for which classiﬁers have already been built. If we assume that approximation of concepts on the higher ontology level takes place using lower level concepts, then according to an established concept approximation methodology, positive and negative examples of the concept approximated are needed as well as their properties which serve the purpose of approximation. However, because of the semantical diﬀerences between concepts on diﬀerent ontology levels, mentioned above, examples of lower ontology level concepts cannot be directly used to approximate a higher ontology level concept. For example, if the concept of a higher level concerns a group of vehicles (e.g., driving in a traﬃc jam, chase of one vehicle after another, overtaking), whereas the lower level concepts concern single vehicles (e.g., accelerating, decelerating, changing lanes), then the properties of a single vehicle (deﬁned in order to approximate lower ontology level concepts) are usually insuﬃcient to describe the properties of the whole group of vehicles. Diﬃculties with concept approximation on the higher ontology level using examples of the lower ontology level also appear when on the higher ontology level there are concepts concerning a time period diﬀerent than that one related to the concepts on the lower ontology level. For example, a higher level concept may concern a time window, that is, a certain period of time (e.g., vehicle acceleration, vehicle deceleration), whereas the lower level concepts may concern a certain instant, that is, a time point (e.g., a small vehicle speed, location of vehicle in the right lane). Hence, we present a method for construction of positive and negative examples of a concept of a higher ontology level consisting, in a general case, in arrangement (putting together) sets of examples of concepts of the lower ontology level. At the same time we deﬁne and represent such sets using patterns expressed in languages describing properties of examples of concepts of lower level in the ontology. These sets (represented by patterns) are arranged according to the socalled constraints resulting from the domain knowledge and determining which sets (patterns) may be arranged and which cannot be arranged for the construction of examples of higher level concepts. Thus, object structures on higher hierarchical levels come into being through linking (with the consideration of certain constraints) of objects from lower levels (and more precisely sets of these objects described by patterns). Such an approach enables a gradual modeling properties of more and more complex objects. Starting with elementary objects, objects being their sets or sequences of such objects, sets of sequences, etc. are gradually modeled. Diﬀerent languages expressing properties of, e.g., elementary objects, object sequences, or sets of sequences correspond to diﬀerent model levels. A crucial innovation feature of methods presented here is the fact that to deﬁne patterns describing examples of a lower ontology level, classiﬁers constructed for these concepts are used. The example construction process for higher ontology level concepts on the basis of lower level concepts proceeds in the following way. Objects which are positive and negative examples of lower ontology level concepts are elements of a

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

489

certain relational structure domain. Relations occurring in such a structure express relations between these objects and may be used to extract sets of objects of the lower ontology level. Each extracted set of objects is also a domain of a certain relational structure, in which relations are deﬁned using information from a lower level. The process of extraction of relational structures is performed in order to approximate a higher ontology level concept with the help of lower ontology level concepts. Hence, to extract relational structures we necessarily need the information about membership of lower level objects to the concepts from this level. Such information may be available for any tested object based on the application of previously created classiﬁers for the lower ontology level concepts. Let us note that classiﬁers stratifying concepts are of a special importance here. The language in which we deﬁne formulas (patterns) to extract new relational structures using relational structures and lower ontology level concepts, is called the language for extracting relational structures (ERS-language). For relational structures extracted in such a way, properties (attributes) may be deﬁned which lead to an information system whose objects are extracted relational structures and the attributes are the properties of these structures (RS-information system). Relational structure properties may be deﬁned using patterns which are formulas in a language specially constructed for this purpose, i.e., in a language for deﬁnnig features of relational structures (F RS-language). For example, some of the languages used to deﬁne the properties of extracted relational structures, presented in this paper, use elements of temporal logics with linear time, e.g., Linear Temporal Logic (see, e.g., [183, 184, 185]). Objects of RS-information system are often inappropriate to make their properties relevant for the approximation of the higher ontology level concepts. It is due to the fact that there are too many such objects and their descriptions are too detailed. Hence, when applied to the higher ontology level concept approximation, the extension of the created classiﬁer would be too low, that is, the classiﬁer would classify too small number of tested objects. Apart from that, the problem of computational complexity would appear which means that because of a large number of objects in such information systems, the number of objects in a linking table, constructed in order to approximate concepts determined in a set of objects of a complex structure, would be too large to construct a classiﬁer eﬀectively (see below). That is why a grouping (clustering) of such objects is applied which leads to obtaining more general objects, i.e., clusters of relational structures. This grouping may take place using a language chosen by an expert and called the language for extracting clusters of relational structures (ECRS-language). Within this language, a family of patterns may be selected to extract relevant clusters of relational structures from the initial information system. For the clusters of relational structures obtained, an information system may be constructed whose objects are clusters deﬁned by patterns from this family, and the attributes are the properties of these clusters. The properties of these clusters may be deﬁned by patterns which are formulas of a language specially constructed for this purpose, i.e., a language for deﬁning features of clusters

490

J.G. Bazan

of relational structures (F CRS-language). For example, some of the languages assigned to deﬁne the properties of relational structure clusters presented in this paper use elements of temporal logics with branching time, e.g., Branching Temporal Logic (see, e.g., [183, 184, 185]). The information system with objects which are clusters of relational structures (CRS-information system) may already be used to approximate the concept of the higher ontology level. In order to do this, a new attribute is added to the system by the expert informs about membership of individual clusters to the concept approximated, and owing to that we obtain an approximation table of a higher ontology concept. The method of construction of the approximation table of a higher ontology level concept may be generalized for concepts determined on a set of structured objects, that is, ones consisting of a set of parts (e.g., a group of vehicles on the road, a group of interacting illnesses, a robot team performing a task together). This generalization means that CRS-information systems constructed for individual parts may be linked in order to obtain an approximation table of a higher ontology level concept determined for structured objects. Objects of this table are obtained through an arrangement (linking) of all possible objects of linked information systems. From the mathematical point of view this assumption is a Cartesian product of sets of objects of linked information systems. However, in terms of domain knowledge not all object links belonging to such a Cartesian product are possible (see [78, 84, 186, 187]). For example, if we approximate the concept of safe overtaking, it makes sense to arrange objects concerning only such vehicle pairs which are in the process of the overtaking maneuver. For the reason mentioned above, that is, elimination of unrealistic complexes of objects, the so-called constraints are deﬁned that are formulas built on the basis of arranged object features. The constraints determine which objects may be arranged in order to obtain an example of an object from a higher level and which may not. Additionally, we assume that to each arrangement allowed by the constraints, the expert adds a decision value informing whether a given arrangement belongs ore does not belong to the approximated concept of a higher level. The table constructed in such a way serves the purpose of the approximation of a concept describing structured objects. However, in order to approximate a concept concerning structured objects, it is often necessary to construct not only all parts of the structured object but also features describing relations between parts. For example, driving one vehicle after another, apart from features describing the behavior of those two vehicles separately, features describing the location of these vehicles in relation to one another as well ought to be constructed. That is why in construction of a table of concept approximation for structured objects, there is constructed an additional CRS-information system whose attributes entirely describe the whole structured object in terms of relations between the parts of this object. In approximation of the object concerning structured objects, this system is arranged together with other CRS-information systems constructed for individual parts of the structured objects.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

The spatial concept of the higher ontology level (defined for complex objects)

The spatio-temporal concept of the higher ontology level (defined for complex objects)

C Spatial concepts of the lower ontology level (defined for the same type of complex objects)

491

C1

...

C

Cl

Spatial concepts of the lower ontology level (defined for the same type of complex objects)

Case 1

C1

...

Cl

Case 2 The spatio-temporal concept of the higher ontology level (defined for structured complex objects)

Spatio-temporal concepts of the lower ontology level (defined for parts of structured complex objects)

C

C1

...

Cl

Case 3 Fig. 1. Three cases of complex concepts approximation in ontology

A fundamental problem in construction of an approximation table of a higher ontology level concept is, therefore, the choice of four appropriate languages used during its construction. The ﬁrst language serves the purpose of deﬁning patterns in a set of examples of a concept of lower ontology level which enable the relational structure extraction. The second one enables to deﬁne the properties of these structures. The third one makes possible to deﬁne relational structure clusters and, ﬁnally, the fourth one, the properties of these clusters. All these languages must be deﬁned in such a way as to make the properties of the created relational structure clusters useful on a higher ontology level for approximation of the concept occurring there. Moreover, when the approximated concept concerns structured objects, each of the parts of this type of objects may require another four the languages similar to those already mentioned above. Deﬁnitions of the above four languages depends on the semantical diﬀerence between concepts from both ontology levels. In the paper, the above methodology is applied in the three following cases in which the above four languages are deﬁned in a completely diﬀerent way: 1. The concept of the higher ontology level is a spatial concept (it does not require observing changes of objects over time) and it is deﬁned on the set of the same objects (examples) as concepts of the lower ontology level, and at the same time the lower ontology level concepts are also spatial concepts (see Case 1 from Fig. 1).

492

J.G. Bazan

2. The concept of the higher ontology level is a spatio-temporal concept (it requires observing object changes over time) and it is deﬁned on a set of the same objects (examples) as the lower ontology level concepts. Moreover, the lower ontology level concepts are spatial concepts exclusively (see Case 2 from Fig. 1). 3. The concept of the higher ontology level is a spatio-temporal concept deﬁned on a set of objects which are structured objects in relation to objects (examples) of the lower ontology level concepts, that is, the lower ontology level objects are parts of objects from the higher ontology level. Additionally, and at the same time the lower ontology level concepts are also spatio-temporal concepts (see Case 3 from Fig. 1). Methods described in the next three subsections concern the above three cases. These methods also found application in construction of methods of behavioral pattern identiﬁcation and in automated planning. Methods of Approximation of Spatial Concepts. In the paper, the method of approximating concepts from ontology is proposed when a higher ontology level concept is a spatial concept (not requiring an observation of changes over time) and it is deﬁned on a set of the same objects (examples) as the lower ontology level concepts; at the same time, the lower level concepts are also spatial concepts. An exemplary situation of this type is an approximation of the concept of Safe overtaking (concerning single vehicles on the road) using concepts such as Safe distance from the opposite vehicle during overtaking, Possibility of going back to the right lane and Possibility of safe stopping before the crossroads. The concept approximation method described in this subsection is an example of the general methodology of approximating concepts from ontology described previously. That is why its speciﬁcity is the domain knowledge usage expressed in the form of a concept ontology and application of rough set methods, mainly in terms of application of classiﬁer construction methods. The basic terms used in the presented method is pattern and production rule. Patterns are descriptions of examples of concepts from an ontology and they are constructed by classiﬁers stratifying these concepts. A production rule is a decision rule which is constructed on two adjacent levels of ontology. In the predecessor of this rule there are patterns for the concepts from the lower level of the ontology whereas in the successor, there is a pattern for one concept from the higher level of the ontology (connected with concepts from the rule predecessor) where both patterns from the predecessor and the successor of the rule are chosen from patterns constructed earlier for concepts from both adjacent levels of the ontology. A rule constructed in such a way may serve as a simple classiﬁer or an argument “for”/“against” the given concept, enabling classiﬁcation of objects which match the patterns from the rule predecessor with the pattern from the rule successor. In the paper, there is proposed an algorithmic method of induction of production rules, consisting in an appropriate search for data tables with attributes describing the membership of training objects to particular layers of concepts (see Section 5.4). These tables are constructed using the so-called constraints between concepts thanks to which the information put in the tables

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

493

only concerns those objects/examples which might be found there according to the production rule under construction. Although a single production rule may be used as a classiﬁer for the concept appearing in a rule successor, it is not a complete classiﬁer yet, i.e., classifying all objects belonging to an approximated concept and not only those matching patterns of a rule predecessor. Therefore, in practice, production rules are grouped into the so-called productions (see Section 5.3), i.e., production rule collections, in a way that each production contains rules having patterns for the same concepts in a predecessor and the successor, but responding to their diﬀerent layers. Such production is able to classify much more objects than a single production rule where these objects are classiﬁed into diﬀerent layers of the concept occurring in a rule successor. Both productions and production rules themselves are only constructed for the two adjacent levels of ontology. Therefore, in order to use the whole ontology fully, there are constructed the so-called AR-schemes, i.e., approximate reasoning schemes (see, e.g., [77, 89, 172, 188, 189, 190, 191, 192, 193, 194]) which are hierarchical compositions of production rules (see Section 5.7). The synthesis of an AR-scheme is carried out in a way that to a particular production from a lower hierarchical level of the AR-scheme under construction another production rule on a higher level may be attached, but only that one where one of the concepts for which the pattern occurring in the predecessor was constructed is the concept connected with the rule successor from the previous level. Additionally, it is required that the pattern occurring in a rule predecessor from the higher level is a subset of the pattern occurring in a rule successor from the lower level (in the sense of inclusion of object sets matching both patterns). To the two combined production rules other production rules can be attached (from above, from below or from the side) and in this way a multilevel structure is made which is a composition of many production rules. The AR-scheme constructed in such a way can be used as a hierarchical classiﬁer whose entrance are predecessors of production rules from the lowest part of the AR-scheme hierarchy and the exit is the successor of a rule from the highest part of the AR-scheme hierarchy. That way, each AR-scheme is a classiﬁer for a concept occurring in the rule successor from the highest part in the hierarchy of the scheme and, to be precise, for a concept for which a pattern occurring in the rule successor from the highest part in the hierarchy of the AR-scheme is determined. However, similarly to the case of a single production rule, an AR-scheme is not a full classiﬁer yet. That is why, in practice, for a particular concept there are many AR-schemes constructed which approximate diﬀerent layers or concept regions. In this paper, there are proposed two approaches for constructing AR-schemes (see Section 5.7). The ﬁrst approach is based on memory with AR-schemes and consists in building many AR-schemes after determining production, which later on are stored and used for the classiﬁcation of tested objects. The second approach is based on a dynamic construction of AR-schemes. It is realized in a way that during classiﬁcation of a given tested object, an

494

J.G. Bazan

appropriate AR-schemes for classifying this particular object is built on the basis of a given collection of productions (“lazy” classiﬁcation). In order to test the quality and eﬀectiveness of classiﬁer construction methods based on AR-schemes, experiments on data generated from the traﬃc simulator were performed (see Section 5.8). The experiments showed that classiﬁcation quality obtained through classiﬁers based on AR-schemes is higher than classiﬁcation quality obtained through traditional classiﬁers based on decision rules. Apart from that, the time spent on classiﬁer construction based on AR-schemes is shorter than when constructing classical rule classiﬁers, their structure is less complicated than that of classical rule classiﬁers (a considerably smaller average number of decision rules), and their performance is much more stable because of the diﬀerences in data in samples supplied for learning (e.g., to change the simulation scenario). Methods of Approximation of Spatio-temporal Concepts. We also propose a method of approximating concepts from ontology when a higher ontology level concept is a spatio-temporal concept (it requires observing changes of complex objects over time) deﬁned on a set of the same objects as the lower ontology level concepts; at the same time, the lower ontology level concepts are spatial concepts only. This case concerns a situation when during an observation of a single object in order to capture its behavior described by a higher ontology level concept, we have to observe it longer than it requires to capture behaviors described by lower ontology level concepts. For example, lower ontology level concepts may concern simple vehicle behaviors such as small increase in speed, small decrease in speed or small move towards the left lane. However, the higher ontology level concept may be a more complex concept as, e.g., acceleration in the right lane. Let us notice that determining whether a vehicle accelerates in the right lane requires its observation for some time called a time window. On the other hand, determining whether a vehicle speed increases in the right lane requires only a registration of the speed of a vehicle in two neighboring instants (time points) only. That is why spatio-temporal concepts are more diﬃcult to approximate than spatial concepts whose approximation does not require observing changes of objects over time. Similarly to spatial concept approximation (see above), the method of concept approximation described in this subsection is an example of the general methodology of approximating concepts from ontology described earlier. Its speciﬁcity is, therefore, the domain knowledge usage expressed in the form of a concept ontology and rough set method application, mainly in terms of application of classiﬁer construction methods. However, in this case more complex ontologies are used, and they contain both spatial and spatio-temporal concepts. The starting point for the method proposed is a remark that spatio-temporal concept identiﬁcation requires an observation of a complex object over a longer period of time called a time window (see Section 6.4). To describe complex object changes in the time window, the so-called temporal patterns (see Section 6.6) are used, which are deﬁned as functions determined on a given time window. These patterns, being in fact formulas from a certain language, also characterize

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

495

certain spatial properties of the complex object examined, observed in a given time window. They are constructed using lower ontology level concepts and that is why identiﬁcation whether the object belongs to these patterns requires the application of classiﬁers constructed for concepts of the lower ontology level. On a slightly higher abstraction level, the spatio-temporal concepts (also called temporal concepts) are directly used to describe complex object behaviors (see Section 6.5). Those concepts are deﬁned by an expert in a natural language and they are usually formulated using questions about the current status of spatio-temporal objects, e.g., Does the vehicle examined accelerate in the right lane?, Does the vehicle maintain a constant speed during lane changing? The method proposed here is based on approximating temporal concepts using temporal patterns with the help of classiﬁers. In order to do this a special decision table is constructed called a temporal concept table (see Section 6.9). The rows of this table represent the parameter vectors of lower level ontology concepts observed in a time window (and, more precisely, clusters of such parameter vectors). Columns of this table (apart from the last one) are determined using temporal patterns. However, the last column represents membership of an object, described by parameters (features, attributes) from a given row, to the approximated temporal concept. Temporal concepts may be treated as nodes of a certain directed graph which is called a behavioral graph. Links (directed edges) in this graph are the temporal relations between temporal concepts meaning a temporal sequence of satisfying two temporal concepts one after another. These graphs are of a great signiﬁcance in complex objects approximation for structured objects (see below). Methods of Approximation of Spatio-temporal Concepts for Structured Objects. The method of spatio-temporal concept approximation presented in the previous subsection is extended to the case when higher ontology level concepts are deﬁned on a set of objects which are structured objects in relation to objects (examples) of the lower ontology level concepts, that is, the lower ontology level objects are parts of objects from the higher ontology level. Moreover, lower ontology level concepts are also spatio-temporal concepts. This case concerns a situation when during a structured object observation, which serves the purpose of capturing its behavior described by a higher ontology level concept, we must observe this object longer than it is required to capture the behavior of a single part of the structured object described by lower ontology level concepts. For example, lower ontology level concepts may concern complex behaviors of a single vehicle such as acceleration in the right lane, acceleration and changing lanes from right to left, decelerating in the left lane. However, a higher ontology level concept may be an even more complex concept describing behavior of a structured object consisting of two vehicles (the overtaking and the overtaken one) over a certain period of time, for example, the overtaking vehicle changes lanes from right to left, whereas the overtaken vehicle drives in the right lane. Let us notice that the behavior described by this concept is a crucial fragment of the overtaking maneuver and determining whether the observed group of two vehicles behaved exactly that way, requires observing a sequence of

496

J.G. Bazan

behaviors of vehicles taking part in this maneuver for a certain period of time. They may be: acceleration in the right lane, acceleration and changing lanes from right to left, maintaining a stable speed in the right lane. Analogously to the case of spatial and spatio-temporal concept approximation for unstructured objects, the method of concept approximation described in this subsection is an example of the general methodology of approximating concepts from ontology described previously. Hence, its speciﬁcity is also the domain knowledge usage expressed in the form of a concept ontology and rough set methods. However, in this case, ontologies may be extremely complex, containing concepts concerning unstructured objects, concepts concerning structured objects as well as concepts concerning relations between parts of structured objects. The starting point for the proposed method is the remark that spatio-temporal concept identiﬁcation concerning structured objects requires observing changes of these objects over a longer period of time (the so-called longer time windows) than in the case of complex objects which are parts of structured objects. Moreover, spatio-temporal concept identiﬁcation concerning structured objects requires not only an observation of changes of all constituent parts of a given structured object individually, but also an observation of relations between these constituent parts and changes concerning these relations. Therefore, in order to identify spatio-temporal concepts concerning structured objects in behavioral graphs, we may observe paths of their constituent objects corresponding to constituent part behaviors in a given period. Apart from that paths in behavioral graphs describing relation changes between parts of structured objects should be observed. The properties of these paths may be deﬁned using functions which we call temporal patterns for temporal paths (see Section 6.17). These patterns, being in fact formulas from a certain language, characterize spatio-temporal properties of the examined structured object in terms of its parts and constraints between these parts. On a slightly higher abstraction level, to describe behaviors of structured objects, the so-called temporal concepts for structured objects (see Section 6.20) are used, which are deﬁned by an expert in a natural language and formulated usually with the help of questions about the current status of structured objects, e.g., Does one of the two observed vehicles approach the other driving behind it in the right lane?, Does one of the two observed vehicles change lanes from the right to the left one driving behind the second vehicle? The method of temporal concept approximation concerning structured objects, proposed here, is based on approximation of temporal concepts using temporal patterns for paths in behavioral graphs of parts of structured objects with the usage of temporal patterns for paths in behavioral graphs reﬂecting relation changes between the constituent parts. In order to do this a special decision table is constructed called a temporal concept table of structured objects (see Section 6.20). The rows of this table are obtained by arranging feature (attribute) value vectors of paths from behavioral graphs corresponding to parts of the structured objects observed in the data set (and, more precisely, value vectors of cluster features of such paths) and value vectors of path features from the behavioral graph

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

497

reﬂecting relation changes between parts of the structured object (and, more precisely, value vectors of cluster features of such paths). From the mathematical point of view such an arrangement is a Cartesian product of linked feature vectors. However, in terms of domain knowledge not all links belonging to such a Cartesian product are possible and making sense (see [78, 84, 186, 187]). According to the general methodology presented above, to eliminate such arrangements of feature vectors that are unreal or do not make sense, we deﬁne the so-called constraints which are formulas obtained on the basis of values occurring in the vectors arranged. The constraints determine which vectors may be arranged in order to obtain an example of a concept from a higher level and which may not. Additionally, we assume that to each feature vector arrangement, acceptable by constraints, the expert adds the decision value informing about the fact whether a given arrangement belongs to the approximated concept from the higher level. Methods of Behavioral Pattern Identiﬁcation. Similarly to the case of spatio-temporal concepts for unstructured complex objects, the spatio-temporal concepts deﬁned for structured objects may also be treated as nodes of a certain directed graph which is called a behavioral graph for a structured object (see Section 6.22). These graphs may be used to represent and identify the so-called behavioral patterns which are complex concepts concerning dynamic properties of complex structured objects expressed in a natural language depending on time and space. Examples of behavioral patterns may be: overtaking on the road, driving in a traﬃc jam, behavior of a patient connected with a high life threat. These types of concepts are much more diﬃcult to approximate even than many temporal concepts. In the paper, a new method of behavioral pattern identiﬁcation is presented which is based on interpreting the behavioral graph of a structured object as a complex classiﬁer enabling identiﬁcation of a behavioral pattern described by this graph. This is possible based on the observation of the structured object behavior for a longer time and checking whether the behavior matches the chosen behavioral graph path. If this is so, then it is determined if the behavior matches the behavioral pattern represented by this graph, which enables a detection of speciﬁc behaviors of structured objects (see Section 6.23). The eﬀective application of the above behavioral pattern identiﬁcation method encounters, however, two problems in practice. The ﬁrst of them concerns extracting relevant context for the parts of structured objects (see the fourth problem from Section 1.2). To solve this problem a sweeping method, enabling a rapid structured object extraction, is proposed in this paper. This method works on the basis of simple heuristics called sweeping algorithms around complex objects which are constructed with the use of a domain knowledge supported by data sets (see Section 6.13). The second problem appearing with behavioral pattern identiﬁcation is the problem of fast elimination of such objects that certainly do not match a given behavioral pattern (see the ﬁfth problem from Section 1.2). As one of the methods of solving this problem, we proposed the so-called method of fast

498

J.G. Bazan

elimination of speciﬁc behavioral patterns in relation to the analyzed structured objects. This method is based on the so-called rules of fast elimination of behavioral patterns which are determined from the data and on the basis of a domain knowledge (see Section 6.24). It leads to a great acceleration of behavioral pattern identiﬁcation because such structured objects, whose behavior certainly does not match a given behavioral pattern, may be very quickly eliminated. For these objects it is not necessary to apply the method based on behavioral graphs which greatly accelerates the global perception. In order to test the quality and eﬀectiveness of classiﬁer construction methods based on behavioral patterns, there have been performed experiments on data generated from the road simulator and medical data connected to detection of higher-death risk in infants suﬀering from the respiratory failure (see Section 6.25 and Section 6.26). The experiments showed that the algorithmic methods presented in this paper provide very good results in detecting behavioral patterns and may be useful with complex dynamical systems monitoring. Methods of Automated Planning. Automated planning methods for unstructured complex objects were also worked out. These methods work on the basis of data sets and a domain knowledge represented by a concept ontology. A crucial novelty in the method proposed here, in comparison with the already existing ones, is the fact that performing actions according to plan depends on satisfying complex vague spatio-temporal conditions expressed in a natural language, which leads to the necessity of approximation of these conditions as complex concepts. Moreover, these conditions describe complex concept changes which should be reﬂected in the concept ontology. Behavior of unstructured complex objects is modeled using the so-called planning rules being formulas of the type: the state before performing an action → action → state 1 after performing an action | ... | state k after performing an action, which are deﬁned on the basis of data sets and a domain knowledge (see Section 7.4). Each rule includes the description of the complex object state before applying the rule (that is, before performing an action), expressed in a language of features proposed by an expert, the name of the action (one of the actions speciﬁed by the expert which may be performed at a particular state), and the description of sequences of states which a complex object may turn into after applying the action mentioned above. It means that the application of such a rule gives indeterministic eﬀects, i.e., after performing the same action the system may turn into diﬀerent states. All planning rules may be represented in a form of the so-called planning graphs whose nodes are state descriptions (occurring in predecessors and successors of planning rules) and action names occurring in planning rules (see Section 7.4). In the graphical interpretation, solving the problem of automated planning is based on ﬁnding a path in the planning graph from the initial state to an expected ﬁnal state. It is worth noticing that the conditions for performing an action (object states) are described by vague spatio-temporal complex concepts which are expressed in the natural language and require an approximation. For speciﬁc applications connected with the situation when it is expected that the proposed plan of a complex object behavior is to be strictly compatible with

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

499

the determined experts’ instructions (e.g., the way of treatment in a specialist clinic is to be compatible with the treatment schemes used there), there has also been proposed an additional mechanism enabling to resolve the nondeterminism occurring in the application of planning rules. This mechanism is an additional classiﬁer based on data sets and domain knowledge. Such classiﬁers suggest the action to be performed in a given state and show the state which is the result of the indicated action (see Section 7.7). The automated planning method for unstructured objects has been generalized in the paper also in the case of planning of the behavior of structured objects (consisting of parts connected with one another by dependencies). The generalization is based on the fact that on the level of a structured object there is an additional planning graph deﬁned where there are double-type nodes and directed edges between the nodes (see Section 7.11). The nodes of the ﬁrst type describe vague features of states (meta-states) of the whole structured object, whereas the nodes of the second type concern complex actions (meta-actions) performed by the whole structured object (all its constituent parts) over a longer period of time (a time window). The edges between the nodes represent temporal dependencies between meta-states and meta-actions as well as meta-actions and meta-states. Similarly to the previous case of unstructured objects, planning of a structured object behavior is based on ﬁnding a path in a planning graph from the initial meta-state to the expected ﬁnal meta-state; and, at the same time, each meta-action occurring in such a path must be planned separately on the level of each constituent part of the structured object. In other words, it should be planned what actions each part of a structured object must perform in order for the whole structured object to be able to perform the meta-action which has been planned. During the planning of a meta-action a synchronization mechanism (determining compatibility) of plans proposed for the part of a structured object is used, which works on the basis of a family of classiﬁers determined on the basis of data sets with a great support of domain knowledge. Apart from that, an additional classiﬁer is applied (also based on a data set and the domain knowledge) which enables to determine whether the juxtaposition and execution of plans determined for the constituent parts, in fact, lead to the execution of the meta-action planned on the level of the whole structured object (see Section 7.13). During the attempt to execute the plan constructed there often appears a need to reconstruct the plan which means that during the plan execution there may appear such a state of a complex object that is not compatible with the state suggested by the plan. A total reconstruction of the plan (building the whole plan from the beginning) may computationally be too costly. Therefore, we propose another plan reconstruction method called a partial reconstruction. It is based on constructing a short so-called repair plan, which rapidly brings the complex object to the so-called return state which appears in the current plan. Next, on the basis of the repair plan, a current plan reconstruction is performed through replacing its fragment beginning with the current state and ending with the return plan with the repair plan (see Section 7.9 and Section 7.17).

500

J.G. Bazan

In construction and application of classiﬁers approximating complex spatiotemporal concepts, there may appear a need to construct, with a great support of the domain knowledge, a similarity relation of two elements of similar type, such as complex objects, complex object states, or plans generated for complex objects. Hence, in this paper we propose a new method of similarity relation approximation based on the use of data sets and a domain knowledge expressed mainly in the form of a concept ontology. We apply this method, among other things, to verify automated planning methods, that is, to compare the plan generated automatically with the plan suggested by experts from a given domain (see Section 7.18, Section 7.19 and Section 7.20). In order to check the eﬀectiveness of the automated planning methods proposed here, there were performed experiments concerning planning of treatment of infants suﬀering from the respiratory failure (see Section 7.21). Experimental results showed that the proposed method gives good results, also in the opinion of medical experts (compatible enough with the plans suggested by the experts), and may be applied in medical practice as a supporting tool for planning of the treatment of infants suﬀering from the respiratory failure. Implementation and Data Sets. The result of the works conducted is also a programming system supporting the approximation of spatio-temporal complex concepts in the given concept ontology in the dialogue with the user. The system also includes an implementation of the algorithmic methods presented in this paper and is available on the web side of RSES system (see [15]). Sections 5, 6 and 7, apart from the method description, contain the results of computing experiments conducted on real-life data sets, supported by domain knowledge. It is worth mentioning that the requirements regarding data sets which can be used for computing experiments with modeling spatio-temporal phenomena are much greater than the requirements of the data which are used for testing process of classical classiﬁers. Not only have the data to be representative of the decision making problem under consideration but also they have to be related to the domain knowledge available (usually cooperation with experts in a particular domain is essential). It is important that such data should fully and appropriately reﬂect complex spatio-temporal phenomena connected to the environment of the data collected. The author of the paper acquired such data sets from two sources. The ﬁrst source of data is the traﬃc simulator made by the author (see Appendix A). The simulator is a computing tool for generating data sets connected to the traﬃc on the street and at crossroads. During simulation each vehicle appearing on the simulation board behaves as an independently acting agent. On the basis of observation of the surroundings (other vehicles, its own location, weather conditions, etc.) this agent makes an independent decision what maneuvers it should make to achieve its aim which is to go safely across the simulation board and to leave the board using the outbound way given in advance. At any given moment of the simulation, all crucial vehicle parameters may be recorded, and thanks to this data sets for experiments can be obtained.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

501

The second collection of data sets used in computer experiments was provided by Neonatal Intensive Care Unit, First Department of Pediatrics, PolishAmerican Institute of Pediatrics, Collegium Medicum, Jagiellonian University, Krakow, Poland. This data constitutes a detailed description of treatment of 300 infants, i.e., treatment results, diagnosis, operations, medication (see Section 6.26 and Appendix B). 1.4

Organization of the Paper

This paper is organized as follows. In Section 2 we brieﬂy describe selected classical methods of classiﬁer construction and concept approximation which are used in next subsections of the paper. These methods are based on rough set theory achievements and were described in the author’s previous papers (see, e.g., [14, 195, 196, 197, 198, 199, 200, 201, 202, 203]). In Section 3 we describe methods of construction of a concept stratifying classiﬁer. The general methodology of approximating concepts with the use of data sets and a domain knowledge represented mainly in the form of a concept ontology is described in Section 4. Methods of approximating spatial concepts from ontology are described in Section 5, whereas methods of approximating spatio-temporal concepts from ontology and methods of behavioral patterns identiﬁcation are described in Section 6. Methods of automated planning of complex object behavior when object states are represented with the help of complex objects requiring an approximation with the use of data sets and a domain knowledge are presented in Section 7. Finally, in Section 8 we summarize the results and give directions for the future research. The paper also contains two appendixes. The ﬁrst appendix contains the description of the traﬃc simulator used to generate experimental data (see Appendix A). The second one describes medical issues connected with the infant respiratory failure (see Appendix B) concerning one of the data sets used for experiments.

2

Classical Classifiers

In general, the term classify means arrange objects in a group or class based on shared characteristics (see [1]). In this work, the term classiﬁcation has a special meaning, i.e., classiﬁcation connotes any context in which some decision or forecast about object grouping is made on the basis of currently available knowledge or information (see, e.g., [11, 204]). A classiﬁcation algorithm (classiﬁer) is an algorithm which enables us to make a forecast repeatedly on the basis of accumulated knowledge in new situations (see, e.g., [11]). Here we consider the classiﬁcation provided by a classifying

502

J.G. Bazan

algorithm which is applied to a number of cases to classify objects unseen previously. Each new object is assigned to a class belonging to a predeﬁned set of classes on the basis of observed values of suitably chosen attributes (features). Many approaches have been proposed to construct classiﬁcation algorithms. Among them we would like to mention classical and modern statistical techniques (see, e.g., [11, 13]), neural networks (see, e.g., [11, 13, 205]), decision trees (see, e.g., [11, 206, 207, 208, 209, 210, 211, 212]), decision rules (see, e.g., [10, 11, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223]) and inductive logic programming (see, e.g., [11, 224]). In this section, we consider methods implemented in our system RSES (Rough Set Exploration System) (see [14, 225, 226, 227, 228, 229, 230, 231]). RSES is a computer software system developed for the purpose of data analysis (the data is assumed to be in the form of an information system or a decision table, see Section 2.1). In construction of classiﬁers, which is the main step in the process od data analysis with RSES, elements of rough set theory are used. In this paper, we call these algorithms the standard RSES methods of classiﬁer construction. The majority of the standard RSES methods of classiﬁer construction have been applied in more advanced methods of classiﬁer construction, which will be presented in Sections 3, 5, 6, and 7. Therefore, in this section we only give a brief overview of that methods of classiﬁer construction. These methods are based on rough set theory (see [16, 17, 232]). In the Section 2.1 we start with introduction of basic rough set terminology and notation, necessary for the rest of this paper (see Section 2.1). The analysis of data in the RSES system proceeds according to the scheme presented in Fig. 2. First, the data for analysis has to be loaded/imported into the system. Next, in order to have a better chance for constructing (learning) a proper classiﬁer, it is frequently advisable to transform the initial data set. Such transformation, usually referred to as preprocessing, may consist of several steps. RSES supports preprocessing methods which make it possible to manage

Knowledge reduction

Load/Import data table

Data preprocessing

Classifier construction

Classifier evaluation

Classification of new cases

Fig. 2. The RSES data analysis process

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

503

missing parts in data, discretize numeric attributes, and create new attributes (see [14] and Section 2.2 for more details). When the data is preprocessed, we can be interested in learning about its internal structure. By using classical rough set concepts such as reducts (see Section 2.1), dynamic reducts (see [14, 195, 196, 198, 201, 202, 203]), and positive region (see Section 2.1) one can discover dependencies that occur in our data set. Knowledge of reducts can lead to reduction of data by removing some of the redundant attributes. Next, the classiﬁer construction may be started. In the RSES system, these classiﬁers may be constructed using various methods (see [14] and sections 2.3, 2.4, 2.5, 2.6, 2.7 for more details). A classiﬁer is constructed on the basis of a training set consisting of labeled examples (objects with decisions). Such a classiﬁer may further be used for evaluation on a test set or applied to new, unseen and unlabeled cases in order to determine the value of decision (classiﬁcation) for them (see Section 2.9). If the quality of the constructed classiﬁer is insuﬃcient, one may return to data preprocessing and/or knowledge reduction; another method of classiﬁer construction may be applied as well. 2.1

Rough Set Basic Notions

In order to provide a clear description further in the paper and to avoid any misunderstandings, we bring here some essential deﬁnitions from rough set theory. We will frequently refer to the notions introduced in this section. Quite a comprehensive description of notions and concepts related to the classical rough set theory may be found in [189]. An information system (see [16, 17]) is a pair A = (U, A) where U is a nonempty, ﬁnite set called the universe of A and A is a non-empty, ﬁnite set of attributes, i.e., mappings a : U → Va , where Va is called the value set of a ∈ A. Elements of U are called objects and interpreted as, e.g., cases, states, processes, patients, observations. Attributes are interpreted as features, variables, characteristic conditions. We also consider a special case of information systems called decision tables. A decision table is an information system of the form A = (U, A, d) where d ∈ A is a distinguished attribute called the decision. The elements of A are called condition attributes or conditions. One can interpret the decision attribute as a kind of partition of the universe of objects given by an expert, a decision-maker, an operator, a physician, etc. In machine learning decision tables are called training sets of examples (see [10]). The cardinality of the image d(U ) = {k : d(s) = k for some s ∈ U } is called the rank of d and is denoted by r(d). r(d) We assume that the set Vd of values of the decision d is equal to {vd1 , ..., vd }. Let us observe that the decision d determines a partition CLASSA (d) = r(d) 1 k {XA , . . . , XA } of the universe U where XA = {x ∈ U : d(x) = vdk } for 1 ≤ k ≤ r(d). CLASSA (d) is called the classiﬁcation of objects of A determined

504

J.G. Bazan

i by the decision d. The set XA is called the i-th decision class of A. By XA (u) we denote the decision class {x ∈ U : d(x) = d(u)}, for any u ∈ U . Let A = (U, A) be an information system. For every set of attributes B ⊆ A, an equivalence relation, denoted by IN DA (B) and called the B-indiscernibility relation, is deﬁned by

IN DA (B) = {(u, u ) ∈ U × U : ∀a∈B a(u) = a(u )}.

(1)

Objects u, u being in the relation IN DA (B) are indiscernible by attributes from B. By [u]IN DA (B) we denote the equivalence class of the relation IN DA (B), such that u belongs to this class. An attribute a ∈ B ⊆ A is dispensable in B if IN DA (B) = IN DA (B \ {a}), otherwise a is indispensable in B. A set B ⊆ A is independent in A if every attribute from B is indispensable in B, otherwise the set B is dependent in A. A set B ⊆ A is called a reduct in A if B is independent in A and IN DA (B) = IN DA (A). The set of all reducts in A is denoted by REDA (A). This is the classical notion of a reduct and it is sometimes referred to as global reduct. Let A = (U, A) be an information system with n objects. By M (A) (see [21]) we denote an n × n matrix (cij ), called the discernibility matrix of A, such that cij = {a ∈ A : a(xi )=a(xj )} for i, j = 1, . . . , n .

(2)

A discernibility function f A for an information system A is a Boolean function of m Boolean variables a1 , . . . , am corresponding to the attributes a1 , . . . , am , respectively, and deﬁned by fA (a1 , . . . , am ) = { cij : 1 ≤ j < i ≤ n ∧ cij =∅}, (3) where cij = {a : a ∈ cij }. It can be shown (see [21]) that the set of all prime implicants of fA determines the set of all reducts of A. We present an exemplary deterministic algorithms for computation of the whole reduct set REDA (A) (see, e.g., [199]). This algorithm computes the discernibility matrix of A (see Algorithm 2.1). The time cost of the reduct set computation using the algorithm presented above can be too high in the case the decision table consists of too many objects, attributes, or diﬀerent values of attributes. The reason is that, in general, the size of the reduct set can be exponential with respect to the size of the decision table and the problem of the minimal reduct computation is NP-hard (see [21]). Therefore, we are often forced to apply approximation algorithms to obtain some knowledge about the reduct set. One way is to use approximation algorithms that need not give optimal solutions but require a short computing time. Among these algorithms are the following ones: Johnson’s algorithm, covering algorithms, algorithms based on simulated annealing and Boltzmann machines, algorithms using neural networks and algorithms based on genetic algorithms (see, e.g., [196, 198, 199] for more details).

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

505

Algorithm 2.1. Reduct set computation Input: Information system A = (U, A) Output: Set REDA (A) of all reducts of A 1 begin 2 Compute indiscernibility matrix M (A) 3 Reduce M (A) using absorbtion laws // Let C1 , ..., Cd are non-empty fields of reduced M (A) Build a familie of sets R0 , R1 , ..., Rd in the following way: 4 5 begin 6 R0 = ∅ 7 for i = 1 to d do 8 Ri = Si ∪ Ti where Si = {R ∈ Ri−1 : R ∩ Ci = ∅} and Ti = (R ∪ {a})a∈Ci ,R∈Ri−1 :R∩Ci =∅ 9 10 end 11 end 12 Remove dispensable attributes from each element of family Rd 13 Remove redundant elements from Rd 14 REDA (A) = Rd 15 end If A = (U, A) is an information system, B ⊆ A is a set of attributes and X ⊆ U is a set of objects (usually called a concept), then the sets {u ∈ U : [u]IN DA (B) ⊆ X} and {u ∈ U : [u]IN DA (B) ∩ X =∅} are called the B-lower and the B-upper approximations of X in A, and they are denoted by BX and BX, respectively. The set BNB (X) = BX − BX is called the B-boundary of X (boundary region, for short). When B = A, we also write BNA (X) instead of BNA (X). Sets which are unions of some classes of the indiscernibility relation IN DA (B) are called deﬁnable by B (or B-deﬁnable in short). A set X is, thus, B-deﬁnable iﬀ BX = BX. Some subsets (categories) of objects in an information system cannot be exactly expressed in terms of the available attributes but they can be deﬁned roughly. The set BX is the set of all elements of U which can be classiﬁed with certainty as elements of X, given a knowledge about these elements in the form of values of attributes from B; the set BNB (X) is the set of elements of U which one can classify neither to X nor to −X having a knowledge about objects represented by B. If the boundary region of X ⊆ U is the empty set, i.e., BNB (X) = ∅, then the set X is called crisp (exact) with respect to B; in the opposite case, i.e., if BNB (X) = ∅, the set X is referred to as rough (inexact) with respect to B (see, e.g., [17]). If X1 , . . . , Xr(d) are decision classes of A, then the set BX1 ∪ · · · ∪ BXr(d) is called the B-positive region of A and denoted by P OSB (d).

506

J.G. Bazan

If A = (U, A, d) is a decision table and B ⊆ A, then we deﬁne a function ∂B : U → P(Vd ), called the B-generalized decision of A, by ∂B (x) = {v ∈ Vd : ∃x ∈ U (x IN DA (B)x and d(x) = v)} .

(4)

The A-generalized decision ∂A of A is called the generalized decision of A. A decision table A is called consistent (deterministic) if card(∂A (x)) = 1 for any x ∈ U , otherwise A is inconsistent (non-deterministic). Non-deterministic information systems were introduced by Witold Lipski (see [233]), while deterministic information systems independently by Zdzislaw Pawlak [234] (see, also, [235, 236]). It is easy to see that a decision table A is consistent iﬀ P OSA (d) = U . Moreover, if ∂B = ∂B , then P OSB (d) = P OSB (d) for any pair of nonempty sets B, B ⊆ A. A subset B of the set A of attributes of a decision table A = (U, A, d) is a relative reduct of A iﬀ B is a minimal set with respect to the following property: ∂B = ∂A . The set of all relative reducts of A is denoted by RED(A, d). Let A = (U, A, d) be a consistent decision table and let M (A) = (cij ) be its discernibility matrix. We construct a new matrix M (A) = (cij ) assuming cij = ∅ if d(xi ) = d(xj ), and cij = cij − {d} otherwise. The matrix M (A) is called the relative discernibility matrix of A. Now, one can construct the relative discernibility function fM (A) of M (A) in the same way as the discernibility function. It can be shown (see [21]) that the set of all prime implicants of fM (A) determines the set of all relative reducts of A. Another important type of reducts are local reducts. A local reduct r(xi ) ⊆ A (or a reduct relative to decision and object xi ∈ U where xi is called a base object ) is a subset of A such that: 1. ∀xj ∈U d(xi ) = d(xj ) =⇒ ∃ak ∈r(xi ) ak (xi ) = ak (xj ), 2. r(xi ) is minimal with respect to inclusion. If A = (U, A, d) is a decision table, then any system B = (U , A, d) such that U ⊆ U is called a subtable of A. ai ∈ A and vi ∈ Vai . A A template of A is a formula (ai = vi ) where generalized template is a formula of the form (ai ∈ Ti ) where Ti ⊂ Vai . An object satisﬁes (matches) a template if for every attribute ai occurring in the template, the value of this attribute at a considered object is equal to vi (belongs to Ti in the case of the generalized template). The template splits the original information system in the two distinct subtables containing objects that satisfy and do not satisfy the template, respectively. It is worth mentioning that the notion of a template can be treated as a particular case of a more general notion, viz., that of a pattern (see Section 4.9).

2.2

Discretization

Suppose we have a decision table A = (U, A, d) where card(Va ) is high for some a ∈ A. Then, there is a very low chance that a new object is recognized by rules

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

507

generated directly from this table because the attribute value vector of a new object will not match any of these rules. Therefore, for decision tables with real (numerical) value attributes, some discretization strategies are built in order to obtain a higher quality of classiﬁcation. This problem was intensively studied (see, e.g., [199, 237, 238] for more details). The process of discretization is usually realized in two following steps (see, e.g., [14, 199, 237, 238]). First, the algorithm generates a set of cuts. By a cut for an attribute ai ∈ A such that Vai is an ordered set we denote a value c ∈ Vai . The cuts can be then used to transform the decision table. As a result we obtain a decision table with the same set of attributes but the attributes have diﬀerent values. Instead of a(x) = v for an attribute a ∈ A and an object x ∈ U , we rather get a(x) ∈ [c1 , c2 ] where c1 and c2 are cuts generated for attribute a by a discretization algorithm. The cuts are generated in a way that the resulting intervals contain possibly most uniform sets of objects w.r.t decision. The discretization method available in RSES has two versions (see, e.g., [14, 199, 238]) that are usually called global and local. Both methods belong to a bottom-up approaches which add cuts for a given attribute one-by-one in subsequent iterations of algorithm. The diﬀerence between these two methods lies in the way in which the candidate for a new cut is evaluated. In the global method, we evaluate all objects in the data table at every step. In the local method, we only consider a part of objects that are related to the candidate cut, i.e., which have the value of the attribute considered currently in the same range as the cut candidate. Naturally, the second (local) method is faster as less objects have to be examined at every step. In general, the local method produces more cuts. The local method is also capable of dealing with nominal (symbolic) attributes. Grouping (quantization) of a nominal attribute domain with use of the local method always results in two subsets of attribute values (see, e.g., [14, 199, 238] for more details). 2.3

Decision Rules

Let A = (U, A, d) be a decision table and let V = {Va : a ∈ A} ∪ Vd . Atomic formulas over B ⊆ A ∪ {d} and V are expressions of the form a = v, called descriptors over B and V , where a ∈ B and v ∈ Va . The set F(B, V ) of formulas over B and V is the least set containing all atomic formulas over B, V and closed with respect to the classical propositional connectives ∨ (disjunction), ∧ (conjunction), and ¬ (negation). Let ϕ ∈ F(B , V ). Then, by |ϕ|A we denote the meaning of ϕ in the decision table A, i.e., the set of all objects of U with the property ϕ, deﬁned inductively by 1. 2. 3. 4.

if ϕ is of the form a = v, then |ϕ|A = {x ∈ U : a(x) = v}, |ϕ ∧ ϕ |A = |ϕ|A ∩ |ϕ |A , |ϕ ∨ ϕ |A = |ϕ|A ∪ |ϕ |A , |¬ϕ|A = U − |ϕ}A .

508

J.G. Bazan

The set F(A, V ) is called the set of conditional formulas of A and is denoted by C(A, V ). Any formula of the form (a1 = v1 ) ∧ ... ∧ (al = vl ) where vi ∈ Vai (for i = 1, ..., l) and P = {a1 , ..., al } ⊆ A is called a P-basic formula of A. If ϕ is a P-basic formula of A and Q ⊆ P , then by ϕ/Q we mean the Q-basic formula obtained from the formula ϕ by removing from ϕ all its elementary subformulas (a = va ) such that a ∈ P \ Q. A decision rule for A is any expression of the form ϕ ⇒ d = v where ϕ ∈ C(A, V ), v ∈ V d , and |ϕ|A = ∅. Formulas ϕ and d = v are referred to as the predecessor (premise of the rule) and the successor of the decision rule ϕ ⇒ d = v respectively. If r is a decision rule in A, then by P red(r) we denote the predecessor of r and by Succ(r) we denote the successor of r . An object u ∈ U is matched by a decision rule ϕ ⇒ d = vdk (where 1 ≤ k ≤ r(d)) iﬀ u ∈ |ϕ|A . If u is matched by ϕ ⇒ d = vdk , then we say that the rule is classifying u to the decision class Xk . The number of objects matched by a decision rule ϕ ⇒ d = v, denoted by M atchA (ϕ ⇒ d = v), is equal to card(|ϕ|A ). The number SuppA (ϕ ⇒ d = v) = card(|ϕ|A ∩ |d = v|A ) is called the number of objects supporting the decision rule ϕ ⇒ d = v. A decision rule ϕ ⇒ d = v for A is true in A, symbolically ϕ ⇒A d = v, iﬀ |ϕ|A ⊆ |d = v|A . If the decision rule ϕ ⇒ d = v is true in A, we say that the decision rule is consistent in A, otherwise ϕ ⇒ d = v is inconsistent or approximate in A. SuppA (r) is called the If r is a decision rule in A, then the number μA (r) = Match A (r) coeﬃcient of consistency of the rule r. The coeﬃcient μA (r) may be understood as the degree of consistency of the decision rule r. It is easy to see that a decision rule r for A is consistent iﬀ μA (r) = 1. The coeﬃcient of consistency of r can be also treated as the degree of inclusion of |P red(r)|A in |Succ(r)|A (see, e.g., [239]). If ϕ ⇒ d = v is a decision rule for A and ϕ is P-basic formula of A (where P ⊆ A), then the decision rule ϕ ⇒ d = v is called a P-basic decision rule for A, or a basic decision rule in short. Let ϕ ⇒ d = v be a P-basic decision rule of A (where P ⊆ A) and let a ∈ P . We will say that the attribute a is dispensable in the rule ϕ ⇒ d = v iﬀ |ϕ ⇒ d = v|A = U implies |ϕ/(P \ {a}) ⇒ d = v|A = U , otherwise attribute a is indispensable in the rule ϕ ⇒ d = v. If all attributes a ∈ P are indispensable in the rule ϕ ⇒ d = v, then ϕ ⇒ d = v will be called independent in A. The subset of attributes R ⊆ P will be called a reduct of P-basic decision rule ϕ ⇒ d = v, if ϕ/R ⇒ d = v is independent in A and |ϕ ⇒ d = v|A = U implies |ϕ/R ⇒ d = v|A = U . If R is a reduct of the P-basic decision rule ϕ ⇒ d = v, then ϕ/R ⇒ d = v is said to be reduced. If R is a reduct of the A-basic decision rule ϕ ⇒ d = v, then ϕ/R ⇒ d = v is said to be an optimal basic decision rule of A (a basic decision rule with minimal number of descriptors). The set of all optimal basic decision rules of A is denoted by RU L(A).

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

2.4

509

Two Methods for Decision Rule Synthesis

Classiﬁers based on a set of decision rules are the most elaborated methods in RSES. Several methods for calculation of the decision rule sets are implemented. Also, various methods for transforming and utilizing rule sets are available. However, in our computer experiments we usually use two methods for decision rules synthesis. We would like to mention those methods here. The ﬁrst method returns all basic decision rules with minimal number of descriptors (see, e.g., [196, 198, 199, 240]). Therefore, this method is often called an exhaustive method. From the practical point of view, the method consists in applying an algorithm computing all reducts (see Algorithm 2.1) for each object individually, which results in obtaining decision rules with a minimal number of descriptors in relation to individual objects (see, e.g., [196, 198, 199]). The second method for basic decision rule synthesis, is the covering algorithm called LEM2 (see, e.g., [216, 222, 223]). In LEM2, a separate-and-conquer technique is paired with rough set notions such as upper and lower approximations. This method tends to produce less rules than algorithms based on the exhaustive local reduct calculation (as in the previous method) and seems to be faster. On the downside, the LEM2 method sometimes returns too few valuable and meaningful rules (see also Section 2.10). 2.5

Operations on Rule Sets

In general, the methods used by RSES to generate rules may produce quite a bunch of them. Naturally, some of the rules may be marginal, erroneous or redundant. In order to provide a better control over the rule-based classiﬁers some simple techniques for transforming rule sets should be used. The simplest way to alter a set of decision rules is by ﬁltering them. It is possible to eliminate from the rule set these rules that have insuﬃcient support on training sample, or those that point at a decision class other than the desired one. More advanced operations on rule sets are shortening and generalization. Rule shortening is a method that attempts to eliminate descriptors from the premise of the rule. The resulting rule is shorter, more general (applicable to more training objects) but it may lose some of its precision. The shortened rule may be less precise, i.e., it may give wrong answers (decisions) for some of the matching training objects. We present an exemplary method of approximate rules computation (see, e.g., [196, 198, 199]) that we use in our experiments. We begin with an algorithm for synthesis of optimal decision rules from a given decision table (see Section 2.4). Next, we compute approximate rules from the optimal decision rules already calculated. Our method is based on the notion of consistency of a decision rule (see Section 2.1). The original optimal rule is reduced to an approximate rule with the coeﬃcient of consistency exceeding a ﬁxed threshold. Let A = (U, A, d) be a decision table and r0 ∈ RU L(A). The approximate rule (based on rule r0 ) is computed using the Algorithm 2.2.

510

J.G. Bazan

Algorithm 2.2. Approximate rule synthesis (by descriptor dropping) Input: 1. decision table A = (U, A, d) 2. decision rule r0 ∈ RU L(A) 3. threshold of consistency μ0 (e.g., μ0 = 0.9) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Output: the approximate rule rapp (based on rule r0 ) begin Calculate the coeﬃcient of consistency μA (r0 ) if μA (r0 ) < μ0 then STOP // In this case no approximate rule end μmax = μA (r0 ) and rapp = r0 while μmax > μ0 do μmax = 0 for i = 1 to the number of descriptors from P red(rapp ) do r = rapp Remove i-th descriptor from P red(r) Calculate the coeﬃcient of consistency μA (r) and μ = μA (r) if μ > μmax then μmax = μ and imax = i end end if μmax > μ0 then Remove imax -th conditional descriptor from rapp end end return rapp end

It is easy to see that the time and space complexity of Algorithm 2.2 are of order O(l2 · m · n) and O(C), respectively (where l is the number of conditional descriptors in the original optimal decision rule r0 and C is a constant). The approximate rules, generated by the above method, can help to extract interesting laws from the decision table. By applying approximate rules instead of optimal rules one can slightly decrease the quality of classiﬁcation of objects from the training set but we expect, in return, to receive more general rules with a higher quality of classiﬁcation of new objects (see [196]). On the other hand, generalization of rules is a process which consists in replacement of the descriptors having a single attribute value in rule predecessors with more general descriptors. In the RSES system there is an algorithm available which instead of simple descriptors of type a(x) = v, where a ∈ A, v ∈ Va and x ∈ U tries to use the so-called generalized descriptors of the form a(x) ∈ V where V ⊂ Va (see, e.g., [14]). In addition, such a replacement is performed

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

511

only when the coeﬃcient of consistency of the new rule is not smaller than the established threshold. Let us notice that such an operation is crucial in terms of enlargement of the extension of decision rules for the generalized decision rules are able to classify a greater number of tested objects. It is worth mentioning that the application of the method of generalizing rules described above only makes sense for tables with attributes having a small number values. Such attributes are usually attributes with symbolic values. On the other hand a usage of this method for tables with numerical attributes requires a previous discretization of values of these attributes. 2.6

Negotiations Among Rules

Suppose we have a set of decision rules. When we attempt to classify an object from test sample with use of a rule set generated, it may happen that various rules suggest diﬀerent decision values. In such conﬂict situations, we need a strategy to resolve controversy and reach a ﬁnal result (decision). This problem was intensively studied (see, e.g., [198, 199]). In its current version, RSES provides a conﬂict resolution strategy based on voting among rules. In this method, each rule that matches the object under consideration casts a vote in favor of the decision value it points at. Votes are summed up and the decision is chosen that has got majority of votes. This simple method may be extended by assigning weights to rules. Each rule, then votes with its weight and the decision that has the highest total of weighted votes is the ﬁnal one. In RSES, this method is known as a standard voting and is based on a basic strength (weight) of decision rules (see Section 2.8). Of course, there are many other methods that can be used to resolve conﬂicts between decision rules (see, e.g., [196, 198, 199, 216, 217, 241]). 2.7

Decomposition Trees

In the case of the decision tables larger, the computation of decision rules for these tables can be extremely diﬃcult or even impossible. This problem arises from a relatively high computational complexity of rule computing algorithms. Unfortunately, it frequently concerns covering algorithms such as, e.g., LEM2 as well (see Section 2.4). One of the solutions to this problem is the so-called decomposition. Decomposition consists in partitioning the entrance data table into parts (subtables) in such a way as to be able to calculate decision rules for these parts using standard methods. Naturally, a method is also necessary which would aggregate the obtained rule sets in order to build a general classiﬁer. In this paper, we present a decomposition method based on a decomposition tree (see [165, 226, 242]) which may be constructed according to Algorithm 2.3. This algorithm creates the decomposition tree in steps where each step leads to construction of the next level of the tree. At a given step of the algorithm execution, a binary partition of the decision table takes place using the best template (see Section 2.1) found for the table being partitioned. In this way, with each tree node (leaf), there is connected a template partitioning the subtable in this node into objects matching and not matching the template. This

512

J.G. Bazan

Algorithm 2.3. Decomposition tree synthesis Input: decision table A = (U, A, d) Output: the decomposition tree for the decision table A 1 begin 2 Find the best template T in A (see Section 2.1) 3 Divide A in two subtables: A1 containing all objects satisfying T and A2 = A − A1 4 if obtained subtables are of acceptable size in the sense of rough set methods then 5 STOP // The decomposition is finished 6 end 7 repeat lines 2-7 for all “too large” subtables 8 end

template and its contradiction are transferred as templates describing subtables to the next step of decomposition. Decomposition ﬁnishes when the subtables obtained are so small that the decision rules can be calculated for them using standard methods. After determining the decomposition tree, decision rule sets are calculated for all the leaves of this tree and, more precisely, for the subtables occurring in single leaves. The tree and the rules calculated for training sample can be used in classiﬁcation of unseen cases. Suppose we have a binary decomposition tree. Let u be a new object, A(T) be a subtable containing all objects matching a template T, and A(¬T ) be a subtable containing all objects not matching a template T. We classify object u starting from the root of the tree using Algorithm 2.4. This algorithm works in such a way that such a leaf of a decomposition tree is sought ﬁrst that the tested object matches the template describing the objects of

Algorithm 2.4. Classiﬁcation by decomposition tree 1 2 3 4 5 6 7 8 9 10 11 12 13

begin if u matches template T found for A then go to subtree related to A(T ) else go to subtree related to A(¬T ) end if u is at the leaf of the tree then go to line 12 else repeat lines 2-11 substituting A(T ) (or A(¬T )) for A end Classify u using decision rules for subtable attached to the leaf end

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

513

that leaf. Next, the object is classiﬁed with the help of decision rules calculated for the leaf that was found. The type of the decomposition method depends on the method of determining the best template. For instance, if decomposition is needed only because it is impossible to compute rules for a given decision table, then the best template for this table is the template which divides a given table into two equal parts. If, however, we are concerned with the table partition that is most compatible with the partition introduced by decision classes, then the measure of the template quality may be, for example, the number of pairs of objects from diﬀerent decision classes, diﬀerentiated with the help of the partition introduced by a given template. Surely, the best template in this case is a template with the largest number of diﬀerentiated pairs. The patterns determined may have diﬀerent forms (see, e.g., [165] for more details). In the simplest case, for a symbolic attribute, the best template might be of the forms a(x) = v or a(x) = v where a ∈ A, v ∈ Va , and x ∈ U , whereas for a numerical attribute, the templates might be a(x) > v, a(x) < v, a(x) ≤ v, or a(x) ≥ v where a ∈ A, v ∈ Va , and x ∈ U . The classiﬁer presented in this section uses a binary decision tree, however, it should not be mistaken for C4.5 or ID3 (see, e.g., [210, 243]) because, as we said before, rough set methods have been used in leaves of the decomposition tree in construction of the classifying algorithm. 2.8

Concept Approximation and Classiﬁers

Deﬁnability of concepts is a term well-known in classical logic (see, e.g., [5, 244, 245]). In this classical approach a deﬁnable concept (set) is a relation on the domain of a given structure whose elements are precisely those elements satisfying some formula in the structure. Semantics of such formula enables to determine precisely for a given element (object) whether it belongs to the concept or not. However, the issue of deﬁnability of concepts is somewhat complicated by the pervasive presence of vagueness and ambiguity in natural language (see [126, 127, 244]). Therefore, in numerous applications, the concepts of interest may only be deﬁned approximately on the basis of available, incomplete, imprecise or noisy information about them, represented, e.g., by positive and negative examples (see [6, 7, 8, 9, 10, 11, 12, 13]). Such concepts are often called vague (imprecise) concepts. We say that a concept is vague when there may be cases (elements, objects) in which there is no clear fact of the matter whether the concept applies or not. Hence, the classical approach to concept deﬁnability known from classical logic cannot be applied for vague concepts. At the same time an approximation of a vague concept consists in construction of an algorithm (called a classiﬁer) for this concept, which may be treated as a constructive, approximate description of the concept. This description enables to classify testing objects, that is, to determine for a given object whether it belongs to the concept approximated or not to which degree. There is a long debate in philosophy on vague concepts (see, e.g., [126, 127, 128]) and recently computer scientists (see, e.g., [79, 82, 83, 246, 247, 248, 249]) as well

514

J.G. Bazan

as other researchers have become interested in vague concepts. Since the classical approach to concept deﬁnability known from classical logic cannot be applied for vague concepts new methods of deﬁnability have been proposed. Professor Lotﬁ Zadeh (see [250]) introduced a very successful approach to deﬁnability of vague concepts. In this approach, sets are deﬁned by partial membership in contrast to crisp membership used in the classical deﬁnition of a set. Rough set theory proposed a method of concept deﬁnability by employing the lower and upper approximation, and the boundary region of this concept (see Section 2.1). If the boundary region of a set is empty it means that a particular set is crisp, otherwise the set is rough (inexact). The non-empty boundary region of the set means that our knowledge about the set is not suﬃcient to deﬁne the set precisely. Using the lower and upper approximation, and the boundary region of a given concept a classiﬁer can be constructed. Assume there is given a decision table A = (U, A, d), whose binary decision attribute with values 1 and 0 partitions the set of objects in two disjoint ones: C and C . The set C contains objects with the decision attribute value equal to 1, and the set C contains objects with the decision attribute value equal to 0. The sets C and C may also be interpreted in such a way that the set C is a certain concept to be approximated and the set C is the complement of this concept (C = U \ C). If we deﬁne for concept C and its complement C , their A-lower approximations AC and AC , the Aupper approximation AC, and the A-boundary BNA (C) (BNA (C) = AC \ AC), we obtain a simple classiﬁer which operates in such a way that a given testing object u is classiﬁed to concept C if it belongs to the lower approximation AC. Otherwise, if object u belongs to the lower approximation AC , it is classiﬁed to the complement of concept C. However, if the object belongs neither to AC nor AC , but it belongs to BNA (C), then the classiﬁer cannot make an unambiguous decision about membership of the object, and it has to respond that the object under testing simultaneously belongs to the concept C and its complement C , which means it is a border object. In this case the membership degree of a tested object u ∈ U to concept C ⊆ U is expressed numerically with the help of a rough membership function (see, e.g., [16, 17]). The rough membership function μC quantiﬁes the degree of relative overlap between the concept C and the equivalence class to which u belongs. It is deﬁned as follows: μC (u) : U → [0, 1] and μC (u) =

card([u]IN DA (A) ∩ C) . card([u]IN DA (A) )

As we can see, in order to work the classiﬁer described above, it is necessary for the tested object to belong to one of the equivalence classes of relation IN DA (A). However, there is one more instance remaining when the tested object does not belong to any equivalence class of relation IN DA (A). In such case, the classiﬁer under consideration cannot make any decision about membership of the tested object and has to say: “I do not know”. Unfortunately, the case when the tested object does not belong to any equivalence class of relation IN DA (A) frequently occurs in practical applications. It is due to the fact that if the objects under testing do not belong to the decision

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

515

table that was known at the beginning, but to its extension, the chances are small that in a given decision table, there exists an object (called a training object) whose conditional attribute values are identical to those in the testing object. However, it follows from the deﬁnitions of the relation IN DA (A) that the testing object for which there is no training object cannot be classiﬁed by the classiﬁer described above. In such a case, one can say that the extension of this classiﬁer is very small. For the above reason, the classic approach to classifying objects in the rough set theory (described above) requires generalization. It is worth noticing that in machine learning and pattern recognition (see, e.g., [6, 8, 9, 10, 11, 12, 13]), this issue is known under the term learning concepts by examples (see, e.g., [10]). The main problem of learning concepts by examples is that the description of a concept under examination needs to be created on the basis of known examples of that concept. By creating a concept description we understand detecting such properties of exemplary objects belonging to this concept that enable further examination of examples in terms of their membership in the concept under examination. A natural way to solve the problem of learning concepts by examples is inductive reasoning (see, e.g., [251, 252]). In inductive reasoning we assume as true the sentence stating a general regularity, at the same time we do that on the basis of acknowledging sentences stating individual instances of this regularity (see, e.g., [251, 252]). This is the reasoning according to which decisions in the real world are often made relying on incomplete or even ﬂawed information. This takes place in the cases of answers to questions connected with forecasting, checking hypotheses or making decisions. In the case of the problem of learning concepts by examples, the usage of inductive reasoning means that while obtaining further examples of objects belonging to the concept (the so-called positive examples) and examples of objects not belonging to the concept (the so-called negative examples), an attempt is made to ﬁnd such description that correctly matches all or almost all examples of the concept under examination. From the theoretical point of view, in the rough set theory the classic approach to concept approximation was generalized by Professor Skowron and Professor Stepaniuk (see [253]). This approach is consistent with the philosophical view (see, e.g., [126, 127]) and the logical view (see, e.g., [128]). The main element of this generalization is an approximation space. The approximation space (see, e.g., [246, 253, 254, 255]) is a tuple AS = (U, I, ν), where – U is a non-empty set of objects, – I : U → P (U ) is an uncertainty function and P (U ) denotes the powerset of U , – ν : P (U ) × P (U ) → [0, 1] is a rough inclusion function. The uncertainty function I deﬁnes for every object u ∈ U a set of objects indistinguishable with u or similar to u. The set I(u) is called the neighborhood of u. If U is a set of objects of a certain decision table A = (U, A, d), then in the simplest case the set I(u) may be the equivalence class [u]IN DA (A) . However, in a general case the set I(u) is usually deﬁned with the help of a special language such as GDL or N L (see Section 4.7).

516

J.G. Bazan

The rough inclusion function ν deﬁnes the degree of inclusion of X in Y , where X, Y ⊆ U . In the simplest case, rough inclusion can be deﬁned by: card(X∩Y ) if X = ∅ card(X) ν(X, Y ) = 1 if X = ∅. This measure is widely used by the data mining and rough set communities (see, e.g., [16, 17, 246, 253]). However, rough inclusion can have a much more general form than inclusion of sets to a degree (see [192, 247, 249]). It is worth noticing that in literature (see, e.g., [247]) a parameterized approximation space is considered instead of the approximation space. Any parameterized approximation space consists of a family of approximation spaces creating the search space for data models. Any approximation space in this family is distinguished by some parameters. Searching strategies for optimal (sub-optimal) parameters are basic rough set tools in searching for data models and knowledge. There are two main types of parameters. The ﬁrst ones are used to deﬁne object sets (neighborhoods), the second are measuring the inclusion or closeness of neighborhoods. For an approximation space AS = (U, I, ν) and any subset X ⊆ U the lower and the upper approximations are deﬁned by: – LOW (AS, X) = {u ∈ U : ν (I (u) , X) = 1} , – U P P (AS, X) = {u ∈ U : ν (I (u) , X) > 0}, respectively. The lower approximation of a set X with respect to the approximation space AS is the set of all objects which can be classiﬁed with certainty as object of X with respect to AS. The upper approximation of a set X with respect to the approximation space AS is the set of all objects which can be possibly classiﬁed as objects of X with respect to AS. Several known approaches to concept approximations can be covered using the approximation spaces discussed here, e.g., the approach given in [16, 17], approximations based on the variable precision rough set model (see, e.g., [256]) or tolerance (similarity) rough set approximations (see, e.g., [253]). Similarly to the classic approach, the lower and upper approximation in the approximation space AS for a given concept C may be used to classify objects to this concept. In order to do this one may examine the membership of the tested objects to LOW (AS, C), LOW (AS, C ) and U P P (AS, C) \ LOW (AS, C). However, in machine learning and pattern recognition (see, e.g., [6, 8, 9, 10, 11, 12, 13]), we often search for approximation of a concept C ⊆ U ∗ in an approximation space AS ∗ = (U ∗ , I ∗ , ν ∗ ) having only a partial information about AS ∗ and C, i.e., information restricted to a sample U ⊆ U ∗ . Let us denote the restriction of AS ∗ to U by AS = (U, I, ν), i.e., I(x) = I ∗ (x) ∩ U , ν(X, Y ) = ν ∗ (X, Y ) for x ∈ U , and X, Y ⊆ U (see Fig. 3). To decide if a given object u ∈ U ∗ belongs to the lower approximation or to the upper approximation of C ⊆ U ∗ , it is necessary to know the value ν ∗ (I ∗ (u), C). However, in the case there is only partial information about the approximation space AS ∗ available, one must make an estimation of such a value ν ∗ (I ∗ (u), C)

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

517

*

U - the set of objects from the approximation space

AS* = (U*, I*,ν*)

Tested object u∈U

*

I(u)

*

I (u)

U - the set of objects from the approximation space

AS = (U, I, ν)

Fig. 3. An approximation space AS and its extension AS ∗

rather than its exact value. In machine learning, pattern recognition or data mining, diﬀerent heuristics are used for estimation of the values of ν ∗ . Using diﬀerent heuristic strategies, values of another function ν are computed and they are used for estimation of values of ν ∗ . Then, the function ν is used for deciding if objects belong to C or not. Hence, we deﬁne an approximation of C in the approximation space AS = (U ∗ , I ∗ , ν ) rather than in AS ∗ = (U ∗ , I ∗ , ν ∗ ). Usually, it is required that the approximations of C ∩ U in AS and AS are close (or the same). The approach presented above (see, e.g., [83, 246, 248, 249]) became an inspiration for ﬁnding out of a number of methods which would enable to enlarge the extension of constructed classiﬁers, that is, to make the classiﬁers under construction to be able to classify any objects, and not only those belonging to a given decision table. Some other issues concerning the rough set approach to vague concept approximation are discussed, e.g., in [83, 128, 248, 249]. Among these issues are the higher order vagueness (i.e., nondeﬁnability of boundary regions), adaptive learning of concept approximation, concept drift, and sorites paradoxes. One of the basic ways of increasing the extension of classiﬁers is to approximate the concepts not with the help of the equivalence class of relation IN D (see above) but with the help of the patterns of the established language which diﬀerent objects may match, both from the training table and its extension. A given object matches the pattern if it is compatible with the description of this pattern. Usually, the pattern is constructed in such a way that all or almost

518

J.G. Bazan

all its matching objects belong to the concept under study (the decision class). Moreover, it is required that the objects from many equivalence classes of relation IN D could match the patterns. Thus, the extension of classiﬁers based on patterns is dramatically greater than the extension of classiﬁers working on the basis of equivalence classes of relation IN D. These types of patterns are often called decision rules (see Section 2.3). In literature one may encounter many methods of computing decision rules from data and methods enabling preprocessing the data in order to construct eﬀective classiﬁers. Into this type of methods one may include, for example, discretization of attribute values (see Section 2.2), methods computing decision rules (see Section 2.3), shortening and generalization of decision rules (see Section 2.5). The determined decision rules may be applied to classiﬁers construction. For instance, let us examine the situation, when a classiﬁer is created on the basis of decision rules from the set RU L(A) computed for a given decision table A = (U, A, d), and at the same time decision attribute d describes the membership to a certain concept C and its complement C . 1 The set of rules RU L(A) is the sum of two subsets RU L(A, C) and RU L(A, C ), where RU L(A, C) is the set of rules classifying objects to C and RU L(A, C ) is a set of rules classifying objects to C . For any tested object u, by M Rul(A, C, u) ⊆ RU L(A, C) and M Rul(A, C , u) ⊆ RU L(A, C ) we denote sets of such rules whose predecessors match object u and classify objects to C and C , respectively. Let AS = (U, I, ν) be an approximation space, where: SuppA (r) ∪ SuppA (r) 1. ∀u ∈ U : I(u) = r∈MRul(A,C,u) r∈MRul(A,C ,u) card(X∩Y ) if X = ∅ card(X) 2. ∀X, Y ⊆ U : ν(X, Y ) = 1 if X = ∅. The above approximation space AS may be extended in a natural way to approximation space AS = (U ∗ , I ∗ , ν ), where: ∗ 1. I ∗ : U ∗ −→ P (U ∗ ) such that ∀u ∈ U : I (u) = I(u), card(X∩Y ) if X = ∅ card(X) 2. ∀X, Y ⊆ U ∗ : ν (X, Y ) = 1 if X = ∅.

Let us notice that such a simple generalization of functions I to I ∗ and ν to ν is possible because function I may determine the neighborhood for a given object belonging to U ∗ . It results from the fact that decision rules from set RU L(A) may recognize objects not only from set U but also from set U ∗ \ U . Approximation space AS may now also be used to construct a classiﬁer which classiﬁes objects from set U ∗ to concept C or its complement C . In creating such a classiﬁer the key problem is to resolve the conﬂict between the rules

1

For simplicity of reasoning we consider only binary classiﬁers, i.e. classiﬁers with two decision classes. One can easily extend the approach to the case of classiﬁers with more decision classes.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

519

classifying the tested object to the concept or to its complement. Let us notice that this conﬂict occurs because in practice we do not know function ν ∗ but only its approximation ν . That is why, there may exist such a tested object ut that the values ν ({ut }, C) and ν ({ut }, C ) are high (that is close to 1), while values ν ∗ ({ut }, C) and ν ∗ ({ut }, C ) are very diﬀerent (e.g., ν ∗ ({ut }, C) is close to 1 and ν ∗ ({ut }, C ) is close to 0). Below, we present the deﬁnition of such a classiﬁer in the form of a function that returns the value Y ES when the tested object belongs to C or the value N O when the tested object belongs to C : Y ES if ν ({u}, C) > 0.5 ∀u ∈ U : Classif ier(u) = (5) N O otherwise. Obviously, other rough inclusion functions may be deﬁned (see, e.g., [192, 247, 249]). Thus, we obtain diﬀerent classiﬁers. Unfortunately, a classiﬁer deﬁned with the help of Equation (5) is impractical because the function ν used in it does not introduce additional parameters which enable to recognize of objects to the concept and its complement whereas in practical applications in constructing classiﬁers based on decision rules, functions are applied which give the strength (weight) of the classiﬁcation of a given tested object to concept C or its complement C (see, e.g., [196, 199, 216, 217, 241]). Below, we present a few instances of such weights (see [199]). 1. A simple strength of decision rule set is deﬁned by SimpleStrength(C, ut) =

card(M Rul(A, C, ut )) . card(RU L(A, C))

2. A maximal strength of decision rule set is deﬁned by M aximalStrength(C, ut ) = maxr∈MRul(A,C,ut )

SuppA (r) card(C)

.

3. A basic strength or a standard strength of decision rule set is deﬁned by SuppA (r) BasicStrength(C, ut ) =

r∈MRul(A,C,ut )

r∈RUL(A,C)

SuppA (r)

4. A global strength of decision rule set is deﬁned by

card GlobalStrength(C, ut ) =

r∈MRul(A,C,ut )

card(C)

.

SuppA (r) .

Using each of the above rules weight, a rough inclusion function corresponding to it may be deﬁned. Let us mark any established weight of rule sets as S. For weight S we deﬁne an exemplary rough inclusion function νS in the following way:

520

J.G. Bazan

∀X, Y ⊆ U : νS (X, Y ) =

⎧ 0 if Y = ∅ ∧ X = ∅ ⎪ ⎪ ⎪ ⎪ 1 if X=∅ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ S(Y,u) ⎪ ⎪ ⎪ S(Y,u)+S(U\Y,u) if X = {u} and ⎪ ⎪ ⎨ S(Y, u) + S(U \ Y, u) = 0 ⎪ ⎪ 1 ⎪ ⎪ 2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ P ⎪ νS ({u},Y ) ⎪ ⎪ ⎩ u∈X card(X)

,

if X = {u} and S(Y, u) + S(U \ Y, u) = 0 if card(X) > 1

where for an established set Y and object u the weights S(Y, u) and S(U \ Y, u) are computed using the decision rule set generated for table A = (U, A, dY ), where attribute dY describes the membership of objects from U to the set Y . The rough inclusion function deﬁned above may be used to construct the classiﬁer as it is done in Equation (5). Such a classiﬁer executes a simple negotiation method between the rules classifying the tested object to the concept and rules classifying the tested object to the complement of the concept (see Section 2.6). It simply is based on classifying tested object u to concept C only when with the established weight of rule sets S the value νS ({u}, C) is bigger than νS ({u}, C ). Otherwise, object u is classiﬁed to the complement of concept C. In this paper, the weight BasicStrength is used in experiments related to construction of classiﬁers based on decision rules to resolve conﬂicts between rule sets. 2.9

Evaluation of Classiﬁers

In order to evaluate the classiﬁer quality in relation to the data analyzed, a given decision table is partitioned into the two tables in a general case (see, e.g., [11, 257, 258]): 1. the training table containing objects on the basis of which the algorithm learns to classify objects to decision classes, 2. the test table, by means of which the classiﬁer learned on the training table may be evaluated when classifying all objects belonging to this table. The numerical measure of the classiﬁer evaluation is often the number of mistakes made by the classiﬁer during classiﬁcation of objects from the test table in comparison to all objects under classiﬁcation (the error rate, see, e.g., [11, 196, 198]). However, the method of the numerical classiﬁer evaluation, used most often, is the method based on a confusion matrix. The confusion matrix (see, e.g., [15, 257, 259]) contains information about actual and predicted classiﬁcations done by a classiﬁer. Performance of such systems is commonly evaluated using the data in the matrix. The Table 1 shows the confusion matrix for a two class classiﬁer, i.e., for a classiﬁer constructed for a concept.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

521

Table 1. The confusion matrix

Predicted Negative Positive Actual Negative TN FP Positive FN TP

The entries in the confusion matrix have the following meaning in the context of our study (see, e.g., [260]): – T N (True Negatives) is the number of correct predictions that an object is a negative example of a concept of the test table, – F P (False Positives) is the number of incorrect predictions that an object is a positive example of a concept of the test table, – F N (False Negatives) is the number of incorrect predictions that an object is a negative example of a concept of the test table, – T P (True Positives) is the number of correct predictions that an object is a positive example of a concept of the test table. Several standard terms (parameters) have been deﬁned for the two class confusion matrix: – the accuracy (ACC) deﬁned for a given classiﬁer by the following equality: ACC =

TN + TP , TN + FN + FP + TP

– the accuracy for positive examples or the sensitivity (see, e.g., [260]) or the true positive rate (T P R) (see, e.g., [257]) deﬁned for a given classiﬁer by the following equality: TP , TPR = TP + FN – the accuracy for negative examples or the speciﬁcity (see, e.g., [260]) or the true negative rate (T N R) (see, e.g., [257]) deﬁned for a given classiﬁer by the following equality: TN T NR = . TN + FP An essential parameter is also the number of classiﬁed objects from the test table in comparison to the number of all objects from this table since classiﬁers may not always be able to classify the objects. This parameter, called the coverage (see, e.g., [11, 15]), may be treated as an extension measure of the classiﬁer. Thus, in order to evaluate classiﬁers, also the following numerical parameters are applied in this paper: 1. the coverage (COV ) deﬁned for a given classiﬁer by the following equality: COV =

TN + FP + FN + TP , the number of all objects of the test table

522

J.G. Bazan

2. the coverage for positive examples (P COV ) deﬁned for a given classiﬁer by the following equality: P COV =

FN + TP , the number of all positive examples of a concept of the table

3. the coverage for negative examples (N COV ) deﬁned for a given classiﬁer by the following equality: N COV =

TN + FP , the number of all negative examples of a concept of the table

4. the real accuracy deﬁned for a given classiﬁer by: ACC · COV , 5. the real accuracy for positive examples or the real true positive rate deﬁned for a given classiﬁer by: T P R · P COV , 6. the real accuracy for negative examples or the real true negative rate deﬁned for a given classiﬁer by: T N R · N COV . Besides that, in order to evaluate classiﬁers still diﬀerent parameters are applied. These are, for instance, time of construction of a classiﬁer on the basis of a training table or the complexity degree of the classiﬁer under construction (e.g., the number of generated decision rules). In summary, in this paper the main parameters applied to the evaluation of classiﬁers are: the accuracy, the coverage, the real accuracy, the accuracy for positive examples, the coverage for positive examples, the real accuracy for positive examples, the accuracy for negative examples, the coverage for negative examples and the real accuracy for negative examples. They are used in experiments with AR schemes (see Section 5.8) and experiments related to detecting behavioral patterns (see Section 6.25 and Section 6.26). However, in experiments with automated planning another method of classiﬁer quality evaluation was applied (see Section 7.21). It results from the fact that this case is about automated generating the value of complex decision that is a plan which is a sequence of actions alternated with states. Hence, to compare this type of complex decision values the above mentioned parameters may not be used. Therefore, to compare the plans generated automatically with the plans available in the data set we use a special classiﬁer based on concept ontology which shows the similarity between any pair of plans (see Section 7.18). It is worth noticing that in literature there may be found another, frequently applied method of measuring the quality of created classiﬁers. It is a method based on ROC curve (Receiver Operating Characteristic curve) (see, e.g., [260, 261, 262]). This method is available, for instance, in system ROSETTA (see, e.g., [259, 263, 264]). It is also worthwhile mentioning that the author of this paper participated in construction of programming library RSES-lib, creating the computational kernel of system ROSETTA (see [230, 259] for more details). In order not to make the value of the determined parameter of the classiﬁer evaluation depending on a speciﬁc partitioning the whole decision table into a training and test parts, a number of methods are applied which perform tests to determine which parameter values of the classiﬁer evaluation are creditable.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

523

The methods of this type applied most often are train-and-test and crossvalidation (see, e.g., [11, 258, 265]). The train-and-test method is usually applied to decision tables having at least 1000 objects (see, e.g., [11]). It consists in a random isolation of two test subtables from the whole data available, treating one of them as a training subtable and the other as a test subtable. The training and test subtables are usually separated (although not always) and altogether make the available decision table. It is crucial, however, that at least some part of the objects from the test subtable does not occur in the training subtable. The proportion between the number of objects in the test and training subtables depends on a given experiment but it is usually such that the number of objects in the test part constitutes from 20 to 50 percent of the number of objects in the whole data available (see, e.g., [11]). The cross-validation method is applied to evaluate a classiﬁer when the number of objects in the decision table is less than 1000 (see, e.g., [11]). This method consists in partitioning data in a random way into m equal parts and, then performing m experiments with them. In each of these experiments, a local coeﬃcient of the classiﬁer evaluation is calculated for a situation when one of the parts into which the data was divided is a set of tested objects, and the remaining m − 1 parts (temporarily combined) are treated as a set of training objects. Finally, a coeﬃcient of the classiﬁer evaluation as an average arithmetical coeﬃcient of all experiments is calculated. The number m is determined depending on the speciﬁc data and should be selected in such a way that the test parts not to have too few objects. In practice, m is an integer ranging from 5 to 15 (see, e.g., [11]). All decision tables used in experiments have more than 1000 objects, in this paper. That is why in order to determine the parameter of the classiﬁer quality the train-and-test method is always applied. Moreover, each experiment is repeated 10 times for ten random partitions into two separate tables (training and test). Hence, the result of each experiment is the arithmetical mean obtained from the results of its repetitions. Additionally, the standard deviation of the received result is given. 2.10

Problem of Low Coverage

If a given tested object matches the predecessor of a certain basic decision rule (that is the values of condition attributes of this object are the same as the values of the descriptors from the rule predecessor corresponding to them), then this rule may be used to classify this object, that is, the object is classiﬁed to the decision class occurring in the rule successor. In this case we also say that a given tested object is recognized by a certain decision rule. However, if a given tested object is recognized by diﬀerent decision rules which classify it to more than one decision classes, then negotiation methods between rules are applied (see Section 2.6 and Section 2.8). In practice, it may happen that a given tested object does not match the predecessor of any of the available decision rules. We say that this object is not recognized by a given classiﬁer based on decision rules and what follows it cannot be classiﬁed by this classiﬁer. It is an unfavorable situation, for we often expect

524

J.G. Bazan

from the classiﬁers to classify all or almost all tested objects. If there are many of the unclassiﬁed objects, then we say that a given classiﬁer has too low an extension. It is expressed numerically by a low value of the coverage parameter (see Section 2.9). A number of approaches which enable to avoid a low coverage of classiﬁers based on decision rules were described in literature. They are for example: 1. The application of classiﬁers based on the set of all rules with a minimum number of descriptors (see Section 2.4) which usually have a high extension (see, e.g., [196, 198]). 2. The application of rule classiﬁers constructed on the basis of covering algorithms and partial matching mechanism of the objects to the rules (see, e.g., [10, 213, 214, 216, 217, 222, 223, 266]). 3. The application of classiﬁers based on decision rules which underwent the process of generalization of rules owing to which the classiﬁer extension usually increases (see Section 2.5). 4. The application of classiﬁers based on a lazy learning which does not require preliminary computation of decision rules, for decision rules needed for object classiﬁcation are discovered directly in a given decision table during the classiﬁcation of the tested object (see, e.g., [197, 198, 267]). All the methods mentioned above have their advantages and disadvantages. The ﬁrst method has an exponential time complexity which results from the complexity of the algorithm computing all reducts (see Section 2.4). The second method is very quick, for it is based on rules computed with the help of the covering method. However, in this method there are often applied approximation rules to classify objects (determined as a result of a partial matching objects to the rules). Therefore, the quality of classiﬁcation on the basis of such rules may be unsatisfactory. The third method uses the operation of rule generalization. Owing to this operation the extension of the obtained rules increases. However, it does not lead to such a high extension as in the case of the ﬁrst, second and fourth method. Apart from that the operation of rule generalization is quite time consuming. Whereas, the fourth method, although does not require preliminary computation of decision rules, its pessimistic computational time complexity of each tested object classiﬁcation is of order O(n2 · m), where n is the number of objects in the training table and m is the number of condition attributes. Hence, for bigger decision tables this method cannot be applied eﬀectively. There is one more possibility remaining to build classiﬁers on the basis of rules computed with the covering method without using partial matching tested objects to the rules. Obviously, classiﬁers based on such rules may have a low coverage. However, they usually have a high quality of the classiﬁcation. It is extremely crucial in many applications (for example in medical and ﬁnancial ones) where it is required that the decisions generated by classiﬁers be always or almost always correct. In such applications it is sometimes better for the classiﬁer to say I do not know rather than make a wrong decision. That is why in this paper we use classiﬁers based on rules computed with covering method (without partial matching objects to the rules) agreeing on a low coverage of such

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

525

classiﬁers in cases when classiﬁers based on the set of all rules with minimum number of descriptors cannot be applied (too large analyzed decision tables).

3

Methods of Constructing Stratifying Classifiers

The algorithm of concept approximation, presented in Subsection 2.8, consists in classifying the tested objects to the lower approximation of this concept, the lower approximation of complement of this concept or its border. Many methods enabling increase of the extension of classiﬁers under construction, in rough set theory are proposed (see Section 2.8). Discretization of attribute values (see Section 2.2), methods of calculating and modifying decision rules (see Sections 2.3, 2.4, 2.5), and partial matching method (see Section 2.10) are examples of such methods. As a result of applying these methods, there are constructed classiﬁers able to classify almost every tested object to the concept or its complement. At ﬁrst glance this state of aﬀairs should dispose optimistically for approximation methods can be expanded for tested objects from beyond a given decision table, which is necessary in inductive learning (see Section 2.8). Unfortunately, such a process of generalizing concept approximation encounters diﬃculties in classifying new (unknown during the classiﬁer learning) tested objects. Namely, after expanding the set of objects U of a given information system with new objects, equivalence classes of these objects are often disjoint with U . This means that if such objects match the description of a given concept C constructed on the basis of set U , this match is often incidental. Indeed due to the unfamiliarity the process generalization of decision rules may go too far (e.g., decision rules are too short) because of absence of these new objects when the concept description was created. It may happen that the properties (attributes) used to describe a concept are chosen in wrong way. So, if a certain tested object from outside the decision table is classiﬁed, then it may turn out that, in the light of the knowledge gathered in a given decision table, this object should be classiﬁed neither to the concept nor to its complement but to the concept border, which expresses our uncertainty about the classiﬁcation of this object. Meanwhile, most of the classiﬁers currently constructed classify the object to the concept or its complement. A need of use the knowledge from a given table arises in order to determine the coeﬃcient of certainty that the object under testing belongs to the approximated concept. In other words, we would like to determine, with reference to the tested object, how certain the fact is that this object belongs to the concept. And at the same time it would be the best to express if the certainty coeﬃcient by a number, e.g., from [0, 1]. In literature such a numerical coeﬃcient is expressed using diﬀerent kinds of rough membership functions (see Section 2.8). If a method of determining such a coeﬃcient is given, it may be assumed that the coeﬃcient values are discretized which leads to a sequence of concept layers arranged linearly. The ﬁrst layer in this sequence represents objects which, without any doubt do not belong to the concept (the lower approximation of the concept complement). The next layers in the sequence represent objects belonging to the

526

J.G. Bazan

lower approximation of C

layers of the boundary region of C lower approximation of U-C

Fig. 4. Layers of a given concept C

concept more and more certainly (border layers of the concept). The last layer in this sequence represents objects certainly belonging to the concept, that is, the ones belonging to the lower concept approximation (see Fig. 4). Let us add that this type of concept layers may be deﬁned both on the basis of the knowledge gathered in data tables and using additional domain knowledge provided by experts. 3.1

Stratifying Classiﬁer

In order to examine the membership of tested objects to individual concept layers, such classiﬁers are needed that can approximate all layers of a given concept at the same time. Such classiﬁers are called in this paper stratifying classiﬁers. Deﬁnition 1 (A stratifying classiﬁer ). Let A = (U, A, d) be a decision table whose objects are positive and negative examples of a concept C (described by a binary attribute d). 1. A partition of the set U is a family {U1 , ..., Uk } of non-empty subsets of the set U (where k > 1) such that the following two conditions are satisﬁed: (a) U = U1 ∪ ... ∪ Uk , (b) ∀i=j Ui ∩ Uj = ∅. 2. A partition of the set U into a family UC1 , ..., UCk we call the partition of U into layers in relation to concept C when the following three conditions are satisﬁed: (a) set UC1 includes objects which, according to an expert, certainly do not belong to concept C (so they belong to a lower approximation of its complement), (b) for every two sets UCi , UCj (where i < j), set UCi includes objects which, according to an expert, belong to concept C with a degree of certainty lower than the degree of certainly of membership of objects of UCj in U , (c) set UCk includes objects which, according to an expert, certainly belong to concept C, viz., to its lower approximation. 3. Each algorithm which assigns (classiﬁes) tested objects into one of the layers belonging to a partition of the set U in relation to the concept C, is called a stratifying classiﬁer of the concept C.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

527

4. In practice, instead of using layer markings UC1 , ..., UCk , elements of the set E = {e1 , ..., ek } are used to label layers, whereas the stratifying classiﬁer constructed for the concept C which classiﬁes each tested object into one of the layers labeled with labels from the set E, is denoted by μE C. 5. If the stratifying classiﬁer μE C classiﬁes a tested object u into the layer labeled by e ∈ E, then this fact is denoted by the equality μE C (u) = e. An expert may divide the set of objects U into layers in two following ways: 1. by an assignment of weight labels to all training objects arbitrary (see Section 3.2), 2. by providing heuristics which may be applied in construction of a stratifying classiﬁer (see Section 3.3). Stratifying classiﬁers can be very useful when we need to estimate realistically what the certainty of membership of a tested object to a concept is, without determining whether the object belongs to the concept or not. Apart from that, stratifying classiﬁers may be used to construct the so-called production rules (see Section 5.3). In the paper, two general ways of construction of stratifying classiﬁers are presented. The ﬁrst one is the expert approach consisting in deﬁning by an expert an additional attribute in data which describes the membership of objects to particular layers of the concept. Next, a classiﬁer diﬀerentiating layers as decision classes is built (see Section 3.2). The second approach is called the automatic approach and is based on designing algorithms which are extensions of classiﬁers enabling the classiﬁcation of objects into layers of a concept on the basis of certain premises and experimental observations (see Section 3.3). 3.2

Stratifying Classiﬁers Based on the Expert Approach

In construction of stratifying classiﬁers using expert knowledge, it is assumed that for all training objects not only the binary classiﬁcation of training objects to a concept or outside the concept is known but we also know the assignment of all training objects into the speciﬁc concept layers. Using this approach an additional knowledge needs to be gained from a domain knowledge. Owing to that a classical classiﬁer may be built (e.g., the one based on a set of rules with a minimal number of descriptors) which directly classiﬁes the objects to diﬀerent concept layers. This classiﬁer is built on the basis of a decision attribute which has as many values as many concept layers are there, and each of these values is a label of one of the layers. 3.3

Stratifying Classiﬁers Based on the Automatic Approach

In construction of stratifying classiﬁers using the automatic approach, the assignment of all training objects to speciﬁc concept layers is unknown but we only know the binary classiﬁcation of training objects to a concept or its complement. However, the performance of a stratifying classiﬁer is, in this case,

528

J.G. Bazan

connected with a certain heuristics which supports discernibility of objects belonging to a lesser or greater degree to the concept, that is, objects belonging to diﬀerent layers of this concept. Such a heuristic determines the way an object is classiﬁed to diﬀerent layers and, thus, it is called a stratifying heuristic. Many diﬀerent types of heuristics stratifying concepts may be proposed. These may be, e.g., heuristics based on the diﬀerence of weights of decision rules classifying tested objects to concept and its complement or heuristics using a k-NN algorithm of k nearest neighbors (compare with [78, 200, 268]). In this paper, however, we are concerned with a new type of stratifying heuristics using the operation of decision rule shortening (see Section 2.5). The starting point of the presented heuristics is the following observation. Let us assume that for a certain consistent decision table A whose decision is a binary attribute with values 1 (objects belonging to the concept C) and 2 (objects belonging to the complement of concept C which is denoted by C ), a set of decision rules, RU L(A) was calculated. The set RU L(A) is the sum of two separate subsets of rules RU L1 (A) (classifying objects to C) and RU L2 (A) (classifying objects to C ). Now, let us shorten the decision rules from RU L1 (A) to obtain the coeﬃcient of consistency equal to 0.9 by placing the shortened decision rules in the set RU L1 (A, 0.9). Next, let RU L (A) = RU L1 (A, 0.9) ∪ RU L2 (A). In this way, we have increased the extension of the input decision set of rules RU L(A) in relation to the concept C, viz., as a result of shortening of the rules, the chance is increased that a given tested object is recognized by the rules classifying to the concept C. In other words, the classiﬁer based on the set of rules RU L (A) classiﬁes objects to the concept C more often. Now, if a certain tested object u, not belonging to table A, is classiﬁed to C by the classiﬁer based on the rule set RU L (A), then the chance that object u actually belongs to C is much bigger than in the case of using the set of rules RU L(A). The reason for this assumption is the fact that it is harder for a classiﬁer based on the set of rules RU L (A) to classify objects to C for the rules classifying objects to C are shortened in it and owing to that they recognize the objects more often. If, however, an object u is classiﬁed to C , then some of its crucial properties identiﬁed by the rules classifying it to C must determine this decision. If shortening of the decision rules is greater (to the lower consistency coeﬃcient), then the change in the rule set extension will be even bigger. Summing up the above discussion, we conclude that rule shortening makes it possible to change the extensions of decision rule sets in relation to chosen concepts (decision classes), and owing to that one can obtain a certain type of approximation based on the certainty degree, concerning the membership of tested objects to the concept under consideration where diﬀerent layers of the concept are modeled by applying diﬀerent coeﬃcients of rule shortening. In construction of algorithms producing stratifying classiﬁers based on shortening of decision rules, there occurs a problem with the selection of accuracy coeﬃcient threshold to which decision rules are shortened. In other words, what we mean here is the range and the step with which the accuracy coeﬃcient threshold must be selected in order to obtain sets of rules enabling an eﬀective

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

529

description of the actual layers of the concept approximated. On the basis of previous experimental experience (see, e.g., [196, 198]), in this paper, we establish that the shortening thresholds of decision rule consistency coeﬃcient are selected from the range 0.5 to 1.0. The lower threshold limit (that is, 0.5) results from the experimental observation that if we shorten rules classifying objects to a certain concept C below the limit 0.5 (without simultaneous shortening of rules classifying objects to C ), then although their extension increases dramatically (they classify objects to the concept C very often), their certainty falls to an absolutely unsatisfactory level. However, the upper limit of the threshold (that is 1.0) simply means leaving only accurate rules in the set of rules, and rejecting other approximation rules which could have occurred for a given decision table. If it comes, however, to the change step of the chosen threshold of the rule coeﬃcient of consistency, we set it at 0.1. This change step is dictated by the fact that it enables a general search of thresholds from 0.5 to 1.0 and, at the same time, the number of rule shortening operations is not too high which is essential for keeping the time needed to conduct computer experiments within acceptable bounds. Now we present an algorithm of a stratifying classiﬁer construction based on rule shortening (see Algorithm 3.1). Let us notice that after the above algorithm completes its performance on the list L, there are eleven decision rule sets. The ﬁrst classiﬁer on this list contains the Algorithm 3.1. Stratifying classiﬁer construction Input: decision table A = (U, A, d) and concept C ⊆ U Output: classiﬁer list L representing a stratifying classiﬁer 1 begin 2 Calculate decision rules for table A, denoted by RU L(A) = RU L1 (A) ∪ RU L2 (A) Create empty classiﬁer list L 3 4 for a := 0.5 to 0.9 with step 0.1 do 5 Shorten rules RU L1 (A) to the consistency coeﬃcient a and place the shortened decision rules in RU L1 (A, a) RU L := RU L1 (A, a) ∪ RU L2 (A) 6 7 Add RU L to the end of the list L 8 end 9 Add RU L(A) to the end of the list L 10 for a := 0.9 to 0.5 with step 0.1 do 11 Shorten rules RU L2 (A) to the consistency coeﬃcient a and place the shortened decision rules the RU L2 (A, a) 12 RU L := RU L1 (A) ∪ RU L2 (A, a) 13 Add RU L to the end of the list L 14 end 15 return L 16 end

530

J.G. Bazan

Algorithm 3.2. Classiﬁcation using the stratifying classiﬁer Input: 1. classiﬁer list L representing a stratifying classiﬁer, 2. set of labels of layers E = {e1 , ..., esize(L)+1 }, 3. tested object u 1 2 3 4 5 6 7 8 9

Output: The label of the layer to which the object u is classiﬁed begin for i := size(L) down to 1 do Classify u using the classiﬁer L[i] if u is classiﬁed by L[i] to the concept C then return ei+1 end end return e1 end

most shortened rules classifying to C. That is why, if it classiﬁes an object to C , the degree of certainty is the highest that this object belongs to concept C , whereas the last classiﬁer on the list L, contains the most shortened rules classifying to C . That is why the classiﬁcation of an object to the concept C using this classiﬁer gives us the highest degree of certainty of that the object really belongs to C. The time complexity of Algorithm 3.1 depends on the time complexity of the chosen algorithm of decision rules computing and on the algorithm of approximate rules synthesis (see Section 2.5). On the basis of the classiﬁer constructed according to Algorithm 3.1, the tested object is classiﬁed to a speciﬁc layer with the help of successive classiﬁers starting from the last to the ﬁrst one, and if the object is classiﬁed by the i-th classiﬁer to C, then we learn about membership of the object under testing to the (i + 1)-th layer of C. However, if the object is not classiﬁed to C by any classiﬁer, we learn about membership of the tested object to the ﬁrst layer (layer number 1), that is, to the complement of concept C. We present a detailed algorithm for classiﬁcation of the object using the stratifying classiﬁer (see Algorithm 3.2). Let us notice that if the size of the list L is equal to 11 (generated by Algorithm 3.1), the above classiﬁer classiﬁes objects to 12 concept layers where the number 12 layer is the layer of objects with the highest degree of certainty of membership to the concept and the layer number 1 is the layer with the lowest degree of certainty of membership to this concept.

4

General Methodology of Complex Concept Approximation

Many real-life problems may be modeled with the help of the so-called complex dynamical systems (see, e.g., [92, 93, 94, 95, 96, 97]) or, putting it in an other

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

531

way, autonomous multiagent systems (see, e.g., [98, 101]) or swarm systems (see, e.g., [104]). These are sets consisting of complex objects which are characterized by the constant change of parameters of their components over time, numerous relationships among the objects, the possibility of cooperation/competition among the objects and the ability of objects to perform more or less complicated actions. Examples of systems of these kind are: traﬃc, a patient observed during treatment, a team of robots during performing some task. The description of the dynamics of such a system is often impossible using purely classical analytical methods, and the description itself contains many vague concepts. For instance, in order to monitor complex dynamical systems eﬀectively, complex spatio-temporal concepts are used very often, concerning dynamic properties of complex objects occurring in these systems. These concepts are expressed in natural language on a much higher level of abstraction than the so-called sensor data, which have mostly been applied to approximation of concepts so far. Examples of such concepts are safe car driving, safe overtaking, patient’s behavior when faced with a life threat, ineﬀective behavior of robot team. Much attention has been devoted to spatio-temporal exploration methods in literature (see, e.g., [63, 64]). The current experience indicates more and more that approximation of such concepts requires a support of knowledge of the domain to which the approximated terms are applied, i.e., the domain knowledge. It usually means the knowledge about concepts occurring in a given domain and various relations among these concepts. This knowledge exceeds signiﬁcantly the knowledge gathered in data sets; it is often represented in a natural language, and it is usually obtained in a dialogue with an expert from a given domain (see, e.g., [41, 42, 43, 44, 45, 46, 52, 269]). One of the methods of representing this knowledge is recording it in the form of the so-called concept ontology. The concept ontology is usually understood as a ﬁnite set of concepts creating a hierarchy and relationships among these concepts which connect concepts from diﬀerent hierarchical levels (see next section). In this subsection, we present a general methodology of approximating complex spatio-temporal concepts on the basis of experimental data and a domain knowledge represented mainly by a concept ontology. 4.1

Ontology as a Representation of Domain Knowledge

The word ontology was originally used by philosophers to describe a branch of metaphysics concerned with the nature and relations of being (see, e.g., [270]). However, the deﬁnition of ontology itself has been a matter of dispute for a long time, and controversies concern mainly the thematic scope to be embraced by this branch. Discussions on the subject of ontology deﬁnition appear in the works of Gottfried Leibniz, Immanuel Kant, Bernard Bolzano, Franz Brentano, or Kazimierz Twardowski (see, e.g., [271]). Most of them treat ontology as a ﬁeld of science concerning types and structures of objects, properties, events, processes, relations, and reality domains (see, e.g., [106]). Therefore, ontology is neither a science concerning functioning of the world nor the ways a human being perceives it. It poses questions: How do we classify everything?, What

532

J.G. Bazan

classes of beings are inevitable for describing and concluding on the subject of ongoing processes?, What classes of being enable to conclude about the truth?, What classes of being enable to conclude about the future? (see, e.g., [106, 270]). Ontology in Informatics. The term ontology appeared in the information technology context at the end of the sixties of the last century as a speciﬁc way of knowledge formalization, mainly in the context of database development and artiﬁcial intelligence (see, e.g., [53, 272]). The growth in popularity of database systems caused avalanche increase of their capacity. The data size, multitude of tools used both for storing and introducing, or transferring data caused that databases became diﬃcult in managing and communication with the outside world. Database schemes are determined to high extent not only by the requirements on an application or database theory but also by cultural conditions, knowledge, and the vocabulary used by designers. As the result, the same class of objects may possess diﬀerent sets of attributes in various schemes termed differently. These attribute sets are identical terms but often describe completely diﬀerent things. A solution to this problem are supposed to be ontologies which can be treated as tools for establishing standards of database scheme creation. The second pillar of ontology development is artiﬁcial intelligence (AI), mainly because of the view according to which making conclusions requires knowledge resources concerning the outside world, and ontology is a way of formalizing and representing such knowledge (see, e.g., [7, 273]). It is worth noticing that, in the recent years, one of the main applications of ontologies has been their use for an intelligent search of information on the Internet (see, e.g., [53] and [54] for more details). Deﬁnition of Ontology. Philosophically as well as in information technology, there is a lack of agreement if it comes to the deﬁnition of ontology. Let us now consider three deﬁnitions of ontology, well-known from literature. Guarino states (see [53]) that in the most prevalent use of this term, an ontology refers to an engineering artifact, constituted by a speciﬁc vocabulary used to describe a certain reality (or some part of reality), plus a set of explicit assumptions regarding the intended meaning of vocabulary words. In this approach, an ontology describes a hierarchy of concepts related by relationships, whereas in more sophisticated cases, suitable axioms are added to express other relationships among concepts and to constrain the interpretation of those concepts. Another well-known deﬁnition of ontology has been proposed by Gruber (see [105]). He deﬁnes an ontology as an explicit speciﬁcation of a conceptualization. He explains that for AI systems, what exists is that which can be represented. When the knowledge of a domain is represented in a declarative formalism, the set of objects that can be represented is called the universe of discourse. This set of objects and the describable relationships among them are reﬂected in the representational vocabulary with which a knowledge-based program represents knowledge. Thus, according to Gruber, in the the context of AI, we can describe the ontology of a knowledge-based program by deﬁning a set of representational terms. In such an ontology, deﬁnitions associate the names of

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

533

entities in the universe of discourse (e.g., classes, relations, functions, or other objects) with human-readable text describing what the names mean, and formal axioms that constrain the interpretation and the well-formed use of these terms. Finally, we present a view of ontology recommended by the World Wide Web Consortium (W3C) (see [107]). W3C explains that an ontology deﬁnes the terms used to describe and represent an area of knowledge. Ontologies are used by people, databases, and applications that need to share domain information (a domain is just a speciﬁc subject area or area of knowledge such as medicine, tool manufacturing, real estate, automobile repair, ﬁnancial management, etc.). Ontologies include computer-usable deﬁnitions of basic concepts in the domain and the relationships among them. They encode knowledge in a domain and also knowledge that spans domains. In this way, they make that knowledge reusable. Structure of Ontology. Concept ontologies share many structured similarities, regardless of the language in which they are expressed. However, most ontologies describe individuals (objects, instances), concepts (classes), attributes (properties), and relations (see, e.g., [53, 54, 105, 107]). Individuals (objects, instances) are the basic, “ground level” components of an ontology. They may include concrete objects such as people, animals, tables, automobiles, and planets, as well as abstract individuals such as numbers and words. Concepts (classes) are abstract groups, sets, or collections of objects. They may contain individuals or other concepts. Some examples of concepts are vehicle (the class of all vehicles), patient (the class of all patients), inﬂuenza (the class of all patients suﬀering from inﬂuenza), player (the class of players), team (the class of all players from some team). Objects belonging to concepts in an ontology can be described by assigning attributes to them. Each attribute has at least a name and a value, and is used to store information that is speciﬁc to the object the attribute is attached to. For example, an object from the concept participant (see ontology from Fig. 52 ) has attributes such as ﬁrst name, last name, address, aﬃliation. If you did not deﬁne attributes for concepts, you would have either a taxonomy (if concept relationships are described) or a controlled vocabulary. These are useful, but are not considered true otologies. There are three following types of relations between concepts from ontology: a subsumption relation (written as is-a relation), a meronymy relation (written as part-of relation), and a domain-speciﬁc relation. The ﬁrst type of relations is the subsumption relation (written as is-a). If a class A subsumes a class B, then any member of the class A is-a member of the class B. For example, the class author subsumes the class participant. It means that anything that is a member of the class author is a member of the class Participant (see ontology from Fig. 5). Where A subsumes B, A is called the superclass, whereas B is the subclass. The subsumption relation is very similar to the notion of inheritance, well-known from the object-oriented programming 2

This example has been inspired by Jarrar (see [54]).

534

J.G. Bazan

Organizing committee

Organizer

Person

Author

Participant

Paper

Reviewer

Program committee

Fig. 5. The graph of a simple ontology

(see, e.g., [274, 275]). Such relation can be used to create a hierarchy of concepts, typically with a maximally general concept like person at the top, and more speciﬁc concepts like author or reviewer at the bottom. The hierarchy of concepts is usually visualized by a graph of ontology (see Fig. 5) where any subsumption relation is represented by a thin solid line with an arrow in the direction from the superclass to the subclass. Another common type of relations is the meronymy relation (written as partof) that represents how objects combine together to form composite objects. For example, in the ontology from Fig. 5, we would say that any reviewer is-part-of the program committee. Any meronymy relation is represented graphically by a broken line with an arrow in the direction from the part to the composite object (see Fig. 5). From the technical point of view this type of relation between ontology terms is represented with the help of object attributes belonging to concepts. It is done in such a way that the value of an attribute of an object u, which is to be a part of some object u belonging to diﬀerent concept, informs about u . Apart from the standard is-a and part-of relations, ontologies often include additional types of relations that further reﬁne the semantics modeled by the ontologies. These relations are often domain-speciﬁc and are used to answer particular types of questions. For example, in the domain of conferences, we might deﬁne a written-by relation between concepts paper and author which tells us who is the author of a paper. In the domain of conferences, we deﬁne also a writes relation between concepts author and paper which tells us which paper has been written by each author. Any domain-speciﬁc relation is represented by

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

535

a thick solid line with an arrow. From the technical point of view this type of relations between ontology concepts is also represented with the help of object attributes belonging to the concepts. In this paper, we use many ontologies, constructed on the basis of domain knowledge concerning the analyzed data sets, to approximate complex concepts. In these ontologies there occur all types of relations mentioned above. However, the relations of individual types do not occur in these ontologies simultaneously but in each of them there occurs only one type of relation. The reason for this is the fact that individual relation types serve us to approximate diﬀerent types of complex concepts. For example, relations of the type is-a occur in ontology from Fig. 6 which is an example of an ontology used to approximate spatial concepts (see Section 5). Ontologies showing dependencies between temporal concepts for structured objects and temporal concepts for constituent parts of these objects (used to approximate temporal concepts for structured objects) are examples of ontologies in which there occur relations of type part-of (see Section 6). On the other hand, domain-speciﬁc relations occur in numerous examples of behavior graphs presented in Section 6 and are used to approximate behavioral patterns. The planning graphs presented in Section 7 are also examples of ontologies in which there occur domain-speciﬁc relations. Incidentally, planning graphs are, in a way, ontologies even more complex than the mentioned above, because two types of concepts occur in them simultaneously. Namely, there occur concepts representing states of complex objects and concepts representing actions performed on complex objects. Obviously, there are many ways of linking the ontologies mentioned above provided they concern the same domain. For example, an ontology describing the behavior graph of a group of vehicles may be linked with ontologies describing dependencies of temporal concepts for such groups of vehicles and temporal concepts describing behavior of individual vehicles or changes of relationships among these vehicles. Then, in such an ontology, there would occur relations of two types simultaneously, that is, domain-speciﬁc and part-of relations. Although these types of linking diﬀerent ontologies are not essential for complex concept approximation methods presented in this paper, they cause a signiﬁcant increase of complexity of the ontologies examined. General Recommendations Concerning Building of an Ontology. Currently there are many papers which describe various designer groups’ experience obtained in the process of ontology construction (see, e.g., [276]). Although they do not constitute formal frames enabling to create an integral methodology yet, general recommendations how to create an ontology may be formed on their basis. Each project connected with an ontology creation has the following phases: – Motivation for creating an ontology. – Deﬁnition of the ontology range. – Ontology building. • Building of a lexicon. • Identiﬁcation of concepts.

536

J.G. Bazan

• Building of the concept structure. • Modeling relations in ontology. – The evaluation of the ontology obtained. – Ontology implementation. Motivation for creating an ontology is an initial process resulting from arising inside a certain organization, a need to change the existing ontology or to create a new one. Extremely crucial for the whole further process is, at this stage, clarity of the aim for which the ontology is built. It is the moment when potential sources of knowledge needed for the ontology construction should be deﬁned. They are usually sources which may be divided into two groups those requiring human engagement (e.g., interviews, discussions) and those in which a human does not appear as a knowledge source (e.g., documents, dictionaries and publications from the modeled domain, intranet and Internet, and other ontologies). By the ontology range we understand this part of the real world which should be included into the model under creation in the form of concepts and relations among them. One of the easier, and at the same time very eﬀective, ways to determine the ontology range accurately is using the so-called “competency questions” (see, e.g., [277]). The starting point for this method is deﬁning a list of questions to which the database built on the basis of the ontology under construction should give an answer. Having the range deﬁned, the process of ontology building should be started. The ﬁrst step in ontology building is deﬁning a list of expressions, phrases, and terms crucial for a given domain and a speciﬁc context of application. A lexicon should be composed that is a dictionary containing terms used by the ontology as well as their deﬁnitions, from the list. The lexicon is a starting point for the most diﬃcult stage in the ontology building, that is, construction of concepts (classes) of the ontology and relations among these concepts. It should be remembered that it is not possible to perform these two activities one after the other. They have to be performed in parallel. We should bear in mind that each relation is also a concept. Thus, ﬁnding the answer to the question What should constitute a concept and what should constitute a relation? is not easy and depends on the target application and, often, the designer’s experience. If it comes to building hierarchical classes, three approaches to building such a hierarchy are given in the paper [278]: 1. Top-down. We start with a concept superior to all concepts included in the knowledge base and we come to the next levels of inferior concepts by applying atomization. 2. Bottom-up. We start with the most inferior concept contained in the knowledge base and we come to the concepts on higher levels of hierarchy by applying generalization. 3. Middle-out. We start with concepts which are the most crucial in terms of the project and we perform atomization or generalization when needed. In order to evaluate the obtained ontology it should be checked if the ontology possesses the following qualities ([277]):

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

537

– Consistency. The ontology is consistently integral, that is, contradictory conclusions cannot be drawn from it. – Completeness. The ontology is complete if all expected elements are included in the model (concepts, relations, etc.). – Conciseness. All information gathered in the ontology is concise and accurate. – The possibility of answering the “competency questions” posed previously. Summing up, an ontology building is a laborious process requiring a huge amount of knowledge concerning the modeling process itself, the tools used, and the domain being modeled. Ontology Applications. Practical ontology applications relate to the so-called general ontologies which have a rather general character and may be applied in a knowledge base building from diﬀerent domains and domain ontologies meaning ontologies describing knowledge about a speciﬁc domain or a speciﬁc fragment of the real world. Many such ontologies have been worked out and they are often available on the Internet. They are, e.g., Dublin Core (see [279]), GFO (General Formal Ontology [280]), OpenCyc/ResearchCyc (see [281]), SUMO (Suggested Upper Merged Ontology [282]), WordNet (see [283]), DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering [284]) and others. Generally, ontologies are applied when the semantics of the data gathered is crucial. It turns out that such a situation takes place quite often, particularly when intelligent methods of data analysis are supposed to act eﬀectively. That is why ontologies more and more are useful in information technology projects. Some examples of applications of ontologies are e-commerce, bioinformatics, geographical, information systems, regulatory and legal information systems, digital libraries, e-learning, agent technology, database design and integration, software engineering natural language processing, information access and retrieval, the Semantic Web, Web services, medicine (see, e.g., [53] and [54] for more details). Computer Systems for Creating and Using Ontologies. There is a series of formal languages to represent ontologies. These are such languages as Web Ontology Language (OWL [107]), Resource Description Framework (RDF [285]), Ontology Inference Layer (OIL [286]), DARPA Agent Markup Language (DAML [287]), CycL (see [288]), etc. However, the most dynamically developed one is OWL which came to the existence as an improvement of the DAML, OIL and RDF languages. There are also many computer systems for creating and using ontologies. They are, e.g., Cyc (see [288]), OpenCyc (see [289]), Protege (see [290]), OntoStudio (previously OntoEdit [291]), Ontolingua (see [292]), Chimaer (see [293]), OilEd (see [294]), and others. Within these systems, the ontology is usually created using convenient graphical tools which make it possible to enter all the elements of ontology as well as their further edition and visualization. Ontological systems very often possess mechanisms of concluding on the basis of ontology constructed. These mechanisms work in such a way that after creating an ontology the system may be asked quite complex questions. They concern

538

J.G. Bazan

the existence of an instance of a concept which satisﬁes certain logical conditions, deﬁned using concepts, attributes, and relations occurring in the ontology. For instance, in the ontology in Fig. 5, we could pose the following questions: – Who is the author of a given paper? – Which papers have been reviewed by a given reviewer? – Which persons belong to the programming committee? From the technical point of view, information searching based on ontology is performed with the help of questions formed in a formal language used to represent ontology or its special extension. For instance, the language RDQL (RDF Data Query Language [295]) is a question language similar to the language SQL extending the RDF language. Usually, the ontological systems also enable to form questions using graphical interface (see, e.g, [290, 291]). 4.2

Motivations for Approximation of Concepts and Relations from Ontology

In current systems operating on the basis of ontology it is assumed that we possess complete information about concepts, that is, for each concept all objects belonging to this concept are known by us. This assumption causes that, in order to examine the membership of an object to the concept, it is enough to check if this object occurs as an instance of this concept or not. Meantime, in practical applications we often possess only incomplete information about concepts, that is, for each concept, certain sets of objects constituting examples and counterexamples, respectively are given. It causes the necessity of approximating concepts with the help of classiﬁers. For instance, using the ontology in Fig. 6 which concerns safe vehicle driving on a road, it cannot be assumed that all concept instances of this ontology are available. For example, for the concept safe driving, it cannot be assumed that the information about all possible cars driving safely is available. That is why for such a concept, a classiﬁer is constructed which is expected to be able to classify examples of vehicles into those belonging and those which do not belong to the concept. Apart from that, the relations between concepts, deﬁned in current systems based on ontology, are usually precise (exact, crisp). For example, for the relation is-a in ontology from Fig. 5, if the relation between concepts author and participant is to be precise (exact, crisp), then each author of a paper at a conference is a participant of this conference. In practice, however, it does not always have to be that way. It is possible that some authors of papers are not conference participants, particularly in the case of articles having many coauthors. So, a relation between concepts can be imprecise (inexact, vague). Besides, on the grounds of classical systems based on ontology, when we possess complete information about concepts, the problem of vagueness of the above relation may be solved by adding to the ontology an additional concept representing these authors who are not conference participants and binding this new concept with the concept person by the is-a relation. However, in practical applications, when the available information about concepts is not complete, we are even not able to

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

Safe distance from the opposite vehicle during overtaking

Possibility of safe stopping before the crossroad

Safe overtaking

Possibility of going back to the right lane

539

Safe driving

Forcing the right of way

Safe distance from the front vehicle

Fig. 6. An ontology for safe driving

check whether the relations under consideration are precise (exact, crisp). That is why relations among concepts also require approximation. In approximation of concepts occurring in ontology, there often appears the following problem. In practical applications, usually is the so-called sensor data available only (that is, data obtained by measurement using sensors, thus obtained on a low level of abstraction). For example, by observing a situation on a road, i.e., such data as speed, acceleration, location, the current driving lane, may be obtained. Meanwhile, some concepts occurring in ontology are so complex that they are separated by a considerable semantical distance from the sensor data, i.e., they are deﬁned and interpreted on very diﬀerent levels of abstraction. Hence, approximation of such concepts using sensor data does not lead to classiﬁers of satisfying quality (see, e.g., [42, 44, 45, 46, 48]). For instance, in ontology from Fig. 6, such a complex concept is without a doubt the concept safe driving because it is not possible to directly determine whether a given vehicle goes safely on the basis of simple sensor data only. If, however, apart from complex concepts there are simple concepts in ontology, that is, those which may be approximated using sensor data, and they are directly or indirectly linked by relations to complex concepts, then appears a need to use the knowledge about the concepts and relations among them to approximate complex concepts more eﬀectively. For example, in order to determine if a given vehicle drives safely, other concepts from ontology from Fig. 6, linked by relations to the concept safe driving, may be used. For example, one of such concepts is the possibility of safe stopping before the crossroad. The aim of this paper is to present set of methods for approximating complex spatio-temporal concepts and relations among them assuming that the

540

J.G. Bazan

Safe driving Forcing the right of way Safe overtaking Safe distance from the front vehicle

Safe distance from the opposite vehicle during overtaking Possibility of going back to the right lane

Possibility of safe stopping before the crossroad

S E N S O R DATA

Fig. 7. The ontology for safe driving revisited

information about concepts and relations is given in the form of ontology. To meet these needs, by ontology we understand a ﬁnite set of concepts creating a hierarchy and relations among these concepts which link concepts from different levels of the hierarchy. At the same time, on top of this hierarchy there is always the most complex concept whose approximation we are interested in aiming at practical applications. For example, ontology from Fig. 6 may be presented hierarchically as in Fig. 7. At the same time, we assume that the ontology speciﬁcation contains incomplete information about concepts and relations occurring in ontology, particularly for each concept, sets of objects constituting examples and counterexamples for these concepts are given. Additionally, for concepts from the lowest hierarchical level (sensor level) it is assumed that there are also sensor attributes available which enable to approximate these concepts on the basis of the examples and counterexamples given. This fact is marked in Fig. 7 by block arrows. 4.3

Unstructured, Structured, and Complex Objects

Every concept mentioned in this paper is understood as a subset of a certain set called the universe. Elements of the universe are called objects and they are interpreted as states, incidents, vehicles, processes, patients, illnesses and sets or sequences of entities mentioned previously. If such objects come from the real-life world, then their perception takes place by detecting their structure. Discovery

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

541

of relevant object structure for particular tasks is a complex problem strongly related to perception, that is usually understood as the process of acquiring, interpreting, selecting, and organizing sensory information (see, e.g., [45, 86, 145, 146, 147, 148, 149, 150, 151, 152, 296, 297, 298, 299]). Many interdisciplinary research has been conducted in this scope in the overlapping areas of such ﬁelds as cognitive science, psychology and neuroscience, pattern recognition (see, e.g., [26, 27, 35, 300, 301, 302, 303]). Structure of objects is used to deﬁne compound patterns over objects with the simple or structured structures. The construction of such compound patterns may be hierarchical. We search for patterns relevant for approximation of some complex concepts. Notice, that together with the granularity of patterns one should consider the computational complexity of satisﬁability testing for such patterns. The structure of the perceived objects may be more or less complex, because the objects may diﬀer in complexity. It means both the degree of spatial as well as the spatio-temporal complexity. When speaking about spatial complexity we mean not only the fact that the objects diﬀer in the features such as location, size, shape, color, weight, but also that objects may consist of parts related with each other in terms of dependencies (e.g., one may examine objects which are object groups in the traﬃc). However, the spatio-temporal complexity results from the fact that the perception of objects may be extended over time (e.g., one may examine objects which are single vehicles observed at a single time point and objects which are also single vehicles, but they are observed over a certain period of time). Both of these aspects of object complexity may cumulate which additionally increases the diversity of appearing objects (e.g., objects which are vehicle groups observed over a certain period of time are more complex than both the objects which are vehicle groups observed at a single time point and the objects which are single vehicles observed over a certain period of time). However, in practice the perception of objects always takes place on an established level of perceiving detail. This means that depending on the needs, during perceiving objects only such details concerning their structure are taken into account that are necessary to conduct eﬀective reasoning about the objects being perceived. For example, if we want to identify vehicles driven dangerously on the road, then we are not interested in the internal construction of each vehicle but rather the behavior of each vehicle as a certain whole. Hence, in the paper, we examine objects of two types. The ﬁrst type of objects are unstructured objects, meaning those which may be treated as indivisible wholes. We deal with this type of objects when we analyze patients, bank clients or vehicles, using their parameters observed at the single time point. The second type of objects which occurs in practical applications are structured objects which cannot be treated as indivisible wholes and are often registered during some period. Examples of this type of objects may be a group of vehicles driving on a highway, a set of illnesses occurring in a patient, a robot team performing a task.

542

J.G. Bazan

In terms of spatiality, structured objects often consist of disjunctive parts which are objects of uniform structure connected with dependencies. However, generally, the construction of structured objects is hierarchical, that is, their parts may also be structured objects. Additionally, a great spatial complexity of structured objects causes that conducting eﬀective reasoning about these objects usually requires their observation over a certain period of time. Thus, the hierarchy of such objects’ structure may concern not only their spatial, but also spatio-temporal structure. For example, to observe simple behaviors of a single vehicle (e.g., speed increase, a slight turn towards the left lane) it is suﬃcient to observe the vehicle over a short period of time, whereas to recognize more complex behaviors of a single vehicle (e.g., acceleration, changing lanes from right to the left one), the vehicle should be observed for a longer period of time, at the same time a repeated observation of the above mentioned simple behaviors may be extremely helpful here (e.g., if over a certain period the vehicle increased speed repeatedly, it means that this vehicle probably accelerates). Finally, behavior observation of a vehicle group requires its observation for an even longer period of time. It happens that way because the behavior of a vehicle group is usually the aggregation or consequence of vehicle behaviors which belong to the group (e.g., observation of an overtaking maneuver of one vehicle by another requires following speciﬁc behaviors of both the overtaking and overtaken vehicle for a certain period of time). Obviously, each of structured objects usually may be treated as an unstructured object. If we treat any object as an unstructured object at a given moment, it means that its internal structure does not interest us from the point of view of the decision problems considered. On the other hand, it is extremely diﬃcult to ﬁnd real-life unstructured objects, that is, objects without parts. In the real-life world, almost every object has some kind of internal structure and consists of certain spatial, temporal or spatio-temporal parts. Particularly, objects which are examples and counterexamples of complex concepts (both spatial and spatio-temporal), being more or less semantically distant from sensor data, have a complex structure. Therefore, one can say that they are complex objects. That is why the division of complex objects into unstructured and structured ones is of a symbolic character only and depends on the interpretation of these objects. If we are interested in their internal structure, then we treat them as structured objects; otherwise we treat them as unstructured ones. 4.4

Representation of Complex Object Collections

If complex objects are gathered into a collection, then in order to represent the available information about these objects, one may use information systems. Below, we present an example of such an information system whose objects are vehicles and attributes describe the parameters of the vehicle recorded at a given time point. Example 1. Let us consider an information system A = (U, A) such that A = {x, y, l, v, t, id}. Each object of this system represents a condition of a considered

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

543

vehicle at one time moment. The attributes x and y provide the current location of a vehicle, the l and v attributes provide us with current traﬃc lane on which the vehicle is and the current vehicle speed respectively. The attribute t represents time in a number of seconds which has passed since the ﬁrst observation of the vehicle (Vt is a subset of the set of positive integer numbers). The attribute id provides identiﬁers of vehicles. The second, extremely crucial, example of the information system used in this paper is an information system whose objects represent patient conditions at diﬀerent time points. Example 2. Let us consider an information system A = (U, A) such that U = {u1 , ..., un } and A = {a1 , ..., am , at , aid }. Each object of this system represents medical parameters of a certain patient during one day of his/her hospitalization. Attributes a1 , ..., am describe medical parameters of the patient (examination results, diagnoses, treatments, medications, etc.), whereas the attribute at represents time in a number of days which has passed since the ﬁrst observation of the patient (Vat is a subset of the set of positive integer numbers). Finally, the attribute aid provides identiﬁers of patients. Similarly to the two examples above, the attributes of complex objects may be based on sensor data. However, in a general case the properties of complex objects may be deﬁned in languages which are deﬁned speciﬁcally for a given purpose (see Section 4.7). 4.5

Relational Structures

As we have written before, structured objects consist of parts which are structured objects of lesser complexity (hierarchical structure) or unstructured objects connected by dependencies. Additionally, a great spatial complexity of structured objects causes that conducting eﬀective conclusions about these objects usually requires their observation for a certain period of time. Hence, there is a need to follow the spatio-temporal dependencies between parts of complex objects. Therefore, the eﬀective description of the structure of objects requires not only providing spatial properties of individual parts of these objects, but also describing the spatio-temporal relations between the parts of these objects. Therefore, in order to describe the structure of complex objects and relations between complex objects in this paper we will use relational structures (see, e.g., [5, 89]). In order to deﬁne the relational structure using language and semantics of ﬁrst-order logic we assume that a set of relation symbols REL = {Ri : i ∈ I} and function symbols F U N = {fj : j ∈ J} are given, where I, J are some ﬁnite sets (see, e.g., [89]). For any functional or relational symbol there is assigned a natural number called the arity of the symbol. Functional symbols and relations of arity 0 are called constants. The set of constants is denoted by CONST. Symbols of arity 1 are called unary and of arity 2 are called binary. In the case of binary relational or functional symbols we usually use traditional inﬁx notation. For instance we write x ≤ y rather than ≤ (x, y). The set of functional

544

J.G. Bazan

and relational symbols together with their arities is called the signature. The interpretation of a functional symbol fi (a relational symbol Ri ) over the set A is a function (a relation) deﬁned over the set A and denoted by fiA (RiA ). The number of arguments of a function fiA (a relation RiA ) is equal the arity of fi (Ri ). Now, we can deﬁne the relational structure of a given signature (see, e.g., [5, 89]). Deﬁnition 2 (A relational structure of a given signature). Let Σ = REL ∪ F U N be a signature, where REL = {Ri : i ∈ I} is a set of relation symbols and F U N = {fj : j ∈ J} is a set of function symbols, where I, J are some ﬁnite sets. 1. A relational structure of signature Σ is a triple (D, R, F) where – D is a non-empty ﬁnite set called the domain of the relational structure, – R = {R1D , ..., RkD } is a ﬁnite (possibly empty) family of relations deﬁned over D such that RiD corresponds to symbol Ri ∈ REL and RiD ⊆ Dni where 0 < ni ≤ card(D) and ni is the arity of Ri , for i = 1, ..., k, – F = {f1D , ..., flD } is ﬁnite (possibly empty) family of functions such that fjD corresponds to symbol fj ∈ F U N and fjD : Dmj −→ D where 0 ≤ mj ≤ card(D) and mj is the arity of fj , for j = 1, ..., l. 2. If for any f ∈ F, f : D0 −→ D, then we call such a function constant and we identify it with one element of the set D, corresponding to f . 3. If (D, R, F) is a relational structure and F is empty, then such relational structure is called pure relational structure and is denoted by (D, R). A classical example of a relational structure is a set of real numbers with operations of addition and multiplications and ordering relation. A typical example of a pure relational structure is a directed graph whose domain is set of graph nodes and the family of relations consists of one relation described by a set of graph edges. The example below illustrates how relational structures may be used to describe the spatial structure of a complex object. Example 3. Let us examine the complex object which is perceived as an image in Fig. 8. In this image one may notice a group of six cars: A, B, C, D, E, F . In order to deﬁne the spatial structure of this car group, the most crucial thing is deﬁning the location of cars towards each other and deﬁning the diversity of the driving directions of individual cars. That is why the spatial structure of such a group may be described with the help of relational structure (S, R), where: – S = {A, B, C, D, E, F }, – R = {R1 , R2 , R3 , R4 }, where: • ∀(X, Y ) ∈ S × S : (X, Y ) ∈ R1 iﬀ X is driving directly before Y , • ∀(X, Y ) ∈ S × S : (X, Y ) ∈ R2 iﬀ X is driving directly behind Y , • ∀(X, Y ) ∈ S × S : (X, Y ) ∈ R3 iﬀ X is coming from the opposite direction in comparison with Y , • ∀(X, Y ) ∈ S ×S : (X, Y ) ∈ R4 iﬀ X is driving in the same direction as Y .

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

545

D

E

C

B F

A

Fig. 8. An example of spatial complex object

D L

H G

C

K

J

I E F B A

F1

F2

F3

Fig. 9. An example of spatio-temporal complex object

For instance, it is easy to see that (B, A), (C, B), (D, C), (F, E) ∈ R1 , (A, B), (B, C), (C, D), (E, F ) ∈ R2 , (E, C), (E, B), (F, A) ∈ R3 and (A, C), (B, D), (E, F ) ∈ R4 . Complex objects may also have spatio-temporal structure. The example below shows this type of a structured object. Example 4. Let us examine the complex object which is represented with the help of three images F1 , F2 and F3 recorded at three consecutive time points (see Fig. 9). In image F1 one may notice cars A, B, C and D, whereas in image

546

J.G. Bazan

F2 we see cars E, F , G and H. Finally, in image F3 we see cars I, J, K and L (see Fig. 9). It is easy to notice that pictures F1 , F2 and F3 may be treated as three frames chosen from a certain ﬁlm made e.g., from an unmanned helicopter conducting a road observation, and at the same time each consecutive frame is distant in time from the previous one by about one second. Therefore, in all these pictures we see the same four cars, at the same time the ﬁrst car is perceived as car A, E or J, the second car is perceived as car B, F or I, the third car is perceived as car C, G or L and the fourth car is perceived as car D, H or K. The spatial structure of complex object ST = {A, B, C, D, E, F, G, H, I, J} may be described with the help of relational structure similar to the one in Example 3. However, object ST has spatio-temporal structure which should be reﬂected in relational structure describing complex object ST . That is why, to the relation family R from Example 3 we add relation Rt determined in the following way: ∀(X, Y ) ∈ ST × ST : (X, Y ) ∈ Rt iﬀ X represents the same vehicle as Y and X was recorded earlier than Y . For instance, it is easy to see that (A, E), (H, K) ∈ Rt , but (G, C), (I, F ) ∈ Rt and (C, H), (F, K) ∈ Rt . Moreover, we modify the deﬁnition of the remaining relations from family R: – ∀(X, Y ) ∈ ST × ST : (X, Y ) ∈ R1 iﬀ X, Y were noticed in and X is going directly before Y , – ∀(X, Y ) ∈ ST × ST : (X, Y ) ∈ R2 iﬀ X, Y were noticed in and X is driving directly behind Y , – ∀(X, Y ) ∈ ST × ST : (X, Y ) ∈ R3 iﬀ X, Y were noticed in and X is coming from the opposite direction in comparison – ∀(X, Y ) ∈ ST × ST : (X, Y ) ∈ R4 iﬀ X, Y were noticed in and X is driving in the same direction as Y .

the same frame the same frame the same frame with Y , the same frame

If some set of complex objects is perceived as an unstructured object (its parts are not distinguished) and these objects belong to the object set of a certain information system, then a structure of such set of complex objects is described by relational structure, that we call a trivial relational structure. Deﬁnition 3. Let A = (U, A) be an information system. For any set of objects U ⊆ U we deﬁne a relational structure (Dom, R, F) such that Dom = {U }, R and F are empty families. Such relational structure is called a trivial relational structure. The above trivial relational structures are used to approximate spatial concepts (see Section 5). In each collection of complex objects there may occur relations between objects belonging to this collection. That is why each collection of complex objects may be treated as a complex object whose parts are objects belonging to the collection. Hence, the structure of complex object collection may be described using relational structure, where the domain elements of this structure are objects which belong to this collection (see Section 4.7).

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

4.6

547

Languages and Property Systems

In the paper, we use many special languages to deﬁne features of complex objects. Any language L is understood as a set of formulas over a given ﬁnite alphabet and is constructed in the following way. 1. First, we deﬁne an alphabet of L, some atomic formulas and their semantics by means of some satisﬁability relation |=L . The satisﬁability relation is a binary relation in X × L, where X denotes a universe of elements (objects). We will write x |=L α to denote the fact that |=L holds for the pair (x, α) consisting of the object x and the formula α. 2. Next, we extend, in the standard way, the satisﬁability relation |=L on Boolean combination of atomic formulas, i.e., on the least set of formulas including atomic formulas and closed with respect to the classical propositional connectives: disjunction (∨), conjunction (∧), negation (¬) using the following rules: (a) x |=L (α ∨ β) iﬀ x |=L α or x |=L β, (b) x |=L (α ∧ β) iﬀ x |=L α and x |=L β, (c) x |=L ¬(α) iﬀ non(x |=L α), where α, β are formulas, x is an object, and the symbol |=L denotes the satisﬁability relation of the deﬁned language. 3. Finally, for any formula α ∈ L, the set |α|L = {x ∈ X : x |=L α} can be constructed that is called the meaning (semantics) of α in L. Hence, in the sequel, in specifying languages and their semantics we will only deﬁne atomic formulas and their semantics assuming that the extension on Boolean combination is the standard one. Moreover, in deﬁnitions of alphabets over which languages are constructed we often omit listing parentheses assuming that the relevant parentheses are always included in alphabets. Besides, in modeling complex objects we often use structures called property systems. Deﬁnition 4 (A property system). A property system is any triple P = (X, L, |= ), where X is a set of objects; L is a language over a given ﬁnite alphabet; and |=⊆ X × L is a satisﬁability relation. We also use the following notation: 1. We write, if necessary, XP , LP , |=P , instead of X, L, and |=, respectively. 2. |α|P = {x ∈ X : x |=P α} is the meaning (semantics) of α in P. 3. By aα for α ∈ LP we denote a function (attribute) from XP into {0, 1} deﬁned by aα (x) = 1 iﬀ x |=P α for x ∈ XP . 4. Any property system P with a ﬁnite set of objects and a ﬁnite set of formulas deﬁnes an information system AP = (XP , A), where A = {aα }α∈L . It is worthwhile mentioning that the deﬁnition of any information system A = (U, A) constructed in hierarchical modeling should start from deﬁnition of the universe of objects of such an information system. For this purpose, we select

548

J.G. Bazan

a language in which a set U ∗ of complex objects is deﬁned, where U ⊆ U ∗ . For specifying the universe of objects of A, we construct some property system Q over the universe U ∗ of already constructed objects. The language LQ consists of formulas which are used for specifying properties of the already constructed objects from U ∗ . To deﬁne the universe of objects of A, we select a formula α from LQ . Such a formula is called type of the constructed information system A. Now, we assume that the object x belongs to the universe of A iﬀ x satisﬁes (in Q) the formula α, i.e., x |=Q α, where x ∈ U ∗ . Observe, that the universe of objects of A can be an extension of the set U because U is usually only a sample of possible objects of A. Notice that the type α selected for a constructed information system deﬁnes a binary attribute aα for this system. Certainly, this attribute can be used to deﬁne the universe of the information system A (see Section 4.7 for more details). Notice also that the property system Q is constructed using property systems and information systems used in modeling the lower level of concept hierarchy. 4.7

Basic Languages of Deﬁning Features of Complex Objects

As we have written before, the perception of each complex object coming from the real-life world takes place by detecting its structure (see Section 4.3), whereas the features of a given complex object may be determined only by establishing the features of this structure. The structures of complex objects which are the result of perception of complex objects may be modeled with the help of relational structures (see Section 4.5). Therefore, by the features of complex objects represented with relational structures we will understand the features of these structures. Each collection of complex objects K may be represented using an information system A = (U, A), where the object set U is equal to collection K and the attributes from set A describe the properties of complex objects from collection K and more precisely, the properties of relational structures representing individual objects from this collection. In the simplest case, the attributes from set A may be sensor attributes, that is, they represent the readings of sensors recorded for objects from set U (see Example 1 and Example 2 from Section 4.4). However, in the case of structured objects whose properties usually cannot be described with the help of sensor attributes, the attributes from set A may be deﬁned with the use of the properties of these objects’ parts, the relations between the parts and information about the hierarchy of parts expressed e.g., with the help of concept ontology (see Section 4.10). In practice, apart from the properties of complex objects described above and represented using the attributes from set A, other properties of complex objects are also possible which describe the properties of these objects on a slightly higher level of abstraction than the attributes from set A. These properties are usually deﬁned by experts on the basis of domain knowledge and are often represented with the help of concepts, that is, attributes which have only two values. For the table in Example 1, e.g., “safe driving”could be such a concept.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

549

By adding such an attribute to the information system, which is usually called decision attribute or decision and marking it with d, we obtain decision table (U, A, d). However, eﬀective approximation of a decision attribute d using attributes from set A usually requires deﬁning new attributes which are often binary attributes representing concepts. Such concepts may be deﬁned in an established language on the basis of attributes available in set A. In this paper, such language is called a language for deﬁning features of complex objects. In the simplest case such a language may be the language of mathematical formulas in which formulas enabling calculating the speciﬁc properties of a complex object are formed. For example, if the complex object is a certain subset of a set of rational numbers with simple addition and multiplication and the order relation, then the attributes of such a complex object may be: the minimal value, the maximum value or the arithmetic average over this set. However, in many cases in order to deﬁne attributes of complex objects special languages should be deﬁned. In this paper, to deﬁne a speciﬁc language deﬁning complex object properties Tarski’s approach is used which requires the language’s alphabet, set of language formulas and language formula semantics (see, e.g., [304] and Section 4.6). For example, in order to deﬁne concepts describing new properties of objects from a given information system a well known language called generalized descriptor language may be used (see, e.g., [16, 165]). Deﬁnition 5 (A generalized descriptor language). Let A = (U, A) be an information system. A generalized descriptor language of information system A (denoted by GDL(A) or GDL-language, when A is ﬁxed) is deﬁned in the following way: Va ∪ {¬, ∨, ∧} is an alphabet of the language • the set ALGDL (A) = A ∪ a∈A

GDL(A), • expressions of the form (a ∈ V ), where a ∈ A and V ⊆ Va are atomic formulas of the language GDL(A). Now, we determine the semantics of the language GDL(A). The language GDL(A) formulas may be treated as the descriptions of object occurring in system A. Deﬁnition 6. Let A = (U, A) be an information system. The satisﬁability of an atomic formula φ = (a ∈ V ) ∈ GDL(A) by an object u ∈ U from table A (denoted by u |=GDL(A) φ) is deﬁned in the following way: u |=GDL(A) (a ∈ V ) iﬀ a(u) ∈ V. We still need to answer the question of deﬁning the atomic formulas (expressions of the form a ∈ V ) belonging to the set of formulas of the above language. In the case of symbolic attributes, in practical applications the formulas of the form a ∈ V are usually deﬁned using relations: “=” or “=” (e.g., a = va or a = va for some symbolic attribute a such that va ∈ Va ). However, if the attribute a is

550

J.G. Bazan

a numeric one, then the correct atomic formulas may be a < va , a ≤ va , a > va or a ≥ va . Atomic formulas may be also deﬁned using intervals, for example: a ∈ [v1 , v2 ], a ∈ (v1 , v2 ], a ∈ [v1 , v2 ) or a ∈ (v1 , v2 ), where v1 , v2 ∈ Va . We present a few examples of formulas of the language GDL(A), where A = (U, A), A = {a1 , a2 , a3 } and v1 ∈ Va1 , v2 ∈ Va2 and v3 , v4 ∈ Va3 . – – – –

(a1 = v1 ) ∧ (a2 = v2 ) ∧ (a3 ∈ [v3 , v4 )), (a1 = v1 ) ∨ (a2 = v2 ), ((a1 = v1 ) ∨ (a2 = v2 )) ∧ (a3 > v3 ), ¬((a1 = v1 ) ∧ (a3 ≤ v3 )) ∨ ((a2 = v2 ) ∧ (a3 ∈ (v3 , v4 ])).

Another example of a language deﬁning complex object properties may be a neighborhood language. In order to deﬁne the neighborhood language a dissimilarity function of pairs of objects of the information system is needed. Deﬁnition 7. Let A = (U, A) be an information system. 1. We call a function DISMA : U × U −→ [0, 1] the dissimilarity function of pairs of objects in the information system A, if the following conditions are satisﬁed: (a) for any pair (u1 , u2 ) ∈ U × U : DISMA (u1 , u2 ) = 0 ⇔ ∀ a ∈ A : a(u1 ) = a(u2 ), (b) for any pair (u1 , u2 ) ∈ U × U : DISMA (u1 , u2 ) = DISMA (u2 , u1 ), (c) for any u1 , u2 , u3 ∈ U : DISMA (u1 , u3 ) ≤ DISMA (u1 , u2 ) + DISMA (u2 , u3 ). 2. For any u1 , u2 , u3 , u4 ∈ U , if DISMA (u1 , u2 ) < DISMA (u3 , u4 ) then we say that objects from the pair (u3 , u4 ) are more diﬀerent than objects from the pair (u1 , u2 ), relatively to DISMA . 3. If any u1 , u2 ∈ U satisﬁes DISMA (u1 , u2 ) = 0 then we say that objects from the pair (u1 , u2 ) are not diﬀerent, relatively to DISMA , i.e., they are indiscernible, relatively to DISMA . 4. If any u1 , u2 ∈ U satisﬁes DISMA (u1 , u2 ) = 1 then we say that objects from the pair (u1 , u2 ) are completely diﬀerent, relatively to DISMA . Let us notice that the above dissimilarity function is not a metric (distance) but a pseudometric. The reason is that the ﬁrst metric condition is not satisﬁed which in the case of the DISMA function would state that the distance between the pair of objects is equal to 0 if and only if they are the same objects. This condition is not satisﬁed because of the possibility of existence of non-one-element abstraction classes of the relation IN DA (A), that is, because of the possibility of repetition of objects in the set U . We present an example of dissimilarity function of pairs of objects of information system.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

551

Example 5. Let A = (U, A) be an information system A = (U, A), where A = {a1 , ..., am } is the set of binary attributes. We deﬁne the dissimilarity function of pairs of objects in the following way: ∀(u1 , u2 ) ∈ U × U : DISMA (u1 , u2 ) =

card({a ∈ A : a(u1 ) = a(u2 )}) . card(A)

Let us notice, that the dissimilarity function deﬁned above is based on a widely known and introduced by Hamming measurement of dissimilarity of two sequences of the same length expressing number of places (positions) on which these two sequences diﬀer. Now, we can deﬁne the neighborhood language. Deﬁnition 8 (A neighborhood language). Let A = (U, A) be an information system. A neighborhood language for the information system A (denoted by N L(A) or N L-language, when A is ﬁxed) is deﬁned in the following way: • the set ALN L (A) = U ∪ (0, 1] ∪ {¬, ∨, ∧} is an alphabet of the language N L(A), • expressions of the form (u, ε), where u ∈ U and ε ∈ (0, 1] called as neighborhoods of objects, are atomic formulas of the language N L(A). Now, we determine the semantics of language N L(A). The language N L(A) formulas may be treated as the descriptions of object occurring in system A. Deﬁnition 9. Let A = (U, A) be an information system and DISMA be a dissimilarity function of pairs of objects from the system A. The satisﬁability of an atomic formula φ = (u0 , ε) ∈ N L(A) by an object u ∈ U from table A relative to dissimilarity function DISMA (denoted by u |=N L(A) φ), is deﬁned in the following way: u |=N L(A) (u0 , ε) ⇔ DISMA (u0 , u) ≤ ε. Each of formula of languages GDL or N L describes a certain set of objects which satisfy this formula (see Fig. 10). According to Deﬁnitions 5 and 8 a set of such objects is included in a set of objects U . However, it is worth noticing that these formulas may be satisﬁed by objects from outside the set U , that is, belonging to an extension of the set U (if we assume that attribute values on such objects can be received) (see Fig. 10). An explanation is needed if it comes to the issue of deﬁning pairs of objects in an information system with a dissimilarity function. For information systems many such functions may be deﬁned applying various approaches. A review of such approaches may be found in, e.g., [162, 163, 164, 165, 166, 167, 168, 169, 170, 171]). However, the approaches known from literature usually do not take into account the full speciﬁcation of a speciﬁc information system. That is why in a general case the dissimilarity function of a pair of objects should be deﬁned by experts individually for each information system on the basis of domain knowledge. Such a deﬁnition may be given in the form of an arithmetical expression (see Example 5). Very often, however, experts in a given domain are not able to present such an expression

552

J.G. Bazan

U* - an extension of the set U

The meaning of the formula φ

U - the set of objects from the system A Fig. 10. The illustration of the meaning of a given formula

and content themselves with presenting a set of value examples of this function, that is, a set of pairs of objects labeled with a dissimilarity function value which exists between these objects. In this last case, deﬁning dissimilarity function requires approximation with the help of classiﬁers. The classiﬁer approximating the dissimilarity function are called dissimilarity classiﬁer of pairs of objects for an information system. Deﬁnition 10. Let A = (U, A) be an information system A = (U, A) (where A = {a1 , ..., am }) and DISMA is a given dissimilarity function of pairs of objects from the system A. 1. A dissimilarity function table for the system A relatively to the dissimilarity function DISMA is a decision table AD = (UD , AD , d), where: – UD ⊆ U × U , – AD = {b1 , ..., bm , bm+1 , ...., b2m }, where attributes from AD are deﬁned in the following way: ai (u1 ) i≤m ∀u = (u1 , u2 ) ∈ UD ∀bi ∈ AD : bi (u) = ai−m (u2 ) otherwise. – ∀u = (u1 , u2 ) ∈ UD : d(u) = DISMA (u1 , u2 ). 2. If AD = (UD , AD , d) is the dissimilarity function table for the system A then any classiﬁer for the table AD is called a dissimilarity classiﬁer for the system A. Such classiﬁer is denoted by μDISMA . Let us notice that in the dissimilarity table of the information system A there do not exist all the possible pairs of objects of system A, but only a certain chosen

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

553

subset of the set of these pairs. This limitation is necessary, for the number of pairs of U × U product may be so large that the expert is not able to give all the values of decision attribute d for them. That is why in the dissimilarity table there are usually only found pairs chosen by the expert which represent typical cases of dissimilarity function determining which may be generalized with the help of a classiﬁer. The dissimilarity classiﬁer may serve determining the value of dissimilarity function for the pair of objects from the information system. According to Deﬁnition 10 such pairs come from set U × U , that is, they are pairs of objects from a given information system A. However, it should be stressed that the dissimilarity classiﬁer may determine the values of the dissimilarity function for the pairs of objects which do not belong to system A, that is, those which belong to extension of A. Hence, dissimilarity classiﬁers may be treated as a way to deﬁne concepts (new two-argument relations). The described approach to the measure of dissimilarity is applied in this paper to the measure of dissimilarity between objects in information systems (see Section 6.7 and Section 6.19), between states in planning graphs (see Section 7.9) and plans (see Section 7.20). 4.8

Types of Complex Objects

In a given complex dynamical system there may occur many diﬀerent complex objects. The collection of all such objects may be represented with the help of information system, where the set of this system’s objects correspond with the objects of this collection and the attributes of this system describe the properties of complex objects from the collection and more precisely the properties of relational structures representing individual objects of this collection. Such a system for a given complex dynamical system we call in this paper a total information system (T IS) for a given complex dynamical system. Attributes of the system T IS may be sensor attributes or they are deﬁned in an established language which helps to express the properties of complex objects (see Section 4.7). To the attribute set of the system T IS one may add the binary decision attribute representing the concept describing an additional property of complex objects. The decision attribute may be further approximated with the help of attributes available from the system T IS (see Section 4.7). However, in practice the concepts which are examined are deﬁned only in the set of complex objects of a certain type occurring in a given complex dynamical system. In the example concerning the traﬃc (see Example 1) such a concept may concern only cars (e.g., safe overtaking of one car by another), whereas in the example concerning patient treatment (see Example 2), the examined concepts may concern the treatment of infants only, not other people like children, adults or the elderly whose treatment diﬀers from the treatment of infants. Therefore, we need a mechanism which enable to appropriate selection of complex objects, and more precisely relational structures which they represent and in which we are interested at the moment. In other words, we need a method which enable to select objects of a certain type from the system T IS.

554

J.G. Bazan

In the paper, we propose a method of adding a binary attribute to T IS to deﬁne the types of complex objects, and more precisely the types representing the objects of relational structures. The value Y ES of such an attribute in a given row, means that the given row represents the complex object that is of the examined type, whereas value N O means that the row represents a complex object which is not of the examined type. The attributes deﬁning types may be deﬁned with the help of attributes from the system T IS in the language GDL or any other language in which the attributes form the system T IS were deﬁned. The example below shows how the attributes deﬁning the types of complex objects may be deﬁned. Example 6. Let us assume that in a certain hospital in the children’s ward there was applied information system A = (U, A) to represent the information about patients’ treatment, such that U = {u1 , ..., un } and A = {a1 , ..., am , aage , at , aid }. Each object of this system represents medical parameters of a certain child in one day of his/her hospitalization. Attributes a1 , ..., am describe medical parameters of the patient (examination results, diagnoses, treatments, medications, etc.), while the attribute aage represents the age of patient (a number of days of life), the at attribute represents the value of a time unit (a number of days) which has elapsed since the ﬁrst observation of the patient and the attribute aid provides identiﬁers of patients. If system A is treated as the total information system for a complex dynamical system understood as a set of all patients, then the “infant” type of patient (it is a child not older than 28 days) labeled with Tinf may be deﬁned with the help of formula (aage ≤ 28). A slightly more diﬃcult situation appears in the case of the information system from Example 1, when we want to deﬁne the passenger car type of object. A written description of the formula deﬁning such a type may be as follows: the object is perceived as a rectangle whose length is two to ﬁve times bigger than its width, and the movement of the object takes place in the direction parallel to the longer side of the rectangle. It is easy to see that in order to deﬁne such a formula the information system from Example 1 would have to be complemented with sensor attributes determining the coordinates of the characteristic points of the object for determining its sizes, shape and movement direction. If we deﬁne an additional attribute by determining the type of object in the system T IS, then we can select information subsystem in which all objects will have the same value of this attribute. Using a subsystem selected in such a way one may analyze concepts concerning the established type of objects. Obviously, during the approximation of these concepts the attribute determining the type according to which an object selection was previously performed is useless, because its value is the same for all selected objects. Therefore, the attributes deﬁning the type of object are not used to approximate concepts, but only to an initial selection of objects for the need of concept approximation. In a given complex dynamical system there may be observed very diﬀerent complex objects. The diversity of objects may express itself both through the degree of spatial complexity and by the spatio-temporal complexity (see Section 4.3). Therefore, in a general case it should be assumed that in order to

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

555

describe the properties of all complex objects occurring in a given dynamical system, many languages must be used. For instance, to describe the properties of a single vehicle at a single time point, the information obtained directly from the sensors are usually used (e.g., speed, location), to describe the properties of a vehicle observed for a certain period of time (time window), a language may be used which enables to deﬁne the so-called temporal patterns observed in time windows (see Section 6.6), whereas in order to describe the properties of groups of vehicles a language may be used which enable to deﬁne temporal patterns observed in the sequences of time windows (see Section 6.17). Moreover, it usually happens that not each of these languages is appropriate to express the properties of all complex objects occurring in a given complex dynamical system. For example, if we want to apply the language of temporal patterns to determine the properties of a vehicle at a single time point, then it is not feasible because this language requires information about the vehicle collected in the whole time window not at a single time point. Therefore, the approach to recognizing types of complex objects described above must be complemented. Namely, the attributes deﬁning types of complex objects, apart from the values Y ES and N O mentioned before, may also have the UNKNOWN value. This value means that for a given complex object it is not possible to compute correctly the value of an attribute. Summarizing, if we examine complex objects from a certain complex dynamical system and claim that a given complex object u is a complex object of type T , then it means that in the total information system constructed for this system there exists such attribute aT that it takes the value Y ES for object u. One may also say that a given complex object u is not a complex object of type T which means that attribute aT corresponding with type T takes the value N O for object u. The value of attribute aT for object u may also take the value UNKNOWN which in practice also means that object u is not of type T . A given complex object may be an object of many types, because there may exist many attributes identifying types in T IS which take the value Y ES for this object. For example, in the information system from Example 6 the type of object Tr may be deﬁned which can be described in words as the patient recently admitted to hospital (that is admitted not earlier than three days ago) with the help of formula (at ≤ 3). Then, the infant admitted to hospital for treatment two days ago is a patient of both type Tinf and Tr . Finally, let us notice that the above approach to determining types of objects may be applied not only to complex objects which were observed at the moment of deﬁning the formula determining the type, but also to those complex objects which appeared later, that is, belong to the extension of the system T IS. It results from the properties of formulas of the language GDL which deﬁne the types of objects in the discussed approach. 4.9

Patterns

If an attribute of a complex object collection is a binary attribute (it describes a certain concept), then the formula enables to determine its values is usually called

556

J.G. Bazan

a pattern for the concept. Below, we present a pattern deﬁnition assuming that there is given a language L deﬁning features of complex objects of a determined type, deﬁned using Tarski’s approach (see, e.g., [304]). Deﬁnition 11 (A pattern). Let S be a collection of complex objects of a ﬁxed type T . We assume C ⊆ S is a concept and L is language of formulas deﬁning (under a given interpretation of L deﬁned by a satisﬁability relation) features of complex objects from the collection S (i.e., subsets of S deﬁned by formulas under the given interpretation). 1. A formula α ∈ L is called a pattern for concept C explained in the language L if exists s ∈ S such that s ∈ C and s |=L α (s satisﬁes α in the language L). 2. If s |=L α then we say that s matches pattern α or s supports pattern α. Otherwise s does not match pattern or does not support pattern α. 3. A pattern α ∈ L is called exact relative to the concept C when for any s ∈ S, if s |=L α then s ∈ C. Otherwise, a pattern α is called inexact. 4. The number: support(α) = card(|α|L ) is called the support of the pattern α. 5. The conﬁdence of the pattern α relatively to the concept C we denote as conf idenceC (α) and deﬁne in the following way: conf idenceC (α) =

card({s ∈ C : s |=L α}) . support(α)

Thus patterns are simple but convenient way of deﬁning complex object properties and they may be applied to information system construction representing complex object collections. Despite the fact that according to Deﬁnition 11, patterns are supposed to describe complex object properties belonging to a given complex object collection S, they may also describe complex object properties from outside of the S collection. However, they always have to be complex objects of the same type as the objects gathered in collection S. Patterns may be deﬁned by experts on the basis of domain knowledge. In such a case the expert must deﬁne a needed formula in a chosen language which enables to test objects on their membership to the pattern. In a general case, patterns may be also approximated with the help of classiﬁers. In this case, it is required from the expert to give only examples of objects belonging to the pattern and counterexamples of objects not belonging to the pattern. Then, however, attributes which may be used to approximate the pattern are needed. Sometimes in an information system representing a complex object collection one of the attributes is distinguished. For example, it may represent a concept distinguished by the expert which requires approximation using the rest of the attributes. Then such an information system is called a decision table (see Section 2.1).

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

557

The decision table constructed for a complex object collection may be useful in classiﬁer construction which ensures the approximation of the distinguished decision attribute. The approximation may be performed with the help of classical classiﬁers (see Section 2) or stratifying classiﬁers (see Section 3). As we wrote before, language formulas serving to deﬁne complex object properties may be satisﬁed by complex objects from outside of a given collection of complex objects. Thus, for any complex object being of the same type as complex objects from a given collection, it may be classiﬁed using the above mentioned classiﬁer. 4.10

Approximation of Concepts from Ontology

The method of using ontology for the approximation of concepts presented in this section consists of approximating concepts from the higher level of ontology using concepts from the lower levels. For the concepts from the lowest hierarchical level of ontology (sensor level), which are not dependent on the rest of the concepts, it is assumed that there are also available the so called sensor attributes which enable to approximate these concepts on the basis of applied positive and negative examples of objects. Below, we present an example of concept approximation using sensor attributes in a certain ontology. Example 7. Let us consider the ontology from Fig. 11. Each vehicle satisfying the established condition expressed in a natural language belongs to some concepts of this ontology. For example, to the concept of Safe overtaking belong vehicles which overtake safely, while to the concept of Possibility of safe stopping before the crossroads belong vehicles whose speed is so small that they may safely stop before the crossroads. Concepts of the lowest ontology level that is Safe distance from the opposite vehicle during overtaking, Possibility of going back to the right lane, Possibility of safe stopping before the crossroads, Safe distance from the front vehicle, Forcing the right of way and Safe distance from the front vehicle are sensor concepts, that is, they may be approximated directly using sensor data. For instance, the concept of Possibility of safe stopping before the crossroads may be approximated using such sensor attributes as vehicle speed, vehicle acceleration, distance to the crossroads, visibility and road humidity. On the higher levels of ontology, however, sensor attributes may not be used directly to approximate concepts because the semantical distance of approximated concepts from sensor attributes is too large and they are deﬁned on diﬀerent levels of abstraction. For example, if we wish to approximate the concept of safe driving on the higher level and on the sensor level we have at our disposal only attributes giving simple parameters of vehicle driving (that is, location, speed, acceleration, etc.), then it is hard to expect that these parameters allow to make the approximation of such a complex concept as safe driving possible. That is why in this paper we propose a method of approximating the concept from the higher level of ontology only with the help of concepts from the ontology level that is lower by one level, which are closer to the concept under approximation

558

J.G. Bazan

Safe driving Forcing the right of way Safe overtaking Safe distance from the front vehicle

Safe distance from the opposite vehicle during overtaking Possibility of going back to the right lane

Possibility of safe stopping before the crossroad

S E N S O R DATA

Fig. 11. An ontology as a hierarchy of concepts for approximation

than the sensor data. The proposed approach to the approximation of concepts of the lower level is based on an assumption that a concept from the higher ontology level is “not too far” semantically from concepts lying on the lower level of ontology. “Not too far” means that it can be expected that it is possible to approximate a concept from the higher level of ontology using concepts from the lower level for which classiﬁers have already been built. The proposed method of approximating concepts of the higher ontology level is based on constructing a decision table for a concept on the higher ontology level whose objects represent positive and negative examples of the concept approximated on this level; and at the same time a stratifying classiﬁer is constructed for this table. In this paper, such a table is called a concept approximation table of the higher ontology level concept. One of the main problems related to construction of the concept approximation table of the higher ontology level concept is providing positive and negative examples of the approximated concept on the basis of data sets. It would seem that objects which are the positive and negative examples of the lower ontology levels concepts may be used at once (without any changes) for concept approximation on the higher ontology level. If it could be possible to perform, any ontology concepts could be approximated using positive and negative examples available from the data sets. However, in a general case, because of semantical diﬀerences between concepts and examples on diﬀerent levels of ontology, objects of the lower level cannot be directly used to approximate concepts of the

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

559

higher ontology level. For example, if on a higher level of a concept hierarchy, we have a concept concerning a group of vehicles, and on a lower one concepts concerning single vehicles, then usually the properties of single vehicles (deﬁned in order to approximate concepts of lower levels of ontology) are not suﬃcient to describe properties of a whole group of vehicles. Diﬃculties with approximation of concepts on the higher ontology level with the help of object properties from the lower ontology level also appear when on the higher ontology level there are concepts concerning another (e.g., longer) period of time than concepts on the lower ontology level. For example, on the higher level we examine a concept concerning a time window (a certain time period), yet on the lower level they are concepts concerning a certain instant, i.e., a time point (see Section 6). That is why in this paper we propose a method for constructing objects of an approximation table of the concept from the higher ontology level (that is, positive and negative examples of this concept) by arranging sets of objects which are positive and negative examples of the lower ontology level concepts. These sets must be constructed in such a way, that the properties of these sets considered together with relationships between their elements could be used for the approximation of the higher ontology level concept. However, it should be stressed here that the complex objects mentioned above (being positive and negative examples of concepts from the higher and lower ontology levels) are representation of real-life objects only. In other words, we assume that the relational structures are expressing the result of perception of real-life objects (see Section 4.5 and Fig. 12). Therefore, by the features of complex objects represented with

Real-life complex objects

Relational structures (representations of real-life complex objects)

Fig. 12. Real-life complex objects and representations of their structures

560

J.G. Bazan

Concept approximation table for the concept C from the higher ontology level

L13

Domain knowledge about the concept C from the higher ontology level

Adding the decision attribute

L12 CRS-information system

L11 Domain knowledge about features of clusters

Selection of features of clusters and selection of clusters acceptable by constraints

FCRS-language + constraint relation

L10

New set of objects represented by clusters of relational structures

L9 Domain knowledge about extraction of clusters

Definition of a new set of objects represented by clusters of relational structures

ECRS-language

L8

RS-information system

L7 Domain knowledge about features of relational structures

Selection of features of relational structures and selection of relational structures acceptable by constraints

FRS-language + constraint relation

L6 New set of objects represented by relational structures

L5

Domain knowledge about extraction of relational structures

Definition of a new set of objects represented by relational structures

ERS-language

L4 Global relational structure S = (U, R)

L3

Domain knowledge about relational structures for attributes from the set A

L2

Definition of family of relations over U

Relational structures for atributes from the set A

Information system A=(U, A) (positive and negative examples of concepts from the lower ontology level)

L1

Fig. 13. The general scheme for construction of the concept approximation table

relational structures we understand the features of these structures. Such features are deﬁned using attributes from information systems from the higher and lower ontology levels.

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

561

In Fig. 13, we illustrate the general scheme for construction of the concept approximation table for a given concept C depending in some ontology on concepts from the lower level (relatively to the concept C). In the further part of the subsection, this scheme will be explained in detail. As we have written before, in this paper we assume that for the concepts of the lower ontology level a collection of objects which are positive and negative examples of these concepts is available. Let us also assume that they are objects of a certain information system A = (U, A), where attributes from set A represent all available properties of these objects (see label L1 from the Fig. 13). It should be stressed here that the information about the membership degree of objects from set U to the concepts from the lower ontology level may serve deﬁning new attributes which are appended to the set A. However, providing such information for a randomly chosen object (also for an object which will appear in the future) requires previous approximation of concepts of the lower level with the help of classical or stratifying classiﬁers. At this point, we assume that for the concepts of the lower ontology level such classiﬁers were already constructed, while our aim is to approximate the concept of the higher ontology level. Incidentally, in the simplest case, the concepts of the lower ontology level may be approximated with the help of sensor attributes (see Example 7). Apart from attributes deﬁned on the basis of the membership of objects to the concepts or to the layers of the concepts, there may be other attributes in set A. For example, it may be an attribute identifying the recording time of values of the remaining attributes from set A for a given object from set U or an attribute unambiguously identifying individual objects or groups of objects from set U . Objects being positive and negative examples of the lower ontology level concepts can be very often used to deﬁne new objects represented by relational structures by using available information about these objects. Relations deﬁned in such structures may be also used to ﬁlter (extract) sets of objects or, in a more general case, sets of relational structures or their clusters as new objects for a higher level concept. Relations among objects may be deﬁned on the basis of attributes from the information system A, with the use of relational structures deﬁned on the value sets of attributes from set A (see label L2 from the Fig. 13). For example, the value set of attribute Vat from Example 2 is a subset of the set of integer numbers. Therefore, it is a domain of a relational structure (Vat , {Rat }), where relation Rat is deﬁned in the following way: ∀(t1 , t2 ) ∈ Vat × Vat : t1 Rat t2 ⇔ t1 ≤ t2 . Relation Rat may be in a natural way, generalized to the relation Rt ⊆ U × U in the following way: ∀(u1 , u2 ) ∈ U × U : u1 Rt u2 ⇔ at (u1 ) Rat at (u2 ). Let us notice that relation Rt orders in time the objects of the information system from Example 2. Moreover, it is also worthwhile mentioning that for any

562

J.G. Bazan

pair of objects (u1 , u2 ) ∈ U × U (where U ⊆ U ) the relation Rt is also deﬁned (if we assume that attribute values on such objects can be received) (see Fig. 10). Analogously, a relation ordering objects in time on the basis of attribute t from the information system from Example 1 may be obtained. Obviously, relations deﬁned on the basis of the attributes of information system A are not always related to the ordering objects in time. The example below illustrates how structural relations may be deﬁned on the basis of the distance between objects. Example 8. Let us consider an information system A = (U, A), whose object set U = {u1 , ..., un } is a ﬁnite set of vehicles going from a town T1 to a town T2 , whereas two attributes d and v belong to the attribute set A. The attribute d represents the distance of a given vehicle from the town T2 while attribute v represents the speed of a given vehicle. Value sets of these attributes are subsets of the set of real numbers. Besides, the set Vd is a domain of relational structure (Vd , {Rdε }), where the relation Rdε is deﬁned in the following way: ∀(v1 , v2 ) ∈ Vd × Vd : v1 Rdε v2 ⇔ |v1 − v2 | ≤ ε, where ε is a ﬁxed real number greater than 0. Relation Rdε may be in a natural way, generalized to the relation Rε ⊆ U × U in the following way: ∀(u1 , u2 ) ∈ U × U : u1 Rε u2 ⇔ d(u1 ) Rdε d(u2 ). As we see, a pair of vehicles belongs to relation Rε when objects are distant from each other by no more than ε. Therefore, relation Rε we call the nearness relation of vehicles and parameter ε is called the nearness parameter of vehicles. Relation Rε may be deﬁned for diﬀerent values ε. That is why in a general case the number of nearness relations is inﬁnite. However, if it is assumed that parameter ε takes the values from a ﬁnite set (e.g., ε = 1, 2, ..., 100), then the number of nearness relations is ﬁnite. If Rε is a nearness relation deﬁned in the set U × U (where ε > 0), then set of vehicles U is a domain of the pure relational structure S = (U, {Rε }). The exemplary concepts characterizing the properties of individual vehicles may be high (average, low) speed of the vehicle or high (average, low) distance from the town T2 . These concepts are deﬁned by an expert and may be approximated on the basis of sensor attributes d and v. However, more complex concepts may be deﬁned which cannot be approximated with the help of these attributes. The example of such a concept is vehicle driving in a traﬃc jam. The traﬃc jam is deﬁned by a number of vehicles blocking one another until they can scarcely move (see, e.g., [305]). It is easy to notice that on the basis of observation of the vehicle’s membership to the above mentioned sensor concepts (concerning a single vehicle) and even observation of the value of sensor attributes for a given vehicle, it is not possible to recognize whether the vehicle is driving in a traﬃc jam or not. It is necessary to examine the neighborhood of a given vehicle and more precisely to check whether there are other vehicles right after and before the examined one. Therefore, to approximate the concept vehicle driving

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

563

in traﬃc jam we need a certain type of vehicle grouping which may be performed with the help of the above mentioned relation Rε (see Example 9). Let us add that in recognition of the vehicle’s membership to the concept vehicle driving in a traﬃc jam, it is also important that the speed of the examined vehicle and the speed of the vehicles in its neighborhood are available. However, to simplify the examples, in this subsection we assume that in recognition of the vehicle’s membership to the concept vehicle driving in a traﬃc jam it is suﬃcient to check the appearance of other vehicles in the neighborhood of a given vehicle and considering the speed of these vehicles is not necessary. Thus, for a given information system A = (U, A) representing positive and negative examples of the lower ontology levels concepts there may be deﬁned a pure relational structure S = (U, R) (see label L3 from the Fig. 13). Next, using the relations from family R a special language may be deﬁned in which patterns are deﬁned which describe sets of objects (new concepts) for the needs of approximation of the higher ontology level concepts (see label L4 from the Fig. 13). The extracted sets of objects of a lower level are also usually nontrivial relational structures, for the relations determined on the whole set of objects of the lower ontology level in a natural way are deﬁned on the extracted sets. Time windows (see Section 6.4) or sequences of time windows (see Section 6.15) may be such kind of relational structures. In modeling, we use pure relational structures (without functions) over set of objects extracted from the initial relational structures whose domains are sets of objects of lower ontology level. The reason is that these structures are deﬁned by extension of relations structures deﬁned on information about objects of lower ontology level and even if in the latter structures are deﬁned functions then after the extension we obtain relations over objects rather than functions. Example 9. Let us consider an information system A = (U, A) from Example 8. Let Rε be the nearness relation deﬁned in the set U × U for the ﬁxed ε > 0. Then, the vehicle set U is the domain of relational structure S = (U, {Rε }) and the relation Rε may be used to extract relational structures from the structure S. In order to do this we deﬁne the family of subsets F (S) of the set U in the following way: F (S) = {Nε (u1 ), ..., Nε (un )}, where: Nε (ui ) = {u ∈ U : ui Rε u}, for i = 1, ..., n. Let us notice that each set from family F (S) is connected with one of the vehicles from set U . Therefore, each of the sets from family F (S) should be interpreted as a set of vehicles which are distant from the established vehicle u no more than by the established nearness parameter ε. In other words each such set is a vehicle set which are in the neighborhood of a given vehicle, with the established radius of the neighborhood area. For instance, if ε = 20 meters then vehicles u3 , u4 , u5 , u6 , and u7 belong to the neighborhood of vehicle u5 (see Fig. 14). Finally, let us notice that each set N ∈ F (S) is a domain of relational structure (N, {Rε }). Thus, we obtain the family of relational structures extracted from structure S.

564

J.G. Bazan

u1

u2

u3

u4

u5

20m

u6

u7

u8

20m

N(u5) Fig. 14. A vehicle and its neighborhood

The language in which, using the relational structures, we deﬁne formulas for expressing extracted relational structures, is called a language for extracting relational structures (ERS-language). The formulas of ERS-language determine type of relational structures, i.e., relational structures which can appear in the constructed information system. These new relational structures represent structure of more compound objects composed out of less compound ones. We call them extracted relational structures (see label L5 from the Fig. 13). In this paper, we use the three following ERS-languages: 1. the language assigned to extract trivial relational structures such as presented in Deﬁnition 3 and this method of relational structure extraction is used in the case of construction of the concept approximation table using stratifying classiﬁers (see Section 5.2), 2. the ET W -language assigned to extract relational structures which are time windows (see Section 6.4), 3. the EST W -language assigned to extract relational structures which are sequences of time windows (see Section 6.15). However, the above mentioned process of extracting relational structures is carried out in order to approximate the concept of the higher ontology level with the help of lower ontology level concepts. Therefore, to extract relational structures it is necessary to use information about membership of objects of the lower level to the concepts from this level. Such information may be available for any tested object thanks to the application of previously created classiﬁers for the lower ontology level concepts (see Section 6.4 and Section 6.15). For relational structures extracted using ERS-language features (properties, attributes) may be deﬁned using a specially constructed language, that we call a language for deﬁnnig features of relational structures (see label L6 from the Fig. 13). The F RS-language leads to an information system whose objects are extracted relational structures and the attributes are the features of these structures. Such system will be called an information system of extracted relational structures (RS-information system) (see label L7 from the Fig. 13). However, from the point of view of domain knowledge, not all objects (relational structures) extracted using ERS-language are appropriate to approximation of a given concept of the higher level of ontology. For instance, if we approximate the

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

565

concept of safe overtaking, it is reasonable to use objects representing vehicles examples that are in the process of overtaking maneuver, for using objects representing vehicles which are not in the process of an overtaking maneuver, nothing help to recognize the pairs of vehicles which take part in a safe overtaking with the pairs of vehicles which overtake unsafely. For the above reason, that is, to eliminate objects which are unreal or are unreasonable, there are deﬁned the so-called constraints which are formulas deﬁned on the basis of object features used to create attributes from the RS-system. The constraints determine which objects may be used in order to obtain a concept example from the higher level and which cannot be used (see label L6 from the Fig. 13). In this paper constraints are represented by a constraint relation and are deﬁned as a formula of the language GDL (see Deﬁnition 5) on the basis of attributes appearing in the system RS-system. The example below illustrates how RS-information systems may be deﬁned. Example 10. Let us consider an information system A = (U, A), a relational structure S = (U, {Rε }) and a family F (S) extracted from relational structure S (see Example 9). We construct an information system F = (F (S), A) such that A = {af , ab }, where for any u = Nε (u) ∈ F (S) a value af (u) is the number of vehicles in the neighborhood Nε (u) going in the right lane before vehicle u and ab (u) is the number of vehicles in the neighborhood Nε (u) going in the right lane behind vehicle u. Let us notice that attributes of set A were chosen in such a way that the objects from information system F are relevant to approximate the concept vehicle driving in a traﬃc jam. For example, if ε = 20 meters and for the object u ∈ F (S) values af (u) = 2 and ab (u) = 2, then vehicle u is driving in a traﬃc jam (see vehicle u4 from Fig. 15). Whereas, if af (u) = 0 and ab (u) = 0, then vehicle u is not driving in a traﬃc jam (see vehicle u7 from Fig. 15). For the system F we deﬁne the following formula: φ = ((af > 0) ∨ (ab > 0)) ∈ GDL(F). It is easy to notice that formula φ is not satisﬁed only by neighborhoods related to vehicles which deﬁnitely not driving in a traﬃc jam. Therefore, in terms of neighborhood classiﬁcation to the concept driving in a traﬃc jam these neighborhoods may be called trivial ones. Hence, formula φ may be treated as a constraint formula which is used to eliminate the above mentioned trivial neighborhoods from F. After such reduction we obtain an RS-information system A = (U , A), where U = {u ∈ F (S) : u |=GDL(F) φ}.

Let us notice that the deﬁnition of attributes of extracted relational structures leads to granulation of relational structures. For example, we obtain granules of relational structures deﬁned by the indiscernibility relation deﬁned by new attributes. A question arises, how to construct languages deﬁning features of relational structures, particularly when it comes to approximation of spatio-temporal concepts, that is, those whose recognition requires following the changes of complex objects over time. One of more developed languages of this type is a temporal

566

J.G. Bazan

u1

u2 10m

u5

u4

u3 10m

10m

10m

u7

u6 10m

10m

10m

N(u4)

10m

10m

10m

N(u7)

Fig. 15. Two vehicle neighborhoods

logic language. In literature there are many systems of temporal logics deﬁned which oﬀer many useful mechanisms (see, e.g., [183, 184, 185]). Therefore, in this paper, we use temporal logics to deﬁne our own languages describing features of relational structures. Especially interesting for us are the elements appearing in deﬁnitions of temporal logics of linear time (e.g., Linear Temporal Logic) and branching time logic (e.g., Branching Temporal Logic). Temporal logic of linear time assumes that time has a linear nature, that is, one without branches. In other words, it describes only one world in which each two events are sequentially ordered. In linear time logics there are the following four temporal operators introduced: , ♦, and U. Generally speaking, these operators enable us to determine the satisﬁability of temporal formulas in a certain time period. Operator (often also marked as G) determines the satisﬁability of a formula at all instants (states) of the time period under observation. Operator ♦ (often marked as F) determines the satisﬁability of a formula at least at one instant (state) of the time period under observation. Operator (often marked as X) determines the satisﬁability of a formula at an instant (state) right after the instant of reference. Finally, operator U (often marked as U) determines the satisﬁability of a formula until another formula is satisﬁed. Therefore, linear time temporal logics may be used to express object properties which aggregate behavior of complex objects observed over a certain period of linear time, e.g., features of time windows or features of temporal paths in behavioral graphs (see Section 6.6 and Section 6.17). Temporal logic of branching time, however, assumes that time has a branching nature, that is, at a given instant it may branch itself into parallel worlds representing possible various future states. In branching time logics there are two additional path operators A and E introduced. They enable us to determine the satisﬁability of temporal formulas for various variants of the future. The ﬁrst operator means that the temporal formula, before which the operator occurs, is satisﬁed for all variants of the future. The second, however, means the formula is satisﬁed for a certain future. Path operators combined with the three G, F and X temporal logics operators give six possible combinations: AG, AF, AX, EG, EF and EX. These combinations give opportunities to describe multi-variant, extended over time behaviors. Therefore, temporal logics of branching time may be used to express such complex object properties that aggregate multi-variant behaviors of objects changing over time (e.g., features of clusters of time windows

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

567

or features of clusters of temporal paths in behavioral graphs) (see Section 6.8 and Section 6.19). We assume, that in extracted relational structures the time ﬂow has a linear character. Therefore, languages using elements of temporal logics with linear time are applied to deﬁne their features. In this paper, we use the three following languages deﬁning features of extracted relational structure: 1. the language assigned to deﬁne features of trivial relational structure such as in Deﬁnition 3 - this method of deﬁning features of relational structures is applied together with extraction of trivial relational structure (see Deﬁnition 3) and is based on the usage of features of objects taken from information system as features of relational structures after extraction (objects in a given information system and elements of domains of extracted from this system relational structures are the same) (see Section 5.2), 2. the language F T W using elements of temporal logic language and is assigned to deﬁne relational structure properties, which are time windows (see Section 6.6), 3. the language F T P also using elements of temporal logic language, assigned to deﬁne relational structure properties, which are paths in behavioral graphs (see Section 6.17). However, objects of RS-information systems are often not suitable to use their properties for approximating concepts of the higher ontology level. It happens this way because the number of these objects is too large and their descriptions are too detailed. Hence, if they are applied to approximate the concept from the higher ontology level, the coverage of the constructed classiﬁer would be too little, that is, the classiﬁer could classify too small number of tested objects. Apart from that, there would appear a problem of computational complexity which means that due to the large number of objects of such information system, the number of objects in the concept approximation table for the structured objects (see further part of this subsection) would be too large in order to construct a classiﬁer eﬀectively. That is why, a clustering such objects is applied leading to obtaining a family of object clusters (see label L8 from the Fig. 13). The example below illustrates in a very simple way how it is possible to deﬁne clusters of relational structures. Example 11. Let A = (U , A) be an RS-information system from Example 10. We are going to deﬁne clusters of the vehicles’ neighborhoods. For this purpose we propose a relation Rσ ⊆ U × U, that is deﬁned in the following way: ∀(u1 ,u2 )∈U×U u1 Rσ u2 ⇔ |af (u1 ) − af (u2 )| ≤ σ ∧ |ab (u1 ) − ab (u2 )| ≤ σ, where σ is a ﬁxed integer number greater than 0. As we see, to relation Rσ belong such pairs of vehicle neighborhoods which diﬀer only slightly (no more than by σ) in terms of attribute values af and ab . Therefore, relation Rσ is called the nearness relation of vehicle neighborhoods and parameter σ is called the nearness

568

J.G. Bazan

parameter of vehicle neighborhoods. The relation Rσ may be deﬁned for diﬀerent values σ. That is why in a general case the number of such nearness relations is inﬁnite. However, if it is assumed that parameter σ takes the values from a ﬁnite set (e.g., σ = 1, 2, ..., 10), then the number of nearness relations is ﬁnite. Let Rσ be nearness relation of neighborhoods determined for the established σ > 0. Then the set of neighborhood of vehicles U is the domain of a pure relational structure S = (U , {Rσ }). The relational structure S is the starting point to extract clusters of vehicle neighborhoods. In order to do this we deﬁne the family of subsets F (S) of the set U in the following way: F (S) = {Nσ (u1 ), ..., Nσ (un )}, where: Nσ (ui ) = {u ∈ U : ui Rσ u}, for i = 1, ..., n. Let us notice that each of the set from family F (S) is connected with one vehicle neighborhood from the set U . For any u ∈ U the set Nσ (u) will be also denoted by u, for short. Moreover, these sets are interpreted as neighborhood clusters which are distant from the central neighborhood in the cluster no more than the established nearness parameter. In other words, each such family is a vehicles’ neighborhood cluster which are close to a given neighborhood, with their established nearness parameter. For instance, if ε = 20 meters and σ = 1, then neighborhoods Nε (u3 ), Nε (u5 ) and obviously neighborhood Nε (u4 ) belong to the neighborhood cluster Nσ (u4 ) (see Fig. 16), whereas the neighborhood Nε (u7 ) does not belong to this neighborhood cluster. Finally, let us notice that each set X ∈ F (S) is a domain of relational structure (X, {Rσ }). Hence, we obtain the family of relational structures extracted from structure S. Grouping of objects in system RS-system may be performed using chosen by an expert language of extraction of clusters of relational structures, which in this case is called a language for extracting clusters of relational structures (ECRS-language). The formulas of ECRS-language express families of clusters of relational structures from the input RS-information systems (see label L9 from the Fig. 13). Such formulas can be treated as a type of clusters of relational structures which will create objects in a new information system. In ECRS-language we may deﬁne a family of patterns corresponding to a family of expected clusters. In this paper, the two following ECRS-languages are used:

u1

u2 10m

u5

u4

u3 10m

10m

10m

u8

u7

u6 10m

10m

10m

N(u4)

10m

10m

N(u7)

N(u3) N(u5)

Fig. 16. Four vehicle neighborhoods

10m

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

569

1. the language ECT W assigned to deﬁne relational structure clusters which are time window families (see Section 6.8), 2. the language ECT P assigned to deﬁne relational structure clusters which are path families in complex object behavioral graphs (see Section 6.19). For clusters of relational structures extracted in such a way features may be deﬁned using a specially constructed language, that we call a language for deﬁning features of clusters of relational structures (F CRS-language) (see label L10 from the Fig. 13). A formula from this language is satisﬁed (or unsatisﬁed) on a given clusters of relational structures if and only if it is satisﬁed for all relational structures from this clusters. The F CRS-language leads to an information system whose objects are extracted clusters of relational structures and the attributes are the features of these clusters (see label L11 from the Fig. 13). Such information system we call an information system of clusters of relational structures (CRS-information system). Similarly to the case of the relational structures extracted using ERSlanguage, not all objects (relational structures) extracted using ECRS-language are appropriate to approximation of a given concept of the higher level of ontology. Therefore in this case we also deﬁne constraints which are formulas deﬁned on the basis of object features used to create attributes from the CRSinformation system. Such constraints determine which objects may be used in order to obtain a concept example from the higher level and which cannot be used. The example below illustrates how CRS-information systems may be deﬁned. Example 12. Let F (S) be the family extracted from relational structure S (see Example 11). One can construct an information system F = (F (S), A), where A = {af , ab } and for any u ∈ F (S) values of attributes af and ab are computed as the arithmetical average of values of attributes af and ab for neighborhoods belonging to the cluster represented by u. The attributes of set A were chosen in such a way that the objects from set U are appropriate for approximation of the concept vehicle driving in a traﬃc jam. For example, if ε = 20 meters, σ = 1 and values af (u) and ab (u) are close to 2 then the neighborhoods from cluster represented by object u contain vehicles which deﬁnitely drive in a traﬃc jam. Whereas, if af (u) and ab (u) are close to 0 then the neighborhoods from cluster represented by object u contain vehicles which deﬁnitely do not drive in a traﬃc jam. For the system F we deﬁne the following formula: Φ = ((af > 0.5) ∨ (ab > 0.5)) ∈ GDL(F). It is easy to notice that formula Φ is not satisﬁed only by such clusters to which belong vehicle neighborhoods deﬁnitely not driving in a traﬃc jam. Therefore, in terms of cluster classiﬁcation to the concept driving in a traﬃc jam these clusters may be called trivial ones. Hence, formula Φ may be treated as a constraint formula which is used to eliminate the above mentioned trivial clusters from F.

570

J.G. Bazan

After such reduction we obtain an CRS-information system A = (U , A), where U = {u ∈ F (S) : u |=GDL(F) Φ }.

Unlike the single relational structures in relational structure clusters the time ﬂow has a branching character because in various elements of a given cluster we observe various variants of dynamically changing reality. Therefore, to deﬁne relational structure cluster properties we use elements of temporal logics of branching time language. In this paper, we use the two following languages deﬁning cluster properties: 1. the language F CT W using elements of temporal logics language and assigned to deﬁne cluster features which are families of time windows (see Section 6.8), 2. the language F CT P also using elements of temporal logics language assigned to deﬁne cluster families which are families of temporal paths in behavioral graphs, that is, sub-graphs of behavioral graphs (see Section 6.19). Finally, we assume that to each object, acceptable by constraints, an expert adds a decision value determining whether a given object belongs to a higher level approximated concept or not (see label L12 from the Fig. 13). After adding the decision attribute we obtain the concept approximation table for a concept from the higher ontology level (see label L13 from the Fig. 13). The notion of concept approximation table concerning a concept from the higher ontology level for an unstructured complex object may be generalized in the case of concept approximation for structured objects (that is, consisting of parts). Let us assume that the concept is deﬁned for structured objects of type T which consist of parts being complex objects of types T1 ,...,Tk . In Fig. 17 we illustrate the general scheme for construction of the concept approximation table for such structured objects. We see that in order to construct a table for approximating a concept deﬁned for structured objects of type T , CRS-systems are constructed for all types of structured object parts, that is, types T1 ,...,Tk (see labels L3−1,..., L3−k from the Fig. 17). Next, these systems are joined in order to obtain a table of approximating concept of the higher ontology level determined for structured objects. Objects of this table are obtained by arranging (linking) all possible objects of linked information systems (see label L4 from the Fig. 17). From the mathematical point of view such an arrangement is a Cartesian product of sets of objects of linked information systems. However, from the point of view of domain knowledge not all objects links belonging to such a Cartesian product are possible and reasonable (see [78, 84, 186, 187]). For instance, if we approximate the concept of overtaking, it is reasonable to arrange objects of such pairs of vehicles that drive close to each other. For the above reason, there are deﬁned constraints which are formulas deﬁned on the basis of properties of arranged objects. The constraints determine which objects may be arranged in order to obtain a concept example from the higher level and which

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

Concept approximation table for the concept from the higher ontology level (for structured objects of the type T)

Domain knowledge about the concept from the higher ontology level (for structured objects of the type T)

L5

Linking of objects (clusters) from all CRS-information systems, selection of clusters arrangements acceptable by constraints and adding the decision attribute

Constraint relation

CRS-information system of clusters of relational structures extracted from relational structure Sk

CRS-information system of clusters of relational structures extracted from relational structure Sk+1

L4

CRS-information system of clusters of relational structures extracted from relational structure S1

. . .

571

...

L3-1

Relational structure S1 (domain is a set of parts of the type T1)

...

L2-1

. . .

. . .

L3-k

L3-c

Relational structure Sk+1 (domain is a Cartesian product of parts of types T1, ..., Tk)

Relational structure Sk (domain is a set of parts of the type Tk)

L2-k

L2-c Domain knowledge about relational structures for attributes from the set A

L2

Defining family of relation in U

Relational structures for atributes from the set A

Information system A=(U, A) (positive and negative examples of concepts from the lower ontology level)

L1

Fig. 17. The general scheme for construction of the concept approximation table for structured objects

cannot be arranged. Additionally, we assume that to each object arrangement, acceptable by constraints, an expert adds a decision value determining whether a given arrangement belongs to a higher level approximated concept or not (see label L4 from the Fig. 17). A table constructed in such a way is to serve a concept approximation determined on a set of structured objects (see label L5 from the Fig. 17). However, it frequently happens that in order to describe a structured object, apart from describing all parts of this object, a relation between the parts of this object should be described. Therefore, in constructing a table of concept approximation for a structured object, there is constructed an additional CRS-information system whose attributes entirely describe the whole structured object in terms of relations between the parts of this object (see label L3−c from the Fig. 17). In approximation of the object concerning structured objects, this system is

572

J.G. Bazan

arranged together with other CRS-information systems constructed for individual parts of the structured objects (see label L4 from the Fig. 17). Similarly to the case of the concept approximation table for unstructured objects, the constraint relation is usually deﬁned as a formula in the language GDL (see Deﬁnition 5) on the basis of attributes appearing in the obtained table. However, constraint relation may also be approximated using classiﬁers. In such a case providing examples of objects belonging and not belonging to constraint relation is required (see, e.g., [78]). The construction of a speciﬁc approximation table of a higher ontology level concept requires deﬁning all elements appearing in Figs. 13 and 17. A fundamental problem connected with construction of an approximation table of the higher ontology level concept is, therefore, the choice of four appropriate languages used during its construction. The ﬁrst language serves the purpose of deﬁning patterns in a set of lower ontology level concept examples which enable the relational structure extraction. The second one enables deﬁning the features of these structures. The third one enables to deﬁne relational structure clusters and ﬁnally the fourth one the properties of these clusters. All these languages must be deﬁned in such a way as to make the properties of created relational structure clusters useful on a higher ontology level for approximation of the concept occurring there. Moreover, in the case when the approximated concept concerns structured objects each of the parts of this type of objects may require another four of the languages mentioned above. The spatial concept of the higher ontology level (defined for complex objects)

The spatio-temporal concept of the higher ontology level (defined for complex objects)

C Spatial concepts of the lower ontology level (defined for the same type of complex objects)

C1

...

C

Cl

Spatial concepts of the lower ontology level (defined for the same type of complex objects)

Case 1

C1

...

Case 2 The spatio-temporal concept of the higher ontology level (defined for structured complex objects)

Spatio-temporal concepts of the lower ontology level (defined for parts of structured complex objects)

C

C1

...

Cl

Case 3 Fig. 18. Three cases of complex concepts approximation in ontology

Cl

Hierarchical Classiﬁers for Complex Spatio-temporal Concepts

573

However, the deﬁnition of these languages depends on semantical diﬀerence between concepts from both ontology levels. In this paper, we examine the following three situations in which the above four languages are deﬁned in a completely diﬀerent way (see Fig. 18). 1. The approximated concept C of the higher ontology level is a spatial concept (it does not require observing changes of objects over time) and it is deﬁned on a set of the same objects as lower ontology level concepts (see Case 1 from Fig. 18). On the lower level we have a concept family: {C1 , ..., Cl }, that are also spatial concept. Apart from that the concepts {C1 , ..., Cl } are deﬁned for unstructured objects without following their changes over time. That is why these concepts are deﬁned on the basis of an object state observation at a single time point or time period established identically for all concepts. For example, the concept C and the concepts C1 ,...,Cl may concern the situation of the same vehicle while concept C may be the concept of Safe overtaking. On the other hand, to the family of concepts C1 ,...,Cl may belong such concepts as: Safe distance from the opposite vehicle during overtaking, Possibility of going back to the right lane and Possibility of safe stopping before the crossroads. The methods of approximation of the concept C for this case are described in Section 5. 2. The concept C under approximation is a spatio-temporal one (it requires observing object changes over time) and it is deﬁned on the set of the same objects as the lower ontology level concepts (see Case 2 from Fig. 18). On the lower level we have a concept family: {C1 , ..., Cl }, that are spatial concept. The concept C concerns object property deﬁned in a longer time period than the concepts from the family {C1 , ..., Cl }. This case concerns a situation when following an unstructured object in order to capture its behavior described by the concept C, we have to observe it longer than it is required to capture behaviors described by concepts from the family {C1 , ..., Cl }. For example, concepts C1 ,...,Cl may concern simple behaviors of a vehicle such as acceleration, deceleration, moving towards the left lane, while the concept C may be a more complex concept: accelerating in the right lane. Let us notice that determining whether a vehicle accelerates in the right lane requires its observation for some time which is called a time window. Howe