Free Essay

Submitted By ajmal

Words 42490

Pages 170

Words 42490

Pages 170

Web Information Systems Engineering and Internet Technologies

Book Series

Series Editor: Yanchun Zhang, Victoria University, Australia

Editorial Board: Robin Chen, AT&T Umeshwar Dayal, HP Arun Iyengar, IBM Keith Jeffery, Rutherford Appleton Lab Xiaohua Jia, City University of Hong Kong Yahiko Kambayashi† Kyoto University Masaru Kitsuregawa, Tokyo University Qing Li, City University of Hong Kong Philip Yu, IBM Hongjun Lu, HKUST John Mylopoulos, University of Toronto Erich Neuhold, IPSI Tamer Ozsu, Waterloo University Maria Orlowska, DSTC Gultekin Ozsoyoglu, Case Western Reserve University Michael Papazoglou, Tilburg University Marek Rusinkiewicz, Telcordia Technology Stefano Spaccapietra, EPFL Vijay Varadharajan, Macquarie University Marianne Winslett, University of Illinois at Urbana-Champaign Xiaofang Zhou, University of Queensland

Other Books in the Series: Semistructured Database Design by Tok Wang Ling, Mong Li Lee,

Gillian Dobbie ISBN 0-378-23567-1 Web Content Delivery edited by Xueyan Tang, Jianliang Xu and Samuel T. Chanson ISBN 978-0-387-24356-6 Web Information Extraction and Integration by Marek Kowalkiewicz, Maria E. Orlowska, Tomasz Kaczmarek and Witold Abramowicz ISBN 978-0-387-72769-1 FORTHCOMING

The Web Resource Space Model

Hai Zhuge

Chinese Academy of Sciences

Hai Zhuge Key Lab of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences P.O. Box 2704-28 No. 6 Science South Road Zhong Guan Cun, Beijing, China 100080 zhuge@ict.ac.cn

Library of Congress Control Number: 2007935313

ISBN-13: 978-0-387-72771-4

Printed on acid-free paper.

e-ISBN-13: 978-0-387-72772-1

2008 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. 987654321 springer.com

Contents

Contents ...................................................................................................... v Preface .......................................................................................................xi Chapter 1 Resource Space Model Methodology ..................................... 1 1.1 Origin of the Resource Space Model................................................ 1 1.2 Basis of the Resource Space Model.................................................. 7 1.2.1 Definitions and Characteristics.................................................. 7 1.2.2 Resource Space Definition Language...................................... 12 1.2.3 Resource Space Manipulation Operations............................... 13 1.2.4 Resource Space Modification.................................................. 15 1.2.5 View Definition ....................................................................... 16 1.2.6 Query Language ...................................................................... 17 1.2.7 Visualized Resource Locating ................................................. 18 1.3 Application Scenarios of the Resource Space Model..................... 20 1.3.1 Management of Web Pages ..................................................... 20 1.3.2 Managing Multi-layer Tables .................................................. 21 1.3.3 Management of Photos ............................................................ 23 1.3.4 Geographical Resource Space ................................................. 25 1.3.5 Multi-dimensional ACM Computing Classification System... 26 1.3.6 Management of Bio-information ............................................. 27 1.3.7 Media Content Space............................................................... 28 1.3.8 Automatically Add New Resources to the Resource Space .... 28 1.4 Design Method ............................................................................... 31 1.4.1 Resource Analysis ................................................................... 31 1.4.2 Top-down Resource Partition.................................................. 32 1.4.3 From Low Dimension to High Dimension .............................. 33 1.4.4 Abstraction and Analogy in Designing Resource Space ......... 37 1.5 Use Resource Space to Manage Relational Tables......................... 39 1.6 The Semantic Link Network........................................................... 41 1.7 Comparison between RSM and RDBM ......................................... 45 1.8 Questions and Answers .................................................................. 48 1.9 Summary......................................................................................... 50

vi

Contents

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy ................................................................................................. 53 2.1 The Basic Idea ................................................................................ 53 2.2 Integrating Resource Space Model with Semantic Link Network.. 56 2.3 Relationship between RSM and SLN ............................................. 60 2.3.1 Transformation from Semantic Link Network to Resource Space Model ..................................................................................... 60 2.3.2 Transformation from Resource Space to Semantic Link Network and Correlations................................................................. 63 2.3.3 Topological Properties............................................................. 65 2.4 Union View of Resource Space and Semantic Link Network........ 68 2.4.1 The Framework ....................................................................... 68 2.4.2 The Core Component of Union View: Resource Class Hierarchy .......................................................................................... 71 2.4.3 Operations on Resource Class Hierarchy ................................ 74 2.4.4 Operations on the Union View of Resource Space and Semantic Link Network.................................................................... 77 2.5 Discussion and Summary ............................................................... 79 Chapter 3 Expressiveness of Query Languages for Resource Space Model ........................................................................................................ 83 3.1 The Problem ................................................................................... 83 3.2 Completeness of Query Languages on Resource Spaces ............... 84 3.2.1 Basic Idea ................................................................................ 84 3.2.2 Definition of Completeness of Query Operations ................... 85 3.3 Complete set of Operations ............................................................ 86 3.3.1 Design of Query Operations .................................................... 87 3.3.2 Verification of Completeness of Operations ........................... 89 3.4 Expressiveness of Query Languages .............................................. 92 3.4.1 Comparison between Expressiveness...................................... 92 3.4.2 Some Characteristics of Expressiveness.................................. 94 3.5 Comparison and Analysis ............................................................... 94 3.6 Summary......................................................................................... 97 Chapter 4 Algebra and Calculus of the Resource Space Model .......... 99 4.1 Basic Idea ....................................................................................... 99 4.2 Resource Space Algebra ............................................................... 100 4.2.1 Definitions of Operations ...................................................... 100 4.2.2 Relationships among Operations ........................................... 105 4.3 Resource Space Calculus.............................................................. 107 4.3.1 Definition............................................................................... 107 4.3.2 From Resource Space Algebra to Resource Space Calculus. 110

Contents

vii

4.3.3 From Resource Space Calculus to Resource Space Algebra. 111 4.3.4 Transformation from Relational Model to Resource Space .. 115 4.4 Summary....................................................................................... 116 Chapter 5 Searching Complexity of Resource Space Model.............. 117 5.1 Basic Concepts and Formulas....................................................... 117 5.1.1 On Computation Complexity................................................. 117 5.1.2 Searching Complexity and Formulas .................................... 120 5.2 Basic Assumptions ....................................................................... 121 5.3 Distribution of Coordinates on Axes ............................................ 123 5.3.1 Best Distribution of Coordinates ........................................... 123 5.3.2 The Worst Distribution of Coordinates ................................. 125 5.4 The Changing of Space Dimension .............................................. 129 5.4.1 Relationship between Dimension and Searching Complexity. 129 5.4.2 Value of Critical Dimension.................................................. 131 5.5 Summary....................................................................................... 132 Chapter 6 Resource Space Model Storage........................................... 135 6.1 Current Approaches to Storing Resource Space........................... 135 6.2 Problem Definition ....................................................................... 137 6.3 System Architecture ..................................................................... 138 6.4 RSM Storage Mechanism ............................................................. 140 6.5 RSM Schema Tree........................................................................ 141 6.6 C-tree ............................................................................................ 148 6.6.1 Resource Operations.............................................................. 148 6.6.2 Minimum Bounding Rectangle ............................................. 152 6.6.3 On INSERT_POLICY........................................................... 154 6.6.4 On SPLIT_POLICY .............................................................. 155 6.6.5 Disk management .................................................................. 156 6.7 Summary....................................................................................... 157 Chapter 7 Structured Peer-to-Peer Resource Space .......................... 159 7.1 Basic Idea ..................................................................................... 159 7.1.1 The Problem .......................................................................... 159 7.1.2 A Brief Introduction to CAN................................................. 160 7.1.3 Basic Approach ..................................................................... 161 7.2 The System Design ....................................................................... 162 7.2.1 The Basis ............................................................................... 162 7.2.2 Node State ............................................................................. 163 7.2.3 Routing .................................................................................. 164 7.2.4 Node Join............................................................................... 167 7.2.5 Node Departure ..................................................................... 170

viii

Contents

7.3 Improvement................................................................................. 171 7.3.1 Routing Performance............................................................. 172 7.3.2 Node Failure Recovery.......................................................... 174 7.3.3 Coordinates in Tree Structure................................................ 175 7.4 Summary....................................................................................... 178 Chapter 8 Unstructured Peer-to-Peer Resource Space...................... 179 8.1 Unstructured Peer-to-Peer ............................................................ 179 8.2 Incorporating Resource Space with Unstructured Peer-to-Peer ... 180 8.2.1 Peer-to-Peer in e-Science....................................................... 180 8.2.2 Integrating Resource Space with Gossip ............................... 182 8.3 The Construction Mechanism....................................................... 184 8.3.1 Resource Index Issuing Process ............................................ 186 8.3.2 Peer Join Process ................................................................... 186 8.3.3 Peer Departure Process.......................................................... 188 8.3.4 Query Processing Process...................................................... 189 8.4 Performance Analysis................................................................... 190 8.4.1 Reliability .............................................................................. 190 8.4.2 Hop Count Expectation ......................................................... 191 8.5 Experimental Evaluation .............................................................. 192 8.5.1 Experiments in Random Networks........................................ 192 8.5.2 Experiments in Random Power-law Networks...................... 193 8.6 Architecture of a RSM-based Gossip Network ............................ 200 8.7 Summary....................................................................................... 201 Chapter 9 Probabilistic Resource Space Model.................................. 203 9.1 Basic Concepts ............................................................................. 203 9.2 Normal Forms of Probabilistic Resource Space ........................... 206 9.2.1 The First Normal Form and Second Normal Form ............... 206 9.2.2 The Third Normal Form ........................................................ 207 9.3 Operations of Probabilistic Resource Space................................. 209 9.3.1 Point Query............................................................................ 209 9.3.2 Resource Query ..................................................................... 210 9.3.3 Resource Modification .......................................................... 211 9.3.4 Operations on Probabilistic Resource Space ......................... 213 9.4 Integrity Constraints under Probability ........................................ 214 9.4.1 Key in Probabilistic Resource Space Model ......................... 214 9.4.2 Integrity Constraints in Probabilistic Resource Space Model 216 9.5 Relevant Works ............................................................................ 219 9.6 Summary....................................................................................... 220 References............................................................................................... 221

Contents

ix

Index........................................................................................................ 231

Preface

Birds of a feather flock together. Web resources of a category work closely for efficiency. Classification is a method of efficiently managing various resources. It is also a basic method for human beings to know the real world and synthesize experience. A Web Resource Space Model (in simple RSM) is a semantic data model for specifying, storing, managing and locating Web resources by appropriately classifying the contents of resources. It enables users or applications to operate resources by the SQL-like query language. A Web resource space is a multi-dimensional classification space where dimensions are discrete. Its intrinsic characteristics are worth studying as it is not an ordinary distance space. A resource space can be normalized to increase the correctness of resource management by setting constraints on dimensions. Aiming at a Web semantic data model with characteristics of normalization and autonomy, this book develops the RSM systematically in methodology, model and theory. It concerns the following contents: 1. The general methodology of the RSM, which includes the origin, fundamental concepts, characteristics and the development method. It helps understand the RSM and design resource spaces for applications. The relationship between the Resource Space Model and the Semantic Link Network. The integration of the two models forms a richer semantic data model to support advanced distributed applications. The completeness and necessity theory for query operations of the Resource Space Model. The algebra and calculus theory for query operations of the Resource Space Model. The query capability and expressive power of the Resource Space Model are studied from two perspectives: resource space algebra and resource space calculus.

2.

3. 4.

xii

Preface

5.

6.

7.

8.

The complexity of searching the resource space. We intend to unveil the relationship between the searching efficiency and the number of dimensions as well as the relationship between the searching efficiency and the distribution of coordinates. The physical storage mechanism of the resource space. Its multidimensional and discrete characteristic is different from the relational database index (one dimensional) and previous multidimensional index (sequential numerical dimensions). The P2P-based decentralized resource space. It is an approach to enabling the Resource Space Model to synergy normalization with autonomy. A structured P2P resource space solution and an unstructured P2P resource space solution are studied. The probabilistic Resource Space Model, which enables users or applications to store and retrieve resources uncertainly. It is a more general Resource Space Model.

I would like to take this opportunity to thank my students Erlin Yao, Yunpeng Xing, Xiang Li, Chao He and Liang Feng who make important contribution to this work. Thanks also go to all team members of the China Knowledge Grid Research Group for their help and cooperation. Research work was financially supported by the National Basic Research Program of China (2003CB317000), the EU 6th Framework Project GREDIA (IST-FP6-034363), and the International Cooperation Program of the Ministry of Science and Technology of China (2006DFA11970). The Web resource space is a part of the resource space we live in. The RSM is a promising model for effective management of versatile resources. Integrating the RSM with the Semantic Link Network, the database models and the Web ontology mechanisms could form a powerful semantic platform for the future interconnection environment.

Hai Zhuge August 9, 2007

Chapter 1 Resource Space Model Methodology

Classification is not only an approach to efficiently managing resources but also a basic method for human to know the real world. The Web Resource Space Model realizes birds of a feather flock together in the digital interconnection environment. It specifies and manages various Web resources by normalizing classification on contents of resources.

1.1 Origin of the Resource Space Model

Files are expanding in our daily-use PCs or laptops due to easier download from websites and email attachments. Management of these accumulating files is troublesome if we arbitrarily save them. For example, saving files in the desktop of Windows seems convenient, but this will cause inefficient retrieval of files and slow down the speed of machine in the long run. Appropriately naming folders to contain various files is a way to efficient file management and retrieval. Inappropriately naming will cause trouble in retrieval of files due to synonym or poor meaning. Goods in supermarkets are arranged by classification. Goods are placed and updated according to the commonsense of the classification shared between customers and sellers. Chain supermarkets often arrange goods in a uniform style and place good categories in the same order so that customers can quickly reach their targets. This order is an experience for customers to raise the efficiency in selecting goods by going directly to the interested category or region. We can also find that some closely relevant goods are arranged in neighborhood. This neighbor information also helps customers to select goods. These strategies bring efficiency and convenience for sellers to manage goods and for customers to select goods. Biologists classify organisms into categories by judging degrees of apparent similarity and difference. On discovering an unknown organism, scientists begin their classification by looking for anatomical features that have the same function as those found on other species, and then determine

2

Chapter 1 Resource Space Model Methodology

whether or not the similarities are due to an independent evolution or to descent from a common ancestor. If the latter is the case, then the two species could be classified into the same category. Children start to learn concepts by classifying real-world objects into categories and by generalizing and specializing categories via instances. So classification is not only an approach to efficiently managing resources but also a basic method for human to know the real world. A (Web) Resource Space Model (in simple RSM) is a semantic data model for specifying, storing, managing and locating contents of (Web) resources by appropriate classification on the contents of resources. The notion of the Web resource space was initiated in 2002 as a multidimensional knowledge space (Zhuge, 2002). Its basic model was proposed in 2004 (Zhuge, 2004a-d). A resource space is a multi-dimensional classification space where dimensions (axes) are discrete so it is different from ordinary distance space. Its intrinsic characteristics are worth studying. A resource space can be normalized to ensure the correctness of managing resources by setting constraints on axes. The Resource Space Model methodology is a study of the basic method for designing resource spaces and helping the development of its applications. Data model and algorithm are the core of software systems. File system is the first milestone towards effective computing resource management. It is a method for storing, managing and retrieving resources in form of files by establishing mapping between a directory indexing structure and a storage device. The directory indexing structure can keep track of the files and path syntax required to access them. It defines the way files are named as well as the maximum size of a file or volume. A file system is a component of most operating systems. IT professionals have used various indexing techniques to raise the efficiency of managing data in files. The file system can be regarded as a 1-dimensional resource space. Database is another milestone towards effective resource management. Most databases use file systems as the underlying mapping mechanism between the higher level indexes and the storage devices. Database theories and systems have influenced the world for forty years (Bachman, 1969). Especially, the Relational Database Model and systems have achieved a

1.1 Origin of the Resource Space Model

3

great success (Codd, 1970). The Relational Database Model uses relational tables to describe basic relations between data types. Its normal form theory ensures the correctness of data operations. Both the file system and the database system are invented at the age of mainframe and centralized computing. The World Wide Web is a huge decentralized file system. There is no central control for adding, updating and removing Web pages. The management of the Web resources challenges traditional data models. Object-oriented databases and object-relational databases extend the application scope of the relational databases by borrowing the advantages of the object-oriented methodologies and programming languages like inheritance and encapsulation to model the real-world and they enable complex objects to be effectively managed (Kim, 1990; Rumbaugh et al., 1991; Mok, 2002). But, their limitations emerge in Web-based applications, which require resources to be managed in an open, decentralized, platform-irrelevant and content-based way. In data warehousing and OLAP area, the multi-dimensional data model was used. It is suitable for storing large-scale historical data. To support decision-making based on large data sets, data mining techniques are needed to discover the rules in large data sets (Han and Kambr, 2000). But, the read-only model is limited in ability to meet the needs of resource management in the Internet environment where resources are complex and have to be frequently operated. Compared with the relational data model, it is weak in theory on data management. Fig. 1.1 depicts a file system on disk. The file system realizes the mapping from the directories and files of various type onto the disk space by establishing the index on the linear disk space. Users or up-level applications can operate files according to their names and path regardless of their physical storage. Fig. 1.2 shows the keyword index on the World Wide Web. The Web is actually a decentralized file system, which enables users to browse Web pages by their URLs. Search engines establish indeces on Web pages distributed on the Web by collecting and classifying Web pages according to keywords and then recommending Web pages sharing the same set of keywords according to the page rank (S.Brin and L.Page, www7.scu.edu.au,1998).

4

Chapter 1 Resource Space Model Methodology

Fig.1.1. The EXT2/3 file system.

1.1 Origin of the Resource Space Model

5

6

Chapter 1 Resource Space Model Methodology

Comparison of the file system on disk and the file system on the World Wide Web implicates the evolution of file systems. With the development of Web applications, effective management of the contents of various resources on the Web becomes a challenge. The new generation data model should be a semantic-rich model that is able to manage content rather than file name. But to precisely describe the content of an individual resource is difficult and a formal description is not easy for sharing among people of different communities. The Resource Space Model can reflect the content of resources by classification semantics. People often use the classification method to manage various contents in daily life. For example, researchers are organized into research groups according to the research topics they are working on. Publishers classify their products (books, journals and conference proceedings) into different categories according to disciplines and contents. A set of resources can be classified by different classification methods as shown in the left hand of Fig. 1.3. Multiple classification methods constitute a multi-dimensional semantic space over a set of resources as shown in the right hand of Fig. 1.3, where each axis represents a kind of classification method. Y X

Y Z

X Z Fig. 1.3. Multiple classification methods over a set of resources constitute a multi-dimensional classification space.

1.2 Basis of the Resource Space Model

7

The Resource Space Model is such a semantic space that manages information, knowledge and service resources. Information resources refer to various types of electronic files that can be transmitted through the Internet, and can be read or perceived directly or indirectly. Knowledge resources include metadata, relation and strucuture as well as the abstract concepts, axioms, rules and methods that can be represented in a certain machine-understandable form. Knowledge can be generated from understanding the information resources or generalizing human experience. Service resources refer to the reusable capability processes for performing tasks, solving problems, or processing information or knowledge resources. A resource space can be presented in: 1. 2. 3. conceptual or logical aspect the definition of an n-dimensional resource space; user view aspect a subspace of the whole resource space displayed in user-understandable form; representation aspect the cross-platform understandable definition based on standard description languages like XML, RDF and OWL; and, storage aspect the physical storage of the resource space including the storage of the space structure, relevant index and resource entities.

4.

The characteristics of the Resource Space Model require a specific method to help design resource spaces. Before presenting the design method, we first introduce the basic notion and components of the Resource Space Model.

1.2 Basis of the Resource Space Model

1.2.1 Definitions and Characteristics The semantic basis of the Resource Space Model is name space, basic data type, set and partition. Concepts are labeled in the name space as consensus of a community, while their semantics are determined by classification and use-case in four worlds the real world, the document world, the machine world and the mental world (Zhuge et al, 2006). Some basic concepts in the name space

8

Chapter 1 Resource Space Model Methodology

do not need to be explained. They can be used to define other concepts. A set of concepts can represent a certain semantics. Classifications on concepts form concept hierarchies. The basic semantic elements of the Resource Space Model are resource, resource space, axis and coordinate. A resource in the machine world has name, type and content. Concepts are basic elements of the composing content. For example, a Web page has a URL, and its content can be represented by a set of concepts. A Resource Space is a multi-dimensional classification space. It consists of a name and a set of axes, denoted as RS(X1, X2, …, Xn). Each axis Xi represents a classification method. Xi is partitioned by a set of coordinates denoted as Xi . A point in the space, determined by one coordinate at every axis, represents a set of resources of the same category. An axis can be regarded as a 1-dimensional resource space. A coordinate C represents a set of resources, denoted as R(C). Resources represented by axis Xi are the union of all the resources represented by its coordinate: R(Xi) R(Ci1) R(Ci2) … R(Cin). The semantics of a coordinate is represented by name, basic datatype, a set of concepts, or a coordinate tree (low-level coordinates finely classify their common ancestor). The semantics of a coordinate is regulated by the semantics of its axis. A coordinate C is called independent from another coordinate C’ if there is no intersection between R(C) and R(C’). Using existing taxonomy and commonsense as the classification method is a way to establish concensus between designers and users. The resource space, axis, coordinate, and point are sets in nature. A coordinate regulates a set of points. An axis regulates a set of coordinates. An axis name represents higher classification level than its coordinates. A resource space regulates a set of axes and the refined classification relationship. A resource is determined by locating the point it belongs to and by selecting from the resource set according to its name and content description. Two axes can be regarded as equivalent if their names are the same in semantics and the names of all the corresponding coordinates are the same in semantics.

1.2 Basis of the Resource Space Model

9

The following are two operations on axis: 1. If two axes X1 and X2 have the same axis name but have different coordinates, then they can be merged into one: X X1 X2 . In this case, the name of X represents X1 and X2. An axis X can be split into two axes X’ and X by dividing the coordinate set of X into two: the coordinate set of X’ and the coordinate set of X , such that X X’ X . If the two axes need to be merged for the future use, the names of X’ and X should be the same in semantics.

2.

The semantics of axis and coordinate can be formally defined or informally defined. For example, the semantics of a coordnate can be defined by a set of concepts, which regulate the semantics of the resources it may contain. The above definitions enable a resource space to represent any form of resources. If we use a set of domain concepts KC to describe a coordinate C, and the resources contained by C share common concept set KR, then we could find a mapping between KC and KR such that corresponding concepts are the same or share a common ancestor. This mapping is useful in automatically classifying resources. Let X (C1, C2, …, Cn) be an axis and Ci' be a coordinate at another axis X’, we say that X finely classifies Ci' (denoted as Ci'/X ) if and only if: 1. (R(Ck) R(Ci')) (R(Cp) R(Ci')) = NULL (k p, and k, p [1, n]); and, 2. R(C1) R(Ci') R(C2) R(Ci') … R(Cn) R(Ci') R(Ci') hold. As the result of fine classification, R(C’) is partitioned into n categories: R(Ci'/X) {R(C1) R(Ci'), R(C2) R(Ci'), …, R(Cn) R(Ci')}. For two axes X (C1, C2, …, Cn) and X’ (C1', C2', …, Cm'), we say that X finely classifies X’ (denoted as X’/X) if and only if X finely classifies C1' , C2' ,…, and Cm' respectively. Two axes X and X’ are called orthogonal with each other (denoted as X X’) if X finely classifies X’ and vice versa, i.e., both X’/X and X/X’ hold. Establishing orthogonal relationship between relevant classifications deepens people’s understanding on the real world. The following three normal forms are for designing a good resource space.

10

Chapter 1 Resource Space Model Methodology

1. A first-normal-form resource space is a resource space and there does not exist name duplication (semantic overlap) between coordinates at any axis. 2. A second-normal-form resource space is a first-normal-form, and for any axis, any two coordinates are independent from each other. 3. A third-normal-form resource space is a second-normal-form and any two axes of it are orthogonal with each other. The following are four operations of the resource space: 1. Join operation. If two resource spaces RS1 and RS2 specify the same type of resources and they have n (n 1) common axes, then they can be joined into one resource space RS such that RS, RS1 and RS2 share these n common axes and |RS|=|RS1| + |RS2| n, where |RS| represents the number of dimensions of the RS. RS is called the join of RS1 and RS2, denoted as RS1 RS2 RS. Disjoin operation. A resource space RS can be disjoined into two resource spaces RS1 and RS2 (denoted as RS RS1 RS2) that specify the same type of resources as that of RS such that they have n (1 n min(|RS1|, |RS2|)) common axes and |RS| n different axes, and |RS|=|RS1| + |RS2| n. Merge operation. If two resource spaces RS1 and RS2 specify the same type of resources and satisfy: (1) |RS1|=|RS2|=n; and, (2) they have n 1 common axes, and there exist two different axes X1 and X2 satisfying the merge condition, then the two spaces can be merged into one RS by retaining the n 1 common axes in the new space and including a new axis X=X1 X2. RS is called the merge of RS1 and RS2, denoted as RS1 RS2 RS, and |RS|= n. The second condition can be extended as follows: they have n k common axes (1 k B, where is a semantic factor and p is the probability of . Metcalfe’s Law states that "The power of the network increases exponentially by the number of computers connected to it. Therefore, every computer added to the network both uses it as a resource while adding resources in a spiral of increasing value and choice."

1.7 Comparison between RSM and RDBM

45

What is the effect of the Semantic Link Network? The importance of a Semantic Link Network depends on the number of people defining and maintaining semantic links rather than the definition of semantic links itself. The evolution of the Semantic Link Network will form a kind of semantic effect that helps promote decentralized applications. In daily life, people often ask neighbors/friends when they have questions so neighbors/friends are likely to hold relevant contents. For scientific papers, citation relations are dense between papers of the same area. This semantic relevancy leads to a semantic community phenomenon in pursuing efficiency. The semantic communities can help promote the efficiency of searching relevant concepts by focusing on a specific semantic community. Semantic locality requests to store relevant contents in semantically close places. The storage mechanism of the Resource Space Model concerns the semantic locality to ensure search efficiency. It is also a criterion to implement a P2P semantic overlay to support semantic-rich decentralized applications. The Semantic Link Network is a general semantic space. The Resource Space Model focuses only on one type of relation the classification relation, so it can be regarded as a special case of the Semantic Link Network. The Resource Space Model and the Semantic Link Network follow different rules. The theories on the Resource Space Model are not suitable for the Semantic Link Network. However, a resource space can be transformed into a Semantic Link Network, and a special Semantic Link Network can also be transformed into a resource space. Moreover, they can be integrated to provide views of different layers: the Resource Space Model placed over the Semantic Link Network provides with a classification view while the Semantic Link Network provides with a scalable semantic link network view.

1.7 Comparison between RSM and RDBM

Recognizing the attributes of an object and knowing the classification of objects are two basic approaches to understanding the real world. The two approaches support each other in recognizing the real world.

46

Chapter 1 Resource Space Model Methodology

An object has attributes of many aspects like physical and chemical characteristics. Objects sharing the same set of attributes can be classified into the same category. Different aspects can form different classification methods. Existing classifications can be refined with the development of knowledge on classifications and attributes. The Relational Database Model is based on the attributes of objects. Just like previous data models such as the network model and objectoriented model, both the Resource Space Model and the Relational Database Model support application systems. An ideal data model should be simple, capable and close to human thinking. The Resource Space Model and the relational database can be further compared as follows. 1. The Resource Space Model focuses on the classification on objects (resources in the digital world). It allows designers and users to observe resources as a whole and then classifies them top-down by commonsense for high-level classification and domain specific knowledge for low-level classification. The relational database model focuses on attributes of objects (entities). In representation, the Resource Space Model is a uniform coordinate system, while the data model of the RDBMS is a relational table. Usually, an application needs to select (search) from many tables so operations on multiple tables are inevitable. The cost of operations on multiple table needs to be reduced. A basic request of the classic relational database model is the atomicity of data. Attributes are defined by datatypes. The Resource Space Model does not request this atomicity in nature. A coordinate in the Resource Space Model can be defined by semantic description including basic datatype, keyword set, or a coordinate tree representing fine classification at different levels. So resources managed by the Resource Space Model can be any form of resources, while the classical relational data model only manages atomic data. The normalization approaches of the Resource Space Model and the Relational Database Model are different in nature. The relational database model normalizes the functional dependence relation, while the Resource Space Model normalizes the classification relation. The Resource Space Model enables a uniform and universal resource view when operating resources. It is suitable for class operations since to retrieve a class is equivlent to locate a point in a resource space. To retrieve a class of data, the relational database system needs to check

2.

3.

1.7 Comparison between RSM and RDBM

47

all of the records unless it is indexed. The RDBMS essentially supports the view of one or more tables. 4. The Relational Data Model requests that application developers are the same as the database designer or that the applicatoin developers are very familiar with the database design because they need to know the table schemas for coding. The Resource Space Model also request the application develpers know the structure of the resource space. In applications, users should be familar with one-dimensional classification of the resources in the application domain. This is the basis of understanding the multi-dimensional resource space. In an organization, high-rank users are interested in high-level classifications, lowrank users are responsible for low-level classifications. The resource space actually provides a type of domain knowledge, which supports various applicaion systems. The basic semantics of the Relational Database Model relies on the Data Definition Language (DDL). The DDL is used to create and destroy databases and database objects. These commands will primarily be used by database administrators during the setup and terminate phases of a database project. The basic semantics of the Resource Space Model is the commonsense on classification at high-level and the domain knowledge on classification at low level. Domain ontology helps explain low-level semantics. For normalization, the Relational Database Model introudces artificial attributes like identity to differentiate one object from the others. Otherwise, the key does not exist in natural attributes in many cases. The Resource Space Model does not have this limitation. Relational table raises its search efficiency by establishing one- dimensional index on attributes. The resource space uses one multidimensional index on the whole space. This multi-dimensional nature requires special storage mechanism different from relational database. As a descrete multi-dimensional index, the resource space has advantages in search efficiency. Complex resources are properties if they are correctly stored. The existing resources bring users’ classification viewpoint so they can be used to enrich the semantics of axes, coordinates and even points. In relational table, data does not contain such rich semantics as complex resources.

5.

6.

7.

8.

Above differences determine that the Resource Space Model concerns the contents (semantics) of resources and the content-based classification so it supports content-based operation. The relational database model con-

48

Chapter 1 Resource Space Model Methodology

cerns the attributes of the objects being managed so it supports attributebased operation. Differences exist between the design method for the relational databases and the design method for the Resource Space Model. The design method for the Resource Space Model does not have the conceptual model so experience plays an important role when designing an appropriate resource space. The Resource Space Model’s conceptual model is actually the same as its data model. Its hierarchical resource organization approach is in line with the top-down resource partition and the “from general to special” thinking characteristic. The classification-based normalization requires the Resource Space Model to have a special design method and tools, which are different from the Relational Database Model. The design of a resource space concerns the resource dictionary, independency checking of coordinates, and orthogonal checking of axes. The design of the relational database concerns the data dictionary and the balance between the normal forms and the retrieval efficiency with respect to the application requirement. A basic semantic overlay should be able to describe basic semantic relations and can further find potential semantic relations. Just like relational database only focuses on very basic relations such as attribute-value relation and functional dependence relation, the Resource Space Model focuses on the classification relations. It is an interesting issue to find a way to integrate different types of semantics. The Resource Space Model, UML, OWL, and database can be mapped from one into another and integrated with each other to enhance and support each other.

1.8 Questions and Answers

Question 1. We have the relational database model and many commercial database systems. Do we still need Resource Space Model? Answer 1. Different models have different application scopes. The relational database model is useful in many applications, especially in pure data management. The Resource Space Model aims at managing contents of various resources rather than pure data. Question 2. What are the distinguished characteristics of the Resource Space Model compared with existing data models. Answer 2. The Resource Space Model is based on content classification reflecting human classification commonsense and thinking on recognizing

1.8 Questions and Answers

49

real-world objects. Coordinates at each axis are discrete and any coordinate can be a tree strucuture. Using the multi-dimensional index, the Resource Space Model supports efficient searching. Question 3. Resource Space Model designers may not clearly know the classification of resources. For example, some cross-area book could belong to two categories. Answer 3. The Resource Space Model is good at managing the content that can be clearly classified. Designers can design resource spaces of different normal forms according to different applications. The following approaches can be used to deal with this issue: 1. add undetermined coordinates to specify those un-determined resources; 2. add a cross-class coordinate to appropriate axis; 3. introduce fuzzy theory to establish a fuzzy Resource Space Model as introduced in (Zhuge, 2004c); and, 4. introduce probability into Resource Space Model to reflect the probability world (see chapter 9). Question 4. Can Resource Space Model change structure during use? Answer 4. The original Resource Space Model is designed for specific applications just like databases, so it is inappropriate to change structure after design. On the other hand, the normal form would be damaged if we change the structure of the Resource Space Model. There are two ways to resolve this issue. One is that we can design a more stable resource space since diverse resource spaces can be designed for an application. The second is that we can design a Resource Space Model system that can adapt to change. Question 5. Can we use existing database systems to realize a resource space system? Answer 5. Existing data structure and indexing techniques can be used to implement a resource space system. The XML file can be used as the intermediate of storing resources. However, a special approach that makes use of the characteristics of the Resource Space Model is needed. Chapter 6 presents an approach to the storage of resource space. Question 6. Are resources stored in the resource space? Answer 6. A resource space includes three parts: the structure of the space including axes and their coordinates, the specification of the content (including identity, path and semantic description) of resources, and the entity resources. The strategy of storing these parts depends on the resource

50

Chapter 1 Resource Space Model Methodology

space system. Based on the Resource Space Model, different types of resource space systems can be developed. Question 7. What is the relationship between the Resource Space Model and the Knowledge Grid? Answer 7. The original idea of the Knowledge Grid is for effective knowledge sharing (Zhuge, 2002). Semantics is the basis for knowledge sharing. The Resource Space Model is suitable for managing knowledge resources by knowledge classification, for example, (concept, relation, axiom, rules, method and theory) can be one axis, and discipline can be another axis of a knowledge space. A decentralized Resource Space Model can be the underlying infrastructure of the Knowledge Grid. Question 8. What is the relationship between the Resource Space Model and P2P networks? Answer 8. P2P is a scalable decentralized resource management mechanism. A resource space can be either centralized or decentralized. A onedimensional resource space can be implemented as a structured P2P network by partitioning a set of resources and enabling a peer to manage a class of resources. Semantic locality requires resources of the same class to be stored in the same place or close places and managed by one peer. It is an interesting issue to realize efficient routing in a multi-dimensional P2P resource space. Chapter 8 and Chapter 9 present two solutions to construct P2P resource spaces.

1.9 Summary

Like Yin-Yang in ancient Chinese philosophy, normalization and autonomy are two aspects of an ideal organization model. The Resource Space Model represents the normalization. The Semantic Link Network represents the autonomy. Integration of the Resource Space Model and the Semantic Link Network can form a semantic overlay with the characteristics of normalization and autonomy. A Resource Space Model can be distributed onto a network to meet the needs of distributed applications. The method and technology of the distributed databases are good references in developing the distributed Resource Space Model. The Resource Space Model can be deployed onto a peer-to-peer network in a certain manner for decentralized applications. The peer-to-peer

1.9 Summary

51

Resource Space Model is a way to realize the synergy of normalization and autonomy. The storage of the resource space is to map the discrete multidimensional resource space into a multi-dimensional index on a linear storage space by using a specific index structure. The peer-to-peer computing can be regarded as a decentralized storage mechanism that distributes a linear disk space onto a network. This book arranges its content as follows: Chapter 1 presents the general methodology of the Resource Space Model. Readers can know its general idea, concept, characteristic and method of the Resource Space Model. Readers could understand and start to design resource spaces for applications after reading this chapter. Chapter 2 investigates the relationship between the Resource Space Model and the Semantic Link Network, and proposes an approach to integrate the two models to construct a richer semantic overlay synergying the normalization and autonomy for managing resources in the future interconnection environment. Chapter 3 studies the expressiveness of query languages for Resource Space Model and introduces the completeness and necessity theory for query operations on resource spaces. Chapter 4 introduces the algebra and calculus theory for the Resource Space Model. They are important parts of the theory of the Resource Space Model and the basis of its query language. Chapter 5 analyzes the complexity of searching in the resource space. It unveils the relationship between the searching efficiency and the number of dimensions as well as the relationship between the searching efficiency and the distribution of coordinates. Chapter 6 introduces an approach to physically store the resource space and its resources. The storage approach should reflect the characteristics of the resource space and efficiently support flexible query. The next two chapters are on a decentralized Resource Space Model. Chapter 7 presents an approach to deploy the Resource Space Model onto peer-to-peer network to obtain the scalability. Chapter 8 presents an approach to make use of classification semantics to improve the efficiency of unstructured peer-to-peer network.

52

Chapter 1 Resource Space Model Methodology

Chapter 9 constructs the probabilistic Resource Space Model to deal with uncertainty in applications by introducing the probability into the Resource Space Model. The motivation is similar to the fuzzy Resource Space Model (Zhuge, 2004c). The probabilistic Resource Space Model can be regarded as a more general Resource Space Model.

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

Classifying objects into categories at different granularity levels, establishing semantic links between known objects, and discovering semantic clues between known and unknown objects are essential for understanding. Incorporating the Resource Space Model and the Semantic Link Network can form a semantic overlay synergy normalization and autonomy for managing resources in the future interconnection environment.

2.1 The Basic Idea

The Semantic Web is to improve the World Wide Web by extending HTML Web pages to descriptions with machine-understandable semantics, better enabling computers and people to work in cooperation (Berners-Lee et al., 2001; Heflin and Hendler, 2001). The success of the HTML guides researchers to develop more powerful markup languages like the Resource Description Framework (RDF, www.w3c.org/RDF), which is often used to integrate various applications by using the eXtensible Markup Language (XML) for syntax and URIs for naming (www.w3c.org/XML). Grid computing research influences the development of the Semantic Web from the computing ideal, infrastructure and model aspects (Foster, 2000, www-fp.mcs.anl.gov/~foster/). Artificial intelligence research influences the development of the Semantic Web towards a wisdom Web (Hoschek et al, 2000; Zhong et al, 2002). Knowledge representation and logics are the basis of traditional artificial intelligence. To introduce rules and description logics into the Semantic Web is a natural idea to enable the Semantic Web to support intelligent applications. An appropriate representation of semantics is also the key to solve the service discovery and matching issues in the service-oriented computing.

54

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

To find a good representation approach is only one aspect of realizing the Semantic Web. Cross-domain understanding needs domain ontology. However, how to efficiently organize, use and maintain the semantic content is a major challenge to realize an efficient Semantic Web. Relational Data Base Model is a semantic model to effectively organize and manage attribute-value semantics (Chen, 1976; Codd, 1970; Codd, 1971b). With the normal form theories, it ensures the efficiency and accuracy of operations on data (Mok, 2002; Ullman, 1988; William, 1983). But the function dependence relationship between attributes of objects and the relationship between attributes and values are limited in semantic ability to organize and manage versatile resources in a large-scale dynamic interconnection environment. Object-oriented databases and object-relational databases extend the application scope of the relational databases by borrowing the advantages of the object-oriented methodologies and programming languages like inheritance and encapsulation to enable complex objects to be normally managed (Abiteboul et al, 1995; Kim, 1990; Rumbaugh et al., 1991; Ullman, 1988). Several graph-based semantic models built on the traditional hierarchical and network data models have been proposed (Gyssens et al, 1994; Levene and Loizou, 1995; Levene and Poulovassilis, 1990; Poulovassilis and Levene, 1994). But, these data models are incapable of effectively managing heterogeneous, distributed and ocean resources in a large, open and dynamic network environment. Semantic rich data models are required for achieving the efficiency and efficacy of resource management in the future interconnection environment. Human recognizes the objective world by classifying objects, establishing links between known objects, and discovering clues between known and unknown objects. Orthogonal classification is a way to normalize the classification of resources. For example, people are used to finding books in library by classifications on topics, publishers, or alphabets, which are orthogonal with each other. If they are integrated into one classification model as shown in Fig. 2.1, then people can locate a book by given three coordinates of corresponding axes in the three-dimensional classification space. Locating resources will be more efficient than only use one type of classification or separately use these classifications. Further, semantic links like citation relationships (e.g., “The Knowledge Grid” cites “Weaving the Web” as shown in Fig. 2.1) and common-author relationships can be established between books the entries in nodes. The incorporation of the classification semantics with the link semantics can form a richer semantic image (overlay) for locating resources.

2.1 The Basic Idea

55

Publisher

World Scientific

The Knowledge Grid Cite

Weaving the Web

Collins

T W Alphabet

Grid

Web

Topic

Fig.2.1. An example of integrating orthogonal classification semantics with link semantics. The Resource Space Model RSM is a semantic data model for uniformly, normally and effectively specifying and managing resources by normalizing classification semantics (Zhuge, 2004a). Its theoretical basis is the normal forms and integrity constraints on resources (Zhuge et al., 2005c). The characteristic of the Resource Space Model is the normalization on classification semantics. Semantic Link Network SLN is a semantic model that links resource descriptions by semantic links (Zhuge, 2003). The SLN model supports semantic representation, reasoning, execution, referential search, and normalization. An SLN can represent any semantic relationships between resources. The characteristic of the SLN is its autonomy: any node (resource) can link to any semantic relevant node.

56

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

Are there any intrinsic rules between RSM and SLN? To know these rules can better develop the theory of the semantic overlay for the future interconnection environment.

2.2 Integrating Resource Space Model with Semantic Link Network

A resource space is an n-dimensional space RS(X1, X2, … , Xn), where Xi = {Ci1, Ci2, …, Cim} is an axis defined by a set of coordinates. A point p(C1,j1, C2,j2, …, Cn,jn) is determined by the coordinate values at every axes. It can uniquely determine a resource set, where each element is called a resource entry. Point and resource entry are two fundamental operation units of the Resource Space Model. Fig.2.2 is an example of a 3-dimentional resource space Spec-ApartGen(Specialization, Apartment, Gender) specifying the student information of a college. Three axes are Specialization = {math, chemistry, physics}, Apartment = {1#, 2#, 3#} and Gender = {male, female}. Each point denotes a class of students, for example, the point (math, 1#, male) represents all the male students who belong to the department of math and live in apartment no.1 in this college. And each resource entry in this point corresponds to a student in the college. Class hierarchy can be defined top-down from any coordinate at an axis. Take Fig.2.2 for example, the coordinate chemistry on axis Specialization is classified into g1, g2 and g3 in terms of grade, and then they can be further classified according to class. The label of each node is determined by a full path from the root, so the leaf node chemistry.g1.c1 can be distinguished from chemistry.g2.c1.

2.2 Integrating Resource Space Model with Semantic Link Network

57

Fig.2.2. A 3-dimentional Resource Space Spec-Apart-Gen. A Semantic Link Network consists of a semantic node set SemanticNodes, a semantic link set SemanticLinks and a reasoning rule set SLNRules, denoted as . Any semantic link in the SemanticLinks is a binary relation between two semantic nodes in the SemanticNodes. For any three semantic nodes A, B and C in the semantic node set if there exist two semantic links A B and B C in the semantic link set, and there exists a semantic link rule X Y, Y Z X Z (denoted as = in simple) in the SLNRules, then A C can be derived out and added to SemanticLinks (Zhuge, 2003). Two semantic link networks can be merged into one by common nodes or by adding semantic links between nodes of different networks. Normal forms of the Semantic Link Network are to guarantee the correctness of its semantics and operations. For example, X is-part-of Y and X isn’t-part-of Y may exist in the same Semantic Link Network because different users may operate on the same Semantic Link Network. The following normal forms of Semantic Link Network are to resolve the redundancy and inconsistency issues.

58

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

1.

If there does not exist semantic-equivalent nodes in a given Semantic Link Network, then we say that the Semantic Link Network is the first normal form SLN (1NF-SLN). If a Semantic Link Network is 1NF and there does not exist inconsistent semantic links and duplicate semantic links between the same pair of nodes, then we say that the Semantic Link Network is the second normal form Semantic Link Network (2NF-SLN). If a Semantic Link Network is a 2NF and there does not exist isolated nodes (accessible from each other), then we say that the Semantic Link Network is the third normal form Semantic Link Network (3NFSLN).

2.

3.

An ideal Semantic Link Network is a semantic “map” of distributed versatile resources to enable people to autonomously publish, manage and browse resources according to semantic links in the map. Each resource defined in the Resource Space Model can have a mapping image in the Semantic Link Network. The purpose is different from the knowledge portals that only provide knowledge services (Mack et al., 2001). Users can make and use the orthogonal classification semantics and the link semantics according to their cognition on the real world. The orthogonal classification semantics can help users focus their operation destination. The Semantic Link Network reflects the semantic clues between versatile resources. The integration of the Resource Space Model and the Semantic Link Network combines the classification semantics and relational semantics. The normalization theories of Resource Space Model and Semantic Link Network support a kind of single semantic point of accessing relevant semantic content, i.e., single semantic image (Zhuge, 2004d). Knowledge portals are difficult to realize this function. Abstract knowledge like traditional semantic network and rule base can be derived from the Semantic Link Network by generalization, can be organized according to the orthogonal semantics in the up-level space, and thus enable the future interconnection environment to implement intelligent services. A Global Semantic Overlay Grid can be built by integrating the Resource Space Model and Semantic Link Network as shown in Fig.2.3.

2.2 Integrating Resource Space Model with Semantic Link Network

59

Fig.2.3. Building a Semantic Overlay Grid by integrating the Resource Space Model and the Semantic Link Network. A local Semantic Overlay Grid has four layers: the entity layer, the local Semantic Link Network, the local resource space layer, and the management mechanism of the resource space and the Semantic Link Network. The integration of the Resource Space Model and the Semantic Link Network lays the foundation of the Local Semantic Overlay Grid. A normalized Local Semantic Overlay Grid requires that both the Resource Space Model and the Semantic Link Network satisfy the normal forms. The Global Overlay Semantic Grid consists of many Local Semantic Overlay Grids connected by the semantic links. A Normalized Global Semantic Overlay Grid requires that all Local Semantic Overlay Grids are normalized and that the global Semantic Link Network is normalized. The Semantic Overlay Grid integrates the advantages of the normalization (the normalization of both Resource Space Model and Semantic Link Network) and self-organization (local Semantic Overlay Grids interconnect with each other autonomously). The normalization reflects the optimization ideal of the Grid.

60

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

2.3 Relationship between RSM and SLN

2.3.1 Transformation from Semantic Link Network to Resource Space Model Definition 2.1. For a Semantic Link Network SLN (N, SL), (1) if for some nodes n1, n2 N (SLN), there exists a directed path p(n1, n2) from n1 to n2, then we call n1 can reach n2; (2) if for any pair of nodes n1, n2 N (SLN), n1 can reach n2 or n2 can reach n1, then we call SLN (N, SL) is unilaterally connected; (3) if for any n1, n2 N (SLN), n1 can reach n2 and n2 can reach n1, then SLN (N, SL) is called strongly connected; and (4) let SLN be the underlying undirected graph of SL (N, SL), then SLN (N, SL) is called weakly connected if SLN is connected; (5) if at least one semantic link (or can be derived from existing links) points from n1 to n2, we say that n2 is semantically reachable from n1, and that n1 and n2 are accessible each other by browsing (regardless of direction). The weak connectedness is about the reachability of browsing. If we n2, n 2 n3 n1 n3 (‘ ’ means by implication), then have n1 we say that or the -link is transitive, and that there exists a semantic chain from n1 to n3. Then we can see that the unilateral and strong connectness is about the reachability of the semantic links by induction. Definition 2.2. A sub-graph SLN of SLN (N, SL) is called a (weak, unilateral, strong) connected component of SLN (N, SL) if (1) SLN is (weakly, unilaterally, strongly) connected; and (2) if SLN is (weakly, unilaterally, strongly) connected too, and SLN SLN , then SLN SLN , i.e., SLN is a maximal subset which is (weakly, unilaterally, strongly) connected. A semantic component can be viewed as a point in the high-level resource space, then we can get the corresponding high-level resource space from a semantic link network as follows: For a Semantic Link Network SLN (N, SL), let SLN ={C1, C2, , Cm}, where Ci represents a strongly connected component of SLN and Ci Cj = . Then N (SLN) = N(Ci), where N(Ci) denotes the nodes of Ci. A strongly connected component of SLN stands for a certain semantics that is independent from the semantics of other components. So a Semantic Link Network can be transformed to a Resource Space Model by mapping

2.3 Relationship between RSM and SLN

61

the strongly connected components into the points in the space. Fig.5 depicts such a transformation. A transformation from a Semantic Link Network to a resource space exists because: There exists an n-dimensional semantic space RS(X1, X2, , Xn) to represent the semantics of a Semantic Link Network. The simplest case is a vector (C1, …, Cn). For each strongly connected semantic component Ci of the SLN, we can find projections on each axis x1, x2, , xn, representing fine semantics on those axes. Each axis takes the projection of C1, …, Cn on it as coordinates. Ci(x1, x2, , xn) corresponds to a point in RS.

The rest points in RS can be assigned as null points. Given a domain ontology, above steps can determine a resource space RS(X1, X2, , Xn) by mapping axis and its coordinates onto concepts in the ontology hierarchy.

Fig.2.4. Transformation from a Semantic Link Network to a resource space. For a Semantic Link Network SLN ={C1, C2, C3, C4} shown in Fig.2.4, Ci is the strong component, 1 i 4. Then, we can get the resource space RS(X1, X2)= {p1, p2, p3, p4} (other two unfilled points represent null points). We can see that when the high-level semantics are got in the resource

62

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

space, the low-level semantics such as p1 p1 in the Semantic Link Network are lost.

p4, p3

p4, p2

p3, and p2

Corollary 2.1. A 1NF Semantic Link Network (1NF-SLN) can be transformed to a 1NF Resource Space Model (1NF-RSM). Proof. If a Semantic Link Network is 1NF, then there do not exist semantic-equivalent nodes in it. From the definition of the strongly connected component, there should not exist the same strongly connected component in the Semantic Link Network. Suppose RS is the high-level resource space, then there should not exist the same sets of points in it. So, there does not exist name duplication between coordinates at any axis in the resource space. Then the resource space is also the 1NF. Corollary 2.2. A 2NF Semantic Link Network (2NF-SLN) can be transformed to a 2NF Resource Space Model (2NF-RSM). Proof. If a Semantic Link Network is 2NF, then there do not exist inconsistent and duplicate semantic links between the same pair of nodes, this guarantees that the strongly connected components of Semantic Link Network are correct in semantics. Suppose {C1, C2, , Cm} is the set of strongly connected components of the Semantic Link Network SLN and RS is the high-level resource space, then RS={ p1, p2, , pm}, where pi corresponds to Ci, 1 i m. If RS is not the second-normal-form, then there exist some points pi and pj are not independent from each other, i.e., R(Ci) R(Cj ) , which means Ci Cj . This is not consistent with that Ci and Cj are different strongly connected components in SLN, so the RS is also a 2NF-RSM. Corollary 2.3. A 3NF Semantic Link Network (3NF-SLN) can be transformed to a 3NF Resource Space Model (3NF-RSM). Proof. If a Semantic Link Network SLN is a 3NF, then there does not exist isolated nodes (accessible from each other), therefore all the strongly connected components of SLN are accessible from each other. Suppose RS is the high-level resource space, we can get that any of its point are reachable from others, then every axis Xi can represent all the resources in RS, which is equivalent to that any two axes of RS are orthogonal with each other (Zhuge et al., 2005c), so RS is also a 3NF. The above three corollaries show that the normal forms of the Resource Space Model and the Semantic Link Network have common properties in solving the redundancy and inconsistency.

2.3 Relationship between RSM and SLN

63

2.3.2 Transformation from Resource Space to Semantic Link Network and Correlations There are multiple ways to transform a resource space to a Semantic Link Network, but semantics may lose during transformation. So it is important to ensure the semantic equivalence of the transformation. The following is a semantically equivalent transformation. Fig.2.5 depicts the transformation process.

Fig.2.5. An example of semantically equivalent transformation. Take nodes in the resource space as nodes in the Semantic Link Network; Define the semantic links between n1(x1, …, xn) and n2(y1, …, yn) as:

SL( xi , yi )

impi ,

if

xi , yi

X i or xi yi ; otherwise.

Where Xi is the axis of the resource space RS, and impi means that node n1(x1, …, xn) implies n2(y1, …, yn) on the semantics of axis Xi. Then we can get SL (SLN) from SL (SLN)= n1,n2 N SL(n1, n2). Because the semantic links are directed, SL(xi, yi) SL(yi, xi) in most cases. If xi = yi, then SL(xi, yi) =SL(yi, xi)= Impi. On the contrary, if SL(xi, yi) = Impi and SL(yi, xi)= Impi, then xi = yi. It is clear that the link Impi is transitive. For any node n(x1, x2, , xn) in the resulting SLN (N, SL), we have R(xi)= {R(yi)|SL(xi , yi) = Impi}. Then, from the steps of construction, it is clear that the resulting SLN (N, SL) is semantically equivalent to the resource space RS (P, E).

64

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

In a resource space RS, an axis with hierarchical coordinates can be transformed into an axis with flat coordinates if only the leaf nodes of each hierarchy are considered. So here focuses on the flat cases. In this case, the semantic link network SLN corresponds to resource space RS regularly. For any two points p1(x1, x2, , xn) and p2(y1, y2, , yn) in RS, we have xi = yi or R(xi) R(yi)= , 1 i n, i.e., both “xi implies yi” and “yi implies xi” hold or neither hold. Correspondingly, for any two nodes n1(x1, x2, , xn) and n2(y1, y2, , yn) in the SLN, SL(xi, yi) =SL(yi, xi)= Impi or SL(xi, yi) =SL(yi, xi)= , i.e., there are no semantic links between xi and yi. Given a resource space RS, there is a semantically equivalent Semantic Link Network SLN, so we can compare the normal forms of RS and the corresponding SLN to study the correlations between the normal forms of RSM and SLN. Corollary 2.4. A 1NF-RSM can be transformed to a 1NF-SLN. Proof. If RS is 1NF, then there does not exist name duplication between coordinates at any axis, so there cannot be the same points in RS. Suppose SLN is the corresponding semantic link network of RS, from N (SLN) =P (RS), we can get that there does not exist semantic-equivalent nodes in SLN, so SLN is also the 1NF. Corollary 2.5. A 2NF-RSM can be transformed to a 2NF-SLN. Proof. If RS is a 2NF, then for any axis, any two coordinates are independent from each other. Suppose SLN is the corresponding semantic link network of RS, then there do not exist semantic-equivalent nodes in SLN. And for any pair of nodes n1(x1, x2, , xn) and n2(y1, y2, , yn) in the SLN, the semantic links between them fall into two cases: SL(xi, yi) =SL(yi, xi)= Impi or SL(xi, yi) =SL(yi, xi)= , so there do not exist inconsistent semantic links and duplicate semantic links between n1 and n2, i.e., the SLN is the 2NF. Corollary 2.6. A 3NF-RSM can be transformed to a 3NF-SLN. Proof. If RS is a 3NF, then any two axes of it are orthogonal with each other, in other words, every axis Xi can represent all the resources in RS, formally, R(X1)=R(X2)= =R(Xn) (Zhuge et al., 2005c), this guarantees that any points in the form of { p(x1, x2, , xn)| xi Xi} are meaningful. Suppose SLN is the corresponding semantic link network of RS, from N (SLN)=P (RS), we have: any nodes in SLN in the form of { n(x1, x2, , xn)| xi Xi} are also meaningful. Then, for any two nodes n1(x1, x2, , xn) and n2(y1, y2, , yn) in the SLN, if there exists some i, 1 i n, such that xi=yi, then SL(xi, yi) =SL(yi, xi)= Impi, then n1 and n2 can be accessed from each

2.3 Relationship between RSM and SLN

65

other. Else if xi yi, for 1 i n hold, let n3= n3(x1, x2, , xn-1, yn), from x1 y1 and xn yn, we can get n3 n1 and n3 n2. Then there exist semantic links n1 Imp1 n3, and n3 Impn n2, so n1 and n2 can also be accessed from each other by the chain n1 Imp1 n3 Impn n2. So any two nodes in the SLN are accessible from each other, i.e., the SLN is the third-normal-form. The above three corollaries further confirm that the normal forms of the Resource Space Model and the Semantic Link Network have the common properties in solving the redundancy and inconsistency. 2.3.3 Topological Properties Just as the quotient resource space defined in (Zhuge et al., 2005c), the quotient Semantic Link Network is the abstract of the original Semantic Link Network, and it reflects the higher level semantics. Here we will prove that the quotient Semantic Link Network keeps the three normal forms of the original one, which shows that the quotient Semantic Link Network is a “good” Semantic Link Network. Definition 2.3. For a Semantic Link Network SLN (N, SL), if there exists an equivalent relation R on the set of nodes N (SLN), then the quotient Semantic Link Network of SLN (N, SL) is defined as SLN (N , SL ), where N (SLN )= N (SLN) R={C1, C2, , Cm}, where Ci is an equivalent class in N (SLN) under the relation R; and, SL ( Ci, Cj)= { SL(ni, nj) | ni, nj N (SLN) and ni Ci, nj Cj }, 1 i, j m. Theorem 2.1. Let SLN (N, SL) be a Semantic Link Network, and SLN (N , SL ) is the quotient Semantic Link Network constructed as above, then SLN (N , SL ) keeps the three normal forms of SLN (N, SL). Proof. (1) If SLN is 1NF, then there do not exist semantic-equivalent nodes in it. For the quotient network SLN (N , SL ), N (SLN )= N (SLN) R={C1, C2, , Cm}. According to the definition of the equivalence relation R, Ci Cj= , 1 i, j m. So there do not exist semantic-equivalent nodes in SLN (N , SL ), which means that SLN (N , SL ) is also 1NF. (2) If SLN is 2NF, then there do not exist inconsistent semantic links and duplicate semantic links between the same pair of nodes in SLN. According to the construction process of the quotient Semantic Link Network, SL (Ci, Cj)= {SL(ni, nj) | ni, nj N (SLN) and ni Ci, nj Cj }, we can get that SL (SLN ) SL (SLN), so there do not exist inconsistent semantic links and duplicate semantic links, in SLN (N , SL ) either, which means that SLN

66

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

(N , SL ) is also 2NF. (3) If SLN is 3NF, then there do not exist isolated nodes (accessible from each other), that is for any two nodes ni, nj N (SLN), there exists a path ni n1 nt nj from ni to nj. Then in the case of SLN , for any two nodes Ci, Cj N (SLN ), then there exists ni, nj N (SLN) and ni Ci, nj Cj. Then we can consider the series S=Ci C1 C t C j, where Ck is the corresponding equivalence class of nk, which means that nk Ck, 1 k t. Then in the series S, (a) if any two neighboring nodes Ck, Cp in it satisfying Ck Cp, then S is a path from Ci to Cj ; (b) if there exists two neighboring nodes Ck=Cp in S, then we get a new series S in which Ck Cp is Ck. Repeating this process finite times, we can get a new series S with the property that any two neighboring nodes in it are not equal, so S is a path, and obviously that it is from Ci to Cj. From (a) and (b), we can get that any two nodes in N (SLN ) are accessible from each other, so SLN (N , SL ) is 3NF. We have proposed a method to transform a given Semantic Link Network to a resource space. In fact, the transformation process is to construct a quotient Semantic Link Network first, then to get the corresponding resource space from the quotient Semantic Link Network. According to the construction process and the definition of the quotient semantic link network, we only need to show that the “strongly connected component” is an equivalent relation among the nodes of the Semantic Link Network. Corollary 2.7. Let SLN (N, SL) be a Semantic Link Network, and R be the relation that “in the same strongly connected component” on N (SLN), then R is an equivalent relation on N (SLN). Because the quotient Semantic Link Network is the abstraction of the original Semantic Link Network, the construction process shows again that the resource space is the abstraction of the Semantic Link Network, reflects the higher-level semantics. Given a resource space RS, we have constructed a Semantic Link Network SLN (N, SL) that is semantically equivalent to it, if we view SLN (N, SL) as the underlying undirected graph, then we can study the topological properties of it. Here we always assume that RS is 2NF and its coordinates are flat. Let the weight on each edge Impi be 1, 1 i n, then we can define the distance and diameter of SLN as (Bollobás, 1998). Definition 2.4. For any two nodes n1 and n2 in SLN (N, SL), the distance d(n1, n2) between n1 and n2 is the length of the shortest path between them.

2.3 Relationship between RSM and SLN

67

Definition 2.5. The diameter of the graph SLN (N, SL) is D(SLN)= Max{ d(n1, n2)| n1, n2 N (SLN)}. For any two points p1(x1, x2, , xn) and p2(y1, y2, n i 1

, yn) in the resource

1

space RS(X1, X2, …, Xn), the distance d ( p1 , p2 ) (

d i2 ( xi , yi )) 2 , where di

is the distance on axis Xi, 1 i n (Zhuge et al., 2005c). We can see that the distance d on RS is the Euclidian sum of di, and from Definition 2.4, the distance on SLN is equivalent to d=Min{di |1 i n}. As the existence of the semantic links, the distance between two nodes in SLN is much smaller than in the RS. From this perspective, the SLN is more suitable for the search and location of resources. The following theorem supports this view. Theorem 2.2. Let RS be a 3NF resource space, and SLN (N, SL) is the corresponding Semantic Link Network, then SLN is an Euler graph, and the diameter of it is 2. Proof. Since RS is 3NF, from Corollary 2.3, the SLN is 3NF too, then any two nodes in it are accessible from each other, i.e., SLN is a connected graph. For any node n in SLN, the edges Impi and Impi are present or absent simultaneously, so the degree of n is even. Because the degree of every node in SLN is even, it is an Euler graph. Suppose the distance d(n1, n2) between two nodes n1(x1, x2, , xn) and n2(y1, y2, , yn) is the biggest in SLN, then D(SLN)= d(n1, n2). Then xi yi, for 1 i n. Else if xi=yi for some i, then there exists edge Impi between n1 and n2 such that d(n1, n2)=1. Let n3= n3(z1, z2, , zn), satisfying xi zi, for 1 i n, then there dose not exist any semantic link between n1 and n3, so d(n1, n3)>1= d(n1, n2), this is inconsistent with the assumption that d(n1, n2) is the biggest. So xi yi, for 1 i n holds and therefore d(n1, n2)> 1. Let n4= n4(x1, x2, , xn-1, yn), from x1 y1 and xn yn, we can get n4 n1 and n4 n2. Then there exist edges n1 Imp1 n4, and n4 Impn n2, so n1 and n2 are linked by the path n1 Imp1 n3 Impn n2, from the definition of d(n1, n2), we can get that d(n1, n2) 2. And from d(n1, n2)> 1, we have d(n1, n2)=2. So D(SLN)= d(n1, n2)=2. Theorem 2.2 implies that the distance between any two nodes in the 3NF (corresponding to a resource space) is either 1 or 2. And from Theorem 2.2 we can get the following corollary: Corollary 2.8. Let RS be a 3NF resource space, and SLN (N, SL) is the corresponding Semantic Link Network, then for any two nodes n1(x1, x2,

68

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

, xn) and n2(y1, y2, , yn) in SLN, d(n1, n2)=1 “there exists some i, 1 i n, such that xi=yi”. And d(n1, n2)=2 xi yi, for 1 i n. Then we can see that the mean distance between two nodes in SLN is within 1 and 2, much smaller than in the resource space RS, which is related to the dimension n and the average number of coordinates on all axes, so we can say that the Semantic Link Network is more suitable for search and location of resources than Resource Space Model. But there is one more important factor that makes the distance in the Semantic Link Network much smaller, which is that there are too much semantic links in the corresponding Semantic Link Network constructed.

2.4 Union View of Resource Space and Semantic Link Network

Integrating the Resource Space Model’s classification semantics with the Semantic Link Network’s link semantics supports richer semantic modeling and applications. To facilitate the cooperation between the Resource Space Model and the Semantic Link Network, we propose the Union View of the Resource Space Model and the Semantic Link Network by introducing a structure called Resource Class Hierarchy, which can be derived from the Resource Space Model. The Union View has the following three advantages: 1. 2. 3. Providing an efficient mapping mechanism from the Resource Space Model to the Semantic Link Network; Facilitating quick and easy modeling of Semantic Link Network; and, Enhancing the interoperability between Semantic Link Networks.

2.4.1 The Framework Fig.2.6 describes the cooperation between the Resource Space Model and the Semantic Link Network for a simple teaching system. A node in this graph represents a resource or a set of resources. The dotted circles represent resource class hierarchies, each of which corresponds to a resource space. The rectangles in dotted circles denote the resource classes corresponding to a class of resources in the Resource Space Model. Both the circle and the triangle represent generic classes defined in the original Semantic Link Network. And the rounded rectangles denote the printable classes, the system-defined classes similar to the basic types in programming languages (Hull and King, 1987; Ullman, 1988). Using the printable

2.4 Union View of Resource Space and Semantic Link Network

69

classes, the atomic value of attributes can be precisely represented. The only String class has three duplications for clarity. The edges and edge labels in this graph are used to represent the relationships between nodes. Note that the edges without labels in the dotted circles represent the inclusion relationships between resource classes. Fig.2.6 has two resource class hierarchies: Human-Resource and ScoreCourse derived from the corresponding resource spaces Human-Resource and Score-Course in the RSM. The resource space Human-Resource is used to store the information about all teachers and students. And the resource space Score-Course is used to manage the information about the test scores of the students. Definition 2.6. The Union View of the Resource Space Model and the Semantic Link Network is a triple S = (VE, RE, RCH), where VE is a finite set of nodes in the Union View which could include resources, generic classes, printable classes and resource classes derived from the Resource Space Model; RE is a finite set of triple , where re represents the relationship between nodes v1 and v2 coming from VE; and, RCH is a finite set of resource class hierarchies each of which corresponds to a resource space in the meaning of the Resource Space Model.

Fig.2.6. Example of incorporating Semantic Link Network with the resource space. The Union View has the following three advantages:

70

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

(1) The Union View provides an efficient mapping mechanism from Resource Space Model to Semantic Link Network. The Semantic Link Network is a general-purpose semantic-rich model for the representation of resources and can provide a semantic overlay for many data models in the style of map. Since each resource class hierarchy corresponds to a resource space, by introducing the resource class hierarchies into the original Semantic Link Network, the Union View provides an efficient and effective mapping from the Resource Space Model to the Semantic Link Network. The efficiency and effectiveness of the mapping depend on the following two points: a) The Union View of Resource Space Model and Semantic Link Network increases the granularity of the mapping from Resource Space Model to Semantic Link Network. In the original Semantic Link Network, each resource defined in the Resource Space Model has a mapping image in the Semantic Link Network. However in the Union View, a point, an axis and even a resource space in the Resource Space Model can be mapped to a node in the Semantic Link Network. Thus, many operations can be applied to a batch of resources. b) The Union View of Resource Space Model and Semantic Link Network provides not only resource mappings but also operation mappings from the Resource Space Model to the Semantic Link Network. (2) The Union View of Resource Space Model and Semantic Link Network makes use of the Resource Space Model to make the Semantic Link Network modeling easier. Many traditional methods such as entity-relation model can be used to help modeling with the Semantic Link Network (Chen, 1976). Since quick and easy modeling is one of the salient features of the Resource Space Model, the resource class hierarchies in the Union View make the Semantic Link Network modeling easier without conflicting with other approaches. (3) The Union View of the Resource Space Model and the Semantic Link Network can enhance the interoperability between Semantic Link Networks. Semantic Link Network operations are mainly based on the structure information. The Union View has introduced some new RSM-based operations, which emphasize on not only the structure but also the semantics of Semantic Link Networks. Thus, the Union View of Resource Space Model and Semantic Link Network can enhance the interoperability between Semantic Link Networks.

2.4 Union View of Resource Space and Semantic Link Network

71

2.4.2 The Core Component of Union View: Resource Class Hierarchy We use to represent a resource space, an axis, a point or a coordinate of the RSM and R( ) represents the resources that can contain. Let RS(X1, X2, … , Xn) be a resource space. Its axes Xi = {Ci1, Ci2, …, Cim} can be extended to Xi* = {Ci1, Ci2, …, Cim, i}, 1 i n. A point taking the coordinate value i on the axis Xi means that the axis Xi has no restriction on that point. For example, if a student belongs to the point ( Specialization, 3#, male) in the resource space Spec-Apart-Gen, then he/she may major in math, chemistry, physics or any other specialization. Meanwhile, we introduce a constant i at each axis Xi that is equivalent to the union of Ci1, Ci2, …, Cim in semantics, i.e. R( i) = R(Ci1) R(Ci2) … R(Cim). Thus, we say that R( i) holds. i is derived from Ci1, Ci2, …, Cim. So R( i) Definition 2.7. Let RS(X1, X2, … , Xn) be a resource space and Xi*, 1 i n be its extended axes. The set of resources represented by ( 1, …, i-1, i, i+1,…, n) is defined as the axis resource class of axis Xi, denoted as aci. Axis resource class and each relation in the Cartesian product X1* X2* … Xn* are generally called resource classes. Particularly, the relation ( 1, 2, …, n) is called base resource class, denoted as rootRS. In fact, the resource class (C1,j1, C2,j2, …, Cn,jn) is the set of resources which simultaneously satisfies all the conditions C1,j1, C2,j2, …, and Cn,jn on corresponding axes in the resource space. And the base resource class ( 1, 2, …, n) actually represents the whole resource space. Take the resource space Spec-Apart-Gen as an example, the base resource class ( Specialization, Apartment, Gender) denotes all the students concerned by this resource space. Regardless the gender, the resource class (physics, 3#, Gender) represents all the students who are major in physics and live in 3# apartment. For any two resource classes c = (C1,j1, C2,j2, …, Cn,jn) and c = (C1,j1 , C2,j2 , …, Cn,jn ) in a resource space, if R(C1,j1) R(C1,j1 ), R(C2,j2) R(C2,j2 ), …, R(Cn,jn) R(Cn,jn ) hold respectively, then the resource class c is called the subclass of the resource class c and the resource class c is called the superclass of the resource class c. And this inclusion relationship is denoted as c c c . The inclusion relationship existing between c and c is a particular case of the subtype relationship defined in (Zhuge, 2003). Since the inclusion relationship exists only between resource classes in the resource class hierarchy, for the sake of representation simplicity we also name the inclu-

72

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

sion relationship as the internal relationship and any other relationships as external relationships. Theorem 2.3. Let RS(X1, X2, … , Xn) be a resource space and Xi* (1 i n) be its extended axes. Any axis resource class aci of axis Xi is one of the subclasses of the base resource class rootRS of RS. And the axis resource class aci is one of the super-classes of any of the resource classes (x1, …, xi-1, xi, xi+1…, xn), where xj belongs to Xj* (1 j i 1 or i+1 j n) and xi belongs to Xi. Proof. (1) For the axis resource class aci = ( 1, …, i-1, i, i+1,…, n) and the base resource class rootRS = ( 1, 2,…, n), we have R( 1) R( 1), …, R( i-1) R( i-1), R( i) R( i), R( i+1) R( i+1), …R( n) R( n) hold respectively. So the axis resource class aci is one of the subclasses of the base resource class rootRS. (2) For any resource class c = (x1, …, xi-1, xi, xi+1…, xn) in RS where xj belongs to Xj* (1 j i 1 or i+1 j n) and xi belongs to Xi. It is obvious that c is one of the subclasses of the resource class c = ( 1, …, i-1, xi, i+1…, n). According to the definition of i, we have R(xi) R( i). Thus c is one of the subclass of aci. So the axis resource class aci is one of the superclasses of the resource class (x1, …, xi-1, xi, xi+1…, xn). Theorem 2.3 indicates that the introduction of axis resource classes will not break the inclusion relationships in the resource class hierarchy of a certain resource space. The resource class set of the resource class hierarchy is defined as follows: Definition 2.8. Let RS(X1, X2, … , Xn) be a resource space and Xi*(1 i n) be its extended axes. The set CS = X1* X2* … Xn* {( 1, …, i-1, i, i+1, …, n) | 1 i n} is called the resource class set of RS. In the resource class hierarchy of a given resource space, only the inclusion relationships between resource classes are concerned. The inclusion relationship is reflexive, asymmetric and transitive, and it supports multi-inheritance. Thus, the resource class hierarchy of a given resource space can be viewed as a directed graph consisting of its resource class set and these inclusion relationships. In a directed graph, (v1, v2, …, vn) will be used to denote one path from v1 to vn. The resource class hierarchy of a given resource space can be defined as follows: Definition 2.9. Let RS(X1, X2, … , Xn) be a resource space. The resource class hierarchy of RS is defined as the directed graph RSG(CS, E), where 1. CS is the resource class set of RS; 2. For any resource classes c1, c2 CS, if E, then c2 cc1 holds;

2.4 Union View of Resource Space and Semantic Link Network

73

3.

For any resource classes c1, c2 CS, if c2 at least one path (c1, cj1, cj2, …, cjm, c2).

c

c1 holds, then there exists

The second condition in definition 2.9 is used to guarantee that RSG(CS, E) can only include the valid inclusion relationships in RS. And the third condition implies that all inclusion relationships in RS should be implicitly included in RSG(CS, E). Fig.2.7 is the illustration of a flat resource space and a part of its resource class hierarchy.

Fig.2.7. A flat resource space and a part of its resource class hierarchy.

74

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

2.4.3 Operations on Resource Class Hierarchy The Resource Space Model’s Join, Disjoin, Merge and Split operations can also be applied to the Resource Class Hierarchy. The following theorem defines the Join operation on resource class hierarchies of resource spaces: Theorem 2.4. Let RS1(X1, …, Xt, Xt+1, …, Xm) and RS2(Xt+1, …, Xm, Xm+1, …, Xn) be two resource spaces that can be joined together to form the resource space RS(X1, …, Xt, Xt+1, …, Xm, Xm+1, …, Xn). Assume that RSG1(CS1, E1) and RSG2(CS2, E2) are the resource class hierarchies of RS1 and RS2 respectively. We construct the directed graph RSG(CS, E) such that: (1) CS = X1* X2* … Xn* {( 1, …, i-1, i, i+1, …, n) | 1 i n}; (2) For any two classes c = (C1, …, Ct, Ct+1, …, Cm, Cm+1, …, Cn) and c = (C1 , …, Ct , Ct+1 , …, Cm , Cm+1, …, Cn ) in CS, E holds if and only if both E1 and E2 hold. Then, the directed graph RSG(CS, E) is the resource class hierarchy of resource space RS. Proof. According to the definition 2.8, it is obvious that CS is the resource class set of resource space RS. We will prove that E satisfies the last two conditions mentioned in definition 2.9. (1) For any two resource classes c = (C1, …, Ct, Ct+1, …, Cm, Cm+1, …, Cn) and c = (C1 , …, Ct , Ct+1 , …, Cm , Cm+1 , …, Cn ) in CS, if E holds, then E1 and < (Ct+1, …, Cm, Cm+1, …, Cn), (Ct+1 , …, Cm , Cm+1 , …, Cn )> E2 hold. Since RSG1(CS1, E1) and RSG2(CS2, E2) are the resource class hierarchies of RS1 and RS2 respectively, R(C1) R(C1 ), R(C2) R(C2 ), …, R(Cn) R(Cn ) hold. So c c c holds. (2) For any two resource classes c = (C1, …, Ct, Ct+1, …, Cm, Cm+1, …, Cn) and c = (C1 , …, Ct , Ct+1 , …, Cm , Cm+1 , …, Cn ) in CS, if c c c holds, we can prove that there exists one path from c to c in RSG(CS, E). Clearly, both (C1, …, Ct, Ct+1, …, Cm) c (C1 , …, Ct , Ct+1 , …, Cm ) and (Ct+1, …, Cm, Cm+1, …, Cn) c (Ct+1 , …, Cm , Cm+1 , …, Cn ) hold. Let xi (t+1 i m) be a coordinate from the axis Xi (t+1 i m). And this coordinate satisfies: R(Ci ) R(xi) R(Ci) and there does not exist a coordinate xi such that both R(xi) R(xi ) R(Ci) and xi xi hold. Thus, both E1 and E2 hold. So E holds. By analogy, through limited steps we can construct one path from c to c in RSG(CS, E). According to 1) and 2), we can conclude that the directed graph RSG(CS, E) is the resource class hierarchy of resource space RS. According to theorem 2.4, the resource class hierarchy of resource space RS deriving from the Join operation of resource spaces RS1 and RS2 can be easily constructed. The Disjoin operation on the resource class hierarchy of a given resource space complies with the following theorem. Theorem 2.5. Let RS1(X1, …, Xt, Xt+1, …, Xm) and RS2(Xt+1, …, Xm, Xm+1, …, Xn) be two resource spaces which derive from the Disjoin operation on the resource space RS(X1, …, Xt, Xt+1, …, Xm, Xm+1, …, Xn). Assume that RSG(CS, E) is the resource class hierarchy of RS. We construct the directed graph RSG1(CS1, E1) such that: (1) CS1 = X1* X2* … Xm* {( 1, …, i-1, i, i+1, …, m) | 1 i m}; (2) For any two resource classes c = (C1, …, Ct, Ct+1, …, Cm) and c = (C1 , …, Ct , Ct+1 , …, Cm ) in CS1, E1 holds if and only if there exists at least a pair of resource classes (C1, …, Ct, Ct+1, …, Cm, Cm+1, …, Cn) and (C1 , …, Ct , Ct+1 , …, Cm , Cm+1 , …, Cn ) in RSG(CS, E) such that E holds. Then, the directed graph RSG1(CS1, E1) is the resource class hierarchy of resource space RS1. Proof. We prove that RSG1(CS1, E1) satisfies the three conditions as the resource class hierarchy of resource space RS1 as follows: (1) According to the definition of the resource class set of a given resource space, it is obvious that CS1 is the resource class set of resource space RS1. (2) For any two resource classes c = (C1, …, Ct, Ct+1, …, Cm) and c = (C1 , …, Ct , Ct+1 , …, Cm ) in CS1, if E1 holds, then there exists one pair of resource classes (C1, …, Ct, Ct+1, …, Cm, Cm+1, …, Cn) and (C1 , …, Ct , Ct+1 , …, Cm , Cm+1 , …, Cn ) in RSG(CS, E) such that E holds. Thus, R(C1) R(C1 ), R(C2) R(C2 ), …, R(Cm ) R(Cm ) hold. It is obvious that c c c holds. (3) Suppose that two resource classes c1 = (C1, …, Ct, Ct+1, …, Cm) and c1 = (C1 , …, Ct , Ct+1 , …, Cm ) in CS1 satisfy c1 c c1 . There exist two resource classes c and c in CS such that they have common coordinates on X1, X2, …, and Xm with c1 and c1 respectively and c c c holds. So there

76

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

must exist one path (c, v1, v2, …, vp, c ) connecting c with c in RSG(CS, E). For any vi (1 i p), there must exist a class vi in CS1 such that vi has common coordinates on X1, X2, …, Xm with vi. Thus, there must exist the path (c1, v1 , v2 , …, vp , c1 ) connecting c1 with c1 in RSG1(CS1, E1). According to 1), 2) and 3), we have: the directed graph RSG1(CS1, E1) is the resource class hierarchy of resource space RS1. Theorem 2.5 provides a method to obtain the resource class hierarchy of a resource space deriving from the Disjoin operation on another resource space. The following theorem defines the Merge operation on resource class hierarchies of resource spaces: Theorem 2.6. Let RS1(X1, X2, …, Xm, X ) and RS2(X1, X2, …, Xm, X ) be two resource spaces that can be merged into a resource space RS(X1, X2, …, Xm, X) by merging X and X into X. Assume that RSG1(CS1, E1) and RSG2(CS2, E2) are the resource class hierarchies of RS1 and RS2 respectively and that ac, ac and ac are the axis resource classes corresponding to axes X, X and X respectively. We construct the directed graph RSG(CS, E) that satisfies: 1. CS = CS1 CS2 {ac} – {ac , ac }; and, 2. E = E1 E2 { | E1 or E2} { | E1 or E2} – { | E1} – { | E2}. If RS is in 2NF, then the directed graph RSG(CS, E) is the resource class hierarchy of resource space RS. Proof. Let CS* be the resource class set of RS. CS (1) Assume that the resource class c = (C1,i1, C2, i2, …, Cm,im, C) holds. If c equals to either ac or the base resource class of RS, then c CS* holds. Otherwise, either c CS1 or c CS2 holds. Thus the coordinate C is on either X or X . So we conclude that the coordinate also belongs to X. It is obvious that c CS*. So CS CS* holds. For any resource class c = (C1,i1, C2, i2, …, Cm,im, C) CS*, If c is equal to either ac or the base resource class of RS, then c CS holds. Otherwise, the coordinate C must be on X. Thus the coordinate C is on either X or X . So either c CS1 or c CS2 holds. It is obvious that c CS. So CS* CS holds. Thus, we have CS = CS* holds. So CS is the resource class set of RS. (2) It is easy to prove that for any E, c c c holds. For any two resource classes c and c in RS that satisfy c c c , we will prove that there

2.4 Union View of Resource Space and Semantic Link Network

77

exists one path from c to c in RSG(CS, E) (here excludes the case where c is equal to c ): (a) Firstly, we suppose that c CS1 holds. If c is equal to ac, then we can conclude that c c ac holds. So there must exist one path (ac , v1, v2, …, vp, c) from ac to c in RSG1(CS1, E1). Thus, the path (ac, v1, v2, …, vp, c) must exist in RSG(CS, E). If c is not equal to ac, c CS1 must hold. Otherwise, the RS arising from the merge of RS1 and RS2 cannot be in 2NF. So there must exist a path (c , v1, v2, …, vp, c) from c to c in RSG1(CS1, E1). There must exist a path from c to c in RSG(CS, E) which is obtained by replacing ac with ac in (c , v1, v2, …, vp, c). (b) Similarly, we can draw the same conclusion when c CS1 holds. According to 1) and 2), we can conclude that the directed graph RSG(CS, E) is the resource class hierarchy of resource space RS. The Split operation on the resource class hierarchy of a given resource space complies with the following theorem. Theorem 2.7. Let RS1(X1, X2, …, Xm, X ) and RS2(X1, X2, …, Xm, X ) be two resource spaces deriving from the Split operation on the resource space RS(X1, X2, …, Xm, X) in 2NF. Assume that RSG(CS, E) is the resource class hierarchy of RS and that ac, ac and ac are the axis resource classes of axes X, X and X respectively. We construct the directed graph RSG1(CS1, E1) satisfying: 1. CS1 = X1* X2* … Xm* X * {( 1, …, i-1, i, i+1, …, m, ) | 1 i m} ac ; and, 2. E1 = { | c CS1 c CS1 E} { | c CS1 E} { | c CS1 E}. Then, the directed graph RSG1(CS1, E1) is the resource class hierarchy of resource space RS1. 2.4.4 Operations on the Union View of Resource Space and Semantic Link Network The Union View of resource space and Semantic Link Network not only maps resource spaces into resource class hierarchies but also inherits all operations of the RSM. The informal description of the four RSM-based operations on the Union View is as follows: Join: Assume that RSG1(CS1, E1) and RSG2(CS2, E2) are two resource class hierarchies which correspond to the resource spaces RS1(X1, …, Xt, Xt+1, …, Xm) and RS2(Xt+1, …, Xm, Xm+1, …, Xn) respectively. And RSG(CS, E) is derived from the Join operation on RSG1(CS1, E1) and RSG2(CS2, E2) with axes Xt+1, …, Xm as described in theorem 2.4. For any

78

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

resource class c1 = (C1, …, Ct, Ct+1, …, Cm) in RSG1(CS1, E1), there must exist the corresponding resource class c = (C1, …, Ct, Ct+1, …, Cm, m+1, …, n) in RSG(CS, E) according to the definition of resource class hierarchy. Then, any external relationships starting from or ending at c1 should be transferred to c. Disjoin: Assume that RSG(CS, E) is a resource class hierarchy corresponding to the resource space RS(X1, …, Xt, Xt+1, …, Xm, Xm+1, …, Xn). RSG1(CS1, E1) and RSG2(CS2, E2) are derived from the Disjoin operation on RSG(CS, E) with axes Xt+1, …, Xm as described in theorem 2.5. For any resource class c = (C1, …, Ct, Ct+1, …, Cm, m+1, …, n) in RSG(CS, E), there must exist corresponding resource class c1 = (C1, …, Ct, Ct+1, …, Cm) in RSG1(CS, E) according to the definition of resource class hierarchy. Then, any external relationships starting from or ending at c should be transferred to c1. Any resource class c = (C1, …, Ct, Ct+1, …, Cm, Cm+1, …, Cn) in RSG(CS, E) satisfies: there at least exits one i (m+1 i n) such that Ci i, if there exists any external relationship starting from or ending at c , then insert c and all external relationships starting from or ending at c into the destination SLN and insert the subtype relationship between c and the resource class (C1, …, Ct, Ct+1, …, Cm, m+1, …, n). Merge: Assume that RSG1(CS1, E1) and RSG2(CS2, E2) are two resource class hierarchies corresponding to the resource spaces RS1(X1, X2, …, Xm, X ) and RS2(X1, X2, …, Xm, X ) respectively. And RSG(CS, E) is derived from the Merge operation on RSG1(CS1, E1) and RSG2(CS2, E2) by merging X and X into X as described in theorem 2.6. Let ac, ac and ac be the axis resource classes corresponding to axes X, X and X respectively. For any resource class c1 = (C1, C2,…, Cm, C ) in RSG1(CS1, E1), if c1 is not equal to ac , then we can find the resource class c = (C1, C2,…, Cm, C ) in RSG(CS, E). Any external relationships starting from or ending at c1 should be transferred to c. If c1 is equal to ac , then any external relationships starting from or ending at c1 should be transferred to each of the immediate successors of ac in RSG(CS, E). Split: Assume that RSG(CS, E) is one resource class hierarchy corresponding to the resource space RS1(X1, X2, …, Xm, X). RSG1(CS1, E1) and RSG2(CS2, E2) are derived from the Split operation on RSG(CS, E) by splitting X into X and X as described in theorem 2.7. Let ac, ac and ac be the axis resource classes corresponding to axes X, X and X respectively. For any resource class c = (C1, C2,…, Cm, C) in RSG(CS, E), if c is not equal to ac, then we can find the resource class c = (C1, C2,…, Cm, C) in RSG1(CS1, E1) or RSG2(CS2, E2). Any external relationships starting

2.5 Discussion and Summary

79

from or ending at c should be transferred to c . If c is equal to ac, then any external relationships starting from or ending at c should be transferred to both ac and ac . The following theorems indicate that the RSM-based operations keep some SLN normal forms. Theorem 2.8. Both the Join operation and the Merge operation of the Resource Space Model keep 1NF-SLN, 2NF-SLN and 3NF-SLN of the original Semantic Link Network. Neither the disjoin operation nor the split operation increase semanticequivalent nodes and links, so they keep 1NF and 2NF. But they may break accessibility, so they do not keep 3NF. Theorem 2.9. Both disjoin operation and split operation of the Resourcce Space Model keep 1NF-SLN and 2NF-SLN of the original Semantic Link Networks.

2.5 Discussion and Summary

A data cube is a type of multidimensional matrix that lets users explore and analyze a collection of data from many perspectives (Agrawal, 1996; Gray, 1996). Data is usually organized in cube form. Cooperation with other tools in statistics or analysis, the data cube can support establishing trends and analyzing performance (Graefe, 1993). The following are three major differences between data cube and the Resource Space Model: 1. The data cube is used to establish trends and analyze performance, while the Resource Space Model is to specify and manage resources by orthogonal classification semantics; The data in a data cube has already been processed and aggregated into cube form, thus data cube is a read-only data model, while the Resource Space Model can deal with dynamic data; and, A lot of calculations for view in data cube will be completed before hand, while the operations in the Resource Space Model are real-time.

2.

3.

A data warehouse is a specifically structured copy of transaction data for query and analysis (Inmon, 2002). The data in a data warehouse is usually organized by certain multidimensional data model. Cooperation with OLAP and data mining, data warehouse can provide decision support for enterprise management.

80

Chapter 2 A Semantic Overlay Integrating Normalization with Autonomy

The main differences between data warehouse and the Resource Space Model are: 1. A data warehouse is mainly used to support decision-making, so it attaches more importance to historical, summarized and consolidated data (Kimball, 1996). In contrast, the Resource Space Model focuses mainly on online resource operation. Data warehouses are usually not updatable. Since data warehouses are usually used to store historical data for query and analysis, data in data warehouses are read-only. While, the Resource Space Model, targeted at Internet applications, can be frequently operated. The application scope of the Resource Space Model is broader than that of data warehouse. The application scope of a data warehouse is usually confined within a certain enterprise, while the Resource Space Model can uniformly specify and manage versatile resources on the whole network.

2.

3.

The comparison in table 2.1 shows that the integration of the Resource Space Model and the Semantic Link Network reserves the advantages of both models. Table 2.1 A brief comparison.

Features RSM SLN RSM+ SLN

Semantics Functional Classifica- Link Classification Classification seman- semantics semantics + tion dependtics Link semantics + Aggrega- ence tion Yes Yes Yes No Yes

Data Cube

RDBM

Normal Form Integrity Reasoning ability Search

Yes No

Yes Strong

Yes Strong

No No

Yes middle

Space locating

Relation search

Locating + Relation search

Space locating

Key-based search

2.5 Discussion and Summary

81

The Resource Space Model and the Semantic Link Network reflect classification semantics and link semantics respectively. They can be transformed to each other under certain conditions. The integration of the two models can form a single semantic image to support semantic-based resource location. The operations on the union view can keep the normal forms. Further, the integration of the two models can construct a semantic overlay integrating the normalization and autonomy.

Chapter 3 Expressiveness of Query Languages for Resource Space Model

A great variety of query languages can be designed for operating resource spaces. But how many operations are complete or necessary? How to define “complete” or “necessary” formally? This chapter answers these questions by investigating the theoretical basis for determining how complete a selection capability is provided in a proposed resource sublanguage independently of any host language. The results are very useful to the design and analysis of the operating languages of the Resource Space Model.

3.1 The Problem

A number of operations on resource spaces such as Join, Disjoin, Merge and Split have been defined (Zhuge, 2004a). The principles in the design of a Resource Operation Language (ROL) of the Resource Space Model have been proposed (Zhuge, 2004d). The Resource Operation Language provides a uniform interface for programmers or application systems to operate the Resource Space Model through programs. The completeness of query operations on resource spaces is discussed in (Zhuge and Yao, 2006), where a set of complete operations and a set of necessary operations are also defined. A relational algebra and a relational calculus are defined in the relational data model. The relational algebra is a collection of operations on relations, and a query language could be directly based on it. There are eight operations defined in the relational algebra, they are extended Cartesian product, traditional set operations (union, intersection and difference), projection, join, division and selection (Codd, 1970). The relational calculus is an applied predicate calculus which may also be used in the formulation of queries on any database consisting of a finite collection of relations in a simple normal form. A data sublanguage (called ALPHA), founded directly on the relational calculus, has been informally described in (Codd, 1971b). The equivalence of relational algebra and relational calculus is

84

Chapter 3 Expressiveness of Query Languages for Resource Space Model

proved in (Codd, 1971b; Ullman, 1982). An algebra or calculus is relationally complete if, given any finite collection of relations R1, R2, …, RN in simple normal form, the expressions of the algebra or calculus permit definition of any relation definable from R1, R2, …, RN by alpha expressions (Codd, 1972). A relational database language SQL (Structured Query Language) based on the relational algebra and calculus is proposed by Boyce and Chamberlin (ANSI SQL, 1986; Boyce et al., 1975; Chamberlin and Boyce, 1974; Chamberlin et al., 1976). A great variety of languages could be designed for querying and updating resource spaces. This chapter investigates a theoretical basis which may be used to determine how complete a selection capability is provided in a proposed resource sublanguage independently of any host language in which the sublanguage may be embedded. We especially concern: when are the defined operations sufficient for use and how many operations are necessary?

3.2 Completeness of Query Languages on Resource Spaces

Completeness is a rather mathematical concept. In mathematics, it is commonly the closeness of a set under operations on the set. For example, the rational number field is complete under the operations addition, subtraction, multiplication and division. And we can also define the completeness of operations, for example on the rational number field, operations addition, subtraction, multiplication and division are complete but operations addition and subtraction are incomplete. It is because operations addition and subtraction can only get a subset of the rational number field, but not the whole field. If we regard the queries of the Resource Space Model as operations on resource spaces, then the completeness of query languages on resource spaces can be viewed as the completeness of operations as above. First we give an example in the traditional set theory. 3.2.1 Basic Idea Given two sets A and B, if we only consider the set operations between them, then how many operations are sufficient for use? Experience tells us that three operations—union, intersection and difference are complete. But what is the reason? Can we define other operations in addition to the three operations?

3.2 Completeness of Query Languages on Resource Spaces

85

Fig.3.1. An example of set operations. As shown in Fig.3.1, in general, A and B are divided into three parts according to the distribution of their elements (here we do not consider the simpler cases where some of the three parts are empty). Part 1 consists of elements which are in A but not in B. Part 2 consists of elements which are both in A and B. Part 3 consists of elements which are in B but not in A. If the empty set is also considered as a required result, then there are in all 23=8 required sets, which are: { , Part 1, Part 2, Part 3, Part 1 and Part 2, Part 1 and Part 3, Part 2 and Part 3, Part 1 and Part 2 and Part 3}, which actually are: { , A B, A B, B A, A, A B, B, A B}, where is always called symmetric difference (Robert, 1979). We can see that the operations set { , , , } are sufficient, because from A and B, these operations can get all the required results. Among these four operations, only and are necessary, because and can be represented by them, we can see it from: A B= A (A B), and A B= (A B) (B A). This inspires us to explore the theoretical basis for the design and analysis of the resource space’s query languages. An operation set is called sufficient when it can get all the required results, and an operation set is called necessary only when it is the smallest sufficient operation set. 3.2.2 Definition of Completeness of Query Operations First, we need to make clear the definition of query operations on resource spaces. Because the operands are resource spaces and the results of operations are also resource spaces, we can view the query operations as the mappings on resource spaces. Suppose S is the discussed domain, an op-

86

Chapter 3 Expressiveness of Query Languages for Resource Space Model

eration op on S is a mapping op: S S S, op (s1, , sn)= s, where s and s1, , sn belong to S. When n=1, op is called unary operation like Disjoin and Split; when n=2, op is called binary operation like Join and Merge (Zhuge, 2004b). Here we only consider the unary and binary operations. Then, how to define the completeness of operations on resource spaces? In applications, the number of resource spaces considered is finite, so for any given finite resource spaces, if an operation set can get all the possible query results on them, then the operation set is complete. Then we need to know what all the possible query results are of any given finite resource spaces. For query in resource spaces, we only consider the information in single spaces or the correlations between many spaces. For a single resource space, the smallest unit is a coordinate of one point. So for any given finite collection of resource spaces RS1, RS2, …, RSN in simple normal form, we , xd)| can get that all the possible query results are in the form of {RS(x1, xk RSi(Xj), 1 i N, d 1 and 1 k d}, i.e., all the combinations of the coordinates of the resource spaces. So we can give the following definition: Definition 3.1 An operation set OP on resource spaces is called complete, if: for any given finite collection of resource spaces RS1, RS2, …, RSN in simple normal form, OP can get all the resource spaces in the form of: {RS(x1, , xd)| xk RSi (Xj), 1 i N, d 1 and 1 k d}. As discussed above, in a complete operation set, some operations can be represented by other operations, so they are not necessary, then the following definition can be given: Definition 3.2 An operation set OP on resource spaces is called necessary, if it is complete and there does not exist a real subset of it which is also complete. Since the definition of the completeness of operations is given, then we need to discuss the design and verification of a complete set of operations.

3.3 Complete set of Operations

In addition to existing operations, we first define several new operations. Then the completeness of the defined set of operations is verified, and we want to answer the question like: how many new operations can we define in addition to existing operations?

3.3 Complete set of Operations

87

3.3.1 Design of Query Operations The operations Join, Disjoin, Merge and Split, have been defined in (Zhuge, 2004a) as follows: Operation 3.1 Join Let |RS| be the number of the dimensions of RS. If two resource spaces RS1 and RS2 store the same type of resources and have n (n 1) common axes, then they can be joined together as one resource space RS such that RS1 and RS2 share these n common axes and |RS|=|RS1| + |RS2| n. RS is called the join of RS1 and RS2, denoted as RS1 RS2 RS. According to the above definition, all the resources in the result resource space RS come from RS1 and RS2 and they can be classified by more axes. Join operation provides an approach to refining classification of resources. Operation 3.2 Disjoin A resource space RS can be disjoined into two resource spaces RS1 and RS2 that store the same type of resources as that of RS such that they have n (1 n min(|RS1|, |RS2|)) common axes and |RS| n different axes, and |RS| = |RS1| + |RS2| n (denoted as RS RS1 RS2). The Disjoin operation can clarify the classification of resources by separating a resource space with large number of axes into two smaller ones. Both Join and Disjoin operations keep 1NF, 2NF and 3NF of the Resource Space Model. Operation 3.3 Merge If two resource spaces RS1 and RS2 store the same type of resources and satisfy: (1) | RS1|=| RS2|=n; and (2) they have n 1 common axes, and there exist two different axes X' and X” satisfying the merge condition, then they can be merged into one RS by retaining the n 1 common axes and adding a new axis X*=X' X”. RS is called the merge of RS1 and RS2, denoted as RS1 RS2 RS, and |RS|= n. Operation 3.4 Split A resource space RS can be split into two resource spaces RS1 and RS2 that store the same type of resources as RS and have |RS| 1 common axes by splitting an axis X into two: X’ and X , such that X=X’ X . This split operation is denoted as RS RS1 RS2. By the split operation, the unconcerned coordinates on a certain axis can be filtered out and only the interested coordinates are preserved. It is obvious that we can also define the set operations of the Resource Space Model, including operations like Union, Difference and Intersection. These operations are not the same with the traditional set operations. They should also satisfy more conditions. Suppose two resource spaces RS1 and RS2 have the same number of dimensions, and the corresponding axes are

88

Chapter 3 Expressiveness of Query Languages for Resource Space Model

in the same domain ontology. Then, we can define the operations Union, Difference and Intersection as follows: , The union of two resource spaces RS1(X11, Operation 3.5 Union , X2n) is: RS1 RS2={ (x1, , xn)| (x1, , xn) RS1 or X1n) and RS2(X21, , xn) RS2}. The result is also a resource space with n axes, which (x1, consists of the points in RS1 or RS2. Operation 3.6 Difference The difference of two resource spaces , X1n) and RS2(X21, , X2n) is: RS1 RS2={ (x1, , xn)| (x1, , RS1(X11, , xn) RS2}. The result is also a resource space with n xn) RS1 and (x1, axes, which consists of the points in RS1 but not in RS2. Operation 3.7 Intersection The intersection of two resource spaces RS1(X11, , X1n) and RS2(X21, , X2n) is: RS1 RS2={ (x1, , xn)| (x1, , xn) RS1 and (x1, , xn) RS2}. The result is also a resource space with n axes, which consists of the points both in RS1 and RS2. In addition to these operations, we can also define two operations Extended Cartesian Product and Selection as follows: Operation 3.8 Extended Cartesian Product The extended Cartesian Product of two resource spaces RS1(X11, , X1n) and RS2(X21, , X2m) is a resource space with n + m axes. The first n axes are the axes of RS1 and the following m axes are axes of RS2. If RS1 has k1 points and RS2 has k2 points, then the Extended Cartesian Product of RS1 and RS2 has k1 k2 points, we denote it as RS1 RS2={(x11, , x1n, x21, , x2m) | (x11, , x1n) RS1 and (x21, , x2m) RS2 }. Operation 3.9 Selection The selection is also called Restriction. It is for selecting the points satisfying the given conditions in the resource space RS, denoted as F(RS)={t | t RS and F(t)=’true’}, where F represents the selecting conditions, it is a logic expression, it has binary value ‘true’ or ‘false’. The logic expression F is composed of the logic operators , , connecting every arithmetic expression. In fact the operation ‘selection’ is to select the points that make the logic expression F be true from the resource space RS. It is clear that different users will also design new operations on different purposes and we cannot list all of them. Then, we will discuss the verification of the completeness of operations and prove that the nine operations defined above are a complete operation set for query on resource spaces.

3.3 Complete set of Operations

89

3.3.2 Verification of Completeness of Operations Before we verify the completeness of operations, two definitions about operations should be given first. Given two operations op1 and op2 on resource spaces, suppose op1 is unary and op2 is binary, we use op1 (RS1) to represent the result of RS1 under operation op1, RS1op2RS2 to represent the result of RS1 and RS2 under operation op2. Definition 3.3 Two unary operations op1 and op2 on resource spaces are called equivalent to each other if for any resource space RS, op1 (RS)= op2 (RS). Two binary operations op1 and op2 on resource spaces are called equivalent to each other if for any resource spaces RS1 and RS2, RS1 op1 RS2= RS1 op2 RS2. For example, if we define a binary operation ‘*’ as: *: RS1* RS2= RS1 (RS1 RS2), then ‘*’ is equivalent to the operation ‘ ’, denoted as *= . If two operations are equivalent to each other, they are the same from the perspective of mapping, so they are the same operation. We define the equivalence of operations is to show that the operations elaborately designed by users may be the same with existing operations in essence. As we can see, the operation ‘*’ is composed of operation ‘ ’, then we can say that operation ‘*’ can be represented by operation ‘ ’. Then we have the following definition. Definition 3.4 Suppose OP is an operation set on resource spaces, an unary (or binary) operation op is called can be represented by OP, if op (RS) (or RS1 op RS2) can be represented as an expression of OP. For example, we have RS1 RS2= RS1 (RS1 RS2), so operation ‘ ’ can be represented by operation ‘ ’. Equivalence and representation are the two basic relations between operations discussed in this paper. Because we cannot give all possible operations, the completeness of operations can only be proven in theory and cannot be proven by giving examples. But the incompleteness of operations can be proven by giving examples. So we give an example to show that operations set {Union, Difference, Intersection} is not complete. For example, suppose the resource spaces considered are {RS1(X1, X2), RS2(X1, Y2)}, then it is clear that space RS3(X1, X2, Y2) is in the required results. But from {RS1(X1, X2), RS2(X1, Y2)}, the operations {Union, Difference, Intersection} cannot get the space RS3(X1, X2, Y2). It is because the precondition of these traditional set operations is that the operated spaces have the same dimensions, and the result spaces also have the same dimension. So from two 2-

90

Chapter 3 Expressiveness of Query Languages for Resource Space Model

dimensional spaces we cannot get a 3-dimensional space. Then, it is clear that operations set {Union, Difference, Intersection} is not complete. Then we will prove that the nine operations defined above are complete for query on resource spaces. Theorem 3.1. The nine operations Union, Difference, Intersection, Extended Cartesian Product, Selection, Join, Disjoin, Merge and Split are complete for query in resource spaces. Proof. According to definition 3.1, an operation set OP on resource spaces is complete if: for any given finite collection of resource spaces RS1, RS2, …, RSN in simple normal form, and OP can get all the resource spaces in the form of: {RS(x1, ,xd)| xk RSi (Xj), 1 i N, d 1 and 1 k d }. So, we only need to show that the nine operations can get the result spaces as above. We use mathematical induction to prove this. When N=1, for a single ,xd), we can use the unary operation Selection to resource space RS1(x1, get the spaces in the form of {RS(x1, , xn)| (x1, , xn) RS1}, then use the unary operation Disjoin to select any axes we need, then we can get the , xd)| xk RS1 (Xj) , 1 j n and spaces in the following form {RS(x1, 1 k d }, which means that when N=1 the conclusion holds. Suppose when N=m the conclusion holds, i.e., the nine operations can get all the re, xd)| xk RS1 (Xj) , 1 j n and 1 k sult spaces in the form of {RS(x1, d }, next we will show that when N=m+1 the conclusion also holds. Now we divide these m+1 resource spaces into two parts: RS1, RS2, …, RSm and RSm+1, for the preceding m spaces, from the assumption we can get that the , nine operations can get the result spaces in the following form: {RS(x1, xd)| xk RSi (Xj), 1 i m, d 1 and 1 k d }; for the resource space RSm+1, as the above using operations Selection and Disjoin we can get: {RS(x1, , xd)| xk RSm+1 (Xj) , 1 j n and 1 k d }. For these two spaces, using the set operations Union, Difference, Intersection and operation Extended , Cartesian Product we can get the correlations between them: {RS(x1, xd)| xk RSi (Xj), 1 i m+1, d 1 and 1 k d }. So when N=m+1 the conclusion also holds. Then we can get for any given finite collection of resource spaces RS1, RS2, …, RSN in simple normal form, the nine operations ,xd)| xk RSi(Xj), can get all the resource spaces in the form of: {RS(x1, 1 i N, d 1 and 1 k d}. So, the nine operations Union, Difference, Intersection, Extended Cartesian Product, Selection, Join, Disjoin, Merge and Split are complete for query in resource spaces.

3.3 Complete set of Operations

91

From the above proof process we can see that, in fact the operation set {Union, Difference, Extended Cartesian Product, Selection, Disjoin} is already complete and the other four operations can all be represented by it. Then are these five operations necessary? The answer is yes. We only need to show that any of these five operations cannot be represented by the other operations. For this negative conclusion, we only need to give a counter-example. Then we will give an example to show that operation Extended Cartesian Product cannot be represented by operations {Union, Difference, Selection, Disjoin}. Suppose RS is a 2-dimensional resource space, then RS RS is a 4-dimensional resource space. But the operations {Union, Difference, Selection, Disjoin} can only maintain or decrease the dimensionality of resource spaces, so from the 2-dimensional resource space RS, the four operations cannot get a resource space whose dimensionality is larger than 2. So space RS RS cannot be got from RS under operations {Union, Difference, Selection, Disjoin}, which means that operation Extended Cartesian Product cannot be represented by operations {Union, Difference, Selection, Disjoin}. Then we can get the following corollary. Corollary 3.1. The five operations Union, Difference, Extended Cartesian Product, Selection and Disjoin are complete and necessary for query in resource spaces. In theory, the definition of a complete and necessary operation set is enough, it is indeed from the perspective of expressiveness of query languages. But in applications, some new operations which can be represented by existing operations will also be defined, for the convenience of expression or operation. For example, from the Join operation, we can naturally introduce another useful operation: Division. And we can define another operation Projection from the operation Disjoin. Then how many new operations we can define in addition to the existing operations? The answer is infinite. The following example will show this. Example 3.1 There exist infinite different operations. According to definition 3, we only need to show that there exist infinite operations which are not equivalent to each other. We define a sequence of operations { 1, 2, 3, …} as follows: rs2) (rs1 rs2), 1: rs1 1 rs2 = (rs1 2 rs2 = (rs1 1 rs2) 1 (rs1 1 rs2), 2: rs1 …… i+1: rs1 i+1 rs2 = (rs1 i rs2) 1 (rs1 i rs2), ……

92

Chapter 3 Expressiveness of Query Languages for Resource Space Model

From the definition, we have: rs1 2 rs2 = (rs1 1 rs2) 1 (rs1 1 rs2) = ((rs1 rs2) (rs1 rs2)) 1 ((rs1 rs2) (rs1 rs2)) = (((rs1 rs2) (rs1 rs2)) ((rs1 rs2) (rs1 rs2))) (((rs1 rs2) (rs1 rs2)) ((rs1 rs2) (rs1 rs2))) = ((rs1 rs2) (rs1 rs2)) ((rs1 rs2) (rs1 rs2)) = (rs1 1 rs2) (rs1 1 rs2). So we can see that for any i 2.

2= 1 1.

And, we conjecture that

i=

i-1

i-1

rs1 i rs2 = (rs1 i-1 rs2) 1 (rs1 i-1 rs2) = ((rs1 i-1 rs2) (rs1 i-1 rs2)) ((rs1 = (rs1 i-1 rs2) (rs1 i-1 rs2).

i-1

rs2)

(rs1

i-1

rs2))

So we have i= i-1 i-1 for any i 2. Then, it is clear that operations { 1, 2, 3, …} are not equivalent to each other, so they are infinite and different operations. This example tells that finding a “self-contained” operation set regardless of its applications is impractical. After the study of the completeness of operations, we will discuss the expressiveness of query languages, and the emphasis is on the comparison between expressiveness of different query languages and the characteristics of expressiveness.

3.4 Expressiveness of Query Languages

The expressiveness of query languages is an abstract concept, it is difficult to be defined or depicted accurately. Here we have not defined the expressiveness of operations directly, we just study the comparison between expressiveness and some characteristics of it. For example, we want to answer questions like: given two operation languages, whose expressiveness is stronger? Or for two given operations P and Q, whose expressiveness is stronger? 3.4.1 Comparison between Expressiveness Given two operations op1 and op2 on two resource spaces RS1 and RS2, suppose op1 is unary and op2 is binary. We use op1 (RS1) to represent the

3.4 Expressiveness of Query Languages

93

result of RS1 under operation op1, and use RS1 op2RS2 to represent the result of RS1 and RS2 under operation op2. Then, all the resource spaces we can get from the set {RS1, RS2} under the set { op1, op2} can be listed as follows: {op1 (RS1), op1(RS2), RS1op2RS2, op1 (op1(RS1), op1 (op1(RS2), op1(RS1 op2 RS2), }. Then, a definition can be given as follows: Definition 3.5 Given a set of resource spaces RSS and a set of operations OP on the resource spaces, we define RSSOP as all the resource spaces that can be got from RSS under a sequence of operations in OP. In general cases, when RSS and OP are both finite, RSSOP is also finite. For example, for RSS={RS1, RS2}, if OPs={ , }, then RSSOPs ={RS1, RS2, RS1 RS2, RS1 RS2}; if OPt={ }, then RSS OPt={RS1, RS2, RS1 RS2}. This example is very simple, but in many cases, it is not easy to compute RSSOP given RSS and OP. For example, for RSS ={RS1, RS2} and OP={ }, one may think that RSSOP ={RS1 RS2, RS2 RS1}. But in fact, RS1 RS1= , RS1 =RS1, RS2 =RS2 and RS1 (RS1 RS2)= RS1 RS2, now the generated

spaces are { , RS1 , RS2, RS1 RS2, RS2 RS1, RS1 RS2}. Are there more spaces that can be generated? The answer is no, because the results of these six spaces under the operation ‘ ’ are also included by themselves, for example, RS1 (RS1 RS2)=RS1 RS2. So in fact now RSSOP ={ , RS1, RS2, RS1 RS2, RS2 RS1, RS1 RS2}. Intuitively, given any set of resource spaces RSS, if operation set OPs can get more results than OPt, i.e., RSS Ops RSS OPt, then we can say that the expressiveness of OPs is stronger than OPt. So a definition can be given as follows: Definition 3.6 Given two operation sets OPs and OPt on resource spaces, the expressiveness of OPs is stronger (weaker) than OPt, denoted by OPs>OPt (OPsOPt nor OPsOPt and OPsOPt. Characteristic 3.4 If OPs>OPt and OPr>OPs, then OPr>OPt. Characteristic 3.5 Given an operation set OPs, if OP is equivalent to or can be represented by some operations in OPs, then OPs>OP. Characteristic 3.6 If OPs>OPt, then (OPs OPt)=OPs. Characteristic 3.6 tells that if newly defined operations can be represented by existing operations, then the expressiveness of operations does not increase in essence. }, we

RS2, RS2 RS1 RS1, RS1 RS2, RS2, RS1

3.5 Comparison and Analysis

Section 1.7 has made a comparion between the Resource Space Model and the Relational Data Model. Several differences influence the design of a query language. Relational data tend to have a regular structure, which allows the descriptive meta-data to be stored in a separate catalog. Resources in resource space, in contrast, are usually heterogeneous. Resource spaces often contain many levels of nested elements, whereas relational data are “flat.” Relational data are usually “dense” (every column has a value). Re-

3.5 Comparison and Analysis

95

source spaces, in contrast, are often “sparse”. So, the relational query languages are not suitable for querying resource spaces. The Resource Space Model needs a resource-using mechanism to provide not only an operational browser for end users but also a programming environment for high-level developers (Zhuge, 2004d). It provides a set of specific programming language — Resource Operation Language ROL, and also provides a running platform for this language to be used for querying and operating resource spaces. The ROL could be embedded in other high-level programming languages supporting the call from other languages. In contrast, the SQL provides a programming language for relational database application systems to operate relational tables. The ROL could be a programming environment. It adopts the SQL’s SELECT-FROM-WHERE grammar. The ROL can execute the operations similar to the classic relational databases such as nesting query, aggregation and ordering the results. The ROL also borrows from the XML query language such characteristics as the management of the structured and semi-structured data, the support of abstract data types, the support of document selection and local path expression. The comparisons of the ROL with the XQL (XML Query Language) and the SQL are given in the following table.

96

Chapter 3 Expressiveness of Query Languages for Resource Space Model

Table 3.1 Comparisons of ROL with XQL and SQL Feature Items Abstract data types SQL-like Specific data model Document selection Partial specified path expression Nested queries Update language Loop statement Branch statement Set operations Aggregates Join operation Open operation Support View Create new elements Query structure Query content Ordering results ROL Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Partially Yes Yes Yes Yes Yes Yes Yes XQL No No No Yes Yes No No No No Yes Partially No No No No No Yes No SQL No Yes Yes No No Yes Yes Partially Partially Yes Yes Yes No Yes Yes No Yes Yes

3.6 Summary

97

3.6 Summary

This chapter investigates the completeness of resource space query languages. The completeness of operations is introduced to answer when the defined operations are sufficient. Several new operations are defined, and the verification of the completeness of the operations is discussed. We conclude that: 1. {Union, Difference, Intersection, Extended Cartesian Product, Selection, Join, Disjoin, Merge and Split} is a set of complete operations for query on resource spaces; and, {Union, Difference, Extended Cartesian Product, Selection and Disjoin} is a set of complete and necessary operations for query on resource spaces.

2.

It is possible to define infinite new operations regardless of practical requirements, while the definition of a self-contained set of operations is impractical. The proposed framework can be used to compare the expressiveness of different resource sublanguages. The theoretical results are very useful in the design and analysis of resource space sublanguages, and they could be also applied to the operation theory of other data models like relational data model and XML. Any general purpose query language of the Resource Space Model must have this completeness at least.

Chapter 4 Algebra and Calculus of the Resource Space Model

To know the query capability and make use of the potential expressive power of the Resource Space Model are essential issues. The query capability and expressive power of the Resource Space Model can be studied from two perspectives: resource space algebra and resource space calculus.

4.1 Basic Idea

The resource space algebra consists of a set of resource spaces and a set of operations of the Resource Space Model. Results of operations on resource spaces are also resource spaces. Users can use a series of operations in the resource space algebra to obtain the desired resources. The query completeness of the operations in the proposed resource space algebra has been discussed in chapter 3. To lay the foundation of the query language for the Resource Space a resource space calModel, we propose a non-procedural query style culus. The resource space calculus is a type of applied predicate calculus and a foundation for the declarative query language. By the alpha expressions defined in the resource space calculus, users can easily and clearly specify the desired resources. The relational algebra and the relational calculus were proposed to depict the query capability of the relational model (Codd, 1971a; Ullman, 1982). A reduction algorithm is also proposed to transform a relation expression into a semantically equivalent expression of the relational algebra (Codd, 1972), though this reduction algorithm has a little defect (Date, 1989; Date 1992). The extended relational algebra and relational calculus with aggregate functions were discussed (Klug, 1982). Structured Query Language (SQL) has the features of both the relational algebra and the relational cal-

100

Chapter 4 Algebra and Calculus of the Resource Space Model

culus (Boyce et al. 1975; Chamberlin and Boyce, 1976; Chamberlin et al., 1976). A query algebra on Object-Oriented Database synthesizing relational query concepts with object-oriented databases as well as corresponding query languages were proposed (Alashqur et al., 1989; Shaw and Zdonik, 1990; Shipnan, 1981; Zaniolo, 1983). A conceptual model and its algebra and calculus have been introduced for OLAP-based applications (Gyssens and Lakshmanan, 1997).

4.2 Resource Space Algebra

Query operations on a given resource space may result in only some desirable points, but these resultant points cannot be operated further by the proposed resource space operations. To guarantee the query result to be a resource space, the following mechanism is employed. Any operation will result in a certain resource space and all undesired points in this resultant resource space will be maintained and marked as null points instead of being filtered out and all desired points marked as non-null points. In the following discussion, R(p) denotes the resources that point p can contain and R (p) denotes the resources that point p actually contains. In a resource space RS, for any null point p we have R(p)= , R (p)= and p RS. For axis Xi and point p in RS, p[Xi] is used to denote the projection of p on the axis Xi. Two resource spaces RS1(X1, X2 … Xn) and RS2(Y1, Y2 … Yn) are unioncompatible if RS1 and RS2 have the same schema, i.e. X1=Y1, …, Xn=Yn. Let p1 and p2 be two points in two union-compatible resource spaces RS1(X1, X2 … Xn) and RS2(X1, X2 … Xn) respectively. If p1[Xi]=p2[Xi] for 1 i n holds, then we say that p1 has the same coordinates with p2, denoted as p1 =p p2.

4.2.1 Definitions of Operations All operations in this algebra fall into two categories: the traditional set operations (union, intersection, difference and Cartesian product) and the

4.2 Resource Space Algebra

101

RSM-specific operations (join, disjoin, merge, split, division, selection, and projection). The set operations of the Resource Space Model can be defined as follows: Union. For two union-compatible resource spaces RS1 and RS2, the union of RS1 and RS2, denoted as RS1 RS2, has the same resource space schema as RS1 and RS2, and RS1 RS2 = {p | (p1 RS1 p2 RS2) p=pp1 p=pp2}. This means that p is non-null in RS if and only if either p1 is non-null in RS1 or p2 is non-null in RS2. For any point p in RS1 RS2 and its counterparts p1 in RS1 and p2 in RS2, we have R (p) = R (p1) R (p2). Fig. 4.1 is an example of union operation.

Fig. 4.1. An example of Union operation. Intersection. For two union-compatible resource spaces RS1 and RS2, the intersection of RS1 and RS2, denoted as RS1 RS2, has the same resource space schema as RS1 and RS2. And RS1 RS2={p | p1 RS1 p2 RS2 p =p p1 p =p p2}. This means that p is non-null in RS if and only if both p1 is non-null in RS1 and p2 is non-null in RS2. For any point p in RS1 RS2 and its counterparts p1 in RS1 and p2 in RS2, we have R (p) = R (p1) R (p2). Fig. 4.2 is an example of the intersection operation.

102

Chapter 4 Algebra and Calculus of the Resource Space Model

RS1

RS2

RS1

RS2

Fig. 4.2. An example of the Intersection operation. Difference. For two union-compatible resource spaces RS1 and RS2, the difference of RS1 and RS2, denoted as RS1 RS2, has the same resource space schema as RS1 and RS2. For any point p in RS1 RS2, let p1 and p2 be two points having the same coordinates as p in RS1 and in RS2 respectively. Then R (p)=R (p1) R (p2). p is non-null if and only if either (1) p1 is nonnull and p2 is null, or (2) R (p) is non-null. Otherwise, p is null. Cartesian product. Let RS1(X1, X2 … Xn) and RS2 (Y1, Y2 …Ym) be two resource spaces that store the same type of resources. The Cartesian product of RS1 and RS2 is defined as RS1 RS2 = RS(X1, X2 … Xn, Y1, Y2 …Ym). For any point p(x1, x2 … xn, y1, y2 … ym) in RS, there exist p1(x1, x2 … xn) and p2(y1, y2 … ym) in RS1 and RS2 respectively. Then p is non-null if and only if both p1 and p2 are non-null. If n=m, Xi=Yi (1 i n) and xj=yj (1 j n), then R (p)=R (p1) R (p2). Otherwise, R (p)=R (p1) R (p2). The RSM-specific operations Join, Disjoin, Merge and Split have been defined in (Zhuge, 2004a). The following definitions are given from the view of resources. Join. Let |RS| be the number of the dimensions of the RS. If two resource spaces RS1(X1, …, Xm, Y1, …, Yn) and RS2(Y1, …, Yn, Z1, …,Zk) store the same type of resources and have n common axes, then they can be joined together as one resource space RS(X1, …, Xm, Y1, …, Yn, Z1, …,Zk) such that RS1 and RS2 share these n common axes and |RS|=|RS1| + |RS2| n. For any point p(x1, …, xm, y1, …, yn, z1, …,zk) in RS, p is non-null if and only if both point p1(x1, …, xm, y1, …, yn) in RS1 and point p2(y1, …, yn, z1, …,zk) in RS2 are non-null. If RS1 and RS2 are union-compatible, then

4.2 Resource Space Algebra

103

R (p)=R (p1) R (p2). Otherwise, R (p) = R (p1) R (p2). RS is called the join of RS1 and RS2, denoted as RS1 RS2 RS. According to the above definition, all the resources in the new resource space RS come from RS1 and RS2 and can be classified by X1, …, Xm, Y1, …, Yn, Z1, …, Zk. The Join operation provides an efficient method to refine classification of resources. Disjoin. A resource space RS(X1, …, Xm, Y1, …, Yn, Z1, …, Zk) can be disjoined into two resource spaces RS1(X1, …, Xm, Y1, …, Yn) and RS2(Y1, …, Yn, Z1, …, Zk) that store the same type of resources as that of RS such that they have n (1 n min(|RS1|, |RS2|)) common axes and |RS| n different axes, and |RS|=|RS1| + |RS2| n. For any point p1 in RS1, there exists a set P of points in RS, each element of which has the same projections on X1, …, Xm, Y1, …, Yn as p1. Then p1 is non-null if and only if there exists at least one non-null point in P. R (p1)= R (p) for any p P. RS1 and RS2 are called disjoin of RS, denoted as RS RS1 RS2. Different from the Join operation, the disjoin operation can clarify the classification of resources by separating large number of axes into two overlapped parts. Based on the disjoin operation, we can naturally introduce another useful operation: projection. The projection of the RSM has almost the same definition as disjoin except that projection results in only one resource space which includes all the desirable axes. For resource space RS(X1, …, Xm, Xm+1, …, Xn), X1, …, Xm(RS) will be used to denote the projection of the resource space RS on axes X1, …, and Xm. From the definition of disjoin, it is clear that the projection provides an algebraic counterpart to the existential quantifier. Merge. If two resource spaces RS1(X1, …, Xn-1, X') and RS2(X1, …, Xn-1, X”) store the same type of resources and satisfy: 1) |RS1|=|RS2|=n; and, 2) they have n 1 common axes, and there exist two different axes X' and X” satisfying the merge condition, then they can be merged into one RS by retaining the n 1 common axes and adding a new axis X*=X' X”. RS is called the merge of RS1 and RS2, denoted as RS1 RS2 RS, and |RS|= n. For any point p in RS, there exists a set P of points in RS1 and RS2 each element of which has the same projections on all axes as p. Then p is nonnull if and only if there exists at least one non-null point in P. The formal

104

Chapter 4 Algebra and Calculus of the Resource Space Model

definition is RS1 RS2 = {p(X1, …, Xn-1, X*) | p RS1 RS2.X”}.

p RS2

X*=RS1.X'

It is obvious that the Union operation is the special case of the merge operation where all axes are common. The above definition can be easily extended to a more general situation where resource spaces RS1 and RS2 have n-m common axes and m different axes satisfying the merge condition. Split. A resource space RS can be split into two resource spaces RS1 and RS2 that store the same type of resources as RS and have |RS| 1 common axes by splitting an axis X into two: X’ and X , such that X=X’ X . This split operation is denoted as RS RS1 RS2. For any point p1 in RS1 there exists a point p in RS, which has the same projections on all axes as p1. Then p1 is non-null if and only if p is non-null. The formal definition is RS1(X1, …, Xn-1, X[C]) = {p(X1, …, Xn-1, X*) | p RS X*=X[C]}, where X[C] represents the axis X containing only the coordinates in coordinate set C. By using the split operation, the unconcerned coordinates on a certain axis can be filtered out and only the interesting coordinates are preserved. Selection. For a resource space RS, the Selection operation is used to select the desirable points according to the given restriction. It is denoted as F(p)}, where F is a logical expression. Any point in F(RS)={p | p RS RS making F true will be marked with non-null and other points will be marked with null. F has the following four forms:

1.

2. 3. 4.

pm[Xi] Y, where Y may be pn[Xj] or just a noun and noun phrase in domain ontology and represents =, , . It is a type of restrictions on the projections on axes of points. pm[Xi] Y, where Y is just a set of nouns or noun phrases in domain ontology and represents or . R (pm[Xi]) R (pn[Xj]), where represents =, , , , , and . It is a type of restrictions on the set of resources that points contain. fc(pi) Y, where fc is the function to calculate the cardinality of the given point and represents =, , . It is a type of restrictions on the quantity of resources that points contain.

Division. Let RS1(X1, …, Xm, Y1, …, Yt) and RS2(Y1, …, Yt, Z1, …, Zn) be two resource spaces. Dividing RS1 by RS2 (denoted as RS1[ Y1, …, Yt]RS2)

4.2 Resource Space Algebra

105

is a resource space X1, …, Xm(RS), and for any point p in X1, …, Xm(RS), p is non-null if and only if for any non-null point p in RS2 there exists a nonnull point p’ in RS1 such that p[X1, …, Xm]=p’[X1, …, Xm] and p [Y1, …, Yt]=p’[Y1, …, Yt]. The division operation provides an algebraic counterpart of the Universal Quantifier. Its role is similar to the division operation in relational algebra. 4.2.2 Relationships among Operations Given a resource space RS, the following situation often occurs: there usually exist some resource-entries (for example r) and points (for example p) in RS such that r R(p) but r R (p). In Fig. 4.3(a), there exists a resource-entry id1 to appear in both point p1 and point p2 (3NF is not satisfied in this case). So we can conclude that r R(a1) from r R (p1) and r R(b2) from r R (p2). So r R(p3) but r R (p3). An issue of this situation is that a user could not obtain the resource-entry r when users only query the point p3. To solve this issue, we suppose that all resource spaces satisfy the following restriction. For any resource space RS(X1, …, Xn), any point p in RS and any resource r, r R (p) holds if and only if there exist n points p1, p2, …, and pn in RS such that both r R (pi) and p[Xi]=pi[Xi] hold for any 1 i n.

(a) Fig. 4.3. An example of full resource space.

(b)

106

Chapter 4 Algebra and Calculus of the Resource Space Model

Theorem 4.1. For any resource space RS(X1, …, Xn) satisfying the above restriction, RS(X1, …, Xn)=( X1(RS) X2(RS) … Xn(RS)) RS holds. Proof. It is clear that the resource space RS and the resource space ( X1(RS) X2(RS) … Xn(RS)) RS have the same schema. For any point p in RS and p’ in ( X1(RS) X2(RS) … Xn(RS)) RS such that p =p p’, we will prove that p is non-null if and only if p’ is non-null, and that R p)=R p’) holds. (1) If p is non-null, then there exist n non-null points p1, p2, …, pn in X1(RS), X2(RS), …, Xn(RS) respectively such that p[Xi] = pi[Xi] hold for any 1 i n. So there exists a non-null point p in X1(RS) Xn(RS) such that p =p p . Since both p and p are non-null, X2(RS) … p’ is non-null point. It is obvious that if p is a null point then p’ is also a null point. So p is non-null if and only if p’ is non-null. (2) Since RS satisfies the given restriction, for any resource r if r R (p) holds then there exists at least one axis Xi (1 i n) in RS such that r R (p[Xi]). So for resource space Xi(RS), r R (p[Xi]). Since p[Xi]=p’[Xi] holds, r R (p[Xi]) holds. Thus r R (p’) holds. On the other hand, if r R (p) holds, we can easily conclude that r R (p’) holds. So R p)=R p’) holds. Thus RS(X1, …, Xn)=( X1(RS) X2(RS) … Xn(RS)) RS holds. According to theorem 4.1, for any resource space RS(X1, …, Xn), the above restriction can be easily satisfied by setting RS as ( X1(RS) X2(RS) … Xn(RS)) RS. Theorem 4.2. Let A=X1, …, Xm, B=Y1, …, Yt and C=Z1, …, Zn. For any resource space RS1(A, B) and resource space RS2(B, C), RS1(A, B)[ B]RS2(B, C) = A(RS1) ( A(RS1) A( A(RS1) B(RS2) A(RS1) B(RS2) RS1)) holds. Proof. (1) Let p and p’ be two points in resource spaces RS1(A, B)[ B]RS2(B, C) and A(RS1) ( A(RS1) A( A(RS1) B(RS2) A(RS1) B(RS2) RS1)) respectively. We will prove that if p =p p’ holds, then p is non-null if and only if p’ is non-null. Suppose that p is a null point in the resource space RS1(A, B)[ B]RS2(B, C). Then there exits a null point p1 in RS1(A, B) and a non-null point p2 in RS2(B, C) such that p1[A]=p[A] and p1[B]=p2[B] hold. Let p3 be the point in A(RS1) having the same coordinates as p. If p3 is null, then it is obvious that the point p’ is a null point. The following is the case that p3 is a non-null point in A(RS1). Let p4 be the point in A(RS1) B(RS2) having the same coordinates as p1. Since p3 in A(RS1) and p2 in B(RS2) are non-null points, p4 is a non-null point. Since point p1 is a null point in RS1, the point p1’ in A(RS1) B(RS2) RS1

4.3 Resource Space Calculus

107

satisfying p1 =p p1’ is also a null point. Let p5 be the point in ( A(RS1) A( A(RS1) B(RS2) A(RS1) B(RS2) RS1)) having the same coordinates as p. So p5 is also a non-null point. Since both p3 and p5 are non-null point, p’ is a null point. (2) In the same way, we can prove that p’ is also a non-null point when p is a non-null point. Theorem 4.3. Merge, Difference, Cartesian product, Projection and Selection are sufficient and necessary. And other operations can be represented as follows: 1. 2. 3. 4. 5. Union is the special case of Merge; RS1 RS2 = RS1 (RS1 RS2); RS1(X1, …, Xm, Y1, …, Yt) RS2(Y1, …, Yt, Z1, …, Zn) = X1, …, Xm, Y1, …, RS2); Yt, Z1, …, Zn(RS1 RS(X1, …, Xn-1, X[C]) = p[X] C(RS), where C represents a set of coordinates; and, RS1(A, B)[ B]RS2(B, C) = A(RS1) ( A(RS1) A( A(RS1) B(RS2) A(RS1) B(RS2) RS1)), where A=X1, …, Xm, B=Y1, …, Yt and C=Z1, …, Zn.

Theorem 4.4. In the five basic operations (Merge, Difference, Cartesian product, Projection and Selection), Difference, Cartesian product, Projection and Selection keep 1NF (the first normal form), 2NF (the second normal form) and 3NF (the third normal form) of the Resource Space Model. Merge keeps 1NF and 2NF, but it does not keep 3NF. As a special Merge operation, Union keeps 1NF, 2NF and 3NF.

4.3 Resource Space Calculus

Having the resource space algebra, we now introduce an applied predicate calculus which can be used to construct declarative queries on any resource space system consisting of a finite set of resource spaces.

4.3.1 Definition The resource space calculus consists of several classes of objects. They include variables, terms, formulas and alpha expressions.

108

Chapter 4 Algebra and Calculus of the Resource Space Model

The set V of variables is the countable sets {p, p1, p2, p3 ...}, where each pi stands for a point variable. A point variable p is a free variable if p does not occur within the scope of any quantifier ( or ). Otherwise p is a bound variable. The set T of terms is composed of the following six parts. 1. 2. 3. 4. 5. 6. Any noun or noun phrase in ontology qualified to name coordinates is in T. Any axis of a certain resource space, the split of an axis or the merge of two axes belong to T. For any point variable pi and its any axis Xj, pi[Xj] is a term. For every point variable pi, the set R (pi[Xj]) is a term. For every point variable pi, fc(pi[Xj]) is a term. Integers not less than 0 belong to T. The set RF of range formulas is defined as follows. 1. Let RSi be a resource space and point variable p V, then RSi(p) belongs to RF (The monadic predicate RSi(p) is used to state that the point variable p has the range of resource space RSi.) If point variable p is the only point variable in a range formula, then this range formula is called a range formula over p. Let , be two range formulas over p. Then belongs to RF. Let , RF. For any point variable p in and , if all resource spaces specifying the range of p in and are union-compatible, then both the disjunction and the conjunction are in RF. The set F of formulas includes the following six types of formulas. 1. 2. Any range formula in RF is in F. Coordinate formula has one of the two forms (a) pm[Xi] Y, where Y may be pn[Xj] or just a noun and noun phrase in T and represents any of the relations =, , ; (b) pm[Xi] Y, where Y is a set of nouns and noun phrases in T and represents or . Let be any of the relations =, , , , , and . Set formula has the form of R (pm[Xi]) R (pn[Xj]). Let be any of the relations =, , . Cardinality formula has the form of fc(pi) Y, where Y may be fc(pj) or just an integer not less than 0. If , F, then the negation , the disjunction and the conjunction are in F. And,

2.

3.

3. 4.

5.

4.3 Resource Space Calculus

109

6.

Let be a range formula over p. Then the quantification ( ( ) are in F.

) and

It is obvious that each qualifier ( or ) in F must be associated with a range formula over a point variable. The expanded forms of ( ) and ( ) are as follows. ( ( ) = p( ) = p( ) )

As with the relational calculus, the range of each point variable in a formula should be definitely specified (Codd, 1972). For any formula in F, is a well-formed formula (WFF in simple) over p if has the form of U1 Un V, where 1. 2. U1 through Un are range formulas over n point variables varying from one another; V belongs to F and satisfies: a) The range of every free variable except p in V has been specified by a certain Ui; b) No rang formula occurs in V. Then this WFF over p is denoted as (p). Let (p) be a WFF formula over p and Xi (1 i n) be a group of axes. The alpha expression can be defined as follows: 1. p(X1, X2, ..., Xn): (p) is an alpha expression; 2. If both p(X1, X2, ..., Xn): and p(X1, X2, ..., Xn): sions, then the following are alpha expressions. a) p(X1, X2, ..., Xn): b) p(X1, X2, ..., Xn): c) p(X1, X2, ..., Xn): are alpha expres-

p(X1, X2, ..., Xn) is called the target point and the logical expression following the colon is called the qualification. The semantics of the alpha expression p(X1, X2, ..., Xn): (p) is to construct a resource space RS consisting of axes X1, X2, ..., and Xn where for any point p if p satisfies (p) then p is non-null, otherwise p is null. The set AE is defined as the set of all alpha expressions, each of which can be used to represent a query in a certain resource space system.

110

Chapter 4 Algebra and Calculus of the Resource Space Model

4.3.2 From Resource Space Algebra to Resource Space Calculus Each operation in the resource space algebra can be represented by a piece of alpha expression in the resource space calculus. Thus, we can draw the conclusion that the resource space calculus has at least as powerful query capability as the resource space algebra. The following is the alpha expressions corresponding to each operation in the resource space algebra. Union. RS1 RS2 {p(X1, X2, ..., Xn): RS1(p) RS2(p)}. This alpha expression means that each point in the new resource space of RS1 RS2 is the union of the corresponding point in RS1 and RS2. {p(X1, X2, ..., Xn): RS1(p) RS2(p)}. This alpha Intersection. RS1 RS2 expression means that each point in the new resource space of RS1 RS2 is the intersection of the corresponding point in RS1 and RS2. {p(X1, X2, ..., Xn): RS1(p) RS2(p)}. This alpha Difference. RS1 RS2 expression means that each point in the new resource space of RS1 RS2 is the difference operation on the corresponding point in RS1 and RS2. {p(X1, X2, ..., Xn, Y1, ... Ym): Cartesian product. RS1 RS2 ( RS1(p1))( RS2(p2))(p[X1]=p1[X1] ... p[Xn]=p1[Xn] p[Y1]=p2[Y1] ... p[Ym]=p2[Ym])}. This alpha expression means that the axes of the new resource space of RS1 RS2 are the concatenation of the axes of RS1 and RS2. And each point in RS1 RS2 is the concatenation of the corresponding points in RS1 and RS2. {p(X1, …, Xm, Y1, …, Yn, Z1, …,Zk): ( RS1(p1)) ( RS2(p2)) Join. RS1 RS2 (p[X1]=p1[X1] ... p[Xm]=p1[Xm] p[Y1]=p1[Y1]=p2[Y1] ... p[Yn]=p1[Y1]=p2[Yn] p[Z1]=p2[Z1] ... p[Zk]=p2[Zk])}. This alpha expression means that the axes of the new resource space of RS1 RS2 are the union of the axes of RS1 and RS2. And each point in RS1 RS2 is the concatenation of the corresponding points in RS1 and RS2 without duplicate axes. {p(X1, ..., Xt): ( RS(p1))(p[X1]=p1[X1] ... Disjoin. RS(X1, ..., Xt) p[Xt]=p1[Xt])}. This alpha expression means that the axes X1, ..., and Xt of the new resource space is a subset of the axes of RS. For each point p in the new resource space, p is non-null if and only if there exists one point p1 in RS such that p and p1 have the same projection on X1, ..., and Xt.

4.3 Resource Space Calculus

111

Merge. RS(X1, ..., Xn, X Y) {p(X1, ..., Xn, X Y): ( RS1(p1))(p[X1]=p1[X1] ... p[Xn]=p1[Xn] p[X Y]=p1[X]) ( RS2(p2))(p[X1]=p2[X1] ... p[Xn]=p2[Xn] p[X Y]=p2[Y])}. This alpha expression means that the first n axes of the new resource space are the same as RS1 and RS2 and the (n+1)th axis is the merge of the (n+1)th axes of RS1 and RS2. Each point in the new resource space is the union of the corresponding point in RS1 and RS2. {p(X1, ..., Xn, X[C]): ( RS(p1))(p[X1]=p1[X1] Split. RS(X1, ..., Xn, X[C]) ... p[Xn]=p1[Xn] p[X[C]]=p1[X])}, herein C represents a set of coordinates. This alpha expression means that the first n axes of the new resource space are the same as RS and the coordinate set C of the (n+1)th axis is a subset of that of the (n+1)th axes of RS. Any point in the new resource space comes from RS. {p(X1, X2, ..., Xn): RS(p) (F F) Selection. F(RS(X1, X2, ..., Xn)) F}. This alpha expression means that each point in the new resource space should make F true. {p(A): ( RS2(p2)) ( RS1(p1)) Division. RS1(A, B)[ B]RS2(B, C) (p[A]=p1[A] p2[B]=p1[B])}, herein A=X1, …, Xm, B=Y1, …, Yt and C=Z1, …, Zn. This alpha expression means that the axes of the new resource space of RS1(A, B)[ ]RS2(B, C) are axis list A. For any point p’ in the new resource space, p’ is non-null if and only if for any non-null point p in RS2 there exists a non-null point p in RS1 such that p’[A]=p[A] and p [B]=p[B]. 4.3.3 From Resource Space Calculus to Resource Space Algebra Just as the relational calculus, the resource space calculus can be used to represent what is a query for, but it does not provide the process to answer the query. This section proposes an algorithm for converting any alpha expression in the resource space calculus to a corresponding series of operations in the resource space algebra. For a given alpha expression p(X1, X2, ..., Xn): , where p(X1, X2, ..., Xn) is the target point and is the qualification, we use the following algorithm based on Codd's reduction algorithm to transform resource space calculus to resource space algebra (Codd, 1972).

112

Chapter 4 Algebra and Calculus of the Resource Space Model

(1) Suppose that X1, X2, ..., Xn come from RS1, RS2, …, RSn respectively, all of which are resource spaces in . For RSi (1 i n), we use RSi* to denote the union of all the resource spaces union-compatible with RSi in . Create a new resource space RS which is the projection of the resource space RS1* RS2* … RSn* on axes X1, X2, ..., and Xn. The qualification is transformed into ’=RS(p) . (2) Since the qualification ’ is the logical concatenation of a series of WFF over p using , and , the qualification can be easily transformed into the disjunctive normal form. In this transformation process, any formula preceded by qualifiers ( or ) is viewed as a whole and any logical operators in this type of formulas are not concerned. Each conjunctive clause in this disjunctive normal form is a WFF over p. For each conjunctive clause (WFF over p) U1 Up V do from step (3) to step (8). Up V, for each Uj (1 j p) and the range formula Ui (3) In U1 (p+1 i q) associated with each qualifier ( or ), apply the following rewriting rules in the given order. (a) Substitute each rang formula RSk(pk) over pk by the resource space RSk. (b) Substitute and by and – respectively. (c) Substitute by . Then we can use a series of set-related operations (union, intersection and difference) to change Uj (1 j q) into Sj such that Sj only contains a new resource space. For example, if Uj = RS1(p) RS2(p), then Sj = RS1 RS2. (4) If there exists a Sj (1 j p) such that Sj = , then make each point in RS null and jump to step (9). For each existence qualifier , if its corresponding range formula Sj is equal to , then replace the formula within the scope of this existence qualifier with Boolean constant F. For each universal qualifier , if its corresponding range formula Sj is equal to , then replace the formula within the scope of this universal qualifier with Boolean constant T. The replacement is based on the following two facts: ) = p(F ) = F. (a) If Sj = , then ( Sj) = p(p Sj (b) If Sj = , then ( Sj) = p(p Sj ) = p(T ) = T. (5) Transform V to its prenex disjunctive normal form V'. Assume that V' has q quantifiers ( or ) and let each quantifier in V' be denoted as Qj (1 j q) from left to right and the range formula associated with Qj be Sp+j. For example, V = (RS1(p1))(p1[Xi] ’2000’) (RS2(p2)) (p2[Xj] ’male’) can be transformed into V’ = (RS1(p1) (RS2(p2))

4.3 Resource Space Calculus

113

(6)

(7) (8)

(9)

(p1[Xi] ’2000’ p2[Xj] ’male’). Each of the conjunctive clauses of V’ will be denoted as Pi (1 i n). For each Pi (1 i n) do step (6). Construct a WFF over p, S1 Sp Q1+p(S1+p(p1+p) Qq(Sq(pq))(Pi), for each Pi (1 i n). Without the loss of generality, we suppose that S1 is the range formula of p. Let Q1+p(S1+p(p1+p) Qq(Sq(pq))(Pi) be denoted as Vi. For each U1 Up Vi (1 i n), do the following steps. (6.1) Form the defining equation RSi = S1 ... Sp+q. (6.2) Pi consists of the conjunction of a series of coordinate formulas, set formulas and cardinality formulas. For resource space RSi, execute the selection operation on the restrictions Pi: Pi}. The result of the selection operation F(RSi)={p | p RSi is denoted as RSi’. Merge all the resource spaces RSi' (1 i n) and the result is denoted as RS'. For any Qj (1 j q) and the resource space Sj+p in its corresponding range formula, we use RS'.X* and Sj+p.X* to denote the sets of all axes of RS’ and Sj+p respectively. For j from q to 1, do the following iteration: if Qj= RS' = RS'.X* Sj+p.X*RS' RS' = RS'[ Sj+p.X*]Sj+p if Qj= After q step operations on RS', the eventual result is denoted as RS''. Then do the following transformation: RS'' = S1 RS(RS'') Resource space RS'' is exactly the result resource space that one of the conjunctive clauses in step (2) desires. Merge all the resource spaces that all conjunctive clauses in step (2) produce through step (3) to step (8). The result resource space will be exactly the result resource space the original alpha expression desires.

Theorem 4.5. For any alpha expression p(X1, X2, ..., Xn): , after the above transformation any point satisfying will appear in the result resource space and none of the points making false will be in the result resource space. Proof. For alpha expression p(X1, X2, ..., Xn): , we first construct RS as the target resource space according to X1, X2, ..., and Xn. It is obvious that the desired resource space is a subset of RS. That is any point p making true appears in RS. So transform to ’=RS(p) and then the original alpha expression will be transformed into p(X1, X2, ..., Xn): ’. After step

114

Chapter 4 Algebra and Calculus of the Resource Space Model

(1) and step (2) in the transformation, ’ will be transformed into its disjunctive normal form involving m pieces of conjunctive clauses (WFFs over p). If p satisfies , we will prove that after the transformation p still exists in the result resource space. Since p makes ’ true, there at least exists one conjunctive clause of ’ that p satisfies. Let the conjunctive clause be U1 Up V. After step (3), (4) and (5) of the transformation, this conjunctive clause will become S1 Sp Q1(S1+p(p1+p) Qq(Sq+p(pq+p))(P1 P2 … Pn), where Qj (1 j q) represents qualifies ( or ) and Q1(S1+p(p1+p) Qq(Sq+p(pq+p))(P1 P2 … Pn) is the prenex disjunctive normal form of V. After step (6) and (7), we can get the resource space RS' Up V. In step corresponding to the above conjunctive clause U1 (8), we suppose that after the iteration from Qq to Qj+1, the resource space is transformed into RS’tmp and the point p is still in RS’tmp. Now, we will prove that after Qj the point p still exists in the result resource space. a) We firstly suppose that Qj is an existence qualifier . Since p is still in RS’tmp, there at least exist one point p’ in Sp+j such that there will exist a point p” in RS’tmp the projection of which on RS.X* is p and the projection on Sp+j.X* is p’. So point p”[RS'.X* Sj+p.X*] also appears in the resource space RS'.X* Sj+p.X*RS’tmp. b) Then we suppose that Qj is a universal qualifier . Since Qj is a universal qualifier, for any point p’ in Sp+j there will exist a point p” in RS’tmp such that the projection of p” on RS.X* is p and the projection of p on Sp+j.X* is p’. So point p”[RS'.X* Sj+p.X*] also appears in the resource space RS’tmp[ Sj+p.X*]Sj+p. From a) and b), we can conclude that after step (8) of the transformation there at least exists one point p” in RS” such that the projection of p” on RS.X* is p. So p will appear in the final result resource space RS. Similarly, we can prove that any point making false will not be in the result resource space. The Resource Space Calculus proposed above has the following difference from the relational calculus (Codd, 1972). 1. The resource space calculus has different operational objectives with the relational calculus. The operational objectives in the resource space calculus includes resource space, axis, coordinate, point and resource-entry which represent different classification granularity of resources, while the relational calculus takes table, tuple, attribute and atomic value as the basic operational units.

4.3 Resource Space Calculus

115

2.

3.

In the relational calculus, there exists only one type of value-based comparison formulas called join terms. On the other hand, there exist three types of comparison formulas coordinate formula, cardinality formula and set formula. Herein, coordinate formula and cardinality formula is value-based comparison formulas and set formula is setbased comparison formula. Richer semantics can be expressed by the resource space calculus. The resource space calculus is used to express totally different semantics from the relational calculus. The classification semantics among resources can be efficiently expressed by using the Resource Space Model. As a basic element, a point in the resource space calculus is used to represent a class of resources. In the relational calculus, a tuple is used to represent just a single resource (a record).

4.3.4 Transformation from Relational Model to Resource Space The Resource Space Model was proposed as a parallelism of the RDBM. In fact, a relational database system excluding null information can be represented by a set of resource spaces. Thus, the proposed resource space algebra and resource space calculus have at least as powerful expressiveness as the relational model. The following is the transforming process from a relational table to a resource space. 1. For each table T(A1, A2, …, An) in a given relational database system, create a resource space RS(X1, X2, …, Xn) by naming resource space after the table name, establishing a one-to-one relationship between the axes of the resource space and the attributes of the table (e.g., Ai corresponds to Xi), and naming each axis of this resource space after the corresponding attribute name. For each tuple t(x1, x2, …, xn) in the table, insert xi (1 i n) as a coordinate into the axis Xi if no coordinate duplication exists. And then insert a resource-entry into the point p(x1, x2, …, xn) to represent the tuple t(x1, x2, …, xn). If a tuple t(x1, x2, …, xn) is deleted from the table, just delete the resource-entry residing in the point p(x1, x2, …, xn) and leave this point and corresponding coordinates alone. For each table T(A1, A2, …, An), let A1… Am (1 m n) be the key of T. Then, the axes X1… Xm will be set as the key of RS. Thus, there do not exist any two different points p(x1… xm, xm+1…xn) and p’(x1… xm, x’m+1…x’n) such that both p and p’ contain resource-entries simultaneously. The functional dependency in the relational database has been represented by the classification relationship in the RSM.

2.

116

Chapter 4 Algebra and Calculus of the Resource Space Model

Since the one-to-one mapping between tables and resource spaces as well as between operations on relational table and operations on the resource space, the basic operations (Union, Difference, Cartesian Product, Selection and Project) in relational database can be easily simulated by Resource Space operations: Merge, Difference, Join, Selection and Projection (Zhuge, 2004a). Thus, any information represented by the relational model can be easily managed by the Resource Space Model. Theorem 4.6. For any relational database system without null information, the generated resource spaces satisfy the first/second/third normal forms of the Resource Space Model. Proof. Let T(A1, A2, …, An) be a relational table and RS(X1, X2, …, Xn) be the generated resource space. According to the creation process of RS, it is clear that RS satisfies the first and second normal forms. In the following, we prove that RS conforms to the third normal form. All resources (tuples) managed by RS come from the table T. For any axis Xi and any tuple t in T, there exists a coordinate on Xi such that this coordinate is the projection of t on Xi. So t R(Xi) holds. On the other hand, R(Xi) only contains the resources derived from the tuples in T. Thus, for any two axes Xi and Xj, R(Xi)=R(Xj) holds. We can conclude that Xi Xj holds. So RS satisfies the third normal form.

4.4 Summary

The resource space algebra enables users or applications to directly and easily obtain the desired resources from the source resource spaces. The resource space calculus is a type of applied predicate calculus and provides a declarative style for describing the desired resources. The equivalence of the resource space algebra and the resource space calculus is discussed. We have shown that the Resource Space Model has at least as powerful expressive capability as the relational model by transforming relations into resource spaces. The algebra and calculus of the Resource Space Model are important part of the theory of the Resource Space Model. They are also the basis of the query language of the Resource Space Model.

Chapter 5 Searching Complexity of Resource Space Model

Given a resource space, it is important for us to know the relationship between the searching efficiency and its dimensions as well as the relationship between the searching efficiency and the coordinates at every axis. Is the dimension of a resource space the higher the better, the lower the better, or, other cases? Is the distribution of coordinates at every axis the evener the better? The answers help design, analyze and understand searching resource space.

5.1 Basic Concepts and Formulas

When a user gives a query Q(X1=q1, , Xn=qn), what is the complexity of searching the point Q(X1=q1, , Xn=qn) in resource space RS(X1, , Xn)? The basic approach is to check the axis and coordinate for qi (i=1, …, n) one by one. Here focuses on the intrinsic complexity of searching in resource space based on comparison between names rather than any specific searching algorithms. 5.1.1 On Computation Complexity The computation complexity studies intrinsic difficulty on the cost of computing resources (e.g., time and space) to solve problem. To study the computation complexity, first we need a computation model to illuminate which operations or steps are permissive and their cost. Turing machine and random access machine are common computing models. With common computing models, we can study the upper bound and lower bound of complexity of problems or seek the optimal algorithm (Aho et al., 1983). The computation complexity of a problem is the function of the scale of the problem, so firstly we need to define the scale of a problem. For matrix

118

Chapter 5 Searching Complexity of Resource Space Model

operation, the order of matrix can be defined as the scale of problem. If the number of operations or steps needed to solve a problem is the exponential function of the scale of problem, then the problem is regarded as having the complexity of exponential time. If the number of operations needed is the polynomial function of the scale of problem, then the problem is regarded as having the complexity of polynomial time. The problems with polynomial time algorithm generally can be solved easily, and the algorithms with polynomial time complexity are regarded as good algorithms. In the theory of computation complexity, the class of problems with complexity of polynomial time is denoted as P. There are many problems for which the best algorithms known have complexity of exponential time. Such problems exist in such areas as combinatorial, graph theory and operation research, and we do not know whether there exist polynomial time algorithms for them. There is a big class of these problems in practice whose computation complexities are equivalent. If we can solve one of them in polynomial time then we can solve all of them in polynomial time. The class is called the NP-complete problem class. For some problems, the upper bound of their complexity is the cost of the best algorithm known so far, but the lower bound can only be built by theoretical proof. To get the lower bound of complexity of a problem, we need to prove that there does not exist any algorithm whose complexity is smaller than the lower bound. It is obvious that building lower bound is much harder than building upper bound. To find the upper bound of the complexity of a given problem, we only need to study the complexity of one specific algorithm. But if we want to get the lower bound of complexity of the same problem, we must study all the algorithms which can solve this problem (this is usually impossible). From the perspective of computation complexity, the main task of the design and analysis of algorithm is to build the upper bound (Knuth, 1997b). Suppose n is the scale of problem, the following are some common problems and the upper bounds of complexity: 1. 2. In the worst case, the comparison-based sorting of n different elements needs O(nlgn) comparisons (Knuth, 1997c). The multiplication of matrix of order n needs O(n2.41) times multiplication operations (Robinson, 2005; Strassen, 1969; Cohn and Umans, 2003). The decision of whether a number of n digits is prime needs time O(nclglglgn) (Knuth, 1997b; Mairson, 1977).

3.

The greatest lower bound and the least upper bound depict the intrinsic complexity of problem and the best solution known so far. In fact, the

5.1 Basic Concepts and Formulas

119

greatest lower bound is the best lower bound known theoretically and the least upper bound is the best solution known in the real world, i.e., the best existing algorithm. A problem’s intrinsic complexity will not change with the newly discovered greatest lower bound or least upper bound. If the upper bound got from an algorithm is equal to the known greatest lower bound, then this upper bound (or lower bound) is exactly the intrinsic complexity of the problem. In this case, the algorithm is called optimal in this sense. In the problem of sorting based on the comparison between names (suppose all of the names are different), suppose S(n) (n is the scale of problem) is the number of comparisons must do at least in the worst case. By building a binary decision tree for this problem, we get a lower bound:

S ( n)

lg n!

n lg n

n ln 2

lg n 2

O(1) .

Analyzing the algorithm Binary Insertion (suppose the number of comparisons in the worst case is B(n)), we get an upper bound: n B ( n) k 1

lg k

n lg n

2 lg n

1.

Combining these upper and lower bounds for S(n) can reach:

lim n S ( n) n lg n

1.

So any algorithm (including Binary Insertion) with complexity of nlgn is asymptotic optimal. But it is desirable to obtain more precise information, and from table 5.1 we can see that: Table. 5.1 Comparison of lower bound and upper bound n= 1 0 0 2 1 1 3 3 3 4 5 5 5 7 8

lg n!

B(n)=

120

Chapter 5 Searching Complexity of Resource Space Model

When n=5, the lower bound given by lg n! is 7, but the upper bound given by B(5) is 8, then S(5) may be 7 or 8. The common algorithms we know which are asymptotic optimal, including Heap Sort, Merge Sort all give the upper bound 8 or 9. Then, how many comparisons are necessary for sorting five elements in the worst case? The answer is 7, but it is not easy to find a method which only needs 7 comparisons. This method is now called Merge Insertion due to the advantages of both Merge and Insertion. It was first proposed in (Demuth, 1956), and then generalized in (Ford and Johnson, 1959). 5.1.2 Searching Complexity and Formulas For the type of problems whose inputs are non-deterministic (there are many cases of inputs), such as sorting and searching, generally we consider two types of complexity, the worst-case complexity and the average complexity, denoted as W(n) and A(n) respectively, supposing the scale of problem is n. Suppose the probability of the occurring of every element in the set is the same and the probability of a successful searching is 1, then we have the following conclusions (Baase and Gelder, 2000; Levitin, 2003): Lemma 5.1 Searching an element K in a sorted set with n elements, any algorithm (based on the comparison between names) must do at least lg(n 1) comparisons in the worst case. Lemma 5.2 Searching an element K in a sorted set with n elements, any algorithm (based on the comparison between names) must do at least lg n 1 comparisons on average. Lemma 5.3 Searching an element K in an unsorted set with n elements, any algorithm (based on the comparison between names) must do at least n comparisons in the worst case. Lemma 5.4 Searching an element K in an unsorted set with n elements, any algorithm (based on the comparison between names) must do at least

n 1 comparisons on average. 2

The above lower bounds are all optimal. For example, the Binary Search algorithm can get the lower bounds in lemma 5.1 and 5.2, the Sequential Searching algorithm can get the lower bounds in lemma 5.3 and 5.4.

5.2 Basic Assumptions

121

Here lgn is the base-2 logarithm, logn is the base-10 logarithm, lnn is the natural logarithm, and log n is the base-a logarithm. x is the floor of a x, x is the ceiling of x, and we have: x 1 cially, for natural number n, we have: lg n

x

x x lg(n 1)

x 1 . Spelg n 1 and

lg(n 1)

lg n

1 (Graham et al., 1989).

For xi 0, 1 i n, there is a sequence of inequalities (Agarwal, 2000):

n 1 x1 1 x2 L 1 xn

n

x1 x2 L xn

x1

x2 L xn n

x12

2 2 x2 L xn . n

For a function f(x) defined on the real domain, if f (x)>0 or f (x)0), and for any >0, we have:

ln ln n

ln n, ln n

(n c ) , n c

(a n ) .

5.2 Basic Assumptions

For an n-dimensional resource space RS(X1, X2, …, Xn), we assume that the n axes are stored in the order (X1, X2, …, Xn), but they are not sorted in alphabet. For every axis Xi=(Ci1, Ci2, …, Cij), the coordinates on the axis are also stored in the order (Ci1, Ci2, …, Cij) and are not sorted in alphabet. Here the comparison between coordinate and the query name is regarded as the basic operation. Suppose N is the number of all of the points in the resource space RS(X1, X2, …, Xn), |Xi| is the number of coordinates on axis Xi, 1 i n, then

122 n i 1

Chapter 5 Searching Complexity of Resource Space Model

| X i | N . We fix N and then study the relationship between the

searching complexity and the dimension n as well as the relationship between searching complexity and the distribution of coordinates on every axis. Assume that points are as basic search units, axes are unsorted set, and the order of axes is unknown, then we have the following theorem.

Theorem 5.1. Given a resource space RS(X1, X2, …, Xn), where every Xi and its coordinates are unsorted in alphabet, 1 i n. Suppose the number of coordinates on axis Xi is |Xi|, then any comparison-based algorithm to find the answer to Q(X1=q1, , Xn=qn) in RS(X1, X2, …, Xn) must do at least

n(n 1) 2

n

| X i | times of comparison in the worst case. i 1

Proof. To get a point Q(X1=q1, , Xn=qn) in the space RS(X1, X2, …, Xn), search needs to check every qi on axis Xj, 1 i n. Suppose Ti is the number of comparisons needed to find qi, then, the times of comparison needed to n find point Q(X1=q1,

, Xn=qn) is T i 1

Ti . To find q1, we need to deter-

mine the axis Xi it belongs to first, then search q1 on Xi. Since there are in all n axes which are not sorted in alphabet, so it takes at least n times of comparisons to find a specific axis Xi in the worst case. There are |Xj| coordinates not sorted in alphabet on axis Xi, so searching for q1 among them needs at least |Xj| times of comparison in the worst case. So finding q1 needs at least n+|Xj| times of comparison in the worst case. We can use the same procedure to find q2, the only difference is we only need to search the remaining n 1 axes. Then, finding q2 needs at least n 1+|Xj| times of comparisons in the worst case. So in the worst case, finding point Q(X1=q1, , n n(n 1) n Xn=qn) needs at least n (n 1) L 1 | X i | times | Xi | 2 i 1 i 1 of comparison. Theorem 5.1 shows that the searching complexity is related to the dimension and the distribution of coordinates on axes. Then, what is the relationship between the searching complexity and the changing of the dimension? And what is the relationship between the searching complexity and the distribution of coordinates on every axis? The following two parts answer these questions.

5.3 Distribution of Coordinates on Axes

123

Corollary 5.1. To find the answer to Q(X1=q1,

, Xk=qk) in RS(X1, X2, …, k n(n 1) Xn) must do at least 2 comparison in the worst case (k n).

(n k )(n k 1) 2

| X i | times of i 1

5.3 Distribution of Coordinates on Axes

Because the number of points N in resource space is fixed, the distribution of coordinates on axes indicates the possible case of the number of coordinates on every axis when N is given. For example, if the number of points n N is 8 and the dimension of space is 3, according to i 1

| X i | N , then all

the possible distributions of coordinates on axes are: (1, 2, 4), (1, 1, 8) and (2, 2, 2). We can see that the distribution (1, 1, 8) is the most uneven, almost all of coordinates locate on only one axis. The distribution (2, 2, 2) is the most even, the number of coordinates on every axis is equal. Then from the perspective of searching complexity, which case of distribution is better? The following will answer this question.

5.3.1 Best Distribution of Coordinates

First we give the following theorem:

Theorem 5.2. Given a class of resource spaces RS(X1, X2, …, Xn), where every Xi and its coordinates are unsorted in alphabet, 1 i n. Let N be the number of points in the space and |Xi| be the number of coordinates on axis Xi, n be fixed dimension and every |Xi| be variable. Assume that the number of comparisons for any comparison-based algorithm for searching Q(X1=q1, , Xn=qn) in the resource space must do at least in the worst case

is W(n). Then, Min W(n)=

n(n 1) 2

1

nN n , and only when |X1| = |X2| = …=

|Xn|, this minimum can be reached.

Proof. Given any class of resource spaces RS(X1, X2, …, Xn), according to

theorem 5.1, W(n)=

n(n 1) 2

n

| X i | . Because the space dimension n is i 1

124

Chapter 5 Searching Complexity of Resource Space Model n fixed, we only need to find the minimum of i 1 n i 1

| X i | under the constraint:

| X i | N . According to the Mean Inequalities [R. P. Agarwal, DifferEquations and Inequalities, 2nd edition, CRC, 2000]:

ence n x1 x2 L xn

x1

x2 n

L xn

, xi 0, 1 i n, (and the equality holds n n

only when x1= x2= …= xn), so we have: i 1 n n

xi | X i |)

n( i 1 1 n

xi ) . Replace every nN , which can con1 n

1 n

xi with |Xi|, we can get: i 1

| X i | n( i 1 1 n

n(n 1) clude that Min W(n)= 2

|Xn|, the minimum can be reached.

nN , and only when |X1| = |X2| = … =

Theorem 5.2 shows that if the space dimension keep fixed, then when the distribution of coordinates on every axis is the most even, the searching complexity in the worst case is the least, i.e., in the above example the distribution (2, 2, 2) is the best. It is worth to note that for some N and space

n(n 1) dimension n, the lower bound 2

nN can never be reached. For

1 n

example, given N=8 and n=4, the best distribution (2, 2, 2) cannot be reached. And for N=12 and space dimension n=3, no matter what the distribution of coordinates is, the minimum in the theorem cannot be reached. This means that only when the given N and n satisfy some conditions, the theorem would hold. The following corollary gives the conditions that N and n should satisfy.

Corollary 5.2. Given a class of resource spaces RS(X1, X2, …, Xn), where every Xi and its coordinates are unsorted in alphabet, 1 i n. Let N be the number of points in the space and |Xi| be the number of coordinates on axis Xi, n be fixed dimension and every |Xi| be variable. Assume that the number of comparisons for any comparison-based algorithm to find a point Q(X1=q1, , Xn=qn) in the resource space must do at least in the worst case

5.3 Distribution of Coordinates on Axes

1

125

is W(n). Then Min W(n)=

n(n 1) nN n , and this minimum can be 2 N reached if and only if n lgN and log n is an integer.

N N lgN and log n is an integer. Let log n =S, accord-

Proof. (1) Suppose n

N ing to n lgN, we get log n 2, i.e., S 2. Now we only need to let |X1| = |X2| = … = |Xn|=S, then the lower bound in theorem 5.2 can be reached.

(2) Suppose the lower bound can be reached, then according to theorem n 5.2, |X1| = |X2| = … = |Xn|, let the value be S, from i 1

| X i | N we get

2,

N n

N N=Sn, then log n =S is an integer. Because N >1, S cannot be 1, then S

N

2 , we can get n

n

lgN. So now both n

lgN and log

is an integer

hold.

5.3.2 The Worst Distribution of Coordinates

First we give the following theorem:

Theorem 5.3. Given a class of resource spaces RS(X1, X2, …, Xn), where every Xi and its coordinates are unsorted in alphabet, 1 i n. Suppose N is the number of all the points in space and |Xi| is the number of coordinates on axis Xi, the space dimension n is fixed and every |Xi| is variable. Let the times of comparisons for any comparison-based algorithm to find an answer to query Q(X1=q1, , Xn=qn) in the resource space must do at least is

W(n) in the worst case. Then Max W(n)= N some |Xi|=N, this maximum can be reached.

n2 2

3n 1 , and only when 2

Proof. Given any RS(X1, X2, …, Xn) in the class of resource spaces, accord-

ing to theorem 5.1, W(n)=

n(n 1) 2

n

| X i | . Because the space dimeni 1 n

sion n is fixed, we only need to find the maximum of i 1 n

| X i | under the

constraint: i 1

| X i | N . Theorem 5.2 has shown that in the case of the

distribution of coordinates on every axis is the most even (|X1| = |X2| = … =

126

Chapter 5 Searching Complexity of Resource Space Model

|Xn|), the minimum can be reached, so we can guess that when the distribution of coordinates on every axis is the most uneven, we can get the maxin n

mum of i 1

| X i | . As

i 1

| X i | N , we guess the most uneven case is: for n some i, |Xi|=N and |Xj|=1 for any j i. In this case, i 1 n

| X i | =N + n 1, so

we only need to show that for any distribution of |Xi|, 1 i n, we have:

| X i | N + n 1. In the following we will prove this, first we have: i 1 n | Xi | i 1 n

|Xj |

|X j | 1

| Xk |

|X k | 1

(5.1)

As i 1

| X i | N , then |Xi| cannot all be 1. On the right side of (5.1) we | X j | n 1. As every |Xj| is 1, so

|X j| 1

have:

| Xk |

|X k | 1

N . Then we only | X k | . We use

|X k | 1

need to show that

| Xk |

|X k | 1

N , i.e.,

| Xk |

|X k | 1

mathematical induction to prove this. We only need to show that for any natural number m, |Xk|>1, 1 k m, m m

| Xk | k 1 k 1

| X k | holds. When m=1, it is obvious. p p

Suppose m=p, k 1 p 1

| Xk | k 1 p | X k | holds. When m=p+1, we have: p p

| Xk | | X p 1 | k 1 k 1

| Xk | k 1

| X k | (| X p 1 | 1) k 1

| Xk |.

5.3 Distribution of Coordinates on Axes p 127

Because k 1 p

| Xk | | Xk | k 1 p 2 and | X p 1 | 2 , we have: 2(| X p 1 | 1) | X p 1 | , which concludes: p p p 1

(| X p 1 | 1) p 1

| Xk | k 1 k 1

| X k | (| X p 1 | 1) k 1

| Xk | k 1

| Xk | | X p 1 | k 1

| X k |.

When m=p+1, the conclusion also holds. According to the proof process, we can see that the conditions of the equality holds are: m=1, or m=2 and |X1| = |X2|=2. The first condition corresponds to the case that some |Xi|=N | X j | = n 2 < n 1, now and |Xj|=1 for any j i. And when m=2,

|X j| 1 n

| X i | < N + n 1. So the condition that some |Xi|=N and |Xj|=1 for any i 1 n

j i is the only case satisfying i 1

| X i | =N + n 1. n Combining the above results, we have: i 1

| Xi |

N + n 1, so Max

W(n)=

n(n 1) 2

N

n 1=N

n2 2

3n 1 , and only when some 2

|Xi|=N (in this case |Xj|=1, j i), this maximum can be reached. Theorem 5.3 shows that if the space dimension is fixed, when the distribution of coordinates on every axis is the most uneven, the searching complexity in the worst case is the highest. Then in the above example, distribution (1, 1, 8) is the worst distribution. It is worth to note that for any given N and space dimension n, the upper bound N

n2 2

3n 1 can 2

always be reached. According to theorem 5.3, we only need to let some |Xi|=N and all the other |Xj|=1. For example, for N=12 and space dimension n=3, let |X1|=12 and |X2|=|X3|=1, then we can get the worst distribution (12, 1, 1). We have studied the searching complexity in the worst case when the space dimension is fixed, got the best case (theorem 5.2, the most even distribution) and the worst case (theorem 5.3, the most uneven distribution),

128

Chapter 5 Searching Complexity of Resource Space Model

these are two extreme cases. Intuitively, in the case that the distribution of coordinates is the most uneven, all information is on one axis, so the searching is more difficult. This shows that to design a resource space, we should keep the distribution of coordinates on every axis as even as possible (in the sense of search efficiency). Then, we can guess whether the following conclusion holds: the distribution of coordinates more uneven, the searching complexity the higher? How to evaluate the unevenness? First, we can think of the variation in the probability and statistics. According to the above proof process, we can guess whether the following conclusion holds: given natural numbers N and n, if two sequences of natural numbers M=(m1, m2, …, mn) and Y=(y1, n n

y2, …, yn) satisfying i 1 n n

mi

i 1

yi

N and Var(M) < Var(Y), then whether

we have i 1

mi < i 1

yi , i.e., E(M) < E(Y)?

Intuitively, the above conclusion should hold, in fact when n=2, we have the following corollary:

Corollary 5.3. Given natural number N, if two sequences of natural numbers M=(m1, m2, …, mn) and Y=(y1, y2, …, yn) satisfying

2 i 1 2

mi

i 1

yi

N and Var(M) < Var(Y), then E(M) < E(Y) holds

Proof. According to the definition of expectation and variation, we have

E(M) (m1+m2)/2, Var(M)=E((M E(M))2)=E(M2) (E(M))2. Then, Var(M) < Var(Y) E(M2) (E(M))2 < E(Y2) (E(Y))2.

Replace E(M) with (m1+m2)/2, we can get

1 2 (m1 2

2 m2 )

1 (m1 4

1 m2 ) 2 < ( y12 2

2 m2

2 y2 )

1 ( y1 4

2 y2

y2 ) 2 .

Which can be simplified is (m12

2m1m2 ) < ( y12

2 y1 y 2 ) .

According to m1m2=y1y2=N, adding 4m1m2 to the left side of the above inequality, adding 4y1y2 to the right side, we can get:

(m12

2 m2

2m1m2 ) < ( y12

2 y2

2 y1 y 2 ) . i.e.,

5.4 The Changing of Space Dimension

129

(m1

m2 ) 2 < ( y1

y2 ) 2

(m1

m2 ) < ( y1

y2 )

E(M) < E(Y) holds.

Above corollary shows that when n=2, the conclusion holds. Then whether the conclusion still holds when n 3? We have the following counterexample: given N=1944 and n=3, sequences M=(6, 18, 18) and n n

Y=(9, 9, 24) satisfying i 1

mi

i 1

yi

N , and Var(M)=32, Var(Y)=50,

also satisfying Var(M) < Var(Y) But E(M) = E(Y)=14 does not satisfy E(M) < E(Y). This shows that the conclusion does not hold when n=3. According to this counterexample, we can construct counterexamples for any given n (n>3), so our guess does not hold when n 3.

5.4 The Changing of Space Dimension

We have studied the relationship between the searching complexity and the distribution of coordinates on every axis when the dimension of resource space is fixed. When changing dimension, what is the relationship between the searching complexity and the change of dimension? In the perspective of searching complexity, the dimension of resource space is the higher the better, or the lower the better? This section answers these questions.

5.4.1 Relationship between Dimension and Searching Complexity Theorem 5.4. Given a class of resource spaces RS(X1, X2, …, Xn), where every Xi and its coordinates are unsorted in alphabet, 1 i n. Suppose N is the number of points in space and |Xi| is the number of coordinates on axis Xi, and the space dimension n and every |Xi| are all variable. Let f(n) be the Min W(n) in theorem 5.2, then there exists a unique critical dimension n0 (1< n0 < lgN ), such that if nn0, f(n) increases according to the increase of n. Proof. When the space dimension n is fixed, according to theorem 5.2,

f(n)=Min W(n)=

n(n 1) 2

1

nN n . Then we study the properties of f(n) ac-

130

Chapter 5 Searching Complexity of Resource Space Model

cording to the changing of dimension n. Let f(x)=

x( x 1) 2

1

xN x , x is real

and 1 x lgN, then we only need to study the properties of f(x), then discuss its values on integers. The differential of f(x) is:

f ' ( x)

x

N (1

1 x

ln N ) x

1 . 2

Then f (1)=1.5+N(1 lnN)0. It 2

is obvious that f (x) is continuous, so according to the median theorem of continuous functions [G. Strang, Calculus, Wellesley-Cambridge, 1991], there exits a zero point of f (x) between 1 and lgN. Then we can get the differential of f (x) again:

ln N f ' ' ( x) 1 N x2

1 x

ln N ln 2 N x . ) 1 N N ln N 2 (1 x3 x x 1

1 x

1

It is obvious that f (x)>0, then f(x) is a concave function and f (x) is strictly increasing, so the zero point of f (x) between 1 and lgN is unique. Let the zero point be Z, since f(x) is concave, we can get that f(Z) is the minimum of f(x) between 1 and lgN, and when xZ, f(x) is strictly increasing. If Z is an integer, let n0=Z. If Z is not an integer and f ( Z ) f ( Z ) , let n0= Z , else let n0= Z . Then we have that n0 is the unique minimum point of f(n), and if nn0, f(n) increases with the increasing of n. Theorem 5.4 shows that from the perspective of searching complexity, the space dimension is not the higher the better, and is not the lower the better either. There is a unique critical dimension, at first the searching complexity in the worst case decreases according to the increasing of space dimension, when space dimension reaches the critical dimension, the searching complexity in the worst case increases according to the increasing of space dimension, i.e., the searching complexity in the worst case is the least when the space dimension is the critical dimension. In the case of the most uneven distribution of coordinates on every axis, we have the following corollary:

Corollary 5.4. Given a class of resource spaces RS(X1, X2, …, Xn), where every Xi and its coordinates are unsorted in alphabet, 1 i n. Suppose N is

5.4 The Changing of Space Dimension

131

the number of points in space and |Xi| is the number of coordinates at axis Xi, and the space dimension n and every |Xi| are all variable. Let F(n) be the Max W(n) in theorem 5.3, then F(n) increases according to the increasing of n.

5.4.2 Value of Critical Dimension

Theorem 5.4 only gives the existence of critical dimension, but what is the value of critical dimension? The following theorem will answer this question.

Theorem 5.5. Given a class of resource spaces RS(X1, X2, …, Xn), where every Xi and its coordinates are unsorted in alphabet, 1 i n. Suppose N is the number of points in the space and |Xi| is the number of coordinates on axis Xi, and the space dimension n and every |Xi| are all variable. Let n0 be the critical dimension in theorem 5.4, then for any >0, there exists N0, such that if N >N0, then we have ln1 N n0 ln N . Proof. According to the proof of theorem 5.4, we only need to estimate the value of Z. As Z is the unique zero point of f (x) between 1 and lgN, we have:

1

f ' (Z )

Z

N Z (1

ln N ) Z

1 2

0.

This is a transcendental equation, it is impossible to solve the value of Z accurately, so we will estimate the value of Z in the next. We will prove that for any >0, if N is big enough, then we have ln1 N Z ln N . Because f (x) is strictly increasing, we only need to show that when N is big enough, f ' (ln1 N ) 0 and f ' (ln N ) 0 . It is clear that f (lnN) =0.5 + lnN>0. When

1 , f ' (ln1 N )

0 is obvious. When

1 2

1

1,

f ' (ln1 N )

N

0 is equivalent to ln1 N

N

1 2 and lim ln N

N

N ln

1

N

(ln N 1) , and

because lim ln1 N1 , then

, there exists N1>0, if N >

ln1 N 1 2

and

ln N

2 ,

which

means

that

ln1 N

2 ln1 N and (ln N 1)

1 ln N . Now we have: 2

132

Chapter 5 Searching Complexity of Resource Space Model

ln1 N

1 2

1

2 ln1 N < N ln

1

N

1 ln N 2

1

1

1

N ln

1

N

(ln N 1) .

So we only need to let 2 ln1 N < N ln

1

N

1 ln N , which is equivalent to 2

4 ln

1 2

N

N

ln1

N

, get the natural logarithm for both sides, we have:

ln 4 (1 2 ) ln ln N

Because lim ln

N N

ln N ln1 N

ln N .

lim (ln N

ln N ln ln N ln ln N )

N

lim ( ln ln N

N

ln ln ln N )

, we have . So

, then lim (ln N

(1 2 ) ln ln N )

there exists N2, if N > N2, then ln N

(1 2 ) ln ln N

1

ln 4 , i.e.,

ln 4 (1 2 ) ln ln N

N > N0, we have

ln N holds. If we let N0=Max{N1, N2}, then if

2 ln

1

N

N

ln1

N

(1

ln N ) ln1 N

0 ,

i.e.,

f ' (ln1 N )

0.

Combining above results, we can get that for any >0, there exists N0 such that if N >N0, then ln1 N n0 ln N . Theorem 5.5 shows that when N is big enough, the critical dimension n0 can approach lnN to any extent. The minimum of f(n) is about:

1

f (ln N )

ln 2 N

ln N

N ln N

ln 2 N

e ln N

O(ln 2 N ) .

5.5 Summary

This chapter discusses the complexity of searching a point in the resource space based on comparisons. We have studied the relationship between the searching complexity and the distribution of coordinates on every axis and reach that: from the perspective of searching complexity, the distribution of coordinates on every axis is the evener the better. We also discuss the relationship between the searching complexity and changing the dimension of

5.5 Summary

133

resource space, and conclude that the space dimension is neither the higher the better, nor the lower the better. There is a unique critical dimension which is optimal from the perspective of searching, and the value of the critical dimension is about lnN (N is the number of points in the resource space). These results are very helpful to design and analyze resource spaces.

Chapter 6 Resource Space Model Storage

The characteristics of the Resource Space Model require a special storage mechanism for efficient resource storage and retrieval. A novel multidimensional indexing structure is proposed to realize semantic-based resource re-organization and efficient retrieval.

6.1 Current Approaches to Storing Resource Space

Relational tables, XML files and spatial indexing structures could be used to store resource space, but it is hard to realize semantic integrity and storage efficiency. The following are two ways of storing resource space in a single relational table: 1. Let each axis correspond to an attribute of the table, and each coordinate of the corresponding axis corresponds to the attribute value. In this way, it is hard to represent hierarchical coordinates and to support efficient multi-attribute search. Let each coordinate correspond to an attribute of the table. The attribute value is of boolean type. This will result in low utilization ratio of storage space due to the magnitude of attribute number as well as the loss of hierarchical semantics.

2.

Fig. 6.1 depicts the transformation from a two-dimensional resource space into a table with attributes X and Y, and the transformation from a resource space into a table with attributes C1, C11, C12, C2, C3, C31, C32 and C4, by the above two kinds of representation respectively. It is feasible to represent all the resources in a single XML file. Each axis or coordinate in resource space corresponds to a tag of XML file. Hierarchical relationships between tags reflect the same semantics as hierarchical coordinates. The value of a tag is the list of all the resources belonging to the classification of that tag. Fig. 6.2 is an XML file representing the resource space described in Fig. 6.1. Tag is a sphere node whose

136

Chapter 6 Resource Space Model Storage

value is indicated by the retangle node {r1, r2}, which means both resource r1 and r2 are in the classification of C31. This storage manner is isimilar to inverted list. Each resource has the same number of copies in XML tree as the dimensionality of resource space. Such a redundancy requires an additional cost of integrity maintenace.

Fig.6.1. Two ways of storing resource space by relational table.

Fig. 6.2. An example of using XML file to store resource space.

Most spatial indexing structures are dedicated to such a space that each dimension has a linear ordering of its coordinates (Gaede and Gnther, 1998). Coordinates in resource space model, however, represent conceptual classification along their axes. They are discrete and usually have hierarchical semantic relationships rather than linear order. Datacube in

6.2 Problem Definition

137

OLAP resembles RSM, but it is mainly for online data analysis and statistics.

6.2 Problem Definition

In this chapter, we devise a specific multidimensional access method named C-tree for resource space storage. It organizes resources by classification semantics and stores semantic-close resources in adjoining place of storage space. Moreover, it preserves hierarchy semantics between concepts. We state the problem formally as follows: How to make the underlying index structures represent hierarchy semantics between concepts so as to implement effective and efficient resource insertion, deletion, exact query and range query on resource space. Hierarchy semantics is prevailing in concept classification. It reflects two important relationships between concepts. One is concept combination. For example, a car is composed of wheel, engine and fuel. The other is concept refinement. For example, book is one kind of publication. In fact, we recognize objects in the real world by such hierarchy semantics in many situations. If it is preserved in the process of index creation, we can utilize it to organize resources much better. The underlying index structures should satisfy the following two goals. the preservation of hierarchical semantics be1. The semantic Goal tween concepts. Hierarchy semantics between concepts should be kept in the underlying structure. Concepts and their hierarchy relationships are designed by system managers according to classification semantics. The design process of RSM already refines resource organization to a certain extent. The normal forms of RSM guarantee the quality of resource classification at the logical level. Meanwhile, it matches people’s thinking way of resource organization. 2. The operation Goal efficient resource operations including insertion, deletion, exact query and range query. The reason of preserving the hierarchical semantics of resource space is that the underlying indexing structure should be guided by such semantics in the process of resource insertion, deletion and query.

138

Chapter 6 Resource Space Model Storage

6.3 System Architecture

An overview of system architecture is depicted in Fig. 6.3. It includes four major components: RSM Schema Definition Module, Resource Operation Input Module, RSM Schema Tree Module, and Physical Storage Space Module. RSM Schema Definition Module is responsible for the input of RSM schema in the format of RS(X0(C00, C01, ..., C0p), X1(C10, C11, ..., C1q), ..., Xn-1(Cn-1 0, Cn-1 1, ..., Cn-1 r)). We use C(C0, ..., Ct) to represent the hierarchy relationships between parent concept C and its child concepts C0 to Ct. Fig. 6.4 depicts an example of RS(X(C1(C11, C12, C13), C2(C21, C22)), Y(C3(C31, C32), C4(C41, C42)), where C2(C21, C22) means concept C2 is the super-concept of C21 and C22. Point P contains resource r1 and r2 both of which belong to the classification “X=C21, Y=C41”.

Fig. 6.3. System architecture.

Resource Operation Input Module is responsible for the input of resource operation from users. Five major kinds of resource operations are

6.3 System Architecture

139

considered here: insertion, deletion, modification, exact query and range query. They are expressed by users who only know RSM schema but not the underlying storage format. In this sense, the resource query “X=C1, Y=C2” is equivalent to the query “Y=C2, X=C1”.

Fig. 6.4. A point and its resources in a resource space. RSM Schema Tree Module encodes the input RSM schema into bit strings which preserve all hierarchy semantics between concepts. In this way, a one-to--one mapping is set up between RSM concepts and bit strings, and stored in a single disk file. If the file is small, load it into internal memory before doing resource operations. Otherwise, build up an index in the head of the file which will be loaded into internal memory instead of the whole file. The index should well support the search for bit string given concept as well as the search for concept given bit string. RSM Schema Tree Module also sends related RSM schema information to Physical Storage Space Module to help the initialization of the underlying indexing structure C-tree. It transforms the resource operations input from Resource Operation Input Module into the format of the underlying storage. Physical Storage Space Module is responsible for the creation and maintenance of C-tree. C-tree is stored in a single disk file. Each node corresponds to a page in the file. In default, the first page stores the root node of C-tree. Leaf node keeps a certain number of classification points in RSM space. Each classification point corresponds to another page which keeps resource locations like file path and URI (Uniform Resource Identifier). Ctree extends R-tree to index the underlying multidimensional bit string

140

Chapter 6 Resource Space Model Storage

space where there is no linear order but hierarchy relationships between coordinates.

6.4 RSM Storage Mechanism

To achieve the semantic goal, we devise a RSM schema tree to encode all hierarchy semantics of a given RSM in a single binary tree. It provides three basic functions as follows: 1. void createStorageSpace(RS rs) it accepts a RSM schema as input and creates an underlying physical storage space, a multidimensional bit string space. 2. BitString normalize(Concept c) it accepts a concept as input, and returns its path in the schema tree as a bit string. RSM schema tree plays the role of mapping RSM and resource operations at the logical level into those at the physical level. 3. BitString normalize(Axis a) it accepts an axis as input, and returns its path in the schema tree as a bit string. To achieve the operation goal, we devise C-tree to index the underlying multidimensional bit string space. It provides the following four basic functions: it finds out all the re1. ResourceSet exactQuery(Classification c) sources belonging to the input conceptual classification. Each conceptual classification is a point in the multidimensional bit string space. 2. Boolean insert(Resource r) it inserts the given resource in C-tree. The input resource contains its conceptual classification as well as location. 3. Boolean delete(Resource r) it deletes the given resource from C-tree. 4. ResourceSet rangeQuery(ClassificationRange range) it returns all the resources whose conceptual classifications are inside the given conceptual classification range. C-tree puts nearby classification points together in external memory so as to achieve an amortized cost of O(logN + T) where N is the number of resources and T is the number of retrieved resources.

6.5 RSM Schema Tree

141

6.5 RSM Schema Tree

Each axis of RSM corresponds to a concept tree as depicted in Fig.6.6. For an RSM of dimensionality d, its schema consists of d concept trees. By forest-to-tree transformation, we can construct a binary tree of the given RSM. By labeling each left edge with bit 0 and each right edge with bit 1, we encode all axes and concepts into bit strings by combining all the bits in their root-to-node paths. Fig. 6.5 demonstrates the generation of an RSM schema tree from the resource space in Fig. 6.4.

Theorem 6.1. Let s1 be the bit string of axis X and s2 be the bit string of concept C. C is a concept in X if and only if s10 is the prefix of s2. Proof. The proof consists of the following two parts:

(=>) In the construction process of RSM schema tree, if C is a concept in X, C’s corresponding node in RSM schema tree is in the left subtree of X’s corresponding node. Therefore, s10 is the prefix of s2. () Since C2 is a sibling concept of C1, there are only sibling concepts between them. In terms of their bit strings, s1=s2(1+) or s2=s1(1+). () Suppose C2 is the parent of C1. If C1 is the first child (or 0th) of C2, then s1 equals s20 which can be easily inferred from the construction process of RSM schema tree. If C1 is the ith child of C2, then s1 equals s20(1+) where the number of 1 is i. Therefore, s1 equals s20(1*). () If C2 is the parent concept of C1, then s1 equals s20 or s20(1+) by Theorem 6.2. Therefore, s20 is the prefix of s1 in this case. Otherwise, one child C3 of C2 is the ancestor concept of C1. Let s3 be the bit string of C3. From the construction process of RSM schema tree, we know that s20 is the prefix of s3 and that s3 is the prefix of s1. Therefore, s20 is the prefix of s1 . ( 2, let C’ be the child of C and the ancestor of the concept Ci. Ci is the first concept among C0, C1, C2, ..., and Ck-1 in the pre-order traverse of the RSM schema tree. By the same reasoning, we can conclude s=lcp(s0, s1,..., sk-1).cutTail(01*).

Fig.6.7. C is the nearest common ancestor of C0 and C1. C’ is the child of C as well as the ancestor of C0. Theorem 6.7. Suppose concept C1 and C2 are in the same axis X. s1 is the bit string of C1. s2 is the bit string of C2. C is their nearest common ancestor in the concept tree of X. s is the bit string of C. Then dist(C1, C2) = zeroCount(s1’) + zeroCount(s2’), where s=lcp(s1, s2), s1=ss1’, s2=ss2’, and zeroCount(str) is the number of 0 in the bit string str. Proof. Since bit 0 represents one time of concept refinement, the number of 0 in s1’ is equal to the shortest path between C1 and C, and the number

6.5 RSM Schema Tree

145

of 0 in s2’ is equal to the shortest path between C2 and C. Since the shortest path from C1 to C2 is the concatenation of the shortest path from C1 to C and the shortest path from C to C2, dist(C1, C2) = zeroCount(s1’) + zeroCount(s2’). By Theorem 6.7, we can calculate the semantic distance between any two concepts given their bit strings. Theorem 6.5 shows that the semantic distance function is a metric function. According to Theorem 6.1 to 6.4, all hierarchy semantics can be determined just according to concepts’ bit strings, which include ancestordescendant relationship, parent-child relationship, sibling relationship, and concept-in-axis relationship. Therefore, it is enough to only store the bit strings of axes and concepts rather than the schema tree. File file_schema depicted in Fig. 6.8 is on this purpose. Using RSM schema tree, all hierarchy semantics between concepts are encoded into bit strings. By certain rules of computation on given concepts’ bit strings, their semantics can be exposed. The RSM schema tree plays the role of interface between the above logical resource space and the underlying physical storage space which is a multidimensional bit string space.

Fig.6.8. The One-One mapping between concepts and bit strings is stored in File file_schema.

One remaining problem in RSM schema tree is that a bit string may be extremely long even up to a linear order of the number of axis concepts. It

146

Chapter 6 Resource Space Model Storage

is mainly caused by the magnitude of the number of sibling nodes. As known to us, the number of concept refinements is rather small in applications, say, less than 32 levels, which is confined by people’s recognition ability. Hence the depth of concept hierarchy tree is limited in applications. However, the number of a concept’s children can be quite large. For example, there are 193 countries in the world. In RSM schema tree, it requires 192 consecutive “1” bits appended to the bit string of concept world to represent the last country. We propose a compressed encoding method to set an upper bound for the length of concepts’ bit strings. It works as follows. Given a bit string, retrieve the first 7 bits. If it contains at least one 0, pack it with a byte by setting the first bit as 1. Otherwise, read more bits until 0 appears or the number of 1 bit adds up to 120. In either case, pack the number of counted 1 bits with a byte by setting the first bit as 0. Proceed with the above process until the residual bit number is no more than seven. Some packing bits are necessary when the residual bit number is less than seven.

Algorithm compressCode(s) // in Java language ByteList bl = ; // byte sequence byte count = -1; while(s.length > 7) if(count == -1) byte tmp = cut 7 bits from s head; if(tmp == (01111111)2) count = 7; else tmp = tmp.setFirstBit(1) bl.append(tmp); else cut all successive 1 at most 120 from the head of s; count += the number of 1 cut off; bl.append(count); count = -1; count = s.length; bl.append(pack(s)); // pack s with all 0 at the end bl.append(count);

6.5 RSM Schema Tree

147

Take the bit string 1101011 1111111 1111111 1110111 0101 in format of 7-bit segments as an example. The first 7-bit segment is packed in byte 11101011. Since the next 7-bit segment comprises seven 1, more bits are consumed until 0 appears. The counted number of 1 bit is 17 whose binary value is 00010001 which is treated as the second packing byte. The next 7-bit segment is 0111010. It is packed in byte 10111010. Now the residual bits are a single 1. We append successive 000000 to align its length with seven. Then we set the first bit of its packing byte as 1. So we get the third packing byte 11000000. At last, we append one more byte which records the length of the residual bits at the previous step. Its first bit is set as 0. Therefore, the final byte sequence is 11101011 00010001 10111010 11000000 00000001. More 1 bits does the original bit string have, more efficiency is our compressed encoding approach. Algorithm compressCode describes the above approach in detail. The following theorem gives an estimation of the byte number after compression.

Theorem 6.8. Assume the depth of concept hierarchy is at most D and the maximal number of any concept’s children is at most 127c, where c is a constant. Then, the number of bytes after compression is at most D(1+c). Proof. One time of concept refinement incurs one 0 bit, so one byte needs to preserve this information during bit string compression. Since the maximal number of child of any concept is 127c at most, c bytes are enough to represent all 1 bits incurred by sibling concepts of the same parent concept. Therefore, 1+c bytes are enough to go down one level in the axis’s concept tree. Because the concept hierarchy depth is D at most, the byte number after compression is D(1+c) at most.

It is easy to decode a compressed bit string to the original. Each time decode one byte. Examine the first bit of the byte. If it is 1, the left seven bits belongs to the original bit string. If it is 0, recover a number of 1 bits which is equal to the value of the byte. The above process is carried out until only two bytes are left. The value of the second one is the number of packing bits in the first byte. So it is also convenient to recover the original information. In fact, we do not need to fully recover all the information since most of the time we just compare or do partial calculation upon their bit strings.

148

Chapter 6 Resource Space Model Storage

6.6 C-tree

Utilizing RSM schema tree, resource space is converted to multidimensional bit string space where each coordinate is a bit string. Hierarchy semantics exist between bit string coordinates, but there is no total ordering between them. Current multidimensional access methods can do little in this setting. From their perspective, the underlying storage device is abstracted as a linear array, which supports fast sequential access but time-consuming random access. In contrast, there is no total ordering among points in a multidimensional space. Accordingly, current multidimensional access methods concentrate on devising efficient ways of putting nearby points into adjoining pages. In multidimensional bit string space, the proximity of points is more complex than conventional multidimensional space. The shortcoming is that bit string coordinates do not have a linear ordering. Fortunately, they have a metric semantic distance as defined in Definition 6.1. Therefore, semantic distance between concepts (bit strings) can help construct more effective and efficient indexing structures. We propose C-tree to get this work done, where C means Concept and Classification. It inherits the basic ideas of classic R-tree and its variants (Guttman, 1984). Moreover, the hierarchy semantics encoded in bit strings is extremely useful for resource insertion and query. It has not only spatialclose but also semantic-close resources stored in adjoining storage space.

6.6.1 Resource Operations

R-tree can be seen as the multidimensional version of B-tree. Points nearby in the space are grouped together in the same leaf node. Hence each leaf node corresponds to a Minimum Bounding Rectangle (MBR) of the points inside it. Leaf nodes with nearby MBRs are grouped together again to form the next upper level. This procedure continues until only one node is remaining which is made as the root node. Fig. 6.9 depicts an example of R-tree.

6.6 C-tree

149

Fig. 6.9. An example of R-tree.

Nowadays, R-tree has already become a design rationale of spatial index structures, far more than just a specific indexing tree. It has three basic components: MBR format, INSERT_POLICY, and SPLIT_POLICY. MBR format is usually a hyper-rectangle or hyper-sphere in conventional multidimensional access methods. It should have certain transitivity property, say, spatial containment relationship. The two policies follow the good standards like the minimization of blank space, overlap area and area margin. Usually, distinct definitions of these four components lead to different spatial index trees, such as R+-tree, R*-tree, etc. Procedures for resource operations are generally as follows.

// find point p inside node n. Algorithm exactQuery(p, n) if(n is leaf) foreach(point c inside n) if(p = = c) output c; else foreach(branch ni of n) if(p is inside ni’s MBR) exactQuery (p, ni);

150

Chapter 6 Resource Space Model Storage

// find points inside given range and node n. Algorithm rangeQuery(range, n) if(n is leaf) foreach(point c inside n) if(c is inside range) output c; else foreach(branch ni of n) if(range intersects with ni’s MBR) rangeQuery(range, ni);

// insert point p Algorithm insert(p) node n = root; while(n is non-leaf) ni is the best branch of n by INSERT_POLICY; n = ni; add p into n; if(n overflows) split(n); else if(n needs adjust)

// delete point p Algorithm delete(p) // n is the leaf node containing point p node n = find(p, ROOT); if(n not exist) return; delete p from n; if(n underflows) condense(n); else if(n needs adjust) adjust(n);

6.6 C-tree

151

// node n need to adjust its MBR as a result // of child change Algorithm adjust(n) do if(n is root) adjust ROOT; return; node n’ = n.pp; // n’ is the parent of n. adjust entry n in n’; n = n’; while(n needs adjust);

// node n need to split as a result of overflow Algorithm split(n) do split n to new nodes p and q by SPLIT_POLICY; if(n is root) generate a new ROOT with entry p and q; return; node n’ = the parent of n; remove entry n from n’ and add p and q into n’; n = n’;

152

Chapter 6 Resource Space Model Storage

// node n need to condense as a result of underflow Algorithm condense(n) Queue Q = ; // FIFO queue while(n is not root && n underflows) // add into queue’s tail Q.addTail(all the entries of n); node n’ = the parent of n; remove entry n from n’; n = n’; adjust(n); // get head node in Q and reinsert it while(n = Q.getHead()) reinsert(n);

// reinsert n at its original level Algorithm reinsert(n) node n’ = root; // n.lev is the level of node n. // Leaf node’s level is 0. while(n’.lev > n.lev + 1) ni is the best branch INSERT_POLICY; n’ = ni; add n into n’; if(n’ overflows)

of

n’

by

6.6.2 Minimum Bounding Rectangle

In multidimensional bit string space, MBR is not easy to perceive and visualize. The impediment lies on no linear ordering between bit string coordinates. So no two coordinates can be found enough to set the range of several given coordinates. If we simply list them all, too many coordinates must be stored in the upper-level nodes of the index tree. It seems inevita-

6.6 C-tree

153

ble to predefine an order between concepts. However, such an order should preserve hierarchy semantics between concepts as much as possible. Meanwhile, resource operations should follow the hierarchy semantics. In our implementation, we define the order as the pre-order traverse of RSM schema tree. Fig. 6.10 depicts an example. [Cs,Ce] is the range of the concepts in shape of dark nodes. [Cs,Ce] covers all dark nodes and shadow nodes. Therefore, an MBR is in format of ([s0,e0], [s1,e1], ..., [sn-1,en-1] ) where si is the bit string of Cis, ei is the bit string of Cie, [Cis, Cie] is the concept range on the ith dimension, and n is the dimensionality of the multidimensional bit string space.

Fig. 6.10. Concept range [s, e] in the concept tree. Theorem 6.9. The containment relationship between C-tree’s MBRs satisfies the transitivity property. Proof. MBR’s projection in each dimension satisfies containment transitivity property, so MBR satisfies containment transitivity property. Definition 6.2. Bit string s is the proper prefix of s’ if and only if s is the prefix of s’ but not equal to s’. Theorem 6.10. Let s be the bit string of concept Cs, e be the bit string of concept Ce, and t be the bit string of concept C. C is in range of [Cs, Ce] in the concept tree if and only if s t e where is the alphabetical order between bit strings assuming 0 is in front of 1 in the alphabetic.

154

Chapter 6 Resource Space Model Storage

Proof. Since the preorder traverse of concept tree is the same as the preorder traverse of transformed binary tree, C is in range of [Cs, Ce] if and only if t is in range of [s, e], that is, s t e.

MBR may need update after point insertion. Assume the point is p(p0, p1,..., pn-1) and the MBR is mbr([s0, e0], [s1, e1],..., [sn-1, e n-1]). If pi is contained by the range of [si, ei], [si, ei] remains no change. If pi precedes si, then [si, ei] is changed to [pi, ei]. If ei precedes pi, then [si, ei] is changed to [si, pi]. One characteristic of multidimensional bit string space is that coordinates can be inserted or deleted. It is useful in real applications, since concept hierarchy semantics evolutes with time. New concept refinement represents a deeper understanding of the application’s semantics.

6.6.3 On INSERT_POLICY

Insert policy decides which MBR among several sibling MBRs is the best to incorporate the given point. In conventional multidimensional space, good means least area enlargement, least overlap area enlargement, least perimeter, etc. The basic idea is making MBR more compact so that blank space is as small as possible. In this way, less space needs accessed in the query process. In multidimensional bit string space, however, such three good measurements are hard to compute since we do not know exactly how many coordinates are inside a given concept range. What we are able to know is the following two: 1. 2. whether the containment relationship satisfies between concept and concept range, and the semantic distance between concept and the start/end concepts of the concept range.

We will show how to define the semantic distance between a concept and a concept range, which contributes to more compact MBRs as well as better grouping according to semantic clustering.

Definition 6.3. Given a point p(p0, p1,..., pn-1) and an MBR mbr([s0, e0], [s1, e1],..., [sn-1, en-1]), the semantic distance between p and mbr is measured by distPM(p, mbr) = i=0..n-1 min{dist(pi, si), dist(pi, ei)}.

6.6 C-tree

155

The smaller is distPM, the better is the insertion of p into mbr. Two cases needs to be taken into account when deciding the best sibling MBR to insert point p into. When p is inside one or more MBRs, resolve tie by selecting the MBR which has the smallest semantic distance to p according to Definition 6.3. This aims at promoting the compactness of MBR. When p is outside each MBR, three measurements are considered in decreasing priority. 1. least overlap. Let overlapNum(mbr, i) be the number of sibling MBRs whose projections on ith dimension intersect with MBR mbr’s projection on ith dimension. Let mbr’ be the MBR after inserting point p into mbr. Then overlapNum(mbr’, i)-overlapNum(mbr, i) measures the increased projection overlap number on ith dimension for inserting point p into mbr. least overlap is the MBR with the smallest i=0..n-1 (overlapNum(mbr’, i)-overlapNum(mbr, i) ). least semantic distance. Select the MBR which has the smallest semantic distance to p as defined in Definition 6.3. least perimeter enlargement. Suppose mbr’ ([s0’, e0’], [s1’, e1’],..., [sn1’, en-1’]) is the resulting MBR of inserting point p(p0, p1,..., pn-1) into MBR mbr ([s0, e0], [s1, e1],..., [sn-1, en-1]). The perimeter enlargement is [si, ei] is equal to dist (si’, i=0..n-1 ([si’, ei’] [si, ei] ), where [si’, ei’] si)+dist (ei’, ei).

2. 3.

6.6.4 On SPLIT_POLICY

When a node in the tree index overflows, it needs split since the allocated storage space for it is limited. The goodness standards for node split are much the same as that of insert policy. Blank space should be minimized to improve query efficiency. The difficulty of achieving this goal in C-tree is the same as that of insertion. Our approach consists of two steps. At the first step, pick up two children of current node as seeds. They should be the farthest pair of child nodes. Suppose mbr1=([s10, e10], [s11, e11],..., [s1 n-1, e1 n-1] ) and mbr2=([s20, e20], [s21, e21],..., [s2 n-1, e2 n-1] ). Their pre-order distance distPre(mbr1, mbr2) is defined to be i=0..n-1 f([s1i, e1i], [s2i, e2i]) where f([s1i, e1i], [s2i, e2i])

156

Chapter 6 Resource Space Model Storage

s2i [ s1i , e1i ] s1i [ s2i , e2i ] min{dist ( s1i , e2i ), dist ( s2i , e1i )} otherwise

. distPre(mbr1, mbr2) computes the distance in the pre-order traverse of RSM shema tree. It measures the farness of a pair of child nodes by the sum of the semantic distances of their projections on each dimension. It measures an overall semantic closeness from the perspective of multidimensional classification. The second step is assigning each left child nodes of current node to the nearer one of the two seeds. For example, mbr’ will be assigned to mbr1 if distPre(mbr1, mbr’) is smaller than distPre(mbr2, mbr’). This process will group together similar classification zones in the space.

6.6.5 Disk management

0

Fig. 6.11 depicts an example of C-tree. Three classification points p1, p2 and p3 are grouped together in node A. The other two are grouped in node B. A and B are the children of the root node.

Y C31 C3 C32 C41 C4 C42 C1 C11 C12 C13 C21 C2 C22 X p2 p1 p3

Fig. 6.11. An example of C-tree. Three points p1(C12,C31) {fp1, fp2, fp3}, p2(C11,C3){fp6} and p3(C13,C32){fp4, fp5} are grouped together in rectangle A.

6.7 Summary

157

Fig. 6.12 describes the corresponding external memory storage using a single file file_ctree. It is divided into pages each of which corresponds to a certain size of consecutive storage space. The size of the page would better match disk’s block size to enable efficient page read and write. The first page is allocated for root node of C-tree in default. Each of the other tree nodes is also stored in one page which is randomly allocated. The pointer to the child node in C-tree is transformed to page shift of the child node’s page in file_ctree. Each point inside a leaf node represents some kind of conceptual classification. It is accompanied by the shift number of the page storing its resources, for example, a list of file paths if the resources are local files, or URI (Uniform Resource Identifier) in the network setting. page no. 0 shiftr

file_ctree

(mbrA,shiftA) fp4 fp5 (mbrB,shiftB)

shiftq shiftA shiftB

fp6 (mbrp,shiftp) (mbrq,shiftq)

shiftp

fp1 fp2 fp3

Fig. 6.12. file_ctree stores C-tree in external memory.

6.7 Summary

The Resource Space Model uses hierarchical classification semantics to reflect semantics in the real world, document world, machine world and

158

Chapter 6 Resource Space Model Storage

mental world. Relational tables are not effective in supporting hierarchical classification semantics. XML files bring into resource redundancy. Moreover, their performance in multi-attribute search is not good. Multidimensional access methods depend on linear order between coordinates of axis, while there is no linear order on coordinates in Resource Space Model. The proposed RSM storage mechanism transforms a resource space into a multidimensional bit string space by encoding coordinates into bit strings, and then uses C-tree to index the multidimensional bit string space. Hierarchy semantics is embodied in the bit strings and used by C-tree in resource insertion and deletion to group semantic-close resources in disk. C-tree is not only a novel multidimensional indexing structure but also a semantic-based resource re-organization mechanism for efficient search.

Chapter 7 Structured Peer-to-Peer Resource Space

The Resource Space Model represents normalization while Peer-to-Peer systems represent autonomy. Integrating resource space with the structured Peer-to-Peer network can construct a structured Peer-to-Peer resource space to realize the synergy between normalization and autonomy.

7.1 Basic Idea

7.1.1 The Problem

Peer-to-Peer (P2P) networks can be largely classified into two types: unstructured and structured. In unstructured P2P networks like Freenet (Clarke et al., 2000), Gnutella (http://www.gnutella.com) and Napster (http://www.napster.com), each peer manages its own data, and there is no particular assumption about the assignment of data onto peers. In structured P2P networks, data items (or indexes of data items) are assigned onto peers according to some rules. One popular type of structured P2P networks is the Distributed Hash Table (DHT) based networks like CAN (Ratnasamy et al., 2001), Pastry (Rowstron and Druschel, 2001), Chord (Stoica et al., 2001) and Tapestry (Zhao et al., 2001). They mainly aim at finding efficient ways to locate resources. The Resource Space Model uses normalized classification semantics to uniformly specify and manage various resources. Integrating the resource space with P2P networks offers a chance for P2P networks to manage complex resources by content. It is also a chance for the Resource Space Model to support decentralized applications by cooperating with P2P networks. Previous works on

160

Chapter 7 Structured Peer-to-Peer Resource Space

Resource Space Model mainly focused on the model itself and the centralized storage mechanism. If we want to use the Resource Space Model to manage a Web community or an office network, the following issues are critical: 1. 2. How to implement the Resource Space Model in a decentralized environment? How to find a decentralized data structure to represent the Resource Space Model and appropriate algorithms to efficiently implement its operations?

This chapter presents an approach to construct a resource space overlay to form a structured P2P Resource Space Model. The basic idea is similar to the space partition idea of CAN (Ratnasamy et al. 2001).

7.1.2 A Brief Introduction to CAN

Like other DHT-based P2P networks, CAN provides applications with an interface that maps key of a resource into the peer storing this resource. It organizes an n-dimensional Cartesian space. This Cartesian space is partitioned into zones, with one or more peers serving as owner(s) of the zone. An example of 2-dimensional CAN is shown in Fig. 7.1. Each key in the system is mapped into a point in the space using the distributed hash table. The peer that owns the zone containing the point owns the corresponding key, and is responsible for returning the resources it holds.

(0.5-0.75, 0.5-1.0)

1.0

C (0.0-0.5, 0.5-1.0) D E (0.75-1.0, 0.5-1.0)

A (0.0-0.5, 0.0-0.5)

B (0.5-1.0, 0.0-0.5)

0.0 0.0 1.0

Fig. 7.1. A 2-dimensional Cartesian space of CAN with 5 peers.

7.1 Basic Idea

161

The issue of routing from a source peer to a destination peer in CAN turns into the issue of routing from one zone to another in the Cartesian space. This routing will follow the path through the Cartesian space from the source point to the destination point. A peer joins the CAN by picking a random point in the Cartesian space, routing to the zone that contains the point, splitting the zone into two, and occupying one by itself. A peer departs from the CAN by asking one of its neighbors to take over its zone.

7.1.3 Basic Approach

The following is the basic approach to deploy a resource space on structured P2P networks (structured P2P RSM). It divides the topological space of the resource space into many independent zones. Each node in the P2P network takes charge of one zone. Each node maintains the information of its neighbor nodes for routing. The approach naturally reserves the topological space view of the resource space and supports the basic operations of the Resource Space Model. The following are challenges of partitioning the resource space: 1. The resource space is not a Cartesian Space like CAN’s partition space. It is a discrete classification space. Moreover, its coordinates could be in tree structure. Therefore, the partition of resource space needs some preprocessings and constraints. There is no guarantee that the indices (i.e. the coordinates of resources) of resources are evenly distributed in the resource space, since resources are likely to center around the hot points (a point in resource space represents a topic). Load balancing becomes a major issue of the system.

2.

162

Chapter 7 Structured Peer-to-Peer Resource Space

7.2 The System Design

7.2.1 The Basis

The resource space is a special n-dimensional topological space. For efficient routing, our approach assumes that the coordinates at every axis are ordered. The order specified by the designer takes the highest priority. If the designer does not specify the order, the coordinates are ordered according to some simple semantics such as the lexicographical order, numerical order, and time order. This is reasonable once the resource space has been designed and used to stably support applications. After ordering, the distance between two coordinates is defined as the number of coordinates between them. The structured P2P RSM only concerns resource locating operation the most basic operation of the Resource Space Model. All the nodes share the same resource space schema in the structured P2P RSM. The entire resource space is dynamically partitioned among all the nodes in the system such that every node owns its individual and distinct zone in the overall resource space. Each zone owns a continuous range of coordinates along each dimension. Fig. 7.2 shows a 2-dimensional resource space partitioned by 6 nodes.

7.2 The System Design

163

1.0 Node 1 Node 2 Node 3

Node 4 0.0 A

Node 5

Node 6

Z

Fig. 7.2. Example of 2-dimensional space with 6 nodes.

The resources in resource space are stored as follows: Each resource in the resource space can be represented by a point whose coordinate will be taken as the index of this resource. Then, the (key, value) pair is stored at the node that owns the zone within which the corresponding point lies. Here, key is the coordinate values of the point in the resource space, and value is either the resource itself or the pointer to the resource, like IP address of the node possessing this resource. To retrieve a resource, the requesting node must route the query message in the structured P2P RSM overlay to the target node storing the resource. Effective routing is therefore a crucial aspect of the design of the structured P2P RSM. Nodes in the structured P2P RSM self-organize into an overlay that represents the resource space. No super node is needed in this overlay. Each node maintains a small piece of information such as IP addresses of its neighbors and coordinate information of the corresponding zones. This information will serve as the routing table that enables routing between two arbitrary nodes in the structured P2P RSM overlay.

7.2.2 Node State

Each node maintains a routing table that holds the IP addresses and the virtual coordinate zones of its neighbors in the resource space. In an n-

164

Chapter 7 Structured Peer-to-Peer Resource Space

dimensional resource space, two nodes are neighbors if their coordinate zones overlap at n 1 dimensions and adjoin at one dimension. In Fig. 7.2, node 3 is a neighbor of node 1 because its coordinate zone overlaps with the node 1’s at the X-axis and adjoins at the Y-axis. Node 6 is not a neighbor of node 1 because their coordinate zones do not adjoin at both the X and Y axes. The zone of each node is an n-dimensional rectangle and can be denoted by R.

R = (R0, R1, …, Rn-1)

(7.1)

Here, n is the number of dimensions and Rk is a closed interval [a, b] describing the scope of the zone along dimension k. The size of the routing table is very small, i.e. O(n) for a n-dimensional resource space. No special effort is needed to well design the structure of the routing table. The speed of searching the routing table is fast.

7.2.3 Routing

The local neighbor states are sufficient to route between two arbitrary points in the space. All structured P2P RSM messages include destination coordinates. The node uses a greedy method to forward the message to the destination. The routing procedure is executed whenever a message destined for coordinate D is arrived at a node P. It is shown in pseudo code as follows. Notations: 1. 2. 3. 4. 5. 6. Z: the zone occupied by P. Ni: one neighbor of node P, suppose P has m neighbors, 0 i

Premium Essay

...About Apples History The history of apples stretches back to the days of Adam and Eve, when it is believed to have been the “forbidden fruit” described in the Bible. Despite this long standing history, apples did not always grow naturally in New England. While the first apples are thought to have grown on the lower slopes of Tian Shan, a mountain range separating Kazakhstan and Krygystan, they also grew wild in Central and Southwest Asia, China, Italy, Switzerland, Spain and Greece. Through conquest and exploration, apples were spread when Romans conquered England and when Spaniards brought them to Mexico and South America. It wasn’t until the mid 1600’s that the Pilgrims cultivated them in Massachusetts. It is believed that John Endecott, an early governor, was the first to bring an apple tree to North America, and the first orchard was planted on Beacon Hill by a clergyman named William Blaxton. It is Blaxton who is credited for growing the first named apple, the Yellow Sweeting. Once apples were established in New England, they played an active role in everyday life. As a fruit which was easily stored through the winter, as well as being very beneficial to settlers’ health, apples were a main staple in early settlers’ diets. Despite the fact that apples were not initially from North America, and have been growing disease-free for centuries in their native habitats, the early settlers found that the long, hot summers and cold winters of New England grew apples......

Words: 1318 - Pages: 6

Free Essay

...Apple From Wikipedia, the free encyclopedia Jump to: navigation, search This article is about the fruit. For the technology company, see Apple Inc.. For the apple genus, see Malus. For other uses, see Apple (disambiguation). "Apple tree" redirects here. For other uses, see Apple tree (disambiguation). Apple A typical apple Scientific classification Kingdom: Plantae (unranked): Angiosperms (unranked): Eudicots (unranked): Rosids Order: Rosales Family: Rosaceae Subfamily: Maloideae or Spiraeoideae[1] Tribe: Maleae Genus: Malus Species: M. domestica Binomial name Malus domestica Borkh., 1803 Synonyms Malus communis Desf. Malus pumila auct.[2] Pyrus malus L.[3] The apple is the pomaceous fruit of the apple tree, species Malus domestica in the rose family (Rosaceae). It is one of the most widely cultivated tree fruits, and the most widely known of the many members of genus Malus that are used by humans. Apples grow on small, deciduous trees. The tree originated in Western Asia, where its wild ancestor, Malus sieversii, is still found today. Apples have been grown for thousands of years in Asia and Europe, and were brought to North America by European colonists. Apples have been present in the mythology and religions of many cultures, including Norse, Greek......

Words: 379 - Pages: 2

Free Essay

...The apple is from a tree that grows in the ground. It is red and it can be used to make apple sauce, apple juice, or apple butter. Many people enjoy eating apple sauce because you can add cinnamon for flavoring. Apple juice is also a popular beverage because of its sweet taste. Sometimes people use apple butter on different types of bread. The apple is from a tree that grows in the ground. It is red and it can be used to make apple sauce, apple juice, or apple butter. Many people enjoy eating apple sauce because you can add cinnamon for flavoring. Apple juice is also a popular beverage because of its sweet taste. Sometimes people use apple butter on different types of bread. The apple is from a tree that grows in the ground. It is red and it can be used to make apple sauce, apple juice, or apple butter. Many people enjoy eating apple sauce because you can add cinnamon for flavoring. Apple juice is also a popular beverage because of its sweet taste. Sometimes people use apple butter on different types of bread. The apple is from a tree that grows in the ground. It is red and it can be used to make apple sauce, apple juice, or apple butter. Many people enjoy eating apple sauce because you can add cinnamon for flavoring. Apple juice is also a popular beverage because of its sweet taste. Sometimes people use apple butter on different types of bread. The apple is from a tree that grows in the ground. It is red and it can be used to make apple sauce, apple juice, or apple......

Words: 379 - Pages: 2

Free Essay

...negocio, la innovación, alrededor de esta variable giró la nueva filosofía de Apple Inc.. Con la introducción de productos nuevos e innovadores, como el reproductor de música iPod, que se ha convertido en uno de los pilares del éxito actual de Apple. Jobs ha mantenido esta estrategia con el lanzamiento de nuevos productos, estilizados y de fácil uso que ayudaron a que incrementara su share, generando nuevos mercados. Jobs ha sabido identificar claramente la tendencia del mercado hacia la sincronización entre los teléfonos móviles y PC, así como el mercado de la música digital, con una visión estratégica enfocada hacia estos cambios. Los objetivos se establecieron alrededor del éxito financiero y su medición mediante las unidades vendidas. Jobs desarrolló una estrategia de entrar en estos mercados por productos diferenciados, estilizados y fácil de usar, impulsado por una investigación de nuevos productos y la evaluación de las tendencias del mercado y su reacción con estas innovaciones. Dado lo anterior considero que Steve Jobs se desempeñó de una manera sorprendente ya que supo redirigir a su compañía y enfocarla a un futuro muy claro para ellos con su visión. Convirtió a Apple en un rotundo caso de éxito después de que hasta el 2007 presentaba pérdidas en sus finanzas. 2. What are the chief elements of Apple’s strategy? How well do the pieces fit together? Is the strategy evolving? Apple mantiene productos innovadores en el mercado, al darse cuenta de que......

Words: 1976 - Pages: 8

Free Essay

...Resumen Ejecutivo Apple Computer ha pasado por varios CEO y ha sido exitoso, pero también estuvo a punto de la bancarrota. Se han destacado por innovar con sus productos creando una nueva necesidad para sus consumidores. Muchos acreditan a Steve Jobs por el éxito de Apple y se cuestionaron cómo sería el desempeño de Tim Cook. La empresa se destacó por crear productos complementarios los unos con los otros, como el caso del iPod y iTunes. Esta ha sido una de las muchas estrategias utilizadas por la empresa para hacer frente a los distintos mercados en los que compiten. También realizaron alianzas estratégicas que les permitieron expandir sus mercados, aunque también han participado en litigios por patentes. Entre las categorías de productos que manufacturan estan: computadoras portátiles y de escritorio, celulares, tablets, software, etc. A pesar de que era una empresa con liderazgo total, los consumidores han ido cambiando muchas veces el producto por Samsung, ya que hace falta un poco de reinventar, dar nuevas ideas y nueva tecnología que impresione al consumidor, productos nuevos salen pero los cambios que se han hecho son poco perceptibles por el consumidor. En cuanto al análisis interno de la empresa, la cadena de valor de Apple le permite tener una ventaja absoluta en términos de innovación y ......

Words: 2241 - Pages: 9

Free Essay

...evaluar y racionar antes las fuerzas ajenas a la empresa que puedan afectar las operaciones. La compañía Apple inc., junto a sus afiliados, los diseñadores, fabricación y mercadeos de las computadoras personales, las comunicaciones móviles y dispositivos de medios portátiles y reproductores de música digital, así como la venta de programas, servicios, soluciones a redes y contenido digitales y aplicaciones en todo el mundo. La compañía Apple inc., vende sus productos en todo el mundo a través de sus tiendas en los centros comerciales y tiendas a través del internet, tiene ventas a mayoristas, tiene distribuidores y revendedores de valor añadido. Además sus macs, iphone, ipad y productos compatibles con ipod, incluyendo las aplicaciones para estos. También tienen impresoras, dispositivos de almacenamientos, altavoces, auriculares y otros tipos de accesorios, a través de sus tiendas tanto físicas como online y los contenidos digitales y aplicaciones a través de itunes store. La compañía vende sus productos a las empresas de consumo, pequeñas y medianas empresas, escuelas, gobiernos y los mercados creativos. A partir de septiembre 25 2010, había 317 tiendas incluyendo 233 en Estados Unidos y 84 tiendas intermediarias. Se le conoce como Apple computer inc., y fue fundada en el 1976 y su central se encuentra en cupertino, california. Tenemos que destacar que Apple es una compañía fenomenal en sus productos, tanto las portátiles como los desktop, sus teléfonos, tablas, y...

Words: 470 - Pages: 2

Premium Essay

...NANYANG BUSINESS SCHOOL AB311 STRATEGIC MANAGEMENT GROUP STRATEGIC REPORT ON APPLE INC. SEMINAR GROUP 2 TEAM GENIE Instructor: A/P LAI SI TSUI-AUCH Word Count: 5,999 Done by: CHAN ZHE YING GOH CHUWEN LEE KOK CHONG TEO KOK MIN JOHN 1 Table of Contents I. EXECUTIVE SUMMARY ............................................................................................................... 3 II. MAIN REPORT............................................................................................................................... 5 1. Introduction of Apple Inc. ........................................................................................................... 5 1.1 1.2 2. 2.1 History................................................................................................................................. 5 Current Business Strategy ................................................................................................... 5 SWOT Analysis ........................................................................................................................ 10 Promising Opportunities ....................................................................................................... 10 The Shift from the PC to Mobile Era ............................................................................ 10 Emerging Markets ......................................................................................................... 11 Consumer Digital......

Words: 8330 - Pages: 34

Premium Essay

...Marketing Opportunities for Apple Name: Institutional Affiliation: Date: Table of Contents Introduction 3 History of apple 3 Market Presence and Revenue Standings 4 Market research 4 Secondary market research 5 Apple brand review 5 Market segmentation 5 Research analysis of consumer needs and wants 7 Summary on the client's wants and needs 9 Research analysis on apple products Preferences 9 Summary 10 Conclusion and recommendations 10 Reference: 11 Introduction History of apple Apple lnc was founded in 1976 by Steve Jobs, Steve Wozniak and Ronald Wayne. The main idea of establishing the Apple lnc at the time was to sell Apple 1, which is a personal computer kit. Steve Jobs during the establishment was one of the majority shareholders with approximately 45% of the total shares, Steve Wozniak also had share as Steve Jobs of 45%. Wayne owned the remaining 10% ownership (bott.org, 2014). During the formation of the Apple, Inc. Company, both Jobs and Wozniak were young entrepreneurs with no asset to their names. They were therefore not afraid of taking any risk. On the other hand, Wayne was a little bit older and had his own personal assets. Due to his fear of undergoing a huge risk, he sold his company ownership stake to Steve and Wozniak for 800$. The valuation of Wayne’s ownership compared to today’s company’s market value, it would be exceeding 3...

Words: 2457 - Pages: 10

Premium Essay

...Purpose The purpose of this report is to present a relevant Discussion Forum and Blog to Apple Computer, Inc. Apple Computer, Inc is the one of main manufacturer of a line of personal computers under the Apple Macintosh brand name, peripherals, and computer software. Two interest groups that focus on services of Apple Company are introduced in this report. The first one is a Discussion Forum named ¡°AppleInsider-Forum¡±. This is a web page concerning all the products of Apple Company, such as iPod, ITunes and Mac computer and let people discuss about these product. The visitors of this discussion forum usually are current and prospective users of Apple¡¯s products. This forum offers people a place to exchange their opinions and experiences in using Apple¡¯s products. The other one is a Blog named ¡°The cult of Mac Blog¡±. It is a news and opinion about Apple and the Mac community. This Blog is powered by Leander Kahney who posts news and threads about Apple on this Blog and viewer may follow their comments. APPLEINSIDER-FORUM Description AppleInsider launched in 1997 and quickly grew to become one of the Internet's premier sources of information for all things about Apple. This forum¡¯s nine different sections cover every aspect of Apple¡¯s products, from hardware to software, from purchasing advice to tech support. Everyday many fans of iPod mp3 player or Macintosh computer gather in this forum to......

Words: 1068 - Pages: 5

Premium Essay

...The multi billion-dollar corporation, Apple Inc., designs and manufactures some of today’s highest technological gizmos and gadgets. Among their best known products are the Apple and Macintosh computers, iPods, iTunes, iPhones and iPads. Apple is one of the most powerful and influential high tech companies in the world. The success of Apple Inc. stems from the innovation and visions of co-founder and entrepreneur, Steve Jobs, the excellence of the stylish, user-friendly products, and the ability to create innovative products that consumer’s desire. The development of Apple Inc. came during the unstable economic times of the 1970’s. Best friends and college dropouts, Steve Jobs and Stephen Wozniak pooled their electronic and business skills to market what was to become the first personal computer. Stephen Wozniak had designed a small computer, the Apple 1, for the enjoyment of some friends at a Homebrew Computer Club meeting. The Apple 1 developed in Steve Jobs’ bedroom and garage, while he envisioned the commercial potential of a personal computer that could help families with personal finances and small businesses with day to day tasks. Vision, drive and creativity allowed this entrepreneur to take the risk to create a business. The challenge of building that business and the desire to control his destiny required passion and perseverance along with innovation. Apple’s first personal computer, the Apple 1, took six months to design and 40 hours to build with an initial......

Words: 1680 - Pages: 7

Premium Essay

...Apple Computer, Inc.: Maintaining the Music Business while Introducing iPhone and Apple TV Leave a reply Topic: Apple Computer, Inc.: Maintaining the Music Business while Introducing iPhone and Apple TV Subject: Business Details: 1. Strategic challenges facing Apple Computer. 2. Dimensions along which company success can be measured. 3. Critical external and internal environmental factors that have strategic implications for Apple\’s future. 4. Dow Apple\’s strategy stands up against industry rivalry. 5. Recommendations you would make to enhance the effectiveness of the company\’s strategy or to change its strategic approach for better results. Abstract: Apple computers were started some 35 years ago by Steve Jobs and Steve Wozniak in the garage of Steve’s home. It has achieve tremendous growth and is currently one of the largest companies in the US marketing electronic technological produces such as the iPad and many other such items that are used extensively by consumers. The company is dedicated to providing its customers the best know-how and understanding through its original hardware, software, and computer related devices along with the best possible services. The major tactical challenge that Apple computer is facing is that the company’s competitors try to surpass its accomplishments and that they are bringing into the market comparable products that are much cheaper than the products marketed by Apple Inc. Introduction Apple Computer was......

Words: 1402 - Pages: 6

Free Essay

...Running head: ETHICS BEHIND APPLE AND FOXCONN RELATIONSHIP 1 Ethics Behind Apple and Foxconn Relationship Maryana Didovych The College of Westchester ETHICS BEHIND APPLE AND FOXCONN RELATIONSHIP 2 Abstract This paper examines Apple, Inc.’s relationship with one of its biggest suppliers, Foxconn Technology Group. Recent growth in suicide incidents at Foxconn factories again caught media’s attention. Whether Apple’s decision to stay in business with Foxconn despite these incidents is ethical or not is examined using Traditional 5-Question approach. Contradictory evidence is also examined. Based on the result of 5-Question approach and reviewed evidence it can be concluded that Apple’s decision may indeed be unethical. Recently published evidence suggests Apple and Foxconn are addressing several issues, but close monitoring of the improvement process is required to ensure success. ETHICS BEHIND APPLE AND FOXCONN RELATIONSHIP 3 Ethics Behind Apple and Foxconn Relationship One of the biggest suppliers and manufacturers of Apple Inc’s (Apple) products recently has been involved in scandals concerning working conditions of its factory workers. This company is called Foxconn Technology Group (Foxconn). It operates in more than 40 research and development centers as well as manufacturing facilities in Asia, Russia, Europe and the Americas. According to Pratap, Radhakrishnan and Dutta (2012), Foxconn is “the world’s biggest contract electronics manufacturer,......

Words: 3108 - Pages: 13

Premium Essay

...The Apple Company is launching a new fall campaign titled, “The Big Apple!” to promote the sales of its new laptop computer, the Mac Book Pro. Featuring the campaign in New York City, Apple stores will be selling the new MacBook Pro laptops with three new cover designs. Known for its simplicity in computer design, Apple expects to ‘wow’ audiences with a departure from the usual look and a venture into new creative territory. Aimed at (but not limited to) the creative personalities that make up fast-paced and glamorous New York City, the campaign is expected to be a big success. The campaign will consist of a city-wide contest where applicants can design a cover that involves some of the elements that represent New York City and then send their idea in to Apple. The judges will then pick one of the designs to be featured alongside the two other covers which will be created by two different icons in New York city which Apple will have personally picked. One will be an up-and-coming designer, the other an artist. By doing this, Apple creates a connection between the people in the city, the culture, and their own brand. The campaign will run from September through till December, during which time the contest will be held, the designs finalized, and the new Mac Books will be available for purchase. Target Audience/Market: In terms of the target audience that Apple is looking to focus on with their campaign, there are a few demographics that the company would like to......

Words: 1957 - Pages: 8

Premium Essay

...Hase BUSS 508 October 21, 2014 The Apple Corporation has become one of the largest corporations in the world. There are a lot of companies that would like to be mentioned in the same breath as Apple. Many companies want to emulate their success. In this paper I will examine Apple current position and reputation, regarding ethical and social responsibility. According to Crane and Matten (2013) “One of the basic tenets of the Corporate Social Responsibility (CSR) movement in business has been it being voluntary and meeting social expectations above and beyond the law.” The Apple Corporation has been publishing its CSR report on its website since 2007. On Apples website it states “Workers everywhere should have the right to safe and ethical working conditions. They should also have access to educational opportunities to improve their lives. Through a continual cycle of inspections, improvement plans, and verification, we work with our suppliers to make sure they comply with our Code of Conduct and live up to these ideals”. Living up to the previous statement concerning apples commitment to ethical and social responsibility has not been an easy one. My position on whether Apple has met their responsibilities would be no because with their brand being the world’s best global brand, they should be held to a higher standard. When you are the leader in your field other corporations are looking at you to ensure all the rules are being followed. Apples 2013 Supplier Responsibility......

Words: 2168 - Pages: 9

Premium Essay

...Apple in the digital age from the iPod to the iPad Apple Inc. The Case Study 2000 - 2010 Foreward John Ashcroft Welcome to this Apple case study. I have always been something of a computer geek. My first computer was a Commodore Pet in 1978. It had 8k of RAM and a cassette player for storage. Programmed effectively, a two dimensional pencil sketch of a rocket would take off and zoom off screen. Beyond that and a few simple games, I don’t recall it did much at all. My first experience of Apple was the Apple II in the early 1980’s. The combination of Apple and a Visicalc spreadsheet, greatly enhanced financial and business plan modelling. Business models were more easily produced and what-if simulations were available at the click of a button. It was a great step up from the pencil and calculator. Seven years ago, I abandoned Microsoft and converted entirely to Apple. Apple Macs, MacBooks, MacBook Air, iPods, iTouch, the iPhone and the iPad, I had to try them all and never looked back This is the case study of Apple in the digital age. The great era of the iPod, the discovery of the digital hub and Apple’s move into the mainstream consumer market with the iPod, the iPhone and the iPad. It has many great examples for enthusiasts of marketing, leadership, organization, financial analysis and strategic management. The story begins almost ten years ago. In 2001, Apple sales fell by a third and the company reported an operating loss of $350 million some 6% of sales. The......

Words: 5086 - Pages: 21