Asdddddddddddddddddddddddddddddddd

118

IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.6, June 2010

Indexing and Querying Semistructured Data Views of Relational Database
B.M. Monjurul Alom, Frans Henskens and Michael Hannaford
School of Electrical Engineering. & Computer Science, University of Newcastle, AUSTRALIA that contains a XML data repository at its core [22]. It is difficult to achieve high query performance using XML data repositories, since queries are answered by traversing many individual element-to-element links requiring multiple index lookups [23] . In the case of XML data, queries are even more complex because they may contain regular path expressions [24]. Thus additional flexibility is needed in order to traverse data whose structure is irregular or partially unknown to the user. Another option for managing semistructured data is to store and query it with a relational database [22]. In the database community many researchers argue that the relational (and objectrelational) model, due to its maturity and widespread usage, is still the best option [25]. XML query processing is much more complicated than traditional query processing methods because of the structure of XML [1]. A path expression specifies patterns of selection predicates on multiple elements related by a tree structure named Query Tree Pattern (QTP). Consequently, In order to process an XML query, all occurrences of its related QTP should be distinguished in the XML document. This is an expensive task when huge XML documents are attended. The well known query processing method termed as Structural Join is described in [2]. In Structural Join, query is decomposed into some binary join operations. Thus, a huge volume of intermediate results are produced in this method. The Holistic twig join approach [3] do not decompose the query into its binary Parent-Child (P-C) or Ancestor-Descendant (A-D) relationships but they need to process all of the query nodes in the document. The query processing method termed TJFast [12] which only process elements which belong to the leaves of QTP instead of processing all the nodes in the XML document. But this method use a structure named Finite State Transducer (FST) for decoding the code of nodes into the same name of the path traversed from the root for each node, so FST waste a lot of time. The contribution of this paper is querying semistructured data using bitmap to represent path-value relationship and compress the bitmap to save space. The presented BIQS supports the Structural Join query, Path query, and Tree structure query which are processed by joining path results. The BIQS technique also supports the

Summary
The most promising and dominant data format for data processing and representing on the Internet is the Semistructured data form termed XML. XML data has no fixed schema; it evolved and is self describing which results in management difficulties compared to, for example relational data. XML queries differ from relational queries in that the former are expressed as path expressions. The efficient handling of structural relationships has become a key factor in XML query processing. It is therefore a major challenge for the database community to design query processing techniques and storage methods that can manage semistructured data efficiently. The main contribution of this paper is querying semistructured data using bitmap to represent path-value relationship and compress the bitmap to save space. The presented bitmap indexing and querying scheme termed BIQS data that stores the element path, token of the word, attribute and document number in a dynamically created matrix structure. We use word, attribute and path dictionaries for the construction of a Bitmap structure. This paper describes an algorithm to query semistructured data in a more time efficient way than is provided by other relational and semistructured query processing techniques. The presented BIQS structure provides storage and query performance improvement due to the compression of semistructured data.

Key words:
Structural Join, XQuery, XPath, Bitmap, TwigStack, MySQL.

1. Introduction
Query processing is an essential part of any type of databases as well as Semistructured (XML) databases [1]. Semistructured data does have some structure, but this structure is not as rigid, regular, or complete as the structure required by traditional database management systems [20]. The use of XML as a semistructured data format is becoming more prevalent, specially when performing tasks such as the simple integration of data from multiple sources [21]. The growth of XML repositories on the Web has led to much research on storing and indexing for efficient querying of XML data. One option for managing semistructured, as well as XML, data is to build a specialized data manager

Manuscript received June 5, 2010 Manuscript revised June 20, 2010

IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.6, June 2010

119

type of query where only a portion of the path name is mentioned in the query. The paper presents the comparison of query execution time of BIQS to the other XML query processing time (Structural Join and TwigStack) and relational query (Oracle, MySQL) processing time. Experimental results show that the proposed technique queries the semistructured data in a time efficient way than is provided by some of the other existing XML and relational query processing techniques. The paper presents the “time and space” complexity of the proposed BIQS algorithm. The paper also addresses the issue of relational (Semistructured data) querying using compressed bitmap structure, the word, path, and attribute dictionaries. The bitmap structure provides the facility to store huge information of words and paths into each cell of a single block to compress the data. This compression leads the data retrieval can be performed efficiently with low latency. To understand the functionality of the proposed technique, the algorithm shows the storing of sixteen words and paths information into each memory cell of a single block by a decimal value for the data compression. But the compression is possible for the presented structure upto the information of thirty two words and paths into each memory cell of a single block. No lose of any XML information is always maintained for the proposed techniques. The remainder of this paper is organized as follows: Related work in section 2, a framework of the proposed method is described in section 3. The algorithm for Bitmap Structure is presented in section 4. Searching and querying documents is described in 5. Section 6 experimental results. The paper concludes with a discussion and final remarks in section 7.

2. Related Work
Many query processing techniques such as Holistic Twig Join methods have been proposed in [6, 8, 13, 18] to process a twig query efficiently; however, they still suffer from large number of redundant function calls. A new approach termed TwigStack+ is proposed in [19] to solve this problem, which based on holistic twig join algorithm that greatly improve query processing performance. The TwigStack+ technique is used to reduce the query processing cost, simply because it can check whether other elements can be processed together with current one. The proposed technique also used to check the usefulness of an element from both forward and backward directions. Different XML query processing techniques are also elaborated in [4, 7, 9, 11, 15]. TSGeneric+[6] made improvements on TwigStack by using XR-Tree to skip some useless

elements which have Solution Extensions but cannot participate in any path solution. TwigStackList [8] handles the sub-optimal problem by attaching an element list to each query node to cache some elements, TJFast [12] improved the query processing performance by scanning elements of leaf nodes in the query to reduce the I/O cost. Although the existing methods [6] can guarantee the optimality of CPU time and I/O when only AD edges involved in the twig pattern, they all suffer from large number of redundant function ( getNext(root) calls. A query processing and update processing method termed EXEL (Efficient XML Encoding and labeling) is presented in [10]. SIGOPT (schema information graph) to optimize XML query processing is described in [17]. The presented technique explores the opportunities for schema information to affect the query evaluation performance. Multi-level operator combination in XML query processing is described in [16] which elaborates the importance to consider the operations at each level. Specifically, the technique considers the influence of projections and set operations on pattern-based selections and containment joins. Database management systems support indexing to provide better query performance. Indexing provides a flexible, uniform and efficient mechanism to access data [22]. There are some path indexes like Strong DataGuide[26], Fabric Index, ToXin[27], APEX[28], Index1[24] , A(k) Index, and Fix[29] which are indexing the path of document’s nodes to facilitate access to nodes required in XML query processing methods. These path indexes are other kinds of query processing methods which are against the Structural Join[2], Twig Join[3] and TJFast[12] methods. Most of the indexing schemes can only be applied to some limited query processing stages or limited class of queries. To overcome these limitations an indexing scheme called ToXin [27] has been developed. ToXin fully exploits the overall path structure of the database in all query processing stages, consisting of Path index and value index. A three dimensional bitmap indexing scheme named Bitcube [30] considers a more complex frequency table that represents a set of documents together with both a set of element paths and a set of words for each path. A new system for indexing and storing XML data based on a numbering scheme for elements is proposed in [11]. Query capabilities are provided through Structural Join and Twig queries, which are the core components of standard XML query language, e.g. XPath[31] and XQuery[32]. Techniques also exist for querying XML data such as Lorel[21], XML-QL[33] , XQL[34], UnQL[35], XML-GL[34], XSL[34], Quilt[25]; however these query languages are complicated to use and

120

IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.6, June 2010

have some limitations. A comprehensive effort has been done on storing and querying XML data using relational databases are described in [4, 7, 9, 15, 36-40] [23, 36-48] also a comprehensive effort has been done on XML data compression are presented in [22].

3. Framework of the Proposed Technique
3.1 Overall Architecture of the System
To understand the functionality of the proposed technique the overall architecture is presented in Fig. 1. Data processing engine is used to generate a word dictionary, a path dictionary, and an attribute dictionary, which together become the basis of a bitmap matrix to store XML document information. The path elements are calculated from root to nested sub element among all the XML documents. The attribute dictionary records all the attributes (not distinct attributes) including the content of each attribute and the corresponding document number. The word dictionary records a token number for each distinct word. The path dictionary stores all the distinct path elements including their path number. Multi block compressed structure stores all the raw information in compressed form. Token and Path (TP) structure are used to represent the token and path number. Secondary indexing is used for searching the token and path number from the Token and Path structure to reduce the search time. The compressed structure with dictionary and TP (Token Path structure) are maintained on main memory. Input query through query manager is applied to the compressed structure to obtain the output query. The developed structure is not always same, if the set of documents are considered but in different order. Different order of the documents provides the differentiation of the structure that does not mean the structure loses some XML information. The structure always maintains the exact information of the original XML database whether the set of documents are considered in different order or same order. For any order of the documents, the data are stored in a multi block compressed structure, which leads the searching efficiency. Also for the use of dynamic matrix structure the efficiency of the updating is not degraded.

path numbers for the words. All the tokens exist (are bounded) within their corresponding path numbers in the first row of the BIQS structure. . We use a negative (-) sign before all path numbers to differentiate them from token. The first column of the matrix stores the document number. The entries of the matrix use a bit value (1/0) to represent the existence or not of the word and element path within the document number. To represent a new path from an XML document, this approach initially generates a new column in the matrix structure. The first row (entry) of the column stores the path number (from path dictionary) and a value 1 is inserted to the next row of the created column. The value 1 denotes a path’s existence in the document. The tokens (from word dictionary) of all the words within the selected path number are stored similarly by creating new columns in the matrix structure. A value 1 is inserted into the next entry of each of the created columns for the token. Each row of the matrix structure records all the information of each distinct XML document. The system similarly completes the matrix creation for all XML documents. BIQS does not create new columns within an existing path, for the same word even for different documents. The technique always creates new columns for the same word but different path number, whatever the document number. We consider the XML documents given in Fig. 2, Fig. 3, Fig. 4, and Fig. 5 for use in demonstrating our proposed bitmap construction.

3.3 Construction of Dictionary and BIQS with Example
The word dictionary, path dictionary and attribute dictionary (comprising Tables I, II and III) have been created from XML documents shown in Fig. 2, Fig. 3, Fig. 4 and Fig. 5. The attribute dictionary, given in Table III, shows that an attribute named key has four different values in different documents such as 2 and 4. In path dictionary, nasa.datasets.dataset.title and dblp.msthesis.title represent two different path number. The system creates a new column in the matrix structure (given in Table IV) to record the path name “nasa.datasets.dataset” from document1 (given in Fig. 2), and the path number (-1) is assigned to the first row of the created column, and a value 1 is assigned to the next row of the created column to indicate existence of the document1. There are no words within this path number except some attributes. Therefore no token is updated within this path number. Similarly, for the path number 2, a new column is created in the structure. For all the words within this path number, a new separate column is created and the value 1 is assigned to the next row of the created column indicating their existence to the corresponding

3.2 Construction of Bitmap Structure
BIQS dynamically creates a two dimensional matrix structure which represents the existence of all the words and element paths in the corresponding documents. The first row of the matrix structure records all the token numbers for the corresponding words and the associated

IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.6, June 2010

121

document. Thus the token 1 for the word ProperMotion is recorded within the path number 2. The value 1 is assigned to the next row of the created column indicate their existence to the corresponding document. In Table IV, 1 is the token within path number 2. The created matrix structure after extracting all the words and paths from document1 to document4, is given in Table V.

3.4 Methodology Structure

to

Compress

the

Bitmap

The system separates the BIQS structure into two structures to compress the XML data. The first row is one structure named Token and Path structure used to represent the token and path number. This row is indexed serially starting from 0. Later, these indexes are used for searching the token and path number from the Token and Path structure. Another structure named Compressed (BIQS) Bitmap Index Structure consists of all other remaining rows of the matrix (except the first row). In this structure each row is divided into blocks. In each block, the information of 16 (words and paths) bit cells is compressed. As each row represents the information of each XML document, there may be a different number of blocks for each document and each block consists of different values for different documents. The compression is also possible using 32 bit cells. The token and path structure is presented in Table VI. The compressed bitmap structure is presented in Table IX. The value of each 16 bit cells is recorded in decimal form. If there is not enough of the data to form a block with 16 bit cells, we fill these bits with zero. The compressed BIQS structure is given in Table VIII; the first column of the structure represents the document number and the other three columns represent the blocks. The value of each block is generated from the BIQS structure given in Table V. The values of the Block0 are 65472, 57, 0 and 39. These values represent the compressed information of data for different XML documents. This compression does not lose any information. We use the compressed BIQS structure for

searching the data. Realistically, we are not converting the binary values (from Table V) into decimal values (into Table VIII) rather than we store the information for 16 words and paths into a single cell of a block. The system separates the BIQS structure into two structures to compress the XML data. The first row is one structure named Token and Path structure used to represent the token and path number. This row is indexed serially starting from 0. Later, these indexes are used for searching the token and path number from the Token and Path structure. Another structure named Compressed (BIQS) Bitmap Index Structure consists of all other remaining rows of the matrix (except the first row). In this structure each row is divided into blocks. In each block, the information of 16 (words and paths) bit cells is compressed. As each row represents the information of each XML document, there may be a different number of blocks for each document and each block consists of different values for different documents. The compression is also possible using 32 bit cells. The token and path structure is presented in Table VI. The compressed bitmap structure is presented in Table IX. The value of each 16 bit cells is recorded in decimal form. If there is not enough of the data to form a block with 16 bit cells, we fill these bits with zero. The compressed BIQS structure is given in Table VIII; the first column of the structure represents the document number and the other three columns represent the blocks. The value of each block is generated from the BIQS structure given in Table V. The values of the Block0 are 65472, 57, 0 and 39. These values represent the compressed information of data for different XML documents. This compression does not lose any information. We use the compressed BIQS structure for searching the data. Realistically, we are not converting the binary values (from Table V) into decimal values (into Table VIII) rather than we store the information for 16 words and paths into a single cell of a block.

122

IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.6, June 2010

Fig. 1 Architecture of the Query Processing Methodology.

Brown

Asdddddddddddddddddddddddddddddddd

Similar Documents

Popular Essays