A hybrid model to improve relevance in document retrieval.
Subject:
Information services (Usage)
Information services (Analysis)
Information services industry (Usage)
Information services industry (Analysis)
Document processing (Analysis)
Document processing (Usage)
Authors:
Siddiqui, Tanveer J.
Tiwary, Umashanker
Pub Date:
03/01/2006
Publication:
Name: Journal of Digital Information Management Publisher: Digital Information Research Foundation Audience: Academic Format: Magazine/Journal Subject: Computers Copyright: COPYRIGHT 2006 Digital Information Research Foundation ISSN: 0972-7272
Issue:
Date: March, 2006 Source Volume: 4 Source Issue: 1
Topic:
Computer Subject: Document processing system; Information services industry
Product:
Product Code: 7399200 Info Services ex Database NAICS Code: 514199 All Other Information Services SIC Code: 7389 Business services, not elsewhere classified
Geographic:
Geographic Scope: India; South Korea Geographic Code: 9INDI India; 9SOUT South Korea

Accession Number:
186470796
Full Text:
ABSTRACT. In information retrieval community a lot of work is focused on increasing efficiency by capturing statistical features. The other dominant approach is to improve the relevance by capturing the semantic and contextual information which is invariably inefficient. Generally the two approaches are assumed to be diametrically opposite. In this paper we have tried to combine the two approaches by proposing a hybrid information retrieval model. The model works in two stages. The first stage is a statistical model and the second stage is based on semantics. We have first downsized the document collection for a given query using vector model and then used a conceptual graph (CG) based representation to rank the documents. Our main objective is to investigate the use of conceptual graphs as a precision tool in the second stage. The use of CGs brings semantic in the ranking process resulting in improved relevance. Three experiments have been conducted to demonstrate the feasibility and usefulness of our model. A test run is made on CACM-3204 collection. We observed 34.8% increase in precision for a subset of CACM queries. The second experiment is performed on a test collection specifically designed to test the strength of our model in situation where the same terms are being used in different context. Improved relevance has been observed in this case also. The application of this approach on results retrieved from LYCOS shown significant improvement. The proposed model is both efficient, scalable and domain independent.

Categories and Subject Descriptors

H.3.1 [Content Analysis and Indexing]; H.3.3 [Information Search and Retrieval]: I.7 [Document and Text Processing]

General Terms

Conceptual Graphs, Semantic retrieval

Keywords: High precision information retrieval model, Intelligent retrieval, CG-based retrieval model, Two-stage retrieval model.

1. Introduction

Traditionally Information retrieval (IR) has been concerned with retrieving relevant information from a collection of documents based on a set of terms provided by the user. Users enter a query and information retrieval system responds by returning a list of documents that seem relevant to the request. On a very large document collection thousands of documents might contain query terms therefore it is important that information retrieval systems rank the retrieved documents according to relevance. Users are not interested in huge amount of information, but in precise, accurate and relevant information. But relevance can't be judged simply on the basis of term occurrence. Most of the existing retrieval system still rely on standard retrieval models (e.g. Boolean, Standard vector and probabilistic) that treat both document and queries as a set of unrelated terms. These statistical models have the advantage of being simple, scalable and computationally feasible but they do not offer accurate and complete representation. These models ignore semantic and contextual information in the retrieval process (Watters 1989). It is difficult to identify useful documents simply on the basis of words used by the author of the document. This is also because of polysemy and synonymy in natural language. Polysemy designates the phenomenon of a lexeme with multiple meaning. The ambiguity makes it difficult for a computer to automatically determine the conceptual content of documents. Synonymy creates a problem when document is indexed with one term and the users' query contains a different term and the two terms share a common meaning. Another important problem owing to syntactic terms representation is that contextual information is lost in the extraction of keywords from the text. The context can not be recovered. We attempt to handle these problems through the inclusion of conceptual graph based representation. Our concern in this paper is to use semantics through Conceptual Graph (CG) to improve relevance in document retrieval. Martin and Eklund (2000) argued in favor of general knowledge representation languages for indexing web documents and suggested the use of concise and easy comprehensible CGs. They argued that CG representation has advantage over metadata language based on extensible markup language (XML). Rama and Srinivasan (1993) underlined the utility of conceptual roles for information retrieval. Their work provides strong evidence in support of use of conceptual roles in information retrieval. Conceptual graphs are very closely related to natural language and hence can be used for representing text. Such a representation holds the promise of extracting more information from documents by explicitly capturing logical relationship between terms; unlike word-statistical approaches that merely count nouns and noun phrases. This fact suggests the use of conceptual graphs in information retrieval, database interfaces and natural language processing. With such representation we will be able to improve precision in information retrieval. For example, Statistical model will fail to distinguish between "library school" and "school library". CG-based model keeps the contextual information between document and/or query terms intact, hence is able to differentiate documents in such cases. Let us consider two documents having fragments like:

1. ... genetic algorithm for information retrieval

2. ... genetic algorithm, neural network,.... Information retrieval ...

and a query "genetic algorithm for information retrieval".

Both the documents contain the terms "genetic", "algorithm", "information" and "retrieval". If we represent these two document fragments as a set of terms then traditional vector space models will return both of them as equally relevant, though document containing fragment 1 is clearly more relevant to the query. This is the local context that helps us to distinguish the two document fragments in this case. A model that considers the relationship existing between the words in two documents will find fragment 1 more relevant as compared to fragment 2. We have attempted to make distinction possible in such cases through the use of conceptual graphs. The variants of CG model have been used in information retrieval. Marega and Pazienza (1994) also emphasized the use of contextual role of words in CoDHir system and concluded that this results in an improvement in retrieval precision over traditional IR technologies. They used conceptual graphs in CoDHir system. Their work consists in identifying contextual roles of words and to extend vector model to consider compound descriptor (contextual role-word). DR-LINK (Liddy and Myaeng, 1993, 1994) uses conceptual graphs to extract and use semantic relation for information retrieval. Cgkat (Martin 1997) and WebKB (Martin and Eklund 2000) uses CGs to index document elements (chapters, paragraphs, etc.). The retrieval mechanism is based on the projection. Relief(ounis, 1998) uses CGs for indexing images and uses properties of relations such as transitivity for querying.

Researchers have also proposed statistical approaches to include contextual information in the retrieval process through the use of multi-term phrases and proximity search algorithm on top of the first stage. Multi-word phrase matching is simpler and existing methods for single term matching can be applied to multi-words. However, it fails to capture variations in syntactic structure unless phrases have been normalized. For example "extraction of roots" might be transformed into "root extraction". Further, the use of multi-word phrases has not yielded significant improvement (Mittendorf et al, 2000) Proximity search may also fail. Consider following two sentences:

Though it is clear from the context that the first sentence is more relevant to the query but this distinction is lost when treated as a bag of words. A proximity search algorithm will not help in this case. It is the relationship existing among words that make this distinction clear and not just their relative positions. The keyword matching methods ignore relations that are expressed in the query (Khoo, 1997). Because a CG aims to capture the relationships between concepts it holds the promise to improve ranking in such situations. Further the model has the potential to handle ambiguities implicitly.

Earlier attempts to include semantics were based on latent semantic indexing (Deerwester et al. 1990, Foltz 1990) and natural language processing (Strzalkowski 1995, Smeaton, 1995) techniques. Use of Latent Semantic Indexing (LSI) in information retrieval is based on the assumption that there is some underlying "hidden" semantic structure in pattern of word-usage across documents. LSI attempts to identify this hidden semantic structure through statistical techniques and then uses this structure for representing and retrieving information. However it is costly in terms of computation and requires re-computation if many new terms are added. Natural language processing has also been used in information retrieval for resolving anaphoric references, discourse-level processing, proper name identification, etc. The direct application of natural language processing results in lack of robustness and efficiency (Strzalkowski, 1995). Hence it is often used with other existing system for indexing, retrieval, query expansion and query modification through relevance feedback.

Dick(1992) pointed out that conceptual retrieval is a precision tool not an all purpose device. Agreeing with Dick we propose a two stage information retrieval model to handle the problems associated with pure statistical models. Retrieval in the first stage is performed using vector space model. The second stage ranks the documents retrieved in first stage based on CG representation. As conceptual graphs make relevance judgment based on semantics, it more closely correspond to users' intent, as expressed through queries. In traditional information retrieval system relevancy simply refers to the degree to which the query terms are present or absent in a document. The model builds notion of "relevancy" based on an understanding of semantics of terms. It ranks documents on the basis of their relational and conceptual similarity to query. This form of relevancy more closely corresponds to users' mental model as can be verified through the improved observed precision or improved acceptance from user.

For CG based ranking, relationships existing among concepts in a sentence have been captured. The corresponding conceptual graphs have been stored for each document during preprocessing stage. The model thus takes the advantage of the simplicity of traditional retrieval models and the versatility of semantic approaches. Unlike LSI, the CG-based model has the advantage of being scalable. Vector space model has been used first to quickly retrieve set of potentially relevant documents and then relevance judgment has been made based on conceptual graphs. A quantified measure proposed by Montes-y-Gomez et al (2000) to match conceptual graphs has been used in this work instead of graph derivation. This measure combines conceptual and relational similarity. The CG representation of documents is prepared along with vector representation. Relatively small number of relations has been captured and some simple heuristics (rules) have been used during matching to allow an exact match in case of semantically similar but structurally dissimilar CG fragments. All these factors contribute in keeping the model computationally simple and add generality to it.

Rest of the paper is organized as follows. Section 2 introduces Conceptual Graphs. Section 3 discusses the retrieval model. This includes a discussion of both vector space and CG-based representation to be used during first and second stage of retrieval. CG similarity measure has also been discussed. Experimental investigations have been made in section 4. Finally conclusions have been made in section 5.

2. Conceptual Graphs

Conceptual graphs are highly expressive form of logic and were originally designed for representing natural language semantics. They have been evolved out of conceptual structure theory as set down by Sowa (1984). A conceptual graph is a network of concept nodes and relation nodes. Concept nodes represent entities, attributes, states and events and relation nodes show how the concepts are interconnected".

In the graph boxes represent concept nodes and the circles are called conceptual relations. Consider the following sentence:

"A cup is on the table" (1)

The conceptual graph of the above sentence is:

[ILLUSTRATION OMITTED]

Concept nodes have two types of field--type field and referent field. Two fields are separated with a colon. In the box type field is shown on the left and referent field is shown on the right separated by a colon. Concepts that do not identify a specific individual are called generic concepts. The referent part of these concepts is omitted. The existential quantifier ([there exists]) is assumed to apply on concepts with blank referent field. For individual concepts the referent field is a specific entity such as a name.

A CG can be represented in three different forms. There is a graphic notation called the display form (DF), a more compact notation called the linear form (LF) and a concrete syntax called the conceptual graph interchange form (CGIF) which has a simplified syntax and a restricted character set designed for compact storage.

The LF representation of the sentence (1) is as:

[Cup] [right arrow] (On) [right arrow] [Table]

In CGIF the sentence (1) can be represented as:

[Cup: *x] [Table: *y] (On ? x ? y)

The symbols *x and *y are called defining labels. The matching symbols ?x and ?y are the bound labels that indicate references to the same instance of a cup x or a table y. CGIF also permits concepts to be nested inside the relation nodes. Nesting of concepts helps in reducing number of co reference labels.

(On [Cup] [Table])

The above representations can be translated to a statement of the following form in typed predicate calculus:

([there exists]x : cup)([there exists]y : Table)on(x, y)

Formally a CG is defined over a support S. A support is a 3-tuple S = ([T.sub.C], [T.sub.R], I). [T.sub.C], and [T.sub.R] are two partially ordered finite sets, respectively, of concept types and relation types and I is a countable set of individual markers. [T.sub.C], [T.sub.R], I are pair wise disjoint. Each support also possesses the generic marker *, which does not belong to I.

The set I [union] {*} is partially ordered with * being the greatest element.

A truly conceptual representation of text is difficult to achieve automatically hence we propose a simplified form of conceptual graph model that can be easily extracted from the text without requiring much of the domain knowledge for the IR task. This model relies on a small set of basic relations ([T.sub.R]), as shown in figure 1, that can be identified based on syntactic patterns. The terms appearing in a document are all considered individual marker and all are of the same type "concept" which is the only type in the set of concept types. The support is thus ({"concept"}, [T.sub.R], T) where T is the set of terms used to represent documents. The focus mainly is on capturing semantics through the inclusion of relationships between terms as it goes in line with the cognitive way of understanding. Most of the research work on CG has focused on graph theory and graph algorithms (Mugnier, 1995). However, it seems unreasonable to use non-deterministic graph derivation algorithms for IR applications, hence a simplified form of conceptual graph matching function proposed by Gomez and others (Gomez et al. 2000; Gomez et al. 2001) has been used in this work.

[FIGURE 1 OMITTED]

3. Retrieval Model

Figure 2 shows our retrieval model. Two models are used in conjunction.

(i) Vector space model and (ii) Conceptual Graph model

[FIGURE 2 OMITTED]

Information retrieval systems retrieve documents for a query and present a ranked list of retrieved documents to users. This ranking is based on system's understanding of users' relevance model. Vector model is quite efficient in retrieving documents satisfying users' need. However it ranks documents on the basis of statistical data. Because of the inherent ambiguities and vagueness of natural language text the ranking offered by vector model may not correspond to users' relevance model. Including semantics in ranking bridges the gap between the systems' and users' relevance model. Hence we have used two stage retrieval model.

3.1 Vector Space model

Our base model is vector space model. It is a mathematical model (Dominich, 2001) that forms the basis of many of the existing search engines.

3.1.1 Document Representation

In vector model both the query and the document is represented as a vector. The values of elements in these vectors represent weights assigned to terms occurring in the documents. These weights denote relative importance of terms in documents. These vectors can be represented as:

[d.sub.j] - ([w.sub.1j], [w.sub.2j], [w.sub.3j], ..., [w.sub.ij], ..., [w.sub.mj])

where [d.sub.j] is a vector representing document j. If there are n documents in the collection and a total of m index terms have been identified to represent, then the document collection as a whole is represented by a m x n matrix, called term-document matrix. An entry, say (i,j)th, in this matrix represents weight of term i in document j. The steps involved in preparing term-document matrix automatically are -

(i) Find individual words and their frequencies in each document

(ii) Use stop list to remove common words.

(iii) Reduce words to their stems. We have used Porter's stemmer for this purpose.

(iv) Assign weights to each term and prepare term-document matrix.

3.1.2 Term weighting

In the information retrieval field, different term weights have been proposed over the years to represent importance of a term in the retrieval process. Most of the weighting schemes consider the following three factors to assign weights:

* Within-document frequency or term frequency

* Collection frequency or inverse document frequency

* Document length

According to Salton (Salton et al., 1996), a high performance weighting scheme should assign more weight to a term that occur frequently in a document and rarely outside, as these term will be more useful for discriminating among documents. The tf-idf weighting scheme fulfils this criterion. The weight assigned to terms according to this scheme is product of tf (term frequency) and idf (inverse document frequency). The first component considers term frequency within document and second component captures distribution of terms across documents. Inverse document frequency for term [t.sub.i] is calculated as :

idfi = log n/ni

Where n is number of document in the collection and [n.sub.i] is the number of documents containing term [t.sub.i]. Many variations of tf-idf measure have been reported (Salton and Buckley, 1988). A number of recent weighting schemes has been proposed by researchers, particularly the BM25 which has reportedly best performance in TREC (Robertson et al. 1994, Robertson et al. 1995) environment. Robertson and Walker motivated the best match (BM) algorithms by the probabilistic model and by some simple approximations to 2-poisson model (Robertson and Walker, 1994)I, but indicated that their result was as much guided by experimentation as by theory (Hiemstra and de Vries, 2000). For CACM-3204 collection we have compared the performance of BM25 and tf-idf weighting function. The following approximation of BM25 term weighting function has been used in this comparison.

Document term weights :

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Query term weights: [w.sub.ik] = [tf.sub.ij]

Query-document similarity : [w.sub.ij] [w.sub.ik]

Where, [tf.sub.ij], = frequency of term [t.sub.i] in document [d.sub.j]

n = number of documents in the collection

[n.sub.i] = number of documents in the collection containing the term [t.sub.i]

dl = document length, and

avdl = average document length in the collection

The two coefficients k1 and b were empirically set to 2 and 0.1.

The form of tf-idf weighting scheme used by us in the first stage is doc= "atn" and query = "ntc" (Savoy and Picard, 2001). In the rest of the discussion we will call this scheme as "modified tf-idf' scheme. This selection was based after evaluating the performance of the CACM collection on seven different combination of "atn", "Itc", "ntc" ,"nnn" and "atc" weighting scheme. We observed maximum precision (36.1%) with doc= "atn' and query = "ntc" weighting scheme. This scheme uses following expression to compute term weights:

Document-term weight (atn"): wij = (0.5 + 0.5 x tfij/max tfj) x (log n/ni)

Query-term weight ("ntc") : [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where w1ik = tfik x log(n/ni)

Query-document similarity : [summation] wij x wik

3.2 Conceptual Graph Model

A conceptual graph representation of document is prepared which will be used in the second stage of the retrieval.

3.2.1 Document Representation

The CG representation of documents has been obtained by identifying relationships among concepts occurring in a sentence. For constructing conceptual graph syntactic patterns in a sentence are identified. The relations captured include the relations between the constituent nouns of complex nominal, relations between verb and other constituents surrounding the verb, mainly AGNT(agent or subject) and PTNT(patient or object) relationship, and prepositional relations. A set of terms that can be substituted for each other has been maintained and utilized in the search process.

The CGs are stored in the form of triplets as:

(rel c1 c2)

Where 'rel' is the relation and 'c1' and 'c2' are concepts participating in this relation.

Example: The tagged representation of the fragment "iterative procedure for solution of equation" will be:

iterative {JJ} procedure {NN} for {IN} solution {NN} of {IN} equation {NN}.

Its CG representation will be stored as:

(attr procedure iterative) (pmod procedure solution) (pmod solution equation)

Unlike Liddy and Myaeng (1994) we have not maintained alternate representation of nominalized verb for documents. This together with a small number of relations identified by us helps in keeping our model computationally simple and thereby making it useful to be used on top of first stage. In order to capture variations in grammatical structures we have used simple heuristics during matching. These heuristics allow exact match for structurally different but semantically similar relations. For instance if the root form of a query concept involved in a 'mod' relation is a verb and it matches with a concept appearing in a 'ptnt' relation in a sentence CG then this will be considered an exact match if second participating concept in both the relation is same. This heuristic will give an exact match for 'task {NN} scheduling {NN} and 'schedule {VB} task {NN}'. The CG for these two fragments is:

[schedule] [right arrow] (type) [right arrow] [task] and [schedule] [right arrow] (ptnt) [right arrow] [task]

Steps involved to get Conceptual graph representation are as follows:

1. Tag documents in the collection. We have used TnT (1) tagger for this purpose.

2. Tagged representation of document is processed to make certain substitution and modifications, such as, removal of modal words, wh-determiner, wh-pronoun and few patterns like 'such that' , 'so that', 'as well as' (replaced with 'and') etc.

3. Tagged and processed document is input to a sentence extractor which extracts sentences. Each extracted sentence of the tagged text is then passed through four modules. Each of these modules is devoted to identify certain types of relationships between concepts. These modules correspond to:

* Preposition and Adverb handler: to extract prepositional and adverbial relations.

* Noun handler: Extracts relations between noun sequences and cardinals.

* Adjective handler: To extract adjectival relations.

* Verb handler: To extract relations between verb and its subject and object.

Similar steps are followed for query CG construction except that alternate representations are prepared for query using the set of replaceable terms. For example suppose "article", "literature", "paper", "information" etc. are in the set of replaceable terms for "document" then additional CGs will be added corresponding to each of these while processing the query "Role of computers in retrieval of scientific documents".

[FIGURE 3 OMITTED]

Example: The conceptual graph representation of sample document shown in figure 3 is explained below:

The tagged text is then passed through the four modules discussed above. The linear form of the CG produced by module 1 is :

The CGs produced by module 3 are:

Module 4 adds following CG:

[analysis][left arrow](ptnt)[left arrow][begin][right arrow](agnt)[right arrow][society]

3.2.2 Conceptual Graph Similarity Measure

In the second stage of the retrieval process the query CG is compared with conceptual graphs of those sentences that contain query concepts. We have not used graph derivation for matching query and sentence CG, rather these conceptual graphs have been compared using the similarity measure proposed by Gomez et al (2000). Given two texts represented by the conceptual graphs G1 and G2 respectively and their intersection graphs [G.sub.c], the similarity S between them will be a combination of their conceptual similarity [S.sub.c] and relational similarity [S.sub.r]:

s = sc x(a + b x sr) (i)

The intersection graphs [G.sub.c], consists of the following elements:

* All concept nodes that appear in both the conceptual graphs G1 and G2.

* All relation nodes that appear in both the initial conceptual graphs and relate the same concept nodes.

The conceptual similarity [S.sub.c] and relational similarity [S.sub.r] are obtained using the following expression:

sc = 2 x n(Gc)/n(G1) + n(G@) sr = 2 x m(Gc) /mGc(G1) + mGc(G2)

The values of coefficients a and b depend on the structure of document and are computed as:

a = 2 x n(Gc)/ 2 x n(Gc) + mGc(G1) + mGc(G2)

and b = 1 - a

where, m([G.sub.c]) is the number of the arcs in the graph [G.sub.c], m[G.sub.c] JG) is the number of the arcs in the immediate neighborhood of the graph [G.sub.c] in the graph G.

Various similarity values obtained using equation (i) have been combined to get a single score for document as:

S = [alpha] x [summation]si/c [beta] x max(si),

where, c is the number of CGs in a document having one or more matching concept with query CG.

The first factor in this expression ensures that a document having a number of identical repetitions of query concepts will score more than one having a single identical and many different occurrences.The second factor ensures that a document having a single exact match and many partially matching fragments have fair chances of being ranked better. A high value of [alpha] improves ranking of a document having many repeated occurrences of partially matched fragments. A high value of [beta] improves ranking of a document having a single matching CG fragment. We have given equal weight to average and maximum similarity values ([alpha] =[beta] = 0.5). These values have been set empirically.

[FIGURE 4 OMITTED]

4. Experimental Design

In order to test the effectiveness of our model and to compare the improvement with other models we have performed three experiments. A test run is made on CACM-3204 collection. To test the capability of our retrieval model to capture context information and improve ranking we performed another experiment on a document collection specifically designed to contain documents having similar terms being used in different contexts. In an attempt to investigate a possible application of our approach on top of existing retrieval systems we performed a third experiment using the top 10 results returned by LYCOS browser. The retrieval algorithm shown in fig.4 explains the approach followed in these experiments.

4.1 Experiment 1

The document collection used in the first experiment is CACM-3204. This collection has 3204 documents and 64 queries with known relevance judgment. A list of 429 stop words is also provided. The objective of this experiment was to (i) compare the performance of the weighting function used by us with one of the recent weighting scheme, the BM25 and (ii) see the performance of our model on an existing document collection with known relevance judgments.

4.1.1 The Experiment

To see the performance of the two weighting functions, the two coefficients, k1 and b, of BM25 has been set empirically to 2 and 0.1 respectively. Following Robertson and Jones (1997) we set k1 to 2 and empirically set b to 0.1. Figure 5 shows recall-precision curve for the CACM collection for the two weighting schemes used. 11- point average precision (averaged over 52 queries of the collection) has been used.

To test the effectiveness of our two stage retrieval model we have compared the precision after first and second stage of retrieval. Wefirst retrieved documents using vector model in stage 1 and then used CG-based based representation to re- rank documents. We have considered documents up to highest recall point. The queries considered are query 12, 19, 23, 30 and 63 of the CACM-3204 collection. The query 23 was selected as it experienced lowest precision with "modified tf-idf' model used in first stage. The ob- served precision for query 30 was quite close to average precision. The performance of remaining three queries was better than aver- age. The ranking of relevant documents after first and second stage for query 12 ("portable operating system") and query 23 ("Distrib

/// uted computing structures and algorithms") is shown in Table 1. The total number of relevant documents for query 12 and 23 is 5 and 4 respectively. Table 2 lists the precision after first and second stage and % improvement for queries considered in this paper. The average recall-precision graph for the subset of four queries has been shown in figure 6. Query 23 has been excluded from this, as this single query might contribute a lot to the average performance.

[FIGURE 5 OMITTED]

4.1.2 Results and Discussions

As shown in figure 5, the average performance of "modified tf-idf scheme" was found better than BM25 scheme. The observed mean average precision (averaged over 52 queries) for "modified tf-idf' and bm25 scheme was 35.9% and 30.5 % respectively. So we have used "modified tf-idf' scheme in our experiment.

Table 1 show that the ranking of the relevant documents for query 12 and 23 of the CACM-3204 collection has been improved after second stage. This improvement is because of the consideration of semantics through the conceptual graph in the second stage. The vector model gives high rank to documents containing frequent occurrences of terms like 'operations' and 'operator' or including a discussion like 'portable random generator'. Similarly for query 23 ("Distributed computing structures and algorithms") vector model gives high ranks to documents pertaining to "probability distribution", "tree structure" etc. CG-based model improves ranking by eliminating these documents resulting in improved precision.

[FIGURE 6 OMITTED]

Figure 6 shows 11-point standard recall-precision curve for query19 and query 63 of the CACM-3204 collection after first and second stage of retrieval. Extrapolated precision is used for all recall points that exceed the maximum observed recall value. As shown in Table 2, we experienced increase in precision after second stage in all the five queries. The maximum increase in precision of 979.6% was observed for query #23. During the first stage of retrieval the first relevant document for this query was obtained after retrieving 29 documents. With CG based ranking this figure was reduced to 1. The minimum improvement of 3.6% was observed for query 12. For query 19 these figures were 48.6% and 59.5%. This represents 22.3% increase in precision. The mean average precision for four queries after first and second stage of retrieval was observed as 46.4% and 62.6%, resulting in an improvement of 34.8%. This improvement is more than significant.

4.2 Experiment 2

4.2.1 Document collection (CGDOC)

In the second experiment we have considered our own document collection specifically designed so as to contain identical terms being used in different areas to further investigate the fruitfulness of our approach. This is a small collection consisting of abstracts of 70 documents and 10 queries. We however do not provide enforce our own relevance judgment, instead list the titles of the documents returned for the queries considered in the paper so as to enable understanding of the improvements being made. The objective was to test the retrieval effectiveness of our model in an environment where differentiation among documents may be difficult solely by statistical means. The average length of the document in the collection is 71.9 words excluding stop words. The size of smallest and largest document in the collection is 12 words and 136 words respectively. Fig.8 lists titles of few of the documents (titles only) and sample queries in this collection.

4.2.2 The experiment

When a query is made, query vector and query CG is prepared to be used in the first and the second stage of retrieval process respectively. In the first stage of retrieval a ranked list of the documents is returned using modified vector model to be used in second stage. The last observed recall point is used as cut-off.

[FIGURE 7 OMITTED]

We compare query CG with sentence CGs of each document returned by first stage. As there can be many sentences in a document containing query terms, a combination of average and maximum similarity value has been used for ranking purpose. To better explain the effect of our CG-based approach we have made a comparison of the ranking obtained after first and second stage of retrieval instead of forcing our own relevance judgments. The titles of documents under consideration for query 1 have been listed in fig.8.

Query #1. Genetic algorithm for information retrieval

Query #2. Fuzzy information retrieval

Query #3. Information retrieval using conceptual graph

Doc#2 An Efficient Information Retrieval Method in WWW Using Genetic Algorithms

Doc #9 An extended inverted file approach for information retrieval

Doc #11 Genetic algorithm based redundancy resolution of robot manipulators

Doc #13 Genetic algorithms: A Survey

Doc #14 Genetic Algorithm and Graph Partitioning

Doc #21 An Image Retrieval Method Based on a Genetic Algorithm

Doc #22 Evolutionary Reinforcement of User Models in an Adaptive Search Engine

Doc #24 Genetic algorithm approach to image segmentation using morphological operations

Doc #25 Multiprocessor Document Allocation: A Genetic Algorithm Approach

Doc #27 Probabilistic and genetic algorithm for document retrieval

Doc #42 Intelligent Agents and its Applications in Information Retrieval

[FIGURE 8 OMITTED]

4.2.3 Results and Discussions

The effect of CG application is that relevant documents that were ranked low by vector model were shifted up in the ranking. The ranking of top 10 documents after first and second stage of retrieval for the query "genetic algorithm in information retrieval" is shown in fig. 9. It can be verified easily from the titles that documents ranked 3 and 5 (Doc#13 and #42) returned by first stage are not as much useful for the query under consideration as doc#22 and doc#27 which are ranked 7 and 9 respectively. Application of CG improves ranking by including doc#22 and doc#27 in top five documents, resulting in improved precision. The vector model fails to identify the similarity between "document retrieval" and "information retrieval" and gives a low rank to doc#27. Our CG-based ranking is able to detect this similarity and improves the ranking of this document. Doc# 14 is ranked 6 after stage 1 but is excluded from the list of top 10 documents after stage 2. This is because of the relatively high frequency of the terms "genetic" and "algorithm", all of which do not appear together. The occurrence of "genetic" and "algorithm" in close context results in high CG similarity value as compared to the case when these terms appear in a different context, such as "simulated annealing algorithm". These occurrences of "algorithm" contribute to low average similarity value. As the second stage of ranking is a combination of average and maximum CG similarity values, this document is ranked low. In contrast to this doc#21 has more relation match (as it talks about use of genetic algorithm in retrieval) resulting in a high CG similarity value, even though the frequency of these terms is not as high as in doc#14. Similar arguments hold for doc#24 and doc#25. Fig. 10 shows the ranking after first and second stage of retrieval for query 2 and 3 listed in fig. 8. The titles of the documents being considered can be found in Appendix II. Similar improvements in ranking have been achieved for other queries also.

4.3 Experiment 3

In order to investigate a possible application of our approach in web search environment we have conducted a small experiment. In this experiment we have simply considered first ten results returned by LYCOS search engine for query#1 of our collection (See Appendix I) and then constructed conceptual graphs for fragments of sentences returned by search engine, involving query terms and compared it with query graph. LYCOS has been used because it is based on simple keyword search. Unlike other search engines it does not consider the position of keyword in the document, number of inbound hyperlinks etc. in the ranking process. This more closely corresponds to the retrieval model used in the first stage of retrieval. The result of this comparison is then used for ranking. The result is shown in figure 11. We have achieved significant improvement in this case also, even though we have just used very little piece of context information (highlighted in Appendix I), as we do not have access to actual document representation.

5. Conclusions

We presented a hybrid two stage information retrieval model. This model first retrieves a set of potentially relevant documents to query using modified vector model and then ranks documents based on conceptual understanding of terms. We have observed an increase of 34.8% in precision for a subset of CACM queries. The experiment performed by us on our own document collection also shown significant improvement in the ranking of retrieved documents. This is because of the semantic considerations in second stage of retrieval through conceptual graphs. A specific query gives more useful relationships among query terms that will help in improving ranking.

The syntactic model proposed by us helps in quickly short listing the relevant documents from a large document set without hampering efficiency. Application of CG-based model in the second stage brings conceptual understanding in making relevance judgment. Relevance based on semantics more closely correspond to users' mental model resulting in improved acceptance for user.

In order to make our CG model more efficient we have used a new scoring function to get the combined document score. The form of graph matching function used by us keeps the computational cost low. The CG representation used by us is easily scalable. Further we have also proposed a tabular representation of CGs to reduce matching time.

[FIGURE 10 OMITTED]

[FIGURE 11 OMITTED]

The proposed model takes the advantages of both the efficiency of the syntactical approach and the accuracy of the semantic approach. The improvement observed when the approach is applied on the results returned by LYCOS demonstrates the potential of our model. This suggests that CG-based model can be used as a precision tool with existing search and retrieval systems.

Appendix I

First ten results returned by LYCOS for the query "Genetic algorithm for information retrieval"

1. Machine Learning for Information Retrieval: Neural Networks. Symbolic ...

... Results of genetic algorithms testing Machine Learning for Information Retrieval : Neural ... algorithms in information retrieval. In [48 ... presented a genetic algorithms based approach for...

More results from: ai.bpa.arizona.edu/papers/mlir93/mlir93.html February 3, 2004-163 KB *3 (0.639)

2 The Art Site on the World Wide Web: McLaughlin

... electronic palettes, generated from genetic algorithms, or captured by digital camera ... fractal animations, genetic algorithm animations, morphs, ray trace ... retrieving files from the site)....

cwis.usc.edu/dept/annenberg/artfinal.html March 11, 2004--90 KB 6 (0.531)

3. text mining and web-based information retrieval reference

... Web Mining, Information Retrieval and ... Fast Look-up Algorithm for Structural ... Web Mining: Information and Pattern ... adaptive genetic algorithm /programming ... statistics to information ...

filebox.vt.edu/users/wfan/text_mining.html February 2, 2004--23 KB 9(0.477)

4. Effective Information Retrieval Using Genetic Algorithms Based ...

... Effective Information Retrieval Using Genetic Algorithms Based Matching ... relevant information from these ... improving retrieval performance ... and recall) for retrieval ... the area of ...

www.computer. org/proceedings/hicss/0493/04932/0493201 ... August 22, 2002--10 KB 1 (0.853)

5. Machine Learning/Genetic Algorithm group--Dipartimento di ...

... representation for ML, such as exploitation ... to also include Genetic Algorithms and Neural Networks ... Mobile Agents for automatic information retrieval , network management ... Learning Agents ...

www.di.unito.it/~mluser/ February 3, 2004--8 KB 7 (0.522)

6. Personal Information Intake Filtering

... Recent work on genetic algorithms for information retrieval [Gordon] focused ... program uses genetic algorithms for evolving symmetrical ... B Cousins. "Information Retrieval from Hypertext ...

www.baclace.net/Resources/ifilter1.html July 27, 2001--42 KB 2 (0.752)

7. Free C/C++ Sources for Numerical Computation

... Description : many genetic algorithm optimisation libraries ... Description : Objects for doing genetic algorithm optimization Name ... analytic statistics for the TREC IR trials Comments :...

cliodhna.cop.uop.edu/~hetrick/c-sources.html July 24, 2002--97 KB 4 (0.560)

8. Home Page for Haym Hirsh

... Directors: Institute for the Study of ... machine learning, information retrieval , data mining ... engineering design, genetic algorithms, knowledge representation ... Publications Information ...

www.cs.rutgers.edu/~hirsh/ March 8, 2004--18 KB 5 (0.560)

9. Artificial Life

... Related Topics Genetic Algorithms Agent technologies ... computation such as genetic algorithms (GAs), evolutionary ... stands for "Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for ...

www.insead.fr/CALT/Encyclopedia/ComputerSciences/Al/a ... January 12, 2004--56 KB 10 (0.460)

10. Information and Communication R&D Center

Information and Communication ... Fuzzy Theory, Genetic Algorithm, Artificial Life ... JAPANESE) Information Retrieval (IR) (JAPANESE ... Service Other Information Abstract ... TOWN GUIDE for ...

www.ricoh.co.jp/rdc/ic/index_e.html September 30, 1999--3 KB 8(0.494)

Appendix II

Doc #1 An Efficient Storage and Retrieval System for Conceptual Graphs

Doc #4 The RELIEF Retrieval System (Image retrieval system based on Conceptual graphs)

Doc #8 Compiling Conceptual Graphs

Doc #9 An extended inverted file approach for information retrieval

Doc #16 PRIME-GC A medical information retrieval prototype on the Web

Doc #19 Measuring Effectiveness in Fuzzy Information Retrieval

Doc #28 Application of fuzzy set theory to extend Boolean Information retrieval

Doc #29 Fuzzy functional dependency and its application to approximate data querying

Doc #31 Improving the Performance of Existing Information Retrieval Systems Using a Software Agent

Doc #34 Flexible Comparison of Conceptual Graphs

Doc #35 CG-DESIRE: Formal Specification Using Conceptual Graphs

Doc #36 Comparison of Conceptual graphs

Doc #37 Fuzzy conceptual graphs for matching images of natural scenes

Doc #39 Text mining at detail level using conceptual graphs

Doc #43 Knowledge Representation Using Fuzzy Petri Nets--Revisited

Doc #49 fuzzy integral as a basis for the interpretation of flexible queries involving monotonic aggregates

Doc #50 Fuzzy Content-Based Retrieval In Image Databases

Doc #56 An information retrieval model based on vector space method by supervised learning

* Rank (Score) obtained through CG

Acknowledgement

We would like to thank Thorsten Brants for providing an opportunity to use Trigrams'n'Tags (TnT) part of speech tagger.

Received 12 May 2005; Revised and accepted 27 Sep. 2005

References

[1] Deerwester S, Dumais Susan T, Furnas, George W and Landauer, Thomas K (1990) Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science

[2] Dick, Judith P (1992) A conceptual case relation representation of text for intelligent retrieval, Technical Report CSRI-265.

[3] Dominich Sander(2001) Mathematical foundation of Information Retrieval. In R. Lowen (Ed.) Mathematical modeling: Theory and applications, Kluwer Academic Pub.

[4] Foltz, Peter W(1990) Using Latent Semantic Indexing for Information Filtering. The ACM conference on office information system (COSIS'90)

[5] Hiemstra D, de Vries Arjen P (2000) Relating the new language models of information retrieval models; published as CTIT technical report TR-CTIT-00-09, May 2000, http://www.ctit.utwente.ni

[6] Khoo, C (1997) The Use of Relation Matching in Information Retrieval, LIBRES: Library and Information Science Research, 7(2).

[7] Liddy ED, Myaeng SH(1994) DR-LINK: a system update for TREC-2. Second Text REtrieval Conference (TREC-2) (NIST-SP 500-215). NIST. pp.85-99. Washington, DC, USA.

[8] Marega R, Pazienza MT(1994) CoDHIR: an information retrieval system based on semantic document representation. Journal of Information Science, 20(6), pp.399-412. UK.

[9] Martin P (1997) CGKAT: A knowledge acquisition and retrieval tool using structured documents and ontologies. In: Lukose D, Delugach H, Keeler M, Searle L, Sowa JF (eds) Proceedings of ICCS'97 (Lecture notes in artificial intelligence 1257). Springer, Berlin Heidelberg New York, pp 581-584.

[10] Martin P and Eklund P(2000) Embedding knowledge in Web documents, Griffith University, Australia, http://decweb.ethz.ch/ WWW8/data/2145/html/bindex.htm, April 24

[11] Mittendorf, E., Mateev, B. and Schauble, R(2000). Using the co-occurrence of words for retrieval weighting. Information Retrieval, 3, 243-251.

[12] Montes-y-Gomez M, Lopez-Lopez, A and Gelbukh A (2000) Comparison of conceptual graphs, Cario O, Sucar LE, Cantu FJ (Eds.) MICAI 2000: Advances in Artificial Intelligence. Lecture notes in Spriger -Verlag, pp. 548-556

[13] Montes-y-Gomez, M, Gelbukh A, Lopez-Lopez A and Baeza-Yates (2001) Flexible Comparison of Conceptual Graphs. Lecture notes in Computer Science 2113, Springer-Verlag

[14] Mugnier M L, On Generalization/Specialization for Conceptual Graphs. Journal of Experimental and theoretical Artificial Intelligence, 7 (1995)325-344.

[15] Oren, Nir (2002) Reexamining ff.idf based information retrieval with Genetic Programming. Proceedings of SAICIST 2002, p.224-234.

[16] Ounis I (1998) Modeling, indexing and retrieving images using conceptual graphs. In: Quirchmayr G, Schweighofer E, Bench-Capon T (eds) Proceedings of DEXA'98 (Lecture notes in computer science 1460). Springer, Berlin Heidelberg New York, pp 226-239.

[17] Rama, D. V, Srinivasan P (1993) An investigation of content representation using text grammarsACM transactions on Information Systems 11(1) pp. 51-75.

[18] Robertson et al (1994) Okapi at TREC-2. In the second Text Retrieval Conference(TREC-2) edited by D K Harman, Gaithersburg, MD: NIST, 1994.

[19] Robertson S, Walker S, Jones S, Hancock-Beaulieu M and Gafford M (1995) Okapi at TREC-3. In: Harman D (Ed.), The Third Text REtrieval Conference (TREC-3). NIST Special Publication 500-225.

[20] Robertson, S. E. & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer-Verlag, New York, 232-241.

[21] Robertson SE and Jones KS (1997 ) Simple proven approaches to text retrieval http://www.n3labs.com/pdf/robertson97simple.pdf

[22] Salton G and Buckley C (1988) Term Weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513-523.

[23] Savoy J and Picard P (2001) Retrieval effectiveness on the Web, Information Processing and Management, 37(4), 643-569 Smeaton Alan F (1995) Natural Language Processing and information retrieval. A tutorial presented at EACL'95 .

[24] Sowa JF (1984) Conceptual structures--Information processing in mind and machine, Addison -Wesley.

[25] Sowa JF(1993) Relating diagram to logic. In Eds. Guy W. Mineau, Bernard Moulin and John F. Sowa, Proceedings of first International conference on conceptual structures, ICCS'93, Quebec city, Canada, August 4-7

[26] Strzalkowski Tomek(1995). Natural language information retrieval. Information Processing and Management, 31(3), pp. 397-417.

[27] Watters CR (1989) Logic framework for information retrieval. Journal of the American Society for Information Science, 40(5), 311-324.

(1) Use of TnT was possible through an evaluation license agreement. Tagset used is Penn TreeBank.

Tanveer J. Siddiqui

Department of Electronics & Communication,

University of Allahabad, Allahabad, India

tjs@jkinstitute.org

Umashanker Tiwary

Indian Institute of Information Technology, Allahabad, India

ust@iiita.ac.in

Tanveer J. Siddiqui received her M.Sc. degree in Computer Science from University of Allahabad. She submitted her thesis in December 2005. Her thesis work was in the area of information retrieval. She served as a scientist in DRDO project from April 1996 to December 1998. She has been counselor at Indira Gandhi National Open University, India and guest faculty at Indian Institute of Information Technology, Allahabad. She is a faculty member at University of Allahabad since January 2000. Her research interests include information retrieval, text and data mining, document summarization and intelligent agent technology.

Uma Shanker Tiwary received his B.tech. and P.hd. degree in Electronics engineering from Banaras Hindu University in 1983 and 1992. He has 21 years of experience in research and teaching. He served as lecturer at M.M.M. Engg. College, Gorakhpur, as lecturer and reader in University of Allahabad and as a visiting scientist at IIIT, Kanpur. Currently he is associate professor at HIT, Allahabad and visiting IT Professor, GIST, South Korea. He sponsored a number of research projects and organized many international conferences. His research interests include Medical Image Processing, Image Processing, Computer Vision, Soft Computing & Fuzzy Logic, and Language and Speech Technology.
The Allahabad bank is situated near river Ganga.
   and I stayed in a hotel situated near the bank of river Ganga.
   and a query "Bank near Ganga"


The tagged representation of the document is
%% document I
%% abstract

the            DT
american       JJ       computer                NN
chemical       JJ       in                      IN
society        NN       the                     DT
has            VBZ      reproduction            NN
begun          VBN      ,                       ,
an             DT       distribution            NN
analysis       NN       ,                       ,
of             IN       and                     CC
the            DT       retrieval         NN
role           NN       of                      IN
of             IN       scientific        JJ
the            DT       information             NN


[Analysis] [right arrow] (PMOD_OF) [right arrow]  [role: #] [right
arrow] (PMOD_OF) [right arrow] [computer: #]
-
   [right arrow] (PMOD_IN)[right arrow][reproduction][right arrow]
(PMOD_OF)[right arrow][information:#]
   [right arrow] (PMOD_IN)[right arrow][distribution][right arrow]
(PMOD OF)[right arrow][information: #]
   [right arrow] (PMOD_IN)[right arrow][retrieval][right arrow]
(PMOD OF)[right arrow][information: #]


[society : #]-

      [right arrow](attr)[right arrow][american]
      [right arrow](attr)[right arrow][chemical]
   [information][right arrow](attr)[right arrow][scientific]


Table 1. Ranking of relevant documents for CACM query 12 and 23

Document #                   Rank

Relevant        First Stage        Second Stage

1523                391                 60
2080                 34                 30
2246                  4                  3
2629                 26                 25
3127                  1                  1
2578                 64                 43
2849                 29                  2
3137                 36                 33
3148                 47                  1

Table 2. Percentage increase in precision

Query #        Precision (%)      Precision (%)      Increase in
                  Stage 1            Stage 2        Precision (%)

12                  42.7               44.3               3.7
19                  48.6               59.5              22.4
23                   5.4               57.8             970.4
30                  35.1               51.5              46.7
63                  44.8               74.5              66.3
Gale Copyright:
Copyright 2006 Gale, Cengage Learning. All rights reserved.