Next Patent: Web site converter
Next Patent: Web site converter
Plaque It!
Sponsored by: Flash of Genius |
[0001] 1. Field of Invention
[0002] The present invention relates generally to the field of determining communities of hyperlinked documents. More specifically, the present invention is related to determining communities of hyperlinked documents based on the relationships of the links between the documents and the structure of the documents.
[0003] 2. Discussion of Prior Art
[0004] As hyperlinked environments grow in size and complexity, it becomes increasingly difficult to locate documents relevant to a given query. One such environment which is growing at a phenomenal rate is the world wide web (WWW). As millions of on-line participants continually create hyperlinked content, there are no capabilities to impose a global structure and consequently the capability to efficiently find the most relevant documents for a broad-topic search through traditional search methods, e.g. text based queries, becomes a much more difficult challenge to overcome. For example, a user searching for information about Harvard University on the WWW utilizing a text search would receive over 80,000 pages from the search. The number of returned pages is an unmanageable number for the user and determining which ones are the most relevant would consume a considerable amount of the user's time. What the user requires is a way to locate the most central, or authoritative, pages on the topic “Harvard.”
[0005] An algorithm for locating authoritative documents within a hyperlinked environment has been proposed by Jon Kleinberg in a recent paper, incorporated herein by reference, “Authoritative Sources in a Hyperlinked Environment,” Proc. ACM-SIAM Symposium on Discrete Algorithms, May 1997 (also appears as IBM Research Report RJ 10076, May 1997 and is additionally available at http://www.cs.cornell.edu/home/kleinber/ on the world wide web). Kleinberg's algorithm is based on two premises. First, the implicit annotation provided by human creators of hyperlinks contains sufficient information to obtain a notion of authority. Secondly, sufficiently broad topics contain communities of hyperlinked pages. These communities comprise two sets of inter-related pages. One set comprises authorities (i.e. highly referenced) on the topic. The second set comprises pages which “point” to many of the authorities. This second set is referred to as hubs because the elements of the set represent strong central points to confer authority on the relevant pages. The two sets of pages exhibit a mutually reinforcing relationship, that is, a good hub points to many authorities while good authorities are pointed to by many hubs. This notion of hubs and authorities is utilized to determine the pages which are the most relevant on a broad topic by using an iterative algorithm to break the apparent circularity of hubs and authorities.
[0006] Increasingly, web pages are being viewed with devices other than regular desktops and standard browsers. Cell phones, palm-top computers with limited screen space and speech-based devices are a few of the alternative devices becoming prevalent. In addition, there are moves to ensure web page content is available for users with limited abilities (blind, dyslexic, illiterate, etc.). The World Wide Web Consortium Accessibility Initiative provides the documents “Web Content Accessibility Guidelines 1.0” and “Techniques for Web Content Accessibility Guidelines 1.0,” both of which are incorporated herein by reference, which describe how to format pages in structured forms so that clients on the alternative devices can process the pages. The current recommendation and notes, respectively, are available from the W
[0007] Therefore, there is a need to return the most authoritative pages which provide the most use, i.e., poorly formed pages need to be penalized because the pages may not be able to be displayed (visual, auditory, tactile, etc.) in a manner appropriate for the limited abilities of the browser or the user.
[0008] A method of determining the documents of a hyperlinked environment which are authorities on a given topic which most closely meet guidelines related to document structure is presented. A base set of documents which is relatively small, containing documents relevant to a given topic, and containing many of the strongest authorities on the topic is obtained. Each document within the set is evaluated and given a structure score which reflects how well-formed the document is. Each document within the set also has corresponding hub and authority weights which are updated and maintained to determine the strongest authorities. The initial hub and authority weights of each document are set to the corresponding structure score of the document. An iterative algorithm is then utilized to determine the strongest authorities. For each round of the algorithm, the authority weights of a document are updated by summing the hub weights of each document pointing to the document, while the hub weights of a document are updated by summing the authority weights of each document which is pointed to by the document whose hub weight is being determined. After a series of iterations, the documents having the highest authority weights are identified as the strongest authorities on the query topic.
[0009] In a further embodiment, the base set of documents is obtained by obtaining a root set of documents and determining the base set from the root set. A root set is first obtained by taking a given number of the highest ranked documents returned form a textual based searching and ranking system. The base set is generated from the root set by including documents which are linked to documents within the root set.
[0010] In a further embodiment, the number of documents included within the base set is limited so as to maintain a relatively small base set. All documents outside of the root set which are pointed to by documents within the root set are included. However, only a limited number of documents outside of the root set which point to documents within the root set are included.
[0011] In further embodiment, the structure score is determined by evaluating each document within the set according to a set of parameters. For each parameter, the document is assigned a parameter score. These parameter scores are then weighted and summed to obtain the documents structure score.
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018] While this invention is illustrated and described in a preferred embodiment, many variations of the method may be implemented still within the spirit of the present invention. There is depicted in the drawings, and will herein be described in detail, a preferred embodiment of the invention, with the understanding that the present disclosure is to be considered as an exemplification of the principles of the invention and the associated functional specifications of the materials for its construction and is not intended to limit the invention to the embodiment illustrated. Those skilled in the art will envision many other possible variations within the scope of the present invention.
[0019] A brief digression into the preferred algorithm in which structure evaluation and penalization is implemented helps to explain the motivation for and advantages of structure based determination of authoritative pages and additionally provides a framework for implementing the present invention. Specifically, a more detailed description of Kleinberg's algorithm will be given.
[0020] In order to implement the algorithm to determine authoritative pages based upon the link structure of the hyperlinked media, a subgraph of the WWW on which the algorithm will operate must be determined. Ideally, this subgraph should be focused to pages which have the following properties:
[0021] relatively small
[0022] many relevant pages
[0023] contains most, or many, of the strongest authoritative pages
[0024]
[0025] A set which does satisfy the third criteria can, however, be generated from root set
[0026] As briefly described above, Kleinberg utilizes the concept of authorities and hubs to determine the strongest authorities within base set
[0027] In order to determine the strongest authorities utilizing the hub and authority model, an iterative algorithm is used to break the circularity between hubs and authorities. 22
[0028] As shown by Kleinberg, the above procedure converges as the iterations increase arbitrarily. Therefore, by choosing a sufficiently high enough number of iterations N, the weights of the c largest coordinates of each vector of weights, i.e. the c highest hub and authority weights in the entire base set
[0029] For clarity,
[0030] Unlike the present invention, the algorithm of Kleinberg provides equal initial weighting to all pages and does not weight the authority and hub weights during the iterations. Therefore, Kleinberg does not favor/disfavor any of the pages and therefore is limited to determining the strongest authorities regardless of any other criteria concerning the page. However, as described above, particular advantages are obtained by penalizing pages which do not have well formed pages. Briefly, in order to penalize pages, accessibility scores (structure scores) are determined for each paged based upon the how well formed the web pages are and these scores are utilized as the initial hub and authority scores. These scores are then utilized to weight the authority and hub weights during the iterations performed to update the authority and hub scores. Therefore the algorithm is biased to favor not only the strongest authorities, but the strongest authorities which are the most well-formed.
[0031] Generally, to compute the structure scores, a set of parameters P is determined which will contribute to a decision of how well formed the pages are. Some exemplary parameters include the following:
[0032] Does the page form a well-formed XML document? If not, what is the tree-distance of the page from being a well formed XML document so that an XML parser can recover meaningfully from the poorly formed page?
[0033] What is the percentage of scripts in the page and what are they used for?
[0034] Are there meaningful ALT tags for items such as link structures, images, and video?
[0035] Of course, all of the instances in the current guidelines provided by the World Wide Web Consortium can be utilized as parameters, as well as future instances added, in addition to other parameters which may become an issue based upon the type of device, particular ability of a pre-processing system to process an HTML document for display (visual, auditory, tactile, etc.), or particular limited ability of a user of any such system.
[0036] For each parameter of P, p
[0037]
[0038] It should be noted, that while the algorithm has been described as initially setting the hub and authority weights equal to the structure score and weighting the hub and authority weights by the structure score during each iteration, as one of skill in the art would understand, initializing the weights to one and multiplying the initial weights by the structure score is equivalent to setting the initial weights to the structure score. Therefore, an equivalent algorithm is able to be constructed in which the initial hub and authority weights are set to one and the hub and authority weights are weighted by the structure score prior to performing the update.
[0039] The following charts show the normalized authority and hub scores at each iteration, with and without structure weighting, for the simple WITHOUT STRUCTURE WEIGHTING a0 h0 a1 h
1 a2 h2 BEFORE NORMALIZATION: 600 1.00 1.00 0
.00 1.22 0.00 1.22
602 1.00 1.00 0.00 1.22
0.00 1.22 604 1.00
1.00 0.45 0.00 0.71 0.00
606 1.00 1.00 0.89
0.00 1.41 0.00 608
1.00 1.00 0.45 0.00 0.71 0.00 NORMALIZED: 600 0.45 0.45 0
.00 0.71 0.00 0.71
602 0.45 0.45 0.00 0.71
0.00 0.71 604 0.45
0.45 0.41 0.00 0.41 0.00
606 0.45 0.45 0.82
0.00 0.82 0.00 608
0.45 0.45 0.41 0.00 0.41 0.00
[0040]
WITH STRUCTURE WEIGHTING a0 h0 a1 h
1 a2 h2 a3 h3 <
td/> BEFORE NORMALIZATION: 600 1.00 1.00 0
.00 0.67 0.00 0.67 0.00
0.67 602 0.25 0.25
0.00 0.28 0.00 0.29 0.00
0.29 604 1.00 1.00
0.36 0.00 0.55 0.00 0.55 0.00 606 0.60 0.60 0.38 0.00 0.61 0.00 0.61 0.00 608 0.60 0.60 0.02 0.00 0.06 0.
00 0.06 0.00 NORMALIZED: 600 0.60 0.60 0
.00 0.92 0.00 0.92 0.00
0.92 602 0.15 0.15
0.00 0.38 0.00 0.40 0.00
0.40 604 0.60 0.60
0.68 0.00 0.67 0.00 0.67 0.00 606 0.36 0.36 0.73 0.00 0.74 0.00 0.74 0.00 608 0.36 0.36 0.04 0.00 0.07 0.
00 0.07 0.00
[0041] The system illustrated without the structure weights is equivalent to the system of Kleinberg. By looking at the final authority scores, it can be seen that without the structure weights nodes
[0042] The following is an example calculation for calculating the authority score at each iteration for node
[0043] Without Structure Weighting:
[0044] With Structure Weighting:
[0045] The above enhancements to hyperlinked document search systems and their described functional elements are implemented in various computing environments. For example, the present invention may be implemented on a conventional IBM PC or equivalent, multi-nodal system (e.g. LAN) or networking system (e.g. Internet, WWW, and wireless web). All programming and data related thereto are stored in computer memory, static or dynamic, and may be retrieved by the user in any of: conventional computer storage, display (i.e. CRT) and/or hardcopy (i.e. printed) formats.
[0046] A system and method has been shown in the above embodiments for the effective implementation of a method to determine the documents of a hyperlinked environment which are authorities on a given topic which most closely meet guidelines related to document structure. While various preferred embodiments have been shown and described, it will be understood that there is no intent to limit the invention by such disclosure, but rather, it is intended to cover all modifications and alternate constructions falling within the spirit and scope of the invention, as defined in the appended claims. For example, the present invention should not be limited by software/program, computing environment, specific computing hardware and hyperlinked environment. It is further envisioned the system can additionally be utilized in conjunction with textual based analysis systems along with other variants of the algorithm to perform classification, clustering, targeted crawling and identification of micro-communities.