Title:
System And Method For Generating A Search Ranking Score For A Web Page
Kind Code:
A1


Abstract:
A system and method for generating a search ranking score for a web page. The system comprises a training data processor effective to receive training data including at least a first page, a first label, a second page and a second label. The system further comprises a feature extraction processor connected to the training data processor, the feature extraction processor is effective to receive the first page, identify first features in the first page and calculate first values relating to the first features; the feature extraction processor is further effective to receive the second page and identify second features and calculate second values relating to the second features. A machine learning processor is connected to the feature extraction processor. The machine learning processor is effective to receive the first features, the first values, the first label, the second features, the second values, and the second label and generate a ranking function for a search engine based on the first features, the first values, the first label, the second features, the second values, and the second label. A receiving processor is connected to the machine learning processor. The receiving processor is effective to receive a web page. A ranking processor is connected to the receiving processor. The ranking processor is effective to apply the ranking function to the web page to generate a score.



Inventors:
Kulkami, Parashuram (New York, NY, US)
Application Number:
12/367634
Publication Date:
03/04/2010
Filing Date:
02/09/2009
Primary Class:
Other Classes:
707/E17.107, 707/E17.109
International Classes:
G06F17/30
View Patent Images:



Primary Examiner:
JAMI, HARES
Attorney, Agent or Firm:
DILWORTH & BARRESE, LLP (1000 WOODBURY ROAD, SUITE 405, WOODBURY, NY, 11797, US)
Claims:
What is claimed is:

1. A method for generating a search ranking score for a web page, the method comprising: receiving training data at a training data processor, the training data including at least a first page, a first label, a second page and a second label; receiving the first page at a feature extraction processor; identifying first features in the first page at the feature extraction processor; calculating first values relating to the first features at the feature extraction processor; receiving a second page at a feature extraction processor; identifying second features in the first page at the feature extraction processor; calculating second values relating to the second features at the feature extraction processor; receiving the first features, the first values, the first label, the second features, the second values, and the second label and at a machine learning server; generating, at the machine learning server, a ranking function for a search engine based on the first features, the first values, the first label, the second features, the second values, and the second label; receiving a web page at a receiving processor; receiving a keyword at the receiving processor; and applying the ranking function to the web page and keyword to generate a score.

2. The method as recited in claim 1, wherein the score relates to internal links, external links, in-links, or out-links.

3. The method as recited in claim 1, wherein the score relates to a quality of the web page.

4. The method as recited in claim 1, wherein the score relates to a crawlability of the web page.

5. The method as recited in claim 1, wherein the applying the ranking function to the web page includes analyzing features of the web page.

6. The method as recited in claim 5, further comprising: generating a maximum score for the web page based on the ranking function; and generating a recommendation for a change to at least one feature of the web page based on the maximum score.

7. The method as recited in claim 6, further comprising generating a prediction of a new score for the web page based on the recommendation.

8. The method as recited in claim 7, further comprising generating an approximation of a change in traffic to the web page based on the recommendation.

9. The method as recited in claim 1, further comprising: receiving a web site including a plurality of web pages; applying the ranking function to each of the plurality of web pages to generate a score for each of the plurality of web pages; calculating a popularity of each of the plurality of web pages; applying a weight to each of the plurality of web pages based on the respective popularity; and generating a score for the web site based on the score and weight for each of the plurality of web pages.

10. The method as recited in claim 9, wherein the popularity is calculated using at least one of the GOOGLE page rank algorithm, the ALEXA rank, depth of the web page from a home page and number of visits to the web page.

11. A system for generating a score for a web page, the system comprising a training data processor, the training data processor effective to receive training data, the training data including at least a first page, a first label, a second page and a second label; a feature extraction processor connected to the training data processor, the feature extraction processor effective to receive the first page, identify first features in the first page and calculate first values relating to the first features; the feature extraction processor further effective to receive the second page and identify second features and calculate second values relating to the second features; a machine learning processor connected to the feature extraction processor, the machine learning processor effective to receive the first features, the first values, the first label, the second features, the second values, and the second label and generate a ranking function for a search engine based on the first features, the first values, the first label, the second features, the second values, and the second label; a receiving processor connected to the machine learning processor, the receiving processor effective to receive a web page and a keyword; and a ranking processor connected to the receiving processor, the ranking processor effective to apply the ranking function to the web page and keyword to generate a score.

12. The system as recited in claim 11, wherein the score relates to internal links, external links, in-links, or out-links.

13. The system as recited in claim 11, wherein the score relates to a quality of the web page.

14. The system as recited in claim 11, wherein the score relates to a crawlability of the web page.

15. The system as recited in claim 11, wherein the ranking processor applies the ranking function to the web page by analyzing features of the web page.

16. The system as recited in claim 15, wherein the ranking processor is further effective to: generate a maximum score for the web page based on the ranking function; and generate a recommendation for a change to at least one feature of the web page based on the maximum score.

17. The system as recited in claim 16, wherein the ranking processor is further effective to generate a prediction of a new score for the web page based on the recommendation.

18. The system as recited in claim 17, wherein the ranking processor is further effective to generate an approximation of a change in traffic to the web page based on the recommendation.

19. The system as recited in claim 11, wherein: the receiving processor is effective to receive a web site including a plurality of web pages; and the ranking processor is effective to: apply the ranking function to each of the plurality of web pages to generate a score for each of the plurality of web pages; calculate a popularity of each of the plurality of web pages; apply a weight to each of the plurality of web pages based on the respective popularity; and generate a score for the web site based on the score and weight for each of the plurality of web pages.

20. The system as recited in claim 19, wherein the popularity is calculated using at least one of the GOOGLE page rank algorithm, the ALEXA rank, depth of the web page from a home page and number of visits to the web page.

21. The system as recited in claim 11, wherein the training data processor, the feature extraction processor, the machine learning processor, the receiving processor and the ranking processor are distinct.

Description:

This application claims priority to U.S. Patent application Ser. No. 61/093,586 entitled “Techniques for Automated Search Rank Function, Approximation, Rank Improvement Recommendations and Predictions”, filed Sep. 2, 2008, the entirety of which is hereby incorporated by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This disclosure relates to a system and method for generating a score for a web page reflecting how a search engine would rank the web page for an arbitrary set of keywords using a ranking algorithm of the search engine.

2. Description of the Related Art

Referring to FIG. 1, the World Wide Web (“WWW”) is a distributed database including literally billions of pages accessible through the Internet. Searching and indexing these pages to produce useful results in response to user queries is constantly a challenge. A search engine is typically used to search the WWW.

A typical prior art search engine 20 is shown in FIG. 1. Pages from the Internet or other source 22 are accessed through the use of a crawler 24. Crawler 24 aggregates pages from source 22 to ensure that these pages are searchable. Many algorithms exist for crawlers and in most cases these crawlers follow links in known hypertext documents to obtain other documents. The pages retrieved by crawler 24 are stored in a database 36. Thereafter, these pages are indexed by an indexer 26. Indexer 26 builds a searchable index of the pages in a database 34. For example, each web page may be broken down into words and respective locations of each word on the page. The pages are then indexed by the words and their respective locations.

In use, a user 32 sends a search query to a dispatcher 30. Dispatcher 30 compiles a list of search nodes in cluster 28 to execute the query and forwards the query to those selected search nodes. The search nodes in search node cluster 28 search respective parts of the index 34 and return search results along with a document identifier to dispatcher 30. Dispatcher 30 merges the received results to produce a final result set displayed to user 32 sorted by ranking scores based on a ranking function.

The ranking function is a function of the query itself and the type of page produced. Factors that are used for relevance include hundreds of features extracted, collected or identified for each page including: a static relevance score for the page such as link cardinality and page quality, superior parts of the page such as titles, metadata and page headers, authority of the page such as external references and the “level” of the references, the GOOGLE page rank algorithm, and page statistics such as query term frequency in the page, words on a page, global term frequency, term distances within the page, etc.

The use of search engines has become one of the most popular online activities with billions of searches being performed by users every month. Search engines are also a starting point for consumers for shopping and various day to day purchases and activities. With billions of dollars being spent by consumers online, it has become ever more important for web sites to organize and optimize their web pages in an effort to be more visible and accessible to users of a search engine.

As discussed above, for each web page, hundreds of features are extracted and a ranking function is applied to those features to produce a ranking score. A merchant with a web page would like his page to be ranked higher in a result set based on relevant search keywords compared with web pages of his competitor for the same keywords. For example, for a merchant selling telephones, that merchant would like his web page to acquire a higher ranking score, and appear higher in a result set produced by a search engine, based on the keyword query “telephone” than the ranking scores of web sites of his competitors for the same keyword. There are some prior art solutions available to guess the ranking algorithm used by a search engine and to provide recommendations about improvements that can be made to web pages so that the ranking score for a web page relating to particular keywords may improve. However, most of these systems use manual, human judgment and historical knowledge about search engines. Humans must be trained to perform this analysis. The basis for these judgments are mostly guesses or arrived at by trial and error. Consequently, most prior art solutions are inaccurate, time consuming, and require expensive human capital. Moreover, these solutions are available only for specific search engines and are not immune to changes in search or ranking algorithms used by known search engines nor do they have the ability to adapt to new search engines.

SUMMARY OF THE INVENTION

One embodiment of the invention is a method for generating a search ranking score for a web page. The method comprises receiving training data at a training data processor, the training data including at least a first page, a first label, a second page and a second label. The method further comprises receiving the first page at a feature extraction processor; identifying first features in the first page at the feature extraction processor; and calculating first values relating to the first features at the feature extraction processor. The method further comprises receiving a second page at a feature extraction processor; identifying second features in the first page at the feature extraction processor; and calculating second values relating to the second features at the feature extraction processor. The method further comprises receiving the first features, the first values, the first label, the second features, the second values, and the second label and at a machine learning server; generating, at the machine learning server, a ranking function for a search engine based on the first features, the first values, the first label, the second features, the second values, and the second label; receiving a web page at a receiving processor; receiving a keyword at the receiving processor; and applying the ranking function to the web page and keyword to generate a score.

Another embodiment of the invention is a system for generating a score for a web page. The system comprises a training data processor, the training data processor effective to receive training data, the training data including at least a first page, a first label, a second page and a second label. The system further comprises a feature extraction processor connected to the training data processor, the feature extraction processor effective to receive the first page, identify first features in the first page and calculate first values relating to the first features; the feature extraction processor further effective to receive the second page and identify second features and calculate second values relating to the second features. The system further comprises a machine learning processor connected to the feature extraction processor, the machine learning processor effective to receive the first features, the first values, the first label, the second features, the second values, and the second label and generate a ranking function for a search engine based on the first features, the first values, the first label, the second features, the second values, and the second label. The system further comprises a receiving processor connected to the machine learning processor, the receiving processor effective to receive a web page and a keyword; and a ranking processor connected to the receiving processor, the ranking processor effective to apply the ranking function to the web page and keyword to generate a score.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings constitute a part of the specification and include exemplary embodiments of the present invention and illustrate various objects and features thereof.

FIG. 1 is a system drawing a search engine in accordance with the prior art.

FIG. 2 is a system drawing of system for generating an approximation of a ranking algorithm in accordance with an embodiment of the invention.

FIG. 3 is schematic drawing of a database structure in accordance with an embodiment of the invention.

FIG. 4 is a flow chart of a process which could be used in accordance with an embodiment of the invention.

FIG. 5 is a flow chart of a process which could be used in accordance with an embodiment of the invention.

FIG. 6 is a flow chart of a process which could be used in accordance with an embodiment of the invention.

FIGS. 7A and 7B are system drawings of a system for generating a score for a web page in accordance with an embodiment of the invention.

FIGS. 8A and 8B are system drawings of a system for generating a score for a web site in accordance with an embodiment of the invention.

FIGS. 9A and 9B are system drawings of a system for generating a recommendation for a web site in accordance with an embodiment of the invention.

FIG. 10 is a flow chart of a process which could be used in accordance with an embodiment of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT(S)

Various embodiments of the invention are described hereinafter with reference to the figures. Elements of like structures or function are represented with like reference numerals throughout the figures. The figures are only intended to facilitate the description of the invention or as a limitation on the scope of the invention. In addition, an aspect described in conjunction with a particular embodiment of the invention is not necessarily limited to that embodiment and can be practiced in conjunction with any other embodiments of the invention.

When applying a ranking function, search engines receive as input: 1) at least one keyword and 2) a plurality of web pages in a result set produced based on the keyword(s). With those inputs, the search engine produces as an output a ranking score for each web page. The inventors recognized this phenomenon and produced a system and algorithm to reverse engineer the function performed by search engines to produce that output. Stated another way, search engines perform the following ranking function to generate a ranking score for each page in a result set:


ranking score=F(input)

where the input is the search query in the form of keyword(s) and extracted features of the pages in the result set.

In order to approximate the ranking function, training data may be sent to a machine learning system. Generating such training data is perhaps the most difficult and labor intensive part of any machine learning system. As discussed above, prior art techniques for generating training data include the use of teams of humans subjectively viewing selected portions of available data such as keywords and result sets. Even if collection of data may be automated, in the prior art, labeling of the data is performed manually. Such labeling techniques are often inaccurate as they are subject to human judgment of a complex system such as a search engine. A human being typically cannot judge by intuition whether he has collected all kinds of different search results to ensure that the training data is diverse and it is generally not possible to manually track or generate a diverse set of data. A diverse training set is desired for a machine learning algorithm to work well. Moreover, human labeling in not accurate because it is generally not possible to judge a label value by intuition.

Referring to FIG. 2, there is shown a system 80 in accordance with an embodiment of the invention. System 80 includes a training data generator server 60. Training data generator server or processor 60 sends keywords 62 over a network 64 (such as the Internet) to a search engine server 66. Keywords 62 could be virtually any set of keywords that, when input to a search engine, yield web pages in a result set. It is desirable to generate a number of different sets of keywords. Many techniques could be used to generate such sets. For example, keyword tools provided by search engines such as the MSN Keyword tool, or the GOOGLE ADWORDs tool could be used, third party tools which monitor and collect keywords based on popularity usage and other metrics may be used, or statistical analysis may be used to determine important keywords from web pages and web logs. For example, by collecting the frequency distribution of keywords from web pages and web logs, it may be possible to identify important keywords from pages. Keywords 62 are sent by search engine server 66 to a search engine index 68.

Search engine index 68 outputs web pages 70 that are responsive to a search query including keywords 62. Search engine server 66 receives web pages 70 and orders or ranks web pages 70 based on an unknown ranking algorithm to produce ranked web pages 76.

Ranked web pages 76 are sent over network 64 and fed to training data generator server 60. Training data generator server 60 stores ranked web pages 76 and labels 82 for those pages in a training data storage 84. A label 82 is associated with each ranked web page 76 corresponding to the rank of the ranked web page 76 based on keyword 62. The inventors have determined that a linear distribution of the ranking scores is a good representation of those scores. Consequently, if L ranked web pages 76 are considered, the highest ranked web page is given a label L, the second highest is given a label L-1, etc.

A feature extraction engine 72 receives web pages 76 and labels 82 from training data storage 84, and extracts values for defined features in ranked web pages 76. Search engines generate a ranking score and rank pages based on values of certain features. Those features include, for example, features used in the GOOGLE page rank algorithm (such as the links pointing to the page and links in the page pointing somewhere else), the size of the web page, the number of matches between the web page and keywords 62, etc. The features may be derived from the content and structure of HTML documents. For example the kinds of features extracted may relate to: keyword frequency in Title tag of HTML documents, keyword frequency in metatags of HTML documents, keyword frequency in the body of HTML documents, keyword frequency in anchor text of in-links to a HTML document, number of back links of a HTML document, or distribution of back links of HTML documents.

If n features are extracted from ranked web pages 76, then each page P may be represented as a page vector 78 P={f1P, f2P, . . . fnP} where fnP is the nth feature of page P.

For example, if 100 ranked web pages 76 are considered, and 6 features are extracted, feature extraction server 72 will produce 100 page vectors 78, labeled 100 to 1, where page 100 would have the form P(100)={f1100, f2100, 3100, f4100, f5100, f6100} where f1, f2, f3, f4, f5 and f6 would have values corresponding to those respective features. For example, if f1 corresponds to the number of keyword matches, and if page 100 has 10 words that match keywords 62, f1 may have a value of 10.

Referring to FIG. 3, there is shown an example of a training data structure 110 which may be stored in training data storage 84. As shown, for a keyword 112 (“telephone” is shown) training data structure 110 may include a label column 114 and a web page column 118. Label column 114 includes labels 116 for ranked web pages 76 (FIG. 2). The web pages themselves may be stored in web page column 118. The contents of training data structure 110 may be used as training data in a machine learning server 74 (discussed below).

Referring to FIG. 4, there is shown a flow chart of a process which could be used to implement a process in accordance with an embodiment of the invention. The process could be used with, for example, system 80 described with respect to FIG. 2. As shown at step S2, at least one input or keyword is sent to a search engine or any other system implementing a process. At step S4, the search engine queries a search engine index using the keyword to produce a result set including web pages, or the process uses the keywords as input to produce an output. At step S6, the search engine ranks the web pages, or the process ranks the outputs. At step S8, the search engine or process forwards the inputs or keywords and ranked web pages or output to a training data server or processor. At step S10, the training data server or processor assigns a label to each page or output based on the rank. At step S12, the labels and pages or outputs are used as training data.

Referring to FIG. 5, there is shown a flow chart of a process which could be used in accordance with an embodiment of the invention. The process shown in FIG. 5 could be used with, for example, system 80 described above with reference to FIG. 2. As shown, at step S22, a web page is received by a feature extraction server. At step S24, features in the page are identified. At step S26, values are calculated for each feature. At step S28, a page vector is generated including the values for each feature.

Referring again to FIG. 2, feature extraction server 72 stores page vectors 78 in a feature storage 86. A machine learning server or processor 74 may receive page vectors 78 from feature storage 86 and labels 82 from training data storage 84, and use page vectors 78 and labels 82 as training data. Machine learning server 74 applies a machine learning algorithm to page vectors 78 and labels 82 to produce an approximated ranking function 88.

Machine learning server 74 may use any known machine learning techniques on training data to produce an approximated ranking function 88 for a search engine of interest. Although a specific algorithm is disclosed to produce training data, any training data may be used. More specifically, as described above, machine learning server 74 receives page vectors 78 stored in feature storage 86 and labels 82 stored in training data storage 84. Machine learning server 74 then applies a known machine learning algorithm using, for example, boosted decision trees, support vector machines, Bayesian learners, any other algorithm, or a combination of these techniques, to calculate an approximated ranking function 88 which is an approximation of the ranking algorithm used by search engine server 66 to score and rank web pages 70. Approximated ranking function 88 for a page P will have the form R(P)=v1f1+v2f2 + . . . vnfn, where vnfn is a weight assigned to the nth feature of a page of interest. For example, v1 could have a value of 0.5 and v2 could have a value of 0.25 indicating that in the approximated ranking function 88, feature f1 is given twice the amount of weight as feature f2. Machine learning server 74 may continually process page vector 78 and labels 82 using many different machine learning algorithms to calculate approximated ranking function 88. For example, machine learning server 74 may calculate a set of three different possible approximated ranking functions, and in one particular function a combination of values v results in ranking scores R(P) for pages matching labels 82 of ranked web pages 76 80% of the time. That particular function may be assigned as the approximated ranking function 88.

Approximated ranking function 88 need not be the same as the actual ranking algorithm used by search engine server 66 to be of significant value. For example, the GOOGLE search engine may assign different weights to different features than those assigned by machine learning server 74. In fact, machine learning server 74 may identify new patterns and/or useful features not currently appreciated by known search engines. However, as the inputs to machine learning server 74 (e.g. page vectors 78 and labels 82) are the outputs of a search engine of interest, approximated ranking function 88 yields a valuable result. Ranking function 88 may be used to evaluate how well a particular web page would rank in a result set based on a keyword in a search engine of interest.

Referring to FIG. 6, there is shown a flow chart of a process which could be used in accordance with an embodiment of the invention. The process shown in FIG. 6 could be used with, for example, system 80 described above with reference to FIG. 2. As shown, at step S40, for each web page or output, a machine learning server or processor receives a page vector including features and values. At step S42, the machine learning server receives labels for each of the pages or outputs indicating how well each page corresponds to a particular keyword or input. At step S44, a query is made as to whether there are more web pages or outputs to process. If the answer to query S44 is yes, control branches back to step S40. If the answer to query S44 is no, control branches to step S46. At step S46, the machine learning server uses a machine learning algorithm to generate an approximated ranking function based on the features, values and labels.

Armed with the above information, a user may be able to present system 80 with a his/her web site, a search engine of interest and pertinent keywords and receive a score indicating how well the web site is designed for that search engine. For example, as shown in FIG. 7A, system 80 may receive a web page 124 and an indication of a desired search engine 126 and pertinent keywords 127 from a user 122, such as through a communication 128, at a receiving server 144. System 80, connected to receiving server/processor 144 can determine an approximated ranking function 88 for search engine 126 using techniques described above. As shown in FIG. 7B, receiving server 144 may then apply web page 124 and pertinent keywords 127 to approximated ranking function 88 to produce a score 130 for web page 124 for those pertinent keywords 127. Score 130 may then be forwarded to user 122. Score 130 gives user 122 a measure of how well web page 124 is ranking in search engine algorithms for the pertinent keywords 127 and, in particular, how well web page 124 would rank in search engine 126.

For example, referring also to FIG. 2, web page 124 has particular values for features f (e.g. 10 links pointing to the web page, or 5 words match pertinent keyword 127). Those values are calculated by feature extraction server 72, applied to approximated ranking function 88, multiplied by weights v and the products summed together to produce score 130. Stated another way, score 130 for web page 124=R(v1f1(124)+v2f2(124)+ . . . vnfn(124)) where vnfn(124) is the product of the nth weight and the nth feature in web page 124. Score 130 is a good approximation of the actual rank score that would be given by search engine of interest 126 for pertinent keyword 127 and also provides a numerical measure of how well web page 124 would fare in various other search engines.

Referring to FIG. 8A, a web site 134 may include a plurality of web pages 124. A site score may be generated for the entire web site 134 by adding scores for the web pages 124 across the site 134. For example, a Site Score SS may be calculated as {(w1.s1+w2.s2 . . . +wnsn)/(w1+w2 . . . +wn)} where wn is the weight of the nth page of the web site 134, and sn is the score of the nth page using score 130 discussed above. A site score server 136 in communication with system 80 may be used to perform these calculations. Weights w are assigned based on the popularity of each page in the web site. Site Score SS thus represents a numerical measure of how well an entire web site is faring in search engines.

The popularity used in determining weight w may be calculated by looking at a function of certain metrics provided by search engines such as the GOOGLE page rank algorithm, ALEXA rank, number of visits to the particular page, depth of the URL address of the page from a home page, etc. Popularity P may be calculated as P={(P1+P2+ . . . Pn)/n} where Pn is the nth factor used in calculating web page popularity and n is the number of popularity factors used. The popularity may be calculated based on an average score of these metrics normalized to a range of between 1 and 100. For example, if the popularity produces a score of 6 out of 10, that score may be normalized to 60 out of 100.

Once the popularity of each web page is known, the relative weights w may be calculated as a number between 0 and 1 using, for example, the following formula:


Wmax=max(P1, P2 . . . PN) and


wk=Pk/Wmax

where “max” is the maximum value, wk is the weight assigned for the kth web page of a web site and Pk is the popularity of the kth web page.

Score 130 may be broken down by categories for user 122. The categories can be based on features f. For example, the following scores can be provided to user 122 corresponding to how well the respective features are used in web page 124:

Link score—a score based on weights applied to internal links, externals links, in-links, and out-links in web pages 124 from approximated ranking function 88.

Relevance score—a score based on the relevance of web pages 124 with respect to certain keywords such as pertinent keyword 127.

Quality score—a score indicating a general quality of web page 124 in terms of how well organized the pages are, how content is formatted, whether certain rules are being followed, etc. using weights from approximated ranking function 88.

Crawlability score—a score indicating how easily accessible and indexable web page 124 is using weights from approximated ranking function 88. For example, crawlers reach and download web pages by following links. Certain situations may make a web page more difficult to crawl such as: broken links, no links, Javascript generated links (which may not be seen by crawlers), web page errors, a server being down, an incorrect robots.txt file, etc. A crawler may not be able to access these web pages and therefore these web pages do not show up on a search index. Also, even if a crawler can access a particular page, the page content and format may be designed in such a way which prevents a search engine indexer to see the content of the page and index them, such as in cases where the pages are flash based, there is too much javascript generated content, HTML frames, CSS formatting etc.

Scores may also be generated for any of the features extracted by feature extraction engine 72.

Referring to FIGS. 9A and 9B, knowing approximated ranking function 88 and features f for web page 124, system 80, in conjunction with a prediction/recommendation server 138, can calculate a new score 130a for web page 124 with different values of features f. If the new score 130a is higher in value than an original score 130, the different values of features f can be used as a recommendation 140 to user 122 to improve search visibility and quality. Prediction/recommendation server 138 may generate recommendation 140 by calculating optimal values for features f of web page 124 to achieve the highest possible score 130. Such a highest possible score is determined by:


Max S130={R(v′f1, v′f2, . . . v′fn)}

where Max S is the highest possible score and could be determined by, for example, using the score of the highest ranked page 76 (FIG. 2), R is the approximated ranking function 88 calculated above, and v′fn is the calculated optimal value for the nth feature knowing Max S and R. The values of all features are modified together(some may increase the total score, others decrease etc.) to calculate multiple scores for each combination or variation of features. The combination resulting in the highest score S is chosen. Since the values can be real numbers resulting in infinite possibilities, the values acceptable may be constrained to the values seen in practice. This is derived from the training data such as from ranked web pages 76. For example the maximum range of a feature value in 1 million pages in the training data may be between 0-20. These are chosen as thresholds.

Recommendation 140 for each feature f may then determined as:


r(fn)=v′(fnv(fn)

where r(fn) is a recommendation for feature fn, and v′(fn)˜v(fn) is the difference between optimal and original feature values for feature fn. Score 130 may be re-calculated for each optimal value of feature fn and a score difference, between a score 130 using original feature values and a score 130a using optimal features values, may be used as a measure of score gain. Recommendations 140 may then be sorted based on new score 130a.

With recommendation 140, system 80, in conjunction with prediction and recommendation server 138, can also generate a prediction of a gain in search rank for new scores 130a. For example, assume that web page 124 is the 5th highest ranked web page for pertinent keyword 127 because of a score 130 having a value 500, and the 4th highest ranked web page has a score 130 with a value of 550. If a recommendation 140 for a particular feature fn causes the score 130 of web page 124 to be greater than 550, system 80 can predict that web page 124 will now be the 4th highest ranked web page for search engine 126 for pertinent keyword 127. Predictions may also change based on a search engine of interest as changes to features f will affect the ranking score of search engines differently.

Referring to FIG. 10, there is shown a flow chart of a process which could be used in accordance with an embodiment of the invention. The process shown in FIG. 10 could be used with, for example, system 80 described above with reference to FIG. 2. As shown, at step S50, a receiving server receives a web page, an indication search engine of interest and a pertinent keyword from a user. At step S52, the receiving server forwards the indication of a search engine to a system effective to determine an approximated ranking function for the search engine. At step S54, the approximated ranking function is generated. At step S56, the web page and pertinent keyword is applied to the approximated ranking function to generate a score for the web page. At step S58, a recommendation may be generated for the web page to optimize the score. At step S60, the score and the recommendation may be forwarded to the user.

The change in rank of a web page is directly proportional to a possible traffic gain by web page 124. Click through rates on search engines are calculated for websites by analyzing data from web analytic tools or by analyzing historical data collected. By analyzing the change of click traffic, system 80 can estimate the average rate of increase of traffic for various rank positions on the search results. As traffic on a web site is directly proportional to sales and revenue, the percentage increase in revenue can be predicted by calculating a predicted percentage increase in traffic.

An accuracy A of system 80 may be determined by looking at the scores 130 generated by system 80 for a particular number k of web pages. The accuracy A may be determined using the formula:


A=TM/k

where TM is the total number of pages in ranked web pages 76 which match in rank to pages sorted by score 130 generated by system 80.

As can be discerned, a system in accordance with the above description can make recommendations on various data points in order of importance, and make rank predictions for each recommended change.

Clearly, although different servers are shown for various elements such as the training data server, and the feature extraction server, the receiving server, and the prediction/recommendation server, all servers could be combined in a single processor, housing or location.

A system in accordance with that described above can be used to collect training data on any search engine. Moreover, the system can adapt automatically to changes in ranking functions of existing search engines and produce new training data accordingly. Prior art systems are significantly limited in that subjective, expensive human capital is used to analyze only samples of available data. A system in accordance with the invention could analyze one page or thousands of pages easily and efficiently.

As can be discerned, the system and process described above is more accurate than human labeling because, in part, results of an unknown process such as search engine ranking are used. As the system is automated, it is possible to easily collect large amounts of training data without manual intervention. The use of automated labeling further eliminates the need for manual intervention. Learning algorithms produced in accordance with the invention are change resistant. This is because training data is based on search results. If any search engine changes its ranking algorithm the results will change and the training data will change. Prior art systems based on intuition and prior knowledge of humans cannot adapt as easily. The system works with known and to be developed search engines and can easily be applied to specific sites such as, for example, TRAVELOCITY.COM.

The invention has been described with reference to an embodiment that illustrates the principles of the invention and is not meant to limit the scope of the invention. Modifications and alterations may occur to others upon reading and understanding the preceding detailed description. It is intended that the scope of the invention be construed as including all modifications and alterations that may occur to others upon reading and understanding the preceding detailed description insofar as they come within the scope of the following claims or equivalents thereof. Various changes may be made without departing from the spirit and scope of the invention.