| 5297031 | Method and apparatus for order management by market brokers | March, 1994 | Gutterman et al. | 705/37 |
| 5297032 | Securities trading workstation | March, 1994 | Trojan et al. | 705/37 |
| 5301350 | Real time storage/retrieval subsystem for document processing in banking operations | April, 1994 | Rogan et al. | 705/33 |
| 5321750 | Restricted information distribution system apparatus and methods | June, 1994 | Nadan | 380/230 |
| 5410693 | Method and apparatus for accessing a database | April, 1995 | Yu et al. | 707/100 |
| 5537586 | Enhanced apparatus and methods for retrieving and selecting profiled textural information records from a database of defined category structures | July, 1996 | Amram et al. | 707/3 |
| 5721908 | Computer network for WWW server data access over internet | February, 1998 | Lagarde et al. | 707/10 |
| 5745899 | Method for indexing information of a database | April, 1998 | Burrows | 707/102 |
| 5778367 | Automated on-line information service and directory, particularly for the world wide web | July, 1998 | Wesinger et al. | 707/10 |
| 5790793 | Method and system to create, transmit, receive and process information, including an address to further information | August, 1998 | Higley | 709/218 |
| 5819271 | Corporate information communication and delivery system and method including entitlable hypertext links | October, 1998 | Mahoney et al. | 707/9 |
| 5835712 | Client-server system using embedded hypertext tags for application and database development | November, 1998 | DuFresne | 709/203 |
| 5835718 | URL rewriting pseudo proxy server | November, 1998 | Blewett | 709/218 |
| 5848410 | System and method for selective and continuous index generation | December, 1998 | Walls et al. | 707/4 |
| 5852820 | Method for optimizing entries for searching an index | December, 1998 | Burrows | 707/2 |
| 5859971 | Differencing client/server communication system for use with CGI forms | January, 1999 | Bittinger et al. | 709/203 |
| 5864871 | Information delivery system and method including on-line entitlements | January, 1999 | Kitain et al. | 707/104.1 |
| 5873077 | Method and apparatus for searching for and retrieving documents using a facsimile machine | February, 1999 | Kanoh et al. | 707/3 |
| 5890172 | Method and apparatus for retrieving data from a network using location identifiers | March, 1999 | Borman et al. | 715/205 |
| 5933829 | Automatic access of electronic information through secure machine-readable codes on printed documents | August, 1999 | Durst et al. | 707/10 |
| 5956716 | System and method for delivery of video data over a computer network | September, 1999 | Kenner et al. | 707/10 |
| 5983214 | System and method employing individual user content-based data and user collaborative feedback data to evaluate the content of an information entity in a large information communication network | November, 1999 | Lang et al. | 707/1 |
| 5987464 | Method and system for periodically updating data records having an expiry time | November, 1999 | Schneider | 707/10 |
| 5987480 | Method and system for delivering documents customized for a particular user over the internet using imbedded dynamic content | November, 1999 | Donohue et al. | 715/207 |
| 6055538 | Methods and system for using web browser to search large collections of documents | April, 2000 | Kessenich et al. | 707/101 |
| 6078917 | System for searching internet using automatic relevance feedback | June, 2000 | Paulsen et al. | 707/6 |
| 6094649 | Keyword searches of structured databases | July, 2000 | Bowen et al. | 707/3 |
| 6105021 | Thorough search of document database containing compressed and noncompressed documents | August, 2000 | Berstis | 707/3 |
| 6169992 | Search engine for remote access to database management systems | January, 2001 | Beall et al. | 707/103R |
| WO/1993/023836 | November, 1993 | METHODS FOR ESTABLISHING CERTIFIABLE INFORMED CONSENT FOR A MEDICAL PROCEDURE | ||
| WO/1996/008108 | March, 1996 | A COMPUTER CONTROLLED VIDEO INTERACTIVE LEARNING SYSTEM | ||
| WO/1998/037697 | August, 1998 | AN AUTOMATIC TIMER EVENT ENTRY |
The present invention relates to the process of developing and maintaining the content of Internet search engine databases.
An internet (including, but not limited to, the Internet, intranets, extranets and similar networks), is a network of computers, with each computer being identified by a unique address. The addresses are logically subdivided into domains or domain names (e.g. ibm.com, pbs.org, and oranda.net) which allow a user to reference the various addresses. A web, (including, but not limited to, the World Wide Web (WWW)) is a group of these computers accessible to each other via common communication protocols, or languages, including but not limited to Hypertext Transfer Protocol (HTTP). Resources on the computers in each domain are identified with unique addresses called Uniform Resource Locator (URL) addresses (e.g.http:// www.ibm.com/products/laptops.htm). A web site is any destination on a web. It can be an entire individual domain, multiple domains, or even a single URL.
Resources can be of many types. Resources with a “.htm” or.“html” URL suffix are text files, or pages, formatted in a specific manner called Hypertext Markup Language (HTML). HTML is a collection of tags used to mark blocks of text and assign meaning to them. A specialized computer application called a browser can decode the HTML files and display the information contained within. A hyperlink is a navigable reference in any resource to another resource on the internet.
An internet Search Engine is a web application consisting of
Agents are programs that can travel over the internet and access remote resources. The internet search engine uses agent programs called Spiders, Robots, or Worms, among other names, to inspect the text of resources on web sites. Navigable references to other web resources contained in a resource are called hyperlinks. The agents can follow these hyperlinks to other resources. The process of following hyperlinks to other resources, which are then indexed, and following the hyperlinks contained within the new resource, is called spidering.
The main purpose of an internet search engine is to provide users the ability to query the database of internet content to find content that is relevant to them. A user can visit the search engine web site with a browser and enter a query into a form (or page), including but not limited to an HTML form, provided for the task. The query may be in several different forms, but most common are words, phrases, or questions. The query data is sent to the search engine through a standard interface, including but not limited to the Common Gateway Interface (CGI). The CGI is a means of passing data between a client, a computer requesting data or processing and a program or script on a server, a computer providing data or processing. The combination of form and script is hereinafter referred to as a script application. The search engine will inspect its database for the URLs of resources most likely to relate to the submitted query. The list of URL results is returned to the user, with the format of the returned list varying from engine to engine. Usually it will consist of ten or more hyperlinks per search engine page, where each hyperlink is described and ranked for relevance by the search engine by means of various information such as the title, summary, language, and age of the resource. The returned hyperlinks are typically sorted by relevance, with the highest rated resources near the top of the list.
The World Wide Web consists of thousands of domains and millions of pages of information. The indexing and cataloging of content on an Internet search engine takes large amounts of processing power and time to perform. With millions of resources on the web, and some of the content on those resources changing rapidly (by the day, or even minute), a single search engine cannot possibly maintain a perfect database of all Internet content. Spiders and other agents are continually indexing and re-indexing WWW content, but a single World Wide Web site may be visited by an agent once, then not be visited again for months as the queue of sites the search engine must index grows. A site owner can speed up the process by manually requesting that resources on a site be re-indexed, but this process can get unwieldy for large web sites and is in fact, a guarantee of nothing.
Many current internet search engines support two methods of controlling the resource files that are added to their database. These are the robots.txt file, which is a site-wide, search engine specific control mechanism, and the ROBOTS META HTML tag which is resource file specific, but not search engine specific. Most internet search engines respect both methods, and will not index a file if robots.txt, ROBOTS META tag, or both informs the internet search engine to not index a resource. The use of robots.txt, the ROBOTS META tag and other methods of index control is advocated for the purposes of the present invention.
Commonly, when an internet search engine agent visits a web site for indexing, it first checks the existence of robots.txt at the top level of the site. If the search agent finds robots.txt, if analyses the contents of the file for records such as:
The above example would instruct all agents not to index any file in directories names /cgi-bin/SRC or /stats. Each search engine agent has its own agent name. For example, AltaVista (currently the largest Internet search engine) has an agent called Scooter. To allow only AltaVista access to directory lavstuff, the following robots.txt file would be used:
The ROBOTS META tag is found in the file itself. When the internet search engine agent indexes the file, it will look for a HTML tag like one of the following:
INDEX and NOINDEX indicate to all agents whether or not the file should be indexed by that agent. FOLLOW and NOFOLLOW indicate to all agents whether or not they should spider hyperlinks in this document.
For current internet search engines, the present invention process uses the CGI program(s) provided by the search engine in order to add, modify an remove files from the search engine index. However, the process can generally only remove a file from the search engine index if the file no longer exists or if the site owner (under the direction of the process) has configured the site, through the use of robots.txt, the ROBOTS META tag or other methods of index control, so that the search engine will remove the file from its index.
The duration of time between the first time a site is indexed and the next time that information is updated has led to several key problems:
The present invention provides a mechanism for search engine and web site managers to maintain as perfect a registration of web site content as is possible. By augmenting or replacing existing agents and manual registration methods with specialized tools on the local web site (and, when feasible, at the search engine), the current problems with search engine registration and integrity can be eliminated.
The present invention defeats the key problems with automated agents and manual registration and replaces them with an exception based, distributed processing system. Instead of making the search engine do all the work necessary to index a site, the web site owner is now responsible for that operation. By distributing the work, the search engine is improved in these ways:
The process is begun by distributing a set of search engine update software tools to the web site owner. These tools can be implemented in one of three ways. The first way is to implement the tools on the web server of the site owner. The software can run automatically, having direct access to all resources on the web site. The second way is to install the software tools on a surrogate server. This surrogate is a computer with proper permissions and access to the resources of the web site and automatically accesses those resources over the network. The third way is through the use of client-side tools. The software will run on each client's computer, check the client's web server via internet protocols, and relay the information on the web server to the search engine.
The software could be written in a variety of different programming languages and would be applicable for as many client and server computers as needed.
Upon initial execution, the software builds a database of the resources on the web site. The resources catalogued can be specified by the user, or automatically through spidering functions of the software. The database consists of one record per resource indexed on the site. Each record contains fields including:
Upon each subsequent execution the software tools inspect the current state of the web site against the content of the database. When altered, removed, or additional content is found, the software tools make the appropriate changes to the database and then notify the search engine of those changes (see FIG. 1, Box 206 a, 207 b-c). Changes to the database are made as follows:
Through application of the present invention, the following improvements are made in search engine administration:
The main aspect of the present invention is to provide a method to index locally at a web site all changes to that site's resource content database which has occurred since the last search engine indexing.
Another aspect of the present invention is to actively transmit said changes to an internet search engine.
Another aspect of the present invention is to automatically transmit batches of updates (a list of content that has changed since the last search engine index), in a predetermined manner.
Other objects of this invention will appear from the following description and appended claims, reference being had to the accompanying drawings forming a part of this specification wherein like reference characters designate corresponding parts in the several views.
FIG. 1 is a flowchart of the steps to select which search engines will receive updates and which files shall be updated on those search engines
FIG. 2 is a diagram of the decision tree for determining the state of a specific resource on a particular search engine database, and the action needed to update the internet search engine as enabled in FIG. 1 .
FIG. 3 is a diagram of the Internet search engine update process of updating the files as in FIG. 1 and resources defined by FIG. 2 .
Before explaining the disclosed embodiment of the present invention in detail, it is to be understood that the invention is not limited in its application to the details of the particular arrangement shown, since the invention is capable of other embodiments. Also, the terminology used herein is for the purpose of description and not of limitation.
The present invention can be used on new Internet search engine systems, or existing systems can be adapted for use by existing search engines having the following characteristics:
In addition, if a search engine allows search results to be constrained to one particular site, that completes the functionality requirements of the present invention.
The technical effort required to apply the present invention to existing Internet search engines is similar to that required to apply the invention to a new search engine. The most complex instance would be to apply the invention to a range of search engines, some of which have been designed with the invention in mind, some of which have not. The aforementioned instance will be assumed here.
As implemented, the invention is a server-side process, running either on a surrogate server or the actual server upon which the web site is stored. The process is coded as a program in the Perl programming language, although other languages such as C++ or Java could be used. The process is invoked regularly by the operating system of the computer on which the program resides or manually by a web site manager.
As such, there are three main areas of the preferred embodiment that need to be understood. They are:
Installation of the software tools places a number of CGI scripts, database tables, and HTML forms on the server. Each element performs a specific function relevant to the process and is outlined below. Initially, there is a database Table of Search Engines, containing an entry for each Internet search engine. The table below illustrates the format of a typical search engine record.
| Field | Type | Default | Description |
| Name | String | None | The name of the search engine |
| Enabled | Boolean | True | Whether the search engine is to be |
| informed of changes to content | |||
| Table of | Table | None | Database table of files indexed on this |
| Files | site and for which changes must be | ||
| tracked | |||
| Register by | Boolean | True | Whether to register a resource on this |
| default | search engine in the absence of explicit | ||
| information provided by the site | |||
| manager | |||
| Max | Integer | None | The maximum number of registrations |
| registrations | allowed per day by this search engine | ||
| Limit to site | Boolean | None | Whether the search engine allows |
| searches to be restricted to one web | |||
| site only | |||
| Lists index | Boolean | None | Whether the search engine will report |
| date | the date a resource was last indexed | ||
| Lists index | Boolean | None | Whether the search engine will report |
| time | the time a resource was last indexed | ||
| Index time | Integer | None | Typical delay between registration time |
| and indexing of a site by the search | |||
| engine | |||
| Supports | Boolean | None | Whether the search engine will allow a |
| file | particular file to be searched for | ||
| lookup | |||
The user is provided with an HTML form and CGI script, hereinafter referred to as a CGI program, in order to configure the Enabled and Table of Files fields (see FIG. 1, Box 100 - 101 ). The information the user inputs is submitted over the Common Gateway Interface (FIG. 1, Box 102 ) and the referenced CGI script updates the database tables as instructed (FIG. 1, Box 103 - 105 ). The user can thus enable (i.e., select) and disable a particular search engine using this interface. A search engine that is disabled in the database is simply skipped during an update.
The Table of Files is a field in the Table of Search Engines database. It is initially configured by the user through a CGI program (FIG. 1, Box 200 ) to list the files the user wishes to be registered with this search engine. This table contains a record for each resource. Each record contains the following fields:
| Field | Type | Default | Description |
| Name | String | None | The URL of the resource |
| To Be | Boolean | False | Whether the resource needs to be |
| Registered | registered with this search engine | ||
| To Be | Boolean | False | Whether the resource needs to be |
| Un- | unregistered (removed) from this search | ||
| registered | engine | ||
| Date and | Date | None | Date and time the file was last registered |
| time last | and | with the search engine | |
| registered | Time | ||
| Register | Enum | By | Whether the site manager wants the file to |
| (True, | default | be registered on this search engine. The | |
| False, | ‘By default’ value indicates to follow the | ||
| By | value of the ‘Register by default’ field of | ||
| default) | the search engine record of the database | ||
The Table of Files is a list of the above records. The list is built by first obtaining the set of resources the user wishes to maintain and register with a search engine (FIG. 1, Box 201 ). The user enters the files they wish to monitor into a CGI program and submits the form (FIG. 1, Box 203 a-c, Box 204 a—c). The form allows the user to choose from many methods of building the Table of Files. These methods include, but are not limited to:
The list of pages built by the above process forms the Name fields of the Table of Files records for each search engine. This process can be performed globally (on all search engines in the table of search engines), on a group of search engines or on an individual search engine, as indicated by the user (FIG. 1, Box 206 a, 207 b, 207 c).
Submitting the above form also invokes a CGI script to set the Enabled and ‘Register by default’ fields of the appropriate search engine record according to the preferences of the user. Additionally, a page is provided where the title, URL and Meta Description of each page would be substituted in the appropriate place in the table for each search engine.
Submitting this additional information invokes a CGI script to set the Register field of the Table of Files field for the appropriate search engine record, according to preferences of the user.
IIV. The Process by Which the Database is Constructed and Updated
The process now looks up each file and determines whether the file is registered, current, out of date, or deleted with respect to its registration on the search engine.
There are eight possible states for the file to be in with respect to its registration. In order for the process to be deterministic, all random spidering activity by the search engine is ignored in determining the state of the file. The state is determined purely by the current registration and the data the process has stored in the database of activities performed by previous invocations of itself.
FIG. 2 illustrates the decision process to determine the state of a resource on the search engine (Box 1 ) and the action, which must be taken. A resource can be in the following states:
| Deleted (2a) | The resource no longer exists on the web site. If the |
| resource exists in the search engine database, an | |
| error is signaled. | |
| Awaiting | The resource is not in state 2a. The resource should |
| indexing (2b) | shortly be indexed by the search engine and should not |
| be registered now. | |
| Out of | The resource is not in state 2a, 2b . . . The resource is not |
| date (2c) | due to be indexed by the search engine, but has been |
| modified since it was last indexed by the search engine. | |
| Well | The resource is not in state 2a, 2b, 2c. The resource has |
| registered | not been modified since last indexed and its listing |
| (2d) | on the search engine is correct. |
| Wrongly | The resource is not in state 2a, 2b, 2c, 2d. The resource |
| registered | is listed on the search engine, but the web site manager |
| (2e) | does not want it to be. |
| Wrongly | The resource is not in state 2a, 2b, 2c, 2d, 2e. The web |
| unregistered | site manager wishes the resource to be registered by the |
| (2f) | search engine, but the resource is not registered by the |
| search engine or due to be indexed by the search engine. | |
| Correctly | The resource is not in state 2a, 2b, 2c, 2d, 2e, 2f. The |
| unregistered | resource is not registered, not due to be indexed, and |
| (2g) | the user does not wish it to be. |
| Will be | The resource is not in state 2a, 2b, 2c, 2d, 2e, 2f, or 2g. |
| indexed in | The resource is not listed by the search engine and the |
| error (2h) | site manager does not wish it to be. However, the |
| file will shortly be indexed by the search engine and the | |
| site configuration currently would not prevent this. | |
The following are the actions to be taken in each state (see FIG. 2 ):
| Deleted (3a) | The resource no longer exists on the web site. The |
| process attempts to remove the resource entry from the | |
| search engine database with a CGI program provided by | |
| the engine for this purpose (4a). | |
| Awaiting | No action is taken. |
| indexing (3b) | |
| Out of | The resource has been modified since it was last indexed |
| date (3c) | by the search engine. The process attempts to register |
| the resource for re-indexing with CGI program provided | |
| by the engine for this purpose. | |
| Well | No action is taken. |
| registered | |
| (3d) | |
| Wrongly | The process attempts to remove the resource entry from |
| registered | the search engine index using a CGI program provided |
| (3e) | by the search engine for this purpose. |
| Wrongly | The process attempts to add the resource to the search |
| unregistered | engine index using a CGI program provided by the |
| (3f) | search engine for this purpose. |
| Correctly | No action is taken. |
| unregistered | |
| (3g) | |
| Will be | The web site manager is warned through the process |
| indexed in | reporting mechanism (e-mail, a web page, or other |
| error (3h) | method) that the manager does not want the resource to |
| be indexed, but the search engine will shortly index it | |
| and there are no safeguards in place to prevent this. | |
| Site manager can take appropriate steps to avoid | |
| registration (4b) or registration will take place (4c). | |
| For each enabled search engine in DatabaseLookup(table of | |
| search engines) | |
| list of files = search engine table of files | |
| If search engine.limit to site | |
| search engine files = SearchEngineLookup(all files | |
| reported by search engine for this site) | |
| list of files = list of files + search engine files | |
| End if | |
| For each file in list of files | |
| last index date time = GetIndexDateTime(file, search engine) | |
| If FileExists(file, list of files) | |
| If search engine.table of files.file.toberegistered | |
| RegisterFile(file, search engine) | |
| Next For [each file in list of files] | |
| End if | |
| last modification date time = | |
| GetLastModificationDateTime(file) | |
| will be indexed = WillBeIndexed(file, search engine, | |
| last index date time) | |
| should be registered = ShouldBeRegistered(file, | |
| search engine) | |
| If last index date time != not found | |
| If should be registered | |
| If last modification date time > | |
| last index date time | |
| If will be indexed | |
| AddReport(“awaiting | |
| indexing”, file) | |
| Else | |
| AddReport(“out of date”, | |
| file) | |
| RegisterFile(file, | |
| search engine) | |
| End if | |
| Else | |
| AddReport(“well registered””, | |
| file) | |
| End if | |
| Else [File is registered but should not be] | |
| AddReport(“wrongly registered”, file) | |
| UnRegisterFile(file) | |
| End if | |
| Else [File is not registered] | |
| If should be registered | |
| AddReport(“correctly unregistered”, file) | |
| RegisterFile(file, search engine) | |
| Else | |
| If will be indexed | |
| AddReport(“will be indexed in error”, | |
| file) | |
| Else | |
| AddReport(“well unregistered”, | |
| file) | |
| End if | |
| End if | |
| End if | |
| Else [File Does not exist] | |
| AddReport(“deleted”, file) | |
| If last index date time != not found | |
| UnRegisterFile(file, search engine) | |
| End if | |
| End if [File Exists] | |
| End For | |
| End For | |
There are three ways the process may update a search engine:
In practice, these three activities are usually performed by the same CGI program on current search engines. This CGI program is the ‘register file’ program and is run manually by the user of automatically (FIG. 3, Box 100 ). An HTML form is provided for the purpose of adding a resource to the search engine index. On submitting the form, a CGI script is invoked. The most common mode of action for this script is as follows:
| On RegisterFile(file, search engine) | |
| Check that the file is appropriate for the search engine | |
| If file is appropriate or IsRegistered(file, search engine) | |
| If file is not appropriate | |
| AddReport(“inappropriate file registered”, file) | |
| End if | |
| If!(file in DatabaseLookup(search engine, table of files)) | |
| AddFileToDatabase(search engine, file) | |
| End if | |
| If SearchEngineRegistrationsOK(file, search engine) | |
| SearchEngineRegisterFile(file) | |
| If file registered OK | |
| search engine.table of files.file.date last | |
| registered = todays's date | |
| search engine.table of files.file.time last | |
| registered = now | |
| AddReport(“file registered”, file) | |
| search engine.table of files | |
| file.toberegistered = false | |
| Else | |
| AddReport(“Registration failed”, file) | |
| search engine.table of files | |
| file.toberegistered = true | |
| End if | |
| Else | |
| AddReport(“registration delayed”, file) | |
| search engine.table of files.file. | |
| toberegistered = true | |
| End if | |
| Else | |
| AddReport(“registration failed - inappropriate file”, file) | |
| End if | |
| End RegisterFile | |
| On UnRegisterFile(file, search engine) | |
| SearchEngineUnRegisterFile(file) | |
| If file unregistered OK | |
| AddReport(“file unregistered”, file) | |
| search engine.table of files.file.tobeunregistered = false | |
| Else | |
| AddReport(“Unregistration failed”, file) | |
| search engine.table of files.file.tobeunregistered = true | |
| End if | |
| End UnRegisterFile | |
The present invention would:
| On DatabaseLookup(table of search engines) | |
| return table of search engines | |
| End DatabaseLookup(table of search engines) | |
| On DatabaseLookup(search engine, table of files) | |
| return table of files(search engine) | |
| End DatabaseLookup(search engine, table of files) | |
| On AddFileToDatabase(search engine, file) | |
| table of files(search engine) += file | |
| End AddFileToDatabase(search engine, file) | |
| On SearchEngineLookup(all files reported by search engine for site) | |
| list of files = ( ) | |
| page number = 1 | |
| site links = SearchEngineGetPage(search engine, site, page number) | |
| while number of site links > 0 | |
| list of files += site links | |
| increment page number | |
| site links = SearchEngineGetPage(search engine, | |
| site, page number) | |
| end while | |
| return list of files | |
| End SearchEngineLookup(all files reported by search engine for site) | |
| On FileExists(file, list of files) | |
| If file is local | |
| Perform stat of file | |
| return stat.exists | |
| else | |
| Perform HTTP head request of file | |
| If head request indicates that file exists | |
| Return file exists | |
| else | |
| Return file not exists | |
| end if | |
| end if | |
| End FileExists(file) | |
| OnGetLastModificationDate(file) | |
| If file is local | |
| Perform stat of file | |
| return stat.LastModificationDate | |
| else | |
| Perform HTTP head request of file | |
| return response.LastModifiedDate | |
| end if | |
| End GetLastModificationDate(file) | |
| On GetIndexDateTime(file, search engine) | |
| If search engine.lists index date | |
| If search engine supports file lookup | |
| If(!LookupFile(search engine, file)) | |
| last index date time = not found | |
| Else | |
| last index date time = lookup.date | |
| If search engine.lists index time | |
| last index date time += lookup.time | |
| End if | |
| End if | |
| Else | |
| last index date time = not found | |
| For each phrase in file | |
| While GetNextSearchEnginePage(search engine, | |
| phrase) | |
| If search engine page lists file | |
| last index date time = | |
| searchpage.file.date | |
| If search engine.lists index time | |
| last index date time += | |
| lookup.time | |
| End if | |
| Exit For [each phrase in file] | |
| End if | |
| End While | |
| End For | |
| End if | |
| If last index date time!= not found | |
| Translate last index date time to server time | |
| End if | |
| return last index date time | |
| Else | |
| If file.date and time last registered is set | |
| return file.date and time last registered + | |
| search engine.index time | |
| End If | |
| return not found | |
| End If | |
| End GetIndexDateTime(search engine, file) | |
| On WillBeIndexed(file, search engine, last index date time) | |
| If file.date and time last registered is set | |
| If last index date time > file.date and time last | |
| registered | |
| return false | |
| End if | |
| predicted index date time = file.date and | |
| time last registered + search engine.index time | |
| return (predicted index date time > today now) | |
| Else | |
| return false | |
| End If | |
| End | |
| On ShouldBeRegistered(file, search engine) | |
| If search engine supports ROBOTS tag | |
| If file contains ROBOTS tag | |
| return !(ROBOTS tag contains NOINDEX) | |
| End If | |
| End if | |
| If search engine supports robots.txt file | |
| If site has robots.txt file | |
| return !(file excluded by robots.txt) | |
| End if | |
| End if | |
| return search engine.register by default | |
| End ShouldBeRegistered(file, search engine) | |
| on AddReport(descriptive text, file) | |
| set report = report + file + descriptive text | |
| end | |
| Field | Type | Format | Description | |
| Proxy | String | None | The location of the | |
| proxy for the file | ||||
Whenever the process registers a resource with the search engine, it could deliver the proxy to the search engine in place of the resource itself. The format of the proxy file could be plain text, or HTML to allow current indexing techniques to continue to work. The format of the proxy file could also be any other markup language, for instance XML. The principle remains the same a text file is used in place of any other file or set of files. This method will allow, for example, Java, embedded objects, graphics, frames, and other file formats to be indexed.
Spamming is a potential problem when using proxy files. The idea of the proxy file is that the search engine uses it to create an index, but the search engine user links to the real file in response to a search query. Clearly, if the contents of the proxy file and the real file do not match, the user will not get what they are expecting. For example, a rogue site owner may set up the proxy file to catch a lot of queries about sex (the most searched for term on the Internet), when in fact their page is trying to persuade you to join their online gambling syndicate.
Spamming will only occur when there is a breakdown of trust between the site owner and search engine owner. The site owners could sign an online contract to guarantee that they will not spam. By signing the contract, they are provided with the embodiment of the process in order to register and maintain their registration with the search engine. If, through spamming, the contract is broken, the search engine can discontinue listing pages temporarily or permanently for the web site in question. It may also be able to take legal action. There are also programmable and scalable methods of defeating spamming—they are irrelevant to this discussion.
It is important to emphasize that web site owners do not have to use the tools provided for their sites to be registered. The search engine can still spider sites whose owners do not use the tools provided, in the same way as conventional search engines spider sites. For sites that are deemed appropriate, the search engine can even set up a surrogate server to implement the present invention on behalf of a non-participating site owner. The present invention is not limited in its application to the details of the particular arrangement shown, since the invention is capable of other embodiments. Also, the terminology used herein is for the purpose of description and not of limitation.