[0001] 1. Field of the Invention
[0002] The present invention relates to a method and system for simplifying Web contents, and more particularly to a technique for simplifying the contents on the fly, even in the case of Web pages which do not have history information or whose URL (Uniform Resource Locator) changes day by day.
[0003] 2. Background of the Invention
[0004] In recent years, the use of the Internet has become popular because of the progress of network technologies and improvement of functions of an information apparatus and the lowering cost trend. Since the detailed information transmission can be performed at a low cost without regard to corporations and individuals and further without conscious of borders, Web pages as a source of information transmission are increasing explosively day by day. Furthermore, vast amounts of information are updated under the control of administrators of Web pages. In this context, the Internet and Web pages utilizing the same are becoming an important information gathering media which takes the place of conventional broadcasts and mass media or which compensates for them.
[0005] By the way, the role of Web pages are diversifying. For example, without staying in a mere information transmission, business transactions (electronic commerce) via Web pages and collaborations using Web pages are being performed. In order to implement these diversified functions, there are provided Web pages which have a higher convenience. Also, in order to access the intended information more rapidly, there are incorporated functions in the Web pages which improve user operability of, for example, a search screen. Examples are a link list that is used in common in the site, an image map, or a form, etc. These are included in every page and provide functions that are very convenient for general users.
[0006] However, these general Web pages are designed on the premise of a desktop type of computer screen. That is, their layouts are considered in view of the size of a desktop computer screen. Hence, in case of a device with a small screen (hereinafter referred to as small screen device) such as PDA (Personal Digital Assistant) and cellular phones, or a software which reads aloud a Web page (hereinafter referred to as voice browser), there is a problem that one can not reach necessary information quickly. Namely, concerning the general Web pages, a form and image map are laid out at the top of the page, so it is necessary, in case of a small screen device, to repeat a display of these forms and others many times to reach the necessary information. Also in case of the voice browser, necessary information is read aloud after these forms and others have been read aloud. The small screen device generally does not need visual multi-functionality like a desktop computer, whereas the voice browser does not need visual functions for improving operability. On the contrary, these visual functions form an obstacle to the small screen device and voice browser.
[0007] Therefore, there is attempted a technique of simplification for omitting a part of Web pages, for example, “Dharma Transcoding” technique as described in “Annotation Based Web Content Transcoding” by Masahiro Hori et al. (http://www9.org/w9cdrom/169/169.html) or “DiffWeb” (difference) technique as described in the web site “http://www.diffweb.com/”.
[0008] The “Dharma transcoding” technique is a technique which divides an existing Web page into several pages in a condition similar to an original layout and to create a page that is easily displayed to a small screen device. This technique needs external annotation information which gives a detailed description of a structure of pages and significance of each part.
[0009] The “DiffWeb” technique is a technique that calculates and presents a difference between a Web page that was registered in advance and saved and a current Web page. According to this technique, a list of pages can be registered per user and a difference of these pages can be calculated. With this difference technique, all of the processing such as page registration, storage, and difference operation is performed by a direction from users. Similar difference techniques include “HTML Diff” described in “The C3 Project at Stamford”(http://www-db.stanford.edu/c3/c3.html), and “Mindit” web site found at the following web address described in “http://mindit.netmind.com/mindit.shtml”.
[0010] However, the “Dharma transcoding” technique needs the annotation information, as described above. To give the annotation information, there is needed interposition such as a volunteer, so that it is difficult to automate completely.
[0011] With the “DiffWeb” technique, page registration, storage, and difference operation are processed according to a direction from a user, as described above. Thus, the difference operation can not be performed as on-the-fly processing. Also, concerning the pull-down menu, it is feared that a character string as contents is deleted and the form after simplification can not work well.
[0012] Moreover, according to the prior techniques, the simplification is implemented by calculating a difference against a comparative page which has been saved in advance. Therefore, the following problems exist.
[0013] First, if the comparative page has not been saved in advance, the simplification can not be performed. That is, only a page that has a comparative page recorded can become a target for simplification, so that the page that appears first can not be subject to simplification.
[0014] Secondly, even if the comparative page has been saved, a page whose URL changes day by day can not be simplified. For example, an article page of the Asahi Shinbun (www.asahi.com) includes the date in the URL, as follows, i.e., “http://www.asahi.com/0530/news/business30010.html”. In this case, there is no past page that has the same URL, therefore, the simplification can not be performed.
[0015] Thirdly, even the necessary information might be deleted. For example, important information such as a title of link lists or a form might be deleted. On the contrary, unnecessary subtle changes in character strings might be saved.
[0016] It is therefore a feature of the present invention to provide a technique for the simplification of Web pages in order to access necessary information rapidly, when displaying or outputting Web pages using a small screen device or a voice browser.
[0017] It is another feature of the invention to provide a technique for performing the simplification of Web pages even if there is no past page of the same URL.
[0018] It is further feature of the invention to provide a technique for performing simplification of Web pages on the fly.
[0019] It is a still further feature of the invention to provide a technique for simplifying unnecessary information with high precision, without losing important information upon simplification of Web pages.
[0020] Specifically, a feature of the present invention comprises the method steps of acquiring a target page subject to simplification, acquiring adjoining pages that adjoin the target page, and performing a difference operation to delete objects that are common among the target page and the adjoining pages from the target page to generate a simplified page.
[0021] Various other objects, features, and attendant advantages of the present invention will become more fully appreciated as the same becomes better understood when considered in conjunction with the accompanying drawings, in which like reference characters designate the same or similar parts throughout the several views.
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044] The present invention provides a technique for transforming to a page on the fly that is easy to read with a voice browser and a small screen device, on the basis of the difference operation. “The difference operation” means, for example, the operation for calculating the differences between two HTML documents. The difference operation that is used in the simplification of a Web page, according to the present invention, uses not only a past page having the same URL but also adjoining pages, as a comparative page with which a target page subject to simplification is compared. According to the comparison, it retrieves only the information that has been updated, and removes the template information in common for each page of the site. Hereby, only the main contents of the page subject to simplification are retrieved.
[0045] According to the present invention, adjoining pages for comparison are automatically acquired, so the simplification of the Web pages can be performed on the fly even if there is no stored information of the past page. Hereby, a quick access to necessary information is possible when using a small screen device or a voice browser.
[0046] The adjoining pages comprise pages of URLs whose directory or parent directory is common with the URL of the target page or the URL of the links included in the target page; a top page of each directory under the root directory of the target page; a past page of the target page; pages of the links included in the past page; or past pages of the adjoining pages.
[0047] It is possible to prioritize URLs of the adjoining pages after acquiring the adjoining pages, wherein the prioritizing is determined based on the edit distance between the URL of the target page and the URLs of the adjoining pages, or the relevance among URLs based on the number of co-occurrence or the number of cross-reference between the target page and the adjoining pages.
[0048] In said difference operation, DP matching can be used to determine whether the objects are common or not, and also the significance of the objects included in the target page is calculated. If the significance exceeds a predetermined threshold, the objects are not deleted even if they are common with the objects of the adjoining pages. On the contrary, objects with low significance are deleted.
[0049] The significance is represented by the sum of weighted feature values. The feature values comprising the character size of the objects, a numerical value assigned to fonts and other character attributes, a numerical value to identify whether the objects are the banner, a displacement value of the objects from the center of the screen, the number of keywords included in the objects, a numerical value assigned to the information indicating whether the objects are added or updated, the ratio of updated characters of the objects, a numerical value assigned to the information indicating whether the objects are one character, a numerical value assigned to the tag class of the objects, etc.
[0050] Further, the post-processing can be done after the difference operation, which includes restoration of the list title, restoration of information at the top of or on the side of table, movement of the form to the rearward of the page, or reference of annotation information.
[0051] It is also possible to receive a request from a user terminal, then performing each of said steps in response to the request to select a simplified page which has the least amount of information among the simplified pages, and sending the selected simplified page to the user terminal. The user terminal can be a computer system in which a voice browser operates or an information terminal which has a display with a small screen. Alternatively, the user terminal or a computer system connecting to the user terminal may provide a voice recognition function and voice synthesis function, wherein a request is input by voice and the simplified page is output by voice.
[0052] Now the present invention will be described with reference to the accompanying drawings.
[0053] However, the present invention is implemented with various forms, so it is not limited to the embodiments described herein. Note that the same elements are referred to with the same reference numbers through the drawings.
[0054] In the following embodiment, the present invention will be described as a method and system, however, it is also implemented as a medium in which a program for use with a computer is recorded. Therefore, the present invention can take the form of hardware, the form of software, and the combination thereof. As a medium recording a program, any computer-readable medium is included, for example, a hard disk, CD-ROM, optical storage device, magnetic storage device, etc.
[0055] Also, in the following embodiments, a typical computer system is available. It includes a CPU, a main memory (RAM), a nonvolatile storage (ROM), etc., all of which are interconnected by bus. Further, a co-processor, an image accelerator, a cache memory, and an input-output (I/O) controller may be connected to the bus. In addition, an external storage, a data input device, a display device, a communication controller, etc., are connected via interface. Needless to say, it is possible to provide hardware resources that are typically equipped for a computer system. A representative external storage is a hard disk drive, however, an optical semiconductor storage such as a magneto-optical disk, an optical storage, and a flash memory is also included. A data input device includes an input device such as a keyboard and a pointing device such as a mouse. A data input device further includes an image reader such as a scanner, and also a voice input device. A display device includes a CRT, an LCD, a plasma display device, etc. A computer system includes a variety of computers such as a personal computer, a work station, a mainframe computer, etc.
[0056] The first embodiment of the present invention:
[0057]
[0058] The user terminal
[0059] The proxy server
[0060] The Web server
[0061] The Web server
[0062] According to the embodiment of the present invention, a user specifies an address of the proxy server
[0063] The function of the processing of the present system is as follows. The user terminal
[0064]
[0065] Adjoining URL listing module:
[0066]
[0067] The adjoining URL listing module
[0068] First, in response to the request from the proxy server
[0069] Which list of these modules
[0070] Now an example of URL selection will be described below.
[0071] (1) Directory common URL selection:
[0072] URL of the target page:
[0073] http://www.asahi.com/0606/news/national06015.html
[0074] URLs Listed (a part):
[0075] http: //www.asahi.com/a0606/news/national06012.html
[0076] http://www.asahi.com/0606/news/national06013.htm
[0077] http://www.asahi.com/0606/news/national06014.html
[0078] (2) Parent directory common URL selection:
[0079] URL of the target page:
[0080] http://www.cnn.com/2000/US/06/05/sea.based.defence/index.htm l
[0081] URLs selected (a part):
[0082] http://www.cnn.com/2000/US/06/05/dday.remenbrance/index.html
[0083] http://www.cnn.com/2000/US/06/05/helicopter.escape.03/index. html
[0084] http://www.cnn.com/2000/US/06/05/curbing.terrorism.02/index. html
[0085] Directory listing module:
[0086] The directory listing module
[0087] URL of the Target Page:
[0088] http://www.cnn.com/2000/US/06/05/helicopter.escape.03/index. html
[0089] URLs selected (a part):
[0090] http://www.cnn.com/2000/US/06/05/
[0091] http://www.cnn.com/2000/US/06/
[0092] http://www.cnn.com/2000/US/
[0093] http://www.cnn.com/2000/
[0094] http://www.cnn.com/
[0095] URL Cache Module:
[0096]
[0097] Using the URL cache module
[0098] URL Priority Operation Module:
[0099] The URL priority operation module
[0100]
[0101] Next, the operation of the URL priority operation module
[0102] Next, the similarity is calculated between the target page
[0103] Next, the relevance of URLs is calculated using the URL relevance calculation module
[0104] Finally, the sort module
[0105] Before-update target page/adjoining page acquisition module
[0106]
[0107] The search module
[0108] The search key for the cache database
[0109] When the target page
[0110] Fetch Module:
[0111]
[0112] Difference Operation Module:
[0113] The difference operation module
[0114]
[0115] In this way, there are generated DOM trees of pages corresponding to the lists that are selected by each of the adjoining URL listing module
[0116] In order to avoid important nodes from being deleted in the course of the difference process of the embodiment, the significance of nodes are calculated in advance in the significance calculation module
[0117] Hereinafter, a technique using DP matching will be described in accordance with
[0118] Next, the DP matching
[0119] Significance Calculation Module:
[0120] In order to prevent important nodes (e.g., a character string indicating the title) from being deleted, the weighting is performed in advance for each of text node and image node of the target page. The common node deletion module
[0121] A method for calculating the significance of nodes will be described below. It is evident that other methods for calculating the significance could also be applied. Here is shown a technique to determine the significance by means of the weighted sum of several feature values. The significance S of each node is calculated by the following formula.
[0122] Where Pi is each feature value and Wi is the weighting for each feature value.
[0123] The following is an example of the feature values.
[0124] <Character size>
[0125] Assuming that the feature value Pi be the difference between the character size when rendered and a default font size.
[0126] This is based on the empirical rules where the larger the character size is, the higher the significance is. The character attributes are also considered into the feature value Pi. In this case, they are added to the Pi depending on each attribute value. For example, when a font such as the bold or italic and the color such as red are specified, or when an underline or double underline is specified, the significance is supposed to be high, so that they are added to Pi depending on such an attribute.
[0127] <Removal of banner by template>
[0128] An image ring which has a high likelihood to be a banner has its significance lowered. A banner template makes an image size, a character string of the link destination (/doubleclick/, /ads/, etc.), and an immediately following link string (Click Here, etc.) be a criterion. A distance from the template can be made a feature value Pi.
[0129] <Node position>
[0130] The weighting is performed in accordance with a position where a node is displayed when rendering. As is shown in
[0131] <Increase of significance by keyword detection>
[0132] The significance of node which includes the keyword can be increased as a result of analysis of keywords of the target page. The feature value is determined as the number of keywords included in the node, wherein the keywords include the important keywords the system holds and the keywords that are determined as a result of analysis of the page.
[0133] <Added nodes and updated nodes>
[0134] In order to increase the significance of added nodes (which are not included in a comparative page), the feature value
[0135] <Ratio of updated string of updated nodes>
[0136] In the case of updated nodes rather than added nodes, the ratio of the number of updated characters to the number of characters in the node can be set as the feature value, wherein Wi is a positive value.
[0137] <Decrease significance when one character>
[0138] In order to decrease the significance of the node with only one character, the feature value
[0139] <Tag class>
[0140] Some of the nodes could be determined its significance apparently by the tag class. Such a tag is assigned a feature value. The default is 0. For example, the positive value could be assigned to the form node in order to save the form node.
[0141] Common node deletion module:
[0142] The common node deletion module
[0143] Cleanup module:
[0144] The cleanup module
[0145] Minimum difference selection module:
[0146] The minimum difference selection module
[0147] According to the system and simplification method of the embodiment of the present invention, even if the past page does not exist, the comparative page can be acquired, so that the simplification of the target page can be performed. Furthermore, various adjoining pages (comparative pages) are acquired exhaustively, which enables the more appropriate and high-precision simplification. Moreover, since the significance of nodes is checked at the difference processing, necessary information is less likely to be deleted. Also, the cleanup module
[0148] Now an example will be shown where the system and simplification of the embodiment of the present invention is applied to an actual Web page.
[0149]
[0150] According to the system and simplification method of the embodiment of the present invention, it proved that the number of characters, the number of links and the number of elements in a page are reduced to about half. Table 1 shows the result of applying the system and method of the present invention to any pages, including CNN, Asahi Shinbun, and SUNTIMES. Though there is some dispersion, the information is roughly reduced to 40%-60% of the original pages.
TABLE 1 Number of Characters Number of Links Number of Elements Site original transcoded original transcoded original transcoded CNN 4,294 2,557 60% 167 75 45% 228 116 51% Suntimes 3,446 2,770 80% 59 17 29% 93 41 44% Asahi 1,880 1,020 54% 40 4 10% 65 13 20%
[0151] Furthermore, Table 2 shows a comparison of information amounts from the beginning of each search page to the display of search results. It proves that the information is greatly reduced, so that the voice browser, for example, can reach the search results swiftly.
TABLE 2 Page Original Transcoded Yahoo 14 links 1 image map 0 link Lycos 15 links 1 form 7 links Infoseek 16 links and 1 form 2 links
[0152] The second embodiment of the present invention:
[0153]
[0154] As is shown in
[0155] Further, it is possible to restore a part of information employing heuristics for the difference DOM tree
[0156]
[0157] The list title restoration module
[0158] 1) There is left even one item in the list.
[0159] 2) A string immediately preceding the list is either a header, a bold or an enlarged character and is within 50 characters.
[0160] In this case, the immediately preceding string is determined to be a title and is restored as is shown in (c) in
[0161] Likewise, the table top/side restoration module
[0162] The form movement module
[0163] Unlike the first embodiment, according to this embodiment, the difference page (DOM) is transformed to HTML by the DOM-HTML translation module
[0164] The third embodiment of the present invention:
[0165]
[0166] As mentioned above in the section of the prior art, there has been proposed and developed a technique which obtains a screen output for a small screen device on the basis of detailed annotation information. The system of this embodiment can obtain an output with higher precision in combination with such annotation information. Here is shown an example where the annotation information is used in the post-processing. As is shown in
[0167] A volunteer
[0168]
[0169] Assuming that visual blocks on a page are specified by the annotation. First, the difference portion marking module
[0170] The fourth embodiment of the present invention:
[0171]
[0172] In this embodiment, a user terminal is a voice XML browser
[0173] The voice XML browser
[0174] A voice recognition browsing server
[0175] The button operated voice browsing server
[0176]
[0177]
[0178] The keyword analysis
[0179] According to the system and method of the embodiments of the present invention, with a voice input or a simple key operation input (button input), perusal of Web pages can be accomplished by voice output. When a visually handicapped person accesses Web contents, the present invention provides an effective means for implementing barrier free. The contents are simplified, so it is apparent that the reading aloud by voice response proceeds smoothly. Besides, for a user who is unused to a computer operation, the present invention provides a technique for easily accessing Web contents.
[0180] While the present invention have been particularly described with respect to the embodiments thereof, the present invention is not limited to these embodiments and various modifications and alternatives may be made without departing from the spirit and scope of the present invention.
[0181] As mentioned above, according to the present invention, there is provided a technique for simplification of Web pages to access necessary information quickly when displaying or outputting a Web page using a small screen device or voice browser. Besides, the simplification of Web pages is performed even if there is no past page of the same URL. Moreover, the simplification of Web pages is performed on the fly. Furthermore, there is provided a technique for simplifying unnecessary information with high precision, without losing important information upon simplification of Web pages.
[0182] It is to be understood that the provided illustrative examples are by no means exhaustive of the many possible uses for my invention.
[0183] From the foregoing description, one skilled in the art can easily ascertain the essential characteristics of this invention and, without departing from the spirit and scope thereof, can make various changes and modifications of the invention to adapt it to various usages and conditions.
[0184] It is to be understood that the present invention is not limited to the embodiments described above, but encompasses any and all embodiments within the scope of the following claims.