[0001] This application claims the benefit of U.S. Provisional Application No. 60/099,641, filed Sep. 9, 1998.
[0002] The present invention relates to computer based natural language processing systems and more particularly to computer based systems and methods of processing natural language text to identify Subject, Action, Object triplets and relationships between such triplets, storing this data and processing this data to semantically analyze, select, summarize, store, and display candidate documents containing specific content or subject matter.
[0003] Computer based document search processors are known to perform key word searches for publications on the Internet and World Wide Web. Today, information owners and service providers are adapting their databases to individual tastes and requirements. For example, Boston based Agents, Inc. offers over the Web personalized newsletters for music fans such that classical music lovers are blocked from receiving Rap music advertisements and vice-versa. KD, Inc. of Hong Kong has developed a system that takes into consideration words similar by sense while searching the Web. Today, the user can download 10,000 papers from the Web by typing the word “Screen”. The search system designed by KD, Inc. asks the user whether he/she is seeking papers related to Computer Screen, TV Screen or Window Screen. In this case, the number of unrelated papers will be drastically reduced.
[0004] Software based search processors are able to remember requests of a single user and to conduct personalized non-stop searches on the Web. So, when a user wakes up in the morning, he/she finds references and abstracts of several new Web papers related to his/her area of interest. In 1997, practically all fundamental technical publications, journals, magazines, as well as patents of all industrial countries became available on the Web, i.e., available in electronic format.
[0005] Although key word searching the Web affords the user great value, it also has created and will continue to create substantial problems adversely affecting this value. Specifically, because of the enormous amount of information available on the Web, key word search processors produce too much downloaded information, the vast majority of which is irrelevant or immaterial to the information the user wants. Many users simply give up in frustration when presented with several hundred articles in response to what the user considered a request for only those few articles related to a specific request.
[0006] This problem is also experienced in the technical fields of science and engineering, particularly since there is a growing number of libraries, government patent offices, universities, government research centers, and others adding vast amounts of technical and scientific information for Web access. Engineers, scientists, and doctors are overwhelmed with too many articles, papers. patents and general information on the topic of interest to them. In addition, the user presently has only two choices when examining a downloaded article to determine its relevance to the users project. He/she can either read the authors abstract and/or scan various sections of the full article to determine whether or not to save or print-out that specific document. Since the author's abstract is not comprehensive, it often omits the reference to the specific subject matter of interest to he user or treats this subject matter in an incomprehensive manner. Thus, scanning the abstract and scanning the full article may have little value and require an inordinate amount of user time.
[0007] Various attempts purport to increase the recall and precision of the selection such as U.S. Pat. Nos. 5,774,833 and 5,794,050 incorporated herein by reference, however, these methods simply rely on key word or phrase searching with various techniques of selection based on variations of the key words, or purported understanding of textual phrases. These prior methods may improve recall but tend to require too much physical and mental effort and time to determine why the document was selected and what is the pertinent part. This results from the entire document or abstract being presented without summary or concept generation.
[0008] A computer based software system and method according to the principles of the present invention solves the foregoing problems and has the ability to perform a non-stop search of all databases on the Web or other network for key words and to semantically process candidate documents for specific knowledge concepts, such as technological functions or specific physical effects, so that only the very few prioritized or a single document meeting the search criteria is presented or identified to the user.
[0009] Further, the computer based software system in accordance with the principles of the present invention captures these highly relevant documents and creates a compressed, short summary of the precise technical physical aspects designated by the search criteria.
[0010] Another aspect of the present invention includes using the semantic analysis results of the selected documents to create new ideas of knowledge concepts. The system does this by analyzing the subject, action, and object triplets mentioned in the documents, identifying cause and effect triplet relationships, and re-organizing these triplet representations into new and/or different profiles of such elements. As further described below, some of these reorganized sets of relationships among these elements may comprise new concepts never before thought of by anyone.
[0011] According to an aspect of the present invention, the method and apparatus begins with the user entering natural language text related to the task, concept, or subject matter for which the user desires to acquire publications or documents. The system analyzes this request text and automatically tags each word with a code that indicates the type of word it is. Once all words in the request are tagged, the system performs a semantic analysis that, in one example, includes determining and storing the verb groups within the first sentence of the request, then determining and storing the noun groups within that sentence of the request. This process is repeated for all sentences in the request.
[0012] Next, the system parses each request sentence with an hierarchical algorithm into a coded framework (tree) which is substantially indicative of the sense of the sentence. The system includes databases of various types to aid in generating the coded framework, such as grammar rules, parsing rules, dictionary synonyms, and the like. Once parsed, sentence codes are stored, the system identifies Subject-Action-Object (SAO) extractions within each sentence and stores them. A sentence can have one, two, or a plurality of SAO extractions as seen in the detailed description below. Each extraction is normalized into a SAO structure by processing extractions according to certain rules described below. Accordingly, the result of the semantic analysis routine performed on the request test is a series of SAO structures (triplets) indicative of the content of the request. These request SAO structures are applied to (1) a comparative module for comparing the SAO structures of candidate documents as described below and (2) a search request and key word generator that identifies key words and key combinations of words, and synonyms thereof, for searching the Web internet, intranet, and/or local databases for candidate documents. Any suitable search engine, e.g. Alta Vista™, can be used to identify, select, and download candidate documents based on the generated key words.
[0013] It should be understood that, as mentioned above, key word searching produces an over-abundance of candidate documents. However, according to the principles of the present invention, the system performs substantially the same semantic analysis on each candidate document as performed on the user input search request. That is, the system generates an SAO structure(s) for each sentence of each candidate document and forward them to the comparative Unit where the request SAO structures are compared to the candidate document SAP structures. Those few candidate documents having SAO structures that substantially match the request SAO structure profile are placed into a retrieved document Unit where they are ranked in order of relevance. The system then summarizes the essence of each retrieved document by synthesizing those SAO structures of the document that match the request SAO structures and stores this summary for user display or printout. Users can later read the summary and decide to display or print out or delete the entire retrieved document and its SAO's.
[0014] As stated above, the SAO structures for each sentence for each retrieved document are stored in the system according to the present invention. According to the knowledge creativity aspect of the present invention, the system analyzes all these stored structures, identifies where common or equivalent subjects and objects exist and reorganizes, generates, synthesizes, new SAO structures or new strings (relationships) or SAO structures for user's consideration. Some of these new structures or strings may by unique and comprise new solutions to problems related to the user's requested subject matter. For example, if two structures S1-A1-O1 and S2-A2-O2 are stored, and the present system recognizes that S2 is equivalent to or the synonym for or has some other stored relation to O1 then it will generate and store for the user's access a summary of S1-A1-S2-A2-O2. Of if the system stores an association between S1 and A2 it can generated S1-A1/A2-O1 to suggest improvement of O1 toward desired results.
[0015] Other and further advantages and benefits shall become apparent with the following detailed description when taken in view of the appended drawings, in which:
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029] One exemplary embodiment of a semantic processing system according to the principles of the present invention includes:
[0030] A CPU
[0031] With reference to FIGS.
[0032] Unit
[0033] Unite
[0034] Database Units
[0035] Unit
[0036] SAO processor
[0037] Unit
[0038] Filtering Unit
[0039] Reference
[0040] SAO synthesizer Unit
[0041] If S was not detected by Unit
[0042] If SAO structures received by Unit
[0043] The salient steps to the method according to the principles of the present invention are shown in
[0044] Then the method normalizes these words (modifies) each as each action is changed to its infinitive form. Thus, “is isolated”
[0045] The request SAO structure key words/phrases are stored and sent to a standard search engine to search for candidate documents in local databases, LANs and/or the Web. Alta Vista™, Yahoo™, or other typical search engines could be used. The engine, using the request SAO structure key words/phrases identifies candidate documents and stores them (full text) for system
[0046] Next System
[0047] Filtered relevant SAO structures of relevant document(s) are analyzed to identify relationships among the subjects, actions, and objects among all relevant structures. Then SAO structures are processed to reorganize them into new SAO structures for storage and synthesis into natural language new sentence(s). The new sentences may and probably some of them will express or summarize new ideas, concepts and thoughts for users to consider. The new sentences are stored for user display or pint-out.
[0048] For example, if
[0049] S
[0050] S
[0051] S
[0052] and S
[0053] Accordingly, the method and apparatus according to the present invention provides use automatically with a set of new ideas directly relating to user's requested area of interest some of which ideas are probably new and suggest possible new solutions to user's problems under consideration and/or the specific documents and summaries of pertinent parts of specific documents related directly to user's request.
[0054] Although mention has been made herein of application of the present system and method to the engineering, scientific and medical fields, the application thereof is not limited thereto. The present invention has utility for historians, philosophers, theology, poetry, the arts or any field where written language is used.
[0055] It will be understood that various enhancements and changes can be made to the example embodiments herein disclosed without departing from the spirit and scope of the present invention.