Title:
SYSTEM, METHOD, AND COMPUTER PROGRAM PRODUCT FOR DATA MINING AND AUTOMATICALLY GENERATING HYPOTHESES FROM DATA REPOSITORIES
Kind Code:
A1


Abstract:
Various embodiments of the present invention provide systems, methods, and computer programs for generating a hypothesis. Specifically, some method embodiments include steps for accessing a system for extracting relationships and determining a relationship rule defining a relationship among a plurality of phrases and a plurality of concepts stored in the system for extracting relationships. Such embodiments further provide steps for parsing a plurality of documents in a data repository according to the relationship rule and generating a hypothesis comprising a previously unknown combination of phrases and concepts being at least partially determined from the parsed plurality of documents. Various embodiments also provide a step for presenting the hypothesis to a user so as to indicate the previously unknown combination.



Inventors:
Raghavan, Vijay V. (Lafayette, LA, US)
Xie, Ying (Kennesaw, GA, US)
Prestigiacomo, Anthony (Baton Rouge, LA, US)
Application Number:
12/210253
Publication Date:
03/26/2009
Filing Date:
09/15/2008
Primary Class:
International Classes:
G06N5/02
View Patent Images:



Primary Examiner:
FERNANDEZ RIVAS, OMAR F
Attorney, Agent or Firm:
ALSTON & BIRD LLP (BANK OF AMERICA PLAZA, 101 SOUTH TRYON STREET, SUITE 4000, CHARLOTTE, NC, 28280-4000, US)
Claims:
What is claimed is:

1. A method for generating a hypothesis, the method comprising: accessing a system for extracting relationships, the system for extracting relationships comprising a plurality of phrases and a plurality of concepts; determining a relationship rule defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts; parsing a plurality of documents in a data repository according to the relationship rule, the plurality of documents each comprising at least a portion of one of the plurality of phrases and the plurality of concepts; generating a hypothesis comprising a previously unknown combination, the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts, the previously unknown combination being at least partially determined from the parsed plurality of documents; and presenting the hypothesis so as to indicate the previously unknown combination.

2. A method according to claim 1, wherein the relationship rule is selected from the group consisting of: an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts; an assignment of at least one of the plurality of phrases to a relationship identifier, the relationship identifier linking a first one of the plurality of concepts to a second one of the plurality of concepts; an assignment of at least one of the plurality of concepts to a semantic category; an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship, wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept; and combinations thereof.

3. A method according to claim 1, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a third concept, and wherein the parsing step further comprises: detecting a first relationship between the first and second concepts; detecting a second relationship between the second and third concepts; detecting a third relationship between the first and third concepts; and determining a potential chain relationship among the first second, and third concepts at least partially from the detected first, second, and third relationships; and wherein the generating step further comprises generating a chain hypothesis comprising the previously unknown combination of the first, second, and third concepts.

4. A method according to claim 1, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a plurality of linking concepts, and wherein the parsing step further comprises: detecting a first relationship between the first concept and a first portion of the plurality of linking concepts; detecting a second relationship between the second concept and a second portion of the plurality of linking concepts; and determining a potential substitution relationship between the first concept and the second concept at least partially from the detected first and second relationships and a number of overlapping concepts present in both the first portion and the second portion of the plurality of linking concepts; and wherein the generating step further comprises generating a substitution hypothesis comprising the previously unknown combination of at least one of the first and second concepts with a portion of the plurality of linking concepts not present in the number of overlapping concepts.

5. A method according to claim 4, wherein the parsing step further comprises determining a strength of the potential substitution relationship between the first and second concepts based at least in part on the number of overlapping concepts present in both the first portion of the second portion of the plurality of linking concepts.

6. A method according to claim 1, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a third concept, and wherein the parsing step further comprises: detecting a first relationship between the first concept and the second concept; detecting a second relationship between the second concept and the third concept; and determining a potential pairwise relationship between the first concept and the third concept at least partially from the detected first and second relationships; and wherein the generating step further comprises generating a pairwise hypothesis comprising the previously unknown combination of the first and third concepts.

7. A method according to claim 6, wherein the parsing step further comprises assessing a strength of the potential relationship between the first and third concepts at least partially from a known secondary relationship between the first and third concepts.

8. A method according to claim 7, wherein the known secondary relationship comprises a common semantic category including both the first and third concepts.

9. A method according to claim 8, wherein the relationship rule comprises the common semantic category.

10. A method according to claim 1, further comprising: identifying a portion of the plurality of documents in the data repository associated with a user; creating a user profile based at least in part on the identified documents, the user profile being indicative of a user information need; and modifying the hypothesis in response to the user profile such that the modified hypothesis at least partially corresponds to the user information need.

11. A method according to claim 10, wherein the user profile comprises at least one semantic category and wherein the method further comprises filtering the presented hypothesis such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category.

12. A method according to claim 1, wherein presenting the hypothesis comprises presenting a display to a user comprising a visual representation of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts.

13. A method according to claim 12, wherein the visual representation comprises an interactive icon configured to be selectable by the user, the interactive icon being further configured to modify the display when selected by the user.

14. A method according to claim 1, wherein the system for extracting relationships is selected from the group consisting of: a vocabulary database corresponding to a selected subject area; a predetermined lexicon; a semantic network; a metathesaurus; and combinations thereof.

15. A method according to claim 1, wherein the data repository is selected from the group consisting of: a biomedical literature database; a medical records database; a chemical literature database; a computer science literature database; a physics literature database; a legal literature database; a psychology literature database; a social science literature database; a news periodical database; a business journal database; and combinations thereof.

16. A method according to claim 1, further comprising storing the determined relationship rule for later or repeated use in the subsequent parsing step.

17. A method according to claim 1, further comprising verifying the hypothesis using at least one independent resource.

18. A computer program product for generating a hypothesis based on a plurality of documents in a data repository in a manner that reduces the burden on the data repository, said computer program product comprising a computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising: a first set of computer instructions for accessing a system for extracting relationships, the system for extracting relationships comprising a plurality of phrases and a plurality of concepts; a second set of computer instructions for determining a relationship rule defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts; a third set of computer instructions for parsing the plurality of documents in the data repository according to the relationship rule, the plurality of documents each comprising at least a portion of one of the plurality of phrases and the plurality of concepts; a fourth set of computer instructions for generating a hypothesis comprising a previously unknown combination, the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts, the previously unknown combination being at least partially determined from the parsed plurality of documents; and a fifth set of computer instructions for presenting the hypothesis so as to indicate the previously unknown combination.

19. A computer program product according to claim 18, wherein the relationship rule is selected from the group consisting of: an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts; an assignment of at least one of the plurality of phrases to a relationship identifier, the relationship identifier linking a first one of the plurality of concepts to a second one of the plurality of concepts; an assignment of at least one of the plurality of concepts to a semantic category; an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship, wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept; and combinations thereof.

20. A computer program product according to claim 18, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a third concept, and wherein the third set of computer instructions for parsing further comprises: a sixth set of computer instructions for detecting a first relationship between the first and second concepts; a seventh set of computer instructions for detecting a second relationship between the second and third concepts; an eighth set of computer instructions for detecting a third relationship between the first and third concepts; and a ninth set of computer instructions for determining a potential chain relationship among the first second, and third concepts at least partially from the detected first, second, and third relationships; and wherein the fourth set of computer instructions for generating further comprises a tenth set of computer instructions for generating a chain hypothesis comprising the previously unknown combination of the first, second, and third concepts.

21. A computer program product according to claim 18, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a plurality of linking concepts, and wherein the third set of computer instructions for parsing further comprises: an eleventh set of computer instructions for detecting a first relationship between the first concept and a first portion of the plurality of linking concepts; a twelfth set of computer instructions for detecting a second relationship between the second concept and a second portion of the plurality of linking concepts; and a thirteenth set of computer instructions for determining a potential substitution relationship between the first concept and the second concept at least partially from the detected first and second relationships and a number of overlapping concepts present in both the first portion and the second portion of the plurality of linking concepts; and wherein the fourth set of computer instructions for generating further comprises a fourteenth set of computer instructions for generating a substitution hypothesis comprising the previously unknown combination of at least one of the first and second concepts with a portion of the plurality of linking concepts not present in the number of overlapping concepts.

22. A computer program product according to claim 21, wherein the third set of computer instructions for parsing further comprises a fifteenth set of computer instructions for determining a strength of the potential substitution relationship between the first and second concepts based at least in part on the number of overlapping concepts present in both the first portion of the second portion of the plurality of linking concepts.

23. A computer program product according to claim 18, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a third concept, and wherein the third set of computer instructions for parsing further comprises: a sixteenth set of computer instructions for detecting a first relationship between the first concept and the second concept; a seventeenth set of computer instructions for detecting a second relationship between the second concept and the third concept; and an eighteenth set of computer instructions for determining a potential pairwise relationship between the first concept and the third concept at least partially from the detected first and second relationships; and wherein the fourth set of computer instructions for generating further comprises a nineteenth set of computer instructions for generating a pairwise hypothesis comprising the previously unknown combination of the first and third concepts.

24. A computer program product according to claim 23, wherein the third set of computer instructions for parsing further comprises a twentieth set of computer instructions for assessing a strength of the potential relationship between the first and third concepts at least partially from a known secondary relationship between the first and third concepts.

25. A computer program product according to claim 24, wherein the known secondary relationship comprises a common semantic category including both the first and third concepts.

26. A computer program product according to claim 25, wherein the relationship rule comprises the common semantic category.

27. A computer program product according to claim 18, further comprising: a twenty-first set of computer instructions for identifying a portion of the plurality of documents in the data repository associated with a user; a twenty-second set of computer instructions for creating a user profile based at least in part on the identified documents, the user profile being indicative of a user information need; and a twenty-third set of computer instructions for modifying the hypothesis in response to the user profile such that the modified hypothesis at least partially corresponds to the user information need.

28. A computer program product according to claim 27, wherein the user profile comprises at least one semantic category, the computer program product further comprising a twenty-fourth set of computer instructions for filtering the presented hypothesis such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category.

29. A computer program product according to claim 18, wherein fifth set of computer instructions for presenting the hypothesis comprises a twenty-fifth set of computer instructions for presenting a display to a user comprising a visual representation of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts.

30. A computer program product according to claim 29, wherein the visual representation comprises an interactive icon configured to be selectable by the user, the interactive icon being further configured to modify the display when selected by the user.

31. A computer program product according to claim 18, wherein the system for extracting relationships is selected from the group consisting of: a vocabulary database corresponding to a selected subject area; a predetermined lexicon; a semantic network; a semantic database; a metathesaurus; and combinations thereof.

32. A computer program product according to claim 18, wherein the data repository is selected from the group consisting of: a biomedical literature database; a medical records database; a chemical literature database; a computer science literature database; a physics literature database; a legal literature database; a psychology literature database; a social science literature database; a news periodical database; a business journal database; and combinations thereof.

33. A computer program product according to claim 18, further comprising a twenty-sixth set of computer instructions for storing the determined relationship rule for later or repeated use in the subsequent parsing step.

34. A computer program product according to claim 18, further comprising a twenty-seventh set of computer instructions for verifying the hypothesis using at least one independent resource.

35. A system for mining information from a data repository comprising a plurality of documents to produce a hypothesis, the system comprising: a system for extracting relationships comprising a plurality of phrases and a plurality of concepts; a host computing element in communication with said system for extracting relationships for accessing said system for extracting relationships; wherein said host computing element determines a relationship rule defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts; wherein said host computing element parses the plurality of documents in a data repository according to the relationship rule, the plurality of documents each comprising at least a portion of one of the plurality of phrases and the plurality of concepts; and wherein said host computing element generates the hypothesis comprising a previously unknown combination, the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts, the previously unknown combination being at least partially determined from the parsed plurality of documents; and a user interface in communication with said host computing element, said user interface configured for presenting the hypothesis so as to indicate the previously unknown combination.

36. A system according to claim 35, wherein said host computing element determines a relationship rule selected from the group consisting of: an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts; an assignment of at least one of the plurality of phrases to a relationship identifier, the relationship identifier linking a first one of the plurality of concepts to a second one of the plurality of concepts; an assignment of at least one of the plurality of concepts to a semantic category; an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship, wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept; and combinations thereof.

37. A system according to claim 35, wherein said host computing element identifies a portion of the plurality of documents in the data repository associated with a user; wherein said host computing element creates a user profile based at least in part on the identified documents, the user profile being indicative of a user information need; and wherein said host computing element modifies the hypothesis in response to the user profile such that the modified hypothesis at least partially corresponds to the user information need.

38. A system according to claim 37, wherein the user profile comprises at least one semantic category and wherein said host computing element filters the presented hypothesis such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category.

39. A system according to claim 35, wherein said user interface presents the hypothesis as a display to a user comprising a visual representation of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts.

40. A system according to claim 39, wherein said user interface presents the visual representation comprising an interactive icon configured to be selectable by the user, the interactive icon being further configured to modify the display when selected by the user.

41. A system according to claim 35, wherein said system for extracting relationships is selected from the group consisting of: a vocabulary database corresponding to a selected subject area; a predetermined lexicon; a semantic network; a semantic database; a metathesaurus; and combinations thereof.

42. A system according to claim 35, wherein said host computing element is in communication with a data repository selected from the group consisting of: a biomedical literature database; a medical records database; a chemical literature database; a computer science literature database; a physics literature database; a legal literature database; a psychology literature database; a social science literature database; a news periodical database; a business journal database; and combinations thereof.

43. A system according to claim 35, further comprising a memory device in communication with said host computing element, said memory device configured for storing the determined relationship rule for later or repeated use in the subsequent parsing step.

44. A system according to claim 35, further comprising an independent resource in communication with said host computing device, said independent resource configured for verifying the generated hypothesis.

Description:

CROSS-REFERENCE

This application is a continuation of co-pending International Application No. PCT/US2007/063983, filed Mar. 14, 2007, the contents of which are incorporated by reference in entirety, and which claims priority to U.S. Patent Application Ser. No. 60/782,935, filed Mar. 15, 2006.

FIELD OF THE INVENTION

Various embodiments of the present invention relate generally to the field of query generation, information retrieval, and data mining with respect to data repositories (such as literature and/or record databases, for example).

BACKGROUND

The wide volume of scientific literature provides a goldmine for the extraction of useful knowledge and information in support of practical decision-making as well as academic research. However, many of the currently-available search engines querying various data repositories offer very limited searching, indexing and categorizing functionalities that fall short of the capabilities to fully explore and utilize such data resources. As an example, Medical Literature Analysis and Retrieval system Online (“MEDLINE”) (the U.S. National Library of Medicine's (NLM) premier bibliographic database), contains approximately 13 million journal articles in life sciences with citation information of and references to concentration on biomedicine. Each year the exponentially-increasing amount of biomedical literature in the MEDLINE database poses tremendous challenges to the ultimate users of those databases, typically scientific researchers. Currently a small number of academic papers have proposed and discussed the idea of generating hypotheses from biomedical literature in databases like MEDLINE in a systematic way so as to facilitate biomedical researchers' discovery and even possibly suggest potential research directions. However, existing work in this area has focused only on generating one type of hypothesis, namely, “a potential pair wise relation”, which does not fully represent most patterns and rules embedded in the document corpus.

Furthermore, existing querying and/or discovery processes as discussed in these papers are usually conducted in a “retrieval mode” which necessarily implies that users must know what knowledge and information they need so that they can provide at least one concept of their search interest to initiate the discovery process. In many cases, however, users may not know how to express their knowledge and information needs or even may not realize and/or appreciate an existing information need. For instance, a given biomedical researcher may never be independently motivated to research a relation between a certain gene and a certain disease that as a matter of fact may be predicted from existing relationships within several recent publications. In addition, different types of users always have different knowledge and information needs based on their respective backgrounds and/or profiles, even if they issue the same query to the same database. For example, for a query of “Diabetics” to MEDLINE, a biomedical researcher may want to acquire some potential research directions for this disease, a medical practitioner may wish to keep current on state-of-the-art diagnosis progress, and a patient may want to ensure that the treatment plan prescribed by her physician is reasonable in light of current treatment options. In summary, each user brings different levels of expertise and different interests to a given query of a given database. Currently available query systems do not address this issue.

In light of the above, a need exists for an improved method, system and computer program product for automatically generating different types of hypotheses from data repositories. There is a further need for automatic analysis of a user's scope of interest and effective delivery of hypotheses, information and knowledge that match the user's interests and information needs.

BRIEF SUMMARY

The needs outlined above are met by the present invention which, in various embodiments, provides systems and methods that overcome many of the technical problems discussed above, as well other technical problems, with regard to the generation and display of potential hypotheses based on written works selected from a database. Specifically in one embodiment, the invention provides a method and computer program product for generating a hypothesis. In some embodiments, the method and/or computer program product may comprise accessing a system for extracting relationships, wherein the system for extracting relationships comprises a plurality of phrases and a plurality of concepts. In various embodiments, the system for extracting relationships may include, but is not limited to: a vocabulary database corresponding to a selected subject area; a predetermined lexicon; a semantic network; semantic database; a metathesaurus; and combinations of such systems.

The method and/or computer program product also comprises determining a relationship rule defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts. In some embodiments, the determined relationship rule may include, but is not limited to: an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts; an assignment of at least one of the plurality of phrases to a relationship identifier, the relationship identifier linking a first one of the plurality of concepts to a second one of the plurality of concepts; an assignment of at least one of the plurality of concepts to a semantic category; an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship, wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept. Some embodiments may further comprise a step for storing the determined relationship rule for later or repeated use in a subsequent parsing step as described further herein.

The method and/or computer program product may also comprise parsing a plurality of documents in a data repository according to the relationship rule, wherein the plurality of documents each comprise at least a portion of one of the plurality of phrases and the plurality of concepts. In various embodiments, the data repository may include, but is not limited to: a biomedical literature database; a medical records database; a chemical literature database; a computer science literature database; a physics literature database; a legal literature database; a psychology literature database; a social science literature database; a news periodical database; a business journal database; and combinations of such data repositories.

The method and/or computer program product embodiments may also comprise steps for generating a hypothesis comprising a previously unknown combination, wherein the previously unknown combination includes one of at least one of the plurality of phrases and at least one of the plurality of concepts. The previously unknown combination may be at least partially determined from the parsed plurality of documents.

In some embodiments, at least a portion of the plurality of documents may comprise at least one of a first concept, a second concept, and a third concept. According to some such embodiments, the parsing step described herein may further comprise: detecting a first relationship between the first and second concepts; detecting a second relationship between the second and third concepts; detecting a third relationship between the first and third concepts; and determining a potential chain relationship among the first second, and third concepts at least partially from the detected first, second, and third relationships. Furthermore, according to some such embodiments, the step for generating the hypothesis may also comprise generating a chain hypothesis comprising the previously unknown combination of the first, second, and third concepts.

In some additional embodiments, at least a portion of the plurality of documents may comprise at least one of a first concept, a second concept, and a plurality of linking concepts. According to some such embodiments, the parsing step may further comprise: detecting a first relationship between the first concept and a first portion of the plurality of linking concepts; detecting a second relationship between the second concept and a second portion of the plurality of linking concepts; and determining a potential substitution relationship between the first concept and the second concept at least partially from the detected first and second relationships and a number of overlapping concepts present in both the first portion and the second portion of the plurality of linking concepts. Furthermore, in some such embodiments, the step for generating the hypothesis may further comprise generating a substitution hypothesis comprising the previously unknown combination of at least one of the first and second concepts with a portion of the plurality of linking concepts not present in the number of overlapping concepts. Furthermore, in some such embodiments, the parsing step may further comprise determining a strength of the potential substitution relationship between the first and second concepts based at least in part on the number of concepts present in both the first portion of the second portion of the plurality of linking concepts.

Furthermore, in some other method and/or computer program embodiments, at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a third concept. In some such embodiments, the parsing step may further comprise: detecting a first relationship between the first concept and the second concept; detecting a second relationship between the second concept and the third concept; and determining a potential pairwise relationship between the first concept and the third concept at least partially from the detected first and second relationships. In some such embodiments, the step for generating the hypothesis may further comprise generating a pairwise hypothesis comprising the previously unknown combination of the first and third concepts. Furthermore, in some such embodiments, the parsing step may further comprise assessing a strength of the potential relationship between the first and third concepts at least partially from a known secondary relationship between the first and third concepts. In various embodiments, the known secondary relationship may comprise a common semantic category including both the first and third concepts. Furthermore, in some such embodiments, the relationship rule generated in the determining step may comprise the common semantic category used to assess the strength of the potential pairwise relationship between the first and third concepts.

Various method and/or computer program products may also comprise presenting the hypothesis so as to indicate the previously unknown combination. In some such embodiments, the step for presenting the hypothesis may comprise presenting a display to a user comprising a visual representation of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts. Furthermore, according to some embodiments, the visual representation presented in the display comprises an interactive icon configured to be selectable by the user. According to such embodiments, the interactive icon may be further configured to modify the display when selected by the user.

Various method and/or computer program product embodiments of the present invention may also comprise various steps for optimizing the generated hypothesis to meet the information needs of a particular user. For example, some embodiments may comprise steps for identifying a portion of the plurality of documents in the data repository associated with the user, and creating a user profile based at least in part on the identified documents. The created user profile may be indicative of a user information need. Some such embodiments may further comprise a step for modifying the hypothesis in response to the user profile such that the modified hypothesis at least partially corresponds to the user information need. In some such embodiments, the created user profile may comprise at least one semantic category and the method and/or computer program product may further comprise a step for filtering the presented hypothesis such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category present in the user profile.

Various embodiments of the present invention may also provide systems for mining information from a data repository comprising a plurality of documents to produce a hypothesis. The data repository may include, but is not limited to: a biomedical literature database; a medical records database; a chemical literature database; a computer science literature database; a physics literature database; a legal literature database; a psychology literature database; a social science literature database; a news periodical database; a business journal database; and a combination of such databases.

The system comprises a system for extracting relationships comprising a plurality of phrases and a plurality of concepts. The system for extracting relationships may include, but is not limited to: a vocabulary database corresponding to a selected subject area; a predetermined lexicon; a semantic network; a metathesaurus; and combinations of such system for extracting relationships. Various system embodiments further comprise a host computing element in communication with the system for extracting relationships for accessing the system. The host computing element is configured for determining a relationship rule defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts. The host computing element may be configured for determining a relationship rule that includes, but is not limited to: an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts; an assignment of at least one of the plurality of phrases to a relationship identifier, wherein the relationship identifier links a first one of the plurality of concepts to a second one of the plurality of concepts; an assignment of at least one of the plurality of concepts to a semantic category; an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship, wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept; and/or a combination of such relationship rules. Some system embodiments may further comprise a memory device in communication with the host computing element, wherein the memory device is configured for storing the determined relationship rule for later or repeated use in a subsequent parsing step.

Furthermore, the host computing element may also be configured for parsing the plurality of documents in a literature database according to the relationship rule, wherein the plurality of documents each comprises at least a portion of one of the plurality of phrases and the plurality of concepts. Furthermore, the host computing element is configured for generating the hypothesis comprising a previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts. The previously unknown combination generated by the host computing element may be at least partially determined from the parsed plurality of documents. Furthermore, some system embodiments may also comprise a user interface in communication with the host computing element, wherein the user interface is configured for presenting the hypothesis so as to indicate the previously unknown combination. In some system embodiments, the user interface may present the hypothesis as a display to a user comprising a visual representation of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts. In some such system embodiments, the user interface may present the visual representation as an interactive icon configured to be selectable by the user. The interactive icon may be further configured to modify the display when selected by the user.

In some system embodiments, the host computing element may also be configured for customizing and/or optimizing the presented hypothesis for a particular user. For example, in some embodiments, the host computing element may identify a portion of the plurality of documents in the data repository associated with a user and thereby create a user profile based at least in part on the identified documents. The user profile created by the host computing element in such embodiments may be indicative of a user information need. The created user profile may also comprise at least one semantic category and the host computing element may therefore filter the presented hypothesis such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category. Furthermore, in some system embodiments, the host computing element may modify the hypothesis in response to the user profile such that the modified hypothesis at least partially corresponds to the user information need.

Thus the systems, methods, and computer program products for generating and displaying potential hypotheses based on written works selected from a database, as described in the embodiments of the present invention, provide many advantages that may include, but are not limited to: providing a conceptual research system configured for mining raw materials from the large amounts of literature in a given data repository to generate potential hypotheses for future directed research; providing a research system and method capable of uncovering previously unknown and/or unappreciated combinations of concepts and/or phrases in a data repository; providing a conceptual research system capable of defining a user profile that is indicative of a particular user's information needs and modifying a proposed conceptual research hypothesis based at least in part on the defined user profile; and providing a conceptual research concept that is configurable for mining usable data (and generating proposed hypotheses) in a variety of different types of data repositories.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

In the description below, reference is made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:

FIG. 1 provides a non-limiting schematic overview of the structure and components of a system and method for automatically generating different types of hypotheses from biomedical literature according to one embodiment of the present invention;

FIGS. 2A-D illustrate non-limiting data structures and relationship rules resulting from a generating step, according to one embodiment of the present invention;

FIGS. 3A-B illustrate non-limiting data structures and previously unknown combinations resulting from a document parsing step, according to one embodiment of the present invention;

FIGS. 4A-C illustrate non-limiting schematics of algorithms for generating three different types of hypotheses, according to one embodiment of the present invention;

FIG. 5 illustrates a process of providing personalized discovery results to match different users' information and knowledge needs that may be implemented according to one embodiment of the present invention;

FIGS. 6A-C illustrate non-limiting schematics of a display comprising a visual representation of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts, according to one embodiment of the present invention;

FIGS. 7A-B illustrate a non-limiting schematic of a host computing element and system useable for implementing various embodiments of the present invention; and

FIG. 8 illustrates a non-limiting schematic of a hypothesis-generating system, according to one embodiment of the present invention, including a hypothesis verification module in communication therewith.

DETAILED DESCRIPTION OF THE INVENTION

The present inventions will now be described with reference to the accompanying drawings, in which some, but not all embodiments of the inventions are shown. Indeed, these inventions may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like numbers refer to like elements throughout.

As shown in FIGS. 1-5, 6A-C, various embodiments of the present invention provide an improved system, method, and computer program product for automatically and/or systematically generating different types of hypotheses from data repositories 20. Specifically, the embodiments as presented in the non-limiting figures are configured for generating three types of hypotheses: (1) potential pair-wise relations (see FIG. 1, element 136, for example), (2) potential chain relations (see FIG. 1, element 132, for example); and (3) potential substitution relations (see FIG. 1, element 134, for example). Importantly, the hypotheses generated by the various embodiments described herein include previously unknown combinations of concepts and/or phrases that may be at least partially determined from a parsed plurality of documents within a data repository 20.

Many of the exemplary embodiments described herein relate generally to the generation of hypotheses related to biomedical literature and/or research such that the various embodiments described herein may be capable of achieving the technical effect of producing proposed hypotheses that may lead to breakthroughs in the application of certain combinations of drugs to certain diseases or disease states. It should be understood, however, that the various embodiments described herein may be used to parse and/or mine other types of data repositories 20 for potentially groundbreaking research topics. For example, the various embodiments herein may be configured for parsing and/or analyzing documents found in data repositories 20 that may include, but are not limited to: biomedical literature databases; medical records databases; chemical literature databases; computer science literature databases; physics literature databases; legal literature databases; psychology literature databases; social science literature databases; news periodical databases; business journal databases; and combinations of such databases. The term “document” as used herein may include, but is not limited to: published journal articles; text strings (such as, for example, a physician's comments in a medical record entry); file records (such as a particular medical record); resumes and/or curriculum vitae; a thesis; a numerical string of data; a patent document (including, for example, issued patents, patent applications, and publicly-available patent prosecution documents); online journal articles; internet web pages; material safety data sheets; pharmaceutical and/or chemical data sheets; advertisements; reported court case and/or administrative proceedings; news articles; letters; and combinations of such materials.

It should be further understood that the generated hypotheses may be implied by patterns embedded in the document corpus of such data repositories such that appropriate relationship rules (as described further herein) may be determined and subsequently applied to the data repository in a substantially automatic “mining mode” to generate hypotheses that may be completely beyond the expectation of a system user.

In accordance with another embodiment, the present invention analyzes various semantic relations among the concepts involved in the identified hypotheses and provides visualization of these relations in an intuitive way. Particular documents in support of each of these relations may be identified to the system users for their further research. In addition, specific search results can be customized for particular researchers based on their specified or potential interests. In operation, a given researcher's interests are identified by automatically analyzing any prior publications or papers related to this researcher. Furthermore, in some embodiments, search results are verified using an independent resource.

As shown in FIG. 1, various embodiments of the present invention may provide a method for generating a hypothesis (such as, for example, potential chain hypotheses 132, potential substitution hypotheses 134, and/or potential pairwise hypotheses 136). The method may comprise, for example, step 110 for accessing a system for extracting relationships 10 comprising a plurality of phrases and a plurality of concepts and determining a relationship rule (see elements 111, 112, 113, 114, for example) defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts. The system for extracting relationships 10 may include, but is not limited to: a vocabulary database corresponding to a selected subject area; a predetermined lexicon; a semantic network; a metathesaurus; and/or combinations of such systems. For example, in some embodiments, wherein the method is used to parse biomedical literature in search of potential research hypotheses, the system for extracting relationships 10 may comprise the Unified Medical Language System (UMLS), which may further comprise component databases including, but not limited to: the Metathesaurus®; the Semantic Network; and/or the SPECIALIST lexicon. It should be understood that the Metathesaurus® may comprise a large vocabulary database containing information about biomedical and health-related concepts, their various names, and the various relationships among them. The Semantic Network provides a substantially consistent categorization (i.e. a “Semantic Type”) of all concepts represented in the UMLS Metathesaurus® and defines a set of relationships that may hold between the various semantic types. Furthermore, the SPECIALIST lexicon is a general English lexicon comprising a variety of biomedical terminology. The system for extracting relationships 10 may also, in some embodiments, define a hierarchical structure among the various phrases and/or concepts included therein. For example, in embodiments, wherein the system for extracting relationships 10 comprises the UMLS, the system for extracting relationships 10 may further comprise MeSH, which provides a controlled vocabulary thesaurus configured for arranging descriptors (terms and/or phrases, for example) in a hierarchical structure. For example, MeSH may comprise descriptors organized in a plurality of categories that may include, but are not limited to: (A) anatomic terms; (B) organisms; (C) diseases; (D) drugs and/or chemicals; and combinations of such descriptors. Each category of descriptors in MeSH may be further subdivided into a variety of subcategories.

Referring to FIG. 1, the relationship rule generated in step 110 may include, but is not limited to: an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts (see element 112 and FIG. 2A); an assignment of at least one of the plurality of phrases to a relationship identifier (see element 114 and FIG. 2B, wherein the relationship identifier may link a first one of the plurality of concepts to a second one of the plurality of concepts); an assignment of at least one of the plurality of concepts to a semantic category (see element 111 and FIG. 2C, for example); an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship (see element 113 and FIG. 2D for example), wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept; and/or combinations of such relationship rules.

Referring to FIGS. 2A-2D, one or more of the relationship rules generated in step 110 may, in some embodiment, be presented and/or stored in a tabular data structure. The tables shown in FIG. 2A-2D may be presented to a user in some embodiments, to supplement the presentation of the hypothesis (see step 140 and FIGS. 6A-6C, for example) the so as to indicate the relationship rule or rules underlying the previously unknown combination and corresponding hypothesis. In other embodiments, the tabulated relationship rules shown, for example, in FIGS. 2A-2D may serve as modular data structures that may be stored in a memory device (see elements 722, 724 and 728 of FIG. 7B) for later or repeated use in a subsequent parsing step 120. Thus, in some such embodiments, the stored relationship rules (111, 112, 113, 114) may be maintained as a pre-computed set of relationship rules that may be used to efficiently parse (step 120) a plurality of documents stored in a particular data repository 20 to which the relationship rules may most likely apply in order to generate (step 130) and present (step 140) potential hypotheses to a user comprising previously unknown combinations of phrases and concepts. For example, various methods of the present invention may provide “conceptual research” services that may provide access to the stored relationship rules 111, 112, 113, 114 (see also, FIGS. 2A-2D, for example) to a user or client that may be in communication with a host computer 700 (see FIG. 7A, for example) configured for performing the various method and/or computer program steps outlined in FIG. 1. Such users may thus define and access stored relationship rules that apply to various systems for extracting relationships 10 and/or data repositories 20 that may be pertinent to their own research areas of interest. For example, a biomedical researcher may “subscribe” to a relationship rule service that may provide efficient document parsing 120 and hypothesis generation 130 services for a biomedical data repository (such as, for example, MEDLINE). Similarly, an attorney or law professor might subscribe to a relationship rule database (stored for example, in various memory devices 722, 724 and 728 of a host computing element 700) that pertains more particularly to a data repository 20 comprising case reporters and/or legal journals such that the system and/or method embodiments of the present invention may more efficiently parse 120 the documents stored therein to generate 130 a proposed legal hypothesis gleaned from the semantic relationships outlined in the stored relationship rules 111, 112, 113, 114 and as specifically applied to the appropriate data repository 20.

As shown in FIG. 1, various method embodiments may further comprise step 120 for parsing a plurality of documents in a data repository 20 according one or more of the generated the relationship rules (111, 112, 113, and 114). The plurality of documents in the data repository 20 may each comprise at least a portion of one of the plurality of phrases and the plurality of concepts. As used herein, the term “parsing” refers generally to the breaking down of various documents into component key phrases and/or concepts that may be comparable to the phrases and/or concepts present in a compatible system for extracting relationships 10. For example, a system for extracting relationships 10 such as the UMLS may be used as the raw material for generating various relationship rules (111, 112, 113, and 114) that may be used to parse a UMLS-compatible data repository (such as MEDLINE or another compatible database of biomedical literature).

The parsing step 120 may comprise performing various quantitative and/or qualitative operations on the component key phrases and/or concepts. As shown generally in FIG. 1, the parsing step 120 may result in products that may include, but are not limited to: concept-concept relationships 122 (generated, for example, when a threshold number of documents within the data repository 20 tie two or more concepts together); concept document relationships 124 (which may map concepts and/or phrases of interest to those documents in which they appear at a frequency that exceeds a selected frequency). It should be understood that threshold numbers and/or frequencies of concepts as described herein may be selected by a user (using, for example, a host computing element 700 as described further herein with respect to FIGS. 7A and 7B). In other embodiments, threshold numbers and/or frequencies of concepts as described herein may be pre-computed and/or pre-assigned and stored in a memory device 722, 724, 728 associated with a host computing element 700 (as described further herein with respect to FIGS. 7A and 7B).

Various method embodiments may further comprise step 130 for generating a hypothesis (that may include, but is not limited to: potential chain hypotheses 132, potential substitution hypotheses 134, and/or potential pairwise hypotheses 136). The hypotheses generated in step 130 may comprise a previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts and may thus be used as the basis for “conceptual research” wherein a researcher is presented with a potential hypothesis that suggests and/or identifies a research topic or direction that has not been addressed in previous research (as documented by the documents in the data repository 20). As described herein, the previously unknown combination (embodied in the generated hypothesis) may be at least partially determined from the parsed (see step 120, for example) plurality of documents present in the data repository 20.

In some embodiments, a “chain” relationship may be established in the parsing step 120 among three or more previously unrelated phrases and/or concepts. For example, at least a portion of the plurality of documents present in the data repository 20 may comprise at least one of a first concept, a second concept, and a third concept. In some such embodiments, the parsing step 120 (utilizing one or more previously-identified and/or stored relationship rules (see elements 111, 112, 113, 114, for example)) may comprise: (1) detecting a first relationship between the first and second concepts; (2) detecting a second relationship between the second and third concepts; (3) detecting a third relationship between the first and third concepts; and (4) determining a potential chain relationship 132 among the first second, and third concepts at least partially from the detected first, second, and third relationships. In such embodiments, the generating step 130 may further comprise generating a chain hypothesis 132 comprising the previously unknown combination of the first, second, and third concepts in a “chain” combination.

For example, such a “chain” relationship may be established among three medical concepts (such as three therapeutic compounds (A, B, C) belonging to the same general class of drugs (as indicated, for example, by a relationship rule comprising an assignment of at least one of the plurality of concepts to a semantic category outlining the class of drug (see relationship rule 111, in FIG. 1, for example))). The parsing step 120 may indicate that therapeutic compounds A and B have a strong pairwise relationship (see element 122 tying the concepts A and B, together). The parsing step 120 may further indicate that therapeutic compounds B and C and A and C have strong pairwise relationships (see element 122 indicating an assessment of the strength of a pairwise relationship between two particular concepts). The parsing step 120 may further indicate that no particular document studies all three therapeutic compounds (A, B and C) together (as indicated, for example by element 124 which comprises an evaluation of the relationships between one or more concepts and each document in the data repository 20). In such an example, step 130 may comprise generating a potential chain hypothesis 132 reporting the previously unknown combination of the therapeutic compounds A, B and C. The hypothesis may, in some embodiments, tie the chain hypothesis to a particular semantic type (see element 111, for example) corresponding to a particular disease state that may respond most favorably to treatment with the combination of therapeutic compounds A, B and C as indicated by the documents in the data repository 20.

FIG. 4A shows a detailed depiction of various exemplary subroutines that may be used to accomplish the hypothesis generating step 130 for a potential chain hypothesis 132 comprising concepts A, B and C. For example, given a set of interesting semantic types (IS) in the system for extracting relationships 10, the generating step 130 may comprise advancing through each concept A in a particular set of concept-concept relationships 122 and determining: (1) if concept A is a specific concept (according to a concept-hierarchical relationship rule 113, for example); and (2) if concept A's semantic type (according to a concept-semantic type relationship rule 111, for example) belongs in the input set of interesting semantic types IS. If the answer to inquiry (2) is positive, the generating step 130 may further comprise retrieving concept A's related concepts (by querying relationship rule 122, for example), and denoting these results as A-relates. Then, for each concept (B) in the group of A-relates, the generating step 130 may comprise determining: (1) if concept B's semantic type (by consulting relationship rule 111, for example) is the same as concept A's semantic type; and (2) if B is a specific concept (according to a concept-hierarchical relationship rule 113, for example). If both (1) and (2) are positive, then the generating step 130 may further comprise retrieving concept B's related concepts (according to relationship rule 122, for example) and denoting these results as B-relates. In order to complete the “chain” of concepts A, B and C, the generating step 130 may further comprise, for each concept C among the B-relates, determining if concept C's semantic type (according to relationship rule 111, for example) is the same as concept A's. If so, then the generating step 130 may further comprise determining if concept C is a specific concept (according to a concept-hierarchical relationship rule 113, for example) and consulting the concept-document relationships 124 in the pertinent data repository 20 (uncovered, for example in the parsing step 120, for example) to retrieve the number of documents in the data repository 20 that contain some mention of concepts A, B and C in combination. If the retrieved number of documents is less than some selected threshold level (that may be pre-defined as a threshold for a “previously unknown” and/or “previously unappreciated” chain combination), then the generating step 130 may provide an output comprising a proposed chain hypothesis 132 that may be presented to a user, for example, in step 140.

According to some embodiments, a “substitution” relationship may be established in the parsing step 120 among two or more previously unrelated phrases and/or concepts. For example, at least a portion of the plurality of documents present in the data repository 20 may comprise at least one of a first concept, a second concept, and a plurality of linking concepts. In some such embodiments, the parsing step 120 (utilizing one or more previously-identified relationship rules (see elements 111, 112, 113, 114, for example)) may comprise: (1) detecting a first relationship between the first concept and a first portion of the plurality of linking concepts; (2) detecting a second relationship between the second concept and the second portion of the plurality of linking concepts; and (3) determining a potential substitution relationship 134 between the first concept and the second concept at least partially from the detected first and second relationships and a number of overlapping concepts present in both the first portion and the second portion of the plurality of linking concepts. In such embodiments, the generating step 130 may further comprise generating a substitution hypothesis 134 comprising the previously unknown combination of at least one of the first and second concepts with a portion of the plurality of linking concepts not present in the number of overlapping concepts. In some such method embodiments, the parsing step 120 may further comprise determining a strength of the potential relationship between the first and second concepts in the proposed substitution hypothesis 134 based at least in part on the number of concepts present in both the first portion of the second portion of the plurality of linking concepts.

For example, such a substitution relationship may be established among: (1) a pair of medical concepts (such as two therapeutic compounds (A and B); (2) a list of component compounds present in both therapeutic compounds A and B (X1, X2, X3, . . . , Xm); and (3) a disease or condition (Y) that is reported as responding positively to treatment with therapeutic compound A. The parsing step 120 may first comprise applying a relationship rule comprising an assignment of therapeutic compounds A and B to a semantic category outlining the common class of drug (see relationship rule 111, in FIG. 1, for example))). The parsing step 120 may further indicate that therapeutic compound A has a strong relationship with the list of component compounds (X1, X2, X3, Xm) (see, for example element 112 (tying phrases indicative of the component compounds (X1, X2, X3, Xm) to a concept ID indicative of compound A)). The parsing step 120 may further indicate that therapeutic compound B also has a strong relationship with the list of component compounds (X1, X2, X3, . . . , Xm) (see, for example element 112 (tying phrases indicative of the component compounds (X1, X2, X3, . . . , Xm) to a concept ID indicative of compound B)). The parsing step 120 may further indicate that the data repository 20 generally indicates that therapeutic compound A correlates strongly to disease or condition Y (as indicated, for example by element 122 which comprises an evaluation of the relationships between one or more concepts present in the data repository 20). In such an example, step 130 may comprise generating a potential substitution hypothesis 132 reporting the previously unknown combination of the therapeutic compound B with the disease condition Y (i.e. therapeutic compound B may be reported as a potential substitute for therapeutic compound A in the treatment of disease or condition Y). The hypothesis may, in some embodiments, tie the chain hypothesis to a particular semantic type (see element 111, for example) corresponding to a particular disease state that may respond most favorably to treatment with the combination of therapeutic compounds A, B and C as indicated by the documents in the data repository 20. A “strength” or potential probative value of the potential substitution hypothesis 132 may be evaluated quantitatively in some embodiments by measuring the value of “m” (i.e. the number of phrases in the number of phrases (X) tying the therapeutic compounds A and B together.

FIG. 4B shows a detailed depiction of various exemplary subroutines that may be used to accomplish the hypothesis generating step 130 for a potential substitution hypothesis 134. Given a set of interesting semantic types (IS) in the system for extracting relationships 10 and a predefined set of interesting relations (IR), the generating step 130 may comprise advancing through each of the concept-concept relationships defined, for example, between concept A and other concepts via the relationship rule 122. The generating step 130 may further comprise (1) determining if concept A is a specific concept (according to a concept-hierarchical relationship rule 113, for example); and (2) determining the semantic type of concept A (via the concept-semantic type relationship rule 111, for example). If the determined semantic type of concept A falls within the given IS, then the generating step 130 may further comprise retrieving various concepts related to concept A (by consulting the concept-concept relationship rule 122, for example) and denoting these concepts as A-relates. For each concept B in the set of A-relates, the generating step 130 may further comprise determining if B is a specific concept (according to a concept-hierarchical relationship rule 113, for example) and retrieving concepts related to concept B (by consulting the concept-concept relationship rule 122, for example) and denoting these concepts as B-relates. Furthermore, for each concept C in the set of B-relates, the generating step 130 may further comprise determining if C's semantic type is the same as that for A (by consulting the concept-semantic type relationship rule 111, for example) and analyzing the concept-concept relationship rule 122 to obtain a set of concepts related to both concept A and concept C. The generating step 130 may further comprise determining if the number of obtained concepts is greater than some selected threshold; and analyzing the concept-concept relationship rule 122 to obtain a list of concepts that are related to concept A but that are unrelated to concept C and denoting the obtained concepts as group A_not_C. Furthermore, the generating step 130 may further comprise determining, for each concept X in A_not_C, if the relationship between A and X is present in the defined set of interesting relations (IR) and reporting a potential substitution hypothesis 134 comprising concept C as a potential substitution for concept A in relation to concept X.

According to some other embodiments, a “pairwise” relationship may be established by the parsing step 120 among two or more previously unrelated phrases and/or concepts. For example, at least a portion of the plurality of documents present in the data repository 20 may comprise at least one of a first concept, a second concept, and a third concept. In some such embodiments, the parsing step 120 (utilizing one or more previously-identified relationship rules (see elements 111, 112, 113, 114, for example)) may comprise: (1) detecting a first relationship between the first concept and the second concept; (2) detecting a second relationship between the second concept and the third concept; and (3) determining a potential pairwise relationship between the first concept and the third concept at least partially from the detected first and second relationships. In such embodiments, the generating step 130 may further comprise generating a pairwise hypothesis 136 comprising the previously unknown combination of the first and third concepts.

According to some such “pairwise” hypothesis embodiments, the parsing step 120 may further comprises assessing a strength of the potential relationship between the first and third concepts at least partially from a known secondary relationship between the first and third concepts. For example, in some embodiments, the known secondary relationship may comprise a common semantic category including both the first and third concepts (as indicated, for example, by a concept-semantic type relationship rule 111 (and/or another relationship rule type), that may be a product of the determining step 110 (as shown generally in FIG. 1).

For example, a pairwise hypothesis 136 may be generated in step 130 by first determining that a strong relationship exists between concept A (i.e. a first therapeutic compound A) and concept X (a disease or condition X). This may be accomplished, for example, using the concept-concept relationship output 122 of an initial parsing step 120. In order to complete the generation of a potential pairwise hypothesis 136, step 130 may further comprise determining the existence of a strong relationship (via the concept-concept relationship output 122, for example) between the disease or condition X and the therapeutic compound B. According to some such embodiments, the generating step 130 may further comprise detecting an interesting secondary relationship between concepts A and B (i.e. detecting if therapeutic compounds A and B are in the same or similar semantic category (see element 111, FIG. 1, for example)). If the parsing step 130 indicates that no current literature in the data repository 20 addresses a potential pairwise relationship between concepts A and B (such as the potential for treatment of particular conditions using some combination of therapeutic compounds A and B, for example), step 130 may further comprise reporting the pair A and B as a potential pairwise hypothesis 136.

FIG. 4C shows a detailed depiction of various exemplary subroutines that may be used to accomplish the hypothesis generating step 130 for a potential pairwise hypothesis 136. Given a set of interesting semantic type pairs (ISP) and/or a set of interesting relations (IR) in the system for extracting relationships 10, the generating step 130 may comprise analyzing each concept A in relation to the concept-concept relationship rule 122 and determining if concept A is a specific concept (according to a concept-hierarchical relationship rule 113, for example). The generating step 130 may further comprise: (1) consulting the concept-semantic type relationship rule 111 to determine if concept A's semantic type is present in at least one of the pairs making up the interesting semantic type pairs (ISP); and (2) retrieving concepts related to concept A (according to the concept-concept relationship rule 122, for example) and designating the retrieved concepts as A-relates. Furthermore, for each concept B in the A-relates, the generating step 130 may further comprise determining if B is a specific concept (according to a concept-hierarchical relationship rule 113, for example); and retrieving concept B's related concepts (according to the concept-concept relationship rule 122, for example) and designating the retrieved concepts as B-relates. Furthermore, as shown in FIG. 4C, for each concept C in the B-relates category, the generating step 130 may further comprise determining if concept C is a specific concept (according to a concept-hierarchical relationship rule 113, for example); and determining if concept C's semantic type (according to the concept-semantic type relationship rule 111, for example) paired with concept A's semantic type is present in the selected ISP. In such embodiments, the generating step 130 further comprises consulting the concept-document relationship rule 124 to obtain a number of documents within the data repository 20 that each contains both concept A and concept C. The generating step 130 may further comprise determining if the number of documents is less than some threshold and, if so, reporting the resulting potential pairwise hypothesis 136 comprising, for example, the previously unknown and/or unappreciated combination of concept A and concept C.

Referring again to FIG. 1, the method embodiments may further comprise step 140 for presenting the hypothesis as a visual output 145 (see, for example, the display outputs shown generally in FIGS. 6A-6C) so as to indicate the previously unknown combination to a user and/or presenting the previously unknown combination to a downstream process (such as, for example, a user analysis 150 and subsequent filtering step 160 as shown in FIG. 5). Various embodiments of the present invention may therefore provide the important technical effect of presenting complex hypotheses 132, 134, 136 produced by the generating step 130 in a simplified, yet multi-layered interactive display (such as those displays shown, for example in FIGS. 6A-6C.

As shown generally in FIGS. 7A and 7B, various system embodiments of the present invention may comprise a user interface 704 (such as a monitor or other display device) in communication with a host computing element 700 and configured for presenting the hypothesis so as to indicate the previously unknown combination to a user. In some embodiments, the user interface 704 may present one or more of the hypotheses 132, 134, 136 as a display to a user comprising a primary visual representation 610 of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts (as shown for example in the display shown generally in FIGS. 6A-6C. In some embodiments, the user interface 704 may also present the primary visual representation 610 comprising an interactive icon 615 configured to be selectable by the user. The interactive icon 615 may be further configured to modify the display when selected by the user (i.e. selection of the interactive icon 615 in the primary visual representation may elicit the display of a secondary visual representation 620 that may indicate, for example, a plurality of relationships linking Phrase 2 and Phrase 3 (as shown, for example in FIG. 6A).

Referring to FIGS. 6A-6C, the step 140 for presenting the hypothesis as a visual output 145 may comprise presenting a display to a user comprising a primary visual representation 610 of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts. The primary visual representation 610 may comprise, in some embodiments, an interactive icon 615 (such as a linking element) configured to be selectable by the user (i.e. via a mouse click for example). The interactive icon 615 may be further configured to modify the display (by calling up a secondary visual representation 620, for example) when selected by the user. The secondary visual representation 620 may also comprise an interactive icon 625 that, when selected by a user, may call up a tertiary visual representation 630 comprising further detail regarding the previously unknown combination of phrases and/or concepts that make up the potential hypothesis.

FIG. 6A shows primary 610, secondary 620, and tertiary 630 visual representations of a previously unknown combination comprising a potential chain hypothesis combining Phrases 1-3 via three separate interactive linking icons 615. As shown in FIG. 6A, if a user selects the interactive icon 615 disposed between Phrase 2 and Phrase 3, step 140 may further comprise displaying a secondary visual representation 620 comprising the various relationships (defined, for example, by one or more relationship rules 111, 112, 113, 114, for example) underlying the combination of Phrases 2 and 3 in the previously unknown chain hypothesis comprises Phrases 1-3. As described above the secondary visual representation 620 may also comprise an interactive icon 625 corresponding to at least one of the linking relationships. The interactive icon 625, when selected by a user, may be configured to call up the tertiary visual representation 630 comprising, for example, a list of documents from the data repository 20 that were parsed (see step 130, for example) to generate the potential chain hypothesis 132 shown in the primary visual representation 610. Thus, by navigating the various interactive visual representations 610, 620, 630 produced in accordance with step 140, a user may deconstruct and/or uncover the various logical steps used by the various system, method, and computer program product embodiments of the present invention to generate the hypothesis. This interactive display feature may be especially useful for allowing a user to assess the quality and/or basis for a particular proposed hypothesis.

Similarly, FIGS. 6B and 6C show a progression of primary 610, secondary 620, and tertiary 630 visual representations for presenting potential substitution hypotheses 134 and potential pairwise hypotheses 136, respectively. Each of the first and second visual representations 610, 620 also comprise interactive icons 615, 625 allowing a user to uncover the various relationship rules 111, 112, 113, 114 and ultimately, the very documentary evidence gleaned from one or more data repositories 20, that serve as the basis for the potential hypothesis presented according to various embodiments of the present invention.

As shown in FIGS. 1 and 5, various method embodiments may further comprise step 150 for identifying a portion of the plurality of documents in the data repository 20 associated with a user and creating a user profile 155 based at least in part on the identified documents. The user profile 155 may be indicative of a user information need. Some such embodiments may further comprise step 160 for modifying the hypothesis in response to the user profile 155 such that the modified hypothesis at least partially corresponds to the user information need. Step 160 for modifying the hypothesis may, in some embodiments, utilize one or more of the various steps 110, 120, 130 described herein for determining relationship rules 111, 112, 113, 114, parsing documents from a data repository 20, and/or generating hypotheses 132, 134, 136 in order to perform a user analysis to identify a portion of the plurality of documents in the data repository 20 associated with a user.

For example, and referring generally to FIG. 5, step 150 may comprise subroutines that may include, but are not limited to: step 151 for searching a user's online publications (which may, in some embodiments, be stored in the data repository 20); step 152 for parsing the searched publications uncovered in step 151 and indexing various phrases therein by a concept identifier (using, for example a phrase-concept relationship rule 112 that may be generated in step 110); step 153 for mapping concepts in the searched publications uncovered in step 151 to a particular semantic type (using, for example, a concept-semantic type relationship rule 111 that may be generated in step 110); and step 154 for ranking the top k semantic types as to produce the user profile 155 (wherein k may comprise an adjustable parameter).

In some embodiments (as shown, for example in FIG. 5), the user profile 155 may comprise at least one semantic category (as defined, for example, by a concept-semantic category relationship rule 111 that may serve as an input to the user analysis step 150 for identifying a portion of the plurality of documents in the data repository 20 associated with a user). According to such embodiments, various methods may further comprise step 160 for filtering one or more of the presented hypotheses 132, 134, 136 such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category highlighted in the user profile 155.

Thus, various system, method, and/or computer program products of the present invention may tailor the “conceptual research” results returned, for example, as part of the presented hypotheses 132, 134, 136 to meet a user information need that may be ascertained by analyzing a user's publications and/or previous search patterns. As described further herein, various system embodiments of the present invention may comprise a host computing element 700 including one or more memory devices 722, 724, 728 configured for storing a user profile 155 such that each user (identified, for example, by a unique user ID and/or password) may log on to a host computing element 700 so as to utilize the conceptual research capabilities of the various embodiments described more fully herein.

Some method embodiments may further comprise a step for verifying one or more generated hypotheses 132, 134, 136 using at least one independent resource. For example, in some method embodiments, the generated hypotheses 132, 134, 136 may be verified using an independent resource 800 as shown generally in FIG. 8. For example, an independent resource 800 (such as a “Verifying Support System” or other verification module for example) may be in communication with a host computing device 700 that may be responsible for performing the parsing 120 and/or generating steps 130 described herein. The independent resource 800 may also be in communication with the data repository 20 so as to be capable of evaluating one or more documents contained therein, using tools and/or subroutines that may include, but are not limited to: a data analysis engine 802, a natural language processing (NLP) engine 804, a data “mining” engine 806, and/or an existing search engine 808. These tools may comprise “off-the-shelf” search engines or other publicly available search tools that may be used by the independent resource 800 to verify the hypothesis 132, 134, 136 (for example) generated by the host computing element 700 (using, for example, the various modules and/or steps shown in FIG. 1). “Verification” of the hypothesis may include, but is not limited to: assessing a potential “breakthrough” value of the hypothesis; assessing the probability of the hypothesis providing a viable research direction and/or research focus; and ensuring that the hypothesis meets the potential information needs of a particular audience (such as, for example, those information needs embodied in a user profile 155 as shown in FIG. 5).

Some embodiments of the present invention further provide a system for mining information from a data repository 20 comprising a plurality of documents to produce a hypothesis (see elements 132, 134, 136 of FIG. 1, for example). Referring generally to FIG. 1, the system may comprise a system for extracting relationships 10 comprising a plurality of phrases and a plurality of concepts. The system for extracting relationships 10 may include, but is not limited to: a vocabulary database corresponding to a selected subject area; a predetermined lexicon; a semantic network; a metathesaurus; and combinations of such databases.

Various system embodiments may further comprise a host computing element 700 (see FIG. 7, for example) in communication with the system for extracting relationships 10 for accessing the system for extracting relationships 10. According to such embodiments, the host computing element 700 may determine a relationship rule 111, 112, 113, 114 defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts. As described herein with respect to the various method and/or computer program product embodiments of the present invention, the various relationship rules 111, 112, 113, 114 determined by the host computing element 700 may include, but are not limited to: an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts (see element 112 and FIG. 2A); an assignment of at least one of the plurality of phrases to a relationship identifier (see element 114 and FIG. 2B, wherein the relationship identifier may link a first one of the plurality of concepts to a second one of the plurality of concepts); an assignment of at least one of the plurality of concepts to a semantic category (see element 111 and FIG. 2C, for example); an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship (see element 113 and FIG. 2D for example), wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept; and/or combinations of such relationship rules.

In some embodiments, as shown generally in FIGS. 7A-7B, the system and/or host computing element 700 thereof may comprise one or more memory devices 722, 724, 728 in communication with and/or integrated with the host computing element 700. According to some such embodiments, the memory device or devices 722, 724, 728 may be configured for storing one or more of the determined relationship rules 111, 112, 113, 114 for later or repeated use in a subsequent parsing step (see step 120, FIG. 1, for example). Thus, the memory devices 722, 724, 728 may allow the host computing element 700 to serve as a “warehouse” of relationships comprising conceptual links determined in part from the system for extracting relationships 10 corresponding to a particular data repository 20. For example, in biomedical “conceptual research” embodiments of the present invention, the memory devices 722, 724, 728 may be configured for storing relationship rules 111, 112, 113, 114 determined from an analysis of the Unified Medical Language System (UMLS) (or a similar system for extracting relationships 10) for use in later and/or repeatedly parsing (see step 120) of a biomedical data repository 20 (such as MEDLINE, for example). Importantly, the memory devices 722, 724, 728 may allow for the conservation of computing power and/or time by “pre-computing”and storing commonly-used and/or re-used relationship rules 111, 112, 113, 114 that may be necessary for parsing 120 and hypothesis generation 130 steps (as described in further detail herein) that may be later performed by the host computing element 700.

Referring again to FIG. 1, the host computing element 700 may also be in communication (via a wired and/or wireless network connection, for example) with a data repository 20. System embodiments of the present invention may be configured for use with various types of data repositories 20 in a variety of subject areas. For example, the host computing element 700 may be in communication with a data repository 20 that may include, but is not limited to: a biomedical literature database (such as MEDLINE, for example); a medical records database; a chemical literature database; a computer science literature database; a physics literature database; a legal literature database; a psychology literature database; a social science literature database; a news periodical database; a business journal database; and combinations of such databases. In such embodiment, the host computing element 700 may be further configured for parsing (see step 120 the plurality of documents in the data repository 20 according to one or more of the determined relationship rules 111, 112, 113, 114 described herein. Each of the plurality of documents stored in the data repository 20 may comprise at least a portion of one of the plurality of phrases and the plurality of concepts such that the relationship rules 111, 112, 113, and 114 may be directly compatible with the terms, phrases, and/or concepts present in the documents.

Furthermore, in some system embodiments, the host computing element 700 may be further configured for performing step 130 for generating one or more potential hypotheses 132, 134, 136 comprising a previously unknown combination of at least one of the plurality of phrases and at least one of the plurality of concepts. As described herein, the previously unknown combination may be at least partially determined from the parsed plurality of documents present in the data repository 20.

As shown generally in FIGS. 7A and 7B, various system embodiments may further comprise a user interface 704 (such as a display device, for example) in communication with the host computing element 700. The user interface 704 may be configured for presenting the hypothesis (as a visual indication or display, for example, as shown generally in FIGS. 6A-6C) so as to indicate the previously unknown combination of at least one of the plurality of phrases and at least one of the plurality of concepts. In some system embodiments, the user interface 704 may be configured for presenting one or more of the generated hypotheses 132, 134, 136 as a display (see FIGS. 6A-6C, for example) to a user comprising a primary visual representation 610 of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts. Referring, generally to FIG. 6A, in some such embodiments, the user interface 704 may be further configured for presenting a primary visual representation 610 comprising an interactive icon 615 configured to be selectable by the user (i.e. via a mouse click or other user input received via one or more input devices 706, 708 in communication with and/or integrated with the host computing element 700. As described herein, the interactive icon 615, 625 may be further configured to modify the display when selected by the user. For example, selection of the interactive icon 615 in the primary visual representation 610 of the presented hypothesis may call up and/or modify the display to present a secondary visual representation 620 comprising the underlying relationship rules and/or linking relations used to generate (in step 130, for example) one or more of the presented hypotheses 132, 134, 136. As described herein, the secondary visual representation 620 may further comprise a second interactive icon 625 that, when selected by a user, may call up a tertiary visual representation 630 presenting, for example, at least a portion of the plurality of parsed documents from the data repository 20 that may have been used to generated one or more of the presented hypotheses 132, 134, 136. Thus, the host computer 700 and the associated user interface 704 elements may allow a user to fully examine the presented hypotheses 132, 134, 136 and the logic and/or relationships underlying the proposed hypotheses. This transparency may allow a user to become comfortable with the various system embodiments of the present invention and to more readily rely on the various “conceptual research” capabilities afforded thereby.

Referring to FIG. 5, in some system embodiments, the host computing element 700 may be further configured for tailoring and/or modifying one or more of the presented hypotheses 132, 134, 136 to meet the particular information needs of an identified user. In some such embodiments, the host computing element 700 may perform step 150 for identifying a portion (see element 30, for example, indicating the user's available online publications) of the plurality of documents in the data repository 20 associated with a user. The host computing element 700 may be further configured for creating a user profile 155 based at least in part on the identified documents. As described herein, the user profile 155 may be indicative of a user information need (such as, for example, a particular user's field of study and/or area of expertise). As shown in FIGS. 1 and 5, the host computing element 700 may be configured for performing step 160 for filtering the results returned from the hypothesis generating step 130 (i.e. one or more potential hypotheses 132, 134, 136). As shown in FIG. 5, the host computing element 700 may be configured for receiving inputs comprising the potential hypotheses 132, 134, 136 and the user profile 155 and modifying one or more of the potential hypotheses 132, 134, 136 in response to the user profile 155 such that the modified hypothesis at least partially corresponds to the user information need. In some system embodiments, the user profile 155 generated by the host computing element 700 may comprise at least one semantic category (as defined, for example, by a concept-semantic category relationship rule 111 that may serve as an input to the user analysis step 150 for identifying a portion of the plurality of documents in the data repository 20 associated with a user). According to such embodiments, the host computing element 700 may be further configured for performing step 160 for filtering one or more of the presented hypotheses 132, 134, 136 such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category highlighted in the user profile 155.

As shown in FIG. 8, some system embodiments may further comprise an independent resource 800 (such as a “Verifying Support System,” for example) in communication with the host computing device 700. The independent resource 800 may also be in communication with the data repository 20 so as to be capable of evaluating one or more documents contained therein, using tools that may include, but are not limited to: a data analysis engine 802, a natural language processing (NLP) engine 804, a data “mining” engine 806, and/or an existing search engine 808. These tools may comprise “off-the-shelf” search engines or other publicly available search tools that may be used by the independent resource 800 to verify the hypothesis 132, 134, 136 (for example) generated by the host computing element 700 (using, for example, the various modules and/or steps shown in FIG. 1). “Verification” of the hypothesis may include, but is not limited to: assessing a potential “breakthrough” value of the hypothesis; assessing the probability of the hypothesis providing a viable research direction and/or research focus; and ensuring that the hypothesis meets the potential information needs of a particular audience (such as, for example, those information needs embodied in a user profile 155 as shown in FIG. 5).

FIGS. 7A-7B illustrate an exemplary host computing element 700 useable for implementing some embodiments of the present invention. In particular, FIG. 7A illustrates an example of a host computing element 700 configured as a computer device in which some embodiments may be utilized. As illustrated in FIG. 7A, the host computing element 700 may comprise a system unit 702, output devices such as display device 704 and printer 710, and input devices such as keyboard 708, and mouse 706. The host computing element 700 receives data for processing by the manipulation of input devices 708 and 706 or directly from fixed or removable media storage devices such as CD disk 712 and/or via network connection interfaces (not illustrated). The host computing element 700 then processes data and presents resulting output data via output devices such as display device 704, printer 710, fixed or removable media storage devices like disk 712 or network connection interfaces. It should be appreciated that the computer device used for implementing the preferred embodiment can be any sort of computer system (e.g., personal computer (laptop/desktop), network computer, server computer, or any other type of computer).

Referring now to FIG. 7B, there is depicted a high-level block diagram of the components of a host computing element 700 such as that illustrated by FIG. 7A. System unit 702 includes a processing device such as processor 720 in communication with a main memory device 722 (which may include various types of cache, random access memory (RAM), or other high-speed dynamic storage devices via a local or system bus 714 or other communication means for communicating data between such devices). The primary memory device 722 may be capable of storing data as well as instructions to be executed by processor 720 and may be used to store temporary variables or other intermediate information during execution of instructions by processor 720. The host computing element 700 may also comprise a read only memory (ROM) and/or other static storage devices 724 coupled to local bus 714 for storing static information (such as one or more relationship rules 111, 112, 113, 114, for example) and instructions for processor 720. The system unit 702 of the host computing element 700 also features an expansion bus 716 providing communication between various devices and devices attached to the system bus 714 via the bus bridge 718. A data storage device 728, such as a magnetic disk 712 or optical disk such as a CD-ROM and its corresponding drive may be coupled to the host computing element 700 for storing data and instructions via expansion bus 716. The host computing element 700 may also, in some embodiments, be coupled via expansion bus 716 to a user interface 704 (or other display device), such as a cathode ray tube (CRT) or a liquid crystal display (LCD), for displaying data to a computer user such as the generated hypotheses 132, 134, 136 and associated visual representations thereof (see FIGS. 6A-6C, as described further herein). In some embodiments, the system may further comprise an alphanumeric input device 708, including alphanumeric and other keys, is coupled to bus 716 for communicating information and/or command selections to processor 720. Other types of user input devices may also be provided as a component of and/or in communication with the host computing element 700. Such input devices may include cursor control device 706, such as a conventional mouse, trackball, or cursor direction keys for communicating direction information and command selection to processor 720 and for controlling cursor movement the user interface 704. Such cursor control devices 706 may be especially useful in allowing a user to select an interactive icon 615, 625 presented in one or more of the primary 610 and secondary 620 visual representations of the hypotheses 132, 134, 136 (as depicted, for example, in FIGS. 6A-6C.

A communication device 726 may also be coupled to and/or in communication with the bus 716 for accessing remote computers or servers via the Internet or other network. Such remote computers and/or servers may house, for example, one or more system for extracting relationships 10 and/or data repositories 20. The communication device 726 may include, but is not limited to: a modem; a network interface card; and/or other interface devices, such as those used for interfacing with Ethernet, Token-ring, or other types of networks. In any event, in this manner, the host computing element 700 may be coupled to and/or in communication with a number of servers via a network infrastructure. The communication device 726 may enable one or more users to selectively access the host computing element 700 so as to take advantage of the relationship rules 111, 112, 113, 114 and/or generated hypotheses 132, 134, 136 that may be generated according to the various embodiments of the present invention.

In addition to providing systems and methods, the present invention also provides computer program products for performing the operations described above. The computer program products have a computer readable storage medium having computer readable program code embodied in the medium. With reference to FIG. 7B, the computer readable storage medium may be part of the memory device 22, and may implement the computer readable program code to perform the above discussed operations.

In this regard, FIGS. 1, 4A-4C, and 5 are block diagram, flowchart and control flow illustrations of methods, systems and program products according to exemplary embodiments of the invention. It will be understood that each block or step of the block diagram, flowchart and control flow illustrations, and combinations of blocks in the block diagram, flowchart and control flow illustrations, can be implemented by computer program instructions. These computer program instructions may be loaded onto a computer or other programmable apparatus to produce a machine, such that the instructions which execute on the computer or other programmable apparatus are capable of implementing the functions specified in the block diagram, flowchart or control flow block(s) or step(s). These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instructions which implement the function specified in the block diagram, flowchart or control flow block(s) or step(s). The computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the block diagram, flowchart or control flow block(s) or step(s).

Accordingly, blocks or steps of the block diagram, flowchart or control flow illustrations support combinations of steps for performing the specified functions, and program instructions for performing the specified functions. It will also be understood that each block or step of the block diagram, flowchart or control flow illustrations, and combinations of blocks or steps in the block diagram, flowchart or control flow illustrations, can be implemented by special purpose hardware-based computer systems which perform the specified functions or steps, or combinations of special purpose hardware and computer instructions.

Many modifications and other embodiments of the invention will come to mind to one skilled in the art to which this invention pertains having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended exemplary inventive concepts. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.