Plaque It!
Sponsored by: Flash of Genius |
[0001] This application claims the benefit of U.S. Provisional Application No. 60/177,223, filed on Jan. 21, 2000.
[0002] 1. Field of the Invention
[0003] This invention relates generally to devices, software, computer systems, and methods used to analyze gene expression data and more particularly to devices, software, computer systems, and methods used to analyze the large volume gene expression data generated in gene expression profiling experiments.
[0004] 2. Description of the Related Art
[0005] Data analysis of large and/or complex sets of biological data is usually performed in two steps.
[0006] 1) Statistical analysis of the raw data, treating the experimental errors, taking into account experimental constraints, and trying to filter and/or extract the relevant data points.
[0007] 2) Attempting the interpretation of the identified subsets of data with respect to the general biological knowledge.
[0008] Methods available so far give partial solutions to either of these two steps but fail to support the complete process.
[0009] Data stemming from expression profiling experiments has been very hard to analyze. To date, the analysis has mainly been done in such a way that two states are compared, a state A and a state E. Thus, only two data sets are compared as illustrated in Table I.
TABLE I Two data sets Status name: A E Phenotype: Influence: Drug No Drug
[0010] The visualization of this data is done as graphs, an example of which is shown in
[0011] In real life however multiple data sets should be analyzed stemming from e.g. different time points or a series of experiments. Furthermore, the representation shown in
[0012] In some special cases, comparison of only two experiments is sufficient. However, analysis of multiple data sets is far more desirable, as it reflects the general experimental situation. In theory such an analysis can be performed in pair wise comparisons of each pair of data sets. However, in practice this is far from efficient, as the sought for information is distributed over many representations. Furthermore, the number of such representations is proportional to the square of experiments and quickly outgrows the size that can be handled.
TABLE II Mutiple data sets Status name: Drug Drug no Drug no Drug Phenotype: Influence Responder responder non responder non responder
[0013] The only approach, so far, that has shown to be successful in displaying the results of more than two experiments uses tree construction based on similarity in expression regulation and tree drawing algorithms as they are commonly used in sequence comparison of gene families (Eisen et al P. Brown Science 1998). This method is based on exhaustive pair wise comparisons of individual data points and can be so time consuming, that the use is limited and application to very large data sets becomes impossible.
[0014] The advantage of being able to display many similarity relationships is limited as the presentation of high numbers of similarity relationships in tree representations exceed the capacity of human comprehension. In addition, visualization of such extensive tree structures faces technical difficulties due to the requirements of very high resolution devices.
[0015] A fundamental problem of displaying many similarity relationships in a tree format is the limitations of the underlying tree algorithm forcing the data into an artificial tree structure. In reality, however, the depicted tree structure can not represent the true relationships and can create artificial similarities or spurious branching patterns. Such misleading artifacts may result in wrong conclusions including, for example, the problem of missing the influence and regulation of important genes in the analysis, even though the required measurements are available.
[0016] Therefore, it would be advantageous to provide for a method that can extract inherent structure from complex and/or large biological data sets, for example from large scale gene expression analysis or protein 2D-Gels.
[0017] The purpose of the invention is to provide for a method that enables defining relationships between data points (e.g. genes) whereby this method is not limited by the size of the data set, the potentially misleading effect of background noise is reduced, relationship are not distorted, and that allows for comprehensible graphical presentation.
[0018] The disclosed method solves the problem of visualization, analysis and interpretation of complex, multi-dimensional data. Such data may consist of data points from expression profiling analysis, 2D gel electrophoresis or SNP analysis. Here, multiple data sets exist and only the integration of all the sets into a two dimensional representation permits an analysis that allows the extraction of the information with respect to what events best explain the status of the cell, for example.
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
[0064]
[0065]
[0066]
[0067]
[0068]
[0069]
[0070]
[0071]
[0072]
[0073]
[0074]
[0075]
[0076]
[0077]
[0078]
[0079]
[0080]
[0081]
[0082]
[0083]
[0084]
[0085] FIGS.
[0086]
[0087] In expression profiling experiments on arrays known DNA fragments are spotted on a solid support as arrays, hybridized with RNA (i.e. cDNA) mixtures as obtained from samples (e.g. tissue samples) and analyzed with respect to the differences in signal strength that reflects the abundance of the various RNA molecules and thus the expression of each gene. Such an analysis may be performed on a chip which would then manifest one form of the frequently discussed ‘DNA Chip’ or on other matrices (e.g. nylon filters). This technology enables the researchers to generate massive data volumes on many individual genes that potentially contain information on networks of co-acting, interacting or co-regulated gene sets. The extraction of this information is by no means trivial since 1) the source data signals usually contain a high level of noise reflecting the problems with experimental reproducibility of such experiments and 2) data volumes generated are usually beyond those efficiently handled and analyzed with standard bioinformatics approaches and manageable by human comprehension.
[0088] FIGS.
[0089]
[0090] The second embodiment compares the number of genes, cells, viruses, sequences, or substances (known variable in experiment) to the number of samples. If the number of samples is larger, then a sample similarity matrix is formed from the data. When the number of known variables is larger, a variable similarity matrix is formed from the data. Thereafter, a singular value decomposition (SVD) of the matrix formed takes place. The sample and known variable coordinates are determined based on the eigenvector of the matrix formed. These coordinates are then utilized in the visualization section of the software.
[0091]
[0092]
[0093]
[0094] The exemplary embodiment utilizes as a server a PC running LINUX -or- a SGI running IRIX, with 128 MB RAM. Additional software requirements utilized in the optional embodiment described: SRS and CORBA server; SRS objects; and ORACLE 8.X. The preferred client is a networked personal computer. One of ordinary skill in the art of computer systems will recognize that the invention could be operated on other computer systems running alternative software.
[0095] The user interacts with software in a very standard way. Menus are opened with a click and hold mouse action. Moving the mouse through the menu will highlight individual menu items and releasing the mouse button will select the highlighted item. It is also possible to click on the first menu item to highlight it, and then use the arrow keys to scroll through the list and thus moving the highlighting bar.
[0096] Multiple adjacent menu or list items may be selected by highlight the first item and then holding down the shift key while highlighting the last item, this action will highlight all the items in between the first and the second item highlighted. Multiple non-adjacent menu or list items may be selected by highlight the first item and then hold down the Ctrl key while highlighting all the others.
[0097] The left mouse button is used for selecting genes in analysis views and highlighting items in all lists and menus. Right-clicking on items, e.g. projects and genes in the gene list, will open a context specific command menu in which you can make selections. The right mouse button is also used for zooming into the analysis views: click and drag the right mouse button around the area you wish to zoom into.
[0098] Command menus list short-cut keys for performing actions using the keyboard instead of the mouse. Exemplary short-cuts are show in Table III:
TABLE III Short-cuts Action Key-stroke Select highlighted items in gene list window. Ctrl + Enter Reset selection in gene list window. Ctrl + R Show profiles of genes highlighted in gene Ctrl + P list window. Close application. Ctrl + Q
[0099] From the command menu bar, select Help>Table of Contents. You will see the table of contents for the on-line help.
[0100] The software provides tools useful in a variety of settings, from numerical gene expression data to biological interpretation. The software employs a variety of statistical algorithms, interactive viewers, links to bioinformatics systems and the capacity to manage large volumes of data enabling the identification of a selection of candidate genes meeting specified criteria.
[0101] The software incorporates a variety of statistical tools. They have been implemented and optimised for performance in the software system. These algorithms include variance analysis, variants of principal component analysis, cluster tree analysis and correlation analysis.
[0102] Expression data and analysis results are vividly displayed with interactive viewers allowing diverse aspects of the data to be highlighted. Properties can be plotted and color-coded to display multiple levels of information simultaneously. Views cross- communicate; selections made in one view will remain highlighted when another view is opened.
[0103] Once a set of regulated genes has been identified, a way to explore and investigate these genes in-depth is needed. Gene classification, patent situation, functional similarities and other aspects of the gene set can be queried via public as well as from proprietary in-house databases. One commercial product that can be used to perform this function is SRS sold by Lion Biosciences.
[0104] Information on biological pathways is essential to interpret co- regulated gene clusters. The software provides easy interaction with several pathway databases via, for example, the SRS technology platform.
[0105] The software system capabilities are enhanced by its ability to interface with a sequence analysis system. One example of such a system is the bioSCOUT system also sold by Lion Bio Sciences. These systems enable the elucidation of detailed information about the genes, or subsets of genes, based on deduced and calculated properties, and may provide summarized feature reports on each gene. If additional information is required, a suite of bioinformatics applications may also be available these sequence analysis systems that can enable further investigations.
[0106] The software is designed to handle large data sets. Data formats which are compatible include GATC database format, tab delimited ASCII format data files and output from BioDiscovery's ImaGene® software.
[0107] Raw data is stored as an “experiment”. Comparable experiments can be grouped into an “experiment group” within a “project” which can contain user annotations and a complete list of the included experiments. Users work within projects containing experiment groups and gene lists.
[0108] Start the software program. The log in window will open. An exemplary login window is shown in
[0109] The whole application may be maintained in one interface window which can be resized, minimized and maximized. There are standard command menus and a tool bar with short-cut buttons. The interface may be subdivided into three windows. An exemplary interface is provided in
[0110] At the top of the interface are the command menus: File, Edit, Analysis, Genes, Administration, Windows and Help. The contents of these menus will be explained as the process of using software is described. To select an item from a menu, click on the menu name to open it, use the mouse arrow to highlight the selection and then click again to select it.
[0111] Command menus are also available by right-clicking over an object to manipulate. For example, by right-clicking on a project folder, a menu opens with options including creating a new experiment group within that folder. Click in the menu to select an item.
[0112] Menus are context sensitive, so listed items in the menus aren't always available.
[0113] The tool bar provides shortcut buttons for all the analysis filters; other applications, for example, “SRS” and “bioSCOUT”; “Save”; and “Clean”. Exemplary tool bar buttons are listed in Table IV.
[0114] The Project List Window is on the left side of the interface. It displays the Project List in which projects, sub- projects, experiment groups, experiments and gene lists will be displayed in a hierarchical tree of folders.
[0115] The largest section of the default interface is the Analysis Window which contains the analysis views and the SRS interface. The internal frames cannot be moved out of the analysis window.
[0116] The Gene List Window is on the bottom of the interface, below the Analysis Window. It lists the genes you have selected in the analysis views.
[0117] You can move about in the windows, resize the software interface and internal windows and zoom into analysis views as you wish. Table V illustrates various ways of adjusting the interface
[0118] The Project List is a hierarchical listing of all the projects and data owned by the users of software. You will create a new project and within the project a new experiment group. The project will provide an environment to work and store results in. The experiment group will house all the experiments that you want to compare.
[0119] In the left-hand window (the Project List), highlight the word “Projects” and choose Edit>New>Project from the menu bar (or right-click on the selection and choose New>Project from the menu). This process is illustrated in
[0120] A dialog window will open, enter the name for the new project, for example, “TutorialAK” as shown in
[0121] Highlight the project in the Project List and choose Edit>New>Experiment Group from the Edit command menu (or right-click on the project and choose New>Experiment Group). A dialog box will open. An exemplary dialog boxes are shown in
[0122] Highlight the experiment group, for example, “Fibro
[0123] A second dialog box will open listing the available experiments. An example of this dialog is illustrated in
[0124] The data utilized in this exemplary analysis are time points taken from a synchronized population of fibroblast cells. This exemplary analysis identifies cyclins that are markedly up regulated during the time course. The above exemplary analysis is utilized for explanation only; one of ordinary skill in the art would understand, based on this example, how to use the invention in the analysis of other cells and related genes.
[0125] One analysis tool provided by the software is a Variance Histogram. To use the Variance Histogram, select an experiment group in the Project List. Click on the “Variance” button in the tool bar. The “Choose Experiments to Plot” dialog box will open. Click “OK”, including all the experiments. The Variance Histogram will appear. An example of a Variance Histogram is illustrated in
[0126] The genes represented in the selected bars are now listed in the Gene List at the bottom of the interface and that there may be a symbol having a color corresponding to the color of the analysis window in the right-hand column (the “Selected in” column) of the Gene List table. These genes represent the first selection of potentially up regulated genes. Next, start a Distance Plot analysis, which clusters the data on their principal components, to further refine the selection.
[0127] Click on the “Distance” button in the tool bar. A parameters dialog box will appear. Click “OK” to use the default scaling parameter, “adjust shift (avg=0)”. This centers the data on the (0,0) coordinate.
[0128] A second dialog box will open displaying two columns of histograms; one for the x-axis and one for the y-axis. Two default histograms are already selected: the first in the x-axis column and the second in the y-axis column. Click “OK” to accept the default choices.
[0129]
[0130] A second dialog box will open displaying two columns of histograms; one for the x-axis and one for the y-axis. These dialog boxes are shown in
[0131] The plot will open, notice that the cluster shown in
[0132] The selected genes are now also listed in the Gene List shown in
[0133] Move to the top of the Gene List shown in
[0134] The highlighted genes will now be selected in the Gene List. In some embodiments these genes are identified with a red diamond in the “Selected in” column. In the frames of the Variance Histogram and Distance Plot windows, there is a “Reset Selection” button. Click on this button in both analysis views to deselect the genes selected in them. The genes selected in the Gene List will also remain selected in all analysis views. In the preferred embodiment the selected genes are displayed in a color, for example, red so that the user can easily identify the genes of interest in any analysis window.
[0135] The selected genes have a high level of variance in expression and are not conforming to an average expression profile. To find the genes that are cyclins, you can use SRS or a similar program/system. The details of the SRS system are disclosed in PCT/EP99/10383 incorporated herein by reference. This feature of the software is illustrated using SRS, however, one of ordinary skill in the art could implement the program with similar systems.
[0136] Click on the “SRS” button in the tool bar. The SRS interface window opens. This window is illustrated in
[0137] A number of databases may be listed. Hold down the Ctrl key and click on the desired databases to highlight them, for example, “EMBL”, “Swissprot” and “GENBANK”. A blank text field and pull-down menu, listing searchable database entry fields, appears below the Q
[0138] A list of genes from the experiments that match the SRS query will be listed in the SRS window. These genes may be automatically selected in all the analysis views and listed in the Gene List as shown in
[0139] Highlight this gene in the Gene List and click on the “Profile” button in the tool bar. The Expression profile of the gene and its description will appear. The expression profile of the gene and its description for AA001916 are shown in
[0140] The cluster tree and red/green plot displays genes grouped by expression pattern. Click on the “Cluster” button in the tool bar. Select “Mean of Experiments” as the reference state (this is the default selection). Click “OK”. Enlarge the Cluster Tree Window to the full height of the Analysis Window by moving the mouse over the bottom frame of the Cluster Tree Window. Then click and drag the bottom edge of the frame down and release the mouse button at the bottom of the Analysis Window. Zoom out of the Cluster Tree view by clicking on the vertical scale tab and dragging it toward the bottom of the Cluster Tree Window, stopping just before a scroll bar appears. An exemplary scale tab is illustrated in
[0141] The clusters can now be seen more clearly. An example of a red/green plot for fibroblast data showing Clusters A, B, C, I, J, and K is shown in
[0142] The Project List is a hierarchical set of project folders. Projects allow software users, working in a multi-user environment, to separate and organize their work. Projects can contain sub-projects, experiment groups and gene lists and can be organized in a hierarchical manner. Users can assign permissions to their projects, determining which software users can access them. Projects can be individually owned or worked in by a group of users.
[0143] Experiment data saved in the server is accessed by the users through their projects. Within a project, an experiment group folder is created to hold the experiments. The user can then choose which data from the database to import into their experiment group for analysis. Data in the Project List, i.e. experiment groups and gene lists, can be exported to the local machine as ASCII files. An example of a Project Folders display is show in
[0144] Projects are folders containing sub-projects as well as experiment groups, gene lists, and user annotations. Data analysis is done within projects. However, before working within a new project, an experiment group should be created in that project and experiments imported into the group. In order to prevent data analysis problems, some embodiments of the invention may require the creation of an experiment group.
[0145] Access to your projects may be controlled by granting read only, read/ write, etc. permissions to other users. Access control enables a company to protect the results of experiments in an effort to protect valuable trade secrets.
[0146] In the Project List, highlight the folder in which the new project is to be created, either the top-level “Projects” folder or one of your own project folders. Choose Edit>New>Project from the menu bar or right-click on the selection and choose New>Project from the menu. A dialog window will open, enter a name for the new project in the text field. Click “OK” and the new project will appear in the project list. The project is automatically saved in the database.
[0147] Highlight the folder in the Project List. Choose Edit>Delete from the Edit command menu. A “Please confirm” dialog box will open: Click “Yes” to delete the folder. Click “No” to cancel the Delete command.
[0148]
[0149] When a user performs analyses, the user first selects an experiment group in the project list, then selects filters to analyze the data. Experiment groups are created and stored inside a project folder, they cannot exist independently in the project list. Their folders may be identified with the image of a flask.
[0150] Highlight a project in the Project List and choose Edit>New>Experiment Group from the Edit command menu or right-click on the project and choose New>Experiment Group. A dialog box will open. Enter a name for the experiment group in the text field and click “OK”. An exemplary dialog box is shown in FIG.
[0151] Experiment groups contain at least two experiments to be complete. Experiments are sets of intensity data from one reading of one chip, micro array or membrane. Typically, all the experiments done to answer a particular hypothesis would be grouped into one experiment group for analysis.
[0152] Intensity data is imported by users. In some embodiments a user may require administrator permissions to import data. Upon import into the software database, experiments may be put into classes. An example of a class selection dialog box is shown in
[0153] The order that experiments are added to the experiment group is important. If they are entered in a different order, some of the analysis views may not be so meaningful.
[0154] Highlight the experiment group in the Project List and choose Edit>Add/Remove from the Edit command menu or right-click on the experiment group and choose Add/Remove. A dialog box will open, select the class of experiments you wish to choose from. A second dialog box will open listing the experiments in the chosen class. Highlight the experiments in the appropriate order. A user may use the Ctrl and Shift keys to highlight multiple experiments. Click on the “Add” button to move them into the “Add” column. Alternatively, the user can double-click on them and they will automatically shift to the “Add” column. Click “OK” to import the chosen experiments into your experiment group. An sample dialog box is shown in
[0155] A section of experiments may be grouped into a sub group to facilitate analysis on only those experiments. Simply create a new experiment group in the parent experiment group and add to the new experiment group only a subsection of the experiments in the parent group.
[0156] Once the experiment group is analyzed, there will probably be a selected group of genes for further study. Unwanted data can be screen out and analysis performed on just these genes by first creating a gene list. A sample gene list is shown in
[0157] Gene lists are stored in your project folder. Gene lists are displayed in the Project List as lists with an overlying chromosome. The genes inside the gene list are depicted as a DNA helix. Select genes of interest in the analysis views. Highlight the project folder in which you want to save the new gene list. Choose Genes>Save Selection As Gene List from the Genes command menu. A dialog box will open asking you to confirm that you wish to save the list of selected genes as a new gene list. Click on the “Yes” button. A second dialog box will open asking you to enter a name for the new gene list. Type in the name and click “OK”. The new gene list will be saved in the highlighted project folder.
[0158] Over time a user will probably have many projects, experiment groups, experiments and gene lists in the Project List. To remember details of each, i.e. experimental conditions, interesting observations on the behavior of particular genes, etc. it is helpful to annotate them with comments about them and an analysis of them.
[0159] Annotations are written and read in the annotation editor. A sample editor is illustrated in
[0160] In the text box, write in the comment. Clicking on “OK” will save the comment. Clicking on “Cancel” will close the annotation editor without saving the comment or changes. The “last updated” date and time for the annotation is the taken from the client computer when the user clicks “OK”.
[0161] The next time the user opens the annotation editor for this Project List item, the user will open the tab and be able to read and edit the previous comments. Other users who open the annotation editor for the same Project List item will get their own tab for adding their annotations. Users can read other users' annotations, but can only edit their own.
[0162] Click on a Project List item to highlight it. Choose Edit>Annotate from the Edit command menu, or right-click on the item and choose Annotate, the annotation editor window opens. Erase the text, “edit your comment here” and type your annotation in the text field. Click “OK” to save your annotation and close the editor window.
[0163] Select the Project List item whose annotations a user desires to view. Choose Edit>Annotate from the Edit command menu, or right-click on the item and choose Annotate, the annotation editor window opens. Click on the tab for the annotation to read. A user can edit only their own annotation, simply by typing in the text field. Click “OK” to close the Annotation Editor and save any changes.
[0164] A user can set the permissions for their projects and experiment groups controlling who has read, write and execute access to them. Permissions can be given to individual users and /or to user groups. Permissions are modified in the permissions dialog window, an example of which is shown
[0165] To set the permissions for your project: Highlight the project or experiment group in the Project List. Choose Edit>Permissions from the Edit command menu. The permissions dialog window will appear. Click in the “Read” “Write” or “Execute” column next to a user's name, this places an “X” in that box, and grants those permissions. Click “OK” to save the new permissions and close the dialog box.
[0166] Using the export option, you can export your experiment groups and gene lists to your local machine as RDB format files (tab delimited text). You have the options of simply exporting the gene names or you can include their descriptions and intensity values as well. When exporting gene lists you must select the experiment group from which the intensities will be read. Exported files can be opened with Microsoft Excel™, Word™ or any simple text program.
[0167] Highlight your experiment group in the Project List window. Select File>Export from the command menu. A dialog box for selecting the destination of the exported file will open. Choose the location in which you want to store the experiment group, click “EXPORT”.
[0168] Highlight your gene list in the Project List window. Select File>Export from the command menu. A dialog box for selecting the experiment group to associate the genes with will open. Select the experiment group to get the intensities from. Click “OK”. A dialog box for selecting the destination of the exported file will open. Choose the location in which you want to store the experiment group, click “Export”.
[0169] The software provides many algorithms for analyzing data. A user can plot two experiments against each other in the difference plot. Clustering can be done by their principle components with the distance plot, or with the cluster tree. You can create histograms displaying the variance of expression levels across an experiment group, gene classifications and genes that adhere to a preconceived profile. With the gene profile, a user can visualize the expression pattern of a single gene across many experiments. The user can select genes that look interesting in any plot or histogram, and they will be automatically selected in all open analysis views for easy comparison of analyses (
[0170] The different analysis filters will extract different information from the experiments. Using multiple filters in combination with the cross-window selection capabilities allows the user to quickly gain valuable insight into the experimental data. Experiments are analyzed in the context of an experiment group. The experiment group, for example, may be a series of time points or comparable experiments from a wild type and a mutant. The experiment group is typically highlighted in the Project List window before selecting an analysis filter.
[0171] Each analysis opens in its own window within the analysis window. As discussed above, the individual analysis windows can be resized, minimized and maximized. When there are many analysis windows open, they will overlap making it difficult to see them all. The “Windows” command menu will list all of the analysis windows open; highlighting a window will cause it to move to the front.
[0172] When starting a new analysis filter, the user may be asked to select parameters. The data scaling procedure is the only parameter which is used by all the analysis filters (except the gene profile), so it will be explained first.
[0173] Scaling of the data allows the user to adjust the units of the plot and histogram axes or the position of the data in the plots. When you start a new analysis, a dialog window will pop open asking you to select the scaling procedure. A sample of this dialog box is illustrated in
[0174] 1) no scaling—the data values are used as is; 2) logarithmic—this plots the exponent of the value, instead of the actual value, e.g. 10 is plotted as 1, 100 as 2 and 1000 as 3, etc. creating a plot with a much smaller scale; 3) adjust scales (sd=1)—this creates a plot in which the standard deviation is equal to one, no matter the shape of the curve, thus, curves created from data having very different ranges of values can be compared; 4) adjust shift (avg=0)—this centers the data in the plot; 5) Correlation Histogram—the Correlation Histogram has a specific set of procedures to normalize the target values that you enter; 6) no scaling—values are used as is; 7) logarithmic range—interprets your values as log values; and 8) unit range—sets the highest value you enter equal to 1 and the lowest value equal to −1, all intermediate values are adjusted accordingly. If the data includes negative values, do not choose the logarithmic procedure.
[0175] The difference plot lets you plot one experiment against another. Genes whose expression varies between the two experiments will fall farther from the 45 degree diagonal, where expression in both experiments is the same (x=y), than those whose expression levels are similar in the two experiments.
[0176] Within a project, highlight the experiment group to analyze. Click the Difference button in the tool bar or choose Analysis>Difference Plot from the Analysis command menu. A dialog box will open asking you to choose a scaling procedure. A sample dialog box is illustrated in
[0177] Only 2 experiments out of the experiment group can be compared in the difference plot. To compare all the experiments, use the distance plot. Sample difference plots are illustrated in
[0178] The difference plot displays the genes in the experiment group as dots. The position of the dot is determined by the expression levels measured in the two experiments selected in the dialog box (
[0179] The diagonal line on the plot can be used to distinguish genes by their degree of difference in expression between the two experiments. From this line you can create a cone, which excludes genes which have x-fold over/under expression less than a cutoff. To create this cone, position the mouse over the diagonal line to get the plus-sign cursor (+). Click and drag this cursor over the plot, away from the diagonal. This will open an information bubble displaying a number times expression, e.g. “3×Expression”. From this cutoff genes having a higher level of expression difference are excluded by the cone drawn when the left-mouse button is released. If genes were selected (highlighted) prior to drawing the cone, releasing the left mouse button will deselect genes falling inside the cone.
[0180] The name of a gene may be revealed by holding the mouse over the plot, the gene name will appear in an text bubble. To make the selection text bubble (“3×Expression”) disappear, click on it. To reset the cone, simply position the mouse over the diagonal line to get the plus-sign cursor and click with the left mouse button.
[0181] To zoom into the analysis view, click and drag with the right mouse button around the area you would like to zoom into. Click the “Reset Zoom” button to reset the view.
[0182] In the Difference Plot and the Distance Plot the numbers along the axes refer to the relative expression levels of the plotted genes. A gene plotted at (4, 20) in the Distance Plot is expressed a relative level of 4 in the experiment plotted along the x-axis and is expressed a relative level of 20 in the experiment plotted along the y-axis. Units are arbitrary.
[0183]
[0184] The distance plot is a variation of principle component analysis (PCA). It plots the genes in the selected experiments in such a way that the distance between genes on the plot is directly proportional to the difference in expression levels of those genes. To create the plot, you must select a scaling procedure and two axes which represent the degree of variation in your data (the principle component).
[0185] When you create a new distance plot, a dialog box will opens to select a scaling procedure (an example is illustrated in
[0186] The position of a gene in this plot gives the relative degree of similarity between its expression and the expression of all the other genes and between the gene's profile and the chosen axis patterns. The closer a gene lies to another gene, the more similar their expression profiles, and the closer the gene is to an axis, the more its profile resembles that of the chosen axis pattern. Outliers show non-average expression profiles and variance coverage.
[0187] Within a project, highlight the experiment group to analyze. Click on the “Distance” button in the tool bar or select Analysis>Distance Plot from the Analysis command menu. A dialog box will open, choose a scaling procedure, click “OK”. A sample scaling dialog box is illustrated in
[0188] Most of the genes in the above distance plot are very close together forming a tight cluster; this is typical. The genes in the cluster all have similar levels of expression across the course of the experiments in the group and hence similar expression profiles. They represent the average. A sample distance plot is shown in
[0189] The genes which fall at a distance from this cluster, mostly to the right, display very different expression levels from the average gene. If the gene falls at a coordinate which is high on both the x- and y- axes that gene also has an expression profile which is very different from the average. You can reveal the name of a gene by holding the mouse over the plot, the gene name will appear in a text bubble. To zoom into the analysis view, click and drag with the right mouse button around the area you would like to zoom into. Click the “Reset Zoom” button to reset the view.
[0190] If we select the gene plotted at (12,26) in
[0191] The variance histogram depicts the standard deviation of expression levels across a series of experiments vs. the number of genes that display such a level of variance in expression level. The genes having little variation in expression levels over the series of experiments will be found together at the left end of the histogram. The genes that do show inconsistency in expression levels across the series of experiments will be found to the right of the histogram. To determine the exact coordinates of the top of a histogram bar, in any histogram analysis, mouse over the bar and the coordinates will be displayed.
[0192] Select the experiment group you wish to analyze by highlighting it in the project list. Alternatively, the user may click on the analysis in the tool bar or select Analysis>Variance from the Analysis command menu. A dialog box will open. Click in the exclude column to shade the box for each experiment you wish not to be used to calculate the histogram. When happy with the selection of experiments to include in the histogram, click “OK.
[0193] Typical variance histograms are illustrated in
[0194] In this example, most of the genes are found toward the left, at the low end of the histogram, meaning that they have steady expression levels over all the experiments in the group. Their expression levels can be low or high as long as they remain constant.
[0195] The tail of the histogram, toward the high end, displays the genes which show, from left to right, medium to high degrees of fluctuation in expression level over the course of the experiments. These are the genes that are up or down regulated at some point in the experiment.
[0196] The histogram typically, does not display any information about the type of variation displayed by the genes. In other words, the user cannot see from this analysis if the genes, which show some variance, are up or down regulated. However, the user can get this information by: first, clicking on the histogram bars of interest, the selected genes will be listed in the gene list window; and second, creating a gene profile of these gene(s).
[0197] The correlation histogram allows a user to enter a pre-conceived set of values defining a search vector, for example a gene expression profile, and plot the genes in the experiment group according to how their expression behavior correlates to your target values.
[0198] There are two ways correlation can be sought: one, by comparing the shape of the profile (vector) given by your target values or two, by comparing the absolute values. You must specify which comparison method to use in the “Select Parameters” dialog box. Samples of this dialog box are illustrated in
[0199] Select the experiment group you wish to analyze by highlighting it in the project list. Click on the “Correlation” button in the tool bar or select Analysis>Correlation Histogram from the Analysis command menu. A dialog window will open. Select a scaling procedure and a comparison method from the menus. Click “OK”. An example of a dialog box where a user may select both the scaling procedure and comparison is illustrated in
[0200] Alternatively, if you have a Gene Profile open, you can select that profile and import the values directly from it. Hit the “Enter” key after editing the last value in the list to enter that change before clicking “OK”. Click “OK”.
[0201] The correlation histogram illustrated in
[0202] In the example shown in
[0203] Looking at the histogram, most of the genes fall on the left side of the histogram, indicating that their expression profiles do not match the target value input. There are several short bars creating the right-hand tail of the curve. These genes have expression profiles which are increasingly similar to the target values entered in the dialog box and there is a single bar, far to the right, which is apparently very similar to the target value. Selecting this bar, by clicking on it reveals that it represents a single gene. Creating a gene profile permits a comparison of the displayed gene's profile to the target values entered.
[0204] This analysis filter allows the user to classify experiments and look for genes that can be classified the same way. A sample Classify Experiments dialog box is illustrated in
[0205] The histogram is created by plotting the degree of correlation between the gene expression profiles and the experiment classes versus the number of genes showing such a degree of correlation. Genes that are highly expressed throughout the experiments in the positive class and expressed at low levels in the negative class are positively correlated to your classes. If they are expressed at low levels in the positive class of experiments and highly expressed in the negative class, they are negatively correlated. If the genes do not show a consistent expression level within a class, or the expression levels are the same in both classes, than there is no correlation between gene expression and the classes and these genes cannot be classified. These relationships are illustrated in Table VI below.
TABLE VI Comparison of expression levels in classified experiments. gene 1 gene 2 gene 3 Experiment expression expression expression Class Experiment level level level Positive exp. 1 high Low low exp. 2 high Low high Negative exp. 3 low High low exp.4 low High high exp. 5 low High low Gene expression shows yes, positive yes, negative none - center correlation to exp. class? correlation - correlation - of histogram identical opposite tail of (or inverse) histogram tail of histogram
[0206] Select the experiment group you wish to analyze by highlighting it in the project list. Click on the “Classification” button in the tool bar or select Analysis>Classification Histogram from the Analysis command menu. A dialog window for classifying the experiments will open. A sample of this dialog box is shown in
[0207] The classification histogram shown in
[0208] The genes lying on the far left of the histogram shown in
[0209] The genes falling in the middle of the histogram shown in
[0210] The genes falling to the far right of the histogram shown in
[0211] The cluster tree analysis hierarchically clusters genes by similarity in their expression profiles, creating a tree view of all the genes in the experiment group and their relationships to each other. Next to the tree view is a colored bar for each gene showing its relative expression level in each experiment. You must select the reference state from which the up/down regulation will be measured. You can select a particular experiment or you can select to use the mean value of all experiments as the reference state.
[0212] Within a project, highlight the experiment group you wish to analyze. Click on the “Tree-plot” button in the tool bar or select Analysis>Tree View from the Analysis command menu. A dialog box opens listing the experiments in the experiment group. Choose an experiment or choose “Mean of Experiments” as the reference state by clicking in the menu. A sample dialog box that may be employed to choose the reference state is shown in
[0213] The Tree View can be adjusted by sliding the scale tabs and sliding the scroll bars. Scaling back to see a greater area may help the user see the clusters and determine which part of the tree looks interesting. Zooming in on an area will allow the user to see details of the tree and the genes represented in those leaves.
[0214] Branch lengths in the tree diagram are proportional to the degree of similarity between two expression profiles. Shorter branches between genes indicates that the genes have more similar expression profiles. Generally, genes having similar functions will be clustered together, as shown by experiments done by Michael B. Eisen, et al. (1998).
[0215] Each row of the red/green plot represents a gene and each column represents an experiment. The color of the rectangle represents the expression level of that gene in that experiment, where down regulation is green and up regulation is red. Up/down regulation is relative to the reference state selected.
[0216] Genes can be selected by clicking on them in the red/green plot. Entire nodes of the tree can be selected by clicking on the desired node. All the genes selected are displayed in the selected gene list and highlighted across all views.
[0217] The method used to create the cluster tree is described by: Michael B. Eisen at al. (1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868.
[0218] The gene profile displays a histogram of a single gene's expression levels over the series of experiments in the experiment group. Below the histogram is the gene description. One example of a gene profile is shown in
[0219] Highlight at least one gene in the gene list in the bottom window. Right-click on the highlighted gene(s) and choose “Show Profile(s)” (Ctrl+P), or click on the “Profile” button in the tool bar. The Profile(s) will open in the main analysis window.
[0220] The SRS or similar interface allows the user to make text based queries of available in-house or other databases to find annotations about the genes that the user is interested in. The power of SRS lies in its unique ability to follow links between databases and essentially treat the different databases as one seamless repository.
[0221] The SRS software interface provides two modes of querying, simple and detail. The simple querying mode lets you submit a preconfigured query with the least amount of work on your part. The detail query mode, on the other hand, lets you configure your own queries to control the stringency and the complexity of your searches. Detail mode also allows you to perform linking operations.
[0222] The bioSCOUT function allows you to pull up complete feature reports summarizing the function and characteristics of the gene product.
[0223] The SRS interface is opened by clicking on the SRS button on the tool bar or choosing Analysis>Query SRS from the Analysis command menu. The interface opens within the software analysis window. Alternatively, other database search programs could be opened in a similar fashion.
[0224] The tool bar has four option buttons: “Stop”, “Detail Mode” (“Simple Mode” when you are in the Detail Mode), “Submit” and “Deselect”. Table VlI illustrates these buttons and their actions.
[0225] There are three windows in the SRS interface: the query window on the left, the results window on the top-right, and the entry window on the bottom-right.
[0226] The user constructs the queries in the query window. Preconfigured queries or databases (depending upon which query mode you are using) may be listed here and text fields for entering query terms will also be in this window.
[0227] The list of results found by your query will be in this window. Gene names, ID numbers and descriptions will be listed.
[0228] The complete database entry for a selected result hit is displayed here.
[0229] When in the simple querying mode, the query window will look different than when in the detail querying mode. The query window in simple mode displays a list of predefined queries. Below this list of queries, at the bottom of the query window, is a labeled text field. The label displays the database field, or type of database information, that will be searched in this query. The text field will display a predetermined query term appropriate for this query or will be blank. If it is blank, the user will enter their own query term or terms before submitting the query. A sample of the SRS interface in simple query mode is illustrated in
[0230] When in the detail query mode, the query window will have a tab listing the databases available to search. Below the tab will be a pull- down menu and a text field. The menu displays the list of available database fields for querying, e.g. Keywords and Metabolite. The contents of this menu depend upon the database(s) selected. If multiple databases are selected, only the fields available in all the selected databases will be available for querying. The text field is blank for typing in your query term or terms. Next to the text field are two buttons (+) and (−). The plus button opens an additional menu and text field for searching multiple database fields in one query. The minus button closes a text field and menu if you decide it is not needed. A sample SRS interface window in detail mode is illustrated in
[0231] Above the Q
[0232] Querying in simple mode is a user friendly way of making SRS queries for anyone without previous experience using SRS. In this mode, the user can choose from a list of preconfigured queries, e.g. “Query on pathway”, which will automatically run a set of database searches and linking operations to retrieve the specified information.
[0233] Once a simple query has been performed, the user can switch to the detail mode and the process used to carry out the query will be available to view. That is, each query and linking operation will be there as if the user had constructed the query in detail mode!
[0234] To perform a query in the simple mode, the user selects the query from the list and then, depending upon the particular query, either enters a query term in the provided text field or immediately clicks on the “Submit” button.
[0235] Select a query in the left-hand window by clicking on it. Depending upon the query the user may or may not need to enter in a query term in the text field at the bottom of the query window. If there is no text in the text field already, enter a query term as discussed below.
[0236] The database field which will be searched with your query term is listed next to the query term text field. This information helps the user enter an appropriate term. Type your query term(s) in the text field. Multiple query terms entered in a single text field are typically separated with Boolean operators.
[0237] Click on the “Submit” button in the tool bar to launch the query.
[0238] In the detail mode the user can configure their queries to obtain the specific information needed from the databases selected by the user. The query can consist of a straight-forward search using a single query term to search a single database, or it can consist of a complex series of searches of multiple databases, multiple query terms combined with Boolean operators and linking operations.
[0239] The process of performing a query in the detail mode starts with selecting databases. Next, the user selects database fields to search, enters the query terms to search these fields and then submits the query. SRS then produces a list of results. The user can stop at this point or refine the query by adding additional searches. The user may also choose to receive information from a database that wasn't queried, based on the results of the fist query by linking the first query to another query or database.
[0240] Select a database(s) by clicking or selecting a database in the Q
[0241] After entering the detail query mode, by clicking on the “Detail Mode” button in the SRS interface tool bar, a tab labeled Q
[0242] Databases are grouped by type; to open or close a database group, click on the toggle switch for the group folder. To select a database, simply click on it in the list. Use the Ctrl key (Cmd on Mac) to select multiple databases. When selecting multiple databases, typically, the user may only be able to query fields that are present in all the selected databases.
[0243] After selecting the databases to search in the current query, a user may move to the bottom section of the query window. In some embodiments this section has a pull-down menu listing the database fields available for searching and a text field for entering a query term or query terms. Select the database field to search and type in an appropriate query term. Multiple query terms entered in a single text field can be combined with the Boolean operators “AND”, “OR”, “BUT NOT” by using the symbols shown in Table VIII.
TABLE VIII Boolean Operators Symbol Operator & AND | OR ! BUT NOT
[0244] A second database field menu and text field may be opened by clicking on the (+) button next to the first text field, and so on. When searching multiple fields in a single query, the field queries are combined with the “AND” operator by default. Hence, the results of such a query all meet the criteria specified for each of the database fields searched. If the user chooses the “OR” operator, the hits only have to meet one of the field criteria to be included in the results list. The “BUT NOT” operator returns a list of hits which meet the criteria of the first field search and do not meet the criteria defined in the second text field.
[0245] A link is any reference in a database entry to another database entry in the same or another database. These links can be hyperlinks or text references. For example: EMBL database entry A's function was predicted by sequence similarity to Swissprot database entries B and C. In this case, a link exists between database entry A and B and between database entries A and C. It is very likely that there is a link between the entries B and C as well.
[0246] Linking is the process of following links to find entries in one database which are related to entries in another. Links can be followed directly, from entry A (from the above example) to entry B, or they can be followed indirectly, from entry A in Swissprot through entry B in EMBL to entry D in a third database.
[0247] Click on the (+) button above Q
[0248] To link the current query (Q
[0249] To link Q