Plaque It!
Sponsored by: Flash of Genius |
[0001] This application is based on, and claims the benefit of, U.S. Provisional Application No. 60/248,259, filed Nov. 14, 2000, entitled Testing for Differentially-Expressed Genes by Maximum Likelihood Analysis of Microarray Data and claims benefit of, U.S. Provisional Application No. 60/266,388, filed Feb. 2, 2001, entitled Methods for Determining the True Signal of an Analyte, which are incorporated herein by reference.
[0003] The invention relates generally to quantitative expression analysis, and more particularly, to methods for identifying significant differences in gene expression.
[0004] Although all cells in the human body contain the same genetic material, the same genes are not active in all of those cells. Alterations in gene expression patterns or in a DNA sequence can have profound effects on biological functions. These variations in gene expression are at the core of altered physiologic and pathologic processes. In the past, determinations of differential gene expression only focused on a few genes at a time. DNA microarrays, devices that consist of thousands of immobilized DNA sequences present on a miniaturized surface, have revolutionized the study of gene expression and are now a staple of biological inquiry into gene expression and genetic variations. Arrays are used to analyze a sample for genotyping or for patterns of gene expression. Using the microarray, it is possible to observe the expression level changes in tens of thousands of genes over multiple conditions, all in a single experiment. Depending on the conditions assayed, differentially-expressed genes may be implicated in cancer, aging, or a metabolic pathway of interest.
[0005] Generally, microarrays are prepared by binding DNA sequences to a surface such as a nylon membrane or glass slide at precisely defined locations on a grid. Using an alternate method, some arrays are produced using laser lithographic processes and are referred to as biochips or gene chips. For genotyping analysis, the sample is genomic DNA. For expression analysis, the sample is cDNA, DNA copies of mRNA. The DNA samples are tagged with a radioactive or fluorescent label and applied to the array. Single stranded DNA will bind to a complementary strand of DNA. At positions on the array where the immobilized DNA recognizes a complementary DNA in the sample, binding or hybridization occurs. The labeled sample DNA marks the exact positions on the array where binding occurs, allowing automatic detection. The output consists of a list of hybridization events, indicating the presence or the relative abundance of specific DNA sequences that are present in the sample. DNA array technology provides a method for rapid genotyping, facilitating the diagnosis of diseases for which a gene mutation has been identified as well as for diseases for which known gene expression biomarkers of a pathologic state, or signature genes, exist.
[0006] A crucial step in the analysis of expression data is determining which genes are expressed differently between two cell populations. Usually, a gene is said to be “differentially-expressed” if its ratio of expression level in one population to expression level in a second population exceeds a certain threshold. This threshold is set based on the observation that in control experiments where the two cell populations are identical, few if any genes have expression ratios exceeding the threshold. However, it is common knowledge that this approach is imprecise, because the uncertainty in the expression ratio is greater for genes that are expressed at low levels than for those that are highly expressed. More sensitive methods have been employed in a few cases, but development of a general, formal statistical test for identifying differentially-expressed genes has remained an open problem.
[0007] Thus, there exists a need for a mathematical model of the variability observed over repeated observations of intensities for biomolecules represented on an array. The present invention satisfies this need and provides related advantages as well.
[0008] The invention relates to a method of determining a true signal of an analyte, comprising (a) measuring an observed signal x for one or more analytes, and (b) determining a mean signal (μ) and a system parameter (β) for said analyte that produce enhanced values for a probability likelihood of said observed signal, said observed signal being related to said mean signal by an additive error (δ) and a multiplicative error (ε), wherein said system parameter specifies properties of said additive error (δ) and said multiplicative error (ε)
[0009]
[0010]
[0011]
[0012] The invention provides a method of determining relative amounts of an analyte between samples. The invention also provides a method of determining the true signal of an analyte. The method of the invention accounts for multiplicative and additive errors influencing the observed signals for an analyte and estimates system parameters based on the observed signals using maximum likelihood estimation. By presenting an error model and associated significance test, the methods of the invention provide a substantial improvement over current thresholding schemes. One advantage of the error model is that the system parameters inherently specify the properties of both the additive and multiplicative error terms. The method of the invention further provides for the performance of a generalized likelihood ratio test for each analyte to determine whether the amounts are relatively different.
[0013] In one embodiment, the method of the invention provides a refined test for comparison of differentially expressed genes that does not rely on gene expression ratios, but directly compares a series of repeated measurements of two observed intensities for each gene. In this regard, the method of the invention utilizes an error model and an associated significance test to determine whether the observed amounts of genes are significantly different between the two or more conditions being compared.
[0014] As used herein, the term “analyte” refers to a molecule whose presence is measured. An analyte molecule can be essentially any molecule for which a detectable probe or assay exists or can be produced by one skilled in the art. For example, an analyte can be a macromolecule such as a nucleic acid, polypeptide or carbohydrate, or a small organic compound. Measurement can be quantitative or qualitative. An analyte can be part of a sample that contains other components or can be the sole or major component of the sample. Therefore, an analyte can be a component of a whole cell or tissue, a cell or tissue extract, a fractionated lysate thereof or a substantially purified molecule. Moreover, an analyte can incorporate a second molecule, for example, a detectable moiety such as a dye, radiolabel, heavy atom label, or other mass label, a fluorochrome, a ferromagnetic substance, a luminescent tag or a detectable binding agent such as biotin. The analyte can be attached in solution or solid-phase, including, for example, to a solid surface such as a chip, microarray or bead.
[0015] As used herein, the term “sample” refers to the substance containing the analyte. It can be heterogeneous or homogeneous. Examples of heterogeneous samples include tissues, cells, lysates and fractionated portions thereof. Homogeneous samples include, for example, isolated populations of polypeptides, nucleic acids or carbohydrates. A sample can also be a purified analyte, free from like or non-like molecules. All of such substances are included within the meaning of the term so long as the substance contains the analyte. In addition to containing the analyte, a sample further can contain one or more additional components such as a buffer, detectable moiety, nucleic acids, polypeptides, carbohydrates or any other substance or molecule.
[0016] As used herein, the term “signal” is intended to mean a detectable, physical quantity or impulse by which information on the presence of an analyte can be determined. Therefore, a signal is the read-out or measurable component of detection. A signal includes, for example, fluorescence, luminescence, calorimetric, density, image, sound, voltage, current, magnetic field and mass. Therefore, the term “observed signal,” as used herein is intended to mean the actual quantity detected of the measured analyte in a particular detection system. An observed signal can include subtraction of non-specific noise. An observed signal can also include, for example, treatment of the measured quantity by routine data analysis and statistical procedures which allow meaningful comparison and analysis of the observed values. Such procedures include, for example, normalization for direct comparison of values having different scales, and filtering for removal of aberrant or artifactual values. A “mean signal” as used herein, refers to the true or inherent quantity of the measured analyte. A mean signal therefore corresponds to the detectable quantity of the analyte independent of variation in the assay or detection system.
[0017] As used herein, the term “sample pairs” refers to two samples containing analytes to be compared. The analytes to be compared within the two samples can be different, or they can be substantially the same species of analyte but subjected to distinct conditions or obtained from distinct sources. Therefore, the term “mean signal pairs,” as used herein, refers to the two true signals, one per analyte, associated with a sample pair. Similarly, when more than two analytes are being compared, the terms “sample sets” and “mean signal sets” are intended by analogy to reference the multiple samples containing the analytes and the corresponding multiple true signals, respectively.
[0018] As used herein, the term “system parameter” refers to the properties of the noise of the system, such as non-analyte, non-specific background signals. Therefore, the system parameter, designated β, is a measure of the error of the system and corresponds to undesirable or interfering signals that distort the true signal.
[0019] As used herein, the term “significantly unequal” refers to two analytes that have a meaningful difference in signal. Therefore, significantly unequal signals refers to two or more signals whose difference is caused by something other than chance, including variation or error in the system.
[0020] The invention provides a method of determining a true signal of an analyte. The method consists of measuring an observed signal x for one or more analytes and determining a mean signal (μ) and a system parameter (β) for the analyte that produce enhanced values for the probability likelihood of the observed signal, which is related to the mean signal by an additive error (δ) and a multiplicative error (ε), where the system parameter specifies the properties of the additive error (δ) and of the multiplicative error (ε).
[0021] The invention further provides a method determining relative amounts of an analyte between samples. The method consists of measuring observed signals x and y for an analyte within two or more sample pairs, determining a mean signal pair per analyte (μ) and a system parameter (β) for each sample pair, that produce enhanced values for the probability likelihood of the observed signals, which are related to the mean signal by an additive error (δ) and a multiplicative error (ε), where the system parameter specifies the properties of the additive error (δ) and the multiplicative error (ε).
[0022] The methods of the invention permit determination of the mean signal, which is the true amount of an analyte, by taking into account both multiplicative and additive error contributions to each observed signal. The methods of the invention further allow accurate determination of relative amounts of an analyte between samples. A maximum-likelihood approach is used to fit the model to observed signals of the analyte. The method of the invention can be used to monitor error introduced by intrinsic or extrinsic factors, to monitor total amount of error over time as well as to isolate or identify particular samples that have a higher error than normally observed. Therefore, the methods of the invention can be used to detect error introduced during any step in the analyte preparation and measurement. Additionally, the methods of the invention can be used, for example, to detect total error of the system or to separate and dissect biological or other intrinsic sample error from assay and procedure error. Thus, the methods of the invention allow quantitative analysis of the mean or true amount of an analyte at any given end point in a procedure as well as allow dissection of the system or procedure to quantitatively determine either or both intrinsic or extrinsic error introduced at any given step of the procedure.
[0023] Likelihood methods use statistical data and probability models to provide optimal use of statistical information. Because likelihood methods provide a specific description of the pattern of variation in data, these methods can be used for estimation and hypothesis testing, which is a formal process of using data to make statistically meaningful decisions such as whether relative amounts of analyte are significantly different between samples. Therefore, the methods of the invention determine, by formal estimation procedures, the mean signal of an analyte or a comparison of mean signals to provide the relative levels of the corresponding analyte. The comparison of mean signals can be for the same analyte subjected to two or more different conditions, different analytes under the same conditions or any combination thereof.
[0024] For comparison of two signals, the maximum-likelihood approach provided by the invention has several advantages over currently accepted ratio-based significance tests. In the ratio-based method, the expression ratio for the two signals to be compared is computed and compared to a control or reference ratio. For example, where the relative level of an analyte is to be compared under two different conditions, the ratio r
[0025] The methods of the invention are generally applicable to measure any analyte that serves as a sample or is contained in a sample so as to allow for detection of the presence of the analyte. As will be described in further detail below, detection of the analyte signal can be by any means as long the observed signal allows for determination of a mean.
[0026] Once a signal indicating the presence of an analyte has been observed, the methods of the invention can be used to determine the true or mean signal of the analyte. The true signal of an analyte is independent of experimental variation or error introduced prior to or during detection of the observed signal. Removal of such error in a signal allows for more accurate quantitation of an analyte and reproducibility of measurements. Therefore, the true or mean signal of an analyte is a measurement of the true or actual level of that analyte. Moreover, through determination of the true signal, the methods of the invention can measure the reproducibility of steps in a process such as, for example, manipulations prior to the determination of the observed signal.
[0027] The methods of the invention are applicable to the measurement of analytes and determination of true signals in both biological and non-biological settings. For example, in a biological setting, experimental error can be classified into at least two categories. Biological error is one such category and consists, for example, of intrinsic error introduced by the biological components. In this regard, regulation at both the gene expression and protein activity levels can be substantially altered due to apparent negligible experimental differences in the treatment of a biological sample. A specific example is where gene expression changes due to the use of different batches of the same media during the course of an experiment. Such biological error produces measurable differences in the level of an analyte such as an expressed gene.
[0028] Another category is the extrinsic error introduced through experimental manipulation. For example, differences in sample preparation, analyte or probe labeling efficiency, hybridization or binding conditions, synthesis of probes, batches of solid-phase substrate and detection efficiency introduce variations in the determination of a measured analyte, even though all components and processes can be controlled so as to result in apparent negligible differences. Nevertheless, measurable differences in observed analyte signal occur due to the introduction of such error.
[0029] Similarly, for non-biological settings the methods of the invention are applicable for determination of true signals from measured analytes in essentially any process or steps thereof for which a quantitative determination or comparison of a measurable component is desired.
[0030] The above exemplary, and other forms of error all affect the perceived amount of a measured analyte through the introduction of fluctuations in the observed signal. Assessing the true signal of the analyte, independent of such fluctuations, allows direct comparison of analyte levels. Moreover, because the true signal of an analyte measurement can be determined, the methods of the invention provide a means for a direct or standardized comparison of analyte measurements both within an experimental system and between different systems. Given the teachings and guidance provided herein, essentially any analysis format known in the art can be used for such subsequent comparison of analytes once the true or mean signals are obtained. Therefore, the methods of the invention can be used to accurately and reproducibility determine the true signal of essentially any measurable analyte as well as used for the initial step in, for example, a comparative analysis of the same analyte under different conditions, the same analyte under repetitive conditions or different analytes under the same conditions.
[0031] As will be described further below, it is understood that the methods of the invention are equally applicable to both large and small sets of analyte samples and sets of measurements. Determination of the true signal for an individual sample is performed similarly as that for the determination of many, and even hundreds or thousands of samples. Similarly, the comparison of true signals for determination of relative amounts of an analyte between samples also is performed for two samples as it is for comparison of many sample pairs or higher order sets of multiple comparisons. Therefore, given the teachings and guidance provided herein, the number of true signals that can be simultaneous determined, or sets of samples that can be simultaneously compared for relative amounts of true signal is only limited by the available computational power.
[0032] The methods of the invention for determining the true signal of an analyte can be applied to a variety of situations. For example, repeated measurements of the observed signal such as intensity x for one or more analytes can be obtained and subsequently used in the method of the invention to characterize the error and determine the significance value for each observed signal. For example, repeated observations of the signal associated with a single analyte such as, for example, the observed intensity of a single gene in a microarray, can be utilized in the methods of the invention to monitor, for example, the variation introduced by two or more distinct conditions, the total error introduced over a given time or sporadic error introduced by any means including variation caused at any step in the protocol.
[0033] The method of the invention provides a description of the relationship between an observed signal and a mean signal. The relationship specifies that the observed signal can be described as containing both an additive error term and a multiplicative error term. The error terms are a measure of variation in the observed signal. Parameters of the additive error term and the multiplicative error term set forth the characteristics or features of the error terms. These parameters are derived from statistical relationships well known in the art. Therefore, the error terms, and the parameters defining them, specify the noise of the analyzed system. Knowing the components and relationship of the noise with reference to the mean or true signal allows determination of the true signal given an empirically measured signal.
[0034] The inclusion of both an additive error term and a multiplicative error term in the described relationship permits distinction of the true signal from the noise at a wide range of observed signals. For example, with a high observed signal, or high observed signal relative to the noise, the system noise can be primarily described by the multiplicative error term. Therefore, the true signal can be accurately distinguished from the noise by employing only a multiplicative error term in the method of the invention. In contrast, where the observed signal is low, or low relative to the noise, the influence of the additive error in describing the noise becomes substantially more prominent. Maintaining this error term in the described relationship at low observed signals enhances the accuracy in distinguishing the true signal from the noise. Similarly, at intermediate observed signal ranges, both the additive and multiplicative error terms substantially influence the description of the noise and inclusion of both will yield enhanced results in distinguishing the true signal from the noise using the described relationship in the method of the invention. Therefore, including both the additive and multiplicative error terms in the description of the relationship between the observed signal and the true signal results in more accurate and predictable performance of the method of the invention at all ranges of observed signal.
[0035] However, utilization of both the additive and multiplicative error terms in the methods of the invention is not always necessary. As described above, if the user knows or can determine that the observed signal is high relative to the limits of detection or relative to the noise, then determination of the true signal can be accurately made by inclusion of only the multiplicative error term. In such circumstances, the additive variation will be small or negligible compared to the observed signal and is included in the described relationship as an example where one or more of the error term parameters, such as the standard deviation of the additive error term, is set to zero. Similarly, where the signal is low but the variation is also known, or can be determined to be small, in like manner the additive error term also can be omitted without substantial affect on determination of the true signal. Determination of the true signal also can be accurately made by inclusion of only the additive error term. For example, applying only the additive error term in the described relationship can be useful for measuring the error in the variation of the background of a system. Given the teachings and guidance described herein, those skilled in the art will know, or can determine, whether determination of a true signal can be made, or is desirable, utilizing both the additive and multiplicative terms in the described relationship employed in the method of the invention.
[0036] For each analyte, the method of the invention provides a relationship between the observed signal and the mean signal which can be described as follows:
[0037] where each measurement j equals 1 through M and each analyte i equals 1 through N, and where x
[0038] For determining the true signal of an analyte, where the observed signal x
[0039] Modifications can be incorporated into the general description of the relationship between the observed signal and the mean signal set forth above and below which do not alter the relationship of the additive or multiplicative error terms with respect to the true signal or their properties in specifying the structure of the noise. Such modifications are exemplified with reference to the description specifying the relationship between the observed signal and the true signal set forth above, but are similarly applicable to the description specifying the relationship between observed and true signals for comparison of two or more signals. The modifications can include, for example, inclusion of functions, augmentations or addition of terms, simplification or removal of terms and transformation of variables. Depending on the origin of the signal data or the desired use, one or more of such modifications can be employed to generate alternative forms of the described relationship appropriate for application to a wide variety of data sets. These modifications as well as others are well known to those skilled in the art and are applicable in the method of the invention.
[0040] For example, the description specifying the relationship between the observed and true signal can be modified by inclusion of a function such as f(σ
[0041] The description specifying the relationship between the observed and true signal also can be modified by augmentation. For example, terms can be added which include constants, second order or even higher order terms which do not alter the relationship of the additive or multiplicative error terms with respect to the true signal or their properties in specifying the structure of the noise. A specific example of the addition of a constant is x
[0042] Simplification or removal of terms has been described above, such as when there is a negligible amount of error. Removal of the corresponding error term can increase the accuracy of determining the remaining parameters and therefore the accuracy of determining the true signal. A specific example of a simplification modification where the additive error has been removed is x
[0043] Transformation of variables is yet another modification which can be performed that does not alter the relationship of the additive or multiplicative error terms with respect to the true signal or their properties in specifying the structure of the noise. For example, because some signal measurements can be distributed over a large range of values, including many orders of magnitude, it can be useful to transform the raw signal measurements into logarithms. For this transformation, the variables x
[0044] The methods of the invention employ the above error model to determine, by formal estimation, the mean signal of an analyte from a set of measurements of an observed signal by using a maximum likelihood approach. To estimate the mean signal, the observed signal should be measured at least twice (j=2), obtaining two separate values and allowing for a more accurate computation of the system parameter and mean signal. However, a larger number of analyte measurements, where j is greater than 2, results in further refinements of true signal determination. For example, as shown in Example I, increasing the number of measurements from two to four per analyte results in beneficial enhancements in true signal determination. Therefore, the number of measurements of a particular analyte can be a few or many times, including for example, about 2, 3, 4, 5, 10, 20, 50, 100 or more sample measurements. Although as few as two measurements is sufficient to accurately determine the true signal of an analyte, the actual number of measurements will vary depending on the need and confidence requirement of the user. For example, the confidence in true signal determination can be increased in analyte samples exhibiting inherently greater variation by compensating for the greater experimental error through increasing the number of sample measurements. Sample measurements can be derived, for example, from independent samples, replicates of the same sample that are independently measured, repeated measurements of the same sample or any combination thereof.
[0045] Once the signal has been measured for one or more analytes, the observed signals can be subjected to a variety of statistical methods well known in the art to prepare the raw data for maximum likelihood analysis. Such methods include, for example, standardization and filtering techniques. Briefly, non-specific background can be subtracted to produce, for example, the observed signal x′. Moreover, depending on the need, the data measurements can be, for example, normalized to have comparable medians and extreme signals within a set of multiple measurements that are artifactually outside the signal range of its partners can be removed. Such modified values for the observed signal are similarly applicable in the methods of the invention for determining the true signal of an analyte. Therefore, the error model of the invention additionally accounts for the influence of multiplicative and additive errors on the observed signals and provides a relationship between an observed signal x′, and the corresponding mean or true signal.
[0046] As will be described further below in context of a comparing relative differences between two or more true signals, once obtained for any particular set of analyte measurements, the observed signal x or x′ is analyzed by, for example, maximum likelihood probability for determination of its mean signal. In addition to a maximum likelihood approach, other approaches are known in the art to determine, by formal estimation, the mean signal from a set of observed measurements, including, for example, Quasi-Maximum Likelihood and Generalized Method of Moments.
[0047] In addition to determining the true signal of an analyte, the methods of the invention also can be utilized to determine relative amounts of an analyte between samples. Briefly, following the methods described above for determination of a true signal for an individual analyte, for comparison of relative amounts of two or more analytes, observed signals are measured for each analyte and the corresponding true signals determined by probability likelihood analysis. The resultant true signals are then formally assessed by, for example, a difference indicator to determine relative levels. In this embodiment, for example, the methods of the invention identify true signals that are significantly unequal, thus representing different amounts of analytes between the compared samples.
[0048] The methods of the invention allow relative comparison of true signals between two analytes or pairs as well as between multiple analytes or sets. As described previously, the analytes to be compared can be can be different, or they can be substantially the same species of analyte but subjected to distinct conditions or obtained from distinct sources. Briefly, samples harboring analytes to be compared are referred to herein as sample pairs or sets. True signals resulting from each observed analyte signal for a particular comparison are similarly referred to as mean signal pairs or mean signal sets. Similarly, the true signals being compared for substantially the same analyte species derived from different conditions or sources is referred to herein as mean signal pairs per analyte and mean signal sets per analyte.
[0049] By reference to comparison of two analytes, for the determination of relative amounts of an analyte between samples the observed signal and mean signal within a sample pair can be described by the following relationship:
[0050] where each measurement j equals 1 through M and each analyte i equals 1 through N; where x
[0051] The above described relationship between observed and mean signals for two analytes substantially parallels that described previously for an individual analyte. Therefore, this error model similarly provides the advantage of allowing multiplicative and additive errors to be independent of one another. Similarly, the above described error model can be applied by analogy to determination true signals for multiple analytes, including three or more analytes. For example, similar mean signal, multiplicative and additive error terms for analyte z can be described in a third equation. Additionally, higher order comparisons and error models can additionally be described using the teachings and guidance provided herein.
[0052] For determining the true signal of an analyte pair, where, for example, the observed signals xij and y are described by a bivariate distribution with the parameters μ
[0053] To determine, by formal estimation, the mean signal pairs of a sample pair, the observed signals x and y should be measured at least twice as described previously. Once the signals have been measured for analytes within one or more sample pairs, the raw data can be prepared for maximum likelihood analysis to produce, for example, two signals x′ and y′. For analysis of more than two analytes within a sample pair, standardization and filtering methods can similarly be used to produce, for example, signals z′ and the like for sample sets. These methods and others well known in the art for processing raw data into useful statistical form are particularly appropriate when analyzing multiple observed signals of sample pairs and sets in order to provide meaningful comparisons by, for example, normalization of divergent scales for the initially measured signals. Such modified values for the observed signals are similarly applicable in the methods of the invention for determining mean signal pairs, mean signal pairs of an analyte and mean signal sets. Therefore, the error model of the invention additionally accounts for the influence of multiplicative and additive errors on the observed signals and provides a relationship between observed signals x′, y′, z′ and higher numbers of like comparisons, and the corresponding true signals.
[0054] For any of the error models described above, once an observed signal, observed signals within a sample pair or sample set are obtained, the mean signal (μ) and the system parameter (β) can be determined or selected by, for example, a non-linear optimization algorithm. Such statistical optimization procedures are well known in the art and can be applied to, for example, individual observed analyte signals, observed signals for a single sample pair and to observed signals for two or more, including, for example, hundreds, thousands or ten thousand or more signals for sample pairs or sets. The number of optimizations that can be performed is coextensive with the number of analyte signals or higher order sets that can be measured and the computing power available in the art.
[0055] Similarly, and in addition to non-linear optimization algorithms, any general optimization procedure for non-linear equations can be used to determine or select the mean signal pair (μ) and a system parameter (β) for each sample pair including, for example, Gradient Descent, Newton-Raphson and Simulated Annealing. For example, The Gradient Descent method is based upon selecting, at each iterative step, the direction in multidimensional space for which the objective function initially changes at the fastest rate, and subsequently choosing an appropriate distance to move in this direction at that iterative step. The Newton-Raphson method is based on a linear approximation to the first-order conditions, which may be numerically estimated, that set to zero the partial derivatives of the objective function with respect to the parameters being estimated. The Simulated Annealing method is based upon making random changes, which become smaller throughout the iterations, in the parameters being estimated and subsequently deciding probalistically whether or not to keep these changes, thereby seeking an optimum while maintaining the ability to escape from a suboptimal local optimum in order to seek a better solution.
[0056] Further, the methods of the invention also allow the mean signal and system parameter to be provided based on previously determined or estimated values rather than calculated de novo. For example, in routine or familiar procedures, the user can have prior knowledge of beneficial or optimal estimates that can be used to calculate enhanced values for the probability likelihood or which more efficient convergence to a maximum probability likelihood. Therefore, the mean signal pair, including the mean signal pair per analyte, for example, (μ) and a system parameter (β) for each sample pair can be determined or provided and then subsequently compared. As will be described further below, comparison of mean signals, mean signal pairs and higher order sets can be performed, for example, by identification of significantly unequal mean signals using well known methods in the art such as statistical difference indicators.
[0057] In one embodiment, the mean signal and system parameters are estimated using maximum likelihood estimation. The maximum likelihood function provides, for example, a framework for the formal estimation process, while recognizing the structure of the random noise in the system. By modeling patterns of randomness, the maximum likelihood estimation process can better separate and estimate the signal. The method of the invention provides likelihood functions using estimates for the true parameters by utilizing standard optimization procedures as described herein. One advantage of the methods of the invention is that, if desired, the error terms can be independent of one another. Moreover, each mean signal within a mean signal pair or set also can be independent with respect to each other. These characteristics allow for the independent optimization of the system parameter and mean signal. Therefore, the efficiency of optimization can be significantly increased for a large number of analytes, for example, through the optimization of the system parameter and mean signals in subsets.
[0058] Briefly, observed values are measured and, subsequently, the system parameter (β) can be selected to enhance the probability likelihood given the observed signal. Similarly, for each analyte, mean signal pairs can be selected to enhance the probability likelihood given the system parameter (β). The mean signal pair and system parameter can be determined at the same time, or alternatively, the mean signal can be determined prior to the system parameter and then subsequently used to determine the system parameter. Conversely, the system parameter can be determined prior to the mean signal and then subsequently used to determine the mean signal. As described further in Example I, this procedure can be reiterated one or more times until the mean signal pair per analyte (μ) and a system parameter (β) converge. With each selection of values and reiteration of the optimization procedure, the calculated mean is enhanced in the direction of the true signal for that analyte, pair or set. In addition to maximum likelihood estimation, probability likelihood values for system parameters and mean signal can be estimated using other modeling techniques known in the art including, for example, Quasi-Maximum Likelihood and Generalized Method of Moments.
[0059] For comparison of the relative levels of two or more true signals, after the system parameter and mean signal have been determined, the methods of the invention provide for identification of mean signal pairs that are significantly unequal, representing different amounts of analytes between the compared samples. The error models and methods of the invention take into account the observation that x and y variances and x-y correlation increase with increasing values of x and y. Based on these empirical observations, the methods of the invention utilize a likelihood ratio test to identify analytes whose true signals μ
[0060] The methods of the invention can be utilized for determining the true signal of an analyte or for comparing the relative levels of two or more true analyte signals in a variety of different formats and modified procedures. For example, observed signals for one or more analytes, sample pairs or sample sets can be measured independently, such as in series, or simultaneously, such as in parallel. Moreover, different observed signals can be measured, for example, from independent samples, the same sample or from independent samples that have been pooled to reduce the total number of samples which are to be manipulated. The number of different observed signals which can be measured from a single sample or pooled sample will depend, for example, on the number of unique detection labels which can be employed to uniquely measure each different analyte within the sample. Corresponding mean signals, mean pairs or mean sets can similarly be determined from the observed signals in series or parallel, for example. Additionally, the measurements of observed signals and determination of mean signals can be multiplexed with ongoing measurements and determinations proceeding simultaneously in series or parallel, such as in an automated system, for example.
[0061] Various modification can be made to the procedure described above for determining or comparing true signals which enhance the description of the noise and therefore, further increase the accuracy of distinguishing the true signal from the noise. For example, variation of a reference signal can be captured or incorporated into the analysis. In this specific example, two or more observed signals to be compared are first independently compared to a reference signal to determine, for example, the system parameters or mean signal pairs for each test-reference comparison. A probability likelihood can then be generated from the product of the terms for each initial test-reference comparison, to describe, for example, β
[0062] Therefore, the invention provides a method of determining relative amounts of an analyte between samples. The method consists of: (a) obtaining a reference signal; (b) obtaining observed signals x and y for an analyte within two or more sample pairs; (c) determining system parameters (β
[0063] The invention also provides a method of determining relative amounts of large numbers of analytes between samples. The method consists of: (a) obtaining observed signals x and y for a plurality of immobilized analytes within two or more sample pairs; (b) determining a mean signal pair per analyte (μ) and a system parameter (β) for each sample pair that provides a maximum probability likelihood of occurrence given the observed signals, the observed signals being related to the mean signal by an additive error (δ) and a multiplicative error (ε), where the system parameter specifies the properties of the additive error and the multiplicative error, and (c) identifying one or more mean signal pairs per analyte that is significantly unequal. The method is applicable, for example, to nucleic acid and polypeptide analytes using immobolized array formats.
[0064] The methods of the invention are applicable for determination or comparison of true signals in a wide variety of systems. Various detection methods for numerous analytes are well known to those skilled in the art. All that is needed to practice the methods of the invention are measurable quantities of an analyte in a data form that can be calculated as a mean.
[0065] In biological systems, for example, detection of a nucleic acid analyte can be by any of a variety of detection methods well known to those skilled in the art. Such methods include, for example, gels, blots, capillaries and microarray formats. In addition to nucleic acid microarrays or chips, the methods of the invention further can be applied to determine the true signal of polypeptide spotted on a chip. The construction of glass chips or other substrates spotted either with chemicals to bind polypeptides or with known antibodies can be constructed and the bound polypeptide analyte can be detected, for example, by a mass spectrometer. Moreover, detection of a polypeptide analyte also can be by any other of a variety of detection methods well known in the art, including, for example, gels, blots, capillary and FACS formats. In addition, analytes other than nucleic acids and polypeptides can be detected by methods known in the art such as spectroscopy and laser-assisted techniques. The detection method and, consequently, the visualization technique that yields the observed signal will depend on a variety of factors such as the nature, amount, stability and purity of the analyte.
[0066] Microarray hybridization and fluorescent detection is one well known method for analysis of large numbers of nucleic acid analytes. Currently, arrays with more than 250,000 different oligonucleotide probes or 10,000 different cDNAs per square centimeter can be produce in significant numbers. Although it is possible to synthesize or deposit DNA fragments of unknown sequence, generally, microarray-based formats utilize specific sequences attached to a solid substrate such as glass, plastic, silicon, gold, a gel or membrane, beads, or beads at the ends of fibre-optic bundles. Such formats allow for parallel hybridization and simultaneous detection of a large number of indexed, surface-bound nucleic acid probes.
[0067] Nucleic acid arrays are generally produced by either robotic deposition of nucleic acids such as PCR products, plasmids or oligonucleotides, onto a glass slide or in situ synthesis using, for example, photolithography of oligonucleotides. After hybridization of labelled samples to the spotted or synthesized probes, the arrays are scanned and a quantitative fluorescence image along with the known identity of the probes is used to detect the presence of a particular molecule above thresholds based on background and noise levels.
[0068] Various methods for preparing labelled material for measurements of gene expression microarrays are well known in the art. For example, the RNA can be labelled directly, using a psoralen-biotin derivative or by ligation to an RNA molecule carrying biotin, labelled nucleotides can be incorporated in cDNA during or after reverse transcription of polyadenylated RNA; or cDNA can be generated that carries a T7 promoter at its 5′ end. In the last case, the double-stranded cDNA serves as template for a reverse transcription reaction in which labelled nucleotides are incorporated into cRNA. Commonly used labels include the fluorophores fluorescein, Cy3 or Cy5, or nonfluorescent biotin, which is subsequently labelled by staining with a fluorescent streptavidin conjugate. Generally, cDNA from two different conditions is labelled with two different fluorescent dyes such as Cy3 and CyS, and the two samples are co-hybridized to an array. After washing, the array is scanned at two different wavelengths to detect the relative transcript abundance for each condition.
[0069] Another quantitation method which is useful for determining expression levels of polypeptide analytes is the isotope-coded affinity tag (ICAT) method (Gygi et al.,
[0070] Additionally, measurement of an analyte signal can be by a variety of other methods well known in the art, including, for example, light emission, radioisotopes, and color development. Briefly, detection can involve methods such as radioactive labeling of the analyte using metabolic labeling in an appropriate cell or in vitro labeling by RNA transcription or by coupled in vitro transcription-translation with appropriate radioactive amino acids. Additionally, covalent modification with a radioactive or fluorescent substrate using an appropriate enzyme or chemical modification can be employed. Moreover, an analyte can be covalently modified by incorporating a chemical moiety capable of being detected. For example, green fluorescent protein, Cy3, Cy5 and other fluorophores can be covalently attached to a polypeptide analyte. Similarly, biotin can be covalently attached to a polypeptide analyte and subsequently detected by streptavidin using detection methods known in the art. Other methods also can involve fusion of an appropriate detection molecule to the analyte. For example, the analyte can be fused to luciferase and detected by light emission or can be fused to lacZ and detected by appropriate calorimetric detection.
[0071] The methods of the invention have utility for a variety of applications. Although a standard microarray compares only two populations, a greater number can be cross-compared by hybridizing labeled probe, such as cDNA prepared from each cell population of interest, to that of a common reference population. The methods of the invention can thus be used to determine genes differentially-expressed between any two populations, even if they have not been directly involved together in a single hybridization experiment.
[0072] The error model of the invention does not distinguish between repeated samples drawn from multiple spots on a single array versus repeated samples drawn from multiple hybridizations to different arrays. Because multiple spots within an array show less variability and more dye-to-dye correlation than do multiple spots observed over several arrays, the error model of the invention can be applied to distinguish between these two types of sampling, resulting in a more sensitive or accurate likelihood ratio test. Systems which involve more than one level of sampling are well known in the art and can be addressed by utilizing a nested design model as described by Dunn and Clark,
[0073] The methods of the invention further can be utilized to place a confidence interval on the true signal difference between two analytes. In this embodiment, rather than testing the hypothesis that μ
[0074] In another embodiment, the methods of the invention can be utilized to quantify, compare, and ultimately reduce the error introduced by each stage of an array process. Therefore, the methods of the invention can be used for quality control in a large variety of processes and settings. For example, as shown in Example II, system parameters and mean signals can be compared for replicate spots on one array versus a single spot observed over multiple array hybridizations (see also Table 2). It is understood that this embodiment of the method of the invention can be expanded to quantify several different levels of variation, such as variation due to cell culture, RNA preparation, labeling, or hybridization. Moreover, it can be expanded to other biological assay systems as well as non-biological systems. Thus, the method of the invention can be utilized to identify sources of variation that contribute to the overall error of the system.
[0075] The methods of the invention can be extended to a wide range of biological data involving comparisons between multiple measurements and can be advantageously utilized to determine differential gene expression based on studies with fluorescent or radioactive-labeled cDNA hybridized to gene clones spotted on membranes. Furthermore, the methods of the invention are applicable to large scale genotyping of human polymorphisms, where normal DNA is cut into small fragments, labeled, transferred onto a microchip and subsequently hybridized with labeled samples of normal and polymorphic DNA. Because the observed quantities of polypeptide expression per gene are analogous to fluorescent signals observed in a microarray experiment and are correlated, the methods of the invention can be practiced with technologies for comparing levels of polypeptide expression between two cell populations, for example (Gygi et al.,
[0076] For example, the method of the invention can be applied to proteomics where increased sensitivity of sequencing methods and mass spectrometry allow for determination polypeptide expression profiles. The methods of the invention can be advantageously used to determine relative amounts of polypeptide based on, for example, virtual 2-D profiles obtained by linking of isoelectric focusing gels with mass spectrometry.
[0077] It is understood that the observed signal depends on the method of detection. For example, in the case of a microarray, the amount of hybridization can be quantified by, for example, optical imaging or laser scanning to observe the emitted light intensity. The observed signal also can be obtained by other visualization techniques based on the nature of the analyte as well as the assay and include, for example, chemiluminescence and fluorescence imaging systems, and mass spectrometry. These and other methods are well known in the art and can be employed for the detection of an observed signal in the methods of the invention.
[0078] This example describes development of a maximum-likelihood test for the variability observed over repeated observations of intensities for genes represented on a DNA microarray.
[0079] Preprocessing of Microarray Data
[0080] The amount of hybridization to each spot is quantified by scanning the array with a laser and observing the intensity of light emitted. Observations are made separately for the two dyes, such that two intensities x and y are observed for each spot on the microarray. This process does not behave deterministically in practice, such that multiple spots corresponding to each gene i hybridized under identical conditions will result in a distribution of intensities x
[0081] Spot intensities were extracted from a scanned image, then background-subtracted and normalized as follows: microarray images are processed with Dapple, a software tool developed for array spot finding and quantitation described by Buhler et al., Bioinformatics 2000, which can be found at the URL: cs.washington.edu/homes/jbuhler/research/array, which is incorporated herein by reference. The Dapple software locates each spot and reports a separate median foreground intensity for each dye inside the spot area. The Dapple software also provides a local background intensity estimate for each spot and dye. The Dapple intensity estimates were subsequently smoothed by spatial filtering using a 7 spot by 7 spot median filter as described by Lim J. S.
[0082] In practice, X′ and y′ have different scales and thus are not directly comparable. This situation can occur if the total amount of labeled cDNA is greater for one dye than it is for the other, if one dye incorporates more efficiently, or if the scanner has different sensitivities to the two dyes. Therefore, the intensities are normalized to have identical medians A within each array hybridization:
[0083] where {tilde over (X)}′ denotes the median intensity of x′ over all spots on a single microarray. If multiple array hybridizations are performed, normalization occurs independently for each and the resulting combined data set consists of data pairs (x
[0084] Formulation of the Error Model
[0085] An error model summarizing the influence of multiplicative and additive errors on x and y has been formulated. In this regard, it has been consistently observed that larger intensity measurements have a proportionately larger error over repeated samples.
[0086] The data shown in
[0087] As shown in
[0088] Based on the observations described above, the background-subtracted, median-normalized intensities observed for each gene are related to their true (or mean) intensities by the following model:
[0089] where (μ
[0090] The model depends on six gene-independent parameters β=(σ