Plaque It!
Sponsored by: Flash of Genius |
This application is a continuation of pending U.S. patent application Ser. No. 10/242,799, filed on Sep. 13, 2002, which claims the benefit of U.S. Provisional Patent Application Nos. 60/322,285, filed on Sep. 14, 2001; 60/322,359, filed on Sep. 14, 2001; 60/322,506, filed on Sep. 14, 2001; 60/324,524, filed on Sep. 26, 2001; 6f0/354,242, filed on Feb. 6, 2002; 60/371,494, filed on Apr. 11, 2002; 60/384,096, filed on May 31, 2002; and 60/397,784, filed on Jul. 24, 2002. The contents of the above Applications are all incorporated herein by reference.
The present invention relates to systems and methods useful for annotating biomolecular sequences. More particularly, the present invention relates to computational approaches, which enable systemic characterization of biomolecular sequences and identification of differentially expressed biomolecular sequences such as sequences associated with a pathology.
In the postgenomic era, data analysis rather than data collection presents the biggest challenge to biologists. Efforts to ascribe biological meaning to genomic data, whether by identification of function, structure or expression pattern are lagging behind sequencing efforts [Boguski M S (1999) Science 286:453-455].
It is well recognized that elucidation of spatial and temporal patterns of gene expression in healthy and diseased states may contribute immensely to further understanding of disease mechanisms.
Therefore, any observational method that can rapidly, accurately and economically observe and measure the pattern of expression of selected individual genes or of whole genomes is of great value to scientists.
In recent years, a variety of techniques have been developed to analyze differential gene expression. However, current observation and measurement methods are inaccurate, time consuming, labor intensive or expensive, oftentimes requiring complex molecular and biochemical analysis of numerous gene sequences.
For example, observation methods for individual mRNA or cDNA molecules such as Northern blot analysis, RNase protection, or selective hybridization to arrayed cDNA libraries [see Sambrook et al. (1989) Molecular cloning, A laboratory manual, Cold Spring Harbor press, NY] depend on specific hybridization of a single oligonucleotide probe complementary to the known sequence of an individual molecule. Since a single human cell is estimated to express 10,000-30,000 genes [Liang et al. (1992) Science 257:967-971], single probe methods to identify all sequences in a complex sample are ineffective and laborious.
Other approaches for high throughput analysis of differential gene expression are summarized infra.
EST sequencing—The basic idea is to create cDNA libraries from tissues of interest, pick clones randomly from these libraries and then perform a single sequencing reaction from a large number of clones. Each sequencing reaction generates 300 base pairs or so of sequence that represents a unique sequence tag for a particular transcript. An EST sequencing project is technically simple to execute since it requires only a cDNA library, automated DNA sequencing capabilities and standard bioinformatics protocols.
To generate meaningful amounts of data, however, high throughput template preparation, sequencing and analysis protocols must be applied. As such, the number of new genes identified as well as the statistical significance of the data is proportional to the number of clones sequenced as well as the complexity of the tissue being analyzed [Adams et al. (1995) Nature 377:3-173; Hillier et al. (1996) Genome Res. 6:807-828].
Subtractive cloning—Subtractive cloning offers an inexpensive and flexible alternative to EST sequencing and cDNA array hybridization. In this approach, double-stranded cDNA is created from the two-cell or tissue populations of interest, linkers are ligated to the ends of the cDNA fragments and the cDNA pools are then amplified by PCR. The cDNA pool from which unique clones are desired is designated the “tester”, and the cDNA pool that is used to subtract away shared sequences is designated the “driver”. Following initial PCR amplification, the linkers are removed from both cDNA pools and unique linkers are ligated to the tester sample. The tester is then hybridized to a vast excess of driver DNA and sequences that are unique to the tester cDNA pool are amplified by PCR.
The primary limitation of subtractive methods is that they are not always comprehensive. The cDNAs identified are typically those, which differ significantly in expression level between cell-populations and subtle quantitative differences are often missed. In addition each experiment is a pair wise comparison and since subtractions are based on a series of sensitive biochemical reactions it is difficult to directly compare a series of RNA samples.
Differential display—Differential display is another PCR-based differential cloning method [Liang and Pardee (1992) Science 257:967-70; Welsh et al. (1992) Nucleic Acids Res. 20:4965-70]. In classical differential display, reverse transcription is primed with either oligo-dT or an arbitrary primer. Thereafter an arbitrary primer is used in conjunction with the reverse transcription primer to amplify cDNA fragments and the cDNA fragments are separated on a polyacrylamide gel. Differences in gene expression are visualized by the presence or absence of bands on the gel and quantitative differences in gene expression are identified by differences in the intensity of bands. Adaptation of differential display methods for fluorescent DNA sequencing machines has enhanced the ability to quantify differences in gene expression [Kato (1995) Nucleic Acids Res. 18:3685-90].
A limitation of the classical differential display approach is that false positive results are often generated during PCR or in the process of cloning the differentially expressed PCR products. Although a variety of methods have been developed to discriminate true from false positives, these typically rely on the availability of relatively large amounts of RNA.
Serial analysis of gene expression (SAGE)—this DNA sequence based method is essentially an accelerated version of EST sequencing [Valculescu et al. (1995) Science 270:484-8]. In this method a digestible unique sequence tag of 13 or more bases is generated for each transcript in the cell or tissue of interest, thereby generating a SAGE library.
Sequencing each SAGE library creates transcript profiles. Since each sequencing reaction yields information for twenty or more genes, it is possible to generate data points for tens of thousands of transcripts in modest sequencing efforts. The relative abundance of each gene is determined by counting or clustering sequence tags. The advantages of SAGE over many other methods include the high throughput that can be achieved and the ability to accumulate and compare SAGE tag data from a variety of samples, however the technical difficulties concerning the generation of good SAGE libraries and data analysis are significant.
Altogether, it is clear from the above that laboratory bench approaches are ineffective, time consuming, expensive and often times inaccurate in handling and processing the vast amount of genomic information which is now available.
It is appreciated, that much of the analysis can be effected by developing computational algorithms, which can be applied to mining data from existing databases, thereby retrieving and integrating valuable biological information.
To date, there are more than a hundred major biomolecule databases and application servers on the Internet and new sites are being introduced at an ever-increasing rates [Ashburner and Goodman (1997) Curr. Opin. Genet. Dev. 7:750-756; Karp (1998) Trends Biochem. Sci. 23:114-116].
However, these databases are organized in extremely heterogeneous formats. These reflect the inherent complexity of biological data, ranging from plain-text nucleic acid and protein sequences, through the three dimensional structures of therapeutic drugs and macromolecules and high resolution images of cells and tissues, to microarray-chip outputs. Moreover data structures are constantly evolving to reflect new research and technology development.
The heterogeneous and dynamic nature of these biological databases present major obstacles in mining data relevant to specific biological queries. Clearly, simple retrieval of data is not sufficient for data mining; efficient data retrieval requires flexible data manipulation and sophisticated data integration. Efficient data retreival requires the use of complex queries across multiple heterogeneous data sources; data warehousing by merging data derived from multiple public sources and local (i.e., private) sources; and multiple data-analysis procedures that require feeding subsets of data derived from different sources into various application programs for gene finding, protein-structure prediction, functional domain or motif identification, phylogenetic tree construction, graphic presentation and so forth.
Current biological data retrieval systems are not fully up to the demand of smooth and flexible data integration [Etzold et al. (1996) Methods Enzymol 266:t14-t28; Schuler et al. (1996) Methods Enzymol. 266:141-162; Chung and Wong (1999) Trends Biotech. 17:351-355].
There is thus a widely recognized need for, and it would be highly advantageous to have, systems and methods which can be used for efficient retrieval and processing of data from biological databases thereby enabling annotation of previously un-annotated biomolecular sequences.
According to one aspect of the present invention there is provided a method of annotating biomolecular sequences according to a hierarchy of interest, the method comprising: (a) computationally constructing a dendrogram having multiple nodes, the dendrogram representing the hierarchy of interest, wherein each node of the multiple nodes of the dendrogram is annotated by at least one keyword; (b) computationally assigning each biomolecular sequence of the biomolecular sequences to a specific node of the multiple nodes of the dendrogram to thereby generate assigned biomolecular sequences; and (c) computationally classifying each of the assigned biomolecular sequences to nodes hierarchically higher than the specific node, thereby annotating biomolecular sequences according to the hierarchy of interest.
According to another aspect of the present invention there is provided a method of identifying differentially expressed biomolecular sequences, the method comprising: (a) computationally constructing a dendrogram having multiple nodes, the dendrogram representing the hierarchy of interest, wherein each node of the multiple nodes of the dendrogram is annotated by at least one keyword; (b) computationally assigning each biomolecular sequence of the biomolecular sequences to a specific node of the multiple nodes of the dendrogram to thereby generate assigned biomolecular sequences; (c) computationally classifying each of the assigned biomolecular sequences to nodes hierarchically higher than the specific node, to thereby generate annotated biomolecular sequences; and (d) identifying annotated biomolecular sequences assigned to a portion of the multiple nodes, thereby identifying differentially expressed biomolecular sequences.
According to yet another aspect of the present invention there is provided a computer readable storage medium comprising a database stored in a retrievable manner, the database including files each containing data of a specific node of a dendrogram, the data including biomolecular sequence information and biomolecular sequence annotations, wherein the biomolecular sequence annotations are selected from the group consisting of contig description, tissue specific expression, pathological specific expression, functional features, parameters for ontological annotation assignment, cellular localization, database sequence source and functional alterations.
According to still another aspect of the present invention there is provided a system for generating a database of annotated biomolecular sequences, the system comprising a processing unit, the processing unit executing a software application configured for: (a) constructing a dendrogram having multiple nodes, the dendrogram representing a hierarchy of interest, wherein each node of the multiple nodes of the dendrogram is annotated by at least one keyword; (b) assigning each biomolecular sequence of the biomolecular sequences to a specific node of the multiple nodes of the dendrogram to thereby generate assigned biomolecular sequences; (c) classifying each of the assigned biomolecular sequences to nodes hierarchically higher than the specific node, to thereby generate annotated biomolecular sequences; and (d) storing sequence annotations and sequence information of the annotated biomolecular sequences, thereby generating the database of annotated biomolecular sequences.
According to further features in preferred embodiments of the invention described below, the biomolecular sequences are selected from the group consisting of polypeptide sequences and polynucleotide sequences.
According to still further features in the described preferred embodiments the polynucleotides are selected from the group consisting of genomic sequences, expressed sequence tags, contigs, complementary DNA (cDNA) sequences, pre-messenger RNA (mRNA) sequences, and mRNA sequences.
According to still further features in the described preferred embodiments the biomolecular sequences are selected from the group consisting of annotated biomolecular sequences, unannotated biomolecular sequences and partially annotated biomolecular sequences.
According to still further features in the described preferred embodiments the method further comprising homology clustering of the biomolecular sequences prior to step (b).
According to still further features in the described preferred embodiments the dendrogram is selected from the group consisting of a graph, a list, a map and a matrix.
According to still further features in the described preferred embodiments the hierarchy of interest is selected from the group consisting of a tissue expression hierarchy, a developmental expression hierarchy, a pathological expression hierarchy, a cellular expression hierarchy, an intracellular expression hierarchy, a taxonomical hierarchy and a functional hierarchy.
According to still further features in the described preferred embodiments each node of the multiple nodes is a parental node in an additional hierarchy of interest.
According to still further features in the described preferred embodiments the method further comprising classifying the biomolecular sequences of the parental node according to the additional hierarchy of interest.
According to still further features in the described preferred embodiments the system further comprising classifying the biomolecular sequences of the parental node according to the additional hierarchy of interest.
According to still further features in the described preferred embodiments each of the biomolecular sequences is a member of a sequence contig.
According to still further features in the described preferred embodiments the method further comprising the step of confirming annotations of the assigned biomolecular sequence in-vivo and/or in-vitro prior to or following step (c).
According to still further features in the described preferred embodiments the system further comprising the step of confirming annotations of the assigned biomolecular sequence in-vivo and/or in-vitro prior to or following step (c).
According to an additional aspect of the present invention there is provided a method of identifying sequence features unique to differentially expressed mRNA splice variants, the method comprising: (a) computationally identifying unique sequence features in each splice variant of an alternatively spliced expressed sequences; and (b) identifying differentially expressed splice variants of the alternatively spliced expressed sequences, thereby identifying sequence features unique to differentially expressed mRNA splice variants.
According to yet an additional aspect of the present invention there is provided a computer readable storage medium comprising data stored in a retrievable manner, the data including sequence information of sequence features unique to differentially expressed mRNA splice variants as set forth in files:
“Transcripts_nucleotide_seqs_part1”,
“Transcripts_nucleotide_seqs_part2”
“Transcripts_nucleotide_seqs_part3.new” and/or
“Protein.seqs”
provided in CD-ROMs 1 and/or 2 enclosed herewith, and sequence annotations as set forth in annotation categories “#TAA_CD” and “#TAA_TIS”, in the file “Summary_table.new” of CD-ROM3 enclosed herewith.
According to still an additional aspect of the present invention there is provided a system for generating a database of sequence features unique to differentially expressed mRNA splice variants, the system comprising a processing unit, the processing unit executing a software application configured for: (a) identifying unique sequence features in each splice variant of an alternatively spliced expressed sequences; and (b) identifying differentially expressed splice variants of the alternatively spliced expressed sequences, thereby identifying sequence features unique to differentially expressed mRNA splice variants. (c) storing the sequence features unique to the differentially expressed mRNA splice variants, thereby generating the database of sequence features unique to differentially expressed mRNA splice variants.
According to still further features in the described preferred embodiments step (b) is effected by qualifying annotations associated with the alternatively spliced expressed sequences.
According to still further features in the described preferred embodiments the method further comprising scoring the annotations associated with the alternatively spliced expressed sequences according to: (i) prevalence of the alternatively spliced expressed sequences in normal tissues; (ii) prevalence of the alternatively spliced expressed sequences in pathological tissues; (iii) prevalence of the alternatively spliced expressed sequence in total tissues; and (iv) number of tissues and/or tissue types expressing the alternatively spliced expressed sequences;
According to still further features in the described preferred embodiments the system further comprising scoring the annotations associated with the alternatively spliced expressed sequences according to: (i) prevalence of the alternatively spliced expressed sequences in normal tissues; (ii) prevalence of the alternatively spliced expressed sequences in pathological tissues; (iii) prevalence of the alternatively spliced expressed sequence in total tissues; and (iv) number of tissues and/or tissue types expressing the alternatively spliced expressed sequences;
According to still further features in the described preferred embodiments step (b) is effected by identifying the unique sequence feature.
According to still further features in the described preferred embodiments the unique sequence feature is selected from the group consisting of a donor-acceptor concatenation, an alternative exon, an exon and a retained intron.
According to still further features in the described preferred embodiments identifying unique sequence features in each splice variant of an alternatively spliced expressed sequence is effected by expressed sequence alignment.
According to a further aspect of the present invention there is provided a kit useful for detecting differentially expressed polynucleotide sequences, the kit comprising at least one oligonucleotide being designed and configured to be specifically hybridizable with a polynucleotide sequence selected from the group consisting of sequence files:
“Transcripts_nucleotide_seqs_part 1”
“Transcripts_nucleotide_seqs_part2” and
“Transcripts_nucleotide_seqs_part3.new”
provided in CD-ROMs 1 and/or 2 enclosed herewith, under moderate to stringent hybridization conditions.
According to still further features in the described preferred embodiments the at least one oligonucleotide is labeled.
According to still further features in the described preferred embodiments the at least one oligonucleotide is attached to a solid substrate.
According to still further features in the described preferred embodiments the solid substrate is configured as a microarray and whereas the at least one oligonucleotide includes a plurality of oligonucleotides each being capable of hybridizing with a specific polynucleotide sequence of the polynucleotide sequences set forth in the files:
“Transcripts_nucleotide_seqs_part 1”
“Transcripts_nucleotide_seqs_part2” and/or
“Transcripts_nucleotide_seqs_part3.new”
provided in CD-ROMs 1 and/or 2 enclosed herewith.
According to still further features in the described preferred embodiments each of the plurality of oligonucleotides is being attached to the microarray in a regio-specific manner.
According to still further features in the described preferred embodiments the at least one oligonucleotide is designed and configured for DNA hybridization.
According to still further features in the described preferred embodiments the at least one oligonucleotide is designed and configured for RNA hybridization.
According to yet a further aspect of the present invention there is provided a method of annotating biomolecular sequences, the method comprising: (a) computationally clustering the biomolecular sequences according to a progressive homology range, to thereby generate a plurality of clusters each being of a predetermined homology of the homology range; and (b) assigning at least one ontology to each cluster of the plurality of clusters, the at least one ontology being: (i) derived from an annotation preassociated with at least one biomolecular sequence of each cluster; and/or (ii) generated from analysis of the at least one biomolecular sequence of each cluster thereby annotating biomolecular sequences.
According to still a further aspect of the present invention there is provided a system for generating a database of annotated biomolecular sequences, the system comprising a processing unit, the processing unit executing a software application configured for: (a) clustering the biomolecular sequences according to a progressive homology range, to thereby generate a plurality of clusters each being of a predetermined homology of the homology range; and (b) assigning at least one ontology to each cluster of the plurality of clusters, the at least one ontology being: (i) derived from an annotation preassociated with at least one biomolecular sequence of each cluster; and/or (ii) generated from analysis of the at least one biomolecular sequence of each cluster, to thereby annotate the biomolecular sequences; and (c) storing sequence annotations and sequence information of the annotated biomolecular sequences, thereby generating the database of annotated biomolecular sequences.
According to still a further aspect of the present invention there is provided a computer readable storage medium comprising a database stored in a retrievable manner, the database including sequence information as set forth in files:
“Transcripts_nucleotide_seqs_part 1”
“Transcripts_nucleotide_seqs_part2”
“Transcripts_nucleotide_seqs_part3.new” and/or
“Protein.seqs”
provided in CD-ROMs 1 and/or 2 enclosed herewith, and sequence ontological annotations in #GO_P, #GO_F, #GO_C annotation categories in file “Summary_table.new” of CD-ROM3 enclosed herewith.
According to still further features in the described preferred embodiments the biomolecular sequences are selected from the group consisting of polynucleotide sequences and polypeptide sequences.
According to still further features in the described preferred embodiments the homology range is between 99%-35%.
According to still further features in the described preferred embodiments the analysis of the at least one biomolecular sequence includes literature text mining.
According to still further features in the described preferred embodiments the analysis of the at least one biomolecular sequence includes cellular localization prediction.
According to still further features in the described preferred embodiments the analysis of the at least one biomolecular sequence includes homology analysis.
According to still further features in the described preferred embodiments the at least one ontology is selected from the group consisting of molecular biology, microbiology, developmental biology, immunology, virology, biochemistry, physiology, pharmacology, medicine, bioinformatics, cell biology, endocrinology, structural biology, mathematics, chemistry, medicine, plant sciences, neurology, genetics, zoology, ecology, genomics, cheminformatics, computer sciences, statistics, physics and artificial intelligence.
According to still further features in the described preferred embodiments the ontology includes a subontology.
According to still further features in the described preferred embodiments the method further comprising scoring the at least one ontology assigned to a cluster of the plurality of clusters according to: (i) a degree of homology characterizing the cluster; and (ii) relevance of annotation to information obtained from literature text mining.
According to still further features in the described preferred embodiments the system further comprising scoring the at least one ontology assigned to a cluster of the plurality of clusters according to: (i) a degree of homology characterizing the cluster; and (ii) relevance of annotation to information obtained from literature text mining.
According to still further features in the described preferred embodiments the method further comprising generating a sequence profile to each cluster of the plurality of clusters following step (b).
According to still further features in the described preferred embodiments the system further comprising generating a sequence profile to each cluster of the plurality of clusters following step (b).
According to still a further aspect of the present invention there is provided a computer readable storage medium, comprising a database stored in a retrievable manner, the database including biomolecular sequence information as set forth in files:
“Transcripts_nucleotide_seqs_part 1”
“Transcripts_nucleotide_seqs_part2”
“Transcripts_nucleotide_seqs_part3.new” and/or
“Protein.seqs”
provided in CD-ROMs 1 and/or 2 enclosed herewith, and biomolecular sequence annotations as set forth in file “Summary_table.new” of CD-ROM 3 enclosed herewith.
According to still a further aspect of the present invention there is provided a method of diagnosing colon cancer in a subject, the method comprising identifying in the subject the presence or absence of a biomolecular sequence selected from the group consisting of SEQ ID NOs: 4, 39, 24-28, 35-38, 12 and 29-31 wherein presence of the biomolecular sequence indicates colon cancer in the subject.
According to still a further aspect of the present invention there is provided method of diagnosing lung cancer in a subject, the method comprising identifying in the subject the presence or absence of a biomolecular sequence selected from the group consisting of SEQ ID NOs: 15, 18, 21 and 32 wherein presence of the biomolecular sequence indicates lung cancer in the subject.
According to still a further aspect of the present invention there is provided a method of diagnosing Ewing sarcoma in a subject, the method comprising identifying in the subject the presence or absence of a biomolecular sequence as set forth in SEQ ID NO: 7, wherein presence of the biomolecular sequence indicates Ewing sarcoma in the subject.
According to still a further aspect of the present invention there is provided a computer readable storage medium comprising data stored in a retrievable manner, the data including sequence information of differentially expressed biomolecular sequences as set forth in files:
“Transcripts_nucleotide seqs_part 1”
“Transcripts_nucleotide seqs_part2”
“Transcripts_nucleotide_seqs_part3.new” and
“Protein.seqs”
provided in CD-ROMs 1 and/or 2 enclosed herewith, and sequence annotations as set forth in annotation categories “SA” and “RA”, in the file “Summary_table.new” of CD-ROM3 enclosed herewith.
According to still a further aspect of the present invention there is provided a computer readable storage medium comprising data stored in a retrievable manner, the data including sequence information of biomolecular sequences exhibiting gain of function or loss of function as set forth in files:
“Transcripts_nucleotide_seqs_part1”
“Transcripts_nucleotide_seqs_part2”
“Transcripts_nucleotide_seqs_part3.new” and
“Protein.seqs”
provided in CD-ROMs 1 and/or 2 enclosed herewith, and sequence annotations as set forth in annotation category “DN”, in the file “Summary_table.new” of CD-ROM3 enclosed herewith.
According to still further features in the described preferred embodiments the database further includes information pertaining to generation of the data and potential uses of the data.
According to still further features in the described preferred embodiments the medium is selected from the group consisting of a magnetic storage medium, an optical storage medium and an optico-magnetic storage medium.
According to still further features in the described preferred embodiments the database further includes information pertaining to gain and/or loss of function of the differentially expressed mRNA splice variants or polypeptides encoded thereby.
The present invention successfully addresses the shortcomings of the presently known configurations by providing methods and systems useful for systematically annotating biomolecular sequences.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, suitable methods and materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.
The invention is herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.
In the drawings:
FIG. 1 a illustrates a system designed and configured for generating a database of annotated biomolecular sequences according to the teachings of the present invention.
FIG. 1 b illustrates a remote configuration of the system described in FIG. 1 a.
FIG. 2 illustrates a gastrointestinal tissue hierarchy dendogram generated according to the teachings of the present invention.
FIG. 3 is a scheme illustrating multiple alignment of alternatively spliced expressed sequences with a genomic sequence including 3 exons (A, B and C) and two introns. Two alternative splicing events are described; One from the donor site, which involves an AB junction, between donor and proximal acceptor and an AC junction, between donor and distal acceptor; A Second alternative splicing event is described from the acceptor site, which involves AC junction, between distal donor and acceptor and BC junction, between proximal donor and acceptor.
FIG. 4 is a tissue hierarchy dendogram generated according to the teachings of the present invention. The higher annotation levels are marked with a single number, i.e., 1-16. The lower annotation levels are marked within the relevant category as one-four numbers after the point (e.g. 4. genitourinary system; 4.2 genital system; 4.2.1 women genital system; 4.2.1.1 cervix).
FIG. 5 is a graph illustrating a correlation between LOD scores of textual information analysis and accuracy of ontological annotation prediction. Results are based on self-validation studies. Only predictions made with LOD scores above 2 were evaluated and used for GO annotation process.
FIGS. 6 a - c are histograms showing the distribution of proteins (closed squares) and contigs (opened squares) from Ensembl version 1.0.0 in the major nodes of three GO categories—cellular component (FIG. 6 a ), molecular function (FIG. 6 b ), and biological process (FIG. 6 c ).
FIG. 7 illustrates results from RT-PCR analysis of the expression pattern of the AA535072 (SEQ ID NO: 39) colorectal cancer-specific transcript. The following cell and tissue samples were tested: B-colon carcinoma cell line SW480 (ATCC-228); C-colon carcinoma cell line SW620 (ATCC-227); D-colon carcinoma cell line colo-205 (ATCC-222). Colon normal tissue indicates a pool of 10 different samples, (Biochain, cat no A406029). The adenocarcinoma sample represents a pool of spleen, lung, stomach and kidney adenocarcinomas, obtained from patients. Each of the tissues (i.e., colon carcinoma samples Duke's A-D; and normal muscle, pancreas, breast, liver, testis, lung, heart, ovary, thymus, spleen kidney, placenta, stomach, brain) were obtained from 3-6 patients and pooled.
FIG. 8 illustrates results from RT-PCR analysis of the expression pattern of the AA513157 (SEQ ID NO: 7) Ewing sarcoma specific transcript. The (+) or (−) symbols, indicate presence or absence of reverse transcriptase in the reaction mixture. A molecular weight standard is indicated by M. Tissue samples (i.e., Ewing sarcoma samples, spleen adenocarcinoma, brain, prostate and thymus) were obtained from patients. The Ln-CAP human prostatic adenocarcinoma cell line was obtained from the ATCC (Manassas, Va.).
FIG. 9 is an autoradiogram of a northern blot analysis depicting tissue distribution and expression levels of AA513157 (SEQ ID NO: 7) Ewing sarcoma specific transcript. Arrows indicate the molecular weight of 28S and 18S ribosomal RNA subunits. The indicated tissue samples were obtained from patients and SK-ES-1-Ewing sarcoma cell-line was obtained from the ATCC (CRL-1427).
FIG. 10 illustrates results from semi quantitative RT-PCR analysis of the expression pattern of the AA469088 (SEQ ID NO: 40) colorectal specific transcript. Colon normal was obtained from Biochain, cat no: A406029. The adenocarcinoma sample represents a pool of spleen, lung, stomach and kidney adenocarcinomas, obtained from patients. Each of all other tissues (i.e., colon carcinoma samples Duke's A-D; and normal thymus, spleen, kidney, placenta, stomach, brain) were obtained from 3-6 patients and pooled.
FIG. 11 is a histogram depicting Real-Time RT-PCR quantification of copy number, of a lung specific transcript, (SEQ ID NO: 15). Amplification products obtained from the following tissues were quantified; normal salivary gland from total RNA (Clontech, cat no: 64110-1); lung normal from pooled adult total RNA (BioChain, cat no: A409363); lung tumor squamos cell carcinoma (Clontech, cat no: 64013-1); lung tumor squamos cell carcinoma (BioChain, cat no: A409017); pooled lung tumor squamos cell carcinoma (BioChain, cat no: A411075); moderately differentiated squamos cell carcinoma (BioChain, cat no: A409091); well differentiated squamos cell carcinoma (BioChain, cat no: A408175); pooled adenocarcinoma (BioChain, cat no: A411076); moderately differentiated alveolus cell carcinoma (BioChain, cat no: A409089); non-small cell lung carcinoma cell line H1299; The following normal and tumor samples were obtained from patients: normal lung (internal number-CG-207N), lung carcinoma (internal number-CG-72), squamos cell carcinoma (internal number-CG-196), squamos cell carcinoma (internal number-CG-207), lung adenocarcinoma (internal number-CG-120), lung adenocarcinoma (internal number-CG-160). Copy number was normalized to the levels of expression of the housekeeping genes Proteasome 26S subunit (dark columns) and GADPH (bright columns).
FIG. 12 is a histogram depicting Real-Time RT-PCR quantification of copy number, of the lung specific transcript (SEQ ID NO: 32). Amplification products obtained from the following tissues and cell-lines were quantified; lung normal from pooled adult total RNA (BioChain, cat no: A409363); lung tumor squamos cell carcinoma (Clontech, cat no: 64013-1); lung tumor squamos cell carcinoma (BioChain, cat no: A409017); pooled lung tumor squamos cell carcinoma (BioChain, cat no: A411075); moderately differentiated squamos cell carcinoma (BioChain, cat no: A409091); well differentiated squamos cell carcinoma (BioChain, cat no: A408175); pooled adenocarcinoma (BioChain, cat no: A411076); moderately differentiated alveolus cell carcinoma (BioChain, cat no: A409089); non-small cell lung carcinoma cell line H1299; The following normal and tumor samples were obtained from patients: normal lung (internal number-CG-207N), lung carcinoma (internal number-CG-72), squamos cell carcinoma (internal number-CG-196), squamos cell carcinoma (internal number-CG-207), lung adenocarcinoma (internal number-CG-120), lung adenocarcinoma (internal number-CG-160). Copy number was normalized to the levels of expression of the housekeeping genes Proteasome 26S subunit (dark columns) and GADPH (bright columns).
FIG. 13 is a histogram depicting Real-Time RT-PCR quantification of copy number, of the lung specific transcript (SEQ ID NO: 18). Amplification products obtained from the following tissues and cell-lines were quantified; lung normal from pooled adult total RNA (BioChain, cat no: A409363); lung tumor squamos cell carcinoma (Clontech, cat no: 64013-1); lung tumor squamos cell carcinoma (BioChain, cat no: A409017); pooled lung tumor squamos cell carcinoma (BioChain, cat no: A411075); moderately differentiated squamos cell carcinoma (BioChain, cat no: A409091); well differentiated squamos cell carcinoma (BioChain, cat no: A408175); pooled adenocarcinoma (BioChain, cat no: A411076); moderately differentiated alveolus cell carcinoma (BioChain, cat no: A409089); non-small cell lung carcinoma cell line H1299; The following normal and tumor samples were obtained from patients: normal lung (internal number-CG-207N), lung carcinoma (internal number-CG-72), squamos cell carcinoma (internal number-CG-196), squamos cell carcinoma (internal number-CG-207), lung adenocarcinoma (internal number-CG-120), lung adenocarcinoma (internal number-CG-160). Copy number was normalized to the levels of expression of the housekeeping genes Proteasome 26S subunit (dark columns) and GADPH (bright columns).
FIG. 14 is a histogram depicting Real-Time RT-PCR quantification of copy number, of a lung specific transcript (SEQ ID NO: 21). Amplification products obtained from the following tissues and cell-lines were quantified; Samples 1-6 are commercial normal lung samples (BioChain, CDP-061010; A503205, A503384, A503385, A503204, A503206, A409363). Sample 7 is lung well differentiated adenocarcinoma (BioChain, CDP-064004A; A504117). Sample 8 is lung moderately differentiated adenocarcinoma (BioChain, CDP-064004A; A504119). Sample 9 is lung moderately to poorly differentiated adenocarcinoma (BioChain, CDP-064004A; A504116). Sample 10 is lung well differentiated adenocarcinoma (BioChain, CDP-064004A; A504118). Samples 11-16 are lung adenocarcinoma samples obtained from patients. Sample 17 is lung moderately differentiated squamous cell carcinoma (BioChain, CDP-064004B; A503187). Sample 18 is lung squamous cell carcinoma (BioChain, CDP-064004B; A503386). Samples 20-21 are lung moderately differentiated squamous cell carcinoma (BioChain, CDP-064004B; A503387, A503383). Sample 22 is lung squamous cell carcinoma pooled (BioChain, CDP-064004B; A411075). Samples 23-26 and sample 31 are lung squamous cell carcinoma obtained from patients. Sample 27 is lung squamous cell carcinoma (Clontech, 64013-1). Sample 28 is lung squamous cell carcinoma (BioChain, A409017). Sample 29 is lung moderately differentiated squamous cell carcinoma (BioChain, CDP-064004B; A409091). Sample 30 is lung well differentiated squamous cell carcinoma (BioChain, CDP-064004B; A408175). Samples 32-35 are lung small cell carcinoma (BioChain, CDP-064004D; A504115, A501390, A501389, A501391). Sample 36-37 are lung large cell carcinoma (BioChain, CDP-064004C; A504113, A504114). Sample 38 is lung moderately differentiated alveolus cell carcinoma (BioChain, A409089). Sample 39 is lung carcinoma obtained from patient. Sample 40 is lung H1299 non-small cell carcinoma cell line. Sample 41 is normal salivary gland sample (Clontech, 64110-1). Copy number was normalized to the levels of expression of the housekeeping genes Proteasome 26S subunit (dark columns) and GADPH (bright columns).
The present invention is of methods and systems, which can be used for annotating biomolecular sequences. Specifically, the present invention can be used to identify and annotate differentially expressed biomolecular sequences, such as differentially expressed alternatively spliced sequences.
The principles and operation of the present invention may be better understood with reference to the drawings and accompanying descriptions.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments or of being practiced or carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting.
Terminology
As used herein, the term “oligonucleotide” refers to a single stranded or double stranded oligomer or polymer of ribonucleic acid (RNA) or deoxyribonucleic acid (DNA) or mimetics thereof. This term includes oligonucleotides composed of naturally-occurring bases, sugars and covalent internucleoside linkages (e.g., backbone) as well as oligonucleotides having non-naturally-occurring portions which function similarly. Such modified or substituted oligonucleotides are often preferred over native forms because of desirable properties such as, for example, enhanced cellular uptake, enhanced affinity for nucleic acid target and increased stability in the presence of nucleases.
The phrase “complementary DNA” (cDNA) refers to the double stranded or single stranded DNA molecule, which is synthesized from a messenger RNA template.
The term “contig” refers to a series of overlapping sequences with sufficient identity to create a longer contiguous sequence. A plurality of contigs may form a cluster. Clusters are generally formed based upon a specified degree of homology and overlap (e.g., a stringency). The different contigs in a cluster do not typically represent the entire sequence of the gene, rather the gene may comprise one or more unknown intervening sequences between the defined contigs.
The term “cluster” refers to a nucleic acid sequence cluster or a protein sequence cluster. The former refers to a group of nucleic acid sequences which share a requisite level of homology and or other similar traits according to a given clustering criterion; and the latter refers to a group of protein sequences which share a requisite level of homology and/or other similar traits according to a given clustering criterion.
A process and/or method to group nucleic acid or protein sequences as such is referred to as clustering, which is typically performed by a clustering (i.e., alignment) application program implementing a cluster algorithm.
As used herein the phrase “biomolecular sequences” refers to amino acid sequences (i.e., peptides, polypeptides) and nucleic acid sequences, which include but are not limited to genomic sequences, expressed sequence tags, contigs, complementary DNA (cDNA) sequences, pre-messenger RNA (mRNA) sequences, and mRNA sequences.
With the presentation of the human genome working draft, data analysis rather than data collection presents the biggest challenge to biologists. Efforts to ascribe biological meaning to genomic data, include the development of advanced wet laboratorial techniques as well as computerized algorithms. While the former are limited due to inaccuracy, time consumption, labor intensiveness and costs the latter are still unfeasible due to the poor organization of on hand sequence databases as well as the composite nature of biological data.
As is further described hereinbelow, the present inventors have developed a computer-based approach for the functional, spatial and temporal analysis of biological data. The present methodology generates comprehensive databases which greatly facilitate the use of available genetic information in both research and commercial applications.
As is further described hereinunder, the present invention encompasses several novel approaches for annotating biomolecular sequences.
“Annotating” refers to the act of discovering and/or assigning an annotation (i.e., critical or explanatory notes or comment) to a biomolecular sequence of the present invention.
The term “annotation” refers to a functional or structural description of a sequence, which may include identifying attributes such as locus name, keywords, Medline references, cloning data, information of coding region, regulatory regions, catalytic regions, name of encoded protein, subcellular localization of the encoded protein, protein hydrophobicity, protein function, mechanism of protein function, information on metabolic pathways, regulatory pathways, protein-protein interactions and tissue expression profile.
An ontology refers to the body of knowledge in a specific knowledge domain or discipline such as molecular biology, microbiology, immunology, virology, plant sciences, pharmaceutical chemistry, medicine, neurology, endocrinology, genetics, ecology, genomics, proteomics, cheminformatics, pharmacogenomics, bioinformatics, computer sciences, statistics, mathematics, chemistry, physics and artificial intelligence.
An ontology includes domain-specific concepts—referred to herein as sub-ontologies. A sub-ontology may be classified into smaller and narrower categories.
The ontological annotation approach of the present invention is effected as follows.
First, biomolecular sequences are computationally clustered according to a progressive homology range, thereby generating a plurality of clusters each being of a predetermined homology of the homology range.
Progressive homology according to this aspect of the present invention is used to identify meaningful homologies among biomolecular sequences and thereby assign new ontological annotations to sequences, which share requisite levels of homologies. Essentially, a biomolecular sequence is assigned to a specific cluster if displays a predetermined homology to at least one member of the cluster (i.e., single linkage). As used herein “progressive homology range” refers to a range of homology thresholds, which progress via predetermined increments from a low homology level (e.g. 35%) to a high homology level (e.g. 99%). Further description of a progressive homology range is provided in the Examples section which follows.
Following generation of clusters, one or more ontologies are assigned to each cluster. Ontologies are derived from an annotation preassociated with at least one biomolecular sequence of each cluster; and/or generated by analyzing (e.g., text-mining) at least one biomolecular sequence of each cluster thereby annotating biomolecular sequences.
Any annotational information identified and/or generated according to the teachings of the present invention can be stored in a database which can be generated by a suitable computing platform.
Thus, the method according to this aspect of the present invention provides a novel approach for annotating biomolecular sequences even on a scale of a genome, a transcriptom (i.e., the repertoire of all messenger RNA molecules transcribed from a genome) or a proteom (i.e., the repertoire of all proteins translated from messenger RNA molecules). This enables transcriptome-wise comparative analyses (e.g., analyzing chromosomal distribution of human genes) and cross-transcriptome comparative studies (e.g., comparing expressed data across species) both of which may involve various subontologies such as molecular function, biological process and cellular localization.
Biomolecular sequences which can be used as working material for the annotating process according to this aspect of the present invention can be obtained from a biomolecular sequence database. Such a database can include protein sequences and/or nucleic acid sequences derived from libraries of expressed messenger RNA [i.e., expressed sequence tags (EST)], cDNA clones, contigs, pre-mRNA, which are prepared from specific tissues or cell-lines or from whole organisms.
This database can be a pre-existing publicly available database [i.e., GenBank database maintained by the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine, and the TIGR database maintained by The Institute for Genomic Research, Blocks database maintained by the Fred Hutchinson Cancer Research Center, Swiss-Prot site maintained by the University of Geneva and GenPept maintained by NCBI and including public protein-sequence database which contains all the protein databases from GenBank,] or private databases (i.e., the LifeSeq.™ and PathoSeq.™ databases available from Incyte Pharmaceuticals, Inc. of Palo Alto, Calif.). Optionally, biomolecular sequences of the present invention can be assembled from a number of pre-existing databases as described in Example 5 of the Examples section.
Alternatively, the database can be generated from sequence libraries including, but not limited to, cDNA libraries, EST libraries, mRNA libraries and the like.
Construction and sequencing of a cDNA library is one approach for generating a database of expressed mRNA sequences. cDNA library construction is typically effected by tissue or cell sample preparation, RNA isolation, cDNA sequence construction and sequencing.
It will be appreciated that such cDNA libraries can be constructed from RNA isolated from whole organisms, tissues, tissue sections, or cell populations. Libraries can also be constructed from a tissue reflecting a particular pathological or physiological state.
Once raw sequence data is obtained, biomolecular sequences are computationally clustered according to a progressive homology range using one or more clustering algorithms. To obtain progressive clusters, the biomolecular sequences are clustered through single linkage. Namely, a biomolecular sequence belongs to a cluster if this sequence shares a sequence homology above a certain threshold to one member of the cluster. The threshold increments from a high homology level to a low homology level with a predetermined resolution. Preferably the homology range is selected from 99% -35%.
Computational clustering can be effected using any commercially available alignment software including the local homology algorithm of Smith & Waterman, Adv. Appl. Math. 2:482 (1981), using the homology alignment algorithm of Needleman & Wunsch, J. Mol. Biol. 48:443 (1970), using the search for similarity method of Pearson & Lipman, Proc. Nat'l. Acad. Sci. USA 85:2444 (1988), or using computerized implementations of algorithms GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software Package Release 7.0, Genetics Computer Group, 575 Science Dr., Madison, Wis.
Another example of an algorithm which is suitable for sequence alignment is the BLAST algorithm, which is described in Altschul et al., J. Mol. Biol. 215:403-410 (1990). Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/).
Since the present invention requires processing of large amounts of data, sequence alignment is preferably effected using assembly software.
A number of commonly used computer software fragment read assemblers capable of forming clusters of expressed sequences, and aligning members of the cluster (individually or as an assembled contig) with other sequences (e.g., genomic database) are now available. These packages include but are not limited to, The TIGR Assembler [Sutton G. et al. (1995) Genome Science and Technology 1:9-19], GAP [Bonfield J K. et al. (1995) Nucleic Acids Res. 23:4992-4999], CAP2 [Huang X. et al. (1996) Genomics 33:21-31], the Genome Construction Manager [Laurence C B. Et al. (1994) Genomics 23:192-201], Bio Image Sequence Assembly Manager, SeqMan [Swindell S R. and Plasterer J N. (1997) Methods Mol. Biol. 70:75-89], and LEADS and GenCarta (Compugen Ltd. Israel).
It will be appreciated that since applying sequence homology analysis on large number of sequences is computationally intensive, local alignment (i.e., the alignment of portions of protein sequences) is preferably effected prior to global alignment (alignment of protein sequences along their entire length), as described in Example 6 of the Examples section.
Once progressive clusters are formed, one or more ontological annotations (i.e., assigning an ontology) are assigned to each cluster.
Systematic and standardized ontological nomenclature is preferably used. Such nomenclature (i.e., keywords) can be obtained from several sources. For example, ontological annotations derived from three main ontologies: molecular function, biological process and cellular component are available from the Gene Ontology Consortium (www.geneontology.org).
Alternatively a list of homogenized ontological nomenclature can be obtained from AcroMed—a computer generated database of biomedical acronyms and the associated long forms extracted from the recent Medline abstracts (http://www.expasy.org/tools/).
Optionally, various conversion tables which link Enzyme Commission number, InterPro protein motifs and SwissProt keywords to gene ontology nodes are also available from www.geneontology.org and can be used with the present method.
Ontologies, sub ontologies, and their ontological relations (i.e., inherent relation—the sub-ontology “IS THE” ontology or composite relation—the ontology “HAS” the sub ontology) can be organized into various computer data structures such as a tree, a map, a graph, a stack or a list. These may also be presented in various data format such as, text, table, html, or extensible markup language (XML)
Ontologies and/or subontologies assigned to a specific biomolecular sequence can be derived from an annotation, which is preassociated with at least one biomolecular sequence in a cluster generated as described hereinabove.
For example, biomolecular sequences obtained from an annotated database are typically preassociated with an annotation. An “annotated database” refers to a database biomolecular sequences, which are at least partially characterized with respect to functional or structural aspects of the sequence. Examples of annotated databases include but are not limited to: GenBank (www.ncbi.nlm.nih.gov/GenBank/), Swiss-Prot (www.expasy.ch/sprot/sprot-top.html), GDB (www.gdb.org/), PIR (www.mips.biochem.mpg.de/proj/prostseqdb/), YDB (www.mips.biochem.mpg.de/proj/yeast/), MIPS (www.mips.biochem.mpg.de/proj/human), HGI (www.tigr.org/tdb/hgi/), Celera Assembled Human Genome (www.celera.com/products/human_ann.cfm and LifeSeq Gold (http://lifeseqgold.incyte.com). Additional specialized annotated databases include annotative information on metabolic (http://www.genome.ad.jp/kegg/metabolism.html) and regulatory pathways (http://www.genome.ad.ip/kegg/regulation.html), and protein-protein interactions (http://dip.doe-mbi.ucla.edu/), etc.
Alternatively, ontologies can be generated from an analysis of at least one biomolecular sequence in each of the clusters of the present invention.
Preferably, analysis of the biomolecular sequence is effected by literature text mining. Since manual review of related-literature may be a daunting task, computational extraction of text information is preferably effected.
Thus, the method of the present invention can also process literature and other textual information and utilize processed textual data for generating additional ontological annotations. For example, text information contained in the sequence-related publications and definition lines in sequence records of sequence databases can be extracted and processed. Ontological annotations derived from processed text data are then assigned to the sequences in the corresponding clusters.
Ontological annotations can also be extracted from sequence associated Medical subject heading (MeSH) terms which are assigned to published papers.
Additional information on text mining is provided in Example 7 of the Examples section and is disclosed in “Mining Text Using Keyword Distributions,” Ronen Feldman, Ido Dagan, and Haym Hirsh, Proceedings of the 1995 Workshop on Knowledge Discovery in Databases, “Finding Associations in Collections of Text,” Ronen Feldman and Haym Hirsh, Machine Learning and Data Mining: Methods and Applications, edited by R. S. Michalski, I. Bratko, and M. Kubat, John Wiley & Sons, Ltd., 1997 “Technology Text Mining, Turning Information Into Knowledge: A White Paper from IBM,” edited by Daniel Tkach, Feb. 17, 1998, each of which is fully incorporated herein by reference.
It will be appreciated that text mining may be performed, in this and other embodiments of the present invention, for the text terms extracted from the definitions of gene or protein sequence records, retrievable from databases such as GenBank and Swiss-Prot and title line, abstract of scientific papers, retrievable from Medline database (e.g., http://www.ncbi.nlm.nih.gov/PubMed/).
Computer-dedicated software for biological text analysis is available from http://www.expasy.org/tools/. Examples include, but are not limited to, MedMiner—A software system which extracts and organizes relevant sentences in the literature based on a gene, gene-gene or gene-drug query; Protein Annotator's Assistant—A software system which assists protein annotators in the task of assigning functions to newly sequenced proteins; and XplorMed—A software system which explores a set of abstracts derived from a bibliographic search in MEDLINE.
Alternatively, assignment of ontological annotations may be effected by analyzing molecular, cellular and/or functional traits of the biomolecular sequences. Prediction of cellular localization may be done using any computer dedicated software. For example prediction of cellular localization can be done using the ProLoc (Einat Hazkani-Covo, Erez Levanon, Galit Rotman, Dan Graur and Amit Novik, a manuscript submitted for publication) computational platform. This software is capable of predicting the cellular localization of polypeptide sequences based on inherent features, including specific localization signatures, protein domains, amino acid composition, pI and protein length. Other examples for cellular localization prediction softwares include PSORT—Prediction of protein sorting signals and localization sites and TargetP—Prediction of subcellular location, both available from http://www.expasy.org/tools/.
Prediction of functional annotations may be effected by motif analysis of the biomolecular sequences of the present invention. Thus for example, by implementing any motif analysis software, which is based on protein homology (see for example, http://motif.genome.ad.jp/ and http://www.accelrys.com/products/grailpro/index.html) it is possible to predict functional motifs of DNA sequences including repeats, promoter sequences and CpG islands and of encoded proteins such as zinc finger and leucine zipper.
Due to the progressive nature of the clusters of the present invention, ontology assignment starts at the highest level of homology. Any biomolecular sequence in the cluster, which shares identical level of homology compared to an ontologically annotated protein in the cluster is assigned the same ontological annotation. This procedure progresses from the highest level of homology to a lower threshold level with a predetermined increment resolution. Newly discovered homologies enable assignment of existing ontological annotations to biomolecular sequences sharing homologous sequences and being previously unannotated or partially annotated (see Examples 5-9 of the Examples section).
Once assignment of an annotation is effected, annotated clusters are disassembled resulting in annotation of each biomolecular sequence of the cluster.
Such annotated biomolecular sequences are then tested for false annotation. This is effected using the following scoring parameters:
(i) A degree of homology characterizing the progressive cluster—accuracy of the annotation directly correlates with the homology level used for the annotation process (see Examples 7-9 of the Examples section).
(ii) Relevance of annotation to information obtained from literature text mining—each assigned ontological annotation which results from literature text mining or functional or cellular prediction is assessed using scoring parameters such as LOD score (For further details see Example 7 of the Examples section).
The present invention also enables the use of the homologies identified according to the teachings of the present invention to annotate more sensitively and rapidly a query sequence. Essentially this involves building a sequence profile for each annotated cluster. A profile enables scoring of a biomolecular sequence according to functional domains along a sequence and generally makes searches more sensitive. Essentially, clustered sequences are also tested for relevance to the cluster based upon shared functional domains and other characteristic sequence features.
Ontologically annotated biomolecular sequences are stored in a database for further use. Additional information on generation and contents of such databases is provided hereinunder.
Such a database can be used to query functional domains and sequences comprising thereof. Alternatively, the database can be used to query a sequence, and retrieve the compatible annotations.
Although the present methodology can be effected using prior art systems modified for such purposes, due to the large amounts of data processed and the vast amounts of processing needed, the present methodology is preferably effected using a dedicated computational system.
Thus, according to another aspect of the present invention and as illustrated in FIGS. 1 a - b , there is provided a system for generating a database of annotated biomolecular sequences.
System 10 includes a processing unit 12 , which executes a software application designed and configured for annotating biomolecular sequences, as described hereinabove. System 10 further serves for storing biomolecular sequence information and annotations in a retrievable/searchable database 18 . Database 18 further includes information pertaining to database generation.
System 10 may also include a user interface 14 (e.g., a keyboard and/or a mouse, monitor) for inputting database or database related information, and for providing database information to a user.
System 10 of the present invention may be any computing platform known in the art including but not limited to a personal computer, a work station, a mainframe and the like.
Preferably, database 18 is stored on a computer readable media such as a magnetic optico-magnetic or optical disk.
System 10 of the present invention may be used by a user to query the stored database of annotations and sequence information to retrieve biomolecular sequences stored therein according to inputted annotations or to retrieve annotations according to a biomolecular sequence query.
It will be appreciated that the connection between user interface 14 and processing unit 12 is bi-directional. Likewise, processing unit 12 and database 18 also share a two-way communication channel, wherein processing unit 12 may also take input from database 18 in performing annotations and iterative annotations. Further, user interface 14 is linked directly to database 18 , such a user may dispatch queries to database 18 and retrieve information stored therein. As such, user interface 14 allows a user to compile queries, send instructions, view querying results and performing specific analyses on the results as needed.
In performing ontological annotations, processing unit 12 may take input from one or more application modules 16 . Application module 16 performs a specific operation and produced a relevant annotative input for processing unit 12 . For example, application module 16 may perform cellular localization analysis on a biomolecular sequence query, thereby determining the cellular localization of the encoded protein. Such a functional annotation is then input to and used by processing unit 12 . Examples for application software for cellular localization prediction are provided hereinabove.
System 10 of the present invention may also be connected to one or more external databases 20 . External database 20 is linked to processing unit 12 in a bi-directional manner, similar to the connection between database 18 and processing unit 12 . External database 20 may include any background information and/or sequence information that pertains to the biomolecular sequence query. External database 20 may be a proprietary database or a publicly available database which is accessible through a public network such as the Internet. External database 20 may feed relevant information to processing unit 12 as it effects iterative ontological annotation. External database 20 may also receive and store ontological annotations generated by processing unit 12 . In this case external database 20 may interact with other components of system 10 like database 18 .
It will be appreciated that the databases and application modules of system 10 can be directly connected with processing unit 12 and/or user interface 14 as is illustrated in FIG. 1 a , or such a connection can be achieved via a network 22 , as is illustrated in FIG. 1 b.
Network 22 may be a private network (e.g., a local area network), a secured network, or a public network (such as the Internet), or a combination of public and private and/or secured networks.
Thus, the present invention provides a well characterized approach for the systemic annotation of biomolecular sequences. The use of text information analysis, annotation scoring system and robust sequence clustering procedure enables for the first time the creation of the best possible annotations and assignment thereof to a vast number of biomolecular sequences sharing homologous sequences. The availability of ontological annotations for a significant number of biomolecular sequences from different species can provide a comprehensive account of sequence, structural and functional information pertaining to the biomolecular sequences of interest.
“Hierarchical annotation” refers to any ontology and subontology, which can be hierarchically ordered. Examples include but are not limited to a tissue expression hierarchy, a developmental expression hierarchy, a pathological expression hierarchy, a cellular expression hierarchy, an intracellular expression hierarchy, a taxonomical hierarchy, a functional hierarchy and so forth.
According to another aspect of the present invention there is provided a method of annotating biomolecular sequences according to a hierarchy of interest. The method is effected as follows.
First, a dendrogram representing the hierarchy of interest is computationally constructed. As used herein a “dendrogram” refers to a branching diagram containing multiple nodes and representing a hierarchy of categories based on degree of similarity or number of shared characteristics.
Each of the multiple nodes of the dendrogram is annotated by at least one keyword describing the node, and enabling literature and database text mining, as is further described hereinunder. A list of keywords can be obtained from the GO Consortium (www.geneontlogy.org); measures are taken to include as many keywords, and to include keywords which might be out of date. For example, for tissue annotation (see FIG. 4), a hierarchy was built using all available tissue/libraries sources available in the GenBank, while considering the following parameters: ignoring GenBank synonyms, building anatomical hierarchies, enabling flexible distinction between tissue types (normal versus pathology) and tissue classification levels (organs, systems, cell types, etc.).
It will be appreciated that the dendrogram of the present invention can be illustrated as a graph, a list, a map or a matrix or any other graphic or textual organization, which can describe a dendrogram. An example of a dendrogram illustrating the gastrointestinal tissue hierarchy is provided in FIG. 2.
In a second step, each of the biomolecular sequences is assigned to at least one specific node of the dendrogram.
The biomolecular sequences according to this aspect of the present invention can be annotated biomolecular sequences, unannotated biomolecular sequences or partially annotated biomolecular sequences.
Annotated biomolecular sequences can be retrieved from pre-existing annotated databases as described hereinabove.
For example, in GenBank, relevant annotational information is provided in the definition and keyword fields. In this case, classification of the annotated biomolecular sequences to the dendrogram nodes is directly effected. A search for suitable annotated biomolecular sequences is performed using a set of keywords which are designed to classify the biomolecular sequences to the hierarchy (i.e., same keywords that populate the dendrogram)
In cases where the biomolecular sequences are unannotated or partially annotated, extraction of additional annotational information is effected prior to classification to dendrogram nodes. This can be effected by sequence alignment, as described hereinabove. Alternatively, annotational information can be predicted from structural studies. Where needed, nucleic acid sequences can be transformed to amino acid sequences to thereby enable more accurate annotational prediction.
Finally, each of the assigned biomolecular sequences is recursively classified to nodes hierarchically higher than the specific nodes, such that the root node of the dendrogram encompasses the full biomolecular sequence set, which can be classified according to a certain hierarchy, while the offspring of any node represent a partitioning of the parent set.
For example, a biomolecular sequence found to be specifically expressed in “rhabdomyosarcoma”, will be classified also to a higher hierarchy level, which is “sarcoma”, and then to “Mesenchimal cell tumors” and finally to a highest hierarchy level “Tumor”. In another example, a sequence found to be differentially expressed in endometrium cells, will be classified also to a higher hierarchy level, which is “uterus”, and then to “women genital system” and to “genital system” and finally to a highest hierarchy level “genitourinary system”. The retrieval can be performed according to each one of the requested levels.
Since annotation of publicly available databases is at times unreliable, newly annotated biomolecular sequences are confirmed using computational or laboratory approaches as is further described hereinbelow.
It will be appreciated that once temporal or spatial annotations of sequences are established using the teachings of the present invention, it is possible to identify those sequences, which are differentially expressed (i.e., exhibit spatial or temporal pattern of expression in diverse cells or tissues). Such sequences are assigned to only a portion of the nodes, which constitute the hierarchical dendrogram.
Changes in gene expression are important determinants of normal cellular physiology, including cell cycle regulation, differentiation and development, and they directly contribute to abnormal cellular physiology, including developmental anomalies, aberrant programs of differentiation and cancer. Accordingly, the identification, cloning and characterization of differentially expressed genes can provide relevant and important insights into the molecular determinants of processes such as growth, development, aging, differentiation and cancer. Additionally, identification of such genes can be useful in development of new drugs and diagnostic methods for treating or preventing the occurrence of such diseases.
Newly annotated sequences identified according to the present invention are tested under physiological conditions (i.e., temperature, pH, ionic strength, viscosity, and like biochemical parameters which are compatible with a viable organism, and/or which typically exist intracellularly in a viable cultured yeast cell or mammalian cell). This can be effected using various laboratory approaches such as, for example, FISH analysis, PCR, RT-PCR, southern blotting, northern blotting, electrophoresis and the like (see Examples 13-20 of the Examples section) or more elaborate approaches which are detailed in the Background section.
It will be appreciated that true involvement of differentially expressed genes in a biological process is better confirmed using an appropriate cell or animal model, as further described hereinunder.
Although the present methodology can be effected using prior art systems modified for such purposes, due to the large amounts of data processed and the vast amounts of processing needed, the present methodology is preferably effected using a dedicated computational system.
Such a system is described hereinabove. The system includes a processing unit which executes a software application designed and configured for hierarchically annotating biomolecular sequences as described hereinabove. The system further serves for storing biomolecular sequence information and annotations in a retrievable/searchable database.
The hierarchical annotation approach enables to assign an appropriate annotation level even in cases where expression is not restricted to a specific tissue type or cell type. For example, different expressed sequences of a single contig which are annotated as being expressed in several different tissue types of a single specific organ or a specific system, are also annotated by the present invention to a higher hierarchy level thus denoting association with the specific organ or system. In such cases using keywords alone would not efficiently identify differentially expressed sequences. Thus for example, a sequence found to be expressed in sarcoma, Ewing sarcoma tumors, pnet, rhabdomyosarcoma, liposarcoma and mesenchymal cell tumors, can not be assigned to specific sarcomas, but still can be annotated as mesenchymal cell tumor specific. Using this hierarchical annotation approach in combination with advanced sequence clustering and assembly algorithms, capable of predicting alternative splicing, may facilitate a simple and rapid identification of gene expression patterns.
Although numerous methods have been developed to identify differentially expressed genes, none of these addressed splice variants, which occur in over 50% of human genes. Given the common sequence features of splice variants it is very difficult to identify splice variants which expression is differential, using prior art methodologies. Therefore assigning unique sequence features to differentially expressed splice variants may have an important impact to the understanding of disease development and may serve as valuable markers to various pathologies.
Thus, according to another aspect of the present invention there is provided a method of identifying sequence features unique to differentially expressed mRNA splice variants. The method is effected as follows.
First, unique sequence features are computationally identified in identified splice variants of alternatively spliced expressed sequences.
As used herein the phrase “splice variants” refers to naturally occurring nucleic acid sequences and proteins encoded therefrom which are products of alternative splicing. Alternative splicing refers to intron inclusion, exon exclusion, or any addition or deletion of terminal sequences, which results in sequence dissimilarities between the splice variant sequence and the wild-type sequence.
Although most alternatively spliced variants result from alternative exon usage, some result from the retention of introns not spliced-out in the intermediate stage of RNA transcript processing.
As used herein the phrase “unique sequence features” refers to donor/acceptor concatenations (i.e., exon-exon junctions), intron sequences, alternative exon sequences and alternative polyadenylation sequences.
Once a unique sequence feature is identified, the expression pattern of the splice variant is determined. If the splice variant is differentially expressed then the unique feature thereof is annotated accordingly.
Alternatively spliced expressed sequences of this aspect of the present invention, can be retrieved from numerous publicly available databases. Examples include but are not limited to ASDB—an alternative splicing database generated using GenBank and Swiss-Prot annotations (http://cbcg.nersc.gov/asdb, AsMamDB—a database of alternative splices in human, mouse and rat (http://166.111.30.65/ASMAMDB.html), Alternative splicing database—a database of alternative splices from literature (http://cgsigm.cshl.org/new_alt_exon_db2/), Yeast intron database—Database of intron in yeast (http://www.cse.ucsc.edu/research/compbio/yeast_introns.html ), The Intronerator—alternative splicing in C. elegans based on analysis of EST data (http://www.cse.ucsc.edu/˜kent/intronerator), ISIS—Intron Sequence Information System including a section of human alternative splices (http://isis.bit.uq.edu.au/), TAP—Transcript Assembly Program result of alternative splicing (http://stl.wustl.edu/-zkan/TAP/) and HASDB—database of alternative splices detected in human EST data.
Additionally, alternative splicing sequence data utilized by this aspect of the present invention can be obtained by any of the following bioinformatical approaches.
(i) Genomically aligned ESTs—the method identifies ESTs which come from the same gene and looks for differences between them that are consistent with alternative splicing, such as large insertion or deletion in one EST. Each candidate splice variant can be further assessed by aligning the ESTs with respective genomic sequence. This reveals candidate exons (i.e., matches to the genomic sequence) separated by candidate splices (i.e., large gaps in the EST-genomic alignment). Since intronic sequences at splice junctions (i.e., donor/acceptor concatenations) are highly conserved (essentially 99.24% of introns have a GT-AG at their 5′ and 3′ ends, respectively) sequence data can be used to verify candidate splices [Burset et al. (2000) Nucleic Acids Res. 28:4364-75 LEADS module [Shoshan, et al, Proceeding of SPIE (eds. M. L. Bittner, Y. Chen, A. N. Dorsel, E. D. Dougherty) Vol. 4266, pp. 86-95 (2001).; R. Sorek, G. Ast, D. Graur, Genome Res. In press; Compugen Ltd. U.S. patent application Ser. No. 09/133,987].
(ii) Identification based on intron information—The method creates a database of individual intron sequences annotated in GenBank and utilizes such sequences to search for EST sequences which include the intronic sequences [Croft et al. (2000) Nat. Genet. 24:340-1].
(iii) EST alignment to expressed sequences—looks for insertions and deletions in ESTs relative to a set of known mRNAs. Such a method enables to uncover alternatively spliced variants with having to align ESTs with genomic sequence [Brett et al. (2000) FEBS Lett. 474-83-86].
It will be appreciated that in order to avoid false positive identification of novel splice isoforms, a set of filters is applied. For example, sequences are filtered to exclude EST having sequence deviations, such as chimerism, random variation in which a given EST sequence or potential vector contamination at the ends of an EST.
Filtering can be effected by aligning ESTs with corresponding genomic sequences. Chimeric ESTs can be easily excluded by requiring that each EST aligns completely to a single genomic locus. Genomic location found by homology search and alignment can often be checked against radiation hybrid mapping data [Muneer et al (2002) Genomic 79:344-8]. Furthermore, since the genomic regions which align with an EST sequence correspond to exon sequences and alignment gaps correspond to introns, the putative splice sites at exon/intron boundaries can be confirmed. Because splice donor and acceptor sites primarily reside within the intron sequence, this methodology can provide validation which is independent of the EST evidence. Reverse transcriptase artifacts or other cDNA synthesis errors may also be filtered out using this approach. Improper inclusion of genomic sequence in ESTs can also be excluded by requiring pairs of mutually exclusive splices in different ESTs.
Additionally, it will be appreciated that observing a given splice variant in one EST but not in a second EST may be insufficient, as the latter can be an un-spliced EST rather than a biological significant intron inclusion. Therefore measures are taken to focus on mutually exclusive splice variants, two different splice variants observed in different ESTs, which overlap in a genomic sequence. A more stringent filtering may be applied by requiring two splice variants to share one splice site but differ in another.
Once splice variants are identified, identification of unique sequence features therewithin can be effected computationally by identifying insertions, deletions and donor-acceptor concatenations in ESTs relative to mRNA and preferably genomic sequences.
As mentioned hereinabove, once alternatively spliced sequences (having unique sequence features) are identified, determination of their expression patterns is effected in order to assign an annotation to the unique sequence feature thereof.
Expression pattern identification may be effected by qualifying annotations which are preassociated with the alternatively spliced expressed sequences, as described hereinabove. This can be accomplished by scoring the annotations. For example scoring pathological expression annotations can be effected according to: (i) prevalence of the alternatively spliced expressed sequences in normal tissues; (ii) prevalence of the alternatively spliced expressed sequences in pathological tissues; (iii) prevalence of the alternatively spliced expressed sequence in total tissues; and (iv) number of tissues and/or tissue types expressing the alternatively spliced expressed sequences.
Alternatively, identifying the expression pattern of the alternatively spliced expressed sequences of the present invention, is accomplished by identifying the unique sequence feature thereof. This can be effected by any hybridization-based technique known in the art, such as northern blot, dot blot, RNase protection assay, RT-PCR and the like.
To this end oligonucleotides probes, which are substantially homologous to nucleic acid sequences that flank and/or extend across the unique sequence features of the alternatively spliced expressed sequences of the present invention are generated.
Preferably, oligonucleotides which are capable of hybridizing under stringent, moderate or mild conditions, as used in any polynucleotide hybridization assay are utilized. Further description of hybridization conditions is provided hereinunder.
Oligonucleotides generated by the teachings of the present invention may be used in any modification of nucleic acid hybridization based techniques, which are further detailed hereinunder. General features of oligonucleotide synthesis and modifications are also provided hereinunder.
Aside from being useful in identifying specific splice variants, oligonucleotides generated according to the teachings of the present invention may also be widely used as diagnostic, prognostic and therapeutic agents in a variety of disorders which are associated with specific splice variants.
Regulation of splicing is involved in 15% of genetic diseases [Krawzczak et al. (1992) Hum. Genet. 90:41-54] and may contribute for example to cancer mis-splicing of exon 18 in BRCA1, which is caused by a polymorphism in an exonic enhancer [Liu et al. (2001) Nature Genet. 27:55-58].
Thus, oligonucleotides generated according to the teachings of the present invention can be included in diagnostic kits. For example, oligonucleotides sets pertaining to a specific disease associated with differential expression of an alternatively spliced transcript can be packaged in a one or more containers with appropriate buffers and preservatives along with suitable instructions for use and used for diagnosis or for directing therapeutic treatment. Additional information on such diagnostic kits is provided hereinunder.
It will be appreciated that an ability to identify alternatively spliced sequences, also facilitates identification of the various products of alternative splicing.
Recent studies indicate that most alternative splicing events result in an altered protein product [International human genome sequencing consortium (2001) Nature 409:860-921; Modrek et al. (2001) Nucleic Acids Res. 29:2850-2859]. The majority of these changes appear to have a functional relevance (i.e., up-regulating or down-regulating activity), such as the replacement of the amino or carboxyl terminus, or in-frame addition and removal of a functional domain. For example, alternative splicing can lead to the use of a different site for translation initiation (i.e., alternative initiation), a different translation termination site due to a frameshift (i.e., truncation or extension), or the addition or removal of a stop codon in the alternative coding sequence (i.e., alternative termination). Additionally, alternative splicing can change an internal sequence region due to an in-frame insertion or deletion. One example of the latter is the new FC receptor β-like protein, whose C-terminal transmembrane domain and cytoplasmic tail, which is important for signal transduction in this class of receptors, is replaced with a new transmembrane domain and tail by alternative polyadenylation. Another example is the truncated Growth Hormone Receptor which lacks most of its intracellular domain and has been shown to heterodimerize with the full-length receptor, thus causing inhibition of signaling by Growth Hormone [Ross, R. J. M., Growth hormone & IGF Research, 9:42-46, (1999)].
Thus, assigning a unique sequence feature to a functionally altered splice variant enables identification of such variants. As used herein the phrase “functionally altered splice variants” refers to alternatively spliced expressed sequences, which protein products exhibit gain of function or loss of function or modification of the original function.
As used herein the phrase “gain of function” refers to any alternative splicing product, which exhibits increased functionality as compared to the wild type gene product.
As used herein the phrase “loss of function” refers to any alternative splicing product, which exhibits reduced function as compared to the wild type gene product including any reduction in function, total absence of function or dominant negative function.
As used herein the phrase “dominant negative” refers to the dominant effect of a splice variant on the activity of wild type mRNA. For example, a protein product of an altered splice variant may bind a wild type target protein without enzymatically activating it (e.g., receptor dimmers), thus blocking and preventing the active enzymes from binding and activating the target protein.
As used herein the phrase “functional domain” refers to a region of a polypeptide, which displays a particular function. This function may give rise to a biological, chemical, or physiological consequence which may be reversible or irreversible and which may include protein-protein interactions (e.g., binding interactions) involving the functional domain, a change in the conformation or a transformation into a different chemical state of the functional domain or of molecules acted upon by the functional domain, the transduction of an intracellular or intercellular signal, the regulation of gene or protein expression, the regulation of cell growth or death, or the activation or inhibition of an immune response.
Identification of putative functionally altered splice variants, according to this aspect of the present invention, can be effected by identifying sequence deviations from functional domains of wild-type gene products.
Identification of functional domains can be effected by comparing a wild-type gene product with a series of profiles prepared by alignment of well characterized proteins from a number of different species. This generates a consensus profile, which can then be matched with the query sequence. Examples of programs suitable for such identification include, but are not limited to, InterPro Scan—Integrated search in PROSITE, Pfam, PRINTS and other family and domain databases; ScanProsite—Scans a sequence against PROSITE or a pattern against SWISS-PROT and TrEMBL; MotifScan—Scans a sequence against protein profile databases (including PROSITE); Frame-ProfileScan—Scans a short DNA sequence against protein profile databases (including PROSITE); Pfam HMM search—scans a sequence against the Pfam protein families database; FingerPRINTScan—Scans a protein sequence against the PRINTS Protein Fingerprint Database; FPAT—Regular expression searches in protein databases; PRATT—Interactively generates conserved patterns from a series of unaligned proteins; PPSEARCH—Scans a sequence against PROSITE (allows a graphical output); at EBI; PROSITE scan—Scans a sequence against PROSITE (allows mismatches); at PBIL; PATTINPROT—Scans a protein sequence or a protein database for one or several pattern(s); at PBIL; SMART—Simple Modular Architecture Research Tool; at EMBL; TEIRESIAS—Generate patterns from a collection of unaligned protein or DNA sequences; at IBM, all available from http://www.expasy.org/tools/.
It will be appreciated that functionally altered splice variants may also include a sequence alteration at a post-translation modification consensus site, such as, for example, a tyrosine sulfation site, a glycosylation site, etc. Examples of post-translational modification prediction softwares include but are not limited to: SignalP—Prediction of signal peptide cleavage sites; ChloroP—Prediction of chloroplast transit peptides; MITOPROT—Prediction of mitochondrial targeting sequences; Predotar—Prediction of mitochondrial and plastid targeting sequences; NetOGlyc—Prediction of type O-glycosylation sites in mammalian proteins; DictyOGlyc—Prediction of GlcNAc O-glycosylation sites in Dictyostelium; YinOYang—O-beta-GlcNAc attachment sites in eukaryotic protein sequences; big-PI Predictor—GPI Modification Site Prediction; DGPI—Prediction of GPI-anchor and cleavage sites (Mirror site); NetPhos—Prediction of Serine, Threonine and Tyrosine phosphorylation sites in eukaryotic proteins; NetPicoRNA—Prediction of protease cleavage sites in picornaviral proteins; NMT—Prediction of N-terminal N-myristoylation; Sulfinator—Prediction of tyrosine sulfation sites all available from http://www.expasy.org/tools/.
Once putative functionally altered splice variants are identified, they are validated by experimental verification and functional studies, using methodologies well known in the art.
The Examples section which follows illustrates identification and annotation of splice variants. Identified and annotated sequences are contained within the enclosed CD-ROMs 1-3. Some of these sequences represent (i.e., are transcribed from) entirely new splice variants, while others represent new splice variants of known sequences. In any case, the sequences contained in the enclosed CD-ROM are novel in that they include previously undisclosed sequence regions in the context of a known gene or an entirely new sequence in the context of an unknown gene.
The nucleic acids of the invention can be “isolated” or “purified.” In the event the nucleic acid is genomic DNA, it is considered “isolated” when it does not include coding sequence(s) of a gene or genes immediately adjacent thereto in the naturally occurring genome of an organism; although some or all of the 5′ or 3′ non-coding sequence of an adjacent gene can be included. For example, an isolated nucleic acid (DNA or RNA) can include some or all of the 5′ or 3′ non-coding sequence that flanks the coding sequence (e.g., the DNA sequence that is transcribed into, or the RNA sequence that gives rise to, the promoter or an enhancer in the mRNA). For example, an isolated nucleic acid can contain less than about 5 kb (e.g., less than about 4 kb, 3 kb, 2 kb, 1 kb, 0.5 kb, or 0.1 kb) of the 5′ and/or 3′ sequence that naturally flanks the nucleic acid molecule in a cell in which the nucleic acid naturally occurs. In the event the nucleic acid is RNA or mRNA, it is “isolated” or “purified” from a natural source (e.g., a tissue) or a cell culture when it is substantially free of the cellular components with which it naturally associates in the cell and, if the cell was cultured, the cellular components and medium in which the cell was cultured (e.g., when the RNA or mRNA is in a form that contains less than about 20%, 10%, 5%, 1%, or less, of other cellular components or culture medium). When chemically synthesized, a nucleic acid (DNA or RNA) is “isolated” or “purified” when it is substantially free of the chemical precursors or other chemicals used in its synthesis (e.g., when the nucleic acid is in a form that contains less than about 20%, 10%, 5%, 1%, or less, of the chemical precursors or other chemicals).
Variants, fragments, and other mutant nucleic acids are also envisaged by the present invention. As noted above, where a given biomolecular sequence represents a new gene (rather than a new splice variant of a known gene), the nucleic acids of the invention include the corresponding genomic DNA and RNA. Accordingly, where a given SEQ ID represents a new gene, variations or mutations can occur not only in that nucleic acid sequence, but in the coding regions, the non-coding regions, or both, of the genomic DNA or RNA from which it was made.
The nucleic acids of the invention can be double-stranded or single-stranded and can, therefore, either be a sense strand, an antisense strand, or a portion (i.e., a fragment) of either the sense or the antisense strand. The nucleic acids of the invention can be synthesized using standard nucleotides or nucleotide analogs or derivatives (e.g., inosine, phosphorothioate, or acridine substituted nucleotides), which can alter the nucleic acid's ability to pair with complementary sequences or to resist nucleases. Indeed, the stability or solubility of a nucleic acid can be altered (e.g., improved) by modifying the nucleic acid's base moiety, sugar moiety, or phosphate backbone. For example, the nucleic acids of the invention can be modified as taught by Toulmé [Nature Biotech. 19:17, (2001)] or Faria et al. [Nature Biotech. 19:40-44, (2001)], and the deoxyribose phosphate backbone of nucleic acids can be modified to generate peptide nucleic acids [PNAs; see Hyrup et al., (1996) Bioorganic & Medicinal Chemistry 4:5-23].
PNAs are nucleic acid “mimics”; the molecule's natural backbone is replaced by a pseudopeptide backbone and only the four nucleotide bases are retained. This allows specific hybridization to DNA and RNA under conditions of low ionic strength. PNAs can be synthesized using standard solid phase peptide synthesis protocols as described, for example by Hyrup et al. (supra) and Perry-O'Keefe et al. [Proc. Natl. Acad. Sci. USA (1996) 93:14670-675]. PNAs of the nucleic acids described herein can be used in therapeutic and diagnostic applications.
Moreover, the nucleic acids of the invention include not only protein-encoding nucleic acids per se (e.g., coding sequences produced by the polymerase chain reaction (PCR) or following treatment of DNA with an endonuclease), but also, for example, recombinant DNA that is: (a) incorporated into a vector (e.g., an autonomously replicating plasmid or virus), (b) incorporated into the genomic DNA of a prokaryote or eukaryote, or (c) part of a hybrid gene that encodes an additional polypeptide sequence (i.e., a sequence that is heterologous to the nucleic acid sequences of the present invention or fragments, other mutants, or variants thereof).
This aspect of the present invention includes naturally occurring sequences of the nucleic acid sequences described above, allelic variants (same locus; functional or non-functional), homologs (different locus), and orthologs (different organism) as well as degenerate variants of those sequences and fragments thereof. The degeneracy of the genetic code is well known, and one of ordinary skill in the art will be able to make nucleotide sequences that differ from the nucleic acid sequences of the present invention but nevertheless encode the same proteins as those encoded by the nucleic acid sequences of the present invention. The variant sequences (e.g., degenerate variants) can be used in the same manner as naturally occurring sequences. For example, the variant DNA sequences of the invention can be incorporated into a vector, into the genomic DNA of a prokaryote or eukaryote, or made part of a hybrid gene. Moreover, variants (or, where appropriate, the proteins they encode) can be used in the diagnostic assays and therapeutic regimes described below.
The sequence of nucleic acids of the invention can also be varied to maximize expression in a particular expression system. For example, as few as one and as many as about 20% of the codons in a given sequence can be altered to optimize expression in bacterial cells (e.g., E. coli ), yeast, human, insect, or other cell types (e.g., CHO cells).
The nucleic acids of the invention can also be shorter or longer than those disclosed on CD-ROMs 1 and 2. Where the nucleic acids of the invention encode proteins, the protein-encoding sequences can differ from those represented by specific sequences of file “Protein.seqs” in CD-ROM 2. For example, the encoded proteins can be shorter or longer than those encoded by one of the nucleic acid sequences of the present invention. Nucleotides can be deleted from, or added to, either or both ends of the nucleic acid sequences of the present invention or the novel portions of the sequences that represent new splice variants. Alternatively, the nucleic acids can encode proteins in which one or more amino acid residues have been added to, or deleted from, one or more sequence positions within the nucleic acid sequences.
The nucleic acid fragments can be short (e.g., 15-30 nucleotides). For example, in cases where peptides are to be expressed therefrom such polynucleotides need only contain a sufficient number of nucleotides to encode novel antigenic epitopes. In cases where nucleic acid fragments serve as DNA or RNA probes or PCR primers, fragments are selected of a length sufficient for specific binding to one of the sequences representing a novel gene or a unique portion of a novel splice variant.
Nucleic acids used as probes or primers are often referred to as oligonucleotides, and they can hybridize with a sense or antisense strand of DNA or RNA. Nucleic acids that hybridize to a sense strand (i.e., a nucleic acid sequence that encodes protein, e.g., the coding strand of a double-stranded cDNA molecule) or to an mRNA sequence are referred to as antisense oligonucleotides. Antisense oligonucleotides can be used to specifically inhibit transcription of any of the nucleic acid sequences of the present invention.
Design of antisense molecules must be effected while considering two aspects important to the antisense approach. The first aspect is delivery of the oligonucleotide into the cytoplasm of the appropriate cells, while the second aspect is design of an oligonucleotide which specifically binds the designated mRNA within cells in a way which inhibits translation thereof.
The prior art teaches of a number of delivery strategies which can be used to efficiently deliver oligonucleotides into a wide variety of cell types [see, for example, Luft (1998) J Mol Med 76 (2): 75-6; Kronenwett et al. (1998) Blood 91 (3): 852-62; Rajur et al. (1997) Bioconjug Chem 8 (6): 935-40; Lavigne et al. (1997) Biochem Biophys Res Commun 237 (3): 566-71 and Aoki et al. (1997) Biochem Biophys Res Commun 231 (3): 540-5].
In addition, algorithms for identifying those sequences with the highest predicted binding affinity for their target mRNA based on a thermodynamic cycle that accounts for the energetics of structural alterations in both the target mRNA and the oligonucleotide are also available [see, for example, Walton et al. (1999) Biotechnol Bioeng 65 (1): 1-9].
Such algorithms have been successfully used to implement an antisense approach in cells. For example, the algorithm developed by Walton et al. enabled scientists to successfully design antisense oligonucleotides for rabbit beta-globin (RBG) and mouse tumor necrosis factor-alpha (TNF alpha) transcripts. The same research group has more recently reported that the antisense activity of rationally selected oligonucleotides against three model target mRNAs (human lactate dehydrogenase A and B and rat gp130) in cell culture as evaluated by a kinetic PCR technique proved effective in almost all cases, including tests against three different targets in two cell types with phosphodiester and phosphorothioate oligonucleotide chemistries.
In addition, several approaches for designing and predicting efficiency of specific oligonucleotides using an in vitro system were also published (Matveeva et al. (1998) Nature Biotechnology 16, 1374-1375).
Several clinical trials have demonstrated safety, feasibility and activity of antisense oligonucleotides. For example, antisense oligonucleotides suitable for the treatment of cancer have been successfully used (Holmund et al. (1999) Curr Opin Mol Ther 1 (3):372-85), while treatment of hematological malignancies via antisense oligonucleotides targeting c-myb gene, p53 and Bcl-2 had entered clinical trials and had been shown to be tolerated by patients [Gerwitz (1999) Curr Opin Mol Ther 1 (3):297-306].
More recently, antisense-mediated suppression of human heparanase gene expression has been reported to inhibit pleural dissemination of human cancer cells in a mouse model [Uno et al. (2001) Cancer Res 61 (21):7855-60].
Thus, the current consensus is that recent developments in the field of antisense technology which, as described above, have led to the generation of highly accurate antisense design algorithms and a wide variety of oligonucleotide delivery systems, enable an ordinarily skilled artisan to design and implement antisense approaches suitable for downregulating expression of known sequences without having to resort to undue trial and error experimentation.
Antisense oligonucleotides can also be a-anomeric nucleic acids, which form specific double-stranded hybrids with complementary RNA in which, contrary to the usual b-units, the strands run parallel to each other [Gaultier et al., Nucleic Acids Res. 15:6625-6641, (1987)]. Alternatively, antisense nucleic acids can comprise a 2′-o-methylribonucleotide [Inoue et al., Nucleic Acids Res. 15:6131-6148, (1987)] or a chimeric RNA-DNA analogue [Inoue et al., FEBS Lett. 215:327-330, (1987)].
The nucleic acid sequences described above can also include ribozymes catalytic sequences. Such a ribozyme will have specificity for a protein encoded by the novel nucleic acids described herein (by virtue of having one or more sequences that are complementary to the cDNAs that represent novel genes or the novel portions (i.e., the portions not found in related splice variants) of the sequences that represent new splice variants. These ribozymes can include a catalytic sequence encoding a protein that cleaves mRNA [see U.S. Pat. No. 5,093,246 or Haselhoff and Gerlach, Nature 334:585-591, (1988)]. For example, a derivative of a tetrahymena L-19 IVS RNA can be constructed in which the nucleotide sequence of the active site is complementary to the nucleotide sequence to be cleaved in an mRNA of the invention (e.g., one of the nucleic acid sequences of the present invention; see, U.S. Pat. Nos. 4,987,071 and 5,116,742). Alternatively, the mRNA sequences of the present invention can be used to select a catalytic RNA having a specific ribonuclease activity from a pool of RNA molecules [see, e.g., Bartel and Szostak, Science 261:1411-1418, (1993); see also Krol et al., Bio-Techniques 6:958-976, (1988)].
Fragments having as few as 9-10 nucleotides (e.g., 12-14, 15-17, 18-20, 21-23, or 24-27 nucleotides) can be useful as probes or expression templates and are within the scope of the present invention. Indeed, fragments that contain about 15-20 nucleotides can be used in Southern blotting, Northern blotting, dot or slot blotting, PCR amplification methods (where naturally occurring or mutant nucleic acids are amplified), colony hybridization methods, in situ hybridization, and the like.
The present invention also encompasses pairs of oligonucleotides (these can be used, for example, to amplify the new genes, or portions thereof, or the novel portions of the splice variant in, for example, potentially diseased tissue) and groups of oligonucleotides (e.g., groups that exhibit a certain degree of homology (e.g., nucleic acids that are 90% identical to one another) or that share one or more functional attributes).
When used, for example, as probes, the nucleic acids of the invention can be labeled with a radioactive isotope (e.g., using polynucleotide kinase to add 32 P-labeled ATP to the oligonucleotide used as the probe) or an enzyme. Other labels, such as chemiluminescent, fluorescent, or calorimetric, labels can be used.
As noted above, the invention features nucleic acids that are complementary to those represented by the nucleic acid sequences of the present invention or novel portions thereof (i.e., novel fragments) and as such are capable of hybridizing therewith. In many cases, nucleic acids that are used as probes or primers are absolutely or completely complementary to all, or a portion of, the target sequence. However, this is not always necessary. The sequence of a useful probe or primer can differ from that of a target sequence so long as it hybridizes with the target under the stringency conditions described herein (or the conditions routinely used to amplify sequences by PCR) to form a stable duplex.
Hybridization of a nucleic acid probe to sequences in a library or other sample of nucleic acids is typically performed under moderate to high stringency conditions. Nucleic acid duplex or hybrid stability is expressed as the melting temperature (Tm), which is the temperature at which a probe dissociates from a target DNA and, therefore, helps define the required stringency conditions. To identify sequences that are related or substantially identical to that of a probe, it is useful to first establish the lowest temperature at which only homologous hybridization occurs with a particular concentration of salt (e.g., SSC or SSPE). (The terms “identity” or “identical” as used herein are equated with the terms “homology” or “homologous”). Then, assuming a 1% mismatch requires a 1° C. decrease in the Tm, the temperature of the wash (e.g., the final wash) following the hybridization reaction is reduced accordingly. For example, if sequences having at least 95% identity with the probe are sought, the final wash temperature is decreased by 5° C. In practice, the change in Tm can be between 0.5° C. and 1.5° C. per 1% mismatch
The hybridization conditions described here can be employed when the nucleic acids of the invention are used in, for example, diagnostic assays, or when one wishes to identify, for example, the homologous genes that fall within the scope of the invention (as stated elsewhere, the invention encompasses allelic variants, homologues and orthologues of the sequences that represent new genes). Homologous genes will hybridize with the sequences that represent new genes under a stringency condition described herein.
A hybridization reaction is carried out at “high stringency” if hybridization (between the probe and a potential target sequence) is carried out at 68° C. in (a) 5×SSC/5× Denhardt's solution/1.0% SDS, (b) 0.5 M NaHPO 4 (pH 7.2)/1 mM EDTA/7% SDS, or (c) 50% formamide/0.25 M NaHPO 4 (pH 7.2)/0.25 M NaCl/1 mM EDTA/7% SDS, and washing is carried out with (a) 0.2×SSC/0.1% SDS at room temperature or at 42° C., (b) 0.1×SSC/0.1% SDS at 68° C., or (c) 40 mM NaHPO 4 (pH 7.2)/1 mM EDTA and either 1% or 5% SDS at 50° C.
“Moderately stringent” conditions constitute the hybridization con