Title:
System and method for identification of MicroRNA precursor sequences and corresponding mature MicroRNA sequences from genomic sequences
Kind Code:
A1


Abstract:
A method for determining microRNA precursors and their corresponding mature microRNAs from genomic sequences is provided. For example, in one aspect of the invention, a method for determining whether a nucleotide sequence contains a microRNA precursor comprises the following steps. Patterns are generated by processing a collection of already known microRNA precursor sequences. One or more attributes are assigned to the generated patterns. Only the patterns whose attributes satisfy certain criteria are subselected, and then the subselected patterns are used to analyze the nucleotide sequence. In another aspect of the invention, a method for identifying a mature microRNA sequence in a microRNA precursor sequence comprises the following steps. One or more patterns are generated by processing a collection of known mature microRNA sequences. The one or more patterns are filtered, and then used to locate instances of the one or more filtered patterns in one or more candidate precursor sequences.



Inventors:
Huynh, Tien (Yorktown Heights, NY, US)
Miranda, Kevin Charles (McDowall, AU)
Rigoutsos, Isidore (Astoria, NY, US)
Application Number:
11/351951
Publication Date:
11/23/2006
Filing Date:
02/10/2006
Assignee:
International Business Machines Corporation (Armonk, NY, US)
Primary Class:
Other Classes:
702/20, 435/6.16
International Classes:
C12Q1/68; G06F19/00
View Patent Images:



Primary Examiner:
HARWARD, SOREN T
Attorney, Agent or Firm:
Ryan, Mason & Lewis, LLP (2425 Post Road Suite 204, Southport, CT, 06890, US)
Claims:
What is claimed is:

1. A method for determining whether a nucleotide sequence contains a microRNA precursor, the method comprising the steps of: generating one or more patterns by processing a collection of known microRNA precursor sequences; assigning one or more attributes to the one or more generated patterns; subselecting one or more patterns whose one or more attributes satisfy at least one criterion; and using the one or more subselected patterns to analyze the nucleotide sequence, such that a determination is made whether the nucleotide sequence contains a microRNA precursor.

2. The method of claim 1, wherein the nucleotide sequence is from an intergenic region.

3. The method of claim 1, wherein the nucleotide sequence is from an intronic region.

4. The method of claim 1, wherein the nucleotide sequence is from an amino acid coding region.

5. The method of claim 1, wherein the step of generating one or more patterns comprises using a pattern discovery algorithm.

6. The method of claim 5, wherein the pattern discovery algorithm is the Teiresias pattern discovery algorithm.

7. The method of claim 1, wherein the step of assigning one or more attributes is carried out independently of and prior to the step of using the one or more subselected patterns to analyze a nucleotide sequence.

8. The method of claim 1, wherein the one or more attributes are quantitative.

9. The method of claim 8, wherein at least one of the one or more attributes represents statistical significance.

10. The method of claim 8, wherein at least one of the one or more attributes represents a length of the pattern.

11. The method of claim 8, wherein at least one of the one or more attributes represents a number of positions in the one or more patterns which are not occupied by wild cards.

12. The method of claim 8, wherein a threshold value for each attribute is selected.

13. The method of claim 12, wherein one or more patterns are discarded if the value of the one or more attributes of the pattern is below the selected threshold for the one or more attributes.

14. The method of claim 13, wherein the steps of selecting a threshold value and discarding one or more patterns are repeated for all used attributes.

15. The method of claim 1, wherein a set of counters is created for the nucleotide sequence.

16. The method of claim 15, wherein the counters in the set of counters equal the number of nucleotides in the nucleotide sequence.

17. The method of claim 1, wherein all patterns are examined.

18. The method of claim 17, wherein each pattern with an instance in the nucleotide sequence contributes to the counters at corresponding positions of the nucleotide sequence.

19. The method of claim 18, wherein only consecutive positions in the nucleotide sequences whose corresponding counter values exceed a threshold are considered.

20. The method of claim 19, wherein one or more groups of consecutive positions are considered only if they satisfy a minimum length criterion.

21. The method of claim 20, wherein a secondary structure of each consecutive group of positions is estimated using an RNA secondary structure prediction method.

22. The method of claim 21, wherein the prediction method is one included with software known as the Vienna Package.

23. The method of claim 21, wherein the prediction method is a method called ‘mfold’.

24. The method of claim 21, wherein the predicted structure is assigned one or more attributes.

25. The method of claim 24, wherein at least one of the one or more attributes is folding energy of a formed complex.

26. The method of claim 24, wherein a threshold value for the one or more attributes is selected.

27. The method of claim 24, wherein a complex is discarded if the value of the one or more attributes is below the selected threshold for the one or more attributes.

28. The method of claim 27, wherein the steps of selecting a threshold value and discarding a complex are repeated for all used attributes.

29. The method of claim 28, wherein the nucleotide sequence is reported as a microRNA precursor if the predicted structure that corresponds to the nucleotide sequence has not been discarded.

30. A system for determining whether a nucleotide sequence contains a microRNA precursor, comprising: a memory that stores computer-readable code; and a processor operatively coupled to the memory, the processor configured to implement the computer-readable code, the computer-readable code configured to: generate one or more patterns by processing a collection of known microRNA precursor sequences; assign one or more attributes to the one or more generated patterns; subselect the one or more patterns whose one or more attributes satisfy at least one criterion; and use the one or more subselected patterns to analyze the nucleotide sequence, such that a determination is made whether a nucleotide sequence contains a microRNA precursor.

31. An article of manufacture for determining whether a nucleotide sequence contains a microRNA precursor, comprising: a computer-readable medium having computer-readable code embodied thereon, the computer-readable code comprising: a step to generate one or more patterns by processing a collection of known microRNA precursor sequences; a step to assign one or more attributes to the one or more generated patterns; a step to subselect the one or more patterns whose one or more attributes satisfy at least one criterion; and a step to use the one or more subselected patterns to analyze the nucleotide sequence, such that a determination is made whether a nucleotide sequence contains a microRNA precursor.

32. A method for identifying a mature microRNA sequence in a microRNA precursor sequence, comprising the steps of: generating one or more patterns by processing a collection of known mature microRNA sequences; filtering the one or more patterns; and locating instances of the one or more filtered patterns in one or more candidate precursor sequences.

33. A system for identifying a mature microRNA sequence in a microRNA precursor sequence, comprising: a memory that stores computer-readable code; and a processor operatively coupled to the memory, the processor configured to implement the computer-readable code, the computer-readable code configured to: generate one or more patterns by processing a collection of known mature microRNA sequences; filter the one or more patterns; and locate instances of the one or more filtered patterns in one or more candidate precursor sequences.

34. An article of manufacture for identifying a mature microRNA sequence in a microRNA precursor sequence, comprising: a computer-readable medium having computer-readable code embodied thereon, the computer-readable code comprising: a step to generate one or more patterns by processing a collection of known mature microRNA sequences; a step to filter the one or more patterns; and a step to locate instances of the one or more filtered patterns in one or more candidate precursor sequences.

35. A method for determining whether a nucleotide sequence contains a mature microRNA, the method comprising the steps of: generating one or more patterns by processing a collection of known mature microRNA sequences; assigning one or more attributes to the one or more generated patterns; subselecting one or more patterns whose one or more attributes satisfy at least one criterion; and using the one or more subselected patterns to analyze the nucleotide sequence, such that a determination is made whether the nucleotide sequence contains a mature microRNA.

36. The method of claim 35, wherein the nucleotide sequence is from an intergenic region.

37. The method of claim 35, wherein the nucleotide sequence is from an intronic region.

38. The method of claim 35, wherein the nucleotide sequence is from an amino acid coding region.

39. The method of claim 35, wherein the step of generating one or more patterns comprises using a pattern discovery algorithm.

40. The method of claim 39, wherein the pattern discovery algorithm is the Teiresias pattern discovery algorithm.

41. The method of claim 35, wherein the step of assigning one or more attributes is carried out independently of and prior to the step of using the one or more subselected patterns to analyze a nucleotide sequence.

42. The method of claim 35, wherein the one or more attributes are quantitative.

43. The method of claim 42, wherein at least one of the one or more attributes represents statistical significance.

44. The method of claim 42, wherein at least one of the one or more attributes represents a length of the pattern.

45. The method of claim 42, wherein at least one of the one or more attributes represents a number of positions in the one or more patterns which are not occupied by wild cards.

46. The method of claim 42, wherein a threshold value for each attribute is selected.

47. The method of claim 46, wherein one or more patterns are discarded if the value of the one or more attributes of the pattern is below the selected threshold for the one or more attributes.

48. The method of claim 47, wherein the steps of selecting a threshold value and discarding one or more patterns are repeated for all used attributes.

49. The method of claim 35, wherein a set of counters is created for the nucleotide sequence.

50. The method of claim 49, wherein the counters in the set of counters equal the number of nucleotides in the nucleotide sequence.

51. The method of claim 35, wherein all patterns are examined.

52. The method of claim 5 1, wherein each pattern with an instance in the nucleotide sequence contributes to the counters at the corresponding positions of the nucleotide sequence.

53. The method of claim 52, wherein only consecutive positions in the nucleotide sequences whose corresponding counter values exceed a threshold are considered.

54. The method of claim 53, wherein one or more groups of consecutive positions are considered only if they satisfy a minimum length criterion.

55. The method of claim 42, wherein a threshold value for the one or more attributes is selected.

56. The method of claim 55, wherein a group of consecutive positions is discarded if the value of the one or more attributes is below the selected threshold for the one or more attributes.

57. The method of claim 56, wherein the steps of selecting a threshold value and discarding a group of consecutive positions are repeated for all used attributes.

58. The method of claim 57, wherein the group of consecutive positions that has not been discarded is reported as a mature microRNA.

Description:

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 60/652,499, filed Feb. 11, 2005, the disclosure of which is incorporated by reference herein.

This application is related to U.S. patent application entitled “System and Method for Identification of MicroRNA Target Sites and Corresponding Targeting MicroRNA Sequences,” Attorney Docket Number YOR920060077US1, filed concurrently herewith, the disclosure of which is incorporated by reference herein. Also, this application is related to U.S. patent application entitled “Ribonucleic Acid Interference Molecules,” Attorney Docket Number YOR920040675US2, filed concurrently herewith, the disclosure of which is incorporated by reference herein.

FIELD OF THE INVENTION

The present invention relates to genes and, more particularly, to ribonucleic acid interference molecules and their role in gene expression.

BACKGROUND OF THE INVENTION

The ability of an organism to regulate the expression of its genes is of central importance to life. A breakdown in this homeostasis leads to disease states, such as cancer, where a cell multiplies uncontrollably, to the detriment of the organism. The general mechanisms utilized by organisms to maintain this gene expression homeostasis are the focus of intense scientific study.

It recently has been discovered that some cells are able to down-regulate their gene expression through certain ribonucleic acid (RNA) molecules. Namely, RNA molecules can act as potent gene expression regulators either by inducing messenger-RNA (mRNA) degradation or by inhibiting translation. This activity is summarily referred to as post-transcriptional gene silencing, or PTGS for short. An alternative name by which it is also known is RNA interference, or RNAi. PTGS/RNAi has been found to function as a mediator of resistance to endogenous and exogenous pathogenic nucleic acids, and, also as a regulator the expression of genes inside cells.

The term ‘gene expression,’ as used herein, refers generally to the transcription of messenger-RNA (mRNA) from a gene, and, e.g., its subsequent translation into a functional protein. One class of RNA molecules involved in gene expression regulation comprises microRNAs, which are endogenously encoded and regulate gene expression by either disrupting the translation processes or by degrading mRNA transcripts, e.g., inducing post-transcriptional repression of one or more target sequences.

The RNAi/post-transcriptional gene silencing mechanism allows an organism to employ short RNA sequences to either degrade or disrupt translation of complementary mRNA transcripts. Early studies suggested only a limited role for RNAi, that of a defense mechanism against foreign born pathogens. However, the subsequent discovery of many endogenously-encoded microRNAs pointed towards the possibility of this being a more general, in nature, control mechanism. Recent evidence has led the community to hypothesize that a wider spectrum of biological processes are affected by RNAi, thus extending the range of this presumed control layer.

To date, there have been relatively few attempts to devise new methods for finding novel microRNA precursors and their associated mature microRNAs. This is likely connected to a belief that is held by the research community at large according to which all of the relevant mature microRNAs and their precursors for the most important model organisms have already been identified using biochemical methods. The existing methods can be categorized into two basic approaches.

In the first approach, the methods begin by predicting the RNA secondary structure of candidate sequences using any of the available predictions programs (e.g. “RNAfold” or “mfold”). The methods then focus on only those sequences that are predicted to fold into the familiar hairpin-like structure of microRNA precursors, subselecting those that satisfy additional sequence or other properties (Lai E C, Tomancak P, Williams R W, Rubin G M. (2003) Computational identification of Drosophila microRNA genes. Genome Biol 4(7): R42; Lim L P, Glasner M E, Yekta S, Burge C B, Bartel D P (2003b) Vertebrate microRNA genes. Science 299: 1540; Lim L P, Lau N C, Weinstein E G, Abdelhakim A, Yekta S, Rhoades M W, Burge C B, Bartel D P (2003a) The microRNAs of Caenorhabditis elegans. Genes and Development 17: 991-1008; I. Bentwich et al., “Identification of hundreds of conserved and nonconserved human microRNAs,” Nature Genetics, published online Jun. 19, 2005. DOI: 10.1038/ng1590).

The second type of approach uses the observation that the two arms of the hairpin of a precursor exhibit a much higher degree of sequence conservation than the regions outside the precursor and also the region in the loop of the precursor. This observation was combined with additional, known properties of microRNAs and led to the successful discovery of many novel mature microRNA and microRNA precursors (Berezikov, E., Guryev, V., van de Belt, J., Wienholds, E., Plasterk, R. H. A., Cuppen, E. (2005) Phylogenetic shadowing and computational identification of human microRNA genes. Cell 120: 21-24).

The inventive approach that we present in the discussion below represents a departure from the above two schools of thought. Even though the inventive approach exploits sequence conservation to discover microRNA precursors, the inventive approach does so locally, i.e. the approach seeks to leverage the existence of locally conserved sequence fragments that are shared by known precursors that could potentially be distant from a phylogenetic standpoint.

A better understanding of the mechanism of the RNA interference process would benefit the fight against disease, drug design and host defense mechanisms.

SUMMARY OF THE INVENTION

A method for identifying microRNA precursor sequences and corresponding mature microRNA sequences from genomic sequences is provided. For example, in one aspect of the invention, a method for determining whether a nucleotide sequence contains a microRNA precursor comprises the following steps. One or more patterns are generated by processing a collection of known microRNA precursor sequences. One or more attributes are assigned to the one or more generated patterns. Only the one or more patterns whose one or more attributes satisfy at least one criterion are subselected, and then the one or more subselected patterns are used to analyze the nucleotide sequence.

In another aspect of the invention, a method for identifying a mature microRNA sequence in a microRNA precursor sequence comprises the following steps. One or more patterns are generated by processing a collection of known mature microRNA sequences. The one or more patterns are filtered, and then used to locate instances of the one or more filtered patterns in one or more candidate precursor sequences.

A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a flow diagram illustrating a method for identifying a microRNA precursor sequence, according to one embodiment of the invention;

FIG. 1B is a flow diagram illustrating a method for identifying a mature microRNA sequence in a microRNA precursor sequence, according to one embodiment of the invention;

FIG. 2A is a graph illustrating a genomic sequence hit with a microRNA-precursor-pattern-set, the graph further illustrating the number of pattern hits with instances in a particular genomic neighborhood as a function of position;

FIG. 2B is a graph illustrating detail of the region shown in FIG. 2A;

FIG. 2C is a graph illustrating detail of the region shown in FIG. 2B;

FIG. 2D is an illustration of the predicted secondary structure of cel-mir-273 as determined by RNAfold;

FIG. 3A is a graph illustrating the distribution of pattern-hit-scores for all C. elegans microRNAs within RFAM (solid line) versus generic hairpins (dashed line).

FIG. 3B is a graph illustrating the distribution of predicted folding energies for all C. elegans microRNAs (solid line) and generic hairpins (dashed line).

FIG. 3C is an X-Y scatter plot illustrating patterns hits versus folding energy for C. elegans microRNAs (light-grey-colored dots) and generic hairpins (dark-grey-colored dots);

FIG. 4 is a table summarizing the microRNA-precursor predictions for the genomes of C. elegans, D. melanogaster, M. musculus and H. sapiens; and

FIG. 5 is a block diagram illustrating a system for determining whether a nucleotide sequence contains a microRNA precursor, in accordance with one embodiment of the invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The teachings of the present invention relate to ribonucleic acid (RNA) molecules and their role in gene expression regulation. As mentioned above, a novel and robust pattern-based approach for the discovery of microRNA precursors and their corresponding mature microRNAs from genomic sequence is provided. Advantageously, the inventive approach obviates the need of cross-species sequence conservation, and is thus readily applicable to any genomic sequence independent of whether it has orthologues in other species. The capabilities of the inventive approach are demonstrated herein by first showing that the inventive approach correctly identifies many of the currently known microRNA precursors and mature microRNAs. We describe an implemented prototype system and use the system to analyze computationally the C. elegans, D. melanogaster, M. musculus and H. sapiens genomes. By way of example, such sequences are described in detail in Application No. 60/652,499, the disclosure of which is incorporated by reference herein. Also, such sequences are described in detail in the above-mentioned related U.S. patent application (YOR920040675US2), the disclosure of which is incorporated herein.

We estimate that the number of endogenously-encoded microRNA precursors is substantially higher than currently hypothesized. The inventive approach readily extends to the discovery of microRNA target sites directly from genomic sequences. A method for identifying microRNA target sites is described in detail in the above-mentioned related U.S. patent application (YOR920060077US1), the disclosure of which is incorporated herein.

FIG. 1A is a flow diagram illustrating a method for identifying a microRNA precursor sequence, according to one embodiment of the invention. Underlying the inventive approach is a pattern-based methodology which discovers variable-length sequence fragments (‘patterns’) that recur in an input database a user-specified, minimum number of times. The number of discovered patterns, the exact locations of each instance of the discovered pattern, the actual extent of each pattern, and finally the number of instances that a pattern has in the input database are, of course, not known ahead of time. Computationally, the pattern discovery problem is a much ‘harder’ problem than database searching, a task with which most biologists are familiar and has been in main-stream use for more than 20 years. Indeed, pattern discovery is an NP-hard problem whereas database searching can be solved in polynomial time.

We will first describe step 110, the generation of patterns. The generation of patterns (step 110) is comprised of steps 112 and 114, as shown in FIG. 1A.

Step 112 is the step of processing known microRNA precursors to discover intra- and inter-species patterns of conserved sequence.

The recurrent instances of conserved sequence segments can be represented with the help of regular expressions each with a differing degree of descriptive power. The expressions used in this disclosure are composed of literals (solid characters from the alphabet of permitted symbols), wildcards (each denoted by ‘.’ and representing any character), and sets of equivalent literals (each set being a small number of symbols, any one of which can occupy the corresponding position). The distance between two consecutive occupied positions is assumed to be unchanged across all instances of the pattern (i.e., ‘rigid patterns’). The pattern [LIV].[LIV].D.ND[NH].P is an example from the domain of amino acid sequences and describes the calcium binding motif of cadherin proteins. The motif in question comprises exactly one of the amino acids {leucine, isoleucine, valine}, followed by any amino acid, followed again by exactly one of the amino acids {leucine, isoleucine, valine}, followed by any amino acid, followed by the negatively charged aspartate, etc. Typically, the presence of a statistically significant pattern in an unannotated amino acid sequence is taken as a sufficient condition to suggest the presence of the feature captured by the pattern.

In the context of the invention described herein, the symbol set that is used comprises the four nucleotides {A,C,G,T} found in a deoxyribonucleic acid (DNA) sequence. The input set which we processed in order to discover patterns is Release 3.0 of the RFAM database, from January 2004 (Griffiths-Jones, S. et al. Rfam: an RNA family database. Nucleic Acids Res., 31 439-441 (2003)). The use of a more-than-18-month-old release of the database as our training set was intentional. We wanted to gauge how well our method would perform if presented only with the knowledge that was available in the literature in January 2004. The analysis has since been repeated using subsequent releases of the RFAM database.

Unlike previously published computational methods for microRNA precursor prediction, the present invention makes use of the sequence information from all the microRNAs which are contained in the RFAM release, and independent of the organism in which they originate. The release in question contains microRNAs from the human, mouse, rat, worm, fly and several plant genomes. The simultaneous processing of microRNA sequences from distinct organisms permits the discovery of conserved sequences both within and across species and makes the method suitable for the analysis of more than one organism. Release 3.0 of RFAM (January 2004), which was used as our input, contained 719 microRNA precursor sequences.

We used a scheme based on BLASTN (Altschul, S. F. Gish, W. Miller, W. Myers, E. W. Lipman, D. J. Basic local alignment search tool. J Mol Biol. 215 403-410 (1990)) to remove duplicate and near-duplicate entries from the initial collection. The final set comprised 530 microRNA precursor sequences. In this cleaned-up set, no two sequences agreed on more than 90% of their positions. We next describe in detail the BLASTN-based cleanup scheme.

We assume that we are given N sequences of variable length and a user-defined threshold X for the permitted, maximum remaining pair-wise sequence similarity. The sequence-based clustering scheme that we employed is shown below. Upon termination, the set CLEAN contains sequences no pair of which agrees on more than X % of the positions in the shorter of the two sequences. For our analysis, we set X=90%.

    • sort the N sequences in order of decreasing length; let Si denote the i-th sequence of the sorted set (i=1, . . . , N)
    • CLEAN custom characterS1
    • for i=2 through N do
      • use Si as query to run BLAST against the current contents of CLEAN if the top BLAST hit T agrees with Si at more than X % of the Si's position
    • then
      • make Si a member of the cluster represented by T discard Si;
    • else
      • CLEAN custom characterCLEAN 4 {Si};

This non-redundant input was then processed using the Teiresias algorithm (Rigoutsos, I. and Floratos, A. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14 55-67 (1998)) in order to discover intra- and inter-species patterns of sequence conservation. The combinatorial nature of the algorithm and the guaranteed discovery of all patterns contained in the processed input makes Teiresias a good choice for addressing this task. The nature of the patterns that can be discovered is controlled by three parameters: L, the minimum number of symbols participating in a pattern; W, the maximum permitted span of any L consecutive (not contiguous) symbols in a pattern; and K, the minimum number of instances required of a pattern before it can be reported. We also enforced a statistical significance requirement. The significance of each pattern was estimated with the help of a second-order Markov chain which was built from actual genomic data. Application of the significance filter reduced the number of patterns that were used in the subsequent phases of the algorithm. Details on the Teiresias algorithm and its properties, the three parameters L/W/K, and how to estimate log-probabilities are given below.

The Teiresias algorithm requires that the three parameters L, W and K be set. The three parameters that control the discovery process were set to L=7, W=10 and K=2. 120,789,247 variable length patterns were discovered in the processed input set. Patterns with log-probability >−34.0 were removed resulting in a final set of 192,240 statistically-significant, microRNA precursor specific patterns. We next describe in detail how these parameters control the number and character of the discovered patterns.

The parameter L controls the minimum possible size of the discovered patterns. The parameter W satisfies the inequality W≧L and controls the ‘degree of conservation’ across the various instances of the reported patterns. Setting W to smaller (respectively larger) values permits fewer (respectively more) mismatches across the instances of each of the discovered patterns. Finally, the parameter K controls the minimum number of instances that a pattern must have before it can be reported.

For a given choice of L, W and K Teiresias guarantees that it will report all patterns that have K or more appearances in the processed input and are such that any L consecutive (but not necessarily contiguous) positions span at most W positions. It is important to stress that even though no pattern can have fewer than L literals, the patterns' maximum length is unconstrained and limited only by the size of the database.

Setting L to small values permits the identification of shorter conserved motifs that may be present in the processed input. As mentioned above, even if L is set to small values, patterns that are longer than L will be discovered and reported. Generally speaking, in order for a short motif to be considered statistically significant it will need to have a large number of copies in the processed input. Setting L to large values will generally permit the identification of statistically significant motifs even if these motifs repeat only a small number of times. This increase in specificity will happen at the expense of a potentially significant decrease in sensitivity.

For the work described herein, we selected L=7. This choice is dictated by the desire to capture potential commonalities among the seed regions of diverse microRNAs; setting L to a value that is smaller than the 6 nucleotides typically associated with the seed regions gives us added flexibility. We also set W=10, a choice that is dictated by the desire to capture sequence commonalities where the local conservation is at least 70%. In other words, any reported pattern will have more than ⅔ of its positions occupied by literals. Finally, we set K=2. This is a natural consequence of the fact that we generate conserved sequence motifs through an unsupervised pattern discovery scheme. The value of 2 is the smallest possible one (a pattern or motif, by definition, must appear at least two times in the processed input) and guarantees that all patterns will be discovered.

Step 114 is the step of statistically filtering the patterns that were generated in step 112. The step of filtering is done by estimating the log-probability of each pattern with the help of a Markov-chain. We next describe in detail how to use Markov chains to estimate the log-probabilities of patterns. The computation is carried out in the same manner for all of the patterns.

Real genomic data was used to estimate the frequency of trinucleotides that could span as many as 23 positions—there are at most 20 wild cards between the first and last nucleotide of the triplet. In other words, we computed the frequencies of all trinucleotides of the form:

AAA
AA.A
AA..A
...
AA....................A
A.AA
A.A.A
A.A..A
...
T....................TT

With these counts at hand, we used Bayes' theorem to estimate the probability that a given pattern could be generated from a random database. Let us use the pattern

  • A..[AT].C..T...G to describe the approach. Observe that we can write:
  • Pr(A..[AT].C..T...G)=
  • Pr(C..T...G/A..[AT].C..T)=
  • Pr(C..T...G/C..T)*Pr(A..[AT].C..T)=
  • Pr(C..T...G/C..T)*Pr([AT].C..T/A..[AT].C)=
  • Pr(C..T...G/C..T)*Pr([AT].C..T/[AT].C)*Pr(A..[AT].C)=
  • Pr(C..T...G/C..T)*Pr([AT].C..T/[AT].C)*Pr(A..[AT].C/A..[AT])=
  • #(C..T...G)/(#(C..T...A)+#(C..T...C)+(C..T...G)+#(C..T...T))*
  • #([AT].C..T)/(#([AT].C..A)+#([AT].C..C)+#([AT].C..G)+#([AT].C..T))*
  • #(A..[AT].C)/(#(A..[AT].A)+#(A..[AT].C)+#(A..[AT].G)+#(A..[AT].T))
  • Note that all of the counts #(.) are available directly from the Markov chain and thus can be substituted for in the last equation. This in turn allows us to estimate the Pr(A..[AT].C..T...G) as well as the log(Pr(A..[AT].C..T...G)).

We next describe step 120, the identification of candidate regions. Step 120 is comprised of step 122 and step 124, as shown in FIG. 1A.

Step 122 is the step of locating instances of patterns in the genomic sequences of interest. We use the 192,240 microRNA precursor patterns to locate instances in genomic sequences of interest. Typically, these sequences correspond to the intergenic and intronic regions of the genome at hand.

We first remove all low-complexity regions from the genomic sequences to be processed using the publicly available NSEG program (Wootton, J. C. and S. Federhen. Statistics of local complexity in amino acid sequences and sequence databases. Computers and Chemistry. 1993; 17:149-163) with default parameter settings. In the filtered sequences, we sought instances of the patterns from the microRNA-precursor-pattern-set.

Step 124 is the step of identifying regions in the genomic sequences of minimum length and supported by a minimum number of pattern hits. An instance of the microRNA precursor pattern generates a “pattern hit” which covers as many nucleotides as the span of the corresponding pattern-this is repeated for all patterns. Each pattern contributes a support of +1 to all of the genomic sequence locations spanned by its instance. Clearly, a given nucleotide position may be hit by more than one pattern. We make use of precisely this observation to associate genomic regions which receive multiple pattern hits with putative microRNA precursors. Conversely, regions which do not correspond to microRNA precursors are expected to receive a much smaller number of hits, if any, which of course permits us to differentiate between background and microRNA precursors.

Segments of contiguous sequence locations that received more than 60 patterns and spanned at least 60 positions were excised together with a 30-nucleotide-long flanking sequence at each end.

We next describe step 130, the step of subselecting among candidate regions and reporting the subselected regions. Step 130 is comprised of step 132, step 134, step 136 and step 138, as shown in FIG. 1A.

Step 132 is the step of predicting the RNA secondary structure of the candidate sequences. With the help of the Vienna package software (Hofacker, I. L. et al. Fast Folding and Comparison of RNA Secondary Structures. Monatsh. Chem. 125 167-188 (1994)), we predicted the RNA secondary structure of each excised sequence. Instead of the Vienna package, we could have used the ‘mfold’ algorithm to predict the hybrid's secondary RNA structure (Matthews, D. H., Sabina, J., Zuker, M. and Turner, D. H. Expanded Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure. J. Mol. Biol. 288, 911-940 (1999)).

Step 134 is the step of filtering candidate sequences based on the energy of the structure. Only those sequences whose predicted Gibbs free energy was ≦−18 Kcal/mol were kept and reported.

Step 136 is the step of further filtering candidate sequences based on number of bulges.

Step 138 is the step of reporting candidate sequences as microRNA precursors.

Lastly, as shown in step 139 of FIG. 1A, the results (e.g., predictions) of the above processes can be optionally evaluated through experiments.

FIG. 1B is a flow diagram illustrating a method for identifying a mature microRNA sequence in a microRNA precursor sequence, according to one embodiment of the invention. In each of the candidate microRNA precursors that were identified in step 130, we sought to determine the location of the corresponding mature microRNA. To this end, we used the same method as described above, only this time we generated patterns from the set of known microRNA sequences.

We next describe step 140, the step of generating patterns. Step 140 is comprised of step 142 and step 144, as shown in FIG. 1B.

Step 142 is the step of processing known microRNAs to discover intra- and inter-species patterns of conserved sequence. Similar to step 112, we downloaded 644 mature microRNAs from the RFAM, Release 3.0 (January, 2004). Subsequent implementations of our method described herein have used more recent versions of the RFAM database.

Step 144 is the step of filtering discovered patterns, keeping only statistically significant patterns. As in step 114, we used a scheme based on BLASTN to remove duplicate and near-duplicate entries from the initial collection. The final set comprised 354 sequences of mature microRNAs such that no two remaining sequences agreed on more than 90% of their positions.

The three parameters that control the discovery process were set to L=4, W=12 and K=2. 120,789,247 variable length patterns were discovered in the processed input set, typically spanning fewer than 22 positions. Patterns with log-probability >−32.0 were removed resulting in a final set of 233,554 statistically-significant, mature-microRNA patterns.

We next describe step 150, the step of identifying mature regions. Step 150 is comprised of step 152, step 154 and step 156, as shown in FIG. 1B.

Step 152 is the step of locating instances of patterns in the candidate precursor sequences. For the 233,554 mature microRNA patterns that we derived from the processed mature microRNA sequences generated, we sought the instances of the mature microRNA patterns in the sequences of microRNA precursors that were identified above. Similar methods as described above in step 122 are incorporated herein.

Step 154 is the step of identifying regions in the candidate precursor sequences of a minimum length and supported by a minimum number of pattern hits. As before, a pattern's instance contributes a vote of “+1” to all the UTR locations that the instance spans. All regions that did not overlap with the putative loop of the precursor and comprised contiguous blocks of locations that were hit by ≧60 patterns and were at least 18 nucleotides long were reported as the mature microRNAs corresponding to this precursor. Similar methods as described above in step 124 are incorporated herein.

Step 156 is the step of reporting regions as mature microRNAs.

Lastly, as shown in step 159 of FIG. 1B, the results (e.g., predictions) of the above processes can be optionally evaluated through experiments.

We next illustrate the above-described stages (‘discovery of a microRNA precursor’/‘discovery of a mature microRNA’) with the help of the C. elegans genome. In particular, we use the genomic region in the vicinity of the known microRNA precursor cel-miR-273.

FIGS. 2A-D illustrate how, for the genomic sequence under consideration, the microRNA-precursor-patterns accumulate in the region of the precursor whereas the microRNA-precursor-patterns are absent in the other areas. For the shown example sequence, approximately 500 patterns end up contributing to genomic location 14,946,975. In fact, the contiguous genomic locations that receive support from the microRNA-precursor-patterns corresponds to the known span of cel-miR-273, which is indicated by the light-grey rectangle in FIG. 2B. The region that received the substantial non-zero precursor support was examined for instances of the mature-microRNA-pattern-set. In FIG. 2C, we show how well the inventive approach localized the mature microRNA section within the cel-miR-273 precursor. The actual span of the known mature microRNA is indicated by the light-grey background.

FIG. 3A is a graph illustrating the distribution of pattern-hit-scores for all C. elegans microRNAs within RFAM (solid line) versus generic hairpins (dashed line).

FIG. 3B is a graph illustrating the distribution of predicted folding energies for all C. elegans microRNAs (solid line) and generic hairpins (dashed line).

FIG. 3C is an X-Y scatter plot illustrating patterns hits versus folding energy for C. elegans microRNAs (light-grey-colored dots) and generic hairpins (dark-grey-colored dots).

We used the 192,240 members of the microRNA-precursor-pattern-set to determine how well they covered those of the training sequences which originated in C. elegans. Almost all of the known C. elegans precursors contained ≧100 instances of the precursor patterns. The solid-line curve in FIG. 3A shows the probability density function for the number of precursors which contained a given number of pattern instances in them.

We next generated randomly what we refer to as a generic hairpin set. This hairpin set was designed so as to comprise sequences whose geometric features were characteristic of all known microRNA precursors, namely, a hairpin-shaped secondary structure and lengths in the interval [60,120] nucleotides. First, we randomly selected numerous regions with lengths uniformly distributed between 60 and 120 nucleotides. There was no restriction as to where in the C. elegans genome these regions were located.

Then, we inspected the predicted RNA secondary structure of these regions and kept only those which formed hairpins and did not include any low-complexity regions. Starting with an initial set of 120,000 randomly selected regions (=10,000×2 strands×6 chromosomes), and discarding as described above, we were left with a total of 20,560 generic hairpins. These hairpins are used to sample the “background” distribution of hairpins and to estimate its properties.

We examined these generic hairpins for instances of the microRNA precursor patterns. The dashed-line curve in FIG. 3A shows the probability density function for the percentage of the generic hairpins that contained a certain number of pattern instances. Setting the support threshold to 60 pattern-instances captures 104 of the 114 known C. elegans microRNAs or 91%. On the other hand, less than 1% of the members of the generic hairpin set exceed threshold. This is an important result that demonstrates that the microRNA precursor patterns capture sequence properties which are specific to microRNA precursors and can effectively distinguish them from randomly selected regions that simply happen to fold into “stem-loop-stem” structures.

In addition to the distribution of pattern instances, we also examined the distribution of the Gibbs free energy values that are computed from the generic hairpin set (dashed-line curve) and the known C. elegans precursors (solid-line curve) and show the results in FIG. 3B. Setting the support threshold to −25 Kcal/mol captures 107 of the 114 known C. elegans microRNA precursors or 94%, but only 7% of the sequences in the generic hairpin set exceed threshold.

Finally, we examined how well a combination of the “energy” and the “pattern-instances” filters separates the known microRNA precursors (light-grey colored dots) from the generic hairpin set (dark-grey colored dots). The results are presented in FIG. 3C. As can be seen in FIG. 3C, there is very little correlation between these two criteria and their combined application provides a simple yet powerful discriminator. The combined threshold of ≧60 pattern instances and a predicted Gibbs energy ≦−25 Kcal/mol allows us to identify 78 of the 114 known C. elegans precursors whereas less than 1% of the generic hairpins exceed this double threshold. This translates into an estimated sensitivity of 67% for our precursor prediction method and an estimated false-positive ratio that is ≦1%.

We repeated the above generic-hairpin analysis for the remaining three genomes of our collection. The remaining three genomes were D. melanogaster, M. musculus and H. sapiens. By way of example, such sequences are described in detail in Application No. 60/652,499, the disclosure of which is incorporated by reference herein. Also, such sequences are described in detail in the above-mentioned related U.S. patent application (YOR920040675US2), the disclosure of which is incorporated herein. The estimated false-positive ratios remained very low, and similar in magnitude to the case of C. elegans above. In particular, the estimates we generated for the false-positive ratio when predicting microRNA precursors in the other three genomes ranged from ≦1% (for hairpins with Gibbs energies of −25 Kcal/mol or less) to ≦2% (for hairpins with Gibbs energies of −18 Kcal/mol or less). Given that the four genomes span a very wide evolutionary spectrum, it is reasonable to assume that these values are characteristic of our method and independent of the identity of the genome that is used.

FIG. 4 is a table summarizing the microRNA-precursor predictions for the genomes of C. elegans, D. melanogaster, M. musculus and H. sapiens.

We have analyzed the intergenic and intronic regions of four complete genomes, as illustrated in FIG. 4. Results are reported for two values for the Gibbs energy threshold, namely −18 Kcal/mol and −25 Kcal/mol.

As can be seen from FIG. 4, the method correctly identifies a very large percentage of the known microRNA precursors in these four genomes, for the used thresholds. Additionally, we also predict many novel microRNA precursors. Their numbers are significantly higher than what has previously been discussed in the literature. In light of the very low error rate estimates of our method, we believe that a substantial number of our microRNA precursor predictions are likely correct.

FIG. 5 is a block diagram of a system 500 for determining whether a nucleotide sequence contains a microRNA precursor in accordance with one embodiment of the present invention. System 500 comprises a computer system 510 that interacts with a media 550. Computer system 510 comprises a processor 520, a network interface 525, a memory 530, a media interface 535 and an optional display 540. Network interface 525 allows computer system 510 to connect to a network, while media interface 535 allows computer system 510 to interact with media 550, such as Digital Versatile Disk (DVD) or a hard drive.

As is known in the art, the methods and apparatus discussed herein may be distributed as an article of manufacture that itself comprises a computer-readable medium having computer-readable code means embodied thereon. The computer-readable program code means is operable, in conjunction with a computer system such as computer system 510, to carry out all or some of the steps to perform the methods or create the apparatuses discussed herein. The computer-readable code is configured to generate patterns processing a collection of already known mature microRNA sequences; assign one or more attributes to the generated patterns; subselect only the patterns whose attributes satisfy certain criteria; generate the reverse complement of the subselected patterns; and use the reverse complement of the subselected patterns to analyze the nucleotide sequence. The computer-readable medium may be a recordable medium (e.g., floppy disks, hard drive, optical disks such as a DVD, or memory cards) or may be a transmission medium (e.g., a network comprising fiber-optics, the world-wide web, cables, or a wireless channel using time-division multiple access, code-division multiple access, or other radio-frequency channel). Any medium known or developed that can store information suitable for use with a computer system may be used. The computer-readable code means is any mechanism for allowing a computer to read instructions and data, such as magnetic variations on a magnetic medium or height variations on the surface of a compact disk.

Memory 530 configures the processor 520 to implement the methods, steps, and functions disclosed herein. The memory 530 could be distributed or local and the processor 520 could be distributed or singular. The memory 530 could be implemented as an electrical, magnetic or optical memory, or any combination of these or other types of storage devices. Moreover, the term “memory” should be construed broadly enough to encompass any information able to read from or written to an address in the addressable space accessed by processor 520. With this definition, information on a network, accessible through network interface 525, is still within memory 530 because the processor 520 can retrieve the information from the network. It should be noted that each distributed processor that makes up processor 520 generally contains its own addressable memory space. It should also be noted that some or all of computer system 510 can be incorporated into an application-specific or general-use integrated circuit.

Optional video display 540 is any type of video display suitable for interacting with a human user of system 500. Generally, video display 540 is a computer monitor or other similar video display.

It is to be appreciated that, in an alternative embodiment, the invention may be implemented in a network-based implementation, such as, for example, the Internet. The network could alternatively be a private network and/or local network. It is to be understood that the server may include more than one computer system. That is, one or more of the elements of FIG. 5 may reside on and be executed by their own computer system, e.g., with its own processor and memory. In an alternative configuration, the methodologies of the invention may be performed on a personal computer and output data transmitted directly to a receiving module, such as another personal computer, via a network without any server intervention. The output data can also be transferred without a network. For example, the output data can be transferred by simply downloading the data onto, e.g., a floppy disk, and uploading the data on a receiving module.

Presented herein is a novel and robust pattern-based methodology for the identification of microRNA precursors and their corresponding mature microRNAs directly from genomic sequence. With the help of patterns derived by processing the sequences of known microRNA precursors, our method identifies genomic regions where numerous instances of these patterns aggregate and subselects among them following energy based filtering.

The following are examples of advantages that characterize the inventive approach provided herein: a) the inventive approach obviates the need to enforce a cross-species conservation filtering before reporting results, thus allowing the discovery of microRNA precursors that may not be shared even by closely related species; b) the inventive approach can be applied to the analysis of any genome that potentially harbors endogenous microRNAs without the need to be retrained each time.

Although illustrative embodiments of the present invention have been described herein, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope or spirit of the invention.