Title:
Space efficient polymer sets
Document Type and Number:
Kind Code:
A1

Abstract:
The disclosure features a collection that comprises a plurality of polymers, typically nucleic acid molecules in a compact form. The molecules include all possible sequences or at least a certain percentage of all possible sequences, of a particular length.
Representative Image:
Inventors:
Bulyk, Martha L. (Weston, MA, US)
Philippakis, Anthony A. (Cambridge, MA, US)
Estep, Preston Wayne (Weston, MA, US)
Application Number:
11/112349
Publication Date:
12/22/2005
Filing Date:
04/22/2005
View Patent Images:
Images are available in PDF form when logged in. To view PDFs, Login  or  Create Account (Free!)
Primary Class:
International Classes:
(IPC1-7): C07H021/04; C12Q001/68
Attorney, Agent or Firm:
FISH & RICHARDSON PC (P.O. BOX 1022, MINNEAPOLIS, MN, 55440-1022, US)
Claims:
1. A collection of nucleic acid polymers, the collection comprising a plurality of polymers in which each polymer has a unique sequence, wherein (i) the polymers of the plurality collectively provide at least 80% of all possible sites of length k, k being greater than 6, the sites being composed of a set comprised of four canonical nucleotides; (ii) each nucleic acid polymer of the plurality has a length greater than 1.2 k nucleotides, and (iii) the number of unique polymers in the plurality is less than 0.5 times 80% of the number of all possible sites of length k.

2. The collection of claim 1 wherein the polymers of the plurality collectively provide at least 98% of all possible sites of length k, and the number of unique polymers in the plurality is less than 0.5 times 98% of the number of all possible sites of length k.

3. The collection of claim 2 wherein the polymers of the plurality collectively provide all possible sites of length k, and the number of unique polymers in the plurality is less than 0.5 times the number of all possible sites of length k.

4. The collection of claim 1 wherein k is between 7 and 10.

5. The collection of claim 1 wherein each of the represented sites of length k is represented at least twice, each time in a different context.

6. The collection of claim 1 wherein each polymer of the plurality is physically associated with a substrate.

7. The collection of claim 6 wherein the substrate is a bead, and each polymer is physically associated with a different bead.

8. The collection of claim 6 wherein the substrate is a planar array, and each polymer is physically associated with a different address of the same array.

9. The collection of claim 1 wherein each polymer of the plurality is in solution.

10. The collection of claim 9 wherein the collection is divided into a plurality of pools, each pool comprising a subset of polymers of the plurality.

11. The collection of claim 1 wherein each polymer is doublestranded.

12. The collection of claim 1 wherein each polymer is DNA.

13. The collection of claim 1 wherein the compaction ratio is less than 0.2.

14. The collection of claim 1 wherein the set of canonical nucleotides consists of adenine, thymidine, guanine and cytosine.

15. The collection of claim 1 wherein the set of canonical nucleotides consists of adenine, uracil, guanine and cytosine.

16. The collection of claim 1 wherein at least two of the sites are composed of a set of nucleotides that comprises one non-naturally occurring nucleotide in addition to the four canonical nucleotides.

17. A nucleic acid array comprising a plurality of addresses, each address of the plurality comprising a nucleic acid molecule wherein (i) the array comprises nucleic acid molecules, the molecules collectively providing at least 80% of all possible sites of length k, k being greater than 6, (ii) each nucleic acid molecule is associated with an address of the plurality and has a length greater than 1.5 k nucleotides, and (iii) the array has fewer addresses than 0.5 times 80% the number of all possible sites of length k.

18. The array of claim 17 wherein the array comprises a planar substrate.

19. The array of claim 17 wherein the nucleic acid molecules are double-stranded DNAs.

20. A method of evaluating interaction specificity of a test compound, the method comprising: contacting the test compound to a plurality of nucleic acid molecules, wherein (i) the polymers of the plurality collectively provide at least 80% of all possible sites of length k, k being greater than 6, the sites being composed of a set of four canonical nucleotides; (ii) each nucleic acid polymer of the plurality has a length greater than 1.2 k nucleotides, and (iii) the number of unique polymers in the plurality is less than 0.5 times 80% of the number of all possible sites of length k; and evaluating interactions between the test compound and one or a subset of nucleic acid molecules of the plurality.

21. The method of claim 20 wherein the nucleic acid molecules are in immobilized on an array and the test compound is in solution prior to the contacting, and the step of evaluating comprises detecting the test compound on the array.

22. The method of claim 20 wherein the nucleic acid molecules are in immobilized on an array and the test compound is in solution prior to the contacting, and the step of evaluating comprises detecting an alteration in one or more of the nucleic acid molecules on the array.

23. The method of claim 22 wherein the test compound comprises a nucleic acid-modifying enzyme.

24. The method of claim 20 wherein the nucleic acid molecules are in solution and the test compound is immobilized on a substrate, and the step of evaluating comprises washing the substrate and identifying nucleic acid molecules that are separated by the washing.

25. The method of claim 24 further comprising amplifying nucleic acid molecules that are bound to the substrate.

26. The method of claim 20 wherein the test compound comprises a protein or nucleic acid.

27. The method of claim 20 wherein the test compound has enzymatic activity.

28. The method of claim 19 wherein the test compound is a methylase, a nuclease, or a polymerase.

29. The method of claim 20 wherein the test compound has a molecular weight less than 2000 Daltons.

30. The method of claim 20 further comprising formulating the test compound as a pharmaceutical composition.

31. The method of claim 30 further comprising administering the test compound to a subject.

32. A method of design a collection of nucleic acids, the method comprising: enumerating, in a string, all permutations of four identifiers in words of a preselected length k, wherein each permutation occurs a limited number of times in the string; segmenting the string into segments of less than length of 100; and synthesizing oligonucleotides according to the sequence of the identifiers in each of the segments.

33. The method of claim 32 wherein the step of enumerating comprises using linear shift register.

34. The method of claim 32 wherein each word of the preselected length occurs exactly once in the string.

35. The method of claim 32 wherein k is between 6 and 12.

36. The method of claim 32 wherein the string is segmented into segments of less than a length of 50.

37. The method of claim 32 wherein the segments include overlapping ends.

38. The method of claim 32 further comprising disposing each of the oligonucleotides on a planar substrate.

Description:

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to provisional applications Ser. No. 60/564,864, filed on Apr. 23, 2004, and Ser. No. 60/587,066, filed Jul. 12, 2004, the contents of both of which are incorporated by reference.

BACKGROUND

Nucleic acid arrays have a variety of uses. One application enables evaluating the binding specificity of nucleic acid binding proteins.

SUMMARY

In one aspect, the disclosure features a collection that comprises a plurality of nucleic acid molecules. The molecules include all possible sequences or at least a certain percentage (e.g., at least 10, 25, 50, 60, 70, 80, 90, 95, 98, 99, 99.9%) of all possible sequences, of a particular length, k. These sequences are termed “k-mers.” k can be, for example, greater or equal to 5, 6, 7, 8, 9, 10, or 12 or, e.g., between 6-20, 6-18, 8-15, or 8-12. Each molecule of the plurality has length greater than k, e.g., a length of k+n nucleotides. For example, n is at least 1, 2, 3, 5, or 10. Typically, the nucleic acid molecules of the plurality are substantially longer than k, and can be at least 1.5×k nucleotides long or at least an integral multiple of k, e.g., at least 2×k nucleotides long, further examples of multiples include at least 2.2, 2.5, 27, 3.0, 3.2, 3.5, 4, 4.5, 5, 6, 7, and ranges in between. In one embodiment, at least some molecules of the plurality (and typically all molecules of the plurality) include at least two, three, four, or five different k-mers.

In one embodiment, the number of unique nucleic acid molecules of the plurality is fewer than the number of all possible k-mer sequences or, if the collection includes only a certain percentage of such sites, fewer than the number of such sequences. The compaction ratio (defined as the number of unique molecules divided by the number of represented k-mers) can be, for example, less than 0.5, 0.2, 0.1, 0.06, 0.05, 0.01, or 0.002. Thus, in an embodiment in which all possible sites of length k are represented, the collection includes fewer molecules than the number of all such possible sequences. This compact design is achieved by including more than one k-mer per molecule, while maintaining similar k-mers at different molecules of the collection.

The nucleic acid molecules of the collection can be in solution, can be labelled, can be present in separate containers, in pools, or can be immobilized, e.g., on one or more beads or on an array. The nucleic acid molecules can each also include an invariant region, e.g., a primer binding site and/or a spacer sequence.

In one embodiment, the nucleic acid molecules of the plurality are all less than 150 nucleotides in length, e.g., less than 120, 100, 80, 75, 70, 65 or 60. For example, the nucleic acid molecules of the plurality can be between 30-75 or 50-70 nucleotides in length. In one embodiment, at least 1, 2, 5 or 10% of the molecules of the plurality include artificial sequences or sequences not present in a yeast intergenic region.

In another aspect, the disclosure features a collection that comprises a plurality of non-homogeneous polymer molecules. The molecules include all possible sequences or at least a certain percentage (e.g., at least 10, 25, 50, 60, 70, 80, 90, 95, 98, 99, 99.9%) of all possible sequences, of a particular length, k. These sequences are termed “k-mers.” k can be, for example, greater or equal to 5, 6, 8, 10, or 12 or, e.g., between 6-20, 6-18, 8-15, or 8-12. Each molecule of the plurality may have the same length, or the lengths may vary. For example, most of the molecules of the collection can have a length of at least k+n subunits. n can be at least 1, 2, 3, 5, or 10. Typically, most of the molecules of the collection are substantially longer than k, e.g., at least 2×k subunits long. In one embodiment, at least some molecules of the plurality (and typically all molecules of the plurality) include at least two, three, four, or five different k-mers.

In one embodiment, the polymer includes a nucleic acid backbone, a peptide backbone, or a sugar backbone. For example, the polymer can be a polypeptide (e.g., a peptide of between 3-30 amino acids) or a larger polypeptide (e.g., greater than 30 amino acids). In embodiments in which the polymer is a nucleic acid, the nucleic acid can be RNA, DNA, or a combination thereof. It can be double-stranded, partially double-stranded, or single-stranded. For example, partially double-stranded and single-stranded molecules may include tertiary structures, e.g., hairpins, bulges and so forth. In a preferred embodiment, at least the variant region of molecules, e.g., the region that includes the different k-mers, in the collection is double-stranded.

In one embodiment, the number of unique nucleic acid molecules of the plurality is fewer than the number of all possible sites or, if the collection includes only a certain percentage of such k-mer sequences, fewer than the number of such sequences. The compaction ratio (defined as the number of unique molecules divided by the number of represented k-mers) can be, for example, less than 0.5, 0.2, 0.1, 0.05, 0.01, or 0.002. Thus, in an embodiment in which all possible sites of length k are represented, the collection includes fewer molecules than the number of all such possible sequences. This compact design can be achieved by including more than one different k-mer per molecule, and by locating similar k-mers (e.g., k-mers that differ by only one or two nucleotides) in different molecules of the collection.

The molecules of the collection can be in solution, can be labeled, can be present in separate containers, in pools, or can be immobilized, e.g., on one or more beads or on an array, in cells, or attached to or contained in viruses. The collection can be used to evaluate an interaction between a molecule of interest and members of the collection.

In another aspect, the disclosure features a nucleic acid array that includes all possible sites or at least a certain percentage (e.g., at least 10, 25, 50, 60, 70, 80, 90, 95, 98, 99, 99.9%) of all possible sites, of a particular length, k. These sites are termed “k-mers.” k can be, for example, greater or equal to 5, 6, 8, 10, or 12 or, e.g., between 6-20, 6-18, 8-15, or 8-12. The array includes a plurality of addresses. Most addresses of the plurality includes at least one nucleic acid molecule of length greater than k and the molecules at each address of the plurality may have the same length, or the lengths may vary.. In certain embodiments, most of the nucleic acid molecules of the collection are at least k+n nucleotides long. For example, n is at least 1, 2, 3, 5, or 10. Typically, most of the nucleic acid molecules of the collection are substantially longer than k, and in preferred embodiments most of the nucleic acid molecules of the collection are at least 2×k nucleotides long.

In one embodiment, the number of unique nucleic acid molecules physically associated with the array is fewer than the number of all possible sites or, if the array includes only a certain percentage of such sites, fewer than the number of such sites. The compaction ratio (defined as the number of unique molecules divided by the number of represented k-mers) can be, for example, less than 0.5, 0.2, 0.1, 0.05, 0.01, or 0.002. Thus, in a typical embodiment, even though all possible sites of length k are represented, the array can have fewer unique addresses than the number of all such possible sites. This compact design is achieved by including more than one k-mer per address, while maintaining similar k-mers at different addresses.

One method for producing a collection of nucleic acids described herein includes representing the collection as a string or a small number of strings). The Hamming distance function or any other distance function can be used to define sequence similarity among k-mers. Related sequences that are seen to be located in a common region of theoretical sequence space (e.g., a “Hamming ball”) are arranged in the string, and, hence, in the array, so that they are discernable (e.g., distributed to different addresses) from each other and recoverable. This design enables fine discrimination for evaluating specificity. Other exemplary distance functions include: a Euclidean distance, z-score distance, Bhattacharya distance, Mahalanobis distance, Matusita distance, divergence metric, Chemoff distance, angular metric, Earth Mover's distance, Hausdorff distance, City Block (Manhattan) distance, Chebychev distance, Minkowski distance, or Canberra distance.

One method for using a collection of nucleic acid molecules described herein is to evaluate a molecule of interest (e.g., a protein, a drug, a nucleic acid aptamer), a sample, or a functional event (e.g., as a bio-sensor). In one embodiment, a collection of nucleic acid molecules in an array is to evaluate the specificity of a nucleic acid binding compound, e.g., a nucleic acid binding protein, e.g., a DNA binding protein, such as a transcription factor or an enzyme (e.g., a methylase or restriction endonuclease, etc). In some cases, a protein or nucleic acid can be modified. For example, protein modifications include phosphorylation, ubiquitination, acetylation, and methylation. These modifications are typically site specific. The protein or nucleic acid can also be present in a complex, e.g., with other macromolecules. For example, a protein complex that includes at least two or three polypeptide chains can be used. All the polypeptide chains can differ or, e.g., as in a complex that includes a homodimer, some chains can be the same.

Interactions between a compound of interest with addresses of the array are evaluated, and the interaction data is processed to identify one or more sites with which the protein interacts. For example, if each possible k-mer is present at plurality of addresses, it is possible to deconvolve the interaction data to identify one or more k-mers that interact with the compound.

In one embodiment, the collection of nucleic acid molecules is used to identify a nucleic acid of interest, e.g., an aptamer with binding or enzymatic activity, e.g., with respect to a target protein, e.g., an enzyme or cell-surface protein. For example, the protein can be associated with a disease or disorder. A nucleic acid molecule that binds to and/or inhibits the target protein can be used, e.g., to detect or modulate the disease or disorder.

In another embodiment, the collection of nucleic acid molecules is used to characterize a sample (e.g. a complex sample, e.g. a sample obtained from a subject, e.g., a patient). The pattern of interaction can be used to identify or characterize the sample.

Accordingly, in one embodiment, the collection of nucleic acid molecules can be used to characterize the binding preferences of nucleic acid binding molecules. Since all possible DNA sequence variants can be represented on DNA microarrays in a space- and cost- efficient manner, a reduced number of individual DNA sequences and individual DNA spots can be synthesized. This reduction also can have implications on the density and/or number of addresses on synthesized microarrays.

Since the nucleic acid molecules are longer (e.g., significantly longer) than k so that the addition of each base to the length of an oligonucleotide (“oligo”) adds another k-mer to that oligonucleotide. For example, considering 10-mers (e.g., DNA sites 10 nucleotides long), a fifty base long oligo would contain 41 distinct 10-mers, and a sixty base oligo would contain 51 distinct 10-mers. Compared to the use of a single k-mer per oligo, this approach can greatly reduce the number of molecules required to characterize the binding, enzymatic, or other physical or functional interactions.

In one embodiment, the nucleic acid includes k-mers, where k is greater than 5, 6, 7, or 8, e.g., a biologically relevant size, and every k-mer or at least the certain percentage of such k-mers, is represented in at least two different nucleic acid molecules.

In one embodiment, in which the nucleic acids of the collection are immobilized on a surface, those two different nucleic acid molecules for a particular k-mer would be located at distinguishable addresses. Each k-mer in the collection can represented at least once, twice, or more (e.g., 3, 4, or 5) times with a variety of flanking bases (e.g., in different nucleic acid molecules). The flanking bases could be either a particular sequence (e.g., non-degenerate) or be degenerate (e.g., N which is a mixture of A, C,G, and T, or R which is a mixture of G and A). The form of this collection of k-mers could be variable. For example, in one embodiment, the collection is provided as an array, e.g., an array of double-stranded oligonucleotides. The oligonucleotides can be, for example, in the range of 10 to 100 bases.

In one embodiment, each k-mer of the collection can be flanked by each of the four nucleotides on both sides, resulting in sixteen variants. For example, considering ACGT, it would be most informative to have AACGTA, AACGTC, AACGTG, AACGTT, CACGTA, CACGTC, CACGTG, CACGTT, GACGTA, GACGTC, GACGTG, GACGTT, TACGTA, TACGTC, TACGTG, and TACGTT represented on a DNA micro array.

If not every k-mer is present in a collection, it is possible, for example, that the omitted k-mers have particular qualities, e.g., homo-polymeric sequences or sequences that include only two different types of nucleotides, and so forth. In certain embodiments, most k-mers are represented within the collection more than once. In preferred embodiments the sequence of a given k-mer is represented within a given molecule shared with other k-mers, and the sequence of this given k-mer is present on at least a second molecule wherein few or none of the other k-mers present on the given first molecule are present on the second molecule. While wishing not to be bound by theory, it is understood that multiple representations of a given k-mer can allow a more precise assignment of which k-mer of a molecule is being bound or otherwise subject to an interaction, and restricting, reducing, or eliminating, occurrences of pairs of k-mers to single occurrences allows more precise determinations of which k-mer within a molecule is being bound or subject to an interaction.

In certain embodiments, to allow the binding site specificity of a DNA binding molecule to be determined completely and readily from a collection of DNA sequences, it is possible to design of such DNA sequences with one or more of the following exemplary features: a) for example, requiring all possible DNA binding sites within a given binding space (termed a ‘Hamming ball’ (a ‘Hamming ball’ refers to a particular k-mer and the k-mers that are within an arbitrary number of mismatches from it) to be located on separate molecules in the collection; b) for example, requiring multiple copies of a given k-mer, with each copy flanked by unique flanking sequence so as to take into account potential junction effects; c) for example, requiring that a given k-mer, if it is located at one end of one molecule, then it should preferably be located either centrally within or at the other end of another molecule (e.g., to account for possible steric effects, surface attachment, synthesis, and/or the requirement for flanking DNA sequence); 4) designing molecules of the collection such that k-mers within a given Hamming ball might be found on either strand (forward or reverse complement), if double-stranded molecules are used.

There are a number of possible ways to design such a set of DNA sequences. In one embodiment, we have calculated what we believe to be is a maximally compact representation of all possible binding sites. For this approach, we addresses requirements for generating one string that contains every k-mer exactly once, when considering words in a stacked fashion. We formalized the notions of 1) ‘discernablility’ e.g., ensuring that for every ‘Hamming ball’, for each of its words there exists at least one spot that occurred without another word from that Hamming ball, and 2) ‘recoverability’ of Hamming balls. These two concepts (discernability and recoverability) jointly ensure that one can figure out which particular k-mer on a bound DNA sequence is actually bound. We ran computational simulations that indicated that most words in most Hamming balls are discernable, and that a Hamming ball is generally recoverable. This approach enables extracting any possible DNA binding site from such collections of molecules. If the collection is implemented as an array, the array can distinguish between different DNA binding sites, including similar sites.

A collection of nucleic acid molecules described herein has numerous uses. Among them is the formation of an array of oligonucleotides for studying the binding properties of a compound, e.g., a nucleic acid binding proteins. Exemplary nucleic acid binding proteins include natural and designed nucleic acid binding proteins (e.g., zinc finger proteins). Such an array of DNA oligonucleotides can be contacted by a labeled DNA binding protein; and by analyzing which oligonucleotides are bound, the sequence of the preferred binding site and the relative strengths of binding to related sites can be determined. More than one protein can be bound to such an array simultaneously. Proteins that compete or cooperate, or binding site variants of the same protein, can be bound simultaneously to analyze the binding site differences. These methods frequently use fluorescently labeled protein and fluorescent microarray scanners. Another use for these arrays is the binding of one or more proteins and the interrogation of each oligonucleotide spot with a laser for the purpose of identifying proteins using mass spectrometry. This embodiment enables identification and relative quantification of each protein bound to a given oligonucleotide, e.g., without labeling the protein.

Such a design of nucleic acid sequences is not limited to double-stranded DNA. It is also not limited to DNA microarrays, as such a set of nucleic acid sequences can also be used in solution.

In one embodiment, it is possible to use a highly parallel in vitro microarray technology for high-throughput characterization of the sequence specificities of DNA-protein interactions. We shall refer to this approach as protein binding microarray technology, or simply “PBM.” PBM technology allows one to determine the binding site specificities of known, designed, or predicted transcription factors in a useful time frame, for example, in a single day. The method can include providing a purified or at least partially purified TF.

Such PBM experiments may be particularly useful when ChIP-chip experiments for particular TFs do not result in enough enrichment of bound fragments in the immunoprecipitated sample to permit identification of the DNA sites bound in vivo. The PBM data may also provide valuable data for those TFs for which it is not known under what culture conditions they are expressed or at what timepoints they are expressed. Moreover, there are hundreds of predicted DNA binding proteins that could be screened for sequence-specific binding using the rapid, high-throughput PBM experiments.

The advantages include the provision of numerous DNA sequence variants in a space- and cost-efficient manner. Only a minimal number of individual DNA sequences and individual DNA spots need to be synthesized. This also has implications, for example, on the required density and number of the spots on synthesized microarrays.

In another aspect, the disclosure features a collection of nucleic acid that includes all or a certain fraction (e.g., at least 70, 80, 90, 95, 98, 99% of all) intergenic sequences from a chromosome of an organism, or from a genome of an organism. Related collections can include all or a certain fraction (e.g., at least 70, 80, 90, 95, 98, 99% of all) sequence that are within a certain distance of identifiable RNA start sites, TATA boxes, Hogness boxes, Pribnow boxes, or Shine-Dalgano sequences. For example, the sequences can be between 100, 200, 500, 800, 1000, or 5000 nucleotides of such sites. The nucleic acids of the collections can be of any length, e.g., an amplifiable length, or a length ranging from 60 basepairs to 1500 basepairs or 30 bp to 2400 bp, or 30 bp to 200 bp. The nucleic acids can be immobilized, e.g., on one or more arrays. In some embodiments, the collection also includes nucleic acids that are intragenic, e.g., coding sequences, introns, or untranslated regulatory sequences.

In an exemplary methods, the nucleic acids of the collection are contacted to an agent, e.g., a protein, small molecule, etc, and interactions between the agent and one or more of the nucleic acids are evaluated. For example, the protein can be a transcription factor or a nucleic acid binding fragment thereof.

All references, patents, and patent applications are hereby incorporated by reference in their entireties. The following references are among those so incorporated: US 2003-0215856, US 2003-0108880, U.S. Ser. Nos. 60/227,900, 60/564,864, 60/587,066 (inclusives of all Appendices and Figures) and PCT/US01/26435.

DETAILED DESCRIPTION

A collection of polymers which include all or a certain percentage of all k-mers has a variety of uses. For example, the polymers can be nucleic acids or polypeptides (e.g., short peptides of length less than 50, 40, 30, 20, or 10 amino acids). The collection can be used to characterize interactions between an agent or a pool that includes multiple agents and members of the collection. The results of such characterization can indicate which k-mers interact with the agent or the pool. Although such collections of polymers have numerous applications, the following are some exemplary ones.

In one embodiment, the DNA binding specificity of a test compound, e.g., a test macromolecule such as a protein or a fragment thereof, can be characterized. For example, the DNA binding domain of a transcription factor can be contacted to members of the collection. The DNA binding domain can be labeled and the members of the collection can be disposed on an array. After contacting the domain to the array and washing the array, the array can be imaged to identify which members of the collection are bound by the domain. The method can be adapted, e.g., to identify one or more functional fragments that interact with DNA.

Results obtained with the candidate fragments can be compared to that of a larger fragment or the entire protein. Thus, information is accrued about the relative contributions of different regions of the protein to binding specificity and affinity. A similar method can be used for DNA binding domains from other types of nuclear proteins, e.g., centromeric proteins, telomeric proteins, and so forth. Protein complexes, including homo- and hetero-oligomers, can be analyzed. Likewise, RNA binding proteins or RNA binding domains thereof can be characterized using an embodiment in which the members of the collection are RNA.

The collection of polymers can be used to design a nucleic acid binding protein with a desired DNA binding specificity. A known nucleic acid binding domain or scaffold (e.g., zinc finger domains) can be randomized or specifically mutated, e.g., at DNA contacting positions. DNA contacting positions can be identified by inspection of three-dimensional structures, e.g., obtained by X-ray crystallography or NMR. These positions can be randomized, e.g., to all possible amino acids, all nineteen non-cysteine amino acids, hydrophilic amino acids, or a combination of hydrophilic and aliphatic amino acids. Mutated domains are contacted to the collection of polymers, and variants which have a desired binding specificity can be used for engineering the nucleic acid binding protein. For example, variants that interact with a site present in a target gene can be selected and used as the DNA binding domain of an artificial transcription factor that regulates that target gene. The artificial transcription factor can also include a transcriptional regulatory domain, e.g., an activation or repression domain. DNA binding domains can be constructed by linking different modules, each with a desired binding specificity, in order to produce a chimeric protein that recognizes a site. For example, one can identify or design a first polypeptide that interacts with a first k-mer and a second polypeptide that interacts with a second k-mer. Then the first and second polypeptide can linked (e.g., by a covalent or non-covalent linkage, e.g., by making a chimeric polypeptide that includes both the first and second polypeptide) to provide a protein that recognizes a site that includes the first and second k-mers. The first and second k-mers can be directly adjacent or gapped, e.g., by at least 1, 2, 3, 4, or 5 nucleotides, e.g., by one or more turns of the DNA helix.

Similar analyses can be used to characterize non-proteinaceous agents. For example, members of a chemical library or designed chemicals (e.g., polyamides and intercalators) can be evaluated, e.g., for binding specificity and affinity.

The collection of polymers can be used in implementations that do not feature an array. For example, it is possible to prepare the collection so that each polymer includes one or more invariant termini that serve as primer binding sites. The members of the collection are contained in solution, e.g., in a single container. A protein of interest can be contacted to the solution and immobilized on a solid support. After the contacting, the support can be washed to remove unbound members of the collection. The bound members can be characterized, e.g., by PCR amplification with primers that recognize the invariant termini. Amplification products can be sequenced to determine the identity of the members that bound to the protein of interest.

In addition to binding specificity, it is possible to characterize other types of interactions, e.g., enzymatic interactions. For example, a enzyme that specifically interacts with DNA can be evaluated using the collection of polymers. In one implementation, a site-specific nuclease is contacted to the collection. Members of the collection that are cleaved by the nuclease are identified. For example, if the members are present on an array and are end-labelled, the cleaved members are identified by release of the label from the respective address of the array. In other types of assays, the DNA or other nucleic acid can be modified, e.g., with a labeled molecule. Other types of enzymatic reactions that can be evaluated include methylation, acetylation (e.g., of histone bound polymers), polymerization, deamination, and so forth.

Similar applications are also available where the polymers include peptide nucleic acids (PNAs), peptides (e.g., polypeptides or short peptides), or artificial polymers (e.g., peptoids). See, e.g., Simon et al. (1992) Proc. Natl. Acad. Sci. USA 89:9367-71 and Horwell (1995) Trends Biotechnol. 13:132-4. For example, a collection of such polymers can be used to evaluate kinase specificity, e.g., by identifying which members of the collection are phosphorylated by a particular kinase. The collection can also be modified so that, rather than include all or a certain percentage of all k-mers, it includes all or a certain percentage of all k-mers with a certain property, e.g., all k-mers that include at least one serine, or all k-mers that include at least one tyrosine, or all k-mers that include at least one residue that can be phosphorylated (e.g., histidine, serine, threonine, or tyrosine). Of course, collections of polymers other than nucleic acids can also be used to evaluate binding interactions and other types interactions.

The collection of polymers can be used to identify molecules with any desired activity, e.g., a binding or other functional activity.

Polymer Arrays

A collection of polymers described herein can be immobilized on an array. An array is a substrate that includes a plurality of addresses. Each address can include a homogenous population of immobilized nucleic acids, e.g., nucleic acids of predetermined sequence. The density of addresses can be at least 10, 50, 200, 500, 10 3 , 10 4 , 10 5 , or 10 6 addresses per cm 2 , and/or no more than 10, 50, 100, 200, 500, 10 3 , 10 4 , 10 5 , or 10 6 addresses/cm 2 . Addresses in addition to addresses of the plurality can be deposited on the array. The addresses can be distributed, on the substrate in one dimension, e.g., a linear array; in two dimensions, e.g., a planar array; or in three dimensions, e.g., a three dimensional array. (e.g., layers of a gel matrix). The term “microarray” is used interchangeably with the term “array.” The term “array” also refers to any coatings or surfaces on a substrate such that a molecule that is said to be “on” or physically associated with an array may be physically associated with such a coating or surface.

In one embodiment, the substrate is an insoluble or solid substrate. Potentially useful insoluble substrates include: mass spectroscopy plates (e.g., for MALDI), glass (e.g., functionalized glass, a glass slide, porous silicate glass, a single crystal silicon, quartz, UV-transparent quartz glass), plastics and polymers (e.g., polystyrene, polypropylene, polyvinylidene difluoride, poly-tetrafluoroethylene, polycarbonate, PDMS, acrylic), metal coated substrates (e.g., gold), silicon substrates, latex, membranes (e.g., nitrocellulose, nylon). The insoluble substrate can also be pliable. The substrate can be opaque, translucent, or transparent. In some embodiments, the array is merely fashioned from a multiwell plate, e.g., a 96- or 384-well microtitre plate.

A variety of methods can be used to prepare an array. In some embodiments, polymers are synthesized in situ, e.g., on the array. In other embodiments, the polymers are synthesized and then disposed on the array. The polymers are synthesized according to one of the sequence design methods described herein.

1. Light-Directed Methods

Where a single solid support is employed, the oligonucleotides can be formed using a variety of techniques known to those skilled in the art of polymer synthesis on solid supports. For example, light-directed methods are described in U.S. Pat. Nos. 5,143,854 and 5,510,270 and 5,527,681. These methods, involve activating predefined regions of a solid support and then contacting the support with a preselected monomer solution. These regions can be activated with a light source, typically shown through a mask (much in the manner of photolithography techniques used in integrated circuit fabrication). Other regions of the support remain inactive because illumination is blocked by the mask and they remain chemically protected. Thus, a light pattern defines which regions of the support react with a given monomer. By repeatedly activating different sets of predefined regions and contacting different monomer solutions with the support, a diverse array of polymers is produced on the support. Other steps, such as washing unreacted monomer solution from the support, can be used as necessary. Other applicable methods include mechanical techniques such as those described in PCT No. 92/10183 and U.S. Pat. No. 5,384,261. Still further techniques include bead based techniques such as those described in PCT US/93/04145 and pin based methods such as those described in U.S. Pat. No. 5,288,514.

The surface of a solid support, optionally modified with spacers having photolabile protecting groups such as NVOC and MeNPOC, is illuminated through a photolithographic mask, yielding reactive groups (typically hydroxyl groups) in the illuminated regions. A 3′-O-phosphoramidite activated deoxynucleoside (protected at the 5′-hydroxyl with a photolabile protecting group) is then presented to the surface and chemical coupling occurs at sites that were exposed to light. Following capping and oxidation, the support is rinsed and the surface illuminated through a second mask, to expose additional hydroxyl groups for coupling. A second 5′-protected, 3′-O-phosphoramidite activated deoxynucleoside is presented to the surface. The selective photodeprotection and coupling cycles are repeated until the desired set of oligonucleotides is produced. Alternatively, an oligomer of from, for example, 4 to 30 nucleotides can be added to each of the preselected regions rather than synthesize each member one nucleotide monomer at a time.

2. Flow Channel or Spotting Methods

Additional methods applicable to array synthesis on a single support are described in U.S. Pat. No. 5,384,261. In the methods disclosed in these applications, reagents are delivered to the support by either (1) flowing within a channel defined on predefined regions or (2) “spotting” on predefined regions. Other approaches, as well as combinations of spotting and flowing, may be employed as well. In each instance, certain activated regions of the support are mechanically separated from other regions when the monomer solutions are delivered to the various reaction sites.

A typical “flow channel” method can generally be described as follows: Diverse polymer sequences are synthesized at selected regions of a solid support by forming flow channels on a surface of the support through which appropriate reagents flow or in which appropriate reagents are placed. For example, assume a monomer “A” is to be bound to the support in a first group of selected regions. If necessary, all or part of the surface of the support in all or a part of the selected regions is activated for binding by, for example, flowing appropriate reagents through all or some of the channels, or by washing the entire support with appropriate reagents. After placement of a channel block on the surface of the support, a reagent having the monomer A flows through or is placed in all or some of the channel(s). The channels provide fluid contact to the first selected regions, thereby binding the monomer A to the support directly or indirectly (via a spacer) in the first selected regions.

Thereafter, a monomer B is coupled to second selected regions, some of which may be included among the first selected regions. The second selected regions will be in fluid contact with a second flow channel(s) through translation, rotation, or replacement of the channel block on the surface of the support; through opening or closing a selected valve; or through deposition of a layer of chemical or photoresist. If necessary, a step is performed for activating at least the second regions. Thereafter, the monomer B is flowed through or placed in the second flow channel(s), binding monomer B at the second selected locations. In this particular example, the resulting sequences bound to the support at this stage of processing will be, for example, A, B, and AB. The process is repeated to form a vast array of sequences of desired length at known locations on the support.

After the support is activated, monomer A can be flowed through some of the channels, monomer B can be flowed through other channels, a monomer C can be flowed through still other channels, etc. In this manner, many or all of the reaction regions are reacted with a monomer before the channel block must be moved or the support must be washed and/or reactivated. By making use of many or all of the available reaction regions simultaneously, the number of washing and activation steps can be minimized. One of skill in the art will recognize that there are alternative methods of forming channels or otherwise protecting a portion of the surface of the support. For example, a protective coating such as a hydrophilic or hydrophobic coating (depending upon the nature of the solvent) is utilized over portions of the support to be protected, sometimes in combination with materials that facilitate wetting by the reactant solution in other regions. In this manner, the flowing solutions are further prevented from passing outside of their designated flow paths.

The “spotting” methods of preparing compounds and arrays can be implemented in much the same manner. A first monomer, A, can be delivered to and coupled with a first group of reaction regions which have been appropriately activated. Thereafter, a second monomer, B, can be delivered to and reacted with a second group of activated reaction regions. Unlike the flow channel embodiments described above, reactants are delivered in relatively small quantities by directly depositing them in selected regions. In some steps, the entire support surface can be sprayed or otherwise coated with a solution, if it is more efficient to do so. Precisely measured aliquots of monomer solutions may be deposited dropwise by a dispenser that moves from region to region. Typical dispensers include a micropipette to deliver the monomer solution to the support and a robotic system to control the position of the micropipette with respect to the support, or an ink-jet printer. In other embodiments, the dispenser includes a series of tubes, a manifold, an array of pipettes, or the like so that various reagents can be delivered to the reaction regions simultaneously.

3. Pin-Based Methods

Another method which is useful for the preparation of the immobilized arrays involves “pin-based synthesis.” This method, which is described in detail in U.S. Pat. No. 5,288,514 utilizes a support having a plurality of pins or other extensions. The pins are each inserted simultaneously into individual reagent containers in a tray. An array of 96 pins is commonly utilized with a 96-container tray, such as a 96-well microtitre dish. Each tray is filled with a particular reagent for coupling in a particular chemical reaction on an individual pin. Accordingly, the trays can often contain different reagents. Since the chemical reactions have been optimized such that each of the reactions can be performed under a relatively similar set of reaction conditions, it becomes possible to conduct multiple chemical coupling steps simultaneously. The method can use a support with a spacer, S, having active sites. In the particular case of oligonucleotides, for example, the spacer may be selected from a wide variety of molecules which can be used in organic environments associated with synthesis as well as aqueous environments associated with binding studies such as may be conducted between the nucleic acid members of the array and other molecules. These molecules include, but are not limited to, proteins (or fragments thereof), lipids, carbohydrates, proteoglycans and nucleic acid molecules. Examples of suitable spacers are polyethyleneglycols, dicarboxylic acids, polyamines and alkylenes, substituted with, for example, methoxy and ethoxy groups. Additionally, the spacers will have an active site on the distal end. The active sites are optionally protected initially by protecting groups. Among a wide variety of protecting groups which are useful are FMOC, BOC, t-butyl esters, t-butyl ethers, and the like.

Various exemplary protecting groups are described in, for example, Atherton et al., 1989, Solid Phase Peptide Synthesis, IRL Press. In some embodiments, the spacer may provide for a cleavable function by way of, for example, exposure to acid or base.

b. Arrays on Multiple Supports

Yet another method which is useful for synthesis of compounds and arrays includes “bead based synthesis.” A general approach for bead based synthesis is described in PCT/US93/04145 (filed Apr. 28, 1993).

c. Protein and Peptide Arrays

US 2002-0192673 describes exemplary methods for making a protein array which include protein translation. U.S. Pat. No. 5,143,854 describes methods that include synthetic amino acid coupling.

Nucleic Acid Binding Proteins

A collection of nucleic acids, as described herein, can be used to evaluate the binding properties of nucleic acid binding proteins (NABPs).

A variety of protein structures are known to bind nucleic acids with high affinity and high specificity. These structures are used in a large number of different proteins, including proteins that specifically control nucleic acid function. For reviews of structural motifs which recognize double stranded DNA, see, e.g., Pabo and Sauer (1992) Annu. Rev. Biochem. 61:1053-95; Patikoglou and Burley (1997) Annu. Rev. Biophys. Biomol. Struct. 26:289-325; Nelson (1995) Curr Opin Genet Dev. 5:180-9. A few non-limiting examples of nucleic acid binding domains include:

Zinc fingers. Zinc fingers are small polypeptide domains of approximately 30 amino acid residues in which there are four amino acids, either cysteine or histidine, appropriately spaced such that they can coordinate a zinc ion (For reviews, see, e.g., Klug and Rhodes, (1987) Trends Biochem. Sci. 12:464-469(1987); Evans and Hollenberg, (1988) Cell 52:1-3; Payre and Vincent, (1988) FEBS Lett. 234:245-250; Miller et al., (1985) EMBO J. 4:1609-1614; Berg, (1988) Proc. Natl. Acad. Sci. U.S.A. 85:99-102; Rosenfeld and Margalit, (1993) J. Biomol. Struct. Dyn. 11:557-570). Hence, zinc finger domains can be categorized according to the identity of the residues that coordinate the zinc ion, e.g., as the Cys 2 -His 2 class, the Cys 2 -Cys 2 class, the Cys 2 -CysHis class, and so forth. The zinc coordinating residues of Cys 2 -His 2 zinc fingers have a typically spacing which is described in Wolfe et al., (1999) Annu. Rev. Biophys. Biomol. Struct. 3:183-212. Typically, the intervening amino acids fold to form an anti-parallel β-sheet that packs against an α-helix, although the anti-parallel β-sheets can be short, non-ideal, or non-existent. The fold positions the zinc-coordinating side chains so they are in a tetrahedral conformation appropriate for coordinating the zinc ion. The base contacting residues are at the N-terminus of the finger and in the preceding loop region. A zinc finger DNA-binding protein normally consists of a tandem array of three or more zinc finger domains.

The zinc finger domain (or “ZFD”) is one of the most common eukaryotic DNA-binding motifs, found in species from yeast to higher plants and to humans. By one estimate, there are at least several thousand zinc finger domains in the human genome alone, possibly at least 4,500. Zinc finger domains can be isolated from zinc finger proteins. Non-limiting examples of zinc finger proteins include CF2-II, Kruppel, WT1, basonuclin, BCL-6/LAZ-3, erythroid Kruppel-like transcription factor, transcription factors Sp1, Sp2, Sp3, and Sp4, transcriptional repressor YY1, EGR1 /Krox24, EGR2/Krox20, EGR3/Pilot, EGR4/AT133, Evi-1, GL11, GL12, GL13, HIV-EP11/ZNF40, HIV-EP2, KRI, ZfX, ZfY, and ZNF7.

Computational methods described below can be used to identify all zinc finger domains encoded in a sequenced genome or in a nucleic acid database. Any such zinc finger domain can be utilized. In addition, artificial zinc finger domains have been designed, e.g., using computational methods (e.g., Dahiyat and Mayo, (1997) Science 278:82-7).

Homeodomains. Homeodomains are simple eukaryotic domains that consist of a N-terminal arm that contacts the DNA minor groove, followed by three α-helices that contact the major groove (for a review, see, e.g., Laughon, (1991) Biochemistry 30:11357-67). The third cc-helix is positioned in the major groove and contains critical DNA-contacting side chains. Homeodomains have a characteristic highly-conserved motif present at the turn leading into the third a:-helix. The motif includes an invariant tryptophan that packs into the hydrophobic core of the domain. Homeodomains are commonly found in transcription factors that determine cell identity and provide positional information during organismal development. Such classical homeodomains can be found in the genome in clusters such that the order of the homeodomains in the cluster approximately corresponds to their expression pattern along a body axis. Homeodomains can be identified by alignment with a homeodomain, e.g., Hox-1, or by alignment with a homeodomain profile or a homeodomain hidden Markov Model (HMM; see below), e.g., PF00046 of the Pfam database or “HOX” of the SMART database (see online resources referenced in Letunic et al. (2004) Nucleic Acids Res. 2004 Jan 1 ;32 Database issue:D142-4), or by the Prosite motif PDOC00027 as mentioned above.

Helix-turn-helix proteins. This DNA binding motif is common among many prokaryotic transcription factors. There are many subfamilies, e.g., the LacI family, the AraC family, to name but a few. The two helices in the name refer to a first α-helix that packs against and positions a second α-helix in the major groove of DNA. These domains can be identified by alignment with a HMM, e.g., HTH_ARAC, HTH_ARSR, HTH_ASNC, HTH_CRP, HTH_DEOR, HTH_DTXR, HTH_GNTR, HTH_ICLR, HTH_LACI, HTH_LUXR, HTH_MARR, HTH_MERR, and HTH_XRE profiles available in the SMART database.

Helix-loop-helix proteins. This DNA binding domain is commonly found among homo- and hetero-dimeric transcription factors, e.g., MyoD, fos, jun, E11, and myogenin. The domain consists of a dimer, each monomer contributing two α-helices and intervening loop. The domain can be identified by alignment with a HMM, e.g., the “HLH” profile available in the SMART database. Although helix-loop-helix proteins are typically dimeric, monomeric versions can be constructed by engineering a polypeptide linker between the two subunits such that a single open reading frame encodes both the two subunits and the linker.

Some nucleic acid binding domains can bind to both DNA and RNA. For example, certain zinc finger domains and homeodomains can interact with RNA as well as DNA. In addition, a number of RNA binding domains, both natural and artificial, are known. For example, the HIV tat protein includes an RNA binding domain that is arginine rich. See generally, Tian et al. (2003) Prog Nucleic Acid Res Mol Biol. 74:123-58; Das et al. (2003) Biopolymers. 70(1):80-5; and Doyle et al. (2002) J Struct Biol. 140(1-3):147-53.

Identification of Protein Domains

A variety of methods can be used to identify structural domains, e.g., nucleic acid binding domains.

Computational Methods. The amino acid sequence of a DNA binding domain isolated by a method described herein can be compared to a database of known sequences, e.g., an annotated database of protein sequences or an annotated database which includes entries for nucleic acid binding domains. In another implementation, databases of uncharacterized sequences, e.g., unannotated genomic, EST or full-length cDNA sequence; of characterized sequences, e.g., SwissProt or PDB; and of domains, e.g., Pfam, ProDom (Servant et al (2002) Brief Bioinform. September 2002;3(3):246-51 and SMART (Simple Modular Architecture Research Tool, see above) can provide a source of nucleic acid binding domain sequences. Nucleic acid sequence databases can be translated in all six reading frames for the purpose of comparison to a query amino acid sequence. Nucleic acid sequences that are flagged as encoding candidate nucleic acid binding domains can be amplified from an appropriate nucleic acid source, e.g., genomic DNA or cellular RNA. Such nucleic acid sequences can be cloned into an expression vector. The procedures for computer-based domain identification can be interfaced with an oligonucleotide synthesizer and robotic systems to produce nucleic acids encoding the domains in a high-throughput platform. Cloned nucleic acids encoding the candidate domains can also be stored in a host expression vector and shuttled easily into an expression vector, e.g., into a translational fusion vector with Zif268 fingers 1 and 2, either by restriction enzyme mediated subcloning or by site-specific, recombinase mediated subcloning (see U.S. Pat. No. 5,888,732). The high-throughput platform can be used to generate multiple microtitre plates containing nucleic acids encoding different candidate nucleic acid binding domains.

Detailed methods for the identification of domains from a starting sequence or a profile are well known. See, for example, Prosite (Hofmann et al., (1999) Nucleic Acids Res. 27:215-219), FASTA, BLAST (Altschul et al., (1990) J. Mol Biol. 215:403-10.), etc. A simple string search can be done to find amino acid sequences with identity to a query sequence or a query profile, e.g., using Perl to scan text files. Sequences so identified can be about 30%, 40%, 50%, 60%, 70%, 80%, 90%, or greater identical to an initial input sequence.

Domains similar to a query domain can be identified from a public database, e.g., using the XBLAST programs (version 2.0) of Altschul et al., (1990) J. Mol. Biol. 215:403-10. For example, BLAST protein searches can be performed with the XBLAST parameters as follows: score=50, wordlength=3. Gaps can be introduced into the query or searched sequence as described in Altschul et al., (1997) Nucleic Acids Res. 25(17):3389-3402. Default parameters for XBLAST and Gapped BLAST programs are available from on-line resources of the National Center of Biotechnology Information, National Institutes of Health, Bethesda Md.

The Prosite profiles PS00028 and PS50157 can be used to identify zinc finger domains. In a SWISSPROT release of 80,000 protein sequences, these profiles detected 3189 and 2316 zinc finger domains, respectively. Profiles can be constructed from a multiple sequence alignment of related proteins by a variety of different techniques. Gribskov and co-workers (Gribskov et al., (1990) Meth. Enzymol. 183:146-159) utilized a symbol comparison table to convert a multiple sequence alignment supplied with residue frequency distributions into weights for each position. See, for example, the PROSITE database and the work of Luethy et al., (1994) Protein Sci. 3:139-1465.

Hidden Markov Models (HMM's) representing a DNA binding domain of interest can be generated or obtained from a database of such models, e.g., the Pfam database, release 2.1. A database can be searched, e.g., using the default parameters, with the HMM in order to find additional domains (see, e.g., the Sanger Center, Cambridge UK for default parameters). Alternatively, the user can optimize the parameters. A threshold score can be selected to filter the database of sequences such that sequences that score above the threshold are displayed as candidate domains. A description of the Pfam database can be found in Sonhammer et al., (1997) Proteins 28(3):405-420, and a detailed description of HMMs can be found, for example, in Gribskov et al., (1990) Meth. Enzymol. 183:146-159; Gribskov et al., (1987) Proc. Natl. Acad. Sci. USA 84:4355-4358; Krogh et al., (1994) J. Mol. Biol. 235:1501-1531; and Stultz et al., (1993) Protein Sci. 2:305-314.

The SMART database of HMM's (Simple Modular Architecture Research Tool, available from online resources of EMBL, Heidelberg, Germany; Schultz et al., (1998) Proc. Natl. Acad. Sci. USA 95:5857 and Schultz et al., (2000) Nucl. Acids Res 28:231) provides a catalog of zinc finger domains (ZnF_C2H2, ZnF_C2C2; ZnF_C2HC; ZnF_C3H1; ZnF_C4; ZNF_CHCC; ZnF_GATA; and ZnF_NFX) identified by profiling with the hidden Markov models of the HMMer2 search program (Durbin et al., (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press).

Hybridization-based Methods. A collection of nucleic acids encoding various forms of a DNA binding domain can be analyzed to profile sequences encoding conserved amino- and carboxy-terminal boundary sequences. Degenerate oligonucleotides can be designed to hybridize to sequences encoding such conserved boundary sequences. Moreover, the efficacy of such degenerate oligonucleotides can be estimated by comparing their composition to the frequency of possible annealing sites in known genomic sequences. Multiple rounds of design can be used to optimize the degenerate oligonucleotides.

A library of nucleic acid domains can be constructed by isolation of nucleic acid sequences encoding domains from genomic DNA or cDNA of eukaryotic organisms such as humans. Multiple methods are available for doing this. For example, a computer search of available amino acid sequences can be used to identify the domains, as described above. A nucleic acid encoding each domain can be isolated and inserted into a vector appropriate for the expression in cells, e.g., a vector containing a promoter, an activation domain, and a selectable marker. In another example, degenerate oligonucleotides that hybridize to a conserved motif are used to amplify, e.g., by PCR, a large number of related domains containing the motif. Moreover, screening a collection limited to domains of interest, unlike screening a library of unselected genomic or cDNA sequences, significantly decreases library complexity and reduces the likelihood of missing a desirable sequence due to the inherent difficulty of completely screening large libraries. Domains in such libraries can be characterized, e.g., using the methods described herein.

Engineering Nucleic Acid Binding Specificity

It is possible to use the collection of polymers described herein to characterize, screen and select modified proteins. For example, known nucleic acid binding proteins can be mutated, e.g., by randomizing one or more positions or by making site specific mutations (e.g., a substitution, insertion, and/or deletion). Such modified proteins can then be contacted to a collection of polymers, either individually or in groups.

Any suitable method known in the art can be used to modify, design and construct nucleic acids encoding NABPs, e.g., phage display, random mutagenesis, combinatorial libraries, computer/rational design, affinity selection, PCR, cloning from cDNA or genomic libraries, synthetic construction and the like. Examples of site specific mutations include modifying a DNA contacting residue, e.g., to alanine, or to a different hydrophilic amino acids. For example, some modifications change the size of the side chain. Examples of randomizations include completely random representation of amino acids (e.g., all amino acids of a set can occur with equal probability, or a probability that is a function of their codon usage and the set of allowed codons), or partially random, e.g., by biasing nucleotides or codons, e.g., to favor certain amino acids (e.g., wildtype amino acids) relative to others. Randomization can occur at one or more positions. For example, at least 25, 50, or 75% of the DNA contacting positions can be randomized, or all such positions can be randomized.

Randomization can be used to produce a library, e.g., a phage display library, from which useful variants can be identified. The library can be screened, e.g., using a desired target that is immobilized. The library as a whole or a subset of the library can be contacted to the collection of polymers, e.g., to provide information about the library's collective ability to interact with different k-mers.

One example of a nucleic acid binding protein that can be altered to produce an artificial transcription factor is a protein that includes one or more zinc finger domains. Such domains are typically arranged as an array of at least three fingers. Nucleic acid binding region of such proteins can be prepared by selection in vitro (e.g., using phage display) or in vivo, or by design based on a recognition code (see, e.g., WO 00/42219 and U.S. Pat. No. 6,511,808). See, e.g., Rebar et al. (1996) Methods Enzymol 267:129; Greisman and Pabo (1997) Science 275:657; Isalan et al. (2001) Nat. Biotechnol 19:656; and Wu et al. (1995) Proc. Nat. Acad. Sci. USA 92:344 for, among other things, methods for creating libraries of varied zinc finger domains. See also, e.g., U.S. Pat. Nos. 5,789,538; 6,453,242; U.S. 2003-0165997; Wu et al., PNAS 92:344-348 (1995); Jamieson et al., Biochemistry 33:5689-5695 (1994); Rebar & Pabo, Science 263:671-673 (1994); Choo & Klug, PNAS 91:11163-11167 (1994); Choo & Klug, PNAS 91: 11168-11172 (1994); Desjarlais & Berg, PNAS 90:2256-2260 (1993); Desjarlais & Berg, PNAS 89:7345-7349 (1992); Pomerantz et al., Science 267:93-96 (1995); Pomerantz et al., PNAS 92:9752-9756 (1995); and Liu et al., PNAS 94:5525-5530 (1997); Griesman & Pabo, Science 275:657-661(1997); Desjarlais & Berg, PNAS 91:11-99-11103(1994).

Non-Proteinaceous Nucleic Acid Binding Compounds

A collection of polymers described herein can also be used to evaluate, design, and test non-proteinaceous nucleic acid binding compounds and other agents which interact with nucleic acids. The polymers can be used, e.g., to evaluate the degree of specificity of an agent for particular sequences. Specificity may vary across a continuum from completely non-specific to broad to highly sequence specific. Again, interactions other than binding interactions can also be evaluated.

One class of compounds which can specifically interact with nucleic acid is the class of polyamides, which includes pyrrole-imidazole polyamides. See, e.g., generally, Dervan et al. (2003) “Recognition of the DNA minor groove by pyrrole-imidazole polyamides.” Curr Opin Struct Biol. 13(3):284-99. See also, e.g., U.S. Ser. No. 08/607,078, PCT/US97/03332, U.S. Ser. Nos. 08/837,524, 08/853,525, PCT/US97/12733, U.S. Ser. No. 08/853,522, PCT/US97/12722, PCT/US98/06997, PCT/US98/02444, PCT/US98/02684, PCT/US98/01006, PCT/US98/03829, and PCT/US98/0714. As described in the foregoing references, polyamides comprise polymers of amino acids covalently linked by amide bonds. Preferably, the amino acids used to form these polymers include N-methylpyrrole (Py) and N-methylimidazole (Im). Polyamides containing pyrrole (Py), and imidazole (Im) amino acids are synthetic ligands that have an affinity and specificity for DNA comparable to naturally occurring DNA binding proteins Trauger, J. W., Baird, E. E. & Dervan, P. B. (1996), Nature 382, 559-561; Swalley, S. E., Baird, E. E. & Dervan, P. B. (1997), J. Am. Chem. Soc. 119, 6953-6961; Turner, J. M., Baird, E. E. & Dervan, P. B. (1997), J. Am. Chem. Soc. 119, 7636-7644; Trauger, J. W., Baird, E. E. & Dervan, P. B. (1998), Angewandte Chemie-International Edition 37, 1421-1423; and Dervan, P. B. & Burli, R. W. (1999), Current Opinion in Chemical Biology 3, 688-693. See, e.g., U.S. Pat. No. 6,559,125. Polyamines can be conformationally constrained and also derivatized, e.g., to enable conjugation to another moiety, e.g., a protein.

Other types of non-proteinaceous agents include other nucleic acids, e.g., nucleic acids that can form a triple helix that interacts with a duplex. For example, U.S. Pat. No. 6,432,638 discloses homopyrimidinepolydeoxyribonucleotide probes with at least one detectable marker, chemotherapeutic agent or a DNA-cleaving moiety attached to at least one predetermined position. See also U.S. Pat. No. 6,403,302 to Dervan et al. The probes are said to be capable of binding the corresponding homopyrimidine-homopurine tracts within large double-stranded nucleic acids by triple-helix formation at a predetermined site, and can be used for gene therapy.

Characterizing Nucleic Acid Binding

After evaluating interaction of an agent to a collection of nucleic acids (e.g., DNAs or RNAs), one can further characterize interaction of the agent to one or more particular members, e.g., individually, or a particular k-mer. For example, further characterization can be used to determine qualitative or quantitative parameters that describe in the interaction between the agent and a member of the collection, e.g., binding parameters, such as binding kinetics and dissociation constants. Further characterization can also provide more sequence information, e.g., by determining the degree of specificity of the agent for a particular identified k-mer or other sequence. Examples of such methods include: Electrophoretic Mobility Shift Assay (EMSA), footprinting, surface plasmon resonance, and methylation interference.

Electrophoretic Mobility Shift Assay (EMSA). Electrophoretic mobility shift assays can be used to characterize interactions between proteins and nucleic acids. For example, electrophoretic mobility shift assay (EMSA) can be performed as described previously (see, e.g., Durand, D., et at., Mol. Cell. Biol. 8:1715-1724, and Jones et al. Cell 42:5593 (1985). In one implementation, binding reactions (15 μl final volume) contain 10 mM Tris-HCl (pH 7.5), 80 mM sodium chloride, 1 mM dithiothreitol, 1 mM EDTA, 5% glycerol, 1.5-2 μg of poly(dI·dC), 5 to 10 μg of the protein agent, and 20,000 cpm (0.1 to 0.5 ng) of 32 P-end-labeled probe (e.g., a double-stranded oligonucleotide probe). After incubation for 45-60 min on ice, the protein-probe complexes were resolved on nondenaturing 5% polyacrylamide gels run in 1× Tris-borate-EDTA (TBE) buffer (Ausubel, F. M., et al., Green Publishing Associates and Wiley-Interscience, New York (1987)). Oligonucleotide probes can be labeled using the Klenow fragment. For cold oligonucleotide competition assays, a 1,000-fold molar excess of unlabeled probe (identical to, related to, or unrelated to the probe) can be added to the binding reaction mixture 15 min into the incubation, and the mixture can be further incubated for 30 min at 4° C. prior to gel loading.

DNAse I Footprinting Assay. DNase I footprinting can be used to assay for DNA sequences which could be protected from DNase I digestion by an agent. Related methods are available for analyzing RNA. See, e.g., Siegel (1988) Proc Natl Acad Sci USA. 85(6):1801-5. A negative and positive control can be run on either side of the agent of interest to produce adequate points of reference.

In one implementation, DNAse I footprinting is performed using a derivation of the procedures described by Durand et al., Mol. Cell. Biol. 8:1715 (1988) and Jones et al., Cell 42:5593 (1985). Binding reactions are carried out under the conditions described above for EMSA but scaled up to 50 μl. After binding, using 50 μg nuclear extracts, 50 μl of a 10 mM MgCl 2 /5 mM CaCl 2 solution is added and 2 μl of an appropriate DNAse I (Worthington, Freehold, N.J.) dilution is added and incubated for 1 minute on ice. DNase I digestion is stopped by adding 90 μl of stop buffer (20 mM EDTA, 1% SDS, 0.2 M NaCl). After addition of 20 μg yeast tRNA as carrier, the samples are extracted two times with an equal volume of phenol/chloroform (1:1) and precipitated after adjusting the solution to 0.3 M sodium acetate and 70% ethanol. DNA samples are then resuspended in 4 μl of an 80% formamide loading dye containing 1× TBE, bromophenol blue and xylene cyanol, heated to 90° C. for 2 minutes, and loaded on 6% polyacrylamide-urea sequencing gels.

Methlylation interference can be assayed according to the protocol of Baldwin (Ausubel, F. M., et al., Green Publishing Associates and Wiley-Interscience, New York (1987)). First, a preparative EMSA (10-fold scale up of reaction described above) is performed. Then, the nucleic acid probe (typically DNA) is eluted from the excised bands representing EMSA complexes by electroelution in a Bio-Rad apparatus. Following piperidine cleavage, the DNA ladders were analyzed on standard 10% polyacrylamide-urea sequencing gels (Ausubel, F. M., et al, Green Publishing Associates and Wiley-Interscience, New York (1987)).

UV Crosslinking Analysis. Preparative EMSA can be performed as described for methylation interference. Before autoradiography, the gel is exposed to UV light in a STRATALINER™ (Stratagene, La Jolla, Calif.), e.g., as described previously (Kelsumi, H. M., et al., Mol. Cell. Biol. 13:6690-6701 (1993)). Bands are excised and heated to 70° C. in Laemmli sample buffer. Gel slices are loaded into the walls of a sodium dodecyl sulfate 10% polyacrylamide electrophoresis (SDS-10% PAGE) gel run in glycine-SDS buffer (Ausubel, F. M., et al, Green Publishing Associates and Wiley-Interscience, New York (1987)).

Interaction Profiling

In one aspect, the disclosure features a method of providing an interaction site profile. The method includes providing an array of polymers, e.g., using a collection of polymers described herein, contacting the compound with the array, and identifying polymers to which the compound interacts, thus providing a “raw” interaction site profile. The array includes a plurality of polymers, wherein the plurality includes all or a certain percentage of all k-mers. Typically, each polymer is positionally distinguishable from the other probes.

The term “raw” interaction site profile refers to the profile which indicates interaction between the compound and each polymer of the plurality. In one embodiment, the interaction of the compound with the probe results in a covalent modification of the probe, e.g., a covalent bond of the probe can be broken or formed. In a preferred embodiment, the interaction of the compound with the capture probe is a binding interaction wherein neither the compound nor the probe has a covalent bond broken or formed.

In one embodiment, the raw interaction site profile is a list of objects, each object representing one of the polymers of the array, and having an associated value, preferably a numerical value. The list can contain two, three, four, five, six, seven, eight, nine, ten, 15, 20, 50, 100, 1000 or more objects. In a preferred embodiment, each polymer on the array is represented by an object. In this embodiment, the list includes as many objects as there are polymers, e.g., addresses on the array. In another embodiment, the list includes the polymers which interact with the compound. Thus, the list can contain only those polymers for which an interaction was detected, or only those polymers for which an interaction met a predetermined condition. Such a list has fewer objects as members than the number of unique polymers.

The raw profile can be processed, e.g., to determine which k-mer or k-mers the compound interacts with. The results of the processing can be provided in the form of a processed profile.

The results of the processing can be in the form of a processed profile which represents k-mers with which the compound interacts. In one embodiment, the processed profile is a list of objects, each object representing one of the k-mers in the collection on the array, and having an associated value, preferably a numerical value. The list can contain two, three, four, five, six, seven, eight, nine, ten, 15, 20, 50, 100, 1000 or more objects. In a preferred embodiment, each k-mer in the collection is represented by an object. In this embodiment, the list includes as many objects as there are k-mers, e.g., as represented on the array. In another embodiment, the list includes the k-mers which interact with the compound. Thus, the list can contain only those k-mers for which an interaction was detected, or only those k-mers for which an interaction met a predetermined condition.

In a preferred embodiment, the raw profile and/or the processed profile is stored in computer memory, such as random access memory or flash memory, or on computer readable media, such as magnetic (e.g., a diskette, removable hard drive, or internal hard drive) or optical media (e.g., a compact disk (CD), DVD, or holographic media). A profile stored in this manner can be on a personal computer, server, e.g., a network server, or mainframe, and can be accessed from another device across a network, e.g., an intranet or internet. In another embodiment, the raw or processed profile is printed on to a media such as a plastic, a paper or a label, e.g., as a bar code or variation thereof.

The value associated with each object of an interaction site profile can be obtained from a quantitative observation, or a qualitative observation, preferably a quantitative observation. In one embodiment, the associated value is a function of the amount of interaction between the compound and a polymer. For example, the amount of interaction can be the amount of binding, the amount of polymer modification, or affinity. In a preferred embodiment, the associated value is a function of the amount of binding between the compound and the polymer. The value can be a function of the amount of a quantitative observation such as a fluorescent signal, a radioactive signal, or a phosphorescent signal of a contacted polymer. The value can be provided by an instrument, e.g., a CCD camera. In one embodiment, the value is a function of the surface plasmon resonance at the site of a contacted polymer. In a preferred embodiment, the associated values are adjusted for a background signal. In another embodiment, the associated value is a function of moles of bound compound. In yet another embodiment, the associated value is an affinity, relative affinity, apparent affinity, association constant, dissociation constant, logarithm of an affinity, or free energy for binding, of the compound for the particular polymer. In a preferred embodiment, the associated values in the list are differ. In other words, the list contains more than one object, and e.g., the associated values of the objects in the list are not all the same. The values provide a range. The values can be distributed in the range. In some embodiments, the values can approximate a Poisson distribution. The list can contain objects whose associated values are zero, or null. The list can contain objects whose associated values are positive or negative. In one embodiment, the list does not contain any objects whose associated values are zero or null.

In a preferred embodiment, interaction site profiles are provided for a compound at varying concentrations of the compound, e.g. an interaction site profile is provided for a compound at a first concentration, at a second concentration, etc. In another preferred embodiment, interaction site profiles are provided for a compound for interaction with varying concentration of polymers. For example, an array can have more than one unit, the compositions of the units being identical, but the first unit having the polymers at a first concentration, and the second unit having the polymers at a second concentration, etc. In yet another preferred embodiment, interaction site profiles are provide for a compound for various intervals after contacting the compound to the array. For example, a first profile can be provided after a first interval of time has elapsed after contacting, and a second profile can be provided after a second interval, etc.

An interaction site profile (e.g., a raw and/or processed profile) can be generated for any compound, e.g., a protein, a peptoid, a PNA, or a chemically modified protein. In a much preferred embodiment the compound is a protein. The polypeptide can be a nucleic acid binding protein. In one embodiment, the protein is an RNA contacting protein such as a splicing factor, a ribosomal protein, a viral protein, an RNA modification enzyme, a translation factor, and the like. In a preferred embodiment, the protein is a DNA contacting protein such as a transcription factor, a replication factor, a telomere binding protein, a centromere binding protein, a restriction modification enzyme, a DNA methylase, DNA repair protein, a single-stranded DNA binding protein, a recombination protein and the like. In one embodiment, the protein is a transcription factor. The transcription factor can bind a double stranded DNA sequence with an affinity of 10 mM, 1 mM, 100 nM or less, preferably 10 nM or less, 1 nM or less, and even more preferably 100 pM or less. The transcription factor can be selected from the group consisting of homeodomains, helix-turn-helix motif proteins, beta-sheets, leucine zippers, steroid receptors, and zinc finger proteins. In one embodiment, the protein is modified or combined with natural and exotic chemical ligands. In yet another embodiment, the compound for which the interaction site profile is generated includes more than one protein, e.g., a complex of proteins.

In a preferred embodiment, the protein is covalently attached to bacteriophage, e.g., a T7 phage, a lambdoid phage, or a filamentous phage. Preferably, the protein is covalent attached to a filamentous phage such as fd or M13. The protein can be covalently fused to a coat protein by constructing a fusion gene with the gene encoding the polypeptide an the viral coat protein gene, e.g., filamentous phage gene VIII or gene III. In another preferred embodiment, the polypeptide is covalently attached to green fluorescent protein (GFP), or a variant thereof (such as enhanced GFP, CFP, BFP, and the like). The protein can be covalently attached by constructing a fusion gene. In yet another embodiment, the protein is linked with an unrelated sequence, e.g., a fusion protein, purification handle, or epitope tag. Useful examples of such unrelated sequences include maltose binding protein, glutathione-S-transferase, chitin binding protein, thioredoxin, hexa-histidine (or 6-His), the “FLAG tag”, the myc epitope, and the hemagglutinin epitope.

In one embodiment, the protein contains a detectable label. The detectable label can be a radiolabel. Preferably, the detectable label is a fluorescent label, e.g., malachite green, Oregon green, Texas Red, Congo Red, Cy3, SYBRGREEN™ I, or R-phycoerythrin. In another embodiment, the protein is contacted with an antibody. The antibody can contact the protein directly or can contact a covalently attached tag, e.g., a moiety mentioned above.

In a preferred embodiment, the protein is a variant of a natural counterpart. The variant can have at least one amino acid difference from the natural counterpart. Preferably the differing amino acid is located within 50 Ångstroms, 20 Ångstroms, or 10 Ångstroms or less of the bound nucleic acid in a structural model, e.g., a model built from X-ray diffraction data, NMR restraint data, or another homology model.

Quality Control of Protein Production

A collection of polymers described herein can be used to evaluate protein production. For example, the interactions of a purified protein or partially purified protein and molecules of the collection can be evaluated and compared to a reference, e.g., a previous production run or previous sample of the protein. For example, when producing a pharmaceutical compound, e.g., a protein therapeutic, the compound is typically purified from cruder mixture, e.g., a cell lysate or media that contains a cell secretion. Without needing to know the contents of any particular sample, the interactions between a sample and the molecules of the collection can provide information about its contents and the degree of purity of the desired product, e.g., the protein therapeutic. Impurities may interact with different molecule in the collection than the desired product and will be detected. Accordingly, when doing numerous production runs to produce a desired product, a sample of the final product or samples from earlier purification steps can be evaluated by contacting to molecules in the collection.

Receptor Ligands

A collection of peptides that includes all or a certain percentage of all possible k-mers can be used to identify a ligand for receptor. In one exemplary implementation, the peptides in the collection are located on different addresses of an array. A soluble form of the receptor is produced, e.g., by expressing the extracellular domain, a fragment thereof or a version of the protein lacking the transmembrane domain. The soluble receptor (e.g., labelled receptor) is contacted to the array and locations where the receptor interacts with a peptide are detected. A peptide that includes only the relevant k-mer can be synthesized and further characterized, e.g., by contacting the peptide to a cell that expresses the receptor and evaluating a biological function of the cell.

Delivering Nucleic Acid Binding Proteins

As described herein, a collection of polymers that includes all or a certain percentage of all k-mers can be used to design, select or identify a nucleic acid binding protein, e.g., an engineered nucleic acid binding protein. The protein may have therapeutic or diagnostic uses, e.g., for detecting, preventing, or ameliorating diseases or disorders in a subject or in cells of a subject.

Conventional viral and non-viral based gene transfer methods can be used to introduce nucleic acids encoding engineered NABP in mammalian cells or target tissues. Such methods can be used to administer nucleic acids encoding NABPs to cells in vitro. Preferably, the nucleic acids encoding NABPs are administered for in vivo or ex vivo gene therapy uses. Non-viral vector delivery systems include DNA plasmids, naked nucleic acid, and nucleic acid complexed with a delivery vehicle such as a liposome. Viral vector delivery systems include DNA and RNA viruses, which have either episomal or integrated genomes after delivery to the cell. For a review of gene therapy procedures, see Anderson, Science 256:808-813 (1992); Nabel & Felgner, TIBTECH 11:211-217 (1993); Mitani & Caskey, TIBTECH 11:162-166 (1993); Dillon, TIBTECH 11:167-175 (1993); Miller, Nature 357:455-460 (1992); Van Brunt, Biotechnology 6(10):1149-1154 (1988); Vigne, Restorative Neurology and Neuroscience 8:35-36 (1995); Kremer & Perricaudet, British Medical Bulletin 51(1):31-44 (1995); Haddada et al., in Current Topics in Microbiology and Immunology Doerfler and Bohm (eds) (1995); and Yu et al., Gene Therapy 1:13-26 (1994).

Methods of non-viral delivery of nucleic acids encoding engineered NABPs include lipofection, microinjection, biolistics, virosomes, liposomes, immunoliposomes, polycation or lipid:nucleic acid conjugates, naked DNA, artificial virions, and agent-enhanced uptake of DNA. Lipofection is described in e.g., U.S. Pat. Nos. 5,049,386, 4,946,787; and 4,897,355) and lipofection reagents are sold commercially (e.g., TRANSFECTAM™ and LIPOFECTIN™. Cationic and neutral lipids that are suitable for efficient receptor-recognition lipofection of polynucleotides include those of Felgner, WO 91/17424, WO 91/16024. Delivery can be to cells (ex vivo administration) or target tissues (in vivo administration).

The preparation of lipid:nucleic acid complexes, including targeted liposomes such as immunolipid complexes, is well known to one of skill in the art (see, e.g., Crystal, Science 270:404-410 (1995); Blaese et al., Cancer Gene Ther. 2:291-297 (1995); Behr et al., Bioconjugate Chem. 5:382-389 (1994); Remy et al., Bioconjugate Chem. 5:647-654 (1994); Gao et al., Gene Therapy 2:710-722 (1995); Ahmad et al., Cancer Res. 52:4817-4820 (1992); U.S. Pat. Nos. 4,186,183, 4,217,344, 4,235,871, 4,261,975, 4,485,054, 4,501,728, 4,774,085, 4,837,028, and 4,946,787).

The use of RNA or DNA viral based systems for the delivery of nucleic acids encoding engineered NABP can exploit the highly evolved processes for targeting a virus to specific cells in the body and trafficking the viral payload to the nucleus. Viral vectors can be administered directly to patients (in vivo) or they can be used to treat cells in vitro and the modified cells are administered to patients (ex vivo). Conventional viral based systems for the delivery of NABPs could include retroviral, lentivirus, adenoviral, adeno-associated and herpes simplex virus vectors for gene transfer. Viral vectors are currently the most efficient and versatile method of gene transfer in target cells and tissues. Integration in the host genome is possible with the retrovirus, lentivirus, and adeno-associated virus gene transfer methods, often resulting in long term expression of the inserted transgene. Additionally, high transduction efficiencies have been observed in many different cell types and target tissues.

The tropism of a retrovirus can be altered by incorporating foreign envelope proteins, expanding the potential target population of target cells. Lentiviral vectors are retroviral vector that are able to transduce or infect non-dividing cells and typically produce high viral titers. Selection of a retroviral gene transfer system would therefore depend on the target tissue. Retroviral vectors are comprised of cis-acting long terminal repeats with packaging capacity for up to 6-10 kb of foreign sequence. The minimum cis-acting LTRs are sufficient for replication and packaging of the vectors, which are then used to integrate the therapeutic gene into the target cell to provide permanent transgene expression. Widely used retroviral vectors include those based upon murine leukemia virus (MuLV), gibbon ape leukemia virus (GaLV), Simian Immuno deficiency virus (SIV), human immuno deficiency virus (HIV), and combinations thereof (see, e.g., Buchscher et al., J. Virol. 66:2731-2739 (1992); Johann et al., J Virol. 66:1635-1640 (1992); Sommerfelt et al, Virol. 176:58-59 (1990); Wilson et al., J. Virol. 63:2374-2378 (1989); Miller et al., J. Virol. 65:2220-2224 (1991); PCT/US94/05700).

In applications where transient expression of the NABP is preferred, adenoviral based systems are typically used. Adenoviral based vectors are capable of very high transduction efficiency in many cell types and do not require cell division. With such vectors, high titer and levels of expression have been obtained. This vector can be produced in large quantities in a relatively simple system. Adeno-associated virus (“AAV”) vectors are also used to transduce cells with target nucleic acids, e.g., in the in vitro production of nucleic acids and peptides, and for in vivo and ex vivo gene therapy procedures (see, e.g., West et al., Virology 160:38-47 (1987); U.S. Pat. No. 4,797,368; WO 93/24641; Kotin, Human Gene Therapy 5:793-801 (1994); Muzyczka, J. Clin. Invest. 94:1351 (1994). Construction of recombinant AAV vectors are described in a number of publications, including U.S. Pat. No. 5,173,414; Tratschin et al., Mol. Cell. Biol. 5:3251-3260 (1985); Tratschin, et al., MoL Cell. Biol. 4:2072-2081 (1984); Hermonat & Muzyczka, PNAS 81:6466-6470 (1984); and Samulski et al., J. Virol. 63:03822-3828 (1989).

In particular, at least six viral vector approaches are currently available for gene transfer in clinical trials, with retroviral vectors by far the most frequently used system. All of these viral vectors utilize approaches that involve complementation of defective vectors by genes inserted into helper cell lines to generate the transducing agent.

pLASN and MFG-S are examples are retroviral vectors that have been used in clinical trials (Dunbar et al., Blood 85:3048-305 (1995); Kohn et al., Nat. Med. 1:1017-102 (1995); Malech et al., PNAS 94:22 12133-12138 (1997)). PA317/pLASN was the first therapeutic vector used in a gene therapy trial. (Blaese et al., Science 270:475480 (1995)). Transduction efficiencies of 50% or greater have been observed for MFG-S packaged vectors. (Ellem et al., Immunol Immunother. 44(1):10-20 (1997); Dranoff et al., Hum. Gene Ther. 1:111 -2 (1997).

Recombinant adeno-associated virus vectors (rAAV) are a promising alternative gene delivery systems based on the defective and nonpathogenic parvovirus adeno-associated type 2 virus. All vectors are derived from a plasmid that retains only the AAV 145 bp inverted terminal repeats flanking the transgene expression cassette. Efficient gene transfer and stable transgene delivery due to integration into the genomes of the transduced cell are key features for this vector system. (Wagner et al., Lancet 351:9117 1702-3 (1998), Kearns et al., Gene Ther. 9:748-55 (1996)).

Replication-deficient recombinant adenoviral vectors (Ad) are predominantly used for colon cancer gene therapy, because they can be produced at high titer and they readily infect a number of different cell types. Most adenovirus vectors are engineered such that a transgene replaces the Ad E1a, E1b, and E3 genes; subsequently the replication defector vector is propagated in human 293 cells that supply deleted gene function in trans. Ad vectors can transduce multiply types of tissues in vivo, including nondividing, differentiated cells such as those found in the liver, kidney and muscle system tissues. Conventional Ad vectors have a large carrying capacity. An example of the use of an Ad vector in a clinical trial involved polynucleotide therapy for antitumor immunization with intramuscular injection (Sterman et al., Hum. Gene Ther. 7:1083-9 (1998)). Additional examples of the use of adenovirus vectors for gene transfer in clinical trials include Rosenecker et al, Infection 24:1 5-10 (1996); Sterman et al., Hum. Gene Ther. 9:7 1083-1089 (1998); Welsh et al., Hum. Gene Ther. 2:205-18 (1995); Alvarez et al., Hum. Gene Ther. 5:597-613 (1997); Topfet al., Gene Ther. 5:507-513 (1998); Sterman et al., Hum. Gene Ther. 7:1083-1089 (1998).

Packaging cells are used to form virus particles that are capable of infecting a host cell. Such cells include 293 cells, which package adenovirus, and ψ2 cells or PA317 cells, which package retrovirus. Viral vectors used in gene therapy are usually generated by producer cell line that packages a nucleic acid vector into a viral particle. The vectors typically contain the minimal viral sequences required for packaging and subsequent integration into a host, other viral sequences being replaced by an expression cassette for the protein to be expressed. The missing viral functions are supplied in trans by the packaging cell line. For example, AAV vectors used in gene therapy typically only possess ITR sequences from the AAV genome which are required for packaging and integration into the host genome. Viral DNA is packaged in a cell line, which contains a helper plasmid encoding the other AAV genes, namely rep and cap, but lacking ITR sequences. The cell line is also infected with adenovirus as a helper. The helper virus promotes replication of the AAV vector and expression of AAV genes from the helper plasmid. The helper plasmid is not packaged in significant amounts due to a lack of ITR sequences. Contamination with adenovirus can be reduced by, e.g., heat treatment to which adenovirus is more sensitive than AAV.

In many gene therapy applications, it is desirable that the gene therapy vector be delivered with a high degree of specificity to a particular tissue type. A viral vector is typically modified to have specificity for a given cell type by expressing a ligand as a fusion protein with a viral coat protein on the viruses outer surface. The ligand is chosen to have affinity for a receptor known to be present on the cell type of interest. For example, Han et al., PNAS 92:9747-9751 (1995), reported that Moloney murine leukemia virus can be modified to express human heregulin fused to gp70, and the recombinant virus infects certain human breast cancer cells expressing human epidermal growth factor receptor. This principle can be extended to other pairs of virus, expressing a ligand fusion protein and target cell expressing a receptor. For example, filamentous phage can be engineered to display antibody fragments (e.g., FAB or Fv) having specific binding affinity for virtually any chosen cellular receptor. Although the above description applies primarily to viral vectors, the same principles can be applied to nonviral vectors. Such vectors can be engineered to contain specific uptake sequences thought to favor uptake by specific target cells.

Gene therapy vectors can be delivered in vivo by administration to an individual patient, typically by systemic administration (e.g., intravenous, intraperitoneal, intramuscular, subdermal, or intracranial infusion) or topical application, as described below. Alternatively, vectors can be delivered to cells ex vivo, such as cells explanted from an individual patient (e.g., lymphocytes, bone marrow aspirates, tissue biopsy) or universal donor hematopoietic stem cells, followed by reimplantation of the cells into a patient, usually after selection for cells which have incorporated the vector.

Ex vivo cell transfection for diagnostics, research, or for gene therapy (e.g., via re-infusion of the transfected cells into the host organism) is well known to those of skill in the art. In a preferred embodiment, cells are isolated from the subject organism, transfected with a NABP nucleic acid (gene or cDNA), and re-infused back into the subject organism (e.g., patient). Various cell types suitable for ex vivo transfection are well known to those of skill in the art (see, e.g., Freshney et al., Culture of Animal Cells, A Manual of Basic Technique (3rd ed. 1994)) and the references cited therein for a discussion of how to isolate and culture cells from patients).

In one embodiment, stem cells are used in ex vivo procedures for cell transfection and gene therapy. The advantage to using stem cells is that they can be differentiated into other cell types in vitro, or can be introduced into a mammal (such as the donor of the cells) where they will engraft in the bone marrow. Methods for differentiating CD34+ cells in vitro into clinically important immune cell types using cytokines such a GM-CSF, IFN-γ. and TNF-α are known (see Inaba et al., J. Exp. Med. 176:1693-1702 (1992)).

Stem cells are isolated for transduction and differentiation using known methods. For example, stem cells are isolated from bone marrow cells by panning the bone marrow cells with antibodies which bind unwanted cells, such as CD4+ and CD8+ (T cells), CD45+ (panb cells), GR-1 (granulocytes), and lad (differentiated antigen-presenting cells) (see Inaba et al., J. Exp. Med. 176:1693-1702 (1992)).

Vectors (e.g., retroviruses, adenoviruses, liposomes, etc.) containing therapeutic NABP nucleic acids can be also administered directly to the organism for transduction of cells in vivo. Alternatively, naked DNA can be administered. Administration is by any of the routes normally used for introducing a molecule into ultimate contact with blood or tissue cells. Suitable methods of administering such nucleic acids are available and well known to those of skill in the art, and, although more than one route can be used to administer a particular composition, a particular route can often provide a more immediate and more effective reaction than another route.

Pharmaceutically acceptable carriers are determined in part by the particular composition being administered, as well as by the particular method used to administer the composition. There is a wide variety of suitable formulations for pharmaceutical compositions, as described herein and, e.g., in Remington's Pharmaceutical Sciences, 17th ed., 1989).

Delivery Vehicles for NABPs

An important factor in the administration of polypeptide compounds, such as the NABPs, is ensuring that the polypeptide has the ability to traverse the plasma membrane of a cell, or the membrane of an intra-cellular compartment such as the nucleus. Cellular membranes are composed of lipid-protein bilayers that are freely permeable to small, nonionic lipophilic compounds and are inherently impermeable to polar compounds, macromolecules, and therapeutic or diagnostic agents. However, proteins and other compounds such as liposomes have been described, which have the ability to translocate polypeptides such as NABPs across a cell membrane.

For example, “membrane translocation polypeptides” have amphiphilic or hydrophobic amino acid subsequences that have the ability to act as membrane-translocating carriers. In one embodiment, homeodomain proteins have the ability to translocate across cell membranes. The shortest internalizable peptide of a homeodomain protein, Antennapedia, was found to be the third helix of the protein, from amino acid position 43 to 58 (see, e.g., Prochiantz, Current Opinion in Neurobiology 6:629-634 (1996)). Another subsequence, the h (hydrophobic) domain of signal peptides, was found to have similar cell membrane translocation characteristics (see, e.g., Lin et al., J. Biol. Chem. 270:1 4255-14258 (1995)).

Examples of peptide sequences which can be linked to a NABP, for facilitating uptake of NABP into cells, include, but are not limited to: an 11 animo acid peptide of the tat protein of HIV; a 20 residue peptide sequence which corresponds to amino acids 84-103 of the p16 protein (see Fahraeus et al., Current Biology 6:84 (1996)); the third helix of the 60-amino acid long homeodomain of Antennapedia (Derossi et al., J. Biol. Chem. 269:10444 (1994)); the h region of a signal peptide such as the Kaposi fibroblast growth factor (K-FGF) h region (Lin et al., supra); or the VP22 translocation domain from HSV (Elliot & O'Hare, Cell 88:223-233 (1997)). Other suitable chemical moieties that provide enhanced cellular uptake may also be chemically linked to NABPs.

Toxin molecules also have the ability to transport polypeptides across cell membranes. Often, such molecules are composed of at least two parts (called “binary toxins”): a translocation or binding domain or polypeptide and a separate toxin domain or polypeptide. Typically, the translocation domain or polypeptide binds to a cellular receptor, and then the toxin is transported into the cell. Several bacterial toxins, including Clostridium perfringens iota toxin, diphtheria toxin (DT), Pseudomonas exotoxin A (PE), pertussis toxin (PT), Bacillus anthracis toxin, and pertussis adenylate cyclase (CYA), have been used in attempts to deliver peptides to the cell cytosol as internal or amino-terminal fusions (Arora et al., J. Biol. Chem., 268:3334-3341 (1993); Perelle et al., Infect. Immun., 61:5147-5156 (1993); Stenmark et al., J. Cell Biol. 113:1025-1032 (1991); Donnelly et al., PNAS 90:3530-3534 (1993); Carbonetti et al., Abstr. Annu. Meet. Am. Soc. Microbiol. 95:295 (1995); Sebo et al., Infect. Immun. 63:3851-3857 (1995); Klimpel et al., PNAS U.S.A. 89:10277-10281 (1992); and Novak et al., J. Biol. Chem. 267:17186-17193 1992)).

Such subsequences can be used to translocate NABPs across a cell membrane. NABPs can be conveniently fused to or derivatized with such sequences. Typically, the translocation sequence is provided as part of a fusion protein. Optionally, a linker can be used to link the NABP and the translocation sequence. Any suitable linker can be used, e.g., a peptide linker.

The NABP can also be introduced into an animal cell, preferably a mammalian cell, via a liposomes and liposome derivatives such as immunoliposomes. The term “liposome” refers to vesicles comprised of one or more concentrically ordered lipid bilayers, which encapsulate an aqueous phase. The aqueous phase typically contains the compound to be delivered to the cell, e.g., a NABP.

The liposome fuses with the plasma membrane, thereby releasing the drug into the cytosol. Alternatively, the liposome is phagocytosed or taken up by the cell in a transport vesicle. Once in the endosome or phagosome, the liposome either degrades or fuses with the membrane of the transport vesicle and releases its contents.

In current methods of drug delivery via liposomes, the liposome ultimately becomes permeable and releases the encapsulated compound (in this case, a NABP) at the target tissue or cell. For systemic or tissue specific delivery, this can be accomplished, for example, in a passive manner wherein the liposome bilayer degrades over time through the action of various agents in the body. Alternatively, active drug release involves using an agent to induce a permeability change in the liposome vesicle. Liposome membranes can be constructed so that they become destabilized when the environment becomes acidic near the liposome membrane (see, e.g. PNAS 84:7851 (1987); Biochemistry 28:908 (1989)). When liposomes are endocytosed by a target cell, for example, they become destabilized and release their contents. This destabilization is termed filsogenesis. Dioleoylphosphatidylethanolamine (DOPE) is the basis of many “fusogenic” systems.

Such liposomes typically comprise a NABP and a lipid component, e.g., a neutral and/or cationic lipid, optionally including a receptor-recognition molecule such as an antibody that binds to a predetermined cell surface receptor or ligand (e.g., an antigen). A variety of methods are available for preparing liposomes as described in, e.g. Szoka et al., Ann. Rev. Biophys. Bioeng. 9:467 (1980), U.S. Pat. Nos. 4,186,183, 4,217,344, 4,235,871, 4,261,975, 4,485,054, 4,501,728, 4,774,085, 4,837,028, 4,235,871, 4,261,975, 4,485,054, 4,501,728, 4,774,085, 4,837,028, 4,946,787, PCT Publication No. WO 91.backslash.17424, Deamer & Bangham, Biochim. Biophys. Acta 443:629-634 (1976); Fraley, et al., PNAS 76:3348-3352 (1979); Hope et al, Biochim. Biophys. Acta 812:55-65 (1985); Mayer et al., Biochim. Biophys. Acta 858:161-168 (1986); Williams et al, PNAS 85:242-246 (1988); Liposomes (Ostro (ed.), 1983, Chapter 1); Hope et al., Chem. Phys. Lip. 40:89 (1986); Gregoriadis, Liposome Technology (1984) and Lasic, Liposomes: from Physics to Applications (1993)). Suitable methods include, for example, sonication, extrusion, high pressure/homogenization, microfluidization, detergent dialysis, calcium-induced fusion of small liposome vesicles and ether-fusion methods, all of which are well known in the art.

In certain embodiments, it is desirable to target the liposomes using targeting moieties that are specific to a particular cell type, tissue, and the like. Targeting of liposomes using a variety of targeting moieties (e.g., ligands, receptors, and monoclonal antibodies) has been previously described (see, e.g., U.S. Pat. Nos. 4,957,773 and 4,603,044).

EXAMPLES

Example 1

General Discussion of One Implementation

We have developed a new, highly parallel in vitro microarray technology for high-throughput characterization of the sequence specificities of DNA-protein interactions. We shall refer to this approach as protein binding microarray (PBM) technology. PBM technology allows evaluating interactions between agents and nucleic acids, for example, the measurement of direct or indirect binding of agents, such as epitope-tagged transcription factors to nucleic acids on DNA microarrays spotted with double-stranded DNAs containing potential DNA binding sites. For instance, a DNA binding protein of interest is expressed with an epitope tag. The tag facilitates protein purification and detection. The epitope-tagged DNA binding protein (usually at least partially purified) is applied to a double-stranded DNA microarray. The microarray is then washed gently to remove any nonspecifically bound protein. The protein-bound microarray is then labeled with a primary antibody specific for the epitope tag expressed as a fusion with the DNA binding protein.

PBM technology enables evaluating the binding site specificities of a nucleic acid binding protein in a single day, starting from the purified protein. Using compact microarrays, we can determine the relative binding affinities for all possible 9-mers using a single nucleic acid array. PBM technology is highly scalable, as many assays may be performed in a single day. Moreover, it is a universal system, since the same nucleic acid array can be used to evaluate proteins from any species for binding to sites in any genome. One person could, for instances, perform triplicate PBM assays with multiple proteins in one day. DNA arrays can be printed, e.g., in production quantities. Identical arrays can be used for triplicate array experiments for each protein of interest.

We have successfully purified FLAG-tagged Rpn4 fusion protein from E. coli using anti-FLAG M2 affinity gel (Sigma). We used this purified Rpn4 fusion protein in EMSAs successfully, indicating that the dual-tagged Rpn4 fusion protein binds DNA sequence-specifically. We have also used this FLAG-tagged RPN4 in PBM experiments to evaluate strategies such as crosslinking the protein-bound microarrays before labeling with a fluorophore conjugate. For labeling protein-bound PBMs, we evaluated a number of strategies, including: (1) labeling with the M2 anti-FLAG primary antibody (Sigma), followed by R-phycoerythrin-conjugated secondary antibody (Sigma); (2) labeling the protein-bound PBMs with FITC- or Cy3-conjugated anti-FLAG antibody (Sigma); (3) Alexa488-conjugated M2 anti-FLAG primary antibody (Sigma). Detection can employ a standard microarray scanners (GSI Lumonics SCANARRAY™) equipped with the appropriate lasers and filter sets. We have seen that higher signal intensity is indicative of higher DNA-protein binding affinity. Thus, PBM technology is successful in identifying sequence-specific TF binding. We also found that labeling with a fluorophore-conjugated primary antibody results in high quality data with a broad dynamic range of signal intensities.

We have used GST-tagged yeast TFs in PBM experiments using these microarrays. The washed, protein-bound microarrays are labeled with a primary anti-GST antibody conjugated with the fluorophore Alexa488. The washed arrays are then scanned using a GSI Lumonics SCANARRAY™ microarray scanner. The microarray images are quantified using GENEPIX™ microarray image quantitation software (Axon Instruments, Inc.). The sequences corresponding to the spots with a Bonferroni-corrected p-value<=0.001 are run through a motif finding program to identify the TF's binding site motif. We use an integrated motif finder that combines results from MEME™, MDSCAN™, BIOPROSPECTOR™, and ALIGNACE™. We identified both ungapped (e.g. Rap1 and Mig1) and gapped (e.g. Abf1) DNA binding site motifs. Experimental negative controls, using either GST alone or a GST-fusion to a protein that is not DNA binding, do not result in ‘bound’ spots on the microarray. Computational negative controls, in which randomly selected yeast intergenic regions are searched with motif finding programs, do not result in motifs with significant group specific scores, meaning that any low scoring motifs that do arise from these random sets are not specific to the input set of motifs (the motifs arising from the spots ‘bound’ in our PBMs are highly specific to the set of ‘bound’ spots).

Survey of TF DNA Binding Domain Types:

The table below exemplifies some of the major classes of eukaryotic DNA binding proteins, with the approximate number of proteins as well as the number of domains of each type in the human, fly, and yeast genomes.

Domain type Human Fly Yeast
C2H2 zinc finger  564 (4500) 234 (771) 34 (56)
Homeobox domain 160 (178) 100 (103) 6
Helix-loop-helix DNA 60 (61) 44 4
binding domain
Basic leucine zipper (bZIP) ˜55 a  27 b  ˜20 d 
Nuclear hormone receptor 47 17 0
Fork head domain 35 (36) 20 (21) 4
Myb-like DNA-binding domain 32 (43) 18 (24) 15 (20)
Ets family ˜22 e  ˜8 c ˜0 d
C2CH zinc finger 17 (22) 6 (8) 3 (5)
Paired domain ˜9 e ˜17 c  ˜0 d
GATA zinc finger 11 (17) 5 (6) 9
Zn(2)-Cys(6) binuclear 0 e 0 c ˜11 d 
cluster (fungal)
MADS domain ˜4 e ˜2 c ˜1 d
HSF family ˜5 e ˜1 c ˜1 d

Table Legend. Major Classes of Eukaryotic DNA Binding Proteins. The number of proteins containing the specified Pfam domains as well as the total number of domains (in parentheses) are shown in each column. Unless otherwise indicated, Celera data (Venter et al., Science, 2001) were used; Celera data were used instead of the Public Consortium data because the Celera data provided values for a greater number of different types of DNA binding domains.

a indicates data from Newman and Keating, Science, 2003.

b indicates data from Fassler et al., Genome Res., 2002.

c indicates data from a polypeptide search of FlyBase.

d indicates a search of SGD.