Title:
Full-length human cDNAs encoding potentially secreted proteins
Kind Code:
A1
Abstract:
The invention concerns GENSET polynucleotides and polypeptides. Such GENSET products may be used as reagents in forensic analyses, as chromosome markers, as tissue/cell/organelle-specific markers, in the production of expression vectors. In addition, they may be used in screening and diagnosis assays for abnormal GENSET expression and/or biological activity and for screening compounds that may be used in the treatment of GENSET-related disorders.

Inventors:
Dumas Milne, Edwards Jean-baptiste (Paris, FR)
Bougueleret, Lydie (Petit-Lancy, CH)
Jobert, Severin (Praha, CZ)
Application Number:
11/197712
Publication Date:
06/15/2006
Filing Date:
08/04/2005
View Patent Images:
Export Citation:
Assignee:
Serono Genetics Institute S.A. (Evry, FR)
Primary Class:
Other Classes:
530/388.260, 435/6, 435/320.100, 435/69.100, 435/183, 435/325, 536/23.200, 800/8
International Classes:
A01K67/027; C12Q1/68; C07H21/04; C12P21/06; C12N9/00; C07K14/47; C07K16/40
Attorney, Agent or Firm:
SALIWANCHIK LLOYD & SALIWANCHIK;A PROFESSIONAL ASSOCIATION (PO BOX 142950, GAINESVILLE, FL, 32614-2950, US)
Claims:
We claim:

1. A composition of matter comprising: a) an isolated polynucleotide, said polynucleotide comprising a nucleic acid sequence encoding: i) a polypeptide comprising an amino acid sequence having any one of the sequences shown as SEQ ID NOs: 242-482 or any one of the sequences of polypeptides encoded by the clone inserts of the deposited clone pool; or ii) a biologically active fragment of said polypeptide; b) a nucleic acid sequence that has at least about 100 contiguous nucleotides of any one of the sequences shown as SEQ ID NOs: 1-241 or any one of the sequences of the clone inserts of the deposited clone pool; c) a polynucleotide or nucleic acid sequence as set forth in a) or b) operably linked to a promoter; d) an expression vector comprising a polynucleotide or nucleic acid sequence as set forth in a) or b) or c); e) a recombinant host cell comprising a polynucleotide or nucleic acid sequence as set forth in a) or b) or c) or d); f) a non-human transgenic animal comprising: i) a polynucleotide or nucleic acid sequence as set forth in a) or b) or c) or d) or ii) the host cell as set forth in e); g) an isolated polypeptide or biologically active fragment thereof, said polypeptide comprising an amino acid sequence having any one of the sequences shown as SEQ ID NOs: 242-482 or any one of the sequences of polypeptides encoded by the clone inserts of the deposited clone pool; or h) an isolated antibody that specifically binds to a polypeptide or biologically active fragment thereof, said polypeptide comprising an amino acid sequence having any one of the sequences shown as SEQ ID NOs: 242-482 or any one of the sequences of polypeptides encoded by the clone inserts of the deposited clone pool.

2. The composition of matter according to claim 1, wherein said polynucleotide or nucleic acid sequence encodes a polypeptide that comprises a signal peptide.

3. The composition of matter according to claim 1, wherein said polynucleotide encodes a polypeptide that is a mature protein.

4. The composition of matter according to claim 1, wherein said polypeptide comprises a signal peptide.

5. The composition of matter according to claim 1, wherein said polypeptide is a mature protein.

6. A method of using a composition of matter according to claim 1 for the production of a polypeptide; the binding of an antibody; determining if a polypeptide is present in a biological sample or expressed in a mammal; determining if a polypeptide is overexpressed or underexpressed in a mammal; or identifying candidate modulators of a polypeptide.

7. The method according to claim 6, wherein said method of making a polypeptide comprises a) providing a population of host cells or a population of cells comprising: i) an isolated polynucleotide, said polynucleotide comprising a nucleic acid sequence encoding: A) a polypeptide comprising an amino acid sequence having any one of the sequences shown as SEQ ID NOs: 242-482 or any one of the sequences of polypeptides encoded by the clone inserts of the deposited clone pool; or B) a biologically active fragment of said polypeptide; ii) a nucleic acid sequence that has at least about 100 contiguous nucleotides of any one of the sequences shown as SEQ ID NOs: 1-241 or any one of the sequences of the clone inserts of the deposited clone pool; iii) a polynucleotide or nucleic acid sequence as set forth in i) or ii) operably linked to a promoter; or iv) an expression vector comprising a polynucleotide or nucleic acid as set forth in i) or ii) or iii); and b) culturing said population of host cells or said population of cells under conditions conducive to the production of said polypeptide within said population of host cells or population of cells.

8. The method of claim 7, further comprising purifying said polypeptide from said population of host cells or said population of cells.

9. The method according to claim 6, wherein said method of binding an antibody comprises the step of: contacting an antibody that specifically binds to a polypeptide or biologically active fragment thereof, said polypeptide comprising an amino acid sequence having any one of the sequences shown as SEQ ID NOs: 242-482 or any one of the sequences of polypeptides encoded by the clone inserts of the deposited clone pools with said polypeptide under conditions that allow binding of said antibody to said polypeptide.

10. The method according to claim 6, wherein said method of determining whether a gene is expressed within a mammal comprises the steps of: a) providing a biological sample from said mammal b) contacting said biological sample with either of: i) a polynucleotide that hybridizes under stringent conditions a polynucleotide of claim 1; or ii) a polypeptide that specifically binds to a polypeptides as set forth in claim 1; and c) detecting the presence or absence of hybridization between said polynucleotide and an RNA species within said sample, or the presence or absence of binding of said polypeptide to a protein within said sample; wherein a detection of said hybridization or of said binding indicates that said gene is expressed within said mammal.

11. The method of claim 10, wherein said polynucleotide is a primer, and wherein said hybridization is detected by detecting the presence of an amplification product comprising the sequence of said primer.

12. The method of claim 10, wherein said polypeptide is an antibody.

13. The method according to claim 6, wherein said method of determining whether a mammal has an elevated or reduced level of gene expression comprises the steps of: a) providing a biological sample from said mammal; and b) comparing the amount of a polypeptide or biologically active fragment thereof comprising an amino acid sequence having any one of the sequences shown as SEQ ID NOs: 242-482 or any one of the sequences of polypeptides encoded by the clone inserts of the deposited clone pool, or of an RNA species encoding said polypeptide, within said biological sample with a level detected in or expected from a control sample; wherein an increased amount of said polypeptide or said RNA species within said biological sample compared to said level detected in or expected from said control sample indicates that said mammal has an elevated level of said gene expression, and wherein a decreased amount of said polypeptide or said RNA species within said biological sample compared to said level detected in or expected from said control sample indicates that said mammal has a reduced level of said gene expression.

14. A method of identifying a candidate modulator of a polypeptide, said method comprising: a) contacting a an isolated polypeptide or biologically active fragment thereof, said polypeptide comprising an amino acid sequence having any one of the sequences shown as SEQ ID NOs: 242-482 or any one of the sequences of polypeptides encoded by the clone inserts of the deposited clone pool with a test compound; and b) determining whether said compound specifically binds to said polypeptide; wherein a detection that said compound specifically binds to said polypeptide indicates that said compound is a candidate modulator of said polypeptide.

Description:

RELATED APPLICATION

The present application is a continuation-in-part application of U.S. Ser. No. 09/731,872, filed Dec. 7, 2000, which claims priority, under 35 USC § 119(e), to the US Provisional Patent Applications Ser. Nos. 60/169,629 and 60/187,470 filed Dec. 8, 1999, and Mar. 6, 2000, respectively, the disclosures of which are incorporated herein by reference in their entireties.

FIELD OF THE INVENTION

The present invention is directed to polynucleotides encoding GENSET polypeptides, fragments thereof, and the regulatory regions located in the 5′- and 3′-ends of the GENSET genes. The invention also concerns polypeptides encoded by the GENSET polynucleotides and fragments thereof. The present invention also relates to recombinant vectors, which include the polynucleotides of the present invention, particularly recombinant vectors comprising a GENSET regulatory region or a sequence encoding a GENSET polypeptide, and to host cells containing the polynucleotides of the invention, as well as to methods of making such vectors and host cells. The present invention further relates to the use of these recombinant vectors and host cells in the production of the polypeptides of the invention. The invention further relates to antibodies that specifically bind to the polypeptides of the invention and to methods for producing such antibodies and fragments thereof. The invention also provides for methods of detecting the presence of the polynucleotides and polypeptides of the present invention in a sample, methods of diagnosis and screening of abnormal GENSET gene expression and/or biological activity, methods of screening compounds for their ability to modulate the activity or expression of GENSET genes and uses of such compounds.

BACKGROUND OF THE INVENTION

The estimated 50,000-100,000 genes scattered along the human chromosomes offer tremendous promise for the understanding, diagnosis, and treatment of human diseases. In addition, probes capable of specifically hybridizing to loci distributed throughout the human genome find applications in the construction of high resolution chromosome maps and in the identification of individuals.

Currently, two different approaches are being pursued for identifying and characterizing the genes distributed along the human genome. In one approach, large fragments of genomic DNA are isolated, cloned, and sequenced. Potential open reading frames in these genomic sequences are identified using bio-informatics software. However, this approach entails sequencing large stretches of human DNA which do not encode proteins in order to find the protein encoding sequences scattered throughout the genome. In addition to requiring extensive sequencing, the bio-informatics software may mischaracterize the genomic sequences obtained, i.e., labeling non-coding DNA as coding DNA and vice versa.

An alternative approach takes a more direct route to identifying and characterizing human genes. In this approach, complementary DNAs (cDNAs) are synthesized from isolated messenger RNAs (mRNAs) which encode human proteins. Using this approach, sequencing is only performed on DNA which is derived from protein coding fragments of the genome. In the past, these cDNAs, ofter short EST sequences were obtained from oligo-dT primed cDNA libraries. Accordingly, they mainly corresponded to the 3′ untranslated region of the mRNA. In part, the prevalence of EST sequences derived from the 3′ end of the mRNA is a result of the fact that typical techniques for obtaining cDNAs, are not well suited for isolating cDNA sequences derived from the 5′ ends of mRNAs (Adams et al., Nature 377:3-174, 1996, Hillier et al., Genome Res. 6:807-828, 1996). In addition, in those reported instances where longer cDNA sequences have been obtained, the reported sequences typically correspond to coding sequences and do not include the full 5′ untranslated region (5′UTR) of the mRNA from which the cDNA is derived. Indeed, 5′UTRs have been shown to affect either the stability or translation of mRNAs. Thus, regulation of gene expression may be achieved through the use of alternative 5′UTRs as shown, for instance, for the translation of the tissue inhibitor of metalloprotease mRNA in mitogenically activated cells (Waterhouse et al., J Biol Chem. 265:5585-9. 1990). Furthermore, modification of 5′UTR through mutation, insertion or translocation events may even be implied in pathogenesis. For instance, the fragile X syndrome, the most common cause of inherited mental retardation, is partly due to an insertion of multiple CGG trinucleotides in the 5′UTR of the fragile X mRNA resulting in the inhibition of protein synthesis via ribosome stalling (Feng et al., Science 268:731-4, 1995). An aberrant mutation in regions of the 5′UTR known to inhibit translation of the proto-oncogene c-myc was shown to result in upregulation of c-myc protein levels in cells derived from patients with multiple myelomas (Willis et al., Curr Top Microbiol Immunol 224:269-76, 1997). In addition, the use of oligo-dT primed cDNA libraries does not allow the isolation of complete 5′UTRs since such incomplete sequences obtained by this process may not include the first exon of the mRNA, particularly in situations where the first exon is short. Furthermore, they may not include some exons, often short ones, which are located upstream of splicing sites. Thus, there is a need to obtain sequences derived from the 5′ ends of mRNAs.

Moreover, despite the great amount of EST data that large-scale sequencing projects have yielded (Adams et al., Nature 377:174, 1996, Hillier et al., Genome Res. 6:807-828, 1996), information concerning the biological function of the mRNAs corresponding to such obtained cDNAs has revealed to be limited. Indeed, whereas the knowledge of the complete coding sequence is absolutely necessary to investigate the biological function of mRNAs, ESTs yield only partial coding sequences. So far, large-scale full-length cDNA cloning has been achieved only with limited success because of the poor efficiency of methods for constructing full-length cDNA libraries. Indeed, such methods require either a large amount of mRNA (Ederly et al., 1995), thus resulting in non representative full-length libraries when small amounts of tissue are available or require PCR amplification (Maruyama et al., 1994; CLONTECHniques, 1996) to obtain a reasonable number of clones, thus yielding strongly biased cDNA libraries where rare and long cDNAs are lost. Thus, there is a need to obtain full-length cDNAs, i.e. cDNAs containing the full coding sequence of their corresponding mRNAs. The present application presents a number of cDNAs, called GENSET polynucleotides, isolated from full-length cDNA librairies obtained from the methods described in PCT publication WO 00/37491.

While many sequences derived from human chromosomes have practical applications, approaches based on the identification and characterization of those chromosomal sequences which encode a protein product are particularly relevant to diagnostic and therapeutic uses. Of the 50,000-100,000 protein coding genes, those genes encoding proteins which are secreted from the cell in which they are synthesized, as well as the secreted proteins themselves, are particularly valuable as potential therapeutic agents. Such proteins are often involved in cell to cell communication and may be responsible for producing a clinically relevant response in their target cells. In fact, several secretory proteins, including tissue plasminogen activator, G-CSF, GM-CSF, erythropoietin, human growth hormone, insulin, interferon-α, interferon-β, interferon-γ, and interleukin-2, are currently in clinical use. These proteins are used to treat a wide range of conditions, including acute myocardial infarction, acute ischemic stroke, anemia, diabetes, growth hormone deficiency, hepatitis, kidney carcinoma, chemotherapy induced neutropenia and multiple sclerosis. For these reasons, cDNAs encoding secreted proteins or fragments thereof represent a particularly valuable source of therapeutic agents. Thus, there is a need for the identification and characterization of secreted proteins and the nucleic acids encoding them.

In addition to being therapeutically useful themselves, secretory proteins include short peptides, called signal peptides, at their amino termini which direct their secretion. These signal peptides are encoded by the signal sequences located at the 5′ ends of the coding sequences of genes encoding secreted proteins. Because these signal peptides will direct the extracellular secretion of any protein to which they are operably linked, the signal sequences may be exploited to direct the efficient secretion of any protein by operably linking the signal sequences to a gene encoding the protein for which secretion is desired. In addition, fragments of the signal peptides called membrane-translocating sequences, may also be used to direct the intracellular import of a peptide or protein of interest. This may prove beneficial in gene therapy strategies in which it is desired to deliver a particular gene product to cells other than the cells in which it is produced. Signal sequences encoding signal peptides also find application in simplifying protein purification techniques. In such applications, the extracellular secretion of the desired protein greatly facilitates purification by reducing the number of undesired proteins from which the desired protein must be selected. Thus, there exists a need to identify and characterize the 5′ fragments of the genes for secretory proteins which encode signal peptides.

Sequences coding for human proteins may also find application as therapeutics or diagnostics. In particular, such sequences may be used to determine whether an individual is likely to express a detectable phenotype, such as a disease, as a consequence of a mutation in the coding sequence for a protein. In instances where the individual is at risk of suffering from a disease or other undesirable phenotype as a result of a mutation in such a coding sequence, the undesirable phenotype may be corrected by introducing a normal coding sequence using gene therapy. Alternatively, if the undesirable phenotype results from overexpression of the protein encoded by the coding sequence, expression of the protein may be reduced using antisense or triple helix based strategies.

The GENSET human polypeptides encoded by the coding sequences may also be used as therapeutics by administering them directly to an individual having a condition, such as a disease, resulting from a mutation in the sequence encoding the polypeptide. In such an instance, the condition can be cured or ameliorated by administering the polypeptide to the individual.

In addition, the human polypeptides or fragments thereof may be used to generate antibodies useful in determining the tissue type or species of origin of a biological sample. The antibodies may also be used to determine the subcellular localization of the human polypeptides or the cellular localization of polypeptides which have been fused to the human polypeptides. In addition, the antibodies may also be used in immunoaffinity chromatography techniques to isolate, purify, or enrich the human polypeptide or a target polypeptide which has been fused to the human polypeptide.

Public information on the number of human genes for which the promoters and upstream regulatory regions have been identified and characterized is quite limited. In part, this may be due to the difficulty of isolating such regulatory sequences. Upstream regulatory sequences such as transcription factor binding sites are typically too short to be utilized as probes for isolating promoters from human genomic libraries. Recently, some approaches have been developed to isolate human promoters. One of them consists of making a CpG island library (Cross et al., Nature Genetics 6: 236-244, 1994). The second consists of isolating human genomic DNA sequences containing SpeI binding sites by the use of SpeI binding protein. (Mortlock et al., Genome Res. 6:327-335, 1996). Both of these approaches have their limits due to a lack of specificity and of comprehensiveness. Thus, there exists a need to identify and systematically characterize the 5′ fragments of the genes.

cDNAs including the 5′ ends of their corresponding mRNA may be used to efficiently identify and isolate 5′UTRs and upstream regulatory regions which control the location, developmental stage, rate, and quantity of protein synthesis, as well as the stability of the mRNA (Theil et al., BioFactors 4:87-93, (1993). Once identified and characterized, these regulatory regions may be utilized in gene therapy or protein purification schemes to obtain the desired amount and locations of protein synthesis or to inhibit, reduce, or prevent the synthesis of undesirable gene products.

In addition, cDNAs containing the 5′ ends of protein genes may include sequences useful as probes for chromosome mapping and the identification of individuals. Thus, there is a need to identify and characterize the sequences upstream of the 5′ coding sequences of genes encoding proteins.

SUMMARY OF THE INVENTION

The present invention provides compositions containing a purified or isolated polynucleotide comprising, consisting of, or consisting essentially of a nucleotide sequence selected from the group consisting of: (a) the sequences of SEQ ID Nos: 1-241; (b) the sequences of clone inserts of the deposited clone pool; (c) the full coding sequences of SEQ ID Nos: 1-241; (d) the full coding sequences of the clone inserts of the deposited clone pool; (e) the sequences encoding one of the polypeptides of SEQ ID Nos: 242-482; (f) the sequences encoding one of the polypeptides encoded by the clone inserts of the deposited clone pool; (g) the genomic sequences coding for GENSET polypeptides; (h) the 5′ transcriptional regulatory regions of GENSET genes; (i) the 3′ transcriptional regulatory regions of GENSET genes; (j) the polynucleotides comprising the nucleotide sequence of any combination of (g)-(i); (k) the variant polynucleotides of any of the polynucleotides of (a)-(j); (l) the polynucleotides comprising a nucleotide sequence of (a)-(k), wherein the polynucleotide is single stranded, double stranded, or a portion is single stranded and a portion is double stranded; (m) the polynucleotides comprising a nucleotide sequence complementary to any of the single stranded polynucleotides of (l). The invention further provides for fragments of the nucleic acid molecules of (a)-(m) described above.

The present invention also provides biologically active forms, variants, fragments and derivatives of the present proteins, where “biologically active” indicates that the form, variant, fragment, or derivative, has any detectable activity in any in vitro assay known in the art or described herein, or has any detectable function in vivo. In preferred embodiments, a determination of whether a particular polypeptide is biologically active will be made based on any of the specific assays or functional characteristics provided below for each of the proteins of this invention.

Therefore, one embodiment of the present invention is a composition containing a purified or isolated nucleic acid comprising a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-241 and sequences of clone inserts of the deposited clone pool, sequences complementary thereto, allelic variants thereof, and degenerate variants thereof. In one aspect of this embodiment, the nucleic acid is recombinant.

Another embodiment of the present invention is a composition containing a purified or isolated nucleic acid comprising at least 8 consecutive nucleotides of a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-241 and sequences of clone inserts of the deposited clone pool, sequences complementary thereto, allelic variants thereof, and degenerate variants thereof. In one aspect of this embodiment, the nucleic acid comprises at least 10, 12, 15, 18, 20, 25, 28, 30, 35, 40, 50, 75, 100, 150, 200, 300, 400, 500, 800, 1000, 1500, or 2000 consecutive nucleotides of said selected sequence, sequences complentary thereto, allelic variants thereof, and degenerate variants thereof. The nucleic acid may be a recombinant nucleic acid.

Another embodiment of the present invention is a composition comprising a vertebrate purified or isolated nucleic acid of at least 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500, 1000 or 2000 nucleotides in length which hybridizes under stringent conditions to any polynucleotide of the invention, preferably a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-241 and sequences of clone inserts of the deposited clone pool, sequences complementary thereto. In one aspect of this embodiment, the nucleic acid is recombinant.

Another embodiment of the present invention is a composition containing a purified or isolated nucleic acid comprising the full coding sequences of a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-241 and sequences of clone inserts of the deposited clone pool, or an allelic variant thereof. In one aspect of this embodiment, the nucleic acid is recombinant.

A further embodiment of the present invention is a composition containing a purified or isolated nucleic acid comprising a contiguous span of a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-31 and 33-143 and sequences of clone inserts encoding secreted proteins in the deposited clone pool, or an allelic variant thereof, wherein said contiguous span encodes a mature protein. In one aspect of this embodiment, the nucleic acid is recombinant. In another aspect of this embodiment, the nucleic acid is an expression vector wherein said contiguous span which encodes a mature protein is operably linked to a promoter.

Yet another embodiment of the present invention is a composition containing a purified or isolated nucleic acid comprising a contiguous span of a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-31 and 33-143 and sequences of clone inserts encoding secreted proteins in the deposited clone pool, or an allelic variant thereof, wherein said contiguous span encodes a signal peptide. In one aspect of this embodiment, the nucleic acid is recombinant. In another aspect of this embodiment, the nucleic acid is an fusion vector wherein said contiguous span which encodes a signal peptide is operably linked to a second nucleic acid encoding an heterologous polypeptide.

Another embodiment of the present invention is a composition containing a purified or isolated nucleic acid encoding a polypeptide comprising a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-241 and sequences of clone inserts of the deposited clone pool, or allelic variant thereof. In one aspect of this embodiment, the nucleic acid is recombinant.

Another embodiment of the present invention is a composition containing a purified or isolated nucleic acid encoding a polypeptide comprising the sequence of a mature protein included in a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-31 and 33-143 and sequences of clone inserts encoding secreted proteins in the deposited clone pool, or allelic variant thereof. In one aspect of this embodiment, the nucleic acid is recombinant.

Another embodiment of the present invention is a composition containing a purified or isolated nucleic acid encoding a polypeptide comprising the sequence of a signal peptide included in a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-31 and 33-143 and sequences of clone inserts encoding secreted proteins in the deposited clone pool, or allelic variant thereof. In another aspect it is present in a vector of the invention.

Further embodiments of the invention include compositions containing purified or isolated polynucleotides that comprise, a nucleotide sequence at least 70% identical, more preferably at least 75% identical, and still more preferably at least 80%, 85%, 90%, 95%, 96%, 97%, 98%, or 99% identical to any of the polynucleotides of the present invention. Methods of determining identity include those well known in the art and described herein. Such analyses can be performed using a full length polynucleotide sequence or using a subsequence of any length. For example, any two sequences can be compared over a region, in either protein or in both proteins, of any 10, 25, 50, 100, 250, 500, 1000, 2000 or more contiguous nucleotides. In addition, any two sequences can be identified as homologous even when they share sequence homology over a limited region of either polynucleotide, for example over a region of at least about 10, 25, 50, 100, 250, 500, 1000, or more contiguous nucleotides.

The invention further provides compositions containing a purified or isolated polypeptide comprising, consisting of, or consisting essentially of an amino acid sequence selected from the group consisting of: (a) the polypeptides of SEQ ID Nos: 242-482; (b) the polypeptides encoded by the clone inserts of the deposited clone pool; (c) the epitope-bearing fragments of the polypeptides of SEQ ID Nos: 242-482; (d) the epitope-bearing fragments of the polypeptides encoded by the clone inserts contained in the deposited clone pool; (e) the domains of the polypeptides of SEQ ID Nos: 242-482; (f) the domains of the polypeptides encoded by the clone inserts contained in the deposited clone pool; and (g) the allelic variant polypeptides of any of the polypeptides of (a)-(f). The invention further provides for fragments of the polypeptides of (a)-(g) above, such as those having biological activity or comprising biologically functional domain(s).

Yet another embodiment of the present invention is a composition containing a purified or isolated protein comprising a sequence selected from the group consisting of sequences of SEQ ID NOs: 242-482 and sequences of polypeptides encoded by clone inserts of the deposited clone pool, or allelic variant thereof.

Another embodiment of the present invention is a composition containing a purified or isolated polypeptide comprising at least 5, 6 or 8 consecutive amino acids of a sequence selected from the group consisting of sequences of SEQ ID NOs: 242-482 and sequences of polypeptides encoded by clone inserts of the deposited clone pool, or allellic variant thereof. In one aspect of this embodiment, the purified or isolated polypeptide comprises at least 10, 12, 15, 20, 25, 30, 35, 40, 50, 60, 75, 100, 150, 200, 250, 300, 350, 400, 450 or 500 consecutive amino acids of said selected sequence or allelic variant thereof.

Another embodiment of the present invention is a composition containing an isolated or purified polypeptide comprising a signal peptide of a sequence selected from the group consisting of sequences of SEQ ID NOs: 242-272 and 274-384 and sequences of polypeptides encoded by clone inserts of the deposited clone pool, or allellic variant thereof.

Yet another embodiment of the present invention is a composition containing an isolated or purified polypeptide comprising a mature protein of a sequence selected from the group consisting of sequences of SEQ ID NOs: 242-272 and 274-384 and sequences of polypeptides encoded by clone inserts of the deposited clone pool, or allellic variant thereof.

A further embodiment of the present invention are compositions containing polypeptide having an amino acid sequence with at least 70% similarity, and more preferably at least 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, or 99% similarity to a polypeptide of the present invention, as well as polypeptides having an amino acid sequence at least 70% identical, more preferably at least 75% identical, and still more preferably 80%, 85%, 90%, 95%, 96%, 97%, 98%, or 99% identical to a polypeptide of the present invention. Such analyses can be performed using a full length polypeptide sequence or using a subsequence of any length. For example, any two sequences can be compared over a region, in either protein or in both proteins, of any 10, 25, 50, 100, 250, 500, 1000, 2000 or more contiguous amino acids. In addition, any two sequences can be identified as homologous even when they share sequence homology over a limited region of either protein, for example over a region of at least about 10, 25, 50, 100, 250, 500, 1000, or more contiguous amino acids. Further included in the invention are compositions comprising a purified or isolated nucleic acid molecule encoding such polypeptides. Methods for determining identity include those well known in the art and described herein.

The present invention also relates to compositions comprising recombinant vectors, which include the purified or isolated polynucleotides of the present invention, and to host cells recombinant for the polynucleotides of the present invention, as well as to methods of making such vectors and host cells. The present invention further relates to the use of these recombinant vectors and recombinant host cells in the production of GENSET polypeptides.

Consequently, another embodiment of the invention is a vector comprising any polynucleotide of the invention. In a preferred embodiment, the vector is an expression vector comprising a nucleic acid sequence encoding a polypeptide selected from the group consisting of sequences of SEQ ID NOs: 242-482 and sequences of polypeptides encoded by the clone inserts of the deposited clone pool, or allelic variant thereof, wherein said nucleic acid sequence is operably linked to a promoter. In another preferred embodiment, the vector is a secretion vector comprising a nucleic acid sequence encoding a signal peptide selected from the group consisting of signal peptides of sequences of SEQ ID NOs: 242-272 and 274-384 and sequences of secreted polypeptides encoded by the clone inserts of the deposited clone pool, or allelic variant thereof, wherein said nucleic acid sequence is operably linked to an heterologous protein such that said signal peptide will direct the secretion of said heterolgous protein.

A further embodiment of the present invention is a method of making a protein comprising a sequence selected from the group consisting of sequences of SEQ ID NOs: 242-482 and sequences of polypeptides encoded by clone inserts of the deposited clone pool, comprising the steps of:

a) obtaining a cDNA comprising a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-241 and sequences of clone inserts of the deposited clone pool;

b) inserting said cDNA in an expression vector such that said cDNA is operably linked to a promoter; and

c) introducing said expression vector into a host cell whereby the host cell produces the protein encoded by said cDNA.

In one aspect of this embodiment, the method further comprises the step of isolating said protein.

Another embodiment of the present invention is a protein obtainable by the method described in the preceding paragraph.

Another embodiment of the present invention is a method of making a protein comprising the amino acid sequence of the mature protein contained in a sequence selected from the group consisting of sequences of SEQ ID NOs: 242-272 and 274-384 and sequences of polypeptides encoded by clone inserts of the deposited clone pool, comprising the steps of

a) obtaining a cDNA comprising a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-31 and 33-143 and sequences of clone inserts of the deposited clone pool, wherein said cDNA encodes a mature protein;

b) inserting said cDNA in an expression vector such that said cDNA is operably linked to a promoter; and

c) introducing said expression vector into a host cell whereby the host cell produces the mature protein encoded by said cDNA.

In one aspect of this embodiment, the method further comprises the step of isolating said protein.

Another embodiment of the present invention is a mature protein obtainable by the method described in the preceding paragraph.

Another embodiment of the present invention is a composition containing a host cell containing the purified or isolated nucleic acids comprising a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-241 and sequences of clone inserts of the deposited clone pool or a sequence complementary thereto described herein.

Another embodiment of the present invention is a composition containing a host cell containing the purified or isolated nucleic acids comprising the full coding sequences of a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-241 and sequences of clone inserts of the deposited clone pool.

Another embodiment of the present invention is a composition containing a host cell containing the purified or isolated nucleic acids comprising a contiguous span of a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-31 and 33-143 and sequences of clone inserts of the deposited clone pool, wherein said contiguous span codes for a mature protein.

Another embodiment of the present invention is a composition containing a host cell containing the purified or isolated nucleic acids comprising a contiguous span of a sequence selected from the group consisting of sequences of SEQ ID NOs: 1-31 and 33-143 and sequences of clone inserts of the deposited clone pool, wherein said contiguous span codes for a signal peptide.

The invention further relates to other methods of making the polypeptides of the present invention.

The present invention further relates to transgenic plants or animals, wherein said transgenic plant or animal is transgenic for a polynucleotide of the present invention and expresses a polypeptide of the present invention.

The invention further relates to compositions comprising antibodies that specifically bind to the GENSET polypeptides of the present invention and fragments thereof as well as to methods for producing such antibodies and fragments thereof.

Therefore, another embodiment of the present invention is a composition containing a purified or isolated antibody capable of specifically binding to a protein comprising a sequence selected from the group consisting of sequences of SEQ ID NOs: 242-482 and sequences of polypeptides encoded by clone inserts of the deposited clone pool. In one aspect of this embodiment, the antibody is capable of binding to a polypeptide comprising at least 6 consecutive amino acids, at least 8 consecutive amino acids, or at least 10 consecutive amino acids of said selected sequence.

The invention also provides kits and methods of detecting GENSET gene expression and/or biological activity in a biological sample. One such method involves assaying for the expression of a GENSET polynucleotide in a biological sample using polymerase chain reaction (PCR) to amplify and detect GENSET polynucleotides or Southern and Northern blot hybridization to detect GENSET genomic DNA, cDNA or mRNA. Alternatively, a method of detecting GENSET gene expression in a test sample can be accomplished using a compound which binds to a GENSET polypeptide of the present invention or a portion of a GENSET polypeptide.

The present invention also relates to diagnostic methods of identifying individuals or non-human animals having elevated or reduced levels of GENSET products, which individuals are likely to benefit from therapies to suppress or enhance GENSET gene expression, respectively and to methods of identifying individuals or non-human animals at increased risk for developing, or present state of having, certain diseases/disorders associated with GENSET gene abnormal expression or biological activity.

The present invention also relates to kits and methods of screening compounds for their ability to modulate (e.g. increase or inhibit) the activity or expression of GENSET genes including compounds that interact with GENSET gene regulatory sequences and compounds that interact directly or indirectly with GENSET polypeptides. Uses of such compounds are also under the scope of the present invention.

The present invention also relates to pharmaceutical or physiologically acceptable compositions comprising, an active agent, the polypeptides, polynucleotides or antibodies of the present invention.

The present invention also relates to computer systems containing cDNA codes and polypeptides codes of sequences of the invention and to computer-related methods of comparing sequences, identifying homology or features using GENSET sequences of the invention.

In another aspect, the present invention provides an isolated polynucleotide, said polynucleotide comprising a nucleic acid sequence encoding i) a polypeptide comprising an amino acid sequence having at least about 80% identity to any one of the sequences shown as SEQ ID NOs:242-482 or any one of the sequences of polypeptides encoded by the clone inserts of the deposited clone pool; or a biologically active fragment of said polypeptide.

In one embodiment, the polypeptide comprises any one of the sequences shown as SEQ ID NOs:242-482 or any one of the sequences of the polypeptides encoded by the clone inserts of the deposited clone pool. In another embodiment, the polypeptide comprises a signal peptide. In another embodiment, the polypeptide is a mature protein. In another embodiment, the nucleic acid sequence has at least about 80% identity over at least about 100 contiguous nucleotides to any one of the sequences shown as SEQ ID NOs:1-241 or any one of the sequences of the clone inserts of the deposited clone pool. In another embodiment, the polynucleotide hybridizes under stringent conditions to a polynucleotide comprising any one of the sequences shown as SEQ ID NOs:1-241 or any one of the sequences of the clone inserts of the deposited clone pool. In another embodiment, the nucleic acid sequence comprises any one of the sequences shown as SEQ ID NOs:1-241 or any one the sequences of the clone inserts of the deposited clone pool. In another embodiment, the polynucleotide is operably linked to a promoter.

In another aspect, the present invention provides an expression vector comprising the polynucleotide operably linked to a promoter. In another aspect, the present invention provides a host cell recombinant for the polynucleotide. In another aspect, the present invention provides a non-human transgenic animal comprising the host cell.

In another aspect, the present invention provides a method of making a GENSET polypeptide, the method comprising a) providing a population of host cells comprising a herein-described polynucleotide and b) culturing the population of host cells under conditions conducive to the production of the polypeptide within said host cells.

In one embodiment, the method further comprises purifying the polypeptide from the population of host cells.

In another aspect, the present invention provides a method of making a GENSET polypeptide, the method comprising a) providing a population of cells comprising a herein-described polynucleotide; b) culturing the population of cells under conditions conducive to the production of the polypeptide within the cells; and c) purifying the polypeptide from the population of cells.

In another aspect, the present invention provides an isolated polynucleotide, the polynucleotide comprising a nucleic acid sequence having at least about 80% identity over at least about 100 contiguous nucleotides to any one of the sequences shown as SEQ ID NOs:1-241 or any one of the sequences of the clone inserts of the deposited clone pool.

In one embodiment, the polynucleotide hybridizes under stringent conditions to a polynucleotide comprising any one of the sequences shown as SEQ ID NOs:1-241 or any one of the sequences of the clone inserts of the deposited clone pool. In another embodiment, the polynucleotide comprises any one of the sequences shown as SEQ ID NOs:1-241 or any one of the sequences of the clone inserts of the deposited clone pool.

In another aspect, the present invention provides a biologically active polypeptide encoded by any of the herein-described polynucleotides.

In another aspect, the present invention provides an isolated polypeptide or biologically active fragment thereof, the polypeptide comprising an amino acid sequence having at least about 80% sequence identity to any one of the sequences shown as SEQ ID NOs:242-482 or any one of the sequences of polypeptides encoded by the clone inserts of the deposited clone pool.

In one embodiment, the polypeptide is selectively recognized by an antibody raised against an antigenic polypeptide, or an antigenic fragment thereof, said antigenic polypeptide comprising any one of the sequences shown as SEQ ID NOs:242-482 or any one of the sequences of polypeptides encoded by the clone inserts of the deposited clone pool. In another embodiment, the polypeptide comprises any one of the sequences shown as SEQ ID NOs:242-482 or any one of the sequences of polypeptides encoded by the clone inserts of the deposited clone pool. In another embodiment, the polypeptide comprises a signal peptide. In another embodiment, the polypeptide is a mature protein.

In another aspect, the present invention provides an antibody that specifically binds to any of ther herein-described polypeptides.

In another aspect, the present invention provides a method of determining whether a GENSET gene is expressed within a mammal, the method comprising the steps of: a) providing a biological sample from said mammal; b) contacting said biological sample with either of: i) a polynucleotide that hybridizes under stringent conditions to the polynucleotide of claim 1 ; or ii) a polypeptide that specifically binds to the polypeptide of claim 19 ; and c) detecting the presence or absence of hybridization between the polynucleotide and an RNA species within the sample, or the presence or absence of binding of the polypeptide to a protein within the sample; wherein a detection of the hybridization or of the binding indicates that the GENSET gene is expressed within the mammal.

In one embodiment, the polynucleotide is a primer, and the hybridization is detected by detecting the presence of an amplification product comprising the sequence of the primer. In another embodiment, the polypeptide is an antibody.

In another aspect, the present invention provides a method of determining whether a mammal has an elevated or reduced level of GENSET gene expression, the method comprising the steps of: a) providing a biological sample from the mammal; and b) comparing the amount of any of the herein-described polypeptides, or of an RNA species encoding the polypeptide, within the biological sample with a level detected in or expected from a control sample; wherein an increased amount of the polypeptide or the RNA species within the biological sample compared to the level detected in or expected from the control sample indicates that the mammal has an elevated level of the GENSET gene expression, and wherein a decreased amount of the polypeptide or the RNA species within the biological sample compared to the level detected in or expected from the control sample indicates that the mammal has a reduced level of the GENSET gene expression.

In another aspect, the present invention provides a method of identifying a candidate modulator of a GENSET polypeptide, the method comprising: a) contacting any of the herein-described polypeptides with a test compound; and b) determining whether the compound specifically binds to the polypeptide; wherein a detection that the compound specifically binds to the polypeptide indicates that the compound is a candidate modulator of the GENSET polypeptide.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a map of the expression vector pPT

FIG. 2 is a block diagram of an exemplary computer system.

FIG. 3 is a flow diagram illustrating one embodiment of a process 200 for comparing a new nucleotide or protein sequence with a database of sequences in order to determine the identity levels between the new sequence and the sequences in the database.

FIG. 4 is a flow diagram illustrating one embodiment of a process 250 in a computer for determining whether two sequences are homologous.

FIG. 5 is a flow diagram illustrating one embodiment of an identifier process 300 for detecting the presence of a feature in a sequence.

BRIEF DESCRIPTION OF TABLES

Table I provides the applicant's internal designation number assigned to each sequence identification number and indicates whether the sequence is a nucleic acid sequence or a polypeptide sequence, and in which vector the cDNA was cloned.

Table II provides structural features for each cDNA of SEQ ID Nos: 1-241 i.e., the locations of the full coding sequences, the signal peptides, the mature polypeptides, the polyA signal and the polyA site.

Table III lists variants for cDNAs of the present invention.

Table IV provides the positions of fragments which are preferably excluded from the present invention.

Tables Va and b provides the positions of fragments which are preferably excluded or included in the present invention. Table IV and Tables Va, and Table Vb provide for the inclusion and exclusion of polynucleotides independently from each other in addition to those described elsewhere in the specification and is therefore, not meant as limiting description.

Table VI lists known biologically structural and functional domains for the polypeptides of the present invention.

Table VII lists antigenic peaks of predicted antigenic epitopes for polypeptides of the present invention.

Table VIII lists the putative chromosomal location of the polynucleotides of the present invention.

Table IX list the Genset's cDNA libraries of tissues and cell types examined that express the polynucleotides of the present invention.

Table X relates to the bias in spatial distribution of the polynucleotide sequences of the present invention.

Table XI lists predicted subcellular localization for cDNAs of the present invention.

Table XII gives the correspondence between the polynucleotides of the US priority applications, namely the US Provisional Patent Applications Ser. Nos. 60/169,629 and 60/187, (column entitled “Seq Id No in priority applications”) and the polynucleotides of the present application (column entitled “Seq Id No in present application”).

BRIEF DESCRIPTION OF SEQUENCE LISTING

SEQ ID Nos: 1-31 and 33-143 are the nucleotide sequences of cDNAs encoding a potentially secreted protein. The locations of the ORFs and sequences encoding signal peptides are listed in the accompanying Sequence Listing. In addition, the von Heijne score of the signal peptide computed as described below is listed as the “score” in the accompanying Sequence Listing. The sequence of the signal-peptide is listed as “seq” in the accompanying Sequence Listing. The “/” in the signal peptide sequence indicates the location where proteolytic cleavage of the signal peptide occurs to generate a mature protein. When appropriate, the locations of the first and last nucleotides of the coding sequences, eventually the locations of the first and last nucleotides of the polyA and the locations of the first and last nucleotides of the polyA sites are indicated.

SEQ ID Nos. 32 and 144-241 are the nucleotide sequences of cDNAs in which no sequence encoding a signal peptide has been identified to date. However, it remains possible that subsequent analysis will identify a sequence encoding a signal peptide in these nucleic acids. The locations of the ORFs are listed in the accompanying Sequence Listing. When appropriate, the locations of the first and last nucleotides of the coding sequences, eventually the locations of the first and last nucleotides of the polyA and the locations of the first and last nucleotides of the polyA sites are indicated.

SEQ ID Nos: 242-272 and 274-384 are the amino acid sequences of polypeptides which contain a signal peptide. These polypeptides are encoded by the cDNAs of SEQ ID Nos: 1-31 and 33-143 respectively. The location of the signal peptide is listed in the accompanying Sequence Listing.

SEQ ID Nos: 273 and 385-482 are the amino acid sequences of polypeptides in which no signal peptide has been identified to date. However, it remains possible that subsequent analysis will identify a signal peptide in these polypeptides. These polypeptides are encoded by the nucleic acids of SEQ ID Nos: 32 and 144-241 respectively.

In accordance with the regulations relating to Sequence Listings, the following codes have been used in the Sequence Listing to describes nucleotide sequences. The code “r” in the sequences indicates that the nucleotide may be a guanine or an adenine. The code “y” in the sequences indicates that the nucleotide may be a thymine or a cytosine. The code “m” in the sequences indicates that the nucleotide may be an adenine or a cytosine. The code “k” in the sequences indicates that the nucleotide may be a guanine or a thymine. The code “s” in the sequences indicates that the nucleotide may be a guanine or a cytosine. The code “w” in the sequences indicates that the nucleotide may be an adenine or a thymine. In addition, all instances of the symbol “n” in the nucleic acid sequences mean that the nucleotide can be adenine, guanine, cytosine or thymine.

In some instances, the polypeptide sequences in the Sequence Listing contain the symbol “Xaa.” These “Xaa” symbols indicate either (1) a residue which cannot be identified because of nucleotide sequence ambiguity or (2) a stop codon in the determined sequence where applicants believe one should not exist (if the sequence were determined more accurately). In some instances, several possible identities of the unknown amino acids may be suggested by the genetic code.

In the case of secreted proteins, it should be noted that, in accordance with the regulations governing Sequence Listings, in the appended Sequence Listing, the encoded protein (i.e. the protein containing the signal peptide and the mature protein or part thereof) extends from an amino acid residue having a negative number through a positively numbered amino acid residue. Thus, the first amino acid of the mature protein resulting from cleavage of the signal peptide is designated as amino acid number 1, and the first amino acid of the signal peptide is designated with the appropriate negative number. However, in the present application, positions on amino acid sequences are always given on the full length polypeptide, the first amino acid of the signal peptide being designated as amino acid number 1.

DETAILED DESCRIPTION

Definitions

Before describing the invention in greater detail, the following definitions are set forth to illustrate and define the meaning and scope of the terms used to describe the invention herein.

The terms “GENSET gene”, when used herein, encompasses genomic, mRNA and cDNA sequences encoding the GENSET protein, including the 5′ and 3′ untranslated regions of said sequences.

As used herein, a “secreted” protein is one which, when expressed in a suitable host cell, is transported across or through a membrane, including transport as a result of signal peptides in its amino acid sequence. “Secreted” proteins include without limitation proteins secreted wholly (e.g. soluble proteins), or partially (e.g. receptors) from the cell in which they are expressed. “Secreted” proteins also include without limitation proteins which are transported across the membrane of the endoplasmic reticulum. As used herein, a “mature protein” is the polypeptide fragment generated after the cleavage of the signal peptide.

The term “full coding sequence” or open reading frame (ORF) of a GENSET gene, when used herein, refers to the complete coding sequence of said gene. In the case of a secreted protein, the full coding sequence comprises the coding sequence for the signal peptide and the coding sequence for the mature polypeptide. Accordingly, the term “full-length polypeptide” refers to the complete polypeptide encoded by said GENSET gene and in the case of a secreted protein it comprises both the signal peptide and the mature polypeptide. The positions of the full length polypeptides and, in the case of secreted proteins, of signal peptides and mature polypeptides are given in the appended sequence listing.

The term “GENSET biological activity” is intended for polypeptides exhibiting an activity similar, but not necessarily identical, to an activity of the GENSET polypeptide of the invention. The GENSET biological activity of a given polypeptide may be assessed using a suitable biological assay well known to those skilled in the art such as the one(s) described herein. In contrast, the term “biological activity” refers to any activity that a polypeptide of the invention may have.

The term “corresponding mRNA” refers to the mRNA which was the template for the cDNA synthesis which produced a cDNA of the present invention.

The term “corresponding genomic DNA” refers to the genomic DNA which encodes mRNA which includes the sequence of one of the strands of the cDNA in which thymidine residues in the sequence of the cDNA are replaced by uracil residues in the mRNA.

The term “deposited clone pool” is used herein to refer to the pool of clones entitled GENSET.071PRF deposited in ATCC with the accession number PTA-1218 on Jan. 21, 2000.

The term “heterologous”, when used herein, is intended to designate any polynucleotide or polypeptide other than the GENSET polynucleotide or polypeptide respectively.

The term “isolated” requires that the material be removed from its original environment (e.g., the natural environment if it is naturally occurring). For example, a naturally-occurring polynucleotide or polypeptide present in a living animal is not isolated, but the same polynucleotide or DNA or polypeptide, separated from some or all of the coexisting materials in the natural system, is isolated. Such polynucleotide could be part of a vector and/or such polynucleotide or polypeptide could be part of a composition, and still be isolated in that the vector or composition is not part of its natural environment. For example, a naturally-occurring polynucleotide present in a living animal is not isolated, but the same polynucleotide, separated from some or all of the coexisting materials in the natural system, is isolated. Specifically excluded from the definition of “isolated” are: naturally-occurring chromosomes (such as chromosome spreads), artificial chromosome libraries, genomic libraries, and cDNA libraries that exist either as an in vitro nucleic acid preparation or as a transfected/transformed host cell preparation, wherein the host cells are either an in vitro heterogeneous preparation or plated as a heterogeneous population of single colonies. Also specifically excluded are the above libraries wherein a specified polynucleotide makes up less than 5% of the number of nucleic acid inserts in the vector molecules. Further specifically excluded are whole cell genomic DNA or whole cell RNA preparations (including said whole cell preparations which are mechanically sheared or enzymatically digested). Further specifically excluded are the above whole cell preparations as either an in vitro preparation or as a heterogeneous mixture separated by electrophoresis (including blot transfers of the same) wherein the polynucleotide of the invention has not further been separated from the heterologous polynucleotides in the electrophoresis medium (e.g., further separating by excising a single band from a heterogeneous band population in an agarose gel or nylon blot).

The term “purified” does not require absolute purity; rather, it is intended as a relative definition. Purification of starting material or natural material to at least one order of magnitude, preferably two or three orders, and more preferably four or five orders of magnitude is expressly contemplated. As an example, purification from 0.1% concentration to 10% concentration is two orders of magnitude. To illustrate, individual cDNA clones isolated from a cDNA library have been conventionally purified to electrophoretic homogeneity. The sequences obtained from these clones could not be obtained directly either from the library or from total human DNA. The cDNA clones are not naturally occurring as such, but rather are obtained via manipulation of a partially purified naturally occurring substance (messenger RNA). The conversion of mRNA into a cDNA library involves the creation of a synthetic substance (cDNA) and pure individual cDNA clones can be isolated from the synthetic library by clonal selection. Thus, creating a cDNA library from messenger RNA and subsequently isolating individual clones from that library results in an approximately 104-106 fold purification of the native message.

The term “purified” is further used herein to describe a polypeptide or polynucleotide of the invention which has been separated from other compounds including, but not limited to, polypeptides or polynucleotides, carbohydrates, lipids, etc. The term “purified” may be used to specify the separation of monomeric polypeptides of the invention from oligomeric forms such as homo- or hetero-dimers, trimers, etc. The term “purified” may also be used to specify the separation of covalently closed polynucleotides from linear polynucleotides. A polynucleotide is substantially pure when at least about 50%, preferably 60 to 75% of a sample exhibits a single polynucleotide sequence and conformation (linear versus covalently close). A substantially pure polypeptide or polynucleotide typically comprises about 50%, preferably 60 to 90% weight/weight of a polypeptide or polynucleotide sample, respectively, more usually about 95%, and preferably is over about 99% pure. Polypeptide and polynucleotide purity, or homogeneity, is indicated by a number of means well known in the art, such as agarose or polyacrylamide gel electrophoresis of a sample, followed by visualizing a single band upon staining the gel. For certain purposes higher resolution can be provided by using HPLC or other means well known in the art. As an alternative embodiment, purification of the polypeptides and polynucleotides of the present invention may be expressed as “at least” a percent purity relative to heterologous polypeptides and polynucleotides (DNA, RNA or both). As a preferred embodiment, the polypeptides and polynucleotides of the present invention are at least; 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 96%, 96%, 98%, 99%, or 100% pure relative to heterologous polypeptides and polynucleotides, respectively. As a further preferred embodiment the polypeptides and polynucleotides have a purity ranging from any number, to the thousandth position, between 90% and 100% (e.g., a polypeptide or polynucleotide at least 99.995% pure) relative to either heterologous polypeptides or polynucleotides, respectively, or as a weight/weight ratio relative to all compounds and molecules other than those existing in the carrier. Each number representing a percent purity, to the thousandth position, may be claimed as individual species of purity.

As used interchangeably herein, the terms “nucleic acid molecule(s)”, “oligonucleotide(s)”, and “polynucleotide(s)” include RNA or DNA (either single or double stranded, coding, complementary or antisense), or RNA/DNA hybrid sequences of more than one nucleotide in either single chain or duplex form (although each of the above species may be particularly specified). The term “nucleotide” is used herein as an adjective to describe molecules comprising RNA, DNA, or RNA/DNA hybrid sequences of any length in single-stranded or duplex form. More precisely, the expression “nucleotide sequence” encompasses the nucleic material itself and is thus not restricted to the sequence information (i.e. the succession of letters chosen among the four base letters) that biochemically characterizes a specific DNA or RNA molecule. The term “nucleotide” is also used herein as a noun to refer to individual nucleotides or varieties of nucleotides, meaning a molecule, or individual unit in a larger nucleic acid molecule, comprising a purine or pyrimidine, a ribose or deoxyribose sugar moiety, and a phosphate group, or phosphodiester linkage in the case of nucleotides within an oligonucleotide or polynucleotide. The term “nucleotide” is also used herein to encompass “modified nucleotides” which comprise at least one modifications such as (a) an alternative linking group, (b) an analogous form of purine, (c) an analogous form of pyrimidine, or (d) an analogous sugar. For examples of analogous linking groups, purine, pyrimidines, and sugars see for example PCT publication No. WO 95/04064, which disclosure is hereby incorporated by reference in its entirety. Preferred modifications of the present invention include, but are not limited to, 5-fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xantine, 4-acetylcytosine, 5-(carboxyhydroxylmethyl) uracil, 5-carboxymethylaminomethyl-2-thiouridine, 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, N6-isopentenyladenine, 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2-methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-adenine, 7-methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D-mannosylqueosine, 5′-methoxycarboxymethyluracil, 5-methoxyuracil, 2-methylthio-N-6-isopentenyladenine, uracil-5-oxyacetic acid (v) ybutoxosine, pseudouracil, queosine, 2-thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil, 5-methyluracil, uracil-5-oxyacetic acid methylester, uracil-5-oxyacetic acid, 5-methyl-2-thiouracil, 3-(3-amino-3-N-2-carboxypropyl) uracil, and 2,6-diaminopurine. The polynucleotide sequences of the invention may be prepared by any known method, including synthetic, recombinant, ex vivo generation, or a combination thereof, as well as utilizing any purification methods known in the art. Methylenemethylimino linked oligonucleosides as well as mixed backbone compounds having, may be prepared as described in U.S. Pat. Nos. 5,378,825; 5,386,023; 5,489,677; 5,602,240; and 5,610,289, which disclosures are hereby incorporated by reference in their entireties. Formacetal and thioformacetal linked oligonucleosides may be prepared as described in U.S. Pat. Nos. 5,264,562 and 5,264,564, which disclosures are hereby incorporated by reference in their entireties. Ethylene oxide linked oligonucleosides may be prepared as described in U.S. Pat. No. 5,223,618, which disclosure is hereby incorporated by reference in its entirety. Phosphinate oligonucleotides may be prepared as described in U.S. Pat. No. 5,508,270, which disclosure is hereby incorporated by reference in its entirety. Alkyl phosphonate oligonucleotides may be prepared as described in U.S. Pat. No. 4,469,863, which disclosure is hereby incorporated by reference in its entirety. 3′-Deoxy-3′-methylene phosphonate oligonucleotides may be prepared as described in U.S. Pat. No. 5,610,289 or 5,625,050 which disclosures are hereby incorporated by reference in their entireties. Phosphoramidite oligonucleotides may be prepared as described in U.S. Pat. No. 5,256,775 or U.S. Pat. No. 5,366,878 which disclosures are hereby incorporated by reference in their entireties. Alkylphosphonothioate oligonucleotides may be prepared as described in published PCT applications WO 94/17093 and WO 94/02499 which disclosures are hereby incorporated by reference in their entireties. 3′-Deoxy-3′-amino phosphoramidate oligonucleotides may be prepared as described in U.S. Pat. No. 5,476,925, which disclosure is hereby incorporated by reference in its entirety. Phosphotriester oligonucleotides may be prepared as described in U.S. Pat. No. 5,023,243, which disclosure is hereby incorporated by reference in its entirety. Borano phosphate oligonucleotides may be prepared as described in U.S. Pat. Nos. 5,130,302 and 5,177,198 which disclosures are hereby incorporated by reference in their entireties.

The term “upstream” is used herein to refer to a location which is toward the 5′ end of the polynucleotide from a specific reference point.

The terms “base paired” and “Watson & Crick base paired” are used interchangeably herein to refer to nucleotides which can be hydrogen bonded to one another be virtue of their sequence identities in a manner like that found in double-helical DNA with thymine or uracil residues linked to adenine residues by two hydrogen bonds and cytosine and guanine residues linked by three hydrogen bonds (See Stryer, 1995, which disclosure is hereby incorporated by reference in its entirety).

The terms “complementary” or “complement thereof” are used herein to refer to the sequences of polynucleotides which is capable of forming Watson & Crick base pairing with another specified polynucleotide throughout the entirety of the complementary region. For the purpose of the present invention, a first polynucleotide is deemed to be complementary to a second polynucleotide when each base in the first polynucleotide is paired with its complementary base. Complementary bases are, generally, A and T (or A and U), or C and G. “Complement” is used herein as a synonym from “complementary polynucleotide”, “complementary nucleic acid” and “complementary nucleotide sequence”. These terms are applied to pairs of polynucleotides based solely upon their sequences and not any particular set of conditions under which the two polynucleotides would actually bind. Unless otherwise stated, all complementary polynucleotides are fully complementary on the whole length of the considered polynucleotide.

The terms “polypeptide” and “protein”, used interchangeably herein, refer to a polymer of amino acids without regard to the length of the polymer; thus, peptides, oligopeptides, and proteins are included within the definition of polypeptide. This term also does not specify or exclude chemical or post-expression modifications of the polypeptides of the invention, although chemical or post-expression modifications of these polypeptides may be included excluded as specific embodiments. Therefore, for example, modifications to polypeptides that include the covalent attachment of glycosyl groups, acetyl groups, phosphate groups, lipid groups and the like are expressly encompassed by the term polypeptide. Further, polypeptides with these modifications may be specified as individual species to be included or excluded from the present invention. The natural or other chemical modifications, such as those listed in examples above can occur anywhere in a polypeptide, including the peptide backbone, the amino acid side-chains and the amino or carboxyl termini. It will be appreciated that the same type of modification may be present in the same or varying degrees at several sites in a given polypeptide. Also, a given polypeptide may contain many types of modifications. Polypeptides may be branched, for example, as a result of ubiquitination, and they may be cyclic, with or without branching. Modifications include acetylation, acylation, ADP-ribosylation, amidation, covalent attachment of flavin, covalent attachment of a heme moiety, covalent attachment of a nucleotide or nucleotide derivative, covalent attachment of a lipid or lipid derivative, covalent attachment of phosphotidylinositol, cross-linking, cyclization, disulfide bond formation, demethylation, formation of covalent cross-links, formation of cysteine, formation of pyroglutamate, formylation, gamma-carboxylation, glycosylation, GPI anchor formation, hydroxylation, iodination, methylation, myristoylation, oxidation, pegylation, proteolytic processing, phosphorylation, prenylation, racemization, selenoylation, sulfation, transfer-RNA mediated addition of amino acids to proteins such as arginylation, and ubiquitination. (See, for instance Creighton (1993); Seifter et al., (1990); Rattan et al., (1992)). Also included within the definition are polypeptides which contain one or more analogs of an amino acid (including, for example, non-naturally occurring amino acids, amino acids which only occur naturally in an unrelated biological system, modified amino acids from mammalian systems, etc.), polypeptides with substituted linkages, as well as other modifications known in the art, both naturally occurring and non-naturally occurring.

As used herein, the terms “recombinant polynucleotide” and “polynucleotide construct” are used interchangeably to refer to linear or circular, purified or isolated polynucleotides that have been artificially designed and which comprise at least two nucleotide sequences that are not found as contiguous nucleotide sequences in their initial natural environment. In particular, this terms mean that the polynucleotide or cDNA is adjacent to “backbone” nucleic acid to which it is not adjacent in its natural environment. Additionally, to be “enriched” the cDNAs will represent 5% or more of the number of nucleic acid inserts in a population of nucleic acid backbone molecules. Backbone molecules according to the present invention include nucleic acids such as expression vectors, self-replicating nucleic acids, viruses, integrating nucleic acids, and other vectors or nucleic acids used to maintain or manipulate a nucleic acid insert of interest. Preferably, the enriched cDNAs represent 15% or more of the number of nucleic acid inserts in the population of recombinant backbone molecules. More preferably, the enriched cDNAs represent 50% or more of the number of nucleic acid inserts in the population of recombinant backbone molecules. In a highly preferred embodiment, the enriched cDNAs represent 90% or more (including any number between 90 and 100%, to the thousandth position, e.g., 99.5%) # of the number of nucleic acid inserts in the population of recombinant backbone molecules.

The term “recombinant polypeptide” is used herein to refer to polypeptides that have been artificially designed and which comprise at least two polypeptide sequences that are not found as contiguous polypeptide sequences in their initial natural environment, or to refer to polypeptides which have been expressed from a recombinant polynucleotide.

As used herein, the term “operably linked” refers to a linkage of polynucleotide elements in a functional relationship. A sequence which is “operably linked” to a regulatory sequence such as a promoter means that said regulatory element is in the correct location and orientation in relation to the nucleic acid to control RNA polymerase initiation and expression of the nucleic acid of interest. For instance, a promoter or enhancer is operably linked to a coding sequence if it affects the transcription of the coding sequence.

As used herein, the term “non-human animal” refers to any non-human animal, including insects, birds, rodents and more usually mammals. Preferred non-human animals include: primates; farm animals such as swine, goats, sheep, donkeys, cattle, horses, chickens, rabbits; and rodents, preferably rats or mice. As used herein, the term “animal” is used to refer to any species in the animal kingdom, preferably vertebrates, including birds and fish, and more preferable a mammal. Both the terms “animal” and “mammal” expressly embrace human subjects unless preceded with the term “non-human”.

The term “domain” refers to an amino acid fragment with specific biological properties. This term encompasses all known structural and linear biological motifs. Examples of such motifs include but are not limited to leucine zippers, helix-turn-helix motifs, glycosylation sites, ubiquitination sites, alpha helices, and beta sheets, signal peptides which direct the secretion of proteins, sites for post-translational modification, enzymatic active sites, substrate binding sites, and enzymatic cleavage sites.

Although they have distinct meanings, the terms “comprising”, “consisting of” and “consisting essentially of” may be interchanged for one another throughout the instant application”. The term “having” has the same meaning as “comprising” and may be replaced with either the term “consisting of” or “consisting essentially of”.

An “amplification product” refers to a product of any amplification reaction, e.g. PCR, RT-PCR, LCR, etc.

A “modulator” of a protein or other compound refers to any agent that has a functional effect on the protein, including physical binding to the protein, alterations of the quantity or quality of expression of the protein, altering any measurable or detectable activity, property, or behavior of the protein, or in any way interacts with the protein or compound.

“A test compound” can be any molecule that is evaluated for its ability to modulate a protein or other compound.

Unless otherwise specified in the application, nucleotides and amino acids of polynucleotides and polypeptides respectively of the present invention are contiguous and not interrupted by heterologous sequences.

Identity Between Nucleic Acids or Polypeptides

The terms “percentage of sequence identity” and “Percentage homology” are used interchangeably herein to refer to comparisons among polynucleotides and polypeptides, and are determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the polynucleotide or polypeptide sequence in the comparison window may comprise additions or deletions (i.e., gaps) as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. The percentage is calculated by determining the number of positions at which the identical nucleic acid base or amino acid residue occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison and multiplying the result by 100 to yield the percentage of sequence identity. Homology is evaluated using any of the variety of sequence comparison algorithms and programs known in the art. Such algorithms and programs include, but are by no means limited to, TBLASTN, BLASTP, FASTA, TFASTA, CLUSTALW, FASTDB (Pearson and Lipman, 1988; Altschul et al., 1990; Thompson et al., 1994; Higgins et al., 1996; Altschul et al., 1990; Altschul et al., 1993; Brutlag et al, 1990), the disclosures of which are incorporated by reference in their entireties.

In a particularly preferred embodiment, protein and nucleic acid sequence homologies are evaluated using the Basic Local Alignment Search Tool (“BLAST”) which is well known in the art (see, e.g., Karlin and Altschul, 1990; Altschul et al., 1990, 1993, 1997), the disclosures of which are incorporated by reference in their entireties. In particular, five specific BLAST programs are used to perform the following task:

(1) BLASTP and BLAST3 compare an amino acid query sequence against a protein sequence database;

(2) BLASTN compares a nucleotide query sequence against a nucleotide sequence database;

(3) BLASTX compares the six-frame conceptual translation products of a query nucleotide sequence (both strands) against a protein sequence database;

(4) TBLASTN compares a query protein sequence against a nucleotide sequence database translated in all six reading frames (both strands); and

(5) TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

The BLAST programs identify homologous sequences by identifying similar segments, which are referred to herein as “high-scoring segment pairs,” between a query amino or nucleic acid sequence and a test sequence which is preferably obtained from a protein or nucleic acid sequence database. High-scoring segment pairs are preferably identified (i.e., aligned) by means of a scoring matrix, many of which are known in the art. Preferably, the scoring matrix used is the BLOSUM62 matrix (Gonnet et al., 1992; Henikoff and Henikoff, 1993), the disclosures of which are incorporated by reference in their entireties. Less preferably, the PAM or PAM250 matrices may also be used (see, e.g., Schwartz and Dayhoff, eds., 1978), the disclosure of which is incorporated by reference in its entirety. The BLAST programs evaluate the statistical significance of all high-scoring segment pairs identified, and preferably selects those segments which satisfy a user-specified threshold of significance, such as a user-specified percent homology. Preferably, the statistical significance of a high-scoring segment pair is evaluated using the statistical significance formula of Karlin (see, e.g., Karlin and Altschul, 1990), the disclosure of which is incorporated by reference in its entirety. The BLAST programs may be used with the default parameters or with modified parameters provided by the user.

Another preferred method for determining the best overall match between a query nucleotide sequence (a sequence of the present invention) and a subject sequence, also referred to as a global sequence alignment, can be determined using the FASTDB computer program based on the algorithm of Brutlag et al. (1990), the disclosure of which is incorporated by reference in its entirety. In a sequence alignment the query and subject sequences are both DNA sequences. An RNA sequence can be compared by first converting U's to T's. The result of said global sequence alignment is in percent identity. Preferred parameters used in a FASTDB alignment of DNA sequences to calculate percent identity are: Matrix=Unitary, k-tuple=4, Mismatch Penalty=1, Joining Penalty=30, Randomization Group Length=0, Cutoff Score=1, Gap Penalty=5, Gap Size Penalty 0.05, Window Size=500 or the length of the subject nucleotide sequence, whichever is 35 shorter. If the subject sequence is shorter than the query sequence because of 5′ or 3′ deletions, not because of internal deletions, a manual correction must be made to the results. This is because the FASTDB program does not account for 5′ and 3′ truncations of the subject sequence when calculating percent identity. For subject sequences truncated at the 5′ or 3′ ends, relative to the query sequence, the percent identity is corrected by calculating the number of bases of the query sequence that are 5′ and 3′ of the subject sequence, which are not matched/aligned, as a percent of the total bases of the query sequence. Whether a nucleotide is matched/aligned is determined by results of the FASTDB sequence alignment. This percentage is then subtracted from the percent identity, calculated by the above FASTDB program using 10, the specified parameters, to arrive at a final percent identity score. This corrected score is what is used for the purposes of the present invention. Only nucleotides outside the 5′ and 3′ nucleotides of the subject sequence, as displayed by the FASTDB alignment, which are not matched/aligned with the query sequence, are calculated for the purposes of manually adjusting the percent identity score. For example, a 90 nucleotide subject sequence is aligned to a 100 nucleotide query sequence to determine percent identity. The deletions occur at the 5′ end of the subject sequence and therefore, the FASTDB alignment does not show a matched/alignment of the first 10 nucleotides at 5′ end. The 10 unpaired nucleotides represent 10% of the sequence (number of nucleotides at the 5′ and 3′ ends not matched/total number of nucleotides in the query sequence) so 10% is subtracted from the percent identity score calculated by the FASTDB program. If the remaining 90 nucleotides were perfectly matched the final percent identity would be 90%. In another example, a 90 nucleotide subject sequence is compared with a 100 nucleotide query sequence. This time the deletions are internal deletions so that there are no nucleotides on the 5′ or 3′ of the subject sequence which are not matched/aligned with the query. In this case the percent identity calculated by FASTDB is not manually corrected. Once again, only nucleotides 5′ and 3′ of the subject sequence which are not matched/aligned with the query sequence are manually corrected. No other manual corrections are made for the purposes of the present invention.

Another preferred method for determining the best overall match between a query amino acid sequence (a sequence of the present invention) and a subject sequence, also referred to as a global sequence alignment, can be determined using the FASTDB computer program based on the algorithm of Brutlag et al. (1990). In a sequence alignment the query and subject sequences are both amino acid sequences. The result of said global sequence alignment is in percent identity. Preferred parameters used in a FASTDB amino acid alignment are: Matrix=PAM 0, k-tuple=2, Mismatch Penalty=1, Joining Penalty=20, Randomization Group25 Length=0, Cutoff Score=1, Window Size=sequence length, Gap Penalty=5, Gap Size Penalty=0.05, Window Size=500 or the length of the subject amino acid sequence, whichever is shorter. If the subject sequence is shorter than the query sequence due to N- or C-terminal deletions, not because of internal deletions, the results, in percent identity, must be manually corrected. This is because the FASTDB program does not account for N- and C-terminal truncations of the subject sequence when calculating global percent identity. For subject sequences truncated at the N- and C-termini, relative to the query sequence, the percent identity is corrected by calculating the number of residues of the query sequence that are N- and C-terminal of the subject sequence, which are not matched/aligned with a corresponding subject residue, as a percent of the total bases of the query sequence. Whether a residue is matched/aligned is determined by results of the FASTDB sequence alignment. This percentage is then subtracted from the percent identity, calculated by the above FASTDB program using the specified parameters, to arrive at a final percent identity score. This final percent identity score is what is used for the purposes of the present invention. Only residues to the N- and C-termini of the subject sequence, which are not matched/aligned with the query sequence, are considered for the purposes of manually adjusting the percent identity score. That is, only query amino acid residues outside the farthest N- and C-terminal residues of the subject sequence. For example, a 90 amino acid residue subject sequence is aligned with a 100-residue query sequence to determine percent identity. The deletion occurs at the N-terminus of the subject sequence and therefore, the FASTDB alignment does not match/align with the first residues at the N-terminus. The 10 unpaired residues represent 10% of the sequence (number of residues at the N- and C-termini not matched/total number of residues in the query sequence) so 10% is subtracted from the percent identity score calculated by the FASTDB program. If the remaining 90 residues were perfectly matched the final percent identity would be 90%. In another example, a 90-residue subject sequence is compared with a 100-residue query sequence. This time the deletions are internal so there are no residues at the N- or C-termini of the subject sequence, which are not matched/aligned with the query. In this case the percent identity calculated by FASTDB is not manually corrected. Once again, only residue positions outside the N- and C-terminal ends of the subject sequence, as displayed in the FASTDB alignment, which are not matched/aligned with the query sequence are manually corrected. No other manual corrections are made for the purposes of the present invention.

The term “percentage of sequence similarity” refers to comparisons between polypeptide sequences and is determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the polypeptide sequence in the comparison window may comprise additions or deletions (i.e., gaps) as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. The percentage is calculated by determining the number of positions at which an identical or equivalent amino acid residue occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison and multiplying the result by 100 to yield the percentage of sequence similarity. Similarity is evaluated using any of the variety of sequence comparison algorithms and programs known in the art, including those described above in this section. Equivalent amino acid residues are defined herein in the “Mutated polypeptides” section.

Polynucleotides of the Invention

The present invention concerns GENSET genomic and cDNA sequences. The present invention encompasses GENSET genes, polynucleotides comprising GENSET genomic and cDNA sequences, as well as fragments and variants thereof. These polynucleotides may be purified, isolated, or recombinant.

Also encompassed by the present invention are allelic variants, orthologs, splice variants, and/or species homologues of the GENSET genes. Procedures known in the art can be used to obtain full-length genes and cDNAs, allelic variants, splice variants, full-length coding portions, orthologs, and/or species homologues of genes and cDNAs corresponding to a nucleotide sequence selected from the group consisting of sequences of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool, using information from the sequences disclosed herein or the clone pool deposited with the ATCC. For example, allelic variants, orthologs and/or species homologues may be isolated and identified by making suitable probes or primers from the sequences provided herein and screening a suitable nucleic acid source for allelic variants and/or the desired homologue using any technique known to those skilled in the art including those described into the section entitled “To find similar sequences”.

In a specific embodiment, the polynucleotides of the invention are at least 15, 30, 50, 100, 125, 500, or 1000 continuous nucleotides. In another embodiment, the polynucleotides are less than or equal to 300 kb, 200 kb, 100 kb, 50 kb, 10 kb, 7.5 kb, 5 kb, 2.5 kb, 2 kb, 1.5 kb, or 1 kb in length. In a further embodiment, polynucleotides of the invention comprise a portion of the coding sequences, as disclosed herein, but do not comprise all or a portion of any intron. In another embodiment, the polynucleotides comprising coding sequences do not contain coding sequences of a genomic flanking gene (i.e., 5′ or 3′ to the gene of interest in the genome). In other embodiments, the polynucleotides of the invention do not contain the coding sequence of more than 1000, 500, 250, 100, 75, 50, 25, 20, 15, 10, 5, 4, 3, 2, or 1 naturally occurring genomic flanking gene(s).

Deposited Clone Pool of the Invention

Expression of GENSET genes has been shown to lead to the production of at least one mRNA species per GENSET gene, which cDNA sequence is set forth in the appended sequence listing as SEQ ID Nos: 1-241. The cDNAs (SEQ ID Nos: 1-241) corresponding to these GENSET mRNA species were cloned in the vector pBluescriptII SK (Stratagene) or one of its derivative called pPT (see FIG. 1). Cells containing the cloned cDNAs of the present invention are maintained in permanent deposit by the inventors at Genset, S.A., 24 Rue Royale, 75008 Paris, France. Table I provides the applicant's internal designation number (column entitled “Internal designation”) assigned to each sequence identification number of SEQ ID Nos: 1-482 (column entitled “Seq Id No”) and indicates whether the sequence is a nucleic acid sequence or a polypeptide sequence (column entitled “Type”), and in which vector the cDNA was cloned (column entitled “Vector”).

Each cDNA can be removed from the Bluescript vector in which it was inserted by performing a NotI Pst I double digestion to produce the appropriate fragment for each clone provided the cDNA sequence of interest does not contain this restriction site within its sequence. The preferable sites for cDNA removal for those clones inserted into pPT are MunI and HindIII, the sites used for cloning provided the cDNA sequence of interest does not contain this restriction site within its sequence. Alternatively, other restriction enzymes of the multicloning site of the vector may be used to recover the desired insert as indicated by the manufacturer or in FIG. 1.

Pool of cells containing the cDNAs of the invention, from which the cells containing a particular polynucleotide is obtainable, were also deposited with the American Tissue Culture Collection (ATCC), 10801 University Boulevard, Manassas, Va. 20110-2209, United States. Each cDNA clone has been transfected into separate bacterial cells ( E - coli ) for these composite deposits. In particular, cells containing the sequences of SEQ ID Nos: 1-241 were deposited on Jan. 21, 2000 in the pool having ATCC Accession No. PTA-1218 and designated GENSET.071PRF.

Bacterial cells containing a particular clone can be obtained from the composite deposit as follows:

An oligonucleotide probe or probes should be designed to the sequence that is known for that particular clone. This sequence can be derived from the sequences provided herein, or from a combination of those sequences. The design of the oligonucleotide probe should preferably follow these parameters:

(a) It should be designed to an area of the sequence which has the fewest ambiguous bases (“N's”), if any;

(b) Preferably, the probe is designed to have a Tm of approximately 80 degree Celsius (assuming 2 degrees for each A or T and 4 degrees for each G or C). However, probes having melting temperatures between 40 degree Celsius and 80 degree Celsius may also be used provided that specificity is not lost.

The oligonucleotide should preferably be labeled with gamma[ 32 P]ATP (specific activity 6000 Ci/mmole) and T4 polynucleotide kinase using commonly employed techniques for labeling oligonucleotides. Other labeling techniques can also be used. Unincorporated label should preferably be removed by gel filtration chromatography or other established methods. The amount of radioactivity incorporated into the probe should be quantified by measurement in a scintillation counter. Preferably, specific activity of the resulting probe should be approximately 4×10 6 dpm/pmole.

The bacterial culture containing the pool of full-length clones should preferably be thawed and 100 ul of the stock used to inoculate a sterile culture flask containing 25 ml of sterile L-broth containing ampicillin at 100 ug/ml. The culture should preferably be grown to saturation at 37 degree Celsius, and the saturated culture should preferably be diluted in fresh L-broth. Aliquots of these dilutions should preferably be plated to determine the dilution and volume which will yield approximately 5000 distinct and well-separated colonies on solid bacteriological media containing L-broth containing ampicillin at 100 ug/ml and agar at 1.5% in a 150 mm petri dish when grown overnight at 37 degree Celsius. Other known methods of obtaining distinct, well-separated colonies can also be employed.

Standard colony hybridization procedures should then be used to transfer the colonies to nitrocellulose filters and lyse, denature and bake them.

The filter is then preferably incubated at 65 degree Celsius for 1 hour with gentle agitation in 6×SSC (20× stock is 175.3 g NaCl/liter, 88.2 g Na citrate/liter, adjusted to pH 7.0 with NaOH) containing 0.5% SDS, 100 pg/ml of yeast RNA, and 10 mM EDTA (approximately 10 ml per 150 mm filter). Preferably, the probe is then added to the hybridization mix at a concentration greater than or equal to 1×10 6 dpn/ml. The filter is then preferably incubated at 65 degree Celsius with gentle agitation overnight. The filter is then preferably washed in 500 ml of 2×SSC/0.1% SDS at room temperature with gentle shaking for 15 minutes. A third wash with 0.1×SSC/0.5% SDS at 65 degree Celsius for 30 minutes to 1 hour is optional. The filter is then preferably dried and subjected to autoradiography for sufficient time to visualize the positives on the X-ray film. Other known hybridization methods can also be employed.

The positive colonies are picked, grown in culture, and plasmid DNA isolated using standard procedures. The clones can then be verified by restriction analysis, hybridization analysis, or DNA sequencing. The plasmid DNA obtained using these procedures may then be manipulated using standard cloning techniques familiar to those skilled in the art.

Alternatively, to recover cDNA inserts from the pool of bacteria, a PCR can be performed on plasmid DNA isolated using standard procedures and primers designed at both ends of the cDNA insertion, including primers designed in the multicloning site of the vector. For example, a PCR reaction may be conducted using universal primers designed by the plasmid provider or using primers which are specific to the cDNA of interest. In the case of Bluescript SK(−), a PCR reaction may be conducted using a primer having the sequence GGAAACAGCTATGACCA and a primer having the sequence GTAAAACGACGGCCAGT. This will produce a DNA fragment including a piece of the multiple cloning site and the cDNA insert. If a specific cDNA of interest is to be recovered, primers may be designed in order to be specific for the 5′ end and the 3′ end of this cDNA using sequence information available from the appended sequence listing. The PCR product which corresponds to the cDNA of interest can then be manipulated using standard cloning techniques familiar to those skilled in the art.

Therefore, an object of the invention is an isolated, purified, or recombinant polynucleotide comprising a nucleotide sequence selected from the group consisting of cDNA inserts of the deposited clone pool. Moreover, preferred polynucleotides of the invention include purified, isolated, or recombinant GENSET cDNAs consisting of, consisting essentially of, or comprising a nucleotide sequence selected from the group consisting of cDNA inserts of the deposited clone pool.

The polynucleotides of SEQ ID NOs: 1-141 may be interchanged with the corresponding polynucleotides encoded by the human cDNA of the clones inserts of the deposited clone pool. The polypeptides of SEQ ID NOs: 242-482 may be interchanged with the corresponding polypeptides encoded by the human cDNA of the clones inserts of the deposited clone pool. The correspondance between the polynucleotides of SEQ ID Nos: 1-141, the polypeptides of SEQ ID NOs: 242-482 and clones inserts of the deposited clone pool is given in Table I.

cDNA sequences of the invention

Another object of the invention is a purified, isolated, or recombinant polynucleotide comprising a nucleotide sequence selected from the group consisting of sequences of SEQ ID Nos: 1-241, complementary sequences thereto, and fragments thereof. Moreover, preferred polynucleotides of the invention include purified, isolated, or recombinant GENSET cDNAs consisting of, consisting essentially of, or comprising a sequence selected from the group consisting of SEQ ID Nos: 1-241.

Polynucleotides GENSET sequences of SEQ ID Nos: 1-241 were then searched for open reading frames able to encode polypeptides. The GENSET ORFs were also searched to identify potential signal sequence motifs using slight modifications of the procedures disclosed in Von Heijne, Nucleic Acids Res. 14:4683-4690, 1986, as described in PCT publication WO 00/37491, the entire disclosures of which are incorporated herein by reference. The GENSET cDNAs of SEQ ID Nos: 1-31 and 33-143 encoding polypeptides of SEQ ID Nos: 242-272 and 274-384 were thus found as containing such signal sequences.

Structural parameters of each of the cDNA of the present invention are described in Table II. Namely, Table II provides, for each cDNA of SEQ ID Nos: 1-241 referred to by its sequence identification number (column entitled “Seq Id No”), the locations of the first and last nucleotides of the coding sequences (listed under the heading “Full Coding Sequence”), and, if applicable, the locations of the signal sequence and the sequence encoding the mature polypeptide in the case of secreted proteins (SEQ ID Nos: 1-31 and 33-143) listed under the headings “Signal Sequence” and “Coding Sequence for the mature Protein” respectively, the locations of the first and last nucleotides of the polyA signals (listed under the heading “Poly A Signal”) and the locations of the first and last nucleotides of the polyA sites (listed under the heading “Poly A Site”).

Accordingly, the full coding sequence (CDS) or open reading frame (ORF) of each cDNA of the invention refers to the nucleotide sequence beginning with the first nucleotide of the start codon and ending with the last nucleotide of the stop codon (see column entiled “Full coding sequence” of Table II for sequences of Seq Id Nos: 1-241). Similarly, the signal sequence of each cDNA of the invention refers to the nucleotide sequence beginning with the first nucleotide of the start codon and ending with the last nucleotide of the codon encoding the signal peptide (see column entiled “Signal sequence” of Table II for sequences of Seq Id Nos: 1-31 and 33-143) and the coding sequence for the mature polypeptide of each cDNA of the invention refers to the nucleotide sequence beginning with the first nucleotide of the first codon encoding and ending with the last nucleotide of the stop codon (see column entiled “Coding sequence for mature protein” of Table II for sequences of Seq Id Nos: 1-31 and 33-143). Similarly, the 5′ untranslated region (or 5′UTR) of each cDNA of the invention refers to the nucleotide sequence starting at nucleotide 1 and ending at the nucleotide immediately 5′ to the first nucleotide of the start codon. The 3′ untranslated region (or 3′UTR) of each cDNA of the invention refers to the nucleotide sequence starting at the nucleotide immediately 3′ to the last nucleotide of the stop codon and ending at the last nucleotide of the cDNA.

Untranslated Regions

In addition, the invention concerns a purified, isolated, and recombinant nucleic acid comprising a nucleotide sequence selected from the group consisting of the 5′UTRs of sequences of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool, sequences complementary thereto, and allelic variants thereof. The invention also concerns a purified, isolated, and recombinant nucleic acid comprising a nucleotide sequence selected from the group consisting of the 3′UTRs of sequences of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool, sequences complementary thereto, and allelic variants thereof.

These polynucleotides may be used to detect the presence of GENSET mRNA species in a biological sample using either hybridization or RT-PCR techniques well known to those skilled in the art those skilled in the art.

In addition, these polynucleotides may be used as regulatory molecules able to affect the processing and maturation of the polynucleotide including them (either a GENSET polynucleotide or an heterologous polynucleotide), preferably the localization, stability and/or translation of said polynucleotide including them (for a review on UTRs see Decker and Parker, 1995, Derrigo et al., 2000). In particular, 3′UTRs may be used in order to control the stability of heterologous mRNAs in recombinant vectors using any methods known to those skilled in the art including Makrides (1999), U.S. Pat. Nos. 5,925,56; 5,807,7 and 5,756,264, which disclosures are hereby incorporated by reference in their entireties.

Coding Sequences

Another object of the invention is an isolated, purified or recombinant polynucleotide comprising the full coding sequence of a sequence selected from the group consisting of sequences of SEQ ID Nos: 1-241, clone inserts of the deposited clone pool, and variants thereof.

A further object of the invention is an isolated, purified or recombinant polynucleotide encoding a polypeptide comprising a sequence selected from the group consisting of sequences of SEQ ID Nos: 242-482 and allelic variants thereof. Another object of the invention is an isolated, purified or recombinant polynucleotide encoding a polypeptide comprising a sequence selected from the group consisting of polypeptides encoded by cDNA inserts of the deposited clone pool and allelic variants thereof.

In a preferred embodiment, the invention encompasses an isolated, purified or recombinant polynucleotide encoding a polypeptide comprising a sequence selected from the group consisting of the mature proteins of SEQ ID Nos: 242-272 and 274-384. In another preferred embodiment, the invention encompasses an isolated, purified or recombinant polynucleotide encoding a polypeptide comprising a sequence selected from the group consisting of the signal peptides of SEQ ID Nos: 242-272 and 274-384.

It will be appreciated that should the extent of the full coding sequence differ from that indicated in the appended sequence listing as a result of a sequencing error, reverse transcription or amplification error, mRNA splicing, post-translational modification of the encoded protein, enzymatic cleavage of the encoded protein, or other biological factors, one skilled in the art would be readily able to identify the extent of the full coding sequences in the sequences of SEQ ID Nos: 1-241. Accordingly, the scope of any claims herein relating to nucleic acids containing the full coding sequence of one of SEQ ID Nos: 1-241 is not to be construed as excluding any readily identifiable variations from or equivalents to the full coding sequences described in the appended sequence listing. Similarly, should the extent of the polypeptides differ from those indicated in the appended sequence listing as a result of any of the preceding factors, the scope of claims relating to polypeptides comprising the amino acid sequence of the polypeptides of SEQ ID Nos: 242-482 is not to be construed as excluding any readily identifiable variations from or equivalents to the sequences described in the appended sequence listing.

It will be appreciated that should the extent of the coding sequence of the mature protein differ from that indicated in the appended sequence listing as a result of a sequencing error, reverse transcription or amplification error, mRNA splicing, post-translational modification of the encoded protein, enzymatic cleavage of the encoded protein, or other biological factors, one skilled in the art would be readily able to identify the extent of the coding sequences for the mature protein in the sequences of SEQ ID Nos: 1-31 and 33-143. Accordingly, the scope of any claims herein relating to nucleic acids containing the coding sequence for the mature proteins of one of SEQ ID Nos: 1-31 and 33-143 is not to be construed as excluding any readily identifiable variations from or equivalents to the coding sequences described in the appended sequence listing. Similarly, should the extent of the mature polypeptides differ from those indicated in the appended sequence listing as a result of any of the preceding factors, the scope of claims relating to mature polypeptides comprising the amino acid sequence of the polypeptides of SEQ ID Nos: 242-272 and 274-384 is not to be construed as excluding any readily identifiable variations from or equivalents to the sequences described in the appended sequence listing.

It will be appreciated that should the extent of the coding sequence of the signal peptide differ from that indicated in the appended sequence listing as a result of a sequencing error, reverse transcription or amplification error, mRNA splicing, post-translational modification of the encoded protein, enzymatic cleavage of the encoded protein, or other biological factors, one skilled in the art would be readily able to identify the extent of the coding sequences for the signal peptide in the sequences of SEQ ID Nos: 1-31 and 33-143. Accordingly, the scope of any claims herein relating to nucleic acids containing the signal sequence of one of SEQ ID Nos: 1-31 and 33-143 is not to be construed as excluding any readily identifiable variations from or equivalents to the coding sequences described in the appended sequence listing. Similarly, should the extent of the signal peptides differ from those indicated in the appended sequence listing as a result of any of the preceding factors, the scope of claims relating to signal peptides comprising the amino acid sequence of the polypeptides of SEQ ID Nos: 242-272 and 274-384 is not to be construed as excluding any readily identifiable variations from or equivalents to the sequences described in the appended sequence listing.

The above disclosed polynucleotides that contains the coding sequence (for the full-length protein of for the mature protein) of the GENSET genes may be expressed in a desired host cell or a desired host organism, when this polynucleotide is placed under the control of suitable expression signals. The expression signals may be either the expression signals contained in the regulatory regions in the GENSET genes of the invention or in contrast the signals may be exogenous regulatory nucleic sequences. Such a polynucleotide, when placed under the suitable expression signals, may also be inserted in a vector for its expression and/or amplification.

Further included in the present invention are polynucleotides encoding the polypeptides of the present invention that are fused in frame to the coding sequences for additional heterologous amino acid sequences. Of special interest are polynucleotides comprising GENSET signal sequences fused to an heterologous polypeptide as described in the section entitled “Secretion vectors”. Also included in the present invention are nucleic acids encoding polypeptides of the present invention together with additional, non-coding sequences, including for example, but not limited to non-coding 5′ and 3′ sequences, vector sequence, sequences used for purification, probing, or priming. For example, heterologous sequences include transcribed, untranslated sequences that may play a role in transcription, and mRNA processing, for example, ribosome binding and stability of mRNA. The heterologous sequences may alternatively comprise additional coding sequences that provide additional functionalities. Thus, a nucleotide sequence encoding a polypeptide may be fused to a tag sequence, such as a sequence encoding a peptide that facilitates purification of the fused polypeptide. In certain preferred embodiments of this aspect of the invention, the tag amino acid sequence is a hexa-histidine peptide, such as the tag provided in a pQE vector (QIAGEN), among others, many of which are commercially available. For instance, hexa-histidine provides for convenient purification of the fusion protein (See Gentz et al., 1989), the disclosure of which is incorporated by reference in its entirety. The “HA” tag is another peptide useful for purification which corresponds to an epitope derived from the influenza hemagglutinin protein (See Wilson et al., 1984), the disclosure of which is incorporated by reference in its entirety. As discussed below other such fusion proteins include the GENSET protein fused to Fc at the N- or C-terminus.

Suitable recombinant vectors that contain a polynucleotide such as described herein are disclosed elsewhere in the specification. Expression vectors encoding GENSET polypeptides or fragments thereof are described in the section entitled “Preparation of the polypeptides”.

Regulatory Sequences of the Invention

As mentioned, the genomic sequence of GENSET genes contains regulatory sequences in the non-coding 5′-flanking region and possibly in the non-coding 3′-flanking region that border the GENSET coding regions containing the exons of these genes.

Polynucleotides derived from GENSET 5′ and 3′ regulatory regions are useful in order to detect the presence of at least a copy of a genomic nucleotide sequence of the GENSET gene or a fragment thereof in a test sample.

Preferred Regulatory Sequences

Polynucleotides carrying the regulatory elements located at the 5′ end and at the 3′ end of GENSET coding regions may be advantageously used to control the transcriptional and translational activity of a heterologous polynucleotide of interest.

Thus, the present invention also concerns a purified or isolated nucleic acid comprising a polynucleotide which is selected from the group consisting of the 5′ and 3′ GENSET regulatory regions, sequences complementary thereto, regulatory active fragments and variants thereof. The invention also pertains to a purified or isolated nucleic acid comprising a polynucleotide having at least 95% nucleotide identity with a polynucleotide selected from the group consisting of GENSET 5′ and 3′ regulatory regions, advantageously 99% nucleotide identity, preferably 99.5% nucleotide identity and most preferably 99.8% nucleotide identity with a polynucleotide selected from the group consisting of GENSET 5′ and 3′ regulatory regions, sequences complementary thereto, variants and regulatory active fragments thereof.

Another object of the invention consists of purified, isolated or recombinant nucleic acids comprising a polynucleotide that hybridizes, under the stringent hybridization conditions defined herein, with a polynucleotide selected from the group consisting of the nucleotide sequences of GENSET 5′- and 3′ regulatory regions, sequences complementary thereto, variants and regulatory active fragments thereof.

Preferred fragments of 5′ regulatory regions have a length of about 1500 or 1000 nucleotides, preferably of about 500 nucleotides, more preferably about 400 nucleotides, even more preferably 300 nucleotides and most preferably about 200 nucleotides.

Preferred fragments of 3′ regulatory regions are at least 20, 50, 100, 150, 200, 300 or 400 bases in length.

“Providing” with respect to, e.g. a biological sample, population of cells, etc. indicates that the sample, population of cells, etc. is somehow used in a method or procedure. Significantly, “providing” a biological sample or population of cells does not require that the sample or cells are specifically isolated or obtained for the purposes of the invention, but can instead refer, for example, to the use of a biological sample obtained by another individual, for another purpose.

“Regulatory active” polynucleotide derivatives of the 5′ regulatory region are polynucleotides comprising or alternatively consisting of a fragment of said polynucleotide which is functional as a regulatory region for expressing a recombinant polypeptide or a recombinant polynucleotide in a recombinant cell host. It could act either as an enhancer or as a repressor. For the purpose of the invention, a nucleic acid or polynucleotide is “functional” as a regulatory region for expressing a recombinant polypeptide or a recombinant polynucleotide if said regulatory polynucleotide contains nucleotide sequences which contain transcriptional and translational regulatory information, and such sequences are “operably linked” to nucleotide sequences which encode the desired polypeptide or the desired polynucleotide.

The regulatory polynucleotides of the invention may be prepared from the nucleotide sequence of GENSET genomic or cDNA sequence, for example, by cleavage using suitable restriction enzymes, or by PCR. The regulatory polynucleotides may also be prepared by digestion of a GENSET gene containing genomic clone by an exonuclease enzyme, such as Bal31 (Wabiko et al., 1986), the disclosure of which is incorporated by reference in its entirety. These regulatory polynucleotides can also be prepared by nucleic acid chemical synthesis, as described elsewhere in the specification.

The regulatory polynucleotides according to the invention may be part of a recombinant expression vector that may be used to express a full coding sequence in a desired host cell or host organism. The recombinant expression vectors according to the invention are described elsewhere in the specification.

Preferred 5′-regulatory polynucleotide of the invention include 5′-UTRs of GENSET cDNAs, or regulatory active fragments or variants thereof. More preferred 5′-regulatory polynucleotides of the invention include sequences selected from the group consisting of 5′-UTRs of sequences of SEQ ID Nos: 1-241, 5′-UTRs of clones inserts of the deposited clone pool, regulatory active fragments and variants thereof.

Preferred 3′-regulatory polynucleotide of the invention include 3′-UTRs of GENSET cDNAs, or regulatory active fragments or variants thereof. More preferred 3′-regulatory polynucleotides of the invention include sequences selected from the group consisting of 3′-UTRs of sequences of SEQ ID Nos: 1-241, 3′-UTRs of clones inserts of the deposited clone pool, regulatory active fragments and variants thereof.

A further object of the invention consists of a purified or isolated nucleic acid comprising:

a) a polynucleotide comprising a 5′ regulatory nucleotide sequence selected from the group consisting of:

(i) a nucleotide sequence comprising a polynucleotide of a GENSET 5′ regulatory region or a complementary sequence thereto;

(ii) a nucleotide sequence comprising a polynucleotide having at least 95% of nucleotide identity with the nucleotide sequence of a GENSET 5′ regulatory region or a complementary sequence thereto;

(iii) a nucleotide sequence comprising a polynucleotide that hybridizes under stringent hybridization conditions with the nucleotide sequence of a GENSET 5′ regulatory region or a complementary sequence thereto; and

(iv) a regulatory active fragment or variant of the polynucleotides in (i), (ii) and (iii);

b) a nucleic acid molecule encoding a desired polypeptide or a nucleic acid molecule of interest, said nucleic acid molecule is operably linked to the polynucleotide defined in (a); and

c) optionally, a polynucleotide comprising a 3′-regulatory polynucleotide, preferably a 3′-regulatory polynucleotide of a GENSET gene.

In a specific embodiment, the nucleic acid defined above includes the 5′-UTR of a GENSET cDNA, or a regulatory active fragment or variant thereof.

In a second specific embodiment, the nucleic acid defined above includes the 3′-UTR of a GENSET cDNA, or a regulatory active fragment or variant thereof.

The regulatory polynucleotide of the 5′ regulatory region, or its regulatory active fragments or variants, is operably linked at the 5′-end of the nucleic acid molecule encoding the desired polypeptide or nucleic acid molecule of interest.

The regulatory polynucleotide of the 3′ regulatory region, or its regulatory active fragments or variants, is advantageously operably linked at the 3′-end of the nucleic acid molecule encoding the desired polypeptide or nucleic acid molecule of interest.

The desired polypeptide encoded by the above-described nucleic acid may be of various nature or origin, encompassing proteins of prokaryotic viral or eukaryotic origin. Among the polypeptides expressed under the control of a GENSET regulatory region include bacterial, fungal or viral antigens. Also encompassed are eukaryotic proteins such as intracellular proteins, such as “house keeping” proteins, membrane-bound proteins, such as mitochondrial membrane-bound proteins and cell surface receptors, and secreted proteins such as endogenous mediators such as cytokines. The desired polypeptide may be an heterologous polypeptide or a GENSET protein, especially a protein with an amino acid sequence selected from the group consisting of sequences of SEQ ID Nos: 242-482, fragments and variants thereof.

The desired nucleic acids encoded by the above-described polynucleotide, usually an RNA molecule, may be complementary to a desired coding polynucleotide, for example to a GENSET coding sequence, and thus useful as an antisense polynucleotide. Such a polynucleotide may be included in a recombinant expression vector in order to express the desired polypeptide or the desired nucleic acid in host cell or in a host organism. Suitable recombinant vectors that contain a polynucleotide such as described herein are disclosed elsewhere in the specification. When a polynucleotide sequence has been recombinantly introduced into a host cell, the cell is said to be “recombinant” for the polynucleotide.

Polynucleotide Variants

The invention also relates to variants of the polynucleotides described herein and fragments thereof. “Variants” of polynucleotides, as the term is used herein, are polynucleotides that differ from a reference polynucleotide. Generally, differences are limited so that the nucleotide sequences of the reference and the variant are closely similar overall and, in many regions, identical. The present invention encompasses both allelic variants and degenerate variants.

Examples of variant sequences of polynucleotides of the invention are given in the appended sequence listing. Table III lists the sequence identification number of all similar sequences of the sequence listing, namely variants. All cDNAS referred to by their sequence identification number on a given line of the table are thought to be variants of the same GENSET gene.

Allelic Variant

A variant of a polynucleotide may be a naturally occurring variant such as a naturally occurring allelic variant, or it may be a variant that is not known to occur naturally. By an “allelic variant” is intended one of several alternate forms of a gene occupying a given locus on a chromosome of an organism (see Lewin, 1990), the disclosure of which is incorporated by reference in its entirety. Diploid organisms may be homozygous or heterozygous for an allelic form. Non-naturally occurring variants of the polynucleotide may be made by art-known mutagenesis techniques, including those applied to polynucleotides, cells or organisms.

Degenerate Variant

In addition to the isolated polynucleotides of the present invention, and fragments thereof, the invention further includes polynucleotides which comprise a sequence substantially different from those described above but which, due to the degeneracy of the genetic code, still encode a GENSET polypeptide of the present invention. These polynucleotide variants are referred to as “degenerate variants” throughout the instant application. That is, all possible polynucleotide sequences that encode the GENSET polypeptides of the present invention are completed. This includes the genetic code and species-specific codon preferences known in the art. Thus, it would be routine for one skilled in the art to generate the degenerate variants described above, for instance, to optimize codon expression for a particular host (e.g., change codons in the human mRNA to those preferred by other mammalian or bacterial host cells).

Nucleotide changes present in a variant polynucleotide may be silent, which means that they do not alter the amino acids encoded by the polynucleotide. However, nucleotide changes may also result in amino acid substitutions, additions, deletions, fusions and truncations in the polypeptide encoded by the reference sequence. The substitutions, deletions or additions may involve one or more nucleotides. The variants may be altered in coding or non-coding regions or both. Alterations in the coding regions may produce conservative or non-conservative amino acid substitutions, deletions or additions. In the context of the present invention, preferred embodiments are those in which the polynucleotide variants encode polypeptides which retain substantially the same biological properties or activities as the GENSET protein. More preferred polynucleotide variants are those containing conservative substitutions.

Similar Polynucleotides

Other embodiments of the present invention is a purified, isolated or recombinant polynucleotide which is at least 90%, 95%, 96%, 97%, 98% or 99% identical to a polynucleotide selected from the group consisting of sequences of SEQ ID Nos: 1-241 and clone inserts of the deposited clone pool. The above polynucleotides are included regardless of whether they encode a polypeptide having a GENSET biological activity. This is because even where a particular nucleic acid molecule does not encode a polypeptide having activity, one of skill in the art would still know how to use the nucleic acid molecule, for instance, as a hybridization probe or primer. Uses of the nucleic acid molecules of the present invention that do not encode a polypeptide having GENSET activity include, inter alia, isolating a GENSET gene or allelic variants thereof from a DNA library, and detecting GENSET mRNA expression in biological samples, suspected of containing GENSET mRNA or DNA by Northern Blot or PCR analysis.

The present invention is further directed to polynucleotides having sequences at least 50%. 60%, 70%, 80%, 90%, 95%, 96%, 97%, 98% or 99% identity to a polynucleotide selected from the group consisting of sequences of SEQ ID Nos: 1-241 and clone inserts of the deposited clone pool, where said polynucleotide do, in fact, encode a polypeptide having a GENSET biological activity. Of course, due to the degeneracy of the genetic code, one of ordinary skill in the art will immediately recognize that a large number of the polynucleotides at least 50%. 60%, 70%, 80%, 90%, 95%, 96%, 97%, 98%, or 99% identical to a polynucleotide selected from the group consisting of sequences of SEQ ID Nos: 1-241 and clone inserts of the deposited clone pool will encode a polypeptide having biological activity. In fact, since degenerate variants of these nucleotide sequences all encode the same polypeptide, this will be clear to the skilled artisan even without performing the above described comparison assay. It will be further recognized in the art that, for such nucleic acid molecules that are not degenerate variants, a reasonable number will also encode a polypeptide having biological activity. This is because the skilled artisan is fully aware of amino acid substitutions that are either less likely or not likely to significantly effect protein function (e.g., replacing one aliphatic amino acid with a second aliphatic amino acid), as further described below. By a polynucleotide having a nucleotide sequence at least, for example, 95% “identical” to a reference nucleotide sequence of the present invention, it is intended that the nucleotide sequence of the polynucleotide is identical to the reference sequence except that the polynucleotide sequence may include up to five point mutations per each 100 nucleotides of the reference nucleotide sequence encoding the GENSET polypeptide. In other words, to obtain a polynucleotide having a nucleotide sequence at least 95% identical to a reference nucleotide sequence, up to 5% of the nucleotides in the reference sequence may be deleted, inserted, or substituted with another nucleotide. The query sequence may be an entire sequence selected from the group consisting of sequences of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool, or the ORF (open reading frame) of a polynucleotide sequence selected from said group, or any fragment specified as described herein.

Hybridizing Polynucleotides

In another aspect, the invention provides an isolated or purified nucleic acid molecule comprising a polynucleotide which hybridizes under stringent hybridization conditions to any polynucleotide of the present invention using any methods known to those skilled in the art including those disclosed herein and in particular in the “To find similar sequences” section. Also contemplated are nucleic acid molecules that hybridize to the polynucleotides of the present invention at lower stringency hybridization conditions, preferably at moderate or low stringency conditions as defined herein. Such hybridizing polynucleotides may be of at least 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500, 1000 or 2000 nucleotides in length.

Of particular interest, are the polynucleotides hybridizing to any polynucleotide of the invention and encoding GENSET polypeptides, particularly GENSET polypeptides exhibiting a GENSET biological activity.

Of course, a polynucleotide which hybridizes only to polyA+ sequences (such as any 3′ terminal polyA+ tract of a cDNA shown in the sequence listing), or to a 5′ complementary stretch of T (or U) residues, would not be included in the definition of “polynucleotide,” since such a polynucleotide would hybridize to any nucleic acid molecule containing a poly (A) stretch or the complement thereof (e.g., practically any double-stranded cDNA clone generated using oligo dT as a primer).

Complementary Polynucleotides

The invention further provides isolated nucleic acid molecules having a nucleotide sequence fully complementary to any polynucleotide of the invention. The present invention encompasses a purified, isolated or recombinant polynucleotide having a nucleotide sequence complementary to a sequence selected from the group consisting of sequences of SEQ ID Nos: 1-241, sequences of clone inserts of the deposited clone pool and fragments thereof. Such isolated molecules, particularly DNA molecules, are useful as probes for gene mapping and for identifying GENSET mRNA in a biological sample, for instance, by PCR or Northern blot analysis.

Polynucleotides Fragments

The present invention is further directed to polynucleotides encoding portions or fragments of the nucleotide sequences described herein. Uses for the polynucleotide fragments of the present invention include probes, primers, molecular weight markers and for expressing the polypeptide fragments of the present invention. Fragments include portions of polynucleotides selected from the group consisting of a) the sequences of SEQ ID Nos: 1-241, b) the genomic GENSET sequences, c) the polynucleotides encoding a polypeptide selected from the group consisting of the sequences of SEQ ID Nos: 242-482, d) the sequences of clone inserts of the deposited clone pool, and e) the polynucleotides encoding the polypeptides encoded by the clone inserts of the deposited clone pool. Particularly included in the present invention is a purified or isolated polynucleotide comprising at least 8 consecutive bases of a polynucleotide of the present invention. In one aspect of this embodiment, the polynucleotide comprises at least 10, 12, 15, 18, 20, 25, 28, 30, 35, 40, 50, 75, 100, 150, 200, 300, 400, 500, 800, 1000, 1500, or 2000 consecutive nucleotides of a polynucleotide of the present invention.

In addition to the above preferred polynucleotide sizes, further preferred sub-genuses of polynucleotides comprise at least 8 nucleotides, wherein “at least 8” is defined as any integer between 8 and the integer representing the 3′ most nucleotide position as set forth in the sequence listing or elsewhere herein. Further included as preferred polynucleotides of the present invention are polynucleotide fragments at least 8 nucleotides in length, as described above, that are further specified in terms of their 5′ and 3′ position. The 5′ and 3′ positions are represented by the position numbers set forth in the appended sequence listing. For allelic, degenerate and other variants, position 1 is defined as the 5′ most nucleotide of the ORF, i.e., the nucleotide “A” of the start codon with the remaining nucleotides numbered consecutively. Therefore, every combination of a 5′ and 3′ nucleotide position that a polynucleotide fragment of the present invention, at least 8 contiguous nucleotides in length, could occupy on a polynucleotide of the invention is included in the invention as an individual species. The polynucleotide fragments specified by 5′ and 3′ positions can be immediately envisaged and are therefore not individually listed solely for the purpose of not unnecessarily lengthening the specifications.

It is noted that the above species of polynucleotide fragments of the present invention may alternatively be described by the formula “a to b”; where “a” equals the 5′ most nucleotide position and “b” equals the 3′ most nucleotide position of the polynucleotide; and further where “a” equals an integer between 1 and the number of nucleotides of the polynucleotide sequence of the present invention minus 8, and where “b” equals an integer between 9 and the number of nucleotides of the polynucleotide sequence of the present invention; and where “a” is an integer smaller then “b” by at least 8.

The present invention also provides for the exclusion of any species of polynucleotide fragments of the present invention specified by 5′ and 3′ positions or sub-genuses of polynucleotides specified by size in nucleotides as described above. Any number of fragments specified by 5′ and 3′ positions or by size in nucleotides, as described above, may be excluded. Specifically excluded from the invention are the fragments described in Table IV. For these cDNAs referred to by their sequence identification numbers, Table IV gives the positions of excluded fragments within these sequences fragments having substantial homology to polyadenylation tails and to repeated sequences including Alu, L1, THE and MER repeats, SSTR sequences or satellite, micro-satellite, and telomeric repeats. Each fragment is represented by a-b where a and b are the start and end positions respectively of a given excluded fragment. Excluded fragments are separated from each other by a coma. As used herein the term “polynucleotide described in Table IV” refers to all polynucleotide fragments defined in Table IV in this manner.

Preferred included and excluded polynucleotide fragments of the invention are also described in Tables Va and Table Vb. For these cDNAs referred to by their sequence identification numbers, Tables Va and Table Vb give the positions of preferred fragments within these sequences (columns entitled “Preferentially included fragments”) as well as the positions of preferentially excluded fragments (columns entitled “Preferentially excluded fragments”). Each fragment is represented by a-b where a and b are the start and end positions respectively of a given preferred fragment. Fragments are separated from each other by a coma. As used herein the term “excluded polynucleotide described in Tables Va and Vb” refers to all polynucleotide preferentially excluded as described in Tables Va and Vb. As used herein the term “preferred polynucleotide described in Tables Va and Vb” refers to all preferrentially included polynucleotide fragments listed in Tables Va and Table Vb in this manner.

Therefore, the present invention encompasses isolated, purified, or recombinant polynucleotides which consist of, consist essentially of, or comprise a contiguous span of at least 8, 10, 12, 15, 18, 20, 25, 28, 30, 35, 40, 50, 75, 100, 150, 200, 300, 400, 500, 1000 or 2000 nucleotides of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 1-241 and sequences fully complementary thereto, to the extent that a contiguous span of these lengths is consistent with the lengths of said selected sequence, wherein said contiguous span comprises at least 1, 2, 3, 5, 10, 15, 18, 20, 25, 28, 30, 35, 40, 50, 75, 100, 150, 200, 300, 400 or 500 nucleotides of a preferred polynucleotide described in Tables Va and Vb, or a sequence complementary thereto. The present invention also encompasses isolated, purified, or recombinant polynucleotides comprising, consisting essentially of, or consisting of a contiguous span of at least 8, 10, 12, 15, 18, 20, 25, 28, 30, 35, 40, 50, 75, 100, 150, 200, 300, 400, 500, 1000 or 2000 nucleotides of a polynucleotide selected from the group consisting of the sequences of SEQ ID Nos: 1-241 and sequences fully complementary thereto, wherein said contiguous span comprises a preferred polynucleotide described in Tables Va and Vb, or a sequence complementary thereto, to the extent that a contiguous span of these lengths is consistent with the length of the selected sequence. The present invention also encompasses isolated, purified, or recombinant nucleic acids which comprise, consist of or consist essentially of a contiguous span of a polynucleotide selected from the group consisting of the sequences of SEQ ID Nos: 1-241 and sequences fully complementary thereto, wherein said contiguous span comprises preferred polynucleotide described in Tables Va and Vb, or a sequence complementary thereto.

Other preferred fragments of the invention are polynucleotides comprising polynucleotides encoding domains of polypeptides. Such fragments may be used to obtain other polynucleotides encoding polypeptides having similar domains using hybridization or RT-PCR techniques. Alternatively, these fragments may be used to express a polypeptide domain which may present a specific biological property. Preferred domains for the GENSET polypeptides of the invention are described in Table VI. Thus, another object of the invention is an isolated, purified or recombinant polynucleotide encoding a polypeptide consisting of, consisting essentially of, or comprising a contiguous span of at least 5, 6, 8, 10, 12, 15, 20, 25, 30, 35, 40, 50, 60, 75, 100, 150, 200, 250, 300, 350, 400, 450 or 500 consecutive amino acids of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 242-482, to the extent that a contiguous span of these lengths is consistent with the lengths of said selected sequence, where said contiguous span comprises at least 1, 2, 3, 5, or 10 of the amino acid positions of a domain of said selected sequence. The present invention also encompasses isolated, purified or recombinant polynucleotides encoding a polypeptide comprising a contiguous span of at least 5, 6, 8, 10, 12, 15, 20, 25, 30, 35, 40, 50, 60, 75, 100, 150, 200, 250, 300, 350, 400, 450 or 500 consecutive amino acids of a sequence selected from the group consisting of sequences of SEQ ID Nos: 242-482, to the extent that a contiguous span of these lengths is consistent with the lengths of said selected sequence, where said contiguous span is a domain of said selected sequence. The present invention also encompasses isolated, purified or recombinant polynucleotides encoding a polypeptide comprising a domain of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 242-482.

The present invention further encompasses any combination of the polynucleotide fragments listed in this section.

Oligonucleotide Primers and Probes

The present invention also encompasses fragments of GENSET polynucleotides for use as primers and probes. Polynucleotides derived from the GENSET genomic and cDNA sequences are useful in order to detect the presence of at least a copy of a GENSET polynucleotide or fragment, complement, or variant thereof in a test sample.

Structural Definition

Any polynucleotide of the invention may be used as a primer or probe. Particularly preferred probes and primers of the invention include isolated, purified, or recombinant polynucleotides comprising a contiguous span of at least 12, 15, 18, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 150, 200, 500, 1000, 1500 or 2000 nucleotides of a sequence selected from the group consisting of the GENSET genomic sequences, the cDNA sequences and the sequences fully complementary thereto. Another object of the invention is a purified, isolated, or recombinant polynucleotide comprising the nucleotide sequence of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 1-241, sequences of clone inserts of the deposited clone pool, sequences fully complementary thereto, allelic variants thereof, and fragments thereof. Moreover, preferred probes and primers of the invention include purified, isolated, or recombinant GENSET cDNAs consisting of, consisting essentially of, or comprising the sequences of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool. Particularly preferred probes and primers of the invention include isolated, purified, or recombinant polynucleotides comprising a contiguous span of at least 12, 15, 18, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 150, 200, 500, 1000, 1500 or 2000 nucleotides of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 1-241 and the sequences fully complementary thereto. ***f***

Design of Primers and Probes

A probe or a primer according to the invention has between 8 and 1000 nucleotides in length, or is specified to be at least 12, 15, 18, 20, 25, 35, 40, 50, 60, 70, 80, 100, 250, 500, 1000, 1500 or 2000 nucleotides in length. More particularly, the length of these probes and primers can range from 8, 10, 15, 20, or 30 to 100 nucleotides, preferably from 10 to 50, more preferably from 15 to 30 nucleotides. Shorter probes and primers tend to lack specificity for a target nucleic acid sequence and generally require cooler temperatures to form sufficiently stable hybrid complexes with the template. Longer probes and primers are expensive to produce and can sometimes self-hybridize to form hairpin structures. The appropriate length for primers and probes under a particular set of assay conditions may be empirically determined by one of skill in the art. The formation of stable hybrids depends on the melting temperature (Tm) of the DNA. The Tm depends on the length of the primer or probe, the ionic strength of the solution and the G+C content. The higher the G+C content of the primer or probe, the higher is the melting temperature because G:C pairs are held by three H bonds whereas A:T pairs have only two. The GC content in the probes of the invention usually ranges between 10 and 75%, preferably between 35 and 60%, and more preferably between 40 and 55%.

For amplification purposes, pairs of primers with approximately the same Tm are preferable. Primers may be designed using the OSP software (Hillier and Green, 1991), the disclosure of which is incorporated by reference in its entirety, based on GC content and melting temperatures of oligonucleotides, or using PC-Rare (http://bioinformatics.weizmann.ac.il/software/PC-Rare/doc/m anuel.html) based on the octamer frequency disparity method (Griffais et al., 1991), the disclosure of which is incorporated by reference in its entirety. DNA amplification techniques are well known to those skilled in the art. Amplification techniques that can be used in the context of the present invention include, but are not limited to, the ligase chain reaction (LCR) described in EP-A-320 308, WO 9320227 and EP-A439 182, the polymerase chain reaction (PCR, RT-PCR) and techniques such as the nucleic acid sequence based amplification (NASBA) described in Guatelli et al. (1990) and in Compton (1991), Q-beta amplification as described in European Patent Application No 4544610, strand displacement amplification as described in Walker et al. (1996) and EP A 684 315 and, target mediated amplification as described in PCT Publication WO 9322461, the disclosures of which are incorporated by reference in their entireties.

LCR and Gap LCR are exponential amplification techniques, both depend on DNA ligase to join adjacent primers annealed to a DNA molecule. In Ligase Chain Reaction (LCR), probe pairs are used which include two primary (first and second) and two secondary (third and fourth) probes, all of which are employed in molar excess to target. The first probe hybridizes to a first segment of the target strand and the second probe hybridizes to a second segment of the target strand, the first and second segments being contiguous so that the primary probes abut one another in 5′ phosphate-3′ hydroxyl relationship, and so that a ligase can covalently fuse or ligate the two probes into a fused product. In addition, a third (secondary) probe can hybridize to a portion of the first probe and a fourth (secondary) probe can hybridize to a portion of the second probe in a similar abutting fashion. Of course, if the target is initially double stranded, the secondary probes also will hybridize to the target complement in the first instance. Once the ligated strand of primary probes is separated from the target strand, it will hybridize with the third and fourth probes, which can be ligated to form a complementary, secondary ligated product. It is important to realize that the ligated products are functionally equivalent to either the target or its complement. By repeated cycles of hybridization and ligation, amplification of the target sequence is achieved. A method for multiplex LCR has also been described (WO 9320227), the disclosure of which is incorporated by reference in its entirety. Gap LCR (GLCR) is a version of LCR where the probes are not adjacent but are separated by 2 to 3 bases.

For amplification of mRNAs, it is within the scope of the present invention to reverse transcribe mRNA into cDNA followed by polymerase chain reaction (RT-PCR); or, to use a single enzyme for both steps as described in U.S. Pat. No. 5,322,770 or, to use Asymmetric Gap LCR (RT-AGLCR) as described by Marshall et al. (1994), the disclosures of which are incorporated by reference in its entireties. AGLCR is a modification of GLCR that allows the amplification of RNA.

The PCR technology is the preferred amplification technique used in the present invention. A variety of PCR techniques are familiar to those skilled in the art. For a review of PCR technology, see White (1997), Erlich (1992) and the publication entitled “PCR Methods and Applications” (1991, Cold Spring Harbor Laboratory Press), the disclosures of which are incorporated by reference in its entireties. In each of these PCR procedures, PCR primers on either side of the nucleic acid sequences to be amplified are added to a suitably prepared nucleic acid sample along with dNTPs and a thermostable polymerase such as Taq polymerase, Pfu polymerase, Tth polymerase or Vent polymerase. The nucleic acid in the sample is denatured and the PCR primers are specifically hybridized to complementary nucleic acid sequences in the sample. The hybridized primers are extended. Thereafter, another cycle of denaturation, hybridization, and extension is initiated. The cycles are repeated multiple times to produce an amplified fragment containing the nucleic acid sequence between the primer sites. PCR has further been described in several patents including U.S. Pat. Nos. 4,683,195; 4,683,202; and 4,965,188, the disclosures of which are incorporated herein by reference in their entireties.

Preparation of Primers and Probes

The primers and probes can be prepared by any suitable method, including, for example, cloning and restriction of appropriate sequences and direct chemical synthesis by a method such as the phosphodiester method of Narang et al. (1979), the phosphodiester method of Brown et al. (1979), the diethylphosphoramidite method of Beaucage et al. (1981) and the solid support method described in EP 0 707 592, which disclosures are hereby incorporated by reference in their entireties.

Detection probes are generally nucleic acid sequences or uncharged nucleic acid analogs such as, for example peptide nucleic acids which are disclosed in International Patent Application WO 92/20702, morpholino analogs which are described in U.S. Pat. Nos. 5,185,444; 5,034,506 and 5,142,047, which disclosures are hereby incorporated by reference in their entireties. The probe may have to be rendered “non-extendable” in that additional dNTPs cannot be added to the probe. In and of themselves analogs usually are non-extendable and nucleic acid probes can be rendered non-extendable by modifying the 3′ end of the probe such that the hydroxyl group is no longer capable of participating in elongation. For example, the 3′ end of the probe can be functionalized with the capture or detection label to thereby consume or otherwise block the hydroxyl group. Alternatively, the 3′ hydroxyl group simply can be cleaved, replaced or modified, U.S. patent application Ser. No. 07/049,061 filed Apr. 19, 1993, which disclosure is hereby incorporated by reference in its entirety, describes modifications, which can be used to render a probe non-extendable.

Labeling of Probes

Any of the polynucleotides of the present invention can be labeled, if desired, by incorporating any label known in the art to be detectable by spectroscopic, photochemical, biochemical, immunochemical, or chemical means. For example, useful labels include radioactive substances (including, 32 P, 35 S, 3 H, 125 I), fluorescent dyes (including, 5-bromodesoxyuridin, fluorescein, acetylaminofluorene, digoxigenin) or biotin. Preferably, polynucleotides are labeled at their 3′ and 5′ ends. Examples of non-radioactive labeling of nucleic acid fragments are described in the French patent No. FR-7810975 or by Urdea et al (1988) or Sanchez-Pescador et al (1988), which disclosures are hereby incorporated by reference in their entireties. In addition, the probes according to the present invention may have structural characteristics such that they allow the signal amplification, such structural characteristics being, for example, branched DNA probes as those described by Urdea et al. in 1991 or in the European patent No. EP 0 225 807 (Chiron), which disclosures are hereby incorporated by reference in their entireties.

The detectable probe may be single stranded or double stranded and may be made using techniques known in the art, including in vitro transcription, nick translation, or kinase reactions. A nucleic acid sample containing a sequence capable of hybridizing to the labeled probe is contacted with the labeled probe. If the nucleic acid in the sample is double stranded, it may be denatured prior to contacting the probe. In some applications, the nucleic acid sample may be immobilized on a surface such as a nitrocellulose or nylon membrane. The nucleic acid sample may comprise nucleic acids obtained from a variety of sources, including genomic DNA, cDNA libraries, RNA, or tissue samples.

Procedures used to detect the presence of nucleic acids capable of hybridizing to the detectable probe include well known techniques such as Southern blotting, Northern blotting, dot blotting, colony hybridization, and plaque hybridization. In some applications, the nucleic acid capable of hybridizing to the labeled probe may be cloned into vectors such as expression vectors, sequencing vectors, or in vitro transcription vectors to facilitate the characterization and expression of the hybridizing nucleic acids in the sample. For example, such techniques may be used to isolate and clone sequences in a genomic library or cDNA library which are capable of hybridizing to the detectable probe as described herein.

Immobilization of Probes

A label can also be used to capture the primer, so as to facilitate the immobilization of either the primer or a primer extension product, such as amplified DNA, on a solid support. A capture label is attached to the primers or probes and can be a specific binding member which forms a binding pair with the solid's phase reagent's specific binding member (e.g. biotin and streptavidin). Therefore depending upon the type of label carried by a polynucleotide or a probe, it may be employed to capture or to detect the target DNA. Further, it will be understood that the polynucleotides, primers or probes provided herein, may, themselves, serve as the capture label. For example, in the case where a solid phase reagent's binding member is a nucleic acid sequence, it may be selected such that it binds a complementary portion of a primer or probe to thereby immobilize the primer or probe to the solid phase. In cases where a polynucleotide probe itself serves as the binding member, those skilled in the art will recognize that the probe will contain a sequence or “tail” that is not complementary to the target. In the case where a polynucleotide primer itself serves as the capture label, at least a portion of the primer will be free to hybridize with a nucleic acid on a solid phase. DNA Labeling techniques are well known to the skilled technician.

The probes of the present invention are useful for a number of purposes. They can be notably used in Southern hybridization to genomic DNA. The probes can also be used to detect PCR amplification products. They may also be used to detect mismatches in the GENSET gene or mRNA using other techniques.

Any of the polynucleotides, primers and probes of the present invention can be conveniently immobilized on a solid support. The solid support is not critical and can be selected by one skilled in the art. Thus, latex particles, microparticles, magnetic beads, non-magnetic beads (including polystyrene beads), membranes (including nitrocellulose strips), plastic tubes, walls of microtiter wells, glass or silicon chips, sheep (or other suitable animal's) red blood cells and duracytes are all suitable examples. Suitable methods for immobilizing nucleic acids on solid phases include ionic, hydrophobic, covalent interactions and the like. A solid support, as used herein, refers to any material which is insoluble, or can be made insoluble by a subsequent reaction. The solid support can be chosen for its intrinsic ability to attract and immobilize the capture reagent. Alternatively, the solid phase can retain an additional receptor which has the ability to attract and immobilize the capture reagent. The additional receptor can include a charged substance that is oppositely charged with respect to the capture reagent itself or to a charged substance conjugated to the capture reagent. As yet another alternative, the receptor molecule can be any specific binding member which is immobilized upon (attached to) the solid support and which has the ability to immobilize the capture reagent through a specific binding reaction. The receptor molecule enables the indirect binding of the capture reagent to a solid support material before the performance of the assay or during the performance of the assay. The solid phase thus can be a plastic, derivatized plastic, magnetic or non-magnetic metal, glass or silicon surface of a test tube, microtiter well, sheet, bead, microparticle, chip, sheep (or other suitable animal's) red blood cells, duracytes® and other configurations known to those of ordinary skill in the art. The polynucleotides of the invention can be attached to or immobilized on a solid support individually or in groups of at least 2, 5, 8, 10, 12, 15, 20, or 25 distinct polynucleotides of the invention to a single solid support. In addition, polynucleotides other than those of the invention may be attached to the same solid support as one or more polynucleotides of the invention.

Oligonucleotide Array

A substrate comprising a plurality of oligonucleotide primers or probes of the invention may be used either for detecting or amplifying targeted sequences in GENSET genes, may also be used for detecting mutations in the coding or in the non-coding sequences of GENSET genes, and may also be used to determine GENSET gene expression in different contexts such as in different tissues, at different stages of a process (embryo development, disease treatment), and in patients versus healthy individuals as described elsewhere in the application.

As used herein, the term “array” means a one dimensional, two dimensional, or multidimensional arrangement of nucleic acids of sufficient length to permit specific detection of gene expression. For example, the array may contain a plurality of nucleic acids derived from genes whose expression levels are to be assessed. The array may include a GENSET genomic DNA, a GENSET cDNA, sequences complementary thereto or fragments thereof. Preferably, the fragments are at least 12, 15, 18, 20, 25, 30, 35, 40 or 50 nucleotides in length. More preferably, the fragments are at least 100 nucleotides in length. Even more preferably, the fragments are more than 100 nucleotides in length. In some embodiments the fragments may be more than 500 nucleotides in length.

Any polynucleotide provided herein may be attached in overlapping areas or at random locations on the solid support. Alternatively the polynucleotides of the invention may be attached in an ordered array wherein each polynucleotide is attached to a distinct region of the solid support which does not overlap with the attachment site of any other polynucleotide. Preferably, such an ordered array of polynucleotides is designed to be “addressable” where the distinct locations are recorded and can be accessed as part of an assay procedure. Addressable polynucleotide arrays typically comprise a plurality of different oligonucleotide probes that are coupled to a surface of a substrate in different known locations. The knowledge of the precise location of each polynucleotides location makes these “addressable” arrays particularly useful in hybridization assays. Any addressable array technology known in the art can be employed with the polynucleotides of the invention. One particular embodiment of these polynucleotide arrays is known as the Genechips™, and has been generally described in U.S. Pat. No. 5,143,854; PCT publications WO 90/15070 and 92/10092, which disclosures are hereby incorporated by reference in their entireties. These arrays may generally be produced using mechanical synthesis methods or light directed synthesis methods which incorporate a combination of photolithographic methods and solid phase oligonucleotide synthesis (Fodor et al., 1991), which disclosure is hereby incorporated by reference in its entirety. The immobilization of arrays of oligonucleotides on solid supports has been rendered possible by the development of a technology generally identified as “Very Large Scale Immobilized Polymer Synthesis” (VLSIPS™) in which, typically, probes are immobilized in a high density array on a solid surface of a chip. Examples of VLSIPS™ technologies are provided in U.S. Pat. Nos. 5,143,854; and 5,412,087 and in PCT Publications WO 90/15070, WO 92/10092 and WO 95/11995, which disclosures are hereby incorporated by reference in their entireties, which describe methods for forming oligonucleotide arrays through techniques such as light-directed synthesis techniques. In designing strategies aimed at providing arrays of nucleotides immobilized on solid supports, further presentation strategies were developed to order and display the oligonucleotide arrays on the chips in an attempt to maximize hybridization patterns and sequence information. Examples of such presentation strategies are disclosed in PCT Publications WO 94/12305, WO 94/11530, WO 97/29212 and WO 97/31256, the disclosures of which are incorporated herein by reference in their entireties.

Consequently, the invention concerns an array of nucleic acid molecules comprising at least one polynucleotide of the invention, particularly a probe or primer as described herein. Preferably, the invention concerns an array of nucleic acid comprising at least two polynucleotides of the invention, particularly probes or primers as described herein. Preferably, the invention concerns an array of nucleic acid comprising at least five polynucleotides of the invention, particularly probes or primers as described herein.

A preferred embodiment of the present invention is an array of polynucleotides of at least 12, 15, 18, 20, 25, 30, 35, 40, 50, 100, 500, 1000, 1500 or 2000 nucleotides in length which includes at least 1, 2, 5, 10, 15, 20, 35, 50, 100, 150 or 200 sequences selected from the group consisting of the sequences of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool, sequences fully complementary thereto, and fragments thereof.

Methods of Making the Polynucleotides of the Invention

The present invention also comprises methods of making the polynucleotides of the invention, including the polynucleotides of SEQ ID Nos: 1-241, genomic DNA obtainable therefrom, or fragment thereof. These methods comprise sequentially linking together nucleotides to produce the nucleic acids having the preceding sequences. Polynucleotides of the invention may be synthesized either enzymatically using techniques well known to those skilled in the art including amplification or hybridization-based methods as described herein, or chemically.

A variety of chemical methods of synthesizing nucleic acids are known to those skilled in the art. In many of these methods, synthesis is conducted on a solid support. These included the 3′ phosphoramidite methods in which the 3′ terminal base of the desired oligonucleotide is immobilized on an insoluble carrier. The nucleotide base to be added is blocked at the 5′ hydroxyl and activated at the 3′ hydroxyl so as to cause coupling with the immobilized nucleotide base. Deblocking of the new immobilized nucleotide compound and repetition of the cycle will produce the desired polynucleotide. Alternatively, polynucleotides may be prepared as described in U.S. Pat. No. 5,049,656, which disclosure is hereby incorporated by reference in its entirety. In some embodiments, several polynucleotides prepared as described above are ligated together to generate longer polynucleotides having a desired sequence.

Polypeptides of the Invention

The term “GENSET polypeptides” is used herein to embrace all of the proteins and polypeptides of the present invention. The present invention encompasses GENSET polypeptides, including recombinant, isolated or purified GENSET polypeptides consisting of, consisting essentially of, or comprising a sequence selected from the group consisting of SEQ ID Nos: 242-482, the polypeptides encoded by human cDNAs contained in the deposited clones, the mature proteins included in SEQ ID Nos: 242-272 and 274-384, mature proteins encoded by the clone inserts of the deposited clone pool, and variants thereof. Other objects of the invention are polypeptides encoded by the polynucleotides of the invention as well as fusion polypeptides comprising such polypeptide.

Polypeptide Variants

The present invention further provides for GENSET polypeptides encoded by allelic and splice variants, orthologs, and/or species homologues. Procedures known in the art can be used to obtain, allelic variants, splice variants, orthologs, and/or species homologues of polynucleotides encoding by polypeptides of the group consisting of SEQ ID Nos: 242-482, mature proteins included in SEQ ID Nos: 242-272 and 274-384, and polypeptides either fill-length or mature encoded by the clone inserts of the deposited clone pool, using information from the sequences disclosed herein or the clones deposited with the ATCC.

The polypeptides of the present invention also include polypeptides having an amino acid sequence at least 50% identical, more preferably at least 60% identical, and still more preferably 70%, 80%, 90%, 95%, 96%, 97%, 98% or 99% identical to a polypeptide selected from the group consisting of the sequences of SEQ ID Nos: 242482, mature proteins included in sequences of SEQ ID Nos: 242-272 and 274-384, and full-length or mature polypeptides encoded by the clone inserts of the deposited clone pool. By a polypeptide having an amino acid sequence at least, for example, 95% “identical” to a query amino acid sequence of the present invention, it is intended that the amino acid sequence of the subject polypeptide is identical to the query sequence except that the subject polypeptide sequence may include up to five amino acid alterations per each 100 amino acids of the query amino acid sequence. In other words, to obtain a polypeptide having an amino acid sequence at least 95% identical to a query amino acid sequence, up to 5% (5 of 100) of the amino acid residues in the subject sequence may be inserted, deleted, (indels) or substituted with another amino acid.

Further polypeptides of the present invention include polypeptides which have at least 90% similarity, more preferably at least 95% similarity, and still more preferably at least 96%, 97%, 98% or 99% similarity to those described above. By a polypeptide having an amino acid sequence at least, for example, 95% “similar” to a query amino acid sequence of the present invention, it is intended that the amino acid sequence of the subject polypeptide is similar (i.e. contain identical or equivalent amino acid residues) to the query sequence except that the subject polypeptide sequence may include up to five amino acid alterations per each 100 amino acids of the query amino acid sequence. In other words, to obtain a polypeptide having an amino acid sequence at least 95% similar to a query amino acid sequence, up to 5% (5 of 100) of the amino acid residues in the subject sequence may be inserted, deleted, (indels) or substituted with another non-equivalent amino acid.

These alterations of the reference sequence may occur at the amino or carboxy terminal positions of the reference amino acid sequence or anywhere between those terminal positions, interspersed either individually among residues in the reference sequence or in one or more contiguous groups within the reference sequence. The query sequence may be an entire amino acid sequence selected from the group consisting of sequences of SEQ ID Nos: 242-482 and those encoded by the clone inserts of the deposited clone pool or any fragment specified as described herein.

The variant polypeptides described herein are included in the present invention regardless of whether they have their normal biological activity. This is because even where a particular polypeptide molecule does not have biological activity, one of skill in the art would still know how to use the polypeptide, for instance, as a vaccine or to generate antibodies. Other uses of the polypeptides of the present invention that do not have GENSET biological activity include, inter alia, as epitope tags, in epitope mapping, and as molecular weight markers on SDS-PAGE gels or on molecular sieve gel filtration columns using methods known to those of skill in the art. As described below, the polypeptides of the present invention can also be used to raise polyclonal and monoclonal antibodies, which are useful in assays for detecting GENSET protein expression or as agonists and antagonists capable of enhancing or inhibiting GENSET protein function. Further, such polypeptides can be used in the yeast two-hybrid system to “capture” GENSET protein binding proteins, which are also candidate agonists and antagonists according to the present invention (See, e.g., Fields et al. 1989), which disclosure is hereby incorporated by reference in its entirety.

Preparation of the Polypeptides of the Invention

The polypeptides of the present invention can be prepared in any suitable manner. Such polypeptides include isolated naturally occurring polypeptides, recombinantly produced polypeptides, synthetically produced polypeptides, or polypeptides produced by a combination of these methods. The polypeptides of the present invention are preferably provided in an isolated form, and may be partially or preferably substantially purified.

Consequently, the present invention also comprises methods of making the polypeptides of the invention, particularly polypeptides encoded by the cDNAs of SEQ ID Nos: 1-241, mature proteins encoded by fragments of SEQS ID Nos: 1-31 and 33-143, full-length and mature polypeptides encoded by the clone inserts of the deposited clone pool, genomic DNA obtainable therefrom, or fragments thereof and methods of making the polypeptides of SEQ ID Nos: 242-482, mature polypeptides included in SEQ ID Nos: 242-272 and 274-384, or fragments thereof. The methods comprise sequentially linking together amino acids to produce the nucleic polypeptides having the preceding sequences. In some embodiments, the polypeptides made by these methods are 150 amino acids or less in length. In other embodiments, the polypeptides made by these methods are 120 amino acids or less in length.

Isolation

From Natural Sources

The GENSET proteins of the invention may be isolated from natural sources, including bodily fluids, tissues and cells, whether directly isolated or cultured cells, of humans or non-human animals. Methods for extracting and purifying natural proteins are known in the art, and include the use of detergents or chaotropic agents to disrupt particles followed by differential extraction and separation of the polypeptides by ion exchange chromatography, affinity chromatography, sedimentation according to density, and gel electrophoresis. See, for example, “Methods in Enzymology, Academic Press, 1993” for a variety of methods for purifying proteins, which disclosure is hereby incorporated by reference in its entirety. Polypeptides of the invention also can be purified from natural sources using antibodies directed against the polypeptides of the invention, such as those described herein, in methods which are well known in the art of protein purification.

From Recombinant Sources

Preferably, the GENSET polypeptides of the invention are recombinantly produced using routine expression methods known in the art. The polynucleotide encoding the desired polypeptide is operably linked to a promoter into an expression vector suitable for any convenient host. Both eukaryotic and prokaryotic host systems are used in forming recombinant polypeptides. The polypeptide is then isolated from lysed cells or from the culture medium and purified to the extent needed for its intended use.

Any GENSET polynucleotide, including those described in SEQ ID Nos: 1-241, those of clone inserts of the deposited clone pool, and allelic variants thereof may be used to express GENSET polypeptides. The nucleic acid encoding the GENSET polypeptide to be expressed is operably linked to a promoter in an expression vector using conventional cloning technology. The GENSET insert in the expression vector may comprise the full coding sequence for the GENSET protein or a portion thereof, especially the sequence for a mture polypeptide. For example, the GENSET derived insert may encode a polypeptide comprising at least 6, 8, 10, 12, 15, 20, 25, 30, 35, 40, 50, 60, 75, 100, 150 or 200 consecutive amino acids of a GENSET protein selected from the group consisting of sequences of SEQ ID Nos: 242-482 and polypeptides encoded by the clone inserts of the deposited clone pool.

Consequently, a further embodiment of the present invention is a method of making a polypeptide comprising a protein selected from the group consisting of sequences of SEQ ID Nos: 242-482 and polypeptides encoded by the clone inserts of the deposited clone pool, said method comprising the steps of

a) obtaining a cDNA comprising a sequence selected from the group consisting of i) the sequences SEQ ID Nos: 1-241, ii) the sequences of clone inserts of the deposited clone pool one, iii) sequences encoding one of the polypeptide of SEQ ID Nos: 242-482, and iv) sequences of polynucleotides encoding a polypeptide which is encoded by one of the clone insert of the deposited clone pool;

b) inserting said cDNA in an expression vector such that the cDNA is operably linked to a promoter; and

c) introducing said expression vector into a host cell whereby said host cell produces said polypeptide.

In one aspect of this embodiment, the method further comprises the step of isolating the polypeptide. Another embodiment of the present invention is a polypeptide obtainable by the method described in the preceding paragraph.

The expression vector is any of the mammalian, yeast, insect or bacterial expression systems known in the art. Commercially available vectors and expression systems are available from a variety of suppliers including Genetics Institute (Cambridge, Mass.), Stratagene (La Jolla, Calif.), Promega (Madison, Wis.), and Invitrogen (San Diego, Calif.). If desired, to enhance expression and facilitate proper protein folding, the codon context and codon pairing of the sequence is optimized for the particular expression organism in which the expression vector is introduced, as explained in U.S. Pat. No. 5,082,767, which disclosure is hereby incorporated by reference in its entirety.

In one embodiment, the entire coding sequence of a GENSET cDNA and the 3′UTR through the poly A signal of the cDNA is operably linked to a promoter in the expression vector. Alternatively, if the nucleic acid encoding a portion of the GENSET protein lacks a methionine to serve as the initiation site, an initiating methionine can be introduced next to the first codon of the nucleic acid using conventional techniques. Similarly, if the insert from the GENSET cDNA lacks a poly A signal, this sequence can be added to the construct by, for example, splicing out the Poly A signal from pSG5 (Stratagene) using BglI and SalI restriction endonuclease enzymes and incorporating it into the mammalian expression vector pXT1 (Stratagene). pXT1 contains the LTRs and a portion of the gag gene from Moloney Murine Leukemia Virus. The position of the LTRs in the construct allow efficient stable transfection. The vector includes the Herpes Simplex Thymidine Kinase promoter and the selectable neomycin gene. The nucleic acid encoding the GENSET protein or a portion thereof is obtained by PCR from a vector containing a GENSET cDNA selected from the group consisting of the sequences of SEQ ID Nos: 1-241 and the clone inserts of the deposited clone pool using oligonucleotide primers complementary to the GENSET cDNA or portion thereof and containing restriction endonuclease sequences for Pst I incorporated into the 5′ primer and BglII at the 5′ end of the corresponding cDNA 3′ primer, taking care to ensure that the sequence encoding the GENSET protein or a portion thereof is positioned properly with respect to the poly A signal. The purified fragment obtained from the resulting PCR reaction is digested with PstI, blunt ended with an exonuclease, digested with Bgl II, purified and ligated to pXT1, now containing a poly A signal and digested with BglII.

Alternatively, cDNAs encoding secreted proteins may be cloned into pED6dpc2 (DiscoverEase, Genetics Institute, Cambridge, Mass.). The resulting pED6dpc2 constructs may be transfected into a suitable host cell, such as COS 1 cells. Methotrexate resistant cells are selected and expanded. Preferably, the secreted protein expressed from the cDNA is released into the culture medium thereby facilitating purification.

In another embodiment, it is often advantageous to add to the recombinant polynucleotide additional nucleotide sequence which codes for secretory or leader sequences, pro-sequences, sequences which aid in purification, such as multiple histidine residues, or an additional sequence for stability during recombinant production.

As a control, the expression vector lacking a cDNA insert is introduced into host cells or organisms.

Transfection of a GENSET expressing vector into mouse NTH 3T3 cells is but one embodiment of introducing polynucleotides into host cells. Introduction of a polynucleotide encoding a polypeptide into a host cell can be effected by calcium phosphate transfection, DEAE-dextran mediated transfection, cationic lipid-mediated transfection, electroporation, transduction, infection, or other methods. Such methods are described in many standard laboratory manuals, such as Davis et al. (1986), which disclosure is hereby incorporated by reference in its entirety. It is specifically contemplated that the polypeptides of the present invention may in fact be expressed by a host cell lacking a recombinant vector.

Recombinant cell extracts, or proteins from the culture medium if the expressed polypeptide is secreted, are then prepared and proteins separated by gel electrophoresis. If desired, the proteins may be ammonium sulfate precipitated or separated based on size or charge prior to electrophoresis. The proteins present are detected using techniques such as Coomassie or silver staining or using antibodies against the protein encoded by the GENSET cDNA of interest. Coomassie and silver staining techniques are familiar to those skilled in the art.

Proteins from the host cells or organisms containing an expression vector which contains the GENSET cDNA or a fragment thereof are compared to those from the control cells or organism. The presence of a band from the cells containing the expression vector which is absent in control cells indicates that the GENSET cDNA is expressed. Generally, the band corresponding to the protein encoded by the GENSET cDNA will have a mobility near that expected based on the number of amino acids in the open reading frame of the cDNA. However, the band may have a mobility different than that expected as a result of modifications such as glycosylation, ubiquitination, or enzymatic cleavage.

Alternatively, the GENSET polypeptide to be expressed may also be a product of transgenic animals, i.e., as a component of the milk of transgenic cows, goats, pigs or sheeps which are characterized by somatic or germ cells containing a nucleotide sequence encoding the protein of interest.

A polypeptide of this invention can be recovered and purified from recombinant cell cultures by well-known methods including differential extraction, ammonium sulfate or ethanol precipitation, acid extraction, anion or cation exchange chromatography, phosphocellulose chromatography, hydrophobic interaction chromatography, affinity chromatography, hydroxylapatite chromatography and lectin chromatography. See, for example, “Methods in Enzymology”, supra for a variety of methods for purifying proteins. Most preferably, high performance liquid chromatography (“HPLC”) is employed for purification. A recombinantly produced version of a GENSET polypeptide can be substantially purified using techniques described herein or otherwise known in the art, such as, for example, by the one-step method described in Smith and Johnson (1988), which disclosure is hereby incorporated by reference in its entirety. Polypeptides of the invention also can be purified from recombinant sources using antibodies directed against the polypeptides of the invention, such as those described herein, in methods which are well known in the art of protein purification.

Preferably, the recombinantly expressed GENSET polypeptide is purified using standard immunochromatography techniques such as the one described in the section entitled “Immunoaffinity Chromatography”. In such procedures, a solution containing the protein of interest, such as the culture medium or a cell extract, is applied to a column having antibodies against the protein attached to the chromatography matrix. The recombinant protein is allowed to bind the immunochromatography column. Thereafter, the column is washed to remove non-specifically bound proteins. The specifically bound protein is then released from the column and recovered using standard techniques.

If antibody production is not possible, the GENSET cDNA sequence or fragment thereof may be incorporated into expression vectors designed for use in purification schemes employing chimeric polypeptides. In such strategies the coding sequence of the GENSET cDNA or fragment thereof is inserted in frame with the gene encoding the other half of the chimera. The other half of the chimera may be beta-globin or a nickel binding polypeptide encoding sequence. A chromatography matrix having antibody to beta-globin or nickel attached thereto is then used to purify the chimeric protein. Protease cleavage sites may be engineered between the beta-globin gene or the nickel binding polypeptide and the GENSET cDNA or fragment thereof. Thus, the two polypeptides of the chimera may be separated from one another by protease digestion.

One useful expression vector for generating beta-globin chimerics is pSG5 (Stratagene), which encodes rabbit beta-globin. Intron II of the rabbit beta-globin gene facilitates splicing of the expressed transcript, and the polyadenylation signal incorporated into the construct increases the level of expression. These techniques as described are well known to those skilled in the art of molecular biology. Standard methods are published in methods texts such as Davis et al., (1986) and many of the methods are available from Stratagene, Life Technologies, Inc., or Promega. Polypeptide may additionally be produced from the construct using in vitro translation systems such as the In vitro Express™ Translation Kit (Stratagene).

Depending upon the host employed in a recombinant production procedure, the polypeptides of the present invention may be glycosylated or may be non-glycosylated. In addition, polypeptides of the invention may also include an initial modified methionine residue, in some cases as a result of host-mediated processes. Thus, it is well known in the art that the N-terminal methionine encoded by the translation initiation codon generally is removed with high efficiency from any protein after translation in all eukaryotic cells. While the N-terminal methionine on most proteins also is efficiently removed in most prokaryotes, for some proteins, this prokaryotic removal process is inefficient, depending on the nature of the amino acid to which the N-terminal methionine is covalently linked.

From Chemical Synthesis

In addition, polypeptides of the invention, especially short protein fragments, can be chemically synthesized using techniques known in the art (See, e.g., Creighton, 1983; and Hunkapiller et al., 1984), which disclosures are hereby incorporated by reference in their entireties. For example, a polypeptide corresponding to a fragment of a polypeptide sequence of the invention can be synthesized by use of a peptide synthesizer. A variety of methods of making polypeptides are known to those skilled in the art, including methods in which the carboxyl terminal amino acid is bound to polyvinyl benzene or another suitable resin. The amino acid to be added possesses blocking groups on its amino moiety and any side chain reactive groups so that only its carboxyl moiety can react. The carboxyl group is activated with carbodiimide or another activating agent and allowed to couple to the immobilized amino acid. After removal of the blocking group, the cycle is repeated to generate a polypeptide having the desired sequence. Alternatively, the methods described in U.S. Pat. No. 5,049,656, which disclosure is hereby incorporated by reference in its entirety, may be used.

Furthermore, if desired, nonclassical amino acids or chemical amino acid analogs can be introduced as a substitution or addition into the polypeptide sequence. Non-classical amino acids include, but are not limited to, to the D-isomers of the common amino acids, 2,4-diaminobutyric acid, a-amino isobutyric acid, 4-aminobutyric acid, Abu, 2-amino butyric acid, g-Abu, e-Ahx, 6-amino hexanoic acid, Aib, 2-amino isobutyric acid, 3-amino propionic acid, ornithine, norleucine, norvaline, hydroxyproline, sarcosine, citrulline, homocitrulline, cysteic acid, t-butylglycine, t-butylalanine, phenylglycine, cyclohexylalanine, b-alanine, fluoroamino acids, designer amino acids such as b-methyl amino acids, Ca-methyl amino acids, Na-methyl amino acids, and amino acid analogs in general. Furthermore, the amino acid can be D (dextrorotary) or L (levorotary).

Modifications

The invention encompasses polypeptides which are differentially modified during or after translation, e.g., by glycosylation, acetylation, phosphorylation, amidation, derivatization by known protecting/blocking groups, proteolytic cleavage, linkage to an antibody molecule or other cellular ligand, etc. Any of numerous chemical modifications may be carried out by known techniques, including but not limited, to specific chemical cleavage by cyanogen bromide, trypsin, chymotrypsin, papain, V8 protease, NaBH4; acetylation, formylation, oxidation, reduction; metabolic synthesis in the presence of tunicamycin; etc.

Additional post-translational modifications encompassed by the invention include, for example, e.g., N-linked or O-linked carbohydrate chains, processing of N-terminal or C-terminal ends), attachment of chemical moieties to the amino acid backbone, chemical modifications of N-linked or O-linked carbohydrate chains, and addition or deletion of an N-terminal methionine residue as a result of prokaryotic host cell expression. The polypeptides may also be modified with a detectable label, such as an enzymatic, fluorescent, isotopic or affinity label to allow for detection and isolation of the protein.

Also provided by the invention are chemically modified derivatives of the polypeptides of the invention which may provide additional advantages such as increased solubility, stability and circulating time of the polypeptide, or decreased immunogenicity. See U.S. Pat. No. 4,179,337. The chemical moieties for derivatization may be selected See U.S. Pat. No. 4,179,337, which disclosure is hereby incorporated by reference in its entirety. The chemical moieties for derivatization may be selected from water soluble polymers such as polyethylene glycol, ethylene glycol/propylene glycol copolymers, carboxymethylcellulose, dextran, polyvinyl alcohol and the like. The polypeptides may be modified at random positions within the molecule, or at predetermined positions within the molecule and may include one, two, three or more attached chemical moieties.

The polymer may be of any molecular weight, and may be branched or unbranched. For polyethylene glycol, the preferred molecular weight is between about 1 kDa and about 100 kDa (the term “about” indicating that in preparations of polyethylene glycol, some molecules will weigh more, some less, than the stated molecular weight) for ease in handling and manufacturing. Other sizes may be used, depending on the desired therapeutic profile (e.g., the duration of sustained release desired, the effects, if any on biological activity, the ease in handling, the degree or lack of antigenicity and other known effects of the polyethylene glycol to a therapeutic protein or analog).

The polyethylene glycol molecules (or other chemical moieties) should be attached to the protein with consideration of effects on functional or antigenic domains of the protein. There are a number of attachment methods available to those skilled in the art, e.g., EP 0 401 384, (coupling PEG to G-CSF), and Malik et al. (1992) (reporting pegylation of GM-CSF using tresyl chloride), which disclosures are hereby incorporated by reference in their entireties. For example, polyethylene glycol may be covalently bound through amino acid residues via a reactive group, such as, a free amino or carboxyl group. Reactive groups are those to which an activated polyethylene glycol molecule may be bound. The amino acid residues having a free amino group may include lysine residues and the N-terminal amino acid residues; those having a free carboxyl group may include aspartic acid residues glutamic acid residues and the C-terminal amino acid residue. Sulfhydryl groups may also be used as a reactive group for attaching the polyethylene glycol molecules. Preferred for therapeutic purposes is attachment at an amino group, such as attachment at the N-terminus or lysine group.

One may specifically desire proteins chemically modified at the N-terminus. Using polyethylene glycol as an illustration of the present composition, one may select from a variety of polyethylene glycol molecules (by molecular weight, branching, etc.), the proportion of polyethylene glycol molecules to protein (polypeptide) molecules in the reaction mix, the type of pegylation reaction to be performed, and the method of obtaining the selected N-terminally pegylated protein. The method of obtaining the N-terminally pegylated preparation (i.e., separating this moiety from other monopegylated moieties if necessary) may be by purification of the N-terminally pegylated material from a population of pegylated protein molecules. Selective proteins chemically modified at the N-terminus modification may be accomplished by reductive alkylation, which exploits differential reactivity of different types of primary amino groups (lysine versus the N-terminal) available for derivatization in a particular protein. Under the appropriate reaction conditions, substantially selective derivatization of the protein at the N-terminus with a carbonyl group containing polymer is achieved.

Multimerization

The polypeptides of the invention may be in monomers or multimers (i.e., dimers, trimers, tetramers and higher multimers). Accordingly, the present invention relates to monomers and multimers of the polypeptides of the invention, their preparation, and compositions containing them. In specific embodiments, the polypeptides of the invention are monomers, dimers, trimers or tetramers. In additional embodiments, the multimers of the invention are at least dimers, at least trimers, or at least tetramers.

Multimers encompassed by the invention may be homomers or heteromers. As used herein, the term “homomer”, refers to a multimer containing only polypeptides corresponding to the amino acid sequences of SEQ ID Nos: 242-482 or encoded by the clone inserts of the deposited clone pool (including fragments, variants, splice variants, and fusion proteins, corresponding to these polypeptides as described herein). These homomers may contain polypeptides having identical or different amino acid sequences. In a specific embodiment, a homomer of the invention is a multimer containing only polypeptides having an identical amino acid sequence. In another specific embodiment, a homomer of the invention is a multimer containing polypeptides having different amino acid sequences. In specific embodiments, the multimer of the invention is a homodimer (e.g., containing polypeptides having identical or different amino acid sequences) or a homotrimer (e.g., containing polypeptides having identical and/or different amino acid sequences). In additional embodiments, the homomenc multimer of the invention is at least a homodimer, at least a homotrimer, or at least a homotetramer.

As used herein, the term “heteromer” refers to a multimer containing one or more heterologous polypeptides (i.e., polypeptides of different proteins) in addition to the polypeptides of the invention. In a specific embodiment, the multimer of the invention is a heterodimer, a heterotrimer, or a heterotetramer. In additional embodiments, the heteromeric multimer of the invention is at least a heterodimer, at least a heterotrimer, or at least a heterotetramer.

Multimers of the invention may be the result of hydrophobic, hydrophilic, ionic and/or covalent associations and/or may be indirectly linked, by for example, liposome formation. Thus, in one embodiment, multimers of the invention, such as, for example, homodimers or homotrimers, are formed when polypeptides of the invention contact one another in solution. In another embodiment, heteromultimers of the invention, such as, for example, heterotrimers or heterotetramers, are formed when polypeptides of the invention contact antibodies to the polypeptides of the invention (including antibodies to the heterologous polypeptide sequence in a fusion protein of the invention) in solution. In other embodiments, multimers of the invention are formed by covalent associations with and/or between the polypeptides of the invention. Such covalent associations may involve one or more amino acid residues contained in the polypeptide sequence (e.g., that recited in the sequence listing, or contained in the polypeptide encoded by a deposited clone). In one instance, the covalent associations are cross-linking between cysteine residues located within the polypeptide sequences, which interact in the native (i.e., naturally occurring) polypeptide. In another instance, the covalent associations are the consequence of chemical or recombinant manipulation. Alternatively, such covalent associations may involve one or more amino acid residues contained in the heterologous polypeptide sequence in a fusion protein of the invention.

In one example, covalent associations are between the heterologous sequence contained in a fusion protein of the invention (see, e.g., U.S. Pat. No. 5,478,925, which disclosure is hereby incorporated by reference in its entirety). In a specific example, the covalent associations are between the heterologous sequence contained in an Fc fusion protein of the invention (as described herein). In another specific example, covalent associations of fusion proteins of the invention are between heterologous polypeptide sequence from another protein that is capable of forming covalently associated multimers, such as for example, oseteoprotegerin (see, e.g., International Publication No: WO 98/49305, the contents of which are herein incorporated by reference in its entirety). In another embodiment, two or more polypeptides of the invention are joined through peptide linkers. Examples include those peptide linkers described in U.S. Pat. No. 5,073,627 (hereby incorporated by reference). Proteins comprising multiple polypeptides of the invention separated by peptide linkers may be produced using conventional recombinant DNA technology.

Another method for preparing multimer polypeptides of the invention involves use of polypeptides of the invention fused to a leucine zipper or isoleucine zipper polypeptide sequence. Leucine zipper and isoleucine zipper domains are polypeptides that promote multimerization of the proteins in which they are found. Leucine zippers were originally identified in several DNA-binding proteins, and have since been found in a variety of different proteins (Landschulz et al., 1988). Among the known leucine zippers are naturally occurring peptides and derivatives thereof that dimerize or trimerize. Examples of leucine zipper domains suitable for producing soluble multimeric proteins of the invention are those described in PCT application WO 94/10308, hereby incorporated by reference. Recombinant fusion proteins comprising a polypeptide of the invention fused to a polypeptide sequence that dimerizes or trimerizes in solution are expressed in suitable host cells, and the resulting soluble multimeric fusion protein is recovered from the culture supernatant using techniques known in the art.

Trimeric polypeptides of the invention may offer the advantage of enhanced biological activity. Preferred leucine zipper moieties and isoleucine moieties are those that preferentially form trimers. One example is a leucine zipper derived from lung surfactant protein D (SPD), as described in Hoppe et al. (1994) and in U.S. patent application Ser. No. 08/446,922, which disclosure is hereby incorporated by reference in its entirety. Other peptides derived from naturally occurring trimeric proteins may be employed in preparing trimeric polypeptides of the invention. In another example, proteins of the invention are associated by interactions between Flag® polypeptide sequence contained in fusion proteins of the invention containing Flag® polypeptide sequence. In a further embodiment, associations proteins of the invention are associated by interactions between heterologous polypeptide sequence contained in Flag® fusion proteins of the invention and anti Flag® antibody.

The multimers of the invention may be generated using chemical techniques known in the art. For example, polypeptides desired to be contained in the multimers of the invention may be chemically cross-linked using linker molecules and linker molecule length optimization techniques known in the art (see, e.g., U.S. Pat. No. 5,478,925, which is herein incorporated by reference in its entirety). Additionally, multimers of the invention may be generated using techniques known in the art to form one or more inter-molecule cross-links between the cysteine residues located within the sequence of the polypeptides desired to be contained in the multimer (see, e.g., U.S. Pat. No. 5,478,925, which is herein incorporated by reference in its entirety). Further, polypeptides of the invention may be routinely modified by the addition of cysteine or biotin to the C terminus or N-terminus of the polypeptide and techniques known in the art may be applied to generate multimers containing one or more of these modified polypeptides (see, e.g., U.S. Pat. No. 5,478,925, which is herein incorporated by reference in its entirety). Additionally, 30 techniques known in the art may be applied to generate liposomes containing the polypeptide components desired to be contained in the multimer of the invention (see, e.g., U.S. Pat. No. 5,478,925, which is herein incorporated by reference in its entirety).

Alternatively, multimers of the invention may be generated using genetic engineering techniques known in the art. In one embodiment, polypeptides contained in multimers of the invention are produced recombinantly using fusion protein technology described herein or otherwise known in the art (see, e.g., U.S. Pat. No. 5,478,925, which is herein incorporated by reference in its entirety). In a specific embodiment, polynucleotides coding for a homodimer of the invention are generated by ligating a polynucleotide sequence encoding a polypeptide of the invention to a sequence encoding a linker polypeptide and then further to a synthetic polynucleotide encoding the translated product of the polypeptide in the reverse orientation from the original C-terminus to the N-terminus (lacking the leader sequence) (see, e.g., U.S. Pat. No. 5,478,925, which is herein incorporated by reference in its entirety). In another embodiment, recombinant techniques described herein or otherwise known in the art are applied to generate recombinant polypeptides of the invention which contain a transmembrane domain (or hydrophobic or signal peptide) and which can be incorporated by membrane reconstitution techniques into liposomes (see, e.g., U.S. Pat. No. 5,478,925, which is herein incorporated by reference in its entirety).

Mutated Polypeptides

To improve or alter the characteristics of GENSET polypeptides of the present invention, protein engineering may be employed. Recombinant DNA technology known to those skilled in the art can be used to create novel mutant proteins or muteins including single or multiple amino acid substitutions, deletions, additions, or fusion proteins. Such modified polypeptides can show, e.g., increased/decreased biological activity or increased/decreased stability. In addition, they may be, purified in higher yields and show better solubility than the corresponding natural polypeptide, at least under certain purification and storage conditions. Further, the polypeptides of the present invention may be produced as multimers including dimers, trimers and tetramers. Multimerization may be facilitated by linkers or recombinantly though heterologous polypeptides such as Fc regions.

N- and C-Terminal Deletions

It is known in the art that one or more amino acids may be deleted from the N-terminus or C-terminus without substantial loss of biological function. For instance, Ron et al. (1993), reported modified KGF proteins that had heparin binding activity even if 3, 8, or 27 N-terminal amino acid residues were missing. Accordingly, the present invention provides polypeptides having one or more residues deleted from the amino terminus of the polypeptides of SEQ ID Nos: 242-482 or that encoded by the clone inserts of the deposited clone pool. Similarly, many examples of biologically functional C-terminal deletion mutants are known. For instance, Interferon gamma shows up to ten times higher activities by deleting 810 amino acid residues from the C-terminus of the protein (See, e.g., Dobeli, et al. 1988), which disclosure is hereby incorporated by reference in its entirety. Accordingly, the present invention provides polypeptides having one or more residues deleted from the carboxy terminus of the polypeptides shown of SEQ ID Nos: 242-482 or encoded by the clone inserts of the deposited clone pool. The invention also provides polypeptides having one or more amino acids deleted from both the amino and the carboxyl termini as described below.

Other Mutations

Other mutants in addition to N- and C-terminal deletion forms of the protein discussed above are included in the present invention. It also will be recognized by one of ordinary skill in the art that some amino acid sequences of the GENSET polypeptides of the present invention can be varied without significant effect of the structure or function of the protein. If such differences in sequence are contemplated, it should be remembered that there will be critical areas on the protein which determine activity. Thus, the invention further includes variations of the GENSET polypeptides which show substantial GENSET polypeptide activity. Such mutants include deletions, insertions, inversions, repeats, and substitutions selected according to general rules known in the art so as to have little effect on activity. For example, guidance concerning how to make phenotypically silent amino acid substitutions is provided.

There are two main approaches for studying the tolerance of an amino acid sequence to change (See, Bowie et al. 1994), which disclosure is hereby incorporated by reference in its entirety. The first method relies on the process of evolution, in which mutations are either accepted or rejected by natural selection.

The second approach uses genetic engineering to introduce amino acid changes at specific positions of a cloned gene and selections or screens to identify sequences that maintain functionality. These studies have revealed that proteins are surprisingly tolerant of amino acid substitutions. The studies indicate which amino acid changes are likely to be permissive at a certain position of the protein. For example, most buried amino acid residues require nonpolar side chains, whereas few features of surface side chains are generally conserved. Other such phenotypically silent substitutions are described by Bowie et al. (supra) and the references cited therein.

Typically seen as conservative substitutions are the replacements, one for another, among the aliphatic amino acids Ala, Val, Leu and Phe; interchange of the hydroxyl residues Ser and Thr, exchange of the acidic residues Asp and Glu, substitution between the amide residues Asn and Gln, exchange of the basic residues Lys and Arg and replacements among the aromatic residues Phe, Tyr. Thus, the fragment, derivative, analog, or homologue of the polypeptide of the present invention may be, for example: (i) one in which one or more of the amino acid residues are substituted with a conserved or non-conserved amino acid residue (preferably a conserved amino acid residue) and such substituted amino acid residue may or may not be one encoded by the genetic code: or (ii) one in which one or more of the amino acid residues includes a substituent group: or (iii) one in which the GENSET polypeptide is fused with another compound, such as a compound to increase the half-life of the polypeptide (for example, polyethylene glycol): or (iv) one in which the additional amino acids are fused to the above form of the polypeptide, such as an IgG Fc fusion region peptide or leader or secretory sequence or a sequence which is employed for purification of the above form of the polypeptide or a pro-protein sequence. Such fragments, derivatives and analogs are deemed to be within the scope of those skilled in the art from the teachings herein.

Thus, the GENSET polypeptides of the present invention may include one or more amino acid substitutions, deletions, or additions, either from natural mutations or human manipulation. As indicated, changes are preferably of a minor nature, such as conservative amino acid substitutions that do not significantly affect the folding or activity of the protein. The following groups of amino acids generally represent equivalent changes: (1) Ala, Pro, Gly, Glu, Asp, Gln, Asn, Ser, Thr; (2) Cys, Ser, Tyr, Thr; (3) Val, Ile, Leu, Met, Ala, Phe; (4) Lys, Arg, His; (5) Phe, Tyr, Trp, His.

A specific embodiment of a modified GENSET peptide molecule of interest according to the present invention, includes, but is not limited to, a peptide molecule which is resistant to proteolysis, is a peptide in which the —CONH— peptide bond is modified and replaced by a (CH2NH) reduced bond, a (NHCO) retro inverso bond, a (CH2-O) methylene-oxy bond, a (CH2-S) thiomethylene bond, a (CH2CH2) carba bond, a (CO—CH2) cetomethylene bond, a (CHOH—CH2) hydroxyethylene bond), a (N—N) bound, a E-alcene bond or also a —CH═CH— bond. The invention also encompasses a human GENSET polypeptide or a fragment or a variant thereof in which at least one peptide bond has been modified as described above.

Amino acids in the GENSET proteins of the present invention that are essential for function can be identified by methods known in the art, such as site-directed mutagenesis or alanine-scanning mutagenesis (See, e.g., Cunningham et al. 1989), which disclosure is hereby incorporated by reference in its entirety. The latter procedure introduces single alanine mutations at every residue in the molecule. The resulting mutant molecules are then tested for biological activity using assays appropriate for measuring the function of the particular protein. Of special interest are substitutions of charged amino acids with other charged or neutral amino acids which may produce proteins with highly desirable improved characteristics, such as less aggregation. Aggregation may not only reduce activity but also be problematic when preparing pharmaceutical formulations, because aggregates can be immunogenic, (See, e.g., Pinckard et al., 1967; Robbins, et al., 1987; and Cleland, et al., 1993).

A further embodiment of the invention relates to a polypeptide which comprises the amino acid sequence of a GENSET polypeptide having an amino acid sequence which contains at least one conservative amino acid substitution, but not more than 50 conservative amino acid substitutions, not more than 40 conservative amino acid substitutions, not more than 30 conservative amino acid substitutions, and not more than 20 conservative amino acid substitutions. Also provided are polypeptides which comprise the amino acid sequence of a GENSET polypeptide, having at least one, but not more than 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1 conservative amino acid substitutions.

Polypeptide Fragments

Structural Definition

The present invention is further directed to fragments of the amino acid sequences described herein such as the polypeptides of SEQ ID Nos: 242-482, mature polypeptides included in SEQ ID Nos: 242-272 and 274-384, or full-length or mature polypeptides encoded by the clone inserts of the deposited clone pool. More specifically, the present invention embodies purified, isolated, and recombinant polypeptides comprising at least 6, preferably at least 8 to 10, more preferably 12, 15, 20, 25, 30, 35, 40, 50, 60, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 350, 400, 450 or 500 consecutive amino acids of a polypeptide selected from the group consisting of the sequences of SEQ ID Nos: 242-482, mature polypeptides included in SEQ ID Nos: 242-272 and 274-384, and full-length or mature polypeptides encoded by the clone inserts of the deposited clone pool, and other polypeptides of the present invention.

In addition to the above polypeptide fragments, further preferred sub-genuses of polypeptides comprise at least 6 amino acids, wherein “at least 6” is defined as any integer between 6 and the integer representing the C-terminal amino acid of the polypeptide of the present invention including the polypeptide sequences of the sequence listing below. Further included are species of polypeptide fragments at least 6 amino acids in length, as described above, that are further specified in terms of their N-terminal and C-terminal positions. However, included in the present invention as individual species are all polypeptide fragments, at least 6 amino acids in length, as described above, and may be particularly specified by a N-terminal and C-terminal position. That is, every combination of a N-terminal and C-terminal position that a fragment at least 6 contiguous amino acid residues in length could occupy, on any given amino acid sequence of the sequence listing or of the present invention is included in the present invention

The present invention also provides for the exclusion of any fragment species specified by N-terminal and C-terminal positions or of any fragment sub-genus specified by size in amino acid residues as described above. Any number of fragments specified by N-terminal and C-terminal positions or by size in amino acid residues as described above may be excluded as individual species.

The above polypeptide fragments of the present invention can be immediately envisaged using the above description and are therefore not individually listed solely for the purpose of not unnecessarily lengthening the specification. Moreover, the above fragments need not have a GENSET biological activity, although polypeptides having these activities are preferred embodiments of the invention, since they would be useful, for example, in immunoassays, in epitope mapping, epitope tagging, as vaccines, and as molecular weight markers. The above fragments may also be used to generate antibodies to a particular portion of the polypeptide. These antibodies can then be used in immunoassays well known in the art to distinguish between human and non-human cells and tissues or to determine whether cells or tissues in a biological sample are or are not of the same type which express the polypeptides of the present invention.

It is noted that the above species of polypeptide fragments of the present invention may alternatively be described by the formula “a to b”; where “a” equals the N-terminal most amino acid position and “b” equals the C-terminal most amino acid position of the polynucleotide; and further where “a” equals an integer between 1 and the number of amino acids of the polypeptide sequence of the present invention minus 6, and where “b” equals an integer between 7 and the number of amino acids of the polypeptide sequence of the present invention; and where “a” is an integer smaller then “b” by at least 6.

The present invention also provides for the exclusion of any species of polypeptide fragments of the present invention specified by 5′ and 3′ positions or sub-genuses of polypeptides specified by size in amino acids as described above. Any number of fragments specified by 5′ and 3′ positions or by size in amino acids, as described above, may be excluded. Specifically excluded from the invention are the polypeptide fragments encoded by the preferentially excluded polynucleotide fragments described in Table IV, and in Tables Va and Vb. Table IV and Tables Va and Vb provide for the exclusion of polypeptides, independently from each other, in addition to those described elsewhere in the specification and is therefore, not meant as limiting description.

Functional Definition

Preferred polypeptide fragments of the invention are isolated, purified or recombinant polypeptides comprising, consisting of, or consisting essentially of signal peptides, preferably signal peptides selected from the group consisting of SEQ ID Nos: 242-272 and 274-384, signal peptides encoded by sequences of SEQ ID Nos: 1-31 and 33-143 and those encoded by the clone inserts of the deposited clone pool. Such polypeptides fragments are useful to design secretion vectors as described elsewhere in the application.

Other preferred polypeptide fragments of the invention are isolated, purified or recombinant polypeptides comprising, consisting of, or consisting essentially of mature proteins, preferably mature proteins selected from the group consisting of SEQ ID Nos: 242-272 and 274-384, mature proteins encoded by sequences of SEQ ID Nos: 1-31 and 33-143 and those encoded by the clone inserts of the deposited clone pool.

Domains

Preferred polynucleotide fragments of the invention are domains of polypeptides of the invention. Such domains may eventually comprise linear or structural motifs and signatures including, but not limited to, leucine zippers, helix-turn-helix motifs, post-translational modification sites such as glycosylation sites, ubiquitination sites, alpha helices, and beta sheets, signal sequences encoding signal peptides which direct the secretion of the encoded proteins, sequences implicated in transcription regulation such as homeoboxes, acidic stretches, enzymatic active sites, substrate binding sites, and enzymatic cleavage sites. Such domains may present a particular biological activity such as DNA or RNA-binding, secretion of proteins, transcription regulation, enzymatic activity, substrate binding activity, etc . . .

A domain has a size generally comprised between 3 and 2000 amino acids. In preferred embodiment, domains comprise a number of amino acids that is any integer between 6 and 500. Domains may be synthesized using any methods known to those skilled in the art, including those disclosed herein, particularly in the section entitled “Preparation of the polypeptides of the invention”. Methods for determining the amino acids which make up a domain with a particular biological activity include mutagenesis studies and assays to determine the biological activity to be tested.

Alternatively, the polypeptides of the invention may be scanned for motifs, domains and/or signatures in databases using any computer method known to those skilled in the art. Searchable databases include Prosite (Hofmann et al., 1999; Bucher and Bairoch 1994), Pfam (Sonnhammer et al., 1997; Henikoff et al., 2000; Bateman et al., 2000), Blocks (Henikoff et al., 2000), Print (Attwood et al., 1996), Prodom (Sonnhammer and Kahn, 1994; Corpet et al. 2000), Sbase (Pongor et al., 1993; Murvai et al., 2000), Smart (Schultz et al., 1998), Dali/FSSP (Holm and Sander, 1996, 1997 and 1999), HSSP (Sander and Schneider 1991), CATH (Orengo et al., 1997; Pearl et al., 2000), SCOP (Murzin et al., 1995; Lo Conte et al., 2000), COG (Tatusov et al., 1997 and 2000), specific family databases and derivatives thereof (Nevill-Manning et al., 1998; Yona et al., 1999; Attwood et al., 2000), each of which disclosures are hereby incorporated by reference in their entireties. For a review on available databases, see issue 1 of volume 28 of Nucleic Acid Research (2000), which disclosure is hereby incorporated by reference in its entirety.

The polypeptides of SEQ ID NOs: 242-482 were screened for the presence of known structural or functional motifs or for the presence of signatures, small amino acid sequences that are well conserved amongst the members of a protein family. The search was conducted on the Pfam 5.5 database using HMMER-2.1.1 (for info see Sonnhammer et Durbin, http:/www.sanger.ac.uk/Pfam/), on a Blocks Plus database containing Blocks version 12.0, Prints version 26.0, Pfam version 5.3, Prodom version 99.1, and Domo version 2.0 using emotif (for info see Nevill-Manning et al., PNAS, 95, 5865-5871, (1998), http://motif.stanford/edu/EMOTIF) and on the Prosite 16.0 database using bla (Tatusov, R. L. & Koonin, E. V. CABIOS 10, No. 4) and pfscan (http://www.isrec.isb-sib.ch/cgi-bin/man.cgi?section=1&topic =pfscan). Some of these predicted domains are described in Table VI. For these polypeptides referred to by their sequence identification numbers (column entitled “Seq Id No”), Table VI gives the designation of the domain (column entitled “Designation of domain”) according to the database of domains indicated in the column entitled “Database” and the positions of preferred fragments within these sequences (column entitled “Positions of domains”). Each fragment is represented by a-b where a and b are the start and end positions respectively of a given preferred fragment on the full-length polypeptide. Preferred fragments are separated from each other by a coma. As used herein, the term “domain described in Table VI” refers to all the domains listed in Table VI for a given GENSET protein referred to by its sequence identification number in the first column. It should be noted that in Table VI, the first methionine encountered is designated as amino acid number 1, i.e; the leader sequence is not numbered negatively. In the appended sequence listing, the first amino acid of the mature protein resulting from cleavage of the signal peptide is designated as amino acid number 1 and the first amino acid of the signal peptide is designated with the appropriate negative number, in accordance with the regulations governing sequence listings.

Consequently, preferred polynucleotide fragments of the invention are domains of the polypeptides of SEQ ID Nos: 242-482. Therefore, the present invention encompasses isolated, purified, or recombinant polypeptides which consist of, consist essentially of, or comprise a contiguous span of at least 6, preferably at least 8 to 10, more preferably 12, 15, 20, 25, 30, 35, 40, 50, 60, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 350, 400, 450 or 500 amino acids of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 242-482, to the extent that a contiguous span of these lengths is consistent with the lengths of said selected sequence, where said contiguous span comprises at least 1, 2, 3, 5, or 10 amino acids positions of a domain described in Table VI of said selected sequence. The present invention also encompasses isolated, purified, or recombinant polypeptides comprising, consisting essentially of, or consisting of a contiguous span of at least 6, preferably at least 8 to 10, more preferably 12, 15, 20, 25, 30, 35, 40, 50, 60, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 350, 400, 450 or 500 amino acids of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 242-482, to the extent that a contiguous span of these lengths is consistent with the lengths of said selected sequence, where said contiguous span is a domain described in Table VI of said selected sequence. The present invention also encompasses isolated, purified, or recombinant polypeptides which comprise, consist of or consist essentially of a domain described in Table VI of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 242-482.

Polypeptides of the present invention that are not specifically described in this table are not considered as not belonging to a domain. This is because they may still be not recognized as such by the particular algorithms used or not be included in the particular database searched. In fact, all fragments of the polypeptides of the present invention, at least 6 amino acids residues in length, are included in the present invention as being a domain. Amino acid residues comprising other domains may be determined by looking in other databases than the ones currently cited to establish Table VI. The domains of the present invention preferably comprises 6 to 200 amino acids (i.e. any integer between 6 and 200, inclusive) of a polypeptide of the present invention. Also, included in the present invention are domain fragments between the integers of 6 and the full length GENSET sequence of the sequence listing. All combinations of sequences between the integers of 6 and the full-length sequence of a GENSET polypeptide are included. The domain fragments may be specified by either the number of contiguous amino acid residues (as a sub-genus) or by specific N-terminal and C-terminal positions (as species) as described above for the polypeptide fragments of the present invention. Any number of domain fragments of the present invention may also be excluded in the same manner.

Epitopes and Antibody Fusions:

A preferred embodiment of the present invention is directed to epitope-bearing polypeptides and epitope-bearing polypeptide fragments. These epitopes may be “antigenic epitopes” or both an “antigenic epitope” and an “immunogenic epitope”. An “immunogenic epitope” is defined as a part of a protein that elicits an antibody response in vivo when the polypeptide is the immunogen. On the other hand, a region of polypeptide to which an antibody binds is defined as an “antigenic determinant” or “antigenic epitope.” The number of immunogenic epitopes of a protein generally is less than the number of antigenic epitopes (See, e.g., Geysen, et al., 1984), which disclosure is hereby incorporated by reference in its entirety. It is particularly noted that although a particular epitope may not be immunogenic, it is nonetheless useful since antibodies can be made to both immunogenic and antigenic epitopes.

An epitope can comprise as few as 3 amino acids in a spatial conformation, which is unique to the epitope. Generally an epitope consists of at least 6 such amino acids, and more often at least 8-10 such amino acids. In preferred embodiment, antigenic epitopes comprise a number of amino acids that is any integer between 3 and 50. Fragments which function as epitopes may be produced by any conventional means (See, e.g., Houghten, 1985), also further described in U.S. Pat. No. 4,631,21, which disclosures are hereby incorporated by reference in their entireties. Methods for determining the amino acids which make up an epitope include x-ray crystallography, 2-dimensional nuclear magnetic resonance, and epitope mapping, e.g., the Pepscan method described by Geysen et al. (1984); PCT Publication No. WO 84/03564; and PCT Publication No. WO 84/03506, which disclosures are hereby incorporated by reference in their entireties. Another example is the algorithm of Jameson and Wolf, (1988) (said reference incorporated by reference in its entirety). The Jameson-Wolf antigenic analysis, for example, may be performed using the computer program PROTEAN, using default parameters (Version 4.0 Windows, DNASTAR, Inc., 1228 South Park Street Madison, Wis.

Antigenic epitopes predicted by the Jameson-Wolf algorithm for the polypeptides of SEQ ID Nos: 242-482 are presented in Table VII. For each GENSET polypeptide referred to by its sequence identification number in the column entitled “Seq Id No”, a list of antigenic epitopes is given in the column entitled “Epitopes”, each epitope being separated by a coma. Each fragment is represented by a-b where a and b are the start and end positions respectively of a given preferred fragment. It should be noted that in Table VII, the first methionine encountered is designated as amino acid number 1, i.e; the leader sequence is not numbered negatively. In the appended sequence listing, the first amino acid of the mature protein resulting from cleavage of the signal peptide is designated as amino acid number 1 and the first amino acid of the signal peptide is designated with the appropriate negative number, in accordance with the regulations governing sequence listings. As used herein, the term “epitope described in Table VII” refers to all preferred polynucleotide fragments described in the second column of Table VII for a GENSET polypeptide referred to by its sequence identification number in the first column. It is pointed out that the immunogenic epitopes listed in Table VII describe only amino acid residues comprising epitopes predicted to have the highest degree of immunogenicity by a particular algorithm. Polypeptides of the present invention that are not specifically described as immunogenic are not considered non-antigenic. This is because they may still be antigenic in vivo but merely not recognized as such by the particular algorithm used. Alternatively, the polypeptides are most likely antigenic in vitro using methods such a phage display. Thus, listed in Table VII are the amino acid residues comprising only preferred epitopes, not a complete list. In fact, all fragments of the polypeptides of the present invention, at least 6 amino acids residues in length, are included in the present invention as being useful as antigenic epitope. Amino acid residues comprising other immunogenic epitopes may be determined by algorithms similar to the Jameson-Wolf analysis or by in vivo testing for an antigenic response using the methods described herein or those known in the art.

Therefore, the present invention encompasses isolated, purified, or recombinant polypeptides which consist of, consist essentially of, or comprise a contiguous span of at least 6, preferably at least 8 to 10, more preferably 12, 15, 20, 25, 30, 35, 40, 50, 60, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 350, 400, 450 or 500 amino acids of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 242-482, to the extent that a contiguous span of these lengths is consistent with the lengths of said selected sequence, where said contiguous span comprises at least 1, 2, 3, 5, or 10 amino acids positions of an epitope described in Table VII of said selected sequence. The present invention also encompasses isolated, purified, or recombinant polypeptides comprising, consisting essentially of, or consisting of a contiguous span of at least 6, preferably at least 8 to 10, more preferably 12, 15, 20, 25, 30, 35, 40, 50, 60, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 350, 400, 450 or 500 amino acids of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 242-482, to the extent that a contiguous span of these lengths is consistent with the lengths of said selected sequence, where said contiguous span is an epitope described in Table VII of said selected sequence. The present invention also encompasses isolated, purified, or recombinant polypeptides which comprise, consist of or consist essentially of an epitope described in Table VII of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 242-482.

The epitope-bearing fragments of the present invention preferably comprises 6 to 50 amino acids (i.e. any integer between 6 and 50, inclusive) of a polypeptide of the present invention. Also, included in the present invention are antigenic fragments between the integers of 6 and the full length GENSET sequence of the sequence listing. All combinations of sequences between the integers of 6 and the full-length sequence of a GENSET polypeptide are included. The epitope-bearing fragments may be specified by either the number of contiguous amino acid residues (as a sub-genus) or by specific N-terminal and C-terminal positions (as species) as described above for the polypeptide fragments of the present invention. Any number of epitope-bearing fragments of the present invention may also be excluded in the same manner.

Antigenic epitopes are useful, for example, to raise antibodies, including monoclonal antibodies that specifically bind the epitope (See, Wilson et al., 1984; and Sutcliffe, et al., 1983), which disclosures are hereby incorporated by reference in their entireties. The antibodies are then used in various techniques such as diagnostic and tissue/cell identification techniques, as described herein, and in purification methods such as immunoaffinity chromatography.

An antibody or other compound that specifically binds to a polypeptide or polynucleotide of the invention is also said to “selectively recognize” the polypeptide or polynucleotide.

Similarly, immunogenic epitopes can be used to induce antibodies according to methods well known in the art (See, Sutcliffe et al., supra; Wilson et al., supra; Chow et al.; (1985) and Bittle, et al., (1985), which disclosures are hereby incorporated by reference in their entireties). A preferred immunogenic epitope includes the natural GENSET protein. The immunogenic epitopes may be presented together with a carrier protein, such as an albumin, to an animal system (such as rabbit or mouse) or, if it is long enough (at least about 25 amino acids), without a carrier. However, immunogenic epitopes comprising as few as 8 to 10 amino acids have been shown to be sufficient to raise antibodies capable of binding to, at the very least, linear epitopes in a denatured polypeptide (e.g., in Western blotting.).

Epitope-bearing polypeptides of the present invention are used to induce antibodies according to methods well known in the art including, but not limited to, in vivo immunization, in vitro immunization, and phage display methods (See, e.g., Sutcliffe, et al., supra; Wilson, et al., supra, and Bittle, et al., supra). If in vivo immunization is used, animals may be immunized with free peptide; however, anti-peptide antibody titer may be boosted by coupling of the peptide to a macromolecular carrier, such as keyhole limpet hemacyanin (KLH) or tetanus toxoid. For instance, peptides containing cysteine residues may be coupled to a carrier using a linker such as -maleimidobenzoyl-N-hydroxysuccinimide ester (MBS), while other peptides may be coupled to carriers using a more general linking agent such as glutaraldehyde. Animals such as rabbits, rats and mice are immunized with either free or carrier-coupled peptides, for instance, by intraperitoneal and/or intradermal injection of emulsions containing about 100 μgs of peptide or carrier protein and Freund's adjuvant. Several booster injections may be needed, for instance, at intervals of about two weeks, to provide a useful titer of anti-peptide antibody, which can be detected, for example, by ELISA assay using free peptide adsorbed to a solid surface. The titer of anti-peptide antibodies in serum from an immunized animal may be increased by selection of anti-peptide antibodies, for instance, by adsorption to the peptide on a solid support and elution of the selected antibodies according to methods well known in the art.

As one of skill in the art will appreciate, and discussed above, the polypeptides of the present invention comprising an immunogenic or antigenic epitope can be fused to heterologous polypeptide sequences. For example, the polypeptides of the present invention may be fused with the constant domain of immunoglobulins (IgA, IgE, IgG, IgM), or portions thereof (CH1, CH2, CH3, any combination thereof including both entire domains and portions thereof) resulting in chimeric polypeptides. These fusion proteins facilitate purification, and show an increased half-life in vivo. This has been shown, e.g., for chimeric proteins consisting of the first two domains of the human CD4-polypeptide and various domains of the constant regions of the heavy or light chains of mammalian immunoglobulins (See, e.g., EPA 0,394,827; and Traunecker et al., 1988), which disclosures are hereby incorporated by reference in their entireties. Fusion proteins that have a disulfide-linked dimeric structure due to the IgG portion can also be more efficient in binding and neutralizing other molecules than monomeric polypeptides or fragments thereof alone (See, e.g., Fountoulakis et al., 1995), which disclosure is hereby incorporated by reference in its entirety. Nucleic acids encoding the above epitopes can also be recombined with a gene of interest as an epitope tag to aid in detection and purification of the expressed polypeptide.

Additional fusion proteins of the invention may be generated through the techniques of gene-shuffling, motif-shuffling, exon-shuffling, or codon-shuffling (collectively referred to as “DNA shuffling”). DNA shuffling may be employed to modulate the activities of polypeptides of the present invention thereby effectively generating agonists and antagonists of the polypeptides. See, for example, U.S. Pat. Nos. 5,605,793; 5,811,238; 5,834,252; 5,837,458; and Patten, et al., (1997); Harayama, (1998); Hansson, et al (1999); and Lorenzo and Blasco, (1998). (Each of these documents are hereby incorporated by reference). In one embodiment, one or more components, motifs, sections, parts, domains, fragments, etc., of coding polynucleotides of the invention, or the polypeptides encoded thereby may be recombined with one or more components, motifs, sections, parts, domains, fragments, etc. of one or more heterologous molecules.

The present invention further encompasses any combination of the polypeptide fragments listed in this section.

Antibodies:

Definitions

The present invention further relates to antibodies and T-cell antigen receptors (TCR), which specifically bind the polypeptides, and more specifically, the epitopes of the polypeptides of the present invention. The antibodies of the present invention include IgG (including IgG1, IgG2, IgG3, and IgG4), IgA (including IgA1 and IgA2), IgD, IgE, or IgM, and IgY. The term “antibody” (Ab) refers to a polypeptide or group of polypeptides which are comprised of at least one binding domain, where a binding domain is formed from the folding of variable domains of an antibody molecule to form three-dimensional binding spaces with an internal surface shape and charge distribution complementary to the features of an antigenic determinant of an antigen, which allows an immunological reaction with the antigen. As used herein, the term “antibody” is meant to include whole antibodies, including single-chain whole antibodies, and antigen binding fragments thereof. In a preferred embodiment the antibodies are human antigen binding antibody fragments of the present invention include, but are not limited to, Fab, Fab′ F(ab)2 and F(ab′)2, Fd, single-chain Fvs (scFv), single-chain antibodies, disulfide-linked Fvs (sdFv) and fragments comprising either a V L or V H domain. The antibodies may be from any animal origin including birds and mammals. Preferably, the antibodies are human, murine, rabbit, goat, guinea pig, camel, horse, or chicken.

Antigen-binding antibody fragments, including single-chain antibodies, may comprise the variable region(s) alone or in combination with the entire or partial of the following: hinge region, CH1, CH2, and CH3 domains. Also included in the invention are any combinations of variable region(s) and hinge region, CH1, CH2, and CH3 domains. The present invention further includes chimeric, humanized, and human monoclonal and polyclonal antibodies, which specifically bind the polypeptides of the present invention. The present invention further includes antibodies that are anti-idiotypic to the antibodies of the present invention.

The antibodies of the present invention may be monospecific, bispecific, and trispecific or have greater multispecificity. Multispecific antibodies may be specific for different epitopes of a polypeptide of the present invention or may be specific for both a polypeptide of the present invention as well as for heterologous compositions, such as a heterologous polypeptide or solid support material. See, e.g., WO 93/17715; WO 92/08802; WO 91/00360; WO 92/05793; Tutt, et al. (1991); U.S. Pat. Nos. 5,573,920, 4,474,893, 5,601,819, 4,714,681, 4,925,648; Kostelny et al. (1992), which disclosures are hereby incorporated by reference in their entireties.

Antibodies of the present invention may be described or specified in terms of the epitope(s) or epitope-bearing portion(s) of a polypeptide of the present invention, which are recognized or specifically bound by the antibody. The antibodies may specifically bind a complete protein encoded by a nucleic acid of the present invention, or a fragment thereof, particularly, in the case of secreted proteins the mature protein or the signal peptide. Therefore, the epitope(s) or epitope bearing polypeptide portion(s) may be specified as described herein, e.g., by N-terminal and C-terminal positions, by size in contiguous amino acid residues, or otherwise described herein (including the sequence listing). Antibodies which specifically bind any epitope or polypeptide of the present invention may also be excluded as individual species. Therefore, the present invention includes antibodies that specifically bind specified polypeptides of the present invention, and allows for the exclusion of the same.

Thus, another embodiment of the present invention is a purified or isolated antibody capable of specifically binding to a polypeptide comprising a sequence selected from the group consisting of the sequences of SEQ ID Nos: 242-482 and the sequences of the clone inserts of the deposited clone pool. In one aspect of this embodiment, the antibody is capable of binding to an epitope-containing polypeptide comprising at least 6 consecutive amino acids, preferably at least 8 to 10 consecutive amino acids, more preferably at least 12, 15, 20, 25, 30, 40, 50, or 100 consecutive amino acids of a sequence selected from the group consisting of SEQ ID Nos: 242-482 and sequences of the clone inserts of the deposited clone pool.

Antibodies of the present invention may also be described or specified in terms of their cross-reactivity. Antibodies that do not specifically bind any other analog, ortholog, or homologue of the polypeptides of the present invention are included. Antibodies that do not bind polypeptides with less than 95%, less than 90%, less than 85%, less than 80%, less than 75%, less than 70%, less than 65%, less than 60%, less than 55%, and less than 50% identity (as calculated using methods known in the art and described herein, e.g., using FASTDB and the parameters set forth herein) to a polypeptide of the present invention are also included in the present invention. Further included in the present invention are antibodies, which only bind polypeptides encoded by polynucleotides, which hybridize to a polynucleotide of the present invention under stringent hybridization conditions (as described herein). Antibodies of the present invention may also be described or specified in terms of their binding affinity. Preferred binding affinities include those with a dissociation constant or Kd less than 5×10 −6 M, 10 −6 M, 5×10 −7 M, 10 −7 M, 5×10 −8 M, 10 −8 M, 5×10 −9 M, 10 −9 M, 5×10 −10 M, 10 −10 M, 5×10 −11 M, 10 −11 M, 5×10 −12 M, 10 −12 M, 5×10 −13 M, 10 −13 M, 5×10 −14 M, 10 −14 M, 5×10 −15 M, and 10 −15 M.

The invention also concerns a purified or isolated antibody capable of specifically binding to a mutated GENSET protein or to a fragment or variant thereof comprising an epitope of the mutated GENSET protein.

Preparation of Antibodies

The antibodies of the present invention may be prepared by any suitable method known in the art. Some of these methods are described in more detail in the example entitled “Preparation of Antibody Compositions to”. For example, a polypeptide of the present invention or an antigenic fragment thereof can be administered to an animal in order to induce the production of sera containing “polyclonal antibodies”. As used herein, the term “monoclonal antibody” is not limited to antibodies produced through hybridoma technology but it rather refers to an antibody that is derived from a single clone, including eukaryotic, prokaryotic, or phage clone, and not the method by which it is produced. Monoclonal antibodies can be prepared using a wide variety of techniques known in the art including the use of hybridoma, recombinant, and phage display technology.

Hybridoma techniques include those known in the art (See, e.g., Harlow et al. 1988; Hammerling, et al, 1981). (Said references incorporated by reference in their entireties). Fab and F(ab′)2 fragments may be produced, for example, from hybridoma-produced antibodies by proteolytic cleavage, using enzymes such as papain (to produce Fab fragments) or pepsin (to produce F(ab′)2 fragments).

Alternatively, antibodies of the present invention can be produced through the application of recombinant DNA technology or through synthetic chemistry using methods known in the art. For example, the antibodies of the present invention can be prepared using various phage display methods known in the art. In phage display methods, functional antibody domains are displayed on the surface of a phage particle, which carries polynucleotide sequences encoding them. Phage with a desired binding property are selected from a repertoire or combinatorial antibody library (e.g. human or murine) by selecting directly with antigen, typically antigen bound or captured to a solid surface or bead. Phage used in these methods are typically filamentous phage including fd and M13 with Fab, Fv or disulfide stabilized Fv antibody domains recombinantly fused to either the phage gene III or gene VIII protein. Examples of phage display methods that can be used to make the antibodies of the present invention include those disclosed in Brinkman et al. (1995); Ames, et al. (1995); Kettleborough, et al. (1994); Persic, et al. (1997); Burton et al. (1994); PCT/GB91/01134; WO 90/02809; WO 91/10737; WO 92/01047; WO 92/18619; WO 93/11236; WO 95/15982; WO 95/20401; and U.S. Pat. Nos. 5,698,426, 5,223,409, 5,403,484, 5,580,717, 5,427,908, 5,750,753, 5,821,047, 5,571,698, 5,427,908, 5,516,637, 5,780,225, 5,658,727 and 5,733,743 (said references incorporated by reference in their entireties).

As described in the above references, after phage selection, the antibody coding regions from the phage can be isolated and used to generate whole antibodies, including human antibodies, or any other desired antigen binding fragment, and expressed in any desired host including mammalian cells, insect cells, plant cells, yeast, and bacteria. For example, techniques to recombinantly produce Fab, Fab′ F(ab)2 and F(ab′)2 fragments can also be employed using methods known in the art such as those disclosed in WO 92/22324; Mullinax et al. (1992); and Sawai et al. (1995); and Better et al. (1988) (said references incorporated by reference in their entireties).

Examples of techniques which can be used to produce single-chain Fvs and antibodies include those described in U.S. Pat. Nos. 4,946,778 and 5,258,498; Huston et al. (1991); Shu et al. (1993); and Skerra et al. (1988), which disclosures are hereby incorporated by reference in their entireties. For some uses, including in vivo use of antibodies in humans and in vitro detection assays, it may be preferable to use chimeric, humanized, or human antibodies. Methods for producing chimeric antibodies are known in the art. See e.g., Morrison, (1985); Oi et al., (1986); Gillies et al. (1989); and U.S. Pat. No. 5,807,715, which disclosures are hereby incorporated by reference in their entireties. Antibodies can be humanized using a variety of techniques including CDR-grafting (EP 0 239 400; WO 91/09967; U.S. Pat. Nos. 5,530,101; and 5,585,089), veneering or resurfacing, (EP 0 592 106; EP 0 519 596; Padlan, 1991; Studnicka et al., 1994; Roguska et al., 1994), and chain shuffling (U.S. Pat. No. 5,565,332), which disclosures are hereby incorporated by reference in their entireties. Human antibodies can be made by a variety of methods known in the art including phage display methods described above. See also, U.S. Pat. Nos. 4,444,887, 4,716,111, 5,545,806, and 5,814,318; WO 98/46645; WO 98/50433; WO 98/24893; WO 96/34096; WO 96/33735; and WO 91/10741 (said references incorporated by reference in their entireties).

Further included in the present invention are antibodies recombinantly fused or chemically conjugated (including both covalently and non-covalently conjugations) to a polypeptide of the present invention. The antibodies may be specific for antigens other than polypeptides of the present invention. For example, antibodies of the present invention may be recombinantly fused or conjugated to molecules useful as labels in detection assays and effector molecules such as heterologous polypeptides, drugs, or toxins. See, e.g., WO 92/08495; WO 91/14438; WO 89/12624; U.S. Pat. No. 5,314,995; and EP 0 396 387, which disclosures are hereby incorporated by reference in their entireties. Fused antibodies may also be used to target the polypeptides of the present invention to particular cell types, either in vitro or in vivo, by fusing or conjugating the polypeptides of the present invention to antibodies specific for particular cell surface receptors. Antibodies fused or conjugated to the polypeptides of the present invention may also be used in vitro immunoassays and purification methods using methods known in the art (See e.g., Harbor et al. supra; WO 93/21232; EP 0 439 095; Naramura, M. et al. 1994; U.S. Pat. No. 5,474,981; Gillies et al., 1992; Fell et al., 1991) (said references incorporated by reference in their entireties).

The present invention further includes compositions comprising the polypeptides of the present invention fused or conjugated to antibody domains other than the variable regions. For example, the polypeptides of the present invention may be fused or conjugated to an antibody Fc region, or portion thereof. The antibody portion fused to a polypeptide of the present invention may comprise the hinge region, CH1 domain, CH2 domain, and CH3 domain or any combination of whole domains or portions thereof. The polypeptides of the present invention may be fused or conjugated to the above antibody portions to increase the in vivo half-life of the polypeptides or for use in immunoassays using methods known in the art. The polypeptides may also be fused or conjugated to the above antibody portions to form multimers. For example, Fc portions fused to the polypeptides of the present invention can form dimers through disulfide bonding between the Fc portions. Higher multimeric forms can be made by fusing the polypeptides to portions of IgA and IgM. Methods for fusing or conjugating the polypeptides of the present invention to antibody portions are known in the art. See e.g., U.S. Pat. Nos. 5,336,603, 5,622,929, 5,359,046, 5,349,053, 5,447,851, 5,112,946; EP 0 307 434, EP 0 367 166; WO 96/04388, WO 91/06570; Ashkenazi et al. (1991); Zheng et al. (1995); and Vil et al. (1992) (said references incorporated by reference in their entireties).

Non-human animals or mammals, whether wild-type or transgenic, which express a different species of GENSET than the one to which antibody binding is desired, and animals which do not express GENSET (i.e. a GENSET knock out animal as described herein) are particularly useful for preparing antibodies. GENSET knock out animals will recognize all or most of the exposed regions of a GENSET protein as foreign antigens, and therefore produce antibodies with a wider array of GENSET epitopes. Moreover, smaller polypeptides with only 10 to 30 amino acids may be useful in obtaining specific binding to any one of the GENSET proteins. In addition, the humoral immune system of animals which produce a species of GENSET that resembles the antigenic sequence will preferentially recognize the differences between the animal's native GENSET species and the antigen sequence, and produce antibodies to these unique sites in the antigen sequence. Such a technique will be particularly useful in obtaining antibodies that specifically bind to any one of the GENSET proteins.

The antibodies of the invention may be labeled by any one of the radioactive, fluorescent or enzymatic labels known in the art.

Uses of Polynucleotides

Uses of Polynucleotides as Reagents

The polynucleotides of the present invention, particularly those described in the “Oligonucleotide primers and probes” section, may be used as reagents in isolation procedures, diagnostic assays, and forensic procedures. For example, sequences from the GENSET polynucleotides of the invention may be detectably labeled and used as probes to isolate other sequences capable of hybridizing to them. In addition, sequences from the GENSET polynucleotides of the invention may be used to design PCR primers to be used in isolation, diagnostic, or forensic procedures.

In Forensic Analyses

PCR primers may be used in forensic analyses, such as the DNA fingerprinting techniques described below. Such analyses may utilize detectable probes or primers based on the sequences of the polynucleotides of the invention. Consequently, the present invention encompasses methods of identification of an individual using the polynucleotides of the invention in forensic analyses, wherein said method includes the steps of:

a) obtaining a biological sample containing nucleic acid material from an individual;

b) obtaining an identification pattern for this individual using the polynucleotides of the invention, particularly using GENSET primers and probes;

c) comparing said identification pattern with a reference identification pattern; and

d) determining whether said identification pattern is identical to said reference identification pattern.

In one embodiment of this method, the identification pattern consists in sequences of amplicons obtained using GENSET primers as explained in the sections entitled “Forensic Matching by DNA Sequencing” and “Positive Identification by DNA Sequencing”.

In another embodiment, the identification pattern consists in unique band or dot patterns obtained using any method described in the sections entitled “Southern Blot Forensic Identification”, “Dot Blot Identification Procedure” and “Alternative “Fingerprint” Identification Technique”.

Forensic Matching by DNA Sequencing

In one exemplary method, DNA samples are isolated from forensic specimens of, for example, hair, semen, blood or skin cells by conventional methods. A panel of PCR primers designed from different polynucleotides of the invention using any technique known to those skilled in the art including those described herein, is then utilized to amplify DNA of approximately 100-200 bases in length from the forensic specimen. Corresponding sequences are obtained from a test subject. Each of these identification DNAs is then sequenced using standard techniques, and a simple database comparison determines the differences, if any, between the sequences from the subject and those from the sample. Statistically significant differences between the suspect's DNA sequences and those from the sample conclusively prove a lack of identity. This lack of identity can be proven, for example, with only one sequence. Identity, on the other hand, should be demonstrated with a large number of sequences, all matching. Preferably, a minimum of 50 statistically identical sequences of 100 bases in length are used to prove identity between the suspect and the sample.

Positive Identification by DNA Sequencing

The “Forensic Matching by DNA Sequencing” technique described herein may also be used on a larger scale to provide a unique fingerprint-type identification of any individual. In this technique, primers are prepared from a large number of polynucleotides of the invention. Preferably, 20 to 50 different primers are used. These primers are used to obtain a corresponding number of PCR-generated DNA segments from the individual in question. Each of these DNA segments is sequenced. The database of sequences generated through this procedure uniquely identifies the individual from whom the sequences were obtained. The same panel of primers may then be used at any later time to absolutely correlate tissue or other biological specimen with that individual.

Southern Blot Forensic Identification

The “Positive Identification by DNA Sequencing” procedure described herein is repeated to obtain a panel of at least 10 amplified sequences from an individual and a specimen. Preferably, the panel contains at least 50 amplified sequences. More preferably, the panel contains 100 amplified sequences. In some embodiments, the panel contains 200 amplified sequences. This PCR-generated DNA is then digested with one or a combination of, preferably, four base specific restriction enzymes. Such enzymes are commercially available and known to those of skill in the art. After digestion, the resultant gene fragments are size separated in multiple duplicate wells on an agarose gel and transferred to nitrocellulose using Southern blotting techniques well known to those with skill in the art. For a review of Southern blotting see Davis et al. (1986), which disclosure is hereby incorporated by reference in its entirety.

A panel of probes based on the sequences of the polynucleotides of the invention, or fragments thereof of at least 10 bases, are radioactively or calorimetrically labeled using methods known in the art, such as nick translation or end labeling, and hybridized to the Southern blot using techniques known in the art. Preferably, the probe comprises at least 12, 15, or 17 consecutive nucleotides from the polynucleotide of the invention. More preferably, the probe comprises at least 20-30 consecutive nucleotides from the polynucleotide of the invention. In some embodiments, the probe comprises more than 30 nucleotides from the polynucleotide of the invention. In other embodiments, the probe comprises at least 40, at least 50, at least 75, at least 100, at least 150, or at least 200 consecutive nucleotides from the polynucleotide of the invention.

Preferably, at least 5 to 10 of these labeled probes are used, and more preferably at least about 20 or 30 are used to provide a unique pattern. The resultant bands appearing from the hybridization of a large sample of polynucleotide of the invention will be a unique identifier. Since the restriction enzyme cleavage will be different for every individual, the band pattern on the Southern blot will also be unique. Increasing the number of cDNA probes will provide a statistically higher level of confidence in the identification since there will be an increased number of sets of bands used for identification.

Dot Blot Identification Procedure

Another technique for identifying individuals using the polynucleotide sequences disclosed herein utilizes a dot blot hybridization technique.

Genomic DNA is isolated from nuclei of subject to be identified. Oligonucleotide probes of approximately 30 bp in length are synthesized that correspond to at least 10, preferably 50 sequences from the polynucleotide of the invention. The probes are used to hybridize to the genomic DNA through conditions known to those in the art. The oligonucleotides are end labeled with p 32 using polynucleotide kinase (Pharmacia). Dot Blots are created by spotting the genomic DNA onto nitrocellulose or the like using a vacuum dot blot manifold (BioRad, Richmond Calif.). The nitrocellulose filter containing the genomic sequences is baked or UV linked to the filter, prehybridized and hybridized with labeled probe using techniques known in the art (Davis et al. 1986). The 32 P labeled DNA fragments are sequentially hybridized with successively stringent conditions to detect minimal differences between the 30 bp sequence and the DNA. Tetramethylammonium chloride is useful for identifying clones containing small numbers of nucleotide mismatches (Wood et al., 1985). A unique pattern of dots distinguishes one individual from another individual.

Alternative “Fingerprint” Identification Technique

In a representative alternative fingerprinting procedure, the probes are derived from cDNAs. Preferably, a plurality of probes having sequences from different genes are used as follows. Polynucleotides containing at least 10 consecutive bases from these sequences can be used as probes. Preferably, the probe comprises at least 12, 15, or 17 consecutive nucleotides from the polynucleotide of the invention. More preferably, the probe comprises at least 20-30 consecutive nucleotides from the polynucleotide of the invention. In some embodiments, the probe comprises more than 30 nucleotides from the polynucleotide of the invention. In other embodiments, the probe comprises at least 40, at least 50, at least 75, at least 100, at least 150, or at least 200 consecutive nucleotides from the polynucleotide of the invention.

Oligonucleotides, generally 20-mers, are prepared from a large number, e.g. 50, 100, or 200, of polynucleotides of the invention using commercially available oligonucleotide services such as Genset, Paris, France. Cell samples from the test subject are processed for DNA using techniques well known to those with skill in the art. The nucleic acid is digested with restriction enzymes such as EcoRI and XbaI. Following digestion, samples are applied to wells for electrophoresis. The procedure, as known in the art, may be modified to accommodate polyacrylamide electrophoresis, however in this example, samples containing 5 ug of DNA are loaded into wells and separated on 0.8% agarose gels. The gels are transferred onto nitrocellulose using standard Southern blotting techniques.

10 ng of each of the oligonucleotides are pooled and end-labeled with P 32 . The nitrocellulose is prehybridized with blocking solution and hybridized with the labeled probes. Following hybridization and washing, the nitrocellulose filter is exposed to X-Omat AR X-ray film. The resulting hybridization pattern will be unique for each individual.

It is additionally contemplated within this example that the number of probe sequences used can be varied for additional accuracy or clarity.

To Find Corresponding Genomic DNA Sequences

The GENSET cDNAs of the invention may also be used to clone sequences located upstream of the cDNAs of the invention on the corresponding genomic DNA. Such upstream sequences may be capable of regulating gene expression, including promoter sequences, enhancer sequences, and other upstream sequences which influence transcription or translation levels. Once identified and cloned, these upstream regulatory sequences may be used in expression vectors designed to direct the expression of an inserted gene in a desired spatial, temporal, developmental, or quantitative fashion.

Use of cDNAs or Fragments Thereof to Clone Upstream Sequences from Genomic DNA

Sequences derived from polynucleotides of the inventions may be used to isolate the promoters of the corresponding genes using chromosome walking techniques. In one chromosome walking technique, which utilizes the GenomeWalker™ kit available from Clontech, five complete genomic DNA samples are each digested with a different restriction enzyme which has a 6 base recognition site and leaves a blunt end. Following digestion, oligonucleotide adapters are ligated to each end of the resulting genomic DNA fragments.

For each of the five genomic DNA libraries, a first PCR reaction is performed according to the manufacturer's instructions (which are incorporated herein by reference) using an outer adaptor primer provided in the kit and an outer gene specific primer. The gene specific primer should be selected to be specific for the polynucleotide of the invention of interest and should have a melting temperature, length, and location in the polynucleotide of the invention which is consistent with its use in PCR reactions. Each first PCR reaction contains 5 ng of genomic DNA, 5 μl of 10× Tth reaction buffer, 0.2 mM of each dNTP, 0.2 μM each of outer adaptor primer and outer gene specific primer, 1.1 mM of Mg(OAc) 2 , and 1 μl of the Tth polymerase 50× mix in a total volume of 50 μl. The reaction cycle for the first PCR reaction is as follows: 1 min at 94 degree Celsius/2 sec at 94 degree Celsius, 3 min at 72 degree Celsius (7 cycles)/2 sec at 94 degree Celsius, 3 min at 67 degree Celsius (32 cycles)/5 min at 67 degree Celsius.

The product of the first PCR reaction is diluted and used as a template for a second PCR reaction according to the manufacturer's instructions using a pair of nested primers which are located internally on the amplicon resulting from the first PCR reaction. For example, 5 μl of the reaction product of the first PCR reaction mixture may be diluted 180 times. Reactions are made in a 50 μl volume having a composition identical to that of the first PCR reaction except the nested primers are used. The first nested primer is specific for the adaptor, and is provided with the GenomeWalker™ kit. The second nested primer is specific for the particular polynucleotide of the invention for which the promoter is to be cloned and should have a melting temperature, length, and location in the polynucleotide of the invention which is consistent with its use in PCR reactions. The reaction parameters of the second PCR reaction are as follows: 1 min at 94 degree Celsius/2 sec at 94 degree Celsius, 3 min at 72 degree Celsius (6 cycles)/2 sec at 94 degree Celsius, 3 min at 67 degree Celsius (25 cycles)/5 min at 67 degree Celsius

The product of the second PCR reaction is purified, cloned, and sequenced using standard techniques. Alternatively, two or more human genomic DNA libraries can be constructed by using two or more restriction enzymes. The digested genomic DNA is cloned into vectors which can be converted into single stranded, circular, or linear DNA. A biotinylated oligonucleotide comprising at least 15 nucleotides from the polynucleotide of the invention sequence is hybridized to the single stranded DNA. Hybrids between the biotinylated oligonucleotide and the single stranded DNA containing the polynucleotide of the invention sequence are isolated as described herein. Thereafter, the single stranded DNA containing the polynucleotide of the invention sequence is released from the beads and converted into double stranded DNA using a primer specific for the polynucleotide of the invention sequence or a primer corresponding to a sequence included in the cloning vector. The resulting double stranded DNA is transformed into bacteria. DNAs containing the GENSET polynucleotide sequences are identified by colony PCR or colony hybridization.

Identification of Promoters in Cloned Upstream Sequences

Once the upstream genomic sequences have been cloned and sequenced as described above, prospective promoters and transcription start sites within the upstream sequences may be identified by comparing the sequences upstream of the polynucleotides of the inventions with databases containing known transcription start sites, transcription factor binding sites, or promoter sequences.

In addition, promoters in the upstream sequences may be identified using promoter reporter vectors as follows. The expression of the reporter gene will be detected when placed under the control of regulatory active polynucleotide fragments or variants of the GENSET promoter region located upstream of the first exon of the GENSET gene. Suitable promoter reporter vectors, into which the GENSET promoter sequences may be cloned include pSEAP-Basic, pSEAP-Enhancer, pβgal-Basic, pβgal-Enhancer, or pEGFP-1 Promoter Reporter vectors available from Clontech, or pGL2-basic or pGL3-basic promoterless luciferase reporter gene vector from Promega. Briefly, each of these promoter reporter vectors include multiple cloning sites positioned upstream of a reporter gene encoding a readily assayable protein such as secreted alkaline phosphatase, luciferase, beta-galactosidase, or green fluorescent protein. The sequences upstream the GENSET coding region are inserted into the cloning sites upstream of the reporter gene in both orientations and introduced into an appropriate host cell. The level of reporter protein is assayed and compared to the level obtained from a vector which lacks an insert in the cloning site. The presence of an elevated expression level in the vector containing the insert with respect to the control vector indicates the presence of a promoter in the insert. If necessary, the upstream sequences can be cloned into vectors which contain an enhancer for increasing transcription levels from weak promoter sequences. A significant level of expression above that observed with the vector lacking an insert indicates that a promoter sequence is present in the inserted upstream sequence.

Promoter sequence within the upstream genomic DNA may be further defined by site directed mutagenesis, linker scanning analysis, or other techniques familiar to those skilled in the art. For example, the boundaries of promoters may be further investigated by constructing nested 5′ and/or 3′ deletions in the upstream DNA using conventional techniques such as Exonuclease III or appropriate restriction endonuclease digestion. The resulting deletion fragments can be inserted into the promoter reporter vector to determine whether the deletion has increased, reduced or illuminated promoter activity, such as described, for example, by Coles et al. (1998), the disclosure of which is incorporated herein by reference in its entirety. In this way, the boundaries of the promoters may be defined. If desired, potential individual regulatory sites within the promoter may be identified using site directed mutagenesis or linker scanning to obliterate potential transcription factor binding sites within the promoter individually or in combination. The effects of these mutations on transcription levels may be determined by inserting the mutations into cloning sites in promoter reporter vectors. This type of assay is well known to those skilled in the art and is described in WO 97/17359, U.S. Pat. No. 5,374,544; EP 582 796; U.S. Pat. No. 5,698,389; U.S. Pat. No. 5,643,746; U.S. Pat. No. 5,502,176; and U.S. Pat. No. 5,266,488; the disclosures of which are incorporated by reference herein in their entirety.

The strength and the specificity of the promoter of each GENSET gene can be assessed through the expression levels of a detectable polynucleotide operably linked to the GENSET promoter in different types of cells and tissues. The detectable polynucleotide may be either a polynucleotide that specifically hybridizes with a predefined oligonucleotide probe, or a polynucleotide encoding a detectable protein, including a GENSET polypeptide or a fragment or a variant thereof. This type of assay is well known to those skilled in the art and is described in U.S. Pat. No. 5,502,176; and U.S. Pat. No. 5,266,488; the disclosures of which are incorporated by reference herein in their entirety. Some of the methods are discussed in more detail elsewhere in the application.

The promoters and other regulatory sequences located upstream of the polynucleotides of the inventions may be used to design expression vectors capable of directing the expression of an inserted gene in a desired spatial, temporal, developmental, or quantitative manner. A promoter capable of directing the desired spatial, temporal, developmental, and quantitative patterns may be selected using the results of the expression analysis described herein. For example, if a promoter which confers a high level of expression in muscle is desired, the promoter sequence upstream of a polynucleotide of the invention derived from an mRNA which is expressed at a high level in muscle may be used in the expression vector. Such vectors are described in more detail elsewhere in the application.

Preferably, the desired promoter is placed near multiple restriction sites to facilitate the cloning of the desired insert downstream of the promoter, such that the promoter is able to drive expression of the inserted gene. The promoter may be inserted in conventional nucleic acid backbones designed for extrachromosomal replication, integration into the host chromosomes or transient expression. Suitable backbones for the present expression vectors include retroviral backbones, backbones from eukaryotic episomes such as SV40 or Bovine Papilloma Virus, backbones from bacterial episomes, or artificial chromosomes.

Preferably, the expression vectors also include a polyA signal downstream of the multiple restriction sites for directing the polyadenylation of mRNA transcribed from the gene inserted into the expression vector.

To Find Similar Sequences

Polynucleotides of the invention may be used to isolate and/or purify nucleic acids similar thereto using any methods well known to those skilled in the art including the techniques based on hybridization or on amplification described in this section. These methods may be used to obtain the genomic DNAs which encode the mRNAs from which the GENSET cDNAs are derived, mRNAs corresponding to GENSET cDNAs, or nucleic acids which are homologous to GENSET cDNAs or fragments thereof, such as variants, species homologues or orthologs. Thus, a plurality of cDNAs similar to GENSET polynucleotides may be provided as cDNA libraries for subsequent evaluation of the encoded proteins or use in diagnostic assays as described herein. cDNAs prepared by any method described therein may be subsequently engineered to obtain nucleic acids which include desired fragments of the cDNA using conventional techniques such as subcloning, PCR, or in vitro oligonucleotide synthesis. For example, nucleic acids which include only the coding sequences may be obtained using techniques known to those skilled in the art. Similarly, nucleic acids containing any other desired fragment of the coding sequences for the encoded protein may be obtained.

Indeed, cDNAs of the present invention or fragments thereof may be used to isolate nucleic acids similar to cDNAs from a cDNA library or a genomic DNA library. Such cDNA libraries or genomic DNA libraries may be obtained from a commercial source or made using techniques familiar to those skilled in the art such as those described in PCT publication WO 00/37491, which disclosure is hereby incorporated by reference in its entirety. Examples of methods for obtaining nucleic acids similar to GENSET polynucleotides are described below.

Hybridization-Based Methods

Techniques for identifying cDNA clones in a cDNA library which hybridize to a given probe sequence are disclosed in Sambrook et al., (1989) and in Hames and Higgins (1985), the disclosures of which are incorporated herein by reference in their entireties. The same techniques may be used to isolate genomic DNAs.

Briefly, cDNA or genomic DNA clones which hybridize to the detectable probe are identified and isolated for further manipulation as follows. Any polynucleotide fragment of the invention may be used as a probe, in particular those defined in the “Oligonucleotide primers and probes” section. A probe comprising at least 10 consecutive nucleotides from a GENSET cDNA or fragment thereof is labeled with a detectable label such as a radioisotope or a fluorescent molecule. Preferably, the probe comprises at least 12, 15, or 17 consecutive nucleotides from the cDNA or fragment thereof. More preferably, the probe comprises 20 to 30 consecutive nucleotides from the cDNA or fragment thereof. In some embodiments, the probe comprises more than 30 nucleotides from the cDNA or fragment thereof.

Techniques for labeling the probe are well known and include phosphorylation with polynucleotide kinase, nick translation, in vitro transcription, and non radioactive techniques. The cDNAs or genomic DNAs in the library are transferred to a nitrocellulose or nylon filter and denatured. After blocking of non specific sites, the filter is incubated with the labeled probe for an amount of time sufficient to allow binding of the probe to cDNAs or genomic DNAs containing a sequence capable of hybridizing thereto.

By varying the stringency of the hybridization conditions used to identify cDNAs or genomic DNAs which hybridize to the detectable probe, cDNAs or genomic DNAs having different levels of identity to the probe can be identified and isolated as described below.

Stringent Conditions

“Stringent hybridization conditions” are defined as conditions in which only nucleic acids having a high level of identity to the probe are able to hybridize to said probe. These conditions may be calculated as follows:

For probes between 14 and 70 nucleotides in length the melting temperature (Tm) is calculated using the formula: T m =81.5+16.6(log (Na+))+0.41(fraction G+C)−(600/N) where N is the length of the probe.

If the hybridization is carried out in a solution containing formamide, the melting temperature may be calculated using the equation: T m =81.5+16.6(log (Na+))+0.41(fraction G+C)−(0.63% formamide)−(600/N) where N is the length of the probe.

Prehybridization may be carried out in 6×SSC, 5× Denhardt's reagent, 0.5% SDS, 100 μg denatured fragmented salmon sperm DNA or 6×SSC, 5× Denhardt's reagent, 0.5% SDS, 100 μg denatured fragmented salmon sperm DNA, 50% formamide. The formulas for SSC and Denhardt's solutions are listed in Sambrook et al., 1986.

Hybridization is conducted by adding the detectable probe to the prehybridization solutions listed above. Where the probe comprises double stranded DNA, it is denatured before addition to the hybridization solution. The filter is contacted with the hybridization solution for a sufficient period of time to allow the probe to hybridize to nucleic acids containing sequences complementary thereto or homologous thereto. For probes over 200 nucleotides in length, the hybridization may be carried out at 15-25° C. below the Tm. For shorter probes, such as oligonucleotide probes, the hybridization may be conducted at 15-25° C. below the Tm. Preferably, for hybridizations in 6×SSC, the hybridization is conducted at approximately 68° C. Preferably, for hybridizations in 50% formamide containing solutions, the hybridization is conducted at approximately 42° C.

Following hybridization, the filter is washed in 2×SSC, 0.1% SDS at room temperature for 15 minutes. The filter is then washed with 0.1×SSC, 0.5% SDS at room temperature for 30 minutes to 1 hour. Thereafter, the solution is washed at the hybridization temperature in 0.1×SSC, 0.5% SDS. A final wash is conducted in 0.1×SSC at room temperature.

Nucleic acids which have hybridized to the probe are identified by autoradiography or other conventional techniques.

Low and Moderate Conditions

Changes in the stringency of hybridization and signal detection are primarily accomplished through the manipulation of formamide concentration (lower percentages of formamide result in lowered stringency); salt conditions, or temperature. The above procedure may thus be modified to identify nucleic acids having decreasing levels of identity to the probe sequence. For example, the hybridization temperature may be decreased in increments of 5° C. from 68° C. to 42° C. in a hybridization buffer having a sodium concentration of approximately 1M. Following hybridization, the filter may be washed with 2×SSC, 0.5% SDS at the temperature of hybridization. These conditions are considered to be “moderate” conditions above 50° C. and “low” conditions below 50° C. Alternatively, the hybridization may be carried out in buffers, such as 6×SSC, containing formamide at a temperature of 42° C. In this case, the concentration of formamide in the hybridization buffer may be reduced in 5% increments from 50% to 0% to identify clones having decreasing levels of identity to the probe. Following hybridization, the filter may be washed with 6×SSC, 0.5% SDS at 50° C. These conditions are considered to be “moderate” conditions above 25% formamide and “low” conditions below 25% formamide. cDNAs or genomic DNAs which have hybridized to the probe are identified by autoradiography or other conventional techniques.

Note that variations in the above conditions may be accomplished through the inclusion and/or substitution of alternate blocking reagents used to suppress background in hybridization experiments. Typical blocking reagents include Denhardt's reagent, BLOTTO, heparin, denatured salmon sperm DNA, and commercially available proprietary formulations. The inclusion of specific blocking reagents may require modification of the hybridization conditions described above, due to problems with compatibility.

Consequently, the present invention encompasses methods of isolating nucleic acids similar to the polynucleotides of the invention, comprising the steps of:

a) contacting a collection of cDNA or genomic DNA molecules with a detectable probe comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40 or 50 consecutive nucleotides of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 1-241, the sequences of clones inserts of the deposited clone pool and sequences complementary thereto under stringent, moderate or low conditions which permit said probe to hybridize to at least a cDNA or genomic DNA molecule in said collection;

b) identifying said cDNA or genomic DNA molecule which hybridizes to said detectable probe; and

c) isolating said cDNA or genomic DNA molecule which hybridized to said probe.

PCR-Based Methods

In addition to the above described methods, other protocols are available to obtain homologous cDNAs using GENSET cDNA of the present invention or fragment thereof as outlined in the following paragraphs.

cDNAs may be prepared by obtaining mRNA from the tissue, cell, or organism of interest using mRNA preparation procedures utilizing polyA selection procedures or other techniques known to those skilled in the art. A first primer capable of hybridizing to the polyA tail of the mRNA is hybridized to the mRNA and a reverse transcription reaction is performed to generate a first cDNA strand.

The term “capable of hybridizing to the polyA tail of said mRNA” refers to and embraces all primers containing stretches of thymidine residues, so-called oligo(dT) primers, that hybridize to the 3′ end of eukaryotic poly(A)+ mRNAs to prime the synthesis of a first cDNA strand. Techniques for generating said oligo (dT) primers and hybridizing them to mRNA to subsequently prime the reverse transcription of said hybridized mRNA to generate a first cDNA strand are well known to those skilled in the art and are described in Current Protocols in Molecular Biology, John Wiley and Sons, Inc. 1997 and Sambrook et al., 1989. Preferably, said oligo (dT) primers are present in a large excess in order to allow the hybridization of all mRNA 3′ ends to at least one oligo (dT) molecule. The priming and reverse transcription steps are preferably performed between 37° C. and 55° C. depending on the type of reverse transcriptase used. Preferred oligo(dT) primers for priming reverse transcription of mRNAs are oligonucleotides containing a stretch of thymidine residues of sufficient length to hybridize specifically to the polyA tail of mRNAs, preferably of 12 to 18 thymidine residues in length. More preferably, such oligo(T) primers comprise an additional sequence upstream of the poly(dT) stretch in order to allow the addition of a given sequence to the 5′ end of all first cDNA strands which may then be used to facilitate subsequent manipulation of the cDNA. Preferably, this added sequence is 8 to 60 residues in length. For instance, the addition of a restriction site in 5′ of cDNAs facilitates subcloning of the obtained cDNA. Alternatively, such an added 5′ end may also be used to design primers of PCR to specifically amplify cDNA clones of interest.

The first cDNA strand is then hybridized to a second primer. Any polynucleotide fragment of the invention may be used, and in particular those described in the “Oligonucleotide primers and probes” section. This second primer contains at least 10 consecutive nucleotides of a polynucleotide of the invention. Preferably, the primer comprises at least 10, 12, 15, 17, 18, 20, 23, 25, or 28 consecutive nucleotides of a polynucleotide of the invention. In some embodiments, the primer comprises more than 30 nucleotides of a polynucleotide of the invention. If it is desired to obtain cDNAs containing the full protein coding sequence, including the authentic translation initiation site, the second primer used contains sequences located upstream of the translation initiation site. The second primer is extended to generate a second cDNA strand complementary to the first cDNA strand. Alternatively, RT-PCR may be performed as described above using primers from both ends of the cDNA to be obtained.

The double stranded cDNAs made using the methods described above are isolated and cloned. The cDNAs may be cloned into vectors such as plasmids or viral vectors capable of replicating in an appropriate host cell. For example, the host cell may be a bacterial, mammalian, avian, or insect cell.

Techniques for isolating mRNA, reverse transcribing a primer hybridized to mRNA to generate a first cDNA strand, extending a primer to make a second cDNA strand complementary to the first cDNA strand, isolating the double stranded cDNA and cloning the double stranded cDNA are well known to those skilled in the art and are described in Current Protocols in Molecular Biology , John Wiley & Sons, Inc. 1997 and Sambrook et al., 1989.

Consequently, the present invention encompasses methods of making cDNAs. In a first embodiment, the method of making a cDNA comprises the steps of

a) contacting a collection of mRNA molecules from human cells with a primer comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of a sequence selected from the group consisting of the sequences complementary to SEQ ID Nos: 1-241 and sequences complementary to a clone insert of the deposited clone pool;

b) hybridizing said primer to an mRNA in said collection;

c) reverse transcribing said hybridized primer to make a first cDNA strand from said mRNA;

d) making a second cDNA strand complementary to said first cDNA strand; and

e) isolating the resulting cDNA comprising said first cDNA strand and said second cDNA strand.

Another embodiment of the present invention is a purified cDNA obtainable by the method of the preceding paragraph. In one aspect of this embodiment, the cDNA encodes at least a portion of a human polypeptide.

In a second embodiment, the method of making a cDNA comprises the steps of

a) contacting a collection of mRNA molecules from human cells with a first primer capable of hybridizing to the polyA tail of said mRNA;

b) hybridizing said first primer to said polyA tail;

c) reverse transcribing said mRNA to make a first cDNA strand;

d) making a second cDNA strand complementary to said first cDNA strand using at least one primer comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool; and

e) isolating the resulting cDNA comprising said first cDNA strand and said second cDNA strand.

In another aspect of this method the second cDNA strand is made by

a) contacting said first cDNA strand with a second primer comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool, and a third primer which sequence is fully included within the sequence of said first primer;

b) performing a first polymerase chain reaction with said second and third primers to generate a first PCR product;

c) contacting said first PCR product with a fourth primer, comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of said sequence selected from the group consisting of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool, and a fifth primer, which sequence is fully included within the sequence of said third primer, wherein said fourth and fifth hybridize to sequences within said first PCR product; and

d) performing a second polymerase chain reaction, thereby generating a second PCR product.

Alternatively, the second cDNA strand may be made by contacting said first cDNA strand with a second primer comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool, and a third primer which sequence is fully included within the sequence of said first primer and performing a polymerase chain reaction with said second and third primers to generate said second cDNA strand.

Alternatively, the second cDNA strand may be made by

a) contacting said first cDNA strand with a second primer comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool;

b) hybridizing said second primer to said first strand cDNA; and

c) extending said hybridized second primer to generate said second cDNA strand.

Another embodiment of the present invention is a purified cDNA obtainable by a method of making a cDNA of the invention. In one aspect of this embodiment, said cDNA encodes at least a portion of a human polypeptide.

Other Protocols

Alternatively, other procedures may be used for obtaining homologous cDNAs. In one approach, cDNAs are prepared from mRNA and cloned into double stranded phagemids as follows. The cDNA library in the double stranded phagemids is then rendered single stranded by treatment with an endonuclease, such as the Gene II product of the phage F1 and an exonuclease (Chang et al., 1993, which disclosure is hereby incorporated by reference in its entirety). A biotinylated oligonucleotide comprising the sequence of a fragment of a known GENSET cDNA, genomic DNA or fragment thereof is hybridized to the single stranded phagemids. Preferably, the fragment comprises at least 10, 12, 15, 17, 18, 20, 23, 25, or 28 consecutive nucleotides of a sequence selected from the group consisting of the sequences of SEQ ID Nos: 1-241 and sequences of clone inserts of the deposited clone pool.

Hybrids between the biotinylated oligonucleotide and phagemids are isolated by incubating the hybrids with streptavidin coated paramagnetic beads and retrieving the beads with a magnet (Fry et al, 1992, which disclosure is hereby incorporated by reference in its entirety). Thereafter, the resulting phagemids are released from the beads and converted into double stranded DNA using a primer specific for the GENSET cDNA or fragment used to design the biotinylated oligonucleotide. Alternatively, protocols such as the Gene Trapper kit (Gibco BRL), which disclosure is which disclosure is hereby incorporated by reference in its entirety, may be used. The resulting double stranded DNA is transformed into bacteria. Homologous cDNAs to the GENSET cDNA or fragment thereof sequence are identified by colony PCR or colony hybridization.

As a Chromosome Marker

Chromosomal localization of the cDNA of the present invention were determined using information from public and proprietary databases. Table VIII lists the putative chromosomal location of the polynucleotides of the present invention. Column one lists the sequence identification number with the corresponding chromosomal location listed in column two. Thus, the present invention also relates to methods and compositions using the chromosomal location of the polynucleotides of the invention to construct a human high resolution map or to identify a given chromosome in a sample using any techniques known to those skilled in the art including those disclosed below.

GENSET polynucleotides may also be mapped to their chromosomal locations using any methods or techniques known to those skilled in the art including radiation hybrid (RH) mapping, PCR-based mapping and Fluorescence in situ hybridization (FISH) mapping described below.

Radiation Hybrid Mapping

Radiation hybrid (RH) mapping is a somatic cell genetic approach that can be used for high resolution mapping of the human genome. In this approach, cell lines containing one or more human chromosomes are lethally irradiated, breaking each chromosome into fragments whose size depends on the radiation dose. These fragments are rescued by fusion with cultured rodent cells, yielding subclones containing different fragments of the human genome. This technique is described by Benham et al. (1989) and Cox et al., (1990), which disclosures are hereby incorporated by reference in their entireties. The random and independent nature of the subclones permits efficient mapping of any human genome marker. Human DNA isolated from a panel of 80-100 cell lines provides a mapping reagent for ordering GENSET cDNAs or genomic DNAs. In this approach, the frequency of breakage between markers is used to measure distance, allowing construction of fine resolution maps as has been done using conventional ESTs (Schuler et al., 1996), which disclosure is hereby incorporated by reference in its entirety.

RH mapping has been used to generate a high-resolution whole genome radiation hybrid map of human chromosome 17q22-q25.3 across the genes for growth hormone (GH) and thymidine kinase (TK) (Foster et al., 1996), the region surrounding the Gorlin syndrome gene (Obermayr et al., 1996), 60 loci covering the entire short arm of chromosome 12 (Raeymaekers et al., 1995), the region of human chromosome 22 containing the neurofibromatosis type 2 locus (Frazer et al., 1992) and 13 loci on the long arm of chromosome 5 (Warrington et al., 1991), which disclosures are hereby incorporated by reference in their entireties.

Mapping of cDNAs to Human Chromosomes Using PCR Techniques

GENSET cDNAs and genomic DNAs may be assigned to human chromosomes using PCR based methodologies. In such approaches, oligonucleotide primer pairs are designed from the cDNA sequence to minimize the chance of amplifying through an intron. Preferably, the oligonucleotide primers are 18-23 bp in length and are designed for PCR amplification. The creation of PCR primers from known sequences is well known to those with skill in the art. For a review of PCR technology see Erlich (1992), which disclosure is hereby incorporated by reference in its entirety.

The primers are used in polymerase chain reactions (PCR) to amplify templates from total human genomic DNA. PCR conditions are as follows: 60 ng of genomic DNA is used as a template for PCR with 80 ng of each oligonucleotide primer, 0.6 unit of Taq polymerase, and 1 uCu of a 32 P-labeled deoxycytidine triphosphate. The PCR is performed in a microplate thermocycler (Techne) under the following conditions: 30 cycles of 94 degree Celsius, 1.4 min; 55 degree Celsius, 2 min; and 72 degree Celsius, 2 min; with a final extension at 72 degree Celsius for 10 min. The amplified products are analyzed on a 6% polyacrylamide sequencing gel and visualized by autoradiography. If the length of the resulting PCR product is identical to the distance between the ends of the primer sequences in the cDNA from which the primers are derived, then the PCR reaction is repeated with DNA templates from two panels of human-rodent somatic cell hybrids, BIOS PCRable DNA (BIOS Corporation) and NIGMS Human-Rodent Somatic Cell Hybrid Mapping Panel Number 1 (NIGMS, Camden, N.J.).

PCR is used to screen a series of somatic cell hybrid cell lines containing defined sets of human chromosomes for the presence of a given cDNA or genomic DNA. DNA is isolated from the somatic hybrids and used as starting templates for PCR reactions using the primer pairs from the GENSET cDNAs or genomic DNAs. Only those somatic cell hybrids with chromosomes containing the human gene corresponding to the GENSET cDNA or genomic DNA will yield an amplified fragment. The GENSET cDNAs or genomic DNAs are assigned to a chromosome by analysis of the segregation pattern of PCR products from the somatic hybrid DNA templates. The single human chromosome present in all cell hybrids that give rise to an amplified fragment is the chromosome containing that GENSET cDNA or genomic DNA. For a review of techniques and analysis of results from somatic cell gene mapping experiments, see Ledbetter et al., (1990), which disclosure is hereby incorporated by reference in its entirety.

Mapping of cDNAs to Chromosomes Using Fluorescence in Situ Hybridization

Fluorescence in situ hybridization allows the GENSET cDNA or genomic DNA to be mapped to a particular location on a given chromosome. The chromosomes to be used for fluorescence in situ hybridization techniques may be obtained from a variety of sources including cell cultures, tissues, or whole blood.

In a preferred embodiment, chromosomal localization of a GENSET cDNA or genomic DNA is obtained by FISH as described by Cherif et al. (1990), which disclosure is hereby incorporated by reference in its entirety. Metaphase chromosomes are prepared from phytohemagglutinin (PHA)-stimulated blood cell donors. PHA-stimulated lymphocytes from healthy males are cultured for 72 h in RPMI-1640 medium. For synchronization, methotrexate (10 uM) is added for 17 h, followed by addition of 5-bromodeoxyuridine (5-BudR, 0.1 mM) for 6 h. Colcemid (1 ug/ml) is added for the last 15 min before harvesting the cells. Cells are collected, washed in RPMI, incubated with a hypotonic solution of KCl (75 mM) at 37 degree Celsius for 15 min and fixed in three changes of methanol:acetic acid (3:1). The cell suspension is dropped onto a glass slide and air dried. The GENSET cDNA or genomic DNA is labeled with biotin-16 dUTP by nick translation according to the manufacturer's instructions (Bethesda Research Laboratories, Bethesda, Md.), purified using a Sephadex G-50 column (Pharmacia, Upssala, Sweden) and precipitated. Just prior to hybridization, the DNA pellet is dissolved in hybridization buffer (50% formamide, 2×SSC, 10% dextran sulfate, 1 mg/ml sonicated salmon sperm DNA, pH 7) and the probe is denatured at 70 degree Celsius for 5-10 min.

Slides kept at −20 degree Celsius are treated for 1 h at 37 degree Celsius with RNase A (100 ug/ml), rinsed three times in 2×SSC and dehydrated in an ethanol series. Chromosome preparations are denatured in 70% formamide, 2×SSC for 2 min at 70 degree Celsius, then dehydrated at 4 degree Celsius. The slides are treated with proteinase K (10 ug/100 ml in 20 mM Tris-HCl, 2 mM CaCl 2 ) at 37 degree Celsius for 8 min and dehydrated. The hybridization mixture containing the probe is placed on the slide, covered with a coverslip, sealed with rubber cement and incubated overnight in a humid chamber at 37 degree Celsius. After hybridization and post-hybridization washes, the biotinylated probe is detected by avidin-FITC and amplified with additional layers of biotinylated goat anti-avidin and avidin-FITC. For chromosomal localization, fluorescent R-bands are obtained as previously described (Cherif et al., 1990). The slides are observed under a LEICA fluorescence microscope (DMRXA). Chromosomes are counterstained with propidium iodide and the fluorescent signal of the probe appears as two symmetrical yellow-green spots on both chromatids of the fluorescent R-band chromosome (red). Thus, a particular GENSET cDNA or genomic DNA may be localized to a particular cytogenetic R-band on a given chromosome.

Use of cDNAs to Construct or Expand Chromosome Maps

Once the GENSET cDNAs or genomic DNAs have been assigned to particular chromosomes using any technique known to those skilled in the art those skilled in the art, particularly those described herein, they may be utilized to construct a high resolution map of the chromosomes on which they are located or to identify the chromosomes in a sample.

Chromosome mapping involves assigning a given unique sequence to a particular chromosome as described above. Once the unique sequence has been mapped to a given chromosome, it is ordered relative to other unique sequences located on the same chromosome. One approach to chromosome mapping utilizes a series of yeast artificial chromosomes (YACs) bearing several thousand long inserts derived from the chromosomes of the organism from which the GENSET cDNAs or genomic DNAs are obtained. This approach is described in Nagaraja et al. (1997), which disclosure is hereby incorporated by reference in its entirety. Briefly, in this approach each chromosome is broken into overlapping pieces which are inserted into the YAC vector. The YAC inserts are screened using PCR or other methods to determine whether they include the GENSET cDNA or genomic DNA whose position is to be determined. Once an insert has been found which includes the GENSET cDNA or genomic DNA, the insert can be analyzed by PCR or other methods to determine whether the insert also contains other sequences known to be on the chromosome or in the region from which the GENSET cDNA or genomic DNA was derived. This process can be repeated for each insert in the YAC library to determine the location of each of the GENSET cDNA or genomic DNA relative to one another and to other known chromosomal markers. In this way, a high resolution map of the distribution of numerous unique markers along each of the organisms chromosomes may be obtained.

Identification of Genes Associated with Hereditary Diseases or Drug Response

This example illustrates an approach useful for the association of GENSET cDNAs or genomic DNAs with particular phenotypic characteristics. In this example, a particular GENSET cDNA or genomic DNA is used as a test probe to associate that GENSET cDNA or genomic DNA with a particular phenotypic characteristic.

GENSET cDNAs or genomic DNAs are mapped to a particular location on a human chromosome using techniques such as those described herein or other techniques known in the art. A search of Mendelian Inheritance in Man (V. McKusick, Mendelian Inheritance in Man (available on line through Johns Hopkins University Welch Medical Library) reveals the region of the human chromosome which contains the GENSET cDNA or genomic DNA to be a very gene rich region containing several known genes and several diseases or phenotypes for which genes have not been identified. The gene corresponding to this GENSET cDNA or genomic DNA thus becomes an immediate candidate for each of these genetic diseases.

Cells from patients with these diseases or phenotypes are isolated and expanded in culture. PCR primers from the GENSET cDNA or genomic DNA are used to screen genomic DNA, mRNA or cDNA obtained from the patients. GENSET cDNAs or genomic DNAs that are not amplified in the patients can be positively associated with a particular disease by further analysis. Alternatively, the PCR analysis may yield fragments of different lengths when the samples are derived from an individual having the phenotype associated with the disease than when the sample is derived from a healthy individual, indicating that the gene containing the cDNA may be responsible for the genetic disease.

Uses of Polynucleotides in Recombinant Vectors

The present invention also relates to recombinant vectors, which include the isolated polynucleotides of the present invention, or fragments thereof and to host cells recombinant for a polynucleotide of the invention, such as the above vectors, as well as to methods of making such vectors and host cells and for using them for production of GENSET polypeptides by recombinant techniques.

Recombinant Vectors

The term “vector” is used herein to designate either a circular or a linear DNA or RNA molecule, which is either double-stranded or single-stranded, and which comprise at least one polynucleotide of interest that is sought to be transferred in a cell host or in a unicellular or multicellular host organism. The present invention encompasses a family of recombinant vectors that comprise a regulatory polynucleotide and/or a coding polynucleotide derived from either the GENSET genomic sequence or the cDNA sequence. Generally, a recombinant vector of the invention may comprise any of the polynucleotides described herein, including regulatory sequences, coding sequences and polynucleotide constructs, as well as any GENSET primer or probe as defined herein.

In a first preferred embodiment, a recombinant vector of the invention is used to amplify the inserted polynucleotide derived from a GENSET genomic sequence or a GENSET cDNA, for example any cDNA selected from the group consisting of sequences of SEQ ID Nos: 1-241, sequences of clone inserts of the deposited clone pool, variants and fragments thereof in a suitable cell host, this polynucleotide being amplified at every time that the recombinant vector replicates.

A second preferred embodiment of the recombinant vectors according to the invention comprises expression vectors comprising either a regulatory polynucleotide or a coding nucleic acid of the invention, or both. Within certain embodiments, expression vectors are employed to express a GENSET polypeptide which can be then purified and, for example be used in ligand screening assays or as an immunogen in order to raise specific antibodies directed against the GENSET protein. In other embodiments, the expression vectors are used for constructing transgenic animals and also for gene therapy. Expression requires that appropriate signals are provided in the vectors, said signals including various regulatory elements, such as enhancers/promoters from both viral and mammalian sources that drive expression of the genes of interest in host cells. Dominant drug selection markers for establishing permanent, stable cell clones expressing the products are generally included in the expression vectors of the invention, as they are elements that link expression of the drug selection markers to expression of the polypeptide.

More particularly, the present invention relates to expression vectors which include nucleic acids encoding a GENSET protein, preferably a GENSET protein with an amino acid sequence selected from the group consisting of sequences of SEQ ID Nos: 242-482, mature polypeptides included in sequences of SEQ ID Nos: 242-272 and 274-384, and sequences of full-length or mature polypeptides encoded by the clone inserts of the deposited clone pool, as well as variants and fragments thereof. The polynucleotides of the present invention may be used to express an encoded protein in a host organism to produce a beneficial effect. In such procedures, the encoded protein may be transiently expressed in the host organism or stably expressed in the host organism. The encoded protein may have any of the activities described herein. The encoded protein may be a protein which the host organism lacks or, alternatively, the encoded protein may augment the existing levels of the protein in the host organism.

Some of the elements which can be found in the vectors of the present invention are described in further detail in the following sections.

General Features of the Expression Vectors of the Invention

A recombinant vector according to the invention comprises, but is not limited to, a YAC (Yeast Artificial Chromosome), a BAC (Bacterial Artificial Chromosome), a phage, a phagemid, a cosmid, a plasmid or even a linear DNA molecule which may comprise a chromosomal, non-chromosomal, semi-synthetic and synthetic DNA. Such a recombinant vector can comprise a transcriptional unit comprising an assembly of:

(1) a genetic element or elements having a regulatory role in gene expression, for example promoters or enhancers. Enhancers are cis-acting elements of DNA, usually from about 10 to 300 bp in length that act on the promoter to increase the transcription.

(2) a structural or coding sequence which is transcribed into mRNA and eventually translated into a polypeptide, said structural or coding sequence being operably linked to the regulatory elements described in (1); and

(3) appropriate transcription initiation and termination sequences. Structural units intended for use in yeast or eukaryotic expression systems preferably include a leader sequence enabling extracellular secretion of translated protein by a host cell. Alternatively, when a recombinant protein is expressed without a leader or transport sequence, it may include a N-terminal residue. This residue may or may not be subsequently cleaved from the expressed recombinant protein to provide a final product.

Generally, recombinant expression vectors will include origins of replication, selectable markers permitting transformation of the host cell, and a promoter derived from a highly expressed gene to direct transcription of a downstream structural sequence. The heterologous structural sequence is assembled in appropriate phase with translation initiation and termination sequences, and preferably a leader sequence capable of directing secretion of the translated protein into the periplasmic space or the extracellular medium. In a specific embodiment wherein the vector is adapted for transfecting and expressing desired sequences in mammalian host cells, preferred vectors will comprise an origin of replication in the desired host, a suitable promoter and enhancer, and also any necessary ribosome binding sites, polyadenylation signal, splice donor and acceptor sites, transcriptional termination sequences, and 5′-flanking non-transcribed sequences. DNA sequences derived from the SV40 viral genome, for example SV40 origin, early promoter, enhancer, splice and polyadenylation signals may be used to provide the required non-transcribed genetic elements.

The in vivo expression of a GENSET polypeptide of the present invention may be useful in order to correct a genetic defect related to the expression of the native gene in a host organism or to the production of a biologically inactive GENSET protein. Consequently, the present invention also comprises recombinant expression vectors mainly designed for the in vivo production of a GENSET polypeptide of the present invention by the introduction of the appropriate genetic material in the organism or the patient to be treated. This genetic material may be introduced in vitro in a cell that has been previously extracted from the organism, the modified cell being subsequently reintroduced in the said organism, directly in vivo into the appropriate tissue.

Regulatory Elements

The suitable promoter regions used in the expression vectors according to the present invention are chosen taking into account the cell host in which the heterologous gene has to be expressed. The particular promoter employed to control the expression of a nucleic acid sequence of interest is not believed to be important, so long as it is capable of directing the expression of the nucleic acid in the targeted cell. Thus, where a human cell is targeted, it is preferable to position the nucleic acid coding region adjacent to and under the control of a promoter that is capable of being expressed in a human cell, such as, for example, a human or a viral promoter.

A suitable promoter may be heterologous with respect to the nucleic acid for which it controls the expression or alternatively can be endogenous to the native polynucleotide containing the coding sequence to be expressed. Additionally, the promoter is generally heterologous with respect to the recombinant vector sequences within which the construct promoter/coding sequence has been inserted.

Promoter regions can be selected from any desired gene using, for example, CAT (chloramphenicol transferase) vectors and more preferably pKK232-8 and pCM7 vectors.

Preferred bacterial promoters are the LacI, LacZ, the T3 or T7 bacteriophage RNA polymerase promoters, the gpt, lambda PR, PL and trp promoters (EP 0036776), the polyhedrin promoter, or the p10 protein promoter from baculovirus (Kit Novagen), (Smith et al., 1983; O'Reilly et al., 1992), which disclosures are hereby incorporated by reference in their entireties, the lambda PR promoter or also the trc promoter.

Eukaryotic promoters include CMV immediate early, HSV thymidine kinase, early and late SV40, LTRs from retrovirus, and mouse metallothionein-L. Selection of a convenient vector and promoter is well within the level of ordinary skill in the art. The choice of a promoter is well within the ability of a person skilled in the field of genetic engineering. For example, one may refer to the book of Sambrook et al., (1989) or also to the procedures described by Fuller et al., (1996), which disclosures are hereby incorporated by reference in their entireties.

Other Regulatory Elements

Where a cDNA insert is employed, one will typically desire to include a polyadenylation signal to effect proper polyadenylation of the gene transcript. The nature of the polyadenylation signal is not believed to be crucial to the successful practice of the invention, and any such sequence may be employed such as human growth hormone and SV40 polyadenylation signals. Also contemplated as an element of the expression cassette is a terminator. These elements can serve to enhance message levels and to minimize read through from the cassette into other sequences.

Selectable Markers

Selectable markers confer an identifiable change to the cell permitting easy identification of cells containing the expression construct. The selectable marker genes for selection of transformed host cells are preferably dihydrofolate reductase or neomycin resistance for eukaryotic cell culture, TRP1 for S. cerevisiae or tetracycline, rifampicin or ampicillin resistance in E. Coli , or levan saccharase for mycobacteria, this latter marker being a negative selection marker.

Preferred Vectors

Bacterial Vectors

As a representative but non-limiting example, useful expression vectors for bacterial use can comprise a selectable marker and a bacterial origin of replication derived from commercially available plasmids comprising genetic elements of pBR322 (ATCC 37017). Such commercial vectors include, for example, pKK223-3 (Pharmacia, Uppsala, Sweden), and pGEM1 (Promega Biotec, Madison, Wis., USA).

Large numbers of other suitable vectors are known to those of skill in the art, and commercially available, such as the following bacterial vectors: pQE70, pQE60, pQE-9 (Qiagen), pbs, pD10, phagescript, psiX174, pbluescript SK, pbsks, pNH8A, pNH16A, pNH18A, pNH46A (Stratagene); ptrc99a, pKK223-3, pKK233-3, pDR540, pRIT5 (Pharmacia); pWLNEO, pSV2CAT, pOG44, pXT1, pSG (Stratagene); pSVK3, pBPV, pMSG, pSVL (Pharmacia); pQE-30 (QIAexpress).

Bacteriophage Vectors

The P1 bacteriophage vector may contain large inserts ranging from about 80 to about 100 kb. The construction of P1 bacteriophage vectors such as p158 or p158/neo8 are notably described by Sternberg (1992, 1994), which disclosure is hereby incorporated by reference in its entirety. Recombinant P1 clones comprising GENSET nucleotide sequences may be designed for inserting large polynucleotides of more than 40 kb (See Linton et al., 1993), which disclosure is hereby incorporated by reference in its entirety. To generate P1 DNA for transgenic experiments, a preferred protocol is the protocol described by McCormick et al. (1994), which disclosure is hereby incorporated by reference in its entirety. Briefly, E. coli (preferably strain NS3529) harboring the P1 plasmid are grown overnight in a suitable broth medium containing 25 μg/ml of kanamycin. The P1 DNA is prepared from the E. coli by alkaline lysis using the Qiagen Plasmid Maxi kit (Qiagen, Chatsworth, Calif., USA), according to the manufacturer's instructions. The P1 DNA is purified from the bacterial lysate on two Qiagen-tip 500 columns, using the washing and elution buffers contained in the kit. A phenol/chloroform extraction is then performed before precipitating the DNA with 70% ethanol. After solubilizing the DNA in TE (10 mM Tris-HCl, pH 7.4, 1 mM EDTA), the concentration of the DNA is assessed by spectrophotometry.

When the goal is to express a P1 clone comprising GENSET nucleotide sequences in a transgenic animal, typically in transgenic mice, it is desirable to remove vector sequences from the P1 DNA fragment, for example by cleaving the P1 DNA at rare-cutting sites within the P1 polylinker (SfiI, NotI or SalI). The P1 insert is then purified from vector sequences on a pulsed-field agarose gel, using methods similar to those originally reported for the isolation of DNA from YACs (See e.g., Schedl et al., 1993a; Peterson et al., 1993), which disclosures are hereby incorporated by reference in their entireties. At this stage, the resulting purified insert DNA can be concentrated, if necessary, on a Millipore Ultrafree-MC Filter Unit (Millipore, Bedford, Mass., USA—30,000 molecular weight limit) and then dialyzed against microinjection buffer (10 mM Tris-HCl, pH 7.4; 250 μM EDTA) containing 100 mM NaCl, 30 μM spermine, 70 μM spermidine on a microdyalisis membrane (type VS, 0.025 μM from Millipore). The intactness of the purified P1 DNA insert is assessed by electrophoresis on 1% agarose (Sea Kem GTG; FMC Bio-products) pulse-field gel and staining with ethidium bromide.

Viral Vectors

In one specific embodiment, the vector is derived from an adenovirus. Preferred adenovirus vectors according to the invention are those described by Feldman and Steg (1996), or Ohno et al., (1994), which disclosures are hereby incorporated by reference in their entireties. Another preferred recombinant adenovirus according to this specific embodiment of the present invention is the human adenovirus type 2 or 5 (Ad 2 or Ad 5) or an adenovirus of animal origin (French patent application No. FR-93.05954), which disclosure is hereby incorporated by reference in its entirety.

Retrovirus vectors and adeno-associated virus vectors are generally understood to be the recombinant gene delivery systems of choice for the transfer of exogenous polynucleotides in vivo, particularly to mammals, including humans. These vectors provide efficient delivery of genes into cells, and the transferred nucleic acids are stably integrated into the chromosomal DNA of the host. Particularly preferred retroviruses for the preparation or construction of retroviral in vitro or in vitro gene delivery vehicles of the present invention include retroviruses selected from the group consisting of Mink-Cell Focus Inducing Virus, Murine Sarcoma Virus, Reticuloendotheliosis virus and Rous Sarcoma virus. Particularly preferred Murine Leukemia Viruses include the 4070A and the 1504A viruses, Abelson (ATCC No VR-999), Friend (ATCC No VR-245), Gross (ATCC No VR-590), Rauscher (ATCC No VR-998) and Moloney Murine Leukemia Virus (ATCC No VR-190; PCT Application No WO 94/24298). Particularly preferred Rous Sarcoma Viruses include Bryan high titer (ATCC Nos VR-334, VR-657, VR-726, VR-659 and VR-728). Other preferred retroviral vectors are those described in Roth et al. (1996), PCT Application No WO 93/25234, PCT Application No WO 94/06920, Roux et al., (1989), Julan et al., (1992), and Neda et al., (1991), which disclosures are hereby incorporated by reference in their entireties.

Yet another viral vector system that is contemplated by the invention comprises the adeno-associated virus (AAV). The adeno-associated virus is a naturally occurring defective virus that requires another virus, such as an adenovirus or a herpes virus, as a helper virus for efficient replication and a productive life cycle (Muzyczka et al., 1992), which disclosure is hereby incorporated by reference in its entirety. It is also one of the few viruses that may integrate its DNA into non-dividing cells, and exhibits a high frequency of stable integration (Flotte et al. 1992; Samulski et al., 1989; McLaughlin et al., 1989), which disclosures are hereby incorporated by reference in their entireties. One advantageous feature of AAV derives from its reduced efficacy for transducing primary cells relative to transformed cells.

BAC Vectors

The bacterial artificial chromosome (BAC) cloning system (Shizuya et al., 1992), which disclosure is hereby incorporated by reference in its entirety, has been developed to stably maintain large fragments of genomic DNA (100-300 kb) in E. coli . A preferred BAC vector comprises a pBeloBAC11 vector that has been described by Kim et al. (1996), which disclosure is hereby incorporated by reference in its entirety. BAC libraries are prepared with this vector using size-selected genomic DNA that has been partially digested using enzymes that permit ligation into either the Bam HI or HindIII sites in the vector. Flanking these cloning sites are T7 and SP6 RNA polymerase transcription initiation sites that can be used to generate end probes by either RNA transcription or PCR methods. After the construction of a BAC library in E. coli , BAC DNA is purified from the host cell as a supercoiled circle. Converting these circular molecules into a linear form precedes both size determination and introduction of the BACs into recipient cells. The cloning site is flanked by two Not I sites, permitting cloned segments to be excised from the vector by Not I digestion. Alternatively, the DNA insert contained in the pBeloBAC11 vector may be linearized by treatment of the BAC vector with the commercially available enzyme lambda terminase that leads to the cleavage at the unique cosN site, but this cleavage method results in a full length BAC clone containing both the insert DNA and the BAC sequences.

Baculovirus:

Another specific suitable host vector system is the pVL1392/1393 baculovirus transfer vector (Pharmingen) that is used to transfect the SF9 cell line (ATCC No. CRL 1711) which is derived from Spodoptera frugiperda . Other suitable vectors for the expression of the GENSET polypeptide of the present invention in a baculovirus expression system include those described by Chai et al., (1993), Vlasak et al., (1983), and Lenhard et al., (1996), which disclosures are hereby incorporated by reference in their entireties.

Delivery of the Recombinant Vectors:

To effect expression of the polynucleotides and polynucleotide constructs of the invention, these constructs must be delivered into a cell. This delivery may be accomplished in vitro, as in laboratory procedures for transforming cell lines, or in vivo or ex vivo, as in the treatment of certain diseases states. One mechanism is viral infection where the expression construct is encapsulated in an infectious viral particle.

Several non-viral methods for the transfer of polynucleotides into cultured mammalian cells are also contemplated by the present invention, and include, without being limited to, calcium phosphate precipitation (Graham et al., 1973; Chen et al., 1987); DEAE-dextran (Gopal, 1985); electroporation (Tur-Kaspa et al., 1986; Potter et al., 1984); direct microinjection (Harland et al., 1985); DNA-loaded liposomes (Nicolau et al., 1982; Fraley et al., 1979); and receptor-mediated transfection. (Wu and Wu, 1987, 1988), which disclosures are hereby incorporated by reference in their entireties. Some of these techniques may be successfully adapted for in vivo or ex vivo use.

Once the expression polynucleotide has been delivered into the cell, it may be stably integrated into the genome of the recipient cell. This integration may be in the cognate location and orientation via homologous recombination (gene replacement) or it may be integrated in a random, non-specific location (gene augmentation). In yet further embodiments, the nucleic acid may be stably maintained in the cell as a separate, episomal segment of DNA. Such nucleic acid segments or “episomes” encode sequences sufficient to permit maintenance and replication independent of or in synchronization with the host cell cycle.

One specific embodiment for a method for delivering a protein or peptide to the interior of a cell of a vertebrate in vivo comprises the step of introducing a preparation comprising a physiologically acceptable carrier and a naked polynucleotide operatively coding for the polypeptide of interest into the interstitial space of a tissue comprising the cell, whereby the naked polynucleotide is taken up into the interior of the cell and has a physiological effect. This is particularly applicable for transfer in vitro but it may be applied to in vivo as well.

Compositions for use in vitro and in vivo comprising a “naked” polynucleotide are described in PCT application No. WO 90/11092 (Vical Inc.) and also in PCT application No. WO 95/11307 (Institut Pasteur, INSERM, Université d'Ottawa) as well as in the articles of Tacson et al. (1996) and of Huygen et al., (1996), which disclosures are hereby incorporated by reference in their entireties.

In still another embodiment of the invention, the transfer of a naked polynucleotide of the invention, including a polynucleotide construct of the invention, into cells may be proceeded with a particle bombardment (biolistic), said particles being DNA-coated microprojectiles accelerated to a high velocity allowing them to pierce cell membranes and enter cells without killing them, such as described by Klein et al., (1987), which disclosure is hereby incorporated by reference in its entirety.

In a further embodiment, the polynucleotide of the invention may be entrapped in a liposome (Ghosh and Bacchawat, 1991; Wong et al., 1980; Nicolau et al., 1987), which disclosures are hereby incorporated by reference in their entireties.

In a specific embodiment, the invention provides a composition for the in vivo production of the GENSET protein or polypeptide described herein. It comprises a naked polynucleotide operatively coding for this polypeptide, in solution in a physiologically acceptable carrier, and suitable for introduction into a tissue to cause cells of the tissue to express the said protein or polypeptide.

The amount of vector to be injected to the desired host organism varies according to the site of injection. As an indicative dose, it will be injected between 0.1 and 100 μg of the vector in an animal body, preferably a mammal body, for example a mouse body.

In another embodiment of the vector according to the invention, it may be introduced in vitro in a host cell, preferably in a host cell previously harvested from the animal to be treated and more preferably a somatic cell such as a muscle cell. In a subsequent step, the cell that has been transformed with the vector coding for the desired GENSET polypeptide or the desired fragment thereof is reintroduced into the animal body in order to deliver the recombinant protein within the body either locally or systemically.

Secretion Vectors

Some of the GENSET cDNAs or genomic DNAs of the invention may also be used to construct secretion vectors capable of directing the secretion of the proteins encoded by genes inserted in the vectors. Such secretion vectors may facilitate the purification or enrichment of the proteins encoded by genes inserted therein by reducing the number of background proteins from which the desired protein must be purified or enriched. Exemplary secretion vectors are described below.

The secretion vectors of the present invention include a promoter capable of directing gene expression in the host cell, tissue, or organism of interest. Such promoters include the Rous Sarcoma Virus promoter, the SV40 promoter, the human cytomegalovirus promoter, and other promoters familiar to those skilled in the art.

A signal sequence from a polynucleotide of the invention, preferably a signal sequences selected from the group of signal sequences of SEQ ID Nos: 1-31 and 33-143 and signal sequences of clone inserts of the deposited clone pool is operably linked to the promoter such that the mRNA transcribed from the promoter will direct the translation of the signal peptide. The host cell, tissue, or organism may be any cell, tissue, or organism which recognizes the signal peptide encoded by the signal sequence in the GENSET cDNA or genomic DNA. Suitable hosts include mammalian cells, tissues or organisms, avian cells, tissues, or organisms, insect cells, tissues or organisms, or yeast.

In addition, the secretion vector contains cloning sites for inserting genes encoding the proteins which are to be secreted. The cloning sites facilitate the cloning of the insert gene in frame with the signal sequence such that a fusion protein in which the signal peptide is fused to the protein encoded by the inserted gene is expressed from the mRNA transcribed from the promoter. The signal peptide directs the extracellular secretion of the fusion protein.

The secretion vector may be DNA or RNA and may integrate into the chromosome of the host, be stably maintained as an extrachromosomal replicon in the host, be an artificial chromosome, or be transiently present in the host. Preferably, the secretion vector is maintained in multiple copies in each host cell. As used herein, multiple copies means at least 2, 5, 10, 20, 25, 50 or more than 50 copies per cell. In some embodiments, the multiple copies are maintained extrachromosomally. In other embodiments, the multiple copies result from amplification of a chromosomal sequence.

Many nucleic acid backbones suitable for use as secretion vectors are known to those skilled in the art, including retroviral vectors, SV40 vectors, Bovine Papilloma Virus vectors, yeast integrating plasmids, yeast episomal plasmids, yeast artificial chromosomes, human artificial chromosomes, P element vectors, baculovirus vectors, or bacterial plasmids capable of being transiently introduced into the host.

The secretion vector may also contain a polyA signal such that the polyA signal is located downstream of the gene inserted into the secretion vector.

After the gene encoding the protein for which secretion is desired is inserted into the secretion vector, the secretion vector is introduced into the host cell, tissue, or organism using calcium phosphate precipitation, DEAE-Dextran, electroporation, liposome-mediated transfection, viral particles or as naked DNA. The protein encoded by the inserted gene is then purified or enriched from the supernatant using conventional techniques such as ammonium sulfate precipitation, immunoprecipitation, immunochromatography, size exclusion chromatography, ion exchange chromatography, and hplc. Alternatively, the secreted protein may be in a sufficiently enriched or pure state in the supernatant or growth media of the host to permit it to be used for its intended purpose without further enrichment.

The signal sequences may also be inserted into vectors designed for gene therapy. In such vectors, the signal sequence is operably linked to a promoter such that mRNA transcribed from the promoter encodes the signal peptide. A cloning site is located downstream of the signal sequence such that a gene encoding a protein whose secretion is desired may readily be inserted into the vector and fused to the signal sequence. The vector is introduced into an appropriate host cell. The protein expressed from the promoter is secreted extracellularly, thereby producing a therapeutic effect.

Cell Hosts

Another object of the invention comprises a host cell that has been transformed or transfected with one of the polynucleotides described herein, and in particular a polynucleotide either comprising a GENSET regulatory polynucleotide or the polynucleotide coding for a GENSET polypeptide. Also included are host cells that are transformed (prokaryotic cells) or that are transfected (eukaryotic cells) with a recombinant vector such as one of those described above. However, the cell hosts of the present invention can comprise any of the polynucleotides of the present invention. In a preferred embodiment, host cells contain a polynucleotide sequence comprising a sequence selected from the group consisting of sequences of SEQ ID Nos: 1-241, sequences of clone inserts of the deposited clone pool, variants and fragments thereof. Preferred host cells used as recipients for the expression vectors of the invention are the following:

a) Prokaryotic host cells: Escherichia coli strains (I.E.DH5-α strain), Bacillus subtilis, Salmonella typhimurium , and strains from species like Pseudomonas, Streptomyces and Staphylococcus.

b) Eukaryotic host cells: HeLa cells (ATCC No. CCL2; No. CCL2.1; No. CCL2.2), Cv 1 cells (ATCC No. CCL70), COS cells (ATCC No. CRL1650; No. CRL1651), Sf-9 cells (ATCC No. CRL1711), C127 cells (ATCC No. CRL-1804), 3T3 (ATCC No. CRL-6361), CHO (ATCC No. CCL-61), human kidney 293. (ATCC No. 45504; No. CRL-1573) and BHK (ECACC No. 84100501; No. 84111301).

c) Other mammalian host cells.

The present invention also encompasses primary, secondary, and immortalized homologously recombinant host cells of vertebrate origin, preferably mammalian origin and particularly human origin, that have been engineered to: a) insert exogenous (heterologous) polynucleotides into the endogenous chromosomal DNA of a targeted gene, b) delete endogenous chromosomal DNA, and/or c) replace endogenous chromosomal DNA with exogenous polynucleotides. Insertions, deletions, and/or replacements of polynucleotide sequences may be to the coding sequences of the targeted gene and/or to regulatory regions, such as promoter and enhancer sequences, operably associated with the targeted gene.

In addition to encompassing host cells containing the vector constructs discussed herein, the invention also encompasses primary, secondary, and immortalized host cells of vertebrate origin, particularly mammalian origin, that have been engineered to delete or replace endogenous genetic material (e.g., coding sequence), and/or to include genetic material (e.g., heterologous polynucleotide sequences) that is operably associated with the polynucleotides of the invention, and which activates, alters, and/or amplifies endogenous polynucleotides. For example, techniques known in the art may be used to operably associate heterologous control regions (e.g., promoter and/or enhancer) and endogenous polynucleotide sequences via homologous recombination, see, e.g., U.S. Pat. No. 5,641,670, issued Jun. 24, 1997; International Publication No. WO 96/29411, published Sep. 26, 1996; International Publication No. WO 94/12650, published Aug. 4, 1994; Koller et al., (1989); and Zijlstra et al. (1989) (The disclosures of each of which are incorporated by reference in their entireties).

The present invention further relates to a method of making a homologously recombinant host cell in vitro or in vivo, wherein the expression of a targeted gene not normally expressed in the cell is altered. Preferably the alteration causes expression of the targeted gene under normal growth conditions or under conditions suitable for producing the polypeptide encoded by the targeted gene. The method comprises the steps of: (a) transfecting the cell in vitro or in vivo with a polynucleotide construct, said polynucleotide construct comprising; (i) a targeting sequence; (ii) a regulatory sequence and/or a coding sequence; and (iii) an unpaired splice donor site, if necessary, thereby producing a transfected cell; and (b) maintaining the transfected cell in vitro or in vivo under conditions appropriate for homologous recombination.

The present invention further relates to a method of altering the expression of a targeted gene in a cell in vitro or in vivo wherein the gene is not normally expressed in the cell, comprising the steps of: (a) transfecting the cell in vitro or in vivo with a polynucleotide construct, said polynucleotide construct comprising: (i) a targeting sequence; (ii) a regulatory sequence and/or a coding sequence; and (iii) an unpaired splice donor site, if necessary, thereby producing a transfected cell; and (b) maintaining the transfected cell in vitro or in vivo under conditions appropriate for homologous recombination, thereby producing a homologously recombinant cell; and (c) maintaining the homologously recombinant cell in vitro or in vivo under conditions appropriate for expression of the gene.

The present invention further relates to a method of making a polypeptide of the present invention by altering the expression of a targeted endogenous gene in a cell in vitro or in vivo wherein the gene is not normally expressed in the cell, comprising the steps of: a) transfecting the cell in vitro with a polynucleotide construct, said polynucleotide construct comprising: (i) a targeting sequence; (ii) a regulatory sequence and/or a coding sequence; and (iii) an unpaired splice donor site, if necessary, thereby producing a transfected cell; (b) maintaining the transfected cell in vitro or in vivo under conditions appropriate for homologous recombination, thereby producing a homologously recombinant cell; and c) maintaining the homologously recombinant cell in vitro or in vivo under conditions appropriate for expression of the gene thereby making the polypeptide.

The present invention further relates to a polynucleotide construct which alters the expression of a targeted gene in a cell type in which the gene is not normally expressed. This occurs when the polynucleotide construct is inserted into the chromosomal DNA of the target cell, wherein said polynucleotide construct comprises: a) a targeting sequence; b) a regulatory sequence and/or coding sequence; and c) an unpaired splice-donor site, if necessary. Further included are a polynucleotide construct, as described above, wherein said polynucleotide construct further comprises a polynucleotide which encodes a polypeptide and is in-frame with the targeted endogenous gene after homologous recombination with chromosomal DNA.

The compositions may be produced, and methods performed, by techniques known in the art, such as those described in U.S. Pat. Nos. 6,054,288; 6,048,729; 6,048,724; 6,048,524; 5,994,127; 5,968,502; 5,965,125; 5,869,239; 5,817,789; 5,783,385; 5,733,761; 5,641,670; 5,580,734; International Publication Nos: WO96/29411, WO 94/12650; and scientific articles described by Koller et al., (1994). (The disclosures of each of which are incorporated by reference in their entireties).

The GENSET gene expression in mammalian cells, preferably human cells, may be rendered defective, or alternatively may be altered by replacing the endogenous GENSET gene in the genome of an animal cell by a GENSET polynucleotide according to the invention. These genetic alterations may be generated by homologous recombination using previously described specific polynucleotide constructs.

Mammal zygotes, such as murine zygotes may be used as cell hosts. For example, murine zygotes may undergo microinjection with a purified DNA molecule of interest, for example a purified DNA molecule that has previously been adjusted to a concentration ranging from 1 ng/ml—for BAC inserts—to 3 ng/μl—for P1 bacteriophage inserts—in 10 mM Tris-HCl, pH 7.4, 250 μM EDTA containing 100 mM NaCl, 30 μM spemine, and 70 μM spermidine. When the DNA to be microinjected has a large size, polyamines and high salt concentrations can be used in order to avoid mechanical breakage of this DNA, as described by Schedl et al (1993b), which disclosure is hereby incorporated by reference in its entirety.

Any one of the polynucleotides of the invention, including the Polynucleotide constructs described herein, may be introduced in an embryonic stem (ES) cell line, preferably a mouse ES cell line. ES cell lines are derived from pluripotent, uncommitted cells of the inner cell mass of pre-implantation blastocysts. Preferred ES cell lines are the following: ES-E14TG2a (ATCC No. CRL-1821), ES-D3 (ATCC No. CRL1934 and No. CRL-11632), YS001 (ATCC No. CRL-11776), 36.5 (ATCC No. CRL-11116). ES cells are maintained in an uncommitted state by culture in the presence of growth-inhibited feeder cells which provide the appropriate signals to preserve this embryonic phenotype and serve as a matrix for ES cell adherence. Preferred feeder cells are primary embryonic fibroblasts that are established from tissue of day 13-day 14 embryos of virtually any mouse strain, that are maintained in culture, such as described by Abbondanzo et al. (1993) and are growth-inhibited by irradiation, such as described by Robertson (1987), or by the presence of an inhibitory concentration of LIF, such as described by Pease and Williams (1990), which disclosures are hereby incorporated by reference in their entireties.

The constructs in the host cells can be used in a conventional manner to produce the gene product encoded by the recombinant sequence.

Following transformation of a suitable host and growth of the host to an appropriate cell density, the selected promoter is induced by appropriate means, such as temperature shift or chemical induction, and cells are cultivated for an additional period. Cells are typically harvested by centrifugation, disrupted by physical or chemical means, and the resulting crude extract retained for further purification. Microbial cells employed in the expression of proteins can be disrupted by any convenient method, including freeze-thaw cycling, sonication, mechanical disruption, or use of cell lysing agents. Such methods are well known by the skilled artisan.

Transgenic Animals

The terms “transgenic animals” or “host animals” are used herein to designate animals that have their genome genetically and artificially manipulated so as to include one of the nucleic acids according to the invention. Preferred animals are non-human mammals and include those belonging to a genus selected from Mus (e.g. mice), Rattus (e.g. rats) and Oryctogalus (e.g. rabbits) which have their genome artificially and genetically altered by the insertion of a nucleic acid according to the invention. In one embodiment, the invention encompasses non-human host mammals and animals comprising a recombinant vector of the invention or a GENSET gene disrupted by homologous recombination with a knock out vector.

The transgenic animals of the invention all include within a plurality of their cells a cloned recombinant or synthetic DNA sequence, more specifically one of the purified or isolated nucleic acids comprising a GENSET coding sequence, a GENSET regulatory polynucleotide, a polynucleotide construct, or a DNA sequence encoding an antisense polynucleotide such as described in the present specification.

Generally, a transgenic animal according the present invention comprises any of the polynucleotides, the recombinant vectors and the cell hosts described in the present invention. In a first preferred embodiment, these transgenic animals may be good experimental models in order to study the diverse pathologies related to the dysregulation of the expression of a given GENSET gene, in particular the transgenic animals containing within their genome one or several copies of an inserted polynucleotide encoding a native GENSET protein, or alternatively a mutant GENSET protein.

In a second preferred embodiment, these transgenic animals may express a desired polypeptide of interest under the control of the regulatory polynucleotides of the GENSET gene, leading to high yields in the synthesis of this protein of interest, and eventually to tissue specific expression of the protein of interest.

In a third preferred embodiment, these transgenic animals may express a desired polypeptide of interest fused to a GENSET signal peptide sequence, leading to the secretion of the fusion (chimeric) polypeptide.

The design of the transgenic animals of the invention may be made according to the conventional techniques well known from the one skilled in the art. For more details regarding the production of transgenic animals, and specifically transgenic mice, it may be referred to U.S. Pat. No. 4,873,191, issued Oct. 10, 1989; U.S. Pat. No. 5,464,764 issued Nov. 7, 1995; and U.S. Pat. No. 5,789,215, issued Aug. 4, 1998; these documents being herein incorporated by reference to disclose methods producing transgenic mice.

Transgenic animals of the present invention are produced by the application of procedures which result in an animal with a genome that has incorporated exogenous genetic material. The procedure involves obtaining the genetic material which encodes either a GENSET coding sequence, a GENSET regulatory polynucleotide or a DNA sequence encoding a GENSET antisense polynucleotide, or a portion thereof, such as described in the present specification. A recombinant polynucleotide of the invention is inserted into an embryonic or ES stem cell line. The insertion is preferably made using electroporation, such as described by Thomas et al. (1987), which disclosure is hereby incorporated by reference in its entirety. The cells subjected to electroporation are screened (e.g. by selection via selectable markers, by PCR or by Southern blot analysis) to find positive cells which have integrated the exogenous recombinant polynucleotide into their genome, preferably via an homologous recombination event. An illustrative positive-negative selection procedure that may be used according to the invention is described by Mansour et al. (1988), which disclosure is hereby incorporated by reference in its entirety.

The positive cells are then isolated, cloned and injected into 3.5 days old blastocysts from mice, such as described by Bradley (1987), which disclosure is hereby incorporated by reference in its entirety. The blastocysts are then inserted into a female host animal and allowed to grow to term. Alternatively, the positive ES cells are brought into contact with embryos at the 2.5 days old 8-16 cell stage (morulae) such as described by Wood et al. (1993), or by Nagy et al. (1993), which disclosures are hereby incorporated by reference in their entireties, the ES cells being internalized to colonize extensively the blastocyst including the cells which will give rise to the germ line.

The offspring of the female host are tested to determine which animals are transgenic e.g. include the inserted exogenous DNA sequence and which ones are wild type.

Thus, the present invention also concerns a transgenic animal containing a nucleic acid, a recombinant expression vector or a recombinant host cell according to the invention.

Recombinant Cell Lines Derived from the Transgenic Animals of the Invention:

A further object of the invention comprises recombinant host cells obtained from a transgenic animal described herein. In one embodiment the invention encompasses cells derived from non-human host mammals and animals comprising a recombinant vector of the invention or a GENSET gene disrupted by homologous recombination with a knock out vector.

Recombinant cell lines may be established in vitro from cells obtained from any tissue of a transgenic animal according to the invention, for example by transfection of primary cell cultures with vectors expressing one-genes such as SV40 large T antigen, as described by Chou (1989), and Shay et al. (1991), which disclosures are hereby incorporated by reference in their entireties.

Uses of Polypeptides of the Invention

Proteins Containing Multimerization Domains

The invention relates to compositions and methods using proteins of the invention containing a multimerization domains such as a leucine zipper or a helix loop helix domain.

Proteins of the invention containing a leucine zipper domain, are herein referred to as LZP, such as the ones described in this section and those containing a leucine zipper domain as shown on Table VI, or parts thereof, preferably fragments comprising a leucine zipper domain, or derivative thereof to mediate multimerization of proteins of interest.

The leucine zipper consists of a periodic repetition of leucine residues at every seventh, covering a distance spanning eight helical turns. The segments containing these periodic arrays of leucine residues appear to exist in an alpha-helical conformation, and the leucine side chains extending from one alpha-helix interact with those from a similar alpha helix of a second polypeptide, facilitating dimerization. The structure formed by cooperation of these two regions forms a coiled coil (O'Shea E. K., Rutkowski R., Kim P. S. Science 243:538-542, 1989).

Leucine-zippers contribute to targeting of various proteins (eg. glucose transporters, Asano, et al., J. Biol. Chem., 267, 19636-19641 (1992)) and permit dimerization of various cytoplasmic hormone receptors and enzymes (Forman, et al., Mol Endocrinol, 3, 1610-1626 (1989)). Leucine zippers are also a common feature of protein transcription factors, where they permit homo- or heterodimerization resulting in tight binding to DNA strands (for reviews, see Abel, et al., Nature 341, 24-25 (1989); Jones, et al., Cell 61, 9-11 (1990); Lamb, et al., Trends in Biochemical Sciences 16, 417-422 (1991)).

Leucine zippers have been shown to be useful tools in several areas of biotechnology, especially in protein engineering, where their ability to mediate homo-dimerization or hetero-dimerization has found several applications. For example, Bosslet et al have described the use of a pair of leucine zipper for in vitro diagnosis, in particular for the immunochemical detection and determination of an analyte in a biological liquid (U.S. Pat. No. 5,643,731)/Tso et al have used leucine zippers for producing bispecific antibody heterodimers (U.S. Pat. No. 5,932,448)/Methods of preparing soluble oligomeric proteins using leucine zippers have been described by Conrad et al (U.S. Pat. No. 5,965,712), Ciardelli et al (U.S. Pat. No. 5,837,816), Spriggs et al (WO9410308)/Leucine zipper forming sequences have been used by Pelletier et al in protein fragment complementation assays to detect biomolecular interactions (WO9834120). Because of their usefulness in biotechnology, it is thus highly interesting to isolate new leucine zipper domains.

The multimerization activity of proteins containing leucine zipper domains may be assayed using any of the assays known to those skilled in the art including circular dichroism spectrum and thermal melting analyses as described in U.S. Pat. No. 5,942,433. Alternatively, the leucine zipper motif in LZP could be used by those skilled in art as a “bait protein” in a well established yeast double hybridization system to identify its interacting protein partners in vivo from cDNA library derived from different tissues or cell types of a given organism. Alternatively, LZP or part thereof could be used by those skilled in art in mammalian cell transfection experiments. When fused to a suitable peptide tag such as [His] 6 tag in a protein expression vector and introduced into culture cells, this expressed fusion protein can be immunoprecipitated with its potential interacting proteins by using anti-tag peptide antibody. This method could be chosen either to identify the associated partner or to confirm the results obtained by other methods such as those just mentioned.

In a preferred embodiment, the invention relates to compositions and methods of using LZP or part thereof for preparing soluble multimeric proteins, which consist in multimers of fusion proteins containing a leucine zipper fused to a protein of interest, using any technique known to those skilled in the art including those described in international patent WO9410308, which disclosure is hereby incorporated by reference in its entirety. In another preferred embodiment, LZP or derivative thereof is used to produce bispecific antibody heterodimers as described in U.S. Pat. No. 5,932,448, which disclosure is hereby incorporated by reference in its entirety. Briefly, leucine zippers capable of forming heterodimers are respectively linked to epitope binding components with different specificities. Bispecific antibodies are formed by pairwise association of the leucine zippers, forming an heterodimer which links two distinct epitope binding components. In still another preferred embodiment, LZP or part thereof or derivative thereof is used for detection and determination of an analyte in a biological liquid as described in U.S. Pat. No. 5,643,731, which disclosure is hereby incorporated by reference in its entirety. Briefly, a first leucine zipper is immobilized on a solid support and the second leucine zipper is coupled to a specific binding partner for an analyte in a biological fluid. The two peptides are then brought into contact thereby immobilizing the binding partner on the solid phase. The biological sample is then contacted with the immobilized binding partner and the amount of analyte in the sample bound to the binding partner determined. In still another preferred embodiment, the LZP or part thereof may be used to synthesize novel nucleic acid binding proteins which are able to multimerize with proteins of interest, for example to inhibit and/or control cellular growth using any genetic engineering technique known to those skilled in the art including the ones described in the U.S. Pat. No. 5,942,433, which disclosure is hereby incorporated by reference in its entirety.

In another embodiment, the invention relates to compositions and methods using the LZP or part thereof or derivative thereof in protein fragment complementation assays to detect biomolecular interactions in vivo and in vitro as described in international patent WO9834120, which disclosures is hereby incorporated by reference in its entirety. Such assays may be used to study the equilibrium and kinetic aspects of molecular interactions including protein-protein, protein-nucleic acid, protein-carbohydrate and protein-small molecule interactions, for screening cDNA libraries for binding to a target protein with unknown proteins or libraries of small organic molecules for biological activity.

Still, another object of the present invention relates to the use of the LZP or part thereof for identifying new leucine zipper domains using any techniques for detecting protein-protein interaction known to those skilled in the art. Among the traditional methods which may be employed are co-immunoprecipitation, crosslinking and co-purification through gradients or chromatographic columns of cell lysates. Once isolated as a protein interacting with the LZP, such an intracellular protein can be identified (e.g. its amino acid sequence determined) and can, in turn, be used, in conjunction with standard techniques, to identify other proteins with which it interacts. The amino acid sequence thus obtained may be used as a guide for the generation of oligonucleotide mixtures that can be used to screen for gene sequences encoding such intracellular proteins. Screening may be accomplished, for example, by standard hybridization or PCR techniques. Techniques for the generation of oligonucleotide mixtures and the screening are well-known. (See, e.g., Ausubel et al., eds., Current Protocols in Molecular Biology , J. Wiley and Sons (New York, N.Y. 1993) and PR Protocols: A Guide to Methods and Applications, 1990, Innis, M. et al., eds. Academic Press, Inc., New York).

Alternatively, methods may be employed which result in the simultaneous identification of genes which encode the intracellular proteins that can dimerize with the LZP or part thereof using any technique known to those skilled in the art. These methods include, for example, probing cDNA expression libraries, in a manner similar to the well known technique of antibody probing of lambda.gt11 libraries, using as a probe a labeled version of the LZP or part thereof, or fusion protein, e.g., the LZP or part thereof fused to a marker (e.g., an enzyme, fluor, luminescent protein, or dye), or an Ig-Fc domain (for technical details on screening of cDNA expression libraries, see Ausubel et al, supra). Alternatively, another method for the detection of protein interaction in vivo, the two-hybrid system, may be used.

Protein of SEQ ID NO:261 (Internal Designation 116-054-3-0-E6-CS)

The 233 amino acids protein of SEQ ID NO: 261 encoded by the cDNA of SEQ ID NO: 20 displays two leucine zipper sites at positions 142-163 and 170-191.

It is believed that the protein of SEQ ID NO: 261 is able to dimerize either with itself (homo-dimerisation) or with an heterologous protein (hetero-dimerisation) of interest, through the mediation of its leucine zipper domain. Preferred polypeptides of the invention are polypeptides comprising fragments of SEQ ID NO: 261 from position 142-163 and 170-191, and fragments having any of the biological activities described herein.

Protein of SEQ ID NO:263 (Internal Designation 116-055-2-0-F7-CS)

The protein of SEQ ID NO: 263 encoded by the cDNA of SEQ ID NO: 22 displays a leucine zipper pattern situated near its NH2 terminal part (position 15 to 36).

It is believed that the protein of SEQ ID NO: 263 is able to dimerize either with itself (homo-dimerisation) or with an heterologous protein (hetero-dimerisation) of interest, through the mediation of its leucine zipper domain. Preferred polypeptides of the invention are polypeptides comprising fragments of SEQ ID NO: 263 from position 15 to 36, and fragments having any of the biological activities described herein.

Protein of SEQ ID NO:245 (Internal Designation 105-026-1-0-A5-CS)

The protein of SEQ ID NO:245 encoded by the cDNA of SEQ ID NO:4 displays a leucine zipper pattern situated near its COOH terminal part (position 371 to 392).

It is believed that the protein of SEQ ID NO: 245 is able to dimerize either with itself (homo-dimerisation) or with an heterologous protein (hetero-dimerisation) of interest, through the mediation of its leucine zipper domain. Preferred polypeptides of the invention are polypeptides comprising fragments of SEQ ID NO: 245 from position 371 to 392, and fragments having any of the biological activities described herein.

Protein of SEQ ID NO: 257 (Internal Designation 106-043-4-0-H3-CS)

The 265-amino-acid-long protein of SEQ ID: 257 encoded by the cDNA of SEQ ID NO: 16 exhibits homology to the Homo sapiens hypothetical protein (Genbank accession number AJ278482). These two proteins are probably the result of an alternative splicing.

The protein of SEQ ID NO: 257 displays a leucine zipper pattern situated from position 155 to 176. Thus, it is believed that the protein of SEQ ID NO: 257 is able to dimerize either with itself (homo-dimerisation) or with an heterologous protein (hetero-dimerisation) of interest, through the mediation of its leucine zipper domain. Preferred polypeptides of the invention are polypeptides comprising leucine zipper domains fragments and fragments having any of the biological activities described herein.

Protein of SEQ ID NO: 314 (Internal Designation 188-41-1-0-B8-CS.cor)

A growing number of proteins have been shown to undergo post-translational modification by fatty acids that are covalently linked to cysteine residues through a thioester bond. Fatty acid modifications contribute to intracellular protein localization by facilitating membrane binding and also by strengthening protein-protein interactions. Cycles of palmitoylation and depalmitoylation have been described for a number of intracellular proteins, but the relevant enzymes that catalyze these processes have yet to be fully characterized and the full significance of these cycles remains to be elucidated.

Palmitoyl-protein thioesterase-1 (PPT1) is a lysosomal hydrolase that removes long-chain fatty acyl groups from modified cysteine residues in proteins. Mutations in PPT1 have been found to cause the infantile form of neuronal ceroid lipofuscinosis (INCL).

Soyombo and Hofmann (J. Biol. Chem. 272: 27456-27463 [1997]) identified cDNAs encoding PPT2. The deduced PPT2 protein contains 302 amino acids, including a 27-amino acid leader peptide, a sequence motif characteristic of many thioesterases and lipases, and 5 potential N-linked glycosylation sites. PPT2 shares 18% amino acid identity with PPT1. Soyombo and Hofmann tentatively localized the human PPT2 gene to 6p21.3. Northern blot analysis detected a predominant 2.0-kb PPT2 transcript in the human tissues examined, with the highest expression in skeletal muscle; variable amounts of 2.8- and 7.0-kb transcripts were also observed.

Cell fractionation studies indicate that PPT2 is present in the lysosomal fraction. Immunoblot analysis of recombinant PPT2 expressed in mammalian cells showed 6 PPT2 proteins ranging in size from 31 to 42 kDa. Treatment that removes asparagine-linked oligosaccharides resulted in a single major protein of 31 kDa and a minor protein of 33 kDa.

Recombinant PPT2, like PPT1, possesses thioesterase activity and localizes to the lysosome. Since PPT2 could not substitute for PPT1 in correcting the metabolic defect in INCL cells and was unable to remove palmitate groups from palmitoylated proteins, it appears that PPT2 possesses a different substrate specificity than PPT1. Another study, however, was able to show, after expression of the recombinant protein in a baculovirus system and using cell lysate as substrate, that the protein had S-thioesterase activity with a preference for acyl groups palmitic and myristic acid.

The subject invention provides the protein/polypeptide of SEQ ID NO:314, encoded by the cDNA of SEQ ID NO:73. The invention also provides biologically active fragments of SEQ ID NO:314. In one embodiment, the polypeptides of SEQ ID NO:314 are interchanged with the corresponding polypeptide encoded by the human cDNA of clone 18841-1-0-B8-CS. “Biologically active fragments” are defined as those peptide or polypeptide fragments having at least one of the biological functions of the full length protein (e.g., removal of long-chain fatty acyl groups from modified cysteine residues in proteins). Compositions of the protein/polypeptide of SEQ ID NO:314, or biologically active fragments thereof, are also provided by the subject invention. These compositions may be made according to methods well known in the art.

The invention also provides variants of the protein of SEQ ID NO:314. These variants have at least about 80%, more preferably at least about 90%, and most preferably at least about 95% amino acid sequence identity to the amino acid sequence encoded by SEQ ID NO:73. Variants according to the subject invention also have at least one functional or structural characteristic of the protein of SEQ ID NO:314. The invention also provides biologically active fragments of the variant proteins. Compositions of variants, or biologically active fragments thereof, are also provided by the subject invention. These compositions may be made according to methods well known in the art. Unless otherwise indicated, the methods disclosed herein can be practiced utilizing the protein encoded by SEQ ID NO:73, biologically active fragments of SEQ ID NO:314, variants of SEQ ID NO:314, and biologically active fragments of the variants.

Because of the redundancy of the genetic code, a variety of different DNA sequences can encode the amino acid sequence of SEQ ID NO:314. In a preferred embodiment, SEQ ID NO:314 is encoded by clone 188-41-1-0-B8-CS or the cDNA of SEQ ID NO:73. It is well within the skill of a person trained in the art to create these alternative DNA sequences which encode proteins having the same, or essentially the same, amino acid sequence. These variant DNA sequences are, thus, within the scope of the subject invention. As used herein, reference to “essentially the same” sequence refers to sequences that have amino acid substitutions, deletions, additions, or insertions that do not materially affect biological activity. Fragments retaining one or more characteristic biological activity of the protein encoded by clone 188-41-1-0-B8-CS are also included in this definition.

In one aspect of the subject invention, SEQ ID NO:314, and variants thereof, can be used to generate polyclonal or monoclonal antibodies. Both biologically active and immunogenic fragments of SEQ ID NO:314, or variant proteins, can be used to produce antibodies. Polyclonal and/or monoclonal antibodies can be made according to methods well known to the skilled artisan. Antibodies produced in accordance with the subject invention can be used in a variety of detection assays known to those skilled in the art.

SEQ ID NO:314 can be used as a marker for identification of lysosome dysfunction in individuals. In this aspect of the subject invention, antibodies specific for SEQ ID NO:314, or fragments thereof, are used in routine immunoassays to screen for the presence or absence of SEQ ID NO:314, or fragments thereof, in samples containing lysosomal contents. The presence or absence of the protein of SEQ ID NO:314 can be used to provide an indication of lysosomal function and is, thus, useful for diagnostic/prognostic identification of lysosomal dysfunction.

The subject invention also provides materials and methods for the screening of individual samples for the presence or absence of nucleic acids encoding the protein of SEQ ID NO:314, or variants thereof. In one embodiment, nucleic acids are provided for hybridization assays, known to those skilled in the art, of mRNA or cDNA. The hybridization assays are performed upon nucleic acid samples obtained, or derived from, an individual with suspected lysosomal dysfunction. The hybridization assays screen for the presence or absence of nucleic acids encoding SEQ ID NO:314, or variants thereof. The presence or absence of such nucleic acids can be used as a predictive/prognostic indicator of disease state or lysosome function.

Nucleic acids of the invention can also be used in gene replacement or gene therapy protocols. This aspect of the subject invention nucleic acids encoding SEQ ID NO:314, or biologically active fragments thereof, can be introduced into cells and implanted into an individual with lysosomal disorders. In one embodiment, genetically engineered macrophage can be used for the treatment regimen (see, for example, Eto and Ohashi [2000] J. Inherit. Metabol. Dis. 23:293-298). Alternatively, autologous cells may be obtained from an individual, transformed with nucleic acid ex vivo, expanded ex vivo, and reintroduced into the individual. Such methods are well known to the skilled artisan.

Protein of SEQ ID NO:280 (Internal Designation 160-75-4-0-A9-CS):

The protein of SEQ ID NO:280, encoded by the cDNA of SEQ ID NO:39 and expressed in the fetal brain, is a chromosome 12 paralog of C7orf2, a human protein described as a transmembrane receptor located on chromosome 7 (Heus, H. C., A. Hing, et al. (1999) Genomics 57(3): 342-51). In addition, this protein is an ortholog of the murine gene LMBR1L, found to be involved in polydactily in mice (Clark, R. M., P. C. Marker, et al. (2000) Genomics 67(1): 19-27). A high level of homology was also found with a gene identified in Fugu rubripes (AF056116), as well as with C. Elegans R05D3.2 (Gellner, K. and S. Brenner (1999) Genome Res 9(3): 251-8).

The 362-amino-acid-long protein of SEQ ID NO:280, encoded by the cDNA of SEQ ID NO:39 is a splice variant of Z64989, located on chromosome 12. The chromosome 12 gene has 6 known variants described in entries AK001356 and AK001651 in genbank and entries A26354, A26375, X27360 and Z64989 in geneseqn. The closest sequence is Z64989, either at the nucleotide or the protein level. Z64989 is split into 17 exons, of which the protein of the invention contains the last 14. The transcription start site of the cDNA of SEQ ID NO:39 lies within the third intron of Z64989, and the protein of the invention starts at position 128 of Z64989. In addition, 2 potential leucine zippers are present in the protein of the invention (positions 136-157 and 272-293).

Preaxial polydactyly is a congenital hand malformation that includes duplicated thumbs, various forms of triphalangeal thumbs, and duplications of the index finger. Clark et al. (supra) demonstrated the correspondence between the spatial and temporal changes in Lmbr1 expression and the embryonic onset of polydactyly mutant phenotype, suggesting that a downregulation of Lmbr1 results in polydactily. It is likely that the Lmbr1 gene is involved in the patterning of limbs during mammalian development, for example by receiving and transducing a locally secreted ligand in the developing limb.

It is believed that the protein of SEQ ID NO:280 is a paralog of human C7orf2, and is thus a membrane bound protein implicated in the patterning of the mammalian body plan during early development. For example, the protein of the invention may be involved in organizing limb development, as well as in the development of the fetal brain. As such, the activity of the present protein likely influences various cellular processes, including gene expression, cellular growth and proliferation, as well as cellular differentiation. In addition, leucine zippers within the present protein render the protein capable of undergoing specific protein-protein interactions with other leucine-zipper containing proteins, including with itself (i.e. homodimerization). Preferred polypeptides of the invention are fragments of SEQ ID NO:280 having any of the biological activities described herein.

In one embodiment of the present invention, the present protein can be used to identify cells of the fetal brain. For example, the protein of the invention or part thereof may be used to synthesize specific antibodies using any technique known to those skilled in the art. Such tissue-specific antibodies may then be used to identify tissues of unknown origin, such as in forensic samples, differentiated tumor tissue that has metastasized to foreign bodily sites, etc., or to differentiate different tissue types in a tissue cross-section using immunochemistry. In addition, labeled reagents that can specifically bind to the protein of the invention can be used to visualize cell membranes and the components of the secretory pathway in cells, e.g. the ER and Golgi.

In another embodiment of the present invention, the present protein can be used to diagnose developmental abnormalities, or the potential for such abnormalities, e.g. in a fetus or in adults to determine (i.e. to determine if they are a carrier of a mutant copy of the gene). Individuals found to carry one or two mutant copies of the present gene would be candidates for, e.g. gene therapy or other strategies to correct or compensate for the gene deficiency, or for strategies to ensure that their children would not be carriers of the mutated gene. The characterization of mutations in genes encoding the present protein would also be of great value in understanding the nature of polydactyly and other developmental disorders, thereby facilitating the development of other strategies for treating and preventing these disorders.

In another embodiment, the present protein is used to modulate gene expression, cell growth and proliferation, and/or cell differentiation in cells in vitro or in vivo. For example, any of these behaviors can be increased or inhibited in cells grown in vitro, e.g. for protein production or for ex vivo therapeutic strategies. In addition, any disease associated with an increase or decrease in any of these cellular behaviors in vivo can be treated or prevented by enhancing or inhibiting the expression or activity of the protein of the invention in cells in vivo.

Proteins of SEQ ID NOs: 309 and 304 (Internal Designations 188-11-1-0-B3-CS and 187-34-0-0-l12-CS)

The proteins of SEQ ID NOs: 309 and 304 are encoded by the cDNAs of SEQ ID NOs: 68 and 63. Accordingly, it will be appreciated that all characteristics and uses of the polypeptides of SEQ ID NOs: 309 and 304 described throughout the present application also pertain to the polypeptides encoded by human cDNA of clones 188-11-1-0-B3-CS and 187-34-0-0-l12-CS. In addition, it will be appreciated that all characteristics and uses of the nucleic acids of SEQ ID NOs: 68 and 63 described throughout the present application also pertain to the nucleic acids of the human cDNAs of clones 188-11-1-0-B3-CS and 187-34-0-0-l12-CS.

The protein of SEQ ID NO: 309 (encoded by the clone having internal designation number 188-11-1-0-B3-CS) and the polymorphic variant thereof of SEQ ID NO: 304 (encoded by the clone having internal identification number 187-34-0-0-l12-CS and which differs from the polypeptide encoded by the clone having internal designation number 188-11-1-0-B3CS at a single amino acid), are highly homologous to the first 279 amino acids of the LGI1 (Leucine-rich gene—Glioma Inactivated) protein. Clones 188-11-1-0-B3-CS and 187-34-0-0-l12-CS appear to be splicing and polymorphic variants of LGI1. The LGI1 protein is 557 amino acid in length. (See Somerville et al., (2000) Mammalian Genome 11, 622-627; Chernova, et al. (1998) Oncogene 17, 2873-2881, the disclosures of which are incorporated herein by reference in their entireties). Clone 188-11-1-0-B3-CS align with the first 279 amino acids of LGI1, followed by the addition of 12 amino acids (VLREIHRFTNMS) to the C-terminal end which do not appear to be homologous to LGI1. Like LGI1, clone 188-11-1-0-B3-CS and the polymorphic variant 187-34-0-0-l12-CS contain the LRR domain and are highly expressed in brain tissue.

LGI1 belongs to a large family of leucine-rich repeat (LRR) proteins. It is believed that the LRR domains act as a region of protein-protein interaction. This has been substantiated as the family of known LRR proteins has grown. Leucine-rich repeats have been identified as essential components in glycoprotein hormone receptors, proteoglycans and the Trk proteins by expression of mutants and artificial chimaeras in tissue culture and by biochemical analysis of the properties of these constructs. Many transmembrane LRR proteins are known or suspected to encode truncated forms (N and L 6 , and slit for example) with functional significance. The proteoglycan Decorin, a secreted protein, binds TGF-β, a growth factor which stimulates decorin expression. Since decorin inhibits growth of cultured cells, it may form part of a negative feedback loop to regulate cell growth. This is similar to the proposed function of the LGI1 receptor protein.

Analysis of brain gliomas has revealed that LGI1 expression is either abolished or greatly reduced in high-grade tumors compared with more benign ones, indicating a role as a tumor suppressor gene (Cowell et al. 2000; Cowell et al. 1998, the disclosure of which is incorporated herein by reference in its entirety). Most glioblastoma multiforme (GBM) brain tumors contain only one genomic copy of LGI1, and this one is almost invariably not expressed. How the gene is inactivated is not clear, although one possibility is that chromosome or gene rearrangement, which occur in 20-25% of tumors, cause inactivation as a result of a positional effect. Recently it was determined that the LGI1 gene is located on 10q24, and is disrupted by translocation in the T98G GBM cell line and is also rearranged in over 26% of primary brain tumors. Alternatively, LGI1 may be part of a highly regulated pathway where inactivation of other key members or high specific transcription factors results in either inactivation of all genes in the pathway or a failure to initiate transcription.

Since functional inactivation of LGI1 occurs during the transition of low-grade to high-grade brain tumors, knockout or transgenic mice in which the expression of the protein of SEQ ID NO:309 or 304 has been reduced, eliminated or altered may be used as disease model. In particular, mice that overexpress LGI1 may be used as a tumorigenesis model.

Mice are particularly useful as models for assessing the consequences of altering the level or activity of the proteins of SEQ ID NO:309 or 304 or to identify agents useful in treaating tumorigenesis, since human and mouse LGI1 are highly conserved, showing 91% identity at the nucleotide level and 97% similarity at the amino acid level, with most of the amino acid substitutions being conservative. The mouse lgi1 gene is 4.2 kb in length, while the human LGI1 is 2.2 kb in length. This difference in size between the human and mouse gene is a result of the inclusion of a 2 kb sequence in the 3′ untranslated region in the mouse gene. Whether the additional sequence affects gene expression is not clear. Further analysis of the genomic sequence reveals that the number of exon/intron boundaries is also similar in humans and mice. The high degree of LGI1 conservation between mice and humans implies that this gene has experienced a strong selection pressure. It is intriguing to speculate that any major deviations in the primary protein sequence may result in a loss of function of this gene product. Total or partial loss of the LGI1 gene function could, therefore, be lethal, which in turn implies that LGI1 plays an important role in normal brain development as well as in tumor formation.

SEQ ID NOs:309 and 304 also have high homology with Slit, a secreted Drosophila protein which plays a role in the development of axon pathway development in the central nervous system. The Slit protein is necessary for the normal development of the midline on the CNS, particularly the midline glial cells, and for the concomitant formation of the commissural axon pathway. The process is dependent on the level of Slit protein expression. It appears that the Slit protein is excreted by the midline glial cells, where it is synthesized and is eventually associated with the surface axons that traverse them. Contact of cells with supernatant expressing the product of this gene increases the permeability of THP-1 monocyte cells to calcium. Thus, it is likely that Slit is involved in a signal transduction pathway that is initiated when Slit protein binds a receptor on the surface of the monocyte cell.

In view of the above, it is believed that the proteins of SEQ ID NOs:309 and 304 are involved in a signal transduction pathway mediated through a receptor that modulates the differentiation and/or proliferation of cells.

Northern blot analysis detects LGI1 transcripts only in brain, neural tissue, and skeletal muscle but not in heart, kidney, lung, placenta, liver, or pancreas. Northern blot analysis of RNA derived from several different regions of human brain revealed a widespread expression of LGI1 although with different intensities. The highest abundance was found in cerebral cortex, hippocampus, and putamen. The lowest expression was detected in corpus callosum. The levels of expression were intermediate in the other brain regions. Accordingly, the proteins of SEQ ID NOs:309 or 304 or fragments thereof, as well as polynucleotides encoding the proteins of SEQ ID NOs:309 or 304, may be used to determine whether a tissue sample is derived from brain (and in particular cerebral cortex, hippocampus, or putamen), neural tissue, and skeletal tissue or to distinguish whether a tissue sample is derived from brain or another tissue, such as heart, kidney, lung, placenta, liver, or pancreas.

Accordingly, the present invention includes the use of the protein of SEQ ID NOs: 309 or 304, fragments comprising at least 5, 8, 10, 12, 15, 20, 25, 30, 35, 40, 50, 60, 75, 100, 150, or 200 consecutive amino acids thereof, or fragments having a desired biological activity to treat or ameliorate a condition, such as those listed above, in an individual. In such embodiments, the protein of SEQ ID NO:309 or 304, or a fragment thereof, is administered to an individual in whom it is desired to increase or decrease any of the activities of the protein of SEQ ID NO:309 or 304, including tumor suppression, modulation of neural development or involvement in brain tumors, glioblastoma multiforme, brain injuries, neurodegenerative disease states and behavioral disorders such as Alzheimers Disease, Parkinsons Disease, epilepsy, multiple sclerosis, Huntingtons Disease, schizophrenia, obsessive compulsive disorders, and in the processes of nerve regeneration in spinal cord injury, stroke, facial nerve damage, diabetes caused nerve damage, and retinal regeneration.

The protein of SEQ ID NO:309 or 304 or a fragment thereof may be administered directly to the individual or, alternatively, a nucleic acid encoding the protein of SEQ NO:309 or 304 or a fragment thereof may be administered to the individual. Alternatively, an agent which increases the activity of the protein of SEQ ID NO:309 or 304 may be administered to the individual. Such agents may be identified by contacting the protein of SEQ NO:309 or 304 or a cell or preparation containing the protein of SEQ ID NO:309 or 304 with a test agent and assaying whether the test agent increases the activity of the protein. For example, the test agent may be a chemical compound or a polypeptide or peptide.

Alternatively, the activity of the protein of SEQ ID NO:309 or 304 may be decreased by administering an agent which interferes with such activity to an individual. Agents which interfere with the activity of the protein of SEQ ID NO:309 or 304 may be identified by contacting the protein or a cell or preparation containing the with a test agent and assaying whether the test agent decreases the activity of the protein. For example, the agent may be a chemical compound, a polypeptide or peptide, an antibody, or a nucleic acid such as an antisense nucleic acid or a triple helix-forming nucleic acid.

In one embodiment, the invention relates to methods and compositions using the protein of the invention or part thereof as a marker protein to selectively identify tissues, preferably brain, or to distinguish between two or more possible sources of a tissue sample on the basis of the level of the protein of SEQ ID NO:309 or 304 in the sample. For example, the protein of SEQ ID NO:309 or 304 or fragments thereof may be used to generate antibodies using any techniques known to those skilled in the art, including those described therein. Such tissue-specific antibodies may then be used to identify tissues of unknown origin, for example, forensic samples, differentiated tumor tissue that has metastasized to foreign bodily sites, or to differentiate different tissue types in a tissue cross-section using immunochemistry. In such methods a tissue sample is contacted with the antibody, which may be detectably labeled, under conditions which facilitate antibody binding. The level of antibody binding to the test sample is measured and compared to the level of binding to control cells from brain or tissues other than brain to determine whether the test sample is from brain. Alternatively, the level of the protein of SEQ ID NO:309 or 304 in a test sample may be measured by determining the level of RNA encoding the protein of SEQ ID NO:309 or 304 in the test sample. RNA levels may be measured using nucleic acid arrays or using techniques such as in situ hybridization, Northern blots, dot blots or other technques familiar to those skilled in the art. If desired, an amplification reaction, such as a PCR reaction, may be performed on the nucleic acid sample prior to analysis. The level of RNA in the test sample is compared to RNA levels in control cells from brain or tissues other than brain to determine whether the test sample is from brain. For a number of disorders listed above, particularly of the nervous system, expression of the genes encoding the polyepeptide of SEQ ID NO:309 or 304 at significant higher or lower levels may be routinely detected in certain tissues or cell types (e.g., cancerous and wounded tissues) or bodily fluids (e.g., serum, plasma, synovial fluid, and spinal fluid) or another tissue of cell sample taken from an individual having such a disorder, relative to the standard gene expression level, i.e., the expression level in healthy tissue or bodily fluid from an individual not having the disorder.

In another embodiment, antibodies to the protein of SEQ ID NO:309 or 304 or part thereof may be used for detection, enrichment, or purification of cells expressing the protein of SEQ ID NO:309 or 304, including using methods known to those skilled in the art. For example, an antibody against the protein of SEQ ID NO:309 or 304 or a fragment thereof may be fixed to a solid support, such as a chromatograpy matrix. A preparation containing cells expressing the protein of SEQ ID NO:309 or 304 is placed in contact with the antibody under conditions which facilitate binding to the antibody. The support is washed and then the cells are released from the support by contacting the support with agents which cause the cells to dissociate from the antibody.

In another embodiment of the present invention, the protein of SEQ ID NO:309 or 304 or a fragment thereof may be used to diagnose disorders associated with altered expression of the protein of SEQ ID NO:309 or 304. In some embodiments, the protein of SEQ ID NO:309 or 304 or fragments thereof may be used to diagnose cancer. In such techniques, the level of the protein of SEQ ID NO:309 or 304 in an ill individual is measured using techniques such as those described herein and compared to the level in normal individuals. For example, a decreased level of the protein of SEQ ID NO:309 or 304 relative to normal individuals suggests that the ill individual may suffer from cancer or be predisposed to getting cancer in the future.

Another embodiment of the present invention is a polypeptide comprising a structural or functional domain of the protein of SEQ ID NO:309 or 304. Such structural or functional domains of the protein of SEQ ID NO:309 or 304 include a leucine rich repeat C-terminal domain located between amino acid positions 173 and 222, a leucine rich repeat located between amino acid positions 92 and 115, a leucine rich repeat located between amino acid positions 116 and 139, a leucine rich repeat located between amino acid positions 140 and 163, a leucine rich repeat located between amino acid positions 164 and 185, a membrane spanning segment located between amino acid positions 15 and 35, and a signal peptide comprising the sequence FLCLLSALLLTEG/KK.

Accordingly, the protein of SEQ ID NO:309 or 304 or fragments thereof, or polynucleotides encoding these proteins or fragments, may be used in in vitro diagnostic assays for malignant brain tumors, such as glioblastoma muliforme. These proteins or nucleic acids may also be used in the attenuation/prevention and/or treatment of brain tumors and/or brain injuries, of neurodegenerative disease states and behavioral disorders such as Alzheimers Disease, Parkinsons Disease, epilepsy, multiple sclerosis, Huntingtons Disease, schizophrenia, obsessive compulsive disorders, and in the processes of nerve regeneration in spinal cord injury, stroke, facial nerve damage, diabetes caused nerve damage, and retinal regeneration.

In addition, the protein, as well as, antibodies directed against the protein, and relevant small molecules may be used as tumor markers and/or immunotherapy targets for the above disease states. For example, antibodies directed against amino acids VLREIHRFTNMS of both clones may aid in the differential detection of the secreted and receptor forms of this protein, since the proteins of SEQ ID NOs:309 and 304 have homology to the secreted forms of LGI1. In addition, the proteins of SEQ ID NOs:309 and 304 or fragments thereof may be used to identify binding partners as described herein.

DNA-Binding Proteins

The invention relates to compositions and methods using proteins of the invention containing a DNA-binding domain, herein referred to as DBP, such as the ones described in this section and those containing a DNA binding domain domain as shown on Table VI, or parts thereof, preferably fragments comprising a DNA binding domain, or derivative thereof.

Transcriptional regulation is primarily achieved by the sequence-specific binding of proteins to DNA and RNA. Of the known protein motifs involved in the sequence specific recognition of DNA, the zinc finger protein is unique in its modular nature. Zinc finger domains are found in numerous zinc binding proteins which are involved in protein-nucleic acid interactions. They are independently folded zinc-containing mini-domains which are used in a modular repeating fashion to achieve sequence-specific recognition of DNA (Klug 1993 Gene 135, 83-92). Such zinc binding proteins are commonly involved in the regulation of gene expression, and usually serve as transcription factors (see U.S. Pat. Nos. 5,866,325; 6,013,453 and 5,861,495).

To date, zinc finger proteins have been identified which contain between 2 and 37 modules. More than two hundred proteins, many of them transcription factors, have been shown to possess zinc fingers domains. Zinc fingers connect transcription factors to their target genes mainly by binding to specific sequences of DNA. Zinc finger modules are found in a wide variety of transcription regulatory proteins in eukaryotic organisms. A zinc finger domain is generally composed of 25 to 30 amino acid residues which form one or more tetrahedral ion binding sites. The binding sites contain four ligands consisting of the sidechains of cysteine, histidine and occasionally aspartate or glutamate. The binding of zinc allows the relatively short stretches of polypeptide to fold into defined structural units which are well-suited to participate in macromolecular interactions (Berg, J. M. et al. (1996) Science 271:1081-1085). The zinc finger domain was first recognized in the transcription factor TFfIIIA from Xenopus oocytes (Miller, et al., EMBO, 4:1609-1614, 1985; Brown, et al., FEBS Lett., 186:271-274, (1985)).

Zinc binding domains which contain a C 3 HC 4 sequence motif are known as RING domains (Lovering, R. et al. (1993) Proc. Natl. Acad. Sci. USA 90:2112-2116). The RING domain consists of eight metal binding residues, and the sequences that bind the two metal ions overlap (Barlow, P. N. et al. (1994) J. Mol. Biol. 237:201-211). Functions of RING finger proteins are mediated through DNA binding and include the regulation of gene expression, DNA recombination, and DNA repair (see Borden and Freemont, Curr Opin Struct Biol 6:395-401 (1996) and U.S. Pat. No. 5,861,495).

Both the RING finger and the LIM domain mediate protein-protein interactions and are involved in transcriptional control, either by directly affecting transcription or recruiting co-activators or co-repressors. LIM domains also contribute to various signalling pathways. They may interact with protein kinases and anchor gene products to large protein complexes or to cellular compartments.

PHD fingers are C 4 HC 3 zinc fingers spanning approximately 50-80 residues and distinct from RING fingers or LIM domains. They are thought to be mostly DNA or RNA binding domain but may also be involved in protein-protein interactions (for a review see Aasland et al, Trends Biochem Sci 20:56-59 (1995)). The PHD finger domain, belonging to zinc finger domain family, is found in many regulatory proteins which are frequently associated with chromatin-mediated transcriptional regulation.

The nucleic acid binding activity of DBP or part thereof may be assayed using any of the assays known to those skilled in the art including those described in U.S. Pat. No. 6,013,453.

The invention relates to compositions and methods using DBPs or part thereof, especially fragments comprising a DNA-binding domain, to stimulate gene transcription.

One of the remarkable features of activation domains of transcriptional factors in general is that “fusing” them to heterologous protein domains seldom affects their ability to activate transcription when recruited to a wide variety of promoters. The high degree of functional independence exhibited by these activation domains makes them valuable tools in various biological assays for analyzing gene expression and protein-protein or protein-RNA or protein-small molecule drug interactions. Several strategies to improve the potency of activation domains and thereby the expression of genes under their control have been reported. These approaches generally involve increasing the number of copies of activation domains fused to the DNA binding domain or generating activators containing synergizing combinations of activation domains.

Therefore, in an additional embodiment, this invention provides compositions and methods containing new transcription factors comprising DBP or part thereof, preferably fragments containing DNA-binding domains. Such transcription factors may be designed to regulate the expression of target genes of interest. Aspects of the invention are applicable to systems involving either covalent or non-covalent linking of the transcription activation domain to a DNA binding domain. In practice, cells can be engineered by the introduction of recombinant nucleic acids encoding the fusion proteins containing at least two mutually heterologous domains, one of them being the DNA-binding domain of the invention, and in some cases additional nucleic acid constructs, to render them capable of ligand-dependent regulation of transcription of a target gene. Administration of the ligand to the cells then regulates (positively, or in some cases, negatively) target gene transcription (all laboratory methods related to this embodiment are completely described in U.S. Pat. No. 6,015,709, which disclosure is hereby incorporated by reference in its entirety). Illustrative (non-limiting) example of heterologous domains which can be included along with a DNA-binding domain in various fusion proteins of this invention include another transcription regulatory domains (i.e., transcription activation domains such as a p65, VP16 or AP domain; transcription potentiating or synergizing domains; or transcription repression domains such as an ssn-6/TUP-1 domain or Kruppel family suppressor domain); a DNA binding domain such as a GAL4, lex A or a composite DNA binding domain such as a composite zinc finger domain or a ZFHD1 domain; or a ligand-binding domain comprising or derived from (a) an immunophilin, cyclophilin or FRB domain; (b) an antibiotic binding domain such as tetR: or (c) a hormone receptor such as a progesterone receptor or ecdysone receptor. A wide variety of ligand binding domains may be used in this invention, although ligand binding domains which bind to a cell permeant ligand are preferred. It is also preferred that the ligand have a molecular weight under about 5 kD, more preferably below 2.5 kD and optimally below about 1500 D. Non-proteinaceous ligands are also preferred. Examples of ligand binding domain/ligand pairs that may be used in the practice of this invention include, but are not limited to: FKBP:FK1012, FKBP:synthetic divalent FKBP ligands (see WO 96/0609 and WO 97/31898), FRB:rapamycin/FKBP (see e.g., WO 96/41865 and Rivera et al, “A humanized system for pharmacologic control of gene expression”, Nature Medicine 2(9):1028-1032 (1997)), cyclophilin:cyclosporin (see e.g. WO 94/18317), DHFR:methotrexate (see e.g. Licitra et al, 1996, Proc. Natl. Acad. Sci. U.S.A. 93:12817-12821), TetR:tetracycline or doxycycline or other analogs or mimics thereof (Gossen and Bujard, 1992, Proc. Natl. Acad. Sci. U.S.A. 89:5547; Gossen et al, 1995, Science 268:1766-1769; Kistner et al, 1996, Proc. Natl. Acad. Sci. U.S.A. 93:10933-10938), a progesterone receptor:RU486 (Wang et al, 1994, Proc. Natl. Acad. Sci. U.S.A. 91:8180-8184), ecodysone receptor:ecdysone or muristerone A or other analogs or mimics thereof (No et al, 1996, Proc. Natl. Acad. Sci. U.S.A. 93:3346-3351) and DNA gyrase:coumermycin (see e.g. Farrar et al, 1996, Nature 383:178-181). In many applications it is preferable to use a DNA binding domain which is heterologous to the cells to be engineered. In the case of composite DNA binding domains, component peptide portions which are endogenous to the cells or organism to be engineered are generally preferred.

In another aspect of this embodiment, polynucleotides encoding DNA-binding domains as well as any other functional fragments of DBP may be introduced into polynucleotides encoding fusion proteins for a variety of regulated gene expression systems, including both allostery-based systems such as those regulated by tetracycline, RU486 or ecdysone, or analogs or mimics thereof, and dimerization-based systems such as those regulated by divalent compounds like FK1012, FKCsA, rapamycin, AP1510 or coumermycin, or analogs or mimics thereof, all as described below (See also, Clackson, Controlling mammalian gene expression with small molecules, Current Opinion in Chem. Biol. 1:210-218 (1997)). The fusion proteins may comprise any combination of relevant components, including bundling domains, DNA binding domains, transcription activation (or repression) domains and ligand binding domains. Other heterologous domains may also be included.

Another embodiment of this invention relates to expression systems, preferably vectors and vector-containing cells, using DBP or part thereof, especially the DNA-binding domain. In this regard, recombinant nucleic acids are provided which encode fusion proteins containing the transcription activation domain of the invention and at least one additional domain that is heterologous thereto, where the peptide sequence of said activation domain is itself eventually modified relative to the naturally occurring sequence from which it was derived to increase or decrease its potency as a transcriptional activator relative to the counterpart comprising the native peptide sequence. Each of the recombinant nucleic acids of this invention may further comprise an expression control sequence operably linked to the coding sequence and may be provided within a DNA vector, e.g., for use in transducing prokaryotic or eukaryotic cells. Some of the recombinant nucleic acids of a given composition as described above, including any optional recombinant nucleic acids, may be present within a single vector or may be apportioned between two or more vectors. The recombinant nucleic acids may be provided as inserts within one or more recombinant viruses which may be used, for example, to transduce cells in vitro or cells present within an organism, including a human or non-human mammalian subject. It should be appreciated that non-viral approaches (naked DNA, liposomes or other lipid compositions, etc.) may be used to deliver recombinant nucleic acids of this invention to cells in a recipient organism. The resultant engineered cells and their progeny containing one or more of these recombinant nucleic acids or nucleic acid compositions of this invention may be used in a variety of important applications, including human gene therapy, analogous veterinary applications, the creation of cellular or animal models (including transgenic applications) and assay applications. Such cells are useful, for example, in methods involving the addition of a ligand, preferably a cell permeant ligand, to the cells (or administration of the ligand to an organism containing the cells) to regulate expression of a target gene.

The invention also relates to methods and compositions using DBP or part thereof to bind to nucleic acids, preferably DNA, alone or in combination with other substances. For example, DBP or part thereof is added to a sample containing nucleic acid in conditions allowing binding, and allowed to bind to nucleic acids. In a preferred embodiment, DBP or part thereof may be used to purify nucleic acids such as restriction fragments. In another preferred embodiment, DBP or part thereof may be used to visualize nucleic acids when the polypeptide is linked to an appropriate fusion partner, or is detected by probing with an antibody. Alternatively, DBP or part thereof may be bound to a chromatographic support, either alone or in combination with other DNA binding proteins, using techniques well known in the art, to form an affinity chromatography column. A sample containing nucleic acids to purify is run through the column. Immobilizing DBP or part thereof on a support advantageous is particularly for those embodiments in which the method is to be practiced on a commercial scale. This immobilization facilitates the removal of the protein from the batch of product and subsequent reuse of the protein. Immobilization of DBP or part thereof can be accomplished, for example, by inserting a cellulose-binding domain in the protein. One of skill in the art will understand that other methods of immobilization could also be used and are described in the available literature.

In another embodiment, the present invention relates to compositions and methods using DBP or part thereof, especially the DNA-binding domain, to alter the expression of genes of interest in a target cells. Such genes of interest may be disease related genes, such as oncogenes or exogenous genes from pathogens, such as bacteria or viruses using any techniques known to those skilled in the art including those described in U.S. Pat. Nos. 5,861,495; 5,866,325 and 6,013,453.

In still another embodiment, DBP or part thereof may be used to diagnose, treat and/or prevent disorders linked to dysregulation of gene transcription such as cancer and other disorders relating to abnormal cellular differentiation, proliferation, or degeneration, including hyperaldosteronism, hypocortisolism (Addison's disease), hyperthyroidism (Grave's disease), hypothyroidism, colorectal polyps, gastritis, gastric and duodenal ulcers, ulcerative colitis, and Crohn's disease.

Protein of SEQ ID NO: 388 (Internal Designation 109-002-4-0-C6-CS)

The protein of SEQ ID NO: 388 encoded by cDNA of SEQ ID NO: 147 is a 375 amino-acids long protein containing a zinc finger domain, namely a PHD-finger domain from positions 329 to 339.

The PHD finger was originally identified by comparison of the maize homeodomain (HD) protein ZMHOX1a (Bellmann R. and Werr W. EMBO J. 11: 3367-3374 (1992)) to its Arabidopsis relative HAT3.1 and named plant homeodomain (PHD) finger due to its association with the DNA-binding HD in both genes. This motif often occurs in various regulatory genes, such as members of the trithorax (TRX-G) or polycomb (PC-G) groups (Aasland R. et al. Trends Biochem. Sci. 20: 56-59 (1995)) and leukaemia-associated proteins (LAP finger) (Saha V. et al. Proc. Natl. Acad. Sci. USA 92: 9737-9741 (1995)). The established function of TRX-G and PC-G genes in chromatin modulation in Drosophila led to the suggestion that the PHD finger is involved in chromatin-mediated transcriptional control. Recent data provide evidence that PHD finger proteins are associated with chromatin remodelling complexes (Bochar D. A. et al. Proc. Natl. Acad. Sci. USA 97: 1038-1043 (2000)) or contribute to histone acetylation (Loewith R. et al. Mol. Cell. Biol. 20: 3807-3816 (2000)). Based on the position of the unique His residue, the cysteine scaffold of the PHD finger (Cys4-His-Cys3) is clearly distinct from RING fingers (Cys3-His-Cys4) and LIM domains (Cys2-His-Cys5) and from DRIL domains, where two RING finger motifs are closely linked. In contrast to the accumulating knowledge about LIM domains, functional data concerning the PHD finger remain rare (see rev. Halbach T. et al. Nucleic Acids Research 28: 3542-3550 (2000)).

GYMNOS, a recently described member of the SWI2/SNF2 protein family in plants (22), also contains a PHD finger and takes part in the control of development. The second PHD finger motif of Drosophila dMI-2 protein (a reference for animal counterparts) shares high sequence conservation to known plant PHD fingers. Due to the similarity to the Drosophila MI-2 gene, GYMNOS has been implicated in chromatin modulation. While the PHD finger is an isolated motif in GYMNOS, the characteristic Cys4-His-Cys3 scaffold in PHDf-HD plant genes is embedded in a large region. This region shares 60% identical residues between seven genes of different plant species and is more highly conserved than the HD (40%). This conservation suggests that the PHD finger is part of a larger functional unit. When combined with a leucine zipper in the surrounding conserved 180 amino acid region in the PHDf-HD proteins, PHD finger activity is masked and silenced. The leucine zipper upstream of the PHD finger mediates interactions with helix 4 of plant 14-3-3 proteins, thus identifying PHDf-HD proteins as potential targets of 14-3-3 signalling pathways. The 14-3-3 family of multifunctional proteins is highly conserved between animals, plants and yeast. Due to the dimeric nature of 14-3-3 proteins and their capacity to form homo- and heterodimers, members of the 14-3-3 protein family function as scaffolds promoting association of protein complexes. 14-3-3 proteins are involved in various signalling pathways that include, for example, Raf, BAD, Bcr/Bcr-Abl, KSR (kinase supressor of Ras), PKC, PI-3 kinase and cdc25C phosphatase. Others enter the nucleus and are associated with DNA-binding complexes. Recent data even indicate contacts to TBP, TFIIB and the human TBP-associated factor hTAF(II)32 (for rev. see Halsbach T., supra).

Recently PHD finger has been shown to activates transcription in yeast, plant and animal cells. Transcriptional activation in animal cells (in the zebrafish embryo as a test system) tested for different PHD fingers seems to be a general feature of the PHD finger motif in eukaryotic cells.

It remains to be elucidated whether the PHD finger directly interacts with a component of the transcription initiation complex or if its positive effect on transcription is mediated via auxiliary protein interactions. Both assumptions, however, involve PHD finger-mediated protein-protein interactions. Surrounding sequences may interfere sterically with accession of the PHD finger and its exposure could eventually depend on binding of a protein partner.

The PHD finger containing proteins appear to be involved in human diseases. Studies on the AIRE gene from humans (Nagamine K. et al. Nat. Genet. 17: 393-398 (1997), Scott H. S. et al. Mol. Endocrinol. 12: 1112-1119 (1998)) have shed more light on the importance of this motif, since all clinically significant mutations in the AIRE gene coincide with alteration in two PHD fingers, resulting in the rare autoimmune polyendocrinopathy-candidiasis-ectodermal dystrophy (APECED). The presence of PHD fingers in genes up-regulated in leukaemia, associated with the autoimmune disease APECED or participating in euchromatin to heterochromatin modulation, like the TRX-G or PC-G genes, indicates that this motif may be involved in a variety of important cellular events including developmental disorders, tumors and immune diseases. For exemple, the role of a chromatin structure remodelling in cancer metastasis and tissue carcinogenesis is well documented (Zhang Y. et al. Cell 16: 279-289 (1998); Klugbauer S. and Rabes H. M. Oncogene 29: 4388-4393 (1999)).

It is believed that the protein of SEQ ID NO: 388 or part thereof is a zinc binding protein, preferably able to bind nucleic acids, more preferably a transcription factor. Preferred polypeptides of the invention are polypeptides comprising the amino acids of SEQ ID NO: 388 from positions 329 to 339. Other preferred polypeptides of the invention are fragments of SEQ ID NO: 388 having any of the biological activity described herein.

In one embodiment of the invention, the protein of the invention, or part thereof, or derivative thereof, may be used to a subject to diagnose developmental disorders and/or cell proliferative disorders linked to dysregulation of gene expression mediated by the PHD-finger domain of the protein of the invention. Such disorders include but are not limited to, renal tubular acidosis, anemia, Cushing's syndrome, achondroplastic dwarfism, epilepsy, gonadal dysgenesis, hereditary neuropathies such as Charcot-Marie-Tooth disease and neurofibromatosis, hypothyroidism, hydrocephalus, seizure disorders such as Syndenham's chorea and cerebral palsy, spinal bifida, and congenital glaucoma, cataract, sensorineural hearing loss, benign tumors, and cancers such as adenocarcinoma; leukemia; melanoma; lymphoma; sarcoma; and cancers of the bladder, colon, liver, brain, small intestine, large intestine, breast, ovary, kidney, lung, and prostate. Diagnosis may be performed using nucleic acids or antibodies able to detect the expression of the protein of the invention using any technique known to those skilled in the art including Northern blotting, RT-PCR, immunoblotting methods immunohistochemisty, enzyme-linked immunosorbant assay (ELISA) described herein. Quantities of the protein of the invention expressed in subject samples, control and disease from biopsied tissues or body fluids or cell extracts taken from patients are compared with the standard values. Deviation between standard and subject values establishes the parameters for diagnosing disease.

In another embodiment, antagonists or inhibitors of the protein of the invention or part thereof may be administered to patients to treat and/or prevent the above referred disorders. Antagonists or inhibitors of transcriptional activators may indeed be used to suppress transcriptional activation in tumor cells. Such antagonists and/or inhibitors may be antibodies specific for the protein of the invention that can be used directly as an antagonist, or indirectly as a targeting or delivery mechanism for bringing a pharmaceutical agent to cells or tissue which express the protein of the invention. Neutralizing antibodies, (i.e., those which inhibit protein-protein interactions) are especially preferred for therapeutic use. Other methods to inhibit the expression of the protein of the invention include antisense and triple helix stategies as described herein. Other antagonists or inhibitors of the protein of the invention may be produced using methods which are generally known in the art, including the screening of libraries of pharmaceutical agents to identify those which specifically bind the protein of the invention. The protein of the invention, or part thereof, preferably its functional or immunogenic fragments, or oligopeptides related thereto, can be used for screening libraries of compounds in any of a variety of drug screening techniques. The fragment employed in such screening may be free in solution, affixed to a solid support, borne on a cell surface, or located intracellularly. The formation of binding complexes, between the protein of the invention, or part thereof, or derivative thereof, and the agent being tested, may be measured. Another technique for drug screening which may be used provides for high throughput screening of compounds having suitable binding affinity to the protein of the invention as described in published PCT application WO84/03564.

Protein of SEQ ID NO: 394 (Internal Designation 157-17-2-0-C1-CS)

The protein of SEQ ID NO:394 encoded by the extended cDNA SEQ ID NO:153 contains a myc-type, helix-loop-helix dimerization domain (Prosite PS00038) from amino acid position 13 to 28 and has no adjacent basic domain. Using the Schiffer-Edmundson helical wheel diagram (Schiffer et al. (1967) Biophys. J. 7:121-135), a hypothetical amphipatic alpha helix is predicted between position 53 and position 68. Three hydrophobic amino acids, Val 55, Phe59 and Ile63, are aligned on the same side of the helix to present a hydrophobic interaction surface and three hydrophilic residues (Tyr53, Gln62 and Ser64) are presented on the other side of helix. There is no Proline residue within the stretch to disrupt the continuity of the alpha helix. Thus, these structural features in the protein of the invention indicates that this protein could be a novel member of the nonbasic “helix-loop-helix” subfamily (HLH) of transcription regulator.

The helix-loop-helix (HLH) family of transcriptional regulators is involved in the control of different cellular differentiation phenomenon such as neurogenesis, haematopoiesis, myogenesis and angiogenesis. The HLH proteins are found in all eukaryotic organisms ranging from yeast saccharomyces cerevisiae to human (Reviewed by Massari M E and Murre C. (2000) Molecular and Cellular Biology, 20 (2):429-440). The HLH proteins bind DNA as dimers, and different members of HLH family bind either as homodimers or as heterodimers with other members of the family. The presence in a cell of a large repertoire of distinct complexes that can bind to a particular DNA sequence element suggests that competition for DNA binding may play a regulatory role.

Members of the helix-loop-helix (HLH) family of transcriptional regulation proteins share a common structural element, i.e. a stretch of 40-50 amino acids containing two short amphipathic alpha-helices separated by a linker region (the loop) of varying length (Murre C et al. (1989) Cell 56:777-783). This element was initially identified as a region of homology among c-myc, the muscle determination gene MyoD (Davis R L et al. (1987) Cell 51:987-1000) and the Drosophila achaete-scute complex (AS-C) involved in neural determination (Villares R. and Cabrera C V (1987) Cell 50:415-424). The HLH proteins form both homodimers and heterodimers by means of interaction between the hydrophobic residues on the corresponding faces of the two helices to give a parallel four-helix bundle structure (Adrian R et al. (1993) Nature, 363:38-45; Ellenberger T et al. (1994) Genes Dev. 8:970-980). The alpha helical regions are usually 15-16 amino acids long with hydrophobic residues at every third and fourth position, and each helix contains several conserved residues (Murre C et al. (1989) Cell, 56:777-783; Benezra R. et al. (1990) Cell, 61:49-59).

The HLH protein family is subdivided into two major groups: the so-called “bHLH” and “non basic HLH” subfamilies. Proteins of the bHLH family contain a conserved highly basic region immediately N-terminal to the first helix (known as bHLH structure), and mutagenesis experiments on MyoD protein confirm that this region is responsible for sequence-specific binding to the “E-box”, a consensus DNA motif for bHLH proteins (Davis R L. et al. (1990) Cell, 60: 733-746). A dimeric bHLH protein (either homodimeric or heterodimeric but in which both subunits contains a basic region) are able to bind to DNA. In general, the bHLH proteins fall into two categories: Class A consists of proteins that are ubiquitously expressed, including mammalian E12/E47 and fly da whereas the class B consists of proteins that are expressed in a more tissue-specific manner, including mammalian MyoD and fly AC-S. In most cases, the tissue-specific bHLH proteins preferentially heterodimerize with ubiquitous partners.

The non basic HLH subfamily contains proteins lacking a basic region unable to bind to DNA but that could form homo- or heterodimers through their HLH motif. Indeed, heterodimeric complexes between non basic HLH and bHLH proteins fail to bind to DNA and negatively modulate the bHLH proteins-mediated transcription activation. This phenomenon was first demonstrated in a MyoD/Id regulation model (Benezra R. et al. (1990) Cell, 61:49-59). The MyoD gene product is able to activate previously silent muscle-specific genes when introduced into a large variety of differentiated cell types. MyoD proteins form either homodimers or heterodimers with other bHLH proteins such as E12 or E47, and bind to E-box consensus motif to activate myogenesis. The Id gene, conserved from batracians to mammals (Wilson R et al. (1995) Mech. Dev. 49:211-222; Sawai S et al. (1997) Mech. Dev. 65:175-185; Norton J D et al. (1998) trends in Cell Biology 8:58-65), lacks a basic region adjacent to its HLH motif but is able to specifically dimerize with either MyoD, E12 or E14 and has been shown to subsequently attenuate the heterodimer's ability to bind DNA. Additionally, overexpression of Id inhibits MyoD-dependent gene activation in in vivo transfection experiments. Id proteins may function either to repress directly the activity of tissue-restricted bHLH proteins by rendering them non-functional or, more likely, to sequestrate the ubiquitous bHLH proteins and preventing them from forming active heterodimers with the tissue-restricted bHLH (Review by Norton J D et al. (1998) trends in Cell biology 8:58-65).

The possibility that the Id protein behaves as a dominant-negative regulator to repress MyoD protein activity through the formation of nonfunctional heterodimeric complexes is considerably strengthened by the following findings in Drosophila . In Drosophila , the development of peripheral nervous system is positively regulated by the two structurally related bHLH proteins, AS-C and daughterless (da), since loss of either activity results in loss of sensory organ development. The extramacrochaetae Emc product belonging to the non basic HLH subfamily was shown to antagonize the activity of AS-C and da. through the formation of nonfunctional heterodimers with the bHLH proteins (Hillary M et al. (1990) Cell, 61:27-38; Garrell J et al. (1990) Cell 61,39-48).

Human Id genes including human Id1, Id2, Id3 and Id4 have been identified and localized (Review by Norton J D et al. (1998) trends in Cell Biology 8:58-65). The bHLH proteins and Id proteins are thought to be involved in the regulation of apoptosis. Differentiation and development of T- and B-lymphocytes in immune system are positively regulated by the combination of ubiquitous E proteins and lymphocyte-restricted bHLH proteins. Disruption in gene expression from either class results in severe perturbation of T- and B-lymphocyte development (Bain G et al. (1997) Mol Cell Biol 17:4782-4791; Zhuang et al. (1996) Mol Cell Biol 16:2898-2905). Cell-arrested T thymocytes undergo a massive apoptosis when Id1 gene is overexpressed (Kim D (1999) Mol Cell Biol 19(12):8240-53). Overexpression of Id1 gene product also results in apoptosis in neonatal and adult cardiac myocytes in culture (Tanaka K et al. (1998) J Biol Chem 273(40) 25922-25928).

Id1 and Id3 proteins are also required to support angiogenesis. Quiescent adult endothelial cells express minimal level of the Id proteins, whereas Id expression is upregulated in angiogenic endothelial cells. Partial loss of these proteins in Id1 +/− Id3 −/− double knockout mice impairs angiogenesis, resulting in the resistance to tumour growth (Lyden D et al. (1999) Nature 401:670-677). In addition, a significant overexpression of mRNA and protein levels of Id1, Id2 and Id3 has been found in patients with pancreatic cancer (Maruyama H et al. Am J Pathol (1999) 155(3):815-822) A correlation of Id1 gene upregulation and aggressive phenotype of human breast cancer cells has also been reported (Lin C Q et al. (2000) Cancer Res 60(5):133240).

Thus, identification and cloning of members of the HLH family, and especially of the non basic HLH subfamily, is necessary to enrich our knowledge about the biological importance of the HLH transcription factors network and further more to provide insights and tools in disorders linked to dysregulation of the HLH-mediated transcription.

It is believed that the protein of SEQ ID NO: 394 or part thereof plays a role in the regulation of transcription activation, probably as a member of the HLH family, preferably of the non basic HLH subfamily. More particularly, the protein of the invention is thought to be able to antagonize the activity of members of the bHLH family through the formation of heterodimers. Preferred polypeptides of the invention are polypeptides comprising the amino acids of SEQ ID NO:394 from positions 13 to 28, from positions 53 to 68, and from positions 13 to 68. Other preferred polypeptides of the invention are fragments of SEQ ID NO: 394 having any of the biological activity described herein.

The dimerization ability of the protein of the invention or part thereof which is characteristic of the HLH family may be assayed using any of the assays known to those skilled in the art. For example, interacting protein partners, especially members of the bHLH subfamily, may be identified using screening of cDNA expression libraries as described for the identification of some HLH transcription factors such as E12 and E47 (Murre C et al. (1989) Cell 56:777-783), Max (a Myc binding factor) (Elizabeth M et al. (1991) Science 251:1217) as well as Id (Benezra C et al. (1990) Cell 61:49-59). Alternatively, the helix-loop-helix motif in the protein of the invention could be used by those skilled in art as a “bait protein” in a well established yeast double hybridization system to identify its interacting protein partners in vivo from cDNA library derived from different tissues or cell types of a given organism. Alternatively, the protein of the invention or part thereof could be used by those skilled in art in mammalian cell transfection experiments. When fused to a suitable peptide tag such as [His] 6 tag in a protein expression vector and introduced into culture cells, this expressed fusion protein can be immunoprecipitated with its potential interacting proteins by using anti-tag peptide antibody. This method could be chosen either to identify the associated partner or to confirm the results obtained by other methods such as those just mentioned.

An object of the invention relates to compositions and methods using the protein of the invention or part thereof to dysregulate gene transcription, preferably transcription mediated by HLH regulators either in vitro or in vivo, through overexpression of the protein of the invention using any means known to those skilled in the art.

The protein of the invention or part thereof could be used to induce apoptosis of specific cell-type under either physiological or pathological conditions. In a preferred embodiment, the apoptosis active polypeptide is added to an in vitro culture of mammalian cells in an amount effective to induce apoptosis. In another preferred embodiment, the apoptosis active polypeptide is expressed under the control of a promoter which may be activated under precise conditions. In particular, such conditional expression of an apoptosis-active polypeptide upon demand may be very useful to get rid of cells that have become unwanted, for example in applications where such cells have been used in a cellular therapy goal and have become useless. Another example of application is the case of expression under the control of a promoter that becomes active after infection by a given microorganism, thus resulting in the death of the infected cells only. Furthermore, the protein of the invention or part thereof may be useful in the diagnosis, the treatment and/or the prevention of disorders in which apoptosis is beneficial, including but not limited to disorders linked to abnormal cellular proliferation such as those described below.

In another embodiment, the protein of the invention or part thereof can be used to diagnose, treat and/or prevent disorders linked to overexpression of HLH proteins, such as cancer and other disorders relating to abnormal cellular differentiation, proliferation, or degeneration, including hyperaldosteronism, hypocortisolism (Addison's disease), hyperthyroidism (Grave's disease), hypothyroidism, colorectal polyps, gastritis, gastric and duodenal ulcers, ulcerative colitis, and Crohn's disease, neurodegenerative disorders such as Parkinson's or Alzheimer's diseases using any methods and/or techniques described herein. In addition, the protein of the invention or part thereof may be used to evaluate the disease progression and the clinical treatment efficiency. The protein of the invention or part thereof could also be used a molecular target for anti-angiogenesis drug design. Inhibition of protein expression could be achieved by many means known to those skilled in the art including those described in the present application. For example, an antisense nucleotide or triple helix strategy could be developed to block the protein synthesis. Alternatively, the expressed protein of the invention might be neutralized by using specific monoclonal antibody using techniques known to those skilled in the art including those described in Peverali F A et al (1994) EMBO J. 13:4291-4301; Barone M V et al. (1994) Proc. Natl. Acad. Sci. USA 91:4985-4988; and Haza E T et al. (1994) J. Biol. Chem. 269:2139-2145.

Protein of SEQ ID NO: 466 (Internal Designation 184-4-2-0-D3-CS)

The protein of SEQ ID NO: 466 overexpressed in liver and encoded by the cDNA of SEQ ID NO: 225 displays a Zinc finger motif of RING type (C3HC4) (Pfam signature from positions 41 to 81, Prosite signature from positions 56 to 65) and a B-box zinc finger motif (pfam signature from positions 110 to 153). In addition, the protein of the invention is predicted to have a nuclear localization.

It is believed that the protein of SEQ ID NO: 466 or part thereof is a zinc binding protein, preferably able to bind nucleic acids, more preferably a transcription factor. Preferred polypeptides of the invention are polypeptides comprising the amino acids of SEQ ID NO:466 from positions 41 to 81 (Ring Zinc finger protein), and from 110 to 153 (B-Box domain). Other preferred polypeptides of the invention are fragments of SEQ ID NO: 466 having any of the biological activity described herein.

Protein of SEQ ID NO: 267 (Internal Designation 116-111-1-0-H9-CS)

The protein of SEQ ID NO:267 encoded by the extended cDNA SEQ ID NO:26 exhibits an Emotif zinc finger domain, C 2 H 2 type, from positions 185 to 202, and is thought to be localized in the nucleus.

It is believed that the protein of SEQ ID NO:267 or part thereof is a zinc binding protein, preferably able to bind nucleic acids, more preferably a transcription factor. Preferred polypeptides of the invention are polypeptides comprising the amino acids of SEQ ID NO:267 from positions 185 to 202. Other preferred polypeptides of the invention are fragments of SEQ ID NO:267 having any of the biological activity described herein.

Protein of SEQ ID NO: 277 (Internal Designation 160-103-1-0-F11-CS)

The protein of SEQ ID NO:277 encoded by the extended cDNA SEQ ID NO:36 exhibits a pfam DHHC zinc finger domain from positions 140 to 204.

It is believed that the protein of SEQ ID NO: 277 or part thereof is a zinc binding protein, preferably able to bind nucleic acids, more preferably a transcription factor. Preferred polypeptides of the invention are polypeptides comprising the residues of SEQ ID NO:277 from positions 140 to 204. Other preferred polypeptides of the invention are fragments of SEQ ID NO:277 having any of the biological activity described herein.

Protein of SEQ ID NO: 272 (Internal Designation 145-25-3-0-B4-CS)

The protein of SEQ ID NO:272 encoded by the extended cDNA SEQ ID NO:31 shows homology with numerous zinc binding proteins. In addition, the protein of the invention exhibits the pfam RING zinc finger signature from positions 87 to 129. The protein of SEQ ID NO:272 has a variant, i.e. the protein of SEQ ID NO:273 encoded by the extended cDNA SEQ ID NO:32 and thought to have the same function and utilities.

It is believed that the protein of SEQ ID NO:272 or part thereof is a zinc binding protein, preferably able to bind nucleic acids or proteins, more preferably a transcription factor. Preferred polypeptides of the invention are polypeptides comprising the amino acids of SEQ ID NO:272 from positions 87 to 129. Other preferred polypeptides of the invention are fragments of SEQ ID NO:272 having any of the biological activity described herein.

Hydrolases and Inhibitors

The invention relates to compositions and methods using proteins of the invention havinf an hydrolytic activity, herein referred to as HYP, such as the ones described in this section and those containing an hydrolytic domain as shown on Table VI, or parts thereof, preferably fragments comprising an hydrolytic domain, or derivative thereof.

The invention relates to methods and compositions using HYP or a fragment thereof to hydrolyze one or several substrates, alone or in combination with other substances. For example, the protein of the invention or part thereof is added to a sample containing the substrate(s) in conditions allowing hydrolysis, and allowed to catalyze the hydrolysis of the substrate(s). Hydrolyzed substrates are then detected using standard methods known to those skilled in the arts. The protein of the invention or part thereof can also be added to samples as a “cocktail” with other hydrolytic enzymes, such as other peptidases, for example to decontaminate surgical instruments using methods described in U.S. Pat. No. 5,489,531. The advantage of using a cocktail of hydrolytic enzymes is that one is able to hydrolyze a wide range of substrates without necessarily knowing the specificity of each enzyme. Using a cocktail of hydrolytic enzymes also protects a sample from a wide range of future unknown contaminants from a vast number of sources. Alternatively, HYP or part thereof may be bound to a chromatographic support, either alone or in combination with other hydrolytic enzymes, using techniques well known to those skilled in the art, to form an affinity column to remove the substrate. Immobilization facilitates removal of the enzyme from the batch of product and subsequent reuse of the enzyme.

Immobilization of the enzyme or part thereof can be accomplished, for example, by adding a cellulose-binding domain to the protein through the modification of the DNA sequence coding for the protein or part thereof. One of skill in the art will understand that other methods of immobilization could also be used and are described in the available literature. Alternatively, the same methods may be used to identify new substrates.

In another embodiment, HYP or part thereof may be used to identify or quantify the amount of a given substrate in a biological sample. In a preferred embodiment, HYP of part thereof is catalytically inactived, i.e. capable of binding but not hydrolyzing a given substrate, using any of the methods known to those skilled in the art including those which produce a mutant enzyme, a recombinant-enzyme, or a chemically inactivated enzyme. The catalytically inactive protein of the invention is then incubated with an aliquot of a biological sample under conditions suitable for binding of the inactive enzyme to the substrate. Then, the bound enzyme is detected to assess the presence or amount of the eubacteria in the biological sample. In another preferred embodiment, HYP or part thereof is used in assays and diagnostic kits for the identification and quantification of substrates in a biological sample. These assays can be based for example, on standard enzyme-linked immunosorbant assays (ELISA) or any other technique known to those skilled in the art In addition, HYP or part thereof may be used to identify, e.g. using screens based on standard assays such as those described above, inhibitors of the enzyme for mechanistic and clinical applications. Such inhibitors may then be used to identify or quantify HYP in a sample, and to diagnose, treat or prevent any of the disorders where the protein's activity is undesirable and/or deleterious.

Protein of SEQ ID NO:400 (Internal Designation 160-54-1-0-F7-CS)

The protein of SEQ ID NO:400, encoded by the cDNA of SEQ ID NO:159, exhibits two putative transmembrane domains encompassing amino-acids 50-70 and 127-147 as predicted by the software TopPred II (Claros and von Heijne, CABIOS applic. Notes, 10: 685-686 (1994)). It also diplays the Prosite carboxypeptidase zinc-binding region signature PS00133 at positions 117-127. It is predicted by the psort software (see Nakai K and Horton P, Trends Biochem Sci. 1999 January;24(1):34-6) to localize to the nucleus with a high probability (73.9%). Finally it is specifically expressed in fetal brain and shows no homology to previously known proteins.

Carboxypeptidase enzymes hydrolyze the terminal amino acid of a protein or peptide. A novel family of carboxypeptidases, localized in the nucleus and with a carboxypeptidase-dependant transcriptional activity, has emerged only recently. Its first member, AEBP1, was previously identified as a 3T3 preadipocyte factor implicated in the repression of the aP2 gene expression. AEBP1 stands for “AE-1 Binding Protein,” where AE-1 is a regulatory element of the adipose P2 gene (aP2), a gene involved in triglyceride metabolism and activated in adipocytes. Its own expression is abolished during adipocyte differentiation (He G P et al., Nature 378:92-96(1995)). AEBP1 was subsequently shown to play a similar role in the differentiation of osteoblastic cell lines (Ohno I et al., Biochem Biophys Res Commun. 1996 Nov. 12;228(2):411-4) and vascular smooth muscle cells (Layne M D et al., J. Biol. Chem. 273:15654-15660(1998)). It was proposed that AEBP1 acts as a negative transcription factor by cleaving proteins involved in transcription, a new feature in transcription regulation. Recent evidence further suggests that its transcriptional activity is itself attenuated by binding to G-proteins subunits (Park J G et al., EMBO J. 1999 Jul. 15;18(14):4004-12) and stimulated by DNA binding (Muise A M and Ro H S, Biochem J. 1999 Oct. 15;343 Pt 2:341-5).

It is believed that the protein of SEQ ID NO:400 plays a role in cell signaling, nuclear transcriptional activity and in the differentiation of several cell types, especially those found in the developing brain (including but not limited to neurons). Preferred polypeptides of the invention are polypeptides having any of the biological activities described herein.

One embodiment of the present invention relates to compositions and methods using the protein of the invention or part thereof as a marker for specific cell compartments (especially the nucleus) and/or tissue types (especially fetal brain). For example, the protein of the invention or part thereof may be used to generate specific antibodies which would in turn allow the visualization of nuclear structures by methods well-known to those of skill in the art. In a similar fashion, antibodies raised against the protein of the invention may be used to identify particular developmental stages (fetal for instance) and/or given tissue types (brain for instance), as the protein of the invention is specifically expressed in brain tissues at a fetal stage. Antibodies and antiserum can also be used to inhibit undesirable carboxypeptidase activities in in vitro experiments and cell cultures, as well as in biological samples and in vivo. Alternatively, quantitative analysis or detection of the protein of the invention, or of nucleic acids encoding the protein, can be carried out by any other technique known to those skilled in the art.

In another embodiment, the protein of the invention may be used to target heterologous compounds (polypeptides or polynucleotides) to the developing brain and/or the cell nucleus. For instance, a chimeric protein composed of the protein of the invention recombinantly or chemically fused to a protein or polynucleotide of therapeutic interest would allow the delivery of the therapeutic protein/polynucleotide specifically to the above-mentioned cellular/tissue targets (nucleus, fetal brain).

In another embodiment, the present invention relates to methods and compositions using the protein of the invention or a fragment thereof to hydrolyze one or several substrates, alone or in combination with other substances. The ability of the present protein to hydrolyze any particular substrate can easily be determined by carrying out a hydrolysis reaction using standard assay techniques such as the ones decribed by Slusher et al. (Slusher et al.—Prostate—2000, 44(1): 55-60) or any other technique well known to those skilled in the art. Potential substrates are any substance containing a peptide bond, more specifically a C-terminal peptide bond. Such substances include, but are not limited to, polypeptides, folic acid and its analogues (e.g. methotrexate). For example, the protein of the invention or part thereof is added to a sample containing the substrate(s) in conditions allowing hydrolysis, and allowed to catalyze the hydrolysis of the substrate(s). Hydrolyzed substrates are then detected using standard methods known to those skilled in the art.

In a preferred embodiment, the protein of the invention or part thereof may be used to modulate cellular transcriptional activity, thereby modulating cellular differentiation. Specifically, as nuclear carboxypeptidases play a role in inhibiting transcription associated with differentiation, then an increase in the activity or expression of the protein can be used to inhibit differentiation. The ability to inhibit differentiation has a number of uses, for example during the cultivation of undifferentiated pluripotent cells to maintain the cultured cells in an undifferentiated state until the need for a given cell type arises (in cases of grafts for instance). The level of the protein activity or expression can be increased in any of a number of ways, including by introducing a polynucleotide encoding the protein into cells, by administering the protein itself to cells, or by administering to cells a compound that increases protein activity or expression. Alternatively, the protein of the invention can be inhibited, thereby enhancing cellular differentiation. The ability to promote differentiation has many uses, including in the treatment or prevention of cancer, as cancer cells are often in a relatively undifferentiated state, and cellular differentiation typically accompanies by growth arrest.

In another embodiment, the protein of the invention or part thereof may be used to diagnose, treat and/or prevent disorders where the presence of substrates, for example excess proteins or peptides, is undesirable or deleterious. Such disorders include but are not limited to, cancer, neurodegenerative disorders such as Parkinson's and Alzheimer's diseases, and diabetes. In another embodiment, the protein of the invention or part thereof may be used to identify or quantify the amount of a given substrate (e.g. a peptide, folic acid, or methotrexate) in a biological sample. In a preferred embodiment, the protein of the invention or part thereof is used in assays and diagnostic kits for the identification and quantification of substrates in a biological sample.

In a most preferred embodiment, the protein of the invention or part thereof can be used in cancer chemotherapies in rescue therapy following toxic high dose methotrexate regimes. Many carboxypeptidases can cleave the C-terminal glutamate moiety from folic acid and its analogues, such as methotrexate. The key role of reduced folates as coenzymes in many biological pathways including those leading to DNA synthesis via the pyrimidines and purines, has made folic acid a target molecule for chemotherapy. Tumor cells grow rapidly and have a high rate of nucleic acid synthesis. Depletion of folic acid has cytotoxic effects, primarily in replicating tissues, and can inhibit growth of tumors with high folic acid requirements. Many carboxypeptidases can directly deplete folate by hydrolytic removal of its glutamate moiety. In cancer chemotherapy, methotrexate (4-amino-N 10 -methyl-pteroyl-glutamate) is commonly used to deplete the pool of reduced folates by inhibiting dihydrofolate reductase (DHFR), which catalyses the reduction of folates into biologically active tetrahydrofolate form, essential in the biosynthesis of all folate coenzymes. Thus, the protein of the invention or part thereof could be used in rescue therapy following toxic high-dose regimes such as described by Widemann et al. (Widemann B. et al.—Proc. Am. Assoc. Cancer Res.—1995, 36, p 232) and Chabner et al. (Chabner B. et al.—Nature—1972, 239, p 395-397), which disclosures are hereby incorporated by reference in their entity. The basis of this strategy is that hydrolysis of methotrexate produces 4-amino-N 10 -methyl-pteroate that is about 100 times less active as an inhibitor of DHFR.

In another preferred embodiment, the protein of the invention or part thereof can be used in an enzyme/prodrug strategy to treat a number of pathologies, especially those treated with drugs associated with severe side effects, including, but not limited to, autoimmune diseases and chronic inflammatory diseases such as rheumatoid arthritis, and cancer chemotherapy. These side effects can be mainly explained by the fact that the in vivo selectivity of the drugs used is too low (for example, the inadequate selectivity between tumor and normal cells of most anticancer drugs is well known and their toxicity to normal tissues is dose limiting). In the first phase of one example of such a protocol, a conjugate of the protein of the invention or part thereof and an antibody to a tissue specific antigen (for example, tumor specific antigens in the case of cancer chemotherapy) is administered. After a delay to allow residual enzyme conjugate to be cleared from the blood, a relatively non-toxic compound is administered to the patient. This non-toxic compound is a substrate of the protein of the invention, and is converted by the protein into a substantially more toxic compound. Thus, because of the previous, targeted administration of the protein of the invention, when the non-toxic compound is administered, the toxic compound is only produced in the vicinity of the cells targeted by the fusion protein. This two-phase approach has been termed antibody-directed enzyme-prodrug therapy (ADEPT); this approach is reviewed by Melton et al. (Melton R. et al.—J. Natl. Cancer Inst.—1996, 88, p 153-165). Alternatively the first phase can be replaced by a gene therapy approach resulting in the de novo synthesis of the protein of the invention or part thereof by cells from the targeted tissue, this has been termed gene-dependent enzyme/prodrug therapy (GDEPT). Another advantage of these 2 approaches (ADEPT and GDEPT) is that a single enzyme molecule is capable of activating many prodrug molecules.

Protein of Seq Id No: 242 (Internal Designation 119-003-4-0-C2-CS)

The protein of SEQ ID No: 242, encoded by the cDNA of SEQ ID No: 1, is homologous to proteins of the M20 metallopeptidases family (EC 3.4.17.X). The protein of the invention is over-expressed in the spinal cord and the brain.

The M20 metallopeptidase family of proteins are all peptidases (i.e. enzymes able to hydrolyze peptide bonds) furthermore they are all exopeptidases, which means that they can hydrolyze the terminal amino acid of a protein or peptide. Members of the M20 peptidase family are glutamate carboxypeptidases, which are capable of releasing the C-terminal glutamate residue, by hydrolysis, from a wide range of N-acyl groups, including peptidyl, aminoacyl, benzoyl, benzyloxycarbonyl, folyl, and pteroyl groups, and physiologically are involved in the catabolism of proteins. M20 carboxypeptidases are either monomeric or homodimeric (i.e. 2 identical proteins assembled to from the enzyme). In order to be active, metallopeptidases must be associated with a metallic cofactor (either Zinc or Cobalt depending on the enzyme). The most studied carboxypeptidase of the M20 family is carboxypeptidase G2 (CPG2) (EC 3.4.17.11), a bacterial enzyme from Pseudomonas sp. (strain RS-16). CPG2 is a dimeric Zinc carboxypeptidase that cleaves the C-terminal glutamate moiety from a number of molecules.

The protein of SEQ ID No: 242 includes the pfam signature for M20 peptidase (position 107 to 451). The protein of SEQ ID No: 242 also includes a number of amino acids that are conserved throughout the M20 protease family especially those that interact with the metal cofactor. Preferred polypeptides of the invention are polypeptides of SEQ ID No: 242 that include the highly conserved amino acids: 133, 135, 149, 163, 200, 201 and/or 262, which are present in over 80% of the members of the M20 peptidase family, and/or amino acids 139, 157, 162, 16, 367 and/or 377, which are present in over 60% of the members of the M20 peptidase family. Of particular interest are amino acids 133, 166, 201 and 262, which by homology are probably involved in the interaction with the metal cofactors. Thus it is believed that the protein of SEQ ID No: 242 or part thereof is a peptidase, preferably a carboxypeptidase, more preferably a metallocarboxypeptidase of the M20 family. Other preferred polypeptides of the invention are any fragments of SEQ ID No: 242 having any of the biological activities described herein.

Determination of carboxypeptidase activity on specific substrates can easily be obtained by carrying out the hydrolysis using standard assay techniques such as the ones decribed by Slusher et al. (Slusher et al.—Prostate—2000, 44(1): 55-60) or any other technique well known to those skilled in the art. Potential substrates are any substance containing a peptide bond, more especially C-terminal peptide bonds, and even more specifically, C-terminal glutamate. Such substances include but are not limited to peptides, folic acid and its analogues (e.g. methotrexate).

In an embodiment the protein of the invention or part thereof could be used to develop assay tools to identify brain and spinal cord tissue since the protein of the invention is overexpressed in these tissues.

In still another embodiment, the protein of the invention or part thereof may be used to diagnose, treat and/or prevent disorders where the presence of substrates, for example excess proteins, is undesirable or deleterious. Such disorders include but are not limited to, cancer, neurodegenerative disorders such as Parkinson's and Alzheimer's diseases, and diabetes. In a most preferred embodiment, the protein of the invention or part thereof can be used in cancer chemotherapies in rescue therapy following toxic high dose methotrexate regimes. Enzymes of the M20 peptidase family can cleave the C-terminal glutamate moiety from folic acid and its analogues, such as methotrexate. The key role of reduced folates as coenzymes in many biological pathways including those leading to DNA synthesis via the pyrimidines and purines, has made folic acid a target molecule for chemotherapy. Tumor cells grow rapidly and have a high rate of nucleic acid synthesis. Depletion of folic acid has cytotoxic effects, primarily in replicating tissues, and can inhibit growth of tumors with high folic acid requirements. Enzymes of the M20 peptidase family can directly deplete folate by hydrolytic removal of its glutamate moiety. In cancer chemotherapy, methotrexate (4-amino-N 10 -methyl-pteroyl-glutamate) is commonly used to deplete the pool of reduced folates by inhibiting dihydrofolate reductase (DHFR), which catalyses the reduction of folates into biologically active tetrahydrofolate form, essential in the biosynthesis of all folate coenzymes. Thus the protein of the invention or part thereof could be used in rescue therapy following toxic high-dose regimes such as described by Widemann et al. (Widemann B. et al.—Proc. Am. Assoc. Cancer Res.—1995, 36, p 232) and Chabner et al. (Chabner B. et al.—Nature—1972, 239, p 395-397), which disclosures are hereby incorporated by reference in their entity. The basis of this strategy is that hydrolysis of methotrexate produces 4-amino-N 10 -methyl-pteroate that is about 100 times less active as an inhibitor of DHFR.

In another preferred embodiment, the protein of the invention or part thereof can be used in an enzyme/prodrug strategy to treat a number of pathologies, especially those treated with drugs associated with severe side effects, including, but not limited to, autoimmune diseases and chronic inflammatory diseases such as rheumatoid arthritis, and cancer chemotherapy. These side effects can be mainly explained by the fact that the in vivo selectivity of the drugs used is too low (for example, the inadequate selectivity between tumor and normal cells of most anticancer drugs is well known and their toxicity to normal tissues is dose limiting). In the first phase of one example of such a protocol, a conjugate of the protein of the invention or part thereof and an antibody to a tissue specific antigen (for example, tumor specific antigens in the case of cancer chemotherapy) is administered. After a delay to allow residual enzyme conjugate to be cleared from the blood, a relatively non-toxic compound is administered to the patient. This non-toxic compound is a substrate of the protein of the invention, and is converted by the protein into a substantially more toxic compound. Thus, because of the previous, targeted administration of the protein of the invention, when the non-toxic compound is administered, the toxic compound is only produced in the vicinity of the cells targeted by the fusion protein. This two-phase approach has been termed antibody-directed enzyme-prodrug therapy (ADEPT), this approach is reviewed by Melton et al. (Melton R. et al.—J. Natl. Cancer Inst.—1996, 88, p 153-165). Alternatively the first phase can be replaced by a gene therapy approach resulting in the de novo synthesis of the protein of the invention or part thereof by cells from the targeted tissue, this has been termed gene-dependent enzyme/prodrug therapy (GDEPT). Another advantage of these 2 approaches (ADEPT and GDEPT) is that a single enzyme molecule is capable of activating many prodrug molecules.

Protein of SEQ ID NO: 401 (Internal Designation 160-88-3-0-A8-CS.corr)

The protein of SEQ ID NO:401 encoded by the cDNA SEQ ID NO:160 is a splicing variant of the hypothetical human palmitoyl-protein thioesterase-2 (PPT2) (E.C. 3.1.2.22) (Genbank accession number AF020543), which is well conserved among eukaryotes ( C. elegans and rodents) and exhibits homology with the palmitoyl protein thioesterase-1 (PPT1) (Genbank accession number L42809). The product of the cDNA SEQ ID NO: 160 is shorter than the human PPT2 (280 versus 308 amino acids respectively) with a gap located between the positions 174 and 203 of the protein PPT2. The protein of SEQ ID NO:401 has a variant, the protein of SEQ ID NO:402 encoded by the cDNA of SEQ ID NO: 161, thought to have the same functions and utilities.

PPT1 (E.C. 3.1.2.22) is a well-described protein, widely conserved among the murine, rat, bovine and human species (Swissprot accession number P50897). It is a lysosomal enzyme that functions in the removal of fatty acids from modified cysteine residues in proteins undergoing degradation (Hofmann S. L. et al, Neuropediatrics, 28: 27-30 (1997)). For example, PPT1 catalyses the deacylation H-ras and the alpha subunits of heterodimeric G proteins in vitro (Camp L. A., J. Biol. Chem., 268: 22566-22574 (1993) and 269: 23212-23219 (1994)). Deacylation by PPT1 may be a prerequisite for complete digestion of the modified polypeptides. In fact there is evidence that palmitoylation leads to increased protection against proteolytic digestion. Both the salivary mucus glycoprotein (Slomiany B. L., Biochem. Biophys. Res. Commun., 151: 1046-1053 (1988),) and chemically acylated bee venom phospholipase A2 (Diaz, R. E., Biochem. Biophys. Acta, 830: 52-58 (1985)) are more resistant to treatment with proteinases than their deacylated forms. Mutations in PPT1 enzyme were shown to underlie the hereditary neurodegenerative disorder, infantile neuronal ceroid lipofuscinosis (Vesa et al., Nature, 376: 584-587 (1995)).

Recently, Soyombo and Hofmann ( J. Biol. Chem, 272: 27456-27463, (1997)) described a second lysosomal thioesterase, PPT2, that shares 20% identity with PPT1. The PPT2 enzyme presumably also plays a role in lysosomal thioester catabolism but has a substrate specificity distinct from that of PPT1. While little is known about the substrate specificity of PPT2, the enzyme is highly active against palmitoylated model substrates such as palmitoyl CoA. PPT2 did not hydrolyse the acyl-cysteine bond of the protein substrates routinely used to assay PPT1 such as H-Ras and albumin. This finding suggest that although both enzymes possess intrinsic palmitoyl thioesterase activity, the “leaving group” recognized by the enzymes may differ. One possibility is that PPT2 recognizes palmitoylated protein substrates but that these substrates differ from those recognized by PPT1. A second possibility is that PPT2 recognizes a novel lipid thioester substrate that is not derived from acylated proteins. Aguado et al. ( Biochem J., 341:679-689, (1999)) demonstrated that PPT2 is an acyl thioesterase. However they cannot distinguish between esterase (thioesterase) and lipase activity. PPT2 shows very high S-thioesterase activity towards the acyl chains C 14:0 >C 16:0 , moderate activity towards the acyl chains C 14:1 >C 20:4 ≈C 16:1 ≈C 18:0 ≈C 12:0 >C 18.2 ≈C 18:3 >C 22:1 ≈C 18.1 ≈C 20:0 , low activity towards the acyl chains C 10:0 and C 22:0 , and no activity towards the acyl chain C 24:0 , C 8:0 , C 6:0 , C 4:0 and C 2:0 . PPT2 has a broader range of action than PPT1, although both have a preference for long acyl chains (more than 12 or 14 carbons) over shorter acyl chains (less than 12 carbons). Aguado et al. (supra) also presented a detailed characterization of PPT2 gene product. The putative 302-residue PPT2 and the protein of the invention contains a hydrophobic leader peptide at the N-terminus (signal peptide with a cleavage site predicted at position 34 of the protein of the invention) suggesting that they are secretory glycoproteins. Both proteins exhibit two motifs located at the N-terminus from positions 108 to 121. One motif is common to triglycerides lipases (from position 110 to 121) and the other one to eukaryotic thiol (Cys) proteases (from positions 108 to 121). Triglyceride lipases are lipolytic enzymes that hydrolyse the ester bond of triglycerides. The most conserved region in all these proteins is centered on a serine residue located in a conserved Gly-Xaa-Ser-Xaa-Gly motif. The PPT2 protein and the protein of the invention contain a cysteine residue (position 115) instead of the first glycine residue in the motif but other lipases with one mismatch in either of the consensus have been described (Blow D., Nature, 343: 694-695 (1990)). In the same region as the lipase motif, PPT2 and the protein of the invention contains a motif common to the active site of eukaryotic thiol (Cys) protease but with a leucine residue (position 113) instead of the glycine at the position 5 of the pattern. In addition, the amino acid sequence of the putative PPT2 shows, at the C-terminus, from positions 171 to 186, a motif common to growth factor and cytokine receptors family, which is not present in the protein of the invention.

Aguado et al. (supra) have found that PPT2 is expressed in cells of the immune system as an approximatively 42 kDa protein in cells extracts and supernatants and is transcribed as at least five different transcripts. The PPT2 gene is located in the class III region of the human MHC which contains several genes encoding proteins with potential roles in the immune system and in inflammation. In addition, Aguado et al. (supra) showed that very large amounts of PPT2 are secreted. However this is not in disagreement with an intracellular activity because the secreted protein could be internalized into the cell through a receptor and act on target located in an intracellular organelle. This mechanism has been described for the secreted PPT1, which can be internalized into the cell by mannose-6-phosphate receptor to act in the lysosome (Verkruyse and Hofmann, J. Biol. Chem., 271: 15831-15836, (1996)), and Soyombo and Hofmann ( J. Biol. Chem, 272: 27456-27463, (1997)) reported that PPT2 binds to mannose-6 phosphate receptor.

Palmitoylation refers to posttranslational modification of proteins in which the most common fatty acids of the cell (i.e. palmitic, stearic and oleic acids) are attached to the side chain of cysteine residues via high-energy thioester linkages (Bizzozero, O. A. et al, Neurochem. Res., 19: 923-933 (1994); Casey P. J., Science, 268: 221-225 (1995)). At present a large number of proteins of diverse origin, structure and function are known to be modified with these fatty acids that attach them to inner surface of the plasma membrane, where the can function optimally (Casey P. J., Science, 268: 221-225 (1995)). Being anchored to membranes is a process necessary for the diverse cellular functions of these modified proteins, including signal transduction, vesicle transport and maintenance of the cytoarchitecture. Almost every tissue and subcellular organelle contains characteristic set of palmitoylated proteins.

The protein of the invention is overexpressed in brain. In recent years a considerable number of functionally relevant nervous system proteins including ions channels, neurotransmitter receptors, signal transduction components and cell-adhesion molecules have been found to be palmitoylated. Although the nervous system is not an exception to this rule, both the number of modified protein in this tissue and the dynamic nature of protein palmitoylation suggest that this modification is critical for regulating important biological processes and that the addition or removal of the fatty acid serves to regulate the activity of these proteins rather that to define their function.

It is believed that the protein of SEQ ID NO:401 or part thereof is an hydrolase, preferably acting on ester bonds, more preferably a thiolester hydrolase, even more preferably an acyl-thioesterase which, as such, plays a role in fatty acid metabolism, in cellular vesicle transport and maintenance of the cytoarchitecture, in cellular proteolysis, endocytosis, signal transduction, lysosomal storage, cell proliferation and differentiation, immune and inflammatory response. The enzyme's substrates are compounds preferably containing an ester bond, preferably a thiol ester bond, more preferably an acyl thioester bond. Preferred polypeptides of the invention are polypeptides comprising the amino acids of SEQ ID NO:401 from positions 108 to 121, and 110 to 121. Other preferred polypeptides of the invention are fragments of SEQ ID NO:401 having any of the biological activities described herein. The hydrolytic activity of the protein of the invention or part thereof may be assayed using any of the assays known to those skilled in the art including those described in Smith et al., Biochem. J., 212: 155 (1983), Spencer et al., J. Biol. Chem., 253: 5922 (1978) and Aguado et al. (supra) or in U.S. Pat. No. 5,445,942.

In another preferred embodiment, the protein of the invention or part thereof may be used to diagnose, treat and/or prevent disorders where the presence of substrates is undesirable or deleterious. Such disorders include but are not limited to infantile neuronal ceroid lipofuscinosis and lysosomal diseases. For diagnostic purposes, the expression of the protein of the invention could be investigated using any of the Northern blotting, RT-PCR or immunoblotting methods described herein and compared to the expression in control individuals. For prevention and/or treatment purposes, the expression of protein of the invention may be enhanced using any of the gene therapy methods described herein or known to those skilled in the art.

In addition, the protein of the invention or part thereof may be used to identify inhibitors for mechanistic and clinical applications. Such inhibitors may then be used to identify or quantify the protein of the invention in a sample, and to diagnose, treat or prevent any of the disorders where the protein's hydrolytic activity is undesirable and/or deleterious including but not limited to lysosomal diseases, neurodegenerative disorder such as infantile neuronal ceroid lipofuscinosis, Parkinson's and Alzheimer's diseases, inflammatory and immune disorders including allergies and leukemia.

Another object of the present invention are compositions and methods of targeting heterologous compounds, either polypeptides or polynucleotides to lysosomes by recombinantly or chemically fusing a fragment of the protein of the invention to an heterologous polypeptide or polynucleotide. Preferred fragments are any fragments of the protein of the invention, or part thereof, that may contain targeting signals for lysosomes such as those described in Vitale et al, Mol. Cell. Biol, 20: 7342-52 (2000), Blagoveshchenskaya et al., J. Biol. Chem., 273: 2729-37 (1998) and Kornfeld, FASEB J., 1: 462-8 (1987)). Such heterologous compounds may be used to modulate lysosomal activity. For example, they may be used to induce and/or prevent a lysosomal protein degradation. Moreover, antibodies binding to the protein of the invention or part thereof may be used for detection of the lysosomes using any techniques known to those skilled in the art.

In still another embodiment, the invention relates to methods and compositions using the protein of the invention or part thereof as a marker protein to selectively identify tissues, preferably brain tissues. For example, the protein of the invention or part may be used to synthesize specific antibodies using any techniques known to those skilled in the art including those described therein. Such tissue-specific antibodies may then be used to identify tissues of unknown origin, for example, forensic samples, differentiated tumor tissue that has metastasized to foreign bodily sites, or to differentiate different tissue types in a tissue cross-section using immunochemistry.

Another embodiment of the present invention relates to methods and compositions using the protein of the invention or part thereof to modify plant lipid composition using any assay known to those skilled in the art including those described by the U.S. Pat. Nos. 5,955,650, 5,945,585 and 5,807,893. Indeed, plant lipids have a variety of nutritional uses and many recent research efforts have examined the role that saturated and unsaturated fatty acids play in reducing the risk of coronary heart disease. In the past, it was believed that mono-unsaturates, in contrast to saturates and poly-unsaturates, had no effect on serum cholesterol and coronary heart disease risk. Several recent human clinical studies suggest that diets high in mono-unsaturated fat and low in saturated fat may reduce the “bad” (low-density lipoprotein) cholesterol while maintaining the “good” (high-density lipoprotein) cholesterol (Mattson et al., Journal of Lipid Research, 26: 194-202 (1985)).

In still another embodiment, the protein of the invention or part thereof may be used in enzyme replacement therapy, due to the ability of cells to take up exogeneously supplied protein and target it to lysosomes (Neufeld E. F., Annu. Rev. Biochem. 60: 257-280(1991), Brady R. O. et al, J. Inher. Metab. Dis. 17: 510-519 (1994)), or in bone-marrow transplantation (Hoogerbrugge P. M. et al, Lancet, 345: 1398-1402 (1995)), as bone-marrow-derived microglial cells are believed to penetrate the blood-brain barrier and may theoretically be able to provide sufficient enzyme to correct the metabolic defect in neurons (Krivit W., Cell transplant., 4: 385-392 (1995)). The protein of the invention or part thereof may be also used in genetic engineering of transplanted cells (Salvetti A. et al, Br. Med. J. 51: 106-122 (1995)) or neural progenitor cell engraftment (Snyder E. Y., Nature, 374: 367-370 (1995)) using any technique known to those skilled in the art.

Protein of SEQ ID NO: 254 (Internal Designation 106-006-1-0-E3-CS)

Angiogenin is a member of the pancreatic Rnase superfamily of proteins. Its mechanism of action is postulated to involve multiple interactions with other proteins through specific regions on the molecular surface of angiogenin. Potential partners of angiogenin include heparin, plasminogen, elastase, angiostatin, actin, and a 170 kDa receptor on the surface of endothelial cells [Strydom, D. J. (1998) Cell. Mol. Life Sci. 54, 811-824].

Angiogenin is required for the process of angiogenesis. Tumor growth requires angiogenesis, and several anti-angiogenic agents have been produced and are currently in the clinical trial stage. It has also been shown that recurrent gastric cancer patients had a much higher serum concentration of angiogenin than primary gastric cancer patients [Shimoyama, S. and Kaminishi, M. (2000) J. Cancer Res. Clin. Oncol. 126, 468-474]. Therefore, angiogenin can be used as a diagnostic marker for the evaluation of cancer aggressiveness or as an early marker for recurrence over a follow-up period.

Angiogenin is a potent inducer of angiogenesis [Fett, J. W.; Strydom, D. J.; Lobb, R. R.; Alderman, E. M.; Bethune, J. L.; Riordan, J. F.; and Vallee, B. L. (1985) Biochemistry 24, 5480-5486]. Angiogenesis is a complex process of blood vessel formation comprising of several separate but interconnected steps at the cellular and biochemical level including: (i) activation of endothelial cells by the action of an angiogenic stimulus, (ii) adhesion and invasion of activated endothelial cells into the surrounding tissues and migration toward the source of the angiogenic stimulus, and (iii) proliferation and differentiation of endothelial cells to form a new microvasculature [Folkman, J. and Shing, Y. (1992) J. Biol. Chem. 267, 10931-10934; Moscatelli, D. and Rifkin, D. B. (1988) Biochim. Biophys. Acta 948, 67-85].

Angiogenin has been demonstrated to induce most of the individual events in the process of angiogenesis including binding to endothelial cells [Badet, J.; Soncin, F.; Guitton, J. D.; Lamare, O.; Cartwright, T.; and Barritault, D. (1989) Proc. Natl. Acad. Sci U.S.A. 86, 8427-8431], stimulating second messengers [Bicknell, R. and Vallee, B. L. (1988) Proc. Natl. Acad. Sci. U.S.A. 85, 5961-5965], mediating cell adhesion [Soncin, F. (1992) Proc. Natl. Acad. Sci. U.S.A. 89, 2232-2236], activating cell-associated proteases [Hu, G. F. and Riordan, J. F. (1993) Biochem. Biophys. Res. Commun. 197, 682-687], inducing cell invasion [Hu, G-F.; Riordan, J. F.; and Vallee, B. L. (1994) Proc. Natl. Acad. Sci. U.S.A. 91, 12096-12100], inducing proliferation of endothelial cells [Hu, G-F.; Riordan, J. F.; and Vallee, B. L. (1997) Proc. Natl. Acad. Sci. U.S.A. 94, 2204-2209] and organizing the formation of tubular structures from the cultured endothelial cells [Jimi, S-I.; Ito, K-I.; Kohno, K.; Ono, M.; Kuwano, M.; Itagaki, Y.; and Isikawa, H. (1985) Biochem. Biophys. Res. Commun. 211, 476-483]. Angiogenin has also been shown to undergo nuclear translocation in endothelial cells via receptor-mediated endocytosis [Moroianu, J. and Riordan, J. F. (1994) Proc. Natl. Acad. Sci. U.S.A. 91, 1677-1681] and nuclear localization sequence-assisted nuclear import [Moroianu, J. and Riordan, J. F. (1994) Biochem. Biophys. Res. Commun. 203, 1765-1772].

While angiogenesis is a tightly-controlled process under usual physiological conditions, abnormal angiogenesis can have devastating consequences in pathological conditions such as arthritis, diabetic retinopathy and tumor growth. It is now well-established that the growth of virtually all solid tumors is angiogenesis dependent [Folkman, J. (1989) J. Natl. Cancer Inst. 82, 4-6]. Angiogenesis is also a prerequisite for the development of metastasis, since it provides the means whereby tumor cells disseminate from the original primary tumor and establish at distant sites [Mahadevan, V. and Hart, I. R. (1990) Rev. Oncol. 3, 97-103; Blood, C. H. and Zetter B. R. (1990) Biochim. Biophys. Acta 1032, 89-118]. Therefore, interference with the process of tumor-induced angiogenesis can be an effective therapy for both primary and metastatic cancers.

Although originally isolated from medium conditioned by human colon cancer cells (Fett et al. (1985), supra), and subsequently shown to be produced by several other histological types of human tumors [Rybak, S. M.; Fett, J. W.; Yao, Q-Z.; and Vallee, B. L. (1987) Biochem. Biophys. Res, Commun. 146, 1240-1248; Olson, K. A.; Fett, J. W.; French, T. C.; Key, M. E.; and Vallee, B. L. (1995) Proc. Natl. Acad. Sci. U.S.A. 92, 442-446], angiogenin also is a constituent of human plasma and normally circulates at a concentration of 250-360 ng/ml [Shimoyama, S.; Gansauge, F.; Gansauge, S.; Negri, G.; Oohara, T.; and Beger, H. G. (1996) Cancer Res. 56, 2703-2706; Blaser, J.; Triebl, S.; Kopp, C.; and Tschesche, H. (1993) Eur. J. Clin. Chem. Clin. Biochem. 31, 513-516].

Several inhibitors of the functions of angiogenin have been developed. These include: (i) monoclonal antibodies (mAbs) [Fett, J. W.; Olson, K. A.; and Rybak, S. M. (1994) Biochemistry 33, 5421-5427], (ii) an angiogenin-binding protein [Hu, G-F.; Chang, S-I.; Riordan, J. F.; and Vallee, B. L. (1991) Proc. Natl. Acad. Sci. U.S.A. 88, 2227-2231; Hu, G-F.; Strydom, D. J.; Fett, J. W.; Riordan, J. F.; and Vallee, B. L. (1993) Proc. Natl. Acad. Sci. U.S.A. 90, 1217-1221; Moroianu, J.; Fett, J. W.; Riordan, J. F.; and Vallee, B. L. (1993) Proc. Natl. Acad. Sci. U.S.A. 90, 3815-3819], (iii) the placental ribonuclease inhibitor (PRI) [Shapiro, R. and Vallee, B. L. (1987) Proc. Natl. Acad. Sci. U.S.A. 84, 2238-2241], (iv) peptides synthesized based on the C-terminal sequence of angiogenin [Rybak, S. M.; Auld, D. S.; St. Clair, D. K.; Yao, Q-Z.; and Fett, J. W. (1989) Biochem. Biophys. Res. Commun. 162, 535-543], and (v) inhibitory site-directed mutagenesis of angiogenin [Shapiro, R. and Vallee, B. L. (1989) Biochemistry 28, 7401-7408].

The subject invention provides for an angiogenin variant protein/polypeptide of SEQ ID NO:254. The invention also provides biologically active fragments of SEQ ID NO:254 comprising the Thr amino acid residue of position 52 of SEQ ID NO:254. In one embodiment, the polypeptides of SEQ ID NO: 254 are interchanged with the corresponding polypeptides encoded by the human cDNA of clone 106-006-1-0-E3-CS. “Biologically active fragments” are defined as those peptide or polypeptide fragments having at least one of the biological functions of the full length protein (e.g., stimulation of angiogenesis). Compositions of the protein/polypeptide of SEQ ID NO:254, or biologically active fragments thereof, are also provided by the subject invention. These compositions may be made according to methods well known in the art. The polypeptides of the present invention can be used in methods described herein and known in the art for other forms of angiogenin.

The invention also provides variants of the protein of SEQ ID NO: 254. These variants have at least about 80%, more preferably at least about 90%, and most preferably at least about 95% amino acid sequence identity to the amino acid sequence encoded by SEQ ID NO: 254. Variants according to the subject invention also have at least one functional or structural characteristic of the protein of SEQ ID NO:254. The invention also provides biologically active fragments of the variant proteins. Compositions of variants, or biologically active fragments thereof, are also provided by the subject invention. These compositions may be made according to methods well known in the art. Unless otherwise indicated, the methods disclosed herein can be practiced utilizing the protein encoded by SEQ ID NO: 254, biologically active fragments of SEQ ID NO:254, variants of SEQ ID NO:254, and biologically active fragments of the variants.

Because of the redundancy of the genetic code, a variety of different DNA sequences can encode the amino acid sequence of SEQ ID NO:254. In a preferred embodiment, SEQ ID NO: 254 is encoded by clone 106-006-1-0-E3-CS or SEQ ID NO: 13. It is well within the skill of a person trained in the art to create these alternative DNA sequences which encode proteins having the same, or essentially the same, amino acid sequence. These variant DNA sequences are, thus, within the scope of the subject invention. As used herein, reference to “essentially the same” sequence refers to sequences that have amino acid substitutions, deletions, additions, or insertions that do not materially affect biological activity. Fragments retaining one or more characteristic biological activity of the protein encoded by clone 106-006-1-0-E3-CS are also included in this definition. The nucleotides “C” and “G” of positions 338 and 339 of SEQ ID NO:13 represent polymorphisms and can be used in association studies and other methods in the art for sing markers.

“Recombinant nucleotide variants” are alternate polynucleotides which encode a particular protein. They can be synthesized, for example, by making use of the “redundancy” in the genetic code. Various codon substitutions, such as the silent changes which produce specific restriction sites or codon usage-specific mutations, can be introduced to optimize cloning into a plasmid or viral vector or expression in a particular prokaryotic or eukaryotic host system, respectively.

In one aspect of the subject invention, SEQ ID NO:254, and variants thereof, can be used to generate polyclonal or monoclonal antibodies. Both biologically active and immunogenic fragments of SEQ ID NO:254, or variant proteins, can be used to produce antibodies. Polyclonal and/or monoclonal antibodies can be made according to methods well known to the skilled artisan. Antibodies produced in accordance with the subject invention can be used in a variety of detection assays known to those skilled in the art. The antibodies may be used to agonize or antagonize the biological activity of the protein of SEQ ID NO:254.

SEQ ID NO:254 can be used as a marker for individuals at risk for the development or recurrence of tumors. As indicated supra, angiogenin is found at certain levels in normal individuals, normally at concentrations of 250-360 ng/ml. Thus, quantitative immunoassays can be used for the detection of abnormal levels (increased) of SEQ ID NO:254, thereby identifying those individuals at risk for the development of tumors. Alternatively, the subject invention provides antibodies specific for SEQ ID NO:254, or fragments thereof, which are used in routine immunoassays to screen for the presence or absence of SEQ ID NO:254, or fragments thereof.

Alternatively, the nucleic acids which encode SEQ ID NO: 254, or fragments thereof, may be used in hybridization assays to detect and/or quantitate the expression of SEQ ID NO: 254. Such hybridization assays are well known to the skilled artisan and can be practiced on a variety of samples, including, but not limited to, tumor cells, biopsied tissues, or normal tissue.

Molecules (see Strydom, D. J., (1998) Cell. Mol. Life Sci. 54, 811-824) that functionally inhibit the action of angiogenin can be used to treat patients with tumors. Because angiogenin is required for the vascularization of tumors, molecules which inhibit the biological activity of angiogenin can be used to reduce tumor vascularization and control tumor growth. Thus, another aspect of the invention provides molecules which inhibit, or reduce, the biological activity of SEQ ID NO: 254. One embodiment provides neutralizing antibodies to inhibit the biological activity of SEQ ID NO: 254. These neutralizing antibodies may be chimeric or humanized, according to methods well known in the art, to minimize the immunogenicity of the molecules when used in patients. Neutralizing antibodies may be used in conjunction with other known therapeutic modalities for the treatment of tumors.

Another embodiment of the invention utilizes the concept that expression of specific genes can be suppressed by oligonucleotides having a nucleotide sequence complementary to the mRNA transcript of the target gene. This suppression occurs by selectively impeding translation and has been termed an “antisense” methodology. In addition, “antigene” or “triplex” methodologies may also suppress expression of genes by using an oligonucleotide which is complementary to a selected site of double stranded DNA, thereby forming a triple-stranded complex to selectively inhibit transcription of the gene. Both “antisense” and “antigene” methodologies can be used to inhibit or reduce the expression of the gene of SEQ ID NO: 254, and thereby provide therapeutic benefit to the patient being treated. Methods of treating individuals using antigene and antisense methodologies are well known to those skilled in the art (see, for example, “Antisense Therapeutics” Agrawal, S. (ed), Humana Press, 1996; Crooke, S. T., and Bennett, C. F. (1996) Annu. Rev. Pharmacol. Toxicol. 36, 107-129; “Prospects for the Therapeutic Use of Antigene Oligonucleotides”, Maher, L. J. (1996) Cancer Investigation 14(1), 66-82 each hereby incorporated by reference in its entirety).

As additional examples, U.S. Pat. No. 5,098,890 is directed to antisense oligonucleotides complementary to the c-myb oncogene and antisense oligonucleotide therapies for certain cancerous conditions. U.S. Pat. No. 5,135,917 provides antisense oligonucleotides that inhibit human interleukin-1 receptor expression. U.S. Pat. No. 5,087,617 provides methods for treating cancer patients with antisense oligonucleotides. U.S. Pat. No. 5,166,195 provides oligonucleotide inhibitors of HIV. U.S. Pat. No. 5,004,810 provides oligomers capable of hybridizing to herpes simplex virus Vmw65 mRNA and inhibiting replication. U.S. Pat. No. 5,194,428 provides antisense oligonucleotides having antiviral activity against influenza virus. U.S. Pat. No. 4,806,463 provides antisense oligonucleotides and methods using them to inhibit HTLV-III replication. U.S. Pat. No. 5,286,717 is directed to a mixed linkage oligonucleotide phosphorothioates complementary to an oncogene. U.S. Pat. No. 5,276,019 and U.S. Pat. No. 5,264,423 are directed to phosphorothioate oligonucleotide analogs used to prevent replication of foreign nucleic acids in cells. Each of these patents is hereby incorporated by reference in its entirety.

The subject invention also provides modified/derivatized nucleic acids encoding SEQ ID NO: 254. These include those modifications which increase the stability and/or affinity of these compounds for targets. Phosphorothioate analogs of oligodeoxynucleotides (ODNs), in which nonbridging phosphoryl oxygens in the backbone of DNA are substituted with sulfur ([S]ODNs) are substantially more stable than their native phosphodiester counterparts. Other derivatives, such as those alkylated on sugar oxygen groups, show enhanced target affinity. [S]ODNs possess good biological activity, pharmacology, pharmacokinetics and safety in vivo (Agrawal (1996), supra). Successful inhibition of specific gene function has been achieved by targeting various sites on specific mRNA sequences that include the AUG translational initiation codon, 5′-transcriptional start site, 3′-termination codon and sequences in both the 5′ and 3′-untranslated regions. These derivatized nucleic acids can be used in any of the aforementioned methodologies.

Protein of SEQ ID: 387 (Internal Designation 105-073-2-0-A7-CS)

The protein of SEQ ID NO:387 encoded by the cDNA of SEQ ID NO:146 is expressed in liver, ovary, prostate and overexpressed in salivary glands. The protein of SEQ ID NO:387 belongs to the abhydrolase family, and is caracterized by the alpha/beta hydrolase fold (Protein Eng 1992;5:197-211, which disclosure is hereby incorporated by reference in its entirety), that is common to a number of hydrolytic enzymes of widely differing phylogenetic origin and catalytic function.

The core of each enzyme is an alpha/beta-sheet (rather than a barrel), containing 8 strand connected by helices. The enzymes are believed to have diverged from a common ancestor, preserving the arrangement of the catalytic residues. All have a catalytic triad, the elements of which are borne on loops, which are the best conserved structural features of the fold.

Epoxide hydrolases are a family of enzymes which hydrolyze a variety of exogenous and endogenous epoxides to their corresponding diols. The epoxide hydrolase add water to epoxides, forming the corresponding diol. On the basis of sequence similarity, it has been proposed that the mammalian soluble epoxide hydrolase contain 2 evolutionarily distinct domains, the N-terminal domain is similar to bacterial haloacid dehalogenase, while the C-terminal domain is similar to soluble plant epoxyde hydrolase, microsomal epoxide hydrolase, and bacterial haloalcane dehalogenase (DNA Cell Biol. 14:61-71 (1995), which disclosure is hereby incorporated by reference in its entirety. Human epoxide hydrolase catalyse the addition of water to epoxides to form the corresponding dihydrodiol. The enzymatic hydratation is essentially irreversible and produces mainly metabolites of lower reactivity that can be conjugated and excreted. The reaction of epoxide hydrolase is therefore generally regarded as detoxifying. Commonly the function of epoxide hydrolase is finally followed by excretion of the diols. However, reactivation of certain diols by a second epoxidation may happen. Epoxide hydrolase inactivates also the epoxides existing in the metabolism of endogenous compounds. Lipophilic xenobiotics tend to accumulate into tissues, and they must be transformed to water soluble compounds to enable the excretion. In this transformation process reactive intermediates are produced. If biotransformation fails to detoxify these reactive intermediates, they may react covalently with critical targets like the genetic material, or start harmful reaction chains like lipid peroxidation. Therefore, epoxide hydrolases are thought to be responsible for carcinogenicity and mutagenicity phenomenon (Exp Pathol 1990;39(34):195-6.). In addition, the interaction between epoxide hydrolase activity and alcohol-metabolizing enzymes, suggests that epoxide hydrolase activity may be associed with the susceptibility to alcoholic liver disease and hepatocellular carcinoma (Toxicol. Lett. 10;115 (1):17-22 (2000), which disclosure is hereby incorporated by reference in its entirety). Compounds containing the epoxide functionality have become common environmental contaminants because of their wide use as pesticides, sterilants, and industrial precursors. Such compounds also occur as products, by-products, or intermediates in normal metabolism and as the result of spontaneous oxidation of membrane lipids (i.e. see, Brash, et al., Proc. Natl. Acad. Sci., 85:3382-3386 (1988), and Sevanian, A., et al., Molecular Basis of Environmental Toxicology (Bhatnager, R. S., ed.) pp. 213-228, Ann Algor Science, Michigan (1980)). As three-membered cyclic ethers, epoxides are often very reactive and have been found to be cytotoxic, mutagenic and carcinogenic (i.e. see Sugiyama, S., et al., Life Sci. 40:225-231 (1987)). Cleavage of the ether bond in the presence of electrophiles often results in adduct formation. As a result, epoxides have been implicated as the proximate toxin or mutagen for a large number of xenobiotics. Reactions of detoxification using epoxide hydrolases typically decrease the hydrophobicity of a compound, resulting in a more polar and thereby excretable substance.

The protein of SEQ ID NO:387 or part thereof is EPOXYLASE-1, an epoxyde hydrolase. Preferred polypeptides of the invention are polypeptides comprising the amino acids of SEQ ID NO: 387 from positions 2 to 132, 52 to 137, 29 to 120, 12 to 137, 19 to 136, 151 to 209, 141 to 209, 30 to 108, and 35 to 108. Other preferred polypeptides of the invention are fragments of SEQ ID NO: 387 having any of the biological activity described herein. The hydrolytic activity of the protein of the invention or part thereof may be assayed using any of the assays known to those skilled in the art including those described in Cancer res 40(7):2552-6 (1980); Exp Pathol 39(34):195-6 (1990), which disclosures are hereby incorporated by reference in their entireties.

The invention also relates to methods and compositions using an Epoxylase polypeptide of the invention or part thereof to diagnose, prevent and/or treat several disorders linked to overexpression of the protein of the invention including alcoholic liver disease, hepatocellular carcinoma, ovarian and prostate cancers.

In addition, the protein of the invention or part thereof may be used to identify inhibitors for mechanistic and clinical applications. Such inhibitors may then be used to identify or quantify the protein of the invention in a sample, and to diagnose, treat or prevent any of the disorders where the protein's hydrolytic activity is undesirable and/or deleterious such as disorders characterized by tissue degradation including but not limited to amyloidosis, colitis, lysosomal diseases, arthritis, muscular dystrophy, inflammation, tumor invasion, glomerulonephritis, parasite-borne infections, Alzheimer's disease, periodontal disease, and cancer metastasis.

In another embodiment, the invention relates to methods and compositions using the protein of the invention or part thereof as a marker protein to selectively identify tissues, preferably ovarian, liver or prostate, more preferably salivary glands. For example, the protein of the invention or part may be used to synthesize specific antibodies using any techniques known to those skilled in the art. Such tissue specific antibodies may then be used to identify tissues of unknown origin, for example, forensic samples, differentiated tumor tissue that metastasized to foreign bodily, or to differentiate different tissue types in a tissue cross-section using immunochemistry.

Protein of SEQ ID No: 398 (Internal Designation: 160-31-3-0-E4-CS)

The protein of SEQ ID No: 398 encoded by the cDNA of SEQ ID No: 157, is overexpressed in fetal brain and shows homology with diverse hydrolases. The protein of the invention also displays a motif characteristic of isochorismatase proteins from positions 17 to 147. In addition, the protein of the invention is an alternatively spliced form of an unnamed human protein.

The protein of SEQ ID NO: 398 is ethelase, an ether hydrolase. Preferred polypeptides of the invention are polypeptides comprising the amino acids of SEQ ID NO: 398 from positions 17 to 147. Other preferred polypeptides of the invention are fragments of SEQ ID NO: 398 having any of the biological activity described herein. The hydrolytic activity of the protein of the invention or part thereof may be assayed using any of the assays known to those skilled in the art including those described in U.S. Pat. Nos. 5,445,942; 5,445,956, 6,017,746 and 5,871,616 and in Rusnak et al, 1990; Biochemistry 29 1425-1435.

In another embodiment, the invention relates to methods and compositions using the protein of the invention or part thereof as a marker protein to selectively identify tissues, preferably fetal brain. For example, the protein of the invention or part may be used to synthesize specific antibodies using any techniques known to those skilled in the art including those described therein. Such tissue-specific antibodies may then be used to identify tissues of unknown origin, for example, forensic samples, differentiated tumor tissue that has metastasized to foreign bodily sites, or to differentiate different tissue types in a tissue cross-section using immunochemistry.

Proteins of SEQ ID NOs: 260 and 265 (Internal Designation 116-004-3-0-A6-CS and 116-091-1-0-D9-CS Respectively)

The protein of SEQ ID NO:260 encoded by the cDNA SEQ ID NO: 19 and over expressed in liver and testis is an isoform of the protein of SEQ ID NO:265 encoded by the cDNA SEQ ID NO:24 over expressed in liver. Both proteins show homology to murine EPCS26 (Hemberger M. et al., Dev. Biol. 222, 158-169 (2000)) with Genbank accession number AF250838. The proteins of SEQ ID NOs:260 and 265 contain a signal peptide (cleavage site at position 18) that could allow the export of the protein to the extracellular domain, the export to a cellular membrane or to define a particular subcellular localization. The cDNA encoding EPCS26 has been shown to be differentially expressed during the process of trophoblast invasion.

Implantation and placentation are key processes in mammalian embryonic development. They physically connect the embryo to its mother and are critical for sufficient nutrient and gaz exchange. The extraembryonic cell lineage is the first to differentiate in the developing conceptus, reflecting the importance of this cell for the establishment of fetal-maternal connections. During murine development, the outer layer of blastocyt, the mural trophectoderm, begins to differentiate into primary trophoblast giant cells on day 5 of gestation (e5). These cells invade the uterine epithelium and penetrate deeply into the stroma. At the same time, the polar trophectoderm cells continue to proliferate and form the ectoplacental cone. On e7, the outer cells of the ectoplacental cone begin to differentiate into secondary trophoblast giant cells. The invasion of uterine stroma by these cells is critical for successful placentation (Cross et al., Science 266, 1508-1518 (1994)).

Trophoblast invasion triggers secretion of proteinases that degrade extracellular matrix molecules. Mouse trophoblasts have been shown to synthesize and secrete serine proteases, matrix metalloproteinases and cysteine proteinases. Invasion of the trophoblast is a highly controlled process. The decidula restricts invasion by secreting proteinases inhibitors. Proteinases and proteinases inhibitors have antagonistic functions in implantation and placentation which may be mirrored by the reciprocity of their expression patterns (Alexander et al., Development 122, 1723-1736 (1996)).

During tumor invasion and metastasis, the degradation of the basement membranes is often accomplished by the proteinases implicated in implantation and normal trophoblast invasion (Strickland and Richards Cell 71, 355-357 (1992), Wilson et al., Proc. Natl. Acad. Sci. USA 94, 1402-1407 (1997)). Uncontrolled trophoblast invasion, as in choriocarcinomas, results in one of the most metastatic tumors known (Strickland and Richards Cell 71, 355-357 (1992)).

A deficient fonction of the protein of the invention could result in an uncontrolled trophoblast invasion, and like in choriocarcinomas results in one of the most metastatic tumors known (Strickland and Richards Cell 71, 355-357 (1992)).

The proteins of SEQ ID NOs:260 and 265, BLASTYLASE-1 and BLASTYLASE-2, play a role in proteolysis, more specifically during embryogenesis, more specifically during trophoblast invasion. The proteins of the invention or part thereof may act as secreted proteinases that degrade extracellular matrix molecules or at the contrary as proteinase inhibitors. Preferred polypeptides of the invention are polypeptides comprising the amino acids of SEQ ID NO:260 from positions 7 to 122 and the amino acids of SEQ ID NO:265 from positions 7 to 81. Other preferred polypeptides of the invention are fragments of SEQ ID NO: 260 and 265 having any of the biological activities described herein. The proteolytic activity of the proteins of the invention or part thereof may be assayed using any of the assays known to those skilled in the art including those described in U.S. Pat. Nos. 6,069,229 and 5,861,267. The protease inhibitor activity of the proteins of the invention or part thereof may be assayed using any of the assays known to those skilled in the art and using methods for determining inhibition constants well known to those skilled in the art (see Fersht, ENZYME STRUCTURE AND MECHANISM, 2nd ed., W.H. Freeman and Co., New York, (1985)).

In addition, the proteins of the invention or part thereof may be used to diagnose, treat or prevent any of the disorders characterized by undesirable and/or deleterious hydrolytic activity such as disorders characterized by tissue degradation including but not limited to amyloidosis, colitis, lysosomal diseases, arthritis, muscular dystrophy, inflammation, tumor invasion, glomerulonephritis, parasite-borne infections, Alzheimer's disease, periodontal disease, cancer metastasis, and choriocarcinoma. For diagnostic purposes, the expression of the proteins of the invention could be investigated using any of the Northern blotting, RT-PCR or immunoblotting methods described herein and compared to the expression in control individuals. Alternatively, inhibitors for the proteins' activity may be developed and use to inhibit and/or reduce its activity using any methods known to those skilled in the art. Overexpression of the proteins of the invention or part thereof may be achieved using any of the gene therapy method described herein.

In another embodiment, the invention relates to methods and compositions using the protein of the inventions or part thereof as a marker protein to selectively identify tissues, preferably liver and testis for the protein of SEQ ID NO: 260, preferably liver for the protein of SEQ ID NO: 265. For example, the proteins of the invention or part may be used to synthesize specific antibodies using any techniques known to those skilled in the art including those described therein. Such tissue-specific antibodies may then be used to identify tissues of unknown origin, for example, forensic samples, differentiated tumor tissue that has metastasized to foreign bodily sites, or to differentiate different tissue types in a tissue cross-section using immunochemistry. Antibodies that specifically bind the Blastylase 1 or 2 proteins can be used to inhibit a Blastylase activity, preferably embryonic implantation or trophoblast invasion. Also preferred are methods of inhibiting neoplastic or tumor cell invasion, either in vitro or in vivo by blocking Blastylase activity. Preferably these methods comprise the step of contacting an antibody specific for Blastylase 1 or 2 with a Blastylase 1 or 2 polypeptide under conditions that permit antigen-antibody binding.

Protein of SEQ ID NO: 265 (Internal Designation 116-088-4-0-A9-CS)

The protein of SEQ ID NO: 265 encoded by the cDNA of SEQ ID NO: 24 is overexpressed in testis and liver. This protein of the invention is homologous to the GdX protein, also named UBL4 (Toniolo et al., Proc Natl Acad Sci USA 1988;85:851-5), found in both human (GENPEPT accession number L44140) and mice species (GENPEPT accession number J04761). In addition, the 174-amino-acid-long protein of SEQ ID NO: 265, which is similar in size to ubiquitin-like proteins, displays a pfam consensus domain from position 1 to 82 that is the hallmarks of ubiquitin family proteins.

Ubiquitin is a protein of 76 amino acid residues, found in all eukaryotic cells, and which is extremely well conserved from protozoan to vertebrates (Jentsch et al., Trends Cell Biol 2000; 10:335-42). It plays a key role in a variety of cellular processes, such as ATP-dependent selective degradation of cellular proteins, maintenance of chromatin structure, regulation of gene expression, stress response, ribosome biogenesis, cell-cycle progression, signal transduction, transcription and antigen presentation (Wilkinson et al. Annu Rev Nutr 1995; 15:161-89). The first ubiquitin is covalently ligated to target proteins through an isopeptide linkage between the C-terminal glycine residue of ubiquitin and an internal ε-amino group of lysine residue of the substrate. To generated an efficient proteasomal targeting signal, additional ubiquitin are linked to the first one by isopeptide bounds, and form branched poly-ubiquitin complexes (Thrower et al., EMBO J. 2000; 19: 94-102). Covalent binding of ubiquitin to proteins marks them for subsequent degradation by a multicomponent enzymatic complex known as the 26S proteasome (Hershko et al., Annu Rev Biochem 1992; 61:761-807).

The genes coding ubiquitin-like proteins fall into two separate classes (Hershko et al., Annu Rev Biochem 1992;61:761-807). Proteins of the first class are frequently designed as ubiquitin-like modifiers, or UBLs. They produce polyubiquitin molecules consisting of exact head to tail repeats of ubiquitin, with a variable number of repeats. These linear polymer of ubiquitin are linked covalently through peptide bonds between the C-terminal glycine residue and N-terminal lysine residue of contiguous ubiquitin molecules. Proteins of the second class are habitually named as ubiquitin-domain proteins, or UDPs. These proteins bear a single domain of the N-terminal domain that is related to ubiquitin, fused to a C-terminal ribosomal domain consisting of 52 or 76-80 amino-acid residues (Finley et al., Nature 1989;338:394-401). These proteins are not conjugated to other proteins and function as an heterogeneous group of proteins. To date, this family includes RAD23, DSK2, PLIC-1, PLIC-2/Chap1, XDRP1, BAG-1, BAT3/Chap2, Scythe, Parkin, UIP28, UBP6, Elongin B, and GdX. In addition, the protein of invention of SEQ ID NO: 265 clearly belongs to the UDPs family, as it displays a single ubiquitin N-terminal consensus domain, which is the hallmark of this protein family subset.

UDPs participate to regulation of proteolysis through multiple mechanisms such as interaction with catalytically active 26S proteasome for RAD23 (Schauber et al., Nature 1998; 391:715-8), hPLIC-1 and hPLIC-2 (Kleijnen et al., Mol Cell 2000; 6:409-19), and BAG-1 (Luders et al., J Biol Chem 2000; 275:4613-7), removing ubiquitin from conjugates for UBP6 (Wyndham et al., Protein Sci 1997; 8:1268-75) and negative regulation of multi-ubiquitin chain assembly for RAD23 (Ortolan et al., Nature cell Biol 2000; 2:601-8). In addition, an increasing body of evidence indicates that some UDPs participate to other cellular functions as protein folding (Luders et al., J Biol Chem 2000; 275:4613-7), apoptosis (Kaye et al., FEBS Lett 2000; 467:348-55), and nucleotide-excision repair (de Laat et al., Genes Dev 1999; 13:768-785). UDPs family proteins have been shown directly associated with pathogenesis of several diseases including xeroderma pigmentosum for RAD23 (Masutani et al., EMBO J 1994; 13:1831-43), and Parkinson's disease for parkin (Kitada et al., Nature 1998; 392:605-8). In addition, involvement of ubiquitin-like proteins or abnormal ubiquitinated accumulation of proteins has been found in multiple human disorders. Most of them, but not all, involve nervous central system as Alzheimer's disease (van Leeuwen et al., Science 1998; 279:242-7), diffuse Lewy body disease (Iseki et al., J Neurol Sci 1997; 146:53-7), Huntington disease (Scherzinger et al., Cell 1997; 90:549-58), and amyotrophic lateral sclerosis (Leigh et al., Brain 1991; 114:775-88). In most disorders, ubiquinated-proteins accumulate within cells and form aggregates termed inclusion bodies that have characteristic appearance on histological examination. In addition, abnormal accumulation of ubiquitinated proteins has been found in Von-Hippel Lindau disease (Kamura et al., Proc Natl Acad Sci USA. 2000; 97:10430-5), and in liver of alcoholic hepatitis patients (Ohta et al., Lab Invest. 1988; 59:848-56). Components of hepatocytes are released within the circulation in alcoholic hepatitis (Sorbi et al., Am J Gastroenterol 1999; 94:1018-22)

It is believed that the protein of SEQ ID NO: 265 or part thereof plays a role in the regulation of proteolysis, preferably as an ubiquitin-like protein, more preferably as an ubiquitin-domain protein. In addition, the protein of the invention may play a role in protein folding, apoptosis and nucleotide-excision repair. Preferred polypeptides of the invention are polypeptides comprising the amino acids of SEQ ID NO:265 from positions 1 to 82. Other preferred polypeptides of the invention are fragments of SEQ ID NO:265 having any of the biological activity described herein.

In an embodiment, the invention relates to compositions and methods using the protein of the invention or part thereof to remove, identify or inhibit contaminating proteases in a sample. Compositions comprising the polypeptides of the present invention may be added to biological samples as a “cocktail” with other protease inhibitors to prevent degradation of protein samples. The advantage of using a cocktail of protease inhibitors is that one is able to inhibit a wide range of proteases without knowing th