Contents
- The aim of the project
- Ethnic dimension
- Lack of proper standard genome. Novelty
- Genomic signatures of hereditary diseases
- Technical aspects and difficulties
- Complete genomes available
- References
The background and the aim of the project
Any two complete individual genome sequences, ~3*109 bases each, are largely identical, except for a few million point changes and other differences (2, 23). Total number of known genetic variants is estimated as tens of millions (34, 35, 45). There is a whole spectrum of various types of differences (41, 45) which include single nucleotide polymorphisms (SNPs), deletions, insertions, as well as inversions (17), translocations, variations in copy numbers of genes (4) and repeats (26). These differences reflect ethnicity and ancestry of the individuals, pathological patterns, as well as individual “healthy” phenotypic traits.
The main idea of the project is to derive from existing and forthcoming individual genome sequences a consensus, an invariable part, without the individual changes, to serve as standard for molecular medical and other studies. Many of the individual variations are associated with various inherited pathological conditions, from mild predisposition to serious life threatening pathologies. Most of these, fortunately, are recessive, but a chance to have it in both parents makes them all highly undesirable. This connection of the changes with genetic diseases and dispositions suggests a natural term for the genome sequence consensus – Healthy Genome, since such genome would not contain any of those variations, both innocent and pathological ones. Master Genome, Consensus Genome, Genome Standard, Universal Reference Genome, Pan-Genome would be possible synonyms of the Healthy Genome.
Thus, the aim of the project is to construct the consensus Healthy Human Genome with the primary purpose of having a balanced standard for medical genetic studies.
“A reference genome sequence is clearly needed for research. Without a point of reference and common coordinate, or naming system, research and clinical assay results cannot be reported in ways that allow for inter-lab comparisons and independent validation of research results.
… A basic coordinate system needs to be developed that can accommodate any indel and rearrangements” (34).
Ethnic dimension
Ethnic variations of sequence polymorphisms are well documented (13,14,16-20,24,27,46). A whole new discipline, pharmacogenomics, is developing, based on different susceptibilities of various ethnicities to drugs, as also reflected in the differential occurrence of disease-associated SNPs (19,20). Many genetic abnormalities are known which are typical of specific ethnic or geographic groups (6,39,40). To name a few: Tay-Sachs syndrome characteristic of Ashkenazi Jewish population (7), cystic fibrosis of Caucasians, especially amongst Danes (8), diabetes of Puerto-Ricans (1), and hypoglycemia of Faroe Islands (38). The specific ethnic SNPs spread from the geographic location of the respective ethnic group (21) but their local higher occurrence persists. “There are demonstrated differences in people’s genetic makeup that predisposes some groups to different diseases based solely on their ethnicity” (31). Thus, there are all reasons to expect that the healthy consensus genome sequences derived for specific ethnical groups would contain sequence features, of ethnical pathology (and normality), different from respective general Healthy Genome consensus (2). In other words, in order to study one or another ethnically linked genetic abnormality, one has to have separate genome standards for respective ethnical groups – like Jewish Healthy Genome, or Danish Healthy Genome, etc. These, of course, will be very close to the general standard, though, perhaps, carrying quite a few of ethnically specific differences (2, 32, 46). This implies that for reliable detection and characterization of the differences one has to have an unbiased general consensus for Homo sapiens where many genome sequences of, desirably, all major distinct ethnical groups would be equally presented. These would be, first of all, Han (China), Bengalis (Bangladesh), Germans, Russians, Italians, Yamamoto (Japan), Punjabi (Pakistan), French, English, Mestizos (Mexico),… Yoruba (Nigeria) etc., in descending order by population size (e.g., as listed in (3)). The truly representative unbiased consensus genome may be derived only by sequence analysis at all consecutive stages of the construction of the consensus. For example, the white race or mongoloid Han and Yamamoto genomes may show some distinct common features, so that additional care to avoid the biases in general consensus would be needed.
The construction of the Healthy Genomes for various ethnic groups is well justified. It will be only fair if every ethnicity will be treated with equal medical attention, taking into account all the differences, on the basis of very latest achievements in genome studies.
Lack of proper standard genomes. Novelty.
Today of the order of 500-1000 genomes of various healthy and sick individuals are available, at various stages of completion. About 70 of them up to now may qualify as complete fully assembled and mapped genomes (42, and listed below) suitable for the derivation of the standard. Notably, the individual sperm genomes (25) are not good for the task since each one of them underwent multiple natural recombinations, and their fertility (and normalcy) status is uncertain. Each one of the above 70 or so can be used, of course, as (temporary) reference for analyzing differences between genomes (2), especially those which are associated with genetic diseases. However, differences between individual genomes do not fully reflect those between the genomes and the (non-existing) standard. Currently, apart from few individual personal genomes (e.g., of C. Venter and of J. Watson) the most frequently used standard is the NCBI human reference genome (11) which is derived from DNA samples of a small number of anonymous donors (12). Comparisons of individual Chinese, Korean and Yoruba genomes with the NCBI standard reveals hundreds of thousands of differences common for these three individuals (2). It appears that these common sites rather represent the (non-existing) standard, while the NCBI is an outlier.
Potentially, the sequences from 1000 Genomes project (10, 45) could be used for derivation of the standard. This collection, however, is not intended to be ethnically balanced, rather being geared to best overlap with available databases of SNPs. For example, its sample list consists of genomes of only 4 races (Whites, Blacks, Amerindians and East Asians). It does not even include representatives of the second largest ethnical group in the world, Bengalis (3). Moreover, since most of the efforts of the Genome sequencing projects are geared to the desirably complete collection of SNPs and other structural variants, this task, too, suffers from lack of good standard: “The current reference sequence, being based on a limited number of samples, neither adequately represents the full range of human diversity, nor is complete”(34). Also, from recent account of the 1000 genomes team: “the interpretation of rare variants in individuals with a particular disease should be within the context of the local (either geographic or ancestry-based) genetic background,” (45) – alluding to the ethnical (geographic) genome standards. Very much in line with our proposal, one could imagine sort of logo, consensus genome: “If the Human Genomes sequence was portrayed in this way, we might replace our arbitrary type-specimen with more natural, biologically accurate”(44). ”Unfortunately developing robust standards is not the highest priority for the National Institutes of Health (NIH)” (49).The NIH-based Genome Reference Consortium (48) has different target: “single tiling path is insufficient to represent a genome in regions with complex allelic diversity. The GRC is now working to create assemblies that better represent this diversity and provide more robust substrates for genome analysis”. Our plan is to develop first the proper standard(s) and only then to include the diversity in the standards.
Thus, no systematic effort to construct a balanced genome standard of Homo sapiens has been attempted so far. The novelty of the project is in its very target – the Healthy Genomes standards.
Genomic signatures of hereditary diseases
There are over 9000 known genetic disorders and diseases.
Although rather detailed haplotypes for many of these diseases are known today, the full genomic sequence characterization of the diseases is not available, and realization is growing that such full characterization is vitally needed (29). “That there is currently no comprehensive, accurate, and openly accessible database of human disease-causing mutations «is the single greatest failure of modern human genetics,» Massachusetts General Hospital’s Daniel MacArthur says” (as cited in 29). It is also clear that any such standard can be developed only after the Healthy Genomes will become available.
The genomic disease standards can be derived in the way similar to ethnic genome standards. The consensus of many individual genomes of the carriers of the disease has to be compared with the Healthy Genome. The differences revealed may then serve as the genomic signature of the disease, for further studies and analyses. The personalized complete lists of the abnormalities of individual patients, compared with standards, will serve as a guide for preventive treatments, and genetic consultations.
1. Genomic signatures of hereditary diseases.
The pathological genomic structural differences specific for Jewish populations (like Tay-Sachs, diabetes and others) will be major focus. With the expected progress of the project and evaluation of possible diseases and anomalies specific or most problematic for Latvians, respective disease genomic standards will be taken care of as well.
2. Ethnical and medical characterization of individual genomes.
This will become possible after the general Healthy Genome standard, Ethnical Healthy Genomes, and the signatures of pathologies will be derived. Potentially, these products of the project may become a major set of Genome Standards of high demand. Every nation and every ethnical group would need these standards for highest efficiency of medicare and personal medicine.
The genomic standards are needed as well in studies on human genomic history and evolution (e.g. 36,37), forensic analyses, legacy cases. One can envisage also analysis and prospective use of genomic sequence signatures of specific talents and professional inclinations. The quick progress in the whole genome studies and applications makes it rather hard to predict the future developments. It is clear, however, that all these developments will require a whole spectrum of various genome sequence standards, starting with the Healthy Genome of Homo sapiens.
Stage 1
For derivation of the initial first version of the Healthy Genome 10-20 suitable complete genome sequences will be selected from the list of currently available individual genomes (below), keeping with the rule “largest ethnical groups first”. Further additions would include more ethnicities, in the order of appearance of the new suitable sequences in publicly available sources.
The first and subsequent versions of the Healthy Genome will be offered as commercialized product to various users for a price to be then established, together with a package of programs and instructions for use. The package will include the procedure of comparison of any genome of interest with the standard, as well as listing and categorization of the differences.
Technical aspects and difficulties
Derivation of the standard genomes and signatures is as formidable as an important task. Technically, this is a multiple alignment problem. The task is to align selected subsets of the sequenced genomes to each other and to derive the reference genomes, or standard hereditary disease signatures, all products of the multiple alignments. Major difficulties would be incompleteness of some of the genomes, sequencing and mapping errors, multiple locations of the same or almost identical genes, numerous tandem repeats with variable copy numbers, sequence inversions, variable copy numbers of genes (4) and possible other hurdles which may appear in forthcoming studies of genome structure. The construction of the reference genome has never been attempted so far, and one may expect many surprises.
A package of computer programs has to be developed, to build the standard genome from any number of individual genomes, to compare any given genome with the standard, to derive complete lists of differences (signatures) for the individual genomes, to cluster the genomes with similar signatures and to develop specific full signatures for various genetic diseases. Some of the programs with similar functions are publicly available and routinely used in the genomic community (2,9).
Complete genomes available (refs as indicated and in Google)
- HuRef – C. Venter
- J. Watson
- G. Church
- NA18507 – Yoruba
- Desmond Tutu Bantu (24)
- !Gubi Khoisan (24)
- Seong-Jin Kim SJK – Korean (2)
- AK1 — Korean
- YH – Chinese
- Gordon Moore
- Stephen Quake works on sperm genomes
- Marjolein Kriek
- Hermann Hauser
- 14 others sequenced by Complete Genomics,
- Unknown number sequenced by Knome,
- 7 genomes sequenced at high depth by the 1000 Genomes Project (28):
- Han Chinese South (CHS)
- African Caribbean in Barbados (ACB)
- Puerto Rican in Puerto Rico (PUR)
- Peruvian in Lima, Peru (PEL)
- Punjabi in Lahore, Pakistan (PJL)
- Sri Lankan Tamil in the UK (STU)
- Indian Telegu in the UK (ITU)
- 10 genomes from various labs (23) (10 more – from sick individuals)
- 20 Korean genomes (27)
- Steve Jobs (43)
- First Irish Genome
- Sitting Bull (Chief)
- First Russian Genome (NCBI)
- Glenn Close
- Five Southern African Genomes (with Tutu)
(to be continued)
References
1. National Diabetes Information Clearinghouse. National Diabetes Statistics, 2011. USA
2. Ahn S.-M. et al. (2009). The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Res. 19(9): 1622–1629
3. CIA, The World Factbook.
4. Li J. et al. (2009). Whole Genome Distribution and Ethnic Differentiation of Copy Number Variation in Caucasian and Asian Populations. PLoS ONE 4(11): e7958
5. Macrae F., 12 July 2012. Gene test could soon see if future lovers are compatible. http://www.dailymail.co.uk/health/article-2172870/Gene-test-soon-future-lovers-compatible.html?ito=feeds-newsxml
6. Milunsky A. Ethnicity and Genes. http://www.babyzone.com/pregnancy/fetal_development/genetics_gender/article/ethnicity-genes-disease
7. Sachs, Bernard (1887), «On arrested cerebral development with special reference to cortical pathology», Journal of Nervous Mental Disease 14 (9): 541–554
8. Wennberg C, Kucinskas V (1994). «Low frequency of the delta F508 mutation in Finno-Ugrian and Baltic populations». Hum. Hered. 44 (3): 169–71.
9. Axelrod N. et al. The HuRef Browser: a web resource for individual human genomics. Nucleic Acids Res. 2009
January; 37(Database issue): D1018–D1024.
10. G Spencer, International Consortium Announces the 1000 Genomes Project, EMBARGOED (2008) http://www.1000genomes.org/files/1000Genomes-NewsRelease.pdf
11. Pruitt KD, Tatusova T, Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35: D61–65
12. PlOS Genetics, Sept. 2011, Dewey FE, Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence.
13. Mori M., et al. Journal of Human Genetics (2005) 50, 264–266. Ethnic differences in allele frequency of autoimmune-disease-associated SNPs
14. Silverberg MS, et al. European Journal of Human Genetics (2007) 15, 328–335. Refined genomic localization and ethnic differences observed for the IBD5 association with Crohn’s disease
15. Human Genome Project Opens the Door to Ethnically Specific Bioweapons. Project Censored.Apr. 30, 2010.(web source)
16. Ghodke Y. et al. Profiling single nucleotide polymorphisms (SNPs) across intracellular folate metabolic pathway in healthy Indians. Indian J Med Res 133, March 2011, pp 274-279
17. Karyn Meltz Steinberg et al., Structural diversity and African origin of the 17q21.31 inversion polymorphism Nature Genetics. Published online: 01 July 2012
18. Richard S Spielman et al. Common genetic variants account for differences in gene expression among ethnic groups Nature Genetics 39, 226-231, 2007
19. Kimchi-Sarfaty C. et al., Ethnicity-related polymorphisms and haplotypes in the human ABCB1 gene. Pharmacogenomics. 2007 Jan;8(1):29-39.
20. Peter H. O’Donnell1 and M. Eileen Dolan. Cancer Pharmacoethnicity: Ethnic Differences in Susceptibility to the Effects of Chemotherapy. Clin Cancer Res. 2009 August 1; 15(15): 4806–4814.
21. Templeton A. www.faculty.biol.ttu.edu/strauss/Phylogenetics/Readings/Templeton1998
22. Hunt, S. (2008) Pharmacogenetics, personalized medicine, and race. Nature Education 1(1)
23. Pelak K, Shianna KV, Ge D, Maia JM, Zhu M, et al. (2010) The Characterization of Twenty Sequenced Human Genomes. PLoS Genet 6(9): e1001111
24. Stephan C. Schuster et al., Complete Khoisan and Bantu genomes from southern Africa. Nature 463, 943-947; 2010
25. Jianbin Wang et al. Genome-wide Single-Cell Analysis of Recombination Activity and De Novo Mutation Rates in Human Sperm. Cell, Volume 150, Issue 2, 402-412, 20 July 2012
26. Trifonov, E. N., The tuning function of the tandemly repeating sequences: molecular device for fast adaptation. In: Evolutionary Theory and Processes: Modern Horizons, Wasser, S. P. (Ed.), Kluwer Academic Publishers, pp 115-138 (2004)
27. http://www.bio-itworld.com/2011/09/13/korean-genome-project-finds-korea-SNPs.html
28. http://www.1000genomes.org/about , 2012
29. http://www.genomeweb.com/mdx/quest-clarity?page=show (T. Vance, A quest for clarity, July/August 2012)
31. http://www.examiner.com/article/ethnicity-specific-reference-genome-improves-everyone-s-health; Dewey FE et al. (2011) Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence. PLoS Genet 7(9): e1002280
32. http://ethnicgenome.wordpress.com/
33. http://www.statgen.nus.edu.sg/~SGVP/
34. Rosenfeld JA, Mason CE, Smith TM (2012) Limitations of the Human Reference Genome for Personalized Genomics. PLoS ONE 7(7): e40294
35. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic acids research 29: 308–311
36. Rocca RA, Magoon G, Reynolds DF, Krahn T, Tilroe VO, et al. (2012) Discovery of Western European R1b1a2 Y Chromosome Variants in 1000 Genomes Project Data: An Online Community Approach. PLoS ONE 7(7): e41634
37. Nature 485, Special issue: Peopling the planet. (03 May 2012)38. Jef Akst, Island disease, The Scientist, August 2012.
39. Yakobson E et al. A single Mediterranean, possibly Jewish, origin for the Val59Gly CDKN2A mutation in four melanomaprone families. EUROPEAN JOURNAL OF HUMAN GENETICS 11, 288-296, 2003
40. Leachman SA,…Yakobson E, et al. Selection criteria for genetic assessment of patients with familial melanoma. JOURNAL OF THE AMERICAN ACADEMY OF DERMATOLOGY 61, 677-684, 2009
41. S. Levy,… C. Venter, PLoS Biol 5(10): e254. The Diploid Genome Sequence of an Individual Human
43. Lohr, Steve (2011-10-20). «New Book Details Jobs’s Fight Against Cancer». The New York Times.
44. Weiss KM http://the-scientist.com/2012/08/17/opinion-what-is-the-human-genome/
45. An integrated map of genetic variation from 1,092 human genomes. The 1000 Genomes Project Consortium. Journal name:
Nature 491Volume: Pages: 56–65 Date published:(2012)
46. Diseases that occur frequently in the Jewish community: Tay-Sachs disease, Gaucher disease, Canavan disease, Familial Dysautonomia, Niemann-Pick Type A and B diseases, and cystic fibrosis (The Center for Jewish Genetic Diseases at The Mount Sinai Medical Center in New York)
47. The Science of Human Perfection: How Genes Became the Heart of American
Medicine [Nathaniel Comfort]
48. http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
49. T. Smith, S. Porter, Genomic inequality, The Scientist, Dec. 1, 2012
Molecular Genetics
Weizmann Institute of Science
Rehovot, ISRAEL