Data availability

The T2T-CHM13v2.0 (T2T-CHM13+Y) assembly, reference analysis set, complete list of resources—including gene annotation, repeat annotation, epigenetic profiles, variant-calling results from 1KGP and SGDP, gnomAD, ClinVar, GWAS and dbSNP datasets—are available for download at https://github.com/marbl/CHM13. The assembly is also available from NCBI and EBI with GenBank accession GCA_009914755.4. Annotation and associated resources are also browsable as ‘hs1’ from the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_3671779_hs1), the Ensembl Genome Browser (https://projects.ensembl.org/hprc/) (assembly name T2T-CHM13v2.0) and NCBI data-hub (https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_009914755.1/). Potential assembly issues are listed and can be tracked at https://github.com/marbl/CHM13-issues. 1KGP and SGDP short-read alignments and variant calls are available within AnVIL at https://anvil.terra.bio/#workspaces/anvil-datastorage/AnVIL_T2T_CHRY. Original data from the Gerton lab underlying this manuscript can be accessed from the Stowers Original Data Repository at http://www.stowers.org/research/publications/libpb-2358. Sequencing data used in this study are listed in Supplementary Table 1.

Code availability

Custom codes developed for data analysis and visualization are available at https://github.com/arangrhie/T2T-HG002Y, https://github.com/snurk/sg_sandbox and https://github.com/schatzlab/t2t-chm13-chry and are deposited with Zenodo159. Software and parameters used are stated in the Supplementary Methods with further details.

References

  1. Skaletsky, H. et al. The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423, 825–837 (2003).

    Article 
    ADS 
    CAS 
    PubMed 

    Google Scholar
     

  2. Miga, K. H. et al. Centromere reference models for human chromosomes X and Y satellite arrays. Genome Res. 24, 697–707 (2014).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  3. Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  4. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  5. Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  6. Gustafson, M. L. & Donahoe, P. K. Male sex determination: current concepts of male sexual differentiation. Annu. Rev. Med. 45, 505–524 (1994).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  7. Vog, P. H. et al. Human Y chromosome azoospermia factors (AZF) mapped to different subregions in Yq11. Hum. Mol. Genet. 5, 933–943 (1996).

    Article 

    Google Scholar
     

  8. Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  9. Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 101–107 (2021).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  10. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  11. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  12. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  13. Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  14. Formenti, G. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat. Methods 19, 696–704 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  15. Kirsche, M. et al. Jasmine and Iris: population-scale structural variant comparison and analysis. Nat. Methods 20, 408–417 (2023).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  16. Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat. Methods 19, 705–710 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  17. Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods 19, 687–695 (2022).

    Article 

    Google Scholar
     

  18. Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  19. Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  20. Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  21. Shumate, A. et al. Assembly and annotation of an Ashkenazi human reference genome. Genome Biol. 21, 129 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  22. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  23. Landrum, M. J. et al. ClinVar: improvements to accessing data. Nucleic Acids Res. 48, D835–D844 (2020).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  24. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  25. Smigielski, E. M., Sirotkin, K., Ward, M. & Sherry, S. T. dbSNP: a database of single nucleotide polymorphisms. Nucleic Acids Res. 28, 352–355 (2000).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  26. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  27. Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  28. Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  29. Dunham, I. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article 
    ADS 
    CAS 

    Google Scholar
     

  30. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  31. Sanders, A. D. et al. Single-cell analysis of structural variations and complex rearrangements with tri-channel processing. Nat. Biotechnol. 38, 343–354 (2020).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  32. Hallast, P. et al. Assembly of 43 human Y chromosomes reveals extensive complexity and variation. Nature https://doi.org/10.1038/s41586-023-06425-6 (2023).

  33. Hammer, M. F. et al. Extended Y chromosome haplotypes resolve multiple and unique lineages of the Jewish priesthood. Hum. Genet. 126, 707 (2009).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  34. Poznik, G. D. et al. Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat. Genet. 48, 593–599 (2016).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  35. Vegesna, R., Tomaszkiewicz, M., Medvedev, P. & Makova, K. D. Dosage regulation, and variation in gene expression and copy number of human Y chromosome ampliconic genes. PLoS Genet. 15, e1008369 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  36. NCBI RefSeq v110 Browser. Homo sapiens isolate NA24385 chromosome Y, alternate assembly T2T-CHM13v2.0. https://tinyurl.com/bdfudexn (2022).

  37. Hoyt, S. J. et al. From telomere to telomere: the transcriptional and epigenetic state of human repeat elements. Science 376, eabk3112 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  38. Warburton, P. E. et al. Analysis of the largest tandemly repeated DNA families in the human genome. BMC Genomics 9, 533 (2008).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  39. Halabian, R. & Makałowski, W. A map of 3′ DNA transduction variants mediated by non-LTR retroelements on 3202 human genomes. Biology 11, 1032 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  40. Weissensteiner, M. H. et al. Accurate sequencing of DNA motifs able to form alternative (non-B) structures. Genome Res. 33, 907-922 (2023).

  41. Tyler-Smith, C., Taylor, L. & Müller, U. Structure of a hypervariable tandemly repeated DNA sequence on the short arm of the human Y chromosome. J. Mol. Biol. 203, 837–848 (1988).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  42. Xue, Y. & Tyler-Smith, C. An exceptional gene: evolution of the TSPY gene family in humans and other great apes. Genes 2, 36–47 (2011).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  43. Saxena, R. et al. Four DAZ genes in two clusters found in the AZFc region of the human Y chromosome. Genomics 67, 256–267 (2000).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  44. Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  45. Jain, M. et al. Linear assembly of a human centromere on the Y chromosome. Nat. Biotechnol. 36, 321–323 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  46. Gershman, A. et al. Epigenetic patterns in a complete human genome. Science 376, eabj5089 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  47. Kasinathan, S. & Henikoff, S. Non-B-form DNA is enriched at centromeres. Mol. Biol. Evol. 35, 949–962 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  48. Nailwal, M. & Chauhan, J. B. Azoospermia factor C subregion of the Y chromosome. J. Hum. Reprod. Sci. 10, 256 (2017).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  49. Kuroda-Kawaguchi, T. et al. The AZFc region of the Y chromosome features massive palindromes and uniform recurrent deletions in infertile men. Nat. Genet. 29, 279–286 (2001).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  50. Repping, S. et al. A family of human Y chromosomes has dispersed throughout northern Eurasia despite a 1.8-Mb deletion in the azoospermia factor c region. Genomics 83, 1046–1052 (2004).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  51. Porubsky, D. et al. Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders. Cell 185, 1986–2005 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  52. Teitz, L. S., Pyntikova, T., Skaletsky, H. & Page, D. C. Selection has countered high mutability to preserve the ancestral copy number of Y chromosome amplicons in diverse human lineages. Am. J. Hum. Genet. 103, 261–275 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  53. Jobling, M. A. Copy number variation on the human Y chromosome. Cytogenet. Genome Res. 123, 253–262 (2008).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  54. Navarro-Costa, P., Plancha, C. E. & Gonçalves, J. Genetic dissection of the AZF regions of the human Y chromosome: thriller or filler for male (in)fertility? Biomed Res. Int. 2010, e936569 (2010).


    Google Scholar
     

  55. Evans, H. J., Gosden, J. R., Mitchell, A. R. & Buckland, R. A. Location of human satellite DNAs on the Y chromosome. Nature 251, 346–347 (1974).

    Article 
    ADS 
    CAS 

    Google Scholar
     

  56. Schmid, M., Guttenbach, M., Nanda, I., Studer, R. & Epplen, J. T. Organization of DYZ2 repetitive DNA on the human Y chromosome. Genomics 6, 212–218 (1990).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  57. Manz, E., Alkan, M., Bühler, E. & Schmidtke, J. Arrangement of DYZ1 and DYZ2 repeats on the human Y-chromosome: a case with presence of DYZ1 and absence of DYZ2. Mol. Cell. Probes 6, 257–259 (1992).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  58. Altemose, N. A classical revival: human satellite DNAs enter the genomics era. Semin. Cell Dev. Biol. 128, 2–14 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  59. Gripenberg, U. Size variation and orientation of the human Y chromosome. Chromosoma 15, 618–629 (1964).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  60. Mathias, N., Bayés, M. & Tyler-Smith, C. Highly informative compound haplotypes for the human Y chromosome. Hum. Mol. Genet. 3, 115–123 (1994).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  61. Altemose, N., Miga, K. H., Maggioni, M. & Willard, H. F. Genomic characterization of large heterochromatic gaps in the human genome assembly. PLoS Comput. Biol. 10, e1003628 (2014).

    Article 
    ADS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  62. Cooke, H. Repeated sequence specific to human males. Nature 262, 182–186 (1976).

    Article 
    ADS 
    CAS 
    PubMed 

    Google Scholar
     

  63. Frommer, M., Prosser, J. & Vincent, P. C. Human satellite I sequences include a male specific 2.47 kb tandemly repeated unit containing one Alu family member per repeat. Nucleic Acids Res. 12, 2887–2900 (1984).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  64. Babcock, M., Yatsenko, S., Stankiewicz, P., Lupski, J. R. & Morrow, B. E. AT-rich repeats associated with chromosome 22q11.2 rearrangement disorders shape human genome architecture on Yq12. Genome Res. 17, 451–460 (2007).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  65. Webster, T. H. et al. Identifying, understanding, and correcting technical artifacts on the sex chromosomes in next-generation sequencing data. GigaScience 8, giz074 (2019).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  66. Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  67. Bekritsky, M. A., Colombo, C. & Eberle, M. A. Identifying genomic regions with high quality single nucleotide variant calling. Illumina https://www.illumina.com/content/illumina-marketing/amr/en_US/science/genomics-research/articles/identifying-genomic-regions-with-high-quality-single-nucleotide-.html (2023).

  68. Breitwieser, F. P., Pertea, M., Zimin, A. V. & Salzberg, S. L. Human contamination in bacterial genomes has created thousands of spurious proteins. Genome Res. 29, 954–960 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  69. Steinegger, M. & Salzberg, S. L. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biol. 21, 115 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  70. Chrisman, B. et al. The human “contaminome”: bacterial, viral, and computational contamination in whole genome sequences from 1000 families. Sci. Rep. 12, 9863 (2022).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  71. Kent, W. J. et al. The Human Genome Browser at UCSC. Genome Res. 12, 996–1006 (2002).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  72. Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01662-6 (2023).

  73. Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  74. Jiang, Z., Hubley, R., Smit, A. & Eichler, E. E. DupMasker: a tool for annotating primate segmental duplications. Genome Res. 18, 1362–1368 (2008).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  75. Vollger, M. R., Kerpedjiev, P., Phillippy, A. M. & Eichler, E. E. StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps. Bioinformatics 38, 2049–2051 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  76. Skene, P. J. & Henikoff, S. An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. eLife 6, e21856 (2017).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  77. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  78. Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  79. Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018).

    Article 
    CAS 

    Google Scholar
     

  80. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  81. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  82. Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Methods 18, 1322–1332 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  83. Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single molecule sequencing. Nat. Methods 15, 461–468 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  84. Jiang, T. et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 21, 189 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  85. Bzikadze, A. V., Mikheenko, A. & Pevzner, P. A. Fast and accurate mapping of long reads to complete genome assemblies with VerityMap. Genome Res. 32, 2107–2118 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  86. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  87. Porubsky, D. et al. breakpointR: an R/Bioconductor package to localize strand state changes in Strand-seq data. Bioinformatics 36, 1260–1261 (2020).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  88. PacBio Revio WGS Dataset. Homo sapiens – GIAB trio HG002-4. https://downloads.pacbcloud.com/public/revio/2022Q4/ (2022).

  89. Poznik, D. yhaplo | Identifying Y-chromosome haplogroups. GitHub https://github.com/23andMe/yhaplo (2022).

  90. Tseng, B. et al. Y-SNP Haplogroup Hierarchy Finder: a web tool for Y-SNP haplogroup assignment. J. Hum. Genet. 67, 487–493 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  91. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  92. Li, H. Identifying centromeric satellites with dna-brnn. Bioinformatics 35, 4408–4410 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  93. Harris, R. S. Improved Pairwise Alignmnet of Genomic DNA (Pennsylvania State Univ., 2007).

  94. Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics 22, 134–141 (2006).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  95. Chin, C.-S. et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat. Methods https://doi.org/10.1038/s41592-023-01914-y (2023).

  96. Frankish, A. et al. GENCODE 2021. Nucleic Acids Res. 49, D916–D923 (2021).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  97. Armstrong, J. et al. Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature 587, 246–251 (2020).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  98. Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  99. Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  100. Fiddes, I. T. et al. Comparative Annotation Toolkit (CAT)—simultaneous clade and personal genome annotation. Genome Res. 28, 1029–1038 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  101. Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  102. Dale, R. K., Pedersen, B. S. & Quinlan, A. R. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics 27, 3423–3424 (2011).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  103. Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  104. Pruitt, K. D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756–D763 (2014).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  105. Kapustin, Y., Souvorov, A., Tatusova, T. & Lipman, D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol. Direct 3, 20 (2008).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  106. Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 30, 772-80 (2013).

  107. Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  108. Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  109. Numanagić, I. et al. Fast characterization of segmental duplications in genome assemblies. Bioinformatics 34, i706–i714 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  110. Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  111. Arian, F. A. S., Hubley, R. & Green, P. RepeatMasker Open-4.0 2013-2015. http://www.repeatmasker.org (2015).

  112. Storer, J., Hubley, R., Rosen, J., Wheeler, T. J. & Smit, A. F. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob. DNA 12, 2 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  113. Olson, D. & Wheeler, T. ULTRA: a model based tool to detect tandem repeats. ACM BCB 2018, 37–46 (2018)

  114. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  115. Storer, J. M., Hubley, R., Rosen, J. & Smit, A. F. A. Curation guidelines for de novo generated transposable element families. Curr. Protoc. 1, e154 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  116. Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  117. Szak, S. T. et al. Molecular archeology of L1 insertions in the human genome. Genome Biol. 3, research0052.1 (2002).

    Article 

    Google Scholar
     

  118. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  119. Cer, R. Z. et al. Searching for non-B DNA-forming motifs using nBMST (non-B DNA motif search tool). Curr. Protoc. Hum. Genet. 73, 18.7.1–18.7.22 (2012).


    Google Scholar
     

  120. Zou, X. et al. Short inverted repeats contribute to localized mutability in human somatic cells. Nucleic Acids Res. 45, 11213–11221 (2017).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  121. Svetec Miklenić, M. et al. Size-dependent antirecombinogenic effect of short spacers on palindrome recombinogenicity. DNA Repair 90, 102848 (2020).

    Article 
    PubMed 

    Google Scholar
     

  122. Sahakyan, A. B. et al. Machine learning model for sequence-driven DNA G-quadruplex formation. Sci. Rep. 7, 14535 (2017).

    Article 
    ADS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  123. Hao, Z. et al. RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms. PeerJ Comput. Sci. 6, e251 (2020).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  124. Dotmatics. GraphPad Prism v.9.1.0 for Windows. https://www.graphpad.com (16 March 2021).

  125. Vollger, M. R. SafFire. GitHub https://github.com/mrvollger/SafFire (2022).

  126. Pendleton, A. L. et al. Comparison of village dog and wolf genomes highlights the role of the neural crest in dog domestication. BMC Biol. 16, 64 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  127. Hach, F. et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat. Methods 7, 576–577 (2010).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  128. Escalona, M. et al. Whole-genome sequence and assembly of the Javan gibbon (Hylobates moloch). J. Hered. 114, 35–43 (2023).

    Article 
    PubMed 

    Google Scholar
     

  129. Cortez, D. et al. Origins and functional evolution of Y chromosomes across mammals. Nature 508, 488–493 (2014).

    Article 
    ADS 
    CAS 
    PubMed 

    Google Scholar
     

  130. Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  131. Dotmatics. Geneious v2019.2.3. https://www.geneious.com/ (2019).

  132. Rambaut et al. FigTree v1.4.4. http://tree.bio.ed.ac.uk/software/figtree/ (2018).

  133. Tyler-Smith, C. & Brown, W. R. A. Structure of the major block of alphoid satellite DNA on the human Y chromosome. J. Mol. Biol. 195, 457–470 (1987).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  134. Shepelev, V. A. et al. Annotation of suprachromosomal families reveals uncommon types of alpha satellite organization in pericentromeric regions of hg38 human genome assembly. Genomics Data 5, 139–146 (2015).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  135. Lee, I. et al. Simultaneous profiling of chromatin accessibility and methylation on human cell lines with nanopore sequencing. Nat. Methods 17, 1191–1199 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  136. Krumsiek, J., Arnold, R. & Rattei, T. Gepard: a rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics 23, 1026–1028 (2007).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  137. Rice, P., Longden, I. & Bleasby, A. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  138. Sun, C. et al. Deletion of azoospermia factor a (AZFa) region of human Y chromosome caused by recombination between HERV15 proviruses. Hum. Mol. Genet. 9, 2291–2296 (2000).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  139. Lassmann, T. Kalign 3: multiple sequence alignment of large datasets. Bioinformatics 36, 1928–1929 (2020).

    Article 
    CAS 

    Google Scholar
     

  140. Wheeler, T. J. & Eddy, S. R. nhmmer: DNA homology search with profile HMMs. Bioinformatics 29, 2487–2489 (2013).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  141. Stephens, Z. D. et al. Simulating next-generation sequencing datasets from empirical mutation and sequencing models. PLoS ONE 11, e0167047 (2016).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  142. Bushnell, B. BBMap: a fast, accurate, splice-aware aligner. OSTI.gov https://www.osti.gov/biblio/1241166 (2017).

  143. Aken, B. L. et al. Ensembl 2017. Nucleic Acids Res. 45, D635–D642 (2017).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  144. Poznik, G. D. et al. Sequencing Y chromosomes resolves discrepancy in time to common ancestor of males versus females. Science 341, 562–565 (2013).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  145. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  146. Schatz, M. C. et al. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genomics 2, 100085 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  147. Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  148. Talenti, A. & Prendergast, J. nf-LO: a scalable, containerized workflow for genome-to-genome lift over. Genome Biol. Evol. 13, evab183 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  149. Guarracino, A., Mwaniki, N., Marco-Sola, S., & Garrison, E. wfmash: whole-chromosome pairwise alignment using the hierarchical wavefront algorithm. GitHub https://github.com/ekg/wfmash (2021).

  150. Sherry, S. T., Ward, M. & Sirotkin, K. dbSNP—database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res. 9, 677–679 (1999).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  151. Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  152. Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  153. Van der Auwera G. A. & O’Connor B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).

  154. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  155. Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  156. Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).

    Article 
    PubMed 

    Google Scholar
     

  157. Marçais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  158. Ondov, B. D., Bergman, N. H. & Phillippy, A. M. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics 12, 385 (2011).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  159. Rhie, A. Repositories for the analysis of T2T-Y and T2T-CHM13v2.0. Zenodo https://doi.org/10.5281/zenodo.8136598 (2023).

  160. Falconer, E. et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods 9, 1107–1112 (2012).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

Download references

Acknowledgements

We thank P. Hallast, M. C. Loftus, M. K. Konkel, P. Ebert, T. Marschall and C. Lee for coordination and discussions, J.C.-I. Lee for sharing the GRCh38-Y coordinates used in Y-Finder and members of the Telomere-to-Telomere consortium and HPRC for constructive feedback. This work utilized the computational resources of the National Institutes of Health (NIH) HPC Biowulf cluster (https://hpc.nih.gov). Computational resources were partially provided by the e-INFRA CZ project (no. 90140), supported by the Ministry of Education, Youth and Sports of the Czech Republic and Computational Biology Core, Institute for Systems Genomics, University of Connecticut. Certain commercial equipment, instruments and materials are identified to specify adequately experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the NIST, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. We thank the Intramural Research Program of NHGRI, NIH no. HG200398 (A.R., S.N., S.K., M.R., A.M.M., B.P.W. and A.M.P.); NIH no. GM123312 (S.J.H., P.G.S.G., G.A.H. and R.J.O.); NIH no. GM130691 (P.M., M.H.W. and K.D.M.); HHMI Hanna Gray Fellowship (N.A.); NIH no. CA266339 (J.G. and T.P.); NIH no. GM147352 (G.A.L.); NIH nos. HG002939 and HG010136 (R.M.H. and J.M.S.); NIH no. HG009190 (P.W.H., A. Gershman and W.T.); NIH nos. HG010263, HG006620 and CA253481 and NSF no. DBI-1627442 (M.C.S.); NIH no. GM136684 (K.D.M.); NIH nos. HG011274 and HG010548 (K.H.M.); NIH nos. HG010961 and HG010040 (H.L.); NIH no. HG007234 (M.D.); NIH no. HG011758 (F.J.S.); NIH no. DA047638 (E.G.); NIH no. GM124827 (M.A.W.); NIH no. GM133747 (R.C.M.); NIH no. CA240199 (R.J.O.); NIH nos. HG002385, HG010169 and HG010971 (E.E.E.); Stowers Institute for Medical Research (J.L.G. and T.P.); National Center for Biotechnology Information of the National Library of Medicine, NIH (F.T.-N. and T.D.M.); intramural funding at NIST (J.M.Z.); NIST no. 70NANB20H206 (M.J.); and NIH nos. HG010972 and WT222155/Z/20/Z and the European Molecular Biology Laboratory (J.A., P.F., C.G.G., L.H., T.H., S.E.H., F.J.M. and L.S.). RNA generation was supported by NIST no. 70NANB21H101 and NIH no. 1S10OD028587; the Ministry of Science and Higher Education of the Russian Federation, St. Petersburg State University, no. PURE 73023672 (I.A.A.); the Computation, Bioinformatics, and Statistics Predoctoral Training Program awarded to Penn State by the NIH (A.C.W.); and Achievement Rewards for College Scientists Foundation, The Graduate College at Arizona State University (A.M.T.O.). E.E.E. is an investigator for HHMI.

Author information

Author notes

  1. Sergey Nurk

    Present address: Oxford Nanopore Technologies Inc., Oxford, UK

  2. Ivan A. Alexandrov

    Present address: Department of Anatomy and Anthropology and Department of Human Molecular Genetics and Biochemistry, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv-Yafo, Israel

  3. These authors contributed equally: Arang Rhie, Sergey Nurk, Monika Cechova, Savannah J. Hoyt, Dylan J. Taylor

Authors and Affiliations

  1. Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA

    Arang Rhie, Sergey Nurk, Sergey Koren, Mikko Rautiainen, Nancy F. Hansen, Ann M. Mc Cartney, Brian P. Walenz & Adam M. Phillippy

  2. Faculty of Informatics, Masaryk University, Brno, Czech Republic

    Monika Cechova

  3. Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA

    Monika Cechova, Julian K. Lucas, Brandy M. McNulty, Hugh E. Olsen & Karen H. Miga

  4. Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA

    Savannah J. Hoyt, Patrick G. S. Grady, Gabrielle A. Hartley & Rachel J. O’Neill

  5. Department of Biology, Johns Hopkins University, Baltimore, MD, USA

    Dylan J. Taylor, Rajiv C. McCoy, Michael E. G. Sauria & Michael C. Schatz

  6. Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA

    Nicolas Altemose

  7. Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA

    Paul W. Hook, Ariel Gershman, Jakob Heinz, Alaina Shumate & Winston Timp

  8. Federal Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia

    Ivan A. Alexandrov

  9. Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia

    Ivan A. Alexandrov & Alla Mikheenko

  10. European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK

    Jamie Allen, Paul Flicek, Carlos Garcia Giron, Leanne Haggerty, Thibaut Hourlier, Sarah E. Hunt, Fergal J. Martin & Likhitha Surapaneni

  11. UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA

    Mobin Asri, Mark Diekhans, Marina Haukness, Julian K. Lucas, Brandy M. McNulty, Hugh E. Olsen & Karen H. Miga

  12. Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, CA, USA

    Andrey V. Bzikadze

  13. Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA

    Nae-Chyun Chen, Samantha Zarate & Michael C. Schatz

  14. GeneDX Holdings Corp, Stamford, CT, USA

    Chen-Shan Chin

  15. Foundation of Biological Data Science, Belmont, CA, USA

    Chen-Shan Chin

  16. Department of Genetics, University of Cambridge, Cambridge, UK

    Paul Flicek

  17. The Rockefeller University, New York, NY, USA

    Giulio Formenti

  18. DNAnexus, Inc., Mountain View, CA, USA

    Arkarachai Fungtammasan

  19. Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA

    Erik Garrison & Andrea Guarracino

  20. Stowers Institute for Medical Research, Kansas City, MO, USA

    Jennifer L. Gerton & Tamara Potapova

  21. University of Kansas Medical Center, Kansas City, MO, USA

    Jennifer L. Gerton

  22. Genomics Research Centre, Human Technopole, Milan, Italy

    Andrea Guarracino

  23. Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany

    Reza Halabian & Wojciech Makalowski

  24. Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA

    Nancy F. Hansen

  25. Department of Biology, Pennsylvania State University, University Park, PA, USA

    Robert Harris, Marta Tomaszkiewicz, Allison C. Watwood, Matthias H. Weissensteiner & Kateryna D. Makova

  26. Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA

    William T. Harvey, Alexandra P. Lewis, Glennis A. Logsdon, Katherine M. Munson, David Porubsky, Mitchell R. Vollger & Evan E. Eichler

  27. Institute for Systems Biology, Seattle, WA, USA

    Robert M. Hubley & Jessica M. Storer

  28. XDBio Program, Johns Hopkins University, Baltimore, MD, USA

    Stephen Hwang

  29. Department of Bioengineering, Department of Physics, Northeastern University, Boston, MA, USA

    Miten Jain

  30. Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA

    Rupesh K. Kesharwani, Luis F. Paulin, Fritz J. Sedlazeck & Yiming Zhu

  31. Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA

    Heng Li

  32. Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA

    Heng Li

  33. Genome Technology Access Center at the McDonnell Genome Institute, Washington University, St. Louis, MO, USA

    Christopher Markovic

  34. Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA

    Jennifer McDaniel, Nathan D. Olson & Justin M. Zook

  35. Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA

    Paul Medvedev

  36. Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA

    Paul Medvedev

  37. Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, PA, USA

    Paul Medvedev

  38. UCL Queen Square Institute of Neurology, UCL, London, UK

    Alla Mikheenko

  39. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

    Terence D. Murphy & Françoise Thibaud-Nissen

  40. Masters Program in National Research University Higher School of Economics, Moscow, Russia

    Fedor Ryabov

  41. Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, USA

    Steven L. Salzberg

  42. Department of Computer Science, Rice University, Houston, TX, USA

    Fritz J. Sedlazeck

  43. Google Inc., Mountain View, CA, USA

    Kishwar Shafin

  44. Institute of Molecular Genetics, Moscow, Russia

    Valery A. Shepelev

  45. Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA

    Angela M. Taravella Oill & Melissa A. Wilson

  46. Department of Biomedical Engineering, Pennsylvania State University, State College, PA, USA

    Marta Tomaszkiewicz

  47. Pacific Biosciences, Menlo Park, CA, USA

    Aaron M. Wenger

  48. Investigator, Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA

    Evan E. Eichler

  49. Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA

    Rachel J. O’Neill

  50. Department of Genetics and Genome Sciences, UConn Health, Farmington, CT, USA

    Rachel J. O’Neill

Contributions

V.A.S. is retired from the Institute of Molecular Genetics. Assembly was carried out by S.N., S.K. and M.R. Validation was performed by A.R., S.K., M.A., A.V.B., G.F., A.F., A.M.M., J.M., A.M., L.F.P., D.P., F.J.S., K.S., P.M., J.M.Z. and K.D.M. ChrY haplogroups were determined by A.R. and A.C.W. Alignment was done by C.-S.C., M.D., R. Harris, M.R.V. and K.D.M. Satellite annotation was performed by N.A., I.A.A., G.A.L., F.R., V.A.S. and K.H.M. N.A., J.G. and T.P. carried out FISH. Repeat annotation was done by S.J.H., P.G.S.G., G.A.H., R.M.H., J.M.S. and R.J.O. Retro-elements were dealt with by R. Halabian and W.M. Non-B DNA was dealt with by M.H.W. and K.D.M. Gene annotation was undertaken by A.R., M.D., P.F., C.G.G., L.H., M.H., J.H., T.H., F.J.M., T.D.M., S.L.S., A.S. and F.T.-N. A.R., R. Harris, W.T.H., P.M., M.T. and K.D.M. dealt with ampliconic genes. Structural annotation was performed by A.R., M.C., H.L., P.M. and K.D.M. Epigenetic analysis was performed by A.R., P.W.H., A. Gershman, W.T. and A.M.W. Mappability was performed by A.M.T.O., M.A.W. and J.M.Z. Non-B DNA was dealt with by M.H.W. and K.D.M. Variants and liftover were carried out by A.R., D.J.T., S.K., J.A., N.-C.C., M.D., E.G., A. Guarracino, N.F.H., W.T.H., S.E.H., S.H., R.C.M., N.D.O., M.E.G.S., L.S., M.R.V., S.Z., J.M.Z., E.E.E. and A.M.P. A.R., S.L.S., B.P.W. and A.M.P. dealt with contamination. Data generation was carried out by M.J., R.K.K., A.P.L., J.K.L., C.M., B.M.M., K.M.M., H.E.O., F.J.S. and Y.Z. Data management was undertaken by A.R., M.D., M.J. and J.K.L. Computational resources were sourced by R.J.O., M.C.S. and A.M.P. A.R., S.N., M.C., S.J.H., D.J.T., N.A., I.A.A., N.-C.C., E.G., J.G., P.G.S.G., A. Guarracino, R. Halabian, W.M., J.M., T.P., F.R., S.L.S., J.M.S., A.M.T.O., A.C.W., M.A.W., S.Z., J.M.Z., E.E.E., R.J.O., M.C.S., K.H.M., K.D.M. and A.M.P. wrote the manuscript draft. A.R. and A.M.P. edited the manuscript, with the assistance of all authors. J.M.Z., E.E.E., R.J.O., M.C.S., K.H.M., K.D.M. and A.M.P. supervised the research. Conceptualization was the responsibility of A.R., S.N., M.C., E.E.E., K.H.M., K.D.M. and A.M.P.

Corresponding author

Correspondence to
Adam M. Phillippy.

Ethics declarations

Competing interests

S.N. is now an employee of ONT. S.K. has received travel funding for speaking at events hosted by ONT. A.F. is an employee of DNAnexus. C.-S.C. is an employee of GeneDX Holdings Corp. N.-C.C. is an employee of Exai Bio. L.F.P. receives research support from Genetech. F.J.S. receives research support from Pacific Biosciences, ONT, Illumina and Genetech. K.S. is an employee of Google LLC and owns Alphabet stock as part of the standard compensation package. W.T. has two patents (nos. 8,748,091 and 8,394,584) licensed to ONT. E.E.E. is a scientific advisory board member of Variant Bio, Inc. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature thanks John Lovell, Mikkel Heide Schierup and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Assembling the X and Y chromosomes of HG002.

a. Chromosome X and Y components of the assembly string graph built from HiFi reads, detected based on node sequence alignments to T2T-CHM13 and GRCh38 references. Each node is colored according to the excess of paternal-specific (blue) and maternal-specific (red) k-mers, obtained from parental Illumina reads, indicating if they exclusively belong to chromosome Y or X, respectively. Most complicated tangles are localized within the heterochromatic satellite region on the Y q-arm. The X and Y subgraphs are connected in PAR1 and PAR2. Graph discontinuities are due to a lack of HiFi sequence coverage in these regions caused by contextual sequencing bias, with 9 out of 11 observed breaks falling within PAR1 on either chromosome (5 out of 5 for chromosome Y). Note that for visualization purposes the length of shorter nodes is artificially increased making the extent of the tangles appear larger than reality. b. The effects of manual pruning and semi-automated ONT read integration is illustrated from top to bottom. Top, zoomed in view of a tangle encoding the P1–P3 palindromic region in Y (approx. 22.86–27.08 Mb, see Fig. 4). Middle, corresponding subgraph following the manual pruning and recompaction. Nodes excluded from the curated “single-copy” list for automated ONT-based repeat resolution are shown in yellow. Three hairpin structures are highlighted, which form almost-perfect inverted tandem repeats encompassing the entire P3 and two P2 (red) palindromes. Node outlines in the palindromes are colored according to the palindromic arms as in Fig. 4. Bottom, corresponding subgraph following the repeat resolution using ONT read-to-graph alignments. Remaining ambiguities were resolved by evaluating ONT read alignments to all candidate reconstructions of the corresponding sub-regions. c. PAR1 subgraph labeled with HiFi read coverage on each node. Gaps (green edges) and uneven node coverage estimates indicate biases in HiFi sequencing across the region. Figure 1 shows an enrichment of SINE repeats and non-B DNA motifs in PAR1 that may underlie the sequencing gaps in this region.

Extended Data Fig. 2 Validation and polishing of the T2T-Y.

a. Evaluation and polishing workflow performed on T2T-CHM13v1.1 autosomes + HG002 XY assemblies. b. Venn diagram of the k-mers from the parents and child. On the left, hap-mers18 represent haplotype specific k-mers inherited by the child. The darker outlined circle inside the child k-mers represent single-copy k-mers (k-mers occurring once in the assembly and single-copy in the child’s genome). Right figure shows an example of the paternal specific, “single-copy” and “marker” k-mers. The marker set includes both multi-copy and single-copy k-mers specific to the paternal haplotype that were inherited by the child. Unlike polishing the nearly haploid CHM13 assembly17, both single-copy k-mers and marker k-mers were used for the marker-assisted alignments to HG002 XY. This helped align more reads within repetitive regions to the correct chromosome for evaluation during polishing. Right panel shows counts of the k-mers and coverage of HiFi and ONT reads using the marker-assisted Winnowmap2 alignment, in addition to alignments from VerityMap, which uses locally unique k-mers for anchoring the reads. c. Aggregated Strand-seq coverage profile across all 65 libraries on GRCh38-Y (top) and T2T-Y (bottom). Each bar represents read counts in every 20 kb bin supporting the reference in forward direction (light green) or reverse direction (dark green). Multiple spikes in reverse direction (black asterisks) in GRCh38-Y indicate inversion polymorphisms relative to HG002, likely due to differences between the haplogroups. Such spikes in coverage are not observed on T2T-X and T2T-Y, which confirm the structural and directional accuracy of the HG002 assemblies. A 3 kb inversion of the unique sequence between the P5 palindromic arms was identified as erroneous in T2T-Y (red asterisk), but was confirmed to be polymorphic in the population and left uncorrected in this version of the assembly.

Extended Data Fig. 3 Large structural differences between T2T-Y and previous GRCh Y assemblies.

ab. Ampliconic genes and X-degenerate sequences revealed from alignments between GRCh38-Y (Y-axis) and T2T-Y (X-axis). a. Dotplot generated using LastZ93 after softmasking with WindowMasker94. b. Identity was computed from matches and mismatches over positions with alignments, excluding gaps. c. Structural differences revealed using PRG-TK95 against GRCh38-Y and GRCh37-Y in the euchromatic region of the Y chromosome.

Extended Data Fig. 4 Repeat discovery and annotation of T2T-Y.

a. Assembly completion allowed for a full assessment of repeats and resulted in the identification of previously unknown satellite arrays (predominantly in the PAR1) and subunit repeats that fall within one of three composite repeat units (TSPY, RBMY, DAZ). b. Ideogram of TE density (per 100 kb bin). This is an extension of Fig. 1 with non-SINEs expanded into separate TE classes (SVA, LTR, LINE, DNA/RC). Density scale ranges from low (white, zero) to high (black, relative to total density) and sequence classes are denoted by color. c. Summary (in terms of base coverage per region) across all five TE classes and two specific families: Alu/SINE and L1/LINE. The satellites in (b) were kept separate as two categories; Cen/Sat as the left satellite block including alpha satellites and DYZ19, while all other categories were combined per sequence classes.

Extended Data Fig. 5 Non-B DNA motifs along the T2T-Y.

HSat3 on the Yq and satellite sequences around the centromere are more enriched with A-phased repeats, direct repeats and STRs, while HSat1B is more enriched with inverted repeats and mirror repeats. Enrichment of non-B DNA sequences were also observed in the PAR region. Notably, the TSPY gene array is enriched for G4 and Z-DNA motifs, as shown in Extended Data Fig. 6b.

Extended Data Fig. 6 Phylogenetic tree analysis of the ampliconic TSPY gene family and pattern of non-B DNA structure.

a. Phylogenetic tree analysis using protein-coding TSPYs from a Sumatran Orangutan (Pongo abelii) and a Silvery gibbon (Hylobates moloch) as outgroups confirmed TSPY2 (distal to the array) and TSPY copies within the array originated from the same branch, distinguished from the rest of the TSPY pseudogenes. Rectangular inset shows a cartoon representation of the simplified tree. Numbers next to the triangles indicate the number of TSPY genes in the same branch. b. G4 and Z-DNA structures predicted for a typical TSPY copy inside the TSPY array. All TSPY copies in the array have the same signature, with one G4 peak present ~500 bases upstream of the TSPY (arrow). Higher Quadron score122 (Q-score) indicates a more stable G4 structure, with scores over 19 considered stable (dotted line).

Extended Data Fig. 7 Recurrent inversions identified with Strand-seq.

a. Five out of 15 individuals have the inverted variant as present in HG002 at the P3 palindrome (white arrow). Although inversions across P1–P2 (yellow and red arrows) are difficult to confirm with Strand-seq because of the high sequence similarity between the palindromic arms, different orientations are observable in these samples. b. Strand states for 65 Strand-seq libraries of HG002. Depending on the mappings of directional Strand-seq reads (+ reads: ‘Crick’, C, – reads: ‘Watson’, W), reference sequence was assigned in three states: WC, WW, and CC. WC, roughly equal mixture of plus and minus reads; WW, all reads mapped in minus orientation; CC, all reads mapped in plus orientation. Changes in strand state along a single chromosome are normally caused by a double-strand-break (DSBs) that occurred during DNA replication160 in a random fashion and we refer to them as sister-chromatid-exchanges (SCEs, yellow thunderbolts). Recurrent change in strand state over the same region in multiple Strand-seq cells indicates misassembly. Similarly, collapsed or incomplete assembly of a certain genomic region will result in a recurrent strand state change as observed for GRCh38-Y (black arrowheads). In contrast, T2T-Y shows strand state changes randomly distributed along each Strand-seq library with no evidence of misassembly or collapse. c. Strand-seq profile of selected libraries over T2T-Y summarized in bins (bin size: 500 kb, step size: 50 kb). Teal, Crick read counts; orange, Watson read counts. As ChrY is haploid, reads are expected to map only in Watson or Crick orientation. Light gray rectangles highlight regions where SCEs were detected in the heterochromatic Yq12 despite a lower coverage of Strand-seq reads. A modified breakpointR parameter was used (windowsize = 500000 minReads = 20) in order to refine detected SCEs presented in panel b and c.

Extended Data Fig. 8 Satellite annotation and recent expansion events in the Yq heterochromatin.

a. A plot showing the top repeat periodicities detected by NTRprism44 in 50 kb blocks tiled across T2T-Y, with centromeric satellite annotations overlaid on the X axis. Large arrays are labeled with their historic nomenclature1, HSat subfamilies61, and predominant repeat periodicities. b. An exact 2000-mer match dotplot of the Yq region (a dot is plotted when an identical 2000 base sequence is found at positions X and Y). The lower triangle has DYZ1/DYZ2 annotations overlaid as yellow and blue bars, respectively. Circled patterns in the upper triangle correspond to recent iterative duplication events, which are illustrated below the X axis. c. A reconstruction of a possible sequence of recent iterative duplications that could explain the observed dotplot patterns. d. A 2000-mer dotplot comparison of two ~800 kb HSat1B sub-arrays that were part of a recent large duplication event, along with self-self comparisons of the same arrays, revealing sites of more recent and smaller-scale deletions and expansions (annotated in yellow and red, with a possible sequence of events illustrated by the schematic on the right).

Extended Data Fig. 9 Genomic similarity in PARs and XTR and improved MAPQ of the PARs through informed sex chromosome complement reference.

a. Dotplots from LASTZ alignments of the CHM13-X, HG002-X, and HG002-Y (T2T-Y) over 96% sequence identity. Dashed gray lines represent the start and end of the approximate PARs or XTR boundaries. Disconnected diagonal lines indicate the presence of genomic diversity between each paired region. More genomic differences are observed in the PAR1 between the HG002-Y and CHM13-X. bc. Average mapping quality (MAPQ) across GRCh38-X from simulated reads of an XX (b) and XY (c) sample. Top, a default version of GRCh38 (with two copies of identical PARs on XY). Middle, a version of GRCh38 informed on the sex chromosome complement (SCC) of the sample (entire Y hard-masked for the XX sample vs. only PARs on the Y hard-masked for the XY sample). Bottom, the difference in average MAPQ between the SCC and default approaches. MAPQ was averaged in 50 kb windows, sliding 10 kb across the chromosome. A positive value means MAPQ score is higher with SCC reference alignment compared to default alignment.

Extended Data Fig. 10 Number of variants called from 1KGP and SGDP individuals.

a. More variants are called on the X-PARs when using the sex chromosome complement reference approach (calling variants in diploid mode on PARs) than the non-masked approach (calling variants in haploid mode on PARs). The 1KGP results for GRCh38-Y are from Aganezov et al.66, which was performed on CHM13v1.0+GRCh38-Y. b. Num. of variants called from each 1KGP XY sample on chromosome GRCh38-Y and T2T-Y c. Num. of variants called in the syntenic region between the two Ys. A large num. of additional variants are called on each sample attributed to the newly added, non-syntenic sequences on T2T-Y. Within the syntenic regions, a reduction in the number of variants is observed for each population except for samples from R1 haplogroups as shown in Fig. 6c. d. Aggregated total number of variants for the 279 SGDP samples per chromosome. e. SGDP genome-wide counts of variants per-sample (n = 279) demonstrate increased variation in African samples regardless of reference. Each bar in the box plot represents the 1st, 2nd (median), and 3rd quartile of the number of variants in each population. Whiskers are bound to the 1.5 × interquartile range. Data outside of the whisker ranges are shown as dots. For the SGDP samples, variants were called using T2T-CHM13+Y or GRCh38 as the reference. All variants shown in this figure were filtered for “high quality (PASS)”.

Extended Data Fig. 11 Human contaminants in bacterial reference genomes.

a. Number of distinct RefSeq accessions in every 10 kb window containing 64-mers of GRCh38-Y (top), T2T-Y (middle), and in T2T-Y only (bottom). Here, RefSeq sequences with more than 20 64-mers or matching over 10% of the Y chromosome are included. b. Length distribution of the sequences from (a) in log scale. Majority of the shorter (<1 kb) sequences contain 64-mers found in HSat1B or HSat3. c. Number of bacterial RefSeq entries by strain identified to contain sequences of T2T-Y and not GRCh38-Y, visualized with Krona158.

Supplementary information

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rhie, A., Nurk, S., Cechova, M. et al. The complete sequence of a human Y chromosome.
Nature (2023). https://doi.org/10.1038/s41586-023-06457-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41586-023-06457-y

Read More