1st draft of a human 'pangenome' published, adding millions of 'building blocks' to the human reference genome

An illustration of the globe ribbons of bright color wrapped around it, representing the newly drafted human pangenome — A new human reference "pangenome" includes DNA data from 47 people.

(Image credit: Darryl Leja, NHGRI)

Scientists have published the first human "pangenome" — a full genetic sequence that incorporates genomes from not just one individual, but 47.

These 47 individuals hail from around the globe and thus vastly increase the diversity of the genomes represented in the sequence, compared to the previous full human genome sequence that scientists use as their reference for study. The first human genome sequence was released with some gaps in 2003 and only made "gapless" in 2022. If that first human genome is a simple linear string of genetic code, the new pangenome is a series of branching paths.

Depiction of the old human reference genome, mostly based on one person's DNA, alongside the new pangenome, based on 47 people's dna — The newly drafted human pangenome is a collection of different genomes from which to compare an individual genome sequence. Like a map of the subway system, the pangenome graph has many possible routes for a sequence to take, represented by the different colors.

The detouring paths at the top of the image represent single nucleotide variants (SNVs), which are single letter differences. The yellow path that loops around itself and repeats the same nucleotides represents a duplication variant. The pink path that loops counterclockwise and follows the nucleotide sequence backwards represents an inversion variant. At the bottom, the green and dark blue paths miss the C nucleotide in its route and represent a deletion variant. The light blue path, which has extra nucleotides in its route, represents an insertion variant.

Geneticists use the reference genome as a guide when sequencing pieces of people's genetic codes, Arya Massarat, a doctoral student in Gymrek's lab who co-authored an editorial about the new research with her in the journal Nature, told Live Science. They match the newly decoded DNA snippets to the reference to figure out how they fit within the genome as a whole. They also use the reference genome as a standard to pinpoint genetic variations — different versions of genes that diverge from the reference — that might be linked with health conditions.

The first pangenome draft now doubles the number of large genome variants, known as structural variants, that scientists can detect, bringing them up to 18,000. These are places in the genome where large chunks have been deleted, inserted or rearranged. The new draft also adds 119 million new base pairs, meaning the paired "letters" that make up the DNA sequence, and 1,115 new gene duplication mutations to the previous version of the human genome.

For example, the Lipoprotein A gene is known to be one of the biggest risk factors for coronary heart disease in African Americans, but the specific genetic changes involved are complex and poorly understood, study co-author Evan Eichler, a genomics researcher at the University of Washington in Seattle, told reporters. With the pangenome, researchers can now more thoroughly compare the variation in people with heart disease and without, and this could help clarify individuals' risk of heart disease based on what variants of the gene they carry.

The new study also used advanced sequencing technology called "long-read sequencing," as opposed to the short-read sequencing that came before. Short-read sequencing is what happens when you send your DNA to a company like 23andMe, Eichler said. Researchers read out small segments of DNA and then stitch them together into a whole. This kind of sequencing can capture a decent amount of genetic variation, but there can be poor overlap between each DNA fragment. Long-read sequencing, on the other hand, captures big segments of DNA all at once.

The researchers are working to recruit new participants to continue to fill in diversity gaps in the pangenome, study co-author Eimear Kenny, a professor of medicine and genetics at the Institute for Genomic Health at Icahn School of Medicine at Mount Sinai in New York City, told reporters. Because genetic information is sensitive and because different rules govern data-sharing and privacy in different countries, this is delicate work. Issues include privacy, informed consent, and the possibility of discrimination based on genetic information, Kenny said.

Already, researchers are uncovering new genetic processes with the draft pangenome. In two papers published in Nature alongside the work, researchers looked at highly repetitive segments of the genome. These segments have traditionally been difficult to study, biochemist Brian McStay of the National University of Ireland Galway, told Live Science, because sequencing them via short-read technology makes it hard to understand how they fit together. The long read technology allows for long chunks of these repetitive sequences to be read at once.

The studies found that in one type of repetitive sequence, known as segmental duplications, there is a larger than expected amount of variation, potentially a mechanism for the long-term evolution of new functions for genes. In another type of repetitive sequence that is responsible for building the cellular machines that create new proteins, though, the genome stays remarkably stable. The pangenome allowed researchers to discover a potential mechanism for how these key segments of DNA stay consistent over time.

Stephanie Pappas is a contributing writer for Live Science, covering topics ranging from geoscience to archaeology to the human brain and behavior. She was previously a senior writer for Live Science but is now a freelancer based in Denver, Colorado, and regularly contributes to Scientific American and The Monitor, the monthly magazine of the American Psychological Association. Stephanie received a bachelor's degree in psychology from the University of South Carolina and a graduate certificate in science communication from the University of California, Santa Cruz.

1st draft of a human 'pangenome' published, adding millions of 'building blocks' to the human reference genome

A reference for health

A diverse understanding

RELATED STORIES