complete, gapless sequence of a human genome, replacing the reference genome that was first drafted in the original Human Genome Project in 2000 and was last updated in 2013.
Accurately mapping genetic variation between people is crucial to uncovering the causes of rare diseases and the increased susceptibility to a range of conditions within population groups. But until last summer, a surprisingly large proportion of the human genome remained uncharted. On March 31, 2022, the researchers behind the Telomere-to-Telomere Consortium published the firstIn a series of papers published in Science, consortium researchers filled in the missing 8 percent of the genome — 200 million base pairs out of a total of 3 billion-plus pairs, including a reported 2,099 newly uncovered genes — thanks to long-read sequencing technologies developed by two companies, PacBio and Oxford Nanopore. (See an NIH infographic here summarizing the work.)
Short vs. long reads
The standard next-generation genome sequencing used today, including the vast majority of research and clinical sequencing at UAB, is done on machines that work via short reads. The preparation process chops up the DNA in a sample into strands roughly 150 base pairs long, or less, and the sequencing machine reads the base pairs found on each strand. Software then assembles these tiny chunks into a complete picture. This works fine for much of the genome. But for regions with long stretches of repeated bases — say, the sequence GAGAGA repeated a few thousand times — or small insertions or deletions, it is difficult or impossible to determine the proper order from short reads alone. Hence the missing 200 million base pairs that the Telomere-to-Telomere Consortium filled in.
With long-read sequencing, read lengths can average 10,000 to 30,000 base pairs, with a maximum length over 1 megabase (1 million base pairs) on an Oxford Nanopore device, says Zechen Chong, Ph.D., assistant professor in the Department of Genetics at the Heersink School of Medicine. Chong’s lab is focused on developing new algorithms for the analysis of long-read sequencing data, and he is collaborating with several UAB researchers using long-read sequencing.
Advantages and disadvantages of long reads
Interested in long-read sequencing?
UAB researchers interested in learning more about long-read sequencing or accessing HudsonAlpha’s sequencing capabilities should contact the CCTS Research Commons.
Long reads can generate more accurate assemblies than short-read technologies, especially when there is no reference genome to check against (de novo assembly) or in repetitive sections of the genome and regions with complex genetic rearrangements, Chong says. He points to research from the Human Genome Structural Variation Consortium that found a sevenfold increase in structural variations — the greatest source of diversity between human genomes — when using long reads compared to traditional short sequencing reads. Those investigators found more than 800,000 insertion or deletion variants smaller than 50 base pairs in each genome sequenced, and nearly 32,000 structural variants 50 base pairs or larger. “The incomplete identification of structural variants from whole-genome sequencing data limits studies of human genetic diversity and disease association,” the authors noted in Nature.
The downside of long-read sequencing is higher error rates (that is, the device’s misidentifying bases as it reads) and a lack of effective tools for accurately evaluating assembly results, Chong says. This is why his lab’s Inspector software for assembling long-read de novo genomes, described in a November 2021 article in the journal Genome Biology, is generating attention in the field and was celebrated as one of Heersink’s Featured Discoveries in February. “Inspector largely reduces assembly errors and therefore improves the assembly quality,” Chong said. (More on Inspector below.)
“Zechen’s work is an important step forward,” said Robert Kimberly, M.D., senior associate dean for Clinical and Translational Research in the UAB School of Medicine and director of the UAB Center for Clinical and Translational Science.
“More real estate on the chromosome”
Kimberly’s lab, which has a major focus on lupus research, is working with Chong and with HudsonAlpha Institute for Biotechnology in Huntsville, Alabama, to use long-read sequencing to study structural variations in patients with lupus. This is the same category of work that has been done using genome-wide association studies for more than a decade, Kimberly points out. “The difference is those studies are focused — necessarily, because of the nature of the technology — on changes in individual bases of nucleic acids,” he said. “Long-read sequencing gives you a much better understanding of structural variations — insertions, deletions and duplications of genetic material on a given chromosome. The larger read length gives you more real estate on the chromosome. Structural variation in relationship to disease phenotypes is a major area for discovery.”
“Long-read sequencing gives you a much better understanding of structural variations — insertions, deletions and duplications of genetic material on a given chromosome. The larger read length gives you more real estate on the chromosome. Structural variation in relationship to disease phenotypes is a major area for discovery.”
Another area where long-read sequencing shines is the “robust construction of haplotypes — the gene variants that are on the same strand of DNA together,” Kimberly said. We all have two strands of DNA on each chromosome, one inherited from our mothers and the other from our fathers. “With short reads, it is hard to determine what variant goes with what variant on the same strand, because both strands are sequenced in a mix at the same time,” Kimberly said.
In a paper published last year, researchers from HudsonAlpha and Martina Bebin, M.D., professor in the UAB Department of Neurology, used long-read sequencing of six patients with neurodevelopmental disorders that could not be explained with genome sequencing using short reads. (They also sequenced the patients’ parents.) In two of the six cases, the researchers found variants of clinical and research interest, which “support[s] the hypothesis that long-read genome sequencing can substantially improve rare disease genetic discovery rates,” the authors write.
Getting started with long reads
The two major players in long-read sequencing — PacBio and Oxford Nanopore — use different techniques with accompanying pluses and minuses. Roughly speaking, PacBio’s method produces shorter but more accurate reads than Oxford Nanopore’s; the costs of equipment and supplies are higher with PacBio as well. Today, PacBio’s circular consensus sequencing, or CCS, is the “gold standard” and the technology used at HudsonAlpha, a major long-read sequencing center, Kimberly said. “I think that long-read sequencing with PacBio is a technology on the cusp of wide adoption for many applications,” Kimberly said. “Long-read sequencing is not necessary for all sequencing applications, so there will always be a role for short reads; but I think we’ll see the cutting edge move increasingly to PacBio CCS.”
There is also a place for the cheaper, longer but less accurate reads possible with Oxford Nanopore technology, Kimberly and Chong say. (A “starter pack” including the company’s MinION sequencing device starts at $1,000.) Kimberly is using a MinION device in his lab. Brittany Lasseigne, Ph.D., an associate professor in the Department of Cell, Developmental and Integrative Biology, recently acquired an Oxford Nanopore GridION and is using it for several projects, including the role of transcriptional diversity in disease and single-cell applications, she says.
“A major challenge for reference-based analysis is distinguishing true variations from assembly errors. Inspector is the first tool to facilitate the discovery of long-read assembly errors, including both small- and large-scale errors. Accurate assembly results are the basis for variant discovery, genome annotation and subsequent functional discoveries.”
Chong’s lab is working on bacterial genome de novo assembly using MinION with Li Xiao, Ph.D. (director of the UAB Diagnostic Mycoplasma Lab), Min Gao, Ph.D. (assistant professor in the Department of Medicine and associate scientist in the Informatics Institute), Kevin Dybvig, Ph.D. (research professor in the Department of Pediatrics), Prescott Atkinson, M.D., Ph.D., (director of the Division of Pediatric Allergy and Immunology) and Ken Waites, M.D. (professor in the Department of Laboratory Medicine).
Analyzing long reads with Inspector
Generating long reads is one thing. Analyzing them is another problem entirely, and one where Chong’s Inspector tool shines. “A major challenge for reference-based analysis is distinguishing true variations from assembly errors,” Chong explained to the Heersink communications team. “Inspector is the first tool to facilitate the discovery of long-read assembly errors, including both small- and large-scale errors. Accurate assembly results are the basis for variant discovery, genome annotation and subsequent functional discoveries.”
The software “improves whole-genome assembly by identifying and correcting assembly errors” and is not affected by genetic variants, Chong said. In evaluations reported in the Genome Biology paper, Inspector outperformed two other long-read assembly evaluators on a simulated genome task. Chong and his team have uploaded the source code for Inspector to GitHub to allow open access, and they have “addressed dozens of questions regarding usage of Inspector from users through GitHub and email,” Chong said.
Maggi Chen, a graduate student in Chong’s lab, was the first author of the paper. The interdisciplinary team for the project also included Yixin Zhang, a master’s student in the Department of Computer Science, and Associate Professor Amy Wang, M.D., and Assistant Professor Min Gao, Ph.D., both of whom are faculty in the Department of Medicine and UAB Informatics Institute. The work was supported by grants from the National Institute of General Medical Sciences, the BioData Translational Science grant from the National Heart, Lung and Blood Institute, and the Center for Clinical and Translational Science grant from the National Center for Advancing Translational Sciences.