Each person has about 4 million sequence differences in their genome relative to the reference human genome. These differences are known as variants. A central goal in precision medicine is understanding which of these variants contribute to disease in a particular patient. Therefore, much of the human genome annotation effort is devoted to developing resources to help interpret the relative contribution of human variants to different observable phenotypes – i.e., determining variant impact.
Recently, Yale School of Medicine led a large NIH-sponsored study where multiple institutions and international collaborators came together to address this challenge. This study generated a large, organized dataset from four individual donors using high-quality genome sequencing to identify all the variants and many different assays to determine their effect on molecular phenotypes in 25 different tissues. Known as EN-TEx, the resource is an important step toward the future of personalized care. The team published its findings in Cell on March 30.
“Our work helps provide a better annotation of the genome and a better understanding of variant impact,” says Mark Gerstein, PhD, Albert Williams Professor of Biomedical Informatics and member of the new Yale Section of Biomedical Informatics & Data Science. He also is affiliated at Yale with molecular biophysics & biochemistry, computer science, and statistics & data science. “An average person’s personal genome has variants in 4 million places. We’re trying to figure out which of these lead to meaningful differences.”
"This work represents the type of innovative large-scale data mining and teamwork that Yale is well-poised to create, coordinate, or participate in,” says Lucila Ohno-Machado, MD, MBA, PhD, Waldemar von Zedtwitz Professor of Medicine and of Biomedical Informatics & Data Science, and chair of the new section. “As our new academic unit grows, we expect to see more and more of this type of exemplary biomedical data science work originate from here.”
In their latest project, the team utilized long-read sequencing technologies to determine diploid genomes from four donors with high accuracy. Everyone has a diploid genome. This means that we have two copies of 22 chromosomes as well as sex chromosomes—one from our mother and one from our father. “Now, for each position on the genome, we can look for the differences between mom and dad in many different functional assays in a perfectly balanced way, allowing us to accurately ascertain variant effect in many tissues,” says Gerstein.
The team developed a variety of statistical and deep learning approaches to be able to leverage the dataset for practical applications. In particular, they built statistical models that identify subsets of regulatory regions in the human genome highly associated with disease variants. They also found many new linkages between variants and changes in nearby gene expression, connecting impactful but uncharacterized variants to genes with known function. This considerably expands previously determined catalogs, especially in many hard-to-assay tissues.
More fundamentally, the team developed a deep learning model that was able to predict whether a variant would disrupt a binding site for a regulatory factor—a protein that binds to specific sequences in the genome to turn nearby genes on or off. Interestingly, they found that to accurately predict this, they needed to look beyond just the binding site itself and consider a large genomic region around the site. The key to whether a binding site would be impacted was the presence of nearby binding sequences for other regulatory factors. “Think of regulatory factors as the legs of the Lunar Module,” says Gerstein. “If it has four legs and one leg doesn’t work, the three other legs can anchor the defective leg.” Similarly, the anchoring of other regulatory factors might stabilize the disrupted binding site and make it less sensitive to variants.
One limitation of the resource is that only four people of European descent are profiled. The team would like to eventually enlarge their study to encompass hundreds of individuals with more diverse ancestries.
Overall, these advances will allow researchers and clinicians to better interpret potential disease-causing variants in an individual, connecting them to regulatory sites, nearby genes, and their tissue of action. “We’ve provided a consistent, beautiful data set and annotation resource for making these interpretations,” says Gerstein.
The global team was assembled by the National Human Genome Research Institute (NHGRI) within NIH, as part of NHGRI's ENCODE consortium, which aims to functionally annotate the genome. The team included collaborators from institutions including Baylor College of Medicine; the Broad Institute of MIT and Harvard; California Institute of Technology; the Centre for Genomic Regulation; Cold Spring Harbor Laboratory; the Dana-Farber Cancer Institute; the European Bioinformatics Institute; HudsonAlpha Institute for Biotechnology; Johns Hopkins University; New York Institute of Technology; Stanford University; University of California, Irvine; University of California, San Diego; University of Hong Kong; University of Massachusetts Medical School; University of Toronto (Canada); and University of Washington, Seattle.