fine structure of gene assignment

Maternal-Child Nursing

Genes: Definition and Structure

In book: eLS

University of Wisconsin–Madison

Discover the world's research

25+ million members
160+ million publication pages
2.3+ billion citations

Rajesh Sukhija

INDAGAT MATH NEW SER

Jacques H. H. Perk

A H Sturtevant
Jhonn Cairns
Gunther Stent S
James Watson D
Horace Freeland Judson
C. P. Oliver
Ajf Griffiths
Rc Lewontin
D Baltimore
Recruit researchers
Join for free
Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

January 1, 1962

The Fine Structure of the Gene

The question “What is a gene?” has bothered geneticists for fifty years. Recent work with a small bacterial virus has shown how to split the gene and make detailed maps of its internal structure

By Seymour Benzer

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Unit 6: Gene expression and regulation

About this unit.

DNA helps make us who we are, but how exactly does it work? In this unit, we'll examine the nitty gritty of replication, transcription, and translation, and learn how seemingly small mutations can have a big impact on our lives.

DNA and RNA structure

Introduction to nucleic acids and nucleotides (Opens a modal)
DNA (Opens a modal)
Molecular structure of DNA (Opens a modal)
Molecular structure of RNA (Opens a modal)
Nucleic acids (Opens a modal)
Prokaryote structure (Opens a modal)
DNA and RNA structure Get 3 of 4 questions to level up!

Replication

Antiparallel structure of DNA strands (Opens a modal)
Leading and lagging strands in DNA replication (Opens a modal)
Speed and precision of DNA replication (Opens a modal)
Semi-conservative replication (Opens a modal)
Molecular mechanism of DNA replication (Opens a modal)
DNA structure and replication review (Opens a modal)
Replication Get 3 of 4 questions to level up!

Transcription and RNA processing

Transcription and mRNA processing (Opens a modal)
Post-transcriptional regulation (Opens a modal)
Eukaryotic gene transcription: Going from DNA to mRNA (Opens a modal)
Overview of transcription (Opens a modal)
Eukaryotic pre-mRNA processing (Opens a modal)
Transcription and RNA processing Get 3 of 4 questions to level up!

Translation

Translation (mRNA to protein) (Opens a modal)
Overview of translation (Opens a modal)
Retroviruses (Opens a modal)
Differences in translation between prokaryotes and eukaryotes (Opens a modal)
DNA replication and RNA transcription and translation (Opens a modal)
Intro to gene expression (central dogma) (Opens a modal)
The genetic code (Opens a modal)
Translation Get 3 of 4 questions to level up!

Regulation of gene expression and cell specialization

DNA and chromatin regulation (Opens a modal)
Regulation of transcription (Opens a modal)
Cellular specialization (differentiation) (Opens a modal)
Non-coding RNA (ncRNA) (Opens a modal)
Operons and gene regulation in bacteria (Opens a modal)
Overview: Gene regulation in bacteria (Opens a modal)
Lac operon (Opens a modal)
The lac operon (Opens a modal)
Trp operon (Opens a modal)
The trp operon (Opens a modal)
Overview: Eukaryotic gene regulation (Opens a modal)
Transcription factors (Opens a modal)
Regulation of gene expression and cell specialization Get 3 of 4 questions to level up!
An introduction to genetic mutations (Opens a modal)
Mutagens and carcinogens (Opens a modal)
The effects of mutations (Opens a modal)
Impact of mutations on translation into amino acids (Opens a modal)
Mutation as a source of variation (Opens a modal)
Aneuploidy & chromosomal rearrangements (Opens a modal)
Genetic variation in prokaryotes (Opens a modal)
Evolution of viruses (Opens a modal)
Mutations Get 3 of 4 questions to level up!

Biotechnology

Introduction to genetic engineering (Opens a modal)
Intro to biotechnology (Opens a modal)
DNA cloning and recombinant DNA (Opens a modal)
Overview: DNA cloning (Opens a modal)
Polymerase chain reaction (PCR) (Opens a modal)
Gel electrophoresis (Opens a modal)
DNA sequencing (Opens a modal)
Applications of DNA technologies (Opens a modal)
Biotechnology Get 3 of 4 questions to level up!

Search Menu
Sign in through your institution
Advance Articles
Featured Articles
Genome Reports
Meeting Reports
Mutant Screen Reports
Software and Data Resources
Neurogenetics
Fungal Genetics and Genomics
Multiparental Populations
Genomic Prediction
Plant Genetics and Genomics
Genetic Models of Rare Diseases
Genomic Data Analyses In Biobanks
Genetics of Bacteria
Why Publish
Author Guidelines
Submission Site
Open Access Options
Full Data Policy
Self-Archiving Policy
About G3 Genes|Genomes|Genetics
About Genetics Society of America
Editorial Board
Guidelines for Reviewers
Advertising & Corporate Services
Journals on Oxford Academic
Books on Oxford Academic

Article Contents

Materials and methods, acknowledgments, literature cited.

< Previous

Fine-Scale Genetic Structure in Finland

Article contents
Figures & tables
Supplementary Data

Sini Kerminen, Aki S Havulinna, Garrett Hellenthal, Alicia R Martin, Antti-Pekka Sarin, Markus Perola, Aarno Palotie, Veikko Salomaa, Mark J Daly, Samuli Ripatti, Matti Pirinen, Fine-Scale Genetic Structure in Finland, G3 Genes|Genomes|Genetics , Volume 7, Issue 10, 1 October 2017, Pages 3459–3468, https://doi.org/10.1534/g3.117.300217

Permissions Icon Permissions

Coupling dense genotype data with new computational methods offers unprecedented opportunities for individual-level ancestry estimation once geographically precisely defined reference data sets become available. We study such a reference data set for Finland containing 2376 such individuals from the FINRISK Study survey of 1997 both of whose parents were born close to each other. This sampling strategy focuses on the population structure present in Finland before the 1950s. By using the recent haplotype-based methods ChromoPainter (CP) and FineSTRUCTURE (FS) we reveal a highly geographically clustered genetic structure in Finland and report its connections to the settlement history as well as to the current dialectal regions of the Finnish language. The main genetic division within Finland shows striking concordance with the 1323 borderline of the treaty of Nöteborg. In general, we detect genetic substructure throughout the country, which reflects stronger regional genetic differences in Finland compared to, for example, the UK, which in a similar analysis was dominated by a single unstructured population. We expect that similar population genetic reference data sets will become available for many more populations in the near future with important applications, for example, in forensic genetics and in genetic association studies. With this in mind, we report those extensions of the CP + FS approach that we found most useful in our analyses of the Finnish data.

Methods for estimating fine-scale genetic structure are becoming increasingly important for genetics research. First, an optimal design of rare variant association studies requires knowledge of detailed genetic structure because rare variants are often population specific and geographically clustered (The 1000 Genomes Project Consortium et al. 2015). Second, as the well-established methods to control for genetic ancestry in common variant association studies do not necessarily work well for rare variants ( Mathieson and McVean 2012 ), we need new approaches to appropriately adjust the ongoing sequencing studies for fine-scale population structure. Third, fine-scale genetic structure can refine relationships between closely related populations and reveal recent history, including population movements over the last centuries ( Genome of the Netherlands Consortium 2014 ; Karakachoff et al. 2015 ; Leslie et al. 2015 ; Athanasiadis et al. 2016 ). Novel methods can even provide useful estimates of an individual’s recent past within countries considered to be genetically homogeneous, such as the UK ( Leslie et al. 2015 ). We expect that this opportunity will have an important role in the near future in engaging the general public to participate in large biobank collections or community efforts for genetics research, such as DNA.Land or Genes for Good ( Check Hayden 2015 ). Finally, an accurate estimate of biogeographic ancestry of a DNA sample is important in forensic genetics ( Kayser and de Knijff 2011 ).

Recently, the estimation of fine-scale genetic structure has improved due to increased sample sizes and advancements in statistical modeling ( Novembre and Peter 2016 ). In particular, utilization of haplotype information captures more detailed genetic ancestry than standard methods based on independent variants ( Gattepaille and Jakobsson 2011 ; Lawson et al. 2012 ; Duforet-Frebourg et al. 2015 ). A promising approach to exploit haplotype information combines software packages ChromoPainter (CP) and FineSTRUCTURE (FS) ( Lawson et al. 2012 ). CP summarizes the genetic similarity of the samples in a coancestry matrix that, for each individual, contains estimates of the proportion of his/her genome that is the closest with each of the other individuals in the sample. FS then clusters the individuals into populations via a nonparametric Bayesian model based on the coancestry matrix from CP. Leslie et al. (2015) recently applied CP + FS to the Peoples of the British Isles project data ( Winney et al. 2012 ) and reported striking concordance between genetic clusters and geography. For example, they genetically differentiated the neighboring counties of Cornwall and Devon in southwest England. We expect that, in the near future, the landmark work of Leslie et al. will motivate fine-scale analyses in many other populations, as well as new applications of CP + FS to individual-level fine-scale ancestry estimation within countries and regions that have so far been considered genetically too homogeneous for such analyses. Extending the interpretability of the output from CP + FS and evaluating how robust CP + FS is to parameters such as the sample size and sampling density of individuals is therefore timely.

In this work, we apply CP + FS to a Finnish population sample both of whose parents were born within 80 km of each other. Finland with its relatively small founder population and strong genetic isolation ( Norio 2003b ; Salmela 2012 ) has become one of the most widely utilized populations in genetic studies of diseases and traits ( Peltonen et al. 1999 ; Sabatti et al. 2009 ; Lim et al. 2014 ). Our goal was to characterize the fine-scale genetic population structure within Finland before migrations that have occurred from 1950s onwards both to serve as a reference data set for ongoing and future genetic association studies as well as to reveal relationships between genetics, known historical events, and the dialectal groups of Finland.

First, we refine our knowledge about the relatively strong genetic difference between western (W) and eastern (E) parts of the country ( Lappalainen et al. 2006 ; Jakkula et al. 2008 ; Salmela et al. 2008 ; Neuvonen et al. 2015 ). Previous genetic analyses have studied this difference by collecting individuals from the opposite sides of the country and observing that their genetic differentiation is large compared to differentiation between some European countries, such as the UK and Germany ( Salmela et al. 2008 ). By utilizing autosomal haplotype information from individuals that uniformly cover the main part of Finland we locate an explicit genetic borderline between W and E Finland, and we introduce a Gaussian mixture model to assess its uncertainty. We find strong similarities between the genetic borderline and both the treaty of Nöteborg from 1323 and the settlement history of Finland ( Figure 1 ).

Locations of 1042 samples and the 12 Finnish provinces (1996 definition). Each sample is at the mean of parents’ coordinates. LAP: Lapland, NOS: Northern Ostrobothnia, OST: Ostrobothnia, CNF: Central Finland, NSA: Northern Savonia, SSA: Southern Savonia, NKA: Northern Karelia, SKA: Southern Karelia, TAV: Tavastia, SWF: Southwestern Finland, SOF: Southern Finland. Kainuu is a subregion of NOS. The dashed line divides Finland into an early-settlement area (south and west of the line) and a late-settlement area (north and east of the line) ( Jutikkala 1933 ). Cities of Helsinki, Turku, and Oulu are marked with black diamonds.

Second, we catalog the Finnish population structure at a finer scale, identifying nearly 20 geographically clustered populations that overlap minimally with each other in general and cover approximately similar surface areas of the country. We find striking concordance between many of the genetic populations and the dialectal regions of the Finnish language. We validate the robustness of the fine-scale structure to characteristics of the data such as the sample size and sampling density of the individuals and study the relationships of these populations by comparing two approaches for building a hierarchical tree.

In Results , we report the fine-scale analysis of the Finnish population structure and assess its robustness. Connections to the earlier work on the Finnish population structure are given in Discussion . We have made our results available through a website (see Data availability ).

The FINRISK Study is a representative, cross-sectional survey of the Finnish working age population (age range 25–74) that, since 1972, has collected a random sample of 6000–8000 individuals every 5 yr to study risk factors of chronic diseases in Finland. Our data were from the FINRISK Study survey of 1997 ( Vartiainen 1998 ) and included genotype data of 4191 individuals born between 1922 and 1972 and their parent’s birth municipalities. The study protocol of the FINRISK Study 1997 was approved by the Ethics Committee of the National Public Health Institute (decision number 38/96). All participants gave written informed consent. To obtain a geographically precisely defined sample, we took forward only those 2376 individuals both of whose parents were born within 80 km from each other and who passed the quality control criteria defined below. The distance between parents was calculated using the great-circle distance and the coordinates of the city centers of the birth municipalities of the parents. The coordinates of the individuals were calculated as an average of their parents’ coordinates. As the youngest individuals in our sample were born in 1972, it follows that almost all parents of our samples were born before 1950. Hence, our data reflect the population structure of Finland before internal migration events that have taken place since around 1950.

The genotyping was performed with Illumina HumanCoreExome-12 BeadChip at the Wellcome Trust Sanger Institute, Hinxton, United Kingdom. Genotyping success was first checked at the Sanger Institute after which we performed additional quality control steps by excluding SNPs with minor allele frequency (MAF) below 5%, Hardy–Weinberg equilibrium P -value below 10 −6 , or call rate below 99.9%. This resulted in 238,438 SNPs. For the CP analysis with rare variants we ignored the MAF filter and included all SNPs with minor allele count above 1 resulting in 303,221 SNPs. All MAFs, HWE values, and call rates were calculated using PLINK version 1.07 ( Purcell 2007 ).

Sample quality control

We excluded the individuals that stood out from the other samples with average heterozygosity |F| > 0.025 or variant missingness rate >0.003. We also excluded individuals on two genotyping plates with poor quality. We calculated the relatedness for each pair of individuals using both PLINK 1.07 ( Purcell 2007 ) and GCTA 1.24.4 ( Yang et al. 2011 ) and excluded one individual from each pair for which either one of the relatedness values exceeded 0.05.

Uniform sample selection

As the genotyping of the individuals from the FINRISK Study survey of 1997 upweighted Eastern Finland in its sampling, about half of the full data set of 2376 individuals were located in the provinces of Northern Karelia and Northern Savonia ( Table 1 ). To study how the uneven sampling density or variation in the total sample size affected the FS results, we constructed three more uniformly distributed subsets of the data. We first placed a grid of 25 km on a map of Finland and sampled at maximum one, two or five randomly chosen individuals from each square. This sampling resulted in data sets that consisted of 328, 580, or 1042 individuals, respectively. We considered the data set with 1042 individuals as our main data set.

Sample sizes

Province .	Full Data Set .	Main Data Set .
Lapland (LAP)	38	38
Northern Ostrobothnia (NOS)	522	263
Kainuu	140	57
Northern Savonia (NSA)	592	139
Northern Karelia (NKA)	587	139
Central Finland (CNF)	45	45
Southern Savonia (SSA)	90	69
Southern Karelia (SKA)	49	47
Ostrobothnia (OST)	85	84
Tavastia (TAV)	75	71
Southwestern Finland (SWF)	226	109
Southern Finland (SOF)	67	38
Åland (ÅLA)	0	0
Total	2376	1042

Province .	Full Data Set .	Main Data Set .
Lapland (LAP)	38	38
Northern Ostrobothnia (NOS)	522	263
Kainuu	140	57
Northern Savonia (NSA)	592	139
Northern Karelia (NKA)	587	139
Central Finland (CNF)	45	45
Southern Savonia (SSA)	90	69
Southern Karelia (SKA)	49	47
Ostrobothnia (OST)	85	84
Tavastia (TAV)	75	71
Southwestern Finland (SWF)	226	109
Southern Finland (SOF)	67	38
Åland (ÅLA)	0	0
Total	2376	1042

Kainuu samples are included in NOS samples.

Includes samples outside the southeastern border.

ChromoPainter and FineSTRUCTURE analyses

The genotype data (after QC) were phased jointly for all individuals with SHAPEIT2 ( Delaneau et al. 2013 ) using default options and the effective population size 11,418 (European average). A recombination map was obtained for the genome build 37 ( http://www.shapeit.fr/files/genetic_map_b37.tar.gz , downloaded 25.6.2014).

Population structure analyses were performed similarly for all four data sets using ChromoPainter 0.0.4 and FineSTRUCTURE 0.0.4 (FS) programs ( Lawson et al. 2012 ). Phased genotype files were converted into CP format and global switch and emission rates were estimated using CP’s expectation-maximization algorithm (10 iterations) on chromosomes 1, 9, 15, and 22 using averages over 24 individuals. We also verified these estimates using an almost 10-fold larger sample of 238 individuals, and the estimates did not notably change. [Recently, Leslie et al. (2015) reported that a 10-fold difference in the switch rate does not have a big impact on the results.] CP was then run using the estimated global parameters and the HapMap build 37 recombination map converted into the CP format.

Population assignment was performed with FS that reads in CP’s chunkcounts output and assigns individuals into genetically (relatively) homogenous groups using a nonparametric Bayesian mixture model implemented through a Markov Chain Monte Carlo (MCMC) algorithm [more details in Lawson et al. (2012) and Leslie et al. (2015) ]. FS was run with the default options of 1,000,000 burn-in iterations, 1,000,000 MCMC iterations from which every 10,000th iteration was recorded. FS-tree was built using 1,000,000 tree comparisons and 100,000 additional hill climbing moves. We ran all four data sets without a predefined number of populations and we also ran the data set with 1042 individuals by specifically asking for two populations ( Figure 3 ).

After the FS analysis, we performed an additional step to improve the population assignment by maximizing the overall posterior probability. This was done as in Leslie et al. (2015) .

Estimating population uncertainty

Principal component analysis (pca).

To compare the chromosome painting method with standard methods that use only unlinked markers, we performed PCA with SmartPCA of EIGENSOFT package ( Patterson et al. 2006 ). We ran SmartPCA on 61,598 SNPs that were pruned to have r 2 < 0.2 within 1 cM windows and excluded the long-range linkage disequilibrium (LD) regions according to Price et al. (2008 ). We performed a PCA on CP’s coancestry matrix as in Lawson et al. (2012) , that is, by adding the column sums to the diagonal, by subtracting the column means from the elements, and by making the matrix symmetric by multiplying it with its transpose.

The comparison ( Figure 2C ) was performed by calculating, for each group, the average squared distance from the group mean, i.e. , the empirical variance on a plane defined by the first and the second principal components. This variance was then scaled to correspond the variance of a similar sized random sample of individuals from the same two-dimensional principal component (PC) plot, which made it possible to compare the two PCA plots. We performed 100,000 random samplings for each group and show their distribution using violin plots in Figure 2C . The groups were defined by the Finnish provinces shown in Figure 1 .

(A and B) The first and second principal components of genetic structure given by ChromoPainter (A) and SmartPCA (B) with individuals colored according to provinces of Figure 1 . (C) For each province, the violin plots show how dispersed, as measured by the sample variance, the individuals from that province are in A and B compared to a random set of similar size ( Materials and Methods ).

To study the effect of including rare variants in CP analyses, we did a similar comparison between our main data set and the data set that also included the rare variants (Supplemental Material, Figure S1 in File S1 ).

The border of the treaty of Nöteborg

The exact border of the treaty of Nöteborg is well defined only from Vyborg castle through Äyräpää and Jääski to the southeastern parts of present-day Finland ( Katajala 2012 ). Several studies have speculated how the border continues toward Western Finland ending somewhere between Kalajoki and Pattijoki ( Juhola 2011 ). Thus, we decided to draw our approximate border from Jääski (28.92 N, 61.04 E) to Pyhäjoki (24.26 N, 64.46 E), which is half way between Kalajoki and Pattijoki and which has also been suggested to be a possible western end point of the border ( Vilkuna 1960 ; Julku 1987 ).

Total variation distance (TVD) and TVD-tree

Pairwise f st.

Pairwise F ST measures genetic differentiation by comparing allele frequencies between two populations. We estimated pairwise F ST values between the main W and E division ( Figure 3A ) and between the 17 fine-scale populations ( Figure 4A ) using the 61,598 SNPs from the PCA analysis and Hudson’s F ST estimate ( Bhatia et al. 2013 ) implemented in the EIGENSOFT package ( Patterson et al. 2006 ).

(A) FineSTRUCTURE results with two populations that we labeled west (W) and east (E). (B) Results from A refined by marking with yellow circles the individuals whose assignment is uncertain (<80% assignment probability to both populations). Also shown are the approximate 1323 borderline of the treaty of Nöteborg, the early vs. late-settlement border from Figure 1 , and the regions of E and W dialects of the Finnish language, including partly Swedish-speaking coastal regions.

(A) Fine-scale population structure with 17 populations, (B) their relationships according to TVD-tree, and (C) their overlap with the seven main dialectal regions of the Finnish language with eight Savonian subdialects marked with different shades of blue. Numbers in parentheses in B show into how many subpopulations these 17 populations split in the complete tree of 52 populations (Figure S3 in File S1 ).

The names of the Finnish provinces

We have used geographically motivated, simple English names for the 12 Finnish provinces (1996 definition). The official Finnish names of these provinces are LAP: Lapin lääni, NOS: Oulun lääni, OST: Vaasan lääni, CNF: Keski-Suomen lääni, NSA: Kuopion lääni, SSA: Mikkelin lääni, NKA: Pohjois-Karjalan lääni, SKA: Kymen lääni, TAV: Hämeen lääni, SWF: Turun ja Porin lääni, SOF: Uudenmaan lääni, ÅLA: Ahvenanmaan lääni.

Scripts for estimating population assignment probabilities and TVD are available at http://www.helsinki.fi/∼sinikerm/ ; PLINK , http://pngu.mgh.harvard.edu/∼purcell/plink/ ; GCTA, http://cnsgenomics.com/software/gcta/ ; SHAPEIT2, https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html ; ChromoPainter and FineSTRUCTURE, http://paintmychromosomes.com/ ; EIGENSOFT, https://www.hsph.harvard.edu/alkes-price/software/ .

Data availability

The genotype data used in this study are available through the National Institute for Health and Welfare Biobank https://thl-biobank.elixir-finland.org/ . Results for fine-scale structure at different levels are available at https://www.fimm.fi/en/research/projects/finnpopgen . File S1 includes Figures S1–S10 and File S2 includes Tables S1 and S2.

We characterized the genetic population structure in Finland using data from the FINRISK Study survey of 1997 ( Vartiainen 1998 ). We first identified a set of 2376 individuals both of whose parents were born within 80 km from each other and that did not contain close relatives ( Materials and Methods ). This sample covered 10 out of 12 provinces of Finland well (1996 definition of provinces) with the exceptions of Lapland (only a few individuals) and Åland (no individuals at all). There were large differences in the sampling density also across the other provinces and therefore our main analysis used a subset of 1042 individuals with a more uniform spatial distribution ( Figure 1 and Table 1 ).

All our samples were genotyped using Illumina HumanCoreExome-12 BeadChip. Our main analysis used 238,438 directly genotyped SNPs with MAF > 5% that passed the quality control metrics ( Materials and Methods ).

Chromosome painting

Generating a haplotype-based coancestry matrix using CP requires considerably more computational resources than calculation of the empirical correlation matrix across a set of independent variants. Therefore, we started by evaluating whether the higher computational cost of CP is compensated by CP capturing more information than the standard relatedness matrix based on independent SNPs. Figure 2 shows the first two PCs of both approaches [panel A for CP and panel B for SmartPCA ( Patterson et al. 2006 ) that uses the empirical correlation matrix] and quantifies how dispersed (panel C; Materials and Methods ) the individuals from the 11 provinces are in these two PC plots compared to a random set of individuals. CP clearly clusters the individuals from five provinces (LAP, NOS, NKA, NSA, SSA) tighter than SmartPCA whereas the opposite is true only for the province of TAV. In the remaining five provinces, we see little difference between the two methods. These results illustrate CP’s overall tendency to cluster individuals who live geographically closer more tightly together than SmartPCA, especially in the northern and eastern parts of the country, which we expect to be the most genetically isolated due to their later permanent inhabitation by a relatively small set of individuals starting from the 1500s ( Figure 1 and Discussion ).

We also assessed whether an addition of 64,783 low-frequency and rare variants (MAF < 5%) available on the genotyping chip affected the CP results but did not observe any noticeable difference compared to the common variant analysis (Figure S1 in File S1 ). This indicates that in our data, the high-quality common variants sufficiently capture the haplotype structure compared to all available variants.

These results motivated us to then run FS on the CP output of the common variant analysis to reveal fine-scale population structure in Finland.

Division between Western and Eastern Finland

To establish the high-level genetic structure in Finland, we applied FS to the output of CP by allowing exactly two populations. As expected, the main genetic division was between W and E parts of the country ( Figure 3A ). The pairwise F ST ( Patterson et al. 2006 ) between these two populations was 0.002 (SE = 2 × 10 −5 ). The clustering model of FS did not report almost any uncertainty for this binary population assignment (Figure S2 in File S1 ). To reveal more detailed differences between individuals in the proportions of genome related to the two populations, we used a GMM to assess how certain each individual was to belong to the W or E population based on the CP coancestry matrix and FS output ( Materials and Methods ). In Figure 3B we have marked those individuals who did not belong to either W or E population with over 80% probability. These individuals highlight a genetic border between W and E from the southeastern corner of Finland to the coast of Central Ostrobothnia leaving Southwestern Lapland also closer to the W population.

Next, we compared this genetic border to historical records and dialectal patterns, both showing features of W and E differentiation. The first and the most densely inhabited regions concentrated on Southern Finland and the coastal regions up to the Bothnia bay dividing Finland into the southwestern early-settlement region (ESR) and the northeastern late-settlement region (LSR), which became permanently inhabited from the 1500s ( Figure 3B ) ( Jutikkala 1933 ). While in general ESR is covered by the W population and LSR is covered by the E population, we point out two exceptions: the ESR provinces of SSA and SKA ( Figure 1 ) are mainly covered by the E population whereas the LSR to the west of CNF is covered by the W population. We discuss these observations together with Southwestern Lapland’s close relation to the W population in Discussion .

The first official border within modern day Finland was ratified in the treaty of Nöteborg in 1323 (fin. Pähkinäsaaren rauha), and it joined the southwestern part of Finland to the Kingdom of Sweden and Eastern Finland to Novgorod (a historical state located in modern day Russia). In Figure 3B we have approximated the border by a line between Jääski near the southeastern border of Finland and Pyhäjoki (see Materials and Methods ) on the coast of Ostrobothnia ( Katajala 2012 ). The genetic division between the W and E populations follows this medieval border line strikingly accurately, leaving more individuals with uncertain assignment on the southern side of the border ( Figure 3B ).

The primary dialectal division of the Finnish language is into E and W dialects ( http://www.kotus.fi/kielitieto/murteet/suomen_murteet , 2015 and Itkonen 1989 ), shows an overall concordance with the genetic division, with an exception on the northern side of the 1323 border near Oulu where the W dialects overlap with the E population ( Figure 3B ).

Fine structure

When FS was run without a preassigned number of populations, it divided our sample of 1042 individuals into 52 populations (Figure S3 in File S1 ). As an example of fine-scale genetic structure in Finland, Figure 4A shows 17 populations from the default hierarchical tree of FS on the map of Finland. We chose this level of the tree because it already reveals detailed population structure without introducing very small populations ( i.e. , < 25 individuals) and because we have verified its robustness to sample size included in the analysis by a comparison with another subset of the data (Figure S4 in File S1 ). Figure 4A shows that overall the populations are geographically clustered, overlap little, and are distributed evenly across Finland. The only exception from tight clustering is P6, which exhibits diffuse clustering along the E–W borderline, as identified in Figure 3 , and includes individuals around the large southern cities of Helsinki and Turku as well as around the northern city of Oulu, 540 km north of Helsinki (see Figure 1 for cities on map and Discussion for more information about this population). The pairwise F ST values corresponding to these 17 populations (Table S1 and Table S2 in File S2 ) show that overall P6 has relatively small F ST values with all other populations indicating approximately equal relatedness to both E and W populations.

To visualize the hierarchical structure of these populations, we compared two agglomerative clustering algorithms to build a hierarchical tree for the populations. FS provides an algorithm (here: FS-tree) that at each level of the tree building merges the two populations resulting in the highest posterior probability among all possible merges. Lawson et al. (2012) reported that although FS-tree has performed well in practice, it might depend significantly on the sample sizes of the populations. We therefore compared FS-tree to another tree-building algorithm based on TVD between populations (TVD-tree, Materials and Methods ) that does not depend on the sizes of the populations. In Figure 4B we show TVD-tree for these 17 populations because, in our data, TVD-tree produced more consistent results across different sample sizes than FS-tree (see Sample size and sample density ). Figure 4B shows that after the E–W split, the next split in the east is between Kainuu and Southeastern Finland, and in the west is between Northern and Southwestern Finland. When we follow the more detailed tree to its 52 leaves (Figure S3 in File S1 ), these four regions (Kainuu, SE Finland, N Finland, and SW Finland) split into 11 (178 individuals), 18 (427), 8 (123), and 15 (314) populations, respectively. Hence, we observe fine-scale population structure across the whole of Finland, which is in contrast to an FS analysis of the UK of the late 1800s ( Leslie et al. 2015 ) where a large unstructured population covered a major part of the country. However, we note that in TVD-tree, the southwest corner of Finland, which has been permanently inhabited the longest ( Jutikkala 1933 ), is the last region to split into smaller parts both in Figure 4B and in the TVD-tree of all 52 populations (Figure S3 in File S1 ).

In addition to the primary E–W dialectal division ( Figure 3B ), the Finnish dialects are further divided into seven main dialects and their subdialects. Figure 4C overlays the main dialects with the genetic populations and shows that, on many occasions, the genetic populations closely follow the dialectal borders. In Western Finland the regions of Southwestern, Tavastian, Southern Ostrobothnian, Mid and Northern Ostrobothnian, and Far-Northern dialects show primarily one or two populations located exclusively at each region. For example, P5 is strictly located at the Southern Ostrobothnian dialectal region and P2 at the Southwestern dialectal region. Only near the city of Oulu do we see a mixture of individuals from several populations whose primary location is outside this dialectal region. In Eastern Finland, the Savonian dialectal region covers several genetic populations but even there the concordance between genetics and dialects can be detected when compared to subdialectal regions ( Figure 4C ). Figure 4C also reveals an interesting detail about the Savonian dialect spoken in Ostrobothnia in Western Finland. Indeed, we observe a genetic population (P10) that clusters in this region, but genetically this population is closer to other populations in Western Finland than to the populations in Eastern Finland. The southeastern dialectal region lacks a unique genetic population of its own since P6, which covers this region, is also widely spread out geographically to the Savonian dialectal region.

Sample size and sample density

To study how sample size and sampling density affect the FS results, we compared our main data set of 1042 individuals to two additional subsets of our data with 328 and 580 individuals as well as to the full data set of 2376 individuals. Individuals in the three subsets were geographically evenly distributed ( Materials and Methods ) while the full data set contained considerably more individuals in the regions of NSA and NKA ( Figure 1 and Table 1 ). We analyzed all data sets with the same pipeline and observed that the total number of populations detected by FS increased approximately linearly with the number of individuals (14, 22, 52, and 170 populations for 328, 580, 1042, and 2376 individuals, respectively). We next investigated how this correlation between the number of populations and the sample size affected the main properties of the population structure identified from each data set.

The visual comparison of the data sets at different levels showed that we can recognize core populations that exist in every data set. In Figure 5 , these populations cluster near Oulu (cyan), into LAP (purple), Kainuu (magenta), NSA and NKA (blue), SSA and SKA (yellow), Central Ostrobothnia (black), OST (dark green), TAV (sky blue), and Southern Finland (red). (Results of sample size of 328 are shown in Figure S5 in File S1 .) In the more uniformly distributed data sets ( Figure 5, A and B and Figure S5 in File S1 ), these are the nine populations first to split in FS-tree but with the data set of 2376 individuals ( Figure 5C ) we recognize all nine populations only when 15 populations are observed. Figure 5C shows that the additional six populations of the data set with 2376 samples are all located in Eastern Finland where the sampling was densest. This observation suggests unsurprisingly that FS identifies more populations where the sample is denser due to increased statistical power for population detection in those regions, but also that these populations can occur at relatively close to the root of FS-tree compared to subsets of data with a more uniform sampling strategy. This raises concerns about whether the splitting order given by FS-tree primarily reflects the genetic differences, or whether it is also significantly affected by varying sample sizes between different populations. While a bifurcating tree is only an approximation to the complex relationships between the populations, we can at least test how stable the structure of the tree is across varying sample sizes and sampling densities using either FS-tree or TVD-tree.

FS results with varying sample size and sample density. (A) Data set of 580 individuals at FS-tree level 9, (B) data set of 1042 individuals at FS-tree level 9, and (C) data set of 2376 individuals at FS-tree level 15.

When we compared the FS-trees and TVD-trees (Figure S5–S8 in File S1 ) we observed that the first split divides all the data sets into E and W but in the two largest data sets (1042 and 2376 individuals), the division made by FS-tree assigned essentially all individuals above the treaty of Nöteborg to the E population, which did not match with the main genetic E–W split identified earlier by explicitly modeling two populations ( Figure 3 ). In contrast, in all four TVD-trees as well as in the two remaining FS-trees (328, 580) the E–W split is consistent with Figure 3 . We emphasize the TVD-tree rather than an FS-tree in Figure 4B because TVD-trees were more consistent than FS-trees both across the sample sizes and in comparison with the explicit analysis of the main genetic split within Finland.

We anticipate that fine-scale genetic structure estimation will become an essential part of future rare variant association studies and individual-level ancestry estimation across the globe. We assessed performance and robustness of the haplotype-based methods ChromoPainter (CP) and FineSTRUCTURE (FS) in revealing fine-scale genetic structure in Finland. First, we defined geographically the main genetic division between Eastern and Western Finland using precisely located samples based on parents’ birthplaces. Second, we characterized the fine-scale genetic population structure present before the 1950s. Our results serve as a population genetic reference for future design and interpretation of genetic association studies and individual-level ancestry estimation in Finland. We validated CP + FS results by comparing them to a standard PCA, incorporating a more sensitive uncertainty measure, comparing different ways of building the hierarchical structure among the populations, and studying the effect of sample size and sampling density on our results. In general, we found CP + FS to be a useful and robust method for fine-scale population structure analysis with a few caveats that will be important to take into account in future applications in other populations. We note that our analyses do not explicitly model migrations and admixture events and future studies with complementary approaches and complementary data from the neighboring countries of Finland are required to study these topics.

Finland has, for the last 10,000 yr, been a border region between W populations of Scandinavia, southern populations of the Baltics, and E populations of European Russia as summarized by Salmela (2012) . These long-term influences may have contributed to the main genetic division within Finland separating W and E parts of the country ( Figure 3A ). Another contributing factor to this primary split is likely to be the relatively small population size and isolated nature of many parts of the late-settlement areas concentrated in Eastern and Northern Finland ( Figure 1 ) that, according to historical records, were only sparsely inhabited, if at all, until the middle of the 16th century when people, mainly from Southern Savonia (SSA), gradually extended their practice of agriculture and more stable habitation to these areas ( Jutikkala 1933 ).

Consistent with Southern Savonian settlers inhabiting Eastern and Northern Finland, we indeed observe that SSA, and its neighboring province SKA, are the only areas of the early-settlement region that are primarily covered by the E population that extends from SSA to the LSR ( Figure 1 and Figure 3B ). On the other hand, the southwest corner of the LSR is genetically part of the W population rather than the E population. A possible historical explanation for this is that these areas were old hunting grounds of Tavastians and therefore may have attracted Tavastian settlers in addition to Savonians (p. 99, Jutikkala 1933 ). Also, later contacts between this region and the neighboring coastal areas of the early-settlement region of Ostrobothnia may well have contributed to the major influences from the W population that we observe in this region today.

Our analysis established that the 1323 borderline of the treaty of Nöteborg is a very accurate description of the main genetic split within Finland ( Figure 3B ). This may support a role for the 16th century Swedish authorities in guiding the Savonian settlers to inhabit land particularly to the east of the 1323 border (p. 98, Jutikkala 1933 ). However, it seems unlikely that the 1323 border itself would have been a physical cause of the genetic population structure as the border was of a more administrative nature and did not restrict the movements of common people ( Katajala 2012 ; Korpela 2002 ).

An interesting detail of the split between the W and E populations is a group of individuals in Torne Valley (fin. Tornionlaakso) in Southwestern Lapland who are assigned to the W population even though geographically they are separated from the rest of the W individuals by E individuals near Oulu ( Figure 3A ). While it is possible that the early-settlement region on the west coast had also extended to Torne Valley (p. 67–68, Jutikkala 1933 ), establishing a genetic connection all the way to Southern Finland, it is also possible that the close ties between the Finnish side of Torne Valley and Sweden across the Torne River have resulted in a genetic admixture whose Swedish component clustered these individuals with Western Finland regardless of how their Finnish component would have clustered them.

Two previous studies about population structure in Finland using genome-wide autosomal variants have either not attempted to define populations based on genetic data ( Jakkula et al. 2008 ) or have done so by applying the STRUCTURE algorithm ( Pritchard et al. 2000 ) to an independent set of 6369 variants ( Salmela et al. 2008 ), which did not reveal the fine-scale genetic structure within Finland. Our haplotype-based population assignment using a data set that evenly covers a major part of Finland therefore provides unprecedented information on the fine-scale genetic structure of the Finnish population. In general, the fine-scale structure that we detected is highly geographically clustered with little overlap between the populations ( Figure 4 and Figure 5 and Figures S4–S8 in File S1 ). There are no large differences in the area covered or number of individuals included in each population (with the exception of P6 in Figure 4 ). This is likely a consequence of a relatively small population size and isolation by distance throughout the country. It is instructive to contrast this pattern to an FS analysis in the UK that focused on the population structure of the late 1800s ( Leslie et al. 2015 ). In this analysis, a single population covered central and southern England and included almost half of all the individuals, even at the finest level where the sample was already split into 53 populations ( Leslie et al. 2015 ). The strong genetic clustering within Finland is in line with the pocketed distribution of multiple severe diseases of the Finnish Disease Heritage ( Peltonen et al. 1999 ; Norio 2003a ) and suggests that Finland also holds great promise for future studies on less severe and more prevalent diseases for which genetic variants with large effects have been more difficult to identify.

In our analysis, the exceptionally dispersed P6 dominating the Southern Karelia region also spread out widely across Savonia, near the large southern cities of Helsinki and Turku as well as along the genetic East–West borderline of Figure 3 all the way to Oulu and even further north ( Figure 4A ). As a hint that this dispersal might be due to recent events, we noticed that individuals of P6 were clearly younger than the rest of our samples (median birth years 1957 and 1950, respectively; Mann-Whitney P -value 10 −5 ). When we further split P6 into its three subpopulations (Figure S9 in File S1 ), we noticed that one of these remained stable across the birth years of our samples (orange population in Figure S9 in File S1 ) while the other two spread out from Karelia to Savonia (green population) and from the city of Vyborg (fin. Viipuri) throughout the country (yellow population) within this timeframe (1922–1972). During and after the Second World War (1939–1945), ∼400,000 Finns from the larger Karelia region southeast of the current borders of Finland were relocated throughout the other parts of Finland. Hence, the youngest third of our sample born after 1957 could, in principle, have such relocated Karelians among their grandparents. However, Figure S9 in File S1 also shows that the dispersal of P6 has started already among individuals born in 1941–1957, and therefore we cannot entirely explain the dispersal by relocations during and after the war (unless some parental birthplaces of these individuals were wrongly reported).

In order to reveal details of the Finnish population structure, we extended CP + FS output in two ways. First, we devised a GMM to estimate uncertainty of population assignment because FS does not report any uncertainty estimate by default. With GMM we can stratify individuals based on the proportion of their genome related to each source population, which information is not available in the FS output. In our analysis, we clustered individuals across Finland based on their genetic makeup with respect to canonical northeastern (Northern Karelia) and southwestern (Southwest Finland) reference samples ( Figure 3B ). Quantitatively, our uncertainty estimates were clearly larger than those we teased out from the raw MCMC output of FS, although we identified qualitative similarities between the two (compare Figure 3B with Figure S2 in File S1 ). We note that the two approaches are not estimating the same quantity: Our GMM estimates mixture proportions of an individual originating from each predefined source population, while FS carries out an unsupervised clustering without admixture modeling.

Second, we introduced TVD-tree to complement the default FS-tree for building a hierarchical tree structure for the populations. Traversing an FS-tree from its leaves toward its root, small populations quickly merge to larger ones since small changes to the population assignment cause a relatively small decrease in the model probability. Instead, TVD-tree uses a difference in average ancestries between two populations and hence is less dependent on the sample sizes of the populations. To combine the complementary properties of FS-tree and TVD-tree, we used FS-tree to choose the level of populations ( e.g. , 17 in Figure 4 and 9 or 15 in Figure 5 ) but then described the relationships between those populations using TVD-tree. In this way, we avoid including very small populations in the tree, which could be prone to a large sampling variation, while at the same time we make use of TVD-tree, which empirically gave more consistent results across data sets as described in Results .

While a discrete population assignment remains an approximation to the underlying genetic relationships between individuals, it provides valuable information for individual-level ancestry estimation and design and analysis of rare variant association studies. We anticipate that our experiences and tools reported here in the context of Finland will be useful for fine-structure analyses of other populations.

We thank the participants of the FINRISK cohort and its funders: the National Institute for Health and Welfare, the Academy of Finland (139635 to V.S.), and the Finnish Foundation for Cardiovascular Research. This work was financially supported by the Academy of Finland (257654, 288509, and 294050 to M.P.; 251217 and 255847 to S.R.) and by the Research Funds of the University of Helsinki to M.P. S.R. was further supported by the Academy of Finland Center of Excellence for Complex Disease Genetics, EU FP7 projects ENGAGE (201413), BioSHaRE (261433), the Finnish Foundation for Cardiovascular Research, Biocentrum Helsinki, and the Sigrid Jusélius Foundation.

Author contributions: S.K., S.R. and M.Pi. designed the study. S.K. conducted the analyses. A.S.H., A-P.S., M.Pe. and V.S. provided materials. S.K., G.H. and M.Pi. provided computational methods. S.K., A.R.M., A.P. M.J.D., S.R. and M.Pi. interpreted the results. S.K. and M.Pi. wrote the manuscript with help from A.R.M. All authors reviewed the manuscript. G.H. is a founder and director of GENSCI and consultant to LivingDNA.

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.300217/-/DC1 .

Communicating editor: S. Tishkoff

Athanasiadis G , Cheng J Y , Vilhjalmsson B J , Jorgensen F G , Als T D et al. , 2016 Nationwide genomic study in Denmark reveals remarkable population homogeneity. Genetics 204 : 711 – 722 .

Google Scholar

Bhatia G , Patterson N , Sankararaman S , Price A , 2013 Estimating and interpreting FST: the impact of rare variants. Genome Res. 23 : 1514 .

Check Hayden Erica 2015 Scientists hope to attract millions to ‘DNA.Land’ Nature News . Available at http://www.nature.com.edgesuite.net/news/scientists-hope-to-attract-millions-to-dna-land-1.18514 .

Delaneau O , Zagury J , Marchini J , 2013 Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10 : 5 – 6 .

Duforet-Frebourg N , Gattepaille L M , Blum M G B , Jakobsson M , 2015 HaploPOP: a software that improves population assignment by combining markers into haplotypes. BMC Bioinformatics 16 : 242 .

Gattepaille L M , Jakobsson M , 2011 Combining markers into haplotypes can improve population structure inference. Genetics 190 : 159 – 174 .

Genome of the Netherlands Consortium , 2014 Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46 : 818 – 825 .

Itkonen T , 1989 Nurmijärven murrekirja . Suomalaisen Kirjallisuuden Seura , Helsinki .

Google Preview

Jakkula E , Rehnström K , Varilo T , Pietiläinen O P H , Paunio T et al. , 2008 The genome-wide patterns of variation expose significant substructure in a founder population. Am. J. Hum. Genet. 83 : 787 – 794 .

Juhola S 2011 Pähkinäsaaren rauhan raja arkeologian ja raja-alueelle jääneen paikannimistön valossa. Ennen ja Nyt . Available at: http://www.ennenjanyt.net .

Julku K , 1987 Suomen itärajan synty . Pohjois-Suomen Historiallinen Yhdistys , Rovaniemi .

Jutikkala E , 1933 Asutuksen leviäminen Suomessa 1600-luvun alkuun mennessä , pp. 51 – 103 in Suomen Kulttuurihistoria , Vol. I, edited by Jutikkala E , Suolahti G . Gummerus , Helsinki .

Karakachoff M , Duforet-Frebourg N , Simonet F , Le Scouarnec S , Pellen N et al. , 2015 Fine-scale human genetic structure in Western France. Eur. J. Hum. Genet. 23 : 831 – 836 .

Katajala K , 2012 Drawing borders or dividing lands?: the peace treaty of 1323 between Sweden and Novgorod in a European context. Scand. J. Hist. 37 : 23 – 48 .

Kayser M , de Knijff P , 2011 Improving human forensics through advances in genetics, genomics and molecular biology. Nat. Rev. Genet. 12 : 179 – 192 .

Korpela J , 2002 Finland’s eastern border after the treaty of Nöteborg: an ecclesiastical, political or cultural border? J. Balt. Stud. 33 : 384 – 397 .

Lappalainen T , Koivumäki S , Salmela E , Huoponen K , Sistonen P et al. , 2006 Regional differences among the Finns: a Y-chromosomal perspective. Gene 376 : 207 – 215 .

Lawson D J , Hellenthal G , Myers S , Falush D 2012 Inference of Population Structure using Dense Haplotype Data. PLoS Genet. 8 : e1002453 .

Leslie S , Winney B , Hellenthal G , Davison D , Boumertit A et al. , 2015 The fine-scale genetic structure of the British population. Nature 519 : 309 – 314 .

Lim E T , Würtz P , Havulinna A S , Palta P , Tukiainen T et al. , 2014 Distribution and medical impact of loss-of-function variants in the Finnish founder population. PLoS Genet. 10 : 1 – 12 .

Mathieson I , McVean G , 2012 Differential confounding of rare and common variants in spatially structured populations. Nat. Genet. 44 : 243 – 246 .

Neuvonen A M , Putkonen M , Översti S , Sundell T , Onkamo P et al. , 2015 Vestiges of an ancient border in the contemporary genetic diversity of North-Eastern Europe. PLoS One 10 : e0130331 .

Norio R , 2003a The Finnish disease heritage III: the individual diseases. Hum. Genet. 112 : 470 – 526 .

Norio R , 2003b Finnish disease heritage II: population prehistory and genetic roots of Finns. Hum. Genet. 112 : 457 – 469 .

Novembre J , Peter B M , 2016 Recent advances in the study of fine-scale population structure in humans. Curr. Opin. Genet. Dev. 41 : 98 – 105 .

Patterson N , Price A L , Reich D , 2006 Population structure and eigenanalysis. PLoS Genet. 2 : 2074 – 2093 .

Peltonen L , Jalanko A , Varilo T , 1999 Molecular genetics of the Finnish disease heritage. Hum. Mol. Genet. 8 : 1913 – 1923 .

Price A L , Weale M E , Patterson N , Myers S R , Need A C et al. , 2008 Long-range LD can confound genome scans in admixed populations. Am. J. Hum. Genet. 83 : 132 – 135, author reply 135–139 .

Pritchard J K , Stephens M , Donnelly P , 2000 Inference of population structure using multilocus genotype data. Genetics 155 : 945 – 959 .

Purcell S , 2007 PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81 : 559 – 575 .

Sabatti C , Service S K , Hartikainen A , Pouta A , Ripatti S et al. , 2009 Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41 : 35 – 46 .

Salmela E , 2012 Genetic Structure in Finland and Sweden: Aspects of Population History and Gene Mapping . University of Helsinki , Helsinki .

Salmela E , Lappalainen T , Fransson I , Andersen P M , Dahlman-Wright K et al. , 2008 Genome-wide analysis of single nucleotide polymorphisms uncovers population structure in Northern Europe. PLoS One 3 : e3519 .

The 1000 Genomes Project Consortium, A. Auton , Brooks L D , Durbin R M , Garrison E P et al. , 2015 A global reference for human genetic variation. Nature 526 : 68 – 74 .

Vartiainen E , 1998 Finriski 1997 . Kansanterveyslaitos , Helsinki .

Vilkuna K 1960 Pähkinäsaaren rauhan raja kansantieteellisessä katsannossa. Historiallinen aikakauskirja 58 : 407 – 432 .

Winney B , Boumertit A , Day T , Davison D , Echeta C et al. , 2012 People of the British Isles: preliminary analysis of genotypes and surnames in a UK-control population. Eur. J. Hum. Genet. 20 : 203 – 210 .

Yang J , Lee S H , Goddard M E , Visscher P M , 2011 GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88 : 76 – 82 .

Supplementary data

Month:	Total Views:
December 2020	2
January 2021	9
February 2021	19
March 2021	56
April 2021	37
May 2021	39
June 2021	19
July 2021	15
August 2021	35
September 2021	50
October 2021	24
November 2021	25
December 2021	46
January 2022	83
February 2022	79
March 2022	68
April 2022	78
May 2022	60
June 2022	44
July 2022	51
August 2022	85
September 2022	61
October 2022	56
November 2022	54
December 2022	134
January 2023	107
February 2023	66
March 2023	81
April 2023	62
May 2023	45
June 2023	46
July 2023	53
August 2023	86
September 2023	57
October 2023	59
November 2023	75
December 2023	76
January 2024	91
February 2024	81
March 2024	113
April 2024	119
May 2024	94
June 2024	41
July 2024	70

Email alerts

Citing articles via.

Advertising and Corporate Services
Journals Career Network

Affiliations

Online ISSN 2160-1836
Copyright © 2024 Genetics Society of America
About Oxford Academic
Publish journals with us
University press partners
What we publish
New features
Open access
Institutional account management
Rights and permissions
Get help with access
Accessibility
Advertising
Media enquiries
Oxford University Press
Oxford Languages
University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Copyright © 2024 Oxford University Press
Cookie settings
Cookie policy
Privacy policy
Legal notice

This Feature Is Available To Subscribers Only

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Open access
Published: 07 September 2023

ChromGene: gene-based modeling of epigenomic data

Artur Jaroszewicz 1 , 2 &
Jason Ernst ORCID: orcid.org/0000-0003-4026-7853 1 , 2 , 3 , 4 , 5 , 6 , 7

Genome Biology volume 24 , Article number: 203 ( 2023 ) Cite this article

2778 Accesses

1 Citations

23 Altmetric

Metrics details

Various computational approaches have been developed to annotate epigenomes on a per-position basis by modeling combinatorial and spatial patterns within epigenomic data. However, such annotations are less suitable for gene-based analyses. We present ChromGene, a method based on a mixture of learned hidden Markov models, to annotate genes based on multiple epigenomic maps across the gene body and flanks. We provide ChromGene assignments for over 100 cell and tissue types. We characterize the mixture components in terms of gene expression, constraint, and other gene annotations. The ChromGene method and annotations will provide a useful resource for gene-based epigenomic analyses.

Genome-wide maps of epigenomic marks, such as histone modifications from ChIP-seq experiments and chromatin accessibility from DNase-seq or ATAC-seq experiments, provide valuable information for annotating the genome in a cell type-specific manner [ 1 , 2 , 3 , 4 , 5 , 6 , 7 ]. Notably, approaches have been developed for annotating the genome into “chromatin states” based on the combinatorial and spatial patterns of epigenomic marks inferred de novo from the data. These different chromatin states can correspond to different classes of genomic elements, including enhancers, promoters, and repressive regions [ 8 , 9 , 10 ]. Annotations from these methods have been used for a diverse range of applications, including understanding gene regulation and genetic variants associated with disease [ 2 , 11 , 12 ].

Typically, chromatin states annotate the genome on a per-position basis. However, for some applications with epigenomic data, it is desirable to conduct gene-based analyses, as is common with transcriptomic data [ 13 , 14 , 15 ], but taking full advantage of epigenomic data to generate gene-based annotations is less straightforward than for per-position annotations. The challenge with gene-based annotations is that the combination of epigenomic marks will vary along a gene in a position-dependent manner. Furthermore, protein-coding genes differ vastly in length, ranging from a few hundred base pairs to over 2 Mb (median 30 kb) [ 16 ]. One strategy for gene-based annotations is to focus on the chromatin state at the transcription start site (TSS) [ 8 , 17 ]. While such an approach is largely independent of varying gene lengths, it ignores potentially important information throughout the gene body. A simple alternative strategy would be based on averaging each mark’s signal across the entire gene [ 18 ]. However, such an approach loses information about which marks co-occur along the gene, and it could be heavily confounded by gene length.

Another strategy has been to partition genes into regions and cluster the genes based on mark signal in those regions. For example, one study partitioned genes into five regions: the 500-bp region upstream of the TSS, the 500-bp region downstream of the TSS, and the remaining gene body into equal thirds [ 19 ]. The study then averaged the per-mark signal within each region, which were then clustered with the k -medoids algorithm. However, such an approach is dependent on the specific choices of the partition and loses information due to averaging within each partition.

Other work has solved related, but different, problems. EPIGENE used hidden Markov models (HMMs) to identify transcription units [ 20 ], but is not designed for clustering them. Hierarchical or multi-tiered HMMs have been proposed to capture broader domains, but are not designed to provide a single annotation per gene in a cell or tissue type [ 21 , 22 , 23 ]. (Henceforth, we will use the term “cell type” instead of “cell or tissue type” for ease of presentation.) EpiAlign was proposed to align chromatin states between two regions, such as genes, and identify corresponding regions [ 24 ], but is also not designed for de novo gene-based annotations.

There is thus a need for a principled, model-based method that can be used to generate de novo annotations of known genes based on the combinatorial and spatial information in data of multiple epigenomic marks. To address this, we introduce ChromGene. ChromGene uses a mixture of HMMs to model the combinatorial and spatial information of epigenomics maps throughout a gene body and flanking regions. Furthermore, ChromGene can learn a common model across multiple cell types and use it to generate per-gene annotations for each. ChromGene is distinguished from other methods in that its focus is assigning a single annotation for each gene, which can span an arbitrary length, as opposed to an annotation for each position in the genome.

Here, we apply ChromGene to ChIP-seq data of histone marks and DNase-seq data from over 100 cell types and produce per-gene annotations for each one. We describe these annotations with respect to their mark emissions and relate them to gene expression data and other external data, including genes with high probability of loss-of-function (LoF) intolerance (pLI) [ 25 ]. We show that ChromGene annotations have better agreement with gene expression and stronger Gene Ontology (GO) and cancer gene set enrichments than other methods. We expect the ChromGene annotations we have produced will be a resource for gene-based epigenomic analyses, and that the methodological approach will be useful for applications to other epigenomic data.

Overview of ChromGene method

ChromGene models the set of epigenomic data across genes with a mixture of HMMs [ 9 ] (“ Methods ,” Fig. 1 ). The set of epigenomic data for each gene, along with a flanking region at each end (default 2 kb), is binarized at fixed-width bins (default 200 bp), indicating observations of each epigenomic mark, before being input to ChromGene. The data for a given gene in a cell type is modeled as being generated by one of M HMMs, each with S “hidden states” (or simply “states”) (black boxes, Fig. 1 a). There are no constraints on allowed transitions between states of one HMM, but transitions between states of different HMMs are not allowed (Fig. 1 b). There is also an initial state distribution, or the probability of seeing a state at the first position of the gene flanking region, over states of the HMMs. We note that otherwise, ChromGene does not directly model gene position information. The prior probability that a gene belongs to a specific mixture component, that is, an individual HMM, corresponds to the sum of initial probabilities of the states of that component. The emission distribution is modeled with a product of independent Bernoulli random variables following ChromHMM [ 9 ] (Fig. 1 c). For given values of M and S , the parameters of the model are learned from the data using an expectation–maximization approach aimed to maximize the likelihood of the model parameters given the data (Fig. 1 d). Once a model is trained, ChromGene computes the posterior probability of each of M mixture components generating each gene’s data in a given cell type. Finally, each gene is then assigned to the component for which it has the greatest posterior probability (Fig. 1 e). We note that while ChromGene also assigns individual bins to hidden states based on the per-state posterior probability assignments (Fig. 1 e), they are not our focus here, as there are effective model-based methods for position-level annotations.

Overview of ChromGene. a ChromGene Model. Each gene is assumed to be generated by one of M = 12 mixture components, each of which is defined by S = 3 hidden states. Transitions within mixture components are allowed, but transitions between mixture components are only allowed by transitioning through the “dummy state” (purple), which only occurs between genes. b Transition matrix. The states within a component have learned probabilities of transition between them (colored and gray “self-transition” cells), but transitions between states in different components are disallowed (white). All states within components are allowed to transition to the dummy state (purple, right column), and the dummy state can transition to any state within any component (purple, bottom row). c Emission matrix. Each state has a separate emission probability for each input mark (colored and gray cells), corresponding to Bernoulli random variable parameters, and the probability of a set of observed marks is modeled using a product of those Bernoulli random variables. States within mixture components are enforced to never emit the “dummy mark” (white, right column), while the dummy state (bottom row) is enforced to never emit input marks (white) and always emit the dummy mark (black). d Data matrix. Input data across all genes and flanking regions is concatenated, with a single observation of “dummy position” between genes. Input data may be emitted within gene body or flanking regions (gray), but not at dummy positions (white cells). The dummy mark (right column) is only emitted at dummy positions (black cells). e IGV browser track [ 26 ] view using 12 input marks, “ChromHMM” annotations [ 27 ], “GENCODE Gene” track, mixture components (“ChromGene”), and components’ hidden states (“Hidden States”; red: state 1, yellow: state 2, green: state 3). Importantly, components’ hidden states are not comparable across different components and are not used in any analysis in this study

To model data from multiple cell types, ChromGene can be applied analogously to the “concatenated” approach of ChromHMM [ 2 , 28 ], which gives cell type-specific assignments based on a common model. To do this, ChromGene treats the same gene in different cell types as if it were different genes in the same cell type when learning the model. Using this common model, a gene is independently assigned to a mixture component for each cell type.

To develop ChromGene, we first noted that a mixture of HMMs can be equivalently expressed using a single, specially constructed HMM. In this single HMM, there is a “dummy state” that is associated with the emission of a “dummy mark,” which is only present before or after genes and their included flanking regions (Fig. 1 a, “ Methods ”). All states within mixture components are initialized to allow transitions to and from this dummy state and between states of the same mixture component, but transitions between different mixture components are disallowed by enforcing those transition probabilities to be 0. Given the equivalence to a single HMM, we trained the ChromGene model by using ChromHMM after making enhancements to ChromHMM to better handle some technical numerical stability issues in this setting (“ Methods ”). We note that this strategy of reducing a mixture of HMMs to a single per-position HMM is general, and it could in principle be used to also provide other region-based annotations or to extend other software designed for per-position annotations, which may differ from ChromHMM in modeling assumptions.

ChromGene generates distinct gene-level chromatin annotations

We applied ChromGene to imputed data for ten histone modifications (H3K9me3, H3K36me3, H4K20me1, H3K79me2, H3K4me1, H3K27ac, H3K9ac, H3K4me3, H3K4me2, and H3K27me3), histone variant H2A.Z, and DNase-seq data from 127 cell types, binarized at 200 bp resolution [ 1 , 27 ] across 19,919 protein-coding genes with 2 kb flanking regions [ 16 ] using the “concatenated” approach. We focused our analysis on a model with M = 12 mixture components and S = 3 states per component to balance model expressivity with having meaningful distinctions between individual components and states within a component (“ Methods ,” Additional File 1 : Fig. S1).

Based on the emission parameters of the model (Fig. 2 a), along with the relationship of the components to external data not used in model learning (discussed below), we gave each component a candidate annotation (Fig. 3 , Additional File 2 : Table S1), which we will use to refer to these components henceforth. Eight of the annotations (“strong_trans_enh,” “strong_trans,” “trans_enh,” “trans_cons,” “trans_K36me3,” “trans_K79me2,” “weak_trans_enh,” and “znf”) (“trans”: “transcribed”, “enh”: “enhancer”, “cons”: “constrained”, “K36me3”: “H3K36me3”, “K79me2”: “H3K79me2”) had at least one state with H3K36me3 or H3K79me2, both transcription-associated histone modifications, present at > 31% of positions. All of these annotations had at least one state associated with high frequency of promoter-associated H3K4me3 or H3K4me2 (> 73%) and limited detection of the repressive mark H3K27me3 (< 15%). For four of these annotations (“strong_trans_enh”, “strong_trans”, “trans_enh”, and “trans_cons”), all three of its corresponding states had a high frequency (> 28%) of at least one mark. In contrast, the annotations “trans_K36me3”, “trans_K79me2”, “weak_trans_enh”, and “znf” all had one state with low frequency of all marks (< 7%). Annotation “znf” is notable in that H3K36me3 co-occurs with the repressive mark H3K9me3. This annotation shows 11-fold enrichment for zinc finger named (ZNF) genes on average. This is consistent with previous findings that ZNF genes enrich for per-position chromatin states associated with this combination of marks [ 8 , 27 ], but in contrast to previous work, we also provide a direct model-based annotation of all genes associated with an epigenomic pattern highly enriched in ZNF genes. Three of the annotations (“strong_trans_enh”, “trans_enh”, and “weak_trans_enh”) had a state in which the H3K4me1 mark had the greatest frequency (> 78% in all cases) and limited H3K4me3 (< 5% of positions), consistent with previously described putative enhancers [ 8 , 27 , 29 ], suggesting these ChromGene annotations contain intragenic enhancers. Annotations “trans_K36me3” and “trans_K79me2” both had a state with a high frequency of active marks typically found at promoters or enhancers, a state with a low frequency of all marks, and a state dominated by H3K36me3 and H3K79me2, respectively.

The emission parameters and assignment of genes to ChromGene annotations. a Heatmap of emission parameters with blue corresponding to a higher probability and white a lower. The ChromGene annotations are labeled on the right and the states within each annotation are labeled on the left. Annotations are ordered from top to bottom by decreasing expression, and states within each annotation are ordered by decreasing enrichment at the gene TSS. Marks are ordered from left to right as previously done [ 27 ]. Transition probabilities are pictured (Additional File 1 : Fig. S2); per-state emission probabilities and enrichments, along with transition probabilities, are also reported (Additional File 2 : Table S1). b Graphical representation of the ChromGene assignment matrix. Columns correspond to cell types, which are ordered as previously done [ 1 ], and their tissue group is indicated by the top colorbar (upper right legend). Rows correspond to 2000 subsampled genes (approximately 10% of all genes). Rows were ordered by hierarchical clustering (“ Methods ”). Each cell is colored by ChromGene annotation for the corresponding cell type and gene (lower right legend)

Brief description of and statistics on each ChromGene annotation. Rows correspond to ChromGene annotations. The columns are as follows: color used for ChromGene annotation; “Mnemonic”—abbreviated name used for annotation; “Description” of each ChromGene annotation based on mark emissions, expression, length, pLI, and other enrichments. Subsequent columns describe summary statistics of ChromGene annotations: “Overall Percentage”, “Median Expression (RPKM)”, and “Median Length (kb)” of genes assigned to annotation; “Percentage of High-pLI Genes (pLI ≥ 0.9)”—percentage of genes assigned to annotation with ≥ 0.9 pLI (probability of Loss of function Intolerance); “Cell Type Specificity”—metric of variability of annotation across cell types; “Housekeeping Gene Enrichment,” “Constitutively Unexpressed Gene Enrichment,” “Constitutive Expressed Gene Enrichment,” “Olfactory Gene Enrichment,” “ZNF Gene Enrichment”—fold enrichment of gene category within annotation compared to “Overall Percentage”

The four remaining annotations (“poised”, “bivalent”, “low”, “quiescent”) lacked any state with the transcription-associated marks, H3K36me3 or H3K79me2, at a high frequency (< 12% for all states). Annotations “poised” and “bivalent” had states with high frequencies of other marks. Notably, the “bivalent” annotation had a state associated with the high presence of the repressive mark H3K27me3 in combination with H3K4me1/2/3, along with another state associated with H3K27me3 alone. Annotation “low” only had one state associated with moderate levels of epigenomic marks, none of which were associated with transcription. No state of the “quiescent” annotation had any detected modifications (all emissions < 1%).

The assignments for all 19,919 genes across 127 cell types are provided (Additional File 3 ) [ 30 , 31 ]. To visualize the assignments, we sampled 2000 genes and clustered them based on assignment to ChromGene annotations across all cell types (“ Methods ,” Fig. 2 b).

ChromGene relationship with gene expression levels

We first investigated how a gene's assignment to a ChromGene annotation relates to its expression level. We compared ChromGene annotations to matched gene expression data for 56 cell types [ 1 ] (“ Methods ”). We separated cell type-gene combinations by their ChromGene annotation and analyzed the distribution of gene expression values (RPKM) for each annotation. We observed that genes assigned to different annotations had varying levels of expression (Fig. 4 ). The two annotations with the highest expression (“strong_trans_enh” and “strong_trans”) both had states with a high frequency of H3K36me3 and H3K79me2. The four annotations with the lowest expression (“poised,” “bivalent,” “low,” and “quiescent”) each had a median RPKM of less than 1. Notably, the “quiescent” and “bivalent” annotations had a large overlap in distribution of gene expression levels, despite the lack of the former with any state with substantial frequency of epigenomic marks, and the association of the latter with epigenomic marks in multiple states. This highlights how ChromGene provides additional information about genes not captured by gene expression.

Gene expression distribution of ChromGene annotations. The gene expression distribution for each ChromGene annotation across 56 cell types. Gene expression values are in RPKM after adding a pseudocount of 0.1, and then log 10 transforming for visualization

For each ChromGene annotation, we also determined how consistent average gene expression levels were across cell types. Specifically, for each annotation and cell type, we calculated the median log 10 (RPKM + 0.1) expression and analyzed the medians as a function of annotation (Additional File 1 : Fig. S3). We then calculated the mean within-annotation variance across cell types (0.026) and the total variance of the computed medians across cell types and annotations (0.783). We found that the median expression variances within annotations were significantly smaller than across annotations ( p -value < 10 −4 , permutation test), suggesting that ChromGene annotations are informative of expression regardless of cell type.

We also quantified how predictive ChromGene assignments are of gene expression and compared it to several baselines. To do this, we first separated genes into two groups: “expressed” genes with RPKM ≥ 1 and “unexpressed” genes with RPKM < 1. We held out genes on one chromosome at a time as a test set and used genes on all other chromosomes as a training set. We calculated the median expression for each ChromGene annotation for genes in the training set and used this as a predictor for the expression of genes in the held-out chromosome, and we found that the ChromGene assignment was a strong predictor of whether genes were expressed or unexpressed (across 56 cell types: mean AUROC = 0.893, standard deviation = 0.021). We also calculated the mean squared error (MSE) between the predicted and observed log 10 (RPKM + 0.1) expression values across all cell types, and we found that predicted expression values were close to the true expression (across 56 cell types: mean MSE = 0.418, standard deviation = 0.050; Pearson r = 0.763). We repeated the evaluations for the following three baseline models (“ Methods ”).

The TSS model clusters genes based only on epigenomic mark information at the TSS. This model only considers one position per gene, and thus does not capture additional information throughout the gene.

Gene average model

The gene average model clusters genes based on the average epigenomic mark binarized values throughout the gene, which is equivalent to using ChromGene with only one state per HMM. This model has two key disadvantages: First, it does not take advantage of spatial information; instead, it treats heterogeneous spatial data as if it were homogeneous, eliminating potentially useful information. Second, it is more likely to be biased by gene length because epigenomic marks preferentially associate with specific genic regions that scale differently with gene length (see below).

Collapsed model

The collapsed model clusters genes using a single state per mixture component, as in the gene average model, but here the single state is found by first training a normal multi-state ChromGene model, then “collapsing” each multi-state HMM into a single-state HMM by taking the weighted average of the states in that component. This is meant to show that the differences between ChromGene and a single-state model are likely due to incorporating spatial information of epigenomic marks instead of different instantiations of the models.

We implemented each baseline method to have an identical number of clusters as our ChromGene model. ChromGene assignments were significantly more predictive of gene expression than the three baseline methods (Table S 2 , AUROC = 0.893 vs 0.818, 0.889, and 0.888; MSE = 0.418 vs 0.685, 0.429, and 0.438; Pearson r = 0.763 vs 0.601, 0.757, and 0.753; p < 10 −4 for all comparisons, binomial test, for ChromGene vs TSS, gene average, and collapsed models, respectively, “ Methods ”).

ChromGene reduces association of gene length with clusters

We next analyzed the relationship between the ChromGene annotations and the lengths of genes assigned to them (Fig. 5 ). We found that the length distribution of most annotations largely overlapped and were concentrated between 10 and 100 kb, but there were a few exceptions. Notably, “low” and “trans_cons” had median gene lengths around 100 kb, while “quiescent” was mostly composed of genes shorter than 10 kb. This “quiescent” annotation was strongly enriched (9.2 fold) for olfactory genes, which have a median length of 1034 bp and are expressed at low levels in most cell types (mean expression = 0.04 RPKM, median expression = 0).

Lengths of genes by ChromGene annotation. Length distribution of different ChromGene annotations. Length axis is log-transformed for visualization

Because gene length distributions varied across ChromGene annotations, we wanted to ensure that ChromGene provided substantial information beyond gene length. We thus evaluated the amount of information shared between ChromGene annotations and gene length, and for comparison, conducted the same evaluations for annotations from the three baseline models. Specifically, for each model and each cell type, we calculated the mutual information of the gene’s assignment and its length. We expected that the baseline TSS model would share the least information with gene length, as it only incorporates data from the TSS and not the whole gene, while among the models that incorporate data from the whole gene, ChromGene would share the least information. Indeed, we found that across cell types, the TSS model had the lowest mutual information (mean of 0.10), and ChromGene had significantly lower mutual information (mean of 0.31) than the gene average and collapsed models (mean of 0.46 and 0.48, respectively, p < 10 –30 , binomial test, “ Methods ,” Additional File 1 : Fig. S4). This indicates that of the models that use information throughout the gene, ChromGene annotations are least likely to simply reflect gene length.

Characterizing ChromGene assignments across cell types

We hypothesized that for a given gene, some ChromGene annotations were more likely to be consistently assigned across cell types, while other annotations were more likely to co-occur with other annotations. To test this, we conducted enrichment analyses for the co-occurrence of ChromGene assignments across cell types (excluding those that could be considered essentially biological replicates) (Fig. 6 , “ Methods ”). We found that certain pairs of ChromGene annotations are strongly depleted for a given gene across pairs of cell types. For example, genes assigned in one cell type to “trans_cons,” which is highly enriched with housekeeping genes, are strongly depleted for the “quiescent” annotation in other cell types (176 fold depletion compared to expected, corresponding to a −7.46 log 2 ratio), consistent with the “quiescent” annotation’s enrichment for olfactory genes. On the other hand, we also found certain pairs of annotations enriched for the same gene across cell types. For example, genes assigned to “poised” in some cell type were enriched for both “weak_trans_enh” and “low” annotations in others, reflecting this annotation’s association with genes that can be activated or repressed in other cell types [ 32 ].

ChromGene co-assignment matrix enrichment. The log 2 enrichment of combinations of ChromGene assignments over all pairs of non-replicate cell types. Enrichment corresponds to an increased likelihood of two cell types having the corresponding ChromGene assignments for a given gene, relative to randomly choosing based on ChromGene cluster sizes (“ Methods ”)

As assignments of genes to different annotations across cell types could be expected based on technical variability, we also estimated a confusion matrix for pairs of “replicate” cell types (Additional File 1 : Fig. S5a, “ Methods ”). We found 81.1% concordance in assignments between replicate cell types, as compared to 10.2% expected by the number of genes assigned to each annotation. We next followed a similar process to calculate a contingency table for non-replicates—the probability of a given gene being assigned to one annotation for one cell type given it was assigned to another annotation in another cell type (Additional File 1 : Fig. S5b). We found that between pairs of randomly chosen non-replicate cell types, ChromGene assignments were 57.1% concordant, which is substantially lower than between replicates (81.1%) and substantially higher than random assignment (10.2%).

We further defined the “cell type specificity” of each ChromGene annotation by dividing the diagonal of the contingency table (across non-replicate cell types) by the confusion matrix diagonal (the probability biological replicates would be assigned to the same annotation), then subtracting this value from 1 to obtain a “cell type specificity score” so that higher values correspond to more cell type-specific annotations. We found that the annotations that varied the least across cell types were “quiescent,” “strong_trans,” “znf,” “bivalent,” and “trans_K36me3” with cell type specificity scores from 0.16 to 0.24 (Fig. 3 , Additional File 2 : Table S1). The most cell type-specific annotations were “weak_trans_enh,” “low,” and “trans_enh” with scores of 0.49, 0.43, and 0.41 respectively. “Weak_trans_enh” having the highest cell type specificity score is consistent with this annotation being primarily associated with enhancers (Additional File 1 : Fig. S6), which are known to be highly cell type-specific [ 2 , 33 ]. The “low” annotation also had high cell type specificity despite having similarly low expression as the “quiescent” annotation, and it was often assigned to the annotation “poised” in other cell types (Fig. 6 ).

We next evaluated the extent to which changes in a gene’s ChromGene assignment across pairs of cell types were reflected (on average) in a change in expression of that gene (Additional File 1 : Fig. S7a). We found that for most pairs of ChromGene annotations, changes in expression were largely consistent with their overall individual expression patterns (Fig. 4 , Additional File 1 : Fig. S7c,e). For example, if a gene was assigned in cell type i to a low-expression annotation, such as “quiescent,” and a high-expression annotation in cell type j , such as “strong_trans_enh,” then the expression of that gene in cell type j would typically be substantially higher than in i (“ Methods ”). Interestingly, we found a few pairs of ChromGene annotations for which expression did not vary as much as expected based on individual patterns (Additional File 1 : Fig. S7c,e). Together, these results show that while changes in ChromGene assignment across cell types are typically reflected in expression changes for their associated annotations, some genes do not follow this pattern.

ChromGene annotations are differentiated by gene set enrichments

We next analyzed the enrichment of ChromGene annotations with respect to various gene sets (Additional File 2 : Table S1). These gene sets included ZNF-named genes, constitutively unexpressed genes, housekeeping genes, olfactory genes, “biological processes” Gene Ontology (GO) terms, and cancer-related gene sets [ 1 , 16 , 34 , 35 , 36 , 37 , 38 ].

ZNF genes had the highest enrichment for the “znf” annotation (11.0 fold, median p < 10 −200 , hypergeometric test, “ Methods ”). Constitutively unexpressed genes (RPKM < 1 in all 56 cell types with matched expression available) [ 1 ] were most enriched in the “quiescent” annotation (5.8 fold, p < 10 −300 ). In contrast, a set of previously defined housekeeping genes, based on broad and constant expression levels [ 39 ], was most enriched in the “strong_trans” annotation (3.0 fold, p < 10 −300 ), which had a 131 fold depletion for constitutively unexpressed genes. For olfactory genes [ 34 ], we observed the strongest enrichment in the “quiescent” annotation (9.2 fold enrichment, p < 10 −200 ), which contained 75.1% of all olfactory genes.

For the GO term enrichments [ 35 ], we calculated an adjusted enrichment p -value for each GO term gene set for each cell type and ChromGene annotation (hypergeometric test, Bonferroni corrected for the number of combinations of 12 ChromGene annotations and 6036 GO terms). For each ChromGene annotation, we identified GO terms that were enriched in the majority of cell types (adjusted p < 0.01) (“ Methods ,” Additional File 1 : Fig. S8a, Additional File 2 : Table S1). The ChromGene annotations “strong_trans_enh” and “strong_trans,” which had the highest expressed genes on average, had the greatest number of such GO terms, constituting 26 and 44% of the significant enrichments, respectively (Fig. 7 c). The most significant terms for “strong_trans_enh” and “strong_trans” included core regulatory and metabolic processes terms such as “SRP-dependent cotranslational protein targeting to membrane” and “cytoplasmic translation” for “strong_trans_enh” and “mRNA splicing, via spliceosome” and “mRNA processing” for “strong_trans.” The ChromGene annotation “bivalent” had several neuron- and development-related enriched GO terms, including “neuron differentiation” and “anterior/posterior pattern specification.” In contrast, the “quies” term, also associated with low expression, showed significant enrichments for terms such as “sensory perception of smell” consistent with its enrichment for olfactory genes. The different GO enrichments for annotations with similar expression levels suggest ChromGene annotations provide information beyond expression.

GO term and cancer gene set enrichment. a Heatmap showing enrichment −log 10 ( p -values) of top GO terms (rows) for six gene groups (columns): all “inactive genes” (< 1.0 RPKM), and subsets of those genes assigned to each of five ChromGene annotations associated with lower expression. Asterisks denote significant enrichments after multiple testing correction (“ Methods ”). Square brackets denote the number of genes in each GO term or gene group. The first row corresponds to the most enriched GO term based on expression only. The second through sixth rows correspond to the most enriched GO term for each of the subsets corresponding to the ChromGene annotations listed in the second through sixth columns, respectively. “Sensory perception of smell” is listed twice since it was most enriched based on both expression only and the subset with the “quiescent” ChromGene annotation. b As in a , but for highly expressed genes (> 100 RPKM) and using the three ChromGene annotations with the highest overall expression, “strong_trans_enh,” “strong_trans,” and “trans_enh.” c Count of how often each ChromGene annotation yielded significant enrichment p -values (adjusted p < 0.01, Bonferroni corrected for the number of combinations of 127 cell types, 12 ChromGene annotations, and gene sets) across 6036 “biological process” GO term gene sets. d As in c , but for 967 cancer gene sets

To more directly investigate whether ChromGene annotations provides information beyond expression in the context of GO term enrichments, we compared GO term enrichments for gene sets defined based on only gene expression to those defined by both their gene expression and ChromGene assignment. Specifically, we first computed GO term enrichment for unexpressed or minimally expressed genes (“inactive genes”, RPKM < 1.0) in “Brain Germinal Matrix” using the same procedure as above (“ Methods ”), then found the most enriched term. Then, for each of the ChromGene annotations associated with lower average expression (“znf,” “poised,” “bivalent,” “low,” and “quiescent”), we took the subset of the inactive genes that also had the ChromGene annotation. For each subset, we repeated the GO term enrichment analysis and found the most enriched GO term. We took the most enriched GO term for each of these sets, then analyzed the enrichment p -values in all the other sets considered in this analysis (Fig. 7 a). We found that the GO term most enriched in inactive genes, “sensory perception of smell (GO:0007608),” was substantially more enriched when subsetting to “quiescent” genes despite having fewer genes in the set (2.5 fold vs 9.1 fold, adjusted p < 10 −24 vs p < 10 −52 , Bonferroni corrected, “ Methods ”). This subset of “quiescent” genes was also enriched in this GO term when using the set of all inactive genes as the background (3.1 fold, adjusted p < 10 −24 ). Additionally, we found that for each of the other ChromGene annotation subsets, its most enriched GO term had a more significant p -value than for all inactive genes. For example, the “bivalent” subset was significantly enriched in “anterior/posterior pattern specification (GO:0009952)” (5.4 fold, adjusted p < 1 −14 ), consistent with the role of bivalent genes in development [ 40 ], but the group of all inactive genes was only marginally enriched (1.8 fold, adjusted p = 0.016).

We repeated the analysis with highly expressed genes (RPKM > 100) using the three ChromGene annotations with the highest average expression (“strong_trans_enh,” “strong_trans,” and “trans_enh”), and again found increased significance of notable GO terms when using ChromGene annotations to subset these expressed genes. For example, in the “Fetal Brain Female” cell type, we found the ChromGene annotation “trans_enh” was significantly enriched for “neural tube development (GO:0021915)” (61.5 fold, adjusted p < 0.01), but was not significantly enriched using expression only (13.2 fold, adjusted p = 1.0) (Fig. 7 b). When using a background of highly expressed genes instead of all genes, we observed a marginally significant enrichment for this GO term in “trans_enh” (4.7 fold enrichment, p < 0.002).

To determine whether ChromGene annotations provide additional information beyond gene expression regardless of cell type, we compared the number of GO terms enriched when splitting genes by ChromGene assignments to splitting genes randomly. We found that for both the unexpressed (< 1 RPKM) and highly expressed (> 100 RPKM) genes, for all 56 cell types tested, splitting genes by ChromGene assignments yielded more enriched GO terms (mean of 51.6 vs 1.8 for RPKM < 1, mean of 25.1 vs 5.2 for RPKM > 100, p < 10 −16 , binomial test) (Additional File 1 : Fig. S8b,c, “ Methods ”). As previous studies have assumed that increased GO significance corresponds to greater biological significance [ 41 , 42 ], these results would suggest increased biological relevance of using ChromGene annotations. However, we note that it is possible a method’s annotations could yield more significant GO enrichments while not necessarily being more biologically relevant, which would be difficult to determine without inherent ground truth. Together, these results suggest that using ChromGene in tandem with expression information can be used to identify subsets of genes with more specific biological roles than by expression alone. Further, they show that ChromGene can provide useful biological information even in the absence of significant expression.

Across all “biological process” GO terms and all ChromGene annotations, we saw an average of 133 enriched GO terms per cell type (adjusted p < 0.01). This was substantially more than for the three baseline methods (TSS, gene average, collapsed), where we saw an average of 53, 109, and 82 gene sets significantly enriched (adjusted p < 0.01), respectively. These results demonstrate that ChromGene annotations yield more “biological process” GO term enrichments than the baseline methods.

We next took 967 cancer-related gene sets, produced by the Cancer Cell Line Encyclopedia and implicated in various types of cancer [ 37 ], and, as before, performed a gene set enrichment analysis for each ChromGene annotation. Interestingly, we found that unlike for the “biological process” GO terms, significant enrichment for cancer gene sets was most likely to be found in lowly expressed annotations, particularly “poised,” “bivalent,” and “low,” which constituted the majority of enrichments, with 21, 60, and 16%, respectively (Fig. 7 d). The enrichment of the “poised” and “bivalent” annotations is consistent with previous observations that misregulation of poised and bivalent chromatin regions are associated with cancer [ 43 , 44 ]. In total, we found that ChromGene had an average of 151 cancer gene sets enriched per cell type. In comparison, the baseline methods (TSS, gene average, collapsed) had 71, 72, and 70 gene sets enriched (hypergeometric test, p < 0.01, Bonferroni corrected for the number of combinations of 127 cell types, 12 annotations, and 967 gene sets), respectively, substantially less than for ChromGene. These results further support that ChromGene annotations are more consistent with established cancer gene sets than the baseline approaches.

ChromGene separates genes by pLI scores

We next explored the relationship between ChromGene annotations and gene constraint, specifically those with high pLI scores (≥ 0.9) [ 25 ]. Genes with high pLI scores have strong evidence that the gene is intolerant to loss of function, i.e., haploinsufficient.

The three ChromGene annotations most enriched for high pLI score genes were “trans_cons,” “strong_trans_enh,” and “strong_trans” (Fig. 8 ), with an average of 49.6, 35.8, and 33.3% of their assigned genes having high pLI scores across cell types, respectively, compared to 18.1% expected by chance. These three annotations were also among the four annotations associated with the highest expression.

Expression and pLI scores per ChromGene annotation. Scatter plot showing gene expression (RPKM) on the x -axis and the proportion of high-pLI genes (pLI ≥ 0.9) on the y -axis. Each point corresponds to genes assigned to a specific ChromGene annotation in a single cell type. Points are colored by their ChromGene annotation. The dashed line corresponds to the overall proportion of genes with high pLI. Gray curves represent the proportion of high-pLI genes as a function of expression, one per cell type. The dotted black line is the proportion of high-pLI genes, averaged across cell types, as a function of expression (“ Methods ”). The figure shows that although there is a positive association between the two measures, ChromGene captures information beyond expression. For example, “trans_enh” and “trans_cons” have similar expression levels, but substantially different proportions of high-pLI genes

Notably, “trans_cons” had a substantially larger percentage of high-pLI genes (49.6%) than “trans_enh” (14.7%), despite a positive correlation between mean ChromGene annotation pLI score and expression (Spearman r from 0.18 to 0.39, mean r = 0.27 across cell types, all p -values < 10 −100 , “ Methods ”), and “trans_cons” having slightly lower overall expression than “trans_enh” (median RPKM = 12.73 and 13.89, respectively). This difference is consistent with the substantial difference in gene length between the “trans_cons” and “trans_enh” annotations (Fig. 5 ), and a previously noted positive correlation between gene length and pLI [ 25 ]. We still saw a difference in percentage of high-pLI genes between “trans_cons” and “trans_enh” after correcting for gene length, although it did decrease, with 28.8 and 20.9%, respectively, of annotated genes having high pLI scores (“ Methods ,” Additional File 1 : Fig. S9, Additional File 2 : Table S1). Another difference between “trans_cons” and “trans_enh” is that genes annotated as “trans_cons” are more likely than “trans_enh” to be assigned to the “strong_trans” or “trans_K79me2” annotations in other cell types (Fig. 6 , Additional File 1 : Fig. S5b).

Among the lowly expressed ChromGene annotations, “low” and “quiescent” had the greatest difference in proportion of high-pLI genes (21.7 and 1.0%, respectively), although this difference was smaller after correcting for gene length (8.9 vs 2.9%, respectively) (“ Methods ”). Compared to the “quiescent” annotation, the “low” annotation contained longer genes and genes more likely to be assigned in other cell types to ChromGene annotations with a greater proportion of high-pLI genes. Similar patterns also held when considering mean pLI (Additional File 1 : Fig. S10). These results further show that ChromGene captures additional information beyond expression.

Here, we introduced ChromGene, a principled, model-based method that uses a mixture of HMMs to annotate genes based on maps of multiple epigenomic marks across genes. ChromGene’s focus on gene assignments is complementary to well-established approaches for generating per-position assignments [ 9 , 10 ].

We applied ChromGene to imputed data for 12 epigenomic marks to annotate genes in over 100 cell types. We showed that ChromGene annotations frequently reflect distinct gene expression levels. In cases where ChromGene annotations had similar gene expression levels, we found they differed on other important properties such as gene set enrichments, including for high pLI score genes, reflecting that ChromGene annotations capture information beyond gene expression levels. We also showed that ChromGene yielded better agreement with gene expression data and more significant enrichments for cancer gene sets and GO terms than baseline approaches.

As ChromGene directly assigns genes to annotations, it will likely be preferable for many gene-centric analyses relative to per-position chromatin state assignments. While ChromGene annotations do correspond to patterns previously identified by per-state annotations, ChromGene annotations are more expressive in that they can associate multiple per-position patterns with the same gene-based annotation. Relating per-position annotations directly to genes has often involved using post hoc approaches mapping these annotations to individual genes. However, such approaches are subject to loss of information and can be biased by gene length. ChromGene mitigates these issues by directly providing a model-based annotation of individual genes. We note that ChromGene annotations are less likely to have advantages over per-position annotations when the object of study is not inherently gene-centric, such as studying enrichments for GWAS-identified variants.

There are multiple possible additional types of applications and extensions of ChromGene that can be investigated in future work. Although here, we applied ChromGene to protein-coding genes, ChromGene is more general and can also be applied to other pre-defined sets of genomic intervals such as long non-coding RNAs and pseudogenes. In this study, we combined our training data in a “concatenated” mode, where chromosomes across different cell types are treated as separate, and a single model is trained that applies to each epigenome. However, the input data for ChromGene can also be “stacked,” where the emissions of marks across cell types are all observed simultaneously [ 28 , 45 ]. This can be used to annotate genes with patterns of variation learned across cell types. In this study, we applied ChromGene to a set of 12 imputed data sets to have a comparable set of input features across many cell types. However, ChromGene can be applied with other choices for the input data. In this study, we did not analyze ChromGene’s per-position annotation of genomic bins to individual mixture component hidden states, as we see the gene-level annotations as the main value for ChromGene relative to existing per-position annotations. However, investigating the use of ChromGene’s per-position annotations for specific applications could be a direction for future work.

Conclusions

ChromGene generates gene-level annotations, analogous to chromatin states at the position level, by modeling the combinatorial and spatial pattern of epigenomic marks using a mixture of HMMs. We used ChromGene to assign nearly 20,000 protein-coding genes for over 100 cell types to twelve gene-level chromatin annotations. We have shown that these annotations not only capture variation in gene expression, but also provide additional information. The assignment of genes to ChromGene annotations, along with the trained ChromGene model and software we used to generate them, are publicly available [ 30 , 31 ]. We expect that the ChromGene assignments we have generated will be a useful resource for gene-based analysis in and across many cell types and that the approach will be useful for gene-based annotations for data from additional marks, cell types, or species.

ChromGene model

For a single cell type, ChromGene uses a mixture of multivariate HMMs to model the combinatorial and spatial patterns within multiple epigenomic maps across a set of genes to derive a single assignment per gene. Each gene is assumed to be generated by one of M fully connected HMMs, each with S states. Each of these M HMMs is defined by three sets of parameters: (1) the initial probability for each state within the HMM, (2) the probability of transitioning from one state to another state in the same HMM, and (3) the probability of observing each of E binarized epigenomic mark emissions given a state in the HMM. Although not explicitly trained, each HMM’s prior probability of being chosen for a gene is equal to the sum of the initial probabilities of its states.

More formally, we denote mixture components as m , 1 ≤ m ≤ M , where M is the total number of components; we denote states as s , 1 ≤ s ≤ S , where S is the number of states per component. The total number of states is thus \(M\times S\) , and each hidden state is denoted h m,s . The initial probability, transition probabilities, and emission probabilities are defined as follows:

Initial probabilities

The initial probability for each state, denoted \({\tau }_{m,s}\) , corresponds to the probability that a randomly chosen gene starts in state h m,s . The sum of all the initial probabilities of states in a mixture component m corresponds to the initial probability of a gene starting in component m: \({\tau }_{m}={\sum }_{s=1}^{S}{\tau }_{m,s}\) and thus the prior probability of the gene being assigned to component m .

Transition probabilities

Each mixture component is modeled as a fully connected HMM. The transition parameters \({\alpha }_{ms,ms{\prime}}\) correspond to the probability of a state s within component m , h m,s , transitioning to a state \(s'\) within the same component \(m,h_{m,s'}\) . Transitions from states within one component to states in another component are not allowed.

Emission probabilities

The probability of observing a specific combination of marks in a state s of component m , h m,s , is calculated based on the product of independent Bernoulli random variables, with emission parameters \({\beta }_{m,s,e}\) , over all marks following the approach of ChromHMM [ 9 ]. \({\beta }_{m,s,e}\) and \((1-{\beta }_{m,s,e})\) represent the probability of observing mark e as present or not present at a specific position in state \({h}_{m,s}\) , respectively.

To apply ChromGene to multiple cell types, we apply the same modeling approach but treat data from multiple additional cell types as if they were additional genes in one cell type. This leads to a common model across cell types, but with cell type-specific gene assignments. This approach is analogous to the “concatenated” model learning approach of ChromHMM [ 2 , 28 ].

Model learning

Parameter initialization.

The initial parameters of states across all mixture components are initialized to sum to 1 using a Dirichlet distribution with \({\alpha }_{m,s}=1\) . The emission parameters are initialized for each state by first randomly assigning each gene to one of the M components. Then, each position in the genic region and flanks is assigned uniformly at random to one of the S states within that component. Finally, for each of the \(M\times S\) hidden states, we set the initial emission probability of each epigenomic mark to its average frequency of being observed present across all positions assigned to the state. For each component m and state s in the component, the transition parameters from the state to each state in its component are initialized to sum to 1 using a Dirichlet distribution with \({\alpha }_{s}=1\) .

Expectation–Maximization procedure

After initializing parameters, ChromGene trains the parameters iteratively with the Baum-Welch algorithm, a special case of the Expectation–Maximization (EM) algorithm, until convergence. In the E step, ChromGene takes the current parameter values (emission, transition, and initial probabilities) and calculates the log likelihood of the data being observed given those values, \(\mathrm{log}(L(\theta ;X,H))=\mathrm{log}(p(X,H|\theta ))\) , where \(\theta =\{{\theta }_{\tau }, {\theta }_{\alpha }, {\theta }_{\beta }\}\) , the set of initial, transition, and emission parameters, respectively, X is the observed data, and H is the set of true, but hidden state assignments. In this step, ChromGene calculates the posterior probability of each position of each gene being generated by each of the \(M\times S\) hidden states.

In the M step, these posterior probabilities are used to re-estimate the parameters \(\theta\) so that the new parameter values maximize the expected likelihood in the E step, i.e., \({\theta }^{t+1}={\text{argmax}}_{\theta }{E[}_{H|X, {\theta }^{t}}log(L(\theta ;X,H))]\) . In our application, we trained for 200 iterations, as is default in ChromHMM, and parallelized the E step computation across 8 cores using the option `-p 8`. For faster computation, we also only evaluated on a randomly selected subset of the data in each iteration.

Implementation on top of ChromHMM

To implement ChromGene on top of ChromHMM, we first noted that a mixture of HMMs can be defined in terms of a single HMM with certain transitions disallowed. Specifically, a set of M fully connected HMMs, each acting as a mixture component, can be represented with a single, larger HMM, and multiple genes within the same file can be handled with the addition of a “dummy state” and “dummy mark.” In this single HMM, each non-dummy state of the HMM can transition into and out of other states within its component and a single dummy state, but not directly to states of other HMM components (Fig. 1 a,b). From the dummy state, transitioning to any component is allowed, but only through the use of the dummy mark, which forces transitions into and out of the dummy state.

We structured the input data so that all genes in a chromosome and cell type (after adding 2-kb flanks on both ends and reversing genes on the negative strand) were concatenated head to tail, but with a single “dummy position” separating the genes, and a dummy position at the beginning and end of the data. In this dummy position, the emission for all the original input marks is set to 0, but we include an emission of 1 for a single, new mark designated the “dummy mark” (Fig. 1 c). This dummy mark has an emission of 0 within all extended genic regions, and 1 only for the dummy position (Fig. 1 d). Upon reaching the dummy position, the model forces transition to the dummy state. From this dummy state, transition to any of the mixture components is allowed.

This structure also allowed us to compactly represent all genes without having separate files for each gene and cell type. For a single cell type, we had 23 chromosome files (chromosomes 1–22 and X). For 127 cell types, this yielded 127 × 23 = 2921 total input data files. We note that in theory, we could apply an equivalent model without dummy marks and states using ChromHMM by creating a separate file for each combination of gene and cell type. However, such a process for our training scenario would produce approximately 2.5 million files, which can be over system limits for the number of files.

We generated initial parameter files for the initial parameters described above. For M mixture components, S states per component, and the dummy state, we had a total of ( M × S ) + 1 states. The initial probability for each state in each component is set to 0, except for the dummy state, which is set to 1. The transition probabilities from any state in a component to each other state in the same component was allowed to be nonzero and normalized so that the sum of these probabilities was 0.95. The remaining 0.05 probability was set to be the probability of transitioning from the component’s states to the dummy state. The probability of transitions from the dummy state to each of the other M × S states corresponds to the initial parameters, which, as described above, were initialized using a Dirichlet distribution with \({\alpha }_{m,s}=1\) for all components so that they summed to 1. The transition probability from a state in a component to each state in different components was set to 0.

Finally, the emission probabilities for the non-dummy marks in the non-dummy states were initialized as described above. For the dummy state, the emission probability of the epigenomic marks was set to 0, and the emission of the dummy mark was set to 1; in the non-dummy states, the emission probabilities for the dummy mark was set to 0. All input files were generated using the script `generate_chromgene_input_files.py` with default parameters [ 30 , 31 ].

We passed the data and the initialized parameter files into ChromHMM’s `LearnModel` function and used several optional flags. We used the `-e 0` and `-t 0` flags to enforce that a 0 or 1 emission or transition parameter in the initial parameter file remains at that value. We also used the `-scalebeta` flag to increase numerical stability in our setting (below, “Handling numerical stability issues”). We used the flag `-n 100` to sample 100 files for each iteration, and `-d -1` to allow the log likelihood to increase between iterations. The log likelihood can increase between iterations because different subsamples of the data are evaluated on different iterations. We trained for 200 iterations, as is default in ChromHMM. We used version 1.18 of ChromHMM.

After training the model, we used ChromHMM to generate posterior probability assignment files. We used these files to calculate the posterior probability of each of the M HMMs of generating each gene in each cell type; finally, we took the HMM with the highest posterior probability to derive hard assignments. These assignments were generated using the script `chromgene_posteriors_to_components.py` with default parameters [ 30 , 31 ].

Handling numerical stability issues

Implementing a mixture of HMM raised some numerical stability issues not encountered with standard, single HMMs in ChromHMM. In this setting, it is expected that for some genes, there will be specific individual HMMs that would have essentially zero probability of generating the observations for the gene. Previously, to deal with numerical stability issues, ChromHMM used a single approach to scaling, which was to rescale both the forward (alpha) variables of the HMMs and the backward (beta) variables at a specific position by the sum of the forward variables. This followed previous standard presentations of HMMs [ 46 , 47 ]. However, this scaling approach led to numerical overflow in our setting. Namely, the magnitudes of the backward variables at a specific position can be vastly greater than the sum of the forward variables, and this difference can be continued to be amplified over long sequences, which in some cases can lead to numerical overflow. This likely happens specifically in this setting because the forward variables for a sequence can be dominated by one mixture component while the backward variables for a sequence can be dominated by another mixture component. To handle the overflow, we rescaled the backward variables based on the sum of the backward variables, which is valid as previously noted [ 48 ].

While rescaling the backward variables by the sum of the backward variables eliminated the numerical overflow, numerical underflow issues were then encountered. Specifically, situations were encountered in certain cases for individual genes where the product of each forward and backward variable for every state was 0, which would lead to division by 0 and no meaningful overall output. This also likely happens in certain situations in which one mixture component dominates the forward variables and another one the backward variables. To avoid the numerical underflow, if a backward variable for a state at a position fell below 10 −300 , we set its value to 10 −300 . We did the same for forward variables, except if the emission probability product for a state at the position was actually 0, in which case the forward variable was kept at 0. These changes were sufficient to lead to a numerically stable procedure.

This alternative procedure for handling numerical stability can now be accessed in ChromHMM by using the `-scalebeta` flag.

Hyperparameter selection

In ChromGene, there are two hyperparameters that must be set: M , the number of mixture components, and S , the number of states per component. For M , the number of mixture components, we considered values ranging from 8 to 20 in increments of two. For S , the number of states per component, we considered each value ranging from 2 to 5, while S = 1 corresponded to the gene average baseline.

The choice of M trades off the ability of the model to capture additional classes of genes while maintaining meaningful distinctions between each annotation. To assess this trade-off, we calculated ChromGene’s reproducibility, defined as the percentage of identical assignments in cell types that can be considered replicates (below, Calculation of confusion matrix and contingency table), as a function of M and S . We found that reproducibility is more sensitive to the choice of M than S (Additional File 1 : Fig. S1a). We found that there was a large drop in reproducibility between M = 12 mixture components and 14 or more. On the other hand, the drop in reproducibility from M = 8 and M = 10 to M = 12 was relatively minimal, suggesting 12 mixture components could provide a reasonable trade-off for the expressivity of the model without having any pair of mixture components that are largely redundant (Additional File 1 : Fig. S1b).

The choice of S trades off the expressivity of the state representation within a mixture component with the overall interpretability of each mixture component. A mixture component with too many states could have “redundant states,” where two or more states have very similar emission parameters. To investigate this trade-off, we calculated the mean Manhattan distance between the emission parameters of the two closest states within a mixture, for values of M = 8 to 20, and S = 2 to 5 (Additional File 1 : Fig. S1c). We note that S = 1 is excluded from this analysis because two states are needed to calculate a distance. Unlike with reproducibility, this metric was more sensitive to the choice of S than M . We found that the mean Manhattan distance in models using S = 4 or S = 5 states per mixture have a similarly low value, suggesting that these models have pairs of “redundant” states that are not well-differentiated by emission parameters (Additional File 1 : Fig. S1d). However, there is still an appreciable gap between S = 3 and both S = 4 and S = 5, supporting the choice of S = 3 to trade off model expressivity while maintaining a relatively interpretable model with limited redundancy between sub-states of the model.

ChromGene annotation visualization

To visualize ChromGene assignments across cell types and genes, we first randomly subsampled 2000 of the 19,919 protein-coding genes. For each pair of these genes, i and j , we calculated a pairwise distance between them as \(1-\mathrm{mean}\left(\mathrm{I}\left({m}_{i,c}={m}_{j,c}\right)\right)\) , where m g,c corresponds to the ChromGene assignment for gene g in cell type c , I( x ) is the indicator function, and the mean is across the 127 cell types. Columns, which correspond to cell types, were ordered the same way as previously [ 1 ]. Rows were ordered using scipy’s hierarchical clustering function, using optimal leaf ordering and average (UPGMA) linkage.

Baseline model implementations

To generate a 12-component model based only on gene TSSs, we took epigenomic mark binarization data across the 127 cell types for the 19,919 genes and used the data at the 200-bp bin overlapping the TSS position. As with the standard ChromGene input data, we separated these individual TSS positions by a single dummy position, then trained a 12-component model with a single state per component ( M = 12, S = 1) on top of ChromHMM and followed the same procedures as above for generating assignments.

The gene average model was implemented using ChromGene using the standard procedures as above, except that for the M = 12 mixture components, it used a single hidden state per mixture component (i.e., S = 1 states per component).

The collapsed model was identical in structure to the gene average model ( M = 12 mixture components, S = 1 states per component), but was not trained directly. After training the full ChromGene model ( M = 12 components, S = 3 states per component), as is used in all results presented, we calculated the prior probability of a “collapsed” component by summing the initial probabilities of states within the corresponding component of the full model. If the initial probabilities for the S states within a component m are denoted \({\tau }_{m,s}\) , then the initial probability for the collapsed state \(s'\) of the corresponding component \(m'\) is \({\tau }_{s{\prime}}={\pi }_{m}=\sum_{s=1}^{S}{\tau }_{m,s}\) , where \({\pi }_{m}\) is the empirical prior probability of component m . The transition probability from the dummy state to a component is this prior probability. The transition probability of the collapsed state to itself is calculated by comparing the number of observed state assignments for the corresponding component in the full model to the number of genes assigned to that component. If the number of 200 bp positions assigned to some component is denoted x m , and the number of genes assigned to that component is denoted n m , then there are ( x m − n m ) transitions from the component to itself and n m transitions from the component to the dummy state, for a total of x m transitions. We set the transition probability of the collapsed state to itself as ( x m − n m ) / x m and the transition probability of the collapsed state to the dummy state as n m / x m .

The emission probability parameters of the collapsed model are set to the weighted mean of the probabilities of the full model, weighted by the state prior probabilities. If the emission probability for a state s in component m for mark e is denoted \({\beta }_{m,s,e}\) , the emission probability of the corresponding collapsed component \(m'\) with the single collapsed state \(s'\) is \({\beta }_{{m}{\prime},{s}{\prime},e}=\sum_{s=1}^{S}{\pi }_{m,s}{\beta }_{m,s,e}\) , where \({\pi }_{m,s}\) is the empirical prior probability of state s in component m , which is equal to the fraction of 200-bp bins assigned to state s across all cell types and genes. These parameters were all set in the ChromHMM model file and were used directly without training to generate assignments for all genes and cell types.

Assignment of genes to mixture components

After training, ChromGene assigns each position along a gene a posterior probability of being in each of the hidden states \(p({H}_{m,s}|X)\) , where H m,s is the assignment to mixture component m and hidden state s , and X is the observed data. From this, ChromGene calculates the posterior probability of a gene being generated by each mixture component: \(P(m|X)={\sum }_{s=1}^{S}p({H}_{m,s}|X)\) based on any position. We note that all positions within a gene have the same component posterior probabilities; for example, if the first position of a gene has a posterior probability of 0.4 of being generated by component 1, then all positions within the gene have a posterior probability of 0.4 for component 1. This is a consequence of transitions between components being disallowed. ChromGene assigns each gene to the component \(1\le {m}^{*}\le M\) with the highest total posterior probability of generating the gene, i.e., \({m}^{*}={\text{argmax}}_{m\in M}P(m|X)\) . We used these hard assignments of genes to components for all presented results.

Training data

To generate input data for ChromGene, we first defined our genes of interest. For this application, we used 19,919 protein-coding genes as defined by Ensembl v65 / GENCODE v10 for hg19, as the epigenomic data we used had matching gene-level expression estimates across 56 cell types based on these gene annotations. For each gene, we took the TSS and TES of the gene as previously defined [ 1 , 16 , 49 ]. Genes on the negative strand were reversed to align with genes on the positive strand so all genes had the same orientation in the model. We rounded the TSS upstream (in the 5′ direction) and the TES downstream (in the 3′ direction) to the next position divisible by 200 bp. We added an additional flank of 2 kb upstream of the TSS and downstream of the TES to capture additional spatial information around the TSS and TES. We then binned the entire region (gene and flanks) into 200-bp bins so that the boundaries of each bin were divisible by 200. Overlapping genes were considered separately with their overlapping regions repeated. We then extracted the epigenomic marks, as described next.

To generate a single ChromGene model that would be comparable across a large number of cell types and marks, we used imputed data for a set of 12 marks (H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K9me3, H3K27ac, H3K27me3, H3K9ac, H3K79me2, H4K20me1, H2A.Z, and DNase) across 127 reference epigenomes [ 1 , 27 ], which for our purposes we treated and referred to as different cell types, except for calculation of the confusion matrix and contingency table, as described below. This imputed data was previously used to train a 25-state ChromHMM model [ 27 ]. The use of imputed data allowed us to annotate more cell types with the same set of marks compared to using directly observed data. We used the same binarization for the imputed data as previously generated [ 27 , 50 ].

Gene expression analysis

We downloaded gene expression data for 56 cell types from the Roadmap Epigenomics Consortium [ 1 , 51 ]. We took the provided RPKM values, added a pseudocount of 0.1, and took the log 10 transform of the value.

To determine whether median gene expression variance within annotations was significantly smaller than across annotations, we performed a permutation test. Each median log 10 (RPKM + 0.1) expression value for a cell type was randomly assigned to one of the ChromGene annotations, while maintaining 56 cell types per ChromGene annotation. We then calculated the mean within-annotation variance of the median expression values for each permutation and for the true ChromGene annotation. To compute a p -value, we counted the fraction of 10,000 permutations in which the mean variance based on the permuted values was less than that of the true ChromGene annotations.

To assess if there was a significant difference in ChromGene’s AUROCs for predicting expressed genes to those of the baseline methods, we used a one-sided binomial test, where we counted how often the AUROC for ChromGene was higher than that of the baseline method, where each observation is a cell type. Based on these counts, we calculated a p -value, where the null hypothesis is that the baseline method is more likely to produce the higher of the AUROCs, p = 0.5, and n = 56 cell types with expression data as observations.

Calculation of mutual information of assignment given gene length

To calculate mutual information of gene assignments given gene lengths, for each gene, we first calculated its log 10 ( length ), which ranged from 2.06 to 6.73, where length is in bp. We binned the log 10 ( length ) into 0.05-increment intervals to create a discretized distribution. Then, for each cell type, we calculated the mutual information of ChromGene assignment and the length bins. We repeated the procedure for each baseline method. To compare to the baseline methods, we used the binomial test described above, using mutual information instead of AUROC and n = 127 cell types.

Calculation of confusion matrix and contingency table

To generate a confusion matrix (Additional File 1 : Fig. S5a), we took our matrix of ChromGene assignments with entries m g,c , where each of g = 1,…,19919 corresponds to a gene and each of c = 1,…,127 corresponds to a cell type (Fig. 2 , genes subsampled and ordered for visualization). For each gene g , we calculated the conditional probability P(m g,c | m g,c’ ) , where \(c\ne c{\prime}\) correspond to pairs of epigenomes that were originally annotated as the same cell type but of different individuals (“Rectal Mucosa Donor 29” and “Rectal Mucosa Donor 31”, “Foreskin Fibroblast Primary Cells skin01” and “Foreskin Fibroblast Primary Cells skin02”, “Foreskin Melanocyte Primary Cells skin01” and “Foreskin Melanocyte Primary Cells skin03”, “Foreskin Keratinocyte Primary Cells skin02” and “Foreskin Keratinocyte Primary Cells skin03”, “Skeletal Muscle Female” and “Skeletal Muscle Male”, “Primary hematopoietic stem cells G-CSF-mobilized Male” and “Primary hematopoietic stem cells G-CSF-mobilized Female”, “Fetal Brain Female” and “Fetal Brain Male”). In short, we sought to answer the question: “given an entry in the assignment matrix is mixture component i , what is the probability that the same gene is assigned to component j in a replicate cell type?” We then calculated overall conditional probabilities by averaging over all genes g . We represented these conditional probabilities so that the component conditioned on corresponds to a row and each row sums to 1.

To generate the gene contingency table (Additional File 1 : Fig. S5b), we performed the same process as for the confusion matrix, but instead of calculating probabilities for pairs of “replicate” cell types, we calculated them for non-replicate cell types.

Cell type specificity

We defined the “cell type specificity” of each ChromGene annotation by dividing the diagonal of the contingency table (across non-replicate cell types) by the confusion matrix diagonal, then subtracting this value from 1 to obtain a “cell type specificity score” so that higher values correspond to more cell type-specific annotations.

Calculation of co-assignment matrix enrichment

To generate the co-assignment matrix enrichment (Fig. 6 ), we first generated an expected co-assignment matrix, where we assumed independence between assignments. For each ChromGene annotation, m , we calculated an empirical prior probability of observing the annotation assignment for a random cell type and gene \({\pi }_{m}=\mathrm{P}(m)={\sum }_{c=1}^{127}{\sum }_{g=1}^{19919}\mathrm{I}({m}_{g,c}=m)/(127\times 19919)\) , where \({\mathrm{I}(m}_{g,c}=m)\) denotes the indicator function applied to the ChromGene assignment for gene g in cell type c being m . We took these prior probabilities and calculated an expected co-assignment matrix, where each entry ( i, j ) was found by multiplying the prior probabilities of mixture components m i and m j : P( m i ) × P( m j ). We calculated the observed co-assignment matrix by counting the frequency with which a gene was assigned to component m i in one cell type and m j in the other, averaging over all genes and all pairs of cell types that were not considered replicates. We then normalized this matrix by the sum of its values to form a probability distribution. To calculate enrichments, we divided the observed co-assignment matrix (Additional File 1 : Fig. S5b) by the expected co-assignment matrix. Finally, we took the log 2 of these values to show enrichments and depletions.

Calculation of differences in gene expression as a function of ChromGene assignment

To calculate expression changes as a function of changes in ChromGene expression, we first took all 19,919 genes across the 56 cell types with expression quantified. For each pair of distinct cell types i and j , we took each gene’s expression and assignment in cell types i and j , then divided the expression of the gene in cell type j by that of its expression in cell type i . We then put this value in a bin indexed by the gene’s ChromGene assignment in cell types i and j (a total of 12 × 12 bins). Finally, we took the mean of the log 2 expression ratios in each bin and plotted the result as a heatmap (Additional File 1 : Fig. S7a). Additionally, we plotted the mean expressions of the genes in cell type j conditioned on the assignment in cell type i (Additional File 1 : Fig. S7d), which are the denominators of the ratios used in Additional File 1 : Fig. S7a.

ChromGene assignments in hg38-based gene annotations

As the Roadmap Epigenomics data and gene annotations (v10) were based on the hg19 assembly, we used the same assembly and gene annotation for all analyses here [ 1 , 16 ]. However, we also generated ChromGene assignments by applying our previously trained ChromGene model with the lift over of a more recent hg38-based protein-coding gene annotation. Specifically, we used the version of the hg38-based GENCODE v41 annotation that had previously been lifted over to hg19 [ 16 , 52 ]. We found that when comparing common genes based on gene name, ChromGene assignments were 93.7% concordant. The 6.3% that were not concordant were due to differences in the annotated TSS or TES position; 46% of discordant genes (total of 2.9%) had a start or end position that was more than 10 kb away between the two annotations.

Olfactory and housekeeping gene annotations

We downloaded olfactory [ 34 ] and housekeeping gene annotations [ 39 ]. We matched the gene names to the GENCODE annotation [ 16 ] to label each gene as olfactory or not olfactory and as housekeeping or not housekeeping.

ChromHMM state enrichment

For each ChromGene mixture component, we took all genes and cell types assigned to the mixture component, and counted the total observed counts of each ChromHMM state in a previously described 25-state model [ 27 ], thus yielding a matrix of M ChromGene components by 25 ChromHMM states. We then added a pseudocount of 1 to these counts and normalized the rows (ChromGene components) to unit probability. We divided these probabilities by the genome-wide ChromHMM state assignment proportions across all cell types to generate an enrichment, and finally, calculated the log 2 of these enrichments (Additional File 1 : Fig. S6).

Gene set enrichments

To calculate the fold enrichments for ZNF-named genes, housekeeping genes, constitutively unexpressed genes (RPKM < 1 in all 56 cell types with matched expression available), constitutively expressed genes (RPKM > 1 in all 56 cell types with matched expression available), and olfactory genes for a mixture component, we divided the mean proportion of genes from the set assigned to the component across all cell types by the empirical prior probability of a gene being assigned to the component. To calculate a median p -value for a gene set and component across cell types, we took each cell type and calculated a p -value using a hypergeometric test (`scipy.stats.hypergeom`, with X = [number of genes in component and gene set], M = [total number of genes], n = [total number of genes in gene set], N = [number of genes in component]), and then took the median of those values.

To calculate significant gene set enrichments for “biological process” GO terms [ 35 ] (Fig. 7 ), we first found the overlap of a gene set with each of the ChromGene annotations for a given cell type. Next, we calculated p -values using a hypergeometric test, as for the individual gene sets above. The heatmap in (Fig. 7 a,b) shows unadjusted p -values, marked with an asterisk when significant after accounting for multiple testing with a Bonferroni correction. For the “expression only” column, we controlled only for the number of GO terms tested, while in the remaining columns, signifying expression and some ChromGene annotation, we controlled for both the number of GO terms and ChromGene annotations tested. We chose to test the minimal set of ChromGene annotations that contained at least 75% of tested genes, starting with the lowest expressed annotation for RPKM < 1 and highest for RPKM > 100. In Fig. 7 c, the bar chart shows the number of gene sets significantly enriched after correcting for the number of GO terms, ChromGene annotations, and cell types (adjusted p < 0.01).

We calculated GO term enrichments for each baseline method following the process above and repeated the process for cancer gene sets (Fig. 7 d) [ 37 ]. The gene sets for “biological process” GO terms and cancer gene sets were downloaded from the Enrichr database [ 38 ].

To generate the GO term enrichment median p -value heatmap (Additional File 1 : Fig. S8a), we followed the procedure above, but without correcting p -values for the number of cell types. We removed rows where the median adjusted p -value was greater than our significance threshold of 0.01, and clustered rows using `seaborn.clustermap`, which uses a Euclidean metric and `average` linking. Finally, we − log 10 transformed the p -values for visualization.

To compare the number of GO terms enriched when splitting unexpressed (or highly expressed) genes by ChromGene annotation to splitting randomly (Additional File 1 : Fig. S8b,c), we first took the set of all genes with < 1 RPKM (or > 100 RPKM). As above, we then took the minimal set of ChromGene annotations, ordered by expression, that contained at least 75% of the tested genes as the annotations tested, and for each, calculated gene set enrichments for all GO terms. We then took the set of all the genes tested based on expression alone, split them randomly into groups of the same size as the ChromGene annotations, and determined significant gene set enrichments as above. We repeated this process 100 times and took the mean number of significant enrichments. We repeated the process for each cell type and represented each one as a point in the scatter plot.

pLI score analysis

We obtained pLI scores for each gene from gnomAD [ 25 , 53 ] using the “exac_pLI” column. We then correlated the expression value of genes across 56 cell types [ 1 ] with their corresponding pLI scores using spearman correlation.

To calculate the proportion of high-pLI genes (Fig. 8 ) and mean of pLI scores (Additional File 1 : Fig. S10) as a function of expression for a given cell type, we took each gene’s expression, added a pseudocount of 0.1 RPKM to it, and took the log 10 of the result (gray curves). We then looked at 30 equally sized bins of this transformed expression value, from −1 to 2, corresponding to 0 and ≈100 RPKM, respectively. For each bin, we took all genes falling into it, and calculated both the proportion of these genes with pLI ≥ 0.9 and the mean pLI of these genes for the two figures, respectively. We plotted the curves by linearly interpolating between bins. We then repeated the process for all cell types simultaneously (black dotted curve).

To show that the difference in percentage of high-pLI genes among ChromGene annotations (Fig. 8 ) is not simply due to differences in gene length, for each cell type, we first took all genes assigned to a ChromGene annotation, then split them into bins based on the log 10 of their length, where the log 10 length ranged from 3 (1000 bp) to 6 (1Mbp), and each bin spanned 0.2 log 10 length, for a total of 15 bins. We used the range of 1 kb–1 Mb as there are very few genes outside this range (394 genes with length < 1 kb, 63 genes with length > 1 MB, which we discarded for this analysis). For each length bin and ChromGene annotation, we counted the number of genes in the bin with high pLI score (≥ 0.9) across all cell types. We then normalized the resulting values to yield the proportion of genes with high pLI score for each ChromGene annotation and length bin (Additional File 1 : Fig. S9). To explicitly compare the differences in pLI scores across annotations controlling for gene length, we first binned all gene lengths by their log 10 length as above, normalized so the distribution summed to 1, and used this as the reference gene length density. Then, we took the proportions of genes with high pLI scores per ChromGene annotation across the bins, as described above, and multiplied them by their respective reference gene length density. Finally, for each ChromGene annotation, we summed the resulting values to obtain a gene length-normalized proportion of high-pLI score genes (Additional File 2 : Table S1).

Availability of data and materials

The ChromGene annotations generated in this study and the ChromGene software are available under the MIT license in the ErnstLab ChromGene GitHub repository [ 30 ] and Zenodo [ 31 ]. We used ChromHMM version 1.18 to train the ChromGene model. No other scripts or software were used. ChromGene assignments are also available in Additional file 3 . Data used to generate input for ChromGene is also publicly available [ 1 , 27 , 50 ].

Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–30.

CAS PubMed PubMed Central Google Scholar

Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011;473:43–9.

Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–37.

CAS PubMed Google Scholar

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74.

Google Scholar

Boyle AP, Song L, Lee B-K, London D, Keefe D, Birney E, et al. High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Res. 2011;21:456–64.

Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013;10:1213–8.

Stunnenberg HG, International Human Epigenome Consortium, Hirst M. The international human epigenome consortium: a blueprint for scientific collaboration and discovery. Cell. 2016;167:1145–9.

Ernst J, Kellis M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol. 2010;28:817–25.

Ernst J, Kellis M. ChromHMM: automating chromatin state discovery and characterization. Nat Methods. 2012;9:215–6.

Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods. 2012;9:473–6.

Claussnitzer M, Dankel SN, Kim K-H, Quon G, Meuleman W, Haugen C, et al. FTO obesity variant circuitry and adipocyte browning in humans. N Engl J Med. 2015;373:895–907.

Libbrecht MW, Chan RCW, Hoffman MM. Segmentation and genome annotation algorithms for identifying chromatin state and other genomic patterns. PLOS Comput Biol. 2021;17:e1009423.

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63.

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–8.

Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci. 1998;95:14863–8.

Frankish A, Diekhans M, Ferreira A-M, Johnson R, Jungreis I, Loveland J, et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019;47:D766–73.

Su D, Wang X, Campbell MR, Song L, Safi A, Crawford GE, et al. Interactions of chromatin context, binding site sequence content, and sequence evolution in stress-induced p53 occupancy and transactivation. PLOS Genet. 2015;11:e1004885.

PubMed PubMed Central Google Scholar

Zhu W, Hu B, Becker C, Doğan ES, Berendzen KW, Weigel D, et al. Altered chromatin compaction and histone methylation drive non-additive gene expression in an interspecific Arabidopsis hybrid. Genome Biol. 2017;18:157.

Kharchenko PV, Alekseyenko AA, Schwartz YB, Minoda A, Riddle NC, Ernst J, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011;471:480–5.

Sahu A, Li N, Dunkel I, Chung H-R. EPIGENE: genome-wide transcription unit annotation using a multivariate probabilistic model of histone modifications. Epigenetics Chromatin. 2020;13:20.

Marco E, Meuleman W, Huang J, Glass K, Pinello L, Wang J, et al. Multi-scale chromatin state annotation using a hierarchical hidden Markov model. Nat Commun. 2017;8:15011.

Jaschek R, Tanay A. Spatial clustering of multivariate genomic and epigenomic information. 2009. p. 170–83.

Larson JL, Huttenhower C, Quackenbush J, Yuan G-C. A tiered hidden Markov model characterizes multi-scale chromatin states. Genomics. 2013;102:1–7.

Ge X, Zhang H, Xie L, Li WV, Kwon SB, Li JJ. EpiAlign: an alignment-based bioinformatic tool for comparing chromatin state sequences. Nucleic Acids Res. 2019;47:e77.

Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–91.

Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nat Biotechnol. 2011;29:24–6.

Ernst J, Kellis M. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nat Biotechnol. 2015;33:364–76.

Ernst J, Kellis M. Chromatin-state discovery and genome annotation with ChromHMM. Nat Protoc. 2017;12:2478–92.

Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet. 2007;39:311–8.

Jaroszewicz A, Ernst J. ChromGene github site. https://github.com/ernstlab/ChromGene/ . Accessed 28 Mar 2023.

Jaroszewicz A, Ernst J. ChromGene: gene-based modeling of epigenomic data. Zenodo. https://doi.org/10.5281/zenodo.8303613 .

Lesch BJ, Page DC. Poised chromatin in the mammalian germ line. Dev Camb Engl. 2014;141:3619–26.

CAS Google Scholar

Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature. 2009;459:108–12.

Barnes IHA, Ibarra-Soria X, Fitzgerald S, Gonzalez JM, Davidson C, Hardy MP, et al. Expert curation of the human and mouse olfactory receptor gene repertoires identifies conserved coding regions split across two exons. BMC Genomics. 2020;21:196.

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.

Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44:W90-97.

Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483:603–7.

Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013;14:128.

Eisenberg E, Levanon EY. Human housekeeping genes, revisited. Trends Genet. 2013;29:569–74.

Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell. 2006;125:315–26.

Botía JA, Vandrovcova J, Forabosco P, Guelfi S, D’Sa K, Hardy J, et al. An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks. BMC Syst Biol. 2017;11:47.

Costa IG, Roepcke S, Hafemeister C, Schliep A. Inferring differentiation pathways from gene expression. Bioinformatics. 2008;24:i156–64.

Chaffer CL, Marjanovic ND, Lee T, Bell G, Kleer CG, Reinhardt F, et al. Poised chromatin at the ZEB1 promoter enables breast cancer cell plasticity and enhances tumorigenicity. Cell. 2013;154:61–74.

Bernhart SH, Kretzmer H, Holdt LM, Jühling F, Ammerpohl O, Bergmann AK, et al. Changes of bivalent chromatin coincide with increased expression of developmental genes in cancer. Sci Rep. 2016;6:37393.

Vu H, Ernst J. Universal annotation of the human genome through integration of over a thousand epigenomic datasets. Genome Biol. 2022;23:9.

Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989;77:257–86.

Durbin R, Eddy SR, Krogh A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. 1st ed. Cambridge: Cambridge University Press; 1998.

Murphy KP. Hidden semi-Markov models (HSMMs). 2002. https://www.cs.ubc.ca/~murphyk/Papers/segment.pdf . Accessed 28 Mar 2023.

Frankish A, Diekhans M, Ferreira A-M, Johnson R, Jungreis I, Loveland J, et al. GENCODE V41 Annotation. Nucleic Acids Research. https://egg2.wustl.edu/roadmap/data/byDataType/rna/expression/Ensembl_v65.Gencode_v10.ENSG.gene_info . Accessed 28 Mar 2023.

Roadmap Epigenomics Consortium. Roadmap Epigenomics Consortium ChromHMM Imputed Data. https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/binaryChmmInput/imputed12marks/binaryData/ . Accessed 28 Mar 2023.

Roadmap Epigenomics Consortium. Roadmap Epigenomics Consortium Gene Expression Data. https://egg2.wustl.edu/roadmap/data/byDataType/rna/expression/57epigenomes.RPKM.pc.gz . Accessed 28 Mar 2023.

Frankish A, Diekhans M, Ferreira A-M, Johnson R, Jungreis I, Loveland J, et al. GENCODE V41 Annotation hg19 to hg38 Liftover. Nucleic Acids Research. https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/GRCh37_mapping/gencode.v41lift37.basic.annotation.gtf.gz . Accessed 28 Mar 2023.

Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. gnomAD Browser pLI Scores. https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz . Accessed 28 Mar 2023.

Munroe R. XKCD Colors. https://xkcd.com/color/rgb/ . Accessed 28 Mar 2023.

Download references

Acknowledgements

We thank Huiling Huang for conducting some preliminary analyses of the results of the method. We also thank Adriana Arneson and Petko Fiziev, and other past and current members of the Ernst Lab, for their feedback on this work.

Review history

The review history is available as Additional file 4 .

Peer review information

Wenjing She was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

This work was supported by the U.S. National Institutes of Health [grants T32HG002536 (A.J.), R01ES024995, U01HG007912, DP1DA044371, UG3NS104095, U01HG012079 (J.E.)], the National Science Foundation [1254200, 2125664] (J.E.), an Alfred P. Sloan Fellowship (J.E.), Kure-IT award from Kure It cancer research, Rose Hills Innovator Award, and the UCLA Jonsson Comprehensive Cancer Center and Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research Ablon Scholars Program.

Author information

Authors and affiliations.

Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, USA

Artur Jaroszewicz & Jason Ernst

Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA, 90095, USA

Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, University of California, Los Angeles, Los Angeles, CA, 90095, USA

Jason Ernst

Computer Science Department, University of California, Los Angeles, Los Angeles, CA, 90095, USA

Computational Medicine Department, University of California, Los Angeles, Los Angeles, CA, 90095, USA

Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, CA, 90095, USA

Molecular Biology Institute, University of California, Los Angeles, Los Angeles, CA, 90095, USA

You can also search for this author in PubMed Google Scholar

Contributions

AJ and JE developed the method. AJ implemented the method and performed all analyses presented. AJ and JE wrote the manuscript. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Jason Ernst .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: .

Fig. S1. Model reproducibility and state similarity as a function of hyperparameters. Fig. S2. ChromGene state transitions. Fig. S3. Median Expression for each ChromGene assignment, separated by cell type. Fig. S4. Mutual information of annotation method and gene length. Fig. S5. ChromGene confusion matrix and contingency table. Fig. S6. Log 2 enrichments of ChromHMM states for each ChromGene assignment. Fig. S7. Comparison of gene expression as a function of ChromGene annotations across pairs of cell. Fig. S8. Median GO term enrichment across all cell types. Fig. S9. Proportion of high-pLI genes per ChromGene annotation, conditioned on gene length types. Fig. S10. Mean pLI vs mean expression. Table S2. Performance of ChromGene compared to baseline methods at predicting expression.

Additional file 2:

Table S1. Description of ChromGene annotation enrichments, state emissions and enrichments, and transition probabilities. Annotation enrichments tab – The columns on this tab after the annotation colors and numbers are as follows: Mnemonic: short identifying name of ChromGene annotation. Description: short description of ChromGene annotation. Overall percentage: percentage of gene-cell type combinations assigned to annotation. Median expression: median expression (RPKM) of genes assigned to annotation across 56 cell types with expression [ 1 ]. Median length (kb): median length (kb) of genes assigned to annotation across all cell types, not including flanking regions. Percentage of high-pLI genes (pLI ≥ 0.9): percentage of genes across cell types that have a pLI score ≥ 0.9. Percentage of high-pLI genes (pLI ≥ 0.9), conditioned on gene length: percentage of genes across cell types that have a pLI score ≥ 0.9 in a gene length matched distribution (“ Methods ”). Contingency table diagonal / confusion matrix diagonal: percentage consistency of ChromGene assignments across non-replicate cell types divided by percentage consistency of assignments across replicate cell types. Cell type specificity: 1 - (contingency table diagonal / confusion matrix diagonal), a metric of cell type specificity. # Housekeeping gene: gene annotated as housekeeping [ 39 ]. Housekeeping gene percentage: percentage of gene-cell type combinations annotated as a housekeeping gene. Housekeeping gene enrichment: fold enrichment of housekeeping genes compared to overall percentage. Housekeeping gene log 2 enrichment: log 2 fold enrichment of housekeeping genes. Housekeeping gene enrichment median enrichment p -value: median p -value of housekeeping gene enrichment across cell types. # Constitutively unexpressed gene: gene that has RPKM < 1 across 56 cell types with expression [ 1 ]. Constitutively unexpressed gene percentage: percentage of gene-cell type combinations annotated as constitutively unexpressed. Constitutively unexpressed gene enrichment: fold enrichment of constitutively unexpressed genes compared to overall percentage. Constitutively unexpressed gene log 2 enrichment: log 2 fold enrichment of constitutively unexpressed genes. Constitutively unexpressed gene median enrichment p -value: median p -value of constitutively unexpressed gene enrichment across cell types. # Constitutively expressed gene: gene that has RPKM> 1 across 56 cell types with expression [ 1 ]. Constitutively expressed gene percentage: percentage of gene-cell type combinations annotated as constitutively expressed. Constitutively expressed gene enrichment: fold enrichment of constitutively expressed genes compared to overall percentage. Constitutively expressed gene log 2 enrichment: log 2 fold enrichment of constitutively expressed genes. Constitutively expressed gene median enrichment p -value: median p -value of constitutively expressed gene enrichment across cell types. # Olfactory gene: gene annotated as olfactory [ 34 ]. Olfactory gene percentage: percentage of gene / cell type combinations annotated as olfactory. Olfactory gene enrichment: fold enrichment of olfactory genes compared to overall percentage. Olfactory gene log 2 enrichment: log 2 fold enrichment of olfactory genes. Olfactory gene median enrichment p -value: median p -value of olfactory gene enrichment across cell types. # ZNF gene: gene starts with "ZNF". ZNF gene percentage: percentage of gene / cell type combinations annotated as ZNF. ZNF gene enrichment: fold enrichment of ZNF genes compared to overall percentage. ZNF gene log 2 enrichment: log 2 fold enrichment of ZNF genes. ZNF gene median enrichment p -value: median p -value of ZNF gene enrichment across cell types. Cancer gene sets enriched (adj p < 0.01): the number of cancer gene sets enriched across all cell types for given annotation. Cancer gene sets enriched percentage: the percentage of cancer gene sets enriched across all cell types for given annotation. BP GO terms enriched (adj p < 0.01): the number of 'Biological Process' GO term gene sets enriched across all cell types for given annotation. BP GO terms enriched percentage: the percentage of 'Biological Process' GO term gene sets enriched across all cell types for given annotation. Color (hex): hex color for ChromGene annotation. Matplotlib color name: color used in matplotlib for ChromGene annotation [ 54 ]. State emissions and enrichments tab – The first column gives a color and number for each annotation. The second column gives the annotation mnemonic. The third column gives a number to each individual state of the mixture component. The next 12 columns give the emission probabilities for each epigenomic mark as indicated. The next two columns give the maximum and minimum emission probabilities represented as percentages. The next column gives the enrichment of the individual states for annotated TSS. Individual states within each mixture are ordered in decreasing value of this enrichment. The next column gives the initial probability of starting in the state overall. The last column gives the initial probability of the state given the component. Transition probabilities tab – This tab shows the transition probability, which indicates the probability, when in the state of the row, of transitioning to the state of the column. Probabilities are shown for individual states of the model, which are ordered and colored based on the component to which they belong, as indicated.

Additional file 3.

ChromGene assignments. An Excel spreadsheet containing the ChromGene assignments for the 127 cell types. These assignments are reported in four tabs, two based on the hg19 assembly and gene annotation (ENSEMBL v65/GENCODE v10) and two based on an hg38-based gene annotation (GENCODE v41) lifted over to hg19. For each assembly, we provide one tab using the Roadmap Epigenomics Consortium “Standardized Epigenome names” for designating the cell type, and another tab using the Epigenome IDs (EIDs) [ 1 ]. Each row after the header row corresponds to one gene. The first five columns from left to right are the chromosome of the gene, the left-most coordinate of the gene, the right-most coordinate of the gene, the gene symbol, and strand of the gene. The remaining 127 columns correspond to different cell types, and the entries to ChromGene assignments.

Additional file 4.

Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Jaroszewicz, A., Ernst, J. ChromGene: gene-based modeling of epigenomic data. Genome Biol 24 , 203 (2023). https://doi.org/10.1186/s13059-023-03041-5

Download citation

Received : 24 May 2022

Accepted : 21 August 2023

Published : 07 September 2023

DOI : https://doi.org/10.1186/s13059-023-03041-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Machine learning
Hidden Markov models
Histone modifications
Epigenomics

Genome Biology

ISSN: 1474-760X

Submission enquiries: [email protected]
General enquiries: [email protected]

Species Assignment for Gene Normalization Through Exploring the Structure of Full Length Article

Conference paper
First Online: 15 February 2020
Cite this conference paper

Ruoyao Ding 14 ,
Huaxing Chen 14 ,
Junxin Liu 14 &
Jian Kuang 14

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11984))

Included in the following conference series:

International Symposium on Emerging Technologies for Education

1389 Accesses

Gene normalization is a process of automatically detecting gene names in the literature and linking them to database records. It is critical for improving the coverage of annotation in gene databases. Automatic association of a gene with a species, also known as species assignment, is an essential step of gene normalization. In this article, we propose a new species assignment method which explores the structure of full length article. Experimental results show our method outperforms state-of-art systems on full length article level species assignment. Thus, we believe our work can be used in the process of full length article gene normalization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime
Available as PDF
Read on any device
Instant download
Own it forever
Available as EPUB and PDF
Compact, lightweight edition
Dispatched in 3 to 5 business days
Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Rummagene: massive mining of gene sets from supporting materials of biomedical research publications

Gota: go term annotation of biomedical literature.

Hierarchical network analysis of co-occurring bioentities in literature

The UniProt Consortium: UniProt: the universal protein knowledgebase. Nucleic Acids Research, 45(D1), D158–D169 (2017)

Google Scholar

Wei, C.-H., Kao, H.-Y.: Cross-species gene normalization by species inference. BMC Bioinform. 12 (Suppl 8), S5 (2011)

Article Google Scholar

Gerner, M., Nenadic, G., Bergman, C.M.: LINNAEUS: a species name identification system for biomedical literature. BMC Bioinform. 11 , 85 (2010). https://doi.org/10.1186/1471-2105-11-85

Krallinger, M., Leitner, F., Rodriguez-Penagos, C., Valencia, A.: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol. 9 (Suppl 2), S4 (2008). https://doi.org/10.1186/gb-2008-9-s2-s4

Wei, C.-H., Kao, H.-Y., Lu, Z.: SR4GN: a species recognition software tool for gene normalization. PLoS ONE 7 (6), e38460 (2012)

Wei, C.-H., Kao, H.-Y., Lu, Z.: GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. Biomed. Res. Int. 2015 , 918710 (2015)

Lu, Z., Kao, H.-Y., Wei, C.-H., Huang, M., Liu, J., Kuo, C.-J., Wilbur, W.J.: The gene normalization task in BioCreative III. BMC Bioinform. 12 (Suppl 8), S2 (2011)

Ding, R., Arighi, C.N., Lee, J.-Y., Wu, C.H., Vijay-Shanker, K.: pGenN, a gene normalization tool for plant genes and proteins in scientific literature. PLoS ONE 10 (8), e0135305 (2015)

Download references

Acknowledgements

The work was supported by Guangdong University of Foreign Studies (299-X5219112, 299-X5218168).

Author information

Authors and affiliations.

School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China

Ruoyao Ding, Huaxing Chen, Junxin Liu & Jian Kuang

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Kuang .

Editor information

Editors and affiliations.

University of Craiova, Craiova, Romania

Elvira Popescu

South China Normal University, Guangzhou, China

Tianyong Hao

National Taiwan Normal University, Taipei, Taiwan

Ting-Chia Hsu

Lingnan University, Hong Kong, Hong Kong

Sapienza University of Rome, Rome, Italy

Marco Temperini

Chinese Academy of Agricultural Sciences, Beijing, China

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper.

Ding, R., Chen, H., Liu, J., Kuang, J. (2020). Species Assignment for Gene Normalization Through Exploring the Structure of Full Length Article. In: Popescu, E., Hao, T., Hsu, TC., Xie, H., Temperini, M., Chen, W. (eds) Emerging Technologies for Education. SETE 2019. Lecture Notes in Computer Science(), vol 11984. Springer, Cham. https://doi.org/10.1007/978-3-030-38778-5_31

Download citation

DOI : https://doi.org/10.1007/978-3-030-38778-5_31

Published : 15 February 2020

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-38777-8

Online ISBN : 978-3-030-38778-5

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Publish with us

Policies and ethics

Find a journal
Track your research

BiologyDiscussion.com
Follow Us On:
Google Plus
Publish Now

Gene: Introduction, Concepts and Structure | Cell Biology

ADVERTISEMENTS:

In this article we will discuss about Gene:- 1. Introduction to Gene 2. The Changing Concept of Gene 3. Fine Structure.

Introduction to Gene :

Mendel’s, (1865) experiments with Garden pea plant showed that certain hereditary “factors” were concerned in determining the appearance of certain morphological traits.

Such Mendelian “factors” were described as “gene” by Johanssen (1909) and these genes were shown to be present on chromosome as beads on string and this was the basis of the chromosome theory of heredity proposed (1902-1903) by Shutton and Boveri.

Thus on the basis of these classical observations, a gene was considered in the early days as a single, small and indivisible hereditary unit that occurred at a definite point on the chromosome and was responsible for a specific phenotypic character. As the knowledge of gene increased day by day in the subsequent studies, the classical concept about gene was changed and modified accordingly.

The Changing Concept of Gene :

The discovery of many phenomena like crossing over, gene-recombination and gene mutation have provided another set of information about gene. But recombination was not believed to occur only between the beads or genes.

Hence the gene was not considered sub-divisible. Thus a gene is considered to control the inheritance of one character, to be indivisible by recombination and to be the smallest unit capable of mutation.

It was soon realised that a gene, in true sense, is not responsible for the expression of one trait by itself, although it may exercise the major control on its development.

That genes express themselves through synthesis of enzyme was demonstrated for the first time in 1941 by G. W. Beadle and E. L. Tatum due to their discovery of biochemical mutations in Neurospora. Based on their work, Beadle and Tatum proposed a concept called “one gene-one enzyme” hypothesis.

Thus it became evident that a gene controlled a biochemical reaction by directing the production of a single enzyme. But it was also realised that one gene produces a single polypeptide and not one enzyme as the latter may consist of more than one polypeptide. Thus, the gene may now be defined as a segment of DNA which contains the information for a single polypeptide (the functional unit).

Each functional unit consists of a series of nucleotides that specifies the sequence of amino acid residue of polypeptide chain such as those of the A and B chains of the tryptophan synthetase enzyme or the α and β chains of haemoglobin. But it is shown that a change in as little as one nucleotide of the polypeptide specifying gene may mutate and produce a variant of the wild type chain that differs in one amino acid residue.

So the functional gene or unit is not the same as the mutational gene, but appears to consist of many mutable sites. The gene must also be considered from the standpoint of the nature of the sites of which recombination may occur.

The functional gene, therefore, appears to be composed of many mutational as well as re-combinational sub-units. The first evidence that the gene was sub-divisible by mutation and recombination came from studies of the X- linked lozenge locus of Drosophila melanogaster by C.P. Oliver in 1940.

Oliver demonstrated that crossing over occurred between two mutants such as alleles lzs and Izg of the sex linked lozenge locus of D. melanogaster at a low frequency of 0.2%. This was the first evidence for intragenic recombination.

According to classical concept a gene is not sub-divisible in that crossing over does not occur within a gene; it always occurs between two separate genes. But Oliver’s studies first indicated that the gene was, in fact, more complex than a bead on a string.

They were first steps towards the present concept of the gene as a long sequence of nucleotide pairs that is capable of mutating and recombining at many different sites along its length.

(i) Cis-Trans Test :

This is an indirect experimental evidence to prove that a gene is sub-divisible. The standard phenotype, i.e., parental form, without any mutation, is called wild type. The genes present in the wild type organism is generally designated by ‘+’ sign for comparison with mutant gene.

Before going to discuss the cis-trans test, it is reasonable to understand the meaning of cis and trans arrangement of gene. Cis arrangement means the condition in which a double heterozygote has received two linked mutations from one parent and their wild type alleles from the other parent, e.g., ab/ab x ++/.++-produces heterozygotes ab/++ (Fig. 15.1).

Trans arrangement means the condition in which a double heterozygote has received a mutant and a wild type allele from each parent— for example a+/a+x+b/-l-b produces a+/+b (Fig. 15.2).

In a cis-trans test the phenotypes produced in cis and trans heterozygotes for two mutant alleles are compared with each other. In a cis heterozygote, both mutant alleles are located in the same chromosome and their wild type alleles are present in the homologous chromosome, i.e., mutant alleles are linked in the coupling phase.

Thus it is expected to produce the wild type phenotype (unless the mutant alleles are dominant or co-dominant) irrespective of whether the two mutant, alleles are located in the same gene or in two different genes.

On the other hand, in case of trans heterozygotes one, mutant alleles are located in the homologous chromosome—they are linked in the repulsion phase. Hence, in trans heterozygotes, it is expected to produce the mutant phenotype if the two alleles are located in the same gene. But if they are located in two different genes, the wild type phenotype would be produced.

Hence simply by comparing the phenotypes for any two mutant alleles it is possible to determine if they are located in the same gene or in two different genes.

They are located in the same gene if their cis heterozygotes produce the wild type phenotype, while their trans heterozygotes have the mutant phenotype. But if both their cis and trans-heterozygotes have the wild type phenotype they are considered to be located in two different genes.

(ii) Complementation Test :

The production of wild type phenotype in a trans-heterozygote for two mutant alleles is termed as complementation and such a study is known as complementation test. The results obtained from complementation tests are highly precise and reliable and they permit an operational demarcation of gene.

Mutant alleles present in the same gene do not show complementation, while those located in different genes show complementation. Actually, this concept is generally true in prokaryotes but in eukaryotes several noteworthy exceptions are known.

The basis of complementation test (Fig. 15.3) may be simply described as follows. A gene produces its effect primarily by directing the production of an active enzyme or polypeptide. On the other hand, a mutant allele of this gene directs the production of an inactive form of the enzyme as a result of which it produces the mutant phenotype.

In the cis heterozygote, one of the two homologous chromosomes has the wild type allele(s) of the gene(s). This wild type allele will direct the synthesis of active enzyme—thereby producing the wild type phenotype.

In trans heterozygotes, if the mutant alleles are present in the same gene, the enzyme molecules produced by them will be inactive and capable of producing only the mutant phenotype. But if two mutant alleles are located in two different genes, one chromosome of trans heterozygote will have the wild type allele of the other gene.

Therefore, the trans heterozygote will have functional product of both the genes and the wild type phenotype will be produced by complementation. The complementation test has proven to be useful in delimiting genes.

But, in many cases, this test does not provide evidences to delimit gene.

a. Dominant or co-dominant mutation.

b. Genes in which mutations occur that show intragenic complementation.

c. Polar mutation, i.e., mutation that affects the expression of adjacent genes.

d. The gene in question does not produce a diffusible gene product, e.g., proteins.

Some other genes—such as operator and promotor genes which generally occur in the operon—do not code for a polypeptide or an enzyme. Hence they can act only in the cis position and they cannot show complementation. Therefore, such genes are called ‘cis- acting gene.’

Fine Structure of Gene :

We have already discussed that there can be several sites in a gene, each capable of being independently involved in mutational and re-combinational events. Therefore, a gene is neither a functional nor a re-combinational unit but is a complex locus whose fine structure should be studied.

The most extensive study on the fine structure of gene was undertaken by Seymour Benzer for a locus in T 4 bacteriophage infecting E. coli. This locus is known as rπ locus.

T4 bacteriophage contains a linear molecule of DNA of about 200,000 base pair long which is packed within its head (Fig. 15.4). When T 4 bacteriophage infects E. coli the bacterial cell lyses in about 20-25 minutes releasing 200-300 progeny phage particles.

When the inoculum of E. coli cells are plated in a petridish containing semi-solid nutrient medium, it will produce an uniform confluent growth or lawn on the surface of nutrient medium after certain period of incubation at the appropriate temperature [Fig. 15.5(a)], If the isolated T 4 bacteriophage particles are placed at different sites on the surface of bacterial lawns, T4 bacteriophage infect the bacterial cell and all the E. coli cells in the immediate surrounding vicinity of phage will be destroyed.

This leads to the development of a clear area in the bacterial lawn. The clear areas are called plaques which indicate the areas of infection and lysis of bacterial cell due to infection by phage and is characteristic of phage.

The plaques are surrounded by a fuzzy or turbid margin called halos which are produced due to a phenomenon called lysis inhibition [Fig. 15.5(b)], It is a delay in lysis of T 4 infected E. coli cells as a consequence of a subsequent infection by another T 4 particle. The ability of T 4 phage to cause lysis of bacterial cell is controlled by gene(s) present in a specific locus called ‘r’ locus (r = rapid lysis).

Mutants in the r II locus are easily recognised due to their inability to multiply in E. coli strain K 12 (λ) which has the chromosome of phage A integrated in its chromosome. However, r II mutants grow rapidly in other strains of E. coli such as strain B and strain K i2 lacking the λ chromosome.

The wild phage T 4 r II+ makes small and fuzzy plaques both on B and K strains, whereas the r II mutants make large sharp plaques on E. coli strain B and K strains (Fig. 15.7). These distinguishable properties enabled Benzer to distinguish mutants and wild type phage with high efficiency. The r II mutants axe conditional lethals unable to grow in K 12 (λ); this property was exploited by Benzer for a fine genetic analysis of the r II locus.

Benzer isolated over 3,000 independent mutants of the r II locus and subjected them to complementation test. Phage carrying r II mutation can be easily identified by sterile toothpick transfers of phage from individual plaques growing on E. coli strain B (r II -permissive)

“Lawns” to lawn of E. coli strain K 12 (λ) (r II restrictive) and lawns of E. coli strain B (Fig. 15.8). Each plaque to be tested (left side of Fig. 15.8) is stabled with a sterile toothpick which is subsequently touched to maxked axea in a petridish with a K 12 (λ) lawn (in the center of Fig. 15.8) and then to an identically marked area in a dish with an E. coli B lawn (right side of Fig. 15.8).

Mutants that fail to grow (are lethal) on K 12 (λ) (left side of the centre plate) can be recovered from the plaques on the E. coli B plates (right side of the Fig. 15.8).

If plaques develop on the E. coli B lawn, it indicates complementation between the two r II mutants used for co-infection, while an absence of plaques signifies a lack of complementation. Mutants at the r I and r III loci as well as r + phage (right side of the central plate) will grow on both K 12 (λ) and B. Benzer placed all r II mutants in two arbitrary groups named be A and B.

All the r” mutations were found to located in one of the two genes of cistron. Benzer designated these two genes r II A and r II B (Fig. 15.9).

The r II A region appears to consist of about 2,000 deoxyribonucleotide pairs. The A region transcribes a messenger RNA that translates an A polypeptide; the B region is similarly responsible for a B polypeptide. B polypeptides are needed for lysis of K type E. coli cells. The wild type (r + ) phages produces both A and B polypeptides. A mutant produces normal B polypeptide but not A, and vice versa.

Hence infection only by identical r II A mutants or by identical r II B mutant alone can cause lysis of the host cells, because none of the phages can produce both A and B polypeptide (Fig. 15.10).

On the other hand, infection by two different mutants (one an r II A mutant and the other an r II B mutant) on the same host cell does result in lysis (Fig. 15.11). It indicates that regions A and B are functionally different and show complementation.

Benzer observed that with infection by two phages—one the wild type (r + ) and the other mutant in either A or B region, i.e., with mutation in the cis position—lysis occurred. But the lysis did not occur when the mutation A or B were in the trans configuration.

Thus, it was clear that mutation in one functional region (A or B) Eire complementary only to mutations in the other region and complementation is detectable by cis-trans test.

Each functional region is responsible for the production of a given polypeptide chain. Benzer defined the functional unit as cistron and conformed operationally more closely to what we commonly think as gene. This cistron, therefore, may be thought of as the gene at the functional level. There can be over a hundred points within a functional unit wherein a mutation can take place and cause a detectable phenotypic effect.

This means that a cistron is over hundred nucleotide pairs in length and there is some evidence that some cistrons may be as long as 30,000 nucleotide pairs. Actually each cistron represents a part of a gene which is responsible for coding of only one polypeptide chain of an enzyme that has two or more different polypeptide chains in its complete enzymatic unit.

A cistron also includes initiating, terminating and any un-transcribed nucleotides.

(a) The Muton:

It is the smallest unit of DNA which, when altered, can give rise to a mutation. Study of the genetic code makes it clear that an alteration of a single nucleotide pair in DNA may result in a missense codon in transcribed mRNA (e.g., AGC—>AGA) or nonsense (e.g., UGC—> UGA). So a cistron may be expected to consist of many mutable units or mutons. The term muton was given by Benzer.

(b) The Recon:

It is the smallest part of DNA which is interchangeable through crossing over and recombination. Extremely delicate studies of recombination in E. coli indicate that a recon consists of not more than two pairs of nucleotides, may be only one.

A recon may occur within a cistron. Thus a gene of classical concept is made up of a number of functional units—the cistrons— which consist of a number of recons and mutons (Fig. 15.12).

(i) Recombination Frequency:

The complementation test shows that all the r II mutants were located within A and B cistrons. In order to estimate the frequency of recombination between r II mutants, E. coli strains B cells are infected with a mixture of the two r II mutants.

If the crossing occurs between two chromosomes of mutant strain it yields one wild type and one double mutant type for each crossing over event (Fig. 15.13). Therefore, some of the progeny phage present in the lysate of the B strain (infected by a mixture of two r II mutant) would be of wild type.

The frequency of the wild type phage in the lysate is determined by plating lysate on the lawn of K 12 (λ) strain. Each wild type phage would produce a plaque on this lawn. This is a highly efficient selection system for wild type phage and as many as 10® progeny phage may be examined in a single petridish.

The number of plaques produce on Kj 2 (λ) represents the number of wild type phage particles in the lysate. A equal number of phage would have the double mutant produced due to recombination. Therefore, the number of recombination phage in the lysate would be twice the number of plaques produced on Ki 2 (λ).

The principle involved in this method was that if a particular mutation presents in the region of a deletion represented by a r II mutant, then, on mixed infection with this deletion mutant, the point mutation will not be able to give rise to wild type, but if it falls outside the deletion regions it will be able to give rise to wild type and recombinant type.

The extents of the deleted segments can be analysed by crossing the deletion mutants to a set of reference point mutations which are previously mapped. Once a set of overlapping deletion has been mapped, their end- point will divide the region resolved by the longest deletion in a set of intervals A, B, C, D (Fig. 15.15).

When an unknown new mutant carrying a point mutation is isolated, the mutant can immediately be mapped to a defined interval by crossing the mutant with each of the overlapping deletion mutants. A mutant in interval D will not produce any wild type recombinant progeny in any of the four crosses. A mutation in interval C will recombine with deletion IV (Fig. 15.15) but not with the other three deletions, and so on.

In this manner, Benzer characterised with deleted segments of a large number of r II deletion mutants. This permitted him to divide the entire r II locus into 47 small segments (Fig. 15.16). A set of seven of these deletion mutants permitted him to divide the r II locus into 7 regions like A 1 – A 6 and B.

Each new r II mutant to be mapped was crossed pairwise with each of these seven deletion mutants and the presence of wild type (recombinants) phage particles counted in the progeny.

On the basis of this data a new r II mutant was localised in one of the- seven segments. (Table 15.1 and Fig. 15.17.) Once an unknown r II mutant is pointed in a segment, it is crossed to another set of deletion mutant’s which allows its localisation in a smaller sub-division of that segment.

The final mapping of r II mutants is done on the basis of recombination data from two and point crosses among mutants located within the concerned subsection of the r II locus. Benzer et al identified more than 300 sites of mutation that were separable by recombination. The progeny of mutation at different sites is highly variable.

Electron Microscope Heteroduplex Mapping :

The presence of genetically well-defined deletion mutations at the r II locus can also be determined using a technique called heteroduplex mapping. A DNA heteroduplex is a DNA molecule in which the two strands are not complementary.

One strand of a DNA double helix may contain one allele of the gene and the other strand may not be totally complementary and may carry of different alleles of the gene. The non complementary portions of DNA then form a heteroduplex which may vary in size from one mismatched base pair to large segments of the molecule.

Heteroduplex mapping involves in vitro preparation of DNA hetero-duplexes and their analysis by electron microscope. The heteroduplex may be prepared by mixing the denatured single-stranded DNA segments of wild type and mutant type followed by DNA renaturation.

The prepared hetero-duplexes between DNA from T 4 r + phage and DNA from each of several genetically well characterised r II deletion mutants. Thereafter they are analysed by electron microscope.

The results obtained estimates of 1,800±70 nucleotide pairs and 845 ± 50 nucleotide-pairs for the sizes of the r II A and r II B genes, respectively. These results combined with the extensive genetic data of Benzer et al provide a fairly clear picture of the fine structure of the r II locus.

(c) Overlapping Genes :

The presence of overlapping gene provides an interesting information for the study of fine structure of a gene. It is generally accepted that the boundaries of neighbouring genes do not overlap.

The study of nucleotide sequences of φ x 174 bacteriophage has clearly resolved that out of total 10 genes of φ x 174, two are located entirely within the coding sequences of two different genes. A third overlaps the sequences of three different genes. This surprising result has important genetic implication for the study of fine structure of gene.

(d) Fine Structure of Genes in Eukaryotes :

Complementation and recombination study have been used to prepare fine structure maps of several eukaryotic genes. By this technique, genetic fine structure maps have now been constructed for many genes of Drosophila; maps have also been worked out for several other higher animals and higher plants.

One of the best examples of such a gene is the rosy (ry) eye locus of Drosophila which codes for the enzyme xanthine dehydrogenase. The different alleles of ry locus map at 10 different sites on the basis of recombination frequency (Fig. 15.18).

Many of the ry mutants do not show complementation (shown in upper line of the Fig. 18.18) while several others show complementary (shown in the lower line of the figure). The complementary allele may be located at the same site or at a site very close to one where non-complementary alleles are located.

The complementation of ry alleles is a case of intragenic complementation. The recovery of wild type recombination is very easy because rosy mutants are conditional lethals. As a result, wild type recombination produced from the heterozygotes for ry 2 ry 3 alleles can be easily isolated and counted by growing their progeny on a purine supplemented medium, on this medium only wild type progeny would survive.

Besides rosy locus, fine structure maps of gene of many other eukaryotes have been prepared like white (w) eye, notch (N) wing, lozenge (lz) eye, zetse (another eye colour locus near the white locus) etc. loci of Drosophila, waxy (wx) and other loci of maize, some loci of yeast.

Analysis and mapping of eukaryotic gene have also got some limitations due to:

i. Examining enough progeny of a cross to detect rare intragenic recombination in eukaryotes is a laborious job.

ii. In many cases, determining how many genes are present at a locus has proven difficult in eukaryotes. This problem arises due to the presence of complex loci.

iii. Complementation tests have often yielded ambiguous results—due to the occurrence of intragenic complementation. Delimiting genes of eukaryotes by complementation test should be done, whenever possible, using amorphic or null mutation (mutation resulting in no gene product) to minimize the possibility of confounding effects of intragenic complementation.

In eukaryotes, however, some genes have interesting structural features which are not found in most prokaryotes.

Therefore, our view of fine structure of any gene as discussed earlier may be partly ambiguous due to use of a specific recombination system. Further, the distances between genes on a genetic map may not correspond to the distances between them in the DNA molecule of which they are a part at the molecular level. There may also be present gaps or a genetic map due to non-availability of mutants in that region.

At the molecular level the fine structure of a gene can be resolved by modern genetic mapping through determination of nucleotide sequence of the concerned DNA segment. Alternatively we can prepare a genetic map by breaking the DNA at specific sites with the help of restriction endonucleases which are specific in recognising very short DNA sequences and cutting the DNA at these specific sites.

These sites of breakage can be identified and mapped in eukaryotes. This modern technique for in prokaryotes to give rise to a restriction genetic mapping at the molecular level has been map.

Models of Gene-Pool Structure | Population Genetics
Biology Notes on Reverse Mutations | Genetics

DNA , Cell , Biology , Cell Biology , Gene

Anybody can ask a question
Anybody can answer
The best answers are voted up and rise to the top

Privacy Overview

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 29 July 2024

Genetic factors associated with reasons for clinical trial stoppage

Olesya Razuvayevskaya 1 , 2 ,
Irene Lopez 1 , 2 ,
Ian Dunham 1 , 2 , 3 &
David Ochoa ORCID: orcid.org/0000-0003-1857-278X 1 , 2

Nature Genetics ( 2024 ) Cite this article

14k Accesses

1 Citations

54 Altmetric

Metrics details

Drug discovery
Medical genetics
Therapeutics

Many drug discovery projects are started but few progress fully through clinical trials to approval. Previous work has shown that human genetics support for the therapeutic hypothesis increases the chance of trial progression. Here, we applied natural language processing to classify the free-text reasons for 28,561 clinical trials that stopped before their endpoints were met. We then evaluated these classes in light of the underlying evidence for the therapeutic hypothesis and target properties. We found that trials are more likely to stop because of a lack of efficacy in the absence of strong genetic evidence from human populations or genetically modified animal models. Furthermore, certain trials are more likely to stop for safety reasons if the drug target gene is highly constrained in human populations and if the gene is broadly expressed across tissues. These results support the growing use of human genetics to evaluate targets for drug discovery programs.

The Leaf Clinical Trials Corpus: a new resource for query generation from clinical trial eligibility criteria

PGxCorpus, a manually annotated corpus for pharmacogenomics

Uncovering interpretable potential confounders in electronic medical records

The drug discovery endeavor is dominated by high attrition rates, and failure remains the most likely outcome throughout the pipeline 1 . A diverse set of factors can lead to failure, with lack of efficacy or unforeseen safety issues reportedly explaining 79% of setbacks in the clinic 2 . New approaches adopted across the industry have aimed to improve success rates by systematically assessing the available evidence throughout the research and clinical pipelines 3 , 4 . Support from human genetic evidence has been repeatedly associated with successful clinical trial progression 5 , 6 , 7 , 8 , ultimately supporting two-thirds of the drugs approved by the US Food and Drug Administration (FDA) in 2021 (ref. 9 ). Further understanding of the reasons for success or failure in clinical trials could assist in reducing future attrition.

Systematically assessing the reasons for success or failure in clinical trials can be hampered by many factors. Several surveys have demonstrated a bias towards reporting positive results, with 78.3% of trials in the literature reporting successful outcomes 10 , 11 , 12 . Successful clinical trials are published significantly faster than trials reporting negative results 13 , 14 . However, access to negative results is crucial, not only for revealing efficacy tendencies and safety liabilities 15 but also for retrospective review and benchmarking of predictive methods, including machine learning.

Since 2007, the FDA has required the submission of clinical trial results to ClinicalTrials.gov, a free-to-access global databank aimed at registering clinical research studies and their results 16 , 17 . For trials halted before their scheduled endpoint, ClinicalTrials.gov provides a freeform stopping reason: termination, suspension or withdrawal 18 . A team of researchers 19 previously classified the reasons for 3,125 stopped trials and found that only 10.8% of trials stopped because of a clear negative outcome. By contrast, the majority (54.5%) fell into a set of reasons characterized as neutral in relation to the therapeutic hypothesis, such as patient recruitment or other business or administrative reasons 19 .

Here, we extended that work by training a natural language processing (NLP) model to classify stopping reasons and used this model to classify 28,561 stopped trials. We integrated our classification with evidence associating the drug target and disease from the Open Targets Platform 20 , revealing that trials stopped for lack of efficacy or safety reasons were less supported by genetic evidence. Furthermore, oncology trials involving drugs for which the target gene is constrained in human populations were more likely to stop for safety reasons, whereas drugs with targets with tissue-selective expression were less likely to pose safety risks. These observations confirm and extend previous studies recognizing the value of genetic information and selective expression in target selection.

Interpretable classification of early stoppage reasons

To catalog the reasons behind the withdrawal, termination or suspension of clinical studies, we classified every free-text reason submitted to ClinicalTrials.gov using an NLP classifier. To build a training set for our model, we revisited the manual classification reported in a previous publication of 3,124 stopped trials based on the available submissions to ClinicalTrials.gov in May 2010 (ref. 19 ). The authors of that article classified every study with a maximum of three classes following an ontological structure (Supplementary Table 1 ). Each of the classes was also assigned a higher-level category representing the outcome implications for the clinical project. For example, 33.7% of the studies were classified as stopped owing to ‘insufficient enrollment’, a neutral outcome owing to its expected independence from the therapeutic hypothesis. When inspecting submitted reasons belonging to the same curated category, we observed a strong linguistic similarity, as revealed by clustering the cosine similarity of the sentence embeddings (Extended Data Fig. 1 ). Studies stopped because of reasons linked to lack of efficacy and studies stopped because of futility have a linguistic similarity of 0.98, with both classes manually classified as ‘negative’ outcomes. Based on this clustering, we redefined the classification by merging semantically similar classes represented by low numbers of annotated sentences. Moreover, we added 447 studies that were stopped as a result of the COVID-19 pandemic (Supplementary Table 2 ), resulting in a total of 3,571 studies manually classified into at least one of 17 stop reasons and explained by six different higher-level outcome categories.

By leveraging the consistent language used by the submitters, we fine-tuned the BERT model 21 for the task of clinical trial classification into stop reasons ( Methods ). Overall, the model showed strong predictive power in the cross-validated set ( F micro = 0.91), performing strongly for the most frequent classes, such as ‘insufficient enrollment’ ( F = 0.98) or ‘COVID-19’ ( F = 1.00), but demonstrating decreased performance on linguistically complex reasons, such as trials stopped because of another study ( F = 0.71) (Supplementary Table 3 ).

To further evaluate the model, we manually curated an additional set of 1,675 stop reasons from randomly selected studies that were not included in the training set. Overall, the performance against the unseen data was lower but comparable to that of the cross-validated model ( F micro ranging from 0.70 to 0.83 depending on the choice of the annotator) (Supplementary Table 4 ), demonstrating real-world performance and reduced risk of overfitting. Interestingly, the curators demonstrated a relatively low agreement for many classes in which the machine-learning model also showed relatively weak performance, such as studies stopped because of insufficient data or met endpoint ( Methods and Extended Data Fig. 1 ).

Reasons reflect operational, clinical and biological constraints

Classification of the 28,561 stopped trials submitted to ClinicalTrials.gov before 27 November 2021 was performed using our NLP model fine-tuned on all the manually curated sentences (Supplementary Table 5 ). In total, 99% of the trials were classified with at least one of the 15 potential reasons and mapped to one of six different higher-level outcomes (Fig. 1 ). ‘Insufficient enrollment’ remained the most common reason to stop a trial (36.67%), with other reasons before the accrual of any study results also occurring in a large number of studies. A total of 977 trials (3.38%) were classified as stopped because of ‘safety or side effects’, and 2,197 studies (7.6%) were stopped because of ‘negative’ reasons, such as those questioning the efficacy or value (futility). The incidence of each stop reason reflects the purpose of each phase (Extended Data Fig. 2 ). Studies stopped because of ‘negative’ outcomes more often impacted phase II (odds ratio (OR) = 1.9, P = 2.4 × 10 − 38 ) and phase III (OR = 2.6, P = 3.64 × 10 − 55 ), whereas studies stopped as a result of ‘safety or side effects’ declined in relative incidence after phase I (OR = 2.4, P = 9.63 × 10 − 23 ) (Supplementary Table 6 ). Trials stopped because of the relocation of the study or key staff occurred more than twice as often during early phase I, highlighting the importance of good clinical practices during the foundational stages. Of the studies that provided a stop reason, 48% were indicated for oncology. This large proportion is likely to be the combined result of the specific weight of oncology indications in the aggregated portfolio—27% of drug approvals in 2022—with the reported large incidence of clinical failures in oncology (32%) compared to other indications 22 , 23 .

Predicted trial stop reasons are shown in rows with counts of trials per start year, clinical phase or therapeutic area shown by the color in each cell. The outcome groupings of the stopped reasons are shown using the color next to the stopped reason label. Note that trials start potentially many years before they are stopped.

Moreover, oncology studies stopped more frequently as a result of safety or side effects and were rarely stopped because of the COVID-19 pandemic (Extended Data Fig. 3 ). Alternatively, COVID-19 was the reported reason to stop respiratory studies at a higher rate than any other therapeutic area, possibly indicating increased operational difficulties.

Genetic support for stopped trials influences the outcome

To better understand the underlying reasons that might have caused the study to fail, we assessed the availability of different types of potentially causal genetic evidence for the intended pharmacological targets in the same indication (Extended Data Fig. 4 ). By using genetic evidence collated by the Open Targets Platform, we reproduced previous reports indicating that genetically supported studies are more likely to progress through the clinical pipeline (Fig. 2a ) 5 , 6 . Interestingly, we also observed that stopped trials—among all the trials at any phase—are depleted in genetic support (OR = 0.73, P = 3.4 × 10 − 69 ). A similar lack of genetic evidence was observed for the three types of stopped studies: withdrawn, terminated and suspended (Supplementary Table 7 ).

a , b , Genetic evidence support for clinical trials either from human genetics studies ( a ) or the International Mouse Phenotyping Consortium mouse knockouts (KO) that phenocopy the human disease ( b ). The panels show the odds ratio (OR) of support for the target-disease hypothesis from genetics evidence for all clinical trials split by phase (top row), stopped clinical trials (center row) and stopped clinical trials split by higher-level stopping reason (bottom row). The significance of the association between genetic evidence and trial outcome was assessed using a two-tailed Fisher’s exact test, with a P value threshold of 0.05 without multiple testing correction. The panels show the OR and 95% confidence intervals (CIs) for the association between genetic evidence and each subclass of trial. Significant ORs of >1 indicate enrichment and <1 indicate depletion.

When stratifying the stopped studies by reason, trials halted because of negative outcomes—such as lack of efficacy or futility—displayed a significant decrease of genetic support for the intended pharmacological target in the same indication (OR = 0.61, P = 6×10 −18 ) (Fig. 2a ). The depletion of genetic evidence on negative outcomes remains consistent when stratifying the indications by oncology (OR = 0.53) or non-oncology studies (OR = 0.75) (Extended Data Fig. 6 ), as well as when splitting by different sources of genetic evidence, including genome-wide association studies processed by the Open Targets Genetics Portal 24 , gene burden tests based on sequencing of large population cohorts 25 , 26 , 27 , ClinVar 28 , ClinGen Gene Validity 29 , Genomics England PanelApp 30 , gene2phenotype 31 , Orphanet 32 and Uniprot 33 (Extended Data Fig. 5 ).

Other predicted reasons for stopping the trials, such as insufficient enrollment, problems with the study design or business or administrative reasons, also present a strong to moderate depletion of genetic evidence denoting potential reduced support for the therapeutic hypothesis (Fig. 2 ). We found that studies stopped as a result of coincidental factors such as the COVID-19 pandemic have no association with the availability of genetic support for the intended target in the primary indication.

The observed associations between clinical trial outcomes and the availability of genetic support remain consistent when considering genetic information in mouse models (Fig. 2b ). Trials that were stopped because of negative factors present the weakest support among all predicted reasons (OR = 0.7, P = 4 × 10 −11 ) when genetic evidence is defined as the presence of a murine model in which the drug target homologous gene knockout causes a phenotype that mimics the indication, as reported by the International Mouse Phenotyping Consortium 34 .

Genetic factors associated with safety-associated stopped trials

Analysis of the classified stop reasons indicates that oncology trials are more likely to stop because of safety or side effects (OR = 2.14, P = 8.1 × 10 −79 ; Supplementary Table 7 ). Moreover, for all trials predicted to stop because of safety concerns, we found a significant enrichment in targets associated with driver events reported by COSMIC 35 , ClinVar 28 or IntOgen 36 (Extended Data Fig. 7 ). Examining the target properties (Fig. 3 ), we found that studies targeting genes that are highly constrained in natural populations (GnomAD pLOEUF 16th percentile) are 1.5 times more likely to stop as a result of safety concerns 37 . Furthermore, the risk of stopping because of safety declines as the genetic constraint of the target decreases. Similarly, we identified a 1.4-fold increased risk of stopping because of safety concerns when the targeted gene is classified as loss-of-function intolerant (pLI > 0.9). These findings are compatible with previous evidence indicating that constrained genes are associated with increased side effects 38 . We also identified functional genomic features that inform on increased safety risk. According to the human protein atlas, a similar 1.3-fold increased risk is observed for genes expressed with low tissue specificity 39 . Instead, studies targeting tissue-enriched genes show a lower-than-expected (OR = 0.8, P = 1.8 × 10 −4 ) likelihood of stopping because of safety. Finally, targets physically interacting with ten or more different partners according to the IntAct database (MI score > 0.42) present an increased risk of stopping as a result of safety concerns 40 . Further stratification of this analysis by indication denotes that these overall constraint signals impacting studies that are stopped because of safety are largely influenced by oncology trials.

We evaluated the significance of the association between trials stopping because of safety or side effects and each variable (therapeutic area, relative genetic constraint as defined by GnomAD, tissue specificity as defined by the Human Protein Atlas and network connectivity with data from IntAct) with a two-tailed Fisher’s exact test, with a P value threshold of 0.05 without multiple testing correction. OR > 1 represents an increased risk of study stopping and OR < 1 represents protection against stopping. Error bars, 95% CI; LoF, loss of function. Detailed results are presented in Supplementary Table 7 .

Genetic evidence is increasingly leveraged by the pharmaceutical industry to add support to the therapeutic hypothesis 3 , 4 , 41 , 42 . Adding to previous observations on the role of genetic factors in overall trial success 5 , 6 , we exploited under-used data from clinical trial records to better understand the opposite outcome: why clinical trials stop. Although the availability of genetic evidence might inform future success, failure remains the most common outcome of clinical studies, and, to our knowledge, no systematic evidence exists on the relevance of genetics to de-risk negative results.

Recent reports indicate that 79% of clinical studies fail because of a lack of efficacy or safety 2 . Our analysis indicates that within the 7.9% of studies that stop early because of withdrawal, termination or suspension, the proportion of trials that failed because of efficacy or safety is only 12.7%. Stopped studies are more likely to fail because of early coincidental factors that are not necessarily linked to biological plausibility; for example, the principal investigator relocates or there is insufficient enrollment in the trial. Notwithstanding the reduced relative risk of efficacy and safety as the main causes for stopping the trial, these studies provide a significant body of unsuccessful results that are probably explained by a weak therapeutic hypothesis. Continued expansion of the recording of negative results from clinical trials, including stoppages, will be valuable. To assist in this effort, we will continue to update the classification of stopped studies through the Open Targets Platform ( https://platform.opentargets.org ) 20 . Further investigation of the study outcomes for completed studies could expand our understanding of the reasons behind unsuccessful trials, particularly after accrual of the study results.

Our analysis exploits the classified stop reasons to understand the relative importance of the causes leading to failed studies. By using a case-control approach, we conclude that genetic support is not only predictive of clinical trial progression but also protective of early trial stoppage. We illustrate different ways in which genetic causality and genetic constraint can de-risk the target selection process. However, many stopped trials, even when stopped for efficacy and safety reasons, might be explained by factors beyond the intended pharmacological target. Off-target effects, pharmacokinetics, drug delivery or toxicology are other risks not considered in this study that might also explain a set of negative outcomes. Another limitation of our study is that the reasons submitted to ClinicalTrials.gov might only represent a fraction of all the reasons contributing to the decision to halt the study. For example, we found that studies that were classified as stopped because of patient recruitment manifest weaker genetic support, an observation that we did not anticipate owing to the lack of an obvious link between enrollment and biological plausibility. Hence, we reason that a fraction of the stopped trials might present an overall lack of confidence in the therapeutic hypothesis, independently of the reported reason.

This study showcases how reflecting on past failures can inform the relative importance of the risks associated with early target identification and prioritization. Although clinical trial success is a discrete outcome, failure needs to be understood as a breakdown of many possible causes. A proper set of positive and negative outcomes such as the ones introduced in this work represent the groundwork necessary to implement quantitative or semi-automatic models to objectively de-risk any future studies.

Inclusion and ethics

This study relied solely on aggregated genetic and clinical information available in public resources. It did not make use of individual-level data, and no specific ethics approval was required. Some of the data sources, including clinical study results, clinical curation of rare variants or genome-wide associations, might present biases towards European ancestries.

NLP classification of stopped clinical trials

To quantify the semantic structure of the reasons for clinical trial stop, we analyzed the classification of the stopped clinical trials developed in a previous publication 19 . We trained a long short-term memory network to create the representations for each stop reason and averaged the embeddings across all examples of a particular class. The class embeddings were then used to calculate the cosine similarities among classes and were visualized using agglomerative hierarchical clustering (Extended Data Fig. 1 ). The hierarchical representation illustrates the clusters that are semantically close to each other, along with the number of examples per class and the parent category. Similar classes with fewer sentences were grouped together to ensure representative categories based on clinical expertise and semantic similarity. For example, the classes ‘study moved’ and ‘key staff left’ are semantically clustered together and attend to similar underlying reasons. The list of categories defined in the previous publication and their redefined groupings can be found in Supplementary Table 8 .

To validate the model on new data and expand the training set, we performed a human annotation experiment of 1,675 additional ClinicalTrials.gov studies that were not classified in the previous publication. We randomly assigned six sets of 250 unique stopped trials to each curator, including 25 overlapping trials, to estimate the inter-annotator agreement. Across 3 pairs of annotators, we estimated inter-annotator agreements of 0.8, 0.71 and 0.66 using the kappa statistic 43 .

Stop reason classification model

We fine-tuned the BERT model for the task of predicting the stop reasons on the training set of 4,500 human-annotated stopped clinical trials 21 . We used a BERT uncased pre-trained model with a one-layer feed-forward classifier consisting of a ReLU layer between the input and output layers, in which the input and output layers represent linear layers. Fine-tuning was performed by using the HuggingFace transformer library 44 . The classifier uses 50 hidden units and the ReLU activation function.

We used the last hidden state at token ‘[CLS]’ to retrieve a representation of the whole explanation and fed it into the classifier. We then applied ‘sigmoid’ over the logits to retrieve the probabilities. The best accuracy on the validation set was achieved while training the model for seven epochs with a batch size of 32, a learning rate of 5 × 10 − 5 and the Pytorch implementation of Adam’s optimizer with weights decay, in which the weight decay is set to the default value of 1 × 10 − 2 . The test set was created stratified to ensure that the relative class frequencies were considered in each fold of the test set. Given that the nature of the task does not assume that the categories are mutually exclusive and the original and new annotation tasks allowed human annotators to mark up to three categories, we treated the top three probabilities returned by the model that are above a pre-defined threshold as correct answers.

Clinical studies

We collated all clinical trials from ClinicalTrials.gov as of 27 November 2021 and classified the 28,561 stopped studies (withdrawn, suspended or terminated). Genetic traits and indications from clinical studies were harmonized using the Experimental Factor Ontology (EFO) 45 . When studies contained multiple indications, their similarity based on the EFO structure was evaluated. All indications were considered when indications were similar (for example, several oncology indications). When indications were dissimilar (for example, diabetes in malaria patients), the diseases were curated to annotate the appropriate indication for the study. Drugs reported as approved by the FDA were also considered to ensure the representation of medicines preceding the ClinicalTrials.gov resource. To map each drug or clinical candidate to its pharmacological targets, we leveraged the molecule mechanism of action from the ChEMBL database 46 . All possible annotations were used if a drug could be mapped to multiple targets. All drug targets were annotated against Ensembl gene IDs 47 when possible. To perform subsequent analyses, only drugs with a known mechanism of action were considered. The resulting dataset contains 594,375 clinical target-disease records, capturing 71,419 unique target-disease associations and 57,775 target-disease pairs in studies that stopped early 48 .

Target-disease genetic support

We integrated 13 sources available in the Open Targets Platform in April 2022 to extract a comprehensive list of genetically supported gene–disease associations. The genetic evidence was mapped to Ensembl gene identifiers and EFO identifiers as part of the Open Targets activities. To represent common disease genetics, we leveraged Open Targets Genetics 24 , a post-genome-wide association study analysis leveraging different functional genomics features. In this study, we used all gene assignments based on a locus-to-gene score above 0.05 (ref. 49 ). The other predominantly germline genetic sources included in this analysis are Gene Burden 25 , 26 , 27 , ClinVar 28 , Genomics England PanelApp 30 , Gene2Phenotype 31 , Clingen Gene–Disease Validity 29 , Uniprot 33 and Orphanet 32 . We included COSMIC Cancer Hallmarks 35 , IntOgen cancer drivers 36 and ClinVar somatic variants 50 as sources of somatic genetic evidence. As a source to capture the effects of genetic variation in animal models, we included the mouse–human phenotypic mappings reported by the International Mouse Phenotyping Consortium 34 . Genetic evidence was ontologically expanded using the EFO, resulting in 3,654,109 genetically supported gene–trait pairs. This dataset represents a redundant view of the evidence, with its only purpose being to maximize the overlap with the clinical information and minimize the issues related to the sparsity in the annotation.

Target annotations

To analyze the target factors that could influence studies stopped because of safety or side effects, we also included a set of target annotations that were independent of the study indication. Each gene was annotated with genetic constraint data from gnomAD, representing the functional impact of the presence of genetic variants, and split into six categories derived from gnomAD’s pLOEUF quantiles. We also analyzed the predicted loss-of-function intolerance, distinguishing genes as ‘LoF-intolerant’ when the pLI score is above 0.9 and as ‘LoF tolerant’ when the pLI score was below 0.1 (ref. 37 ). Moreover, each target was classified in a bin based on the number of unique interacting partners above an MI score threshold of 0.42 in the IntAct database 40 . This threshold corresponds to a physical interaction identified at least once in low-throughput studies or replicated in multiple high-throughput experiments. Additionally, target annotation for tissue specificity and distribution was retrieved from the baseline transcriptomic experiments in the Human Protein Atlas database 39 . The assessment was performed according to the categories defined by the Human Protein Atlas.

Statistics and reproducibility

No preliminary statistical analyses were conducted to determine sample sizes. The choice of clinical studies and genetic information followed an unbiased procedure. The significance of each case-control study was computed using a two-sided Fisher's exact test using all available samples. No multiple testing correction was applied to the resulting P values. All statistical tests were computed using SciPy (v.1.11.4) 50 . The code to replicate the analyses is publicly available (see Code availability).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The full training set is available for download at HuggingFace, including the curation from the previously published article 19 and the COVID-19 stopped studies 51 . The resulting model for download or interactive exploration can also be found in HuggingFace 52 . The dataset of the clinical trial stop reason predictions used in this study is available in Github 53 . The collection of clinical studies annotated with predicted stop reasons and genetic evidence can also be accessed on HuggingFace. Up-to-date predictions for newer clinical trial studies are updated quarterly in the Open Targets Platform.

Code availability

Code to reproduce the model and analysis are available on GitHub 53 .

DiMasi, J. A., Grabowski, H. G. & Hansen, R. W. Innovation in the pharmaceutical industry: new estimates of R&D costs. J. Health Econ. 47 , 20–33 (2016).

Article PubMed Google Scholar

Dowden, H. & Munro, J. Trends in clinical success rates and therapeutic focus. Nat. Rev. Drug Discov. 18 , 495–496 (2019).

Article CAS PubMed Google Scholar

Morgan, P. et al. Impact of a five-dimensional framework on R&D productivity at AstraZeneca. Nat. Rev. Drug Discov. 17 , 167–181 (2018).

Wu, S. S. et al. Reviving an R&D pipeline: a step change in the phase II success rate. Drug Discov. Today 26 , 308–314 (2021).

Nelson, M. R. et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 47 , 856–860 (2015).

King, E. A., Davis, J. W. & Degner, J. F. Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLoS Genet. 15 , e1008489 (2019).

Article PubMed PubMed Central Google Scholar

Minikel, E. V., Painter, J. L., Dong, C. C. & Nelson, M. R. Refining the impact of genetic evidence on clinical success. Nature 629 , 624–629 (2024).

Article CAS PubMed PubMed Central Google Scholar

Trajanoska, K. et al. From target discovery to clinical drug development with human genetics. Nature 620 , 737–745 (2023).

Ochoa, D. et al. Human genetics evidence supports two-thirds of the 2021 FDA-approved drugs. Nat. Rev. Drug Discov. 21 , 551 (2022).

Ioannidis, J. P. A. Why most published research findings are false. PLoS Med . 2 , e124 (2005).

Young, N. S., Ioannidis, J. P. A. & Al-Ubaydli, O. Why current publication practices may distort science. PLoS Med . 5 , e201 (2008).

Bourgeois, F. T., Murthy, S. & Mandl, K. D. Outcome reporting among drug trials registered in ClinicalTrials.gov. Ann. Intern. Med. 153 , 158–166 (2010).

Qunaj, L. et al. Delays in the publication of important clinical trial findings in oncology. JAMA Oncol. 4 , e180264 (2018).

Jones, C. W. et al. Delays in reporting and publishing trial results during pandemics: cross sectional analysis of 2009 H1N1, 2014 Ebola, and 2016 Zika clinical trials. BMC Med. Res. Methodol. 21 , 120 (2021).

Petsko, G. A. When failure should be the option. BMC Biol. 8 , 61 (2010).

Ross, J. S., Mulvey, G. K., Hines, E. M., Nissen, S. E. & Krumholz, H. M. Trial publication after registration in ClinicalTrials.gov: a cross-sectional analysis. PLoS Med . 6 , e1000144 (2009).

Califf, R. M. et al. Characteristics of clinical trials registered in ClinicalTrials.gov, 2007–2010. JAMA 307 , 1838–1847 (2012).

Al-Durra, M., Nolan, R. P., Seto, E., Cafazzo, J. A. & Eysenbach, G. Nonpublication rates and characteristics of registered randomized clinical trials in digital health: cross-sectional analysis. J. Med. Internet Res. 20 , e11924 (2018).

Pak, T. R., Rodriguez, M. & Roth, F. P. Why clinical trials are terminated. Preprint at https://doi.org/10.1101/021543 (2015).

Ochoa, D. et al. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Res . 51 , D1353–D1359 (2023).

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2018).

Mullard, A. 2022 FDA approvals. Nat. Rev. Drug Discov. 22 , 83–88 (2023).

Harrison, R. K. Phase II and phase III failures: 2013–2015. Nat. Rev. Drug Discov. 15 , 817–818 (2016).

Ghoussaini, M. et al. Open targets genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res. 49 , D1311–D1320 (2021).

Wang, Q. et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597 , 527–532 (2021).

Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599 , 628–634 (2021).

Karczewski, K. J. et al. Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes. Cell Genom. 2 , 100168 (2022).

Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42 , D980–D985 (2014).

McGlaughon, J. L., Goldstein, J. L., Thaxton, C., Hemphill, S. E. & Berg, J. S. The progression of the ClinGen gene clinical validity classification over time. Hum. Mutat. 39 , 1494–1504 (2018).

Martin, A. R., Williams, E. & Foulger, R. E. et al. PanelApp crowdsources expert knowledge to establish consensus diagnostic gene panels. Nat. Genet. 51 , 1560–1565 (2019).

Thormann, A. et al. Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat. Commun. 10 , 2373 (2019).

Rodwell, C. & Aymé, S. Rare disease policies to improve care for patients in Europe. Biochim. Biophys. Acta 1852 , 2329–2335 (2015).

UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47 , D506–D515 (2019).

Article Google Scholar

Muñoz-Fuentes, V. et al. Correction to: the International Mouse Phenotyping Consortium (IMPC): a functional catalogue of the mammalian genome that informs conservation. Conserv. Genet. 20 , 135–136 (2019).

Bamford, S. et al. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. Br. J. Cancer 91 , 355–358 (2004).

Gundem, G. et al. IntOGen: integration and data mining of multidimensional oncogenomic data. Nat. Methods 7 , 92–93 (2010).

Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625 , 92–100 (2024).

Duffy, Á. et al. Tissue-specific genetic features inform prediction of drug side effects in clinical trials. Sci. Adv. 6 , eabb6242 (2020).

Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347 , 1260419 (2015).

Del Toro, N. et al. The IntAct database: efficient access to fine-grained molecular interaction data. Nucleic Acids Res. 50 , D648–D653 (2022).

Barrett, J. C., Dunham, I. & Birney, E. Using human genetics to make new medicines. Nat. Rev. Genet. 16 , 561–562 (2015).

Fernando, K. et al. Achieving end-to-end success in the clinic: Pfizer’s learnings on R&D productivity. Drug Discov. Today 27 , 697–704 (2022).

Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20 , 37–46 (1960).

Wolf, T. et al. HuggingFace’s transformers: state-of-the-art natural language processing. Preprint at https://doi.org/10.48550/arXiv.1910.03771 (2019).

Malone, J. et al. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics 26 , 1112–1118 (2010).

Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45 , D945–D954 (2017).

Cunningham, F. et al. Ensembl 2022. Nucleic Acids Res. 50 , D988–D995 (2022).

Open Targets. clinical_evidence. Hugging Face https://doi.org/10.57967/HF/2611 (2024).

Mountjoy, E. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat. Genet. 53 , 1527–1533 (2021).

SciPy (The SciPy Community, 2023).

Open Targets. Clinical_trial_reason_to_stop. Hugging Face https://doi.org/10.57967/HF/2600 (2024).

Open Targets. clinical_trial_stop_reasons. Hugging Face https://doi.org/10.57967/HF/2599 (2024).

López, I., Ochoa, D. & Olesya, R. OpenTargets/StopReasons: stable release. Zenodo https://doi.org/10.5281/ZENODO.11966097 (2024).

Download references

Acknowledgements

We would like to thank T. R. Pak, M. D. Rodriguez and F. P. Roth from Harvard Medical School, the Dana-Farber Cancer Institute and the Donelly Center (University of Toronto) for providing the dataset of curated stop reasons that was used for training our model. We also thank the Open Targets team for manually curating the stop reasons for the additional set of 1,675 clinical studies, including A. Hercules, A. Gonzalez, K. Tsirigos and H. Cornu. Finally, we would like to thank S. Machlitt-Northen from GlaxoSmithKline for providing detailed feedback on the redefined categories for stopped trials. I.D.'s research was funded in part by a Wellcome Trust grant (grant number 206194). For the purpose of Open Access, the authors have applied a CC-BY public copyright license to any author-accepted manuscript version arising from this submission.

Open access funding provided by European Molecular Biology Laboratory (EMBL).

Author information

Authors and affiliations.

Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK

Olesya Razuvayevskaya, Irene Lopez, Ian Dunham & David Ochoa

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, UK

Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK

You can also search for this author in PubMed Google Scholar

Contributions

O.R, I.D. and D.O. designed the study. O.R. and I.L. trained the models. O.R., I.L. and D.O. conducted the analysis. O.R., I.L., I.D. and D.O. wrote the manuscript.

Corresponding author

Correspondence to David Ochoa .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Genetics thanks Emily King, Matthew Nelson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended data fig. 1 hierarchical clustering of stop reason similarity based on curation from pak et al. and 447 additional stopped trials due to covid-19..

Distances were estimated as the cosine similarity of the averaged embeddings (see Methods section).

Extended Data Fig. 2 Association between the reasons to stop along each phase of the clinical development as reported by ClinicalTrials.gov.

Underpowered reasons for stoppage were excluded. We used a two-tailed Fisher’s exact test to assess the significance of the associations, with a p-value threshold of 0.05 (n = 10,214 independent stopped trials). Significant associations are highlighted in blue. Error bars represent 95% confidence intervals for the odds ratio. All results are provided in Supplementary Table 6 .

Extended Data Fig. 3 Percentage of stopped trials predicted to be halted due to Safety or side effects or the COVID-19 pandemic as a fraction of all the trials by predominant therapeutic area.

Indications with multiple possible therapeutic areas were associated with the most severe area (for example Oncology). Overall incidence when considering all therapeutic areas was 5% for COVID-19 and 3.3% for Safety or side effects.

Extended Data Fig. 4 Representation of the data and analytical workflow defined to investigate the predictive value of genetics in all target/disease associations derived from clinical trials.

Except for the baseline expression data, all datasets were sourced from the Open Targets 22.04 release. Detailed methods can be found in the Methods section.

Extended Data Fig. 5 Association between the availability of genetic evidence and clinical trial outcomes by genetic data source.

X-axis displays the respective odds ratio and y-axis groups the studies by phase (red), stopped clinical trials (blue) and stopped clinical trials split by high-level stopping reason (green). Error bars represent 95% confidence intervals for the odds ratio.

Extended Data Fig. 6 Genetic support for stopped clinical trials in all indications, non-oncology and oncology.

Each row represents a stopping reason, with the effect size in the form of odds ratio (OR) and its 95% confidence interval represented by the dot and error bar. An odds ratio (OR) > 1 suggests that trials stopped for a given reason are more likely to have genetic support, while an OR < 1 indicates depletion. The number of trials (n) supporting each estimate is provided. Statistical significance was assessed using a two-tailed Fisher’s exact test, with a significance threshold of p < 0.05 without multiple testing correction.

Extended Data Fig. 7 Association between the availability of genetic evidence and clinical trial outcomes by somatic data source.

X-axis displays the respective odds ratio and y-axis groups the studies by phase (top row), stopped clinical trials (centre row) and stopped clinical trials split by high-level stopping reason (bottom row). Error bars represent 95% confidence intervals for the odds ratio.

Supplementary information

Reporting summary, peer review file, supplementary table.

Supplementary Tables 1–9.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Razuvayevskaya, O., Lopez, I., Dunham, I. et al. Genetic factors associated with reasons for clinical trial stoppage. Nat Genet (2024). https://doi.org/10.1038/s41588-024-01854-z

Download citation

Received : 15 February 2023

Accepted : 02 July 2024

Published : 29 July 2024

DOI : https://doi.org/10.1038/s41588-024-01854-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Stopped clinical trials give evidence for the value of genetics.

Nature Genetics (2024)

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings
My Bibliography
Collections
Citation manager

Save citation to file

Email citation, add to collections.

Create a new collection
Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

Search in PubMed
Search in NLM Catalog
Add to Search

The fine structure of the gene

PMID: 13867419
DOI: 10.1038/scientificamerican0162-70

PubMed Disclaimer

Related information

Cited in Books
PubChem Compound (MeSH Keyword)
Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

IMAGES

Fine structure of gene
Concept and Fine structure of Gene
PPT
Fine structure of gene bsc 2nd year||molecular biology
Fine structure of gene
Fine structure of gene

VIDEO

Fine Structure of Gene: Classical 🆚 Molecular Structure of Gene Part 1, Unit 3 Genetics
Fine Structure of Gene
Fine structure of gene
FINE STRUCTURE OF GENE( PART 1)B.SC 2nd year ZOOLOGY
Fundamentals of DNA Structure , Gene and genomic organization
gene emath3 assignment

COMMENTS

Genes: Properties, Classification and Fine Structure
The fine structure of gene deals with mapping of individual gene locus. This is parallel to the mapping of chromosomes. In chromosome mapping, various genes are assigned on a chromosome, whereas in case of a gene several alleles are assigned to the same locus. The individual gene maps are prepared with the help of intragenic recombination.
Fine structure genetics
Fine structure genetics. Fine structure genetics encompasses a set of tools used to examine not just the mutations within an entire genome, but can be isolated to either specific pathways or regions of the genome. Ultimately, this more focused lens can lead to a more nuanced and interactive view of the function of a gene.
PDF LAC OPERON AND FINE STRUCTURE OF GENE
Benzer, in 1955, divided the gene into recon, muton and cistron which are the units of recombination, mutation and function within a gene. Several units of this type exist in a gene. In order words, each gene consists of several units of function, mutation and recombination. The fine structure of gene deals with mapping of individual gene locus.
Modern Concept of Gene (With Diagram)
ADVERTISEMENTS: In this article we will discuss about: 1. Introduction to Modern Concept of Gene 2. Concept of Gene - Classical Vs. Molecular 3. Subdivision 4. Fine Structure 5. Multi-Gene Families 6. Overlapping 7. Mobile Genetic Elements. Contents: Introduction to Modern Concept of Gene Concept of Gene - Classical Vs. Molecular Subdivision of Gene Fine […]
10.3: Prokaryotic Gene Regulation
Figure 1. In prokaryotes, structural genes of related function are often organized together on the genome and transcribed together under the control of a single promoter. The operon's regulatory region includes both the promoter and the operator. If a repressor binds to the operator, then the structural genes will not be transcribed.
Gene Assignment
As such, early systems models were models of biochemical kinetics. Knowledge of the genome and, in particular, the annotation of the transcriptome of model organisms including the human, has enabled the construction of genome-wide probes by in-silico (computationally-based) means, and precise gene assignment for genome-wide transcript analysis.
Gene Assignment
Gene Assignment. The complete gene assignment of all 10 segments of BTV was shown for the first time in 1988 by virtue of the in vitro transcription and translation of the mRNA from all 10 BTV genomic segments (Van Dijk and Huismans, 1988). ... Gene, Protein, Structure, and PubChem and provide the connection between sequence records and the ...
(PDF) Genes: Definition and Structure
University of Wisconsin, Madison, Wisconsin, USA. The word 'gene' has two meanings: (1) the determinant of an observable trait or. characteristic of an organism, or (2) the DNA sequence that ...
PDF The Fine Structure of the Gene
c ture. Replication of a VirusAn extremely useful organism for this fine-structure mapping is the T4 bac teriophage, wh. ch infects the colon bacillus. T4 is one of a family of viruses that has been most fruitfully exploited by an entire school of molecular biolo gists founded by Max Delbrilck of the Cali.
The Fine Structure of the Gene
More by Seymour Benzer. This article was originally published with the title "The Fine Structure of the Gene" in Scientific American Magazine Vol. 206 No. 1 (January 1962), p. 70. doi:10.1038 ...
BSc 2nd Year 3rd Semester Zoology//Fine Structure of Gene ...
👉 BSc 3rd Semester Zoology Unit 1 Process of Transcription: https://www.youtube.com/playlist?list=PLIMEmoNzKu2mUHPbKWAU89AoE1o4KyvW-
Gene expression and regulation
DNA and RNA structure Get 3 of 4 questions to level up! Replication. DNA can copy itself! Imagine unraveling a sock in order to turn it into two identical socks. ... Regulation of gene expression and cell specialization Get 3 of 4 questions to level up! Quiz 2. Level up on the above skills and collect up to 160 Mastery points Start quiz. Mutations.
Fine Structure of a Gene
Fine Structure of a Gene — DNA Sequencing. Author: Nadia Rosenthal, Ph.D. Author Info & Affiliations. Published March 2, 1995. N Engl J Med 1995;332: 589 - 591. DOI: 10.1056/NEJM199503023320908 ...
Genes: Concept, Definition, Size and Types (With Diagram)
Fine Structure of a Gene 6. Types of Genes and 7. Open Reading Frame. The genetic blueprint contained in the nucleotide sequence can determine the phenotype of an individual. The hereditary units, which are transmitted from one generation to the next generation are called genes. A gene is a fundamental biological unit like atom which is the ...
Structural genomics and its importance for gene function analysis
Whereas the goal of converting protein structure into function can be accomplished by traditional sequence motif-based approaches, recent studies have shown that assignment of a protein's ...
Fine-Scale Genetic Structure in Finland
To establish the high-level genetic structure in Finland, we applied FS to the output of CP by allowing exactly two populations. As expected, the main genetic division was between W and E parts of the country ( Figure 3A ). The pairwise F ST ( Patterson et al. 2006) between these two populations was 0.002 (SE = 2 × 10 −5 ).
ChromGene: gene-based modeling of epigenomic data
This structure also allowed us to compactly represent all genes without having separate files for each gene and cell type. For a single cell type, we had 23 chromosome files (chromosomes 1-22 and X). ... To calculate mutual information of gene assignments given gene lengths, for each gene, we first calculated its log 10 (length), ...
Genetic Fine Structure from Testcross Progeny Analysis
Abstract. F or decades, geneticists have analyzed the fine structure of genes by taking advantage of the cell's own DNA recombination machinery. Genetic recombinational analysis has led to the construction of gene maps that consist of linear arrays of mutations separated from each other by measured genetic intervals along the length of the gene.
Species Assignment for Gene Normalization Through Exploring ...
Gene normalization is a process of automatically detecting gene names in the literature and linking them to database records. It is critical for improving the coverage of annotation in gene databases. Automatic association of a gene with a species, also known as species assignment, is an essential step of gene normalization.
Gene: Introduction, Concepts and Structure
ADVERTISEMENTS: In this article we will discuss about Gene:- 1. Introduction to Gene 2. The Changing Concept of Gene 3. Fine Structure. Introduction to Gene: Mendel's, (1865) experiments with Garden pea plant showed that certain hereditary "factors" were concerned in determining the appearance of certain morphological traits. Such Mendelian "factors" were described as "gene" by
Genetic factors associated with reasons for clinical trial stoppage
The authors of that article classified every study with a maximum of three classes following an ontological structure ... we fine -tuned the BERT ... we used all gene assignments based on a locus ...
The fine structure of the gene
The fine structure of the gene. The fine structure of the gene. Sci Am. 1962 Jan;206:70-84. doi: 10.1038/scientificamerican0162-70.
Re‐evaluation of shellac (E 904) as a food additive and a new
An illustrative chemical structure for shellac was provided (Figure 2). FIGURE 2. ... An additional MS study was conducted to confirm the assignment of mono- and dichloro- compounds, ... For FC 07.2 Fine bakery wares, the restriction 'only as glazing agents only for small products of fine bakery wares coated with chocolate' was considered ...
Federal Register, Volume 89 Issue 147 (Wednesday, July 31, 2024)
[Federal Register Volume 89, Number 147 (Wednesday, July 31, 2024)] [Proposed Rules] [Pages 61596-62648] From the Federal Register Online via the Government Publishing Office [www.gpo.gov] [FR Doc No: 2024-14828] [[Page 61595]] Vol. 89 Wednesday, No. 147 July 31, 2024 Part II Book 2 of 2 Books Pages 61595-62652 Department of Health and Human Services ----- Centers for Medicare & Medicaid ...