您当前的位置: 首页 > 网页快照
Investigating the shared genetic architecture between multiple sclerosis and inflammatory bowel diseases - Nature Communications
Methods . Study samples . GWAS dataset for MS . GWAS summary results for MS were obtained from the International MS Genetics Consortium (IMSGC) meta-analysis of 15 datasets comprising 14,802 MS cases and 26,703 controls of European ancestry 9 . Each dataset was imputed using the 1000 Genomes European panel. SNPs with minor allele frequency (MAF)?>?1% were utilised for meta-analysis using a fixed-effects model. As the MAF information was not available in the MS GWAS meta results, we annotated the MAF information based on the European population from the 1000 Genomes panel. Ambiguous SNPs (AT, TA, CG and GC) were excluded and a total of ~6.8 million SNPs were retained for analysis. GWAS datasets for IBD, UC and CD . We obtained publicly available GWAS summary data for UC, CD and IBD, the latter case sample comprising those in both the UC and CD GWAS 12 . We note that UC and CD were the primary focus of our analyses, but we also included IBD, so as to compare the results of our genetic analyses to the epidemiological literature for overlap between MS and IBD, and because the GWAS for IBD has greater power than UC and CD alone. A total of 34,652 participants of European ancestry (12,882 cases and 21,770 controls) were included in the IBD GWAS, from which 27,432 Europeans (6968 cases and 20,464 controls) and 20,883 Europeans (5956 cases and 14,927 controls) were included in the UC and CD GWAS, respectively. Nearly 12 million SNPs (~9.5 million with MAF?>?1%) were included in all three GWAS summary statistics, imputed using the 1000 Genomes Europeans as the reference. Genome-wide association analyses for each disease were conducted using PLINK 70 , adjusted by principal components. More details about the cohorts and quality control (QC) process are explained in Jostins et al. 11 and Liu et al. 12 . GTEx data . GTEx is a public data resource of gene expression in 53 non-diseased human primary tissues 31 , including 50 solid tissues (e.g. liver, stomach), including some organs (e.g. brain) with multiple subregions, whole blood and two cell line ‘tissues’ (e.g. Epstein–Barr virus–transformed lymphocytes). We used normalised (transcripts per million) GTEx V7 data 71 to assess tissue-specific gene expression. After excluding low-quality individuals ( N ?=?2, defined as <100 genes with >1 read per million) and genes ( N ?=?736, defined as <4 individuals with >1 read per million), we retained data on 53 tissues from a total of 751 individuals, with an average of 220 samples per tissue type. In addition, we also downloaded the GTEx V7 expression quantitative trait locus (eQTL) summary data for the downstream analysis. scRNA-seq data . On the basis of evidence for tissue-level SNP heritability enrichment in the GTEx analyses, we obtained scRNA-seq unique molecular identifier (UMI) count matrices from healthy human lung ( N ?=?57,020 cells) 72 , spleen ( N ?=?94,257 cells) 72 and peripheral blood ( N ?=?68,579 cells) 73 , and mouse small intestine 74 ( N ?=?7216 cells). For the latter, we filtered genes with mismatched gene symbols between mouse and human. In our analyses, we used the normalised and quality controlled scRNA-seq data and cell clustering results reported in the primary articles 72 , 73 , 74 and no further QC was conducted. A total of 84 cell types across four tissues were utilised in our study (see Supplementary Data? 1 ), with an average of 2703 cells per cell type. Statistical analyses . LDSC . We used S-LDSC 24 (Python 2.7) with the baseline-LD model 25 to estimate single trait SNP heritabilities ( h 2 SNP , i.e. the proportion of the phenotypic variance in a trait can be explained by common genetic variants tagged on SNP arrays) for MS, IBD, UC and CD. Baseline-LD model 25 is an extension of S-LDSC 24 that partitions the SNP heritability on the basis of continuous, as opposed to binary annotation sets 25 . We used this approach to estimate h 2 SNP rather than univariate LDSC, as the latter may underestimate h 2 SNP due to the action of negative selection, which results in SNPs with low levels of LD having higher per-SNP heritability. We reformatted all GWAS summary statistics to the pre-computed LD scores of the 1000 Genomes European reference. SNPs were excluded if they did not intersect with the reference panel, or if they were located in the MHC region (chromosome 6: 28,477,797-33,448,354), had a MAF?SNP heritability estimates were converted to the liability-scale based on the observed sample prevalence and population prevalence, assuming the latter were 0.3%, 0.54%, 0.29%, and 0.25% 75 , 76 for MS, IBD, UC and CD, respectively. We used bivariate LDSC 77 to estimate genetic correlations ( r g , i.e. the proportion of genetic variance shared by two traits divided by the square root of the product of their SNP heritability estimates) between MS and each of IBD, UC and CD, as well as between UC and CD. We conducted bivariate LDSC without constraining the intercept and r g estimates were considered Bonferroni significant if the p -value was <1.25 × 10 ?2 (i.e. p ?Fisher’s transformed Z -score from r g using the formula Z -score 78 \(=\frac{{Z}_{{r}_{g1}}-{Z}_{{r}_{g2}}}{s}\) , where \({r}_{g1}\) and \({r}_{g2}\) represent the two genetic correlations, Fisher’s transformed \({Z}_{{r}_{g}}=\frac{1}{2}{{{{{\rm{ln}}}}}}\left(\frac{1+{r}_{g}}{1-{r}_{g}}\right)\) per correlation, \(s=\sqrt{\frac{1}{{n}_{{r}_{g1}}-3}+\frac{1}{{n}_{{r}_{g2}}-3}}\) of which effect sample size per correlation is estimated by \({n}_{{r}_{g}}=\frac{1-{r}_{g}}{{s.e}_{{r}_{g}}^{2}}+2\) . We then calculated the two-tailed p -value from the Z -score of a standard normal distribution. As a sensitivity analysis, we also performed LDSC with the single-trait heritability intercept constrained. Compared to LDSC with unconstrained intercept, constrained intercept LDSC is an approach designed to decrease the standard error of estimates under the assumption of no population stratification, thereby indirectly evaluating the influence of GWAS statistic inflation. However, constrained intercept LDSC may also provide a biased or misleading estimate of heritability and genetic correlation if the intercept is constrained incorrectly. Estimation of local genetic correlations using ρ-HESS . To investigate whether MS shared higher genetic overlap with UC in the local independent genomic region than CD, we applied ρ-HESS 26 (Python 2.7) to evaluate the local genetic correlations (i.e. genetic correlation between traits due to their shared genetic variance at a defined genomic region) between MS and each of IBD, UC, and CD. A total of 1699 default regions that were approximately LD independent with average size of nearly 1.5Mb 79 were checked by ρ-HESS, including five regions in the MHC (i.e. chromosome 6: 28,017,819–28,917,608, 28,917,608–29,737,971, 30,798,168–31,571,218, 31,571,218–32,682,664, and 32,682,664–33,236,497). We performed ρ-HESS to estimate the local SNP heritability per trait and genetic covariance between traits based on the 1000 Genomes Europeans reference of hg19 genome build. Local genetic correlation estimates were then calculated from the local single-trait SNP heritability and local cross-trait genetic covariance estimates. Multi-trait analysis of GWAS . To identify risk SNPs associated with joint phenotypes comprising MS and each of IBD, UC, and CD, we implemented cross-trait meta-analysis of GWAS summary statistics using MTAG 27 (Python 2.7). We used MTAG, rather than standard inverse-variance weighted meta-analyses with trait-specific effect sizes, because this approach can accommodate potential sample overlap between GWAS. We implemented MTAG options that assume equal SNP heritability for each trait and perfect genetic covariance between traits. The upper bound for the false discovery rate (‘maxFDR’) was calculated to examine the assumptions on the equal variance–covariance of shared SNP effect sizes underlying the traits. To investigate if violations of the assumptions of equal SNP heritability for each trait and perfect genetic covariance between traits biased our MTAG results, we performed CPASSOC 28 for MS-IBD, MS-UC, and MS-CD as a sensitivity analysis. CPASSOC assumes the presence of heterogeneous effects across traits and estimates the cross-trait statistic S Het and p -value through a sample size-weighted meta-analysis of GWAS summary data. We prioritised independent SNPs that were genome-wide significant in the cross-trait meta-analyses (e.g. MS-IBD) using both MTAG and CPASSOC, but not identified in the original single-trait GWAS (e.g. MS or IBD). These independent genome-wide significant SNPs were identified by LD clumping ( r 2 ?Haplotype Reference Consortium (HRC). We defined cross-trait SNPs of particular interest if they were independent (i.e. LD r 2 ?SNPs in the respective single-trait GWAS (IMSGC GWAS discovery cohort [14,802 cases, 26,703 controls] 9 ; IBD 12 ; UC 12 ; CD 12 ) or the IMSGC GWAS meta-analysis of MS (discovery + replicate cohorts [47,429 cases, 68,374 controls]; N ?=?200 non-MHC genome-wide significant SNPs; for which we did not have access to the full summary statistics) 9 or that were in LD (LD r 2 ?≥?0.05) with any of these previously reported genome-wide significant SNPs. MR analyses . We used six MR methods to investigate putative causal relationships between MS and each of IBD, UC and CD: Generalised Summary-data-based Mendelian Randomisation (GSMR) 80 , MR-Egger 81 , inverse variance weighting (IVW) 82 , weighted median 83 , weighted mode 84 and CAUSE 29 . We utilised multiple MR methods with different assumptions on the extent and nature of horizontal pleiotropy, which refers to variants with effects on both outcome and exposure through a pathway other than a causal effect. Horizontal pleiotropy can be correlated, if variants affecting both the outcome and exposure do so via a shared heritable factor, or uncorrelated, if variants affect outcome and exposure traits via separate mechanisms. We considered relationships with consistent evidence for causality using all MR methods to be more reliable and noteworthy. We used the R packages GSMR 80 and TwoSampleMR 85 to implement five MR methods (GSMR, IVW, MR-Egger, weighted median and weighted mode) with different assumptions about horizontal pleiotropy. Briefly, GSMR assumes no correlated pleiotropy but implements the HEIDI-outlier approach to identify and remove SNPs with evidence for significant uncorrelated pleiotropy. IVW assumes that if uncorrelated pleiotropy is present it has mean zero, so only adding noise to the regression of meta-analysed SNP effects with multiplicative random effects 82 . MR-Egger further allows for the presence of directional (i.e. non-zero mean) uncorrelated pleiotropy and adds an intercept to the IVW regression to exclude confounding from such pleiotropy 81 . Two-sample MR methods capable of accounting for some correlated pleiotropy include the weighted median and the weighted mode. The weighted median measures the weighted median rather than weighted mean of the SNP ratio, which has the ability to identify true causality if ≤50% of the weights are from invalid SNPs 83 . The weighted mode classifies the SNPs into groups according to their estimated causal effects, and assesses evidence for causality using only the largest set of SNPs, which essentially relaxes the assumptions of MR and has the ability to identify the true effect even if a majority of instruments are invalid SNPs 84 . For these five MR methods, independent SNPs (LD clumping r 2 ?HRC and UK10K) with evidence for genome-wide association ( p ?≤?5 × 10 ?8 ) with the ‘exposure’ trait were used as instrumental variables, and merged with the SNPs from the ‘outcome’ trait. We also used a recently published Bayesian-based MR method called CAUSE that accounts for both correlated and uncorrelated pleiotropy 29 . Compared to the other two-sample MR methods, CAUSE further corrects correlated pleiotropy by evaluating the joint distribution of effect sizes from instrumental SNPs, assuming that the ‘true’ causal effect can influence all instrumental SNPs while correlated pleiotropy only influences a subset of instrumental SNPs. CAUSE improves the power of MR analysis by including a larger number of LD-pruned SNPs (LD r 2 ?the MHC region, here we performed MR analyses with and without SNPs located within the MHC region, to further investigate the effects of MHC region SNPs on putative causal associations between MS and each of IBD, UC and CD. We applied a stricter LD threshold ( r 2 ?SNPs in the MHC region. We declared inferred causal relationships to be significant if they showed Bonferroni-corrected p ?Byrne et al. 86 (i.e. \({{{{{{{\rm{beta}}}}}}}}_{{xy}\left[{{{{{{\rm{liability}}}}}}}\right]}=\frac{{z}_{{K}_{x}}{K}_{y}(1-{K}_{y})}{{z}_{{K}_{y}}{K}_{x}(1-{K}_{x})}{{{{{{{\rm{beta}}}}}}}}_{{xy}\left[{{{{{{\rm{logit}}}}}}}\right]}\) , where \({K}_{x}\) and \({K}_{y}\) are the population prevalence of exposure and outcome trait, respectively; and \({z}_{{K}_{x}}\) and \({z}_{{K}_{y}}\) are the height of the Gaussian distribution at the population prevalence threshold for exposure and outcome trait, respectively), assuming the population prevalence for MS, IBD, UC and CD were 0.3%, 0.54%, 0.29%, and 0.25% 75 , 76 , respectively. We then transformed the liability-scale effect size to an odds ratio. Tissue and cell-type specific enrichment of SNP heritability . Selection of tissue type- and cell type-specific expressed genes : We selected genes that were specifically expressed in each tissue and cell type using the method described by Bryois et al. 87 . For GTEx, we followed Bryois et al. in excluding testis and tissues that were non-natural or collected in <100 donors. We then calculated the average gene expression for tissues in the same organ (e.g. colon-sigmoid and colon transverse), with the exception of brain tissues. These filtering criteria reduced the total number of analysed GTEx tissues from 53 to 37. Subsequently, for each tissue and cell type, we excluded non-protein coding genes, genes with duplicated names, genes located in the MHC region, and genes not expressed in any tissue or cell type. We then scaled gene expression to a total of 1 million UMIs per tissue or cell type, and calculated, for each gene, the proportion (ranging from 0 to 1) of total expression across all tissue/cell types that were specific to each tissue/cell type. The top 10% of most specific genes for each tissue and cell type were then selected for downstream analyses. Stratified LD score regression: We first used S-LDSC 24 to investigate whether SNP heritability for MS, IBD, UC and CD was enriched in specific tissues. We then applied S-LDSC to scRNA-seq data to evaluate whether specific cell types in those tissues showed significant heritability enrichment. For each of 37 GTEx tissues and 84 cell types from healthy human lung ( N ?=?28), spleen ( N ?=?30) and peripheral blood ( N ?=?11), and mouse small intestine ( N ?=?15; we used mouse small intestine data as a ‘proxy’ because no large human small intestine data was publicly available), we defined a focal functional category by selecting SNPs located within 100Kb (hg19) of the set of 10% most specific genes and added this to the default baseline model (comprising 52 genomic annotations) and the set of all genes. We evaluated the significance of each SNP heritability enrichment estimate using the p -value of the regression coefficient Z -score, after adjusting for the baseline model and the set of all genes. Enrichment correlations among MS, UC and CD were calculated by correlating the regression coefficients for GTEx tissues and cell types (by tissues) independently. We adjusted for multiple testing by calculating the Benjamini-Hochberg FDR, accounting for tissues and cell types separately across the four diseases. Additionally, we performed a series of conditional S-LDSC analyses to account for the possibility that gene sets overlap between related tissues and cell types, which may limit our interpretation of estimated tissue- and cell type-specific SNP heritability enrichments. In the case of tissue-level analyses, we conditioned on the baseline model, the set of all genes and the set of genes highly expressed in other FDR-significant tissues. For example, in conditional S-LDSC analyses for lung, we adjusted for the baseline model, the set of all genes and the set of genes highly expressed in small intestine–terminal ileum, spleen and whole blood. In the case of cell type-specific conditional S-LDSC analyses, we additionally conditioned on the set of genes highly expressed in the other FDR-significant cell types of the same tissue, for the focal disease. For example, in the case of B hypermutation cells (spleen) in MS, conditional S-LDSC adjusted for the baseline model, the set of all genes, the set of genes highly expressed in small intestine–terminal ileum, lung and whole blood, and the set of genes highly expressed in B/T doublet cells (spleen) and CD8 + cytotoxic lymphocytes (spleen). For all tissues and cell types, we also conducted gene-set enrichment analysis using MAGMA (Multi-marker Analysis of GenoMic Annotation; see details in the Supplementary Note) 88 with and without genes in the MHC region, as a sensitivity analysis for S-LDSC. Summary-data-based Mendelian randomisation . We used SMR to identify putative functional genes underlying statistical associations for MS, IBD, UC and CD, as well as additional loci identified in cross-trait meta-analyses of MS-IBD, MS-UC and MS-CD, motivated by the question of whether common risk genes underlie MS and IBDs. SMR 23 performs a Mendelian randomisation-equivalent analysis that uses summary statistics from GWAS and eQTL studies to test for an association between gene expression (i.e. exposure) and a target phenotype (i.e. outcome), using genome-wide significant SNPs as instrumental variables. A significant SMR association could be explained by a causal effect (i.e. the causal variant influences disease risk via changes in gene expression), pleiotropy (i.e. the causal variant has pleiotropic effects on gene expression and disease risk) or linkage (i.e. different causal variants exist for gene expression and disease). SMR implements the HEIDI-outlier test to distinguish causality or pleiotropy from linkage, but there is currently no way to distinguish causality from pleiotropy. We implemented SMR using cis -eQTL summary data for whole blood from eQTLGen, a meta-analysis of 14,115 samples 30 , and from GTEx V7 31 for other significant tissues identified by S-LDSC. We utilised UK Biobank European reference combined imputed by HRC and UK10K to evaluate LD, and only focused on expression probes with eQTL p ?≤?5 × 10 ?8 . For MTAG or CPASSOC-based cross-trait phenotypes (e.g. MS-IBD), SMR analyses were restricted to genetic variants of particular interest that defined above. Despite the complicated LD structure in the MHC region, probes located in the region were included in sensitivity analyses due to the importance of the MHC region in susceptibility to MS and IBDs, including for the purpose of comparing SMR associations for MS and IBDs in the MHC region with those in the remainder of the genome. SMR associations due to causality or pleiotropy were declared significant if they surpassed Bonferroni-correction for the total number of eQTLs analysed ( N ?=?94,624, p ??0.05, minimum >10 SNPs). Reporting summary . Further information on research design is available in the? Nature Research Reporting Summary linked to this article. .
From:
监测目标主题     
(1)  
系统抽取对象
机构     
(1)
(2)
(2)
(4)
(3)
(51)
(2)
(1)
(3)
(5)
(6)
(21)
(2)
(4)
(1)
(1)
(3)
活动
法案     
(1)
(1)
出版物     
(1)
地理     
(9)
人物     
(2)
(1)
(2)
(1)
系统抽取主题     
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)  
(1)