Differentially expressed genes reflect disease-induced rather than disease-causing changes in the transcriptome - Nature Communications
Methods . Reverse transcriptome-wide Mendelian randomization (revTWMR) . RevTWMR is a multi-instrument MR approach designed to estimate the causal effect of the phenotypes (exposure) on gene expression (outcome). For each gene, using an inverse-variance weighted method for summary statistics 16 , we define the joint causal effect of the phenotypes on the outcome as $${\hat{{{{\alpha }}}}}=\left({{{\hat{{\beta} }}}}^{\prime}{{{{{ C}}}}}^{{{{{-}}}}{{{{1}}}}}{{\hat{{\beta} }}}\right)^{{{{{-}}}}{{{{1}}}}}\left({{\hat{{\beta} }}}^{\prime}{{{{{ C}}}}}^{{{{{-}}}}{{{{1}}}}}{{\hat{{\gamma} }}}\right)$$ (2) Here β is an n- vector that contains the standardized effect size of n independent SNPs on the phenotype, derived from GWAS. γ is a vector of length n that contains the standardized effect size, in trans- , of each SNP on the gene expression. C is the pair-wise LD matrix between the n SNPs. As instrumental variables, we used independent ( r 2 ?eQTLGen Consortium (31,684 whole blood samples). As we are using only strongly independent SNPs, we use the identity matrix to approximate C . The SNPs with larger effects on the outcome than on the exposure were removed, as these would indicate a violation of MR assumptions (likely reverse causality and/or confounding). The variance of α can be calculated approximately by the Delta method $${{{{{{{\mathrm{var}}}}}} }}\left(\hat{{{{{\alpha }}}}}\right)={\left(\frac{{{\partial }}\hat{{{{{\alpha }}}}}}{{{\partial }}{{{{\beta }}}}}\right)}^{{{{{2}}}}} {{{{{{{\mathrm{var}}}}}} }}\left(\hat{{{{{\beta }}}}}\right)+{\left(\frac{{{\partial }}\hat{{{{{\alpha }}}}}}{{{\partial }}{{{{\gamma }}}}}\right)}^{{{{{2}}}}} {{{{{{{\mathrm{var}}}}}} }}\left(\hat{{{{{\gamma }}}}}\right)+\left(\frac{{{\partial }}\hat{{{{{\alpha }}}}}}{{{\partial }}{{{{\beta }}}}}\right) \left(\frac{{{\partial }}\hat{{{{{\alpha }}}}}}{{{\partial }}{{{{\gamma }}}}}\right) {{{{{{{\mathrm{cov}}}}}} }}(\hat{{{{{\beta }}}}},\hat{{{{{\gamma }}}}})$$ (3) where cov( β , γ ) is 0 if β and γ are estimated from independent samples. We defined the causal effect Z-statistic for gene i as \({\hat{\alpha }}_{i}/{{{{{{\mathrm{SE}}}}}} }({\hat{\alpha }}_{i})\) , where \({{{{{{\mathrm{SE}}}}}} }\left({\hat{\alpha }}_{i}\right)=\sqrt{{{{{{{{\mathrm{var}}}}}} }\left(\hat{\alpha }\right)}_{i,i}}\) . We applied revTWMR across the human genome for a causal association between a set of 12 phenotypes and the expression levels of 19,942 genes using summary statistics from GWAS and eQTLs studies. The analysed traits include BMI, CD 17 , EDU 20 , FG 18 , HDL-cholesterol, height, LDL-cholesterol, RA 19 , SCZ 21 , TC, TG, and WHRadjBMI. While for CD, EDU, FG, RA, SCZ, and WHRadjBMI summary statistics (estimated univariate effect size and standard error) originate from the most recent meta-analysis and were downloaded from the publicly available NIH Genome-wide Repository of Associations Between SNPs and Phenotypes ( https://grasp.nhlbi.nih.gov/ ), for the other traits the GWAS were performed in UKBiobank and the summary statistics are from the Neale Lab ( http://www.nealelab.is/uk-biobank/ ) (Supplementary Data? 16 ). We only used SNPs on autosomal chromosomes and were available in the UK10K reference panel, which allowed estimating the LD among these SNPs and prune them. Strand ambiguous SNPs were removed. Heterogeneity test . The validity of all MR approaches, such as revTWMR, relies on three assumptions. The third assumption (no pleiotropy) is crucial as MR causal estimates will be biased if the genetic variants (IVs) have pleiotropic effects 43 . Hence, revTWMR assumes that all genetic variants used as instrumental variables affect the outcome only through gene expression and not through independent biological pathways. To test for the presence of pleiotropy, we used Cochran’s Q test 42 , 44 . In brief, we tested whether there is a significant difference between the revTWMR-effect of an instrument (i.e., αβ i ) and the estimated effect of that instrument on the gene expression ( γ i ). We defined $${{{{{ d}}}}}_{{{{{ i}}}}}={{{\hat{{{\gamma} }}}}}_{{{{{ i}}}}}-{{\hat{\alpha}}}{{{\hat{\beta}}}}_{{{{{ i}}}}}$$ (4) and its variance as $${{{{{{{\mathrm{var}}}}}}}}\left({{{{{ d}}}}}_{{{{{ i}}}}}\right)={{{{{{{\mathrm{var}}}}}}}}\left({{{\hat{\gamma}}}}_{{{{{ i}}}}}\right)+{\left({{{{{\beta }}}}}_{{{{{ i}}}}}\right)}^{{{{{2}}}}} {{{{{{{\mathrm{var}}}}}}}}\left({{\hat{\alpha }}}\right)+{{{{{{{\mathrm{var}}}}}}}}\left({{{\hat{\gamma} }}}_{{{{{ i}}}}}\right) {\left({{{{\alpha }}}}\right)}^{{{{{2}}}}}+{{{{{{{\mathrm{var}}}}}}}}\left({{{\hat{\beta}}}}_{{{{{ i}}}}}\right) {{{{{{{\mathrm{var}}}}}}}}\left({{\hat{\alpha} }}\right)$$ (5) Next, we tested the deviation of each SNP using the following test statistic $${{{{{ T}}}}}_{{{{{ i}}}}}=\frac{{{{{{ d}}}}}_{{{{{ i}}}}}^{{{{{2}}}}}}{{{{{{{{\mathrm{var}}}}}}}}({{{{{ d}}}}}_{{{{{ i}}}}})} \sim {{{{{\chi }}}}}_{{{{{1}}}}}^{{{{{2}}}}}$$ (6) In case where P ?Mendelian randomization (TWMR) . In order to test the presence of a feedback loop of association, we ran TWMR 15 for all the significant revTWMR genes. To make TWMR and revTWMR results comparable, we ran a univariable TWMR where for each gene we estimated its total effect on the phenotype. The associations between the instrumental variables and the exposure (gene expression) and the outcome (complex traits) are estimated from the same studies used for revTWMR. Gene-based test . To compare GWAS and revTWMR results, we performed gene-based test for association summary statistics using PASCAL 39 . PASCAL assesses the total contribution of all SNP within close physical proximity to a given gene by combining SNP association Z-statistics into gene-based P values while accounting for local LD structure. Replication cohorts . EGCUT . Study population . The Estonian Genome Center, University of Tartu (EGCUT) cohort denotes the Estonian Biobank sample of more than 200,000 individuals or about 20% of the Estonian adult population. All Biobank participants have been genotyped and linked to electronic health records (EHR) of the Health Insurance Fund, national registries, and major hospitals. The EHR linkage captures the participants’ medical history together with demographics, lifestyle information, and laboratory measurements; additional information is provided by self-completed questionnaires. Disease diagnoses are in the form of ICD-10 codes. RNA-seq data is available on 491 unrelated individuals. All Biobank participants have signed a broad informed consent to allow using their genetic and medical information for research purposes. Whole-blood-transcriptome analysis . The preparation of RNA-seq data has been described in detail elsewhere 58 . RNA-seq reads were trimmed of adapters together with low-quality leading and trailing bases using Trimmomatic (version 0.36) 59 . Additional quality control was performed with FastQC (version 0.11.2). The final set of reads were mapped to a human genome reference version GRCh37.p13 using STAR (version 2.4.2a) 60 . Sample mix-ups were tested and corrected for using MixupMapper 61 . Principal component analysis on RNA-seq read counts revealed a batch of outlying samples which was uncovered to be due to a technical problem in library preparation—affected samples were discarded. Data were normalized using the weighted trimmed mean of M values 62 and used as log2-transformed counts per million. To account for (hidden) batch effects, the sequencing batch date together with the first gene expression principal components were used in all subsequent analyses. InChianti . Study population . The InCHIANTI study is a population-based sample that includes 298 individuals of age <65 years and 1155 individuals of age ≥65 years. The study design and protocol have been described in detail previously 63 . The data collection started in September 1998 and was completed in March 2000. The INRCA Ethical Committee approved the entire study protocol. Whole-blood-transcriptome analysis . Peripheral blood specimens were collected from 712 individuals using the PAXgene tube technology to preserve levels of mRNA transcripts. RNA was extracted from peripheral blood samples using the PAXgene Blood mRNA kit (Qiagen, Crawley, UK) according to the manufacturer’s instructions. RNA was biotinylated and amplified using the Illumina? TotalPrep(tm) ?96 RNA Amplification Kit and directly hybed with Human HT-12_v3 Expression BeadChips that include 48,803 probes. Image data were collected on an Illumina iScan and analysed using Illumina GenomeStudio software. These experiments were performed as per the manufacturer’s instructions and as previously described 64 . Quality-control analysis of gene expression levels were previously described 65 . SHIP-Trend . Study population . The Study of Health in Pomerania (SHIP-Trend) is a longitudinal population-based cohort study in West Pomerania, a region in the northeast of Germany, assessing the prevalence and incidence of common population-relevant diseases and their risk factors. Baseline examinations for SHIP-Trend were carried out between 2008 and 2012, comprising 4420 participants aged between 20 and 81 years. Study design and sampling methods were previously described 66 . The medical ethics committee of the University of Greifswald approved the study protocol, and oral and written informed consents were obtained from each of the study participants. Whole-blood-transcriptome analysis . Blood sample collection, as well as RNA preparation, were described in detail elsewhere 67 . Briefly, whole-blood samples of a subset of SHIP-TREND were collected from the participants after overnight fasting (≥10?h) and stored in PAXgene Blood RNA Tubes (BD). Subsequently, RNA was prepared using the PAXgeneTM Blood miRNA Kit (QIAGEN, Hilden, Germany). The purity and concentration of RNA were determined using a NanoDrop ND-1000 UV-Vis Spectrophotometer (Thermo Scientific). To ensure a constantly high quality of the RNA preparations, all samples were analyzed using RNA 6000 Nano LabChips (Agilent Technologies, Germany) on a 2100 Bioanalyzer (Agilent Technologies, Germany) according to the manufacturer’s instructions. Samples exhibiting an RNA integrity number (RIN) less than seven were excluded from further analysis. The Illumina TotalPrep-96 RNA Amplification Kit (Ambion, Darmstadt, Germany) was used for reverse transcription of 500?ng RNA into double-stranded (ds) cDNA and subsequent synthesis of biotin-UTP-labeled antisense-cRNA using this cDNA as the template. Finally, in total 3000?ng of cRNA were hybridized with a single array on the Illumina Human HT-12 v3 BeadChips, followed by washing and detection steps in accordance with the Illumina protocol. Processing of the SHIP-Trend RNA samples was performed at the Helmholtz Zentrum München. BeadChips were scanned using the Illumina Bead Array Reader. The Illumina software GenomeStudio V 2010.1 was used to read the generated raw data, for imputation of missing values and sample quality control. Subsequently, raw gene expression data were exported to the statistical environment R, version 2.14.2 (R Development Core Team 2011). Data were normalized using quantile normalization and log2-transformation using the lumi 2.8.0 package from the Bioconductor open-source software ( http://www.bioconductor.org/ ). Finally, 991 samples were available for gene expression analysis. Technical covariates used in all statistical models included RNA amplification batch, RNA quality (RIN), and sample storage time. The SHIP-Trend expression dataset is available at GEO (Gene Expression Omnibus) public repository under the accession GSE 36382: 991 samples were available for analysis. Phenotype-gene expression correlation . To calculate the correlation between the phenotypes and the gene expression levels, we asked each cohort to run the following analysis. First, the inverse normal transformation was applied to phenotypes and gene expression. Next, transformed phenotypes were adjusted only for sex, age, and age 2 , while gene expression was also corrected for other known relevant covariates. Finally, Pearson’s correlation was calculated between the adjusted trait and the adjusted expression. Finally, correlations from single cohorts were combined using inverse-variance meta-analysis, where weights are proportional to the squared standard error of the correlation estimates, as implemented in METAL 68 . Observed and true correlation between gene expression and traits . The correlation between the effects estimated by revTWMR \(({\alpha }_{{{{{{{\mathrm{revTWMR}}}}}}}})\) and the observational correlation ( \({{{{{{\mathrm{corr}}}}}}}(E,T)\) ) measured in the individual data from EGCUT, InChianti, and SHIP-Trend was calculated using Pearson’s correlation. As such correlation does not consider the error of the estimations, for the significant correlations we used the linear errors-in-variables models to compute the potential true correlation using the following equation $${{{{{{corr}}}}}}_{{{{{{obs}}}}}}={{{{{{corr}}}}}}_{{{{{{true}}}}}} \sqrt{{{{1}}}-\frac{\mathop{\sum }\nolimits_{{{{j}}}={{{1}}}}^{{{{{N}}}}_{{{{{{Genes}}}}}}}{{{{{{SE}}}}}\left({{{{\alpha }}}}_{{{{{{revTWMR}}}}}}\right)}^{{{{2}}}}}{\mathop{\sum }\nolimits_{{{{j}}}={{{1}}}}^{{{{{N}}}}_{{{{{{Genes}}}}}}}{{{{{\alpha }}}}_{{{{{{revTWMR}}}}}}}^{{{{2}}}}}} \sqrt{{{{1}}}-\frac{\mathop{\sum }\nolimits_{{{{j}}}={{{1}}}}^{{{{{N}}}}_{{{{{{Genes}}}}}}}{{{{{{SE}}}}}\left({{{{{corr}}}}}({{{E}}},{{{T}}})\right)}^{{{{2}}}}}{\mathop{\sum }\nolimits_{{{{j}}}={{{1}}}}^{{{{{N}}}}_{{{{{{Genes}}}}}}}{{{{{{corr}}}}}({{{E}}},{{{T}}})}^{{{{2}}}}}}$$ (7) Proportion of observational correlation explained by bidirectional causal effects . Let E and T denote the gene expression and the trait, respectively. In addition, there may exist a confounding factor U causally impacting both of them. We can express E and T as: $${{{T}}}={{{{\alpha }}}}_{{{{{{TWMR}}}}}} {{{E}}}+{{{{q}}}}_{{{{T}}}} {{{U}}}+{{{{\varepsilon }}}}_{{{{T}}}}$$ (8) And $${{{E}}}={{{{\alpha }}}}_{{{{{{revTWMR}}}}}} {{{T}}}+{{{{q}}}}_{{{{E}}}} {{{U}}}+{{{{\varepsilon }}}}_{{{{E}}}}$$ (9) where \({\alpha }_{{{{{{{\mathrm{TWMR}}}}}}}}\) and \({\alpha }_{{{{{{{\mathrm{revTWMR}}}}}}}}\) are the causal effects of E on T and of T on E estimated by TWMR and revTWMR respectively; q T and q E are the causal effects of the confounders on T and E ; and \({\varepsilon }_{T} \sim N(0,{\sigma }_{T})\) and \({\varepsilon }_{E} \sim N(0,{\sigma }_{E})\) represent uncorrelated errors. More specifically, \({\varepsilon }_{T}\) , \({\varepsilon }_{E}\) , and U are all independent of each other, because all dependence between T and E are due to bidirectional causal effects and the confounder U , the residual noises are independent of each other and of the confounder. For simplicity, we assume that E , T , and U have zero mean and unit variance, so that the correlation between E and T can be expressed as $${{{{{{\mathrm{corr}}}}}}}\left(E,T\right)={{{{{{\mathrm{cov}}}}}}}\left(E,T\right)=E\left(E T\right)={\alpha }_{{{{{{{\mathrm{TWMR}}}}}}}}+{\alpha }_{{{{{{{\mathrm{revTWMR}}}}}}}}-{\alpha }_{{{{{{{\mathrm{TWMR}}}}}}}} {\alpha }_{{{{{{{\mathrm{revTWMR}}}}}}}} E\left(E T\right)+{q}_{T} {q}_{E}$$ (10) Equivalently, $${{{{{{\mathrm{corr}}}}}}}\left(E,T\right)=\frac{{\alpha }_{{{{{{{\mathrm{TWMR}}}}}}}}+{\alpha }_{{{{{{{\mathrm{revTWMR}}}}}}}}+{q}_{T} {q}_{E}}{1+{\alpha }_{{{{{{{\mathrm{TWMR}}}}}}}} {\alpha }_{{{{{{{\mathrm{revTWMR}}}}}}}}}$$ (11) As we know the correlation, the bidirectional causal effects estimated by TWMR and revTWMR, we can estimate the contribution of the confounders ( \({q}_{T} {q}_{E}\) ) to the observed correlation. Since the magnitude of \({\alpha }_{{{{{{{\mathrm{TWMR}}}}}}}} {\alpha }_{{{{{{{\mathrm{revTWMR}}}}}}}}\) is negligible, we replaced the denominator with 1. To avoid the recursive equations expressing the forward and reverse causal effects of E on T, we can substitute T into the equation for E and obtain $$E= \, {\alpha }_{{{{{{\mathrm{revTWMR}}}}}}} ({\alpha }_{{{{{{\mathrm{TWMR}}}}}}} E+{q}_{T} U+{\varepsilon }_{T})+{q}_{E} U+{\varepsilon }_{E}\\ E= \, {\alpha }_{{{{{{\mathrm{reTWMR}}}}}}} {\alpha }_{{{{{{\mathrm{TWMR}}}}}}} E+({\alpha }_{{{{{{\mathrm{revTWMR}}}}}}} {q}_{T}+{q}_{E}) U+{\alpha }_{{{{{{\mathrm{revTWMR}}}}}}} {\varepsilon }_{T}+{\varepsilon }_{E}\\ (1-{\alpha }_{{{{{{\mathrm{revTWMR}}}}}}} {\alpha }_{{{{{{\mathrm{TWMR}}}}}}}) E=\big({\alpha }_{{{{{{\mathrm{revTWMR}}}}}}} {q}_{T} +{q}_{E}\big) U+{\alpha }_{{{{{{\mathrm{revTWMR}}}}}}} {\varepsilon }_{T}+{\varepsilon }_{E}\\ {{{E}}}= \,\frac{({{{{\alpha }}}}_{{{{{{{{\mathrm{revTWMR}}}}}}}}} {{{{q}}}}_{{{{T}}}}+{{{{q}}}}_{{{{E}}}}) {{{U}}}+{{{{\alpha }}}}_{{{{{{{{\mathrm{revTWMR}}}}}}}}} {{{{\varepsilon }}}}_{{{{T}}}}+{{{{\varepsilon }}}}_{{{{E}}}}}{1-{{{{\alpha }}}}_{{{{{{{{\mathrm{revTWMR}}}}}}}}} {{{{\alpha}}}}_{{{{{{{{\mathrm{TWMR}}}}}}}}}}$$ (12) Similarly for T $$T= \, {\alpha }_{{{{{{\mathrm{TWMR}}}}}}} ({\alpha }_{{{{{{\mathrm{revTWMR}}}}}}} T+{q}_{E} U+{\varepsilon }_{E})+{q}_{T} U+{\varepsilon }_{T}\\ T= \,{\alpha }_{{{{{{\mathrm{TWMR}}}}}}} {\alpha }_{{{{{{\mathrm{revTWMR}}}}}}} T+ ({\alpha }_{{{{{{\mathrm{TWMR}}}}}}} {q}_{E}+{q}_{T}) U+{\alpha }_{{{{{{\mathrm{TWMR}}}}}}} {\varepsilon }_{E}+{\varepsilon }_{T}\\ \, (1-{\alpha }_{{{{{{\mathrm{TWMR}}}}}}} {\alpha }_{{{{{{\mathrm{revTWMR}}}}}}}) T= ({\alpha }_{{{{{{\mathrm{TWMR}}}}}}} {q}_{E}+{q}_{T}) U+{\alpha }_{{{{{{\mathrm{TWMR}}}}}}} {\varepsilon }_{E}+{\varepsilon }_{T}\\ T = \,\frac{({{{\alpha }}}_{{{{{{{{\mathrm{TWMR}}}}}}}}} {{{ q}}}_{{{E}}}+{{{q}}}_{{{T}}}) {{ U}}+{{{\alpha }}}_{{{{{{{{\mathrm{TWMR}}}}}}}}} {{{\varepsilon }}}_{{{E}}}+{{{\varepsilon }}}_{{{T}}}}{1-{{{\alpha }}}_{{{{{{{{\mathrm{TWMR}}}}}}}}} {{{\alpha }}}_{{{{{{{{\mathrm{revTWMR}}}}}}}}}}$$ (13) GWAS hits trans -eQTL mapping in GTEx . Genotypes and gene expression quantifications from the GTEx project v8 dataset 47 were obtained via dbGaP accession number phs000424.v8.p1. This includes genotypes of 838 subjects, 85.3% of European American origin, 12.3% African American, and 1.4% Asian American. The phased version of the genotype files was used and the genotypes for 1078 out of 1093 GWAS hits used as instrument variables in revTWMR were retrieved, matching for chromosome, position, and reference/alternative allele, after conversion to GRCh38 coordinates using the UCSC liftOver tool 69 . Gene expression quantification (TPM values) from RNA-seq experiments across 49 tissues (for which genotype data is also available for ≥70 individuals) processed and provided by the GTEx project v8 were also downloaded. These gene expression quantifications had been mapped to Gencode v26 70 gene annotations on GRCh38 and normalized by TMM between samples (as implemented in edgeR), and inverse normal transform across samples. Moreover, only genes passing an expression threshold of >0.1 TPM in ≥20% samples and ≥6 reads in ≥20% samples had been retained. The association between each of the 2177 GWAS hits genotyped in GTEx v8 and each gene expression (20,315 to 35,007 genes per tissue, all gene types) across 49 tissues of the GTEx v8 was computed using QTLtools v1.3.1 trans function 71 . This consists of more than 2 billion association tests performed. For this, the–nominal option for calculating nominal p values was used, as well as the–normal option, to enforce the gene expression phenotypes to match normal distributions N (0,1). To include all associations, no cis window filtering was applied. Moreover, covariates provided by GTEx v8 for each tissue were regressed out of each expression matrix to account for potential confounding factors, by using the–covariate option on QTLtools. These included 15 to 60 PEER factors (depending on tissue sample size) 72 , five genotype PCA PCs as well as information about the sequencing platform, PCR usage, and the sex of the samples provided by GTEx v8. Analysis of mouse data . We used blood triglyceride and liver gene expression in a panel of BXD mice that were fed a chow or high-fat diet (CD and HFD 48 ). The study involves 52 strains, with five mice per strain per condition. Male mice were switched to an HFD diet at 8 weeks of age, subjected to extensive cardiometabolic phenotyping, and finally sacrificed at 29 weeks of age after an overnight fast. The blood and liver collection were performed simultaneously during tissue collection. Microarray data and triglyceride measurements in the two diets are available for a subset of 34 strains. Phenotype data, including blood triglyceride measurement, are deposited in the Mouse Phenome Database ( https://phenome.jax.org/projects/Auwerx1 ) and the raw gene expression data in the Gene Expression Omnibus ( https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60149 ). To calculate diet-induced TG change, we subtracted the strain average on CD from that on HFD and converted the resulting difference into a z -score. We normalized the Affymetrix Mouse Gene 1.0 ST Array data using the Affymetrix Power Tools software version 1.20.5 with GC correction (GCCN) and space transformation (SST). We removed the lowest quartile of genes based on average expression in all samples. For each BXD strain, we calculated strain-level HFD-induced fold change as the difference in the expression on HFD minus that of CD and then converted these values to z -scores. We then performed Spearman’s correlation for each gene’s HFD-induced fold change and the TG diet-induced differences. We used the biomaRt R package version 2.42.1 73 to convert between mouse and human gene Ensembl IDs. Reporting Summary . Further information on research design is available in the? Nature Research Reporting Summary linked to this article. .
原始网站图片
增加监测目标对象/主题，请
登录 直接在原文中划词！