Data access

The Gene Expression Omnibus database

We identified and obtained samples from the Gene Expression Omnibus (GEO), the largest database of published DNAm array samples to date. GEO is a curated database with controlled schema for organization of various data and information from research projects. In general samples will be identifiable as unique GSM IDs under a unique GSE ID record, where the GSE ID approximates a discrete experiment or project. Further, the data platform(s) used to generate GSM data are denoted with distinct GPL IDs. Below, we idnetified all GSM IDs run using one of several DNA methylation (DNAm) platforms, for which IDAT raw array image files were published as supplement.

Identifying samples and studies

We identified DNAm array samples published to GEO and available on the GEO Data Sets (GDS) database as of March 2019. We queried GDS, which provides FTP access to GEO data, using the Entrez Utilities software. This allowed us to obtain available samples by DNAm array platform and graph sample and study availability by year (Figures 1 and S1). We piped the functions esearch, efetch, and xtract for this task, and this can be rerun using the provided script edirect_query.py.

Compiling sample data

Upon recognition of samples with IDATs available for the HM450K platform, we programmatically managed data downloads with versioning using the Celery job que manager and NTP timestamps for file versioning. Downloads were organized as tasks around GSE IDs containing valid samples, where each task comprised of downloading available GSM IDATs and the GSE SOFT file containing experiment and sample metadata. Importantly, we coded tasks to execute with other DNAm array platforms besides HM450K (GPL13534), including the HM27K (GPL8490) and EPIC (GPL21145) platforms.

We used additional scripts to scrape GSM-specific metadata from the GSE-specific SOFT files and stored scraped data as GSM-specific JSON files. Certain GSE-specific data was also removed from the resulting JSON files, such as methods and experiment title, to avoid learning study-specific metadata and increase sample-specific learning in metadata processing downstream. As an example, the filtered JSON file for sample GSM937258 was:

[
{
  "!Sample_characteristics_ch1": "tissue site: Bone",
  "!Sample_characteristics_ch1.1": "tissue type: Tumor",
  "!Sample_characteristics_ch1.2": "sample id (short form for plotting): 24-6b",
  "!Sample_characteristics_ch1.3": "description: Xiphoid Met",
  "!Sample_source_name_ch1": "Prostate cancer metastasis",
  "!Sample_title": "Prostate cancer metastasis 16032"
}
]

Processing sample metadata

We performed extensive processing of sample metadata in order to clean and harmonize data and learn annotations from the SOFT files.

Learning metadata annotation labels

We initially reviewed metadata encoding patterns manually on a GSE-specific basis. Key tags recurring across samples, and sample-specific labels useful for research, were then coerced under corresponding variables or under a “miscellaneous” variable. Considering this as a preprocessing step, we then postprocessed the coerced metadata with logic and pattern matching using regular expressions.

Postprocessing resulted in uniformly coded and cleaned metadata labels. For instance, for the variables tissue and disease, labels are mainly lower-case and separated by semicolons. Term labels are then readily discoverable and queryable either approximately or exactly with regular expression queries. We conducted additional annotation of sample storage practices (e.g. frozen or formalin fixed paraffin-embedded) by mining the “miscellaneous” term from preprocessing and reviewing study information for the largest available GSE records (Table S4).

MetaSRA-pipeline

We used the MetaSRA-pipeline software to learn sample type predictions from GSM-specific JSON files mined from SOFT files. This pipeline uses machine learning to map metadata to ENCODE ontology terms and form a probability for each of several possible sample types. Note, we used a fork of the pipeline, available here, which was modified to complete where certain ontology data was non-uniformly recorded.

We retained the highest-probability sample types under a sampletype variable in the postprocessed metadata, along with the returned probability. For identifiable cell lines with available corresponding records, we additionally appended background information from Cellosaurus. We showed the distribution of best sample type prediction probabilities across samples and probability cutoffs in Figure S2. In general, we noted lower probability correspods with less frequent informative and sample-specific metadata. Importantly, MetaSRA-pipeline was optimized for use with RNA-sequencing samples in the Sequence Read Archive, and additional steps can be taken to better optimize it for use with the specific metadata encountered in GEO.

Model-based metadata predictions

We performed model-based predictions of sex, age, and cell type fractions using noob-normalized DNAm (Beta-values, see below) compiled from available GSM IDATs. Sex and cell fractions were predicted using the getSex and estimateCellCounts functions from the minfi package, and age predictions were performed using the agep function from the wateRmelon package. In the functions for predicting sex and cell type fractions, light preprocessing is performed that may result in prediction variation depending on the number and type of samples included. When we tested prediction concordance from randomizing samples to calculations, we found > 95% concordance across randomizations.

DNAm array data extraction and preprocessing

The HM450K Illumina BeadArray DNAm array platform utilizes two probe types (Type I and Type II) to probe over 480,000 CpG loci genome-wide. The platform also includes specially designed control probes and probes targeting high-frequency SNPs for certain types of genetic inference. Owing to the amount of samples and size of data files here, we opted to compile fairly large (~120 Gb) tables of various vital DNAm and related data out of memory. As a result, many of our data preparation, analysis and summary steps were implemented in parallel across rows of these large data files.

DNAm signal and Beta-values

From available IDATs, we used minfi to extract and store signal tables for each channel color (red and green), and we used these to derive raw Beta-values. The Beta-value reflects the fraction of methylated DNA strands in a sample, and is thus constrained from 0 - 1 (see Methods). We further performed within-sample normalization using the noob method in minfi to produce normalized Beta-values and extact methylated and unmethylated signal tables.

Deriving approximate genome-wide and autosome-wide distance matrices

Aging owing to the amount of data involved, we needed to perform an intermediate dimensionality-reduction step prior to principal component analyses (PCAs). We used feature hashing, or the hashing trick, to preserve between-probe distances while sufficiently reducing data dimensionality so that PCA could be run in active memory (see below). We performed feature hashing on raw and normalized Beta-values for genome-wide and autosomal probes, respectively. Feature hashing was implemented as in the following Python script:

def feature_hash(arr, target_dim=1000):
    low_d_rep = [0 for _ in range(target_dim)]
    for i, el in enumerate(arr):
        hashed = mmh3.hash(str(i))
        if hashed > 0:
            low_d_rep[hashed % target_dim] += arr[i]
        else:
            low_d_rep[hashed % target_dim] -= arr[i]
    return low_d_rep

Deriving control metric signals

Consulting published literature and recommendations, we wrote code to calculate 17 BeadArray control metrics from the IDAT-derived control probe signals (values in Table S2, descriptions in Table S3, see script beadarray_cgctrlmetrics.R). We especially based this work off of functions in the ewastools package that also derives the BeadArray metrics, we few minor discrepancies where we felt our code was an improvement. This included inclusion of a denominator offset of 1 where certain metrics would otherwise divide by 0, and inclusion of certain control probe type that were omitted prior. We note these changes impacted a few of the metrics and may have slightly impacted sample signal distribution relations to published metric thresholds.

We further derived methylated and unmethylated log2 medians from noob Beta-values. Examination of samples in the 2d plane formed by the metrics has been used to identify likely low-quality or unreliable samples prior.

Preprocessing DNAm across 7 distinct tissues

We implemented additional filters and preprocessing for a subset of samples from one of 7 types of tissues. We initially removed all samples containing likely from cancer tissue or cancer patients. We then reviewed available studies linked to remaining GSE IDs, noting the disease studied and included sample types in Table S6. We also removed samples showing study-specific low signal for both methylated and unmethylated log2 medians (< 5th quantile) for studies with > 4 samples. Next, we filtered likely replicates using the probabilistic model implemented by the call_genotypes function in ewastools, which utilizes the array probes mapping to high-frequency SNPs. Finally, we only retained tissues with > 99 remaing samples from at least 2 studies.

For our cross-study analyses, we reasoned that cross-study (GSE ID-specific) batch effects would have greater confounding impact than within-study batch effects. After preliminary study ID adjustment with all samples across the 7 tissues, we observed frequent probe “polarity swapping” (e.g. changes from low methylation to high methylation, or vice versa, before and after adjustment), indicating failed adjustment due frequent high-magnitude GSE ID-specific variance contributions that were much greater than tissue-specific variance contributions. For this reason, we instead performed study ID adjustments within each tissue before further within-tissue preprocessing. No polarity swapping was observed using this approach.

After linear adjustment on GSE ID, we removed sex chromosome probes. Within each tissue, we then performed array-wide probe-specific ANOVAs from multiple regressions adjusting on GSE ID, predicted sex, predicted age, and predicted cell type fractions. Probes with significant (p-adjusted < 1e-2) and high-magnitude (> 10% variance) variance contributions from any of these variables were removed. This step removed from 8 - 40% of autosomal probes across the 7 tissues, respectively.

Statistical analyses and visualizations

In summary, we generated summary statistics using the scikit-learn Python package or base R functions, conducted statistical tests using base R, and visualized data using base R and the ggplot2 R package. All p-value adjustments were performed using the Benjamini-Hotchberg method. Enrichment tests were performed using Binomial tests, and correlations used the Spearman test.

GEO year-wise sample and study availability

Using the data mined with FTP calls to GDS using E-utilities, we plotted year-wise cumulative abundances of GSMs and GSEs using the ggplot2 package.

Metadata concordance tests and analyses

We used base R functions to calculate fractional concordances for sex, and linear modeling, R-squared, and Spearman’s Rho to assess concordance for age. Figure S3 plots were generated using base R.

Analyzing control metric signal

We assessed control metric signal (BeadArray metrics and methylated or unmethylated log2 medians) using base R and certain packages. We calculated metadata term enrichment for the tissue and disease annotations using the Bimodal test with multiple testing adjustment using the Benjamini-Hotchberg method (p-adjusted < 1e-3 significance). Significance of signal differences between storage types (frozen and FFPE) were calculated using T-tests.

Stacked barplots, confidence intervals, and violin plots in Figure 2 were generated using ggplot2. Study-wise sub-threshold frequency heatmaps were generated using ComplexHeatmap.

Analyzing PCA across samples and probes

We performed PCA using the feature-hashed probe datasets and the prcomp R function with default arguments (see above). Plots were generated using the factoextra package and base R.

Analyzing hypo- and hypervariable probes across 7 tissues

After preprocessing, we used several approaches for tissue DNAm variability analyses To identify hypovariable probes across 7 tissues (pan-tissue probes), we identified probes with the lowest-magnitude variances that were shared across the studied tissues (Table S7).

We used two approaches to identify tissue-specific hypervariable probes (Table S8). First, we identified the globally most hypervariable probes within each tissue. Second, we identified the binned most hypervariable probes. In summary, this second approach evaluates variability of binned probes individually, where each bin is defined by a mean Beta-value increment of 0.1 (10 bins total from 0 - 1). To select tissue-specific hypervariable probes, we first selected the top 1,000 most hypervariable probes from method 1, then selected the top 1,000 most hypervariable probes from method 2 which were not selected from method 1, resulting in 2,000 tissue-specific hypervariable probes per tissue. Code to select and analyze hypervariable probes is provided in the script get_txmvp_3methods.R.

Recount Methylation manuscript supplemental information

Sean Maden

2/2/2020