Cytosine Methylation Variant Calling with MinION Nanopore Sequencing

At a Glance

Metadata	Details
Publication Date	2016-05-17
Journal	eScholarship (California Digital Library)
Authors	Arthur C Rand
Analysis	Full AI Review Included

Technical Documentation: MPCVD Diamond Solutions for Advanced Biosensing Platforms

Analysis of “Cytosine Methylation Variant Calling with MinION Nanopore Sequencing” (Rand et al., 2016)

Executive Summary

This paper outlines a probabilistic methodology using the Oxford Nanopore MinION sequencer to accurately call cytosine methylation variants (C, 5mC, 5hmC). This high-precision biosensing application requires extremely stable electrochemical platforms, an area where 6CCVD’s materials excel.

Novel Methodology: Successfully utilized Nanopore sequencing data combined with a Hierarchical Dirichlet Process (HDP) Hidden Markov Model (HMM) to expand the traditional four-base nucleotide alphabet.
High Accuracy Demonstrated: Achieved classification accuracy up to 95% in a three-way comparison (C, 5mC, 5hmC) and 98% in a two-way comparison (C vs. 5mC) on synthetic DNA templates.
Sensing Principle: Classification relies on precisely measuring minute changes in ionic current resulting from the interaction of specific 6-mer sequences with the protein pore.
Modeling Advancement: Optimized HDP topologies (e.g., ‘Multiset’ HDP) were crucial for accurately modeling the probability density functions of the ionic current distributions.
Biological Relevance: Demonstrated successful and accurate mapping of 5-methylcytosine within E. coli genomic DNA, validating the approach for real-world genetic analysis.
Material Requirements: Applications requiring high-fidelity ionic current measurement and electrochemical stability, such as this Nanopore setup, are ideally served by Boron-Doped Diamond (BDD) sensing platforms.

Technical Specifications

The following data points highlight the precision achieved in methylation calling:

Parameter	Value	Unit	Context
Nucleotide Word Length	6	bases	Defines the k-mer sequence causing ionic current blockage
3-Way Classification Accuracy (Max)	95	%	C, 5mC, and 5hmC classification on synthetic DNA reads
2-Way Classification Accuracy (Max)	98	%	C vs. 5mC classification on synthetic DNA reads
HDP Mean Read Accuracy (C/mC/hmC)	74	%	Achieved using the optimized ‘Multiset’ HDP topology
HDP Median Site Accuracy (C/mC/hmC)	72	%	Measurement taken at the methylation site level
MinION Signal Output	μ, σ, τ	N/A	Mean, standard deviation, and scale parameter for ionic current (e)
Materials Tested	N/A	N/A	Synthetic oligonucleotides and E. coli genomic DNA

Key Methodologies

The Nanopore sequencing and computational pipeline used to achieve high-accuracy variant calling involved several critical steps:

Preparation of DNA: Utilizing highly controlled synthetic DNA oligonucleotide templates and biologically relevant E. coli genomic DNA for training and validation datasets.
Electrochemical Sensing: Applying a voltage across a membrane containing a nanometer-sized protein pore, separating two ionic solution chambers.
Ionic Current Recording: Recording the precise level of ionic current blockage (e), which is dependent on the specific 6-mer nucleotide sequence passing through the pore.
Hidden Markov Model (HMM) Construction: Implementing an HMM architecture to model the transitions between states (match, insert-Y, insert-X), accommodating the expanded alphabet (C, mC, hmC).
Statistical Optimization via HDP: Employing Hierarchical Dirichlet Processes (HDP) to accurately model the probability distributions of the ionic current (e) for different methylation statuses, moving beyond simple Maximum Likelihood Estimate (MLE) models.
Performance Assessment: Calculating True Positive Rates (TPR) and False Positive Rates (FPR) using ROC analysis, testing the model against genomic reads and PCR reads to differentiate true signal from noise/error.

6CCVD Solutions & Capabilities

Advanced electrochemical systems, like those underpinning Nanopore sequencing, require materials with unmatched stability, conductivity control, and resistance to harsh chemical environments—qualities inherent to MPCVD Diamond. 6CCVD is positioned to supply key components for the next generation of these high-resolution biosensors.

Applicable Materials for Nanopore Replication & Extension

The critical need for a stable, electrochemically inert yet conductive platform points directly to 6CCVD’s Boron-Doped Diamond (BDD) material.

Material	Grade and Application	Key Benefit for Sequencing
Boron-Doped Polycrystalline Diamond (BDD-PCD)	High-stability electrochemical electrode and platform material for biosensors.	Extreme electrochemical window, low background current, and unmatched corrosion resistance, crucial for precise ionic current measurement in aqueous solutions.
High Purity Single Crystal Diamond (SCD)	Precision thermal management component within the MinION ASIC or fluidic control section.	Highest thermal conductivity (up to 22 W/cm·K) for rapid, stable heat dissipation, ensuring consistent temperature control essential for pore stability.
Polycrystalline Diamond (PCD)	Robust, large-area substrate or protective coating for microfluidic channels.	High mechanical hardness and chemical inertness, ensuring device longevity and purity of biological samples.

Customization Potential

6CCVD’s end-to-end capabilities allow researchers to integrate diamond directly into their Nanopore platforms without complex external outsourcing.

Custom Dimensions and Formats: The paper implies a microfluidic device footprint. 6CCVD offers custom BDD and PCD plates/wafers up to 125mm diameter, enabling large-scale sensor array development.
Precision Thickness Control: We supply BDD films and SCD wafers tailored from 0.1µm to 500µm thick, allowing engineers to balance conductivity, thermal requirements, and integration complexity.
Tunable Conductivity: We can supply BDD with specific boron doping levels (heavy or light) to meet the exact resistivity (Ω·cm) necessary for optimized electrode performance in ionic current detection.
Integrated Metalization: Since Nanopore sequencing requires precise electrical contacts and applied voltages, 6CCVD offers internal, lithographically defined metalization services. We routinely deposit electrode materials such as Au, Pt, Pd, Ti, W, and Cu directly onto BDD wafers.
Ultra-Smooth Polishing: For critical wafer bonding or contact interfaces, 6CCVD guarantees surface roughness of Ra < 1nm for SCD and Ra < 5nm for inch-size PCD/BDD, ensuring reliable microfluidic sealing.

Engineering Support

This research validates the critical need for robust, high-precision electrochemical environments in advanced DNA sequencing. 6CCVD’s in-house PhD engineering team specializes in diamond electrochemical properties and can assist researchers in material selection, electrode design, and integration strategies for similar DNA sequencing and biosensor projects.

For custom specifications or material consultation, visit 6ccvd.com or contact our engineering team directly.

View Original Abstract

Cytosine Methylation Variant Calling with MinION Nanopore Sequencing Arthur C. Rand, Miten Jain, Jordan Eizenga, Audrey Musselman-Brown, Hugh E. Olsen, Mark Akeson and Benedict Paten Department of Biomolecular Engineering, University of California, Santa Cruz Abstract Strand Template Complement B Accuracy A Accuracy Chemical modifications to DNA regulate cellular state and function. The Oxford Nanopore MinION is a portable single-molecule DNA sequencer that can sequence long fragments of genomic DNA. Here we show that the MinION can be used to detect and map three cytosine variants: cytosine, 5-methylcytosine, and 5-hydroxymethylcytosine. We present a probabilistic method that enables expansion of the nucleotide alphabet to include bases containing chemical modifications. Our results on synthetic DNA show that individual cytosine base modifications can be classified with accuracy up to 95% in a three-way comparison and 98% in a two-way comparison. We also demonstrate that 5-methylcytosine can be accurately mapped in E. coli genomic DNA Base modification calling accuracy results on synthetic oligonucleotides Nanopore Sequencing C MLE C HDP C MLE mC HDP mC MLE hmC HDPhmC D MLE C HDP C MLE mC HDP mC MLE hmC HDPhmC Template True Label pA time ATGCACTGAACA ATGCAC TGCACT A nanometer-sized protein pore embeded in a membrane. GCACTG X i The membrane seperates two chambers containing an ionic solution. CACTGA A voltage is applied, and the ionic current through the pore is recorded. ACTGAA DNA is threaded through the pore, and partially blocks the ionic current. CTGAAC The level of the ionic current (e ) is due to six nucleotide words (x ). j G 0 G σni γ B γ M γ L G 0 G σn G σni θ ji C H D θ ji T G T A C* G C* T TGTA GTAC TACG ACGC CGCT GCTA CTAA TAAG GTAC m TAC m G ACGC m CGC m T GC m TA C m TAA GTAC TAC G h ACGC CGC T GC TA C TAA AC m GC C m GCT AC m GC m C m GC m T AC m GC h C m GC h T AC h GC C h GCT AC h GC m C h GC m T AC h GC h C h GC h T h h h h A A PCR Reads B Mean pairwise Hellinger Distance A. Data partitioning for HDP training on E. coli. 1,709 high-confidence methylated CCWGG sites (pins) were divided into training (unstarred) and test (starred). The HDP is trained on reads from PCR amplified DNA (orange lines) and events aligned to the training sites from genomic DNA reads (magenta lines). These combined data constitute the training dataset (dashed box). The trained model is then tested on genomic and PCR DNA reads aligned to the test sites from separate flow-cells. B. ROC plot shows HMM-HDP two-way classification performance on cytosines in test group (A, starred pins). Methylation calls are made by combining marginal probabilities from template and complement reads. Genomic reads were used to assess true positive rate, the PCR reads were used to assess the false positive rate. Genomic Reads True Positive Rate H A G h Comparison of different HDP topologies Three-Way Accuracy Model Mean Accuracy (read) Median Accuracy (read) Mean Accuracy (site) Median Accuracy (site) MLE singlelevel multiset composition middleNts group Two-Way Accuracy Model Mean Accuracy (read) Median Accuracy (read) Mean Accuracy (site) Median Accuracy (site) singlelevel multiset MLE is the maximum likelihood estimate of a normal distribution. ‘Two-level’ is an HDP model with no subgroupings of 6-mers, ‘Multiset’, ‘Composition’, ‘MiddleNucleotides’, and ‘GroupMultiset’ are three-level HDP models. Three-way classification was performed between cytosine, 5-methylcytosine, and 5-hydroxymethylcytosine. Two-way classifications were between cytosine and 5-methylcytosine. False Positive Rate The HDP more realistically models ionic current distributions AGCTAA KDE γ B B Mapping 5-methylcytosine in E. coli genomic DNA MLE γ L A and B. The accuracy distribution by read (A) and by context (B) is shown for the MLE emission distributions and the ‘Multiset’ HDP model on synthetic oligonucleotides. The triangles represent the mean of the distribution. C. Confusion matrix showing HMM-HDP three-way cytosine classification performance on template reads of synthetic oligonucleotides. D. Scatter plot shows the correlation between log-odds of correct classification and the mean pairwise Hellinger distance between the methylation statuses of the 6-mer distributions overlapping a cytosine. A. Architecture of hidden Markov model used in this study. The match state ‘M’ (square) emits an event-6-mer pair and proceeds along the reference, Insert-Y ‘Iy’ (diamond) emits a pair but stays in place, and Insert-X ‘Ix’ (circle) proceeds along the reference but does not emit a pair. Two-level (B) and three-level (C) hierarchical Dirichlet process shown in graphical form. Circles represent random variables. The base distribution ‘H’ is a normal inverse- gamma distribution for both models. The Dirichlet processes ‘G 0 ’, ‘G σn ’, and ‘G σni ’ are parameterized by their parent distribution and shared concentration parameters ‘γ B ’, γ M ’, and γ L ’. The factors ‘θji’ specify the parameters of the normal distribution mixture component that generates observation ‘xji’. D. Variable-order HMM meta-structure over an example reference sequence. Each C in the reference X ji represents a potentially methylated cytosine. The structure expands around the C* base to accommodate for all possible methylation states. Each cell contains the three states shown in A, and transitions span between cells. The transitions are restricted so that methylation states are labeled X ji consistently within a path. The match states are drawn with 4-mers for simplicity, but the model is implemented with 6-mers. I y (-,e j ) Predicted Label HDP (Multiset) M (x i ,e j ) Modeling Ionic Current with a hidden Markov model I x (x i ,-) i A Log-odds of correct classification e j : μ,σ,t TTGCTG GAACTT C mC hmC Probability distributions for three representative 6-mers by multiple methods. The first row shows the kernel density estimate (KDE). The middle row shows maximum likelihood estimated (MLE) normal distribution probability density functions. The bottom row shows probability density functions from the ‘Multiset’ hierarchical Dirichlet process (HDP). All data shown are from template reads.

Tech Support

Original Source

DOI: None