Dismiss

DISMISS is an R script, which as an additional step in MeDIP-Seq data analysis workflow, enables the allocation of strands to methylated DNA regions. It does this by analyzing the proportions of first mate reads aligning to the methylated locus from the plus and minus strands.

Download as .zip Download as .tar.gz View on GitHub

dismiss

DISMISS can be run locally via the UNIX command-line or on our public galaxy install here.

Quick Reference

Dismiss is a collection of functions implemented in R and require the following to be present on the system:

To run dismiss, the following files should be in the current path or directory:

In order to simplify the process of strand separation, a utility script dismiss_macs2_extractor.R has been provided to perform strand separation on peak calling performed on MeDIP-Seq alignment data using MACS2. The script can be run from the command line, provided the above mentioned conditions are met.

Rscript dismiss_macs2_extractor.R MACS2_result.xls File.bam [p or s]

where the last argument is essential and should be set to either

The functions provided with the script dismiss_header.R can be used directly in one’s own R scripts, and strand separation performed on ranged data of GRanges object type. This ranged data would have been produced by a MeDIP-Seq analysis software of choice e.g. MEDIPS , MACS2 etc. (see section 2.3 for description of the function to call)

Outputs:

The script dismiss_macs2_extractor.R will produce four output files with the current time and date added as postfix to the names:

The three .gff3 format files contains methylated regions separated into double, minus and plus stranded parts. These files can easily be imported into genome browsers as tracks for analysis or converted to other suitable formats for further processing.

GFF3 file additional information:

The following additional information is present for each record entry in the gff3 file. Values 1 to 4 are the ones provided by MACS2 in the original results.

2.0 General Information

2.1 What is DISMISS

DISMISS, which as an additional step in MeDIP-Seq data analysis workflow, enables the allocation of strands to methylated DNA regions. It does this by analyzing analysing the proportions of of first mate read aligning to the methylated locus from the plus and minus strands.

2.2 Installation notes

DISMISS uses the base R package and Bioconductor libraries. R (version 3.1.2) can be installed by visiting http://cran.r-project.org/. Once R has been installed, then Bioconductor (version 3.0) can be installed directly (http://www.bioconductor.org/install/) through R by typing:

source("http://bioconductor.org/biocLite.R") biocLite()

The bioconductor packages GenomicAlignments, GenomicFeatures and rtracklayer can each be installed after these steps by typing e.g.

biocLite("GenomicAlignments")

2.3 How does it work

DISMISS can be used in two ways: 1) a more simplified script that processes results from MACS2 and outputs results as an R object and gff3 files - using dismiss_macs2_extractor.R; 2) a more involved method where the functions can be called directly in R.

2.3.1 Simple method.

This method involves a call to the script dimiss_macs2_extractor.R which performs the following steps:

  1. Process command line arguments by setting variables
    1. macs = name of the MACS2 results excel file.
    2. bam = name of the MeDIP-Seq alignment file in bam format
    3. paired = TRUE; if the command line argument was p or else s
  2. Source the main script with all functions dismiss_header.R
  3. Set the Genomic Ranges object oGRdismiss
    1. Call function f_oGRMACS2toGRanges with argument macs, to convert tab separated ranged data into a GRanges object oGRdismiss
  4. Separate the strands by calling the function f_oGRSeparateStrands with arguments oGRdismiss, bam and paired variables.
  5. Save the output as three gff3 files and an R object containing oGRdismiss object.

3.0 Methylation in the honey-bee genome

IGV data sets to accompany the DISMISS paper are available below.

Figure 1: Screen shot of IGV with available data sets.

Data sets available and links for honey bee genome version Amel_4.5:

Genome and GFF files are available here (735MB zipped download).

3.1 Processing MeDIP datasets with DISMISS via the command line

The whole experiment can be performed using Trimmomatic, bowtie2, samtools, MACS2, DISMISS, BEDTools and IGVTools as follows from the command-line and then viewed using IGV.

## Clean with Trimmomatic
java -Xms3G -Xmx3G -jar ./trimmomatic-0.30.jar PE -phred33 SRR850130_1.fastq SRR850130_2.fastq SRR850130_1.fastq.p SRR850130_1.fastq.unp SRR850130_2.fastq.p SRR850130_2.fastq.unp ILLUMINACLIP:./Adaptors.seq:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:15 MINLEN:36

## Paired Data, reverted nurse
bowtie2 --threads 8 --sensitive -I 0 -X 500 -x Amel_4.5_scaffolds.fa -1 SRR850132_1.fastq.p -2 SRR850132_2.fastq.p -S reverted.sam
samtools view -b -q 10 -o reverted.bam reverted.sam
samtools sort -@ 8 -T reverted.tmp -O bam reverted.bam > reverted.srt.bam
samtools rmdup reverted.srt.bam nurse-single.rmdup.srt.bam
samtools index reverted.rmdup.srt.bam

## Run MACS2, DISMISS on reverted nurse to make DISMISS stranded GFF files
macs2 callpeak -m 5 50 -t reverted.rmdup.srt.bam -f BAM -g 2.3e8 -q 0.01 -n reverted
Rscript uhkniazi-dismiss-cea5c48/dismiss_macs2_extractor.R reverted_peaks.xls revered.rmdup.srt.bam p

## Make files for IGV
bedtools genomecov -bg -ibam reverted.rmdup.srt.bam > reverted.bedgraph
igvtools toTDF reverted.bedgraph reverted.bedgraph.tdf Amel_4.5_scaffolds.fa

4.0 Involved method.

This is recommended for users who are more familiar with R, and would like to produce Genomic Ranges objects themselves. It is assumed that these Genomic Ranges objects are ranges over the genome where methylated signal is present. This method involves the use of the function f_oGRSeparateStrands. Each of the functions used here are described in more detail later in section 3.

5.0 Functions

5.1 f_bComparePoissonLiklihoodApply

f_bComparePoissonLiklihoodApply = function(ivVector, cutoff=0.2)

Arguments:

ivVector = integer vector of size 2 where each value represents the first mate read count from the plus and minus strands

cutoff = cutoff value for the likelihood ratio. Default value used is 0.2

Description:

The function takes two values in a vector form and checks if the the two values are equal under a poisson distribution.

Rets:

Returns true if the two values are equal.

Figure 1: The figure shows a likelihood ratio plots of two read counts 40 (plus strand) and 65 (minus strand). The black line shows the plausible values of rates if the poisson rate is 40 and the red line shows plausible values of 65. The dotted blue and green lines show slices through the likelihood function i.e. the cutoff values. At a cutoff of 0.2 the two rates 40 and 65 are marginally equal and at a cutoff above this, the two rates will not be considered equal. However at a cutoff of 0.1 (blue dotted line) the two rates can be considered equal under a poisson distribution.

Considering Figure 1, the cutoff value can be changed according to the nature of the data sets. In our tests, a cutoff value of 0.2 (default value) gave good matches with BS-Seq data. This value can be adjusted based on the sequencing depth - if the sequencing was done deep enough then there will be good separation between signal and noise channels and a lower threshold may give good separation, and where sequencing depth is not enough then a threshold of 0.3 or higher may be more appropriate. However a relationship between the threshold cutoff and sequencing depth is an open question, and has not been explored in detail.

5.2 f_mFastCountMateReadsOverRanges

f_mFastCountMateReadsOverRanges = function(gr.signal, bam.file, bPaired=T)

Arguments:

gr.signal = Genomic Ranges object of class GRanges, which would be a set of ranges with strand value *, over which counting is done

bam.file = alignment file in bam format, expects the bam index file with same name and .bai extension in the directory

bPaired = T - boolean value, set true if paired end or else set false for single end reads

Description:

It uses a GRanges object (typically calculated by a peak caller e.g. MACS) and using the alignment data, creates a matrix that contains first mate reads from plus and minus strands (mp, mm) and number of reads from both mates (bp, bm). For single end data both mp = bp and mm = bm.

Rets:

Returns a matrix with count data

mRets = matrix(NA, nrow=length(gr.signal), ncol=4, dimnames=list(NULL, c('mp', 'mm', 'bp', 'bm')))

5.3 f_oGRSeparateStrands

Main function that performs strand separation, and uses functions in sections 3.1 and 3.2 to perform these tasks. The details of the function are explained in section 2.3.2.