Appendix
This page includes list of inputs and parameter options for all scripts. This should not be used as workflow or tutorial, but is a resource when necessary for specific scripts.
combine_allc_pe.py
- Require imports: sys, multiprocessing, subprocess, os, gzip
- Expects allC files to be for a single chromosome
- Accepts gzipped allC files
Usage: python combine_allc_pe.py [-q] [-h] [-f] [-p=num_proc] [-o=out_id]
[-c=chrm_list | -cf=fasta_index] <allc_path> <sample_name> [sample_name]*
Required:
allc_path path to allC files
sample_name name of sample; used to find allC files
when "-f" flag set, file with sample names listed one per line
Optional:
-q quiet; do not print progress
-h print help message and exit
-f sample names are in file
-p=num_proc number of processors to use [default 1]
-o=out_id output file identifier [default "combined"]
-c=chrm_list comma-separated list of chrms to use [default for arabidopsis]
-cf=fasta_index fasta index file with chrms to use
unmethylate_allc_pe.py
- Required imports: sys, multiprocessing, subprocess, os, gzip
- Accepts gzipped allC files
- Note: specify allC file not allC path and sample names
- Output files are named as input file name with
unmethylated
added
Usage: python unmethylate_allc_pe.py [-q] [-h] [-f] [-p=num_proc] [-v=coverage]
<allc_file> [allc_file]*
Required:
allc_file allC file to unmethylated
when "-f" set, file with list of allC files
Optional:
-q quiet; do not print progress
-h print help message and exit
-f allC files names listed in the file
-p=num_proc number of processors to use [default 1]
-v=coverage coverage for each position [default as-is in input]
filter_allc_coverage_pe.py
- Required imports: sys, math, glob, multiprocessing, subprocess, os
- Other required files: bioFiles
- Expects allC files to have all chromosomes for one sample
Usage: python filter_allc_coverage_pe.py [-q] [-h] [-v=min_cov] <allc_path>
<sample_name> [sample_name]*
Required:
allc_path path to allC files
sampleN name of sample; used to find allC files
Optional:
-q quiet; do not print progress
-h print help and exit
-v=min_cov min coverage for positions to include [default 3]
-p=num_proc number of processors to use [default 1]
dmr_gen_counts_pe.py
- Required imports: sys, math, multiprocessing, subprocess, os
- Expects allC file to include all chromosomes
- When using coverage parameter, uses allC files output by filter_allc_coverage_pe.py.
- Accept gzipped allC files
- Includes option to create pickle files (python object files) of allC files (Saves processing time later)
- Order of samples is important!! Script only includes comparsions of adjacent samples.
Usage: python dmr_gen_counts_pe.py [-h] [-q] [-k] [-o=out_id]
[-m=meth_type] [-p=num_proc] [-v=min_cov] <dmr_file> <allc_path> <sample1>
<sample2> [sampleN]*
Required:
dmr_file tab-delimited file of DMR regions (1-indexed)
allc_path path to allC files
sampleN sample name to analyze; order of samples is important
Optional:
-h print help and exit
-q quiet; do not print progress
-k pickle allC files; creates/reads pickle version of allC files
saves subsequent computational time but is additional memory
-o=out_id identifier for output file [default "out"]
-p=num_proc number of processors [default 1]
-m=meth_type sequence context; must be one of "CG", "CHG", "CHH", and "C"
dmr_gen_ztesting.py
- Required imports: sys, math, os, pandas, numpy, scipy
- Other required files: bioFiles
- Input is the output of dmr_gen_counts_pe.py
Usage: python dmr_gen_ztesting.py [-h] [-q] [-wm] [-n=num_c_thresh]
[-m=meth_thresh] [-d=length_thresh] [-f=fdr] [-o=outID] <in_file>
Required:
in_file tab-delimited file of DMRs and read counts
Optional:
-h print help and exit
-q quiet; do not print progress
-wm methylation threshold is for raw methyl difference
not percent difference
-n=num_c_thresh min number of cytosines in region to be considered
for analysis [default 10]
-d=len_thresh min length of dmr in bp [default 40]
-m=meth_thresh min methylation change btwn generations to be
considered a switch [default 0.3]
-f=fdr FDR value for significant switches [default 0.05]
-o=out_id identifier for output files [default uses input file name]
dmr_file_to_bed.py
- Required imports: sys, os
- Uses switches output of dmr_gen_ztesting.py
Usage: python dmr_file_to_bed.py [-h] [-q] [-v=score_thresh] [-p=name_prefix]
[-o=outID] <in_file>
Required:
in_file input file of DMRs; switches output of dmr_gen_ztesting
Optional:
-h print help and exit
-q quiet; do not print progress
-v=score_thresh min score to include in output [default -1]
-p=name_prefix prefix for naming features [default None]
-o=outID identifer for output file
find_mpos_diff_pe.py
- Required imports: sys, multiprocessing, subprocess, os, gzip
- Expects allC files to be split by chromosomes
- Accepts gzipped allC files
Usage: python find_mpos_diff_pe.py [-h] [-q] [-v=min_cov] [-c=chrm_list]
[-o=out_id] [-p=num_proc] [-m=meth_types] <allc_path> <sample1_name>
<sample2_name>
Required:
allc_path path to allc files
sample_name names of samples to compare
Optional:
-h print this help screen and exit
-q quiet; does not print progress
-v=min_cov min coverage to include a position [default 3]
-o=out_id string for output file name [default "out"]
-c=chrm_list comma-separated list of chrms [default arabidopsis]
-p=num_proc num processors to use [default 1]
-m=meth_types comma-separated list of "CG", "CHG", and/or "CHH"
[default all]
filter_pos_gene_subset.py
- Required imports: sys, os, bisect
- Input should be the output of find_mpos_diff_pe.py to filter positions
Usage: python filter_pos_gene_subset.py [-h] [-q] [-cds] [-v] <pos_file>
<subset_file> <gff_file>
Required:
pos_file 1-based position file, tab-delimited columns: chrm, start, end
subset_file file with list of genes to subset, one gene per line
use "none" or "na" to use all genes
gff_file GFF formatted file with genes
Optional:
-h print help and exit
-q quiet; do not print progress
-cds use CDS annotation not gene
-v include coordinates opposite of what is specified
weighted_meth_by_pos_pe.py
- Required imports: sys, multiprocessing, subprocess, os, bisect, gzip
- Accepts gzipped allC files
Usage: python find_mpos_diff_pe.py [-h] [-q] [-v=min_cov] [-c=chrm_list]
[-o=out_id] [-p=num_proc] [-m=meth_types] <allc_path> <sample1_name>
<sample2_name>
Required:
allc_path path to allc files
sample_name names of samples to compare
Optional:
-h print this help screen and exit
-q quiet; does not print progress
-v=min_cov min coverage to include a position [default 3]
-o=out_id string for output file name [default "out"]
-c=chrm_list comma-separated list of chrms [default arabidopsis]
-p=num_proc num processors to use [default {:d}] 1
-m=meth_types comma-separated list of "CG", "CHG", and/or "CHH"
[default all]
epigenotyping_pe_v1.7.3.py
- Required python imports: sys, math, multiprocessing, subprocess, os, numpy, pandas, sklearn, functools
- Other required files: bth_util, decodingpath, transitions
- decodingpath.py includes code for the smoothing algorithms (forward-backward and Viterbi)
- transitions includes code to compute the transition matrix needed by decoding algorithms
- This script is run per chromosome.
Usage: python epigenotyping_pe_v1.7.3.py [-q] [-n-mpv] [-t-out] [-g=generation]
[-c=bin_thresh] [-d=decoding_type] [-p=num_proc] [-o=out_id] [-m=mother_
samples][-f=father_samples] [-b=bin_size] [-t=centromere] <input_file>
Requried:
input_file tab-delimited file of of weighted methylation by position for samples
Optional:
-q quiet; do not print progress
-h print help and exit
-n-mpv do not check for systematic mid-parent bias
-t-out write transition matrix to file
-g=generation generation of self-crossing; used to determine classification
probabilities; use 0 for uniform weight [default 2]
-d=decode_type decoding type to use (capitlization ignored) [default A]
Viterbi="v" or "viterbi"
Forward-Backward="forwardbackward", "f" or "fb"
Both="all" or "a"
Off="false", "none", or "n"
-o=out_id identifier for output file [default "out" or variation of
input file name]
-p=num_proc number of processors [default 1
-c=bin_thresh minimum number of features per bin to be classified
groups bins to reach this number [default 3
-m=mother_samples comma-separated sample name(s) of mother
[default mother]
-f=father_samples comma-separated sample name(s) of father
[default father]
-b=bin_size size of bins in bp [default 100kbp]
-t=centromere centromere coordinates as "start,end"; can include multipe
centromeres as "start1,end1,start2,end2..." [default None]
find_crossovers.py
- Required imports: sys, os, pandas
- This script is run per chromosome.
Usage: python find_crossovers.py [-c=predict_column] [-o=out_id] <input_file>
Required:
input_file tab-delimited file with samples epigenotype per bin
Optional:
-h print this help menu and exit
-q quiet;do not print progress
-o=out_id output identifier [default variation of
input file name
-c=predict_column label of column to use as final epigenotype
[default "vit.prediction"]
decode_pileup_pe.py
- Required imports: sys, multiprocessing, subprocess, os
- Assumes all pileup files have information for the same positions.
Usage: python decode_pileup_pe.py [-o=out_id] [-p=num_proc] <pileup_file> [pileup_file]*
Required:
pileup_file pileup file for a sample; output from samtools pileup
Optional:
-q quiet; do not print progress
-h print this help menu and exit
-o=out_id identifier for output file [default "out"]
-p=num_proc number of processors [default 1]
pileup_genotype_pe.py
- Required imports: sys, multiprocessing, subprocess, os
- Uses the output of decode_pileup_pe.py
Usage: python pileup_genotype_pe.py [-h] [-q] [-o=out_id] [-p=num_proc]
[-m=mother_label] [-f=father_label] [-v-min_cov] <decoded_pileup_file>
Required:
decode_pileup_file tab-delimited input file; output of decode_pileup_pe.py
Optional:
-h print help and exit
-q quiet; do not print progress
-o=out_id identifier for output file [default variation of input
file name]
-v=min_cov min number of reads needed to support genotype [default 1]
-p=num_proc number of processors [default 1]
-m=mother_label sample name of mother [default mother]
-f=father_label sample name of father [default father]