Appendix

This page includes list of inputs and parameter options for all scripts. This should not be used as workflow or tutorial, but is a resource when necessary for specific scripts.

combine_allc_pe.py

  • Require imports: sys, multiprocessing, subprocess, os, gzip
  • Expects allC files to be for a single chromosome
  • Accepts gzipped allC files
Usage:  python combine_allc_pe.py [-q] [-h] [-f] [-p=num_proc] [-o=out_id] 
        [-c=chrm_list | -cf=fasta_index] <allc_path> <sample_name> [sample_name]*

Required:
allc_path       path to allC files
sample_name     name of sample; used to find allC files
                when "-f" flag set, file with sample names listed one per line

Optional:
-q              quiet; do not print progress
-h              print help message and exit
-f              sample names are in file
-p=num_proc     number of processors to use [default 1]
-o=out_id       output file identifier [default "combined"]
-c=chrm_list	comma-separated list of chrms to use [default for arabidopsis]
-cf=fasta_index	fasta index file with chrms to use

unmethylate_allc_pe.py

  • Required imports: sys, multiprocessing, subprocess, os, gzip
  • Accepts gzipped allC files
  • Note: specify allC file not allC path and sample names
  • Output files are named as input file name with unmethylated added
Usage:  python unmethylate_allc_pe.py [-q] [-h] [-f] [-p=num_proc] [-v=coverage]
	    <allc_file> [allc_file]*

Required:
allc_file     allC file to unmethylated
              when "-f" set, file with list of allC files

Optional:
-q            quiet; do not print progress
-h            print help message and exit
-f            allC files names listed in the file
-p=num_proc   number of processors to use [default 1]
-v=coverage   coverage for each position [default as-is in input]

filter_allc_coverage_pe.py

  • Required imports: sys, math, glob, multiprocessing, subprocess, os
  • Other required files: bioFiles
  • Expects allC files to have all chromosomes for one sample
Usage:  python filter_allc_coverage_pe.py [-q] [-h] [-v=min_cov] <allc_path>
        <sample_name> [sample_name]*

Required:
allc_path    path to allC files
sampleN      name of sample; used to find allC files

Optional:
-q           quiet; do not print progress
-h           print help and exit
-v=min_cov   min coverage for positions to include [default 3]
-p=num_proc  number of processors to use [default 1]

dmr_gen_counts_pe.py

  • Required imports: sys, math, multiprocessing, subprocess, os
  • Expects allC file to include all chromosomes
  • When using coverage parameter, uses allC files output by filter_allc_coverage_pe.py.
  • Accept gzipped allC files
  • Includes option to create pickle files (python object files) of allC files (Saves processing time later)
  • Order of samples is important!! Script only includes comparsions of adjacent samples.
Usage:  python dmr_gen_counts_pe.py [-h] [-q] [-k] [-o=out_id]
        [-m=meth_type] [-p=num_proc] [-v=min_cov] <dmr_file> <allc_path> <sample1>
        <sample2> [sampleN]*

Required:
dmr_file       tab-delimited file of DMR regions (1-indexed)
allc_path      path to allC files
sampleN        sample name to analyze; order of samples is important

Optional:
-h             print help and exit
-q             quiet; do not print progress
-k             pickle allC files; creates/reads pickle version of allC files
               saves subsequent computational time but is additional memory
-o=out_id      identifier for output file [default "out"]
-p=num_proc    number of processors [default 1]
-m=meth_type   sequence context; must be one of "CG", "CHG", "CHH", and "C"

dmr_gen_ztesting.py

  • Required imports: sys, math, os, pandas, numpy, scipy
  • Other required files: bioFiles
  • Input is the output of dmr_gen_counts_pe.py
Usage:  python dmr_gen_ztesting.py [-h] [-q] [-wm] [-n=num_c_thresh]
        [-m=meth_thresh] [-d=length_thresh] [-f=fdr] [-o=outID] <in_file>

Required:
in_file           tab-delimited file of DMRs and read counts

Optional:
-h                print help and exit
-q                quiet; do not print progress
-wm               methylation threshold is for raw methyl difference
                  not percent difference
-n=num_c_thresh   min number of cytosines in region to be considered
                  for analysis [default 10]
-d=len_thresh     min length of dmr in bp [default 40]
-m=meth_thresh    min methylation change btwn generations to be
                  considered a switch [default 0.3]
-f=fdr            FDR value for significant switches [default 0.05]
-o=out_id         identifier for output files [default uses input file name]

dmr_file_to_bed.py

  • Required imports: sys, os
  • Uses switches output of dmr_gen_ztesting.py
Usage:  python dmr_file_to_bed.py [-h] [-q] [-v=score_thresh] [-p=name_prefix]
        [-o=outID] <in_file>

Required:
in_file          input file of DMRs; switches output of dmr_gen_ztesting

Optional:
-h               print help and exit
-q               quiet; do not print progress
-v=score_thresh  min score to include in output [default -1]
-p=name_prefix   prefix for naming features [default None]
-o=outID         identifer for output file

find_mpos_diff_pe.py

  • Required imports: sys, multiprocessing, subprocess, os, gzip
  • Expects allC files to be split by chromosomes
  • Accepts gzipped allC files
Usage:  python find_mpos_diff_pe.py [-h] [-q] [-v=min_cov] [-c=chrm_list]
        [-o=out_id] [-p=num_proc] [-m=meth_types] <allc_path> <sample1_name>
        <sample2_name>

Required:
allc_path	path to allc files
sample_name	names of samples to compare

Optional:
-h              print this help screen and exit
-q              quiet; does not print progress
-v=min_cov      min coverage to include a position [default 3]
-o=out_id       string for output file name [default "out"]
-c=chrm_list    comma-separated list of chrms [default arabidopsis]
-p=num_proc     num processors to use [default 1]
-m=meth_types   comma-separated list of "CG", "CHG", and/or "CHH"
                [default all]

filter_pos_gene_subset.py

  • Required imports: sys, os, bisect
  • Input should be the output of find_mpos_diff_pe.py to filter positions
Usage:  python filter_pos_gene_subset.py [-h] [-q] [-cds] [-v] <pos_file>
        <subset_file> <gff_file>

Required:
pos_file     1-based position file, tab-delimited columns: chrm, start, end
subset_file  file with list of genes to subset, one gene per line
             use "none" or "na" to use all genes
gff_file     GFF formatted file with genes

Optional:
-h           print help and exit
-q           quiet; do not print progress
-cds         use CDS annotation not gene
-v           include coordinates opposite of what is specified

weighted_meth_by_pos_pe.py

  • Required imports: sys, multiprocessing, subprocess, os, bisect, gzip
  • Accepts gzipped allC files
Usage:  python find_mpos_diff_pe.py [-h] [-q] [-v=min_cov] [-c=chrm_list]
        [-o=out_id] [-p=num_proc] [-m=meth_types] <allc_path> <sample1_name>
        <sample2_name>

Required:
allc_path       path to allc files
sample_name     names of samples to compare

Optional:
-h              print this help screen and exit
-q              quiet; does not print progress
-v=min_cov      min coverage to include a position [default 3]
-o=out_id       string for output file name [default "out"]
-c=chrm_list	comma-separated list of chrms [default arabidopsis]
-p=num_proc     num processors to use [default {:d}] 1
-m=meth_types   comma-separated list of "CG", "CHG", and/or "CHH"
                [default all]

epigenotyping_pe_v1.7.3.py

  • Required python imports: sys, math, multiprocessing, subprocess, os, numpy, pandas, sklearn, functools
  • Other required files: bth_util, decodingpath, transitions
    • decodingpath.py includes code for the smoothing algorithms (forward-backward and Viterbi)
    • transitions includes code to compute the transition matrix needed by decoding algorithms
  • This script is run per chromosome.
Usage:  python epigenotyping_pe_v1.7.3.py [-q] [-n-mpv] [-t-out] [-g=generation]
        [-c=bin_thresh] [-d=decoding_type] [-p=num_proc] [-o=out_id] [-m=mother_
        samples][-f=father_samples] [-b=bin_size] [-t=centromere] <input_file>

Requried:
input_file          tab-delimited file of of weighted methylation by position for samples

Optional:
-q                  quiet; do not print progress
-h                  print help and exit
-n-mpv              do not check for systematic mid-parent bias
-t-out              write transition matrix to file
-g=generation       generation of self-crossing; used to determine classification
                    probabilities; use 0 for uniform weight [default 2]
-d=decode_type      decoding type to use (capitlization ignored) [default A]
                    Viterbi="v" or "viterbi"
                    Forward-Backward="forwardbackward", "f" or "fb"
                    Both="all" or "a"
                    Off="false", "none", or "n"
-o=out_id           identifier for output file [default "out" or variation of
                    input file name]
-p=num_proc         number of processors [default 1
-c=bin_thresh       minimum number of features per bin to be classified
                    groups bins to reach this number [default 3
-m=mother_samples   comma-separated sample name(s) of mother
                    [default mother]
-f=father_samples   comma-separated sample name(s) of father
                    [default father]
-b=bin_size         size of bins in bp [default 100kbp]
-t=centromere       centromere coordinates as "start,end"; can include multipe
                    centromeres as "start1,end1,start2,end2..." [default None]

find_crossovers.py

  • Required imports: sys, os, pandas
  • This script is run per chromosome.
Usage:  python find_crossovers.py [-c=predict_column] [-o=out_id] <input_file>

Required:
input_file           tab-delimited file with samples epigenotype per bin

Optional:
-h                   print this help menu and exit
-q                   quiet;do not print progress
-o=out_id            output identifier [default variation of
                     input file name
-c=predict_column    label of column to use as final epigenotype
                     [default "vit.prediction"]

decode_pileup_pe.py

  • Required imports: sys, multiprocessing, subprocess, os
  • Assumes all pileup files have information for the same positions.
Usage:  python decode_pileup_pe.py [-o=out_id] [-p=num_proc] <pileup_file> [pileup_file]*

Required:
pileup_file    pileup file for a sample; output from samtools pileup

Optional:
-q             quiet; do not print progress
-h             print this help menu and exit
-o=out_id      identifier for output file [default "out"]
-p=num_proc    number of processors [default 1]

pileup_genotype_pe.py

  • Required imports: sys, multiprocessing, subprocess, os
  • Uses the output of decode_pileup_pe.py
Usage:  python pileup_genotype_pe.py [-h] [-q] [-o=out_id] [-p=num_proc]
        [-m=mother_label] [-f=father_label] [-v-min_cov] <decoded_pileup_file> 

Required:
decode_pileup_file  tab-delimited input file; output of decode_pileup_pe.py

Optional:
-h                  print help and exit
-q                  quiet; do not print progress
-o=out_id           identifier for output file [default variation of input 
                    file name]
-v=min_cov          min number of reads needed to support genotype [default 1]
-p=num_proc         number of processors [default 1]
-m=mother_label     sample name of mother [default mother]
-f=father_label     sample name of father [default father]