Snakemake Overview
Main workflow
The scripts below are presented in the order that they are executed by the Tigerfish workflow via Snakemake. Here, all scripts and their function are documented to better understand the workflow, which files are generated at each snakemake step, and where config parameters are called.
generate_jf_count
Purpose: Generates a Jellyfish index file that is used for counting k-mers downstream.
Input: Genome reference FASTA file
Output: A genome query hash file containing counts of all k-mers in the form of a .jf file.
jellyfish count -s 3300M -m {params.mer_val} -o {output} {input.fasta_file}
config.yml parameters
mer_val
fasta_file
Snakemake parameters
{output}
generate_bt2_indices
Purpose: Generates genome Bowtie2 indices which is used for aligning probes to the entire genome of interest.
Input: Genome reference FASTA file
Output: Collection of Bowtie2 indices placed in a Bowtie2 directory of your choosing.
bowtie2-build --threads 4 {input} {BOWTIE2_DIR}/{ASSEMBLY}
config.yml parameters
{input}
Snakemake parameters
{BOWTIE2_DIR}/{ASSEMBLY}
generate_jf_idx
Purpose: To generate k-mer count index files using the derived Jellyfish hash table from the generate_jf_count step. Generates independent k-mer count index files for each scaffold.
Inputs: Genome reference FASTA file (FASTA_FILE). Output Jellyfish hash table generated from the generate_jf_count step (JF_INDEXFILE)
Outputs: An output Jellyfish k-mer count file containing all k-mers within a selected scaffold and it’s corresponding counts (JF_OUT). A file that is used to reference the base position of where the start of each k-mer in the output count file occurs (J_INDEX_OUT). A seperated FASTA file of each selected scaffold (SCAFFOLD_FA_OUT).
usage: generate_jf_idx.py [-h] -f FASTA_FILE -j JF_INDEXFILE -c CHR_NAME -f_o
SCAFFOLD_FA_OUT -j_o JF_OUT -i J_INDEX_OUT -m
MER_VAL
config.yml parameters
fasta_file
sample (CHR_NAME)
mer_val (MER_VAL)
Snakemake parameters
JF_INDEXFILE (generate_jf_count output)
SCAFFOLD_FA_OUT
JF_OUT
J_INDEX_OUT
split_bed
Purpose: Reads a BED file provided by the user containing coordinates of regions for probe design. If regions on different chromosomes exist, this script will generate independent files for different regions based on chromosome.
Inputs: A BED file provided in the config.yml if defined_coords: “TRUE”.
Outputs: A BED file split by chromosomes if different repeat regions are provided in the input file.
usage: split_bed.py [-h] -b BED_FILE -c CHROM_NAME -o BED_OUT
config.yml parameters
sample (CHROM_NAME)
bed_file (BED_FILE)
Snakemake parameters
BED_OUT
repeat_ID
Purpose: Reads a Jellyfish count file of a given scaffold, a chrom index file to account for base location, as well as the path to the chromosome FASTA to generate BED files of genomic regions that have been flagged as having elevated k-mer counts based on user parameters.
Input: Jellyfish count and index files derived from generate_jf_idx output.
Output: BED File of repeat region coordinates.
usage: repeat_ID.py [-h] -j JF_COUNT -i INDEX_FILE -chr CHR_NAME -st START
[-w WINDOW_LENGTH] [-t THRESHOLD] [-c COMPOSITION_SCORE]
-o_b BED_FILE -m MER_LENGTH
config.yml parameters
sample (CHR_NAME)
file_start (START)
window (WINDOW_LENGTH)
threshold (THRESHOLD)
composition (COMPOSITION_SCORE)
mer_val (MER_LENGTH)
Snakmake parameters
JF_COUNT (JF_OUT)
INDEX_FILE (JF_INDEXFILE)
BED_FILE
design_probes
Purpose: Designs oligo probes against identified repeat regions if repeat_ID: “TRUE”. If repeat coordinates provided, probes will be designed here against those regions.
Input: Provided bed_file or output from repeat_ID step.
Output: File containing probe scaffold, start, stop, melting temperature, probe sequence in a tab seperated file.
usage: design_probes.py [-h] -b BED_NAME -r_o REGION_OUT -p_o PROBES_OUT -g
GENOME_FASTA -c CHROM_NAME -l MIN_LEN -L MAX_LEN -t
MIN_TEMP -T MAX_TEMP
config.yml parameters
fasta_file (GENOME_FASTA)
sample (CHROM_NAME)
min_len (MIN_LEN)
max_len (MAX_LEN)
min_temp (MIN_TEMP)
max_temp (MAX_TEMP)
Snakemake parameters
BED_NAME (BED_FILE)
REGION_OUT
PROBES_OUT
kmer_filter
Purpose: Takes a probe file generated from design_probes and computes each probe’s aggregate on-target region k-mer count and k-mer counts that occur in the whole genome. Rank orders probes based on this on target binding proportion and aggregate on-target region k-mer count.
Input: Generated probe file, Jellyfish k-mer count file, and the FASTA file provided for all repeat regions.
Output: A probe file with oligos provided in ranked order based on user parameter preferences.
usage: kmer_filter.py [-h] -p PROBE_FILE -j JF_FILE -f FASTA [-m MERLENGTH] -o
OUT_PATH -c1 C1_VALUE -c2 C2_VALUE
config.yml parameters
c1_val (C1_value)
c2_val (C2_value)
mer_val (MERLENGTH)
Snakemake parameters
PROBE_FILE (PROBES_OUT)
JF_FILE (JF_COUNT)
OUT_PATH
probe_mer_filter
Purpose: Takes a probe file that undergoes rank sorting in kmer_filter to cull probes based on user parameters.
Input: Output probe file from kmer_filter step
Output: Provides truncated probe list that will undergo genome wide alignment to identify best candidate probes for each repeat region.
usage: probe_mer_filter.py [-h] -f FILE_PATH -o OUT_PATH -e ENRICH_SCORE -cn
COPY_NUM -m MER_CUTOFF -k MERLENGTH
config.yml parameters
enrich_score (ENRICH_SCORE)
copy_num (COPY_NUM)
mer_cutoff (MER_CUTOFF)
mer_val (MERLENGTH)
Snakemake parameters
FILE_PATH (PROBES_OUT)
OUT_PATH
generate_genome_bins
Purpose: Takes reference genome file and makes it into bins of a specified size using BEDtools.
Input: Genome chrom.sizes file provided as chrom_sizes_file.
Output: Two files. One file contains the chromosome and bin position in a tab seperated file for alignment which was made using the genome_windows parameter. The second file creates a threshold window to compute the size of the imaging repeat.
bedtools makewindows -g {input.sizes} -w {params.genome_windows} > {output} |
bedtools makewindows -g {input.sizes} -w {params.thresh_window} > {output} |
config.yml parameters
genome_windows {params.genome_windows}
thresh_window {params.thresh_window}
chrom_sizes_file {input.sizes}
Snakemake parameters
{output}
make_chrom_dir (checkpoint)
Purpose: Before alignment, to parallelize multiple repeat regions found within each scaffold, all repeats are split into independent files for parallel computing.
Input: Output filtered probes from probes_mer_filter step.
Output: A series of probe files split by each repeat region and grouped within a scaffold name’s directory.
usage: split_filter.py [-h] -f FILE_PATH -o OUT_PATH
config.yml parameters
None.
Snakemake parameters
PROBES_OUT (FILE_PATH)
Specified directory in Snakemake file (OUT_PATH)
alignment_filter
Purpose: Takes probes filtered from probe_mer_split after undergoing repeat region split in gather_repeat_regions. Aligns candidate probes to entire reference genome and takes pairwise derived sequences to compute predicted thermodynamic duplexing probability. This means Tigerfish uses this probabilities to aggregate which alignments match to the target repeat region vs elsewhere in the target genome. This is just to ensure that final candidate probes are able to bind to targets of interest.
Input: Filtered and rank sorted probe file.
Output: Select repeat specific probes based on user specified filtering parameters.
usage: alignment_filter.py [-h] -f PROBE_FILE -o OUT_FILE
[-r REGION_THRESHOLD] -b BOWTIE_INDEX -k
BT2_MAX_ALIGN -l SEED_LENGTH -t MODEL_TEMP -pb
MAX_PDUPS_BINDING -moT MIN_ON_TARGET -Mr
MAX_PROBE_RETURN -gb GENOMIC_BIN -th THRESH -rf REF_FLAG
config.yml parameters
target_sum (REGION_THRESHOLD)
bt2_alignments (BT2_MAX_ALIGN)
seed_length (SEED_LENGTH)
model_temp (MODEL_TEMP)
max_pdups_binding (MAX_PDUPS_BINDING)
min_on_target (MIN_ON_TARGET)
max_probe_return (MAX_PROBE_RETURN)
off_bin_thresh (THRESH)
ref_flag (REF_FLAG)
Snakemake parameters
PROBES_OUT (PROBE_FILE)
(OUT_FILE)
(BOWTIE_INDEX)
genome_windows (GENOMIC_BIN)
merge_alignment_filter
Purpose: Following alignment of all regions, all seperate repeat files are merged into an aggregate probe file.
Input: Aggregated output from alignment_filter step.
Output: A summary file of total candidates found within each repeat region.
usage: cat {input} > {output}
config.yml parameters
None
Snakemake parameters
None
split_rm_alignments
Purpose: Splits all repeat regions into remaining probe files to undergo downstream alignment.
Input: Merged files from merge_alignment_filter step.
Output: A directory for each scaffold containing independent repeat region files for all remaining candidate probes.
usage: split_rm_alignments.py -f {input} -o {output}
config.yml parameters
None
Snakemake parameters
None
align_probes
Purpose: Takes probes from split_rm_alignments and aligns them to generated genome-wide Bowtie2 indices created during previous run for probe generation in the main workflow. Note: it is important that the whole genome FASTA is provided as the fasta_file to ensure that a correct genome wide Bowtie2 index is made.
Input: Split probes from make_chrom_dir step and Bowtie2 index derived from main pipeline workflow.
Output: An alignment file containing the derived mapped alignments for each probe sequence corresponding to a target repeat region.
usage: generate_alignments.py [-h] -f FILE_PATH -o OUT_PATH -b BOWTIE_INDEX -k
BT2_MAX_ALIGN -l SEED_LENGTH -t MODEL_TEMP
config.yml parameters
bt2_alignments (BT2_MAX_ALIGN)
seed_length (SEED_LENGTH)
model_temp (MODEL_TEMP)
bowtie_index (BOWTIE_INDEX)
Snakemake parameters
FILE_PATH
OUT_PATH
derived_beds
Purpose: Takes the output of the alignment file to generate a BED file of all derived alignment locations.
Input: The output alignment file from align_probes.
Output: A BED file containing the coords of all mapped genome wide alignments.
usage: make_derived_beds.py [-h] -f FILE_PATH -o OUT_PATH
config.yml parameters
None
Snakemake parameters
FILE_PATH
OUT_PATH
get_region_bed
Purpose: Takes the subset probe file and generates a BED file from the repeat target coordinates.
Input: The split probe file generated from alignment_filter containing candidate probes.
Output: A BED file containing the repeat region coordinates.
usage: get_region_bed.py [-h] -i IN_FILE -o OUT_FILE
config.yml parameter
None
Snakemake parameters
IN_FILE
OUT_FILE
bedtools_intersect
Purpose: Performs a BEDtools intersect using the generated genomic bins file on the BED coordinates from derived alignments and the target repeat region. This is to be able to map alignments to target and non-target bins.
Input: The generated genome bin file, BED file from derived alignments, and BED file from target repeat region.
Output: An intersected BEDtools file containing the coordinates of each mapped BED coordinate to the corresponding genome bin window it falls in.
config.yml parameters
None
Snakemake parameters
input.derived_bed
input.genome_bin
input.repeat_bed
output.alignments_out
output.repeat_out
get_alignments
Purpose: For all alignments, predicted duplexing (pDups) values are computed to assess how likely a probe is to bind at a mapped genomic region. This is then used to compute aggregate on-target vs off-target based on the genomic windows computed.
Input: The genome bin file derived from the tresh_window parameter, derived alignment and repeat region mapped genomic overlaps from BEDtools.
Output: An annotated probe file summarizing all true on and off target alignments in the entire genome for all probe candidates that mapped to a particular repeat region, A populated file summarizing which bins are mapping to the repeat target bins vs other bins in the genome if above provided threshold value, and plotted maps based on where pileup binding can be found.
usage: get_alignments.py [-h] -c_t CHROM_TRACK -c_o CHROM_OVERLAPS -r_o
REPEAT_OVERLAP -p PAIRWISE_PDUPS -pl OUT_PLOT -t
THRESH -t_s THRESH_SUMM -c_s CHROM_SUMM
config.yml parameters
align_thresh (THRESH)
Snakemake parameters
input.genome_bin (CHROM_TRACK)
output.repeat_out (CHROM_OVERLAPS)
output.alignments_out (REPEAT_OVERLAP)
(PAIRWISE_PDUPS)
(OUT_PLOT)
(THRESH_SUMM)
(CHROM_SUMM)
generate_chromomap
Note: This step is only implemented if probe_cand_binding mode is activated.
Purpose: Implements an R library, chromoMap, to plot where target probes are anticipated to make FISH signal. These are especially helpful to validate binding sites based on morphology if validating probes via metaphase FISH assay.
Input: Generated repeat region probe BED coordinates.
Output: An image of a chromosome with an annotated color highlighting the repeat region location.
usage: Rscript --vanilla make_chromomap.R -c {input.chrom_sizes} -r {input.probe_bed} -o {output}
map_region_coords
Purpose: Computes thresholded window where imaging signal is predicted within each repeat region site.
Input: alignment_filter output, get_alignments output, align_probes output
Output: A final candidate probes file containing the imaging target region and a summary of probe binding information for each repeat region.
usage: collapse_repeat.py [-h] -f get_alignments.output.thresh_binding -a align_probes.output
-p alignment_filter.output -po {output.probe_file} -ro {output.repeat_summary}
config.yml parameters
None
Snakemake parameters
get_alignments.output.thresh_binding
align_probes.output
alignment_filter.output
output.probe_file
output.repeat_summary
merge_mapping
Purpose: Takes all repeat regions that have final candidate probes and merges them into scaffold files.
Input: Output of map_region_coords.
Output: A directory for each scaffold containing finalized candidate probes.
usage: cat {input} > {output}
config.yml parameters
None
Snakemake parameters
None
summary
Purpose: Following alignment of all regions, all seperate repeat files are merged into an aggregate probe file. From this probe file statistics are computed that summarizes the total probes per repreat region and their aggregate on and off-target binding.
Input: Aggregated output from merge_mapping step.
Output: A summary file of total candidates found within each repeat region.
usage: finish_summary.py [-h] -f PROBE_FILE -o OUT_FILE
config.yml parameters
None
Snakemake parameters
PROBES_OUT (PROBE_FILE)
OUT_FILE
Below is an example image of a DAG that is produced by Tigerfish following Probe Design Mode on the DXZ4 repeat:
config.yml parameters
If you have more questions about any scripts in particular from the main workflow or post process workflow, be sure to check out our GitHub page. Also check out our Tigerfish tutorial to see how these scripts come together to generate example data.