Difference between revisions of "Fastq mcf"

From BioUML platform
Jump to: navigation, search
 
(6 intermediate revisions by one user not shown)
Line 9: Line 9:
 
* Discard sequences that are too short after all of the above
 
* Discard sequences that are too short after all of the above
 
* Keep multiple mate-reads in sync while doing all of the above
 
* Keep multiple mate-reads in sync while doing all of the above
 +
 +
==Usage==
 +
<code>
 +
Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]<br>
 +
</code>
 +
Detects levels of adapter presence, computes likelihoods and
 +
locations (start, end) of the adapters.  Removes the adapter
 +
sequences from the fastq file(s).
 +
 +
Stats go to stderr, unless <code>-o</code> is specified.
 +
 +
Specify <code>-0</code> to turn off all default settings
 +
 +
If you specify multiple 'paired-end' inputs, then a -o option is
 +
required for each.  IE: -o read1.clip.q -o read2.clip.fq
 +
 +
====Options====
 +
    -h      This help
 +
    -o FIL  Output file (stats to stdout)
 +
    -s N.N  Log scale for adapter minimum-length-match (2.2)
 +
    -t N    % occurance threshold before adapter clipping (0.25)
 +
    -m N    Minimum clip length, overrides scaled auto (1)
 +
    -p N    Maximum adapter difference percentage (10)
 +
    -l N    Minimum remaining sequence length (19)
 +
    -L N    Maximum remaining sequence length (none)
 +
    -D N    Remove duplicate reads : Read_1 has an identical N bases (0)
 +
    -k N    sKew percentage-less-than causing cycle removal (2)
 +
    -x N    'N' (Bad read) percentage causing cycle removal (20)
 +
    -q N    quality threshold causing base removal (10)
 +
    -w N    window-size for quality trimming (1)
 +
    -H      remove >95% homopolymer reads (no)
 +
    -0      Set all default parameters to zero/do nothing
 +
    -U|u    Force disable/enable Illumina PF filtering (auto)
 +
    -P N    Phred-scale (auto)
 +
    -R      Dont remove Ns from the fronts/ends of reads
 +
    -n      Dont clip, just output what would be done
 +
    -C N    Number of reads to use for subsampling (300k)
 +
    -S      Save all discarded reads to '.skip' files
 +
    -d      Output lots of random debugging stuff
 +
 +
====Quality adjustment options====
 +
    --cycle-adjust    CYC,AMT    Adjust cycle CYC (negative = offset from end) by amount AMT
 +
    --phred-adjust    SCORE,AMT  Adjust score SCORE by amount AMT
 +
 +
====Filtering options====
 +
    --[mate-]qual-mean  NUM      Minimum mean quality score
 +
    --[mate-]qual-gt    NUM,THR  At least NUM quals > THR
 +
    --[mate-]max-ns    NUM      Maxmium N-calls in a read (can be a %)
 +
    --[mate-]min-len    NUM      Minimum remaining length (same as -l)
 +
    --hompolymer-pct    PCT      Homopolymer filter percent (95)
 +
 +
If mate- prefix is used, then applies to second non-barcode read only
 +
 +
Adapter files are 'fasta' formatted:
 +
 +
Specify n/a to turn off adapter clipping, and just use filters
 +
 +
Increasing the scale makes recognition-lengths longer, a scale
 +
of 100 will force full-length recognition of adapters.
 +
 +
Adapter sequences with _5p in their label will match 'end's,
 +
and sequences with _3p in their label will match 'start's,
 +
otherwise the 'end' is auto-determined.
 +
 +
Skew is when one cycle is poor, 'skewed' toward a particular base.
 +
If any nucleotide is less than the skew percentage, then the
 +
whole cycle is removed.  Disable for methyl-seq, etc.
 +
 +
Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done
 +
for miRNA, amplicon and other low-complexity situations!)
 +
 +
Duplicate read filtering is appropriate for assembly tasks, and
 +
never when read length < expected coverage.  -D 50 will use
 +
4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.
 +
 +
Quality filters are evaluated after clipping/trimming
 +
 +
==Links==
 +
 +
[https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqMcf.md fastq-mcf on GitHub]

Latest revision as of 17:35, 25 March 2019

Contents

[edit] Introduction

fastq-mcf attempts to:

  • Detect & remove sequencing adapters and primers
  • Detect limited skewing at the ends of reads and clip
  • Detect poor quality at the ends of reads and clip
  • Detect Ns, and remove from ends
  • Remove reads with CASAVA 'Y' flag (purity filtering)
  • Discard sequences that are too short after all of the above
  • Keep multiple mate-reads in sync while doing all of the above

[edit] Usage

Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
Detects levels of adapter presence, computes likelihoods and locations (start, end) of the adapters. Removes the adapter sequences from the fastq file(s).

Stats go to stderr, unless -o is specified.

Specify -0 to turn off all default settings

If you specify multiple 'paired-end' inputs, then a -o option is required for each. IE: -o read1.clip.q -o read2.clip.fq

[edit] Options

   -h       This help
   -o FIL   Output file (stats to stdout)
   -s N.N   Log scale for adapter minimum-length-match (2.2)
   -t N     % occurance threshold before adapter clipping (0.25)
   -m N     Minimum clip length, overrides scaled auto (1)
   -p N     Maximum adapter difference percentage (10)
   -l N     Minimum remaining sequence length (19)
   -L N     Maximum remaining sequence length (none)
   -D N     Remove duplicate reads : Read_1 has an identical N bases (0)
   -k N     sKew percentage-less-than causing cycle removal (2)
   -x N     'N' (Bad read) percentage causing cycle removal (20)
   -q N     quality threshold causing base removal (10)
   -w N     window-size for quality trimming (1)
   -H       remove >95% homopolymer reads (no)
   -0       Set all default parameters to zero/do nothing
   -U|u     Force disable/enable Illumina PF filtering (auto)
   -P N     Phred-scale (auto)
   -R       Dont remove Ns from the fronts/ends of reads
   -n       Dont clip, just output what would be done
   -C N     Number of reads to use for subsampling (300k)
   -S       Save all discarded reads to '.skip' files
   -d       Output lots of random debugging stuff

[edit] Quality adjustment options

   --cycle-adjust    CYC,AMT     Adjust cycle CYC (negative = offset from end) by amount AMT
   --phred-adjust    SCORE,AMT   Adjust score SCORE by amount AMT

[edit] Filtering options

   --[mate-]qual-mean  NUM       Minimum mean quality score
   --[mate-]qual-gt    NUM,THR   At least NUM quals > THR
   --[mate-]max-ns     NUM       Maxmium N-calls in a read (can be a %)
   --[mate-]min-len    NUM       Minimum remaining length (same as -l)
   --hompolymer-pct    PCT       Homopolymer filter percent (95)

If mate- prefix is used, then applies to second non-barcode read only

Adapter files are 'fasta' formatted:

Specify n/a to turn off adapter clipping, and just use filters

Increasing the scale makes recognition-lengths longer, a scale of 100 will force full-length recognition of adapters.

Adapter sequences with _5p in their label will match 'end's, and sequences with _3p in their label will match 'start's, otherwise the 'end' is auto-determined.

Skew is when one cycle is poor, 'skewed' toward a particular base. If any nucleotide is less than the skew percentage, then the whole cycle is removed. Disable for methyl-seq, etc.

Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done for miRNA, amplicon and other low-complexity situations!)

Duplicate read filtering is appropriate for assembly tasks, and never when read length < expected coverage. -D 50 will use 4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.

Quality filters are evaluated after clipping/trimming

[edit] Links

fastq-mcf on GitHub

Personal tools
Namespaces

Variants
Actions
BioUML platform
Community
Modelling
Analysis & Workflows
Collaborative research
Development
Virtual biology
Wiki
Toolbox