Assembly Quality & Improvement

Assembly Quality & Improvement
QA Assessment, Scaffolding & Contamination Detection
Learn systematic approaches to evaluate, enhance, and validate your genome assemblies.
Why Assembly Quality Assessment?
Catching Problems Before They Compound
Assembly artifacts create serious downstream consequences. Incorrect joins between sequences lead to false biological conclusions, wasted experimental validation efforts, and potentially retracted publications. Quality assessment acts as your safety net, catching these issues early.
Critical issues QA detects:
Misassemblies: Regions incorrectly joined from distant genomic locations
Gaps: Missing sequences due to insufficient coverage or complexity
Errors: Incorrect base calls, small insertions, or deletions
Contamination: Foreign DNA sequences from other organisms
QUAST: Comprehensive Assembly Statistics
Your Assembly Report Card
QUAST (Quality Assessment Tool for Genome Assemblies) calculates over 30 assembly quality metrics, providing comprehensive evaluation with or without a reference genome. When a reference is available, QUAST performs sophisticated alignment analysis to detect structural problems.
The tool generates interactive HTML reports with visualizations, making it easy to identify issues and compare multiple assemblies. QUAST has become the standard tool for assembly quality reporting in genomics publications.
Basic Statistics
Contig count, N50, total length, GC content
Misassembly Detection
Identifies relocations, inversions, translocations
Gene Analysis
Validates gene structure consistency
Coverage Patterns
Analyzes uniformity and identifies gaps
Understanding QUAST Metrics
Contig Count
What it means: Total number of assembled sequences
Interpretation: Fewer contigs indicate more contiguous assembly
N50
What it means: Median contig length (50th percentile)
Interpretation: Higher values indicate longer, more complete contigs
Total Length
What it means: Sum of all contig lengths
Interpretation: Should match expected genome size
GC Content
What it means: Percentage of G+C bases
Interpretation: Should match known organism profile
Genome Fraction
What it means: Percent of reference covered (with reference)
Interpretation: >95% indicates excellent coverage
Misassemblies
What it means: Detected structural errors (with reference)
Interpretation: Relocations, inversions requiring investigation
Coverage Analysis: Detecting Problems
Reading the Coverage Landscape
Coverage depth measures how many reads align to each genomic position. This metric reveals assembly quality and potential problems that other statistics miss.
Coverage patterns reveal:
Normal (20-50x): Confident, well-supported bases
Low (<5x): Suspect regions requiring verification
Zero coverage: Major red flag indicating gaps or misassemblies
High (>2x average): Possible collapsed repeats or duplications
Sudden coverage drops often indicate misassemblies where sequences were incorrectly joined.
Read Mapping: Validation & Analysis
Mapping your original reads back to the assembly validates that the assembler faithfully reconstructed your data. This critical QC step calculates essential metrics that reveal assembly quality.
90%
Reads Mapped
Expected for good assemblies
85%
Properly Paired
Minimum acceptable threshold
0
Coverage Gaps
Ideal for complete assemblies
Poor mapping rates indicate the assembly doesn't accurately represent your sequencing data, suggesting problems with assembly parameters or data quality.
RagTag: Reference-Guided Scaffolding
Leveraging Reference Genomes
RagTag improves draft assemblies by using a reference genome as a guide. It aligns your contigs to the reference, then orders and orients them to match the reference structure. This dramatically improves contiguity while correcting assembly errors.
RagTag's workflow:
Align contigs to reference genome
Determine correct contig order
Orient contigs (forward or reverse)
Join nearby contigs with estimated gap sizes
Optionally correct misassemblies
Contigs vs. Scaffolds
Contigs
Definition: Continuous sequences without any gaps, assembled purely from overlapping reads
Source: Direct output from assembly graph traversal
Characteristics: High confidence but often shorter due to fragmentation at repeat boundaries
Limitation: Broken wherever assembler cannot confidently resolve structure
Scaffolds
Definition: Connected contigs with estimated gaps (represented as Ns) between them
Source: Built using paired-end read information or reference genome guidance
Characteristics: More contiguous representation, longer sequences
Advantage: Maintains relative order and orientation while acknowledging uncertainty
Think of contigs as the solid foundation blocks and scaffolds as the complete structure with some estimated connecting pieces.
RagTag Outputs
ragtag.scaffolds.fasta
The main output file containing your improved assembly. Contigs are now ordered, oriented, and connected based on the reference genome structure. This is the file you'll use for downstream annotation and analysis.
ragtag.scaffold.agp
Assembly Golden Path (AGP) format file specifying the exact structure of each scaffold. Documents which contigs were used, their orientations, and gap locations. Essential for tracking assembly provenance.
ragtag.scaffold.stats
Summary statistics comparing input contigs to output scaffolds. Shows number of contigs successfully placed vs. unplaced, plus N50 improvements achieved through scaffolding.
Expected improvements: Dramatically fewer scaffolds compared to input contigs, with substantially longer N50 values indicating increased contiguity.
BUSCO: Completeness Check
Benchmarking Universal Single-Copy Orthologs
BUSCO assesses assembly completeness by searching for genes that should be present in single copies across all organisms in a taxonomic group. These evolutionarily conserved genes serve as molecular markers of assembly quality.
The tool compares your assembly against databases of orthologous genes specific to your organism's lineage (e.g., bacteria, fungi, vertebrates). Missing genes suggest incomplete assembly, while duplicated genes hint at collapsed repeats or contamination.
Score interpretation:
Complete (C): >90% = excellent assembly
Fragmented (F): <5% = acceptable
Missing (M): <5% = good coverage
Assembly Quality Interpretation
Excellent Assembly
N50 > 50kb
BUSCO > 90%
Ready for publication
Moderate Assembly
N50 10-50kb
BUSCO 80-90%
Usable but improvable
Poor Assembly
N50 < 10kb
BUSCO < 80%
Requires improvement
If your assembly falls into the "poor" category, consider these improvement strategies: increase sequencing coverage, use longer read technologies, try different assembly parameters, or employ hybrid assembly approaches combining multiple data types.
For moderate assemblies, reference-guided scaffolding with RagTag often provides substantial improvements without additional sequencing.
BUSCO Score Breakdown
Understanding BUSCO Categories
Complete (C): Gene found in full length
Single-copy (S): One copy as expected — this is normal
Duplicated (D): Multiple copies detected — investigate this carefully
Fragmented (F): Partial gene sequence found, suggesting the gene was split across a gap or contig boundary
Missing (M): Expected gene completely absent from assembly
High duplication counts (D>5%) warrant investigation for contamination or assembly artifacts.
Assembly Quality Improvement Cycle
Assembly improvement is an iterative process. Each tool provides different insights that inform your next steps. QUAST reveals structural issues and contiguity problems. RagTag addresses fragmentation by reference-guided scaffolding. BUSCO validates biological completeness and detects contamination.
Cycle through these tools, applying fixes and re-evaluating until your assembly meets publication standards or you've exhausted improvement options. Document each iteration's parameters and results for reproducibility.
Typical iteration count: Most assemblies require 2-3 improvement cycles to reach optimal quality.
Assembly Quality Verdict
Ready for Publication
N50 > 50kb
>90% reads mapped
BUSCO > 90%
Acceptable Quality
N50 20-50kb
80-90% reads mapped
BUSCO 80-90%
Needs Improvement
N50 < 20kb
<80% reads mapped
BUSCO < 80%
Use these thresholds as guidelines, not absolute rules. Context matters — some highly repetitive genomes may never achieve "excellent" metrics but remain scientifically valuable. Document limitations clearly in your methods.
Congratulations!
You Now Have Essential NGS Analysis Skills
01
Quality Control Mastery
You can rigorously assess sequencing data quality and make informed trimming decisions
02
Assembly Expertise
You understand de novo assembly algorithms and can effectively use SPAdes
03
Quality Assessment
You can evaluate assembly quality using QUAST, coverage analysis, and BUSCO
04
Assembly Improvement
You know how to enhance assemblies through scaffolding and error detection
05
Validation Skills
You can validate assemblies and detect contamination systematically
Next steps: Apply these skills to your own research projects, continue practicing with diverse datasets, and stay current with evolving bioinformatics tools and best practices.