Benchmarking hybracter
on Dorado v0.5.0 Data
Background and Method
In December 2023, Ryan Wick updated his ongoing ONT-only accuracy analysis on his blog (you should read his blog if you are reading this and haven't yet!)
With the latest updates to Oxford Nanopore's basecaller Dorado, he found that the reads were were consistently Q20+ (median Q=20.5) and that when using Trycycler, there were significant improvements over earlier versions of Dorado thanks to new basecaller models - specifically 'dna_r10.4.1_e8.2_400bps_sup@v4.3.0'.
He found that — at least with Trycycler assemblies — we're pretty close to being able to generate perfect ONT-only assemblies - he found a total of 37 differences between the 'ground truth' and Trycycler in 9 isolates tested. (As an aside here, the closer we get to perfection, the more the ground-truths themselves come into question in my opinion - unless you use exhaustively generate the same assemblies using multiple sequencing modalities like this paper ).
I asked Ryan to run hybracter
on these 9 isolates with --no_medaka
(as he had previously found that Medaka polishing makes the assemblies worse with the newest ONT reads here).
As the legend he is, he agreed kindly.
He first tried to run hybracter
on the entire raw read sets, but he found they failed the Flye step of Hybracter as there were too many long reads (~500x depth).
Accordingly he decided to run hybracter
twice, with two different subsampling methods (both to 100x of the estimated genome size):
- Subsampling with Filtlong using '
--target_bases
. According to how Filtlong works, this will prefer both longer and higher-quality reads. - Subsampling randomly using Rasusa.
He then compared the results to the reference using the same methodology as he did for the Trycycler assemblies outlined in his blog.
The full hybracter
config parameters were:
contaminants: none
databases: null
dnaapler_custom_db: none
flyeModel: --nano-hq
input: hybracter.csv
log: hybracter/hybracter.log
logic: best
medakaModel: r1041_e82_400bps_sup_v4.2.0
min_length: 1000
min_quality: 9
no_medaka: true
output: hybracter
single: false
skip_qc: false
qc:
Results
I wanted to look at the results from 2 angles:
- The number of errors as calculated by Ryan.
- The number of extra contigs - from some benchmarking of Plassember, these usually indicate some impurities in the read sets of hybrid sequencing data. For long only, the same logic should apply - these would be areas of the chromosome that have low quality or other non-reference reads in the assemblies.
Errors
The results of the errors are as follows:
Genome | Trycycler errors | Hybracter (Filtlong) errors | Hybracter (Rasusa) errors |
---|---|---|---|
Campylobacter jejuni | 5 | 15 | 16 |
Campylobacter lari | 18 | 27 | 40 |
Escherichia coli | 1 | 2 | 1328 (see note below) |
Listeria ivanovii | 5 | 7 | 475 |
Listeria monocytogenes | 0 | 2 | 3 |
Listeria welshimeri | 1 | 2 | 1 |
Salmonella enterica | 3 | 15 | 18 |
Vibrio cholerae | 2 | 31535 (see note below) | 28205 (see note below) |
Vibrio parahaemolyticus | 2 | 12 | 7 |
For some context of the V. cholerae and E. coli results in Ryan's words
Importantly, the larger chromosome in V. cholerae and the larger plasmid in E. coli had some structural heterogeneity, so Hybracter's high error counts for those genomes aren't really a problem. For the Trycycler assemblies, I tried to use the majority variant, and Hybracter clearly settled on the other variant. I don't have an explanation for the glitch in the L. ivanovii assembly (with Rasusa) - I think that may just be a Flye mistake.
Overall, excluding Vibrio cholerae, the 8 Trycyler assemblies had total errors of 35, while Hybracter with Filtlong had 82.
Extra Contigs
Note: Ryan had pre-filtered the read sets to throw away all reads <10kp. Therefore, unsurprisingly, the 3 plasmids under 10kbp in E E. coli and 1 in S. enterica were not assembled so they are ignored.
Sample | Subsampling Method | Plasmids (Over 10kbp) in Reference | Number of Contigs | Plasmid Contigs | Extra Non-Plasmid Contigs |
---|---|---|---|---|---|
Campylobacter jejuni | Filtlong | 0 | 1 | 0 | 0 |
Campylobacter jejuni | Rasusa | 0 | 3 | 0 | 2 |
Campylobacter lari | Filtlong | 0 | 1 | 0 | 0 |
Campylobacter lari | Rasusa | 0 | 1 | 0 | 0 |
Escherichia coli | Filtlong | 2 | 3 | 2 | 0 |
Escherichia coli | Rasusa | 2 | 3 | 2 | 0 |
Listeria ivanovii | Filtlong | 0 | 1 | 0 | 0 |
Listeria ivanovii | Rasusa | 0 | 6 | 0 | 5 |
Listeria monocytogenes | Filtlong | 0 | 1 | 0 | 0 |
Listeria monocytogenes | Rasusa | 0 | 4 | 0 | 3 |
Listeria welshimeri | Filtlong | 0 | 1 | 0 | 0 |
Listeria welshimeri | Rasusa | 0 | 5 | 0 | 4 |
Salmonella enterica | Filtlong | 1 | 2 | 1 | 0 |
Salmonella enterica | Rasusa | 1 | 2 | 1 | 0 |
Vibrio cholerae | Filtlong | 0 | 2 | 0 | 0 |
Vibrio cholerae | Rasusa | 0 | 13 | 0 | 11 |
Vibrio parahaemolyticus | Filtlong | 0 | 16 | 0 | 14 |
Vibrio parahaemolyticus | Rasusa | 0 | 2 | 0 | 0 |
Overall, 5 Rasusa assemblies had extra non-plasmid contigs, while only 1 Filtlong one did (Vibrio parahaemolyticus).
Conclusions
- Bacterial genome assembly is hard! While I think (along with the other benchmarking) that Hybracter is the best way of currently doing automated long- and hybrid-read assemblies, it still doesn't always get it right. Structural variation and low-quality reads that sneak into to the final read sets can cause some grief.
- Nonetheless, most of the time, Hybracter is pretty good even on long-only read sets! 82 errors in the 8 genomes without structural variation is getting pretty close to perfect even with long reads only.
- Subsampling deeply sequenced (500x) read sets with Filtlong outperforms Rasusa - both in terms of errors and also it was far less likely to assemble extra non-plasmid contigs.
Based on these results (and after getting requests for subsampling functionality), the --subsample_depth
parameter has been added to Hybracter in v0.5.0. It defaults to 100. That value multiplied by the estimated chromosome size (-c
) will be passed as --target_bases
to Filtlong during Hybracter's QC phase.
As a final word of warning: if you are really interested in small plasmid recovery and/or small plasmid copy number accuracy, you may want to disable subsampling, because --target_bases
is not random - it prioritises the longest & highest quality reads (which may result in your small plasmid reads being dropped).
You can turn off subsampling by specifying a very large number for --subsample_depth
e.g. --subsample_depth 100000
.