Benchmarking `hybracter` on Dorado v0.5.0 Data

Background and Method

In December 2023, Ryan Wick updated his ongoing ONT-only accuracy analysis on his blog (you should read his blog if you are reading this and haven't yet!)

With the latest updates to Oxford Nanopore's basecaller Dorado, he found that the reads were were consistently Q20+ (median Q=20.5) and that when using Trycycler, there were significant improvements over earlier versions of Dorado thanks to new basecaller models - specifically 'dna_r10.4.1_e8.2_400bps_sup@v4.3.0'.

He found that — at least with Trycycler assemblies — we're pretty close to being able to generate perfect ONT-only assemblies - he found a total of 37 differences between the 'ground truth' and Trycycler in 9 isolates tested. (As an aside here, the closer we get to perfection, the more the ground-truths themselves come into question in my opinion - unless you use exhaustively generate the same assemblies using multiple sequencing modalities like this paper ).

I asked Ryan to run hybracter on these 9 isolates with --no_medaka (as he had previously found that Medaka polishing makes the assemblies worse with the newest ONT reads here).

As the legend he is, he agreed kindly.

He first tried to run hybracter on the entire raw read sets, but he found they failed the Flye step of Hybracter as there were too many long reads (~500x depth).

Accordingly he decided to run hybracter twice, with two different subsampling methods (both to 100x of the estimated genome size):

Subsampling with Filtlong using '--target_bases. According to how Filtlong works, this will prefer both longer and higher-quality reads.
Subsampling randomly using Rasusa.

He then compared the results to the reference using the same methodology as he did for the Trycycler assemblies outlined in his blog.

The full hybracter config parameters were:

  contaminants: none
  databases: null
  dnaapler_custom_db: none
  flyeModel: --nano-hq
  input: hybracter.csv
  log: hybracter/hybracter.log
  logic: best
  medakaModel: r1041_e82_400bps_sup_v4.2.0
  min_length: 1000
  min_quality: 9
  no_medaka: true
  output: hybracter
  single: false
  skip_qc: false
qc:

Results

I wanted to look at the results from 2 angles:

The number of errors as calculated by Ryan.
The number of extra contigs - from some benchmarking of Plassember, these usually indicate some impurities in the read sets of hybrid sequencing data. For long only, the same logic should apply - these would be areas of the chromosome that have low quality or other non-reference reads in the assemblies.

Errors

The results of the errors are as follows:

Genome	Trycycler errors	Hybracter (Filtlong) errors	Hybracter (Rasusa) errors
Campylobacter jejuni	5	15	16
Campylobacter lari	18	27	40
Escherichia coli	1	2	1328 (see note below)
Listeria ivanovii	5	7	475
Listeria monocytogenes	0	2	3
Listeria welshimeri	1	2	1
Salmonella enterica	3	15	18
Vibrio cholerae	2	31535 (see note below)	28205 (see note below)
Vibrio parahaemolyticus	2	12	7

For some context of the V. cholerae and E. coli results in Ryan's words

Importantly, the larger chromosome in V. cholerae and the larger plasmid in E. coli had some structural heterogeneity, so Hybracter's high error counts for those genomes aren't really a problem. For the Trycycler assemblies, I tried to use the majority variant, and Hybracter clearly settled on the other variant. I don't have an explanation for the glitch in the L. ivanovii assembly (with Rasusa) - I think that may just be a Flye mistake.

Overall, excluding Vibrio cholerae, the 8 Trycyler assemblies had total errors of 35, while Hybracter with Filtlong had 82.

Extra Contigs

Note: Ryan had pre-filtered the read sets to throw away all reads <10kp. Therefore, unsurprisingly, the 3 plasmids under 10kbp in E E. coli and 1 in S. enterica were not assembled so they are ignored.

Sample	Subsampling Method	Plasmids (Over 10kbp) in Reference	Number of Contigs	Plasmid Contigs	Extra Non-Plasmid Contigs
Campylobacter jejuni	Filtlong	0	1	0	0
Campylobacter jejuni	Rasusa	0	3	0	2
Campylobacter lari	Filtlong	0	1	0	0
Campylobacter lari	Rasusa	0	1	0	0
Escherichia coli	Filtlong	2	3	2	0
Escherichia coli	Rasusa	2	3	2	0
Listeria ivanovii	Filtlong	0	1	0	0
Listeria ivanovii	Rasusa	0	6	0	5
Listeria monocytogenes	Filtlong	0	1	0	0
Listeria monocytogenes	Rasusa	0	4	0	3
Listeria welshimeri	Filtlong	0	1	0	0
Listeria welshimeri	Rasusa	0	5	0	4
Salmonella enterica	Filtlong	1	2	1	0
Salmonella enterica	Rasusa	1	2	1	0
Vibrio cholerae	Filtlong	0	2	0	0
Vibrio cholerae	Rasusa	0	13	0	11
Vibrio parahaemolyticus	Filtlong	0	16	0	14
Vibrio parahaemolyticus	Rasusa	0	2	0	0

Overall, 5 Rasusa assemblies had extra non-plasmid contigs, while only 1 Filtlong one did (Vibrio parahaemolyticus).

Conclusions

Bacterial genome assembly is hard! While I think (along with the other benchmarking) that Hybracter is the best way of currently doing automated long- and hybrid-read assemblies, it still doesn't always get it right. Structural variation and low-quality reads that sneak into to the final read sets can cause some grief.
Nonetheless, most of the time, Hybracter is pretty good even on long-only read sets! 82 errors in the 8 genomes without structural variation is getting pretty close to perfect even with long reads only.
Subsampling deeply sequenced (500x) read sets with Filtlong outperforms Rasusa - both in terms of errors and also it was far less likely to assemble extra non-plasmid contigs.

Based on these results (and after getting requests for subsampling functionality), the --subsample_depth parameter has been added to Hybracter in v0.5.0. It defaults to 100. That value multiplied by the estimated chromosome size (-c) will be passed as --target_bases to Filtlong during Hybracter's QC phase.

As a final word of warning: if you are really interested in small plasmid recovery and/or small plasmid copy number accuracy, you may want to disable subsampling, because --target_bases is not random - it prioritises the longest & highest quality reads (which may result in your small plasmid reads being dropped).

You can turn off subsampling by specifying a very large number for --subsample_depth e.g. --subsample_depth 100000.