Extraction of 16S-ITS-23S sequences: All GenBank and RefSeq genomes of Archaea and Bacteria from GTDB v214 [1] were downloaded using genome_updater v0.6.2. A total of 402,695 assemblies were downloaded, excluding 4 assemblies with missing fna files and 77,032 assemblies with missing annotation files. These incomplete genomes were ignored, leaving 402,691 assemblies with sequences. The annotation file (GFF) was parsed to identify the 16S and 23S genes. Identification was performed by applying regex patterns to the product or gene attributes of the rRNA annotations. Pairs of 16S and 23S genes were generated under the following conditions: (i) Both 16S and 23S genes must be on the same strand, (ii) the extracted portion must be between 3000 and 7000 nucleotides in length and (iii) the region must begin with 16S and end with 23S. Genomic regions meeting these criteria have been extracted. A total of 358,166 16S-ITS-23S regions were found in 142,377 out of the 402,691 assemblies. Preprocessing of the 16S-ITS-23S sequences Since 358,166 16S-ITS-23S regions, for removing redundant information, identical sequences with the exact same taxonomy were dereplicated, resulting in a total of 199,690 unique sequences. To identify and remove potential eukaryotic contamination, the sequences were blasted against a S. cerevisae 35S sequence. Sequences with a query identity at 70% minimum and a coverage greater than 40% were removed e.g. 283 sequences. Suspicious sequences identification: Suspicious sequences are sequences that show notable dissimilarities to other sequences from their own species, while displaying significant similarity to sequences from taxonomically distant species (the threshold used is the family rank) (see Additional file 3: Figure S2). To identify suspicious sequences, an intra-species clustering and a global clustering were performed: (i) for every species with at least 4 sequences of 16S-ITS-23S, sequences were clustered at 95% identity and singleton sequences that did not cluster with other sequences from their respective species were identified; (ii) all sequences were clustered at 99% identity and taxonomically heterogeneous clusters, with representation from at least two different families, were identified. We used VSEARCH v2.22.1 [2] with --iddef 3 that treats gap events as a single mismatch, regardless of the length of the gap. This approach is necessary as mismatches between the 16S and 23S rRNA genes of the same species genomes are expected and should not be penalized. Thus, sequences are identified as suspicious if they are singleton within their species and they are found in a heterogeneous cluster, meaning they are far from their own species while being close to another species. A total of 258 sequences have been identified as suspicious. To distinguish these sequences, the term "suspicious" was added to their taxonomy, enabling easy filtering and identification in subsequent analyses. 1. Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil PA, Hugenholtz P. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 2022;50(D1):D785-D94. 2. Rognes T, Flouri T, Nichols B, Quince C, Mahe F. VSEARCH: a versatile open source tool for metagenomics. PeerJ. 2016;4:e2584.