Pathogen Identification from Clinical Metagenomics Data
Author
Devon Allies
Abstract
This analysis focuses on identifying a potential pathogen from clinical metagenomics data. The workflow includes quality control of sequencing reads, host depletion, de novo assembly, and classification using BLAST. The results highlight the presence of clinically relevant pathogens in the sample.
Aim:
The objective of this project was to develop and execute a bioinformatics workflow to identify potential pathogens from clinical metagenomics data. The workflow encompassed several key steps, including quality control of raw sequencing reads, host depletion to remove human DNA, de novo assembly of unmapped reads, and classification using BLAST. The ultimate goal was to determine the presence of clinically relevant pathogens in the sample. By performing host depletation and classification, I was able to demonstrate the ability to isolate the microbial signal from a Bronchoalveolar lavage (BAL) sample.
Data Acquisition:
The raw sequencing data for this project was accessed from the NCBI SRA (BioProject: PRJNA1327646), originating from a published study on COPD microbiomes (Chen et al., 2025). The authors of the study specified that the data deposited in the SRA had already undergone in silico human host DNA removal prior to submission.
To ensure data integrity and quantify any residual human reads, a Quality Control (QC) step was performed. Raw reads were aligned against the human reference genome (hg38) using BWA (Li and Durbin, 2009).
Quality Control of Samples:
Fastq Statistics:
The seqkit stats(Shen, Sipos and Zhao, 2024) output describes the raw sequencing data before alignment. The command is used to generate a tabular statistical summary of FASTA/Q files. By default, it automatically detects the sequence type of the files.
------------------------------
Processing sample: SRR35359600
-------------------------------
2627660 + 0 in total (QC-passed reads + QC-failed reads)
2493294 + 0 primary
0 + 0 secondary
134366 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1637803 + 0 mapped (62.33% : N/A)
1503437 + 0 primary mapped (60.30% : N/A)
2493294 + 0 paired in sequencing
1246647 + 0 read1
1246647 + 0 read2
1169744 + 0 properly paired (46.92% : N/A)
1496222 + 0 with itself and mate mapped
7215 + 0 singletons (0.29% : N/A)
241104 + 0 with mate mapped to a different chr
113350 + 0 with mate mapped to a different chr (mapQ>=5)
total: The total number of reads processed, separated into those passing Quality Control (QC) and those failing.
mapped: The percentage of reads that successfully aligned to the reference genome. This is critical for assessing the efficiency of the alignment process and the relevance of the reference genome.
paired in sequencing: Confirms all reads were generated from paired-end sequencing, which is important for downstream analyses that rely on paired-end data.
properly paired: Reads that aligned to the reference genome in the expected orientation and distance. A high percentage indicates good library preparation and alignment quality.
singletons: Reads that did not have a properly paired mate.
Host Depletion:
Host depletion is a critical step in viral metagenomics to enrich for viral sequences by removing host-derived reads. This process can be achieved through various methods, including computational approaches that filter out host sequences based on alignment to a reference genome or using specialized tools designed for host depletion.
Show code
# Set the paths for files and environmentenvDir="/home/devonallies/micromamba/envs/bioinfo/bin"dataDir="../../../../data"sample=$(basename"${dataDir}/sort/SRR35359600.sorted.bam" .sorted.bam)echo"------------------------------"echo"Processing sample: $sample"echo"-------------------------------"${envDir}/samtools collate -u-O${dataDir}/sort/SRR35359600.sorted.bam |\${envDir}/samtools fastq -f4-1"${dataDir}/unmapped/${sample}_R1.fq"-2"${dataDir}/unmapped/${sample}_R2.fq"-s"${dataDir}/unmapped/${sample}_singleton.fq"
Host depletion is a critical step in clinical metagenomics because clinical samples (like blood, tissue, or respirator fluid) are typically dominated by human DNA, which can constitute 90% - 99% of the sequencing data. By removing the host background, sensitivity is increased, it enables accurate De Novo Assembly, and improves downstream analysis. The singletons are placed in a singleton.fq file.
While a notable percentage indicated in Table 2, 62.33%, of the reads still aligned to the human genome, the remaining non-host reads are sufficient for downstream classification.
De Nova Assembly:
Megahit (Li et al., 2016) is a fast and memory-efficient de novo assembler for large and complex metagenomics data sets. It uses succinct de Bruijn graphs to represent the input data, which significantly reduces memory usage compared to traditional de Bruijn graph representations. Megahit is designed to handle large-scale metagenomic datasets, making it suitable for assembling genomes from environmental samples with high diversity and complexity.
Show code
# Set the paths for files and environmentenvDir="/home/devonallies/micromamba/envs/bioinfo/bin"dataDir="../../../../data"sample=$(basename"${dataDir}/unmapped/SRR35359600" _R1.fq)r1=${dataDir}/unmapped/${sample}_R1.fqr2=${dataDir}/unmapped/${sample}_R2.fqsingleton=${dataDir}/unmapped/${sample}_singleton.fqecho"------------------------------"echo"Processing sample: $sample"echo"-------------------------------"${envDir}/megahit-1$r1-2$r2-r$singleton-o"${dataDir}/megahit/${sample}"--out-prefix$sample-f
Table 4: De Nova Assembly uisng Megahits
------------------------------
Processing sample: SRR35359600
-------------------------------
2026-02-05 14:24:59 - MEGAHIT v1.2.9
2026-02-05 14:24:59 - Using megahit_core with POPCNT and BMI2 support
2026-02-05 14:24:59 - Convert reads to binary library
2026-02-05 14:25:00 - b'INFO sequence/io/sequence_lib.cpp : 75 - Lib 0 (/mnt/069ABB979ABB822B/learning/data/unmapped/SRR35359600_R1.fq,/mnt/069ABB979ABB822B/learning/data/unmapped/SRR35359600_R2.fq): pe, 982642 reads, 150 max length'
2026-02-05 14:25:00 - b'INFO sequence/io/sequence_lib.cpp : 75 - Lib 1 (/mnt/069ABB979ABB822B/learning/data/unmapped/SRR35359600_singleton.fq): se, 7215 reads, 150 max length'
2026-02-05 14:25:00 - b'INFO utils/utils.h : 152 - Real: 1.0382\tuser: 0.6381\tsys: 0.1387\tmaxrss: 85204'
2026-02-05 14:25:00 - k-max reset to: 141
2026-02-05 14:25:00 - Start assembly. Number of CPU threads 8
2026-02-05 14:25:00 - k list: 21,29,39,59,79,99,119,141
2026-02-05 14:25:00 - Memory used: 45253907251
2026-02-05 14:25:00 - Extract solid (k+1)-mers for k = 21
2026-02-05 14:25:09 - Build graph for k = 21
2026-02-05 14:25:13 - Assemble contigs from SdBG for k = 21
2026-02-05 14:25:22 - Local assembly for k = 21
2026-02-05 14:25:28 - Extract iterative edges from k = 21 to 29
2026-02-05 14:25:29 - Build graph for k = 29
2026-02-05 14:25:31 - Assemble contigs from SdBG for k = 29
2026-02-05 14:25:40 - Local assembly for k = 29
2026-02-05 14:25:46 - Extract iterative edges from k = 29 to 39
2026-02-05 14:25:47 - Build graph for k = 39
2026-02-05 14:25:49 - Assemble contigs from SdBG for k = 39
2026-02-05 14:25:59 - Local assembly for k = 39
2026-02-05 14:26:08 - Extract iterative edges from k = 39 to 59
2026-02-05 14:26:08 - Build graph for k = 59
2026-02-05 14:26:10 - Assemble contigs from SdBG for k = 59
2026-02-05 14:26:17 - Local assembly for k = 59
2026-02-05 14:26:24 - Extract iterative edges from k = 59 to 79
2026-02-05 14:26:25 - Build graph for k = 79
2026-02-05 14:26:26 - Assemble contigs from SdBG for k = 79
2026-02-05 14:26:30 - Local assembly for k = 79
2026-02-05 14:26:38 - Extract iterative edges from k = 79 to 99
2026-02-05 14:26:39 - Build graph for k = 99
2026-02-05 14:26:40 - Assemble contigs from SdBG for k = 99
2026-02-05 14:26:44 - Local assembly for k = 99
2026-02-05 14:26:51 - Extract iterative edges from k = 99 to 119
2026-02-05 14:26:52 - Build graph for k = 119
2026-02-05 14:26:53 - Assemble contigs from SdBG for k = 119
2026-02-05 14:26:56 - Local assembly for k = 119
2026-02-05 14:27:04 - Extract iterative edges from k = 119 to 141
2026-02-05 14:27:05 - Build graph for k = 141
2026-02-05 14:27:05 - Assemble contigs from SdBG for k = 141
2026-02-05 14:27:08 - Merging to output final contigs
2026-02-05 14:27:09 - 9116 contigs, total 7205375 bp, min 210 bp, max 16159 bp, avg 790 bp, N50 908 bp
2026-02-05 14:27:09 - ALL DONE. Time elapsed: 129.526783 seconds
Show code
# Set the paths for files and environmentenvDir="/home/devonallies/micromamba/envs/bioinfo/bin"dataDir="../../../../data"sample=$(basename"${dataDir}/unmapped/SRR35359600" _R1.fq)${envDir}/seqkit stats ${dataDir}/megahit/${sample}/${sample}.contigs.fa
Table 5: Statistics of Megahit Assembly
file format type num_seqs sum_len min_len avg_len max_len
../../../../data/megahit/SRR35359600/SRR35359600.contigs.fa FASTA DNA 9,116 7,205,375 210 790.4 16,159
Basic Local Alignment Search Tool:
Blast is the gold standard for comparing a query sequence (DNA, RNA, or Protein) against a massive database of known sequences to find regions of similarity
Show code
#|label: tbl-blast#|tbl-cap: "Output of blast result"envDir="/home/devonallies/micromamba/envs/bioinfo/bin"dataDir="../../../../data"sample=$(basename"${dataDir}/unmapped/SRR35359600" _R1.fq)${envDir}/makeblastdb-out${dataDir}/megahit/${sample}/${sample}-in${dataDir}/megahit/${sample}/${sample}.contigs.fa -dbtype nucl -parse_seqids${envDir}/blastdbcmd-db${dataDir}/megahit/${sample}/${sample}-entry all -outfmt"%l %a">"${dataDir}/megahit/${sample}/${sample}_blast.txt"cat${dataDir}/megahit/${sample}/${sample}_blast.txt |sort-rn|head
Building a new DB, current time: 02/08/2026 16:42:57
New DB name: /mnt/069ABB979ABB822B/learning/data/megahit/SRR35359600/SRR35359600
New DB title: ../../../../data/megahit/SRR35359600/SRR35359600.contigs.fa
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /mnt/069ABB979ABB822B/learning/data/megahit/SRR35359600/SRR35359600
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 9116 sequences in 0.215929 seconds.
16159 k141_9106
15399 k141_6088
15193 k141_7641
9747 k141_173
8971 k141_6816
8401 k141_4799
8326 k141_91
8324 k141_2364
7369 k141_8249
6914 k141_3132
The BLAST results indicate a statistically definitive match between the query sequence and the subject database. The alignment parameters - specifically the 99.409% percent identity, the 5757 basepair alignmnent length, and the E-value of 0.0 - provides evidence of high-confidence sequence homology.
Percent Identity: The high degree of nucleotide conservation suggests that the query and subject sequences are nearly identical. This level of similarity is characteristic of sequences derived from the same species or highly conserved orthologous rehions within closely related strains.
Alignment Length: The substantial length of the aligned region bolsters the validity of the match.
Expect Value: An E-value of 0.0 represents a probability that is indistinguishable from zero. It signifies that the likelyhood of observing such a high-scoring alignment by random chance within the search space is non-existent.
The verified non-host reads were assembled into contigs using megahit(Li et al., 2016) and subsequently classified via BLAST(Camacho et al., 2009). The analysis yielded a high-confidence identification based on the alignment of contig k141_9107
Running: /home/devonallies/micromamba/envs/bioinfo/bin/amrfinder -n ../../../../data/megahit/SRR35359600/SRR35359600.contigs.fa --database ../../../../data/databases/armfinder/amr_db/2026-01-21.1/ --plus
Software directory: /home/devonallies/micromamba/envs/bioinfo/bin/
Software version: 4.2.5
Database directory: /mnt/069ABB979ABB822B/learning/data/databases/armfinder/amr_db/2026-01-21.1
Database version: 2026-01-21.1
AMRFinder translated nucleotide search
- include -O ORGANISM, --organism ORGANISM option to add mutation searches and suppress common proteins
Running blastx
Making report
Protein id Contig id Start Stop Strand Element symbol Element name Scope Type Subtype Class Subclass Method Target length Reference sequence length % Coverage of reference % Identity to reference Alignment length Closest reference accession Closest reference name HMM accession HMM description
NA k141_2664 760 1617 + blaTEM-116 broad-spectrum class A beta-lactamase TEM-116 core AMR AMR BETA-LACTAM BETA-LACTAM ALLELEX 286 286 100.00 100.00 286 WP_000027050.1 broad-spectrum class A beta-lactamase TEM-116 NA NA
amrfinder took 47 seconds to complete
AMRFINDER identified the gene blaTEM-116, which is a broad-spectrum class A beta-lactamase. This enzyme breaks down beta-lactam antibiotics.
Conclusion:
The identification of Acinetobacter parvus, is a gram-negative aerobic bacterium notable for forming small colonies in cultures, with the metrics displayed in Table 6, indicates a strong result.
Reference:
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K. and Madden, T.L., 2009. BLAST+: Architecture and applications. BMC Bioinformatics, 10(1), p.421. https://doi.org/10.1186/1471-2105-10-421.
Chen, G., Wiegand, C., Willett, A., Herr, C., Müller, R., Bals, R. and Kalinina, O.V., 2025. Comparative metagenomic analysis on COPD and health control samples reveals taxonomic and functional motifs. Frontiers in Microbiology, 16, p.1636322. https://doi.org/10.3389/fmicb.2025.1636322.
Li, D., Luo, R., Liu, C.-M., Leung, C.-M., Ting, H.-F., Sadakane, K., Yamashita, H. and Lam, T.-W., 2016. MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods, 102, pp.3–11. https://doi.org/10.1016/j.ymeth.2016.02.020.
Li, H. and Durbin, R., 2009. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25(14), pp.1754–1760. https://doi.org/10.1093/bioinformatics/btp324.
Shen, W., Sipos, B. and Zhao, L., 2024. SeqKit2: A swiss army knife for sequence and alignment processing. iMeta, [online] 3(3), p.e191. https://doi.org/10.1002/imt2.191.