1 PacMAN pipeline report

Congratulations! You have finished running the PacMAN pipeline. In this report you will find the information of all the main steps run with your sequences, and the statistics related to how it went.

Specifically, this is the report for the project: PacMAN And the pipeline run: COI.

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on
## '../../data/config_files/manifest.csv'

And the samples that were run in this analysis are:

1.1 Quality control

Quality control of the raw samples was done so that any issues in the final analysis can possibly be traced back to issues in the original data.

Visual representation of the results of the QC on the raw data can be found in an html report in:
results/PacMAN/samples/multiqc_COI.html
Generally, the sequences in the samples had the following composition:

Sample fastqc.percent duplicates fastqc.percent gc fastqc.avg sequence length fastqc.median sequence length fastqc.percent fails fastqc.total sequences
SUVA3_20231128_S3A | qc | fw 74.2 38 300.0115 301 27.27273 0.07187
SUVA3_20231128_S3A | qc | rv 81.9 38 300.2132 301 54.54545 0.07187

1.2 Sequence trimming

As you can most likely see, there are several issues with the raw sequences. Therefore, the sequences were trimmed and cleaned-up prior to the analysis of sequence variants and the taxonomic composition of the samples.

The first step of the process was to trim and quality control the samples using fastp (Shifu Chen 2023).

And the results for each sample in reads, and in percentages of original reads:

Read1BeforeFiltering Read2BeforeFiltering Read1AfterFiltering Read2AfterFiltering
SUVA3_20231128_S3A 71870 71870 70668 70668

1.3 Cutadapt

The next step in the process ran cutadapt on all samples. All analyses were done in single-mode, forward and reverse reads were analysed separately. This is done so that reads that do not have a pair are not discarded at this stage.

The command(s) that was used was:

## [1] "This is cutadapt 3.4 with Python 3.9.20"                                                                                                                                                                                                                                                                 
## [2] "Command line parameters: -g GGWACWGGWTGAACWGTWTAYCCYCC -a TGRTTYTTYGGNCAYCCNGARGTNTA -o results/PacMAN/runs/COI/02-cutadapt/SUVA3_20231128_S3A/SUVA3_20231128_S3A_1P.fastq.gz --minimum-length 1 --rc --discard-untrimmed results/PacMAN/samples/SUVA3_20231128_S3A/fastp/SUVA3_20231128_S3A_1P.fastq.gz"
## [1] "This is cutadapt 3.4 with Python 3.9.20"                                                                                                                                                                                                                                                                 
## [2] "Command line parameters: -g TANACYTCNGGRTGNCCRAARAAYCA -a GGRGGRTAWACWGTTCAWCCWGTWCC -o results/PacMAN/runs/COI/02-cutadapt/SUVA3_20231128_S3A/SUVA3_20231128_S3A_2P.fastq.gz --minimum-length 1 --rc --discard-untrimmed results/PacMAN/samples/SUVA3_20231128_S3A/fastp/SUVA3_20231128_S3A_2P.fastq.gz"

1.3.1 Forward reads

TotalReadsProcessed ReadwithPrimer ReverseComplemented ReadsTooShort ReadsWritten BasePairsProcessed BasePairsWritten
SUVA3_20231128_S3A 1P 70668 70613 36 0 70613 21197268 19346196
ReadwithPrimer ReadsWritten BasePairsWritten
SUVA3_20231128_S3A 1P 99.92 99.92 91.27

1.3.2 Reverse reads

TotalReadsProcessed ReadwithPrimer ReverseComplemented ReadsTooShort ReadsWritten BasePairsProcessed BasePairsWritten
SUVA3_20231128_S3A 2P 70668 70603 2 0 70603 21195650 19341268
ReadwithPrimer ReadsWritten BasePairsWritten
SUVA3_20231128_S3A 2P 99.91 99.91 91.25

1.3.3 Unpaired forward reads

For forward unpaired reads they were:

TotalReadsProcessed ReadwithPrimer ReverseComplemented ReadsTooShort ReadsWritten BasePairsProcessed BasePairsWritten
SUVA3_20231128_S3A 1U 982 980 0 0 980 295125 269068
ReadwithPrimer ReadsWritten BasePairsWritten
SUVA3_20231128_S3A 1U 99.8 99.8 91.17

1.3.4 Unpaired reverse

For reverse unpaired reads they were:

TotalReadsProcessed ReadwithPrimer ReverseComplemented ReadsTooShort ReadsWritten BasePairsProcessed BasePairsWritten
SUVA3_20231128_S3A 2U 151 150 0 0 150 44450 40251
ReadwithPrimer ReadsWritten BasePairsWritten
SUVA3_20231128_S3A 2U 99.34 99.34 90.55

Currently the unpaired reads that are kept at this step are not analysed further. Keep this in mind if there is a large amount of unpaired forward and reverse reads.

1.4 ASV inference with dada2

Next the reads without adapters and primers, were analysed using the dada2 pipeline. This pipeline evaluates the error rates in reads, and based on the results defines amplicon sequence variants (ASVs).

Note! At this moment in this step we are only analysing the reads that are still paired after sequence trimming. We plan to include the analysis of unpaired reads also in the future.

1.4.1 Filter and trim

The first step of the dada2 pipeline is to perform more filtering and trimming based on sequence quality.

The parameters used in this run were:

## filterAndTrim(filesForw, filtFs, filesRev, filtRs,
##   truncLen = c(200, 150),truncQ=2,
##   trimRight = 0, trimLeft = 0,
##   maxLen = Inf, minLen = 20,
##   maxN = 0, minQ = 2, maxEE = 2,
##   rm.phix = TRUE, orient.fwd = ,
##   matchIDs = FALSE, id.sep = \\s, id.field = ,
##   compress = TRUE, multithread = TRUE,
##   n = 100000, OMP = TRUE,
##   verbose = FALSE
## )

The quality profile of all analysed paired reads before trimming (Forward reads in left panel, while reverse reads are in right panel):

captioncaption

caption

And after trimming it was:

captioncaption

caption

In case unpaired reads were kept after trimming and primer removal, quality profiles for unpaired reads can also be found in:

## ../../results/PacMAN/runs/COI/06-report/dada2/aggregate_quality_profiles_filesForw_single.png
## ../../results/PacMAN/runs/COI/06-report/dada2/aggregate_quality_profiles_filesRev_single.png
## ../../results/PacMAN/runs/COI/06-report/dada2/aggregate_quality_profiles_filtered_unpaired_forward.png
## ../../results/PacMAN/runs/COI/06-report/dada2/aggregate_quality_profiles_fitlered_unpaired_reverse.png

1.4.2 Asv inference

The following step of the dada2 pipeline was define the ASVs from these trimmed reads.

First error rates were learned for both forward and reverse paired reads separately as well as any unpaired remaining reads which survived filtering, with the following command:

## learnErrors(filtFs, multithread = TRUE, nbases = 1e8, randomize = FALSE,MAX_CONSIST = 10, OMEGA_C = 0, verbose = FALSE)

The forward and reverse error profiles were the following (in absolute numbers and in percentage of analysed reads):

captioncaption

caption

Next the reads were dereplicated with the following command:

## derepFastq(filtFs, n = 1e6)

The error rates were then used to infer ASVs from the dereplicated sequences using the following command for both forward and reverse reads separately:

## dada(derepFs, err = errF, selfConsist = FALSE, pool = TRUE, priors = ,
##   multithread = TRUE)

In case reads were overlapping (include merge in the config file), sequence pairs were combined using:

## mergePairs(dadaFs, derepFs, dadaRs, derepRs,
##   minOverlap = 12, maxMismatch = 0, returnRejects = TRUE,
##   propagateCol = , justConcatenate = FALSE, trimOverhang = FALSE)

If returnRejects was selected, also the returned unpaired sequences were kept in the final tables.

Otherwise, the dereplicated sequences were kept without merging.

Also any sequences unpaired forward or reverse sequences were added to the sequence tables at this stage.

All single reverse sequences (unpaired either in trimmomatic or here, or if pairs were not merged), were reverse complemented for the bowtie2 algorithm.

And finally, chimeras were removed with:

## removeBimeraDenovo(seqtab, method = consensus, multithread = TRUE)

The amount of reads analysed in each step and remaining at the end of the pipeline were:

reads.in reads.out.paired reads.in.forward.single reads.out.forward.single reads.in.reverse.single reads.out.reverse.single denoisedF denoisedR merged denoisedF_single denoisedR_single nonchim ASVs
SUVA3_20231128_S3A 70512 59686 101 71 91 72 59242 59315 58464 47 49 128207 296

1.5 Assigning taxonomy

1.5.1 RDP

Next, taxonomy was assigned to the inferred reads using rdp. It’s important to note that

The command used for the rdp search was the following:

## rdp_classifier -Xmx16g classify -tdata/reference_databases/COI_terrimporter/rRNAClassifier.properties -o  results/PacMAN/runs/COI/04-taxonomy/rdp/COI_terrimporter_5.1.0_rdp.output -q results/PacMAN/runs/COI/03-dada2/rep-seqs.fna

Additionally a vsearch local alignment was made to a reference database to find matches and their ids.

## vsearch --usearch_global results/PacMAN/runs/COI/03-dada2/rep-seqs.fna  --dbdata/reference_databases/COI_ncbi_1_50000_pcr_pga_taxon_derep_clean_sintax.fasta --id 0.97 --blast6out results/PacMAN/runs/COI/04-taxonomy/vsearch/COI_ncbi_1_5000_vsearch_blast.b6 --threads 4  --output_no_hits --query_cov 0.85 --maxaccepts 100 --maxrejects 100 --maxhits 10  --lcaout results/PacMAN/runs/COI/04-taxonomy/vsearch/COI_ncbi_1_5000_vsearch_lcaout.tab  --lca_cutoff 0.6 --notrunclabels 2> --lcaout results/PacMAN/runs/COI/06-report/vsearch/vsearch_COI_ncbi_1_5000.log

This command indicates that from a maximum of 10 hits that all meet the criteria (identity and query coverage), the final result is a consensus taxonomy that over 60% of these hits agree on. This lca workflow is meant to filter out hits, were there are multiple conflicting high-identity hits.

The vsearch log indicates how many matches were made with this search:

## Matching unique query sequences: 81 of 296 (27.36%) with an identity cutoff of 0.97 a query coverage of 0.85 and an lca cutoff of 0.6

Finally all results are stored in the output files after a filtering step.

## The rdp results were then filtered based off of the user defined probability cutoff: 0.6

The rdp results after filtering are stored as the taxonomic lineage, while the full results with probability, and the vsearch hits are stored in the field “identificationRemarks”, and are included in the final Darwin Core files.

1.5.2 Blast against the full ncbi-nt

If chosen by the user (Blast: include: TRUE in the config file) the unknown sequences that did not have a match (results either ““, or only”Eukaryota”), were further analysed with blastn.

## The unknown sequences defined in this way can be found in: /results/PacMAN/runs/COI/04-taxonomy/unknown_asvs.txt
## In total there were  123  asvs, from a total of  296

The used command was:

## Blast was not included in this analysis run

The results of the blast query were then filtered with a last common ancestor script using basta. BASTA is a command-line tool written in python that was developed to enable a basic taxonomy annotation based on a Last Common Ancestor algorithm similar to the one implemented in MEGAN. A number of parameters can be chosen for the basta-lca analysis. The following command was used.

## Blast was not included in this analysis run

A phyloseq object was built with this information, and can be found here:

./results/PacMAN/runs/COI/05-dwca/phyloseq_object.rds

The total information stored in the phyloseq object is:

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 296 taxa and 1 samples ]
## sample_data() Sample Data:       [ 1 samples by 21 sample variables ]
## tax_table()   Taxonomy Table:    [ 296 taxa by 13 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 296 reference sequences ]

Note also that the sequences of the ASVs are stored in the tax_table of the phyloseq object.

The total number of taxa assigned at each taxonomic level:

##   scientificName scientificNameID           domain     superkingdom 
##              296              264               NA              288 
##           phylum            class            order 
##              173              133              108

In relative abundance (see total number of reads in the dada2 output), the most abundant taxa at phylum level and at species level are shown in the following images.

The relationship of the biodiversity between the samples is shown in the following ordination. The closer together the samples are depicted, the more similar the biodiversity measured in each of the samples was. (Works only if more than 2 samples are analysed in a run.)

1.6 Darwin core archive tables

Finally, the results of this analysis were collected in a Darwin Core Occurrence table and DNA-derived data extension, to facilitate submitting the data to OBIS.

First, lsids for the assigned taxa were collected based on names in the WoRMS database. Occurrences that are not included in the final tables are mostly known taxa that are not marine. These occurrences are listed in the table 05-dwca/Taxa_not_in_worms.tsv. This table also potentially has marine species that are not recognized by WoRMS yet, and therefore it is recommended to manually review this table.

All unknown sequences that did not have a taxonomic assignment are added in the occurrence table as ‘Incertae sedis’ (urn:lsid:marinespecies.org:taxname:12), however in the future more checks will be made to filter out sequences that were not from the target gene region or are likely not marine.

In addition, if there was a control sample which was labelled in the original sample as OccurrenceStatus; “absent”, any occurrences of the taxa found in this sample were removed from all samples.

The final tables can be found at:

  • results/PacMAN/runs/COI/05-dwca/Occurrence_table.tsv
  • results/PacMAN/runs/COI/05-dwca/DNA_extension_table.tsv

From this analysis run there were

The total number of occurrences was 296 .
Of which 288 were known, and therefore 8 or 2.702703 % remained unknown.

The sequences of these occurrences can nevertheless be added to the database, and it is possible that their taxonomy will be defined when reference databases develop.

The number of names not found in the marine records of WoRMS was 0 .