Congratulations! You have finished running the PacMAN pipeline. In this report you will find the information of all the main steps run with your sequences, and the statistics related to how it went.
Specifically, this is the report for the project: PacMAN
And the pipeline run: COI.
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on
## '../../data/config_files/manifest.csv'
And the samples that were run in this analysis are:
Quality control of the raw samples was done so that any issues in the final analysis can possibly be traced back to issues in the original data.
Visual representation of the results of the QC on the raw data can be
found in an html report in:
results/PacMAN/samples/multiqc_COI.html
Generally, the sequences in the samples had the following
composition:
| Sample | fastqc.percent duplicates | fastqc.percent gc | fastqc.avg sequence length | fastqc.median sequence length | fastqc.percent fails | fastqc.total sequences |
|---|---|---|---|---|---|---|
| SUVA3_20231128_S3A | qc | fw | 74.2 | 38 | 300.0115 | 301 | 27.27273 | 0.07187 |
| SUVA3_20231128_S3A | qc | rv | 81.9 | 38 | 300.2132 | 301 | 54.54545 | 0.07187 |
As you can most likely see, there are several issues with the raw sequences. Therefore, the sequences were trimmed and cleaned-up prior to the analysis of sequence variants and the taxonomic composition of the samples.
The first step of the process was to trim and quality control the samples using fastp (Shifu Chen 2023).
And the results for each sample in reads, and in percentages of original reads:
| Read1BeforeFiltering | Read2BeforeFiltering | Read1AfterFiltering | Read2AfterFiltering | |
|---|---|---|---|---|
| SUVA3_20231128_S3A | 71870 | 71870 | 70668 | 70668 |
The next step in the process ran cutadapt on all samples. All analyses were done in single-mode, forward and reverse reads were analysed separately. This is done so that reads that do not have a pair are not discarded at this stage.
The command(s) that was used was:
## [1] "This is cutadapt 3.4 with Python 3.9.20"
## [2] "Command line parameters: -g GGWACWGGWTGAACWGTWTAYCCYCC -a TGRTTYTTYGGNCAYCCNGARGTNTA -o results/PacMAN/runs/COI/02-cutadapt/SUVA3_20231128_S3A/SUVA3_20231128_S3A_1P.fastq.gz --minimum-length 1 --rc --discard-untrimmed results/PacMAN/samples/SUVA3_20231128_S3A/fastp/SUVA3_20231128_S3A_1P.fastq.gz"
## [1] "This is cutadapt 3.4 with Python 3.9.20"
## [2] "Command line parameters: -g TANACYTCNGGRTGNCCRAARAAYCA -a GGRGGRTAWACWGTTCAWCCWGTWCC -o results/PacMAN/runs/COI/02-cutadapt/SUVA3_20231128_S3A/SUVA3_20231128_S3A_2P.fastq.gz --minimum-length 1 --rc --discard-untrimmed results/PacMAN/samples/SUVA3_20231128_S3A/fastp/SUVA3_20231128_S3A_2P.fastq.gz"
| TotalReadsProcessed | ReadwithPrimer | ReverseComplemented | ReadsTooShort | ReadsWritten | BasePairsProcessed | BasePairsWritten | |
|---|---|---|---|---|---|---|---|
| SUVA3_20231128_S3A 1P | 70668 | 70613 | 36 | 0 | 70613 | 21197268 | 19346196 |
| ReadwithPrimer | ReadsWritten | BasePairsWritten | |
|---|---|---|---|
| SUVA3_20231128_S3A 1P | 99.92 | 99.92 | 91.27 |
| TotalReadsProcessed | ReadwithPrimer | ReverseComplemented | ReadsTooShort | ReadsWritten | BasePairsProcessed | BasePairsWritten | |
|---|---|---|---|---|---|---|---|
| SUVA3_20231128_S3A 2P | 70668 | 70603 | 2 | 0 | 70603 | 21195650 | 19341268 |
| ReadwithPrimer | ReadsWritten | BasePairsWritten | |
|---|---|---|---|
| SUVA3_20231128_S3A 2P | 99.91 | 99.91 | 91.25 |
For forward unpaired reads they were:
| TotalReadsProcessed | ReadwithPrimer | ReverseComplemented | ReadsTooShort | ReadsWritten | BasePairsProcessed | BasePairsWritten | |
|---|---|---|---|---|---|---|---|
| SUVA3_20231128_S3A 1U | 982 | 980 | 0 | 0 | 980 | 295125 | 269068 |
| ReadwithPrimer | ReadsWritten | BasePairsWritten | |
|---|---|---|---|
| SUVA3_20231128_S3A 1U | 99.8 | 99.8 | 91.17 |
For reverse unpaired reads they were:
| TotalReadsProcessed | ReadwithPrimer | ReverseComplemented | ReadsTooShort | ReadsWritten | BasePairsProcessed | BasePairsWritten | |
|---|---|---|---|---|---|---|---|
| SUVA3_20231128_S3A 2U | 151 | 150 | 0 | 0 | 150 | 44450 | 40251 |
| ReadwithPrimer | ReadsWritten | BasePairsWritten | |
|---|---|---|---|
| SUVA3_20231128_S3A 2U | 99.34 | 99.34 | 90.55 |
Currently the unpaired reads that are kept at this step are not analysed further. Keep this in mind if there is a large amount of unpaired forward and reverse reads.
Next the reads without adapters and primers, were analysed using the dada2 pipeline. This pipeline evaluates the error rates in reads, and based on the results defines amplicon sequence variants (ASVs).
Note! At this moment in this step we are only analysing the reads that are still paired after sequence trimming. We plan to include the analysis of unpaired reads also in the future.
The first step of the dada2 pipeline is to perform more filtering and trimming based on sequence quality.
The parameters used in this run were:
## filterAndTrim(filesForw, filtFs, filesRev, filtRs,
## truncLen = c(200, 150),truncQ=2,
## trimRight = 0, trimLeft = 0,
## maxLen = Inf, minLen = 20,
## maxN = 0, minQ = 2, maxEE = 2,
## rm.phix = TRUE, orient.fwd = ,
## matchIDs = FALSE, id.sep = \\s, id.field = ,
## compress = TRUE, multithread = TRUE,
## n = 100000, OMP = TRUE,
## verbose = FALSE
## )
The quality profile of all analysed paired reads before trimming (Forward reads in left panel, while reverse reads are in right panel):
caption
And after trimming it was:
caption
In case unpaired reads were kept after trimming and primer removal, quality profiles for unpaired reads can also be found in:
## ../../results/PacMAN/runs/COI/06-report/dada2/aggregate_quality_profiles_filesForw_single.png
## ../../results/PacMAN/runs/COI/06-report/dada2/aggregate_quality_profiles_filesRev_single.png
## ../../results/PacMAN/runs/COI/06-report/dada2/aggregate_quality_profiles_filtered_unpaired_forward.png
## ../../results/PacMAN/runs/COI/06-report/dada2/aggregate_quality_profiles_fitlered_unpaired_reverse.png
The following step of the dada2 pipeline was define the ASVs from these trimmed reads.
First error rates were learned for both forward and reverse paired reads separately as well as any unpaired remaining reads which survived filtering, with the following command:
## learnErrors(filtFs, multithread = TRUE, nbases = 1e8, randomize = FALSE,MAX_CONSIST = 10, OMEGA_C = 0, verbose = FALSE)
The forward and reverse error profiles were the following (in absolute numbers and in percentage of analysed reads):
caption
Next the reads were dereplicated with the following command:
## derepFastq(filtFs, n = 1e6)
The error rates were then used to infer ASVs from the dereplicated sequences using the following command for both forward and reverse reads separately:
## dada(derepFs, err = errF, selfConsist = FALSE, pool = TRUE, priors = ,
## multithread = TRUE)
In case reads were overlapping (include merge in the config file), sequence pairs were combined using:
## mergePairs(dadaFs, derepFs, dadaRs, derepRs,
## minOverlap = 12, maxMismatch = 0, returnRejects = TRUE,
## propagateCol = , justConcatenate = FALSE, trimOverhang = FALSE)
If returnRejects was selected, also the returned unpaired sequences were kept in the final tables.
Otherwise, the dereplicated sequences were kept without merging.
Also any sequences unpaired forward or reverse sequences were added to the sequence tables at this stage.
All single reverse sequences (unpaired either in trimmomatic or here, or if pairs were not merged), were reverse complemented for the bowtie2 algorithm.
And finally, chimeras were removed with:
## removeBimeraDenovo(seqtab, method = consensus, multithread = TRUE)
The amount of reads analysed in each step and remaining at the end of the pipeline were:
| reads.in | reads.out.paired | reads.in.forward.single | reads.out.forward.single | reads.in.reverse.single | reads.out.reverse.single | denoisedF | denoisedR | merged | denoisedF_single | denoisedR_single | nonchim | ASVs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SUVA3_20231128_S3A | 70512 | 59686 | 101 | 71 | 91 | 72 | 59242 | 59315 | 58464 | 47 | 49 | 128207 | 296 |
Next, taxonomy was assigned to the inferred reads using rdp. It’s important to note that
The command used for the rdp search was the following:
## rdp_classifier -Xmx16g classify -tdata/reference_databases/COI_terrimporter/rRNAClassifier.properties -o results/PacMAN/runs/COI/04-taxonomy/rdp/COI_terrimporter_5.1.0_rdp.output -q results/PacMAN/runs/COI/03-dada2/rep-seqs.fna
Additionally a vsearch local alignment was made to a reference database to find matches and their ids.
## vsearch --usearch_global results/PacMAN/runs/COI/03-dada2/rep-seqs.fna --dbdata/reference_databases/COI_ncbi_1_50000_pcr_pga_taxon_derep_clean_sintax.fasta --id 0.97 --blast6out results/PacMAN/runs/COI/04-taxonomy/vsearch/COI_ncbi_1_5000_vsearch_blast.b6 --threads 4 --output_no_hits --query_cov 0.85 --maxaccepts 100 --maxrejects 100 --maxhits 10 --lcaout results/PacMAN/runs/COI/04-taxonomy/vsearch/COI_ncbi_1_5000_vsearch_lcaout.tab --lca_cutoff 0.6 --notrunclabels 2> --lcaout results/PacMAN/runs/COI/06-report/vsearch/vsearch_COI_ncbi_1_5000.log
This command indicates that from a maximum of 10 hits that all meet the criteria (identity and query coverage), the final result is a consensus taxonomy that over 60% of these hits agree on. This lca workflow is meant to filter out hits, were there are multiple conflicting high-identity hits.
The vsearch log indicates how many matches were made with this search:
## Matching unique query sequences: 81 of 296 (27.36%) with an identity cutoff of 0.97 a query coverage of 0.85 and an lca cutoff of 0.6
Finally all results are stored in the output files after a filtering step.
## The rdp results were then filtered based off of the user defined probability cutoff: 0.6
The rdp results after filtering are stored as the taxonomic lineage, while the full results with probability, and the vsearch hits are stored in the field “identificationRemarks”, and are included in the final Darwin Core files.
If chosen by the user (Blast: include: TRUE in the config file) the unknown sequences that did not have a match (results either ““, or only”Eukaryota”), were further analysed with blastn.
## The unknown sequences defined in this way can be found in: /results/PacMAN/runs/COI/04-taxonomy/unknown_asvs.txt
## In total there were 123 asvs, from a total of 296
The used command was:
## Blast was not included in this analysis run
The results of the blast query were then filtered with a last common ancestor script using basta. BASTA is a command-line tool written in python that was developed to enable a basic taxonomy annotation based on a Last Common Ancestor algorithm similar to the one implemented in MEGAN. A number of parameters can be chosen for the basta-lca analysis. The following command was used.
## Blast was not included in this analysis run
A phyloseq object was built with this information, and can be found here:
./results/PacMAN/runs/COI/05-dwca/phyloseq_object.rds
The total information stored in the phyloseq object is:
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 296 taxa and 1 samples ]
## sample_data() Sample Data: [ 1 samples by 21 sample variables ]
## tax_table() Taxonomy Table: [ 296 taxa by 13 taxonomic ranks ]
## refseq() DNAStringSet: [ 296 reference sequences ]
Note also that the sequences of the ASVs are stored in the tax_table of the phyloseq object.
The total number of taxa assigned at each taxonomic level:
## scientificName scientificNameID domain superkingdom
## 296 264 NA 288
## phylum class order
## 173 133 108
In relative abundance (see total number of reads in the dada2 output), the most abundant taxa at phylum level and at species level are shown in the following images.
The relationship of the biodiversity between the samples is shown in the following ordination. The closer together the samples are depicted, the more similar the biodiversity measured in each of the samples was. (Works only if more than 2 samples are analysed in a run.)
Finally, the results of this analysis were collected in a Darwin Core Occurrence table and DNA-derived data extension, to facilitate submitting the data to OBIS.
First, lsids for the assigned taxa were collected based on names in the WoRMS database. Occurrences that are not included in the final tables are mostly known taxa that are not marine. These occurrences are listed in the table 05-dwca/Taxa_not_in_worms.tsv. This table also potentially has marine species that are not recognized by WoRMS yet, and therefore it is recommended to manually review this table.
All unknown sequences that did not have a taxonomic assignment are added in the occurrence table as ‘Incertae sedis’ (urn:lsid:marinespecies.org:taxname:12), however in the future more checks will be made to filter out sequences that were not from the target gene region or are likely not marine.
In addition, if there was a control sample which was labelled in the original sample as OccurrenceStatus; “absent”, any occurrences of the taxa found in this sample were removed from all samples.
The final tables can be found at:
From this analysis run there were
The total number of occurrences was 296 .
Of which 288 were known, and therefore 8 or 2.702703 % remained
unknown.
The sequences of these occurrences can nevertheless be added to the database, and it is possible that their taxonomy will be defined when reference databases develop.
The number of names not found in the marine records of WoRMS was 0 .