Pipeline1 – Insect Phylogeny

# PHYLOGENETIC PLACEMENT
# video tutorial here

# Dependencies:
# Linux
# NCBI E-Utilities (Sayers, 2009, https://www.ncbi.nlm.nih.gov/books/NBK25497/)
# Blastclust (https://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/)
# remove_fasta_entries.pl (https://github.com/dchesters/insectphylo_tools)
# Clustal Omega (Sievers and Higgins, 2014, Curr Protoc Bioinformatics, 48)
# Raxml 8 (Stamatakis, 2014, Bioinformatics, 30)

# First we need 'queries', these are, study subjects from some community samples which we have barcoded
# and would like to extract information from.
# We will use insect DNA barcodes from BEF China,
# the worlds largest tree diversity experiment (Bruelheide et al, 2014, Methods Ecol Evol, 5).
# The data are from caterpillar herbivores (Wang et al, 2019, Journal of Ecology, 107),
# and are available from the NCBI repository, which we can interface with at the command line using e-utilities.
esearch -db nuccore -query 'MN131188:MN132787[ACCN]' -mindate 2017 -maxdate 2022 | efetch -format fasta > BEF_Leps.fas

# There should be about 1600 sequences in Fasta format.
# Some have species names, some higher rank labels, and there is a lot of conspecific data.
# The description lines for NCBI data are rather unwieldy, we can simplify with a bash command:
sed "s/>\(\S*\)\s\(\S*\)\s\(\S*\).*/>\2_\3_\1/" BEF_Leps.fas > BEF_Leps.fas2

# Common processes to conduct on this type of community barcode data would be OTU clustering and taxonomic assignment.
# We should at minimum here conduct similarity clustering,
# since many phylogenetic implementations assume a range of sequence variation
# that does not include a large proportion of identical or near identical entries,
# and indeed will often crash when presented with such.
# Some might regard this method (hierarchical clustering according to pairwise sequence similarity) as simple,
# though the software tool is very robust and a good one to have in a set of options.
# Blastclust is part of the legacy Blast toolkit (although the legacy tools were superseded many years ago by Blast+,
# a predecessor to Blastclust was not part of these).
# You can still obtain these from the Blast legacy folder of the NCBI ftp site, above.
blastclust -i BEF_Leps.fas2 -o BEF_Leps.fas2.uniques -S 99.0 -p F -e F -a 4 -L .99

# The output of Blastclust is named BEF_Leps.fas2.uniques, it contains all the IDs included in the input fasta file,
# arranged as OTUs per line, and ordered upper to lower according to number of members in each OTU.
# We need to take a representative sequence for each OTU, the following Perl script will do this.
# We will use the -relabel option to make the OTU easier to find.
perl remove_fasta_entries.pl -fasta_in BEF_Leps.fas2 -fasta_out BEF_Leps.fas2.OTU.fas -rm_list BEF_Leps.fas2.uniques -regex 0 -relabel 1 -retain -blastclust_list 1

# We now need reference data, which is something to profile our plot OTU against.
# Download the insectphylo reference synthesis phylogeny and corresponding COI alignment,
# from https://insectphylo.org/lepidoptera_v2/.
# Or, from the command line:
wget https://insectphylo.org/wp-content/uploads/2023/12/Lepidoptera_v2.NEWICK.txt
wget https://insectphylo.org/wp-content/uploads/2023/12/Lepidoptera_v2.FASTA_.txt

# Next we need to align the queries to the references (known as 'profile alignment').
# COI is with rare exceptions length invariant. Nonetheless, errors often creep in when aligning thousands of any sequence.
# Clustal Omega is pretty robust and should manage this task without any major error.
# Other options for high throughput profile alignment include Pynast, aligner.pl.
clustalo-1.2.4-Ubuntu-x86_64 --profile1=Lepidoptera_v2.FASTA_.txt --in=BEF_Leps.fas2.OTU.fas -outfile=Refs_and_Queries.clo2 --seqtype=DNA --iter=2 --outfmt=fa --threads=1 --log=clustallogfile

# It is good practice to check alignments by eye. Many are fond of BioEdit for sequence viewing, though it is a Windows tool.
# For Linux you might try ugene.
# Roughly the first quarter of sequences in the file are the queries,
# these have accessions (note they are not species filtered),
# and the latter 3 quarters are the references, these are species filtered so accessions can be removed.
# Under default alignment parameters of Clustal Omega there may be some errors (gap inserted where shouldnt be),
# on the 5 prime end of the queries.
# Clustal Omega uses HMMs and doesnt have gap parameters which are common in other software. But 2 iterations instead of the default 1 and the result should be near perfect.
ugene -ui

# Next place the queries onto the reference phylogeny, for which we will use Raxml version 8.
# Input for this step is the COI alignment (default input format to Raxml 8 is usually reported as Phylip,
# though it will also read Fasta), and the fixed reference tree.
# There is more than a single way this can be done, here we will run a regular tree search fixing the reference topology.
# Alternative is an evolutionary placement algorithm although the implementations for this are less robust.
# The search is split into 2 parts otherwise Raxml often crashes when doing both of these.
# Running time will be about 3 hours, and require about 1 GB.
raxmlHPC-8.2.4 -F -g Lepidoptera_v2.NEWICK.txt -s Refs_and_Queries.clo2 -n Refs_and_Queries1 -m GTRCAT -c 24 -p 123 -o Micropterna_lateralis
raxmlHPC-8.2.4 -f e -t RAxML_result.Refs_and_Queries1 -s Refs_and_Queries.clo2 -n Refs_and_Queries2 -m GTRCAT -c 24 -p 123 -o Micropterna_lateralis

# Output file is RAxML_result.Refs_and_Queries2,
# check in a tree veiwer, I recommend archeopteryx. We have added the string 'OTU_' to the queries
# so these can be easily highlighted and observed.
java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file RAxML_result.Refs_and_Queries2
# Calculation of phylogenetic diversity indices from placed plot data will be covered in a separate pipeline.