# PHYLOGENETIC PLACEMENT # video tutorial here # Dependencies: # Linux # NCBI E-Utilities (Sayers, 2009, https://www.ncbi.nlm.nih.gov/books/NBK25497/) # Blastclust (https://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/) # remove_fasta_entries.pl (https://github.com/dchesters/insectphylo_tools) # Clustal Omega (Sievers and Higgins, 2014, Curr Protoc Bioinformatics, 48) # Raxml 8 (Stamatakis, 2014, Bioinformatics, 30) # First we need 'queries', these are, study subjects from some community samples which we have barcoded # and would like to extract information from. # We will use insect DNA barcodes from BEF China, # the worlds largest tree diversity experiment (Bruelheide et al, 2014, Methods Ecol Evol, 5). # The data are from caterpillar herbivores (Wang et al, 2019, Journal of Ecology, 107), # and are available from the NCBI repository, which we can interface with at the command line using e-utilities. esearch -db nuccore -query 'MN131188:MN132787[ACCN]' -mindate 2017 -maxdate 2022 | efetch -format fasta > BEF_Leps.fas # There should be about 1600 sequences in Fasta format. # Some have species names, some higher rank labels, and there is a lot of conspecific data. # The description lines for NCBI data are rather unwieldy, we can simplify with a bash command: sed "s/>\(\S*\)\s\(\S*\)\s\(\S*\).*/>\2_\3_\1/" BEF_Leps.fas > BEF_Leps.fas2 # Common processes to conduct on this type of community barcode data would be OTU clustering and taxonomic assignment. # We should at minimum here conduct similarity clustering, # since many phylogenetic implementations assume a range of sequence variation # that does not include a large proportion of identical or near identical entries, # and indeed will often crash when presented with such. # Some might regard this method (hierarchical clustering according to pairwise sequence similarity) as simple, # though the software tool is very robust and a good one to have in a set of options. # Blastclust is part of the legacy Blast toolkit (although the legacy tools were superseded many years ago by Blast+, # a predecessor to Blastclust was not part of these). # You can still obtain these from the Blast legacy folder of the NCBI ftp site, above. blastclust -i BEF_Leps.fas2 -o BEF_Leps.fas2.uniques -S 99.0 -p F -e F -a 4 -L .99 # The output of Blastclust is named BEF_Leps.fas2.uniques, it contains all the IDs included in the input fasta file, # arranged as OTUs per line, and ordered upper to lower according to number of members in each OTU. # We need to take a representative sequence for each OTU, the following Perl script will do this. # We will use the -relabel option to make the OTU easier to find. perl remove_fasta_entries.pl -fasta_in BEF_Leps.fas2 -fasta_out BEF_Leps.fas2.OTU.fas -rm_list BEF_Leps.fas2.uniques -regex 0 -relabel 1 -retain -blastclust_list 1 # We now need reference data, which is something to profile our plot OTU against. # Download the insectphylo reference synthesis phylogeny and corresponding COI alignment, # from https://insectphylo.org/lepidoptera_v2/. # Or, from the command line: wget https://insectphylo.org/wp-content/uploads/2023/12/Lepidoptera_v2.NEWICK.txt wget https://insectphylo.org/wp-content/uploads/2023/12/Lepidoptera_v2.FASTA_.txt # Next we need to align the queries to the references (known as 'profile alignment'). # COI is with rare exceptions length invariant. Nonetheless, errors often creep in when aligning thousands of any sequence. # Clustal Omega is pretty robust and should manage this task without any major error. # Other options for high throughput profile alignment include Pynast, aligner.pl. clustalo-1.2.4-Ubuntu-x86_64 --profile1=Lepidoptera_v2.FASTA_.txt --in=BEF_Leps.fas2.OTU.fas -outfile=Refs_and_Queries.clo2 --seqtype=DNA --iter=2 --outfmt=fa --threads=1 --log=clustallogfile # It is good practice to check alignments by eye. Many are fond of BioEdit for sequence viewing, though it is a Windows tool. # For Linux you might try ugene. # Roughly the first quarter of sequences in the file are the queries, # these have accessions (note they are not species filtered), # and the latter 3 quarters are the references, these are species filtered so accessions can be removed. # Under default alignment parameters of Clustal Omega there may be some errors (gap inserted where shouldnt be), # on the 5 prime end of the queries. # Clustal Omega uses HMMs and doesnt have gap parameters which are common in other software. But 2 iterations instead of the default 1 and the result should be near perfect. ugene -ui # Next place the queries onto the reference phylogeny, for which we will use Raxml version 8. # Input for this step is the COI alignment (default input format to Raxml 8 is usually reported as Phylip, # though it will also read Fasta), and the fixed reference tree. # There is more than a single way this can be done, here we will run a regular tree search fixing the reference topology. # Alternative is an evolutionary placement algorithm although the implementations for this are less robust. # The search is split into 2 parts otherwise Raxml often crashes when doing both of these. # Running time will be about 3 hours, and require about 1 GB. raxmlHPC-8.2.4 -F -g Lepidoptera_v2.NEWICK.txt -s Refs_and_Queries.clo2 -n Refs_and_Queries1 -m GTRCAT -c 24 -p 123 -o Micropterna_lateralis raxmlHPC-8.2.4 -f e -t RAxML_result.Refs_and_Queries1 -s Refs_and_Queries.clo2 -n Refs_and_Queries2 -m GTRCAT -c 24 -p 123 -o Micropterna_lateralis # Output file is RAxML_result.Refs_and_Queries2, # check in a tree veiwer, I recommend archeopteryx. We have added the string 'OTU_' to the queries # so these can be easily highlighted and observed. java -cp ~/software/archeopteryx/forester_1034.jar org.forester.archaeopteryx.Archaeopteryx -c ~/software/archeopteryx/config_file RAxML_result.Refs_and_Queries2 # Calculation of phylogenetic diversity indices from placed plot data will be covered in a separate pipeline.