Tuesday, March 4, 2014

old school permutation test for phylogenetic signal in a Maximum Parsimony tree

Some old-school phylogenetics in R with Maximum Parsimony- still useful on occasion. The idea of MP is to find the shortest possible tree, given the data (or the consensus of equally short trees). The MP (most parsimonious) tree is considered the most likely. But how do you test whether an MP tree is statistically significant? It is usual to run bootstrap re-sampling of the data and see how many times particular branches appear, but what about testing for phylogenetic signal across the tree?

Below is one approach that tests whether the tree is any shorter than if the relationships were random. The test uses a (non-parametric) permutation, where character states at each locus are randomly permuted among the taxa. We can conclude the tree has strong phylogenetic signal if it is significantly shorter than trees generated from permuted data. The test is known as PTP or 'permutation tail probability'. You should check literature on pros and cons and remember that this method is relatively crude and will only really single out a pretty poor data set.


First generate a MP tree. The approach is to start with a NJ tree which is then optimised with MP.
Some dummy data:

Copy and paste the following into a text editor, save as plain text, then change the file extenstion to .fasta (I'm sure there is a way to paste directly into R in required format, but this works for an example)


>Species1
AAAAGAATAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTTCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAATAAA-----TCTTAAATATTTTT-AGATAGTCAAATTTT------CTTTTTTATTTTTGATTTTGAATTC
AT--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT-TTA-ATTCTATA
TTTCT-TATTTCTGTTTCGAATTCGATTGATTGGTTATAAGGAATCAGGAC-ATTCCTTGGGCTTTTCA-TTCCGGAAAA
GGGGGAAGTTAAGAAAAAAAA----------TAATCGACCCTTCAA-GTATCCCCAATTCCATGGGAAAAG-TGGCGGAA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCCATATTGAATTGCGGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATAAGGGTTTCTATAGGTAAGAAAGAAAAG-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCAGTATAAATGAAAGAA---------GCAATCCTGAGCCAAATCCT-GTTTTCTCAAAACAAAGGT-----
TT-AAAAAACGAAAAAAAAAA-----GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAACCTTTCCATGGAAACTTCAGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATAAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAATATTCATTCAGCAAATGATTCACTCCAGAAGTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCCTTTTTCGTTATCGGTTCGAAATTCCTTTATCTTTCTGATTCTTTGACAAA------
CGT-ATTTGGGCGCAAA----GGACTTTCTCTTATCACATGGGATATAGAATACACATCCAAATTAAGC-AAGGAATCCC
TTTTTGAAGGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAA--CTTATAAAGTCGTCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGCAATTCCCCTTG-CCCCTTTAATTGACATAGACTCCAGTCATCTAA
TAAAATGAGGATGGGATGCTACATTGGGAATGGTCGGGATA-----GGGGGGCCCCCCG
>Species2
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTAATAAATATTGGATATTAT-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAACTATTTTT-AGATAGTCAAATTTTTTTT---------ATTTAAGATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATATCTATATTTTCTATCCTATATATTTCTTATATT-TCTA
ATTCTATATTTC--TTTCGAATTAGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCAG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAGAAA---------GAATCGACCCTTCAA-GTATCCCAAATTGCATGGGAAAAA-TGGTAGGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTGGATTG
G-ATCAAATCCGGGTTTCTATAGGTAAGAAAGAAAAT-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCAGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAAAA-----GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCCC
TGGTAGAGGCATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAATGTATCTATTCAATACTATATCATT---A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAGTCGAATATTCAGTCAGCAAATGATTCACTCCATA-GTCCAATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCTTTTTTCATTAGCGATTCCAAATTCCTTTATCTTTCTGATTCTTTGACAAA------
CGT-ATTTGGGCGTAAA----TGCCTTTCTCTTATCACATGTGATATAGAATACACATCCAAATTATGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAAT-------------------------------------------------TT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGGACGGGATACTACATTGGGAATGGTCGGGATA-----GGGGGGCCCC-CG
>Species3
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AG-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCAA-TTTTTTTT--CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT---ATATT-TCTA
ATTCTATATTTC--TTTGGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAAAA-------GAATCGACCCTTCAA-GTACCCCCAATTGCATGGGAAAAA-TGGTAGGA
AG-AGAAAGATATATTATGGAG-TATATATCAATCTATATTGAATTACCGATACAGAAATGATAAAATTCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAAGAAAA----TGACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAA---------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TAAAGG-------TATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTTAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGAGAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCTTTTTTCATTAGCGGTTCCAAATTTCTTTATCTTTCTGATTCTTTGACAAACA----
--T-ATTTGGGCATAAA----TGACTTTCTCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTTGAAAGGCGCCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGGATGGGATGCTACATTTGGAATGGTCGGGATA-----CCCCCGCCCCCCG
>Species4
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AG-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTTGAATAC-AACTTGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCCA-TTTTTTT---CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTGGAATTCAATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-TTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAAAAAAAAAAA---TCGACCCTTCAA-GTATCCCTAATTACATG-GAAAAA-TGGTATGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAAGAAAA----T-ATTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCCCGTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAA-------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTCTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCTTTTTTTCATTAGCGGTTCCAAATTCCTTTATCTTTCGGATTCTTTGACAAA------
CGT-ATTTGGGCGTAAA----TGACTTTCTCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAA-GGCGCCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGAGCCCAGTCATCTAA
TAAAATGAGGATGGGATGCTACATTGGGAATGGTCGGGATA-----CCCCCCCCCCGCC
>Species5
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-ATAAATTCGAATAC-AATTCAAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTATT-AGATAGTCAA-TTTTTTTT--CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTGGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAAA--------GAATCGACCCTTCAA-GTATCCCCAATTACATG-GAAAAA-TGGTATGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAAGAAAA----T-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAAA------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAACCGTATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCTTTTTTCATTAGCGGTTCCAAATTCCTTTATCTTTCGGATTCTTTGACAAA------
CGT-ATTTGGGCGTAAA----TGACTTTCTCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAA-GGCGCCTTTTT-GAAG
ATCCTCGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGGGTGGGATGCTACATTGGGAATGGTCGGGATA-----CCCCCCCCCCGCC
>Species6
AAAAAA-TAGAATCTATAATTCCAAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AG-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAATCAA-----TCTTAAATATTTTT-AGATAGTCAA-TTTTTTTT--CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTGGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAAAA-------TAATCGACCCTTCAA-CTATACCAAATTACATG-GAAAAA-TGGTATGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAAGAAAA----T-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAAA------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTCACTGCGT
TGGTAGAGTAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATACTATATCAA
ATGATTATTGATGGTCCGAATCTGTATTTGTATTTTTTACTATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCACTTTTT-CATTAGCGGTTCCAAATTCCTTTATCTTTCTGATTCTTTGACAAAGACAAA
CGT-ATTTGGGCGTAAA----TGACTATATCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAAAGGCGCCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCTTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGGATGGGATGCTACATTGGGAATGGTCGGGATA-----CCCCCCGCGGCCC
>Species7
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCAA-TTTTTTTT--CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAC------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTTGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAA---------GAATCGACCCTTCAA-GTATCCCCAATTATATGATAAAAA-TTGTATGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAATAAAA----T-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAAAA-----GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATC----------CCCCTTTTT-CATTAGCGGTTCCAAATTCCTTTATCTTTCTGATTCTTTGACAAAGACAAA
CGT-ATTTGGGCGTAAA----TGACTTTCTCTTATCACATGTGA-ATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAGTGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAAAGGCGCCTTTTT-GAAG
ATCCACGAGATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGCATGGGATGCTACATTGGGAATGGTCGGGATA-----CCCCCCCGGGCCG
>Species8
AAAAA--TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AG-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCAA-TTTTTTTTT-CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTGGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAAA--------GAATCGACCCTTCAA-GTATCCCCAATTACATG-GAAAA--KGGTATGA
AR-AGAAAGATATAT-ATGGGG-TATATATCAATCGATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATKG
G-ATCAAATATGGGTTTCTATAAGTAAGAAAA----T-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAAA------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCAT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTAAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGCAAAATGGAAGAATTAGTGTGAATTGATTCCACATT
GAAGAAAAAATTGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGCAATTTTTAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCTTTTTTCATTAGCGGTTCCAAATTCCTTTATCTTTCGGATTCTTTGACAAAC-----
-GT-ATTTGGGCGTAAA----TGACTTTCTCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCAAAA-TCAATAGCATTAC-TCAGACT----------GAAAAACTTTGAA-GGCGCCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCCGTAATCTAA
TAAAATTAGGATGGGCTGCTACATTGGGAATGGTCGGGATA-----CCCCCCCCCCGGC
>Species9
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCAA-TTTTTTT---CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAC------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTTGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAA---------GAATCGACCCTTCAA-GTATCCCCAATTATATGATAAAAA-TTGTATGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAATAAAA----T-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAA-------GGGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATATCA-----A
ATGATTAATGATAGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCTTTTT-CATTAGCGGTTCCAAATTCCTTTATCTTTCTGATTCTTTGACAAAGACAAA
CGT-ATTTGGGCGTAAA----TGACTTTCTCTTATCACATGTGA-ATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAGTGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAAAGGCGCCTTTTT-GAAG
ATCCACGAGATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGCATGGGATGCTACATTGGGAATGGTCGGGATA-----CCCCCCCGGGCCG
>Species10
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCAA-TT---------------------------------
-------------GAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTGGAATTCGATTGATTGGTTATAACTAATCAG-ACAATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAA------------TCGACCCTTCAA-GTATCCCCAATTACATGAGAAAAA-TTGTATGA
AG-AGAAAGATATAT-ATGGGGGTATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCA-TTTGAATG
GGATCAAATATGGGGTTCTATAAGTAATAAAA----T-CCTTTTTTCGAG------------------------------
-----------------------------------GTAATCCTGAGCCAAATCCC-GTTTTTTCAAAACAAAGGTAAAGG
TTCAAAAAACAAAAAAAA--------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCCTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCACTTTTT-CATTAGCGGTTCCAAATTCCTTTATCTTTCTGATTCTTTGACAAAGACAAA
CGT-ATTTGGGCGTAAA----TGACTTTATCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAAAGGCGCCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCTTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGGATGGGATGCTACATTGGGAATGGTCGGGATA-----CCC-CCCCGGCCG




###
###R code

require(picante)
require(phangorn)

DNA <- read.dna("DNA.fasta", format="fasta")

DNA2 <- as.phyDat(DNA) #for optim.parsimony function

DIST_DNA <- dist.dna(DNA, model="TN93") #Nei's genetic distance
tree_DNA <- nj(DIST_DNA)

x <- as.vector(DIST_DNA); y <- as.vector(as.dist(cophenetic(tree_DNA)))
plot(x, y, xlab="pairwise genetic distances", ylab="pairwise distances on NJ tree", pch=20, cex=2); abline(lm(y~x), col="red", lwd=2); text(0.04, 0.11, round(cor(x,y)^2, 3)) #visualise how well raw distances are represented in the tree

Visual comparison of genetic distances versus distance on tree


parsim_DNA <- optim.parsimony(tree_DNA, DNA2, trace=3) #optimise the nj tree with MP

#optional
#parsim_DNA_root <- root(parsim_DNA, outgroup=1) #root tree if you like

#plot(parsim_DNA_root)


Tree to be tested


#BOOT1 <- boot.phylo(parsim_DNA_root, DNA, FUN = function(xx) nj(dist.dna(xx)), trees=TRUE, B=1000) #1000 bootstrap reps

#CONSENSUS <- consensus(BOOT1$trees, p = 0.5, check.labels = FALSE)


##############
#FUNCTION seqPerm:
#non-parametric permutation test for shorter tree length with MP optimisation versus character states randomly permuted among taxa
#

#the tree being testing is unrooted
#the DNA data should be in DNAbin format and the tree to be tested in phylo format

sample_func <- function(x
{
sample(x, length(x))
}


PFUNCTION <- function(x, y, z)
{
   if(x <= y) z <- z+1
   return(z)
}


seqPerm <- function(x, y, nperm) #x DNA data matrix, y tree to test, nperm required number of permutations
{

if(class(x)!="DNAbin")
{return(print("DNA sequence data must be in DNAbin format"))}

if(class(y)!="phylo")
{return(print("Tree to be tested must be class 'phylo'"))}

nGT <- 1

data2 <- as.phyDat(x)

PARS1 <- parsimony(y, data2)

for(i in 1:nperm)
{
DNA_perm <- as.matrix(apply(x, 2, sample_func))
row.names(DNA_perm) <- row.names(x
class(DNA_perm) <- "DNAbin"
DNA_perm2 <- as.phyDat(DNA_perm)

perm_tree <- nj(dist.dna(DNA_perm, model="TN93")) 

delete <- optim.parsimony(perm_tree, DNA_perm2, trace=0)


PARS2 <- parsimony(delete, DNA_perm2)

nGT <- PFUNCTION(PARS2, PARS1, nGT)
}


P <- nGT/(nperm+1)

cat('Prob (',nperm,'permutations) =',P,'\n','\n')
return(P.perm=P )
}



seqPerm(DNA, parsim_DNA, 1000)
#computation can be a little slow with many permutations


Example of a tree based on permuted data



No comments:

Post a Comment