Wednesday 9 December 2015

More thinking about the bootstrap - systematic bias.

The reasoning for sampling the columns and not the rows of the data matrix as you would do in most bootstrap cases is that because they produce a tree the rows are not considered independent, but the columns are.

I am not sure I quite follow this logic because a split between two groups of sequences might be based on a pattern of changes rather than a single column and so the tree topology depends on the multiple columns that are definitely not independent. So doesn't the tree structure mean that both the rows AND the columns are not independent?

Anyway more serious than this is bootstraps represent resampling to try and capture the real population and so it is assumed that the sequences that form the alignment and that you are using to generate the tree are a random sample. This is sadly almost NEVER true. We sample the low hanging fruit. We sample organisms that we like, that we can culture, that are model systems we do not sample randomly and without bias. The collection of sequences and genomes introduces a systematic error into everything we do.

The bootstrap cannot deal with this. If there are aspects of the population that are not sampled at all then no amount of bootstraps can model it. This is a Black Swan problem. Your bootstrap values will only tell you that this data, with this model using this method, has strong support for being reproducible. It says nothing about it being a biological truth.

No comments: