The Accidental Statistician: December 2015

Thursday 31 December 2015

Does peer review work?

As an outsider to the viral phylogenetics community who just does some work for people in the community I get a pretty hard time from referees. If you are not part of the in crowd then they always find a reason why you have done something wrong. Some of my favourite examples are cladograms are wrong (then use the data I deposited to draw a phylogram if you want - which shows the same thing in more confusing detail). Some think you have to carry out a modeltest before building the tree (that would be the editor who wrote modeltest, and has managed to get it to 14,000 citations) whereas the next referee says modeltest is a worthless circular argument and pointless to carry out. The funniest was probably the figures are not clear to read, well they are vector graphic pdfs not bitmaps like all the ones you publish yourself so you could zoom them if you were not being obstructive and obfuscating.

The list of reasons goes on and on. The real question for an editor or referee to give a rejection is whether the science is right, not if it could be presented better. Wrong figures or writing are a source for corrections not for rejections. I have edited 120 papers and the default as an editor is to believe the scientist carried out with honesty and integrity and that if I was to do the same analysis I would get the same results. I believe in 100% transparency. That means; no anonymous referees, possibly post publication peer review (F1000 model), publishing all referees comments even for rejected papers.

All of this would be amusing rather than frustrating until I see that someone publishes the same idea a few months later and I lose any credit for it. For these reasons everything I ever do goes straight into to PeerJ repository so that I can always go that is a nice peer reviewed publication you have there but it is a shame I put mine in the repository before you even submitted yours, and isn't it strange that you were probably that anonymous reviewer that made sure my paper was rejected while you got your very similar one published. In my other interest as a lawyer they have a name for this practice. They call it fraud.

The politics of H5N8

I used to think that protein crystallography was a field full of back-stabbing political bastards but they have nothing on the depths of spite and pathetic stupidity present amongst the viral phylogenetics community. As Sayre's law states academic politics is so bitter because the stakes are so low and no stakes are lower than those for H5N8.

H5N8 is a subtype of influenza that nobody cared about or studied until a big outbreak in 2014. This outbreak was important for two reasons.

It showed that all the papers about wild birds not being able to spread highly pathogenic flu virus were wrong (Sorry Gaidet at al.)
It allowed the Guangdong Goose H5 hemagglutinin to spread to North America.

That is it. No more interest. It actually reassorted to H5N2 pretty quickly in the US as H5N8 does not often occur because it is not a preferred packaging of the virus. H5N1 is much more frequent and H5N2 also seems to be a better alternative.

Saturday 19 December 2015

H5N8 in Taiwan - Poor methods and not the best peer review.

I was critical of the sharing of data from the Taiwanese outbreak but there are a few more problems I have with the paper that reports the analysis of the data. So the paper says in the methods section that:

Phylogenetic analysis, as described previously (Lee et al., 2014a and Lee et al., 2014b), was performed using these full genome sequences and closely related sequences from GenBank, GISAID and the publicly available government website (http://ai.gov.tw/index.php?id=720), which gave the sequences of the 16 H5 viruses isolated by the Council of Agriculture (COA), Taiwan during the recent outbreaks.

Now lets look at those two papers by Lee from 2014 with the methods in them. The first one is a letter and so does not even have a methods section. The methods are only mentioned in the figure legend.

Phylogenetic tree of hemagglutin (HA) genes of influenza A(H5N8) viruses, South Korea, 2014Triangles indicate viruses characterized in this studyOther viruses detected in South Korea are indicated in boldfaceSubtypes are indicated in parenthesesA total of 72 HA gene sequences were ≥1,600 ntMultiple sequence alignment was performed by using ClustalW (www.ebi.ac.kr/Tolls/clustalw2)The tree was constructed by using the neighbor-joining method with the Kimura 2-parameter model and MEGA version 5.2 (www.megasoftware.net/) with 1,000 bootstrap replicatesH5, hemagglutinin 5; Gs/Gd, Goose/Guangdong; LPAI, low pathogenic avian influenza; HPAI, highly pathogenic avian influenzaScale bar indicates nucleotide substitutions per site.

This uses NJ-tree construction in Mega - and Mega 6.06 was already available.

The second paper does have a methods section which says:

Molecular clock analysis. For the HA and NA genes, the genetic distance from the common ancestral node of the lineage to each viral isolate was measured from the ML tree and plotted against the sample collection dates. Linear regression was used to indicate the rate of accumulation of mutations over time. A more detailed evolutionary time scale for each virus gene phylogeny, with confidence limits, was obtained using relaxed molecular clocks under uncorrelated lognormal (UCLD) and exponential (UCED) rate distributions, implemented in a Bayesian Markov chain Monte Carlo (BMCMC) statistical framework (27), using BEAST, version 1.8 (28). The SRD06 nucleotide substitution model (29) and Bayesian Skyride demographic model (30) were used. Multiple runs were performed for each data set, giving a total of 6 107 states (with 1 107 states discarded as burn-in) that were summarized to compute statistical estimates of the parameters. Convergence of the BMCMC analysis was assessed in Tracer, version 1.6 (A. Rambaut M. Suchard, and A. J. Drummond, 2013 [http://tree.bio.ed.ac.uk/software/tracer/]

So this analysis was carried out with Beast in a Bayesian framework. So which of these totally different methods was used in the current paper? It has to be the Beast analysis because of the way that the trees appear. But this also raises questions as they talk about the Bayesian Skyride model. I think they mean the Bayesian Skyline model and are confused by this paper. Anyway you should not be using the Skyline unless you are interested in hypotheses about viral demographics and phylodynamics. What doe they mean by multiple runs? This shows a naivity in using Beast. So while they might get the right results, they could have got them faster and easier using a simpler coalescent model.

What is of larger concern is that in both the Taiwanese outbreak paper and the second Lee paper the referees did not notice the errors in linking to two methodologies or the incorrect use of the Bayesian Skyline. So much for the peer review process improving science.

Friday 18 December 2015

Lessons for data collection from crystallography for viral data collection

I remember what a huge impact Denzo had on the processing of experimental data. Denzo allows you to model errors in the collection, such as slippage and x-ray damage. When we take samples of sequences from a population there are experimental errors that we need to consider. The sorts of question that we need to ask are:

Is there a difference between the population of the virus in the host and that in the amplified sample?
If the virus is a mixture of subtypes do the experimental methods favour one subtype over another?
Does the sequencing technique favour AT or GC rich regions?
Can we distinguish between sequencing error and point mutations?

Recent work on improving the collection and monitoring of wild bird avian influenza has shown that birds can be infected with multiple sub-types. In these cases how do we know which segments match with which other segments? How do they mix and produce mature virus?

Wednesday 16 December 2015

A new low in data sharing: The Taiwanese Outbreak of H5N8

I am interested in the spread of H5N8 avian influenza and after seeing the spread of cases in Taiwan on the OIE website I have been eagerly waiting to see sequence data emerge to carry out some detailed evolutionary analysis. I have waited and waited and waited and today I got my RSS feed from Google Scholar to say a paper is available showing the relationship to the Korean outbreak. The authors had been hoarding the data to publish it rather than making it more widely available and being pipped to the publication, as we live by the law of publish or perish.

Now it is published the data must be available for everyone as it no longer matters for priority of publication. I checked the sources Genbank, GISAID (the worlds first password access only resource for sharing publicly funded biological sequence data) and a Taiwanese government database - http://ai.gov.tw/index.php?id=720 It is so easy to find a database that is only in Chinese script with no English translation. That is the perfect way to share data.

Anyway we should all know more languages and they have a good reason to publish in their native language. Google can translate it anyway. So there is a page with the sequences and here is the link http://ai.gov.tw/index.php?id=704 There is just one slight catch. These are image files. In order to extract the sequences you are going to have to retype them all from the images. 1600 characters for each entry! That is not data sharing. That is not a public database. That is not good practice or good science. That is obstruction pure and simple.

Japan made its data available almost immediately as well as issuing local warnings to farmers about the risk from wild birds of the spread of the new H5N8 variant. The result of this was a relatively small number of cases in domestic birds. However the lack of transparency from the Taiwanese laboratories contributed to the deaths of 3.2 million birds including nearly 60% of the domestic goose population. This is an avoidable disaster that has cost millions to farmers and the scale of which could have been reduced by improved sharing of information.

Sunday 13 December 2015

Reproducibility in molecular dynamics

I once asked David Osguthorpe for some advice when I was a PhD student. I was using Discover to carry out molecular dynamics simulations on a small peptide. He told me that it was unreliable in the version that Biosym were selling and that you got different answers every time. He also told me that my loop simulations were wrong and that I must have forgotten to cap the peptide ends to get the peptide to fold into any loop form.

A few ideas and lessons came from this:

I ignored my peptide simulation results and so it was never published. At the time it was the largest and longest peptide simulation ANYONE had done (this was 1994).
The structure of the peptide bound to an enzyme became available and it was in the conformation that I had observed in the calculations! This structure was later retracted as it was based on a poor enzyme crystal structure (I corrected the structure in the PDB).
I now know to run some simulations fixing the random seed to check for reproducibility of the simulations. This shows what is computer/coding variation and what is simulated variation.
I have shown that despite variability the simulations are following some sort of physical reality in that they follow the Arrhenius equation. This is for an ensemble, an average over multiple simulations.
Now I run all simulations a number of times and it worries me about how irreproducible they are when the seeds are not fixed. This seriously undermines the reproducibility of the field and supports my proposed second doctoral supervisors reasons for not taking that position on. His comment was that it is garbage in and garbage out.

Wednesday 9 December 2015

More thinking about the bootstrap - systematic bias.

The reasoning for sampling the columns and not the rows of the data matrix as you would do in most bootstrap cases is that because they produce a tree the rows are not considered independent, but the columns are.

I am not sure I quite follow this logic because a split between two groups of sequences might be based on a pattern of changes rather than a single column and so the tree topology depends on the multiple columns that are definitely not independent. So doesn't the tree structure mean that both the rows AND the columns are not independent?

Anyway more serious than this is bootstraps represent resampling to try and capture the real population and so it is assumed that the sequences that form the alignment and that you are using to generate the tree are a random sample. This is sadly almost NEVER true. We sample the low hanging fruit. We sample organisms that we like, that we can culture, that are model systems we do not sample randomly and without bias. The collection of sequences and genomes introduces a systematic error into everything we do.

The bootstrap cannot deal with this. If there are aspects of the population that are not sampled at all then no amount of bootstraps can model it. This is a Black Swan problem. Your bootstrap values will only tell you that this data, with this model using this method, has strong support for being reproducible. It says nothing about it being a biological truth.

Friday 4 December 2015

Reassortment in influenza

Why is there any sort of argument over the amount of reassortment in Influenza? Put simply for each subtype there are many lineages and each lineage can have many subtypes as shown in this image.