The Accidental Statistician: 2017

Wednesday 13 December 2017

Roger Stone

Read Jon Ronson - The Elephant in the Room

Alex Jones is paranoid.

Stone was introduced to Jones by Reeves - the grandson of the original superman
Stone had a business with Mannafort and they also worked with Lee Atwater.

Stone worked for Savimbi in Angola
Bob Dole
Landmines
Marcos

Stone knew Roger Ailes
Anti-globalist/establishment

Except Stone is part of the establishment - he was Nixon's Counsel.

They have a cult mentality - they will kill the GOP
This is a coup - they do not want to win hearts.

Alex Jones hijacked the Young Turks
Stone and Jones brought up Clinton rape accusations to the debates.

Manages media - creates false flag confrontations

SuperHubs

This is a slightly puzzling book

p22 the author is partially wrong. Hierarchies do work and Herbert Simon showed why, but this was not because of top down control. They can be non-directed and spontaneously arise.

p25 gatekeepers to the rich and powerful. Is this a good idea?

p27 - why did the author write the book?

potentially undermines her credibility.
makes people wary in talking to her.
obvious that she is a Soros fan.

p32 "Money is mostly created by banks offering loans" regulated by central banks interest rates and asset purchases (Gold etc.) At the minute with QE $17 trillion has been pumped into the markets and created huge asset bubble such as BitCoin. The intention was for the money to be used for investment and to kickstart growth but this has failed. It has remained in the markets and the banks and not been distributed to the wider economy. This is going to result in a very serious and drastic need for realignment.

Fundamentally commodities are more important than other markets because we cannot live without them. We depend on them for:

Shelter
Warmth
Food

As Apple share price rises the return per share has fallen because this is pure speculation and not investment.

p53 power of the central banks is greater than the politicians. Brexit proves this wrong. You can get a populist vote in ignorance of how the central banks work and this can create a suicidal economic policy.

Creating Research Objects

There is a serious problem with scientific fraud and the reproducibility problem. We need to think about ways in which we can check the integrity of a study.

http://www.researchobject.org

This is also a way of encoding know how.

Metadata is too time consuming to create at the minute. It needs to be built into the planning and research process itself (GitHub?)

Want to create a knowledge exchange report
Open Research Data - Manyika
Rules for growth - Stodden

Data management plans are required by research councils

Integrity checking
Hashing
What are open file formats?

Blockchain for the trust layer

Politicians - regulations and policies
Qualifications
Medical Records
Passports
Forensics
Risk assessment and rating

Restoring trust is essential

(Byzantine General's Problem)

Money is a trust system representing work done

Reputation is also a trust system but this is only as strong as the weakest link.

Thinking about the bootstrap

Bootstrap samples experimental units but in phylogenetics you sample the VARIABLES - sites.
How should we treat sites?

Remove totall variant?
Remove sites where a row is missing?

You cannot say that parametric and non-parametric are the same thing. They are correlated but not directly comparable.

Carry out FastTree with H5N8, then H5 then N8
Use the parametric and non-parametric bootstraps
Use the CONSEL measures as well.

Having more bootstraps than 100 makes NO difference to the bootstrap values. They converge quickly empirically.

This is far below the theoretical numbers needed by Efron says that this is usual.
Suggests that sites are linked and so there is less independent variability than it appears.
Need to experiment with conserved sites.
Need to experiment with the substitution models to look at sensitivity and also gamma.

There is a lack of independence between sites in the evolutionary models but this is IGNORED in the bootstrap calculations. You should bootstrap codons and not individual bases.

Need to create synthetic data where the true tree is known. This can be used to test:

Effects of sampling by censoring the data.
Evaluate modeltest.
Check trees from bad evolutionary models against the best models (probably the same!!!)

The process of learning

Genetic: Very slow learning and wasteful because it depends on selection. This works between generations.

Taught: Fast learning that sums up what happens in a community.

Exploration: Novel learning by experiences. This is learning through interaction

Distances in psychology.

not transitive

not symmetrical

Tversky 1977 - features of similarity

AI cannot make human decisions until it gets beyond clustering distances.

Undoing Project p 107-114.

Belief in the law of small numbers - The Undoing Project p157-163

Pundtits (illiterate "experts") p 168

Saturday 18 November 2017

The Virus Gene Papers

I think it unlikely that I will be submitting to Virus Gene again in a hurry. We had written a few papers that we knew would be unpopular and sent them to a meeh level journal where we expected to have an easier ride through peer review. The first hint this wasn't going to be the case was the editor assigned who happened to have collaborated with a group that was in direct competition in H9 phylogenetics lead by Cattoli who I had insulted previously.

Anyway, they are now in PeerJ and public so that nobody else runs off and starts using USEARCH in flu phylogenetics and claiming priority.

https://peerj.com/preprints/3166/
https://peerj.com/preprints/3396/

I just wanted to put the referee's comments for the first paper that was rejected here, because they are laughable and in the context of the referee's comments on the second paper they are probably wrong or at least not consistent. I have put my responses here as with a straight reject I get not response to the editor, who is not going to be on my Christmas card list.

Reviewer #1: General comments

The paper presents the method of classification of H9 lineages using clustering and compares the results with classification based on other methods.

The paper would gain if some practical aspects were added. The title suggests the method is fast, so an approximate time of analysis would be useful, especially that cluster analysis after each run is required and repeated clustering if necessary is suggested.

Specific comments:

Introduction

Explain HMM and SVM abbreviations.

Hidden Markov Models and Support Vector Machines

Materials and methods

Please add the information on the chain length in the BEAST analysis.

2 million

Results

Lines 37-43: "USEARCH identified 19 clusters …" - does it refer to H9 HA? It should be indicated in the text to avoid confusion with subtype identification described above.

19 H9 clusters

Lines 43-44: "The subtype originated in Wisconsin in 1966 and this clade continues to be in circulation" Do the Authors mean that H9N2 subtype was first detected in Wisconsin in 1966?

H9 is first detected in 1966 as part of H9N2

Lines 37-39 (2nd page of results): The sentence "The phylogenetic trees…" is confusing, as only fig. 4 shows tree for clade 12 and it was not divided into subclades.

Easily changed

Lines 50-51 (2nd page of results): Were there 3 or 4 subclades of 14 clade identified?

Easily checked

Discussion

First sentence "The clustering of the influenza viral hemagglutinins using USEARCH proved that clustering can correctly identify the viral subtypes from the sequence data" - the subtype identification was partially correct, as it did not detect H15, and H7 was split into two clusters, so this statement should be revised. It would be interesting to mention with which subtype the H15 sequences were clustered.

I can show that H15 separates out at slightly lower identity. H7 is two groups adjacent so it is correctly identified. It gets 14 out of 15 clusters this is 93% accuracy the method works. 93% is more accuracy than typical for clustering algorithms.

Lines 27-30 (2nd page of discussion): "…small sub-clades of four or less sequence were merged for phylogenetic analysis…" Please explain it in Results.

You cannot make a tree of less than 3 sequences.

Supplementary Figure 3: There are branches labeled with subclade number and some with individual sequence. Please explain it. It is also associated with the comment above.

That would be because labelling a cluster containing one sequence with a cluster name would be stupid. As these clusters were grouped for tree generation it would be misleading to use the cluster number but I can edit them to have both.

Table 3 - missing data in the 5th line

No that does not exist – it is unsupported data in the LABEL paper that is not public and cannot be verified. This was data given to Justin Bahl but not available to anyone else.

Reviewer #2: The automated detection and assignment of IAV genetic data to known lineages and the identification of sequences that don't "fit" existing descriptions is a challenge that requires creative solutions. The authors present a manuscript that proposes a solution to this question and tests it on an extensive H9 IAV dataset.

Though I find the general question intriguing there are a number of issues. The two major items are: a) as a study on the evolutionary dynamics of H9 IAV, this is missing appropriate context, and the results are not adequately presented or discussed; and b) as a tool to identify known and unknown HA, it generates results that appear to be no different to BLASTn, it isn't packaged so that others may use it in a pipeline/web interface/package, and the generated "clusters" aren't linked to any known biological properties. I elaborate on a few of these issues below.

1) This is not a novel algorithm: USEARCH has been in use for over 7 years and it has been previously used in IAV diagnostics. Consequently, I would expect the authors to present a novel implementation of the algorithm (e.g., a downloadable package/pipeline, or an interactive interface on the web) or a detailed analysis and discussion of the evolutionary history of the virus in question. Unfortunately, the authors do not achieve either.

This reviewer is lying you may search for IAV and USEARCH in Google and you will find NOTHING except the two papers I mentioned both of which are more recent. It was first used by Webster in 2015 and for a different approach. It is mostly used for analyzing metagenomics projects. It cannot be packaged because as the paper shows you have to make decisions about the clustering. It is not just automatic you have to analyse the appropriate identity and clustering.

2) The introduction is not adequately structured - after reading, I was left confused as to why dividing the H9 subtype into different genetic clades was necessary, i.e. there is no justification provided for the study. The discussion of clades and lineages is particularly convoluted and given the presented information, it is not clear what the authors are trying to achieve (i.e., they move from identifying subtypes, to identifying clades, to lineages, to reassortment, and all possible combinations). Additionally, there are entire sections of the introduction that consist entirely of unsupported statements (lines 39-48 on alignments and tree inference: lines 52-60 on lineage evolution). This section needs to be revised to provide appropriate context and justification for the study.

The reviewer is obviously completely oblivious as to why you want to carry out lineage analysis in influenza. As such they are not competent to review the paper. As the WHO actually has a working party to create these nomenclatures for H5 this argument is ridiculous.

3) There are many figures associated with BEAST analyses. The goals of these analyses are not introduced, and the trees are not presented or described in any meaningful detail. Further, and more concerning, the presented trees appear to be wrong, i.e. the tip dates are not in accordance with the temporal scale.

That would be because the editor had the number of figures reduced. The BEAST analysis is not particularly important other than to show the consistency of the clustering. If the reviewer bothered to read then they would see that one of the trees does not use tip dates and is a maximum likelihood tree and so dates WILL NOT be consistent with the temporal scale if there is variation in mutation rate along one of the branches. This is actually an interesting point as BEAST FAILS completely to generate a reasonable tree with tip dates for that cluster of data. It produces a network with cycles over a wide range of initial parameters.

4) One of the major results (lines 6-16 in the results) is that the USEARCH algorithm can identify the correct subtype of a sequence, most of the time. How does this compare to BLASTn? And, failing to classify a subtype (line 16) is problematic. The authors should consider what the goal of the analysis is, and then present it along with results from similar algorithms, e.g., with the same data, is BLASTn able to identify subtypes?

I am intrigued by how the reviewer thinks that BLASTn works? To do the same task I would need to identify prototypes of each cluster and then use BLASTn to find the rest of the cluster. I would then need to apply some sort of cut-off in order to identify when BLASTn was finding members of other clusters and not the current cluster. In short this is nonsense. They perform different functions as USEARCH identifies the clusters not just related sequences. USEARCH produces the results in about 1 minute. Just to even set-up the BLAST searches would take 10 times longer than this and to analyse their results and do the correct portioning will take hundreds of times longer. The title of the USEARCH paper is actually “Searching and clustering orders of magnitude FASTER THAN BLAST”

5) I do not understand the significance of USEARCH identifying 19 clusters (line 37); and these data are not linked in anyway to a larger more comprehensive description of the evolutionary dynamics of H9 IAV. The authors should refine their hypothesis, and discuss the results - specifically, if a cluster is identified, what does it mean? What is the significance of the previously unidentified clusters? How closely does this align with phylogenetic methods (and the discussed LABEL)?

Um really this is now getting to be a bad joke. The paper compares to LABEL a method based on totally subjective cluster names created by influenza researchers. The entire discussion is carrying out exactly what this referee is suggesting in this paragraph. Do they need glasses? Are they suffering from a reading problem? Do they have a brain injury? USEARCH produces some of the clusters from LABEL, faster more efficiently and correctly. It is completely objective and based on mathematical criteria. There is no bias dependent on convenience sampling because it uses all the data not just the data a particular lab collects at a particular time. This is a MAJOR step forward in trying to sort out the mess that is influenza nomenclature and shows that most existing attempts are biased, partial and use rules that are not appropriate such as the need for clades to be homogeneous in subtype e.g. only H9N2 and not other H9 containing subtypes. The hypothesis is that existing nomenclatures are bad arbitrary, subjective and not based on mathematical rigour. We have proved this in this paper and in two more analyzing H7 and the internal influenza genes. All show exactly the same point, sound maths, rigorous systematic approaches and excellent biological agreement.

Minor comment:

1) Using my laptop, I aligned all non-redundant H9 HAs (n=5888) in ~2 minutes, and inferred a maximum likelihood phylogeny in ~6 minutes. The argument that phylogenetic methods are slow, particularly given modern tree inference algorithms and implementations on HPCs (e.g. Cipres: http://www.phylo.org) is not accurate. Additionally, alignment issues - particularly within subtypes is a trivial issue.

Yippy for you referee 2. Now put them into clusters. Just edit that tree with 5888 sequences and see how long it takes. Meanwhile USEARCH will have done it after 1 minute and it will be mathematically correct and not depend on how you cut the trees. Alignments of large numbers of sequences are unreliable. Regardless of this referee stating that this is unsupported this is actually supported by a very large literature and best summed up in the work of Robert Edgar who wrote Muscle and who says DO NOT DO LARGE ALIGNMENTS WITHOUT USING MY USEARCH PROGRAM FIRST. But then it is unlikely that referee 2 actually RTFM for the alignment program. I am sure they ran it without bootstrap and it could not have used tip dates as only BEAST does this.

2) There are a number of methods, e.g, neighbor joining and UPGMA, that use agglomerative clustering methods.

Yes there are well done referee 2 for being a genius and knowing that actually all of phylogenetics is related to clustering. This is the one and only correct statement that they make. All nomenclature and lineage methods depend on agglomerative methods but this is a divisive clustering method which is much less susceptible to convenience sampling. USEARCH is the fastest and best clustering method you can use and it is divisive and not agglomerative.

My comment is that I have NEVER encountered a more partial incompetent and ignorant referee than referee two. I think that they protest too much because they have too much invested in current methods such as LABEL which this paper show to be at best poor and at worst completely wrong.

The state of Influenza Phylogenetics

I really was not particularly interested in viral phylogenetics for most of my research career but I started my research in the area when I joined the Institute of Animal Health. I have a background in synthetic organic chemistry and protein crystallography and so I am well aware of how vicious and petty academic politics can be (ask me someday about the MRC skiing anecdote, or the Ribosome Nobel Prize story, or maybe the Rubisco saga or you can ask me about Nicolaou and Taxol ).

But I can honestly say that I have never met a more political, corrupt and inept field than influenza phylogenetics. It is staggering how bad it is, and that might be what makes it interesting. I come to it from a statistics perspective as I spent five years on my Damascene transformation in the statistics department at Oxford. I am very interested in bad science, people doing science badly in order to get grants and power but who really have no idea what they are doing. I was inspired by the work of John Ziman and the growth of the field of reproducibility (scientists like to suggest that this only applies to social sciences although psychology is acknowledging it has a problem too and biology in general definitely has a reproducibility problem). In viral phylogenetics there is a lot of bad science.

First I want to set out the main problems I have are:

Most biologists in the field have no idea what they are talking about from a theoretical perspective. They don't get the maths, they do not understand the assumptions and they ignore any results that do not fit with their expectations instead of asking themselves why it happened.
Sampling is terrible, It is all a convenience sample and this cannot avoid being a biased sample. Why are we focused on China when we collect almost nothing in Africa and the last swine flu pandemic came from Mexico?
People horde data and do not collaborate. There are 3 main databases which all have data in different formats with different annotations that you might or might not want. They are designed to make it difficult for researchers to use data from all three.
Peer review is intensely political and there are cliques which cite each other's papers excessively and which block publication of other groups. There are some government laboratories that have > 50 cites on a scientific note paper that says ... well not very much. Everyone is playing the citation game to keep their government laboratory funding.
There is a lack of communication between the analysts and the laboratory scientists. For example how many analysts know that the virus was often passaged through hen's eggs for amplification before sequencing the result was that the sequences mutate to have chicken specific variations and so the sequences in the original sample are not the same as those they submit to the database. This is exactly the same as the cowpox vaccine for smallpox - it is NOT cowpox Jenner got lucky and also the problem of cell culture where the cell cultures are no longer the genotypes originally collected.

Why do I think these are problems?

I told a referee that phylogenetics is just a clustering method based on a metric. They assured me that this is not always true. For example parsimony and maximum likelihood are not distance based methods says the referee. Except fundamentally they are. To calculate a parsimony you need an alignment. To get a multiple alignment you need to build an alignment you use a progressive alignment based on a guide tree which will often use UPGMA a distance-based method. Distance is fundamental to progressive alignment. Parsimony itself depends on the smallest number of changes - a distance. The scoring models you use as evolutionary models for measuring changes and calculating likelihoods are metrics for finding distances. Probability is a metric in measure theory it is a distance. You could use information theory but the difference in information is again a metric and a distance. Whatever way you try and cut it phylogenetics depends on distance and a metric and groups based on those metrics. It is clustering it does use metrics and because it uses metrics it is not completely objective it is subjective and metric dependent. People who do phylogenetics would do well to read Kaufman and Rousseeuw to understand why clustering needs care and why metrics are very interesting (I have a story about one of the authors as well which makes me reluctant to suggest reading his book but it is foundational).
The BOOTSTRAP - I don't know where to start. Nobody who is a biologist has bothered to do some simple experiments to check what is does and what it means with real data. For example to know how many bootstraps you need to use run it on your data with 50, 100, 150 and 200 and see if it has converged - you get the same numbers each time. From my experience 100 is more than enough. All of these referees and experimentalists using 500 or 1000 or 10,000 are wasting their time. If you read Efron's book he says 100 is empirically often enough although, in theory, the number should be something like the square of the number of sequences. If they had read Efron's book they might grasp the issues. That means going beyond his simplified paper saying what the bootstrap is and definitely going beyond taking Felsenstein's word for it. There is some fantastic work on this by Susan Holmes who worked with Efron. This is really great stuff but under-read and poorly understood. So much so that she has moved on to other things.
The need to make everything quantitative. Biology is NOT always quantitative. If I see a clade in a tree and it is monophyletic to a geographical location I believe that tree. I do not need to put a number on it. I could work out by permutation test how likely it is to get a clade that is monophyletic to a location but given the sampling is convenience sampling that is not probably going to be meaningful.
Creating trees for distinct species should make sense. We know and I mentioned before that influenza undergoes rapid change to obtain host specific mutations when it is introduced to a new host, such as passaging it in chicken eggs. We know this experimentally for ferrets as well. Why would a host specific tree be a bad idea? A referee and an editor thought so in my H9N2 work until a paper taking the same approach was published in Nature and then what I had done was Ok and they allowed my appeal after stalling publication for 18 months and two journals with the same editor in both.

Sunday 1 October 2017

How would you hack an election?

If I wanted to win an election how would I do it?

In areas where there is no voter id what I need first is a list of past voting records. I need to know who doesn't vote. I need this because I am going to vote for them and I don't want any of them turning up by accident. People who are on the register but cannot vote because they are dead or incapacitated look like great candidates as well BUT these are too easy to check so actually I want to avoid this or someone might detect my fraud quite easily.

Once I have a lost of non-voters who are on the roll and alive and who could vote but never do what do I do next? I go to the polling station on their behalf and vote for them. Remember that they never vote and they are likely not to be recognised. I need to do this in small numbers at all of the polling stations. The more stations the better to spread the load and hide what I am doing. I don't want to do more than 50-100 votes in each of the stations. This means that my voting team needs to be at a maximum of 100 people. I do not want to have the same person voting at a station twice. These teams then move from station to station over the day. I can also do postal votes and absentee votes and these help. This is a big logistical operation but I don't care if I win the prize and control the government. You are talking thousands of operatives but they will target only a few key areas which can be flipped. Only marginals matter.

This is certainly possible for the swing states like Pennsylvania, Wisconsin etc. if I have Russian man-power available but I would need a forward base and a long term plan to get the people in place over time. Combine this with analytics, targeted media a big fake news operation and you can win the election.

How would you recognise this in the results? Can it be detected? The only thing that you see is a higher than expected turnout, unexpected demographic shifts and a high turnout from previous non-voters. But all of this is plausible deniability. There is not way an audit can say that this fraud exists unless you corroborate all of the suspicious votes. This is the perfect undetectable hack. Oddly enough Trump and Brexit both depended on previous non-voters.

Is it just me or does seasonal flu not make sense?

The NHS are warning that the UK is going to have a bad flu season this winter because there has been a bad flu season in the southern hemisphere this year.

But maybe they got a bad season this year because we got a bad season last year. Surely if this bad season alternates between northern and southern hemispheres one affecting the other then once a bad season starts all seasons have to be bad after that.

What is more puzzling is how is their bad season going to become our bad season? If it is being spread by air travel then why didn't we have a bad summer flu season? Why does it only kick off in our winter after their winter and flu season is over. The idea was that we get more flu and colds in winter because we spend more time in doors passing it to one another in the winter. I agree in an agrarian society where behaviour changes with seasons but I spend as much time in my office in the summer as I do in the winter. That is apart from the summer vacation but if I went to New Zealand won't I bring back the flu?

A big factor in my job and exposure is schools. The new school term brings colds and flu. Now we have much shorter summer holidays for schools in the UK we should see longer flu seasons if this is part of the cause for flu being seasonal. Seasonal flu is reality but our models as to why it is seasonal don't fit very well. We need to get some better models as to why it is seasonal and how it spreads between hemispheres.

Humidity and temperature have been shown to have effects but I suspect that there are more factors to take into account and imagining how flu seasons spread between hemispheres is another factor to consider.

How not to develop analytic talent

I wrote a review of the book Developing Analytic Talent by Vincent Granville and gave it a good bashing. But I could not do it justice to the total incompetence. Vincent Granville PhD is a perfect example of a snake-oil salesman. He speaks about his papers, his experience and the investment he has attracted, he talks about his books but a quick Google of his name just turns up a website which he set-up and which engages in some shady practices, including him writing articles pretending to be other authors, especially women in order to make it appear more gender neutral.

The review could not capture the many gems within the book so here are some of his best bits of writing.

Compound metrics are to base metrics what molecules are to atoms. Just like as few as seven atoms (oxygen, hydrogen, helium, carbon, sodium, chlorine and sulfur) can produce trillions of trillions of molecules and chemical compounds (a challenge for analytical and computational chemists designing molecules to cure cancer), the same combinatorial explosion takes place as you move from base to compound metrics.

p110

Very nice but Helium is a noble gas and does not form compounds on Earth although it might do in special environments. His PhD is not in chemistry

Apparently he also thinks that an app for pricing in amusement parks would be a good venture

Increase prices and find optimum prices. (Obviously, it must be higher than current prices due to the extremely large and dense crowds visiting these parks, creating huge waiting lines and other hazards everywhere - from the few restaurants and bathrooms to the attractions).

p105

Alternatively they could build more bathrooms and more restaurants and make even more money from the large crowds rather than reducing foot-fall as people go elsewhere. Who are richer the owners of WallMart or the owners of Tiffany's?

This however is the best and saved for page 174

The number of variables is assumed to be high, and the independent variables are highly correlated.

What? Wait let me see what the definition of independent variables is. That would be whose variation does NOT depend on the variation of another. That would mean not correlated. This is more than a slight howler this is so elementary that you cannot believe a single thing the author says. He then goes on to do regression in Excel.

On page 189 he talks about the possibility of getting negative variances - this is impossible. On page 190 he talks about the variance being bounded by 1 as a maximum. This is nonsense even with normalised data the variance = 1 V is not <1 as="" he="" nbsp="" p="" states.="">

Wednesday 27 September 2017

How not to design a questionnaire. Lord Ashcroft's dangerous political polls.

Lord Ashcroft has had a big impact on election polling in the UK. He has even had favourable mentions with Nate Silver on his blogs. But I have been taking a deeper look into his polls.

I have to admit to some political activism and I was previously a Liberal Democrat Councillor. I collected canvassing data and fed it into the party's own analysis program and it makes predictions about the state of your campaign. Most of the time it got it pretty close to right and I won by the amount I expected. That was small local scale analysis.

Lord Ashcroft is well known and a Tory donor and he has been running polls in both the UK and more recently in the US. He conducted a lot of focus groups around the Trump election and also the Brexit vote. What concerns me is not the polls and the data, but the way the polls are carried out. Most specifically the questions that are used.

I was reading a story on CapX by Robert Colville about the demographic disaster awaiting the Conservative Party and he used some very particular phrasing.

If people think that multiculturalism, immigration, the internet, the green revolution and feminism are forces for good, they will vote Labour/Democrat. If they think they are bad, they will vote Tory/Republican.

Now who exactly thinks in those terms? Oh yes, there is a force for good and I am behind it 100%. People just never say that. Oh that is a force for evil/bad, I am really against that. These phrases are straight from Lord Ashcroft's poll published in his book Hopes and Fears. Participants had to chose on a scale whether they thought that:

Feminism
The green movement
The internet
Multiculturalism
Immigration

have made America better or worse.

I use these questions for my second year statistics class as examples of non-scientific questions. These are in fact sound-bites and propaganda disguised as poll questions. They provide a frame of reference to lead participants towards where the person asking the questions wants to take them. The questions are based on invoking feelings and reactions and not actually obtaining rational responses.

Take for example this more detailed question.

Thinking about the following changes in America over recent years, do you think they have made America better or worse?

More lesbian and gay couples raising children

How can lesbian women and gay men raising children have ANY effect WHATSOEVER on whether a country is better or worse? What does it mean for a country to be better or worse? If it means economically less successful then how do lesbian and gay couples raising children affect the economy? If it means the country has gotten worse because of a rise in crime then again how is this caused by lesbian and gay couples raising children? The responses to this question will be pure framing based on current personal experience. If your standard of living has declined recently then you will respond that America has become worse, but the relationship to lesbian and gay couples is forgotten in the question. I could have asked the same question about America getting better or worse and used "More Bush family members raising children" and I would get the same response.

This sort of question is nonsense. It is intended to bias and put words into the participant's mouths by limiting the spectrum of possible causes. The motivation behind these polls is generating easily digestible journalistic content. They can then summarise the poll by saying a majority of participants think that feminism or multiculturalism has made America/UK worse. These are leading questions and the poll is at best meaningless and at worst pure propaganda. Polls like this are designed to influence the opinions of participants and not to actually discover what their opinions are.

We need to think seriously about the impacts of polling carried out in this way, as for me it raises ethical concerns about how they are being carried out and how they are being used.

Thursday 20 July 2017

Bill Gates, VR and Influenza Vaccines

Bill Gates was shown around the NIH where they wanted to show him how VR helps to create better vaccines.

As I said in another Blogpost. I began my research career in structure based drug design. Then I learned that you models are only as good as the data you feed in and so I moved down the pipeline, first to crystallography and now to sequence data analysis. I know the limitations but the NIH wants money and they are less likely to talk about them.

The limitation in influenza research is sampling. We simply do not collect the data properly. We have lots of data from China because that is where we think future outbreaks might come from, but the last Swine Flu pandemic originated in Mexico. There is not enough systematic global collection of data. This means that unexpected changes catch us unaware. Most of the times we do pick the right vaccine candidates but sometimes we get it wrong. VR will not help this.

What will help is the IOT. That provides an opportunity for massive data collection. The Cloud allows us to share data globally. If we can stop the national laboratories from hoarding data this would also be a big step forward. The WHO also needs to be reformed to remove some of the political players who are a barrier to sharing. Scientists are bad politicians. My dad worked on a funding committee and they had to move meetings to secret locations because the scientists were always trying to lobby then and bully them into decisions. In influenza research there is a ruling clique that wishes to restrict research participation. When you mention citizen science or data sharing they have a fit.

So what should we be looking at today?

1) Why is there a wide-spread breeding failure in the Celtic seas? Is this related to the death of marine mammals? Both of these sets of species are possible influenza sources/sinks and it might be a good idea to do some influenza viral screens to see if a new more pathogenic strain has evolved. If it is an influenza it is probably H7 or H3 and that it can affect mammals would be a concern.

2) H5N8 from Viet Nam is a new emerging highly pathogenic H5 containing lineage. We had an outbreak in Korea that spread to North America and also Europe but it does not seem to be able to persist in either Europe or North America (although for North America that is disputed). However the most recent European cases are not related to the Gochang and Buan Korean lineages, but to a Viet Nam sequence that again spread via Korea.

The key is vigilance, wider participation and thinking outside the box. The bird breeding might be nothing and unrelated but lets do a quick check. Improved data collection and screening is where we will make the break-throughs not in the VR lab.

Wednesday 19 July 2017

My F1000 paper on reassortment in H5N8 - even open peer review has flaws

This is really irritating me as this is version 3 of the same story. It is not even a particularly interesting story except it is if you think deeply about it.

What I want to show is that H5N8 in the US is a subtype that has been produced by multiple events where an H5 containing virus reassorts with an N8 containing virus. To do this I constructed a tree for ALL the H5 sequences in the database and ALL the N8 sequences in the database.

If I am wrong then the H5N8 sequences from the US would cluster in one group and not be spread across the tree in many distinct clades. They would at least be close neighbours. Am I wrong? Nope they are spread all over both trees. In the H5 tree there are lots of neighbouring H5 sequences that have been sampled but they are from other subtypes. In the N8 tree they are also widely spread with sequences from many other subtypes in between.

So here is the referees comment https://f1000research.com/articles/5-2463/v1#referee-response-18901

I am completely incredulous about how this is a possible argument for rejection.

If the referees are right int their arguments then you CAN have clades in a tree that are polyphyletic for subtype without reassortment being the cause. I can simply get from H5N2 to H5N8 by spontaneous mutation of the N2 to the N8 form. I can make hundreds of base changes and insertions and deletions and this is more likely than a simple reassortment event.

Now let me imagine this is not their argument and that they will admit that reassortment does exist in the clades but that they are not convinced about the specific H5N8 reassortment events. They are suggesting that these occur in both the N8 and H5 tree which are two independent samples in the same pattern by chance and that there is a need for the internal gene trees to corroborate these events. I say in the paper very clearly that I cannot KNOW what the origins of the H5 and N8 are and I can just make suppositions about them, but I know that a reassortment event has definitely occurred. Otherwise what are the chances that the H5 and N8 genes both undergo substantial mutational changes between samplings of H5N8 and that the neighbours include large numbers of sequences from many other subtypes? If this is true then H5N8 sampling must have been carried out appallingly or in a very biased manner while we detect all the other subtypes with ease.

To be honest I have the internal gene trees as I just finished them for another paper about a different subject and oddly enough they show exactly what I said in the paper. I was 100% right. These two referees are 100% WRONG. In fact their arguments are so illogical and unsupported by evidence that I was surprised that they were brave enough to put their names to them.

So let me ignore the sneering way the review is written, because everyone has a time in their life when they thing they know it all and recent graduates tend to fall into this trap more often than not. I know that I was the same 20 years ago when I started my career.

Let me consider how they use rhetoric in order to create a straw-man to knock down by suggesting that the paper is about discovering reassortment - it is not. The title is very clear it says reassortment in H5N8 and then only part of the H5N8 tree - the US part is actually the main focus of the study. It is a paper about a specific example and the dangers of collecting data by subtype as this gives an incomplete picture of sequence evolution in a segmented virus that can undergo reassortment.

This is the point. If the segments can reassort then they can pass between multiple subtypes in their evolutionary pathway. This is not rocket science this is just suggesting it is better to consider this possibility in trees and sampling and not just carry out phylogenetic analysis by subtype.

This gives me with two options:
1) The referees are idiots.
2) The referees forgot to declare the conflict of interest in that they have a skewed view-point in order to protect their existing work. This paper starts to undermine the idea of monophyletic clades for subtype which underpins the WHO nomenclature system for H5 which Justin Bahl helps to manage.

I do not think the referees are idiots, but I do think the second point is true and that there are good reasons why Dr Bahl should have declared a real and prejudicial interest and NOT taken up the review. Nobody is objective about seeing their work undermined, ever. I also think this is a good reason for me to ban Dr Bahl and Joseph Hicks from EVER reviewing any of my papers and for editors to consider any review that they provide with suspicion.

Saturday 29 April 2017

Let's go back to the beginning.

I need to tell a long story and so I need to go back to the beginning. Part of this story I have already told but not very well and so this is an attempt to put everything into context.

I did a degree in Chemistry and Law at the University of Exeter. When it came to choosing what to do next I wanted to stay in research. I had met some lawyers at the recruitment fairs and they had convinced me that I did not want to be a lawyer. My drive was wanting to change the world, their motivations were purely financial. I was not going to be Perry Mason and I was not going to write environmental protection legislation and so I turned back to science which had been my dream since my teens.

What caught my eye was protein molecular modelling. I had not done much biological chemistry or biochemistry as an under-graduate but the beauty of the computer models captivated me. It was like the best video game I had ever seen (at that time they used the best graphics computers you could buy and they cost tens of thousands). I had applied for PhDs elsewhere in physical chemistry including with P.W. Atkins (his response was he didn't supervise students, then why was he is the graduate prospectus?). But nothing compared to those ribbon images of proteins.

I received the Norman Rydon scholarship from the Chemistry Department at Exeter. This allowed me to pick my supervisor and the money would come with me. It was an incredible stroke of luck and so I got to follow my dream and study molecular modelling of proteins. My PhD was in homology modelling of FBP aldolase and also including using molecular dynamics to study the conformations of peptide inhibitors. Unfortunately about the time I finished Swiss-Model appeared and what had taken me 3 years to do now could be done in 10 minutes on a server ... That was the end of homology modelling research for me. What I also learned was that modellers depend on the quality of the data they are given. The FBP structure that I used had some limitations and so my models shared those limitations and so I went back a step to become a protein crystallographer.

While I was doing my PhD and protein crystallography post-doc I was the general computational biologist or bioinformatician on call for the research group. This was the mid-1990s and so bioinformatics did not really exist as a subject. I did sequence alignment, BLAST searches and phylogenetic analysis for projects where we tried to understand the evolution of protein structure and function.

What I realised from building these alignments and trees was that if your data is very partial and contains mostly sequences from related species and only a few sequences from more distant species, then this will bias the tree towards your data and possibly away from a better representation of reality. The problem is how do you select sequences to include and exclude? An even bigger issue is the irregularity of the sampling across the "tree of life" (we also did not call it that then either we just called it across the kingdoms). We worked with the recently discovered Archaea and they are dramatically different to the bacteria and the eukaryotes and putting the Archaeal sequences into trees was difficult.

From alignments I also learned that making secondary structure predictions on all the sequences in the alignment is better than just making predictions on a single instance. They should all have the same structure and so this sequence level variation should disappear in predictions. This turned out the be a major discovery (made by someone else) and that ended investigation in secondary structure prediction (OK I have missed out neural nets etc. but I have serious doubts that they contribute anything more than the use of multiple alignment and using GOR or even Chou and Fasman).

I continued as a post-doc in protein crystallography but also dabbled in bioinformatics until in 1999 Exeter set-up an MSc in Bioinformatics. As one of the local experts I helped to set up the course and I taught the sequence and structure modules. I was made a lecturer in 2000 and I remained at Exeter for five years until I got caught up in the departmental politics of the closing of the Chemistry Department (I was a lecturer in Biological Sciences and Engineering and Computer Science at the time). Exeter made me redundant but the atmosphere had soured for me there anyway, because they disapproved of me trying to have a work life balance and putting my wife and newly born children first. They also did not like my involvement in politics. I was a city councillor in Exeter for five years.

One of my students had said did you see this advert for a bioinformatics lecturer at Oxford. I hadn't but the closing date hadn't passed and so I applied. My curriculum was okay. I was much better at teaching than research. Setting myself up as an independent researcher had also been made difficult by having to separate my research from my PhD and post-doctoral supervisor. Luckily I had the support of Dr Ron Yang and we had worked together on some projects. He did the computing (most of it) and I gave the biological input (a bit at the end to make sure that it actually worked). This meant that I had my four publications and that is what matters in UK academia.

I went to the interview at Oxford. I thought it went well. The head of the course was young, bright and very unexpected. Dr (now Prof.) Charlotte Dean would go on to be head of the department of statistics. What was unexpected was how relaxed and casual she was. She was not the serious unapproachable Oxford don. They offered me the job on the same day and I started a few months later. We moved from Devon to Oxfordshire. I lead the teaching in the modules Charlotte was not leading. In Oxford the Bioinformatics MSc was in the Department of Statistics and I became a Departmental Lecturer in Statistics. This also meant I taught statistics, Perl programming and he biology courses. I had gone from being a lecturer in Biological Sciences and Engineering and Computer Science to a lecturer in Statistics. I had degrees in none of these subjects but I am 100% a computational biologist. This is the curse of being inter-disciplinary. I did start to think about systems biology at Exeter and we had a meeting there where I met Kitano and his work was a major influence on my thinking.

This is when I became an accidental statistician and I am glad that it happened, because apart from the stunning molecular images I found that data is what fascinates me. I just can never get enough data. I think that if I had been introduced to statistics earlier on then I might have been a statistician from the start but at school it is never taught well and people fear statistics. At Oxford I learnt to love it. When the lecturer who taught statistical data mining left I took over his module. Now I was teaching masters level statistics to people who had degrees in maths and some of them from Oxford. It was an amazing experience, although I have to admit to spending the entire summer reading the textbooks from cover to cover (thanks to Hastie, Tibshirani and Friedman and also to Brian Ripley).

Charlotte went on to other things and I became acting head of the MSc teaching more and more. My interests now were systems biology and trying to put together data from different experiments and perspectives. What really troubled me and what still sits in the back of my mind is entropy and how it works in living systems. I was a book worm for systems biology.