The Physician Behind the PREDIMED Retraction

John M. Mandrola, MD; John B. Carlisle, MBChB


July 25, 2018

John M. Mandrola, MD: Hi, everyone. This is John Mandrola from, Medscape Cardiology. And I'm pleased to be with Dr John Carlisle, an anesthesiologist in the United Kingdom who did the research involved with the recent PREDIMED retraction. Dr Carlisle, welcome.

John B. Carlisle, MBChB: Thank you, John. I'm very pleased to be here.

Mandrola: I'm excited to meet you. First, just tell us who you are.

Carlisle: I am a hospital doctor in the UK. I work in a smallish-sized [National Health Service] hospital. I've been a specialist here in Devon, UK, for 17 years. I'm not an academic; I'm not employed by a university. My day-to-day job is as an anesthesiologist. I also staff a preoperative assessment clinic, meeting patients for surgery. I'm an intensivist as well.

Mandrola: How did you get interested in this project?

Carlisle: When I was a trainee, I was looking for things to bulk up my CV and a job came up with the Cochrane Collaboration. At the end of the 1990s, the Cochrane Anaesthesia Review Group, which is based in Denmark, Copenhagen, was just setting up and they wanted somebody to respond to comments. I held that job for a couple of years. And then they said, "Well, it's about time you did your own systematic review."

So, I did a systematic review about drugs to prevent postoperative nausea and vomiting. During the course of looking at papers for that systematic review, I came across quite a few from a Japanese anesthesiologist. And it turned out, about 10 years later, that he had made up most of his data. It is from that that I really developed an interest in this field.

Mandrola: You noticed the irregularities in the process because of doing the systematic review, is that right?

Carlisle: That's correct.

Mandrola: Do you have any background in statistics or computer science?

Carlisle: No, not really, except for the passing interest you have in order to pass exams, which many medical students will be familiar with. It's one of those last things you look at before the exam and then you forget about it. The other time I learned a bit about statistics was for doing systematic reviews, looking at how to analyze randomized controlled trials (RCTs) when you combine them.

Mandrola: I've read your paper and numerous descriptions of your methods, but can you make it simple? How did you do this?

Carlisle: Fortunately, the core of the method is familiar to all doctors, which is how we calculate the probability that two groups are different. When you're looking at an RCT that asked, "Did this drug work," you're generally looking for a small P value. If the outcome was something continuous, like patient weights — maybe one group went on a diet, the other did not, and you're looking at weight loss — you do a t-test, which will be familiar to most doctors. More than two groups, you use an [analysis of variance (ANOVA)]. The names of those tests are familiar even if you don't know the nuts and bolts of it.

My method was to apply those types of tests to the characteristics that are present before you do the trial. So, the heights, the weights, those things that are present in the population before we actually do the experiments.

Mandrola: The baseline characteristics in an RCT, if it is truly randomized, should not be different.

Carlisle: It is true that if you have a big enough sample — you'd need hundreds of thousands — then the means will be pretty much exactly the same. Most studies never get that big. How much difference relies on chance — the chance differences in the people allocated to one group and the other. There is almost always some difference. When a study reports those means, they may report them imprecisely, so they may appear to be the same. A mean weight of 74.1 kg in both groups, for example, may turn out to be quite different if you then increase the number of decimal places. Most of the time, there are some differences between the groups. How much difference should be due to chance.

Mandrola: How does your method detect irregularities in these baseline variables?

Carlisle: At its simplest, you do a t-test on the heights. A really small P value, indicating a really big difference, would suggest that maybe there is something wrong with that study. At the other end of the spectrum, the groups may be unusually similar, [and you can calculate a P value for that].

My method differs from the normal methods of t-tests and ANOVAs only to the extent that I used simulations to try and work out [the chance that two means were the same] because that is just as unlikely as two means that are very different.

Mandrola: In the first table of [a published RCT], the investigators list the baseline characteristics. The P values for each baseline characteristic (eg, height, weight, waist circumference) are generally included. In your method, you look at a sum of the average of those P values?

Carlisle: That is correct. I generated a single P value for the trial as a whole, which means you've got to somehow combine those P values for those different characteristics. Sometimes the authors won't calculate the P values, as was the case in in the PREDIMED study.[1] If they had, they may have spotted something wasn't quite right, but they didn't do that. Sometimes people will calculate P values incorrectly, so you'll see a P value next to that characteristic, but it may be wrong.

Some journals recommend that you don't calculate P values for baseline characteristics because any differences should be the result of the chance process of allocation rather than something important.

Mandrola: What would you characterize as the weaknesses of your method?

Carlisle: The assumptions of the method [are the weakness]. If people are not aware of those assumptions, they will misinterpret the analysis. The analysis assumes that the sample population is allocated in a very simple way. It assumes that there is no block randomization, no stratification, and no minimization. There are some new methods of randomly allocating patients that make a study more efficient.

The minimization process, for instance, actually changes the probability of being allocated to one group or the other as the trial progresses. That means that my method would produce a slightly incorrect P value. This process also assumes that the calculated means are normally distributed. Things like age, height, and weight are distributed in a slightly non-normal way (eg, log-normal). However, the distribution of means are usually normally distributed, so that is likely okay.

Mandrola: What about correlation of variables? For instance, tall people might have heavier weights.

Carlisle: That's right. So, even if the P values for the individual statistics are correct, when you combine them, the method assumes that they are independent of each other. And as you just said, tall people are generally going to be heavier, so any slight imbalance in heights will also be reflected in an imbalance in weights because those two things are connected. Whenever one uses this sort of method and you get the result, you've then got to pause and think, "Okay, hang on, were the assumptions that we made met?" And if there are reasons to think they're not, then you've got to be fairly cautious in how you approach what the next step might be.

Selecting Studies for Analysis

Mandrola: How did you decide which studies to look at?

Carlisle: Doing the original systematic reviews that I did with Cochrane, I had already analyzed studies by that Japanese author that I mentioned. He is top of the leaderboard on Retraction Watch — a website that your viewers may be interested in. That site posts a list of the top 30 or so authors who have had the most papers retracted. The purpose of the website is to track retractions of biomedical literature. There are four anesthetists in the top 20, which is a bit of a worry to anesthetists. Either we are lying more than other specialties or we're lying the same amount, but really bad at it and we get found out.

I had analyzed the studies from this Japanese researcher and then I wanted to look and see how many more of these types of studies might there be in the journal I work for, Anaesthesia, and other anesthetic journals in which he had published. I looked at six anesthetic journals and I ended up looking at 15 years' worth of RCTs. I analyzed any I came across; I was not interested in the particular topic of those RCTs.

It does not mean that the studies with normal P values were good ones nor does it necessarily mean that the papers with small P values are bad ones either.

Having done that, some of the people I talked to at conferences were a bit alarmed that maybe anesthesiologists were getting a bad name as liars. They suggested I look at some other journals. I chose two big-hitting journals, the New England Journal of Medicine and [the Journal of the American Medical Association]. I simply looked at 15 years' worth of RCTs with the caveat that I didn't always include every single one I came across. There were a few animal studies that I purposely decided not to include, but I did include a few animal studies in my analysis.

Mandrola: You found almost 100 studies out of 5000 that had irregularities?

Carlisle: We have calculated a P value for each of those studies, and now have 5000 or so P values. What threshold you choose will determine how you categorize those studies into white "good" studies and black "we are worried about these" studies. What's true is that the distribution of P values didn't follow the expectation in 2%, approximately 100, of those 5000.

In those 100, there was a difference between expected and observed distributions of P values. It does not mean that the studies with normal P values were good ones nor does it necessarily mean that the papers with small P values are bad ones either.

Mandrola: When you found irregularities in trials with big names like PREDIMED, did you worry about naming names?

Carlisle: Yes. We did have very long discussions within the Anaesthesia editorial board. When I had done the systematic review for Cochrane, we had quite a few legal teams involved, and we did for this paper as well. I strongly felt that it was important that the method was published and the data that I analyzed were published so that people could spot as many errors in my paper as in the papers I was looking at. You cannot do that unless I've published what I was doing.

I wanted to be really open about it and it was important when I wrote my paper that I didn't accuse anybody of lying, research misconduct, or fraud.[2] Some people assumed that I might have done that in my paper, but if you read it, you'll see that I was very careful in what I said. I just said, "I think this might be the reason why this paper has got an unusual P value." When it came to the PREDIMED study, I stated that I didn't know why it had gotten a small P value. I did not think that correlations of variables accounted for it, and I don't think the stated randomization process could account for the unlikelihood of that particular study.

I mainly feel nervous about people feeling that I am accusing them of fraud. I worry that people might lose jobs or be falsely accused. It is a very sensitive subject, and I am aware of that. I don't want to wreck people's lives just for the sake of doing it. I think it is very important in both writing and reviewing papers that we are cautious about any assumptions made.

Replicating the Method

Mandrola: Can your method be easily replicated by journal editors or peer reviewers?

Carlisle: Yes. Most of the time, simple t-tests and ANOVAs can be applied and a reasonable P value calculated. The only exception, where you might need to run a simulation, is where the means are reported as being the same. I published the codes that I used to run my analysis, which are freely available and designed to run in R, which is a free software program.

Once journal editors and authors are more open to the possibility that they may have made an error and more prepared to allow people to look at their data and help, because that is really what we are trying to do, I think that evidence-based medicine could be improved.

There are a few journals already using it. I screen all the RCTs coming through my journal. I'm not quite sure exactly what the New England Journal [of Medicine] is doing, but I understand they are now analyzing those baseline characteristics, which they hadn't done before. I'm not exactly sure whether they are using my code or some of their own.

Mandrola: Has this work left you with any impressions about the state of the evidence, as it is?

Carlisle: I feel, in a way, reasonably confident and optimistic about the future because I think there is a much greater move for people to accept that we're all human, liable to make errors, and we have our frailties. Once journal editors and authors are more open to the possibility that they may have made an error and more prepared to allow people to look at their data and help, because that is really what we are trying to do, I think that evidence-based medicine could be improved. In the past, we have been hidden behind our own doors. Bringing out the data and the questions is a healthy thing to do. So I am fairly optimistic about it.

Mandrola: Has it changed your approach to being either an early adopter or a slow adopter? I am sort of a slow adopter. I get some criticism for that. I wonder how you feel and whether this has changed?

Carlisle: I don't know whether it's youth having flown or some other reason, but as you get older you have seen fads come and go. Often, a practice that was initially supported by very strong evidence may turn out to be not the best thing. John Ioannidis has written about this pendulum process where something is initially popular and then people are against it, eventually finding an equilibrium that is supported by the evidence. I have definitely moved towards being a slow adopter.

I think the analysis of PREDIMED and other studies might encourage other viewers today to think about not jumping on a bandwagon and instead taking a step back. I don't think your patients will suffer if they're not the first to receive a new drug, but they may suffer if you are an early adopter. Until there is evidence coming from different directions, all in agreement, I think we need to be cautious.

Are Systematic Reviews Better?

Mandrola: That brings up the question of systematic reviews and the quality of systematic reviews versus one or two. We are taught that a systematic review is a high level of evidence, but there seems to be more and more of these. Do you have any comments about the quality of these papers?

Carlisle: I think the quality of systematic reviews varies in a similar way to the quality of RCTs. Just as you can see some fairly rubbish RCTs and recognize them, people who are familiar with systematic reviews will see a bad systematic review and recognize it as being bad. Just because a paper is labeled a systematic review does not mean that what it says is correct or even a good reflection of the evidence as a whole. There are instances where you might have two author groups publishing a systematic review of the same papers and come to different conclusions, which reveals that a systematic review is an observational study in itself.

A systematic review is observing randomized control trials and their results, but it isn't itself a randomized controlled trial and is open to more biases, perhaps, than a randomized controlled trial might be that's been conducted very well. There have been arguments that a single, large, multicenter study might be better than a systematic review with just as many or maybe more participants as a result of pooling different studies together. One of the arguments is that a systematic review will include potentially bad and good randomized controlled trials.

But, as the PREDIMED study illustrates, if you put all your eggs in the basket of one single, large randomized controlled trial, you are then vulnerable to any overt or covert problems within the study. Usually, if that is overt, you won't believe the results. But as the PREDIMED study showed, they had some problems they were unaware of, and I think some people who commented on the rebooted PREDIMED study[3] that was published after the retraction argued that there may be remaining problems with it that just haven't been discovered.

Just because a paper is labeled a systematic review does not mean that what it says is correct or even a good reflection of the evidence as a whole.

Mandrola: Do you have any more plans to use this technique on other studies in the future?

Carlisle: I have a lot of papers coming through for my journal, Anaesthesia. We have identified problems with a number [of papers] that have not been published, including clear instances of fraud. One of my jobs will be to write [an analysis of the] number of papers coming through. Whether we name authors has not yet been determined. As I noted, there are some problems with doing that. But I think it would be interesting for readers and other journal editors to be aware of the problems in papers that were not published.

There are many papers submitted to various journals that will only publish 10% to 15% of the papers that they see. If one is analyzing baseline data, you can start to pick up problems. We've asked a number of authors for their raw, individual patient data and we will identify problems that way. I think there are a number of interesting aspects to this and I'll certainly be looking at those in the future.


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.