Being a biotech investor is difficult and full of pitfalls. Not only is there clinical trial risk but there are also regulatory and commercial risks, each of which is significant. One thing that isn't talked about enough is the dangers of post-hoc analyses involving subgroups of a patient population. Often, when clinical trials fail, management teams tout the results of a post-hoc analysis, using it as a way to save face and justify the trial, or worse, raise money from investors to keep the company afloat and continue to pay themselves a comfortable salary.
In our opinion, post-hoc analysis is inherently risky and spurious and should be viewed with extreme caution. History is replete with examples of bad post hoc analyses. Two examples in today's market come from Oncothyreon (NASDAQ:ONTY) and Celsion (NASDAQ:CLSN). We will point out why we believe the post-hoc analyses from Oncothyreon and Celsion are bad science and why we don't believe investors should put much stock into the results.
Dangers of post hoc subgroup analyses:
While post-hoc analysis of subgroups within a clinical trial can occasionally be valuable and provide insights, it has serious limitations. A review of the literature on clinical trials and post-hoc analysis makes it clear that clinicians, investors, and regulators should all view post-hoc analysis of subgroups with extreme caution. As Dr. Peter M Rothwell, an expert in the field, put it, "Post-hoc observations should be treated with skepticism irrespective of their statistical significance."^{1}
We think that post hoc analyses done for failed clinical trials are even worse. For those analyses, Dr. Kenneth F Schulz, another expert, states, "Be especially suspicious of investigators highlighting a subgroup treatment effect in a trial with no overall treatment effect. They are usually superfluous subgroup salvages of otherwise indeterminate (negative) trials."^{2} And yes, this is exactly what's ongoing with both Celsion and Oncothyreon. Another states it more bluntly: "Subgroups kill people."^{3 }
Why is this the case?
First and foremost, it's important to note that the hypothesis is created by the post-hoc analysis, and that it has not been proven by experiment or generated from a scientific basis. This is extremely important. It's similar to a sharpshooter "who fires at a barn and then paints a target around the bullet hole. A target shows how accurate the shot was only if it was in place before the shooting. In the same way, statistical tests applied to unusual looking results may give the false impression of a 'bull's eye.'^{4}
The results of subgroup analysis only become relevant if it can be replicated and shown in a randomized clinical trial. Otherwise, one cannot be certain if the subgroup analysis has any meaning whatsoever.
Second, if a company does enough post-hoc analyses using different subgroups, it is nearly certain to find something statistically significant. This problem is called "multiplicity" in statistics, and results in inflated false positives.^{5} For example, even if a clinical trial showed no true treatment effect, if you split the study population into 20 mutually exclusive subgroups, the probability of at least one significant but false positive result at a p-value of 0.05 is 64%.^{6}
The chances of a false positive can be calculated as the reciprocal of the probability that twenty true negative readings in a row are observed -- mathematically this is expressed as 100% - [95% ^ 20] = 100% - 36% = 64% chance that at least one false positive will emerge.
In fact, if a study population is split into 60 subgroups, up to three statistically significant interaction tests (p<0.05) can be expected on the basis of chance alone!^{5}
An infamous example of inappropriate subgroup analysis comes from a 1998 Lancet paper that looked at the effect of taking aspirin after acute myocardial infarction. The Lancet editors wanted nearly 40 subgroups. Knowing what this would lead to, the authors obliged and insisted that they add a few of their own. While the overall study showed a small positive effect, the authors showed that participants who were born as Geminis or Libras had a slightly negative effect on death from aspirin (9% increase, not statistically significant); however, patients born under all other astrological signs reaped a huge beneficial effect of 28% with an incredible p-value of p<0.00001!^{7}
Despite the dangers of post-hoc subgroup analysis, it's all too common. Investigators and biotech companies are inherently biased and want to put their best foot forward. All too often, they engage in post-hoc analysis to find anything positive in the results. Dr. Olivier Naggara examined published research papers summarizing clinical trials and found "the prevalence of trial publications claiming at least 1 (statistically significant) subgroup effect has ranged from 25% to 60%."^{6} Data-mining is more common than not!
What's worse is that the research papers don't reveal how they conducted the subgroup analyses and only provide the positive results. Best practice stresses that the way a post-hoc analysis is conducted is more important than the results it provides. Not revealing details about the methodology behind a study's post-hoc analysis provides an incomplete picture, and one cannot derive any conclusions from it.
As a result of bad post-hoc analyses, a number of organizations have tried to create guidelines for what constitutes good and bad post-hoc analyses. These reports can be found in our bibliography below. The general guidelines for good post-hoc analyses are:
1) Don't have too many subgroups
2) Pre-specify the subgroups before conducting the trial
3) Clearly articulate at the beginning of the trial why the subgroup is relevant, as well as the expected direction and magnitude of the subgroup effects
4) Use a different statistical model for determining statistical significance of subgroups
Sadly, these guidelines are rarely followed. Too often, subgroup analyses are fishing expeditions and data-dredging exercises.
Two recent post-hoc analyses by Oncothyreon and Celsion:
Late last year and earlier this year, both Oncothyreon and Celsion announced negative results from their respective Phase III clinical trials, only to come back from the ashes with "promising" subgroup analyses.
Celsion announced that patients who received RFA (radiofrequency ablation) along with ThermoDox for greater than 45 minutes had a meaningful clinical benefit in both PFS and OS. The Celsion release can be found here.
Oncothyreon showed that patients who received Stimuvax after concurrent CRT (chemoradiotherapy) did significantly better than those on placebo and achieved statistical significance (HR 0.78; 95% CI 0.64-0.95; p=0.016). The link can be found here.
However, an examination of each of the respective post-hoc analyses shows significant issues.
There are three main issues that we see with Oncothyreon's post-hoc analysis. While the Company did pre-specify the subgroups, it failed to follow rule #3 above. It was not clear at the beginning of the trial why Oncothyreon believed a difference should be seen in the concurrent versus sequential CRT patients, nor did Oncothyreon postulate the expected direction and magnitude of the subgroup effects. It seems like they just stumbled upon it and grasped upon it as their last prayer for Stimuvax.
In addition, we don't know how many subgroup analyses Oncothyreon conducted (rule #1). This is extremely important and impacts the statistical analysis they conducted. A common way to adjust for multiple subgroup analyses is to lower the p-value threshold by the number of subgroups analyzed. "One way to ensure that the overall chances of a false-positive result are no greater than 5% (0.05) is for each test to use a criterion of 0.05/n, to assess statistical significance (the Bonferroni correction). For example, if 20 tests are conducted, each should use P=.0025 as the threshold for significance."^{6}
As it relates to Oncothyreon, we know that they pre-stratified the patients on a number of different criteria. If they used 10 different subgroups, then the adjusted p-value for the post-hoc analysis should've been 0.005. We know they stratified the patients on at least 4 criteria: geographical region; concurrent or sequential CRT; whether the patient responded or was stable after CRT; and IIIA or IIIB stage NSCLC. Four subgroups would make the p-value required to show statistical significance 0.0125 (0.05 divided by 4). The actual p-value they reported of 0.016 doesn't meet the minimum statistical threshold for the 4 pre-specified groups in the trial. Thus, in addition to breaking rule #3 by not articulating the importance of the subgroup in advance, Oncothyreon also broke rules #1 and #4. Nor did they provide adequate transparency and release the results of their other subgroup analyses. This leads us to believe they presented their best foot forward and are not being completely honest with their findings.
If Oncothyreon's analysis is not scientifically sound, then we believe Celsion's analysis crosses the Rubicon into the ridiculous. First, Celsion did not use a pre-specified subgroup. After the HEAT Phase III trial concluded (and failed), they appeared to have gone back into the data to find any subgroups that had positive findings. This was confirmed to us by our consultations with experts (click here for our report) and is the very definition of data mining.
In addition, the type of subgroup analysis Celsion conducted is especially misleading. To distinguish their subgroups, Celsion used the duration of RFA treatment. Time of treatment, like age of the patients, is an arbitrary cut off. Furthermore, time of treatment is a continuous marker. Celsion could have chosen a time cut off anywhere from 0 minutes to 90+ minutes, which provides them an infinite number of chances to find something statistically significant. Essentially, they should have been assured of finding something positive because they gave themselves so many opportunities to find something significant.
Worst of all, even though Celsion did not have a pre-specified subgroup and used a continuous marker, Celsion still failed to find statistical significance in their subgroup analysis. As the example above showed, if you look 60 times, you'd expect to find at least three statistically significant interaction tests (p<0.05) on the basis of chance alone. Celsion found none.
In essence, Celsion violated each and every one of the guidelines for conducting a post-hoc analysis:
Guideline |
Celsion |
Don't have too many subgroups |
A nearly infinite number of possible subgroups |
Pre-specify subgroups before conducting the trial |
Chose a random subgroup after trial completion |
Articulate at the beginning why the subgroup is relevant |
Make up random explanations after finding the subgroup |
Use different statistical model to determine significance |
Fail to achieve statistical significance and proclaim victory nevertheless |
Conclusion
We don't believe investors should be excited by either of the post-hoc subgroup analyses presented by Oncothyreon and Celsion. Both companies failed to follow proper guidelines for post-hoc analysis, and we believe they went on a fishing expedition to find anything positive to present to investors. In fact, after they presented their positive subgroup analyses, both companies raised millions of dollars from investors.
Even if the subgroup analyses are interesting, it would require both companies to go back and conduct more clinical trials. They will be required to conduct another expensive and long Phase III trial, which will not conclude for years and will require several more rounds of funding, or they may even be required to go back earlier and start with pre-clinical animal trials. Neither case is attractive, and both options will result in significant dilution and significant wait for any conclusive results.
Thinking about Celsion and Oncothyreon, we cannot conclude other than that their lead candidates (Thermodox and Stimuvax) are worthless. Even worse, the continued pursuit of these drugs is likely to destroy significant value at each of these companies. We believe that these companies are, at best, worth their cash value per share and that there is downside to this due to their ongoing cash burn, history of poor capital allocation, and likely need for additional equity capital. Our target price for Oncothyreon is its cash balance of $0.98 per share or 42% below the current stock price. Our target price for Celsion is $0.73 per share or 34% below the current stock price.
References
1) Rothwell PM. Treating Individuals 2 - Subgroup Analysis in Randomized Controlled Trials: Importance, Indications, and Interpretation. Lancet 2005; 365:176-86. Article link here.
2) Schulz KF. Epidemiology 5 - Multiplicity in Randomised Trials II: Subgroup and Interim Analyses. Lancet 2005; 354:1657-1661. Article link here.
3) van Gijn J. Extrapolation of Trials Data into Practice: Where Is the Limit. Cerebrovasc Dis 1995; 5:159-62. Article link here. (Article must be purchased.)
4) Fletcher J. Subgroup Analyses: How to Avoid Being Misled. BMJ 2007; 335:96-97. Article link here.
5) Wang R. Statistics in Medicine - Reporting of Subgroup Analyses in Clinical Trials. NEJM 2007; 357;21:2189-2194. Article link here.
6) Nagarra O. The Problem of Subgroup Analyses: An Example from a Trial on Ruptured Intracranial Aneurysms. AJNR 2011; 32:633-636. Article link here.
7) ISIS-2 Collaborative Group. Randomised Trial of Intravenous Streptokinase, Oral aspirin, Both, or Neither Among 17,187 Cases of Suspected Acute Myocardial Infarction: ISIS-2. Lancet 1988;2:349-60. Article link here. (Article must be purchased.)
Disclosure: I am short CLSN, ONTY. I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it. I have no business relationship with any company whose stock is mentioned in this article.