Tivozanib, Aveo And FDA; Did A Good Drug Get Mugged?

Is there evidence that the staff at FDA's division of oncology products abandoned their ethics when preparing briefing documentation and presentation material for May 2012 Oncologic Drugs Advisory Committee meeting? The answer may have relevance for the SEC investigation of Aveo Pharmaceuticals (NASDAQ:AVEO) and for assorted other investigations.

Is there evidence that tivozanib outperformed sorafenib in the trial which was the topic of in the May 2012 Oncologic Drugs Advisory Committee meeting? The answer may have relevance for the financial and judicial fate of Aveo and, also, may (hopefully) impact on how FDA will judge results from the future kidney cancer trials.

Part 1: Verifying FDA's Claims

FDA's staff makes several dubious claims in the briefing book [1] and in the presentation [2] for May 2013 meeting of Oncologic Drugs Advisory Committee (ODAC). Below are two claims of particular interest followed by a list of claims I think are either unsupported by facts or are grossly misleading.

2.1 Three Card Monte

There is a set of connected statements in the briefing book, in the presentation and in the transcript [5] of the morning session of May 2013 advisor committee meeting.

Visual aids - Table 1 provides the necessary background information, Figure 1 - FDA's briefing book's table 7, Figure 2 - FDA's slide 17. I break the action down so that all can appreciate.

In Table 1 tivozanib trial is highlighted in yellow. If you sort the trials by the average percentage for MSKCC Favorable group then tivozanib sits just in the middle. There is nothing remarkable in tivozanib trial's percentages. All primary sources ([12], [13] and [14]) have the same numbers.

Figure 1 - Now you see it

Nothing remarkable in Figure 1 (FDA's table 7) either. It shows the same numbers as Table 1. Why, then, there is the following sentence referring to table 7 on the page 7 of the briefing book?

"Patients were evenly divided between performance status 0 (45% tivozanib, 54% sorafenib) and performance status 1 (55% tivozanib, 46% sorafenib). Note that a substantial number of patients had a MSKCC favorable prognosis."

I could not figure out what did FDA mean with it. It got really perplexing with slide 17 in FDA's presentation - Figure 2.

Figure 2 - Now you don't

What is going on here, 26.9 % is now 54 % etc.? A partial answer comes on page 73 of the meeting transcript.

Revealing the ace:

"We agree that the arms are well-balanced. I show this information just to emphasize that the vast majority of the patients in this trial had favorable prognostic characteristics, and this impacts comparison of the results of this trial to that of others performed in the past."

All this is so deftly executed that even the representatives for AVEO don't catch on. Is preventing comparison with other trials so important that one is willing to risk charges for unethical conduct? No, the intended damage goes beyond preventing comparison. Slide 17 dramatically lessens the value of tivozanib trial's median survival times. How? MSKCC risk groups are among the most significant prognostic factors (predictors) on how long kidney cancer patients will survive. Without treatment or under mildly effective treatment survival time ratios between MSKCC risk groups are roughly as follows:

Poor - 1, Intermediate - 2, Favorable - 4.

In other words: for every month that MSKCC Poor group survives, Favorable is likely to survive 4 months. Every expert on kidney cancer is well aware of this relation.

FDA's slide 17 implies, by altering tivozanib trial's MSKCC Favorable ratio from one of the lowest to the highest, that the long survival is due to high ratio of long living patients, not because of the treatment.

1.2 Ninja Toxin?

Figure 3 - Ninja Toxin

FDA's slide 37 lists possible hypotheses to explain survival results. One hypothesis on the list: Tivozanib has greater delayed toxicity or toxicity not recognized. Same possibility comes up in FDA's commentary:

"… The very real possibility that death from toxicity contributed to a worse survival on the tivozanib …"

Sometimes simple tools bring clarity when more powerful ones confuse. Figure 3 presents digitized Kaplan-Meier survival estimates [14] and cumulative hazard functions for tivozanib and sorafenib over a period of 24 months. If Kaplan-Meier curve estimates the probability of surviving until time T then hazard function estimates the probability of dying immediately after that. Basically they are related views on to same data. The common definition of cumulative hazard function is

H(t) = -log10(S(t)) , where S(t) is the corresponding survival function

One can read cumulative hazard function as follows:

  1. Upward pending cumulative hazard function means increasing risk
  2. Downward pending cumulative hazard function means decreasing risk.
  3. Approximately constant slope (approximately straight line) means constant hazard or risk.

In Figure 3 the blue dotted line depicts the cumulative hazard function for tivozanib and it looks surprisingly straight to me. Where is the toxicity? Maybe FDA is talking about toxicity that is unobservable and kills without killing - sort of ninja style.

Maybe tivozanib after sorafenib is toxic. The red dotted line is the cumulative hazard function for sorafenib. The line is livelier than tivozanib's but there are no signs of increasing risk. On the contrary, sorafenib's cumulative hazard line occasionally levels, meaning that there are periods when instantaneous risk of dying seems to be ZERO. The first leveling is around the time when the dominant MSKCC risk group, Intermediate, in the sorafenib channel hits its median progression free time 7.4 months [15] and is in process of crossing over to tivozanib - close to 20 % of patients in the sorafenib channel are already on tivozanib at this time [16]. The second leveling happens around 17 months when the on-tivozanib percentage approaches and exceeds 50 % in the sorafenib channel. After that sorafenib's trace keeps slowly pending downwards - continuous, gentle reduction of the risk of dying. No signs of increase in toxicity anywhere.

1.3 No Consistency - Not At All

FDA's claims on the Page 10 of the briefing book:

"The 23% response rate in the sorafenib arm observed in this trial is not consistent with the observed response rates in the sorafenib arm in other randomized trials..."

FDA is implying consistency where there is none. Response rates from all three trials are inconsistent with each other. One cannot claim consistency when there is none. The total lack of consistency is easy to prove using Fisher exact test - Table 2.

Highest p-value from Fisher exact tests for response rate equality/consistency was p= 0.000022. Rejection at 95 % confidence level requires p< 0.05. Responses rates are all inconsistent. FDA can't make consistency claim. FDA repeats this approach on pages 10 - 11. This time with dosing changes and with this absurd statement (look at percentages):

"In the recent axitinib/sorafenib study, 80% of patients on the sorafenib arm required a dose interruption and 52% a dose reduction. However, in the sorafenib/placebo trial, 14% of patients on sorafenib required a dose interruption and 10% a dose reduction. Therefore, the degree of dose reduction/interruption in this trial is not consistent with other studies of sorafenib."

Fisher's exact tests for this claim are in Table 3. FDA is, again, claiming consistency where there is none. Note, that dosing reductions for tivozanib and axitinib are getting close being equal ( p-value: 0.028). Sorafenib is the ultimate outlier in this comparison.

1.4 A Quick List of the Rest

I stop detailed discussion here and list a few more items with a short answer. Page refers to the briefing book, Slide to the presentation. I have also included references to sources in case somebody wants to dig deeper.

  1. Slide 30 plus transcript page 79: FDA is implying regional differences in adverse event counts. Did FDA prove regional differences in adverse event counts? - No. This slide is useless: adverse event counts seem to vary wildly even between predominantly western trials. See, for example, what they were for sorafenib in the original sorafenib trial [3] (in clinical review(s) section) - any adverse event: 84.6%, grade 3- 4: 30.2 %. In the more recent axitinib trial [17] - any adverse event: 97.7 %, grade 3 - 4: 51.3 %.
  2. Slide 22 plus transcript page 75: "Due to significant right-censoring, the estimate for survival difference was not stable at the median."Is Kaplan-Meier median unstable under right-censoring? - No. Estimates for mean are unstable, estimates for median are not [18], [19]. Could FDA's staff be blissfully ignorant of the difference? Not likely. Censoring up to 60 % of trial arm population has very little effect on KM median [18]. Censoring in tivozanib trial did not reach 60% [2].(Population - Deaths)/Population is the upper limit for censoring in case somebody wonders how I can say that. [Following is added for clarification] But, FDA may have meant (with poor wording) that the population at risk becomes very low somewhere after 28 months and that makes medians imprecise. That interpretation is correct.
  3. Slide 25 plus transcript page 77: "There has been a secular trend for improvement in survival over the last decade. Specifically, for sorafenib, one can see continuous improvement despite the fact that middle trial was performed with a patient population that was more heavily pretreated, including prior targeted therapy." This is another attempt by FDA to put down tivozanib trial medians. Of the three sorafenib trials the first is the original sorafenib trial: sorafenib as a study arm - by protocol no follow-up treatment. The other two have sorafenib as control arm with all sorts of follow-up treatments. Can you compare results from one trial's study arm with results from other trial's control arm? - No, but FDA does.

1.5 Part I: Conclusions

Did FDA's staff dance around facts? Sure looks like it. Did FDA's staff breach the rules of professional ethics? Maybe it is possible to explain away all other discrepancies I listed as mistakes, honest lapses etc… but that will not do for altering MKSCC risk group numbers - premeditation is shown by the perplexing sentence in the briefing book.

2. Part II: Tivozanib - A Good Drug?

2.1 The Trouble with Temsirolimus and Why Overall Survival Is Misleading

The sentiment expressed by some advisors during the May 2, 2012 ODAC meeting: in temsirolimus kidney cancer treatment has a drug with proven survival benefit, one should not add to the complexities faced by practitioners by approving a drug without similar merit.

Temsirolimus has demonstrated survival benefit - proved in a trial specifically designed to achieve that result in overall survival. The benefits are only for a limited number of kidney cancer patients, for those who belong to MSKCC Poor prognostic group or have non-clear cell kidney cancer [21]. In the trial that resulted in its acceptance temsirolimus had this record against placebo, Table 4:

Yes, placebo (plus best standard care, naturally) handily beat out temsirolimus in the Intermediate group. Only heavy weighting of MSKCC groups towards Poor kept overall survival favorable to temsirolimus. Then came this piece of news (pardon the dramatics):

Special Clinical Sports Bulletin

ESMO, Vienna 2012 [22], [23]. The vaunted survival champion temsirolimus went down hard in a clinical bout vs. sorafenib. Official scoring: Median Overall Survival - temsirolimus 12.3 months, sorafenib 16.6 months and Hazard Ratio 1.31. The bout was MSKCC risk group fair with the Intermediate group being dominant in both arms.

A presentation in the same ESCO meeting gave Kaplan-Meier survival breakdown by MSKCC groups for Axis trial (axitinib vs. sorafenib) [22]. Overall medians were 20.1 vs. 19.2 months in favor of axitinib, but sorafenib did outperform axitinib in MSKCC Intermediate group by 23.9 vs. 18.8 months. The intermediate group was not the dominant group in this trial, obviously.

The above should make it clear that MSKCC risk group distribution is a significant factor in the duration of overall survival (I would call it the dominant factor). It should also be obvious that overall survival can be misleading; it can be manipulated by selecting MSKCC risk group distribution that favors study compound, even relatively slight differences in distributions between study arms will have an effect. In light of these observations FDA's preference for overall survival seems senseless at best, harmful at worst.

2.2 How Much Bias by MSKCC Risk Groups?

Is there a way to estimate the bias due the unbalance in the distribution of MSKCC risk groups? Yes, there is. The following relationship between overall survival function and survival functions for MSKCC groups will hold for any kidney cancer trial:

So(t) = wf*Sf(t) + wi*Si(t) + wp*Sp(t)

So is overall survival function, Sf, Si and Sp are corresponding survival functions for the underlying MSKCC risk groups and wf,wi and wp are the weighting factors (percentages of trial arm population) for each group.

One just needs to find suitable baseline survival curves for MSKCC risk groups. The best option would be survival curves for MSKCC risk groups based on large number of freshly diagnosed patients left untreated. This is so grisly that, for humanity's sake, I hope there is no such study. The second best option is survival curves for treatment naïve patients (not treated with targeted therapies like sorafenib or tivozanib) under minimally effective treatment; the original studies that introduced MSKCC risk groups [24], [25]. Treatment naïve and minimal treatment thereafter are due to time period covered by the studies - the 90's and before. Best treatment available: radiation, cytokines and surgery. Both studies are retrospective - no censoring; relative accuracy (on probability axis) should not decrease with increasing time. The original study [24] had more patients (and better quality images) so I digitized the KM curves from it and wrote a little routine to get equivalent median estimates based on this MSKCC Treatment Naïve Group. I checked the accuracy by comparing the median from the original study with the estimate for it from the routine: Published overall median: 10 months, my estimate: 10.0 months - accurate enough for my purposes. Table 5 uses MSKCC risk group percentages from Table 1 to calculate 'Equivalent Medians by MSKCC Treatment Naïve Groups'.

The bias in Table 5 is estimated by the ratio between calculated equivalent medians in trial arms. If the number is over 1 bias favors the study arm, if it is below 1 bias favors the control arm. Temsirolimus has the highest bias favoring the study arm (1.08) and tivozanib has a meaningful bias against the study arm (0.94). What might be the effect on medians? Table 6 has adjusted medians i.e. after the control arm median is multiplied by the corresponding bias ratio from Table 5.

It looks like the elimination of the bias alone can swing the medians from favoring sorafenib to favoring tivozanib. One can only wonder what might happen with hazard ratio.

Have I shown enough to make FDA agree? No - not by a mile! I don't have any means to prove how accurate my bias estimates and corrections are. I can't show what the hazard ratio would be (well, I can - and it would not be accurate). Is this a dead end, then? No! There is a way: reversal of thinking. Don't look at overall survival estimate, look at survival estimates for MSKCC Risk groups - no need for bias adjustments. Aveo has all the data it needs to do this comparison. My opinion is that they should do it and publish the results. They can't be worse than the overall survival curve. Open letter to Aveo follows:

2.3 Dear Aveo Pharmaceuticals

I hear you are in a tight spot. FDA roughed up tivozanib for use in kidney cancer treatment and now all sorts of institutions and lawyers are coming after you. I have an idea that just might get you off.

  1. First perform a simple check: Change the stratification by ECOG score in Cox to stratification by MSKCC risk groups. Did the hazard ratios go down?
  2. You did this already for progression free survival (NYSE:PFS), do it now for survival: break down Kaplan-Meier overall survival estimates by MSKCC risk groups. They should look better than the overall survival curve. If it's still too close to call (sorafenib is a strong performer in the Intermediate group) do not worry. You can handle even that.
  3. This is how you create a reliable estimate for the crossover effect. You had inkling on how to do it in the figure 6A of your overall survival presentation. Take the on-tivozanib group of patients in sorafenib channel (N=156) and break it down by MSKCC risk groups. Compare these KM curves for Favorable and Intermediate groups with the corresponding KM curves for sorafenib in the step 2. Think what you have there: It is like against like comparison, one curve is 100% on-tivozanib, the other one about 50 % on tivozanib. (You need to calculate the actual percentage on tivozanib for the curves from part 2). Do you see any differences? How many ways there is to explain the differences? Nephrectomy? - Nope, practically 100% of trial population had it. Prior cytokines? - Nope, 21% of sorafenib population had prior cytokines, and it would be a miracle if all of them are in one group only. Thanks to your 'bad trial design' only crossover remains.
  4. What is the MSKCC risk group distribution in the sorafenib channel of Figure 7A (the one having on-sorafenib only patients)? It should be a mix of Intermediate and Favorable groups. If one is clearly bigger, separate it out. You know what to do with it.
  5. If it is still too close to call, which I doubt, you should still proceed to step 6.
  6. Publish (soon, please) and recommend this approach for general use. It is valid and superior to FDA's preferred way of evaluating survival. In addition how the crossover is handled is superior to Rank Preserving Structural Failure Time Model, for instance. Maybe they get the hint.

Sincerely, K. Forlorn

2.4 Meanwhile in the Insurance City

Hi, you guys at health insurance companies. How does it feel to know that your estimates for lifetime costs of certain kidney cancer treatments are garbage unless the MSKCC risk group distribution in the trial you used to fit your Weibull distribution on matches the distribution seen in the general practice? Planning to do something about that?

Goodbye and thank you for reading.


