Productivity Metrics and Peer Review Scores, Continued

In a previous post, I described some initial results from an analysis of the relationships between a range of productivity metrics and peer review scores. The analysis revealed that these productivity metrics do correlate to some extent with peer review scores but that substantial variation occurs across the population of grants.

Here, I explore these relationships in more detail. To facilitate this analysis, I separated the awards into new (Type 1) and competing renewal (Type 2) grants. Some parameters for these two classes are shown in Table 1.

Table 1. Selected=

Table 1. Selected parameters for the population of Type 1 (new) and Type 2 (competing renewal) grants funded in Fiscal Year 2006: average numbers of publications, citations and highly cited citations (defined as those being in the top 10% of time-corrected citations for all research publications).

For context, the Fiscal Year 2006 success rate was 26%, and the midpoint on the funding curve was near the 20th percentile.

To better visualize trends in the productivity metrics data in light of the large amounts of variability, I calculated running averages over sets of 100 grants separately for the Type 1 and Type 2 groups of grants, shown in Figures 1-3 below.

Figure 1. Running averages for the number of publications over sets of 100 grants funded in Fiscal Year 2006 for Type 1 (new, solid line) and Type 2 (competing renewal, dotted line) grants as a function of the average percentile for that set of 100 grants.

Figure 1. Running averages for the number of publications over sets of 100 grants funded in Fiscal Year 2006 for Type 1 (new, solid line) and Type 2 (competing renewal, dotted line) grants as a function of the average percentile for that set of 100 grants.

Figure 2. Running averages for the number of citations over sets of 100 grants funded in Fiscal Year 2006 for Type 1 (new, solid line) and Type 2 (competing renewal, dotted line) grants as a function of the average percentile for that set of 100 grants.

Figure 2. Running averages for the number of citations over sets of 100 grants funded in Fiscal Year 2006 for Type 1 (new, solid line) and Type 2 (competing renewal, dotted line) grants as a function of the average percentile for that set of 100 grants.

Figure 3. Running averages for the number of highly cited publications over sets of 100 grants funded in Fiscal Year 2006 for Type 1 (new, solid line) and Type 2 (competing renewal, dotted line) grants as a function of the average percentile for that set of 100 grants.

Figure 3. Running averages for the number of highly cited publications over sets of 100 grants funded in Fiscal Year 2006 for Type 1 (new, solid line) and Type 2 (competing renewal, dotted line) grants as a function of the average percentile for that set of 100 grants.

These graphs show somewhat different behavior for Type 1 and Type 2 grants. For Type 1 grants, the curves are relatively flat, with a small decrease in each metric from the lowest (best) percentile scores that reaches a minimum near the 12th percentile and then increases somewhat. For Type 2 grants, the curves are steeper and somewhat more monotonic.

Note that the curves for the number of highly cited publications for Type 1 and Type 2 grants are nearly superimposable above the 7th percentile. If this metric truly reflects high scientific impact, then the observations that new grants are comparable to competing renewals and that the level of highly cited publications extends through the full range of percentile scores reinforce the need to continue to support new ideas and new investigators.

While these graphs shed light on some of the underlying trends in the productivity metrics and the large amount of variability that is observed, one should be appropriately cautious in interpreting these data given the imperfections in the metrics; the fact that the data reflect only a single year; and the many legitimate sources of variability, such as differences between fields and publishing styles.

Productivity Metrics and Peer Review Scores

A key question regarding the NIH peer review system relates to how well peer review scores predict subsequent scientific output. Answering this question is a challenge, of course, since meaningful scientific output is difficult to measure and evolves over time–in some cases, a long time. However, by linking application peer review scores to publications citing support from the funded grants, it is possible to perform some relevant analyses.

The analysis I discuss below reveals that peer review scores do predict trends in productivity in a manner that is statistically different from random ordering. That said, there is a substantial level of variation in productivity metrics among grants with similar peer review scores and, indeed, across the full distribution of funded grants.

I analyzed 789 R01 grants that NIGMS competitively funded during Fiscal Year 2006. This pool represents all funded R01 applications that received both a priority score and a percentile score during peer review. There were 357 new (Type 1) grants and 432 competing renewal (Type 2) grants, with a median direct cost of $195,000. The percentile scores for these applications ranged from 0.1 through 43.4, with 93% of the applications having scores lower than 20. Figure 1 shows the percentile score distribution.

Figure 1. Cumulative number of NIGMS R01 grants in Fiscal Year 2006 as a function of percentile score.

Figure 1. Cumulative number of NIGMS R01 grants in Fiscal Year 2006 as a function of percentile score.

These grants were linked (primarily by citation in publications) to a total of 6,554 publications that appeared between October 2006 and September 2010 (Fiscal Years 2007-2010). Those publications had been cited 79,295 times as of April 2011. The median number of publications per grant was 7, with an interquartile range of 4-11. The median number of citations per grant was 73, with an interquartile range of 26-156.

The numbers of publications and citations represent the simplest available metrics of productivity. More refined metrics include the number of research (as opposed to review) publications, the number of citations that are not self-citations, the number of citations corrected for typical time dependence (since more recent publications have not had as much time to be cited as older publications), and the number of highly cited publications (which I defined as the top 10% of all publications in a given set). Of course, the metrics are not independent of one another. Table 1 shows these metrics and the correlation coefficients between them.

Table 1. Correlation coefficients between nine metrics of productivity.

Table 1. Correlation coefficients between nine metrics of productivity.

How do these metrics relate to percentile scores? Figures 2-4 show three distributions.

Figure 2. Distribution of the number of publications as a function of percentile score. The inset shows a histogram of the number of grants as a function of the number of publications.

Figure 2. Distribution of the number of publications as a function of percentile score. The inset shows a histogram of the number of grants as a function of the number of publications.

Figure 3. Distribution of the number of citations as a function of percentile score. The inset shows a histogram of the number of grants as a function of the number of citations.

Figure 3. Distribution of the number of citations as a function of percentile score. The inset shows a histogram of the number of grants as a function of the number of citations.

Figure 4. Distribution of the number of highly cited publications as a function of percentile score. Highly cited publications are defined as those in the top 10% of all research publications in terms of the total number of citations corrected for the observed average time dependence of citations.

Figure 4. Distribution of the number of highly cited publications as a function of percentile score. Highly cited publications are defined as those in the top 10% of all research publications in terms of the total number of citations corrected for the observed average time dependence of citations.

As could be anticipated, there is substantial scatter across each distribution. However, as could also be anticipated, each of these metrics has a negative correlation coefficient with the percentile score, with higher productivity metrics corresponding to lower percentile scores, as shown in Table 2.

Table 2. Correlation coefficients between the grant percentile score and nine metrics of productivity.

Table 2. Correlation coefficients between the grant percentile score and nine metrics of productivity.

Do these distributions reflect statistically significant relationships? This can be addressed through the use of a Lorenz curve Exit icon to plot the cumulative fraction of a given metric as a function of the cumulative fraction of grants, ordered by their percentile scores. Figure 5 shows the Lorentz curve for citations.

Figure 5. Cumulative fraction of citations as a function of the cumulative fraction of grants, ordered by percentile score. The shaded area is related to the excess fraction of citations associated with more highly rated grants.

Figure 5. Cumulative fraction of citations as a function of the cumulative fraction of grants, ordered by percentile score. The shaded area is related to the excess fraction of citations associated with more highly rated grants.

The tendency of the Lorenz curve to reflect a non-uniform distribution can be measured by the Gini coefficient Exit icon. This corresponds to twice the shaded area in Figure 5. For citations, the Gini coefficient has a value of 0.096. Based on simulations, this coefficient is 3.5 standard deviations above that for a random distribution of citations as a function of percentile score. Thus, the relationship between citations and the percentile score for the distribution is highly statistically significant, even if the grant-to-grant variation within a narrow range of percentile scores is quite substantial. Table 3 shows the Gini coefficients for the all of the productivity metrics.

Table 3. Gini coefficients for nine metrics of productivity. The number of standard deviations above the mean, as determined by simulations, is shown in parentheses below each coefficient.

Table 3. Gini coefficients for nine metrics of productivity. The number of standard deviations above the mean, as determined by simulations, is shown in parentheses below each coefficient.

Of these metrics, overall citations show the most statistically significant Gini coefficient, whereas highly cited publications show one of the least significant Gini coefficients. As shown in Figure 4, the distribution of highly cited publications is relatively even across the entire percentile score range.

NIH-Wide Correlations Between Overall Impact Scores and Criterion Scores

In a recent post, I presented correlations between the overall impact scores and the five individual criterion scores for sample sets of NIGMS applications. I also noted that the NIH Office of Extramural Research (OER) was performing similar analyses for applications across NIH.

OER’s Division of Information Services has now analyzed 32,608 applications (including research project grant, research center and SBIR/STTR applications) that were discussed and received overall impact scores during the October, January and May Council rounds in Fiscal Year 2010. Here are the results by institute and center:

Correlation coefficients between the overall impact score and the five criterion scores for 32,608 NIH applications from the Fiscal Year 2010 October, January and May Council rounds.

Correlation coefficients between the overall impact score and the five criterion scores for 32,608 NIH applications from the Fiscal Year 2010 October, January and May Council rounds. High-res. image (112KB JPG)

This analysis reveals the same trends in correlation coefficients observed in smaller data sets of NIGMS R01 grant applications. Furthermore, no significant differences were observed in the correlation coefficients among the 24 NIH institutes and centers with funding authority.

Scoring Analysis with Funding and Investigator Status

My previous post generated interest in seeing the results coded to identify new investigators and early stage investigators. Recall that new investigators are defined as individuals who have not previously competed successfully as program director/principal investigator for a substantial NIH independent research award. Early stage investigators are defined as new investigators who are within 10 years of completing the terminal research degree or medical residency (or the equivalent).

Below is a plot for 655 NIGMS R01 applications reviewed during the January 2010 Council round.

A plot of the overall impact score versus the percentile for 655 NIGMS R01 applications reviewed during the January 2010 Council round. Solid symbols show applications for which awards have been made and open symbols show applications for which awards have not been made. Red circles indicate early stage investigators, blue squares indicate new investigators who are not early stage investigators and black diamonds indicate established investigators.

A plot of the overall impact score versus the percentile for 655 NIGMS R01 applications reviewed during the January 2010 Council round. Solid symbols show applications for which awards have been made and open symbols show applications for which awards have not been made. Red circles indicate early stage investigators, blue squares indicate new investigators who are not early stage investigators and black diamonds indicate established investigators.

This plot reveals that many of the awards made for applications with less favorable percentile scores go to early stage and new investigators. This is consistent with recent NIH policies.

The plot also partially reveals the distribution of applications from different classes of applicants. This distribution is more readily seen in the plot below.

A plot of the cumulative fraction of applications for four classes of applications with a pool of 655 NIGMS R01 applications reviewed during the January 2010 Council round. The classes are applications from early stage investigators (red squares), applications from new investigators (blue circles), new (Type 1) applications from established investigators (black diamonds) and competing renewal (Type 2) applications from established investigators (black triangles). N indicates the number in each class of applications within the pool.

A plot of the cumulative fraction of applications for four classes of applications with a pool of 655 NIGMS R01 applications reviewed during the January 2010 Council round. The classes are applications from early stage investigators (red squares), applications from new investigators (blue circles), new (Type 1) applications from established investigators (black diamonds) and competing renewal (Type 2) applications from established investigators (black triangles). N indicates the number in each class of applications within the pool.

This plot shows that competing renewal (Type 2) applications from established investigators represent the largest class in the pool and receive more favorable percentile scores than do applications from other classes of investigators. The plot also shows that applications from early stage investigators have a score distribution that is quite similar to that for established investigators submitting new applications. The curve for new investigators who are not early stage investigators is similar as well, although the new investigator curve is shifted somewhat toward less favorable percentile scores.

Scoring Analysis with Funding Status

In response to a previous post, a reader requested a plot showing impact score versus percentile for applications for which funding decisions have been made. Below is a plot for 655 NIGMS R01 applications reviewed during the January 2010 Council round.

A plot of the overall impact score versus the percentile for 655 NIGMS R01 applications reviewed during the January 2010 Council round. Green circles show applications for which awards have been made. Black squares show applications for which awards have not been made.

A plot of the overall impact score versus the percentile for 655 NIGMS R01 applications reviewed during the January 2010 Council round. Green circles show applications for which awards have been made. Black squares show applications for which awards have not been made.

This plot confirms that the percentile representing the halfway point of the funding curve is slightly above the 20th percentile, as expected from previously posted data.

Notice that there is a small number of applications with percentile scores better than the 20th percentile for which awards have not been made. Most of these correspond to new (Type 1, not competing renewal) applications that are subject to the NIGMS Council’s funding decision guidelines for well-funded laboratories.

Scoring Analysis: 1-Year Comparison

I recently posted several analyses (on July 15, July 19 and July 21) of the relationships between the overall impact scores on R01 applications determined by study sections and the criterion scores assigned by individual reviewers. These analyses were based on a sample of NIGMS applications reviewed during the October 2009 Council round. This was the first batch of applications for which criterion scores were used.

NIGMS applications for the October 2010 Council round have now been reviewed. Here I present my initial analyses of this data set, which consists of 654 R01 applications that were discussed, scored and percentiled.

The first analysis, shown below, relates to the correlation coefficients between the overall impact score and the averaged individual criterion scores.

Correlation coefficients between the overall impact score and averaged individual criterion scores for 654 NIGMS R01 applications reviewed during the October 2010 Council round. The corresponding scores for a sample of 360 NIGMS R01 applications reviewed during the October 2009 Council round are shown in parentheses.

Correlation coefficients between the overall impact score and averaged individual criterion scores for 654 NIGMS R01 applications reviewed during the October 2010 Council round. The corresponding scores for a sample of 360 NIGMS R01 applications reviewed during the October 2009 Council round are shown in parentheses.

Overall, the trend in correlation coefficients is similar to that observed for the sample from 1 year ago, although the correlation coefficients for the current sample are slightly higher for four out of the five criterion scores.

Here are results from a principal component analysis:

Principal component analysis of overall impact score based on the five criterion scores for 654 NIGMS R01 applications reviewed during the October 2010 Council round. The corresponding scores for a sample of 360 NIGMS R01 applications reviewed during the October 2009 Council round are shown in parentheses.

Principal component analysis of overall impact score based on the five criterion scores for 654 NIGMS R01 applications reviewed during the October 2010 Council round. The corresponding scores for a sample of 360 NIGMS R01 applications reviewed during the October 2009 Council round are shown in parentheses.

There is remarkable agreement between the results of the principal component analysis for the October 2010 data set and those for the October 2009 data set. The first principal component accounts for 72% of the variance, with the largest contribution coming from approach, followed by innovation, significance, investigator and finally environment. This agreement between the data sets extends through all five principal components, although there is somewhat more variation for principal components 2 and 3 than for the others.

Another important factor in making funding decisions is the percentile assigned to a given application. The percentile is a ranking that shows the relative position of each application’s score among all scores assigned by a study section at its last three meetings. Percentiles provide a way to compare applications reviewed by different study sections that may have different scoring behaviors. They also correct for “grade inflation” or “score creep” in the event that study sections assign better scores over time.

Here is a plot of percentiles and overall impact scores:

A plot of the overall impact score versus the percentile for 654 NIGMS R01 applications reviewed during the October 2010 Council round.

A plot of the overall impact score versus the percentile for 654 NIGMS R01 applications reviewed during the October 2010 Council round.

This plot reveals that a substantial range of overall impact scores can be assigned to a given percentile score. This phenomenon is not new; a comparable level of variation among study sections was seen in the previous scoring system, as well.

The correlation coefficient between the percentile and overall impact score in this data set is 0.93. The correlation coefficients between the percentile and the averaged individual criterion scores are given below:

Correlation coefficients between the percentile and the averaged individual criterion scores for 654 NIGMS R01 applications reviewed during the October 2010 Council round.

Correlation coefficients between the percentile and the averaged individual criterion scores for 654 NIGMS R01 applications reviewed during the October 2010 Council round.

As one would anticipate, these correlation coefficients are somewhat lower than those for the overall impact score since the percentile takes other factors into account.

The results of a principal component analysis applied to the percentile data show:

Principal component analysis of percentile data based on the five criterion scores for 654 NIGMS R01 applications reviewed during the October 2010 Council round.

Principal component analysis of percentile data based on the five criterion scores for 654 NIGMS R01 applications reviewed during the October 2010 Council round.

The results of this analysis are very similar to those for the overall impact scores, with the first principal component accounting for 72% of the variance and similar weights for the individual averaged criterion scores.

Our posting of these scoring analyses has led the NIH Office of Extramural Activities and individual institutes to launch their own analyses. I will share their results as they become available.

The New Scoring System

At the recent meeting of the National Advisory General Medical Sciences Council, our Council members had their first opportunity to examine summary statements using the new peer review scoring system.

Many aspects of the new scoring system are unfamiliar, including the use of overall impact scores as integers from 10 (best) to 90 (worst). A summary of the new scoring system is well described in a scoring system and procedure document, and an earlier version of this was shared widely with reviewers.

As background, I compiled some data for approximately 300 NIGMS R01 applications reviewed under the new system.

This plot shows the distribution of overall impact scores along with the corresponding percentiles.

This plot shows the distribution of overall impact scores along with the corresponding percentiles. Note the relative spread of percentile scores at a given impact score. This spread is due to the fact that percentiles are determined independently for each study section that considered 25 or more R01 applications. Otherwise, percentiles are determined across the overall pool of R01 applications reviewed by the Center for Scientific Review.

For comparison, here is a plot of a similar number of NIGMS R01 applications reviewed using the old scoring system.

A plot of a similar number of NIGMS R01 applications reviewed using the old scoring system.

Note the similar spread of percentiles at a given score due to study section-specific percentiling.

I would like to mention another major change as a result of the NIH Enhancing Peer Review effort. You must use restructured application forms and instructions, including a 12-page length limit for R01s, for applications due on or after January 25, 2010. For details, see the recent NIH Guide notice. We plan to post updates about these changes as key dates approach.