When Josh Sommer was an undergraduate at Duke, he was diagnosed with a rare cancer called a chordoma; fortunately, the skull base tumor was successfully removed surgically, though these malignancies have a nasty habit of growing back. To accelerate chordoma research, Sommer set up the Chordoma Foundation. (I’ve been exceptionally inspired by Sommer, and have written about him here and here; he was also selected as a “Forbes 30 Under 30″ awardee in 2014, story here; an excellent 2014 profile in The Atlantic is here).
Sommer was motivated in part by the frustration he experienced as he spoke to different academic researchers who were studying his type of cancer; he discovered that scientists operated in a strikingly siloed fashion, viewing others in the field more as competitors than collaborators. While perhaps an obvious fact of life to most researchers, this observation represented a painful realization for a patient in desperate need of scientific progress.
Speaking to the first Sage Congress in 2010 (see his deeply moving talk here), Sommers invoked Intel founder Andy Grove, who had recently highlighted in JAMA the need to accelerate knowledge turns in medicine. If researchers would share knowledge faster, both Grove and Sommer argued, medical discovery would be accelerated.
Several recent studies, however, suggest that while open collaboration and immediate data sharing may be beneficial in many ways, they should not be viewed as an unalloyed good, and come with important tradeoffs (and even unwanted consequences) that innovation stakeholders must consider.
Sharing Is Caring
The benefits of rapid knowledge sharing was demonstrated in a now-classic analysis by Gulley and Lakhani of a MATLAB programming contest during which participants vied with each other to code the best-performing algorithm, and could view (and borrow from) any code that was submitted. The result: as competitors refined each other’s algorithms, the top performance achieved increased profoundly (see figure 2 of this paper).
These results have been replicated by researchers at Sage Bionetworks, a non-profit organization that seeks to encourage scientists in disparate institutions to collaborate as if they were in the same lab (disclosure: I was a founding advisor of Sage, and remain strongly supportive of its mission, though I do not have a formal relationship with the organization).
When Sage set up a programming contest to build a particular type of algorithm, Sage’s Chief Commons Officer John Wilbanks explained to me, they realized that none of the participants were sharing code, even though that was part of the rules. Perhaps not surprisingly, each contestant wanted to win, and didn’t want to divulge part of his or her secret sauce, a pattern of behavior that, as Sommer discovered, is unfortunately not unusual for most academic research.
However, Sage then came up with an innovation: they introduced small financial incentives, rewarding programmers who used someone else’s code to climb to the top of the leaderboard, as well as programmers whose code by used by someone else to attain the top position. With this reward structure in place, scientists began to share, and the performance of the algorithm lurched forward, improving by a factor of ten almost immediately.
As promising as this result was, the team at Sage were puzzled by a phenomenon they observed during the contest: once an initial promising approach to a solution was suggested, most new submissions tended to be slight variations on this established theme, minor tweaks that in aggregate resulted in significant improvements in the overall performance.
Avoiding Premature Optimization
But what if there were even more promising potential solutions possible – solutions that solvers tended not to come up with once a promising approach had been surfaced? To put it mathematically, what if the winner of the algorithm experiment has identified a local maximum, but not the global maximum? What if a better result was out there, a result that, paradoxically, might be harder to discover in the presence of rich, continuous collaboration?
Sage explored this question in a second programming contest, published earlier this year in Nature Methods. In the new competition, individual programmers would create algorithms which were used to evaluate test data and deliver a predicted result. Sage would compare this prediction to the known result, and return a score based on the performance; code was not shared between different teams.
The result: team scores tended to improve over time, as programmers optimized their algorithms in response to feedback.
The really interesting finding, though, was that at the end of the contest, Sage compared the performance of the best individual algorithms to that of an “ensemble” algorithm (essentially, looking at consensus results), and found that it outperformed the individual algorithms, suggesting that even a highly capable team might miss important insights that other teams, working independently, might capture. Some of these insights that might have been missed entirely if code was shared during the process, and all the contestants quickly focused on improving an initially promising algorithm.
These results map nicely onto data recently reported by Boudreau and Lakani, who examined the effect of sharing in an algorithm development contest. On the one hand, sharing resulted in rapid improvement in performance of the top algorithm, consistent with Lakani’s previously published result, as well as Sage’s original observation.
However, Boudreau and Lakani also discovered that when data sharing was permitted, there was “a greater tendency to coordination in the form of convergence rather than divergence” – i.e. most solvers jumped on a promising approach and tried to improve it. In contrast, when data sharing wasn’t permitted, a far larger number of unique approaches were proposed. Thus, suggest the authors, “in a ‘rugged’ landscape of possible solutions, we might be concerned that [data sharing] encourages path dependence and lock into a suboptimal solution approach.”
The real challenge, it seems, for those who manage innovation, and for those like Josh Sommer (and really, all of us) who have a stake in the outcome, is figuring out how, on the one hand, to leverage the demonstrated power of data sharing to incrementally improve and rapidly refine solutions while, on the other hand, not losing sight of the the benefits of independent research, work that might be slower, but, in reducing the impact of groupthink, could yield answers that are more original, profound, and impactful.