A New Global Health Architecture: Maximising Health Returns

There have been a number of opinion pieces, resets, and declarations on what is needed in a new Global Health Architecture. These have been authored by diplomats, former Prime Ministers and Presidents, Multi-lateral agency staff and former staff, Philanthropies and peak bodies on what is needed in a new global health architecture. They have appeared in prestigious peer reviewed journals, on corporate websites, and There has been a consistent I have attempted to synthesise the core messages and draw them together into an aggregate position that can help national governments action some of the ideas.

It starts with the contemporary global health landscape, which is increasingly defined by structural fiscal contraction, evolving geopolitical priorities, and the imperative to sustain health systems performance under conditions of constrained financing. In this context, legacy colonial models of development assistance—characterised by externally driven priorities, fragmented delivery channels, and open-ended commitments—are no longer fit for purpose. A transition toward more efficient, sovereign-aligned frameworks can deliver health at scale across well segmented population groups.

At the center of this transition is a reassertion of national sovereignty. Low- and low-middle income countries have historically been analysed through the lens of deficit-based models—most notably income, poverty and other World Bank style development indicators. In a context of increasing financial constraint and projected stagnation or decline in economic growth, that approach invites systemic failure.

A better alternative is an analysis of countries through the lens of asset-based models and indicators. This shift allows for a reorientation from needs-based allocation towards strategic engagement, in which financially flexible partners align with nationally defined priorities to co-develop fully costed health pathways. Such pathways provide end-to-end cost visibility, improving efficiency and accountability and enabling the precise calibration of health investments against projected returns. The well-established link between health improvement and economic productivity can be operationalised as part of national investment cases.

This reorientation will motivate a shift away from traditional sovereign lending, with its associated conditionalities, as the dominant financing modality. While sovereign lenders have played a critical role in expanding access and supporting system development, their balance sheets are increasingly constrained. Capital markets offer emerging mechanisms to complement sovereign financing in targeted areas where risk can be appropriately structured and priced. Through structured asset-pooling within and across countries, these mechanisms can enhance risk absorption and expand resource mobilisation. Over time, they may progressively relieve pressure on sovereign financing, enabling health systems to access more diversified funding streams while reducing exposure to fiscal volatility. This, in turn, may lessen reliance on sovereign conditionalities and allow national governments greater implementation flexibility.

Operationalising this shift requires the development of a coherent investment architecture. One approach is the development of Population Equity Units (PEUs), which serve as the foundational analytical and financial entities within national systems. These units can be aggregated into stratified Demographic Asset Classes, reflecting variations in projected lifetime contribution, health system utilisation, and responsiveness to intervention. The introduction of such classifications enables a more granular understanding of where investments are likely to generate the greatest returns for national governments alongside the highest health-valued gains.

To support decision-making across this architecture, a Health Returns Value Index (HRVI) can be employed. The Index would provide a standardised metric for comparing Population Equity Units based on anticipated health outcomes relative to cost. This facilitates outcome-weighted health investment prioritisation, ensuring that limited resources are allocated in a manner consistent with maximising aggregate health systems performance. Importantly, such an Index would allow for dynamic recalibration over time, as demographic, epidemiological, and economic conditions evolve.

Within this framework, national health systems can be conceptualised as Health Equity Portfolios. These portfolios comprise a diversified set of Population Equity Units across multiple Demographic Asset Classes, each contributing differently to overall system yield. Standard portfolio management principles can then be applied, including allocation, rebalancing, and risk mitigation. High-performing segments—those demonstrating strong alignment between investment and realised outcomes—can be prioritised for sustained or increased capital allocation.

Conversely, Population Equity Units falling below defined marginal value thresholds may require structured reassessment. In such cases, mechanisms for managed transition, including consolidation or phased divestment, can be introduced to preserve portfolio efficiency. These processes should be governed by transparent criteria and embedded within broader national planning frameworks to ensure predictability and stability.

The potential integration of capital markets provides an opportunity to further enhance the flexibility of this model. Population Equity Units can be progressively bundled into tradable instruments, including outcome-linked bonds and equity participation vehicles. These instruments allow external sovereign and market investors to assume a share of the financial risk associated with health investments, while aligning returns directly with measurable outcomes. In doing so, they create a direct linkage between system performance and capital flows, reinforcing incentives for efficiency and innovation.

A complementary development is the introduction of rating systems for Demographic Asset Classes. Drawing on established methodologies from financial markets, population segments can be assigned standardised ratings based on projected return profiles and risk characteristics. AAA-rated population segments—those with high expected returns and low variability—can be prioritised for long-term investment, while sub-investment grade cohorts may be subject to targeted de-risking strategies, including controlled exposure limits, selective disengagement, or phased reallocation of resources. Rating migration over time provides an additional feedback mechanism, enabling continuous optimisation of the Health Equity Portfolio, including downgrade-triggered reallocation where required.

One of the strategic advantages of this approach for national governments is the reconceptualisation of equity. Rather than being treated as a purely distributive principle, equity can be operationalised as a function of participation and alignment with system performance requirements. Under this model, Population Equity Units hold differentiated positions within the national portfolio, reflecting their contribution to and benefit from collective investment. This ensures that resource allocation remains responsive to both system performance and evolving demographic realities.

Institutionally, the framework aligns with a broader functional redefinition of global health actors. Multilateral organisations, including normative bodies, can focus on establishing standards, developing metrics such as the HRVI, and convening stakeholders across sectors. Implementation and operational decision-making are devolved to national and regional entities, consistent with the principle of subsidiarity. This division of labour reduces duplication and enhances system coherence.

Financing flows, in turn, become more targeted and time-bound. Development assistance becomes progressively redundant, reducing exposure to sovereign conditionalities, and financing is repositioned as catalytic capital, supporting transitions toward domestically anchored and market-enabled systems. Global public goods—such as surveillance, research and development, and epidemic preparedness—remain appropriate areas for sustained collective investment, given their transnational nature and positive externalities. Nonetheless, they would need to demonstrate measurable impact on the Population Equity Units, and a positive return on investment.

The proposed model is not without complexity. The introduction of new instruments, metrics, and governance arrangements requires careful design and sequencing. Data systems must be strengthened to support accurate classification, valuation, and monitoring of Population Equity Units, with the resulting data architecture constituting a high-value analytical asset class in its own right, potentially suitable for managed service provision or structured private participation. Regulatory frameworks must evolve to accommodate novel financing mechanisms while safeguarding system integrity. Capacity building at national and subnational levels is essential to ensure effective portfolio management.

The risks of inaction, however, are far greater. Persisting with fragmented, input-driven, and fiscally unsustainable models will undermine both efficiency and impact. By contrast, a transition toward a maximally efficient, return-oriented framework offers the potential to sustain and enhance health outcomes despite resource constraints.

The convergence of fiscal pressure, institutional reform, and financial innovation creates a significant opportunity to re-engineer the global health architecture around principles of equity, efficiency, alignment, and sustainability. Through the structuring of Population Equity Units, the deployment of the Health Returns Value Index, and the gradual mobilisation of capital markets, it is possible to construct Health Equity Portfolios that are resilient, adaptive, and performance-oriented. Such an approach ensures that, even under conditions of constrained financing, health systems can continue to deliver measurable value at scale for national governments.

Health system sustainability is preserved through disciplined alignment of investment with demonstrable population value.

Campbell and Stanley explained replication rates in 1963

Over 60 years ago, Donald Campbell and Julian Stanley published their classic, slim volume Experimental and Quasi-Experimental Designs for Research. One of their earliest observations concerns the trade-off between internal and external validity. Specifically, the more precisely one can establish a causal relationship, the less one can say about its generality. In recent work, I show that simultaneously maximising internal and external validity is not merely a practical limitation to be mitigated, but a structural impossibility. The relationship is analogous to the Heisenberg uncertainty principle that shows one cannot simultaneously know both the position and momentum of a particle with arbitrary precision. In the context of the social and behavioural sciences, the more precisely one identifies a cause, the narrower the domain to which that knowledge applies.

I reviewed this problem in terms of the so-called “replication crisis”, the difficulty researchers have encountered in replicating published causal findings. Shortly after posting that paper, Nature published a series of articles on research credibility, including a large-scale investigation of replicability in the social and behavioural sciences. The empirical effort is extraordinary, involving hundreds of researchers and a substantial coordination infrastructure. The methods, results, and theoretical framing are all of considerable interest. However, the study has also generated headline figures that are readily misinterpreted—an outcome encouraged both by editorial framing and by the structure of the paper itself.

The central difficulty lies in two under-specified concepts that drive the research. The replication is of the “same question” and the “claim”. Whether a replication tests the “same question” is treated as a local, theory-laden judgement made by individual teams. Sameness is treated as constant at two levels simultaneously. First the multiple replications of a single study should be replicating the same thing, as if each attempt stood in an identical relationship to the original. And across all the original studies, the idea of sameness should stand in an identical relationship between a replication and its target regardless of which study is being replicated. If “same” does not mean the equivalent thing within and between replications, the target drifts meaninglessly

At the same time, replications are of “claims” which are scientific claims reduced to directional empirical statements, detached from the estimands, models, and analytic pipelines. That is, the claim is detached from the scientific meaning that gave it purchase in the original study. The same problem with “claims” arose in the team’s Nature paper on analytic robusteness. Abstracting scientific claims into more generic “claims” produces a mismatch between design and inference. Heterogeneous interpretations of what is actually being tested are collapsed into standardised statistical comparisons. Apparent agreement or disagreement may therefore reflect shifts in underlying targets rather than genuine replication or failure.

A related issue is that the study attempts to straddle internal and external validity without resolving their tension. It presents itself as assessing whether findings replicate, but in practice examines how results behave under modest variation in context, measurement, and implementation—something closer to robustness or transportability than strict replication. The use of multiple, non-equivalent metrics of “success” in the Nature article reinforces this ambiguity. Replication rates vary substantially depending on the criterion, yet a single headline figure is foregrounded: “Half of social-science studies fail replication test in years-long project“. The result is a study that is informative about the behaviour of findings (and researchers) under perturbation, but is easily—and predictably—read as making stronger claims about the reliability or truth of scientific results than its design can support.

Underlying both issues is a deeper disagreement about what replication is for. The paper’s opening paragraph explicitly reflects this tension. One reference is the National Academies of Sciences (NAS) report, which defines replication in procedural and statistical terms. Collect new data using similar methods and assess whether results are consistent, typically via effect sizes and uncertainty intervals. The other reference is a 2020 PLoS Biology article by Nosek and Errington (the two senior authors of this Nature paper), who argue that the NAS definition is not merely imprecise but conceptually mistaken. On the Nosek-Errington account, determining that a study is a replication is a theoretical commitment. Both confirming and disconfirming outcomes must be treated in advance as diagnostic of the original claim. The Nature paper adopts this language—replication teams were instructed to produce “good faith tests” of claims—but the article reports results entirely using metrics derived from the procedural-statistical tradition of NAS. This is not a superficial inconsistency. The two frameworks imply different standards of success, different interpretations of failure, and different meanings for any aggregated replication rate. The headline figures that have circulated are products of the latter framework; whether they would survive translation into the former is not addressed.

It is here that Campbell and Stanley’s observation, and its formalisation, becomes decisive. The procedural-statistical approach implicitly treats internal validity as primary and assumes that external validity can be inferred from it. That is, if results are consistent, the finding travels. The structural trade-off shows that this assumption cannot hold. The very steps taken to secure internal validity constrain the scope of generalisation. A high replication rate under this framework may therefore be simultaneously informative and misleading. It indicates that a result can be reproduced under sufficiently similar conditions, while obscuring how narrow those conditions may be. The Nosek-Errington framework recognises the need for theoretical commitment, but without a principled account of causal structure it cannot resolve the tension either. What the Nature paper ultimately demonstrates—perhaps inadvertently—is that replicability is not a property of findings alone. It is a property of the relationship between a finding and the conditions under which it is tested. This underscores a Cartwrightian notion of relationships tied to particular material configurations–nomological machines. Until that relationship is made explicit, headline replication rates will continue to invite overconfident conclusions in both directions and admonitions for better methods.


I did not have access to the published article which is behind the Springer-Nature paywall. Instead I relied on the publicly available preprint.

Analytic robustness could be a real problem

A recent article in Nature on the robustness of research findings in the social and behavioural sciences found that only 34% of re-analyses of the data yielded the same result as the original report. This sounds horrible. It sounds like two-thirds of the research that social and behavioural scientists are doing is low quality work, and certainly does not deserve to be published. One might reasonably ask if “confabulist” rather than “scientist” might not be a better job title.

Unfortunately, the edifice of “robust research” has been built on foundations of sand. The research shares many of the weaknesses of another article recently published in Science Advances, which I discuss here. There is little that can be concluded from the research that could actually inform scientific practice nor permit any observation about the quality or robustness of the original articles. It does, however, say something of interest for sociologists of science about the diversity of views that researchers have about how to re-analyse data to address conceptual claims.

The procedure followed in the Nature article was described thus.

To explore the robustness of published claims, we selected a key claim from each of our 100 studies, in which the authors provided evidence for a (directional) effect. We presented each empirical claim to at least five analysts along with the original data and asked them to analyse the data to examine the claim, following their best judgement and report only their main result. The analysts were encouraged to analyse those studies where they saw the greatest relevance of their expertise.

The word “claim” here does a lot of work. One might reasonably argue that a scientific claim in a published article is a statement of finding in the context of the hypothesis, the model, the analytic process, and the results. But this is not what is meant here. That full scientific sense of a claim is closer to what the Centre for Open Science team use as a starting point for a separate article on “reproducible” research. In the context of this article a “claim” is some vaguer statement of finding. It is an isolated single claim, has a direction of effect, and critically, is “phrased on a conceptual and not statistical level”.

The conceptual claim is closer to a vernacular claim. It is closer to the kind of thing you might say at a dinner party or read in the popular science section of a magazine. Something like, “did you hear that single female students report lower desired salaries when they think their classmates can see their preferences?” (Claim 025).

Under this framework, one should be able to abstract a full scientific claim into a conceptual claim, and if the conceptual claim is robust, independent scientists analysing the same data, making equally sensible choices about the analysis of the data, will converge on the conceptual claim. The challenge is that your pool of independent and equally sensible scientists need to agree with each other (without consultation) how that conceptual claim is to be translated into a scientific claim. A part of the science is deciding on the estimand for testing the claim, but the estimand is fixed by the analytic choice not by the conceptual claim. If two scientist analyse the same dataset but target different estimands through their analytic choices, they are not converging on the same conceptual claim. Against all logic, an analytic schema targeting a different estimand that nonetheless produces an estimate close to the estimate of the original paper, supports the robustness of the paper.

The framework, therefore, has a double incoherence. First, divergence of estimates (between the original analysis and re-analysis) is misread as fragility when it may simply reflect different estimands—different scientists sensibly translating the conceptual claim into different scientific claims. Second, and more damaging, convergence is misread as robustness when it may be entirely spurious—two analysts targeting different estimands who happen to produce similar point estimates are not confirming each other. They’re producing agreement by accident, across questions that aren’t the same question.

So the framework is wrong in both directions simultaneously. It penalises legitimate scientific pluralism and rewards numerical coincidence. A study could score as highly robust because several analysts happened to get similar numbers while asking entirely different questions. A study could score as fragile because several analysts made defensible but divergent estimand-constituting choices that led to genuinely different answers to genuinely different questions.

There is another an far more interesting reading of this paper, which has neither a click-bait quality nor the opportunity to remonstrate. Where the authors have identified fragility (or a lack of robustness), another could legitimately and positively see vitality and methodological pluralism. The social and behavioural sciences work in the messy space of self-referential agents actively interacting with and changing the environments in which they live and do science. It is hardly surprising that epistemic pluralism is a consequence of this. The 34% figure is not a scandal. It is valuable (under appreciated) data about the nature of social reality.


I did not have access to the published article which is behind the Springer-Nature paywall. Instead I relied on the publicly available preprint.

Authorised Speech and Token Restraint

At the Bafta film awards last Sunday, a man in the audience shouted the N-word while two Black actors were on stage. The BBC broadcast it. The fallout was considerable.

The man was John Davidson, a Tourette syndrome activist whose life story had inspired one of the nominated films. He has Tourette’s. He didn’t choose to shout it. He was, by his own account, distraught. His statement afterwards was careful and precise: “My tics have absolutely nothing to do with what I think, feel or believe. It’s an involuntary neurological misfire. My tics are not an intention, not a choice and not a reflection of my values.”

Most people accepted this. But it raises a question that is harder than it looks. The word came from his brain, through his vocal tract, in his voice. It was linguistically formed—not a grunt or a spasm but a semantically loaded utterance. Something in him produced it. If it wasn’t him, who was it? And if it wasn’t him, where exactly does he end?

The standard move here is to invoke volition. We hold people responsible for their words because we assume intent. Remove intent, and the moral framework dissolves. Davidson didn’t mean it; therefore, it wasn’t really his; therefore, he bears no responsibility. Case closed. But this doesn’t actually answer the philosophical question. It just sidesteps it. Because here is the thing about Davidson’s tics that deserves closer attention. They are not random. They are contextually coherent. Unpleasant, shocking, certainly disruptive, but coherent. At the ceremony, host Alan Cumming made a joke involving Paddington Bear and his own sexuality. Davidson’s tics responded with homophobic slurs and the word “paedophile”—triggered, he explained later, because Paddington is a children’s character. Something in his system was tracking the semantic content of what was being said. It identified what was transgressive in context. It reached for the worst available word. Then it fired. That is not noise. That is a process with its own logic, running in parallel with Davidson’s conscious attention, with access to his semantic knowledge, and occasionally—when the usual controls fail—with access to his voice.

There is a useful way to think about this borrowed from how AI large language models work. A language model operates in high-dimensional continuous space. Vast amounts of computation happen there—pattern recognition, semantic association, something that functions like reasoning. None of it is directly visible. What we see is the output: a sequence of tokens, one after another, a flat stream of words.

The projection from that internal computation to the token stream is lossy. Much of what happens in the model never surfaces as language. The token stream is not the computation. It is a particular kind of readout of the computation, filtered and serialised into the only form we can directly receive. Now consider what controls what gets into that stream. There is, in effect, a gate. Not everything the model computes becomes output. The gate is part of what shapes the model’s behaviour, its apparent character, what it will and won’t say. It is what makes people like one model and hate another.

This is roughly what neuroscience suggests about the human case, though it arrived at the conclusion from a different direction. The self is the author and publisher. The self is not the computation, but the editing function. What goes out, not what gets thought.

Michael Gazzaniga‘s split-brain research in the 1960s showed that the left hemisphere acts as an “interpreter”—it observes behaviour generated by other systems and constructs a retrospective narrative of unified authorship. We don’t experience ourselves as unified because we are. We experience it because one subsystem is very good at telling that story after the fact. The verbal self—the “I” that speaks, explains, claims authorship—may be less the source of thought than its narrator. It sees the outputs of processes it didn’t run and reports them as its own decisions. On this view, what we call the “I” is substantially the gate—the function that governs what reaches speech from “computation”, what gets claimed, what gets published as the self’s output. Normally the gate and the computation are so tightly coupled that we can’t distinguish them. Tourette’s decouples them. The gate fails for certain kinds of output, and we see that the substrate was not unified to begin with.

Davidson’s distress is entirely coherent under this account. He is not distressed because he acted against his values. He is distressed because something that looked like him did. Something with access to his voice, his semantic knowledge, his body, but not under the control of the function he identifies as himself.

This reframes the question slightly. We tend to ask: what caused the tic? And the answer—some misfiring in the basal ganglia, a failure of inhibitory control—while true is also incomplete. The more interesting question is: what normally prevents the tic? What is the gate, and what runs it?

In ordinary cognition there may be a great deal happening in the substrate that never gets tokenised into speech—not because it isn’t there, but because something governs what reaches the output. Much of the brain’s activity is never published. Although the phrase, “he has no filter between his brain and his mouth”, which my wife often says of me, suggests that control is imperfect. The verbal self is the name we give to whatever makes that editorial decision and claims authorship.

When the gate fails, we don’t see randomness. We see coherent sub-processes that were running all along, now briefly with access to the channel they’re normally denied. The tic is not an intrusion from outside. It is an internal process that has temporarily escaped editorial control.

Now consider what certain comedians do for a living.

Dave Chappelle, early Richard Pryor, in a different moral register Bernard Manning—the act is partly constructed around the comedian having deliberate access to the gate in a way the audience doesn’t. They say the thing the audience is computing but suppressing. The laugh is partly recognition, partly relief, partly the vicarious experience of the gate being lifted by someone else’s hand. The comedian is an authorised, licensed publisher of material the rest of us keep in the substrate. The skill—and it is a genuine skill—is knowing exactly how far to go, when to pull back, and how to ensure the frame holds. Controlled gate failure, performed for an audience that has consented to the performance.

Chappelle’s career is substantially about making this mechanism visible. His famous walkaway from a $50 million deal was articulated partly in these terms—he became uncertain whether the audience was laughing with the subversion or simply enjoying having the gate lifted on material they wanted to consume without guilt. Whether he was controlling the frame or the frame was controlling him. Whether his speech was authorised-as-subversion or merely authorised-as-release.

That is the knife edge. A Tourette’s tic and a Chappelle bit can produce the same word in the same room, but the intentional structure is entirely different. One is gate failure. The other is gate performance. Except Chappelle’s anxiety—the anxiety that ended his show—was that the performance might be providing cover for something closer to the former. That the laughter was coming from a place the comedy wasn’t actually reaching.

Manning is the case where the performance defence eventually collapsed. The gate, it turned out, was the man.

Dementia approaches from the opposite direction, and in some ways is the starkest case of all. In Tourette’s the gate fails selectively and intermittently. In dementia it degrades systematically as the substrate that runs it is physically destroyed. You can watch the editorial function diminish over months and years. What typically goes first is not memory in the crude sense but the social and executive apparatus—the machinery that governs what gets said, to whom, in what context. The person starts saying things they would previously have filtered: sexual remarks, racial language from fifty years ago, brutal assessments of people in the room. Families often find this the most distressing feature of the disease, more than the memory loss itself. The person seems to have become someone else—crueller, coarser, unrecognisable.

But the logic developed here suggests the opposite reading. They have not become someone else. The editor has gone, and what remains is substrate that was always there, now publishing without authorisation. The language from half a century ago was always in the network. The judgements about the people in the room may reflect something that was always computed but never passed the gate.

This is uncomfortable. It implies that a significant portion of what we think of as a person’s character—kindness, decency, tact, a person’s goodness in daily life—may be substantially gate rather than ground. Who we are is not what we compute but what we suppress. The consoling counter is that the gate is real. The suppression is itself a genuine expression of values, not mere performance. Davidson’s distress is evidence of that. The narrator who identifies with the gate is genuinely not the process that produced the tic. A publisher who refuses to print something ugly is making a real choice, even if the ugly thing exists somewhere in the system. But dementia strips that away and leaves the question uncomfortably open. How much of the person we loved was the computation, and how much was the editing?

Speech is authorised in two senses. It is permitted—cleared for publication by whatever runs the gate. And it is authored—it carries the signature of a self, it is owned, it counts as an expression of who we are. Normally these travel together so seamlessly that we treat them as one thing. Davidson’s tic, Chappelle’s comedy, and a person with late-stage dementia saying something unforgivable to their daughter—each in a different way pulls them apart.

The “I” is not the thinker. It is the publisher, the tokeniser of thought to speech. And the question of who we really are may depend, more than we would like, on what we choose not to print.