Skip to content

Methodology

Here, we provide an overview of the amortised neural inferential methods supported by the package, which include neural Bayes estimators, neural posterior estimators, and neural ratio estimators. For further details on each of these methods and amortised neural inference more broadly, see the review paper by Zammit-Mangion et al. (2025) and the references therein.

Notation: We denote model parameters of interest by   , where   is the parameter space. We denote data by   , where   is the sample space. We denote neural-network parameters by . For simplicity, we assume that all measures admit densities with respect to the Lebesgue measure. We use   to denote the prior density function of the parameters. The input argument to a generic density function   serves to specify both the random variable associated with the density and its evaluation point.

Neural Bayes estimators

The goal of parametric point estimation is to estimate from data using an estimator,   . Estimators can be constructed intuitively within a decision-theoretic framework based on average-risk optimality. Specifically, consider a loss function    . Then the Bayes risk of the estimator   is

Any minimiser of the Bayes risk is said to be a Bayes estimator with respect to    and  .

Bayes estimators are functionals of the posterior distribution (e.g., the Bayes estimator under quadratic loss is the posterior mean), and are therefore often unavailable in closed form. A way forward is to assume a flexible parametric function for  , and to optimise the parameters within that function in order to approximate the Bayes estimator. Neural networks are ideal candidates, since they are universal function approximators, and because they are fast to evaluate. Let    denote a neural network parameterised by . Then a Bayes estimator may be approximated by  , where

with   and, independently for each ,   . The process of obtaining is referred to as "training the network", and this can be performed efficiently using back-propagation and stochastic gradient descent. The trained neural network   approximately minimises the Bayes risk, and therefore it is called a neural Bayes estimator (Sainsbury-Dale at al., 2024).

Once trained, a neural Bayes estimator can be applied repeatedly to observed data (whose structure conforms with the chosen neural-network architecture) at a fraction of the computational cost of conventional inferential methods. It is therefore ideal to use a neural Bayes estimator in settings where inference needs to be made repeatedly; in this case, the initial training cost is said to be amortised over time.

Uncertainty quantification with neural Bayes estimators

Uncertainty quantification with neural Bayes estimators often proceeds through the bootstrap distribution (e.g., Lenzi et al., 2023; Richards et al., 2024; Sainsbury-Dale et al., 2024). Bootstrap-based approaches are particularly attractive when nonparametric bootstrap is possible (e.g., when the data are independent replicates), or when simulation from the fitted model is fast, in which case parametric bootstrap is also computationally efficient. However, these conditions are not always met and, although bootstrap-based approaches are often considered to be fairly accurate and favourable to methods based on asymptotic normality, there are situations where bootstrap procedures are not reliable (see, e.g., Canty et al., 2006, pg. 6).

Alternatively, by leveraging ideas from (Bayesian) quantile regression, one may construct a neural Bayes estimator that approximates a set of marginal posterior quantiles (Fisher et al., 2023; Sainsbury-Dale et al., 2025), which can then be used to construct credible intervals for each parameter. Inference then remains fully amortised since, once the estimators are trained, both point estimates and credible intervals can be obtained with virtually zero computational cost. Specifically, posterior quantiles can be targeted by training a neural Bayes estimator under the loss function

where   denotes the indicator function, since the Bayes estimator under this loss function is the vector of marginal posterior -quantiles (Sainsbury-Dale et al., 2025, Sec. 2.2.4).

Neural posterior estimators

We now describe amortised approximate posterior inference through the minimisation of an expected Kullback–Leibler (KL) divergence. Throughout, we let denote a parametric approximation to the posterior distribution  , where the approximate-distribution parameters belong to a space .

We first consider the non-amortised case, where the optimal parameters for a single data set are found by minimising the KL divergence between   and :

The resulting approximate posterior targets the true posterior in the sense that the KL divergence is zero if and only if    for all  . However, solving this optimisation problem is often computationally demanding even for a single data set , and solving it for many different data sets can be computationally prohibitive. The optimisation problem can be amortised by treating the parameters as a function   , and then choosing the function   that minimises an expected KL divergence:

In practice, we approximate   using a neural network,   , which is parameterised by and trained by minimising a Monte Carlo approximation of the expected KL divergence above:

Once trained, the neural network   may be used to estimate the optimal approximate-distribution parameters given data at almost no computational cost. The neural network  , together with the corresponding approximate distribution  , is collectively referred to as a neural posterior estimator.

There are numerous options for the approximate distribution  . For instance,   can be modelled as a Gaussian distribution (e.g., Chan et al., 2018, where the parameters   consist of a -dimensional mean parameter and the   non-zero elements of the lower Cholesky factor of a covariance matrix, and the half-vectorisation operator   vectorises the lower triangle of its matrix argument. For further flexibility, one may consider trans-Gaussian distributions (e.g., Maceda et al., 2024) or Gaussian mixtures (e.g., Papamakarios & Murray, 2016), the latter of which is implemented in the package as GaussianMixture.

Another widely adopted approach to modelling    is through the use of normalising flows (e.g., Ardizzone et al., 2019; Radev et al., 2022), excellent reviews for which are given by Kobyzev et al. (2020) and Papamakarios (2021). A particularly popular class of normalising flow is the affine coupling flow (e.g., Dinh et al., 2016; Kingma & Dhariwal, 2018; Ardizzone et al., 2019), which are universal density approximators (Teshima et al., 2020) and are implemented in the package as NormalisingFlow.

Neural ratio estimators

Finally, we describe amortised inference by approximation of the likelihood-to-evidence ratio,

where   is the likelihood and is the marginal likelihood (also known as the model evidence).

The likelihood-to-evidence ratio is ubiquitous in statistical inference. For example, likelihood ratios of the form     are central to hypothesis testing and model comparison, and naturally appear in the transition probabilities of most standard MCMC algorithms used for Bayesian inference. Further, since the likelihood-to-evidence ratio is a prior-free quantity, its approximation facilitates Bayesian inference in applications where one requires multiple fits of the model under different prior distributions.

Unlike the methods discussed earlier, the likelihood-to-evidence ratio might not immediately seem like a quantity well-suited for approximation by neural networks, which are trained by minimising empirical risk functions. However, this ratio emerges naturally as a simple transformation of the optimal solution to a standard binary classification problem, derived through the minimisation of an average risk. Specifically, consider a binary classifier that distinguishes dependent data-parameter pairs with class labels   from independent data-parameter pairs with class labels  , and where the classes are balanced. Here, denotes an arbitrary "proposal" distribution for that does not, in general, coincide with the prior distribution (see below). Then, the Bayes classifier under binary cross-entropy loss is defined as

where      . It can be shown (e.g., Hermans et al., 2020, App. B) that the Bayes classifier is given by

and, hence,

This connection links the likelihood-to-evidence ratio to the average-risk-optimal solution of a standard binary classification problem, and consequently provides a foundation for approximating the ratio using neural networks. Specifically, let     denote a neural network parametrised by . Then the Bayes classifier may be approximated by   , where

with each sampled independently from a "proposal" distribution ,   , and   a random permutation of . The proposal distribution does not necessarily correspond to the prior distribution , which is specified in the downstream inference algorithm (see below). In theory, any with support over can be used. However, with finite training data, the choice of is important, as it determines where the parameters are most densely sampled and, hence, where the neural network    best approximates the Bayes classifier. Further, since neural networks are only reliable within the support of their training samples, a lacking full support over essentially acts as a "soft prior".

Once the neural network is trained,   ,   , may be used to quickly approximate the likelihood-to-evidence ratio, and therefore it is called a neural ratio estimator.

Inference based on a neural ratio estimator may proceed in a frequentist setting via maximum likelihood and likelihood ratios (e.g., Walchessen et al., 2024), and in a Bayesian setting by facilitating the computation of transition probabilities in Hamiltonian Monte Carlo and MCMC algorithms (e.g., Hermans et al., 2020). Further, an approximate posterior distribution can be obtained via the identity  , and sampled from using standard sampling techniques (e.g., Thomas et al., 2022).