Architectures

Modules

The following high-level modules are often used when constructing a neural-network architecture. In particular, the DeepSet is the building block for most classes of Estimators in the package.

NeuralEstimators.DeepSetType
DeepSet(ψ, ϕ, a = mean; S = nothing)

The DeepSets representation (Zaheer et al., 2017),

\[θ̂(𝐙) = ϕ(𝐓(𝐙)),    𝐓(𝐙) = 𝐚(\{ψ(𝐙ᵢ) : i = 1, …, m\}),\]

where 𝐙 ≡ (𝐙₁', …, 𝐙ₘ')' are independent replicates from the statistical model, ψ and ϕ are neural networks, and a is a permutation-invariant aggregation function. Expert summary statistics can be incorporated as,

\[θ̂(𝐙) = ϕ((𝐓(𝐙)', 𝐒(𝐙)')'),\]

where S is a function that returns a vector of user-defined summary statistics. These user-defined summary statistics are provided either as a Function that returns a Vector, or as a vector of functions. In the case that ψ is set to nothing, only expert summary statistics will be used.

The aggregation function a can be any function that acts on an array and has a keyword argument dims that allows aggregation over a specific dimension of the array (e.g., sum, mean, maximum, minimum, logsumexp).

DeepSet objects act on data of type Vector{A}, where each element of the vector is associated with one data set (i.e., one set of independent replicates from the statistical model), and where the type A depends on the form of the data and the chosen architecture for ψ. As a rule of thumb, when A is an array, the replicates are stored in the final dimension. For example, with gridded spatial data and ψ a CNN, A should be a 4-dimensional array, with the replicates stored in the 4ᵗʰ dimension. Note that in Flux, the final dimension is usually the "batch" dimension, but batching with DeepSet objects is done at the data set level (i.e., sets of replicates are batched together).

Data stored as Vector{Arrays} are first concatenated along the replicates dimension before being passed into the summary network ψ. This means that ψ is applied to a single large array rather than many small arrays, which can substantially improve computational efficiency.

Set-level information, $𝐱$, that is not a function of the data can be passed directly into the inference network ϕ in the following manner,

\[θ̂(𝐙) = ϕ((𝐓(𝐙)', 𝐱')'),    \]

or, in the case that expert summary statistics are also used,

\[θ̂(𝐙) = ϕ((𝐓(𝐙)', 𝐒(𝐙)', 𝐱')').  \]

This is done by calling the DeepSet object on a Tuple{Vector{A}, Vector{Vector}}, where the first element of the tuple contains a vector of data sets and the second element contains a vector of set-level information (i.e., one vector for each data set).

Examples

using NeuralEstimators, Flux

# Two dummy data sets containing 3 and 4 replicates
p = 5  # number of parameters in the statistical model
n = 10 # dimension of each replicate
Z = [rand32(n, m) for m ∈ (3, 4)]

# Construct the deepset object
S = samplesize
qₛ = 1   # dimension of expert summary statistic
qₜ = 16  # dimension of neural summary statistic
w = 32  # width of hidden layers
ψ = Chain(Dense(n, w, relu), Dense(w, qₜ, relu))
ϕ = Chain(Dense(qₜ + qₛ, w, relu), Dense(w, p))
ds = DeepSet(ψ, ϕ; S = S)

# Apply the deepset object to data
ds(Z)

# Data with set-level information
qₓ = 2 # dimension of set-level vector
ϕ = Chain(Dense(qₜ + qₛ + qₓ, w, relu), Dense(w, p))
ds = DeepSet(ψ, ϕ; S = S)
x = [rand32(qₓ) for _ ∈ eachindex(Z)]
ds((Z, x))
source
NeuralEstimators.GNNSummaryType
GNNSummary(propagation, readout)

A graph neural network (GNN) module designed to serve as the summary network ψ in the DeepSet representation when the data are graphical (e.g., irregularly observed spatial data).

The propagation module transforms graphical input data into a set of hidden-feature graphs. The readout module aggregates these feature graphs into a single hidden feature vector of fixed length (i.e., a vector of summary statistics). The summary network is then defined as the composition of the propagation and readout modules.

The data should be stored as a GNNGraph or Vector{GNNGraph}, where each graph is associated with a single parameter vector. The graphs may contain subgraphs corresponding to independent replicates.

Examples

using NeuralEstimators, Flux, GraphNeuralNetworks
using Flux: batch
using Statistics: mean

# Propagation module
d = 1      # dimension of response variable
nₕ = 32    # dimension of node feature vectors
propagation = GNNChain(GraphConv(d => nₕ), GraphConv(nₕ => nₕ))

# Readout module
readout = GlobalPool(mean)
nᵣ = nₕ   # dimension of readout vector

# Summary network
ψ = GNNSummary(propagation, readout)

# Inference network
p = 3     # number of parameters in the statistical model
w = 64    # width of hidden layer
ϕ = Chain(Dense(nᵣ, w, relu), Dense(w, p))

# Construct the estimator
θ̂ = DeepSet(ψ, ϕ)

# Apply the estimator to a single graph, a single graph with subgraphs
# (corresponding to independent replicates), and a vector of graphs
# (corresponding to multiple data sets each with independent replicates)
g₁ = rand_graph(11, 30, ndata=rand(d, 11))
g₂ = rand_graph(13, 40, ndata=rand(d, 13))
g₃ = batch([g₁, g₂])
θ̂(g₁)
θ̂(g₃)
θ̂([g₁, g₂, g₃])
source

User-defined summary statistics

    The following functions correspond to summary statistics that are often useful as user-defined summary statistics in DeepSet objects.

    NeuralEstimators.samplesizeFunction
    samplesize(Z::AbstractArray)

    Computes the sample size of a set of independent realisations Z.

    Note that this function is a wrapper around numberreplicates, but this function returns the number of replicates as the eltype of Z, rather than as an integer.

    source
    NeuralEstimators.samplecorrelationFunction
    samplecorrelation(Z::AbstractArray)

    Computes the sample correlation matrix, R̂, and returns the vectorised strict lower triangle of R̂.

    Examples

    # 5 independent replicates of a 3-dimensional vector
    z = rand(3, 5)
    samplecorrelation(z)
    source
    NeuralEstimators.NeighbourhoodVariogramType
    NeighbourhoodVariogram(h_max, n_bins) 
    (l::NeighbourhoodVariogram)(g::GNNGraph)

    Computes the empirical variogram,

    \[\hat{\gamma}(h \pm \delta) = \frac{1}{2|N(h \pm \delta)|} \sum_{(i,j) \in N(h \pm \delta)} (Z_i - Z_j)^2\]

    where $N(h \pm \delta) \equiv \left\{(i,j) : \|\boldsymbol{s}_i - \boldsymbol{s}_j\| \in (h-\delta, h+\delta)\right\}$ is the set of pairs of locations separated by a distance within $(h-\delta, h+\delta)$, and $|\cdot|$ denotes set cardinality.

    The distance bins are constructed to have constant width $2\delta$, chosen based on the maximum distance h_max to be considered, and the specified number of bins n_bins.

    The input type is a GNNGraph, and the empirical variogram is computed based on the corresponding graph structure. Specifically, only locations that are considered neighbours will be used when computing the empirical variogram.

    Examples

    using NeuralEstimators, Distances, LinearAlgebra
      
    # Simulate Gaussian spatial data with exponential covariance function 
    θ = 0.1                                 # true range parameter 
    n = 250                                 # number of spatial locations 
    S = rand(n, 2)                          # spatial locations 
    D = pairwise(Euclidean(), S, dims = 1)  # distance matrix 
    Σ = exp.(-D ./ θ)                       # covariance matrix 
    L = cholesky(Symmetric(Σ)).L            # Cholesky factor 
    m = 5                                   # number of independent replicates 
    Z = L * randn(n, m)                     # simulated data 
    
    # Construct the spatial graph 
    r = 0.15                                # radius of neighbourhood set
    g = spatialgraph(S, Z, r = r)
    
    # Construct the variogram object wth 10 bins
    nv = NeighbourhoodVariogram(r, 10) 
    
    # Compute the empirical variogram 
    nv(g)
    source

    Layers

    In addition to the built-in layers provided by Flux, the following layers may be used when building a neural-network architecture.

    NeuralEstimators.DensePositiveType
    DensePositive(layer::Dense, g::Function)
    DensePositive(layer::Dense; g::Function = Flux.relu)

    Wrapper around the standard Dense layer that ensures positive weights (biases are left unconstrained).

    This layer can be useful for constucting (partially) monotonic neural networks (see, e.g., QuantileEstimatorContinuous).

    Examples

    using NeuralEstimators, Flux
    
    layer = DensePositive(Dense(5 => 2))
    x = rand32(5, 64)
    layer(x)
    source
    NeuralEstimators.PowerDifferenceType
    PowerDifference(a, b)

    Function $f(x, y) = |ax - (1-a)y|^b$ for trainable parameters a ∈ [0, 1] and b > 0.

    Examples

    using NeuralEstimators, Flux
    
    # Generate some data
    d = 5
    K = 10000
    X = randn32(d, K)
    Y = randn32(d, K)
    XY = (X, Y)
    a = 0.2f0
    b = 1.3f0
    Z = (abs.(a .* X - (1 .- a) .* Y)).^b
    
    # Initialise layer
    f = PowerDifference([0.5f0], [2.0f0])
    
    # Optimise the layer
    loader = Flux.DataLoader((XY, Z), batchsize=32, shuffle=false)
    optim = Flux.setup(Flux.Adam(0.01), f)
    for epoch in 1:100
        for (xy, z) in loader
            loss, grads = Flux.withgradient(f) do m
                Flux.mae(m(xy), z)
            end
            Flux.update!(optim, f, grads[1])
        end
    end
    
    # Estimates of a and b
    f.a
    f.b
    source
    NeuralEstimators.ResidualBlockType
    ResidualBlock(filter, in => out; stride = 1)

    Basic residual block (see here), consisting of two sequential convolutional layers and a skip (shortcut) connection that connects the input of the block directly to the output, facilitating the training of deep networks.

    Examples

    using NeuralEstimators
    z = rand(16, 16, 1, 1)
    b = ResidualBlock((3, 3), 1 => 32)
    b(z)
    source
    NeuralEstimators.SpatialGraphConvType
    SpatialGraphConv(in => out, g=relu; args...)

    Implements a spatial graph convolution for isotropic processes,

    \[ \boldsymbol{h}^{(l)}_{j} = g\Big( \boldsymbol{\Gamma}_{\!1}^{(l)} \boldsymbol{h}^{(l-1)}_{j} + \boldsymbol{\Gamma}_{\!2}^{(l)} \bar{\boldsymbol{h}}^{(l)}_{j} + \boldsymbol{\gamma}^{(l)} \Big), \quad \bar{\boldsymbol{h}}^{(l)}_{j} = \sum_{j' \in \mathcal{N}(j)}\boldsymbol{w}^{(l)}(\|\boldsymbol{s}_{j'} - \boldsymbol{s}_j\|) \odot f^{(l)}(\boldsymbol{h}^{(l-1)}_{j}, \boldsymbol{h}^{(l-1)}_{j'}),\]

    where $\boldsymbol{h}^{(l)}_{j}$ is the hidden feature vector at location $\boldsymbol{s}_j$ at layer $l$, $g(\cdot)$ is a non-linear activation function applied elementwise, $\boldsymbol{\Gamma}_{\!1}^{(l)}$ and $\boldsymbol{\Gamma}_{\!2}^{(l)}$ are trainable parameter matrices, $\boldsymbol{\gamma}^{(l)}$ is a trainable bias vector, $\mathcal{N}(j)$ denotes the indices of neighbours of $\boldsymbol{s}_j$, $\boldsymbol{w}^{(l)}(\cdot)$ is a (learnable) spatial weighting function, $\odot$ denotes elementwise multiplication, and $f^{(l)}(\cdot, \cdot)$ is a (learnable) function.

    By default, the function $f^{(l)}(\cdot, \cdot)$ is modelled using a PowerDifference function. One may alternatively employ a nonlearnable function, for example, f = (hᵢ, hⱼ) -> (hᵢ - hⱼ).^2, specified through the keyword argument f.

    The spatial distances between locations must be stored as an edge feature, as facilitated by spatialgraph(). The input to $\boldsymbol{w}(\cdot)$ is a $1 \times n$ matrix (i.e., a row vector) of spatial distances. The output of $\boldsymbol{w}(\cdot)$ must be either a scalar; a vector of the same dimension as the feature vectors of the previous layer; or, if the features vectors of the previous layer are scalars, a vector of arbitrary dimension. To promote identifiability, the weights are normalised to sum to one (row-wise) within each neighbourhood set. By default, $\boldsymbol{w}(\cdot)$ is taken to be a multilayer perceptron with a single hidden layer, although a custom choice for this function can be provided using the keyword argument w.

    Arguments

    • in: The dimension of input features.
    • out: The dimension of output features.
    • g = relu: Activation function.
    • bias = true: Add learnable bias?
    • init = glorot_uniform: Initialiser for $\boldsymbol{\Gamma}_{\!1}^{(l)}$, $\boldsymbol{\Gamma}_{\!2}^{(l)}$, and $\boldsymbol{\gamma}^{(l)}$.
    • f = nothing
    • w = nothing
    • w_width = 128: (Only applicable if w = nothing) The width of the hidden layer in the MLP used to model $\boldsymbol{w}(\cdot, \cdot)$.
    • w_out = in: (Only applicable if w = nothing) The output dimension of $\boldsymbol{w}(\cdot, \cdot)$.

    Examples

    using NeuralEstimators, Flux, GraphNeuralNetworks
    
    # Toy spatial data
    m = 5                  # number of independent replicates
    d = 2                  # spatial dimension
    n = 250                # number of spatial locations
    S = rand(n, d)         # spatial locations
    Z = rand(n, m)         # data
    g = spatialgraph(S, Z) # construct the graph
    
    # Construct and apply spatial graph convolution layer
    l = SpatialGraphConv(1 => 10)
    l(g)
    source

    Output layers

      In addition to the standard activation functions provided by Flux, the following layers can be used at the end of an architecture to act as "output activation functions" that ensure valid estimates for certain models. Note that, although we may conceptualise the following structs as "output activation functions", they should be treated as separate layers included in the final stage of a Flux Chain(). In particular, they cannot be used as the activation function of a Dense layer.

      NeuralEstimators.CompressType
      Compress(a, b, k = 1)

      Layer that compresses its input to be within the range a and b, where each element of a is less than the corresponding element of b.

      The layer uses a logistic function,

      \[l(θ) = a + \frac{b - a}{1 + e^{-kθ}},\]

      where the arguments a and b together combine to shift and scale the logistic function to the range (a, b), and the growth rate k controls the steepness of the curve.

      The logistic function given here contains an additional parameter, θ₀, which is the input value corresponding to the functions midpoint. In Compress, we fix θ₀ = 0, since the output of a randomly initialised neural network is typically around zero.

      Examples

      using NeuralEstimators, Flux
      
      a = [25, 0.5, -pi/2]
      b = [500, 2.5, 0]
      p = length(a)
      K = 100
      θ = randn(p, K)
      l = Compress(a, b)
      l(θ)
      
      n = 20
      θ̂ = Chain(Dense(n, p), l)
      Z = randn(n, K)
      θ̂(Z)
      source
      NeuralEstimators.CorrelationMatrixType
      CorrelationMatrix(d)
      (object::CorrelationMatrix)(x::Matrix, cholesky::Bool = false)

      Transforms a vector 𝐯 ∈ ℝᵈ to the parameters of an unconstrained d×d correlation matrix or, if cholesky = true, the lower Cholesky factor of an unconstrained d×d correlation matrix.

      The expected input is a Matrix with T(d-1) = (d-1)d÷2 rows, where T(d-1) is the (d-1)th triangular number (the number of free parameters in an unconstrained d×d correlation matrix), and the output is a Matrix of the same dimension. The columns of the input and output matrices correspond to independent parameter configurations (i.e., different correlation matrices).

      Internally, the layer constructs a valid Cholesky factor 𝐋 for a correlation matrix, and then extracts the strict lower triangle from the correlation matrix 𝐑 = 𝐋𝐋'. The lower triangle is extracted and vectorised in line with Julia's column-major ordering: for example, when modelling the correlation matrix

      \[\begin{bmatrix} 1 & R₁₂ & R₁₃ \\ R₂₁ & 1 & R₂₃\\ R₃₁ & R₃₂ & 1\\ \end{bmatrix},\]

      the rows of the matrix returned by a CorrelationMatrix layer are ordered as

      \[\begin{bmatrix} R₂₁ \\ R₃₁ \\ R₃₂ \\ \end{bmatrix},\]

      which means that the output can easily be transformed into the implied correlation matrices using vectotril and Symmetric.

      See also CovarianceMatrix.

      Examples

      using NeuralEstimators
      using LinearAlgebra
      using Flux
      
      d  = 4
      l  = CorrelationMatrix(d)
      p  = (d-1)*d÷2
      θ  = randn(p, 100)
      
      # Returns a matrix of parameters, which can be converted to correlation matrices
      R = l(θ)
      R = map(eachcol(R)) do r
      	R = Symmetric(cpu(vectotril(r, strict = true)), :L)
      	R[diagind(R)] .= 1
      	R
      end
      
      # Obtain the Cholesky factor directly
      L = l(θ, true)
      L = map(eachcol(L)) do x
      	# Only the strict lower diagonal elements are returned
      	L = LowerTriangular(cpu(vectotril(x, strict = true)))
      
      	# Diagonal elements are determined under the constraint diag(L*L') = 𝟏
      	L[diagind(L)] .= sqrt.(1 .- rowwisenorm(L).^2)
      	L
      end
      L[1] * L[1]'
      source
      NeuralEstimators.CovarianceMatrixType
      CovarianceMatrix(d)
      (object::CovarianceMatrix)(x::Matrix, cholesky::Bool = false)

      Transforms a vector 𝐯 ∈ ℝᵈ to the parameters of an unconstrained d×d covariance matrix or, if cholesky = true, the lower Cholesky factor of an unconstrained d×d covariance matrix.

      The expected input is a Matrix with T(d) = d(d+1)÷2 rows, where T(d) is the dth triangular number (the number of free parameters in an unconstrained d×d covariance matrix), and the output is a Matrix of the same dimension. The columns of the input and output matrices correspond to independent parameter configurations (i.e., different covariance matrices).

      Internally, the layer constructs a valid Cholesky factor 𝐋 and then extracts the lower triangle from the positive-definite covariance matrix 𝚺 = 𝐋𝐋'. The lower triangle is extracted and vectorised in line with Julia's column-major ordering: for example, when modelling the covariance matrix

      \[\begin{bmatrix} Σ₁₁ & Σ₁₂ & Σ₁₃ \\ Σ₂₁ & Σ₂₂ & Σ₂₃ \\ Σ₃₁ & Σ₃₂ & Σ₃₃ \\ \end{bmatrix},\]

      the rows of the matrix returned by a CovarianceMatrix are ordered as

      \[\begin{bmatrix} Σ₁₁ \\ Σ₂₁ \\ Σ₃₁ \\ Σ₂₂ \\ Σ₃₂ \\ Σ₃₃ \\ \end{bmatrix},\]

      which means that the output can easily be transformed into the implied covariance matrices using vectotril and Symmetric.

      See also CorrelationMatrix.

      Examples

      using NeuralEstimators
      using Flux
      using LinearAlgebra
      
      d = 4
      l = CovarianceMatrix(d)
      p = d*(d+1)÷2
      θ = randn(p, 50)
      
      # Returns a matrix of parameters, which can be converted to covariance matrices
      Σ = l(θ)
      Σ = [Symmetric(cpu(vectotril(x)), :L) for x ∈ eachcol(Σ)]
      
      # Obtain the Cholesky factor directly
      L = l(θ, true)
      L = [LowerTriangular(cpu(vectotril(x))) for x ∈ eachcol(L)]
      L[1] * L[1]'
      source