petitRADTRANS.sbi.training#

Training and checkpoint orchestration for SBI posterior models.

This module holds the optimisation engine that fit() (and the flow-matching estimator) delegate to. It is deliberately backend-agnostic: the estimator supplies the model, a dataset reader, and a handful of callbacks (loss, validation loss, per-checkpoint diagnostics, and the selection/stability metrics), and SBITrainer runs the epoch loop, selects checkpoints, and persists resumable state.

Key pieces#

TrainingSchemeConfig – the trainer’s own (internal) optimisation configuration: schedule, batch size, multi-device flag, checkpointing, early stopping, and verbose-diagnostics knobs. Distinct from the public, grouped petitRADTRANS.sbi.config.TrainingConfig, which is translated into this at fit time.
SBITrainer – the reusable loop: AdamW with an optional warmup+cosine schedule and gradient clipping, JIT-compiled train/eval steps, optional data-parallel sharding across devices, pluggable validation diagnostics, checkpoint selection, and early stopping.
Checkpoint backends – CheckpointBackend (ABC) with EquinoxCheckpointBackend (dill-pickled trees) and OrbaxCheckpointBackend; resolve_checkpoint_backend() picks one ('auto' prefers Orbax, falling back to Equinox), and load_trainer_checkpoint_state() restores a single checkpoint outside the trainer (used by the workflow’s checkpoint guardrails).

Checkpoint kinds#

Each epoch the trainer may write up to five checkpoint subdirectories, all carrying their epoch’s full validation-metric block so a later run can audit why a checkpoint was chosen:

best_selection – minimum of the selection metric. For a flow this metric is the validation loss for a stable checkpoint, the loss plus a small bounded penalty for a marginally-unstable-but-usable one, and a large sentinel for a pathological (collapsed / non-invertible) one. Minimising it therefore picks the best-loss usable checkpoint and never a pathological one. This is the deployed model by default.
best_loss – minimum raw validation loss (ignores stability). Drives early stopping so a sharpening posterior whose inverse transiently trips the strict invertibility gate is not halted prematurely.
best_stable – minimum validation loss among epochs that pass the strict stability gate; the safe fallback when the selection model is only marginally invertible.
best – a copy of whichever of the above is currently the deployed model.
latest – the most recent epoch’s full state, used to resume training.

Data path & performance#

Reading scattered (shuffled) rows from HDF5 every epoch is slow, so each split is loaded once, normalised, and pre-stacked into dense in-RAM arrays (_prestack_observation_blocks() → PreStackedObservations); per-epoch shuffling is then just a NumPy index permutation. The train/eval steps are JIT-compiled once and reused; when data_parallel is set and multiple devices are present, batches are padded to a fixed size and sharded across a device mesh (GSPMD inserts the collectives). Everything here is result-invariant infrastructure: it changes speed and memory, not the optimisation outcome, except for the genuinely scientific parts (the gradient step, LR schedule, checkpoint selection, and early-stopping decision).

Classes#

`EarlyStoppingConfig`	Early-stopping policy for SBI training.
`TrainingSchemeConfig`	Internal optimisation configuration consumed by `SBITrainer`.
`CheckpointBackend`	Persistence backend for trainer checkpoints.
`EquinoxCheckpointBackend`	Checkpoint backend backed by Equinox tree serialization.
`OrbaxCheckpointBackend`	Checkpoint backend that uses Orbax when available.
`SBITrainer`	Reusable optimisation loop for amortized SBI posteriors.

Functions#

`resolve_checkpoint_backend`(→ CheckpointBackend)	Resolve the requested checkpoint backend with graceful fallback.
`load_trainer_checkpoint_state`(→ tuple[Any, dict[str, ...)	Load one persisted trainer checkpoint state and its metadata.

Module Contents#

class petitRADTRANS.sbi.training.EarlyStoppingConfig#

Early-stopping policy for SBI training.

Attributes#

patience:: Number of consecutive non-improving epochs tolerated before stopping.
min_delta:: Minimum decrease in the monitored metric that counts as an improvement.

patience: int#

min_delta: float = 0.0#

class petitRADTRANS.sbi.training.TrainingSchemeConfig#

Internal optimisation configuration consumed by SBITrainer.

The public, grouped petitRADTRANS.sbi.config.TrainingConfig is translated into this dataclass at fit time; end users normally configure the former.

Attributes#

learning_rate:: Peak AdamW learning rate.
batch_size:: Number of samples per optimisation step.
num_epochs:: Maximum number of passes over the training split.
parameter_space:: Coordinate space the parameters are presented in ('unconstrained', 'cube', …); passed through to the loss via the batch metadata.
seed:: Base RNG seed; per-epoch shuffles and diagnostic sampling derive from it.
shuffle_train:: Whether to permute the training split each epoch.
early_stopping:: Optional EarlyStoppingConfig; None disables early stopping.
checkpoint_directory:: Directory for the trainer’s checkpoint subdirectories; None disables checkpointing.
checkpoint_backend:: Backend name ('auto' / 'equinox' / 'orbax').
resume_from_checkpoint:: Whether to resume from the latest checkpoint (restoring model, optimiser state, history, and best-checkpoint bookkeeping).
data_parallel:: Whether to shard batches across all available devices.
gradient_clip_norm:: Global gradient-norm clip value; None disables clipping.
weight_decay:: AdamW weight decay; 0 uses plain Adam.
use_cosine_schedule:: Enable a warmup + cosine-decay learning-rate schedule.
warmup_fraction:: Fraction of the schedule spent warming up (used when warmup_epochs is None).
warmup_epochs:: Absolute warmup length in epochs; overrides warmup_fraction when set.
min_learning_rate:: Final learning rate of the cosine decay.
lr_schedule_total_epochs:: Horizon (in epochs) over which the schedule decays; defaults to num_epochs.
verbose_diagnostics:: Whether to write the per-epoch diagnostic metrics JSON and plots.
diagnostics_output_directory:: Destination for verbose diagnostic artifacts; defaults to the checkpoint directory’s parent.
diagnostics_plot_interval:: Epoch interval between verbose diagnostic plots.
n_validation_diagnostic_batches:: Number of validation batches sampled per epoch for the (expensive) per-checkpoint diagnostics.

learning_rate: float = 0.001#

batch_size: int = 32#

num_epochs: int = 10#

parameter_space: str = 'unconstrained'#

seed: int = 0#

shuffle_train: bool = True#

early_stopping: EarlyStoppingConfig | None = None#

checkpoint_directory: str | None = None#

checkpoint_backend: str = 'auto'#

resume_from_checkpoint: bool = False#

data_parallel: bool = False#

gradient_clip_norm: float | None = 1.0#

weight_decay: float = 0.0#

use_cosine_schedule: bool = False#

warmup_fraction: float = 0.02#

warmup_epochs: float | None = None#

min_learning_rate: float = 1e-06#

lr_schedule_total_epochs: float | None = None#

verbose_diagnostics: bool = False#

diagnostics_output_directory: str | None = None#

diagnostics_plot_interval: int = 1#

n_validation_diagnostic_batches: int = 4#

class petitRADTRANS.sbi.training.CheckpointBackend#

Bases: abc.ABC

Persistence backend for trainer checkpoints.

name: str#

abstractmethod save(state: Any, output_directory: pathlib.Path, metadata: dict[str, Any]) → None#

Persist checkpoint state and metadata.

Parameters#

state:: Serializable optimizer/model state payload.
output_directory:: Directory that should receive the checkpoint files.
metadata:: Small JSON-serializable metadata dictionary written next to the checkpoint payload.

Returns#

None

abstractmethod restore(template_state: Any, output_directory: pathlib.Path) → tuple[Any, dict[str, Any]]#

Restore persisted checkpoint state into the supplied template.

Parameters#

template_state:: Structure used by some backends to describe the expected state layout during restoration.
output_directory:: Directory containing the serialized checkpoint payload.

Returns#

tuple[Any, dict[str, Any]]: Restored state object together with the deserialized checkpoint metadata dictionary.

class petitRADTRANS.sbi.training.EquinoxCheckpointBackend#

Bases: CheckpointBackend

Checkpoint backend backed by Equinox tree serialization.

name = 'equinox'#

save(state: Any, output_directory: pathlib.Path, metadata: dict[str, Any]) → None#: Serialize checkpoint state with dill and save JSON metadata.

restore(template_state: Any, output_directory: pathlib.Path) → tuple[Any, dict[str, Any]]#: Load a dill-serialized checkpoint and its metadata.

class petitRADTRANS.sbi.training.OrbaxCheckpointBackend#

Bases: CheckpointBackend

Checkpoint backend that uses Orbax when available.

name = 'orbax'#

_ocp#

_checkpointer#

save(state: Any, output_directory: pathlib.Path, metadata: dict[str, Any]) → None#: Persist checkpoint state with Orbax and write JSON metadata.

restore(template_state: Any, output_directory: pathlib.Path) → tuple[Any, dict[str, Any]]#: Restore Orbax checkpoint state and metadata.

petitRADTRANS.sbi.training.resolve_checkpoint_backend(preferred: str = 'auto') → CheckpointBackend#

Resolve the requested checkpoint backend with graceful fallback.

Parameters#

preferred:: Backend name. Supported values are 'auto', 'equinox', and 'orbax'.

Returns#

CheckpointBackend: Concrete checkpoint backend implementation.

Notes#

'auto' prefers Orbax when installed and otherwise falls back to the dill-based Equinox backend.

petitRADTRANS.sbi.training.load_trainer_checkpoint_state(checkpoint_directory: str | pathlib.Path, checkpoint_kind: str = 'best') → tuple[Any, dict[str, Any]] | None#

Load one persisted trainer checkpoint state and its metadata.

Parameters#

checkpoint_directory:: Root directory containing trainer checkpoint subdirectories such as best_selection and best_loss.
checkpoint_kind:: Checkpoint subdirectory to restore.

Returns#

tuple[Any, dict[str, Any]] | None: Restored state payload and metadata, or None when the requested checkpoint does not exist.

class petitRADTRANS.sbi.training.SBITrainer(config: TrainingSchemeConfig)#

Reusable optimisation loop for amortized SBI posteriors.

Configured once with a TrainingSchemeConfig and reused via fit(), which is given the model, a dataset reader, and the estimator’s loss/diagnostic/selection callbacks. The trainer owns the AdamW optimiser and schedule, the JIT-compiled train/eval steps, optional multi-device sharding, the per-epoch validation + diagnostics pass, the five checkpoint kinds (see the module docstring), and early stopping. Checkpointing is enabled only when the config provides a checkpoint_directory.

config#

checkpoint_directory = None#

checkpoint_backend#

_checkpoint_backend_fallback_reason: str | None = None#

property _latest_checkpoint_directory: pathlib.Path | None#: Path of the latest (resume) checkpoint, or None if disabled.

property _best_checkpoint_directory: pathlib.Path | None#: Path of the best (deployed-model) checkpoint, or None if disabled.

_checkpoint_directory_for_kind(checkpoint_kind: str) → pathlib.Path | None#: Resolve the subdirectory for one checkpoint kind, or None when no checkpoint directory is configured.

_save_checkpoint(checkpoint_kind: str, state: dict[str, Any], metadata: dict[str, Any]) → None#

Persist one checkpoint kind, transparently downgrading Orbax→Equinox.

No-op when checkpointing is disabled. If the Orbax backend rejects a leaf type mid-run (_is_orbax_unsupported_type_error()), the trainer permanently switches to the Equinox backend, records the fallback reason in the metadata, and retries the save.

_restore_checkpoint(template_state: dict[str, Any], checkpoint_kind: str) → tuple[dict[str, Any], dict[str, Any]] | None#

Restore one checkpoint kind, or None if it is missing/disabled.

Honours the backend name recorded in the checkpoint’s metadata (so a checkpoint written by Equinox after an Orbax fallback restores correctly), and returns the (state, metadata) pair from that backend.

fit(model: Any, dataset: Any, loss_fn: Callable[[Any, Any], Any], eval_loss_fn: Callable[[Any, Any], Any] | None = None, eval_diagnostic_fn: Callable[[Any, Any], Mapping[str, float]] | None = None, selection_metric_fn: Callable[[float, Mapping[str, float] | None], float] | None = None, selection_metric_name: str | None = None, stability_metric_fn: Callable[[float, Mapping[str, float] | None], Mapping[str, float]] | None = None, stability_flag_key: str = 'checkpoint_is_stable', loss_conditioning_schedule: Callable[[int, int], float] | None = None) → tuple[Any, dict[str, list[float]], dict[str, list[float]], dict[str, Any]]#

Optimise one posterior model against a dataset reader.

Runs the full epoch loop: (optionally resume from latest) → per-epoch train pass → validation loss + diagnostics → selection/stability scoring → checkpoint writes → early-stopping check, returning the selected (“best”) model and the histories/metadata.

Parameters#

model:: Trainable Equinox-style model to optimise (e.g. the encoder+flow container).
dataset:: Reader exposing iter_batches(split=..., batch_size=..., ...) and a dataset.manifest.splits mapping with 'train' / 'validation' counts.
loss_fn:: Training loss. Called as (model, batch) by default, or (model, batch, cond) when loss_conditioning_schedule is given.
eval_loss_fn:: Validation loss; defaults to loss_fn. Always evaluated at conditioning 1.0 so checkpoint selection tracks the untempered objective.
eval_diagnostic_fn:: Optional (model, batch) -> {metric: value} evaluated on a few validation batches per epoch; results are aggregated (_aggregate_diagnostic_metrics()) into the epoch metrics (e.g. the flow invertibility / collapse diagnostics).
selection_metric_fn:: Optional (validation_loss, epoch_metrics) -> float producing the scalar minimised to pick best_selection. When None the raw validation loss is used.
selection_metric_name:: Name under which the selection metric is logged/stored; defaults to 'validation_selection_metric' (or 'train_loss' with no validation split).
stability_metric_fn:: Optional (validation_loss, epoch_metrics) -> {metric: value} adding stability flags (notably checkpoint_is_stable) to the epoch metrics; drives the best_stable checkpoint and the per-epoch stable flag.
stability_flag_key:: Key in the epoch metrics whose value (>= 0.5) marks an epoch stable.
loss_conditioning_schedule:: Optional (epoch, num_epochs) -> float feeding a per-epoch scalar to the training loss (e.g. likelihood-temperature beta for VI annealing); evaluation always uses 1.0.

Returns#

tuple[Any, dict[str, list[float]], dict[str, list[float]], dict[str, Any]]

(best_model, train_history, validation_metrics_history, metadata):

best_model – the deployed (best) model.
train_history – {'train_loss': [...]} per epoch.
validation_metrics_history – per-key lists for every validation metric (loss, selection metric, diagnostics, stability flags).
metadata – the best/selection/loss/stable epochs and metrics, the selected best_checkpoint_kind, early-stopping/resume flags, step counts, and the checkpoint backend used.

Notes#

With no validation split the trainer selects on training loss. The five checkpoint kinds and the selection-metric semantics are described in the module docstring. Training/validation data are pre-loaded and pre-stacked into RAM where possible (falling back to streaming from disk otherwise), and batches are sharded across devices when data_parallel is set.

property _min_delta: float#: Minimum-improvement threshold for best-checkpoint comparisons (0 when early stopping is not configured).

_should_stop(patience_counter: int) → bool#: Whether training should stop: True once the non-improving-epoch count reaches the configured early-stopping patience.