3. Training, Logging, Checkpoints

AI4Plasma training utilities show up in multiple places:

  • ai4plasma.core.model.BaseModel: minimal base wrapper

  • ai4plasma.operator.*Model: operator-learning training wrappers

  • ai4plasma.piml.pinn.PINN: multi-term physics loss and callbacks

3.1. Training components

Most training loops share the same building blocks:

  • Model: network + physics or operator wrapper

  • Optimizer: Adam/AdamW commonly used for PINNs and operators

  • Scheduler: optional learning rate schedule for long runs

  • Data: supervised datasets or collocation points

  • Callbacks: visualization and diagnostics

3.2. TensorBoard

Many training loops accept tensorboard_logdir=... and will write event files.

Run TensorBoard from the repository root:

tensorboard --logdir app

Common log directories:

  • app/operator/deeponet/runs/...

  • app/piml/cs_pinn/runs/...

  • app/piml/rk_pinn/runs/...

3.2.1. What to log

Recommended scalars:

  • total loss and per-term loss

  • learning rate

  • validation error (if available)

  • physical diagnostics (e.g., conductance, radiation power, max/min temperature)

Recommended figures:

  • solution profiles and contours

  • residual maps or error heatmaps

  • training curves (loss, error)

3.3. Checkpointing

Typical options (model-specific):

  • checkpoint_dir: directory for saving periodic checkpoints

  • checkpoint_freq: save every N epochs

  • resume_from: resume training from a checkpoint file

A checkpoint usually contains:

  • model weights

  • optimizer state

  • current epoch

3.3.1. When to save

  • Save frequently in long runs (every 100-1000 epochs)

  • Save more frequently during hyperparameter exploration

  • Always save the best validation checkpoint when a validation metric is available

3.4. Reproducibility

Use set_seed(...) from ai4plasma.utils.common and record:

  • seed

  • device

  • precision

  • code version

Also consider recording:

  • git commit or tag

  • dataset version and preprocessing

  • training configuration (hyperparameters)

3.5. Practical tips for PINN training

  • Start with a conservative learning rate (e.g. 1e-4 to 1e-3).

  • Use a scheduler (MultiStepLR or ReduceLROnPlateau) for long runs.

  • Balance loss term weights (domain vs boundary) when residual magnitudes differ.

  • Log intermediate physical quantities (conductivity, radiation terms) to catch non-physical regimes early.

3.6. Example: basic training skeleton

from ai4plasma.utils.common import set_seed
from ai4plasma.core.network import FNN

set_seed(2023)

net = FNN([2, 64, 64, 64, 1])
# Create model wrapper, define equation terms, then train
# model.train(num_epochs=..., lr=...)

3.7. Troubleshooting

  • Loss does not decrease: reduce learning rate, increase collocation points, or check derivative computations.

  • Boundary drift: increase boundary term weight or sample more boundary points.

  • Exploding gradients: use gradient clipping or smooth loss functions, and check property interpolation ranges.

  • Slow training: reduce batch size, simplify network, or use mixed precision where appropriate.

3.8. Checkpoint resumption workflow

  1. Identify the latest checkpoint file in checkpoint_dir.

  2. Pass resume_from=... to the model training method.

  3. Verify logs continue from the expected epoch in TensorBoard.

3.9. Next steps