3. Training, Logging, Checkpoints
AI4Plasma training utilities show up in multiple places:
ai4plasma.core.model.BaseModel: minimal base wrapperai4plasma.operator.*Model: operator-learning training wrappersai4plasma.piml.pinn.PINN: multi-term physics loss and callbacks
3.1. Training components
Most training loops share the same building blocks:
Model: network + physics or operator wrapper
Optimizer: Adam/AdamW commonly used for PINNs and operators
Scheduler: optional learning rate schedule for long runs
Data: supervised datasets or collocation points
Callbacks: visualization and diagnostics
3.2. TensorBoard
Many training loops accept tensorboard_logdir=... and will write event files.
Run TensorBoard from the repository root:
tensorboard --logdir app
Common log directories:
app/operator/deeponet/runs/...app/piml/cs_pinn/runs/...app/piml/rk_pinn/runs/...
3.2.1. What to log
Recommended scalars:
total loss and per-term loss
learning rate
validation error (if available)
physical diagnostics (e.g., conductance, radiation power, max/min temperature)
Recommended figures:
solution profiles and contours
residual maps or error heatmaps
training curves (loss, error)
3.3. Checkpointing
Typical options (model-specific):
checkpoint_dir: directory for saving periodic checkpointscheckpoint_freq: save every N epochsresume_from: resume training from a checkpoint file
A checkpoint usually contains:
model weights
optimizer state
current epoch
3.3.1. When to save
Save frequently in long runs (every 100-1000 epochs)
Save more frequently during hyperparameter exploration
Always save the best validation checkpoint when a validation metric is available
3.4. Reproducibility
Use set_seed(...) from ai4plasma.utils.common and record:
seed
device
precision
code version
Also consider recording:
git commit or tag
dataset version and preprocessing
training configuration (hyperparameters)
3.5. Practical tips for PINN training
Start with a conservative learning rate (e.g.
1e-4to1e-3).Use a scheduler (MultiStepLR or ReduceLROnPlateau) for long runs.
Balance loss term weights (domain vs boundary) when residual magnitudes differ.
Log intermediate physical quantities (conductivity, radiation terms) to catch non-physical regimes early.
3.6. Example: basic training skeleton
from ai4plasma.utils.common import set_seed
from ai4plasma.core.network import FNN
set_seed(2023)
net = FNN([2, 64, 64, 64, 1])
# Create model wrapper, define equation terms, then train
# model.train(num_epochs=..., lr=...)
3.7. Troubleshooting
Loss does not decrease: reduce learning rate, increase collocation points, or check derivative computations.
Boundary drift: increase boundary term weight or sample more boundary points.
Exploding gradients: use gradient clipping or smooth loss functions, and check property interpolation ranges.
Slow training: reduce batch size, simplify network, or use mixed precision where appropriate.
3.8. Checkpoint resumption workflow
Identify the latest checkpoint file in
checkpoint_dir.Pass
resume_from=...to the model training method.Verify logs continue from the expected epoch in TensorBoard.
3.9. Next steps
Operator training details: guides/operator.md
PINN workflows: guides/piml.md
Example scripts: examples/index.md