3. Training, Logging, Checkpoints

AI4Plasma training utilities show up in multiple places:

ai4plasma.core.model.BaseModel: minimal base wrapper
ai4plasma.operator.*Model: operator-learning training wrappers
ai4plasma.piml.pinn.PINN: multi-term physics loss and callbacks

3.1. Training components 

Most training loops share the same building blocks:

Model: network + physics or operator wrapper
Optimizer: Adam/AdamW commonly used for PINNs and operators
Scheduler: optional learning rate schedule for long runs
Data: supervised datasets or collocation points
Callbacks: visualization and diagnostics

3.2. TensorBoard 

Many training loops accept tensorboard_logdir=... and will write event files.

Run TensorBoard from the repository root:

tensorboard --logdir app

Common log directories:

app/operator/deeponet/runs/...
app/piml/cs_pinn/runs/...
app/piml/rk_pinn/runs/...

3.2.1. What to log 

Recommended scalars:

total loss and per-term loss
learning rate
validation error (if available)
physical diagnostics (e.g., conductance, radiation power, max/min temperature)

Recommended figures:

solution profiles and contours
residual maps or error heatmaps
training curves (loss, error)

3.3. Checkpointing 

Typical options (model-specific):

checkpoint_dir: directory for saving periodic checkpoints
checkpoint_freq: save every N epochs
resume_from: resume training from a checkpoint file

A checkpoint usually contains:

model weights
optimizer state
current epoch

3.3.1. When to save 

Save frequently in long runs (every 100-1000 epochs)
Save more frequently during hyperparameter exploration
Always save the best validation checkpoint when a validation metric is available

3.4. Reproducibility 

Use set_seed(...) from ai4plasma.utils.common and record:

seed
device
precision
code version

Also consider recording:

git commit or tag
dataset version and preprocessing
training configuration (hyperparameters)

3.5. Practical tips for PINN training 

Start with a conservative learning rate (e.g. 1e-4 to 1e-3).
Use a scheduler (MultiStepLR or ReduceLROnPlateau) for long runs.
Balance loss term weights (domain vs boundary) when residual magnitudes differ.
Log intermediate physical quantities (conductivity, radiation terms) to catch non-physical regimes early.

3.6. Example: basic training skeleton 

from ai4plasma.utils.common import set_seed
from ai4plasma.core.network import FNN

set_seed(2023)

net = FNN([2, 64, 64, 64, 1])
# Create model wrapper, define equation terms, then train
# model.train(num_epochs=..., lr=...)

3.7. Troubleshooting 

Loss does not decrease: reduce learning rate, increase collocation points, or check derivative computations.
Boundary drift: increase boundary term weight or sample more boundary points.
Exploding gradients: use gradient clipping or smooth loss functions, and check property interpolation ranges.
Slow training: reduce batch size, simplify network, or use mixed precision where appropriate.

3.8. Checkpoint resumption workflow 

Identify the latest checkpoint file in checkpoint_dir.
Pass resume_from=... to the model training method.
Verify logs continue from the expected epoch in TensorBoard.

3.9. Next steps 

Operator training details: guides/operator.md
PINN workflows: guides/piml.md
Example scripts: examples/index.md