# Training, Logging, Checkpoints

AI4Plasma training utilities show up in multiple places:

- `ai4plasma.core.model.BaseModel`: minimal base wrapper
- `ai4plasma.operator.*Model`: operator-learning training wrappers
- `ai4plasma.piml.pinn.PINN`: multi-term physics loss and callbacks

```{contents}
:local:
:depth: 2
```

## Training components

Most training loops share the same building blocks:

- **Model**: network + physics or operator wrapper
- **Optimizer**: Adam/AdamW commonly used for PINNs and operators
- **Scheduler**: optional learning rate schedule for long runs
- **Data**: supervised datasets or collocation points
- **Callbacks**: visualization and diagnostics

## TensorBoard

Many training loops accept `tensorboard_logdir=...` and will write event files.

Run TensorBoard from the repository root:

```bash
tensorboard --logdir app
```

Common log directories:

- `app/operator/deeponet/runs/...`
- `app/piml/cs_pinn/runs/...`
- `app/piml/rk_pinn/runs/...`

### What to log

Recommended scalars:

- total loss and per-term loss
- learning rate
- validation error (if available)
- physical diagnostics (e.g., conductance, radiation power, max/min temperature)

Recommended figures:

- solution profiles and contours
- residual maps or error heatmaps
- training curves (loss, error)

## Checkpointing

Typical options (model-specific):

- `checkpoint_dir`: directory for saving periodic checkpoints
- `checkpoint_freq`: save every N epochs
- `resume_from`: resume training from a checkpoint file

A checkpoint usually contains:

- model weights
- optimizer state
- current epoch

### When to save

- Save frequently in long runs (every 100-1000 epochs)
- Save more frequently during hyperparameter exploration
- Always save the best validation checkpoint when a validation metric is available

## Reproducibility

Use `set_seed(...)` from `ai4plasma.utils.common` and record:

- seed
- device
- precision
- code version

Also consider recording:

- git commit or tag
- dataset version and preprocessing
- training configuration (hyperparameters)

## Practical tips for PINN training

- Start with a conservative learning rate (e.g. `1e-4` to `1e-3`).
- Use a scheduler (MultiStepLR or ReduceLROnPlateau) for long runs.
- Balance loss term weights (domain vs boundary) when residual magnitudes differ.
- Log intermediate physical quantities (conductivity, radiation terms) to catch non-physical regimes early.

## Example: basic training skeleton

```python
from ai4plasma.utils.common import set_seed
from ai4plasma.core.network import FNN

set_seed(2023)

net = FNN([2, 64, 64, 64, 1])
# Create model wrapper, define equation terms, then train
# model.train(num_epochs=..., lr=...)
```

## Troubleshooting

- **Loss does not decrease**: reduce learning rate, increase collocation points, or check derivative computations.
- **Boundary drift**: increase boundary term weight or sample more boundary points.
- **Exploding gradients**: use gradient clipping or smooth loss functions, and check property interpolation ranges.
- **Slow training**: reduce batch size, simplify network, or use mixed precision where appropriate.

## Checkpoint resumption workflow

1. Identify the latest checkpoint file in `checkpoint_dir`.
2. Pass `resume_from=...` to the model training method.
3. Verify logs continue from the expected epoch in TensorBoard.

## Next steps

- Operator training details: [guides/operator.md](operator.md)
- PINN workflows: [guides/piml.md](piml.md)
- Example scripts: [examples/index.md](../examples/index.md)