Picochat Architecture

Picochat is a small-language-model factory with one explicit contract: every stage writes artifacts that later stages can verify.

dataset pack
  -> corpus.txt + corpus_manifest.json
  -> tokenizer.json
  -> base checkpoint + base report
  -> SFT checkpoint + SFT report
  -> eval report + external benchmark reports
  -> summary.json + summary.md + release gate

Training Stack

The current 1B-class scale is h200-1b-ddp8.

Component Choice
Architecture Decoder-only Transformer
Layers 24
Context 2048
Embedding 2048
Attention GQA, 16 query heads, 4 KV heads
Position RoPE
Norm RMSNorm
MLP SwiGLU
Residual Parallel residual
Stabilizers QK norm, scaled residual init
Embeddings Tied token embeddings / LM head
Precision BF16
Runtime CUDA, torch.compile, DDP
Memory Per-block gradient checkpointing

The scale targets roughly 1.12B parameters and 22.4B planned base-training tokens, about 20 tokens per parameter.

Distributed Strategy

Picochat uses plain DDP, not FSDP or tensor parallelism, for the 1B path.

That is not the most memory-efficient possible setup, but it is simpler, debuggable, and fits comfortably on 8xH100/H200 for this model size. The next scaling step is experimental FSDP for base pretraining; release run tiny remains DDP until SFT, checkpoint export, and post-run gates are validated under sharded parameter ownership.

DDP safeguards:

Experimental base-training FSDP is available through:

torchrun --standalone --nproc_per_node=8 -m picochat.cli train base \
  --ddp --distributed-strategy fsdp \
  --corpus data/corpus.txt \
  --tokenizer runs/tokenizer.json \
  --out-dir runs/base-fsdp-smoke \
  --device cuda \
  --precision bf16

Current FSDP guardrails:

Dataset Modes

Picochat supports three base data modes:

For long release profiles, sharded/packed mode must have document boundary tokens available. This prevents a run from validating on partial fragments of the same source document it trained on.

Checkpoint and Resume Safety

Checkpoints are written crash-safely:

  1. write into a unique temporary directory
  2. atomically swap with os.replace
  3. keep .previous as rollback protection

Resume safety includes:

The sanity suite includes resume-loss determinism checks so a short run can prove that resume is not silently changing the training stream.

Web Workbench

The workbench reads real run artifacts. It exposes:

The dashboard is an operator surface, not a marketing page. If a gate blocks, the UI should show what failed and what to fix.

Serving

Picochat includes a native PyTorch serving path for local integrations:

picochat serve --checkpoint runs/<run>/sft/checkpoint --tokenizer runs/<run>/tokenizer.json

The server loads the checkpoint once and exposes:

This is intentionally a local smoke-serving layer, not a high-throughput inference stack. Production adapters for vLLM, TGI, TensorRT-LLM, or llama.cpp are separate future work because Picochat uses a custom audited model implementation.