Picochat Pipeline Guide

Picochat is an honest small-language-model factory. The goal is not to pretend a tiny or pilot model is a large assistant. The goal is to make each step of language-model training visible enough to inspect, rerun, and explain.

The native scratch pipeline is:

dataset -> tokenizer -> base pretraining -> chat SFT -> optional DPO -> eval -> chat -> report

Each stage writes artifacts to a run folder so the next stage has visible inputs instead of hidden state.

If you are starting from an existing Hugging Face model instead of creating a Picochat-native base model, use the separate path:

picochat train hf-sft --model <hf-model-id> --input <chat.jsonl> --out-dir runs/<hf-sft-name>

That path writes Hugging Face model folders and intentionally skips Picochat’s native tokenizer/base-pretraining stages.

For multi-turn tool-calling data, put the full context in a messages array and end the row with the target assistant response. Picochat masks the previous system/user/assistant/tool turns and trains only the final assistant message.

1. Dataset

Purpose: turn local source files into one normalized training corpus.

Input examples:

Output artifacts:

What to inspect:

Important idea: a tiny model cannot learn what the dataset does not contain. If the corpus is tiny, repeated, noisy, or off-topic, the model will mostly memorize that data.

Useful command:

picochat data preview --dataset-pack examples/tiny_dataset_pack.json

Hugging Face import is an intake helper before this stage. It streams or loads rows from a dataset split, writes a local text file, and then hands control back to the normal Picochat corpus preview:

picochat data hf-import --dataset HuggingFaceFW/fineweb-edu --split train --text-column text --max-rows 1000 --out runs/fineweb-edu-sample/corpus.txt
picochat data preview --input runs/fineweb-edu-sample/corpus.txt

Once the corpus exists, generate an eval starter from the same text:

picochat data eval-starter --input runs/fineweb-edu-sample/corpus.txt --out runs/fineweb-edu-sample/eval_starter.jsonl --max-items 40

Important idea: generated eval rows are scaffolding. They are useful because they force the first benchmark to reference the loaded corpus, but they still need human review before the score means anything.

2. Tokenizer

Purpose: convert text into token IDs the model can read.

Picochat starts with a character tokenizer because it is easy to inspect. It also supports a byte tokenizer for UTF-8 coverage experiments and a small dependency-free BPE tokenizer for learned merge experiments. These are simpler than production tokenizers, but they keep the mechanics visible.

Input:

Output:

What to inspect:

Useful command:

picochat tok train --input runs/manual/corpus.txt --out runs/manual/tokenizer.json

To compare BPE against the educational baseline:

picochat tok train --input runs/manual/corpus.txt --out runs/manual/tokenizer-bpe.json --type bpe --vocab-size 512 --min-freq 2

3. Base Pretraining

Purpose: train a decoder-only transformer to predict the next token.

This is the actual language-model training stage. The model sees token windows from the corpus and learns next-token patterns.

Inputs:

Output artifacts:

What to inspect:

Important idea: decreasing train loss only means the model is fitting the training windows. Validation loss tells you whether that fit is carrying over to held-out windows. When run tiny has a corpus manifest with multiple documents, Picochat prefers document-level holdout so validation text comes from complete unseen sources. That is a stronger signal than splitting random windows from the same document.

Useful command:

picochat train base --corpus runs/manual/corpus.txt --tokenizer runs/manual/tokenizer.json --corpus-manifest runs/manual/corpus_manifest.json --split-mode document --out-dir runs/manual/base --context-size 128 --max-steps 300

For longer runs, add guardrails:

picochat train base --corpus runs/manual/corpus.txt --tokenizer runs/manual/tokenizer.json --corpus-manifest runs/manual/corpus_manifest.json --split-mode document --out-dir runs/manual/base --context-size 128 --max-steps 10000 --max-minutes 45 --early-stop-patience 3 --canary-count 3

4. Chat SFT

Purpose: teach the base model a chat response format using supervised examples.

SFT does not create new knowledge by itself. It teaches behavior found in the chat JSONL rows.

Inputs:

Output artifacts:

What to inspect:

Important idea: on very small chat files, SFT can quickly memorize exact answers. That is why Picochat reports a memorization-risk diagnostic instead of hiding the gap.

Useful command:

picochat train sft --input examples/tiny_chat.jsonl --tokenizer runs/manual/tokenizer.json --checkpoint runs/manual/base/best_checkpoint --out-dir runs/manual/sft --max-steps 600 --early-stop-patience 6

LoRA command for lightweight domain adapters:

picochat train sft --input examples/tiny_chat.jsonl --tokenizer runs/manual/tokenizer.json --checkpoint runs/manual/base/best_checkpoint --out-dir runs/manual/sft-lora --peft lora --lora-rank 8 --lora-alpha 16 --lora-targets attn_qkv,attn_proj --max-steps 600 --early-stop-patience 6

LoRA is for adapting an existing base checkpoint to a domain or behavior style. It does not replace base pretraining or make the base model know facts it never saw.

5. Optional DPO

Purpose: improve post-SFT preference behavior when you have curated chosen/rejected answers for the same prompt.

DPO is optional. Use it for alignment preferences such as safer refusals, clearer tone, shorter answers, or domain style. Do not use it as a substitute for base pretraining, SFT coverage, or held-out eval.

Inputs:

Output artifacts:

For smoke tests, Picochat can generate starter preference pairs from SFT rows:

picochat data preference-starter --input runs/manual/chat.jsonl --out data/preferences.jsonl

Those starter rows use synthetic rejected answers. They are useful for checking DPO mechanics, not for claiming alignment quality.

Useful command:

picochat train dpo --input data/preferences.jsonl --tokenizer runs/manual/tokenizer.json --checkpoint runs/manual/sft/checkpoint --out-dir runs/manual/dpo --max-steps 200 --learning-rate 0.000005 --beta 0.1 --early-stop-patience 4

End-to-end command:

picochat run tiny \
  --out-dir runs/manual \
  --dataset-pack runs/pack/dataset_pack.json \
  --dpo-input data/preferences.jsonl \
  --dpo-steps 200

When DPO is enabled in run tiny, SFT-fit, held-out eval, external evals, and release gates all score the post-DPO checkpoint.

6. Eval

Purpose: score generated replies with transparent rules.

Picochat evals are intentionally simple. Each item can define required phrases, any-of phrase groups, forbidden phrases, required entities, length bounds, a reference answer, corpus-support requirements, and whether the question is answerable.

Inputs:

Output artifacts:

What to inspect:

Important idea: this is not semantic truth evaluation. It is an inspectable measurement for a tiny model, especially for whether it makes unsupported answers when it should refuse or echoes the prompt instead of answering.

Useful command:

picochat eval chat --input examples/tiny_eval.jsonl --checkpoint runs/manual/sft/checkpoint --tokenizer runs/manual/tokenizer.json --out-dir runs/manual/eval
picochat eval chat --input examples/tiny_eval.jsonl --checkpoint runs/manual/sft/checkpoint --tokenizer runs/manual/tokenizer.json --out-dir runs/manual/eval --support-corpus runs/manual/corpus.txt

7. Chat And Generation

Purpose: sample text from a checkpoint and inspect behavior manually.

Inputs:

What to inspect:

Important idea: generation is a sample, not a proof. Use it alongside reports and evals.

Useful command:

picochat chat --checkpoint runs/manual/sft/checkpoint --tokenizer runs/manual/tokenizer.json

8. Report

Purpose: make the run explainable after it finishes.

Output artifacts:

What to inspect:

Important idea: reports are part of the experiment, not decoration. If a result cannot be traced back to the data, model settings, checkpoint, and eval rules, it is not useful yet.

How To Read A Tiny Run

  1. Start with corpus_report.md.
  2. Check tokenizer stats and token examples.
  3. Read base loss diagnostics.
  4. Check base memorization diagnostics and copied-span rates.
  5. Read SFT loss diagnostics.
  6. Check eval pass/fail details.
  7. Check the eval ladder: smoke failures mean wiring is broken, held-out failures mean weak recall/generalization, transfer failures mean brittle behavior, adversarial failures mean weak refusal, and memorization-probe failures mean the model may be copying.
  8. Compare generated samples with eval results.
  9. Only then increase data size, context length, steps, or model size.

The point of Picochat is controlled learning. Make one change, rerun, compare artifacts, and keep the explanation honest.