Picochat Pipeline Guide
Picochat is an honest small-language-model factory. The goal is not to pretend a tiny or pilot model is a large assistant. The goal is to make each step of language-model training visible enough to inspect, rerun, and explain.
The native scratch pipeline is:
dataset -> tokenizer -> base pretraining -> chat SFT -> optional DPO -> eval -> chat -> report
Each stage writes artifacts to a run folder so the next stage has visible inputs instead of hidden state.
If you are starting from an existing Hugging Face model instead of creating a Picochat-native base model, use the separate path:
picochat train hf-sft --model <hf-model-id> --input <chat.jsonl> --out-dir runs/<hf-sft-name>
That path writes Hugging Face model folders and intentionally skips Picochat’s native tokenizer/base-pretraining stages.
For multi-turn tool-calling data, put the full context in a messages array and
end the row with the target assistant response. Picochat masks the previous
system/user/assistant/tool turns and trains only the final assistant message.
1. Dataset
Purpose: turn local source files into one normalized training corpus.
Input examples:
examples/tiny_corpus.txt- a folder of
.txt,.md,.jsonl,.csv, or.pyfiles - a dataset pack such as
examples/tiny_dataset_pack.json - a small imported Hugging Face sample written to a local
corpus.txt
Output artifacts:
corpus.txtcorpus_manifest.jsoncorpus_report.md
What to inspect:
- how many files were included or skipped
- duplicate-document rate
- duplicate-line rate
- empty-line rate
- source quality scores
- whether files were filtered by
--min-score - document spans used for whole-document validation holdout
Important idea: a tiny model cannot learn what the dataset does not contain. If the corpus is tiny, repeated, noisy, or off-topic, the model will mostly memorize that data.
Useful command:
picochat data preview --dataset-pack examples/tiny_dataset_pack.json
Hugging Face import is an intake helper before this stage. It streams or loads rows from a dataset split, writes a local text file, and then hands control back to the normal Picochat corpus preview:
picochat data hf-import --dataset HuggingFaceFW/fineweb-edu --split train --text-column text --max-rows 1000 --out runs/fineweb-edu-sample/corpus.txt
picochat data preview --input runs/fineweb-edu-sample/corpus.txt
Once the corpus exists, generate an eval starter from the same text:
picochat data eval-starter --input runs/fineweb-edu-sample/corpus.txt --out runs/fineweb-edu-sample/eval_starter.jsonl --max-items 40
Important idea: generated eval rows are scaffolding. They are useful because they force the first benchmark to reference the loaded corpus, but they still need human review before the score means anything.
2. Tokenizer
Purpose: convert text into token IDs the model can read.
Picochat starts with a character tokenizer because it is easy to inspect. It also supports a byte tokenizer for UTF-8 coverage experiments and a small dependency-free BPE tokenizer for learned merge experiments. These are simpler than production tokenizers, but they keep the mechanics visible.
Input:
corpus.txt
Output:
tokenizer.json
What to inspect:
- vocab size
- special tokens
- how a string maps to IDs
- whether important characters are missing
Useful command:
picochat tok train --input runs/manual/corpus.txt --out runs/manual/tokenizer.json
To compare BPE against the educational baseline:
picochat tok train --input runs/manual/corpus.txt --out runs/manual/tokenizer-bpe.json --type bpe --vocab-size 512 --min-freq 2
3. Base Pretraining
Purpose: train a decoder-only transformer to predict the next token.
This is the actual language-model training stage. The model sees token windows from the corpus and learns next-token patterns.
Inputs:
corpus.txttokenizer.json
Output artifacts:
base/checkpoint/base/best_checkpoint/base/train_report.jsonbase/report.mdbase/sample.txtbase/canary_probe.txtwhen train-only canaries are enabled
What to inspect:
- train loss
- validation loss
- validation BPB, which compares tokenizers more fairly than plain loss
- estimated tokens seen and dataset passes
- stop reason: max steps, time budget, or early stop
- final train/validation gap
- best validation step
- recommended checkpoint step
- split mode: random token windows or held-out complete documents
- memorization diagnostics: train copy rate, held-out overlap, copied spans, canary hits
- generated base sample
Important idea: decreasing train loss only means the model is fitting the
training windows. Validation loss tells you whether that fit is carrying over
to held-out windows. When run tiny has a corpus manifest with multiple
documents, Picochat prefers document-level holdout so validation text comes
from complete unseen sources. That is a stronger signal than splitting random
windows from the same document.
Useful command:
picochat train base --corpus runs/manual/corpus.txt --tokenizer runs/manual/tokenizer.json --corpus-manifest runs/manual/corpus_manifest.json --split-mode document --out-dir runs/manual/base --context-size 128 --max-steps 300
For longer runs, add guardrails:
picochat train base --corpus runs/manual/corpus.txt --tokenizer runs/manual/tokenizer.json --corpus-manifest runs/manual/corpus_manifest.json --split-mode document --out-dir runs/manual/base --context-size 128 --max-steps 10000 --max-minutes 45 --early-stop-patience 3 --canary-count 3
4. Chat SFT
Purpose: teach the base model a chat response format using supervised examples.
SFT does not create new knowledge by itself. It teaches behavior found in the chat JSONL rows.
Inputs:
- chat JSONL with
userandassistantfields tokenizer.json- base checkpoint
Output artifacts:
sft/checkpoint/sft/checkpoint/adapter_model.ptandadapter_config.jsonwhen using LoRAsft/sft_report.jsonsft/report.mdsft/sample.txt
What to inspect:
- number of usable chat examples
- truncated examples
- supervised answer tokens
- SFT train/validation gap
- whether validation loss diverged while train loss fell
Important idea: on very small chat files, SFT can quickly memorize exact
answers. That is why Picochat reports a memorization-risk diagnostic instead
of hiding the gap.
Useful command:
picochat train sft --input examples/tiny_chat.jsonl --tokenizer runs/manual/tokenizer.json --checkpoint runs/manual/base/best_checkpoint --out-dir runs/manual/sft --max-steps 600 --early-stop-patience 6
LoRA command for lightweight domain adapters:
picochat train sft --input examples/tiny_chat.jsonl --tokenizer runs/manual/tokenizer.json --checkpoint runs/manual/base/best_checkpoint --out-dir runs/manual/sft-lora --peft lora --lora-rank 8 --lora-alpha 16 --lora-targets attn_qkv,attn_proj --max-steps 600 --early-stop-patience 6
LoRA is for adapting an existing base checkpoint to a domain or behavior style. It does not replace base pretraining or make the base model know facts it never saw.
5. Optional DPO
Purpose: improve post-SFT preference behavior when you have curated chosen/rejected answers for the same prompt.
DPO is optional. Use it for alignment preferences such as safer refusals, clearer tone, shorter answers, or domain style. Do not use it as a substitute for base pretraining, SFT coverage, or held-out eval.
Inputs:
- preference JSONL with
userorprompt,chosen, andrejected tokenizer.json- policy checkpoint, usually the SFT checkpoint
- optional frozen reference checkpoint, otherwise the policy checkpoint is used as the reference starting point
Output artifacts:
dpo/checkpoint/dpo/best_checkpoint/dpo/dpo_report.jsondpo/report.md
For smoke tests, Picochat can generate starter preference pairs from SFT rows:
picochat data preference-starter --input runs/manual/chat.jsonl --out data/preferences.jsonl
Those starter rows use synthetic rejected answers. They are useful for checking DPO mechanics, not for claiming alignment quality.
Useful command:
picochat train dpo --input data/preferences.jsonl --tokenizer runs/manual/tokenizer.json --checkpoint runs/manual/sft/checkpoint --out-dir runs/manual/dpo --max-steps 200 --learning-rate 0.000005 --beta 0.1 --early-stop-patience 4
End-to-end command:
picochat run tiny \
--out-dir runs/manual \
--dataset-pack runs/pack/dataset_pack.json \
--dpo-input data/preferences.jsonl \
--dpo-steps 200
When DPO is enabled in run tiny, SFT-fit, held-out eval, external evals, and
release gates all score the post-DPO checkpoint.
6. Eval
Purpose: score generated replies with transparent rules.
Picochat evals are intentionally simple. Each item can define required phrases, any-of phrase groups, forbidden phrases, required entities, length bounds, a reference answer, corpus-support requirements, and whether the question is answerable.
Inputs:
- eval JSONL
- SFT checkpoint
- tokenizer
- optional support corpus for corpus-overlap diagnostics
Output artifacts:
eval/eval_report.jsoneval/report.md
What to inspect:
- pass rate
- unsupported claim rate
- eval ladder breakdown
- prompt echo rate
- missing support rate
- missing entity rate
- length violation rate
- corpus support failure rate
- token-F1 and ROUGE-L-style reference overlap
- repetition diagnostics
- failure analysis and recommendations
- failure clusters and weak eval levels
- matched and missing phrases
- matched and missing entities
- forbidden phrases found in replies
Important idea: this is not semantic truth evaluation. It is an inspectable measurement for a tiny model, especially for whether it makes unsupported answers when it should refuse or echoes the prompt instead of answering.
Useful command:
picochat eval chat --input examples/tiny_eval.jsonl --checkpoint runs/manual/sft/checkpoint --tokenizer runs/manual/tokenizer.json --out-dir runs/manual/eval
picochat eval chat --input examples/tiny_eval.jsonl --checkpoint runs/manual/sft/checkpoint --tokenizer runs/manual/tokenizer.json --out-dir runs/manual/eval --support-corpus runs/manual/corpus.txt
7. Chat And Generation
Purpose: sample text from a checkpoint and inspect behavior manually.
Inputs:
- base or SFT checkpoint
- tokenizer
- prompt
What to inspect:
- exact prompt formatting
- temperature
- top-k
- top-p
- repetition penalty
- seed
- repeated or collapsed output
Important idea: generation is a sample, not a proof. Use it alongside reports and evals.
Useful command:
picochat chat --checkpoint runs/manual/sft/checkpoint --tokenizer runs/manual/tokenizer.json
8. Report
Purpose: make the run explainable after it finishes.
Output artifacts:
summary.jsonsummary.md- stage-level Markdown reports
What to inspect:
- run settings
- artifact paths
- final losses
- loss diagnostics
- eval summary
- generated samples
Important idea: reports are part of the experiment, not decoration. If a result cannot be traced back to the data, model settings, checkpoint, and eval rules, it is not useful yet.
How To Read A Tiny Run
- Start with
corpus_report.md. - Check tokenizer stats and token examples.
- Read base loss diagnostics.
- Check base memorization diagnostics and copied-span rates.
- Read SFT loss diagnostics.
- Check eval pass/fail details.
- Check the eval ladder: smoke failures mean wiring is broken, held-out failures mean weak recall/generalization, transfer failures mean brittle behavior, adversarial failures mean weak refusal, and memorization-probe failures mean the model may be copying.
- Compare generated samples with eval results.
- Only then increase data size, context length, steps, or model size.
The point of Picochat is controlled learning. Make one change, rerun, compare artifacts, and keep the explanation honest.