This Markov chain text generator offers three modes:
- Words mode: generates individual words character-by-character using adaptive n-grams (1-N tokens).
- Text mode: generates flowing paragraphs word-by-word with paragraph-aware boundaries.
- GPT mode: generates text using GPT tokenizer (BPE tokens) with token-level n-grams.
- Phonetic mode: converts text to IPA phonemes and generates phonetically plausible words. Use the Speak button to hear generated text aloud.
- MDL mode: learns a ~1100-token subword vocabulary directly from the source corpus using Minimum-Description-Length-style credit assignment, then generates with that learned vocabulary. Produces corpus-specific chunks that sit between Words (characters) and GPT (prebuilt BPE). Treats the corpus as a single stream — paragraph breaks are not preserved.
Words and Text modes handle punctuation intelligently. Paired punctuation (quotes, brackets) is removed to avoid pairing errors, while unpaired punctuation (periods, commas, dashes) is properly separated and formatted. GPT mode uses the tokenizer's built-in punctuation handling.
Differences from vanilla Markov chains:
Instead of a fixed n-gram size, this implementation generates all possible prefix-suffix pairs from 1 to N tokens and randomly selects a prefix length weighted by frequency at each generation step.
This adaptive approach better captures patterns at multiple scales.