How airoboros Generates High‑Quality Synthetic Training Data
Modern large language models crave oceans of diverse, instruction‑like text pairs and airoboros tackles this gap head‑on.

A Code‑Level Exploration of the Self‑Instruct Pipeline, Instructor Network, and Quality‑Control Heuristics
This article was crafted by first using gitingest
to ingest and condense the core folders of the Airoboros GitHub repository. We then prompted ChatGPT to interpret and explain the most critical components of the source code, translating complex design patterns and mechanisms into a digestible narrative for data scientists and engineers.
Table of Contents
- Introduction: Why Synthetic Data?
- High‑Level Architecture
- The SelfInstructor Orchestrator
- Topic Generation and Management
- Embedding‑Based Novelty Filtering
- Instructor Taxonomy and Prompt Templates
- 6.1 Generic inline instructors
- 6.2 Complex multistage instructors
- Response Generation and Post‑Processing
- Automated Judging and Culling
- Role‑Playing and Character‑Centric Data
- Quality Barriers: Safety, Readability, Diversity
- Integration with FAISS, SentenceTransformers, and LoRA
- LMoE Routing for Function‑Calling Data
- Putting It All Together: An End‑to‑End Run
- Strengths, Limitations, and Future Work
- Conclusion
1 Introduction: Why Synthetic Data?
Modern large language models crave oceans of diverse, instruction‑like text pairs. Public data alone rarely provides the breadth, cleanliness, or legal certainty commercial projects require. airoboros tackles this gap head‑on by programmatically manufacturing vast quantities of high‑quality, highly varied instruction/response pairs—often called self‑instruct or synthetic data—using foundation models themselves as teachers.
While many projects stop at a single “generate instruction → generate answer” loop, airoboros layers topic control, faiss‑based duplication checks, multi‑instructor specialization, and rigorous LLM‑powered grading to push synthetic quality well beyond naïve pipelines.
What follows is a deep, code‑referenced walk through every moving part of the project.
2 High‑Level Architecture
At the top lives airoboros/self_instruct.py. Its SelfInstructor
class acts as an orchestration engine that
- loads a YAML config (models, counts, batch sizes, etc.),
- spins up embedding and FAISS resources,
- discovers or creates topic files,
- delegates work to dozens of instructor modules under
airoboros/instructors/
, - monitors token usage and parallel asyncio tasks, and
- persists unique triples to an output JSONL corpus.
A typical run looks like:
textCopySelfInstructor.run()
├─ initialize_topics()
├─ initialize_index()
├─ for each instructor in config:
│ asyncio.create_task(run_instructor(category))
├─ await all tasks
├─ optionally run editor / stylized_response after base data exists
└─ log completion
Each instructor is a coroutine generator that yields dicts shaped like:
j
"instruction": "...",
"response": "...",
"category": "general",
"system": "...", # optional
"skip_prompt_formatting": … # optional flag for RP formats
}
The orchestrator writes these to instructions.jsonl
and simultaneously inserts every non‑RP instruction into a FAISS index so that future prompts can be checked for semantic similarity in O(1) via vector search.
3 The SelfInstructor
Orchestrator in Depth
3.1 Configuration Intake
The constructor reads a YAML file whose keys define
model
(OpenAI or VertexAI)- per‑instructor
count
,batch_size
,api_params
- global thresholds (min FAISS distance, max tokens, etc.)
- topic file paths and avoidance regexes
Parsing is handled in load_config()
, which also spins up a SentenceTransformer embedding model (default: thenlper/gte‑small) and builds an empty faiss.IndexFlatL2
.
3.2 Async HTTP Clients
The orchestrator knows how to talk to two back‑ends:
- OpenAI chat completions (
_post_openai
) with full retry/back‑off logic and nuanced exception mapping (RateLimitError
,ServerOverloadedError
, etc.). - VertexAI chat/generative models (
_post_vertexai
) with bearer‑token refresh usinggoogle.oauth2.service_account
.
Both funnel through a generic generate_response()
dispatcher that injects system/user messages and returns only the raw assistant text, unless the response trips any filter regexes (apologies, policy refusals, banned words).
3.3 Parallel Instructor Scheduling
run()
schedules one asyncio
task per configured instructor. Inside run_instructor(category)
the engine:
- Logs a start timestamp.
- Streams items from the instructor’s generator.
- For each item, calls
persist()
:- Writes JSONL line.
- Adds embedding to FAISS (unless
category=="rp"
).
- Updates per‑category counts and progress bars.
Because each instructor internally batches N prompts per model call, the pipeline achieves high throughput with bounded token cost.
4 Topic Generation and Management
Many instructors need topical diversity. If topics.txt
is absent, initialize_topics()
fabricates it by repeatedly asking the base model with a topic_prompt
such as:
“Generate obscure, interesting topics the assistant should avoid, excluding any sensitive content. Return 8 numbered items.”
Each model response is parsed, de‑duplicated (case‑insensitive), and written to disk until the requested count (default 20) is hit.
These topics are later sampled per instructor to ensure downstream instructions don’t converge on a handful of subjects.
5 Embedding‑Based Novelty Filtering
A central design goal is maximal uniqueness. Every time an instructor proposes a candidate instruction, the orchestrator calls:
pythonCopyawait
is_too_similar(text, min_score)
- The text is embedded (
calculate_embeddings
) using the same GTE model. - FAISS returns the L2 distance to the nearest existing vector.
- If the distance ≤
min_docsearch_score
(default 0.35) the candidate is rejected.
Two immediate benefits:
- Eliminates accidental duplicates even across different instructor categories.
- Encourages the model to “think of another way” if it regurgitates something semantically close.
A larger de‑dup pass happens later in cull(), where similar responses across the entire file are grouped and the “best” instance is retained (via LLM judging, then longest length as tiebreaker).
6 Instructor Taxonomy and Prompt Templates
Under airoboros/instructors/
lie ~30 specialised generators. Each has a generate(instructor, **kwargs)
coroutine that:
- Loads a jinja‑like template from prompts/.
- Fills placeholders (
{batch_size}
,{topics}
, etc.). - Calls
await instructor.generate_response(...)
. - Parses the model output into instruction/answer pairs.
- Optionally fires secondary calls to obtain answers (e.g.
coding.py
first asks for tasks, then separately asks for code).
6.1 Inline Instructors
Many simple categories—joke, misconception, multiple_choice, riddle, trivia—share a helper inline_qa.generate()
. This utility takes:
start_key
/end_key
markers (QUESTION:
,ANSWER:
)- A batch template with examples and constraints
- Extra
template_kwargs
for dynamic instructions (e.g. random option letters)
It returns clean pairs with minimal custom code.
6.2 Complex Multistage Instructors
- contextual.py builds BEGININPUT / BEGININSTRUCTION tasks with fake metadata, then synthesises answers in a second round using a dedicated contextual_response.txt template that enforces strict citation behaviour.
- rp.py spins up full role‑play chats with character cards drawn from character seeds, injects formatting rules (action delimiters, quoting), and runs dozens of conversational turns to create assistant messages grounded in prior context.
- detailed_writing.py creates 4000‑word narrative tasks, generates them in thirds to cope with context limits, merges, then rewrites for flow.
Each of these modules showcases advanced prompt programming: they seed the model with Examples, request multiple outputs in structured formats, and parse them with custom regex.
7 Response Generation and Post‑Processing
Returning raw model output is rarely enough. Many instructors post‑process:
- coding.py strips code‑fenced markdown, ensuring “plain text only” if the instruction asked for PLAINFORMAT.
- rp.py cleans hallucinated character names, fixes misplaced action delimiters, and removes any
REMINDER:
disclaimers. - stylized_response.py rescues “SKIP” markers so that jokes/lists aren’t needlessly role‑played when inappropriate.
These transformations help downstream finetuning by keeping target texts consistent and devoid of formatting artefacts that can confuse tokenizers.
8 Automated Judging and Culling
Quality control is two‑tier:
- Online filtering during generation (regex banish, FAISS distance).
- Offline cull invoked via CLI (
entrypoint.py cull-instructions
).
The cull pass groups instructions by semantic similarity (again via embeddings), then for each cluster asks the model to grade answers using prompts/filter.txt:
“If the response obeys the instruction, contains no hallucination, scores ≥ threshold (100) output GOOD else BAD.”
If multiple “good” candidates exist, the longest combined instruction + response wins. Everything else is purged.
9 Role‑Playing and Character‑Centric Data
Synthetic corpora often miss in‑character dialogue and long‑form chats. airoboros addresses this with a pipeline that:
- Generates Character Cards (
character.py
) from seed prompts. - Stores them as JSON with
description
andstay_in_character
guidance. - Feeds cards to awareness.py, gtkm.py, stylized_response.py, and especially rp.py.
rp.py
crafts multi‑speaker transcripts obeying strict formatting (actions, quotes, NEXT token). It also trains the model to sustain persona across dozens of turns without obviously repeating itself—a crucial capability for chat‑style LLMs.
10 Quality Barriers: Safety, Readability, Diversity
Several guardrails are sprinkled throughout the code base:
- Flesch‑Kincaid hints: Many templates include
READABILITY_HINT
(“score of 30 or lower – college level”) to push lexical richness. - Topic avoidance: The config can list sensitive domains; regex exclusion in templates ensures those aren’t touched.
- Apology ban: Any response starting with “I’m sorry,” or “I can’t” is discarded.
- Rate limiting: Exponential back‑off prevents flooding APIs.
- Language override: A single
language
knob lets users localise the entire corpus (prompts + responses).
Together these produce data that is literate, topic‑diverse, and free of normative refusals that plague naïve self‑instruct sets.
11 Integration with FAISS, SentenceTransformers, and LoRA
airoboros elegantly blends open‑source vector search (FAISS) with SentenceTransformer
embeddings for instant novelty checks.
On the modelling side, the lmoe/ sub‑package hosts a lightweight Mixture‑of‑Experts API that grafts multiple LoRA adapters onto a single base model and routes requests either via a learned router or an agent prompting step. The Router
chooses an expert by embedding the user instruction and comparing to adapter descriptions.
This LMoE capability powers function‑calling data (category “agent”), where the synthetic prompts teach a model to emit YAML/JSON describing which tool to invoke.
12 Putting It All Together: An End‑to‑End Run
Imagine launching:
python -m airoboros.entrypoint generate-instructions --config-path=config.yaml
- Topic bootstrap produces topics.txt (if absent).
- FAISS index initialises with either a dummy doc or pre‑existing corpus.
- Instructors fire off concurrently; e.g. general will ask GPT‑4 to invent 5 instructions about random topics, parse them, then ask GPT‑4 again for each answer.
- Each accepted pair is appended to instructions.jsonl.
- Once all configured counts are satisfied, optional second‑round instructors (editor, stylized_response, gtkm) run, leveraging the freshly created data.
- A final cull command mercilessly weeds out bad or redundant rows, leaving a polished, balanced JSONL corpus—often hundreds of thousands of lines—that can be tokenized straight into LoRA or SFT training.
13 Strengths, Limitations, and Future Work
Strengths
- Composability – adding a new data type is trivial: drop a prompt template + write a generator.
- Online uniqueness guarantee – FAISS prevents wasted tokens.
- Automated self‑critique – the model grades its own output, closing the loop.
- Persona depth – RP pipelines teach models long‑range role consistency.
- Multi‑backend – OpenAI or VertexAI can be swapped via config.
Limitations
- Teacher‑student collapse – synthetic data inherits biases and errors of the base model.
- Embedding model scope – GTE‑small vectors might miss deep semantic duplicates.
- Token budget – enormous narrative tasks push model context limits; partial‑generation hacks (first third, etc.) mitigate but complicate training.
- LLM gatekeeping – heavy reliance on policy‑aligned APIs may silently refuse certain content, skewing dataset distribution.
Future Ideas
- Plug‑in newer embedding models (e.g. BGE‑Large) for finer similarity gates.
- Train a small reward model to replace the inline “GOOD/BAD” heuristic with continuous scores.
- Leverage retrieval‑augmented generation to ground synthetic facts in Wikipedia snapshots, boosting factuality.
- Add iterative self‑revise loops where the model critiques and rewrites its first draft.
14 Conclusion
airoboros exemplifies a second‑generation self‑instruct framework—moving from simple prompt/answer dumps to a multi‑layered, quality‑obsessed synthesis factory. By combining embedding‑based novelty, template‑driven instructor specialisation, automated LLM judging, and compulsory role‑play diversity, the project delivers a dataset that approaches the richness of expensive human curation at a fraction of the cost.
Whether you plug its JSONL straight into a LoRA finetune, distil it into retrieval chunks, or use it to bootstrap RLHF preference comparisons, airoboros offers a pragmatic blueprint for anyone needing lots of safe, diverse language data—today.