Synthetic data with
a proof attached.
Tell Misata the numbers your dataset has to hit: revenue from $50k to $200k, a 2% fraud rate, a December spike. A deterministic math engine builds relational data that lands on them exactly, then hands you the proof. No LLM in the loop.
The synthetic data engine
built like an instrument.
Most generators imitate data and hope it looks right. Misata works the other way around. You declare what must be true, and a constraint engine, an outcome solver, and a realism core make it true. Same input, same output, no model API key anywhere.
Your agent designs the schema.
Misata guarantees the math.
LLMs are excellent at deciding what a dataset should look like and terrible at generating ten thousand rows of it. Misata’s MCP server splits the work along that line. The agent sends a schema dict with tables, distributions, formulas, and outcome curves. It gets back generated files and an integrity proof it can show its user.
- Six MCP tools: generate from a story, from a schema dict, mimic a CSV, list domains, inspect capabilities.
- Every response carries
integrity.verifiedwith orphan counts per relationship. The agent never has to claim correctness it can’t check. - Generation burns zero tokens. The agent spends its context on design; the deterministic engine does the volume.
pip install "misata[mcp]"
{
"mcpServers": {
"misata": { "command": "misata-mcp" }
}
}
// agent → misata
generate_from_schema({ schema, rows, seed })
// misata → agent
{
"ok": true,
"files": ["customers.csv", "orders.csv"],
"integrity": {
"verified": true,
"relationships": [
{ "relationship":
"orders.customer_id → customers.customer_id",
"intact": true, "orphans": 0 }
]
}
}From prose to a full,
linked dataset.
No config files. No real data for warm-up. No API key. Describe the world, call generate, then assert the properties you declared. They hold.
import misata
tables = misata.generate(
"A fintech with 2,000 customers, checking and savings "
"accounts, and 20,000 transactions including a 2% fraud rate.",
seed=42,
)
customers = tables["customers"] # 2,000 rows
accounts = tables["accounts"] # ~4,000 rows
transactions = tables["transactions"] # ~20,000 rows
# Declared properties hold. Assert them.
assert (~transactions["account_id"]
.isin(accounts["account_id"])).sum() == 0 # 0 orphans
assert abs(transactions["is_fraud"].mean() - 0.02) < 0.005
# Same seed → same bytes, on any machine
pd.testing.assert_frame_equal(
customers, misata.generate(..., seed=42)["customers"])Generate the data
you can't legally use.
Open source, MIT, deterministic. The engine runs on your machine and nothing you generate ever leaves it. The math is public. So is the paper.
