v0.8.0.3 · open source · pip install misata

Synthetic data with
a proof attached.

Tell Misata the numbers your dataset has to hit: revenue from $50k to $200k, a 2% fraud rate, a December spike. A deterministic math engine builds relational data that lands on them exactly, then hands you the proof. No LLM in the loop.

0%
error on declared aggregates
0
orphan keys, verified per run
757
tests on the engine
1
arXiv paper on the math
misata · pythonrunning
How it works
Declare
Describe the world you need
Write a sentence, sketch a schema on the canvas, or let your AI agent send one over. Growth curves, fraud rates, seasonality: you set them as targets, and the engine treats them as law.
Generate
The math engine fills it in
Seeded, vectorized NumPy does the heavy lifting. Names match genders, appointments land on 15-minute grids, categories follow Zipf's law, and the distance from Chicago to San Diego is the real one. No API key. No model call.
Verify
Get a proof, not a promise
Every run ships an integrity report: foreign keys checked, roll-ups reconciled, declared curves matched. Run the GROUP BY yourself. The totals add up.
What Misata does

The synthetic data engine
built like an instrument.

Most generators imitate data and hope it looks right. Misata works the other way around. You declare what must be true, and a constraint engine, an outcome solver, and a realism core make it true. Same input, same output, no model API key anywhere.

Exact outcome curves
Draw revenue climbing from $50k to $200k and the generated rows sum to it exactly. Closed-form sampling, no rejection loops, formalized in our arXiv paper. The imitation synthesizers we benchmarked miss declared aggregates by 74 to 86 percent.
Relational data that reconciles
Parents come before children and every foreign key resolves. Roll-ups survive a real JOIN: a customer's total_spent equals the sum of their orders, a project's revenue equals the sum of its billed timesheets.
Realism that survives inspection
Names match genders across 11 cultures. Appointments land where a calendar would put them. Categories follow Zipf's law, city distances check out on a map, and reviews agree with their star ratings. Each classic tell of fake data has a mechanism built to kill it.
Unknown domains, composed honestly
A story outside the built-in domains gets a structural schema composed from its own entities. Misata wires up the tables and keys but never invents semantics, and it tells you exactly what it inferred.
Capsules: domain knowledge as a file
Mine vocabulary from your CSVs, or have an LLM write it once, into a single JSON file you can share. From then on generation is deterministic and free. Your domain's terms, no model in the loop.
Deterministic and auditable
Same seed, same dataset, byte for byte, on your laptop or in CI. The engine is MIT licensed, so the math your data came from is math you can actually read.
Built for AI agents

Your agent designs the schema.
Misata guarantees the math.

LLMs are excellent at deciding what a dataset should look like and terrible at generating ten thousand rows of it. Misata’s MCP server splits the work along that line. The agent sends a schema dict with tables, distributions, formulas, and outcome curves. It gets back generated files and an integrity proof it can show its user.

  • Six MCP tools: generate from a story, from a schema dict, mimic a CSV, list domains, inspect capabilities.
  • Every response carries integrity.verified with orphan counts per relationship. The agent never has to claim correctness it can’t check.
  • Generation burns zero tokens. The agent spends its context on design; the deterministic engine does the volume.
mcp config · claude / cursor / windsurf
pip install "misata[mcp]"

{
  "mcpServers": {
    "misata": { "command": "misata-mcp" }
  }
}

// agent → misata
generate_from_schema({ schema, rows, seed })

// misata → agent
{
  "ok": true,
  "files": ["customers.csv", "orders.csv"],
  "integrity": {
    "verified": true,
    "relationships": [
      { "relationship":
          "orders.customer_id → customers.customer_id",
        "intact": true, "orphans": 0 }
    ]
  }
}
Five lines

From prose to a full,
linked dataset.

No config files. No real data for warm-up. No API key. Describe the world, call generate, then assert the properties you declared. They hold.

Try it in the studio
quickstart.py
import misata

tables = misata.generate(
    "A fintech with 2,000 customers, checking and savings "
    "accounts, and 20,000 transactions including a 2% fraud rate.",
    seed=42,
)

customers    = tables["customers"]      # 2,000 rows
accounts     = tables["accounts"]       # ~4,000 rows
transactions = tables["transactions"]   # ~20,000 rows

# Declared properties hold. Assert them.
assert (~transactions["account_id"]
          .isin(accounts["account_id"])).sum() == 0   # 0 orphans
assert abs(transactions["is_fraud"].mean() - 0.02) < 0.005

# Same seed → same bytes, on any machine
pd.testing.assert_frame_equal(
    customers, misata.generate(..., seed=42)["customers"])

Generate the data
you can't legally use.

Open source, MIT, deterministic. The engine runs on your machine and nothing you generate ever leaves it. The math is public. So is the paper.