Fragment is a structured context layer for building products. It connects what your customers say, what your product does, and what your company knows into one structured source of truth that stays current, always. Your team, and the AI agents you build with, work from that context to figure out what to build, instead of digging through Gong calls, Slack threads, and stale docs.

Fragment is for product teams and builders who want to ship the right things faster and smarter. Product managers use it to discover customer needs, validate their ideas, and craft build-ready specs grounded in what their customers really want. Product leaders use it to back their roadmaps, ground their product strategy, and connect between business goals and customer needs. Researchers use it to compress weeks of customer and user research into hours. Founders and builders use it to understand their customers, ship the right things, and iterate less without a research team behind them. Product marketers use Fragment to connect real customer language into their positioning and messaging. Your AI agents use it to prevent context drift and build with less iterations. Fragment is how product teams, builders, and their AI agents ship the right things faster instead of running a feature factory.

What can I do with Fragment?

Fragment does the product work that usually eats your week, grounded in what your customers really want and need. With Fragment you can: - Turn a rough idea into specs your team and your agents can build from - Run continuous discovery hands free, with every Gong, Zoom, and Google Meet call analyzed for you - Find hidden patterns in customer feedback with automated synthesis - Keep a live picture of your customers, product, and company in one place - Generate VoC reports, customer personas, Jobs To Be Done for your users, and more - Prioritize with evidence instead of the loudest opinion in the room - Answer any customer or product question grounded in your customer feedback, product reality, and company knowledge and goals

Does Fragment work with my AI agents (Claude, Cursor, Codex)?

Yes. Fragment connects to Claude Code, Cursor, and Codex through MCP, so your agents pull your customer, product, and company context in any of your workflows. No more agents guessing from a thin prompt. Fragment turns every agent into a long-time employee that actually knows your customers, product, and company.

How does Fragment take a rough idea to a spec?

Fragment takes an idea from hunch to build-ready spec. Start with a business goal like "I want to grow the number of seats per customer", a rough idea, or a question. Fragment pulls in the right context, checks it against what customers actually need, works through solutions with you, and drafts a spec grounded in real evidence and your product. Each spec carries the customer need, the business goal, your product reality, and the decisions behind it, so your team and your agents can start building without guessing.

What can Fragment connect to, and what can I upload?

Fragment works with the tools your conversations already live in, and takes almost anything else you upload. Native integrations: Gong, Zoom (with Fragment's AI notetaker), Google Drive, Fireflies, and Zapier. Beyond those, drop sources straight into a project: transcripts, analytics exports, survey results, PRDs, documents, spreadsheets, images, audio, and video.

Do I have to tag, train, or maintain anything?

No. There's nothing to tag, no taxonomy to maintain, no manual entry, and no model to train. Fragment analyzes every new call, message, and upload automatically, pulls out the feedback, and rolls it into your themes and company context. It gets sharper the more it sees. You keep working, and the picture stays current on its own.

How do I get started, and is there a free plan?

Yes, there's a free plan, and you're running in minutes without connecting anything. Sign up, enter your company website, and Fragment builds your company context automatically so you can start working as fast as possible. Hook up Gong, Zoom, or Google Drive when you're ready, and every past conversation gets analyzed.

How can I find out more?

We love chatting with builders. Drop us a line at [info@fragment.fit](mailto:info@fragment.fit), join our [Slack community](https://join.slack.com/t/fragment-community/shared_invite/zt-3v3oh5oxw-eg_d0aPlW0BtqDWqmObVow), or [book a demo](https://calendly.com/igal-itskovich/your-fragment-demo).

PlaybooksMarch 26, 2026By Avidan Nadav

How to Build Your First Eval System in a Week

Everyone says you need evals. Almost nobody tells you how to start. Here's a concrete five-day path from zero to a working eval you can run on every release.

"You need evals" is the most repeated and least actionable advice in AI product right now. Everyone nods. Almost nobody tells you what to actually do on Monday morning.

So here's the Monday-morning version. By Friday you'll have a working eval system: a fixed set of real cases, a way to score them, and a single number that tells you whether your feature got better or worse since the last release. Not a perfect system. A real one — which beats a perfect one you never get around to building.

ℹ️ Info

An eval, minus the jargon, is three things: a set of real inputs, a way to judge each output, and a score you can track over time. That's it. Everything fancier is just an upgrade to one of those three parts.

The mental model: evals are TDD for AI

If you've ever written a test before writing the code, you already understand evals. Test-driven development gave software a superpower — a regression suite that screams the moment you break something that used to work. For twenty years, that safety net is a big part of why teams could move fast without setting fire to production.

AI features tore a hole in that net, because "correct" is now fuzzy and probabilistic. Evals are how you patch it. Think of them as the regression tests for behavior you can't pin down with an assertEquals. The discipline that made TDD work — write the check first, run it on every change, treat a drop as a real failure — is the same discipline that makes evals work. The eval crowd online likes to put it bluntly: "vibes don't scale."

Day 1: Collect 20 real cases (your golden set)

Do not invent test cases. Go pull 20 real inputs your feature has seen or will see — actual customer emails, actual queries, actual documents. Real data fails in ways your imagination politely won't.

Bias the set on purpose: roughly half "normal" cases, half nasty ones, pulled straight from your failure-mode map if you've done one. An eval that's all happy-path inputs will reassure you right up until the moment it shouldn't.

Store it in the dumbest format that works. A spreadsheet, one row per case, is completely fine for week one. "Golden set" just means the trusted reference cases you grade against — borrowed from ML, no magic.

Day 2: Write the answer key and the rubric

For each case, write down what a good output looks like. Two flavors:

Cases with a right answer (did it extract the correct date? yes/no). Easy — write the expected answer.
Cases with no single right answer (is this draft reply good?). Write a short rubric instead: three to five checks like "accurate to policy," "right tone," "invents no facts." Pass = clears all of them.

⚠️ Warning

Write the rubric before you look at the model's outputs. Grade first and rubric-after, and you'll unconsciously write a rubric the model already passes. That's not evaluation. That's drawing the target around the arrow after it lands.

Day 3: Pick how you grade

Three options, cheapest first. Use the cheapest one that fits each case.

Exact / programmatic. The answer is checkable in code — a date, a number, a valid JSON shape. Fastest, most reliable, use it everywhere you can.
Human. You read the output and score it against the rubric. Slow, but the gold standard for subjective cases, and totally fine at 20 cases.
LLM-as-judge. A second model grades the output against your rubric. Scales beautifully — but it lies sometimes, so spot-check its grades against your own on a handful of cases before you trust it with the rest.

For week one: programmatic where possible, human for the rest. Add the LLM judge later, once you trust your rubric enough to trust a machine reading it.

Day 4: Run the baseline

Run all 20 cases through the current version and score every one. Add it up. That number — say, 14 out of 20 — is your baseline.

This is the moment the whole thing pays off. For the first time, you have an objective statement about how good your feature is. Not "it feels solid." Fourteen out of twenty, with the six failures named and waiting.

A score only means something against a bar, though. Deciding what number is good enough is its own small craft, and it's worth doing on purpose — I wrote up how to set your MVQ thresholds for exactly that.

Day 5: Wire it into your release

An eval you run once is a report. An eval you run every release is a system. That difference is the entire value.

Make the rule simple and boring: before any change to the feature ships, run the eval. Score drops? You've got a regression, and you treat it like a broken test — not a vibe, not a debate. Even if "running it" this week means manually re-scoring the spreadsheet over coffee, the discipline is the point. Automate it once it's earned the trust.

The artifact you keep

A sheet with 20 real cases, an expected answer or rubric for each, a grading method, and a running column of scores by date. That's your eval system. It looks humble. It will also save you from shipping a regression in front of a customer, which is the single most expensive thing an AI team does.

Start with 20 cases this week. Grow it to 200 once it's earned its place. Most teams are still arguing about whether they need evals. You'll have one running by Friday.

PlaybooksFeb 24, 2026

How to Set Your Three MVQ Thresholds

MVQ only works if you turn it into numbers you commit to before launch. Here's how to set the floor, the bar, and the ceiling for an AI feature, with a worked example.

PlaybooksFeb 4, 2026

How to Map a Feature's Failure Modes in 30 Minutes

Most teams discover how an AI feature breaks after a customer does. This is a 30-minute ritual that surfaces the failures first. You walk out with a one-page failure signature.