From answers to action · RemotePass AI case

132

From answers

to proactive answers

An AI work layer for RemotePass. It takes the manual review queue and turns it into a workflow you can trust to act for you.

Jasper de Winter · Senior AI PM case

Scroll ↓

01 Context

132 things to review. One at a time.

RemotePass is becoming the one place to hire and pay anyone, anywhere. Today that also means a review queue that keeps growing: 132 items across contracts, payroll, expenses, time, documents and bills. You clear them one approval at a time, and the obvious problems sit in there with everything else.

0

items in the review queue

02 What I found

Two AI features today. Neither one acts for you.

Ask AI is useful, but it waits for you. You have to know what to ask, and it only tells you the answer.
The Payroll Agent is proactive, but only on payroll, and only to flag. It spots a problem, then leaves the fix to you.
Everything else (expenses, contracts, time, documents, bills) has no agent at all. Obvious anomalies slip straight through.

03 Where AI fits

Where AI creates the most value here.

Two questions separate today's AI from real help. One: is it reactive, or proactive? Two: does it only show you the information, or act on it? Place RemotePass on those two axes and the gap is obvious.

Shows it → Acts on it Reactive → Proactive

The work layer

Ask AI today

Payroll Agent today

Proactive Ask AIthe pull surface

Agentic Review Centerthe push surface · the bet

One layer, two surfaces

The Review Center (push) and a proactive Ask AI (pull) are the same move on two surfaces: more proactive, more able to act. They share one trust foundation, so they are two faces of a single work layer, not two projects.

Onboarding copilot

A different domain: building compliant contracts and multi-country onboarding. Real value, but a bigger build, more compliance risk, and it runs far less often. A later bet, parked off this map on purpose.

The top-right corner is where the value is, and the Review Center is the surface that gets us there fastest.

04 Prioritisation

Highest toil, highest stakes, easiest to prove.

Four questions, scored honestly, then a clear call.

ImpactHow much busywork and money or compliance risk it takes off your plate.

Strategic fitOne review surface for every category, instead of yet another separate tool. That is the "one place" promise.

Risk if wrongWhat it costs if the AI gets it wrong. Lower is better.

Can we prove itWhether we can show it working, now, in this case.

OpportunityImpactFitRisk if wrong ↓Provable

Agentic Review CenterThe biggest pile of toil and money at risk, it builds on the Payroll Agent we already ship, and we can demo it today. High High Med High

Proactive Ask AIA strong vision and low risk, but harder to show. It is a surface, not a clear step-by-step flow. Med High Low Med

Onboarding copilotHigh value, but a bigger build with more compliance exposure. Worth proving later. Med Med High Low

The verdict: the Review Center wins on impact, fit, contained risk and how provable it is. The medium risk if it errs is real, but it is exactly what makes it the most honest thing to demo. Ask AI is the better long-term vision; onboarding is the bigger build for later.

05 The insight

Don't add a chatbot. Climb the ladder.

The agent starts by watching and asking. It earns more autonomy step by step, and you decide how far it goes. You are always in the loop by default.

The prototype demonstrates step 2, recommend and verify, and lets you switch on step 3, configurable autonomy, for low-risk items only.

L0ReactiveAnswers when you ask. This is Ask AI today.

L1ProactiveComes to you with the issues it finds, across every category.

L2Recommend & verify prototypeProposes the fix with its evidence. When it is unsure, it asks the contractor.

L3Configurable autonomyHandles low-risk items on its own, with undo. You set how far, and it learns what you trust.

06 The solution

132 in the queue. 18 that actually need you.

One Review Agent reads all 132 items at once. Most are routine and check out fine, so it clears those from your attention and surfaces only the ones with a real signal. In this walkthrough that leaves 18 issues, sorted into three buckets.

0

Scanned

Every item in the queue, across all 8 categories.

0

Routine & clean

Unchanged salaries, in-policy spend, verified docs. Confirmed, not bothered with.

0

Need a human look

A real signal: a duplicate, a missing payment, an over-policy expense.

0

Auto-handled, with undo

Low-risk only, and only once you switch on Trust Levels. Always reversible.

0

Needs your decision

A recommendation with the evidence it is based on. You approve.

0

Needs clarification

Not enough to decide, so the agent drafts a question to the contractor.

3 + 9 + 6 = 18. Grant low-risk autonomy and routine items move from "your decision" into "auto-handled", so the queue shrinks to genuine judgment calls. (Numbers are illustrative for this walkthrough.)

07 The PRD

The Product Requirements Document, on one screen.

The spec the prototype is built from: the problem, the goals, the users, and the requirements ranked P0 to P2.

Problem

"Manage it all in one place" today means a review queue that keeps swelling: 132 items, one approval at a time, across 8 categories. Obvious anomalies (duplicates, missing payments, out-of-policy spend) slip through, reviewers burn hours, and money and compliance risk sit there unattended.

Goals

Cut the time it takes to clear the queue.
Catch real issues before they cost money (catch every one we can).
Keep a human in control. Every action reversible.

Non-goals

It never auto-pays.
No training on customer data.
Help the reviewer, don't replace them.

Users & jobs to be done

Payroll / finance approver: "When I open the queue, clear what's safe and point me at what's risky, without me losing control."
Ops / compliance lead: "Show me the agent only ever acts within bounds I set, and let me audit every action."

Requirements

P0

Triage into 3 buckets · cited evidence and a recommended action · a drafted "ask the contractor" message · Trust Levels (risk tier by autonomy, never auto-pays) · undo and audit log.

P1

A self-learning Reviewer Profile (shows where each rule came from, editable) · per-category overrides · scheduled proactive runs.

P2

A shared agent framework other teams build on · proactive Ask AI that pushes · onboarding and compliance copilot.

Success metrics

North-star: time to clear the queue. Outcome: money protected and errors prevented. Guardrails: precision and recall, override rate and false-positive rate. See §10 →

Data & AI considerations

A mix of rules, retrieval and a frontier LLM (no fine-tune at v1). Personal data is stripped out before any model call, stays in the customer's account, and is never used to train. Evals gate every release. See §09 →

08 See it live · the centrepiece

Try it yourself.

This is the heart of the case: a working, on-brand prototype. Log in and run the agent on a real queue. Triage the 18 issues, open the duplicate and read its evidence, set Trust Levels and watch low-risk items clear themselves, and accept a rule into your Reviewer Profile. A short guide walks you through every step.

Try the prototype

09 Why it's safe, and senior

Agentic, in a money and compliance domain, done responsibly.

Rules and LLM, each doing its job: deterministic rules catch the clear anomalies (duplicate, currency, out of range). The LLM explains, groups and reasons on the edge cases. It never owns arithmetic it shouldn't.
Privacy first: data stays in the customer's account, personal details are stripped out before anything reaches a model, and nothing is ever used to train.
Evals and guardrails: a precision and recall bar before anything is surfaced, a second pass on high-severity items, confidence routing, a full audit trail, fully reversible, and it never auto-pays.
Watched and gated: deep observability on every agent run, and it ships as an opt-in beta first, where errors are cheap, before autonomy widens.

10 Success metrics

One north-star. One outcome. One trust bar.

North-star

Time to clear

How fast the whole review queue goes from full to just the genuine judgment calls. If the agent works, this drops.

Outcome

Money protected

Errors prevented and money saved: duplicates caught, missing payments surfaced, out-of-policy spend stopped.

Trust guardrail

Recall over precision

A missed real issue is the costly miss, so we tune to catch them all, then watch the override and false-positive rate so trust holds.

11 How I'd validate this

What I don't yet know, and how I'd find out.

This case is built on desk research and one walkthrough. Before betting real roadmap on it, here's how I'd earn confidence, and who would do what.

Discovery

Interview 6 to 8 finance and payroll approvers on how they triage today, and where they lose time and trust.

Evidence

Mine Ask AI logs and Payroll Agent dismissals for the real mix of anomalies and how many false positives people tolerate. Size the bet with data, not a hunch.

Design-partner beta

Ship detect and recommend to 2 or 3 existing Payroll Agent customers. Measure acceptance and override before widening autonomy.

Product the bet, the requirements and metrics, the beta partners.

Design Trust Levels and Reviewer Profile UX, keeping control easy to read.

ML / Eng the rules and LLM, the evals, guardrails and observability.

Data / Legal anomaly labels, personal-data handling, audit and compliance.

12 Rollout & GTM

Ship where errors are cheap. Earn the right to more autonomy.

Phase 0 · Internal dogfood

Run it on our own queue and build the eval set.

Phase 1 · Closed beta

Existing Payroll Agent customers. Detect and recommend, a human applies.

Phase 2 · GA across categories

Then proactive runs, and the shared agent framework other teams build on.

How we set up evals

1 · Labelled golden setBuilt from the dogfood queue. Every known anomaly type (duplicate, currency, out of range, missing payment) with the right answer, plus held-out edge cases, per category and severity.

2 · Score precision and recallMeasured per category and severity, with a stricter bar on high-severity. Every prompt or model change has to pass the offline regression before it ships.

3 · Shadow, then growNew runs go in shadow first. Acceptance and override rates become live evals, and human-labelled disagreements feed back so the set keeps growing.

Who's involved: Design, ML and Eng, Data, Legal and Compliance, CS, GTM. How we sell it: the "AI Work Layer" capability, with privacy-first and human-in-the-loop as the story.

13 Vision

From answers to action.

One AI Work Layer. Ask AI (pull) and the Review Agent (push) are two faces of it, on one shared trust foundation. Every team plugs in, and every workflow climbs the ladder.

Run the prototype

Jasper de Winter · Senior AI PM case · RemotePass