The Contractor Skills Test Pack
A fill‑in‑the‑blanks pack to launch a paid, 90‑minute async skills test with calibrated AI judging, human sampling, anti‑cheat, and fair stipends—designed for Automation Builder and Content Ops roles.
Duplicate this pack into your workspace, replace the [BRACKETS] with your details, and ship a paid, 90‑minute async skills test this week. It includes two ready roles (Automation Builder and Content Ops), golden‑set items with answer keys, rubricized pairwise AI judging, confidence bands, a human‑sampling SOP, anti‑cheat, pay‑band tables, and a candidate‑facing one‑pager.
How to use:
- Fill the [ORG], [ROLE], and [CONTACT] fields across sections.
- Swap in your product context and examples.
- Run 3–5 pilot attempts (internal) to calibrate pass bands.
- Publish the candidate one‑pager and start inviting applicants.
- Review borderlines via the human‑sampling SOP and log decisions in the audit log.
Quick‑start: 90‑minute async hiring test blueprint
- Test name: [ROLE] 90‑minute async skills test
- Time cap: 90 minutes (hard stop)
- Submission window: [START_DATE]–[END_DATE]; late submissions auto‑fail unless [EXEMPTION_RULE]
- Delivery: [SUBMISSION_PORTAL_URL] (Google Drive link or portal upload)
- Payment: [CURRENCY][STIPEND_AMOUNT] within [PAYMENT_DAYS] days via [PAYMENT_METHOD]
- Grading: Pairwise LLM judge (permutation debiased) + human sampling on borderlines
- Pass bands: Pass ≥0.65 win‑rate; Borderline 0.55–0.65; Reject <0.55 (see Confidence Bands)
- Contacts: Ops owner [OWNER_NAME] — [OWNER_EMAIL]; Appeals [APPEAL_EMAIL]
- Privacy & fairness: No screen takeover; minimal proctoring; anonymized review; appeal route below
Role 1 template: Automation Builder test (webhook → transform → error handler)
Goal: Prove you can wire a simple inbound webhook → transform → error‑handled handoff in under 90 minutes using the stack you’ll use on the job.
Stack: [STACK_TOOL] (e.g., Make, Zapier, n8n) + [CODE_RUNTIME] (optional for transforms)
Scenario prompt (paste to candidates):
- You receive POST requests at
/leadcontaining lead payloads from multiple sources. Normalize to a shared schema and forward to Slack + a Google Sheet. Log and retry on transient errors. - Hidden edge cases (the grader checks these):
- Currency strings like "€1,200.50" → decimal 1200.50 (strip symbols; support commas)
- Missing
emailshould route to anincompletesheet and post a Slack warning - Non‑UTF‑8 characters must be sanitized (replacement char ✓)
- Idempotency: duplicate
lead_idevents must not duplicate rows
Golden‑set items (internal; not shared with candidates):
Item A (Happy path)
Input JSON:
{"lead_id":"a1","name":"K. Duarte","email":"k@ex.co","amount":"$1,050.00","source":"ads"}
Expected normalized:
{"lead_id":"a1","name":"K. Duarte","email":"k@ex.co","amount":1050.00,"currency":"USD","source":"ads"}
Checks: row in Sheet 'leads', Slack message in #leads, HTTP 200
Item B (Currency parse)
Input:
{"lead_id":"b2","name":"S. Rao","email":"s@ex.co","amount":"€1,200.50","source":"referral"}
Expected: amount 1200.50, currency EUR
Item C (Missing email)
Input:
{"lead_id":"c3","name":"A. Chen","amount":"440","source":"organic"}
Expected: row in 'incomplete', Slack warn, no row in 'leads'
Item D (Duplicate id)
First input: {"lead_id":"d4",...}
Second input: identical payload 30s later
Expected: only one row exists; second request logged as duplicate
Rubric & weights (Automation):
- Data correctness (normalization, currency parsing) — 40%
- Robustness (error handling, retries, idempotency) — 35%
- Observability (clear logs, alerts) — 15%
- Maintainability (naming, comments, foldering) — 10%
Submission artifacts required:
- [WORKFLOW_SHARE_LINK]
- [SCREENSHOT_LOGS_LINK]
- [NOTES_MD_LINK] (short rationale + known gaps)
Role 2 template: Content Ops test (brief → outline → draft)
Goal: Show you can turn a short brief into a clean outline and a tight draft that matches voice, facts, and constraints in under 90 minutes.
Scenario prompt (paste to candidates):
- You’re writing a 600–800 word blog post: "[TOPIC]: A 3‑step playbook for [AUDIENCE]" with a 140–160‑char meta description.
- Use the supplied brand voice notes. Include at least one number, one caution, and a 3‑step playbook. Avoid hype.
Brand voice excerpt (share to candidates):
- Tone: plainspoken, operator‑grade; short sentences; no guru talk
- Banned words: "revolutionary", "game‑changing"
- Style: use specific numbers; prefer checklists; minimal adverbs
Golden‑set brief (internal key):
Brief: "Cut client onboarding time by 50% in Make.com"
Outline (gold): H1, 3 steps (collect→provision→verify), 'gotchas' box, 1 mini‑case, CTA.
Draft (gold): 680–720 words; includes % baseline math; warns about OAuth token expiry; links to SLA template.
Meta: 150 chars; includes 'async' and 'Make.com'.
Factual anchors: Make scenario limit (e.g., 100 ops/min) mentioned once; retry/backoff basics.
Rubric & weights (Content Ops):
- Factual accuracy (claims grounded; no hallucinations) — 35%
- Structure (clear outline; scannable; headings do work) — 25%
- Voice adherence (no hype; concise; numbers) — 25%
- Brief compliance (length, must‑include elements) — 15%
Submission artifacts required:
- [OUTLINE_DOC_LINK]
- [DRAFT_DOC_LINK]
- [SOURCES_LINKS] (list URLs used)
Golden‑set structure and calibration method
Foldering (suggested):
/golden-set/[ROLE]/items/*.json— one file per item/golden-set/[ROLE]/answers/*.md|.json— canonical answers/flows/golden-set/[ROLE]/rubric.json— criteria + weights/golden-set/versions.json— semver + change notes
Minimum composition:
- 6–10 items per role: 4 happy‑path, 2–3 edge cases, 1 failure‑handling
- Diversity: vary input lengths, formats, and traps (verbosity, position)
- Each item must include:
prompt_to_candidate(if applicable),input,expected_behavior,acceptance_checks[], andrationale
Example item schema:
{
"id": "auto-item-b-currency",
"role": "automation",
"input": {"amount": "€1,200.50", ...},
"expected": {"amount": 1200.50, "currency": "EUR"},
"acceptance_checks": [
"sheet.row.amount == 1200.50",
"sheet.row.currency == 'EUR'"
],
"rationale": "Tests parse + locale handling"
}
Calibration run:
- Have 2–3 internal testers attempt the full test under the same 90‑min cap.
- Compute win‑rates vs gold; adjust rubric weights so your intended "hire" profiles clear ≥0.65 and clear "no‑hires" fall <0.55.
- Freeze
versions.jsonas v1.0.0 and record prompts/models in the audit log.
Pairwise AI‑judge prompt (with permutation debiasing)
Use pairwise, rubricized judging instead of raw 1–10s. Run A vs B and B vs A (permutation) to cut position effects.
Judge ensemble:
- [JUDGE_MODEL_1] primary; optional [JUDGE_MODEL_2], [JUDGE_MODEL_3] as tie‑breakers
- Majority vote across models; break ties by higher confidence
Prompt template (pairwise; per item):
System: You are a strict hiring assessor. Judge which candidate output better satisfies the rubric for the given task. Do not reward length or style beyond the rubric.
User:
TASK CONTEXT:
[ITEM_CONTEXT]
RUBRIC (weights in %):
1) [CRITERION_1] — [W1]
2) [CRITERION_2] — [W2]
3) [CRITERION_3] — [W3]
4) [CRITERION_4] — [W4]
CANDIDATE OUTPUT A:
[OUTPUT_A]
CANDIDATE OUTPUT B:
[OUTPUT_B]
INSTRUCTIONS:
- Make a single decision: A, B, or Tie.
- Justify briefly per criterion.
- Ignore superficial verbosity; reward correctness and rubric fit.
- Return JSON only in the schema below.
JSON SCHEMA:
{
"winner": "A|B|Tie",
"confidence": 0.0–1.0,
"per_criterion": [
{"criterion": "[CRITERION_1]", "why": "...", "edge_cases_considered": true},
{"criterion": "[CRITERION_2]", "why": "...", "edge_cases_considered": false}
],
"flags": ["verbosity_bias?","position_bias?","format_mismatch?"],
"notes": "one sentence"
}
Permutation debiasing:
- For each item, run the prompt twice: (A=candidate, B=gold) and (A=gold, B=candidate). Flip any observed position bias flags to review later.
- Score as win=1 if candidate beats gold, 0 if loses, 0.5 if tie.
Length control:
- Pre‑trim both outputs to [MAX_TOKENS_PER_OUTPUT] tokens or [MAX_CHARS] chars to avoid verbosity bias.
Confidence‑band calculator spec (with Wilson CI)
Aggregator rules:
- For each item i, compute
win_i ∈ {1, 0.5, 0}after AB and BA runs. - Candidate win‑rate:
p̂ = (Σ win_i) / N_items. - Also compute a 95% Wilson interval for p̂.
Pass/borderline/reject (defaults; tune after calibration):
- Pass if
p̂ ≥ 0.65OR Wilson lower bound≥ 0.60. - Borderline if
0.55 ≤ p̂ < 0.65OR Wilson interval straddles 0.60. - Reject if
p̂ < 0.55AND Wilson upper bound< 0.60.
Wilson interval (for N items, successes w = Σ win_i where ties count as 0.5):
z = 1.96
phat = w / N
A = phat + z^2/(2N)
B = z * sqrt((phat*(1-phat) + z^2/(4N)) / N)
C = 1 + z^2/N
lower = (A - B)/C
upper = (A + B)/C
Pseudocode:
if phat >= 0.65 or lower >= 0.60: PASS
elif 0.55 <= phat < 0.65 or (lower < 0.60 and upper >= 0.60): BORDERLINE → human review
else: REJECT
Logging:
- Store per‑item winners, confidence, and flags; keep JSON responses for audit.
Human‑sampling SOP (10–20%) + appeal flow
When to sample:
- Always sample BORDERLINE decisions.
- Always sample if any critical criterion failed (mark criteria with [CRITICAL=true] in
rubric.json). - Randomly sample 10–20% of clear PASS decisions weekly, stratified by source and region.
How to sample:
- Draw a stratified sample: [SAMPLE_RATE]% of PASS across (role, source, region); 100% of BORDERLINE.
- Assign to [REVIEWER_POOL] with a 24–48h SLA.
Human review rubric:
- Use the same rubric; grade independently without seeing the AI decision.
- Allowed outcomes: Uphold, Upgrade, Downgrade. Add 1–2 sentence rationale.
Appeal flow (candidate‑facing):
- Email [APPEAL_EMAIL] within [APPEAL_WINDOW_DAYS] days with subject "Appeal: [ROLE] — [YOUR_NAME]".
- You’ll receive a human re‑review within [APPEAL_SLA_DAYS] days. We return rubric feedback either way.
Error taxonomy (label issues):
- Misparse/Incorrect (data correctness)
- Robustness gap (error handling/idempotency)
- Voice/style mismatch (content)
- Factual error (content)
- Other: [FREE_TEXT]
Anti‑cheat checklist (minimal‑overhead)
Use light‑touch integrity checks that don’t invade privacy or penalize neurodiversity.
Required:
- Randomized inputs per candidate (rotate
golden-setminor variants monthly on [ROTATION_DAY]) - Time cap enforced by [PORTAL/FORM]; late lock at 90 minutes
- Honor statement checkbox (no external collaboration; cite sources)
- Version rotation: model prompts and golden‑set version pinned to each invite
Optional (choose one, not all):
- Tab‑switch logging only (no webcam)
- Plagiarism scan on text (flag, don’t auto‑fail)
Avoid:
- Always‑on screen/video recording for a 90‑min take‑home
- Location/IP geofencing beyond basic fraud checks
Review flags (auto):
- Unusual submission latency patterns
- Duplicate uncommon phrasing across multiple candidates
- Perfect match to public GitHub gists tied to this prompt
Regional pay‑bands + stipend calculator (edit this first)
How to set fair, simple stipends:
- Compute from your market midpoint rate × 1.5 hours. Round up to sensible numbers.
- Publish the amount up front. Pay within [PAYMENT_DAYS] days.
Formula:
stipend = midpoint_hourly_rate * 1.5
minimums: content ≥ [CURRENCY]45, automation ≥ [CURRENCY]60
Directional defaults (replace with your own data):
Content Ops hourly (midpoints):
- US/Canada: $40–50 → use $45
- W. Europe/UK: €35–45 → use €40
- E. Europe: $25–35 → use $30
- LatAm: $20–30 → use $25
- India/SEA: $15–25 → use $20
Automation Builder hourly (midpoints):
- US/Canada: $55–75 → use $65
- W. Europe/UK: €45–65 → use €55
- E. Europe: $35–50 → use $42
- LatAm: $28–45 → use $36
- India/SEA: $22–40 → use $30
Example stipend table (edit):
| Region | Content 90‑min | Automation 90‑min |
|--------------|-----------------|-------------------|
| US/Canada | $68 | $98 |
| W. Europe/UK | €60 | €83 |
| E. Europe | $45 | $63 |
| LatAm | $38 | $54 |
| India/SEA | $30 | $45 |
Candidate one‑pager (Notion template)
Copy/paste this into Notion or your job portal. Fill in brackets before publishing.
Title: [ROLE] — Paid 90‑minute async skills test
What you’ll do:
- Complete a focused [ROLE] task under a 90‑minute cap
- Deliver [ARTIFACTS_LIST]
- You’ll be paid [CURRENCY][STIPEND_AMOUNT] within [PAYMENT_DAYS] days
How we grade:
- A calibrated rubric + AI judge compares your output to a gold standard
- We human‑review all borderlines and 10–20% of passes
Fairness & privacy:
- No webcam or screen takeover; minimal tab logging only
- You can appeal to [APPEAL_EMAIL] within [APPEAL_WINDOW_DAYS] days
Pass bands:
- Pass ≥0.65 win‑rate; Borderline 0.55–0.65; Reject <0.55
Scope guardrails:
- Don’t overbuild; ship the smallest thing that passes the rubric
- If you hit a blocker, note it in [NOTES_MD_LINK] and move on
Audit log & changelog (HELM‑style)
Log every change to datasets, prompts, models, and weights. This protects candidates and your team.
Fields (copy to a sheet or Notion DB):
- Date — [YYYY‑MM‑DD]
- Version — [MAJOR.MINOR.PATCH]
- Role — [automation|content]
- Changed — [dataset|rubric|prompt|model|weights|bands]
- Summary — [WHAT_CHANGED]
- Why — [WHY_CHANGED]
- Owners — [NAME1], [NAME2]
- Back‑compat note — [IMPACT]
- Links — [PR/COMMIT_URL], [DOC_URL]
Sample entry:
2026‑05‑08 | v1.1.0 | automation | dataset | Added EUR parse case; tightened idempotency check | Candidates missed locale parsing; false positives on dupes | S. Ramires | Back‑compat: none | PR#42, rubric v1.1 link
Implementation runbook (operator checklist)
Inputs (configure in your repo or sheet):
- [ORG_NAME], [ROLE]
- [STACK_TOOL], [CODE_RUNTIME]
- [JUDGE_MODEL_1], [JUDGE_MODEL_2] (optional), [MAX_TOKENS_PER_OUTPUT]
- [PASS_BAND]=0.65, [BORDERLINE_LOW]=0.55, [CONFIDENCE_FLOOR]=0.60
- [APPEAL_EMAIL], [PAYMENT_METHOD], [PAYMENT_DAYS]
- [ROTATION_DAY] (e.g., last Friday monthly)
Automation checklist:
- Duplicate this pack → private repo/folder
- Fill stipend table and publish candidate one‑pager
- Load golden‑set items into
/golden-set/[ROLE]/items - Wire submit portal + time cap
- Implement judge prompt and permutation runner
- Build Wilson CI sheet/calc and pass‑band logic
- Set up sampling queue + reviewer pool
- Enable minimal anti‑cheat flags
- Ship pilot to 3 internal testers → calibrate
- Go live + start audit log at v1.0.0
Rubric files (drop‑in JSON templates)
Use this reference format for your rubric.json files. Edit criterion names, weights, and criticality per role.
{
"role": "automation",
"criteria": [
{"name": "Data correctness", "weight": 0.40, "critical": true},
{"name": "Robustness", "weight": 0.35, "critical": true},
{"name": "Observability", "weight": 0.15, "critical": false},
{"name": "Maintainability", "weight": 0.10, "critical": false}
],
"bands": {"pass": 0.65, "borderline_low": 0.55, "confidence_floor": 0.60}
}
{
"role": "content",
"criteria": [
{"name": "Factual accuracy", "weight": 0.35, "critical": true},
{"name": "Structure", "weight": 0.25, "critical": false},
{"name": "Voice adherence", "weight": 0.25, "critical": false},
{"name": "Brief compliance", "weight": 0.15, "critical": false}
],
"bands": {"pass": 0.65, "borderline_low": 0.55, "confidence_floor": 0.60}
}