Template

The Contractor Skills Test Pack

A fill‑in‑the‑blanks pack to launch a paid, 90‑minute async skills test with calibrated AI judging, human sampling, anti‑cheat, and fair stipends—designed for Automation Builder and Content Ops roles.

Duplicate this pack into your workspace, replace the [BRACKETS] with your details, and ship a paid, 90‑minute async skills test this week. It includes two ready roles (Automation Builder and Content Ops), golden‑set items with answer keys, rubricized pairwise AI judging, confidence bands, a human‑sampling SOP, anti‑cheat, pay‑band tables, and a candidate‑facing one‑pager.

How to use:

  1. Fill the [ORG], [ROLE], and [CONTACT] fields across sections.
  2. Swap in your product context and examples.
  3. Run 3–5 pilot attempts (internal) to calibrate pass bands.
  4. Publish the candidate one‑pager and start inviting applicants.
  5. Review borderlines via the human‑sampling SOP and log decisions in the audit log.

Quick‑start: 90‑minute async hiring test blueprint

  • Test name: [ROLE] 90‑minute async skills test
  • Time cap: 90 minutes (hard stop)
  • Submission window: [START_DATE]–[END_DATE]; late submissions auto‑fail unless [EXEMPTION_RULE]
  • Delivery: [SUBMISSION_PORTAL_URL] (Google Drive link or portal upload)
  • Payment: [CURRENCY][STIPEND_AMOUNT] within [PAYMENT_DAYS] days via [PAYMENT_METHOD]
  • Grading: Pairwise LLM judge (permutation debiased) + human sampling on borderlines
  • Pass bands: Pass ≥0.65 win‑rate; Borderline 0.55–0.65; Reject <0.55 (see Confidence Bands)
  • Contacts: Ops owner [OWNER_NAME] — [OWNER_EMAIL]; Appeals [APPEAL_EMAIL]
  • Privacy & fairness: No screen takeover; minimal proctoring; anonymized review; appeal route below

Role 1 template: Automation Builder test (webhook → transform → error handler)

Goal: Prove you can wire a simple inbound webhook → transform → error‑handled handoff in under 90 minutes using the stack you’ll use on the job.

Stack: [STACK_TOOL] (e.g., Make, Zapier, n8n) + [CODE_RUNTIME] (optional for transforms)

Scenario prompt (paste to candidates):

  • You receive POST requests at /lead containing lead payloads from multiple sources. Normalize to a shared schema and forward to Slack + a Google Sheet. Log and retry on transient errors.
  • Hidden edge cases (the grader checks these):
    1. Currency strings like "€1,200.50" → decimal 1200.50 (strip symbols; support commas)
    2. Missing email should route to an incomplete sheet and post a Slack warning
    3. Non‑UTF‑8 characters must be sanitized (replacement char ✓)
    4. Idempotency: duplicate lead_id events must not duplicate rows

Golden‑set items (internal; not shared with candidates):

Item A (Happy path)
Input JSON:
{&quot;lead_id&quot;:&quot;a1&quot;,&quot;name&quot;:&quot;K. Duarte&quot;,&quot;email&quot;:&quot;k@ex.co&quot;,&quot;amount&quot;:&quot;$1,050.00&quot;,&quot;source&quot;:&quot;ads&quot;}
Expected normalized:
{&quot;lead_id&quot;:&quot;a1&quot;,&quot;name&quot;:&quot;K. Duarte&quot;,&quot;email&quot;:&quot;k@ex.co&quot;,&quot;amount&quot;:1050.00,&quot;currency&quot;:&quot;USD&quot;,&quot;source&quot;:&quot;ads&quot;}
Checks: row in Sheet &#39;leads&#39;, Slack message in #leads, HTTP 200

Item B (Currency parse)
Input:
{&quot;lead_id&quot;:&quot;b2&quot;,&quot;name&quot;:&quot;S. Rao&quot;,&quot;email&quot;:&quot;s@ex.co&quot;,&quot;amount&quot;:&quot;€1,200.50&quot;,&quot;source&quot;:&quot;referral&quot;}
Expected: amount 1200.50, currency EUR

Item C (Missing email)
Input:
{&quot;lead_id&quot;:&quot;c3&quot;,&quot;name&quot;:&quot;A. Chen&quot;,&quot;amount&quot;:&quot;440&quot;,&quot;source&quot;:&quot;organic&quot;}
Expected: row in &#39;incomplete&#39;, Slack warn, no row in &#39;leads&#39;

Item D (Duplicate id)
First input: {&quot;lead_id&quot;:&quot;d4&quot;,...}
Second input: identical payload 30s later
Expected: only one row exists; second request logged as duplicate

Rubric & weights (Automation):

  • Data correctness (normalization, currency parsing) — 40%
  • Robustness (error handling, retries, idempotency) — 35%
  • Observability (clear logs, alerts) — 15%
  • Maintainability (naming, comments, foldering) — 10%

Submission artifacts required:

  • [WORKFLOW_SHARE_LINK]
  • [SCREENSHOT_LOGS_LINK]
  • [NOTES_MD_LINK] (short rationale + known gaps)

Role 2 template: Content Ops test (brief → outline → draft)

Goal: Show you can turn a short brief into a clean outline and a tight draft that matches voice, facts, and constraints in under 90 minutes.

Scenario prompt (paste to candidates):

  • You’re writing a 600–800 word blog post: "[TOPIC]: A 3‑step playbook for [AUDIENCE]" with a 140–160‑char meta description.
  • Use the supplied brand voice notes. Include at least one number, one caution, and a 3‑step playbook. Avoid hype.

Brand voice excerpt (share to candidates):

  • Tone: plainspoken, operator‑grade; short sentences; no guru talk
  • Banned words: "revolutionary", "game‑changing"
  • Style: use specific numbers; prefer checklists; minimal adverbs

Golden‑set brief (internal key):

Brief: &quot;Cut client onboarding time by 50% in Make.com&quot;
Outline (gold): H1, 3 steps (collect→provision→verify), &#39;gotchas&#39; box, 1 mini‑case, CTA.
Draft (gold): 680–720 words; includes % baseline math; warns about OAuth token expiry; links to SLA template.
Meta: 150 chars; includes &#39;async&#39; and &#39;Make.com&#39;.
Factual anchors: Make scenario limit (e.g., 100 ops/min) mentioned once; retry/backoff basics.

Rubric & weights (Content Ops):

  • Factual accuracy (claims grounded; no hallucinations) — 35%
  • Structure (clear outline; scannable; headings do work) — 25%
  • Voice adherence (no hype; concise; numbers) — 25%
  • Brief compliance (length, must‑include elements) — 15%

Submission artifacts required:

  • [OUTLINE_DOC_LINK]
  • [DRAFT_DOC_LINK]
  • [SOURCES_LINKS] (list URLs used)

Golden‑set structure and calibration method

Foldering (suggested):

  • /golden-set/[ROLE]/items/*.json — one file per item
  • /golden-set/[ROLE]/answers/*.md|.json — canonical answers/flows
  • /golden-set/[ROLE]/rubric.json — criteria + weights
  • /golden-set/versions.json — semver + change notes

Minimum composition:

  • 6–10 items per role: 4 happy‑path, 2–3 edge cases, 1 failure‑handling
  • Diversity: vary input lengths, formats, and traps (verbosity, position)
  • Each item must include: prompt_to_candidate (if applicable), input, expected_behavior, acceptance_checks[], and rationale

Example item schema:

{
  &quot;id&quot;: &quot;auto-item-b-currency&quot;,
  &quot;role&quot;: &quot;automation&quot;,
  &quot;input&quot;: {&quot;amount&quot;: &quot;€1,200.50&quot;, ...},
  &quot;expected&quot;: {&quot;amount&quot;: 1200.50, &quot;currency&quot;: &quot;EUR&quot;},
  &quot;acceptance_checks&quot;: [
    &quot;sheet.row.amount == 1200.50&quot;,
    &quot;sheet.row.currency == &#39;EUR&#39;&quot;
  ],
  &quot;rationale&quot;: &quot;Tests parse + locale handling&quot;
}

Calibration run:

  • Have 2–3 internal testers attempt the full test under the same 90‑min cap.
  • Compute win‑rates vs gold; adjust rubric weights so your intended "hire" profiles clear ≥0.65 and clear "no‑hires" fall <0.55.
  • Freeze versions.json as v1.0.0 and record prompts/models in the audit log.

Pairwise AI‑judge prompt (with permutation debiasing)

Use pairwise, rubricized judging instead of raw 1–10s. Run A vs B and B vs A (permutation) to cut position effects.

Judge ensemble:

  • [JUDGE_MODEL_1] primary; optional [JUDGE_MODEL_2], [JUDGE_MODEL_3] as tie‑breakers
  • Majority vote across models; break ties by higher confidence

Prompt template (pairwise; per item):

System: You are a strict hiring assessor. Judge which candidate output better satisfies the rubric for the given task. Do not reward length or style beyond the rubric.

User:
TASK CONTEXT:
[ITEM_CONTEXT]

RUBRIC (weights in %):
1) [CRITERION_1] — [W1]
2) [CRITERION_2] — [W2]
3) [CRITERION_3] — [W3]
4) [CRITERION_4] — [W4]

CANDIDATE OUTPUT A:
[OUTPUT_A]

CANDIDATE OUTPUT B:
[OUTPUT_B]

INSTRUCTIONS:
- Make a single decision: A, B, or Tie.
- Justify briefly per criterion.
- Ignore superficial verbosity; reward correctness and rubric fit.
- Return JSON only in the schema below.

JSON SCHEMA:
{
  &quot;winner&quot;: &quot;A|B|Tie&quot;,
  &quot;confidence&quot;: 0.0–1.0,
  &quot;per_criterion&quot;: [
    {&quot;criterion&quot;: &quot;[CRITERION_1]&quot;, &quot;why&quot;: &quot;...&quot;, &quot;edge_cases_considered&quot;: true},
    {&quot;criterion&quot;: &quot;[CRITERION_2]&quot;, &quot;why&quot;: &quot;...&quot;, &quot;edge_cases_considered&quot;: false}
  ],
  &quot;flags&quot;: [&quot;verbosity_bias?&quot;,&quot;position_bias?&quot;,&quot;format_mismatch?&quot;],
  &quot;notes&quot;: &quot;one sentence&quot;
}

Permutation debiasing:

  • For each item, run the prompt twice: (A=candidate, B=gold) and (A=gold, B=candidate). Flip any observed position bias flags to review later.
  • Score as win=1 if candidate beats gold, 0 if loses, 0.5 if tie.

Length control:

  • Pre‑trim both outputs to [MAX_TOKENS_PER_OUTPUT] tokens or [MAX_CHARS] chars to avoid verbosity bias.

Confidence‑band calculator spec (with Wilson CI)

Aggregator rules:

  • For each item i, compute win_i ∈ {1, 0.5, 0} after AB and BA runs.
  • Candidate win‑rate: p̂ = (Σ win_i) / N_items.
  • Also compute a 95% Wilson interval for p̂.

Pass/borderline/reject (defaults; tune after calibration):

  • Pass if p̂ ≥ 0.65 OR Wilson lower bound ≥ 0.60.
  • Borderline if 0.55 ≤ p̂ &lt; 0.65 OR Wilson interval straddles 0.60.
  • Reject if p̂ &lt; 0.55 AND Wilson upper bound &lt; 0.60.

Wilson interval (for N items, successes w = Σ win_i where ties count as 0.5):

z = 1.96
phat = w / N
A = phat + z^2/(2N)
B = z * sqrt((phat*(1-phat) + z^2/(4N)) / N)
C = 1 + z^2/N
lower = (A - B)/C
upper = (A + B)/C

Pseudocode:

if phat &gt;= 0.65 or lower &gt;= 0.60: PASS
elif 0.55 &lt;= phat &lt; 0.65 or (lower &lt; 0.60 and upper &gt;= 0.60): BORDERLINE → human review
else: REJECT

Logging:

  • Store per‑item winners, confidence, and flags; keep JSON responses for audit.

Human‑sampling SOP (10–20%) + appeal flow

When to sample:

  • Always sample BORDERLINE decisions.
  • Always sample if any critical criterion failed (mark criteria with [CRITICAL=true] in rubric.json).
  • Randomly sample 10–20% of clear PASS decisions weekly, stratified by source and region.

How to sample:

  • Draw a stratified sample: [SAMPLE_RATE]% of PASS across (role, source, region); 100% of BORDERLINE.
  • Assign to [REVIEWER_POOL] with a 24–48h SLA.

Human review rubric:

  • Use the same rubric; grade independently without seeing the AI decision.
  • Allowed outcomes: Uphold, Upgrade, Downgrade. Add 1–2 sentence rationale.

Appeal flow (candidate‑facing):

  • Email [APPEAL_EMAIL] within [APPEAL_WINDOW_DAYS] days with subject "Appeal: [ROLE] — [YOUR_NAME]".
  • You’ll receive a human re‑review within [APPEAL_SLA_DAYS] days. We return rubric feedback either way.

Error taxonomy (label issues):

  • Misparse/Incorrect (data correctness)
  • Robustness gap (error handling/idempotency)
  • Voice/style mismatch (content)
  • Factual error (content)
  • Other: [FREE_TEXT]

Anti‑cheat checklist (minimal‑overhead)

Use light‑touch integrity checks that don’t invade privacy or penalize neurodiversity.

Required:

  • Randomized inputs per candidate (rotate golden-set minor variants monthly on [ROTATION_DAY])
  • Time cap enforced by [PORTAL/FORM]; late lock at 90 minutes
  • Honor statement checkbox (no external collaboration; cite sources)
  • Version rotation: model prompts and golden‑set version pinned to each invite

Optional (choose one, not all):

  • Tab‑switch logging only (no webcam)
  • Plagiarism scan on text (flag, don’t auto‑fail)

Avoid:

  • Always‑on screen/video recording for a 90‑min take‑home
  • Location/IP geofencing beyond basic fraud checks

Review flags (auto):

  • Unusual submission latency patterns
  • Duplicate uncommon phrasing across multiple candidates
  • Perfect match to public GitHub gists tied to this prompt

Regional pay‑bands + stipend calculator (edit this first)

How to set fair, simple stipends:

  • Compute from your market midpoint rate × 1.5 hours. Round up to sensible numbers.
  • Publish the amount up front. Pay within [PAYMENT_DAYS] days.

Formula:

stipend = midpoint_hourly_rate * 1.5
minimums: content ≥ [CURRENCY]45, automation ≥ [CURRENCY]60

Directional defaults (replace with your own data):

Content Ops hourly (midpoints):

  • US/Canada: $40–50 → use $45
  • W. Europe/UK: €35–45 → use €40
  • E. Europe: $25–35 → use $30
  • LatAm: $20–30 → use $25
  • India/SEA: $15–25 → use $20

Automation Builder hourly (midpoints):

  • US/Canada: $55–75 → use $65
  • W. Europe/UK: €45–65 → use €55
  • E. Europe: $35–50 → use $42
  • LatAm: $28–45 → use $36
  • India/SEA: $22–40 → use $30

Example stipend table (edit):

| Region       | Content 90‑min | Automation 90‑min |
|--------------|-----------------|-------------------|
| US/Canada    | $68             | $98               |
| W. Europe/UK | €60             | €83               |
| E. Europe    | $45             | $63               |
| LatAm        | $38             | $54               |
| India/SEA    | $30             | $45               |

Candidate one‑pager (Notion template)

Copy/paste this into Notion or your job portal. Fill in brackets before publishing.

Title: [ROLE] — Paid 90‑minute async skills test

What you’ll do:

  • Complete a focused [ROLE] task under a 90‑minute cap
  • Deliver [ARTIFACTS_LIST]
  • You’ll be paid [CURRENCY][STIPEND_AMOUNT] within [PAYMENT_DAYS] days

How we grade:

  • A calibrated rubric + AI judge compares your output to a gold standard
  • We human‑review all borderlines and 10–20% of passes

Fairness & privacy:

  • No webcam or screen takeover; minimal tab logging only
  • You can appeal to [APPEAL_EMAIL] within [APPEAL_WINDOW_DAYS] days

Pass bands:

  • Pass ≥0.65 win‑rate; Borderline 0.55–0.65; Reject <0.55

Scope guardrails:

  • Don’t overbuild; ship the smallest thing that passes the rubric
  • If you hit a blocker, note it in [NOTES_MD_LINK] and move on

Audit log & changelog (HELM‑style)

Log every change to datasets, prompts, models, and weights. This protects candidates and your team.

Fields (copy to a sheet or Notion DB):

  • Date — [YYYY‑MM‑DD]
  • Version — [MAJOR.MINOR.PATCH]
  • Role — [automation|content]
  • Changed — [dataset|rubric|prompt|model|weights|bands]
  • Summary — [WHAT_CHANGED]
  • Why — [WHY_CHANGED]
  • Owners — [NAME1], [NAME2]
  • Back‑compat note — [IMPACT]
  • Links — [PR/COMMIT_URL], [DOC_URL]

Sample entry:

2026‑05‑08 | v1.1.0 | automation | dataset | Added EUR parse case; tightened idempotency check | Candidates missed locale parsing; false positives on dupes | S. Ramires | Back‑compat: none | PR#42, rubric v1.1 link

Implementation runbook (operator checklist)

Inputs (configure in your repo or sheet):

  • [ORG_NAME], [ROLE]
  • [STACK_TOOL], [CODE_RUNTIME]
  • [JUDGE_MODEL_1], [JUDGE_MODEL_2] (optional), [MAX_TOKENS_PER_OUTPUT]
  • [PASS_BAND]=0.65, [BORDERLINE_LOW]=0.55, [CONFIDENCE_FLOOR]=0.60
  • [APPEAL_EMAIL], [PAYMENT_METHOD], [PAYMENT_DAYS]
  • [ROTATION_DAY] (e.g., last Friday monthly)

Automation checklist:

  • Duplicate this pack → private repo/folder
  • Fill stipend table and publish candidate one‑pager
  • Load golden‑set items into /golden-set/[ROLE]/items
  • Wire submit portal + time cap
  • Implement judge prompt and permutation runner
  • Build Wilson CI sheet/calc and pass‑band logic
  • Set up sampling queue + reviewer pool
  • Enable minimal anti‑cheat flags
  • Ship pilot to 3 internal testers → calibrate
  • Go live + start audit log at v1.0.0

Rubric files (drop‑in JSON templates)

Use this reference format for your rubric.json files. Edit criterion names, weights, and criticality per role.

{
  &quot;role&quot;: &quot;automation&quot;,
  &quot;criteria&quot;: [
    {&quot;name&quot;: &quot;Data correctness&quot;, &quot;weight&quot;: 0.40, &quot;critical&quot;: true},
    {&quot;name&quot;: &quot;Robustness&quot;, &quot;weight&quot;: 0.35, &quot;critical&quot;: true},
    {&quot;name&quot;: &quot;Observability&quot;, &quot;weight&quot;: 0.15, &quot;critical&quot;: false},
    {&quot;name&quot;: &quot;Maintainability&quot;, &quot;weight&quot;: 0.10, &quot;critical&quot;: false}
  ],
  &quot;bands&quot;: {&quot;pass&quot;: 0.65, &quot;borderline_low&quot;: 0.55, &quot;confidence_floor&quot;: 0.60}
}
{
  &quot;role&quot;: &quot;content&quot;,
  &quot;criteria&quot;: [
    {&quot;name&quot;: &quot;Factual accuracy&quot;, &quot;weight&quot;: 0.35, &quot;critical&quot;: true},
    {&quot;name&quot;: &quot;Structure&quot;, &quot;weight&quot;: 0.25, &quot;critical&quot;: false},
    {&quot;name&quot;: &quot;Voice adherence&quot;, &quot;weight&quot;: 0.25, &quot;critical&quot;: false},
    {&quot;name&quot;: &quot;Brief compliance&quot;, &quot;weight&quot;: 0.15, &quot;critical&quot;: false}
  ],
  &quot;bands&quot;: {&quot;pass&quot;: 0.65, &quot;borderline_low&quot;: 0.55, &quot;confidence_floor&quot;: 0.60}
}