TemplateMay 8, 2026

The Contractor Skills Test Pack

A fill‑in‑the‑blanks pack to launch a paid, 90‑minute async skills test with calibrated AI judging, human sampling, anti‑cheat, and fair stipends—designed for Automation Builder and Content Ops roles.

Contents↓ Download PDF

Quick‑start: 90‑minute async hiring test blueprint Role 1 template: Automation Builder test (webhook → transform → error handler)Role 2 template: Content Ops test (brief → outline → draft)Golden‑set structure and calibration method Pairwise AI‑judge prompt (with permutation debiasing)Confidence‑band calculator spec (with Wilson CI)Human‑sampling SOP (10–20%) + appeal flow Anti‑cheat checklist (minimal‑overhead)Regional pay‑bands + stipend calculator (edit this first)Candidate one‑pager (Notion template)Audit log & changelog (HELM‑style)Implementation runbook (operator checklist)Rubric files (drop‑in JSON templates)

Duplicate this pack into your workspace, replace the [BRACKETS] with your details, and ship a paid, 90‑minute async skills test this week. It includes two ready roles (Automation Builder and Content Ops), golden‑set items with answer keys, rubricized pairwise AI judging, confidence bands, a human‑sampling SOP, anti‑cheat, pay‑band tables, and a candidate‑facing one‑pager.

How to use:

Fill the [ORG], [ROLE], and [CONTACT] fields across sections.
Swap in your product context and examples.
Run 3–5 pilot attempts (internal) to calibrate pass bands.
Publish the candidate one‑pager and start inviting applicants.
Review borderlines via the human‑sampling SOP and log decisions in the audit log.

Quick‑start: 90‑minute async hiring test blueprint

Test name: [ROLE] 90‑minute async skills test
Time cap: 90 minutes (hard stop)
Submission window: [START_DATE]–[END_DATE]; late submissions auto‑fail unless [EXEMPTION_RULE]
Delivery: [SUBMISSION_PORTAL_URL] (Google Drive link or portal upload)
Payment: [CURRENCY][STIPEND_AMOUNT] within [PAYMENT_DAYS] days via [PAYMENT_METHOD]
Grading: Pairwise LLM judge (permutation debiased) + human sampling on borderlines
Pass bands: Pass ≥0.65 win‑rate; Borderline 0.55–0.65; Reject <0.55 (see Confidence Bands)
Contacts: Ops owner [OWNER_NAME] — [OWNER_EMAIL]; Appeals [APPEAL_EMAIL]
Privacy & fairness: No screen takeover; minimal proctoring; anonymized review; appeal route below

↑ Back to top

Role 1 template: Automation Builder test (webhook → transform → error handler)

Goal: Prove you can wire a simple inbound webhook → transform → error‑handled handoff in under 90 minutes using the stack you’ll use on the job.

Stack: [STACK_TOOL] (e.g., Make, Zapier, n8n) + [CODE_RUNTIME] (optional for transforms)

Scenario prompt (paste to candidates):

You receive POST requests at /lead containing lead payloads from multiple sources. Normalize to a shared schema and forward to Slack + a Google Sheet. Log and retry on transient errors.
Hidden edge cases (the grader checks these):
1. Currency strings like "€1,200.50" → decimal 1200.50 (strip symbols; support commas)
2. Missing email should route to an incomplete sheet and post a Slack warning
3. Non‑UTF‑8 characters must be sanitized (replacement char ✓)
4. Idempotency: duplicate lead_id events must not duplicate rows

Golden‑set items (internal; not shared with candidates):

Item A (Happy path)
Input JSON:
{&quot;lead_id&quot;:&quot;a1&quot;,&quot;name&quot;:&quot;K. Duarte&quot;,&quot;email&quot;:&quot;k@ex.co&quot;,&quot;amount&quot;:&quot;$1,050.00&quot;,&quot;source&quot;:&quot;ads&quot;}
Expected normalized:
{&quot;lead_id&quot;:&quot;a1&quot;,&quot;name&quot;:&quot;K. Duarte&quot;,&quot;email&quot;:&quot;k@ex.co&quot;,&quot;amount&quot;:1050.00,&quot;currency&quot;:&quot;USD&quot;,&quot;source&quot;:&quot;ads&quot;}
Checks: row in Sheet &#39;leads&#39;, Slack message in #leads, HTTP 200

Item B (Currency parse)
Input:
{&quot;lead_id&quot;:&quot;b2&quot;,&quot;name&quot;:&quot;S. Rao&quot;,&quot;email&quot;:&quot;s@ex.co&quot;,&quot;amount&quot;:&quot;€1,200.50&quot;,&quot;source&quot;:&quot;referral&quot;}
Expected: amount 1200.50, currency EUR

Item C (Missing email)
Input:
{&quot;lead_id&quot;:&quot;c3&quot;,&quot;name&quot;:&quot;A. Chen&quot;,&quot;amount&quot;:&quot;440&quot;,&quot;source&quot;:&quot;organic&quot;}
Expected: row in &#39;incomplete&#39;, Slack warn, no row in &#39;leads&#39;

Item D (Duplicate id)
First input: {&quot;lead_id&quot;:&quot;d4&quot;,...}
Second input: identical payload 30s later
Expected: only one row exists; second request logged as duplicate

Rubric & weights (Automation):

Data correctness (normalization, currency parsing) — 40%
Robustness (error handling, retries, idempotency) — 35%
Observability (clear logs, alerts) — 15%
Maintainability (naming, comments, foldering) — 10%

Submission artifacts required:

[WORKFLOW_SHARE_LINK]
[SCREENSHOT_LOGS_LINK]
[NOTES_MD_LINK] (short rationale + known gaps)

↑ Back to top

Role 2 template: Content Ops test (brief → outline → draft)

Goal: Show you can turn a short brief into a clean outline and a tight draft that matches voice, facts, and constraints in under 90 minutes.

Scenario prompt (paste to candidates):

You’re writing a 600–800 word blog post: "[TOPIC]: A 3‑step playbook for [AUDIENCE]" with a 140–160‑char meta description.
Use the supplied brand voice notes. Include at least one number, one caution, and a 3‑step playbook. Avoid hype.

Brand voice excerpt (share to candidates):

Tone: plainspoken, operator‑grade; short sentences; no guru talk
Banned words: "revolutionary", "game‑changing"
Style: use specific numbers; prefer checklists; minimal adverbs

Golden‑set brief (internal key):

Brief: &quot;Cut client onboarding time by 50% in Make.com&quot;
Outline (gold): H1, 3 steps (collect→provision→verify), &#39;gotchas&#39; box, 1 mini‑case, CTA.
Draft (gold): 680–720 words; includes % baseline math; warns about OAuth token expiry; links to SLA template.
Meta: 150 chars; includes &#39;async&#39; and &#39;Make.com&#39;.
Factual anchors: Make scenario limit (e.g., 100 ops/min) mentioned once; retry/backoff basics.

Rubric & weights (Content Ops):

Factual accuracy (claims grounded; no hallucinations) — 35%
Structure (clear outline; scannable; headings do work) — 25%
Voice adherence (no hype; concise; numbers) — 25%
Brief compliance (length, must‑include elements) — 15%

Submission artifacts required:

[OUTLINE_DOC_LINK]
[DRAFT_DOC_LINK]
[SOURCES_LINKS] (list URLs used)

↑ Back to top

Golden‑set structure and calibration method

Foldering (suggested):

/golden-set/[ROLE]/items/*.json — one file per item
/golden-set/[ROLE]/answers/*.md|.json — canonical answers/flows
/golden-set/[ROLE]/rubric.json — criteria + weights
/golden-set/versions.json — semver + change notes

Minimum composition:

6–10 items per role: 4 happy‑path, 2–3 edge cases, 1 failure‑handling
Diversity: vary input lengths, formats, and traps (verbosity, position)
Each item must include: prompt_to_candidate (if applicable), input, expected_behavior, acceptance_checks[], and rationale

Example item schema:

{
  &quot;id&quot;: &quot;auto-item-b-currency&quot;,
  &quot;role&quot;: &quot;automation&quot;,
  &quot;input&quot;: {&quot;amount&quot;: &quot;€1,200.50&quot;, ...},
  &quot;expected&quot;: {&quot;amount&quot;: 1200.50, &quot;currency&quot;: &quot;EUR&quot;},
  &quot;acceptance_checks&quot;: [
    &quot;sheet.row.amount == 1200.50&quot;,
    &quot;sheet.row.currency == &#39;EUR&#39;&quot;
  ],
  &quot;rationale&quot;: &quot;Tests parse + locale handling&quot;
}

Calibration run:

Have 2–3 internal testers attempt the full test under the same 90‑min cap.
Compute win‑rates vs gold; adjust rubric weights so your intended "hire" profiles clear ≥0.65 and clear "no‑hires" fall <0.55.
Freeze versions.json as v1.0.0 and record prompts/models in the audit log.

↑ Back to top

Pairwise AI‑judge prompt (with permutation debiasing)

Use pairwise, rubricized judging instead of raw 1–10s. Run A vs B and B vs A (permutation) to cut position effects.

Judge ensemble:

[JUDGE_MODEL_1] primary; optional [JUDGE_MODEL_2], [JUDGE_MODEL_3] as tie‑breakers
Majority vote across models; break ties by higher confidence

Prompt template (pairwise; per item):

System: You are a strict hiring assessor. Judge which candidate output better satisfies the rubric for the given task. Do not reward length or style beyond the rubric.

User:
TASK CONTEXT:
[ITEM_CONTEXT]

RUBRIC (weights in %):
1) [CRITERION_1] — [W1]
2) [CRITERION_2] — [W2]
3) [CRITERION_3] — [W3]
4) [CRITERION_4] — [W4]

CANDIDATE OUTPUT A:
[OUTPUT_A]

CANDIDATE OUTPUT B:
[OUTPUT_B]

INSTRUCTIONS:
- Make a single decision: A, B, or Tie.
- Justify briefly per criterion.
- Ignore superficial verbosity; reward correctness and rubric fit.
- Return JSON only in the schema below.

JSON SCHEMA:
{
  &quot;winner&quot;: &quot;A|B|Tie&quot;,
  &quot;confidence&quot;: 0.0–1.0,
  &quot;per_criterion&quot;: [
    {&quot;criterion&quot;: &quot;[CRITERION_1]&quot;, &quot;why&quot;: &quot;...&quot;, &quot;edge_cases_considered&quot;: true},
    {&quot;criterion&quot;: &quot;[CRITERION_2]&quot;, &quot;why&quot;: &quot;...&quot;, &quot;edge_cases_considered&quot;: false}
  ],
  &quot;flags&quot;: [&quot;verbosity_bias?&quot;,&quot;position_bias?&quot;,&quot;format_mismatch?&quot;],
  &quot;notes&quot;: &quot;one sentence&quot;
}

Permutation debiasing:

For each item, run the prompt twice: (A=candidate, B=gold) and (A=gold, B=candidate). Flip any observed position bias flags to review later.
Score as win=1 if candidate beats gold, 0 if loses, 0.5 if tie.

Length control:

Pre‑trim both outputs to [MAX_TOKENS_PER_OUTPUT] tokens or [MAX_CHARS] chars to avoid verbosity bias.

↑ Back to top

Confidence‑band calculator spec (with Wilson CI)

Aggregator rules:

For each item i, compute win_i ∈ {1, 0.5, 0} after AB and BA runs.
Candidate win‑rate: p̂ = (Σ win_i) / N_items.
Also compute a 95% Wilson interval for p̂.

Pass/borderline/reject (defaults; tune after calibration):

Pass if p̂ ≥ 0.65 OR Wilson lower bound ≥ 0.60.
Borderline if 0.55 ≤ p̂ < 0.65 OR Wilson interval straddles 0.60.
Reject if p̂ < 0.55 AND Wilson upper bound < 0.60.

Wilson interval (for N items, successes w = Σ win_i where ties count as 0.5):

z = 1.96
phat = w / N
A = phat + z^2/(2N)
B = z * sqrt((phat*(1-phat) + z^2/(4N)) / N)
C = 1 + z^2/N
lower = (A - B)/C
upper = (A + B)/C

Pseudocode:

if phat &gt;= 0.65 or lower &gt;= 0.60: PASS
elif 0.55 &lt;= phat &lt; 0.65 or (lower &lt; 0.60 and upper &gt;= 0.60): BORDERLINE → human review
else: REJECT

Logging:

Store per‑item winners, confidence, and flags; keep JSON responses for audit.

↑ Back to top

Human‑sampling SOP (10–20%) + appeal flow

When to sample:

Always sample BORDERLINE decisions.
Always sample if any critical criterion failed (mark criteria with [CRITICAL=true] in rubric.json).
Randomly sample 10–20% of clear PASS decisions weekly, stratified by source and region.

How to sample:

Draw a stratified sample: [SAMPLE_RATE]% of PASS across (role, source, region); 100% of BORDERLINE.
Assign to [REVIEWER_POOL] with a 24–48h SLA.

Human review rubric:

Use the same rubric; grade independently without seeing the AI decision.
Allowed outcomes: Uphold, Upgrade, Downgrade. Add 1–2 sentence rationale.

Appeal flow (candidate‑facing):

Email [APPEAL_EMAIL] within [APPEAL_WINDOW_DAYS] days with subject "Appeal: [ROLE] — [YOUR_NAME]".
You’ll receive a human re‑review within [APPEAL_SLA_DAYS] days. We return rubric feedback either way.

Error taxonomy (label issues):

Misparse/Incorrect (data correctness)
Robustness gap (error handling/idempotency)
Voice/style mismatch (content)
Factual error (content)
Other: [FREE_TEXT]

↑ Back to top

Anti‑cheat checklist (minimal‑overhead)

Use light‑touch integrity checks that don’t invade privacy or penalize neurodiversity.

Required:

Randomized inputs per candidate (rotate golden-set minor variants monthly on [ROTATION_DAY])
Time cap enforced by [PORTAL/FORM]; late lock at 90 minutes
Honor statement checkbox (no external collaboration; cite sources)
Version rotation: model prompts and golden‑set version pinned to each invite

Optional (choose one, not all):

Tab‑switch logging only (no webcam)
Plagiarism scan on text (flag, don’t auto‑fail)

Avoid:

Always‑on screen/video recording for a 90‑min take‑home
Location/IP geofencing beyond basic fraud checks

Review flags (auto):

Unusual submission latency patterns
Duplicate uncommon phrasing across multiple candidates
Perfect match to public GitHub gists tied to this prompt

↑ Back to top

Regional pay‑bands + stipend calculator (edit this first)

How to set fair, simple stipends:

Compute from your market midpoint rate × 1.5 hours. Round up to sensible numbers.
Publish the amount up front. Pay within [PAYMENT_DAYS] days.

Formula:

stipend = midpoint_hourly_rate * 1.5
minimums: content ≥ [CURRENCY]45, automation ≥ [CURRENCY]60

Directional defaults (replace with your own data):

Content Ops hourly (midpoints):

US/Canada: $40–50 → use $45
W. Europe/UK: €35–45 → use €40
E. Europe: $25–35 → use $30
LatAm: $20–30 → use $25
India/SEA: $15–25 → use $20

Automation Builder hourly (midpoints):

US/Canada: $55–75 → use $65
W. Europe/UK: €45–65 → use €55
E. Europe: $35–50 → use $42
LatAm: $28–45 → use $36
India/SEA: $22–40 → use $30

Example stipend table (edit):

| Region       | Content 90‑min | Automation 90‑min |
|--------------|-----------------|-------------------|
| US/Canada    | $68             | $98               |
| W. Europe/UK | €60             | €83               |
| E. Europe    | $45             | $63               |
| LatAm        | $38             | $54               |
| India/SEA    | $30             | $45               |

↑ Back to top

Audit log & changelog (HELM‑style)

Log every change to datasets, prompts, models, and weights. This protects candidates and your team.

Fields (copy to a sheet or Notion DB):

Date — [YYYY‑MM‑DD]
Version — [MAJOR.MINOR.PATCH]
Role — [automation|content]
Changed — [dataset|rubric|prompt|model|weights|bands]
Summary — [WHAT_CHANGED]
Why — [WHY_CHANGED]
Owners — [NAME1], [NAME2]
Back‑compat note — [IMPACT]
Links — [PR/COMMIT_URL], [DOC_URL]

Sample entry:

2026‑05‑08 | v1.1.0 | automation | dataset | Added EUR parse case; tightened idempotency check | Candidates missed locale parsing; false positives on dupes | S. Ramires | Back‑compat: none | PR#42, rubric v1.1 link

↑ Back to top

Rubric files (drop‑in JSON templates)

Use this reference format for your rubric.json files. Edit criterion names, weights, and criticality per role.

{
  &quot;role&quot;: &quot;automation&quot;,
  &quot;criteria&quot;: [
    {&quot;name&quot;: &quot;Data correctness&quot;, &quot;weight&quot;: 0.40, &quot;critical&quot;: true},
    {&quot;name&quot;: &quot;Robustness&quot;, &quot;weight&quot;: 0.35, &quot;critical&quot;: true},
    {&quot;name&quot;: &quot;Observability&quot;, &quot;weight&quot;: 0.15, &quot;critical&quot;: false},
    {&quot;name&quot;: &quot;Maintainability&quot;, &quot;weight&quot;: 0.10, &quot;critical&quot;: false}
  ],
  &quot;bands&quot;: {&quot;pass&quot;: 0.65, &quot;borderline_low&quot;: 0.55, &quot;confidence_floor&quot;: 0.60}
}

{
  &quot;role&quot;: &quot;content&quot;,
  &quot;criteria&quot;: [
    {&quot;name&quot;: &quot;Factual accuracy&quot;, &quot;weight&quot;: 0.35, &quot;critical&quot;: true},
    {&quot;name&quot;: &quot;Structure&quot;, &quot;weight&quot;: 0.25, &quot;critical&quot;: false},
    {&quot;name&quot;: &quot;Voice adherence&quot;, &quot;weight&quot;: 0.25, &quot;critical&quot;: false},
    {&quot;name&quot;: &quot;Brief compliance&quot;, &quot;weight&quot;: 0.15, &quot;critical&quot;: false}
  ],
  &quot;bands&quot;: {&quot;pass&quot;: 0.65, &quot;borderline_low&quot;: 0.55, &quot;confidence_floor&quot;: 0.60}
}

↑ Back to top