AI can draft your reviews. It cannot rate your people.
Before reading this: None required. Prompt Engineering Fundamentals 2026 helps, but is not essential.
After reading this article, you'll be able to: - Draft review feedback from a de-identified evidence log, with a prompt that blocks invented examples - Screen AI-drafted feedback for personality comments, hedging and non-actionable language - Decide which employee information can go near an AI tool, and which cannot
Why this matters
Performance reviews are the documents employees remember and the process almost nobody trusts. A 2025 Deloitte study cited by Workday found 61 per cent of managers and 72 per cent of workers distrust their organisation's performance management process, alongside a SHRM estimate that managers spend roughly 210 hours a year on reviews.
AI is arriving in exactly this gap. SHRM's January 2024 survey of 2,366 HR professionals found about 1 in 4 organisations using AI to support HR, with performance management a minority use among adopters (25 per cent, against 64 per cent for talent acquisition). By SHRM's 2025 Talent Trends research, overall adoption had reached 43 per cent. The time saving is real: a Gartner December 2025 survey found managers using AI in performance management save an average of four hours.
The catch sits in one other Gartner number: as of October 2025, only 8 per cent of HR leaders believe their managers have the skills to use AI effectively. Tools are spreading faster than the judgement needed to run them, and reviews are where that gap costs most, because the output decides ratings, pay and careers.
The core concept: AI is the statistician, not the selector
A football club's statistician compiles the season: minutes, involvements, errors, form lines. The selector reads it, picks the side and fronts the press conference. Nobody confuses the two jobs, and no club lets the spreadsheet pick the team.
That is the working division for AI in performance reviews. The model is a tireless statistician: it can synthesise a year of evidence notes into a structured draft in seconds. The manager is the selector: the rating, the calibration position and the conversation are selection decisions, and they carry the selector's name.
Two findings make the division non-negotiable. First, evidence in or garbage out. When Textio had ChatGPT write performance feedback from thin prompts in January 2023, the output was fluent, generic and useless: no examples, nothing the employee could act on. It also guessed gender from job titles, presuming kindergarten teachers female and construction workers male. The tests ran on an early model, so treat them as a persistent risk class, not a verdict on current tools. Either way, a model with no evidence does not write your review, it writes a horoscope.
Second, drafting is not deciding. A manager who uses AI to turn reviewed evidence into clean wording, then checks every line, runs a low-risk drafting aid. A system that scores, ranks or recommends ratings is automated decision-making (ADM), a separate category with its own obligations. Most of the legal risk below attaches to the second.
Where the bias actually comes from
AI does not invent feedback bias. It inherits it from decades of human reviews, at scale.
The baseline is grim without any technology. Textio's 2022 analysis of performance feedback language found women receive 22 per cent more written personality feedback than men and are 11 times more likely to be described as "abrasive". Black women were four times more likely than white men to be labelled an "overachiever". The same report links actionable feedback to faster growth, higher pay and faster promotion; biased feedback compounds into biased careers.
A model trained on that writing reproduces it. In Textio's ChatGPT experiment, the model described men as "confident" three times as often as women, and when gender was specified it wrote roughly 15 per cent longer feedback for women, with the extra words typically adding criticism. The obvious fix fails too: asking ChatGPT to "remove bias" made the feedback less clear, not less biased. Self-debiasing is not a control. A human screen for personality comments, hedging and vague praise is.
Recency bias runs the other way: it is the one distortion where AI outperforms the unaided manager. Culture Amp lists it among the most common review distortions, mitigated by collecting evidence at multiple points through the cycle. A model is indifferent to when something happened: feed it twelve months of notes and it weights February the same as last week. That only works if the notes exist, so the workflow below starts before review season.
The workflow: evidence-first review preparation
Six steps. The first is the one most managers skip, and the one that makes the rest work.
1. Keep an evidence log through the cycle. One line per event: date, observable behaviour, outcome. "14 Mar: presented vendor analysis to steering committee, two cost-model errors, corrected same day." Ten minutes a fortnight. This is the anti-recency-bias control and the raw material for everything below.
2. De-identify before anything touches the tool. The OAIC's guidance on commercially available AI products recommends, as best practice, not entering personal information, particularly sensitive information, into publicly available generative AI tools. Replace names with placeholders ([EMPLOYEE], [PEER]) and strip anything identifying by inference. An approved enterprise tool changes the calculus, but check your AI policy first; Textio's practitioner guidance makes that step one.
3. Draft from evidence only. Use a prompt that forbids invention and anchors every claim to a logged note. The worked example below shows one.
4. Screen for the three bias markers. Personality comments ("abrasive", "bubbly", "not a culture fit"), hedging ("perhaps consider") and non-actionable language: praise or criticism with no example and no next step. These are Textio's screening targets.
5. Fact-check every line against the log. Textio's warning is that AI can "beautifully and confidently write garbage". APP 10 of the Privacy Act also requires reasonable steps to ensure accuracy of personal information, an obligation the OAIC links directly to hallucination and bias risks in generative AI.
6. Deliver the conversation yourself. No script read verbatim. The discussion is where judgement becomes visible: as Workday's design principle puts it, human-centric, never human-free.
Preparing calibration cases
Calibration sessions exist to align rating standards across managers: same criteria, same evidence test, every team. AI's legitimate role is assembling the case file. Run each person's evidence log through the same structure and prompt, so the panel compares like with like. Gaps in evidence surface before the session instead of during it. The model formats the case. It does not get a vote in the room.
The privacy and legal floor
Four things every Australian manager and HR practitioner needs on the record.
The employee records exemption is narrower than most people assume. It covers acts directly related to a current or former employment relationship and an employee record held by the employer. Disclosing employee data to a third party, including an external AI tool, can fall outside it, and a third party that collects employee records must comply with the APPs (OAIC guidance). "Exempt because it is HR data" is not a safe assumption once the data leaves your systems. Get advice for your configuration.
Transparency obligations arrive on 10 December 2026. New APP 1.7 requires privacy policies to disclose where a computer program uses personal information in decisions that could reasonably be expected to significantly affect an individual's rights or interests, with civil penalties available (Johnson Winter Slattery's analysis). A manager drafting wording with full human review is unlikely to cross that threshold; a tool that scores employees or substantially drives ratings may. Ask the question before any AI-suggested-ratings feature is switched on.
Fair Work liability does not transfer to the algorithm. Under the Fair Work Act 2009, the employer remains liable for unfair dismissal even if an algorithm made the decision, and general protections claims carry a reverse onus of proof. Kingston Reid's commentary notes an employer "would find it difficult to overcome the hurdle of a reverse onus of proof if relying exclusively on ADM technologies", recommending human involvement in significant employment decisions. No court or Fair Work Commission decision has yet tested an AI-made performance decision; the existing provisions apply regardless.
The governance anchor is voluntary, for now. Australia's Voluntary AI Safety Standard sets ten guardrails, including meaningful human oversight (Guardrail 5), transparency to affected people, and challenge and redress processes. It is voluntary, and the proposed mandatory guardrails for high-risk AI remain proposals, but it is the cleanest benchmark for defensible AI use in HR.
This section is general information and education only, not legal advice. Obtain advice from a qualified professional before relying on a privacy or employment law position for a specific scenario.*
Common mistakes
Pasting the year into a personal account, names and all. The fastest way to turn a drafting shortcut into a privacy incident.
Drafting from memory instead of a log. Memory is recency-weighted; you get fluent, generic feedback about the last six weeks.
Asking the model to de-bias its own work. Textio tested this and the feedback got less clear, not less biased.
Letting the draft set the rating. Decide the rating from the evidence before you generate a word; anchoring is just bias with better formatting.
Reading the AI script at the conversation. The employee notices.
Treating the saved four hours as the point. The point is reviews that are better evidenced and easier to defend in calibration.
A worked example
A manager is preparing a half-year review for a business analyst. The evidence log has 14 dated lines and has been de-identified. The prompt, into Claude (it works identically in ChatGPT):
You are helping a manager draft performance review feedback. Use only the evidence notes below. Do not invent examples, do not infer personality traits, and do not inflate or soften. Structure the draft as two strengths and one development area. Anchor every claim to a specific note and its date. If the evidence is too thin to support a claim, write "insufficient evidence" rather than the claim. Plain Australian English. No superlatives. Refer to the employee as [EMPLOYEE] throughout.
>
Evidence notes: [paste the de-identified log]
What comes back is a competent first draft anchored to real events. The human pass checks each claim against the log, deletes any hedge, confirms the development area carries a next step the employee can take, and decides the rating, which the prompt deliberately never asked for. About 25 minutes against the two hours the blank page usually takes.
The advanced layer: scaling it across an HR function
For an HR practitioner running calibration across dozens of managers, the leverage is standardisation. Issue a one-page evidence log template at the start of the cycle, mandate the same drafting prompt, and have the panel read structurally identical case files. Among organisations already using AI in performance management, SHRM found 57 per cent use it to help managers provide more comprehensive or actionable feedback and 46 per cent to facilitate goal setting. Both uses live or die on input quality, and the template is the lever that controls it at scale.
The second job is vendor diligence. AI-suggested ratings or automated performance flags sit in the ADM category. Run the APP 1.7 significant-effect question, the reverse-onus question and the Guardrail 5 oversight test before the feature is enabled, and document who can override it. Then train managers on this workflow before the licence rollout, not after.
What must stay human
The list is short and it does not flex:
- The rating
- The calibration position argued in the room
- Any pay, promotion or formal performance consequence
- The conversation itself
- Accountability for every word in the final document
AI compresses the drafting. It does not absorb the responsibility. Write the evidence down all year. Let the model assemble it. Make the call yourself.
Try this
Try this: Take a paragraph of feedback you wrote in your last review cycle and remove every name. Paste it into Claude or ChatGPT and ask: "Flag any personality-based comments, hedging language, or statements the employee could not act on. Explain each flag." Then rewrite the weakest sentence yourself, anchored to a specific example. Fifteen minutes, no new accounts.
Glossary
Evidence log. A running record of dated, observable work events kept through the review cycle.
Recency bias. Overweighting the most recent weeks when rating a whole cycle.
Calibration. A session where managers align rating standards by reviewing evidence across teams.
Automated decision-making (ADM). A computer program making or substantially driving a decision, rather than drafting words a human reviews.
Employee records exemption. A narrow Privacy Act exemption for employer-held employee records; it generally does not follow data disclosed to third-party tools.
Where to go next
- Prompt Engineering Fundamentals 2026
- Custom Projects vs Raw Chats
- AI in Hiring Needs Human Review
TheAICommand. Intelligence, At Your Command.





