Can I use AI to write my team's performance reviews?

You can use AI to draft feedback from a de-identified evidence log, then check every line. Drafting is a low-risk aid. Deciding is not. Scoring, ranking or recommending ratings is automated decision-making with its own obligations. The rating, calibration position and conversation stay human and carry your name.

What employee information can I safely paste into ChatGPT or Claude?

De-identify before anything touches the tool. The OAIC recommends not entering personal, particularly sensitive, information into publicly available generative AI. Replace names with placeholders like [EMPLOYEE], and strip anything identifying by inference. An approved enterprise tool changes the calculus, but check your AI policy first.

How do I stop AI-drafted feedback from being biased?

Do not ask the model to de-bias its own work; Textio tested this and the feedback got less clear, not less biased. Instead, run a human screen for the three markers: personality comments, hedging, and non-actionable language with no example and no next step. The screen is the control, not self-debiasing.

Does using AI in performance reviews break Australian privacy or employment law?

The employee records exemption is narrower than assumed and may not follow data sent to third-party tools. New APP 1.7 transparency obligations arrive on 10 December 2026. Under the Fair Work Act, the employer stays liable even if an algorithm decided. This is general information only, not legal advice; obtain advice for your scenario.

How should I prompt AI to draft review feedback without inventing examples?

Instruct it to use only the evidence notes, not invent examples, not infer personality traits, and not inflate or soften. Anchor every claim to a specific note and date, and write "insufficient evidence" where the log is thin. Then fact-check each line against the log and decide the rating yourself.

What is Evidence log?

A running record of dated, observable work events kept through the review cycle. The raw material for AI-assisted drafting and the main control against recency bias.

What is Recency bias?

The tendency to overweight the most recent weeks when rating a whole cycle. Mitigated by collecting evidence at multiple points through the period.

A structured session where managers align rating standards by reviewing evidence across teams, so the same performance earns the same rating regardless of who rates it.

What is Automated decision-making (ADM)?

A computer program making or substantially driving a decision, as opposed to drafting words a human reviews. The higher-obligation category in Australian privacy and employment law commentary.

What is Employee records exemption?

A narrow exemption in the Privacy Act 1988 (Cth) covering employer-held employee records. It generally does not follow data disclosed to third-party tools.

How can I practise the skill in "AI in Performance Reviews: Draft the Words, Keep the Judgement"?

Take a paragraph of feedback you wrote in your last review cycle and remove every name. Paste it into Claude or ChatGPT and ask it to flag personality comments, hedging and statements the employee could not act on. Rewrite the weakest sentence yourself, anchored to a specific example.

AI Performance Reviews: Keep the Judgement

AI can draft your reviews. It cannot rate your people. It can synthesise a year of de-identified evidence notes into a first draft of feedback, but the rating, the calibration position and the conversation stay human.

Before reading this: None required. Prompt Engineering Fundamentals 2026 helps, but is not essential.

After reading this article, you'll be able to: - Draft review feedback from a de-identified evidence log, with a prompt that blocks invented examples - Screen AI-drafted feedback for personality comments, hedging and non-actionable language - Decide which employee information can go near an AI tool, and which cannot

Why this matters

Performance reviews are the documents employees remember and the process almost nobody trusts. A 2025 Deloitte study cited by Workday found 61 per cent of managers and 72 per cent of workers distrust their organisation's performance management process, alongside a SHRM estimate that managers spend roughly 210 hours a year on reviews.

AI is arriving in exactly this gap. SHRM's January 2024 survey of 2,366 HR professionals found about 1 in 4 organisations using AI to support HR, with performance management a minority use among adopters (25 per cent, against 64 per cent for talent acquisition). By SHRM's 2025 Talent Trends research, overall adoption had reached 43 per cent. The time saving is real: a Gartner December 2025 survey found managers using AI in performance management save an average of four hours.

The catch sits in one other Gartner number: as of October 2025, only 8 per cent of HR leaders believe their managers have the skills to use AI effectively. Tools are spreading faster than the judgement needed to run them, and reviews are where that gap costs most, because the output decides ratings, pay and careers.

The core concept: AI is the statistician, not the selector

A football club's statistician compiles the season: minutes, involvements, errors, form lines. The selector reads it, picks the side and fronts the press conference. Nobody confuses the two jobs, and no club lets the spreadsheet pick the team.

That is the working division for AI in performance reviews. The model is a tireless statistician: it can synthesise a year of evidence notes into a structured draft in seconds. The manager is the selector: the rating, the calibration position and the conversation are selection decisions, and they carry the selector's name.

Two findings make the division non-negotiable. First, evidence in or garbage out. When Textio had ChatGPT write performance feedback from thin prompts in January 2023, the output was fluent, generic and useless: no examples, nothing the employee could act on. It also guessed gender from job titles, presuming kindergarten teachers female and construction workers male. The tests ran on an early model, so treat them as a persistent risk class, not a verdict on current tools. Either way, a model with no evidence does not write your review, it writes a horoscope.

Second, drafting is not deciding. A manager who uses AI to turn reviewed evidence into clean wording, then checks every line, runs a low-risk drafting aid. A system that scores, ranks or recommends ratings is automated decision-making (ADM), a separate category with its own obligations. Most of the legal risk below attaches to the second.

The dividing line between AI drafting and human judgement in performance reviews — The model drafts. The manager decides.

Where does the bias in AI-drafted feedback come from?

AI does not invent feedback bias. It inherits it from decades of human reviews, at scale. The same pattern surfaces in recruitment, where AI in hiring needs human review before another tool.

The baseline is grim without any technology. Textio's 2022 analysis of performance feedback language found women receive 22 per cent more written personality feedback than men and are 11 times more likely to be described as "abrasive". Black women were four times more likely than white men to be labelled an "overachiever". The same report links actionable feedback to faster growth, higher pay and faster promotion; biased feedback compounds into biased careers.

A model trained on that writing reproduces it. In Textio's ChatGPT experiment, the model described men as "confident" three times as often as women, and when gender was specified it wrote roughly 15 per cent longer feedback for women, with the extra words typically adding criticism. The obvious fix fails too: asking ChatGPT to "remove bias" made the feedback less clear, not less biased. Self-debiasing is not a control. A human screen for personality comments, hedging and vague praise is.

Recency bias runs the other way: it is the one distortion where AI outperforms the unaided manager. Culture Amp lists it among the most common review distortions, mitigated by collecting evidence at multiple points through the cycle. A model is indifferent to when something happened: feed it twelve months of notes and it weights February the same as last week. That only works if the notes exist, so the workflow below starts before review season.

The workflow: evidence-first review preparation

Six steps. The first is the one most managers skip, and the one that makes the rest work.

1. Keep an evidence log through the cycle. One line per event: date, observable behaviour, outcome. "14 Mar: presented vendor analysis to steering committee, two cost-model errors, corrected same day." Ten minutes a fortnight. This is the anti-recency-bias control and the raw material for everything below.

2. De-identify before anything touches the tool. The OAIC's guidance on commercially available AI products recommends, as best practice, not entering personal information, particularly sensitive information, into publicly available generative AI tools. Replace names with placeholders ([EMPLOYEE], [PEER]) and strip anything identifying by inference. An approved enterprise tool changes the calculus, but check your AI policy first; Textio's practitioner guidance makes that step one.

3. Draft from evidence only. Use a prompt that forbids invention and anchors every claim to a logged note, built on the same prompt engineering fundamentals that keep any AI output grounded. The worked example below shows one.

4. Screen for the three bias markers. Personality comments ("abrasive", "bubbly", "not a culture fit"), hedging ("perhaps consider") and non-actionable language: praise or criticism with no example and no next step. These are Textio's screening targets.

5. Fact-check every line against the log. Textio's warning is that AI can "beautifully and confidently write garbage". APP 10 of the Privacy Act also requires reasonable steps to ensure accuracy of personal information, an obligation the OAIC links directly to hallucination and bias risks in generative AI.

6. Deliver the conversation yourself. No script read verbatim. The discussion is where judgement becomes visible: as Workday's design principle puts it, human-centric, never human-free.

Six-step evidence-first workflow for AI-assisted performance reviews — Evidence in, judgement out: the six-step review workflow.

Preparing calibration cases

Calibration sessions exist to align rating standards across managers: same criteria, same evidence test, every team. AI's legitimate role is assembling the case file. Run each person's evidence log through the same structure and prompt, so the panel compares like with like. Gaps in evidence surface before the session instead of during it. The model formats the case. It does not get a vote in the room.

Is AI in performance reviews legal in Australia?

Four things every Australian manager and HR practitioner needs on the record.

The employee records exemption is narrower than most people assume. It covers acts directly related to a current or former employment relationship and an employee record held by the employer. Disclosing employee data to a third party, including an external AI tool, can fall outside it, and a third party that collects employee records must comply with the APPs (OAIC guidance). "Exempt because it is HR data" is not a safe assumption once the data leaves your systems. Get advice for your configuration.

Transparency obligations arrive on 10 December 2026. New APP 1.7 requires privacy policies to disclose where a computer program uses personal information in decisions that could reasonably be expected to significantly affect an individual's rights or interests, with civil penalties available (Johnson Winter Slattery's analysis). A manager drafting wording with full human review is unlikely to cross that threshold; a tool that scores employees or substantially drives ratings may. Ask the question before any AI-suggested-ratings feature is switched on, because automated decisions now belong in your privacy policy.

Fair Work liability does not transfer to the algorithm. Under the Fair Work Act 2009, the employer remains liable for unfair dismissal even if an algorithm made the decision, and general protections claims carry a reverse onus of proof. Kingston Reid's commentary notes an employer "would find it difficult to overcome the hurdle of a reverse onus of proof if relying exclusively on ADM technologies", recommending human involvement in significant employment decisions. No court or Fair Work Commission decision has yet tested an AI-made performance decision; the existing provisions apply regardless. The same discipline extends to formal steps, where AI can draft a performance warning but cannot fake the facts.

The governance anchor is voluntary, for now. Australia's Voluntary AI Safety Standard sets ten guardrails, including meaningful human oversight (Guardrail 5), transparency to affected people, and challenge and redress processes. It is voluntary, and the proposed mandatory guardrails for high-risk AI remain proposals, but it is the cleanest benchmark for defensible AI use in HR.

This section is general information and education only, not legal advice. Obtain advice from a qualified professional before relying on a privacy or employment law position for a specific scenario.*

Common mistakes

Pasting the year into a personal account, names and all. The fastest way to turn a drafting shortcut into a privacy incident.

Drafting from memory instead of a log. Memory is recency-weighted; you get fluent, generic feedback about the last six weeks.

Asking the model to de-bias its own work. Textio tested this and the feedback got less clear, not less biased.

Letting the draft set the rating. Decide the rating from the evidence before you generate a word; anchoring is just bias with better formatting.

Reading the AI script at the conversation. The employee notices.

Treating the saved four hours as the point. The point is reviews that are better evidenced and easier to defend in calibration.

A worked example

A manager is preparing a half-year review for a business analyst. The evidence log has 14 dated lines and has been de-identified. The prompt, into Claude (it works identically in ChatGPT):

You are helping a manager draft performance review feedback. Use only the evidence notes below. Do not invent examples, do not infer personality traits, and do not inflate or soften. Structure the draft as two strengths and one development area. Anchor every claim to a specific note and its date. If the evidence is too thin to support a claim, write "insufficient evidence" rather than the claim. Plain Australian English. No superlatives. Refer to the employee as [EMPLOYEE] throughout.

Evidence notes: [paste the de-identified log]

What comes back is a competent first draft anchored to real events. The human pass checks each claim against the log, deletes any hedge, confirms the development area carries a next step the employee can take, and decides the rating, which the prompt deliberately never asked for. About 25 minutes against the two hours the blank page usually takes.

The advanced layer: scaling it across an HR function

For an HR practitioner running calibration across dozens of managers, the leverage is standardisation. Issue a one-page evidence log template at the start of the cycle, mandate the same drafting prompt, and have the panel read structurally identical case files. Among organisations already using AI in performance management, SHRM found 57 per cent use it to help managers provide more comprehensive or actionable feedback and 46 per cent to facilitate goal setting. Both uses live or die on input quality, and the template is the lever that controls it at scale.

The second job is vendor diligence. AI-suggested ratings or automated performance flags sit in the ADM category, and hiring and performance tools carry positive-duty obligations HR must control directly. Run the APP 1.7 significant-effect question, the reverse-onus question and the Guardrail 5 oversight test before the feature is enabled, and document who can override it. Then train managers on this workflow before the licence rollout, not after.

What must stay human in an AI-assisted review?

The list is short and it does not flex:

The rating
The calibration position argued in the room
Any pay, promotion or formal performance consequence
The conversation itself
Accountability for every word in the final document

AI compresses the drafting. It does not absorb the responsibility. Write the evidence down all year. Let the model assemble it. Make the call yourself.

Try this

Try this: Take a paragraph of feedback you wrote in your last review cycle and remove every name. Paste it into Claude or ChatGPT and ask: "Flag any personality-based comments, hedging language, or statements the employee could not act on. Explain each flag." Then rewrite the weakest sentence yourself, anchored to a specific example. Fifteen minutes, no new accounts.

Glossary

Evidence log. A running record of dated, observable work events kept through the review cycle.

Recency bias. Overweighting the most recent weeks when rating a whole cycle.

Calibration. A session where managers align rating standards by reviewing evidence across teams.

Automated decision-making (ADM). A computer program making or substantially driving a decision, rather than drafting words a human reviews.

Employee records exemption. A narrow Privacy Act exemption for employer-held employee records; it generally does not follow data disclosed to third-party tools.

Where to go next

Prompt Engineering Fundamentals 2026
Custom Projects vs Raw Chats
AI in Hiring Needs Human Review

Bottom line

AI is the statistician, not the selector: it can compress a year of de-identified evidence into a clean first draft of feedback, but the rating, the calibration position and the conversation stay human and carry your name. Draft from evidence only, screen for the three bias markers, fact-check every line, and decide the rating yourself.

Do this Monday:

Start a one-line-per-event evidence log: date, observable behaviour, outcome.
De-identify before anything touches the tool, replacing names with placeholders like [EMPLOYEE].
Draft from evidence only, with a prompt that forbids invented examples.
Screen every draft for personality comments, hedging and non-actionable language.
Decide the rating from the evidence yourself, and deliver the conversation in person.

TheAICommand. Intelligence, At Your Command.

AI in Performance Reviews: Draft the Words, Keep the Judgement