🇺🇸 AI Peer Review Crisis: Feedback Loops That Scale - United States

AI for Event Management: Conference Intelligence•December 19, 2025•By 3L3C

Peer review is a conference ops problem. Learn how author feedback and reviewer rewards can scale evaluations—plus what event teams can copy now.

Conference IntelligencePeer ReviewEvent OperationsAI GovernanceWorkflow DesignProgram Committees

Major AI conferences are drowning in submissions. NeurIPS reportedly crossed 30,000 papers in a single year, and ICLR saw a 59.8% jump year-over-year. When volume climbs that fast, quality control doesn’t bend—it breaks.

If you work in conference planning or event operations, that might sound like an academic problem. It isn’t. The peer review system is basically a high-stakes, high-volume decision workflow: assign tasks, collect evaluations, manage bias, prevent fraud, and produce outcomes people trust. That’s the same operational shape as attendee matchmaking, agenda scheduling, exhibitor lead scoring, or speaker selection—just with more math and more reputational risk.

This post is part of our AI for Event Management: Conference Intelligence series, and it focuses on one of the most practical proposals I’ve seen for fixing overloaded evaluation pipelines: ICML 2025’s outstanding position paper arguing for author feedback and reviewer rewards. I’ll translate what it means for conference teams, what could go wrong, and how to apply the same ideas to event programs that need to scale without losing credibility.

Why peer review is an event ops problem (not just academia)

Peer review fails for the same reason many conference workflows fail: demand scales faster than capacity.

In research conferences, that looks like too many papers and too few qualified reviewers. In event management, it shows up as:

Too many session proposals for a small program committee
Sponsor packages that require “fair” lead allocation but lack transparent rules
Too many attendee meetings to schedule manually
Too much content moderation for too few staff

The shared risk is trust collapse. When submitters (authors, speakers, startups, sponsors) believe decisions are random or biased, they stop investing. In peer review, that means weaker science and more noise in the record. In events, it means fewer high-quality proposals, lower sponsor renewals, and more public complaints.

A line from the ICML discussion lands hard: peer review is a “gatekeeper of scientific knowledge.” In events, your selection and scheduling process is a gatekeeper of attention—and attention is the currency that drives ticket sales and leads.

ICML’s proposal in plain English: add a feedback loop and pay for quality

The paper by Jaeho Kim and coauthors proposes two changes:

Author feedback on review quality (accountability)
Rewards for reviewers who do a good job (motivation)

That combination matters. Most organizations attempt only one side:

Add more rules (accountability) without incentives → people comply minimally.
Add perks (incentives) without measurement → people game it.

Their stance is blunt and correct: a small fraction of low-effort or careless reviews can poison the entire system when the scale is huge.

The “two-way peer review” mechanism (two-phase release)

Their most operationally interesting idea is the two-phase release of reviews.

Instead of releasing the full review at once, the process splits into:

Phase 1 (neutral sections): summary, strengths, and questions.
Author feedback: authors rate whether the reviewer understood the work and behaved professionally.
Phase 2 (judgment sections): weaknesses and numeric ratings released after feedback.

This is a simple but powerful design move: collect comprehension signals before you reveal final judgments.

In event terms, it’s like asking a committee member to first provide:

A factual recap of a talk proposal
What’s strong about it
Clarifying questions

…and only then allowing them to assign acceptance scores.

That sequence is hard to fake. If someone didn’t read the proposal, they can’t produce a coherent recap and questions without exposing themselves.

What conference organizers can borrow: “feedback loops” as conference intelligence

For event planners, the big transferable idea is this:

If you want scalable evaluation, you need measurable feedback loops that detect low-quality work early.

Here are three concrete ways to apply it across conference intelligence workflows.

1) Speaker and session selection: require “understanding before scoring”

If your program committee is overwhelmed, introduce a lightweight, two-step template:

Step A (neutral): “What is this talk about?”, “Who is it for?”, “What evidence is provided?”
Step B (evaluative): “Should we accept it?”, “How strong is the fit?”, “What’s the risk?”

Then add a submitter-facing feedback channel focused on review quality, not “did I get accepted.” For example, rejected speakers can rate whether feedback was specific, accurate, and respectful.

Use that data internally to:

Identify reviewers who consistently misunderstand submissions
Improve reviewer training and calibration
Assign high-stakes submissions to the most reliable evaluators

This is the same logic as the ICML proposal: track responsibility at scale.

2) Attendee matchmaking and hosted meetings: reward good “decision hygiene”

Matchmaking systems often fail because stakeholders don’t trust the matches. The fix isn’t only better models—it’s better governance.

Borrow the reviewer reward concept by tracking behaviors like:

Response rate to meeting requests
Timeliness of confirmations
Post-meeting feedback completeness
Quality ratings from counterparties

Then create visible recognition:

“Top Connector” badges in the attendee profile
Priority access to premium matchmaking features
Early booking windows for high-quality participants

The goal is the same as peer review rewards: make helpful behavior valuable and visible.

3) Sponsor lead quality: create two-way scoring (and stop pretending it’s one-way)

Many events treat sponsor satisfaction as purely a function of volume. That backfires because sponsors care about relevance and intent.

A two-way system looks like:

Sponsors rate lead quality and follow-up readiness
Attendees rate sponsor outreach quality and relevance

Over time you build a reputation system that reduces spammy booth behavior and improves attendee experience—while still delivering leads.

This is conference intelligence in practice: your event becomes a learning system.

Reviewer rewards: why “badges” and “impact scores” work (and where they don’t)

Kim’s paper proposes two reward layers:

Short-term rewards: visible digital badges (example: “Top 10% Reviewer”) shown on academic profiles.
Long-term rewards: metrics like a “reviewer impact score,” described as an h-index-style measure based on citations of papers reviewed.

The exact metric might evolve, but the philosophy is right:

If your system depends on expert labor, you need to treat quality work as a first-class output.

How to translate rewards into event ops

Event teams can implement analogous rewards without creating perverse incentives:

Program committee recognition displayed on event site, badges in internal tools, or special seating/invites
Review quality scores used for future committee selection
Career value for volunteers: documented service credits, reference letters, or formal community roles

One thing I strongly agree with: burying recognition on a hard-to-find webpage doesn’t change behavior. Visibility matters.

Gaming and bias: the failure modes you should plan for upfront

The proposal openly flags a real risk: gaming.

If reviewers are rewarded based on author feedback, reviewers might become overly positive to “win” better ratings. In events, the equivalent is curators accepting safe, popular talks to avoid complaints, or sponsors optimizing for vanity metrics.

Here’s how I’d design guardrails (and I’d do these from day one):

1) Separate “civility” from “agreement”

Ask submitters to rate things like:

Did the reviewer accurately summarize the work?
Were comments specific and actionable?
Was the tone professional?

Avoid “Was the decision fair?” as a primary score. That measures disappointment, not quality.

2) Use anomaly detection, not just averages

In overloaded systems, you don’t need perfect measurement—you need to find outliers fast:

Reviewers whose summaries don’t match submissions
Reviewers whose scores have suspiciously low variance
Reviewers consistently rated poorly across many submissions

This is where AI can help responsibly: flag patterns for human chairs to inspect.

3) Add calibration rounds

Before full reviewing starts, run a small batch where multiple reviewers evaluate the same submissions. Use that to align standards and identify who needs support.

If you run large events, this is the operational equivalent of rehearsals: it saves you later.

The uncomfortable future: when LLMs can review better than humans

The interview ends with a forward-looking dilemma: LLMs are already being discussed in peer review, and eventually they may surpass humans. If that happens, why would humans review at all?

For event management, the parallel question is already here:

If AI can rank proposals, schedule sessions, and predict engagement better than committees, what’s the human role?

My take: humans don’t go away; their job changes. Humans become:

The designers of the evaluation rules
The auditors of edge cases
The owners of accountability when outcomes harm trust

AI can help scale decision pipelines, but only if you build the same thing Kim is arguing for: feedback loops and incentives that keep quality high.

Practical next steps for event teams (lead-friendly, not salesy)

If you’re running a call for papers, speaker submissions, awards, startup challenges, or any high-volume selection process, you can test this approach in one cycle.

Split your evaluation form into a neutral “understanding” section and a judgment section.
Collect submitter feedback on review quality using comprehension-focused questions.
Create visible recognition for evaluators who consistently produce high-quality reviews.
Monitor gaming signals (suspicious positivity, low-effort summaries, outlier patterns).
Use AI for triage and auditing, not for final authority—at least until your governance is mature.

If your goal is leads, this is also a clean story to tell sponsors and stakeholders: you’re not just “using AI,” you’re building a credible, scalable decision system that protects participant trust.

Peer review is a warning flare. Conferences that treat evaluation as an afterthought will watch quality degrade year after year. Conferences that instrument their workflows—feedback, rewards, audit trails—will scale.

What part of your event’s decision pipeline would improve fastest if you added one measurable feedback loop next month?