TLDR: The Digital Transformation Agency (DTA) in Australia is piloting an AI-human pairing model to streamline the review process for applications to its Digital Marketplace 2 (DM2) panel. A proof-of-concept demonstrated that an AI, specifically a large language model, achieved higher agreement rates with human assessors (84%) and lower average rating differences (0.76) compared to two human assessors (81% agreement, 0.92 difference). The initiative aims to improve consistency and efficiency in evaluating the approximately 20,000 applications received annually, while maintaining human oversight for final decisions.
The Australian federal government’s Digital Transformation Agency (DTA) is advancing its efforts to integrate artificial intelligence into its procurement processes, specifically trialing AI to assist in reviewing applications for its Digital Marketplace 2 (DM2) panel. This initiative, reported by Eleanor Dickinson on iTnews on August 28, 2025, involves a proof-of-concept that utilizes a large language model (LLM) to work alongside human assessment officers in evaluating IT supplier case studies.
The DM2 panel, launched in October 2024, serves as a government-wide procurement framework for IT labour hire, professional, and consulting services, attracting a substantial volume of approximately 20,000 applications annually. Ben Bildstein, former DTA principal technology advisor, highlighted the scale and importance of this work during the AI Government Showcase in Canberra, noting the potential for AI to assist in rating these applications.
Initial considerations to fully automate application evaluations using AI were dismissed after reviewing government procurement policies, standards, and AI ethics guidelines. Bildstein explicitly stated, “Pretty simply, AI can’t do that. It can’t evaluate an application in a procurement context for you – that’s a human’s job.” Consequently, the DTA opted for an ‘AI-human pairing model’ for its pilot program, with the aim of going live later this year.
The proof-of-concept focused on assessing supplier case studies, which traditionally involve two human reviewers independently rating work on a scale of one to five. The DTA established a benchmark where two human caseworkers typically agree on their ratings 81 percent of the time, with a margin of error of one point considered sufficient. In contrast, the AI demonstrated a higher agreement rate, aligning with human assessors 84 percent of the time.
Further analysis revealed that the average disagreement score between two human assessors was 0.92. When comparing a human assessor with the AI, this disagreement score dropped to 0.76, indicating that the AI’s ratings were statistically closer to a human’s than human assessors were to each other. Bildstein remarked, “So, we’re getting a little bit more consistency with a human and an AI.” In instances of disagreement (approximately 16% of cases), the AI’s assessment is discarded, and two human reviewers conduct the evaluation, mirroring the DTA’s established practice. The correlation, which measures the similarity in overall ranking, also showed the AI to be more consistent with human ratings than human reviewers were with each other.
Also Read:
- UK Government’s AI Contract Spending Reaches £573 Million, Exceeding Previous Year’s Total
- KPMG Unveils Advanced AI Agent for Expedited Tax Services
This pilot represents a strategic move by the DTA to leverage AI for efficiency and consistency in high-volume administrative tasks, while ensuring critical human judgment and ethical considerations remain central to the procurement process.


