Bridging the Gap: How Researchers See Large Language Models Reshaping Data Discovery

TLDR: This research paper explores how Large Language Models (LLMs) could transform data discovery, moving beyond traditional keyword matching to natural language queries. Through focus groups with 27 researchers, the study identifies LLMs’ potential benefits, such as streamlining the search process, improving user experience, and inspiring new research. However, it also uncovers significant barriers to acceptance, including concerns about data biases, unreliable results (like hallucinations), ethical issues (e.g., access inequality, privacy, environmental costs), and a general lack of trust. The paper concludes that transparency in LLM data, model workings, and response generation is crucial for overcoming these barriers and fostering researcher acceptance, emphasizing that LLMs should support, not replace, human judgment.

Data discovery, the process by which researchers find existing datasets for their work, has traditionally relied on keyword matching. This method often requires researchers to know the exact terms used by others, making the process challenging and potentially leading to missed relevant information. A recent study, “From keywords to semantics: Perceptions of large language models in data discovery,” explores how Large Language Models (LLMs) could change this landscape by allowing natural language queries and a deeper understanding of data semantics.

The research, conducted by Maura E Halstead, Mark A. Green, Caroline Jay, Richard Kingston, David Topping, and Alexander Singleton, aimed to understand how researchers perceive the use of LLMs for data discovery. Focusing on a human-centered artificial intelligence (HCAI) approach, the team ran focus groups with 27 participants, including doctoral students, academic researchers, data service staff, and government or third-sector researchers. Participants were asked to imagine using LLMs to find data on topics like fuel poverty, comparing it to their current keyword-based methods.

The Promise of LLMs in Data Discovery

The study revealed that researchers see significant potential in LLMs to transform the data discovery process. One key area is enhancing the interactive process. Currently, researchers spend considerable time narrowing down research questions and selecting keywords through trial and error. LLMs could assist in identifying research questions and allow for more conversational, natural language queries, starting with the desired outcomes rather than specific keywords. This could also inspire new research ideas and help find under-utilized datasets.

LLMs also promise to improve the user experience. Participants highlighted that LLMs could significantly speed up data discovery by quickly analyzing data quality, assessing relevance to a research question, and even identifying data permission issues upfront. The ability to customize LLMs to search specific catalogs and make data more accessible to non-native English speakers or non-specialists were also seen as major benefits. However, researchers still want a search results page to make final decisions themselves, rather than the LLM dictating the choice.

Significant Barriers to Acceptance

Despite the promising benefits, the study identified several fundamental concerns that could prevent researchers from fully embracing LLMs. A major barrier is the issue of biases. Participants worried that the quality of the underlying data used to train LLMs would directly impact the quality of outputs. Incomplete, missing, or poorly parsed older datasets could lead to bad information. There was also concern that LLMs might prioritize popular datasets, hindering the discovery of knowledge gaps and reinforcing existing biases in a feedback loop.

Unreliable results were another significant concern. Researchers noted that LLMs often present information with a convincing and authoritative tone, even when incorrect or when generating “hallucinated” (fake) datasets. This persuasive tone could mislead users into believing false information. Ethical considerations also emerged, such as the potential for free versus paid LLM versions to create an information divide for developing countries, security risks if sensitive queries are harvested, unintentional plagiarism due to unclear attribution, and the unknown environmental costs of running LLMs.

Ultimately, a general lack of trust in LLMs was evident. Participants expressed a need to “double-check” findings with other tools and noted that inconsistent responses from LLMs further eroded their confidence. This suggests that LLMs are currently seen more as validation tools rather than primary search engines.

Also Read:

Transparency as the Key to Overcoming Barriers

The study found that most of these barriers could be overcome through transparency. Researchers want clear information about the LLM’s training data, including its sources and reliability. They also desire transparency about the model itself, such as its limitations, version numbers, release dates, and who is responsible for its development and oversight. Knowing the model’s confidence level in its answers and how it processes queries to find datasets would also build trust.

Response transparency is equally crucial. Participants frequently requested direct links to the actual datasets and their metadata. They also want LLMs to provide summaries of this information and allow them to explore citation metrics to understand how data has been used elsewhere. This level of transparency would enable researchers to validate findings and make informed decisions, ensuring that LLMs augment, rather than replace, human judgment.

In conclusion, while LLMs offer a transformative potential for data discovery, their acceptance hinges on addressing concerns around bias, reliability, ethics, and trust. The research highlights that co-designing LLM-augmented tools with users, focusing on transparency and accountability, is essential for creating systems that truly support and enhance the research process. You can read the full paper here: From keywords to semantics: Perceptions of large language models in data discovery.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Bridging the Gap: How Researchers See Large Language Models Reshaping Data Discovery

The Promise of LLMs in Data Discovery

Significant Barriers to Acceptance

Transparency as the Key to Overcoming Barriers

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Minister Fahmi Fadzil Advocates for Ethical AI Communication and New Media Frameworks

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates