Small Language Models Show Promise in Formal Logic Reasoning for Ontology Engineering

TLDR: This research investigates the ability of Small Language Models (SLMs) to process and represent formal knowledge for logical reasoning, aiming to assist ontology engineering. The study compares natural language with various compact formal languages like CLIF, TFL+, and MINIFOL across different SLMs and training methods (SFT, Zero-Shot, Few-Shot). Findings indicate that compact formal representations, particularly CLIF, can achieve competitive performance with natural language in first-order logic reasoning tasks, especially with Supervised Fine-Tuning. While techniques like tokenizer re-training show promise for smaller models, the overall results suggest that formal languages can be a viable alternative to natural language for enhancing SLM reasoning capabilities in knowledge representation.

Language models (LMs) have made incredible strides in various natural language processing tasks, from generating text to answering complex questions. However, a persistent challenge for these models lies in their reasoning capabilities, particularly in fields like ontology engineering—the process of creating structured knowledge representations. This limitation is especially noticeable in tasks requiring explicit or implicit logical thinking.

A recent preliminary study by Hanna Abi Akl and her supervisors at Université Côte d’Azur delves into this very issue, focusing on Small Language Models (SLMs). The research aims to understand how incorporating formal methods can improve SLMs’ performance on reasoning tasks, with a long-term goal of using these models to kickstart the construction of ontologies. The core question guiding their work is: Is there a better formal representation for logical data than natural language?

Exploring Formal Languages for Logical Reasoning

To investigate this, the researchers developed a methodology called the Syllogistic Evaluation Framework (SEF) combined with the Common Logic Grammar Construction (CLGC) pipeline. The SEF helps classify different types of logical reasoning problems, such as disjunctive, hypothetical, categorical, and complex syllogisms, using the FOLIO dataset. The CLGC pipeline is crucial for transforming logical data from its original First-Order Logic (FOL) form into various alternative formal languages. These included Common Logic Interchange Format (CLIF), Conceptual Graph Interchange Format (CGIF), Tensor Function Logic (TFL), Tensor Function Logic Plus (TFL+), and a custom language called Miniature First-Order Logic (MINIFOL).

The study experimented with a range of SLMs, including Flan-T5-small, Flan-T5-base, Flan-T5-large, GPT-2, Phi-3.5-mini-instruct, and Gemma-2-2b-it. They tested these models across different learning methods: Supervised Fine-Tuning (SFT), Zero-Shot (ZS) Prompting, and Few-Shot (FS) Prompting. The objective was to determine the truth value of a logical conclusion (True, False, or Uncertain) based on a given set of premises, using the various formal language representations as input.

Key Findings: The Power of Compact Formalisms

The results revealed several interesting insights. In the Supervised Fine-Tuning (SFT) setting, formalizing premises and conclusions in CLIF proved highly effective, often tying or even outperforming Natural Language (NL) in accuracy. This suggests that SLMs can reason well with more compact and structured formalisms than verbose natural language. Languages with more complex syntaxes, like CGIF, generally showed weaker performance, indicating that simpler formal structures might be easier for SLMs to process.

Interestingly, the smallest model tested, Flan-T5-small, sometimes achieved the best performance, even surpassing larger, fine-tuned models. This hints that increased architectural complexity might not always be beneficial for this specific type of reasoning task. In Zero-Shot (ZS) prompting, compact languages like CLIF and TFL+ also demonstrated competitive results, and augmenting prompts with BNF grammar rules generally improved SLM performance.

The researchers also explored advanced techniques like In-Context Grammar Passing (ICGP) and Tokenizer Re-Training. ICGP, where the BNF grammar was provided as additional context during SFT, surprisingly hindered learning and degraded model performance. Tokenizer re-training, which adapts the model’s vocabulary to the specific grammar, showed promise for smaller models and very compact data representations (like Flan-T5-small with TFL+). However, this method did not scale well to larger models, potentially leading to overfitting and reduced generalization.

An analysis using the Syllogistic Evaluation Framework (SEF) showed that models performed well on Disjunctive, Hypothetical, and Complex syllogisms. TFL+ even showed a slight edge over NL and CLIF for Complex syllogisms, possibly because its compact nature handles the ambiguity of multi-premise problems better. Categorical syllogisms, being under-represented in the dataset, yielded less conclusive results.

Also Read:

Implications and Future Directions

The study concludes that while natural language remains a strong baseline, compact and formal representations like CLIF can effectively challenge it for first-order reasoning tasks in SLMs. This is a significant finding, especially since these results were achieved with small, frugal language models (under 3 billion parameters), making them accessible for practical applications. The research confirms that Supervised Fine-Tuning is currently the most stable and effective training method for these tasks, and that controlled formal languages generally scale well with models.

Looking ahead, the PhD research will explore combining different input representations (e.g., NL + CLIF) to see if a blend of expressiveness and formal structure can further enhance reasoning. Another exciting direction involves injecting knowledge from high-level ontologies into SLMs using formal languages to facilitate ontology extension. For more details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Small Language Models Show Promise in Formal Logic Reasoning for Ontology Engineering

Exploring Formal Languages for Logical Reasoning

Key Findings: The Power of Compact Formalisms

Implications and Future Directions

Gen AI News and Updates

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Understanding How AI Agents ‘Know’: A New Look at Awareness

Enhancing Trust in AI Reasoning: How VERICOT Validates Language Model Logic

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates