spot_img
HomeResearch & DevelopmentAI's Performance in Patent Law: A Reality Check on...

AI’s Performance in Patent Law: A Reality Check on Language Models

TLDR: A study evaluated various large language models (LLMs) on the European Qualifying Examination (EQE) for patent attorneys. While OpenAI o1 showed the highest accuracy (0.82), no model achieved the 0.90 professional standard. Human experts found LLMs struggled with legal reasoning, date calculations, and distinguishing key patent concepts like novelty and obviousness, often due to misaligned word embeddings from general training data. The research concludes that despite advancements, current LLMs are not yet capable of performing at a human patent attorney’s level, highlighting the need for further development in logical consistency and domain-specific understanding.

Large Language Models (LLMs) are increasingly integrated into various fields, including law. However, their quantitative performance and the underlying reasons for it, especially in specialized domains like patent law, remain largely unexplored. A recent study delves into this by evaluating several LLMs on parts of the European Qualifying Examination (EQE), a rigorous test for aspiring European Patent Attorneys. The findings shed light on both the capabilities and significant limitations of current AI models in handling complex legal tasks.

The research, titled “Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?” by Bhakti Khera, Rezvan Alamian, Pascal A. Scherz, and Stephan M. Goetz, aimed to assess how well LLMs could perform on a practical patent attorney examination. The European Qualifying Examination is known for testing hands-on fitness in regulations and typical attorney work, requiring both practical skills and deep legal knowledge.

The study evaluated a range of open-source and proprietary LLMs, including variants from GPT-series, Anthropic, Deepseek, Llama-3, Google Gemma, and Mistral AI. These models varied in size, architecture, and training data, providing a comprehensive overview of current LLM capabilities in the legal domain. The examination involved True or False statements, often requiring detailed justifications based on patent law principles.

Key Performance Insights

The results showed that while some models performed better than others, none could meet the professional-level standard of 0.90 accuracy required to pass the examination. OpenAI o1 emerged as the top performer with an accuracy of 0.82 and an F1 score of 0.81. In contrast, models like AWS Llama 3.1 8B and a Python-deployed Llama 3.1 8B lagged significantly, with accuracies around 0.50 and 0.55, respectively, placing them within the range of mere guessing for the two-answer forced-choice design.

Human patent experts played a crucial role in evaluating the textual justifications provided by the models. Their assessments revealed a critical misalignment: experts valued clarity and legal rationale more than raw correctness. This highlighted that automated metrics alone do not fully capture the quality and relevance of legal justifications. The study also found that model outputs were sensitive to modest temperature changes and prompt wording, underscoring the continued necessity of expert oversight.

Experimental Findings

The researchers conducted several experiments to understand factors influencing LLM performance:

  • Temperature Variability: Adjusting the ‘temperature’ parameter, which controls the randomness of outputs, significantly impacted model behavior. A temperature of 0.3 was found to be optimal for stronger models like Llama 3.1 405B and Claude Sonnet 3.5, balancing accuracy and coherence while maintaining high prediction reliability. Intermediate temperatures, however, led to increased inconsistency.

  • Prompting Techniques: Simple prompts, such as “Please answer the question with justifications!”, notably improved performance, especially for smaller models. This suggests that explicit instructions can trigger a more structured chain of thought, enhancing legal analysis.

  • Context Length Management: Processing questions individually (as in Python deployments) allowed for longer context lengths, leading to more comprehensive answers compared to processing multiple questions simultaneously (as in AWS). Implementation details, such as how key-value (KV) pairs are cached in memory, also played a significant role.

  • Platform Differences: The same model (Llama 3.1 8B) showed performance variations when deployed on different platforms (AWS vs. local Python scripts). AWS-deployed models, with their optimized caching strategies, sometimes yielded different, and occasionally more accurate, results.

  • Multimodal Capabilities: Testing models on complete EQE pre-exam PDFs (which include text, tables, and figures) revealed that GPT-4o excelled at integrating text and graphics, while Claude 3 Opus often struggled with formatting coherence.

The Challenge of Legal Nuance

A significant limitation identified by human experts was the models’ struggle with fundamental legal concepts, particularly distinguishing between “novelty” and “obviousness” or “inventive step” in patent law. While many models could define these terms, they often failed to apply them correctly in complex scenarios involving temporal information or combinations of prior art documents. This issue may stem from the models’ training data, which largely consists of general internet text where these terms might be used interchangeably or with less precision than required in legal contexts.

The study suggests that the word embeddings formed during pre-training on general internet data might create insufficient semantic distance between legally distinct terms like “novel” and “inventive.” This deep linguistic entanglement could be a major bottleneck, making it difficult to resolve these issues purely through fine-tuning or prompt engineering.

Also Read:

Conclusion

This research highlights that while LLMs show considerable promise in structured legal contexts, significant limitations persist in tasks requiring deeper interpretive reasoning, comprehensive legal analysis, and logical consistency. No model, even the largest and most advanced, has yet reached the level of human patent attorneys. The field still has a long way to go to develop a truly virtual patent attorney capable of handling the nuances and complexities of patent law. Future work should focus on improving logical consistency, robust multimodality, and adaptive prompting to bridge the gap towards human-level patent proficiency. For more details, you can refer to the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -