TLDR: This study analyzes 18 years of Canadian NSERC-funded research proposals (2005-2022) using advanced topic modeling techniques (LDA, STM, BERTopic). It introduces a novel algorithm, COFFEE, to enable robust covariate effect estimation for BERTopic, addressing its lack of native functionality for such analysis. The findings indicate that BERTopic excels at identifying more granular, coherent, and emergent research themes compared to LDA and STM. The covariate analysis, powered by COFFEE, confirms distinct provincial research specializations (e.g., Alberta in environmental science, Ontario/BC in AI) and reveals consistent gender-based thematic patterns, such as a stronger male association with AI and a significant positive association of female researchers with public health and vaccine communication. These insights offer a robust empirical foundation for funding organizations to formulate more equitable and impactful funding strategies.
Understanding the landscape of national scientific investment is crucial for optimizing research outcomes and fostering an inclusive environment. A recent study delves into 18 years of research proposals funded by the Natural Sciences and Engineering Research Council of Canada (NSERC), spanning from 2005 to 2022. This comprehensive analysis aims to uncover evolving research trends and the demographic and geographical forces that shape them, particularly in light of commitments to equity, diversity, and inclusion.
The research employed a comparative evaluation of three prominent topic modeling approaches: Latent Dirichlet Allocation (LDA), Structural Topic Modelling (STM), and BERTopic. Topic modeling is a powerful tool that helps identify hidden themes within large collections of text. A significant innovation introduced in this study is a novel algorithm called COFFEE (Covariate Effect Estimation for BERTopic). This algorithm addresses a key limitation of BERTopic, which previously lacked a native function for analyzing how external factors, like gender or location, influence research topics. COFFEE allows for robust estimation of these covariate effects, enabling a more nuanced understanding of the data.
Uncovering Research Themes
The study found that while all three models effectively identified core scientific domains, BERTopic consistently outperformed the others by revealing more granular, coherent, and emergent themes. For instance, BERTopic was particularly adept at identifying the rapid expansion of artificial intelligence as a distinct and growing area of research. This suggests that BERTopic, with its advanced contextual embeddings, is better at capturing subtle semantic relationships and identifying niche or newly emerging research areas that might be overlooked by more traditional models like LDA, which tends to produce broader, more generalized topics. STM, on the other hand, offered a good balance, identifying both broad and reasonably distinct sub-topics.
A quantitative evaluation of topic quality further supported BERTopic’s strength, showing it had the highest average coherence, meaning its identified topics were more interpretable and semantically consistent. While LDA and STM showed slightly higher topic diversity, BERTopic’s focus on generating unique, specialized topics provided deeper insights into specific research domains.
The Influence of Geography and Gender
One of the most compelling aspects of this research is its analysis of how geographical location and gender influence research topics. The COFFEE algorithm, paired with BERTopic, allowed for a detailed examination of these covariate effects. The findings revealed distinct provincial research specializations across Canada.
For example, Alberta showed a significant positive association with research in “Environmental Science & Industrial Processes,” a finding corroborated by STM and aligning with the province’s strong energy sector. Ontario and British Columbia demonstrated strong positive effects in “Computer Science & Artificial Intelligence,” reflecting their roles as major technology and AI hubs. Manitoba was highlighted for its prominence in “Molecular Biology & Biotechnology.” Interestingly, BERTopic uniquely identified a significant research focus on “Materials Science & Applied Physics” in New Brunswick and “Environmental Science” in Newfoundland and Labrador, nuances that STM did not capture.
The analysis also confirmed consistent gender-based thematic patterns. Both BERTopic and STM indicated a stronger association of male researchers with “Computer Science & Artificial Intelligence,” reflecting the widely documented gender disparities in STEM fields. Crucially, the COFFEE-powered BERTopic uniquely identified “Public Health & Vaccine Communication” as a field with a significant positive association with female researchers. This finding underscores the prominent role women play in public health professions and health communication, an insight that STM could not provide due to its inability to identify this specific topic.
Also Read:
- Mapping Scientific Trends: An LLM Approach to Engineering Research in PNAS
- AI Insights into Climate Policy Adoption: A Look at the European Green Deal
Implications for Science Policy
The insights from this study have significant implications for science policy and funding organizations like NSERC. By providing a more granular and sensitive analytical tool, the COFFEE-powered BERTopic framework allows funding agencies to move beyond high-level summaries. This precision is vital for developing targeted, evidence-based strategies that support regional research ecosystems and promote the goals of Equity, Diversity, and Inclusion (EDI) across Canada’s scientific community. The ability to identify specific regional strengths and gender-based contributions can help formulate more equitable and impactful funding strategies, ultimately enhancing the effectiveness of the scientific ecosystem.
For more detailed information, you can read the full research paper here.


