TLDR: CoSyn, an open-source tool developed by researchers at the University of Pennsylvania and the Allen Institute for AI, is making GPT-4V-level vision AI more accessible. It achieves this by using AI to generate synthetic training data, enabling open-source models to interpret complex visual information like scientific charts and medical diagrams, and even outperform proprietary systems.
A groundbreaking open-source tool named CoSyn, short for Code-Guided Synthesis, is poised to revolutionize the accessibility of advanced vision AI, bringing capabilities on par with proprietary systems like OpenAI’s GPT-4V to a wider audience. Developed by a collaborative team from the University of Pennsylvania’s School of Engineering and Applied Science (Penn Engineering) and the Allen Institute for AI (Ai2), CoSyn addresses a critical challenge in AI development: the need for extensive and diverse training data for models to accurately interpret complex visual information.
Traditionally, training AI to understand intricate images such as financial forecasts, medical diagrams, and nutrition labels has been dominated by closed-source systems like ChatGPT and Claude. CoSyn introduces an innovative approach by leveraging the language skills of open-source AI models to create synthetic training data. This process involves using AI to generate scientific figures, charts, and tables, along with relevant questions and answers, effectively teaching other AI systems how to ‘see’ and comprehend these complex visuals.
The efficacy of CoSyn is demonstrated through its impressive performance. The resulting dataset, CoSyn-400K, comprises over 400,000 synthetic images and 2.7 million sets of corresponding instructions, covering diverse categories including scientific charts, chemical structures, and user-interface screenshots. Models trained with CoSyn have shown to match or even surpass top proprietary systems like GPT-4V and Gemini 1.5 Flash across a suite of seven benchmark tests. A notable example is the creation of a new benchmark, NutritionQA, where only 7,000 synthetically generated nutrition labels were used to train a model, yielding remarkable results.
Yue Yang, a co-first author and Research Scientist at Ai2’s PRIOR: Perceptual Reasoning and Interaction Research group, highlighted the significance of this approach, stating, ‘This is like taking a student who’s great at writing and asking them to teach someone how to draw, just by describing what the drawing should look like. We’re essentially transferring the strengths of open-source AI from text to vision.’
Also Read:
- AI-Powered Detection: Carnegie Mellon Researchers Combat Invasive Plants with Generative AI
- AI-Powered Research Agents Revolutionize Information Discovery
The team has made the full CoSyn code and dataset publicly available, fostering collaboration and inviting the global research community to build upon their work. This open-source release is expected to accelerate advancements in AI systems capable of reasoning about scientific documents, benefiting a wide range of users from students to researchers. Looking ahead, Yang envisions synthetic data not only helping AI understand images but also enabling it to interact with them, potentially leading to intelligent digital agents that can perform tasks like clicking buttons and filling out forms, thereby assisting users in daily activities.


