
Share
Researchers have fine-tuned two large language models on a vast database of molecules to enhance drug design, allowing for the creation and prediction of molecular properties with unprecedented accuracy.
Recent advancements in large language models (LLMs) have opened new frontiers in generative molecular drug design. A team of researchers from the University of California, Los Angeles, and the Institute of Chemical Physics in Yerevan, Armenia, has introduced two LLMs-Chemlactica and Chemma-fine-tuned on a novel corpus of 110 million molecules with computed properties, totaling 40 billion tokens. These models show strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples.
The researchers introduced a novel optimization algorithm that leverages these language models to optimize molecules for arbitrary properties with limited access to a black box oracle. Here’s how it works:
Combination of Techniques:
Performance:

For researchers and practitioners in drug design and cheminformatics, this work represents a significant step forward. Here are some key takeaways:
The researchers have publicly released the training corpus, the language models (Chemlactica and Chemma), and the optimization algorithm. This openness is crucial for reproducibility and further advancements in the field.
The integration of large language models into molecular optimization marks a significant milestone in drug design and cheminformatics. By leveraging advanced NLP techniques and novel optimization algorithms, researchers can now generate and refine molecules more efficiently than ever before. The public release of these resources will undoubtedly spur further innovation and collaboration within the scientific community.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
30 July 2024
88 articles
Related Articles
Related Articles
More Stories