Prediction, expertise, and validation: An integrated approach to advancing SBDD
This year, we hosted the first Discngine Labs event in person at Churchill College in Cambridge, alongside an additional online presence. Experts from the drug discovery field revealed a wealth of insights into the evolving role of computational predictions, particularly AlphaFold and other AI-driven models, in supporting structure-based drug design (SBDD).
While these technological advancements provide remarkable opportunities, the consensus remains that their true potential lies in integration with experimental data and expert interpretation, which ultimately enables more informed decisions in target validation, ligand design, and project prioritization.
Darren Green’s talk at Discngine Labs Live: “Optimizing the Impact of Protein Structure Information on Next-Generation Drug Discovery”
The Discngine Labs Live event started off with a plenary talk by Darren Green, former cheminformatics director at GSK and now an independent consultant. He explored how predictive tools can enhance the everyday life of a drug discovery scientist. One of the important takeaways from his talk was the need to “optimize the expert” - the concept that set the stage for further discussions around how in silico tools can automatize and organize workflows while still providing access to key structural details - which I will explore later in this blog post.
Following Darren’s talk, our Discngine colleague Lorena Zara presented how 3decision — Discngine's 3D protein structure repository — is at the forefront of integrating AI-derived insights with experimental data. She emphasized the importance of having the right structural data management tool in such an ever-growing, data-rich environment and presented use cases demonstrating 3decision’s impact.
The event concluded with an insightful roundtable discussion, featuring:
David Brown (Vertex Pharmaceuticals)
Juan Carlos Mobarec (AstraZeneca)
James Davidson (Vernalis)
Benoit Baillif (Astex Pharmaceuticals)
The panel explored the transformative role of AI in drug discovery, highlighting its potential and the importance of ensuring model reliability. They discussed strategies to enhance predictive accuracy, address challenges related to data completeness and integrate AI-driven insights with experimental validation. Their perspectives underscored the value of combining innovative AI models with established methodologies to drive more efficient and reliable structure-based drug design.
The Expanding Role of Protein Structure Predictions in Drug Discovery
Darren Green’s insights
Darren Green opened his talk by emphasizing that rapid expansion of protein structure data is, driving a surge in high-quality structures, especially with the help of CryoEM—there are now over 70,000 unique protein entries in the PDB. Meanwhile, AlphaFold has revolutionized structural prediction by generating over 200 million predicted structures. Yet, despite this wealth of data, only 10% of the human genome has been structurally solved.
Prediction models seem to fit every step of the way in drug discovery and development.
However, some key challenges remain:
Poor modeling of protein dynamics and flexibility
Difficulty in predicting multi-domain proteins and complexes
Training set bias, which limits the model’s ability to generalize
Overconfidence in prediction tools, often due to unreliable confidence metrics
As a result, these models still require careful validation before being applied in drug discovery and development. As Darren noted, citing Borkakoti and Thornton, most approved drugs were developed using experimental structures or homology models. This highlights the challenge of using AI-driven predictions for unexplored targets, but at the same time an area where such approaches are particularly valuable and needed.
Presentation slide from Darren Green referencing the value of structures for drug discovery and development and the need for precise and valid predicted structural models
Although the models involve uncertainty and require further work to make them truly effective, Darren Green shared his perspective on their invaluable role at nearly every stage of the drug discovery process.
Starting with target identification, predicted models can help identify novel structural domains, assess ligandability, and highlight selectivity risks early on. Further, at hit/lead identification stage, he mentions that these models have accelerated virtual screening and improved hit triage, allowing researchers to assess lead quality faster. While prediction models tend to be less impactful in lead optimization compared to earlier stages, due to the struggle of capturing induced fit effects and dynamic structural stages they remain valuable tools. When combined with molecular dynamics simulations, those tools can help refine ligand poses within a binding site, possibly predicting how structural changes in the protein influence ligand orientation.
Surprisingly, as the speaker highlighted, prediction models aid pre-clinical testing by uncovering species-specific structural differences (e.g., rat vs. human targets) and unexpected off-target interactions. This, in turn, helps guide model organism selection and toxicology assessment. In clinical research, these models support polypharmacology, precision medicine, and drug repurposing by identifying alternative therapeutic targets when primary ones remain uncertain.
The growing need to optimize the expert
Despite the growing influence of AI in structural biology, Darren Green emphasized that the greatest challenge lies in enabling experts to effectively interpret and apply vast volumes of data while maintaining the right level of crucial structural details. With the overwhelming scale of available information, he highlighted how the focus should shift to building tools that help researchers manage, automate, and organize structural data ensuring that no essential information is overlooked.
“While new technologies, such as machine learning-driven predictions, are incredibly exciting, the real challenge lies in working at scale without loosing crucial detail. The key is to optimize the expert and provide tools that can automate and organize data but still allow access to detail.”
Darren summarized that accurate protein structure prediction holds immense promise for efficiency in drug discovery. However, their success ultimately depends on combining predicted insights with expert judgment and experimental validation to unlock their full potential.
Integrative approach with 3decision
Following these insights, Lorena Zara showcased how 3decision technology supports this experimental and predictive structural integration, including:
Assessment of AI model accuracy in predicting structural conformation and protein-protein interactions
Investigating druggable pockets in AI models of new targets lacking experimentally resolved structures
Building on Darren’s talk, several features were showcased that enhance expert workflows. Lorena demonstrated how the platform seamlessly integrates into this changing world of structural biology and computational chemistry, empowering scientists to manage increasing volumes of structural data while preserving the necessary level of detail and enabling efficient analysis.
3decision 2.0 is coming soon!
Discover Discngine’s 3D protein structure and data management tool in our new product brochure.
Potentials and pitfalls of using computational predictions to drive structure-based design
With powerful machine learning tools like AlphaFold increasingly used across drug discovery pipelines, researchers are now exploring how best to integrate these models with experimental workflows to ensure their outputs are both useful and reliable. During the roundtable discussion, the panel highlighted the opportunities and limitations of these tools, focusing on model training, data quality, and areas for improvement.
Roundtable chair David Brown emphasized that while prediction tools are widely used, their accuracy is limited by the quality and completeness of the input data. He pointed out that although the crystallographic community collects extensive datasets, only a single refined structure is typically deposited in the PDB, which becomes the foundation for many predictive models. As a result, crucial contextual data, such as metadata related to construct design and mutations, is lost, limiting the reliability of AI-generated outputs. This highlights the need for having in place better practices to annotate and share structural data, particularly when those are used to train predictive models.
“We’re starting to combine information from multiple methods […] but some critical data is still missing. As a structural biology community, we don’t always capture the metadata that comes with the data we’re training models on.”
This sparked a broader discussion on data bias and overconfidence in prediction models. Since tools like AlphaFold are trained primarily on existing PDB structures, they tend to perform well on familiar folds but often struggle with multi-domain proteins, flexible targets, or entirely novel targets. Additionally, most current models are trained in positive datasets, meaning successful, well-folded structures. The absence of misfolded or unstable conformations in training data can lead to models appearing more confident than they should be, therefore, producing outputs that may seem accurate but are biologically irrelevant.
Where experimental data still has the lead
Despite the growing influence of computational tools, experts repeatedly emphasized that experimental data remains indispensable. Both X-ray crystallography and cryo-EM, even when producing low-resolution or early-stage structures, continue to guide key decisions in drug design. Their role is especially critical in:
Identifying cryptic binding sites that prediction models struggle to reveal
Exploring protein flexibility, which is crucial in understanding induced fit and conformational changes during ligand binding
Assessing protein stability, particularly in engineered constructs or stabilized membrane proteins
In fragment-based drug design, for example, early-stage crystallography and expression studies remain essential. While computational models can guide initial screening, experimental data is still crucial for confirming hits, optimizing fragments, and understanding structure-activity relationships.
Discngine Labs Live roundtable session with David Brown (Vertex Pharmaceuticals), Juan Carlos Mobarec (AstraZeneca), Benoit Baillif (Astex Pharmaceuticals) and James Davidson (Vernalis), on “Potential and pitfalls of using computational predictions to drive structure-based drug design”
Model enhancement strategies and emerging solutions
The conversation then shifted from identifying model pitfalls to addressing limitations and exploring potential solutions.
Integrating Molecular Dynamics Simulations
The panel discussed using molecular dynamics (MD) simulations to bridge the gap between static structures and real-world molecular behavior. MD simulations offer valuable insights into flexible conformations and ligand-induced structural changes, especially in the lead optimization stage. While prediction models excel in showcasing static, well-folded structures, MD simulations provide insight into alternative conformations, improving the reliability of flexible targets with multiple structural states. In lead optimization, MD simulations are critical for refining models to reflect ligand-induced structural changes.
However, the panel also stressed that MD simulations must be applied carefully, as their effectiveness relies heavily on the quality of the input model and the accuracy of the force fields used.
Incorporating physical constraints into predictive models
Another promising direction discussed was the integration of physical constraints into predictive models. This would help improve the method’s generalizability to unexplored protein folds or rare conformations and ensure that predicted structures remain chemically and physically plausible. While still in its early stages, this strategy could offer a more robust framework for structure prediction, particularly for novel or challenging targets.
Complexity in modelling protein-protein interactions
To improve model’s ability to predict multi-protein complexes, panelists highlighted the need for integrating AlphaFold predictions with other complementary methods such as:
Co-folding tools to improve complex assembly modeling
Docking tools to refine interactions in multi-subunit structures
Dynamic simulations to explore the binding interfaces in flexible regions
This integrative strategy, combining multiple computational methods is considered essential for generating biologically meaningful protein complexes, and the accuracy of its protein-protein interactions.
AI powered docking
AI-driven docking tools like DiffDock were discussed as highly promising for the prediction of holo structures, but their results must be carefully visually reviewed and validated by experts. While these models offer promise in accelerating ligand-protein interaction studies, significant challenges persist, particularly in their handling of chirality, stereochemistry and tetrahedral centers. These limitations can introduce substantial errors, potentially causing significant setbacks for medicinal chemists.
Thus, the panel universally agreed that a solution would be to define benchmarks of predicted ligand poses using metrics like RMSD and ideally review the results by experienced chemists before using them in active design decisions.
Industry data sharing
An interesting and still controversial topic is the integration of negative data and structural data from the industry into training datasets. Incorporating failed predictions, misfolded structures, and even internal industry datasets could dramatically improve model reliability. However, the risk of data leakage and concerns around intellectual property protection continue to pose significant challenges.
One solution could be sharing derived model parameters rather than raw data, although this approach remains controversial within the current pharmaceutical landscape.
The panelists' insights emphasized the importance of carefully balancing innovation with established methods to navigate these challenges and ensure robust, reliable outcomes in structure-based drug design.
Conclusion
In summary, the key takeaway from the event was the need to optimize the expert in utilizing the surge of structural data and to correctly interpret and integrate predictive models with experimental approaches.
Key areas for improvement include:
Incorporating negative data into training datasets to improve prediction reliability.
Expanding the use of MD simulations to refine models and capture dynamic protein behavior.
Emphasizing experimental validation for flexible, multi-domain, or cryptic targets
Developing secure frameworks for sharing proprietary data to improve prediction models while protecting commercial interests.
Investing in data management tools to organize and streamline workflows in such data-heavy environment.
Networking at Discngine Labs Live, Churchill College, Cambridge, UK
The Discngine Labs Live event finished with an engaging networking session which was a great opportunity to reflect on the event insights, further exchange thoughts, and make some great professional connections.
We plan to hold more similar events in the future! Subscribe to our newsletter and make sure you join our next Labs Live.