Machine learning in structural biology and chemoinformatics

  1. Jiménez Luna, José Salvador
Dirigida per:
  1. Gianni De Fabritiis Director/a

Universitat de defensa: Universitat Pompeu Fabra

Fecha de defensa: 24 de d’octubre de 2019

Tribunal:
  1. Toni Giorgino President/a
  2. Ferran Sanz Carreras Secretari/ària
  3. Juan Fernández Recio Vocal

Tipus: Tesi

Teseo: 601773 DIALNET lock_openTDX editor

Resum

Deep learning approaches have become increasingly popular in the last years thanks to their state-of-the-art performance in fields such as computer vision and natural language understanding. The first goal of this thesis was to adapt such approaches, and particularly those used in image recognition, to the domains of structural biology and chemoinformatics. We do so by the development of a novel three-dimensional biomolecular representation that can be used in conjunction with 3D-convolutional neural networks for a variety of tasks. We test the applicability of such methods in several relevant problems in the early drug discovery pipeline, such as protein binding site prediction, protein-ligand binding affinity prediction, drug selectivity elucidation and molecular generative models. The second goal of this thesis was to facilitate the use and accessibility of such tools by their implementation and deployment in an easy-to-use web application. Goals: The objectives of the thesis presented here were threefold: the first was the exploration of modern representations for biomolecular complexes towards their use in modern deep learning architectures, such as in the case of voxelization and convolutional neural networks. The second was to apply such models in projects relevant in drug discovery pipelines, comparing their performance to existing approaches whenever possible. Finally, the last goal was to deploy such models in the PlayMolecule.org repository of applications so as to facilitate and promote their use to computational and medicinal chemists. Conclusions: 1. Volumetric representations of biomolecular complexes are a novel and flexible way of modeling shapes and solving different structural biology and chemoinformatics tasks. 2. Pipelines featuring 3D-convolutional neural networks can outperform complex hand-crafted geometric algorithms for the detection of druggable binding pockets, given enough curated training data. 3. Similar approaches that featurize the binding pocket as well as the pose of a compound have been shown to be state of the art in protein-ligand affinity prediction, compared to other scoring functions of diverse nature. However, the performance for ranking chemically close compounds in lead optimization is not consistent, therefore requiring further training in the congeneric series at hand. Such models show promise by outperforming simulation and docking-based approaches with very few examples. 4. Multilabel neural networks are a fast and efficient model that can be used in the association of compounds to the pathways they intervene in, at an unprecedented scale. An open issue remains the equal treatment of negative and unknown activity ligands towards a target. 5. Generative models such as variational autoencoders and captioning networks can be used in conjunction with volumetric representations to generate novel compounds with desirable characteristics while retaining similarity to a seed molecule.