VISION

Our research group is focused on using computational and experimental approaches to understand and design protein functions. We have extensive experience in deep unsupervised learning and protein design, which we have applied to various projects (see below!).

Over the next five years, we will expand our focus to include the design of custom-tailored and new-to-nature protein functions. This will involve machine learning and other computational approaches to explore unexplored regions of the protein space and generate novel proteins with desired functions. We will also include experimental characterization efforts, using various techniques to validate and refine our computational models.

We are particularly interested in using our expertise to address significant challenges in the fields of healthcare and sustainability. This includes developing new drugs to treat diseases, designing enzymes for biotechnological applications, and creating proteins with novel functions that can help address environmental challenges.

We believe that protein design has the potential to change the world we live in, and Artificial Intelligence is at the core of this revolution.

ONGOING PROJECTS

Controlled generation of artificial enzymes

The Transformer deep neural architecture, the core of many of the applications we interact with in our daily lives, such as Google Translator or chatGPT, has an unmatched potential for protein design.

In this project, we trained a GPT2-like architecture to design enzymes with specific functions. Each enzyme sequence was linked to each catalytic identifier (EC class, e.g., ‘2.7.1.2’); hence, the model has learned to map sequence features specific to each enzymatic function. ZymCTRL generates enzyme sequences upon a user-defined catalytic activity prompt.

Munsamy, M., Lindner, S, Lorenz, P., Ferruz, N. ZymCTRL: A conditional model for the controllable generation of artificial enzymes. in the MLSB workshop of the 36th NeurIPS conference (2022).

ProtGPT2: A generative model for protein design

Natural Language Processing methods have shown impressive capabilities generating long, coherent text (think GPT3 or ChatGPT). Inspired by this success, we trained ProtGPT2 on the entire protein space. ProtGPT2 has learned ‘the protein language’ and generates protein sequences in unexplored regions of the protein space. The generated proteins are ordered, globular and feature non-idealized structures, while expressing in wet-lab settings.

Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun 13, 4348 (2022). https://doi.org/10.1038/s41467-022-32007-7

Conserved protein fragments across the protein space

Proteins have evolved via replication and recombination of subdomain-sized fragments that appear frequently across the protein structure space. Mimicking natural evolution, we can design new protein chimeras by identifying and combining these fragments.

Ferruz, N. et al. Identification and Analysis of Natural Building Blocks for Evolution-Guided Fragment-Based Protein Design. J. Mol. Biol. 432, 3898–3914 (2020).

Ferruz, N., Noske, J. & Höcker, B. Protlego: A Python package for the analysis and design of chimeric proteins. Bioinformatics. 37, 3182–3189 (2021).