ci0c00593_si_001.pdf (6.38 MB)

De Novo Protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks

Download (6.38 MB)
journal contribution
posted on 30.09.2020, 18:50 by Mostafa Karimi, Shaowen Zhu, Yue Cao, Yang Shen
Although massive data is quickly accumulating on protein sequence and structure, there is a small and limited number of protein architectural types (or structural folds). This study is addressing the following question: how well could one reveal underlying sequence–structure relationships and design protein sequences for an arbitrary, potentially novel, structural fold? In response to the question, we have developed novel deep generative models, namely, semisupervised gcWGAN (guided, conditional, Wasserstein Generative Adversarial Networks). To overcome training difficulties and improve design qualities, we build our models on conditional Wasserstein GAN (WGAN) that uses Wasserstein distance in the loss function. Our major contributions include (1) constructing a low-dimensional and generalizable representation of the fold space for the conditional input, (2) developing an ultrafast sequence-to-fold predictor (or oracle) and incorporating its feedback into WGAN as a loss to guide model training, and (3) exploiting sequence data with and without paired structures to enable a semisupervised training strategy. Assessed by the oracle over 100 novel folds not in the training set, gcWGAN generates more successful designs and covers 3.5 times more target folds compared to a competing data-driven method (cVAE). Assessed by sequence- and structure-based predictors, gcWGAN designs are physically and biologically sound. Assessed by a structure predictor over representative novel folds, including one not even part of basis folds, gcWGAN designs have comparable or better fold accuracy yet much more sequence diversity and novelty than cVAE. The ultrafast data-driven model is further shown to boost the success of a principle-driven de novo method (RosettaDesign), through generating design seeds and tailoring design space. In conclusion, gcWGAN explores uncharted sequence space to design proteins by learning generalizable principles from current sequence–structure data. Data, source codes, and trained models are available at