The Plant Science Student Conference (PSSC) has been organised by students from the two Leibniz institutes, IPK and IPB, every year for the last 20 years. In this interview, Christina Wäsch (IPK) and Carolin Apel (IPB)…
Over 600 guests came to the IPB on July 4 for the Long Night of Sciences to learn lots of new things and put their knowledge to the test at our science quiz course. This year, our program was aimed equally at children and…
Our 10th Leibniz Plant Biochemistry Symposium on May 7 and 8 was a great success. This year's theme was new methods and research approaches in natural product chemistry. The excellent presentations on active substances and…
Illig, A.-M.; Siedhoff, N. E.; Davari, M. D.; Schwaneberg, U.;Evolutionary probability and stacked regressions enable data-driven protein engineering with minimized experimental effortJ. Chem. Inf. Model.646350-6360(2024)DOI: 10.1021/acs.jcim.4c00704
Protein engineering through directed evolution and (semi)rational approaches is routinely applied to optimize protein properties for a broad range of applications in industry and academia. The multitude of possible variants, combined with limited screening throughput, hampers efficient protein engineering. Data-driven strategies have emerged as a powerful tool to model the protein fitness landscape that can be explored in silico, significantly accelerating protein engineering campaigns. However, such methods require a certain amount of data, which often cannot be provided, to generate a reliable model of the fitness landscape. Here, we introduce MERGE, a method that combines direct coupling analysis (DCA) and machine learning (ML). MERGE enables data-driven protein engineering when only limited data are available for training, typically ranging from 50 to 500 labeled sequences. Our method demonstrates remarkable performance in predicting a protein’s fitness value and rank based on its sequence across diverse proteins and properties. Notably, MERGE outperforms state-of-the-art methods when only small data sets are available for modeling, requiring fewer computational resources, and proving particularly promising for protein engineers who have access to limited amounts of data.