摘要
There is significant interest and importance todevelop robust machine learning models to assist organicchemistry synthesis. Typically, task-specific machine learningmodels for distinct reaction prediction tasks have been developed.In this work, we develop a unified deep learning model, T5Chem,for a variety of chemical reaction predictions tasks by adapting the"Text-to-Text Transfer Transformer"(T5) framework in naturallanguage processing (NLP). On the basis of self-supervisedpretraining with PubChem molecules, the T5Chem model canachieve state-of-the-art performances for four distinct types of task-specific reaction prediction tasks using four different open-sourcedata sets, including reaction type classification on USPTO_TPL,forward reaction prediction on USPTO_MIT, single-step retrosyn-thesis on USPTO_50k, and reaction yield prediction on high-throughput C-N coupling reactions. Meanwhile, we introduced a newunified multitask reaction prediction data set USPTO_500_MT, which can be used to train and testfive different types of reactiontasks, including the above four as well as a new reagent suggestion task. Our results showed that models trained with multiple tasksare more robust and can benefit from mutual learning on related tasks. Furthermore, we demonstrated the use of SHAP (SHapleyAdditive exPlanations) to explain T5Chem predictions at the functional group level, which provides a way to demystify sequence-based deep learning models in chemistry. T5Chem is accessible throughhttps://yzhang.hpc.nyu.edu/T5Chem