Molecular Representations in Machine-Learning-Based Prediction of PK Parameters for Insulin Analogs

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Molecular Representations in Machine-Learning-Based Prediction of PK Parameters for Insulin Analogs. / Einarson, Kasper A.; Bendtsen, Kristian M.; Li, Kang; Thomsen, Maria; Kristensen, Niels R.; Winther, Ole; Fulle, Simone; Clemmensen, Line; Refsgaard, Hanne H.f.

In: ACS Omega, Vol. 8, No. 26, 2023, p. 23566-23578.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Einarson, KA, Bendtsen, KM, Li, K, Thomsen, M, Kristensen, NR, Winther, O, Fulle, S, Clemmensen, L & Refsgaard, HHF 2023, 'Molecular Representations in Machine-Learning-Based Prediction of PK Parameters for Insulin Analogs', ACS Omega, vol. 8, no. 26, pp. 23566-23578. https://doi.org/10.1021/acsomega.3c01218

APA

Einarson, K. A., Bendtsen, K. M., Li, K., Thomsen, M., Kristensen, N. R., Winther, O., Fulle, S., Clemmensen, L., & Refsgaard, H. H. F. (2023). Molecular Representations in Machine-Learning-Based Prediction of PK Parameters for Insulin Analogs. ACS Omega, 8(26), 23566-23578. https://doi.org/10.1021/acsomega.3c01218

Vancouver

Einarson KA, Bendtsen KM, Li K, Thomsen M, Kristensen NR, Winther O et al. Molecular Representations in Machine-Learning-Based Prediction of PK Parameters for Insulin Analogs. ACS Omega. 2023;8(26):23566-23578. https://doi.org/10.1021/acsomega.3c01218

Author

Einarson, Kasper A. ; Bendtsen, Kristian M. ; Li, Kang ; Thomsen, Maria ; Kristensen, Niels R. ; Winther, Ole ; Fulle, Simone ; Clemmensen, Line ; Refsgaard, Hanne H.f. / Molecular Representations in Machine-Learning-Based Prediction of PK Parameters for Insulin Analogs. In: ACS Omega. 2023 ; Vol. 8, No. 26. pp. 23566-23578.

Bibtex

@article{0cbb80f5fa554dbeb97b9e9854379e84,
title = "Molecular Representations in Machine-Learning-Based Prediction of PK Parameters for Insulin Analogs",
abstract = "Therapeutic peptides and proteins derived from either endogenous hormones, such as insulin, or de novo design via display technologies occupy a distinct pharmaceutical space in between small molecules and large proteins such as antibodies. Optimizing the pharmacokinetic (PK) profile of drug candidates is of high importance when it comes to prioritizing lead candidates, and machine-learning models can provide a relevant tool to accelerate the drug design process. Predicting PK parameters of proteins remains difficult due to the complex factors that influence PK properties; furthermore, the data sets are small compared to the variety of compounds in the protein space. This study describes a novel combination of molecular descriptors for proteins such as insulin analogs, where many contained chemical modifications, e.g., attached small molecules for protraction of the half-life. The underlying data set consisted of 640 structural diverse insulin analogs, of which around half had attached small molecules. Other analogs were conjugated to peptides, amino acid extensions, or fragment crystallizable regions. The PK parameters clearance (CL), half-life (T1/2), and mean residence time (MRT) could be predicted by using classical machine-learning models such as Random Forest (RF) and Artificial Neural Networks (ANN) with root-mean-square errors of CL of 0.60 and 0.68 (log units) and average fold errors of 2.5 and 2.9 for RF and ANN, respectively. Both random and temporal data splittings were employed to evaluate ideal and prospective model performance with the best models, regardless of data splitting, achieving a minimum of 70% of predictions within a twofold error. The tested molecular representations include (1) global physiochemical descriptors combined with descriptors encoding the amino acid composition of the insulin analogs, (2) physiochemical descriptors of the attached small molecule, (3) protein language model (evolutionary scale modeling) embedding of the amino acid sequence of the molecules, and (4) a natural language processing inspired embedding (mol2vec) of the attached small molecule. Encoding the attached small molecule via (2) or (4) significantly improved the predictions, while the benefit of using the protein language model-based encoding (3) depended on the used machine-learning model. The most important molecular descriptors were identified as descriptors related to the molecular size of both the protein and protraction part using Shapley additive explanations values. Overall, the results show that combining representations of proteins and small molecules was key for PK predictions of insulin analogs.",
author = "Einarson, {Kasper A.} and Bendtsen, {Kristian M.} and Kang Li and Maria Thomsen and Kristensen, {Niels R.} and Ole Winther and Simone Fulle and Line Clemmensen and Refsgaard, {Hanne H.f.}",
year = "2023",
doi = "10.1021/acsomega.3c01218",
language = "English",
volume = "8",
pages = "23566--23578",
journal = "ACS Omega",
issn = "2470-1343",
publisher = "ACS Publications",
number = "26",

}

RIS

TY - JOUR

T1 - Molecular Representations in Machine-Learning-Based Prediction of PK Parameters for Insulin Analogs

AU - Einarson, Kasper A.

AU - Bendtsen, Kristian M.

AU - Li, Kang

AU - Thomsen, Maria

AU - Kristensen, Niels R.

AU - Winther, Ole

AU - Fulle, Simone

AU - Clemmensen, Line

AU - Refsgaard, Hanne H.f.

PY - 2023

Y1 - 2023

N2 - Therapeutic peptides and proteins derived from either endogenous hormones, such as insulin, or de novo design via display technologies occupy a distinct pharmaceutical space in between small molecules and large proteins such as antibodies. Optimizing the pharmacokinetic (PK) profile of drug candidates is of high importance when it comes to prioritizing lead candidates, and machine-learning models can provide a relevant tool to accelerate the drug design process. Predicting PK parameters of proteins remains difficult due to the complex factors that influence PK properties; furthermore, the data sets are small compared to the variety of compounds in the protein space. This study describes a novel combination of molecular descriptors for proteins such as insulin analogs, where many contained chemical modifications, e.g., attached small molecules for protraction of the half-life. The underlying data set consisted of 640 structural diverse insulin analogs, of which around half had attached small molecules. Other analogs were conjugated to peptides, amino acid extensions, or fragment crystallizable regions. The PK parameters clearance (CL), half-life (T1/2), and mean residence time (MRT) could be predicted by using classical machine-learning models such as Random Forest (RF) and Artificial Neural Networks (ANN) with root-mean-square errors of CL of 0.60 and 0.68 (log units) and average fold errors of 2.5 and 2.9 for RF and ANN, respectively. Both random and temporal data splittings were employed to evaluate ideal and prospective model performance with the best models, regardless of data splitting, achieving a minimum of 70% of predictions within a twofold error. The tested molecular representations include (1) global physiochemical descriptors combined with descriptors encoding the amino acid composition of the insulin analogs, (2) physiochemical descriptors of the attached small molecule, (3) protein language model (evolutionary scale modeling) embedding of the amino acid sequence of the molecules, and (4) a natural language processing inspired embedding (mol2vec) of the attached small molecule. Encoding the attached small molecule via (2) or (4) significantly improved the predictions, while the benefit of using the protein language model-based encoding (3) depended on the used machine-learning model. The most important molecular descriptors were identified as descriptors related to the molecular size of both the protein and protraction part using Shapley additive explanations values. Overall, the results show that combining representations of proteins and small molecules was key for PK predictions of insulin analogs.

AB - Therapeutic peptides and proteins derived from either endogenous hormones, such as insulin, or de novo design via display technologies occupy a distinct pharmaceutical space in between small molecules and large proteins such as antibodies. Optimizing the pharmacokinetic (PK) profile of drug candidates is of high importance when it comes to prioritizing lead candidates, and machine-learning models can provide a relevant tool to accelerate the drug design process. Predicting PK parameters of proteins remains difficult due to the complex factors that influence PK properties; furthermore, the data sets are small compared to the variety of compounds in the protein space. This study describes a novel combination of molecular descriptors for proteins such as insulin analogs, where many contained chemical modifications, e.g., attached small molecules for protraction of the half-life. The underlying data set consisted of 640 structural diverse insulin analogs, of which around half had attached small molecules. Other analogs were conjugated to peptides, amino acid extensions, or fragment crystallizable regions. The PK parameters clearance (CL), half-life (T1/2), and mean residence time (MRT) could be predicted by using classical machine-learning models such as Random Forest (RF) and Artificial Neural Networks (ANN) with root-mean-square errors of CL of 0.60 and 0.68 (log units) and average fold errors of 2.5 and 2.9 for RF and ANN, respectively. Both random and temporal data splittings were employed to evaluate ideal and prospective model performance with the best models, regardless of data splitting, achieving a minimum of 70% of predictions within a twofold error. The tested molecular representations include (1) global physiochemical descriptors combined with descriptors encoding the amino acid composition of the insulin analogs, (2) physiochemical descriptors of the attached small molecule, (3) protein language model (evolutionary scale modeling) embedding of the amino acid sequence of the molecules, and (4) a natural language processing inspired embedding (mol2vec) of the attached small molecule. Encoding the attached small molecule via (2) or (4) significantly improved the predictions, while the benefit of using the protein language model-based encoding (3) depended on the used machine-learning model. The most important molecular descriptors were identified as descriptors related to the molecular size of both the protein and protraction part using Shapley additive explanations values. Overall, the results show that combining representations of proteins and small molecules was key for PK predictions of insulin analogs.

U2 - 10.1021/acsomega.3c01218

DO - 10.1021/acsomega.3c01218

M3 - Journal article

C2 - 37426277

VL - 8

SP - 23566

EP - 23578

JO - ACS Omega

JF - ACS Omega

SN - 2470-1343

IS - 26

ER -

ID: 359243763