Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 3;58(35):15638-15649.
doi: 10.1021/acs.est.4c00172. Epub 2024 May 2.

Two-Stage Machine Learning-Based Approach to Predict Points of Departure for Human Noncancer and Developmental/Reproductive Effects

Affiliations

Two-Stage Machine Learning-Based Approach to Predict Points of Departure for Human Noncancer and Developmental/Reproductive Effects

Jacob Kvasnicka et al. Environ Sci Technol. .

Abstract

Chemical points of departure (PODs) for critical health effects are crucial for evaluating and managing human health risks and impacts from exposure. However, PODs are unavailable for most chemicals in commerce due to a lack of in vivo toxicity data. We therefore developed a two-stage machine learning (ML) framework to predict human-equivalent PODs for oral exposure to organic chemicals based on chemical structure. Utilizing ML-based predictions for structural/physical/chemical/toxicological properties from OPERA 2.9 as features (Stage 1), ML models using random forest regression were trained with human-equivalent PODs derived from in vivo data sets for general noncancer effects (n = 1,791) and reproductive/developmental effects (n = 2,228), with robust cross-validation for feature selection and estimating generalization errors (Stage 2). These two-stage models accurately predicted PODs for both effect categories with cross-validation-based root-mean-squared errors less than an order of magnitude. We then applied one or both models to 34,046 chemicals expected to be in the environment, revealing several thousand chemicals of moderate concern and several hundred chemicals of high concern for health effects at estimated median population exposure levels. Further application can expand by orders of magnitude the coverage of organic chemicals that can be evaluated for their human health risks and impacts.

Keywords: QSAR model; chemical risk assessment; high-throughput screening; life cycle impact assessment (LCIA); machine learning; toxicity prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
Overview of the two-stage machine learning framework for predicting points of departure. (A) Conceptual framework. (B) Model development, evaluation, and application. The surrogate points of departure were obtained from Table S5 of Aurisano et al. Features were extracted from predictions by OPERA 2.9.,Figures S1–S2 provide an overview of the model training and evaluation. Exposure estimates were obtained from SEEM3 by Ring et al. Application chemicals were expected to occur in the environment and lacked in vivo points of departure., Note: ML, machine learning; POD, point of departure; QSAR, quantitative structure–activity relationship; OPERA, OPEn structure–activity/property Relationship App; ToxValDB, Toxicity Value Database; RMSE, root-mean-squared error; MedAE, median absolute error; R2, coefficient of determination; MAD, median absolute deviation; SEEM, Systematic Empirical Evaluation of Models.
Figure 2
Figure 2
Model fitting. In-sample performance is assessed through scatterplots and performance metrics comparing the fitted and observed values for each chemical The fitted values are predictions from the cross-validated final models that were fitted on the full labeled data set. The figure is subdivided by target effect category and by whether the feature selection was implemented. Note: RMSE, root-mean-squared error; MedAE, median absolute error; R2, coefficient of determination; n, sample size.
Figure 3
Figure 3
Model evaluation. (A) Out-of-sample performance is assessed through scatterplots comparing the mean predicted values for each chemical when it is part of the “testing” data set across 30 cross-validation repeats (y-axis) against the corresponding surrogate values (x-axis). The dashed red line indicates perfect correspondence. (B) The distribution of performance metrics from 150 cross-validation scores (30 repeats × 5-fold), where each boxplot shows the median and interquartile range with whiskers representing the 95% confidence interval. The figure is subdivided by the performance metric, target effect category, and by whether feature selection was implemented. Note: RMSE, root-mean-squared error; MedAE, median absolute error; R2, coefficient of determination; n, sample size. The scale for R2 is reversed to be consistent with values to the “left” corresponding to better performance.
Figure 4
Figure 4
Cumulative counts of the application chemicals in relation to the predicted points of departure and margins of exposure. Data are shown for chemicals that were on the Merged NORMAN Suspect List (SusDat), and within the applicability domain of SEEM3 (n = 32,524), excluding any training chemicals. The margins of exposure correspond to an individual at the population median exposure. Uncertainty is represented in two ways: (1) Exposure uncertainty, reflected by examining margins of exposure at different exposure percentiles; (2) Point of departure (hazard) uncertainty, represented by a 90% prediction interval derived from the median RMSE based on cross validation. Vertical spans highlight different risk categories, as described in the Methods. The x-axis is truncated at log10MOE = 10. Note: POD, point of departure; MOE, margin of exposure.

Similar articles

References

    1. United States Environmental Protection Agency (EPA) . U.S. EPA System of Registries Terms & Acronyms; https://sor.epa.gov/sor_internet/registry/termreg/searchandretrieve/term... (accessed 2023–12–01).
    1. Fantke P.; Huang L.; Overcash M.; Griffing E.; Jolliet O. Life Cycle Based Alternatives Assessment (LCAA) for Chemical Substitution. Green Chem. 2020, 22 (18), 6008–6024. 10.1039/D0GC01544J. - DOI
    1. Fantke P.; Chiu W. A.; Aylward L.; Judson R.; Huang L.; Jang S.; Gouin T.; Rhomberg L.; Aurisano N.; McKone T.; Jolliet O. Exposure and Toxicity Characterization of Chemical Emissions and Chemicals in Products: Global Recommendations and Implementation in USEtox. Int. J. Life Cycle Assess. 2021, 26 (5), 899–915. 10.1007/s11367-021-01889-y. - DOI - PMC - PubMed
    1. Wignall J. A.; Muratov E.; Sedykh A.; Guyton K. Z.; Tropsha A.; Rusyn I.; Chiu W. A. Conditional Toxicity Value (CTV) Predictor: An In Silico Approach for Generating Quantitative Risk Estimates for Chemicals. Environ. Health Perspect. 2018, 126 (5), 05700810.1289/EHP2998. - DOI - PMC - PubMed
    1. Wang Z.; Walker G. W.; Muir D. C. G.; Nagatani-Yoshida K. Toward a Global Understanding of Chemical Pollution: A First Comprehensive Analysis of National and Regional Chemical Inventories. Environ. Sci. Technol. 2020, 54 (5), 2575–2584. 10.1021/acs.est.9b06379. - DOI - PubMed

LinkOut - more resources