LIT-PCBA

LIT-PCBA provides a rigorously curated dataset of PubChem-derived dose-response bioassays for unbiased benchmarking of virtual screening and machine learning methods in drug discovery.


Key Features:

  • Unbiased Data Compilation: The dataset comprises 15 target sets derived from 149 dose-response PubChem bioassays and contains 7,844 confirmed active and 407,381 confirmed inactive compounds.
  • Rigorous Data Curation: False positives and assay artifacts were removed and active and inactive compounds were balanced within similar molecular property ranges to minimize chemical biases.
  • Target Selection for Versatility: Target sets were selected based on the availability of at least one X-ray structure in complex with ligands of the same phenotype as PubChem active compounds, supporting ligand-based and structure-based virtual screening.
  • Validation through Virtual Screening Methods: Preliminary screenings using 2D fingerprint similarity, 3D shape similarity, and molecular docking identified target sets where at least one method enriched the top 1% of ranked compounds in true actives by a factor of two.
  • Asymmetric Validation Embedding (AVE): AVE was applied to further reduce biases and ensure representative separation between training and validation ligand sets.

Scientific Applications:

  • Benchmarking virtual screening methods: Provide realistic, unbiased test sets for comparing the performance of 2D, 3D, and docking-based virtual screening algorithms.
  • Machine-learning model development and evaluation: Train and evaluate ML classifiers and regression models for compound activity prediction on balanced, curated actives/inactives.
  • Comparative assessment of ligand- and structure-based approaches: Enable direct comparison of ligand-based (fingerprint/shape) and structure-based (docking) screening performance using targets with X-ray ligand complexes.

Methodology:

Derived from 149 dose-response PubChem bioassays into 15 target sets; selected targets requiring at least one X-ray ligand complex of the same phenotype; removed false positives and assay artifacts and balanced actives/inactives by molecular property ranges; validated with preliminary virtual screenings using 2D fingerprint similarity, 3D shape similarity, and molecular docking; applied Asymmetric Validation Embedding (AVE) to reduce bias.

Topics

Details

Tool Type:
web application
Added:
1/18/2021
Last Updated:
2/17/2021

Operations

Publications

Tran-Nguyen V, Jacquemard C, Rognan D. LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening. Journal of Chemical Information and Modeling. 2020;60(9):4263-4273. doi:10.1021/acs.jcim.0c00155. PMID:32282202.