Chemical and Biological Data Repositories for AI Drug Discovery

A contributor to AI’s surge in drug discovery is the massive amount of data being generated. However, availability of data (e.g., scarcity of high-quality labeled data, class imbalance, unreported negative results, incomplete biological annotations) can impose a fundamental limitation on development of ML and deep learning models. There is an immense size of druglike molecules, large variety and quantity of biological targets, high dimension of chemical and biological properties, and potential applications in not just drug discovery but healthcare and biological chemistry. As a result, it is not uncommon to commit to a research direction where there is a dearth of data.

We will cover contemporary, publicly accessible chemical data repositories to keep track of in accordance with your own personal projects and passions that might come to light while working through this book. Most of the databases we will look at contain existing compounds, though there are also large databases (up to billions) of virtual molecules that do not exist but could be synthesized. By no means do we intend to provide a comprehensive reference of benchmark datasets, as these are in flux and data sets for specific tasks are best found in primary literature.

This is Appendix B of my book, Build AI Drug Discovery Pipelines. The appendices are freely available through Manning, but most readers never check them. I keep an updated version of this one here rather than only in print, because as a catalog of datasets it is the part of the book most likely to go out of date. Last updated June 2026.

Public data repositories grouped by category (bioactivity and binding, chemical structures, structural biology, virtual libraries, ML benchmarks, toxicity and safety, reactions, target discovery, and clinical data), all feeding a central machine-learning drug-discovery pipeline.
The major categories of public data that feed an ML drug-discovery pipeline, organized as they appear in this catalog.

Bioactivity & Binding Affinity

ChEMBL & ChEBI

ChEMBL is an open access, literature derived (primarily from medicinal chemistry literature) data warehouse containing manually curated, high quality data with FAIR data principles (findable, accessible, interoperable, reusable) at its core. ChEMBL covers more than 40 years of published research and is maintained by the European Bioinformatics Institute (EBI). As of ChEMBL 35 (released 2024), ChEMBL constitutes ~1.6M assays, ~15.6K drug targets, ~2.5M compounds, and over 20M bioactivity measurements drawn from ~90K publications. ChEBI, also maintained by EBI, is a freely available dictionary of ~60K small-molecule chemical entities of biological interest (hence the initialism, ChEBI).

FAIR Data Principles

Adherence to FAIR principles enhances ChEMBL’s utility for ML applications in drug discovery. Data preprocessing—often the most time-consuming step in building ML models—becomes more standardized and reproducible. Feature engineering benefits from consistent representation of chemical structures and biological targets. Model validation gains reliability because test sets can be constructed with awareness of data provenance, avoiding leakage between training and test data. Importantly, FAIR principles facilitate model sharing and benchmarking across the research community, as models built on ChEMBL can be evaluated against common, well-understood datasets rather than proprietary or inconsistently processed data collections.

Each FAIR principle has a role:

  • Findable. Humans and machines can locate the data through persistent identifiers and rich metadata. ML practitioners can programmatically access specific subsets of data—such as all measurements for a particular kinase family, enabling automated pipeline construction for model training.
  • Accessible. ChEMBL data can be retrieved through standardized protocols like REST APIs, simplifying data collection workflows. Instead of scraping PDFs or navigating complex interfaces, researchers can write scripts that directly query and download precisely the bioactivity data needed for their models.
  • Interoperable. ChEMBL uses standardized vocabularies, ontologies, and data formats that allow seamless integration with other databases and tools. Compound structures are represented using standard SMILES and InChI notations, making them directly compatible with cheminformatics tools like RDKit for feature generation. Biological targets are linked to UniProt identifiers, enabling automatic integration with protein sequence and structural databases. This standardization reduces the data preprocessing burden and target annotations can be reliably merged across datasets.
  • Reusable. ChEMBL data comes with clear provenance information and detailed metadata about experimental conditions. This contextual information informs decisions about data filtering and quality control. For instance, knowing the exact assay type that generated a measurement allows practitioners to control for systematic biases by either removing certain assay types or explicitly modeling them as additional features.

BindingDB

BindingDB is a public repository of measured binding affinities—reported as Ki, Kd, IC50, EC50, and thermodynamic quantities (ΔG°, ΔH°, −TΔS°)—between protein targets and drug-like molecules. BindingDB curates affinities from both medicinal chemistry literature and issued US patents, a patent-derived slice that ChEMBL does not systematically cover. As of 2025, BindingDB holds roughly 3.29M binding datapoints across ~1.35M molecules and ~9,454 targets, including ~58,852 protein–small-molecule crystal structures for which an affinity measurement is also recorded. Content is released with monthly updates and is accessible via web interface, REST API, and bulk download. Although BindingDB overlaps with ChEMBL (roughly 55% of ChEMBL’s data is also in BindingDB), the patent-derived and mass-spectrometry-measured affinities support models that need breadth beyond literature-curated activities.

IUPHAR/BPS Guide to Pharmacology (GtoPdb)

The IUPHAR/BPS Guide to Pharmacology (GtoPdb) is an expert-curated database of drug targets, ligands, and their quantitative interactions, jointly maintained by the International Union of Basic and Clinical Pharmacology (IUPHAR) and the British Pharmacological Society. Its current release covers roughly 3,000 targets, 12,000 ligands, and 23,000 quantitative interaction records, with three releases per year. The World Health Organization draws on GtoPdb for its standardized drug nomenclature, and the database is widely treated as the reference for expert-curated target pharmacology. Content is available through a web interface, REST API, and bulk downloads. GtoPdb prioritizes carefully reviewed target family definitions and quantitative interaction data.

Additional bioactivity resources

Other resources for bioactivity and binding affinity:

  • PubChem BioAssay is the NCBI bioassay component of PubChem, aggregating standardized high-throughput screening data deposited by NIH Molecular Libraries, academic centers, and industrial partners across hundreds of millions of tested datapoints.
  • BindingMOAD contains ~41,000 high-quality protein–ligand complexes with manually curated binding data; largely static since 2020 but still widely used for benchmark construction.
  • PDBbind is a curated subset of PDB entries with experimentally measured binding affinities, organized into general, refined, and core sets, and long used as a docking benchmark.

Chemical Structure & Property Databases

PubChem & ToxNet

PubChem contains data on 1114M compounds, 302M substances, and 304M bioactivities aggregated across 903 data sources. PubChem is federally sponsored and maintained by the National Library of Medicine (NLM) within the US National Institutes of Health (NIH). In addition to information on chemical structures, you can find related data on genes, proteins, pathways, and literature. PubChem has also demarcated a subset of their data that is related to COVID-19 and SARS-CoV-2. In 2019, content from ToxNet was also integrated into PubChem. The collection of ToxNet data within PubChem covers toxicology data, including but not limited to hazardous substances, genetic toxicology, and chemical carcinogenesis.

DrugBank & DrugCentral

DrugBank is a comprehensive database of structural (chemical, pharmacological) and biological target (sequence, structure, pathway) information covering approximately 15K drugs, including ~2,700 FDA-approved small molecules, ~1,500 FDA-approved biologics, 134 nutraceuticals, and ~6,700 experimental or discovery-phase drugs. The full DrugBank database requires a commercial license, which should be factored in early if you are building a pipeline you intend to publish or commercialize. DrugCentral, maintained by the University of New Mexico, is an openly licensed (CC-BY-SA) companion resource covering approximately 4,900 approved drugs with regulatory, pharmacology, and indication data. DrugCentral is commonly paired with DrugBank when a pipeline requires redistribution-friendly licensing; it overlaps substantially in coverage for marketed drugs while being more limited on experimental drugs.

Additional structure and property resources

Other resources covering chemical identity, physicochemical properties, and metabolite coverage:

  • ChemSpider contains information on ~115M chemical structures aggregated from 276 data sources by the Royal Society of Chemistry (RSC); structures in RSC publications are added automatically.
  • The Human Metabolome Database (HMDB) contains information on ~221K small-molecule metabolites found in the human body, plus ~8.6K linked protein sequences. HMDB is integrated with DrugBank, T3DB (toxins), SMPDB (pathway diagrams), and FooDB (food components).
  • The NIST Chemistry WebBook contains thermochemical data on compounds and reactions as well as IR, mass, and UV/Vis spectroscopic and gas chromatography data.
  • ChEBI is EMBL-EBI’s freely available dictionary of ~60K small-molecule chemical entities of biological interest; discussed together with ChEMBL in the preceding Bioactivity section because the two are co-maintained.
  • SureChEMBL is EMBL-EBI’s repository of approximately 28M chemical structures mined from worldwide patent filings, with daily updates and open licensing. It complements ChEMBL for patent-space analyses in Chapters 2, 3, 7, and 10.

Structural Biology

Cryo-EM, nuclear magnetic resonance (NMR), and X-ray crystallography are useful tools in structure-based drug design. They aid in determining the three-dimensional crystal structure of biological macromolecules, such as a protein or nucleic acid structure, and complexes, such as protein-ligand complexes. The resultant information serves as a gold standard source on the atomic-level physical interactions between a drug and its protein target.

Protein Data Bank

The Protein Data Bank (PDB) is a repository of over 230,000 experimentally resolved 3D structures of large biological molecules (proteins, DNA, and RNA), including structures of proteins with bound small-molecule ligands. Note that PDB was one data source used in training AlphaFold. Other large experimentally derived repositories for organic and inorganic crystallographic information include the Cambridge Structural Database (CSD) and the Crystallography Open Database (COD).

AlphaFoldDB

AlphaFoldDB contains approximately 214M predicted protein 3D structures (AlphaFold DB v4) generated by AlphaFold2, effectively covering every sequence in UniProt. It is maintained jointly by DeepMind and EMBL-EBI and has fundamentally changed the scale at which structure-informed drug discovery can operate—from roughly 230K experimental structures to two orders of magnitude more, spanning nearly every sequenced organism. Each predicted structure ships with per-residue pLDDT confidence scores and predicted aligned error (PAE) maps, which should be inspected before committing to any structure-based modeling; the sidebar following this section discusses the trade-offs in more detail. Chapter 12 and Appendix D reference AlphaFoldDB repeatedly as a complement to experimentally resolved structures.

Additional structural resources

Additional structural resources worth knowing about:

  • EMDB (Electron Microscopy Data Bank) hosts 48,000+ cryo-EM and cryo-ET maps (as of 2025); maintained at EMBL-EBI, it is the essential companion to PDB in the cryo-EM era.
  • BMRB (Biological Magnetic Resonance Bank) provides NMR data for proteins, nucleic acids, and metabolites.
  • UniProt is the reference protein sequence and functional-annotation database, linked to from essentially every structural and bioactivity resource covered in this appendix.
  • InterPro, Pfam, CATH, and SCOP are integrated protein domain, family, and structural classification hierarchies; InterPro now subsumes Pfam.
  • OPM (Orientations of Proteins in Membranes) provides spatial positioning of membrane proteins within lipid bilayers, relevant for membrane-target SBDD in Chapter 9.
  • ASD (Allosteric Database) curates allosteric proteins, sites, and modulators, useful alongside Chapter 9’s discussion of allostery.

Experimentally Resolved Structures Versus Predicted Structures

Note that there is a critical distinction between experimentally resolved protein structures, such as those in the Protein Data Bank (PDB), and computationally predicted structures like those in AlphaFoldDB. Experimental structures, determined through X-ray crystallography, cryo-electron microscopy, or NMR spectroscopy, provide direct physical evidence of protein conformations, typically with resolution ranging from 1-3Å for high-quality structures. These experimental structures capture specific states of proteins and include crucial information about water molecules, metal ions, and binding site flexibility.

In contrast, AlphaFoldDB structures are generated through model predictions without experimental validation for each specific protein. While AlphaFold2 achieves remarkable accuracy for overall protein folding (especially for well-conserved domains), its predictions may be less reliable for other use cases. The model provides per-residue confidence scores (pLDDT) that should be evaluated when using these structures for drug design.

For structure-based drug discovery applications, experimentally resolved structures should be prioritized when available. AlphaFoldDB structures can serve as valuable alternatives when experimental data is lacking, but should be used with appropriate caution, particularly for novel target classes or when the predicted binding site shows low confidence scores. Ideally, computational predictions should be validated through orthogonal experimental approaches before committing significant resources to structure-based drug design campaigns.

Ultra-Large Virtual Libraries

ZINC22

ZINC22 is the successor to ZINC15/ZINC20 and the current staple for large-scale 3D virtual screening. It contains approximately 54.9B tangible 3D-ready molecules and incorporates ultra-large make-on-demand catalogs such as Enamine REAL alongside in-stock and historical vendor compounds. Every molecule is either in stock, available on synthesis-on-demand terms with reasonable success rates, or sourced from curated vendor catalogs, so docking hits can realistically be purchased and assayed. Access is free; tranches are downloadable and an API and web front-end support per-tranche retrieval. ZINC22 powers most contemporary ultra-large docking campaigns, like the ones we discussed in Chapters 2, 7, 9, and 10, and has replaced earlier ZINC releases as the default library for 3D virtual screening.

Enamine REAL Space

Enamine REAL Space is the largest enumerated make-on-demand virtual library, maintained by the commercial supplier Enamine. As of 2025, REAL Space enumerates approximately 81.8B molecules built from a small set of validated reactions and a curated building-block inventory, with the smaller REAL Database subset at ~6.75B molecules; Enamine reports roughly 80% synthesis success on actual orders. The library is freely downloadable for academic use (directly from Enamine or through ZINC22), and individual compounds can be purchased on 2–6 week timelines. Landmark ultra-large-scale docking studies such as Lyu et al. (2019) and Sadybekov et al. (2022, Nature) rely on REAL as their compound pool, and it is updated quarterly as new building blocks and reactions come online.

Additional virtual libraries

One further open enumerated library worth noting:

  • GDB (Generated Database) provides exhaustive enumeration of organic molecules up to a specified atom count; GDB-17 covers approximately 166.4B molecules with up to 17 heavy atoms. Useful for probing the boundaries of synthesizable chemical space in Chapters 7 and 10.

ML Benchmark Suites

MoleculeNet

MoleculeNet is a collection of 17 curated datasets spanning quantum mechanics, physical chemistry, biophysics, and physiology, totaling more than 700K compounds. MoleculeNet (alongside TDC) remains a heavily cited benchmark suite for molecular property prediction, and many papers report results on at least a subset of its tasks. Readers should be aware of documented train/test leakage in several sub-datasets (HIV and BACE are the best-known cases, but documented quality issues are becoming pervasive) where scaffold splits alone do not prevent near-duplicates from appearing on both sides of the split. The Notes on Usage section later in this appendix discusses split strategies in detail, and Polaris Hub (below) addresses many of these issues head-on.

Therapeutics Data Commons (TDC / PyTDC)

The Therapeutics Data Commons (TDC) is an ML-ready benchmark platform maintained by an academic consortium. It exposes more than 66 datasets across 22 learning tasks—single-instance prediction, multi-instance prediction, and generation—covering target discovery, activity, efficacy, safety, and manufacturing. TDC-2 (released in 2024) extends coverage to multimodal single-cell data, clinical trial outcomes, and protein–peptide interactions. Installation is a single pip install PyTDC, and everything is released under the MIT license. In practice TDC has become the default benchmark suite for the models built in Chapters throughout this book, though recent analyses have brought forward quality issues.

Polaris Hub

Polaris Hub is a next-generation benchmarking platform backed by the Polaris consortium (Valence Labs / Recursion, the Chan Zuckerberg Initiative, and several major pharmaceutical labs). It was created specifically to address the quality issues that accumulated in MoleculeNet and TDC over the years: duplicated records, leakage between splits, overly optimistic random splits on data where temporal or structural splits are more realistic, and insufficiently grounded industrial datasets. Polaris emphasizes time-based splits, matched molecular pair analysis, and datasets donated by industrial partners, and it is designed to be consumed through the polaris-lib Python package. At the time of writing, Polaris is moving quickly to become the default benchmark platform for industrial-strength evaluation, and readers should check it alongside (or instead of) MoleculeNet and TDC.

PoseBusters

PoseBusters is the de-facto physical-validity benchmark for deep-learning protein–ligand docking and co-folding methods, produced by Buttenschoen and colleagues at Oxford. Rather than scoring poses on RMSD to a reference alone, PoseBusters checks whether predicted poses are chemically and geometrically sensible: correct stereochemistry, plausible bond lengths and angles, absence of severe clashes, and acceptable protein–ligand interaction geometry. The PoseBusters Benchmark Set consists of 308 curated protein–ligand complexes released from 2021 onward, specifically to avoid training-set contamination for models trained on older PDB snapshots. The package is MIT-licensed and installable as a Python library. It has become a standard reporting requirement for DiffDock, AlphaFold 3, RoseTTAFold All-Atom, and similar methods as those covered in Chapters 9, 11, 12, and Appendix D.

PLINDER

PLINDER (Protein–Ligand INteractions Dataset and Evaluation Resource; plinder.sh) is the largest annotated protein–ligand interaction benchmark, and it is the modern answer to the leakage problems that have dogged docking evaluation for a decade. Developed by a consortium that includes OpenFold, Isomorphic Labs, Genentech, and Stanford, PLINDER covers 449,383 protein–ligand complexes with rich annotations—pocket identity, ligand similarity, protein similarity, and assembly metadata—and, critically, ships with similarity-aware splits that control for leakage at the pocket, ligand, and protein levels simultaneously. Earlier benchmarks routinely allowed a docking model trained on pre-2021 PDB to encounter near-identical pockets or ligands at test time, inflating reported performance; PLINDER’s splits make that failure mode much harder. The full dataset and Python API are available under the MIT license through a Google Cloud Storage bulk download. PLINDER is rapidly becoming the reference benchmark for models like those covered in Chapters 9, 11, 12, and Appendix D.

CrossDocked2020

CrossDocked2020 is a training set of approximately 22.5M cross-docked protein–ligand poses across ~18,450 complexes, produced by the Koes Lab at the University of Pittsburgh. Each native PDB pose is supplemented with additional poses generated by cross-docking the ligand into related pockets, yielding a pose distribution that captures both correct and incorrect binding modes. This makes CrossDocked2020 the standard training set for structure-based generative models such as Pocket2Mol, TargetDiff, and DiffSBDD, as well as for learned scoring functions. The dataset is freely available.

Additional benchmark resources

Several additional benchmarks appear frequently in the ML drug-discovery literature:

  • OGB (Open Graph Benchmark) is the standard GNN evaluation harness; molecular subsets include ogbg-molhiv and ogbg-molpcba. Relevant for Chapters 8 and 11.
  • GuacaMol is BenevolentAI’s generative-model benchmark for distribution learning and goal-directed generation (Chapter 10).
  • MOSES (Molecular Sets) is Insilico Medicine’s complementary generative benchmark with standardized metrics and baselines (Chapter 10).
  • LIT-PCBA covers 15 targets curated from PubChem HTS campaigns; designed to avoid the ligand-bias issues of DUD-E and the preferred modern alternative for virtual-screening benchmarks.
  • The Astex Diverse Set is 85 carefully curated protein–ligand complexes and a long-standing docking gold standard.
  • DUD-E provides 102 targets with actives and property-matched decoys. Use cautiously: documented ligand-bias lets models discriminate actives from decoys on crude physicochemical features alone; treat as a historical baseline rather than a rigorous evaluation.
  • CASP is the biennial blind assessment of protein structure prediction, and the venue in which AlphaFold2 was demonstrated. Relevant for Chapter 12 and Appendix D.
  • CAMEO provides continuous weekly evaluation of structure-prediction methods, complementing CASP’s biennial cadence.

Toxicity & Safety

Tox21 / ToxCast / EPA CompTox Chemicals Dashboard

The Tox21 and ToxCast programs, together with the EPA’s CompTox Chemicals Dashboard, constitute the dominant source of high-throughput in vitro toxicity data for ML. Tox21 is a joint effort of the US EPA, NIH (NCATS, NTP/NIEHS), and FDA that screened roughly 10K compounds against ~60 nuclear receptor and stress response assays; ToxCast extends this to ~9K compounds across more than 1,800 assays; and the CompTox Dashboard aggregates these together with hazard, exposure, and toxicokinetic data on approximately 1.2M chemicals. All data are CC0-licensed and accessible through a REST API and bulk download, with quarterly updates. These datasets provide gold-standard training resource for ML models of toxicity endpoints, such as those trained in Chapter 5’s cytochrome P450 case study.

Additional toxicity and safety resources

Two complementary pharmacovigilance and side-effect resources:

  • SIDER annotates ~1,400 marketed drugs with ~5,800 side-effect terms extracted from product labels. Caveat: last comprehensive update 2015; pair with FAERS for current signals.
  • FAERS (FDA Adverse Event Reporting System) provides freely available quarterly data files of adverse-event reports.

Reaction & Synthesis

Open Reaction Database (ORD)

The Open Reaction Database (ORD) is an open, schema-standardized repository for organic reaction data, developed by a collaboration between Google Research, the Doyle Lab at Princeton, and the Coley Lab at MIT. It currently hosts more than 2M reactions contributed by academic and industrial groups, and it is the primary FAIR alternative to proprietary databases such as Reaxys and SciFinder for reaction-informed modeling. Reactions conform to a structured protobuf schema that captures reagents, conditions, outcomes, and provenance, and the full dataset is released under CC-BY-SA with a GitHub-hosted Python interface.

USPTO Reactions (Lowe dataset)

The USPTO Reactions dataset, compiled by Daniel Lowe during his PhD and hosted on Figshare under CC0, contains approximately 1.8M reactions extracted from US patent filings between 1976 and September 2016. Standardized subsets—USPTO-50K and USPTO-MIT—are the foundation training set for essentially every open-source retrosynthesis and forward-prediction ML model, from Molecular Transformer to Chemformer and beyond. The dataset is static at its 2016 cutoff, which is its main limitation for newer reaction chemistry; the commercial Pistachio database (NextMove Software) is a direct continuation that many industrial labs license. Pistachio, together with Reaxys (Elsevier) and CAS SciFinder, remain the dominant commercial reaction databases but are accessible only through institutional subscriptions and are not practical as open training resources.

Specialized Modalities

R-BIND

R-BIND (RNA-targeted BIoactive ligaNd Database) is the reference resource for small molecules that bind non-ribosomal RNA targets. Maintained by the Hargrove Lab at Duke, it provides manually curated bioactivity annotations along with a separate fragments subset, currently covering approximately 160+ small molecules with documented RNA-binding activity. The scale is small but the curation quality is high. Chapter 6’s case study on small-molecule binding to an RNA target illustrates how to use related data.

Additional specialized-modality resources

Resources for other emerging modalities covered in Part 2:

  • PROTAC-DB v3.0 (2024) catalogs 6,100+ PROTACs, 500+ warheads, 220+ E3-ligase ligands, and 2,600+ linkers. Useful for targeted-protein-degradation research.
  • COCONUT (Collection of Open Natural Products) aggregates 695,000+ natural products from 60+ sources under CC0.
  • LOTUS integrates ~750K structure–organism pairs with Wikidata under CC0, complementary to COCONUT.
  • RNAcentral aggregates ~36M non-coding RNA sequences across dozens of RNA-focused databases.
  • Rfam provides ~4,100 curated RNA families with covariance models.
  • NDB (Nucleic Acid Database) provides 3D structures of nucleic acids and nucleic acid–protein complexes, complementing PDB for RNA/DNA structural work in Chapter 6.

Target Discovery & Systems Biology

Open Targets Platform

Open Targets is the systematic target-identification and prioritization platform jointly developed by EMBL-EBI, the Wellcome Sanger Institute, and a 13-company pharmaceutical consortium (GSK, Pfizer, BMS, Sanofi, MSD, Genentech, Biogen, Novartis, Takeda, Merck KGaA, Calico, Roche, and EMBL). It integrates genetics, genomics, transcriptomics, drug, animal model, and literature evidence into a single target–disease association framework. Release 25.09 (September 2025) covers 63,226 targets, 28,327 diseases and phenotypes, and approximately 18M target–disease associations drawn from 22 data sources. Everything is CC0 and accessible through a web UI, GraphQL API, bulk downloads, and a Google BigQuery mirror, with quarterly releases.

Additional target-discovery and systems-biology resources

Further target-discovery and systems-biology resources, grouped roughly by data type:

  • DisGeNET covers ~1.13M gene–disease associations across ~21K genes and ~30K diseases. Caveat: moved to a freemium model in 2023; academic access remains available.
  • DepMap hosts Broad Institute CRISPR and RNAi loss-of-function screens across ~1,100 cancer cell lines with paired omics; CC-BY.
  • Connectivity Map / LINCS L1000 (CLUE) contains 1M+ perturbational gene-expression signatures from small-molecule and genetic perturbations; central to the drug-repurposing methods covered in Chapter 7.
  • Therapeutic Target Database (TTD) catalogs 4,000+ targets and 50,000+ drugs with therapeutic annotations.
  • STRING integrates protein–protein interactions across ~12K organisms, combining experimental, predicted, and text-mining evidence.
  • BioGRID provides manually curated genetic and protein interactions.
  • Reactome curates human (and other) biological pathways under CC-BY.
  • KEGG provides pathway, drug, and disease data. Free for academic use within rate limits; commercial use has required a license since 2022.
  • GEO (Gene Expression Omnibus) is the NCBI public repository of bulk and single-cell expression datasets.
  • GTEx provides tissue-specific expression across 54 human tissues from healthy donors.
  • TCGA / GDC Data Portal hosts The Cancer Genome Atlas and the broader Genomic Data Commons multi-omics cancer datasets.

Clinical & Pharmacovigilance

Clinical and pharmacovigilance resources close the loop between preclinical modeling and observed human outcomes, where downstream clinical context shapes what a useful target or useful model looks like.

  • ClinicalTrials.gov registers 490,000+ trials worldwide; API v2.0 and bulk download available. Primary source for trial-level clinical outcome data.
  • FAERS (FDA Adverse Event Reporting System) is also covered in the Toxicity & Safety section above; it is the main source of US post-marketing adverse-event signals.
  • PharmGKB is an NIH-funded pharmacogenomics knowledge base integrating gene–drug–phenotype evidence with CPIC clinical-guideline annotations; CC-BY-SA.
  • CPIC (Clinical Pharmacogenetics Implementation Consortium) publishes clinical guidelines for gene-guided prescribing, co-developed with PharmGKB.

Notes on Usage

As mentioned at the beginning of this chapter, our models are subject to inherent limitations in the data they are trained on. We’ll unearth data quality issues while working through applied examples in the book, but we list several core data set properties you should verify prior to building a model.

Source Categorization

Primary data sources consist of the original experimental data gathered and submitted by researchers. Secondary data sources are built on top of primary data with additional analyses and interpretations. Primary data may be specific to the original researcher’s use case, which secondary data might supplement and generalize with information from other primary or secondary data sources. Additional curation during creation of secondary data can improve content of the primary data to enable new insights and even approximate information that is otherwise infeasible to collect via experimentation. Constructing a secondary data source is also cheaper relative to the expenses needed to generate primary data.

However, secondary data can introduce noise if using supplemental data of questionable authenticity or relevance, or from a biased source. In practice, many databases are a mix of primary and secondary sources and disentangling the two is not straightforward. To determine the reliability of a data source, we can investigate its data provenance. Data provenance is the record of how the data was generated, transmitted, and stored, from its origin to its entry into the database.

Data Exchange

Providers of data are, unsurprisingly, consumers of data. In the case of public databases, contents are freely accessible and often integrated across multiple other databases. For example, UniChem maintains a non-redundant database that cross-references identifiers and structures across 40 different databases, including ChEMBL, ZINC, PubChem, HMDB, ChEBI, and DrugBank. It is a good idea to be aware of any data exchange the database you source from is involved in and to understand how that database evaluates the verity of integrated data sources. Specifically, data sources might vary in how they represent the same molecule or units of measurement, and error from one data source is at risk of propagating across all linked data sources.

When integrating data from multiple sources, several preprocessing strategies can mitigate inconsistencies and prevent error propagation:

  • Standardize units of measurement (e.g., converting all IC50 values to nM)
  • Normalize molecular representations using canonical SMILES or InChI identifiers
  • Implement rigorous duplicate detection and resolution protocols
  • Establish clear rules for handling conflicting data points from different sources
  • Document data provenance to track the origin of each data point
  • Apply quality filters based on experimental confidence scores or statistical validation

In the case where we want to use ChEMBL, we might check their documentation on how they extract, curate, and annotate drug data. We’d learn that information on the literature reference, assay, target, organism, and structural and property information of the compound is manually extracted from full text articles via a service. The data then undergoes automated and manual curation steps to standardize activity types, standardize units, compute additional properties, and flag and correct incorrect or duplicate data.

When evaluating chemical databases, we must also consider recency and update frequency. ChEMBL, for instance, releases major updates approximately twice yearly, while PubChem receives daily updates as new compounds are deposited. Some specialized databases might update only annually or less frequently. This update cadence directly impacts model quality, particularly for rapidly evolving research areas involving emerging targets. Using a database that hasn’t been updated in 18 months might mean missing crucial compounds that could serve as valuable training examples.

Additionally, data corrections and standardization improvements often accompany database updates. Earlier versions might contain structural errors, incorrect activity values, or inconsistent representations that get rectified in subsequent releases. When selecting a database, examine its version history documentation to understand both update frequency and what specific improvements each update contained. Consider documenting the version and timestamp of each dataset used in your model training to ensure reproducibility and to help evaluate performance differences between model iterations.

Train/Test Splits

A random split assumes that training and test data are identically distributed, which is almost never true for drug-discovery data: chemical series cluster, assays drift over time, and literature coverage is biased toward well-studied targets. A scaffold split (Bemis–Murcko framework separation) removes obvious near-duplicates and is a useful minimum bar, but it does not prevent scaffolds from sharing substituents or near-identical analogs appearing on either side of the split. Time-based splits, i.e., training on data published before some cutoff date, testing on everything after, more closely simulate prospective deployment and often reveal that headline performance numbers are dramatically inflated. Similarity-aware splits, which enforce separation by pocket, ligand, and protein similarity simultaneously, are considered best practice for docking and structure-based ML; PLINDER’s splits (described in the ML Benchmark Suites section above) implement this and should be preferred over split strategies that ship with older benchmarks.

Benchmark Leakage and Contamination

Many published performance numbers on older benchmarks are misleading because the training data of modern models overlaps heavily with the test sets. E.g., in structure-based docking, models trained on any pre-2021 snapshot of PDB effectively have memorized large portions of the PDBbind and CASF test sets. PoseBusters mitigates this with a post-2021 benchmark set; PLINDER mitigates it with similarity-aware splits; but many published comparisons on older benchmarks do neither, and their reported improvements should be treated with caution. As another example, DeepChem’s MoleculeNet HIV and BACE datasets have documented near-duplicate pairs across scaffold splits, meaning that even scaffold-split accuracy can be overestimated. A third is historical and still cited: DUD-E’s actives and decoys are distinguishable by simple physicochemical descriptors, so models that appear to discriminate binders from non-binders may in fact be discriminating drug-like from non-drug-like chemistry (LIT-PCBA is one recommended replacement benchmark). When reading a new modeling paper, a sensible first question is: what was the split strategy, and is the reported benchmark known to leak?

Activity Cliffs and Assay-Merge Noise

Aggregated bioactivity databases (ChEMBL, BindingDB, PubChem BioAssay) merge data from tens of thousands of distinct assays run across decades in different labs. Even for the same compound against the same target, reported IC50 values can disagree by more than an order of magnitude because of assay format, cell line, reference compound, substrate concentration, and plate-to-plate variability. In the worst case, the same molecule is labeled “active” in one assay and “inactive” in another. Worse, the drug-discovery literature is populated with activity cliffs where a tiny chemical modification, such as adding or removing a methyl group or swapping a heteroatom, changes potency by 10–100×. Models trained on naively merged data can learn neither the real SAR signal nor a calibrated noise floor and tend to fail where cliffs have highest impact. Practical mitigations include filtering to a single assay type per target before modeling (at a cost in dataset size), modeling assay identity as an explicit categorical feature, reporting per-assay performance as well as global performance, and validating predictions on held-out series rather than held-out random samples.

Commercial vs. Free Licensing

Several widely cited resources have moved to or reinforced commercial terms, even since this appendix was first drafted! If you are building a model or product you intend to publish, distribute, or commercialize, audit the license terms of every resource in your pipeline early.

Garbage In, Garbage Out

ML facilitates drug discovery, but not without its shortcomings. Data with suspect quality eventuates as a faulty model, i.e., garbage in, garbage out. Lack of quality data limits the ability to model and search the vast space of drug-like molecules, leading to greater focus on well-explored chemical structures with more data and less complex interactions. The more limited the diversity of chemical structures in our data set, the more restricted our model’s applicability domain. The applicability domain scopes the reliability of our model when applied to molecules that are increasingly different in their structure or composition from the molecules in our training data set. Even the available data may only serve as an approximation of reality. Our ML models capture trends in the data – low quality data results in low quality models.

As an example, suppose we want to screen, or filter out, drug candidates that might cause drug-induced liver injury (DILI), which is the leading cause of drug withdrawal from the market [1]. Coincidentally, we happen to have a data set of molecules, each with “Yes” or “No” labels that indicate whether they are associated with DILI, that we can use to model and predict DILI on any molecule! We craft a proposal for our DILI model and send it to our employer’s pharmacovigilance department, which is responsible for drug safety with respect to clinical use.

Their response is lackluster and pinpoints an issue in our data set – the labels are conditional. Drug X might provoke DILI, but only in patients with a specific genotype or only in patients concomitantly taking drug Y or only when administered above a certain dose (even dihydrogen monoxide, also known as water, can be toxic at extreme doses). Once the adverse reaction occurs, we might observe differences in severity levels, frequency of occurrence across gender or age groups, and duration of the effect (i.e., short-term versus long-term). For this use case, a binary classifier is not adequate, and we need to reconsider our data selection and modeling strategy. Importantly, the further detached our data and model become from the full context under which the drug is taken, the less reliable improved model performance becomes as an indicator of improved safety and efficacy.

Continuing this example further, how might our DILI data set have been created? Experiments to directly measure DILI in a living organism (i.e., in vivo experiments) are difficult, expensive, low-throughput, and may have more confounding factors compared to measurements on microorganisms, cells, or molecules in a controlled environment, such as a test tube or petri dish, outside their normal biological environment (i.e., in vitro experiment). If we want to generate a DILI data set without in vivo experiments in an animal model or human, we could measure DILI via a proxy assay. One possible contributor to DILI is bile salt export pump (BSEP) inhibition, which we can measure with a BSEP inhibition assay on hepatocytes (a major type of liver cell that constitutes most of the total liver cell population).

Though proxy assays are a useful source data, we ultimately care about the phenotypic endpoint (DILI in this case). The further our data set moves away from in vivo context; the further our model’s utility moves away from clinical relevance. A silver-lining is that these limitations can be interpreted as challenges to overcome, and the field is ripe for contributions and advancements from readers such as yourself. We will formally cover the limitations of ML, and ways to navigate them, throughout the second half of the book.

Note. Even high quality data can produce poor results if the context of data collection is ignored or misunderstood! Always investigate the experimental conditions under which your training data was collected.

Where to Look Next

The resources catalogued here represent a snapshot. New databases appear, older ones change license terms, scales grow by orders of magnitude, and curation practices evolve. Community review articles (e.g., Nucleic Acids Research publishes an annual database issue each January) can help you keep current. Whatever resource you use, cite the specific version or release (the ChEMBL release number, the UniProt release date, the Open Targets quarterly tag) and archive the data snapshot you trained on. Models are only reproducible if the data they were trained on is.

I maintain this catalog beyond the book’s release. If a resource here has changed, or one is missing that belongs on the list, let me know.


References

[1] Babai, S., Auclert, L., and Le-Louët, H. (2018). Safety data and withdrawal of hepatotoxic drugs. Therapie. https://doi.org/10.1016/j.therap.2018.02.004


Adapted from Appendix B of Build AI Drug Discovery Pipelines. The book puts these datasets to work in applied case studies.

Stay in the Ark

Notes start here first, with occasional cross-posts on Substack and social channels.

Follow on Substack



    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • Likelihood of approval, phase transitions, and the crowded-vs-abandoned map of therapeutic areas
  • Tissue specificity as a safety filter: GTEx, Human Protein Atlas, and scRNA-seq for target prioritization
  • Druggability, ligandability, and modality choice in the AlphaFold 3 era
  • How to tell a drug target matters: evidence frameworks for target–disease linkage
  • Drug target discovery: the front-of-funnel decision behind most Phase II failures