BioassayR Database Downloads

Database Description

This database contains all small molecule bioactivity screens from NCBI PubChem BioAssay which include at least one real activity score (active or inactive) and have a specified protein target.

Three types of annotation details are provided for protein targets: sequence level clustering (via kClust), Pfam domains, and UniProt identifiers. These can be accessed using the "translateTargetId()" function in bioassayR with the category option "UniProt", "kClust", or "domains" respectively. Domain data for target proteins is based on a HMMER 3.1b2 search using Pfam version 29.0 with the options "hmmscan -E 0.01 --domE 0.01 --cpu 8 --noali". Sequence level protein target clustering was done with kClust using the command line options "-s 0.52 -c 0.8 -e 1.0e-4 -M 16000MB". If duplicate CIDs are contained within a single assay (such as multiple SIDs) only one is kept under no specific criteria. As of April 2016 we now include raw numeric scores from a data column other than the 0 to 100 PUBCHEM_ACTIVITY_SCORE if there is an available column matching the following regex (case insensitive): "inhibition|ic50|ki|gi50|ec50|ed50|lc50".

An attempt was made to parse all assay data as accurately as possible, however users should double check the accuracy of data critical to their experiments.

File Downloads

pubchem_protein_only.sqlite (7.8 GB, updated Apr 6th 2016, md5sum 294c1e7e27ffd8ed37411087e15474af)
Previous version using all 0 to 100 PUBCHEM_ACTIVITY_SCORE raw scores: pubchem_protein_only_01_11_2016.sqlite (7.5 GB, updated January 11th 2016, md5sum 92f4ae5ac38fc8b582002d9c95fa77f4)
https://github.com/TylerBackman/pubchem-bioassay-database (source code to build this database)