This database contains all small molecule bioactivity screens from NCBI PubChem BioAssay which include at least one real activity score (active or inactive) and have a specified protein target.
Three types of annotation details are provided for protein targets: sequence level clustering (via kClust), Pfam domains, and UniProt identifiers. These can be accessed using the "translateTargetId()" function in bioassayR with the category option "UniProt", "kClust", or "domains" respectively. Domain data for target proteins is based on a HMMER 3.1b2 search using Pfam version 29.0 with the options "hmmscan -E 0.01 --domE 0.01 --cpu 8 --noali". Sequence level protein target clustering was done with kClust using the command line options "-s 0.52 -c 0.8 -e 1.0e-4 -M 16000MB". If duplicate CIDs are contained within a single assay (such as multiple SIDs) only one is kept under no specific criteria. As of April 2016 we now include raw numeric scores from a data column other than the 0 to 100 PUBCHEM_ACTIVITY_SCORE if there is an available column matching the following regex (case insensitive): "inhibition|ic50|ki|gi50|ec50|ed50|lc50".
An attempt was made to parse all assay data as accurately as possible, however users should double check the accuracy of data critical to their experiments.