Olivier's way of computing novelty

From DISI
Jump to navigation Jump to search

This is my lazy way of doing novelty calculations.

Step 0: source my environment

All I'm developing is part of a single Python package called bksltk (BKS-lab Toolkit). So just source my Python environment on gimel2 (or other gimelX which has Python3) and you'll have access to everything:

source /nfs/home/omailhot/pyenv_source.sh

Eventually the package will be better documented (and probably integrated into pydock).

Step 1: get all knowns from ChEMBL

Here we don't really care about properties, drug-likeness etc. So just go to your target on ChEMBL, grab the version with the highest number of compounds, and click on the number of associated compounds to view the list of them (no activity data). Then, download as csv. Unzip the file and rename it to what you want, we will call it chembl.csv here. Now to generate a .smi file from this, simply use:

from bksltk.utils import write_smiles_from_chembl

write_smiles_from_chembl('chembl.csv', 'chembl.smi')

Step 2: compute novelty

You will also need your list of molecules, as a ".smi" file which is basically just a text file, without any header, and with whitespace-separated data of which the first column is the SMILES strings of your compound. The second column should be an identifier. Look at the 'chembl.smi' file if you are confused. Then, to get a dataframe of the maximum similarities to any of your knowns, simply do:

from bksltk.utils import get_novelty

get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt')

This will write all the maximum Tcs to the 'novelty.txt' file. Alternatively, you can give a Tc threshold and only molecules less similar than the threshold will be written out:

get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt', threshold=0.35)

Note:

There are intermediary files created with the '.fp' file extension. If RDKit fails to read some of your input SMILES, delete all .fp files before rerunning.

Step 3: ultimate laziness

Combine these in a single script:

from bksltk.utils import write_smiles_chembl, get_novelty

write_smiles_chembl('chembl.csv', 'chembl.smi')
get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt', threshold=your_threshold)