Interactive ligands visualizer

From DISI
Jump to navigation Jump to search

I (Olivier) put together this interactive visualizer to make sure that I don't miss out some chemotypes when coming up with actives at the start of a retrospective campaign. Starting from a downloaded ChEMBL CSV file for a list of ligands, images of each molecule are generated with RDKit and a text file with filtered Smiles is generated. You then need to compute the ECFP fingerprints on Gimel from that file (see below), and then a generated script will show an interactive visualization of the chemical space spanned by the ligands (tSNE), with each molecule shown on mouse hovering.

Chemspace vis example.gif


Step 1: install chemspace_vis package

The visualizer will need to be installed on your local machine, not the cluster.

Make sure you are using Python 3, and then simply:

pip install chemspace_vis

N.B. This only works on Mac and Linux, sorry Windows users (if you exist).

For those interested, the source code can be found on my GitHub: https://github.com/gregorpatof/chemspace_vis_package


Step 2: obtain ChEMBL CSV file (or use provided example)

Any ChEMBL CSV from a given activity of a given target will do. If you want to run the visualizer directly from smiles and fingerprints, see bottom of page.

You can also clone the example repository, which contains the CSV for mu-opioid ligands with measured Emax and an example script:

git clone https://github.com/gregorpatof/chemspace_vis_example

Just to make things too clear, here is how I obtained that CSV:


Step 3: extract Smiles and activity for given HAC and MW filters

This is accomplished by the preprocess_part1() method in the example script, which runs a single command:

from chemspace_vis.preprocess import preprocess_chembl

chembl_csv = "mor_chembl_emax.csv"

activity_name = "Emax" # The text name of the activity (in this case, Emax)
preprocess_chembl(chembl_csv, activity_name, max_hac=35, max_mw=600, img_folder="mol_images")

As you can see, you can specify the maximum number of heavy atoms (max_hac) and maximum molecular weight (max_mw) for the ligands to keep.

This will generate two files: a .smi file with the Smiles for all the kept ligands, and .df file which keeps the activity value (Emax here) in dataframe format.

It also generates all 2D images of your molecules, with ChEMBL ID (or other, it is taken from the .smi file) and activity included, in the mol_images folder.


Step 4: compute the fingerprints on Gimel

Copy the .smi file to gimel, source the DOCK3.7 base, and then run this command (on gimel, not gimel2 or others):

python ~jklyu/zzz.github/ChemInfTools/utils/teb_chemaxon_cheminf_tools/generate_chemaxon_fingerprints.py mor_chembl_emax.smi mor_chembl_emax

This will generate a .fp file, in the present case mor_chembl_emax.fp

I personally like to use bash, so the way I source the DOCK3.7 base is:

export DOCKBASE=/nfs/soft/dock/versions/dock37/DOCK-3.7-trunk
source /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/env.sh


Step 5: tSNE and interactive visualization

Almost done! Copy the .fp file back to your machine, then run part 2 of the example script:

from chemspace_vis.preprocess import make_tsne_from_fingerprints
from chemspace_vis.visualizer import make_visualizer_script

fingerprints_file = "mor_chembl_emax.fp"
make_tsne_from_fingerprints(fingerprints_file)
make_visualizer_script("tsne_data.df", "mol_images", activity_filename="mor_chembl_emax_activity.df", use_log10=False)

The first command will compute tSNE from the fingerprints. You will see a print telling you what percentage of the variance is covered by the PCA first applied (anything over 90-95% is good).

If you get an error that looks like this:

ValueError: n_components=180 must be between 0 and min(n_samples, n_features)=91 with svd_solver='full'

It means that you have less than 180 molecules in your list. Simple change the number of requested components to a number below your number of molecules:

make_tsne_from_fingerprints(fingerprints_file, n_pca_components=SOME_NUMBER_LESS_THAN_N)

Then, the visualizer script will be generated. If you supply an activity filename, you will get coloring based on that property (here, Emax). The use_log10 flag can be useful if you have extreme values driving the coloring.


Step 6: run the visualizer

Simply run the generated visualizer script:

python visualizer_script.py

You can then zoom on parts where ligands are close together, and go back to the general view with the back arrow:

Mor zoom example.gif


Optional: run directly from Smiles and fingerprints

For this, you will bypass the ChEMBL processing step. Simply generate the images:

from chemspace_vis.preprocess import generate_images

generate_images("you_smiles.smi", "mol_images", activity_fn=None)

And then run the second step, without an activity file:

from chemspace_vis.preprocess import make_tsne_from_fingerprints
from chemspace_vis.visualizer import make_visualizer_script

fingerprints_file = "mor_chembl_emax.fp"
make_tsne_from_fingerprints(fingerprints_file)
make_visualizer_script("tsne_data.df", "mol_images", use_log10=False)