DISI - User contributions [en]

Small change analogs

2024-12-11T22:55:36Z

Omailhot:

'''Step 0: source my environment'''

All I'm developing is part of a single Python package called bksltk (BKS-lab Toolkit). So just source my Python environment on gimel2 (or other gimelX which has Python3) and you'll have access to everything:

source /nfs/home/omailhot/pyenv_source.sh

Then, you can use the get_all_analogs_smiles_set() method from the toolkit, which will return all smiles of small change analogs, starting from a single smiles string:

from bksltk.analogs import get_all_analogs_smiles_set

all_analogs_smiles = get_all_analogs_smiles_set('YOUR_SMILES')

If you want to do multiple smiles, just call the function for each one :-)

These are the small changes considered: H->CH3, H->OH, H->F, H->Cl, H->Br, aryl carbon to nitrogen, aromatic nitrogen to carbon.

To generate a single .csv file of all small change analogs (that you can send directly to Enamine for a quote) from a .smi file, do this:

from bksltk.analogs import get_csv_analogs

get_csv_analogs('your_input.smi', 'your_output.csv')

Small change analogs

2024-12-06T21:57:47Z

Omailhot:

Small change analogs

2024-12-06T21:55:37Z

Omailhot: Created page with "'''Step 0: source my environment''' All I'm developing is part of a single Python package called bksltk (BKS-lab Toolkit). So just source my Python environment on gimel2 (or other gimelX which has Python3) and you'll have access to everything: source /nfs/home/omailhot/pyenv_source.sh Then, you can use the get_all_analogs_smiles_set() method from the toolkit, which will return all smiles of small change analogs, starting from a single smiles string: from bksltk.an..."

Combinatorial analogs

2024-12-03T21:02:53Z

Omailhot:

Here's an example of how to generate combinatorial analogs from the same parent. Make sure to source my environment first:

source /nfs/home/omailhot/pyenv_source.sh

Then, you will want to create a .png file of your parent with standard numbering on it. We'll use etomidate in this example:

from bksltk.analogs import write_numbered_parent_png, make_analogs_combinations

eto_smiles = 'CCOC(=O)C1=CN=CN1[C@H](C)C2=CC=CC=C2'
write_numbered_parent_png(eto_smiles, 'test_parent.png')

Look at the .png, and figure out where your modifications lie. Then, you'll create a "modification dictionary" that will be used for combinatorial generation of analogs. In this example, carbon 0 can get a hydroxyl, fluorine or methyl attached, carbons 6 and 14 can get methylated, carbon 15 can get replaced with an aromatic nitrogen or get hydroxylated, and carbon 16 can get hydroxylated. The n_combinations_list specifies how many modifications should be combined in the kept analogs. In this example, analogs that are combining 2, 3 or 4 modifications will be enumerated. The output_filename will generate a both .png and a .csv file with the enumerated analogs.

modifications_dict = {0: ['O', 'F', 'C'],
6: ['C'],
14: ['C'],
15: ['N', 'O'],
16: ['O']}
eto_smiles = 'CCOC(=O)C1=CN=CN1[C@H](C)C2=CC=CC=C2'
n_combinations_list = [2, 3, 4]
output_filename = 'eto_combined_analogs'
make_analogs_combinations(modifications_dict, eto_smiles, n_combinations_list, output_filename)

Combinatorial analogs

2024-11-26T19:45:12Z

Omailhot: Created page with "Here's an example of how to generate combinatorial analogs from the same parent. Make sure to source my environment first: source /nfs/home/omailhot/pyenv_source.sh Then, you will want to create a .png file of your parent with standard numbering on it. We'll use etomidate in this example: from bksltk.analogs import write_numbered_parent_png, make_analogs_combinations eto_smiles = 'CCOC(=O)C1=CN=CN1[C@H](C)C2=CC=CC=C2' write_numbered_parent_png(eto_smiles, 'test_p..."

Olivier's way of computing novelty

2023-09-22T20:10:46Z

Omailhot:

This is my lazy way of doing novelty calculations.

'''Step 0: source my environment'''

All I'm developing is part of a single Python package called bksltk (BKS-lab Toolkit). So just source my Python environment on gimel2 (or other gimelX which has Python3) and you'll have access to everything:

source /nfs/home/omailhot/pyenv_source.sh

Eventually the package will be better documented (and probably integrated into pydock).

'''Step 1: get all knowns from ChEMBL'''

Here we don't really care about properties, drug-likeness etc. So just go to your target on ChEMBL, grab the version with the highest number of compounds, and click on the number of associated compounds to view the list of them (no activity data). Then, download as csv. Unzip the file and rename it to what you want, we will call it chembl.csv here. Now to generate a .smi file from this, simply use:

from bksltk.utils import write_smiles_from_chembl

write_smiles_from_chembl('chembl.csv', 'chembl.smi')

'''Step 2: compute novelty'''

You will also need your list of molecules, as a ".smi" file which is basically just a text file, without any header, and with whitespace-separated data of which the first column is the SMILES strings of your compound. The second column should be an identifier. Look at the 'chembl.smi' file if you are confused. Then, to get a dataframe of the maximum similarities to any of your knowns, simply do:

from bksltk.utils import get_novelty

get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt')

This will write all the maximum Tcs to the 'novelty.txt' file. Alternatively, you can give a Tc threshold and only molecules less similar than the threshold will be written out:

get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt', threshold=0.35)

'''Note:'''

There are intermediary files created with the '.fp' file extension. If RDKit fails to read some of your input SMILES, delete all .fp files before rerunning.

'''Step 3: ultimate laziness'''

Combine these in a single script:

from bksltk.utils import write_smiles_chembl, get_novelty

write_smiles_chembl('chembl.csv', 'chembl.smi')
get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt', threshold=your_threshold)

Olivier's way of computing novelty

2023-09-21T01:12:03Z

Omailhot:

This is my lazy way of doing novelty calculations.

'''Step 0: source my environment'''

All I'm developing is part of a single Python package called bksltk (BKS-lab Toolkit). So just source my Python environment on gimel2 (or other gimelX which has Python3) and you'll have access to everything:

source /nfs/home/omailhot/pyenv_source.sh

Eventually the package will be better documented (and probably integrated into pydock).

'''Step 1: get all knowns from ChEMBL'''

Here we don't really care about properties, drug-likeness etc. So just go to your target on ChEMBL, grab the version with the highest number of compounds, and click on the number of associated compounds to view the list of them (no activity data). Then, download as csv. Unzip the file and rename it to what you want, we will call it chembl.csv here. Now to generate a .smi file from this, simply use:

from bksltk.utils import write_smiles_from_chembl

write_smiles_from_chembl('chembl.csv', 'chembl.smi')

'''Step 2: compute novelty'''

You will also need your list of molecules, as a ".smi" file which is basically just a text file, without any header, and with whitespace-separated data of which the first column is the SMILES strings of your compound. The second column should be an identifier. Look at the 'chembl.smi' file if you are confused. Then, to get a dataframe of the maximum similarities to any of your knowns, simply do:

from bksltk.utils import get_novelty

get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt')

This will write all the maximum Tcs to the 'novelty.txt' file. Alternatively, you can give a Tc threshold and only molecules less similar than the threshold will be written out:

get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt', threshold=0.35)

'''Step 3: ultimate laziness'''

Combine these in a single script:

from bksltk.utils import write_smiles_chembl, get_novelty

write_smiles_chembl('chembl.csv', 'chembl.smi')
get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt', threshold=your_threshold)

Olivier's way of computing novelty

2023-09-20T21:42:02Z

Omailhot: Created page with "This is my lazy way of doing novelty calculations. '''Step 0: source my environment''' All I'm developing is part of a single Python package called bksltk (BKS-lab Toolkit). So just source my Python environment on gimel2 (or other gimelX which has Python3) and you'll have access to everything: source /nfs/home/omailhot/pyenv_source.sh Eventually the package will be better documented (and probably integrated into pydock). '''Step 1: get all knowns from ChEMBL''' H..."

This is my lazy way of doing novelty calculations.

'''Step 0: source my environment'''

All I'm developing is part of a single Python package called bksltk (BKS-lab Toolkit). So just source my Python environment on gimel2 (or other gimelX which has Python3) and you'll have access to everything:

source /nfs/home/omailhot/pyenv_source.sh

Eventually the package will be better documented (and probably integrated into pydock).

'''Step 1: get all knowns from ChEMBL'''

Here we don't really care about properties, drug-likeness etc. So just got to your target on ChEMBL, grab the version with the highest number of compounds, and click on the number of associated compounds to view the list of them (no activity data). Then, download as csv. Unzip the file and rename it to what you want, we will call it chembl.csv here. Now to generate a .smi file from this, simply use:

from bksltk.utils import write_smiles_from_chembl

write_smiles_from_chembl('chembl.csv', 'chembl.smi')

'''Step 2: compute novelty'''

You will also need your list of molecules, as a ".smi" file which is basically just a text file, without any header, and with whitespace-separated data of which the first column is the SMILES strings of your compound. The second column should be an identifier. Look at the 'chembl.smi' file if you are confused. Then, to get a dataframe of the maximum similarities to any of your knowns, simply do:

from bksltk.utils import get_novelty

get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt')

This will write all the maximum Tcs to the 'novelty.txt' file. Alternatively, you can give a Tc threshold and only molecules less similar than the threshold will be written out:

get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt', threshold=0.35)

'''Step 3: ultimate laziness'''

Combine these in a single script:

from bksltk.utils import write_smiles_chembl, get_novelty

write_smiles_chembl('chembl.csv', 'chembl.smi')
get_novelty('chembl.smi', 'your_mols.smi', 'novelty.txt', threshold=your_threshold)

Interactive ligands visualizer

2023-01-27T23:54:14Z

Omailhot:

I (Olivier) put together this interactive visualizer to make sure that I don't miss out some chemotypes when coming up with actives at the start of a retrospective campaign. Starting from a downloaded ChEMBL CSV file for a list of ligands, images of each molecule are generated with RDKit and a text file with filtered Smiles is generated. You then need to compute the ECFP fingerprints on Gimel from that file (see below), and then a generated script will show an interactive visualization of the chemical space spanned by the ligands (tSNE), with each molecule shown on mouse hovering.

[[File:Chemspace_vis_example.gif]]

'''Step 1: install chemspace_vis package'''

The visualizer will need to be installed on your local machine, not the cluster.

Make sure you are using Python 3, and then simply:

pip install chemspace_vis

N.B. This only works on Mac and Linux, sorry Windows users (if you exist).

For those interested, the source code can be found on my GitHub: https://github.com/gregorpatof/chemspace_vis_package

'''Step 2: obtain ChEMBL CSV file (or use provided example)'''

Any ChEMBL CSV from a given activity of a given target will do. If you want to run the visualizer directly from smiles and fingerprints, see bottom of page.

You can also clone the example repository, which contains the CSV for mu-opioid ligands with measured Emax and an example script:

git clone https://github.com/gregorpatof/chemspace_vis_example

Just to make things too clear, here is how I obtained that CSV:

<gallery>
chembl_mor1.png|Mu-opioid receptor on ChEMBL
chembl_mor2.png|1100 ligands with measured Emax
chembl_mor3.png|Generating the CSV
</gallery>

'''Step 3: extract Smiles and activity for given HAC and MW filters'''

This is accomplished by the preprocess_part1() method in the example script, which runs a single command:

from chemspace_vis.preprocess import preprocess_chembl

chembl_csv = "mor_chembl_emax.csv"

activity_name = "Emax" # The text name of the activity (in this case, Emax)
preprocess_chembl(chembl_csv, activity_name, max_hac=35, max_mw=600, img_folder="mol_images")

As you can see, you can specify the maximum number of heavy atoms (max_hac) and maximum molecular weight (max_mw) for the ligands to keep.

This will generate two files: a .smi file with the Smiles for all the kept ligands, and .df file which keeps the activity value (Emax here) in dataframe format.

It also generates all 2D images of your molecules, with ChEMBL ID (or other, it is taken from the .smi file) and activity included, in the mol_images folder.

'''Step 4: compute the fingerprints on Gimel'''

Copy the .smi file to gimel, source the DOCK3.7 base, and then run this command (on gimel, not gimel2 or others):

python ~jklyu/zzz.github/ChemInfTools/utils/teb_chemaxon_cheminf_tools/generate_chemaxon_fingerprints.py mor_chembl_emax.smi mor_chembl_emax

This will generate a .fp file, in the present case mor_chembl_emax.fp

I personally like to use bash, so the way I source the DOCK3.7 base is:

export DOCKBASE=/nfs/soft/dock/versions/dock37/DOCK-3.7-trunk
source /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/env.sh

'''Step 5: tSNE and interactive visualization'''

Almost done! Copy the .fp file back to your machine, then run part 2 of the example script:

from chemspace_vis.preprocess import make_tsne_from_fingerprints
from chemspace_vis.visualizer import make_visualizer_script

fingerprints_file = "mor_chembl_emax.fp"
make_tsne_from_fingerprints(fingerprints_file)
make_visualizer_script("tsne_data.df", "mol_images", activity_filename="mor_chembl_emax_activity.df", use_log10=False)

The first command will compute tSNE from the fingerprints. You will see a print telling you what percentage of the variance is covered by the PCA first applied (anything over 90-95% is good).

If you get an error that looks like this:

ValueError: n_components=180 must be between 0 and min(n_samples, n_features)=91 with svd_solver='full'

It means that you have less than 180 molecules in your list. Simple change the number of requested components to a number below your number of molecules:

make_tsne_from_fingerprints(fingerprints_file, n_pca_components=SOME_NUMBER_LESS_THAN_N)

Then, the visualizer script will be generated. If you supply an activity filename, you will get coloring based on that property (here, Emax). The use_log10 flag can be useful if you have extreme values driving the coloring.

'''Step 6: run the visualizer'''

Simply run the generated visualizer script:

python visualizer_script.py

You can then zoom on parts where ligands are close together, and go back to the general view with the back arrow:

[[File:mor_zoom_example.gif]]

'''Optional: run directly from Smiles and fingerprints'''

For this, you will bypass the ChEMBL processing step. Simply generate the images:

from chemspace_vis.preprocess import generate_images

generate_images("you_smiles.smi", "mol_images", activity_fn=None)

And then run the second step, without an activity file:

from chemspace_vis.preprocess import make_tsne_from_fingerprints
from chemspace_vis.visualizer import make_visualizer_script

fingerprints_file = "mor_chembl_emax.fp"
make_tsne_from_fingerprints(fingerprints_file)
make_visualizer_script("tsne_data.df", "mol_images", use_log10=False)

Interactive ligands visualizer

2023-01-27T23:17:53Z

Omailhot:

I (Olivier) put together this interactive visualizer to make sure that I don't miss out some chemotypes when coming up with actives at the start of a retrospective campaign. Starting from a downloaded ChEMBL CSV file for a list of ligands, images of each molecule are generated with RDKit and a text file with filtered Smiles is generated. You then need to compute the ECFP fingerprints on Gimel from that file (see below), and then a generated script will show an interactive visualization of the chemical space spanned by the ligands (tSNE), with each molecule shown on mouse hovering.

[[File:Chemspace_vis_example.gif]]

'''Step 1: install chemspace_vis package'''

The visualizer will need to be installed on your local machine, not the cluster.

Make sure you are using Python 3, and then simply:

pip install chemspace_vis

N.B. This only works on Mac and Linux, sorry Windows users (if you exist).

For those interested, the source code can be found on my GitHub: https://github.com/gregorpatof/chemspace_vis_package

'''Step 2: obtain ChEMBL CSV file (or use provided example)'''

Any ChEMBL CSV from a given activity of a given target will do. If you want to run the visualizer directly from smiles and fingerprints, see bottom of page.

You can also clone the example repository, which contains the CSV for mu-opioid ligands with measured Emax and an example script:

git clone https://github.com/gregorpatof/chemspace_vis_example

Just to make things too clear, here is how I obtained that CSV:

<gallery>
chembl_mor1.png|Mu-opioid receptor on ChEMBL
chembl_mor2.png|1100 ligands with measured Emax
chembl_mor3.png|Generating the CSV
</gallery>

'''Step 3: extract Smiles and activity for given HAC and MW filters'''

This is accomplished by the preprocess_part1() method in the example script, which runs a single command:

from chemspace_vis.preprocess import preprocess_chembl

chembl_csv = "mor_chembl_emax.csv"

activity_name = "Emax" # The text name of the activity (in this case, Emax)
preprocess_chembl(chembl_csv, activity_name, max_hac=35, max_mw=600, img_folder="mol_images")

As you can see, you can specify the maximum number of heavy atoms (max_hac) and maximum molecular weight (max_mw) for the ligands to keep.

This will generate two files: a .smi file with the Smiles for all the kept ligands, and .df file which keeps the activity value (Emax here) in dataframe format.

It also generates all 2D images of your molecules, with ChEMBL ID (or other, it is taken from the .smi file) and activity included, in the mol_images folder.

'''Step 4: compute the fingerprints on Gimel'''

Copy the .smi file to gimel, source the DOCK3.7 base, and then run this command (on gimel, not gimel2 or others):

python ~jklyu/zzz.github/ChemInfTools/utils/teb_chemaxon_cheminf_tools/generate_chemaxon_fingerprints.py mor_chembl_emax.smi mor_chembl_emax

This will generate a .fp file, in the present case mor_chembl_emax.fp

'''Step 5: tSNE and interactive visualization'''

Almost done! Copy the .fp file back to your machine, then run part 2 of the example script:

from chemspace_vis.preprocess import make_tsne_from_fingerprints
from chemspace_vis.visualizer import make_visualizer_script

fingerprints_file = "mor_chembl_emax.fp"
make_tsne_from_fingerprints(fingerprints_file)
make_visualizer_script("tsne_data.df", "mol_images", activity_filename="mor_chembl_emax_activity.df", use_log10=False)

The first command will compute tSNE from the fingerprints. You will see a print telling you what percentage of the variance is covered by the PCA first applied (anything over 90-95% is good).

If you get an error that looks like this:

ValueError: n_components=180 must be between 0 and min(n_samples, n_features)=91 with svd_solver='full'

It means that you have less than 180 molecules in your list. Simple change the number of requested components to a number below your number of molecules:

make_tsne_from_fingerprints(fingerprints_file, n_pca_components=SOME_NUMBER_LESS_THAN_N)

Then, the visualizer script will be generated. If you supply an activity filename, you will get coloring based on that property (here, Emax). The use_log10 flag can be useful if you have extreme values driving the coloring.

'''Step 6: run the visualizer'''

Simply run the generated visualizer script:

python visualizer_script.py

You can then zoom on parts where ligands are close together, and go back to the general view with the back arrow:

[[File:mor_zoom_example.gif]]

'''Optional: run directly from Smiles and fingerprints'''

For this, you will bypass the ChEMBL processing step. Simply generate the images:

from chemspace_vis.preprocess import generate_images

generate_images("you_smiles.smi", "mol_images", activity_fn=None)

And then run the second step, without an activity file:

from chemspace_vis.preprocess import make_tsne_from_fingerprints
from chemspace_vis.visualizer import make_visualizer_script

fingerprints_file = "mor_chembl_emax.fp"
make_tsne_from_fingerprints(fingerprints_file)
make_visualizer_script("tsne_data.df", "mol_images", use_log10=False)

Interactive ligands visualizer

2023-01-27T22:37:57Z

Omailhot:

I (Olivier) put together this interactive visualizer to make sure that I don't miss out some chemotypes when coming up with actives at the start of a retrospective campaign. Starting from a downloaded ChEMBL CSV file for a list of ligands, images of each molecule are generated with RDKit and a text file with filtered Smiles is generated. You then need to compute the ECFP fingerprints on Gimel from that file (see below), and then a generated script will show an interactive visualization of the chemical space spanned by the ligands (tSNE), with each molecule shown on mouse hovering.

[[File:Chemspace_vis_example.gif]]

'''Step 1: install chemspace_vis package'''

The visualizer will need to be installed on your local machine, not the cluster.

Make sure you are using Python 3, and then simply:

pip install chemspace_vis

N.B. This only works on Mac and Linux, sorry Windows users (if you exist).

For those interested, the source code can be found on my GitHub: https://github.com/gregorpatof/chemspace_vis_package

'''Step 2: obtain ChEMBL CSV file (or use provided example)'''

Any ChEMBL CSV from a given activity of a given target will do.

You can also clone the example repository, which contains the CSV for mu-opioid ligands with measured Emax and an example script:

git clone https://github.com/gregorpatof/chemspace_vis_example

Just to make things too clear, here is how I obtained that CSV:

<gallery>
chembl_mor1.png|Mu-opioid receptor on ChEMBL
chembl_mor2.png|1100 ligands with measured Emax
chembl_mor3.png|Generating the CSV
</gallery>

'''Step 3: extract Smiles and activity for given HAC and MW filters'''

This is accomplished by the preprocess_part1() method in the example script, which runs a single command:

from chemspace_vis.preprocess import preprocess_chembl

chembl_csv = "mor_chembl_emax.csv"

activity_name = "Emax" # The text name of the activity (in this case, Emax)
preprocess_chembl(chembl_csv, activity_name, max_hac=35, max_mw=600, img_folder="mol_images")

As you can see, you can specify the maximum number of heavy atoms (max_hac) and maximum molecular weight (max_mw) for the ligands to keep.

This will generate two files: a .smi file with the Smiles for all the kept ligands, and .df file which keeps the activity value (Emax here) in dataframe format.

It also generates all 2D images of your molecules, with ChEMBL ID (or other, it is taken from the .smi file) and activity included, in the mol_images folder.

'''Step 4: compute the fingerprints on Gimel'''

Copy the .smi file to gimel, source the DOCK3.7 base, and then run this command (on gimel, not gimel2 or others):

python ~jklyu/zzz.github/ChemInfTools/utils/teb_chemaxon_cheminf_tools/generate_chemaxon_fingerprints.py mor_chembl_emax.smi mor_chembl_emax

This will generate a .fp file, in the present case mor_chembl_emax.fp

'''Step 5: tSNE and interactive visualization'''

Almost done! Copy the .fp file back to your machine, then run part 2 of the example script:

from chemspace_vis.preprocess import make_tsne_from_fingerprints
from chemspace_vis.visualizer import make_visualizer_script

fingerprints_file = "mor_chembl_emax.fp"
make_tsne_from_fingerprints(fingerprints_file)
make_visualizer_script("tsne_data.df", "mol_images", activity_filename="mor_chembl_emax_activity.df", use_log10=False)

The first command will compute tSNE from the fingerprints. You will see a print telling you what percentage of the variance is covered by the PCA first applied (anything over 90-95% is good).

Then, the visualizer script will be generated. If you supply an activity filename, you will get coloring based on that property (here, Emax). The use_log10 flag can be useful if you have extreme values driving the coloring.

'''Step 6: run the visualizer'''

Simply run the generated visualizer script:

python visualizer_script.py

You can then zoom on parts where ligands are close together, and go back to the general view with the back arrow:

[[File:mor_zoom_example.gif]]

Interactive ligands visualizer

2023-01-20T23:12:50Z

Omailhot:

I (Olivier) put together this interactive visualizer to make sure that I don't miss out some chemotypes when coming up with actives at the start of a retrospective campaign. Starting from a downloaded ChEMBL CSV file for a list of ligands, images of each molecule are generated with RDKit and a text file with filtered Smiles is generated. You then need to compute the ECFP fingerprints on Gimel from that file (see below), and then a generated script will show an interactive visualization of the chemical space spanned by the ligands (tSNE), with each molecule shown on mouse hovering.

[[File:Chemspace_vis_example.gif]]

'''Step 1: install chemspace_vis package'''

Make sure you are using Python 3, and then simply:

pip install chemspace_vis

N.B. This only works on Mac and Linux, sorry Windows users (if you exist).

For those interested, the source code can be found on my GitHub: https://github.com/gregorpatof/chemspace_vis_package

'''Step 2: obtain ChEMBL CSV file (or use provided example)'''

Any ChEMBL CSV from a given activity of a given target will do.

You can also clone the example repository, which contains the CSV for mu-opioid ligands with measured Emax and an example script:

git clone https://github.com/gregorpatof/chemspace_vis_example

Just to make things too clear, here is how I obtained that CSV:

<gallery>
chembl_mor1.png|Mu-opioid receptor on ChEMBL
chembl_mor2.png|1100 ligands with measured Emax
chembl_mor3.png|Generating the CSV
</gallery>

'''Step 3: extract Smiles and activity for given HAC and MW filters'''

This is accomplished by the preprocess_part1() method in the example script, which runs a single command:

from chemspace_vis.preprocess import preprocess_chembl

chembl_csv = "mor_chembl_emax.csv"

activity_name = "Emax" # The text name of the activity (in this case, Emax)
preprocess_chembl(chembl_csv, activity_name, max_hac=35, max_mw=600, img_folder="mol_images")

As you can see, you can specify the maximum number of heavy atoms (max_hac) and maximum molecular weight (max_mw) for the ligands to keep.

This will generate two files: a .smi file with the Smiles for all the kept ligands, and .df file which keeps the activity value (Emax here) in dataframe format.

It also generates all 2D images of your molecules, with ChEMBL ID (or other, it is taken from the .smi file) and activity included, in the mol_images folder.

'''Step 4: compute the fingerprints on Gimel'''

Copy the .smi file to gimel, source the DOCK3.7 base, and then run this command (on gimel, not gimel2 or others):

python ~jklyu/zzz.github/ChemInfTools/utils/teb_chemaxon_cheminf_tools/generate_chemaxon_fingerprints.py mor_chembl_emax.smi mor_chembl_emax

This will generate a .fp file, in the present case mor_chembl_emax.fp

'''Step 5: tSNE and interactive visualization'''

Almost done! Copy the .fp file back to your machine, then run part 2 of the example script:

from chemspace_vis.preprocess import make_tsne_from_fingerprints
from chemspace_vis.visualizer import make_visualizer_script

fingerprints_file = "mor_chembl_emax.fp"
make_tsne_from_fingerprints(fingerprints_file)
make_visualizer_script("tsne_data.df", "mol_images", activity_filename="mor_chembl_emax_activity.df", use_log10=False)

The first command will compute tSNE from the fingerprints. You will see a print telling you what percentage of the variance is covered by the PCA first applied (anything over 90-95% is good).

Then, the visualizer script will be generated. If you supply an activity filename, you will get coloring based on that property (here, Emax). The use_log10 flag can be useful if you have extreme values driving the coloring.

'''Step 6: run the visualizer'''

Simply run the generated visualizer script:

python visualizer_script.py

You can then zoom on parts where ligands are close together, and go back to the general view with the back arrow:

[[File:mor_zoom_example.gif]]

Interactive ligands visualizer

2023-01-20T23:07:48Z

Omailhot:

I (Olivier) put together this interactive visualizer to make sure that I don't miss out some chemotypes when coming up with actives at the start of a retrospective campaign. Starting from a downloaded ChEMBL CSV file for a list of ligands, images of each molecule are generated with RDKit and a text file with filtered Smiles is generated. You then need to compute the ECFP fingerprints on Gimel from that file (see below), and then a generated script will show an interactive visualization of the chemical space spanned by the ligands (tSNE), with each molecule shown on mouse hovering.

[[File:Chemspace_vis_example.gif]]

'''Step 1: install chemspace_vis package'''

Make sure you are using Python 3, and then simply:

pip install chemspace_vis

N.B. This only works on Mac and Linux, sorry Windows users (if you exist).

'''Step 2: obtain ChEMBL CSV file (or use provided example)'''

Any ChEMBL CSV from a given activity of a given target will do.

You can also clone the example repository, which contains the CSV for mu-opioid ligands with measured Emax and an example script:

git clone https://github.com/gregorpatof/chemspace_vis_example

Just to make things too clear, here is how I obtained that CSV:

<gallery>
chembl_mor1.png|Mu-opioid receptor on ChEMBL
chembl_mor2.png|1100 ligands with measured Emax
chembl_mor3.png|Generating the CSV
</gallery>

'''Step 3: extract Smiles and activity for given HAC and MW filters'''

This is accomplished by the preprocess_part1() method in the example script, which runs a single command:

from chemspace_vis.preprocess import preprocess_chembl

chembl_csv = "mor_chembl_emax.csv"

activity_name = "Emax" # The text name of the activity (in this case, Emax)
preprocess_chembl(chembl_csv, activity_name, max_hac=35, max_mw=600, img_folder="mol_images")

As you can see, you can specify the maximum number of heavy atoms (max_hac) and maximum molecular weight (max_mw) for the ligands to keep.

This will generate two files: a .smi file with the Smiles for all the kept ligands, and .df file which keeps the activity value (Emax here) in dataframe format.

It also generates all 2D images of your molecules, with ChEMBL ID (or other, it is taken from the .smi file) and activity included, in the mol_images folder.

'''Step 4: compute the fingerprints on Gimel'''

Copy the .smi file to gimel, source the DOCK3.7 base, and then run this command (on gimel, not gimel2 or others):

python ~jklyu/zzz.github/ChemInfTools/utils/teb_chemaxon_cheminf_tools/generate_chemaxon_fingerprints.py mor_chembl_emax.smi mor_chembl_emax

This will generate a .fp file, in the present case mor_chembl_emax.fp

'''Step 5: tSNE and interactive visualization'''

Almost done! Copy the .fp file back to your machine, then run part 2 of the example script:

from chemspace_vis.preprocess import make_tsne_from_fingerprints
from chemspace_vis.visualizer import make_visualizer_script

fingerprints_file = "mor_chembl_emax.fp"
make_tsne_from_fingerprints(fingerprints_file)
make_visualizer_script("tsne_data.df", "mol_images", activity_filename="mor_chembl_emax_activity.df", use_log10=False)

The first command will compute tSNE from the fingerprints. You will see a print telling you what percentage of the variance is covered by the PCA first applied (anything over 90-95% is good).

Then, the visualizer script will be generated. If you supply an activity filename, you will get coloring based on that property (here, Emax). The use_log10 flag can be useful if you have extreme values driving the coloring.

'''Step 6: run the visualizer'''

Simply run the generated visualizer script:

python visualizer_script.py

You can then zoom on parts where ligands are close together, and go back to the general view with the back arrow:

[[File:mor_zoom_example.gif]]

Interactive ligands visualizer

2023-01-20T23:04:19Z

Omailhot:

Interactive ligands visualizer

2023-01-20T23:04:05Z

Omailhot:

File:Mor zoom example.gif

2023-01-20T23:03:28Z

Omailhot:

Interactive ligands visualizer

2023-01-20T22:52:41Z

Omailhot:

Interactive ligands visualizer

2023-01-20T22:45:39Z

Omailhot:

Interactive ligands visualizer

2023-01-20T22:21:37Z

Omailhot:

File:Chembl mor3.png

2023-01-20T22:19:26Z

Omailhot:

File:Chembl mor2.png

2023-01-20T22:19:18Z

Omailhot:

File:Chembl mor1.png

2023-01-20T22:19:08Z

Omailhot:

Interactive ligands visualizer

2023-01-20T21:54:41Z

Omailhot:

File:Chemspace vis example.gif

2023-01-20T21:53:50Z

Omailhot:

Interactive ligands visualizer

2023-01-20T00:43:31Z

Omailhot:

File:Chemspace vis example1.png

2023-01-20T00:40:08Z

Omailhot:

Interactive ligands visualizer

2023-01-20T00:38:14Z

Omailhot: Created page with "I (Olivier) put together this interactive visualizer to make sure that I don't miss out some chemotypes when coming up with actives at the start of a retrospective campaign. Starting from a downloaded ChEMBL CSV file for a list of ligands, images of each molecule are generated with RDKit and a text file with filtered Smiles is generated. You then need to compute the ECFP fingerprints on Gimel from that file (see below), and then a generated script will show an interactiv..."