Extended Search of Analogs via Bioisosteric Replacements

From DISI
Jump to navigation Jump to search

Rationale

Our standard pipeline of searching for analogs (at least as I know it) consists of entering the SMILES of a molecule of interest into all flavors of SmallWorld and Arthor, available in the lab. This procedure has two drawbacks:

  1. Limited diversity. SmallWorld's and Arthor's measure of distance between the analogs and the parent compound is graph edit distance.[1] This metric, while useful and robust, is somewhat different from a chemist's idea of similarity. For example, the graph edit distance between benzene and cyclohexane is 6. It is quite far, and normally we do not consider such distant analogs. But as a part of a lead-like molecule, these two rings may replace each other in certain cases, without the loss of biological activity of the whole compound.
  2. Time investments. Manual search in the databases takes quite some time, especially if you need to find analogs for many compounds.

I wanted to create an automated procedure for analog searching. SmallWorld API is perfectly suitable for that, although sometimes unstable. But to overcome the issue of limited diversity, I decided to use the bioisosteric replacement program, which is currently being developed by Maksim Tsukanov.

How it works

The pipeline for the extended analog search works in two steps:

  1. Create bioisosteres of the original molecule (method created by Maksim Tsukanov, currently under development)
  2. Search for their closest analogs in SmallWorld (distance up to 2)

How to use

DISCLAIMER: The Bioisostere pipeline is under development, which means its ability to yield results is not assured. SmallWorld API is unstable sometimes. Every request is retried 4 times if unsuccessful, but it may still not return results in certain cases. The exhaustive search for analogs is not guaranteed.

The scripts are currently available on Gimel only. Once the bioisostere program is published, running the whole pipeline on any Linux/MacOS machine will be possible.

All scripts are deposited in ~ak87/PROGRAM/ANALOGS

To look for analogs, do the following:

  1. Log on to Gimel
  2. ssh to any of the newer machines (Gimel5, epyc, n-1-XX...)
  3. Prepare a file with SMILES name, separated by a tab. So far, I've tested the pipeline with one compound at a time. In theory, you should be able to enter as many compounds as you like, but the analogs will be mixed up.
  4. sh ~ak87/PROGRAM/ANALOGS/analog-search.sh <input.smi>

The run should take about 10-20 minutes, depending on the size of your molecule and its "popularity" in the commercial databases. I deliberately did not make requests to the databases parallel in order to omit overloading the API.

The list of analogs will be stored in final_analogs.smi. The format is SMILES ID Distance

You can also run any of the stages separately. To perform a bulk SmallWorld search, run sh ~ak87/PROGRAM/ANALOGS/bulk-analogs-bioisostere-sw1.sh <input.smi> Currently, it only searches for the analogs with a distance of up to 2, but you can copy the script and modify it as you like.

  1. I am probably using slightly wrong terminology here, so be it. You can learn more about it from many marvelous Nextmove's presentations, like this one: https://www.nextmovesoftware.com/talks/Sayle_SmallWorld_Oxford_202003.pdf