Cassidy clustering

From DISI
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

These are notes to myself from Teague/Cassidy for the new cluster.

I've written some code to generate Daylight-like fingerprints (Base-64) from the output of Chemaxon's generatemd. This is a wonderfully named program called "binstr2base64.py" although it can do the opposite as well. It's currently in /raid1/teague/Code/cluster/binstr2base64.py. It seems to work fine however there's a small bug when reading from the standard input after a SIGKILL interrupt. This is not really a big deal as it's pretty hard to recreate under normal circumstances.

I haven't verified that the BASE-64 encoding is compatible with the mappings in fast_tanimoto.c but the decode-encode methods are consistent. i.e. running generatemd's output though binstr2base64 twice will give the same fingerprint (sans delimiters)

How To (using eMolecules as a demo with 1024-bit ECFP4-like fingerprints):

cd /raid1/people/teague/Code/cluster/test
gzip -dc gzip -dc /raid9/db/zinc-may08/byvendor/emol/emol_p0.smi.gz > emol_p0.smi
/raid3/software/jchem/current/bin/generatemd c emol_p0.smi -k CF -f 1024 -2 | /raid1/people/teague/Code/cluster/binstr2base64.py | paste -d' ' emol_p0.smi - > emol_p0.smi.fp
rm emol_p0.smi

This will be a BIG file (and I've already generated it) so if you want a smaller test simply to verify IO & decoding you can use test.smi, which is the first 10 or so compounds. You'll have to generate the fingerprints yourself however.

A note for Cassidy: If you experience python errors run:

source /raid3/software/python/bin/python-env.csh (or python-env.sh if you're using BASH)

I'll verify the binary-base64 mappings are correct but I'd recommend checking yourself as well.