These are notes to myself from Teague/Cassidy for the new cluster.
I've written some code to generate Daylight-like fingerprints (Base-64) from the output of Chemaxon's generatemd. This is a wonderfully named program called "binstr2base64.py" although it can do the opposite as well. It's currently in /raid1/teague/Code/cluster/binstr2base64.py. It seems to work fine however there's a small bug when reading from the standard input after a SIGKILL interrupt. This is not really a big deal as it's pretty hard to recreate under normal circumstances.
I haven't verified that the BASE-64 encoding is compatible with the mappings in fast_tanimoto.c but the decode-encode methods are consistent. i.e. running generatemd's output though binstr2base64 twice will give the same fingerprint (sans delimiters)
How To (using eMolecules as a demo with 1024-bit ECFP4-like fingerprints):
cd /raid1/people/teague/Code/cluster/test gzip -dc gzip -dc /raid9/db/zinc-may08/byvendor/emol/emol_p0.smi.gz > emol_p0.smi /raid3/software/jchem/current/bin/generatemd c emol_p0.smi -k CF -f 1024 -2 | /raid1/people/teague/Code/cluster/binstr2base64.py | paste -d' ' emol_p0.smi - > emol_p0.smi.fp rm emol_p0.smi
This will be a BIG file (and I've already generated it) so if you want a smaller test simply to verify IO & decoding you can use test.smi, which is the first 10 or so compounds. You'll have to generate the fingerprints yourself however.
A note for Cassidy: If you experience python errors run:
source /raid3/software/python/bin/python-env.csh (or python-env.sh if you're using BASH)
I'll verify the binary-base64 mappings are correct but I'd recommend checking yourself as well.