Filter xREAL by Heavy Atom Count

From DISI
Jump to navigation Jump to search

Preface

I wrote this quick tool to grab molecules below HAC19 from the xREAL database on Wynton. Since the database is >1 trillion compounds, I wanted to avoid the entire RDKit molecule build process if at all possible. The code removes every non-alphabetical character, lowercase letters, and the letter 'H' from a SMILES string. This leaves a list of "atoms", which can be counted to give the HAC for a compound. Assuming the HAC count falls within the specified bounds, the entire smi line is written out.

I'm assuming that the location of xREAL remains the same on disk, and that we never ever add any more compounds to xREAL. These are solid assumptions, and absolutely never ever going to be incorrect.

Usage

1. Copy the script to your working directory, and make a log dir:

cd /wynton/group/bks/work/zdingman/GAT1/zwitterion_scripts/HAC_filter_xREAL.sh .
mkdir logs

2. Modify the header variables defined in the header: HACMIN, HACMAX, outpath. (When xREAL inevitably changes locations, modify inpath as well.

3. Submit the job to run:

qsub HAC_filter_xREAL.sh

Run time dependent on the Wynton queue. Anecdotally seems to run to completion overnight with no issues. Each job seems to take roughly 40 minutes (aka, frustratingly too long for the short cpu queue)