Running DOCK6 with ChemGrid Score and db2 files
Installation
Obtain DOCK 6 from the official GitHub repository. See also the DOCK 6 manual: Installation.
# on epyc-A40
cd dock6/install
./configure gnu
make all
Large-scale docking (LSD)
The most recent copy on Wynton is in /wynton/group/bks/work/ak87/UCSF/SCRIPTS/DOCK6/2025-08-Trents_fix/dock6/.
Converting DOCK 3 grids to DOCK 6
See the Scripts/Convert Grids section below. This has been tested on Gimel and should also work on Wynton. The preferred workflow starts from both dockfiles and working directories produced by DOCK 3 blastermaster. When that is not available, the script requires the DOCK 3 dockfiles only. DOCK 3 and DOCK 6 installations are needed.
sh convert_dock3_grids_dockfiles.sh <path_to_dock3_grids_dockfiles>
Note: Converted grids reproduce the same scores when rescoring DOCK 3 poses. For Mac1, however, top scores during large-scale docking were in the thousands, likely because some ligands extended beyond the grid box.
Running LSD
Running LSD with DOCK 6 is analogous to DOCK 3:
- Prepare grids in
dock6files. - Obtain SubDock from the SUBDOCK6 repository.
- Prepare lists of ligand
.tgzfiles in ansdidirectory. - Create
run_subdock6.shin the working directory and run it (see Scripts/Submit LSD below). The script targets Wynton but should work on clusters using SGE or SLURM. You can re-run it to retry unfinished jobs.
Processing LSD
Run the parsing script (see Scripts/Process LSD below). It writes a CSV of scores (similar to extract_all.sort.uniq.txt) and a second CSV with per-subdirectory runtimes.
python parse_dock_logs.py --root /path/to/search \
--ligand_csv /path/to/out/ligand_scores.csv \
--stat_csv /path/to/out/stat_summary.csv
To do: extract poses.
Rescoring DOCK 3 poses
See Scripts/Rescoring. Edit the script to specify:
- The
mol2.gzfile with poses to rescore. - A directory with
db2files (DOCK 6 reads solvation data from these). - DOCK 6 grids (
dock6files). - Parameter files from the DOCK 6 installation.
- The helper
lc_blazing_fast_separate_mol2_into_smaller_files.pyis used to split large MOL2 files for speed.
Scripts
Convert grids
#!/bin/bash
# convert_dock3_grids_dockfiles.sh
# Convert DOCK3 grids to DOCK6 dockfiles
# Author Trent Balius (converted to bash)
set -e
echo "sh convert_dock3_grids_dockfiles.sh <path_to_dock3_grids_dockfiles>"
workdir="$(pwd)"
cd "$workdir" || exit
# system="$(basename "$workdir")"
#filedir="/is2/projects/RAS-CompChem/static/work/DUDEZ/dudez_dockprep/${system}_dock6prep/blastermaster/dockfiles"
if [ -z "$1" ]; then
echo "Usage: sh convert_dock3_grids_dockfiles.sh <path_to_dock3_grids_dockfiles>"
exit 1
fi
filedir=$1
# Directory where the converted DOCK6 dockfiles will be stored
dock6files_dir="${workdir}/dock6files"
mkdir -p "$dock6files_dir"
cd "$dock6files_dir" || exit
cp "$filedir"/vdw.* .
#cp "$filedir/box" .
#cp "$filedir/qnifft.electrostatics.phi" .
cp "$filedir/trim.electrostatics.phi" .
cp "$filedir/matching_spheres.sph" .
cp "$filedir"/ligand.desolv.* .
cp "$filedir/INDOCK" .
cp "$filedir/rec.crg.pdb" rec.pdb
perl /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/proteins/makebox/makebox.smallokay.pl matching_spheres.sph rec.pdb box 10.0
# Creating standalone grids, so no links to the original files
# ln -sf vdw.bmp chem.bmp
# ln -sf vdw.vdw chem.vdw
# #ln -sf vdw.esp chem.esp
# ln -sf vdw.vdw chem.esp
# #ln -sf qnifft.electrostatics.phi rec+sph.phi
# #ln -sf trim.electrostatics.phi rec+sph.phi
# ln -sf trim.electrostatics.phi chem.phi
cp -v vdw.bmp chem.bmp
cp -v vdw.vdw chem.vdw
cp -v vdw.vdw chem.esp
cp -v trim.electrostatics.phi chem.phi
dsize=$(grep delphi_nsize INDOCK | awk '{print $2}')
if [ -z "$dsize" ]; then
echo "Error: delphi_nsize not found in INDOCK. Please check the INDOCK file."
exit 1
fi
cat << EOF > gconv.in
compute_grids yes
DOCK3_7_grids yes
QNIFFT_grid yes
grid_spacing 0.2
phi_grid_size $dsize
output_molecule no
contact_score no
contact_cutoff_distance 4.5
chemical_score yes
energy_score yes
energy_cutoff_distance 10
atom_model u
attractive_exponent 6
repulsive_exponent 12
distance_dielectric yes
dielectric_factor 4
bump_filter yes
bump_overlap 0.5
receptor_file rec.crg
box_file box
vdw_definition_file /nfs/home/ak87/PROGRAM/DOCK6-NEBIUS/dock6-gritukan_merged/parameters/vdw_AMBER_parm94.defn
chemical_definition_file /nfs/home/ak87/PROGRAM/DOCK6-NEBIUS/dock6-gritukan_merged/parameters/chemgrid/conv.defn
score_grid_prefix chem
EOF
/nfs/home/ak87/PROGRAM/DOCK6-NEBIUS/dock6-gritukan_merged/bin/grid-convert -i gconv.in > gconv.log
Submit LSD
#!/bin/bash
export EXPORT_DEST=$PWD/output
export DOCKFILES=$PWD/dock6files
export DOCKEXEC=/wynton/group/bks/work/ak87/UCSF/SCRIPTS/DOCK6/2025-08-Trents_fix/dock6/bin/dock6
export SHRTCACHE=/dev/shm
export LONGCACHE=/tmp
export SHRTCACHE_USE_ENV=
export USE_DB2_TGZ=true
export USE_DB2_TGZ_BATCH_SIZE=1
export USE_DB2=false
export USE_DB2_BATCH_SIZE=100
export USE_SLURM=False
export USE_SLURM_ARGS=--export=ALL
export USE_SGE=True
export USE_SGE_ARGS="-l s_rt=08:28:00 -l h_rt=08:30:00 -l mem_free=2G"
export USE_PARALLEL=false
export MAX_PARALLEL=-1
export QSUB_EXEC=qsub
export SBATCH_EXEC=sbatch
export PARALLEL_EXEC=parallel
export SUBMIT_WAIT_TIME=0
for i in sdi/* ; do
export k=$(basename $i .in)
echo k $k
export INPUT_SOURCE=$PWD/$i
export EXPORT_DEST=$PWD/output/$k
bash </path/to/subdock6.bash>
done
Process LSD
#!/usr/bin/env python3
"""
parse_dock_logs.py
Recursively parse DOCK 'dock.out*' files:
• Extract all LIGAND score components; keep the record with the lowest
Chemgrid_Score per molecule.
• Extract STAT summary for each file.
• Save two CSV files: ligand_scores and stat_summary.
Usage:
python parse_dock_logs.py \
--root /path/to/search \
--ligand_csv /path/to/out/ligand_scores.csv \
--stat_csv /path/to/out/stat_summary.csv
"""
import argparse
import re
from pathlib import Path
import pandas as pd
# ──────────────────────────────────────────────────────────────────────────────
# Regex patterns for parsing DOCK output files
LIGAND_START_RE = re.compile(r"^-+$") # Lines with only dashes
MOLECULE_RE = re.compile(r"^\s*Molecule:\s*(\S+)", re.IGNORECASE)
LIGAND_KEYVAL_RE = re.compile(
r"^\s*([A-Za-z0-9_]+):\s*([-+]?\d*\.?\d+)\s*$"
)
STAT_BLOCK_RE = re.compile(r"^(=+.*STAT|.*STAT.*)", re.IGNORECASE)
STAT_MOLS_RE = re.compile(r"^\s*(\d+)\s+Molecules Processed", re.IGNORECASE)
STAT_TIME_RE = re.compile(r"^Total elapsed time:\s+([\d.]+) seconds", re.IGNORECASE)
ELAPSED_LINE_RE = re.compile(r"Total elapsed time:", re.IGNORECASE)
# ──────────────────────────────────────────────────────────────────────────────
def file_finished(lines, tail=50) -> bool:
"""Return True if 'Total elapsed time' appears in the last `tail` lines."""
return any(ELAPSED_LINE_RE.search(l) for l in lines[-tail:])
# ──────────────────────────────────────────────────────────────────────────────
def parse_ligands(lines, subdir, filename):
"""Yield dictionaries with ligand scores."""
i = 0
record_count = 0
while i < len(lines):
# Look for lines with dashes (separator before ligand record)
if LIGAND_START_RE.match(lines[i]):
# Skip ahead to find Molecule: line
j = i + 1
while j < len(lines) and not MOLECULE_RE.search(lines[j]):
j += 1
if j >= len(lines):
i += 1
continue
# Found a molecule line, start parsing
molecule_match = MOLECULE_RE.search(lines[j])
if not molecule_match:
i += 1
continue
record = {
"subdir": subdir,
"file": filename,
"Molecule": molecule_match.group(1)
}
# Parse key-value pairs from the rest of the record
k = j + 1
while k < len(lines):
line = lines[k].strip()
# Stop if we hit another separator or definitive end markers
if (line.startswith("---") or
line.startswith("===") or
line.startswith("score >") or
line.startswith("Total elapsed time")):
break
# Try to match key-value pairs
kv_match = LIGAND_KEYVAL_RE.match(lines[k])
if kv_match:
key, val = kv_match.groups()
if val: # Only add if value is not empty
record[key] = float(val)
k += 1
# Only yield if we found meaningful data
if len(record) > 3: # Has molecule + metadata + at least one score
record_count += 1
yield record
i = k # Continue from where we left off
else:
i += 1
# ──────────────────────────────────────────────────────────────────────────────
def parse_stat(lines, subdir, filename):
"""Return dict with STAT info or None if block not found."""
for idx, line in enumerate(lines):
if STAT_BLOCK_RE.match(line):
mols = elapsed = None
for j in range(idx + 1, len(lines)):
if mols is None:
m = STAT_MOLS_RE.match(lines[j])
if m:
mols = int(m.group(1))
if elapsed is None:
t = STAT_TIME_RE.match(lines[j])
if t:
elapsed = float(t.group(1))
if lines[j].startswith("=") and j > idx + 1:
break # end of STAT block
return {
"subdir": subdir,
"file": filename,
"molecules_processed": mols,
"elapsed_seconds": elapsed,
}
return None
# ──────────────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description="Parse DOCK log files.")
ap.add_argument("--root", type=Path, default=Path.cwd(),
help="Root directory to search (default: CWD)")
ap.add_argument("--ligand_csv", type=Path, required=True,
help="Output CSV for ligand scores")
ap.add_argument("--stat_csv", type=Path, required=True,
help="Output CSV for STAT summaries")
args = ap.parse_args()
ligand_records = []
stat_records = []
for path in args.root.rglob("dock.out*"):
print(f"Processing {path}")
if not path.is_file():
continue
lines = path.read_text(errors="ignore").splitlines()
if not file_finished(lines):
print(f" Skipping: file not finished")
continue # skip incomplete file
subdir = str(path.parent.relative_to(args.root))
filename = path.name
# LIGAND parsing
ligand_records_from_file = list(parse_ligands(lines, subdir, filename))
ligand_records.extend(ligand_records_from_file)
print(f" Found {len(ligand_records_from_file)} ligand records")
# STAT parsing
stat = parse_stat(lines, subdir, filename)
if stat:
stat_records.append(stat)
if not ligand_records:
print("No completed ligand data found.")
return
# Ligand DF & best scores
df_lig = pd.DataFrame(ligand_records)
# Check if Chemgrid_Score exists, if not use alternative column
score_column = "Chemgrid_Score"
if score_column not in df_lig.columns:
# Look for alternative score columns
possible_scores = [col for col in df_lig.columns if "score" in col.lower()]
if possible_scores:
score_column = possible_scores[0]
print(f"Using {score_column} instead of Chemgrid_Score")
else:
print("No score column found!")
return
best_idx = (
df_lig.groupby("Molecule")[score_column]
.idxmin()
.dropna()
)
df_best = df_lig.loc[best_idx].sort_values(score_column)
args.ligand_csv.parent.mkdir(parents=True, exist_ok=True)
df_best.to_csv(args.ligand_csv, index=False)
# STAT DF
if stat_records:
df_stat = pd.DataFrame(stat_records)
args.stat_csv.parent.mkdir(parents=True, exist_ok=True)
df_stat.to_csv(args.stat_csv, index=False)
print(f"Wrote {len(df_best)} molecules to {args.ligand_csv}")
if stat_records:
print(f"Wrote {len(df_stat)} STAT rows to {args.stat_csv}")
if __name__ == "__main__":
main()
Rescoring
#!/usr/bin/env bash
# ------------------------------------------------------------------
# Rescoring pipeline — Bash port of the original tcsh script
# ------------------------------------------------------------------
# IFS=$'\n\t' # safe word splitting
# ------------------------- CONFIGURATION --------------------------
system="AMPC" # set your system tag here
# system="EGFR" # uncomment to switch target
outdir="${system}_rescore_v3.8_v6"
src_gz="/home/ak87/Desktop/SMALL_PROJ/DOCK6-NEBIUS/2025-06-TEST/TEST-LINKIN/RUN-TRENTs-COMMENTS/DOCK3/1/output/test.mol2.gz"
db2_dir="/home/ak87/Desktop/SMALL_PROJ/DOCK6-NEBIUS/2025-06-TEST/AMPC-TESTED/"
dockfiles_link="/home/ak87/Desktop/SMALL_PROJ/DOCK6-NEBIUS/2025-06-TEST/TMP2/dock6files"
dock6_bin="/home/ak87/Programs/dock6-gritukan_merged/bin/dock6"
separate_py="/home/ak87/Programs/lc_blazing_fast_separate_mol2_into_smaller_files_AK.py"
vdw_defn="/home/ak87/Programs/dock6-gritukan_merged/parameters/vdw_AMBER_parm94.dock3_7.defn"
flex_defn="/home/ak87/Programs/dock6-gritukan_merged/parameters/flex.defn"
flex_tbl="/home/ak87/Programs/dock6-gritukan_merged/parameters/flex_drive.tbl"
# ------------------------------------------------------------------
mkdir -p "$outdir"
cd "$outdir"
# Unzip test.mol2 once
zcat "$src_gz" > test.mol2
# Link dockfiles directory (overwrite stale link if present)
ln -sfn "$dockfiles_link" dockfiles
# Fresh output accumulator
rm -f total_energy.txt
# Split the multi-mol2 file into chunks named split001.mol2 …
python "$separate_py" test.mol2 split 1
rm test.mol2
mol2files=$(ls *.mol2)
echo $mol2files
# ---------------------- MAIN PROCESSING LOOP ---------------------
for mol2file in $mol2files; do
# Grab ligand-source path (single value expected)
# file=$(grep -m1 "Ligand Source File:" "$mol2file" | awk '{print $5}')
ligand_id=$(grep "########## Name:" "$mol2file" | awk '{print $3}')
ligand_charge=$(grep "########## Ligand Charge:" "$mol2file" | awk '{print $4}')
echo "Processing ligand: $ligand_id (charge: $ligand_charge, source: $mol2file)"
found_file=false
for file in "$db2_dir"/*"$ligand_id"*.db2; do
echo "Checking file: $file"
this_charge=$(head -n 2 "$file" | tail -n 1 | awk '{print $2}')
echo $ligand_id $this_charge $ligand_charge
match=$(python3 -c "import sys; print("1") if float('$ligand_charge') == float('$this_charge') else sprint("2")")
if [[ "$match" -eq "1" ]]; then
echo "Found source file: $file (charge match: $ligand_charge)"
found_file=true
break
fi
done
if [[ $found_file == false ]]; then
echo "No matching source file found for ligand: $ligand_id (charge: $ligand_charge)"
continue
fi
name="${mol2file%.*}" # basename without .mol2
echo "Processing $mol2file (source ⇒ $file)"
# Build solvation table
cat "$file" | grep "^A" \
| awk '{printf"%f %f %f %f %f\n",0.0,$8,0.0,$9,$10}' \
> "${name}.solv"
printf '%s\n' "$name" > molname.txt # not used later but kept
# Append TRIPOS solvation section
{
printf '@<TRIPOS>SOLVATION\n'
printf 'LIG header\n'
cat "${name}.solv"
} >> "${name}.mol2"
# ------------------- Generate dock.in on-the-fly --------------
cat > dock.in <<EOF
conformer_search_type rigid
use_internal_energy no
ligand_atom_file ${name}.mol2
limit_max_ligands no
skip_molecule no
read_mol_solvation yes
calculate_rmsd no
use_database_filter no
orient_ligand no
bump_filter no
score_molecules yes
contact_score_primary no
grid_score_primary no
gist_score_primary no
multigrid_score_primary no
dock3.5_score_primary yes
dock3.5_vdw_score yes
dock3.5_grd_prefix dockfiles/chem52
dock3.5_electrostatic_score yes
dock3.5_ligand_desolvation_score volume
dock3.5_solvent_occlusion_file dockfiles/ligand.desolv.heavy
dock3.5_redistribute_positive_desolvation no
dock3.5_hydrogen_desolvation_grid yes
dock3.5_hydrogen_solvent_occlusion_file dockfiles/ligand.desolv.hydrogen
dock3.5_receptor_desolvation_score no
dock3.5_write_atomic_energy_contrib no
dock3.5_score_vdw_scale 1
dock3.5_score_es_scale 1
minimize_ligand no
atom_model all
vdw_defn_file ${vdw_defn}
flex_defn_file ${flex_defn}
flex_drive_file ${flex_tbl}
ligand_outfile_prefix ${name}_output
write_orientations no
num_scored_conformers 1
rank_ligands no
EOF
# --------------------- Run DOCK and harvest -------------------
"$dock6_bin" -i dock.in -o dock.out
# Pull chemgrid and energy terms into running table
echo "$(grep Chemgrid_Score "${name}_output_scored.mol2") \
$(grep 'Total Energy:' "${name}.mol2")" \
| awk '{print $3, $7}' >> total_energy.txt
done
# ------------------------------------------------------------------
echo "All done — summary energies written to total_energy.txt"(base)