Running DOCK6 with ChemGrid Score and db2 files

From DISI
Revision as of 19:35, 9 September 2025 by Iamkaant (talk | contribs)
Jump to navigation Jump to search

Installation

Obtain DOCK 6 from the official GitHub repository. See also the DOCK 6 manual: Installation.

# on epyc-A40
cd dock6/install
./configure gnu
make all

Large-scale docking (LSD)

The most recent copy on Wynton is in /wynton/group/bks/work/ak87/UCSF/SCRIPTS/DOCK6/2025-08-Trents_fix/dock6/.

Converting DOCK 3 grids to DOCK 6

See the Scripts/Convert_Grids section below. This has been tested on Gimel and should also work on Wynton. The preferred workflow starts from both dockfiles and working directories produced by DOCK 3 blastermaster. When that is not available, the script requires the DOCK 3 dockfiles only. DOCK 3 and DOCK 6 installations are needed.

sh convert_dock3_grids_dockfiles.sh <path_to_dock3_grids_dockfiles>

Note: Converted grids reproduce the same scores when rescoring DOCK 3 poses. For Mac1, however, top scores during large-scale docking were in the thousands, likely because some ligands extended beyond the grid box.

Running LSD

Running LSD with DOCK 6 is analogous to DOCK 3:

  • Prepare grids in dock6files.
  • Obtain SubDock from the SUBDOCK6 repository.
  • Prepare lists of ligand .tgz files in an sdi directory.
  • Create run_subdock6.sh in the working directory and run it (see Scripts/Submit_LSD below). The script targets Wynton but should work on clusters using SGE or SLURM. You can re-run it to retry unfinished jobs.

Processing LSD

Run the parsing script (see Scripts/Process_LSD below). It writes a CSV of scores (similar to extract_all.sort.uniq.txt) and a second CSV with per-subdirectory runtimes.

python parse_dock_logs.py --root /path/to/search \
  --ligand_csv /path/to/out/ligand_scores.csv \
  --stat_csv /path/to/out/stat_summary.csv

To do: extract poses.

Rescoring DOCK 3 poses

See Scripts/Rescoring. Edit the script to specify:

  • The mol2.gz file with poses to rescore.
  • A directory with db2 files (DOCK 6 reads solvation data from these).
  • DOCK 6 grids (dock6files).
  • Parameter files from the DOCK 6 installation.
  • The helper lc_blazing_fast_separate_mol2_into_smaller_files.py is used to split large MOL2 files for speed.

Scripts

Convert grids

#!/bin/bash
# convert_dock3_grids_dockfiles.sh
# Convert DOCK3 grids to DOCK6 dockfiles
# Author Trent Balius (converted to bash)

set -e
echo "sh convert_dock3_grids_dockfiles.sh <path_to_dock3_grids_dockfiles>"
workdir="$(pwd)"
cd "$workdir" || exit
# system="$(basename "$workdir")"

#filedir="/is2/projects/RAS-CompChem/static/work/DUDEZ/dudez_dockprep/${system}_dock6prep/blastermaster/dockfiles"
if [ -z "$1" ]; then
	echo "Usage: sh convert_dock3_grids_dockfiles.sh <path_to_dock3_grids_dockfiles>"
	exit 1
fi

filedir=$1
# Directory where the converted DOCK6 dockfiles will be stored
dock6files_dir="${workdir}/dock6files"

mkdir -p "$dock6files_dir"
cd "$dock6files_dir" || exit

cp "$filedir"/vdw.* .
#cp "$filedir/box" .
#cp "$filedir/qnifft.electrostatics.phi" .
cp "$filedir/trim.electrostatics.phi" .
cp "$filedir/matching_spheres.sph" .
cp "$filedir"/ligand.desolv.* .
cp "$filedir/INDOCK" .
cp "$filedir/rec.crg.pdb" rec.pdb

perl /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/proteins/makebox/makebox.smallokay.pl matching_spheres.sph rec.pdb box 10.0

# Creating standalone grids, so no links to the original files
# ln -sf vdw.bmp chem.bmp
# ln -sf vdw.vdw chem.vdw
# #ln -sf vdw.esp chem.esp
# ln -sf vdw.vdw chem.esp
# #ln -sf qnifft.electrostatics.phi rec+sph.phi
# #ln -sf trim.electrostatics.phi rec+sph.phi
# ln -sf trim.electrostatics.phi chem.phi

cp -v vdw.bmp chem.bmp
cp -v  vdw.vdw chem.vdw
cp -v  vdw.vdw chem.esp
cp -v trim.electrostatics.phi chem.phi

dsize=$(grep delphi_nsize INDOCK | awk '{print $2}')
if [ -z "$dsize" ]; then
	echo "Error: delphi_nsize not found in INDOCK. Please check the INDOCK file."
	exit 1
fi

cat << EOF > gconv.in
compute_grids                  yes
DOCK3_7_grids                  yes
QNIFFT_grid                    yes
grid_spacing                   0.2
phi_grid_size                  $dsize
output_molecule                no
contact_score                  no
contact_cutoff_distance        4.5
chemical_score                 yes
energy_score                   yes
energy_cutoff_distance         10
atom_model                     u
attractive_exponent            6
repulsive_exponent             12
distance_dielectric            yes
dielectric_factor              4
bump_filter                    yes
bump_overlap                   0.5
receptor_file                  rec.crg
box_file                       box
vdw_definition_file            /nfs/home/ak87/PROGRAM/DOCK6-NEBIUS/dock6-gritukan_merged/parameters/vdw_AMBER_parm94.defn
chemical_definition_file       /nfs/home/ak87/PROGRAM/DOCK6-NEBIUS/dock6-gritukan_merged/parameters/chemgrid/conv.defn
score_grid_prefix              chem
EOF

/nfs/home/ak87/PROGRAM/DOCK6-NEBIUS/dock6-gritukan_merged/bin/grid-convert -i gconv.in > gconv.log

Submit LSD

#!/bin/bash
export EXPORT_DEST=$PWD/output
export DOCKFILES=$PWD/dock6files
export DOCKEXEC=/wynton/group/bks/work/ak87/UCSF/SCRIPTS/DOCK6/2025-08-Trents_fix/dock6/bin/dock6
export SHRTCACHE=/dev/shm
export LONGCACHE=/tmp
export SHRTCACHE_USE_ENV=
export USE_DB2_TGZ=true
export USE_DB2_TGZ_BATCH_SIZE=1
export USE_DB2=false
export USE_DB2_BATCH_SIZE=100
export USE_SLURM=False
export USE_SLURM_ARGS=--export=ALL
export USE_SGE=True
export USE_SGE_ARGS="-l s_rt=08:28:00 -l h_rt=08:30:00 -l mem_free=2G"
export USE_PARALLEL=false
export MAX_PARALLEL=-1
export QSUB_EXEC=qsub
export SBATCH_EXEC=sbatch
export PARALLEL_EXEC=parallel
export SUBMIT_WAIT_TIME=0

for i in  sdi/* ; do
       	export k=$(basename $i .in)
       	echo k $k
       	export INPUT_SOURCE=$PWD/$i
       	export EXPORT_DEST=$PWD/output/$k
       	bash </path/to/subdock6.bash>
done

Process LSD

#!/usr/bin/env python3
"""
parse_dock_logs.py

Recursively parse DOCK 'dock.out*' files:
  • Extract all LIGAND score components; keep the record with the lowest
    Chemgrid_Score per molecule.
  • Extract STAT summary for each file.
  • Save two CSV files: ligand_scores and stat_summary.

Usage:
    python parse_dock_logs.py \
        --root /path/to/search \
        --ligand_csv /path/to/out/ligand_scores.csv \
        --stat_csv /path/to/out/stat_summary.csv
"""

import argparse
import re
from pathlib import Path
import pandas as pd

# ──────────────────────────────────────────────────────────────────────────────
# Regex patterns for parsing DOCK output files
LIGAND_START_RE = re.compile(r"^-+$")  # Lines with only dashes
MOLECULE_RE = re.compile(r"^\s*Molecule:\s*(\S+)", re.IGNORECASE)
LIGAND_KEYVAL_RE = re.compile(
    r"^\s*([A-Za-z0-9_]+):\s*([-+]?\d*\.?\d+)\s*$"
)
STAT_BLOCK_RE = re.compile(r"^(=+.*STAT|.*STAT.*)", re.IGNORECASE)
STAT_MOLS_RE = re.compile(r"^\s*(\d+)\s+Molecules Processed", re.IGNORECASE)
STAT_TIME_RE = re.compile(r"^Total elapsed time:\s+([\d.]+) seconds", re.IGNORECASE)
ELAPSED_LINE_RE = re.compile(r"Total elapsed time:", re.IGNORECASE)

# ──────────────────────────────────────────────────────────────────────────────
def file_finished(lines, tail=50) -> bool:
    """Return True if 'Total elapsed time' appears in the last `tail` lines."""
    return any(ELAPSED_LINE_RE.search(l) for l in lines[-tail:])

# ──────────────────────────────────────────────────────────────────────────────
def parse_ligands(lines, subdir, filename):
    """Yield dictionaries with ligand scores."""
    i = 0
    record_count = 0
    while i < len(lines):
        # Look for lines with dashes (separator before ligand record)
        if LIGAND_START_RE.match(lines[i]):
            # Skip ahead to find Molecule: line
            j = i + 1
            while j < len(lines) and not MOLECULE_RE.search(lines[j]):
                j += 1
            
            if j >= len(lines):
                i += 1
                continue
                
            # Found a molecule line, start parsing
            molecule_match = MOLECULE_RE.search(lines[j])
            if not molecule_match:
                i += 1
                continue
                
            record = {
                "subdir": subdir, 
                "file": filename,
                "Molecule": molecule_match.group(1)
            }
            
            # Parse key-value pairs from the rest of the record
            k = j + 1
            while k < len(lines):
                line = lines[k].strip()
                
                # Stop if we hit another separator or definitive end markers
                if (line.startswith("---") or 
                    line.startswith("===") or 
                    line.startswith("score >") or
                    line.startswith("Total elapsed time")):
                    break
                
                # Try to match key-value pairs
                kv_match = LIGAND_KEYVAL_RE.match(lines[k])
                if kv_match:
                    key, val = kv_match.groups()
                    if val:  # Only add if value is not empty
                        record[key] = float(val)
                
                k += 1
            
            # Only yield if we found meaningful data
            if len(record) > 3:  # Has molecule + metadata + at least one score
                record_count += 1
                yield record
            
            i = k  # Continue from where we left off
        else:
            i += 1

# ──────────────────────────────────────────────────────────────────────────────
def parse_stat(lines, subdir, filename):
    """Return dict with STAT info or None if block not found."""
    for idx, line in enumerate(lines):
        if STAT_BLOCK_RE.match(line):
            mols = elapsed = None
            for j in range(idx + 1, len(lines)):
                if mols is None:
                    m = STAT_MOLS_RE.match(lines[j])
                    if m:
                        mols = int(m.group(1))
                if elapsed is None:
                    t = STAT_TIME_RE.match(lines[j])
                    if t:
                        elapsed = float(t.group(1))
                if lines[j].startswith("=") and j > idx + 1:
                    break  # end of STAT block
            return {
                "subdir": subdir,
                "file": filename,
                "molecules_processed": mols,
                "elapsed_seconds": elapsed,
            }
    return None

# ──────────────────────────────────────────────────────────────────────────────
def main():
    ap = argparse.ArgumentParser(description="Parse DOCK log files.")
    ap.add_argument("--root", type=Path, default=Path.cwd(),
                    help="Root directory to search (default: CWD)")
    ap.add_argument("--ligand_csv", type=Path, required=True,
                    help="Output CSV for ligand scores")
    ap.add_argument("--stat_csv", type=Path, required=True,
                    help="Output CSV for STAT summaries")
    args = ap.parse_args()

    ligand_records = []
    stat_records = []

    for path in args.root.rglob("dock.out*"):
        print(f"Processing {path}")
        if not path.is_file():
            continue
        lines = path.read_text(errors="ignore").splitlines()
        if not file_finished(lines):
            print(f"  Skipping: file not finished")
            continue  # skip incomplete file
        
        subdir = str(path.parent.relative_to(args.root))
        filename = path.name

        # LIGAND parsing
        ligand_records_from_file = list(parse_ligands(lines, subdir, filename))
        ligand_records.extend(ligand_records_from_file)
        print(f"  Found {len(ligand_records_from_file)} ligand records")

        # STAT parsing
        stat = parse_stat(lines, subdir, filename)
        if stat:
            stat_records.append(stat)

    if not ligand_records:
        print("No completed ligand data found.")
        return

    # Ligand DF & best scores
    df_lig = pd.DataFrame(ligand_records)
    
    # Check if Chemgrid_Score exists, if not use alternative column
    score_column = "Chemgrid_Score"
    if score_column not in df_lig.columns:
        # Look for alternative score columns
        possible_scores = [col for col in df_lig.columns if "score" in col.lower()]
        if possible_scores:
            score_column = possible_scores[0]
            print(f"Using {score_column} instead of Chemgrid_Score")
        else:
            print("No score column found!")
            return
    
    best_idx = (
        df_lig.groupby("Molecule")[score_column]
        .idxmin()
        .dropna()
    )
    df_best = df_lig.loc[best_idx].sort_values(score_column)
    args.ligand_csv.parent.mkdir(parents=True, exist_ok=True)
    df_best.to_csv(args.ligand_csv, index=False)

    # STAT DF
    if stat_records:
        df_stat = pd.DataFrame(stat_records)
        args.stat_csv.parent.mkdir(parents=True, exist_ok=True)
        df_stat.to_csv(args.stat_csv, index=False)

    print(f"Wrote {len(df_best)} molecules to {args.ligand_csv}")
    if stat_records:
        print(f"Wrote {len(df_stat)} STAT rows to {args.stat_csv}")

if __name__ == "__main__":
    main()

Rescoring

#!/usr/bin/env bash
# ------------------------------------------------------------------
# Rescoring pipeline — Bash port of the original tcsh script
# ------------------------------------------------------------------
# IFS=$'\n\t'               # safe word splitting

# ------------------------- CONFIGURATION --------------------------
system="AMPC"             # set your system tag here
# system="EGFR"            # uncomment to switch target

outdir="${system}_rescore_v3.8_v6"
src_gz="/home/ak87/Desktop/SMALL_PROJ/DOCK6-NEBIUS/2025-06-TEST/TEST-LINKIN/RUN-TRENTs-COMMENTS/DOCK3/1/output/test.mol2.gz"

db2_dir="/home/ak87/Desktop/SMALL_PROJ/DOCK6-NEBIUS/2025-06-TEST/AMPC-TESTED/"
dockfiles_link="/home/ak87/Desktop/SMALL_PROJ/DOCK6-NEBIUS/2025-06-TEST/TMP2/dock6files"

dock6_bin="/home/ak87/Programs/dock6-gritukan_merged/bin/dock6"
separate_py="/home/ak87/Programs/lc_blazing_fast_separate_mol2_into_smaller_files_AK.py"

vdw_defn="/home/ak87/Programs/dock6-gritukan_merged/parameters/vdw_AMBER_parm94.dock3_7.defn"
flex_defn="/home/ak87/Programs/dock6-gritukan_merged/parameters/flex.defn"
flex_tbl="/home/ak87/Programs/dock6-gritukan_merged/parameters/flex_drive.tbl"
# ------------------------------------------------------------------

mkdir -p "$outdir"
cd "$outdir"

# Unzip test.mol2 once
zcat "$src_gz" > test.mol2

# Link dockfiles directory (overwrite stale link if present)
ln -sfn "$dockfiles_link" dockfiles

# Fresh output accumulator
rm -f total_energy.txt

# Split the multi-mol2 file into chunks named split001.mol2 …
python "$separate_py" test.mol2 split 1
rm test.mol2
mol2files=$(ls *.mol2)
echo $mol2files

# ---------------------- MAIN PROCESSING LOOP ---------------------
for mol2file in $mol2files; do
    # Grab ligand-source path (single value expected)
    # file=$(grep -m1 "Ligand Source File:" "$mol2file" | awk '{print $5}')
    ligand_id=$(grep "##########                 Name:" "$mol2file" | awk '{print $3}')
    ligand_charge=$(grep "##########        Ligand Charge:" "$mol2file" | awk '{print $4}')
    echo "Processing ligand: $ligand_id (charge: $ligand_charge, source: $mol2file)"
    found_file=false
    for file in "$db2_dir"/*"$ligand_id"*.db2; do
        echo "Checking file: $file"
        this_charge=$(head -n 2 "$file" | tail -n 1 | awk '{print $2}')
        echo $ligand_id $this_charge $ligand_charge
        match=$(python3 -c "import sys; print("1") if float('$ligand_charge') == float('$this_charge') else sprint("2")")
        if [[ "$match" -eq "1" ]]; then
            echo "Found source file: $file (charge match: $ligand_charge)"
            found_file=true
            break
        fi
    done
    if [[ $found_file == false ]]; then
        echo "No matching source file found for ligand: $ligand_id (charge: $ligand_charge)"
        continue
    fi
    name="${mol2file%.*}"          # basename without .mol2

    echo "Processing $mol2file  (source ⇒ $file)"

    # Build solvation table
    cat "$file" | grep "^A" \
        | awk '{printf"%f %f %f %f %f\n",0.0,$8,0.0,$9,$10}' \
        > "${name}.solv"

    printf '%s\n' "$name" > molname.txt   # not used later but kept

    # Append TRIPOS solvation section
    {
        printf '@<TRIPOS>SOLVATION\n'
        printf 'LIG header\n'
        cat "${name}.solv"
    } >> "${name}.mol2"

    # ------------------- Generate dock.in on-the-fly --------------
    cat > dock.in <<EOF
conformer_search_type                                        rigid
use_internal_energy                                          no
ligand_atom_file                                             ${name}.mol2
limit_max_ligands                                            no
skip_molecule                                                no
read_mol_solvation                                           yes
calculate_rmsd                                               no
use_database_filter                                          no
orient_ligand                                                no
bump_filter                                                  no
score_molecules                                              yes
contact_score_primary                                        no
grid_score_primary                                           no
gist_score_primary                                           no
multigrid_score_primary                                      no
dock3.5_score_primary                                        yes
dock3.5_vdw_score                                            yes
dock3.5_grd_prefix                                           dockfiles/chem52
dock3.5_electrostatic_score                                  yes
dock3.5_ligand_desolvation_score                             volume
dock3.5_solvent_occlusion_file                               dockfiles/ligand.desolv.heavy
dock3.5_redistribute_positive_desolvation                    no
dock3.5_hydrogen_desolvation_grid                            yes
dock3.5_hydrogen_solvent_occlusion_file                      dockfiles/ligand.desolv.hydrogen
dock3.5_receptor_desolvation_score                           no
dock3.5_write_atomic_energy_contrib                          no
dock3.5_score_vdw_scale                                      1
dock3.5_score_es_scale                                       1
minimize_ligand                                              no
atom_model                                                   all
vdw_defn_file                                                ${vdw_defn}
flex_defn_file                                               ${flex_defn}
flex_drive_file                                              ${flex_tbl}
ligand_outfile_prefix                                        ${name}_output
write_orientations                                           no
num_scored_conformers                                        1
rank_ligands                                                 no
EOF
    # --------------------- Run DOCK and harvest -------------------
    "$dock6_bin" -i dock.in -o dock.out

    # Pull chemgrid and energy terms into running table
    echo "$(grep Chemgrid_Score "${name}_output_scored.mol2") \
$(grep 'Total Energy:' "${name}.mol2")" \
        | awk '{print $3, $7}' >> total_energy.txt
done
# ------------------------------------------------------------------
echo "All done — summary energies written to total_energy.txt"(base)