Create ChemDraw files with Python

From DISI
Jump to navigation Jump to search

Complete Manual: Displaying Chemical Structures in ChemDraw Format Using pycdxml

Overview

The pycdxml package enables platform-independent manipulation of ChemDraw files (CDX and CDXML formats) through Python. The CDXMLSlideGenerator module creates professional structure sheets by arranging molecules in grids with associated properties, outputting standard ChemDraw documents that can be further edited by users.

Installation

Create a conda environment with required dependencies:

name: pycdxml
channels:  
  - conda-forge 
  - defaults   
dependencies:
  - python>=3.8  
  - rdkit>=2020.09.1 
  - numpy
  - pyyaml
  - lxml
  - fonttools
  - matplotlib
  - pip

Install the environment and package:

conda env create -f environment.yml
conda activate pycdxml
python -m pip install -e /path/to/PyCDXML

Core Workflow for Structure Display

The typical workflow consists of four main steps:

  1. Read molecular structures from an SD file using RDKit
  2. Convert to CDXML format using the cdxml_converter module
  3. Extract properties from molecules to display alongside structures
  4. Generate slide using CDXMLSlideGenerator with customizable layout

Detailed Implementation

Step 1: Import Required Modules

import sys
from rdkit import Chem
from pycdxml import cdxml_slide_generator, cdxml_converter

Step 2: Load Molecular Structures

# Read from command line or specify path directly
input_file = sys.argv[1]  # or use explicit path
suppl = Chem.SDMolSupplier(input_file)
molecules = [x for x in suppl]

Important considerations:

  • Works best with standard small molecules
  • May have issues with organometallics, polymers, or complex structures
  • Each molecule treated as a separate entity (similar to MOL file concept)

Step 3: Convert Molecules to CDXML Format

cdxmls = []
for mol in molecules:    
    cdxml = cdxml_converter.mol_to_document(mol).to_cdxml()
    cdxmls.append(cdxml)

The mol_to_document() method creates a ChemDraw document object, and to_cdxml() converts it to the CDXML XML string format.

Step 4: Extract and Configure Properties

Properties appear as text annotations below each structure. The TextProperty class defines how each property displays:

all_props = []
for mol in molecules:
    props = [
        cdxml_slide_generator.TextProperty('SOURCE_ID', 
                                          mol.GetProp("SOURCE_ID"), 
                                          color='#3f6eba'),
        cdxml_slide_generator.TextProperty('MG_ID', 
                                          mol.GetProp("MG_ID"))
    ]
    all_props.append(props)

TextProperty parameters:

  • name: Property label (displayed if show_name=True)
  • value: The actual property value from molecule data
  • color: Hex color code (e.g., '#3f6eba' for blue)
  • show_name: Boolean to display/hide the property name

Step 5: Generate the Slide

sg = cdxml_slide_generator.CDXMLSlideGenerator(
    style="ACS 1996",           # ChemDraw style template
    number_of_properties=2,      # Properties per molecule
    columns=5,                   # Grid columns
    rows=10,                     # Grid rows
    slide_width=50,              # Page width (cm)
    slide_height=70              # Page height (cm)
)

slide = sg.generate_slide(cdxmls, all_props)

CDXMLSlideGenerator parameters:

  • style: ChemDraw style name (e.g., "ACS 1996") - determines bond length, font sizes, display preferences
  • number_of_properties: Must match the number of TextProperty objects per molecule
  • columns and rows: Grid layout (5×10 = 50 structures per page)
  • slide_width and slide_height: Document dimensions in centimeters

Step 6: Save Output File

output_file = sys.argv[2]  # or specify path
with open(output_file, "w", encoding='UTF-8') as xf:
    xf.write(slide)

Critical encoding note: Always use encoding='UTF-8' to ensure proper character handling, especially for special characters in compound names or properties.

Command-Line Usage

Create a script (e.g., generate_structures.py) and run:

python generate_structures.py input_structures.sdf output_slide.cdxml

Advanced Features

Style Application

Apply consistent ChemDraw styles across all structures. The styler module can convert existing CDXML files to a target style:

from pycdxml import cdxml_styler

styler = cdxml_styler.CDXMLStyler(style_source="/path/to/ACS 1996.cdxml")
styler.apply_style_to_file('input.cdxml', outpath='output.cdxml')

Style affects:

  • Bond lengths and widths
  • Atom label font sizes
  • Hydrogen display (implicit vs explicit)
  • Stereochemistry indicators
  • Overall visual presentation

Format Conversions

Convert between CDXML (XML text) and CDX (binary) formats:

# CDX to CDXML
doc = cdxml_converter.read_cdx('/path/to/structure.cdx')    
cdxml_converter.write_cdxml_file(doc, '/path/to/structure.cdxml')

# CDXML to base64-encoded CDX
doc = cdxml_converter.read_cdxml('/path/to/structure.cdxml')
b64_cdx = cdxml_converter.to_b64_cdx(doc)

Property Annotations for SD File Export

Important feature: All visible properties are automatically annotated to the molecules in the CDXML file. If you open the generated CDXML in ChemDraw and save as an SD file, all displayed properties will be included in the SD file output.

Practical Tips

For Large Datasets:

  • Process in batches if dealing with thousands of structures
  • Calculate optimal rows/columns based on page size and readability
  • Consider multiple pages rather than overcrowding single page

Property Display:

  • Limit to 2-4 properties per structure for readability
  • Use color coding to highlight important values (e.g., red for high activity)
  • Include units in property names when showing numerical values

Layout Optimization:

  • Standard 5×10 grid (50 structures) works well for A3/tabloid size printouts
  • For presentations, consider 4×3 or 5×4 grids for better visibility
  • Adjust slide_width and slide_height to match intended output format

Output and Compatibility

The generated CDXML file is a fully valid ChemDraw document that can be:

  • Opened and edited in ChemDraw Desktop
  • Converted to PDF, PNG, or other formats via ChemDraw
  • Modified by end users (chemists) to adjust layouts, add annotations, or change styles
  • Saved as SD files with all property annotations preserved

Known Limitations

  • Very old CDX files (~ChemDraw 7 era) may fail to parse correctly
  • Complex molecules (organometallics, polymers) may have conversion issues
  • Package focuses on "file-level" operations, not chemical validation
  • Style translation may not perfectly position additional drawing elements (brackets, arrows) relative to molecules

Complete Example Script

#!/usr/bin/env python
"""
Generate ChemDraw structure sheet from SD file
Usage: python script.py input.sdf output.cdxml
"""
import sys
from rdkit import Chem
from pycdxml import cdxml_slide_generator, cdxml_converter

# Read input file
input_sdf = sys.argv[1]
output_cdxml = sys.argv[2]

# Load molecules
suppl = Chem.SDMolSupplier(input_sdf)
molecules = [mol for mol in suppl if mol is not None]

# Convert to CDXML
cdxmls = []
for mol in molecules:    
    cdxml = cdxml_converter.mol_to_document(mol).to_cdxml()
    cdxmls.append(cdxml)

# Extract properties
all_props = []
for mol in molecules:
    props = [
        cdxml_slide_generator.TextProperty('ID', 
                                          mol.GetProp("SOURCE_ID"), 
                                          color='#3f6eba'),
        cdxml_slide_generator.TextProperty('Activity', 
                                          mol.GetProp("ACTIVITY"), 
                                          show_name=True)
    ]
    all_props.append(props)

# Generate slide
generator = cdxml_slide_generator.CDXMLSlideGenerator(
    style="ACS 1996",
    number_of_properties=2,
    columns=5,
    rows=10,
    slide_width=50,
    slide_height=70
)

slide = generator.generate_slide(cdxmls, all_props)

# Save output
with open(output_cdxml, "w", encoding='UTF-8') as f:
    f.write(slide)

print(f"Generated {output_cdxml} with {len(molecules)} structures")

References