Create ChemDraw files with Python
Complete Manual: Displaying Chemical Structures in ChemDraw Format Using pycdxml
Overview
The pycdxml package enables platform-independent manipulation of ChemDraw files (CDX and CDXML formats) through Python. The CDXMLSlideGenerator module creates professional structure sheets by arranging molecules in grids with associated properties, outputting standard ChemDraw documents that can be further edited by users.
Installation
Create a conda environment with required dependencies:
name: pycdxml channels: - conda-forge - defaults dependencies: - python>=3.8 - rdkit>=2020.09.1 - numpy - pyyaml - lxml - fonttools - matplotlib - pip
Install the environment and package:
conda env create -f environment.yml conda activate pycdxml python -m pip install -e /path/to/PyCDXML
Core Workflow for Structure Display
The typical workflow consists of four main steps:
- Read molecular structures from an SD file using RDKit
- Convert to CDXML format using the cdxml_converter module
- Extract properties from molecules to display alongside structures
- Generate slide using CDXMLSlideGenerator with customizable layout
Detailed Implementation
Step 1: Import Required Modules
import sys
from rdkit import Chem
from pycdxml import cdxml_slide_generator, cdxml_converter
Step 2: Load Molecular Structures
# Read from command line or specify path directly
input_file = sys.argv[1] # or use explicit path
suppl = Chem.SDMolSupplier(input_file)
molecules = [x for x in suppl]
Important considerations:
- Works best with standard small molecules
- May have issues with organometallics, polymers, or complex structures
- Each molecule treated as a separate entity (similar to MOL file concept)
Step 3: Convert Molecules to CDXML Format
cdxmls = []
for mol in molecules:
cdxml = cdxml_converter.mol_to_document(mol).to_cdxml()
cdxmls.append(cdxml)
The mol_to_document() method creates a ChemDraw document object, and to_cdxml() converts it to the CDXML XML string format.
Step 4: Extract and Configure Properties
Properties appear as text annotations below each structure. The TextProperty class defines how each property displays:
all_props = []
for mol in molecules:
props = [
cdxml_slide_generator.TextProperty('SOURCE_ID',
mol.GetProp("SOURCE_ID"),
color='#3f6eba'),
cdxml_slide_generator.TextProperty('MG_ID',
mol.GetProp("MG_ID"))
]
all_props.append(props)
TextProperty parameters:
- name: Property label (displayed if show_name=True)
- value: The actual property value from molecule data
- color: Hex color code (e.g., '#3f6eba' for blue)
- show_name: Boolean to display/hide the property name
Step 5: Generate the Slide
sg = cdxml_slide_generator.CDXMLSlideGenerator(
style="ACS 1996", # ChemDraw style template
number_of_properties=2, # Properties per molecule
columns=5, # Grid columns
rows=10, # Grid rows
slide_width=50, # Page width (cm)
slide_height=70 # Page height (cm)
)
slide = sg.generate_slide(cdxmls, all_props)
CDXMLSlideGenerator parameters:
- style: ChemDraw style name (e.g., "ACS 1996") - determines bond length, font sizes, display preferences
- number_of_properties: Must match the number of TextProperty objects per molecule
- columns and rows: Grid layout (5×10 = 50 structures per page)
- slide_width and slide_height: Document dimensions in centimeters
Step 6: Save Output File
output_file = sys.argv[2] # or specify path
with open(output_file, "w", encoding='UTF-8') as xf:
xf.write(slide)
Critical encoding note: Always use encoding='UTF-8' to ensure proper character handling, especially for special characters in compound names or properties.
Command-Line Usage
Create a script (e.g., generate_structures.py) and run:
python generate_structures.py input_structures.sdf output_slide.cdxml
Advanced Features
Style Application
Apply consistent ChemDraw styles across all structures. The styler module can convert existing CDXML files to a target style:
from pycdxml import cdxml_styler
styler = cdxml_styler.CDXMLStyler(style_source="/path/to/ACS 1996.cdxml")
styler.apply_style_to_file('input.cdxml', outpath='output.cdxml')
Style affects:
- Bond lengths and widths
- Atom label font sizes
- Hydrogen display (implicit vs explicit)
- Stereochemistry indicators
- Overall visual presentation
Format Conversions
Convert between CDXML (XML text) and CDX (binary) formats:
# CDX to CDXML
doc = cdxml_converter.read_cdx('/path/to/structure.cdx')
cdxml_converter.write_cdxml_file(doc, '/path/to/structure.cdxml')
# CDXML to base64-encoded CDX
doc = cdxml_converter.read_cdxml('/path/to/structure.cdxml')
b64_cdx = cdxml_converter.to_b64_cdx(doc)
Property Annotations for SD File Export
Important feature: All visible properties are automatically annotated to the molecules in the CDXML file. If you open the generated CDXML in ChemDraw and save as an SD file, all displayed properties will be included in the SD file output.
Practical Tips
For Large Datasets:
- Process in batches if dealing with thousands of structures
- Calculate optimal rows/columns based on page size and readability
- Consider multiple pages rather than overcrowding single page
Property Display:
- Limit to 2-4 properties per structure for readability
- Use color coding to highlight important values (e.g., red for high activity)
- Include units in property names when showing numerical values
Layout Optimization:
- Standard 5×10 grid (50 structures) works well for A3/tabloid size printouts
- For presentations, consider 4×3 or 5×4 grids for better visibility
- Adjust slide_width and slide_height to match intended output format
Output and Compatibility
The generated CDXML file is a fully valid ChemDraw document that can be:
- Opened and edited in ChemDraw Desktop
- Converted to PDF, PNG, or other formats via ChemDraw
- Modified by end users (chemists) to adjust layouts, add annotations, or change styles
- Saved as SD files with all property annotations preserved
Known Limitations
- Very old CDX files (~ChemDraw 7 era) may fail to parse correctly
- Complex molecules (organometallics, polymers) may have conversion issues
- Package focuses on "file-level" operations, not chemical validation
- Style translation may not perfectly position additional drawing elements (brackets, arrows) relative to molecules
Complete Example Script
#!/usr/bin/env python
"""
Generate ChemDraw structure sheet from SD file
Usage: python script.py input.sdf output.cdxml
"""
import sys
from rdkit import Chem
from pycdxml import cdxml_slide_generator, cdxml_converter
# Read input file
input_sdf = sys.argv[1]
output_cdxml = sys.argv[2]
# Load molecules
suppl = Chem.SDMolSupplier(input_sdf)
molecules = [mol for mol in suppl if mol is not None]
# Convert to CDXML
cdxmls = []
for mol in molecules:
cdxml = cdxml_converter.mol_to_document(mol).to_cdxml()
cdxmls.append(cdxml)
# Extract properties
all_props = []
for mol in molecules:
props = [
cdxml_slide_generator.TextProperty('ID',
mol.GetProp("SOURCE_ID"),
color='#3f6eba'),
cdxml_slide_generator.TextProperty('Activity',
mol.GetProp("ACTIVITY"),
show_name=True)
]
all_props.append(props)
# Generate slide
generator = cdxml_slide_generator.CDXMLSlideGenerator(
style="ACS 1996",
number_of_properties=2,
columns=5,
rows=10,
slide_width=50,
slide_height=70
)
slide = generator.generate_slide(cdxmls, all_props)
# Save output
with open(output_cdxml, "w", encoding='UTF-8') as f:
f.write(slide)
print(f"Generated {output_cdxml} with {len(molecules)} structures")
References
- pycdxml GitHub repository: https://github.com/kienerj/pycdxml
- Official ChemDraw CDX format specification: https://www.cambridgesoft.com/services/documentation/sdk/chemdraw/cdx/IntroCDX.htm
- RDKit documentation: https://www.rdkit.org/docs/