Introduction
HQS Molecules
is a tool that can be used in conjunction with other HQStage
modules by HQS Quantum Simulations GmbH. HQS Molecules
is a user-friendly Python package designed to streamline computational chemistry workflows. It offers functionalities for converting 2D chemical structures to 3D coordinates, accessing the PubChem database, interfacing with the xTB and CREST programs for energy calculations and conformer searches, and handling XYZ files. These functionalities enhance efficiency, accuracy, and ease of use, making it a valuable resource for chemists and researchers.
Associated Functionality
HQS Molecules
is providing the following functionalities to the users:
- Converting chemical structures from two-dimensional connectivity representations (SMILES strings or Molfiles) to three-dimensional coordinates. This is achieved using the RDKit and Open Babel packages.
- API access to the PubChem database, permitting name-to-structure and structure-to-name searches directly via Python scripts.
- Providing a Python interface for important functionalities of the semi-empirical xTB program, such as single-point energy calculations, geometry optimizations and frequency calculations.
- Performing conformer searches through a Python interface for the CREST program. New: HQS Molecules includes a procedure to refine conformer ensembles obtained via CREST by reassigning equivalent structures and completing the rotamer ensemble, combined with a screening for saddle points.
- Reading and writing XYZ files.
Functionality in this package makes extensive use of Pydantic to define classes used for input and output.
Why it is Useful
-
For quantum-chemical calculations, a 3D structure is essential, providing the spatial coordinates of all atoms. The larger the molecule, the more likely it is to have multiple possible conformations that are in thermodynamic equilibrium.
HQS Molecules
allows chemists and researchers in other fields to easily transition from simple 2D representations of molecules to more complex and informative 3D structures. This is crucial for understanding molecular geometry, predicting behavior and properties, and performing computational chemistry simulations. -
Direct access to the PubChem database enables users to quickly retrieve chemical information and structures by simply providing a chemical name or SMILES string. This streamlines the process of gathering data for research and development, saving time and reducing the potential for errors in manual searches.
-
Integrating xTB functionalities into Python scripts allows for automated and efficient computational chemistry workflows. Users can perform energy calculations, optimize molecular geometries, and conduct frequency analyses seamlessly within their Python environment, enhancing productivity and enabling more complex simulations.
-
Conformer searches are essential for identifying the most stable molecular conformations and understanding the conformational flexibility of molecules. By providing a Python interface for the CREST program with in-house developed post-processing,
HQS Molecules
simplifies the process of conformer searching, making it more accessible and easier to integrate into larger computational workflows. -
XYZ files are a common format for representing molecular structures. The ability to read and write these files ensures compatibility with a wide range of computational chemistry tools and facilitates the exchange of molecular data between different software packages.
-
Pydantic provides data validation and parsing using Python type annotations. By leveraging Pydantic, the package ensures that input and output data are correctly formatted and validated, reducing the likelihood of errors and improving the robustness of the software.
Summary
With HQS Molecules
, we utilize multiple existing packages to cover various configurations with a single tool, allowing users to work flexibly without having to choose between different programs. HQS Molecules
provides a unified input for all available packages and outputs data in a format that can be used for further processing steps in computational chemistry simulations.
Sample Use Cases
HQS Molecules
provides an easy-to-use Python interface to generate three-dimensional structures, optimize geometries, calculate molecular energies or vibration frequencies, and to sample molecular conformations. These functionalities are crucial building blocks in molecular simulation workflows for the chemical and pharmaceutical industries.
Example use cases:
- The input for quantum chemistry calculations, such as density functional theory (DFT), is rarely identical with the content of chemical in-house databases. With functionality as implemented in this package,
HQS
computed thermodynamic properties for a customer using DFT, starting from a text file containing thousands of molecule names. - Converting chemical structures from 2D to 3D representations, finding and comparing geometries, sampling conformers: computational chemistry workflows require many steps before any DFT calculation can even start.
HQS Molecules
combines the established expertise of open-source packages such as RDKit, Open Babel, xTB or CREST behind a Python interface, while improving critical aspects.HQS Molecules
can be used as part of your own workflow to provide the input for a quantum chemistry software, which simulates chemical reactions or calculates spectroscopic parameters. - Quantum algorithms for molecular electronic structure problems normally require a three-dimensional structure, for example in XYZ format, as an input. Obtaining such an XYZ file can
be a nuisance for specialists and non-specialists alike. With
HQS Molecules
, the name of a molecule is all you need to generate three-dimensional geometries for a vast number of catalogued compounds. - As many other packages, the open-source Active Space Finder by
HQS
requires a molecular geometry as part of its input.HQS Molecules
provides starting geometries for further calculations with PySCF and the Active Space Finder.
Further Materials
In addition to this manual, please check the example notebooks for a practical demonstrations of the utilities in this module.
Installation
For general information on installing the module and downloading the example notebooks, please refer to the HQStage documentation.
Please note that
HQS Molecules
is currently only supported on Linux.
Parts of this module require third-party software packages that are not available for installation via pip: Open Babel, xTB and CREST.
The RDKit developers recommend installation via the conda-forge channel. Please note that the package on PyPI, which is installed by the
pip
command, may be outdated.
-
Installation of
RDKit
: To make use of the latest version of the program, RDKit should be installed from the conda-forge channel. Installation via the PyPI server is also possible, but its versions do not seem to be kept up-to-date and compatibility with certain versions of other packages is not guaranteed. -
Installation of
Open Babel
:
For the user to benefit from the full scope of the 2D-to-3D structure conversion functionality, we recommend installingOpen Babel
. Open Babel can be installed via the package managers of various Linux distributions (e.g.,apt
on Debian 12) or alternatively from the conda-forge channel. Note: The 2D-to-3D structure conversion functionality uses the Open Babel package as a fallback option if the structure conversion of a molecule viaRDKit
fails. It is still possible to useHQS Molecules
withoutOpen Babel
. In that case, the functionality will only be able to useRDKit
for structure conversion. -
Installation of
xTB
: To make use of the interface to the xTB program for semiempirical calculations, thextb
binary must be provided by the user in thePATH
variable. We strongly recommend to download the binary release6.7.1
from GitHub, since Hessian calculations with the GFN-FF method are known to fail with version6.7.0
. Besides that, an installation fromconda-forge
is possible, in principle, but it is known to fail for specific calculations (like e.g. Hessian calculations for linear molecules). -
Installation of
CREST
: To make use of the interface to the CREST program for conformer search, thecrest
binary must be provided by the user in thePATH
variable. We recommend downloading the binary release3.0.2
from GitHub.
Molecular Name Lookup
HQS Molecules
provides API access to the PubChem database, permitting name-to-structure and structure-to-name searches directly via Python scripts.
IMPORTANT: If you use the PubChem interface, your data will be sent over the internet to the PubChem servers.
Name-to-Structure Search
Requests to PubChem are made via the PubChem
data class, which stores the data retrieved after completion of each request. In the following example, a request is made using the molecule name:
>>> from hqs_molecules import PubChem
>>> pc = PubChem.from_name("2-methylprop-1-ene")
>>> print(pc.name)
Isobutylene
>>> print(pc.cid)
8255
>>> print(pc.formula)
C4H8
>>> print(pc.smiles)
CC(=C)C
>>>
As shown above, the data retrieved is stored in the attributes
name
for the compound name,smiles
for the SMILES string,cid
for the PubChem compound ID, andformula
for the molecular formula (as aMolecularFormula
object).
Note: the name stored in the name
attribute may differ from the argument supplied for the request: the most entries for a compound on PubChem list multiple synonymous names, but the name
attribute contains the representative title name selected on PubChem.
Structure-to-Name Search
Likewise, a reverse search can be performed using a SMILES string:
>>> pc = PubChem.from_smiles("COC1=C(C=CC(=C1)C=O)O")
>>> print(pc.name)
Vanillin
>>>
Search by PubChem Compound Identifier
Finally, requests can be made using the PubChem compound identifier:
>>> pc = PubChem.from_cid(145742)
>>> print(pc.name)
Proline
>>> print(pc.smiles)
C1C[C@H](NC1)C(=O)O
>>>
An important feature of the PubChem
class is that it verifies the depositor of individual data items retrieved. Data is only made available if PubChem is listed as the data source.
As shown in the section on 2D to 3D structure conversion, a SMILES string can be used to generate a three-dimensional representation.
Structure Conversion
An important functionality of HQS Molecules
is the conversion of two-dimensional molecular structure representations into three-dimensional geometries, as shown here for the example of the glucose molecule.
Simple 2D to 3D Conversion
Input: SMILES strings and Molfiles are supported as input for the functions smiles_to_molecule
and molfile_to_molecule
, respectively.
Output: Both functions return a Molecule
object containing three-dimensional atomic coordinates. In addition, the returned object contains the overall molecular charge, which is always included in a Molfile or a SMILES string. The multiplicity field is set to None
, as it cannot be derived unambiguously from the input.
The structure conversion employs the respective functionality of RDKit
as the first choice, and uses Open Babel
as a backup in case that structure conversion with RDKit fails. After generating a three-dimensional structure, a bonding graph is determined using distance criteria and used to verify the generated structure against the input. If the composition and the bonding graphs do not match, then the structure is rejected.
If the input structure is stored in a molfile named "my_molecule.mol"
, then the conversion is performed by calling:
>>> from hqs_molecules import molfile_to_molecule
>>> mol = molfile_to_molecule("my_molecule.mol")
Likewise, it is possible to perform a conversion of a molecular structure with a SMILES string - for example, one obtained from PubChem:
>>> from hqs_molecules import PubChem, smiles_to_molecule
>>> pc = PubChem.from_name("propane")
>>> mol = smiles_to_molecule(pc.smiles)
An optional check can be carried out with either of the conversion functions by supplying a molecular formula as an argument. The conversion fails if the input does not match the provided formula. In that case, the formula needs to be represented as a MolecularFormula
object.
>>> from hqs_molecules import MolecularFormula, PubChem, smiles_to_molecule
>>> pc = PubChem.from_name("propane")
>>> # succeeds
>>> mol = smiles_to_molecule(pc.smiles, formula=pc.formula)
>>> # raises an exception
>>> mol = smiles_to_molecule(pc.smiles, formula=MolecularFormula.from_str("C3H7-"))
Utilities for RDKit
The HQS Molecules
module includes convenience functions to create RDKit Mol
objects from SMILES strings or Molfiles. These objects represent molecular information within the RDKit package.
Both the smiles_to_rdkit
and molfile_to_rdkit
functions accept an argument addHs
. By default, it is set to True
, causing explicit hydrogen atoms to be added in the generated object. Setting addHs = False
suppresses the addition of explicit hydrogens; only hydrogens that were already explicitly represented within a Molfile are retained.
>>> from hqs_molecules import smiles_to_rdkit
>>> # The object generated contains 11 atoms.
>>> rdkit_mol = smiles_to_rdkit("CCC")
>>> # The object generated contains 3 atoms.
>>> rdkit_mol = smiles_to_rdkit("CCC", addHs=False)
Expert Usage
The functionalities described in the remainder of this section are only intended for expert usage.
An RDKit Mol
object can be converted to a three-dimensional structure by passing it to the function rdkit_to_molecule
. It is a low-level function that calls RDKit without resorting to Open Babel as a backup. Nonetheless, it performs a consistency check for the generated structure. If the RDKit Mol
object was created without the addition of explicit hydrogens (addHs = False
), this conversion may fail due to a composition mismatch.
The low-level functions to perform structure conversion using only Open Babel are available via smiles_to_molecule_obabel
and molfile_to_molecule_obabel
. These functions require a SMILES string or a Molfile as their input, respectively. A separate consistency check of the generated structure is also performed here.
XYZ File Handling
HQS Molecules
provides the functionality for reading and writing XYZ files. XYZ files are a common format for representing molecular structures. The ability to read and write these files ensures compatibility with a wide range of computational chemistry tools and facilitates the exchange of molecular data between different software packages.
Reading and Writing XYZ Files
XYZ files are read using the function xyz.read
. It can be called either with a Path
object or a
string representing a file path; alternatively, it may be provided with a file-like object.
>>> from hqs_molecules import xyz
>>> # providing a file name
>>> molgeom = xyz.read("molecule.xyz")
>>> # providing a file-like object
>>> with open("molecule.xyz", "r") as f:
... molgeom = xyz.read(f)
...
XYZ files contain atomic positions and chemical element symbols, but no information on charge and spin multiplicity. Therefore, the function returns a MolecularGeometry
object with element symbols and atomic positions. It can be easily converted to a Molecule
object by providing information on the charge and, optionally, on the multiplicity:
>>> mol = molgeom.to_molecule(charge=0)
>>> mol = molgeom.to_molecule(charge=0, multiplicity=1)
XYZ files can be written to disk using the function xyz.write
by providing either a MolecularGeometry
or a Molecule
object (the latter is a subclass of the former). The function can either create a file at the specified path, or write to a file-like object.
Converting Strings with XYZ Information
In some cases, the content of an XYZ file may be present in a string, or may need to be converted to a string. Such cases can be handled generally using the aforementioned functions xyz.read
and xyz.write
operating on a io.StringIO
object (instead of a Path
object or a string). However, the HQS Molecules
module also provides convenience functions xyz.parse_string
and xyz.get_string
that read and format strings with XYZ content directly:
>>> # creates a string from a MolecularGeometry or Molecule object
>>> s = xyz.get_string(mol)
>>> # create a MolecularGeometry object from a string with XYZ content
>>> molgeom = xyz.parse_string(s)
Working with Comments in XYZ Files
The XYZ file format permits one single line with an arbitrary comment inside a molecular geometry definition. Some programs use this comment line to store specific information. When importing an XYZ file in HQS Molecules
the comment can be requested by setting the argument return_comment
to True
. This causes the function to return a tuple with the molecular geometry and the comment line.
>>> from hqs_molecules import xyz
>>> molgeom, comment = xyz.read("molecule.xyz", return_comment=True)
Likewise, a comment line can be exported to an XYZ file by providing it in the comment
argument.
>>> xyz.write("output.xyz", molgeom, comment="comment line ...")
Information on charge and spin multiplicity can be written into the comment line of an XYZ file using the function xyz.write_with_info
. Unlike the read
and write
functions, it requires a Molecule
object, as the MolecularGeometry
class lacks information on charge and multiplicity.
XYZ Trajectory Files
Trajectories may be stored in XYZ files simply by concatenating XYZ geometries. Such files can be read using the function xyz.read_geometries
. File paths (represented by a Path
object or a string) as well as file-like objects are supported as its input argument. By default, the function returns a list of molecular geometries. Setting return_comments
to True
causes it to return a list of tuples instead, each tuple containing a geometry and its associated comment.
>>> # Returns a list of geometries.
>>> geom_list = xyz.read_geometries("trajectory.xyz")
>>> # Return of tuples containing geometries and comments.
>>> tuple_list = xyz.read_geometries("trajectory.xyz", return_comments=True)
In order to write a trajectory file, the function xyz.write
is used (as for individual geometries), but with a list of geometries instead of an individual geometry.
>>> xyz.write("output.xyz", geom_list)
If individual comments need to be provided for each geometry, this is best done by working with a file stream, as shown in the example below:
>>> with open("output.xyz", "w") as f:
... for step, geometry in enumerate(geom_list):
... xyz.write(f, geometry, comment=f"step number {step}")
...
Semiempirical Calculations with xTB
This HQS Molecules
module contains a Python interface to perform calculations with the xTB and
CREST software packages.
The xTB (extended tight binding) program package is developed by the research group of Prof. Stefan Grimme at Universität Bonn. It contains implementations of the semi-empirical quantum-mechanical methods GFNn-xTB (n = 0–2) as well as of the GFN-FF force field ("GFN" stands for Geometries, Frequencies, and Non-covalent interactions, for which it has been parametrized). HQS Molecules
contains wrapper functions to perform single-point energy calculations, geometry optimizations, and frequency calculations.
In this section the user will find further details on the associated functions:
- Single-point energy calculations,
- Structure optimizations,
- Gradient calculations,
- Frequency calculations,
- Stable optimizations including Hessian calculations,
- Biased Hessian, and
- Conformer/Rotamer search.
Single-point energy calculations
The energy of any meaningful set of atomic coordinates with a specific charge and multiplicity
can be calculated using the single_point
function. Calculations are performed with the GFN2-xTB method by default, or alternatively with the GFN-FF method.
Input
The first mandatory argument of the function is a Molecule
object, as the charge and multiplicity need to be known to perform a calculation.
Note: the functions for reading an XYZ file return MolecularGeometry
objects, which contain neither charge nor spin information. A MolecularGeometry
objects can be easily converted to a Molecule
object by providing information on the charge and, optionally, on the multiplicity, as described in the XYZ File Handling chapter. On the other hand, the 2D-to-3D structure conversion functions return Molecule
objects with charge but without spin information.
Keyword arguments
It is possible to further refine the calculation with the following keyword arguments:
force_field
: Use GFN-FF force-field (defaults toFalse
, in which case the GFN2-xTB method is used).solvent
: By default, calculations are performed in vacuum. If the name of a solvent is provided, the implicit solvation model ALPB (analytical linearized Poisson-Boltzmann) will be used in the calculation. The available solvents are listed here.create_tmpdir
: Run the calculation in a temporary folder that is deleted with completion of the calculation (the default is set toTrue
).
Output
The single_point
function returns a float with the single-point energy in Hartree.
Example
An example is provided here:
from hqs_molecules import single_point
energy = single_point(molecule, solvent="Water", force_field=False)
Structure optimization
xTB can vary the position of atoms in the system to find the equilibrium structure with the lowest energy.
This functionality is embedded into HQS Molecules
in the optimization
function.
Extra arguments
The function works similarly to the single_point
function but it has two extra arguments:
-
opt_level
with the possible optimization levels as defined by xTB. In this wrapper, the levels are defined in theXTBOptLevel
class. By default, the very tight criterion is employed (XTBOptLevel.VTight
), whereas the xTB software itself usesXTBOptLevel.Normal
. The options are:- Crude
- Sloppy
- Loose
- Lax
- Normal
- Tight
- VTight
- Extreme
-
max_cycles
is the maximum number of optimization iterations. By default, it is set toNone
, which means thatmax_cycles
is automatically determined by the xTB software depending on the size of the molecule (check above link).
Output: Trajectory
Calls to optimization
function return a Trajectory
object containing the optimization trajectory (structures
) as a list of Molecule
objects and the energies (in Hartree) for each step (energies
) as a list of floats.
Most of the time, the quantities of interest are the optimized geometry and its energy. To access these, a Trajectory
object contains two types of attributes: Trajectory.last
and Trajectory.last_energy
contain the last geometry and its energy, respectively; Trajectory.lowest_energy
and Trajectory.lowest
contain the lowest energy and the corresponding geometry. Trajectory.lowest_step
returns the index of the energetically lowest geometry. In many cases, the last step may also be the one with the lowest energy.
The trajectory can be written to an XYZ file using the write
(or write_with_info
) functions of the xyz
module.
Example
The following box illustrates some general ideas behind executing a geometry optimization, starting from a Molecule
object.
from hqs_molecules import optimization, xyz
# perform optimization
opt_data = optimization(molecule, solvent="water")
# Write the trajectory into an XYZ file
xyz.write(file="trajectory.xyz", mol=opt_data.structures)
# Optimized structure: here it is assumed to be in the last step.
opt_molecule = opt_data.last # Molecule object
opt_energy = opt_data.last_energy # float
Gradient calculation
The nuclear gradient (first derivative of the energy with respect to the atomic coordinates) is calculated at each step of an optimization procedure. A necessary condition for an optimization to converge is typically to reduce the (norm of the) gradient to almost zero.
A single-point gradient can be calculated using the gradient
function with the same arguments as a single_point
procedure. It returns a nested list of floating point numbers, representing an N × 3 matrix (where N is the number of atoms in the molecule). Each value in the list corresponds to the derivative of an atomic position with respect to the x, y or z coordinates, represented in units of Hartree/Bohr.
Frequency calculations
The Hessian matrix (a matrix of second derivatives of the energy with respect to the atomic coordinates) is used to determine the nature of a stationary point, vibrational frequencies, molecular vibrational spectra, and thermochemical properties. In xTB, the Hessian matrix is calculated numerically.
Extra argument
HQS molecules
provides the frequencies
wrapper function to calculate vibrational frequencies and thermochemical corrections; note that it does not return the Hessian matrix itself. Besides the arguments already mentioned in the single_point
wrapper, this function also accepts a temperature
argument, which allows the user to define one or multiple temperatures for the calculation of thermochemical corrections. By default, the calculation will be performed at 298.15 K, but other temperature or a list of temperatures can also be specified.
Stationary point
If frequency calculations are meant to be performed for a stationary point (an optimized equilibrium structure or a saddle point), it is important to perform the calculations with the same method as the optimization itself; the potential energy surfaces of different methods have differing stationary points.
Frequency calculations are often used to ascertain if a stationary point is a minimum or a saddle point. A molecule of N atoms has 3N degrees of freedom: three translational degrees, three rotational degrees (two for linear molecules), and 3N − 6 vibrational degrees (or 3N − 5 for linear molecules). For a minimum, those frequencies need to be positive; for a transition state, exactly one of them needs to be imaginary.
Output
The return value is a MolecularFrequencies
object, which contains three important fields:
total_energy
: the total electronic energy of the system (in Hartree).frequencies
: a list with the harmonic frequencies (in cm−1). This list contains 3N elements, where 6 values should be zero (or 5 for linear molecules). By convention, imaginary frequencies are represented as negative numbers.thermochem
: a list of objects with temperature-dependent thermodynamic properties (BasicThermochemistry
). Each of theseBasicThermochemistry
objects contains thetemperature
(in K), the totalenthalpy
and the totalgibbs_energy
(both in Hartree), and theentropy
(in Hartree / K). Note that the electronic energy is included in the enthalpy and in the Gibbs energy; however, the solvation contribution currently is not.
Example
from hqs_molecules import frequencies
thermo_data = frequencies(opt_geom, temperature=[100, 125], solvent="water")
print("Single point energy:", thermo_data.total_energy)
print("Frequencies:", thermo_data.frequencies)
for res in thermo_data.thermochem:
print(
f"At {res.temperature} K: "
f"the enthalpy is {res.enthalpy}, "
f"the entropy is {res.entropy}, and "
f"the Gibbs energy is {res.gibbs_energy} (in Hartree)"
)
Stable optimization with Hessian calculations
With a simple geometry optimization, there is no guarantee of obtaining a minimum on the potential energy surface; a converged structure can also be a saddle point. Vibrational frequencies need to be calculated for a full characterization. Only if all of them are non-negative, a minimum has been found.
If imaginary frequencies are found, a displacement along the associated normal mode is a good starting point for a new optimization. In difficult cases, optimizations followed by frequency calculations may need to be repeated multiple times in order to find a genuine local minimum.
Such a composite calculation is provided by HQS Molecules
with the opt_with_hess
function. It repeats the Hessian and optimization calculations until no imaginary frequencies are found. If the maximum number of restarts (indicated by the max_restarts
keyword, 4 by default) is reached without converging to a minimum, an exception is raised.
Input
Since the function combines optimizations with frequencies, it requires the same arguments as the optimization
function, with an option to provide a temperature
argument (for the thermodynamic properties) and max_restarts
.
Output
The output of optimization with Hessian calculation is a tuple with a Trajectory
object (as for an optimization) and a
MolecularFrequencies
object (as for a frequency calculation). The trajectory concatenates the geometries and energies of all optimizations performed within the procedure. On the other hand, the MolecularFrequencies
object contains information for the converged geometry, determined by the overall lowest energy.
Example
A simple call to this procedure can be made as exemplified below:
from hqs_molecules import opt_with_hess, xyz
molecule = xyz.read("input_geometry.xyz").to_molecule(charge=0, multiplicity=1)
trajectory, frequency_data = opt_with_hess(molecule)
Biased Hessian
Hessian calculations for vibrational frequencies or thermochemical contributions are normally performed at stationary points of the potential energy surface. Therefore, the geometry optimization and the Hessian calculation need to be performed with the same method.
However, Spicher and Grimme proposed a method ("single-point Hessian") to calculate the Hessian for stationary points that were optimized with a different method. Therefore, the geometry can be optimized with a more expensive method, while xTB is used to obtain a quantity resembling the Hessian. In a nutshell, it employs the following steps:
- The input structure is re-optimized under application of a biasing potential, which constrains the atoms to remain close to their original positions.
- The Hessian is calculated at the new, slightly distorted geometry under the application of the biasing potential.
- Finally, the frequencies are re-scaled to reduce the impact of the biasing potential.
HQS Molecules
provides the function biased_hessian
to interface the described method in the xTB package. Its usage is analogous to the function frequencies
described above.
Conformer/Rotamer Search
Many molecules, especially in organic chemistry, have a large degree of conformational freedom. At relevant temperatures, they are not represented by a single minimum on the potential energy surface, but rather by an equilibrium between rapidly interconverting minimum structures. Thus, comparing experimental measurements with theoretical data often requires consideration not only of a single structure, but of an entire thermodynamically accessible conformational ensemble.
CREST (Conformer–Rotamer Ensemble Sampling Tool) is a program used to perform conformer searches
using the methods implemented in the xTB package.
The wrapper function find_conformers
provided by HQS Molecules
interfaces CREST (which uses xTB as its engine)
to search for low-energy conformers and rotamers of a given molecule.
Afterwards, a post-processing is performed to refine the conformer ensemble.
Conformers and Rotamers
Conformers are distinct minima on the potential energy surface that are separated by sufficiently low barriers, such that they can interconvert at a given temperature. Often, this occurs through torsional rotations around single bonds.
In this context, rotamers are defined as interconverting structures that can be represented as permutations of the atoms in a conformer [1]. A simple example of this structures related by the rotation of a methyl group (–CH3) in an organic molecule. Somewhat less intuitively, the two chair structures of cyclohexane are also regarded as rotamers in this context, as their structures can be overlayed.
Input
The find_conformers
wrapper function performs three successive steps:
- Conformer/rotamer ensemble search using CREST.
- Reassignment of the structural ensemble and completion of the rotamer sets.
- Calculation of vibrational frequencies and discarding structures corresponding to saddle points.
It employs the following arguments:
molecule
: mandatory specification of the input molecule as aMolecule
object.solvent
: optional specification of the solvent as a string. By default, the calculation is performed in vacuum.create_tmpdir
: set toTrue
by default, so that the calculation is performed in a temporary directory that is deleted upon finishing the calculation.force_field
: it is set toTrue
by default; set toFalse
to perform the conformer search with GFN2-xTB instead.freq_threshold
: the frequency threshold, in cm−1 below which conformers are considered saddle points and thus discarded.allow_mirror_rotamers
: it is set toFalse
by default, such that mirror images are considered to be different structures and not equivalent rotamers.similarity_tolerances
: the numerical tolerances for bond distances, bond angles, and dihedral angles for structures to be identified as equivalent.do_rotamer_completion
: it is set toTrue
by default, such that it is attempted to complete the set of rotamers using permutational logic.
The second to last argument is of type SimilarityTolerances
, which is a data class containing default values for the absolute and relative tolerances
for bond distances, bond angles, and dihedral angles to be considered equivalent in the structure-matching procedure.
Please note that bond and dihedral angles are compared via the absolute value of the complex exponential of the respective angle difference,
and the tolerances refer to those values.
It is important to emphasize that the function uses GFN-FF by default, as conformer searches are very expensive; in contrast, the wrapper functions for the xTB package use GFN2-xTB by default.
Output
The output is an instance of the ConformerEnsemble
class.
-
Important methods of the class provide access to flat lists of all conformer structures and energies (
get_conformer_structures
,get_conformer_energies
), -
as well as the number of rotamers per conformer (
get_degeneracies
). -
Detailed information on conformers with their individual rotamers is contained in the items of the
conformers
list. Each element of this list is of theConformer
type.- Apart from having the
structure
- and
energy
attributes, - it also has a list of
atom_mappings
dictionaries, which represent the atom permutations of the referencestructure
to the different rotamers.
- Apart from having the
Note that the outcome of calculations with CREST may differ somewhat, even if otherwise identical calculations are repeated. Moreover, the starting geometry affects the conformer search. However, the conformer refinement procedure vastly reduces this uncertainty.
Algorithm
Different workflows have been implemented for CREST over the years. It is worth noting that find_conformers
calls CREST with the static metadynamics simulations algorithm, instead of the computationally less demanding extensive metadynamic sampling utility that is the default when running CREST directly.
Note also that the grouping of conformer and rotamer structures as determined in the CREST calculation is ignored, and all the structures are regrouped by our own procedure as follows:
-
First, a plain list of structures is generated from all conformers and rotamers obtained from CREST.
-
Then, the structures are all compared among each other, grouping them by conformers purely based on their 3D structural parameters (bond lengths, bond angles, and dihedral angles).
-
If they are the same within a given numerical threshold, they are considered to be equivalent structures (rotamers), and the atomic mapping (or permutation) relative to the representative conformer is stored for each of the rotamers instead of all the geometries themselves.
-
Successively, the set of rotamer mappings is checked for completeness by multiplying all permutation matrices and verifying if the result is already present in the set of permutations. That way, a full set of rotamers is determined for each conformer as long as the initial set of structures is sufficiently representative.
-
As a last step of the procedure, structures corresponding to energetic saddle points are discarded.
Example
An example of a conformer/rotamer search, starting from a molecule name:
from hqs_molecules import PubChem, find_conformers, smiles_to_molecule
pc = PubChem.from_name("1-butene")
molecule = smiles_to_molecule(pc.smiles)
conf_data = find_conformers(molecule)
print(f"Found {len(conf_data)} distinct conformers.") # or conf_data.num_conformers
print(f"Conformer 0 has {conf_data.get_degeneracies()[0]} rotamers.")
Important Data Structures
This section provides a short description of important data structures used for input and output
in functions provided by HQS Molecules
. Many of these classes are implemented as Pydantic models. Pydantic provides data validation and parsing using Python type annotations. By leveraging Pydantic, the package ensures that input and output data are correctly formatted and validated, reducing the likelihood of errors and improving the robustness of the software.
The objects described in this section are:
MolecularGeometry
andMolecule
representing molecules in 3D,- Molecular formulas
MolecularFormula
, PubChem
dataclass for the data from PubChem,Trajectory
andMolecularFrequencies
representing output of quantum-mechanical calculations that cannot be stored in elementary data types.ConformerEnsemble
storing results of a conformer search: it is obtained by combining an initial search performed by CREST with a subsequent refinement of the conformer ensemble using techniques developed at HQS.
Representing Molecules in 3D
Representing Atomic Positions
The MolecularGeometry
class contains atomic positions (in Å) and chemical element symbols. Objects of this type are commonly generated in HQS Molecules
by reading an XYZ file. However, it lacks information on charge and spin multiplicity, which are typically needed for quantum-chemical calculations.
Important attributes of MolecularGeometry
objects are
natoms
(representing the number of atoms),symbols
(returning a list of chemical element symbols),- and
positions
(returning an N × 3 array of atomic positions).
Inspection of the class reveals further methods to update atomic positions and create copies of molecules, possibly with updated positions.
>>> from hqs_molecules import MolecularGeometry
>>> help(MolecularGeometry)
Internally, atoms are represented by a list of Atom
objects. These are defined as named tuples containing the element symbol and the position, one tuple per atom. Note that this feature permits the atoms
attribute to be used directly as input for PySCF calculations, as shown in the example below.
>>> from hqs_molecules import smiles_to_molecule
>>> from pyscf.gto import Mole
>>> hqs_mol = smiles_to_molecule("C=C")
>>> pyscf_mol = Mole(atom=hqs_mol.atoms)
Molecules with Charge and Spin
Molecule
is one of the most important classes in the HQS Molecules
package. It is implemented as a subclass of MolecularGeometry
, with the addition of charge
and multiplicity
fields. Objects of Molecule
type are commonly returned by functions performing 2D to 3D structure conversion. An additional attribute is nelectrons
, containing the number of electrons corresponding to the molecular composition and charge.
Molecular formulas (such as H2O or OH−) and molecular structure representations (such as SMILES strings or Molfiles) always contain the total molecular charge, explicitly or implicitly. Therefore, it is vital to preserve the total charge together with three-dimensional representations of molecular structures.
In addition to the charge, quantum-chemical calculations usually also require a specification of the spin multiplicity. Unlike the charge, it is not necessarily straightforward to infer from a molecular structure. Therefore, None
is permitted as a value for the field. Indeed, functions such as smiles_to_molecule
or molfile_to_molecule
never set the field to an integer value themselves.
Knowing the value of the spin multiplicity, the value can be set and validated for a Molecule
object by using the set_multiplicity
method.
>>> from hqs_molecules import smiles_to_molecule
>>> mol = smiles_to_molecule("CCO")
>>> print(mol.multiplicity)
None
>>> mol.set_multiplicity(1)
Molecule(atoms=[...], charge=0, multiplicity=1)
>>> print(mol.multiplicity)
1
Since the set_multiplicity
method returns the object itself in addition to modifying it, calls such as mol = smiles_to_molecule("CCO").set_multiplicity(1)
are possible.
Objects of type MolecularGeometry
can be converted to Molecule
instances using the to_molecule
method, with the charge being mandatory and the multiplicity optional.
Molecular Formulas
Within HQS Molecules
, molecular formulas are represented by MolecularFormula
objects containing the elemental composition and the total charge. For example, formulas from PubChem are converted into this format:
>>> from hqs_molecules import PubChem
>>> pc = PubChem.from_name("Bicarbonate")
>>> pc.formula
MolecularFormula(natoms={'C': 1, 'H': 1, 'O': 3}, charge=-1)
>>>
The class implements __str__
as a conversion of the formula to a string in Hill notation:
>>> f"{pc.formula}"
'CHO3-'
Users can easily create molecular formulas from a string input.
>>> from hqs_molecules import MolecularFormula
>>> formula = MolecularFormula.from_str("MnO4-")
>>> formula
MolecularFormula(natoms={'Mn': 1, 'O': 4}, charge=-1)
>>>
The from_str
constructor can handle some degree of complexity (for example, "CH3COOH"
is interpreted equivalently to "C2H4O2"
), but it cannot process arbitrarily complicated semi-structural formulas. Note that isomers cannot be distinguished, as they have identical elemental compositions.
Data from PubChem
Results from PubChem queries are stored within instances of the PubChem
class. Unlike most other classes described in this section, it is implemented as a dataclass
and not as a Pydantic model.
In practical use, instances of this class would normally be created using methods such as from_name
or from_smiles
. The retrieved data is stored in the fields of the class. A description can be found by executing:
>>> from hqs_molecules import PubChem
>>> help(PubChem)
Output of Quantum-Mechanical Calculations
Molecular Trajectories
Instances of the Trajectory
class, as returned by geometry optimizations with xTB, contain two fields:
- a list of
Molecule
objects that is labeledstructures
, - and the energies of each structure in a list labeled
energies
.
Convenience attributes are implemented for the following properties:
- Obtaining the number of structures through the
length
attribute. - Obtaining the last structure and its energy via the attributes
last
andlast_energy
, respectively. - Identifying the structure with the lowest energy and accessing the structure, its energy and its position in the trajectory with the attributes
lowest
,lowest_energy
andlowest_step
, respectively.
Vibrational Frequencies
A generic representation of computed vibrational frequencies and basic thermochemical properties is contained within the class MolecularFrequencies
. Instances contain
- the total electronic energy (in the
total_energy
field) - and a list of vibrational frequencies (in the field
frequencies
).
Note that the latter are represented as (real) floating-point numbers; by convention, imaginary frequencies are represented as negative numbers.
Additionally, the MolecularFrequencies
class defines
- a field
thermochem
with a list ofBasicThermochemistry
objects.
In addition to vibrational frequencies, programs such as xTB can calculate thermodynamic contributions via a rigid rotor and harmonic oscillator approximation. These contributions are temperature-dependent (while harmonic frequencies are not). Therefore, thermochemical corrections are stored in a list with one item per temperature value. Each BasicThermochemistry
object contains fields enthalpy
, entropy
, gibbs_energy
, and temperature
(representing the temperature used to evaluate the aforementioned properties). Since these quantities are interdependent, only enthalpy, entropy and temperature are stored explicitly, while the Gibbs energy is recomputed upon being accessed.
Note that the Hessian matrix itself is not represented in the MolecularFrequencies
class.
Conformer Search Results
Structures and energies of conformers determined via CREST are stored by HQS Molecules
in a class ConformerEnsemble
, which contains a list of Conformer
objects.
Note that the grouping of conformer and rotamer structures as determined in the CREST calculation is ignored, and all the structures are regrouped by our own procedure, as described in the section on conformer search.
Further information on the attributes of the respective classes can be accessed from within Python:
>>> from hqs_molecules import Conformer, ConformerEnsemble
>>> help(Conformer)
>>> help(ConformerEnsemble)
List of Available Solvents for xTB
By default, quantum-chemical calculations are performed in vacuum. If the name of a solvent is provided, the implicit solvation model will be used in the calculation. The xTB interface uses the ALPB
(analytical linearized Poisson-Boltzmann) method for implicit solvation.
The reference academic publication can be found here.
ALPB
has been parametrized for the following solvents:
- "Vacuum"
- "Acetone"
- "Acetonitrile"
- "Aniline"
- "Benzaldehyde"
- "Benzene"
- "Dioxane"
- "Ethylacetate"
- "Furane"
- "Hexadecane"
- "Nitromethane"
- "Octanol"
- "Octanol (wet)"
- "Phenol"
- "Dichlormethane"
- "Chloroform"
- "CS2"
- "DMSO"
- "Ether"
- "Water"
- "Methanol"
- "THF"
- "Toluene"
- "DMF"
- "Ethanol"
- "Hexane"
The solvent names as given in quotation marks can be provided in the solvent
argument of the xTB/CREST wrapper functions implemented in HQS Molecules
.
Background Information on Structure Conversion
This section provides further information on:
- Background,
- Distance geometry,
- Metric matrix method,
- Distance bounds matrix,
- RDKit Variants.
Background
The given list of references provides a summary of more technical aspects of the 3D structure conversion. More details can be found in the following sources:
- J. M. Blaney, J. S. Nixon, Distance Geometry in Molecular Modeling. Rev. Comput. Chem. 1994, 5, 299–335 (DOI:10.1002/9780470125823.ch6).
- S. Riniker, G. A. Landrum, Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation. J. Chem. Inf. Model. 2015, 55, 2562–2574 (DOI:10.1021/acs.jcim.5b00654).
- S. Wang, J. Witek, G. A. Landrum, S. Riniker, Improving Conformer Generation for Small Rings and Macrocycles Based on Distance Geometry and Experimental Torsional-Angle Preferences. J. Chem. Inf. Model. 2020, 60, 2044–2058 (DOI:10.1021/acs.jcim.0c00025).
- RDKit documentation
Distance Geometry
The structure conversion routine of HQS Molecules
makes use of the open-source cheminformatics
software RDKit, whose source code is available on GitHub.
Given the SMILES string or Molfile, an RDKit molecule object is created and
hydrogen atoms are added explicitly if needed.
Assuming a valid SMILES string is provided in the variable smiles
,
the most basic steps are shown in the following code snipped:
from rdkit.Chem import AllChem
mol = AllChem.MolFromSmiles(smiles)
molH = AllChem.AddHs(mol)
status = AllChem.EmbedMolecule(molH)
As can be seen in the code above, in a first step a molecule object is created (MolFromSmiles
function),
which contains 2D information about the molecule, such as the connectivity.
Since most hydrogen atoms are implicit in skeletal formulas and not included in the 2D structure,
but are indispensable for a 3D representation of the molecule, they need to be added in a second step (AddHs
function).
However, the central function for generating 3D structures is the EmbedMolecule
,
which returns 0
if the structure was generated successfully, and -1
otherwise.
This function computes a "3D embedding" based on variants of a stochastic approach referred to as
distance geometry (DG).
DG differs from standard Monte Carlo or molecular dynamics
techniques in that it directly generates structures to satisfy the geometric constraints of the model
instead of searching (almost) the entire conformational space to find appropriate structures.
As such, it can rapidly find one or more possible solutions but shares with other random methods
the inability to guarantee finding all of them.
Metric Matrix Method
Distance geometry (DG) approach described above is based on the assumption that purely geometric constraints can describe all possible conformations of a molecule, for which upper and lower distance bounds for all pairs of atoms are determined and represented in a distance bounds matrix. However, the key element of the DG technique is the metric matrix, \(\mathbf{G}\), whose elements \(g_{ij}\) can be calculated from the vector product of the coordinates of atoms \(i\) and \(j\), \[ \mathbf{G} = \mathbf{XX}^\text{T} , \] where \(\mathbf{X}\) is a matrix containing the Cartesian coordinates of the atoms. \(\mathbf{G}\) is a square symmetric matrix, and as such it can be decomposed as \[ \mathbf{G} = \mathbf{VL}^2\mathbf{V}^\text{T} , \] where the elements of the diagonal matrix \(\mathbf{L}^2\) are the eigenvalues and the columns of \(\mathbf{V}\) are the eigenvectors of \(\mathbf{G}\). From the last two equations and the fact that \(\mathbf{L}\) has only diagonal entries it can be seen that \[ \mathbf{X} = \mathbf{VL} . \] This means that the Cartesian coordinates \(\mathbf{X}\) can be regenerated by multiplying the square root \(\mathbf{L}\) of the eigenvalues of \(\mathbf{G}\) with the eigenvectors \(\mathbf{V}\). It has thus been shown that it is possible to convert three-dimensional atomic coordinates to the metric matrix \(\mathbf{G}\) and then regenerate them from it.
A crucial aspect in DG is that \(\mathbf{G}\) can also be derived directly from the distance matrix, \(\mathbf{D}\), in which each component \(d_{ij}\) is the distance between atoms \(i\) and \(j\). To this end, \(\mathbf{D}\) is first converted to \(\mathbf{D}_o\), where each element \(d_{io}\) is the distance between atom \( i \) and the center of mass \(o\), \[ d_{io}^2 = \frac1N \sum_{j=1}^{N} d_{ij}^2 - \frac{1}{N^2} \sum_{j=2}^{N} \sum_{k=1}^{j-1} d_{jk}^2 , \] where \(N\) is the number of atoms. \(\mathbf{G}\) can then be calculated using the cosine rule, \( g_{ij} = ( d_{io}^2 + d_{jo}^2 - d_{ij}^2 ) / 2 \), and as seen before, the Cartesian coordinates \(\mathbf{X}\) can be obtained from \(\mathbf{G}\).
Distance Bounds Matrix
Therefore, in order to get conformations of a molecule, it is necessary to provide a distance matrix, \(\mathbf{D}\). Its components can be derived from the distance bounds matrix, which, as stated previously, contains upper and lower distance bounds for all pairs of atoms in a given molecule, determining the geometric constraints of the DG technique. But how can one define those bounds?
For example, imagine a square-shaped molecule with bond lengths of 1.0 Å. The upper and lower bounds for the neighboring atoms can thus be set to 1.0. Assuming that the atoms have van der Waals (vdW) radii of 0.25 Å, then all other lower bounds can be set to 0.5 (the default lower bound between any two atoms is the sum of of their vdW radii). However, the remaining upper bounds are not known, so a default of 100 is commonly chosen. This completes the entries of the distance bounds matrix, and a distance matrix \(\mathbf{D}\) can be generated from it by choosing distances randomly between the corresponding lower and upper bounds and entering them at appropriate places in \(\mathbf{D}\).
However, the upper bound of 100 is problematic since most distances between 0.5 and 100 Å are geometrically impossible given the other distances. Here, one can make use of the triangle inequality relationship, which states that for three points A, B, and C, distance AC can at most be AB + BC, meaning that the length of one side of a triangle must be less than or equal to the sum of the lengths of the other two sides. The inverse triangle inequality states that the minimum distance AC must be at least |AB − BC|.
While these relationships are valid for exact distances, very similar ones hold for upper and lower bounds. For example, applying the triangle inequality relationship to the three atoms A, B, and D of the square molecule indicates that the upper bound between D and B can be at most 2.0 Å, so the default value of 100 is lowered significantly. In general, the triangle inequality lowers some upper bounds, and the inverse triangle inequality raises some lower bounds.
RDKit Variants
The RDKit variants of the Distance Geometry (DG) approach are referred to as "experimental-torsion distance geometry" (ETDG), which includes experimental torsional-angle preferences, and "ETKDG," with additional "basic knowledge" (K) terms such as flat aromatic rings or linear triple bonds. ETKDG performs the following steps:
- Build a distance bounds matrix based on the topology, including 1–5 distances, but without van der Waals scaling.
- Triangle smooth this bounds matrix.
- If step 2 fails, repeat step 1, this time without 1–5 bounds but with vdW scaling, then repeat step 2.
- Pick a distance matrix at random using the bounds matrix.
- Compute initial coordinates from the distance matrix.
- Repeat steps 3 and 4 until the maximum number of iterations is reached or embedding is successful.
- Adjust the initial coordinates by minimizing a "distance violation" error function.
If the molecule consists of multiple fragments, they are embedded separately by default, which means that they are likely to occupy the same region of space.