{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Representation Methods in ChemML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To build a machine learning model, raw chemical data is first converted into a numerical representation. The representation contains spatial or topological information that defines a molecule. The resulting features may either be in continuous (molecular descriptors) or discrete (molecular fingerprints) form." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from chemml.chem import Molecule\n", "from chemml.datasets import load_organic_density\n", "import numpy as np\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating `chemml.chem.Molecule` object from molecule SMILES\n", "\n", "All feature representation methods available in ChemML require `chemml.chem.Molecule` as inputs" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Importing an existing dataset from ChemML\n", "molecules, target, dragon_subset = load_organic_density()\n", "mol_objs_list = []\n", "for smi in molecules['smiles']:\n", " mol = Molecule(smi, 'smiles')\n", " mol.hydrogens('add')\n", " mol.to_xyz('MMFF', maxIters=10000, mmffVariant='MMFF94s')\n", " mol_objs_list.append(mol)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Coulomb Matrix](https://doi.org/10.1103/PhysRevLett.108.058301)\n", "\n", "Simple molecular descriptor which mimics the electro-static interaction between nuclei. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "featurizing molecules in batches of 31 ...\n", "\u001b[1m500/500\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m12s\u001b[0m 24ms/step \n", "Merging batch features ... [DONE]\n", " 0 1 2 3 4 5 \\\n", "0 388.023441 67.563795 388.023441 46.773304 71.039377 388.023441 \n", "1 73.516695 12.680660 73.516695 13.507005 15.308384 53.358707 \n", "2 388.023441 10.343694 73.516695 40.719814 5.612250 53.358707 \n", "3 388.023441 72.013552 388.023441 49.045255 31.222555 73.516695 \n", "4 388.023441 34.076060 388.023441 20.740383 20.314923 73.516695 \n", "\n", " 6 7 8 9 ... 1643 1644 1645 1646 \\\n", "0 43.471164 31.884828 23.619673 53.358707 ... 0.0 0.0 0.0 0.0 \n", "1 15.511761 10.387486 7.267909 53.358707 ... 0.0 0.0 0.0 0.0 \n", "2 22.032304 7.173553 20.331941 53.358707 ... 0.0 0.0 0.0 0.0 \n", "3 26.287638 24.264785 15.451307 73.516695 ... 0.0 0.0 0.0 0.0 \n", "4 21.673878 43.473700 12.535104 53.358707 ... 0.0 0.0 0.0 0.0 \n", "\n", " 1647 1648 1649 1650 1651 1652 \n", "0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 1653 columns]\n" ] } ], "source": [ "from chemml.chem import CoulombMatrix\n", "\n", "#The coulomb matrix type can be sorted (SC), unsorted(UM), unsorted triangular(UT), eigen spectrum(E), or random (RC)\n", "CM = CoulombMatrix(cm_type='SC',n_jobs=-1) \n", "\n", "features = CM.represent(mol_objs_list)\n", "print(features[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Fingerprints from RDKit](https://www.rdkit.org/)\n", "\n", "Molecular fingerprints are a way of encoding the structure of a molecule. The most common type of fingerprint is a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints allows you to determine the similarity between two molecules, to find matches to a query substructure, etc." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 1 2 3 4 5 6 7 8 9 ... 1014 \\\n", "0 0 0 0 0 0 0 1 0 0 0 ... 0 \n", "1 0 0 0 0 0 0 0 0 0 0 ... 0 \n", "2 0 0 0 0 0 0 0 0 0 0 ... 0 \n", "3 0 0 0 0 0 0 1 0 0 0 ... 0 \n", "4 0 0 0 1 0 0 0 0 1 0 ... 0 \n", "\n", " 1015 1016 1017 1018 1019 1020 1021 1022 1023 \n", "0 0 0 0 0 1 0 0 0 0 \n", "1 0 0 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 0 0 \n", "4 0 0 0 0 1 0 1 0 0 \n", "\n", "[5 rows x 1024 columns]\n" ] } ], "source": [ "from chemml.chem import RDKitFingerprint\n", "\n", "# RDKit fingerprint types: 'morgan', 'hashed_topological_torsion' or 'htt' , 'MACCS' or 'maccs', 'hashed_atom_pair' or 'hap' \n", "morgan_fp = RDKitFingerprint(fingerprint_type='morgan', vector='bit', n_bits=1024, radius=3)\n", "features = morgan_fp.represent(mol_objs_list)\n", "print(features[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Molecule tensors from `chemml.chem.Molecule` objects\n", "\n", "Molecule tensors can be used to create neural graph fingerprints using `chemml.models`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tensorising molecules in batches of 100 ...\n", "\u001b[1m500/500\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m7s\u001b[0m 13ms/step \n", "Merging batch tensors ... [DONE]\n" ] } ], "source": [ "from chemml.chem import tensorise_molecules\n", "atoms,bonds,edges = tensorise_molecules(molecules=mol_objs_list, max_degree=5, max_atoms=None, n_jobs=-1, batch_size=100, verbose=True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Matrix for atom features (num_molecules, max_atoms, num_atom_features):\n", " (500, 57, 62)\n", "Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):\n", " (500, 57, 5)\n", "Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):\n", " (500, 57, 5, 6)\n" ] } ], "source": [ "print(\"Matrix for atom features (num_molecules, max_atoms, num_atom_features):\\n\", atoms.shape)\n", "print(\"Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):\\n\", edges.shape)\n", "print(\"Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):\\n\", bonds.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Descriptors from RDKit](https://www.rdkit.org/docs/source/rdkit.Chem.Descriptors.html)\n", "\n", "Comprehensive set of molecular descriptors calculated using RDKit. Includes topological, geometrical, electronic, and constitutional properties. Efficient calculation for large datasets. Flexible selection of specific or all descriptors via the RDKDesc class declaration. Integrates with other RDKit functions and Python workflows." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Calculating RDKit descriptors: 100%|██████████| 500/500 [00:04<00:00, 105.85it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ " MaxAbsEStateIndex MaxEStateIndex MinAbsEStateIndex MinEStateIndex \\\n", "0 8.638741 8.638741 0.039513 -3.852793 \n", "1 8.193508 8.193508 0.286175 -0.599619 \n", "2 8.666998 8.666998 0.049406 -3.394321 \n", "3 8.032986 8.032986 0.013750 -2.742153 \n", "4 9.189644 9.189644 0.166504 -3.306171 \n", "\n", " qed SPS MolWt HeavyAtomMolWt ExactMolWt \\\n", "0 0.816913 70.705882 285.503 266.351 285.067963 \n", "1 0.735869 16.444444 240.222 232.158 240.064725 \n", "2 0.801905 32.818182 313.386 298.266 313.099731 \n", "3 0.749729 51.384615 218.299 208.219 218.007136 \n", "4 0.772983 38.095238 319.415 306.311 319.056152 \n", "\n", " NumValenceElectrons ... fr_sulfonamd fr_sulfone fr_term_acetylene \\\n", "0 94 ... 0 0 0 \n", "1 88 ... 0 0 0 \n", "2 112 ... 0 0 0 \n", "3 72 ... 0 0 0 \n", "4 108 ... 0 0 0 \n", "\n", " fr_tetrazole fr_thiazole fr_thiocyan fr_thiophene fr_unbrch_alkane \\\n", "0 0 1 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 1 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 2 0 0 0 \n", "\n", " fr_urea SMILES \n", "0 0 c1nc(C2CSCCS2)sc1CC1CCCC1 \n", "1 0 Oc1nccnc1-c1coc(-c2cnccn2)c1 \n", "2 0 c1cc(-c2ccc(N3CNCN(c4cncs4)C3)nc2)co1 \n", "3 0 Oc1occc1C1(O)CSCCS1 \n", "4 0 c1nc(-c2cocc2C2NCNCN2c2cscn2)cs1 \n", "\n", "[5 rows x 218 columns]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "from chemml.chem import RDKDesc\n", "\n", "rdd = RDKDesc()\n", "features = rdd.represent(mol_objs_list)\n", "print(features[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Descriptors from Mordred](https://github.com/mordred-descriptor/mordred)\n", "\n", "Note: This function requires Mordred to be installed from the link.\n", "\n", "Mordred molecular descriptors are an open-source alternative to Dragon/RDKit descriptors. This library can generate up to 1800+ descriptors, in comparison to Dragon's 5200+ and RDKit's 200." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 500/500 [00:08<00:00, 58.16it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " nAcid nBase SpAbs_A SpMax_A SpDiam_A SpAD_A SpMAD_A LogEE_A \\\n", "0 0 0 22.998278 2.356990 4.586123 22.998278 1.352840 3.780704 \n", "1 0 0 24.114905 2.401132 4.700060 24.114905 1.339717 3.834684 \n", "2 0 1 30.133660 2.388743 4.753098 30.133660 1.369712 4.046753 \n", "3 0 0 16.550756 2.429396 4.799667 16.550756 1.273135 3.507942 \n", "4 0 2 28.597221 2.464180 4.847996 28.597221 1.361772 4.008031 \n", "\n", " SM1_A VE1_A ... TSRW10 MW AMW WPath WPol \\\n", "0 -5.218048e-15 3.771227 ... 64.739856 285.067963 7.918555 564 19 \n", "1 8.215650e-15 3.859715 ... 64.604946 240.064725 9.233259 624 25 \n", "2 -1.232348e-14 4.414919 ... 71.264318 313.099731 8.462155 1142 30 \n", "3 6.328271e-15 3.324299 ... 58.294869 218.007136 9.478571 226 18 \n", "4 -1.554312e-15 4.162396 ... 71.560531 319.056152 9.384004 870 29 \n", "\n", " Zagreb1 Zagreb2 mZagreb1 mZagreb2 SMILES \n", "0 88.0 101.0 3.694444 3.777778 c1nc(C2CSCCS2)sc1CC1CCCC1 \n", "1 94.0 110.0 4.555556 4.000000 Oc1nccnc1-c1coc(-c2cnccn2)c1 \n", "2 118.0 139.0 4.666667 4.833333 c1cc(-c2ccc(N3CNCN(c4cncs4)C3)nc2)co1 \n", "3 68.0 80.0 4.284722 2.861111 Oc1occc1C1(O)CSCCS1 \n", "4 114.0 137.0 4.416667 4.638889 c1nc(-c2cocc2C2NCNCN2c2cscn2)cs1 \n", "\n", "[5 rows x 1373 columns]\n" ] } ], "source": [ "from chemml.chem import Mordred\n", "\n", "mord = Mordred()\n", "features = mord.represent(mol_objs_list, quiet=False)\n", "print(features[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Descriptors from PaDELPy](https://pypi.org/project/padelpy/)\n", "\n", "Note: This function requires PaDELPy and JRE 6+ to be installed from the link.\n", "\n", "PaDEL-Descriptor: Open-source software for calculating molecular descriptors and fingerprints. Computes 797 descriptors (663 1D/2D, 134 3D) and 10 fingerprint types. Uses Chemistry Development Kit and custom implementations. Offers GUI and CLI, supports multiple file formats, and enables multithreading for efficient calculations." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " nAcid ALogP ALogp2 AMR \\\n", "0 0 1.4476999999999995 2.095835289999999 63.309999999999995 \n", "1 0 -0.6144000000000001 0.37748736000000005 4.1837 \n", "2 0 0.05900000000000016 0.0034810000000000192 31.210299999999997 \n", "3 0 0.16129999999999978 0.026017689999999927 38.772499999999994 \n", "4 0 0.3426999999999998 0.11744328999999985 38.349399999999996 \n", "\n", " apol naAromAtom nAromBond nAtom nHeavyAtom nH ... P2s E1s \\\n", "0 45.34906699999998 5 5 36 17 19 ... \n", "1 32.458344 17 19 26 18 8 ... \n", "2 45.60389499999999 16 17 37 22 15 ... \n", "3 28.953929999999986 5 5 23 13 10 ... \n", "4 43.650308999999986 15 16 34 21 13 ... \n", "\n", " E2s E3s Ts As Vs Ks Ds SMILES \n", "0 c1nc(C2CSCCS2)sc1CC1CCCC1 \n", "1 Oc1nccnc1-c1coc(-c2cnccn2)c1 \n", "2 c1cc(-c2ccc(N3CNCN(c4cncs4)C3)nc2)co1 \n", "3 Oc1occc1C1(O)CSCCS1 \n", "4 c1nc(-c2cocc2C2NCNCN2c2cscn2)cs1 \n", "\n", "[5 rows x 1876 columns]\n" ] } ], "source": [ "from chemml.chem import PadelDesc\n", "\n", "padel = PadelDesc()\n", "features = padel.represent(mol_objs_list[:10])\n", "print(features[:5])" ] } ], "metadata": { "kernelspec": { "display_name": "nitin_py312_env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 2 }