Copy from JChem for Excel: Copy and Paste Single Structures with Ctrl+C-V: Copy and Paste tables from Excel: To External Applications: Copy and Paste single structures: From External Applications: From Instant JChem (IJC) From External Structure Editors: Convert to SMILES (from structure).
Principal component analysis (PCA) is a useful tool in the design and planning of chemical libraries. PCA can be used to reveal differences in structural and physicochemical parameters between various classes of compounds by displaying them in a convenient graphical format. Herein, we demonstrate the use of PCA to gain insight into structural features that differentiate natural products, synthetic drugs, natural product-like libraries, and drug-like libraries, and show how the results can be used to guide library design. 1 Introduction Principal component analysis (PCA) is a mathematical method for dimensionality reduction that allows for multidimensional datasets to be visualized using two- or three-dimensional plots with minimal loss of information ,. When applied in the context of diversity-oriented synthesis, PCA is primarily used to visualize similarities and differences within collections of compounds based on structural and physicochemical parameters, and can be leveraged in library design. Molecular weight, stereocenters, rotatable bonds, hydrophobicity, and aqueous solubility are a few examples of parameters commonly included in such analyses. Herein, we selected 20 structural and physicochemical parameters for analysis based on previously identified correlations of these parameters with oral bioavailability, cell permeability, solubility, and binding selectivity, as well as their ability to distinguish synthetic drugs from natural products ( vide infra).
Each compound in our analysis is represented as a 20-dimensional vector defined by the structural and physicochemical parameters. PCA rotates these vectors onto a new set of orthogonal axes called principal components, in which the variance retained from the original data is maximized on each successive principal component. As such, the three-dimensional plot we show in this example retains 75% of the variance from the full 20-dimensional dataset. PCA can also be used to guide the design of chemical libraries. This is important in drug discovery because current drugs are limited in both structure and function. For example, current small-molecule drugs address only about 1% of the protein targets encoded in the human genome , and half of those target only four protein classes: rhodopsin-like G-protein receptors, nuclear receptors, and voltage- and ligand-gated ion channels.
![Mac Mac](/uploads/1/2/5/4/125428576/546923834.png)
In contrast, natural products are known to target a broader range of protein classes and have led to the majority of antibacterials (65%) and anticancer drugs (75%). Therefore, novel libraries of compounds that share the structural features of natural products are attractive for the discovery of lead compounds to evaluate new therapeutic targets. Along these lines, many macrocycle and medium-ring-containing natural products have compelling biological activities. This key cyclic framework presents functional groups to biological targets in appropriate pharmacophoric conformations –. Compared to their corresponding linear congeners, macrocycles can provide increased binding affinity , improved bioavailability , and, in some cases, enhanced cell permeability , which are desirable pharmacological properties in the development of new drugs. However, despite these attractive features of macrocycles and medium rings, they remain severely underexploited in current drug and probe discovery efforts , , due to challenges associated with their synthesis. To address the underrepresentation of these compounds, we have sought to circumvent the inherent limitations of classical cyclization-based strategies for macrocycle and medium-ring synthesis by developing alternative ring-expansion approaches that are tolerant of a broad range of substitution patterns and functional groups.
We recently developed two such methods , , both of which can be employed on gram scale, provide products bearing handles for further diversification, and are transferable to parallel synthesis platforms. Recently developed routes to natural product-like macrocycle and medium-ring libraries using ring expansion strategies Herein, we describe the use of PCA to assess how libraries of compounds produced using these synthetic routes compare to natural products and what structural and physicochemical parameters distinguish them from synthetic drugs and drug-like libraries. The information harnessed from PCA can also direct downstream modifications of a scaffold to obtain molecules that are more characteristic of a targeted class, such as natural products. In this example, we show that compounds appearing in the proximity of the drug-like region of the PCA plot can be modified to have greater natural product-like properties by addressing several influential structural and physicochemical parameters.
The relative contributions of structural and physicochemical parameters to each principal component (PC) axis are obtained from the loading data and loading plots produced from PCA. In the analysis presented herein, the number of oxygen atoms, hydrogen bond donors, and hydrogen bond acceptors are among the most influential parameters for PC1. Stereochemical density (the number of stereocenters normalized to molecular weight) and the fraction of sp 3-hybridized carbons are large contributors to PC2. We further demonstrate that these structural and physicochemical parameters can be addressed by chemical modifications of our library members to increase their natural product-like character. Subsequent analysis of these modified compounds in PCA demonstrates their increased penetration into natural product-like regions of the plot.
This work illustrates how insights gleaned from PCA can be used in the planning of chemical libraries to probe targeted areas of chemical space. Create a new MS Excel file containing one column for compound names (Column A) and one column for SMILES codes (Column B) ( see ). Do not include a header row.
Group the compounds by compound class (such as Drugs, Natural Products). Save the MS Excel file as a Text (tab delimited) (.txt) file that will be used in Subheading 3.1, step 4. Delete the compound names column and save an additional.txt file that contains only SMILES codes, which will be used later will be used later for batch processing (Subheading 3.1, step 7). From Instant JChem, import the MS Excel file containing the compound names and SMILES codes by selecting File Import File, and then click on “Next” ( see ). Click on the folder icon next to the “File to import” field, and then navigate to and select the.txt file containing compound names and SMILES codes.
Under “File Format” choose “Delineated text files (.csv,.tab,.txt)”, and then click on “Open”. After Instant JChem has finished scanning the file and indicated the number of fields found, click on “Next”. The “Field details” panel gives a summary of the fields to be imported from the text file.
The structure, molecular weight, molecular formula, and compound names of each entry are displayed by default ( see ). The “Monitor import” window will give a summary of the imported data ( see ). Once fully processed, click on “Finish”.
Export the table of physicochemical parameters calculated in Instant JChem to an MS Excel file (.xls) by selecting File Export to File. In the “Specify details” window, click on the purple folder to the right of the “File” field. Name the file and define the file format as “Microsoft Office Excel Workbook (.xls)”. The following window gives the user an option to remove or rearrange columns in the exported file ( see ). The “Monitor progress” window summarizes the export process. When complete, click on “Finish”. Open the.xls file containing these physicochemical parameters in MS Excel; additional physicochemical parameters will be added later (Subheading 3.1, steps 8 and 9).
To calculate ALOGPs and ALOGpS from the website’s window, choose “Upload file” and select “Smiles—SMILES file—default” from the drop-down menu and click on “Proceed with file uploading”. Click on “Choose file” and select the.txt file consisting of SMILES codes only that was created in Subheading 3.1, step 2.
Click on “Upload file” and a new pop-up window should be displayed that states “Your file “yourfile.txt” was uploaded successfully.” Close the pop-up window displaying this message and a new window will open that is entitled “results.txt”. Copy the text from this results window and paste it into a new MS Excel file. In the MS Excel file containing the compound names and physicochemical parameters that was created in Subheading 3.1, verify that the compounds are grouped by compound class (such as Drugs, Natural Products), and insert a new row below each group. Each new row will represent the average compound for a given class. Accordingly, name each new row based on the category it will represent, for example “AVG Drug”. For these new rows, fill the cells associated with structural and physicochemical parameter values using the “AVERAGE” function to the left of the cell formula field and select the appropriate cells.
Similarly, add two new rows below the last compound and calculate the mean (“AVERAGE”) of each physicochemical parameter for the entire dataset (all compounds), as well as the standard deviation of each parameter using the “STDEV” function. Name the MS Excel sheet/tab “Raw”. This command gives a table showing the distribution of variance from the full dataset on each principal component. PCA generates as many principal components as there are parameters, but importantly, the majority of variance is represented in the first few components ( see refs. 1, 2 for further discussion on PCA and variance). In this example, the first three principal components (PC1–PC3) retain 75% of the variance from the 20-dimensional dataset.
As such, PC1–PC3 can be used to construct a set of two-dimensional plots that will allow the visualization of the data in a more intuitive manner while still retaining the majority of the information from the full dataset. We will therefore focus our remaining analysis on PC1–PC3. Open the MS Excel file that contains the “Raw” and “Norm” sheets/tabs (Subheading 3.2, step 2), and create a new sheet/tab within that MS Excel file, naming it “PCA”.
To transfer the scores data obtained from “R” to MS Excel, first open the scores.txt file in a text editor. At the beginning of the document, add a column header such as “Compound” followed by a tab. Next, copy all of the text in the file and paste it into the MS Excel sheet/tab named “PCA”. Change the number format (Format Cells) of the PCA cells to three decimal places.
In MS Excel, plot PC1 vs. PC2 from the “PCA” sheet/tab by selecting the columns and clicking on the “Chart Wizard” icon in the Standard Toolbar (View Toolbars Standard). Under “Standard Types” choose “Scatter XY” and click on “Next”. Enter series information (e.g., Drugs, Natural Products) under the “Series” tab and fill the X- and Y-values data fields with corresponding range for each series (for example PC1 data for X-values and PC2 data for Y-values). When done entering the series information, click on “Next”. Enter a title for the plot and labels for the axes and then click on “Next.” Select “As object in PCA” and click on “Finish” ( see ).
Examine the PCA plots together with the loading plots to identify structural and physicochemical parameters that determine where a particular compound or a collection of compounds appears on the PCA plot. For example, many natural products appear to the left (negative PC1) of the library members in the PC1 vs. The corresponding loading plot indicates that HBA, tPSA, and O are all major components of PC1.
Recalling that each PC axis is a linear combination of structural and physicochemical parameters, Note that the coefficients for each parameter used for a given PC axis were also obtained in Subheading 3.2, step 4. This table provides more quantitative information regarding the parameters that have the greatest impact on the location of a compound with respect to each PC axis. Consider the structural and physicochemical parameters that are the most important in differentiating a collection of compounds from the targeted region of chemical space.
In this example, O, HBD, HBA, tPSA, nStereo, aqueous solubility, and Fsp 3 are among the most influential in distinguishing our library compounds from natural products. The introduction of additional oxygen atoms and additional stereocenters to our library compounds would likely lead to more natural product-like compounds. A dihydroxylation of the olefins contained in our medium-ring library members would address all of the parameters mentioned above and should result in compounds that are shifted towards natural products in our PCA plots.
Leveraging this information, we proceed with the dihydroxylation of multiple medium-ring compounds to produce a collection of PCA-directed derivatives of our initial medium-ring library. Include the collection of modified compounds in a new analysis to evaluate the effectiveness of the PCA-directed modifications in targeting the desired region of the plot. In this example, our diol products have structural and physicochemical parameters that are more consistent with natural products compared to their parent olefins, and as a result, the compounds are shifted towards natural products in all of the PCA plots. Reiterate this process as necessary to provide a library with the desired structural and physicochemical properties. We thank Tony D. Davis (MSKCC) for suggesting inclusion of the logD, van der Waals surface area, and relative polar surface area parameters, and for providing modifications of this protocol for Windows users.
Instant JChem was generously provided by ChemAxon. Financial support from the NIH (P41 GM076267 to D.S.T., P41 GM076267-03S1 to R.A.B., T32 CA062948-Gudas to T.A.W.), Starr Foundation, Tri-Institutional Stem Cell Initiative, Alfred P. Sloan Foundation (Research Fellowship to D.S.T.), Deutscher Akademischer Austauschdienst (DAAD, postdoctoral fellowship to F.K.), William H. Goodwin and Alice Goodwin and the Commonwealth Foundation for Cancer Research, and the MSKCC Experimental Therapeutics Center is gratefully acknowledged. 1In this analysis, we used 40 top-selling drugs from , 60 diverse natural products, 20 drug-like library members, 23 macrocycle natural products, 32 synthetic macrocycles, 20 medium-ring natural products, 38 synthetic medium rings, and 25 cyclohexadienone precursors to those medium rings. In the analysis described in Subheading 3.3, an additional eight synthetic medium-ring diols were included.
2Much of the raw data used in this analysis is available from the Supplementary Information for refs. 3To obtain the SMILES codes in ChemBioDraw, select the chemical structures and choose Edit Copy As SMILES. Paste the SMILES codes into an MS Word document. The compounds in the string are separated by a period (“.”), and can be converted to a table format in MS Excel by saving the MS Word document as a text file (.txt) and importing the data in MS Excel using Data Get External Data.
Select the text file, choose the “Delimited” option in Step 1 of the Text Import Wizard, and in Step 2 of the Wizard specify the delimiters as a period (“.”) in the “Other” field. The imported SMILES codes can be transposed (flipped from row to column format) by copying them, then selecting Edit Paste Special, and clicking on the “Transpose” option. 4Some software does not handle spaces and punctuation in compound names, but underscores can be used instead. For Windows 7 users, spaces are allowed. 5Make sure that “IJC Project (with local database)” is highlighted in the “Projects” panel. 6Make sure that “localdb as admin” is highlighted in the “Projects and schemas” panel.
7Remove any undesired fields using the “. Where file path is the entire file path beginning with the drive (usually C: ). Users can obtain the file path by dragging and dropping the text file directly into R, which returns an error message but reports the file location including the drive. 17To transfer the “R” output to MS Excel, copy the first section of the table without the column headers (for example the PC1–PC8 data before the first section break) and paste it into an MS Word document. Change the font to Courier and the size to 5 pt such that the text in the document resembles a table. Save the file as a Text-only (.txt) file.
From MS Excel, import the data using Data Get External Data. Select the file, then choose the “Fixed width” option in Step 1 of the Text Import Wizard, click on “Next”, verify column divisions, and then click on “Finish”. 18Several PC axes were inverted in this example to maintain resemblance to our previous PCA plots by adding the “ylim” and “xlim” axis limit options to the “biplot” command in “R”:. This does not impact data interpretation because the signs of all PC axes are arbitrary. 19If desired, loading plots can also be produced where the scores (compound names) are hidden. To do this, replace “gray” with “white” in the biplot command.
20The plot must be saved before an additional plot command is entered. 21For Windows users, the file path information in the write.table command will be different and needs to include the drive (such as C: ) (cf. 22If desired, change the appearance of the plot by right-clicking on the object you wish to modify, and then select the Format option.