Python API#

The Python code is organized in two packages: jp2rt.ml and jp2rt.java that will be described in the following sections.

Machine learning related functions#

The jp2rt.ml module contains functions related to machine learning aspects.

Data I/O#

This package relies on tab separated values format; for this reason two functions are provided to load molecular descriptors and retention times from files in such format. The following convention is assumed:

  • the files do not contain any header (but can have a different number of fields per row)

  • the retention time is the first field of the row,

  • the SMILES value (if needed) is the last non numeric field of the row (that is, a value for which the conversion with float raises ValueError),

  • the molecular descriptors (if present) are the last fields of the row and are preceded by a non numeric value (for example, the SMILES value).

The following function is needed both in estimating a model, and predicting the retention times. Since molecular descriptors values are usually added using add_descriptors_via_tsv() (or its command line equivalent) the above mentioned convention is satisfied.

load_descriptors(path)[source]#

Loads molecular descriptors values from a file and returns them as a numpy array.

The input file must be in tab separated format, must not have an header, and the descriptors must be the last fields of every row and must be preceded by a non numeric field (for instance a SMILES) on every row.

Parameters:

path (str) – The path of the tab separated file to read.

Returns:

the descriptor values.

Return type:

numpy.array

On the other hand, if the retention times are to be loaded as in estimating a model the following function is used.

load_retention_times(path)[source]#

Loads retention times values from a file and returns them as a numpy array.

The input file must be in tab separated format, must not have an header, and the retention time must be the first field on every row.

Parameters:

path (str) – The path of the tab separated file to read.

Returns:

the retention times values.

Return type:

numpy.array

Model I/O#

The following two functions can be used, respectively, to load and save a model.

By model in jp²rt we usually mean any instance of a Supervised learning model, or in a broader sense a class implementing the scikit-learn estimator interface (a condition that can be checked using the check_estimator() function).

Models are saved in a compressed archive (as defined in PKZIP Application Note); the archive contains the model itself (as serialized by Joblib) and a manifest file that allows to check that the refers to the correct jp²rt version.

The following function is usually required when predicting the retention times.

load_model(path)[source]#

Loads a model from a file.

Parameters:

path (str) – The path of the file to load the model from.

Returns:

The loaded model.

Return type:

sklearn.base.BaseEstimator

On the other hand, the following function is usually required at the end of the model estimation process.

save_model(model, path)[source]#

Saves a model to a file.

Parameters:
Returns:

The number of bytes written to the file.

Return type:

int

Model estimation and evaluation#

Model validation is usually performed using cross validation and visual inspection.

As shown in Estimate the model the following function is used to assist in the validation and hence selection process.

evaluate_model(model, X, y, n_splits=5, prob=0.95)[source]#

Evaluates a model using cross validation and visual inspection.

Parameters:
  • model (sklearn.base.BaseEstimator) – The model to evaluate.

  • X (numpy.array) – The input data.

  • y (numpy.array) – The target values.

  • n_splits (int, optional) – The number of splits to use for cross validation. Defaults to 5.

  • prob (float, optional) – The confidence interval to use for the residuals distribution plot. Defaults to 0.95.

Returns:

A dictionary with the evaluation results.

Return type:

dict

Performing model definition, evaluation and training can be a complex task. The following convenience function is provided to simplify the process.

simple_ensemble_model_estimate(regressor_name, X, y)[source]#

Trains a simple ensemble model using the given regressor and the input data.

Parameters:
  • regressor_name (str) – The name of the regressor to use.

  • X (numpy.array) – The input data.

  • y (numpy.array) – The target values.

Returns:

The trained model.

Return type:

sklearn.base.BaseEstimator

As the name suggests, the function considers just ensemble models; you can get a list of valid values for the regressor_name parameter using the following function

list_ensemble_models()[source]#

Lists the available ensemble models names.

Java bridge functions#

The jp2rt.java module contains functions that help bridging the Java implementation with the Python code. Java code execution in Python is made possible by SciJava that in turn is built on [JPype](https://jpype.readthedocs.io/.

For bulk operations it is better to avoid a Python/Java conversion of data for each molecule; for this reason the following function works by reading and writing the data from tab separated values files.

add_descriptors_via_tsv(src, dst)[source]#

Add molecular descriptors given the SMILES.

The input file must be in tab separated format, must not have an header, and the SMILES must be the last fields of every row. In the output file will be copied all the fields of the input file, followed by the molecular descriptors values.

Parameters:
  • src (str) – Path to the tab separated values file containing the SMILES.

  • dst (str) – Path to the tab separated values file to write the molecular descriptors.

The above function computes a subset of the molecular descriptors available in CDK (for a discussion on how such descriptors are selected, see the dedicated appendix section); the class and value names of the considered descriptors are returned by the following function.

descriptors()[source]#

Returns the list of computed descriptors.

The function returns a dictionary having the descriptor class names as keys and the name of the computed values.

If one is interested in computing the set of descriptors for a single molecule, given its SMILES, the following function can be used.

compute_descriptors(smiles)[source]#

Computes the descriptor values for the given SMILES.

Parameters:

smiles (str) – The SMILES of the molecule of which the descriptors should be computed.

Returns:

the descriptor values.

Return type:

list of float

In case one needs to compute a single descriptor (whatever the descriptor is, even if it does not belong to the selected ones) the following function can be used.

compute_single_descriptor(name, smiles)[source]#

Computes the values of a given descriptor for the given SMILES.

Parameters:
  • name (str) – The name of the class of the descriptor to compute.

  • smiles (str) – The SMILES of the molecule of which the descriptors should be computed.

Returns:

the descriptor values.

Return type:

list of float

More detailed information about the Java code can be found in the Java API documentation.