Python API#
The Python code is organized in two packages: jp2rt.ml
and
jp2rt.java
that will be described in the following sections.
Machine learning related functions#
The jp2rt.ml
module contains functions related to machine learning
aspects.
Data I/O#
This package relies on tab separated values format; for this reason two functions are provided to load molecular descriptors and retention times from files in such format. The following convention is assumed:
the files do not contain any header (but can have a different number of fields per row)
the retention time is the first field of the row,
the SMILES value (if needed) is the last non numeric field of the row (that is, a value for which the conversion with
float
raisesValueError
),the molecular descriptors (if present) are the last fields of the row and are preceded by a non numeric value (for example, the SMILES value).
The following function is needed both in estimating a model,
and predicting the retention times. Since molecular descriptors
values are usually added using add_descriptors_via_tsv()
(or
its command line equivalent) the above mentioned convention is satisfied.
- load_descriptors(path)[source]#
Loads molecular descriptors values from a file and returns them as a numpy array.
The input file must be in tab separated format, must not have an header, and the descriptors must be the last fields of every row and must be preceded by a non numeric field (for instance a SMILES) on every row.
- Parameters:
path (
str
) – The path of the tab separated file to read.- Returns:
the descriptor values.
- Return type:
On the other hand, if the retention times are to be loaded as in estimating a model the following function is used.
- load_retention_times(path)[source]#
Loads retention times values from a file and returns them as a numpy array.
The input file must be in tab separated format, must not have an header, and the retention time must be the first field on every row.
- Parameters:
path (
str
) – The path of the tab separated file to read.- Returns:
the retention times values.
- Return type:
Model I/O#
The following two functions can be used, respectively, to load and save a model.
By model in jp²rt
we usually mean any instance of a Supervised
learning model, or in
a broader sense a class implementing the scikit-learn estimator
interface (a condition
that can be checked using the
check_estimator()
function).
Models are saved in a compressed archive (as defined in PKZIP Application
Note);
the archive contains the model itself (as serialized by
Joblib) and a manifest file that allows to
check that the refers to the correct jp²rt
version.
The following function is usually required when predicting the retention times.
- load_model(path)[source]#
Loads a model from a file.
- Parameters:
path (
str
) – The path of the file to load the model from.- Returns:
The loaded model.
- Return type:
On the other hand, the following function is usually required at the end of the model estimation process.
- save_model(model, path)[source]#
Saves a model to a file.
- Parameters:
model (
sklearn.base.BaseEstimator
) – The model to save.path (
str
) – The path of the file to save the model to.
- Returns:
The number of bytes written to the file.
- Return type:
Model estimation and evaluation#
Model validation is usually performed using cross validation and visual inspection.
As shown in Estimate the model the following function is used to assist in the validation and hence selection process.
- evaluate_model(model, X, y, n_splits=5, prob=0.95)[source]#
Evaluates a model using cross validation and visual inspection.
- Parameters:
model (
sklearn.base.BaseEstimator
) – The model to evaluate.X (
numpy.array
) – The input data.y (
numpy.array
) – The target values.n_splits (
int
, optional) – The number of splits to use for cross validation. Defaults to 5.prob (
float
, optional) – The confidence interval to use for the residuals distribution plot. Defaults to 0.95.
- Returns:
A dictionary with the evaluation results.
- Return type:
Performing model definition, evaluation and training can be a complex task. The following convenience function is provided to simplify the process.
- simple_ensemble_model_estimate(regressor_name, X, y)[source]#
Trains a simple ensemble model using the given regressor and the input data.
- Parameters:
regressor_name (
str
) – The name of the regressor to use.X (
numpy.array
) – The input data.y (
numpy.array
) – The target values.
- Returns:
The trained model.
- Return type:
As the name suggests, the function considers just ensemble
models; you can get a list of valid values for the regressor_name
parameter using the following function
Java bridge functions#
The jp2rt.java
module contains functions that help bridging the Java
implementation with the Python code. Java code execution in Python is made
possible by SciJava
that in turn is built on
[JPype](https://jpype.readthedocs.io/.
For bulk operations it is better to avoid a Python/Java conversion of data for each molecule; for this reason the following function works by reading and writing the data from tab separated values files.
- add_descriptors_via_tsv(src, dst)[source]#
Add molecular descriptors given the SMILES.
The input file must be in tab separated format, must not have an header, and the SMILES must be the last fields of every row. In the output file will be copied all the fields of the input file, followed by the molecular descriptors values.
The above function computes a subset of the molecular descriptors available in CDK (for a discussion on how such descriptors are selected, see the dedicated appendix section); the class and value names of the considered descriptors are returned by the following function.
- descriptors()[source]#
Returns the list of computed descriptors.
The function returns a dictionary having the descriptor class names as keys and the name of the computed values.
If one is interested in computing the set of descriptors for a single molecule, given its SMILES, the following function can be used.
In case one needs to compute a single descriptor (whatever the descriptor is, even if it does not belong to the selected ones) the following function can be used.
- compute_single_descriptor(name, smiles)[source]#
Computes the values of a given descriptor for the given SMILES.
More detailed information about the Java code can be found in the Java API documentation.