Predict the Retention Time

Predict the Retention Time#

Now assume we have a list of compounds of which we know only the SMILES; since we have no other data than the PlaSMA dataset, we’ll take the fourth column of the first 100 lines of plasma.tsv file:

$ cut -f 4 plasma.tsv | head -n 100 > smiles.tsv

So the steps we need to perform are:

  • compute the molecular descriptors for each SMILES,

  • load the model, and the descriptors,

  • predict the retention time with the model.

Use the API#

Molecular descriptors can be added not only using the command line as shown in Compute the descriptors, but also using the add_descriptors_via_tsv() function.

from jp2rt import add_descriptors_via_tsv

add_descriptors_via_tsv('smiles.tsv', 'smiles+descriptors.tsv')
Hide code cell output
Computing   0% │                                       │   0/100 (0:00:00 / ?) 
Computing   7% │██▎                              │   7/100 (0:00:01 / 0:00:13) 
Computing  36% │███████████▉                     │  36/100 (0:00:02 / 0:00:03) 
Computing  84% │███████████████████████████▋     │  84/100 (0:00:03 / 0:00:00) 
Computing 100% │█████████████████████████████████│ 100/100 (0:00:03 / 0:00:00) 
Computing 100% │█████████████████████████████████│ 100/100 (0:00:03 / 0:00:00) 

We are now ready to load the computed descriptors, the model we have estimated and saved and to use it to predict the retention time:

from jp2rt import load_model, load_descriptors

X = load_descriptors('smiles+descriptors.tsv')
model = load_model('extratrees')
y = model.predict(X)

The command line#

The same steps can be performed using the command line, we have already seen in Compute the descriptors how to add the descriptors, so we just need to predict the retention time:

$ jp2rt predict-rt extratrees.jp2rt smiles+descriptors.tsv rt+smiles+descriptors.tsv
Read 100 molecules with 243 descriptor values each...
Predicted retention times written to /home/runner/work/jp2rt/jp2rt/docs/example/rt+smiles+descriptors.tsv...