Get some example data#
To provide some example of execution of jp²rt
, we’ll use a small curated
dataset, namely the PlaSMA dataset that can be freely
obtained from such project.
You can get the data yourself (for instance to be sure to have the latest
available version of it) using the approach suggested in this section, or
download the data used to generate this documentation as contained in the
example-data.zip
file available as one of the “Asset” of the current
release.
The dataset is provided in NIST MS format (.msp
), so we’ll write a msp2tsv
function to convert such format to the tab separated values (.tsv
) used to
compute the molecular descriptors. During the conversion, only entries with an
InChiKey, the SMILES and with positive ionization mode are kept.
We can preserve how many information we want from the original file, provided that:
the file is in tab separated values format,
the first field is the retention time,
the last field is the SMILES.
Since the computation of the molecular descriptors is performed in parallel, the output file is not guaranteed to have the same order as the input file, for this reason it can be a good idea to preserve at least the InChIKey to be able to match the original data with the one with computed molecular descriptors.
# A function converting MSP file to TSV file (while filtering out negative ion mode records and records with no SMILES)
import csv
def msp2tsv(src, dst, extra_keys=('INCHIKEY', 'NAME')):
keys = (
'RETENTIONTIME',
*extra_keys,
'SMILES',
)
keys_set = frozenset(keys)
csv_writer = csv.writer(dst, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
record = {}
for line in src:
lines = line.strip()
if not lines:
if keys_set <= set(record.keys()) and 'IONMODE' in record and record['IONMODE'] == 'Positive':
csv_writer.writerow(record[k] for k in keys)
record = {}
else:
key, *value = lines.split(': ', 1)
if value:
record[key] = value[0]
We can use the above function to save a .tsv
file while downloading the
original dataset directly from the PlaSMA
download
page:
PLASMA__DATASET_URL = 'http://plasma.riken.jp/menta.cgi/plasma/get_msp_all'
There is actually non need to save the .msp
on the local disk, the .tsv
can
be produced on the fly:
from urllib.request import urlopen
with urlopen(PLASMA__DATASET_URL) as msp_src, open('plasma.tsv', mode = 'w') as tsv_dst:
src = msp_src.read().decode('utf-8').splitlines()
msp2tsv(src, tsv_dst)
We can check the number of downloaded lines:
$ wc -l plasma.tsv
799 plasma.tsv