Get some example data

Get some example data#

To provide some example of execution of jp²rt, we’ll use a small curated dataset, namely the PlaSMA dataset that can be freely obtained from such project.

You can get the data yourself (for instance to be sure to have the latest available version of it) using the approach suggested in this section, or download the data used to generate this documentation as contained in the example-data.zip file available as one of the “Asset” of the current release.

The dataset is provided in NIST MS format (.msp), so we’ll write a msp2tsv function to convert such format to the tab separated values (.tsv) used to compute the molecular descriptors. During the conversion, only entries with an InChiKey, the SMILES and with positive ionization mode are kept.

We can preserve how many information we want from the original file, provided that:

the file is in tab separated values format,
the first field is the retention time,
the last field is the SMILES.

Since the computation of the molecular descriptors is performed in parallel, the output file is not guaranteed to have the same order as the input file, for this reason it can be a good idea to preserve at least the InChIKey to be able to match the original data with the one with computed molecular descriptors.

# A function converting MSP file to TSV file (while filtering out negative ion mode records and records with no SMILES)

import csv


def msp2tsv(src, dst, extra_keys=('INCHIKEY', 'NAME')):
  keys = (
    'RETENTIONTIME',
    *extra_keys,
    'SMILES',
  )
  keys_set = frozenset(keys)
  csv_writer = csv.writer(dst, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
  record = {}
  for line in src:
    lines = line.strip()
    if not lines:
      if keys_set <= set(record.keys()) and 'IONMODE' in record and record['IONMODE'] == 'Positive':
        csv_writer.writerow(record[k] for k in keys)
      record = {}
    else:
      key, *value = lines.split(': ', 1)
      if value:
        record[key] = value[0]

We can use the above function to save a .tsv file while downloading the original dataset directly from the PlaSMA download page:

PLASMA__DATASET_URL = 'http://plasma.riken.jp/menta.cgi/plasma/get_msp_all'

There is actually non need to save the .msp on the local disk, the .tsv can be produced on the fly:

from urllib.request import urlopen

with urlopen(PLASMA__DATASET_URL) as msp_src, open('plasma.tsv', mode = 'w') as tsv_dst:
  src = msp_src.read().decode('utf-8').splitlines()
  msp2tsv(src, tsv_dst)

We can check the number of downloaded lines:

$ wc -l plasma.tsv

799 plasma.tsv