Menu Close

Drugbank Importer

drugbank-importer

drugbank-importer is a command-line tool meant to convert DrugBank‘s XML into several formats (currently
csv and any database supported by sqlalchemy 2.x).

It is implemented in python and freely inspired from zzploveyou’s
fork
or tal-baum
implementation
.

Usage

$ drugbank-import --help
Usage: drugbank-import [OPTIONS]

Options:
  -f, --file-path PATH  Path to the DrugBank XML dump
  -t, --target TEXT     Where to save the import, either a path to a directory
                        (for CSV) or a sqlalchemy ressource locator, e.g.
                        `sqlite://` for memory or `sqlite:///tmp/database.db`
                        for `/tmp/database.db`
  -l, --limit INTEGER   Limit the number of records to proceed
  --help                Show this message and exit.

For instance, the following command would import data from the first 10 drug
records contained in drugbank.xml, and save them in a sqlite file test.db:

drugbank-import -f drugbank.xml -l 10 -t sqlite:///test.db

The database schema is the following:

classDiagram
   class partners{
      partner_id: VARCHAR(6)
      partner_name: VARCHAR
      gene_name: VARCHAR
      uniprot_id: VARCHAR
      genbank_gene_id: VARCHAR
      genbank_protein_id: VARCHAR
      hgnc_id: VARCHAR
      organism: VARCHAR
      taxonomy_id: VARCHAR
   }
   class drugs{
      drugbank_id: VARCHAR(6)
      drugname: VARCHAR
      drug_type: VARCHAR
      ATC_codes: BOOLEAN
      approved: BOOLEAN
      experimental: BOOLEAN
      illicit: BOOLEAN
      investigational: BOOLEAN
      nutraceutical: BOOLEAN
      withdrawn: BOOLEAN
   }
   class carriers{
      id: INTEGER
      drugbank_id: VARCHAR(6)
      partner_id: VARCHAR(6)
      action: VARCHAR
   }
   partners <|-- carriers
   drugs <|-- carriers
   class targets{
      id: INTEGER
      drugbank_id: VARCHAR(6)
      partner_id: VARCHAR(6)
      action: VARCHAR
   }
   partners <|-- targets
   drugs <|-- targets
   class transporters{
      id: INTEGER
      drugbank_id: VARCHAR(6)
      partner_id: VARCHAR(6)
      action: VARCHAR
   }
   partners <|-- transporters
   drugs <|-- transporters
   class enzymes{
      id: INTEGER
      drugbank_id: VARCHAR(6)
      partner_id: VARCHAR(6)
      action: VARCHAR
   }
   partners <|-- enzymes
   drugs <|-- enzymes
   class descriptions{
      id: INTEGER
      drugbank_id: VARCHAR(6)
      drug_name: VARCHAR
      description: VARCHAR
      SMILES: VARCHAR
   }
   drugs <|-- descriptions

Motivation

Both the original
implementation
and its
fork
share the same
design:

  • the XML file is parsed into a tree
  • the tree is traversed, while information about drugs and their targets is
    extracted and accumulated
  • accumulated information is serialized into csv files

This design and subsequent control flow is simple to read and understand, which
is of paramount importance for this kind of data importers. However, it comes
with a few issues:

  • the DrugBank XML file is ~1.5 GB (as of 2022-12), and all associated
    data structures may hardly fit in memory
  • all the information extracted about drugs and targets have to fit in memory as
    well
  • the extraction of drug/target information and their serialization is highly
    coupled: extracting new fields or creating new “tables” can quickly become not
    trivial
  • only csv support is provided (although it can be later imported easily in a
    database, at the cost of an extra step).

drugbank-importer has been designed to solve these issues.

How it works

Being less script-ish, drugbank-importer is a bit harder to read and
understand at first sight and worth a short explanation about how it leverages
several powerful techniques to solve the aforementioned limits:

  • iterative parsing: DrugBank records are parsed into an XML tree one by one
  • lazy evaluation: generators are used both to yield DrugBank records, and to
    issue records for drugs, targets, etc.
  • the extraction and serialized steps are separated, to support multiple
    backends (currently csv and any database supported by sqlalchemy 2.x)

Instead the procedural steps mentioned above, drugbank-importer does the
following:

  • a generator issues a subset of lines corresponding to a drug record in XML
    format
  • this generator is consumed by another generator which creates the XML tree for
    this particular subset of lines, parses the information that need to be
    serialized, and yield it as Dict
  • this last generator is consumed by a serialization function that can be chosen
    at runtime (either to file or to arbitrary sqlalchemy locations)

The most notable difference with the former approach is that control is deferred
to the serializers, which receive business entities to serialize as they are
produced.

View Source Code
Posted in Development, Tech, web development