drugbank-importer
drugbank-importer
is a command-line tool meant to convert DrugBank‘s XML into several formats (currently
csv
and any database supported by sqlalchemy 2.x).
It is implemented in python and freely inspired from zzploveyou’s
fork or tal-baum
implementation.
Usage
$ drugbank-import --help
Usage: drugbank-import [OPTIONS]
Options:
-f, --file-path PATH Path to the DrugBank XML dump
-t, --target TEXT Where to save the import, either a path to a directory
(for CSV) or a sqlalchemy ressource locator, e.g.
`sqlite://` for memory or `sqlite:///tmp/database.db`
for `/tmp/database.db`
-l, --limit INTEGER Limit the number of records to proceed
--help Show this message and exit.
For instance, the following command would import data from the first 10 drug
records contained in drugbank.xml
, and save them in a sqlite file test.db
:
drugbank-import -f drugbank.xml -l 10 -t sqlite:///test.db
The database schema is the following:
classDiagram
class partners{
partner_id: VARCHAR(6)
partner_name: VARCHAR
gene_name: VARCHAR
uniprot_id: VARCHAR
genbank_gene_id: VARCHAR
genbank_protein_id: VARCHAR
hgnc_id: VARCHAR
organism: VARCHAR
taxonomy_id: VARCHAR
}
class drugs{
drugbank_id: VARCHAR(6)
drugname: VARCHAR
drug_type: VARCHAR
ATC_codes: BOOLEAN
approved: BOOLEAN
experimental: BOOLEAN
illicit: BOOLEAN
investigational: BOOLEAN
nutraceutical: BOOLEAN
withdrawn: BOOLEAN
}
class carriers{
id: INTEGER
drugbank_id: VARCHAR(6)
partner_id: VARCHAR(6)
action: VARCHAR
}
partners <|-- carriers
drugs <|-- carriers
class targets{
id: INTEGER
drugbank_id: VARCHAR(6)
partner_id: VARCHAR(6)
action: VARCHAR
}
partners <|-- targets
drugs <|-- targets
class transporters{
id: INTEGER
drugbank_id: VARCHAR(6)
partner_id: VARCHAR(6)
action: VARCHAR
}
partners <|-- transporters
drugs <|-- transporters
class enzymes{
id: INTEGER
drugbank_id: VARCHAR(6)
partner_id: VARCHAR(6)
action: VARCHAR
}
partners <|-- enzymes
drugs <|-- enzymes
class descriptions{
id: INTEGER
drugbank_id: VARCHAR(6)
drug_name: VARCHAR
description: VARCHAR
SMILES: VARCHAR
}
drugs <|-- descriptions
Motivation
Both the original
implementation and its
fork share the same
design:
- the XML file is parsed into a tree
- the tree is traversed, while information about drugs and their targets is
extracted and accumulated - accumulated information is serialized into
csv
files
This design and subsequent control flow is simple to read and understand, which
is of paramount importance for this kind of data importers. However, it comes
with a few issues:
- the DrugBank XML file is ~1.5 GB (as of 2022-12), and all associated
data structures may hardly fit in memory - all the information extracted about drugs and targets have to fit in memory as
well - the extraction of drug/target information and their serialization is highly
coupled: extracting new fields or creating new “tables” can quickly become not
trivial - only
csv
support is provided (although it can be later imported easily in a
database, at the cost of an extra step).
drugbank-importer
has been designed to solve these issues.
How it works
Being less script-ish, drugbank-importer
is a bit harder to read and
understand at first sight and worth a short explanation about how it leverages
several powerful techniques to solve the aforementioned limits:
- iterative parsing: DrugBank records are parsed into an XML tree one by one
- lazy evaluation: generators are used both to yield DrugBank records, and to
issue records for drugs, targets, etc. - the extraction and serialized steps are separated, to support multiple
backends (currentlycsv
and any database supported by sqlalchemy 2.x)
Instead the procedural steps mentioned above, drugbank-importer
does the
following:
- a generator issues a subset of lines corresponding to a drug record in XML
format - this generator is consumed by another generator which creates the XML tree for
this particular subset of lines, parses the information that need to be
serialized, and yield it asDict
- this last generator is consumed by a serialization function that can be chosen
at runtime (either to file or to arbitrary sqlalchemy locations)
The most notable difference with the former approach is that control is deferred
to the serializers, which receive business entities to serialize as they are
produced.