CanGraph.GraphifyHMDB package
In this page
CanGraph.GraphifyHMDB package#
This script, created as part of my Master’s Intenship at IARC, imports nodes from the Human Metabolome Database (a high quality, database containing a list of metabolites and proteins associated to different diseases) to Neo4J format in an automated way, providing an export in GraphML format.
To run, it uses alive_progress
to generate an interactive progress bar (that shows the script is still running through its most time-consuming parts) and the neo4j
python driver. This requirements can be installed using: pip install -r requirements.txt
.
To run the script itself, use:
python3 main.py neo4jadress username databasepassword
where:
neo4jadress: is the URL of the database, in neo4j:// or bolt:// format
username: the username for your neo4j instance. Remember, the default is neo4j
password: the passowrd for your database. Since the arguments are passed by BaSH onto python3, you might need to escape special characters
Please note that there are two kinds of functions in the associated code: those that use python f-strings, which themselves contain text that cannot be directly copied into Neo4J (for instance, double brackets have to be turned into simple brackets) and normal multi-line strings, which can. This is because f-strings allow for variable customization, while normal strings dont.
An archived version of this repository that takes into account the gitignored files can be created using: git archive HEAD -o ${PWD##*/}.zip
Important Notices#
Please ensure you have internet access, enough espace in your hard drive (around 5 GB) and read-write access in
./xmlfolder
. The files needed to build the database will be stored there.There are two kinds of high-level nodes stored in this database: “Metabolites”, which are individual compounds present in the Human Metabolome; and “Proteins”, which are normally enzimes and are related to one or multiple metabolites. There are different types of metabolites, but they were all imported in the same way; their origin can be differenced by the “
” field on the corresponding “Concentration” nodes. You could run a query such as: MATCH (n:Metabolite)-[r:MEASURED_AT]-(c:Concentration) RETURN DISTINCT c.Biospecimen
Some XML tags have been intentionally not processed; for example, the
tag seemed like too much info unrelated to our project, or the tags, which could be useful but seemed to only link to external DBs
The package consists of the following modules:
CanGraph.GraphifyHMDB.build_database module#
A python module that provides the necessary functions to transition the HMDB database to graph format,
either from scratch importing all the nodes (as showcased in CanGraph.GraphifyHMDB.main
) or in a case-by-case basis,
to annotate existing metabolites (as showcased in CanGraph.main
).
- add_biological_properties(filename)[source]#
Adds biological properties to existing “Metabolite” nodes based on XML files obtained from the HMDB website. In this case, only properties labeled as <predicted_properties> are added.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
Another option would have been to auto-add all the properties, and name them using RETURN “Predicted ” + apoc.text.capitalizeAll(replace(kind, “_”, ” “)), value; however, this way we can select and not duplicate / overwrite values.
Todo
It would be nice to be able to distinguish between experimental and predicted properties
- add_concentrations_abnormal(filename)[source]#
Creates “Concentration” nodes based on XML files obtained from the HMDB website. In this function, only metabolites that are labeled as “abnormal_concentration” are added.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
Here, an UNWIND clause is used instead of a FOREACH clause. This provides better performance, since, unlike FOREACH, UNWIND does not process rows with empty values
Warning
Using the CREATE row forces the creation of a Concentration node, even when some values might be missing. However, this means some bogus nodes could be added, which MUST be accounted for at the end of the DB-Creation process.
- add_concentrations_normal(filename)[source]#
Creates “Concentration” nodes based on XML files obtained from the HMDB website. In this function, only metabolites that are labeled as “normal_concentration” are added.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
Here, an UNWIND clause is used instead of a FOREACH clause. This provides better performance, since, unlike FOREACH, UNWIND does not process rows with empty values
Warning
Using the CREATE row forces the creation of a Concentration node, even when some values might be missing. However, this means some bogus nodes could be added, which MUST be accounted for at the end of the DB-Creation process.
- add_diseases(filename)[source]#
Creates “Publication” nodes based on XML files obtained from the HMDB website.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
Here, an UNWIND clause is used instead of a FOREACH clause. This provides better performance, since, unlike FOREACH, UNWIND does not process rows with empty values (and, logically, there should be no Publication if there is no Disease)
Note
Publications are created with a (m)-[r:CITED_IN]->(p) relation with Metabolite nodes. If one wants to find the Publication nodes related to a given Metabolite/Disease relation, one can use:
MATCH p=()-[r:RELATED_WITH]->() WITH split(r.PubMed_ID, ",") as pubmed UNWIND pubmed as find_this MATCH (p:Publication) WHERE p.PubMed_ID = find_this RETURN p
- add_experimental_properties(filename)[source]#
Adds properties to existing “Metabolite” nodes based on XML files obtained from the HMDB website. In this case, only properties labeled as <experimental_properties> are added.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
Another option would have been to auto-add all the properties, and name them using RETURN “Experimental ” + apoc.text.capitalizeAll(replace(kind, “_”, ” “)), value; however, this way we can select and not duplicate / overwrite values.
Todo
It would be nice to be able to distinguish between experimental and predicted properties
- add_gene_properties(filename)[source]#
Adds some properties to existing “Protein” nodes based on XML files obtained from the HMDB website. In this case, properties will mostly relate to the gene from which the protein originates.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
We are not creating “Gene” nodes (even though each protein comes from a given gene) because we believe not enough information is being given about them.
- add_general_references(filename, type_of)[source]#
Creates “Publication” nodes based on XML files obtained from the HMDB website.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
Since not all nodes present a “PubMed_ID” field (which would be ideal to uniquely-identify Publications, as the “Text” field is way more prone to typos/errors), nodes will be created using the “Authors” field. This means some duplicates might exist, which should be accounted for.
Note
Unlike the rest, here we are not matching metabolites, but ALSO proteins. This is intentional.
- add_go_classifications(filename)[source]#
Creates “Gene Ontology” nodes based on XML files obtained from the HMDB website. This relates each protein to some GO-Terms
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
- add_metabolite_associations(filename)[source]#
Adds associations contained in the “protein” file, between proteins and metabolites.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
Like he “add_metabolite_associations” function, this creates non-directional relationships (m)-[r:ASSOCIATED_WITH]-(p) ; this helps duplicates be detected.
Note
The “ON CREATE SET” clause for the “Name” param ensures no overwriting
- add_metabolite_references(filename)[source]#
Creates references for relations betweens Protein nodes and Metabolite nodes
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Warning
Unfortunately, Neo4J makes it really, really, really difficult to work with XML, and so, this time, a r.PubMed_ID list with the references could not be created. Nonetheless, I considered adding this useful.
- add_metabolites(filename)[source]#
Creates “Metabolite” nodes based on XML files obtained from the HMDB website, adding some essential identifiers and external properties.
See also
This way of working has been taken from William Lyon’s Blog
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
- add_predicted_properties(filename)[source]#
Adds properties to existing “Metabolite” nodes based on XML files obtained from the HMDB website. In this case, only properties labeled as <predicted_properties> are added.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
Another option would have been to auto-add all the properties, and name them using RETURN “Predicted ” + apoc.text.capitalizeAll(replace(kind, “_”, ” “)), value; however, this way we can select and not duplicate / overwrite values.
Todo
It would be nice to be able to distinguish between experimental and predicted properties
- add_protein_associations(filename)[source]#
Creates “Protein” nodes based on XML files obtained from the HMDB website.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
Unlike the “add_protein” function, this creates Proteins based on info on the “Metabolite” files, not on the “Protein” files themselves. This could mean node duplication, but, hopefully, the MERGE by Accession will mean that this duplicates will be catched.
- add_protein_properties(filename)[source]#
Adds some properties to existing “Protein” nodes based on XML files obtained from the HMDB website. In this case, properties will mostly relate to the protein itself.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
The “signal_regions” and the “transmembrane_regions” properties were left out because, after a preliminary search, they were mostly empty
- add_proteins(filename)[source]#
Creates “Protein” nodes based on XML files obtained from the HMDB website.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
We are not creating “Gene” nodes (even though each protein comes from a given gene) because we believe not enough information is being given about them.
- add_taxonomy(filename)[source]#
Creates “Taxonomy” nodes based on XML files obtained from the HMDB website. These represent the “kind” of metabolite we are dealing with (Family, etc)
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the XML file that is being imported
- Returns
A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
- Return type
Note
It only creates relationships in the Kingdom -> Super Class -> Class -> Subclass direction, and from any node -> Metabolite. This means that, if any member of the Kingdom -> Super Class -> Class -> Subclass is absent, the line will be broken; hopefully in that case a new metabolite will come in to rescue and settle the relation!
- build_from_metabolite_file(newfile, driver)[source]#
A function able to build a portion of the HMDB database in graph format, provided that one “Metabolite” XML is supplied to it. This are downloaded separately from the website, as all the files that are not
`hmdb_proteins.zip`
, and can be presented either as the full file, or as a splitted version of it, with just one item per file (which is recommended due to memory limitations)- Parameters
newfile (str) – The path of the XML file to import
driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
- Returns
This function modifies the Neo4J Database as desired, but does not produce any particular return.
- build_from_protein_file(newfile, driver)[source]#
A function able to build a portion of the HMDB database in graph format, provided that one “Protein” XML is supplied to it. This are downloaded separately from the website, as
`hmdb_proteins.zip`
, and can be presented either as the full file, or as a splitted version of it, with just one item per file (which is recommended due to memory limitations)- Parameters
newfile (str) – The path of the XML file to import
driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
- Returns
This function modifies the Neo4J Database as desired, but does not produce any particular return.
CanGraph.GraphifyHMDB.main module#
A python module that leverages the functions present in the build_database
module to recreate the HMDB database using a graph forma and Neo4J, and then provides an GraphML export file.
Please note that, to work, the functions here pre-suppose you have internet access, which will be used to download
HMDB’s XMLs under `./xmlfolder/`
(please ensure you have read-write access there).
For more details on how to run this script, please consult the package’s README