CanGraph.ExposomeExplorer package
In this page
CanGraph.ExposomeExplorer package#
This package, created as part of my Master’s Intenship at IARC, transitions the exposome-explorer database (a high quality, hand-curated database containing associations of foods and chemical compounds with cancer) to Neo4J format in an automated way, providing an export in GraphML format.
To run, it uses alive_progress
to generate an interactive progress bar (that shows the script is still running through its most time-consuming parts) and the neo4j
python driver. This requirements can be installed using: pip install -r requirements.txt
.
To run the script itself, use:
python3 main.py neo4jadress databasename databasepassword csvfolder
where:
neo4jadress: is the URL of the database, in neo4j:// or bolt:// format
databasename: the name of the database in use. If using the free version, there will only be one database per project (neo4j being the default name); if using the pro version, you can specify an alternate name here
databasepassword: the passowrd for the databasename DataBase. Since the arguments are passed by BaSH onto python3, you might need to escape special characters
csvfolder: The folder where the CSV files for the Exposome Explorer database are stored. This CSVs have to be manually exported from the (confidential) database itself, and are NOT equivalent to those found in exposome-explorer download’s page
An archived version of this repository that takes into account the gitignored files can be created using: git archive HEAD -o ${PWD##*/}.zip
The package consists of the following modules:
CanGraph.ExposomeExplorer.build_database module#
A python module that provides the necessary functions to transition the Exposome Explorer database to graph format,
either from scratch importing all the nodes (as showcased in CanGraph.ExposomeExplorer.main
) or in a case-by-case basis,
to annotate existing metabolites (as showcased in CanGraph.main
).
- add_cancer_associations(filename)[source]#
Imports the ‘cancer_associations’ database as a relation between a given Cancer and a Measurement
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- add_components(filename)[source]#
Adds “Metabolite” nodes from Exposome-Explorer’s components.csv This is because this components are, in fact, metabolites, either from food or from human metabolism
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- add_correlations(filename)[source]#
Imports the ‘correlations’ database as a relation between two measurements: the intake_id, a food taken by the organism and registered using dietary questionnaires and the excretion_id, a chemical found in human biological samples, such that, when one takes one component, one will excrete the other. Data comes from epidemiological studies where dietary questionnaires are administered, and biomarkers are measured in specimens
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- add_measurements_stuff(filename)[source]#
A massive and slow-running function that creates ALL the relations between the ‘measurements’ table and all other related tables:
units: The units in which a given measurement is expressed
components: The component which is being measured
samples: The sample from which a measurement is taken
experimental_methods: The method used to take a measurement
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- add_metabolomic_associations(filename)[source]#
Imports the ‘metabolomic_associations’ database as a relation between to measurements: the intake_id, a food taken by the organism and registered using dietary questionnaires and the excretion_id, a chemical found in human biological samples, such that, when one takes one component, one will excrete the other. Data comes from Metabolomics studies seeking to identify putative dietary biomarkers.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- add_microbial_metabolite_identifications(filename)[source]#
Imports the relations pertaining to the “microbial_metabolite_identifications” table. A component (i.e. a metabolite) can be identified as a Microbial Metabolite, which means it has an equivalent in the microbiome. This can have a given reference and a tissue (BioSpecimen) in which it occurs.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- add_reproducibilities(filename)[source]#
Creates relations between the “reproducibilities” and the “measurements” table, using “initial_id”, an old identifier, for the linkage
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- add_samples(filename)[source]#
Imports the relations pertaining to the “samples” table. A sample will be taken from a given subject and a given tissue (that is, a specimen, which will be blood, urine, etc)
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- add_subjects(filename)[source]#
Imports the relations pertaining to the “subjects” table. Basically, a subject can appear in a given publication, and will be part of a cohort (i.e. a grop of subjects)
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_auto_units(filename)[source]#
Shows the correlations between two units, converted using the rubygem ‘https://github.com/masa16/phys-units’ which standarizes units of measurement for our data
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_cancers(filename)[source]#
Adds “Cancer” nodes from Exposome-Explorer’s cancers.csv
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_cohorts(filename)[source]#
Adds “Cohort” nodes from Exposome-Explorer’s cohorts.csv
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_experimental_methods(filename)[source]#
Adds “ExperimentalMethod” nodes from Exposome-Explorer’s experimental_methods.csv
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_measurements(filename)[source]#
Adds “Measurement” nodes from Exposome-Explorer’s measurements.csv
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_microbial_metabolite_info(filename)[source]#
Adds “Metabolite” nodes from Exposome-Explorer’s microbial_metabolite_identifications.csv These represent all metabolites that have been re-identified as present, for instance, in the microbiome.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_publications(filename)[source]#
Adds “Publication” nodes from Exposome-Explorer’s publications.csv
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_reproducibilities(filename)[source]#
Adds “Reproducibility” nodes from Exposome-Explorer’s reproducibilities.csv These represent the conditions under which a given study/measurement was carried
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_samples(filename)[source]#
Adds “Sample” nodes from Exposome-Explorer’s samples.csv From a Sample, one can take a series of measurements
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_specimens(filename)[source]#
Annotates “BioSpecimen” nodes from Exposome-Explorer’s specimens.csv whose ID is already present on the DB A biospecimen is a type of tissue where a measurement can originate, such as orine, csf fluid, etc
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_subjects(filename)[source]#
Annotates “Subject” nodes from Exposome-Explorer’s subjects.csv whose ID is already present on the DB
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- annotate_units(filename)[source]#
Adds “Unit” nodes from Exposome-Explorer’s units.csv A unit can be converted into other (for example, for normalization)
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
- build_from_file(databasepath, Neo4JImportPath, driver, bar=None, do_all=False, keep_counts_and_displayeds=True, keep_cross_properties=False)[source]#
A function able to build a portion of the Exposome-Explorer database in graph format, provided that at least one “Component” (Metabolite) node is present in said database. It works by using that node as an starting point from which to search in the rest of the Exposome_Explorer database, finding related nodes there.
- Parameters
databasepath (str) – The path to the database where all Exposome-Explorer CSVs are stored
Neo4JImportPath (str) – The path from which Neo4J is importing data
driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
bar – The bar() object from alive_bar, in case we want the function to run with do_all=True
do_all (bool) – True if importing the whole database; False if just importing a part of it
keep_counts_and_displayeds (bool) – Whether to keep the properties ending with
`_count`
&`displayed_`
that, although present in the original DB, might be considered not useful for us.keep_cross_properties (bool) – Whether to keep the properties used to cross-reference in the original Neo4J database.
- Returns
This function modifies the Neo4J Database as desired, but does not produce any particular return.
Note
This wont work if a “Component” (Metabolite) node is not already present; when building the database, either full or by parts, you should import the respective Components first
Warning
Due to the script’s design, only nodes which have a connection to nodes previously present on the database will be imported. This is on purpose: unconnected nodes don’t mean much in a Graph DataBase
- import_csv(filename, label)[source]#
Imports a given CSV into Neo4J. This CSV must be present in Neo4J’s Import Path
- Parameters
tx (neo4j.Session) – The session under which the driver is running
filename (str) – The name of the CSV file that is being imported
label (str) – The label of the Neo4J nodes that will be imported, with the columns of the CSV being its properties.
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
Note
For this to work, you HAVE TO have APOC availaible on your Neo4J installation
- remove_counts_and_displayeds(inputfile, outputfile)[source]#
Removes
`_count`
&`displayed_`
text-strings from a given file, so that, when processing it with the other functions present in this document, they ignore the columns containing said text-strings, which represent properties which are considered not useful for our program. This is. of course, not the most elegant, but it works.- Parameters
- Returns
The function does not have a return; instead, it transforms
`inputfile``
into`outputfile`
- remove_cross_properties()[source]#
Removes some properties that were added by the other functions present in this script, that are used to cross-reference the different tables in the Relational Database EE comes from, and that, in a Graph Database, are no longer necessary.
- Parameters
tx (neo4j.Session) – The session under which the driver is running
- Returns
A Neo4J connexion to the database that modifies it accordingly.
- Return type
CanGraph.ExposomeExplorer.main module#
A python module that leverages the functions present in the build_database
module to recreate the exposome-explorer database using a graph format
and Neo4J, and then provides an GraphML export file.
Please note that, to work, the functions here pre-suppose you have access to Exposome-Explorer internal CSVs,
and that you have placed them under a folder provided as `sys.argv[4]`
. These CSVs are confidential,
and can only be accessed under request to the International Agency for Research on Cancer.
For more details on how to run this script, please consult the package’s README