Q&A for FAIR webinar on Knowledge Graphs by Ilaria Maresi


Questions Asked

Answers (contact ilaria@thehyve.nl for further info)


You mentioned creating superclasses of some nodes, is this done at the query runtime, or beforehand, while creating the KG?

This is done when you create the semantic layer of the Knowledge Graph. When you make this semantic model you define classes and relationships between them, including subclassing. So, yes, before querying! 


Which tools and technologies do you typically use in your end to end knowledge graph pipelines? For instance: technologies used for parsing/serializing RDF, triple store, query engine, reporting/visualization.

Creating the semantic model can be done using tools such as Protege or WebProtege (https://protege.stanford.edu), or if you prefer, you can simply write the triples out in a Text Editor. To create the RDF triples from source data there's a handy Python library focused on RDF (RDFLib https://www.w3.org/TR/rdf-schema/). Once the triples are in a triple store, such as Allegrograph, you can query the graph using the SPARQL endpoint. For visualisations I recommend using WebVOWL (http://vowl.visualdataweb.org/webvowl.html) for small ontological models that are rooted in OWL, and otherwise for viewing parts of your knowledge graph there is Gruff (https://allegrograph.com/products/gruff/), which works together with Allegrograph to create visualisations based on SPARQL queries. For more information I recommend this blog post on useful tools for building knowledge graphs (https://blog.thehyve.nl/blog/tools-building-knowledge-graphs).


What was the underlying graph database or triplestore used?

In this instance we used Allegrograph but there are many other triple stores out there! Please refer to https://blog.thehyve.nl/blog/tools-building-knowledge-graphs


What triple store do you recommend for good performance?

There's a lot of triple stores out there, and unfortunately I don't have enough experience with all of them to recommend one over another but this is a great resource that sums up information all the major triple stores: https://www.w3.org/wiki/LargeTripleStores. See also https://blog.thehyve.nl/blog/tools-building-knowledge-graphs. In terms of impact on a typical knowledge graph project, choice of triple store is probably not a large factor unless you have a very large knowledge base.




What frameworks have you used for your ETL's? Have you used Talend?

See https://blog.thehyve.nl/blog/tools-building-knowledge-graphs for some ETL frameworks we have tried in the past, but typically we use Python code leveraging rdflib.


What are your thoughts about RDF versus property graph? Which one would you recommend for drug discovery data.

In terms of results, what matters most is defining an effective data model and ensuring the use cases are properly adressed. The technology choice is secondary to this, and much depends on the environment you are working with. That being said, RDF makes it easier to integrate public ontologies and build on the linked web, while property graphs are a bit easier to get started with. It also matters whether the project is stand alone or needs to contribute to a larger effort.


Can you perhaps give an example of one drug that has been discovered thanks to semantic models or KGs, please? Can you elaborate on the role of manual data curation for building the models, please?

Published examples of how semantic technologies are contributing drug discovery can be expected over the next five years as it has only recently become mainstream in drug development operations. It helps to reuse of any existing related conceptual or semantic models to save time. This requires collaboration between data engineers and subject matter experts.


What happens if other peoples are misusing properties, e.g. schema:drug is used in a "creative" (=unintended) way? Is there a single way to represent the information?

This is made difficult by the flexibility of model building. This is helped by use of rules such as SHACL to validate. Unintended misuse of a particluar ontology might also be mitigated through query validation and consultation with subject matter experts. No single representation, there are many ways! How you model concepts can actually be very subjective, and it depends on the data you have and the use cases you are addressing. This subjectivity is part of what makes semantic modelling interesting but it also makes it challenging to converge on one model. My advice is to use existing models and ontologies where possible, to avoid creating multiple models of similar spaces. There are also some standard practices you can apply while modelling (take a look at OWL: https://www.w3.org/TR/owl-guide/ and RDFS: https://www.w3.org/TR/rdf-schema/ to get started)


Is there a place to see all the ontologies available for life sciences?

BioPortal (https://bioportal.bioontology.org) or OLS (https://www.ebi.ac.uk/ols/index) are both great resources for this.


How hard is this technically - what tools to use practically ? I have 20 tables with data how do I start ?

The first step would be to determine the kinds of questions you want your graph to answer. From there, you can start building a semantic model to represent the concepts in your data and the relationships between them. Once your model is complete you can start instantiating it with data, thereby creating the Knowledge Graph. That's a very quick overview! If you want to get a more in-depth understanding of the tools available for building a Knowledge Graph I would suggest this blog post we published on data engineering tools for Knowledge Graphs: https://blog.thehyve.nl/blog/tools-building-knowledge-graphs. This Knowledge Graph Seminar session from Stanford could also be a good start: https://www.youtube.com/watch?time_continue=4920&v=bvwjG-3qAmY&feature=emb_title. Feel free to get in touch with us if you need any advice or help on creating a Knowledge Graph!


For your Clinical Trial project, using data from ClinicalTrials.gov, do you use natural language processing (NLP) to read the clinical trial outcomes and translate them into triples, to be incorporated into your semantic model and knowledge graph?

For the clinical trials we were mainly interested in their identifiers, which can be accessed via the ClinicalTrials.org API. More information on that here: https://clinicaltrials.gov/api/gui.


For data updates to a source, you need to rebuild the graph. Any insight into how long that takes 

This depends on the tools you're using, your ETL process and the scale of your graph. Without knowing those it's hard to say!


How long did that take? ELNs often have PDFs or Powerpoint decks.  How do you extract data from those formats? Is it better to create data marts from the GraphDB instead of asking a scientist to use SPARQL?

Building the models and KD took a number of months for a team of experts. The biggest challenge was understanding the landscape and data to allow correct semantic modelling rather than the technicalities. Text mining can be used to extract data from these formats, if necessary. While SPARQL queries can be difficult to formulate from scratch; this can be tackled through a library of common scientific questions with encoded SPARQL queries, as done by Open PHACTS. Cached data marts are another approach to this too, especially for more complex queries which can also improve performance.


Can you perform enrichment analysis if you haven't enforced a common standard? Yes, that's it (as defined by Ian)

Either over or under representation of a particular class in the KD can be analysed for patterns of enrichment.


Can we compare semantic modeling to classical data warehousing modeling ? What are the differences between the two ?

Semantic modelling codes for meaning and interoperability of the linked data through inclusion of metadata and identifiers following the FAIR principles. Semantics and interoperability are much more difficult to implement in a classical data warehouse.


Can Ilaria comment on scale on knowledge graphs - how big a data set can it accomodate?  What about query performance?  Thanks.

The knowledge graph example contained ~14 million triples with performance query times in seconds. This will depend on the type of triple store and query.


Are you using some tool to create triplets from structured and unstructured data sources

We made use of the open source RDFLib Python library (https://rdflib.readthedocs.io/en/stable/), which is specifically geared towards working with RDF. With RDFLib you can easily create triples from incoming data. I would recommend diving into the documentation to get some further insight!


Any recommendation for mapping IDs across different ID systems please?

If you have different identifiers for the same term you could link them to the concept using skos:altLabel (more information on that here https://www.w3.org/2012/09/odrl/semantic/draft/doco/skos_altLabel.html). Alternatively, you could follow a similar approach as Wikidata by creating different properties for all the IDs. For instance, non-small cell lung carcinoma has quite a few IDs in Wikidata, these are all linked to the concept using properties like 'MeSH code' and 'Disease Ontology ID'. Check out the identifiers section on the term's Wikidata page to get a better idea: https://www.wikidata.org/wiki/Q3658562.

Source: Recording and slides