HELM Challenges

The HELM team have identified discrete pieces of development work that are important to the project, but do not currently figure in the project plans. We would like to issue these are challenges to the wider community, for interested developers to investigate. Outlines of the work areas are shown below, they are a guide to the work that could be done, not a comprehensive set of requirements, but show areas of interest.

We would be very happy to work with developers in any way that is helpful to them. For example the group could:

  • Draft a more specific requirements document in accordance with the scope the developer is happy to take on.
  • Arrange regular meetings to discuss technical questions and explain the concepts used in previous development work.

if you are interested please get in touch via info@OpenHELM.org.


HELM Grammar 


HELM currently does not have a well defined grammar.  While the syntax is defined and grammar is implied, the grammar should be specified in a grammar format file such as the .g4 (LISP) file defined by the Antlr open source tool.  This would allow for parser-generators to be created for other languages such as python, C# and Javascript. It would also put structure and documentation on the HELM syntax as it evolves over time.

 

Outcome should be to add HELM to the following grammar library: https://github.com/antlr/grammars-v4


Fragmentation


HELM represents biomolecules as polymeric structures, which are composed of monomeric building blocks. Typically, these are collected in a list or database which contains the full chemical graph of each building block alongside with additional data specifying its context in a biomolecule. Such a monomer dictionary is the very foundation of representing biomolecules at an atomic level.

An often encountered problem within the community when adopting HELM is the migration of legacy molecules, currently predominantly stored in a chemical file format. Based on a given large set of biomolecules a consistent and normalized monomer dictionary needs to be found, which fully represents the space of biomolecules in the given set and satisfies common domain knowledge in terms of the context of each identified monomer. In a next step the generated monomer dictionary would be used to convert biomolecules from a chemical file format into the HELM representation.

Some work has already been done on peptides by the EBI. Their code relied on Pipeline Pilot and is linked to their internal systems, but is available on request. There is also a limited fragmenter as part of the toolkit.

Main objectives of this project:

-          Collate current opinions on what qualifies a (partial) chemical structure as a monomer

-          Implement rules for fragmenting the chemical graph of a biomolecule into monomers

-          Attempt to detect the chemical context of all fragments and generate additional data for each monomer accordingly

-          Define and implement guidelines/best practices for normalization of a monomer dictionary

-          Produce a valid HELM notation from input biomolecules in the context of the generated monomer dictionary

-          Automate the entire process of establishing the initial monomer dictionary and file format conversion as much as possible

 

Canonicalization


Specific purposes, e.g registration of biomolecules, require the ability to identify and filter for distinct biomolecules to create a database without redundancies. Typically a single biomolecule can be expressed with more than just one HELM representation. Based on the principles HELM applies to represent a biomolecule, the challenge of generating a canonical representation can be broken down into three main objectives:

-          Canonicalization of the HELM notation: In the context of a given monomer dictionary, a unique HELM notation string must be generated.

-          Canonicalization of the monomer dictionary: All monomers of a given monomer dictionary must be independent monomers, i.e. satisfying that the combination of two monomers must not be present in the dictionary as a third monomer within the same context

-          Canonicalization of the chemical structures within the monomer dictionary: The chemical representation of all monomers in the monomer dictionary must be unique in their given context, i.e. alternative structural tautomers, representations of aromatic rings, ionic forms of certain functional groups and other features considered to yield chemically equivalent structures must not be kept as separate entries in the monomer dictionary.


There is some cannonicalisation functionality in the HELM toolkit, but this is limited in scope.



HELM adopters have a need to search through a repository of HELM strings: either to check for uniqueness, to find similarities, for data retrieval or to answer scientific questions.
One of the limitations of the current toolkit is the lack of a search tool. Given a large set of HELM strings, it is not possible to find matches of a particular term using HELM notation or other descriptors.
It should be possible to search across HELM strings using a variety of formats to retrieve information of interest.

A Cambridge University student completed a proof of concept search engine in summer 2014 and the code is available on GitHub. This included exact match and substructure searching of atom/bond structures and exact match and substring searching of sequence level information and combinations. There are limited user interface components.


Objectives/requirements:

  • The data source of the HELM search must be configurable. i.e.: the data can reside in a local file or a database.
  • Search for exact match, sub structure or motif search. The input criteria can consist of the following by using AND/OR searches:
    • Natural analog sequence (peptides, nucleotides)
    • Monomer ID
    • HELM string (including inline HELM notation)
    • SMILES / SMARTS / Chemical structure drawing tool
    • Specific attachment point on monomer.
            Ex: CHEM1,*,1:R1 all attachment to R1 on CHEM1 monomer.
  • Fuzzy searches
    • Canonicalized vs. not canonicalized HELM (permutation, combination )
    • Gap search (*)
    • Ambiguities (TBD)
  • Component based web service allowing search which is fully configurable