Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The HELM team have identified discrete pieces of development work that are important to the project, but do not currently figure in the project plans. We would like to issue these are challenges to the wider community, for interested developers to investigate. Outlines of the work areas are shown below, they are a guide to the work that could be done and , not a hard comprehensive set of requirements, but show areas of interest . The project would welcome any work performed in one (or more!) of these outlined areas, or even part of the work specified in an area.The project team is .

We would be very happy to work with developers in any way that is helpful to them. For example the group could:

...

if you are interested please get in touch via info@OpenHELM.org.


HELM Grammar 


HELM currently does not have a well defined grammar.  While the syntax is defined and grammar is implied, the grammar should be specified in a grammar format file such as the .g4 (LISP) file defined by the Antlr open source tool.  This would allow for parser-generators to be created for other languages such as python, C# and Javascript. It would also put structure and documentation on the HELM syntax as it evolves over time.

Fragmentation

  

Outcome should be to add HELM to the following grammar library: https://github.com/antlr/grammars-v4


Fragmentation


HELM represents biomolecules as polymeric structures, which are composed of monomeric building blocks. Typically, these are collected in a list or database which contains the full chemical graph of each building block alongside with additional data specifying its context in a biomolecule. Such a monomer dictionary is the very foundation of representing biomolecules at an atomic level.

...

-          Automate the entire process of establishing the initial monomer dictionary and file format conversion as much as possible

 

Canonicalization

 


Specific purposes, e.g registration of biomolecules, require the ability to identify and filter for distinct biomolecules to create a database without redundancies. Typically a single biomolecule can be expressed with more than just one HELM representation. Based on the principles HELM applies to represent a biomolecule, the challenge of generating a canonical representation can be broken down into three main objectives:

...

-          Canonicalization of the chemical structures within the monomer dictionary: The chemical representation of all monomers in the monomer dictionary must be unique in their given context, i.e. alternative structural tautomers, representations of aromatic rings, ionic forms of certain functional groups and other features considered to yield chemically equivalent structures must not be kept as separate entries in the monomer dictionary.

 


There is some cannonicalisation functionality in the HELM toolkit, but this is limited in scope.

 


 


HELM adopters have a need to search through a repository of HELM strings: either to check for uniqueness, to find similarities, for data retrieval or to answer scientific questions.
One of the limitations of the current toolkit is the lack of a search tool. Given a large set of HELM strings, it is not possible to find matches of a particular term using HELM notation or other descriptors.
It should be possible to search across HELM strings using a variety of formats to retrieve information of interest.

A Cambridge University student completed a proof of concept search engine in summer 2014 and the code is available on GitHub. This included exact match and substructure searching of atom/bond structures and exact match and substring searching of sequence level information and combinations. There are limited user interface components.

 


Objectives/requirements:

  • The data source of the HELM search must be configurable. i.e.: the data can reside in a local file or a database.
  • Search for exact match, sub structure or motif search. The input criteria can consist of the following by using AND/OR searches:
    • Natural analog sequence (peptides, nucleotides)
    • Monomer ID
    • HELM string (including inline HELM notation)
    • SMILES / SMARTS / Chemical structure drawing tool
    • Specific attachment point on monomer.
            Ex: CHEM1,*,1:R1 all attachment to R1 on CHEM1 monomer.
  • Fuzzy searches
    • Canonicalized vs. not canonicalized HELM (permutation, combination )
    • Gap search (*)
    • Ambiguities (TBD)
  • Component based web service allowing search which is fully configurable

 

 

...