if you are interested please get in touch via info@OpenHELM.org.
Outcome should be to add HELM to the following grammar library: https://github.com/antlr/grammars-v4
HELM represents biomolecules as polymeric structures, which are composed of monomeric building blocks. Typically, these are collected in a list or database which contains the full chemical graph of each building block alongside with additional data specifying its context in a biomolecule. Such a monomer dictionary is the very foundation of representing biomolecules at an atomic level.
- Automate the entire process of establishing the initial monomer dictionary and file format conversion as much as possible
Specific purposes, e.g registration of biomolecules, require the ability to identify and filter for distinct biomolecules to create a database without redundancies. Typically a single biomolecule can be expressed with more than just one HELM representation. Based on the principles HELM applies to represent a biomolecule, the challenge of generating a canonical representation can be broken down into three main objectives:
- Canonicalization of the chemical structures within the monomer dictionary: The chemical representation of all monomers in the monomer dictionary must be unique in their given context, i.e. alternative structural tautomers, representations of aromatic rings, ionic forms of certain functional groups and other features considered to yield chemically equivalent structures must not be kept as separate entries in the monomer dictionary.
There is some cannonicalisation functionality in the HELM toolkit, but this is limited in scope.
HELM adopters have a need to search through a repository of HELM strings: either to check for uniqueness, to find similarities, for data retrieval or to answer scientific questions.
One of the limitations of the current toolkit is the lack of a search tool. Given a large set of HELM strings, it is not possible to find matches of a particular term using HELM notation or other descriptors.
It should be possible to search across HELM strings using a variety of formats to retrieve information of interest.
A Cambridge University student completed a proof of concept search engine in summer 2014 and the code is available on GitHub. This included exact match and substructure searching of atom/bond structures and exact match and substring searching of sequence level information and combinations. There are limited user interface components.
- The data source of the HELM search must be configurable. i.e.: the data can reside in a local file or a database.
- Search for exact match, sub structure or motif search. The input criteria can consist of the following by using AND/OR searches:
- Natural analog sequence (peptides, nucleotides)
- Monomer ID
- HELM string (including inline HELM notation)
- SMILES / SMARTS / Chemical structure drawing tool
- Specific attachment point on monomer.
Ex: CHEM1,*,1:R1 all attachment to R1 on CHEM1 monomer.
- Fuzzy searches
- Canonicalized vs. not canonicalized HELM (permutation, combination )
- Gap search (*)
- Ambiguities (TBD)
- Component based web service allowing search which is fully configurable