On the problem of searching for molecular structure and/or proteins.

muladharma · Jan 16, 2021

Starting this thread to discuss the problem of searching for chemicals, molecules.

There are many standards for representing the naming of molecules, especially in the form of strings of characters to be used by computers. There seems to be no worldwide standard for the problem, so any application that allows you to perform the task could be using a combination of multiple representations.

One such application is:
OPSIN: Open Parser for Systematic IUPAC Nomenclature OPSIN: Open Parser for Systematic IUPAC nomenclature

The OPSIN app produces CML file which is an .XML of the atom positions and bonds.
There could be more than a single name for a molecule, and the application deals well with this and other problems by detecting ambiguity.

The result given might be a combination of operations on the N-Grams of the input, which by some logic finds or constructs a match. That being said, without studying the code and/or docummentation one cannot know about the completeness (if all inputs can produce all outputs) and the inversibility (can produce an input given an output) of the operation.

It's not clear if this can be used to search for related compounds, but some ideas are: backtracking inputs, searching for names using structure and backtracking structures.

Examples:
Hydroxytryptamine gives in the result the hydroxy group on the amine, but 5-hydroxytryptamine fixes ambiguity. It could be that ambiguous parts are resolved in the order of priority of construction.

For multiple configurations, example dimethoxybenzene, the first ortho form is prefferred.

Some inputs cannot detect stereocenters, example L-Glycine, but it gives a sign that it can search for that.

Using neural networks can yield other insights because of the complex nature of the search space that is generated by combining natural language with structural geometry.

downwardsfromzero · Jan 17, 2021

Some inputs cannot detect stereocenters, example L-Glycine, but it gives a sign that it can search for that.

This specific example has the problem that glycine, out of all the amino acids, is not in fact chiral.

The onus is, er, on us to name our molecules as accurately as possible while simultaneously being aware of the range of alternative trivial names and/or possible (common or otherwise) nomenclatural errors that may be encountered.

As you may have noticed, nomenclature is one of my specialised fields of focus. Please feel free to address me with questions directly.

Proteins are far more tricky beasts as their secondary structure is outside the scope of ordinary systems of nomenclature. And, of course, the larger the molecule, the more of a mouthful the systematic name becomes.

The OPSIN link is great, thanks!

On the problem of searching for molecular structure and/or proteins.

muladharma

Rising Star

downwardsfromzero

Boundary condition