Free Academic Seminars And Projects Reports

Full Version: Chemical Structure Representation and Search Systems
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
[attachment=15133]
Chemical Structure Representation and Search Systems
New MDL File Formats
Since lecture on Oct 28, MDL have published details of a new file format XDfile
XML-based data format for transferring structure/reaction information with associated data
built around existing MDL connection table formats
can incorporate Chime strings (encrypted format used to render structures and reactions on a Web page)
can incorporate SMILES strings
New download site:
http://mdldownloads/public/ctfile/ctfile.jsp
Lecture 4: Topics to be Covered
full structure search
structure registration systems
graph isomorphism
algorithm complexity and NP-complete problems
substructure search
subgraph isomorphism
screening searches and fingerprints
substructure query formulation
SMARTS
commercial systems
Full Structure and Substructure Search
full structure search
query is is complete molecule
is this molecule in the database?
or tautomers, stereoisomers etc. of it,
substructure search
query is a pattern of atoms and bonds
does this pattern occur as a substructure of any of the molecules in my database?
superstructure search
query is a complete molecule
are any of the molecules in the database substructures of it?
N.B. Some Daylight Chemical Information Systems Inc. documentation uses substructure and superstructure search in the opposite sense to those given here
Full Structure Search
Many databases contain millions of structures, so search speed is important
Simplest approaches uses canonical representation for query and database structures (e.g. canonical SMILES)
could sort database SMILES into alphanumerical order
search sorted list for match with query
Hash table lookup can improve search speed
calculate hash-code ( idiot number in predefined range) from SMILES for each database structure
this is address (disk file or memory) at which full representation is stored
only SMILES which have same hash code need to be compared
Structure Registration Systems
Many chemical and pharmaceutical companies maintain compound registry systems
database of all compounds worked on internally
may included many compounds never published elsewhere (i.e. not in Chemical Abstracts, Beilstein)
links to company reports, biological screening data, stock number in compounds store etc.
links to electronic lab notebooks, LIMS (Lab. Info. Management System), ORACLE database etc.
Structure Registration Systems
new compounds need to be added regularly
used to be done by chemical information specialists
now frequently done directly by bench chemists
registration system must
check consistency of input data
e.g. compare molecular formula with structure
check that compound is really new
different ways of handling tautomers, salts, stereoisomers etc.
assign registry number
add supplementary data (melting point etc.)
make data immediately available for search
Structure Registration Systems
Public databases use same principles, adding compounds from published literature
Chemical Abstracts Registry file
links to document where data on molecule was published
Beilstein Registry file
lots of data may be stored with compound, from different data sources; existing records may need updating
Updates for searching may be made available at regular intervals (weekly, monthly, annually, etc.)
Graph Isomorphism
In graph theory terms, when two full structures match, their graphs are said to be isomorphic
each node N1 in G1 must be mapped to a node N2 in G2
neighbours of N1 must map to neighbours of N2
Graph isomorphism by brute force
for each node in G1
map it against an unmapped node in G2
check that neighbours of each node map appropriately in the two graphs
if each graph has n nodes there are n! ways of doing this
n (n-1) (n-2) (n-3) 3 2 1
this is a big number if n is anything non-trivial
9! = 362 880
10! = 3 628 800
Computational complexity
a measure of how long a computational algorithm will take to run, depending on the size of input
if you give it twice as much data will it take twice as long to run?
e.g. comparing a word sequentially against each member of a list of words of length n
take taken depends directly on length of list
algorithm is O(n) [ order-n ]
e.g. comparing each word in a list of length n with every other word of the same list
algorithm is O(n2) [ order-n-squared ]
Computational complexity
some algorithms may have complexity O(n3), O(n4), O(log n), O(n log n) etc.
these are all polynomial time algorithms
some algorithms have exponential complexity, e.g. O(2n)
this is much slower than polynomial
brute-force graph isomorphism is O(n!)
this is even slower than simple exponential
Computational complexity
for some problems you can find more efficient algorithms (lower order of complexity) to do the same thing
e.g. searching a sorted list
simple sequential search is O(n)
binary chop search is O(log n)
for some problems there are no known polynomial-time algorithms
NP-complete problems
a class of problems for which no polynomial-time algorithms are known
problems in this class are mathematically equivalent
if a polynomial time algorithm could be found for one of them, it would fit all of them
well-known example is travelling salesman problem (shortest path visiting each of several cities)
it is suspected (but not proven) that no polynomial-time algorithms can exist for this class of problems
NP-complete problems
graph isomorphism is probably NP-complete (not rigorously proven)
subgraph isomorphism is a generalisation of graph isomorphism
nodes in G1 (query structure) must be mapped to subset of nodes in G2 (database structure)
i.e. G1 is a subgraph G2
subgraph isomorphism has been proven to be NP-complete
substructure searching is inherently slow
Subgraph isomorphism
NP-completeness of problem means that worst-case match times are exponential in number of atoms involved
but average-case match times can be better than this
much effort has been expended on this problem over the past 40+ years
closely-related problems remain an active area of research
Speeding up subgraph isomorphism
use a faster computer
use tricks to avoid exploring potential solutions that are bound to fail
do most of the work in a pre-processing of the database structures, independently of the query
Speeding up subgraph isomorphism
chemical graphs have several characteristics that allow heuristics ( tricks ) to be used to speed up isomorphism identification
several different node and edge labels
low connectivity of each node
using hydrogen-suppressed graphs reduces size of problem (number of nodes)
these tricks would be of less use for general graphs
additional tricks and algorithms may be used in special cases (e.g. if graphs are trees)
Backtracking
modification of the brute-force approach
abandons partial solutions part-way through when it can be seen they are bound to fail
worst-case is still exponential in number of nodes, but doesn t arise very often
first map an arbitrary pair of nodes
then map neighbours of these nodes
if successful, map neighbours of each neighbour, etc.
if not, backtrack one step, and try a different mapping
Backtracking
algorithm will terminate
when all query nodes are mapped [MATCH]
when all alternative mappings for first query node have been tried, and have failed [NO MATCH]
extra tricks can be used for further improvement
only map nodes with same element type and charge, and compatible bonding patterns
start with unusual atom types, and nodes with lots of neighbours
Partitioning and Relaxation
often used as an adjunct to backtracking
start by partitioning the nodes into sets of possible correspondents
e.g. nitrogens can only match nitrogens
iteratively refine the partition on basis of other possible correspondences
e.g. if F6 is only possible correspondent for Q1 then F6 cannot be a correspondent for Q2
if the list of possible correspondents for a query node becomes empty, there is no isomorphism
Partitioning and Relaxation
can also reduce lists of possible correspondents by looking at neighbours
if F6 is to remain a valid correspondent of Q1, then the neighbours of F6 must be possible correspondents of the neighbours of Q1
as this check is repeated for each node, we are bringing in information from further away, but only ever looking at immediate neighbours
this technique is the same as Morgan s algorithm for node labelling in canonicalisation
it is called relaxation
backtracking can be used as a fallback when no further reductions can be made in the lists of possible correspondents
Subgraph isomorphism algorithms
Ray and Kirsch s algorithm (1957)
basic backtracking
Sussenguth s partitioning algorithm (1965)
relaxation technique called connectivity property , with backtracking as fall-back
Figueras s set reduction algorithm (1972)
Ullmann s algorithm (1976)
efficient relaxation and backtracking
von Scholley s relaxation algorithm (1984)
Screening
so far we ve considered matching one query substructure against one database full structure
each structure from the database needs to be compared against the query in turn
many will fail because they don t contain the query substructure
screening allows many of these to be eliminated before we get to this stage
uses structure fingerprints discussed in lecture 3
Fingerprints
the fragments present in a structure can be represented as a sequence of 0s and 1s
00010100010101000101010011110100
0 means fragment is not present in structure
1 means fragment is present in structure (perhaps multiple times)
each 0 or 1 can be represented as a single bit in the computer (a bitstring )
for chemical structures often called structure fingerprints
Screening
build a fingerprint for the query substructure
only those database structures that contain all the fragments in the query can possibly match the query
Query: 00000100010101000001010011010100
DB struct 1: 00010100010101000101010011110100 MATCH
DB struct 2: 00000000100101001001000011100000 NO MATCH
comparing fingerprint bitstrings is very fast (logical AND operation)
only those structures that pass the screening stage need to be considered as candidates for atom-by-atom isomorphism search
Screening
can be made faster by inverting the bit strings (actually, turning them on their side)
instead of a bitstring of fragments for each structure ..
store a bitstring of structures for each fragment
each bit represents a database structure
1=structure contains fragment; 0=structure does not
search by ANDing together the bitstrings for the fragments present in the query
this will list those structures that contain all the query s fragments
Screening Effectiveness
Ideally we want to eliminate as many structures as possible at the screening stage
99% screenout or more would be good
Fingerprint construction can help in this
frequency distributions of fragments in a large database are very skewed
a few fragments occur in almost all compounds
will therefore give little or no screenout
many fragments occur in very few compounds
need very long fingerprint (lots of fragments) to ensure that we will have some in the query
Fingerprint construction
best fragments are medium frequency ones
fragments also need to be independent of each other
dictionary used for CA Registry search was constructed on basis of analysis of fragment frequency distributions
Daylight fingerprints
each fragment is used to generate a hash code, which specifies the bit position to be set
actually most fragments set several bits
small fragments (more frequent) set fewer bits
larger fragments (less frequent) set more bits
several different fragments may set the same bit
in principle this can reduce screening effectiveness
it may allow a fingerprint to match when structure does not contain the same fragment as the query
in practice is not a serious problem
it will never cause a structure to be rejected when it is actually a match