SFTM is an algorithm allowing to match web two trees. It has been mostly tested on web pages. To our knowledge, SFTM is the most efficient existing algorithm to match nodes from two websites (c.f. benchmark)
SFTM makes full use of the information contained in the nodes of the trees to match. In the case of HTML, it means tags, attributes and their values.
If you want to understand SFTM, please read the associated scientific paper. If you use it for academic purposes, please don't forget to cite us.
For a solution in the java exosystem you can also see the implementation in Kotlin which is deployed as a package in the official MAVEN repository.
There are 4 projects:
tree-matching-csharpthe algorithm as a c# librarytree-matching-csharp.Testthe unit tests for the different parts of the algorithmtree-matching-csharp.Benchmark(messy) allows to run different benchmark around the SFTM algorithm (and competitors)tree-matching-csharp.Visualization(messy) allows to visualize some of the results
That's where the algorithm is, written as a c# library. Structure:
DOM.cscontains methods to transform a webpagestringinto the tree structure SFTM usesFtmCost.csexposes the method to compute the FTM cost of a given matching (see FTM paper). This is not directly part of the SFTM algorithm but allows to compute a "confidence" metric on each match.InMemoryIndexer.csMethods to create an index of a given set of documents that can be queried fastITreeMatcherthe interface that SFTM implementsMetropolis.csimplementation of the metropolis algorithm applied to SFTMNeighbors.cscontains utility methods to manipulate list of node's neighborsSftmTreeMatcher.cscontains the core of the algorithmSimilarityPopagation.cscontains methods used to propagate the similarityUtils.csgeneral utilitiesTypes.csContains definitions for a node and an edge
You can refer to the paper for a more theoretical explanation.
The SFTM algorithm itself takes two websites of type Node (in Types.cs).
TreeMatcherResponse matching = await _matcher.MatchTrees(sourceNodes, targetNodes);Where sourceNodes and targetNodes are of types IEnumerable<Node>.
To directly match two websites without having to transform the website into IEnumerable<Node>, you can use the DOM methods:
IEnumerable<Node> sourceNodes = await DOM.WebpageToTree(source);
IEnumerable<Node> targetNodes = await DOM.WebpageToTree(target);A direct example of such usage can be found in:
tree-matching-csharp.Test/TreeMatchingTest.cs/TestTreeMatching
Serializing the results of the matching to interact with external applications can be challenging. Let us say you want to create an API out of SFTM that match two websites, you have two options:
- Return the matching as a list of tuples
(id1, id2)whereid1/2are the ids of the nodes when traversing it the same order (e.g. post traversal) - Require input websites to contain a
signatureattribute for each node (whose value uniquely identifies the node). Then return a list of tuples(signature_1, signature_2). We found solution 1. to be quite impractical since different parsers often don't exactly parse the nodes the same way which makes order-based ids very fragile. An similar solution to 2. has been implemented in:tree-matching-csharp.Visualization/Controllers/HomeController.cs/MatchWebsites