Description
I see several enhancements for the DomCrawler component. Some of them are just about cleaning its API, while some others would open the door to new optimizations which are not possible today
Stop extending SplObjectStorage
The crawler uses a SplObjectStorage
to store its nodes, and it does it by inheritance. This has several drawbacks:
- it makes the DomCrawler class inherit from all methods of SplObjectStorage while most of them are totally meaningless for it (the only method we really want are countable and iterator implementations, and maybe
contains
even though I'm not actually sure it is actually used in the wild). - we have several methods accessing nodes by their index in the Crawler. But SplObjectStorage does not allow such accesses (as it considers the object as being the key for key-based access). So the only way we have for these methods is to iterate until we reach the position, making it
O(n)
rather thanO(1)
.
This clutters the API of the class with no benefit and with drawbacks IMO.
My proposal here is to deprecate all methods which are inherited from SplObjectStorage (except count
and Iterator methods), to warn people if they use one of them. The places where we use SplObjectStorage methods internally would use parent::attach()
for instance to avoid the deprecation warning (they are not deprecated for internal usage).
In 3.0, we will drop the inheritance on SplObjectStorage and replace the storage of DOM nodes with an array, which will give as direct access to nodes by position.
Restrict the Crawler to a single DOM document
Currently, a single crawler can contain elements from multiple DOM documents. This forces us to recreate the DOMXpath object for each element in the crawler (and we do it again in crawlers returned by filtering). Given that namespace registration represents as much time as the actual XPath query, this opens a big optimization potential.
Having multiple documents in the same crawler can only happen in a few case:
- if you call
addContent/addHtmlContent/addXmlContent
(oradd($string)
which will calladdContent
) multiple times on the same instance - if you call it on an instance which already contains nodes, returned by the filtering of a previous crawler in general (note that the first case is actually a specific subcase of this one)
- if you build a Crawler with DOM nodes inside it manually (not something I ever saw in the wild)
My proposal here is to deprecate the ability to register nodes from different documents in 2.8 and forbid it in 3.0.
Once this restriction is enforced in 3.0, we will be able to apply more optimizations (it might be possible to apply some optimizations of the XPath building in the current codebase, but it would require to make the code much more complex).
Note that loading HTML content multiple times in the Crawler will already cause weird behavior for the the handling of the base
tag as we can only have 1 base href by crawler, which will be the value of the last document loaded with a base
tag in it resolved against the base href determined by previous content. This means this behavior does not make sense.
What do you think about these proposals ?