API cleanup for the DomCrawler component

I see several enhancements for the DomCrawler component. Some of them are just about cleaning its API, while some others would open the door to new optimizations which are not possible today
#### Stop extending `SplObjectStorage`

The crawler uses a `SplObjectStorage` to store its nodes, and it does it by inheritance. This has several drawbacks:
- it makes the DomCrawler class inherit from all methods of SplObjectStorage while most of them are totally meaningless for it (the only method we really want are countable and iterator implementations, and maybe `contains` even though I'm not actually sure it is actually used in the wild). 
- we have several methods accessing nodes by their index in the Crawler. But SplObjectStorage does not allow such accesses (as it considers the object as being the key for key-based access). So the only way we have for these methods is to iterate until we reach the position, making it `O(n)` rather than `O(1)`.

This clutters the API of the class with no benefit and with drawbacks IMO.

My proposal here is to deprecate all methods which are inherited from SplObjectStorage (except `count` and Iterator methods), to warn people if they use one of them. The places where we use SplObjectStorage methods internally would use `parent::attach()` for instance to avoid the deprecation warning (they are not deprecated for internal usage).

In 3.0, we will drop the inheritance on SplObjectStorage and replace the storage of DOM nodes with an array, which will give as direct access to nodes by position.
#### Restrict the Crawler to a single DOM document

Currently, a single crawler can contain elements from multiple DOM documents. This forces us to recreate the DOMXpath object for each element in the crawler (and we do it again in crawlers returned by filtering). Given that namespace registration represents as much time as the actual XPath query, this opens a big optimization potential.
Having multiple documents in the same crawler can only happen in a few case:
- if you call `addContent/addHtmlContent/addXmlContent` (or `add($string)` which will call `addContent`) multiple times on the same instance
- if you call it on an instance which already contains nodes, returned by the filtering of a previous crawler in general (note that the first case is actually a specific subcase of this one)
- if you build a Crawler with DOM nodes inside it manually (not something I ever saw in the wild)

My proposal here is to deprecate the ability to register nodes from different documents in 2.8 and forbid it in 3.0.
Once this restriction is enforced in 3.0, we will be able to apply more optimizations (it might be possible to apply some optimizations of the XPath building in the current codebase, but it would require to make the code much more complex).

Note that loading HTML content multiple times in the Crawler will already cause weird behavior for the the handling of the `base` tag as we can only have 1 base href by crawler, which will be the value of the last document loaded with a `base` tag in it resolved against the base href determined by previous content. This means this behavior does not make sense.

What do you think about these proposals ?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API cleanup for the DomCrawler component #15849

Stop extending `SplObjectStorage`

Restrict the Crawler to a single DOM document

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

API cleanup for the DomCrawler component #15849

Description

Stop extending SplObjectStorage

Restrict the Crawler to a single DOM document

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Stop extending `SplObjectStorage`