Thanks to visit codestin.com
Credit goes to github.com

Skip to content

API cleanup for the DomCrawler component #15849

Closed
@stof

Description

@stof

I see several enhancements for the DomCrawler component. Some of them are just about cleaning its API, while some others would open the door to new optimizations which are not possible today

Stop extending SplObjectStorage

The crawler uses a SplObjectStorage to store its nodes, and it does it by inheritance. This has several drawbacks:

  • it makes the DomCrawler class inherit from all methods of SplObjectStorage while most of them are totally meaningless for it (the only method we really want are countable and iterator implementations, and maybe contains even though I'm not actually sure it is actually used in the wild).
  • we have several methods accessing nodes by their index in the Crawler. But SplObjectStorage does not allow such accesses (as it considers the object as being the key for key-based access). So the only way we have for these methods is to iterate until we reach the position, making it O(n) rather than O(1).

This clutters the API of the class with no benefit and with drawbacks IMO.

My proposal here is to deprecate all methods which are inherited from SplObjectStorage (except count and Iterator methods), to warn people if they use one of them. The places where we use SplObjectStorage methods internally would use parent::attach() for instance to avoid the deprecation warning (they are not deprecated for internal usage).

In 3.0, we will drop the inheritance on SplObjectStorage and replace the storage of DOM nodes with an array, which will give as direct access to nodes by position.

Restrict the Crawler to a single DOM document

Currently, a single crawler can contain elements from multiple DOM documents. This forces us to recreate the DOMXpath object for each element in the crawler (and we do it again in crawlers returned by filtering). Given that namespace registration represents as much time as the actual XPath query, this opens a big optimization potential.
Having multiple documents in the same crawler can only happen in a few case:

  • if you call addContent/addHtmlContent/addXmlContent (or add($string) which will call addContent) multiple times on the same instance
  • if you call it on an instance which already contains nodes, returned by the filtering of a previous crawler in general (note that the first case is actually a specific subcase of this one)
  • if you build a Crawler with DOM nodes inside it manually (not something I ever saw in the wild)

My proposal here is to deprecate the ability to register nodes from different documents in 2.8 and forbid it in 3.0.
Once this restriction is enforced in 3.0, we will be able to apply more optimizations (it might be possible to apply some optimizations of the XPath building in the current codebase, but it would require to make the code much more complex).

Note that loading HTML content multiple times in the Crawler will already cause weird behavior for the the handling of the base tag as we can only have 1 base href by crawler, which will be the value of the last document loaded with a base tag in it resolved against the base href determined by previous content. This means this behavior does not make sense.

What do you think about these proposals ?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions