Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Sep 7, 2023. It is now read-only.
This repository was archived by the owner on Sep 7, 2023. It is now read-only.

Engine code: describe which XPath can fail, which must not. #1802

@dalf

Description

@dalf

In the result parsing, engines parse the HTML using lxml (most of the time). If an XPath request doesn't return at least one result, it may be fine or trigger an error later. In the later case, it is difficult to know exactly what is going on without looking at the downloaded HTML.

This issue suggests:

  • to add new exception classes.
  • to add two optional parameters to eval_xpath function to check the result count.

It may help to know when an engine starts to be broken if the engine codes says which XPath request should not fail (?).

I'm not sure if it is useful and /or a privacy problem if searx makes statistics about broken XPath ?

Class hierarchy

SearxException
	SearxParameterException
	SearxEngineException
		SearxEngineCaptchaException (instead of RuntimeWarning in google.py)
		SearxEngineXPathException

eval_xpath

def eval_xpath(element, xpath_str, eq=None, gte=None):
    xpath = get_xpath(xpath_str)
    result = xpath(element)
    # new code: check result count now
    if eq is not None and len(result) != eq:
	raise SearxEngineXPathException(xpath, eq=eq)
    if gte is not None and len(result) < gte:
	raise SearxEngineXPathException(xpath, gte=gte)
    return result

usage examples

extract_url

https://github.com/asciimoo/searx/blob/master/searx/engines/xpath.py#L53

def extract_url(xpath_results, search_url):
    if xpath_results == []:
        raise Exception('Empty url resultset')	

--> Make the check before calling extract_url

bing engine

	...
    for result in eval_xpath(dom, '//div[@class="sa_cc"]'):
        link = eval_xpath(result, './/h3/a', eq=1)[0]
	...
    for result in eval_xpath(dom, '//li[@class="b_algo"]'):
        link = eval_xpath(result, './/h2/a', eq=1)[0]
	...

google engine

	title = extract_text(eval_xpath(result, title_xpath, eq=1)[0])
	url = parse_url(extract_url(eval_xpath(result, url_xpath, eq=1), google_url), google_hostname)

The huge try/catch to ignore all the parsing errors would be able to display the XPath in the logs.

Another way without try/catch and without modification to the eval_xpath function:

	title_xpr = eval_xpath(result, title_xpath)
	url_xpr = eval_xpath(result, url_xpath)
	if len(title_xpr) > 0 and len(url_xpr) > 0:
		title = extract_text(title_xpr[0])
		url = parse_url(extract_url(url_xpr, google_url), google_hostname)
		...

The eq and gte parameters can't help much for the result count.

Using eq:

    try:
        results_num = int(eval_xpath(dom, '//div[@id="resultStats"]//text()', eq=1)[0]
                          .split()[1].replace(',', ''))
        results.append({'number_of_results': results_num})
    except:
        pass

Without eq, with more checking:

	results_num_xpath = eval_xpath(dom, '//div[@id="resultStats"]//text()')
	if len(results_num_xpath) > 0:
		results_num_text = results_num_xpath[0]
		results_num_text_first = results_num_text.split()[1].replace(',', '') 
		try:
			results_num = int(results_num_text_first)
			results.append({'number_of_results': results_num})
		except ValueError:
			pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions