-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Engine code: describe which XPath can fail, which must not. #1802
Description
In the result parsing, engines parse the HTML using lxml (most of the time). If an XPath request doesn't return at least one result, it may be fine or trigger an error later. In the later case, it is difficult to know exactly what is going on without looking at the downloaded HTML.
This issue suggests:
- to add new exception classes.
- to add two optional parameters to eval_xpath function to check the result count.
It may help to know when an engine starts to be broken if the engine codes says which XPath request should not fail (?).
I'm not sure if it is useful and /or a privacy problem if searx makes statistics about broken XPath ?
Class hierarchy
SearxException
SearxParameterException
SearxEngineException
SearxEngineCaptchaException (instead of RuntimeWarning in google.py)
SearxEngineXPathException
eval_xpath
def eval_xpath(element, xpath_str, eq=None, gte=None):
xpath = get_xpath(xpath_str)
result = xpath(element)
# new code: check result count now
if eq is not None and len(result) != eq:
raise SearxEngineXPathException(xpath, eq=eq)
if gte is not None and len(result) < gte:
raise SearxEngineXPathException(xpath, gte=gte)
return resultusage examples
extract_url
https://github.com/asciimoo/searx/blob/master/searx/engines/xpath.py#L53
def extract_url(xpath_results, search_url):
if xpath_results == []:
raise Exception('Empty url resultset') --> Make the check before calling extract_url
bing engine
...
for result in eval_xpath(dom, '//div[@class="sa_cc"]'):
link = eval_xpath(result, './/h3/a', eq=1)[0]
...
for result in eval_xpath(dom, '//li[@class="b_algo"]'):
link = eval_xpath(result, './/h2/a', eq=1)[0]
...google engine
title = extract_text(eval_xpath(result, title_xpath, eq=1)[0])
url = parse_url(extract_url(eval_xpath(result, url_xpath, eq=1), google_url), google_hostname)The huge try/catch to ignore all the parsing errors would be able to display the XPath in the logs.
Another way without try/catch and without modification to the eval_xpath function:
title_xpr = eval_xpath(result, title_xpath)
url_xpr = eval_xpath(result, url_xpath)
if len(title_xpr) > 0 and len(url_xpr) > 0:
title = extract_text(title_xpr[0])
url = parse_url(extract_url(url_xpr, google_url), google_hostname)
...The eq and gte parameters can't help much for the result count.
Using eq:
try:
results_num = int(eval_xpath(dom, '//div[@id="resultStats"]//text()', eq=1)[0]
.split()[1].replace(',', ''))
results.append({'number_of_results': results_num})
except:
passWithout eq, with more checking:
results_num_xpath = eval_xpath(dom, '//div[@id="resultStats"]//text()')
if len(results_num_xpath) > 0:
results_num_text = results_num_xpath[0]
results_num_text_first = results_num_text.split()[1].replace(',', '')
try:
results_num = int(results_num_text_first)
results.append({'number_of_results': results_num})
except ValueError:
pass