Crawling The Web Final
Crawling The Web Final
Stefan Stoicescu
[email protected]
TU Delft
Delft, Netherlands
ABSTRACT websites by crawling the top one million domains. Our goal is to
This paper presents a large-scale vulnerability analysis of the top identify which known vulnerabilities are still exposed in the wild,
one million domains, comparing results from legacy Alexa and mod- and how automated detection frameworks can scale to meet this
ern Cloudflare rankings. We develop specialized crawling method- task efficiently.
ologies to identify vulnerable software components, focusing on Anti-bot protections have become a critical component of mod-
Apache, WordPress, Wix, CSP configurations, and XSS risks, while ern web security, as websites increasingly face automated attacks
overcoming anti-bot protections like Cloudflare. Our analysis shows and web scraping attempts. These protections, often implemented
3.5% of domains run detectable Apache servers, with 0.3% using by services like Cloudflare, aim to distinguish between legitimate
vulnerable versions, while WordPress exhibits 31 and 8 confirmed human traffic and malicious bots by using techniques such as
plugin/theme vulnerabilities in Cloudflare and Alexa datasets, re- CAPTCHA challenges, JavaScript challenges, and IP reputation
spectively. CSP adoption remains inconsistent. Performance tests analysis. While these measures help mitigate the risks of data scrap-
show that Go browserless crawlers outperform Python browser- ing, brute force attacks, and bot-driven exploitation, they also pose
based crawlers in speed, though Cloudflare still blocks some do- significant challenges for web crawlers focused on vulnerability
mains even when using browser automation. While anti-bot sys- detection. Anti-bot systems can block or slow down crawlers, mak-
tems and false positives present ongoing challenges, our hybrid ing it difficult to conduct large-scale scans efficiently. As a result,
lightweight/browser-based approach proves large-scale vulnerabil- researchers must adapt their crawling strategies, using advanced
ity detection remains practical. tools like stealthy browsers or headless automation to bypass these
defenses, all while balancing performance and accuracy. Under-
standing the intricacies of anti-bot protections is essential for im-
1 INTRODUCTION proving the effectiveness of vulnerability detection across the web.
The modern web is a dynamic, layered environment built on a To address these goals, we focus on several key questions: How
complex mix of open-source frameworks, third-party services, and can we conduct a vulnerability-focused web crawl across one mil-
custom integrations. While these technologies enable rapid devel- lion domains in under 24 hours to identify critical weaknesses in
opment and widespread access, they also introduce a persistent widely used websites, and how do modern protections like Cloud-
risk: outdated or vulnerable components left exposed to the public. flare impact the crawling process? How do various crawling frame-
For attackers, popular websites represent high-value targets—and works perform when detecting vulnerabilities? How do different
a single overlooked update can serve as a gateway for widespread frameworks handle modern anti-bot protections, and which web
exploitation. vulnerabilities can be identified through crawling alone?
Understanding the real-world security posture of the web re- This paper makes the following contributions:
quires more than just studying theoretical vulnerabilities; it also
necessitates an examination of practical threats. It demands direct • It presents a methodology for conducting large-scale web
observation of what is deployed on the internet. While internet- crawls focused on identifying critical vulnerabilities in pop-
wide port scans [13, 14] have long been used to identify open ser- ular websites while addressing challenges posed by modern
vices, relatively fewer studies have focused on crawling the appli- protections like Cloudflare.
cation layer of the web at scale-especially with an emphasis on • It evaluates the effectiveness of different crawling approaches
detecting software versions and misconfigurations tied to known for vulnerability detection, emphasizing scalability, efficiency,
vulnerabilities. and handling websites with anti-bot measures.
In this paper, we aim to assess the prevalence of outdated or • It identifies and analyzes common web vulnerabilities-such
vulnerable software components among the world’s most popular as those found in WordPress, Wix, and Apache, as well as
Delft, 2025 Matyas Kollert, Cristian Comendant, Adrian Josan, Matei Ivan Tudor, Cristian Perlog, Andrei Ionescu, and Stefan Stoicescu
granular error handling, and concurrent processing. This, along balancer or reverse proxy, routing requests to backend servers
with the ease of use were the reasons for choosing this particular that may run other web server software.
framework for the majority of the crawls. We consulted The Exploit Database4 to identify vulnerabilities
Crawlee is a modern JavaScript web scraping and browser au- in these servers. Our research revealed:
tomation library designed for building reliable crawlers. It provides • nginx has had no verified exploits since 2013, indicating
built-in support for both HTTP requests and headless browser relative safety.
automation, offering automatic retry mechanisms, request queue • Apache had severe Remote Code Execution (RCE) vul-
management, and sophisticated anti-detection features. The frame- nerabilities in 2021 (see Table 1).
work supports multiple underlying engines including Playwright, • Apache Tomcat (a Java Servlet implementation) had seven
Puppeteer, and Cheerio. vulnerabilities since 2020, including a 2025 RCE exploit,
Despite Crawlee’s advanced feature set and reputation for han- though most remain unverified.
dling modern web applications, our implementation faced signifi-
cant performance limitations. Even after stripping away all stealth Given Apache and Tomcat’s higher vulnerability profile, we
configurations and reducing the framework to its most barebones focused on detecting them using these techniques:
setup, we could not achieve more than 1 request every 2 sec- (1) Server Header Analysis
onds. This throughput proved insufficient for our large-scale crawl- The HTTP response’s Server header often discloses the
ing objectives, where we needed to process one million domains server name and version (Figure 1). Administrators frequently
within 24 hours. The performance bottleneck appeared to stem suppress this information to avoid exposing vulnerable ver-
from Crawlee’s extensive built-in safeguards and overhead, which, sions.
while beneficial for avoiding detection, severely limited scalability. (2) 404 Error Page Inspection
Playwright is a cross-browser automation toolkit that drives If the header is hidden, the crawler requests a fake URL to
Chromium, Firefox, and WebKit with a unified API. Unlike Colly’s trigger a 404 response. Default installations reveal server
basic HTTP requests, Playwright uses a real browser engine, exe- versions in error pages (Figure 2, Figure 3).
cutes JavaScript, and exposes granular hooks for request intercep- (3) X-Powered-By Header Check
tion, screenshotting, and network-level manipulation. Additionally, As a fallback, the X-Powered-By header may indicate a Servlet-
it ships with automatic “smart waits” and retry logic that dramat- based server (e.g., Tomcat), though this isn’t definitive.
ically reduces DOM-synchronization bugs common in Puppeteer
WordPress is one of the most widely used content management
scripts. Playwright is reserved for crawling activities that require
systems (CMS) on the internet, powering over 43% of all websites
realistic client-side interaction with websites, such as XSS vulnera-
[2]. This massive adoption makes WordPress a prime target for
bility detection. While Playwright’s respectable performance of 10
attackers, as vulnerabilities in the platform, themes, or plugins can
requests per second (rps) made it our initial choice for a second-pass
lead to widespread exploits. One of the reasons for these vulner-
crawler, Camoufox’s more authentic default user fingerprint turned
abilities is the fact that many of WordPress’s themes and plugins
out to be essential for fooling Cloudflare-equipped websites.
are developed by the community, often by independent develop-
Camoufox is a purpose-built anti-detect browser built for Play-
ers. While this allows for rapid innovation and customization, it
wright: a minimalistic Firefox fork that injects realistic, rotating
also introduces significant risks. Community-developed plugins
fingerprints at the protocol level covering canvas, WebGL, WebRTC,
and themes are not always subject to rigorous security audits or
font metrics, time-zone, and dozens of high-entropy surfaces. Un-
updates, leading to a large number of outdated or vulnerable com-
like stealth plug-ins that patch navigator properties post-launch,
ponents remaining active on websites. This has resulted in many
Camoufox re-compiles the browser so that fingerprint mutations
security breaches where attackers exploit these outdated plugins
originate in native code, making it nearly indistinguishable from a
or themes [20, 21]. As a result, maintaining up-to-date software
human-operated session. However, stealth comes at a cost. Cam-
and properly vetting third-party components remains a critical
oufox is much slower than Playwright, with a throughput of around
challenge for WordPress site administrators.
2 rps when using 16 threads. We therefore deployed Camoufox in a
To detect WordPress websites in a large-scale crawl, we identify
targeted role: domains that returned a 403 Cloudflare error were
key indicators such as WordPress-specific URLs and resources. We
re-queued for a second Camoufox-based crawl.
scan for the presence of common WordPress paths, including those
for themes and plugins stored under /wp-content/themes/ and
4.2 Vulnerability Categories /wp-content/plugins/. This method involves analyzing page ele-
In terms of the vulnerabilities, this research focused on HTTP ments like the href, src, and img attributes, which often contain
Servers, WordPress, CSP configuration, Cross-Site Scripting, and version information or specific paths that can reveal WordPress
Wix. These were chosen for their popularity and amount of known components. The crawler also extracts and logs information such
exploits. as the theme name and the list of plugins found on each page.
HTTP web servers are tools used by virtually all websites on These indicators help us accurately identify WordPress websites
the internet to serve HTTP requests. We analyzed market share and gather relevant information about their configurations and
data3 and found that nginx (33.8%) and Apache (26.0%) are the potential vulnerabilities.
most popular options. Notably, nginx is often deployed as a load
3 https://w3techs.com/technologies/overview/web_server 4 https://www.exploit-db.com/
Delft, 2025 Matyas Kollert, Cristian Comendant, Adrian Josan, Matei Ivan Tudor, Cristian Perlog, Andrei Ionescu, and Stefan Stoicescu
In this study, we focus on detecting specific vulnerabilities in pop- • DOM-Based XSS: A client-side only flaw that arises when
ular themes and plugins, particularly those with known CVEs that frontend code reads malicious values and uses them in DOM
are version-specific. As numerous vulnerabilities are discovered sinks where data is executed. Thus, inputs with embedded
each year [20, 21], we selected a few representative ones, with the scripts that are not properly sanitized lead to the scripts
flexibility to easily extend this approach in the future. For example, being executed.
we track vulnerabilities in the XStore theme [8] for versions up to • Reflected XSS: With reflected XSS the attacker embeds a ma-
9.3.5, and the LearnPress plugin [4] for versions up to 4.2.7. We licious script in a request parameter that the server reflects
carefully monitor these vulnerabilities to ensure that our crawler back in its HTTP response. Once the user is taken to the
can detect the affected versions. Additionally, we focus on several crafted URL, the script executes once in that session. The
other high-severity vulnerabilities, such as privilege escalation in script is not stored, however, and thus is only executed when
plugins like SureTriggers [11] and LiteSpeed Cache [7], as well the user is taken to the malicious URL.
as SQL injection flaws in plugins like Depicter [9]. • Stored XSS: Stored XSS poses the greatest threat out of
Content Security Policy (CSP) is a critical web security mech- all three, as in this case, the malicious script is stored on
anism that helps prevent Cross-Site Scripting (XSS) attacks by al- the server, rather than being dropped after execution. This
lowing website operators to specify which sources of content are means that every time a query is sent to the server that
considered trustworthy. When properly implemented, CSP acts as retrieves the stored JavaScript, the malicious code will be
a powerful defense layer that can significantly reduce the impact executed.
of code injection vulnerabilities. However, CSP adoption remains
inconsistent across the web, which is what we wish to explore. In this research, the focus is on DOM XSS. The crawler parses
Our crawling methodology detects whether websites implement the HTML of the website, looking for HTML forms and if any form
CSP and, for those without CSP, identifies alternative XSS preven- is found, it inputs a malicious element with embedded JavaScript
tion mechanisms. We check for CSP protection through both HTTP that also contains a unique token generated by the crawler. The
headers and meta tags in the HTML. When analyzing sites, we look JavaScript that is to be executed is a simple console.log. The form
for XSS risk indicators such as inline scripts, inline event handlers, is then submitted. Then, it monitors the console stream, and the
eval() usage, and cross-origin script loading. For websites lacking moment the unique token appears in the console, it is sure that the
CSP protection, we assess their security posture by comparing these embedded JavaScript was executed inside the page.
risk factors against protective measures like modern frameworks, However, there is an issue with such a crawler: it can accidentally
input validation, and output encoding. This evaluation allows us to cause a stored XSS attack. This can happen because the input that
categorize sites by risk level and pinpoint which popular websites is submitted to the form may also be saved to a server. While
are most vulnerable to XSS attacks due to insufficient protection it is unlikely that the input is unsanitized both in the DOM and
mechanisms. server, and that the input is also saved to the server, it still is a
Cross-Site Scripting (XSS) vulnerabilities allow an attacker possibility that must be taken into account. Thus, this crawler was
to run arbitrary JavaScript in a victim’s browser. Such an attack not deployed at scale and must be used responsibly.
becomes feasible whenever untrusted input travels from an HTTP Wix is a leading cloud-based development platform, offering a
request into a script-interpreted context without strict input saniti- closed-source, integrated environment for website creation. Unlike
zation. The JavaScript is executed in the context of the vulnerable the open-source ecosystem of WordPress, Wix’s architecture cen-
user, thereby gaining the same ability to read cookies, tokens, and tralizes control over the core platform and its applications. This
DOM data as the legitimate application. Many protections against distinction shifts the focus of vulnerability detection from a wide
XSS have been devised, such as input sanitization or runtime data- range of third-party components to the specific, first-party appli-
code separation, and static and dynamic analysis techniques were cations offered by Wix itself. Our methodology for Wix, therefore,
devised for the detection of XSS vulnerabilities. [15][19]. Even with focuses on first identifying sites built on the platform and then
many protection and detection protocols, XSS was the top-ranked enumerating the specific Wix applications they utilize.
vulnerability in the Common Weakness Enumeration in 20245 , over- To identify a website as being built with Wix, our crawler uses
taking memory safety errors. several methods to identify fingerprints. The primary and most reli-
XSS is generally distinguished in three subclasses[15][19]: able indicator is the presence of the X-Wix-Server-Artifact-Id
HTTP response header, which definitively confirms the site is
served by Wix infrastructure. In addition to this server-side check,
5 https://cwe.mitre.org/top25/archive/2024/2024_cwe_top25.html we perform a comprehensive scan of the client-side source code.
Crawling the Web Delft, 2025
This involves searching for Wix-specific patterns and domains CAPTCHA, whereas other Python frameworks often become stuck
across several HTML elements: in an endless loop of repeated CAPTCHA submissions.
• Scripts and Links: We analyze the src and href attributes
of <script> and <link> tags for URLs containing Wix- 5.1 Performance
owned domains, such as wix.com, wixstatic.com, and There is a large difference in performance between the Go and
wixsite.com. Python crawlers. This difference is attributed to a number of factors,
• HTML Attributes and Content: The crawler inspects the regarding the scraping frameworks they use and the languages
entire DOM for keywords like wix, wix-, and _wix within themselves. The Go crawlers were also run on one of TU Delft’s
attributes such as class, id, data-wix, and data-hook. supercomputers, Naxos. However, contrary to expectations, the
• Meta Tags and Inline Scripts: We also parse <meta> tags speed of the crawlers slowed considerably. There is no clear reason
and the content of inline scripts for explicit mentions of as to why this is the case, as the supercomputer is not limited either
“Wix” or its services. by resources or by network bandwidth.
Go is a more efficient programming language than Python. It is
Once a site is identified as a Wix-powered entity, the crawler compiled, rather than interpreted, and uses static typing, instead
proceeds to enumerate the active Wix applications, such as Wix of dynamic typing, both of which reduce runtime overhead. Fur-
Stores, Blog, Bookings, or Events. This is achieved by searching the thermore, Go uses "Goroutines", which are extremely lightweight
page source for application-specific keywords (e.g., wix-bookings, coroutines, managed by Go itself. On the other hand, the Python
wixStore). Our process also includes an attempt to extract version crawlers must make use of the "multiprocessing" package to avoid
information for these applications by matching common version- the Global Interpreter Lock, which restricts execution to a single
ing patterns (e.g., v=, version=, data-version) found in URLs or thread at a time. However, the "multiprocessing" package also adds
element attributes. While finding specific, publicly disclosed vul- memory overheads and requires constant serialization and deseri-
nerabilities in a closed-source platform like Wix is challenging, the alization of data.
above-mentioned enumeration gives a good view of the technolo- Go crawlers make use of the highly optimized Colly frame-
gies used, giving a clear view of current and possible future security work, which performs bare HTTP requests to domains, rather than
gaps. running a browser in the background that accesses them. This al-
ready makes it much faster than scraping frameworks that do run
4.3 Cloudflare Protection a browser, even in headless mode, and leads to much less resource
To detect Cloudflare protection, we monitor HTTP response sta- consumption. The framework still has all the necessary capabili-
tuses and the content of response bodies during the crawl. Specifi- ties of a scraping framework, such as setting a user-agent, header
cally, when a 403 Forbidden status code is encountered, we check the manipulation, asynchronous and parallel scraping, and more.
response body for keywords associated with Cloudflare’s security The Python crawlers use the Camoufox framework. While Camoufox
mechanisms. These keywords include "cloudflare" and "_cf_chl_opt," provides more stealth capabilities, this also comes at a small perfor-
which are indicative of Cloudflare’s challenge page being served. mance cost, being slightly slower than other Python frameworks,
If such keywords are detected, the domain is flagged as being pro- such as Playwright. Camoufox also runs a browser in the back-
tected by Cloudflare. This detection is crucial for identifying which ground for each thread that runs, leading to high memory con-
websites require special handling during the crawl, as Cloudflare sumption, even when the browsers are run in "headless" mode,
often serves a CAPTCHA challenge or JavaScript challenge page which limits performance.
that can block automated crawlers. By counting and tracking these As Table 2 shows, the Go implementations deliver a huge speed-
occurrences, we can efficiently identify Cloudflare-protected sites up compared to Python. As mentioned above, the crawlers slow
and handle them accordingly, using alternative crawling strategies down considerably when run on the Naxos supercomputer. Al-
like browser-based crawlers for further exploration. though Go crawlers have an average speed of ≈16 requests per
second, running them on Naxos drops this number to just 7.
5 RESULTS
The crawlers were run on the Cloudflare top 1 million, 100 thousand, Crawler Requests Per Second
10 thousand, and Alexa 1 million, 100 thousand, and 10 thousand
datasets. In this section, they are referred to as C1M, C100K, C10K, Go WordPress 16.07
A1M, A100k and A10k respectively. Firstly, the Go implementa- Go CSP 15.41
tions of the crawlers were run on those datasets to collect results Go Apache 16
and to determine which web pages could not be accessed due to Python WordPress 1.98
Cloudflare blocking the crawlers. Only about 31000 domains from Python CSP 1.02
C1M and 9600 domains from A1M were found to be protected by Python Apache 1
Cloudflare. The Python crawlers were then run on these protected Go crawlers on Naxos (avg.) ≈7
websites, as they use the Camoufox framework in combination with Table 2: Crawler-performance metrics
a CAPTCHA submission function, which provides a higher success
rate in accessing them than the Go crawlers. Notably, Camoufox
also supports proper redirection to the target page after solving a
Delft, 2025 Matyas Kollert, Cristian Comendant, Adrian Josan, Matei Ivan Tudor, Cristian Perlog, Andrei Ionescu, and Stefan Stoicescu
Theme C1M C100K C10K A1M A100K Table 5: Vulnerable WordPress Plugins
Apache 35048 1652 78 6860 343
Apache Tomcat 63 7 0 390 59 Plugin C1M C100K C10K A1M A100K
Vulnerable Versions 3232 127 27 354 15 depicter [9] 14 (110) 1 (2) 0 (2) 3 (27) 0 (2)
litespeed-cache [7] 18 (507) 2 (22) 0 (1) 5 (59) 0 (1)
Among 1 million Cloudflare-scanned domains, about 3.5% have litespeed-cache-2 [6] 15 (507) 0 (22) 0 (1) 2 (59) 0 (1)
detectable Apache servers, while 0.3% show vulnerable versions. learnpress [4] 30 (21) 1 (0) 0 (0) 6 (4) 0 (0)
This ratio declines with domain popularity, with roughly 1.6% de- suretriggers [11] 3 (11) 0 (0) 0 (0) 0 (2) 0 (0)
k-elements [10] 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)
tectable and 0.13% vulnerable in the top 100k Cloudflare domains.
vikbooking [5] 1 (2) 0 (0) 0 (0) 0 (2) 0 (0)
This suggests that more popular domains are less likely to expose
detectable and vulnerable Apache versions, likely due to stronger
security measures or better header masking. Site Identification and False Positive Rate. Our crawler initially
Out of all the detected websites, only 2 domains were running flagged 88 unique domains from the C1M dataset as potentially
Apache version 2.4.49 (one of the most recent versions known to being built with Wix. To validate these findings, a manual inspec-
be vulnerable to Remote Code Execution (RCE))6 . However, for a tion was performed on this set of 88 domains. The analysis re-
successful RCE attack, the administrator must have explicitly dis- vealed that only 2 of these domains (wix.com and its subdomain
abled the default configuration directive Require all denied and editorx.com) were definitively Wix platforms. The remaining 86
enabled CGI scripts for specific paths. Given this specific and un- domains-representing an exceptionally high false positive rate of
common set of conditions, the likelihood of successful exploitation 97.7%-were incorrectly identified.
across all affected servers is relatively low. Nevertheless, if an RCE These false positives can be grouped into several categories:
attack does succeed, it could allow an attacker to gain full control
• Major Technology and Media Companies: High-profile
over a highly valuable asset, especially if the domain is among the
sites such as openai.com, cnn.com, google.com, figma.com,
most popular and heavily visited.
roblox.com, and mailchimp.com were consistently misiden-
Additionally, we detected 17 domains running Apache versions
tified. These platforms operate on their own built infrastruc-
older than 2.2.15, which are known to be vulnerable to the notorious
ture.
Slowloris Denial of Service attack7 .
• Third-Party Widget and Service Providers: A significant
The list of CVEs against which these versions were checked is
number of false positives came from service providers whose
provided in Appendix B.
products integrate with Wix. Domains like elfsight.com,
WordPress vulnerabilities were more frequent than expected.
userway.org, tidio.com, and powr.io were flagged be-
The most prevalent vulnerability was associated with the XStore
cause their marketing pages or documentation contain the
theme [8], which was detected 31 times in the C1M dataset and
keyword “Wix,” not because they are hosted on the platform.
8 times in the A1M dataset. The most frequently found plugin
• Coincidental Keyword Matches: On other large sites, the
vulnerability was tied to LearnPress [4], appearing 30 times in
keyword “wix” appeared coincidentally within JavaScript
the C1M dataset and 6 times in the A1M dataset. Other notable
variable names, CSS classes, or large JSON objects, leading
vulnerabilities included those in the Depicter plugin and both
to incorrect identification.
LiteSpeed Cache vulnerabilities.
In comparison, vulnerabilities were much less frequent in the Plugin and Version Extraction Errors. The high false positive rate
C100k and A100k datasets, with no confirmed vulnerabilities found in site identification led to a complete failure in the subsequent
in the C10k and A10k datasets. enumeration of Wix applications and their versions. The crawler
For a detailed view of the distribution of these vulnerabilities reported finding “Wix Stores” on bestbuy.ca and “Wix Events” on
across different domains, refer to Tables 4 and 5 below. The numbers cnn.com, both of which are incorrect attributions caused by the
in parentheses indicate cases where the version of the theme/plugin presence of generic keywords.
was not publicly available. As a result, while the theme/plugin may Furthermore, the version extraction process produced unusable,
be vulnerable, manual verification would be necessary to confirm noisy data. Instead of version numbers, the crawler extracted code
this. fragments.
Our scan for Wix-based websites and applications yielded signif- Our analysis of Content Security Policy (CSP) implemen-
icant results, primarily highlighting the limitations and high error tation reveals significant variations in adoption rates and secu-
rate of a heuristic-based detection methodology. rity postures. CSP header implementation was highest among the
6 https://www.exploit-db.com/exploits/50383 most popular domains in the C10K dataset at 32.9%, while broader
7 https://en.wikipedia.org/wiki/Slowloris_(cyber_attack) datasets showed lower adoption rates: 28.2% for A1M, 26.3% for
Crawling the Web Delft, 2025
C100K, and 20.1% for C1M. Meta tag-based CSP implementation However, the crawler must be used responsibly. The user requires
remained consistently low across all datasets, ranging from 1.0% authorization to utilize it, as it can possibly be harmful, and should
to 1.7%, indicating that most sites prefer HTTP header-based CSP only use it on websites where the "security.txt" file may be found,
deployment when they implement it at all. The analysis of XSS as it contains contact information for the owner of the website. An
risk indicators shows concerning patterns across all datasets. Sites identifiable user-agent should also be used, as it allows the web-
averaged between 8.44 and 11.95 external scripts per page, with site owner to determine who performed the crawl. The embedded
a substantial portion being cross-origin scripts that increase the JavaScript code must be non-harmful - a simple "console.log" is the
attack surface. Inline scripts and event handlers were prevalent, best choice. However, even if the JavaScript is non-harmful, storing
averaging 8.61-11.95 and 3.79-6.12 per page respectively. For sites it would pollute the web application. Lastly, if a vulnerability is
without CSP protection, our risk assessment categorized 18.9% to detected, then the user must perform a coordinated vulnerability
33.1% as high risk for XSS vulnerabilities, with the C1M dataset disclosure.
showing the highest proportion of vulnerable sites at 33.1%. Despite
the lack of CSP adoption, many sites employed alternative protec-
tion mechanisms, with input validation being the most common 5.3 Cloudflare-Protected Domains
(43.4% to 73.5%) and HttpOnly cookies providing session protection For A1M, around 9600 domains were flagged as protected by Cloud-
for 43.1% to 76.1% of analyzed domains. flare. In C1M it was over 31000 domains. Using the second-pass
crawler, we were able to get a valid response from around 26%
Metric C1M C100K C10K A1M of websites in A1M, and 23% for C1M. This high failure rate can
CSP Implementation mostly be attributed to error code 403 (forbidden), indicating that
CSP Header 20.1% 26.3% 32.9% 28.2%
Meta CSP 1.0% 1.4% 1.7% 1.6% the domains were not indeed to be accessed directly. Accessing
XSS Risk Indicators (avg/URL) most of these domains in a browser results in the same behavior,
Inline Scripts 11.95 11.10 10.81 8.61 suggesting that the stealthiness of Camoufox is not the primary
Inline Handlers 5.53 6.12 4.17 3.79
Inline Styles 6.87 6.96 8.59 4.30 issue.
eval() Usage 0.63 0.62 0.66 0.45 In terms of vulnerabilities, no vulnerable WordPress versions
postMessage 0.16 0.14 0.15 0.12 or themes are found for either crawl, suggesting that Cloudflare-
JSONP Endpoints 0.31 0.25 0.25 0.21
Script Loading (avg/URL) protected domains are more frequently updated. However, the rel-
External Scripts 11.33 11.00 11.20 8.44 atively low numbers of samples used for the second crawl hinder
Cross-origin 7.30 8.29 9.21 5.07
Same-origin 4.03 2.71 1.99 3.37
a more definitive conclusion. The protected domains also showed
Protection Measures much fewer high-risk and medium-risk websites, as classified by the
Modern Frameworks 32.5% 40.4% 51.4% 28.8% CSP crawler, 2.2%, respectively 8.4%, for Alexa’s protected domains,
X-Content-Type 34.6% 40.8% 44.3% 48.4%
Output Encoding 39.6% 40.6% 46.9% 31.5%
and 3.0%, respectively 13.0% for Cloudflare’s protected domains.
Input Validation 73.5% 60.3% 43.4% 45.6% Furthermore, a higher percentage of domains implemented a CSP
Sandboxed iframes 1.1% 1.3% 0.7% 1.1% header as opposed to the unprotected domains: 34.2% for the Alexa
HttpOnly Cookies 43.1% 48.4% 52.2% 76.1%
Sensitive Forms 51.2% 43.2% 29.3% 38.2% domains, and 31.3% for the Cloudflare domains. As for the Apache
XSS Risk (Sites w/o CSP) crawler, no vulnerable versions were detected in the Cloudflare-
High Risk 33.1% 29.1% 24.9% 18.9% protected domains.
Medium Risk 14.2% 14.1% 13.8% 9.4%
Low Risk 12.3% 11.5% 11.6% 12.4%
Minimal Risk 19.7% 18.0% 15.8% 30.2%
Table 6: Comprehensive CSP and XSS Security Analysis 6 DISCUSSION
Our large-scale crawling analysis reveals significant security gaps
in the modern web landscape, particularly regarding outdated soft-
ware components and inadequate XSS protection mechanisms. The
XSS The crawler was not deployed at scale, given the possible
prevalence of vulnerable WordPress themes and plugins, especially
issues it may cause. As such, to test the crawler, three simple, vul-
the XStore theme and LearnPress plugin found across multiple
nerable websites were created. All of them make use of dangerous
datasets, demonstrates that many popular websites continue to
DOM sinks, which end up executing unsanitized user inputs. The
run software with known CVEs. While Apache servers showed
first website makes use of the "document.write" function, the sec-
no instances of the critical 2021 RCE vulnerabilities in our crawl,
ond one uses the "innerHTML" property of an HTML element in
this likely reflects either effective patching practices or the specific
the DOM, and the third one makes use of the "setAttribute" function
configuration requirements needed to exploit these vulnerabilities.
of an HTML element. To detect websites as vulnerable, the crawler
The stark contrast in CSP adoption rates, which ranges from 20.1%
makes use of two different HTML elements as input:
in the C1M dataset to 32.9% in the C10K dataset, suggests that more
• "><img src=x onerror="console.log(’XSS:t’)"> popular websites are more likely to implement modern security
• x" onerror="console.log(’XSS:t’)" headers, yet the majority of sites may remain vulnerable to XSS
Both elements have an error handler that calls the console.log attacks. The same can be said about WordPress vulnerabilities as
function and both elements intentionally trigger an error. The both C100k and A100k contained close to no confirmed vulnerabili-
crawler was able to detect all three websites as vulnerable with the ties and the C10k and A10k datasets not containing even possible
two payloads. vulnerabilities.
Delft, 2025 Matyas Kollert, Cristian Comendant, Adrian Josan, Matei Ivan Tudor, Cristian Perlog, Andrei Ionescu, and Stefan Stoicescu
The findings highlight important trade-offs in large-scale vul- [6] CVE. 2024. WordPress LiteSpeed Cache plugin <= 6.3.0.1 - Unauthenticated
nerability detection. Our Go-based Colly crawler achieved an order Privilege Escalation vulnerability. https://www.cve.org/CVERecord?id=CVE-
2024-28000
of magnitude better performance compared to browser-based so- [7] CVE. 2024. WordPress LiteSpeed Cache plugin < 6.5.0.1 - Unauthenticated Ac-
lutions, making it suitable for rapid, large-scale scanning of basic count Takeover via Cookie Leak vulnerability. https://www.cve.org/CVERecord?
id=CVE-2024-44000
vulnerabilities. However, the high false positive rate in our Wix [8] CVE. 2024. WordPress XStore theme <= 9.3.5 - Unauthenticated SQL Injection
detection (97.7%) underscores the limitations of heuristic-based vulnerability. https://www.cve.org/CVERecord?id=CVE-2024-33559
identification methods when dealing with closed-source platforms. [9] CVE. 2025. Slider Popup Builder by Depicter <= 3.6.1 - Unauthenticated SQL
Injection via ’s’ Parameter. https://www.cve.org/CVERecord?id=CVE-2025-2011
The significant presence of Cloudflare protection (31,000 domains [10] CVE. 2025. WordPress K Elements plugin < 5.4.0 - Unauthenticated Account
in C1M vs. 9,600 in A1M) indicates that modern anti-bot measures Takeover vulnerability. https://www.cve.org/CVERecord?id=CVE-2024-56000
are becoming increasingly prevalent, requiring hybrid approaches [11] CVE. 2025. WordPress SureTriggers <= 1.0.82 - Privilege Escalation Vulnerability.
https://www.cve.org/CVERecord?id=CVE-2025-27007
that combine lightweight HTTP crawlers with more sophisticated [12] CVE. 2025. Zox News <= 3.17.0 - Missing Authorization to Authenticated (Sub-
browser automation for comprehensive coverage. These findings scriber+) Arbitrary Options Modification. https://www.cve.org/CVERecord?id=
CVE-2024-13643
suggest that while automated vulnerability detection at scale re- [13] Zakir Durumeric, Michael Bailey, and J. Alex Halderman. 2014. An Internet-Wide
mains feasible, it requires careful framework selection and detection View of Internet-Wide Scanning. In 23rd USENIX Security Symposium (USENIX
methodology refinement to balance speed, accuracy, and coverage. Security 14). USENIX Association, San Diego, CA, 65–78. https://www.usenix.
org/conference/usenixsecurity14/technical-sessions/presentation/durumeric
[14] Zakir Durumeric, Eric Wustrow, and J. Alex Halderman. 2013. ZMap: Fast
7 CONCLUSION Internet-wide Scanning and Its Security Applications. In 22nd USENIX Security
Symposium (USENIX Security 13). USENIX Association, Washington, D.C., 605–
While many third-party services and CDNs claim to secure the 620. https://www.usenix.org/conference/usenixsecurity13/technical-sessions/
internet, the modern web remains a patchwork of technologies, paper/durumeric
many of which are dangerously out-of-date. By crawling both the [15] Abdelhakim Hannousse, Salima Yahiouche, and Mohamed Cherif Nait-Hamoud.
2022. Twenty-two years since revealing cross-site scripting attacks: A systematic
A1M and the C1M datasets with both lightweight HTTP requests mapping and a comprehensive survey. https://arxiv.org/abs/2205.08425
as well as headless, stealthy browsers, we have demonstrated that [16] Sebastian Lekies, Ben Stock, and Martin Johns. 2013. 25 million flows later:
large-scale detection of DOM-based XSS. In Proceedings of the 2013 ACM SIGSAC
vulnerability-focused web crawling can still be performed at scale, Conference on Computer & Communications Security (Berlin, Germany) (CCS
from normal everyday hardware. Our Go/Colly crawler managed 16 ’13). Association for Computing Machinery, New York, NY, USA, 1193–1204.
r/s, an entire order of magnitude faster than our Camoufox crawler, https://doi.org/10.1145/2508859.2516703
[17] Nick Nikiforakis, Luca Invernizzi, Alexandros Kapravelos, Steven Van Acker,
yet the combination of the two allowed us to get responses from Wouter Joosen, Christopher Kruegel, Frank Piessens, and Giovanni Vigna. 2012.
most websites that had correct domains within 24 hours. You are what you include: large-scale evaluation of remote javascript inclusions.
Across the millions of domains found in our datasets, we have In Proceedings of the 2012 ACM Conference on Computer and Communications
Security (Raleigh, North Carolina, USA) (CCS ’12). Association for Computing Ma-
detected numerous instances of well-known, high-severity exploits. chinery, New York, NY, USA, 736–747. https://doi.org/10.1145/2382196.2382274
The WordPress ecosystem exposes several privilege escalation bugs [18] Peter Dave Hello. 2024. Top 1 Million Domains. https://github.com/
PeterDaveHello/top-1m-domains. Accessed: YYYY-MM-DD.
in popular themes, and the majority of websites lack CSP headers. [19] Germán E. Rodríguez, Jenny G. Torres, Pamela Flores, and Diego E. Benavides.
Conversely, some of our results, such as the Wix crawler, show 2020. Cross-site scripting (XSS) attacks and mitigation: A survey. Computer
that simple keyword-based heuristics produce a large number of Networks 166 (2020), 106960. https://doi.org/10.1016/j.comnet.2019.106960
[20] WPScan.com. 2025. WordPress Plugin Vulnerabilities. Retrieved June 23, 2025
false-positive results. from https://wpscan.com/plugins/
Our work is not without its limits, however, as a 66% failure rate, [21] WPScan.com. 2025. WordPress Theme Vulnerabilities. Retrieved June 23, 2025
including 15% inactive domains and 6% incorrect TLS certificates, from https://wpscan.com/themes/
suggests that better domain handling is the logical next step. Fu-
ture research should focus on trying to reduce the number of false
positive results, targeting a wider range of vulnerabilities, or even
investigating AI-driven adaptable crawling strategies to avoid anti-
bot detection. In addition, creating a more efficient multi-threading
solution could enable large-scale crawls using browser-based frame-
works.
REFERENCES
[1] Hugo L.J. Bijmans, Tim M. Booij, and Christian Doerr. 2019. Inadvertently Making
Cyber Criminals Rich: A Comprehensive Study of Cryptojacking Campaigns at
Internet Scale. In 28th USENIX Security Symposium (USENIX Security 19). USENIX
Association, Santa Clara, CA, 1627–1644. https://www.usenix.org/conference/
usenixsecurity19/presentation/bijmans
[2] Pavel Ciorici. 2025. How Many Websites Use WordPress in 2025? WordPress Sta-
tistics. Retrieved June 23, 2025 from https://www.wpzoom.com/blog/wordpress-
statistics/#:~:text=WordPress%20powers%2043.5%25%20of%20all,over%2030%
2C000%20WordPress%20themes%20available
[3] CVE. 2023. CVE-2020-36719 - ListingPro Arbitrary Plugin Installation. https:
//www.cve.org/CVERecord?id=CVE-2020-36719
[4] CVE. 2024. LearnPress – WordPress LMS Plugin <= 4.2.7 - Unauthenticated SQL
Injection via ’c_only_fields’. https://www.cve.org/CVERecord?id=CVE-2024-
8522
[5] CVE. 2024. VikBooking < 1.6.8 - Broken Access Control. https://www.cve.org/
CVERecord?id=CVE-2024-2749
Crawling the Web Delft, 2025
A APPENDIX
Figure 2: Default Apache 404 HTTP response page showing version (e-jp.cmcd1.com )
Figure 3: Default Apache Tomcat 404 HTTP response page showing version (changelogs.ubuntu.com)
B APACHE-RELATED CVES
The following CVEs were considered during the vulnerability analysis of Apache web servers:
• https://nvd.nist.gov/vuln/detail/CVE-2021-42013
• https://nvd.nist.gov/vuln/detail/CVE-2021-41773
• https://nvd.nist.gov/vuln/detail/CVE-2019-10098
• https://nvd.nist.gov/vuln/detail/CVE-2019-10092
• https://nvd.nist.gov/vuln/detail/CVE-2016-8740
• https://nvd.nist.gov/vuln/detail/CVE-2008-0455
• https://nvd.nist.gov/vuln/detail/CVE-2007-6203