0% found this document useful (0 votes)

8 views9 pages

Crawling The Web Final

Uploaded by

suzuvany

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views9 pages

Crawling The Web Final

Uploaded by

suzuvany

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Crawling the Web

Matyas Kollert Cristian Comendant Adrian Josan

[email protected] [email protected] [email protected]
TU Delft TU Delft TU Delft
Delft, Netherlands Delft, Netherlands Delft, Netherlands

Matei Ivan Tudor Cristian Perlog Andrei Ionescu

[email protected] [email protected] [email protected]
TU Delft TU Delft TU Delft
Delft, Netherlands Delft, Netherlands Delft, Netherlands

Stefan Stoicescu
[email protected]
TU Delft
Delft, Netherlands
ABSTRACT websites by crawling the top one million domains. Our goal is to
This paper presents a large-scale vulnerability analysis of the top identify which known vulnerabilities are still exposed in the wild,
one million domains, comparing results from legacy Alexa and mod- and how automated detection frameworks can scale to meet this
ern Cloudflare rankings. We develop specialized crawling method- task efficiently.
ologies to identify vulnerable software components, focusing on Anti-bot protections have become a critical component of mod-
Apache, WordPress, Wix, CSP configurations, and XSS risks, while ern web security, as websites increasingly face automated attacks
overcoming anti-bot protections like Cloudflare. Our analysis shows and web scraping attempts. These protections, often implemented
3.5% of domains run detectable Apache servers, with 0.3% using by services like Cloudflare, aim to distinguish between legitimate
vulnerable versions, while WordPress exhibits 31 and 8 confirmed human traffic and malicious bots by using techniques such as
plugin/theme vulnerabilities in Cloudflare and Alexa datasets, re- CAPTCHA challenges, JavaScript challenges, and IP reputation
spectively. CSP adoption remains inconsistent. Performance tests analysis. While these measures help mitigate the risks of data scrap-
show that Go browserless crawlers outperform Python browser- ing, brute force attacks, and bot-driven exploitation, they also pose
based crawlers in speed, though Cloudflare still blocks some do- significant challenges for web crawlers focused on vulnerability
mains even when using browser automation. While anti-bot sys- detection. Anti-bot systems can block or slow down crawlers, mak-
tems and false positives present ongoing challenges, our hybrid ing it difficult to conduct large-scale scans efficiently. As a result,
lightweight/browser-based approach proves large-scale vulnerabil- researchers must adapt their crawling strategies, using advanced
ity detection remains practical. tools like stealthy browsers or headless automation to bypass these
defenses, all while balancing performance and accuracy. Under-
standing the intricacies of anti-bot protections is essential for im-
1 INTRODUCTION proving the effectiveness of vulnerability detection across the web.
The modern web is a dynamic, layered environment built on a To address these goals, we focus on several key questions: How
complex mix of open-source frameworks, third-party services, and can we conduct a vulnerability-focused web crawl across one mil-
custom integrations. While these technologies enable rapid devel- lion domains in under 24 hours to identify critical weaknesses in
opment and widespread access, they also introduce a persistent widely used websites, and how do modern protections like Cloud-
risk: outdated or vulnerable components left exposed to the public. flare impact the crawling process? How do various crawling frame-
For attackers, popular websites represent high-value targets—and works perform when detecting vulnerabilities? How do different
a single overlooked update can serve as a gateway for widespread frameworks handle modern anti-bot protections, and which web
exploitation. vulnerabilities can be identified through crawling alone?
Understanding the real-world security posture of the web re- This paper makes the following contributions:
quires more than just studying theoretical vulnerabilities; it also
necessitates an examination of practical threats. It demands direct • It presents a methodology for conducting large-scale web
observation of what is deployed on the internet. While internet- crawls focused on identifying critical vulnerabilities in pop-
wide port scans [13, 14] have long been used to identify open ser- ular websites while addressing challenges posed by modern
vices, relatively fewer studies have focused on crawling the appli- protections like Cloudflare.
cation layer of the web at scale-especially with an emphasis on • It evaluates the effectiveness of different crawling approaches
detecting software versions and misconfigurations tied to known for vulnerability detection, emphasizing scalability, efficiency,
vulnerabilities. and handling websites with anti-bot measures.
In this paper, we aim to assess the prevalence of outdated or • It identifies and analyzes common web vulnerabilities-such
vulnerable software components among the world’s most popular as those found in WordPress, Wix, and Apache, as well as
Delft, 2025 Matyas Kollert, Cristian Comendant, Adrian Josan, Matei Ivan Tudor, Cristian Perlog, Andrei Ionescu, and Stefan Stoicescu

security issues like Cross-Site Scripting (XSS) and Content 3 DATASETS

Security Policy (CSP) misconfigurations on some of the most To find the top one million domains to crawl, we searched the
visited websites. internet for the most reputable lists of popular domains. We initially
used one of the most well-known datasets: the top one million
2 BACKGROUND AND RELATED WORK domains list from the now-retired Alexa.com service1 . The domains
Large-scale web crawling has been important in understanding the were sorted from most to least popular. This was the dataset we used
security landscape of the web. Prior efforts, such as SHODAN and for our main crawls. However, after learning that it had not been
Censys, pioneered internet-wide scans by targeting network-layer updated since 2022, we decided to experiment with more up-to-date
vulnerabilities (e.g., open ports, SSL misconfigurations). These tools sources. We chose the "Domain Rankings" service from Cloudflare
lack granularity in analyzing the application layer, where most Radar2 , which is updated daily and provided by Cloudflare, a major
web-based exploits occur. internet services provider and a reputable source of internet-related
data. In this list, the domains were sorted in alphabetical order.
2.1 Evolution of Web Crawling for For both of these datasets, we also crawled subsets of the most
Vulnerabilities/Security popular 100 000 and 10 000 domains. This was done to quantify
how the vulnerability distribution changes with the popularity of
• Static Crawlers Earlier studies relied on lightweight crawlers
the websites.
(e.g., wget, Scrapy) to index pages and extract metadata.
While efficient, these tools struggled with JavaScript-heavy
4 METHODOLOGY
sites and failed to detect dynamically loaded vulnerabilities
such as DOM-based XSS [17]. To assess the security of the modern web at scale, our methodology
• Headless Browsers: The rise of frameworks like PhantomJS is grounded in the analysis of two distinct, large-scale datasets: the
and Selenium enabled interaction with modern web applica- now-deprecated Alexa Top 1 Million (A1M) list and its successor,
tions, simulating user behavior to uncover client-side flaws. the Cloudflare Radar Top 1 Million rankings (C1M) [18]. Using
Research by [16] demonstrated effectiveness in detecting these lists as our targets, we implemented a large-scale crawling
stored XSS and CSP bypasses, but at a high computational operation employing several web crawling frameworks to evaluate
cost. their performance, scalability, and resilience against anti-bot protec-
tions. For the analysis phase, we developed specific detection logic
2.2 Challenges in Modern Crawling to fingerprint key technologies (such as WordPress, and Apache)
and identify known vulnerabilities and misconfigurations within
The modern vulnerability landscape is characterized by a complex them. The following subsections provide a detailed breakdown of
ecosystem of discovery, disclosure, and remediation processes. The the crawling frameworks evaluated, the specific vulnerability cate-
Common Vulnerabilities and Exposures (CVE) system provides gories targeted, and our methods for handling defensive measures
standardized identification for security vulnerabilities, but the time like Cloudflare.
lag between vulnerability discovery and widespread patching cre-
ates windows of opportunity for attackers. Research has shown 4.1 Frameworks
that many organizations fail to apply security updates promptly,
To optimize our web crawling process, we evaluated several browser-
leaving known vulnerabilities exposed for extended periods.
based and browser-less crawling frameworks, ultimately selecting
four that aligned best with the needs of our research. These frame-
2.3 Cryptojacking Campaigns
works were assessed on criteria such as resource efficiency, through-
The study in [1] provides an analysis of cryptojacking campaigns at put, scalability, and compatibility with anti-detection mechanisms.
an internet scale, revealing how cybercriminals exploit third-party Each framework was chosen for its ability to handle different as-
software like WordPress to deploy browser-based miners across pects of web crawling, from high-throughput HTTP-based requests
thousands of domains. The work highlights the prevalence of cryp- to complex browser automation, ensuring we could efficiently cover
tojacking, particularly on adult content websites, and demonstrates the vast scope of domains we aimed to process. Below, we detail
the evolution of mining applications. The research shows the need the strengths and weaknesses of the selected frameworks and their
for large-scale crawling to detect coordinated threats, aligning with specific roles in our crawling strategy.
our scopes in identifying outdated and vulnerable web components Colly is a fast and elegant web scraping framework written
through systematic internet-wide scans. in Go, designed for building efficient web crawlers with minimal
resource overhead. For our scope of research, Colly proved to be
2.4 Key Advances in This Work the optimal choice for HTTP-based high-throughput crawling oper-
Our research builds on the foundations by: ations. Colly operates on a lightweight callback-driven architecture
• Systematically comparing crawling frameworks for vulnera- that performs bare HTTP requests without the overhead of run-
bility detection. ning a full browser instance. This makes it more resource-efficient
• Introducing a lightweight method to fingerprint Cloudflare- compared to browser-based crawling frameworks. The framework
protected sites. leverages Go’s advantages, such as static typing, proxy support,
• Quantifying the trade-offs between coverage (A1M vs. C1M 1 https://github.com/mozilla/cipherscan/tree/master/top1m

datasets) and precision. 2 https://radar.cloudflare.com/domains

Crawling the Web Delft, 2025

granular error handling, and concurrent processing. This, along balancer or reverse proxy, routing requests to backend servers
with the ease of use were the reasons for choosing this particular that may run other web server software.
framework for the majority of the crawls. We consulted The Exploit Database4 to identify vulnerabilities
Crawlee is a modern JavaScript web scraping and browser au- in these servers. Our research revealed:
tomation library designed for building reliable crawlers. It provides • nginx has had no verified exploits since 2013, indicating
built-in support for both HTTP requests and headless browser relative safety.
automation, offering automatic retry mechanisms, request queue • Apache had severe Remote Code Execution (RCE) vul-
management, and sophisticated anti-detection features. The frame- nerabilities in 2021 (see Table 1).
work supports multiple underlying engines including Playwright, • Apache Tomcat (a Java Servlet implementation) had seven
Puppeteer, and Cheerio. vulnerabilities since 2020, including a 2025 RCE exploit,
Despite Crawlee’s advanced feature set and reputation for han- though most remain unverified.
dling modern web applications, our implementation faced signifi-
cant performance limitations. Even after stripping away all stealth Given Apache and Tomcat’s higher vulnerability profile, we
configurations and reducing the framework to its most barebones focused on detecting them using these techniques:
setup, we could not achieve more than 1 request every 2 sec- (1) Server Header Analysis
onds. This throughput proved insufficient for our large-scale crawl- The HTTP response’s Server header often discloses the
ing objectives, where we needed to process one million domains server name and version (Figure 1). Administrators frequently
within 24 hours. The performance bottleneck appeared to stem suppress this information to avoid exposing vulnerable ver-
from Crawlee’s extensive built-in safeguards and overhead, which, sions.
while beneficial for avoiding detection, severely limited scalability. (2) 404 Error Page Inspection
Playwright is a cross-browser automation toolkit that drives If the header is hidden, the crawler requests a fake URL to
Chromium, Firefox, and WebKit with a unified API. Unlike Colly’s trigger a 404 response. Default installations reveal server
basic HTTP requests, Playwright uses a real browser engine, exe- versions in error pages (Figure 2, Figure 3).
cutes JavaScript, and exposes granular hooks for request intercep- (3) X-Powered-By Header Check
tion, screenshotting, and network-level manipulation. Additionally, As a fallback, the X-Powered-By header may indicate a Servlet-
it ships with automatic “smart waits” and retry logic that dramat- based server (e.g., Tomcat), though this isn’t definitive.
ically reduces DOM-synchronization bugs common in Puppeteer
WordPress is one of the most widely used content management
scripts. Playwright is reserved for crawling activities that require
systems (CMS) on the internet, powering over 43% of all websites
realistic client-side interaction with websites, such as XSS vulnera-
[2]. This massive adoption makes WordPress a prime target for
bility detection. While Playwright’s respectable performance of 10
attackers, as vulnerabilities in the platform, themes, or plugins can
requests per second (rps) made it our initial choice for a second-pass
lead to widespread exploits. One of the reasons for these vulner-
crawler, Camoufox’s more authentic default user fingerprint turned
abilities is the fact that many of WordPress’s themes and plugins
out to be essential for fooling Cloudflare-equipped websites.
are developed by the community, often by independent develop-
Camoufox is a purpose-built anti-detect browser built for Play-
ers. While this allows for rapid innovation and customization, it
wright: a minimalistic Firefox fork that injects realistic, rotating
also introduces significant risks. Community-developed plugins
fingerprints at the protocol level covering canvas, WebGL, WebRTC,
and themes are not always subject to rigorous security audits or
font metrics, time-zone, and dozens of high-entropy surfaces. Un-
updates, leading to a large number of outdated or vulnerable com-
like stealth plug-ins that patch navigator properties post-launch,
ponents remaining active on websites. This has resulted in many
Camoufox re-compiles the browser so that fingerprint mutations
security breaches where attackers exploit these outdated plugins
originate in native code, making it nearly indistinguishable from a
or themes [20, 21]. As a result, maintaining up-to-date software
human-operated session. However, stealth comes at a cost. Cam-
and properly vetting third-party components remains a critical
oufox is much slower than Playwright, with a throughput of around
challenge for WordPress site administrators.
2 rps when using 16 threads. We therefore deployed Camoufox in a
To detect WordPress websites in a large-scale crawl, we identify
targeted role: domains that returned a 403 Cloudflare error were
key indicators such as WordPress-specific URLs and resources. We
re-queued for a second Camoufox-based crawl.
scan for the presence of common WordPress paths, including those
for themes and plugins stored under /wp-content/themes/ and
4.2 Vulnerability Categories /wp-content/plugins/. This method involves analyzing page ele-
In terms of the vulnerabilities, this research focused on HTTP ments like the href, src, and img attributes, which often contain
Servers, WordPress, CSP configuration, Cross-Site Scripting, and version information or specific paths that can reveal WordPress
Wix. These were chosen for their popularity and amount of known components. The crawler also extracts and logs information such
exploits. as the theme name and the list of plugins found on each page.
HTTP web servers are tools used by virtually all websites on These indicators help us accurately identify WordPress websites
the internet to serve HTTP requests. We analyzed market share and gather relevant information about their configurations and
data3 and found that nginx (33.8%) and Apache (26.0%) are the potential vulnerabilities.
most popular options. Notably, nginx is often deployed as a load
3 https://w3techs.com/technologies/overview/web_server 4 https://www.exploit-db.com/
Delft, 2025 Matyas Kollert, Cristian Comendant, Adrian Josan, Matei Ivan Tudor, Cristian Perlog, Andrei Ionescu, and Stefan Stoicescu

Date Verified Title

2021-11-11 yes Apache HTTP Server 2.4.50 - Remote Code Execution (RCE) (3)
2021-10-25 no Apache HTTP Server 2.4.50 - Remote Code Execution (RCE) (2)
2021-10-13 yes Apache HTTP Server 2.4.50 - Path Traversal & Remote Code Execution (RCE)
2021-10-06 yes Apache HTTP Server 2.4.49 - Path Traversal & Remote Code Execution (RCE)
Table 1: Documented Apache Exploits

In this study, we focus on detecting specific vulnerabilities in pop- • DOM-Based XSS: A client-side only flaw that arises when
ular themes and plugins, particularly those with known CVEs that frontend code reads malicious values and uses them in DOM
are version-specific. As numerous vulnerabilities are discovered sinks where data is executed. Thus, inputs with embedded
each year [20, 21], we selected a few representative ones, with the scripts that are not properly sanitized lead to the scripts
flexibility to easily extend this approach in the future. For example, being executed.
we track vulnerabilities in the XStore theme [8] for versions up to • Reflected XSS: With reflected XSS the attacker embeds a ma-
9.3.5, and the LearnPress plugin [4] for versions up to 4.2.7. We licious script in a request parameter that the server reflects
carefully monitor these vulnerabilities to ensure that our crawler back in its HTTP response. Once the user is taken to the
can detect the affected versions. Additionally, we focus on several crafted URL, the script executes once in that session. The
other high-severity vulnerabilities, such as privilege escalation in script is not stored, however, and thus is only executed when
plugins like SureTriggers [11] and LiteSpeed Cache [7], as well the user is taken to the malicious URL.
as SQL injection flaws in plugins like Depicter [9]. • Stored XSS: Stored XSS poses the greatest threat out of
Content Security Policy (CSP) is a critical web security mech- all three, as in this case, the malicious script is stored on
anism that helps prevent Cross-Site Scripting (XSS) attacks by al- the server, rather than being dropped after execution. This
lowing website operators to specify which sources of content are means that every time a query is sent to the server that
considered trustworthy. When properly implemented, CSP acts as retrieves the stored JavaScript, the malicious code will be
a powerful defense layer that can significantly reduce the impact executed.
of code injection vulnerabilities. However, CSP adoption remains
inconsistent across the web, which is what we wish to explore. In this research, the focus is on DOM XSS. The crawler parses
Our crawling methodology detects whether websites implement the HTML of the website, looking for HTML forms and if any form
CSP and, for those without CSP, identifies alternative XSS preven- is found, it inputs a malicious element with embedded JavaScript
tion mechanisms. We check for CSP protection through both HTTP that also contains a unique token generated by the crawler. The
headers and meta tags in the HTML. When analyzing sites, we look JavaScript that is to be executed is a simple console.log. The form
for XSS risk indicators such as inline scripts, inline event handlers, is then submitted. Then, it monitors the console stream, and the
eval() usage, and cross-origin script loading. For websites lacking moment the unique token appears in the console, it is sure that the
CSP protection, we assess their security posture by comparing these embedded JavaScript was executed inside the page.
risk factors against protective measures like modern frameworks, However, there is an issue with such a crawler: it can accidentally
input validation, and output encoding. This evaluation allows us to cause a stored XSS attack. This can happen because the input that
categorize sites by risk level and pinpoint which popular websites is submitted to the form may also be saved to a server. While
are most vulnerable to XSS attacks due to insufficient protection it is unlikely that the input is unsanitized both in the DOM and
mechanisms. server, and that the input is also saved to the server, it still is a
Cross-Site Scripting (XSS) vulnerabilities allow an attacker possibility that must be taken into account. Thus, this crawler was
to run arbitrary JavaScript in a victim’s browser. Such an attack not deployed at scale and must be used responsibly.
becomes feasible whenever untrusted input travels from an HTTP Wix is a leading cloud-based development platform, offering a
request into a script-interpreted context without strict input saniti- closed-source, integrated environment for website creation. Unlike
zation. The JavaScript is executed in the context of the vulnerable the open-source ecosystem of WordPress, Wix’s architecture cen-
user, thereby gaining the same ability to read cookies, tokens, and tralizes control over the core platform and its applications. This
DOM data as the legitimate application. Many protections against distinction shifts the focus of vulnerability detection from a wide
XSS have been devised, such as input sanitization or runtime data- range of third-party components to the specific, first-party appli-
code separation, and static and dynamic analysis techniques were cations offered by Wix itself. Our methodology for Wix, therefore,
devised for the detection of XSS vulnerabilities. [15][19]. Even with focuses on first identifying sites built on the platform and then
many protection and detection protocols, XSS was the top-ranked enumerating the specific Wix applications they utilize.
vulnerability in the Common Weakness Enumeration in 20245 , over- To identify a website as being built with Wix, our crawler uses
taking memory safety errors. several methods to identify fingerprints. The primary and most reli-
XSS is generally distinguished in three subclasses[15][19]: able indicator is the presence of the X-Wix-Server-Artifact-Id
HTTP response header, which definitively confirms the site is
served by Wix infrastructure. In addition to this server-side check,
5 https://cwe.mitre.org/top25/archive/2024/2024_cwe_top25.html we perform a comprehensive scan of the client-side source code.
Crawling the Web Delft, 2025

This involves searching for Wix-specific patterns and domains CAPTCHA, whereas other Python frameworks often become stuck
across several HTML elements: in an endless loop of repeated CAPTCHA submissions.
• Scripts and Links: We analyze the src and href attributes
of <script> and <link> tags for URLs containing Wix- 5.1 Performance
owned domains, such as wix.com, wixstatic.com, and There is a large difference in performance between the Go and
wixsite.com. Python crawlers. This difference is attributed to a number of factors,
• HTML Attributes and Content: The crawler inspects the regarding the scraping frameworks they use and the languages
entire DOM for keywords like wix, wix-, and _wix within themselves. The Go crawlers were also run on one of TU Delft’s
attributes such as class, id, data-wix, and data-hook. supercomputers, Naxos. However, contrary to expectations, the
• Meta Tags and Inline Scripts: We also parse <meta> tags speed of the crawlers slowed considerably. There is no clear reason
and the content of inline scripts for explicit mentions of as to why this is the case, as the supercomputer is not limited either
“Wix” or its services. by resources or by network bandwidth.
Go is a more efficient programming language than Python. It is
Once a site is identified as a Wix-powered entity, the crawler compiled, rather than interpreted, and uses static typing, instead
proceeds to enumerate the active Wix applications, such as Wix of dynamic typing, both of which reduce runtime overhead. Fur-
Stores, Blog, Bookings, or Events. This is achieved by searching the thermore, Go uses "Goroutines", which are extremely lightweight
page source for application-specific keywords (e.g., wix-bookings, coroutines, managed by Go itself. On the other hand, the Python
wixStore). Our process also includes an attempt to extract version crawlers must make use of the "multiprocessing" package to avoid
information for these applications by matching common version- the Global Interpreter Lock, which restricts execution to a single
ing patterns (e.g., v=, version=, data-version) found in URLs or thread at a time. However, the "multiprocessing" package also adds
element attributes. While finding specific, publicly disclosed vul- memory overheads and requires constant serialization and deseri-
nerabilities in a closed-source platform like Wix is challenging, the alization of data.
above-mentioned enumeration gives a good view of the technolo- Go crawlers make use of the highly optimized Colly frame-
gies used, giving a clear view of current and possible future security work, which performs bare HTTP requests to domains, rather than
gaps. running a browser in the background that accesses them. This al-
ready makes it much faster than scraping frameworks that do run
4.3 Cloudflare Protection a browser, even in headless mode, and leads to much less resource
To detect Cloudflare protection, we monitor HTTP response sta- consumption. The framework still has all the necessary capabili-
tuses and the content of response bodies during the crawl. Specifi- ties of a scraping framework, such as setting a user-agent, header
cally, when a 403 Forbidden status code is encountered, we check the manipulation, asynchronous and parallel scraping, and more.
response body for keywords associated with Cloudflare’s security The Python crawlers use the Camoufox framework. While Camoufox
mechanisms. These keywords include "cloudflare" and "_cf_chl_opt," provides more stealth capabilities, this also comes at a small perfor-
which are indicative of Cloudflare’s challenge page being served. mance cost, being slightly slower than other Python frameworks,
If such keywords are detected, the domain is flagged as being pro- such as Playwright. Camoufox also runs a browser in the back-
tected by Cloudflare. This detection is crucial for identifying which ground for each thread that runs, leading to high memory con-
websites require special handling during the crawl, as Cloudflare sumption, even when the browsers are run in "headless" mode,
often serves a CAPTCHA challenge or JavaScript challenge page which limits performance.
that can block automated crawlers. By counting and tracking these As Table 2 shows, the Go implementations deliver a huge speed-
occurrences, we can efficiently identify Cloudflare-protected sites up compared to Python. As mentioned above, the crawlers slow
and handle them accordingly, using alternative crawling strategies down considerably when run on the Naxos supercomputer. Al-
like browser-based crawlers for further exploration. though Go crawlers have an average speed of ≈16 requests per
second, running them on Naxos drops this number to just 7.
5 RESULTS
The crawlers were run on the Cloudflare top 1 million, 100 thousand, Crawler Requests Per Second
10 thousand, and Alexa 1 million, 100 thousand, and 10 thousand
datasets. In this section, they are referred to as C1M, C100K, C10K, Go WordPress 16.07
A1M, A100k and A10k respectively. Firstly, the Go implementa- Go CSP 15.41
tions of the crawlers were run on those datasets to collect results Go Apache 16
and to determine which web pages could not be accessed due to Python WordPress 1.98
Cloudflare blocking the crawlers. Only about 31000 domains from Python CSP 1.02
C1M and 9600 domains from A1M were found to be protected by Python Apache 1
Cloudflare. The Python crawlers were then run on these protected Go crawlers on Naxos (avg.) ≈7
websites, as they use the Camoufox framework in combination with Table 2: Crawler-performance metrics
a CAPTCHA submission function, which provides a higher success
rate in accessing them than the Go crawlers. Notably, Camoufox
also supports proper redirection to the target page after solving a
Delft, 2025 Matyas Kollert, Cristian Comendant, Adrian Josan, Matei Ivan Tudor, Cristian Perlog, Andrei Ionescu, and Stefan Stoicescu

5.2 Vulnerabilities Found Table 4: Vulnerable WordPress Themes

Apache-related vulnerabilities were observed after performing
Theme C1M C100K A1M A100K
the crawl, with the results presented in Table 3.
xstore [8] 31 (54) 0 (1) 8 (9) 0 (0)
Table 3: Detected Apache & Tomcat servers in Alexa’s and zox-news [12] 1 (50) 0 (1) 0 (11) 0 (0)
listingpro [3] 0 (8) 0 (0) 0 (2) 0 (0)
Cloudflare’s lists

Theme C1M C100K C10K A1M A100K Table 5: Vulnerable WordPress Plugins
Apache 35048 1652 78 6860 343
Apache Tomcat 63 7 0 390 59 Plugin C1M C100K C10K A1M A100K
Vulnerable Versions 3232 127 27 354 15 depicter [9] 14 (110) 1 (2) 0 (2) 3 (27) 0 (2)
litespeed-cache [7] 18 (507) 2 (22) 0 (1) 5 (59) 0 (1)
Among 1 million Cloudflare-scanned domains, about 3.5% have litespeed-cache-2 [6] 15 (507) 0 (22) 0 (1) 2 (59) 0 (1)
detectable Apache servers, while 0.3% show vulnerable versions. learnpress [4] 30 (21) 1 (0) 0 (0) 6 (4) 0 (0)
This ratio declines with domain popularity, with roughly 1.6% de- suretriggers [11] 3 (11) 0 (0) 0 (0) 0 (2) 0 (0)
k-elements [10] 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)
tectable and 0.13% vulnerable in the top 100k Cloudflare domains.
vikbooking [5] 1 (2) 0 (0) 0 (0) 0 (2) 0 (0)
This suggests that more popular domains are less likely to expose
detectable and vulnerable Apache versions, likely due to stronger
security measures or better header masking. Site Identification and False Positive Rate. Our crawler initially
Out of all the detected websites, only 2 domains were running flagged 88 unique domains from the C1M dataset as potentially
Apache version 2.4.49 (one of the most recent versions known to being built with Wix. To validate these findings, a manual inspec-
be vulnerable to Remote Code Execution (RCE))6 . However, for a tion was performed on this set of 88 domains. The analysis re-
successful RCE attack, the administrator must have explicitly dis- vealed that only 2 of these domains (wix.com and its subdomain
abled the default configuration directive Require all denied and editorx.com) were definitively Wix platforms. The remaining 86
enabled CGI scripts for specific paths. Given this specific and un- domains-representing an exceptionally high false positive rate of
common set of conditions, the likelihood of successful exploitation 97.7%-were incorrectly identified.
across all affected servers is relatively low. Nevertheless, if an RCE These false positives can be grouped into several categories:
attack does succeed, it could allow an attacker to gain full control
• Major Technology and Media Companies: High-profile
over a highly valuable asset, especially if the domain is among the
sites such as openai.com, cnn.com, google.com, figma.com,
most popular and heavily visited.
roblox.com, and mailchimp.com were consistently misiden-
Additionally, we detected 17 domains running Apache versions
tified. These platforms operate on their own built infrastruc-
older than 2.2.15, which are known to be vulnerable to the notorious
ture.
Slowloris Denial of Service attack7 .
• Third-Party Widget and Service Providers: A significant
The list of CVEs against which these versions were checked is
number of false positives came from service providers whose
provided in Appendix B.
products integrate with Wix. Domains like elfsight.com,
WordPress vulnerabilities were more frequent than expected.
userway.org, tidio.com, and powr.io were flagged be-
The most prevalent vulnerability was associated with the XStore
cause their marketing pages or documentation contain the
theme [8], which was detected 31 times in the C1M dataset and
keyword “Wix,” not because they are hosted on the platform.
8 times in the A1M dataset. The most frequently found plugin
• Coincidental Keyword Matches: On other large sites, the
vulnerability was tied to LearnPress [4], appearing 30 times in
keyword “wix” appeared coincidentally within JavaScript
the C1M dataset and 6 times in the A1M dataset. Other notable
variable names, CSS classes, or large JSON objects, leading
vulnerabilities included those in the Depicter plugin and both
to incorrect identification.
LiteSpeed Cache vulnerabilities.
In comparison, vulnerabilities were much less frequent in the Plugin and Version Extraction Errors. The high false positive rate
C100k and A100k datasets, with no confirmed vulnerabilities found in site identification led to a complete failure in the subsequent
in the C10k and A10k datasets. enumeration of Wix applications and their versions. The crawler
For a detailed view of the distribution of these vulnerabilities reported finding “Wix Stores” on bestbuy.ca and “Wix Events” on
across different domains, refer to Tables 4 and 5 below. The numbers cnn.com, both of which are incorrect attributions caused by the
in parentheses indicate cases where the version of the theme/plugin presence of generic keywords.
was not publicly available. As a result, while the theme/plugin may Furthermore, the version extraction process produced unusable,
be vulnerable, manual verification would be necessary to confirm noisy data. Instead of version numbers, the crawler extracted code
this. fragments.
Our scan for Wix-based websites and applications yielded signif- Our analysis of Content Security Policy (CSP) implemen-
icant results, primarily highlighting the limitations and high error tation reveals significant variations in adoption rates and secu-
rate of a heuristic-based detection methodology. rity postures. CSP header implementation was highest among the
6 https://www.exploit-db.com/exploits/50383 most popular domains in the C10K dataset at 32.9%, while broader
7 https://en.wikipedia.org/wiki/Slowloris_(cyber_attack) datasets showed lower adoption rates: 28.2% for A1M, 26.3% for
Crawling the Web Delft, 2025

C100K, and 20.1% for C1M. Meta tag-based CSP implementation However, the crawler must be used responsibly. The user requires
remained consistently low across all datasets, ranging from 1.0% authorization to utilize it, as it can possibly be harmful, and should
to 1.7%, indicating that most sites prefer HTTP header-based CSP only use it on websites where the "security.txt" file may be found,
deployment when they implement it at all. The analysis of XSS as it contains contact information for the owner of the website. An
risk indicators shows concerning patterns across all datasets. Sites identifiable user-agent should also be used, as it allows the web-
averaged between 8.44 and 11.95 external scripts per page, with site owner to determine who performed the crawl. The embedded
a substantial portion being cross-origin scripts that increase the JavaScript code must be non-harmful - a simple "console.log" is the
attack surface. Inline scripts and event handlers were prevalent, best choice. However, even if the JavaScript is non-harmful, storing
averaging 8.61-11.95 and 3.79-6.12 per page respectively. For sites it would pollute the web application. Lastly, if a vulnerability is
without CSP protection, our risk assessment categorized 18.9% to detected, then the user must perform a coordinated vulnerability
33.1% as high risk for XSS vulnerabilities, with the C1M dataset disclosure.
showing the highest proportion of vulnerable sites at 33.1%. Despite
the lack of CSP adoption, many sites employed alternative protec-
tion mechanisms, with input validation being the most common 5.3 Cloudflare-Protected Domains
(43.4% to 73.5%) and HttpOnly cookies providing session protection For A1M, around 9600 domains were flagged as protected by Cloud-
for 43.1% to 76.1% of analyzed domains. flare. In C1M it was over 31000 domains. Using the second-pass
crawler, we were able to get a valid response from around 26%
Metric C1M C100K C10K A1M of websites in A1M, and 23% for C1M. This high failure rate can
CSP Implementation mostly be attributed to error code 403 (forbidden), indicating that
CSP Header 20.1% 26.3% 32.9% 28.2%
Meta CSP 1.0% 1.4% 1.7% 1.6% the domains were not indeed to be accessed directly. Accessing
XSS Risk Indicators (avg/URL) most of these domains in a browser results in the same behavior,
Inline Scripts 11.95 11.10 10.81 8.61 suggesting that the stealthiness of Camoufox is not the primary
Inline Handlers 5.53 6.12 4.17 3.79
Inline Styles 6.87 6.96 8.59 4.30 issue.
eval() Usage 0.63 0.62 0.66 0.45 In terms of vulnerabilities, no vulnerable WordPress versions
postMessage 0.16 0.14 0.15 0.12 or themes are found for either crawl, suggesting that Cloudflare-
JSONP Endpoints 0.31 0.25 0.25 0.21
Script Loading (avg/URL) protected domains are more frequently updated. However, the rel-
External Scripts 11.33 11.00 11.20 8.44 atively low numbers of samples used for the second crawl hinder
Cross-origin 7.30 8.29 9.21 5.07
Same-origin 4.03 2.71 1.99 3.37
a more definitive conclusion. The protected domains also showed
Protection Measures much fewer high-risk and medium-risk websites, as classified by the
Modern Frameworks 32.5% 40.4% 51.4% 28.8% CSP crawler, 2.2%, respectively 8.4%, for Alexa’s protected domains,
X-Content-Type 34.6% 40.8% 44.3% 48.4%
Output Encoding 39.6% 40.6% 46.9% 31.5%
and 3.0%, respectively 13.0% for Cloudflare’s protected domains.
Input Validation 73.5% 60.3% 43.4% 45.6% Furthermore, a higher percentage of domains implemented a CSP
Sandboxed iframes 1.1% 1.3% 0.7% 1.1% header as opposed to the unprotected domains: 34.2% for the Alexa
HttpOnly Cookies 43.1% 48.4% 52.2% 76.1%
Sensitive Forms 51.2% 43.2% 29.3% 38.2% domains, and 31.3% for the Cloudflare domains. As for the Apache
XSS Risk (Sites w/o CSP) crawler, no vulnerable versions were detected in the Cloudflare-
High Risk 33.1% 29.1% 24.9% 18.9% protected domains.
Medium Risk 14.2% 14.1% 13.8% 9.4%
Low Risk 12.3% 11.5% 11.6% 12.4%
Minimal Risk 19.7% 18.0% 15.8% 30.2%
Table 6: Comprehensive CSP and XSS Security Analysis 6 DISCUSSION
Our large-scale crawling analysis reveals significant security gaps
in the modern web landscape, particularly regarding outdated soft-
ware components and inadequate XSS protection mechanisms. The
XSS The crawler was not deployed at scale, given the possible
prevalence of vulnerable WordPress themes and plugins, especially
issues it may cause. As such, to test the crawler, three simple, vul-
the XStore theme and LearnPress plugin found across multiple
nerable websites were created. All of them make use of dangerous
datasets, demonstrates that many popular websites continue to
DOM sinks, which end up executing unsanitized user inputs. The
run software with known CVEs. While Apache servers showed
first website makes use of the "document.write" function, the sec-
no instances of the critical 2021 RCE vulnerabilities in our crawl,
ond one uses the "innerHTML" property of an HTML element in
this likely reflects either effective patching practices or the specific
the DOM, and the third one makes use of the "setAttribute" function
configuration requirements needed to exploit these vulnerabilities.
of an HTML element. To detect websites as vulnerable, the crawler
The stark contrast in CSP adoption rates, which ranges from 20.1%
makes use of two different HTML elements as input:
in the C1M dataset to 32.9% in the C10K dataset, suggests that more
• "><img src=x onerror="console.log(’XSS:t’)"> popular websites are more likely to implement modern security
• x" onerror="console.log(’XSS:t’)" headers, yet the majority of sites may remain vulnerable to XSS
Both elements have an error handler that calls the console.log attacks. The same can be said about WordPress vulnerabilities as
function and both elements intentionally trigger an error. The both C100k and A100k contained close to no confirmed vulnerabili-
crawler was able to detect all three websites as vulnerable with the ties and the C10k and A10k datasets not containing even possible
two payloads. vulnerabilities.
Delft, 2025 Matyas Kollert, Cristian Comendant, Adrian Josan, Matei Ivan Tudor, Cristian Perlog, Andrei Ionescu, and Stefan Stoicescu

The findings highlight important trade-offs in large-scale vul- [6] CVE. 2024. WordPress LiteSpeed Cache plugin <= 6.3.0.1 - Unauthenticated
nerability detection. Our Go-based Colly crawler achieved an order Privilege Escalation vulnerability. https://www.cve.org/CVERecord?id=CVE-
2024-28000
of magnitude better performance compared to browser-based so- [7] CVE. 2024. WordPress LiteSpeed Cache plugin < 6.5.0.1 - Unauthenticated Ac-
lutions, making it suitable for rapid, large-scale scanning of basic count Takeover via Cookie Leak vulnerability. https://www.cve.org/CVERecord?
id=CVE-2024-44000
vulnerabilities. However, the high false positive rate in our Wix [8] CVE. 2024. WordPress XStore theme <= 9.3.5 - Unauthenticated SQL Injection
detection (97.7%) underscores the limitations of heuristic-based vulnerability. https://www.cve.org/CVERecord?id=CVE-2024-33559
identification methods when dealing with closed-source platforms. [9] CVE. 2025. Slider Popup Builder by Depicter <= 3.6.1 - Unauthenticated SQL
Injection via ’s’ Parameter. https://www.cve.org/CVERecord?id=CVE-2025-2011
The significant presence of Cloudflare protection (31,000 domains [10] CVE. 2025. WordPress K Elements plugin < 5.4.0 - Unauthenticated Account
in C1M vs. 9,600 in A1M) indicates that modern anti-bot measures Takeover vulnerability. https://www.cve.org/CVERecord?id=CVE-2024-56000
are becoming increasingly prevalent, requiring hybrid approaches [11] CVE. 2025. WordPress SureTriggers <= 1.0.82 - Privilege Escalation Vulnerability.
https://www.cve.org/CVERecord?id=CVE-2025-27007
that combine lightweight HTTP crawlers with more sophisticated [12] CVE. 2025. Zox News <= 3.17.0 - Missing Authorization to Authenticated (Sub-
browser automation for comprehensive coverage. These findings scriber+) Arbitrary Options Modification. https://www.cve.org/CVERecord?id=
CVE-2024-13643
suggest that while automated vulnerability detection at scale re- [13] Zakir Durumeric, Michael Bailey, and J. Alex Halderman. 2014. An Internet-Wide
mains feasible, it requires careful framework selection and detection View of Internet-Wide Scanning. In 23rd USENIX Security Symposium (USENIX
methodology refinement to balance speed, accuracy, and coverage. Security 14). USENIX Association, San Diego, CA, 65–78. https://www.usenix.
org/conference/usenixsecurity14/technical-sessions/presentation/durumeric
[14] Zakir Durumeric, Eric Wustrow, and J. Alex Halderman. 2013. ZMap: Fast
7 CONCLUSION Internet-wide Scanning and Its Security Applications. In 22nd USENIX Security
Symposium (USENIX Security 13). USENIX Association, Washington, D.C., 605–
While many third-party services and CDNs claim to secure the 620. https://www.usenix.org/conference/usenixsecurity13/technical-sessions/
internet, the modern web remains a patchwork of technologies, paper/durumeric
many of which are dangerously out-of-date. By crawling both the [15] Abdelhakim Hannousse, Salima Yahiouche, and Mohamed Cherif Nait-Hamoud.
2022. Twenty-two years since revealing cross-site scripting attacks: A systematic
A1M and the C1M datasets with both lightweight HTTP requests mapping and a comprehensive survey. https://arxiv.org/abs/2205.08425
as well as headless, stealthy browsers, we have demonstrated that [16] Sebastian Lekies, Ben Stock, and Martin Johns. 2013. 25 million flows later:
large-scale detection of DOM-based XSS. In Proceedings of the 2013 ACM SIGSAC
vulnerability-focused web crawling can still be performed at scale, Conference on Computer & Communications Security (Berlin, Germany) (CCS
from normal everyday hardware. Our Go/Colly crawler managed 16 ’13). Association for Computing Machinery, New York, NY, USA, 1193–1204.
r/s, an entire order of magnitude faster than our Camoufox crawler, https://doi.org/10.1145/2508859.2516703
[17] Nick Nikiforakis, Luca Invernizzi, Alexandros Kapravelos, Steven Van Acker,
yet the combination of the two allowed us to get responses from Wouter Joosen, Christopher Kruegel, Frank Piessens, and Giovanni Vigna. 2012.
most websites that had correct domains within 24 hours. You are what you include: large-scale evaluation of remote javascript inclusions.
Across the millions of domains found in our datasets, we have In Proceedings of the 2012 ACM Conference on Computer and Communications
Security (Raleigh, North Carolina, USA) (CCS ’12). Association for Computing Ma-
detected numerous instances of well-known, high-severity exploits. chinery, New York, NY, USA, 736–747. https://doi.org/10.1145/2382196.2382274
The WordPress ecosystem exposes several privilege escalation bugs [18] Peter Dave Hello. 2024. Top 1 Million Domains. https://github.com/
PeterDaveHello/top-1m-domains. Accessed: YYYY-MM-DD.
in popular themes, and the majority of websites lack CSP headers. [19] Germán E. Rodríguez, Jenny G. Torres, Pamela Flores, and Diego E. Benavides.
Conversely, some of our results, such as the Wix crawler, show 2020. Cross-site scripting (XSS) attacks and mitigation: A survey. Computer
that simple keyword-based heuristics produce a large number of Networks 166 (2020), 106960. https://doi.org/10.1016/j.comnet.2019.106960
[20] WPScan.com. 2025. WordPress Plugin Vulnerabilities. Retrieved June 23, 2025
false-positive results. from https://wpscan.com/plugins/
Our work is not without its limits, however, as a 66% failure rate, [21] WPScan.com. 2025. WordPress Theme Vulnerabilities. Retrieved June 23, 2025
including 15% inactive domains and 6% incorrect TLS certificates, from https://wpscan.com/themes/

suggests that better domain handling is the logical next step. Fu-
ture research should focus on trying to reduce the number of false
positive results, targeting a wider range of vulnerabilities, or even
investigating AI-driven adaptable crawling strategies to avoid anti-
bot detection. In addition, creating a more efficient multi-threading
solution could enable large-scale crawls using browser-based frame-
works.

REFERENCES
[1] Hugo L.J. Bijmans, Tim M. Booij, and Christian Doerr. 2019. Inadvertently Making
Cyber Criminals Rich: A Comprehensive Study of Cryptojacking Campaigns at
Internet Scale. In 28th USENIX Security Symposium (USENIX Security 19). USENIX
Association, Santa Clara, CA, 1627–1644. https://www.usenix.org/conference/
usenixsecurity19/presentation/bijmans
[2] Pavel Ciorici. 2025. How Many Websites Use WordPress in 2025? WordPress Sta-
tistics. Retrieved June 23, 2025 from https://www.wpzoom.com/blog/wordpress-
statistics/#:~:text=WordPress%20powers%2043.5%25%20of%20all,over%2030%
2C000%20WordPress%20themes%20available
[3] CVE. 2023. CVE-2020-36719 - ListingPro Arbitrary Plugin Installation. https:
//www.cve.org/CVERecord?id=CVE-2020-36719
[4] CVE. 2024. LearnPress – WordPress LMS Plugin <= 4.2.7 - Unauthenticated SQL
Injection via ’c_only_fields’. https://www.cve.org/CVERecord?id=CVE-2024-
8522
[5] CVE. 2024. VikBooking < 1.6.8 - Broken Access Control. https://www.cve.org/
CVERecord?id=CVE-2024-2749
Crawling the Web Delft, 2025

A APPENDIX

Figure 1: HTTP Server header showing Apache version

Figure 2: Default Apache 404 HTTP response page showing version (e-jp.cmcd1.com )

Figure 3: Default Apache Tomcat 404 HTTP response page showing version (changelogs.ubuntu.com)

B APACHE-RELATED CVES
The following CVEs were considered during the vulnerability analysis of Apache web servers:
• https://nvd.nist.gov/vuln/detail/CVE-2021-42013
• https://nvd.nist.gov/vuln/detail/CVE-2021-41773
• https://nvd.nist.gov/vuln/detail/CVE-2019-10098
• https://nvd.nist.gov/vuln/detail/CVE-2019-10092
• https://nvd.nist.gov/vuln/detail/CVE-2016-8740
• https://nvd.nist.gov/vuln/detail/CVE-2008-0455
• https://nvd.nist.gov/vuln/detail/CVE-2007-6203

Vulnerability Assessment and Penetration Testing
100% (1)
Vulnerability Assessment and Penetration Testing
19 pages
Bug Hunters Methodology Live Day Two App Analysis Master
No ratings yet
Bug Hunters Methodology Live Day Two App Analysis Master
102 pages
Automatic Bug Bounty
100% (2)
Automatic Bug Bounty
80 pages
Vintage Games 2.0
100% (10)
Vintage Games 2.0
375 pages
Geartrax For Solidworks 2016 471instmankl PDF
No ratings yet
Geartrax For Solidworks 2016 471instmankl PDF
2 pages
Practical Linear Algebra A Geometry Toolbox
100% (1)
Practical Linear Algebra A Geometry Toolbox
506 pages
NCM 110 Midterm
No ratings yet
NCM 110 Midterm
14 pages
CS GTU Study Material Presentations Unit-3 26112020052923AM
No ratings yet
CS GTU Study Material Presentations Unit-3 26112020052923AM
37 pages
What Is Normalization in DBMS (SQL) - 1NF, 2NF, 3NF, BCNF Database With Example
No ratings yet
What Is Normalization in DBMS (SQL) - 1NF, 2NF, 3NF, BCNF Database With Example
8 pages
Web App Penetration Testing Tools Comparison
No ratings yet
Web App Penetration Testing Tools Comparison
50 pages
Web Crawler Toolkit for Developers
No ratings yet
Web Crawler Toolkit for Developers
6 pages
Rajab 2011a PDF
No ratings yet
Rajab 2011a PDF
12 pages
An Empirical Analysis Reflected Server XSS Exploitation Techniques
No ratings yet
An Empirical Analysis Reflected Server XSS Exploitation Techniques
14 pages
Ethical Hacking: Web Vulnerability Tools
No ratings yet
Ethical Hacking: Web Vulnerability Tools
5 pages
Gigabyte Ga-Q77m-D2h Rev 1.01
No ratings yet
Gigabyte Ga-Q77m-D2h Rev 1.01
32 pages
Web Security for Developers
No ratings yet
Web Security for Developers
5 pages
Google VRP Bug Hunting Guide
No ratings yet
Google VRP Bug Hunting Guide
54 pages
Black Widow: Blackbox Data-Driven Web Scanning
No ratings yet
Black Widow: Blackbox Data-Driven Web Scanning
16 pages
Cybersecurity for IT Students
No ratings yet
Cybersecurity for IT Students
65 pages
Local Media4517619182949719879
No ratings yet
Local Media4517619182949719879
14 pages
MulesoftDevLead-Resume
No ratings yet
MulesoftDevLead-Resume
11 pages
Catalog-SHANGHAI M2C SMART DEVICE CO.,LTD
100% (1)
Catalog-SHANGHAI M2C SMART DEVICE CO.,LTD
51 pages
Eclipse Shortcuts
No ratings yet
Eclipse Shortcuts
1 page
158 511 1 PB Exploring Effectiveness Web Crawlers in Detecting Security
No ratings yet
158 511 1 PB Exploring Effectiveness Web Crawlers in Detecting Security
10 pages
2019 KY Toward A Deep Learning
No ratings yet
2019 KY Toward A Deep Learning
8 pages
Week 2 - Enumerating An Application
No ratings yet
Week 2 - Enumerating An Application
21 pages
Paper Submission Example From Peergrade
No ratings yet
Paper Submission Example From Peergrade
6 pages
QUANTITATIVE ANALYSES OF SOFTWARE VULNERABILITIES-Joh - Colostate - 0053A - 10768
No ratings yet
QUANTITATIVE ANALYSES OF SOFTWARE VULNERABILITIES-Joh - Colostate - 0053A - 10768
224 pages
Research Project Arduino Powered Automatic Fluid Dispenser
No ratings yet
Research Project Arduino Powered Automatic Fluid Dispenser
36 pages
An Analysis of Vulnerability Scanners in Web Applications For VAPT
No ratings yet
An Analysis of Vulnerability Scanners in Web Applications For VAPT
6 pages
Delete interDB Macro Records - Design Module - Modules - AVEVA Software Community
No ratings yet
Delete interDB Macro Records - Design Module - Modules - AVEVA Software Community
2 pages
Baseintrorpyecto
No ratings yet
Baseintrorpyecto
2 pages
10 AJSE+18 (1) +2024+pp+1-21+Detection+of+Malicious+Websites
No ratings yet
10 AJSE+18 (1) +2024+pp+1-21+Detection+of+Malicious+Websites
21 pages
Tutorial 03 Latch FF State Machines 1
No ratings yet
Tutorial 03 Latch FF State Machines 1
81 pages
Functional Testing Techniques
No ratings yet
Functional Testing Techniques
42 pages
Detecting Victim System in Client and Client Networks
No ratings yet
Detecting Victim System in Client and Client Networks
4 pages
Org Baldurs Gate II Shadow of Amn Quick Reference Card
No ratings yet
Org Baldurs Gate II Shadow of Amn Quick Reference Card
6 pages
FYP Report VulnScan
No ratings yet
FYP Report VulnScan
73 pages
Using Pentestingtoolstoscanwebdomains
No ratings yet
Using Pentestingtoolstoscanwebdomains
16 pages
A Machine Learning Approach For Web Application Vulnerability Detection Using Random Forest
No ratings yet
A Machine Learning Approach For Web Application Vulnerability Detection Using Random Forest
9 pages
Cyber Security: Threats, Vulnerabilities and Countermeasures - A Perspective On The State of Affairs in Mauritius
No ratings yet
Cyber Security: Threats, Vulnerabilities and Countermeasures - A Perspective On The State of Affairs in Mauritius
15 pages
60 Website Vulnerability Scanning System Using Python PY060
No ratings yet
60 Website Vulnerability Scanning System Using Python PY060
7 pages
Security Tool for Product Teams
No ratings yet
Security Tool for Product Teams
14 pages
RP Web Vulnerability
No ratings yet
RP Web Vulnerability
6 pages
Python Web Security Scanner
No ratings yet
Python Web Security Scanner
33 pages
Bug Bounty For Beginners
No ratings yet
Bug Bounty For Beginners
24 pages
SQL Injection
No ratings yet
SQL Injection
5 pages
Efficiency and Effectiveness Ofweb Application Vulnerability Detection Approaches - A Review
No ratings yet
Efficiency and Effectiveness Ofweb Application Vulnerability Detection Approaches - A Review
36 pages
WPScan-WordPress Pentesting Framework
No ratings yet
WPScan-WordPress Pentesting Framework
27 pages
Kavin PPT Phase 1
No ratings yet
Kavin PPT Phase 1
24 pages
2arduino Intro
No ratings yet
2arduino Intro
23 pages
Web HackingV2
No ratings yet
Web HackingV2
116 pages
Calculation Method of Spectrum Requirement For IMT-2020 eMBB and URLLC With Puncturing Based On M G 1 Priority Queuing Model
No ratings yet
Calculation Method of Spectrum Requirement For IMT-2020 eMBB and URLLC With Puncturing Based On M G 1 Priority Queuing Model
14 pages
Wallpaper For Phones - Google Search
No ratings yet
Wallpaper For Phones - Google Search
1 page
Cause 09042025
No ratings yet
Cause 09042025
814 pages
VL2023240503483 Pe003
No ratings yet
VL2023240503483 Pe003
13 pages
Web Vulnerability Detection Analyzer Based On Pyth
No ratings yet
Web Vulnerability Detection Analyzer Based On Pyth
17 pages
Final - Year Project Report
No ratings yet
Final - Year Project Report
81 pages
Fin Irjmets1711370701-1
No ratings yet
Fin Irjmets1711370701-1
4 pages
Evaluation PHP
No ratings yet
Evaluation PHP
1 page
Com1 IpPbx: Innovative Indian IP Switch
100% (1)
Com1 IpPbx: Innovative Indian IP Switch
2 pages
Client Copy Procedure
No ratings yet
Client Copy Procedure
14 pages
Pronto Xi Help 750.2 - Item Creation Request Function
No ratings yet
Pronto Xi Help 750.2 - Item Creation Request Function
3 pages
Mid Term
No ratings yet
Mid Term
13 pages
Bug Hunting
No ratings yet
Bug Hunting
119 pages
Webscanner Research Paper Draft 250203 192018
No ratings yet
Webscanner Research Paper Draft 250203 192018
4 pages
Network Model
No ratings yet
Network Model
25 pages
Speech Emotion Analysis System
No ratings yet
Speech Emotion Analysis System
10 pages
Web Application Vulnerability Prediction Using Hybrid Program Ana
No ratings yet
Web Application Vulnerability Prediction Using Hybrid Program Ana
21 pages
Cours de Béton Armé Selon L'eurocode PDF
No ratings yet
Cours de Béton Armé Selon L'eurocode PDF
1 page
Penetration Testing Report
No ratings yet
Penetration Testing Report
15 pages
Geometric Sequences (Using Standard Formulae) - Lesson3
No ratings yet
Geometric Sequences (Using Standard Formulae) - Lesson3
15 pages
CS Unit 3
No ratings yet
CS Unit 3
35 pages
Bots 2025
No ratings yet
Bots 2025
21 pages
Short Answer 1
No ratings yet
Short Answer 1
3 pages
Walter Rudametkin 2020 - Studying The Resilience of Browser Fingerprinting To Block Crawlers
No ratings yet
Walter Rudametkin 2020 - Studying The Resilience of Browser Fingerprinting To Block Crawlers
13 pages
Detecting Malicious Urls:trends, Challenges and The Role of Browser Extensions
No ratings yet
Detecting Malicious Urls:trends, Challenges and The Role of Browser Extensions
11 pages
Vulnerability Scanning Part1
No ratings yet
Vulnerability Scanning Part1
32 pages
Detecting Vulnerabilities in Website Using Multisc
No ratings yet
Detecting Vulnerabilities in Website Using Multisc
8 pages
Machine Learning For Web Vulnerability Detection: Department of Computer Science & Engineering
No ratings yet
Machine Learning For Web Vulnerability Detection: Department of Computer Science & Engineering
30 pages
The Institute of Management Sciences, Lahore: Assignment #1
No ratings yet
The Institute of Management Sciences, Lahore: Assignment #1
3 pages
Advance Ethical Hacking Bug Bounty Hunting & Penetration Testing PDF
No ratings yet
Advance Ethical Hacking Bug Bounty Hunting & Penetration Testing PDF
17 pages
Classifying Dark Web Executables Using Public Malware Tools
No ratings yet
Classifying Dark Web Executables Using Public Malware Tools
10 pages
Bug Bounty Guide
No ratings yet
Bug Bounty Guide
10 pages
A3 Project Expo
No ratings yet
A3 Project Expo
1 page
Major Final Report
No ratings yet
Major Final Report
15 pages
Web Vulnerability Scanner Identifying and Reporting Website Security Flaws
No ratings yet
Web Vulnerability Scanner Identifying and Reporting Website Security Flaws
10 pages

Crawling The Web Final

Uploaded by

Crawling The Web Final

Uploaded by

Crawling the Web

Matyas Kollert Cristian Comendant Adrian Josan

Matei Ivan Tudor Cristian Perlog Andrei Ionescu

security issues like Cross-Site Scripting (XSS) and Content 3 DATASETS

datasets) and precision. 2 https://radar.cloudflare.com/domains

Date Verified Title

5.2 Vulnerabilities Found Table 4: Vulnerable WordPress Themes

Figure 1: HTTP Server header showing Apache version

You might also like