2023 11th International Conference on Internet of Everything, Microwave Engineering, Communication and Networks (IEMECON) | 978-1-6654-7512-9/23/$31.
00 ©2023 IEEE | DOI: 10.1109/IEMECON56962.2023.10092327
Solution to Web Scraping
Chandan Biswas Rahul Mallick Subrata Paul Prof. Dipta Mukherjee
Master of Computer Master of Computer Master of Computer Dept of Computer Science &
Application Application Application Engineering
University of Engineering University of Engineering University of Engineering University of Engineering
and Management, Jaipur and Management, Jaipur and Management, Jaipur and Management, Jaipur
Jaipur, India Jaipur, India Jaipur, India Jaipur, India
[email protected] [email protected] [email protected] [email protected] Abstract— Data retrieval from a website, frequently At this time when everything is based on the Internet, the
automatically and without the owner's consent, is known possibility of stealing people's important data is very high.
as scraping. This information can also be utilized Hackers, third-party organizations, etc. are always ready to
however the scraper sees fit. The action is considered steal data from websites without the owner’s permission (scrap
criminal, but the legality has changed because hasn't websites) [2]. Anti-web scraping techniques/Anti-web scraping
stopped others from following suit. anti-scraping programs are very helpful to protect your data from web
instruments are not used though. Anti-scraping scrapers.
Solutions are offered as fairly expensive services that are
both slow, and effective services. This essay list offers II. WEB SCRAPING
suggestions for reducing impediments. the development The unauthorized process of collecting data from web pages is
of an anti-scraping System as a Product app for small-
called web scraping.
to medium-sized websites. are expensive.
Web scraping was initially employed by financial analysts to
Keyword - Scraping, Web scraping, Anti-web scraping. forecast stock market patterns, but other businesses might
benefit from this method of data acquisition. Since this is an
I. INTRODUCTION autonomous procedure, businesses can easily collect data and
concentrate their efforts on data analysis and business strategy
Website scraping is the practice of periodically (which development [3].
might be regular, completely arbitrary, or somewhat Anti-web scraping techniques/Anti-web scraping programs are
arbitrary within a range) obtaining data from a certain very helpful to protect your data from web scrapers [2].
website, and is typically carried out by robots. Human Key Components of Web Scraping When a scraper gathers and
participation is often restricted to choosing the time interval extracts and transforms data into a format to use [3].
and, sporadically, creating bot-detecting countermeasures Step 1: Sending HTTP request to the server [3].
and web analysis methods, in the event that the target is Step 2: parse and extract the website code [3].
concerned about security [1]. Step 3: Download Data and Save [3].
Medium-sized and smaller websites are more vulnerable to
scraping because of construction, scraping bots are simple
to create, although expensive anti-scraping tools aren't used.
In this field, scraping mostly entails acquiring competitor
data in order to increase one's own volume of data and, as a
result, the number of customers to one's own website. There
have been several court decisions and precedents
established.
web scrapers can harm us by theft our important/personal
data from our website using web scraping techniques. So,
we have to stop/block those web scrapers from our website.
We have created an anti-web scraping technique by using
this anti-web scraping technique we can protect websites
from web scrapers. Fig 1 : Web Scraping Steps
978-1-6654-7512-9/23/$31.00 ©2023 IEEE
Authorized licensed use limited to: NYSS'S Yeshwantrao Chavan College of Engineering. Downloaded on January 21,2025 at 07:42:28 UTC from IEEE Xplore. Restrictions apply.
A user must submit their identity information into a system
during login to gain access to that website. It is a crucial part
of our website security protocols.
User names and passwords are always the first two main parts
of information needed for login [5].
A user name, often known as an account name, is used to
identify a person specifically. User names may be random,
identical to or connected to the real names of users or both.
A password is also a string, but unlike a user name, it is meant
to be kept private and known only by the user and maybe the
system administrator [6].
Fig 2 : How the Attacker Sends the Link to the Victim
III. ANTI-WEB SCRAPING
The technique of protecting websites from web scrapers is
called Anti-web scraping. Web scrapers always try to theft data
from websites using Scraper bots, so anti-web scraping should
be applied to websites to protect those important data [1].
Web scraping bots are frequently blocked using anti-scraping
techniques, which prevent the information they collect from
being publicly accessed [2].
we used some different techniques/methods to achieve anti-web Fig 4 : User Login
scraping. those techniques/methods are
A. Sign-up and login verification B. CAPTCHA VERIFICATION
B. Captcha verification It is a well-liked method of protecting data against online
C. OTP verification
D. Honeypot scraping. In this instance, entering captcha text is required for
E. Web Application Firewall. access to the website. The main drawback of this approach is
the inconvenience it causes to normal users who are made to
A. SIGN-UP AND LOGIN VERIFICATION type captchas. As a result, it applies most commonly to systems
sign-up is like registration or ID creation. Using those ids, we where data is accessed sparingly and only in response to
can log in and access websites. To sign up (Create an id) you specific requests [7].
must have to enter your user id like Rahul (Your Name), and Using captcha recognition software and services, captcha may
password, and confirm the password. and other things are your be avoided. When compared to a one-time payment when
name, DOB, email, contact number, etc. [4]. software is acquired, the human-based alternative is typically
more effective, however, in this situation, money is made for
each captcha that is recognized [8].
A captcha is an acronym for Fully Automated Public Turing
Machine to Identify Computers and Humans Apart. Public
automated software can tell whether a user is a robot or a human
[7]. This application would present a variety of problems,
including distorted visuals, fill-in-the-blanks, and even
equations that supposedly only human beings can solve.
Fig 3 : User Registration Fig 5 : Captcha Verification
978-1-6654-7512-9/23/$31.00 ©2023 IEEE
Authorized licensed use limited to: NYSS'S Yeshwantrao Chavan College of Engineering. Downloaded on January 21,2025 at 07:42:28 UTC from IEEE Xplore. Restrictions apply.
C. OTP VERIFICATION messages, making the honeypot an advanced honeypot. An
agile model was employed as the methodology. The most
A Time Code (One Time password), or OTP, is indeed a string
effective and popular methodology for website development is
of letters or numbers that are created by a computer program
agile. My project's desired outcome is for my honey-pot system
and intended to be used only once for logging in. One-Time
to be able to recognize, seize, and stop spam bots in order to
Passwords will reduce the possibility of malicious login
keep them from accessing my website. Additionally, to create a
attempts and, consequently reduce the possibility of data theft.
list of spam emails that anti-spam organizations can use to
In OTP verification the computer-generated string always sendsin
confirm spam emails. In a word, the goal of this project is to
your email, mobile number, etc. Then only the user can access
locate spam bots, catch them, and prevent them from accessing
websites using that OTP [9].
my website. In addition, using the information about the spam
When a person (user/visitor) wants to access the websites then
bots that were captured, a list can be used as an awareness as
the person has to go through the OTP verification techniques.
well as a verification list for regular users is also created. [11].
In OTP verification always a newly computer-generated string
sends to your email, mobile number, etc. Only then, the user
can access websites using that OTP [10].
Every OTP always has a time period, the person has to use the
OTP at that particular time, Otherwise, if the OTP expires then
the person can't log in/access the websites using that expired
OTP. After that, if the person wants to try again to log in/access
the website, then the person has to request the website to send
a new OTP (Resend OTP) [10].
Then the person will get a new OTP to log in/access the website.
Fig 6 : OTP Verification
D. HONEYPOT
A security measure known as a honeypot sets up virtual traps to Fig 7 : Applying Honeypot
entice intruders. Attackers can take advantage of a computer
E. WEB APPLICATION FIREWALL.
system’s flaws that have been purposefully compromised,
allowing you to research them and strengthen your security By filtering and keeping track of web traffic between a website
measures. and the Internet, a WAF (web application firewall) aids in the
A honeypot is a technique used to detect and trap bad requests protection of web applications. A firewall that is specially made
on a website. The majority of these spam attempts target to manage "web" traffic is known as a web application firewall.
websites by way of spam form submissions or searches for A web application firewall's job is to examine all HTTP traffic
security holes. They frequently jam into it by filling up the going to a web server, filter out any "bad" requests, and forward
fields on our website's contact page, entry form, or product any "good" traffic.
inquiry form, leaving us with possibly tens of thousands of We need to protect your web server and its content from cyber-
spam submissions each day. As a result, the foundation of my attacks. Like: - Cross-Site Scripting, Layer 7 Dos attacks, Web
idea is a honey-pot-based spam defense system. A field will be Scraping, Third-party, etc.
placed on the registration form as a security measure that users A web application firewall guards against harmful HTTP/S
cannot see because of CSS and Js (hide the trap's position). The traffic entering and exiting your website by filtering, analyzing,
honeypot system recognizes and traps the bots when they scan and blocking it. It also stops any unauthorized data from leaving
the code and fill inside the secret gap. In our project, we used the app. It accomplishes this by abiding by a set of guidelines
email as a hidden field and more sophisticated hiding strategies. that assist distinguish between safe and malicious
In a sense, the goal of my research is to use cutting-edge hidden communications. Every HTTP/S request on the application
techniques to detect, trap, block, and generate a list of spam
978-1-6654-7512-9/23/$31.00 ©2023 IEEE
Authorized licensed use limited to: NYSS'S Yeshwantrao Chavan College of Engineering. Downloaded on January 21,2025 at 07:42:28 UTC from IEEE Xplore. Restrictions apply.
level is examined by a web application firewall, which ❖ FLOWCHART OF ANTI-WEB SCRAPING
safeguards the application layer. Web application firewall can TECHNIQUE
be thought of as the bridge that connects the user and the web
app, filtering all communications before they reach the user or
the top app.
Some web application firewalls: -
Cloudflare
App Trana
Akamai
Citrix
F5 Advanced
SiteLock
Sucuri Website Firewall
Fig 8 : Applying Firewall
➢ Cloudflare Firewall
We have used the Cloudflare firewall on our website.
a cloud firewall is a protection device that filters out potentially
harmful network traffic. Cloud firewalls were hosted on the
cloud, as opposed to conventional firewalls. Firewall-as-a-
Service is another name for this firewall delivery model that
utilizes the cloud (FWaaS).
Like traditional firewalls do for an organization's internal
network, cloud-based firewalls provide a virtual wall across
cloud platforms, architecture, and applications. On-site
infrastructure can be protected by cloud firewalls as well.
Cloudflare web application firewall (WAF) guards against
zero-day threats, such as SQL Injection and Cross-Site
Scripting (XSS), on your website.
Risks and vulnerabilities affecting the application layer as
recognized by OWASP.
Customers include the Top 50 Alexa-ranked websites, banks, e-
commerce businesses, and large corporations. Our WAF Fig 9 : Working Principle of the Anti-
prevents millions of assaults per day and is completely Scraping Technique
integrated into our DDoS defense. It automatically learns from
every new threat. By using a backend server to duplicate and IV. FUTURE SCOPE
cache websites, Cloudflare serves as an intermediate between In the future, we will turn this anti-scraping technique into anti-
such a server and a client. It can reduce loading times by saving scraping software. In that anti-scraping software, we will use
web content for transmission on the nearest edge server. It can better Firewalls, a Login system, Captcha, OTP verification,
also alter information, like graphics and rich text, in order to Email verification, honeypot, IP tracing, etc. And we will also
function better. A level of security filtration is also provided by try to stop different illegal requests of web servers.
Cloudflare using this intermediary approach. It can block spam
and bot traffic, stop distributed denial-of-service attacks,
intercept bot attacks, and detect harmful communication by
being in between the user and the hosting server.
978-1-6654-7512-9/23/$31.00 ©2023 IEEE
Authorized licensed use limited to: NYSS'S Yeshwantrao Chavan College of Engineering. Downloaded on January 21,2025 at 07:42:28 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION [10] Singh, B., K. Sh Ranjan, and D. Aggarwal. "Smart voting
web-based application using face recognition,Aadhar and
In this era when everything is based on the internet, our OTP verification." International Journal of Research in
important and personal data also remains on the internet so this Industrial Engineering 9.3 (2020): 260- 270.
data can be stolen by scrappers (hackers) and those scrappers [11] MarchMairh, Abhishek, et al. "Honeypot in network
(hackers) can harm us using those data. To get out of this security: a survey." Proceedings of the 2011 international
problem we have to do something to block those scrappers conference on communication, computing & security.
(hackers) and save our important and personal data from those 2011.
scrappers (hackers). [12] Muzammil, Akanksha Chaudhary, and Rohit Nandan.
we have created a website that has an anti-web scraping "Comparative Analysis of Packet Filtering Firewall."
technique. In this anti-web scraping technique, we have used [13] Bhamra, Satnam Singh. "The 2010 Personal Firewall
some modules like- Sign-up and login verification, Captcha Robustness Evaluation." (2010).
verification, OTP verification, Honeypot, and Web Application
Firewall.
By using this anti-web scraping technique, we can block/avoid
scrapers (hackers) and protect our important and personal data
from those scrapers (hackers).
Since scraper will try to steal data from the website by
developing new methods, again and again, we also need to
develop new methods to prevent theft.
REFERENCES
[1] Haque, Afzalul, and Sanjay Singh. "Anti-scraping
application development." 2015 international conference
on advances in computing, communications and
informatics (ICACCI). IEEE, 2015.
[2] Parikh, Kaushal, et al. "Detection of web scraping using
machine learning." Open access international journal of
Science and Engineering (2018): 114-118.
[3] Sirisuriya, De S. "A comparative study on web scraping."
(2015).
[4] Lemon, Tatiana, et al. "Development of a Student Work-
Hour Verification Database for the Pace University Center
for Community Action and Research."
[5] Sun, San-Tsai, et al. "What makes users refuse websingle
sign-on? An empirical investigation of OpenID."
Proceedings of the seventh symposium onusable privacy
and security. 2011.
[6] Gamboa, H., A. L. N. Fred, and A. K. Jain.
"Webbiometrics: User verification via web interaction."
2007 Biometrics Symposium. IEEE, 2007.
[7] Mehra, Mahendra, et al. "Mitigating denial of service
attack using CAPTCHA mechanism." Proceedings of the
International Conference & Workshop on Emerging
Trends in Technology. 2011.
[8] Cui, Jing-Song, et al. "A CAPTCHA implementation based
on moving objects recognition problem." 2010
International Conference on E-Business and E-
Government. IEEE, 2010.
[9] Kurniawan, Dwi Ely, et al. "Login Security Using One
Time Password (OTP) Application with Encryption
Algorithm Performance." Journal of Physics: Conference
Series. Vol. 1783. No. 1. IOP Publishing,2021.
978-1-6654-7512-9/23/$31.00 ©2023 IEEE
Authorized licensed use limited to: NYSS'S Yeshwantrao Chavan College of Engineering. Downloaded on January 21,2025 at 07:42:28 UTC from IEEE Xplore. Restrictions apply.