Brief History of Web Scraping
May 14, 2021
Data, web scraping
Web scraping is becoming a more widely known term. Most associate it with web data
extraction, the most efficient and the simplest way of copying large chunks of information
online; however, did you know that web scraping was born for a completely different purpose
and it took almost two decades for it to transform into web scraping we are familiar with
now?
Here is the timeline:
The birth of the World Wide Web
The origins of very basic web scraping can be dated back to 1989 when a British scientist
Tim Berners-Lee created the World Wide Web. Originally the idea was to have a platform
where information could be automatically shared between scientists in universities and
institutes all around the world. However, with the World Wide Web came three very
important features that are the key elements for every web scraping tool nowadays:
the URLs which we now use to designate a scraper to a specific website,
embedded hyperlinks that allow us to navigate through the designated website,
and web pages that contained various types of data - text, images, audios, videos, etc.
First web browser
Continuing his work, two years later, Tim Berners-Lee created the very first web browser, an
http:// web page, all run on a server from his NeXT computer, giving a way for people to
access and interact with the World Wide Web.
The Wanderer
Time-wise not much apart, in 1993, the first concept of crawling was born. The Wanderer,
more precisely - the World Wide Web Wanderer developed by Matthew Gray at the
Massachusetts Institute of Technology was a first of its kind, Perl-based web crawler whose
sole purpose was to measure out the size of the web. The same year, the Wanderer was used
to generate an index called the Wandex. Even though the author does not claim it, the
Wanderer with Wandex had the potential to become the first general-purpose World Wide
Web search engine.
JumpStation
However, the same year, 1993, the technology that laid grounds for big names such as
Google, Bing, Yahoo, and other search tools on the web today - JumpStation was born and
became the actual very first crawler-based web search engine. With it, millions of web pages
indexed - the internet turned into an open-source platform of data in various forms.
BeautifulSoup
A bit more than a decade later, in 2004, came BeautifulSoup - HTML parser, a library of
commonly used algorithms written in Python programming language. BeautifulSoup helped
to grasp the sense of site structure and parse the contents within the HTML containers;
therefore, saving hours of work for programmers. And since the internet had become this
immense source of information that anyone with a computer and internet connection had
access to, as well as it being easily searchable, people had started to take advantage of this by
extracting the information available to them. For some time websites did not prohibit the
ability to download the content of their sites; however, slowly that changed, and for the
amount of data that was getting downloaded - simply manually copy-pasting was not an
option; therefore, other ways of obtaining the information was bound to be developed.
Rise of visual web scrapers
Soon after, web scraping as we know it was born. The visual web scraping software Web
Integration Platform version 6.0 which was launched by Stefan Andresen, allowed users to
highlight the necessary information of a web page and structure that data into a usable excel
file, or database which provided an opportunity for non-programmers to join and easily
extract data from the web.
Nowadays, as technologies and industries progress, companies are looking to gain an
advantage over their competition. And, due to the fact, that the amount of information
available on the internet is growing exponentially, Web scraping is becoming one of the most
prominent and widely-used methods of acquiring data at scale across various industries and
business spheres
Future of web scraping
Web scraping has grown immensely in recent years, and almost guaranteed to continue
upward growth. Currently, the commercial web scraping scene is mostly for gaining a
competitive advantage by collecting leads, scraping competitors, price monitoring, etc.
However, as technology develops, such as Artificial Intelligence, and data becomes even
more accessible and crucial to different aspects of life, web scraping will advance with it and
produce various new and remarkable applications that we are only looking forward to
experimenting with.