Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 3340024

Browse files
authored
Merge pull request hastagAB#45 from nitish-iiitd/master
Added a Simple Webpage Parser Wrapper
2 parents b24df42 + c5f6bf0 commit 3340024

File tree

4 files changed

+32
-0
lines changed

4 files changed

+32
-0
lines changed

SimpleWebpageParser/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Simple Webpage Parser
2+
A simple wrapper around the popular web scraper library BeautifulSoap. It merges the use of Requests and BeautifulSoap library in one class which abstracts the process of extraction of html from webpage's url and gives user a clean code to work with.
3+
4+
## Libraries Required
5+
1. requests
6+
`$pip install requests`
7+
2. beautifulsoup4
8+
`$pip install beautifulsoup4`
9+
10+
## Usage
11+
A sample script `webpage_parser.py` has been provided to show the usage of the SimpleWebpageParser. It prints all the links from the Hacktoberfest's home page.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import requests
2+
from bs4 import BeautifulSoup
3+
4+
class SimpleWebpageParser():
5+
6+
def __init__(self, url):
7+
self.url = url
8+
9+
def getHTML(self):
10+
r = requests.get(self.url)
11+
data = r.text
12+
soup = BeautifulSoup(data,"lxml")
13+
return soup

SimpleWebpageParser/__init__.py

Whitespace-only changes.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
from SimpleWebpageParser import SimpleWebpageParser
2+
3+
swp = SimpleWebpageParser("https://hacktoberfest.digitalocean.com/")
4+
html = swp.getHTML()
5+
print html.find_all('a')
6+
7+
## the html returned is an object of type BeatifulSoup, you can parse using BeautifulSoup syntax
8+
## refer to its documentation for more functionalities

0 commit comments

Comments
 (0)