-
Notifications
You must be signed in to change notification settings - Fork 0
rplessl/wk2txt
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
NAME
wk2txt - A Website to Text Converter
SYNOPSIS
wk2txt *options*
Options:
--input URL convert *URL* to text
--urlfile file convert URLs in *file* to text
--output directory save output in directory *directory*
--xml dump in XML format
--timeout inverval timeout after *interval* seconds (NOT IMPLEMENTED)
--debug print diagnostic information
--help print help on options
--version print version number
DESCRIPTION
wk2txt is a command line tool based on the WebKit Engine to get the text
content of a websites as seen by current webbrowsers which includes the
evaluation of javascript code present in many modern webpages.
wk2txt can process a whole list of URLs, which can be built by using the
"--input" option multiple times or by passing the filename of a list of
URLs to the "--urlfile". This file is expected to have 1 URL per line.
The contents of the webpages is written to one file per webpage. The
files are stored in the directory specified by the "--output" option.
The filename is derived from the URL, it is the SHA1 hash code for the
URL.
By default, wk2txt dumps the contents and the meta data of the webpage
in plain text format. If the "--xml" option is used, content and meta
data is dumped in XML format.
AUTHOR
Christian Plessl <[email protected]>
Roman Plessl <[email protected]>
About
a website to text converter using webkit
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published