Note: This is a fork of the original (now unmaintained) internetarchive/warc project and Python module.
WARC (Web ARChive) is a file format for storing web crawls.
http://bibnum.bnf.fr/WARC/ and https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
This warc library makes it very easy to work with WARC files::
import warc
with warc.open("test.warc") as f:
for record in f:
print(record['WARC-Target-URI'], record['Content-Length'])
Also ARC files and WARC-derivatives (WAT and WET) are supported.
The documentation of the warc library is available at https://warc.readthedocs.org/.
This software is licensed under GPL v2. See LICENSE file for details.