Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs.
Internet Archive's Wayback Machine has 3 useful public APIs.
- SavePageNow or Save API
- CDX Server API
- Availability API
These three APIs can be accessed via the waybackpy either by importing it from a python file/module or from the command-line interface.
Using pip, from PyPI (recommended):
pip install waybackpy -UUsing conda, from conda-forge (recommended):
See also waybackpy feedstock, maintainers are @rafaelrdealmeida, @labriunesp and @akamhy.
conda install -c conda-forge waybackpyInstall directly from this git repository (NOT recommended):
pip install git+https://github.com/akamhy/waybackpy.gitDocker Hub: hub.docker.com/r/secsi/waybackpy
Docker image is automatically updated on every release by Regulary and Automatically Updated Docker Images (RAUDI).
RAUDI is a tool by SecSI, an Italian cybersecurity startup.
>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
>>> save_api.cached_save
False
>>> save_api.timestamp()
datetime.datetime(2022, 1, 18, 12, 52, 49)>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://google.com"
>>> user_agent = "my new app's user agent"
>>> cdx_api = WaybackMachineCDXServerAPI(url, user_agent)>>> cdx_api.oldest()
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest = cdx_api.oldest()
>>> oldest
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest.archive_url
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> oldest.original
'http://google.com:80/'
>>> oldest.urlkey
'com,google)/'
>>> oldest.timestamp
'19981111184551'
>>> oldest.datetime_timestamp
datetime.datetime(1998, 11, 11, 18, 45, 51)
>>> oldest.statuscode
'200'
>>> oldest.mimetype
'text/html'>>> newest = cdx_api.newest()
>>> newest
com,google)/ 20220217234427 http://@google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 563
>>> newest.archive_url
'https://web.archive.org/web/20220217234427/http://@google.com/'
>>> newest.timestamp
'20220217234427'>>> near = cdx_api.near(year=2010, month=10, day=10, hour=10, minute=10)
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.timestamp
'20101010101435'
>>> near.timestamp
'20101010101435'
>>> near = cdx_api.near(wayback_machine_timestamp=2008080808)
>>> near.archive_url
'https://web.archive.org/web/20080808051143/http://google.com/'
>>> near = cdx_api.near(unix_timestamp=1286705410)
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>>>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://pypi.org"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
... print(item.archive_url)
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
. # URLS REDACTED FOR READABILITY
.
https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, WaybackMachineAvailabilityAPI, are also implemented in the CDX server API interface class, WaybackMachineCDXServerAPI. Also note
that the newest() method of WaybackMachineAvailabilityAPI can be more recent than WaybackMachineCDXServerAPI's same method.
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/Documentation is at https://github.com/akamhy/waybackpy/wiki/Python-package-docs.
Demo video on asciinema.org, you can copy the text from video:
CLI documentation is at https://github.com/akamhy/waybackpy/wiki/CLI-docs.
- akamhy (https://github.com/akamhy)
- eggplants (https://github.com/eggplants)
- danvalen1 (https://github.com/danvalen1)
- AntiCompositeNumber (https://github.com/AntiCompositeNumber)
- rafaelrdealmeida (https://github.com/rafaelrdealmeida)
- jonasjancarik (https://github.com/jonasjancarik)
- jfinkhaeuser (https://github.com/jfinkhaeuser)
- mhmdiaa (https://github.com/mhmdiaa)
--known-urlsis based on this gist. - dequeued0 (https://github.com/dequeued0) for reporting bugs and useful feature requests.