aioarxiv

Asynchronous Python wrapper for the arXiv API

Credit: Based on the original synchronous arxiv.py by Lukas Schwab

arXiv, maintained by the Cornell University Library, provides open access to 1,000,000+ scholarly articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.

Features

Fully asynchronous API interactions
Efficient, non-blocking arXiv searches
Support for downloading PDFs and source files
Access to arXiv's RSS feed
Flexible search and client configuration

Installation

From Git Repository

$ pip install git+https://github.com/jmmeus/aioarxiv.git

Local Installation

$ git clone https://github.com/jmmeus/aioarxiv.git
$ cd aioarxiv
$ pip install .

Usage

Basic Search

import asyncio
import aioarxiv

async def main():
    # Search for the 10 most recent articles matching the keyword "quantum"
    search = aioarxiv.SearchQuery(
        query="quantum",
        max_results=10,
        sort_by=aioarxiv.SortCriterion.SubmittedDate
    )

    # Initialize the asynchronous API client and fetch results asynchronously
    async with aioarxiv.Client() as client:
        results = client.results(search)

        # Results is an AsyncGenerator, we iterate through the elements
        async for result in results:
            print(result.title)
        # ... or exhaust it into a list
        all_results = [r async for r in results]
        print([r.title for r in all_results])

        # Alternatively, we can access the first element using the 
        # `async_iterator.__anext__()` magic method
        first_result = await results.__anext__()
        # ... or for Python >= 3.10 we can use the builtin anext method
        first_result = await anext(results)
        print(first_result.title)

asyncio.run(main())

Advanced Search and Client Configuration

import asyncio
import aioarxiv

async def main():
    # Configure async client with custom parameters
    client = aioarxiv.Client(
        page_size=1000,  # Number of results per page
        delay_seconds=10.0,  # Delay between API requests
        num_retries=5  # Number of retry attempts for failed requests
    )

    async with client as _client:
        # Advanced query searching by author and title
        search = aioarxiv.SearchQuery(query="au:del_maestro AND ti:checkerboard")
        results = _client.results(search)
        first_result = await results.__anext__()
        print(first_result)

        # Search by specific paper ID
        search_by_id = aioarxiv.SearchQuery(id_list=["1605.08386v1"])
        results = _client.results(search_by_id)
        paper = await results.__anext__()
        print(paper.title)

        # Iterate through all results
        search = aioarxiv.SearchQuery(query="quantum", max_results=100)
        async for result in _client.results(search):
            print(result.title)

asyncio.run(main())

Downloading Papers

import asyncio
import aioarxiv

async def main():
    # Initialize the asynchronous API client
    client = aioarxiv.Client()

    # Download a paper by ID
    search = aioarxiv.SearchQuery(id_list=["1605.08386v1"])
    async with client as _client:
        results = _client.results(search)
        paper = await results.__anext__()

        # Download PDF asynchronously
        await paper.download_pdf()
        
        # Download with custom filename and directory
        await paper.download_pdf(
            dirpath="./downloads", 
            filename="quantum-paper.pdf"
        )

        # Download source archive
        await paper.download_source(filename="paper-source.tar.gz")

asyncio.run(main())

RSS Feed Access

The RSS feed provides a faster alternative to sorting results by publication date, though it comes with some trade-offs:

Limited to newly published papers of only the previous full day (max 2000)
Contains less metadata per result (e.g. no published/updated timestamps)
Depends on feed update frequency (once a day at midnight eastern time), making a regular search more reliable for finding the newest papers
Still supports core functionality like PDF and source downloads

import asyncio
import aioarxiv

async def main():
    client = aioarxiv.Client()

    async with client as _client:
        # Get the most recent entries from the RSS feed using RSSQuery
        feed_query = aioarxiv.RSSQuery(query="astro-ph")
        feed_results = _client.results(feed_query)
        
        # Iterate through feed entries
        async for entry in feed_results:
            # Contains largely the same metadata as search results
            # Also includes announcement type info to filter for
            # only new (not updated or cross-posted) entries
            if entry.announce_type == aioarxiv.AnnounceType.New:
                # Print only info for new publications
                print(entry.entry_id, entry.title, entry.authors)

        # Limit the number of results
        limited_feed = aioarxiv.RSSQuery(query="astro-ph", max_results=5)
        feed_entries = [entry async for entry in _client.results(limited_feed)]

        # Get first entry using anext
        first_entry = await anext(feed_results)
        print(first_entry.title)
        
        # Download PDF and source files just like you would search results
        await first_entry.download_pdf(
            dirpath="./downloads",
            filename="latest-quantum-paper.pdf"
        )
        await first_entry.download_source()

asyncio.run(main())

Logging

Configure logging to inspect network behavior and API interactions:

import logging
import aioarxiv

logging.basicConfig(level=logging.DEBUG)

Types

Client: Configurable async client for fetching results from both arXiv API and RSS feeds
BaseQuery: Base class for queries, not intended to be instantiated by the user
SearchQuery: Defines search parameters for arXiv database queries
RSSQuery: Defines parameters for RSS feed queries
BaseResult: Base class for results, not intended to be instantiated by the user
SearchResult: Represents paper metadata from arXiv API searches with download methods. The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned
RSSResult: Represents RSS feed paper metadata with download methods. The meaning of the underlying raw data is documented in the arXiv info: RSS feed Specifications

Contributing

Contributions are welcome! Please open issues and submit pull requests on the GitHub repository.

Acknowledgements

This package is an asynchronous reimplementation of the original arxiv.py by Lukas Schwab, designed to provide async capabilities for arXiv API interactions.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
.github		.github
aioarxiv		aioarxiv
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ruff.toml		ruff.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

aioarxiv

Features

Installation

From Git Repository

Local Installation

Usage

Basic Search

Advanced Search and Client Configuration

Downloading Papers

RSS Feed Access

Logging

Types

Contributing

Acknowledgements

About

Uh oh!

Releases 6

Packages

Contributors 20

Uh oh!

Languages

License

jmmeus/aioarxiv

Folders and files

Latest commit

History

Repository files navigation

aioarxiv

Features

Installation

From Git Repository

Local Installation

Usage

Basic Search

Advanced Search and Client Configuration

Downloading Papers

RSS Feed Access

Logging

Types

Contributing

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 20

Uh oh!

Languages

Packages