arXiv Business Model White Paper
================================

Cornell University Library  
January 15, 2010; last updated August 5, 2010  
See also: [arXiv Support](/help/support) and [FAQ](/help/support/faq)  
Contact: support@arxiv.org

## 1. Introduction

Started in August 1991, arXiv.org (formerly xxx.lanl.gov) is
internationally acknowledged as a pioneering and successful digital
archive and open-access distribution service for research articles. The
e-print repository has transformed the scholarly communication
infrastructure of multiple fields of physics and plays an increasingly
prominent role in a unified set of global resources for physics,
mathematics, computer science, and related disciplines. It is very
firmly embedded in the research workflows of these subject domains and
has changed the way in which material is shared, making science more
democratic and allowing for the rapid dissemination of scientific
findings.

arXiv is an international initiative and has mirror sites in 13
countries and collaborations with U.S. and foreign professional
societies and other international organizations. It has provided a
crucial life-line for isolated researchers in developing countries. Most
scientists and researchers who post content on arXiv also submit their
work for publication in traditional peer-reviewed journals. However,
famously reclusive Russian mathematician Grigori Perelman's 2003
decision to post his proof of the 100-year-old Poincaré Conjecture
solely in arXiv underscores the repository's importance and its role in
transforming scholarly communication.

Since its inception in 1991 with a focus on the high energy physics
community, arXiv has significantly expanded both its subject coverage
and user base. It moved to Cornell with its founder, physicist Paul
Ginsparg, when he returned as a faculty member in 2001, and is now a
collaboration between the Cornell University Library and Cornell's
Computing and Information Science Program. The library is responsible
for arXiv's operation and maintenance, while research around the
repository is performed in conjunction with the Information Science
program.

arXiv is a primary exemplar of an effective scholarly repository and is
often cited to illustrate digital repositories' potential role in
transforming scholarly communication, broadening access, and allowing
for the rapid dissemination of scientific findings. Cultural practices
within high energy physics such as the long-standing reliance on
pre-prints likely influenced the initial rapid appropriation of arXiv
within that community, but arXiv has since been adopted by many other
communities with different practices. Through Paul Ginsparg's leadership
the service has consistently focused on the disciplinary cultures
represented in the digital repository and the needs of the user
communities. Although arXiv is not peer-reviewed, the submissions are
screened by subject-specific moderators to ensure content is relevant to
current research in the specified disciplines (see arXiv's
[primer](http://arxiv.org/help/primer) and [moderation
guidelines](http://arxiv.org/help/moderation)). Additionally, an
endorsement system uses community feedback to pre-screen new submitters.
arXiv has facilities to harvest, record and display references and links
to formally published versions of articles based on the deposited
e-prints, thus providing an overt link to peer review. arXiv currently
numbers over 580,000 e-prints. In calendar year 2009, arXiv accepted
64,047 new submissions and served over 30 million full-text downloads.

## 2. Business Model

Scholars worldwide depend upon the stable operation and continued
development of arXiv. Sustainability is best assured by aligning revenue
sources with the constituents that realize value from arXiv, and by
reducing dependence upon on Cornell University Library's budget. We have
decided to pursue a collaborative business model that will engage the
institutions that benefit from arXiv.

Based on extensive feedback from arXiv stakeholders, we are proposing an
interim business arrangement for three years that will provide the most
immediately viable short-term funding model: income generated by
recurring subsidies from the libraries at academic institutions,
research centers, government laboratories, and other organizations that
are the heaviest users of arXiv, managed through a tiered structure of
annual support requests similar to many other open-access funding
models.

### 2.1 Budget

The calendar year 2010 budget for arXiv is \$400,000, which includes
costs for personnel and operating expenses (G&A overhead, hardware,
hosting, and network charges; see [budget summary](2010_budget)). Staff
salaries account for about 80% of total annual expenses. We expect the
annual budget to increase to $500,000 by 2012 to facilitate necessary
upgrades and enhancements (see [section 3.1](#sec3_1) for more
information regarding our technology agenda).

Our goal is to continue providing this valuable component of the
scholarly communication system at minimal cost to the community. The
budget may be viewed as an effective cost per download of 1.3 cents, or
alternatively, an effective cost per submission of &lt;$7.

### 2.2 Collaborative support model

We are seeking direct support from the heaviest institutional users of
arXiv. Our tier-based support structure is based on the previous
calendar year's download activity and will be applied equally to
academic institutions, research centers, government labs, and other
organizations. For consortia willing to promote and capture support for
arXiv, we will offer a 10% discount for new supporting institutions and
a 5% discount on renewals. No institutional site license is required.

Initially a 3-tiered institutional support model will be implemented
with a top-end rate of \$4,000 per year, and a low-end rate of $2,300 per
year. We seek support from institutions representing the most active
users of arXiv, in both the United States and other countries.

#### 2010 Support Request Rates

**Tier 1**

The top 100 institutions bases on the previous year's download activity.
These institutions account for approximately 55% of all institutional
downloads from arXiv.

\$4,000/year

**Tier 2**

Institutions that rank between 101 and 200 in terms of download activity
and account for approximately 25% of all institutional downloads from
arXiv.

\$3,200/year

**Tier 3**

Institutions that rank below 200 in download activity and account for
approximately 20% of all institutional downloads from arXiv.

\$2,300/year

Institutions wishing to support arXiv should contact their consortia
representatives or the arXiv office at Cornell University Library at
support@arxiv.org.

We anticipate that it will take time to attract sufficient support to
meet arXiv's budget needs. If strong support results in a surplus then
it will be reinvested in arXiv or result in reduced rates. See [section
2.4](#sec2_4) for a discussion of operating principles and Cornell
support during transition to this model.

### 2.3 Why a use-based model?

As a public good, arXiv should be supported by those institutions that
use it the most. We have compiled a [listing of the most active 200
institutions](2009_usage), based on download data, for calendar year 2009. We will compile and review usage data, by institution, on an annual basis and will notify institutions that have moved from one tier to another.

We do not have precise statistics based on articles submitted to arXiv.
This is because arXiv does not demand that submitters include
affiliation information for authors, and does not control any
affiliation metadata that is submitted. We can provide approximate
statistics for supporting institutions but note that these should be
interpreted with care. Early feedback from our colleagues indicates
interest in submission statistics. Options for generating better
statistics will be discussed by the arXiv Sustainability Advisory Group
(see [section 2.7](#sec2_7)).

### <span id="sec2_4">2.4 Operating principles</span>

-   Cornell University Library is committed to maintaining arXiv as an
    open access service, free to submitters and users alike.
-   Cornell University Library will initially implement an institutional
    support model that targets academic institutions, research centers,
    and other organizations (e.g. government labs) until a more diverse
    funding model can be developed.
-   Cornell University Library will continue to provide a sizeable
    portion of the costs to operate arXiv, but that amount will diminish
    incrementally over the next four years. In 2010, Cornell University
    Library will subvent up to 75% of total annual costs; in 2011, the
    amount will decline to 50%; in 2012 to 25%; and thereafter remain
    steady at 15% of the operating budget. The budget for 2010 is
    approximately $400,000 per year. We expect this amount to go up to
    500,000 by 2012 due to necessary upgrades and enhancements.
-   Costs associated with institutional support management are factored
    in as a percentage of annual support income.
-   Cornell University Library will not realize a "profit" from this
    support model; any surplus accrued will be reinvested in arXiv or
    result in reduced rates.

### 2.5 Development of a long-term business plan

We realize that our business model needs to be responsive to the
shifting ecology of scholarly publishing. As we continue to investigate
long-term sustainability issues, our goal through this document is to
articulate potential interim strategies. The sustainability scenario we
propose represents a short-term strategy for the next three years. Our
initial approach will be to collaborate with the libraries of the
heaviest user institutions in the US and abroad in our effort to
reposition arXiv as a vested online scholarly resource, an asset with
shared benefits and accountability.

Over the next few years we will develop a long-term business plan that
provides a strategic framework to protect and increase the value of
arXiv for those who use it. Ideally this will comprise a blend of
ongoing underwriting from Cornell University Library and support from
the academic library community and research centers. It might also
include support from scholarly societies, an endowment, or funding
agencies such as the NSF. We will strengthen existing collaborations
(e.g. with the INSPIRE project of CERN, SLAC, DESY and Fermilab) and
develop additional partnerships that allow arXiv to provide better
services or to share the support burden. Advice from the Sustainability
Advisory Group (see [section 2.7](#sec2_7)) and other supporting
institutions will be used in developing this long-term business plan.

### 2.6 Alternative revenue sources

In the process of investigating business models we have considered many
options. For a good overview see the [Ithaka report on Sustainability
and Revenue Models for Online Academic
Resources](http://www.ithaka.org/ithaka-s-r/strategy/sustainability-and-revenue-models-for-online-academic-resources).
Alternative revenue sources considered included:

**Requesting donations at time of submission —**  
We have no plans to impose article processing charges or submission
fees. Barrier-free submission and use is one of the founding principles
of arXiv. We have considered requesting donations at time of submission
but have concluded that such fundraising would incur greater overhead
than the institutional support model, and would not engage our peer
institutions. We also want to ensure broad international contributions
to the repository without financial expectations from the authors.

**SCOAP3 —**  
arXiv would potentially be a beneficiary of redirected funding
administered by the [Sponsoring Consortium for Open Access Publishing in
Particle Physics](http://scoap3.org/) (SCOAP3) consortium. It's not
clear, however, when this initiative will meet its annual funding goal
of €10,000,000 ($14,120,000). It should also be noted that the SCOAP3
initiative is restricted to HEP and particle physics content only, which
represents between 18% and 40% of submissions to arXiv (depending how
broadly the subject area is construed). If SCOAP3 is successful it could
potentially subvent a similar fraction of arXiv's operating costs. We
will continue to monitor the development of SCOAP3 and its impact on our
long-term plans.

### <span id="sec2_7">2.7 Governance</span>

arXiv is maintained and operated by the Cornell University Library, with
guidance from the [arXiv Scientific Advisory
Board](http://arxiv.org/help/scientific_ad_board) and several subject
Advisory Committees. Additionally, the [arXiv Sustainability Advisory
Group](sustainability_advisory_group) provides an essential consultative
role in developing diverse sustainability strategies for arXiv.

### 2.8 Benefits for supporting institutions

Contributions will be openly acknowledged on the arXiv.org website.
Within the first year arXiv will add banners recognizing institutional
support (*"Your access to arXiv is supported by University X"* or
similar) and support for local OpenURL based services. Both of these
features will be based on the IP address information supplied by the
supporting institution. We will develop plans for additional benefits in
consultation with supporting institutions, including more detailed usage
statistics.

### 2.9 Why should my institution support arXiv?

The recent [Ithaka report on
sustainability](http://www.ithaka.org/ithaka-s-r/strategy/sustainability-and-revenue-models-for-online-academic-resources)
provides a comprehensive review of a variety of business models for
supporting online academic resources. This report defines sustainability
as *"the ability to generate or gain access to the resources financial
or otherwise needed to protect and increase the value of the content or
service for those who use it."* Therefore, keeping open access academic
resources such as arXiv sustainable involves not only covering their
costs but also continuing to enhance their value based on the needs of
the user community. Such a financial commitment is likely to be beyond a
single institution's resources.

arXiv has been one of the most important disruptive innovations in
scholarly communications since the advent of the Internet. Its
preemptive dissemination model represented the first significant means
to provide expedited access to scientific research well ahead of formal
publication. It remains an exemplar in the open-access debate.

arXiv is a primary destination site for both authors and readers in its
core domains within physics and math. If a case can be made for any
repository being community-supported, arXiv has to be at the top of the
list. We believe that arXiv sustainability should be considered a shared
investment in a culturally embedded resource that provides unambiguous
value to a global network of science researchers.

## 3. Technical Architecture

The arXiv software was developed in-house at the Los Alamos National
Laboratory and Cornell over the past 18 years. The software is
predominantly written in Perl with components that use Java, PHP and
Python. Metadata and user information is stored in a MySQL database and
Lucene is used to provide the search service. A key focus of software
development has been to automate the operation of arXiv as much as
possible, and continual improvement has been necessary to keep the
administrative staff requirement from increasing as the number of
incoming submissions has steadily increased. The three server machines
that provide the main arXiv.org site are supported by Cornell's central
IT organization with 24x7 support. Mirror sites are locally supported
and receive updates daily.

### <span id="sec3_1">3.1 Technical plans for maintaining and advancing arXiv</span>

While the underlying technology has been updated throughout its 18 year
history, the system requires significant internal re-engineering to
support an evolving technological landscape, increased growth and use,
and to ensure the sustainability of the service. arXiv's success has
relied upon a highly efficient use of both author and administrative
effort, and has served its large and ever-growing user base with only a
fixed-size skeletal staff. In this respect, it long anticipated many of
the current "Web 2.0" and social networking trends: providing a
framework in which a community of users can deposit, share and annotate
content. It also helped initiate, and continues to play a leading role,
in the growing movement for open access to scholarly literature.

**Improve submission system —**  
We expect to roll out a complete revision of the arXiv submission system
early in 2010. This will provide much improved user interaction during
the submission process and will also streamline the workflow for
moderators and administrators, giving much greater flexibility when
handling submissions with technical or classification issues.
Underpinning this new system are several infrastructural improvements
which will facilitate later development and/or platform migration.

**Support for associated data and information objects —**  
Digital data and associated multimedia information such as images and
audio/video are becoming an integral part of scientific publications. To
maintain its innovation role in scholarly communication, it is essential
for arXiv to develop features in support of the deposit and archiving of
supplementary information objects that are associated with a given
paper.

**Scalable and expandable architecture for sustainability —**  
The arXiv software has been developed in-house over many years and this
has both benefits and burdens associated with it. To keep arXiv
sustainable, it is important to re-engineer the software to layer
arXiv-specific functionality over generic repository software. Creating
a generalized architecture will facilitate efficient technology
management processes and allow the implementation of digital
preservation procedures and policies.

### 3.2 Subject area coverage and expansion

The Cornell University Library frequently receives requests to extend
arXiv to include other subject areas. In recent years we have added the
fields of quantitative biology, statistics and quantitative finance;
requests currently under consideration include mechanics and some areas
of engineering. Due to limited resources, we have adopted a measured
approach to expansion because there is significant organizational and
administrative effort required both to create and to maintain new
subject areas. Adding a new subject area involves exploring the
user-base and use characteristics pertaining to the subject area,
establishing the necessary advisory committees, and recruiting
moderators.

Although arXiv.org is the central portal for scientific communication in
some disciplines, it is neither feasible nor necessarily desirable to
play that role in all disciplines. However, arXiv can provide a model
for other communities through improved service to its existing dedicated
user communities, and act as an essential component of a global
networked scholarly communication system. We anticipate that system will
become increasingly broad in its subject area coverage, and increasingly
diverse in its component databases, repositories, and other online tools
and services.

### 3.3 Enduring access

The first element in our plan is to ensure the long-term sustainability
of arXiv as a service. This requires a solid business plan to support
maintenance and expansion of the system. The second element is digital
preservation of arXiv content.

Digital preservation refers to a range of managed activities to support
the long-term maintenance of bitstreams to ensure that digital objects
are usable (intact and readable), retaining all quantities of
authenticity, accuracy, and functionality deemed to be essential when
articles (and other associated materials) were ingested. Formats
accepted by arXiv have been selected based on their archival value
(TeX/LaTeX, PDF, HTML, OOXML) and the ability to process all source
files is actively monitored. The underlying bits are protected by
standard backup procedures at the Cornell campus and off-site backup
facilities in New York City provide geographic redundancy. The complete
content is replicated at our mirror sites around the world and
additional managed tape backups are taken at Los Alamos National
Laboratory.

The Cornell University Library is developing an archival repository that
will support preservation of critical content from institutional
resources including arXiv. All arXiv documents, both in source and
processed form, with be stored in this repository by the end of 2010.
There will be ongoing incremental ingest of new material. We expect that
the preservation costs for arXiv will be borne by the Cornell University
Library leveraging the archival infrastructure developed for the library
system.
