FEATURE ARTICLE
PHILOSOPHY OF BIOLOGY
The challenges of big data
biology
Abstract The availability of big data has the potential to transform many areas of the life sciences and
usher in new ways of doing research. Here, I argue that big data biology also raises fundamental
questions in the philosophy of science: for example, what is a good dataset, and how can reliable
knowledge be extracted from big data? Collaborations between biologists, data scientists and
philosophers of science will help us to answer these and other questions.
SABINA LEONELLI
T
he life sciences have a long history of algorithms can reliably identify causal links in
dealing with large quantities of data, and data also remains a matter of contention: discov-
recent advances in experimental capabili- ering that a specific gene pathway is frequently
ties have vastly increased the amount of data associated with a particular phenotypic trait is
that needs to be stored and analysed. The not the same as understanding why that is the
computational power available to researchers case and whether the pathway is causing the
has also improved over time, but the volume trait.
and heterogeneity of the data regularly outstrip There are many other questions that are of
the strategies and tools available for their collec- interest to philosophers of science. Does a reli-
tion and analysis. Moreover, the volume of the ance on big data change the very idea of biolog-
ical discovery and what counts as biological
knowledge? What role do theories play in data-
intensive research and how does big data biol-
Biological concepts – no matter how ogy relate to hypothesis-driven, observational
loosely defined – are always and exploratory research? How does the auto-
mation of data analysis affect the reliability of
embedded in broader theoretical results? What is the difference between data
perspectives on how nature works. and noise, and what are data in the first place?
Biologists might think that these questions,
while interesting and important, are somewhat
removed from – and possibly irrelevant to – their
everyday work. In this article, I aim to counter
data now available, especially in ’omics’ fields, this perception by highlighting philosophical
raises fundamental questions about the research insights that can help to confront some of the
process, such as the role of theory, the impor- key challenges related to the use of big data in
tance of context, and the purpose of know-how biology.
in data interpretation.
Copyright Leonelli. This article is For example, there is widespread debate
distributed under the terms of the around the extent to which a scientist needs to Big data biology meets biological
Creative Commons Attribution
License, which permits unrestricted
be familiar with the protocols and instruments pluralism
use and redistribution provided that
used to generate the data, and the relevant biol- Biology is notoriously fragmented in its meth-
the original author and source are ogy of the organisms at hand, in order to be ods, goals, instruments and conceptual frame-
credited. able to interpret data. The extent to which works. Often, different groups – even within the
Leonelli. eLife 2019;8:e47381. DOI: https://doi.org/10.7554/eLife.47381 1 of 5
Feature article Philosophy of Biology The challenges of big data biology
same subfield – disagree over preferred termi- through which nature is best represented and
nology, research organisms, and experimental investigated. In other words, the networks of
methods and protocols (Leonelli, 2012). As a concepts associated with data in big data infra-
consequence, one term may be used to refer to structures should be viewed as theories: ways of
different processes, or different definitions may seeing the biological world that guide scientific
apply to the same term. This profound fragmen- reasoning and the direction of research, which
tation, which philosophers call pluralism are often revised to take into account new dis-
(Kellert et al., 2006), is reflected within the coveries (Leonelli et al., 2011). The quest for
many technologies and domain-specific stand- large-scale data integration makes it necessary
ards that are used to generate, store, share and for all biological disciplines to identify such theo-
analyse data (O’Malley and Soyer, 2012). Find- ries and debate their implications for the model-
ing ways to tackle pluralism is a key challenge ling and analysis of big data (Leonelli, 2016).
for big data biology. Philosophers have long discussed the theo-
It is easy to dismiss these difficulties as purely retical significance of classification and naming
technical matters that can be overcome by, for practices in biology (Dupré, 2001), often in col-
example, using interoperable databases and file laboration with taxonomists, and occasionally
formats to integrate data from difference sour- with molecular and developmental biologists.
ces so that they can be used and re-used across For example, researchers have attributed multi-
a variety of research contexts. However, there ple meanings to the gene concept, which philos-
are deeper conceptual and philosophical difficul- ophers have documented and articulated as part
ties. Databases need to be accessed through a of a broader investigation of the intellectual
common ‘query’ system, and this raises the foundations and implications of the ’molecular
question of which terminologies should be used bandwagon’ that has dominated the last 50
to classify the data and integrate them with years of biological research (Griffiths and Stotz,
other data, and what are the implications of 2013; Rheinberger and Müller-Wille, 2017).
These studies demonstrated that biological con-
cepts – no matter how loosely defined – are
always embedded in broader theoretical per-
We need to acknowledge that no spectives on how nature works
data are ’raw’ in the sense of being (Callebaut, 2012).
This is not to say that big data biology is fully
independent from human determined by pre-existing hypotheses. Rather,
interpretation. it draws on current theories and hypotheses but
does not let them predetermine research out-
comes (Waters, 2007). It is also important to
note that, no matter which method is used to
generate them, observations and measurements
such choices? The considerable labour involved are always situated in a specific framework
in devising credible retrieval systems for biologi- (Bogen, 2013). Irrespective of how standardised
cal databases speak to the difficulty of this task: they are, the instruments used to generate those
this difficulty is illustrated by the lively debates data are built to satisfy specific research agen-
over the definitions of terms such as ’pathogen’ das (Rheinberger, 2011). This means that we
and ’metabolism’ on the Gene Ontology data- need to acknowledge that no data are ’raw’ in
base (The Gene Ontology Consortium, 2019). the sense of being independent from human
The implications for big data biology are sub- interpretation. Moreover, data can be processed
stantive. Far from being ‘the end of theory’, the
differently. It is thus important to understand the
computational mining of big data involves signif- conceptual choices that shaped the production
icant theoretical commitments. The choice and and classification of data. Researchers using big
definition of keywords used to classify and data need to recognise that the theoretical
retrieve data matters enormously to their subse- structures that informed the production and
quent interpretation. Linking diverse datasets processing of the data will influence their future
means making decisions about the concepts use.
Leonelli. eLife 2019;8:e47381. DOI: https://doi.org/10.7554/eLife.47381 2 of 5
Feature article Philosophy of Biology The challenges of big data biology
One might ask if pluralism is an obstacle to context-independent ways of assessing which
the integration of data from different sources data are reliable and which are not. But it does
and to the extraction of reliable and accurate not take into account that data are often exten-
knowledge from these data. Philosophers of sci- sively processed artefacts resulting from highly
ence have argued that pluralism may actually be planned interactions with the world; nor does it
beneficial when attempting to extract knowl- do justice to the observation that biologists have
edge about the highly complex and variable pro- different views of what counts as reliable data,
cesses encountered in the life sciences or what counts as data in the first place (Borg-
(Dupre, 1993; Mitchell, 2003). Fragmented man, 2015; Leonelli, 2016). Thus, what consti-
research traditions arise from centuries of fine- tutes as noise for one community and/or
tuning research tools in order to study a given research purpose can sometimes count as data
process or species in as much detail as possible. for another (McAllister, 2011; Loettgers, 2009;
While this makes it more challenging to general- Woodward, 2010).
ise these tools and the resulting knowledge Building on these insights, I have argued that
(Levins, 1984; Wimsatt, 2007), it also ensures data are ’relational’: in other words, the objects
that the data collected are robust and inferences that best serve as data can change depending
are accurate (Longino, 2013; Wylie, 2017). It is on the standards, goals and methods used to
crucial for big data biology to build on this leg- generate, process and interpret those objects as
acy by creating ways to work with data from evidence (Leonelli, 2016). This explains why
diverse sources without misinterpreting their assessments of data quality always relate to a
provenance or losing the insights they provide specific investigation. It also accounts for the
into the complexity of life. reluctance of researchers to trust data sources
whose history is not clearly documented, and
the related drive to collect metadata about data
Assessing the quality of data provenance.
Biologists often have feelings of unease about Data scientists sometimes underestimate the
the quality of data and metadata found in online importance of linking databases to the physical
databases, particularly when the relevant data- samples from which the data were originally col-
bases are not curated by experts in the specific lected (such as specimens, tissues, and cell and
field and/or organism. Many databases are not microbial cultures). It has been shown that
peer reviewed or curated, and even when they access to original samples enhances data repro-
are, assessments of quality and reliability are ducibility and provides researchers with better
often specific to certain fields of research and opportunities to replicate experiments and reuse
cannot easily be transferred to other research data (Dietrich et al., 2014). Access to original
fields or other kinds of studies in the same samples also provides a concrete point of con-
research field (Floridi and Illari, 2014; Leo- tact between research traditions and
nelli, 2017). The potential for loss of data quality approaches, through which differences can be
grows the more databases become interopera- identified and critically examined (Leonelli and
ble, since extensive data linkage makes it possi- Ankeny, 2012).
ble for unreliable data sources to pollute the Accepting a relational view of data means
overall reliability of online data collections. moving away from generic approaches to data
This is another realm where pluralism seems curation towards context-sensitive approaches
to be a problem for big data biology. Does a that include fine-grained descriptors for the
lack of consensus on how to assess the quality of data, even though this may slow down the pace
data signal a distinctive weakness of how biol- of research (Leonelli and Tempini, 2018). At the
ogy can (and should) engage in big data same time, recognising the local and situated
research? One way to answer is to challenge the nature of big data selection helps to identify
very understanding of the data on which this what conclusions can be drawn from data analy-
question is grounded. Thinking of data as being sis, and evaluate how to generalise particular
intrinsically good or bad – independent of con- inferences. This is particularly useful when
text and goals of inquiry – means thinking of assessing whether a given extrapolation can be
them as being static representations of nature extended from one species to another (say, from
that are useful because they accurately and rats to humans).
objectively document a feature of the world at a There is no doubt that big data mining has a
particular time and place. This view certainly powerful heuristic function: it is often the first
motivates the search for definitive, universal and step in any biological inquiry, helping to define
Leonelli. eLife 2019;8:e47381. DOI: https://doi.org/10.7554/eLife.47381 3 of 5
Feature article Philosophy of Biology The challenges of big data biology
the direction and scope of research Funding
(Nickles, 2018). Big data enable biologists to Grant reference
spot patterns and trends more effectively, and Funder number Author
indeed, philosophers are starting to explore how H2020 European 335925 Sabina Leonelli
Research Coun-
data mining can help to explore, develop and
cil
verify mechanistic hypotheses (Pietsch, 2016;
Australian Re- DP160102989 Sabina Leonelli
Ratti, 2015; Canali, 2019). At the same time,
search Council
the relational view highlights how the interpreta-
tion and reliability of inferences from big data Alan Turing In- EP/N510129/1 Sabina Leonelli
stitute
depend on two crucial factors: first, regular con-
frontation with other research methods, models The funders had no role in study design, data collection
and approaches (Elliott et al., 2016); and sec- and interpretation, or the decision to submit the work
ond, thoughtful contextualisation of data with for publication.
respect to shifts in the perspective, goals and
methods of investigators (Shavit and Grie-
semer, 2009). Taking a relational view of data
means taking seriously the value- and theory- References
Bogen J. 2013. Theory and observation in science. The
laden history of data objects. It also promotes
Stanford Encyclopedia of Philosophy. http://plato.
efforts to document that history within data- stanford.edu/archives/spr2013/entries/science-theory-
bases, so that future data users can assess the observation/ [Accessed March 22, 2019].
quality of data for themselves and according to Borgman C. 2015. Big Data, Little Data, No Data. MIT
their own standards. A case in point is a recent Press.
Callebaut W. 2012. Scientific perspectivism: a
collaboration between a taxonomist and a phi-
philosopher of science’s response to the challenge of
losopher on the value of ambiguity in the labels big data biology. Studies in History and Philosophy of
used for data in biodiversity research Science Part C: Studies in History and Philosophy of
(Sterner and Franz, 2017). Biological and Biomedical Sciences 43:69–80.
Automated data analysis is an exciting pros- DOI: https://doi.org/10.1016/j.shpsc.2011.10.007,
PMID: 22326074
pect for biological discovery. Far from making
Canali S. 2019. Evaluating evidential pluralism in
human judgement unnecessary, the increasing epidemiology: mechanistic evidence in exposome
power of computational algorithms requires a research. History and Philosophy of the Life Sciences
proportional increase in critical thinking. Collab- 41:4. DOI: https://doi.org/10.1007/s40656-019-0241-
oration between philosophers and biologists can 6, PMID: 30756196
Dietrich MR, Ankeny RA, Chen PM. 2014. Publication
foster essential reflection on which parts of data trends in model organism research. Genetics 198:787–
browsing and integration should be conducted 794. DOI: https://doi.org/10.1534/genetics.114.
with the help of algorithms, and how results 169714, PMID: 25381363
should be interpreted. Collaboration between Dupre JA. 1993. The Disorder of Things: Metaphysical
Foundations of the Disunity of Science. Harvard
philosophers and bioinformaticians (and other
University Press.
types of data scientists) can promote the devel- Dupré J. 2001. In defence of classification. Studies in
opment of data infrastructures that adequately History and Philosophy of Science Part C: Studies in
capture the provenance of data, and encourage History and Philosophy of Biological and Biomedical
users to assess the quality and relevance Sciences 32:203–219. DOI: https://doi.org/10.1016/
S1369-8486(01)00003-6
of data in relation to their research questions.
Elliott KC, Cheruvelil KS, Montgomery GM, Soranno
PA. 2016. Conceptions of good science in our data-
Note rich world. BioScience 66:880–889. DOI: https://doi.
This Feature Article is part of the Philosophy of org/10.1093/biosci/biw115, PMID: 29599533
Biology collection. Floridi L, Illari P. 2014. The Philosophy of Information
Quality. Springer.
Griffiths P, Stotz K. 2013. Genetics and Philosophy:
Sabina Leonelli is in the Department of Sociology, An Introduction. Cambridge University Press.
Philosophy and Anthropology, University of Exeter, DOI: https://doi.org/10.1017/CBO9780511744082
Exeter, United Kingdom Kellert SH, Longino HE, Waters CK. 2006.
[email protected] Introduction: the pluralist stance. In: Kellert S. H,
https://orcid.org/0000-0002-7815-6609 Longino H. E, Waters C. K (Eds). Scientific Pluralism.
University of Minnesota Press.
Competing interests: The author declares that no
Leonelli S, Diehl AD, Christie KR, Harris MA, Lomax J.
competing interests exist. 2011. How the gene ontology evolves. BMC
Bioinformatics 12:325. DOI: https://doi.org/10.1186/
Published 05 April 2019 1471-2105-12-325, PMID: 21819553
Leonelli. eLife 2019;8:e47381. DOI: https://doi.org/10.7554/eLife.47381 4 of 5
Feature article Philosophy of Biology The challenges of big data biology
Leonelli S. 2012. When humans are the exception: Pietsch W. 2016. The causal nature of modeling with
cross-species databases at the interface of biological big data. Philosophy & Technology 29:137–171.
and clinical research. Social Studies of Science 42:214– DOI: https://doi.org/10.1007/s13347-015-0202-2
236. DOI: https://doi.org/10.1177/0306312711436265, Ratti E. 2015. Big data biology: between eliminative
PMID: 22848998 inferences and exploratory experiments. Philosophy of
Leonelli S. 2016. Data-Centric Biology: A Philosophical Science 82:198–218. DOI: https://doi.org/10.1086/
Study. Chicago University Press. 680332
Leonelli S. 2017. Global data quality assessment and Rheinberger H-J. 2011. Infra-experimentality: from
the situated nature of “best” research practices in traces to data, from data to patterning facts. History of
biology. Data Science Journal 16:32. DOI: https://doi. Science 49:337–348. DOI: https://doi.org/10.1177/
org/10.5334/dsj-2017-032 007327531104900306
Leonelli S, Ankeny RA. 2012. Re-thinking organisms: Rheinberger H-J, Müller-Wille S. 2017. The Gene
the impact of databases on model organism biology. From Genetics to Postgenomics. University of Chicago
Studies in History and Philosophy of Science Part C: Press.
Studies in History and Philosophy of Biological and Shavit A, Griesemer J. 2009. There and back again, or
Biomedical Sciences 43:29–36. DOI: https://doi.org/ the problem of locality in biodiversity surveys.
10.1016/j.shpsc.2011.10.003 Philosophy of Science 76:273–294. DOI: https://doi.
Leonelli S, Tempini N. 2018. Where health and org/10.1086/649805
environment meet: the use of invariant parameters in Sterner B, Franz NM. 2017. Taxonomy for humans or
big data analysis. Synthese 29. DOI: https://doi.org/ computers? Cognitive pragmatics for big data.
10.1007/s11229-018-1844-2 Biological Theory 12:99–111. DOI: https://doi.org/10.
Levins R. 1984. The strategy of model building in 1007/s13752-017-0259-5
population biology. In: Sober E (Ed). Conceptual Issues The Gene Ontology Consortium. 2019. The gene
in Evolutionary Biology. MIT Press. p. 18–27. ontology resource: 20 years and still GOing strong.
Loettgers A. 2009. Synthetic biology and the Nucleic Acids Research 47:D330–D338. DOI: https://
emergence of a dual meaning of noise. Biological doi.org/10.1093/nar/gky1055, PMID: 30395331
Theory 4:340–356. DOI: https://doi.org/10.1162/BIOT_ Waters CK. 2007. The nature and context of
a_00009 exploratory experimentation: an introduction to three
Longino H. 2013. Studying Human Behaviour. case studies of exploratory research. History and
University of Chicago Press. Philosophy of the Life Sciences 29:275–284. PMID: 1
McAllister JW. 2011. What do patterns in empirical 8822658
data tell us about the structure of the world? Synthese Wimsatt W. 2007. Re-Engineering Philosophy for
182:73–87. DOI: https://doi.org/10.1007/s11229-009- Limited Beings: Piecewise Approximations to Reality.
9613-x Harvard University Press. DOI: https://doi.org/10.
Mitchell S. 2003. Biological Complexity and 1007/s10539-010-9199-1
Integrative Pluralism. Cambridge University Press. Woodward J. 2010. Data, phenomena, signal, and
DOI: https://doi.org/10.1017/CBO9780511802683 noise. Philosophy of Science 77:792–803. DOI: https://
Nickles T. 2018. Alien reasoning: is a major change in doi.org/10.1086/656554
scientific research underway? Topoi 16. DOI: https:// Wylie A. 2017. How archaeological evidence bites
doi.org/10.1007/s11245-018-9557-1 back: strategies for putting old data to work in new
O’Malley MA, Soyer OS. 2012. The roles of ways. Science, Technology & Human Values 42:203–
integration in molecular systems biology. Studies in 225. DOI: https://doi.org/10.1177/0162243916671200
History and Philosophy of Science Part C: Studies in
History and Philosophy of Biological and Biomedical
Sciences 43:58–68. DOI: https://doi.org/10.1016/j.
shpsc.2011.10.006, PMID: 22326073
Leonelli. eLife 2019;8:e47381. DOI: https://doi.org/10.7554/eLife.47381 5 of 5
© 2019, Leonelli. This work is published under
http://creativecommons.org/licenses/by/4.0/(the “License”). Notwithstanding
the ProQuest Terms and Conditions, you may use this content in accordance
with the terms of the License.