Codestin Search App

Introduction

piikun is a Python package for the analysis and visualization of species delimitation models in an information theoretic framework that provides a true distance or metric space for these models based on the variance of information criterion of (Meila, 2007). The name piikun is from a Kumeyaay (Ipai) word for "sparrowhawk", in homage to the indigenous people of Southern California, on whose land I live and work and has become my home.

The species delimitation models being analyzed may be generated by any inference package, such as BP&P, SNAPP, DELINEATE etc., or constructed based on taxonomies or classifications based on conceptual descriptions in literature, geography, folk taxonomies, etc. Regardless of source or basis, each species delimitation model can be considered a partition of taxa or lineages and thus can be represented in a dedicated and widely-supported data exchange format, "SPART-XML", which piikun takes as one of its input formats, in addition to DELINEATE.

For every collection of species delimitation models, piikun generates a set of partition profiles, partition comparison tables, and a suite of graphical plots visualizing data in these tables. The partition profiles report unitary information theoretic and other statistics for each of the species delimitation partition, including the probability and entropy of each partition following [@meila2007comparing].

The partition comparison tables, on the other hand, provide a range of bivariate statistics for every distinct pair of partitions, including the mutual information, joint entropy, etc., as well as a information theoretic distance statistics are true metrics on the space of species distribution models: the variance of information [@meila2007comparing] and the normalized joint variance of information distance [@vinh2010information].

More details regarding the theory, metrics, and measures implemented in here can be found in Sukumaran and Meila, 2024.

Installation

Installing from the GitHub Repositories

We recommend that you install directly from the main GitHub repository using pip (which works with an Anaconda environment as well):

$ python3 -m pip install --user --upgrade git+https://github.com/jeetsukumaran/piikun.git

or

$ python3 -m pip install --user --upgrade git+git://github.com/jeetsukumaran/piikun.git

Usage

Following the generation, definition or conceptualization of the species delimitation partitions (see below), a typical analyses would consist of:

Compiling the partition definitions and associated information from the species delimitation model analyses results.
Analyzing the partition data to calculate the various measures of information for each partition and associated distances between each distinct pair of partitions.
Visualizing the results.

Generating the Partitions

A typical piikun analysis starts after generating or otherwise conceptualizing the source or sources species delimitation models to be analyzed. This can be the results of one or more DELINEATE, BPP, or some other species delimitation analyses run on the a dataset that, while needing to be invariant in terms of lineage/population concepts, may vary based on the statistical datak So, for example, one may imagine a data set consisting 40 samples that we are organizing into higher level units (e.g., a sample of individuals into demes in a BPP A11 analysis, or a sample of putative demes into species units in a BPP A10 or DELINEATE analysis). We may use various different kinds of data to represent these lineages in these analysis and other related ones. We might run multiple different BPP analyses using different markers, or we may be comparing our work on one set of markers to another that used a different set (or ran under different constraints).

We can compare the species delimitation results across all these analyses because the basic concepts being organized into deme or species "blocks" or "subsets" are consistent across all of them. This is why we can also include and compare the results from, not just across different genetic datasets of the same samples, but also across analyses that integrate morphological data, or take into account arrangements described in literature, speculatively, etc. As long as the the "elements" of the subsets of the partitions map to the same concept, the resulting partitions can be compared, regardless of method of conceptualization or identification.

Quick Start: Single-Step Analysis

The program piikun-analyze connects together three separate programs (discussed individually below): piikun-compile, piikun-evaluate, and piikun-visualize. You can run the entire pipeline on one of the provided example datasets by:

$ cd examples/
$ piikun-analyze -f delineate data/maddison-2020/lionepha-p090-hky.mcct-mean-age.delimitation-results.json

Tool Details

Compiling the Partitions: Extracting the Partition Data from the Species Delimitation Model Sources

piikun-compile is a command-line program that parses and formats data about species delimitation models from various sources and concats them in a common .json-formatted datastore. piikun-compile takes as its input a collection of partitions in one of the following data formats, specified using the -f or --source-format options:

"delineate": DELINEATE results

This specifies the primary .json results files produces by DELINEATE as sources.

$ piikun-compile -f delineate delineate-results.json
$ piikun-compile --source-format delineate delineate-results.json

"bpp-a11: BPP (A11 mode) format

This specifies the output log files from BPP as sources.

$ piikun-compile -f bpp-a11 output.txt
$ piikun-compile --source-format bpp-a11 output.txt

"spart-xml": SPART-XML

This specifies the "SPART"-format XML as sources.

$ piikun-compile -f spart-xml data.xml
$ piikun-compile --source-format spart-xml data.xml

"json-dict": Generic JSON dictionary

This specifies the sources will be dictionaries in JSON format, with a specific set of keys/elements (see below for details).
```
$ piikun-compile -f json-dict data.json
$ piikun-compile --source-format json-dict data.json
```
"json-list": Generic JSON list (of lists)

This specifies the sources will be lists of lists in JSON format (see below for details).
```
$ piikun-compile -f json-dict data.json
$ piikun-compile --source-format json-dict data.json
```

The output file names and paths can be specified by using the -o/--output-title and -O/--output-directory options.

Additional information can be added using the "--set-property" flag. For example, the following adds information regarding the source that can be referenced in visualizations and analysis downstream:

$ piikun-compile \
    -f delineate delineate-results.json \
    --set-property n_genes:143 \
    --set-property hypothesis:geographical \

See --help for details on this and other options.

Collating and Combining Multiple Sources

The data files produced by piikun-compule can be analyzed by piikun-evaluate individually directly. To analyze data from multiple source formats you will use piikun-compile on each source type separately, generating a pikun partition JSON data file, for each one, and then run piikun-compile on all of these results to generate a single dataset.

# Produces: ``delineate1-results.partitions.json``, ``delineate2-results.partitions.json``
$ piikun-compile -f delineate delineate1-results.json delineate2-results.json

# Produces: ``bpp1.out.partitions.json``, ``bpp2.out.partitions.json``
$ piikun-compile -f bpp-a11 bpp1.out.txt bpp2.out.txt

# Produces unified dataset for analysis: "``concated-data.partitions.json``"
$ piikun-compile \
    delineate1-results.partitions.json \
    delineate2-results.partitions.json \
    bpp1.out.partitions.json \
    bpp2.out.partitions.json \
    -o concated-data

See --help for details on this and other options, such as setting the output file names and paths using the -o/--output-title and -O/--output-directory, etc.

`piikun-evaluate`: Calculate Statistics and Distances

This command carries out the main calculations of this package. It takes as its input the .partitions.json data file produced by piikun-compile.

# Produces: ``delineate1-results.partitions.json``, ``delineate2-results.partitions.json``
$ piikun-compile -f delineate delineate1-results.json delineate2-results.json

# Produces: ``bpp1.out.partitions.json``, ``bpp2.out.partitions.json``
$ piikun-compile -f bpp-a11 bpp1.out.txt bpp2.out.txt

# Independent/separate comparative analysis of species
# delimitation models from multiple sources
$ piikun-evaluate delineate1-results.partitions.json
$ piikun-evaluate delineate2-results.partitions.json
$ piikun-evaluate bpp1.out.partitions.json
$ piikun-evaluate bpp2.out.partitions.json

# Single joint analysis of species delimitation models
# from multiple sources

# Combine species delimitation models from multiple sources
# into single data file: ``concated-data.partitions.json``
$ piikun-compile \
    delineate1-results.partitions.json \
    delineate2-results.partitions.json \
    bpp1.out.partitions.json \
    bpp2.out.partitions.json \
    -o concated-data

# Joint/single comparative analysis of species
# delimitation models from multiple sources
$ piikun-evaluate concated-data.partitions.json

$ piikun-evaluate \
    -o project42 \
    -O analysis_dir \
    data.partitions.json
$ piikun-evaluate \
    --output-title project42 \
    --output-directory analysis_dir \
    data.partitions.json

See --help for details on this and other options, such as setting the output file names and paths using the -o/--output-title and -O/--output-directory, etc.

The number of partitions can are read from the input set can be restricted to the first $n$ partitions using the --limit-partitions option:
```
$ piikun-evaluate \
    --source-format delineate \
    --output-title project42 \
    --output-directory analysis_dir \
    --limit-partitions 10 \
    delineate-results.json
```
This is option is particularly useful when the number of partitions in the input is large and/or most of the partitions in the input set may not be of interest. For e.g., a typical DELINEATE analysis may generate hundreds if not thousands of partitions, and most of these are low-probability ones of not much practical interest. Using the --limit flag will focus on just the subset of interest, which will help with computation time and resources.

Output

piikun-evaluate will generate two data files (named and located based on the -o/--output-title and -O/--output-directory options):

output-directory/output-title-profiles.tsv
output-directory/output-title-comparisons.tsv

These files provide univariate and a mix of univariate and bivariate statistics, respectively, for the partitions.

Both of these files can be directly loaded as a PANDAS data frame for more detailed analysis:

>>> import pandas as pd
>>> df1 = pd.read_json(
...     "output-directory/output-title-comparisons.json",
... )

The __distances file includes the variance of information distance statistics: vi_distance and vi_normalized_kraskov.

`piikun_analyze.py`

This is the main pipeline for analyzing species delimitation models. It computes partition profiles and distances between partitions.

Input:

Species delimitation models in formats like DELINEATE or SPART-XML.

Output:

Profiles: Information-theoretic statistics, such as entropy, for each partition.
Distances: Pairwise comparisons between partitions, including mutual information and joint entropy.

Example Usage:

python piikun_analyze.py --source-path <models_file.spart> --output-directory <output_dir>

This script supports flexible input formats, allowing integration with various species delimitation tools.

`piikun_visualize.py`

The piikun_visualize.py script generates various visualizations of the analysis results, helping users interpret partition profiles and distances.

Available Plot Types:

Cluster Maps: Visualizes the relationships between partitions using hierarchical clustering.
Partition Score CDF: Displays the cumulative distribution of partition scores.
Profile Comparison: Plots size and entropy comparisons across partitions.

Output Formats:

Supported formats include png, jpg, pdf, and html.

Example Usage:

python piikun_visualize.py --input-file <distances.json> --plot-type clustermap --output-format png

This script provides flexible visualization capabilities, allowing users to explore the relationships between species partitions interactively or via static plots.

Reference

Standard Workflow Tool Chain

Command	Purpose	Input	Output
`piikun-compile`	Dataset assembly	(Various)	`<title>__partitions.json`
`piikun-evaluate`	Scoring and comparing	`<title>__partitions.json`	`<title>__profiles.json`
			`<title>__distances.json`
`piikun-visualize`	Plotting and visualization	`<title>__distances.json`	`<title>__distances__<visualization-name>.html`
			`<title>__distances__<visualization-name>.jpg`
			`<title>__distances__<visualization-name>.pdf`
			`<title>__distances__<visualization-name>.png`

Internal Data Formats

JSON Dictionaries (`piikun` format)

piikun uses structured JSON dictionaries to store analysis results, which include partition profiles and pairwise distances. Below is an example format for the output JSON:

{
  "partition_profiles": {
    "ptn1": {"size": 10, "entropy": 0.8},
    "ptn2": {"size": 12, "entropy": 0.9}
  },
  "distances": [
    {"ptn1": "A", "ptn2": "B", "distance": 0.15},
    {"ptn1": "A", "ptn2": "C", "distance": 0.20}
  ]
}

Nested Lists

For simpler structures, piikun also supports nested list formats. This is often used when working with raw partitions of populations:

[
  [["pop1", "pop2", "pop3", "pop4"]],
  [["pop1"], ["pop2", "pop3", "pop4"]],
  [["pop1", "pop2"], ["pop3", "pop4"]]
]

Each nested list represents a partition structure, grouping populations or taxa for analysis.

Citation

@article{sukumaran2024piikun,
  title = {Piikun: an information theoretic toolkit for analysis and visualization of species delimitation metric space},
  shorttitle = {Piikun},
  author = {Sukumaran, Jeet and Meila, Marina},
  year = {2024},
  month = dec,
  journal = {BMC Bioinformatics},
  volume = {25},
  number = {1},
  pages = {385},
  issn = {1471-2105},
  doi = {10.1186/s12859-024-05997-y},
  url = {https://doi.org/10.1186/s12859-024-05997-y},
  urldate = {2025-01-07},
  abstract = {Existing software for comparison of species delimitation models do not provide a (true) metric or distance functions between species delimitation models, nor a way to compare these models in terms of relative clustering differences along a lattice of partitions.},
}

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
examples		examples
src/piikun		src/piikun
tests		tests
.gitignore		.gitignore
CHANGES.md		CHANGES.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Installation

Installing from the GitHub Repositories

Usage

Generating the Partitions

Quick Start: Single-Step Analysis

Tool Details

Compiling the Partitions: Extracting the Partition Data from the Species Delimitation Model Sources

Collating and Combining Multiple Sources

`piikun-evaluate`: Calculate Statistics and Distances

Output

`piikun_analyze.py`

`piikun_visualize.py`

Reference

Standard Workflow Tool Chain

Internal Data Formats

JSON Dictionaries (`piikun` format)

Nested Lists

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jeetsukumaran/piikun

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Installing from the GitHub Repositories

Usage

Generating the Partitions

Quick Start: Single-Step Analysis

Tool Details

Compiling the Partitions: Extracting the Partition Data from the Species Delimitation Model Sources

Collating and Combining Multiple Sources

piikun-evaluate: Calculate Statistics and Distances

Output

piikun_analyze.py

piikun_visualize.py

Reference

Standard Workflow Tool Chain

Internal Data Formats

JSON Dictionaries (piikun format)

Nested Lists

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`piikun-evaluate`: Calculate Statistics and Distances

`piikun_analyze.py`

`piikun_visualize.py`

JSON Dictionaries (`piikun` format)

Packages