PIIvot

PIIvot utilizes fine-tuned named entity recognition to identify and anonymize entities that commonly contain personally identifiable information with contextually accurate surrogates.

Overview 📖

PIIvot is a library for the detection of potential PII and its anonymization in data workflows. It utilizes realistic surrogates to obfuscate names, schools, phone numbers, and locations.

For a closer look, you can explore the core module's primary code located at piivot/engine/analyzer.py and piivot/engine/anonymizer.py where you'll find the implementation of the main functional classes: Analyzerand Anonymizer.

Analyzer

Analyze specified data columns in the given DataFrame using a fine-tuned NER algorithm to label potentially sensitive information.
Anonymizer

Anonymize specified data columns in the given DataFrame using provided labeled spans (potentially from the Analyzer) and generate reasonable surrogate replacements to obfuscate potential PII.

2. Setup 🧑‍🔬

2.1 Prerequisites 📋

Python 🐍

Python version 3.12^ is required
OpenAI API Key

To anonymize data with the Anonymizer you’ll need an active OpenAI API key.
Huggingface Account

To use finetuned models, you may need to request access for your Huggingface account. Once you've been granted access to the hub, use huggingface-cli login with a User Access Token that has 'Read access to contents of all public gated repos you can access'.
(Optional) W&B Account

To use the Experiment model training pipeline, you will need to log into a wandb account. Use 'wandb login' to set up your desired logging project.

2.2 Installation ⏬

Poetry

piivot uses poetry (do not use pip or conda). To create the environment:

Windows

poetry env use 3.12
poetry config virtualenvs.in-project true
poetry install

# to activate the env
poetry shell

MacOS

poetry env use 3.12
poetry config virtualenvs.in-project true

poetry config --local installer.no-binary pyodbc

poetry install

# to activate the env
poetry shell

Linux

export PYTHON_KEYRING_BACKEND=keyring.backends.fail.Keyring

poetry env use 3.12
poetry config virtualenvs.in-project true

poetry install

# to activate the env
poetry shell

❗ NOTE: if you get the following error

This error originates from the build backend, and is likely not a problem with poetry but with multidict (6.0.4) not supporting PEP 517 builds. You can verify this by running 'pip wheel --use-pep517 "multidict (==6.0.4)"'.

Run:

poetry shell
pip install --upgrade pip
MULTIDICT_NO_EXTENSIONS=1 pip install multidict
poetry add inflect
poetry add pyodbc

# if package are not reinstalled then run:
poetry update

3. Run 🏃

Example run for analyze and anonymize functions:

from piivot.engine import Analyzer, Anonymizer, LabelAnonymizationManager

from openai import OpenAI
import pandas as pd

data = [
    {"message": "Hi, I'm John and I live in New York."},
    {"message": "Hello, my name is Jane and I live in Los Angeles."},
    {"message": "Hey, I'm Alice from Chicago."},
    {"message": "Greetings, I'm Bob and I'm based in Seattle."},
    {"message": "Hi, I'm Carol and I reside in Miami."},
]
df = pd.DataFrame(data, columns=["message"])

analyzer = Analyzer("dslim/bert-base-NER")
df = analyzer.analyze(df, data_columns=['message'])

# For this demo we demonstrate how to apply an open-source NER model to the PIIvot NER labeling task
gpt_client = OpenAI(api_key="[[Your API key...]]")
label_anon_manager = LabelAnonymizationManager()
# Rename labels from QATD_2k to configure label manager for dslim/bert-base-NER 
label_anon_manager.rename_label("NAME", "PER")
label_anon_manager.rename_label("LOCATION_ADDRESS", "LOC")

anonymizer = Anonymizer(label_anon_manager, client=gpt_client)

anonymized_df = anonymizer.anonymize(df, data_columns=['message'], label_columns=['message_labels'])
anonymized_df.head()

Using context_groups to preserve anonymizations between rows

data2 = [
    {"conversation_id": 1, "message": "Hi, I'm John and I live in New York."},
    {"conversation_id": 1, "message": "Hello John, my name is Jane and I live in Los Angeles which is pretty far from New York."},
    {"conversation_id": 2, "message": "Hey, I'm Alice from Portland."},
    {"conversation_id": 2, "message": "Greetings Alice, I'm Bob and I'm based in Seattle which is pretty close to Portland!"},
    {"conversation_id": 2, "message": "Great to meet you Alice! We'll have to meet up sometime."},
]
df2 = pd.DataFrame(data2, columns=["message", "conversation_id"])

df2 = analyzer.analyze(df2, data_columns=['message'], context_groups=['conversation_id'])

anonymized_df2 = anonymizer.anonymize(df2, data_columns=['message'], label_columns=['message_labels'], context_groups=['conversation_id'])
anonymized_df2.head()

If running locally in a Jupyter Notebook, you can import the PIIvot Repo with the following code.

import os
import sys

module_path = os.path.abspath([[Path to PIIvot Repo]])
if module_path not in sys.path:
    sys.path.append(module_path)

Using `PIIvot` in other repositories

Previously we would have installed the package globally using pip install -e ., using poetry you simply add a dependency to the local package.

Clone the repository:
```
git clone [[repo_path]]
```
In your other repository, add the following to the pyproject.toml:
```
piivot = {path = <path-to-piivot>, develop=true}
```
Example: piivot was cloned in the parent directory of the current project.
```
piivot = {path = "../piivot", develop = true}
```
The develop flag should mean that your installation will be automatically updated when piivot is editted.

You can now import this package:

from piivot.engine import Analyzer, Anonmizer

If you then update this package it should update automatically (if develop = true). If this does not happen you should be able to just run poetry update piivot but you may need to reinstall your poetry environment.

4. Backend Model Training 🏋️‍♂️

PIIvot has built in support for a variety of model fine-tuning and experimentation use cases. This code is highly tailored to the Tutor/Student Dialogue dataset and further work is need to generalize the experimentation pipeline to any dataset.

Prerequisites 📋

A labeled .csv

Use orchex to generate a ground truth tutor student dialogues extract.

./learn.py

Example experiment run using deberta_base_experiment

poetry run python ./learn.py --exp_folder ./experiment_configs/deberta_base_experiment --data_filepath [[data_filepath]]

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
piivot		piivot
potential-pii annotation		potential-pii annotation
.gitignore		.gitignore
README.md		README.md
learn.py		learn.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PIIvot

Table of Contents

Overview 📖

2. Setup 🧑‍🔬

2.1 Prerequisites 📋

2.2 Installation ⏬

Poetry

3. Run 🏃

Using `PIIvot` in other repositories

4. Backend Model Training 🏋️‍♂️

Prerequisites 📋

./learn.py

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Eedi/PIIvot

Folders and files

Latest commit

History

Repository files navigation

PIIvot

Table of Contents

Overview 📖

2. Setup 🧑‍🔬

2.1 Prerequisites 📋

2.2 Installation ⏬

Poetry

3. Run 🏃

Using PIIvot in other repositories

4. Backend Model Training 🏋️‍♂️

Prerequisites 📋

./learn.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Using `PIIvot` in other repositories

Packages