data-deduplication

Star

Here are 7 public repositories matching this topic...

sail-sg / sailcraft

Star

🚢 Data Toolkit for Sailor Language Models

data-deduplication data-cleaning

Updated Feb 24, 2025
Python

gagan3012 / PolyDeDupe

Sponsor

Star

PolyDeDupe: Multi-Lingual Data Deduplication

multilingual nlp data-deduplication

Updated Oct 27, 2025
Python

Этот проект представляет собой мощный инструмент для поиска и анализа дублирующихся файлов в указанной директории. Программа позволяет эффективно выявлять одинаковые файлы на основе их содержимого, используя алгоритм хеширования SHA-256. Она поддерживает настройку параметров, таких как минимальный размер файла для проверки и игнорирование определен

python hashing productivity multithreading data-deduplication file-system sha256 file-management system-utility cli-tool dev-tools file-deduplication file-comparison disk-cleanup command-line-utility duplicate-file-finder

Updated Feb 14, 2025
Python

Anveshika06 / VIT-VTAS-TY-2022

Star

data-deduplication hashing-algorithm

Updated Jan 7, 2023
Python

anirudh-69 / Financial-Data-ETL-Workflow

Star

ETL workflow for stock data processing using Mage and PostgreSQL

python etl docker-compose postgresql data-deduplication data-engineering stock-market data-processing financial-data data-modeling data-cleaning data-aggregation api-integration alpha-vantage mage-ai

Updated Jan 17, 2025
Python

RayanGAtech / HR-Roster-Change-Data-Capture-Pipeline

Star

The HR Roster Change Detection Pipeline is an automated solution for processing HR roster data. Leveraging Apache Airflow and PostgreSQL, it enables seamless data ingestion, deduplication, and change detection, streamlining HR operations.

python open-source automation etl postgresql data-deduplication data-engineering data-pipelines apache-airflow roster-management workforce-analytics scalable-solutions hr-technology delta-detection hr-data-processing

Updated Dec 4, 2024
Python

fabriziosalmi / text-boundaries

Sponsor

Star

A Python-based tool for preprocessing, cleaning, and analyzing text datasets, designed to filter, deduplicate, sort data, and generate statistical insights.

machine-learning natural-language-processing data-validation data-deduplication data-preprocessing data-sorting data-automation dataset-cleaning text-data-analysis dataset-boundaries data-statistics-generation

Updated Oct 31, 2025
Python

Improve this page

Add a description, image, and links to the data-deduplication topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the data-deduplication topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-deduplication

Here are 7 public repositories matching this topic...

sail-sg / sailcraft

gagan3012 / PolyDeDupe

dffdgdg / FindDuplicates

Anveshika06 / VIT-VTAS-TY-2022

anirudh-69 / Financial-Data-ETL-Workflow

RayanGAtech / HR-Roster-Change-Data-Capture-Pipeline

fabriziosalmi / text-boundaries

Improve this page

Add this topic to your repo