This repository is for CSVs of DS data at various stages of extraction, transformation, and enrichment. It also includes RDF and JSON data extracted from our linked database for display and search through a user interface.
New files should be called: DATE-name.csv, where DATE is the date the file was created in YYYYMMDD format and name is a description, like jhu, combined, upenn, etc. When more than one descriptor is applied, descriptors are separated by a dash, such as in 20220920-language-combined-enriched. In addition, when element is provided in the instructions, it is a description of the metadata element, field, or type of data, such as in languages.csv, named-subjects-unreconciled.csv, or 20220705-places-combined-enriched.csv.
- Navigate to
member-datadirectory. - Create sub-directory for institution using code list (if directory does not exist).
- Create sub-sub-directory based on filename DATE in YYYY-MM format (if directory does not exist).
- Click on
Add Filebutton and clickUpload filesfrom context menu. - Drag and drop or
choose your filesto be uploaded. Commit changesdirectly to main branch.
- Navigate to
ds-data/terms/batchdirectory. - Create sub-directory based on filename DATE in YYYY-MM format (if does not exist).
- Click on
Add Filebutton and clickUpload filesfrom context menu (file to be uploaded should beDATE-element-enriched.csv). - Drag and drop or
choose your filesto be uploaded. Commit changesdirectly to main branch.
See also instructions at ds-open-refine/instructions/
- Navigate to
ds-data/terms/reconcileddirectory - Click on
Add Filebutton and clickUpload filesfrom context menu (file to be uploaded should be 'element.csv`). - Drag and drop or
choose your filesto be uploaded. Commit changesdirectly to main branch.
.
├── README.md
├── Workflow-README-template.md
├── config.yml
├── member-data
│ ├── burke
│ │ ├── 2022-02
│ │ │ ├── 2022-02-23-burke-marc-combined-import.csv
│ │ │ ├── 2022-02-25-burke-marc-mmw-import.csv
│ │ ├── 2022-04
│ │ │ ├── 2022-04-04-burke-hebrew-import.csv
│ ├── ccny
│ │ ├── 2022-02
│ │ │ ├── 2022-02-23-ccny-mets-import.csv
│ │ │ ├── 2022-02-25-ccny-csv-import.csv
│ ├── columbia
│ │ ├── 2023-01
│ │ │ ├── 2022-02-23-columbia-marcxml-european-import.csv
etc.
├── split_data.rb
├── test
│ ├── missing_inst_name.csv
│ ├── missing_qid.csv
│ └── unknown_qid.csv
└── workflow
├── 2022-02-23-combined-README.md
├── 2022-02-23-combined-enriched.csv
├── 2022-02-23-combined.csv
├── 2022-04-04-combined-README.md
├── 2022-04-04-combined-enriched.csv
└── 2022-04-04-combined.csv
CSVs in the workflow directory should be split into institution-specific
directories. The split_data.rb script splits the CSV on the QID in the
holding_institution column and puts the file in folder as defined in the
config.yml file.
The configuration file contains the QID, name and a single-word folder for each institution. New repositories should be added to the configuration. The format of the entries is like so:
---
- :qid: Q814779
:name: Beinecke Rare Book & Manuscript Library
:directory: beinecke
- :qid: Q995265
:name: Bryn Mawr College
:directory: brynmawr
- :qid: Q63969940
:name: Burke Library at Union Theological Seminary
:directory: burkesplit_data.rb validates the config file and the CSV.