An R package to read, write, and edit Data Package data and metadata. Unlike other existing R packages dpmr and datapkg, dpkg can be used to build and document Data Packages entirely within R. Please note that this is a work in progress and function naming and functionality may drift based on feedback from the community.
This package is not on CRAN. To install in R, use devtools:
devtools::install_github("ezwelty/dpkg")
To build a data package, assemble the data and add metadata to the various elements:
data <- data.frame(
id = 1L %>% set_field(title = "Identifier"),
value = 1.1,
added = Sys.Date()
)
# Data Resource (list of Fields)
dr <- data %>%
set_resource(
name = "data",
path = "data/data.csv"
)
# Data Package (list of Resources)
dp <- list(dr) %>%
set_package(
name = "data-package"
)You can preview the package metadata:
get_package(dp) %>% str()## List of 3
## $ name : chr "data-package"
## $ profile : chr "data-package"
## $ resources:List of 1
## ..$ :List of 4
## .. ..$ name : chr "data"
## .. ..$ path : chr "data/data.csv"
## .. ..$ profile: chr "data-resource"
## .. ..$ schema :List of 1
## .. .. ..$ fields:List of 3
## .. .. .. ..$ :List of 3
## .. .. .. .. ..$ name : chr "id"
## .. .. .. .. ..$ type : chr "integer"
## .. .. .. .. ..$ title: chr "Identifier"
## .. .. .. ..$ :List of 2
## .. .. .. .. ..$ name: chr "value"
## .. .. .. .. ..$ type: chr "number"
## .. .. .. ..$ :List of 3
## .. .. .. .. ..$ name : chr "added"
## .. .. .. .. ..$ type : chr "date"
## .. .. .. .. ..$ format: chr "%Y-%m-%d"
Write the package to file:
dir <- tempdir()
write_package(dp, path = dir)And read the package back in:
read_package(dir)## $data
## id value added
## 1 1 1.1 2017-08-24
##
## attr(,"dpkg_package")
## attr(,"dpkg_package")$name
## [1] "data-package"
##
## attr(,"dpkg_package")$profile
## [1] "data-package"
##
## attr(,"class")
## [1] "dpkg" "list"
In dpkg, the contents of a data package is stored as a list of one or more data resources (each a list) of one or more fields (each typically an atomic vector). For example:
dp <- list(
dr = data.frame(
id = 1L,
value = 1.1,
added = Sys.Date()
)
)Package, resource, and field ("data objects") metadata can be set or updated using the set_* functions (set_package, set_resource, set_field), which come in a <- flavor:
set_field(dp$dr$id) <- field(title = "Unique identifier", constraints = constraints(unique = TRUE))and a pipe-friendly flavor:
dp$dr$id %<>% set_field(title = "Identifier", constraints = NULL)As seen above with the use of field and constraints, a suite of helper functions are available to assist in the building of metadata:
- Data objects:
package,resource,field - Meta objects:
schema,foreignKey,constraints,license,source,contributor
Data object metadata is stored as attributes. Although in base R attributes are lost in many common operations, this package provides protection from this by making metadata resilient to [, [[, subset, and append.
To preview a package, metadata can be retrieved from data objects using the get_* functions (get_package, get_resource, get_field). Missing properties are filled with their default values:
- Fields
name: The name of the object in a list (resource).type: The type corresponding to the object class.character->"string"numeric->"number"integer->"integer"logical->"boolean"Date->"date"POSIXt->"datetime"- otherwise ->
"string"
format: The default format for that type.date->"%Y-%m-%d"datetime->"%Y-%m-%dT%H-%M-%SZ"unit: Units set by units deparsed to product power form.
- Resources
name: The name of the object in a list (package).schema$fields: Field metadata from the elements of the object.
- Packages
resources: Resource metadata from the elements of the object.
get_field(dp$dr$id) %>% str()## List of 2
## $ type : chr "integer"
## $ title: chr "Identifier"
get_resource(dp$dr) %>% str()## List of 3
## $ profile: chr "data-resource"
## $ schema :List of 1
## ..$ fields:List of 3
## .. ..$ :List of 3
## .. .. ..$ name : chr "id"
## .. .. ..$ type : chr "integer"
## .. .. ..$ title: chr "Identifier"
## .. ..$ :List of 2
## .. .. ..$ name: chr "value"
## .. .. ..$ type: chr "number"
## .. ..$ :List of 3
## .. .. ..$ name : chr "added"
## .. .. ..$ type : chr "date"
## .. .. ..$ format: chr "%Y-%m-%d"
## $ data :'data.frame': 1 obs. of 3 variables:
## ..$ id : int 1
## ..$ value: num 1.1
## ..$ added: chr "2017-08-24"
get_package(dp) %>% str()## List of 2
## $ profile : chr "data-package"
## $ resources:List of 1
## ..$ :List of 4
## .. ..$ name : chr "dr"
## .. ..$ profile: chr "data-resource"
## .. ..$ schema :List of 1
## .. .. ..$ fields:List of 3
## .. .. .. ..$ :List of 3
## .. .. .. .. ..$ name : chr "id"
## .. .. .. .. ..$ type : chr "integer"
## .. .. .. .. ..$ title: chr "Identifier"
## .. .. .. ..$ :List of 2
## .. .. .. .. ..$ name: chr "value"
## .. .. .. .. ..$ type: chr "number"
## .. .. .. ..$ :List of 3
## .. .. .. .. ..$ name : chr "added"
## .. .. .. .. ..$ type : chr "date"
## .. .. .. .. ..$ format: chr "%Y-%m-%d"
## .. ..$ data :'data.frame': 1 obs. of 3 variables:
## .. .. ..$ id : int 1
## .. .. ..$ value: num 1.1
## .. .. ..$ added: chr "2017-08-24"
write_package writes package data and metadata to disk using the following rules for each resource:
format: If missing, checkspathfile extension andmediatype. Only "csv" ("text/csv") and "json" ("application/json") are supported.path: If not set, the data is saved in the metadata (datapackage.json) as either an inline JSON object (format:"json" or missing) or a CSV string (format:"csv"). For writing,pathmust be a single, local, relative path.
Resource as an inline JSON object:
set_resource(dp$dr) <- package(format = "json", path = NULL)
get_resource(dp$dr)$data## id value added
## 1 1 1.1 2017-08-24
write_package(dp, path = tmpdir)
list.files(tmpdir)## [1] "datapackage.json"
Resource as an inline CSV string:
set_resource(dp$dr) <- package(format = "csv", path = NULL)
get_resource(dp$dr)$data## [1] "id,value,added\n1,1.1,2017-08-24"
write_package(dp, path = tmpdir)
list.files(tmpdir)## [1] "datapackage.json"
Resource as a JSON file:
set_resource(dp$dr) <- package(format = "json", path = "data/data.json")
get_resource(dp$dr)$data## NULL
write_package(dp, path = tmpdir)
list.files(tmpdir, recursive = TRUE)## [1] "data/data.json" "datapackage.json"
Resource as a CSV file:
set_resource(dp$dr) <- package(format = "csv", path = "data/data.csv")
get_resource(dp$dr)$data## NULL
write_package(dp, path = tmpdir)
list.files(tmpdir, recursive = TRUE)## [1] "data/data.csv" "datapackage.json"
read_package reads package data and metadata into the same structure described above, but unlike write_package, it supports both local and remote paths. The resources argument can be used to read a subset of the package's resources (or all if NULL, the default).
dp <- read_package(
"https://raw.githubusercontent.com/columbia-glacier/optical-surveys-1985/master",
resources = c("station", "velocity")
)
get_package(dp) %>% str()## List of 8
## $ name : chr "optical-surveys-1985"
## $ title : chr "Optical Surveys (1985)"
## $ description : chr "Velocity of three reflectors 1.3, 2.8, and 4.6 km from the terminus and meteorological observations from a station on nearby He"| __truncated__
## $ profile : chr "data-package"
## $ version : chr "0.1.0"
## $ sources :List of 1
## ..$ :List of 2
## .. ..$ title: chr "Original data, scripts, and documentation"
## .. ..$ path : chr "sources/"
## $ contributors:List of 1
## ..$ :List of 3
## .. ..$ title: chr "Ethan Welty"
## .. ..$ email: chr "[email protected]"
## .. ..$ role : chr "author"
## $ resources :List of 2
## ..$ :List of 5
## .. ..$ name : chr "station"
## .. ..$ path : chr "data/station.csv"
## .. ..$ profile: chr "data-resource"
## .. ..$ title : chr "Station Metadata"
## .. ..$ schema :List of 1
## .. .. ..$ fields:List of 2
## .. .. .. ..$ :List of 4
## .. .. .. .. ..$ name : chr "lat"
## .. .. .. .. ..$ type : chr "number"
## .. .. .. .. ..$ description: chr "Latitude (WGS84, EPSG:4326)."
## .. .. .. .. ..$ unit : chr "°"
## .. .. .. ..$ :List of 4
## .. .. .. .. ..$ name : chr "lng"
## .. .. .. .. ..$ type : chr "number"
## .. .. .. .. ..$ description: chr "Longitude (WGS84, EPSG:4326)."
## .. .. .. .. ..$ unit : chr "°"
## ..$ :List of 5
## .. ..$ name : chr "velocity"
## .. ..$ path : chr "data/velocity.csv"
## .. ..$ profile: chr "data-resource"
## .. ..$ title : chr "Marker Velocity"
## .. ..$ schema :List of 1
## .. .. ..$ fields:List of 4
## .. .. .. ..$ :List of 3
## .. .. .. .. ..$ name : chr "marker"
## .. .. .. .. ..$ type : chr "integer"
## .. .. .. .. ..$ description: chr "Marker identifier (1: 1.3 km, 2: 2.8km, and 3: 4.6 km from the terminus)."
## .. .. .. ..$ :List of 3
## .. .. .. .. ..$ name : chr "sequence"
## .. .. .. .. ..$ type : chr "integer"
## .. .. .. .. ..$ description: chr "Sequence number from figure tracing. Observations are 'continuous' between times of the same sequence."
## .. .. .. ..$ :List of 4
## .. .. .. .. ..$ name : chr "t"
## .. .. .. .. ..$ type : chr "datetime"
## .. .. .. .. ..$ format : chr "%Y-%m-%dT%H:%M:%SZ"
## .. .. .. .. ..$ description: chr "Date and time (UTC)."
## .. .. .. ..$ :List of 4
## .. .. .. .. ..$ name : chr "value"
## .. .. .. .. ..$ type : chr "number"
## .. .. .. .. ..$ description: chr "Velocity"
## .. .. .. .. ..$ unit : chr "m d-1"
dp$station## lat lng
## 1 60.98891 ° -147.0357 °
head(dp$velocity)## marker sequence t value
## 1 1 2 1985-08-06 14:43:20 9.289464 m/d
## 2 1 1 1985-08-06 14:45:22 9.218484 m/d
## 3 1 4 1985-08-06 14:51:20 9.511275 m/d
## 4 1 3 1985-08-06 15:07:43 9.440296 m/d
## 5 1 6 1985-08-06 15:27:08 9.635490 m/d
## 6 1 5 1985-08-06 15:28:58 9.571165 m/d
read_package_github accepts a shorthand GitHub repository address.
dp <- read_package_github("columbia-glacier/optical-surveys-1985", "station")Only types string, number, integer, boolean, date, and datetime are implemented (see table-schema/field-descriptors). Add support for the remaining types:
-
type =object -
type =array -
type =time (via package hms) -
type =year (already supported viatype = dateandformat = "%Y") -
type =yearmonth (already supported viatype = dateandformat = "%Y-%m") -
type =duration (already supported viatype = numericandunit) -
type =geopoint -
type =geojson
Additionally:
- Validate field values against
constraintsproperty - For
type =string, validate values againstformatproperty
- Validate packages, resources, and schemas against standard (https://specs.frictionlessdata.io/schemas/registry.json) or custom profiles.
- Read/write resources with a
pathlike "data/data.csv.gz" to/from compressed files - Write resource schemas with a
pathto a JSON file - Read/write GeoJSON and TopoJSON to/from spatial objects
- Validate license name against http://licenses.opendefinition.org/licenses/groups/all.json
- Support reading packages based on their Data Package Identifier