fs2 data

A set of streaming data parsers based on fs2.

Following modules are available:

fs2-data-json: A JSON parser and manipulation library
fs2-data-json-circe: circe support for parsed JSON.
fs2-data-json-diffson: diffson support for patching JSON streams.
fs2-data-xml: An XML parser
fs2-data-csv: A CSV parser
fs2-data-csv-generic: generic decoder for CSV files

JSON module usage

Stream parser

To create a stream of JSON tokens from an input stream, use the tokens pipe in fs2.data.json package

import cats.effect._

import fs2._
import fs2.data.json._

val input = """{
              |  "field1": 0,
              |  "field2": "test",
              |  "field3": [1, 2, 3]
              |}
              |{
              |  "field1": 2,
              |  "field3": []
              |}""".stripMargin

val stream = Stream.emits(input).through(tokens[IO])
println(stream.compile.toList.unsafeRunSync())

The pipe validates the JSON structure while parsing. It reads all the json values in the input stream and emits events as they are available.

Selectors

Selectors can be used to select a subset of a JSON token stream.

For instance, to select and enumerate elements that are in the field3 array, you can create this selector. Only the tokens describing the values in field3 will be emitted as a result.

val selector = ".field3.[]".parseSelector[IO].unsafeRunSync()
val filtered = stream.through(filter(selector))
println(filtered.compile.toList.unsafeRunSync())

The filter syntax is as follows:

. select the root values, it is basically the identity filter.
.f selects the field named f in objects. It fails if the value it is applied to is not a JSON object.
.f? is similar to .f but doesn't fail in case the value it is applied to is not a JSON object.
.[f1, f2, ..., fn] selects only fields f1 to fn in objects. The fields are emitted wrapped in an object. It fails if the value it is applied to is not an object.
.[f1, f2, ..., fn] similar to .[f1, f2, ..., fn] but doesn't fail if the value it is applied to is not an object.
.[id1, idx2, ..., idxn] selects only elements idx1, ..., idxn in arrays. The values are emitted wrapped in an array. It fails if the value it is applied to is not an array.
.[idx1, idx2, ..., idxn]? similar to .[idx1, idx2, ..., idxn] but doesn't fail if the value it is applied to is not an array.
.[idx1:idx2] selects only elements between idx1 (inclusive) and idx2 (exclusive) in arrays. The values are emitted wrapped in an array. It fails if the value it is applied to is not an array.
.[idx1:idx2]? similar to .[idx1:idx2] but doesn't fail if the value it is applied to is not an array.
.[] selects and enumerate elements from an array or objects. The values are not wrapped in an array or object. It fails if the value it is applied to is not an array or an object.
.[]? similar as .[] but doesn't fail if the value it is applied to is neither an array nor an object.
sel1 sel2 applies selector sel1 to the root value, and selector sel2 to each selected value.

AST Builder and Tokenizer

JSON ASTs can be built if you provider an implicit Builder[Json] to the values pipe. The Builder[Json] typeclass describes how JSON ASTs of type Json are built from streams.

implicit val builder: Builder[SomeJsonType] = ...
val asts = stream.through(values[F, SomeJsonType])

The asts stream emits all top-level JSON values parsed, in our example, the two objects are emitted.

If you provide an implicit Tokenizer[Json], which describes how a JSON AST is transformed into JSON events, you can apply transformations to the JSON stream. For instance, you can wrap all values in the fields3 array by using this code:

implicit tokenizer: Tokenizer[SomeJsonType] = ...
val transformed = stream.through(transform[IO, Json](selector, json => SomeJsonObject("test" -> json)))

Circe

The fs2-data-json-circe module provides Builder and Tokenizer instances for the circe Json type. For instance both examples above with circe can be written that way:

import fs2.data.json.circe._
import io.circe._

val asts = stream.through(values[F, Json])
println(asts.compile.toList.unsafeRunSync())

val transformed = stream.through(transform[IO, Json](selector, json => Json.obj("test" -> json)))
println(transformed.through(values[IO, Json]).compile.toList.unsafeRunSync())

Patches

The fs2-data-json-diffson module provides some integration with diffson. It allows for patching a Json stream as it is read to emit the patched value downstream. Patching stream can be useful in several case, for instance:

it can be used to filter out fields you don't need for further processing, before building an AST with these fields;
it can be used to make data from an input stream anonymous by removing names or identifiers;
it makes it possible to enrich an input stream with extra data you need for further processing;
many other use cases when you need to amend input data on the fly without building the entire AST in memory.

Currently only JSON Merge Patch is supported.

In order for patches to be applied, you need a Tokenizer for some Json type the patch operates on (see above) and a Jsony from diffson for that same Json type.

Let's say you are using circe as Json AST library, you can use patches like this:

import fs2.data.json.mergepatch._

import diffson._
import diffson.circe._
import diffson.jsonmergepatch._

import io.circe._

val mergePatch: JsonMergePatch[Json] = ...

val patched = stream.through(patch(mergePatch))

println(patched.compile.toList.unsafeRunSync())

XML module usage

Stream parser

To create a stream of XML events from an input stream, use the events pipe in fs2.data.xml package

import cats.effect._

import fs2._
import fs2.data.xml._

val input = """<a xmlns:ns="http://test.ns">
              |  <ns:b ns:a="attribute">text</ns:b>
              |</a>
              |<a>
              |  <b/>
              |  test entity resolution &amp; normalization
              |</a>""".stripMargin

val stream = Stream.emits(input).through(events[IO])
println(stream.compile.toList.unsafeRunSync())

The pipe validates the XML structure while parsing. It reads all the XML elements in the input stream and emits events as they are available.

Resolvers

Namespace can be resolved by using the namespaceResolver pipe.

val nsResolved = stream.through(namespaceResolver[IO])
println(nsResolved.compile.toList.unsafeRunSync())

Using the referenceResolver pipe, entity and character references can be resolved. By defaut the standard xmlEntities mapping is used, but it can be replaced by any mapping you see fit.

val entityResolved = stream.through(referenceResolver[IO]())
println(entityResolved.compile.toList.unsafeRunSync())

Normalization

Once entites and namespaces are resolved, the events might be numerous and can be normalized to avoid emitting too many of them. For instance, after reference resolution, consecutive text events can be merged. This is achieved by using the normalize pipe.

val normalized = entityResolved.through(normalize[IO])
println(normalized.compile.toList.unsafeRunSync())

CSV module usage

Stream parser

To create a stream of CSV rows from an input stream, use the rows pipe in fs2.data.csv package. The default column separator is , but this can be overridden by providing the separator parameter.

import cats.effect._

import fs2._
import fs2.data.csv._

val input = """i,s,j
              |1,test,2
              |,other,-3
              |""".stripMargin

val stream = Stream.emits(input).through(rows[IO]())
println(stream.compile.toList.unsafeRunSync())

CSV Rows with headers

Rows can be converted to a CsvRow[Header] for some Header type. This class provides higher-level utilities to manipulate rows.

If your CSV file doesn't have headers, you can use the noHeaders pipe, which creates CsvRow[Nothing]

val noh = stream.through(noHeaders[IO])
println(noh.compile.toList.unsafeRunSync())

If you want to consider the first row as a header row, you can use the headers pipe. For instance to have headers as String:

val withh = stream.through(headers[IO, String])
println(withh.map(_.toMap).compile.toList.unsafeRunSync())

To support your own type of Header you must provide an implicit ParseableHeader[Header]. For instance if you have a fix set of headers represented as enumeratum enum values, you can provide an instance of ParseableHeader as follows:

import enumeratum._

sealed trait MyHeaders extends EnumEntry
object MyHeaders extends Enum[MyHeaders] {
  case object I extends MyHeaders
  case object S extends MyHeaders
  case object J extends MyHeaders
  def values = findValues
}

implicit object ParseableMyHeaders extends ParseableHeader[MyHeaders] {
  def parse(h: String) = MyHeaders.withNameInsensitive(h)
}

val withMyHeaders = stream.through(headers[IO, MyHeaders])
println(withMyHeaders.map(_.toMap).compile.toList.unsafeRunSync())

If the parse method fails for a header, the entire stream fails.

Decoding

Using the decode or decodeRow pipes, one can decode the rows into some Scala types by providing implicit instances of RowDecoder and CsvRowDecoder respectively.

The simplest way of doing it, is to use the fs2-data-csv-generic module, which gives automatic derivation for case classes.

For instance, to decode to a shapeless HList

import fs2.data.csv.generic.hlist._
import shapeless._

val decodedH = stream.tail.through(decode[IO, Option[Int] :: String :: Int :: HNil]) // tail drops the header line
println(decodedH.compile.toList.unsafeRunSync())

Cell types (Int, String, ...) can be decoded by providing implicit instances of CellDecoder. Instances for primitives and common types are defined already. You can easily define your own or use generic derivation for coproducts:

import fs2.data.csv.generic.semiauto._

sealed trait State
case object On extends State
object Off extends State

implicit val stateDecoder = deriveCellDecoder[State]
// use stateDecoder to derive decoders for rows...or just test:
println(stateDecoder("On"))
println(stateDecoder("Off"))

The generic derivation for cell decoders also supports renaming and deriving instances for unary product types (case classes with one field):

import fs2.data.csv.generic.semiauto._

sealed trait Advanced
@CsvValue("Active") case object On extends Advanced
case class Unknown(name: String) extends Advanced

implicit val unknownDecoder = deriveCellDecoder[Unknown] // works as we have an implicit CellDecoder[String]
implicit val advancedDecoder = deriveCellDecoder[Advanced]

println(advancedDecoder("Active")) // prints Right(On)
println(advancedDecoder("Off")) // prints Right(Unknown(Off))

You can also decode rows to case classes automatically.

import fs2.data.csv.generic.semiauto._

case class Row(s: String, i: Option[Int], j: Int)

implicit val rowDecoder = deriveCsvRowDecoder[Row]

val rows = withh.through(decodeRow[IO, String, Row])

println(rows.compile.toList.unsafeRunSync())

Case class generic also supports default parameters:

case class Row(s: String, i: Int = 34, j: Int)

implicit val rowDecoder = deriveCsvRowDecoder[Row]

val rows = withh.through(decodeRow[IO, String, Row])

println(rows.compile.toList.unsafeRunSync())

There's also support for full auto-derivation, just import fs2.data.csv.generic.auto._ for everything, import fs2.data.csv.generic.auto.row._ for RowDecoder support only or import fs2.data.csv.generic.auto.csvrow._ for CsvRowDecoder support.

Development

This project builds using mill. You can install mill yourself or use the provided millw wrapper, in this case replace mill with ./millw in the following commands:

compile everything: mill __.compile
compile & run all tests: mill __.test
run benchmarks (you can provide JMH arguments in the end): mill '__.benchmarks[2.13.1].runJmh'

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
benchmarks		benchmarks
csv		csv
json		json
xml		xml
.gitignore		.gitignore
.mill-version		.mill-version
.scalafmt.conf		.scalafmt.conf
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.sc		build.sc
jmh.sc		jmh.sc
millw		millw

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

fs2 data

JSON module usage

Stream parser

Selectors

AST Builder and Tokenizer

Circe

Patches

XML module usage

Stream parser

Resolvers

Normalization

CSV module usage

Stream parser

CSV Rows with headers

Decoding

Development

About

Uh oh!

Releases

Packages

Languages

License

ybasket/fs2-data

Folders and files

Latest commit

History

Repository files navigation

fs2 data

JSON module usage

Stream parser

Selectors

AST Builder and Tokenizer

Circe

Patches

XML module usage

Stream parser

Resolvers

Normalization

CSV module usage

Stream parser

CSV Rows with headers

Decoding

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages