Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ybasket/fs2-data

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fs2 data

Build Status Codacy Badge

A set of streaming data parsers based on fs2.

Following modules are available:

  • fs2-data-json: Maven Central A JSON parser and manipulation library
  • fs2-data-json-circe: Maven Central circe support for parsed JSON.
  • fs2-data-json-diffson: Maven Central diffson support for patching JSON streams.
  • fs2-data-xml: Maven Central An XML parser
  • fs2-data-csv: Maven Central A CSV parser
  • fs2-data-csv-generic: Maven Central generic decoder for CSV files

JSON module usage

Stream parser

To create a stream of JSON tokens from an input stream, use the tokens pipe in fs2.data.json package

import cats.effect._

import fs2._
import fs2.data.json._

val input = """{
              |  "field1": 0,
              |  "field2": "test",
              |  "field3": [1, 2, 3]
              |}
              |{
              |  "field1": 2,
              |  "field3": []
              |}""".stripMargin

val stream = Stream.emits(input).through(tokens[IO])
println(stream.compile.toList.unsafeRunSync())

The pipe validates the JSON structure while parsing. It reads all the json values in the input stream and emits events as they are available.

Selectors

Selectors can be used to select a subset of a JSON token stream.

For instance, to select and enumerate elements that are in the field3 array, you can create this selector. Only the tokens describing the values in field3 will be emitted as a result.

val selector = ".field3.[]".parseSelector[IO].unsafeRunSync()
val filtered = stream.through(filter(selector))
println(filtered.compile.toList.unsafeRunSync())

The filter syntax is as follows:

  • . select the root values, it is basically the identity filter.
  • .f selects the field named f in objects. It fails if the value it is applied to is not a JSON object.
  • .f? is similar to .f but doesn't fail in case the value it is applied to is not a JSON object.
  • .[f1, f2, ..., fn] selects only fields f1 to fn in objects. The fields are emitted wrapped in an object. It fails if the value it is applied to is not an object.
  • .[f1, f2, ..., fn] similar to .[f1, f2, ..., fn] but doesn't fail if the value it is applied to is not an object.
  • .[id1, idx2, ..., idxn] selects only elements idx1, ..., idxn in arrays. The values are emitted wrapped in an array. It fails if the value it is applied to is not an array.
  • .[idx1, idx2, ..., idxn]? similar to .[idx1, idx2, ..., idxn] but doesn't fail if the value it is applied to is not an array.
  • .[idx1:idx2] selects only elements between idx1 (inclusive) and idx2 (exclusive) in arrays. The values are emitted wrapped in an array. It fails if the value it is applied to is not an array.
  • .[idx1:idx2]? similar to .[idx1:idx2] but doesn't fail if the value it is applied to is not an array.
  • .[] selects and enumerate elements from an array or objects. The values are not wrapped in an array or object. It fails if the value it is applied to is not an array or an object.
  • .[]? similar as .[] but doesn't fail if the value it is applied to is neither an array nor an object.
  • sel1 sel2 applies selector sel1 to the root value, and selector sel2 to each selected value.

AST Builder and Tokenizer

JSON ASTs can be built if you provider an implicit Builder[Json] to the values pipe. The Builder[Json] typeclass describes how JSON ASTs of type Json are built from streams.

implicit val builder: Builder[SomeJsonType] = ...
val asts = stream.through(values[F, SomeJsonType])

The asts stream emits all top-level JSON values parsed, in our example, the two objects are emitted.

If you provide an implicit Tokenizer[Json], which describes how a JSON AST is transformed into JSON events, you can apply transformations to the JSON stream. For instance, you can wrap all values in the fields3 array by using this code:

implicit tokenizer: Tokenizer[SomeJsonType] = ...
val transformed = stream.through(transform[IO, Json](selector, json => SomeJsonObject("test" -> json)))

Circe

The fs2-data-json-circe module provides Builder and Tokenizer instances for the circe Json type. For instance both examples above with circe can be written that way:

import fs2.data.json.circe._
import io.circe._

val asts = stream.through(values[F, Json])
println(asts.compile.toList.unsafeRunSync())

val transformed = stream.through(transform[IO, Json](selector, json => Json.obj("test" -> json)))
println(transformed.through(values[IO, Json]).compile.toList.unsafeRunSync())

Patches

The fs2-data-json-diffson module provides some integration with diffson. It allows for patching a Json stream as it is read to emit the patched value downstream. Patching stream can be useful in several case, for instance:

  • it can be used to filter out fields you don't need for further processing, before building an AST with these fields;
  • it can be used to make data from an input stream anonymous by removing names or identifiers;
  • it makes it possible to enrich an input stream with extra data you need for further processing;
  • many other use cases when you need to amend input data on the fly without building the entire AST in memory.

Currently only JSON Merge Patch is supported.

In order for patches to be applied, you need a Tokenizer for some Json type the patch operates on (see above) and a Jsony from diffson for that same Json type.

Let's say you are using circe as Json AST library, you can use patches like this:

import fs2.data.json.mergepatch._

import diffson._
import diffson.circe._
import diffson.jsonmergepatch._

import io.circe._

val mergePatch: JsonMergePatch[Json] = ...

val patched = stream.through(patch(mergePatch))

println(patched.compile.toList.unsafeRunSync())

XML module usage

Stream parser

To create a stream of XML events from an input stream, use the events pipe in fs2.data.xml package

import cats.effect._

import fs2._
import fs2.data.xml._

val input = """<a xmlns:ns="http://test.ns">
              |  <ns:b ns:a="attribute">text</ns:b>
              |</a>
              |<a>
              |  <b/>
              |  test entity resolution &amp; normalization
              |</a>""".stripMargin

val stream = Stream.emits(input).through(events[IO])
println(stream.compile.toList.unsafeRunSync())

The pipe validates the XML structure while parsing. It reads all the XML elements in the input stream and emits events as they are available.

Resolvers

Namespace can be resolved by using the namespaceResolver pipe.

val nsResolved = stream.through(namespaceResolver[IO])
println(nsResolved.compile.toList.unsafeRunSync())

Using the referenceResolver pipe, entity and character references can be resolved. By defaut the standard xmlEntities mapping is used, but it can be replaced by any mapping you see fit.

val entityResolved = stream.through(referenceResolver[IO]())
println(entityResolved.compile.toList.unsafeRunSync())

Normalization

Once entites and namespaces are resolved, the events might be numerous and can be normalized to avoid emitting too many of them. For instance, after reference resolution, consecutive text events can be merged. This is achieved by using the normalize pipe.

val normalized = entityResolved.through(normalize[IO])
println(normalized.compile.toList.unsafeRunSync())

CSV module usage

Stream parser

To create a stream of CSV rows from an input stream, use the rows pipe in fs2.data.csv package. The default column separator is , but this can be overridden by providing the separator parameter.

import cats.effect._

import fs2._
import fs2.data.csv._

val input = """i,s,j
              |1,test,2
              |,other,-3
              |""".stripMargin

val stream = Stream.emits(input).through(rows[IO]())
println(stream.compile.toList.unsafeRunSync())

CSV Rows with headers

Rows can be converted to a CsvRow[Header] for some Header type. This class provides higher-level utilities to manipulate rows.

If your CSV file doesn't have headers, you can use the noHeaders pipe, which creates CsvRow[Nothing]

val noh = stream.through(noHeaders[IO])
println(noh.compile.toList.unsafeRunSync())

If you want to consider the first row as a header row, you can use the headers pipe. For instance to have headers as String:

val withh = stream.through(headers[IO, String])
println(withh.map(_.toMap).compile.toList.unsafeRunSync())

To support your own type of Header you must provide an implicit ParseableHeader[Header]. For instance if you have a fix set of headers represented as enumeratum enum values, you can provide an instance of ParseableHeader as follows:

import enumeratum._

sealed trait MyHeaders extends EnumEntry
object MyHeaders extends Enum[MyHeaders] {
  case object I extends MyHeaders
  case object S extends MyHeaders
  case object J extends MyHeaders
  def values = findValues
}

implicit object ParseableMyHeaders extends ParseableHeader[MyHeaders] {
  def parse(h: String) = MyHeaders.withNameInsensitive(h)
}

val withMyHeaders = stream.through(headers[IO, MyHeaders])
println(withMyHeaders.map(_.toMap).compile.toList.unsafeRunSync())

If the parse method fails for a header, the entire stream fails.

Decoding

Using the decode or decodeRow pipes, one can decode the rows into some Scala types by providing implicit instances of RowDecoder and CsvRowDecoder respectively.

The simplest way of doing it, is to use the fs2-data-csv-generic module, which gives automatic derivation for case classes.

For instance, to decode to a shapeless HList

import fs2.data.csv.generic.hlist._
import shapeless._

val decodedH = stream.tail.through(decode[IO, Option[Int] :: String :: Int :: HNil]) // tail drops the header line
println(decodedH.compile.toList.unsafeRunSync())

Cell types (Int, String, ...) can be decoded by providing implicit instances of CellDecoder. Instances for primitives and common types are defined already. You can easily define your own or use generic derivation for coproducts:

import fs2.data.csv.generic.semiauto._

sealed trait State
case object On extends State
object Off extends State

implicit val stateDecoder = deriveCellDecoder[State]
// use stateDecoder to derive decoders for rows...or just test:
println(stateDecoder("On"))
println(stateDecoder("Off"))

The generic derivation for cell decoders also supports renaming and deriving instances for unary product types (case classes with one field):

import fs2.data.csv.generic.semiauto._

sealed trait Advanced
@CsvValue("Active") case object On extends Advanced
case class Unknown(name: String) extends Advanced

implicit val unknownDecoder = deriveCellDecoder[Unknown] // works as we have an implicit CellDecoder[String]
implicit val advancedDecoder = deriveCellDecoder[Advanced]

println(advancedDecoder("Active")) // prints Right(On)
println(advancedDecoder("Off")) // prints Right(Unknown(Off))

You can also decode rows to case classes automatically.

import fs2.data.csv.generic.semiauto._

case class Row(s: String, i: Option[Int], j: Int)

implicit val rowDecoder = deriveCsvRowDecoder[Row]

val rows = withh.through(decodeRow[IO, String, Row])

println(rows.compile.toList.unsafeRunSync())

Case class generic also supports default parameters:

case class Row(s: String, i: Int = 34, j: Int)

implicit val rowDecoder = deriveCsvRowDecoder[Row]

val rows = withh.through(decodeRow[IO, String, Row])

println(rows.compile.toList.unsafeRunSync())

There's also support for full auto-derivation, just import fs2.data.csv.generic.auto._ for everything, import fs2.data.csv.generic.auto.row._ for RowDecoder support only or import fs2.data.csv.generic.auto.csvrow._ for CsvRowDecoder support.

Development

This project builds using mill. You can install mill yourself or use the provided millw wrapper, in this case replace mill with ./millw in the following commands:

  • compile everything: mill __.compile
  • compile & run all tests: mill __.test
  • run benchmarks (you can provide JMH arguments in the end): mill '__.benchmarks[2.13.1].runJmh'

About

streaming data parsers and manipulation library

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 99.1%
  • Other 0.9%