Developing the Firehose for a given protocol requires a deep understanding of that protocol, of the meaning of each bits of data. So yes design of the schema, alongside the instrumentation that makes it extract the right info at the right time, alongside some smart contracts that will exercice different intrinsics/externals/env. functions made available by the transaction execution runtime (we called our testbeds battlefield), all of those are critical and significant tasks. We usually put a few weeks worth of efforts, especially when new protocols are radically different from the previous. Porting instrumentation to different EVM runtimes, obviously takes less time. I’d say around 4-6 weeks with 2-3 engineers.
The battlefield repo for a given protocol will include: scripts to boot a chain from genesis (mining or producing blocks), scripts to deploy one or two test contracts that exercice the features of the chain, scripts to execute transactions on the producing node, a p2p connection to a Firehose-enabled node connected to the producing node, and outputting the instrumentation. We also found useful to write regression tests, to be able to diff outputs when we either upgrade the node, or upgrade our instrumentation. Once the test chain was written, we can also turn off the produce, and simply replay the chain to gather the firehose output. This whole flow makes it efficient to iterate on firehoses for different protocols, and has proven to allow steady progress to be made, without much regression.
For Ethereum in particular, we have identical protobuf generation using both geth and OpenEthereum. The firehose instrumentation is different (one being in Go, the other in Rust), but the piece of software consuming the output of both is able to produce a single output. It makes sense that the data produced by both be identical in a way that we can abstract away the implementation: if it weren’t the case, it would mean the data would be either specific to an implementation, or non-deterministic (sort of out of the protocol). It took us a day or two to port instrumentation on each meaningful geth derived chains (like BSC, etc…), same thing for OE-derived chains. I don’t foresee any issue porting that to Erigon, but (unless Matt V. has done it overnight
we don’t have it ready as of August 4th 2021.
Not sure if it was clear from the doc, when firehose is enabled and data is consumed this way, the node doesn’t need to have any flags that make it heavier (like archive mode, or tracing enabled). It can be treated (resource-wide) just like the simplest full node that does validation of transactions (so executes all transactions and processes the state transitions for the protocol).
If I’m not mistaken, I do think our Ethereum instrumentation supports using the node normally while it is being used for extraction. I’d suggest not to, though. Putting no other load than processing transactions and extracting data will ensure that the downstream systems will have the freshest data as fast as possible. Querying nodes while they do the write operations will inevitably cause lock contention and other slowdowns (although it’ll affect some protocols more than others).
Hope this helps, and hope it was clear enough! Thanks for the feedback!