Add Apache Arrow Support to TensorFlow Dataset #23002

BryanCutler · 2018-10-15T22:37:57Z

Apache Arrow is a standard format for in-memory columnar data. It provides a cross-language platform for systems to communicate and operate on data efficiently.

Adding Arrow support in TensorFlow Dataset will allow systems to interface with TensorFlow in a well defined way, without the need to develop custom converters, serialize data, or write to specialized files.

This PR defines an Arrow base layer that will create a TensorFlow Dataset with an iterator over Arrow record batches to produce Tensor values for each column. This base layer is then extended to implement three Dataset Ops that consume Arrow record batches: 1) from Python memory / Pandas DataFrames, 2) Reading Arrow Feather files, 3) Input stream with a socket client to connect to a server streaming Arrow record batches. These are implemented now because they are straightforward and can provide some good initial functionality, but the design here should allow for more Arrow Ops in the future.

fixes #23001

BryanCutler · 2018-10-15T22:40:42Z

This is currently a WIP, but wanted to open the PR to open discussion. Still needed:

Make sure all reasonable data types are supported
Tests are currently very simple, need to expand them
Socket impl is only enabled for posix
Clean up some rough edges

BryanCutler · 2018-10-15T22:48:27Z

@yongtang I saw that you are adding a Parquet Dataset in #19461. There is a lot of overlap since the parquet-cpp reader is based on Arrow. I believe that the parquet-cpp reader can produce Arrow record batches that could then leverage the base classes here that iterate over the batches. It would be nice to have a common Arrow layer in TensorFlow Dataset that is easy to extend for specific Ops. Would you be interested in collaborating?

BryanCutler · 2018-10-15T22:53:09Z

cc @frreiss @feihugis

yongtang · 2018-10-15T23:17:30Z

@BryanCutler Yes it would be great to have a common layer for Arrow. PR #19461 has been approved, and is awaiting for bazel update (to 0.17.1) in #22449 to be merged. I think the arrow part could be consolidated once in.

yongtang · 2018-10-15T23:21:18Z

Add @dmitrievanthony for I/O related discussion.

dmitrievanthony · 2018-10-16T07:48:47Z

Thanks @yongtang.

Hi @BryanCutler, PR looks very promising! I see you implemented only Posix sockets so far, let me suggest you to have a look at Apache Ignite Dataset (#22210). In this dataset I implemented Windows sockets as well as Posix, it might be helpful.

BryanCutler · 2018-10-16T16:26:23Z

Yes it would be great to have a common layer for Arrow. PR #19461 has been approved, and is awaiting for bazel update (to 0.17.1) in #22449 to be merged. I think the arrow part could be consolidated once in.

Great, thanks @yongtang! I will keep an eye out for #19461 being merged and then we can think about consolidating it in this PR or a followup, if this one gets approved.

BryanCutler · 2018-10-16T16:29:11Z

I see you implemented only Posix sockets so far, let me suggest you to have a look at Apache Ignite Dataset

Thanks @dmitrievanthony ! I would have liked to use boost asio, but I'm not sure if it's worth adding the dependency for that. If it's been done for Windows too in Apache Ignite Dataset, I'll take a look there and try to do the same.

yongtang · 2018-10-16T17:34:31Z

@BryanCutler The PR #19461 has boost dependency in place so it might be less of an issue for adding asio. Though boost dependency only works with bazel 0.17.1+ (that is also the reason merge of #19461 is blocked now).

galv · 2018-10-17T05:05:27Z

As a suggestion, you could consider making this a separate python package, which depends on both tensorflow and apache arrow to build. The tooling around these kinds of python packages with binary dependencies is still very new, but doable. Tensorflow does export its dataset header files. They are not under the API stability guarantees, but they have been fairly stable from what I observed (two method signatures changed trivially in 1.10, I believe). OF course, @mrry knows more, as the author.

I myself have made my own Dataset in an external project, using cmake. I don't have a ready-made example, but I know that arrow use cmake as well, so it is very easily doable to have this live separately from the tensorflow repo. You can follow up with me about that.

I mention this because tensorflow/contrib is going away.

Super cool work, by the way! I remember chatting with a Databricks engineer at the Spark Summit about how Arrow could be a great complement to Project Hydrogen.

yongtang · 2018-10-17T14:14:38Z

@BryanCutler See related discussion about tensorflow/contrib and dataset relatedtensorflow/io, that will likely happen in tensorflow 2.0:

tensorflow/community#18

FYI @dmitrievanthony @mrry @ewilderj @martinwicke

BryanCutler · 2018-10-18T03:34:12Z

Thanks for the info @galv and @yongtang, it sounds like a lot of things are in transition right now. Since contrib is going away, are no new additions being accepted there now?

yongtang · 2018-10-18T15:30:33Z

@BryanCutler When tensorflow/contrib is deprecated, Dataset related components could be moved to the planned tensorflow/io so it will not be a concern at that time. /cc @dmitrievanthony @mrry

In the short term though, I guess it may depend on the length of the transition period. If 2.0 is imminent then PR may not be accepted. (the PR could be opened to tensorflow/io then).

Don't know the planned date of 2.0 yet. Maybe @martinwicke have some insight about the 2.0 schedule?

martinwicke · 2018-10-18T16:10:23Z

We announced we wouldn't accept new projects into contrib when we announced 2.0.

I think this is an obvious candidate for sig-io. @ewilderj how far are we with sig-io? I think we're looking for a date, right?

Moving there will delay this for at least a few weeks as we have to set up the sig-io repo and integration. But if we were to accept this here, it would simply have to be moved to sig-io anyway, and I don't think there will be another TF release which could include this before sig-io is off the ground, so I think the extra work is not too useful.

ewilderj · 2018-10-19T21:01:38Z

I created the mailing list for SIG IO (announcement on developers@ list) and am recruiting for interest for that group. Once we've got a few folks signed up we can kick off the SIG with a conference call and get things moving.

BryanCutler · 2018-10-19T21:39:05Z

Thanks for the replies @ewilderj and @martinwicke. I signed up to the SIG and am looking forward to the discussions, thanks!

… batches. Define 3 ops to read record batches: 1) from memory, 2) from Feather files, 3) from an input stream/socket

tensorflowbutler · 2018-11-28T18:59:41Z

Nagging Reviewer @mrry: You have been added as a reviewer to this pull request. Please add your review or reassign. It has been 39 days with no activity and the awaiting review label has been applied.

mrry · 2018-12-08T01:34:00Z

I removed myself as reviewer, since it looks like this is going to go through the SIG IO process instead.

gunan · 2018-12-13T23:51:11Z

I am not involved in sig-io
I think you are looking for @yongtang

BryanCutler · 2018-12-21T06:16:44Z

This PR has been moved to tensorflow/io#36 and can be closed

BryanCutler requested a review from mrry as a code owner October 15, 2018 22:37

googlebot added the cla: yes label Oct 15, 2018

ymodak self-assigned this Oct 15, 2018

ymodak added the awaiting review Pull request awaiting review label Oct 15, 2018

BryanCutler force-pushed the arrow-dataset branch from e42861a to faee235 Compare October 18, 2018 05:21

BryanCutler added 4 commits October 24, 2018 13:53

Create ArrowDataset with base classes for iterating over Arrow record…

d5a01cf

… batches. Define 3 ops to read record batches: 1) from memory, 2) from Feather files, 3) from an input stream/socket

was not properly setting the iterator end of sequence

9ff13f5

change shape to unknown dims, for now

b3d55cc

made socket port assignable

3d94ae5

BryanCutler force-pushed the arrow-dataset branch from faee235 to 3d94ae5 Compare October 30, 2018 00:04

dmitrievanthony mentioned this pull request Nov 19, 2018

TensorFlow on Apache Ignite tensorflow/ecosystem#81

Open

mrry removed their request for review December 8, 2018 01:33

ymodak requested a review from gunan December 13, 2018 18:43

gunan removed their request for review December 13, 2018 23:49

ymodak requested a review from yongtang December 14, 2018 00:03

tensorflowbutler removed the awaiting review Pull request awaiting review label Dec 14, 2018

BryanCutler closed this Dec 21, 2018

Add Apache Arrow Support to TensorFlow Dataset #23002

Add Apache Arrow Support to TensorFlow Dataset #23002

Uh oh!

Conversation

BryanCutler commented Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler commented Oct 15, 2018 • edited by ymodak Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler commented Oct 15, 2018

Uh oh!

BryanCutler commented Oct 15, 2018

Uh oh!

yongtang commented Oct 15, 2018

Uh oh!

yongtang commented Oct 15, 2018

Uh oh!

dmitrievanthony commented Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler commented Oct 16, 2018

Uh oh!

BryanCutler commented Oct 16, 2018

Uh oh!

yongtang commented Oct 16, 2018

Uh oh!

galv commented Oct 17, 2018

Uh oh!

yongtang commented Oct 17, 2018

Uh oh!

BryanCutler commented Oct 18, 2018

Uh oh!

yongtang commented Oct 18, 2018

Uh oh!

martinwicke commented Oct 18, 2018

Uh oh!

ewilderj commented Oct 19, 2018

Uh oh!

BryanCutler commented Oct 19, 2018

Uh oh!

tensorflowbutler commented Nov 28, 2018

Uh oh!

mrry commented Dec 8, 2018

Uh oh!

gunan commented Dec 13, 2018

Uh oh!

BryanCutler commented Dec 21, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

BryanCutler commented Oct 15, 2018 •

edited

Loading

BryanCutler commented Oct 15, 2018 •

edited by ymodak

Loading

dmitrievanthony commented Oct 16, 2018 •

edited

Loading