Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Oct 15, 2018

Apache Arrow is a standard format for in-memory columnar data. It provides a cross-language platform for systems to communicate and operate on data efficiently.

Adding Arrow support in TensorFlow Dataset will allow systems to interface with TensorFlow in a well defined way, without the need to develop custom converters, serialize data, or write to specialized files.

This PR defines an Arrow base layer that will create a TensorFlow Dataset with an iterator over Arrow record batches to produce Tensor values for each column. This base layer is then extended to implement three Dataset Ops that consume Arrow record batches: 1) from Python memory / Pandas DataFrames, 2) Reading Arrow Feather files, 3) Input stream with a socket client to connect to a server streaming Arrow record batches. These are implemented now because they are straightforward and can provide some good initial functionality, but the design here should allow for more Arrow Ops in the future.

fixes #23001

@BryanCutler
Copy link
Member Author

BryanCutler commented Oct 15, 2018

This is currently a WIP, but wanted to open the PR to open discussion. Still needed:

  • Make sure all reasonable data types are supported
  • Tests are currently very simple, need to expand them
  • Socket impl is only enabled for posix
  • Clean up some rough edges

@BryanCutler
Copy link
Member Author

@yongtang I saw that you are adding a Parquet Dataset in #19461. There is a lot of overlap since the parquet-cpp reader is based on Arrow. I believe that the parquet-cpp reader can produce Arrow record batches that could then leverage the base classes here that iterate over the batches. It would be nice to have a common Arrow layer in TensorFlow Dataset that is easy to extend for specific Ops. Would you be interested in collaborating?

@BryanCutler
Copy link
Member Author

cc @frreiss @feihugis

@yongtang
Copy link
Member

@BryanCutler Yes it would be great to have a common layer for Arrow. PR #19461 has been approved, and is awaiting for bazel update (to 0.17.1) in #22449 to be merged. I think the arrow part could be consolidated once in.

@yongtang
Copy link
Member

Add @dmitrievanthony for I/O related discussion.

@ymodak ymodak self-assigned this Oct 15, 2018
@ymodak ymodak added the awaiting review Pull request awaiting review label Oct 15, 2018
@dmitrievanthony
Copy link
Contributor

dmitrievanthony commented Oct 16, 2018

Thanks @yongtang.

Hi @BryanCutler, PR looks very promising! I see you implemented only Posix sockets so far, let me suggest you to have a look at Apache Ignite Dataset (#22210). In this dataset I implemented Windows sockets as well as Posix, it might be helpful.

@BryanCutler
Copy link
Member Author

Yes it would be great to have a common layer for Arrow. PR #19461 has been approved, and is awaiting for bazel update (to 0.17.1) in #22449 to be merged. I think the arrow part could be consolidated once in.

Great, thanks @yongtang! I will keep an eye out for #19461 being merged and then we can think about consolidating it in this PR or a followup, if this one gets approved.

@BryanCutler
Copy link
Member Author

I see you implemented only Posix sockets so far, let me suggest you to have a look at Apache Ignite Dataset

Thanks @dmitrievanthony ! I would have liked to use boost asio, but I'm not sure if it's worth adding the dependency for that. If it's been done for Windows too in Apache Ignite Dataset, I'll take a look there and try to do the same.

@yongtang
Copy link
Member

@BryanCutler The PR #19461 has boost dependency in place so it might be less of an issue for adding asio. Though boost dependency only works with bazel 0.17.1+ (that is also the reason merge of #19461 is blocked now).

@galv
Copy link

galv commented Oct 17, 2018

As a suggestion, you could consider making this a separate python package, which depends on both tensorflow and apache arrow to build. The tooling around these kinds of python packages with binary dependencies is still very new, but doable. Tensorflow does export its dataset header files. They are not under the API stability guarantees, but they have been fairly stable from what I observed (two method signatures changed trivially in 1.10, I believe). OF course, @mrry knows more, as the author.

I myself have made my own Dataset in an external project, using cmake. I don't have a ready-made example, but I know that arrow use cmake as well, so it is very easily doable to have this live separately from the tensorflow repo. You can follow up with me about that.

I mention this because tensorflow/contrib is going away.

Super cool work, by the way! I remember chatting with a Databricks engineer at the Spark Summit about how Arrow could be a great complement to Project Hydrogen.

@yongtang
Copy link
Member

@BryanCutler See related discussion about tensorflow/contrib and dataset relatedtensorflow/io, that will likely happen in tensorflow 2.0:

tensorflow/community#18

FYI @dmitrievanthony @mrry @ewilderj @martinwicke

@BryanCutler
Copy link
Member Author

Thanks for the info @galv and @yongtang, it sounds like a lot of things are in transition right now. Since contrib is going away, are no new additions being accepted there now?

@yongtang
Copy link
Member

@BryanCutler When tensorflow/contrib is deprecated, Dataset related components could be moved to the planned tensorflow/io so it will not be a concern at that time. /cc @dmitrievanthony @mrry

In the short term though, I guess it may depend on the length of the transition period. If 2.0 is imminent then PR may not be accepted. (the PR could be opened to tensorflow/io then).

Don't know the planned date of 2.0 yet. Maybe @martinwicke have some insight about the 2.0 schedule?

@martinwicke
Copy link
Member

We announced we wouldn't accept new projects into contrib when we announced 2.0.

I think this is an obvious candidate for sig-io. @ewilderj how far are we with sig-io? I think we're looking for a date, right?

Moving there will delay this for at least a few weeks as we have to set up the sig-io repo and integration. But if we were to accept this here, it would simply have to be moved to sig-io anyway, and I don't think there will be another TF release which could include this before sig-io is off the ground, so I think the extra work is not too useful.

@ewilderj
Copy link
Contributor

I created the mailing list for SIG IO (announcement on developers@ list) and am recruiting for interest for that group. Once we've got a few folks signed up we can kick off the SIG with a conference call and get things moving.

@BryanCutler
Copy link
Member Author

Thanks for the replies @ewilderj and @martinwicke. I signed up to the SIG and am looking forward to the discussions, thanks!

@tensorflowbutler
Copy link
Member

Nagging Reviewer @mrry: You have been added as a reviewer to this pull request. Please add your review or reassign. It has been 39 days with no activity and the awaiting review label has been applied.

@mrry mrry removed their request for review December 8, 2018 01:33
@mrry
Copy link
Contributor

mrry commented Dec 8, 2018

I removed myself as reviewer, since it looks like this is going to go through the SIG IO process instead.

@ymodak ymodak requested a review from gunan December 13, 2018 18:43
@gunan gunan removed their request for review December 13, 2018 23:49
@gunan
Copy link
Contributor

gunan commented Dec 13, 2018

I am not involved in sig-io
I think you are looking for @yongtang

@ymodak ymodak requested a review from yongtang December 14, 2018 00:03
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Dec 14, 2018
@BryanCutler
Copy link
Member Author

This PR has been moved to tensorflow/io#36 and can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Support for Apache Arrow in TensorFlow Dataset