Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Craigacp
Copy link
Member

Description

This change rebases CSVLoader on top of CSVDataSource, instead of reading the CSVs itself and processing the examples into a ListDataSource.

There are currently two breaking changes both of which are sufficiently minor that we're ok with introducing them in a minor release.

  • All the CSVLoader.loadDataSource overloads now return DataSource<T> rather than ListDataSource<T> and the source they return is an instance of CSVDataSource<T>. This is unavoidable unless we want to read the CSVDataSource directly into a ListDataSource on construction, which might cause provenance issues.
  • CSVLoader.loadDataSource is now lazy, like CSVDataSource itself. Previously it used to read the whole csv file and buffer the examples in memory, triggering any parsing exceptions during the call. Now the file is read as it's iterated, so the parsing exceptions are triggered on loading into a dataset, and the examples are not cached. We could modify CSVDataSource to have a cache flag which recovered the original behaviour.

This work caused a few changes in DoubleFieldProcessor and FieldResponseProcessor to allow them to behave in the same way as the old CSVLoader parsing code. All these options are off by default, so the current behaviour is preserved. I also went through and tagged all the method overrides where the super class method was marked deprecated in PR #150 which removes all the compile time deprecation warnings.

There is one further behaviour change, which is multi-output responses are generated in a different order. This seems to be because the set iteration order is not fixed, and so could have broken the test at any time in the old code. The test was fixed by using a LinkedHashSet, and we probably should specify in the docs that this should be used to ensure consistent iteration order.

Motivation

CSVLoader currently emits ListDataSource with a special provenance class. This isn't configurable, and so that means it needs to be special cased in the reproducibility system, which seems unnecessary. After this change the CSVLoader will emit a CSVDataSourceProvenance which can be used to reconstruct the data without user intervention or special casing.

Craigacp added 2 commits July 14, 2021 23:03
This refactor is to make it easier to reproduce a model which was
trained using data loaded via CSVLoader. Previously as it produced
ListDataSource with a specific provenance then automatic reproduction
required special casing everything which used CSVLoader. Now it is
refactored to sit on top of RowProcessor and produce a regular
CSVDataSource.

There are two breaking changes:
- CSVLoader.loadDataSource now returns DataSource<T> (an instance of
  CSVDataSource<T>) instead of ListDataSource<T>.
- CSVLoader.loadDataSource is now lazy and does not cache the CSV rows.
  This causes the parsing exceptions to be thrown when the source is
iterated rather than when loadDataSource is called.
@Craigacp Craigacp requested a review from JackSullivan July 15, 2021 13:32
@Craigacp Craigacp added the Oracle employee This PR is from an Oracle employee label Jul 16, 2021
/**
* Constructs a field processor which extracts a single double valued feature from the specified field name.
* <p>
* Generates features named "fieldName@value", and does not throw an exception if the value failed to parse.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not 100% clear on first reading that fieldName is the name of the field while value is literally the string value. Is there a way to make this clearer? Maybe make fieldName a reference in the javadoc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

* @throws IOException If the disk read failed.
*/
public ListDataSource<T> loadDataSource(Path csvPath, Set<String> responseNames) throws IOException {
public DataSource<T> loadDataSource(Path csvPath, Set<String> responseNames) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention it in the commit message but not here, we should document that whatever order these responseNames are in will be used in the generated responses, so users should use an ordered set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Member

@JackSullivan JackSullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from the documentation changes mentioned.

JackSullivan
JackSullivan previously approved these changes Aug 5, 2021
Copy link
Member

@JackSullivan JackSullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@Craigacp Craigacp merged commit 5c9e198 into oracle:main Aug 5, 2021
@Craigacp Craigacp deleted the csv-loader-refactor branch August 5, 2021 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Oracle employee This PR is from an Oracle employee

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants