CSVLoader refactor #153

Craigacp · 2021-07-15T13:32:08Z

Description

This change rebases CSVLoader on top of CSVDataSource, instead of reading the CSVs itself and processing the examples into a ListDataSource.

There are currently two breaking changes both of which are sufficiently minor that we're ok with introducing them in a minor release.

All the CSVLoader.loadDataSource overloads now return DataSource<T> rather than ListDataSource<T> and the source they return is an instance of CSVDataSource<T>. This is unavoidable unless we want to read the CSVDataSource directly into a ListDataSource on construction, which might cause provenance issues.
CSVLoader.loadDataSource is now lazy, like CSVDataSource itself. Previously it used to read the whole csv file and buffer the examples in memory, triggering any parsing exceptions during the call. Now the file is read as it's iterated, so the parsing exceptions are triggered on loading into a dataset, and the examples are not cached. We could modify CSVDataSource to have a cache flag which recovered the original behaviour.

This work caused a few changes in DoubleFieldProcessor and FieldResponseProcessor to allow them to behave in the same way as the old CSVLoader parsing code. All these options are off by default, so the current behaviour is preserved. I also went through and tagged all the method overrides where the super class method was marked deprecated in PR #150 which removes all the compile time deprecation warnings.

There is one further behaviour change, which is multi-output responses are generated in a different order. This seems to be because the set iteration order is not fixed, and so could have broken the test at any time in the old code. The test was fixed by using a LinkedHashSet, and we probably should specify in the docs that this should be used to ensure consistent iteration order.

Motivation

CSVLoader currently emits ListDataSource with a special provenance class. This isn't configurable, and so that means it needs to be special cased in the reproducibility system, which seems unnecessary. After this change the CSVLoader will emit a CSVDataSourceProvenance which can be used to reconstruct the data without user intervention or special casing.

This refactor is to make it easier to reproduce a model which was trained using data loaded via CSVLoader. Previously as it produced ListDataSource with a specific provenance then automatic reproduction required special casing everything which used CSVLoader. Now it is refactored to sit on top of RowProcessor and produce a regular CSVDataSource. There are two breaking changes: - CSVLoader.loadDataSource now returns DataSource<T> (an instance of CSVDataSource<T>) instead of ListDataSource<T>. - CSVLoader.loadDataSource is now lazy and does not cache the CSV rows. This causes the parsing exceptions to be thrown when the source is iterated rather than when loadDataSource is called.

JackSullivan · 2021-08-05T17:41:48Z

Data/src/main/java/org/tribuo/data/columnar/processors/field/DoubleFieldProcessor.java

    /**
     * Constructs a field processor which extracts a single double valued feature from the specified field name.
+     * <p>
+     * Generates features named "fieldName@value", and does not throw an exception if the value failed to parse.


It's not 100% clear on first reading that fieldName is the name of the field while value is literally the string value. Is there a way to make this clearer? Maybe make fieldName a reference in the javadoc?

Data/src/main/java/org/tribuo/data/csv/CSVLoader.java

JackSullivan · 2021-08-05T19:22:36Z

Data/src/main/java/org/tribuo/data/csv/CSVLoader.java

     * @throws IOException If the disk read failed.
     */
-    public ListDataSource<T> loadDataSource(Path csvPath, Set<String> responseNames) throws IOException {
+    public DataSource<T> loadDataSource(Path csvPath, Set<String> responseNames) throws IOException {


You mention it in the commit message but not here, we should document that whatever order these responseNames are in will be used in the generated responses, so users should use an ordered set.

JackSullivan

Looks good apart from the documentation changes mentioned.

Data/src/main/java/org/tribuo/data/columnar/processors/field/DoubleFieldProcessor.java

…oubleFieldProcessor.java

JackSullivan

Looks good to me

Data/src/main/java/org/tribuo/data/columnar/processors/field/DoubleFieldProcessor.java

…oubleFieldProcessor.java

Craigacp added 2 commits July 14, 2021 23:03

Fixing formatting in FieldResponseProcessor.

7647d10

Craigacp requested a review from JackSullivan July 15, 2021 13:32

Craigacp added the Oracle employee This PR is from an Oracle employee label Jul 16, 2021

JackSullivan reviewed Aug 5, 2021

View reviewed changes

Data/src/main/java/org/tribuo/data/csv/CSVLoader.java Show resolved Hide resolved

JackSullivan reviewed Aug 5, 2021

View reviewed changes

Documentation updates after review.

9d4522e

JackSullivan reviewed Aug 5, 2021

View reviewed changes

Data/src/main/java/org/tribuo/data/columnar/processors/field/DoubleFieldProcessor.java Outdated Show resolved Hide resolved

Update Data/src/main/java/org/tribuo/data/columnar/processors/field/D…

75e0c62

…oubleFieldProcessor.java

JackSullivan previously approved these changes Aug 5, 2021

View reviewed changes

JackSullivan reviewed Aug 5, 2021

View reviewed changes

Data/src/main/java/org/tribuo/data/columnar/processors/field/DoubleFieldProcessor.java Outdated Show resolved Hide resolved

Update Data/src/main/java/org/tribuo/data/columnar/processors/field/D…

f0b2688

…oubleFieldProcessor.java

JackSullivan dismissed their stale review via f0b2688 August 5, 2021 19:44

JackSullivan approved these changes Aug 5, 2021

View reviewed changes

Craigacp merged commit 5c9e198 into oracle:main Aug 5, 2021

Craigacp deleted the csv-loader-refactor branch August 5, 2021 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSVLoader refactor #153

CSVLoader refactor #153

Uh oh!

Craigacp commented Jul 15, 2021

Uh oh!

JackSullivan Aug 5, 2021

Uh oh!

Craigacp Aug 5, 2021

Uh oh!

Uh oh!

JackSullivan Aug 5, 2021

Uh oh!

Craigacp Aug 5, 2021

Uh oh!

JackSullivan left a comment

Uh oh!

Uh oh!

JackSullivan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CSVLoader refactor #153

CSVLoader refactor #153

Uh oh!

Conversation

Craigacp commented Jul 15, 2021

Description

Motivation

Uh oh!

JackSullivan Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

Craigacp Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JackSullivan Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

Craigacp Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

JackSullivan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JackSullivan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants