-
Notifications
You must be signed in to change notification settings - Fork 194
CSVLoader refactor #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSVLoader refactor #153
Conversation
This refactor is to make it easier to reproduce a model which was trained using data loaded via CSVLoader. Previously as it produced ListDataSource with a specific provenance then automatic reproduction required special casing everything which used CSVLoader. Now it is refactored to sit on top of RowProcessor and produce a regular CSVDataSource. There are two breaking changes: - CSVLoader.loadDataSource now returns DataSource<T> (an instance of CSVDataSource<T>) instead of ListDataSource<T>. - CSVLoader.loadDataSource is now lazy and does not cache the CSV rows. This causes the parsing exceptions to be thrown when the source is iterated rather than when loadDataSource is called.
| /** | ||
| * Constructs a field processor which extracts a single double valued feature from the specified field name. | ||
| * <p> | ||
| * Generates features named "fieldName@value", and does not throw an exception if the value failed to parse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not 100% clear on first reading that fieldName is the name of the field while value is literally the string value. Is there a way to make this clearer? Maybe make fieldName a reference in the javadoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
| * @throws IOException If the disk read failed. | ||
| */ | ||
| public ListDataSource<T> loadDataSource(Path csvPath, Set<String> responseNames) throws IOException { | ||
| public DataSource<T> loadDataSource(Path csvPath, Set<String> responseNames) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mention it in the commit message but not here, we should document that whatever order these responseNames are in will be used in the generated responses, so users should use an ordered set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
JackSullivan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good apart from the documentation changes mentioned.
Data/src/main/java/org/tribuo/data/columnar/processors/field/DoubleFieldProcessor.java
Outdated
Show resolved
Hide resolved
…oubleFieldProcessor.java
JackSullivan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Data/src/main/java/org/tribuo/data/columnar/processors/field/DoubleFieldProcessor.java
Outdated
Show resolved
Hide resolved
…oubleFieldProcessor.java
Description
This change rebases
CSVLoaderon top ofCSVDataSource, instead of reading the CSVs itself and processing the examples into aListDataSource.There are currently two breaking changes both of which are sufficiently minor that we're ok with introducing them in a minor release.
CSVLoader.loadDataSourceoverloads now returnDataSource<T>rather thanListDataSource<T>and the source they return is an instance ofCSVDataSource<T>. This is unavoidable unless we want to read theCSVDataSourcedirectly into aListDataSourceon construction, which might cause provenance issues.CSVLoader.loadDataSourceis now lazy, likeCSVDataSourceitself. Previously it used to read the whole csv file and buffer the examples in memory, triggering any parsing exceptions during the call. Now the file is read as it's iterated, so the parsing exceptions are triggered on loading into a dataset, and the examples are not cached. We could modifyCSVDataSourceto have a cache flag which recovered the original behaviour.This work caused a few changes in
DoubleFieldProcessorandFieldResponseProcessorto allow them to behave in the same way as the oldCSVLoaderparsing code. All these options are off by default, so the current behaviour is preserved. I also went through and tagged all the method overrides where the super class method was marked deprecated in PR #150 which removes all the compile time deprecation warnings.There is one further behaviour change, which is multi-output responses are generated in a different order. This seems to be because the set iteration order is not fixed, and so could have broken the test at any time in the old code. The test was fixed by using a
LinkedHashSet, and we probably should specify in the docs that this should be used to ensure consistent iteration order.Motivation
CSVLoader currently emits
ListDataSourcewith a special provenance class. This isn't configurable, and so that means it needs to be special cased in the reproducibility system, which seems unnecessary. After this change the CSVLoader will emit a CSVDataSourceProvenance which can be used to reconstruct the data without user intervention or special casing.