Fix for not all input objects supporting `last_line` #1222

oliver-s-lee · 2023-06-30T11:01:56Z

Parser now checks before accessing last_line in case it's not available, and fileinput.FileInput is now correctly wrapped with FileWrapper (so it now does have last_line)

In theory, we probably want all input types to be wrapped with FileWrapper, but I didn't make this change because 1) there's a weird amount of IO boilerplate code and I don't understand what a lot of it is doing and 2) I'm not sure how well covered by tests IO is...

berquist · 2023-06-30T13:57:59Z

I've converted it to draft only so I know when you're done.

For the coverage, we don't have it turned on in GitHub right now, but you can get it by doing

python -m pytest -v --cov=cclib --cov-report=html -k 'test_parser' test

and open htmlcov/index.html in your browser.

oliver-s-lee · 2023-07-05T11:10:31Z

Ok, just right off the bat two io tests are failing because the depend on stdin, which is disabled by pytest. The recommended way around this is to emulate with StringIO, but the tests seem to depend on stdin's specific seek method for some reason? Not really sure what's going on but try to fix them up as best I can

oliver-s-lee · 2023-07-05T11:11:12Z

test/parser/testlogfileparser.py

-                stdin = io.StringIO(contents)
-            except TypeError:
-                stdin = io.StringIO(unicode(contents))
-            stdin.seek = sys.stdin.seek


I don't understand why this is necessary for the test

test/parser/testlogfileparser.py

oliver-s-lee · 2023-07-05T11:47:25Z

I've had a look-over the IO internals and there's a fair bit to do I think. We support lots of different file objects, which is great, but they all get processed at different points. Some processing is done in io/ccio.py, some is in parser/logfileparser.py. Some functions relating to IO are in logfileparser.py and some relating to log files are in ccio.py. Some files get opened and closed multiple times for no apparent reason. We do a lot of type checking in multiple places, and ideally it would be moved to one central location.

I think a class based approach is probably the way forward here; have one InputFile class that wraps whatever input files/stream we want to support and everything else steps away as much as possible from caring about where it's reading from. I'll hack away and see where I get.

oliver-s-lee · 2023-07-05T12:03:01Z

For (my) reference, a non-exhaustive list of the types of input we can read from at various points:

A string (a filename)
A pathlib.Path (a filename)
A string (a URL)
A list of strings (containing any mix of the above)
An archived version (.gz, .zip, .bz, .bz2) of any of the above
An open file object (a stream, which may or may not be seekable)
At some point, some of the above get wrapped/converted to:
cclib.logfileparser.FileWrapper
fileinput.FileInput
another file object (with open)

oliver-s-lee · 2023-07-05T12:08:17Z

Two parsers (adf and Gaussian) call seek() from inside the for line in inputfile loop which is pretty wacky, and it's only used to skip to the end of the file. This would probably be better supported by a custom exception. As far as I can tell, this is the only place we use seek() directly, so removing it would have the added benefit of removing our dependence on all input file types supporting seek, which we currently have to do a fair bit of leg-work to set-up.

berquist · 2023-07-06T03:31:25Z

FYI I think some of my type annotations are wrong and shouldn't be trusted. The code will have to be read as you've done.

a non-exhaustive list of the types of input we can read from at various points

A bad part of this is that it isn't clear what functionality other people are even using. I doubt that anyone is passing URLs directly or using the streaming functionality (that's defeated if you're gonna call seek...), and probably few are reading archives.

oliver-s-lee · 2023-07-06T09:43:26Z

Yeah no problem at all, I'll go through the type annotations at the end once everything's working ok.

Totally agree re. not clear what's actually being used. I presume at one point all these different inputs were being used by someone, but whether that's still the case is hard to say for sure. The weirdest one for me to understand is the unseekable stream, which I suspect is really just to support stdin, but could be wrong. I do wonder how many people are using the archive stuff too, but there is some sense to it I think.

oliver-s-lee · 2023-07-06T16:16:45Z

Ok, must of the grunt work on this is now done and it's ready for review. There's a few more things that would be nice to do at some point, but can't presently without badly breaking backwards compatibility (which should be mostly maintained as is). Some things I don't like:

The FileWrapper class is still in cclib.parser rather than cclib.io where it clearly belongs. Sadly it can't move to IO because of circular dependency issues, which would require some restructuring to get around.
There is some low level IO (opening files etc) happening transparently inside FileWrapper (and also by extension in ccopen) which makes it difficult to keep track of open files. Ideally opening files would happen in context managers at the top-most level.
FileWrapper isn't a context manager or a real file object (it's probably missing some useful methods)
FileWrapper only supports seek() to the start or end. Not a problem at the moment as none of the parsers do seeking anymore, but might be nice to have.

oliver-s-lee · 2023-07-06T16:17:28Z

Also the test is failing for some reason relating to logging that I can't fathom, do we overload the logging module as part of CI?

…apper class

…rsing exception

oliver-s-lee · 2023-07-20T10:04:16Z

Yeah agreed. Yes it's separate, but all logging done outside of the parser is now with the same cclib instance (instead of the root logger it was before this PR). Each parser still creates its own unique instance using the name of the logfile being parsed, which is quite nice for seeing where the parser's warnings are coming from.

Haha no problem. They have just closed the bug report though so what good it will do remains to be seen...

berquist · 2023-07-20T11:33:32Z

This is why I'm tiring of open source. That person obviously didn't read any of what you said and assumed you were intentionally using distutils.

We could propose an issue to PySCF since all their downstream projects would be affected.

berquist · 2023-07-20T13:18:51Z

I promise to give this a review by the end of my day today.

berquist

I'm not done, but here is a first batch of questions.

cclib/io/ccio.py

test/parser/testlogfileparser.py

cclib/parser/turbomoleparser.py

berquist · 2023-07-21T02:28:30Z

cclib/parser/turbomoleparser.py

-        # A list of previous lines to allow look-behind functionality.
-        self.last_lines = collections.deque([""] * 10, 10)
+
+    def sort_input(self, file_names: list) -> list:


Putting this alongside the parser is ok, but the reason it was a bare function is that self isn't necessary. I could see it being an abstract class method in the future, for other parsers that can take multiple files like Molpro, but a top-level (private) function would work just as well.

Same comment about using list as a type annotation. (It translates to typing.List[Any] which isn't great.)

Yeah ok I see your point. My main motivation for making it a class method was to take advantage of inheritance to automatically pick the correct sorting function, seeing as how we've already got the correct parser class at this point. Should make future additions easier because there's no need to maintain a table of functions or similar.

I'll change it to a classmethod, although if we'd rather go with non-class function I'm not totally opposed either.

Also, I actually rely on cclib.io.ccio.sort_turbomole_outputs() in my own code, which this change obviously breaks. For me that's not a problem, but for others who might rely on it this is a non backwards compatible change. Think it's worth adding back cclib.io.ccio.sort_turbomole_outputs() as an alias to the new class method (or whatever we end up going with)?

Breaking a function like that across a major or even minor (but not patch) release is ok, at least when you consider how many people were likely using it compared to ccread or ccopen.

Keeping it as an alias is ok, and if we had an official way of doing deprecation (it's never come up) we could get rid of it after 1.8.x.

But I am thinking the method is the better approach. It just isn't clear to me yet which one of instance/class/static it should be. Class is just a good compromise.

Cool, I'll add an alias and make it emit a warning on first use, that way at least the user/developer is aware they're using a deprecated function.

Class or abstract makes the most sens I think, and to me class always feels more natural than abstract...

cclib/parser/logfilewrapper.py

cclib/parser/logfileparser.py

oliver-s-lee · 2023-07-21T07:05:19Z

This is why I'm tiring of open source. That person obviously didn't read any of what you said and assumed you were intentionally using distutils.

We could propose an issue to PySCF since all their downstream projects would be affected.

Yeah indeed, happily though they have now reopened the issue, so maybe some hope does yet remain :) Hmm yeah I did consider opening an issue on PySCF too, the problem for them though is that there's no easy workaround if they need that ctypeslib function (except to manually fix the logger afterwards I guess?)

…ogger chosen, rather than relying on the name of the logging object

…n --verbose

berquist · 2023-07-22T02:58:45Z

Yeah indeed, happily though they have now reopened the issue, so maybe some hope does yet remain :) Hmm yeah I did consider opening an issue on PySCF too, the problem for them though is that there's no easy workaround if they need that ctypeslib function (except to manually fix the logger afterwards I guess?)

We can wait for changes on NumPy's end. PySCF isn't doing anything wrong by using ctypeslib. But letting them know would be good too, since I've seen they perform workaround for other NumPy API changes.

berquist

One last thing with log levels.

cclib/scripts/ccget.py

…) and marked it as deprecated

berquist

🎉

oliver-s-lee mentioned this pull request Jun 30, 2023

inputfile.last_line is not always available #1220

Closed

berquist linked an issue Jun 30, 2023 that may be closed by this pull request

inputfile.last_line is not always available #1220

Closed

berquist marked this pull request as draft June 30, 2023 13:54

berquist self-requested a review June 30, 2023 13:54

berquist added this to the v1.8 milestone Jun 30, 2023

berquist added bug io labels Jun 30, 2023

oliver-s-lee commented Jul 5, 2023

View reviewed changes

test/parser/testlogfileparser.py Show resolved Hide resolved

oliver-s-lee force-pushed the io_fix branch from 0550619 to cc0a678 Compare July 5, 2023 16:49

oliver-s-lee force-pushed the io_fix branch from 3ab8d28 to 6e0cc41 Compare July 6, 2023 11:27

oliver-s-lee marked this pull request as ready for review July 6, 2023 16:48

oliver-s-lee added 9 commits July 17, 2023 12:16

Fixed Logfile parser always assuming its input file supports 'last_line'

0dce2c6

fileinput.FileInput is now wrapped by FileWrapper

d5b212b

Temporarily disabled non-functioning stdin tests

bbeb347

Replaced stdin IO test wih StringIO equivalents (sort of)

8550031

Combined if/else

5e0f08f

Refactoring of IO code, moved handling of file object types to FileWr…

6ca7803

…apper class

Replaced seek(0, 2) calls in ADF and Gaussian parsers with new StopPa…

da10408

…rsing exception

Added whitespace

c33221d

Fixed ccopen not supporting quiet anymore

62e7fe9

berquist requested changes Jul 21, 2023

View reviewed changes

oliver-s-lee mentioned this pull request Jul 21, 2023

ccread/ccget support for cjson is not working #1234

Closed

oliver-s-lee added 13 commits July 21, 2023 08:59

Converted old print statements to use the logging mechanism

50e5af2

Fixed calling non-existent cjson.read_cjson() method (fixes cclib#1234)

d5646f7

Added type annotations for ccread

27c0251

Removed verbose kwarg

83cb04c

Corrected documentation for logname attribute

96b3bce

Switched to using type(log).__name__ to determine the 'name' of the l…

86453ec

…ogger chosen, rather than relying on the name of the logging object

Added handler for root cclib logger and changed log-level depending o…

9ffcf33

…n --verbose

CJSONReaderTest now uses the ccread interface

7022148

Updated type hints for source arguments

37456a3

Fixed weird import

4f24803

Removed comments

a5d0f77

Updated signature of sort_input()

b4ad9e5

Fixed type hint

6978d52

berquist requested changes Jul 22, 2023

View reviewed changes

cclib/scripts/ccget.py Show resolved Hide resolved

oliver-s-lee added 3 commits July 24, 2023 08:24

Fixed Turbomole sort not working for more complex paths

073e7be

Readded sort_turbomole_outputs() as an alias to Turbomole.sort_input(…

da083a2

…) and marked it as deprecated

ccget no longer ignores warnings by default

7258359

berquist approved these changes Jul 24, 2023

View reviewed changes

berquist merged commit b36fc21 into cclib:master Jul 24, 2023

This was referenced Aug 20, 2023

Create and test new attributes for the NBO parser #1251

Merged

Fix printing false negatives during tests #1255

Open

oliver-s-lee deleted the io_fix branch February 26, 2024 13:27

oliver-s-lee mentioned this pull request Apr 26, 2024

Order of parsing files for Turbomole #562

Closed

Fix for not all input objects supporting last_line #1222

Fix for not all input objects supporting last_line #1222

Uh oh!

Conversation

oliver-s-lee commented Jun 30, 2023

Uh oh!

berquist commented Jun 30, 2023

Uh oh!

oliver-s-lee commented Jul 5, 2023

Uh oh!

oliver-s-lee Jul 5, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oliver-s-lee commented Jul 5, 2023

Uh oh!

oliver-s-lee commented Jul 5, 2023

Uh oh!

oliver-s-lee commented Jul 5, 2023

Uh oh!

berquist commented Jul 6, 2023

Uh oh!

oliver-s-lee commented Jul 6, 2023

Uh oh!

oliver-s-lee commented Jul 6, 2023

Uh oh!

oliver-s-lee commented Jul 6, 2023

Uh oh!

oliver-s-lee commented Jul 20, 2023

Uh oh!

berquist commented Jul 20, 2023

Uh oh!

berquist commented Jul 20, 2023

Uh oh!

berquist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

berquist Jul 21, 2023

Choose a reason for hiding this comment

Uh oh!

oliver-s-lee Jul 21, 2023

Choose a reason for hiding this comment

Uh oh!

oliver-s-lee Jul 21, 2023

Choose a reason for hiding this comment

Uh oh!

berquist Jul 22, 2023

Choose a reason for hiding this comment

Uh oh!

oliver-s-lee Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-s-lee commented Jul 21, 2023

Uh oh!

berquist commented Jul 22, 2023

Uh oh!

berquist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

berquist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fix for not all input objects supporting `last_line` #1222

Fix for not all input objects supporting `last_line` #1222