Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix for not all input objects supporting last_line #1222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 54 commits into from
Jul 24, 2023

Conversation

oliver-s-lee
Copy link
Contributor

Parser now checks before accessing last_line in case it's not available, and fileinput.FileInput is now correctly wrapped with FileWrapper (so it now does have last_line)

In theory, we probably want all input types to be wrapped with FileWrapper, but I didn't make this change because 1) there's a weird amount of IO boilerplate code and I don't understand what a lot of it is doing and 2) I'm not sure how well covered by tests IO is...

@berquist berquist linked an issue Jun 30, 2023 that may be closed by this pull request
@berquist berquist marked this pull request as draft June 30, 2023 13:54
@berquist berquist self-requested a review June 30, 2023 13:54
@berquist
Copy link
Member

I've converted it to draft only so I know when you're done.

For the coverage, we don't have it turned on in GitHub right now, but you can get it by doing

python -m pytest -v --cov=cclib --cov-report=html -k 'test_parser' test

and open htmlcov/index.html in your browser.

@berquist berquist added this to the v1.8 milestone Jun 30, 2023
@oliver-s-lee
Copy link
Contributor Author

Ok, just right off the bat two io tests are failing because the depend on stdin, which is disabled by pytest. The recommended way around this is to emulate with StringIO, but the tests seem to depend on stdin's specific seek method for some reason? Not really sure what's going on but try to fix them up as best I can

stdin = io.StringIO(contents)
except TypeError:
stdin = io.StringIO(unicode(contents))
stdin.seek = sys.stdin.seek
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is necessary for the test

@oliver-s-lee
Copy link
Contributor Author

I've had a look-over the IO internals and there's a fair bit to do I think. We support lots of different file objects, which is great, but they all get processed at different points. Some processing is done in io/ccio.py, some is in parser/logfileparser.py. Some functions relating to IO are in logfileparser.py and some relating to log files are in ccio.py. Some files get opened and closed multiple times for no apparent reason. We do a lot of type checking in multiple places, and ideally it would be moved to one central location.

I think a class based approach is probably the way forward here; have one InputFile class that wraps whatever input files/stream we want to support and everything else steps away as much as possible from caring about where it's reading from. I'll hack away and see where I get.

@oliver-s-lee
Copy link
Contributor Author

For (my) reference, a non-exhaustive list of the types of input we can read from at various points:

  • A string (a filename)
  • A pathlib.Path (a filename)
  • A string (a URL)
  • A list of strings (containing any mix of the above)
  • An archived version (.gz, .zip, .bz, .bz2) of any of the above
  • An open file object (a stream, which may or may not be seekable)
    At some point, some of the above get wrapped/converted to:
  • cclib.logfileparser.FileWrapper
  • fileinput.FileInput
  • another file object (with open)

@oliver-s-lee
Copy link
Contributor Author

Two parsers (adf and Gaussian) call seek() from inside the for line in inputfile loop which is pretty wacky, and it's only used to skip to the end of the file. This would probably be better supported by a custom exception. As far as I can tell, this is the only place we use seek() directly, so removing it would have the added benefit of removing our dependence on all input file types supporting seek, which we currently have to do a fair bit of leg-work to set-up.

@berquist
Copy link
Member

berquist commented Jul 6, 2023

FYI I think some of my type annotations are wrong and shouldn't be trusted. The code will have to be read as you've done.

a non-exhaustive list of the types of input we can read from at various points

A bad part of this is that it isn't clear what functionality other people are even using. I doubt that anyone is passing URLs directly or using the streaming functionality (that's defeated if you're gonna call seek...), and probably few are reading archives.

@oliver-s-lee
Copy link
Contributor Author

Yeah no problem at all, I'll go through the type annotations at the end once everything's working ok.

Totally agree re. not clear what's actually being used. I presume at one point all these different inputs were being used by someone, but whether that's still the case is hard to say for sure. The weirdest one for me to understand is the unseekable stream, which I suspect is really just to support stdin, but could be wrong. I do wonder how many people are using the archive stuff too, but there is some sense to it I think.

@oliver-s-lee
Copy link
Contributor Author

Ok, must of the grunt work on this is now done and it's ready for review. There's a few more things that would be nice to do at some point, but can't presently without badly breaking backwards compatibility (which should be mostly maintained as is). Some things I don't like:

  • The FileWrapper class is still in cclib.parser rather than cclib.io where it clearly belongs. Sadly it can't move to IO because of circular dependency issues, which would require some restructuring to get around.
  • There is some low level IO (opening files etc) happening transparently inside FileWrapper (and also by extension in ccopen) which makes it difficult to keep track of open files. Ideally opening files would happen in context managers at the top-most level.
  • FileWrapper isn't a context manager or a real file object (it's probably missing some useful methods)
  • FileWrapper only supports seek() to the start or end. Not a problem at the moment as none of the parsers do seeking anymore, but might be nice to have.

@oliver-s-lee
Copy link
Contributor Author

Also the test is failing for some reason relating to logging that I can't fathom, do we overload the logging module as part of CI?

@oliver-s-lee oliver-s-lee marked this pull request as ready for review July 6, 2023 16:48
@oliver-s-lee
Copy link
Contributor Author

Yeah agreed. Yes it's separate, but all logging done outside of the parser is now with the same cclib instance (instead of the root logger it was before this PR). Each parser still creates its own unique instance using the name of the logfile being parsed, which is quite nice for seeing where the parser's warnings are coming from.

Haha no problem. They have just closed the bug report though so what good it will do remains to be seen...

@berquist
Copy link
Member

This is why I'm tiring of open source. That person obviously didn't read any of what you said and assumed you were intentionally using distutils.

We could propose an issue to PySCF since all their downstream projects would be affected.

@berquist
Copy link
Member

I promise to give this a review by the end of my day today.

Copy link
Member

@berquist berquist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not done, but here is a first batch of questions.

# A list of previous lines to allow look-behind functionality.
self.last_lines = collections.deque([""] * 10, 10)

def sort_input(self, file_names: list) -> list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting this alongside the parser is ok, but the reason it was a bare function is that self isn't necessary. I could see it being an abstract class method in the future, for other parsers that can take multiple files like Molpro, but a top-level (private) function would work just as well.

Same comment about using list as a type annotation. (It translates to typing.List[Any] which isn't great.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ok I see your point. My main motivation for making it a class method was to take advantage of inheritance to automatically pick the correct sorting function, seeing as how we've already got the correct parser class at this point. Should make future additions easier because there's no need to maintain a table of functions or similar.

I'll change it to a classmethod, although if we'd rather go with non-class function I'm not totally opposed either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I actually rely on cclib.io.ccio.sort_turbomole_outputs() in my own code, which this change obviously breaks. For me that's not a problem, but for others who might rely on it this is a non backwards compatible change. Think it's worth adding back cclib.io.ccio.sort_turbomole_outputs() as an alias to the new class method (or whatever we end up going with)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking a function like that across a major or even minor (but not patch) release is ok, at least when you consider how many people were likely using it compared to ccread or ccopen.

Keeping it as an alias is ok, and if we had an official way of doing deprecation (it's never come up) we could get rid of it after 1.8.x.

But I am thinking the method is the better approach. It just isn't clear to me yet which one of instance/class/static it should be. Class is just a good compromise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'll add an alias and make it emit a warning on first use, that way at least the user/developer is aware they're using a deprecated function.

Class or abstract makes the most sens I think, and to me class always feels more natural than abstract...

@oliver-s-lee
Copy link
Contributor Author

This is why I'm tiring of open source. That person obviously didn't read any of what you said and assumed you were intentionally using distutils.

We could propose an issue to PySCF since all their downstream projects would be affected.

Yeah indeed, happily though they have now reopened the issue, so maybe some hope does yet remain :) Hmm yeah I did consider opening an issue on PySCF too, the problem for them though is that there's no easy workaround if they need that ctypeslib function (except to manually fix the logger afterwards I guess?)

@berquist
Copy link
Member

Yeah indeed, happily though they have now reopened the issue, so maybe some hope does yet remain :) Hmm yeah I did consider opening an issue on PySCF too, the problem for them though is that there's no easy workaround if they need that ctypeslib function (except to manually fix the logger afterwards I guess?)

We can wait for changes on NumPy's end. PySCF isn't doing anything wrong by using ctypeslib. But letting them know would be good too, since I've seen they perform workaround for other NumPy API changes.

Copy link
Member

@berquist berquist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing with log levels.

Copy link
Member

@berquist berquist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

inputfile.last_line is not always available
2 participants