-
Notifications
You must be signed in to change notification settings - Fork 41
Complete rewrite of Parser and Tokenizer #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Decouple the Tokenizer and Parser:
- No cross talk or dependency between Tokenizer and Parser for easier debugging.
- Parser works purely on token list generated by Tokenizer.
- Complete include parsing support:
- Support required, file, url, and classpath keyword. Include types are passed to the HoconIncludeCallback delegate.
- Support parsing hocon configuration that contains a pure array (should be used for array include only).
- Include can be declared floating in an object (merged with the fields on the same level), assigned to a field key, and declared inside an array.
- require, file, url, and classpath starting declaration needs to be typed exactly as "required(", "file(", "url(" and "classpath(" with no space between the keyword and the open parenthesis.
- Change ExternalIncludes example to show how to use the new include implementation.
- Make Hocon parser to be completely free from Akka related implementation:
- Configuration related classes are moved to Hocon.Configuration.
- Create similar functionalities to Config class in HoconRoot class.
- CONFIGURATION compiler directive is removed by bumping Hocon.Configuration netstandard version to 2.0 and including System.Configuration.ConfigurationManager package.
- Make Hocon parser async capable.
- Async parser allows for long running process such as retrieving include configuration from a url possible.
- Add ability to normalize a parsed Hocon to minimize the memory footprint.
- Add PrettyPrint feature to HoconRoot.
- Breaking changes:
- String types are now stored as separate types, meaning:
- Unquoted null string are resolved as null value
- Quoted "null" strings are resolved as a string "null"
- Paths and keys are fully parsed, a quoted key needs to be quoted to be valid, eg:
- "a.\"some quoted, key" is not the same as "a.some quoted, key" because commas are invalid inside an unquoted string.
- "\"a.key.with.dots\" will actually use the whole quoted string as a field key, while "a.key.with.dots" will be resolved as a path with 4 keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't managed to get a deep insight - so more issues may arrise. Right now - congratz, it's a huge amount of work 👍 Nonetheless, problems I see are mostly related to performance. I can recommend reading this one: http://xoofx.com/blog/2016/06/13/implementing-a-markdown-processor-for-dotnet/ (more interesting this article is 2 years old, with new .NET APIs we can do much faster things now).
If we decide to make this library compatible with Spans and new super duper bits of .NET highly optimized libraries, the only parts that are actually allocating here would be a final HOCON objects. I think, this would be both a great learning experience and a powerfull selling point for HOCON standard.
| { | ||
| var hocon = @"a=[1,2] [3,4]"; | ||
| Assert.True(new[] { 1, 2, 3, 4 }.SequenceEqual(ConfigurationFactory.ParseString(hocon).GetIntList("a"))); | ||
| Assert.True(new[] { 1, 2, 3, 4 }.SequenceEqual(HoconParser.Parse(hocon).GetIntList("a"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HoconParser.Parse? What has happened with ConfigurationFactory? Reminder: we have existing API, that we should keep for backward compatiblity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConfigurationFactory static class is moved to Hocon.Configuration namespace. I guess I can keep the namespace as Hocon for the "Hocon.Configuration" project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the same note, should I keep the old name "Parser" instead of the new "HoconParser"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hocon.Configuration namespace is moved to Hocon, HoconParser is refactored back to Parser
src/HOCON.Tests/Commas.cs
Outdated
| { | ||
| while (config_1.MoveNext()) | ||
| { | ||
| config_2.MoveNext(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not sequence equal anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because for this test, we're comparing 2 KeyValuePair from 2 different objects. The values of each KVP are HoconField, which does not have any IEquatable implementation. I'll add IEquatable implementation on the next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IEquatable added, you can use object1.AsEnumerable().SequenceEqual(object2.AsEnumerable()) now.
| namespace Hocon | ||
| { | ||
| public delegate string HoconIncludeCallback(HoconCallbackType callbackType, string value); | ||
| public delegate Task<string> HoconIncludeCallbackAsync(HoconCallbackType callbackType, string value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What use cases do we expect for async callbacks? From what I see below this is the only purpose for that entire API to be async. Event then, I guess in grand most of the cases that will be no a resource like file/url etc, which would make a better use of ValueTask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main purpose is to have it non-blocking for "include url()" because it potentially has to wait a long time and large Hocon config file cases because the current Tokenizer implementation does not support string Stream. Since I'm trying to make Hocon as standalone and as generic as possible, I thought making everything async is the best move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My big concern here is that how much this feature justifies all of this? Tbh. I never knew that you can include URL or file in you HOCON files and I'm using it for the last 4 years. On the other side parsing and text processing can be made very lightweight if we won't try to make it all async. My gut feeling is that 2 heaviest parts of Parse...Async methods are throwing exceptions and actually calling await/returning a Task.
From personal experience: few weeks ago I've wrote a CSV parser using the latests bits of .NET low-level API. After benchmarking it on 1.5GB files it turned out that over 80% of memory and CPU was used for async state machine related to async loading of files and 8% was used by garbage collector to recycle those tasks (the rest of parsing pipeline was not allocating anything).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I can always remove all the async await, its pretty easy to do since none of my code are actually asynchronous. It was there for completeness, since I wanted to support the required url() spec.
The whole point of this exercise is trying to support the full spec and make Hocon a real standalone configuration library outside of Akka (ie. just like NewtonSoft Json), so I'm just covering all the bases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing HOCON should be a boot up process, it seems farfetched that there would be so many calls to the the parser that it justifies a complete async API IMO.
If the purpose is to make it more maintainable, aim to make it smaller/less code instead of adding things with low end user value.
my 2 cents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored it so that only the include callback is async, would this be a better compromise?
src/HOCON/HoconParser.cs
Outdated
| /// This class contains methods used to parse HOCON (Human-Optimized Config Object Notation) | ||
| /// configuration strings. | ||
| /// </summary> | ||
| public class HoconParser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default please keep classes sealed. Inheritance has no sense (and it's counter performance) is target class was not designed for it to begin with.
src/HOCON/Impl/Utils.cs
Outdated
| TaskContinuationOptions.None, | ||
| TaskScheduler.Default); | ||
|
|
||
| public static TResult RunSync<TResult>(Func<Task<TResult>> func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do these functions differ from func().GetAwaiter().GetResult()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't know. I lifted the code from
https://github.com/aspnet/AspNetIdentity/blob/master/src/Microsoft.AspNet.Identity.Core/AsyncHelper.cs
The biggest concern is that func().GetAwaiter().GetResult() can cause a deadlock related to context. I kind of figured that Microsoft knows their own code better than I do, so I tried to code as defensively as possible.
src/HOCON/Impl/HoconPath.cs
Outdated
|
|
||
| namespace Hocon | ||
| { | ||
| public class HoconPath:List<string>, IEquatable<HoconPath> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sealed
src/HOCON/Impl/HoconPath.cs
Outdated
|
|
||
| unchecked | ||
| { | ||
| return this.Aggregate(seed, (current, item) => (current * modifier) + item.GetHashCode()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I've seen, HoconPath may be used as key in dictionaries - and using aggregates instead of loops is super expensive way for computing hash code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a loop. Internally, I only use the last key in the path as a key in the HoconObject field Dictionary, but other people may decide to use them as keys in dictionaries, HoconPath should be fixed at that time, though I didn't put the effort to make it into an immutable list, will this be a problem in the future?
src/HOCON/Impl/HoconTokenizer.cs
Outdated
| return false; | ||
|
|
||
| return _text[_index]; | ||
| return _text.Substring(Index = pattern.Length, pattern.Length) == pattern; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know which version of .NET we want to target, but I'm pretty sure we can do this check without allocating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a dead code, pruned.
src/HOCON/Impl/HoconTokenizer.cs
Outdated
| if (Index + length > _text.Length) | ||
| return null; | ||
|
|
||
| string s = _text.Substring(Index, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, no need to allocate. The current state of HOCON parser can allocate 1000x more memory than the string which it parses. I think, if we are going to rewrite it we also should think about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I've removed as much Substring as possible.
src/HOCON/Impl/HoconTokenizer.cs
Outdated
| private bool PullNewLine(HoconTokenizerResult tokens) | ||
| { | ||
| int start = Index; | ||
| if (!Matches(Utils.NewLine)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do a check for both cases (\n and \r\n), you should not need that replacement of Unix style endings at the intial HoconParser.Parse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its already handled, \n and \r\n shouldn't matter anymore.
…f to improve performance. Side effect: Improved `include` parsing.
…ll be processed during value query
…Parser` class name back to `Parser`
# Conflicts: # src/HOCON/Impl/HoconParserException.cs
|
@Arkatufus looks like build server barfed on some new language features in your latest set of commits |
|
@Aaronontheweb which C# version is our build server using? We should go to latest by default (now 7.3). |
|
@Horusiath since we're building against the .NET Core SDK, I think you can just set the language setting in the .CSPROJ file. Build server has 7.1 installed on it by default I think, but the CSPROJ settings should override. |
This review is from a few months back and it looks like @Arkatufus made an effort to work with the new System.Memory types, as requested, but ultimately decided it wasn't worth the trouble. Still, the rest of the review is good and much appreciated.
|
@Arkatufus, these changes look fantastic and I appreciate the amount of time and trouble it take to implement them. Also really appreciate the amount of time and effort spent by @Horusiath and @rogeralsing reviewing the code in detail over the summer. I'm going to merge this in now so we can start applying other changes (such as porting akkadotnet/akka.net#3600 into this library) in separate PRs on top of this one, and I think we might be ready to do a new release of the library soon. Nice work. |
Decouple the Tokenizer and Parser:
Complete include parsing support:
Complete self-referencing substitution support.
Support for the += array assignment operator.
Make Hocon parser to be completely free from Akka related implementation:
Add ability to normalize a parsed Hocon to minimize the memory footprint.
Add PrettyPrint feature to HoconRoot.
Add Infinity, -Infinity and NaN support for float and double value.
Add hexadecimal and octal value support for long, int, and byte value.
Breaking changes: