-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[Utf8] New component with Bytes, CodePoints and Graphemes implementations of string objects #22184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b807183
to
9edd61a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inheritance does not really make sense. Programming against the GenericStringInterface
does not provide me with any confidence over the kind of string I will receive, in other words, I could program against literal PHP strings as well.
What actually would make sense is that the UTF-8 variations extend Bytes
, while implementing a common UTF8String
interface.
This approach would also be extensible for the future, e.g. to add an ASCIIString
that extends Bytes
and implements the UTF8String
interface as well. Since any valid ASCII string is valid UTF-8.
To put it differently: anyone capable of processing bytes is capable of processing UTF-8, anyone capable of processing UTF-8 is capable of processing ASCII, … you may continue this chain until you reach a pure Latin Alphabet (e.g. [a-z]
).
} | ||
} else { | ||
throw new InvalidArgumentException('Pattern replacement must be a valid string or array of strings.'); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem as before, second argument can be an array only if first argument is an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, see above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PHP doesn't provide generic programming (as e.g. C++ "templates"), so this is the way we found to simulate generic programming.
Certainly not: an utf8 string is not an instanceof bytes, for the purpose of this component. The return value of eg the
That is generic programming : ignoring the type of things do to similar operations on objects. |
I can treat any and every UTF-8 string as a series of bytes. The current implementation of If this is considered to be useful, than it’s fine with me. I probably have to add, that I truly like the initiative and effort. String handling is very complicated, and I thought very often about creating a similar thing. I mean, I would not take the time to review this and give constructive feedback if I would consider this being crap. So, please feel encouraged and not discouraged by all my comments. 🐱 |
624ee76
to
8f52667
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 cool stuff.
in general there's a lot of the same-ish code.. could it be further simplified with a common mb_
trait using either 8bit
or UTF-8
?
a4fe716
to
551d790
Compare
So if I well understood the component, we have to instantiate an object containing the string to do utf-8 safe manipulation and comparison? Why not having static methods like voku/portable-utf8 package does? BTW, maybe this was already discussed and I'll be glad to have a thread link in this case, but why not requiring and using a library like Maybe I'm too curious, but I like to have some elucidation. 👼 😉 |
@soullivaneuh this is an API upgrade of an existing package from @nicolas-grekas, meant to bring this package in the symfony ecosystem (so that it is maintained by the core team rather than @nicolas-grekas alone) |
@soullivaneuh because package from voku doesn't allow to manipulate strings as bytes, codepoints or graphemes units. Depending on your domain specific context, you'll choose one of the 3 implementations. |
@stof Seems indeed legit, I was thinking like you after posting this question. 😉
Indeed, this is maybe the package class name is But still. Why class instantiation for string manipulation? It look a little bit too heavy if I just want a lenght of a string and do a sub string. Maybe it will be clearer for me when the related documentation will come. 😉 |
The reason for using the type system is usually to be able to use the type system. Util classes do not give you any kind of security. If I need a valid UTF-8 string I should be able to type hint that to you. Using That being said, I still don't like the implementation, sorry. |
@nicolas-grekas What about this one? Does it make sense to finish it and merge it? |
I really want to finish it :) |
I'm closing here so we can keep the discussion relevant to the attached patch. |
Continued in #33553 |
…anagement with an abstract unit system (nicolas-grekas, hhamon, gharlan) This PR was merged into the 5.0-dev branch. Discussion ---------- [String] a new component for object-oriented strings management with an abstract unit system | Q | A | ------------- | --- | Branch? | master | Bug fix? | no | New feature? | yes | Deprecations? | no | Tickets | - | License | MIT | Doc PR | - This is a reboot of #22184 (thanks @hhamon for working on it) and a generalization of my previous work on the topic ([patchwork/utf8](https://github.com/tchwork/utf8)). Unlike existing libraries (including `patchwork/utf8`), this component provides a unified API for the 3 unit systems of strings: bytes, code points and grapheme clusters. The unified API is defined by the `AbstractString` class. It has 2 direct child classes: `BinaryString` and `AbstractUnicodeString`, itself extended by `Utf8String` and `GraphemeString`. All objects are immutable and provide clear edge-case semantics, using exceptions and/or (nullable) types! Two helper functions are provided to create such strings: ```php new GraphemeString('foo') == u('foo'); // when dealing with Unicode, prefer grapheme units new BinaryString('foo') == b('foo'); ``` `GraphemeString` is the most linguistic-friendly variant of them, which means it's the one ppl should use most of the time *when dealing with written text*. Future ideas: - improve tests - add more docblocks (only where they'd add value!) - consider adding more methods in the string API (`is*()?`, `*Encode()`?, etc.) - first class Emoji support - merge the Inflector component into this one - use `width()` to improve `truncate()` and `wordwrap()` - move method `slug()` to a dedicated locale-aware service class - propose your ideas (send PRs after merge) Out of (current) scope: - what [intl](https://php.net/intl) provides (collations, transliterations, confusables, segmentation, etc) Here is the unified API I'm proposing in this PR, borrowed from looking at many existing libraries, but also Java, Python, JavaScript and Go. ```php function __construct(string $string = ''); static function unwrap(array $values): array static function wrap(array $values): array function after($needle, bool $includeNeedle = false, int $offset = 0): self; function afterLast($needle, bool $includeNeedle = false, int $offset = 0): self; function append(string ...$suffix): self; function before($needle, bool $includeNeedle = false, int $offset = 0): self; function beforeLast($needle, bool $includeNeedle = false, int $offset = 0): self; function camel(): self; function chunk(int $length = 1): array; function collapseWhitespace(): self function endsWith($suffix): bool; function ensureEnd(string $suffix): self; function ensureStart(string $prefix): self; function equalsTo($string): bool; function folded(): self; function ignoreCase(): self; function indexOf($needle, int $offset = 0): ?int; function indexOfLast($needle, int $offset = 0): ?int; function isEmpty(): bool; function join(array $strings): self; function jsonSerialize(): string; function length(): int; function lower(): self; function match(string $pattern, int $flags = 0, int $offset = 0): array; function padBoth(int $length, string $padStr = ' '): self; function padEnd(int $length, string $padStr = ' '): self; function padStart(int $length, string $padStr = ' '): self; function prepend(string ...$prefix): self; function repeat(int $multiplier): self; function replace(string $from, string $to): self; function replaceMatches(string $fromPattern, $to): self; function slice(int $start = 0, int $length = null): self; function snake(): self; function splice(string $replacement, int $start = 0, int $length = null): self; function split(string $delimiter, int $limit = null, int $flags = null): array; function startsWith($prefix): bool; function title(bool $allWords = false): self; function toBinary(string $toEncoding = null): BinaryString; function toGrapheme(): GraphemeString; function toUtf8(): Utf8String; function trim(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self; function trimEnd(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self; function trimStart(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self; function truncate(int $length, string $ellipsis = ''): self; function upper(): self; function width(bool $ignoreAnsiDecoration = true): int; function wordwrap(int $width = 75, string $break = "\n", bool $cut = false): self; function __clone(); function __toString(): string; ``` `AbstractUnicodeString` adds these: ```php static function fromCodePoints(int ...$codes): self; function ascii(array $rules = []): self; function codePoint(int $index = 0): ?int; function folded(bool $compat = true): parent; function normalize(int $form = self::NFC): self; function slug(string $separator = '-'): self; ``` and `BinaryString`: ```php static function fromRandom(int $length = 16): self; function byteCode(int $index = 0): ?int; function isUtf8(): bool; function toUtf8(string $fromEncoding = null): Utf8String; function toGrapheme(string $fromEncoding = null): GraphemeString; ``` Case insensitive operations are done with the `ignoreCase()` method. e.g. `b('abc')->ignoreCase()->indexOf('B')` will return `1`. For reference, CLDR transliterations (used in the `ascii()` method) are defined here: https://github.com/unicode-org/cldr/tree/master/common/transforms Commits ------- dd8745a [String] add more tests 82a0095 [String] add tests 012e92a [String] a new component for object-oriented strings management with an abstract unit system
[edit: continued in #33553]
This is a port of tchwork/utf8 to Symfony.
tchwork/utf8 has 7M downloads on packagist, and I'd be really happy to maintain it under the Symfony umbrella.
It provides 3 classes that wrap PHP strings into objects, and deal with the 3 usual unit spaces of strings: bytes, utf8 chars and grapheme clusters.
All 3 classes implement the
GenericStringInterface
, so that one can type hint any of them, then potentially select which appropriate unit system one want to deal with (see above) with "converter" methods.GenericStringInterface
is annotated@final
to tag it as not-implementable by userland - thus allow us to change it and add more methods later on if we want, without being blocked by our BC promise.In order to help the implementation, the component has a PHP 7.0 requirement. It'd be nice if this could be accepted as such - this helps a lot to make the code clean.
Test coverage is at 100%.
A big thank to @hhamon who did the port.
(for cross ref, here is a related package: https://packagist.org/packages/danielstjules/stringy)