; -- Mode: Markdown; -- ; vim: filetype=markdown tw=76 expandtab shiftwidth=4 tabstop=4
License: LGPLv3
Author: Uvarov Michael ([email protected])
Module for working strings. A string is a flatten list of Unicode characters.
All actions with Unicode was described in Unicode Standards. This library realized only these documents:
- UAX 44 Unicode Character Database
ux_string uses ux_char and ux_unidata.
ux_uca uses ux_char and ux_unidata.
ux_char uses ux_unidata.
ux_unidata is for an internal data access.
This module provides the functions for operations with UNIDATA. UNIDATA contains data about Unicode characters.
- to_nfc/1
- to_nfd/1
- to_nfkd/1
- to_nfkc/1
- is_nfc/1
- is_nfd/1
- is_nfkc/1
- is_nfkd/1
- to_lower/1
- to_upper/1
Grapheme is a letter with its modifiers.
- length/1
- reverse/1
- first/2
- last/2
Code:
(ux@delta)11> ux_string:length("FF g̈").
4
(ux@delta)12> string:len("FF g̈").
5
(ux@delta)13> ux_string:to_graphemes("FF g̈").
["F","F"," ",[103,776]]- explode/2,3
- html_special_chars/1 (htmlspecialchars in php)
- strip_tags/1,2
Code:
ux_string:explode(["==", "++", "|"], "+++-+=|==|==|=+-+++").Result:
[[],"+-+=",[],[],[],[],"=+-","+"]Code:
ux_string:strip_tags("<b>bold text</b>").Result:
"bold text"Type is a General Category.
Code:
Str = "Erlang created the field of telephone
networks analysis. His early work in scrutinizing the use of local, exchange
and trunk telephone line usage in a small community, to understand the
theoretical requirements of an efficient network led to the creation of the
Erlang formula, which became a foundational element of present day
telecommunication network studies.",
ux_string:explode_types(['Zs', 'Lu'], Str).Result:
[[],"rlang","created","the","field","of","telephone",
"networks","analysis.",[],"is","early","work","in",
"scrutinizing","the","use","of","local,","exchange","and",
"trunk","telephone","line","usage","in","a","small",
[...]|...]Code:
ux_string:types(Str).Result:
['Lu','Ll','Ll','Ll','Ll','Ll','Zs','Ll','Ll','Ll','Ll',
'Ll','Ll','Ll','Zs','Ll','Ll','Ll','Zs','Ll','Ll','Ll','Ll',
'Ll','Zs','Ll','Ll','Zs','Ll'|...]Where atom 'Lu' is Letter, Uppercase; ll is Letter, Lowercase. Read more
about types from description of ux_char:type/1.
Code:
ux_string:delete_types(['Ll'], Str).Result:
"E . H , , E , ."Code:
ux_char:type($ ).Result:
'Zs'- Normative Categories:
- Lu Letter, Uppercase
- Ll Letter, Lowercase
- Lt Letter, Titlecase
- Mn Mark, Non-Spacing
- Mc Mark, Spacing Combining
- Me Mark, Enclosing
- Nd Number, Decimal Digit
- Nl Number, Letter
- No Number, Other
- Zs Separator, Space
- Zl Separator, Line
- Zp Separator, Paragraph
- Cc Other, Control
- Cf Other, Format
- Cs Other, Surrogate
- Co Other, Private Use
- Cn Other, Not Assigned (no characters in the file have this property)
- Informative Categories:
- Lm Letter, Modifier
- Lo Letter, Other
- Pc Punctuation, Connector
- Pd Punctuation, Dash
- Ps Punctuation, Open
- Pe Punctuation, Close
- Pi Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
- Pf Punctuation, Final quote (may behave like Ps or Pe depending on usage)
- Po Punctuation, Other
- Sm Symbol, Math
- Sc Symbol, Currency
- Sk Symbol, Modifier
- So Symbol, Other
See Unicode Technical Standard #10.
- compare/2,3
- sort/1,2
- sort_key/1,2
- sort_array/1,2
- search/2,3,4
Code from erlang shell:
1> ux_uca:sort_key("a").
<<21,163,0,0,32,0,0,2,0,0,255,255>>
2> ux_uca:sort_key("abc").
<<21,163,21,185,21,209,0,0,34,0,0,4,0,0,255,255,255,255,
255,255>>
3> ux_uca:sort_key("abcd").
<<21,163,21,185,21,209,21,228,0,0,35,0,0,5,0,0,255,255,
255,255,255,255,255,255>>Code:
ux_uca:compare("a", "a").
ux_uca:compare("a", "b").
ux_uca:compare("c", "b").Result:
equal
lower
greater
Code:
Options = ux_uca_options:get_options([
{natural_sort, false},
{strength, 3},
{alternate, shifted}
]),
InStrings = ["erlang", "esl", "nitrogen", "epm", "mochiweb", "rebar", "eunit"],
OutStrings = ux_uca:sort(Options, InStrings),
[io:format("~ts~n", [S]) || S <- OutStrings],
SortKeys = [{Str, ux_uca:sort_key(Options, Str)} || Str <- OutStrings],
[io:format("~ts ~w~n", [S, K]) || {S, K} <- SortKeys],
ok.Result:
epm
erlang
esl
eunit
mochiweb
nitrogen
rebar
epm [5631,5961,5876,0,32,32,32,0,2,2,2]
erlang [5631,6000,5828,5539,5890,5700,0,32,32,32,32,32,32,0,2,2,2,2,2,2]
esl [5631,6054,5828,0,32,32,32,0,2,2,2]
eunit [5631,6121,5890,5760,6089,0,32,32,32,32,32,0,2,2,2,2,2]
mochiweb [5876,5924,5585,5735,5760,6180,5631,5561,0,32,32,32,32,32,32,32,32,0,2,2,2,2,2,2,2,2]
nitrogen [5890,5760,6089,6000,5924,5700,5631,5890,0,32,32,32,32,32,32,32,32,0,2,2,2,2,2,2,2,2]
rebar [6000,5631,5561,5539,6000,0,32,32,32,32,32,0,2,2,2,2,2]
okCode:
(ux@delta)30> ux_uca:search("The quick brown fox jumps over the lazy dog.",
"fox").
{"The quick brown ","fox"," jumps over the lazy dog."}
(ux@delta)33> ux_uca:search("The quick brown fox jumps over the lazy dog.",
"cat").
falseCode:
(ux@delta)20> CF = fun(S) -> ux_uca_options:get_options([{strength,S}]) end.
#Fun<erl_eval.6.80247286>
(ux@delta)32> ux_uca:search(CF(2), "The quick brown fox jumps over the lazy
dog.", "dog", maximal).
{"The quick brown fox jumps over the lazy"," dog.",[]}
(ux@delta)21> ux_uca:search(CF(2), "fF", "F").
{[],"f","F"}
(ux@delta)22> ux_uca:search(CF(3), "fF", "F").
{"f","F",[]}Code:
(ux@delta)20> CF = fun(S) -> ux_uca_options:get_options([{strength,S}]) end.
#Fun<erl_eval.6.80247286>
(ux@delta)27> ux_uca:search(CF(3), "! F ?S?", "! F !", 'minimal').
{"! ","F"," ?S?"}
(ux@delta)28> ux_uca:search(CF(3), "! F ?S?", "! F !", 'maximal').
{[],"! F ?","S?"}
(ux@delta)29> ux_uca:search(CF(3), "! F ?S?", "! F !", 'medium').
{[],"! F "," ?S?"}Stores UNIDATA information. For internal using only.
ux_unidata_filelist:set_source(Level, ParserType, ImportedDataTypes,
FromFile).For example:
ux_unidata_filelist:set_source(process, blocks, all, code:priv_dir(ux) ++ "/UNIDATA/Blocks.txt"}).loads data about Unicode blocks from priv/UNIDATA/Blocks.txt.
So, different processes can use their own unidata dictionaries.
Level is process, application or node.
Parsers are located into ux_unidata_parser_* modules.
Default unidata files are loaded when the application tries get the access to them.