u8c

Fast validating utf-8 encoder/decoder for C

The decoding function is based on the work of Bjoern Hoehrmann: http://bjoern.hoehrmann.de/utf-8/decoder/dfa

with the following differences:

the implementation is faster than what presented on the original site.
the code is clearer to read and to relate to the state machines
it has been extended to include C0 80 as the encoding for U+0000. (see https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 ) and treats '\0' as the string terminator.

To use it, include u8c.h in your code and link against u8c.c

API

 int u8next(char *s [, int *c]) --> Returns the number of bytes encoding the first
                                    codepoint in the string s.
                                    If c is not NULL, stores the codepoint in *c.
                                    If the encoding is not valid, returns -1 and stores
                                    in *c the first byte of the string.
                                    Note that c is optional (implies NULL).

 int u8strlen(char *s)          --> Returns the number of codepoints in the string s.
                                    Does NOT validate the string encoding!

 int u8codepoint(char *s)       --> Returns the codepoint encoded in s or the first
                                    byte if the encodinging is not valid.

 int u8encode(char *s, int c)   --> Stores the encoding of codepoint c in s and
                                    place a string terminator ('\0') at the end.
                                    There must be *at least* 5 bytes allocated in the
                                    string s.
 	
  The following functions are the counterpart of the C standard functions

 char *u8strcat(char *d, char *s)
 char *u8strchr(char *str, int c)
 int   u8strcmp(char *str1, const char *str2)  
 int   u8strncmp(const char *str1, const char *str2, size_t n) -> 'n' the size of the buffer in bytes
 char *u8strcpy(char *dest, const char *src)
 char *u8strstr(const char *haystack, const char *needle)

 These two functions will ensure the last codepoint is not broken

 char *u8strncat(char *dest, const char *src, size_t n) -> 'n' is the buffer size in bytes
 char *u8strncpy(char *dest, const char *src, size_t n) -> 'n' is the buffer size in bytes

Compatibility with standard C functions

Strings encoded with UTF-8 are mostly compatible with the functions in the standard C library:

strtok()        Only works for delimiters with ASCII codes (01..7F).
  
strcspn()       Only works for ASCII codes (01..7F).
  
strspn()        Only works for ASCII codes (01..7F).
  
strpbrk()       Only works for ASCII codes (01..7F).).
  
strrchr()       Only works for ASCII codes (01..7F).
  
[f]printf()     Fully compatible. Will output the string "as is"
[f]puts()       preserving the UTF encoding.

FSM

Here is the decoder's Finite State Machine.

Note that state LEN3 and LEN4 (dashed in the picture) are there just to ease the implementation, they do not consume any input.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
doc		doc
test		test
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
runtest.sh		runtest.sh
u8c.c		u8c.c
u8c.h		u8c.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

u8c

API

Compatibility with standard C functions

FSM

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

rdentato/u8c

Folders and files

Latest commit

History

Repository files navigation

u8c

API

Compatibility with standard C functions

FSM

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages