Elevator Pitch
Maybe you know how to work with codecs like UTF-8 and store unicode strings in files/databases. But what’s about complex tasks? How to normalize strings, write correct regular expression patterns and do other text processing? At the end: how many punctuation characters is exist?
Description
Everybody uses Unicode nowdays. At least for emojies in slack or twitter 🤓
Saving data with UTF-8 encoding and reading it back is well known procedure. What’s about more complex challenges?
- Unicode, codepoints and byte strings. What every thing exists is for?
- Converting unicode from bytes. Error modes, codecs etc. Source encoding autodetection as a bonus.
- UTF-16, Little/Big Endian, Surrogate Pairs. What casual software developer should know about.
- Unicode Composites and their normalization.
- Unicode Categories, API for working with them and internationalized regular expressions.
I’m working in https://ocean.io/ We parse very many internet pages in the wild and extract useful information from them. The talk is based on our experience in this area.