The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

a thoughtful web.

Good ideas and conversation. No ads, no tracking. Login or Take a Tour!

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) · 8

NaN · 3412 days ago

joelonsoftware.com · #programming · #unicode

An article all the way back from 2003, but still very relevant

tweet · print · htmlmarkup tips · 0

deanSolecki · 3412 days ago · link ·

TL;DR, UTF-8?

+discuss+discuss

–

thundara · 3412 days ago · link ·

TL;DR: state the encoding

It does not make sense to have a string without knowing what encoding it uses

+discuss+discuss

lke · 3412 days ago · link ·

No the piece doesn't tell you what you should use. But what you need to consider when working with strings.

The author mention using UCS2 (UTF-16) in his products but exporting any products to UTF-8.

+discuss+discuss

j4d3 · 3412 days ago · link ·

I was thinking almost exactly the same thing. I'd add only: everywhere. In your spreadsheet, in your CSV, in your database, in your back-end language, in your JavaScript, and sure, in your HTML.

+discuss+discuss

thundara · 3412 days ago · link ·

Also worth a read: Unicode: The Good, the Bad, and the (mostly) Ugly

Link

Sadly, the original website is more often down than not, but here are the unformatted slides:

https://dheeb.files.wordpress.com/2011/07/gbu.pdf

It has a good overview of the various feature support of different languages and problems that you can run into with things like regexes and passwords.

+discuss+discuss

–

DarkLinkXXXX · 3412 days ago · link ·

Here's the slideshow

+discuss+discuss

briandmyers · 3412 days ago · link ·

I read this back when it came out; and while it makes some good points, ASCII is still king in the embedded world. Unicode overhead is often not at all necessary, and undesirable, in that realm. Just sayin'. Things are changing, but slowly (i.e. many embedded systems now run some form of Linux under the hood, and have plenty of power to spare for Unicode support).

+discuss+discuss

WanderingEng · 3412 days ago · link ·

This is pretty topical. Just yesterday I learned a program I use only accepts ascii file names. It gives a useless error if you give it a Unicode file name via the python API.