Perl Unicode Essentials

Location: D135
Tags: perl, unicode
Average rating: ***..
(3.12, 8 ratings)

Attendee prerequisites for this tutorial are listed below.

Growing exponentially over the last decade, Unicode text now comprises over 95% of the documents retrieved over the web, while in other collections, it is often 100% Unicode. This tutorial shows Perl programmers how to manage Unicode data.

Key differences between bytes and characters affect numerous basic Perl string operations. Automatic encoding and decoding between multiple external formats make dealing with Unicode I/O easy. Your programs can be written with literal UTF-8 strings and even identifiers, but alternate and sometimes more convenient ways of expressing Unicode code points symbolically in your strings are available. Normalization of combined characters makes string comparisons trickier with Unicode. The old sort function doesn’t work well on Unicode text; instead, locale-specific sorting algorithms take its place. Unicode pattern-matching is only lightly discussed, because it’s a big enough topic to merit a tutorial of its own.


This tutorial has certain workstation requirements and pre-requisites, including software installations required prior to attending. Click HERE to download the file.

QUESTIONS for the speaker?: Use the “Leave a Comment or Question” section at the bottom to address them.

Photo of Tom Christiansen

Tom Christiansen


Tom Christiansen is a programmer, author, and lecturer who’s been
involved with Perl since its initial public release back in 1987. Tom is
the owner of the PERL.COM domain and website, and original author of much
of Perl’s online documentation. Tom is lead author of the
The Perl Cookbook and co-author of Programming Perl, Learning Perl
(2nd edition), and Learning Perl on Win32 Systems, all bestselling titles by
O’Reilly & Associates.

He served two terms on the USENIX Association Board of Directors, and was
president of The Perl Journal. Perl users selected Tom to receive the
first White Camel Award in 1999. In 2000, Members of the Open Source
community voted Tom Best Newbie Helper in the first annual Andover.Net
Slashdot Open Source Community Awards, to honor Open Source pioneers.

Tom holds a Masters degree in Computer Science from the University of
Wisconsin – Madison with a dual specialization in operating systems design
and in computational linguistics. He previously received his Bachelors
degree there in Spanish and Computer Science with minor fields of study in
French, Mathematics, and Music. Tom has lived abroad in England and in
Spain, where he studied Romance Philology, café solo, and vino tinto.

Residing at the western edge of Boulder, Colorado, Tom is an amateur
naturalist who spends most of his summer hiking and camping high in the
wilderness well above 10,000 feet of elevation, wandering about the vast
Colorado Plateau, or relaxing under the glittering kaleidoscope of the
Black Rock Desert’s starkly featureless playa. Over the past five years,
Tom has become especially interested in how the exciting growth of
affordable digital photography has opened up to mere mortals dramatic
artistic opportunities previously possible to only the most dedicated
and persistent of professional photographers, and often not even to them.

Comments on this page are now closed.


Picture of Jacinta Richardson
Jacinta Richardson
08/09/2011 12:34pm PDT

I’m fairly unicode ignorant. I don’t want to be. I went along to this talk, hoping that “essentials” would be what I needed to grok the basics of unicode. Unfortunately the talk seemed to start at a point that was too advanced for me and kept at that. I followed along as best I could and really will have to do a whole lot more reading about unicode soon.

Further feedback for the author, 5+ lines of text on your slides means that people half way down the room and further will start having difficulty reading your text. 10+ lines guarantees difficulty. I think I saw one slide of yours with about 19. Please increase your text size and your slide deck.

Picture of Ricardo Signes
Ricardo Signes
07/31/2011 11:01pm PDT

This was an excellent talk. Tom had mentioned that he had originally considered doing this as a full-day talk, and I think that would be great. I would love to see him offer a half-day talk on “stuff you need to know about Unicode text in general” for programmers of all denominations, if that’s something he thinks could be done.

Unicode ignorance is one of the biggest problems I see with many of the otherwise good, in-the-know programmers with whom I work. Unicode text is everywhere, and people keep working with it as if it’s just “ASCII with some voodoo.” Despite that programmer perspective, users see mojibake show up on sites and immediately “know” that the site is crap, implemented by idiots.

The talk seemed to make some assumptions about what the students knew: stupidly, I can’t remember very good examples, but basically I felt like the basics of what Unicode is, what case folding is, and so on, were more assumed than given. Later, a few attendees said much of the talk was over their head because they lacked the fundamentals. Although I generally don’t like sitting through fundamentals that I know, here I think it might be worth the time.

Either way, though, this was probably the most educational and generally useful thing I attended all week.

brian d foy
07/28/2011 8:25am PDT

Some of the stuff that Tom mentioned is in CPAN as Unicode::Tussle.

Elliot Shank
07/26/2011 11:35am PDT

Dear O’Reilly, more sessions like this, please.

Picture of Matt Riffle
Matt Riffle
07/21/2011 9:01am PDT

I’m not able to get it unzipped, even. None of zip, gzip, or tar (with/without the z option) recognize the file as valid, for me.

Bart Sutton
07/20/2011 7:38am PDT

Hello Tom,

I downloaded the oscon2011_tute_christiansen_19542.gzip file to my laptop and unzipped it. I now have a 27M file: 07/20/2011 02:30 PM 27,617,280 oscon2011_tute_christiansen_19542

I figured out it was a tar file and untarred it but you may want to let people know that.

Thanks Bart Sutton