Datamation Logo

Conquering Character Encoding Chaos With GNU Recode

December 4, 2008
Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

In the beginning were C and C++, and hosts of other computer programming languages. These are all based on ASCII (American Standard Code for Information Interchange), which as the name implies is based on the English alphabet. Which wouldn’t be an issue except there are lot of other humans in the world, and they don’t use the English alphabet.

So along came Unicode to the rescue. Unicode provides a framework for all alphabets of the world to be represented on computers. UTF-8 is the most popular Unicode implementation because it preserves backwards compatibility with ASCII. Which is all fun to know, but what good is it when you’re looking at piles of computer files that need to converted from ISO-8859-1 (Latin-1, Western European) into whatever encoding you prefer? Naturally, there are a number of utilities just for this task.

GNU Recode supports over 150 character sets, and converts just about anything to anything. For example, there are still users of legacy Linux systems that still run ISO-8859-1. Recode will convert these to nice modern UTF-8, like this:

$ recode UTF-8 recode-test.txt


Check out the GNU Recode Manual for instructions.

That’s fast and easy enough, but there’s one more job- converting the filename. The convmv command is just the tool for this job. This example converts all the ISO-8859-1 filenames in the files/ directory to UTF-8:

$ convmv -f iso-8859-1 -t utf8 --notest  files/


convmv run without the –notest option does a dry-run without changing anything, which is probably a wise thing to do first.

Maybe you have a file that you don’t know what the encoding is. Upload your file to this online tool and it will tell you. You can even do file conversions here.

Resources

The subject of character encoding is huge and bewildering, especially for us dinosaurs from the typewriter era. By golly, when you hit a typewriter key it came out the same way every single time. Wikipedia has a number of excellent introductory articles:

Unicode
UTF-8
ISO/IEC 8859

This article was first published on LinuxPlanet.com.

  SEE ALL
ARTICLES
 

Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Datamation Logo

Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.

Advertisers

Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.

Advertise with Us

Our Brands


Privacy Policy Terms & Conditions About Contact Advertise California - Do Not Sell My Information

Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.