URL Percent Encoding and Unicode

Advertise Here

, 2010-05-24

This page discuss some issues about what characters should be percent encoded in URL, and how different browsers behave.

Browser Behavior

Some test on browser's behavior on URL encoding/decoding. Apparently, some browsers automatically decode parts of the percent encoding.

Copy this line:

http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)

then go to browser, open a new tab or window, paste the line into the URL field, then Enter to load the page.

Then, select URL field and copy the URL. Then, paste in a text editor. Here are the results (on Windows browsers):

• Google Chrome
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(D%C3%BCrer)

• Safari
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)

• Firefox
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29

• Opera
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)

• IE
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)

Now, try again, starting with this line:

http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29
• Google Chrome
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(D%C3%BCrer)

• Safari
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28Dürer%29

• Firefox
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29

• Opera
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)

• IE
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29

Another example. Start with:

http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
• Google Chrome
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

• Safari
http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem

• Firefox
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

• Opera
http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem

• IE
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

All results are on Windows Vista, using latest public released version of the browsers as of 2010-05-24.

Summary

Here's some summary of the behavior as it appears from above tests:

References

Emacs Question

Is there emacs lisp function that decode the URL percent encoding? e.g.

http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

should become

http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem

That's a EN DASH (unicode 8211, #o20023, #x2013).

I know there's a

 (require 'gnus-util)
 (gnus-url-unhex-string …)

but that just unhex, and generates gibberish if the URL contains Unicode chars.

some study shows that the %E2%80%93 are hexadecimals E2 80 93, and is the byte sequence of the en dash char by utf-8 encoding.

So, i guess i could parse the URL then interpret the %x string as utf-8 hex bytes then turn them back to Unicode chars. Any idea if there's built in function that helps this?

Some discussion and temp solutions at:

Reported to FSF: bug#6252.

From the above discussions, you can see that it does not seem clear what character should be percent encoded. In fact, different browsers have different behavior.

blog comments powered by Disqus