Xah Lee, 2010-05-24
This page discuss some issues about what characters should be percent encoded in URL, and how different browsers behave.
Some test on browser's behavior on URL encoding/decoding. Apparently, some browsers automatically decode parts of the percent encoding.
Copy this line:
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)
then go to browser, open a new tab or window, paste the line into the URL field, then Enter to load the page.
Then, select URL field and copy the URL. Then, paste in a text editor. Here are the results (on Windows browsers):
• Google Chrome http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(D%C3%BCrer) • Safari http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer) • Firefox http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29 • Opera http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer) • IE http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)
Now, try again, starting with this line:
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29
• Google Chrome http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(D%C3%BCrer) • Safari http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28Dürer%29 • Firefox http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29 • Opera http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer) • IE http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29
Another example. Start with:
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
• Google Chrome http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem • Safari http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem • Firefox http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem • Opera http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem • IE http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
All results are on Windows Vista, using latest public released version of the browsers as of 2010-05-24.
Here's some summary of the behavior as it appears from above tests:
Is there emacs lisp function that decode the URL percent encoding? e.g.
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
should become
http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem
That's a EN DASH (unicode 8211, #o20023, #x2013).
I know there's a
(require 'gnus-util) (gnus-url-unhex-string …)
but that just unhex, and generates gibberish if the URL contains Unicode chars.
some study shows that the %E2%80%93 are hexadecimals E2 80 93, and is the byte sequence of the en dash char by utf-8 encoding.
So, i guess i could parse the URL then interpret the %x string as utf-8 hex bytes then turn them back to Unicode chars. Any idea if there's built in function that helps this?
Some discussion and temp solutions at:
Reported to FSF: bug#6252.
From the above discussions, you can see that it does not seem clear what character should be percent encoded. In fact, different browsers have different behavior.