Working on a site where the plan is to move URLs from a query string format to a number based format. Lots of URLs exist that have unescaped accented & similar UTF8 characters in them. The problem? I can’t seem to get Apache2 to properly match accented characters & do a rewrite. I am doing this all in the Apache2 config.
For example, this URL:
http://great.website.example.com/?place=cafe
Will work as expected with this Apache2 RewriteRule setting:
RewriteCond %{QUERY_STRING} ^(place|location)=cafe
RewriteRule ^/find/$ /find/1234? [L,R=301]
Now look at this URL. Note the accented é
:
http://great.website.example.com/?place=café
Why doesn’t that URL work with the following Apache2 RewriteRule setting:
RewriteCond %{QUERY_STRING} ^(place|location)=café
RewriteRule ^/find/$ /find/1234? [L,R=301]
Both of these rules should rewrite the URL to the following:
http://great.website.example.com/find/1234
But the example with the accented é
simply doesn’t work. Maybe a wildcard character would work, but I can’t seem to get that to work either.
Your
/?place=café
will be url-encoded by the browser to/?place=caf%C3%A9
and this is what you should match.You can use a RewriteMap to do the unescaping for you. like this:
In the second RewriteCond line I use %2, as %1 would contain either "location" or "place".
However, adding a lot of RewriteRules to your config in order to map words to numbers is going to be a big performance hit on your server, and will be hard to maintain. A better solution is to use a RewriteMap for that too.
For example, asume that
/etc/apache2/places.txt
contains:Then this whould work for you:
You can also use a RewriteMap based on a database query. That would be my preferred choice, as I could then ofload the job of matching words to numbers to the content management system.
More details you can find in the documentation: http://httpd.apache.org/docs/2.2/mod/mod_rewrite.html#rewritemap
In a related question, someone suggested using
RewriteMap
to call an external program to rewrite URLs.Also: Perhaps the request is actually something different entirely? A browser might have internally translated the accented characters to url-encoded ASCII? E.g. '
%20
' rather than ''.