Some tests run against Apache 2.2.?? with a simple bit of mod_rewrite in a .htaccess
file.
RewriteEngine on
# Test rewrite
RewriteRule ^(.+) junk?_START_&q1=$1&_END_ [NC,QSA]
# Final rewrite
RewriteCond %{IS_SUBREQ} !true [NC]
RewriteCond %{SERVER_NAME} (.+) [NC]
RewriteRule (.*) http://%1/apps/mod_rewrite?_R=Final&URI=$1 [NC,L,QSA,NE,PROXY]
Where we can pop some tests through. The mod_rewrite file is a PHP script which dumps out some data for debug purposes.
Putting the following URI fragments into the rewrites:
Against the following rule:
RewriteRule ^(.+) junk?_START_¶m=$1&_END_ [NC,QSA]
yields:
/apps/mod_rewrite?_R=Final&URI=junk&_START_&q1=foo
So we can see the any character dot has fallen over at the space, this is repeated with the physical whitespace and the tab, we even loose the last querystring parameter. The plus character is however passed through thus:
/apps/mod_rewrite?_R=Final&URI=junk&_START_&q1=foo+bar&_END_
Most other encoded values pass through okay, an encoded ampersand later confuses PHP and it splits the value into a new variable (as was to be expected).
Testing against the rule with a B flag to stop expansion of URI encoded vars yields:
/apps/mod_rewrite?_R=Final&URI=junk&_START_&q1=foo%20bar&_END_
So we can see the space has been passed though as a literal, to be expected when the rule sees it this time it sees it as literally a percent followed by a two followed by a zero. Physical whitespace is encoded to a %20 and passed through. The ampersand is decoded in the output and ends up once again as a separator. Adding an NE flag along with the B doesn’t seem to alter the output.
Moving to a different rewrite rule where we try to catch our elusive whitespace:
RewriteRule ^(.+)(\s+)(.+) ojunk?_START_&q1=$1&q2=$2&q3=$3&_END_ [NC,QSA]
Matching against our %20 encoded whitespace again fails to properly capture the space and we end up with a result like this:
/apps/mod_rewrite?_R=Final&URI=junk&_START_&q1=foo&q2=
The foo+bar obviously doesn’t match the pattern and misses the rule. Other whitespace characters fail as we would expect. Adding a B flag causes the pattern to be properly processed and we end up with a result like this:
/apps/mod_rewrite?_R=Final&URI=junk&_START_&q1=foo&q2=%20&q3=bar&_END_
So we can see the %20 has correctly been matched against the \s whitespace character class.
Now we have some idea of how the whitespace behaves we should see how it behaves when part of a cascaded query string. So if we construct some rules like this:
RewriteRule ^(.+)(\s+)(.+) junk?_START_&q1=$1&q2=$2&q3=$3&_END_ [NC,QSA,B,NE]
RewriteCond %{QUERY_STRING} q2=(.) [NC]
RewriteRule ^junk junk?_S_&e1=%1&_E_ [NC,QSA]
And pass through our foo%20bar. We find the e1 parameter has captured a % character, so we know our space character has ended up URI encoded in the query string.
Written on 10 Feb 2013 and categorised in Apache and NIX, tagged as pcre, regexp, and uri