[Rabbit-dev] FilterHandler is unable to correctly parse some buffer streams

Discussion:

Mindaugas Žakšauskas

2009-12-12 20:27:41 UTC

Hi,

I was playing with various Html filters (e.g. BackgroundFilter) but
strangely I am getting intermittent results.

Basically I only want to add some extra stuff to the output HTML and do not
bother with gzipping at all (any reason why FilterHandler extends
GZipHandler?). So my config is like this:

8<------------------------------------------------------------------------------------
#
text/html=rabbit.handler.FilterHandler
text/html;charset\=iso-?8859-1=rabbit.handler.FilterHandler
text/html;charset\=iso-?8859_1=rabbit.handler.FilterHandler
text/html;charset\=utf-?8=rabbit.handler.FilterHandler
text/html;charset\=utf_?8=rabbit.handler.FilterHandler

# few lines below....

[rabbit.handler.GZipHandler]
# If I set this to false, no filtering happens - at all! WTF?
compress=true

[rabbit.handler.FilterHandler]
filters=rabbit.filter.BodyFilter,rabbit.filter.BackgroundFilter
8<------------------------------------------------------------------------------------

Now if I place a breakpoint in rabbit.filter.BackgroundFilter::handleTag
(the line which does tag.removeAttribute ("background");) it sometimes stops
(and works correctly) but sometimes doesn't. E.g. it stops if I go to

http://www.ukstudentlife.com/Life/Money.htm (1)
but doesnt if I go to
http://en.wikipedia.org/wiki/South_African_Republic_pond (2)

The reason why this is so intermittent is because the byte buffer array
(named "arr") which is formed in FilterHandler::modifyBuffer sometimes comes
as meaningful text and sometimes as unparseable garbage. If I try to create
a new String from this array - readable string is created for page (1) but
is total garbage for (2). At some later stage, when parser parses this
garbage, it spits a single HtmlBlock which is yet another invariant of same
garbage anyway.

The reason why "arr" array sometimes comes as garbled must be coming from
implementation details of rabbit.io.BufferHandle
(rabbit.io.CacheBufferHandle in my case). It is rather hard and time
consuming for me to dig deeper to understand what might be wrong here, but I
hope I have provided enough pointers for somebody who knows the internals
better.

Somebody else seen this problem before? Does this depend on different HTTP
server behaviour (to me it looks so)?
BTW, I am using Rabbit 4.2.

m.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://khelekore.org/pipermail/rabbit-dev/attachments/20091212/1ab924c9/attachment.htm>

Robert Olofsson

2009-12-12 20:52:58 UTC

Permalink

On Sat, 12 Dec 2009 20:27:41 +0000

Post by Mindaugas Å½akÅ¡auskas
Now if I place a breakpoint in rabbit.filter.BackgroundFilter::handleTag
(the line which does tag.removeAttribute ("background");) it sometimes stops
(and works correctly) but sometimes doesn't. E.g. it stops if I go to
http://www.ukstudentlife.com/Life/Money.htm (1)
but doesnt if I go to
http://en.wikipedia.org/wiki/South_African_Republic_pond (2)

If you add the HttpSnoop filter to the httpoutfilters you can easily see
the header that rabbit reads from the server and you will find:

GET http://en.wikipedia.org/wiki/South_African_Republic_pond HTTP/1.1
HTTP/1.0 200 OK
...
Content-Encoding: gzip
Content-Length: 5878
Content-Type: text/html; charset=utf-8
...

So that you see binary junk is expected, the data is gzipped.

Rabbit should not normally try to filter this data, if it is then it is doing the
wrong thing.
If you set "repack=true" rabbit ought to unpack and filter pages
that are gzipped.

Rabbit does not normally try to unzip compressed pages, since
compressed content means that someone has already thought about
minimizing the content and then rabbit will only add latency.

/robo

Mindaugas Žakšauskas

2009-12-12 21:49:51 UTC

Permalink

2009/12/12 Robert Olofsson <robert.olofsson at khelekore.org>

Post by Robert Olofsson
If you add the HttpSnoop filter to the httpoutfilters you can easily see
<..> the data is gzipped.

Cool, thanks for explanation! That just suddenly started to make sense now.

Post by Robert Olofsson
Rabbit should not normally try to filter this data, if it is then it is doing the
wrong thing.

Yep, I think the fix is necessary (at least in 4.2, haven't got a chance to
look at 4.3). Something like

if ("gzip".equals(response.getHeaders("Content-Encoding"))) {
return;
}

in FilterHandler::handleArray, just before parser.setText (arr, off, len)
perhaps?
But this might be just too straightforward...

If you set "repack=true" rabbit ought to unpack and filter pages

Post by Robert Olofsson
that are gzipped.

This makes sense only if repack=true *and* compress=true. If compress=false
and repack=true, browser ends up in "Content Encoding Error" when trying to
get gzipped stream, such as http://www.guardian.co.uk. Is this expected?

Rabbit does not normally try to unzip compressed pages, since

Post by Robert Olofsson
compressed content means that someone has already thought about
minimizing the content and then rabbit will only add latency.

Yes, unless one wants to be too intrusive or more user friendly :)

Also, another question: what is the best way to share data
between proxy.filter.HttpFilter and proxy.filter.HtmlFilter?
E.g. say I want to show a nice (HTML) error screen (which I would normally
do in proxy.filter.HtmlFilter) if user is not authenticated (and that's
decided in proxy.filter.HttpFilter), instead of just throwing a default HTTP
407.
One way of doing this would be throwing extra rabbit.http.HttpHeader and
check (also delete?) it later. But is this good enough? Any other options?

Thanks a lot for help!

Regards,
Mindaugas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://khelekore.org/pipermail/rabbit-dev/attachments/20091212/b9d526f4/attachment.htm>

Robert Olofsson

2009-12-13 14:13:19 UTC

Permalink

On Sat, 12 Dec 2009 21:49:51 +0000

Post by Mindaugas Å½akÅ¡auskas

Post by Robert Olofsson
Rabbit should not normally try to filter this data, if it is then it is doing the
wrong thing.

Yep, I think the fix is necessary (at least in 4.2, haven't got a chance to
look at 4.3). Something like

When I look at the code and try it out everything seems to work fine.
FilterHandler.setupHandler calls GZipHandler.setupHandler and that will
set mayFilter to false for compressed content.
In FilterHandler.modifyBuffer we check mayFilter and since it is not
true we will just call super.modifyBuffer and return (and that will just
write the buffer out).

Post by Mindaugas Å½akÅ¡auskas
If you set "repack=true" rabbit ought to unpack and filter pages

Post by Robert Olofsson
that are gzipped.

If repack is not working, then we have a bug. I have to check it.

Post by Mindaugas Å½akÅ¡auskas
Also, another question: what is the best way to share data
between proxy.filter.HttpFilter and proxy.filter.HtmlFilter?
E.g. say I want to show a nice (HTML) error screen (which I would normally
do in proxy.filter.HtmlFilter) if user is not authenticated (and that's
decided in proxy.filter.HttpFilter), instead of just throwing a default HTTP
407.
One way of doing this would be throwing extra rabbit.http.HttpHeader and
check (also delete?) it later. But is this good enough? Any other options?

I do not think that question makes much sense.
If you want to return a nice error page for unauthorized users you do not
want to get the real html resource and filter it.
What we ought to do is to make the error pages use configurable templates.
Currently StandardResponseHeaders is the class that generates the error
pages and it does not read any templates. So changing that one to read some
template would be a nice thing.

/robo

Robert Olofsson

2009-12-13 15:22:43 UTC

Permalink

On Sun, 13 Dec 2009 15:13:19 +0100

Post by Robert Olofsson

Post by Mindaugas Å½akÅ¡auskas
This makes sense only if repack=true *and* compress=true. If compress=false
and repack=true, browser ends up in "Content Encoding Error" when trying to
get gzipped stream, such as http://www.guardian.co.uk. Is this expected?

If repack is not working, then we have a bug. I have to check it.

That is a bug, I have fixed it in my tree.
Thanks for spotting it.

/robo

Robert Olofsson

2009-12-13 15:22:43 UTC

Permalink

On Sun, 13 Dec 2009 15:13:19 +0100

Post by Robert Olofsson

If repack is not working, then we have a bug. I have to check it.

That is a bug, I have fixed it in my tree.
Thanks for spotting it.

/robo

Robert Olofsson

2009-12-13 15:22:43 UTC

Permalink

On Sun, 13 Dec 2009 15:13:19 +0100

Post by Robert Olofsson

If repack is not working, then we have a bug. I have to check it.

That is a bug, I have fixed it in my tree.
Thanks for spotting it.

/robo

Robert Olofsson

2009-12-13 14:13:19 UTC

Permalink

On Sat, 12 Dec 2009 21:49:51 +0000

Post by Mindaugas Å½akÅ¡auskas

Post by Robert Olofsson
Rabbit should not normally try to filter this data, if it is then it is doing the
wrong thing.

Yep, I think the fix is necessary (at least in 4.2, haven't got a chance to
look at 4.3). Something like

Post by Mindaugas Å½akÅ¡auskas
If you set "repack=true" rabbit ought to unpack and filter pages

Post by Robert Olofsson
that are gzipped.

If repack is not working, then we have a bug. I have to check it.

Robert Olofsson

2009-12-13 14:13:19 UTC

Permalink

On Sat, 12 Dec 2009 21:49:51 +0000

Post by Mindaugas Å½akÅ¡auskas

Post by Robert Olofsson
Rabbit should not normally try to filter this data, if it is then it is doing the
wrong thing.

Yep, I think the fix is necessary (at least in 4.2, haven't got a chance to
look at 4.3). Something like

Post by Mindaugas Å½akÅ¡auskas
If you set "repack=true" rabbit ought to unpack and filter pages

Post by Robert Olofsson
that are gzipped.

If repack is not working, then we have a bug. I have to check it.

Mindaugas Žakšauskas

2009-12-12 21:49:51 UTC

Permalink

2009/12/12 Robert Olofsson <robert.olofsson at khelekore.org>

Post by Robert Olofsson
If you add the HttpSnoop filter to the httpoutfilters you can easily see
<..> the data is gzipped.

Cool, thanks for explanation! That just suddenly started to make sense now.

Post by Robert Olofsson
Rabbit should not normally try to filter this data, if it is then it is doing the
wrong thing.

Post by Robert Olofsson
that are gzipped.

Post by Robert Olofsson
compressed content means that someone has already thought about
minimizing the content and then rabbit will only add latency.

Mindaugas Žakšauskas

2009-12-12 21:49:51 UTC

Permalink

2009/12/12 Robert Olofsson <robert.olofsson at khelekore.org>

Post by Robert Olofsson
If you add the HttpSnoop filter to the httpoutfilters you can easily see
<..> the data is gzipped.

Cool, thanks for explanation! That just suddenly started to make sense now.

Post by Robert Olofsson
Rabbit should not normally try to filter this data, if it is then it is doing the
wrong thing.

Post by Robert Olofsson
that are gzipped.

Post by Robert Olofsson
compressed content means that someone has already thought about
minimizing the content and then rabbit will only add latency.