cRoSSbow and gdb
Table of Contents
Yesterday I took a break from work for half an hour to just do some of my own futzing about, and I came back to something I'd already been playing around with, which was attempting to figure out why the feed processor I'd just downloaded and built wouldn't handle the main feed, https://www.databreaches.net/feed/ that I wanted to view.
cRoSSbow
I first found out about crossbow
on this lobste.rs post and decided it was probably right up my alley. I've written and spoken elsewhere about work I've done previously with feeds, primarily on my betterfeed framework for reconstituting RSS feeds.
I'm fully aware however, that my solution is super heavyweight, has a lot of code that I probably wouldn't have needed to write myself in any other language, and is just not really usable for others.
So crossbow
looked like a great solution. Looking at the man page for crossbow-cookbook
it has all sorts of examples like:
Download the full article This scenario is similar to the previous one, except that the item description contains only part of the content, or nothing at all. The link field contains a valid URL, which is intended to be reached by means of a browser. In this case we can leverage curl(1) to do the retrieval: crossbow set -i "$ID" -u "$URL" \ -o subproc \ -f "curl -o %n.html %l" -C /some/destination/path/
This is a much more tractable way of doing similar to what I was already doing, but composable on the command-line. I could easily throw in a pipeline to pup
for a given feed and re-use my extant CSS selectors in order to still just pull the actual content out.
But after downloading, building and installing the full set of dependencies and the project itself, I tried to use a very similar invocation to the one above to add https://www.databreaches.net to crossbow
's download list, returning 0, but when I ran crossbow-fetch
I got the cryptic error message:
crossbow-fetch: cannot open feed [https://www.databreaches.net/feed/]: error_code=2 (mrss: Parser error)
God dammit.
Now I could probably have just dug through the code, but I don't have much patience for that, and I have been having some amount of fun and success recently in turning to…
gdb
This isn't my first rodeo with gdb
, but I'm not a prolific writer of compiled languages nor am I an accomplished reverse engineer - I'm largely interested in building up my skillset because I (fairly recently) purchased a debugger for ARM devices.
Anyway, I'm going to cut a long story short here, but I learned a few things:
- How to "break" (technically it's a
catch
) on a syscall (in this caseclone
because I wanted to see what was going on in another thread) - How to lock the scheduler so other threads don't continue to run in the background
- How to step through a thread in the foreground
- How to enable source listing for stepping through code with debug symbols compiled in
Again, to cut a long story short the error logic I wanted to see was in the libmrss.so.0
dependency, which in my efforts I ended up recompiling and installing separately to /usr/local/...
and loading in with LD_PRELOAD
in my gdb
invocation.
I then did the following:
dir ~/libmrss/src
to get the line information corresponding to the debug symbolscatch syscall clone
to create a breakpoint when a thread is about to be createdrun
because the next scheduler argument doesn't want to work until you start the program (we're safe because we'll break before threads are created)set scheduler-locking on
in order to prevent other threads from being scheduled whilst stepping through the current one.
We can then step (s
) until the new thread is created and switch to it with:
info thread
to view the threads and their IDsthread 2
if you want to, for example, switch to the thread with id=2
I mean turns out none of the important stuff actually happened in the second thread which returns before thread 1 actually does the work that produces the error, but it's a good thing to have learned.
Anyway, here's the interesting bit:
(gdb) s mrss_parse_url_with_options_error_and_transfer_buffer (feed_size=0x0, feed_content=0x0, code=<optimized out>, options=0x0, ret=0x7fffffff9440, url=0x7fffffffd480 "https://www.databreaches.net/feed/") at mrss_parser.c:1114 1114 if (nxml_parse_buffer(doc, buffer, size) != NXML_OK) { (gdb) s 1121 if (!(err = __mrss_parser(doc, ret))) { (gdb) s __mrss_parser (doc=0x555555563560, ret=ret@entry=0x7fffffff9440) at mrss_parser.c:1015 1015 if (!(cur = nxmle_root_element(doc, NULL))) (gdb) s 1018 if (!strcmp(cur->value, "rss")) { (gdb) s 1042 else if (!strcmp(cur->value, "RDF")) (gdb) s 1045 else if (!strcmp(cur->value, "feed")) (gdb) s mrss_parse_url_with_options_error_and_transfer_buffer (feed_size=0x0, feed_content=0x0, code=<optimized out>, options=0x0, ret=<optimized out>, url=0x7fffffffd480 "https://www.databreaches.net/feed/") at mrss_parser.c:1133 1133 nxml_free(doc);
Well up until I came to write about this, in my mind it was the case that the feed was atom so that's why libmrss
couldn't parse it, but in checking my facts to be able to actually post examples this doesn't appear to be the case.
This is not the feed you're looking for
The feed itself is RSS and the library does have support for Atom as far as I can tell.
So going back through, let's see what that root value is if it's not rss
:
Thread 1 "crossbow-fetch" hit Breakpoint 3, __mrss_parser (doc=0x555555563560, ret=ret@entry=0x7fffffff9440) at mrss_parser.c:1015 1015 if (!(cur = nxmle_root_element(doc, NULL))) (gdb) s 1018 if (!strcmp(cur->value, "rss")) { (gdb) p cur->value $1 = 0x555555563600 "html" (gdb)
Well there's a turnout for the books. If I curl
the URL, it's definitely returning RSS. I bet this is user-agent related.
To test the theory out, I add a test feed that's pointing to my website and tail the nginx
logs, doing both a crossbow-fetch
and a regular curl:
77.97.102.36 - - [13/Aug/2020:11:09:18 +0000] "GET / HTTP/1.1" 200 2090 "-" "-" 77.97.102.36 - - [13/Aug/2020:11:09:55 +0000] "GET / HTTP/1.1" 301 178 "-" "curl/7.71.1"
So the libcurl being used by nxml_download_file
isn't passing any user-agent header.
Let's see what databreaches.net thinks of this:
<!DOCTYPE html> <!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]--> <!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]--> <!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]--> <head> <title>Attention Required! | Cloudflare</title> <meta charset="UTF-8" /> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
Well there we go, Cloudflare has decided that I'm not allowed to see content because I haven't told them what my UA is (╯ °□°)╯︵ ┻━┻
Sticking a pin in it
I need to have a bit of a think about the best way to proceed (and I need to get some real work done) so here are some options:
- Set up an HTTP proxy that keeps track of which site wants which UA and uses it accordingly
- Add a UA argument for
crossbow-fetch
- Go the whole hog and add support to both
crossbow-fetch
andcrossbow-set
so that UA can be set for each feed
I don't want to put in too much speculative work for something that might not be accepted upstream, so I'm going to stick a pin in this and come back to it later.