Yesterday I took a break from work for half an hour to just do some of my own futzing about, and I came back to something I’d already been playing around with, which was attempting to figure out why the feed processor I’d just downloaded and built wouldn’t handle the main feed, https://www.databreaches.net/feed/ that I wanted to view.

cRoSSbow

I first found out about crossbow on this lobste.rs post and decided it was probably right up my alley. I’ve written and spoken elsewhere about work I’ve done previously with feeds, primarily on my betterfeed framework for reconstituting RSS feeds.

I’m fully aware however, that my solution is super heavyweight, has a lot of code that I probably wouldn’t have needed to write myself in any other language, and is just not really usable for others.

So crossbow looked like a great solution. Looking at the man page for crossbow-cookbook it has all sorts of examples like:

Download the full article
    This scenario is similar to the previous one, except that the item description
    contains only part of the content, or nothing at all.  The link field contains a
    valid URL, which is intended to be reached by means of a browser.

    In this case we can leverage curl(1) to do the retrieval:

                crossbow set -i "$ID" -u "$URL" \
                    -o subproc \
                    -f "curl -o %n.html %l"
                    -C /some/destination/path/

This is a much more tractable way of doing similar to what I was already doing, but composable on the command-line. I could easily throw in a pipeline to pup for a given feed and re-use my extant CSS selectors in order to still just pull the actual content out.

But after downloading, building and installing the full set of dependencies and the project itself, I tried to use a very similar invocation to the one above to add https://www.databreaches.net to crossbow’s download list, returning 0, but when I ran crossbow-fetch I got the cryptic error message:

crossbow-fetch: cannot open feed [https://www.databreaches.net/feed/]: error_code=2 (mrss: Parser error)

God dammit.

Now I could probably have just dug through the code, but I don’t have much patience for that, and I have been having some amount of fun and success recently in turning to…

gdb

This isn’t my first rodeo with gdb, but I’m not a prolific writer of compiled languages nor am I an accomplished reverse engineer - I’m largely interested in building up my skillset because I (fairly recently) purchased a debugger for ARM devices.

Anyway, I’m going to cut a long story short here, but I learned a few things:

How to “break” (technically it’s a catch) on a syscall (in this case clone because I wanted to see what was going on in another thread)
How to lock the scheduler so other threads don’t continue to run in the background
How to step through a thread in the foreground
How to enable source listing for stepping through code with debug symbols compiled in

Again, to cut a long story short the error logic I wanted to see was in the libmrss.so.0 dependency, which in my efforts I ended up recompiling and installing separately to /usr/local/... and loading in with LD_PRELOAD in my gdb invocation.

I then did the following:

dir ~/libmrss/src to get the line information corresponding to the debug symbols
catch syscall clone to create a breakpoint when a thread is about to be created
run because the next scheduler argument doesn’t want to work until you start the program (we’re safe because we’ll break before threads are created)
set scheduler-locking on in order to prevent other threads from being scheduled whilst stepping through the current one.

We can then step (s) until the new thread is created and switch to it with:

info thread to view the threads and their IDs
thread 2 if you want to, for example, switch to the thread with id=2

I mean turns out none of the important stuff actually happened in the second thread which returns before thread 1 actually does the work that produces the error, but it’s a good thing to have learned.

Anyway, here’s the interesting bit:

(gdb) s
mrss_parse_url_with_options_error_and_transfer_buffer (feed_size=0x0, feed_content=0x0,
        code=<optimized out>, options=0x0, ret=0x7fffffff9440,
        url=0x7fffffffd480 "https://www.databreaches.net/feed/") at mrss_parser.c:1114
1114      if (nxml_parse_buffer(doc, buffer, size) != NXML_OK) {
(gdb) s
1121      if (!(err = __mrss_parser(doc, ret))) {
(gdb) s
__mrss_parser (doc=0x555555563560, ret=ret@entry=0x7fffffff9440) at mrss_parser.c:1015
1015      if (!(cur = nxmle_root_element(doc, NULL)))
(gdb) s
1018      if (!strcmp(cur->value, "rss")) {
(gdb) s
1042      else if (!strcmp(cur->value, "RDF"))
(gdb) s
1045      else if (!strcmp(cur->value, "feed"))
(gdb) s
mrss_parse_url_with_options_error_and_transfer_buffer (feed_size=0x0, feed_content=0x0,
        code=<optimized out>, options=0x0, ret=<optimized out>,
        url=0x7fffffffd480 "https://www.databreaches.net/feed/") at mrss_parser.c:1133
1133      nxml_free(doc);

Well up until I came to write about this, in my mind it was the case that the feed was atom so that’s why libmrss couldn’t parse it, but in checking my facts to be able to actually post examples this doesn’t appear to be the case.

This is not the feed you’re looking for

The feed itself is RSS and the library does have support for Atom as far as I can tell.

So going back through, let’s see what that root value is if it’s not rss:

Thread 1 "crossbow-fetch" hit Breakpoint 3, __mrss_parser (doc=0x555555563560,
        ret=ret@entry=0x7fffffff9440) at mrss_parser.c:1015
1015      if (!(cur = nxmle_root_element(doc, NULL)))
(gdb) s
1018      if (!strcmp(cur->value, "rss")) {
(gdb) p cur->value
$1 = 0x555555563600 "html"
(gdb)

Well there’s a turnout for the books. If I curl the URL, it’s definitely returning RSS. I bet this is user-agent related.

To test the theory out, I add a test feed that’s pointing to my website and tail the nginx logs, doing both a crossbow-fetch and a regular curl:

77.97.102.36 - - [13/Aug/2020:11:09:18 +0000] "GET / HTTP/1.1" 200 2090 "-" "-"
77.97.102.36 - - [13/Aug/2020:11:09:55 +0000] "GET / HTTP/1.1" 301 178 "-" "curl/7.71.1"

So the libcurl being used by nxml_download_file isn’t passing any user-agent header.

Let’s see what databreaches.net thinks of this:

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Attention Required! | Cloudflare</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />

Well there we go, Cloudflare has decided that I’m not allowed to see content because I haven’t told them what my UA is (╯ °□°）╯︵ ┻━┻

Sticking a pin in it

I need to have a bit of a think about the best way to proceed (and I need to get some real work done) so here are some options:

Set up an HTTP proxy that keeps track of which site wants which UA and uses it accordingly
Add a UA argument for crossbow-fetch
Go the whole hog and add support to both crossbow-fetch and crossbow-set so that UA can be set for each feed

I don’t want to put in too much speculative work for something that might not be accepted upstream, so I’m going to stick a pin in this and come back to it later.