Yesterday I took a break from work for half an hour to just do some of my own futzing about, and I came back to something I’d already been playing around with, which was attempting to figure out why the feed processor I’d just downloaded and built wouldn’t handle the main feed, https://www.databreaches.net/feed/ that I wanted to view.
I first found out about crossbow
on this lobste.rs
post and
decided it was probably right up my alley. I’ve written and spoken elsewhere
about work I’ve done previously with feeds, primarily on my
betterfeed framework for
reconstituting RSS feeds.
I’m fully aware however, that my solution is super heavyweight, has a lot of code that I probably wouldn’t have needed to write myself in any other language, and is just not really usable for others.
So crossbow
looked like a great solution. Looking at the man page for
crossbow-cookbook
it has all sorts of examples like:
Download the full article
This scenario is similar to the previous one, except that the item description
contains only part of the content, or nothing at all. The link field contains a
valid URL, which is intended to be reached by means of a browser.
In this case we can leverage curl(1) to do the retrieval:
crossbow set -i "$ID" -u "$URL" \
-o subproc \
-f "curl -o %n.html %l"
-C /some/destination/path/
This is a much more tractable way of doing similar to what I was already doing,
but composable on the command-line. I could easily throw in a pipeline to pup
for a given feed and re-use my extant CSS selectors in order to still just pull
the actual content out.
But after downloading, building and installing the full set of dependencies and
the project itself, I tried to use a
very similar invocation to the one above to add
https://www.databreaches.net to crossbow
’s download list, returning 0, but
when I ran crossbow-fetch
I got the cryptic error message:
crossbow-fetch: cannot open feed [https://www.databreaches.net/feed/]: error_code=2 (mrss: Parser error)
God dammit.
Now I could probably have just dug through the code, but I don’t have much patience for that, and I have been having some amount of fun and success recently in turning to…
This isn’t my first rodeo with gdb
, but I’m not a prolific writer of compiled
languages nor am I an accomplished reverse engineer - I’m largely interested in
building up my skillset because I (fairly recently) purchased a debugger for
ARM devices.
Anyway, I’m going to cut a long story short here, but I learned a few things:
catch
) on a syscall (in this case
clone
because I wanted to see what was going on in another thread)Again, to cut a long story short the error logic I wanted to see was in the
libmrss.so.0
dependency, which in my efforts I ended up recompiling and
installing separately to /usr/local/...
and loading in with LD_PRELOAD
in
my gdb
invocation.
I then did the following:
dir ~/libmrss/src
to get the line information corresponding to the debug
symbolscatch syscall clone
to create a breakpoint when a thread is about to be
createdrun
because the next scheduler argument doesn’t want to work until you
start the program (we’re safe because we’ll break before threads are
created)set scheduler-locking on
in order to prevent other threads from being
scheduled whilst stepping through the current one.We can then step (s
) until the new thread is created and switch to it with:
info thread
to view the threads and their IDsthread 2
if you want to, for example, switch to the thread with id=2I mean turns out none of the important stuff actually happened in the second thread which returns before thread 1 actually does the work that produces the error, but it’s a good thing to have learned.
Anyway, here’s the interesting bit:
(gdb) s
mrss_parse_url_with_options_error_and_transfer_buffer (feed_size=0x0, feed_content=0x0,
code=<optimized out>, options=0x0, ret=0x7fffffff9440,
url=0x7fffffffd480 "https://www.databreaches.net/feed/") at mrss_parser.c:1114
1114 if (nxml_parse_buffer(doc, buffer, size) != NXML_OK) {
(gdb) s
1121 if (!(err = __mrss_parser(doc, ret))) {
(gdb) s
__mrss_parser (doc=0x555555563560, ret=ret@entry=0x7fffffff9440) at mrss_parser.c:1015
1015 if (!(cur = nxmle_root_element(doc, NULL)))
(gdb) s
1018 if (!strcmp(cur->value, "rss")) {
(gdb) s
1042 else if (!strcmp(cur->value, "RDF"))
(gdb) s
1045 else if (!strcmp(cur->value, "feed"))
(gdb) s
mrss_parse_url_with_options_error_and_transfer_buffer (feed_size=0x0, feed_content=0x0,
code=<optimized out>, options=0x0, ret=<optimized out>,
url=0x7fffffffd480 "https://www.databreaches.net/feed/") at mrss_parser.c:1133
1133 nxml_free(doc);
Well up until I came to write about this, in my mind it was the case that the
feed was atom so that’s why libmrss
couldn’t parse it, but in checking my
facts to be able to actually post examples this doesn’t appear to be the case.
The feed itself is RSS and the library does have support for Atom as far as I can tell.
So going back through, let’s see what that root value is if it’s not rss
:
Thread 1 "crossbow-fetch" hit Breakpoint 3, __mrss_parser (doc=0x555555563560,
ret=ret@entry=0x7fffffff9440) at mrss_parser.c:1015
1015 if (!(cur = nxmle_root_element(doc, NULL)))
(gdb) s
1018 if (!strcmp(cur->value, "rss")) {
(gdb) p cur->value
$1 = 0x555555563600 "html"
(gdb)
Well there’s a turnout for the books. If I curl
the URL, it’s definitely
returning RSS. I bet this is user-agent related.
To test the theory out, I add a test feed that’s pointing to my website and
tail the nginx
logs, doing both a crossbow-fetch
and a regular curl:
77.97.102.36 - - [13/Aug/2020:11:09:18 +0000] "GET / HTTP/1.1" 200 2090 "-" "-"
77.97.102.36 - - [13/Aug/2020:11:09:55 +0000] "GET / HTTP/1.1" 301 178 "-" "curl/7.71.1"
So the libcurl being used by nxml_download_file
isn’t passing any
user-agent header.
Let’s see what databreaches.net thinks of this:
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Attention Required! | Cloudflare</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
Well there we go, Cloudflare has decided that I’m not allowed to see content because I haven’t told them what my UA is (╯ °□°)╯︵ ┻━┻
I need to have a bit of a think about the best way to proceed (and I need to get some real work done) so here are some options:
crossbow-fetch
crossbow-fetch
and crossbow-set
so that UA can be set for each feedI don’t want to put in too much speculative work for something that might not be accepted upstream, so I’m going to stick a pin in this and come back to it later.