This is the first I'm writing about Backstory, a service that organizes news around real people, places, and things. It was invented, designed, and built between myself and a long-time friend and fellow-engineer. This post will focus on Backstory's raison d'etre and software architecture. I also hope to write soon about our tooling, deployment practices, engineering challenges, and product development learnings.
Problems with Online News
For a long time I've been terribly unfulfilled by reading news online. I like to stay informed, so I read a lot, but always encounter the same few issues with news sites:
1. Lack of context
Let's work with an example. At the time of writing, a trending topic is the 2015 Baltimore Riots. Imagine that someone hitherto unfamiliar with the topic was reading the following LA Times article: As Baltimore curfew ends, celebratory crowds peacefully gather. They'd probably have many questions: Why was there a curfew? How bad were the riots? What caused them in the first place? Where is Baltimore? And who is Freddie Gray? Some of these questions might well be answered by the article itself, but more often than not, a single article is just one perspective of reporting at a single moment in a larger story. The reality of the situation is much bigger than what this one article has to offer. There is history, both recent and distant (think Michael Brown & Ferguson). There are many points of view, and many developments to the story. In our case, thousands of news publishers across the country have been reporting on this topic for a week.
The problem of context is being addressed diversely. The riots are big enough for the Baltimore Sun to curate a special section called Freddie Gray & Baltimore Unrest. A Google search for Baltimore includes a card with useful information about the city. Wikipedia already has a substantial page dedicated to the subject. Sites like Vox.com and apps like Timeline leverage teams of editors to hand-contextualize the news. But nobody, as far as I know, is doing it systemically and automatically..
2. Lack of cohesion
This one is more about how news tends to be organized. New stories pour in at alarming rates, and yet we continue to organize them in straight lines by decreasing date. At best, we put them in predictable categories like Local, Nation, World, and Business. See Google News, SmartNews, all news readers, and every newspaper website ever. Do news events really happen in a vacuum? Don't they inform each other? The Reverb app is the best I've seen at organizing news into interesting bins, but fails to provide cross-bin relationships, ultimately focusing on "news discovery". That brings us to the next point…
3. Lack of coverage
Some services, like Zite, Flipboard, and StumbleUpon, employ personalized feeds and machine learning to serve you more of what you've already been reading, or what you think your interests are. While this can potentially make for a highly engaged reader (at least initially), it seems to ultimately work against helping the reader be objectively informed. How do I know I'm getting all the important stuff? Am I just reading for serendipity, or out of pure habit? Personally, I feel I'm missing out on the bigger picture just about everywhere I go for news. I'm always playing catch-up. A given service is typically only as extensive as the editorial team behind it, or the sources it has plugged into it. Circa and Inside are solving the coverage problem by focusing only on breaking stories and their development, but have the aforementioned context and cohesion deficencies.
The idea we had to start solving some of the above problems was to organize news around real people, places, and things, i.e. proper nouns. Proper nouns are tangible and compelling. They are the actors in the world's stories. When news is organized by actors, you can can present it in new and interesting ways:
- Timelines — Take all the news for a given actor and sort it by date. What you have is a living story that shows real-life change over time.
- Relationships — Actors interact. Which actors were involved in a particular news event? How did they interact before? Expressing these relationships, especially through graphs, can expose informative patterns.
- Enrichment — Actors tend to have things Wikipedia entries and Twitter accounts. The more of these you identify, the more contextualized and informative you can make your presentation of the news.
Our hypothesis was that organizing news in a way that better approximates the real world, and allowing folks to explore that structure, would result in a refreshingly different and informative news experience. The application described below sets out to validate that hypothesis.
At the time of writing, Backstory is incarnated as a responsive website at backstory.io with the slogan The Names Behind the News. The website dubs proper nouns names and surfaces trending and latest names as well as exposing full-text search for them. Most importantly, every name gets its own page with a Wikipedia snippet and a news timeline:
The Software Architecture
As described above, the original thought behind Backstory was to organize news around actors to create something of a bigger picture, something bigger than any single piece of content. A great way to represent this bigger picture is with a news graph. Graph databases excel at modeling situations with many interlacing and unpredictable relationships. Thus, zooming all the way out, Backstory is built around a graph database centerpiece, with independent components that write to, and read from that database. Arrows indicate the general flow of data through the application:
From the outset, we wanted a robust system that would also respond well to constant experimentation and change. We made some very intentional design decisions early on. Each component runs as a distinct process (or set of processes) in its own Docker container. Among other things, independently layering the various domains of the application this way provides 4 advantages:
- Scalability — If a particular component requires more resources, like a bigger hard drive, higher bandwidth, or more machines altogether, this can be addressed without touching the other components.
- Modularity — As long as these components maintain predictable interfaces between themselves, each can be tuned independently. For example, we've been able to experiment with several algorithms for identifying actors in the News Graph Builder. Because the same graph structure is always created, the rest of the application could care less.
- Fail-safety — Sometimes the website goes down. Because all components are loosely coupled, the backend can continue to churn through articles without caring. When the website comes back, no data has been lost.
- Portability — Since each thing runs in its own Docker environment, we can move spread components accross different servers, or run them all on the same machine!
Following is a more detailed discussion of each component:
1. News Graph Builder
Technologies: Java, Named Entity Recognition, clustering, secret sauce
This component is responsible for:
- Constantly fetching news articles from the internet, and for a given article:
- Identifying the actors in the article
- Grouping the article with other articles about the same thing
- Integrating the article and actors into the news graph
We've been careful to formulate each of these steps as independent processes with well-defined inputs and outputs, so that they can scale independently and become their own web services if necessary. It's also worth noting that we've planned for a future where we consume arbitrary content, not just news articles.
Why Java? Because one of us rocks it professionally, and there's a huge ecosystem of open-source libraries. Also, because this component was likely to outlive any given frontend view of the data, we wanted a resilient, strongly typed situation.
2. News Graph Database
We use neo4j to maintain a large, ever-growing network of news events, articles, and the actors they talk about. Noting that we've been processing articles from a couple hundred news sources since September 2014, here are some counts, fresh from the database at the time of writing:
- Articles: 342,662
- Events (article clusters): 235,531
- Actors: 71,666
- Actor Relationships: 3,703,630
Thanks to neo4j, the size of this data is only 4.20GiB! I won't go into the details of the graph structure here, but it's sufficient to say that we can quickly answer questions like the following:
- What has Hillary Clinton been up to this week? This month?
- Which newspaper reports the most about NASCAR?
- Has Benedict Cumberbatch ever publicly interacted with Pasadena?
- How many "news hops" from the ISS to the IDF?
3. News Graph Admin Website
This is a minimal AngularJS website in front of a Java webserver, intended for managerial tasks over the graph database. The neo4j browser is excellent for various introspections and ad-hoc queries, but sometimes the task is sufficiently complex or repeated to warrant a dedicated UI.
4. News Graph REST API
Technologies: Python, Django, REST, Swagger
From the discussion in (2), hopefully it's clear that backstory.io is just a small window into the power of the news graph database. Ultimately we'd like to let others tap into that power to answer their own interesting questions. In terms of software architecture, the best way to do that is with an API layer in front of the graph. This component exposes HTTP queries for things like events, actors, and subgraphs, and has read-only interaction with the graph database. The website is currently the only API client.
5. Backstory Website
The website is basically a cool way to explore the news graph. It's changed a lot as we've clarified our vision and adapted to customer feedback over the months. As far as software goes, its primarily AngularJS that's enabled us to change quickly. Becuase of data-binding, once my services and object models are set, I can spend time tinkering with declarative views. It's so simple it doesn't even feel like coding. The Foundation framework has also given us snappy mobile-friendliness out of the box.
Shortly after the inception of Backstory, we discovered Lean Startup. That is to say, we discovered how to apply the scientific method to entrepreneurship, so as not to waste people's time. We're not experts, and are learning every day how to better develop this product. We've built a lot, and it seems valuable to us, but we still need confirmation from the online news-reading masses. If you're interested in actively using Backstory, or providing us feedback that will lead to new features, you can sign up here! Backstory is also on Twitter and Facebook.
How we run this system is another story I'd like to tell soon. Think BitBucket, JARS in containers, Continuous Delivery, and Cloudflare… Thanks!