Monday, August 8, 2011

Hacking the newsroom

[This is part 2 of the final pitch, which talks about the newsroom and business perspective. Part 1, detailing the newsreader perspective is here.]

Before anything else, there must be a 90 seconds theatrical promo:

Stop laughing at my amateurish video editing! This is my first ever ... even Bergman, Godard, Fellini started somewhere to be great! Jokes apart here's what REVEAL actually is all about:


Lets consider a hypothetical newsroom which uses REVEAL. A journalist gets hold a huge collection of classified documents that contains potentially sensitive information. Instead of painstakingly reading each line and jumping back to google to search relevant information - she uploads them to REVEAL and hits the pantry for her coffee. Reveal goes to work and automatically parses out names of pepole, places, organizations etc. Using the names it detected, REVEAL affixes thumbnail images with the mappings of the named entities with the documents. The journalist now sits back, sips the coffee and flips through the images looking for someone/something/some place that's interesting and jumps directly to the document when she finds her target.

But that's not all. In order to make the life much easier for the journalist - REVEAL uses the names and keywords from the document, to aggregates semantically related contents from the net - images, video, news, blog, wiki articles using open apis. Making the background context readily available, it allows the journalist focus solely on her analysis of the story.


What follows is an over the top ambitious plan for making lots of money - I mean the business plan.

Unearthing named entities involves doing tonnes of computationally intensive text analysis and for any sizable dataset we need a cloud based solution. While REVEAL will always be Free and Open Source Software, the business proposition is offering it as a service. Be a startup or a news corp, whoever deploys REVEAL at their site - they can offer it as a service to other news agencies/ organizations based on pay by usage model. Different packages can be offered based on when they want to share the information dug out from their documents.

Nothing like REVEAL exists today. The cohesive bond of unknown information on well known personalities and organizations, original content (the documents), expert opinion(journalist's view), user generated content(comments) and  aggregated content - will make REVEAL a dream product for generating ad-revenues. Features for lead generation is inbuilt into the system and the karma points based reader appreciation along with the 360 degree view of the world will ensure persistent traffic.

Now get me to Berlin Hackathon!
(398 words)

Most common names detected in Wikileaks cablegate files

Link to an incomplete implementation

2 comments:

Mark Reginald James said...

This is great. It could actually be used on any link-poor webpage to make it easier to view and access related information.

The social document deconstruction aspects would really help a newsroom collaborate to tease out meaning and significance. It should be a feature of Wikileaks.

It may even be able to work as a browser add-on rather than as a hosted service, making the magic happen automatically. See if you can work with other KMLL participants who are developing add-on-based Web annotation and comment systems.

Tathagata Dasgupta said...

Actually the annotation does not rely on mechanical turks (i.e. human readers) but on an information retrieval technique called Named Entity Extraction that AUTOMATICALLY pulls out the names. Human efforts would appear at a later stage - more like a teacher grading answer scripts. I ran the code on 36 megs of cablegate documents released December 2010 and found around thousands names of people, organizations and places.The auto-tagging took all night to complete on my 4gigs ram intel core 2 duo laptop. So a cloud is absolutely essential!