Friday, May 11, 2012

The Third Meetup

Last Tuesday was our third meetup for Chuck eesley's venture-lab.org. Instead of the Michigan Street Starbucks opposite to The Chicago Tribune, we pivoted to the Wormhole for this one. For any geek who has been to this place, knows what a riot it is. From "Back to the Future" time-machine retro-fitted on the ceiling, old atari cartridges as showpieces on the coffee-table, super typo-friendly wifi password, stopwatch controlled brewing, Starwars puppets, shopkeep.com app on ipad instead of cashbox - the bearded coffee masters had it all. Everything except a place to accommodate the thunderous 8 of Lake Effect Ventures. But of course, we have Benn Bennett - son of a lawyer and a linguist by profession. We witnessed the art of "coaxing people out of their couch" - 2 minutes later the entire team had the best possible seating arrangement.
 


Two hours of caffeine drenched brainstorming spitted out the following:

  1. I sketched out how the process might flow in two steps.  We are down to a pretty bare minimum concept build which is ideal both for this class and for getting something up quickly so that we can test it.
  2. I set up a Twitter account for Lake Effect Ventures so that we can tweet about progress we are making.
  3. Andy is going to jot up a positioning statement and beef up the business model canvas for the concept
  4. Leandre will use these to complete our 2-slide initial submission for our deliverable for the next deadline
  5. Leandre will also use this to start to craft a presentation deck
  6. Benn will be working on the copy for the landing page that I started.
  7. Benn will also be crafting a logo in Photoshop (Alex, Zak, Sidi if any one of you is good with design Benn would appreciate the assistance there)
  8. We need to think of a name for the concept as well
We think it is a bit premature to start on the user stories right now given that we have a good idea of what we are gonna build. Charles and me are gonna start on that and look to have something complete from a Version 1.0 standpoint by mid next week barring no setbacks. We will look to craft the user stories once we complete the MVP and use them as structure for testing features and functionality (Zak stay tuned on this)
Benn and Andy will also be working on putting together a more formal customer survey so that we structure the interviews we are having and start to compile meaningful data which we will need going forward.
Its getting exciting ….

Saturday, April 14, 2012

Advice:John Doerr on working in teams



Incredible Networking:  Collect names, emails of all folks you meet. Be very careful about who your friends and keep in touch - after all you become the average of the five people you spend your time with. Call them up - Its incredible what people will tell you over the phone. (This is something, I have always fallen short - I can hardly get beyond emails).

Carry Chessick, the founder and last CEO of restaurant.com once told me after his lecture session at UIC, that networking as it is perceived is worthless. When you meet people, make sure you finish off by saying "If I can be of any help to you, please do not hesitate to get in touch". That's the only way that business card will actually fetch you some benefit. I met a sales guy from SalesForce.com, some time back at Chicago Urban Geeks drink ... who sent out a mail immediately after the introduction from his phone with a one line saying who he was, where we met, and that he'll keep an eye on tech internship notices for me. Brilliant.

360-s: If you want to find information about some company, of course you Google. So lets say if you are gathering info about Google, you'll also want to talk to their competitors Yahoo, Bing ... and find what they are thinking. Then you triangulate all that information to get in a good position.

Coaching: Make sure there is some one will consistently give you advice on what's going on in your workplace.

Mentoring: Having a very trusted person outside your work who can give advice is invaluable.

Time buddy: How do you make sure that you are doing good time management? Get a time buddy, compare your calendars on how you are spending time. Bill Gates does this Steve Balmer.

Another interesting practice I've read sometime back on Hackernews is communicating with team members in two short at regular intervals:
(1) What I did last week/day:
(2) What I'll do next week/day:

As my dear friend Guru Devanla(https://github.com/gdevanla) would put it "Its all about setting expectations ... and meeting them"!

Monday, August 15, 2011

Data loss protection for source code

Scopes of Data loss in SDLC
In a post Wikileaks age the software engineering companies should probably start sniffing their development artifacts to protect the customer's interest. From requirement analysis document to the source code and beyond, different the software artifacts contain information that the clients will consider sensitive. The traditional development process has multiple points for potential data loss - external testing agencies, other software vendors, consulting agencies etc. Most software companies have security experts and/or business analysts redacting sensitive information from documents written in natural language. Source code is a bit different though.

A lot companies do have people looking into the source code for trademark infringements, copyright statements that do not adhere to established patterns, checking if previous copyright/credits are maintained, when applicable. Blackduck or, Coverity are nice tools to help you with that.

Ambitious goal

I am trying to do a study on data loss protection in source code - sensitive information or and quasi-identifiers that might have seeped into the code in the form of comments, variable names etc. The ambitious goal is detection of such leaks and automatically sanitize (probably replace all is enough) such source code and retain code comprehensibility at the same time.

To formulate a convincing case study with motivating examples I need to mine considerable code base and requirement specifications. But no software company would actually give you access to such artifacts. Moreover (academic) people who would evaluate the study are also expected to be lacking such facilities for reproducibility. So we turn towards Free/Open source softwares. Sourceforge.net, Github, Bitbucket, Google code - huge archives of robust softwares written by sharpest minds all over the globe. However there are two significant issues with using FOSS for such a study.

Sensitive information in FOSS code?


Firstly, what can be confidential in open source code? Majority of FOSS projects develop and thrive outside the corporate firewalls with out the need for hiding anything. So we might be looking for the needle in the wrong haystack. However, being able to define WHAT sensitive information is we can probably get around with it.

There are commercial products like Identity Finder that detect information like Social Security Numbers (SSNs), Credit/Debit Card Information (CCNs), Bank Account Information, any Custom Pattern or Sensitive Data in documents. Some more regex foo or should be good enough for detecting all such stuff ...


#/bin/sh
SRC_DIR=$1
for i in `cat sensitive_terms_list.txt`;do
        for j in `ls $SRC_DIR`; do cat $SRC_DIR$j | grep -EHn --color=always $i ; done
done




Documentation in FOSS


Secondly, the 'release early, release often' bits of FOSS make a structured software development model somewhat redundant. Who would want to write requirements docs, design docs when you just want to scratch the itch? The nearest in terms of design or, specification documentation would be projects which have adopted the Agile model (or, Scrum, say) of development. In other words, a model that mandates extensive requirements documentation be drawn up in the form of user stories and their ilk. being a trivial example.

Still Looking
What are some of the famous Free/Open Source projects that have considerable documentation closely resembling a traditional development model (or models accepted in closed source development)? I plan to build a catalog of such software projects so that it can serve as a reference for similar work that involve traceability in source code and requirements.

Possible places to look into: (WIP)
* Repositories mentioned above
* ACM/IEEE
* NSA, NASA, CERN

Would sincerely appreciate if you leave your thoughts, comments, poison fangs in the comments section ... :)

Monday, August 8, 2011

Hacking the newsroom

[This is part 2 of the final pitch, which talks about the newsroom and business perspective. Part 1, detailing the newsreader perspective is here.]

Before anything else, there must be a 90 seconds theatrical promo:

Stop laughing at my amateurish video editing! This is my first ever ... even Bergman, Godard, Fellini started somewhere to be great! Jokes apart here's what REVEAL actually is all about:


Lets consider a hypothetical newsroom which uses REVEAL. A journalist gets hold a huge collection of classified documents that contains potentially sensitive information. Instead of painstakingly reading each line and jumping back to google to search relevant information - she uploads them to REVEAL and hits the pantry for her coffee. Reveal goes to work and automatically parses out names of pepole, places, organizations etc. Using the names it detected, REVEAL affixes thumbnail images with the mappings of the named entities with the documents. The journalist now sits back, sips the coffee and flips through the images looking for someone/something/some place that's interesting and jumps directly to the document when she finds her target.

But that's not all. In order to make the life much easier for the journalist - REVEAL uses the names and keywords from the document, to aggregates semantically related contents from the net - images, video, news, blog, wiki articles using open apis. Making the background context readily available, it allows the journalist focus solely on her analysis of the story.


What follows is an over the top ambitious plan for making lots of money - I mean the business plan.

Unearthing named entities involves doing tonnes of computationally intensive text analysis and for any sizable dataset we need a cloud based solution. While REVEAL will always be Free and Open Source Software, the business proposition is offering it as a service. Be a startup or a news corp, whoever deploys REVEAL at their site - they can offer it as a service to other news agencies/ organizations based on pay by usage model. Different packages can be offered based on when they want to share the information dug out from their documents.

Nothing like REVEAL exists today. The cohesive bond of unknown information on well known personalities and organizations, original content (the documents), expert opinion(journalist's view), user generated content(comments) and  aggregated content - will make REVEAL a dream product for generating ad-revenues. Features for lead generation is inbuilt into the system and the karma points based reader appreciation along with the 360 degree view of the world will ensure persistent traffic.

Now get me to Berlin Hackathon!
(398 words)

Most common names detected in Wikileaks cablegate files

Link to an incomplete implementation

Saturday, July 23, 2011

Reveal - How much does the world know?


Have you ever had a shit-tonne of documents dumped into your inbox with an impossible deadline demanding to suck out the hidden juicy bits? Or may be it has been a joyful experience of discovering the dump of an MILF's emails, diplomatic cables, or code dumps of an evil corporation's website? At moments like those, you might have uttered, "fcuk! ... Omne Ignotum Pro Magnifico!". Wouldn't it be nice if the needles just magically popped out of the haystack? Meet Reveal (clickable prototype) - a software framework that aspires to achieve that and may be a bit more.


Background:
While sed/awk/grep-ing the cablegate files, I stumbled upon a cable that mentioned Kofi Annan asking Robert Mugabe to step down in exchange for a handsome retirement plan during the Millennium summit. Being an ignorant bloke, I could hardly recall what the Millennium summit was about, had no clue if Mugabe was still in office, and if Kofi Annan has made a comment on this! Without the right background and context I could not appreciate the data to the full extent. Below is the #MozNewsLab final project idea pitch in the lights of the three speakets  this week: Chris Heilmann, John Resig and Jesse James Garrett

"What is this thing for? What does it do? How is it supposed to fit into people's lives?", @Jesse James Garrett:

Journalists get amazing amount of digital data everyday which are in the form of numbers in tables. With some spreadsheet skills or help from newsroom programmers, they produce incredible revelations of the reality that hides behind those numbers. However, when the data comes in the form of unstructured text files written in natural language - there isn't much algorithmic help available, other than full text searches with a list of guess words. Using cutting edge information retrieval technique,  Reveal would aim to build a framework that automatically annotates names, places, locations, dates etc. in the unstructured text files.

"Adopting Open Source, Open standards", @Chris Heilmann:
Being baptized by St. IGNUcius, the idea of Free as in Freedom runs through the core of the technology stack of Reveal. Standard LAMP stack for server side, UI powered by HTML5, CSS3 and jQuery plugins and a number of open source libraries for doing the information extraction - long post describing the information retrieval technology coming soon. (Mind map above).

Using the detected names, locations, dates etc., Reveal will try to aggregate additional information in the form of images, maps, news articles, videos, wikipedia pages, visualizations etc. via open API-s and use them as navigational elements to browse the data. Juxtaposed to the document under scrutiny, these will provide the right context to gauge the sensitivity of the information.


"User to Contributor", @John Resig:
Additionally, by showing a relative score of "How much does the world know?", calculated on the basis of the aggregated information published before the documents surfaced, we can excite the newsreaders to share the information across their own social network. Add some game mechanics by quantifying that "sharing", and we bust the filter bubble of ignorant blokes and turn them into responsible citizens who'll raise voices against wrong doings of totalitarian regimes, evil corporations or other bad asses. This will lead to creation of more content and will act as a feedback loop to the background and context aggregation step before.

Now, a similar project by the uber journalist-programmer Jonathan Stray of the AP has won this year's Knight Mozilla news challenge. His approach, Overview, solely focuses on clustering documents based on cosine similarity of their tf-idf scores. Using sexy visualization, it pulls out key terms specific to the corpus under study. The night when the results of Knight Mozilla challenge was announced - in an euphoric outburst I sent him an embarrassingly long late night email ranting the above. Obviously, I never heard back but he will be releasing his code soon and I am super excited to fork it for visualizations in Reveal.

That is my final software idea pitch inspired by Chris Heilmann, John Resig and Jesse James Garrett #MozNewsLab Week 2:



Monday, July 18, 2011

Tweetsabers of News Revolution across the globe #MozNewsLab



After Amanda Cox's lecture I was too pepped up to do some quick and sexy data visualization. In my daily life, I rely on R or GNUPlot for doing all my plots, simply because of their scripting interface. I have played a bit with Google chart and visualization api and they are absolutely brilliant. I'm planning to get my hands dirty with matplotlib (yep, yep ... Python!).

So mid way into the re-listening the lecture, I was overpowered by this feeling of doing some global data visualization. One of the best global data visualization tool that have left an impression on my mind, is the WebGL Globe - an open platform from Google Data Arts Team. I grabbed the example code from here and with a simple Python script collected Twitter activities during the first week of #MozNewsLab into a .json file. Some simple changes to the sample javascript code and there you have colored light sabers shooting out from the globe.

The main problem was the twitter api limit - a meager 125 queries per hour. So I had to rely on the geo-location data that I had scraped earlier for mapping #MozNewsLab participants in the world map. That allowed me to narrow down the geo location queries for those who were not a participant of #MozNewsLab. In case the geo-location info was not available on their twitter profile, those homeless tweet counts were assigned to our dear Lab co-lead Phillip Smith.

Friday, July 15, 2011

EntreJourno 101 with Burt Herman, CEO & co-founder Storify at #MozNewsLab