Sonian Summer Codefest 2011: Abundant Innovation - Part 2
continued from the first blog installment of Codefest 2011
Team 7: Git ’r Done (Third Place Winner)
Windows node deployments were one of the last manual tasks. With automation, automatic scaling is now possible, as well as reducing the risk of human errors from manual tasks.
The benefit to Sonian is completing a goal for 100% automated cloud deployments.
Team 8: The Sharper Image (Second Place Winner)
This team was a combination of SAFE and website folks; Paul, Phil, Kevin, and Bira, showing us how to employ an image hashing algorithm to find images in the archive.
Searching the archive now is limited to text-based queries. But what if you want to find an image that has no text and all you have is a similar image to use as your reference?
The team started by using a technique called perceptual hashing to calculate a unique hash value for each image in the archive, and storing that hash value in the full-text index along side the standard index text for each object in the archive. Perceptual hashing is suited for images because it is impervious to image scaling, works with a variety of formats.
The theory is that with all image hash values stored in the index, a customer could use the search UI to find an image if they have a sample of what they are looking for. The best way to describe the image is to upload a sample of a similar image and ask the system to return results for all images in the archive that “look like this sample.” And that is what the team demonstrated. Paul initially indexed three emails with photo attachments of his children and one email with an image NOT his children. The team then demonstrated a search without image classification and the results returned all images, the three of his children and the one not of his children. The demonstration went on to show the same search, but this time by uploading a sample image of Paul’s child, the resulting search returned just the three results of the emails with a picture of the children.
This approach and framework for indexing non “text” components and then searching based on non-text samples can be extended to other data types beyond images. Over time the Sonian service will focus on not just returning a million hits fast, but returning that “one in a million” hit with speed and accuracy.
Team 9: Beautification
Matt W., Sonian’s website team leader, along with Ryan, illuminated the audience with a demonstration of the Sonian Viewer using a new component ExtJS from Sencha. As background, the latest product release with the enhanced My Archive application employs a new visual toolkit called ExtJS. This is a set of tools to create “rich web applications” that perform and feel like desktop apps, but running in the browser.
Team Beautification’s goal was to show the Sonian Viewer with a more intuitive interface and improved user experience. The Sonian Viewer is the “Swiss army knife” for managing our cloud infrastructure, but as an in-house project the design was basic rails scaffolding. Adding ExtJS functionality will load pages faster and make it easier for non-technical users to use the Sonian Viewer.
ExtJS will become a core Sonian user interface component. This toolkit also supports mobile device development and implements HTML 5 standards to ensure cross-browser functionality.
Team 10: Performance Art
Continuing along the themes of “efficiency, cost analysis, and visualizations,” Jim, Joe G. and Steve from the SAFE team demonstrated some data performance art to great applause. The famous quote from Lord Kelvin is how the team summarized their ideas: “When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge of it is of a meager and unsatisfactory kind...” The team’s goal was to instrument performance benchmarking in the SAFE server code and analyze the results.
Based on a preliminary review from one test run, the team was pretty confident they could improve SAFE server efficiency by 20% with a few configuration changes. With more data, more efficiency gains will be realized.
Sonian benefits whenever we can find performance optimizations in our own code. In this example we will be able to process more data for less money without sacrificing customer experience. A triple win scenario.
Team 11: Team Speed
Pete started his presentation by reminding the audience of a couple core tenants on how customers use Sonian Archive. Tenant one is that with the archiving use case, the data does not change. An item the user “views” in the current session will be the same in future sessions. The second tenant is that we can predict customer navigation patterns, and with that anticipated UI action, make the page loading process faster. The net benefit for customers is an overall faster experience in the archive system, even as the volumes of data increase.
Pete used a browser plug-in to show that pages load substantially faster by using caching techniques and pre-fetching data in the background. In the first example, Pete loaded a page without caching/pre-fetch and it took 10 seconds. With caching enabled, the page loaded 1,000 percent faster. In the second example, Pete showed a user navigating the results page. After the first page rendered, the next page was fetched in the background so that when the user clicked “next” the page would display instantly. Pre-fetching data in the background works best when we can anticipate the user’s next action. In the case of search results, there is a high-degree of certainty the user will navigate the first few results pages.
Sonian customers benefit from a more pleasing user experience. Pages load faster, and the application will feel more responsive.
Team 12: Commit Hooked
Decklin, a DevOps engineer, demonstrated tight integration between Chef and Git. Sonian uses Chef to manage cloud infrastructure deployments. Git (and Github.com) is where Sonian manages source code. Prior to Decklin’s work, there was loose integration between the deployment system (Chef) and the repository where source code is maintained (Github.) This has been a source of frustration and Decklin tackled the problem.
The DevOps group is working continuously to remove friction from automated deployments. Decklin’s Codefest solution helps this effort by centralizing the source for software components, and makes Github the single authority for code installed on every new node.
Team 13: Diylizo
Lee, a SAFE team member, used his prior experience and personal interest in Natural Language Processing (NLP) algorithms to categorize and aggregate SAFE server log file data.
*Background info - In March 2010, Lee released his Clojure-opennlp project to interface Clojure with the OpenNLP library functions. OpenNLP is a set of linguistic tools that allow a computer to “understand” chunks of text.
SAFE server logs contain valuable information for debugging and gathering other useful data for analysis. These logs also contain Java Virtual Machine stack traces. In a cloud computing environment, SAFE error statements, as well as JVM stack traces, are spread across many virtual machines. Lee’s solution is to aggregating and categorizing log files with NLP allowing a whole new level of understanding to occur. In this demonstration, the NLP algorithms were trained to identify error codes by looking for text patterns.
The breakthrough here is that the NLP library was agnostic to the meaning and language (English or French or Russian, etc.) of the patterns, only that it knew how to find them. Each error code and stack trace has a unique “signature” for identification, and diagnostic data could be extracted from the error statements and correlated with other system information. Correlation along a consistent time series is a “must have” to identify problem patterns across a distributed database.
In the future correlating log statements with customer actions will help trace errors from user action to back-end function.
Congratulations to all the teams who competed! The next Codefest is sure to be another interesting event.
blog comments powered by