Vocalyze it

Subscribe to this Blog!

Your email:

Browse by Tag

Sonian's Email & Data Archiving Blog:

Current Articles | RSS Feed RSS Feed

Now Releasing 100 Stories at a Time: Part 2

  
  
  
  
  
  
  
2008
  • EC2 in Europe (EU) Region
  • Amazon Elastic Block Store (EBS)
  • Premium Support
  • Lower data transfer costs
  • S3 tiered pricing
The most significant Amazon services launched in 2008 were Elastic Block Store (EBS) and S3 tiered pricing. With EBS we were able to make architectural changes that improved performance and lowered our costs. Prior to EBS, the only persistent storage was S3, since EC2 instances only had attached ephemeral storage. Ephemeral was not of sufficient durability for customer data. And trying to use S3 as persistent block storage was cumbersome.

The second notable 2008 event was S3 tiered pricing. At the time Sonian was preparing to launch our first customer shipping instance of our flagship archiving service and as we completed the beta program we were looking forward to on-boarding the first paying customers. S3 tiered pricing allowed us to build a financial model for our cloud “cost of goods,” and the notion that as we stored more customer data our unit costs would drop was very compelling.

2009
  • Relational Database Service (RDS)
  • SAS70 Type 2 Certification
  • EC2 Reserved instance
  • Lower EC2 pricing
  • Lower S3 pricing
  • Amazon Import/Export Service
  • Amazon EC2 Spot Pricing
In 2009 Amazon offered many enhancements that contributed to overall lower costs and improved customer service.

Four price reductions were introduced in 2009. S3 standard pricing was lowered by a couple cents per gigabyte per month (a substantial savings at scale,) EC2 per-unit-hour pricing was reduced for all compute types, another big price advantage at scale.

Amazon Import/Export Service (IES) removed a significant barrier to cloud adoption: how to quickly move a lot of data from on-premises to the cloud. With IES, customers can copy their enterprise data to portable media (typically USB or eSATA drives), choose their encryption, and move massive amounts of data to the cloud easily, cost effectively and securely. The phrase "there's no bandwidth limits to a FedEx truck" comes to mind when thinking about the possibilities of this service for rapid cloud adoption.

Sonian created a customer self-service workflow using the IES API's. Now customers can initiate the physical media import process on their own.

The new Relational Database Service (RDS) is a great example of the pro and con differences between Infrastructure as a Service (IaaS) and Platform as a Service (PaaS). RDS is a cloud implementation of MySQL. Amazon packages compute and storage with MySQL software and sells the package as a unit for an hourly rate. There is a compute and storage profile that should meet all but the most demanding relational database needs. RDS is an example of PaaS. Pay a nominally higher price for a managed database service and Amazon takes care of backup, software updates and scaling. But you also have the option to do-it-yourself by using EC2 instances and EBS volumes to run your own MySQL instances. Your per-hour costs will be lower, but you also take on the ownership of backup, administration and scaling.

2010-2011
  • Singapore (APJ) Region
  • Lower outbound data xfer
  • S3 Reduced Redundancy Storage (RRS)
  • Cloudwatch
  • Management Console
  • S3 Large Object Support
  • Lower EC2 prices
  • Gov Cloud Region
  • Amazon Simple Email Service (SES)
More than a dozen new cloud features were introduced throughout 2010. That's a release cadence of at least two per month. This is the beginning of the steeping "hockey stick" curve showing increasing innovation over a shorter time period.

Sonian, and in turn our customers, are benefitting from every one these innovations. A few highlights include the Simple Email Service (SES), S3 Large Object Support and the Gov Cloud Region.

With SES, Sonian outbound email notifications are delivered more reliably to customers. Prior to SES, the only way to guarantee an email sent from an EC2 instance would not get tagged as SPAM was to relay the outbound message through an external third-party service. This was cumbersome and expensive. With SES, Amazon has solved this problem in an elegant and low-cost way.

S3 Large Object support increased the maximum S3 object size from 5 Gb to 50 Tb. Any ISV storing a lot of large data objects had to create a sharding method to work around this size limitation. The new 50 Tb limit in practicality is "limitless" in context to our current average enterprise data sets.

The new Gov Cloud Region opens a whole new set of possibilities for the Federal Government to use the cloud to manage government infrastructure. The Gov Cloud, in conjunction with Federal Information Systems Management Act (FISMA) accreditation allows Amazon and it's ISV to sell cloud services to government agencies. Considering that typical on-premises physical server costs the government $31,000 dollars a year, one should expect the cloud to play an increasing role within government IT.
See first series of this post here: http://blog.sonian.com/bid/79903/Now-Releasing-100-Stories-at-a-Time-Part-1

Now Releasing 100 Stories at a Time: Part 1

  
  
  
  
  
  
  
100 2 resized 600One of Sonian's core themes is "the cloud creates innovation acceleration." Proof of this is our upcoming software release, which includes over 100 new stories deploying to customer accounts. The release includes enhancements, fixes and most importantly new features driven by feedback from engaging with our customer audience.

One of the remarkable positive attributes of cloud computing is the ever accelerating "pace of innovation." Sonian has been working in this "pure cloud" operating mode for nearly five years, and now we're entering a period of "innovation cadence," which means more new features are being deployed to our customers faster and faster.

In the cloud there is a theory of innovation inheritance not previously seen in the enterprise IT space. As a pure-cloud ISV, Sonian "inherits" the innovation occurring at the cloud infrastructure layer. In some cases the inheritance is a cash windfall because prices are lowered (see this post about cloud costs - (http://www.gregarnette.com/blog/2011/11/a-brief-history-cloud-cpu-costs-over-the-past-5-years/), and in other examples the inheritance is a new capability allowing us to add value and create a better customer experience.

Our primary cloud provider is Amazon Web Services (AWS). Amazon's innovation pace has accelerated over the previous 12 months. In 2010 alone, Amazon introduced more significant new features than the prior four year period 2005 through 2009. The foundational IaaS services S3 and EC2 laid the groundwork for this acceleration effect. Likewise, Sonian's core investments in cloud search and cloud monitoring, security and automated deployments enable us to "step on the innovation gas." Our next product release making its way through user acceptance testing has over one hundred changes (what we call "stories"), from minor fixes to advanced new features our customers told us they want. We couldn't have deployed a release of this significance without the prior investments in core cloud building blocks.

Below is a chronology of Amazon's most notable achievements for the past five years (from Sonian's perspective,) and how Sonian leverages these cloud infrastructure building blocks.

2006
  • Amazon Simple Storage Service (S3)  
  • Amazon Elastic Compute Cloud (EC2)
S3 and EC2, working in conjunction with each other, are the "original" IaaS building blocks. These two web services are key to Sonian's cloud success. With S3, we get the "bedrock of storage" that allows us to manage customer data in a cost-effective and reliable manner. Amazon has perfected the state of the art in resilient storage.

EC2 allows us to provide "compute" services on customer data stored in the cloud. For our use-case, we need both compute and storage on the same high-speed network. Perfecting compute in the cloud has been a rewarding challenge. On average we launch over 500 virtual compute nodes, representing 3,000 plus elastic compute units (ECU). The ability to scale up on-demand, and equally as important scale down on-demand is a critical aspect to our infrastructure cost management.

2007
  • Amazon SimpleDB (SDB)
  • Amazon S3 in Europe (EU) Region
Sonian took a hard look at Amazon SimpleDB (SDB) as a cloud-based key/value store, but we couldn't figure out the long-term operational costs. SDB is priced by multiple dimensions (data stored, compute hours running queries, and API requests), and we did not have the ability to profile our application at scale to truly understand how our costs would scale. We also wanted to maintain a degree of "cloud independence" and felt that SDB would lock us too deep into the AWS environment.

At every decision point in choosing our application building block we could weigh the do-it-yourself approach using IaaS against the comparable PaaS offering. IaaS requires more people time self-managing, while PaaS offers benefits but also the downside of vendor lock-in.

S3 services in Europe allowed us to expand service to EMEA with no up-front costs.
To continue reading, click here to see second part of this series...

How Sonian uses Amazon Web Services to Solve Information Archiving Problems

  
  
  
  
  
  
  
New Thinking Required to be Successful in the Cloud
In 2006, Amazon Web Services flashed brilliance with a “light bulb moment” that sparked the imaginations of leading edge technologists and entrepreneurs. Literally overnight, “The Cloud” had arrived. The cloud offered the ability to create, launch and operate SaaS applications in a way that was never possible. Using simple and secure API’s, a software engineer could harness vast quantities of compute and storage services on-demand and with no up-front costs...all without touching a single physical atom. The cloud allowed small, efficient teams to build an application that could serve a large world-wide audience.

The core requirements for every SaaS application are scale-up, reliability and efficient infrastructure utilization. Scaling in the cloud means harnessing the on-demand capabilities. Reliability in the cloud means designing for failure by making software mirror what “physical” hardware used to supply in the co-located world. Operating cost-efficiently means “gaming the cloud” to find every place where you can process more work with less compute resources.

Sonian has created innovative technology, architectures and processes in three major categories to be able to harness the cloud for enterprise information archiving and analytics.

1. Effective Budgeting with a Cost Control System
Compared to a traditional dedicated data center environment, it’s way too easy to spend money in the cloud. “Purchasing” in the cloud is psychologically different with the duality of two mindsets (using purchase orders to buy everything up-front versus consume small bits at a time) have to reconcile with the vastly different operating styles of dedicated compared to cloud. In the dedicated environment, big capital expenditures get multiple approvals and are on many people’s radar. But in the cloud, most teams start their cloud relationship with a credit card and pay monthly for the previous 30 days of small micro-charges for gigabytes of storage and hours of cpu time consumed.

As the project grows and more people cycle onto the team, and the march to launch pushes faster and faster, it’s natural to consume more and more infrastructure. This is when a cost control system needs to be implemented.

The are several ways to implement a budgeting and cost framework. A simple spreadsheet can be useful in the early days, along with once a month review to ensure the billed amount is close to the projected budget. But more often than not, there is a startling “gotcha” moment. A monthly bill comes in with a dramatic unexpected uptick in CPU or storage expense. This is “the canary in the coal mine” early warning to implement a more automatic and vigilant cost control system.

In the cloud, tremendous amounts of computing horsepower is easily available, and at an hourly cost that seems minuscule (just dimes per hour), but when extrapolated out over 30 days, the expense can add up quickly. A hypothetical $0/68/hour compute instance, running 24 hours a day for 30 days, will cost $489 at the end of the month. Multiply that by 10 compute instances, and the compute fee is close to five thousand dollars. If the 10 compute instances are doing valuable work, then that’s a great deal. Because when the work is complete, the compute instances get turned off. And these compute instances probably didn’t need to run for the full month in the first place. Maybe just a couple hours per month is all that is required. The less positive scenario is that the 10 compute instances were started, did some work, but someone forgot to make sure they were turned off. The surprise five thousand dollar up-tick is the tipping point to the realization all these positive cloud upsides need a governing structure to ensure costs do not quickly spiral out of control.

To solve the cost control problem there are third-party commercial cost management systems available, which is what we did at Sonian. We built a system tailored to our needs. Each cloud ISV has different use cases that will determine the best path. The downside to using a commercial package is the “one size fits all” approach, and the fact you will need to give the third-party access credentials to your cloud account control panels. (n.b. access control granularity is improving to appropriately restrict third-party cost analyzers from doing harm to the infrastructure.)

The “do it yourself” approach uses cloud API’s to scrape data for infrastructure utilization and billing. With a reasonable effort, you can build a custom cost analyzer that is designed more tightly into your software stack. The advantage for a do-it-yourself approach is a system that solves your unique use case, and the downside is more code to maintain. But since cost management is critical to cloud computing success, the custom implementation ensures tight integration to the core technology.

2. Scale Beautifully with a Systems Monitoring and Automation Framework
The next area to focus on is achieving reliable operations with monitoring and deployment automation. In some cases the cost management system described above should be an extension of the automation and monitoring framework, since all three will benefit from being tightly coupled together. In the old pre-cloud world there was no need for a real-time cost management system. The adjunct technology and systems surrounding the core application was mostly monitoring and alerting in nature. But in the cloud, systems automation is as critical as cost management and monitoring.

In order to take advantage of cloud on-demand scaling, automation needs to be responsible for provisioning new infrastructure and keeping track of what is running so you can efficiently turn it off later. Automation removes “human hands” and potential errors. And while cloud automation is required to start and stop infrastructure, a monitoring system is needed to keep vigilance on the running infrastructure and alert when a server or process fails. Cloud architectures tend to be dynamic, distributed, and highly complex, which means an effective monitoring system is a "must have" in order to know when a component is trending toward failure.

The current state of the art is to have a separate automation framework focused on scaling infrastructure up and down, while monitoring is its own set of technologies with the mandate to alert and report on the live environment.

Each of these areas needs near custom implementation to graft into your project. Cloud-based software stacks, by their very nature on how distributed software works in the cloud, need custom automation and monitoring. There are many great frameworks (Sonian’s open source Sensu project for example) which support extensive customization capabilities. Cloud ISV’s should use one of these as the foundation to create a monitoring and automation framework that meets your unique needs.

3. Game the Cloud with Elastic Applications
A simple fact proven through real-world experience: “Net new” software stacks, designed with cloud operating principles firmly rooted in the core design, are a “must have” to take advantage of cloud computing economics and reliability. In fact, the dual goals of economical operation and non-stop reliability can be achieved with the same architectural principles. Building for the cloud is all about “fluid designs” of discreet components that can be woven into cloud computing fabric. The days of rigid architectures are behind us.

The prevailing conventional enterprise software design and architecture patterns are not adequate for today’s new cloud infrastructures. Traditional architectures assume client/server, PC-based building blocks. The era of an assemblage of Windows or Linux compute nodes combined together in different component roles is falling behind us. And that is a good thing. When thinking about building software for the cloud, design patterns that were previously commonplace for mainframe thinking are more appropriate for the cloud, than trying to retrofit the old client/server model onto the cloud.

The cloud should really be thought of as an enormous mainframe computer.  Embracing this notion is a big leap in design architectures from where we are today, because our conventional modern thinking has been heavily influenced by the building blocks of yesteryear: Cheap hardware running Windows and Linux operating systems, custom software, client/server and Web 1.0 architectures, all wired together to form various clusters of functionality.

There are three primary reasons to use the cloud versus a traditional co-located data center:
  • Your use case has dynamic, shifting, variable work loads
  • Your use case has a very high up-time SLA requirement (at whatever cost)
  • Your building a prototype and scale and cost are not important… for the moment
Not every SaaS application has these requirements. It’s easy to get caught up in the “cloud hype” du jour and the desire to harness the cloud because it’s the “in thing” right now. But seriously look at your needs because maybe the cloud is not the best place to host your software. If you are running a web application that manages light-weight user generated content for a predictable number of simultaneous people, then the cloud probably offers no economic advantage. But if your app needs to be 99.9999% available, and your audience will pay for this up-time SLA, then the cloud is the right infrastructure. No other hosting platform has the ability to be economical, scalable and reliable all at the the same time. To be all three requires elastic applications.

In 2007 Amazon introduced their Elastic Compute Cloud (EC2) service, which complimented the very reliable Simple Storage Service (S3). EC2 is flexible compute on demand, with no upfront costs. EC2 allows you to build applications that can be both fault tolerant and economical at the same time. Without elastic software EC2 is a very expensive hosting platform for non-elastic software stacks.

The majority of enterprise software is not elastic. Traditional enterprise software created during the client/server of the last decade, or the more recent Web 1.0 era, didn’t know about cloud computing. You would need to look back in time toward mainframe design principles to get a feel for a software architecture with compute efficiencies baked in. But cloud, and the ability to take a step back and design differently, what was old (mainframe) seems fresh and new.

Take for example Microsoft Exchange Server. Used by hundreds of millions of people on a daily basis, it’s one of the most common examples of traditional non-cloud architectures. The Exchange Server software requires a dedicated amount of hardware, regardless of the “in the moment” usage patterns. Now envision a scenario where Exchange Server was re-built with elastic capabilities. Back-end and front-end Exchange components could dynamically scale up and down based on time of day or other usage patterns. Instead of dedicated compute resources for front-end services, data stores, gateways, etc. a re-architected Exchange Server could dynamically allocate compute to the component most under load. For evenings and weekends capacity could be pared back (saving money). During peak activity periods, such as at the start of the work day when everyone is checking their email, compute capacity could be added for a few hours, then scaled back and the same cycle could repeat after lunch. When the software has elasticity at the core it’s possible to operate at peak efficiency around the clock.

Elastic application characteristics include:
  • Enterprise service bus architecture patterns
  • Loosely coupled modules that scale independently of one another
  • A design that evokes “fluid” concepts as opposed to rigid constructs
  • Design to Survive an Earthquake and Save the Planet at the Same Time
The big data cloud needs applications that are flexible and pliable. Just as you would design a building to withstand an earthquake, structures need to sway and balance gracefully when the ground trembles. In a metaphorical sense the cloud has similarities to geographies with consistent seismic activity. Rigid structures won’t endure an earthquake’s potential to cause catastrophic destruction. The same can be said for software running in the cloud. A “rigid” application like Microsoft Exchange Server will suffer greater damage in the cloud compared to software written to be fluid and flexible.

In the cloud, like in an earthquake zone, the software gets no advance warning before calamity strikes. In the cloud, the software can’t “peer” beneath the virtualization layer to see pending doom. In the cloud you need to expect failure and be resilient. Being resilient doesn’t mean being rigid, it’s actually just the opposite. To enable software to endure an infrastructure failure in the cloud requires a fluid design that can adapt in real-time to a rapidly changing environment.

The dual goals of cost efficiency and resiliency are accomplished with the same design. To be resilient is to be a collection of loosely coupled services that are woven together and flex to accommodate a dynamic cloud environment. To be cost effective is to be a collection of loosely coupled services that can optimize every transaction and near 100% compute utilization. The closer to 100% utilization means less waste.

A Cloud Customer Success Story - Dollar Tree Stores
Sonian, operating in the Amazon Web Services cloud,  has been providing information archiving services for US-based Dollar Tree Stores (DTS) for over 10,000 employees and thousands of store locations. DTS has an IBM Domino/Notes environment and needed a new way to manage long-term archival, search, and e-discovery requirements. DTS also wanted to move their on-premises legacy archive data to the cloud in order to take advantage of better econmics and ROI for their IT budget.

Battle of the Archiving Models - Part 2: On-premise vs. cloud

  
  
  
  
  
  
  

Why cloud?

In today’s technology-consumed world, it seems as though more and more businesses are migrating their archived data to the cloud and abandoning their on-premise archive systems. But why?

Benefits of the cloud

Industry-leading analyst, Gartner, has estimated that Software as a Service (SaaS) revenue is to pass the $12 billion mark in 2011, which is a 20.7% increase over 2010’s $10 billion record. The main reasons for this marginal increase in SaaS and cloud services are due to cost, powerful search capabilities, accessibility, eDiscovery needs and security.

  • Cost-effective price
    • Deploying in the cloud requires no on-premise hardware or software.
    • For example, Sonian has a one-time low monthly per-user price. No fine print or extra/hidden fees.
    • On-premise solutions has costly IT and maintenance fees.
  • Powerful search
    • With data stored in the cloud, you can search through your archive in less than half the time it would take in an on-premise system.
    • For instance, Sonian has AES 256-bit encryption in transit and at rest, allowing users to access their data within seconds! Try out the Sonian sandbox! Demo our product and see how the cloud really works!
    • You could spend days, weeks, or even months searching for a specific piece of data in an on-premise archive.
  • Instant accessibility
    • Regardless of the time or day, you can access your archived information when stored in the cloud. With Sonian, once you’ve stored your information, it will be archived, accessible and searchable for forever.
    • Due to the one location of an on-premise system, time and date could hinder your accessibility to your archive.
  • Streamlined compliance and eDiscovery needs
    • There are rules and regulations (FINRA, HIPAA, SEC, etc.) most industries must abide by that require maintaining emails and other pertinent business information for “x” amount of years. Cloud services improve meeting these eDiscovery needs and compliance regulations through automated services.
    • Sonian’s cloud-powered service can define specific archive retention policies that align with their requirements.
    • Most of these regulations have strict deadlines, but on-premise systems often delay this process with the amount of time it takes to search the physical archive and the lack of eDiscovery and compliance services.
The cloud is a powerful, accessible and cost-efficient product. Compared to on-premise models, the cloud is infinitely more scalable, provides more elasticity and can meet eDiscovery needs effortlessly.

Try it out today by using Sonian’s Archive – get a free trial for 2 full weeks at no cost!

Battle of the Archiving Models - Part 1: On-premise vs. Hosted vs. Cloud Computing

  
  
  
  
  
  
  

This will be the first of a blog series, "Battle of Archiving Models." To start it off, we will cover the basics; on-premise vs. hosted vs. cloud computing.  Below is a general description of all three, and the benefits that each model offers.

It’s been said that we generate more data on a daily basis than all of the data combined from the beginning of time until 2003. Most of that is attributed to you and I and what we do every day. From our daily emails, Facebook posts, tweets on Twitter, to our connections on LinkedIn, the files we download and the pictures upload, this all amounts to an enormous data load.

We in IT have struggled for years to address the ever-growing impact of user-generated data.  Email is a great example; within the past decade, the average corporate mailbox allocates over 10 gigabytes of storage—and the trend shows no sign of relenting. Organizations now want to provide users with “unlimited” mailboxes that could contain hundreds of gigabytes of data.

IT is also tasked with managing new data types. Priorities related to improved collaboration, data loss prevention, and intellectual property protection are driving us to move files from client storage (laptop and hard drives) to IT-managed storage. So now, IT is being called upon to address the unmitigated rise of social media, blogging, and mobile communications as business tools. 

Regulations and Their Consequences

The explosion in IT-managed data has occurred during a period when regulators are placing new requirements on the handling, retention, and disposition of content.  For example:

  • The United States Federal Rules of Civil Procedure (FRCP) require that organizations of all sizes maintain data archives that are readily accessible in the event of litigation.

  • The Sarbanes-Oxley Act (SOX) requires that companies preserve a variety of correspondences (including email messages) for a period of seven years.

  • The Financial Regulatory Authority (FINRA) and Securities and Exchange Commission (SEC) place numerous restrictions on financial services firms related to the management and preservation of email, instant messaging, and social media data.

  • The Health Insurance Portability and Accountability Act (HIPAA) requires that companies operating in the healthcare industry retain certain communications and documentation (which can include email messages and attachments) for a minimum of six years.

In addition to the examples cited above, industries ranging from energy to state and federal government to education and the non-profit sector all fall under specific regulations that govern how user-generated content should be managed, and the penalties for non-compliance have never been higher:

  • In 2004, the SEC fined Bank of America $10 million for failing to retain and produce emails in accordance for SEC regulations.

  • In 2008, non-compliance with FRCP mandates compelled a judge in the United States to award $29 million to a plaintiff in a suit against UBS Warburg.

  • In 2010, FINRA fined Piper Jaffray and MetLife Securities a combined $2M+ for email-related failures.  While the amounts may be modest in the scheme of the financial services industry, we should consider the reputational impact to the companies; particularly after FINRA published press releases referring to “supervisory and reporting violations” as well as “investigations of broker misconduct.”

More Storage, More Problems

While the impact of fines and the publicity they generate should not be understated, they are dwarfed by the costs born by all customers as the storage of user-generated data places an ever-increasing drain on our budgets, focus, and productivity.

Organizations have tried to address the data explosion by buying more storage. But after years of this, it has been proven that on-premise storage systems are considerably more expensive than a line item on a budget would ever reflect.

Beyond the direct and indirect costs of storage systems, maintaining large data stores also impedes the performance of our IT infrastructure and applications.  Mail systems buckle under the weight of giant data stores. Network latency increases as backup windows span an ever-growing portion of the business day. Overall, businesses risk losing profits when critical applications are slow and unstable.

In order to address this “perfect storm” of unprecedented growth in unstructured data, most organizations have found that information archiving represents the only viable solution.

Archiving is the Answer

At a minimum, information archiving satisfies regulatory requirements and reduces the burden placed on IT applications, such as email.  In most cases, archiving provides significant storage and infrastructure cost savings, and in some cases, it enables IT to redirect focus and resources away from infrastructure and toward value-added activities.

With this solution, archiving is no longer a value-added service for IT; it is an essential component of the IT portfolio, and it is required to tame skyrocketing storage costs while maintaining compliance.  Now comes the next step, which archiving vendor do you choose; traditional on-premise archiving, hosted archiving, or cloud-powered archiving.

On-Premise Archiving

With the traditional on-premise model, archiving systems are completely located within a businesses’ data center, and the business maintains responsibility for the installation, configuration, and operation of the archiving system and underlying infrastructure. With on-premise systems, customers experience fairly rapid migration of legacy data—attributable in large part to the physical proximity of the archive system to the legacy data store. 

The on-premise archiving model was the most popular model for early adopters of archiving solutions (particularly large financial services customers in the early 2000s).  Due to the cost and complexity of the systems, which require investments in hardware, software, and storage as well as ongoing operations and support, adoption of this model has been waning as organizations are opting for a third-party archiving service.

Hosted Archiving

In the hosted model, archiving systems are housed within an archiving vendor’s data center.  Unlike the on-premise model, customers are not required to install, configure, or maintain the archiving system or its underlying infrastructure—the vendors manage these activities on behalf of the customer.

hosted archiving model resized 600

With this service, the customer only needs to be concerned with capacity management to the extent that it impacts pricing. Otherwise, hosted vendors shoulder the burden of capacity management. Customers can also focus on activities related to the archiving process and functionality, such as defining retention policies, searching for specific content, and exporting data for discovery. 

The benefits of this solution is that it reduces IT complexity and offers cost savings relative to on-premise systems.  It is also a fairly low-risk evolution of the legacy model in that (unlike cloud-powered archiving, discussed below), the archiving system leverages traditional infrastructure technologies. However, this solution comes with many of the same issues on-premise systems have; capacity management, service availability, and large capital expenses.

Cloud-Powered Archiving

Rather than operating their own infrastructure, cloud-powered archiving vendors build their applications to operate on top of cloud infrastructure from third parties, such as Amazon or Rackspace. In this model, neither the customer nor the archiving vendor operates physical infrastructure directly. The archiving vendor builds and maintains an archiving system (software layer) that is operated on top of cloud infrastructure. 

Of the three archiving models, the cloud-powered approach best capitalizes on the proposition values specializing in scale and elasticity. The infrastructure vendor, archiving service provider, and businesses are able to focus on core competencies, i.e. operating data centers, developing archiving software, and facilitating business processes, respectively. Likewise, the cloud vendor procures and operates infrastructure at tremendous scale, enabling them to offer the lowest prices in the market.  Finally, cloud-optimized technologies such as ElasticSearch and Chef enable archiving vendors to maximize availability performance based on their customers’ real-time processing, bandwidth, and storage requirements. 

Some Things are Certain in IT...

Moving forward, the volume of user-generated data will only continue to increase.  The number of restrictions placed on the management of that data will also go up, along with the number of  requests (and demands) for data to support litigation, compliance, and business intelligence. IT leaders need to be prepared for the convergence of these trends that, if left unaddressed, will drain the productivity of their teams, increase storage expenses, and put the reputations and financial viability of their organizations at risk.

For most organizations, the only way to effectively address the data explosion is with a robust and effective archiving system.  Fortunately, customers have their choice of numerous vendors and at least three archiving models in the market today that each offers unique benefits.  IT leaders should choose the archiving option that best suits their needs and budgets, but this should be done relatively soon—before an audit, discovery request, or regulatory inquiry arises that makes them wish they had.

How To Manage Your Money in the Cloud

  
  
  
  
  
  
  

describe the imageSome people say the cloud can save you money...but how? Take a look at Joe Kinsella, Sonian's VP of Engineering, his most recent blog post...

When thinking about cloud costs, I am often reminded of Susie Orman telling a shocked caller that the daily $4 Stackbuck’s latte he drinks is costing him over a thousand dollars a year. The cost of cloud infrastructure, like a daily latte, has a way of sneaking up on you. Unfortunately, the lack of knowledge around managing cloud costs has resulted in a new oddity in the cloudosphere: cloud drop outs that have forsaken the cloud for the financial predictability of physical infrastructure. To read more of Joe's blog click here...>

What Happened to the Cloud Computing Class of 2008

  
  
  
  
  
  
  

This particular blog is a little different than what we would normally post. You see, in 2008, Sonian was a top 7 finalist in Amazon’s AWS Start-Up Challenge. This competition was designed for start-ups to get themselves heard and noticed, and win up to $100,000 in cash and AWS credits.

As it has been a few years since, Sonian would like to see what each finalist has been up to. Where is the company now, what have they been doing, and what were some of the obstacles or triumphs they overcame?

We’ve contacted each finalist and will begin to post their responses as they filter in. To begin, we have provided you with the letter to which we distributed to each finalist. Also, Sonian has started the blog series by sharing what we’ve been doing for the past three years…

Dear <2008 AWS Start-Up Challenge Company>,

In 2008 your company and my company Sonian were in a distinguished group of 7 pioneering cloud-computing vendors selected from thousands of entrants to compete for the AWS Start-Up Challenge grand prize.

Sonian is now curious to learn how your business is doing 3 years later. We would like to spend 15 minutes on the phone with you to discuss your AWS experience, what you learned, what you would have done the same or differently, and what you’re doing now!

Here, to ease the pressure, we’ll go first.

Sonian - the name of the company is in homage to the nations preeminent archival repository: The Smithsonian Institution.

In 2007, Greg Arnette founded Sonian and began a “journey to the cloud.” The premise was simple: use cloud computing infrastructure as an enterprise-class electronic information archive repository. The technology challenge was immense but surmountable. Using “the cloud” for business-class needs would require new ways of thinking about software architectures, design patterns, and new systems administration processes would need to be conceived to operate a software-as-a-service system at cloud-scale.

Four years ago, businesses, government agencies, schools and hospitals were all suffering from overburdened storage systems and rising costs. If cloud computing could be harnessed effectively, a business could save 50 to 80 percent on their Tier 2 and Tier 3 storage costs.   

The reason the cloud was so appealing in 2007 (and continues to be appealing today) is two-fold: economics and reliability. Efficiently utilizing the cloud to solve enterprise IT pain points was a worthy mission to start a company. It took another two years before the venture community would see the cloud as “a positive,” and the financial collapse in September 2008 accelerated cloud adoption, because companies across the world implemented budget austerity programs.

Amazon Web Services was the first credible “cloud infrastructure” to attract wide adoption. Sun Microsystems and others had tried “cloud-like” systems before, but the economics didn’t work for a start-up. For example, Sun was charging one dollar per CPU hour, while Amazon was at 10 cents per hour. The same with storage, Sun was one dollar per gigabyte per month, while Amazon was at 15 cents.

In order to attract attention to their nascent cloud services, Amazon launched “Start-Up Challenges.” The winner would receive $50,000 cash prize as well as Amazon usage credits. Amazon was prescient to focus on start-up companies at first. Enterprises were not ready to make big cloud migrations, but start-ups had existing infrastructure investments, and we’re already in the mindset to take a risk. Sonian entered the Challenge in 2008, and was a Top 7 Finalist. Here is how Sonian has prospered since then.

Since 2008 Start-Up Challenge, Sonian has grown nicely and matured from an early-stage company to growth stage. In August 2009 Sonian raised seven million dollars in a Series “A” venture financing, and in December 2010 added an additional nine million to the coffers with a “B” round.    

Sonian is managing an impressive amount of data on the Amazon cloud. As of this writing (August 2011) Sonian is storing nearly 4 billion objects on S3 and EBS.  Sonian utilizes on average 500 compute nodes per month and is pioneering many “big data and search” technologies to store information at the lowest costs.

Sonian’s engineering group is a new kind of distributed model leveraging the best of both on-premises and remote personnel. The key to our success with this hybrid model is great team leadership, an agile development process, web-based tools and frictionless remote meeting and audio capabilities.

As of today, the company has 60 people, and is headquartered in the Boston area.

Alright, now it’s your turn. We promise not to take much of your time.  You can contact us at marketing@sonian.net or call 617-958-4000. All of those who oblige will have their results published in a follow-up blog. Hope you’re all doing well and we’re looking forward to hearing from you!

Your Fellow Classmate,

Sonian

describe the image

Natural Disasters - The Effect on Your Data

  
  
  
  
  
  
  

Did you feel that? Well your data sure did.describe the image

Yesterday, at about 3:15 PM, Massachusetts, along with some other east coast states, felt the aftershock of a 5.9 magnitude earthquake that hit Virginia. The Sonian office here in Newton even felt the floor and a few chairs and desks shake. Granted, some Sonian employees were a little worried, especially since earthquakes aren’t a frequent occurrence in New England. However, one thing we didn’t have to worry about was our data and our customers’ data.

A natural disaster is never planned and always unexpected.  At a moment’s notice your office, your airplane flight to a business meeting, or even your data, can be effected by a hurricane, a tornado, or as we saw (or should I say felt) yesterday, an earthquake. So how do you prepare for it? The answer is, the cloud.

Storing and archiving your data on-premise can be ideal for large corporations, ones that have the budget and staffing to manage the archive. Most enterprises have the resources to fund an on-premise archiving solution, as it requires significant upfront investment for estimated storage and a budget for an IT staff to manage the archive. However, what happens to that data if a hurricane storms through the town and damages your servers with terabytes of information stored on it? Or what if an earthquake rumbles through the building in which your on-premise archive is located?

This would give organizations reason to worry about data loss, but could be eliminated if they stored their data in the cloud. For example, in the Sonian Archive, your data is always available, searchable, and accessible, regardless of unexpected weather conditions.

Once you archive your data, Sonian uses GlobalRAID Infrastructure that immediately replicates your information to 8 data centers, geographically dispersed across the globe. With an on-premise solution, there is only one location of storage. Archiving data in the cloud eliminates the risk of losing your data from extreme weather conditions, and in Sonian’s case, if one data center was hit, there’s no need to worry because your company information is stored in seven other places.

By dispersing and duplicating data in geographically dispersed centers, it requires no cost or maintenance. Your data is also safe thanks to Sonian’s three-layer security system via industry standard encryption (Defense Department AES and SSL) certifying that encryption keys cannot be shared between customers.

With an on-premise solution, companies have to estimate the amount of servers and retention time needed for “x” amount of data.  A cloud-powered solution, like Sonian’s, provides 11 9’s data resiliency and unlimited storage space. Archive data in the cloud requires no maintenance, hardware or software, while this could be a full time job for an IT department relying on an on-premise situation.

Essentially, would you rather worry about an entire archive; years, possibly decades of corporate information lost within a matter of minutes, or would you rather know that it is secure and safe in the cloud?

Sonian Summer Codefest 2011: Abundant Innovation - Part 2

  
  
  
  
  
  
  

continued from the first blog installment of Codefest 2011

Team 7: Git ’r Done (Third Place Winner)

Windows node deployments were one of the last manual tasks. With automation, automatic scaling is now possible, as well as reducing the risk of human errors from manual tasks.

The benefit to Sonian is completing a goal for 100% automated cloud deployments.

Team 8: The Sharper Image (Second Place Winner)

This team was a combination of SAFE and website folks; Paul, Phil, Kevin, and Bira, showing us how to employ an image hashing algorithm to find images in the archive.

Searching the archive now is limited to text-based queries. But what if you want to find an image that has no text and all you have is a similar image to use as your reference?

The team started by using a technique called perceptual hashing to calculate a unique hash value for each image in the archive, and storing that hash value in the full-text index along side the standard index text for each object in the archive. Perceptual hashing is suited for images because it is impervious to image scaling, works with a variety of formats.

The theory is that with all image hash values stored in the index, a customer could use the search UI to find an image if they have a sample of what they are looking for. The best way to describe the image is to upload a sample of a similar image and ask the system to return results for all images in the archive that “look like this sample.” And that is what the team demonstrated. Paul initially indexed three emails with photo attachments of his children and one email with an image NOT his children. The team then demonstrated a search without image classification and the results returned all images, the three of his children and the one not of his children. The demonstration went on to show the same search, but this time by uploading a sample image of Paul’s child, the resulting search returned just the three results of the emails with a picture of the children.

This approach and framework for indexing non “text” components and then searching based on non-text samples can be extended to other data types beyond images. Over time the Sonian service will focus on not just returning a million hits fast, but returning that “one in a million” hit with speed and accuracy.  

Team 9: Beautification

Matt W., Sonian’s website team leader, along with Ryan, illuminated the audience with a demonstration of the Sonian Viewer using a new component ExtJS from Sencha. As background, the latest product release with the enhanced My Archive application employs a new visual toolkit called ExtJS. This is a set of tools to create “rich web applications” that perform and feel like desktop apps, but running in the browser.

Team Beautification’s goal was to show the Sonian Viewer with a more intuitive interface and improved user experience. The Sonian Viewer is the “Swiss army knife” for managing our cloud infrastructure, but as an in-house project the design was basic rails scaffolding.  Adding ExtJS functionality will load pages faster and make it easier for non-technical users to use the Sonian Viewer.

ExtJS will become a core Sonian user interface component. This toolkit also supports mobile device development and implements HTML 5 standards to ensure cross-browser functionality.

Team 10: Performance Art

Continuing along the themes of “efficiency, cost analysis, and visualizations,” Jim, Joe G. and Steve from the SAFE team demonstrated some data performance art to great applause. The famous quote from Lord Kelvin is how the team summarized their ideas: “When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge of it is of a meager and unsatisfactory kind...” The team’s goal was to instrument performance benchmarking in the SAFE server code and analyze the results.

Based on a preliminary review from one test run, the team was pretty confident they could improve SAFE server efficiency by 20% with a few configuration changes. With more data, more efficiency gains will be realized.

Sonian benefits whenever we can find performance optimizations in our own code. In this example we will be able to process more data for less money without sacrificing customer experience. A triple win scenario.

Team 11: Team Speed

Pete started his presentation by reminding the audience of a couple core tenants on how customers use Sonian Archive. Tenant one is that with the archiving use case, the data does not change. An item the user “views” in the current session will be the same in future sessions. The second tenant is that we can predict customer navigation patterns, and with that anticipated UI action, make the page loading process faster. The net benefit for customers is an overall faster experience in the archive system, even as the volumes of data increase.

Pete used a browser plug-in to show that pages load substantially faster by using caching techniques and pre-fetching data in the background. In the first example, Pete loaded a page without caching/pre-fetch and it took 10 seconds. With caching enabled, the page loaded 1,000 percent faster. In the second example, Pete showed a user navigating the results page. After the first page rendered, the next page was fetched in the background so that when the user clicked “next” the page would display instantly. Pre-fetching data in the background works best when we can anticipate the user’s next action. In the case of search results, there is a high-degree of certainty the user will navigate the first few results pages.

Sonian customers benefit from a more pleasing user experience. Pages load faster, and the application will feel more responsive.

Team 12: Commit Hooked

Decklin, a DevOps engineer, demonstrated tight integration between Chef and Git. Sonian uses Chef to manage cloud infrastructure deployments. Git (and Github.com) is where Sonian manages source code. Prior to Decklin’s work, there was loose integration between the deployment system (Chef) and the repository where source code is maintained (Github.) This has been a source of frustration and Decklin tackled the problem.

The DevOps group is working continuously to remove friction from automated deployments. Decklin’s Codefest solution helps this effort by centralizing the source for software components, and makes Github the single authority for code installed on every new node.

Team 13: Diylizo

Lee, a SAFE team member, used his prior experience and personal interest in Natural Language Processing (NLP) algorithms to categorize and aggregate SAFE server log file data.

*Background info - In March 2010, Lee released his Clojure-opennlp project to interface Clojure with the OpenNLP library functions. OpenNLP is a set of linguistic tools that allow a computer to “understand” chunks of text.

SAFE server logs contain valuable information for debugging and gathering other useful data for analysis. These logs also contain Java Virtual Machine stack traces.  In a cloud computing environment, SAFE error statements, as well as JVM stack traces, are spread across many virtual machines. Lee’s solution is to aggregating and categorizing log files with NLP allowing a whole new level of understanding to occur. In this demonstration, the NLP algorithms were trained to identify error codes by looking for text patterns.

The breakthrough here is that the NLP library was agnostic to the meaning and language (English or French or Russian, etc.) of the patterns, only that it knew how to find them. Each error code and stack trace has a unique “signature” for identification, and diagnostic data could be extracted from the error statements and correlated with other system information. Correlation along a consistent time series is a “must have” to identify problem patterns across a distributed database.

In the future correlating log statements with customer actions will help trace errors from user action to back-end function.

 

Congratulations to all the teams who competed! The next Codefest is sure to be another interesting event.


Sonian Summer Codefest 2011: Abundant Innovation - Part 1

  
  
  
  
  
  
  

The first quarterly all-engineering Codefest completed Tuesday evening (Aug. 16th) with three winning teams, one dramatic performance, and many laughs.

codefest 2 resized 600

The entire company was invited to view the presentations and vote for their favorites. The only voting rule was you couldn’t vote for your own team. The judging was based on three criteria:

1. Impact on solving a Sonian or customer pain point (50%)

2. “Cool-ness” factor (25%)

3. Presentation style and effectiveness to convey the idea (25%).

Thirteen teams competed, representing the four functional units in the Sonian Engineering organization; SAFE (back-end), Website (front-end), DevOps (systems management) and QA. There were several teams from each group.  The themes ranged from automation, performance measurement, to UI beautification and speed. Each team gravitated toward their “natural” inclinations.

The DevOps teams focused on automating manual tasks and removing friction from deployments. The SAFE team (back-end) showcased applying “math” to measuring performance and data classification. The website team looked at speed and a better user experience, and the QA team showed us new ways to think about cost-testing alongside bug testing.

Six teams had a metrics or analytics theme. Two teams focused on user interface improvements, and four teams came up with solutions for automation and deployment problems.

Instead of Ernst and Young tallying the votes, our Harvard MBA trained ROI analyst, Chris H., stepped in to ensure a fair and accurate counting.

And thanks to all the non-technical folks who sat patiently through presentations where terms like “latency,” “lazy loading,” “grepping logs” and “foreground queues” were discussed.

Teams chose their presentation order with the QA team volunteering first. Below is a summary of the first six presentations with some context on how the idea fits into Sonian's needs and long-term vision. There will be another blog soon to follow with the remaining seven presentations.

Team 1: “You paid what for that…Export job, Object list request, or ES cluster?”

Andrea, Gopal, Bryan and Jitesh from the quality assurance team got together around an idea to extend testing methodologies into infrastructure cost analysis. In order to maximize the cloud’s economic advantage, the engineering team is always thinking about the cost of software operating on “big data scale” levels of activity. From architecture to implementation, the goal is to infuse “cost conscious” at every level. The QA team came up with a novel idea on this theme. 

The proposed idea is to extend the testing framework to set a baseline of feature infrastructure costs, and then measure successive releases against the baseline. A significant cost deviation from the baseline could be considered a design flaw, implementation error or a SEV1 bug. Some sample features with measurable costs would be an import job, export request, or a re-index. Over time the entire app suite could have an expense profile established.

Having QA be an additional “cost analysis layer” in the full development cycle will only help make the Sonian software as efficient as possible.

**Bonus points to this team for the most elaborate props and “dramatic performance.”

Team 2: Visualizing Beautiful Insights with Flowing Data

David, Drew, Greg and data analytic consultants, Luke and Sean, demonstrated several prototypes on how to visualize unstructured data. The team shared a common goal around the idea to show customers what’s possible by looking at their data from a visual analytics perspective (i.e., pie charts, heat maps, tag clouds, social graphs, etc.)

David and Drew exported email header information (including attachments) from the index and formatted the output (JSON to CSV) so that Luke and Sean could import a CSV file into their visual framework. Greg created a simulated corporate directory and grouped the email addresses.

The resulting demo used real-world data to showcase communication patterns, frequent “sender/recipient” pairings, a social graph based on who talks to who and attachment file type relative to company organizational unit affiliation.

Visualizing data is relevant to Sonian as we start planning more analytics on the information in the archive and the desire to show customers “actionable intelligence” on their dark data.

Team 3: Team Hobo

Joe W. and Efren D. from the website team demonstrated the concept of using Vagrant (hence the team name?) on laptops to support running most of the application locally. In cloud development, one of the holy grails of developer efficiency is to be able to develop the website application either locally or on a cloud node. Vagrant allows individuals and teams to configure their developer environment more efficiently, whether running on EC2, Rackspace or local. Getting Vagrant to work correctly, consistently and bulletproof across all platforms is challenging.

This team demonstrated Vagrant working on Macbooks and the website and dev tools running completely from the virtual environment, starting from “bare metal.”

Team 4: Awesome / Doppler

Bill, Thomas and Dan S. from Tier 2 Support showed us how they are innovating their tools to help make case resolution faster.

Tier 2 support steps in to solve the challenging support requests that Tier 1 support folks hand-off. Tier 2 is the “glue” between engineering and the help desk, triaging new cases and researching vexing problems. Many Tier 2 requests involve locating the status of individual items in the archive, and this effort can be time consuming given the distributed nature of cloud computing and the pipeline process that manages customer data.

Team Awesome / Doppler created a framework to search across the many log files spread across the compute infrastructure that power the archive system.

Using a combination of bash scripts, Chef commands and their knowledge of the archive file system, the team created a menu-driven search capability that could zoom into a specific part of the system and search for log file statements based on customer ID, distribution partner, or other artifact.  

Team 5: Pointy Haired Bosses (First Place Winner)

The results were quite impressive, and the team demonstrated the Viewer with a new data import capability and a new report showing a few sample accounts where the reported mailbox counts were lower than the calculated counts. In typical Codefest “right to the finish line,” the data was hot off the press.

Historically, keeping customers on an honest self-reported mailbox count was handled in the theme “trust but verify,” with the emphasis on trust. Now with this new feature the verification part can occur. 

The benefits to Sonian are numerous. Accurate billing data tightens the revenue story, as well as giving us a benchmark to calculate true ratios for infrastructure expense to billable subscriber seats. What’s next: Fine-tune the mailbox count algorithm and then run the initial data gathering task across all index clusters to populate the database. Igor will propose a plan, and he and Joe K. will monitor the process.

 

Team 6: Next Gen Metrics

Two DevOps engineers, Sean P. and Justin K., solved a challenging systems administration problem.

Recording and analyzing system events and performance metrics is a “must-have” for distributed systems. The dynamic nature of cloud computing magnifies this problem since good metrics are needed for optimization, and receiving good metrics in a cloud environment requires better tools than the current state of the art. 

Sean and Justin proved elastic search could be a very useful storage repository for metric data gathered by the new cloud-scale monitoring project. The monitoring system collects detailed metric data from Sonian server processes (SAFE, Elastic Search, Website) and stores the data as JSON documents in an Elastic Search cluster. JSON is a lightweight data format and Elastic Search has native JSON capabilities.  Elastic Search queries support data ranges and identifying facets in the data stream.


All Posts