architecture

My first workshop is Scalable Internet Architectures by Theo Schlossnagle, CEO of OmniTI. He gave a nearly identical talk last year but I missed some of it, and it was really good, so I went! (Robert from our Web Admin team attended as well.)

There aren’t many good books on scalability. Mainly there are three – Art of Scalability, Cal Henderson’s Building Scalable Web Sites, and his, Scalable Internet Architectures. So any tips you can get a hold of are welcome.

Following are my notes from the talk; my own thoughts are in italics.

What is architecture? It encompasses everything form power up to the client touchpoint and everything in between.

Of necessity, people are specialized into specific disciplines but you have to overcome that to make a whole system make sense.

The new push towards devops (development/operations collaboration) tries to address this kind of problem.

Operations

Operations is a serious part of this, and it takes knowledge, tools, experience, and discipline.

Knowledge – Is easy to get; Internet, conferences (Velocity, Structure, Surge), user groups

Tools – All tools are good; understand the tools you have. Some of operations encourages hackiness because when there is a disruption, the goal is “make it stop as fast as possible.”

You have to know how to use tools like truss, strace, dtrace through previous practice before the outage comes. Tools (and automation) can help you maintain discipline.

Experience comes from messing up and owning up.

Discipline is hardest. It’s the single most lacking thing in our field. You have to become a craftsman. To learn discipline through experience, and through practice achieve excellence. You can’t be too timid and not take risks, or take risks you don’t understand.

It’s like my old “Web Admin Standing Orders” that tried to delineate this approach for my ops guys – “1. Make it happen. 2. Don’t f*ck it up. 3. There’s the right way, the wrong way, and the standard way.” Take risks, but not dumb risks, and have discipline and tools.

He recommends the classic Zen and the Art of Motorcycle Maintenance for operations folks. Cowboys and heroes burn out. Embrace a Zen attitude.

Best Practices

Version Control everything. All tools are fine, but mainly it’s about knowing how to use it and using it correctly, whether it’s CVS or Subversion or git.
Know Your Systems – Know what things look like normally so you have a point of comparison. “Hey, there’s 100 database connections open! That must be the problem!” Maybe that’s normal. Have a baseline (also helps you practice using the tools). Your brain is the best pattern matcher.
Don’t say “I don’t know” twice. They wrote an open source tool called Reconnoiter that looks at data and graphs regressions and alerts on it (instead of cacti, nagions, and other time consuming stuff). Now available as SaaS!
Management – Package rollout, machine management, provisioning. “You should use puppet or chef! Get with the times and use declarative definition!” Use the tools you like. He uses kickstart and cfengine and he likes it just fine.

Dynamic Content

Our job is all about the dynamic content. Static content – Bah, use akamai or cachefly or panther or whatever. it’s a solved problem.

Premature optimization is the root of all evil – well, 97% of it. It’s the other 3% that’s a bitch. And you’re not smart enough to know where that 3% is.

Optimization means “don’t do work you don’t have to.” Computational reuse and caching, but don’t do it in the first place when possible.
He puts comments for things he decides not to optimize explaining the assumptions and why not.

Sometimes naive business decisions force insane implementations down the line; you need to re-check them.

Your content is not as dynamic as you think it is. Use caching.

Technique – Static Element Caching

Applied YSlow optimizations – it’s all about the JavaScript, CSS, images. Consolidate and optimize. Make it all publicly cacheable with 10 year expiry.

RewriteRule (.*)\.([0-9]+)\.css $1.css makes /s/app.23412 to /s/app.css – you get unique names but with new cached copy. Bump up the number in the template. Use “cat” to consolidate files, freaks!

Images, put a new one at a new URI. Can’t trust caches to really refresh.

Technique – Cookie Caching

Announcing a distributed database cache that always is near the user and is totally resilient! It’s called cookies. Sign it if you don’t want tampering. Encrypt if you don’t want them to see its contents. Done. Put user preferences there and quit with the database lookups.

Technique – Data Caching

Data caching. Caching happens at a lot of layers. Cache if you don’t have to be accurate, use a materialized view if you do. Figuring out the state breakdown of your users? Put it in a separate table at signup or state change time, don’t query all the time. Do it from the app layer if you have to.

Technique – Choosing Technologies

Understand how you’ll be writing and retrieving data – and how everyone else in the business will be too! (Reports, BI, etc.) You have to be technology agnostic and find the best fit for all the needs – business requirements as well as consistency, availability, recoverability, performance, stability. That’s a place where NoSQL falls down.

Technique – Database

Shard your database then shoot yourself. Horizontal scaling isn’t always better. It will make your life hell, so scale vertically first. If you have to, do it, and try not to have regrets.

Do try “files,” NoSQL, cookies, and other non-ACID alternatives because they scale more easily. Keep stuff out of the DB where you can.

When you do shard, partition to where you don’t need more than one shard per OLTP question. Example – private messaging system. You can partition by recipient and then you can see your messages easily. But once someone looks for messages they sent, you’re borked. But you can just keep two copies! Twice the storage but problem solved. Searching cross-user messages, however, borks you.

Don’t use multimaster replication. It sucks – it’s not ready for prime time. Outside ACID there are key-value stores, document databases, etc. Eventual consistency helps. MongoDB, Cassandra, Voldemort, Redis, CouchDB – you will have some data loss with all of them.

NoSQL isn’t a cure-all; they’re not PCI compliant for example. Shiny is not necessarily good. Break up the problem and implement the KISS principle. Of course you can’t get to the finish line with pure relational for large problems either – you have to use a mix; there is NO one size fits all for data management.

Keep in mind your restore-time and restore-point needs as well as ACID requirements of your data set.

Technique – Service Decoupling

One of the most fundamental techniques to scaling. The theory is, do it asynchronous. Why do it now if you can postpone it? Break down the user transaction and determine what parts can be asynchronous. Queue the info required to complete the task and process it behind the scenes.

It is hard, though, and is more about service isolation than postponing work. The more you break down the problem into small parts, the more you have in terms of problem simplification, fault isolation, simplified design, decoupling approach, strategy, and tactics, simpler capacity planning, and more accurate performance modeling. (Like SOA, but you know, that really works.)

One of my new mantras while building our cloud systems is “Sharing is the devil,” which is another way of stating “decouple heavily.”

Message queueing is an important part of this – you can use ActiveMQ, OpenAMQ, RabbitMQ (winner!). STOMP sucks but is a universal protocol most everyone uses to talk to message queues.

Don’t decouple something small and simple, though.

Design & Implementation Techniques

Architecture and implementation are intrinsically tied, you can’t wholly separate them. You can’t label a box “Database” and then just choose Voldemort or something.

Value accuracy over precision.

Make sure the “gods aren’t angry.” The dtrace guy was running mpstat one day, and the columns didn’t line up. The gods intended them to, so that’s your new problem instead of the original one! OK, that’s a confusing anecdote. A better one is “your Web servers are only handling 25 requests per second.” It should be obvious the gods are angry. There has to be something fundamentally wrong with the universe to make that true. That’s not a provisioning problem, that’s an engineering problem.

Develop a model. A complete model is nearly impossible, but a good queue theory model is easy to understand and provides good insight on dependencies.

Draw it out, rationalize it. When a user comes in to the site all what happens? You end up doing a lot of I/O ops. Given traffic you should then know about what each tier will bear.

Complexity is a problem – decoupling helps with it.

In the end…

Don’t be an idiot. A lot of scalability problems are from being stupid somewhere. High performance systems don’t have to scale as much. Here’s one example of idiocy in three acts.

Act 1 – Amusing Error By Marketing Boneheads – sending a huge mailing with an URL that redirects. You just doubled your load, good deal.

Act 2 – Faulty Capacity Planning – you have 100k users now. You try to plan for 10 million. Don’t bother, plan only to 10x up, because you just don’t understand the problems you’ll have at that scale – a small margin of error will get multiplied.

Someone into agile operations might point out here that this is a way of stating the agile principle of “iterative development.”

Act 3 – The Traffic Spike – I plan on having a spike that gives me 3000 more visitors/second to a page with various CSS/JS/images. I do loads of math and think that’s 5 machines worth. Oh whoops I forgot to do every part of the math – the redirect issue from the amusing error above! Suddenly there’s a huge amount more traffic and my pipe is saturated (Remember the Internet really works on packets and not bytes…) .

This shows a lot of trust in engineering math… But isn’t this why testing was invented? Whenever anyone shows me math and hasn’t tested it I tend to assume they’re full of it.

Come see him at Surge 2010! It’s a new scalability and performance conference in Baltimore in late Sep/early Oct.

A new conference, interesting! Is that code for “server side performance, ” where Velocity kinda focuses on client side/front end a lot?

I was recently reading a good Cameron Purdy post where he talks about his eight theses regarding why startups or students can pull stuff off that large enterprise IT shops can’t.

My summary/trenchant restatement of his points:

Changing existing systems is harder than making a custom-built new one (version 2 is harder)
IT veterans overcomplicate new systems
The complexity of a system increases exponentially the work needed to change it (versions 3 and 4 are way way harder)
Students/startups do fail a lot, you just don’t see those
Risk management steps add friction
Organizational overhead (paperwork/meetings) adds friction
Only overconservative goons work in enterprise IT anyway
The larger the org, the more conflict

Though I suspect #1 and #3 are the same, #2 and #5 are the same, and #6 and #8 are the same, really.

I’ve been thinking about this lately with my change from our enterprise IT Web site to a new greenfield cloud-hosted SaaS product in our R&D organization. It’s definitely a huge breath of fresh air to be able to move fast. My observations:

Complexity

The problem of systems complexity (theses #1 and #3) is a very real one. I used to describe our Web site as having reached “system gridlock.” There were hundreds of apps running dozens to a server with poorly documented dependencies on all kinds of stuff. You would go in and find something that looked “wrong” – an Apache config, script, load balancer rule, whatever – but if you touched it some house of cards somewhere would come tumbling down. Since every app developer was allowed to design their own app in its own tightly coupled way, we had to implement draconian change control and release processes in an attempt to stem the tide of people lining up to crash the Web site.

We have a new system design philosophy for our new gig which I refer to as “sharing is the devil.” All components are separated and loosely coupled. Using cloud computing for hardware and open source for software makes it easy and affordable to have a box that does “only one thing.” In traditional compute environments there’s pressure to “use up all that CPU before you add more”, which results in a penny wise, pound foolish strategy of consolidation. More and more apps and functions get crunched closer together and when you go back to pull them out you discover that all kinds of new connections and dependencies have formed unbidden.

Complication

Overcomplicating systems (#2 and #5) can be somewhat overcome by using agile principles. We’ve been delving heavily into doing not just our apps but also our infrastructure according to an agile methodology. It surfaces your requirements – frankly, systems people often get away with implementing whatever they want, without having a spec let alone one open to review. Also, it makes you prioritize. “Whatever you can get done in this two week iteration, that’s what you’ll have done, and it should be working.” It forces focus on what is required to get things to work and delays more complex niceties till later as there’s time.

Conservatism

Both small and large organizations can suffer from #6 and #8. That’s mostly a mindset issue. I like to tell the story about how we were working on a high level joint IT/business vision for our Web site. We identified a number of “pillars” of the strategy we were developing – performance, availability, TCO, etc. I had identified agility as one, but one of the application directors just wasn’t buying into it. “Agility, that’s weird, how do we measure that, we should just forget about it.” I finally had to take all the things we had to the business head of the Web and say “of these, which would you say is the single most important one?” “Agility, of course,” he said, as I knew he would. I made it a point to train my staff that “getting it done” was the most important thing, more important than risk mitigation or crossing all the t’s and dotting all the i’s. That can be difficult if the larger organization doesn’t reward risk and achievement over conservatism, but you can work on it.

Tag Archives: architecture

Velocity 2010: Scalable Internet Architectures

Architecture

Operations

Best Practices

Dynamic Content

Technique – Static Element Caching

Technique – Cookie Caching

Technique – Data Caching

Technique – Choosing Technologies

Technique – Database

Technique – Service Decoupling

Design & Implementation Techniques

In the end…

Enterprise Systems vs. Agility

Complexity

Complication

Conservatism

Subscribe

Recent Comments

Recent Posts

Austinites

Cloud

DevOps

Archives