Tag Archives: web

Why SSL Sucks

How do you make your Web site secure?  “SSL!” is the immediate response.  It’s a shame that there are so many pain points surrounding using SSL.

  • It slows things down.  On ni.com we bought a hardware SSL accelerator, but in the cloud we can’t do that.
  • Cert management.  Everyone uses loads of VIPs, so you end up needing to get wildcard certs.  But then you have to share them – we’re a big company, and don’t want to give the main wildcard cert out to multiple teams.  And you can’t get extended validation (EV) on wildcard certs.
  • UI issues.  We just tried turning SSL on for our internal wiki.  But anytime there’s any non-https included element – warnings pop up.  If there’s a form field to go to search, which isn’t https, warnings pop up.  Hell, sometimes you follow a link to a non-https page, and warnings pop up.
  • CAs.  We have an internal NI CA and try to have some NI-signed certs, but of course you have to figure out how to get them into every browser in your enterprise.
  • It’s just retardedly complicated.  For putting a cert on Apache, it’s pretty well-worn, but recently I was trying to set up encrypted replication for mySQL and OpenDS and Jesus, doing anything other than the default self signed cert is hell on earth.  “Oh is that in the right format wallet?”

The result is that SSL’s suckiness ends up driving behavior that degrades security.  People know to just “accept the exception” any time they hit a site that complains about an invalid cert.  We have decided to remove SSL from most of our internal wiki and just leave it on the login page to avoid all the UI issues.  We couldn’t secure our replication from a combination of bugs (OpenDS secure replication works, until you restart any server that is – then it’s broken permanently) and the hassle.

In general, there has been little to no usability work put into the cert/encryption area, and that is why still so few people use it.  PGPing your email is only for gearheads. Hell, you have to transform key formats to use PuTTY to log into Amazon servers using SSL.  Stop the madness.

If the world really gave a crap about encryption, then e.g. your public key could be attached to your Facebook profile and people’s mail readers could pull that in automatically to validate signatures, for instance. “Key exchange” isn’t harder than the other kinds of more-convenient information exchange that happen all the time on the Net.   And you could take a standard cert in whatever format your CA gives it to you and feed it into any software in an easy and standard way and have it start encrypting its communication.

Me to world – make it happen!

7 Comments

Filed under Security

Upcoming Free Velocity WebOps Web Conference

O’Reilly’s Velocity conference is the only generalized Web ops and performance conference out there.  We really like it; you can go to various other conferences and have 10-20% of the content useful to you as a Web Admin, or you can go here and have most of it be relevant!

They’ve been doing some interim freebie Web conferences and there’s one coming up.  Check it out.  They’ll be talking about performance functionality in Google Webmaster Tools, mySQL, Show Slow, provisioning tools, and dynaTrace’s new AJAX performance analysis tool.

O’Reilly Velocity Online Conference: “Speed and Stability”
Thursday, March 17; 9:00am PST
Cost: Free

Leave a comment

Filed under Conferences, DevOps

Come To OpsCamp!

Next weekend, Jan 30 2009, there’s a Web Ops get-together here in Austin called OpsCamp!  It’ll be a Web ops “unconference” with a cloud focus.  Right up our alley!  We hope to see you there.

Leave a comment

Filed under DevOps, Uncategorized

Velocity 2009 – Best Tidbits

Besides all the sessions, which were pretty good, a lot of the good info you get from conferences is by networking with other folks there and talking to vendors.  Here are some of my top-value takeaways.

Aptimize is a New Zealand-based company that has developed software to automatically do the most high value front end optimizations (image spriting, CSS/JS combination and minification, etc.).  We predict it’ll be big.  On a site like ours, going back and doing all this across hundreds of apps will never happen – we can engineer new ones and important ones better, but something like this which can benefit apps by the handful is great.

I got some good info from the MySpace people.  We’ve been talking about whether to run our back end as Linux/Apache/Java or Windows/IIS/.NET for some of our newer stuff.  In the first workshop, I was impressed when the guy asked who all runs .NET and only one guy raised his hand.   MySpace is one of the big .NET sites, but when I talked with them about what they felt the advantage was, they looked at each other and said “Well…  It was the most expeditious choice at the time…”  That’s damning with faint praise, so I asked about what they saw the main disadvantage being, and they cited remote administration – even with the new PowerShell stuff it’s just still not as easy as remote admin/CM of Linux.  That’s top of my list too, but often Microsoft apologists will say “You just don’t understand because you don’t run it…”  But apparently running it doesn’t necessarily sell you either.

Our friends from Opnet were there.  It was probably a tough show for them, as many of these shops are of the “I never pay for software” camp.  However, you end up wasting far more in skilled personnel time if you don’t have the right tools for the job.  We use the heck out of their Panorama tool – it pulls metrics from all tiers of your system, including deep in the JVM, and does dynamic baselining, correlation and deviation.  If all your programmers are 3l33t maybe you don’t need it, but if you’re unsurprised when one of them says “Uhhh… What’s a thread leak?” then it’s money.

ControlTier is nice, they’re a commercial open source CM tool for app deploys – it works at a higher level than chef/puppet, more like capistrano.

EngineYard was a really nice cloud provisioning solution (sits on top of Amazon or whatever).  The reality of cloud computing as provided by the base IaaS vendors isn’t really the “machines dynamically spinning up and down and automatically scaling your app” they say it is without something like this (or lots of custom work).  Their solution is, sadly, Rails only right now.  But it is slick, very close to the blue-sky vision of what cloud computing can enable.

And also, I joined the EFF!  Cyber rights now!

You can see most of the official proceedings from the conference (for free!):

1 Comment

Filed under Conferences, DevOps

Velocity 2009 – Monday Night

After a hearty trip to Gordon Biersch, Peco went to the Ignite battery of five minute presentations, which he said was very good.  I went to two Birds of a Feather sessions, which were not.  The first was a general cloud computing discussion which covered well-trod ground.  The second was by a hapless Sun guy on Olio and Fabian.  No, you don’t need to know about them.  It was kinda painful, but I want to commend that Asian guy from Google for diplomatically continuing to try to guide the discussion into something coherent without just rolling over the Sun guy.  Props!

And then – we were lame and just turned in.  I’m getting old, can’t party every night like I used to.  (I don’t know what Peco’s excuse is!)

Leave a comment

Filed under Conferences, DevOps

Velocity 2009 – Scalable Internet Architectures

OK, I’ll be honest.  I started out attending “Metrics that Matter – Approaches to Managing High Performance Web Sites” (presentation available!) by Ben Rushlo, Keynote proserv.  I bailed after a half hour to the other one, not because the info in that one was bad but because I knew what he was covering and wanted to get the less familiar information from the other workshop.  Here’s my brief notes from his session:

  • Online apps are complex systems
  • A siloed approach of deciding to improve midtier vs CDN vs front end engineering results in suboptimal experience to the end user – have to take holistic view.  I totally agree with this, in our own caching project we took special care to do an analysis project first where we evaluated impact and benefit of each of these items not only in isolation but together so we’d know where we should expend effort.
  • Use top level/end user metrics, not system metrics, to measure performance.
  • There are other metrics that correlate to your performance – “key indicators.”
  • It’s hard to take low level metrics and take them “up” into a meaningful picture of user experience.

He’s covering good stuff but it’s nothing I don’t know.  We see the differences and benefits in point in time tools, Passive RUM, tagging RUM, synthetic monitoring, end user/last mile synthetic monitoring…  If you don’t, read the presentation, it’s good.  As for me, it’s off to the scaling session.

I hopped into this session a half hour late.  It’s Scalable Internet Architectures (again, go get the presentation) by Theo Schlossnagle, CEO of OmniTI and author of the similarly named book.

I like his talk, it starts by getting to the heart of what Web Operations – what we call “Web Admin” hereabouts – is.  It kinda confuses architecture and operations initially but maybe that’s because I came in late.

He talks about knowledge, tools, experience, and discipline, and mentions that discipline is the most lacking element in the field. Like him, I’m a “real engineer” who went into IT so I agree vigorously.

What specifically should you do?

  • Use version control
  • Monitor
  • Serve static content using a CDN, and behind that a reverse proxy and behind that peer based HA.  Distribute DNS for global distribution.
  • Dynamic content – now it’s time for optimization.

Optimizing Dynamic Content

Don’t pay to generate the same content twice – use caching.  Generate content only when things change and break the system into components so you can cache appropriately.

example: a php news site – articles are in oracle, personalization on each page, top new forum posts in a sidebar.

Why abuse oracle by hitting it every page view?  updates are controlled.  The page should pull user prefs from a cookie.  (p.s. rewrite your query strings)
But it’s still slow to pull from the db vs hardcoding it.
All blog sw does this, for example
Check for a hardcoded php page – if it’s not there, run something that puts it there.  Still dynamically puts in user personalization from the cookie.  In the preso he provides details on how to do this.
Do cache invalidation on content change, use a message queuing system like openAMQ for async writes.
Apache is now the bottleneck – use APC (alternative php cache)
or use memcached – he says no timeouts!  Or… be careful about them!  Or something.

Scaling Databases

1. shard them
2. shoot yourself

Sharding, or breaking your data up by range across many databases, means you throw away relational constraints and that’s sad.  Get over it.

You may not need relations – use files fool!  Or other options like couchdb, etc.  Or hadoop, from the previous workshop!

Vertically scale first by:

  • not hitting the damn db!
  • run a good db.  postgres!  not mySQL boo-yah!

When you have to go horizontal, partition right – more than one shard shouldn’t answer an oltp question.   If that’s not possible, consider duplication.

IM example.  Store messages sharded by recipient.  But then the sender wants to see them too and that’s an expensive operation – so just store them twice!!!

But if it’s not that simple, partitioning can hose you.

Do math and simulate it before you do it fool!   Be an engineer!

Multi-master replication doesn’t work right.  But it’s getting closer.

Networking

The network’s part of it, can’t forget it.

Of course if you’re using Ruby on Rails the network will never make your app suck more.  Heh, the random drive-by disses rile the crowd up.

A single machine can push a gig.  More isn’t hard with aggregated ports.  Apache too, serving static files.  Load balancers too.  How to get to 10 or 20 Gbps though?  All the drivers and firmware suck.  Buy an expensive LB?

Use routing.  It supports naive LB’ing.  Or routing protocol on front end cache/LBs talking to your edge router.  Use hashed routes upstream.  User caches use same IP.  Fault tolerant, distributed load, free.

Use isolation for floods.  Set up a surge net.  Route out based on MAC.  Used vs DDoSes.

Service Decoupling

One of the most overlooked techniques for scalable systems.  Why do now what you can postpone till later?

Break transaction into parts.  Queue info.  Process queues behind the scenes.  Messaging!  There’s different options – AMQP, Spread, JMS.  Specifically good message queuing options are:

Most common – STOMP, sucks but universal.

Combine a queue and a job dispatcher to make this happen.  Side note – Gearman, while cool, doesn’t do this – it dispatches work but it doesn’t decouple action from outcome – should be used to scale work that can’t be decoupled.  (Yes it does, says dude in crowd.)

Scalability Problems

It often boils down to “don’t be an idiot.”  His words not mine.  I like this guy. Performance is easier than scaling.  Extremely high perf systems tend to be easier to scale because they don’t have to scale as much.

e.g. An email marketing campaign with an URL not ending in a trailing slash.  Guess what, you just doubled your hits.  Use the damn trailing slash to avoid 302s.

How do you stop everyone from being an idiot though?  Every person who sends a mass email from your company?  That’s our problem  – with more than fifty programmers and business people generating apps and content for our Web site, there is always a weakest link.

Caching should be controlled not prevented in nearly any circumstance.

Understand the problem.  going from 100k to 10MM users – don’t just bucketize in small chunks and assume it will scale.  Allow for margin for error.  Designing for 100x or 1000x requires a profound understanding of the problem.

Example – I plan for a traffic spike of 3000 new visitors/sec.  My page is about 300k.  CPU bound.  8ms service time.  Calculate servers needed.  If I varnish the static assets, the calculation says I need 3-4 machines.  But do the math and it’s 8 GB/sec of throughput.  No way.  At 1.5MM packets/sec – the firewall dies.  You have to keep the whole system in mind.

So spread out static resources across multiple datacenters, agg’d pipes.
The rest is only 350 Mbps, 75k packets per second, doable – except the 302 adds 50% overage in packets per sec.

Last bonus thought – use zfs/dtrace for dbs, so run them on solaris!

1 Comment

Filed under Conferences, DevOps

Velocity 2009 – Hadoop Operations: Managing Big Data Clusters

Hadoop Operations: Managaing Big Data Clusters (see link on that page for preso) was given by Jeff Hammerbacher of Cloudera.

Other good references –
book: “Hadoop: The Definitive Guide
preso: hadoop cluster management from USENIX 2009

Hadoop is an Apache project inspired by Google’s infrastructure; it’s software for programming warehouse-scale computers.

It has recently been split into three main subprojects – HDFS, MapReduce, and Hadoop Common – and sports an ecosystem of various smaller subprojects (hive, etc.).

Usually a hadoop cluster is a mess of stock 1 RU servers with 4x1TB SATA disks in them.  “I like my servers like I like my women – cheap and dirty,” Jeff did not say.

HDFS:

  • Pools servers into a single hierarchical namespace
  • It’s designed for large files, written once/read many times
  • It does checksumming, replication, compression
  • Access is from from Java, C, command line, etc.  Not usually mounted at the OS level.

MapReduce:

  • Is a fault tolerant data layer and API for parallel data processing
  • Has a key/value pair model
  • Access is via Java, C++, streaming (for scripts), SQL (Hive), etc
  • Pushes work out to the data

Subprojects:

  • Avro (serialization)
  • HBase (like Google BigTable)
  • Hive (SQL interface)
  • Pig (language for dataflow programming)
  • zookeeper (coordination for distrib. systems)

Facebook used scribe (log aggregation tool) to pull a big wad of info into hadoop, published it out to mysql for user dash, to oracle rac for internal…
Yahoo! uses it too.

Sample projects hadoop would be good for – log/message warehouse, database archival store, search team projects (autocomplete), targeted web crawls…
As boxes you can use unused desktops, retired db servers, amazon ec2…

Tools they use to make hadoop include subversion/jira/ant/ivy/junit/hudson/javadoc/forrest
It uses an Apache 2.0 license

Good configs for hadoop:

  • use 7200 rpm sata, ecc ram, 1U servers
  • use linux, ext3 or maybe xfs filesystem, with noatime
  • JBOD disk config, no raid
  • java6_14+

To manage it –

unix utes: sar, iostat, iftop, vmstat, nfsstat, strace, dmesg, friends

java utes: jps, jstack, jconsole
Get the rpm!  http://www.cloudera.com/hadoop

config: my.cloudera.com
modes – standalong, pseudo-distrib, distrib
“It’s nice to use dsh, cfengine/puppet/bcfg2/chef for config managment across a cluster; maybe use scribe for centralized logging”

I love hearing what tools people are using, that’s mainly how I find out about new ones!

Common hadoop problems:

  • “It’s almost always DNS” – use hostnames
  • open ports
  • distrib ssh keys (expect)
  • write permissions
  • make sure you’re using all the disks
  • don’t share NFS mounts for large clusters
  • set JAVA_HOME to new jvm (stick to sun’s)

HDFS In Depth

1.  NameNode (master)
VERSION file shows data structs, filesystem image (in memory) and edit log (persisted) – if they change, painful upgrade

2.  Secondary NameNode (aka checkpoint node) – checkpoints the FS image and then truncates edit log, usually run on a sep node
New backup node in .21 removes need for NFS mount write for HA

3.  DataNode (workers)
stores data in local fs
stored data into blk_<id> files, round robins through dirs
heartbeat to namenode
raw socket to serve to client

4.  Client (Java HDFS lib)
other stuff (libhdfs) more unstable

hdfs operator utilities

  • safe mode – when it starts up
  • fsck – hadoop version
  • dfsadmin
  • block scanner – runs every 3 wks, has web interface
  • balancer – examines ratio of used to total capacity across the cluster
  • har (like tar) archive – bunch up smaller files
  • distcp – parallel copy utility (uses mapreduce) for big loads
  • quotas

has users, groups, permissions – including x but there is no execution, but used for dirs
hadoop has some access trust issues – used through gateway cluster or in trusted env
audit logs – turn on in log4j.properties

has loads of Web UIs – on namenode go to /metrics, /logLevel, /stacks
non-hdfs access – HDFS proxy to http, or thriftfs
has trash (.Trash in home dir) – turn it on

includes benchmarks – testdfsio, nnbench

Common HDFS problems

  • disk capacity, esp due to log file sizes – crank up reserved space
  • slow but not dead disks and flapping NICS to slow mode
  • checkpointing and backing up metadata – monitor that it happens hourly
  • losing write pipeline for long lived writes – redo every hour is recommended
  • upgrades
  • many small files

MapReduce

use Fair Share or Capacity scheduler
distributed cache
jobcontrol for ordering

Monitoring – They use ganglia, jconsole, nagios and canary jobs for functionality

Question – how much admin resource would you need for hadoop?  Answer – Facebook ops team had 20% of 2 guys hadooping, estimate you can use 1 person/100 nodes

He also notes that this preso and maybe more are on slideshare under “jhammerb.”

I thought this presentation was very complete and bad ass, and I may have some use cases that hadoop would be good for coming up!

Leave a comment

Filed under Conferences, DevOps