Tag Archives: data

Velocity 2010: Day 2 Keynotes Continued

Back from break and it’s time for more!  The dealer room was fricking MOBBED.  And this is with a semi-competing cloud computing convention, Structure,  going on down the road.

Lightning Demos

Time for a load of demos!

dynaTrace

Up first is dynaTrace, a new hotshot in the APM (application performance management) space.  We did a lot of research in this space (well, the sub-niche in this space for deep dive agent-driven analytics), eventually going with Opnet Panorama over CA/Wily Introscope, Compuware, HP, etc.  dynaTrace broke onto the scene since and it seems pretty pimp.  It does traditional metric-based continuous APM and also has a “PurePath” technology where they inject stuff so you can trace a transaction through the tiers it travels along, which is nice.

He analyzed the FIFA page for performance.  dynaTrace is more of a big agenty thing but they have an AJAX edition free analyzer that’s more light/client based.  A lot of the agent-based perf vendors just care about that part, and it’s great that they are also looking at front end performance because it all ties together into the end user experience.  Anyway, he shows off the AJAX edition which does a lot of nice performance analysis on your site.  Their version 2.0 of Ajax Edition is out tomorrow!

And, they’re looking at how to run in the cloud, which is important to us.

Firebug

A Firefox plugin for page inspection, but if you didn’t know about Firebug already you’re fired.  Version 1.6 is out!  And there’s like 40 addons for it now, not just YSlow etc, but you don’t know about them – so they’re putting together “swarms” so you can get more of ’em.

In the new version you can see paint events.  And export your findings in HAR format for portability across tools of this sort.  Like httpwatch, pagespeed, showslow.  Nice!

They’ve added breakpoints on network and HTML events.  “FireCookie” lets you mess with cookies (and breakpoints on this too).

YSlow

A Firebug plugin, was first on the scene in terms of awesome Web page performance analysis.  showslow.com and gtmetrix.com complement it.   You can make custom rules now.  WTF (Web Testing Framework) is an new YSlow plugin that tests for more shady dev practices.

PageSpeed

PageSpeed is like YSlow, but from Google!  They’ve been working on turning the core engine into an SDK so it can be used in other contexts.  Helps to identify optimizations to get time to first pane, making JS/CSS recommendations.

The Big Man

Now it’s time for Tim O’Reilly to lay down the law on us.  He wrote a blog post on operations being the new secret sauce that kinda kicked off the whole Velocity conference train originally.

Tim’s very first book was the Massacomp System Administrator’s Guide, back in 1983.  System administration was the core that drove O’Reilly’s book growth for a long time.

Applications’ competitive advantage is being driven by large amounts of data.  Data is the “intel inside” of the new generation of computing.  The “internet operating system” being built is a data operating system.  And mobile is the window into it.

He mentioned OpenCV (computer vision) and Collective Intelligence, which ended up suing the same kinds of algorithms.  So that got his thinking about sensors and things like Arduino.  And the way technology evolves is hackers/hobbyists to innovators/entrepreneurs.  RedLaser, Google Goggles, etc. are all moves towards sensors and data becoming pervasive.  Stuff like the NHIN (nationwide health information network).  CabSense.  AMEE, “the world’s energy meter” (or at least the UK’s) determined you can tell the make and model of someone’s appliances based on the power spike!  Passur has been gathering radar data from sensors, feeding through algorithms, and now doing great prediction.

Apps aren’t just consumed by humans, but by other machines.  In the new world every device generates useful data, in which every action creates “information shadows” on the net.

He talks about this in Web Squared.  Robotics, augmented reality, personal electronics, sensor webs.

More and more data acquisition is being fed back into real time response – Walmart has a new item on order 20 seconds after you check out with it.  Immediately as someone shows up at the polls, their name is taken off the call-reminder list with Obama’s Project Houdini.

Ushahidi, a crowdsourced crisis mapper.  In Haiti relief, it leveraged tweets and skype and Mechanical Turk – all these new protocls were used to find victims.

And he namedrops Opscode, the Chef guys – the system is part of this application.  And the new Web Operations essay book.  Essentially, operations are who figures out how to actually do this Brave New World of data apps.

And a side note – the closing of the Web is evil.  In The State of the Internet Operating System, Tim urges you to collaborate and cooperate and stay open.

Back to ops – operations is making some pretty big – potentially world-affecting- decisions without a lot to guide us.  Zen and the Art of Motorcycle Maintenance will guide you.  And do the right thing.

Current hot thing he’s into – Gov2.0!  Teaching government to think like a platform provider.  He wants our help to engage with the government and make sure they’re being wise and not making technopopic decisions.

Code for America is trying to get devs for city governments, which of course are the ass end of the government sandwich – closest to us and affecting our lives the most, but with the least funds and skills.

Third Party Pain

And one more, a technical one, on the effects of third party trash on your Web site – “Don’t Let Third Parties Slow You Down“, by Google’s Arvind Jain and Michael Kleber.  This is a good one – if you run YSlow on our Web site, www.ni.com, you’ll see a nice fast page with some crappy page tags and inserted JS junk making it half as fast as it should be.  Eloqua, Unica, and one of our own demented design layers an extra like 6s on top of a nice sub-2s page load.

Adsense adds 12% to your page load.  Google Analytics adds 5% (and you should use the new asynchronous snippet!).  Doubleclick adds 11.5%.  Digg, FacebookConnect, etc. all add a lot.

So you want to minimize blocking on the external publisher’s content – you can’t get rid of them and can’t make them be fast (we’ve tried with Eloqua and Unica, Lord knows).

They have a script called ASWIFT that makes show_ads.js a tiny loader script.  They make a little iframe and write into it.  Normally if you document.write, you block the hell out of everything.  Their old show_ads.js had a median of 47 ms and a 90th percentile of 288 ms latency – the new ASWIFT one has a mdeian of 11 ms and a 90th %ile of 32 ms!!!

And as usual there’s a lot of browser specific details.  See the presentation for details.  They’re working out bugs, and hope to use this on AdSense soon!

Leave a comment

Filed under Conferences, DevOps

Velocity 2009 – Hadoop Operations: Managing Big Data Clusters

Hadoop Operations: Managaing Big Data Clusters (see link on that page for preso) was given by Jeff Hammerbacher of Cloudera.

Other good references –
book: “Hadoop: The Definitive Guide
preso: hadoop cluster management from USENIX 2009

Hadoop is an Apache project inspired by Google’s infrastructure; it’s software for programming warehouse-scale computers.

It has recently been split into three main subprojects – HDFS, MapReduce, and Hadoop Common – and sports an ecosystem of various smaller subprojects (hive, etc.).

Usually a hadoop cluster is a mess of stock 1 RU servers with 4x1TB SATA disks in them.  “I like my servers like I like my women – cheap and dirty,” Jeff did not say.

HDFS:

  • Pools servers into a single hierarchical namespace
  • It’s designed for large files, written once/read many times
  • It does checksumming, replication, compression
  • Access is from from Java, C, command line, etc.  Not usually mounted at the OS level.

MapReduce:

  • Is a fault tolerant data layer and API for parallel data processing
  • Has a key/value pair model
  • Access is via Java, C++, streaming (for scripts), SQL (Hive), etc
  • Pushes work out to the data

Subprojects:

  • Avro (serialization)
  • HBase (like Google BigTable)
  • Hive (SQL interface)
  • Pig (language for dataflow programming)
  • zookeeper (coordination for distrib. systems)

Facebook used scribe (log aggregation tool) to pull a big wad of info into hadoop, published it out to mysql for user dash, to oracle rac for internal…
Yahoo! uses it too.

Sample projects hadoop would be good for – log/message warehouse, database archival store, search team projects (autocomplete), targeted web crawls…
As boxes you can use unused desktops, retired db servers, amazon ec2…

Tools they use to make hadoop include subversion/jira/ant/ivy/junit/hudson/javadoc/forrest
It uses an Apache 2.0 license

Good configs for hadoop:

  • use 7200 rpm sata, ecc ram, 1U servers
  • use linux, ext3 or maybe xfs filesystem, with noatime
  • JBOD disk config, no raid
  • java6_14+

To manage it –

unix utes: sar, iostat, iftop, vmstat, nfsstat, strace, dmesg, friends

java utes: jps, jstack, jconsole
Get the rpm!  http://www.cloudera.com/hadoop

config: my.cloudera.com
modes – standalong, pseudo-distrib, distrib
“It’s nice to use dsh, cfengine/puppet/bcfg2/chef for config managment across a cluster; maybe use scribe for centralized logging”

I love hearing what tools people are using, that’s mainly how I find out about new ones!

Common hadoop problems:

  • “It’s almost always DNS” – use hostnames
  • open ports
  • distrib ssh keys (expect)
  • write permissions
  • make sure you’re using all the disks
  • don’t share NFS mounts for large clusters
  • set JAVA_HOME to new jvm (stick to sun’s)

HDFS In Depth

1.  NameNode (master)
VERSION file shows data structs, filesystem image (in memory) and edit log (persisted) – if they change, painful upgrade

2.  Secondary NameNode (aka checkpoint node) – checkpoints the FS image and then truncates edit log, usually run on a sep node
New backup node in .21 removes need for NFS mount write for HA

3.  DataNode (workers)
stores data in local fs
stored data into blk_<id> files, round robins through dirs
heartbeat to namenode
raw socket to serve to client

4.  Client (Java HDFS lib)
other stuff (libhdfs) more unstable

hdfs operator utilities

  • safe mode – when it starts up
  • fsck – hadoop version
  • dfsadmin
  • block scanner – runs every 3 wks, has web interface
  • balancer – examines ratio of used to total capacity across the cluster
  • har (like tar) archive – bunch up smaller files
  • distcp – parallel copy utility (uses mapreduce) for big loads
  • quotas

has users, groups, permissions – including x but there is no execution, but used for dirs
hadoop has some access trust issues – used through gateway cluster or in trusted env
audit logs – turn on in log4j.properties

has loads of Web UIs – on namenode go to /metrics, /logLevel, /stacks
non-hdfs access – HDFS proxy to http, or thriftfs
has trash (.Trash in home dir) – turn it on

includes benchmarks – testdfsio, nnbench

Common HDFS problems

  • disk capacity, esp due to log file sizes – crank up reserved space
  • slow but not dead disks and flapping NICS to slow mode
  • checkpointing and backing up metadata – monitor that it happens hourly
  • losing write pipeline for long lived writes – redo every hour is recommended
  • upgrades
  • many small files

MapReduce

use Fair Share or Capacity scheduler
distributed cache
jobcontrol for ordering

Monitoring – They use ganglia, jconsole, nagios and canary jobs for functionality

Question – how much admin resource would you need for hadoop?  Answer – Facebook ops team had 20% of 2 guys hadooping, estimate you can use 1 person/100 nodes

He also notes that this preso and maybe more are on slideshare under “jhammerb.”

I thought this presentation was very complete and bad ass, and I may have some use cases that hadoop would be good for coming up!

Leave a comment

Filed under Conferences, DevOps