Search:
Where I Work
NKS
Subscribe
Add to Google
RSS 0.91
RSS 1.0
RSS 2.0
ATOM 1.0
RSS 2.0 and ATOM
Network
View Ian's profile on LinkedIn
Archives
2007 April (1)
2007 February (1)
2007 January (4)
2006 December (2)
2006 November (2)
2006 September (5)
2006 August (4)
2006 July (1)
2006 June (3)
2006 May (2)
2006 March (4)
2006 February (4)
2006 January (1)
2005 December (8)
2005 November (26)
2005 October (10)
2005 September (17)
2005 August (87)
2005 July (48)
2005 June (34)
2005 May (24)
2005 April (243)
2004 April (1)
2004 February (3)
2003 August (2)
2003 June (2)
2003 May (8)
2003 January (1)
2002 September (1)
2002 July (4)
2002 June (2)
2002 May (5)
2002 April (15)
2002 March (15)
Projects
CornFS
DENSO NAV
Rage Powered
Tampa Bay
TampaBad
SLUG
ob-buttons
Creative Commons OpenSource Linux Individual-i GeoURL Linux Speakeasy Speed Test
Twitter

follow icblenke at http://twitter.com
Google
Ian's shared items in Google Reader (subscribe)

CONSUMER AND GOVERNMENTAL AFFAIRS BUREAU EXTENDS EXPIRING CERTIFICATIONS FOR CERTAIN PROVIDERS OF VIDEO RELAY SERVICE AND IP RELAY SERVICE

CONSUMER AND GOVERNMENTAL AFFAIRS BUREAU EXTENDS EXPIRING CERTIFICATIONS FOR CERTAIN PROVIDERS OF VIDEO RELAY SERVICE AND IP RELAY SERVICE

Structure and Practices of the Video Relay Service Program

The YouTube Video You Don’t See

Example Show

Shop with confidence across the web

Helicopter view of your driving directions on Google Maps

Google CIO and others talk DevOps and "Disaster Porn" at Surge

Burning Man 2011 - Yes we were there.

September 08, 2011

Getting Started on the Google API

CACertMan app to address DigiNotar & other bad CA’s

Tangled

Custom Class Loading in Dalvik

Jingle Adventures contd…

TWO REPORTS OF ADVISORY COMMITTEES ON DISABILITIES ISSUES RELEASED

Join the White House Disability Group Monthly Call on July 27

Multiple APK Support in Android Market

Debugging Android JNI with CheckJNI

Android 3.2 Platform and Updated SDK tools

Geektalk

Believe in yourself

Forever alone involuntary flashmob

PS3 root key released - sign and run anything

lunar eclipse shadow on earth

hotpot NFC tags in portland

Oh, little bobby tables

Don't have a front-facing camera?

Tango.me

Looxcie

Mobile phone product testing: Models

Visual 6502

Extruding Light

Foam Printer

How Can the LHC withstand 1 Petabyte of Data a Second?

Linus Torvalds is now officially a US Citizen

Backin up quartet

Oh, hell yes.

Portland bike lanes get mario symbols

Skype RC4 claimed reverse-engineered

Best ever cease and desist

wkhtmltopdf - just awesome

Measurement Lab - Google IO BigQuery session is live querying 60 billion rows instantly

All you need is a little egotism, and $6

Examply punycode link

Convert IDN punycode to/from native characters

Sparkfun free day tomorrow: 1/7

websockets

C thulu ftagn recursion

Need a recursive DNS server? Use 8.8.8.8 and 8.8.4.4

Google Public DNS

JIQL - Java JDBC wrapper for Google DataStore

OpenNebula

Trillions

ZFS L2ARC ZIL on SSD

Swimming in OpenCL

Unicorn == Mongrel delayed_job

Remus - Transparent HA for Xen

Go

What DNS is not

Crossbow Virtual Wire Demo Tool

Banner ads on flies

PoolParty

Eucalyptus MySQL SOLR RabbitMQ Varnish == Nebula.nasa.gov

Nebula.nasa.org

Ubuntu Enterprise Cloud (UEC)

Evernote

Apple drops ZFS due to legal concerns

Peering disputes between Cogent and Hurricane Electric

Equinix to acquire Switch and Data for $689 million

We Are All Connected

Project kxen renamed project HXEN

Pomegranate Phone

Lessconf Jacksonville - followed the next day by Barcamp

Stick-figure guide to advanced AES crypto

Why you should pay attention to Google Wave

rails-primer - how to easily host rails projects on appengine

AppEngine-JRuby on google code

Ruby on Google AppEngine: appengine-jruby video

Dataliberation.org - The Data Liberation Front - a group concerned with moving data in and out of google

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine

Proxmox VE - OpenVZ KVM Cluster appliance management

Sun/Oracle kill of SXCE: Sysadmins everywhere cry in horror.

Essentials of Metaheuristics

making water drinkable through nano-filtration

Pigin 2.6.1 adds Xmpp voice and video support

Opera Unite

Setting up a Layer-3 tunnel VPN using ssh 4.3 and -w option tun devices

shadowserver.org - botnet hunting resources

OpenBSC - a Siemens BS-11 microBTS or a ip.access nanoBTS == your own GSM tower

Voxbone's 883 country code

Apple keyboard firmware hack

Karesansui Project - a Xen management harness from Japan

eunicycle

Pygowave Server - Run your own Google Wave server

Happy Sysadmin Day!

Bokode

Bass cannon

Xen clocksource0 time went backwards

Internet vs World Population stats

BBC article on sat-3 cut

sat-3 cut

iPeak - RAIN

Asankya - RAIN

Apple pulls Google Voice app from iPhone - AT&T's fault

HadoopDB

live-android boot ISO - very neat

How to update your GeoIP information in addition to SWIPping

EATR

Google Wave hackathon on 20th/21st, if you happen to be in Mountainview

Did I mention OTOY here before?

NeatX - NX for Ganeti

STuPiD - STUN/TURN using PHP in Dispair

Aviary.com

Browser based Server-side 3D gaming from OTOY

Cisco's replacement for the WRT54GL is the WRT160NL

Spinn3r.com - Index the blogosphere

Team ARIN

Parts of galaxy Messier 87 are missing

DRAEGER ALCOTEST 7110 MKIII-C Evaluation of Breathalizer Source Code

Cyclops

Google's AJAX playground

How Michael Osinski Helped Build the Bomb That Blew Up Wallstreet

Bruce Perens - A Cyber-Attach on an American City

How Google and Facebook are using R

adito - the new gpl fork of the old sslexplorer project

A date idea: forklift sunset

Psytechnics - VVoiP QoE

r1soft cdp

IP Address geolocation for free

Shapeways - $50 "3-D poem rings" until the end of the month

GrandCentral to become Google Voice

Wolframalpha is coming

Hosted Xen Project

VirtualGL X11 transport

TurboVNC VirtualGL == FAST network GL

Ben Rockwood's presentation at the OpenSolaris Storage Summit: ZFS in the trenches

The Crisis of Credit Visualized on Vimeo

10gen - a java based app hosting infrastructure

Engineyard Vertebra - another cloud infrastructure management harness

Eucalyptus - an opensource EC2 compatible hosting infrastructure

asciicasts.com

railsbrain.com <-- ajaxified rdoc

AP IMPACT: SWAT Teams Deployed in 911 fraud

Lessons learned by people who have quit Google

Makwana indicted for Fanny Mae malware

"physicalized" servers

Zentific svn repo: alpha available

Holographic Space-Time ?

DACS - Distribution and Configuration System - version 2.0

Video of Cisco IOS attack talk at Chaos Computer Conference

Cosmic radio background noise 6 times higher than expected

We get a leap second tonight

Grow your own bioluminescent algae

Johnson and Ruby/Javascript

Two turntables and a git repo

Quartz Composer and Cruise Control status

Truthy and stupid.rb

The nature of truth

Get2Human

Sunay Tripathi's Solaris Networking Blog

Merry Christmas from XKCD

Merry Christmas from Chiron Beta Prime

Prius Emergency Generator

German folk tune Jazz improv

Memcached speed improvements

FSF sues Cisco

Asterisk Vishing Alert

Google's Native Client... the next ActiveX?

Waterballs

YAGNI development assistant

HA-xVM demo video posted

Kemari 1.0 released - HA Xen

The Decline and Fall of Agile

Zone Alarm 2009 Free Tomorrow

kenai.com - xVM Server Project site

58% Spam Drop from one colo shutdown

Xenomips - a Xen friendly domU version of Dynamips - Emulate a Cisco 7200

Debian and Android dual-boot on the G1

Sipper (SIPr) - a SIP testing framework in ruby

DBslayer - a SQL abstraction layer using JSON

Clojure - JVM based LISP dialect with immutable persistent data structures that are inherently thread safe

Fingerworks keyboard in a MacBookPro

NfSen - Netflow Sensor

The Phoenix BIOS hypervisor is Xen

Do you live in a Constitution-Free zone?

Puppet presentation at NYCOSUG this month

Kemari - Xen lock-step HA

XenSmartIO - Infiniband IO for Xen

Starting with b100, OpenSolaris has virtual consoles

OpenSolaris testfarm build server interface now available

Firefox M9 Fenric - Maemo alpha

SystemZ - aka Sirius - a port of OpenSolaris to IBM System Z mainframe OS running in z/VM mode

40.8% efficient solar cell

FREDNET

World sunlight map

Solaris and ZFS on a Dell 2950, tweaking notes

Logstalgia

Early Access Windows PV drivers for xVM

Economics: The Theory of Interstellar Trade

COMSTAR Admin Guide PDF file

The Financial Crisis: What Happened and What's Next?

3.5" DIY SSD drive

Microsoft usurping ODF

Cisco to run Windows 2008 on their appliance virtually for services

Packetfence: an OpenSource Network Access Control system

Public.resource.org

persist.js - an alternative to gears

Chinese building "impossible" EM drive

Supertinykeyboard

COMSTAR SMTF - solaris FC, SAS, and iSCSI targets

Flexiscale - yet another control panel?

RightScale - cloud control panels?

GoGrid, a servepath company.

OSCON in 37 minutes

Criticial ESXi remote vulnerability in openwsman

Parasitic power

Microsoft FUD on VMWare: vmwarecostswaytoomuch.com

nmap builds zenmap topology maps

Tue, 28 Mar 2006

After filling a CornFS volume for a couple of days now, I found a few problems that really begged for another release.

I'm still building cornfs with debug flags and under gdb to catch any segfaults in the new caching code. Sure enough, it found a segfault or two that I needed to cleanup my pointer handling a bit. Cacheinsert() now works for a rather huge cacheinventory() run without incident.

There was also a bug with the statfs() setup in the cache upload function. Instead of statfs()ing the /data/cornfs/import/{servername} directory, it was handily using /data/cornfs/import. All of the servers appeared to have the same remaining free space, which caused the last two servers to fill to the brim.

Like I said, there will likely be some rapid releases this week as I stumble upon more nits to pick.

For now you can download cornfs-v0.0.5.1.tar.bz2 and have at it.

Sun, 26 Mar 2006

I've been working on cornfs this weekend a bit to speed things up.

With the help of gprof and gcc -pg, I found that the caching routines were causing a huge performance hit. Every read() and write() was doing a linear linked list search through every cached entry. This is fixed.

Along the way, I found it difficult to debug things with one huge cornfs.c source file. So I've split that up into numerous .c source files to fix this.

I also updated the Makefile to build on its own without building under fuse/examples as before. It is now 2.5.2 friendly, and compiles with 22 ABI compatibility. I'll see about adding the 25 ABI functions shortly.

So, download cornfs-v0.0.5.0.tar.bz2, extract, and build with make.

A few folks have mentioned they were playing with cornfs via private email. With this latest version, and NFS over ssh, NKS is finally running this in a production environment.

Look forward to some rapid updates here in the near future.

Thu, 23 Mar 2006

I've had serious problems using shfs, sshfs, and sfs. The first two fall apart under load, and the latter is a nightmare to get working in our environment (a PAM nightmare, that is).

Rather than dealing with something crazy, I decided to go back to a faithful old standard: NFS. As the remote storage nodes are accessible only via ssh, ssh was the ideal transport for the NFS mounts.

How do you do this? With a little port trickery and some inittab craziness to hold the tunnels up.

NFS v3 and newer have a TCP transport mode that make it possible to tunnel using ssh. Older versions of NFS use a UDP based ONC RPC transport. Make sure you have kernel support for TCP and NFS v3 before you continue.

On the remote nodes, install NFS:

 # apt-get install nfs-kernel-server nfs-common portmap

Then setup an exports file sharing something to localhost:

 # echo "/exports localhost(rw,async,insecure,no_root_squash)" >> /etc/exports

We need to have mountd start on a known port to setup the ssh tunnel from the master. The "-p" flag is used for this. Debian keeps the RPCMOUNTDOPTS flags in /etc/default/nfs-kernel/server, easily updated with this perl one-liner:

 # perl -pi -e 's/^(RPCMOUNTDOPTS)=.*$/$1="-p 32767"/' /etc/default/nfs-kernel-server

It's also a good idea to block portmap request from anything but localhost with tcpwrappers, just in case your firewall rules happen to be down for some reason.

 # echo "portmap: LOCAL" >> /etc/hosts.allow

Now restart things and make sure the mountpoint is being exported:

 # /etc/init.d/nfs-kernel-server stop
 # /etc/init.d/nfs-common stop
 # /etc/init.d/portmap stop
 # /etc/init.d/portmap start
 # /etc/init.d/nfs-common start
 # /etc/init.d/nfs-kernel-server start
 # rpcinfo -p localhost
 # showmount -e localhost

The remote server is now ready to mount. Return to your central master cornfs server that will act as the client and setup an ssh tunnel.

Step 1: Install nfs-client

 # apt-get install nfs-client

Step 2: Setup key trust with the remote server:

 # ssh-keygen -f ~/.ssh/id_dsa-cornfs -P'' -t dsa -b 1024
 # cat ~/.ssh/id_dsa-cornfs.pub | ssh remoteserver 'mkdir ~/.ssh; cat - >> ~/.ssh/authorized_keys'

Step 3: Setup the SSH tunnel with an inittab respawn

 # echo 'N0:23:respawn:/usr/bin/ssh -c blowfish -L 10000:localhost:2049 -L 11000:localhost:32767 remoteserver vmstat 300' >> /etc/inittab
 # telinit q

Now you should see an ssh tunnel running in a process listing. Check your system logs to see if there are any problems.

Step 4: Add fstab entries for NFS:

 # echo 'localhost:/export /data/cornfs/import/remoteserver nfs rw,bg,soft,port=10000,mountport=11000,tcp 0 0' >> /etc/fstab
 # mount /data/cornfs/import/remoteserver

You should now see your remote server /export filesystem mounted under /data/cornfs/import.

Each remote server will need to have a unique nfs and mountd port assignment. Repeat steps 3 and 4 for each.

I started at 10000 and 11000 and worked my way up from there. The next server's port assignments are 10001 and 11001, etc.

This works suprisingly well, and appears to be quite stable (far more stable than the other alternatives).

That's not to say things are as fast as they could possibly be, but it works.

Thu, 09 Feb 2006

This article disappeared from my site for a while.. restoring..

The DENSO NAV system uses a DVD which you can find in a Prius so equipped by sliding the driver seat forward, prying off the plastic cover panel, sliding the eject button lock over, and hitting the eject button while the car is in aux power but not yet “On”.

If you put the DVD in your home computer, you will see an ISO9660 filesystem representation of the KIWI data files used by the NAV computer. Check out the KIWI Documents for the actual data format of these files; I’m actively trying to write something to decode them now. Mapmaster appears to be Toyota’s navigation division.

This is a full 8G DVD, you will need a dual-layer burner to make a copy.

The files are just pointers into the actual KIWI datasets.

VERSION.TXT - The version of the data on this DVD (ver.04.2 for a DENSO NAV 4.2 DVD)

ALLDATA.KWI - A file that represents all of the KIWI MAP data (I think). - This is the biggest file on the root of the filesystem at 2.8G

COVERAGE.TXT - Text file listing all files in the COVERAGE/ directory

COVERAGE/ - Directory full of BMP images showing the coverage area of this DVD as a raster image. This would be the easiest thing to hack with your own graphics.

INDEXDAT.KWI - Lists the IDX/ directory files.

IDX/ - Directory full if Index files. My guess is for searching and/or quick access for names/addresses/phonenumbers etc, as that is what they appear to contain. - This is the biggest directory on the DVD (5G)

METADATA.KWI - Metadata. Contains this string:


LANG::=US English,UK English,German,French,Spanish,Italian,Dutch,Swedish,Danish ; CHCD ::=ISO 8859-1 ; COOR::=WGS84 ;

SPEC.KWI - Specification file. Contains this string:


SUPERMETA::=AFUS:2.47, AGUS:2.47 ;

LOADING.KWI - Loading Module Management File. - At 520M, including strings like “/HDD/slot1/passwd.osg”, the first thing that I think of when I see this file is “this is the code that the NAV computer boots”. - “I Agree” is in this file. Removing that initial nag screen may be as simple as a binary patch to this file! Woo!

DICVCE56.KWI - From the strings, it looks like a group list of Points of Interest (not the actual points, just the general groups). This is a small file.

VIRTUAL0.DAT VIRTUAL1.DAT - No idea what these are for. They appear to be full of NULLs.

I’ll be updating this node as I figure things out.

I’ve created a quick little perl script based solely on the KIWI “All Data Management Frame” document 0500122e.pdf. The script, alldata.pl is a very rudimentary parser of the data structures, based entirely on a hash tree of the schema information gleaned from this document. Hey, it’s a start.

Update: KAMIYA Satosi has restored his Ruby KIWI Voice scripts!

More to come…

Sat, 14 Jan 2006

Running a large farm of heavily modified MailScanner instances for high-volume customers exposes a number of problems when dealing with spammers.

Recently, some of the more irritating spammers appear to be leaning on bounces for delivery from reputable sources rather than proxies or direct delivery from botnets. For this kind of spammer, connecting to your load balanced cluster and delivering 10,000 messages in a single SMTP conversation isn't out of the ordinary.

For most customers, you verify the recipients enumerated to make sure each one exists for delivery before accepting the messages.

If your customer doesn't want to to reject any received mail for their domain for whatever reason (argh!), but just want you to identify spam for them and ship it off to a spam jail so their (broken) mail system won't choke on the volume, you end up queueing up every one of those 10,000 messages to process - wether there is a real person to receive it or not.

It might help in this instance to use the greylisting trick of initially returning a "temporary error" for each incoming message. Blind spammers simply don't seem to want to re-send a temporarily rejected message. Real mail servers don't have this problem, and happily retry 5-15 minutes later without issue.

If your customer also doesn't understand the value of greylisting, and is adamant against rejecting mail for whatever reason, you're stuck in a rather messy position. All mail must be accepted, queued, and processed.

This means, across your farm of hundreds of mailscanner servers, you end up with a handful of servers chewing through 10,000+ message backlogs, and the rest of the servers in your cluster are chewing through _nothing. After moving those queue files between servers for a while, you realize that you're losing the battle....

That is, unless you're clever.

Enter: the greypit.

The idea behind greypitting is to try and balance incoming messages across your load balanced cluster of mail servers.

It would be nice to accept only the first N messages in an SMTP connection. Once the limit is reached, simply return a "temporary failure" to the remote MTA. Actual RFC valid MTAs will give up and attempt to retry in the near future (5-15 minutes).

Greylisting is the act of returining these temporary failures immediately, but recording the IP/senders/recipients triplet and accepting the messages when they are resent. A greypit keeps no triplet, and accepts the first N messages without interfering at all.

The fun bit here: the spammers ignore the SMTP return codes. They happily continue to blindly hammer away sending message after message, each which is being rejected with a temporary failure.

So, after the first "temporary failure" result, what if we start to sleep a given number of second for each successive MAIL FROM: beyond the initial limit of 10 messages (in addition to not accepting those messages for delivery)? The spammers get a taste of tarpit.

The cluster now balances evenly. Only 10 messages from that evil spammer actually make it through to be scanned, the remainder cause the spammer to tarpit themselves into oblivion.

So, here is the Sendmail Milter source for a little project I call greypit:

greypit.c - v1.0 C daemon source.

To build the daemon, link with libmilter and libpthread:

gcc -o greypit greypit.c -lmilter -lpthread

You will also need to add a line to sendmail.mc to use this milter:

INPUT_MAIL_FILTER(`greypit', `S=local:/var/run/greypit/sock, F=')

And create the /var/run/greypit directory for the socket.

Simple. Elegant. Hacktastik. But it works.

Mon, 12 Dec 2005

I've built a set of CME-681 rules to catch the common 6 english messages:

cme-681.cf

This particular Sober worm is also known as:

Sat, 03 Dec 2005

Luke Kanies mentioned that someone pointed him toward Freeride's Freebase "bus".

Looking into it, it's a neat programming model, though the bus doesn't seem to address transport issues or be intended to run across more than one machine. For that matter, it doesn't appear Freebase persists queues at all between restarts. It has a neat plugin architecture though allowing for easy extensions, and the Slot abstraction is a good idea.

Documentation is sparse though, reading the code is the best way to grok it.

As a reliable transport for messages on a system, I'm still most interested in Assaf Arkin's reliable-msg library.

Using Freebase with a Slot handler for a reliable-msg Queue would be neat though. I'm digging through Freebase and reliable-msg now to see if I can devise something.

Yes, this is getting into implementation rather quickly. So many alternatives and variables.

The idea is to get something going, to keep up the momentum. Release early, release often...

Thu, 01 Dec 2005

There has been much discussion on the SAGE config-mgmt list regarding Luke Kanies' effort toward a message bus for Puppet.

A number of folks in #puppet on irc.projects.net continue to talk about this new message bus and what it should accomplish.

The goals are:

  1. Create a message router that will allow agents to subscribe to message feeds
  2. Write agents for each subsystem or opensource component engine that either publish or subscribe to those messages.

These messages may range from simple syslog messages and SNMP traps to system stats, IDS alerts, netflow log snippets, or anything else that might concern a sysadmin.

The Runnel message bus needs two things:

  1. A data abstraction for the messages
  2. A transport to ship them around reliably.

For data abstraction, the primary contenders are RDF and microformats. Luke posted to the microformats list asking for their input on this matter.

For transport, we would like to keep it simple yet allow for the messages to be transported over any network topology. This might be as simple as messages transported over SMTP to messages sent over a direct TCP socket connection between an agent and the router.

The goal is not to make a message bus that will solve any computing problem generically. We're not trying to rework MQSeries here. In the end, this will solve a problem for Puppet, and potentially open up an avenue for communication between disparate systems like Request Tracker (RT), Nagios, SEC, and any other subsystems we can build agents to communicate over the bus.

Wed, 30 Nov 2005

Copy on Write (CoW)

First off, lets decide how we're going to build our filesystems. While there is CopyOnWrite (CoW) support (LVM writable persistent snapshots), it isn't 100% reliable yet, and doesn't handle out-of-space conditions very well. Because of this, I am going to avoid using it.

That doesn't mean we shouldn't understand it a bit first though:

Creating the "virgin" backing store volume:

        # lvcreate -n virgin -L 4G vg
        # mkfs -t xfs /dev/vg/virgin
        # mount /dev/vg/virgin /mnt
        # debootstrap sarge /mnt http://source.rfc822.org/debian
        # vi /mnt/etc/fstab 
        # umount /mnt

Creating a clone filesystem:

        # lvcreate -s -n myclonedisk1 -L 1G /dev/vg/virgin

This new volume ("myclonedisk1") can handle up to 1G of "block differences" before it runs out of space. To that end, you will need to periodically grow the block device depending on the space remaining:

        # lvextend +1G /dev/vg/myclonedisk1

Can you see the danger here? For each clone disk snapshot, you will need to monitor the space used to see if enough space remains, and grow it whenver the space approaches some kind of threshold. If something goes crazy and rapidly makes changes to a filesystem, you may not catch the change in time with a monitoring script in dom0, and you may get a fatally corrupted volume in the process.

For this reason, I am avoiding it.

XenU RAID1 vs dm-mirror

Rather than use the somewhat experimental dm-mirror support for mirrored volumes, we're going to leave the mirroring up to the XenU domains to do themselves.

Lets create a domain that runs on "node0", the first cluster node:

Create some volumes.

        # lvcreate -n blenke-web-00_mirror0 -L 4G vg /dev/md3
        # lvcreate -n blenke-web-00_mirror1 -L 4G vg /dev/etherd/e0.1

Fill the primary volume:


        # mkfs -t xfs /dev/vg/blenke-web-00_mirror0
        # mount /dev/vg/blenke-web-00 /mnt
        # debootstrap sarge /mnt http://source.rfc822.org/debian
        # vi /mnt/etc/fstab
        # echo blenke-web-00 > /mnt/etc/hostname

Rather than using debootstrap, I strongly suggest doing this once and rsyncing other images from this base tree somewhere in your management infrastructure.

Now that ther volumes exist, here is a XenU configuration that would use these volumes:

        # cat - <<EOF > /etc/xen/auto/blenke-web-00
        kernel = "/boot/vmlinuz-2.6-xenU"
        memory = 64
        cpu = -1 # Xen should allocate a proc to run on.
        vcpus = 1 # We only want 1 CPU for this domain (Xen 3.0 SMP!)
        name = "blenke-web-00"
        nics = 1
        vif = [ 'mac=aa:00:0a:00:00:0a, bridge=xenbr0' ]
        ip = "10.0.0.10"
        disk = [ 'phy:vg/blenke-web-00_mirror0,sda1,w',
                 'phy:vg/blenke-web-00_mirror1,sda2,w' ]
        root = "/dev/md0 ro"
        EOF

(more to come)

Wed, 30 Nov 2005

This is a summary of the GFS wiki instructions, as applied to our new cluster.

First, get fenced running:

        # fence_tool join

Next, create the GFS filesystem:

        # gfs_mkfs -p lock_dlm -t <ClusterName>:<FSName> -j <Journals> <Device>

        <ClusterName> must match the cluster name used in CCS config
        <FSName> is a unique name chosen now to distinguish this fs from others
        <Journals> the number of journals in the fs, one for each node to mount
        <Device> a block device, usually an LVM logical volume

for a 2 node setup ("node0" and "node1"), you might use:

On node0:

        # lvcreate -n shared_node0 -L 10G vg /dev/md3
        # lvcreate -n shared_node1 -L 10G vg /dev/etherd/e0.1

        # gfs_mkfs -p lock_dlm -t blenke:shared_node0 -j 2 /dev/lv/shared_node0
        # gfs_mkfs -p lock_dlm -t blenke:shared_node1 -j 2 /dev/lv/shared_node1

On both:

        # mkdir -p /shared/node0 /shared/node1
        # mount /dev/lv/shared_node0 /shared/node0
        # mount /dev/lv/shared_node1 /shared/node1

Remember: GFS filesystems, while accessible by both nodes, ARE NOT MIRRORED. You create the GFS filesystem on a shared block device. If the block device happens to be on one server or the other, when that server is rebooted, the other nodes will be unable to access that filesystem.

For cluster mirroring, look for dm-mirror and the lvcreate -m option. The dm-mirror kernel module is made up of dm-raid1 and dm-log, which is being worked on by RedHat right now LVM2 Mirroring for RHEL4. Currently only pvmove and lvmcreate -m use this kernel module (if you have a recent lvm2 build), and you're really on your own.

If you have a cluster of more than 3 nodes (more than 3 PVs in the cluster VG), you can create a mirrored volume. One PV will get one half of the mirror, one PV will get the other half of the mirror, and one PV will get the mirror log volume.

        # lvcreate -m 1 -n mirror1 --alloc anywhere -L 4G vg
        Logical volume "mirror1" created
        # lvscan
        ACTIVE            '/dev/vg/mirror1' [4.00 GB] anywhere
        ACTIVE            '/dev/vg/mirror1_mlog' [4.00 MB] anywhere
        ACTIVE            '/dev/vg/mirror1_mimage_0' [4.00 GB] inherit
        ACTIVE            '/dev/vg/mirror1_mimage_1' [4.00 GB] inherit
Wed, 30 Nov 2005

First, create a Physical Volume for the local RAID10 stripe, then for the remote RAID10 stripe via AoE:

    pvcreate /dev/md3
    pvcreate /dev/etherd/e0.1

This is where that extra RAID stripe comes in. The first pv is for the stripe on this cluster node, the second is for the stripe on the other cluster node.

Next, create a Volume Group that contains both Physical Volumes:

    vgcreate vg /dev/md3 /dev/etherd/e1.0

This creates a "vg" volume group that is visible from both cluster nodes, where volumes can be carved out as needed between them.

(Note: This does not mirror the pv's. That's what the -m flag to lvcreate is for. Alternatively, the XenU domain must do software RAID1 to accomplish this goal.)

Wed, 30 Nov 2005

lvm2 is an entirely userspace abstraction that uses the devmapper kernel module to present volumes carved out of physical block device space.

lvm2 has a cluster manager called "clvmd" that registers with cman to communicate with other cluster nodes to act in a cluster configuration. With clvmd, lvm2 becomes a cluster-wide naming system for volumes carved up out of network exposed block devices, and a locking engine for the same.

        # apt-get install lvm2

Or build from CVS:

        # cvs -d :pserver:cvs@sources.redhat.com:/cvs/lvm2 login cvs
        # cvs -d :pserver:cvs@sources.redhat.com:/cvs/lvm2 checkout LVM2
        # cd LVM2 ; ./configure --with-clvmd=cman --with-confdir=/etc/lvm --prefix=/usr && make && make install

After the cluster is configured and running ("ccsd" and "cman"), and lvm2 is installed, we need to edit /etc/lvm/lvm.conf to make this a cluster aware setup.

        # vi /etc/lvm/lvm.conf

In devices {}, Add:

        filter = [ "a|/dev/etherd/*|" ]
        types = [ "aoe", 1024 ]
        sysfs_scan = 0

In global {}, comment out:

        # locking_type = 1

just below that, in global {}, uncomment or add:

        locking_library = "liblvm2clusterlock.so"
        locking_type = 2
        library_dir = "/lib/lvm2"

Then save, and start up clvmd (make sure cman is running first, and the node is part of the cluster):

        # clvmd &

You can now scan for volume groups:

        # vgscan

NOTE: lvm2 does not scan AoE devices by default. In fact, if you have sysfs enabled it will not find AoE devices at all, even if you add a filter that matches them. Moreover, lvm2 will only find AoE devices with a major as listed in /etc/modules:

        # grep aoe /proc/devices
        152 aoechr
        152 aoe

This means that all of the AoE devices you wish to scan must start with a major number of 152. If you look at /dev/etherd, you will see 16 "partition" devices for each shelf/slot device by default. Using 16 partitions, as AoE assigns minor numbers linearly, the crossover to major 153 happens just after "e1.5p14". This means that you really only have all of one shelf visible to lvm2, and part of a second (a maximum of 16 devices.. not good for a large cluster of more than 16 nodes).

One "fix" is to edit drivers/block/aoe/aoe.h in your kernel source and replace "AOEPARTITIONS 16" with "AOEPARTITIONS 1":

        # perl -pi -e 's/(AOE_PARTITIONS 1)6/$1/g' drivers/block/aoe/aoe.h

Alternatively, set AOE_PARTITIONS=1 when building your kernel

        # make ARCH=xen AOE_PARTITIONS=1 oldconfig clean bzImage modules module_install

Rebuild your kernel, then re-generate your /etc/ethered devices using the n_partitions variable:

        # n_partitions=1 aoe-mkdevs /dev/etherd

This really fixes the problem, and lvm2 can scan all of the AOE shelf/slot devices!

Wed, 30 Nov 2005

When configuring the RedHat clustering, you must create a cluster.conf which will exist on every node.

 # vi /etc/cluster/cluster.conf

This is an example 2 node configuration, with manual fencing:

        <?xml version="1.0"?>
         <cluster name="blenke" config_version="1">
         <clusternodes>
          <clusternode name="smart" nodeid="1" votes="1">
           <fence>
            <method name="human">
             <device name="last_resort" ipaddr="smart.ssn.blenke.net"/>
            </method>
           </fence>
          </clusternode>
          <clusternode name="stupid" nodeid="2" votes="1">
           <fence>
            <method name="human">
             <device name="last_resort" ipaddr="stupid.ssn.blenke.net"/>
            </method>
           </fence>
          </clusternode>
         </clusternodes>
         <fencedevices>
          <fencedevice name="lastresort" agent="fencemanual"/>
         </fencedevices>
         <cman port="6809" twonode="1" expectedvotes="1">
         </cman>
        </cluster>

   

Once the config file is created, we start ccsd. The ccs daemon keeps the configuration in sync between cluster nodes.

    /etc/init.d/ccsd start

Next, join the cluster with cman. The cman kernel module is the cluster manager. It uses dlm locking and heartbeat thread to form a quorum of nodes that are part of the cluster.

    # cman_tool join

This will join, or create, a cluster.

Wed, 30 Nov 2005

On a cluster server, the goal is to share storage with other nodes in the cluster.

Each cluster server node is going to share the entire /dev/md3 stripe as a single large block device to the other clvm'ed nodes.

Each "shared" cluster stripe will be defined as an AoE shelf/slot.

vblade 0 0 eth1 /dev/md3

This will create a device "/dev/etherd/e0.0" shared over the eth1 network interface between the cluster nodes on the shared private storage network. Only the other nodes will see this device, you must continue to reference it as /dev/md3 locally. LVM2 will automagically scan this device and include it when re-assembling the cluster volume group on boot.

For production use, as vblade doesn't fork, the easiest way to keep vblade running is to add it to inittab as respawn.

on node0:

# echo "e0:2:respawn:/usr/sbin/vblade 0 0 eth1 /dev/md3" >> /etc/inittab
# init q

on node1:


# echo "e1:2:respawn:/usr/sbin/vblade 0 1 eth1 /dev/md3" >> /etc/inittab
# init q

You should see output from the vblade starting appear in /var/log/daemon. On the other node, you should be able to aoe-discover and aoe-stat show the device:

# aoe-interfaces eth1
# aoe-discover
# aoe-stat
e0.0       306.440GB   eth1 up

Note: as this is at the end of /etc/inittab, and running in runlevel 2, the rc2 script will need to finish first before init starts respawning vblade. To expose the aoe device to the network before this point (if you really must), just put this line before the rc2 line in /etc/inittab.

Wed, 30 Nov 2005

Both aoetools and vblade (the ATA over Ethernet target) have debian packages. If you can't apt-get install them straightaway, drop me an email, and I'll post the backports of these to woody.

This step needs a bit more documentation (will fill it in shortly).

Wed, 30 Nov 2005

During a "make dist", the build process looks in xen-unstable/dist/install/boot/config-2.6.12.6-xen0 (or -xenU) for the config file to use, and will override the default w/ those files if they exist.

It's generally best to remove the xen-unstable/linux-1.6.12-xen? directories between builds if the Xen tree has been updated; safer that way.

You will need to change the Xen0 2.6.12 kernel so that it builds with devmapper (dm) support, and ATA over Ethernet (AoE):


# cd xen-unstable/linux-2.6.12-xen0
# make ARCH=xen menuconfig clean bzImage modules
# cp -f arch/i386/boot/bzImage /boot/vmlinuz-2.6.12.2-xen0
# cp -f System.map /boot/System.map-2.6.12.2-xen0
# cp -f .config /boot/config-2.6.12.2-xen0
Once you're done rebuilding and preparing to install your kernel, you will also need to re-build the "dlm" and "cman" kernel modules as well:

# cd cluster ; ./configure --kernel_src=`pwd`/../xen-unstable/linux-2.6.12-xen0
# make -C cluster/ install

You will also need to add a boot menu option for this Xen kernel using the Xen 3.0 hypervisor:


# vi /boot/grub/menu.lst

Add a section like so:


title Xen 3.0 / XenLinux 2.6.12.6
kernel /boot/xen-3.0.gz dom0_mem=256000 console=vga apic_verbosity=verbose noapic
module /boot/vmlinuz-2.6.12.6-xen0 root=/dev/md0 noapic ro console=tty0

Note: this is why we don't use lilo. Getting lilo to work with command line arguments for both kernel (append=) and module (initrd=) is only the beginning of the pain. Use grub. Be happy.

You are now ready to reboot with a cluster-ready Xen kernel.

Wed, 30 Nov 2005

To build the source below, we will need a compiler, and cvs for the source checkouts.

# apt-get install gcc-3.4-dev libc6-dev cvs

Xen has a few dependencies:

# apt-get install libncurses5-dev bridge-utils hotplug iproute python2.3-dev zlib1g-dev

If you want to build the documentation as well, you'll need a few more (tetex, "ps2pdf" from gs-common, and "fig2dev" from transfig, and a recent version of perl with pod2man that supports the --name option).

# apt-get install tetex gs-common transfig perl

Now, grab the Xen "unstable" release and extract it. This includes a 2.6.12 kernel, which is required by the RedHat cluster tools (which we will discuss below).

# wget http://www.cl.cam.ac.uk/Research/SRG/netos/xen/downloads/xen-unstable-src.tgz
# tar xvzf xen-unstable-src.tgz

Now, build the userspace Xen tools and an initial Dom0 kernel (we will rebuild it in the next step, don't worry too much about the .config file right now):

# cd xen-unstable
# make dist (everything builds)
# ./install.sh
Installing Xen from './dist/install' to '/'...
All done.
Checking to see whether prerequisite tools are installed...
Xen CHECK-INSTALL  Wed Nov 23 22:46:09 EST 2005
Checking check_brctl: OK
Checking check_hotplug: OK
Checking check_iproute: OK
Checking check_python: OK
Checking check_zlib_lib: OK
All done.
# make install

Other bits that probably aren't required anymore:

Now we're done with the Xen kernel and userspace tools. Lets move on to the RedHat cluster tools to build against the Xen Dom0 kernel.

The stable RedHat cluster tools can be grabbed via CVS:

# cvs -d :pserver:cvs@sources.redhat.com:/cvs/cluster login cvs
Password: {enter "cvs"}
# cvs -d :pserver:cvs@sources.redhat.com:/cvs/cluster checkout -r STABLE cluster

When we build the cluster tools, we want to point the build at the source tree for the Xen Dom0 kernel so that it builds the appropriate kernel modules.

First, some dependencies:

# apt-get install libxml2-dev

Then a small fix to get around the fact that a glibc 2.2 doesn't have an ifaddrs.h or getifaddrs()/freeifaddrs(). You don't need to do this if you're running a glibc 2.3 or later system:

# cat > /usr/include/ifaddrs.h <<EOF
#define getifaddrs(x)   -1
#define freeifaddrs(x)

struct ifaddrs {
    struct  ifaddrs *ifa_next;
    char    *ifa_name;
    struct sockaddr *ifa_addr;
};
EOF

Yeah, it's an ugly hack, but it fixes our woody enough to allow this to build. I'm a bad bad sysadmin.

In the latest CVS checkout, I also had to add an #include back to the top of cluster/cman/lib/libcman.c:

#include "libcman.h"

Then we build:

# cd cluster
# ./configure --kernel_src=`pwd`/../xen-unstable/linux-2.6.12-xen0
# make install

Now the software is ready. Both the Xen tools and the RedHat cluster tools are installed, and the Xen hypervisor and Dom0 kernel is built with the RedHat cluster kernel modules.

Wed, 30 Nov 2005

I use a debian based distro that I maintain in-house with an extensive hand-maintained repository of backports.

The auto-install platform is roughly based on the SystemImager package, only heavily hacked to simplify maintenance and unify the install script across all of our builds in a flexible way (some day I hope to opensource it here somewhere soon).

I strongly recommend that you have a running filesystem for root (/), usr, and var, that are NOT encapsulated with lvm. You will understand why later. This would be a slightly different layout than our standard NKS setup:


 /dev/md0 - RAID1 - root (/) (1G)
 /dev/md1 - RAID10 - /usr (4G)
 /dev/md2 - RAID10 - /var (16G)
 /dev/md3 - RAID10 - everything else.

You can do the following manually with a Knoppix CD if you really want to:

On a 4 drive Parallel ATA (PATA) setup you can generate the above using:


$ cat - <<EOF | sfdisk /dev/hda
0,500,fd,*
,1000,82
,4000,83
,,5
,8000,83
,,83
EOF

Cryptic, yes, but simple.

Repeat for each drive to partition. Then follow with mdadm to build the arrays:


# /sbin/mdadm --create /dev/md0 --force --run --level 1 --chunk 128 \
    --raid-devices 4 /dev/hda1 /dev/hdb1 /dev/hdc1 /dev/hdd1
# /sbin/mdadm --create /dev/md1 --force --run --level 10 --chunk 128 \
    --raid-devices 4 /dev/hda3 /dev/hdb3 /dev/hdc3 /dev/hdd3
# /sbin/mdadm --create /dev/md2 --force --run --level 10 --chunk 128 \
    --raid-devices 4 /dev/hda5 /dev/hdb5 /dev/hdc5 /dev/hdd5
# /sbin/mdadm --create /dev/md3 --force --run --level 10 --chunk 128 \
    --raid-devices 4 /dev/hda6 /dev/hdb6 /dev/hdc6 /dev/hdd6

Keeping with this scheme, booting single user, or init=/bin/bash, should give you at least md0 from which you can mount md1 and md2 to do rescue operations. This should be enough to fix most server deaths with RAID1 and without worrying about LVM.

Now format those arrays:


 # mke2fs -j /dev/md0
 # mkfs.xfs /dev/md1
 # mkfs.xfs /dev/md2

And mount them:


# mkdir /target
# mount /dev/md0 /target
# mkdir /target/usr
# mount /dev/md1 /target/usr
# mkdir /target/var
# mount /dev/md2 /target/var

Then fill it with debootstrap (or rsync, or whatever):


# debootstrap sarge /target http://source.rfc822.org/debian

Now edit /target/etc/fstab:


/dev/md0 / ext3 defaults 0 0
/dev/md1 /usr xfs defaults 0 0
/dev/md2 /var xfs defaults 0 0

and install a kernel (this is temporary):


# cp /etc/resolv.conf /target/etc/resolv.conf
# chroot /target apt-get update
# chroot /target apt-get install kernel-image-2.6.8

Now install grub as the MBR on all drives. Make them all bootable as hda, in case hda should die. NOTE: We do not use lilo, as it cannot handle booting the Xen hypervisor and Xen kernels without some ugliness.


# chroot /target apt-get install grub
# mkdir /target/boot/grub
# cp -a /lib/grub/i386-pc/ /target/boot/grub/
# cp /target/usr/share/doc/grub/examples/menu.lst /target/boot/grub/menu.lst
# grub
grub> root (hd0,0)
grub> setup (hd0)
grub> setup (hd1)
grub> setup (hd2)
grub> setup (hd3)

Edit your /target/boot/grub/menu.lst so that it points to the kernel.

Now you should have a bootable system. Unmount the /target mounted filesystems and reboot.

You should now be running a base install of a distribution of Linux on your server that boots with grub and has an unused md storage device that spans the majority of free space (/dev/md3). Xen requires the former, and lvm2/aoe will require the latter.

Tue, 29 Nov 2005

When building any Linux cluster, the first step is laying out the topology and shared storage.

To keep things simple, fast, and cheap, ATA over Ethernet (AoE) is really the best solution available at the moment.

For simplicity, each server in the cluster will be given two network interfaces. An "internal" protected storage network, and an "external" firewalled public network.

Tue, 29 Nov 2005

The goal: Make a managable cluster of machines work together to provide 99.999% availability for a set of virtual machines in the fastest way possible with current cheap commodity hardware.

To this end, I've put a bit of energy into building a simple Xen cluster. This whitepaper is an attempt to document the effort.

Xen is a hypervisor. Think of it as a microkernel done right. There exists Linux, NetBSD, and even an OpenSolaris port that run under the Xen hypervisor. The "host" machine is Domain 0 (Dom0), and is responsible for talking to hardware on the box and configuring and booting the Domain User (DomU) slices. Don't be confused by Dom0, however; the Xen hypervisor is the magician behind the scenes making this possible.

Xen 3.0 has migration features: you can move a Xen DomU instance between physical Xen servers. To do this, however, you need a shared storage system, or some method of NAS/SAN visible to all nodes in the cluster.

RedHat has a wonderful clustering platform with native clustered stupport for LVM2. Instead of GNBD, however, I've decided to use ATA-over-Ethernet for simplicity and speed. With this, we have a clusterable group of machines that share a common storage namespace (and can access each other's storage directly via the network), permitting native Xen domain migration.

The following guides formed the basis of the above decision:

Google
 
Web ian.blenke.com