Al-Adwan: And we do look at our relationship with the SQL Server team as the blueprint to build on the rest of our relationships with Microsoft.
Tevelde: One helpful thing is trending. We sample all the information we’ve accumulated every 15 minutes and keep the samples in a data warehouse on a separate server for three months. Even though the data is being taken in a reactive state, we can mark the trends and flip to be more proactive. The only issues that cause a serious outage now are hardware issues—where a hard drive goes bad inside the box, or where something you can't predict goes bad. Based on our scripts, if a box goes down, we move the logs to another box, and get up and running in about 20 minutes.
Al-Adwan: These scripts are all automated. Hardware failures are our biggest problem because we're on Standard Edition and not all of our infrastructures are clustered, but once we move to the clustering model, that's going to remove the tight coupling that we have right now with a database in a specific server, and allow us the flexibility to recover faster. We're really excited about that.
SQL Server Magazine: How many companies would you say are at this level, using as many servers as MySpace?
Stelzmuller: Even in smaller implementations you encounter the same problems, just not as often as you do when they’re at the scale of servers that we have. We have to work hard to build all of these solutions because the problems happen often. It's a resource issue. But the tools we have would be useful to smaller implementations.
Tevelde: Take our DBA monitoring scripts: Whether you have 100 servers or one server, all these scripts are installed on every single box. So if all of our boxes were to drop offline, we'd still use these monitoring scripts to do exactly what we're doing for predictive analysis and problem solving.
Stelzmuller: Whether you're dealing with five DL585s or 100 DL585s, you're still going to face the same type of limitations in a single server or implementation. We just happen to have more iterations.
SQL Server Magazine: So the scale you're operating at causes problems to surface more often than at a small-to-midsized company?
Tevelde: Right, if you have one or two boxes and there's 10 people using it, you're going to run into these problems maybe once a year. But when you have 400 boxes and they're being pushed to their limits, once or twice a week we'll see four or five of the same major issues. But because of these scripts, we're able to identify what these issues are. After we find the workaround, we add it to our monitoring system and baseline knowledge.
SQL Server Magazine: It will be interesting to see how much of the give and take you have with the SQLCAT team ends up in the next version of SQL Server itself.
Stelzmuller: We encounter some bugs that the average implementation might not get around to. But the bugs are there for everyone, whether or not you hit them today or a year from now.
SQL Server Magazine: What do you do to back up so many servers, and with related data on different servers, how do you handle it?
Al-Adwan: We use 3PAR [3PAR.com] for our SAN storage solution. We have about 14 production 3PAR frames to power all of our databases. We have 2 colos [collocated servers] where our data lives, and we have 14 production frames distributed across those colos. We have a set of four backup frames; we call those "near lines," and they're all running SATA [Serial ATA] disks. We have our near lines distributed in both colos. So from a backup perspective, within each one of our 3PAR frames, we take a snapshot of our database once daily. We maintain three versions of that snapshot on our live production databases, and then we replicate those snapshots to the other colo on our SATA drive. So we have three hot snapshots of the data on the production frames, and three backup copies on our near lines and office colo. We were very excited about that design—actually 3PAR liked it so much that they're thinking of marketing it to their other customers.
Stelzmuller: We heard from Microsoft that we’re the largest implementation of SQL Server from a transaction and data volume standpoint. It's really important to show what SQL Server is capable of handling. I know a lot of people tend to underestimate it as a platform, and I think that speaks to the limits of their imagination.
SQL Server Magazine: A lot of people think of SQL Server as a medium-sized business play, but I think you're showing that it can be a pretty large business play.
Al-Adwan: We have a motto here—it's not about platform, it's about architecture. It's platform agnostic—you just have to know how to set it up.