My name is Ara Sadoyan, I'm the founder of OddEye monitoring and anomaly detection platform. Beside our platform, we also provide DevOps as a service to our most valuable clients. Here I want to share some information about one of our great friends and partners Mouseflow and their infrastructure hosted at LeaseWeb.
Recently I was reading lots of articles and stories about docker, container orchestration, microservice deployment and how it saved lives and tons of time for a particular company. While I truly believe that containers, microservices, and container orchestration are great, I always stand for the simple rule: The right tool for the right task.
A little background about the project. Over 500 Linux and Windows servers at two data centers. Most applications are in .NET, with less than 10 services, some of which are super loaded some are not, HBase is the main database, ElasticSearch store series, we use Ansible to deploy servers and of course OddEye for monitoring.
We are using TeamCity to build and MSDeploy to deploy our .NET applications on Windows servers. Usually, deployment takes less than 10 minutes, of course during this procedure service remains online with zero downtime.
So the tasks are simple:
- No SPOF
- Maximum performance
- Fewer servers (we already have more than 500)
- Easy infrastructure management
- As much prediction of possible issues as possible
- As much automation of tasks as possible
- As little time as possible spent on maintenance per machine and overall infrastructure.
Lots of people will say that the only, or at least the most optimal solution to achieve these goals is to deploy dockers, orchestrate it with Kubernetes, or at least go to AWS and order tons of cloud instances. But one of our tasks is “Maximum performance”. This means that we must be as close as possible to hardware so cloud instances with noisy neighbors is not an option for us. Containers are probably the most performant way to virtualize environments, but every abstraction is loss of performance, maybe just a little, but it is. Also neither Hadoop/HBase nor ElasticSearch, which are our main services, play well with containers. So the only solution left is to work with metal boxes.
So how to manage this monster? Do we need a team of engineers, DevOps and several managers to keep track of everything, solution architects, etc…? No we don't.
We do not want to spend many hundreds of man hours and keep lots of people in a team to manage these servers, we need to do this in an effective way with a small team. So how to achieve that? The thing is that lots of people forget about battle tested old tools, and forget that most of these tools are in use by the biggest players. So the decision was made: we are moving forward with Metal Boxes. For that we need reliable data center which provides good servers with reasonable prices at US and EU. After short research we choose to work with LeaseWeb.
There is a small task for the local personal of datacenter: they should deliver servers with defined by us settings for KVM, so we can have offline access to servers when things goes bad and somehow we lose contact with OS over SSH. We ordered an initial amount of racks and servers and started configuring our system. First thing was to create a TFTP server with install images which should have some basic configurations and SSH keys. When that was done, we could boot the machine via PXE and automatically install the OS on it.
The procedure of delivery of servers is the following:
- We ask the data center to deliver a certain amount of servers and configure KVMs with desired IP addresses
- When the servers are delivered, we power them on via KVM and boot them via PXE
- After several minutes OS is installed and ready to use.
When the basic OS is installed and configured we need to deploy software and services. Most of our servers are running HDFS/HBase or ElasticSearch. So the task is to automate the installation and configuration process. It did not take a long time for us to choose Ansible as the preferred automation tool. The reason for choosing Ansible was obvious: Ansible relies on SSH and doesn’t require any agent installation at the target systems. As we install SSH keys during OS installation nothing else left to do at the end machines in order to use them as Ansible clients. We have installed Ansible on the head machine and created several playbooks to automate HDFS/HBase and ElasticSearch installations.
As I have mentioned above many of our applications are running on .NET on Windows so our development team is using MSDeploy to handle correct deployment of applications. When we switch to .NET core on Linux , which is on our roadmap, we will make several Ansible playbooks for that as well.
- 2 DevOps
- 4 Guys (One active at a time) who are doing 24/7/365 human monitoring
- Over 500 servers
- Over 2PB of HBase data
- Over 50 BLN documents in ElasticSearch
My message isn’t new: Choose the right tool for the right task, what is trending may not be the best fit for your case. And of course it is definitely possible to manage a metal monster with a micro team
I also suggest you read this article :