Secret: How did Stack Overflow deploy – 2016 version

Tram Ho

To find out what this structure “does”, start with a regular day at Stack Overflow. Below is the daily statistics from February 9, 2016, with the difference from November 12, 2013:

  • 209,420,973 (+61,336,090) HTTP requests to our load balancer
  • 66,294,789 (+30,199,477) of those were page loads
  • 1,240,266,346,053 (+406,273,363,426) bytes (1.24 TB) of HTTP traffic sent
  • 569,449,470,023 (+282,874,825,991) bytes (569 GB) total received
  • 3,084,303,599,266 (+1,958,311,041,954) bytes (3.08 TB) total sent
  • 504,816,843 (+170,244,740) SQL Queries (from HTTP requests alone)
  • 5,831,683,114 (+5,418,818,063) Redis hits
  • 17,158,874 (not tracked in 2013) Elastic searches
  • 3,661,134 (+57,716) Tag Engine requests
  • 607,073,066 (+48,848,481) ms (168 hours) spent running SQL queries
  • 10,396,073 (-88,950,843) ms (2.8 hours) spent on Redis hits
  • 147,018,571 (+14,634,512) ms (40.8 hours) spent on Tag Engine requests
  • 1,609,944,301 (-1,118,232,744) ms (447 hours) spent processing in ASP.Net
  • 22.71 (-5.29) ms average (19.12 ms in ASP.Net) for 49,180,275 question page renders
  • 11.80 (-53.2) ms average (8.81 ms in ASP.Net) for 6,370,076 home page renders

You might be wondering about the dramatic decrease in ASP.Net processing time compared to 2013 (which was 757 hours ago) despite having 61 million requests per day. It is thanks to a hardware upgrade in early 2015 with performance optimization in the application itself. Don’t forget: performance is still a feature. If you are curious about the hardware issue, don’t worry, all your hardware questions will be resolved later.

So what’s new in the last 2 years? Besides replacing some network servers and devices, not much. The following is a top series of hardware running the website every day (with a distinct difference from 2013):

  • 4 Microsoft SQL Servers (new hardware with 2 of them)
  • 11 IIS Web Servers (new hardware)
  • 2 Redis Servers (new hardware)
  • 3 Tag Engine servers (new hardware for 2 out of 3)
  • 3 Elasticsearch servers (old)
  • 4 HAProxy Load Balancers (2 more to support CloudFlare)
  • 2 Networks (each includes Nexus 5596 Core + 2232TM Fabric Extenders , upgraded to 10Gbps everywhere)
  • 2 Fortinet 800C Firewalls (replacing Cisco 5525-X ASAs)
  • 2 Cisco ASR-1001 Routers (replacing Cisco 3945 Routers)
  • 2 Cisco ASR-1001-x Routers (new!)

What do we need to run Stack Overflow? Not too different from 2013, but thanks to the optimal operation and new hardware is mentioned above, we reduced to just one single web server. We (accidentally) tested this hypothesis, and succeeded (several times). You need to understand: I just said I can do it, not saying it is a good idea. But it’s fun, right?

After obtaining the original numbers for comparison, learn how to make these colorful web pages. Since there are very few completely separate systems (and the Stack Overfow is no exception), architecture-related decisions often don’t work very well without looking at the panorama, how small pieces match the alignment. can. That is our goal, handling the whole body. This will be an overview of infrastructure with some key points in hardware; The next article will go into more detail on this issue.

Here are some pictures of current hardware, this is an image of price A (it has a similar sister price B):

Next, let’s talk about layout. The following is a logical overview of the current large systems:

Ground Rules

Here are some rules that apply global so you don’t need to repeat with each setup:

  • Everything must be redundant (redundant).
  • Every server and network device has at least 2x 10Gbps connection.
  • Every server has 2 power sources through 2 power sources, from 2 UPS units are secured by 2 generators and 2 utility feeds.
  • All servers have redundant links between prices A and B.
  • All servers and services have to be redundant through another data center (Colorado), though I will mainly talk about New York.
  • Everything must be redundant.

The Internets

You have to find us first – that’s DNS . Find us fast, so we hire CloudFlare (now) because they have a DNS server closer (almost … all locations in the world). We update DNS records through an API, and they will take care of “hosting” DNS. For maximum security, we obviously still need our own DNS server. If the apocalypse has happened (possibly caused by GPL, Punyon , or caching) and people still want to program it, we will turn it on.

After finding the secret hiding place, HTTP traffic will come from one of four ISPs (Level 3, Zayo, Cogent, and Lightower in New York) and go through one of our four edge routers. We will continue the peer with ISPs with BGP (quite standard) to control the flow of traffic and provide some broad ways for traffic to be able to reach us effectively. These routers ASR-1001 and ASR-1001-X are used in pairs, each pair will serve 2 ISPs in the direction of active / active – so we have backup here. Although they are all on the same physical 10Gbps network, external traffic (external access) will be isolated into external VLANs , load balancing will also be connected here. After going through the snow maker, you will head to the load balancer.

Also, maybe this is the right time to tell you more that we have a 10Gbps MPLS between the two data centers, but not directly involved with the site. We use this connection to backup data and recover quickly if needed. “But hey, that’s nothing redundant!” Well, you’re technically right, that’s the only failure point. But wait! We also maintain two failover OSPF routes (MPSL is # 1, OSPF is # 2 and # 3 based on cost) through our ISPs. Each set connects to the corresponding set in Colorado, and they load balanced traffic between them in case of failover.

Note: failover is a system protection method, in which redundant equipment will replace the main system if something goes wrong.

Load Balancers ( HAProxy )

Load balancer is running HAProxy 1.5.15 on CentOS 7 (our favorite Linux version). TLS (SSL) traffic is also terminated in HAProxy. We will soon have to delve into HAProxy 1.7 to support HTTP / 2 .

Unlike other servers with 10Gbps LACP dual network link, each load balancer has 2 10Gbps pairs: one for peripheral networks and one for DMZ. These device boxes are equipped with 64GB of memory or more to handle SSL negotiation more effectively. When we can cache more TLS sessions in memory to reuse, the system will have to recalculate subsequent connections to the same client. So we can restore the session faster and save more. Because RAM is quite cheap, this is an easy choice.

The balancing load on the setup itself is simpler. We listen to many IPs of different sites (mostly because of certificate issues and DNS management) and snow to many weak master backends based on host headers. The most noteworthy thing we do here is rate limiting and some header capture (sent from our web tier) into the HAProxy syslog message , from which we can record performance parameters for each request. .

Web Tier (IIS 8.5, ASP.Net MVC 5.2.3, and .Net 4.6.1)

Load balancing will cause traffic to 9 servers that we call “primary” (01-09) and 2 “dev / meta” (10-11, our staging environment) web server. The main server runs content like Stack Overflow, Careers, and all Stack Exchange sites, in addition to meta.stackoverflow.com and meta.stackexchange.com , running on the last two servers.

Here is the model of Stack Overflow distribution across all web tier in Opserver (our internal monitoring dashboard):

… And by looking at the application, those web servers will look like this:

Service Tier (IIS, ASP.Net MVC 5.2.3, .Net 4.6.1, and HTTP.SYS)

Behind these web servers is a concept very similar to “service tier”, which also runs IIS 8.5 on Windows 2012R2. This tier runs internal service to support production web tier and other internal systems. The two major factors here are “Stack Server”, running the tag engine and based on http.sys (not after IIS) and Providence API (IIS-based). Fun: I have to put affinity on each of these 2 processes to land in separate sockets because Stack Server only steamrolls L2 and L3 cache when refreshing the question list every 2 minutes.

These service boxes perform heavy tasks with tag engines and backend APIs where we need redundancy, but not up to 9x redundancy. For example, loading all posts and their tags (changing every n minutes') from the database (currently 2) is not a little. We don't want to have to make 9 such downloads on the web tier; 3 times is enough, and at the same time quite safe. We also set these boxes differently on the hardware side to better optimize the different computational load characteristics of the tag engine and indexing jobs (also run here). "Tag engine" is a relatively complicated topic and will have a separate post . Basically: when you access / questions / tagged / java, you will find the tag engine to see which questions match. It will handle all of our tag reconciliation beyond / search. so new navigation , ... all are using this data service.

Cache & Pub / Sub ( Redis )

We use Redis for some of the problems here, and the effect is very stable. Although it has to handle about 160 billion ops a month, each instance is below 2% CPU. Usually much lower:

We have an L1 / L2 system cache with Redis. "L1" is HTTP Cache on the web server or any application running. "L2" returns to Redis and retrieves the value. Our values ​​are stored in Protobuf format , through protobuf-dot-net by Marc Gravell. For a client, we are using StackExchange.Redis - write, open source. When a web server receives a cache miss on both L1 and L2, it will retrieve the value from the source (database query, API call, ...) and put the results into both local cache and Redis. The next server that wants to get the value can skip L1, but will find the value in L2 / Edis, no need for database query or API call.

We also run many Q&A pages, so each site will have its own L1 / L2 caching: follow the prefix key in L1 and follow the database ID in L2 / Redis.

Besides the two main Redis servers (master / slave) running all site instances, we also have a machine learning instance as the slave for both servers more specifically (due to memory problems). This Machine learning instance is used to introduce questions to the home page, better job mapping, etc. It is the platform called Providence, introduced here by Kevin Montrose.

Redis' main server has 256GB of RAM (about 90GB used) and the Providence server has 384GB of RAM (about 125GB used).

However, Redis is not only for the cache, it also has a publishing mechanism & subscriber in which a server can publish messages and all subscribers will receive it - including downstream clients on the Redis slave. We use this mechanism to clean up L1 cache on other servers when a web server does the removal to be more consistent, but there is still a great use: websocket.

Websockets ( NetGain )

We use websocket to push real-time updates to users such as notifications on top bar, vote counts, new nav counts, new answers and comments, and much more.

The socket server itself uses raw socket running on the web tier. This is a small application running on the open source library: StackExchange.NetGain . At the peak, we have about 500,000 websocket open simultaneously. That's scary browser numbers. Fun fact: some browsers have been open for more than 18 months, we don't know why. Someone should check to see if these developers are still alive. The following is the number of websocket opened simultaneously in the week:

Why is websocket? With our scale, they are more efficient than polling. With this direction, we can push data more simply with fewer resources, and can be faster to update users. Still, there are some problems — such as ephemeral port and file handle exhaustion on load balancing.

Search ( Elasticsearch )

Spoiler: this part is not very important. Web tier performs vanilla searches (vanilla: standard, less customizable) for Elasticsearch 1.4, using StackExchange.Elastic client. Unlike other areas, we don't plan to open this resource simply because it only shows a very small subset of the API we use, so publicity would be more harmful than good. They use elastic for / search, calculate related questions, and suggestions when asking questions.

Each Elastic cluster (there is always an elastic cluster in each data center) has 3 nodes, and each site has its own index. Careers has added some indexes. Some specials in the elastic world: Our 3 server clusters have a bit more "pitfall" with all storage being SSD, 192GB RAM, and dual 10Gbps network per cluster.

Together with application domains (yeah, we have to spare with .NetCore already ...) in the Stack Server host tag engine also continuously indexes items in Elasticsearch. We do some tricks here like ROWVERSION` in SQL Server (data source) compared to the” last position “document in Elastic. Since it acts as a sequence, we simply grab and index any item that has changed since the previous pass.

More scalable and more accurate identification capabilities, is the main reason we use Elasticsearch instead of resources like SQL full-text search. SQL CPUs are relatively expensive, and Elastic is now cheaper and has more features. So why not use Solr ? We want to search the entire network (multiple indexes at the same time), and this feature has not been supported yet when we consider it. The reason we haven’t been 2.x is because of a big change to “types” which means that we need to re-index everything to upgrade, and we still don’t have much time to do this.

Databases (SQL Server)

We are using SQL Server as a single source of truth . All data in Elastic and Redis comes from SQL Server. We run 2 SQL Server clusters with AlwaysOn Availability Groups . Each cluster has 1 master (almost all downloads) and 1 copy in New York. At the same time, they have a copy in Colorado (our DR center). All copies are out of sync.

The first cluster is a Dell R720xd server, each server with 384GB RAM, 4TB PCIe SSD space, and 2 × 12 cores. This server hosts Stack Overflow, Sites (bad names, I will explain later), PRIMZ, and Mobile database.

The second cluster is a Dell R730xd server, each server is equipped with 768GB of RAM, 6TB PCIe SSD, and 2 × 8 core. This cluster runs everything else, including Careers , Open ID , Chat , our Exception log , and all other Q&A sites (like Super User , Server Fault , …).

We also want to keep the database usage level on the database tier very low, but the current reality is quite high because of some problems with the plan cache we are trying to identify. By this time, NY-SQL02 and 04 are master, 01 and 03 are newly restarted copies during the process of making some updates to the SSD. This is information in the last 24 hours:

The way we use SQL is quite simple. And simply, it will be fast. Although some queries can become quite crazy, our interaction with SQL itself is quite simple. We have a number of legacy Linq2Sql , but all of our recent programming uses our open source Dapper , Micro-ORM using POCOs . In other words: Stack Overflow has only one procedure stored in the database and I intend to convert this last trace into code.

Libraries

Finally, I will try to turn our device into information that can really help you . Below is a list of open source .Net libraries we constantly maintain so you can use them freely. The reason we open these libraries is because they don’t have the core business value but benefit the community (so don’t worry):

  • Dapper (.Net Core) – High performance Micro-ORM for ADO.Net
  • StackExchange.Redis – High performance redis client
  • MiniProfiler – Profiler is lightweight and we run on every page (also supports Ruby, Go, and Node)
  • Exceptional – Error logger for SQL, JSON, MySQL, …
  • Jil – High performance JSON (de) serializer
  • Sigil – .Net CIL generation helper (for C # case not fast enough)
  • NetGain – High-performance Websocket server
  • Opserver – Dashboard monitors, directly controls most systems and retrieves information from Orion, Bosun, or WMI.
  • Bosun – System to monitor the backend, written in Go

ITZone via nickcraver

Share the news now

Source : TechTalk