Image by janeb13 at

These are notes on architecting large scale systems. (Is architecting even a word?) The notes came out of a series of recent conversations with trusted colleagues.

Large software systems have several axes to manage

  • amount of requests -> millions of requests per second
  • amount of data -> millions of data records to process
  • amount of code -> millions of lines of code to organize
  • amount of staff -> millions of communication channels


  • is the system healthy?
  • keep it healthy

Monitoring > Logging

  • log everything
  • keep logs for about a year
  • do not log confidential data such as passwords
  • include 1. correlation id and 2. the entire request/response payload
  • structured logging is a nice-to-have

Monitoring > Notifications

  • monitor every important part of the system
  • ping employees via phone, email, text
  • have clear plans of action; e.g. restart the service
  • Kubernetes has some out-of-the-box monitoring: live, healthy.


  • who can deploy?
  • the machine needs approval from two people:
  • ... one person (e.g. a developer) starts a deployment,
  • ... another person must approve the deployment,
  • ... the machine does the actual deployment via a script.
  • this is more secure and prevents accidents (e.g. deleted the database, oops)


  • scale out is the trend
  • it is about load balancing
  • compared to one large machine, multiple small machines are
  • ... less expensive
  • ... more flexible
  • ... more resilient (w/ redundancies)

Distributed Transactions

                       HasStarted    HasCompleted

database1.someTable        1              1
database2.someTable        1              0 // problem!
database3.someTable        1              1
  • for each distributed database table involved in a distributed transaction,
  • ... store the transaction state: HasStarted, HasCompleted...
  • also include a supervisor service to check for transaction integrity
  • --
  • some transaction will fail
  • fail fast and on purpose:
  • ... if something in the system fails, shoot the flare,
  • ... sound the alarm immediately, do not keep processing

Event Sourcing

  • do not store state
  • rather, record every event that would otherwise change state,
  • and determine the state at a given time by replaying each event.


  • supports clustering for a reliable, distributed cache
  • multiple services read/write to a single distributed cache
  • more than just a cache
  • ... supports queue and bus features
  • ... pub-sub, in which services subscribe to a key
  • ... consistent index increment operations (plus one)
  • limitation: only supported serialized data

Lots of Data - Paging, Filtering

Back 1 2 3 4 5 6 7 8 9 10 .. 11,320 Next // no!
Back 1 2 3 4 5 6 7 8 9 10 Next // yes
  • exposing the total number of available pages in the UI is expensive,
  • because counting the total number of records entails scanning the entire data store,
  • and it's non-trivial to cache the count on a dynamic query
  • a better alternative is to expose a limited number of pages with back/next
  • on the other hand, filtering is cheap, especially with DB indexing

Message Queue

  • we mostly read them
  • uses a pull not a push model
  • usually support message persistence after something has handled the message

Message Bus

  • this is pub/sub (publish/subscribe)
  • multiple listeners/subscribers
  • usually do not support message persistence
  • compared to object-oriented event handling,
  • ... publishing to a message bus can be harder to debug at runtime,
  • ... because of difficultly finding the message handler(s) and stepping into code;
  • ... static analysis does not support "find all message bus subscribers"

Managing Thousands of Lines of Code

  • reduce coupling (cliched, but true)
  • define clean lines among functional areas,
  • even though we own the entire code base,
  • ... break code out into libraries;
  • ... pretend we are creating a set of public APIs
  • ... focus on opaque APIs that hide their implementation details.
  • in a multi-layered/multi-tiered architecture,
  • ... avoid depending on services/libraries that are more than one layer away,
  • ... strive for dependency structures that look like linked lists

Traffic Localization, Distributed Systems

  • direct the user to the closest server
  • e.g Azure Traffic Manager does this; CloudFlare (probably) provides it too
  • when the request arrives
  • ... ping available servers
  • ... look at the latency
  • ... direct the request to the correct server
  • this is basically a performance enhancing DNS service

Challenge: region specific transactions

  • e.g. payment processing
  • ... payments in Australia
  • ... payments in United Kingdom
  • ... payments in United States
  • localize the algorithms for payment process,
  • scale payment servers based on regional demand,
  • reuse shopping cart and invoicing systems for all regions,
  • doing this is is harder when payment/cart/invoicing are coupled
  • one viable approach: break out a service per payment region

MonoRepos vs MicroRepos

  • the mono repositories require more investment in tooling
  • without that, it's too easy to do stupid things in a mono repository

Politics in Large Organizations

  • as the system grows, change becomes more difficult,
  • the corporate politics can become reactionary,
  • the code base's technical debt can become unwieldy and brittle,
  • focus on making small, impactful changes,
  • large systems are sobering - it's like steering an enormous ship
  • mitigation:
    • avoid coupling from the start
    • the harder it is to introduce coupling, the less likely people will do it
    • so, make coupling painful and more difficult:
      • separate repositories
      • code analysis
      • dependency structure tests
      • code inspection
    • one choice is among:
      • an above average system that prevents bad patterns
      • an above average team that prevents bad patterns
      • a mix of both is probably the best