Chapter 1: Computer Architecture
Wht not using a bigger and bigger CPUs? Mostly because CPU speeds reached a plateau.
Chapter 2: Application Architecture
What should you do when a server is not capable of serving your current load?
-
We can make the server a bit better: Better CPU, More RAM, etc. This is called vertical scaling.
-
We can take our server and make copy of it. This is called horizontal scaling.
Metric Service: provides all the metric we care about. How resources of the server is being utilized. Sometime this is actually log-based metrics.
- When something goes wrong, we would'nt want to manually go and look at the metrics to understand something is wrong!
- We feed the metric to another service called alerting service to automatically notify us when some performance falls behind some configured threshold.
Chapter 3: Design Requirements
High Level requirements:
- Move Data
- Store Data:
- Usually on database
- Blob store
- File systems or distributed file systems
- Transform Data
NOTE: Bad design choices for application architecture are very difficult to correct later. For example, if you design a database badly, it would hurt really bad.
Availability:
If a system has 99% availability, it experiences 1% downtime. By increasing availability to 99.9%, the downtime reduces significantly, from 1% to just 0.1%, representing a tenfold improvement. This increase is substantial, as even a small rise in availability dramatically minimizes potential disruptions. 99% is two-9s! :D
SLO: Service Level Objective SLA: Service Level Agreement which states if a company's service is interrupted, how the customer will be provided by partial refund.
Throughput
- For user related transactions, it is # of requests / seconds
- For Databases usually the throughput is measured as Queries/Second or QPS.
- For pipelines where large data flows through different stages, byte / seconds is the metric
IPv4: 32-bit IPv6: 128-bit Port Number is a 16-bit value.
Sequence number in the TCP header
When sending a request to https://google.com browser reserves a port for your client machine on which you will receive the data back from the backend server.
TCP is reliable
- IP does not have a solution for it.
- Missing packets will be retransmitted
- It is connection-oriented protocol and requires a physical connection be established before data can be exchanged
- Full-Duplex
- 3-way handshake
- Data can be arrived out-of-order and reordered
UDP
- User Datagram Protocol
- Connection-Less
- Some packet will be lost or arrive our of order: No guaranty
- A lot faster than TCP
- UDP is used for DNS for some reason
DNS
- Domain Name System
- ICANN: International Corporation for Assigned Names and Numbers
- Non-Profit Organization: Does not sell the domains
- Domain Registrar resell domains.
- A Records is the main Address Records
- On a URL: https://domains.google.com/get-started
- com: is called TLD (Top Level Domain)
- google: domain name or primary domain name
- www: does not serve a technical purpose. It was mainly a convention to be used
- /get-started: is a path
RPC: Remote Procedure Call, executing code that does not reside on the local machine.
HTTP:
- Client and Server do not need to know anything about each other.
- No state management
- Anything needed for that request and response, is handled/included in the request/response.
Endpoint: A URL and a method like GET, PUT, ... is defined as an endpoint.
HTTP Responses:
- Informational Responses: (100-199)
- Successful Responses: (200-299)
- 201 (created) which is a reasonable response for a POST request
- Redirection Responses: (300-399)
- Client error responses: (400-499)
- Server error responses: (500-599)
SSL/TLS:
- SSL: Secure Socket Layer
- It came before TLS
- It is outdated but still being used
- TLS: Transport Layer Security
HTTPS:
- Enhance HTTP against Man In The Middle (MITM) attack.
WebSocket:
- It resolves an issue if you use http
- Let's say you need to retrieve all the chat messages in a group
- If you use HTTP, you need to frequently send requests to the server to send you the messages
- It is some kind of polling. It might work but it is not optimized:
- Every time it needs to make a new connection
- Client sends a http request to establish a WebSocket connection
- Going forward there is a persistent connection between the client and server
- Now, the client and the server can be the initiator for a transaction.
- For example a client starts typing a message. This will be delivered to everybody in the channel.
- For other clients, the server is the initiator.
HTTP2: Added streams to its protocol and it will make WebSocket obsolete.
API: Application Programming Interface
Three of the modern APIs:
- REST
- GraphQL
- gPRC
REST API:
- REpresentation State Transfer
- State-Less: There is not any server stored on the server
- For example let's say we need to query all the comments for a Youtube video:
- If we use pagination, we just send something like: https://youtube.com/videos?id=abc&offset=0&limit=10 which gives us the first 10 comments.
- Being state-less helps in case of having multiple servers! No data needs to be saved!
JSON: Java Script Object Notation
- Readable
- From Performance perspective is not the best
GraphQL:
- It is built on top of http
- It only uses post request to be able to send data on the body which tells which resource is required to be retrieved from the server
- When asking for resources, we end up with overfetching which means for example for a particular user we need the profile picture and the username. Not everything
gRPC:
- Built on top of HTTP2 because it needs certain features of HTTP2
- Cannot be used natively from a browser
- You need a proxy
- It is usually used for server-to-server communication
- It is faster than REST
- Instead of json it sends data using Protocol Buffers
- leads to send smaller amount of data
- Instead of responses as in REST, it uses error messages: custom error handling
Good API
Let's say you have a POST endpoint which requires a userId and the content. After a while you might need to add another field in the body.
- You can make that extra field optional so that your endpoint is backward-compatible.
- You can use versioning.
Let's talk about an API for tweets!
- POST: https://api.twitter.com/v1.0/tweet and pass tweet text and userId
- GET (to retrieve one tweet): https://api.twitter.com/v1.0/tweet/:id
- DELETE (to retrieve on tweet): https://api.twitter.com/v1.0/tweet/:id
- GET (to retrieve list of a user's tweet): We already used https://api.twitter.com/v1.0/tweet/:id. Therefore we need another path: https://api.twitter.com/v1.0/users/:id/tweets?limit=10&offset=0 (for pagination). We make limit and offset optional.
Caching
- The response header called cache-control is the header that leads to caching a file from the browser
- x-cache: HIT means the data was returned from cache
Write-Around Cache:
- When we write the data directly into the primary place which is disk.
- The first time when reading the data, it's gonna be a miss. Then we copy the data to the memory (cache).
Write-Through Cache:
- When writing the data, it will be directly to the cache.
Write-Back Cache:
What is the difference between write-through and write-Back
CDN: Content Delivery Network
- CDNs are used for static content that are not changing
- Pull CDNs and Push CDNs
Proxies
Load Balancers are examples of reverse proxy.
Load balancing Strategies:
- Round Robin
- Weighted Round Robin
- Number of Least Connections
- Location
- Hashing
There are two types of load balancers: Layer 4 (transport layer )and Layer 7 (Application layer)
- Layer 4 is faster since it looks at the IP. Does not have higher levels' visibility
- That's why layer 7 is more powerful. It is more expensive.
Note: Look at a paper by Google called Maglev.
Consistent Hashing (for Load Balancing)
Let's say you have multiple servers, each having their own on-memory cache. How can we make sure we use the best out of the cached data? For example when a server goes down or we add more servers. How can we make sure we map current requests to a server with cached data?
Note: You need to research this more thoroughly.
Note: Learn Rendezvous Caching
SQL
RDBMS: Relational DataBase Management Systems
- SQL are stored on disk
- Efficient Read and Write operations are necessary.
- It uses B+ tree. TODO: study this
Tradeoff in using RDBMS:
- Atomicity
- We have transactions in database
- Transactions should be atomic
- If a portion of a transaction fails, the whole transaction should fail
- For example you need to move money from one customer and add it to another one. What happens if you update the first customer's bank account value. But, before updating the second customer's, database crashes?
- Consistency
- Some columns should not be NULL.
- Certain columns should be unique for example.
- Isolation:
- Transactions should not have any impact on each other when running in parallel (side effect).
- We might need to enforce a sequence on transactions
- Durability:
- Data should be persisted
- Redis is not durable and therefore not ACID-compliant
NoSQL (Not Only SQL)
The biggest limitation that NoSQL can get over is scale.
Examples of NoSQL:
- Key-Value Stores: Redis, Memcached, etcd
- Usually used for cache
- Document Store:
- More is like a collection for example a JSON
- The benefit is that it is a lot more flexible
- The most popular is MongoDB
- Wide-Column:
- A lot of writes like time series, less read and update
- Might have schema as well
- The most optimized for a massive amount of writing
- Cassandra and Google's BigTable.
- Graph DB:
- Built on top of SQL in some ways
- Efficient for example for Facebook Friendship use case
SQL vs NoSQL:
- NoSQL dbs can generally scale a lot better than SQL dbs
- Mostly because of the constraints on SQL: ACID
- Scaling horizontally for SQL is quite difficult because of ACID
- joining two data from two servers and enforcing the constraints, is really difficult
- For NoSQL we don't have the constraints and therefore they can scale better
- ACID for SQL is like BaSE for NoSQL
- Eventual Consistency
- Leader-Follower Concept:
- We have two database that are supposed to be replicate of each other
- Write should be on the leader database
- Eventually the leader will make sure that the same data to the follower(S)
- For a shorter amount of the time, if requests go to the followers, they might see non up-to-date values
- We are able to handle higher amount of read capacity
- Partitioning data is available for SQL databases as well: Sharding
- by default is not implemented for SQLs
- it is gonna be much more complicated. That's why not implemented by default.
Replication
Leader-Follower Replication: We talked about it a bit earlier.
- Asynchronous vs. Synchronous Replication
Leader-Leader Replications:
- Pros:
- We can read and write from every single node, scaling up read as well write.
- Cons:
- Making consistency is gonna be difficult
- Each master usually servers different part of the world, independently.
Sharding
- When you massive amount of data, a single query can take seconds to complete.
- We take half of the rows in one database in machine 1 and take the second half in another database in machine 2.
- How dow we divide the data then?
- We use shard key like female or male
- Or like Last name from A-L goes to machine 1 and M-Z goes to machine 2
- Running joins between two machine is gonna be complicated
- Hash-based Sharding
- We use shard key like female or male
- SQL databases like MySQL and Postgres do not provide sharding by default. We should implement them ourselves.
- Sharding are not designed to be used with RDBMS databases.
CAP Theorem
- Consistency
- Availaiblity
- Partition Tolerance
It is like having two servers, one leader and follower. Partition Tolerance is when two servers cannot talk to each other. Now we have two options:
- Prioritize Consistency which means we make followers unavailable
- Prioritize Availability which means although some data are not up-to date, users can send request to them.
PACELC: Give P (network connection between servers working), choose A or C. Else, favor Latency or Consistency
Object Storage
The predecessors to Object Storage was blob storages.
BLOB: Binary Large OBjects.
- When we put files there, we cannot edit that files.
- Optimized to save a large amount of data
- Similar to Hashmaps, your files' ID should globally unique
Use Cases:
- When you storing large files that you need for your applications for long-term usages
- Metadata can still be saved in databases
- Storing large files in a database does not make sense
- You can easily send one simple GET HTTP request and retrieve your file
Message Queues
- When there are large number of events coming to a server and in case the server can not keep up with the events, message queues come into picture.
- The data will be stored on disk
- Usually events will be processed FIFO
Data transportation between queue and server:
- Server can pull from the queue
- Queue can push the data to the servers
- Sometimes the server responds with an acknowledge
Pub/Sub:
- There are multiple publishers that produce the events
- A middle layer called Topics feed into different subscriptions
Note:
- The benefit is that if one application or served is added, it can easily scale and adapt to the architecture by defining more topics and subscribers.
- Popular message queues include RabbitMQ and Kafka
- Popular cloud-based one is Google's pub/sub
MapReduce
- We have a large amount of personal data
- For each person we want to redact pipelines
- If we need to process it quickly we might need multiple servers to process it in parallel
Now we have two types of processing:
- Batch processing:
- We have all the data upfront
- Micro-batch processing is like processing a bunch of data every 30 seconds.
- Streaming
- We are processing data in real-time (data is coming)
Note:
- Spark Cluster and Hadoop is examples of MapReduce. Probably Flink