Database Engineering
brian | Published: March 5, 2024, 11:57 a.m. | Updated: May 25, 2025, 5:45 p.m.
Transaction: A transaction is a unit of work that is performed within a database management system. In simple terms, a transaction generally means anything that changes within the database.
Note: Transactions provide reliability even in case of system failures.
Example: Let's say Person A (Current total = $100) wants to send $50 to Person B (Current total = $200).
1. We will deduct $50 dollars from Person A (New total = $50)
2. Now lets say there is a system failure, and Person B never receives the $50, and yet we still removed $50 from Person A
3. In the end, since a transaction is atomic (the transaction must be complete in its entirety, or no change will be performed), then the withdraw will rollback, and the state of this transaction will be as if there was never a transaction in the first place. A transaction is not only atomic, but it also includes everything that's shown in the image above and we will explain that in the next sections.
Atomicity: atomicity ensures that either everything in the transaction occurs, or nothing occurs at all. If during a transaction there is system failure, then everything will be rolled back, and all the changes that may have been made during the transaction will be undone.
Consistency: consistency ensures that transactions preserve the consistency of the database.
Example: Lets say I have an Ecommerce application where I sell certain items. I want to make sure that when someone buys an item from my store, for it to reflect on the database so that the total number of items left that will be accurate. So lets say someone buys the last item from my store, but then someone else comes and tries to buy that same item. With consistency, the user should not be able to buy the item because there should be none left.
Isolation:isolation determines how transaction integrity is visible to other users and systems.
* Can transactions that are currently executing see changes that other transactions make? (that are also executing)
* This can also be known as "Read Phenomena". Example: A, queries some data. Then B updates that data. Finally A queries/reads that data again.
*There are 3 "Read Phenomenas"
1. Dirty Reads
2. Non-repeatable reads
3. Phantom reads
Dirty Read: A dirty read is a phenomenon that occurs in database systems when one transaction reads data that has been modified by another transaction but has not yet been committed.
Ex: "1" reads some data, "2" updates the data, "1" reads the data again, but hold up..."2" made a mistake, and decides to roll back. Now "1" has the incorrect data, and if "1" saves to data base, there will be inconcistency throughout the entire database because whoever decides to retreive this newly saved data will actually have the incorrect information.
Non-repeatable Reads: A non-repeatable read occurs when a transaction retrieves a row twice and that row is updated by another transaction that is committed in between.
Ex:
"1" Retreive" -> "2" Update & Commit -> "1" Retreive again
Phantom Reads: A phantom read occurs when a transaction retrieves a set of rows twice and new rows are inserted into or removed from that set by another transaction that is committed in between.
Ex: let's say you queried all items in the store (and you added a date range filter), great you got your response! Now lets say another transaction adds a new item to your store. If you query all the items in the store again, you will get a new result
Isolation Levels
1. Read Uncommited: This is the lowest isolation level. In this level, dirty reads are allowed, so one transaction may see not-yet-committed changes made by other transactions.
2. Read Commited: each query in a transaction only sees commited changes by other transactions. Example if "2" changes a value, "1" wont be able to see it until "2" has commited it first, in turn this prevents dirty reads. However, this doesn't restrict further changes after it has been read, so that means if we make another query, there is no guarantee that we will get the same results. Note: PostgreSQL's default isloation level is "Read Commited"
3. Repeatable Reads: Each transaction obtains a consistent snapshot of the database at the start of the transaction.Throughout the transaction, all queries see the same snapshot of the database as it existed at the start of the transaction. Basically if I query some data, it will remain the same throughout this transaction, it will never change.
Note 1: this DOES NOT elimate phantom reads. For example, lets say that we are querying every person aged 21 and above. These values are ensured to not change, however it does not prevent new rows from being added or deleted in the middle of the transaction.
(NOTE 2 In PostgreSQL it does prevent phantom reads)
4. Snapshot: Every single query in a transaction only sees changes that have been commited up to the start of the transaction. Its basically a snapshot version of the database at that precise moment. This eliminates all 'Read phenomena, along with Serializable which I talk about below"
5. Serializable: This is the highest level of isolation, and here transactions operates as if it is the only transaction executing on the database. So anything that I read, must not depend on any other transaction that is currently running.
Ex: If I begin a transaction, and query some items, and then another transaction creates new items in the same table, if I query again in the first transaction, those new items that were just added will NOT show up in the query.
Durability: Durability ensures that once a transaction is committed, its changes are permanently stored in the database and will not be lost even in the event of a system failure or crash right after.
Techniques for acheiving durability:
1. WAL (Write Ahead Log): Databases will save all the changes that are being made to disk, and in turn this guarantees persistence. If there is a system failure, we can use WAL logs to recover the transactions that were in progress at the time of the crash and we can reconstruct the data base to the point of before the crash.
2.Asynchronous snapshot technique: When we make changes, everything is kept in memory but in the background we are snapshotting everything to disk.
Durability OS cache: When the database asks the OS to write to disk, the OS will actually write to its own memory cache instead (RAM), and once its done, it will let the database know that it has wrote to disk even though that's not the case, because truly WE know that it actually wrote it to the cache. The reason the OS does this, is beacuse it wants to bundle up all the writes and then flush to disk all at once to increase performance.
Problem: Let's say we commit, now the database will ask the OS to write the WAL segments to disk, but the OS instead writes them to cache and lets the database know that it was saved in disk (We know thats a lie). Now lets say there is an OS crash, as a result that can lead to the OS restarting and in turn cause the data to be lost. In this case the database is not durable, because it has false information.
Solution: There is a OS command called "Fsync" which is used to ensure that data is physically written from the operating system's cache to persistent storage (such as a hard disk). Downside: Fsyncs are expensive and can slow down commits