https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
The spanner paper
Externally consistent distributed transactions
Lock-free read-only transactions
Atomic schema updates
Data is stored in schematized semi-relational tables; data is versioned, and each version is automatically timestamped with its commit time; old versions of data are subject to configurable garbage-collection policies; and applications can read data at old timestamps. Spanner supports general-purpose transactions, and pro- vides a SQL-based query language.
Apps can control replication
These features enable Spanner to support con- sistent backups, consistent MapReduce executions [12], and atomic schema updates, all at global scale, and even in the presence of ongoing transactions.
In addition, the serialization order satisfies external consistency (or equivalently, linearizability [20]): if a transaction T1 commits before another transaction T2 starts, then T1's commit timestamp is smaller than T2's. Spanner is the first system to provide such guarantees at global scale.
The API directly exposes clock uncertainty, and the guarantees on Spanner's times- tamps depend on the bounds that the implementation pro- vides
externally-consistent distributed transactions, lock- free read-only transactions, and atomic schema update
It then describes the directory abstraction, which is used to manage replica- tion and locality, and is the unit of data movement.
We currently run a test/playground universe, a development/production uni- verse, and a production-only universe.
Unlike Bigtable, Spanner assigns timestamps to data, which is an important way in which Spanner is more like a multi-version database than a key-value store.
Our Paxos implementation supports long-lived leaders with time-based leader leases, whose length defaults to 10 seconds. (Note that having a long-lived Paxos leader is critical to efficiently managing the lock table.)
Our implementation of Paxos is pipelined, so as to improve Spanner's throughput in the presence of WAN latencies; but writes are applied by Paxos in order (a fact on which we will depend in Section 4).
In both Bigtable and Span- ner, we designed for long-lived transactions (for exam- ple, for report generation, which might take on the order of minutes), which perform poorly under optimistic con- currency control in the presence of conflicts.
a set of contiguous keys that share a common prefix
lexicographically contiguous partition of the row space. Instead,
For expository clarity we have over-simplified
Two -phase commit
It looks like SQL with some extensions to support protocol-buffer-valued fields.
Spanner's data model is not purely relational, in that rows must have names. More precisely, every table is re- quired to have an ordered set of one or more primary-key columns. This requirement is where Spanner still looks like a key-value store: the primary keys form the name for a row, and each table defines a mapping from the primary-key columns to the non-primary-key columns. A row has existence only if some value (even if it is NULL) is defined for the row's keys. Imposing this struc- ture is useful because it lets applications control data lo- cality through their choices of keys.
This interleaving of tables to form directories is significant because it allows clients to describe the locality relation- ships that exist between multiple tables, which is nec- essary for good performance in a sharded, distributed database. Without it, Spanner would not know the most important locality relationships.
an interval with bounded time uncertainty (un- like standard time interfaces that give clients no notion of uncertainty)
Daemons apply a variant of Marzullo's algorithm [27] to detect and reject liars, and synchronize the local machine clocks to the non- liars.
In our production environ- ment, ε is typically a sawtooth function of time
A read-only transaction is a kind of transaction that has the performance benefits of snapshot isolation [6]. A read-only transaction must be predeclared as not hav- ing any writes; it is not simply a read-write transaction without any writesGlossary
Reads in a read-only transaction ex- ecute at a system-chosen timestamp without locking, so that incoming writes are not blocked
A snapshot read is a read in the past that executes with- out locking. A client can either specify a timestamp for a snapshot read, or provide an upper bound on the desired timestamp's staleness and let Spanner choose a time- stamp. In either case, the execution of a snapshot read proceeds at any replica that is sufficiently up-to-date.
A potential leader sends requests for timed lease votes; upon receiving a quorum of lease votes the leader knows it has a lease.
Spanner depends on the following disjointness invariant:
each Paxos leader's lease interval is disjoint from every other leader's.
Spanner depends on the following monotonicity in- variant: within each Paxos group, Spanner assigns times- tamps to Paxos writes in monotonically increasing or- der, even across leaders.
Reads within read-write transactions use wound- wait [33] to avoid deadlocks.
Having the client drive two-phase commit avoids send- ing data twice across wide-area links
The leader-lease intervals are disjointed
because of two ar- tifacts of our experiment
Shorter lease times would reduce the effect of server deaths on availability, but would require greater amounts of lease-renewal network traffi
How- ever, there can be significant tail-latency issues that cause higher values of ε. The reduction in tail latencies begin- ning on March 30 were due to networking improvements that reduced transient network-link congestion. The in- crease in ε on April 13, approximately one hour in dura- tion, resulted from the shutdown of 2 time masters at a datacenter for routine maintenance. We continue to in- vestigate and remove causes of TrueTime spikes.
Application semantics requires transactions across arbitrary data, and consistent reads.
Spanner's timestamp semantics made it efficient for F1 to maintain in-memory data structures computed from the database state
F1 takes full snapshots of data at a timestamp to initialize its data structures, and then reads incremental changes to update them
The new paper is paper-spanner-becoming-a-sql-system