Proof of SQL™
The Space and Time engineering team is aggressively building out the capabilities of Proof of SQL™: novel cryptography that allows the data warehouse to generate a SNARK cryptographic proof of SQL query execution, proving that query computation was done accurately and that both the query and the data are verifiably tamperpoof.
About Tamperproof Queries
Tamperproof queries require two parties: a verifier and a prover. For the Space and Time architecture, the Validator nodes play the role of the verifier, and the database clusters play the role of the prover.
The process consists of two types of interactions: proof ingestion, and query requests.
Proof Ingestion
When a service or a client sends data that is to be added to the database, that data is routed to the verifier so that the verifier can create a commitment to that data. This commitment is a small “digest” of the data but holds enough information for the rest of the protocol to ensure that the data is not tampered with. After [Note: the data can be routed immediately, but the verifier only "locks in" the data after creating this commitment. Some extra precautions are needed when routing immediately to keep the commitment properly synced with the database] creating this commitment, the verifier can route the data to the database for storage. The verifier stores this commitment for later usage.
An important design constraint is that the commitment must be updatable. In other words, suppose that the verifier already holds the commitment to a specific table, but new data is to be ingested and appended to the table. To do this efficiently, the verifier must be able to combine the old commitment with the incoming data to create a new commitment to the entire updated table. The key constraint here is that the verifier must be able to do this without access to the old existing data.
Note: at no stage of this process does the prover need to do anything other than accept and store incoming data.
Query Requests
When a service or a client sends a query request, that request is routed through the Validator to the prover, which is the database. [Because the Validator plays the role of the verifier, the verifier is routing the request. However, this isn't strictly necessary, and the request could be routed directly to the prover without even opening the packet.] At this point, the prover parses the query, computes the correct result, and produces a proof of SQL. It then sends both the query result, and the proof to the verifier.
Once the verifier has this proof, it can use the commitment to check the proof against the result and verify that the prover has produced the correct result to the query request. This query result can then be routed back to the service or client along with a tamperproof success flag. However, if the proof does not pass, the verifier does not route the result, but sends a failure message instead.
Primitives
Non-interactive randomness
Conceptually, it is easier to design a protocol for proving results when the prover and verifier are able to have a multi-round back-and-forth protocol. For example, one fundamental building block in every protocol is the use of random challenges. Often, the verifier wants to check if a claim from the prover is true. To do this, the verifier sends a random challenge to the prover. Since the prover sends the claim to the verifier before being sent the challenge, the prover has no way of knowing it ahead of time and cannot maliciously modify the claim for that specific challenge. If the claim and the challenges are appropriately structured, each challenge with a correct response lets the verifier know that it is highly unlikely that the prover is malicious. More complex claims can require more challenge/response interactions.
This multi-round interaction between the two parties would lead to extreme latency problems, especially for more complex queries. Fortunately, this problem is solved by the Fiat–Shamir heuristic, which works by essentially allowing the prover to simulate the entire interaction. In this situation, the prover generates the random challenges, but by the design of the heuristic, still cannot learn the challenges before "sending" the claim. This allows us to utilize the power of random challenges while still allowing the prover to only send one message in response to the verifier.
Commitments
The biggest computational cost revolves around the commitments and their uses. So, the decision of which commitment scheme we use is important. Fortunately, the commitment scheme that we chose can be treated as a "black box", so this choice can be modified as the product evolves.
There are three main^[This is, perhaps, a bit of a generalization.] commitment schemes that often get used: Pedersen commitments, KZG commitments, and FRI commitments. The commitment scheme that we are initially using is a generalized Pedersen commitment. There are several reasons for this.
Simplicity
A Pedersen commitment is a simple construction, which lends itself to faster development, and shorter time to market. It also has nice mathematical properties, which simplifies other components of the protocol. As a brief introduction, suppose a0, a1, a2,… is a column of data. Then, the Pedersen commitment is simply a0⋅G0⊕a1⋅G1⊕a2⋅G2⊕⋯ where the Gi's are "random" public parameters. Of note is that these Gi's are transparent parameters and don't require a trusted setup phase, as is required by some other commitment schemes.
Easy Updates
As mentioned in the overview, an important design constraint is that these commitments should be updatable. All that needs to be done to append to a commitment is to add an extra term to the above sum for each new piece of data. [Other schemes are updatable but may require a more complex architecture.]
Speed
While Pedersen commitments aren't the fastest commitments (FRI commitment are quite a bit faster) to compute, we have made some choices so that they are still fast enough. Many commitment schemes rely on a choice of an elliptic curve. Because we need fast computation times, we are using Curve25519 for our first version because it is one of the fastest elliptic curves [The reason to use another curve would be if we wanted to use protocols that required elliptic curve pairings. This is helpful, but the cost is that pairing-friendly curves are significantly slower.], and it has a strong reputation in cryptography. Additionally, because the structure of the commitment is very repetitive, computation can be easily parallelized. As a result, we are pushing the computation of these commitments to GPUs, which can compute them 100-1000 times faster than CPUs.
A final note should be made about the opening scheme we are using for commitments. Because asymptotic verifier time and proof size isn't as important for the SQL tamperproofing application, we can use schemes that aren't as succinct as most other applications require. This means that we can use schemes that are both simpler and require less round-trip time. For these reasons, we have chosen to use the commitment scheme that Bulletproofs use [We are not using the surrounding bulletproof protocol, just its PCS.]. This is an attractive option because it is a well-tested, open-source system that we can use quickly. However, as with almost everything else in this section on commitments, this is effectively a "black box", so it is likely that as our product evolves, we will exchange this scheme for a commitment scheme specifically tailored to our use case.
Example
We will walk through an example where the client creates a table, appends to it, and then queries the table.
Table creation
The client sends the following data to the verifier to create a table with some data.
Example Employees table:
Weekly Pay | Yearly Bonus |
---|---|
4000 | 50000 |
7500 | 0 |
5000 | 400000 |
1500 | 0 |
The verifier accepts this data and computes the commitment to each column. In this case, these are:
Cpay = 2000 ⋅ G0 ⋅ G1 ⊕ 5000⋅ G2 ⊕ 1500 ⋅ G3
Cbonus = 50000 ⋅ G0 ⊕ 0 ⋅ G1 ⊕ 40000⋅ G2 ⊕ 0 ⋅ G3
The verifier then stores these commitments and sends the data to the prover, who creates a new table in the database.
Table append
Suppose the client then sends the following new rows of data to the verifier.
Rows to append to Employees table:
Weekly Pay | Yearly Bonus |
---|---|
3000 | 100000 |
4500 | 30000 |
The verifier accepts this data and updates the commitment to each column. In this case, the new commitments are:
Cnew pay = Cold pay ⊕ 3000 ⋅ G4 ⊕ 3000 G5
Cnew bonus = Cold bonus ⊕ 100000 ⋅ G4 ⊕ 30000 G5
The verifier then sends the new rows to the prover, who appends them to the table in the database.
SQL Query
Finally, the client decides that it want to know the total compensation. So, it sends the following SQL query to the verifier.
Total compensation SQL query:
SELECT Pay*52+Bonus FROM Employees
The verifier routes the query to the prover who then executes the query by taking the Pay column, multiplying it by 52 and adding the Bonus column. This is the result, which is sent back to the verifier. Because this query is so simple, the prover does not need to send any additional proof to the verifier.
Employees table along with total compensation:
Weekly Pay | Yearly Bonus |
---|---|
4000 | 50000 |
7500 | 0 |
5000 | 400000 |
1500 | 0 |
3000 | 100000 |
4500 | 30000 |
The verifier receives this result column from the prover. To verify it, the verifier must mirror the provers computation by taking the Pay commitment, multiplying it by 52 and adding the Bonus commitment. The happens to be exactly the commitment [Most operations are not this simple, and for more complex operations, a proof is required as well] to what the result should be, which is:
Ccorrect result = 52 ⋅ Cpay ⊕ Cbonus
The verifier then also computes the commitment to the result that got sent from the prover. If the prover sent the correct result, this should be:
Csent result = 258000 ⋅ G0 ⊕ 390000 ⋅ G1 ⊕ 660000 ⋅ G2 ⊕ 78000 ⋅ G3 ⊕ 256000 ⋅ G4 ⊕ 254000 ⋅ G5
If the prover sent the wrong result, this will be something else. Then, the verifier checks to see if:
Ccorrect result = Csent result
[By combining all of the previous equations one can see that these two should be the same]. If it does, the verifier sends the result to the client along with a flag that confirms that the result is correct. If the two values are not equal the verifier tells the client that there was an error [The verifier also should alert the network to the fact that the prover is potentially malicious].
Updated 4 months ago