The journey of a query in BigQuery.

 Before diving deep into BigQuery, let's find out what happens behind the scenes when you hit the "Run" button to submit your query.

1️⃣ HTTP Post

Whenever you run the query through the console or using an SDK, the clients will send an HTTP post request to the BigQuery endpoint.

This is usually a POST request that contains an OAuth2 token for authorization and a JSON payload which includes the query you want to run.

2️⃣ Routing

The request is routed to BigQuery endpoint through the internet. This address is served by a Google Front-End (GFE) server, which is the same type of server that services Google Search.

The router transforms the JSON HTTP request into Protocol Buffers (Protobufs), which is the serialization format used for communication internally within Google services.

3️⃣ Job Server

This component is responsible for keeping track of the state of a request.

Since the network connection between the client and the BigQuery server is
not expected to last forever and some queries can take hours to run (maximum 6 hours), the Job Server is designed to operate asynchronously.

4️⃣ Query Engine

The query then goes to the Query Master, which is responsible for the overall execution.

It contacts the metadata server to figure out the physical data layout in Colossus.

This is where partition pruning happens, only the metadata of the active partitions will be returned.

After determining how much data is involved, the Query Master forms an initial query plan and requests slots from the scheduler.

When finishing the execution of the query, the results are split into two parts.

The first part is stored in Spanner along with the query metadata.

The remaining data is written to Colossus, Google's distributed file system.


Comments

Popular Posts