The journey of a query in BigQuery.
Before
diving deep into BigQuery, let's find out what happens behind the
scenes when you hit the "Run" button to submit your query.
1️⃣ HTTP Post
Whenever
you run the query through the console or using an SDK, the clients will
send an HTTP post request to the BigQuery endpoint.
This is
usually a POST request that contains an OAuth2 token for authorization
and a JSON payload which includes the query you want to run.
2️⃣ Routing
The
request is routed to BigQuery endpoint through the internet. This
address is served by a Google Front-End (GFE) server, which is the same
type of server that services Google Search.
The router transforms
the JSON HTTP request into Protocol Buffers (Protobufs), which is the
serialization format used for communication internally within Google
services.
3️⃣ Job Server
This component is responsible for keeping track of the state of a request.
Since the network connection between the client and the BigQuery server is
not
expected to last forever and some queries can take hours to run
(maximum 6 hours), the Job Server is designed to operate asynchronously.
4️⃣ Query Engine
The query then goes to the Query Master, which is responsible for the overall execution.
It contacts the metadata server to figure out the physical data layout in Colossus.
This is where partition pruning happens, only the metadata of the active partitions will be returned.
After determining how much data is involved, the Query Master forms an initial query plan and requests slots from the scheduler.
When finishing the execution of the query, the results are split into two parts.
The first part is stored in Spanner along with the query metadata.
The remaining data is written to Colossus, Google's distributed file system.
Comments
Post a Comment