redash/docs/dev/query_execution.rst
2015-07-26 15:24:16 +03:00

95 lines
3.3 KiB
ReStructuredText

Query Execution Model
#####################
Introduction
============
The first datasource which was used with re:dash was Redshift. Because
we had billions of records in Redshift, and some queries were costly to
re-run, from the get go there was the idea of caching query results in
re:dash.
This was to relieve stress from the Redshift cluster and also to improve
user experience.
How queries get executed and cached in re:dash?
===============================================
Server
------
To make sure each query is executed only once at any giving time, we
translate the query to a ``query hash``, using the following code:
.. code:: python
COMMENTS_REGEX = re.compile("/\*.*?\*/")
def gen_query_hash(sql):
sql = COMMENTS_REGEX.sub("", sql)
sql = "".join(sql.split()).lower()
return hashlib.md5(sql.encode('utf-8')).hexdigest()
When query execution is done, the result gets stored to
``query_results`` table. Also we check for all queries in the
``queries`` table that have the same query hash and update their
reference to the query result we just saved
(`code <https://github.com/EverythingMe/redash/blob/master/redash/models.py#L235>`__).
Client
------
The client (UI) will execute queries in two scenarios:
1. (automatically) When opening a query page of a query that doesn't
have a result yet.
2. (manually) When the user clicks on "Execute".
In each case the client does a POST request to ``/api/query_results``
with the following parameters: ``query`` (the query text),
``data_source_id`` (data source to execute the query with) and ``ttl``.
When loading a cached result, ``ttl`` will be the one set to the query
(if it was set). This is a relic from previous versions, and I'm not
sure if it's really used anymore, as usually we will fetch query result
using its id.
When loading a non cached result, ``ttl`` will be 0 which will "force"
the server to execute the query.
As a response to ``/api/query_results`` the server will send either the
query results (in case of a cached query) or job id of the currently
executing query. When job id received the client will start polling on
this id, until a query result received (this is encapsulated in
``Query`` and ``QueryResult`` services).
Ideas on how to implement query parameters
==========================================
Client side only implementation
-------------------------------
(This was actually implemented in. See pull request `#363 <https://github.com/EverythingMe/redash/pull/363>`__ for details.)
The basic idea of how to implement parametized queries is to treat the
query as a template and merge it with parameters taken from query string
or UI (or both).
When the caching facility isn't required (with queries that return in a
reasonable time frame) the implementation can be completly client side
and the backend can be "blind" to the parameters - it just receives the
final query to execute and returns result.
As one improvement over this, we can let the UI/user specify the TTL
value when making the request to ``/api/query_results``, in which case
caching will be availble too, while not having to make the server aware
of the parameters.
Hybrid
------
Another option, will be to store the list of possible parameters for a
query, with their default/optional values. In such case, the server can
prefetch all the options and cache them to provide faster results to the
client.