redash/docs/dev/query_execution.rst

Query Execution Model
#####################

Introduction
============

The first datasource which was used with re:dash was Redshift. Because
we had billions of records in Redshift, and some queries were costly to
re-run, from the get go there was the idea of caching query results in
re:dash.

This was to relieve stress from the Redshift cluster and also to improve
user experience.

How queries get executed and cached in re:dash?
===============================================

Server
------

To make sure each query is executed only once at any giving time, we
translate the query to a ``query hash``, using the following code:

.. code:: python

    COMMENTS_REGEX = re.compile("/\*.*?\*/")

    def gen_query_hash(sql):
        sql = COMMENTS_REGEX.sub("", sql)
        sql = "".join(sql.split()).lower()
        return hashlib.md5(sql.encode('utf-8')).hexdigest()

When query execution is done, the result gets stored to
``query_results`` table. Also we check for all queries in the
``queries`` table that have the same query hash and update their
reference to the query result we just saved
(`code <https://github.com/EverythingMe/redash/blob/master/redash/models.py#L235>`__).

Client
------

The client (UI) will execute queries in two scenarios:

1. (automatically) When opening a query page of a query that doesn't
   have a result yet.
2. (manually) When the user clicks on "Execute".

In each case the client does a POST request to ``/api/query_results``
with the following parameters: ``query`` (the query text),
``data_source_id`` (data source to execute the query with) and ``ttl``.

When loading a cached result, ``ttl`` will be the one set to the query
(if it was set). This is a relic from previous versions, and I'm not
sure if it's really used anymore, as usually we will fetch query result
using its id.

When loading a non cached result, ``ttl`` will be 0 which will "force"
the server to execute the query.

As a response to ``/api/query_results`` the server will send either the
query results (in case of a cached query) or job id of the currently
executing query. When job id received the client will start polling on
this id, until a query result received (this is encapsulated in
``Query`` and ``QueryResult`` services).

Ideas on how to implement query parameters
==========================================

Client side only implementation
-------------------------------

(This was actually implemented in. See pull request `#363 <https://github.com/EverythingMe/redash/pull/363>`__ for details.)

The basic idea of how to implement parametized queries is to treat the
query as a template and merge it with parameters taken from query string
or UI (or both).

When the caching facility isn't required (with queries that return in a
reasonable time frame) the implementation can be completly client side
and the backend can be "blind" to the parameters - it just receives the
final query to execute and returns result.

As one improvement over this, we can let the UI/user specify the TTL
value when making the request to ``/api/query_results``, in which case
caching will be availble too, while not having to make the server aware
of the parameters.

Hybrid
------

Another option, will be to store the list of possible parameters for a
query, with their default/optional values. In such case, the server can
prefetch all the options and cache them to provide faster results to the
client.