Foreword

While models are the brain of systematic trading, data is its blood. Naturally, running a successful systematic trading business requires a well-designed and efficient data backbone. However, this matter is often overlooked until the situation becomes dire, and by that point, it may be hard to improve it in a meaningful way.

One of the reasons why so little attention is paid to the data backbone is the availability of free and seemingly suitable technologies. Unfortunately, the vast majority of them do not cater for the specific needs of systematic trading, which eventually results in low productivity.

Another reason is the lack of understanding of the long-term costs of standardising on suboptimal tech. The adopted technology comes with its constraints, which can severely impact the efficiency of quantitative research, modelling, backtesting, and post-trading analysis.

Finally, there is a tendency to go with the crowd: “Everyone is doing this, so we shall do the same”. Doubtless, there are benefits to going down the well-trodden path, but it does not mean that the path would inevitably lead to a paradise.

Here is my current understanding of what is important in the data backbone and how it can be implemented.

Requirements

How can we tell whether a data backbone is good or not?

First, it is good if it serves its purpose well, and its purpose is to collect, store, and provide data. The main users of the data are quants. However, an advanced data backbone may also cater for real-time data subscriptions such as trading models running live. For the time being, let’s focus exclusively on the requirements of quants.

Second, the data backbone should be cost-efficient. The data to be stored may be voluminous and the access patterns may be quite inconvenient to deal with. This may sometimes be solved by throwing money at the problem, but it would be ideal to avoid a situation where the cost of technology is higher than the PnL of the firm.

When we consider the needs of quants, what is actually required?

There are several things:

Large volumes of data (a ballpark estimate is petabytes).
Reliability: loss of the data can be costly as not every kind of data can be restored.
The data needs to be stored with precision, especially the data in its raw format.
The data must be discoverable.
The quality of data must be high. Errors in data recording or retention must be spotted early.
Access to the data must be efficient. In particular, temporal as-of joins may be required for many types of data.
The data model must be able to represent the data precisely.
The meaning of the data and historical changes in that meaning must be documented and easy to find.

Also, let’s not forget that sometimes gold is actually a fool’s gold.

It is often assumed that any additional features of a system by definition are good or at least neutral. That is not true. There are several reasons for that.

The first one is that some features have a nasty trait, that while they seemingly help in the short term, they also create a heavy burden to be carried long term. An example of it may be a “clever” API which tries to guess what is required. While it may seem tempting to have it, it will bring in fragility as no matter how clever the API is, occasionally it will make mistakes.

The second one is that the human mind is not infinitely powerful. It is by far more efficient if it is laser-focused on a specific matter rather than hundreds of them. For example, a quant would be more productive in modelling if there is no need to keep in mind strange quirks of the data access layer, bugs in statistic libraries, a team meeting which is about to happen in ten minutes, and so on.

An excellent data backbone would deliberately lack features that are not truly necessary.

The Design of the Data Backbone

Data Storage

There are two (and a half) kinds of data storage which are essential: Data Lake, Data Warehouse, and optionally a transactional database. Quants would normally use data from the Data Warehouse and there should be a process by which the information from the Data Lake and the transactional database is distilled, transformed and put into the Data Warehouse.

There are various kinds of data to deal with: tick data, reference data, event streams, estimates of quantitative parameters, configurations of models, position data, and so on. The way the data is stored and used depends on its type.

Data Lakes

What is it?

Data Lake is the collection of data in its raw form: logs, PCAP files, and other data sourced from external sources. The amount of data is likely going to be large and the data should be seen as mostly unstructured. It should not be necessary for quants to actively use it, but they should be able to do so if they need to get an idea of what is available.

Data Organisation

It is important that the data is appropriately organised. As a minimum, each sort of data in the Data Lake should live in its own dedicated place rather than being intermixed with other data. The data should be linked to the corresponding documentation. Where possible, the data should be immutable: if an update for the same data arrives, it should be stored as an alternative version alongside the old version.

Due to the mostly unstructured nature of the data and access patterns, it makes sense to store it partitioned by day and provide for efficient bulk upload/delete/read operations with a time granularity of a day. It should be assumed that the vast majority of operations are going to be reads, but updates should be given priority so that the ingestion process would not stall due to excessive reads.

Access Controls

It may be prudent to ensure that access to the Data Lake is appropriately controlled: only specific software and a handful of people can write to it. This is required to avoid accidental damage to the data.

Implementation

A Data Lake can be implemented in various ways: as BLOBs in an S3 storage, as files in a file system, or in some other way. Technologically, it is the least complex part. After all, most operations are bulk reads and bulk writes. However, it may be sensible to keep in mind that the volumes of data transfers may be significant, which may have various cost implications. Furthermore, as the data may need to be read by various quality assurance processes as well as processes parsing the data and writing it to the Data Warehouse, the corresponding software would likely have to be close to the Data Lake.

Data Warehouse

What is it?

A Data Warehouse is a system which stores information used by quants in a way which is most suitable for their purposes. This would often (but not always) mean column-oriented storage with support for temporal as-of joins of several types, as well as bulk loading or reading segments of data within the specified time ranges.

The volume of data in a Data Warehouse would likely be smaller than that in a Data Lake, but it may still be significant, especially where tick data and streams of events are considered.

Data Organisation

Unlike Data Lakes where information is mostly unstructured, Data Warehouse stores highly-structured data optimized for the required read patterns. In all cases, bulk reads should be possible, with data selected by various attributes and time ranges.

Also, in most cases, as-of joins are needed: it must be possible to look up the most recent (or the next available) data as of a specific time, possibly with the data at that specific time excluded. Note, that half-usable solutions like those where the practical horizon of an as-of join is significantly limited may be detrimental to the quality of data queries, and consequently to the quant research.

Only some types of data would not require as-of joins. An obvious example would be streams of events. With those, as-of joins may still be useful in some rare circumstances, but they would not be critically required.

In order to work with the data efficiently, one may need to represent it in a column-oriented manner, partitioned by some of those columns and possibly by time (for instance, by days).

In all cases, it may be useful to guarantee that the sequence of entries matches their temporal order. This may be particularly important in the case of dense event streams where it may be impractical to have true and precise timestamps on every single event, but it is still important to make sure the events can be read in the correct sequence.

Further Considerations

Just like with Data Lakes, access should be appropriately controlled so that only certain processes and a limited number of people can modify the data.

There may be a need to represent the same data in several competing ways. For example, for some kinds of research, it may be important to know how was the data perceived as of a specific time, while for others it is important to know what it was as of a specific time. Naturally, these two views on the data do not match where alterations to the previously received data are made.

It would be useful had the Data Warehouse been able to run complicated joins across multiple datasets, but it is not, strictly speaking, required.

It may also be useful to support virtual columns such as a cumulative sum of another column since the start of the day. However, the same can be simulated by a process which populates the data.

Implementation

It appears that for all types of data, it would be useful to represent it as a kind of table: zero or more columns used to partition the table, one (or more) column(s) of timestamps with a guarantee that the timestamps are non-descending, and zero or more columns of data.

Note, that in case there is more than one timestamp column, it is not clear which one is used for day-level aggregations and the meaning of as-of joins becomes muddled. One of them will definitely be special. Furthermore, it may be problematic to run as-of joins on timestamp columns other than the “main” one.

Ideally, this data should be stored in a column-oriented structure as it is somewhat better for the performance.

Given that the vast majority of queries are reads, the system should be optimised to service them well. In particular, concurrent reads should not block each other. A clever implementation would be able to use the partitioning of the data to run various queries in parallel. Also, writes should be able to run in parallel too: there is no need to maintain complex invariants on consistency so there is no need to block writes from uploading an updated version of the data while there are readers already reading the old version.

Some data can be arbitrarily sparse. In such cases, for each daily sub-table of data, it may be useful to store an optional aggregation: the most recent data just before the sub-table.

The tricky problem there is how to do it: should it be done by the Data Warehouse automatically or should it be done by the process which populates the data? In the latter case, the aggregations would look like independent data sets which need to be worked with explicitly. In the former case, the aggregations are less visible, so they are more convenient to use, but they can be burdensome to implement.

Finally, true as-of joins are deceptively hard to implement. That is why even the systems which excel at working with financial data tend to cut corners: they limit as-of joins to either a range defined by the maximum deviation from the reference time or to the time bucket of the pre-configured granularity (such as day).

Transactional Database

What is it?

A transactional database would normally be used to store the information on orders, fills, cancellations, amendments, trade confirmations, and so on. It is populated and accessed in real-time and is usually a go-to place where one needs to know the current position or the trading activities which have happened historically. It is not designed, however, to run analytical workloads.

Data Organisation

The data is organised as rows, and normally it is structured (although there are exceptions). Each row represents a transaction, an event, or a version thereof.

Considerations

Records may need to be updated on a per-entry level. This should be possible and the history of changes should be preserved. Ideally, a write-only read-many approach should be used.

Records may need to be linked to and with each other: trades with trade confirmations, fills with orders, etc. These links should be reliable and stable.

The transactional database must be highly available. In case it fails, it should be easy to restore it in a matter of minutes no matter what the root cause of the outage was.

It must be possible to answer certain questions efficiently, such as “What are the current positions in a book?”.

Implementation

There is a wealth of Database Management Systems which can be used to implement a transactional database: from traditional relational databases to various NoSQL solutions.

When working on the implementation, care should be taken to confirm that the chosen approach to data versioning works well, and with the expected volumes of transactions, the data can be updated and queried efficiently.

Where there is a danger of internal inconsistencies appearing, it is prudent to create automatic processes which would periodically check the data and confirm there are no issues with it.

Data Types

Here are some observations on what works and what does not where data types are considered.

Firstly, it is strongly advised to use GMT timestamps wherever it is possible. If you need to store timestamps in a time zone other than GMT (really, you probably do not), then the best thing to do is to use a data type for a timestamp with a time zone and specify the time zone explicitly.

Where you need to store fractional decimals, especially data such as prices, do prefer data types designed for decimals. Avoid binary floating-point representation if you can. There are good reasons for the existence of binary floating point representations, but storing prices or units of funds is not one of them.

If you can, choose type systems which allow for actual missing values (not just NaNs) and for all types of data. This can save you much trouble down the road.

Avoid type systems which are too focused on low-level details. If you need to choose between a 32-bit binary floating point and a 64-bit one, you are probably in the wrong place.

Where you are dealing with text, one day you may need to deal with non-ASCII characters. It may be wise to just use UTF-8 or an equivalent Unicode encoding everywhere as a standard.

Atoms/symbols are useful too. In particular, they can be handy when representing the direction of trading, type of events in event streams, types of orders, etc. It is convenient if these are available.

Quant Tools

In short, whatever the tools quants wish to use should be available to them.

It is not a great idea to impose restrictions on their environment: each person is more efficient when working with the tools he or she is most comfortable with. Some like Python, while others are good with R or Matlab.

The important thing is that access to the data backbone (and especially the Data Warehouse) must be as convenient as possible so that quants can focus on their research rather than combating problems with the data APIs.

Another important consideration is the stability. The APIs should be as stable: if they change, the old research code may stop working and that is a disaster.

Finally, it is quite helpful to have APIs where it is easy to see what is using them. This allows us to identify which parts of functionality are used and which are not, who/what the users are, which version of the client API library is in use, and so on. Designing the data APIs with these things in mind will result in a better experience months later, potentially saving significant costs.

Checklist

Now suppose you have a design in mind or perhaps an existing system already. Let’s do a few more checks to see whether everything is all right.

Do you have situations where missing values are represented as zeros? If so, mend it first and fire whoever decided that zero is a good representation of NA.
Does your data storage work in such a way that while one of the clients is pulling some data, others have to wait even if they have requested the same data? If so, there is room for improvement.
Is there visibility into who/what/when is connecting to the data store and which data is being accessed? If not, you should mend it before it becomes a serious problem.
Do you need to pay thousands of USD per TB or data stored? Someone might be seeing you as a source of free cash.
How failure-resistant the system is? You do not want to end up in a situation where your data storage has failed and there is no quick recovery (or worse still, no recovery process whatsoever).
Can you query the data efficiently? It may be OK if quants pull the data in bulk and then query it locally, but it may be a sign that the data API layer is just not good enough.
Is it easy to find the data you need or find out what is available?
Is the meaning of the data explained? In case of any questions, is it possible to find out who knows about the data?
How easy it is to find the code which has populated the data in the data warehouse? Losing it can be painful as one day you may need to re-fill the datasets.
Are there mechanisms for automatic quality checks?

Conclusion

Above, the outline of a basic data backbone for systematic trading has been described. The description deliberately avoided specific technologies. There is no perfect solution in any case. The thing that matters is the understanding of the requirements and the focus on satisfying them as much as practical.

The backbone is indeed basic. It does not have advanced features such as support for real-time intraday feeds and derived data. However, it should provide a decent foundation and serve a systematic trading business well. More advanced functionality can be added at a later point without a need to make dramatic changes.