We at Flowtale frequently use the TimescaleDB PostgreSQL extension in our clients’ projects which involve high frequency time series data (such as from finance, click-streams, Internet-of-Things). Since this is not always a default choice in cloud data science architectures, we get questions about this choice. We therefore set out our observations here, in case such data is relevant for your next projects. In addition version 2.0 was released a couple of weeks ago, and some previously enterprise-only features have been made free, so it may be an appropriate time to review the key concepts at work.
Please note that this is not an evaluation of TimescaleDB against other relational time series database solutions (such as the CitusDB cluster extension to PostgreSQL), but rather some observations on our experience with TimescaleDB specifically. We may follow up with detailed comparisons in future blog posts depending on reader interest.
Executive Summary
The goal of this writeup is to give a brief overview of the TimescaleDB, common use cases, best practices, benefits and compare it with similar solutions to store time series data in data science projects, like traditional relational databases, NoSQL or time-oriented databases.
What is TimescaleDB
TimescaleDB is an open-source relational database extension for time-series data. It is an extension of well-known PostgreSQL DB that earned a strong reputation over the years for its proven architecture and rich feature set. PostgreSQL itself is not only an open-source reliable data storage, it provides a variety of features to natively support plenty of data types, ensure data integrity, reliability and disaster recovery, provide security and performance due to using concurrency, transactions, sophisticated query planning, advanced indexing and many more. And the cherry on the pie – there are a huge number of monitoring and development tools, BI and analytics extensions built to use PostgreSQL as a source.
TimescaleDB claims to have 10-100x performance compared to PostgreSQL or MongoDB. It naturally extends the relational model of PostgreSQL leveraging the nature of time-series data: append-only (mostly), time-centric which allows it to scale much better. It has its own advanced partitioning strategy (time/space partitioning) which means TimescaleDB partitions by timestamp (time) and another index (space).
Why SQL
NoSQL models are very popular nowadays because of the benefits they provide in contrast to relational models like scalability. However, SQL has not lost its relevance. It is a standard used over decades and while it has its restrictions, it allows to ensure data integrity, normalization, strong relationships and space optimization due to avoiding data duplication. NoSQL and time-specific databases will require new concepts (document-based, graph-based, key-value-based, etc.) along with new language and syntax to learn. SQL is a unified and common language to manipulate and retrieve data and almost every software engineer is familiar with it. Traditional relational model, turns out not to be easily adapted to big scale and high amounts of input data. That’s where Timescale shows its advantage in our architecture. By using Timescale DB in the project there is no need to use one database for time-series data and separate database for relational data. Both concepts greatly coincide because TimescaleDB is built on top of PostgreSQL. It is also worth noting that, as of late, PostgreSQL supports foreign data wrappers: it can be used as a centralized query store for connecting to many other data sources also.
Performance
There are few conceptual improvements to relational model that allows TimescaleDB to boost its performance compared to traditional SQL or NoSQL. As mentioned earlier, TimescaleDB leverages time-series data nature to extend relational models and give it much more scalability and speed. It encourages the use of a time-centric approach, where incoming data is grouped by the timestamp. For example, instead of creating one row per metric (narrow-table model), the row should be created per timestamp and include all the metrics, gathered at that specific time (wide-table model).
The time-series data is mostly appended and unlikely to be updated in the future. It is well demonstrated in IoT workload. Once the IoT device, whether it’s the small sensor or industrial equipment, sends the data – it is unnecessary and unlikely to change in future. The purpose of such data is further analysis rather than manipulation. Due to the mostly appending style of incoming data, TimescaleDB changed memory allocation strategy for indexing so it doesn’t eventually use all allocated memory and doesn’t use the hard disk so a huge number of rows is no longer a bottleneck like in PostgreSQL or other relational databases. These and more improvements allow TimescaleDB to receive millions of data points per second. Additionally, TimescaleDB provides native compressing of data, parallelizing operations across partitions that also improves performance.
Advantages Over Other Models
By using NoSQL we introduce a whole new approach not often natural for data processing. NoSQL encourages usage of eventual consistency and tolerates data duplication because it does not support JOINs between data.
On the other side, if pure relational approach lacks needed scalability and speed because data is not time-centric, and traditional querying and indexing algorithms are limited in capacity to handle high input load.
Timescale brings the time-series concept into the relational world. While supporting natural SQL it allows to treat data as relational and participate in BI and Big Data pipelines, using Grafana, Tableau, PowerBI, Periscope and other tools. It natively supports JOINs and can be joined with other data sources supporting SQL using e.g. Presto. At the same time, being time-oriented, it provides orders of magnitude greater performance than traditional SQL databases and might have a higher ingestion rate.
There are time-oriented databases, like InfluxDB, AWS Timestream, etc. and they are often considered to deal with time-oriented tasks. The caveat is that AWS Timestream locks down the project to AWS infrastructure. This and other time-specific databases optimized only for time-series workflows. Which means, that if a project starts to require other kinds of data – time-oriented databases won’t be optimized to handle it and it will require introducing the new database. TimescaleDB solves it providing support both for time-series and relational models.
Cloud
TimescaleDB has the same potential for use in cloud environments as PostgreSQL. In particular, it provides Docker images to use in containerized environments, like Kubernetes, Docker Swarm, AWS ECS, etc. It is easily installable on most operational systems so can be used with any major PaaS cloud platform. There is also an enterprise solution available. There are also several cloud providers that support TimescaleDB naturally as part of their service, like Azure and AWS.
TimescaleDB shows good performance results, in our experience, using single node scaling, so it is easier to use over relational databases or NoSQL data storages that would require multiple instance management and special autoscaling and other DevOPS techniques to manage the database cluster.
IoT
Being a time-oriented DB, we regard Timescale as well suited for IoT workloads. Metrics from devices and sensors are time-series data. TimescaleDB allows to receive, retain and analyze a stream of metrics and sensor indications from an array of devices at scale.
As TimescaleDB supports millions of data points per second, it provides an opportunity to collect data from tens of millions of devices with low latency.
It supports the same real-time ingestion tools that are used to ingest data to PostgreSQL, like Kafka, as well as provides real time continuous aggregation views to provide quick real-time insights and answers based on data.
Final Thoughts
In our experience TimescaleDB is a good choice for time-series data in many contexts like IoT workloads, financial, trading operations, application metrics, etc. It combines power of relational models in terms of data integrity, stability and normalizations, ability to create strong relationships and join, as well as providing enormous scalability, much higher throughput and lower latency.
We expect to continue using it in many architectures we deliver; although this is a fast moving technology space, so only time will tell. Alternative ways of scaling PostgreSQL for time series data also exist, such as Citus; remaining on our radar.