Tick Data Processing

Overview

Tick data is the highest resolution market data available — every trade execution, every quote change, every order book update recorded individually with its precise timestamp. Where bar data summarises market activity into open, high, low, and close for a defined time period, tick data preserves every price movement within that period. The bid that moved one tick before a large trade. The ask that widened during low liquidity. The burst of activity in the microseconds before a price level broke. This resolution of information is invisible in bar data and essential for certain categories of trading analysis.

Tick data processing is the infrastructure that captures, stores, validates, and serves this high-resolution data at the scale that production use requires. A single liquid instrument generates thousands of ticks per second during active trading. A portfolio of instruments across multiple venues generates millions of ticks per hour. Storing this data efficiently, querying it at speed, and processing it into the derived representations — bars, volume profiles, order flow metrics — that trading analysis consumes requires an architecture built for the specific characteristics of tick data rather than a general-purpose database handling a data type it was not designed for.

We build custom tick data processing infrastructure for systematic trading firms, quantitative research operations, high-frequency trading systems, and any operation where tick-level market data is a component of the trading or research workflow — from focused tick data capture and storage for a specific instrument universe to comprehensive market microstructure analysis platforms processing data across multiple exchanges.

What Tick Data Processing Covers

Tick data capture. The real-time capture of trade and quote data from exchange feeds, broker price streams, and data vendor APIs — recording each tick event with the precision and completeness that downstream use requires.

Trade tick capture records every executed trade: the instrument, the timestamp with microsecond or nanosecond precision where the feed provides it, the trade price, the trade volume, and the aggressor side where the feed identifies buyer or seller initiation. Quote tick capture records every bid/ask update: the timestamp, the best bid and ask prices, the bid and ask sizes at the top of the book, and optionally the full order book depth snapshot at the time of the quote.

Capture latency — the delay between the exchange recording a tick event and the tick appearing in the local data store — is a function of network infrastructure, feed architecture, and processing design. For latency-sensitive research and execution applications, minimising capture latency is a design objective. For historical research applications where the primary concern is data completeness and accuracy, capture latency is less critical than ensuring that no ticks are lost.

Gap detection during capture — monitoring the tick stream for periods where expected tick activity is absent, indicating a connectivity or feed issue rather than genuine market inactivity. Gaps detected during capture are flagged for the data quality management process rather than silently appearing as missing data in the historical record.

Tick data storage. Storing tick data efficiently is a fundamental challenge — the data volumes are large, the write patterns are continuous and high-frequency, and the query patterns (range queries over time for a specific instrument) differ significantly from general transactional database query patterns.

Columnar storage and time-series database architectures are well-suited to tick data's characteristics — sequential writes of time-ordered records for each instrument, and range queries over time that retrieve all records for an instrument within a specified time window. TimescaleDB (PostgreSQL extension for time-series data), ClickHouse (columnar analytical database), InfluxDB, and Parquet files on object storage each provide different trade-offs between write performance, query performance, compression efficiency, and operational complexity.

Parquet file storage — the columnar file format used extensively in quantitative research — provides excellent compression for tick data (tick data's sequential, low-variance numerical values compress well) and is directly consumable by the Python data science stack (Pandas, PyArrow, Dask) without a database query layer. For research applications where data is processed in large batches rather than queried interactively, Parquet on S3 or equivalent object storage is often the most practical storage architecture.

Partitioning strategy — organising the tick data store by instrument and by time period — determines query performance for the access patterns the consuming applications use. Partitioning by instrument and by trading day allows efficient retrieval of all ticks for a specific instrument on a specific day without scanning the full dataset. Partitioning by instrument and by month provides more efficient storage for instruments with lower tick rates where daily partitions would be excessively fine-grained.

Data compression reduces storage cost and improves I/O performance by reducing the volume of data that must be read from disk for each query. Tick data compresses well due to the sequential, delta-encoded nature of price series — successive prices are close to each other and compress efficiently with delta encoding before general compression. Compression schemes tailored to tick data's characteristics (delta encoding combined with integer packing and general compression) achieve compression ratios that significantly reduce storage requirements for large tick datasets.

Bar construction from tick data. Most trading analysis is conducted on bar data rather than tick-level data — moving averages, RSI, Bollinger Bands, and the vast majority of technical indicators are defined on OHLCV bars. Constructing bars from tick data allows the bar type and resolution to be determined by the analysis requirements rather than by the data source.

Time-based bar construction from tick data produces OHLCV bars at any resolution — 1-second, 5-second, 1-minute, or any other time resolution — by grouping tick records by time bucket and computing the open (first trade price in the bucket), high (maximum trade price), low (minimum trade price), close (last trade price), and volume (sum of trade volumes). Time-based bars constructed from tick data are equivalent to the bars that data vendors provide but can be generated at any resolution from the same underlying tick data.

Volume bars — bars that close after a defined traded volume — are constructed by accumulating tick volume until the threshold is reached, then closing the bar. Volume bars normalise for market activity rather than time, making each bar represent the same market participation. Tick bars — bars that close after a defined number of individual trades — provide a similar normalisation. Dollar bars — bars that close after a defined dollar volume has traded — normalise for the capital deployed rather than the number of trades.

Renko and range bars from tick data — the price-movement-based bar types that close when price moves a defined amount rather than after a defined time or volume — require processing tick data sequentially to track the cumulative price movement and identify when the threshold for a new bar is reached.

Order flow analysis. Order flow analysis — the classification of each trade as buyer-initiated or seller-initiated, and the aggregation of this classification into directional volume metrics — requires tick-level data with trade price and quote context.

Trade classification using the Lee-Ready algorithm or the BVC (Bulk Volume Classification) algorithm determines whether each trade was buyer-initiated (the buyer was the aggressor who crossed the spread) or seller-initiated (the seller crossed the spread). For exchanges that provide aggressor side in the trade feed, classification uses the exchange-provided flag. For data sources that do not provide aggressor side, classification algorithms estimate the aggressor based on the relationship between the trade price and the prevailing bid/ask at the time of the trade.

Delta calculation — the difference between buyer-initiated volume and seller-initiated volume within a bar — produces the directional volume metric that order flow trading analysis uses to assess whether buyers or sellers are more aggressive. Cumulative delta — the running sum of delta over the session — tracks the overall directional tendency of order flow through the trading day.

Volume at price analysis — computing the volume traded at each distinct price within a bar or time period — produces the footprint chart data that shows the distribution of buying and selling activity across price levels. Volume profile computation — aggregating volume by price level over a longer period — produces the price distribution data that volume profile trading analysis uses to identify high-volume nodes and value areas.

High-frequency data analysis. For research and systems operating at the sub-second level, tick data processing extends to the microstructure analysis that requires intraday resolution.

Bid-ask spread analysis — computing the time-weighted average spread, the volume-weighted average spread, and the distribution of spread widths across different times of day, days of the week, and market conditions. Spread dynamics around market events — the spread widening that precedes and follows news releases, the spread compression during high-liquidity periods, the spread spikes that indicate temporary liquidity withdrawal.

Price impact analysis — the relationship between trade size and price movement, quantifying how much price moves on average for each unit of volume traded. Price impact is the primary determinant of execution cost for large orders and is estimated from tick data that records the sequence of trades and prices.

Intraday seasonality — the patterns in tick rate, spread, and volatility that repeat at consistent times of day. Tick rate seasonality shows the periods of high and low market activity within the trading session. Volatility seasonality shows when price movement is concentrated. These patterns inform execution scheduling and strategy session filters.

Data quality and cleaning. Raw tick data from any source contains errors — erroneous prices, duplicate records, timestamp inconsistencies, and missing data. Tick data cleaning processes identify and handle these issues before the data is used for research or system inputs.

Outlier detection — identifying prices that are implausibly far from the surrounding price history given the instrument's typical volatility. Erroneous price spikes that represent data errors rather than genuine market prices are flagged and optionally replaced with interpolated values or excluded from the clean dataset. The threshold for outlier detection is calibrated to the instrument's volatility — a 0.1% move in a second is plausible for a volatile cryptocurrency but implausible for a large-cap equity in a normal market.

Duplicate detection and deduplication — identifying records that represent the same tick event appearing multiple times in the raw data (a common occurrence with real-time feed capture where duplicate delivery can occur on reconnection). Duplicate records are removed before storage or during the cleaning pass.

Timestamp normalisation — converting timestamps from source-specific formats and time zones to UTC with consistent precision. Different data sources represent timestamps at different resolutions (second, millisecond, microsecond, nanosecond) and in different time zones. Normalisation to UTC microseconds provides consistent temporal ordering across sources.

Research data serving. Making tick data efficiently accessible for research workflows — the backtesting engine that requests tick data for a specific instrument and time range, the Python research notebook that loads a day's ticks for analysis, the real-time strategy that queries the recent tick history for indicator initialisation.

Research data APIs — the endpoints that expose tick data to consuming systems with the query interface that research workflows require. Instrument and time range queries that retrieve all ticks for a specified instrument between two timestamps. Filtered queries that return only trades (excluding quote ticks) or only ticks above a minimum size threshold. Aggregation queries that return pre-computed bar data for a specified bar type and resolution.

Data access libraries for Python research workflows — the client library that wraps the data API with the Pandas-compatible interface that quantitative researchers use. Loading tick data for a specific instrument and date as a Pandas DataFrame with a single function call — the data access pattern that allows research notebooks to focus on analysis rather than data retrieval plumbing.

Technologies Used

Rust — ultra-high-performance tick data ingestion, real-time bar construction, order flow calculation, data validation pipeline
Python — research data processing, statistical analysis, Pandas-compatible data access libraries, alternative data integration
C# / ASP.NET Core — exchange feed integration, data vendor connectivity, REST API for research data serving
TimescaleDB / PostgreSQL — tick data storage with time-series query optimisation and automatic partitioning
ClickHouse — columnar analytical database for high-throughput tick data queries at research scale
Parquet / Apache Arrow — columnar file format for efficient tick data storage and Python research stack compatibility
Redis — real-time tick state, recent tick cache for indicator initialisation, processing coordination
Apache Kafka — high-throughput tick data ingestion pipeline and distribution to multiple consuming systems
AWS S3 / object storage — cost-effective large-scale tick data archival
WebSocket — real-time exchange feed connectivity for tick data capture
Binance / Bybit / Kraken WebSocket APIs — cryptocurrency tick data capture
Interactive Brokers TWS API — equity and futures tick data
Polygon.io / Refinitiv APIs — professional tick data vendor integration
NumPy / Pandas / PyArrow — Python numerical computation for tick data analysis

Tick Data as Research Infrastructure

The quality of quantitative research is bounded by the resolution and quality of the data it is conducted on. Research conducted on daily bars cannot detect the intraday patterns that tick data reveals. Research conducted on tick data with undetected errors produces conclusions that do not hold in live trading. Backtesting on constructed bars rather than the underlying tick data misses the execution realities — the intrabar price movement, the spread costs, the partial fill probabilities — that tick-accurate backtesting captures.

Tick data processing infrastructure that captures data completely, validates it rigorously, stores it efficiently, and serves it to research and production systems through well-designed interfaces is the research foundation that systematic trading operations build on. The investment in this infrastructure is recovered in the quality of research it enables and the accuracy of the live systems it supports.

High-Resolution Data for High-Resolution Analysis

Tick data contains information that bar data discards. The analysis and systems that use tick data see the market at its actual resolution rather than at the resolution that data summarisation imposes. Custom tick data processing infrastructure built for the specific instruments, the specific resolution requirements, and the specific research and production use cases of the operation provides the data foundation that high-resolution analysis requires.