Building Foundation Models for Banking Data: A Deep Dive into PRAGMA

Most machine learning systems I have worked on in financial domains follow the same pattern: a messy, heterogeneous stream of user activity gets transformed into a static feature table, and then a model—usually gradient boosting or a shallow neural net—tries to make sense of it.

That pipeline works, but it is brittle. Every new task (fraud, credit, churn, engagement) tends to grow its own feature layer. Every change in product behavior forces rework. And we throw away structure—time, sequence, relationships between events—in favor of flattened snapshots.

PRAGMA: Revolut Foundation Model (Ostroukhov et al., 2026) takes a different stance: instead of forcing financial data into tabular form first, it treats user histories as first-class sequential objects, closer to structured streams than to a single wide row. The goal is to pre-train a foundation model on these sequences and reuse it across tasks like credit scoring, fraud detection, and lifetime value (LTV) prediction.

This post is a technical walkthrough of that paper: how PRAGMA represents events, why naive “just use an LLM” or “just aggregate features” approaches break, and which design choices matter for production-scale banking data.

The problem with “tabular first” and “text first”

At first glance, pointing a Transformer at banking data sounds easy: serialize records and train. In practice, banking histories have a few properties that make naive approaches fail.

1. Heterogeneity. A single event can mix numerical amounts, categorical merchant types, free-text descriptions, and metadata (channel, device, location). Flattening into unstructured text loses type semantics; treating everything as generic tokens obscures what each field means.

2. Irregular time gaps. Unlike tokenized sentences, events are not evenly spaced. A user might make ten transactions in one day, then go quiet for weeks. Standard positional embeddings do not encode “wall-clock spacing” well on their own.

3. Long-tailed lengths. Some users have thousands of events; others have few. Padding to a fixed length wastes compute; aggressive truncation drops signal. Production systems need truncation strategies that acknowledge the tradeoff (the paper uses a fixed context window and augments profile state with life-long milestone events—timestamped first occurrences such as a first top-up—so early history is not entirely lost when the tail is cut).

4. Multi-source context. Behavior depends on static profile (tenure, region, plan), the event stream (transactions, app usage, communications), and recurring calendar structure. Models that ignore profile or bolt it on as an afterthought leave performance on the table.

The cumulative effect is familiar: feature engineering becomes the bottleneck, models do not transfer well across tasks, and temporal structure is underused. PRAGMA is aimed squarely at that gap.

What does not work (and why)

Before the architecture, it helps to see why three common escapes still stumble.

Treat everything as text (LLM-style serialization)

The dominant instinct—serialize events into sentences and reuse the LLM stack—is seductive. Example: “User made a card payment of 14.99 GBP at a grocery store on Monday morning.”

It breaks down in subtle ways. Token inefficiency is the first: field names, delimiters, and formatting turn one logical event into many subword tokens. That inflates context usage for long histories, which is exactly where banking data lives.

Numerical fidelity is worse. When amounts are split into digit or subword fragments, magnitude and ordering—the currency of finance—are not represented cleanly. There is also a schema erasure problem: in the raw data, “amount” and “merchant category” are distinct fields; in text they are adjacent tokens, and the model must re-discover structure you already had.

So the model ends up doing two jobs at once: recover structure from prose, then learn behavior on top. That shows up as weaker, less sample-efficient representations.

Tabular deep learning (handcrafted aggregates)

The opposite extreme is the classic pipeline: compute aggregates—last 30-day spend, transaction counts, average balance—and feed a tree model or small MLP.

Operationally this is dependable. Representationally it destroys temporal dynamics. A spending spike followed by silence, a chain of small transactions before a large one, or recurring pay-cycle deposits are not recoverable from a few rolling sums. Two users with different trajectories can look identical after aggregation.

It also does not share statistical strength across products and tasks: each new objective spawns new feature work. The PRAGMA authors explicitly contrast this with a shared backbone trained once on raw sequences.

Standard sequence models without a disciplined representation

Encoding each event as a dense vector and feeding an RNN or Transformer is directionally right—order is preserved. But teams often patch heterogeneity ad hoc: different scaling for numerics, one-off embeddings for categories, text handled inconsistently, missing values special-cased. Time is frequently reduced to positions or raw timestamps that do not capture irregular gaps or calendar regularities. Profile context is often concatenated awkwardly at the end of the sequence.

The paper’s diagrams (roughly pages 5–6 in the PDF) make the design point explicit: separate event-level encoding, profile-level encoding, and history-level fusion so each stage can specialize before the sequence model does its job.

The core realization

Text models underuse structure. Tabular models underuse sequence. Naïve sequence models underuse typed fields and good time.

PRAGMA’s bet is that representation and tokenisation matter as much as the Transformer stack. If keys, values, and time are encoded cleanly, the encoder can learn patterns that generic serialization obscures.

A mental model for PRAGMA

At a high level, PRAGMA treats a user record as a structured document:

A profile state: static key–value attributes at an evaluation cut-off (plan, region, quantiles, and similar).
A sequence of events, each a timestamped set of key–value fields drawn from multiple sources (transactions, app, communications, trading, …).

The model is encoder-only and bidirectional: the training objective is masked reconstruction, not open-ended generation—well suited to discriminative downstream tasks where you want a transferable embedding.

PRAGMA hierarchical encoder

Static profile and heterogeneous events are encoded in separate branches; the history encoder fuses event embeddings with temporal signals and the profile embedding (Revolut PRAGMA, Fig. 2–3 style decomposition).

This separation matters:

Local patterns are learned inside the event encoder.
Cross-event patterns are learned in the history encoder.
Who the user is is injected via the profile branch instead of being diffused across messy token soup.

Key–value–time tokenisation

The central representational move is to decompose each field into three aligned signals:

Piece	Role	Typical handling
Key	Semantic type (`amount`, `merchant_category`, …)	Single token per key (~60 key tokens in the paper’s vocabulary)
Value	The payload	Percentile buckets for numerics; categorical IDs; subwords for text
Time	When the event occurred	See next section

For example, a numeric amount might land in a bucket token that preserves rank information without fragile float tokenization; a category maps to a dedicated token; text may expand to multiple value tokens while the key still anchors meaning.

Key–value–time tokenisation

Each field is disentangled: the model always knows which semantic slot it is reading, how the value was encoded, and where the event sits in time—without serialising everything as free-form text.

Why this works:

The model always knows which slot it is reading (key embedding).
Types are not smashed together into undifferentiated text.
The token space stays controlled relative to naive serialization.

Temporal encoding

Time is one of the hardest parts of financial modeling. PRAGMA combines two complementary views:

Log-seconds since the previous event — emphasizes recency, compresses long idle periods, and keeps fine resolution when gaps are short (so an hour still matters differently from a day, while month-scale gaps are not dominated by linear raw seconds).
Cyclical calendar features — hour-of-day, day-of-week, and related signals capture recurring behaviors (salary deposits, weekend spending spikes).

Together they encode relative spacing and absolute calendar structure.

Two complementary time encodings

Relative time keeps fine resolution for short gaps and compresses long silence; cyclical features capture payday weekends, salary deposits, and other calendar structure.

Pre-training: multi-level masked modeling

PRAGMA uses a BERT-style masked objective, extended to structured records. Masking is drawn from three mechanisms (plus occasional UNK replacements instead of MASK tokens so the model does not overfit to mask symbols that never appear at inference):

Token-level (~15%) — predict individual corrupted values.
Event-level (~10%) — reconstruct an entire masked event.
Semantic-type / key-level (~10%) — mask all values for a chosen key across the record, forcing the model to use context and field identity.

That mixture pushes the model past shallow co-occurrence: it must use within-event structure, cross-event dependencies, and cross-feature interactions.

Multi-level masked modelling

Masking at several granularities pushes the model to learn local field structure, cross-event dependencies, and interactions between feature types—not just shallow co-occurrence.

Scale and training engineering

The reported pre-training corpus is 24 billion events, 207 billion tokens, and 26 million user records across 111 countries, with a 25-month window (2023–2025) chosen to balance coverage, recency, and distribution shift—details that matter as much as any layer count when you are operating on real institutional timelines.

Two throughput optimizations stand out in the paper:

Sequence packing — pack multiple shorter user histories into one training sequence to cut padding waste.
Dynamic batching — vary batch size with sequence length so GPU memory is used predictably on heavy-tailed lengths.

The authors report on the order of 2–5× throughput improvement from these techniques. That is the difference between an experiment and a train job that finishes.

Adaptation after pre-training

Once the backbone exists, PRAGMA follows two complementary adaptation paths:

Embedding probe — freeze the encoder, train a small head (even linear). Fast and cheap for iteration.
LoRA fine-tuning — update a small fraction of parameters (the paper cites roughly 2–4%), applied to attention and MLP blocks, for near–full fine-tuning quality with most weights shared across tasks.

Downstream adaptation

The same pre-trained encoder supports a cheap embedding probe for exploration and LoRA specialisation when you need near–full fine-tuning quality with most weights shared across tasks.

Method	Speed	Performance	Typical use
Probing	Fast	Moderate	Exploration, baseline strength checks
LoRA	Medium	High	Production specialization

What the results suggest (qualitatively)

The paper reports relative improvements only (absolute metrics are withheld as commercially sensitive), but the patterns are still instructive:

Credit scoring benefits strongly from scale—risk seems to reward depth.
Simpler predictive tasks (e.g., some LTV or transaction modeling setups) may not need the largest variant.
Profile state matters a great deal for risk-flavored tasks; events alone are not the whole story.
Text encoding can help, but adds latency—another production tradeoff, not a free lunch.

That matches a broader lesson: not every downstream task extracts the same value from the same backbone and modality stack.

What I'd conclude from this

What PRAGMA really demonstrates is not just that transformers can ingest banking data, but that most of the value comes from treating financial activity as a structured sequence problem instead of a feature-engineering problem.

For years, the dominant pattern in fintech has been to compress rich user behavior into static snapshots and hope models can recover signal from aggregates. PRAGMA flips that premise. It assumes the structure already lives in the data—the ordering of events, the semantics of fields, the timing between actions—and focuses on preserving that structure so the model can learn from it.

That shift has a few implications that matter beyond the paper’s benchmarks.

Representation is the core system design choice, not the headline architecture. The key–value–time formulation is doing a disproportionate share of the work. Encode the data in a way that respects types, fields, and time, and the Transformer stops fighting the input; it can learn patterns across events, temporal spacing, and user context instead of reconstructing schema from prose.

Pre-training changes how teams ship models. Rather than isolated pipelines per task, you get a shared backbone that encodes behavior once and reuses it broadly. The fact that a frozen embedding plus a linear probe already performs well is a strong signal: a large fraction of what teams have historically built by hand can be absorbed into the representation layer up front.

Gains are not uniform—and that is a production constraint, not a footnote. The largest improvements tend to show up where signal is sparse, delayed, or expensive to hand-engineer—credit-style risk, engagement, and similar settings. For simpler, more local objectives, smaller models are often enough. A single “one size fits all” footprint is rarely optimal; the real engineering question is where scale earns its cost.

Limitations are as instructive as the wins. The paper’s anti–money laundering discussion is a useful example: the weakness there is not necessarily “the encoder isn’t big enough,” but that the task is fundamentally relational. You cannot fully characterize financial behavior by staring at isolated user timelines forever. The plausible next step is to pair sequence backbones with graph- or network-aware context—counterparties, clusters, and flows—not to pretend every problem is a longer per-user sequence.

Key takeaways

Treating financial data as plain text or as static tables leaves signal on the table. Events, fields, and time are where most of the information lives.
Tokenisation is not a preprocessing detail in systems like this; it defines what the model can and cannot learn.
Pre-training on raw event histories can replace large parts of traditional feature engineering—but only if the representation preserves meaning (keys, values, and time, not a lossy narrative).
Scaling helps selectively. The biggest wins cluster in complex, low-signal tasks, not in every benchmark equally.
Parameter-efficient adaptation (e.g., LoRA) is what keeps foundation models practical; without cheap specialization, reuse across tasks breaks down economically.
Sequence models alone do not solve all financial ML. Problems dominated by networks and relationships will need extensions beyond single-user histories.

If I were building in this space, I would not treat PRAGMA as the last word. I would treat it as a blueprint: a strong representation layer for user behavior that you can extend—especially with relational context—as the obvious next increment.

For the full technical treatment—figures, masking recipe, LoRA and profile ablations here is the paper: PRAGMA: Revolut Foundation Model (arXiv:2604.08649).