Building Foundation Models for Banking Data: A Deep Dive into PRAGMA
How Revolut’s PRAGMA encoder treats multi-source financial histories as first-class sequences—key–value–time tokenisation, hierarchical encoders, temporal encoding, and adaptation without tabular feature walls.
April 27, 2026
Most machine learning systems I have worked on in financial domains follow the same pattern: a messy, heterogeneous stream of user activity gets transformed into a static feature table, and then a model—usually gradient boosting or a shallow neural net—tries to make sense of it.
That pipeline works, but it is brittle. Every new task (fraud, credit, churn, engagement) tends to grow its own feature layer. Every change in product behavior forces rework. And we throw away structure—time, sequence, relationships between events—in favor of flattened snapshots.
PRAGMA: Revolut Foundation Model (Ostroukhov et al., 2026) takes a different stance: instead of forcing financial data into tabular form first, it treats user histories as first-class sequential objects, closer to structured streams than to a single wide row. The goal is to pre-train a foundation model on these sequences and reuse it across tasks like credit scoring, fraud detection, and lifetime value (LTV) prediction.
This post is a technical walkthrough of that paper: how PRAGMA represents events, why naive “just use an LLM” or “just aggregate features” approaches break, and which design choices matter for production-scale banking data.
The problem with “tabular first” and “text first”
At first glance, pointing a Transformer at banking data sounds easy: serialize records and train. In practice, banking histories have a few properties that make naive approaches fail.
1. Heterogeneity. A single event can mix numerical amounts, categorical merchant types, free-text descriptions, and metadata (channel, device, location). Flattening into unstructured text loses type semantics; treating everything as generic tokens obscures what each field means.
2. Irregular time gaps. Unlike tokenized sentences, events are not evenly spaced. A user might make ten transactions in one day, then go quiet for weeks. Standard positional embeddings do not encode “wall-clock spacing” well on their own.
3. Long-tailed lengths. Some users have thousands of events; others have few. Padding to a fixed length wastes compute; aggressive truncation drops signal. Production systems need truncation strategies that acknowledge the tradeoff (the paper uses a fixed context window and augments profile state with life-long milestone events—timestamped first occurrences such as a first top-up—so early history is not entirely lost when the tail is cut).
4. Multi-source context. Behavior depends on static profile (tenure, region, plan), the event stream (transactions, app usage, communications), and recurring calendar structure. Models that ignore profile or bolt it on as an afterthought leave performance on the table.
The cumulative effect is familiar: feature engineering becomes the bottleneck, models do not transfer well across tasks, and temporal structure is underused. PRAGMA is aimed squarely at that gap.
What does not work (and why)
Before the architecture, it helps to see why three common escapes still stumble.
Treat everything as text (LLM-style serialization)
The dominant instinct—serialize events into sentences and reuse the LLM stack—is seductive. Example: “User made a card payment of 14.99 GBP at a grocery store on Monday morning.”
It breaks down in subtle ways. Token inefficiency is the first: field names, delimiters, and formatting turn one logical event into many subword tokens. That inflates context usage for long histories, which is exactly where banking data lives.
Numerical fidelity is worse. When amounts are split into digit or subword fragments, magnitude and ordering—the currency of finance—are not represented cleanly. There is also a schema erasure problem: in the raw data, “amount” and “merchant category” are distinct fields; in text they are adjacent tokens, and the model must re-discover structure you already had.
So the model ends up doing two jobs at once: recover structure from prose, then learn behavior on top. That shows up as weaker, less sample-efficient representations.
Tabular deep learning (handcrafted aggregates)
The opposite extreme is the classic pipeline: compute aggregates—last 30-day spend, transaction counts, average balance—and feed a tree model or small MLP.
Operationally this is dependable. Representationally it destroys temporal dynamics. A spending spike followed by silence, a chain of small transactions before a large one, or recurring pay-cycle deposits are not recoverable from a few rolling sums. Two users with different trajectories can look identical after aggregation.
It also does not share statistical strength across products and tasks: each new objective spawns new feature work. The PRAGMA authors explicitly contrast this with a shared backbone trained once on raw sequences.
Standard sequence models without a disciplined representation
Encoding each event as a dense vector and feeding an RNN or Transformer is directionally right—order is preserved. But teams often patch heterogeneity ad hoc: different scaling for numerics, one-off embeddings for categories, text handled inconsistently, missing values special-cased. Time is frequently reduced to positions or raw timestamps that do not capture irregular gaps or calendar regularities. Profile context is often concatenated awkwardly at the end of the sequence.
The paper’s diagrams (roughly pages 5–6 in the PDF) make the design point explicit: separate event-level encoding, profile-level encoding, and history-level fusion so each stage can specialize before the sequence model does its job.
The core realization
Text models underuse structure. Tabular models underuse sequence. Naïve sequence models underuse typed fields and good time.
PRAGMA’s bet is that representation and tokenisation matter as much as the Transformer stack. If keys, values, and time are encoded cleanly, the encoder can learn patterns that generic serialization obscures.
A mental model for PRAGMA
At a high level, PRAGMA treats a user record as a structured document:
- A profile state: static key–value attributes at an evaluation cut-off (plan, region, quantiles, and similar).
- A sequence of events, each a timestamped set of key–value fields drawn from multiple sources (transactions, app, communications, trading, …).
The model is encoder-only and bidirectional: the training objective is masked reconstruction, not open-ended generation—well suited to discriminative downstream tasks where you want a transferable embedding.
This separation matters:
- Local patterns are learned inside the event encoder.
- Cross-event patterns are learned in the history encoder.
- Who the user is is injected via the profile branch instead of being diffused across messy token soup.
Key–value–time tokenisation
The central representational move is to decompose each field into three aligned signals:
| Piece | Role | Typical handling |
|---|---|---|
| Key | Semantic type (amount, merchant_category, …) | Single token per key (~60 key tokens in the paper’s vocabulary) |
| Value | The payload | Percentile buckets for numerics; categorical IDs; subwords for text |
| Time | When the event occurred | See next section |
For example, a numeric amount might land in a bucket token that preserves rank information without fragile float tokenization; a category maps to a dedicated token; text may expand to multiple value tokens while the key still anchors meaning.
Why this works:
- The model always knows which slot it is reading (key embedding).
- Types are not smashed together into undifferentiated text.
- The token space stays controlled relative to naive serialization.
Temporal encoding
Time is one of the hardest parts of financial modeling. PRAGMA combines two complementary views:
- Log-seconds since the previous event — emphasizes recency, compresses long idle periods, and keeps fine resolution when gaps are short (so an hour still matters differently from a day, while month-scale gaps are not dominated by linear raw seconds).
- Cyclical calendar features — hour-of-day, day-of-week, and related signals capture recurring behaviors (salary deposits, weekend spending spikes).
Together they encode relative spacing and absolute calendar structure.
Pre-training: multi-level masked modeling
PRAGMA uses a BERT-style masked objective, extended to structured records. Masking is drawn from three mechanisms (plus occasional UNK replacements instead of MASK tokens so the model does not overfit to mask symbols that never appear at inference):
- Token-level (~15%) — predict individual corrupted values.
- Event-level (~10%) — reconstruct an entire masked event.
- Semantic-type / key-level (~10%) — mask all values for a chosen key across the record, forcing the model to use context and field identity.
That mixture pushes the model past shallow co-occurrence: it must use within-event structure, cross-event dependencies, and cross-feature interactions.
Scale and training engineering
The reported pre-training corpus is 24 billion events, 207 billion tokens, and 26 million user records across 111 countries, with a 25-month window (2023–2025) chosen to balance coverage, recency, and distribution shift—details that matter as much as any layer count when you are operating on real institutional timelines.
Two throughput optimizations stand out in the paper:
- Sequence packing — pack multiple shorter user histories into one training sequence to cut padding waste.
- Dynamic batching — vary batch size with sequence length so GPU memory is used predictably on heavy-tailed lengths.
The authors report on the order of 2–5× throughput improvement from these techniques. That is the difference between an experiment and a train job that finishes.
Adaptation after pre-training
Once the backbone exists, PRAGMA follows two complementary adaptation paths:
- Embedding probe — freeze the encoder, train a small head (even linear). Fast and cheap for iteration.
- LoRA fine-tuning — update a small fraction of parameters (the paper cites roughly 2–4%), applied to attention and MLP blocks, for near–full fine-tuning quality with most weights shared across tasks.
| Method | Speed | Performance | Typical use |
|---|---|---|---|
| Probing | Fast | Moderate | Exploration, baseline strength checks |
| LoRA | Medium | High | Production specialization |
What the results suggest (qualitatively)
The paper reports relative improvements only (absolute metrics are withheld as commercially sensitive), but the patterns are still instructive:
- Credit scoring benefits strongly from scale—risk seems to reward depth.
- Simpler predictive tasks (e.g., some LTV or transaction modeling setups) may not need the largest variant.
- Profile state matters a great deal for risk-flavored tasks; events alone are not the whole story.
- Text encoding can help, but adds latency—another production tradeoff, not a free lunch.
That matches a broader lesson: not every downstream task extracts the same value from the same backbone and modality stack.
What I'd conclude from this
What PRAGMA really demonstrates is not just that transformers can ingest banking data, but that most of the value comes from treating financial activity as a structured sequence problem instead of a feature-engineering problem.
For years, the dominant pattern in fintech has been to compress rich user behavior into static snapshots and hope models can recover signal from aggregates. PRAGMA flips that premise. It assumes the structure already lives in the data—the ordering of events, the semantics of fields, the timing between actions—and focuses on preserving that structure so the model can learn from it.
That shift has a few implications that matter beyond the paper’s benchmarks.
Representation is the core system design choice, not the headline architecture. The key–value–time formulation is doing a disproportionate share of the work. Encode the data in a way that respects types, fields, and time, and the Transformer stops fighting the input; it can learn patterns across events, temporal spacing, and user context instead of reconstructing schema from prose.
Pre-training changes how teams ship models. Rather than isolated pipelines per task, you get a shared backbone that encodes behavior once and reuses it broadly. The fact that a frozen embedding plus a linear probe already performs well is a strong signal: a large fraction of what teams have historically built by hand can be absorbed into the representation layer up front.
Gains are not uniform—and that is a production constraint, not a footnote. The largest improvements tend to show up where signal is sparse, delayed, or expensive to hand-engineer—credit-style risk, engagement, and similar settings. For simpler, more local objectives, smaller models are often enough. A single “one size fits all” footprint is rarely optimal; the real engineering question is where scale earns its cost.
Limitations are as instructive as the wins. The paper’s anti–money laundering discussion is a useful example: the weakness there is not necessarily “the encoder isn’t big enough,” but that the task is fundamentally relational. You cannot fully characterize financial behavior by staring at isolated user timelines forever. The plausible next step is to pair sequence backbones with graph- or network-aware context—counterparties, clusters, and flows—not to pretend every problem is a longer per-user sequence.
Key takeaways
- Treating financial data as plain text or as static tables leaves signal on the table. Events, fields, and time are where most of the information lives.
- Tokenisation is not a preprocessing detail in systems like this; it defines what the model can and cannot learn.
- Pre-training on raw event histories can replace large parts of traditional feature engineering—but only if the representation preserves meaning (keys, values, and time, not a lossy narrative).
- Scaling helps selectively. The biggest wins cluster in complex, low-signal tasks, not in every benchmark equally.
- Parameter-efficient adaptation (e.g., LoRA) is what keeps foundation models practical; without cheap specialization, reuse across tasks breaks down economically.
- Sequence models alone do not solve all financial ML. Problems dominated by networks and relationships will need extensions beyond single-user histories.
If I were building in this space, I would not treat PRAGMA as the last word. I would treat it as a blueprint: a strong representation layer for user behavior that you can extend—especially with relational context—as the obvious next increment.
For the full technical treatment—figures, masking recipe, LoRA and profile ablations here is the paper: PRAGMA: Revolut Foundation Model (arXiv:2604.08649).