The two architectures
In Architecture A — LLM-in-the-loop, the integration pipeline receives a source row, constructs a prompt containing the row and the destination schema, calls a hosted language model, and writes the model’s output to the destination.
Source row
CSV, JSON, API
LLM call
Hosted provider, per-row inference
Mapped row
Destination schema
In Architecture B — configuration-time AI with deterministic runtime, the language model reads the source schema and a sample of source data during a configuration phase. It proposes mappings with confidence scores. A human reviewer inspects the proposal, modifies it where needed, and certifies it. The certified mapping is compiled to deterministic code. Production data flows through the compiled mapping; the language model is not called again unless the source schema changes and re-certification is performed.
Configuration phase — runs once per source, against samples
Source schema + sample rows
LLM proposes mapping
Human certifies
Compiled mapping artifact
Production runtime — runs per row, against production data
Source row
Deterministic code
No LLM call
Mapped row
Data exposure
In Architecture A, production rows are transmitted to the language model provider in order to be mapped. The data may include personally identifiable information, financial records, health information, or other regulated content, depending on the pipeline. Depending on the provider’s terms of service and the customer’s contract, transmitted data may be logged for safety review, used for model improvement, or retained for a defined period. Enterprise contracts can restrict these uses; the transmission itself still occurs.
In Architecture B, the model is exposed to the source schema and a sample of rows during the configuration phase. Sample size is typically in the tens to low hundreds of rows. After certification, the production data path does not include the language model.
The implications for several common compliance frameworks differ between the two architectures because the language model provider’s role with respect to production data differs.
| Framework | Relevant question | Architecture A | Architecture B |
|---|---|---|---|
| GDPR Art. 28 | Is the model provider a sub-processor of production personal data? | Yes, for the data sent on each row | The model processes only configuration-phase samples |
| HIPAA | Does PHI reach the model provider? | Yes, on rows containing PHI | No PHI in the production data path |
| PCI DSS | Is the model provider in scope as a processor of cardholder data? | Yes, when cardholder data is in the rows being mapped | No |
| Data residency (EU, UK, and similar) | Does production data leave the customer’s chosen region for inference? | Subject to the provider’s inference region policy | Production data stays where the customer puts it |
| SOC 2 Type 2 | Can the same control be tested twice and produce the same observation? | Output may vary across provider-side model versions | The compiled mapping is testable and version-controlled |
How each of these resolves in practice depends on the customer’s contractual position, the data being processed, and the controls implemented at both the customer and the provider.
Inference properties
Language model inference is stochastic at sampling temperature greater than zero. At temperature zero, batch-level effects, hardware variation, concurrent inference, and provider-side model versioning can produce output variation across calls. Public guidance from major inference providers does not include a commitment to bitwise-reproducible outputs across calls or across model versions.
Compiled code, by contrast with the inference process, produces the same output for the same input.
Pipelines whose downstream systems rely on output stability across runs — for example deduplication, reconciliation, and change-detection systems that compare consecutive runs — observe these properties differently in the two architectures.
Model behavior: hallucination
Foundation models can produce outputs that are not grounded in the input. The phenomenon is documented across public benchmarks. The Vectara Hallucination Evaluation Model leaderboard 1 tracks summarization hallucination rates across commercial and open-source models on an ongoing basis. The HaluEval benchmark 2 measures hallucination across question answering, knowledge-grounded dialogue, and summarization. The Stanford “Can Foundation Models Wrangle Your Data?” study 3 evaluates foundation models on entity matching, data imputation, and error detection, and reports plausible-but-wrong outputs as a recurring failure mode. The Ji et al. survey 4 catalogs hallucination behavior across natural language generation tasks. Rates vary substantially by task, model version, year, and prompt design; current values are best read from the relevant evaluation source rather than asserted as fixed numbers.
In data integration contexts, hallucination has been observed in published reports and customer case writeups as: a field value that resembles others in the source but does not exist there; a date in the expected format that differs from the source value; a category label inferred from another field rather than read from the column the source actually carries.
In Architecture B, the production runtime is compiled code without a language model call. The output is determined by the source values according to the mapping logic; values not present in the source do not appear in the output.
Cost scaling
Hosted LLM inference is billed per token. The cost of an Architecture A pipeline scales linearly with row count. Cost per row depends on the model used, the prompt design (how much of the destination schema is sent with each row), the provider’s token pricing, retry behavior, and overhead such as embedding calls or tool use. Current pricing is best read from the model provider directly.
In Architecture B, the language model cost is incurred during configuration and is independent of subsequent production volume.
Audit and review
Audit and compliance reviews of a pipeline typically include a request to explain why a specific source value produced a specific destination value. In Architecture B, the mapping artifact — a JSON document, a SQL expression, or an AST — is a static description of the transformation. The artifact can be read, version-controlled, diffed against earlier versions, and rolled back.
In Architecture A, the explanation for a given output is the LLM call that produced it on a specific date with a specific model version. Providers do not generally commit to multi-year version stability; reproducing a historical call may produce a different output. Reviewing pipeline behavior in that case is performed by sampling outputs and inferring patterns.
The SOC 2 Type 2 framework includes controls that require the same test to produce the same result. The two architectures support this requirement through different mechanisms — Architecture B through the artifact’s static properties, Architecture A through additional infrastructure such as full output logging, version pinning, or batch replay.
Where each architecture has been deployed
Published examples and customer reports describe Architecture A in:
- One-off transformations performed by an analyst, with human review of each output before downstream use
- Exploratory analytics where downstream consumers tolerate output variance — sentiment dashboards, topic clustering, document classification for review queues
- Low-volume, human-in-the-loop workflows where a domain expert reviews each output (research synthesis, clinical decision support, legal contract review)
Architecture B is described in:
- Production data pipelines that run unattended
- Pipelines handling data subject to compliance requirements (financial transactions, healthcare records, identity verification)
- Pipelines whose downstream systems compare runs (reconciliation, deduplication, change tracking)
- High-volume pipelines where per-row inference cost would be material
References
Footnotes
-
Vectara. Hughes Hallucination Evaluation Model (HHEM) leaderboard. Public benchmark tracking hallucination rates across commercial and open-source LLMs on document summarization. ↩
-
Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y., & Wen, J.-R. (2023). HaluEval: A large-scale hallucination evaluation benchmark for large language models. EMNLP 2023. ↩
-
Narayan, A., Chami, I., Orr, L., & Ré, C. (2022). Can foundation models wrangle your data? arXiv:2205.09911. ↩
-
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1-38. ↩