SDS and Hazmat Document Extraction for Retail Compliance

The compliance filing cabinet problem

A home improvement retailer carries 4,000 chemical products: paints, adhesives, solvents, cleaners, pesticides. Each one has a Safety Data Sheet. Each SDS is a PDF from the manufacturer, formatted according to that manufacturer’s template, with 16 mandatory sections containing the hazard classifications, handling procedures, storage requirements, and regulatory information that determine how the product can be sold, shipped, stored, and displayed.

Somewhere in the organization, someone is responsible for making sure every one of those 4,000 SDSs has been reviewed, the relevant data extracted, and the compliance fields populated in the product information system. When a supplier updates a formulation and sends a revised SDS, that person needs to identify what changed, update the extracted data, and verify that the product still meets the requirements for every channel and jurisdiction where it is sold.

In most retail organizations, that person is doing this manually. They open each PDF, locate the relevant sections, transcribe GHS classification codes, copy hazard statements, note storage temperature ranges, and type it all into a spreadsheet or compliance database. For a single SDS, this takes 15 to 30 minutes. For 4,000 products with annual SDS refreshes, it is a full-time job that produces errors at a rate of 3 to 5 percent — and in hazmat compliance, a 3 percent error rate is not a rounding error. It is a liability.

What makes SDS extraction particularly difficult

Safety Data Sheets follow the GHS (Globally Harmonized System) 16-section structure, which sounds like it should make extraction straightforward. It does not, for several reasons.

Layout varies by manufacturer. The 16 sections are mandated, but their visual presentation is not. One manufacturer uses a two-column layout with bordered tables. Another uses a single-column format with bold section headers. A third embeds hazard pictograms as inline images that break text flow. A fourth uses a template from 2015 that predates their current branding, with different fonts, spacing, and page breaks than their newer documents.

The same data appears in different formats within the same document. Section 2 (Hazard Identification) contains GHS classification codes like Flam. Liq. 3 and H-statements like H226 - Flammable liquid and vapour. Section 14 (Transport Information) contains UN numbers, packing groups, and DOT proper shipping names. Section 15 (Regulatory Information) lists TSCA, SARA 313, California Prop 65, and state-specific regulatory statuses. Each section uses different formatting conventions, and the relationship between them matters. A product’s GHS classification determines its DOT shipping requirements, and both must be consistent.

Multi-component SDSs add complexity. A product with five chemical components has a composition table in Section 3 listing each component with its CAS number, concentration range, and individual hazard classification. The product-level hazard classification in Section 2 is the aggregate, but compliance databases often need both: the product classification for labeling and the component data for regulatory reporting.

Languages and regional variants multiply the problem. A product sold in the US and Canada needs an English SDS and a French-Canadian SDS. A product sold in Mexico needs a Spanish SDS. The GHS classification codes are the same, but the hazard statement text, precautionary statement text, and regulatory section content differ by language and jurisdiction. Some manufacturers send a single multi-language SDS. Others send separate documents per language with different filenames and no consistent naming convention.

The downstream systems that depend on extracted data

SDS data does not stay in the compliance database. It feeds multiple downstream systems, each requiring specific fields in specific formats.

Product information management (PIM) systems need GHS pictogram codes, signal words, and hazard statements for product labels and online listings. California retailers need Prop 65 warning text. States with specific hazardous product registries need the registration status per product per state.

Warehouse management systems need storage compatibility codes to prevent incompatible chemicals from being stored adjacent. A flammable liquid cannot be stored next to an oxidizer. The storage requirements come from SDS Sections 7 and 10, but they arrive as prose text (“Store in a cool, well-ventilated area away from oxidizing agents”), not as the structured compatibility codes the WMS requires.

Transportation management systems need DOT hazmat classifications, UN numbers, packing groups, and proper shipping names to generate compliant shipping documents. An incorrect UN number on a bill of lading is a DOT violation. The data comes from SDS Section 14, and it must match exactly. Not approximately, not “close enough.” Exactly.

E-commerce platforms need hazmat shipping flags to determine which carriers can handle the product, whether it can ship by air, and whether special packaging is required. A product incorrectly flagged as non-hazmat that ships via air freight creates regulatory exposure for the retailer and the carrier.

Environmental, health, and safety (EHS) systems need the full composition data from Section 3, exposure limits from Section 8, and toxicological data from Section 11 for workplace safety compliance.

Each of these systems needs different fields from the same source document. Manual extraction means the same SDS is read and transcribed multiple times by different teams for different systems, with no guarantee of consistency between the extractions.

AI vision extraction from SDS documents

datathere processes SDS PDFs using AI vision extraction, reading the document as a rendered page and identifying the structural elements that matter for compliance data.

When an SDS is uploaded, the AI identifies the 16-section structure regardless of the manufacturer’s layout. It locates Section 2 and extracts GHS classification codes, hazard category numbers, H-statements, P-statements, signal words, and pictogram identifiers. It reads the composition table in Section 3 and extracts CAS numbers, component names, and concentration ranges. It pulls UN numbers, packing groups, and proper shipping names from Section 14. It identifies regulatory listings in Section 15.

The extraction is not a flat text dump. The AI maps each extracted value to the destination compliance schema with confidence scores. A GHS classification of Flam. Liq. 3 maps to the ghs_flammability_category field with high confidence. A storage instruction extracted from prose text (“Keep away from heat, hot surfaces, sparks, open flames and other ignition sources”) maps to storage compatibility codes with lower confidence, flagging it for a compliance specialist to verify the code assignment.

This matters because SDS documents mix structured data (GHS codes, CAS numbers, UN numbers) with unstructured prose (handling instructions, first aid measures, ecological information). The structured data extracts cleanly. The prose requires interpretation, converting a text description of storage requirements into the specific compatibility codes the WMS uses. The confidence scores tell the compliance team exactly where to focus their review.

Handling SDS revisions and version tracking

Chemical manufacturers update SDSs when formulations change, when regulatory classifications are revised, or when new regulatory requirements take effect. A retailer with 4,000 chemical products receives hundreds of SDS updates per year.

The challenge is not just processing the new document. It is identifying what changed and determining whether the change affects compliance status, labeling, storage, or shipping.

datathere’s approach to SDS updates runs the revised document through the same extraction and mapping pipeline as the original. The system produces a structured comparison: which fields changed, what the old and new values are, and whether the change crosses a compliance threshold. A concentration change from 12% to 11% might be immaterial. A GHS classification change from Category 4 to Category 3 triggers a cascade of updates: new labels, new storage requirements, potentially new shipping restrictions.

Quality enforcement rules catch changes that require action. A GHS classification that moves to a more severe category triggers a halt, requiring a compliance specialist to review before the updated data flows to downstream systems. A change to Section 3 composition data that crosses a SARA 313 reporting threshold gets quarantined for environmental compliance review. A change to Section 14 transport data that alters the UN number or packing group gets flagged for immediate logistics team notification.

Scaling across the supplier base

The first SDS from a new manufacturer requires the most review. The AI generates mappings between the manufacturer’s layout and the destination compliance schema, and a compliance specialist verifies the extractions, particularly for prose-to-code conversions like storage compatibility.

That verified mapping becomes a template. The second SDS from the same manufacturer, using the same layout, processes faster because the template already handles the manufacturer’s formatting conventions. By the fiftieth SDS from that manufacturer, the extraction is nearly automatic, with review focused only on the fields where the AI’s confidence drops below the threshold.

Over time, the template library covers the formatting conventions of the manufacturer base. A retailer working with 200 chemical suppliers might encounter 30 to 40 distinct SDS layouts. Once those layouts are mapped, new products from existing suppliers process with minimal manual effort. New suppliers require an initial template setup, but the investment is amortized across every product they supply.

From manual compliance to systematic enforcement

The shift from manual SDS extraction to AI-driven processing changes the compliance posture from reactive to systematic. Instead of a compliance specialist finding problems when they manually review a document, the system catches problems when the data enters the pipeline.

A product with a GHS classification that does not match its DOT shipping category gets flagged at extraction time, not when a shipment is stopped at a carrier facility. A storage compatibility conflict between a newly received product and its assigned warehouse location surfaces during data processing, not during a safety audit. A Prop 65 listing that changed in a revised SDS triggers a label update workflow immediately, not when a customer complaint arrives.

The compliance team’s role shifts from data transcription to exception management. They review the extractions the system is uncertain about, verify the compliance-critical mappings, and handle the cases that require judgment — the product whose GHS classification falls at the boundary between two categories, the SDS whose Section 14 data contradicts its Section 2 classification, the multi-component product whose aggregate hazard rating requires calculation rather than direct extraction.

The 4,000 SDSs are still there. The 16 sections per document are still there. The downstream systems still need their specific fields in their specific formats. What changes is that the mechanical work of reading, locating, transcribing, and entering data is handled by a system that does it consistently, at scale, with quality enforcement at every step. The compliance team’s expertise goes to the decisions that require expertise, not to the data entry that consumed it.