10 Essential Questions Data Engineers Should Ask In The Era Of AI Pipelines

10 Essential Questions Data Engineers Should Ask In The Era Of AI Pipelinesimage

By Devika Gayapu, HEXstream data solutions engineer

Over years of building data pipelines across financial services, telecom and SaaS platforms, I’ve found that the costliest mistakes don’t occur in the transformation layer; they happen long before a single line of SQL is written.

The pattern is always the same: the pipeline looks solid, tests pass, then production data arrives and something breaks. A customer_key that should be numeric contains alphanumeric codes. A timestamp field is NULL 12% of the time. A table assumed to hold unique records has thousands of duplicates inflating downstream metrics. In almost every post-mortem I've been part of, the root cause is the same—the right questions weren't asked about the source data early enough.

1. What is the business meaning of this data?

Before writing any SQL, understand what the data represents. A column named customer_key might represent billing accounts, not individual customers and a single customer could have dozens. Talk to someone who owns the source system before touching the transformation layer.

2. What is the expected data type—and what is it actually?

Source and target systems rarely agree on types, and the gaps are almost never documented. A Boolean might arrive as 'Y'/'N', '1'/'0', or 'True'/'False' depending on the app version. Always profile the actual data—not just the schema docs. This one query catches most surprises early:

SELECT COUNT(*) AS total_rows, SUM(CASE WHEN TRY_CAST(id AS INT) IS NULL
THEN 1 ELSE 0 END) AS non_numeric FROM source_table;

3. Are there formatting issues or special characters?

Real-world data is messy—phone numbers with mixed delimiters, IDs with leading zeros, currency symbols inside numeric fields. Never assume a column is clean because it has a numeric-sounding name. Scan for non-conforming values during discovery, not after the first load failure.

4. Can this field contain NULL values—and what do they mean?

NULL can mean 'unknown', 'not applicable', 'not yet populated', or 'legacy migration default’ and each requires different handling. A NULL in a join key silently drops rows. Confirm nullability expectations with the source team and validate against actual null rates before go-live.

5. Is the data truly unique where it needs to be?

Most pipeline designs assume at least one field is a unique identifier. That assumption fails constantly—from CDC systems emitting duplicate events, to batch extracts overlapping time windows, to acquisitions merging systems with colliding IDs. Verify uniqueness explicitly before designing any join or upsert logic.

6. Is it a full load or incremental—and how reliable is that?

This shapes the entire pipeline architecture. Incremental loads are efficient but fragile if change tracking is unreliable. Many systems update records via direct SQL, bypassing application-layer timestamps entirely, meaning updated at is never touched, and you're silently missing changes. Trust, but periodically reconcile.

7. Do we need historical tracking?

If a business user will ever ask, "What was this customer's status on March 1?" you need to plan for it now. SCD Type 1 overwrites (simple, no history), Type 2 adds effective date rows (full history, more complex to query), Type 4 archives to a separate table. Retrofitting history later means reprocessing everything—often impossible if the source doesn't retain it.

8. What are the valid value ranges and business rules?

Every domain has implicit rules that should be made explicit: durations can't be negative, start dates can't follow end dates, status codes must belong to a known set. Document these as assertions in dbt or your validation layer. Failed checks should quarantine records—not silently load bad data into the warehouse.

9. Are there dependencies between fields?

Logical constraints often exist even when database-level constraints don't. A status of 'paid' should always have a non-null payment_date. A child record should always reference an existing parent. Map these dependencies during discovery and add validation to surface violations before they cause join failures or silently vanish from analytics.

10. What should happen when bad data arrives?

Every pipeline will encounter unexpected data. Decide upfront: reject and alert (pipeline halts, zero silent corruption), quarantine and continue (bad records routed for review, valid records load), or auto-correct and load (safe only for well-understood, low-risk issues). I default to quarantine-and-continue. The key is that this decision is made during design, not improvised at 2 am.

How AI is changing these questions

AI is reshaping both how we answer these questions and adding an entirely new one.

AI as a Discovery Accelerator

Tasks that used to take hours—profiling a table, documenting column semantics, flagging anomalies—can now be accelerated significantly. Tools like Databricks Assistant, dbt Cloud's AI features, and LLMs used directly can generate profiling queries from a schema, suggest quality checks from sample data, and draft transformation logic from a plain-English description.

I paste schemas and sample rows into a prompt and ask what data-quality issues to watch for. The output isn't always right, but it surfaces blind spots fast—especially in unfamiliar domains. AI suggestions are hypotheses, not answers. The 10 questions haven't changed; the speed at which you can generate and explore them has.

Bonus Question 11: Is this data ready for AI?

If your organization is building ML models, LLM features, or RAG systems on top of your pipelines, ask one more question about every source: Is this data suitable as training data, context, or model input?

These are concerns that traditional data pipelines typically didn’t have to address. Does the data fairly represent the population it'll be used to make decisions about—or does it encode historical bias? Does training data contain information unavailable at prediction time (temporal leakage)? For RAG pipelines, is the data fresh enough that an LLM won't confidently answer with outdated context? And critically—does the data contain PII or regulated information that can't be sent to a third-party AI API without compliance review?

This is where data engineers now need to loop in legal, security, and compliance—not just source system owners. The technical questions are familiar. The stakeholder map has changed.

What AI cannot replace

AI tools are genuinely useful for going faster. They don't replace the thinking. An LLM doesn't know your company's definition of 'active customer' changed after an acquisition. It can't navigate data ownership politics or make judgment calls about acceptable quality tradeoffs. And when a pipeline built on AI-generated logic fails, a human engineer still owns it. Use AI to accelerate discovery—not to skip it.

Final thoughts

Building reliable pipelines isn't primarily a SQL problem. It's a discovery and communication problem. The engineers I've seen excel treat source data like a foreign system they need to understand before building on top of it.

These questions are a mindset, not a one-time checklist. Ask them for every new source—and with AI now in the picture, add one more: Is this data safe, fresh, and suitable for the AI systems that will run on top of it.

Good transformations start with good questions. And good questions start before the first SELECT statement.

CLICK HERE TO CONNECT WITH US ABOUT YOUR AI PIPELINE.


Let's get your data streamlined today!