Adam Woroniec - Data Engineer

Introduction

File generation and export processes live in a peculiar gray area of software architecture. Too often, these critical business processes get tacked onto databases through SQL Server Agent jobs, stored procedures, and command-line utilities like BCP (Bulk Copy Program). While this approach works, it creates maintenance headaches and architectural questions that many organizations eventually face: Where should file generation actually live?

This article chronicles a migration from BCP command-line exports to Azure Data Factory (ADF), exploring both the technical implementation details and the larger architectural questions about separation of concerns in data systems.

What is BCP and Why Migrate Away?

For those unfamiliar, BCP (Bulk Copy Program) is a command-line utility that ships with SQL Server, designed to bulk-copy data between SQL Server instances and data files. It's been around since the early days of SQL Server and is remarkably efficient at what it does—moving large volumes of data quickly.

A typical BCP command looks like this:

bcp "SELECT * FROM dbo.ValidationData WHERE BatchID = 12345" 
queryout "C:\Files\ValidationFile_11292025.csv" 
-c -t, -S ServerName -d DatabaseName -T

BCP jobs were often scheduled through SQL Server Agent, wrapped in stored procedures, or called from application code. They worked, but they came with baggage:

No built-in error handling beyond basic exit codes
Limited retry logic when network shares were unavailable
Hard-coded file paths scattered across dozens of jobs
Difficult to track lineage and audit who generated what
No source control unless you manually tracked job definitions

Our organization had accumulated these jobs over years. When a cloud migration was required, we faced a choice: replicate the BCP approach in the cloud, or fundamentally rethink how file generation worked.

The SSIS Detour (And Why It Failed)

Before BCP, many organizations tried SQL Server Integration Services (SSIS). SSIS was Microsoft's answer to ETL and data integration, offering visual designers and robust data flow capabilities. In theory, it should have been perfect for file generation workflows.

In practice, SSIS became a maintenance nightmare:

Visual Studio dependency: Every developer needed Visual Studio with specific SSIS extensions installed
Version compatibility issues: Opening a project created in VS 2017 in VS 2019 often broke things
Deployment complexity: SSIS projects had to be deployed to Integration Services Catalog, creating another layer of infrastructure
Debugging difficulties: When packages failed, troubleshooting often involved arcane XML manipulation
No real source control: While you could check .dtsx files into Git, they're binary XML files that don't diff well

The breaking point came when simply opening certain SSIS projects became impossible to troubleshoot. Files would throw cryptic errors, and the only solution was often "rebuild it from scratch." This drove the initial migration from SSIS to BCP—at least BCP was simple and predictable.

But BCP wasn't a long-term solution either.

Enter Azure Data Factory

When our organization initiated a broader cloud migration, Azure Data Factory emerged as the logical choice for replacing our BCP jobs. ADF offered several immediate advantages:

Native cloud integration with Azure services
Visual pipeline designer without the Visual Studio baggage
Built-in source control through Git integration
Parameterization allowing reusable pipeline templates
Monitoring and alerting baked into the platform
Orchestration capabilities for complex workflows

The migration involved 10 jobs representing 6 unique file generation processes. The goal wasn't just to replicate BCP functionality—it was to build a maintainable, scalable system.

Technical Implementation: The Components

Migrating from BCP to ADF revealed just how many moving parts were involved in what seemed like simple file exports. Here's what the complete architecture looked like:

1. Core Pipeline Structure

Each ADF pipeline followed a similar pattern:

Lookup Activity → Stored Procedure → Copy Data → Integration Components

The Lookup Activity retrieved batch IDs or processing parameters from the database—essentially replacing the dynamic SQL that BCP commands often used.

The Stored Procedure Activity executed the business logic. This was a conscious decision to keep transformation logic in the database where it belonged, rather than recreating complex business rules in ADF's data flows.

The Copy Data Activity handled the actual file generation, writing to Azure Blob Storage with dynamically generated filenames like ValidationFile_EDI_11292025.csv.

2. Azure Blob Storage as Staging

Unlike BCP which wrote directly to network shares, the ADF approach used Azure Blob Storage as an intermediary:

Database → ADF Pipeline → Blob Storage → Final Destination

This added a layer but provided significant benefits:

Resilience: Blob storage is highly available; network shares are not
Auditing: Azure automatically tracks blob creation/modification
Cost-effective: Blob storage is cheap for staging data
Integration ready: Other Azure services can easily consume blobs

Files landed in containers with organized folder structures:

/datafactory/ValidationFiles/Customer01/ValidationFile_11292025.csv

3. Self-Hosted Integration Runtime for Network Shares

Some files needed to reach on-premises file shares for SFTP pickup or legacy system consumption. This required deploying a Self-Hosted Integration Runtime (SHIR)—essentially an agent installed on a server with network access to the file shares.

The SHIR enabled ADF to:

Copy files from Azure Blob to Windows network shares
Maintain security boundaries between cloud and on-premises
Handle authentication against Active Directory resources

The pipeline pattern became:

Copy Data (to Blob) → Copy Data (Blob to File Share via SHIR)

4. Logic Apps for Email Distribution

BCP jobs often included ad-hoc scripts to send notification emails. ADF doesn't have native email capabilities, but it integrates seamlessly with Azure Logic Apps.

The implementation used a Web Activity in ADF to trigger a Logic App via HTTP:

ADF Pipeline:

{
  "type": "WebActivity",
  "typeProperties": {
    "url": "https://prod-xx.region.logic.azure.com/workflows/.../triggers/manual/paths/invoke",
    "method": "POST",
    "body": {
      "fileName": "@{variables('pipelineFileName')}",
      "filePathAndName": "@{concat('datafactory/ValidationFiles/', variables('pipelineFileName'))}",
      "batchId": "@{variables('batchId')}",
      "status": "Success"
    }
  }
}

Logic App Workflow:

HTTP trigger receives payload from ADF
Parse JSON extracts the filename
Get blob content retrieves the file from storage
Send email (V2) delivers the file as an attachment

Key lessons learned:

Use Content property (not ContentBytes) when attaching blob content
Construct blob paths dynamically: @{concat('datafactory/', body('Parse_JSON')['fileName'])}
Logic Apps provide retry logic and error handling that BCP never had

5. Azure Functions for Excel Exports

Some business users demanded Excel files instead of CSVs. While ADF can generate CSVs natively, Excel requires programmatic libraries like EPPlus or ClosedXML.

We implemented an Azure Function (C#) that:

Received a trigger from ADF with dataset parameters
Queried the database or read from blob storage
Generated .xlsx files using EPPlus
Wrote back to blob storage for distribution

This modular approach kept the Excel-specific logic separate from the core pipeline.

The Power of Parameterization

The real transformation came from treating pipelines as templates. Instead of 10 separate, hard-coded jobs, we created base pipelines with parameters:

Pipeline Parameters:

clientName: Target client/customer
fileNamePrefix: Base filename (e.g., "ValidationFile")
emailRecipients: Comma-separated email list
destinationPath: Blob container and folder path
storedProcedureName: Which proc to execute

This meant:

6 unique pipeline templates served 10 different file generation processes
Configuration-driven execution: New clients/files required parameter changes, not new pipelines
Consistent monitoring: All jobs followed the same pattern, making debugging predictable

Compare this to BCP where each job was its own snowflake, with paths and logic scattered across SQL Server Agent job steps.

Organization Through Folder Structure

ADF allows organizing pipelines, datasets, and linked services into folders. This might seem trivial, but it was transformative for maintainability.

Our structure:

/Pipelines
  /FileGeneration
    /ValidationFiles
      - Pipeline_ValidationFile_Template
      - Pipeline_ValidationFile_Customer01
      - Pipeline_ValidationFile_Customer02
    /ReportExports
      - Pipeline_DailySalesReport
      - Pipeline_WeeklyInventory
  /DataIntegration
    - (other ETL pipelines)

When a validation file job failed at 2 AM, having this organization meant immediately knowing:

Where to find the pipeline
Which template it derived from
Related datasets and connections

With BCP and SQL Agent, this context was lost—jobs were just names in a list, and you had to open each one to understand what it did.

Source Control and Deployment

Perhaps one of the biggest improvements was native Git integration. ADF pipelines are JSON files under the hood, and they live in an Azure DevOps or GitHub repository.

While we did use Redgate SQL Source Control to track SQL Server Agent jobs and stored procedures, the experience was fundamentally different. Redgate worked by periodically syncing database objects to source control—it was a bridge between SQL Server and Git, not a native integration.

With ADF, the Git workflow is first-class:

Pull request workflows for pipeline changes
Code review before deploying to production
Rollback capabilities when changes broke things
Environment promotion (Dev → Test → Prod) through ARM templates
Audit trail of who changed what and when

The difference is subtle but important. With Redgate:

Make change in SQL Server → Redgate syncs → Commit to Git

With ADF:

Make change in Git branch → Deploy to ADF → Test → Merge

In the ADF model, source control is the source of truth. In the Redgate model, the database is the source of truth, and source control is a mirror. This inversion matters for deployment pipelines and preventing configuration drift.

Deployments used Azure DevOps pipelines with ARM template deployments—no more manually recreating jobs in production or relying on Redgate's deployment tools to keep environments in sync.

With ADF, every pipeline change was:

git commit -m "Updated email recipients for client validation file"
git push origin feature/update-client-recipients

The Architectural Question: Application Layer vs. Data Layer

This migration forced a deeper question: Should file generation live in the application layer or the data layer?

The Case for Application Layer

Modern software engineering principles suggest keeping business logic in application code:

Advantages:

Better testability: Unit tests, integration tests, mocking—all easier in application code
Language flexibility: Use languages optimized for the task (Python for data manipulation, C# for complex business rules)
Separation of concerns: Databases handle data persistence; applications handle workflows
Scalability: Applications can scale horizontally; databases often can't
Version control: Application code naturally lives in Git
Observability: Modern APM tools (Application Performance Monitoring) work best with application code

The Argument: File generation is a workflow—an operation that happens because of business events. The application knows when an order ships, when a batch processes completes, when a report period closes. Why pull data out of the application, into the database, only to push it back out through separate ETL tooling?

A well-architected system might handle file generation like this:

Business Event (Order Shipped) 
  → Application Service (Generate Shipping Manifest)
    → Data Repository (Fetch Required Data)  
    → File Service (Generate CSV/Excel)
    → Distribution Service (Email/SFTP)

All of this lives in the application, uses application logging, and deploys with application code.

The Case for Data/ETL Layer

However, the reality is more nuanced:

Advantages:

Data proximity: Files are often aggregations of database data; keeping generation close to the data reduces network overhead
Specialized tools: ADF, SSIS, and ETL tools are optimized for data transformations and bulk operations
Operational separation: File generation failures don't impact application uptime
Historical complexity: Many files involve complex SQL queries built by DBAs over years—rewriting in application code is risky
Scheduling independence: Reports run on schedules (nightly, weekly) independent of application events
Resource management: Generating a 100MB Excel file in your web application kills performance; ETL tools handle this better

The Reality: Some file generation truly is data integration work, not application work. Consider:

Compliance reports: Aggregating historical data for regulatory reporting
Data exports for partners: Providing extracts to third-party systems
Archive processes: Extracting aged data for long-term storage
Analytics feeds: Generating files for data science/BI platforms

These workflows are fundamentally about data movement and transformation—the core purpose of ETL systems.

The Hybrid Approach

The answer isn't binary. The best architecture recognizes that different file generation needs belong in different places:

Application Layer:

Transactional documents (invoices, shipping labels, order confirmations)
User-initiated exports (download my data)
Real-time reports triggered by specific events
Files requiring complex business logic beyond SQL

Data/ETL Layer:

Scheduled batch reports
Data extracts for external systems
Compliance and regulatory reports
Analytics feeds and data warehouse exports
Historical aggregations

The key insight: File generation should live closest to where the triggering logic and required data reside.

If generating a file requires:

Joining 15 tables
Complex temporal logic
Historical data analysis

Then keeping it in the data layer makes sense.

If generating a file requires:

Real-time application state
Integration with external APIs
Complex orchestration across microservices

Then it belongs in the application layer.

Why Azure Data Factory Won

Despite the architectural debates, ADF proved superior to both SSIS and BCP for our specific needs:

1. Maintainability

Pipelines are organized, parameterized, and version-controlled. Adding a new file export meant cloning a template and updating parameters—not recreating everything from scratch.

2. Observability

ADF provides built-in monitoring with:

Pipeline run history
Activity-level duration tracking
Error messages and stack traces
Integration with Azure Monitor for alerting

BCP gave you exit codes. SSIS gave you cryptic XML errors. ADF gives you actionable insights.

3. Scalability

ADF can scale to handle massive data volumes through:

Parallel copy activities
Data flow optimizations
Integration Runtimes sized appropriately

We haven't hit scaling limits yet, but we know the runway is there.

4. Cost Visibility

Azure provides detailed cost breakdowns by:

Pipeline runs
Data movement volumes
Integration Runtime hours

This transparency didn't exist with on-premises SQL Agent jobs.

5. No Visual Studio Dependency

The ADF portal is web-based. Any developer can:

Open a browser
Navigate to the ADF resource
Edit pipelines
Test changes

No more "I can't open this SSIS package because I have the wrong VS version installed."

6. Flexibility Through Integration

ADF isn't trying to do everything. It orchestrates:

Logic Apps for email
Azure Functions for custom code
Databricks for complex transformations
Power BI for embedding visualizations

This composability is powerful. Each component does what it's best at.

Lessons Learned

After migrating 10 jobs and running them in production for months, here are the key takeaways:

1. Parameterize Everything

We initially hard-coded file paths and email addresses. Changing these required pipeline edits and deployments. Moving to parameters meant configuration changes were just trigger updates—no code changes needed.

2. Blob Storage Is Your Friend

Using blob storage as staging adds a step, but the benefits (resilience, auditing, cost) outweigh the complexity. Plus, it enables scenarios like "regenerate last month's file" by simply re-running the pipeline.

3. Modular Pipelines Over Monoliths

Our first instinct was to build one giant pipeline that did everything. Breaking into discrete, chained pipelines made debugging vastly easier. When the email step fails, you don't want to re-run the entire 30-minute data extraction.

4. Error Handling Is Not Optional

ADF's retry policies, dependency conditions, and failure paths need to be explicitly configured. We learned this the hard way when a transient network issue caused 50 pipeline failures over a weekend because retry wasn't enabled.

5. Documentation Matters

We created README files for each pipeline explaining:

Purpose and business owner
Trigger schedule
Expected output
Troubleshooting steps

When jobs fail at 3 AM, good documentation means anyone on-call can fix it.

6. Source Control Discipline

Require pull requests even for "small" pipeline changes. We had an incident where a well-meaning developer changed a pipeline directly in production, which later got overwritten by a deployment from Git. Always work from source.

Conclusion

Migrating from BCP commands to Azure Data Factory wasn't just a technical lift—it forced us to rethink where file generation belongs in our architecture. While the "application vs. data layer" debate doesn't have a single answer, the migration revealed that ADF provides a sweet spot for our specific needs:

Close to the data for SQL-heavy transformations
Cloud-native for resilience and scalability
Observable and maintainable in ways BCP never was
Flexible enough to integrate with application-layer services when needed

The 6 parameterized pipeline templates now handle all our file generation needs. Adding new exports takes hours instead of days. Troubleshooting happens in minutes instead of hours. And most importantly, we have confidence that the system won't mysteriously break because someone upgraded Visual Studio.

For organizations still relying on BCP commands, SQL Agent jobs, or struggling with SSIS, Azure Data Factory offers a modern alternative. It won't solve every architectural question, but it provides the tools to build maintainable, observable data integration workflows—and that's a foundation worth building on.

Have you migrated legacy file generation processes to modern cloud platforms? I'd love to hear about your experiences and what architectural patterns worked (or didn't work) for your team. Feel free to connect and share your story.