Businesses can now churn out data analytics based on big data from a variety of sources. To make better decisions, they need access to all of their data sources for analytics and business intelligence (BI).
An incomplete picture of available data can result in misleading reports, spurious analytic conclusions, and inhibited decision-making. To correlate data from multiple sources, data should be stored in a centralized location — a data warehouse — which is a special kind of database architected for efficient reporting.
Information must be ingested before it can be digested. Analysts, managers, and decision-makers need to understand data ingestion and its associated technologies, because a strategic and modern approach to designing the data pipeline ultimately drives business value.
What is data ingestion?
Data ingestion is the transportation of data from assorted sources to a storage medium where it can be accessed, used, and analyzed by an organization. The destination is typically a data warehouse, data mart, database, or a document store. Sources may be almost anything — including SaaS data, in-house apps, databases, spreadsheets, or even information scraped from the internet.
The data ingestion layer is the backbone of any analytics architecture. Downstream reporting and analytics systems rely on consistent and accessible data. There are different ways of ingesting data, and the design of a particular data ingestion layer can be based on various models or architectures.
Batch vs. streaming ingestion
Business requirements and constraints inform the structure of a particular project’s data ingestion layer. The right ingestion model supports an optimal data strategy, and businesses typically choose the model that’s appropriate for each data source by considering the timeliness with which they’ll need analytical access to the data:
Here, the ingestion layer periodically collects and groups source data and sends it to the destination system. Groups may be processed based on any logical ordering, the activation of certain conditions, or a simple schedule. When having near-real-time data is not important, batch processing is typically used, since it’s generally easier and more affordably implemented than streaming ingestion.
(also called stream processing or streaming) involves no grouping at all. Data is sourced, manipulated, and loaded as soon as it’s created or recognized by the data ingestion layer. This kind of ingestion is more expensive, since it requires systems to constantly monitor sources and accept new information. However, it may be appropriate for analytics that require continually refreshed data.
It’s worth noting that some “streaming” platforms (such as Apache Spark Streaming) actually utilize batch processing. Here the ingested groups are simply smaller or prepared at shorter intervals, but still not processed individually. This type of processing is often called micro batching and considered by some to be another distinct category of data ingestion.
Common data ingestion challengesCertain difficulties can impact the data ingestion layer and pipeline performance as a whole.
The global data ecosystem is growing more diverse, and data volume has exploded. Information can come from numerous distinct data sources, from transactional databases to SaaS platforms to mobile and IoT devices. These sources are constantly evolving while new ones come to light, making an all-encompassing and future-proof data ingestion process difficult to define.Coding and maintaining an analytics architecture that can ingest this volume and diversity of data is costly and time-consuming, but a worthwhile investment: The more data businesses have available, the more robust their potential for competitive analysis becomes.Meanwhile, speed can be a challenge for both the ingestion process and the data pipeline. As data grows more complex, it’s more time-consuming to develop and maintain data ingestion pipelines, particularly when it comes to “real-time” data processing, which depending on the application can be fairly slow (updating every 10 minutes) or incredibly current (think stock ticker applications during trading hours).Knowing whether an organization truly needs real-time processing is crucial for making appropriate architectural decisions about data ingestion. Choosing technologies like autoscaling cloud-based data warehouses allows businesses to maximize performance and resolve challenges affecting the data pipeline.
Legal and compliance requirements add complexity (and expense) to the construction of data pipelines. For example, European companies need to comply with the General Data Protection Regulation (GDPR), US healthcare data is affected by the Health Insurance Portability and Accountability Act (HIPAA), and companies using third-party IT services need auditing procedures like Service Organization Control 2 (SOC 2).
Businesses make decisions based on the data in their analytics infrastructure, and the value of that data depends on their ability to ingest and integrate it. If the initial ingestion of data is problematic, every stage down the line will suffer, so holistic planning is essential for a performant pipeline.
Data ingestion and ETL
The growing popularity of cloud-based storage solutions has given rise to new techniques for replicating data for analysis.
Until recently, data ingestion paradigms called for an extract, transform, load (ETL) procedure in which data is taken from the source, manipulated to fit the properties of a destination system or the needs of the business, then added to that system. When businesses used costly in-house analytics systems, it made sense to do as much prep work as possible, including transformations, prior to loading data into the warehouse.
But today, cloud data warehouses like Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Azure SQL Data Warehouse can cost-effectively scale compute and storage resources with latency measured in seconds or minutes. This allows data engineers to skip the preload transformations and load all of the organization’s raw data into the data warehouse. Data scientists can then define transformations in SQL and run them in the data warehouse at query time. This new sequence has changed ETL into ELT, which is ideal for replicating data cost-effectively in cloud infrastructure.
Businesses don’t use ELT to replicate data to a cloud platform just because it gets the data to a destination faster. ELT removes the need to write complex transformations as a part of the data pipeline, and avoids less scalable on-premises hardware. Most importantly, ELT gives data and analytic teams more freedom to develop ad-hoc transformations according to their particular needs.