3.1 Extracting Data from cloud Source
Extracting data from cloud sources is a key part of the modern ETL process. It involves using specialized tools to retrieve raw data from various cloud-based systems and move it to a temporary staging area for further processing
Extraction Concepts:
🡆Connectors and APIs: Cloud ETL tools rely on pre-built connectors and APIs (Application Programming Interfaces) to connect to a wide range of cloud services . These tools handle the complexities of authentication and data retrieval, making the process seamless .
- Example: A financial services company needs to pull data from a cloud-based CRM (Salesforce) and a marketing automation tool (Marketo). The ETL tool uses a pre-built Salesforce connector and the Marketo API to automatically extract the customer and campaign data.
🡆Staging Area: After extraction, the raw data is temporarily stored in a staging area. This isolated repository is crucial because it acts as a buffer, allowing transformations to occur without impacting the live data sources or the final data warehouse .
- Example: Data from various sources (CSV files, a SQL database, and a cloud-based application) is extracted and placed into separate tables in a staging database. This allows the data engineering team to work on cleaning and transforming the data without affecting the production data warehouse.
- Data Formats and Types: Data from cloud sources can come in various formats, such as JSON, CSV, or Parquet . The extraction process must be able to handle these different structures and prepare them for transformation.
🡆Load Types (Full vs. Delta): The extraction can be a full load, which pulls all available data, or a more efficient delta load, which only extracts new or changed data since the last run. Delta loads are essential for keeping data repositories up-to-date in a timely manner