Skip to main content

Data Management

Data is one of the most valuable assets in any modern organization. CPAs must understand how data is extracted, stored, queried, and integrated to support financial analysis, operational decisions, and audit procedures. The ISC exam tests your knowledge of the full data life cycle — from creation through active use, storage, and final disposition — as well as your ability to work with relational databases, SQL queries, and business process models.

This section covers data extraction methods, data storage types and database schemas, the data life cycle, relational database structure (integrity rules, data dictionaries, normalization), SQL queries, data integration from multiple sources, and business process models.

info

Data management is tested across all three skill levels on the ISC exam. You must be able to define storage types and summarize the data life cycle (Remembering and Understanding), examine database structure and SQL queries (Application), and investigate business process models to identify improvements (Analysis).


Data Extraction Methods and Techniques

Data extraction is the process of retrieving data from one or more sources for further processing, analysis, or storage. CPAs encounter data extraction in contexts such as audit data analytics, financial reporting, and system migrations.

Common data extraction methods include:

MethodDescriptionExample
Manual extractionUsers export data from a system using built-in reporting or export featuresExporting a trial balance from an accounting application to a spreadsheet
Query-based extractionUsing a query language (typically SQL) to retrieve specific data from a databaseWriting a SQL query to extract all journal entries over $100,000 from the general ledger
API-based extractionUsing application programming interfaces to programmatically retrieve data from systemsPulling transaction data from a SaaS application via its REST API
ETL (Extract, Transform, Load)An automated process that extracts data from source systems, transforms it into a consistent format, and loads it into a target system (e.g., data warehouse)Nightly ETL job that extracts sales data from three regional systems and loads it into a centralized data warehouse
Exam Tip

The ETL process is a foundational concept in data management. Remember the three steps: Extract (pull data from sources), Transform (clean, validate, and format the data), and Load (insert the data into the target system). If any step has a control weakness — for example, the transform step does not validate data completeness — the resulting data may be unreliable.


Data Storage Types

Organizations use different types of data storage depending on their analytical needs, data volume, and performance requirements:

Storage TypeDescriptionPurposeCharacteristics
Data warehouseA centralized repository that stores integrated data from multiple sources in a structured, query-optimized formatEnterprise-wide reporting and analyticsHistorical data, structured, optimized for read-heavy queries
Data lakeA large storage repository that holds raw data in its native format until it is needed for analysisStoring unstructured, semi-structured, and structured data at scaleRaw data, flexible schema, supports diverse data types
Data martA subset of a data warehouse that focuses on a specific business function or departmentDepartmental reporting and analysisNarrower scope than a data warehouse, faster queries for specific use cases

Example: MAS Inc. uses a data warehouse to consolidate financial data from its ERP system, CRM system, and external market data feeds. The finance department accesses a data mart that contains only the financial data relevant to their reporting needs, which provides faster query performance than querying the full data warehouse.


Database Schemas

A database schema defines the logical structure of a database — how tables are organized and how they relate to each other. Two common schema designs used in data warehouses are the star schema and snowflake schema.

Star Schema

A star schema organizes data around a central fact table (which contains quantitative data such as sales amounts, quantities, or transaction counts) surrounded by dimension tables (which contain descriptive attributes such as date, product, customer, or location).

Characteristics:

  • Simple structure, easy to understand and query
  • Dimension tables are denormalized (may contain redundant data)
  • Faster query performance due to fewer joins

Snowflake Schema

A snowflake schema is a variation of the star schema where dimension tables are normalized — broken into additional sub-tables to reduce redundancy.

Characteristics:

  • More complex structure with additional tables and joins
  • Dimension tables are normalized, reducing data redundancy
  • Slower query performance due to more joins, but more efficient storage
note

For the ISC exam, you should understand the structural difference between star and snowflake schemas. The star schema trades storage efficiency for query simplicity, while the snowflake schema trades query simplicity for storage efficiency.


The Data Life Cycle

The data life cycle describes the stages that data passes through from its creation to its eventual disposal. Understanding the data life cycle is essential for implementing appropriate controls at each stage.

StageDescriptionKey Controls
Creation / CollectionData is generated or collected from internal or external sourcesInput validation, authorization, data quality checks
StorageData is stored in databases, data warehouses, or other repositoriesAccess controls, encryption, backup and recovery
Active UseData is accessed, processed, and analyzed for business purposesAccess controls, processing integrity, audit trails
Sharing / DistributionData is shared with internal or external partiesAuthorization, encryption in transit, data classification
ArchivalData that is no longer actively used but must be retained is moved to long-term storageRetention policies, access controls, retrieval procedures
Disposition / DestructionData that is no longer needed is securely destroyedSecure deletion methods, documentation of destruction, compliance with retention policies
warning

Data disposition is often overlooked but is critically important, especially for sensitive data. Simply deleting a file does not securely destroy it — data may be recoverable from storage media unless overwritten or the media is physically destroyed. The exam may test whether you can identify a deficiency in an organization's data disposition practices.


Relational Database Structure

A relational database organizes data into tables (also called relations), where each table consists of rows (records) and columns (fields). Understanding relational database concepts is essential for examining database integrity and writing SQL queries.

Data Integrity Rules

Relational databases enforce several types of data integrity:

Integrity TypeDescriptionExample
Entity integrityEvery table must have a primary key, and no primary key value can be nullEach customer in the Customers table has a unique, non-null CustomerID
Referential integrityForeign key values must match a primary key value in the related table (or be null)An OrderID in the Orders table must correspond to a valid CustomerID in the Customers table
Domain integrityValues in a column must conform to defined data types, formats, and allowable rangesA DateOfBirth column must contain a valid date; a State column must contain a valid two-letter state code

Data Dictionary

A data dictionary (also called a metadata repository) is a centralized record that describes the structure, content, and characteristics of the data in a database. It typically includes:

  • Table names and descriptions
  • Column names, data types, and constraints
  • Primary key and foreign key relationships
  • Valid values and business rules
  • Data ownership and stewardship

The data dictionary is an essential tool for auditors because it provides a map of the database structure and helps verify that the database is properly designed and documented.

Normalization

Normalization is the process of organizing data in a relational database to reduce redundancy and improve data integrity. Normalization involves decomposing tables into smaller, related tables and defining relationships between them.

Normal FormRulePurpose
First Normal Form (1NF)Each column contains only atomic (indivisible) values; no repeating groupsEliminate repeating groups
Second Normal Form (2NF)1NF + every non-key column depends on the entire primary key (not just part of it)Eliminate partial dependencies
Third Normal Form (3NF)2NF + every non-key column depends only on the primary key (not on other non-key columns)Eliminate transitive dependencies

Example: If Bear Co.'s Employees table contains columns for EmployeeID, EmployeeName, DepartmentID, and DepartmentName, the DepartmentName depends on DepartmentID (not on EmployeeID). To achieve 3NF, DepartmentName should be moved to a separate Departments table, with DepartmentID as the link.


Structured Query Language (SQL)

SQL (Structured Query Language) is the standard language for interacting with relational databases. The ISC exam tests your ability to examine a SQL query and determine whether the retrieved data set is relevant and complete.

Common SQL Commands

CommandPurposeExample
SELECTRetrieves data from one or more tablesSELECT CustomerName, Balance FROM Customers
FROMSpecifies the table(s) to queryFROM Customers
WHEREFilters rows based on a conditionWHERE Balance > 10000
JOINCombines rows from two or more tables based on a related columnJOIN Orders ON Customers.CustomerID = Orders.CustomerID
INSERTAdds new rows to a tableINSERT INTO Customers (CustomerID, CustomerName) VALUES (101, 'Bear Co.')
UPDATEModifies existing data in a tableUPDATE Customers SET Balance = 15000 WHERE CustomerID = 101
DELETERemoves rows from a tableDELETE FROM Customers WHERE CustomerID = 101

Common SQL Clauses and Operators

Clause / OperatorPurposeExample
ORDER BYSorts the result setORDER BY Balance DESC
GROUP BYGroups rows that share a value for aggregate calculationsGROUP BY DepartmentID
HAVINGFilters groups (used with GROUP BY)HAVING SUM(Amount) > 50000
AND / ORCombines multiple conditionsWHERE Status = 'Active' AND Balance > 0
INMatches any value in a listWHERE State IN ('IL', 'NY', 'CA')
BETWEENMatches values within a rangeWHERE PostDate BETWEEN '2025-01-01' AND '2025-12-31'
LIKEPattern matching with wildcardsWHERE CustomerName LIKE 'Bear%'
IS NULLTests for null valuesWHERE ApprovalDate IS NULL

Aggregate Functions

FunctionPurposeExample
COUNT()Counts the number of rowsSELECT COUNT(*) FROM Transactions
SUM()Calculates the sum of a numeric columnSELECT SUM(Amount) FROM Transactions
AVG()Calculates the average of a numeric columnSELECT AVG(Amount) FROM Transactions
MAX()Returns the maximum valueSELECT MAX(Amount) FROM Transactions
MIN()Returns the minimum valueSELECT MIN(Amount) FROM Transactions

String Functions

FunctionPurposeExample
UPPER() / LOWER()Converts text to uppercase or lowercaseSELECT UPPER(CustomerName) FROM Customers
TRIM()Removes leading and trailing whitespaceSELECT TRIM(CustomerName) FROM Customers
SUBSTRING()Extracts a portion of a stringSELECT SUBSTRING(AccountCode, 1, 3) FROM Accounts
CONCAT()Combines two or more stringsSELECT CONCAT(FirstName, ' ', LastName) FROM Employees
LENGTH() / LEN()Returns the length of a stringSELECT LENGTH(CustomerName) FROM Customers

Examining a SQL Query for Relevance and Completeness

When the ISC exam presents a SQL query and asks whether the retrieved data set is relevant and complete, consider:

  • Relevance — Does the WHERE clause filter the data to the correct population? Are the correct columns selected?
  • Completeness — Does the query include all necessary records? Could the WHERE clause be inadvertently excluding relevant data? Are JOIN conditions correct (an INNER JOIN excludes records with no match, while a LEFT JOIN includes all records from the left table)?
  • Accuracy — Are aggregate functions applied correctly? Is the GROUP BY clause grouping data at the right level?

Example: An auditor at Gies Co. needs to extract all journal entries posted to revenue accounts during 2025. The following query is proposed:

SELECT JournalID, PostDate, AccountCode, Amount
FROM JournalEntries
WHERE AccountCode LIKE '4%'
AND PostDate BETWEEN '2025-01-01' AND '2025-12-31'
AND Amount > 0

Is this query complete? Not necessarily. The Amount > 0 condition excludes credit entries (negative amounts), which could include revenue reversals and adjustments. Removing this condition would make the query more complete.


Data Integration

Data integration is the process of combining data from different sources to provide a unified view for analysis and decision-making. Organizations often have data scattered across multiple systems — ERP, CRM, spreadsheets, external data feeds — and need to bring this data together.

Integration Challenges

ChallengeDescription
Data format inconsistenciesDifferent systems may store the same data in different formats (e.g., date formats, currency codes)
Duplicate recordsThe same entity (customer, vendor, employee) may exist in multiple systems with different identifiers
Data quality issuesMissing values, incorrect entries, and outdated records can compromise the integrated data set
TimelinessData from different sources may be updated at different frequencies, leading to inconsistencies
Semantic differencesThe same term may have different meanings in different systems (e.g., "revenue" may be defined differently in the sales system vs. the accounting system)

Integration Approaches

ApproachDescription
ETL (Extract, Transform, Load)Data is extracted from source systems, transformed into a consistent format, and loaded into a central repository
Data virtualizationData remains in its source systems but is accessed through a unified interface that presents an integrated view
API integrationSystems exchange data in real time through application programming interfaces
Manual consolidationData is exported from individual systems and combined manually (e.g., in spreadsheets) — highest risk of error

Business Process Models

A business process model is a visual representation of the steps, decisions, data flows, and actors in a business process. The ISC exam tests your ability to investigate a business process model and identify potential improvements.

Types of Business Process Models

Model TypeDescriptionKey Use
FlowchartA diagram that shows the sequential steps in a process, including decision points and document flowsDocumenting transaction processing steps and control points
Data Flow Diagram (DFD)A diagram that shows how data moves through a system, including data sources, processes, data stores, and data destinationsUnderstanding data inputs, outputs, and transformations
Business Process Model and Notation (BPMN) diagramA standardized graphical notation for modeling business processes, including events, activities, gateways, and flowsDetailed process modeling with industry-standard notation

Identifying Improvements

When investigating a business process model, look for:

  • Redundant steps — Activities that are performed multiple times without adding value
  • Manual processes that could be automated (e.g., using RPA or system integration)
  • Missing controls — Points in the process where authorization, validation, or reconciliation should occur but does not
  • Bottlenecks — Steps that create delays due to capacity constraints or sequential dependencies
  • Unnecessary handoffs — Points where data or documents are transferred between people or systems, increasing the risk of errors or delays

Example: Illini Entertainment has a process model for its accounts payable cycle. The model shows that vendor invoices are first entered into a spreadsheet by a clerk, then manually re-entered into the ERP system by a different clerk. This redundant manual data entry step is a candidate for elimination — the invoice could be entered directly into the ERP system or captured through automated invoice processing.


Summary

TopicKey Takeaway
Data extractionMethods include manual export, SQL queries, API calls, and ETL processes
Data storage typesData warehouses (structured, integrated), data lakes (raw, flexible), data marts (departmental subsets)
Database schemasStar schema (denormalized, fast queries) and snowflake schema (normalized, less redundancy)
Data life cycleCreation, storage, active use, sharing, archival, and disposition — controls are needed at each stage
Relational databasesTables with rows and columns; enforce entity, referential, and domain integrity
SQLStandard language for querying databases; examine queries for relevance, completeness, and accuracy
Data integrationCombining data from multiple sources; ETL is the most common approach
Business process modelsFlowcharts, DFDs, and BPMN diagrams; investigate for redundancies, missing controls, and improvement opportunities

Practice Questions

  1. Bear Co. maintains a data warehouse that is loaded nightly from its ERP system using an ETL process. The finance team notices that the data warehouse contains duplicate customer records — the same customer appears with slightly different names (e.g., "Bear Co." and "Bear Company"). At which step of the ETL process should this issue be addressed?

  2. An auditor is reviewing the following SQL query used by MAS Inc. to identify high-value unpaid invoices:

    SELECT InvoiceID, VendorName, Amount
    FROM Invoices
    INNER JOIN Vendors ON Invoices.VendorID = Vendors.VendorID
    WHERE Amount > 50000
    AND PaymentDate IS NULL

    A vendor recently changed its name in the Vendors table, but the corresponding invoices still reference the old VendorID, which was deleted from the Vendors table. Will those invoices appear in the query results? Why or why not?

  3. Gies Co. is considering implementing a data lake alongside its existing data warehouse. What is the primary difference between a data warehouse and a data lake, and when would each be more appropriate?

Answers
  1. The Transform step. During the Transform phase of the ETL process, data should be cleansed, standardized, and deduplicated. This includes matching and merging records for the same customer that appear with different name variations.

  2. No. The query uses an INNER JOIN, which only returns rows where there is a matching record in both tables. If the old VendorID was deleted from the Vendors table, the invoices referencing that VendorID will have no match and will be excluded from the results. To include these invoices, the auditor should use a LEFT JOIN instead, which returns all rows from the Invoices table even if there is no matching Vendor record.

  3. A data warehouse stores structured, integrated data that has been cleaned and organized (typically through an ETL process) and is optimized for reporting and analytics. A data lake stores raw data in its native format — including unstructured and semi-structured data — and imposes structure only when the data is queried. A data warehouse is more appropriate when the organization needs consistent, reliable data for financial reporting. A data lake is more appropriate when the organization needs to store large volumes of diverse data for exploratory analysis or machine learning.