040206 Quality Assuring your Data Pipeline

2023-04-10

위 벳지는 수강을 완료하고 받은 뱃지입니다.

Get Data into the EMS

02. Refine your Data Pipeline

Quality Assuring your Data Pipeline

이것은 데이터 파이프라인이 실행되기 전에 품질 보증 검사를 실행하는 데 사용할 수 있는 실용적인 테이크 아웃 체크리스트 입니다. 체크리스트는 프로젝트 모범 사례를 기반으로 하며 현재 구현에 사용됩니다.

“데이터를 EMS로 가져오기” 트랙을 진행했다면 대부분의 이러한 사항에 익숙할 것입니다. 자명하지 않은 사항에 대해서는 간단한 설명도 제공합니다.

체크리스트는 다음과 같습니다. 빠르게 읽고 나중에 코스 리소스에서 목록을 다운로드 하십시오.

Area	Check	Explanation	Status
Data Connection	Connections in place working - No errors/warnings?		Not Started
Extractions	Should we be using Replication Cockpit?	This is a consideration if the existing pipeline is too slow or needs to work for an operational / high speed use case	Not Started
	Full Extractions Loads run less than 12 hours?	12 hours is a benchmark that typically should not be crossed for extractions. The points below can help reduce this time limit.	Not Started
	Replication Cockpit not used during extractions?	If you use both Data Jobs and the Replication Cockpit, make sure they do not run at the same time. You can use the Replication Cockpit Calendar function along with Data Job schedules to set this up.	Not Started
	No “unused” / “disabled” tables present in extractions?		Not Started
	Limited the columns extracted to only those necessary?		Not Started
	Filters are applied to large tables?		Not Started
	All extractions placed in a single data pool, and data connections exported to process-specific data pools.	This is a best practice to avoid extracting the same data more than one.	Not Started
	Dynamic Parameters utilized in Delta Filter section (Last Loads, Change Number, etc.)?	This applies if you use Delta extractions with Data Jobs	Not Started
Transformation scripts	Review each step in each transformation script for :
	a. Ensure that changes to any Marketplace Connector are commented with Initials, Date and Commentary	Commented changes with dates allow for easier Connector updates in the future.	Not Started
	b. Ensure each block of code has unambiguous explanation of the purpose of the block of code		Not Started
	c. ANALYZE_STATISTICS(‘XXXX’); used on all temporary tables		Not Started
	d. No Select Distincts (unless there is a comment present as to why it is needed)		Not Started
	e. Appropriate naming convention utlized Cases Table: «Process Name» + \ _ + «Table Name» (eg. CLAIMS_CASES) Activities Table: _CEL _ + «Process Name» + _ACTIVITIES (eg. _CEL_CLAIMS_ACTIVITIES)		Not Started
	f. Intuitive Variable naming		Not Started
	g. No “unused” transformations (i.e. “Testing”, “Sandbox”, etc) present in Data Job		Not Started
Transformations - Additional	Temporary Tables utilized?	Use temporary tables if you run similar joins across multiple transformations.	Not Started
	Ensure that there are no cartesian (many-to-many) joins present		Not Started
	Use WHERE EXISTS rather than joins where applicable		Not Started
	Can transformation jobs be run in parallel?	If transformations are independent of one another, you can consider splitting them into separate Data Jobs and running them in parallel with a schedule.	Not Started
Data Model Loads	No error messages on Data Model upload (including warnings)		Not Started
	Using tables instead of views to load to Data Model?		Not Started
	Using a Data Model with the minimal number of tables and columns for a high speed use case?		Not Started
	Subscribed to all Data Models?		Not Started
Replication Cockpit	Replication Cockpit replicating without errors?		Not Started
Scheduling	Full / Delta Loads scheduled, enabled, and running?		Not Started
Execution History	Processing Time for Delta ETL (Extraction>Transform>Data Model) run time less than 1 hour (unless other circumstances override)		Not Started
	Schedules have no errors in recent history?		Not Started
Data Validation	Confirm that customer has approved the accuracy of the Raw Data and Activity Steps		Not Started