Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
40 views14 pages

Checklist Best Practices Known Issues v4

Uploaded by

santhosh1148
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views14 pages

Checklist Best Practices Known Issues v4

Uploaded by

santhosh1148
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
You are on page 1/ 14

S.

No Applicable Phase
1 Analysis
2 Analysis
3 Analysis
4 Analysis
5 Analysis
7 Analysis
8 Analysis
9 Analysis
10 Analysis
11 Analysis
12 Analysis
15 Design
17 Design
18 Design
19 Design
20 Design
21 Execution
22 Execution
Title
Limitations in techstack technology tools identified and documented?
Was the discovery phase done to finalize inventory or scope from client ?
DB objects inventory categorized based on size ?
End to end lineage of source scripts identified and finalized ?
Validating identified lineage with scheduling tool sequence flow ?
Data availability in lower environments during project execution agreed ?
Is the priority of work planned, what are the objects/subject Area to be worked upon in phases ?
Data validation criteria and accepted deviation & agreed with customer ?
Plan in place to address source - target limitations and exceptions?
Analysis done on load management (target DB) between existing apps running in prod with new development work ?
Availability of required documentation for migrating apps ?
Validation approach identified for UT, SIT, UAT?
Decided on lift & shift migration or rearchitecturing the shift ?
Performace optimation approach / plan identified ?
Provisioning server configuration based on analysed data volume in scope ?
Performance optimization techniques identified and implemented for execution engines / clusters ?
Identifying code reviewers and establishing code review approval process before moving it to higher environments ?
DBA approval received for projected load in prod servers ?
Checklist(Y/N)
Comment
Any limitations with existing tech stack has to be compensated with work around logic and additional manual effort causing de
Without final inventory, it may impact project development pipelines at any phase of the project leading significant rework
Need to know number of large volume of objects being handled throughout the engagement to align with data load and delive
Without proper lineage, souce scripts might be missed which may be the issue for rework and data validation mismatch
This may give insight on missing information on lineage and dependencies
lower environmnt (dev,qa) needs to have sufficient data for testing purposes to avoid data mismatch during later stage of the
Sharing info between teams on data anamoly, data inconsistency etc., saves lot of time
To ensure data quality, accepted deviation from source data should be finalized

Any new development work should not affect the apps already running in production. Work should be segrregated accordingly
Agree with customer on approach and deviation arise due to document unavailability
Without proper testing plan, data quality might be compromised.
All performance issues may not be addressed in lift and shift
This will be helpful in the long run to save time of running jobs in production.

To avoid load on teradata leading to id block, prior permission from DBA is nencessary before running load
S.No Applicable Phase
1 Analysis
2 Analysis
3 Analysis
4 Analysis
5 Analysis
7 Analysis
8 Analysis
10 Analysis
11 Analysis
12 Analysis
14 Analysis
15 Analysis
16 Analysis
20 Design
21 Design
22 Design
23 Design
24 Design
25 Design
26 Design
28 Design
29 Execution
30 Execution
31 Execution
32 Execution
33 Execution
34 Execution
35 Execution
Title
Datatype compatibility checks done between source and target DB ?
Access level validation on source side (inclusive of all drill down levels) ?
Access validation on target side (some targets may need delete privilege) ?
Scheduling / data migration window identified for data migration ?
Datatype limitations (like BLOB/CLOB/BINARY/FLOAT) specific to target execution engine ?
Catch up load time frame discussed, considering Go Live deadlines ?
Identification of current job execution timings to perform data validation if comparison against daily refreshing data
Defining equivalents for non-compatible datatype between source and target ?
Concurrency and performance optimization parameters definition checks on source side to increase throughput efficiency ?
Agreed on retaining leading/trailing spaces during migration ?
Decommission plan for bridging tables(Used in more than one application/warehouse) ?
Identifying number of encrypted column tables involved in migration for sensitive data ?
Tables which have hard delete in source db (daily/weekly/monthly)
View migration analysis and approach finalized ?
Partitioning required and applicable ?
Data retention policy for offloaded historical data (in ADLS / storage layers) ?
Data migration strategy identified for large volume tables ?
Incremental / catch up load strategy and execution duration analysis ?
Plan to test end to end data validation across datalake layers ?
Server / Hardware sizing based on volume in scope ?
Is there any sensitive data available, what would be the pre-defined testing approach on these tables data ?
Identified approach for Null / blank value handling between source and target ?
Identified approach for any New Line or ASCII char handling between source and target ?
Foreign language characters compaitility check between source and target ?
Validation of data migration from views traversing multi layers ?
Usage of appropriate delimiter and field separated by values to avoid data mismatch and data inconsistency issues
Availability of data in lower environment to perform data validation against stagnant data
Case sensitivity checks in DDLs and source data between source and target dbs ?
Checklist(Y/N)
Comment
This help in reducing the data quality issue.

Avoiding business peak time


Databricks is an example, float getting converted to double are having precision round off issues
Catch up load needs to be completed in time otherwise this may cause delay in go live
Need to identify proper window for data validation after load as prod data is refreshed on daily basis
Acitivity should be performed during initial phase of the project
Having source batch ids that can handle large volume and parallel executions
To avoid any reporting tool issues during later stage of the project
One by one warehouse may go live. This will impact data in these tables.
If we are migrating from Non-prod env, need to decrypt the data and encrypt in lower env. Also test script requires changes inc
Need additional effort to run the script in target db after data migration/catchup.
drill down views
This will enable the faster data offload/ migration.
Cost consideration
Splitting in to years / months to avoid issues due in case of file system storage
Avoid delays during Go Live
Avoid data reload at later stage of project
App server should have sufficient space to copy data and data copied should be removed once full load is done

helps in reducing the data quality issues as nulls sometimes get converted to blank as the source value is not recognized by too
helps in reducing the data quality issues. New line causes data to be shifted to next line causing count and data mismatch
All foreign languages are not supported in target so some samples needs to be validated to ensure
Data and dataype mismatch validation
Field delimiter if present in data causes data shift so there needs to be delimiter combination identified that wont be present in
Sufficient data needs to be available in dev and qa environment for proper testing
This will help in avoiding column name and data mismatch issue.
S.No Applicable Phase
1 Analysis
2 Analysis

3 Analysis
4 Analysis
5 Analysis
6 Analysis

7 Analysis
8 Analysis
9 Design
10 Design
11 Design
12 Design
13 Execution
14 Execution
15 Execution
16 Execution
17 Execution
Title
Is there a need to merge multiple scripts into a single script based on identified pattern?
Do scripts that belong to different layers require different implementations?

Handling of db specific approach in target db (volatile tables) ?


Will additional custom code be required on top of the converted scripts ?
Availabiity of source objects non-compatible with target ?
Analysis of complex scripts and finding performance improvement pattern ?

Analyzing execution time across each layers and report performance issue in case of performance lag compared to current db
Cost estimation for new objects deployment in target or reporting DB
Alignment with customer on naming convention to be followed for containers / files / folders ?
Approach identified and pattern analysis done on keys handling ?
Data alignment validation based on different tools used for data loading (historical / incremental) ?
Analysis and agreeing on data processing requirement for BI layers ?
Validating and mapping key function or workaround availability between source and target ?
Converted output code review by SME to meet customer standards and expectation ?
Avoiding parameters hardcoding in scripts to eliminate manual update of key variables and connection parameters ?
Processing time validation across each datalake layers and validating requirement for maintaining history in each layers ?
Options considered to enhance cluster performance during script execution ?
Checklist(Y/N)
Comment
w
Some Layer specifc changes e.g. : DWL layer scripts adhere to different naming convention compared to ACQ layer scripts
In Teredata, volatile tables are temporary tables. There are various way to implement the functionality of volatile table in the t
virews or normal tables.
Target environment may require pre-processing and post processing of tables before executing the actual SQL transaction.
Certain source-specific objects like record Error, sys and activity tables etc. will not be applicable in the target environment.
Avoid multi joins and implementing techniques to rewrite existing queries in more optimized way

This will help to locate and indentify important factor that is causing delay in execution and reducing performance which migh
Azure SQL DB
In case of storage like ADLS Gen2 or Deltalake
Surrogate Keys

High concurrency and low concurrency data

The coding structure, standards and naming conventions of the converted code need to be evaluated by subject matter experts
This will reduce the manual changes to be done on the top of converted scripts. Scripts can be used across applications.
Based on data retention requirements across layers as per architecture
Photon accelerator in databricks
cluster fine tuning options to enahnce performance and parallel executions across teams / projects
S.No Title

1 Key tables identification during initial phase of the project

2 Partition columns identification

3 Handling tables with larger volumes during historical data offload

4 Using cloud advantage to enable faster data load options

5 Understanding extraction scheduling window

6 Incremental strategy for data offloads

7 Identified and agreed data retention policy

8 Optimization done using databricks


Comment

Key tables identification is a significant factor for project timeline. Defining proper keys and surrogate
key approach implementation during initial phase of project reduced lot of rework and pipeline issues.

Partitioning the tables by date allows for pointing the ETL pipeline to the partitions/folders that need to
be processed and greatly improve the read performance. This applies for both Full loads and incremental
ingestion patterns. Partitioning also helps with Delta table management scenarios such as running
OPTIMIZE command at the partition level

Identify data breakdown policy. Break large tables into smaller chucks of data based on year before
history load to avoid putting too much load on Teradata server. Also check on the capacity of Teradata
for data extraction during batch jobs execution

For history data load TPT is a slower process than NOS. During client engagements, team needs to
ensure Teradata Vantage is opened for leveraging NOS capabilities for faster history data load. The
configurations and server permissions required for the same should also be considered alongside app
server storage configuration for copying intermediate data

Teradata server should have the capacity to support large volume data offloads and extract tables/views
in parallel. There should be runtime window clearly set for history data loads without affecting batch
executions and mitigating slowness issues. Monitoring should also be done for the jobs that run in
schedules
Change data feed is a feature for delta tables that require incremental loads. CDF allows to efficiently
identify the data changes in the form of INSERT, UPDATE, MERGE and DELETE against the base table.
Setting proper incremental columns and right filter conditions reduces DB overloads.

The operational reason for implementing a data retention policy involves proper data backup to help it
recover in the event of data loss. Set data retention policies for inactive (deleted) data and enforcing it
for different tables using both delta VACUUM feature and ADLS Soft Delete feature saves cost
Periodically run ANALYZE TABLE COMPUTE STATISTICS to make sure spark optimizer has accurate
understanding on data distributions of delta tables. This would specifically help AQE (Adaptive Query
Execution) to make better optimization decisions during the execution time

You might also like