Saturday, 28 June 2014

Data Conversion


Data Conversion term has been perceived and interpreted in different ways. Everybody I have interacted with, describes Data Conversion differently. In very simple term it is a process of converting data from source to target, however it is important to understand that though it sounds simple but is a complex process and requires number of iteration of development and testing similar to any other project. To start with it is important to understand difference between terms Migration/Re-Platform and Data Conversion.

MIGRATION or RE-PLATFORM vs Data Conversion

Many times people get confused with Migration and DATA Conversion; these terms are many times incorrectly understood as synonym of each other. This results in both being looked and evaluated through same lenses which creates some confusion with respect to scope of work under Data Conversion.














Figure above illustrates difference between Migration and Data Conversion, as shown above Migration involves much wider scope and Data Conversion being small part of overall delivery supported by technology with support of other areas.
DATA Conversion as name suggests involves DATA and converting it from source to target through an ETL process. ETL is a common term with respect to Data Conversion and involves Extraction Transformation and Loading of data from Source to Target system. Data conversion is a very specialized area of work under a wider project Migration and is very critical to overall success of any Migration.

Data Conversion Life Cycle

Now that we know what is Data Conversion it is right time to delve into how is Data Conversion Same Same but Different from other work streams within project. After working on few Data Conversion projects, I have realized that Data Conversion follows a standard life cycle of the project but at the same time has number of nuances which makes it different from standard projects and requires deviations. Similar to every project Data Conversion follows standard Waterfall Model, however it is very different with respect to set of activities performed and deliverables at each stage of SDLC. Unfortunately there are no best practices to be followed for any Data Conversion project, hence this paper provides some inputs or direction on how a Data Conversion project should be approached or managed. Irrespective of type of data and magnitude of project activities below should be applicable to most software projects involving data conversion.
Different companies have evolved Waterfall model to suit internal process, but basis of the model remains same i.e. it is a sequential process which progress through various processes. For the purpose of discussion we will split the complete project into following phases:

·       INITIATE
·       ANALYSE
·       DESIGN
·       BUILD
·       VALIDATE
·       IMPLEMENT



Illustration above sums up various activities performed in Data Conversion, now we will try to break this and try to fit each activity to various project phases starting with Initiate.

INITIATE

Initiate phase defines the project scope, from Data Conversion point of view it is the stage where Data Conversion System Scope is identified and defined. In simple terms this phase requires getting answers for following questions –
·       WHERE is data stored in current system and WHERE will data get stored in target system?
Answer to above question is basis of all future activities, WHERE question will tell us the current and target data stores. Most important and basic thing to be kept in mind is if system does not store Data is it outside the scope of Data Conversion. Once we have answered above question, next step is to determine
·       WHAT data need to be converted?
By answering WHERE question we can determine systems which store data, but answering WHAT question is most critical as it will refine the system list and might eliminate some of the systems we enumerated by answering the WHERE question. Data can be stored in number of formats in different systems, including hard copy documents inside a cabinet or a customer database on a personal desktop of CEO or a database within card processing system etc. Answer to WHAT question will help in defining a more defined scope of Data Conversion, we will be able to determine that files inside the cabinet or customer database on personal desktop of CEO is not in scope of Data Conversion hence helping us to reduce the scope.
After we have done due deliberation in getting answers to WHERE and WHAT all-simple tasks are complete i.e. systems in scope of Data Conversion are defined and we are ready for the tough job i.e. to answer HOW? There is no single answer to this question but based on my previous experience I recommend that each system in scope of data conversion should be categorized into one of the following –
ü  Data Conversion

Systems categorized under Data Conversion require an Extract Transform & Load (ETL) process to be developed to take data from source to target,

ü  Data Load

Systems categorized under Data Load are new (or existing) data stores and require one time initial load at the time of migration in order to build data within the system.

ü  Data Refresh

System categorized as Data Refresh are existing data stores used in target state as well and require some type of a data feed sent to support target systems

This categorization will help us in definition of strategy of converting data into each of the system. Strategy might involve using some ETL tool like informatica or abinito or can be based on developing some in-house ETL tool.

 

Initiate phase ends with defining Data Conversion System Scope and deliverable produced is a Data Conversion Scope & Strategy which provides details on each system in-scope and also out of scope. Though at this stage we don’t define a detailed design but is important to document some of the vital points like how data is going to be converted is it going to be a big bang or an incremental data conversion, differences in approach for static and dynamic data conversion etc?

ANALYSIS

Most challenging area in Data Conversion is to determining specifics of data that needs to be converted from source to target. Unlike any enhancement project where requirements are provided by business and defines what exactly is required for data conversion there are no requirement document and in many cases no business to define what data should be converted? Data conversion is always considered too technical for any business area to provide requirements. However that said this does not mean data conversion is like working in darkness, some clear requirements are always there more of basic data obligations that we have to meet e.g. legal and compliance requirements to ensure specific information is made available for 10 years from specific date etc.

But what is left is for technical team to figure out what data should be converted. Here again there is no rule book to help in identification but what I think is best rule is to convert data required to support Business As Usual based on target system and not based on what is currently available on source system. It is very important to have the mindset of looking at things from target system and not from current system point of view.  Analysis phase is the phase in which we define each individual system data conversion scope, it is important to understand difference between scope defined at INITIATE and ANALSYIS phase, at initiate phase we define we need to convert data from a system A with data stored in access, system B with data stored in Informix whereas ANALYSIS phase defined within system A we need to convert tables customer, office and address. So it is getting into more specifics under each system. 

It is like Data Discovery exercise in which we delve into details of data structure of source systems to answer same WHERE and WHAT questions we answered at INITIATE but now more specific to system in question.  WHERE is data stored within system A and WHAT data is stored within system A.




At end of this phase what comes out is a detailed Data Dictionary of source system. Many times most of the products already have a very clearly defined data dictionary and that makes life much easier for data architects.

DESIGN

By design phase we have already understood what data is stored where in source system and by this time we should already have in-depth understanding of data structure on target systems. This phase is all about answering HOW data required on target system needs to be converted from source system. Exercise in defining the rules for data is commonly known as Data Mapping which defines rule for each data element on target system.
Emphasis on target system is important, as complete data mapping exercise is not about where to store data from source system into target system but is about how to convert data into target system from source system. At first read both approaches might sound same but diving in details it will make it clear that it is not like shifting our house where everything from old house is taken and shifted to new house. Instead it is about selectively shifting items from old house to new house depending on availability of place holder in new house. At the end we might end up leaving a sofa in old house as it is not required in new house.
In addition to defining data mapping rules for each data element Design phase also includes coming up with a design for ETL process for system conversion. ETL or Extract Transform & Load process distinctly splits the complete process of moving data from source to target system into 3 phases:
·       Extract refers to process of how data needs to be extracted from the source system.

·       Transform refers to process of transforming or converting data from source to target data structure

·       Load refers to process of loading data into target system

Depending on source and target systems, technology used and various other factors these 3 processes can be distinctly defined or can be overlapping or sometimes even not visible in a data conversion process.




 Deliverables at end of this phase are data mapping sheets which provide detail mapping rule for each data structure and second is a specification document which is a detailed design for ETL process required to convert data.
Another important deliverables which I feel is very useful is Migration Functional Solution produced at this stage. Though data conversion is mainly a technology delivery however it is important to get business approval on what data is converted. In few organizations business might agree on reviewing field level data mapping sheet which is best option however considering the technical details present in data mapping sheet it is advisable that a more business friendly specification document is produced which in business terms explains and covers all critical aspect of moving data from source to target system
Reconciliation specification or strategy is one of the most critical aspect of Data Conversion and as an output of design phase a specification document should be produced which contains details on various KPI used to prove or verify data is correctly moved from source to target system. It is essential to ensure that reconciliation strategy includes KPI for validation for both completeness and correctness of data. There is no standard list of KPI and it totally depends on source and target system on what KPI are used.

BUILD

There is nothing special or different from any other project with regards to BUILD phase, this is the phase where actual development is done for the ETL process.

VALIDATE

Validate or Testing is another challenging phase for data conversion, unlike other software projects where testing, defect fixing and retest is much easier because of high number of test cycles. For data conversion depending on volume of data and complexity of data mapping rules, testing can be very difficult and like finding of needle in haystack. It is a daunting task to find data for all scenarios and test them through ETL process, hence it is recommended that as much as possible from first phase of testing for data conversion should be performed using actual live size data volume and if possible with actual data that needs to be converted. This mitigates the risk of finding issues towards the end of project.
Mock and dress Rehearsals are general terms used for the purpose of defining data conversion testing. Some organizations just use single term and some use these as distinct phases. Mock referring to initial test phases where focus is more on verification of data mapping rule validation, reconciliation, ETL build, performance of each system in scope of data conversion individually. Whereas dress rehearsals are more focused on rehearsing complete cut over data activity where data conversion is just part of it, and cover a much wider scope. Another focus of dress rehearsal is to determine and confirm that all activities (including data conversion) can be completed end to end within acceptable time window.

IMPLEMENTATION

Last phase before CLOSURE, covers the actual activity which needs to be performed on the actual day when data needs to be converted from source to target system. This is the time when all the hard work done in earlier phases is delivered in form of an ETL process to move data successfully from source to target.



At the end I hope this paper will be helpful for all experts who are venturing into a Data Conversion project and it will help in understanding how Data Conversion is “Same Same but Different” from other software projects.

No comments:

Post a Comment