Key Tokenization

I mentioned in my previous post about having to move data from an Oracle 12g database into a SQL Server database. The SQL Server still had several applications that needed to read from it but the Oracle database was the source of truth. In our case, the applications that sat on top of the SQL Server database were readonly type applications.

At the beginning of the project, I was told that id's in the Oracle database were generated using a standard sequence that generated numbers for new records that were higher than the the top value of the SQL Server. Not only was this not true I found out, but we had the extra caveat that simply dropping the alpha part of the number would result in number collisions across the two databases.

For this situation we tokenized the alpha parts of the generated id's into a number. So essentially we had id's that took the form similar to 'prod123456' that when tokenized would look like '10123456'. We then used this tokenized id to match against the SQL Server database for the purposes of uniqueness in the ODI packages. Legacy id's that were imported into Oracle retained their numeric id's. This solution worked for this particular project due to the data growth projection.

Due to how I broke the task up, I used the Oracle REGEXPLIKE and REGEXPREPLACE functions in a CASE statement to pull the working set how I needed it.

CASE
    WHEN REGEXP_LIKE(TABLE.FIELD_NAME, '^\\d{0,10}$') THEN TO_NUMBER(TABLE.FIELD_NAME)
    WHEN REGEXP_LIKE(TABLE.FIELD_NAME, '^[A-Za-z]{1,10}\\d{0,10}$') THEN TO_NUMBER(REGEXP_REPLACE(TABLE.FIELD_NAME, '[A-Za-Z]{1,10}', '10'))
    ELSE 0
END

This allowed me to leave my existing id's that were numeric alone and then tokenize the alpha parts of the id with a number. The same concept could be used with a different number; 10 just fit my requirements. I had to make sure I wasn't going to overflow the integer column in the SQL Server database.

NOTE: You'll notice the double leading slashes on the regular expression. This was needed due to how ODI Studio sends the values you enter to generate a script. I need the double slashes so that the generated SQL had a single slash and the regular expression would work correctly. I used regular expressions in other parts of the query, but only the SELECT portion of the script had this anomoly. Some things you don't want to dig too deeply into. This is one of them.

Moving Data with ODI Studio

I've recently been working on a project shuffling data from an Oracle 12g database into a SQL Server database and I wanted to get some thoughts on the project down somewhere so I can refer back to them. The client is heavily invested in a product called ODI Studio and while I've never used this particular product, I have migrated data from one database to another.

ODI Studio has quite a learning curve, but once you get proficient with it, you can create data moving scenarios fairly quickly using many of the built in modules. You can even modify the built in modules and create fairly powerful data migration modules if you are willing to look through the beanscript documentation for ODI.

I plan on creating a series of these blog posts outlining some of the approaches I took and some gotcha's that I ran across. The first post is going to be about the general approach.

Overview

I always approach data projects in a similar way. In this case the newer database was the Oracle database and was the source of truth. There were several other applications that needed to stay on the existing SQL Server database so the task is to do a nightly push of data from the Oracle database to the SQL Server database using ODI.

The schemas between the databases are very different so after analyzing the data, I first create a set of temporary tables in the source database that mimic the destination tables that I will be inserting into and updating. So for instance if I have a table in the destination called Foo, I would create a table in the source database called Temp_Foo. Since I'm dealing with different RDBMS's that have different data types, I approximate and do the best I can with types. String types are fairly straightforward varchar2's in Oracle map over to varchar's in SQL Server. Integers are interesting because they are very different in the two systems. Numeric types in Oracle are very different from integers in SQL server. Oracle data types dictate that you specify the maximum number of digits in a number, so if the max number you need to store is 9999 your data type will be Number(4,0). The first parameter specifies the length in digits and second specifies the precision for decimal numbers. Most of the columns in my SQL Server database were integers or string so this worked out rather nicely. Since integer fields in SQL server can hold a max of 2,147,483,647, the field in my temp table needed to be Number(10,0) to hold all of the digits.

Once you have the fields in the temp table specified, you can use ODI Studio to start mapping fields over. I'll get to this in a separate post. This post is just about the general approach. Putting data into the temp tables is done using Integrations in ODI Studio. You will have at least one integration for each table in the destination database that you want to populate and keep up to date. This is honestly where I did most of the hard work. The whole point of creating the temp tables that mimic the destination tables is to leverage the power of ODI studio for doing what is essentially a batch of merge statements on the destination database.

The built in modules do a fairly good job of these insert/update/delete statements which are the next steps.

Summary

  1. Create temp tables in your source database, 1 per table in your destination that you want to migrate data to.
  2. Create an integration per temp table that will select data from the source database that mimics the destination data format and insert that into the temp tables.
  3. Create an integration per temp table that will do inserts and updates in the destination tables based on data in the source temp tables.
  4. Create an integration per temp table that will do deletes in the destination tables based on data in the source temp tables.

Setup an Azure Virtual Machine to Test Github Pages Posts

I think it's no secret that I'm a Windows guy, so testing out blog posts on my Github Pages site isn't the easiest thing to do. Rather than try to get Jekyll working on Windows, I went for another route. I'm always wanting to expand my knowledge and stretch what I know, so I decided to get a little closer to the environment that Github uses, which is Linux.

I'm writing this post on Surface Pro 3, and running VM's on it messes with the Instant On capabilities of the device. In light of this, I opted to try an Azure VM loaded up with the latest Ubuntu image. I used the wizard on the Azure portal to create a new VM using the 14.10 Ubuntu server image from the gallery. I took all the defaults in the wizard with 2 exceptions.

  1. I just set up a username and password instead of setting a certificate/key file.
  2. I forwarded external port 80 to internal port 4000.

The first item is just to make my life initially easier. You could just have easily set up a key, I just chose a username/password instead. The second item will come into play later once we have set everything up on the box and are wanting to test our blog posts.

Setting up the box with Jekyll

Once Azure tells us the virtual machine is up and running, we can ssh into using the username and password we set up earlier. Use the following command substituting your username and machine name. You'll be asked for a password; enter in what you set up during VM creation and then you'll be in!

ssh username@machinename.cloudapp.net

This is a blank virtual machine with nothing on it and the meat of this blog post is getting it to the point where you can run Jekyll.

Update apt-get and install prerequisites

Before we can get Jekyll and ruby up and running, we need to make sure our apt-get is updated and we have a few things on the box. Run the following commands.

sudo apt-get -y update
sudo apt-get -y upgrade
sudo apt-get install -y git-core curl nodejs

Next, we'll get Ruby up and running using RVM. For those just getting started, you could probably skip this step, but from what I understand from my Ruby friends, you want to use RVM to allow for different projects on a server to use different versions of Ruby. This is a very similar setup to what ASP.NET vNext will use, so you may want to get familiar with the concept. The instructions are paraphrased from here.

gpg --keyserver hkp://keys.gnupg.net --recv-keys D39DC0E3
\curl -sSL https://get.rvm.io | bash -s stable --ruby

The above will put RVM on your system and install the stable version of ruby. I found I needed to exit the linux session and then ssh back into the box to get access to the RVM command. Once you've done that, you can finish out by installing the correct version of ruby and getting bundler installed.

rvm install 2.0
gem install bundler

Your Repository and Jekyll

You should clone your repository, cd into the repository root, and ensure that you have a Gemfile in your repository root with the following contents.

source 'https://rubygems.org'
gem 'github-pages'

Once you have that, use bundler to make sure all the gems are installed from the Gemfile.

bundle install

Once that gets finished, you should be good to go. Just run the following command to start your server on port 4000.

bundle exec jekyll serve

Use your browser to hit your virtual machine, using the cloudapp.net address. This will hit port 80 on the box, which if you remember from earlier we forwarded to port 4000 internally. You should see your github pages site!

Pointing My Domain at Hover to Github Pages

The first order of business to standing up my blog was to point my domain registered at Hover at my Github Pages repository. Now I'm not a networking guy, but being a webby type person, I reckoned that I could figure it out.

After a little reading of the help documentation, I was ultimately going to need to do two things:

  1. Use A Records to point my domain to the IP addresses that Github uses to serve the Jekyll blog.
  2. Use CNAME Records to point my top level domain as well as the www subdomain to my Github Pages site at sdanna.github.io.

My config for that looks like the image below: Hover Configuration for Github Pages

It's important to note that when you are creating the CNAME records that you put the trailing dot (.) after the url. Otherwise you will not have a good time.