Session E-OLAP1&2, E-DTS

Interfacing Visual FoxPro to
OLAP Services

Val Matison
Matison Consulting Group Inc.


Table of Contents

Table of Contents 

1

Introduction 

2

Data Warehouse 

2

The Problem with OLTP Data 

2

Data Mining 

3

Performance Measures 

3

Building the Data Warehouse 

4

Dimensional Design 

5

Dimensions 

6

Deriving knowledge from data 

6

DTS - Data Transformation Services 

7

Data cleaning 

7

Referential Integrity - Orphaned rows 

7

Corrupted Data 

8

Log Files 

8

Filtering 

9

Data Conversions 

9

Indexes 

9

Accessing the DTS package from VFP 

9

Using a Custom VFP ActiveX DLL 

10

Refreshing Data 

12

Controlling DTS from within VFP 

12

DTS Object Model 

12

DTS Example 

12

DTS Example 

13

Type Indicators in a Microsoft SQL Server Provider 

16

Accessing OLAP data from VFP 

16

Introduction

This document is an introduction to interfacing Visual FoxPro to Microsoft's OLAP services. This interface includes using FoxPro based tables for data and Visual FoxPro for programming logic. Note that I said FoxPro based data  ignoring the Visual modifier. There are older systems, which are still undergoing transformation into the newer  Windows DNA model. That data can be successfully imported into the current technology. Programmatic interfaces  are all based on VFP however. Most of this document focuses on DTS services primarily because this aspect of  OLAP related work encompasses the largest amount of time for any OLAP project. That said I have tried to touch on  each aspect of OLAP development to provide a good introduction for the data warehouse developer.

Data Warehouse

It is not uncommon to find corporations that have many gigabytes of data yet very little knowledge of what that information contains. OLTP systems can provide pages upon pages of summary data but little insight as to the meaning hidden in the numbers. The production database permits immediate queries like providing the list of items on an invoice for a given customer. However, it is difficult if not impossible to determine sales of last quarter vs. sales of this quarter for the customers with the highest transaction volumes. Asking this type of question from the DBA would elicit a comment like "You can't get there from here". He's correct.

The data warehouse acts as a consolidated repository that provides access to accurate information. It resolves the  problem of an abundance of data but a dearth of information. This physical database is the place where people can access data that is unavailable to them from regular production systems. The DW is the consolidation of all the data for business processes from all the systems that pertain to them. It is structured to facilitate queries of the type that show trends, expose inconsistencies or highlight hidden facts. It facilitates access because data is structured for quick response. Reports based on one aggregation can be quickly drilled down to provide additional detail. Many views of the data are also obtainable providing insight to where none previously existed. The DW allows a business to respond quickly to changing customer needs, reduce time-to-market, discover new markets and products, and improve business processes.

Data in the warehouse is consistently in sync with corporate definitions. This means that data values are the same for the same attributes, units for measures are the same, and even the manner in which processes are recorded are consistent. Data from many sources can be based on different weights and measures so the database designer picks a standard and sticks to it. Data types are synchronized to be consistence. Dates are dates and not a combination of string, numeric or date data types. Sales are recorded based on a definition obtained by the participants of the design team.

The DW is the source for quality data. You might say the source is the OLTP data however the raw source data is often filled with inconsistencies that must be resolved before it has any value in the data warehouse. These inconsistencies include corrupted data, orphaned records, incomplete transactions, different data types and other issues, which make it difficult to obtain accurate information.

Finally, the DW is the repository for historical information. Trends can be obtained from the data showing how the key business indicators fair over time. It can be argued that a data warehouse is not only the data but also the tools that allow you to query and view the data.

The Problem with OLTP Data

OLTP data represents the current state of a system since transactional information is rarely kept for more than a month in a large system. Business transactions can take place over time but in the context of an OLTP system, it is only represented in the current phase of its life. What happens if an item is short shipped from a supplier or back ordered to the customer? Two months ago this might be reflected in the OLTP but today the short-shipped item has arrived and the item has been sent off. The order is marked as complete but the important information regarding the supplier and an unhappy customer was lost. The DW will capture snapshots of the data in time providing the foundation for ongoing analysis of the company.

An OLTP system is often normalized to promote fast transactions and quick response times but this does not facilitate querying. In addition, data often isn't in the same place all the time. How easy is it to submit a query from more than one database? Heterogeneous joins are possible between databases but this requires the databases to be structured in a manner that facilitates such joins. SQL Server permits joins like this through OLE DB. Even so, if databases are designed with data exchange in mind it's likely that it will be very difficult to query them in real time. Performance degradation results from large queries against OLTP data. Users cannot tolerate lengthy delays in processing transactions so they're often put off until late in the evening or when user activity is low.

Data Mining

Data mining is the process of knowledge discovery from data contained within a database. Statistical relationships and patterns are used to resolve questions of the database such as who is the model customer or can credit card fraud be detected or what items coexist on the same invoice. The methods used in Data Mining include visual associations, statistical relationships and even genetic algorithms. Data mining can benefit from the data cleaning activities from DTS but the methods used to find patterns are quite different from OLAP. It is included here since the next release of SQL Server will provide some fundamental data mining functionality. Some of the techniques for data mining include:

Associations/Affinity

This is often called market basket analysis. Rules are developed for associations between sets of items. This allows marketers to group items in a store or promote them together.

Sequences/Temporal Patterns

This technique is used to identify time-based affinity.

Clustering

By grouping similar items according to statistical similarities, marketers can discover distinct groups in their customer bases. Companies can then use this knowledge to develop targeted marketing programs. Clustering might be used to identify:

Classification

Classification is often a result of clustering. For example, a bank might use it to predict whether an applicant is credit-worthy. By mining its database of existing loan customers, the bank could develop two classes of loan applicants-likely to default and likely to make good-based on two factors, income and debt.

Forecasting/Prediction

Forecasting is a different type of prediction from the other techniques. Rather than predicting the likelihood of an event, it predicts the future value of a variable-for example, revenues or claims paid. After a forecasting model is built, companies can use them for "what-if" scenarios. A company might vary the input variables of advertising, product location and pricing to determine the impact on sales.

Performance Measures

Consider the types of questions to be asked of the DW. In the case of a satellite cable TV supplier the following questions are typical. Do associations between products exist? If someone buys pay per view wrestling will they buy pay per view boxing? If a customer buys the sport channel will he also buy specialty sports events? What makes a customer more likely to buy adult programming than someone else? What indicators show that someone will buy more than one service? What parts of the country purchase which service? Are there seasonal indicators against what people buy? Is there any unique group of programming that makes people by a service? Who is the ideal customer? How many dollars per hour per dealer vs. consumer sale does the sales team make? Why do people buy short duration contracts when longer terms are less expensive? Is gender an indicator of buying behavior? Of course there are many other questions to be asked, depending on the business process. The list is endless. Review the following list to see what kind of information is sought after in a data-warehousing project.

Objectives Common to Many Target Groups

Objectives for Specific Industries

Health Care

Finance/Banking

Manufacturing

Returns from an investment in a DW are usually realized quickly since the information can be put to use in reducing costs and taking existing business processes and making them more profitable. Target marketing to customers who are more disposed to buying a particular product reduces costs and increases overall revenue.

Building the Data Warehouse

There are many steps to building a DW, not unlike any other large software project. The list that follows outlines some of the steps involved to building a DW. This list focuses on data preparation since this is often the most critical aspect of the project. Like any project, decisions must be made in advance as to what will be achieved by implementing the DW. In OLTP systems, poor design considerations made at the beginning of the project will negatively impact the entire application. In a DW, poor comprehension of what the data means and how it's related results in a weak application.

Sample Project Steps

The dimensional model or start join schema is a direct reflection of the manner in which a business process is viewed. The star schema is the most useful as it permits browsing of the data, joins are simple and drill down, up or even across is easily facilitated It is tempting to create a snow flaked design by normalizing the design but this is discouraged since additional joins are required for almost every query degrading performance considerably.

Figure 1 is the database schema for the Northwind database that ships with SQL Server. This structure is very common for database design. Figure 2 shows the dimensional version of this same database. The dimensional version facilitates queries due to its design and aggregations.

The central table is called a fact table. It contains performance indicators that are used to measure a business process. In the Northwind example, this table is called SALES_FACT. The outlying tables are called dimension tables and they contain parameters or criteria by which the central table can be queried.

Dimensions

The most common dimension is Time. As the business is in flux, the time dimension will reflect trends and changes in the data. A typical query might ask "what are the top 10 customers buying this month compared to last month and how are those products doing against last quarter" The time dimension logs the database as it existed at the time. Incomplete orders or backordered items are important indicators of business activity and they are recorded with a time attribute associated against the data.

A well architected database will include many attributes on which to query. The art of any data warehouse system is the ability to determine which attributes are indicators of future activity and which are spurious. It is vital to retrieve the most amount of information from any piece of information particularly if there are only a few attributes in the database. Figure 3 shows the amount of information that is obtainable just from one date. Some aspects of this are trivial but others may indicate why sales are poor or booming.

DTS is a tool that facilitates the efficient transfer of data from one source to another. Sources can include almost anything, as long as ODBC or OLE DB can talk to the source and destination. The data source can include a database or even a text source. Data can be manipulated using simple or complex custom transformations. Even database objects can be moved. A number of interfaces are available to the developer including SQL, mail, and other services that will aid in the process of moving data from one location to another. DTS can be manipulated from an application via COM, simply called from an application or augmented with ActiveX controls. Elements can execute in serial or parallel fashion with constraints placed on any process to ensure accurate porting of data. DTS is used not only to move data but also to validate it and to clean it. The most recent service packs for SQL server should be installed before attempting DTS with Visual FoxPro tables. An ODBC DLL that ships with the original version of SQL 7.0 server has a problem with VFP data and an older version will be required to get any meaningful extracts from VFP data. This older DLL is available from Microsoft's web site but it is not required if the most current releases of SQL software are used.

Data cleaning includes a number of different operations under one umbrella. For a data warehouse to work properly, data must be as clean as possible. Without good data, the warehouse is suspect and won't have much value. The first question that comes to mind is how can the OLTP system be trusted if it's so "dirty". The simple answer is that every database system has some bad data. That data could be corrupt like having dates of February 30 or data can be invalid since it doesn't conform to business rules. Fields might have blank values because operators were too busy to record so called non-essential data or the program logic could be faulty and not record information correctly, and so on. I have never seen a database of any significant size that was perfectly clean. Unfortunately FoxPro and in particular FoxPro DOS allows corrupt data to reside in the database usually without any operational problems!

Records that do not have the corresponding foreign keys in the parent table must be addressed as to their usefulness in the system. If the line items for in invoice reference a non-existent invoice header, there may be sufficient information buried in the line items to rebuild a header. If this problem can't be resolved then a decision must be made to merely ignore this data and not include it as part of the port. A simple method to remove unwanted orphaned rows is to write an SQL procedure that removes them once the rows have been imported into the database.

Another option is to ignore these rows since they will be removed if further processing is done via joins between the two tables. It is sometimes easier to import data from a source into a temporary holding area and then apply transformations on this data and then port it into the final warehouse. This is true particularly in the early stages of development when structures are under development. The end user will often ask questions once the warehouse is operational that will require additional modifications. Having space for temporary tables is a bonus for these types of requests. Once data is ported to the final CS7.HTMLT.X¯2Ithe join condition are simply ignored and they don't migrate any further. Most systems will address all concerns during the import stage and import directly to the DW.

Yes, this happens. It's very annoying since it's often very hard to find. SQL Server is far less tolerant of corrupted information than FoxPro so the data being imported must be fairly clean. It is still possible to sneak by the odd thing through to SQL Server and you won't know about these types of problems until you begin working with the data. A decision must be made to throw out the data or to try and fix it if it's very important. All data is important but OLAP entails looking at a macro view of things and usually losing a few rows here and there is inconsequential. Clearly if the discarded data includes a significant portion of the business then something has to be done to keep the data. It's not always possible to ascertain whether or not there are problems with source data. It's a simple matter to address referential integrity but corrupted data is more difficult to find since VFP has no built in mechanism for validating the data. A numeric field can contain character data for example or dates might have 40 days in the month. An initial pass of the data through DTS will highlight problems.

The initial import of data into SQL Server can prove enlightening. All databases have errors. DTS will stop at the first error unless the number of allowed errors is changed in the advanced tab of the data transformation properties dialog. Change this value to a high number on the initial pass of data and be sure to include an error log. The place to look for errors is in this log file as displayed below. Note, the log file for the entire package will list no errors if errors are logged as part of the data transformation process. The following is a small excerpt from the error log. The row number is given for the error along with an error description that at least points to the field that caused the problem.

DTS can be configured to import a subset of the source table. This is most easily accomplished by using an SQL statement as the source of data. The SQL dialect used is the one supported by the source data. For example, the following statement is valid in SQL Server:

But in Visual FoxPro the same statement should be written as

The select statement can be used to combine several tables if required but statements like UPDATE or DELETE are not supported. Within the SQL statement it is perfectly permissible to use functions to convert data types, concatenations, etc. It's preferable to filter data that you know will never be required up front so no errors can occur at a later stage. Occasionally source tables contain flagged rows for no further use but they are not deleted from the source data. If these rows are not required for any further transactions, removing them at the onset makes sense. Filtering is also another way of validating data. A field that's critical to further processing but is empty may yield that row invalid and should be excluded from any further processing.

Note that deleted rows in Visual FoxPro are not retrieved into SQL Server regardless of any SET DELETED defaults established in VFP. If deleted rows are required as part of the import then rows must be recalled within VFP.

Data can be converted in the original select extract, defined as part of the transformations tab in data transformation services or after the fact by manipulating data from a temporary table to the final data warehouse. When performed in the original select, you have the full use of VFP's function library at your disposal. User defined conversions are not allowed at this stage. The most common conversions include character to date, string to numeric and standard string manipulations.

The problem with using VFP's functions at the early stage is that VFP is more lenient to problems within the host data. Date ranges for smalldatetime and datetime in SQL server are different and empty FoxPro dates convert into different values. FoxPro's CTOD function will perform the following conversions, none of which are correct:

Use TRUNCATE TABLE before importing large amounts of data. This operation is not logged and therefore information is not written to the log. The other option is to use DELETE FROM TABLENAME but that is much slower. I also drop the indexes and then recreate them after the import. This is implemented via an SQL TASK

DTS packages can be created and accessed from VFP. The following code shows how to access DTS packages that are located in either the repository or on disk. The package id can be obtained by looking at the properties dialog box and noting the package id there. The repository saves multiple versions so if the version ID is excluded; the most recent version is used. You must include the curly braces as part of the package id. This example also looks at the steps collection. There is no single place to determine how long the package took to run but the steps object does record the elapsed time. To determine how long the entire package took to run, sum the executiontime property for each step object.

Visual FoxPro can be used to control how data is imported into SQL Server by building an ActiveX DLL. In the following example, VFP is used in combination with Microsoft Internet Transfer Control to import data from the web into SQL Server. This DTS package consists of two data connections, one to a text file and the other to SQL Server. The data transformations have already been defined for the text file since this package will be used to refresh the data whenever the DTS package is run. For more help on importing text into SQL Server, see the documentation. A script task is required to control the process of getting the data and if successful, the data will be imported.

VFP will be used to retrieve data from the web via an Activex control. Build this by creating a form based on a class and in the class info dialog check OLE PUBLIC. This won't work otherwise. Next add the Microsoft Internet Transfer Control onto the form the way you would any other control. You may have to add the control to your controls list before you begin. The control has a number of different settings but the defaults will suffice to show how the whole process works. Now add a new property and a new method to the form. The property will hold the results of the data transfer. The custom method contains the code to retrieve the data and to write it to a local file. The listing for the custom method follows below.

The class must be added to a project and then compiled. Before compiling select the Build COM DLL option. The DLL is automatically registered on the development machine but it can be registered anywhere using regsvr32.exe which resides in the Windows 2000 operating system folder.

The next listing shows the ActiveX script that is used to talk to the VFP ActiveX control. Note the project name and the class name are used to access the script. The custom method created is called and a status check made to see if the function completed successfully. If so the data is imported into SQL Server.

It is important to resolve how and when data will be refreshed. How will data be copied from the production environment to the SQL Server engine? DTS will place a lock on the VFP table meaning that all users should be logged out of the system. The simplest solution is to perform DTS imports when everyone is out of the system. Full imports of data will create a burden on the production environment if full table copies are implemented. One method of locking the table is to create an ActiveX DLL that locks a table. Once the lock is obtained, the data is copied to a temporary holding area. The lock is then released and the data then imported into SQL Server. If there are space limitations, the data can then be zapped from the destination table upon completion.

DTS can be controlled directly from within VFP using the DTS Package Object Library, which is implemented via DTSPKG.DLL. The entire DTS object model is accessible allowing the creation, execution and saving of DTS packages. Considerably more functionality is exposed in the object model than in the Visual Package designer. Even simple tasks like determining step execution time is available within a package but not the designer. Four collections are exposed in the DTS object model and include connections, tasks, steps and global variables.

The following example will create a DTS package from within VFP and then execute it. The first step requires that an object of type DTS Package be created. Basic properties are then set for the entire package. In the code that follows, it is possible to see similarities to the checkboxes within the DTS designer like writing a status message to the event log upon completion.

Connections must be established for both the source and destination. The VFP source uses a DSN that points to database container. The destination uses the local host server and uses a trusted connection to establish login. Finally a reference is made in the destination to the database being used - DTS_TEST. Connections are added to the package once they are created.

A step is then added to the package that contains a reference to a task with the taskname property. The task in this example uses a simple SQL select statement to retrieve two columns from the customer table. The task references the connection Ids for both source and destination. The final destination table is set within the DestinationObjectName property.