The following document was converted from a MS Word Document. This document
was used during intitial design discussions to help determine the scope
of the project and may or may not reflect the actual implementation design.
By: Norm Wright
SIS Team/Grad. Admissions Team
Jonathan Ives
People Database/IT Integration Team
Date: November 5, 1997
Version 5
1.1. Introduction *
1.2. Expected Audience *
1.3. Next Steps *
2.2. What is the status of the People Database? *
2.3. What is the Technology behind the People Database? *
2.4. Who is Responsible for the People Database? *
3.2. How do duplicate MIT Id’s get generated? *
3.3. Duplicate checking needs to be done anyway *
5. Outline of Matching Process and Suspense Resolution? *
5.2. Processing (1) Definite Matches to an existing Person *
5.3. Processing (2) Definitely does NOT match anyone. *
5.4. Processing (3) may or may not match some person(s) *
6.2. The API is in written in C *
6.3. The algorithm needs to be tested in a batch environment *
6.4. The MIT Id Assignment Scheme needs to be coordinated *
6.5. People Database may not have sufficient information to resolve questionable matches *
6.6. People Database does not have a separate Name Suffix *
6.7. MITSIS "creates" other people *
7.2. Need a high hit ratio *
7.3. Current API May need more info for suspense resolution *
7.4. We will probably have to adjust the matching logic *
7.5. Code will have to be upgraded *
7.6. C DLL’s upgraded to 32 bit *
7.7. No need to Port to MAC *
7.8. Who will do this work? *
7.9. Could this be designed more generically? *
7.10. Kerberos 4 vers. Kerberos 5 *
1.2. General Rules *
1.3. Wide net of "potential" matches is cast *
1.4. People caught in wide net are compared to "new" person *
1.5. Each person caught in the net is categorized as follows: *
1.6. The rules for this classification are stored in a table, but match the following guidelines: *
1.7. Date of Birth Matching *
1.8. Name Matching *
1.9. ID Matching *
The Graduate Admissions system is currently being re-designed and integrated in with the MITSIS system. During this process, we hope to integrate the new system in with MIT’s "People" database. This document discusses that issue.
There are three groups of people for whom this document is intended. All members of these groups are programmers or have a technical background. The groups are:
The People database is the result of an IT integration effort to create a common infrastructure service for identifying people associated with MIT.
It is a database that contains basic biographic information about any person who has been associated with MIT including students, employees, contractors, and spouses. Not all contractors and spouses are currently available.
It has information from both MITSIS and the Personnel systems.
It provides access to this database via an API which allows a user to query the database. The API performs a complex and sophisticated search of the existing data and provides for the creation of new people and the assignment of an MIT Id to them.
It has a graphical client server front end that provides end user access to the database using the API. The front end can be downloaded off the WWW.
The People Database was originally developed several years ago (3 years?) by Scott Thorne, et. al. and is currently at release Version 1.
It uses the Data Warehouse as its source for information from the personnel and student systems.
Consideration for a new release of the People Database is underway. As of October 1997, this project is in the Discovery Phase with no target completion date. Information about this project is available at:
http://web.mit.edu.ezproxyberklee.flo.org/peopledb/www/v2
The database is Oracle. The version is currently behind slightly -- V7.1.4.1.1. An upgrade to V7.3.2 is planned but not currently scheduled.
The Mid Tier is a stand-a-lone Unix C program that listens to a particular port for requests. It establishes a connection with the client and processes requests by interrogating the database and returning results to the client. The client interaction is managed via a well defined API.
The API is accessible through calls to an API that has been deployed as a windows DLL.
The DLL was developed in C and is currently available only in a 16 bit Windows 3.11 version, which runs on Windows 95 and Windows NT as a 16 bit application.
The connection is fully encrypted and authenticated using Kerberos V. 4.
A Powerbuilder client was built in PowerBuilder 4.0 and is deployed only in the Windows Environment (3.1, 95, NT 4.0). The PowerBuilder application makes calls to the C DLL.
» Could JAVA call the Mid Tier?
http://web.mit.edu.ezproxyberklee.flo.org/peopeldb/www
Mail can be sent to:
peopledb@mit.edu
The main contact person is:
Jonathan Ives
x3-0590
E19-349L
It is hoped that by integrating the People Database in with the new Admissions system, and therefore MITSIS, we will be able to solve three main business needs:
Admissions has stated a business need to identify and clean up any duplicates prior to sending out decision letters. It is very important to make sure the right decision letter is sent to the right person and that the same person not get two different letters.
The Registrars Office is charged as part of academic accreditation standards with maintaining a unified academic history for each student that matriculates. Undergraduate and graduate records are supposed to be connected together.
The Admissions Office is NOT always the 1st point of contact. The Admissions Database is the primary entry point for MOST BUT NOT ALL students. It is where a basic biographic record for a prospective student is usually created and when that record is passed over to the Registrars Office a new student gets created in the MITSIS database and an MIT Id gets assigned. There is an implicit assumption in this process that the student is an entirely new person to MIT and that she did not exist in any other system prior to her creation and Id assignment in MITSIS.
This assumption of course is not necessarily true. On the graduate level, a new student may have been an employee or the spouse of another student. A new undergraduate student could have been the son or daughter of an employee who is on the health plan. These situations create serious problems and confusions. The person gets multiple Ids cards with different numbers. Information about the person cannot be readily matched and entered in the various systems.
It is hoped that integration with the "People" database will solve these problems and eliminate the creation of duplicate Id’s for the same person.
This needs to be done because it is NOT possible to force a thorough searching of the database before allowing the user to enter data into the database. Often the user does not have sufficient information to make a fully informed decision. As a result, duplicates are sure to appear.
In addition, duplicate checking needs to be done when a person becomes a student in order to make sure the Registrars Office maintains a unified academic history for the person.
It would be nice to do one set of "matching" against just the People Database to identify both duplicates within Admissions itself against Registrars Office and people whose point of 1st contact was not Admissions.
Contact The person sent scores to MIT or requested some information.
Inquiry The person has requested an application.
» Application The person has submitted an application
Admitted The person has been admitted by MIT
Student The person has said that she will matriculate
In theory, the earlier in the process we do the integration with the people database, the better. But in practice, before the Application stage the information that we have on the person is often fragmentary, is prone to errors. Therefore, integration with the people database cannot occur before this point.
Also, at the Application stage – before a decision is made – the admissions system has an internal business requirement to "clean up" the data and that duplicates be identified and eliminated. This needs to be done so that a decision can be correctly made and attributed to the proper person and so that same person does not get both an admit letter and a deny letter.
Currently Freshmen have their MIT Id’s assigned at the Application stage, when the data is passed over to MITSIS. This may seem a bit early, but there are several reasons for this:
» Taking all this into account, it appears that the integration with the People database should occur at the Application stage also.
All applicants that do not have an MIT Id need to be programmatically matched against the People database. The matching puts the applicant into one of three categories:
» The reason we only need to search on people who do not already have an MIT Id is because the system does not allow anyone to enter an MIT Id. Therefore, if one exists, it is because the new person was already manually matched to an existing person (and we hope they did a good job of the matching).
If the applicant definitely matches a single person in the people database, then that person’s MIT ID should be assigned to the applicant.
But before this can be done, the system should "double check" that the MIT ID does not already exist in MITSIS. The existing person’s ID from the People database should be used to query MITSIS. If the ID already exists in MITSIS then there probably is a duplicate between the applicant and a student in MITSIS.
Remember only people without ID’s are matched to the People database in the first place; so the original applicant and the MITSIS person who already has the ID must have different PIDMs.
Initially, at least, all these suspected duplicates should put into a suspense table for manual review.
If a suspected duplicate has been reviewed and the user says "Yes" the applicant definitely is the person who already exists in MITSIS then the system should go through and change the PIDM on the original applicant to match the PIDM of the real person. Either that or it should copy the original data over to the real PIDM and mark the original applicant as invalid. The assumption is that for whatever reason, the person was not found when originally browsing the MITSIS database but that all the data should go under the proper PIDM.
If an applicant falls into category (2) above by definitely NOT matching anyone in the People Database then she should be added to the People Database and have an MIT Id assigned to her.
This MIT Id should be posted into MITSIS for the applicant.
An MIT Id can also be assigned as the result of a suspense resolution where the person says the applicant definitely is new.
MITSIS currently contains data on all students going back to 1960. The people database contains only "recent" students (It goes back to students who were active since 1994 or so).
To solve this we need to expand the current MITSIS pass to the Data Warehouse to include all students in the both "active" and "archive" sections of the database.
This will add between 70,000 to 100,000 more "old students" to the people database. Jonathan Ives has checked with the DBA and was told that there is sufficient space to hold the NEW people. Also, performance probably should NOT be an issue, but it should be checked and indexes added if needed.
Currently, MITSIS is not sending the Date of Birth to the Data Warehouse (and therefore to the People Database).
Since the People database is fed via the Data Warehouse, the MITSIS DW extract needs to be changed to accommodate these changes. The MITSIS programming needed to accomplish this is almost complete and all we need is coordination with the Data Warehouse to get the new columns and additional people included in the file. This is planned for sometime after November.
» We need to investigate whether the DW can implement a "FULL" refresh of so many people nightly or whether we need to implement an incremental update.
The API that was written in version 1.0 of the People Database was written in C. This is a concern because MITSIS (and therefore Grad. Admissions) is using Stored Procedures for much of its processing. If the matching logic were a PL/SQL stored procedure then MITSIS and the People Database could be integrated together very tightly.
» The problem is that while C can call PL/SQL, the reverse is not true. PL/SQL cannot effectively call C.
One of the reasons the search algorithm was written in C is because of its inherent complexity and the lack of alternatives at the time, 1995. Technology may now have caught up and it may now be possible to write the algorithm in PL/SQL using cursor variables and PL/SQL tables.
There are other reasons why the API was in C including, portability and support for systems without direct access to Oracle.
We have three options:
The current People Database subscribers do not use it in a heavy duty "batch" environment. Until we pump large volumes of messy data against it we will not know whether adjustments will be needed to the basic algorithm.
In particular, the current algorithm does not determine if a person DEFINITELY DOES OR DOES NOT MATCH people on the People Database. Instead, it "ranks" them according to how close they match, relying on the user to browse the list and make the DEFINITIVE decision. This is fine for an interactive application, but with the large number of people to be checked, it would be impractical for the user to sit there all day manually reviewing each case.
MITSIS and the People Database use an identical scheme for assigning new MIT Ids to people.
The only difference is that the People database uses the prefix "92" (is it 92 or 90?) before attaching 7 random digits where MITSIS rotates between prefixes of "95" and "99".
As the People database takes on more and more responsibility for assigning Ids, MITSIS should give up its number space to the People Database.
People Restricts Access to SSN
Currently the SSN can be used to "query" the People database, but it cannot be returned for viewing. Often the SSN is just mistyped and a person reviewing the problem on a suspense screen would need to see the SSN to see that it was simply mistyped.
Jonathan Ives was concerned because his understanding was that the MIT ID was established so that SSN would not and need not be used. Unfortunately, although SSN is a good identifier, we are not permitted to display it. While some departments still may have access to this information within their systems, from what he understood they are not permitted to see this information for records coming from other departmental systems (this would also be true for home address info). Using SSN for search purposes is somewhat of a convenience. What we might consider is being able to modify the algorithm to only return records with exact match on SSN.
Norm’s understanding was that MIT wanted to get away from using the SSN as the KEY to the person’s computer based records at MIT. A separate issue is deciding who needs to have access to it and for what reasons.
People Database does not have Address
Often when searching and resolving suspense, a person’s address helps to resolve the issue. The people database does not have this information.
» There are privacy issues that need to be worked out here along with the access to the SSN.
Often the absence or existence of a suffix helps to tell apart a Jr. and Sr. who live at the same address.
In addition to somebody becoming "admitted", certain groups of students get created directly in MITSIS. This is a problem that we hope to address, but it will NOT be addressed immediately and the "multiple" points of entry creates potential headaches due to timing and related issues.
Such groups include:
» This will not be a major problem if we do refreshes every night between MITSIS and the People Database.
Create a PowerBuilder application that attaches simultaneously to both the MITSIS mid tier database and the people database. This application will replace logic currently provided by the MITSIS Cobol Matching Algorithm that is part of the current Admissions/MITSIS interface.
The new application will be written use the EXISTING People Database PowerBuilder UI CODE as its basis.
Instead of PROMPTING the user for NAME, SSN and DOB, the application will read that information from MITSIS and automatically call the People database API.
Depending on the response from the People database API, the application will do one of the following:
In order for this approach to work, the People Database matching logic must produce a fairly high hit ratio. 90% to 95% of the time, the application should DEFINITELY MATCH or DEFINITELY NOT MATCH. If too many records require manual review then the approach will become too cumbersome to implement.
As noted previously, the current People Database restricts view access to DOB, SSN and Address information, all of which are important in allowing a user to determine whether or not two people REALLY are the same person. This issue still needs to be addressed.
» We may be able to solve this by getting the "missing" data on students directly from MITSIS.
It appears that the current People database matching logic does NOT RETURN DEFINITE answers. We may be able to simply use the "closeness" number that is returned, but this will have to be checked.
The current People Database PowerBuilder code is in version 4.0. To allow us to integrate into MITSIS properly, we would need to upgrade it to version 5.0.
We may also have to upgrade the client API DLL’s to 32 bit too so they will work with PB 5.0.
Since this would be a single application running on one machine, we might be able to justify setting it up to run on just an NT machine.
Norman Wright can probably do the MITSIS database work (stored procedures, etc.)
Jonathan Ives can probably handle the C work required to port the Client API to 32 bit and any adjustments to the Matching Logic.
We need someone to port the existing PB code from version 4.0 to 5.0.
We would then need to make the changes required to integrate it into MITSIS.
Could this process be designed from a generic enough perspective that it would NOT be specific to MITSIS? I.e. could it be designed to take in a list of people and spit out a new list of the people containing the MIT ID? Basically it would be a "batch" version of the People Database.
» There may be such a need to but it would probably not need to be written in PowerBuilder in a client server mode. Instead it would probably be written to run all on the server.
This algorithm is implemented as part of the current MITSIS Admissions Interface program and is written in Pro/Cobol.
All names are stripped of spaces and punctuation and mapped to uppercase before any comparisons are made.
First a wide net is cast for people who "potentially" might be the person in question. This is done in several ways:
For each potential person selected in the wide net above we compare the three main pieces of data to see if it:
If either is Blank.
They match exactly year, month and day.
Months and Days match, but one Year is <= 15 years old
There are no 15 year old students at MIT.
Not allowed – must have at least a last name as a condition of doing the search.
Both suffixes are present but different!
If no Ids are present.
Both ID’s exist and match exactly.