Best Practices Guide for Documenting SAS Data Sets

This guide provides a set of best practices that will help you with the task of documenting a SAS data set.  These best practices were established by the developers at Quarksoft, LLC who have extensive experience in technical documentation, database design and development.

Start With a Good Database Design

Good database design can make documenting your SAS data set a much easier task.   Although it might not always be an option, if you have any influence over the database design, here are some key design areas that you should look out for:

  • Naming conventions - The lack of naming conventions can result in very inconsistent and cryptic names for the tables, columns and other database objects making it more difficult to document them properly.  Applying good naming conventions to the database design will promote better readability of the documentation.
  • Primary keys - While every table should have a primary key that defines which columns uniquely identify a record, most databases do not require them.  Not only does this affect database performance and data integrity, knowing what the primary key is for a table is key to understanding the data model and an essential part of the database documentation.
  • Table relationships - Another key to understanding the data model is documenting the relationships between tables.  And just like primary keys, they are not required to be defined.

And even if you do not have any control over the database design, a quick review will help you better estimate how much time it will take and set expectations for the completion of the documentation.

Embed Comments inside Your SAS Data Set

SAS supports the LABEL keyword for adding comments to tables and columns.  These comments are stored as metadata within the database definition, making them easy to maintain as the database structure changes over time.

By embedding comments inside your SAS data set, you create a "self-documenting" database that will be easier for DBAs and database programmers to understand.

For more information on embedding comments see Documenting SAS Data Sets using SAS Workbench.

Comment What You Can and Annotate Everything Else

For SAS data sets, you will want to describe the following database objects:

  • Tables - Most tables are typically used to store logical representations of real world objects.  For these types of tables you should describe what that real world object is.  Your database may also include tables, sometimes referred to as "link tables" that are used to create many-to-many relationships between tables.  For these types of tables, you should describe the purpose of the relationship.
  • Columns - As tables typically represent real world objects, the columns within those tables are used to store properties related to those objects.  Column descriptions should include what those properties are and how the column represents them.

The main purpose of the comments that you embed in your SAS data set should be to provide descriptions for these database objects. These descriptions should be concise and consist of one or two sentences.  If there is additional information that is important to document, you should consider structuring your comments into two parts: a summary description followed by a remarks section.  The remarks section can be used to include the additional information.

Embedding comments in your SAS data set using the LABEL keyword does have some limitations. These include:

  • SAS labels can be no longer than 40 characters.
  • SAS labels are expected to be very short descriptions as the are used as column headings for reports.

To get around these limitations, you can use a tool like TechWriter to annotate SAS data sets with external comments that can be merged together with the embedded comments to produce the complete documentation.

Use a Documentation Generator

Having a self-documented SAS data set is helpful for database design and maintenance, but there are many instances where you are going to need actual database documentation. Here are a few examples:

  • System Documentation - If you are creating a SAS data set for a client, a complete set of system documentation is typically a required deliverable.
  • End-User Documentation - While your DBA and database programmer may have full access to the database definition, you may allow some users restricted access that only allows them to execute database queries.  These users typically will not have the access or the skills necessary to read and understand a database definition. You will need to provide them with some form of documentation that will help them create their database queries. 
  • Design Reviews - Not all participants in a design review are going to have the same level of SAS knowledge and expertise.  Your subject matter experts may not be technical at all.  To facilitate such a design review, you will need to provide documentation that anybody can read and understand.

Fortunately there are tools like TechWriter that can be used to generate the database documentation automatically, leveraging the database definition and the comments you've embedded.

Make Documentation Part of Your Development Process

If your project is using automated build tools like Ant, NAnt and MSBuild, you can make generating the documentation part of the automated build process for the system.   By having the documentation as up-to-date as the system itself, your project will benefit in the following areas:

  • Knowledge Capture - While some development methodologies such as Agile ascribe to a "document late" philosophy, projects can suffer greatly when they lose a team member.  The knowledge of a system literally walks out the door.  Documenting as you go minimizes the impact of losing a team member and makes it easier for new team members to ramp up.
  • Communication - Documentation can help communication among project team members.  The project manager can better assess the status of the project.  Subject matter experts can verify that developers are on the right track.  And mistakes can be identified earlier in the project life-cycle.
  • Documentation Quality - Scrambling at the end of the project to throw the documentation together for the final deliverable can result in the database not being documented properly, if documented at all. 

By making documentation part of your development process you will ensure that the documentation is always in sync with your SAS data set, even in extreme programming environments where the database is constantly changing. 

For more information please visit: