Data Cleaning and Management Course Using Stata
All 5 days only R13,500.00 (Excl. VAT) per delegate
About This Course
Every data analyst who is keen about producing accurate and excellent statistical outputs spends ample time exploring and cleaning their data before engaging on statistical analysis. It is in fact thought that data cleaning and management can sometimes take more time than the actual statistical analysis. This is because of two reasons. Firstly, the validity of data analysis outputs depends on the quality of the data used. Secondly, a reviewer or reader of a report is more likely to spot data analysis errors than data cleaning and management errors.
Despite the extreme importance of data cleaning and management for programme and research data, the concepts and procedures are rarely taught in postgraduate schools, and there are scarcely any short courses in the continent that cover them. This leaves many data analysts without a systematic approach to data cleaning and management.
After many years of data analysis experience in diverse projects, CESAR has drawn from its rich experience to put together a comprehensive data cleaning and management short course. The course will be taught using Stata and participants will be required to have some experience with Stata or similar software.
Day 1 - Data cleaning using Stata
- Setting up Stata
- Introduction to data quality
- Review of basic data quality checks commands
- Timing and procedures for data cleaning
Day 2 - Data management using Stata
- Importing and combining datasets in Stata
- Review of basic data management commands
- Handling string variables in Stata
- Stata egen and collapse commands
Day 3 - Efficient data management using Stata
- Automatic outputting of Stata results
- Automatic sequence commands (looping)
The course will teach participants the following:
Definitions and dimensions of data quality
This section will cover why and how to carry out data combination. The different procedures for merging and appending data will be covered including how to use data merging to troubleshoot for data errors.
Learn to fix dates
Understanding dates is a central part of longitudinal studies (including intervention studies) but also important for cross-sectional surveys. The course will teach participants how Stata stores dates, how to convert dates from different formats to Stata format, how to format dates, subtract dates to get age and duration, how to extract part of dates (e.g. month from date), etc. So participants will learn Stata dates and the numerous commands for dealing with dates in Stata including nuisance dates.
Handling string – inlist, stringmatch, inrange, trim and substring
Database managers are trained to avoid or reduce the use of string field in their research data. This is for a reason. However, strings are not entirely avoidable in datasets. For example, the following types of fields are usually string: unique identifiers, open ended questions and questions that require participants to specify their option, etc.
What is special about Stata egen command?
The generate command is one of the most commonly used commands in Stata. The course will show different applications of generate apart from the basic ones and include conditional generate. Beyond the generate command, there is an extended-generate (egen) command which has numerous application for diverse types of data management including risk score creation. There is a time to use generate and there is a time to use egen. These will be covered.
Learn to aggregate the data
Sometimes analysts will have to convert their data from individual level data to aggregate level data. This is important both for longitudinal studies, multiple sites M&E data and certain forms of data analysis. Two approaches will be taught for aggregating data – the intuitive step-by-step approach and Stata inbuilt collapse command approach. Participants will also learn the difference between _n and _N for such analysis.
Stata allows data management and statistical analysis to be carried out in wide and long formats, but some analysts may be more comfortable with a particular format. Stata can reshape data from wide to long format and vice-versa using the reshape command.
Foreach and forvalues
The foreach and forvalues are Stata loop commands that help to carry out analysis more speedily by applying the same procedure to many variables at the same time. Loops allow you to run the same command for several variables at one time without having to write separate code for each variable. The commands and their application will be covered.
Automatic outputting of results
Manual copying of results from Stata results window remains the most common way many analysts transfer results from Stata to Word document or Excel reports. Yet this can both be time consuming and error-prone. This course will teach participants how to use Stata tabout command for outputting stata results automatically. Other similar commands will also be mentioned.
The course will cover the philosophies, commands and practice of exploring data for the following types of errors. Steps on how to correct them will also be covered.
- Duplicates records
- Illogical sequence
- Missing data reports
- Illegal values
Timing and procedures for data cleaning and management
- Before data collection – data collection planning
- During data collection – data collection execution
- After data collection – cleaning data for analysis
- Research analysts
- Data analysts
- Programme managers
- Postgraduate students
- Market researchers
- Clinical and medical researchers
- Government practitioners