Table of Contents
Introduction to OpenRefine ExtensionsMessy data is useless data – this is the first thing that comes into your mind when you encounter chaotic, sloppy, and irregular content. Extra spaces, duplicate records, misspellings, or unstructured format needs to be cleaned up before starting a proper data analysis. You’ll agree that being a data analyst does not imply cleaning “dirty” material, though this process takes a major part of analytic time. That’s why it’s critical to prepare it first. In this article, we’ll learn about a powerful tool called OpenRefine and how it can help us with data cleaning.
About OpenRefine and Its AlternativesOpenRefine software, a former Google Refine, is a robust desktop application that will help to fix inconsistencies in your extracted content and transform it into a clean, near-perfect source for further analysis. OpenRefine has the following features:
- Standardizing dataset
- Reorganizing data columns
- Faceting and Clustering
- Operations Tracking
- Data Exporting
- Have a complete overview of the material
- Work with massive datasets up to 100,000 rows
- Eliminate inconsistencies
- Split data into parts
- Accomplish your content using other sources
- You do not need an internet connection though it runs on your default web browser
- Any of the content you enter in OpenRefine will not be sent to a remote server
- Your original content is not modified
- All your actions can be undone
- All your actions are automatically tracked
- Your project is automatically saved
How to use OpenRefine
Installing OpenRefineTo download OpenRefine on your computer, make the following steps:
- Go to http://openrefine.org/download.html to download and install it.
- Run it by clicking on openrefine.exe
- Wait until it is opened with Firefox or Chrome (make sure that one of these browsers is set as a default one)
- If OpenRefine is not automatically opened, point your web browser at http://127.0.0.1:3333/ or http://localhost:3333/ and run it again.
Creating a projectLet’s start with creating a project with OpenRefine.
- Prepare a .csv file with content to be cleaned
- Run OpenRefine
- Select the Create Project tab
- Click on the Choose Files button and select the file.
- Click on Next to begin the uploading.
- When the file is uploaded, name the project and click on Create Project.
- Change the Settings available on the bottom of the screen according to your needs.
- Click on Create Project to finish importing.
Parsing dataWhile proceeding with further parsing, you can consider the following options by clicking the arrow on the required column:
- Select Facet to arrange data in a column for sorting, inspecting, and editing.
- Select Facet ->Custom Text Facet to split column data without creating new columns.
- Select Edit cells -> Common transforms -> Trim leading and trailing white space to remove white spaces.
- Select Edit cells -> Common transforms -> To Titlecase to change the case in names.
- Select Edit cells -> Cluster and edit… to define values that might be alterations of the same thing.
Exporting dataAfter finishing with parsing, you need to export the cleaned dataset:
- Click on the Export button.
- Select the required format.
OpenRefine AlternativesSeveral other data cleaning tools can be used as an alternative to OpenRefine.
- Trifacta data engineering tool. The tool has built-in analytics, GitHub code quality control, and integration with AWS, Google Cloud, and Microsoft Azure. It has a free trial. However, due to the extended functionality, all plans are premium, with the starter premium license costing $80 per user per month.
- Tibco Clarity tool is a cloud-based data cleansing service with a free 30-day trial. The standard plan starts at $100 per month.
- Melissa Clean Suite for CRM – is a part of Melissa data management service. It has similar functions and works with the most popular data formats and compressed archives. The pricing is credit-based at $4/1,000 credits.
- IBM InfoSphere QualityStage is a very powerful tool allowing you to investigate, clean, and organize data. One of its characteristic features is deep data profiling with built-in data classes. It can save you a lot of hours.
- Mr. Data Converter – a simple and free tool that can be useful if you work with single-page content and clean it up manually. It converts CSV / Exel data into web formats (HTML, JSON, XML).