In the last blog post, we introduced Singer, the open-source framework, as a powerful tool for ETL processes. This time, we’d like to discuss how you can implement the framework in your own projects.
How to start working with Singer
Starting a test run is rather simple. First, you need to create a python environment, for which step-by-step instructions to do so are available online.
As soon as you’ve done that, it’s time to create your first virtual environment inside python.
Please note before beginning that it’s a best practice to create and use an individual virtual environment for every tap and target. This avoids any conflicts between module requirements for the different modules.
The next step is to install the tap and target you’ve chosen into their corresponding virtual environment. This installation can be performed very easily using a pip install command. This example command installs the tap-salesforce to the load data from your Salesforce account:
After installing the tap, you’ll need to configure the connection to the Salesforce api.
This is done by creating a config.json inside your working directory containing all required configuration parameters. If searching for the file, remember that these can always be found on the tap-specific documentation and mainly consists of the api-credentials.
Singer uses a method called discover to identify all available data objects in your source system. This information is stored either in a catalog.json or in a properties.json, based on the tap you are using. For example, if you are using tap-salesforce, you can create a properties.json with the following command:
This properties.json is used to select all the objects you want to load with your ETL process. Objects are selected by adding a selected: true row to the metadata section of every object you want to select inside the properties.json.
Now, your tap is ready to go and can either be combined with a target or be run as a standalone process, sending the data to stdout, using the following command:
Please note that the square brackets are not a part of the code and are only used to show the option of piping the source data to a target, in this case the target-csv.
Since the setup of a new tap or target is generally the same, you can easily add more source systems or target databases using the same process.
Though easy to install, your growing Singer instance may affect a few comfort features. A further complication that may arise is the eventual need to schedule your ELT processes to run on a daily basis for consistent staging purposes. Which might require a bit more work to accomplish. Luckily, there also is something for you in that case.
Overall, we consider the Singer framework a great environment for easy-to-use ETL processes. That said, while the process is straightforward, you may run into some problems when working with a larger number of different taps and targets. This is due to the presence of different virtual environments which make the commands grow longer and more complex with time.
Though, there is another open-source tool that uses the Singer standards and even improves upon its ease of use and flexibility.
This tool is called Meltano and we’ll discuss its benefits in the next newsletter!
-by Tim Kirschke (Scalefree)
Get Updates and Support
Please send inquiries and feature requests to [email protected].
To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.