Directly accessing remote and/or compressed files in Kettle

vfs_access

Integrating data from different systems often involves working with data files that need to be fetched from a server first. If the remote file system cannot be mounted over the network the download is usually done using FTP/SFTP or HTTP/HTTPS. This often results in a Kettle solution starting with a job that downloads the correct file, optionally extracts the file locally if it is a compressed archive, and calls a transformation to load the data. After the transformation completes some clean up may be necessary to archive the source files or dispose of them altogether.

The somewhat tedious and repetitive job of downloading the files, unpacking them and finally cleaning them up can often be avoided by utilizing the now ubiquitous support for Virtual File System URLs in Kettle. This article shows how remote (and optionally compressed) files can be accessed directly from the steps that need them. . . . → Read More: Directly accessing remote and/or compressed files in Kettle

Dealing with Proxy Problems in ETL Processes

Screenshot ETL behind Proxy

Sometimes your ETL processes need to access systems external to your network. Suppose your ETL process needs to download a ZIP file from a business partner over the internet using SFTP. If your company has a proxy infrastructure which is prohibiting direct access to the internet, you might run into trouble with ETL tools that do not support proxies. Pentaho Kettle 3.2.0 does not support a generic proxy configuration in its SFTP transfer job entry, for example. Depending on the platform you are using you might be able to use a generic proxy transparently (a SOCKS5 proxy for example). For Java based tools you might specify the proxy settings on the command line using -DproxySet=true -DproxyHost=xxxx -DproxyPort=xxxx. But sometimes all of these techniques will not work. When looking for workarounds you might consider using the classic unix tool expect in combination with the tsocks library to do the download. Both are available as standard packages for linux/unix systems including mac osx. For unix derivates you are likely to find these tools available in your package manager, for mac osx you can use macports to install them.

Continue reading Dealing with Proxy Problems in ETL Processes