Dealing with Proxy Problems in ETL Processes

Sometimes your ETL processes need to access systems external to your network. Suppose your ETL process needs to download a ZIP file from a business partner over the internet using SFTP. If your company has a proxy infrastructure which is prohibiting direct access to the internet, you might run into trouble with ETL tools that do not support proxies. Pentaho Kettle 3.2.0 does not support a generic proxy configuration in its SFTP transfer job entry, for example. Depending on the platform you are using you might be able to use a generic proxy transparently (a SOCKS5 proxy for example). For Java based tools you might specify the proxy settings on the command line using -DproxySet=true -DproxyHost=xxxx -DproxyPort=xxxx. But sometimes all of these techniques will not work. When looking for workarounds you might consider using the classic unix tool expect in combination with the tsocks library to do the download. Both are available as standard packages for linux/unix systems including mac osx. For unix derivates you are likely to find these tools available in your package manager, for mac osx you can use macports to install them.

Expect can be used as a chat bot program to talk to a sftp server using the sftp command line client.

Consider the following short expect script file sftp.expect:

#!/usr/bin/expect -f

set host [lindex $argv 0]
set user [lindex $argv 1]
set pass [lindex $argv 2]
set file [lindex $argv 3]
set timeout -1

spawn sftp $user@$host

expect "assword:"
send "$pass\r"

expect "sftp>"
send "get /reports/$file $file\r"
expect "sftp>"

send "quit\r"

It accepts a few arguments to connect to a certain host using a username and password and downloads a file to a local directory. The following shell command downloads the file /reports/report_2010-06-03.zip from the host host.org with user scott and password tiger and places it in the current directory.

expect -f sftp.expect host.org scott tiger report_2010-06-03.zip

The problem remains, that the sftp client does not know about any proxies. So in case you are behind a proxy, it will not be able to connect to the target host. Tsocks to the rescue.

After configuring tsocks properly in the /etc/tsocks.conf file (make sure to read the man pages for tsocks.conf) you make any program use the proxy settings transparently by running it with tsocks prepended.

tsocks expect -f sftp.expect host.org scott tiger report_2010-06-03.zip

The tsocks mini program effectively causes the tsocks library to be loaded for the expect script and its sub-processes. The tsocks library transparently intercepts all network calls and is thus able to transparently utilize your company’s proxy.

All that remains to be done is executing this line from your ETL tool. The ETL tool probably needs to do some calculations to determine the requested filename and usually puts the download operation into a grater processing and transformation context.

Example Job File

You might want to download the example job files which were created for Kettle 3.2.0. Make sure that you have tsocks and expect installed and configured before trying to run the ETL job. You also need to change the job entry Download File via SFTP to customize your connection credentials.

Related Literature:

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>