All posts by Kevin Whitson

Software developer, database administrator, infrastructure apprentice, electronics addict, and occasional blogger.

Use Xargs to Handle split-by Skew in Sqoop


I often use Sqoop for migrating relational data to Hadoop. However, I frequently encounter data skew in my “split-by” column, which generally leads to utilization problems with parallelization. Some mappers might get a lot of work while others may not get any. For example, imagine an “integer” column called “order_date” with data stored as YYYYMMDD (i.e. 20190101). Yes, using integers to store dates is not the best choice (IMHO), but this frequently occurs in the wild. In this situation, if I let Sqoop determine how to allocate the work to 10 mappers for a month of data, say 20190101 to 20190201, Sqoop will incorrectly allocate the mappers as such…

Mapper StartDate EndDate Notes
01 20190101 20190110 10 days of orders
02 20190111 20190120 10 days of orders
03 20190121 20190130 10 days of orders
04 20190131 20190140 01 days of orders
05 20190141 20190150 00 days of orders
06 20190151 20190160 00 days of orders
07 20190161 20190170 00 days of orders
08 20190171 20190180 00 days of orders
09 20190181 20190190 00 days of orders
10 20190191 20190200 00 days of orders


Because Sqoop tries to evenly allocate the split-by keys, mappers 01, 02, and 03 will end up processing most of the daily orders. Mapper 04 will only process one day of orders, and mappers 05 through 10 will not process any orders.

It would be better if I could specifically tell each mapper which records to process. This would allow me to manually divide up the work (thus, avoiding the skew). Since Sqoop doesn’t really have a native way to solve this problem, my first thought was to run Sqoop inside a loop, such that each iteration processed a specific allocation of data splits. For example…

Note… In my use case, each order_date has millions of orders.

However, the problem with this approach is that each iteration of the loop waits for Sqoop to finish before continuing. Although this approach improves mapper utilization by evenly allocating data splits, I now have a parallelization problem, as I’m only running one mapper at a time. In addition to having precise control over data splits, I also need a way to parallelize the work.

To solve the parallelization problem, I thought I would just subprocess each call to Sqoop, which would give me multiple concurrent jobs, each running one mapper as a background job. This can be achieved by adding an ampersand “&” to the end of the call to Sqoop (this forks and runs the command in a separate sub-shell, as an asynchronous job).

Unfortunately, this technique introduces two problems. First, the stdout and stderr for all the jobs get jumbled together. While this is not critically important, I will address this problem later. More importantly, now I have a problem of flooding the source database with too much concurrent work. Every turn of the loop kicks off a background Sqoop job with one mapper. If I have to pull more than a few days of data, the source database will quickly get overwhelmed. So now, in addition to having precise control over splitting the data, and having a way to parallelize multiple jobs, I also need a way of throttling concurrent work.

After a little research on multithreading, subprocessing, and parallelizing bash scripts, I found that xargs can throttle subprocesses. That is, I can arbitrarily limit concurrent xargs subprocesses by setting the “max-procs” argument. When one job completes, another job will auto spawn to take its place until all jobs are complete. In other words, xargs will handle the complexity of tracking, running, and throttling concurrent work. Armed with this new information, I now had all of the pieces needed to control data splits, parallelize jobs, and throttle concurrent work.

The solution I came up with was to define a generic bash function to perform the Sqoop ingest. Using a function encapsulated all the common logic of each iteration and simplified the call to xargs. To this function I would pass the filter criteria (which gives me the custom split-by functionality of Sqoop). Xargs would iteratively call the function using a list of filter criteria. Within xargs, I can specify “max-procs” to limit concurrent work (this emulates the throttling). The following script is a simplified version of my approach. I pass a list of filter criteria to xargs which then passes each item to the “do_work” function. No more than 3 jobs will run concurrently because “max-procs=3″.

After testing this idea, everything seemed to work, except, I noted that all the concurrent jobs were dumping stderr and stdout to the terminal (that is, visually, every job’s terminal output was jumbled together). If something went wrong, it was difficult to tell which job I needed to troubleshoot. What I needed was for each job to redirect its output stream to individual files. Then I could go back and look at each job’s log and troubleshoot any issues. The solution was to use the “tee” command to copy Sqoop’s output streams to files in HDFS (you could, however, write to local disk if you wanted to). The following is a simplified version of my final solution.

Note that I export common variables that will be the same for every Sqoop job (this not required if you define the variables within the do_work function). Also, it goes without saying, please don’t put your Sqoop password in your script.

Although I only allocate one mapper in my function, it is possible to concurrently run Sqoop jobs that have multiple mappers. This would be useful if you had large data splits for a column that also had some significant skew. Just add the “split-by” argument and set “num-mappers” inside your call to Sqoop. However, be careful how high you set “max-procs” because the concurrent load you put on the source database will be (max-procs * num-mappers).

Although this solution doesn’t fix all skew problems for Sqoop, I have tweaked the approach in a few projects, and, so far, I’ve had pretty good success in achieving a better level of utilization of my Sqoop mappers.

One caveat, you end up running multiple application masters (each call to Sqoop is a new map reduce job). However, the inefficiency of multiple application masters is a good trade for wasted resources allocated for multiple mappers that do very little work, or nothing at all.

Hopefully this post gives you another technique for optimizing mapper utilization with Sqoop. If this helped you, or if you have ideas or suggestions, please let me know in the comments.

Monitoring Temperature With Raspberry Pi


The Problem:

I recently remodeled my home office and I now have a dedicated closet for my electronics (Server, NAS, AV Receiver, etc.) During the build I planned for heat remediation by installing an exhaust fan that dumps air from the closet into my adjoining office. However, the temperature in the closet hovers around 90°F (32°C), even with the fan on. Although this temperature is within hardware thresholds, it’s a bit warmer than I would prefer. To get a better understanding of my heat dissipation needs, I decided to monitor and record temperature fluctuations over several days to see what temperature ranges I was experiencing.

Monitoring temperature levels is a perfect project for the Raspberry Pi. I have used an analog TMP36GZ low voltage temperature sensor before in an Arduino project but this would be my first attempt at using the Raspberry Pi’s GPIO pins. Unfortunately, after a bit of research, I discovered that my analog temperature sensor wouldn’t work with the Raspberry Pi’s “digital only” IO pins. While I could have prototyped a solution using an ADC and some spare components, I really wanted a simple build so I could just start coding on the Pi.

The solution to my problem was a DS18B20 Digital Temperature Sensor IC which I found on The DS18B20 uses the 1-Wire communication bus which is perfect for the BCM GPIO4 pin (PIN 7) on the Raspberry Pi. Other caveats, you can work with the DS18B20 from the Linux terminal, and you can connect multiple 1-Wire devices, in series, to PIN 7.

The Build:

I had some spare CAT5e cable so I stripped and soldered 3 wires to the three pins on the sensor – orange for +3.3v, brown for ground, and green for data. Also, the DS18B20 requires a pull-up resistor between the power and data leads.

DS18B20 CAT5e connection

Then, I used electrical tape to insulate the exposed areas and I shrink-wrapped everything to protect the connections.

DS18B20 Shrink-wrapped

To the other end of the CAT5e cable I attached three female jumper wire cable housing connectors. These will principally be used for quick connections to a splitter rather than connecting directly to the Pi because I need to connect several devices to a single pin (specifically PIN7 for 1-wire).

DS18B20 Female Connectors

Next, I manufactured three tiny Y-splitters (2 male to 1 female) to join the VDD, DQ, and GND lines from 2 sensors before connecting to the Pi.

DS18B20 Y-Splitter

Finally, I made a second sensor and attached both to the Raspberry Pi using the following arrangement.
DS18B20 Schematic

Here is the finished build. Note the three splitters are plugged into PIN1 (orange/3.3v), PIN6 (brown/GND), and PIN7 (green/data).


The Code:

After connecting the DS18B20’s to the Raspberry Pi, you can interact with the devices using the below terminal commands. Note, your device IDs will be specific to your 1-Wire devices. In my case, my devices are 28-0000055f311a and 28-0000055f327d.

Here is what my terminal window looks like after running the above.

Click for larger image

In the terminal window above, the interesting information from reading from the device is the first line ending with “YES” – which means no errors – and the second line ending with “t=” followed by the temperature in thousandths of degrees Celsius (°C * 1000). Using this information, we can divide the temperature by 1000 to get Celsius and convert to Fahrenheit using…
°F = °C * (9/5) + 32.

Once we verify that the DS18B20 is operating correctly using terminal commands, we can construct a python script to read from the devices and write the temperature to a CSV file. Later we will plot the temperature readings to see how much they fluctuate over time.

Once created, you can run the above Python script from the terminal by using the following command.

Running the above terminal command a few times produces output, like the below, in a CSV file. Note that I have 2 Temperature sensors connected to the Pi. One is hanging in the center of the top of the closet, measuring ambient temperature, and the other is attached to the hottest exterior surface on my server.

Device TimeStamp Temp °F
28-0000055f311a 2014-05-21 22:21:02.585486 80.9366
28-0000055f327d 2014-05-21 22:21:02.585486 119.3
28-0000055f311a 2014-05-21 22:21:09.331944 81.05
28-0000055f327d 2014-05-21 22:21:09.331944 119.4116
28-0000055f311a 2014-05-21 22:21:13.082604 81.05
28-0000055f327d 2014-05-21 22:21:13.082604 119.3
The Schedule:

Now that we have a script to collect temperature data and save it to a CSV file, we need to schedule it to run periodically. In my case, I wanted to run the script every minute of every day for several days. To schedule the python script, we can use crontab from the terminal.

While in edit mode, add a crontab job using the following syntax. Note that this command should be all on one line in crontab.

The five asterisks (*) mean to run the job every minute, every hour, every day of month, every month, and every day of the week. The part about “> /dev/null 2>&1″ just means don’t save the output or errors in a log and don’t show anything on any screens (i.e. run silently even if errors occur).

The Results:

After a few days of collecting data, I can plot the results to see how effective I am at dissipating heat from my closet. Below is a chart of the output after running the Python script, via crontab, for several days.


Now that I have a baseline, I can experiment with different heat dissipating methods to find what works best to keep my electronics cool. When I’m done monitoring my electronics closet, I can see myself redeploying the rig to other projects such as attics, crawl spaces, automotive projects, and mini-fridge hacking. Let me know in the comments if you come up with your own temperature monitoring projects.

The Kevin Test for Successful Projects

One of my personality ticks is constantly asking the question, “Why do projects fail?” Too many times to count, I’ve trolled the internet and tried to absorb the lessons learned by countless before me. I’ve had my fair share of flops and I can say with confidence, in hindsight, it’s always easier to spot the cause of a failed project after it fails. However, when you are in the middle of a project, it’s not so easy. It’s like the fog of war. You are so entrenched in the present that by the time you’re convinced the project is going to fail, it usually already has.

Several years ago, while searching for the answer to the ultimate question, I stumbled on a post by Joel Spolsky entitled “The Joel Test: 12 Steps to Better Code.” Written back in August of 2000, this test was basically a simple, objective survey designed to help developers build good software. Over the years, I have found myself going back to Joel’s test to see how my current software development practices stack up to the questions that he outlines. Even after a decade, I’ve never scored a perfect score; but I’ve found that, as long as I maintain a high percentage of affirmative answers, the quality of my software was high.

The Joel Test
  1. Do you use source control?
  2. Can you make a build in one step?
  3. Do you make daily builds?
  4. Do you have a bug database?
  5. Do you fix bugs before writing new code?
  6. Do you have an up-to-date schedule?
  7. Do you have a spec?
  8.  Do programmers have quiet working conditions?
  9. Do you use the best tools money can buy?
  10. Do you have testers?
  11. Do new candidates write code during their interview?
  12. Do you do hallway usability testing?

In October 2011, Rands (aka Michael Lopp) followed up the “Joel Test” with a post entitled “The Rands Test”. In this test, Rands asks 10 subjective survey questions that are designed to help you gain perspective on how well your team functions as a group. Unfortunately, I’ve never scored a perfect score on The Rands Test either; but that’s not the point. The point, on both tests, is to try to maintain as many positive point answers to the survey questions as possible while simultaneously trying to improve the areas where your responses are negative points. Having a strong score on The Rands Test means you have a strong team.

The Rands Test
  1. Do you have a consistent 1:1 where you talk about topics other than status?
  2. Do you have a consistent team meeting?
  3. Are handwritten status reports delivered weekly via email?
  4. Are you comfortable saying NO to your boss?
  5. Can you explain the strategy of your company to a stranger?
  6. Can you tell me with some accuracy the state of the business? (Or could you go to someone / somewhere and figure it out right now?)
  7. Is there a regular meeting where the guy/gal in charge gets up in front of everyone and tells you what he/she is thinking? Are you buying it?
  8. Can you explain your career trajectory? Can your boss?
  9. Do you have well-defined and protected time to be strategic?
  10. Are you actively killing the Grapevine?

These two tests are great if you want to build good software or good teams. However, what if we combine good team behaviors and good software development behaviors together and come up with a test for good project behaviors. While I can’t possibly concede that I’m in any league with Joel Spolsky or Rands, I do venture to take what I have learned from them and try my hand at what I’ll dub “The Kevin Test” for successful projects. All credit however should go to the aforementioned because “The Kevin Test” shamelessly takes the two tests, combines them, and tries to come up with a litmus test for successful software development projects. My contribution merely applies and expands many of their ideas to the concept of projects, which lies between the regions of strong teams and strong software.

The Kevin Test
  1. Do you have a backlog of requirements or specs?
  2. Do you keep a list of bugs, issues, and scope changes?
  3. Do you fix problems before starting new work?
  4. Do you have a schedule for when work is supposed to be done?
  5. Do you feel that the team members can speak freely and frankly?
  6. Do you have testers (other than the developers)?
  7. Do you think you can communicate what you are building to a stranger?
  8. Do you know what you are supposed to do next?
  9. Do you have too many concurrent tasks?
  10. Do you conduct retrospectives?
Do you have a backlog of requirements or specs?

It’s surprisingly easy to attend a meeting, and immediately go back to your desk and start creating the solution. On extremely simple projects you might get lucky by skipping the backlog, but on more complex projects you are taking a huge risk. Creating a backlog of the requirements is the first step in showing that more thought was put into the project than merely attending a meeting. The backlog should contain known specs and their status, whether they get implemented or not. Once you have a backlog, questions about scope, completed work, and remaining effort suddenly become trivial to answer (“It’s in the backlog”). Additionally, the backlog can serve as a tool for estimating the effort for similar projects.

Do you keep a list of bugs, issues, and scope changes?

As people test or use your solution, they will come up with bugs, issues, or changes to scope. Some of these you will fix and others you will ignore. Keeping a log of these details and the decisions made about them, provides a historical record, and eliminates scope misunderstandings. This project artifact, along with the requirements backlog, is a roadmap that tells how the project got to where it is.

Do you fix problems before starting new work?

Ignoring problems with the assumption that you will get to them later is technical debt. Also, unfinished or faulty work can spread to other parts of your project. These unresolved issues tarnish the perceived quality of the overall product. In addition, fixing problems before starting new work reduces context switching and focuses attention on current tasks.

Do you have a schedule for when work is supposed to be done?

All too often, I hear people give estimates like “We should be done in about three weeks.” Unfortunately, what typically happens is most of the work ends up getting done, in a rush, just before it’s due. If a setback is encountered, the due date gets blown with little to no time left to react. However, if you set multiple short term goals, you will have a continuous stream of due dates. Consequently, instead of having most of the work performed at the end, work will be performed throughout the project schedule. There is no better incentive to start working now than having a deadline that is tomorrow instead of in three weeks. If a setback occurs early, you’ll likely have more time to make adjustments or at least notify stakeholders about the problem. Daily SCRUM meetings and weekly sprints are perfect examples of setting short term goals.

Do you feel that the team members can speak freely and frankly?

The balance of authority between stakeholders and team members cannot be lopsided. The weaker represented group, stakeholders or team members, needs to have an equal voice. If there is not a healthy debate in meetings, then problems or issues with the project are concealed. The “Yes Man Phenomenon” ensues and emotional investment in the project is replaced by apathy.

Do you have testers (other than the developers)?

Asking a software developer to test their own code is like asking an Algebra student to grade their own paper. If a mistake is made, how will it get noticed? Developers don’t think and behave like testers. The order in which inputs are collected and entered may not match the order in which users experience these workflows. Suppose a customer doesn’t know the shipping zip code, should you prevent them from placing the order? Developers and sales staff might have different opinions. The point is what’s valuable and important to users is different than what’s valuable and important to developers.

Do you think you can communicate what you are building to a stranger?

If you can’t describe the value and basic function to a stranger then it will be difficult to estimate build complexity or timeframe. If you don’t understand the underlying problem you are trying to solve, then you won’t know if you were successful at solving it. You don’t have to be a domain expert, but you do need to know enough about the problem to participate in an intelligent conversation.

Do you know what you are supposed to do next?

One of the benefits of SCRUM, when done correctly, is that anyone can know, or determine, the status of the project at any time. Built into the SCRUM process is a backlog that clearly defines what has been completed, what is being worked on, and what is remaining. If any member is unsure about what they should be doing or where the project stands, then all they have to do is consult the project backlog. Meetings to discuss “where we are on the project” are either unnecessary, or are the result of poor project planning. Consult the backlog to see what’s next if there’s slack time between completed work and the next meeting.

Do you have too many concurrent tasks?

One of the biggest project velocity killers is context switching caused by multitasking. Unlike computers, human brains are not well equipped for multithreading conscious tasks. Every time you context switch, there is a significant warm-up period of low productivity. When working a problem, it takes a while to remember where you were, and where you were trying to go. If you are doing too much context switching, your whole day can be a string of task warm-ups with little work getting accomplished.

Do you conduct retrospectives?

In the words of George Santayana, “Those who cannot remember the past are condemned to repeat it.” The point of this meeting is to communicate what’s working and what’s not. The discussions should be encouraging but frank, and liberal with positive reinforcement. Generally, this meeting attendance should be kept small; no one likes their flaws to be in the public spotlight. It’s important to note that “retrospectives” is plural. That is, multiple retrospectives should occur throughout a project not just at the end; if you’re only conducting a retrospective at the end of a project, you’re probably conducting an autopsy. Mid-project retrospectives can be an opportunity to make corrective changes before a project fails. Of all the things you can learn on a project, the retrospective can be the most valuable. Unfortunately, retrospectives are often skipped or done poorly.

It’s not critical to have a perfect score on “The Joel Test”, “The Rands Test”, and “The Kevin Test”. Rather, focus on maintaining good form and positive answers to most of the questions while simultaneously working to improve those elements that are lacking. Don’t pick too many problem areas to address at the same time; otherwise, you might fall into the context switching trap. Focus on the problems that are most likely to cause a project to fail, such as not having a backlog. Over time, good project management form will become habitual and projects not using it will feel unnatural. As projects succeed, there will be a natural tendency to repeat the behaviors that contributed to the project’s success.

VIsual Aids for Communicating Project Management Concepts

With project members and stake holders, I occasionally find that it’s difficult to communicate the benefits of using certain project management tools or methods. For example, conversations about the benefits of using SCRUM or Test Driven Development (TDD) to improve a project’s probability of success are occasionally met with blank stares or even friction. For these meetings, it’s good to have a few visual aids to help digest how using tools can mitigate certain kinds of risk. Below is a sample of some of the visual aids I’ve used over the years.

The Triple Constraint

Triple ConstraintThe first diagram is typically the one almost everyone knows. The Triple Constraint, also known as the Project Management Triangle, is a good visual aid for communicating how a project’s cost, scope, and schedule relate to each other. Changes in one or two of the three metrics will adversely affect the remaining metric(s). For example, reducing cost will negatively impact the project’s scope or schedule. Limiting the schedule will negatively impact the scope or cost. Widening a project’s scope will negatively impact the schedule or cost. Trying to control all three constraints simultaneously is impossible. Additionally, there is significant risk to a project’s success if you materially change which constraints are most important. Successful projects – i.e. on time, on schedule, and within scope – strike a delicate equilibrium between the three constraints.

Product Management Trap

Product Management TrapAnother visual aid I like to use is the Product Management Trap diagram. In this diagram, the stakeholder controls the value axis while the development team controls the complexity axis. Often a project will have a wide portfolio of features or components. Identifying where features rank helps stakeholders focus on risk versus ROI. When trimming scope, focus on low value, high complex features. When trying to balance resource schedules, add or remove low complexity features. If possible, when negotiating scope, features should be removed in descending order (4, 3, 2, 1), so that you eliminate the most difficult features first. Be mindful, however, that although quadrant 3 contains difficult and complex features, trimming here could significantly change the overall ROI for the project. There is a great YouTube video by sketchcaster about this chart.

POS Vs Complexity

POS Vs ComplexityVisualizing the relationship between probability of success (POS) and complexity is useful when communicating the effectiveness of project management methodologies or tools (e.g. SCRUM, Kanban, TDD, etc.) The POS versus complexity diagram illustrates how utilizing proven methodologies and tools can positively improve the POS without significantly affecting complexity or scope. For example, SCRUM improves POS (on the y axis), but complexity (on the x axis) remains relatively unchanged with or without SCRUM. The relative improved POS for using project management methods and tools becomes more pronounced on projects of higher complexity. For projects of low complexity, the benefit is marginal.

Time Vs Complexity

Time Vs ComplexitySimilarly to the previous graph, time to develop versus complexity is a good tool to communicate the effectiveness of good project management methodologies and tools. Typically, the more complex a project, the more time it takes to complete. The introduction of effective project management methods and tools can often be compounding and, therefore, affect time in a logarithmic way. Low complexity projects only benefit marginally, but high complexity projects experience a significant time improvement. For example, a good TDD (Test Driven Development) strategy continuously exercises code and reduces the likelihood of bugs introduced by refactoring or changes in scope. Finding and fixing bugs before they surface in Quality Control (QC) translates to reduced communication (e.g. tickets, emails, phone calls, and meetings), reduced context switching, and reduced follow-up testing for QC agents. Over time, these improvements are compounding and can significantly reduce the overall time needed to develop a project.

Efficiency Vs Flexibility

Efficiency Vs FlexibilityFinally, efficiency versus flexibility is a good visual aid for illustrating that highly flexible systems generally perform less efficiently than rigid systems. I typically break out this diagram when I hear requirements that suggest the need to create a system that can do everything. For example, requirements might suggest users to be able to process data from lots of different data sources; or, requirements might propose that the users be able to dynamically build, customize, and maintain reports; or, requirements advocate that users be able to view and manipulate the data from multiple front end systems. Beyond the fact that these requirements may add significant complexity to a given project, the typical trade-off for these requirements is inefficiency in resource consumption. Disk space, memory, and CPU consumption is greater in highly flexible systems. As flexibility increases, views or reports typically shift from real-time to deferred, or batched. Also, highly flexible systems are more complex and therefore take longer to build. Ultimately, the currency to measure the efficiency in such systems is time. How fast does the system need to perform and how fast do you need it built?

I have many other visual aids that, unfortunately, I’ve omitted here for brevity (e.g. risk versus complexity, supportability versus complexity, etc.) Also, there are a whole host of artifacts generated in well managed projects (Gantt charts, burn-down charts, etc.) that are great communication visual aids. For now, I’ll save those for future blog posts. Until then, I’ll leave that homework up to you.

Automating SQL Express database backups

SQL Server Express is a fantastic, free database engine for small, standalone databases. However, the limitations of the Express database engine can make automating database backups tricky. Due to a lack of a SQL Server Agent, you will likely have to roll your own process for backing up Express databases. While there are enterprise database backup solutions, these solutions are typically licensed by server. Therefore, eating up a valuable license for a small, standalone Express database might be cost prohibitive. There is, however, an easy way to roll your own automated backup solution using a stored procedure, PowerShell, and a Windows scheduled task.

First, we need a stored procedure that accepts a parameter for the directory where we want to save backups. This stored procedure should back up every database except tempdb. Also, the stored procedure needs to be saved in a database that will never get deleted (e.g. master).

Here is a script to create the spBackupDatabases procedure in the master database…

Each database in the instance will save to the backup location specified by the path parameter. As written, the file name for each backup will be in the form _.bak (e.g. InventoryDB_201401072105.bak); however, you can change this to whatever naming convention you prefer. One important factor to note is that the directory you pass to the stored procedure to needs to allow the SQL Server service account “write” permissions. If you are running the SQL Express service as the system account, you will have to grant appropriate permissions to the backup directory for that system.

Let’s test the stored procedure to confirm everything is working. Be sure the destination folder already exists or you will get Sql ErrorNumber 3013.

Now that we have a stored procedure to backup all databases, we need to create a script that will run the stored procedure. As an added caveat, the script needs to delete the backup files that are older than 30 days. PowerShell is a perfect solution for this because it can natively interact with SQL Server using the .Net Framework. The following script will connect to the SQL Express instance, run the new stored procedure, and then “DELETE ALL FILES” that are over 30 days old. Note that the databases are backed-up using the SQL Server service account, but the old files are deleted using the account that is running the PowerShell script. Therefore, be sure both accounts can read and write to the backup directory.

You will note that the database connection string above is using “Integrated Security=True”. This is because I did not want to store the database connection user ID and password in the PowerShell script. Instead, I will schedule the PowerShell script via a Windows Scheduled Task with the credentials for a domain account saved in the Scheduled Task. This makes it harder for someone to obtain a user ID and password for the database. If you are using a SQL Account, and you are okay with putting the credentials in the connection string, then you can use the following…

Finally, the last step for automated SQL Express database backups is creating a Windows Scheduled task. The trick for running a PowerShell script from a Windows Scheduled Task is to specify “PowerShell” in the Program/Script section of the Edit Action setting. Then, in the “Add arguments” section, add the path to your saved PowerShell script.

PowerShellScheduledTaskNote… If you have a problem running the PowerShell script, you can add the “-noexit” switch to the “Add arguments” section so that the error text stays on-screen after the script runs. Just be sure to remove the “-noexit” switch after testing so that PowerShell closes after it finishes running.

Running Linux Commands from PowerShell.

In my lab, I occasionally need to automate maintenance tasks that involve Windows and Linux systems. For example, I need to backup Windows directories to a Linux-based NAS device, compress and decompress files, delete old backups, etc. Sometimes, what I need to do is run SSH commands from PowerShell in a dynamic way. I found some examples online but they only ran one command at a time. For me, it would be better if I could dynamically create a set of commands; then have those all run consecutively in one SSH call.

To do this, first you need to define the statements you want to run in an array. In my case, I wanted something dynamic so I came up with the following.

Basically, the above commands will display the Linux distribution release info, change the working directory, print the working directory, unzip a file, and then remove the zip file. Note the “;” after each command is required. Alternatively, you can use “and list” (&&) or “or list” (||) instead of “;” if you understand how they work.

Now that I have the SSH commands that I want to run, how do I pass them to Linux? Manually, when I want to remotely connect to Linux in an interactive way, I use PuTTY. However, by itself, PuTTY doesn’t have a Windows command-line interface. Thankfully, the makers of PuTTY released Plink, aka “PuTTY Link”, which is a command-line connection tool for PuTTY. Armed with this new information, I downloaded Plink to the same directory as PuTTY and added an alias to my PowerShell script.

Now that I have an alias for Plink, I can pass my array of SSH commands directly to my Linux machine in one line of code.

One thing that is nice about this approach, the output of the SSH commands are displayed in the PowerShell console. That way, you can see if any Linux-based warnings or errors occur.

In the above example, I’ve added my user name and password as parameters in the command-line. Obviously, in a production environment this is not desirable. You can get around this by using public keys for SSH authentication. For more information, check out PuTTY’s help documentation. At the time of this writing, Chapter 8 covered how to set up public keys for SSH authentication.

Here is the finished script.

Some notes worth sharing… Initially, my instinct told me that zipping a large directory locally on the NAS device would be faster than trying to remotely zip the files from my Windows PC. I assumed the network overhead of downloading the files and then uploading the compressed archive back to the NAS would be a bottleneck. In fact, in my case, it was faster to do it remotely from Windows. This is because the limited RAM and CPU for my consumer grade NAS device were quickly overwhelmed by the compression task. My Windows box, with a dual core CPU, 4GB RAM, a Gigabit NIC, and an SSD could compress the files faster than the NAS device despite having to send the data over the network both ways. Some tasks, such as deleting large directories were significantly faster when ran locally on the NAS. Therefore, you will have to experiment to find out what works best for you.

Backup VMs with PowerShell and PowerCLI.

PowerCLIIn my lab, I back up my VMs by exporting OVA templates and saving the files on a NAS device. My typical routine involves powering down the VM, exporting an OVA template, waiting about 30 minutes for the export to complete, powering it back up, rinse and repeat. After a few backup iterations, I found this process to be a chore that I occasionally put off. Today, I decided to automate the task and pay down some principle on that technical debt.

I’m fairly comfortable in PowerShell and some time ago I downloaded VMware’s PowerCLI. I decided to give it another test drive and knock out a real world problem. The first hurdle was how to load the modules. The PowerCLI loads via a snap-in. Therefore, we can add the snap-in and view the available cmdlets thusly.


Okay, great! We have several hundred cmdlets and they have fairly decent naming conventions. Now let’s repro my manual workflow in code. First, I need to connect to a host and get a list of all my VM’s.


Next, I need to power down running VMs and bring them back online after backing them up.

The last thing I need is to export the OVA to my NAS device. The command we will need is Export-VApp. (I prefer to power down the VM before running this.)

Finally, let’s pull it all together and backup all the VMs on the host. This is the completed script.

While it is possible to use Connect-VIServer without having to manually enter credentials, I’ll leave that step up to you. Google “PowerShell credentials SecureString” for more information.

Ideas for the next step? Addressing the possible infinite loop would be wise if you don’t want to periodically check your backup job. Backing up the ESX Host OS and configuration is important especially if your environment is complex. Adding some logging for backup date, VM size, and other metrics would be easy and valuable. You could send “Successful” or “Failed” email alerts if you’re okay with a little more spam. Let me know what you come up with.

Try backing up your VMs today. Tomorrow you might be glad you did.

Arduino Memory Management.

Arduino UnoI had been playing with my Arduino Uno for about 3 months when I decided I needed a real project to put my skills to the test. My first real project was to web enable my garage door. I planned to use the Ethernet Shield to host a simple web page. However, during the development I started getting strange behavior. Only the first portion of my web page was being served. Not yet knowing my problem was related to memory, I tried in vain to comment out different parts of my code to find an elusive bug. After a bit of frustrating troubleshooting, I discovered that my problem wasn’t a code bug, it was memory based.

As it turns out, the ATMega328 chip for my Arduino Uno has 3 kinds of memory…

ATMega328 Memory
Type Amount Description
SRAM 2 Kb RAM for your program to run
EEPROM 1 Kb Can be used as durable storage
FLASH 32 Kb This is what you upload your sketch to

SRAM is the static random access (i.e. volatile) memory for your sketches. The variables you create and the Libraries you use can quickly eat up this limited memory resource. To get an idea of how small 2 Kb is, generate a 2 Kb Lorem Ipsum.

Since I was using three libraries, Ultrasonic.h, SPI.h, and Ethernet.h, I suspected I might have a problem there. I loaded up the following simple sketch to see what my base memory footprint was before writing any code.

I examined the SRAM consumption and found I had only consumed about 273 bytes (~15%) of the SRAM. Therefore, the memory consumption from the libraries I was using was not an issue. Next, I started adding the code for my simple web page to see if something I was doing there was problematic.

After adding the above code, and examining the SRAM consumption, I was starting to understand where my memory problems were coming from. The strings in my println functions were being consumed by SRAM even though they were static. At runtime, such strings are copied from FLASH to SRAM. I could tell, at my current rate of SRAM consumption, my sketch wasn’t going to cut it. By this point, I wasn’t yet reading incoming data from submitting the form and I was already consuming 1236 bytes (~60%) of the available SRAM. Also, I still needed to add code for using the ultrasonic sensor to read if the garage door was up or down and inject that into the HTML.

Luckily, I discovered the F() function for leaving static strings in FLASH memory instead of copying them to SRAM. The F() function basically tells your Arduino not to copy the strings into SRAM. Rather, a pointer is used to read the strings from FLASH. Since every character in my HTML consumes a byte of memory, this trick saves a lot of SRAM. The following demonstrates the kind of changes I needed to make.

After making this change, my SRAM memory consumption was only about 400 bytes (21%).

This simple trick works for static strings in sketches but what if you wanted declare a variable in a sketch and not have that variable copied to SRAM at runtime?

PROGMEM is a variable modifier specifically for this purpose. In my case, I had a global constant that was being used by my sketch. Using the PROGMEM variable modifier, I was able to keep the variable value in FLASH and out of SRAM.

It’s worth noting that, in my case, I’m really only saving myself a few bytes of SRAM with the above variable. But the example illustrates that you can keep variables in FLASH to preserve what precious little SRAM you are limited with.

The last form of memory on the ATMega328 is EEPROM (1 Kb). This form of memory works like durable storage. Values stored in EEPROM are not volatile and persist even if the Arduino loses power. You can store one byte in EEPROM by supplying the memory address and the byte value. Then you can retrieve the byte by supplying the address. Therefore, it’s not practical to try to optimize SRAM memory consumption by saving data in EEPROM. The real benefit of using EEPROM is data durability.

For completeness, it’s worth mentioning that the data write/erase cycle lifespan is 100,000 for EEPROM, 10,000 for FLASH, and basically unlimited for SRAM. Therefore, choosing the appropriate memory location for volatile data would be prudent so that you don’t prematurely wear out the chip.

There is a great memory optimization series on Adafruit’s web site. Check out Memories of an Arduino for more information. Also, there is a good tutorial on that covers the basics.

Monitoring Host System Temperature with Powershell.

FireI recently made some custom modifications to an old ESX virtual host to limit the noise produced by the CPU fans. After injecting resistors into the CPU fan circuits, I wanted to monitor the system temperature in case it ever exceeded the normal operating range. On my laptop, I prototyped a temperature monitoring script using WMI and PowerShell.

Everything was working beautiful until I tried scheduling the script on one of my VMs. The problem is, guest VMs know very little about the underlying host that they are running on. Since VMs can be migrated from host to host without interruption, it’s pointless for the OS to ask “What’s my temperature?” What I should have done, instead, is have the guest OS ask “EsxHost1, what is your temperature?” I admit, I had to laugh, when I realized the problem because I should have seen it coming. Back to the drawing board I went.

After my initial SNAFU, I abandoned WMI because it wasn’t going to work against ESX. Next, I decided to try VMware’s PowerCLI. Unfortunately, after a 200MB download and an hour of digging around, host temperature was nowhere to be found. Then I discovered VMware’s CIM interface. CIM is an open standard, similar to WMI, that allows you to control and manage hardware. PowerShell 3.0 has some new Cmdlets that improve the user experience of working with CIM. After a little Googling I found the class “CIM_NumericSensor” which contains, among other things, “Ambient Temp”.

Using the above script, I can remotely ask an ESX host for its system temperature; so far, everything has been “cool”. If you need to monitor your host from PowerShell 2.0, check out the cmdlet “New-WSManInstance”. Carter Shanklin wrote a post entitled “Monitoring ESX hardware with Powershell” on It should get you going in the right direction.