Get company registration data for the State of Delaware.
Some people think it would be good for data about company registrations to be freely distributed and free of charge. The State of Delaware apparently doesn’t think so; Delaware blocks computers from accessing the General Information Name Search site if they make more than a few hundred (I’m not sure of the exact number.) requests in a short time.
In order to download all of the data, we are thus using a swarm of computers from different IP addresses, each making very few requests to the site. You can help!
You just need to install a program and let it keep running. It will periodically contact a central server for directions, and it will query Delaware’s General Information Name Search accordingly. It is very careful to avoid being blocked. But if it detects that you are on a new IP address it will take advantage of that.
The installation process involves running things in a terminal. Remind me to put some directions here about how to do that in Mac and Windows.
If you already have Python and Pip installed, you can just do this.
sudo pip install delaware
If you have Python but not Pip, you can download a standalone package (I have to make this.) and run the setup like so.
tar xzf delaware.tar.gz cd delaware sudo python setup.py
If you don’t have Python installed, follow these directions.
If you are on any operating system other than Windows, you probably already have Python installed.
Add a note about Enthought, Continuum, &c.
Once you’ve installed the program, type this into a terminal.
It’ll ask you a few questions the first time you run it, but you can totally ignore it after that.
If errors come up
If the program stops running, please send the error message to firstname.lastname@example.org. Also, please save the ~/.delaware directory, as it contains files that can be helpful for figuring out what went wrong.
How it works
I went with worker-manager architecture, but maybe I should have gone with something less classist? Peer-to-peer connections are annoying because of port blocking of various sorts, but that would be nice because then I don’t need to be responsible. Well anyway, here’s how it works.
Asking for directions
The worker contacts the manager asking for a job. It provides the following information.
- Chosen by the user
- Password-like thing
- Hash of a salted installation ID, which is created when the program is first run
- IP address (implicitly)
- The manager is able to determine the IP address from which the request came.
The username is there so that the person can be recognized for her efforts.
The password-like thing is there to trace provenance of the data. This is mainly here in case someone fakes the data, so that I can figure out which data not to trust. It could also be helpful for debugging issues specific to certain systems.
The IP address is used for determining whether the rate limit is close to being reached. The manager directs workers not to query the Delaware site if they are approaching rate limit. The IP address is wholy separate from username and installation ID, as the same IP address can be accessed by multiple devices associated with the same user and by devices associated with multiple users.
Receiving work orders
In response to the above directions request, the worker will receive either a status code of 429 (too many requests) or a status code of 200. The manager decides which one based on how many requests have come from this IP address recently.
If the manager provides a status code of 200, it also provides the following information.
- File number
- The company to look up
- An IP address
- This will be passed back to the manager for rate limiting purposes.
The IP address is the worker’s own IP address, but it needed to contact the manager to figure that out.
The file number is chosen randomly (with uniform weights) from the file numbers with the lowest amount of responses so far.
For example, all file numbers (0 to 8 million) are possible when we start because there have been zero responses so far. Soon, some file numbers will be selected, so there will be some file numbers with zero responses and some with one response. Once all file numbers have been chosen at least once, the manager will begin repeating file numbers. By repeating file numbers, we check for consistency between different responses (in case someone is trying to fake data), and we continue to update the data (in case companies change).
I chose this approach so that we can be intelligent about which file numbers we query without assigning jobs to particular workers.
Querying the website
Once the bot has been directed to look up a particular file number, it queries the Deleware corporations site accordingly. It goes to the starting page for the General Information Name Search (called home in the code). It enters the file number and receives a list of up to one company. (This page is called a search in the code.) It then goes to this maybe-company page (called result in the code).
At every step, the bot
- minimally parses the web page so that it may advance to the next step,
- sends information about the HTTP response to the manager
- pauses randomly for a time on the order of a second to avoid looking so obviously like a bot
When it sends the response information to the manager,
- “Before” IP address
- The previous IP address that the manager told the worker
- Current IP address (implicitly)
- The IP address that the manager currently detects from the worker
- Simplified HTTP response from Delaware
- This the main information that we are looking for.
- Whether the request appeared successful
- Based on a rough parse, the worker says whether the request was successful. The manager uses this for selecting file numbers for job assignments (in the first step of the process)
Saving information on the manager
XXX FIX THIS SECTION XXX
When the manager recieves a response, it first needs to determine an additional piece of information. The worker has provided the “before” IP address; the manager now determines the “after” IP address.
Having determined this, it writes the following stuff to a simple log file.
- installation id
- before ip address
- after ip address
- serialized request
It also saves the IP address(es) in an IP address table. We maintain this table so we can avoid exceeding thresholds for IP blocking. If the before and after IP addresses are different, we conservatively count the request as having come from both addresses.
Finally, it parses the file number from the response and updates the sampling weights for the file number selection.
A separate process comes along later, reads the log files, and reads more information from the response. The involved parsing is moved to a separate task for two main reasons. First, this reduces the load of the manager. Second, we can reuse the separate task for loading backups; we don’t need to write a separate thing for that.
The worker waits a random time on the order of seconds before repeating the above process. This way, the bots may look a bit less like bots and thus be harder to block.
Questions you might have
- We can’t make cross-domain requests, so we’d have to inject something into the Deleware page, and that’s annoying, especially for this site.
- Doesn’t OpenCorporates already have it?
- OpenCorporates doesn’t have it.
- Have people done similar things in terms of this distibuted API?
- Why Python rather than something that people with Windows can run?
- Because it’s easier
- Has anyone tried talking to Delaware?
- How many companies?
- Dunno, but less than 600,000
In order to avoid faking of data, enforce that the worker only complete work that it has been ordered to. This could happen through some form of encryption or just by looking for strange patterns in the server logs.
The rate-limit query on the database isn’t working. Fix it.
Figure out what the actual rate limit is.