You want to parse craigslist
If you read the Craigslist Terms of Use, you’ll quickly learn that using an automated system to interact with Craigslist in certain ways is prohibited. We do not advocate violating Craigslist’s Terms of Use. However, many people do ask how it can be done.
If you find this useful, please provide attribution on your site with a link back to this post. Need help or custom software? Send us an email.
Overview
This combination of perl and bash scripts are meant to run on a Unix box, and have been tested on several shared-hosting providers, including GoDaddy and MediaTemple. They’ll likely work on your local Mac or Ubunutu kit.
Without using a wrapper script, the normal invocation is as follows:
./load.pl ./shake.pl
Required Files
Each pre section below is a separate file. In addition, you’ll need two text files: candidates.txt and emails.txt. (candidates.txt and emails.txt are empty to begin.)
Here is an overview of how things work.
Action Files
• page.pl — The first is a file to parse craigslist pages. It accepts a craigslist (category) URL and a city as arguments, and outputs the URLs of posts with titles that match certain keywords.
• post.pl — There is a file to parse individual posts. It accepts a (post) URL as an argument, and prints out the URL, the post title, the post ID and the email address (if there is one).
• load.pl — The third file is a wrapper file for page.pl. It supplies all of the cities, allows you to edit categories, and sets the number of days in the past that you wish to include.
• shake.pl — The final action file a wrapper for post.pl, and iterates over the candidate file URLs to retrieve the post data.
Cities
• cities.txt — This file includes all of the cities we want to search, separated by semi-colons.
Output Files
• candidates.txt — page.pl prints all of the matching post URLs to this file. (There will likely be duplicates if you search multiple cities or related categories in an a geographical area.)
• emails.txt — post.pl will print the email, post title, post id and post URL. Each value will be separated by :!:SEP:!:. If you run post.pl multiple times (by using shake.pl, for example), emails.txt will include many lines.
Quick Setup (and basic use)
1. Download the zip.
2. FTP it to your server and/or unzip it on your local Mac, Ubuntu or other *nix-type system.
3. Run
#Run the following commands; CD into the directory and make the file executable cd craigslist chmod +x * chmod 777 *
4. Open load.pl in your favorite text editor and change the following section to include any categories you want to search. You can place a “#” before each line to “turn them off”, or add new lines to add additional categories.
print `./page.pl $_ "/cpg/" $datesBack`; #cpg = computer gigs print `./page.pl $_ "/eng/" $datesBack`; #eng = internet engineering jobs print `./page.pl $_ "/sof/" $datesBack`; #sof = software/QA/DBA/etc jobs print `./page.pl $_ "/web/" $datesBack`; #web = web/HTML/info design jobs
5. Open page.pl and edit the words next to the keywords variable to change your search:
#These are the words you want to search for in the post title. $keywords = '(seo|php|javascript|porsche|whatever)';
6. Run it
./load.pl #wait for it to finish (will write to candidates.txt) ./shake.pl #wait for it to finish
7. Check your results: open emails.txt