Yet another data feed method or "Everybody wants a SALE!"

I’ve had an idea kicking around in the back of head for a few months, but not the time to write it down. As I’m currently on a flight with some time to kill, I thought I’d write an abstract of the idea for the blog to wet people’s appetites for a white paper to come soon.

In talking with realtors and developers who make use of our various RETS projects like libRETS, ezRETS, and vieleRETS it seems a majority of users are replicating the MLS database for web-based advertising purposes. Following CRT’s mission of playing with a diverse set of technology and following Perl‘s Motto of “there’s more than one way to do it” I’ve been tinkering with an idea of how MLS’s can share that data in alternative ways.

Since every good project needs a catchy name that you can turn into a fun acronym, I’ve decided to call this the Syndicated Advertising-based Listings Engine or SALE. Besides, this makes promoting the project fun with catchphrases like “Everybody wants a SALE!” But what is this project you ask?

Being the hip up-to-date netizen that I am, I obviously read tons of web sites and listen to a fair amount of podcasts via their RSS feeds. In some ways, syndication is just another word for replication, so I wondered, how could we replicate the MLS’s database easily using something like RSS and/or ATOM? Since I talk to realtors more than MLS staff, I’ve heard lots of talk of Realtors using RSS/ATOM feeds of listings in their business, but I haven’t heard of an MLS doing so. It’s also worth pointing out that some companies, such as Google or Trulia take in listings in an RSS-ish format.

In this thought experiment I’ve made a few assumptions. 1) In the general case, a Realtor only wants to pull listings via this technology, not update it. 2) RSS/ATOM is a widely deployed technology that most VAR/consulting software developers have been exposed to. 3) An advertising feed doesn’t contain ALL the data in the MLS listing. 4) This is advertising data we, as real estate professionals, WANT spread all over the earth, ie. only public data.

Using those assumptions the idea is that an MLS could have a feed of listings that could be hit periodically and would contain a few key pieces of data that makes the person receiving the data be able to do complex things like tract prices changes and status. In my mind, to be on the safe side, this feed should have the last weeks worth of changes. So if someone chose to hit the feed every three days that would still work and the feed size would be of reasonable size.

For every change in the MLS database, there would be an entry in the feed. Say a property was listed, had its price change, and then was closed that would generate three entries in the feed over the lifetime of that listing. Even if listing lifecycle happened in a single day.

There is the problem of someone just coming on to the system for the first time. There will need to be a “full database” version of the feed available as well. In theory, this should only be hit once per entity pulling the data, but in reality it will probably be more due to system crashes and the like. I see this as something that needs to be updated once a week or so and then the “live feed” can be used to synchronize to the current date.

Based on the needs of the live feed and the full dump feed, I see that each listing should have the following as required items: 1) the Listing ID, 2) a unique ID of some sort, maybe a UUID, to uniquely identify the change, and 3) the URL to where the property is listed on the MLS or Realtor web site and 4) URL(s) to image(s) of the property that can be snagged easily via some media server.

My original thought in emphasizing the change is that the client receiving the data would be responsible for figuring out what changed since each entry in the feed would have the full data. However, I’m thinking maybe a require field that lists the fields that have changed, or NONE, might also be a good idea. This does, however, put more burden on the server-side implementer.

The security minded will notice I didn’t talk about authentication. Because these feeds are just basic HTTP grabs, you can secure them anyway you want to using your web server’s authentication choices as well as running it over SSL. Also, thanks to assumption 4, we can ignore things like per-level field encryption and the like as this is all public data we want seen.

One of the things that I haven’t really thought about yet is what the data inside of the feed would be or how it would look. In my conversations around the industry a magic number of “about 50 fields” keeps coming up as what would be needed for a schema like this. I know of at least one of the regional MLS data sharing agreements that shares just about that number of data items. For something like SALE, I could see that number as well. Of course, because RSS/ATOM feeds are namespaced xml, a particular server vendor could extend it. I haven’t thought in that direction yet either.

In any case, this was my brain dump/abstract of the idea. In the white paper version I’ll have more of the specifics in terms of what the feed could look like fleshed out and maybe a workflow example of that first meshing of the full feed with the life feed.

  1. RETS2, IDXExtended schema. Almost available. 😉

  2. Keller Williams Realty has recently implemented the KWLS – Keller Williams Listing Service that is designed to replicate MLS data but provide us (REALTORS) with direct control over over our own listing inventory. As we ramp up and gain agent participation several bugs have poped up and are being dealt with, but overall – our listings are getting excellent exposure via numerous serach engines and destination web sites. The ultimate goal is to eliminate the MLS’ ability to sell our listing data back to us in the form of “captured leads”.

  3. I never understood why giving IDX vendors, brokers and agents read only access to the MLS database server never caught on. It’s MUCH easier to implement a keep the DBs sync process than to play the data download/upload unzip game most MLSs require you to particpate in. It doesn’t require additional tools, libraries or standards (since nearly every software development platform worth using includes DB drivers)…

    Besides, why stop at RSS? Why not other XML based standards (such as Trulia, Google Earth KML, Google Base, PropSmart, etc)?

  4. Robbie,

    No reason to stop at RSS, but I was mostly thinking of something that had the transport/syndication that more people are familiar with. Given that a lot of the audience that wants to sync the data isn’t as advanced as yourself, I was going for something a bit easier to deal with.

    Straight database access seems like something unlikely to happen. I need to think about it more to define the reasons my gut is saying its a bad idea.

    However, most people are familiar with RSS and there are a ton of reader libraries out there. Its just using proven well-known tech to do this. (As it is, the GoogleBase file format is pretty much an RSS feed. Last time I looked at it anyway.)

  5. Obviously, my brain wasn’t working when I posted on Friday night at 8:30. (What was I doing working then anyway?) Anyway, a tech at an MLS mailed me the following:

    “From our perspective (an MLS), we don’t allow direct DB access because of the security risk. Even if the exposed account is read-only, the hole poked through the firewall to the Internet exposes our SQL server to all sorts of bad stuff like DDoS attacks, the next slammer-type worm, or some as-of-yet unknown threat.

    As such, we restrict outside access to the data to some form of intermediary method such as RETS or FTP flat-file dumps.”

    I think that was mostly what my gut feeling way, but at 8:30 it was just calling for more beer. (Hey, the kids were in bed, it was all good.)

    Databasess probably aren’t hardened against the cess pool the internet has become. I’d say this is due to the assumption that there is some sort of application layer between DB and the rest of the world.

    I was also thinking that most developers have a hard enough time writing efficient SQL without a DBA’s help. I’d hate to open it up to my customer base at large and hope to keep up performance. I suppose you could only expose VIEWS that have been optimized and only allow certain types of queries. At that point, might as well go RETS, SALE, or whatever.

  6. I’d be more willing to put a DB server on the internet these days because it’s bugs are well known (read and it’s vendors are under constant scruitiny. At least my DB vendor (Microsoft) has been doing a good job after the embarassment of Slammer – see Besides, you can always use firewalls, and only give trusted entities direct DBMS access and then add DBMS security on top of it. You have to buy a DBMS server anyway, might as well wring every cent of value out of the thing…

    You still have a security risk w/ RETS because it will likely still be based on top of a database back-end (so you have both RETS & DBMS flaws to deal with instead of DBMS flaws). Unfortunately, MLS RETS servers probably do not undergoe the scruntity that DBMS servers do (Security by obscurity). It reminds me of the old “you now have 2 problems” programmer jokes –

    Besides, every university level computer science program teaches SQL & databases. Every software platform worth using has libraries for querying and connectivity against a DBMS. SQL is powerful & constantly improving, querying language that is well supported.

    Nobody but a few MLS vendor’s and IDX vendor’s software engineers knows anything about RETS. The library support is improving but still lacking, and it’s likely will never be tought in college, have out of the box support from major platform vendors or have book written that I can purchase from Amazon.

    It seems odd that we are ignoring, replacing or reinventeing a well known, well understood, well supported and performant data querying, access, and tramission mechanism that the rest of the world uses in favor of a Real Estate industry only solution (RETS)… I’m sure there are good reasons, I just haven’t heard any yet. 🙂

    Still, I think adding RSS support is great move to empower the non-power users. Everybody knows and supports XML & HTTP. It’s just us power users yearn for something that gives us more power & flexability (or otherwise saves us time & effort)…

  7. Andrew Tillman

    The issue of RETS being a good solutions for exchanging data and putting a SQL database directly on the internet are two very different issues. For many companies, the database that RETS may pull from is used by more then one application. The reason to have an abstraction layer are many;

    1. Security. Sure venders are getting better at creating secure DBMSs. But I doubt any of them think it’s a good idea to put the database open directly on the internet. Sure, your RETS server could get hacked, but if you have a secure database, then they have another layer of security to go through before they can damage your data. And any other applications that use your database can still run.

    2. Abstraction. It’s nice when your customers don’t know what your underlying data schema is and instead access some sort of abstraction layer (which is what RETS is). If you decide to change your data schema (and it does happen, often) you can make sure that the abstraction layer outputs the same before and after the change. This means that your customers don’t need to be aware of the change you made.

    3. Security. If you allow anyone to write direct SQL queries to your database, you are opening yourself to malicious SQL being run on your box. Sure you could filter out SQL that does updates if all they need to do is query, but if you need a filtering layer you might as well have an abstraction layer. If you allow update it gets even worse, you MUST check the validity and security of all input from the outside world, unless you want people to put in dates that corrupt your database connection, or never release a lock on a some data.

    4. Differences in SQL between venders. If you allow people to access your database directly you force them to deal with the quirks of your specific DBMS. Sure their is ANSI, but there are a lot subtle and important difference between Oracle, SQLServer, MySQL etc. Trust me, migrate from SQLServer and Oracle and you will learn how much they can differ.

    5. Performance. An abstraction layer would be able to cache the results of queries so as to reduce the load on a server. If all your customers are running live queries on the database, it could put a huge load on your box.

    6. No DBA with any experience would every want to be responsilbe for such a system. They would know better so getting someone to maintain such a system would be difficult.

    In the end, I find the idea of opening up my database to the internet to be terrifing and I think it is a very, very bad idea.

  8. I think #4 is the best reason for not doing direct SQL. DB interop would probably be as big a pain as RETS interop would be. And as long as it’s painful, you might as well implement and deal with an abstraction layer (only deal with the pain once, instead of once for every DB vendor)

    I should note, I’m only concerned with read-only and copy/sync scenarios, for my applications (IDX / MLS web searches). I realize RETS probably addresses more scenarios than the ones I care about (and for those scenarios, it is better than direct DB access would be).

    Still, I long for the ease and flexibility, that living in a SQL Server world brings me, where moving data from one server to another is just an INSERT INTO … SELECT … FROM OPENROWSET(….) query away.