• subscribe
December 19, 2001 12:00 AM

Identifying Duplicate .jpg Files

SQL Server Pro
InstantDoc ID #23234
Downloads
23234.zip

I've created a table in SQL Server 2000 that stores .jpg files in an image column. Unfortunately, the table includes many duplicate images. Can I compare the image value in one row with the image contents of another row to determine whether the contents are identical?

I'm sure you've tried using the equal to (=) operator to compare the columns through a JOIN clause. Unfortunately, you can't use image data types with the = operator. However, a simple workaround might help you deal with this limitation.

Listing 1 shows a code example based on the Employees table in the Northwind sample database. Imagine that you want to eliminate duplicate employee photos from your database. Because you want to find employees who have the same picture in the same table, you use a self-join to compare rows in the Employees table. You alias the Employees table as E1 and E2. You can't use an = operator to join image columns, but you can use the Substring() function that Listing 1 shows to compare chunks of data. This example will work as written if none of your image columns are more than 8000 bytes. If your image columns contain more than 8000 bytes, your SQL code will be more complex. For example, I've used the Datalength() function in the following command to determine that the longest photo value in the Employees table is 21,722 bytes:

SELECT max(datalength(photo)) FROM Employees

Each Substring() call can return only 8000 bytes, so you need three separate clauses in the JOIN statement to accommodate the existing data, as Listing 2 shows. Keep in mind that a query such as the one that Listing 2 shows can take a long time to run on a large table that has large image columns.



ARTICLE TOOLS

Comments
  • GARY
    9 years ago
    Jan 03, 2003

    In case of self join that List2 shows, E1.Employee_Id<>E2.Employee_Id should be added.

  • Jeff
    10 years ago
    Dec 24, 2002

    I find the best, quickest and easiest way to find duplicates is not by using JOINS (which doesn't work if you have NULLS) but by using GROUP BY's.

    Just GROUP BY the fields you need to check for duplicates (in this case, the SUBSTRINGS() of the image) and where you have a count(*) of >1, you have a duplicate record! And this handles NULLs just fine, which JOINS do not.

    Not only that, but you know exactly how many duplicates you have because the COUNT(*) tells you!

    The SQL statement is much shorter, quicker, easier to read and debug.

    - Jeff

  • Stephane Baudet
    10 years ago
    Dec 24, 2002

    Don't you think that the function CHECKSUM might help instead of using several substring() calls ?

You must log on before posting a comment.

Are you a new visitor? Register Here