4

I have a financial system that needs to attach PDF receipts for each record saved on my system.

These receipts will be scanned by a proper device attached to the computer and saved in PDF to be stored in my database.

Today the system saves the PDF file as varbinary(max), but because of the number of rows in my table, the filesize of my DB is increasing too fast. The average file size is about 1 to 2 MB.

What is the best way to store these kind of files without compromising my database performance?

Andriy M
  • 20,973
  • 6
  • 52
  • 93
  • 2
    If it were me, I'd store the receipt information and a path to the pdf. Unless there is a REALLLLY good reason, there is no reason to store the object itself in a database. It would be really easy to generate a link in a report for easy one click access to the pdf. – Steve Mangiameli Dec 14 '15 at 19:28
  • Use File Stream. Can be accessed through UNC path as well as through database commands. Stores the files on disk, separate from the database. – Jonathan Fite Dec 14 '15 at 19:48
  • Have you contacted your receipt generation vendor? It maybe possible to stream the receipt data directly from your receipt generator directly to the pc/database and/or simply store the data to recreate the receipt. – armitage Dec 14 '15 at 20:56

2 Answers2

1

You are not sacrificing performance as in database performance using varbinary(MAX). You are not searching on the varbinary(MAX). You are not indexing the column.

What is nice about in the table is single consistent backup. The down side is a bigger backup.

Delivering the file to the client varbinary(MAX) is going to be less efficient than FILESTEAM or file system.

1 - 2 mb is relatively small. If it was 200+ mb then you typically would want to avoid storing in varbinary(MAX).

If table size alone is problem then FILESTREAM is probably your best solution. This will not hurt database performance and files are not stored in a table.

FILESTREAM enables SQL Server-based applications to store unstructured data, such as documents and images, on the file system. Applications can leverage the rich streaming APIs and performance of the file system and at the same time maintain transactional consistency between the unstructured data and corresponding structured data.

From a licensing perspective I am pretty sure FILESTREAM does not count as database size. Like for EXPRESS with a 10GB database limit the FILESTREAM does not count.

Another benefit is if you have OCR on the PDF with an Adobe iFilter you can Full Text search the document. But you may be able to do that with FILESTREAM also - from what I see no.

Or you can just manage the file totally separate and only store a path in SQL.

paparazzo
  • 4,957
  • 1
  • 15
  • 31
  • Actually, you can take a perf hit but you might have no choice depending on how the app will access the data. If stored in the DB, each retrieval goes through SQL Server's layers before it gets delivered to the client. CPU, memory, IO all get consumed. If stored in filestream, the app can access via win32 API that bypasses most layers of SQL Server except the permission checks in most situations. If you retrieve a lot of data, it can make a very big difference especially when each receipt is 1-2MB (that's a whole different issue, why so big for receipts?) – SQLmojoe Dec 17 '15 at 17:37
  • @SQLmojoe Actually, I did not assert there was no performance hit. I said "You can read varbinary(MAX) about as fast a file system." I don't agree with "big" difference - on 1-2mb file. Yes a large volume is a large load but it is still a small fraction difference. – paparazzo Dec 17 '15 at 17:52
  • @SQLmojoe My comment on sacrificing performance was pure database performance. I will try and clarify the question. – paparazzo Dec 17 '15 at 17:53
  • That's my point, you can take a big perf hit on the DB. Consider the code path a request has to go through if run through SQL Server vs directly streaming from the filesystem plus you're consuming buffer pool for BLOBs.The memory manager in 2014 is MUCH better with large page allocations compared to earlier versions but it's still a lot more expensive. We've tested this lots but do verify yourself and check the difference. However, as noted earlier, the OP may not have a choice. If the app can't access the BLOBs via Win32 API or use only Windows Auth then FILESTREAM is a non-starter. – SQLmojoe Dec 17 '15 at 18:17
  • @SQLmojoe I said performance hit on delivery and feel that is an sufficient depiction given the nature of the question. I am not going to debate how big the hit is with you. If you feel you have a more complete answer then I welcome you to post one. OP does not state he is having any performance issues. OP wants to address table size and not compromise performance. – paparazzo Dec 17 '15 at 18:27
0

The best way, just as Documentum and other major packages use, is to store a path and filename in the database, and put the file in that location.

For even smaller database size, have one table of common paths, and then make the filename based on the short primary or candidate key of that row; you'd only need to store the path ID and the extension, then.

If your financial compliance rules require integrity or tamper protection, you could check with your compliance or legal group to see if storing a hash in the database or a public/private key signature in that location would suffice.

Anti-weakpasswords
  • 1,534
  • 9
  • 12