-1

I am trying to find out what is state of the art with database, python, and big data.

My starting point began with a SQL server, and multiprocessing pandas, and dask. Imagine I need to maintain a database with more than 1 billion rows, and I need to keep inserting into it and even perform multiprocessing, larger than memory complex analysis on them.

Some drawbacks includes that, SQL server is very slow in inserting data and extracting data.

Inserting 100k rows takes 1 second, reading 1M head rows takes 5s+. The speed is very dissatisfactory compared with dask with parquet. However, for dask with parquet, I cannot keep inserting into this "more than 1 billion rows database". Multiindex/None-clustered index is also not supported even making some previously fast sql join slower....

I looked around and found apache sql, pyspark. But I'm a bit unsure if that is the correct step forward. Any suggestion? Thanks!

thinker
  • 121
  • 4
  • I suggest you think strongly about column sizes and indexing, and about SSDs for your disks. This question is not answerable as it stands, too many moving parts – Charlieface Aug 01 '21 at 06:27

1 Answers1

2

I've worked with tables in the 10s of billions of records, that were pretty heavily transactional, and SQL Server handled it just fine. Not sure what your expectations are, but reading 1 million rows (by returning it to a client) in about 5s is pretty reasonable, depending on total data size, even parquet formatted data (which I've also worked with) is only going to shave off less than a second for the same amount of data being used in the same exact way. I could say the same about inserting 100,000 rows in about 1 second.

Note measuring performance in terms of numbers of rows is missing the variable of the column width to accurately determine the reasonability of the performance. So yes I'd agree returning 1 million rows of a single TINYINT (1 byte wide) column in 5 seconds sounds inappropriate, but 1 million rows with a total column width of 1,000 bytes is reasonable.

At this scale, measured performance now becomes hardware and architecture design bound, which is mostly database platform independent. If you've perfectly architected your schema, then the performance you're seeing is not a limitation of SQL Server, rather it's the reality of the amount of data you're trying to process on the hardware you've provisioned.

Long story short, the tools out there for managing data these days aren't necessarily inherently any different performance-wise between each other, rather it's how you use those tools and the hardware you provision to back them.

J.D.
  • 22,196
  • 3
  • 34
  • 78
  • Thanks! If I wanted to maintain the non-clustered index ability, and frequent insertion ability of database and also the fast loading of parquet files. What are my options? I am currently experimenting with SQL server and the loading and writing speed of a large dataframe is not very optimal compared to dask with parquet.. – thinker Aug 01 '21 at 20:37
  • @user40780 Unfortunately I'm not familiar with Dask & have almost no experience with Python's Panda.DataFrames, but if you're not seeing optimal performance when using them to write to and read from a SQL Server instance, than likely the bottleneck lies somewhere in how you're leveraging those technologies to do so. You may want to look into some of the topics that Charlieface mentioned in his comment, but I'd also recommend posting a question possibly on StackOverflow on "*how to efficiently write large amounts of data to SQL Server with Python?*" since it's more programmatic of a question. – J.D. Aug 01 '21 at 21:57
  • @user40780 additionally if you want to post a separate question asking for help with efficient schema design and architecture in SQL Server, here on DBA.StackExchange.com would be a good place for that. Having more specifics on your data, it's *structure*, and your use cases and goals would be useful in helping you architect your schema efficiently. – J.D. Aug 01 '21 at 22:12