database greenhorn

PoisonedPrisonPanda@discuss.tchncs.de · edit-2 19 days ago

database greenhorn

TehPers@beehaw.org · 17 days ago

If you are new to something and want to learn, seek resources and educate yourself with them. Learning takes time, and there are no shortcuts.

A hot DB should not run on HDDs. Slap some nvme storage into that server if you can. If you can’t, consider getting a new server and migrating to it.

SQL server can generate execution plans for you. For your queries, generate those, and see if you’re doing any operations that involve iterating the entire table. You should avoid scanning an entire table with a huge number of rows when possible, at least during requests.

If you want to do some kind of dupe protection, let the DB do it for you. Create an index and a table constraint on the relevant columns. If the data is too complex for that, find a way to do it, like generating and storing hashes, sorting lists/dicts, etc just so that the DB can do the work for you. The DB is better at enforcing constraints than you are (when it can do so).

For read-heavy workflows, consider whether caches or read replicas will benefit you.

And finally back to my first point: read. Learn. There are no shortcuts. You cannot get better at something if you don’t take the time to educate yourself on it.

PoisonedPrisonPanda@discuss.tchncs.de · 1 day ago

A hot DB should not run on HDDs. Slap some nvme storage into that server if you can. If you can’t, consider getting a new server and migrating to it.

Did this because of the convincing replies in this thread. Migrating to modern hardware and switch SQL server with PostgreSQL (because its used by the other system we work with already, and there is know-how available in this domain).

You should avoid scanning an entire table with a huge number of rows when possible, at least during requests.

But how can we then ensure that I am not adding/processing products which are already in the “final” table, when I have no knowledge about ALL the products which are in this final table?

Create an index and a table constraint on the relevant columns. … just so that the DB can do the work for you. The DB is better at enforcing constraints than you are (when it can do so).

This is helpful and also what I experienced. In the peak of the period where the server was overloaded the CPU load was pretty much zero - all processing happened related to disk read/write. Which was because we implemented poor query design/architecture.

For read-heavy workflows, consider whether caches or read replicas will benefit you.

May you elaborate what you mean with read replicas? Storage in memory?

And finally back to my first point: read. Learn. There are no shortcuts. You cannot get better at something if you don’t take the time to educate yourself on it.

Yes, I will swallow the pill. but thanks to the replies here I have many starting points on where to start.

RTFM is nice - but starting with page 0 is overwhelming.

TehPers@beehaw.org · 13 hours ago

But how can we then ensure that I am not adding/processing products which are already in the “final” table, when I have no knowledge about ALL the products which are in this final table?

Without knowledge about your schema, I don’t know enough to answer this. However, the database doesn’t need to scan all rows in a table to check if a value exists if you can build an index on the relevant columns. If your products have some unique ID (or tuple of columns), then you can usually build an index on those values, which means the DB builds what is basically a lookup table for those indexed columns.

Without going into too much detail, you can think of an index as a way for a DB to make a “contains” (or “retrieve”) operation drop from O(n) (check all rows) to some much faster speed like O(log n) for example. The tradeoff is that you need more space for the index now.

This comes with an added benefit that uniqueness constraints can be easily enforced on indexed columns if needed. And yes, your PK is indexed by default.

Read more about index in Postgres’s docs. It actually has pretty readable documentation from my experience. Or read a book on indexes, or a video, etc. The concept is universal.

May you elaborate what you mean with read replicas? Storage in memory?

This highly depends on your needs. I’ll link PG’s docs on replication though.

If you’re migrating right now, I wouldn’t think about this too much. Replicas basically are duplicates of your database hosted on different servers (ideally in different warehouses, or even different regions if possible). Replicas work together to stay in sync, but depending on the kind of replica and the kind of query, any replica may be able to handle an incoming query (rather than a single central database).

If all you need are backups though, then replicas could be overkill. Either way, you definitely don’t want prod data all stored in a single machine, usually. I would talk to your management about backup requirements and potentially availability/uptime requirements.