Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I once consulted at a company with a very large monolithic SQL Server. It actually wasn't Postgres but let's pretend it was.

It had been around for decades and over time it had ended up being used for all sorts of things within the company. In fact, it was more or less true that every application and business process within the whole company stored its data within this database.

A key issue we had was that because this database had many different applications that queried it, and there were a huge number of processes and procedures that inserted or updated data within it, sometimes queries would break due to upstream insert/update processes being amended or new ones added that broke application-level invariants -- or when a normal process operated differently when there was bad data.

It was very difficult to work out what had happened because often everything that you looked at was written a decade before you and the employees had long since left the company.

Would it be possible to capture changes from a Postgres database in some kind of DAG in order that you could find out things like:

- What processes are inserting, updating or deleting data and historically how are they behaving? For example, do they operate differently ever?

- How are different applications' querying this data? Are there any statistics about their queries which are generally true? Historically how are these statistics changing?

I don't know if there is prior art here, or what kind of approach might allow a tool like this to be made?

(I've thought of making something like this before but I think this is an area in which you'd want to be a core Postgres engineer to make good choices.)



Logical replication, in Postgres, contains all of the information about the change statement (insert/update/delete) in order to logically recreate the same state in another database.

You won't get client-level providence data with each change...

However you could hack around that. The logical replication stream can also include informational messages from the "pg_logical_emit_message" function to insert your own metadata from clients. It might be possible to configure your clients to emit their identifier at the beginning of each transaction.


I’m not sure how to handle queries, but for inserts/updates I have a column that tracks the event source (last updated by). Maybe this is an anti-pattern - I’d love a more robust solution.


Technically log replication has everything done by everything, and if you are careful with triggers you can also track everything as well, using a DDL/DML capture table(DCL too if you're worried!).

These approaches work on basically every type of SQL solution that uses WAL/triggers.

For your specific question I have a trigger approach many times in SQL Server but it has a tendency to slow things down if you are logging every query so designing an insertion mechanism that doesn't bog down production isn't perfect, and you might want to perform some sampling.


You get a lot just by making sure each application has its own user access.


This seems like the right approach to me.

An approach I've taken is temporal tables w/ Application and UpdatedBy fields. That gives you a permanent record of every change, what application did it, and what user performed the action, at what time, and then lets you query the database as if you were querying it at that point in time. You can add triggers to fail CRUD if those fields are not updated if you want to get really paranoid.

There's a lot of overhead to this in terms of storage, so it's not suitable for high-throughput or cost-constrained transactional systems, but it's something for the toolbox.


Even if it doesn't you can start by adding the application name to the connection string and you could probably do something gnarly with triggers to write that in a table and get it pushed in the logical replication.


yeah, for SQL Server the connection string has an ApplicationName property for this purpose, it's pretty useful :)


This is an interesting idea


I had the following idea recently: Go through all scripts/programs that send queries to the db and append a comment to the queey containing a unique id for that query that links it to the script/program. The query log hopefully shows the comment with the id so you can trace the origin.


For the "I just need to emit and visualize a DAG" problem, at one point I wrote a python script that would filter for the relevant data, emit line-by-line mermaidJS-flavored markdown, and then shove that into an HTML file that imported the MermaidJS library.

The MermaidJS solves for the DAG and visualizes it, and your browser has enough context to let you CTRL-F for any interesting phrases you put in the label.


I think it's users + pgAudit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: