I am running a case-control study for which I wish to choose 5 controls for each case, stratifying by age, sex, and date of measurement. Each case or control has a unique serial number and the controls will be given a stratum_id
that matches the case. My Python routine is approximately as follows:
control_cases = {}
for stratum_id, case in case_dict.items():
query = (
"SELECT serial_number "
"FROM foo_db "
"WHERE AGE = {} "
"AND SEX = {} "
"AND DATE = {} "
"AND SERIAL_NUMBER NOT LIKE {} "
"LIMIT 5;"
).format(case['age'], case['sex'], case['date'], case['serial_number']
records = run_query(query)
control_cases[stratum_id] = records
The problem is, I have hundreds of cases, and the SQL server seems to be highly throttled and doesn't respond well to hundreds of queries. I would like to somehow loop through all the case records, matching each with 5 controls, within a single SQL query. Right now I work around this by exporting them as strings and creating a single meta-string (QUERY_1) UNION ALL (QUERY_2) ... UNION ALL (QUERY_N). It gets the job done. But this is surely not very good SQL.